Sounds carry a large amount of information about our everyday environment and physical events that take place in it. We can perceive the sound scene we are within (busy street, office, etc.), and recognize individual sound sources (car passing by, footsteps, etc.). Developing signal processing methods to automatically extract this information has huge potential in several applications, for example searching for multimedia based on its audio content, making context-aware mobile devices, robots, cars etc., and intelligent monitoring systems to recognize activities in their environments using acoustic information. However, a significant amount of research is still needed to reliably recognize sound scenes and individual sound sources in realistic soundscapes, where multiple sounds are present, often simultaneously, and distorted by the environment.
|Task||Task description||Development dataset||Baseline system||Evaluation dataset||Results|
|Task 1, Acoustic Scene Classification||Released||Released||Released||TBA||TBA|
|Task 2, Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring||Released||Released||Released||TBA||TBA|
|Task 3, Sound Event Localization and Detection||Released||Released||Released||TBA||TBA|
|Task 4, Sound Event Detection and Separation in Domestic Environments||Released||Released||Partial||TBA||TBA|
|Task 5, Urban Sound Tagging with Spatiotemporal Context||Released||Released||Released||TBA||TBA|
|Task 6, Automated Audio Captioning||Released||Released||Released||TBA||TBA|
Acoustic scene classification
The goal of acoustic scene classification is to classify a test recording into one of the predefined acoustic scene classes. This task is a continuation of the Acoustic Scene Classification task from previous DCASE Challenge editions, with some changes that bring new research problems into focus.
We provide two different setups of the acoustic classification problem:
Classification of data from multiple devices (real and simulated) targeting generalization properties of systems across a number of different devices.
Classification of data into three higher level classes while focusing on low-complexity solutions.
Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring
This is a novel DCASE challenge task closely related to an industrial problem: automatically detecting mechanical failure for achieving factory automation. The goal of this task is to identify whether the sound emitted from a target machine is normal or anomalous. The main challenge is to detect unknown anomalous sounds under the condition that only normal sound samples have been provided as training data. This task cannot be solved as a simple classification problem, even though the normal/anomaly classification problem seems to be a two-class classification problem. Prompt detection of machine anomaly by observing its sounds will be useful for machine condition monitoring.
Sound Event Localization and Detection
Given a multichannel audio input, the goal of a sound event localization and detection (SELD) method is to output all instances of the sound labels in the recording, its respective onset-offset times, and spatial locations in azimuth and elevation angles. Each individual sound event instance in the provided recordings are spatially stationary with a fixed location during their entire duration. Successful implementation of such a SELD method will enable the automatic description of the social and human activities and help machines to interact with the world more seamlessly. Specifically, SELD will enable people with hearing impairment to visualize sounds. Robots and smart video conference equipment can recognize and track the sound source of interest. Further, smart homes, smart cities, and smart industries can use SELD for audio surveillance.
Sound Event Detection and Separation in Domestic Environments
This task is the follow-up to DCASE 2019 task 4. The task evaluates systems for the detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. This year, we also encourage participants to propose systems that use source separation jointly with sound event detection. This task aims to investigate how we can optimally exploit synthetic data and to what extent can source separation improve sound event detection, and vice-versa?
Urban Sound Tagging with Spatiotemporal Context
This taks is a follow up to DCASE 2019 Task 5. This task aims to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags for 10s recordings from an urban acoustic sensor network for which we know when and where the recordings were taken. This task is motivated by the real-world problem of building machine listening tools to aid in the monitoring, analysis, and mitigation of urban noise pollution. In this task, in addition to the recordings, we provide identifiers of the block of New York City the recording was taken as well as when the recording was taken, quantized to the hour. We encourage all submissions to exploit both this provided information as well as incorporating any external data (e.g. weather, land usage, traffic, Twitter) that can further help inform your system to predict tags.
Automated Audio Captioning
Automated audio captioning is the task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. Audio captioning methods can model concepts (e.g. "muffled sound"), physical properties of objects and environment (e.g. "the sound of a big car", "people talking in a small and empty room"), and high level knowledge ("a clock rings three times"). This modeling can be used in various applications, ranging from automatic content description to intelligent and content oriented machine-to-machine interaction.