Sounds carry a large amount of information about our everyday environment and physical events that take place in it. We can perceive the sound scene we are within (busy street, office, etc.), and recognize individual sound sources (car passing by, footsteps, etc.). Developing signal processing methods to automatically extract this information has huge potential in several applications, for example searching for multimedia based on its audio content, making context-aware mobile devices, robots, cars etc., and intelligent monitoring systems to recognize activities in their environments using acoustic information. However, a significant amount of research is still needed to reliably recognize sound scenes and individual sound sources in realistic soundscapes, where multiple sounds are present, often simultaneously, and distorted by the environment.
|Task||Task description||Development dataset||Baseline system||Evaluation dataset||Results|
|Task 1, Acoustic Scene Classification||TBA||TBA||TBA||TBA||TBA|
|Task 2, Unsupervised Anomalous Sound Detection for Machine Condition Monitoring under Domain Shifted Conditions||TBA||TBA||TBA||TBA||TBA|
|Task 3, Sound Event Localization and Detection with Directional Interference||TBA||TBA||TBA||TBA||TBA|
|Task 4, Sound Event Detection and Separation in Domestic Environments||TBA||TBA||TBA||TBA||TBA|
|Task 5, Few-shot Bioacoustic Event Detection||TBA||TBA||TBA||TBA||TBA|
|Task 6, Automated Audio Captioning||TBA||TBA||TBA||TBA||TBA|
Acoustic scene classification
The goal of acoustic scene classification is to classify a test recording into one of the predefined ten acoustic scene classes. This task is a continuation of the Acoustic Scene Classification task from previous DCASE Challenge editions, with some changes that bring new research problems into focus. This task is a follow up to DCASE 2020 Task 1.
We provide two different setups of the acoustic classification problem:
Classification of data from multiple devices (real and simulated) targeting generalization properties of systems across a number of different devices while focusing on low-complexity solutions.
This subtask provides scene data with audio and video material to allow learning complementary information from a different modality. There are no restrictions on the modality or combinations of modalities used for the systems. This task is for machine learning enthusiasts that are interested in development of complex methods without the limitations or specific problems from Subtask A.
Unsupervised Anomalous Sound Detection for Machine Condition Monitoring under Domain Shifted Conditions
The scope of this task is to identify whether the sound emitted from a target machine is normal or anomalous via an anomaly detector trained using only normal sound data. The main difference from the DCASE 2020 Task 2 is that the participants have to solve the domain shift problem, i.e., the condition where the acoustic characteristics of the training and test data are different.
Sound Event Localization and Detection with Directional Interference
The scope of this task is temporal detection, classification, and simultaneous localization of sound activity of interest, emitted by sound sources under real reverberant conditions and under both static and dynamic scenarios. The main difference from the previous year’s task is the introduction of directional (localised) interference from unknown sound types, in conjunction with realistic spatial ambient noise. Additionally, multichannel reverberation simulation tools will be provided to the participants to explore new spatial training and augmentation strategies. This task is a follow up to DCASE 2020 Task 3.
Sound Event Detection and Separation in Domestic Environments
The task evaluates systems for the detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. This year, we also encourage participants to propose systems that use source separation jointly with sound event detection. This task aims to investigate how we can optimally exploit synthetic data and to what extent can source separation improve sound event detection, and vice-versa? This task is a follow up to DCASE 2020 Task 4.
Few-shot Bioacoustic Event Detection
This challenge focuses on sound event detection in a few-shot learning setting for animal (mammal and bird) vocalisations. Participants will be expected to create a method that can extract information from five exemplar vocalisations (shots) of mammals or birds and detect and classify sounds in field recordings. The main objective is to find reliable algorithms that are capable of dealing with data sparsity, class imbalance, and noisy/busy environments.
Automated Audio Captioning
Automated audio captioning is the task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. Audio captioning methods can model concepts (e.g. "muffled sound"), physical properties of objects and environment (e.g. "the sound of a big car", "people talking in a small and empty room"), and high level knowledge ("a clock rings three times"). This modeling can be used in various applications, ranging from automatic content description to intelligent and content oriented machine-to-machine interaction. This task is a follow up to DCASE 2020 Task 6.