Sounds carry a large amount of information about our everyday environment and physical events that take place in it. We can perceive the sound scene we are within (busy street, office, etc.), and recognize individual sound sources (car passing by, footsteps, etc.). Developing signal processing methods to automatically extract this information has huge potential in several applications, for example searching for multimedia based on its audio content, making context-aware mobile devices, robots, cars etc., and intelligent monitoring systems to recognize activities in their environments using acoustic information. However, a significant amount of research is still needed to reliably recognize sound scenes and individual sound sources in realistic soundscapes, where multiple sounds are present, often simultaneously, and distorted by the environment.
|Task||Task description||Development dataset||Baseline system||Evaluation dataset||Results|
|Task 1, Low-Complexity Acoustic Scene Classification||TBA||TBA||TBA||TBA||TBA|
|Task 2, First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring||TBA||TBA||TBA||TBA||TBA|
|Task 3, Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes||TBA||TBA||TBA||TBA||TBA|
|Task 4, Sound Event Detection with Weak Labels and Synthetic Soundscapes; Sound Event Detection with Soft Labels||TBA||TBA||TBA||TBA||TBA|
|Task 5, Few-shot Bioacoustic Event Detection||TBA||TBA||TBA||TBA||TBA|
|Task 6, Automated Audio Captioning and Language-Based Audio Retrieval||TBA||TBA||TBA||TBA||TBA|
|Task 7, Foley Sound Synthesis||TBA||TBA||TBA||TBA||TBA|
Low-Complexity Acoustic Scene Classification
The task targets acoustic scene classification on devices with low computational and memory allowance, which impose certain limits on the model complexity. The task setup is based on limited model complexity, acoustically diverse data, and multiple mobile devices, reflecting a real-life application for ASC. The focus of the task is on the training strategies used for obtaining robust models that at the same time are light enough to run on embedded systems. Specific details for the required implementation are for example a maximum memory allowance, but no predefined parameter representation format, maximum allowance of 30 MMACs, and the requirement to calculate energy consumption.
The task is a repeat task from DCASE 2022, with the added calculation of the energy consumption, which is a factor in the overall ranking.
Ranking of submissions will be done taking into account the model accuracy, memory, MMACs and energy needs, to create a measurement that looks at the low resources from multiple perspectives.
First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring
The goal of this task is to identify whether a machine is normal or anomalous using only normal sound data under domain shifted conditions. One major difference from DCASE 2022 Task 2 is that the set of machine types are completely different between the development dataset and evaluation dataset. Therefore, the participants are expected to develop a system that can handle completely new machine types.
Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes
SELD is the joint task of detecting the temporal activities of multiple sound event classes of interest and estimating their respective spatial trajectories when active. This task is a continuation of the Task 3 of DCASE2022, which was the first time that the SELD task was evaluated with real spatially annotated recordings, compared to the previous years that were using synthetic data. Testing on (and partial training with) real recordings brings new challenges and opportunities to improve real-world performance. To foster innovation even further, this task will have, in addition to an audio-only setup similar to the previous year, an additional setup where participants also have access to 360° video of the recorded scenes.
Sound Event Detection with Weak Labels and Synthetic Soundscapes; Sound Event Detection with Soft Labels
The goal of sound event detection is to provide the event class together with the event time boundaries given multiple events can be present in an audio recording. The target of this task is to perform sound event detection using training datasets where varying types of annotations are available, aiming at exploring how to leverage different types of annotations. Because strongly labeled data is costly to obtain, prone to annotator biases and does not account for the annotator uncertainty, in this task we propose to investigate: weakly labeled data and strongly labeled synthetic soundscapes in subtask A (follow up to DCASE Task 4) and softly labeled data (nonbinary activity for sounds) in subtask B. In both cases the labeled in domain data can be used together with external datasets. In order to encourage cross-participation to subtasks, one baseline will be common to both subtasks.
This task is a continuation of Task 4 at DCASE 2022. The main novelties for this year will be additional sets of extracted embedding will be made available, the evaluation will include systematic evaluation of the energy consumption, alternative evaluation methods will be explored to explore the robustness of the systems to changes in the operating point.
This is a new subtask where annotations on the training data are given with a 1s resolution. The annotation is based on multiple annotators, and the aggregation is a value between 0 and 1 per event class per 1s segment. These are considered these as soft labels that indicate uncertainty of the annotators’ pool on the content. The soft annotations are used to train a sound event detection system that should perform at a 1s time resolution.
Few-shot Bioacoustic Event Detection
This task focuses on sound event detection (SED) in a few-shot learning setting for animal vocalisations. Participants will be expected to create a method that can extract information from five exemplar vocalisations in a long sound recording, and generalise from those exemplars to find the rest of the sound events in the audio. The bioacoustic (animal sound) setting addresses a real need from animal researchers, while also providing a well-defined, constrained yet highly variable domain in which to evaluate few-shot learning methodology. In 2021 and 2022, the task has stimulated a lot of innovation, but there is still plenty of scope for technical innovations to achieve robust strong performance. For this edition, the task will keep most details the same as 2022 in order to ensure maximum comparability, but will introduce new unseen evaluation recordings from new animal sound recording scenarios, to ensure that the evaluation is as representative as possible, and also to ensure there are truly-unseen datasets in the evaluation.
University of Salford
University of Oxford
Queen Mary University of London
Automated Audio Captioning and Language-Based Audio Retrieval
This task approaches the problem of analysis of audio signals by using natural language to represent rich characteristics of audio signals. The task setup is otherwise similar to the DCASE 2022 Challenge Task 6, but with some changes, to take into account the development of the field.
This task is a continuation of Task 6 at DCASE 2022 Challenge and focuses on the research question “How can we make machines understand higher level and human-perceived information from general sounds?”.
The metric used to rank the submissions will combine SPIDEr with a fluency error detection model. Participants will still report METEOR, CIDEr, and SPICE metrics. In addition, FENSE and CB-score will be reported as contrastive metrics.
This task is a continuation of Task 6 at DCASE 2022 Challenge and the goal of this task is to evaluate methods where a retrieval system takes a free-form textual description as an input and is supposed to rank audio signals in a fixed dataset based on their match to the given description.
In the 2022 Challenge, the ground truth relevance of audio files were considered binary, so that only audio files matching with their corresponding caption were considered relevant, and all the other audio files non-relevant. In the 2023 Challenge, this limitation is addressed by crowdsourced graded relevance scores, and use of the Normalized Distributed Cumulative Gain as the metric. mAP@10 (mean average precision at cut-off 10) and recall@k (i.e., recall@1, recall@5, and recall@10) will be used as secondary metrics.
Foley Sound Synthesis
This task aims to build a foley sound synthesis system that can generate plausible audio signals fitting into given categories of foley sound. The foley sound categories are composed of sound events and environmental background sounds. The challenge has two subproblems – the development of models with and without external resources. Participants are expected to submit a system for one of the two problems, and each problem is evaluated independently. Submissions will be evaluated by Frechet Audio Distance (FAD), followed by a subjective test.