Introduction
Sounds carry a large amount of information about our everyday environment and the physical events that take place in it. We can perceive the sound scene we are within (busy street, office, etc.), and recognize individual sound sources (car passing by, footsteps, etc.). Developing signal processing methods to extract this information automatically has enormous potential in several applications, for example, searching for multimedia based on its audio content, making context-aware mobile devices, robots, cars, etc., and intelligent monitoring systems to recognize activities in their environments using acoustic information. However, a significant amount of research is still needed to reliably recognize sound scenes and individual sound sources in realistic soundscapes, where multiple sounds are present, often simultaneously, and distorted by the environment.
Challenge status
Task | Task description | Development dataset | Baseline system | Evaluation dataset | Results |
---|---|---|---|---|---|
Task 1, Data-Efficient Low-Complexity Acoustic Scene Classification | Released | Released | Released | Released | Released |
Task 2, First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring | Released | Released | Released | Released | Released |
Task 3, Audio and Audiovisual Sound Event Localization and Detection with Source Distance Estimation | Released | Released | Released | Released | Released |
Task 4, Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels | Released | Released | Released | Released | Released |
Task 5, Few-shot Bioacoustic Event Detection | Released | Released | Released | Released | Released |
Task 6, Automated Audio Captioning | Released | Released | Released | Released | Released |
Task 7, Sound Scene Synthesis | Released | Released | Released | No evaluation dataset | Released |
Task 8, Language-Based Audio Retrieval | Released | Released | Released | Released | Released |
Task 9, Language-Queried Audio Source Separation | Released | Released | Released | Released | Released |
Task 10, Acoustic-based Traffic Monitoring | Released | Released | Released | Released | Released |
updated 2024/07/01
Tasks
Data-Efficient Low-Complexity Acoustic Scene Classification
Acoustic scene classification aims to automatically categorize audio recordings into specific environmental sound scenes, such as "metro station," "urban park," or "public square." Previous editions of the acoustic scene classification (ASC) task have focused on limited computational resources and diverse recording conditions, reflecting typical challenges faced when developing ASC models for embedded systems.
This year, participants are additionally encouraged to tackle another problematic condition, namely the limited availability of labeled training data. To this end, the ranking will be based on the number of labeled examples used for training and the system's performance on a test set with diverse recording conditions. Participants will train their system on predefined training sets with a varying number of items. External resources, such as data sets and pre-trained models not specific to ASC, are allowed after the approval of the task organizers and do not count towards the labeled training data count used in the ranking.
Additionally, the focus on low-complexity models is preserved by restricting the model size to 128kB and the number of multiply-accumulate operations for a one-second audio clip to 30 million.
Organizers
First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring
The task is to develop anomalous sound detection techniques that can train models on new train data without manual tunings (first-shot problem) and work for domain generalization tasks. Our task in 2023 was also focused on the first-shot problem and domain generalization, which was realized by providing completely different machine types for the development and evaluation dataset with additional attribute information (such as the machine operation speed) attached to them. Our task in 2024 will also be held with the development and evaluation dataset having different machine types, whereas for the evaluation dataset, the machine types will be new ones not seen in our previous tasks. In addition, we will conceal additional attribute information for at least one machine type in the development dataset, which also mimics some real-world situations. The participants are expected to develop techniques that can be useful for solving the first-shot problem and train models robust to domain shifts.
Organizers
Tomoya Nishida
Hitachi, Ltd.
Kota Dohi
Hitachi, Ltd.
Harsh Purohit
Hitachi, Ltd.
Takashi Endo
Hitachi, Ltd.
Yohei Kawaguchi
Hitachi, Ltd.
Audio and Audiovisual Sound Event Localization and Detection with Source Distance Estimation
In DCASE Challenge 2023 Task 3 introduced audiovisual processing in sound event localization and detection, in parallel to methods working only with audio input. In this year’s challenge we aim to maintain the same setup, since last year’s audiovisual submissions in DCASE2023 showed that there is a need for stronger approaches to effectively combine visual information with audio for SELD. There is however a new component in the new proposal, and that is source distance estimation (SDE). Complementary to direction-of-arrival estimation (DOAE) in the previous challenges, SDE is seeing increased interest from the audio research community in the last couple of years as it is a useful spatial cue for multiple downstream tasks, such as speech enhancement or separation. Successful SDE allows stronger learning of the spatial composition of a sound scene, since, together with DOA, allows absolute positional mapping of events in space.
Organizers
Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels
This task is the follow up to task 4 A and task 4 B in 2023. We propose to unify the setup of both subtasks. The target of this task is to provide the event class together with the event time boundaries, given that multiple events can be present and may be overlapping in an audio recording. This task aims at exploring how to leverage training data with varying annotation granularity (temporal resolution, soft/hard labels). One major novelty is that systems will also be evaluated on labels with different granularity in order to get a broader view of the systems behavior and to assess their robustness for different applications. One additional novelty is that the target classes in the different datasets are different, and hence sound labels of one dataset may be present but not annotated in the other one. The systems will have to cope with potentially missing target labels at training.
Organizers
Few-shot Bioacoustic Event Detection
This task focuses on sound event detection (SED) in a few-shot learning setting for animal vocalisations. Can you design a system that can extract information from five exemplar vocalisations in a long sound recording, and generalise from those exemplers to find the rest of the sound events in the audio? The bioacoustic (animal sound) setting addresses a real need from animal researchers, while also providing a well-defined, constrained yet highly variable domain in which to evaluate a few-shot learning methodology. Between 2021 - 2023, the task has stimulated a lot of innovation, but there are many open questions and opportunities to innovate, e.g. prototypical networks, fine-tuning methods, or new architectures. The 2024 edition is the same as 2023, except for an enlarged set of data sources and a new deep learning baseline. We added two mammals datasets (one being a marine mammals dataset) to the validation set to make the domain shift between the validation and evaluation sets a little less steep. We will also introduce new unseen recordings for a couple of evaluation datasets.
Organizers
Helen Whitehead
University of Salford
Joe Morford
University of Oxford
Michael Emmerson
Queen Mary University of London
Automated Audio Captioning
Automated audio captioning (AAC) is the task of general audio content description using free text. It is an inter-modal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. AAC methods can model concepts (e.g. "muffled sound"), physical properties of objects and environment (e.g. "the sound of a big car", "people talking in a small and empty room"), and high level knowledge ("a clock rings three times"). This modeling can be used in various applications, ranging from automatic content description to intelligent and content-oriented machine-to-machine interaction.
Organizers
Sound Scene Synthesis
Sound Scene Synthesis is the task of generating a sound scene given a textual description. This new generation task expands the scope from last year’s Foley sounds to a more general sound scene. Further, it adds controllability with natural language in the form of text prompts.
The organizers will provide: 1) a development set consisting of 50 to 100 prompts and corresponding audio embeddings, and 2) a colab template with the baseline. There will be only one track, and it permits any external resources for system training. Before the inference starts, it is allowed to use external resources to install Python modules and download pre-trained models. However, calls to external resources are not allowed during inference. In other words, the inference stage must be 100% self-contained.
The evaluation will proceed as such:
- The participants submit a colab notebook with their contributions.
- A debugging phase will be opened a few days before the challenge deadline, during which the participants will be able to submit their system to check that it is running correctly.
- The organizers will query the submitted synthesizers with a new set of prompts. There will be some reasonable limits set to the system footprint.
- The organizers will rank the submitted synthesizers in terms of Fréchet Audio Distance (FAD).
- The participants will be asked to perceptually rate a sample of the generated sounds from the highest ranked systems.
- Results will be published on July 15th due to the time needed to perform the subjective evaluation.
Organizers
Language-Based Audio Retrieval
Language-based audio retrieval is concerned with retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions will be used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based on their matching with the query. Through this task, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.
Organizers
Language-Queried Audio Source Separation
Language-queried audio source separation (LASS) is the task of separating arbitrary sound sources using textual descriptions of the desired source. LASS provides a useful tool for future source separation systems, allowing users to extract audio sources via natural language instructions. Such a system could be useful in many applications, such as automatic audio editing, multimedia content retrieval, and augmented listening. The objective of this challenge is to effectively separate sound sources using natural language queries, thereby advancing the way we interact with and manipulate audio content.
Organizers
Acoustic-based Traffic Monitoring
This task aims to develop an acoustic-based traffic monitoring solution to count the number of pass-by vehicles, identify their type (passenger cars or commercial vehicles) and determine the direction of travel. Traffic monitoring systems are integral components of smart city developments, facilitating road monitoring and anomaly detection within urban environments. These solutions leverage various sensors, including induction loops, vibration or magnetic sensors, radar, cameras, infrared, and acoustic sensors. Acoustic-based systems offer several advantages, making them an attractive option either on a standalone basis or in conjunction with other sensors. These advantages include low cost, power efficiency, ease of installation, resilience to adverse weather, and adaptability to low visibility conditions, among others.
In this task, participants receive traffic recordings from diverse locations, each characterized by distinct traffic densities. These recordings are captured using a 4-microphone linear array deployed alongside a road. Participants are tasked with designing a vehicle counting system for each car type and travel direction. Given the challenges associated with collecting and labeling real-world traffic data, participants are encouraged to employ various traffic simulation and augmentation techniques to address the limited availability of training data. Notably, participants are provided with an acoustic traffic simulation tool alongside the traffic recordings, enabling the generation of synthetic data. This tool facilitates the simulation of both individual pass-by events, based on the provided information on the type, speed, and direction of the vehicle, as well as longer audio segments containing multiple events that mimic specific traffic densities.