DCASE2024 Challenge

IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events
1 April - 15 June 2024

Introduction

Sounds carry a large amount of information about our everyday environment and the physical events that take place in it. We can perceive the sound scene we are within (busy street, office, etc.), and recognize individual sound sources (car passing by, footsteps, etc.). Developing signal processing methods to extract this information automatically has enormous potential in several applications, for example, searching for multimedia based on its audio content, making context-aware mobile devices, robots, cars, etc., and intelligent monitoring systems to recognize activities in their environments using acoustic information. However, a significant amount of research is still needed to reliably recognize sound scenes and individual sound sources in realistic soundscapes, where multiple sounds are present, often simultaneously, and distorted by the environment.

Challenge status

Task Task description Development dataset Baseline system Evaluation dataset Results
Task 1, Data-Efficient Low-Complexity Acoustic Scene Classification Released Released Released Released Released
Task 2, First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Released Released Released Released Released
Task 3, Audio and Audiovisual Sound Event Localization and Detection with Source Distance Estimation Released Released Released Released Released
Task 4, Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels Released Released Released Released Released
Task 5, Few-shot Bioacoustic Event Detection Released Released Released Released Released
Task 6, Automated Audio Captioning Released Released Released Released Released
Task 7, Sound Scene Synthesis Released Released Released No evaluation dataset Released
Task 8, Language-Based Audio Retrieval Released Released Released Released Released
Task 9, Language-Queried Audio Source Separation Released Released Released Released Released
Task 10, Acoustic-based Traffic Monitoring Released Released Released Released Released

updated 2024/07/01

Tasks

Data-Efficient Low-Complexity Acoustic Scene Classification

Scenes Task 1

Acoustic scene classification aims to automatically categorize audio recordings into specific environmental sound scenes, such as "metro station," "urban park," or "public square." Previous editions of the acoustic scene classification (ASC) task have focused on limited computational resources and diverse recording conditions, reflecting typical challenges faced when developing ASC models for embedded systems.

This year, participants are additionally encouraged to tackle another problematic condition, namely the limited availability of labeled training data. To this end, the ranking will be based on the number of labeled examples used for training and the system's performance on a test set with diverse recording conditions. Participants will train their system on predefined training sets with a varying number of items. External resources, such as data sets and pre-trained models not specific to ASC, are allowed after the approval of the task organizers and do not count towards the labeled training data count used in the ranking.

Additionally, the focus on low-complexity models is preserved by restricting the model size to 128kB and the number of multiply-accumulate operations for a one-second audio clip to 30 million.

Organizers

Task description Results



First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

Monitoring Task 2

The task is to develop anomalous sound detection techniques that can train models on new train data without manual tunings (first-shot problem) and work for domain generalization tasks. Our task in 2023 was also focused on the first-shot problem and domain generalization, which was realized by providing completely different machine types for the development and evaluation dataset with additional attribute information (such as the machine operation speed) attached to them. Our task in 2024 will also be held with the development and evaluation dataset having different machine types, whereas for the evaluation dataset, the machine types will be new ones not seen in our previous tasks. In addition, we will conceal additional attribute information for at least one machine type in the development dataset, which also mimics some real-world situations. The participants are expected to develop techniques that can be useful for solving the first-shot problem and train models robust to domain shifts.

Organizers

Tomoya Nishida

Tomoya Nishida

Hitachi, Ltd.

Noboru Harada

Noboru Harada

Daisuke Niizumi

Daisuke Niizumi

Davide Albertini

Davide Albertini

Roberto Sannino

Roberto Sannino

Simone Pradolini

Simone Pradolini

Filippo Augusti

Filippo Augusti

Keisuke Imoto

Keisuke Imoto

Doshisha University

Kota Dohi

Kota Dohi

Hitachi, Ltd.

Harsh Purohit

Harsh Purohit

Hitachi, Ltd.

Takashi Endo

Takashi Endo

Hitachi, Ltd.

Yohei Kawaguchi

Yohei Kawaguchi

Hitachi, Ltd.

Task description Results



Audio and Audiovisual Sound Event Localization and Detection with Source Distance Estimation

Localization Task 3

In DCASE Challenge 2023 Task 3 introduced audiovisual processing in sound event localization and detection, in parallel to methods working only with audio input. In this year’s challenge we aim to maintain the same setup, since last year’s audiovisual submissions in DCASE2023 showed that there is a need for stronger approaches to effectively combine visual information with audio for SELD. There is however a new component in the new proposal, and that is source distance estimation (SDE). Complementary to direction-of-arrival estimation (DOAE) in the previous challenges, SDE is seeing increased interest from the audio research community in the last couple of years as it is a useful spatial cue for multiple downstream tasks, such as speech enhancement or separation. Successful SDE allows stronger learning of the spatial composition of a sound scene, since, together with DOA, allows absolute positional mapping of events in space.

Organizers

Archontis Politis

Archontis Politis

Kazuki Shimada

Kazuki Shimada

Yuki Mitsufuji

Yuki Mitsufuji

Tuomas Virtanen

Tuomas Virtanen

Parthasaarathy Sudarsanam

Parthasaarathy Sudarsanam

Daniel Krause

Daniel Krause

Kengo Uchida

Kengo Uchida

David Diaz-Guerra

David Diaz-Guerra

Yuichiro Koyama

Yuichiro Koyama

Naoya Takahashi

Naoya Takahashi

Takashi Shibuya

Takashi Shibuya

Shusuke Takahashi

Shusuke Takahashi

Task description Results



Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels

Events Task 4

This task is the follow up to task 4 A and task 4 B in 2023. We propose to unify the setup of both subtasks. The target of this task is to provide the event class together with the event time boundaries, given that multiple events can be present and may be overlapping in an audio recording. This task aims at exploring how to leverage training data with varying annotation granularity (temporal resolution, soft/hard labels). One major novelty is that systems will also be evaluated on labels with different granularity in order to get a broader view of the systems behavior and to assess their robustness for different applications. One additional novelty is that the target classes in the different datasets are different, and hence sound labels of one dataset may be present but not annotated in the other one. The systems will have to cope with potentially missing target labels at training.

Organizers

Irene Martin Morato

Irene Martin Morato

Constance Douwes

Constance Douwes

Annamaria Mesaros

Annamaria Mesaros

Task description Results



Few-shot Bioacoustic Event Detection

Bio Task 5

This task focuses on sound event detection (SED) in a few-shot learning setting for animal vocalisations. Can you design a system that can extract information from five exemplar vocalisations in a long sound recording, and generalise from those exemplers to find the rest of the sound events in the audio? The bioacoustic (animal sound) setting addresses a real need from animal researchers, while also providing a well-defined, constrained yet highly variable domain in which to evaluate a few-shot learning methodology. Between 2021 - 2023, the task has stimulated a lot of innovation, but there are many open questions and opportunities to innovate, e.g. prototypical networks, fine-tuning methods, or new architectures. The 2024 edition is the same as 2023, except for an enlarged set of data sources and a new deep learning baseline. We added two mammals datasets (one being a marine mammals dataset) to the validation set to make the domain shift between the validation and evaluation sets a little less steep. We will also introduce new unseen recordings for a couple of evaluation datasets.

Organizers

Helen Whitehead

Helen Whitehead

University of Salford

Joe Morford

Joe Morford

University of Oxford

Michael Emmerson

Michael Emmerson

Queen Mary University of London

Frants Jensen

Frants Jensen

Syracuse University

Ester Vidana Vila

Ester Vidana Vila

La Salle, Universitat Ramon Llull

Dan Stowell

Dan Stowell

Task description Results



Automated Audio Captioning

Caption Task 6

Automated audio captioning (AAC) is the task of general audio content description using free text. It is an inter-modal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. AAC methods can model concepts (e.g. "muffled sound"), physical properties of objects and environment (e.g. "the sound of a big car", "people talking in a small and empty room"), and high level knowledge ("a clock rings three times"). This modeling can be used in various applications, ranging from automatic content description to intelligent and content-oriented machine-to-machine interaction.

Organizers

Task description Results



Sound Scene Synthesis

Synthesis Task 7

Sound Scene Synthesis is the task of generating a sound scene given a textual description. This new generation task expands the scope from last year’s Foley sounds to a more general sound scene. Further, it adds controllability with natural language in the form of text prompts.

The organizers will provide: 1) a development set consisting of 50 to 100 prompts and corresponding audio embeddings, and 2) a colab template with the baseline. There will be only one track, and it permits any external resources for system training. Before the inference starts, it is allowed to use external resources to install Python modules and download pre-trained models. However, calls to external resources are not allowed during inference. In other words, the inference stage must be 100% self-contained.

The evaluation will proceed as such:

  1. The participants submit a colab notebook with their contributions.
  2. A debugging phase will be opened a few days before the challenge deadline, during which the participants will be able to submit their system to check that it is running correctly.
  3. The organizers will query the submitted synthesizers with a new set of prompts. There will be some reasonable limits set to the system footprint.
  4. The organizers will rank the submitted synthesizers in terms of Fréchet Audio Distance (FAD).
  5. The participants will be asked to perceptually rate a sample of the generated sounds from the highest ranked systems.
  6. Results will be published on July 15th due to the time needed to perform the subjective evaluation.

Organizers

Mathieu Lagrange

Mathieu Lagrange

CNRS, Ecole Centrale Nantes, Nantes Université

Laurie Heller

Laurie Heller

Carnegie Mellon University

Keunwoo Choi

Keunwoo Choi

Gaudio Lab, Inc.

Brian McFee

Brian McFee

New York University

Keisuke Imoto

Keisuke Imoto

Doshisha University

Yuki Okamoto

Yuki Okamoto

The University of Tokyo

Task description Results



Language-Based Audio Retrieval

Retrieval Task 8

Language-based audio retrieval is concerned with retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions will be used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based on their matching with the query. Through this task, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.

Organizers

Task description Results



Language-Queried Audio Source Separation

Separation Task 9

Language-queried audio source separation (LASS) is the task of separating arbitrary sound sources using textual descriptions of the desired source. LASS provides a useful tool for future source separation systems, allowing users to extract audio sources via natural language instructions. Such a system could be useful in many applications, such as automatic audio editing, multimedia content retrieval, and augmented listening. The objective of this challenge is to effectively separate sound sources using natural language queries, thereby advancing the way we interact with and manipulate audio content.

Organizers

Task description Results



Acoustic-based Traffic Monitoring

Monitoring Task 10

This task aims to develop an acoustic-based traffic monitoring solution to count the number of pass-by vehicles, identify their type (passenger cars or commercial vehicles) and determine the direction of travel. Traffic monitoring systems are integral components of smart city developments, facilitating road monitoring and anomaly detection within urban environments. These solutions leverage various sensors, including induction loops, vibration or magnetic sensors, radar, cameras, infrared, and acoustic sensors. Acoustic-based systems offer several advantages, making them an attractive option either on a standalone basis or in conjunction with other sensors. These advantages include low cost, power efficiency, ease of installation, resilience to adverse weather, and adaptability to low visibility conditions, among others.

In this task, participants receive traffic recordings from diverse locations, each characterized by distinct traffic densities. These recordings are captured using a 4-microphone linear array deployed alongside a road. Participants are tasked with designing a vehicle counting system for each car type and travel direction. Given the challenges associated with collecting and labeling real-world traffic data, participants are encouraged to employ various traffic simulation and augmentation techniques to address the limited availability of training data. Notably, participants are provided with an acoustic traffic simulation tool alongside the traffic recordings, enabling the generation of synthetic data. This tool facilitates the simulation of both individual pass-by events, based on the provided information on the type, speed, and direction of the vehicle, as well as longer audio segments containing multiple events that mimic specific traffic densities.

Organizers

Shabnam Ghaffarzadegan

Shabnam Ghaffarzadegan

Stefano Damiano

Stefano Damiano

Abinaya Kumar

Abinaya Kumar

Wei-Cheng Lin

Wei-Cheng Lin

Hans-Georg Horst

Hans-Georg Horst

Toon van Waterschoot

Toon van Waterschoot

Task description Results