Introduction

Sounds carry a large amount of information about our everyday environment and the physical events that take place in it. We can perceive the sound scene we are within (busy street, office, etc.), and recognize individual sound sources (car passing by, footsteps, etc.). Developing signal processing methods to extract this information automatically has enormous potential in several applications, for example, searching for multimedia based on its audio content, making context-aware mobile devices, robots, cars, etc., and intelligent monitoring systems to recognize activities in their environments using acoustic information. However, a significant amount of research is still needed to reliably recognize sound scenes and individual sound sources in realistic soundscapes, where multiple sounds are present, often simultaneously, and distorted by the environment.

Challenge status

Task	Task description	Development dataset	Baseline system	Evaluation dataset	Results
Task 1, Data-Efficient Low-Complexity Acoustic Scene Classification	Released	Released	Released	Released	Released
Task 2, First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring	Released	Released	Released	Released	Released
Task 3, Audio and Audiovisual Sound Event Localization and Detection with Source Distance Estimation	Released	Released	Released	Released	Released
Task 4, Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels	Released	Released	Released	Released	Released
Task 5, Few-shot Bioacoustic Event Detection	Released	Released	Released	Released	Released
Task 6, Automated Audio Captioning	Released	Released	Released	Released	Released
Task 7, Sound Scene Synthesis	Released	Released	Released	No evaluation dataset	Released
Task 8, Language-Based Audio Retrieval	Released	Released	Released	Released	Released
Task 9, Language-Queried Audio Source Separation	Released	Released	Released	Released	Released
Task 10, Acoustic-based Traffic Monitoring	Released	Released	Released	Released	Released

updated 2024/07/01

Tasks

Data-Efficient Low-Complexity Acoustic Scene Classification

Scenes Task 1

Acoustic scene classification aims to automatically categorize audio recordings into specific environmental sound scenes, such as "metro station," "urban park," or "public square." Previous editions of the acoustic scene classification (ASC) task have focused on limited computational resources and diverse recording conditions, reflecting typical challenges faced when developing ASC models for embedded systems.

This year, participants are additionally encouraged to tackle another problematic condition, namely the limited availability of labeled training data. To this end, the ranking will be based on the number of labeled examples used for training and the system's performance on a test set with diverse recording conditions. Participants will train their system on predefined training sets with a varying number of items. External resources, such as data sets and pre-trained models not specific to ASC, are allowed after the approval of the task organizers and do not count towards the labeled training data count used in the ranking.

Additionally, the focus on low-complexity models is preserved by restricting the model size to 128kB and the number of multiply-accumulate operations for a one-second audio clip to 30 million.

Organizers

First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

Monitoring Task 2

The task is to develop anomalous sound detection techniques that can train models on new train data without manual tunings (first-shot problem) and work for domain generalization tasks. Our task in 2023 was also focused on the first-shot problem and domain generalization, which was realized by providing completely different machine types for the development and evaluation dataset with additional attribute information (such as the machine operation speed) attached to them. Our task in 2024 will also be held with the development and evaluation dataset having different machine types, whereas for the evaluation dataset, the machine types will be new ones not seen in our previous tasks. In addition, we will conceal additional attribute information for at least one machine type in the development dataset, which also mimics some real-world situations. The participants are expected to develop techniques that can be useful for solving the first-shot problem and train models robust to domain shifts.

Hitachi, Ltd.

Task description Results

Audio and Audiovisual Sound Event Localization and Detection with Source Distance Estimation

Localization Task 3

In DCASE Challenge 2023 Task 3 introduced audiovisual processing in sound event localization and detection, in parallel to methods working only with audio input. In this year’s challenge we aim to maintain the same setup, since last year’s audiovisual submissions in DCASE2023 showed that there is a need for stronger approaches to effectively combine visual information with audio for SELD. There is however a new component in the new proposal, and that is source distance estimation (SDE). Complementary to direction-of-arrival estimation (DOAE) in the previous challenges, SDE is seeing increased interest from the audio research community in the last couple of years as it is a useful spatial cue for multiple downstream tasks, such as speech enhancement or separation. Successful SDE allows stronger learning of the spatial composition of a sound scene, since, together with DOA, allows absolute positional mapping of events in space.

Organizers

Task description Results

Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels

Events Task 4

This task is the follow up to task 4 A and task 4 B in 2023. We propose to unify the setup of both subtasks. The target of this task is to provide the event class together with the event time boundaries, given that multiple events can be present and may be overlapping in an audio recording. This task aims at exploring how to leverage training data with varying annotation granularity (temporal resolution, soft/hard labels). One major novelty is that systems will also be evaluated on labels with different granularity in order to get a broader view of the systems behavior and to assess their robustness for different applications. One additional novelty is that the target classes in the different datasets are different, and hence sound labels of one dataset may be present but not annotated in the other one. The systems will have to cope with potentially missing target labels at training.

Organizers

Samuele Cornell

Carnegie Mellon University

Task description Results

Few-shot Bioacoustic Event Detection

Bio Task 5

This task focuses on sound event detection (SED) in a few-shot learning setting for animal vocalisations. Can you design a system that can extract information from five exemplar vocalisations in a long sound recording, and generalise from those exemplers to find the rest of the sound events in the audio? The bioacoustic (animal sound) setting addresses a real need from animal researchers, while also providing a well-defined, constrained yet highly variable domain in which to evaluate a few-shot learning methodology. Between 2021 - 2023, the task has stimulated a lot of innovation, but there are many open questions and opportunities to innovate, e.g. prototypical networks, fine-tuning methods, or new architectures. The 2024 edition is the same as 2023, except for an enlarged set of data sources and a new deep learning baseline. We added two mammals datasets (one being a marine mammals dataset) to the validation set to make the domain shift between the validation and evaluation sets a little less steep. We will also introduce new unseen recordings for a couple of evaluation datasets.

Organizers

Burooj Ghani

Naturalis Biodiversity Center

Ines Nolasco

Queen Mary University of London

Jinhua Liang

Queen Mary University of London

Shubhr Singh

Queen Mary University of London

Vincent Lostanlen

Centre National de la Recherche Scientifique(CNRS)
Laboratoire des Sciences du Numérique de Nantes (LS2N)

Ariana Strandburg-Peshkin

University of Konstanz
Max Planck Institute of Animal Behavior

Emily Grout

University of Konstanz
Max Planck Institute of Animal Behavior

Hanna Pamula

AGH University of Science and Technology

Helen Whitehead

University of Salford

Joe Morford

University of Oxford

Michael Emmerson

Queen Mary University of London

Frants Jensen

Syracuse University

Ester Vidana Vila

La Salle, Universitat Ramon Llull

Dan Stowell

Tilburg University

Task description Results

Automated Audio Captioning

Caption Task 6

Automated audio captioning (AAC) is the task of general audio content description using free text. It is an inter-modal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. AAC methods can model concepts (e.g. "muffled sound"), physical properties of objects and environment (e.g. "the sound of a big car", "people talking in a small and empty room"), and high level knowledge ("a clock rings three times"). This modeling can be used in various applications, ranging from automatic content description to intelligent and content-oriented machine-to-machine interaction.

Organizers

Huang Xie

Tampere University

Tuomas Virtanen

Tampere University

Romain Serizel

University of Lorraine

Etienne Labbé

Université Toulouse III – Paul Sabatier
Institut de Recherche en Informatique de Toulouse

Thomas Pellegrini

Université Toulouse III – Paul Sabatier
Institut de Recherche en Informatique de Toulouse

Xinhao Mei

University of Surrey

Xuenan Xu

University of Surrey

Mark D. Plumbley

University of Surrey

Wenwu Wang

University of Surrey

Mengyue Wu

Shanghai Jiao Tong University

Task description Results

Sound Scene Synthesis

Synthesis Task 7

Sound Scene Synthesis is the task of generating a sound scene given a textual description. This new generation task expands the scope from last year’s Foley sounds to a more general sound scene. Further, it adds controllability with natural language in the form of text prompts.

The organizers will provide: 1) a development set consisting of 50 to 100 prompts and corresponding audio embeddings, and 2) a colab template with the baseline. There will be only one track, and it permits any external resources for system training. Before the inference starts, it is allowed to use external resources to install Python modules and download pre-trained models. However, calls to external resources are not allowed during inference. In other words, the inference stage must be 100% self-contained.

The evaluation will proceed as such:

The participants submit a colab notebook with their contributions.
A debugging phase will be opened a few days before the challenge deadline, during which the participants will be able to submit their system to check that it is running correctly.
The organizers will query the submitted synthesizers with a new set of prompts. There will be some reasonable limits set to the system footprint.
The organizers will rank the submitted synthesizers in terms of Fréchet Audio Distance (FAD).
The participants will be asked to perceptually rate a sample of the generated sounds from the highest ranked systems.
Results will be published on July 15th due to the time needed to perform the subjective evaluation.

Organizers

Mathieu Lagrange

CNRS, Ecole Centrale Nantes, Nantes Université

Junwon Lee

Gaudio Lab, Inc. / Korea Advanced Institute of Science & Technology (KAIST)

Modan Tailleur

Laboratoire des sciences du numérique de Nantes

Laurie Heller

Carnegie Mellon University

Keunwoo Choi

Gaudio Lab, Inc.

Brian McFee

New York University

Keisuke Imoto

Doshisha University

Yuki Okamoto

The University of Tokyo

Task description Results

Language-Based Audio Retrieval

Retrieval Task 8

Language-based audio retrieval is concerned with retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions will be used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based on their matching with the query. Through this task, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.

Organizers

Huang Xie

Tampere University

Tuomas Virtanen

Tampere University

Romain Serizel

University of Lorraine

Etienne Labbé

Université Toulouse III – Paul Sabatier
Institut de Recherche en Informatique de Toulouse

Thomas Pellegrini

Université Toulouse III – Paul Sabatier
Institut de Recherche en Informatique de Toulouse

Task description Results

Language-Queried Audio Source Separation

Separation Task 9

Language-queried audio source separation (LASS) is the task of separating arbitrary sound sources using textual descriptions of the desired source. LASS provides a useful tool for future source separation systems, allowing users to extract audio sources via natural language instructions. Such a system could be useful in many applications, such as automatic audio editing, multimedia content retrieval, and augmented listening. The objective of this challenge is to effectively separate sound sources using natural language queries, thereby advancing the way we interact with and manipulate audio content.

Organizers

Xubo Liu

University of Surrey

Wenwu Wang

University of Surrey

Mark D. Plumbley

University of Surrey

Jonathan Le Roux

Mitsubishi Electric Research Laboratories

Gordon Wichern

Mitsubishi Electric Research Laboratories

Yan Zhao

ByteDance

Yuzhuo Liu

ByteDance

Hangting Chen

Tencent AI Lab

Task description Results

Acoustic-based Traffic Monitoring

Monitoring Task 10

This task aims to develop an acoustic-based traffic monitoring solution to count the number of pass-by vehicles, identify their type (passenger cars or commercial vehicles) and determine the direction of travel. Traffic monitoring systems are integral components of smart city developments, facilitating road monitoring and anomaly detection within urban environments. These solutions leverage various sensors, including induction loops, vibration or magnetic sensors, radar, cameras, infrared, and acoustic sensors. Acoustic-based systems offer several advantages, making them an attractive option either on a standalone basis or in conjunction with other sensors. These advantages include low cost, power efficiency, ease of installation, resilience to adverse weather, and adaptability to low visibility conditions, among others.

In this task, participants receive traffic recordings from diverse locations, each characterized by distinct traffic densities. These recordings are captured using a 4-microphone linear array deployed alongside a road. Participants are tasked with designing a vehicle counting system for each car type and travel direction. Given the challenges associated with collecting and labeling real-world traffic data, participants are encouraged to employ various traffic simulation and augmentation techniques to address the limited availability of training data. Notably, participants are provided with an acoustic traffic simulation tool alongside the traffic recordings, enabling the generation of synthetic data. This tool facilitates the simulation of both individual pass-by events, based on the provided information on the type, speed, and direction of the vehicle, as well as longer audio segments containing multiple events that mimic specific traffic densities.

Organizers

Samarjit Das

Bosch Research, USA

Hans-Georg Horst

Robert Bosch GmbH

Toon van Waterschoot

KU Leuven

Task description Results

Schedule

15 Jan 2024

Challenge task descriptions

01 Apr 2024

Challenge launch

01 Jun 2024

Release of evaluation datasets

15 Jun 2024

Challenge deadline

30 Jun 2024

Challenge results

Contact

Recent news

DCASE2024 Challenge results published

DCASE2024 Challenge received 321 submission entries

DCASE2024 Challenge evaluation datasets available

Content

Introduction

Challenge status

Tasks

Data-Efficient Low-Complexity Acoustic Scene Classification

Organizers

Florian Schmid

Paul Primus

Irene Martin Morato

Toni Heittola

Khaled Koutini

Gerhard Widmer

Annamaria Mesaros

First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

Organizers

Tomoya Nishida

Noboru Harada

Daisuke Niizumi

Davide Albertini

Roberto Sannino

Simone Pradolini

Filippo Augusti

Keisuke Imoto

Kota Dohi

Harsh Purohit

Takashi Endo

Yohei Kawaguchi

Audio and Audiovisual Sound Event Localization and Detection with Source Distance Estimation

Organizers

Archontis Politis

Kazuki Shimada

Yuki Mitsufuji

Tuomas Virtanen

Parthasaarathy Sudarsanam

Daniel Krause

Kengo Uchida

David Diaz-Guerra

Yuichiro Koyama

Naoya Takahashi

Takashi Shibuya

Shusuke Takahashi

Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels

Organizers

Samuele Cornell

Janek Ebbers

Manu Harju

Irene Martin Morato

Constance Douwes

Annamaria Mesaros

Romain Serizel

Few-shot Bioacoustic Event Detection

Organizers

Burooj Ghani

Ines Nolasco

Jinhua Liang

Shubhr Singh

Vincent Lostanlen

Ariana Strandburg-Peshkin

Emily Grout

Hanna Pamula

Helen Whitehead

Joe Morford

Michael Emmerson

Frants Jensen

DCASE2024 Challenge evaluation
datasets available