DCASE2025 Challenge

IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events
1 April - 15 June 2025

Introduction

Sounds carry a large amount of information about our everyday environment and the physical events that take place in it. We can perceive the sound scene we are within (busy street, office, etc.), and recognize individual sound sources (car passing by, footsteps, etc.). Developing signal processing methods to extract this information automatically has enormous potential in several applications, for example, searching for multimedia based on its audio content, making context-aware mobile devices, robots, cars, etc., and intelligent monitoring systems to recognize activities in their environments using acoustic information. However, a significant amount of research is still needed to reliably recognize sound scenes and individual sound sources in realistic soundscapes, where multiple sounds are present, often simultaneously, and distorted by the environment.

Challenge status

Task Task description Development dataset Baseline system Evaluation dataset Results
Task 1, Low-Complexity Acoustic Scene Classification with Device Information Released Released Released Released Released
Task 2, First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Released Released Released Released Released
Task 3, Stereo sound event localization and detection in regular video content Released Released Released Released Released
Task 4, Spatial semantic segmentation of sound scenes Released Released Released Released Released
Task 5, Audio Question Answering Released Released Released Released Released
Task 6, Language-Based Audio Retrieval Released Released Released Released Released

updated 2024/07/01

Tasks

Low-Complexity Acoustic Scene Classification with Device Information

Scenes Task 1

The task is a follow-up of previous years' Data-Efficient Low-Complexity Acoustic Scene Classification with several modifications: 1) Recording device information will be available for the evaluation set; 2) Modified low-complexity constraints;3) No constraints on external data sources; 4) Only a small training set available; 5) Inference code must be made available by participants.

Organizers





First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

Monitoring Task 2

The aim of this task is to develop anomalous sound detection techniques that can train models on new data with noisy normal machine sounds and a few additional samples containing only factory noise or clean normal machine sounds, enabling the model to achieve higher detection performance regardless of environmental noise shifts or other domain shifts.

Organizers

Tomoya Nishida

Tomoya Nishida

Hitachi, Ltd.

Noboru Harada

Noboru Harada

Daisuke Niizumi

Daisuke Niizumi

Davide Albertini

Davide Albertini

Roberto Sannino

Roberto Sannino

Simone Pradolini

Simone Pradolini

Filippo Augusti

Filippo Augusti

Keisuke Imoto

Keisuke Imoto

Doshisha University

Kota Dohi

Kota Dohi

Hitachi, Ltd.

Harsh Purohit

Harsh Purohit

Hitachi, Ltd.

Takashi Endo

Takashi Endo

Hitachi, Ltd.

Yohei Kawaguchi

Yohei Kawaguchi

Hitachi, Ltd.





Stereo sound event localization and detection in regular video content

Localization Task 3

Sound event localization and detection (SELD) is a joint task of detecting the temporal activities of multiple sound event classes of interest and estimating their respective spatial trajectories when active. This task focuses on SELD systems using stereo audio in regular video content. When viewing video content with stereo audio, a content viewer perceives the locations of sources from its stereo audio. Since generative AI has shown its potential to facilitate content with stereo audio, analyzing stereo audio becomes essential to guarantee content quality regarding spatial perception. This analysis can also be used in various cognition tasks, such as media description and summarization with spatial cues. Specifically, this task has two tracks: audio-only inference and audiovisual inference. Given stereo audio without or with its corresponding video, SELD systems aim to output detection and localization results of target sound classes.

Organizers

Kazuki Shimada

Kazuki Shimada

Archontis Politis

Archontis Politis

Tuomas Virtanen

Tuomas Virtanen

Yuki Mitsufuji

Yuki Mitsufuji

Parthasaarathy Sudarsanam

Parthasaarathy Sudarsanam

David Diaz-Guerra

David Diaz-Guerra

Kengo Uchida

Kengo Uchida

Yuichiro Koyama

Yuichiro Koyama

Naoya Takahashi

Naoya Takahashi

Takashi Shibuya

Takashi Shibuya

Shusuke Takahashi

Shusuke Takahashi





Spatial semantic segmentation of sound scenes

Semantic Task 4

This task proposal aims to enhance technologies for sound event detection and separation from multi-channel input signals that mix multiple sound events with spatial information. This is a fundamental basis of immersive communication. The ultimate goal is to separate sound event signals from a multichannel mixture into dry sound object signals and metadata about the object type (sound event class) and temporal localization. However, because several existing challenge tasks already provide some of the sub-set functions, this task for this year focuses on detecting and separating sound events from multi-channel spatial input signals.

Organizers

Masahiro Yasuda

Masahiro Yasuda

NTT Corporation

Noboru Harada

Noboru Harada

Binh Thien Nguyen

Binh Thien Nguyen

NTT Corporation

Daiki Takeuchi

Daiki Takeuchi

NTT Corporation

Daisuke Niizumi

Daisuke Niizumi

Marc Delcroix

Marc Delcroix

NTT Corporation

Shoko Araki

Shoko Araki

NTT Corporation

Tomohiro Nakatani

Tomohiro Nakatani

NTT Corporation

Yasunori Ohishi

Yasunori Ohishi

NTT Corporation

Nobutaka Ono

Nobutaka Ono

Tokyo Metropolitan University





Audio Question Answering

Reasoning Task 5

The Audio Question Answering (AQA) task focuses on advancing question-answering capabilities in the realm of “interactive audio understanding,” covering both general acoustic events and knowledge-heavy sound information within a single track. The AQA task encourages participants to develop systems that can accurately interpret and respond to complex audio-based questions (i.e., (A), (B), or (C) option), requiring models to process and reason across diverse audio types. Such systems could be useful in many applications, including audio-text and multimodal model evaluation, as well as building interactive audio-agent models for the audio research communities and beyond. Reproducible baselines will include one resource-efficient computing setting (i.e., single 8GB RAM); direct using enterprise API for challenge entry is prohibited.

Organizers

Huck Yang

Huck Yang

NVIDIA Research

Sreyan Ghosh

Sreyan Ghosh

University of Maryland, College Park

Sonal Kumar

Sonal Kumar

University of Maryland, College Park

Sakshi Singh

Sakshi Singh

University of Maryland, College Park

Jaeyeon Kim

Jaeyeon Kim

Harvard University

Zhifeng Kong

Zhifeng Kong

NVIDIA Research

Dinesh Manocha

Dinesh Manocha

University of Maryland, College Park

Oriol Nieto

Oriol Nieto

Adobe

Rafeal Valle

Rafeal Valle

NVIDIA Research





Language-Based Audio Retrieval

Retrieval Task 6

This task focuses on the development of audio retrieval systems that can find audio recordings based on textual queries. While similar to previous editions, this year's evaluation setup introduces the possibility of multiple matching audio candidates for a single query. To support this, we provide additional correspondence annotations for audio-query pairs in the evaluation sets, enabling a more nuanced assessment of the retrieval systems' performance during development and final ranking.

Organizers

Benno Weck

Benno Weck

Universitat Pompeu Fabra

Tuomas Virtanen

Tuomas Virtanen