DCASE2025 Challenge

Daisuke Niizumi

Davide Albertini

Roberto Sannino

Simone Pradolini

Filippo Augusti

Keisuke Imoto

Kyoto University

Kota Dohi

Hitachi, Ltd.

Harsh Purohit

Hitachi, Ltd.

Takashi Endo

Hitachi, Ltd.

Yohei Kawaguchi

Hitachi, Ltd.

Task description Results

Stereo Sound Event Localization and Detection in Regular Video Content

Localization Task 3

Sound event localization and detection (SELD) is a joint task of detecting the temporal activities of multiple sound event classes of interest and estimating their respective spatial trajectories when active. This task focuses on SELD systems using stereo audio in regular video content. When viewing video content with stereo audio, a content viewer perceives the locations of sources from its stereo audio. Since generative AI has shown its potential to facilitate content with stereo audio, analyzing stereo audio becomes essential to guarantee content quality regarding spatial perception. This analysis can also be used in various cognition tasks, such as media description and summarization with spatial cues. Specifically, this task has two tracks: audio-only inference and audiovisual inference. Given stereo audio without or with its corresponding video, SELD systems aim to output detection and localization results of target sound classes.

Organizers

Kazuki Shimada

Archontis Politis

Queen Mary University of London

Iran Roman

Tuomas Virtanen

Yuki Mitsufuji

Parthasaarathy Sudarsanam

David Diaz-Guerra

Ruchi Pandey

Kengo Uchida

Yuichiro Koyama

Naoya Takahashi

Takashi Shibuya

Shusuke Takahashi

Task description Results

Spatial Semantic Segmentation of Sound Scenes

Semantic Task 4

This task proposal aims to enhance technologies for sound event detection and separation from multi-channel input signals that mix multiple sound events with spatial information. This is a fundamental basis of immersive communication. The ultimate goal is to separate sound event signals from a multichannel mixture into dry sound object signals and metadata about the object type (sound event class) and temporal localization. However, because several existing challenge tasks already provide some of the sub-set functions, this task for this year focuses on detecting and separating sound events from multi-channel spatial input signals.

Organizers

Masahiro Yasuda

NTT, Inc.

Noboru Harada

Binh Thien Nguyen

NTT, Inc.

Daiki Takeuchi

NTT, Inc.

Daisuke Niizumi

Marc Delcroix

NTT, Inc.

Shoko Araki

NTT, Inc.

Tomohiro Nakatani

NTT, Inc.

Yasunori Ohishi

NTT, Inc.

Romain Serizel

Université de Lorraine

Mayank Mishra

Université Lorraine

Nobutaka Ono

Tokyo Metropolitan University

Takao Kawamura

Tokyo Metropolitan University

Task description Results

Audio Question Answering

Reasoning Task 5

The Audio Question Answering (AQA) task focuses on advancing question-answering capabilities in the realm of “interactive audio understanding,” covering both general acoustic events and knowledge-heavy sound information within a single track. The AQA task encourages participants to develop systems that can accurately interpret and respond to complex audio-based questions (i.e., (A), (B), or (C) option), requiring models to process and reason across diverse audio types. Such systems could be useful in many applications, including audio-text and multimodal model evaluation, as well as building interactive audio-agent models for the audio research communities and beyond.

Organizers

Huck Yang

NVIDIA Research

Sreyan Ghosh

Qing Wang

Jaeyeon Kim

Seoul National University

Hengyi Hong

Sonal Kumar

Guirui Zhong

Zhifeng Kong

NVIDIA Research

FNU Sakshi

Vaibhavi Lokegaonkar

Ramani Duraiswami

Dinesh Manocha

Gunhee Kim

Seoul National University

Jun Du

Rafeal Valle

NVIDIA Research

Task description Results

Language-Based Audio Retrieval

Retrieval Task 6

This task focuses on the development of audio retrieval systems that can find audio recordings based on textual queries. While similar to previous editions, this year's evaluation setup introduces the possibility of multiple matching audio candidates for a single query. To support this, we provide additional correspondence annotations for audio-query pairs in the evaluation sets, enabling a more nuanced assessment of the retrieval systems' performance during development and final ranking.

Organizers

Huang Xie

Johannes Kepler University Linz

Paul Primus

Benno Weck

Universitat Pompeu Fabra

Tuomas Virtanen