DCASE2026 Challenge

IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events
1 April - 15 June 2026

Challenge status

Task Task description Development dataset Baseline system Evaluation dataset Results
Task 1, Heterogeneous Audio Classification TBA TBA TBA TBA TBA
Task 2, Noise-aware Unsupervised Anomalous Sound Detection for Machine Condition Monitoring TBA TBA TBA TBA TBA
Task 3, Semantic Acoustic Imaging for Sound Event Localization and Detection from Spatial Audio and Audiovisual Scenes TBA TBA TBA TBA TBA
Task 4, Spatial Semantic Segmentation of Sound Scenes TBA TBA TBA TBA TBA
Task 5, Audio-Dependent Question Answering TBA TBA TBA TBA TBA
Task 6, Audio Moment Retrieval from Long Audio TBA TBA TBA TBA TBA
Task 7, Domain-Agnostic Incremental Learning for Audio Classification TBA TBA TBA TBA TBA

updated 2026/01/29

Introduction

Sounds carry a large amount of information about our everyday environment and the physical events that take place in it. We can perceive the sound scene we are within (busy street, office, etc.), and recognize individual sound sources (car passing by, footsteps, etc.). Developing signal processing methods to extract this information automatically has enormous potential in several applications, for example, searching for multimedia based on its audio content, making context-aware mobile devices, robots, cars, etc., and intelligent monitoring systems to recognize activities in their environments using acoustic information. However, a significant amount of research is still needed to reliably recognize sound scenes and individual sound sources in realistic soundscapes, where multiple sounds are present, often simultaneously, and distorted by the environment.

Tasks

Heterogeneous Audio Classification

Sounds Task 1

This task focuses on heterogeneous audio classification using the Broad Sound Taxonomy (BST), which comprises 5 top-level and 23 second-level sound categories. The taxonomy covers Music, Instrument samples, Speech, Sound Effects and Soundscapes, and is currently used in Freesound. The goal is to evaluate models on diverse, real-world audio that varies widely in duration, content, and recording conditions. To support this, two complementary datasets derived from Freesound are provided: a curated subset (BSD10k-v1.1), and a larger, noisier user-annotated collection (BSDNoisy50k) reflecting real-world labeling variability. Participants are encouraged to explore audio, text metadata, and multimodal approaches, as well as hierarchical relationships between categories, as the evaluation is performed using hierarchical measures. The task promotes general-purpose sound classification applicable to audio organization, retrieval, and analysis.

Organizers

Panagiota Anastasopoulou

Panagiota Anastasopoulou

Universitat Pompeu Fabra

Frederic Font

Frederic Font

Universitat Pompeu Fabra

Dmitry Bogdanov

Dmitry Bogdanov

Universitat Pompeu Fabra

Lonce Wyse

Lonce Wyse

Universitat Pompeu Fabra





Noise-aware Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

Monitoring Task 2

The aim of this task is to develop anomalous sound detection techniques that can train models on new train data with noisy normal sounds of machines and additional data consisting of distant recordings of the machine, such that the model can achieve higher detection performance regardless of environmental noise or occurrences of domain shifts.

Organizers

Tomoya Nishida

Tomoya Nishida

Hitachi, Ltd.

Noboru Harada

Noboru Harada

Daiki Takeuchi

Daiki Takeuchi

NTT, Inc.

Daisuke Niizumi

Daisuke Niizumi

Tokyo Metropolitan University

Keisuke Imoto

Keisuke Imoto

Kyoto University

Kota Dohi

Kota Dohi

Hitachi, Ltd.

Harsh Purohit

Harsh Purohit

Hitachi, Ltd.

Takashi Endo

Takashi Endo

Hitachi, Ltd.

Yohei Kawaguchi

Yohei Kawaguchi

Hitachi, Ltd.





Semantic Acoustic Imaging for Sound Event Localization and Detection from Spatial Audio and Audiovisual Scenes

Localization Task 3

Acoustic Imaging SELD requires generating high-resolution semantic energy maps from low-channel audio, outputting dynamic polygon masks encoding event class, localization, and acoustic energy. This task proposes a paradigm shift, moving from vector-based localization to Acoustic Imaging SELD. While traditional SELD estimates sparse Direction-of-Arrival (DOA) vectors, this task models the acoustic field as a high-resolution, dense energy map, capturing the physical extent, instantaneous energy modulation, and diffuseness of sound sources. Participants will be required to train models for reconstructing high-fidelity acoustic images from low-resolution (4-channel)inputs.

Organizers

Archontis Politis

Archontis Politis

Kazuki Shimada

Kazuki Shimada

Huw Cheston

Huw Cheston

Queen Mary University of London

Parthasaarathy Sudarsanam

Parthasaarathy Sudarsanam

David Diaz-Guerra

David Diaz-Guerra

Takashi Shibuya

Takashi Shibuya

Shusuke Takahashi

Shusuke Takahashi

Yuki Mitsufuji

Yuki Mitsufuji





Spatial Semantic Segmentation of Sound Scenes

Semantic Task 4

The spatial semantic segmentation of sound scenes (S5) task proposal aims to enhance technologies for sound event detection and separation from multi-channel input signals that mix multiple sound events with spatial information. This is a fundamental basis of immersive communication. The ultimate goal is to separate sound event signals with 6 Degrees of Freedom (6DoF) information into dry sound object signals and metadata about the object type (sound event class) and representing spatial information, including direction. This task is a sequel to DCASE 2025 Task 4, but it introduces two new challenges. One involves handling overlapping sound sources primarily within the same class, while the other addresses cases where the separation target is absent. These are essential for real-world applications and highlight new research challenges.

Organizers

Masahiro Yasuda

Masahiro Yasuda

NTT, Inc.

Binh Thien Nguyen

Binh Thien Nguyen

NTT, Inc.

Noboru Harada

Noboru Harada

Daiki Takeuchi

Daiki Takeuchi

NTT, Inc.

Carlos Hernandez-Olivan

Carlos Hernandez-Olivan

NTT

Marc Delcroix

Marc Delcroix

NTT, Inc.

Shoko Araki

Shoko Araki

NTT, Inc.

Tomohiro Nakatani

Tomohiro Nakatani

NTT, Inc.

Daisuke Niizumi

Daisuke Niizumi

Tokyo Metropolitan University

Nobutaka Ono

Nobutaka Ono

Tokyo Metropolitan University





Audio-Dependent Question Answering

Reasoning Task 5

Audio-Dependent Question Answering (ADQA) is a multiple-choice question answering task where questions are specifically designed to be audio-dependent: the correct answer cannot be determined without listening to the audio. Current Large Audio-Language Models (LALMs) achieve high accuracy on audio benchmarks even when audio is replaced with silence, demonstrating that existing benchmarks contain many audio-independent questions answerable through textual priors. To address this issue, we introduce AudioMCQ, a comprehensive training set with 570,000+ samples covering Sound, Music, Speech, and Temporal domains, where each sample is labeled as either strong or weak audio-dependent. The evaluation set, ADQA-Bench, is constructed by collecting samples from mainstream benchmarks and applying multi-stage audio-dependency filtering, followed by careful data curation and data leakage prevention mechanisms, with final human verification. This benchmark ensures models demonstrate genuine audio comprehension capabilities rather than relying on text-based reasoning or memorization.

Organizers

Haolin He

Haolin He

The Chinese University of Hong Kong

Renhe Sun

Renhe Sun

Antgroup

Zheqi Dai

Zheqi Dai

The Chinese University of Hong Kong

Xingjian Du

Xingjian Du

University of Rochester

Chunyat Wu

Chunyat Wu

The Chinese University of Hong Kong

Jiayi Zhou

Jiayi Zhou

Antgroup

Xiquan Li

Xiquan Li

Shanghai Jiao Tong University

Yun Chen

Yun Chen

University of Surrey

Xie Chen

Xie Chen

Shanghai Jiao Tong University

Zhiyao Duan

Zhiyao Duan

University of Rochester

Weiqiang Wang

Weiqiang Wang

Antgroup

Mark D. Plumbley

Mark D. Plumbley

Jian Liu

Jian Liu

Antgroup

Qiuqiang Kong

Qiuqiang Kong

The Chinese University of Hong Kong





Audio Moment Retrieval from Long Audio

Retrieval Task 6

Audio moment retrieval is the task of retrieving specific moments within long audio recordings that align with a given textual query. While conventional language-based audio retrieval aimed to accurately capture the correspondence between queries and audio clips, this task focuses on localizing exactly which part of a long audio recording corresponds to the query. The primary challenge is to capture temporal contexts within long audio, which requires effective sequence modeling and learning methods to enhance retrieval accuracy.

Organizers

Hokuto Munakata

Hokuto Munakata

Tatsuya Komatsu

Tatsuya Komatsu

Keisuke Imoto

Keisuke Imoto

Kyoto University

Tuomas Virtanen

Tuomas Virtanen





Domain-Agnostic Incremental Learning for Audio Classification

Learning Task 7

This task aims to develop a universal domain-incremental learning (DIL) system that learns to classify audio from different domains sequentially over time without significantly forgetting the knowledge of any of the previously learned domains. Participants will have to train a model for sound event classification in incremental steps using data from different domains, without access to previous domains data in each step.

Organizers

Manjunath Mulimani

Manjunath Mulimani

Riccardo Casciotti

Riccardo Casciotti

Annamaria Mesaros

Annamaria Mesaros