Challenge status
| Task | Task description | Development dataset | Baseline system | Evaluation dataset | Results |
|---|---|---|---|---|---|
| Task 1, Heterogeneous Audio Classification | TBA | TBA | TBA | TBA | TBA |
| Task 2, Noise-aware Unsupervised Anomalous Sound Detection for Machine Condition Monitoring | TBA | TBA | TBA | TBA | TBA |
| Task 3, Semantic Acoustic Imaging for Sound Event Localization and Detection from Spatial Audio and Audiovisual Scenes | TBA | TBA | TBA | TBA | TBA |
| Task 4, Spatial Semantic Segmentation of Sound Scenes | TBA | TBA | TBA | TBA | TBA |
| Task 5, Audio-Dependent Question Answering | TBA | TBA | TBA | TBA | TBA |
| Task 6, Audio Moment Retrieval from Long Audio | TBA | TBA | TBA | TBA | TBA |
| Task 7, Domain-Agnostic Incremental Learning for Audio Classification | TBA | TBA | TBA | TBA | TBA |
updated 2026/01/29
Introduction
Sounds carry a large amount of information about our everyday environment and the physical events that take place in it. We can perceive the sound scene we are within (busy street, office, etc.), and recognize individual sound sources (car passing by, footsteps, etc.). Developing signal processing methods to extract this information automatically has enormous potential in several applications, for example, searching for multimedia based on its audio content, making context-aware mobile devices, robots, cars, etc., and intelligent monitoring systems to recognize activities in their environments using acoustic information. However, a significant amount of research is still needed to reliably recognize sound scenes and individual sound sources in realistic soundscapes, where multiple sounds are present, often simultaneously, and distorted by the environment.
Tasks
Heterogeneous Audio Classification
This task focuses on heterogeneous audio classification using the Broad Sound Taxonomy (BST), which comprises 5 top-level and 23 second-level sound categories. The taxonomy covers Music, Instrument samples, Speech, Sound Effects and Soundscapes, and is currently used in Freesound. The goal is to evaluate models on diverse, real-world audio that varies widely in duration, content, and recording conditions. To support this, two complementary datasets derived from Freesound are provided: a curated subset (BSD10k-v1.1), and a larger, noisier user-annotated collection (BSDNoisy50k) reflecting real-world labeling variability. Participants are encouraged to explore audio, text metadata, and multimodal approaches, as well as hierarchical relationships between categories, as the evaluation is performed using hierarchical measures. The task promotes general-purpose sound classification applicable to audio organization, retrieval, and analysis.
Organizers
Noise-aware Unsupervised Anomalous Sound Detection for Machine Condition Monitoring
The aim of this task is to develop anomalous sound detection techniques that can train models on new train data with noisy normal sounds of machines and additional data consisting of distant recordings of the machine, such that the model can achieve higher detection performance regardless of environmental noise or occurrences of domain shifts.
Organizers
Tomoya Nishida
Hitachi, Ltd.
Kota Dohi
Hitachi, Ltd.
Harsh Purohit
Hitachi, Ltd.
Takashi Endo
Hitachi, Ltd.
Yohei Kawaguchi
Hitachi, Ltd.
Semantic Acoustic Imaging for Sound Event Localization and Detection from Spatial Audio and Audiovisual Scenes
Acoustic Imaging SELD requires generating high-resolution semantic energy maps from low-channel audio, outputting dynamic polygon masks encoding event class, localization, and acoustic energy. This task proposes a paradigm shift, moving from vector-based localization to Acoustic Imaging SELD. While traditional SELD estimates sparse Direction-of-Arrival (DOA) vectors, this task models the acoustic field as a high-resolution, dense energy map, capturing the physical extent, instantaneous energy modulation, and diffuseness of sound sources. Participants will be required to train models for reconstructing high-fidelity acoustic images from low-resolution (4-channel)inputs.
Organizers
Spatial Semantic Segmentation of Sound Scenes
The spatial semantic segmentation of sound scenes (S5) task proposal aims to enhance technologies for sound event detection and separation from multi-channel input signals that mix multiple sound events with spatial information. This is a fundamental basis of immersive communication. The ultimate goal is to separate sound event signals with 6 Degrees of Freedom (6DoF) information into dry sound object signals and metadata about the object type (sound event class) and representing spatial information, including direction. This task is a sequel to DCASE 2025 Task 4, but it introduces two new challenges. One involves handling overlapping sound sources primarily within the same class, while the other addresses cases where the separation target is absent. These are essential for real-world applications and highlight new research challenges.
Organizers
Binh Thien Nguyen
NTT, Inc.
Audio-Dependent Question Answering
Audio-Dependent Question Answering (ADQA) is a multiple-choice question answering task where questions are specifically designed to be audio-dependent: the correct answer cannot be determined without listening to the audio. Current Large Audio-Language Models (LALMs) achieve high accuracy on audio benchmarks even when audio is replaced with silence, demonstrating that existing benchmarks contain many audio-independent questions answerable through textual priors. To address this issue, we introduce AudioMCQ, a comprehensive training set with 570,000+ samples covering Sound, Music, Speech, and Temporal domains, where each sample is labeled as either strong or weak audio-dependent. The evaluation set, ADQA-Bench, is constructed by collecting samples from mainstream benchmarks and applying multi-stage audio-dependency filtering, followed by careful data curation and data leakage prevention mechanisms, with final human verification. This benchmark ensures models demonstrate genuine audio comprehension capabilities rather than relying on text-based reasoning or memorization.
Organizers
Renhe Sun
Antgroup
Zheqi Dai
The Chinese University of Hong Kong
Chunyat Wu
The Chinese University of Hong Kong
Jiayi Zhou
Antgroup
Xiquan Li
Shanghai Jiao Tong University
Yun Chen
University of Surrey
Xie Chen
Shanghai Jiao Tong University
Weiqiang Wang
Antgroup
Jian Liu
Antgroup
Audio Moment Retrieval from Long Audio
Audio moment retrieval is the task of retrieving specific moments within long audio recordings that align with a given textual query. While conventional language-based audio retrieval aimed to accurately capture the correspondence between queries and audio clips, this task focuses on localizing exactly which part of a long audio recording corresponds to the query. The primary challenge is to capture temporal contexts within long audio, which requires effective sequence modeling and learning methods to enhance retrieval accuracy.
Organizers
Domain-Agnostic Incremental Learning for Audio Classification
This task aims to develop a universal domain-incremental learning (DIL) system that learns to classify audio from different domains sequentially over time without significantly forgetting the knowledge of any of the previously learned domains. Participants will have to train a model for sound event classification in incremental steps using data from different domains, without access to previous domains data in each step.
