Introduction
Sounds carry a large amount of information about our everyday environment and the physical events that take place in it. We can perceive the sound scene we are within (busy street, office, etc.), and recognize individual sound sources (car passing by, footsteps, etc.). Developing signal processing methods to extract this information automatically has enormous potential in several applications, for example, searching for multimedia based on its audio content, making context-aware mobile devices, robots, cars, etc., and intelligent monitoring systems to recognize activities in their environments using acoustic information. However, a significant amount of research is still needed to reliably recognize sound scenes and individual sound sources in realistic soundscapes, where multiple sounds are present, often simultaneously, and distorted by the environment.
Challenge status
Task | Task description | Development dataset | Baseline system | Evaluation dataset | Results |
---|---|---|---|---|---|
Task 1, Low-Complexity Acoustic Scene Classification with Device Information | Released | Released | Released | Released | Released |
Task 2, First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring | Released | Released | Released | Released | Released |
Task 3, Stereo sound event localization and detection in regular video content | Released | Released | Released | Released | Released |
Task 4, Spatial semantic segmentation of sound scenes | Released | Released | Released | Released | Released |
Task 5, Audio Question Answering | Released | Released | Released | Released | Released |
Task 6, Language-Based Audio Retrieval | Released | Released | Released | Released | Released |
updated 2024/07/01
Tasks
Low-Complexity Acoustic Scene Classification with Device Information
The task is a follow-up of previous years' Data-Efficient Low-Complexity Acoustic Scene Classification with several modifications: 1) Recording device information will be available for the evaluation set; 2) Modified low-complexity constraints;3) No constraints on external data sources; 4) Only a small training set available; 5) Inference code must be made available by participants.
Organizers
First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring
The aim of this task is to develop anomalous sound detection techniques that can train models on new data with noisy normal machine sounds and a few additional samples containing only factory noise or clean normal machine sounds, enabling the model to achieve higher detection performance regardless of environmental noise shifts or other domain shifts.
Organizers
Tomoya Nishida
Hitachi, Ltd.
Kota Dohi
Hitachi, Ltd.
Harsh Purohit
Hitachi, Ltd.
Takashi Endo
Hitachi, Ltd.
Yohei Kawaguchi
Hitachi, Ltd.
Stereo sound event localization and detection in regular video content
Sound event localization and detection (SELD) is a joint task of detecting the temporal activities of multiple sound event classes of interest and estimating their respective spatial trajectories when active. This task focuses on SELD systems using stereo audio in regular video content. When viewing video content with stereo audio, a content viewer perceives the locations of sources from its stereo audio. Since generative AI has shown its potential to facilitate content with stereo audio, analyzing stereo audio becomes essential to guarantee content quality regarding spatial perception. This analysis can also be used in various cognition tasks, such as media description and summarization with spatial cues. Specifically, this task has two tracks: audio-only inference and audiovisual inference. Given stereo audio without or with its corresponding video, SELD systems aim to output detection and localization results of target sound classes.
Organizers
Spatial semantic segmentation of sound scenes
This task proposal aims to enhance technologies for sound event detection and separation from multi-channel input signals that mix multiple sound events with spatial information. This is a fundamental basis of immersive communication. The ultimate goal is to separate sound event signals from a multichannel mixture into dry sound object signals and metadata about the object type (sound event class) and temporal localization. However, because several existing challenge tasks already provide some of the sub-set functions, this task for this year focuses on detecting and separating sound events from multi-channel spatial input signals.
Organizers
Binh Thien Nguyen
NTT Corporation
Audio Question Answering
The Audio Question Answering (AQA) task focuses on advancing question-answering capabilities in the realm of “interactive audio understanding,” covering both general acoustic events and knowledge-heavy sound information within a single track. The AQA task encourages participants to develop systems that can accurately interpret and respond to complex audio-based questions (i.e., (A), (B), or (C) option), requiring models to process and reason across diverse audio types. Such systems could be useful in many applications, including audio-text and multimodal model evaluation, as well as building interactive audio-agent models for the audio research communities and beyond. Reproducible baselines will include one resource-efficient computing setting (i.e., single 8GB RAM); direct using enterprise API for challenge entry is prohibited.
Organizers
Language-Based Audio Retrieval
This task focuses on the development of audio retrieval systems that can find audio recordings based on textual queries. While similar to previous editions, this year's evaluation setup introduces the possibility of multiple matching audio candidates for a single query. To support this, we provide additional correspondence annotations for audio-query pairs in the evaluation sets, enabling a more nuanced assessment of the retrieval systems' performance during development and final ranking.