Answering questions about general acoustic events and knowledge-heavy sound information.
Description
The Audio Question Answering (AQA) task focuses on advancing question-answering capabilities in the realm of “interactive audio understanding,” covering both general acoustic events and knowledge-heavy sound information within a single track. The AQA task encourages participants to develop systems that can accurately interpret and respond to complex audio-based questions (i.e., (A), (B), or (C) option), requiring models to process and reason across diverse audio types. Such systems could be useful in many applications, including audio-text and multimodal model evaluation, as well as building interactive audio-agent models for the audio research communities and beyond.

Audio dataset
The AudioQA task in DCASE2025 consists of three distinct QA subsets that has multiple choices question format: Bioacoustics QA, Temporal Soundscapes QA, and Complex QA (MMAU). Each subset is designed to evaluate different aspects of audio understanding and reasoning. In this section, we describe each dataset in detail. All subsets include training and development sets, while the evaluation set for the challenge will be released on June 1, 2025.
Part 1: Bioacoustics QA
Marine mammals produce a wide range of acoustic signals for various purposes, and these vocalizations are often species-specific. This characteristic enables fine-grained grounding of sounds to real-world biological events. To evaluate the perceptual and cognitive capabilities of the audio-language models, the Bioacoustics QA dataset includes questions about 31 species of marine mammals, which vary significantly in their acoustic ranges, habitats, and vocalization durations.
This subset challenges models to classify the species, the vocalization type, or both. In addition, models are asked to retrieve factual knowledge about the perceived species, interpret acoustic details, and compare the characteristics of different vocalizations.
The dataset contains 0.7K QA pairs for training and 0.2K QA pairs for development. The sample rate varies from 600 Hz to 160 kHz, and the audio duration ranges from 0.4 second to 625 seconds. This variability enables evaluation of how well models can adapt to diverse acoustic conditions.
Acknowledgments
All audio files used in this subset are sourced from the Watkins Marine Mammal Sound Database, maintained by the Woods Hole Oceanographic Institution, New Bedford Whaling Museum (www.whalingmuseum.org).
Participants are strictly prohibited from using additional audio files from the same database, as some of them may be included in the evaluation set.
Part 2: Temporal Soundscapes QA
It is common that different sounds may occur in an audio sample. The categories, sequences, and timestamps of these sounds are very important for the model to understand the interaction between sounds and improve the ability of temporal reasoning. To evaluate the perceptual and cognitive capabilities of the audio-language models, the Temporal Soundscapes QA dataset is introduced including questions about 26 sound classes.
This subset challenges models to decide the active sound, the sound class, the temporal relationship among different sounds (e.g., occurring sequence), or the timestamp (including onset, offset, and duration) of sound. The questions are mainly set from easy to difficult. For example, it is simple to judge the class of the first sound in audio, but it is difficult to judge the duration of a sound (that is, it is necessary to determine the onset and offset timestamps).
The dataset contains 1k QA pairs for training and 0.6k QA pairs for development. The sample rate varies from 32 kHz to 48 kHz, and the audio is processed as mono with 10 seconds duration. Except a small number of audio samples have a one-to-many (one to three at most) relationship with QA, most samples only generate a corresponding QA. Note that each audio sample and QA text have been carefully manually verified, including the number of sounds, sound timestamp, text content and so on.
Acknowledgements
All audio files used in this subset are sourced from the NIGENS general sound events database, L3DAS23 Challenge, and TAU Spatial Sound Events 2019.
Part 3: Complex QA (MMAU)
Complex QA focuses on complex question answering grounded in audio understanding. Each instance consists of a natural audio clip paired with a multi-faceted question that requires reasoning over temporal, acoustic, and contextual cues within the audio. Questions may involve identifying overlapping sound events, interpreting sequences of auditory phenomena, or discerning abstract relationships implied by the soundscape. The audio clips are sourced from AudioSet and Mira datasets, ensuring a rich and diverse set of real-world scenarios. This task builds on the principles established in the MMAU Sound subset, extending the challenge to higher-order auditory comprehension and inference.
The dataset contains 6.4k QA pairs for training and 1.6k QA pairs for development. All audio files are 10 seconds long and sampled at 16 kHz.
Acknowledgements
All the audios in the subsset are sourced from Audioset and Mira Data.
Task Setup
Development Dataset
The development set consists of three subsets: Bioacoustics QA, Temporal Soundscapes QA, and Complex QA, as described above. Each subset includes a training set for model development and a development set for evaluating model performance.
The development set can be downloaded through the huggingface.
Evaluation Dataset
The evaluation set will be used for ranking submissions. It can be also downloaded through the downloaded through the huggingface. Please run eval_download_aqa_2025.sh
to download the evaluation set.
External Data Resources
Use of external data resources is allowed as long as they are publicly available. The following rules apply on the use of external data:
- External data resources, such as datasets and pre-trained models, must be freely and publicly accessible before May 15th 2025.
- If participants intend to use external data resources, they are required to inform the organizers in advance. This ensures fairness by providing all participants the opportunity to access the external data. Should you wish to make use of data resources not mentioned in the list, please send an email or message in the Slack channel (# task-5-2025-audioqa-public) to the task coordinators.
- List of allowed external data resources will be locked on May 29th 2025 (AOE).
- The participants are required to indicate clearly which external data resources they have used in the technical report.
- An up-to-date list of allowed and prohibited datasets is available here: # task-5-dataset-spreadsheet
List of external data resources allowed:
Resource name | Link |
---|---|
AudioSet | https://research.google.com/audioset/ |
AVQA | https://mn.cs.tsinghua.edu.cn/avqa/ |
Clotho | https://arxiv.org/abs/1910.09387 |
Clotho-AQA | https://arxiv.org/abs/2004.01444 |
MMAU | https://github.com/Sakshi113/mmau/tree/main |
AudioSet-Strong | https://github.com/curlsloth/audioset-strong-download |
CompA | https://github.com/Sreyan88/CompA |
WavCaps | https://github.com/XinhaoMei/WavCaps |
AudioCaps | https://github.com/cdjkim/audiocaps |
FSD-50k | https://zenodo.org/records/4060432 |
VGGSound | https://arxiv.org/abs/2004.14368 |
MUSIC-AVQA | https://gewu-lab.github.io/MUSIC-AVQA/ |
Urbansound8K | https://github.com/reml-group/fortisavqa |
FortisAVQA | https://github.com/reml-group/fortisavqa |
JamendoMaxCaps | https://github.com/AMAAI-Lab/JamendoMaxCaps |
CochlScene | https://zenodo.org/records/7080122 |
TUT Acoustic scenes 2016 | https://zenodo.org/records/45739 |
TUT Acoustic scenes 2017 | https://zenodo.org/records/400515 |
TAU Urban Acoustic Scenes 2022 Mobile | https://zenodo.org/records/6337421 |
TACOS | https://zenodo.org/records/15379789 |
OpenAQA | https://github.com/YuanGongND/ltu |
AudSem | https://huggingface.co/datasets/gijs/audsem |
SpeechCraft | https://github.com/thuhcsi/SpeechCraft |
MusicCaps | https://www.kaggle.com/datasets/googleai/musiccaps |
LP-MusicCaps-MTT | https://huggingface.co/datasets/seungheondoh/LP-MusicCaps-MTT |
Task Rules
There are general rules valid for all tasks; these, along with information on technical report and submission requirements, can be found here.
Task specific rules:
- Each participating team may submit up to four systems for official evaluation.
- The use of open-source large language models (LLMs) is permitted, while closed-source LLMs are not allowed. The parameter size of any single model must not exceed 100 billion (100B). For cascaded or multi-stage architectures, each individual component model must also be under 100B.
- At least one of the submitted systems must be a lightweight solution with a total parameter size under 9B.
- API-based models are permitted only if the underlying model is open-source (e.g., LLaMA-3-70B) and accompanied by documentation verifying that it adheres to the 100B constraint for scientific reproducibility. Participants intending to use such APIs must report the API usage details, including model version and associated dataset versions, in the DCASE 2025 Slack channel (# task-5-2025-audioqa-public) by May 30th for organizer review and approval.
- The use of external data resources is allowed, as long as they are publicly accessible and approved by the organizers.
- Participants are not permitted to make subjective judgments or manual annotations on the evaluation set. The evaluation set must also not be used for training any of the submitted systems.
- Participants may include unlimited additional system variants in their technical reports (e.g., for ablation studies or further analysis). However, only the four officially submitted systems per team will be considered for the challenge ranking.
- Participants may explore various post-processing methods (e.g., string matching, Sentence-BERT matching, etc.) to extract answer choices from the system’s response, as demonstrated in the baseline systems. If using LLMs for post-processing, only open-source models are allowed.
Evaluation
Participants may submit up to four systems, each of which will be ranked independently.
We use top-1 accuracy, defined as the proportion of multiple-choice questions for which the system selects the correct answer, as the primary evaluation metric. The leaderboard ranking of individual submissions will be based on this metric.
To assess the robustness of each submission, we will perform multiple evaluations by shuffling the order of the answer choices. The best and worst accuracy scores obtained across these permutations will be reported as the system’s robustness score.
Baseline System
We adopt three baseline models—Qwen2-Audio-7B
, AudioFlamingo 2
, and Gemini-2-Flash
—for evaluation. These models are evaluated in a zero-shot setting on the development split of the Development Set.
Qwen2-Audio-7B
Qwen2-Audio-7B is a large audio-language model that integrates a Whisper-large-v3 audio encoder with the 7B-parameter Qwen language model, capable of generating text-based answers from audio inputs. The model is pre-trained on over 30 diverse audio understanding tasks—spanning speech, music, and environmental sounds—using unified natural-language prompts.
Evaluation
The given prompt and audio input are fed into Qwen2-Audio-Instruct for inference, generating an output text. Subsequently, the output text, along with the question options, is inputted into the pre-trained Sentence-Bert model to derive their respective embeddings. The final step involves calculating the cosine similarity between the embedding of the output text and each of the option's embeddings. The option yielding the highest similarity score is chosen as the final answer. Considering the model needs to know the content of the options to answer certain questions in part 1 and part 3 data, we used the question options as model input during inference on these two parts of data. For questions regarding part 2 data, the answer can be inferred solely from the audio, so during inference on part 2 data, the question options were not inputted as prompts.
Baseline inference code for Qwen2-Audio-7B is available on huggingface.
AudioFlamingo 2
AudioFlamingo 2 is an audio-language model developed for long-form audio understanding and reasoning. It adopts a Flamingo-style cross-attention architecture, combining a custom CLAP audio encoder with a lightweight 3B-parameter language model. The model is trained through a multi-stage curriculum: it is first fine-tuned on synthetic audio QA data (AudioSkills) to develop expert reasoning capabilities, and then on the LongAudio dataset to support extended audio inputs of up to 5 minutes.
Evaluation
For AudioFlamingo 2, the question format was modified to align with the model’s expected input template. Specifically, the original format—“Question? A. xxx, B. xxx, C. xxx, D. xxx”—was reformatted to: “Question? (A) xxx. (B) xxx. (C) xxx. (D) xxx.” This adjustment enabled the model to better follow the instruction and produce responses in the format: “(A/B/C/D) xxx.” As the model consistently generated clearly structured answers, the evaluation was performed via direct string matching between the model output and the correct option, without computing embedding similarity.
Gemini-2.0-Flash
Gemini 2.0 Flash is a multimodal Transformer developed by Google DeepMind, optimized for fast and robust audio-visual question answering. It accepts audio, image, video, and text inputs within a context window of up to 1 million tokens and generates coherent textual responses. As a second-generation "Flash" model, it features advanced tool usage, long-context reasoning, and native multimodal capabilities—including image generation and speech output. While training details remain proprietary, the model is known to be trained on large-scale web multimodal datasets.
Evaluation
The given prompt and audio input are fed into Gemini-2.0-flash for inference to generate the answer. The question and options are combined into a single prompt to make Gemini select the correct answer. An example of such a prompt is: “I want you to answer the question about the audio. I will provide you with the question and multiple options. Your task is to generate the only correct option for the question. Here is the question: At what time does the first occurrence of the baby crying sound end? The options are: A. 1.9s; B. 3.1s; C. 4.8s; D. 6.0s”.
Baseline Results
Dataset | Accuracy | ||
---|---|---|---|
Dataset | Qwen2-Audio-7B | AudioFlamingo2 | Gemini-2.0-Flash |
Part 1 Dev | 30.0% | 53.9% | 42.0% |
Part 2 Dev | 39.2% | 31.7% | 46.3% |
Part 3 Dev | 49.6% | 49.5% | 56.6% |
Dev Total | 45.0% | 45.7% | 52.5% |
Submission
General information for all DCASE submissions can be found on the Submission page. The official challenge submission must include the following:
- System output files for the evaluation set (
.csv
file) - Metadata file for the submission (
.yaml
file) - A technical report detailing the method for their submission (
.pdf
file) - If using post-processing, the code used to process the model’s response (e.g.,
.py
files or other scripts) must be provided.
We allow up to 4 system output submissions per participant/team. For each system, metadata should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted metadata (the .yaml
file), submitted system output (the .csv
file), and the technical report (the .pdf
file). For indicating the connection of your files, you can consider using the following naming convention:
<author>_<institute>_task6_<submission_index>_<output or meta or technical_report or post_process>.<csv or yaml or pdf or py>
For example:
Kim_SNU_task5_1.output.csv
Kim_SNU_task5_1.meta.yaml
Kim_SNU_Task5_1.post_process.py
Kim_SNU_task5_1.technical_report.pdf
The field
System output file
The system output file should be a .csv
file, and should have the following two columns:
-
question
: Name of the question -
answer
: Answer of the system for the quesiton. It must match one of the given choice options. Participants may apply a post-processing method to extract the exact choice from the system’s response.
For example:
question answer
. .
. .
. .
part1_test_0471 A. Sound 1
part1_test_0472 B. Sound 2
. .
. .
. .
Metadata file
For each system, metadata should be provided in a separate file. The file format should be as indicated below.
# Submission information for task 6
submission:
# Submission label
# Label is used to index submissions.
# Generate your label following way to avoid
# overlapping codes among submissions
# [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
label: Kim_SNU_task5_1
#
# Submission name
# This name will be used in the results tables when space permits
name: Qwen2-Audio-7B Baseline
#
# Submission name abbreviated
# This abbreviated name will be used in the results table when space is tight.
# Use maximum 10 characters.
abbreviation: Qwen2base
# Authors of the submitted system. Mark authors in
# the order you want them to appear in submission lists.
# One of the authors has to be marked as corresponding author,
# this will be listed next to the submission in the results tables.
authors:
# First author
- lastname: Kim
firstname: Jaeyeon
email: jaeyeonkim99@snu.ac.kr # Contact email address
corresponding: true # Mark true for one of the authors
# Affiliation information for the author
affiliation:
abbreviation: SNU
institute: Seoul National University
department: Vision and Learning Lab # Optional
location: Seoul, Korea
# Second author
# ...
# System information
system:
end_to_end: true # True if single end-to-end system, false if cascaded (chained) system
pretrained: true # True if the system is pretrained, false if not
pre_loaded: qwen2-audio-7b-instruct # Name of the pre-trained model used in the system. If not pretrained, null
autoregressive_model: true # True if the system is based on autoregressive language model, false if not
model_size: 8.4B # Number of total parameters of the system in billions.
light_weighted: false # True if the system is lightweight submission (i.e. less than 8B parameters)
# Post processing: Details about the post processing method used in the system
post_processing: Selected the option that has highest SentenceBERT similarity score with the model response
# Optional. Extenral data resources to train the system
external_data_resources: [
"AudioSet",
"AudioCaps"
]
# System results on the development-testing split.
# - Full results are not mandatory, however, they are highly recommended as they are needed for thorough analysis of the challenge submissions.
# - If you are unable to provide all the results, incomplete results can also be reported.
# - Each score should contain at least 3 decimals.
results:
development:
accuracy: 45.0%
Citations
If you are participating in this task, please consider citing the following papers:
- Task report
Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, S Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha, Gunhee Kim, Jun Du, Rafael Valle, and Bryan Catanzaro. Multi-domain audio question answering toward acoustic content reasoning in the dcase 2025 challenge. 2025. URL: https://arxiv.org/abs/2505.07365, arXiv:2505.07365.
Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge
- Part 1: Bioacoustics QA is part of the following work, which will be uploaded after the challenge deadline.
Jaeyeon Kim, Heeseung Yun, Sang Hoon Woo, Chao-Han Huck Yang, and Gunhee Kim. Wow-bench: evaluating fine-grained acoustic perception in audio-language models via marine mammal vocalizations. 2025.
WoW-Bench: Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalizations
- Part 3: ComplexQA builds on the principles established in the MMAU Sound subset.
- Baseline Models
Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen2-audio technical report. 2024. URL: https://arxiv.org/abs/2407.10759, arXiv:2407.10759.
Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro. Audio flamingo 2: an audio-language model with long-audio understanding and expert reasoning abilities. 2025. URL: https://arxiv.org/abs/2503.03983, arXiv:2503.03983.
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
- Exemplar Inference-Time Strategies
Chao-Han Huck Yang, Yile Gu, Yi-Chieh Liu, Shalini Ghosh, Ivan Bulyko, and Andreas Stolcke. Generative speech recognition error correction with large language models and task-activating prompting. In ASRU, 1–8. IEEE, 2023.