Answering questions about general acoustic events and knowledge-heavy sound information.

If you are interested in the task, you can join us on the dedicated slack channel

Description

The Audio Question Answering (AQA) task focuses on advancing question-answering capabilities in the realm of “interactive audio understanding,” covering both general acoustic events and knowledge-heavy sound information within a single track. The AQA task encourages participants to develop systems that can accurately interpret and respond to complex audio-based questions (i.e., (A), (B), or (C) option), requiring models to process and reason across diverse audio types. Such systems could be useful in many applications, including audio-text and multimodal model evaluation, as well as building interactive audio-agent models for the audio research communities and beyond.

Audio dataset

The AudioQA task in DCASE2025 consists of three distinct QA subsets that has multiple choices question format: Bioacoustics QA, Temporal Soundscapes QA, and Complex QA (MMAU). Each subset is designed to evaluate different aspects of audio understanding and reasoning. In this section, we describe each dataset in detail. All subsets include training and development sets, while the evaluation set for the challenge will be released on June 1, 2025.

Part 1: Bioacoustics QA

Marine mammals produce a wide range of acoustic signals for various purposes, and these vocalizations are often species-specific. This characteristic enables fine-grained grounding of sounds to real-world biological events. To evaluate the perceptual and cognitive capabilities of the audio-language models, the Bioacoustics QA dataset includes questions about 31 species of marine mammals, which vary significantly in their acoustic ranges, habitats, and vocalization durations.

This subset challenges models to classify the species, the vocalization type, or both. In addition, models are asked to retrieve factual knowledge about the perceived species, interpret acoustic details, and compare the characteristics of different vocalizations.

The dataset contains 0.7K QA pairs for training and 0.2K QA pairs for development. The sample rate varies from 600 Hz to 160 kHz, and the audio duration ranges from 0.4 second to 625 seconds. This variability enables evaluation of how well models can adapt to diverse acoustic conditions.

Acknowledgments

All audio files used in this subset are sourced from the Watkins Marine Mammal Sound Database, maintained by the Woods Hole Oceanographic Institution, New Bedford Whaling Museum (www.whalingmuseum.org).

Participants are strictly prohibited from using additional audio files from the same database, as some of them may be included in the evaluation set.

Part 2: Temporal Soundscapes QA

It is common that different sounds may occur in an audio sample. The categories, sequences, and timestamps of these sounds are very important for the model to understand the interaction between sounds and improve the ability of temporal reasoning. To evaluate the perceptual and cognitive capabilities of the audio-language models, the Temporal Soundscapes QA dataset is introduced including questions about 26 sound classes.

This subset challenges models to decide the active sound, the sound class, the temporal relationship among different sounds (e.g., occurring sequence), or the timestamp (including onset, offset, and duration) of sound. The questions are mainly set from easy to difficult. For example, it is simple to judge the class of the first sound in audio, but it is difficult to judge the duration of a sound (that is, it is necessary to determine the onset and offset timestamps).

The dataset contains 1k QA pairs for training and 0.6k QA pairs for development. The sample rate varies from 32 kHz to 48 kHz, and the audio is processed as mono with 10 seconds duration. Except a small number of audio samples have a one-to-many (one to three at most) relationship with QA, most samples only generate a corresponding QA. Note that each audio sample and QA text have been carefully manually verified, including the number of sounds, sound timestamp, text content and so on.

Acknowledgements

All audio files used in this subset are sourced from the NIGENS general sound events database, L3DAS23 Challenge, and TAU Spatial Sound Events 2019.

Part 3: Complex QA (MMAU)

Complex QA focuses on complex question answering grounded in audio understanding. Each instance consists of a natural audio clip paired with a multi-faceted question that requires reasoning over temporal, acoustic, and contextual cues within the audio. Questions may involve identifying overlapping sound events, interpreting sequences of auditory phenomena, or discerning abstract relationships implied by the soundscape. The audio clips are sourced from AudioSet and Mira datasets, ensuring a rich and diverse set of real-world scenarios. This task builds on the principles established in the MMAU Sound subset, extending the challenge to higher-order auditory comprehension and inference.

The dataset contains 6.4k QA pairs for training and 1.6k QA pairs for development. All audio files are 10 seconds long and sampled at 16 kHz.

Acknowledgements

All the audios in the subsset are sourced from Audioset and Mira Data.

Task Setup

Development Dataset

The development set consists of three subsets: Bioacoustics QA, Temporal Soundscapes QA, and Complex QA, as described above. Each subset includes a training set for model development and a development set for evaluating model performance.

The development set can be downloaded through the huggingface.

Task 5 Development Dataset

Evaluation Dataset

The evaluation set will be used for ranking submissions. It can be also downloaded through the downloaded through the huggingface. Please run eval_download_aqa_2025.sh to download the evaluation set.

Task 5 Evaluation Dataset

External Data Resources

Use of external data resources is allowed as long as they are publicly available. The following rules apply on the use of external data:

External data resources, such as datasets and pre-trained models, must be freely and publicly accessible before May 15th 2025.
If participants intend to use external data resources, they are required to inform the organizers in advance. This ensures fairness by providing all participants the opportunity to access the external data. Should you wish to make use of data resources not mentioned in the list, please send an email or message in the Slack channel (# task-5-2025-audioqa-public) to the task coordinators.
List of allowed external data resources will be locked on May 29th 2025 (AOE).
The participants are required to indicate clearly which external data resources they have used in the technical report.
An up-to-date list of allowed and prohibited datasets is available here: # task-5-dataset-spreadsheet

List of external data resources allowed:

Resource name	Link
AudioSet	https://research.google.com/audioset/
AVQA	https://mn.cs.tsinghua.edu.cn/avqa/
Clotho	https://arxiv.org/abs/1910.09387
Clotho-AQA	https://arxiv.org/abs/2004.01444
MMAU	https://github.com/Sakshi113/mmau/tree/main
AudioSet-Strong	https://github.com/curlsloth/audioset-strong-download
CompA	https://github.com/Sreyan88/CompA
WavCaps	https://github.com/XinhaoMei/WavCaps
AudioCaps	https://github.com/cdjkim/audiocaps
FSD-50k	https://zenodo.org/records/4060432
VGGSound	https://arxiv.org/abs/2004.14368
MUSIC-AVQA	https://gewu-lab.github.io/MUSIC-AVQA/
Urbansound8K	https://github.com/reml-group/fortisavqa
FortisAVQA	https://github.com/reml-group/fortisavqa
JamendoMaxCaps	https://github.com/AMAAI-Lab/JamendoMaxCaps
CochlScene	https://zenodo.org/records/7080122
TUT Acoustic scenes 2016	https://zenodo.org/records/45739
TUT Acoustic scenes 2017	https://zenodo.org/records/400515
TAU Urban Acoustic Scenes 2022 Mobile	https://zenodo.org/records/6337421
TACOS	https://zenodo.org/records/15379789
OpenAQA	https://github.com/YuanGongND/ltu
AudSem	https://huggingface.co/datasets/gijs/audsem
SpeechCraft	https://github.com/thuhcsi/SpeechCraft
MusicCaps	https://www.kaggle.com/datasets/googleai/musiccaps
LP-MusicCaps-MTT	https://huggingface.co/datasets/seungheondoh/LP-MusicCaps-MTT

Task Rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements, can be found here.

Task specific rules:

Each participating team may submit up to four systems for official evaluation.
The use of open-source large language models (LLMs) is permitted, while closed-source LLMs are not allowed. The parameter size of any single model must not exceed 100 billion (100B). For cascaded or multi-stage architectures, each individual component model must also be under 100B.
At least one of the submitted systems must be a lightweight solution with a total parameter size under 9B.
API-based models are permitted only if the underlying model is open-source (e.g., LLaMA-3-70B) and accompanied by documentation verifying that it adheres to the 100B constraint for scientific reproducibility. Participants intending to use such APIs must report the API usage details, including model version and associated dataset versions, in the DCASE 2025 Slack channel (# task-5-2025-audioqa-public) by May 30th for organizer review and approval.
The use of external data resources is allowed, as long as they are publicly accessible and approved by the organizers.
Participants are not permitted to make subjective judgments or manual annotations on the evaluation set. The evaluation set must also not be used for training any of the submitted systems.
Participants may include unlimited additional system variants in their technical reports (e.g., for ablation studies or further analysis). However, only the four officially submitted systems per team will be considered for the challenge ranking.
Participants may explore various post-processing methods (e.g., string matching, Sentence-BERT matching, etc.) to extract answer choices from the system’s response, as demonstrated in the baseline systems. If using LLMs for post-processing, only open-source models are allowed.

Evaluation

Participants may submit up to four systems, each of which will be ranked independently.

We use top-1 accuracy, defined as the proportion of multiple-choice questions for which the system selects the correct answer, as the primary evaluation metric. The leaderboard ranking of individual submissions will be based on this metric.

To assess the robustness of each submission, we will perform multiple evaluations by shuffling the order of the answer choices. The best and worst accuracy scores obtained across these permutations will be reported as the system’s robustness score.

Baseline System

We adopt three baseline models—Qwen2-Audio-7B, AudioFlamingo 2, and Gemini-2-Flash—for evaluation. These models are evaluated in a zero-shot setting on the development split of the Development Set.

Qwen2-Audio-7B

Qwen2-Audio-7B is a large audio-language model that integrates a Whisper-large-v3 audio encoder with the 7B-parameter Qwen language model, capable of generating text-based answers from audio inputs. The model is pre-trained on over 30 diverse audio understanding tasks—spanning speech, music, and environmental sounds—using unified natural-language prompts.

Evaluation

The given prompt and audio input are fed into Qwen2-Audio-Instruct for inference, generating an output text. Subsequently, the output text, along with the question options, is inputted into the pre-trained Sentence-Bert model to derive their respective embeddings. The final step involves calculating the cosine similarity between the embedding of the output text and each of the option's embeddings. The option yielding the highest similarity score is chosen as the final answer. Considering the model needs to know the content of the options to answer certain questions in part 1 and part 3 data, we used the question options as model input during inference on these two parts of data. For questions regarding part 2 data, the answer can be inferred solely from the audio, so during inference on part 2 data, the question options were not inputted as prompts.

Baseline inference code for Qwen2-Audio-7B is available on huggingface.

Qwen2-Audio Baseline

AudioFlamingo 2

AudioFlamingo 2 is an audio-language model developed for long-form audio understanding and reasoning. It adopts a Flamingo-style cross-attention architecture, combining a custom CLAP audio encoder with a lightweight 3B-parameter language model. The model is trained through a multi-stage curriculum: it is first fine-tuned on synthetic audio QA data (AudioSkills) to develop expert reasoning capabilities, and then on the LongAudio dataset to support extended audio inputs of up to 5 minutes.

Evaluation

For AudioFlamingo 2, the question format was modified to align with the model’s expected input template. Specifically, the original format—“Question? A. xxx, B. xxx, C. xxx, D. xxx”—was reformatted to: “Question? (A) xxx. (B) xxx. (C) xxx. (D) xxx.” This adjustment enabled the model to better follow the instruction and produce responses in the format: “(A/B/C/D) xxx.” As the model consistently generated clearly structured answers, the evaluation was performed via direct string matching between the model output and the correct option, without computing embedding similarity.

Gemini-2.0-Flash

Gemini 2.0 Flash is a multimodal Transformer developed by Google DeepMind, optimized for fast and robust audio-visual question answering. It accepts audio, image, video, and text inputs within a context window of up to 1 million tokens and generates coherent textual responses. As a second-generation "Flash" model, it features advanced tool usage, long-context reasoning, and native multimodal capabilities—including image generation and speech output. While training details remain proprietary, the model is known to be trained on large-scale web multimodal datasets.

Evaluation

The given prompt and audio input are fed into Gemini-2.0-flash for inference to generate the answer. The question and options are combined into a single prompt to make Gemini select the correct answer. An example of such a prompt is: “I want you to answer the question about the audio. I will provide you with the question and multiple options. Your task is to generate the only correct option for the question. Here is the question: At what time does the first occurrence of the baby crying sound end? The options are: A. 1.9s; B. 3.1s; C. 4.8s; D. 6.0s”.

Baseline Results

Dataset	Accuracy
Dataset	Qwen2-Audio-7B	AudioFlamingo2	Gemini-2.0-Flash
Part 1 Dev	30.0%	53.9%	42.0%
Part 2 Dev	39.2%	31.7%	46.3%
Part 3 Dev	49.6%	49.5%	56.6%
Dev Total	45.0%	45.7%	52.5%

Submission

General information for all DCASE submissions can be found on the Submission page. The official challenge submission must include the following:

System output files for the evaluation set (.csv file)
Metadata file for the submission (.yaml file)
A technical report detailing the method for their submission (.pdf file)
If using post-processing, the code used to process the model’s response (e.g., .py files or other scripts) must be provided.

We allow up to 4 system output submissions per participant/team. For each system, metadata should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted metadata (the .yaml file), submitted system output (the .csv file), and the technical report (the .pdf file). For indicating the connection of your files, you can consider using the following naming convention:

<author>_<institute>_task6_<submission_index>_<output or meta or technical_report or post_process>.<csv or yaml or pdf or py>

For example:

Kim_SNU_task5_1.output.csv
Kim_SNU_task5_1.meta.yaml
Kim_SNU_Task5_1.post_process.py
Kim_SNU_task5_1.technical_report.pdf

The field is to differentiate your submissions in case that you have multiple submissions.

System output file

The system output file should be a .csv file, and should have the following two columns:

question: Name of the question
answer: Answer of the system for the quesiton. It must match one of the given choice options. Participants may apply a post-processing method to extract the exact choice from the system’s response.

For example:

    question         answer
        .               .
        .               .
        .               .
   part1_test_0471    A. Sound 1 
   part1_test_0472    B. Sound 2
        .               .
        .               .
        .               .

Metadata file

For each system, metadata should be provided in a separate file. The file format should be as indicated below.

Metadata

# Submission information for task 6
submission:
  # Submission label
  # Label is used to index submissions.
  # Generate your label following way to avoid
  # overlapping codes among submissions
  # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
  label: Kim_SNU_task5_1
  #
  # Submission name
  # This name will be used in the results tables when space permits
  name: Qwen2-Audio-7B Baseline
  #
  # Submission name abbreviated
  # This abbreviated name will be used in the results table when space is tight.
  # Use maximum 10 characters.
  abbreviation: Qwen2base

# Authors of the submitted system. Mark authors in
# the order you want them to appear in submission lists.
# One of the authors has to be marked as corresponding author,
# this will be listed next to the submission in the results tables.
authors:
  # First author
  - lastname: Kim
    firstname: Jaeyeon
    email: jaeyeonkim99@snu.ac.kr               # Contact email address
    corresponding: true                         # Mark true for one of the authors

    # Affiliation information for the author
    affiliation:
      abbreviation: SNU
      institute: Seoul National University
      department: Vision and Learning Lab          # Optional
      location: Seoul, Korea

  # Second author
  # ...

# System information
system:
  end_to_end: true # True if single end-to-end system, false if cascaded (chained) system
  pretrained: true # True if the system is pretrained, false if not
  pre_loaded: qwen2-audio-7b-instruct  # Name of the pre-trained model used in the system. If not pretrained, null
  autoregressive_model: true # True if the system is based on autoregressive language model, false if not
  model_size: 8.4B # Number of total parameters of the system in billions.
  light_weighted: false # True if the system is lightweight submission (i.e. less than 8B parameters)

  # Post processing: Details about the post processing method used in the system
  post_processing: Selected the option that has highest SentenceBERT similarity score with the model response

  # Optional. Extenral data resources to train the system
  external_data_resources: [ 
    "AudioSet",
    "AudioCaps"
  ]

# System results on the development-testing split.
    # - Full results are not mandatory, however, they are highly recommended as they are needed for thorough analysis of the challenge submissions.
    # - If you are unable to provide all the results, incomplete results can also be reported.
    # - Each score should contain at least 3 decimals.
results:
  development:
    accuracy: 45.0%

Results

This table shows the performance of each submitted system in the evaluation set. The ranking is based on the achieved domain average accuracy metric. Complete results and technical reports can be found in the results page.

Selected metric rank	Submission Information	Evaluation dataset				Development dataset
Selected metric rank	Submission Code	Domain Average	Part 1 Accuracy	Part 2 Accuracy	Part 3 Accuracy	Domain Average
	Sun_Antgroup_task5_2	73.74	70.75	65.31	85.15	77.93
	Chen_SRCN_task5_3	64.91	52.19	58.85	83.70	69.82
	Shi_USTC_task5_1	72.81	69.37	61.96	87.10	78.13
	Grzeszczyk_SRPOL_task5_4	60.18	47.41	51.67	81.45	68.18
	Gibier_inria_task5_1	55.97	42.55	50.00	75.35	62.25
	Wijngaard_DACS_task5_4	55.25	44.25	40.19	81.30	61.83
	Baseline_Kimi_Audio	52.09	37.44	42.58	76.25	46.80
	Baseline_Gemini_2_0	51.20	36.41	43.18	74.00	48.30
	Baseline_AudioFlamingo2	50.85	42.87	32.18	77.50	45.00
	Chung_IND_task5_1	50.37	48.54	30.26	72.30	67.68
	Guan_HEU_task5_2	49.70	36.87	39.47	72.75	68.94
	Yin_XJTLU_task5_1	42.13	35.41	31.94	59.05	48.94
	Baseline_Qwen2_Audio	37.19	27.71	38.52	45.35	39.60

Citations

If you are participating in this task, please consider citing the following papers:

Task report

Publication

Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, S Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha, Gunhee Kim, Jun Du, Rafael Valle, and Bryan Catanzaro. Multi-domain audio question answering toward acoustic content reasoning in the dcase 2025 challenge. 2025. URL: https://arxiv.org/abs/2505.07365, arXiv:2505.07365.

PDF

Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge

PDF

Part 1: Bioacoustics QA is part of the following work, which will be uploaded after the challenge deadline.

Publication

Jaeyeon Kim, Heeseung Yun, Sang Hoon Woo, Chao-Han Huck Yang, and Gunhee Kim. Wow-bench: evaluating fine-grained acoustic perception in audio-language models via marine mammal vocalizations. 2025.

WoW-Bench: Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalizations

Part 3: ComplexQA builds on the principles established in the MMAU Sound subset.

Publication

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: a massive multi-task audio understanding and reasoning benchmark. In ICLR. 2025.

PDF

Mmau: A massive multi-task audio understanding and reasoning benchmark

PDF

Baseline Models

Publication

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen2-audio technical report. 2024. URL: https://arxiv.org/abs/2407.10759, arXiv:2407.10759.

PDF

Qwen2-Audio Technical Report

PDF

Publication

Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro. Audio flamingo 2: an audio-language model with long-audio understanding and expert reasoning abilities. 2025. URL: https://arxiv.org/abs/2503.03983, arXiv:2503.03983.

PDF

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

PDF

Exemplar Inference-Time Strategies

Publication

Chao-Han Huck Yang, Yile Gu, Yi-Chieh Liu, Shalini Ghosh, Ivan Bulyko, and Andreas Stolcke. Generative speech recognition error correction with large language models and task-activating prompting. In ASRU, 1–8. IEEE, 2023.

PDF

Generative speech recognition error correction with large language models and task-activating prompting

PDF

	Huck Yang NVIDIA Research
	Sreyan Ghosh University of Maryland, College Park
	Qing Wang University of Science and Technology of China
	Jaeyeon Kim Seoul National University
	Hengyi Hong University of Science and Technology of China
	Sonal Kumar University of Maryland, College Park
	Guirui Zhong University of Science and Technology of China
	Zhifeng Kong NVIDIA Research
	FNU Sakshi University of Maryland, College Park
	Vaibhavi Lokegaonkar University of Maryland, College Park
	Ramani Duraiswami University of Maryland, College Park
	Dinesh Manocha University of Maryland, College Park
	Gunhee Kim Seoul National University
	Jun Du University of Science and Technology of China
	Rafeal Valle NVIDIA Research

Coordinators

Content

Description

Audio dataset

Part 1: Bioacoustics QA

Acknowledgments

Part 2: Temporal Soundscapes QA

Acknowledgements

Part 3: Complex QA (MMAU)

Acknowledgements

Task Setup

Development Dataset

Evaluation Dataset

External Data Resources

Task Rules

Task specific rules:

Evaluation

Baseline System

Qwen2-Audio-7B

Evaluation

AudioFlamingo 2

Evaluation

Gemini-2.0-Flash

Evaluation

Baseline Results

Submission

System output file

Metadata file

Metadata

Results

Citations

Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge

WoW-Bench: Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalizations

Mmau: A massive multi-task audio understanding and reasoning benchmark

Qwen2-Audio Technical Report

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Generative speech recognition error correction with large language models and task-activating prompting