Audio Question Answering


Task description

Answering questions about general acoustic events and knowledge-heavy sound information.

If you are interested in the task, you can join us on the dedicated slack channel

Description

The Audio Question Answering (AQA) task focuses on advancing question-answering capabilities in the realm of “interactive audio understanding,” covering both general acoustic events and knowledge-heavy sound information within a single track. The AQA task encourages participants to develop systems that can accurately interpret and respond to complex audio-based questions (i.e., (A), (B), or (C) option), requiring models to process and reason across diverse audio types. Such systems could be useful in many applications, including audio-text and multimodal model evaluation, as well as building interactive audio-agent models for the audio research communities and beyond.

Figure 1: Overview of Audio QA system.


Audio dataset

The AudioQA task in DCASE2025 consists of three distinct QA subsets that has multiple choices question format: Bioacoustics QA, Temporal Soundscapes QA, and Complex QA (MMAU). Each subset is designed to evaluate different aspects of audio understanding and reasoning. In this section, we describe each dataset in detail. All subsets include training and development sets, while the evaluation set for the challenge will be released on June 1, 2025.

Part 1: Bioacoustics QA

Marine mammals produce a wide range of acoustic signals for various purposes, and these vocalizations are often species-specific. This characteristic enables fine-grained grounding of sounds to real-world biological events. To evaluate the perceptual and cognitive capabilities of the audio-language models, the Bioacoustics QA dataset includes questions about 31 species of marine mammals, which vary significantly in their acoustic ranges, habitats, and vocalization durations.

This subset challenges models to classify the species, the vocalization type, or both. In addition, models are asked to retrieve factual knowledge about the perceived species, interpret acoustic details, and compare the characteristics of different vocalizations.

The dataset contains 0.7K QA pairs for training and 0.2K QA pairs for development. The sample rate varies from 600Hz to 16kHz, and the audio duration ranges from 0.4 second to 625 seconds. This variability enables evaluation of how well models can adapt to diverse acoustic conditions.

Acknowledgments

All audio files used in this subset are sourced from the Watkins Marine Mammal Sound Database, maintained by the Woods Hole Oceanographic Institution, New Bedford Whaling Museum (www.whalingmuseum.org).

Participants are strictly prohibited from using additional audio files from the same database, as some of them may be included in the evaluation set.

Part 2: Temporal Soundscapes QA

It is common that different sounds may occur in an audio sample. The categories, sequences, and timestamps of these sounds are very important for the model to understand the interaction between sounds and improve the ability of temporal reasoning. To evaluate the perceptual and cognitive capabilities of the audio-language models, the Temporal Soundscapes QA dataset is introduced including questions about 26 sound classes.

This subset challenges models to decide the active sound, the sound class, the temporal relationship among different sounds (e.g., occurring sequence), or the timestamp (including onset, offset, and duration) of sound. The questions are mainly set from easy to difficult. For example, it is simple to judge the class of the first sound in audio, but it is difficult to judge the duration of a sound (that is, it is necessary to determine the onset and offset timestamps).

The dataset contains ~1k QA pairs for training and ~0.6k QA pairs for development. The sample rate varies from 32k Hz to 48k Hz, and the audio is processed as mono with 10 seconds duration. Except a small number of audio samples have a one-to-many (one to three at most) relationship with QA, most samples only generate a corresponding QA. Note that each audio sample and QA text have been carefully manually verified, including the number of sounds, sound timestamp, text content and so on.

Acknowledgements

All audio files used in this subset are sourced from the NIGENS general sound events database, L3DAS23 Challenge, and TAU Spatial Sound Events 2019.

Part 3: Complex QA (MMAU)

Task Setup

Complex QA focuses on complex question answering grounded in audio understanding. Each instance consists of a natural audio clip paired with a multi-faceted question that requires reasoning over temporal, acoustic, and contextual cues within the audio. Questions may involve identifying overlapping sound events, interpreting sequences of auditory phenomena, or discerning abstract relationships implied by the soundscape. The audio clips are sourced from AudioSet and Mira datasets, ensuring a rich and diverse set of real-world scenarios. This task builds on the principles established in the MMAU Sound subset, extending the challenge to higher-order auditory comprehension and inference.

Acknowledgements

All the audios in the subsset are sourced from Audioset and Mira Data.

Development Dataset

The development set consists of three subsets: Bioacoustics QA, Temporal Soundscapes QA, and Complex QA, as described above. Each subset includes a training set for model development and a development set for evaluating model performance.

The development set can be downloaded through the huggingface.

Evaluation Dataset

The evaluation set will be used for ranking submissions. It will be released a few weeks prior to the final submission deadline.

External Data Resources

Use of external data resources is allowed as long as they are publicly available. The following rules apply on the use of external data:

  • External data resources, such as datasets and pre-trained models, must be freely and publicly accessible before May 15th 2025.
  • If participants intend to use external data resources, they are required to inform the organizers in advance. This ensures fairness by providing all participants the opportunity to access the external data. Should you wish to make use of data resources not mentioned in the list, please send an email or message in the Slack channel (# task-5-2025-audioqa-public) to the task coordinators.
  • List of allowed external data resources will be locked on May 15th 2025 (no further external sources allowed).
  • The participants are required to indicate clearly which external data resources they have used in the technical report.

List of external data resources allowed:

Resource name Type Added Link
AudioSet audio 01.04.2025 https://research.google.com/audioset/

Task Rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements, can be found here.

Task specific rules:

  • Each participating team may submit up to four systems for official evaluation.
  • The use of open-source large language models (LLMs) is permitted, while closed-source LLMs are not allowed. The parameter size of any single model must not exceed 100 billion (100B). For cascaded or multi-stage architectures, each individual component model must also be under 100B.
  • At least one of the submitted systems must be a lightweight solution with a total parameter size under 9B.
  • API-based models are permitted only if the underlying model is open-source (e.g., LLaMA-3-70B) and accompanied by documentation verifying that it adheres to the 100B constraint for scientific reproducibility. Participants intending to use such APIs must report the API usage details, including model version and associated dataset versions, in the DCASE 2025 Slack channel (# task-5-2025-audioqa-public) by May 30th for organizer review and approval.
  • The use of external data resources is allowed, as long as they are publicly accessible and approved by the organizers.
  • Participants are not permitted to make subjective judgments or manual annotations on the evaluation set. The evaluation set must also not be used for training any of the submitted systems.
  • Participants may include unlimited additional system variants in their technical reports (e.g., for ablation studies or further analysis). However, only the four officially submitted systems per team will be considered for the challenge ranking.

Evaluation

Participants may submit up to four systems, each of which will be ranked independently.

We use top-1 accuracy, defined as the proportion of multiple-choice questions for which the system selects the correct answer, as the primary evaluation metric. The leaderboard ranking of individual submissions will be based on this metric.

To assess the robustness of each submission, we will perform multiple evaluations by shuffling the order of the answer choices. The best and worst accuracy scores obtained across these permutations will be reported as the system’s robustness score.

Baseline System

We adopt three baseline models—Qwen2-Audio-7B, AudioFlamingo 2, and Gemini-2-Flash—for evaluation. These models are evaluated in a zero-shot setting on the development split of the Development Set.

Qwen2-Audio-7B

Qwen2-Audio-7B is a large audio-language model that integrates a Whisper-large-v3 audio encoder with the 7B-parameter Qwen language model, capable of generating text-based answers from audio inputs. The model is pre-trained on over 30 diverse audio understanding tasks—spanning speech, music, and environmental sounds—using unified natural-language prompts.

Evaluation

The given prompt and audio input are fed into Qwen2-Audio-Instruct for inference, generating an output text. Subsequently, the output text, along with the question options, is inputted into the pre-trained Sentence-Bert model to derive their respective embeddings. The final step involves calculating the cosine similarity between the embedding of the output text and each of the option's embeddings. The option yielding the highest similarity score is chosen as the final answer. Considering the model needs to know the content of the options to answer certain questions in part 1 and part 3 data, we used the question options as model input during inference on these two parts of data. For questions regarding part 2 data, the answer can be inferred solely from the audio, so during inference on part 2 data, the question options were not inputted as prompts into Qwen.

AudioFlamingo 2

AudioFlamingo 2 is an audio-language model developed for long-form audio understanding and reasoning. It adopts a Flamingo-style cross-attention architecture, combining a custom CLAP audio encoder with a lightweight 3B-parameter language model. The model is trained through a multi-stage curriculum: it is first fine-tuned on synthetic audio QA data (AudioSkills) to develop expert reasoning capabilities, and then on the LongAudio dataset to support extended audio inputs of up to 5 minutes.

Evaluation

For AudioFlamingo 2, the question format was modified to align with the model’s expected input template. Specifically, the original format—“Question? A. xxx, B. xxx, C. xxx, D. xxx”—was reformatted to: “Question? (A) xxx. (B) xxx. (C) xxx. (D) xxx.” This adjustment enabled the model to better follow the instruction and produce responses in the format: “(A/B/C/D) xxx.” As the model consistently generated clearly structured answers, the evaluation was performed via direct string matching between the model output and the correct option, without computing embedding similarity.

Gemini-2-Flash

Gemini 2.0 Flash is a multimodal Transformer developed by Google DeepMind, optimized for fast and robust audio-visual question answering. It accepts audio, image, video, and text inputs within a context window of up to 1 million tokens and generates coherent textual responses. As a second-generation "Flash" model, it features advanced tool usage, long-context reasoning, and native multimodal capabilities—including image generation and speech output. While training details remain proprietary, the model is known to be trained on large-scale web multimodal datasets.

Evaluation

The given prompt and audio input are fed into Gemini-2.0-flash for inference to generate the answer. The question and options are combined into a single prompt to make Gemini select the correct answer. An example of such a prompt is: “I want you to answer the question about the audio. I will provide you with the question and multiple options. Your task is to generate the only correct option for the question. Here is the question: At what time does the first occurrence of the baby crying sound end? The options are: A. 1.9s; B. 3.1s; C. 4.8s; D. 6.0s”.

Baseline Results

Dataset Accuracy
Dataset Qwen2-Audio-7B AudioFlamingo2 Gemini-2-Flash
Part 1 Dev 30.0% 53.9% 42.0%
Part 2 Dev 39.2% 31.7% 46.3%
Part 3 Dev 49.6% 49.5% 56.6%
Dev Total 45.0% 45.7% 52.5%

Submission

General information for all DCASE submissions can be found on the Submission page. The official challenge submission must include the following:

  • System output files for the evaluation set (.csv file)
  • Metadata file for the submission (.yaml file)
  • A technical report detailing the method for their submission (.pdf file)

We allow up to 4 system output submissions per participant/team. For each system, metadata should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted metadata (the .yaml file), submitted system output (the .csv file), and the technical report (the .pdf file)! For indicating the connection of your files, you can consider using the following naming convention:

<author>_<institute>_task6_<submission_index>_<output or meta or technical_report>.<csv or yaml or pdf>

For example:

Kim_SNU_task5_1.output.csv
Kim_SNU_task5_1.meta.yaml
Kim_SNU_task5_1.technical_report.pdf

The field is to differentiate your submissions in case that you have multiple submissions.

System output file

The system output file should be a .csv file, and should have the following two columns: 1. question: Name of the question 2. answer: Answer of the system for the quesiton

For example:

    question         Answer
        .               .
        .               .
        .               .
   test_aqa_1    (A) Thunderstorm

More details on the required format for system output and the metadata specific to this task will be released in mid-April.