Audio-Dependent Question Answering


Task description

Coordinators

Haolin He
Haolin He

The Chinese University of Hong Kong

Renhe Sun
Renhe Sun

Ant Group

Zheqi Dai
Zheqi Dai

The Chinese University of Hong Kong

Xingjian Du
Xingjian Du

University of Rochester

Chunyat Wu
Chunyat Wu

The Chinese University of Hong Kong

Jiayi Zhou
Jiayi Zhou

Ant Group

Xiquan Li
Xiquan Li

Shanghai Jiao Tong University

Yun Chen
Yun Chen

University of Surrey

Xie Chen
Xie Chen

Shanghai Jiao Tong University

Zhiyao Duan
Zhiyao Duan

University of Rochester

Weiqiang Wang
Weiqiang Wang

Ant Group

Mark D. Plumbley
Mark D. Plumbley
Jian Liu
Jian Liu

Ant Group

Qiuqiang Kong
Qiuqiang Kong

The Chinese University of Hong Kong

Evaluating whether Large Audio-Language Models truly "listen" to audio or rely on textual shortcuts, using Audio-Dependency Filtering to ensure genuine audio perception.

If you are interested in the task, you can join us on the DCASE 2026 Slack channel (task-5-2026-adqa). Please first join the DCASE community Slack workspace before accessing the task channel.

Description

The Audio-Dependent Question Answering (ADQA) task focuses on addressing a critical bottleneck in current Large Audio-Language Models (LALMs): "Textual Hallucination." Many state-of-the-art models currently pass audio understanding benchmarks by relying on text prompts and internal linguistic priors rather than actual audio perception. Research shows that even when audio is replaced with silence, models can achieve over 50% accuracy on certain benchmarks.

ADQA introduces a rigorous evaluation framework using Audio-Dependency Filtering (ADF). This ensures that the questions in the test set cannot be answered through common sense or text-only reasoning. Participants are encouraged to develop systems that truly "listen" and reason based on the provided audio signal. Such advancements are vital for building reliable interactive audio agents and robust multimodal evaluation systems.

Figure 1: Overview of the ADQA benchmark construction, training, and evaluation pipeline.


Audio Dataset

The ADQA task in DCASE 2026 provides a high-quality, curated dataset designed to promote audio-centric reasoning.

Official Training Set: AudioMCQ-StrongAC-GeminiCoT

It is derived from the large-scale AudioMCQ dataset (570k+ samples).

  • Strong Audio-Contribution (StrongAC): Samples are selected using the StrongAC Split, ensuring the answer is highly dependent on audio cues.
  • Gemini-Distilled CoT: Includes native Chain-of-Thought (CoT) reasoning labels generated by Gemini 3.1 Pro. These labels provide explicit reasoning steps grounded in audio perception, facilitating CoT distillation for smaller, lightweight models.

Development Dataset

A small portion of the development set is derived from existing benchmarks (MMAU, MMAR, and MMSU), while the remaining majority consists of newly constructed, human-annotated multiple-choice questions. All samples undergo the following four-step Audio-Dependency Filtering (ADF) process to ensure genuine audio dependence:

  1. Silent Audio Filtering: Questions solvable by LALMs without audio are removed.
  2. LLM Common-sense Check: Ensures no external knowledge alone can solve the question.
  3. Perplexity-based Soft Filtering: Eliminates samples with text-based statistical shortcuts.
  4. Manual Verification: Final human-in-the-loop check for ground-truth accuracy.

Evaluation Set: ADQA-Bench

The evaluation set (ADQA-Bench) will be released on June 1, 2026. This set will be used for the final leaderboard ranking. Similar to the development set, a small portion of ADQA-Bench is sourced from MMAU, MMAR, and MMSU, with the rest being human-annotated multiple-choice questions. The entire evaluation set is constructed through the same rigorous four-step ADF hard-filtering process described above to guarantee zero-shot reliability.

Evaluation

Participants may submit up to four systems, each of which will be ranked independently.

The primary evaluation metric is Top-1 Accuracy on the ADQA-Bench.

  • Robustness Score: To ensure models are not overfitting to option order, we will perform multiple evaluations by shuffling the order of choices. The variance between these scores will determine the system's robustness.

Baseline System

We provide five baseline systems built on recent open-source Large Audio-Language Models:

Fun-Audio-Chat

Fun-Audio-Chat is a Large Audio Language Model designed for natural, low-latency voice interactions, featuring Dual-Resolution Speech Representations and Core-Cocktail training to balance compute efficiency with strong audio understanding capabilities.

Kimi-Audio

Kimi-Audio is an open-source audio foundation model excelling in audio understanding, generation, and conversation.

MiMo-Audio

MiMo-Audio is scaled with over one hundred million hours of pretraining data, enabling few-shot learning capabilities across diverse audio tasks. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks, spoken dialogue benchmarks, and instruct-TTS evaluations, approaching or surpassing closed-source models.

Qwen3-Omni

Qwen3-Omni is a natively end-to-end multilingual omni-modal foundation model. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech.

Step-Audio 2 Mini

Step-Audio 2 Mini is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.

Baseline Results

The following table shows the Top-1 Accuracy of each baseline system on the development set (random guess accuracy: 0.2546):

System Accuracy
Fun-Audio-Chat 0.5681
Kimi-Audio 0.4636
MiMo-Audio 0.5457
Qwen3-Omni 0.6248
Step-Audio 2 Mini 0.5053
Overall (weighted average) 0.5415

Task Rules

General Rules

  • Submission Limit: Each team may submit up to four systems.
  • LALM Policy: Only open-source Large Audio Language Models are permitted. Closed-source/API-only models are prohibited.
  • No Manual Labeling: Participants are not allowed to perform manual annotations or subjective judgments on the evaluation set.

External Data Resources and Base Model Application

The use of external data and base models is allowed under the following conditions:

  • External data resources and base models must be publicly available before May 15th, 2026.
  • Participants must register any external dataset or base model they plan to use via the External Data & Base Model Registration Sheet. Simply add a new row with the resource you wish to use and set its status to "Under Review". Organizers will review and approve requests twice daily throughout the challenge period.
  • If the source audio of a dataset you intend to use is already fully contained within another dataset that has been registered and approved, you do not need to register it separately.
  • Prohibited Data: Using the evaluation set's original source audio for training is strictly forbidden.
  • External Base Model Size: No single model component may exceed 30 Billion (30B) parameters. If using an agent system or multi-model voting ensemble, the total combined parameter count of all models must not exceed 100 Billion (100B) parameters.

Important Dates

  • April 1, 2026: Release of Training/Dev sets and Baseline systems.
  • June 1, 2026: Release of Evaluation set.
  • June 15, 2026: System submission deadline.
  • June 30, 2026: Announcement of challenge results.

Submission

General information for all DCASE submissions can be found on the Submission page. The official challenge submission must include the following:

  • System output files for the evaluation set (.csv file)
  • Metadata file for the submission (.yaml file)
  • A technical report detailing the method for their submission (.pdf file)
  • If using post-processing, the code used to process the model's response (e.g., .py files or other scripts) must be provided.

We allow up to 4 system output submissions per participant/team. For each system, metadata should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted metadata (the .yaml file), submitted system output (the .csv file), and the technical report (the .pdf file). For indicating the connection of your files, you can consider using the following naming convention:

<author>_<institute>_task5_<submission_index>_<output or meta or technical_report or post_process>.<csv or yaml or pdf or py>

For example:

He_CUHK_task5_1.output.csv
He_CUHK_task5_1.meta.yaml
He_CUHK_task5_1.post_process.py
He_CUHK_task5_1.technical_report.pdf

The field is to differentiate your submissions in case that you have multiple submissions.

System output file

The system output file should be a .csv file, and should have the following two columns:

  1. question: Name of the question

  2. answer: Answer of the system for the question. It must match one of the given choice options. Important: The answer field must contain only the plain answer text itself, without any option prefix such as A., B., (a), (A), etc. For example, if the correct choice is A. Jazz music, the answer should be Jazz music, not A. Jazz music. If the model's output naturally includes such prefixes, participants should apply a post-processing script to strip them before submission.

For example:

    question         answer
        .               .
        .               .
        .               .
   test_0001           Jazz music
   test_0002           Three speakers
        .               .
        .               .
        .               .

Metadata file

For each system, metadata should be provided in a separate file. The file format should be as indicated below.

# Submission information for task 5
submission:
  # Submission label
  # Label is used to index submissions.
  # Generate your label following way to avoid
  # overlapping codes among submissions
  # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
  label: He_CUHK_task5_1
  #
  # Submission name
  # This name will be used in the results tables when space permits
  name: Qwen3-Omni Baseline
  #
  # Submission name abbreviated
  # This abbreviated name will be used in the results table when space is tight.
  # Use maximum 10 characters.
  abbreviation: Qwen3base

# Authors of the submitted system. Mark authors in
# the order you want them to appear in submission lists.
# One of the authors has to be marked as corresponding author,
# this will be listed next to the submission in the results tables.
authors:
  # First author
  - lastname: He
    firstname: Haolin
    email: haolin.he@example.com                   # Contact email address
    corresponding: true                            # Mark true for one of the authors

    # Affiliation information for the author
    affiliation:
      abbreviation: CUHK
      institute: The Chinese University of Hong Kong
      department: Department of Computer Science      # Optional
      location: Hong Kong, China

  # Second author
  # ...

# System information
system:
  end_to_end: true # True if single end-to-end system, false if cascaded (chained) system
  pretrained: true # True if the system is pretrained, false if not
  pre_loaded: qwen3-omni  # Name of the pre-trained model used in the system. If not pretrained, null
  autoregressive_model: true # True if the system is based on autoregressive language model, false if not
  model_size: 30B # Number of total parameters of the system in billions.
  light_weighted: false # True if the system is lightweight submission (i.e. less than 30B parameters)

  # Post processing: Details about the post processing method used in the system
  post_processing: Direct string matching to extract the option letter from the model response

  # Optional. External data resources to train the system
  external_data_resources: [
    "AudioSet"
  ]

# System results on the development set.
    # - Full results are not mandatory, however, they are highly recommended as they are needed for thorough analysis of the challenge submissions.
    # - If you are unable to provide all the results, incomplete results can also be reported.
    # - Each score should contain at least 3 decimals.
results:
  development:
    accuracy: 64.450%

Citations

If you are participating in this task, please consider citing the following papers:

Publication

Haolin He, Xingjian Du, Renhe Sun, Zheqi Dai, Yujia Xiao, Mingru Yang, Jiayi Zhou, Xiquan Li, Zhengxi Liu, Zining Liang, Chunyat Wu, Qianhua He, Tan Lee, Xie Chen, Wei-Long Zheng, Weiqiang Wang, Mark Plumbley, Jian Liu, and Qiuqiang Kong. Measuring audio's impact on correctness: audio-contribution-aware post-training of large audio language models. In International Conference on Learning Representations (ICLR). 2026.

PDF

Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models

Abstract

Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL) has demonstrated promising results, multi-stage approaches such as supervised fine-tuning (SFT) followed by RL remain suboptimal. The allocation of data across multiple training stages to maximize LALM capabilities has not been fully explored, and large-scale, high-quality datasets for such research are also lacking. To address these problems, we firstly present AudioMCQ, a comprehensive audio multiple-choice question dataset comprising 571k samples with two kinds of chain-of-thought annotations. Secondly, we investigate the prevalent zero audio-contribution phenomenon in LALMs, where models derive correct answers solely from textual information without processing audio content. We propose Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. Based on these insights, we develop two effective post-training paradigms: Weak-to-Strong (SFT on weak audio-contribution data followed by RL on strong audio-contribution data) and Mixed-to-Strong (SFT on mixed audio-contribution data followed by RL on strong audio-contribution data). We achieve first place in the DCASE 2025 Audio-Question-Answering challenge by using AudioMCQ. Additionally, leveraging our dataset with different training strategies, we achieve 78.2% on MMAU-test-mini, 75.6% on MMAU, 67.1% on MMAR, and 70.7% on MMSU, establishing new state-of-the-art performance.

PDF


Publication

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. MMAU: a massive multi-task audio understanding and reasoning benchmark. In The Thirteenth International Conference on Learning Representations (ICLR). 2025.

PDF

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

Abstract

The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on tasks requiring expert-level knowledge and complex reasoning. MMAU comprises 10k carefully curated audio clips paired with human-annotated natural language questions and answers spanning speech, environmental sounds, and music. It includes information extraction and reasoning questions, requiring models to demonstrate 27 distinct skills across unique and challenging tasks. Unlike existing benchmarks, MMAU emphasizes advanced perception and reasoning with domain-specific knowledge, challenging models to tackle tasks akin to those faced by experts. We assess 18 open-source and proprietary (Large) Audio-Language Models, demonstrating the significant challenges posed by MMAU. Notably, even the most advanced Gemini Pro v1.5 achieves only 52.97% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves only 52.50%, highlighting considerable room for improvement. We believe MMAU will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.

PDF


Publication

Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu, Ruibin Yuan, Zhisheng Zheng, Ziya Zhou, Haina Zhu, Wei Xue, Emmanouil Benetos, Kai Yu, Eng-Siong Chng, and Xie Chen. MMAR: a challenging benchmark for deep reasoning in speech, audio, music, and their mix. 2025. URL: https://arxiv.org/abs/2505.13032, arXiv:2505.13032.

PDF

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Abstract

We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area.

PDF


Publication

Dingdong Wang, Junan Li, Jincenzi Wu, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. MMSU: a massive multi-task spoken language understanding and reasoning benchmark. 2026. URL: https://arxiv.org/abs/2506.04779, arXiv:2506.04779.

PDF

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Abstract

Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at this https URL. Evaluation Code is available at this https URL.

PDF