Evaluating whether Large Audio-Language Models truly "listen" to audio or rely on textual shortcuts, using Audio-Dependency Filtering to ensure genuine audio perception.

If you are interested in the task, you can join us on the DCASE 2026 Slack channel (task-5-2026-adqa). Please first join the DCASE community Slack workspace before accessing the task channel.

Description

The Audio-Dependent Question Answering (ADQA) task focuses on addressing a critical bottleneck in current Large Audio-Language Models (LALMs): "Textual Hallucination." Many state-of-the-art models currently pass audio understanding benchmarks by relying on text prompts and internal linguistic priors rather than actual audio perception. Research shows that even when audio is replaced with silence, models can achieve over 50% accuracy on certain benchmarks.

ADQA introduces a rigorous evaluation framework using Audio-Dependency Filtering (ADF). This ensures that the questions in the test set cannot be answered through common sense or text-only reasoning. Participants are encouraged to develop systems that truly "listen" and reason based on the provided audio signal. Such advancements are vital for building reliable interactive audio agents and robust multimodal evaluation systems.

Figure 1: Overview of the ADQA benchmark construction, training, and evaluation pipeline.

Audio Dataset

The ADQA task in DCASE 2026 provides a high-quality, curated dataset designed to promote audio-centric reasoning.

Official Training Set: AudioMCQ-StrongAC-GeminiCoT

It is derived from the large-scale AudioMCQ dataset (570k+ samples).

Strong Audio-Contribution (StrongAC): Samples are selected using the StrongAC Split, ensuring the answer is highly dependent on audio cues.
Gemini-Distilled CoT: Includes native Chain-of-Thought (CoT) reasoning labels generated by Gemini 3.1 Pro. These labels provide explicit reasoning steps grounded in audio perception, facilitating CoT distillation for smaller, lightweight models.

Hugging Face - AudioMCQ-StrongAC-GeminiCoT

Development Dataset

A small portion of the development set is derived from existing benchmarks (MMAU, MMAR, and MMSU), while the remaining majority consists of newly constructed, human-annotated multiple-choice questions. All samples undergo the following four-step Audio-Dependency Filtering (ADF) process to ensure genuine audio dependence:

Silent Audio Filtering: Questions solvable by LALMs without audio are removed.
LLM Common-sense Check: Ensures no external knowledge alone can solve the question.
Perplexity-based Soft Filtering: Eliminates samples with text-based statistical shortcuts.
Manual Verification: Final human-in-the-loop check for ground-truth accuracy.

Hugging Face - DCASE2026-Task5-DevSet

Evaluation Set: ADQA-Bench

The evaluation set (ADQA-Bench) will be released on June 1, 2026. This set will be used for the final leaderboard ranking. Similar to the development set, a small portion of ADQA-Bench is sourced from MMAU, MMAR, and MMSU, with the rest being human-annotated multiple-choice questions. The entire evaluation set is constructed through the same rigorous four-step ADF hard-filtering process described above to guarantee zero-shot reliability.

Hugging Face - ADQA-Bench

Evaluation

Participants may submit up to four systems, each of which will be ranked independently.

The primary evaluation metric is Top-1 Accuracy on the ADQA-Bench.

Champion Tracks

We will award two champions in this task:

Overall Champion: The system achieving the highest Top-1 Accuracy on ADQA-Bench, with no restriction on model size (subject to the general task rules above).
Lightweight Champion (Lite-ADQA Award): Among all submitted systems whose total model parameter count is strictly less than 10 Billion (10B), the one achieving the highest Top-1 Accuracy will be recognized as the lightweight champion. This track is designed to encourage the development of efficient, resource-friendly audio-language models that can achieve strong audio-dependent reasoning with limited computational budgets. We have annotated which base models fall below or above the 10B threshold in the External Data & Base Model Registration Sheet for participants' reference.

Baseline System

We provide five baseline systems built on recent open-source Large Audio-Language Models. The prompt constructions listed below are empirically optimized through our experiments and represent the best-performing configurations we have found for each model. Participants are encouraged to explore alternative prompt designs that may further improve performance.

Fun-Audio-Chat

Fun-Audio-Chat is a Large Audio Language Model designed for natural, low-latency voice interactions, featuring Dual-Resolution Speech Representations and Core-Cocktail training to balance compute efficiency with strong audio understanding capabilities.

Prompt Construction

System Prompt: "You are asked to generate text tokens."

The prompt uses (A), (B) style labels, with each choice on a separate line:

{question} Choose the correct option from the following options:
(A) {choice_1}
(B) {choice_2}
(C) {choice_3}
(D) {choice_4}

Example:

At what time does the first occurrence of the baby crying sound end? Choose the correct option from the following options:
(A) 1.9s
(B) 3.1s
(C) 4.8s
(D) 6.0s

Kimi-Audio

Kimi-Audio is an open-source audio foundation model excelling in audio understanding, generation, and conversation.

Prompt Construction

System Prompt: None (uses model default).

The prompt uses A., B., C. style labels, with all labeled choices concatenated inline after the question:

{question} A. {choice_1} B. {choice_2} C. {choice_3} D. {choice_4}

Example:

At what time does the first occurrence of the baby crying sound end? A. 1.9s B. 3.1s C. 4.8s D. 6.0s

MiMo-Audio

MiMo-Audio is scaled with over one hundred million hours of pretraining data, enabling few-shot learning capabilities across diverse audio tasks. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks, spoken dialogue benchmarks, and instruct-TTS evaluations, approaching or surpassing closed-source models.

Prompt Construction

System Prompt: "You are a helpful audio understanding assistant that answers multiple choice questions based on audio content."

The prompt uses a multi-line Choice: block followed by a strict output matching instruction:

{question}

Choice: 
{choice_1}
{choice_2}
{choice_3}
{choice_4}

Choose a choice from the given {N} choices. Do not provide any additional explanations or content. Output must match exactly one of the listed choices.

Example:

At what time does the first occurrence of the baby crying sound end?

Choice: 
1.9s
3.1s
4.8s
6.0s

Choose a choice from the given 4 choices. Do not provide any additional explanations or content. Output must match exactly one of the listed choices.

Qwen3-Omni

Qwen3-Omni is a natively end-to-end multilingual omni-modal foundation model. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech.

Prompt Construction

System Prompt: "You are an audio understanding model that answers multiple choice questions based on audio content."

Following the official MMAU evaluation format, no letter labels are used. Choices are presented as a Python list:

{question} Select one option from the provided choices. ['{choice_1}', '{choice_2}', '{choice_3}', '{choice_4}']

Example:

At what time does the first occurrence of the baby crying sound end? Select one option from the provided choices. ['1.9s', '3.1s', '4.8s', '6.0s']

Step-Audio 2 Mini

Step-Audio 2 Mini is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.

Prompt Construction

System Prompt: "You are an audio understanding model that answers multiple choice questions based on audio content."

Choices are presented as a Python list with an explicit output format instruction using <RESPONSE> tags:

{question} Please choose the answer from the following options: ['{choice_1}', '{choice_2}', '{choice_3}', '{choice_4}']. Output the final answer in <RESPONSE> </RESPONSE>.

Example:

At what time does the first occurrence of the baby crying sound end? Please choose the answer from the following options: ['1.9s', '3.1s', '4.8s', '6.0s']. Output the final answer in <RESPONSE> </RESPONSE>.

Baseline Results

The following table shows the Top-1 Accuracy of each baseline system on the development set (random guess accuracy: 0.2546):

System	Accuracy
Fun-Audio-Chat	0.5681
Kimi-Audio	0.4636
MiMo-Audio	0.5457
Qwen3-Omni	0.6248
Step-Audio 2 Mini	0.5053
Overall (weighted average)	0.5415

Task Rules

General Rules

Submission Limit: Each team may submit up to four systems.
LALM Policy: Only open-source Large Audio Language Models are permitted for the audio understanding system submitted for evaluation. Closed-source/API-only models are prohibited. This restriction applies exclusively to the final inference and submission system; there are no restrictions on the models or tools used for Data Preprocessing and Dataset Refinement. Participants are free to use any tools for these preliminary steps, provided that the final inference and submission system is built upon open-source LALMs.
Training-Free Approaches: We also welcome training-free frameworks (e.g., prompt engineering, in-context learning, retrieval-augmented generation, or inference-time reasoning strategies). As long as the total parameter count of all models involved does not exceed 100 Billion (100B) parameters, training-free submissions are fully eligible.
No Manual Labeling: Participants are not allowed to perform manual annotations or subjective judgments on the evaluation set.

External Data Resources and Base Model Application

The use of external data and base models is allowed under the following conditions:

External data resources and base models must be publicly available before May 15th, 2026.
Participants must register any external dataset or base model they plan to use via the External Data & Base Model Registration Sheet. Simply add a new row with the resource you wish to use and set its status to "Under Review". Organizers will review and approve requests twice daily throughout the challenge period.
If the source audio of a dataset you intend to use is already fully contained within another dataset that has been registered and approved, you do not need to register it separately.
Prohibited Data: Using the evaluation set's original source audio for training is strictly forbidden.
External Base Model Size: No single model component may exceed 30 Billion (30B) parameters. If using an agent system or multi-model voting ensemble, the total combined parameter count of all models must not exceed 100 Billion (100B) parameters.
Contrastive Decoding Note: If your system employs contrastive decoding (or any similar technique) that involves two or more models — regardless of whether the models share the same architecture or parameter size — the total parameter count is the sum of all models' parameters. For example, using two 7B models for contrastive decoding results in a total of 14B parameters.
RL Supervision Models: If you use a separate model solely as an external supervisor during reinforcement learning training (e.g., a reward model or teacher model), there are no parameter size limits on that supervision model. The model size restrictions (30B per component, 100B total) apply only to models that are part of the final inference and submission system.

Important Dates

April 1, 2026: Release of Training/Dev sets and Baseline systems.
June 1, 2026: Release of Evaluation set.
June 15, 2026: System submission deadline.
June 30, 2026: Announcement of challenge results.

Submission

General information for all DCASE submissions can be found on the Submission page. The official challenge submission must include the following:

System output files for the evaluation set (.csv file)
Metadata file for the submission (.yaml file)
A technical report detailing the method for their submission (.pdf file)
If using post-processing, the code used to process the model's response (e.g., .py files or other scripts) must be provided.

We allow up to 4 system output submissions per participant/team. For each system, metadata should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted metadata (the .yaml file), submitted system output (the .csv file), and the technical report (the .pdf file). For indicating the connection of your files, please use the following naming convention:

<author>_<institute>_task5_<submission_index>_<output or meta or technical_report or post_process>.<csv or yaml or pdf or py>

For example:

He_CUHK_task5_1.output.csv
He_CUHK_task5_1.meta.yaml
He_CUHK_task5_1.post_process.py
He_CUHK_task5.technical_report.pdf

The field <submission_index> is to differentiate your submissions in case that you have multiple submissions.

Submission package

To form your submission package, all files should be packaged into a zip file for submission. Here is an example of the package structure, assuming a submission label He_CUHK_task5:

Zip-package root
│
└───task5/
    │   He_CUHK_task5.technical_report.pdf
    │
    └───He_CUHK_task5_1/
    │       He_CUHK_task5_1.output.csv
    │       He_CUHK_task5_1.meta.yaml
    │       He_CUHK_task5_1.post_process.py        (Optional) Post-processing code
    │
    └───He_CUHK_task5_2/
    │       He_CUHK_task5_2.output.csv
    │       He_CUHK_task5_2.meta.yaml
    │       He_CUHK_task5_2.post_process.py        (Optional)
    :
    │
    └───He_CUHK_task5_4/
            He_CUHK_task5_4.output.csv
            He_CUHK_task5_4.meta.yaml
            He_CUHK_task5_4.post_process.py        (Optional)

System output file

The system output file should be a .csv file, and should have the following two columns:

question: Name of the question, formatted as eval_0001, eval_0002, ..., matching the evaluation set ordering.
answer: Answer of the system for the question. It must match one of the given choice options. Important: The answer field must contain only the plain answer text itself, without any option prefix such as A., B., (a), (A), etc. For example, if the correct choice is A. Jazz music, the answer should be Jazz music, not A. Jazz music. If the model's output naturally includes such prefixes, participants should apply a post-processing script to strip them before submission.

For example:

    question         answer
        .               .
        .               .
        .               .
   eval_0001           Jazz music
   eval_0002           Three speakers
        .               .
        .               .
        .               .

Metadata file

For each system, metadata should be provided in a separate file. The file format should be as indicated below.

Metadata

# Submission information for task 5
submission:
  # Submission label
  # Label is used to index submissions.
  # Generate your label following way to avoid
  # overlapping codes among submissions
  # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
  label: He_CUHK_task5_1
  #
  # Submission name
  # This name will be used in the results tables when space permits
  name: Qwen3-Omni Baseline
  #
  # Submission name abbreviated
  # This abbreviated name will be used in the results table when space is tight.
  # Use maximum 10 characters.
  abbreviation: Qwen3base

# Authors of the submitted system. Mark authors in
# the order you want them to appear in submission lists.
# One of the authors has to be marked as corresponding author,
# this will be listed next to the submission in the results tables.
authors:
  # First author
  - lastname: He
    firstname: Haolin
    email: haolin.he@example.com                   # Contact email address
    corresponding: true                            # Mark true for one of the authors

    # Affiliation information for the author
    affiliation:
      abbreviation: CUHK
      institute: The Chinese University of Hong Kong
      department: Department of Computer Science      # Optional
      location: Hong Kong, China

  # Second author
  # ...

# System information
system:
  end_to_end: true # True if single end-to-end system, false if cascaded (chained) system
  pretrained: true # True if the system is pretrained, false if not
  pre_loaded: qwen3-omni  # Name of the pre-trained model used in the system. If not pretrained, null
  autoregressive_model: true # True if the system is based on autoregressive language model, false if not
  model_size: 30B # Number of total parameters of the system in billions.
  light_weighted: false # True if the system is lightweight submission (i.e. less than 30B parameters)

  # Post processing: Details about the post processing method used in the system
  post_processing: Direct string matching to extract the option letter from the model response

  # Optional. External data resources to train the system
  external_data_resources: [
    "AudioSet"
  ]

# System results on the development set.
    # - Full results are not mandatory, however, they are highly recommended as they are needed for thorough analysis of the challenge submissions.
    # - If you are unable to provide all the results, incomplete results can also be reported.
    # - Each score should contain at least 3 decimals.
results:
  development:
    accuracy: 64.450%

Results

This table shows the best system from each team in the evaluation set. The ranking is based on the achieved evaluation accuracy metric. Complete results and technical reports can be found in the results page.

Rank	Submission Information	Evaluation			Model
Rank	Submission Code	Eval Accuracy	Dev Accuracy	Diff	Parameters
	Lim_CAU_task5_4	58.33	70.50	-12.17	96000000000
	Nam_IND_task5_2	57.17			60000000000
	Hu_IOA_task5_4	57.03	67.70	-10.67	8600000000
	Yin_XJTLU_task5_1	56.00	68.33	-12.33	30000000000
	Cheng_Surrey_task5_1	53.93	65.07	-11.13	8000000000
	Zhang_WHU_task5_1	51.57	62.79	-11.22	8000000000
	Tathe_UIUC_task5_1	50.60	58.43	-7.83	8900000000
	Huang_JAIST_task5_1	49.60	64.84	-15.24	8000000000
	Kim_SGU_task5_4	49.13			7000000000
	Wu_XMU_task5_1	47.20	53.90	-6.70	7000000000
	ZC_Inst_task5_1	46.03	64.45	-18.42	8000000000
	Guan_HEU_task5_1	43.37	53.08	-9.71	8000000000
	Song_BIT_task5_1	43.33	50.53	-7.20	7000000000
	Xu_HUST_task5_1	31.87	56.69	-24.82	5400000000

Citations

If you are participating in this task, please consider citing the following papers:

Publication

Haolin He, Xingjian Du, Renhe Sun, Zheqi Dai, Yujia Xiao, Mingru Yang, Jiayi Zhou, Xiquan Li, Zhengxi Liu, Zining Liang, Chunyat Wu, Qianhua He, Tan Lee, Xie Chen, Wei-Long Zheng, Weiqiang Wang, Mark Plumbley, Jian Liu, and Qiuqiang Kong. Measuring audio's impact on correctness: audio-contribution-aware post-training of large audio language models. In International Conference on Learning Representations (ICLR). 2026.

PDF

Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models

Abstract

Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL) has demonstrated promising results, multi-stage approaches such as supervised fine-tuning (SFT) followed by RL remain suboptimal. The allocation of data across multiple training stages to maximize LALM capabilities has not been fully explored, and large-scale, high-quality datasets for such research are also lacking. To address these problems, we firstly present AudioMCQ, a comprehensive audio multiple-choice question dataset comprising 571k samples with two kinds of chain-of-thought annotations. Secondly, we investigate the prevalent zero audio-contribution phenomenon in LALMs, where models derive correct answers solely from textual information without processing audio content. We propose Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. Based on these insights, we develop two effective post-training paradigms: Weak-to-Strong (SFT on weak audio-contribution data followed by RL on strong audio-contribution data) and Mixed-to-Strong (SFT on mixed audio-contribution data followed by RL on strong audio-contribution data). We achieve first place in the DCASE 2025 Audio-Question-Answering challenge by using AudioMCQ. Additionally, leveraging our dataset with different training strategies, we achieve 78.2% on MMAU-test-mini, 75.6% on MMAU, 67.1% on MMAR, and 70.7% on MMSU, establishing new state-of-the-art performance.

PDF

Publication

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. MMAU: a massive multi-task audio understanding and reasoning benchmark. In The Thirteenth International Conference on Learning Representations (ICLR). 2025.

PDF

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

Abstract

The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on tasks requiring expert-level knowledge and complex reasoning. MMAU comprises 10k carefully curated audio clips paired with human-annotated natural language questions and answers spanning speech, environmental sounds, and music. It includes information extraction and reasoning questions, requiring models to demonstrate 27 distinct skills across unique and challenging tasks. Unlike existing benchmarks, MMAU emphasizes advanced perception and reasoning with domain-specific knowledge, challenging models to tackle tasks akin to those faced by experts. We assess 18 open-source and proprietary (Large) Audio-Language Models, demonstrating the significant challenges posed by MMAU. Notably, even the most advanced Gemini Pro v1.5 achieves only 52.97% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves only 52.50%, highlighting considerable room for improvement. We believe MMAU will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.

PDF

Publication

Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu, Ruibin Yuan, Zhisheng Zheng, Ziya Zhou, Haina Zhu, Wei Xue, Emmanouil Benetos, Kai Yu, Eng-Siong Chng, and Xie Chen. MMAR: a challenging benchmark for deep reasoning in speech, audio, music, and their mix. 2025. URL: https://arxiv.org/abs/2505.13032, arXiv:2505.13032.

PDF

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Abstract

We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area.

PDF

Publication

Dingdong Wang, Junan Li, Jincenzi Wu, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. MMSU: a massive multi-task spoken language understanding and reasoning benchmark. 2026. URL: https://arxiv.org/abs/2506.04779, arXiv:2506.04779.

PDF

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Abstract

Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at this https URL. Evaluation Code is available at this https URL.

PDF

	Haolin He The Chinese University of Hong Kong
	Renhe Sun Ant Group
	Zheqi Dai The Chinese University of Hong Kong
	Xingjian Du University of Rochester
	Chunyat Wu The Chinese University of Hong Kong
	Jiayi Zhou Ant Group
	Xiquan Li Shanghai Jiao Tong University
	Yun Chen University of Surrey
	Xie Chen Shanghai Jiao Tong University
	Zhiyao Duan University of Rochester
	Weiqiang Wang Ant Group
	Mark D. Plumbley King’s College London
	Jian Liu Ant Group
	Qiuqiang Kong The Chinese University of Hong Kong

Coordinators

Content

Description

Audio Dataset

Official Training Set: AudioMCQ-StrongAC-GeminiCoT

Development Dataset

Evaluation Set: ADQA-Bench

Evaluation

Champion Tracks

Baseline System

Fun-Audio-Chat

Prompt Construction

Kimi-Audio

Prompt Construction

MiMo-Audio

Prompt Construction

Qwen3-Omni

Prompt Construction

Step-Audio 2 Mini

Prompt Construction

Baseline Results

Task Rules

General Rules

External Data Resources and Base Model Application

Important Dates

Submission

Submission package

System output file

Metadata file

Metadata

Results

Citations

Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models

Abstract

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

Abstract

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Abstract

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Abstract