Audio Question Answering


Challenge results

Task description

The Audio Question Answering (AQA) task focuses on advancing question-answering capabilities in the realm of "interactive audio understanding," covering both general acoustic events and knowledge-heavy sound information within a single track. The AQA task encourages participants to develop systems that can accurately interpret and respond to complex audio-based questions (i.e., (A), (B), or (C) option), requiring models to process and reason across diverse audio types.

The task consists of three distinct QA subsets: Bioacoustics QA, Temporal Soundscapes QA, and Complex QA (MMAU). Each subset is designed to evaluate different aspects of audio understanding and reasoning.

More detailed task description can be found in the task description page

Teams ranking

Here are listed the best systems from all teams. The ranking is based on the achieved domain average accuracy metric.

Selected
metric rank
Submission Information Evaluation dataset Development dataset Model size
Submission Code System Rank Corresponding
Author
Technical
Report
Domain
Average
Part 1
Accuracy
Part 2
Accuracy
Part 3
Accuracy
Domain
Average
Amount of
Parameters
Sun_Antgroup_task5_2 1 Renhe Sun He_cuhk_t5_2025 73.74 70.75 65.31 85.15 77.93 12000000000
Chen_SRCN_task5_3 9 Minjun Chen Chen_srcn_t5_2025 64.91 52.19 58.85 83.70 69.82 8300000000
Shi_USTC_task5_1 3 Song Yan Cai_ustc_2025 72.81 69.37 61.96 87.10 78.13 8100000000
Grzeszczyk_SRPOL_task5_4 10 Michal Grzeszczyk Wojtowicz-Kruk_srpol_2025 60.18 47.41 51.67 81.45 68.18 8400000000
Gibier_inria_task5_1 13 Marcel Gibier Gibier_inria_2025 55.97 42.55 50.00 75.35 62.25 7700000000
Wijngaard_DACS_task5_4 14 Gijs Wijngaard Wijngaard_um_2025 55.25 44.25 40.19 81.30 61.83 92400000000
Baseline_Kimi_Audio 19 52.09 37.44 42.58 76.25 46.80 9800000000
Baseline_Gemini_2_0 20 51.20 36.41 43.18 74.00 48.30
Baseline_AudioFlamingo2 21 50.85 42.87 32.18 77.50 45.00 3000000000
Chung_IND_task5_1 22 HaeChun Chung Chung_ind_2025 50.37 48.54 30.26 72.30 67.68 4800000000
Guan_HEU_task5_2 23 Jian Guan Xiao_heu_2025 49.70 36.87 39.47 72.75 68.94 8400000000
Yin_XJTLU_task5_1 27 Zeyu Yin yin_xjtlu_2025 42.13 35.41 31.94 59.05 48.94 230000000
Baseline_Qwen2_Audio 31 37.19 27.71 38.52 45.35 39.60 8400000000

Systems ranking

Here are listed all systems and their ranking according to the different metrics.

Selected
metric rank
Submission Information Evaluation dataset Development dataset Model size
Submission Code Corresponding
Author
Technical
Report
Domain
Average
Part 1
Accuracy
Part 2
Accuracy
Part 3
Accuracy
Domain
Average
Amount of
Parameters
Baseline_Qwen2_Audio 37.19 27.71 38.52 45.35 39.60 8400000000
Baseline_AudioFlamingo2 50.85 42.87 32.18 77.50 45.00 300000000
Baseline_Gemini_2_0 51.20 36.41 43.18 74.00 48.30
Baseline_Kimi_Audio 52.09 37.44 42.58 76.25 46.80 9800000000
Sun_Antgroup_task5_2 Sun sun_antgroup_2025 73.74 70.75 65.31 85.15 77.93 12000000000
Sun_Antgroup_task5_4 Sun sun_antgroup_2025 73.11 68.88 65.31 85.15 75.73 20000000000
Shi_USTC_task5_1 Song Yan shi_ustc_2025 72.81 69.37 61.96 87.10 78.13 8100000000
Sun_Antgroup_task5_1 Sun sun_antgroup_2025 72.14 67.75 63.28 85.40 76.90 12000000000
Shi_USTC_task5_2 Song Yan shi_ustc_2025 71.52 68.15 59.57 86.85 75.23 7100000000
Shi_USTC_task5_4 Song Yan shi_ustc_2025 71.45 64.34 67.94 82.05 73.90 7100000000
Sun_Antgroup_task5_3 Sun sun_antgroup_2025 69.78 62.88 62.80 83.65 75.43 8200000000
Shi_USTC_task5_3 Song Yan shi_ustc_2025 69.27 65.88 56.58 85.35 76.53 8100000000
Chen_SRCN_task5_3 Minjun Chen Chen_srcn_t5_2025 64.91 52.19 58.85 83.70 69.82 8300000000
Grzeszczyk_SRPOL_task5_4 Grzeszczyk grzeszczyk_srpol_2025 60.18 47.41 51.67 81.45 68.18 8400000000
Grzeszczyk_SRPOL_task5_2 Grzeszczyk grzeszczyk_srpol_2025 60.08 47.65 52.39 80.20 65.54 8400000000
Grzeszczyk_SRPOL_task5_3 Grzeszczyk grzeszczyk_srpol_2025 59.34 47.41 49.16 81.45 64.73 8400000000
Gibier_inria_task5_1 Marcel Gibier gibier_inria_2025 55.97 42.55 50.00 75.35 62.25 7700000000
Wijngaard_DACS_task5_4 Wijngaard wijngaard_dacs_2025 55.25 44.25 40.19 81.30 61.83 92400000000
Wijngaard_DACS_task5_1 Wijngaard wijngaard_dacs_2025 55.00 44.49 41.51 79.00 58.92 8400000000
Wijngaard_DACS_task5_2 Wijngaard wijngaard_dacs_2025 54.68 42.30 39.83 81.90 59.32 8400000000
Wijngaard_DACS_task5_3 Wijngaard wijngaard_dacs_2025 53.58 43.11 38.28 79.35 59.86 8400000000
Grzeszczyk_SRPOL_task5_1 Grzeszczyk grzeszczyk_srpol_2025 52.77 46.03 39.47 72.80 57.57 1600000000
Chung_IND_task5_1 Chung chung_ind_2025 50.37 48.54 30.26 72.30 67.68 4800000000
Guan_HEU_task5_2 Guan guan_heu_2025 49.70 36.87 39.47 72.75 68.94 8400000000
Chung_IND_task5_2 Chung chung_ind_2025 49.20 44.65 29.90 73.05 65.93 4800000000
Chung_IND_task5_3 Chung chung_ind_2025 48.98 46.35 27.75 72.85 64.96 4800000000
Guan_HEU_task5_1 Guan guan_heu_2025 46.94 30.79 39.12 70.90 66.58 8400000000
Yin_XJTLU_task5_1 Yin yin_xjtlu_2025 42.13 35.41 31.94 59.05 48.94 230000000
Yin_XJTLU_task5_2 Yin yin_xjtlu_2025 41.67 35.58 30.62 58.80 47.91 230000000
Yin_XJTLU_task5_4 Yin yin_xjtlu_2025 41.55 34.93 30.98 58.75 48.28 230000000
Yin_XJTLU_task5_3 Yin yin_xjtlu_2025 40.87 34.68 29.19 58.75 49.50 230000000

System characteristics

In this section you can find the characteristics of the submitted systems includedin the official ranking.

Rank Submission
code
Technical
Report
Domain Average Amount of
parameters
Pretrained Model Model Type External Data Post-processing
31 Baseline_Qwen2_Audio 37.19 8400000000 Qwen2-Audio-7B-Instruct end-to-end, autoregressive Selected the option that has highest SentenceBERT similarity score with the model response
21 Baseline_AudioFlamingo2 50.85 300000000 AudioFlamingo2 end-to-end, autoregressive Direct matching
20 Baseline_Gemini_2_0 51.20 Gemini-2.0-Flash Direct matching
19 Baseline_Kimi_Audio 52.09 9800000000 Kimi-Audio-7B-Instruct end-to-end, autoregressive Direct matching
1 Sun_Antgroup_task5_2 sun_antgroup_2025 73.74 12000000000 Kimi-Audio-7B-Instruct end-to-end, autoregressive AVQA, AudioCaps, Clotho, CompA-R, LP-MusicCaps-MTT, MMAU, MusicCaps, SpeechCraft, TACOS, VGGSound, AudioSet, VocalSound Select and output the option with the same first letter
2 Sun_Antgroup_task5_4 sun_antgroup_2025 73.11 20000000000 Qwen2-Audio-7B-Instruct, Kimi-Audio-7B-Instruct end-to-end, autoregressive AVQA, AudioCaps, Clotho, CompA-R, LP-MusicCaps-MTT, MMAU, MusicCaps, SpeechCraft, TACOS, VGGSound, AudioSet, VocalSound Select and output the option with the same first letter
3 Shi_USTC_task5_1 shi_ustc_2025 72.81 8100000000 Kimi-Audio-7B-Instruct end-to-end, autoregressive AudioSet-Strong, MMAU Use the first letter of the model output and select the corresponding option
4 Sun_Antgroup_task5_1 sun_antgroup_2025 72.14 12000000000 Kimi-Audio-7B-Instruct end-to-end, autoregressive AVQA, AudioCaps, Clotho, CompA-R, LP-MusicCaps-MTT, MMAU, MusicCaps, SpeechCraft, TACOS, VGGSound, AudioSet, VocalSound Select and output the option with the same first letter
5 Shi_USTC_task5_2 shi_ustc_2025 71.52 7100000000 Qwen2-Audio-7B-Instruct end-to-end, autoregressive AudioSet-Strong, MMAU Use the first letter of the model output and select the corresponding option
6 Shi_USTC_task5_4 shi_ustc_2025 71.45 7100000000 Qwen2-Audio-7B-Instruct end-to-end, autoregressive AudioSet-Strong, MMAU Use the first letter of the model output and select the corresponding option
7 Sun_Antgroup_task5_3 sun_antgroup_2025 69.78 8200000000 Qwen2-Audio-7B-Instruct end-to-end, autoregressive AVQA, AudioCaps, Clotho, CompA-R, LP-MusicCaps-MTT, MMAU, MusicCaps, SpeechCraft, TACOS, VGGSound, AudioSet, VocalSound Select and output the option with the same first letter
8 Shi_USTC_task5_3 shi_ustc_2025 69.27 8100000000 Kimi-Audio-7B-Instruct end-to-end, autoregressive AudioSet-Strong, MMAU Use the first letter of the model output and select the corresponding option
9 Chen_SRCN_task5_3 Chen_srcn_t5_2025 64.91 8300000000 Qwen2-Audio-7B-Instruct end-to-end, autoregressive AVQA, Clothov-AQA, Audioset, Audioset-Strong, AudioCaps, WavCaps, FSD50K, CompA-R, TACOS, MMAU Directly extract the string content from tags
10 Grzeszczyk_SRPOL_task5_4 grzeszczyk_srpol_2025 60.18 8400000000 R1-AQA end-to-end, autoregressive MMAU, TACOS Selected the option that has the lowest Levenshtein distance with the model response
11 Grzeszczyk_SRPOL_task5_2 grzeszczyk_srpol_2025 60.08 8400000000 R1-AQA end-to-end, autoregressive MMAU, TACOS Selected the option that has the lowest Levenshtein distance with the model response
12 Grzeszczyk_SRPOL_task5_3 grzeszczyk_srpol_2025 59.34 8400000000 R1-AQA end-to-end, autoregressive MMAU Selected the option that has the lowest Levenshtein distance with the model response
13 Gibier_inria_task5_1 gibier_inria_2025 55.97 7700000000 Qwen2.5-7B-Instruct autoregressive AudioSet Selected the response at index i corresponding to the letter label of the model's answer (e.g., A → 0, B → 1, ...)
14 Wijngaard_DACS_task5_4 wijngaard_dacs_2025 55.25 92400000000 Qwen2-Audio-7B-Instruct autoregressive AVQA, CLothoAQA, CompA-Order, TACOS, AudSem
15 Wijngaard_DACS_task5_1 wijngaard_dacs_2025 55.00 8400000000 Qwen2-Audio-7B-Instruct autoregressive AVQA, CLothoAQA, CompA-Order, TACOS, AudSem
16 Wijngaard_DACS_task5_2 wijngaard_dacs_2025 54.68 8400000000 Qwen2-Audio-7B-Instruct autoregressive AVQA, CLothoAQA, CompA-Order, TACOS, AudSem Exact match
17 Wijngaard_DACS_task5_3 wijngaard_dacs_2025 53.58 8400000000 Qwen2-Audio-7B-Instruct autoregressive AVQA, CLothoAQA, CompA-Order, TACOS, AudSem Exact match
18 Grzeszczyk_SRPOL_task5_1 grzeszczyk_srpol_2025 52.77 1600000000 end-to-end, autoregressive AudioCaps, AudioSet, Clotho, WavCpas, OpenAQA, VGGSOund, FSD-50k Selected the option that has the lowest Levenshtein distance with the model response
22 Chung_IND_task5_1 chung_ind_2025 50.37 4800000000 AudioFlamingo2 end-to-end, autoregressive Select the same choice as the index of the answer derived from the model
23 Guan_HEU_task5_2 guan_heu_2025 49.70 8400000000 Qwen2-Audio-7B-Instruct end-to-end, autoregressive Selected the option that has highest SentenceBERT similarity score with the model response
24 Chung_IND_task5_2 chung_ind_2025 49.20 4800000000 AudioFlamingo2 end-to-end, autoregressive Select the same choice as the index of the answer derived from the model
25 Chung_IND_task5_3 chung_ind_2025 48.98 4800000000 AudioFlamingo2 end-to-end, autoregressive Select the same choice as the index of the answer derived from the model
26 Guan_HEU_task5_1 guan_heu_2025 46.94 8400000000 Qwen2-Audio-7B-Instruct end-to-end, autoregressive Selected the option that has highest SentenceBERT similarity score with the model response
27 Yin_XJTLU_task5_1 yin_xjtlu_2025 42.13 230000000 BEATs-Base, BERT-Base-Uncased end-to-end Map the predicted choice letter (A/B/…) to its answer string and write to CSV
28 Yin_XJTLU_task5_2 yin_xjtlu_2025 41.67 230000000 BEATs-Base, BERT-Base-Uncased end-to-end Map the predicted choice letter (A/B/…) to its answer string and write to CSV
29 Yin_XJTLU_task5_4 yin_xjtlu_2025 41.55 230000000 BEATs-Base, BERT-Base-Uncased end-to-end Map the predicted choice letter (A/B/…) to its answer string and write to CSV
30 Yin_XJTLU_task5_3 yin_xjtlu_2025 40.87 230000000 BEATs-Base, BERT-Base-Uncased end-to-end Map the predicted choice letter (A/B/…) to its answer string and write to CSV

Additional System Characteristics

This section provides metrics and attributes of the submitted system that are not part of the official ranking.

Submission
code
Technical
Report
Domain Average Part 1
Accuracy
Part 2
Accuracy
Part 3
Accuracy
Dev Domain
Average
Amount of
parameters
Pretrained Model Model Type External Data Post-processing
Baseline_Qwen2_Audio 37.19 27.71 38.52 45.35 39.60 8400000000 Qwen2-Audio-7B-Instruct end-to-end, autoregressive Selected the option that has highest SentenceBERT similarity score with the model response
Baseline_AudioFlamingo2 50.85 42.87 32.18 77.50 45.00 300000000 AudioFlamingo2 end-to-end, autoregressive Direct matching
Baseline_Gemini_2_0 51.20 36.41 43.18 74.00 48.30 Gemini-2.0-Flash Direct matching
Baseline_Kimi_Audio 52.09 37.44 42.58 76.25 46.80 9800000000 Kimi-Audio-7B-Instruct end-to-end, autoregressive Direct matching
omni_Chen_SRCN_task5_1 Chen_srcn_t5_2025 75.67 66.45 74.52 86.05 81.26 8300000000 Qwen2.5-Omni-7B end-to-end, autoregressive AVQA, Clothov-AQA, Audioset, Audioset-Strong, AudioCaps, WavCaps, FSD50K, CompA-R, TACOS, MMAU Directly extract the string content from tags
omni_Chen_SRCN_task5_2 Chen_srcn_t5_2025 75.51 66.94 73.80 85.80 81.40 8300000000 Qwen2.5-Omni-7B end-to-end, autoregressive AVQA, Clothov-AQA, Audioset, Audioset-Strong, AudioCaps, WavCaps, FSD50K, CompA-R, TACOS, MMAU Directly extract the string content from tags
omni_Chen_SRCN_task5_4 Chen_srcn_t5_2025 63.31 46.68 58.61 84.65 64.75 4000000000 Qwen2.5-Omni-3B end-to-end, autoregressive AVQA, Clothov-AQA, Audioset, Audioset-Strong, AudioCaps, WavCaps, FSD50K, CompA-R, TACOS, MMAU Directly extract the string content from tags
omni_Liu_MLPXC_task5_1 Li_mlpxc_2025 66.04 49.92 61.01 87.20 69.57 8900000000 Qwen2.5-Omni-7B end-to-end, autoregressive Clotho-AQA,TACOS
omni_Liu_MLPXC_task5_2 Li_mlpxc_2025 73.69 62.32 72.01 86.75 75.10 8900000000 Qwen2.5-Omni-7B end-to-end, autoregressive Clotho-AQA,TACOS
omni_Liu_MLPXC_task5_3 Li_mlpxc_2025 71.22 59.56 66.99 87.10 71.42 8900000000 Qwen2.5-Omni-7B end-to-end, autoregressive Clotho-AQA,TACOS
omni_Liu_MLPXC_task5_4 Li_mlpxc_2025 71.24 58.35 67.46 87.90 72.28 8900000000 Qwen2.5-Omni-7B end-to-end, autoregressive Clotho-AQA,TACOS

Technical reports

Parameter-Efficient Tuning of Large Audio-Language Models for DCASE 2025 Challenge Task 5

Pengfei Cai, Yanfeng Shi, Qing Gu, Nan Jiang, and Yan Song
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, China

Abstract

In this technical report, we describe our systems developed for DCASE 2025 Challenge Task5. Our system is mainly based on parameter-efficient tuning of large audio-language models, e.g., Qwen2-Audio and Kimi-Audio. The training process is conducted using Low-Rank Adaptation (LoRA) and divided into two stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In addition, we reformatted the annotations of the AudioSet-Strong and MMAU datasets into a question-answer format to augment the official task dataset. Our final system achieves an accuracy of 80.0% on the development set.

PDF

DCASE 2025 Challenge Task 5 Technical Report

Minjun Chen, Jun Shao, Yangyang Liu, Bo Peng and Jie Chen
Samsung Research China-Nanjing, Nanjing, China

Abstract

We describe our submitted systems for DCASE2025 Task 5 in this technical report: Audio Question Answering. Our proposed systems focus on training a Large Audio Language Model (LALM) with carefully curated training datasets and training sessions, based on a carefully chosen multi-modality baseline model. We choose Qwen 2.5 Omni 7B, which has shown impressive performance on audio and vision related tasks, as the base to initialize the audio encoder and LLM component of the proposed systems. We collect and transform multiple audio-text datasets for the training, the total number of samples reached 800K, covering multiple audio related tasks, such as closed Audio QA, opened Audio QA, audio caption, and audio temporal understanding and reasoning. We curated a multi-stage training procedure to help the model learning to focus on different aspects of the data, and to learn from easy to hard during the training process. In the post-training stage, we adopt different training methods, including fine-tune (SFT) and GRPO, to take advantage of their different abilities in generalization and memory. With these carefully considered designs, our model not only learn to answer the questions correctly in content, but also in the required format specified in prompts, which simplify the post-process procedure for evaluation. Our proposed systems achieve top-1 accuracy 81.3% on the DCASE Task 5 development set

PDF

Parameter-Efficient Fine-Tuning of Audio Flamingo 2 with LoRA for the DCASE 2025 Audio Question Answering Challenge

HaeChun Chung
Independent, Seoul, Korea

Abstract

Audio Question Answering (AQA) presents a significant challenge, demanding models capable of complex reasoning over extensive audio sequences. In this research, we boost the performance of Audio Flamingo 2 (AF2), a compact yet powerful audio-language model, by employing parameter-efficient Low-Rank Adaptation (LoRA). We apply targeted data augmentation strategies for multiple-choice QA and fine-tune the model using the DCASE2025 Challenge Task 5 dataset. Our top-performing model, utilizing LoRA with a rank of 8, achieves a remarkable 69.67% accuracy. This substantially outperforms all established baselines, including the strong Gemini-2.0-Flash (52.5%). These results highlight the effectiveness and practical value of lightweight adaptation approach, especially when operating under constrained computational resources

PDF

Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions

Marcel Gibier, Nolwenn Celton, Raphaël Duroselle, Pierre Serrano, Olivier Boeffard, and Jean-François Bonastre
Inria Paris, LR2, Paris, France

Abstract

In this report, we describe our submission to Track 5 of the DCASE 2025 Challenge for the task of Audio Question Answering (AQA). Our system leverages the SSL backbone BEATs to extract frame-level audio features, which are then processed by a classification head to generate segment-level predictions of acoustic events, following the Audioset ontology. These segment-level predictions are subsequently calibrated before producing event-level predictions. Finally, these predictions are incorporated into a structured prompt, along with the question and candidate answers. This prompt is then fed to a fine-tuned version of Qwen2.5-7B-Instruct, trained using the GRPO algorithm with a simple reward function. Our method achieves a top-1 accuracy of 65.4 % on the development set, demonstrating the effectiveness of combining acoustic event reasoning with instruction-tuned large language models for AQA.

PDF

Audio Question Answering at the DCASE 2025 Challenge

Haolin He1, Mingru Yang2, Renhe Sun3, Jiayi Zhou3, Jian Liu3, Qianhua He2 and Qiuqiang Kong1
1Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong, China 2School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China 3Machine Intelligence, Ant Group, Shanghai, China

Abstract

In this technical report, we describe the submission system for DCASE2025 Task 5: Audio Question Answering. In this work, we introduce a comprehensive audio question answering dataset named DCASE-AQA-Boost, featuring diverse question types and carefully curated answer options to address the limitations of existing collections. Based on the DCASE-AQA-Boost, we have developed two models, Kimi-Audio-SFT-12B and Qwen2-Audio-R1-8B. Kimi-Audio-SFT-12B is obtained through a two-stage Supervised Fine-Tuning (SFT) process using the Pretraining and Finetuning split of DCASE-AQA-Boost. Qwen2-Audio-R1-8B is trained using our proposed three-stage training paradigm based on the DCASE-AQA-Boost and DCASE2025 task 5 training set, which incorporates Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). Experimental results demonstrate that the proposed method significantly improves the accuracy of multiple-choice audio question answering systems. Kimi-Audio-SFT-12B and Qwen2-Audio-R1-8B achieve 77.66% and 78.18% accuracy on the DCASE2025 Task 5 development set, respectively.

PDF

MIAQA Submission for DCASE 2025 Challenge Task 5: A Reinforcement Learning Driven Audio Question Answering Method

Gang Li, Jizhong Liu, Heinrich Dinkel, Yadong Niu, Xingwei Sun, Tianzi Wang, Junbo Zhang, and Jian Luan
MiLM Plus, Xiaomi Corp., China

Abstract

This technical report presents an audio question answering (AQA) method submitted to DCASE 2025 Challenge Task 5. Recent studies have shown that reinforcement learning (RL) can enhance the audio reasoning capabilities of large audio language models (LALMs). Thus, we employ a RL strategy to optimize our AQA model. The MiAQA submission is based on our preliminary study [1]1. We apply the group relative policy optimization (GRPO) algorithm to Qwen2.5-Omni-7B. The model directly generates responses after implicit reasoning, without relying on complex, explicit chain-of-thought (CoT). To enhance data diversity, the training data combines human-annotated datasets with weakly labeled datasets generated by large language models (LLMs). Using only a single model and 35k training samples, MiAQA achieves up to 78.0% accuracy on the DCASE 2025 AQA development set.

PDF

Data-Balanced Curriculum Learning for Audio Question Answering

Gijs Wijngaard1, Elia Formisano2,3,4, Michele Esposito1, and Michel Dumontier1
1Department of Advanced Computing Sciences, Maastricht University, Netherlands 2Department of Cognitive Neuroscience, Maastricht University, Netherlands 3Maastricht Centre for Systems Biology, Maastricht University, Netherlands 4Brightlands Institute for Smart Society, Maastricht University, Netherlands

Abstract

Audio question answering (AQA) requires models to understand acoustic content and perform complex reasoning. Current models struggle with dataset imbalances and unstable training dynamics. This work combines curriculum learning with statistical data balancing to address these challenges. The method labels question difficulty using language models, then trains progressively from easy to hard examples. Statistical filtering removes overrepresented audio categories, and guided decoding constrains outputs to valid multiple-choice formats. Experiments on the DCASE 2025 training set and five additional public datasets show that data curation improves accuracy by 19.2% over baseline models, achieving 64.2% on the DCASE 2025 benchmark.

PDF

Take It with a Grain of Salt: Improving Audio Question Answering with Large Language Models

Juliusz Wójtowicz-Kruk1, Piotr Masztalski12, Bartłomiej Zgórzyński1, and Michal Grzeszczyk1
1Samsung R&D Institute Poland, Warsaw, Poland 2AGH University of Krakow, Poland

Abstract

In this report, we present our solution for DCASE 2025 Task 5: Audio Question Answering. We explore two distinct architectures: audio encoder trained with a text decoder model, and the R1-AQA model fine-tuned for the challenge tasks. Our original solution utilizes the PaSST-S audio encoder and the Qwen2.5-1.5B-Instruct model. The model was pre-trained on captioning, tagging, and question answering tasks, followed by fine-tuning for each of the three challenge tasks. The R1-AQA model underwent fine-tuning across all challenge tasks using LoRA. Through experimentation with various datasets and training methodologies, including a few-shot approach, our best trained-from-scratch model achieved an accuracy of 0.61, while the fine-tuned R1-AQA model reached an accuracy of 0.71 on the challenge development split

PDF

Audio Question Answering Using Audio-Language Model with Supervised Fine-Tuning and Group Relative Policy Optimization

Feiyang Xiao1, Tong Ye1, Kejia Zhang1, Qiaoxi Zhu2, Guangjun He3, Pengming Feng3, Li Liu4, Wenwu Wang5, and Jian Guan1
1College of Computer Science and Technology, Harbin Engineering University, Harbin, China 2University of Technology Sydney, Ultimo, Australia 3State Key Laboratory of Space Information System and Integrated Application (SISIA), Beijing, China 4The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China 5Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK

Abstract

This report presents our submission for Task 5 (Audio Question Answering, AQA) of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge. Building on the Qwen-Audio backbone, our system interprets natural language questions and generates textual answers directly from audio inputs. In our submission, an open-source variant of Qwen-Audio-7B with Group Relative Policy optimization (GRPO) is adopted as our first system (GRPO-Qwen), which is then fine-tuned on the DCASE training set with GRPO to obtain our second system (GRPO2). Experiments on the official AQA benchmark show that our systems surpass the baselines in terms of Top-1 accuracy, demonstrating their effectiveness.

PDF

EchoTwin-QA: A Dual-Tower BEATsBERT System for DCASE 2025 Task 5 Audio Question Answering

Zeyu Yin1, Ziyang Zhou1, Yiqiang Cai1, Shengchen Li1, and Xi Shao2
1Xi’an Jiaotong-Liverpool University, School of Advanced Technology, Suzhou, China 2Nanjing University of Posts and Telecommunications, College of Telecommunications and Information Engineering, Nanjing, China

Abstract

Task 5 of the DCASE 2025 Challenge frames Audio Question Answering (AQA) as a multi-choice test of acoustic reasoning across marine bio-acoustics, temporal soundscapes and everyday recordings. We present a light-weight dual-tower system that couples a BEATs-Base audio encoder with a BERT-Base text encoder; a two-layer MLP, amounting to ∼132M trainable parameters, maps the concatenated embeddings to answer logits. On the official development set our best submission achieves 54.46 % accuracy, surpassing the strongest baseline (Gemini-2.0-Flash, 52.5%).

PDF