Task description
The Audio Question Answering (AQA) task focuses on advancing question-answering capabilities in the realm of "interactive audio understanding," covering both general acoustic events and knowledge-heavy sound information within a single track. The AQA task encourages participants to develop systems that can accurately interpret and respond to complex audio-based questions (i.e., (A), (B), or (C) option), requiring models to process and reason across diverse audio types.
The task consists of three distinct QA subsets: Bioacoustics QA, Temporal Soundscapes QA, and Complex QA (MMAU). Each subset is designed to evaluate different aspects of audio understanding and reasoning.
More detailed task description can be found in the task description page
Teams ranking
Here are listed the best systems from all teams. The ranking is based on the achieved domain average accuracy metric.
Selected metric rank |
Submission Information | Evaluation dataset | Development dataset | Model size | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Submission Code | System Rank |
Corresponding Author |
Technical Report |
Domain Average |
Part 1 Accuracy |
Part 2 Accuracy |
Part 3 Accuracy |
Domain Average |
Amount of Parameters |
|
Sun_Antgroup_task5_2 | 1 | Renhe Sun | He_cuhk_t5_2025 | 73.74 | 70.75 | 65.31 | 85.15 | 77.93 | 12000000000 | |
Chen_SRCN_task5_3 | 9 | Minjun Chen | Chen_srcn_t5_2025 | 64.91 | 52.19 | 58.85 | 83.70 | 69.82 | 8300000000 | |
Shi_USTC_task5_1 | 3 | Song Yan | Cai_ustc_2025 | 72.81 | 69.37 | 61.96 | 87.10 | 78.13 | 8100000000 | |
Grzeszczyk_SRPOL_task5_4 | 10 | Michal Grzeszczyk | Wojtowicz-Kruk_srpol_2025 | 60.18 | 47.41 | 51.67 | 81.45 | 68.18 | 8400000000 | |
Gibier_inria_task5_1 | 13 | Marcel Gibier | Gibier_inria_2025 | 55.97 | 42.55 | 50.00 | 75.35 | 62.25 | 7700000000 | |
Wijngaard_DACS_task5_4 | 14 | Gijs Wijngaard | Wijngaard_um_2025 | 55.25 | 44.25 | 40.19 | 81.30 | 61.83 | 92400000000 | |
Baseline_Kimi_Audio | 19 | 52.09 | 37.44 | 42.58 | 76.25 | 46.80 | 9800000000 | |||
Baseline_Gemini_2_0 | 20 | 51.20 | 36.41 | 43.18 | 74.00 | 48.30 | ||||
Baseline_AudioFlamingo2 | 21 | 50.85 | 42.87 | 32.18 | 77.50 | 45.00 | 3000000000 | |||
Chung_IND_task5_1 | 22 | HaeChun Chung | Chung_ind_2025 | 50.37 | 48.54 | 30.26 | 72.30 | 67.68 | 4800000000 | |
Guan_HEU_task5_2 | 23 | Jian Guan | Xiao_heu_2025 | 49.70 | 36.87 | 39.47 | 72.75 | 68.94 | 8400000000 | |
Yin_XJTLU_task5_1 | 27 | Zeyu Yin | yin_xjtlu_2025 | 42.13 | 35.41 | 31.94 | 59.05 | 48.94 | 230000000 | |
Baseline_Qwen2_Audio | 31 | 37.19 | 27.71 | 38.52 | 45.35 | 39.60 | 8400000000 |
Systems ranking
Here are listed all systems and their ranking according to the different metrics.
Selected metric rank |
Submission Information | Evaluation dataset | Development dataset | Model size | |||||
---|---|---|---|---|---|---|---|---|---|
Submission Code |
Corresponding Author |
Technical Report |
Domain Average |
Part 1 Accuracy |
Part 2 Accuracy |
Part 3 Accuracy |
Domain Average |
Amount of Parameters |
|
Baseline_Qwen2_Audio | 37.19 | 27.71 | 38.52 | 45.35 | 39.60 | 8400000000 | |||
Baseline_AudioFlamingo2 | 50.85 | 42.87 | 32.18 | 77.50 | 45.00 | 300000000 | |||
Baseline_Gemini_2_0 | 51.20 | 36.41 | 43.18 | 74.00 | 48.30 | ||||
Baseline_Kimi_Audio | 52.09 | 37.44 | 42.58 | 76.25 | 46.80 | 9800000000 | |||
Sun_Antgroup_task5_2 | Sun | sun_antgroup_2025 | 73.74 | 70.75 | 65.31 | 85.15 | 77.93 | 12000000000 | |
Sun_Antgroup_task5_4 | Sun | sun_antgroup_2025 | 73.11 | 68.88 | 65.31 | 85.15 | 75.73 | 20000000000 | |
Shi_USTC_task5_1 | Song Yan | shi_ustc_2025 | 72.81 | 69.37 | 61.96 | 87.10 | 78.13 | 8100000000 | |
Sun_Antgroup_task5_1 | Sun | sun_antgroup_2025 | 72.14 | 67.75 | 63.28 | 85.40 | 76.90 | 12000000000 | |
Shi_USTC_task5_2 | Song Yan | shi_ustc_2025 | 71.52 | 68.15 | 59.57 | 86.85 | 75.23 | 7100000000 | |
Shi_USTC_task5_4 | Song Yan | shi_ustc_2025 | 71.45 | 64.34 | 67.94 | 82.05 | 73.90 | 7100000000 | |
Sun_Antgroup_task5_3 | Sun | sun_antgroup_2025 | 69.78 | 62.88 | 62.80 | 83.65 | 75.43 | 8200000000 | |
Shi_USTC_task5_3 | Song Yan | shi_ustc_2025 | 69.27 | 65.88 | 56.58 | 85.35 | 76.53 | 8100000000 | |
Chen_SRCN_task5_3 | Minjun Chen | Chen_srcn_t5_2025 | 64.91 | 52.19 | 58.85 | 83.70 | 69.82 | 8300000000 | |
Grzeszczyk_SRPOL_task5_4 | Grzeszczyk | grzeszczyk_srpol_2025 | 60.18 | 47.41 | 51.67 | 81.45 | 68.18 | 8400000000 | |
Grzeszczyk_SRPOL_task5_2 | Grzeszczyk | grzeszczyk_srpol_2025 | 60.08 | 47.65 | 52.39 | 80.20 | 65.54 | 8400000000 | |
Grzeszczyk_SRPOL_task5_3 | Grzeszczyk | grzeszczyk_srpol_2025 | 59.34 | 47.41 | 49.16 | 81.45 | 64.73 | 8400000000 | |
Gibier_inria_task5_1 | Marcel Gibier | gibier_inria_2025 | 55.97 | 42.55 | 50.00 | 75.35 | 62.25 | 7700000000 | |
Wijngaard_DACS_task5_4 | Wijngaard | wijngaard_dacs_2025 | 55.25 | 44.25 | 40.19 | 81.30 | 61.83 | 92400000000 | |
Wijngaard_DACS_task5_1 | Wijngaard | wijngaard_dacs_2025 | 55.00 | 44.49 | 41.51 | 79.00 | 58.92 | 8400000000 | |
Wijngaard_DACS_task5_2 | Wijngaard | wijngaard_dacs_2025 | 54.68 | 42.30 | 39.83 | 81.90 | 59.32 | 8400000000 | |
Wijngaard_DACS_task5_3 | Wijngaard | wijngaard_dacs_2025 | 53.58 | 43.11 | 38.28 | 79.35 | 59.86 | 8400000000 | |
Grzeszczyk_SRPOL_task5_1 | Grzeszczyk | grzeszczyk_srpol_2025 | 52.77 | 46.03 | 39.47 | 72.80 | 57.57 | 1600000000 | |
Chung_IND_task5_1 | Chung | chung_ind_2025 | 50.37 | 48.54 | 30.26 | 72.30 | 67.68 | 4800000000 | |
Guan_HEU_task5_2 | Guan | guan_heu_2025 | 49.70 | 36.87 | 39.47 | 72.75 | 68.94 | 8400000000 | |
Chung_IND_task5_2 | Chung | chung_ind_2025 | 49.20 | 44.65 | 29.90 | 73.05 | 65.93 | 4800000000 | |
Chung_IND_task5_3 | Chung | chung_ind_2025 | 48.98 | 46.35 | 27.75 | 72.85 | 64.96 | 4800000000 | |
Guan_HEU_task5_1 | Guan | guan_heu_2025 | 46.94 | 30.79 | 39.12 | 70.90 | 66.58 | 8400000000 | |
Yin_XJTLU_task5_1 | Yin | yin_xjtlu_2025 | 42.13 | 35.41 | 31.94 | 59.05 | 48.94 | 230000000 | |
Yin_XJTLU_task5_2 | Yin | yin_xjtlu_2025 | 41.67 | 35.58 | 30.62 | 58.80 | 47.91 | 230000000 | |
Yin_XJTLU_task5_4 | Yin | yin_xjtlu_2025 | 41.55 | 34.93 | 30.98 | 58.75 | 48.28 | 230000000 | |
Yin_XJTLU_task5_3 | Yin | yin_xjtlu_2025 | 40.87 | 34.68 | 29.19 | 58.75 | 49.50 | 230000000 |
System characteristics
In this section you can find the characteristics of the submitted systems includedin the official ranking.
Rank |
Submission code |
Technical Report |
Domain Average |
Amount of parameters |
Pretrained Model | Model Type | External Data | Post-processing |
---|---|---|---|---|---|---|---|---|
31 | Baseline_Qwen2_Audio | 37.19 | 8400000000 | Qwen2-Audio-7B-Instruct | end-to-end, autoregressive | Selected the option that has highest SentenceBERT similarity score with the model response | ||
21 | Baseline_AudioFlamingo2 | 50.85 | 300000000 | AudioFlamingo2 | end-to-end, autoregressive | Direct matching | ||
20 | Baseline_Gemini_2_0 | 51.20 | Gemini-2.0-Flash | Direct matching | ||||
19 | Baseline_Kimi_Audio | 52.09 | 9800000000 | Kimi-Audio-7B-Instruct | end-to-end, autoregressive | Direct matching | ||
1 | Sun_Antgroup_task5_2 | sun_antgroup_2025 | 73.74 | 12000000000 | Kimi-Audio-7B-Instruct | end-to-end, autoregressive | AVQA, AudioCaps, Clotho, CompA-R, LP-MusicCaps-MTT, MMAU, MusicCaps, SpeechCraft, TACOS, VGGSound, AudioSet, VocalSound | Select and output the option with the same first letter |
2 | Sun_Antgroup_task5_4 | sun_antgroup_2025 | 73.11 | 20000000000 | Qwen2-Audio-7B-Instruct, Kimi-Audio-7B-Instruct | end-to-end, autoregressive | AVQA, AudioCaps, Clotho, CompA-R, LP-MusicCaps-MTT, MMAU, MusicCaps, SpeechCraft, TACOS, VGGSound, AudioSet, VocalSound | Select and output the option with the same first letter |
3 | Shi_USTC_task5_1 | shi_ustc_2025 | 72.81 | 8100000000 | Kimi-Audio-7B-Instruct | end-to-end, autoregressive | AudioSet-Strong, MMAU | Use the first letter of the model output and select the corresponding option |
4 | Sun_Antgroup_task5_1 | sun_antgroup_2025 | 72.14 | 12000000000 | Kimi-Audio-7B-Instruct | end-to-end, autoregressive | AVQA, AudioCaps, Clotho, CompA-R, LP-MusicCaps-MTT, MMAU, MusicCaps, SpeechCraft, TACOS, VGGSound, AudioSet, VocalSound | Select and output the option with the same first letter |
5 | Shi_USTC_task5_2 | shi_ustc_2025 | 71.52 | 7100000000 | Qwen2-Audio-7B-Instruct | end-to-end, autoregressive | AudioSet-Strong, MMAU | Use the first letter of the model output and select the corresponding option |
6 | Shi_USTC_task5_4 | shi_ustc_2025 | 71.45 | 7100000000 | Qwen2-Audio-7B-Instruct | end-to-end, autoregressive | AudioSet-Strong, MMAU | Use the first letter of the model output and select the corresponding option |
7 | Sun_Antgroup_task5_3 | sun_antgroup_2025 | 69.78 | 8200000000 | Qwen2-Audio-7B-Instruct | end-to-end, autoregressive | AVQA, AudioCaps, Clotho, CompA-R, LP-MusicCaps-MTT, MMAU, MusicCaps, SpeechCraft, TACOS, VGGSound, AudioSet, VocalSound | Select and output the option with the same first letter |
8 | Shi_USTC_task5_3 | shi_ustc_2025 | 69.27 | 8100000000 | Kimi-Audio-7B-Instruct | end-to-end, autoregressive | AudioSet-Strong, MMAU | Use the first letter of the model output and select the corresponding option |
9 | Chen_SRCN_task5_3 | Chen_srcn_t5_2025 | 64.91 | 8300000000 | Qwen2-Audio-7B-Instruct | end-to-end, autoregressive | AVQA, Clothov-AQA, Audioset, Audioset-Strong, AudioCaps, WavCaps, FSD50K, CompA-R, TACOS, MMAU | Directly extract the string content from |
10 | Grzeszczyk_SRPOL_task5_4 | grzeszczyk_srpol_2025 | 60.18 | 8400000000 | R1-AQA | end-to-end, autoregressive | MMAU, TACOS | Selected the option that has the lowest Levenshtein distance with the model response |
11 | Grzeszczyk_SRPOL_task5_2 | grzeszczyk_srpol_2025 | 60.08 | 8400000000 | R1-AQA | end-to-end, autoregressive | MMAU, TACOS | Selected the option that has the lowest Levenshtein distance with the model response |
12 | Grzeszczyk_SRPOL_task5_3 | grzeszczyk_srpol_2025 | 59.34 | 8400000000 | R1-AQA | end-to-end, autoregressive | MMAU | Selected the option that has the lowest Levenshtein distance with the model response |
13 | Gibier_inria_task5_1 | gibier_inria_2025 | 55.97 | 7700000000 | Qwen2.5-7B-Instruct | autoregressive | AudioSet | Selected the response at index i corresponding to the letter label of the model's answer (e.g., A → 0, B → 1, ...) |
14 | Wijngaard_DACS_task5_4 | wijngaard_dacs_2025 | 55.25 | 92400000000 | Qwen2-Audio-7B-Instruct | autoregressive | AVQA, CLothoAQA, CompA-Order, TACOS, AudSem | |
15 | Wijngaard_DACS_task5_1 | wijngaard_dacs_2025 | 55.00 | 8400000000 | Qwen2-Audio-7B-Instruct | autoregressive | AVQA, CLothoAQA, CompA-Order, TACOS, AudSem | |
16 | Wijngaard_DACS_task5_2 | wijngaard_dacs_2025 | 54.68 | 8400000000 | Qwen2-Audio-7B-Instruct | autoregressive | AVQA, CLothoAQA, CompA-Order, TACOS, AudSem | Exact match |
17 | Wijngaard_DACS_task5_3 | wijngaard_dacs_2025 | 53.58 | 8400000000 | Qwen2-Audio-7B-Instruct | autoregressive | AVQA, CLothoAQA, CompA-Order, TACOS, AudSem | Exact match |
18 | Grzeszczyk_SRPOL_task5_1 | grzeszczyk_srpol_2025 | 52.77 | 1600000000 | end-to-end, autoregressive | AudioCaps, AudioSet, Clotho, WavCpas, OpenAQA, VGGSOund, FSD-50k | Selected the option that has the lowest Levenshtein distance with the model response | |
22 | Chung_IND_task5_1 | chung_ind_2025 | 50.37 | 4800000000 | AudioFlamingo2 | end-to-end, autoregressive | Select the same choice as the index of the answer derived from the model | |
23 | Guan_HEU_task5_2 | guan_heu_2025 | 49.70 | 8400000000 | Qwen2-Audio-7B-Instruct | end-to-end, autoregressive | Selected the option that has highest SentenceBERT similarity score with the model response | |
24 | Chung_IND_task5_2 | chung_ind_2025 | 49.20 | 4800000000 | AudioFlamingo2 | end-to-end, autoregressive | Select the same choice as the index of the answer derived from the model | |
25 | Chung_IND_task5_3 | chung_ind_2025 | 48.98 | 4800000000 | AudioFlamingo2 | end-to-end, autoregressive | Select the same choice as the index of the answer derived from the model | |
26 | Guan_HEU_task5_1 | guan_heu_2025 | 46.94 | 8400000000 | Qwen2-Audio-7B-Instruct | end-to-end, autoregressive | Selected the option that has highest SentenceBERT similarity score with the model response | |
27 | Yin_XJTLU_task5_1 | yin_xjtlu_2025 | 42.13 | 230000000 | BEATs-Base, BERT-Base-Uncased | end-to-end | Map the predicted choice letter (A/B/…) to its answer string and write to CSV | |
28 | Yin_XJTLU_task5_2 | yin_xjtlu_2025 | 41.67 | 230000000 | BEATs-Base, BERT-Base-Uncased | end-to-end | Map the predicted choice letter (A/B/…) to its answer string and write to CSV | |
29 | Yin_XJTLU_task5_4 | yin_xjtlu_2025 | 41.55 | 230000000 | BEATs-Base, BERT-Base-Uncased | end-to-end | Map the predicted choice letter (A/B/…) to its answer string and write to CSV | |
30 | Yin_XJTLU_task5_3 | yin_xjtlu_2025 | 40.87 | 230000000 | BEATs-Base, BERT-Base-Uncased | end-to-end | Map the predicted choice letter (A/B/…) to its answer string and write to CSV |
Additional System Characteristics
This section provides metrics and attributes of the submitted system that are not part of the official ranking.
Submission code |
Technical Report |
Domain Average |
Part 1 Accuracy |
Part 2 Accuracy |
Part 3 Accuracy |
Dev Domain Average |
Amount of parameters |
Pretrained Model | Model Type | External Data | Post-processing |
---|---|---|---|---|---|---|---|---|---|---|---|
Baseline_Qwen2_Audio | 37.19 | 27.71 | 38.52 | 45.35 | 39.60 | 8400000000 | Qwen2-Audio-7B-Instruct | end-to-end, autoregressive | Selected the option that has highest SentenceBERT similarity score with the model response | ||
Baseline_AudioFlamingo2 | 50.85 | 42.87 | 32.18 | 77.50 | 45.00 | 300000000 | AudioFlamingo2 | end-to-end, autoregressive | Direct matching | ||
Baseline_Gemini_2_0 | 51.20 | 36.41 | 43.18 | 74.00 | 48.30 | Gemini-2.0-Flash | Direct matching | ||||
Baseline_Kimi_Audio | 52.09 | 37.44 | 42.58 | 76.25 | 46.80 | 9800000000 | Kimi-Audio-7B-Instruct | end-to-end, autoregressive | Direct matching | ||
omni_Chen_SRCN_task5_1 | Chen_srcn_t5_2025 | 75.67 | 66.45 | 74.52 | 86.05 | 81.26 | 8300000000 | Qwen2.5-Omni-7B | end-to-end, autoregressive | AVQA, Clothov-AQA, Audioset, Audioset-Strong, AudioCaps, WavCaps, FSD50K, CompA-R, TACOS, MMAU | Directly extract the string content from |
omni_Chen_SRCN_task5_2 | Chen_srcn_t5_2025 | 75.51 | 66.94 | 73.80 | 85.80 | 81.40 | 8300000000 | Qwen2.5-Omni-7B | end-to-end, autoregressive | AVQA, Clothov-AQA, Audioset, Audioset-Strong, AudioCaps, WavCaps, FSD50K, CompA-R, TACOS, MMAU | Directly extract the string content from |
omni_Chen_SRCN_task5_4 | Chen_srcn_t5_2025 | 63.31 | 46.68 | 58.61 | 84.65 | 64.75 | 4000000000 | Qwen2.5-Omni-3B | end-to-end, autoregressive | AVQA, Clothov-AQA, Audioset, Audioset-Strong, AudioCaps, WavCaps, FSD50K, CompA-R, TACOS, MMAU | Directly extract the string content from |
omni_Liu_MLPXC_task5_1 | Li_mlpxc_2025 | 66.04 | 49.92 | 61.01 | 87.20 | 69.57 | 8900000000 | Qwen2.5-Omni-7B | end-to-end, autoregressive | Clotho-AQA,TACOS | |
omni_Liu_MLPXC_task5_2 | Li_mlpxc_2025 | 73.69 | 62.32 | 72.01 | 86.75 | 75.10 | 8900000000 | Qwen2.5-Omni-7B | end-to-end, autoregressive | Clotho-AQA,TACOS | |
omni_Liu_MLPXC_task5_3 | Li_mlpxc_2025 | 71.22 | 59.56 | 66.99 | 87.10 | 71.42 | 8900000000 | Qwen2.5-Omni-7B | end-to-end, autoregressive | Clotho-AQA,TACOS | |
omni_Liu_MLPXC_task5_4 | Li_mlpxc_2025 | 71.24 | 58.35 | 67.46 | 87.90 | 72.28 | 8900000000 | Qwen2.5-Omni-7B | end-to-end, autoregressive | Clotho-AQA,TACOS |
Technical reports
Parameter-Efficient Tuning of Large Audio-Language Models for DCASE 2025 Challenge Task 5
Pengfei Cai, Yanfeng Shi, Qing Gu, Nan Jiang, and Yan Song
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, China
Shi_USTC_task5_1 Shi_USTC_task5_2 Shi_USTC_task5_3 Shi_USTC_task5_4
Parameter-Efficient Tuning of Large Audio-Language Models for DCASE 2025 Challenge Task 5
Pengfei Cai, Yanfeng Shi, Qing Gu, Nan Jiang, and Yan Song
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, China
Abstract
In this technical report, we describe our systems developed for DCASE 2025 Challenge Task5. Our system is mainly based on parameter-efficient tuning of large audio-language models, e.g., Qwen2-Audio and Kimi-Audio. The training process is conducted using Low-Rank Adaptation (LoRA) and divided into two stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In addition, we reformatted the annotations of the AudioSet-Strong and MMAU datasets into a question-answer format to augment the official task dataset. Our final system achieves an accuracy of 80.0% on the development set.
DCASE 2025 Challenge Task 5 Technical Report
Minjun Chen, Jun Shao, Yangyang Liu, Bo Peng and Jie Chen
Samsung Research China-Nanjing, Nanjing, China
omni_Chen_SRCN_task5_1 omni_Chen_SRCN_task5_2 Chen_SRCN_task5_3 omni_Chen_SRCN_task5_4
DCASE 2025 Challenge Task 5 Technical Report
Minjun Chen, Jun Shao, Yangyang Liu, Bo Peng and Jie Chen
Samsung Research China-Nanjing, Nanjing, China
Abstract
We describe our submitted systems for DCASE2025 Task 5 in this technical report: Audio Question Answering. Our proposed systems focus on training a Large Audio Language Model (LALM) with carefully curated training datasets and training sessions, based on a carefully chosen multi-modality baseline model. We choose Qwen 2.5 Omni 7B, which has shown impressive performance on audio and vision related tasks, as the base to initialize the audio encoder and LLM component of the proposed systems. We collect and transform multiple audio-text datasets for the training, the total number of samples reached 800K, covering multiple audio related tasks, such as closed Audio QA, opened Audio QA, audio caption, and audio temporal understanding and reasoning. We curated a multi-stage training procedure to help the model learning to focus on different aspects of the data, and to learn from easy to hard during the training process. In the post-training stage, we adopt different training methods, including fine-tune (SFT) and GRPO, to take advantage of their different abilities in generalization and memory. With these carefully considered designs, our model not only learn to answer the questions correctly in content, but also in the required format specified in prompts, which simplify the post-process procedure for evaluation. Our proposed systems achieve top-1 accuracy 81.3% on the DCASE Task 5 development set
Parameter-Efficient Fine-Tuning of Audio Flamingo 2 with LoRA for the DCASE 2025 Audio Question Answering Challenge
HaeChun Chung
Independent, Seoul, Korea
Chung_IND_task5_1 Chung_IND_task5_2 Chung_IND_task5_3
Parameter-Efficient Fine-Tuning of Audio Flamingo 2 with LoRA for the DCASE 2025 Audio Question Answering Challenge
HaeChun Chung
Independent, Seoul, Korea
Abstract
Audio Question Answering (AQA) presents a significant challenge, demanding models capable of complex reasoning over extensive audio sequences. In this research, we boost the performance of Audio Flamingo 2 (AF2), a compact yet powerful audio-language model, by employing parameter-efficient Low-Rank Adaptation (LoRA). We apply targeted data augmentation strategies for multiple-choice QA and fine-tune the model using the DCASE2025 Challenge Task 5 dataset. Our top-performing model, utilizing LoRA with a rank of 8, achieves a remarkable 69.67% accuracy. This substantially outperforms all established baselines, including the strong Gemini-2.0-Flash (52.5%). These results highlight the effectiveness and practical value of lightweight adaptation approach, especially when operating under constrained computational resources
Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions
Marcel Gibier, Nolwenn Celton, Raphaël Duroselle, Pierre Serrano, Olivier Boeffard, and Jean-François Bonastre
Inria Paris, LR2, Paris, France
Gibier_inria_task5_1
Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions
Marcel Gibier, Nolwenn Celton, Raphaël Duroselle, Pierre Serrano, Olivier Boeffard, and Jean-François Bonastre
Inria Paris, LR2, Paris, France
Abstract
In this report, we describe our submission to Track 5 of the DCASE 2025 Challenge for the task of Audio Question Answering (AQA). Our system leverages the SSL backbone BEATs to extract frame-level audio features, which are then processed by a classification head to generate segment-level predictions of acoustic events, following the Audioset ontology. These segment-level predictions are subsequently calibrated before producing event-level predictions. Finally, these predictions are incorporated into a structured prompt, along with the question and candidate answers. This prompt is then fed to a fine-tuned version of Qwen2.5-7B-Instruct, trained using the GRPO algorithm with a simple reward function. Our method achieves a top-1 accuracy of 65.4 % on the development set, demonstrating the effectiveness of combining acoustic event reasoning with instruction-tuned large language models for AQA.
Audio Question Answering at the DCASE 2025 Challenge
Haolin He1, Mingru Yang2, Renhe Sun3, Jiayi Zhou3, Jian Liu3, Qianhua He2 and Qiuqiang Kong1
1Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong, China 2School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China 3Machine Intelligence, Ant Group, Shanghai, China
Sun_Antgroup_task5_1 Sun_Antgroup_task5_2 Sun_Antgroup_task5_3 Sun_Antgroup_task5_4
Audio Question Answering at the DCASE 2025 Challenge
Haolin He1, Mingru Yang2, Renhe Sun3, Jiayi Zhou3, Jian Liu3, Qianhua He2 and Qiuqiang Kong1
1Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong, China 2School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China 3Machine Intelligence, Ant Group, Shanghai, China
Abstract
In this technical report, we describe the submission system for DCASE2025 Task 5: Audio Question Answering. In this work, we introduce a comprehensive audio question answering dataset named DCASE-AQA-Boost, featuring diverse question types and carefully curated answer options to address the limitations of existing collections. Based on the DCASE-AQA-Boost, we have developed two models, Kimi-Audio-SFT-12B and Qwen2-Audio-R1-8B. Kimi-Audio-SFT-12B is obtained through a two-stage Supervised Fine-Tuning (SFT) process using the Pretraining and Finetuning split of DCASE-AQA-Boost. Qwen2-Audio-R1-8B is trained using our proposed three-stage training paradigm based on the DCASE-AQA-Boost and DCASE2025 task 5 training set, which incorporates Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). Experimental results demonstrate that the proposed method significantly improves the accuracy of multiple-choice audio question answering systems. Kimi-Audio-SFT-12B and Qwen2-Audio-R1-8B achieve 77.66% and 78.18% accuracy on the DCASE2025 Task 5 development set, respectively.
MIAQA Submission for DCASE 2025 Challenge Task 5: A Reinforcement Learning Driven Audio Question Answering Method
Gang Li, Jizhong Liu, Heinrich Dinkel, Yadong Niu, Xingwei Sun, Tianzi Wang, Junbo Zhang, and Jian Luan
MiLM Plus, Xiaomi Corp., China
omni_Liu_MLPXC_task5_1 omni_Liu_MLPXC_task5_2 omni_Liu_MLPXC_task5_3 omni_Liu_MLPXC_task5_4
MIAQA Submission for DCASE 2025 Challenge Task 5: A Reinforcement Learning Driven Audio Question Answering Method
Gang Li, Jizhong Liu, Heinrich Dinkel, Yadong Niu, Xingwei Sun, Tianzi Wang, Junbo Zhang, and Jian Luan
MiLM Plus, Xiaomi Corp., China
Abstract
This technical report presents an audio question answering (AQA) method submitted to DCASE 2025 Challenge Task 5. Recent studies have shown that reinforcement learning (RL) can enhance the audio reasoning capabilities of large audio language models (LALMs). Thus, we employ a RL strategy to optimize our AQA model. The MiAQA submission is based on our preliminary study [1]1. We apply the group relative policy optimization (GRPO) algorithm to Qwen2.5-Omni-7B. The model directly generates responses after implicit reasoning, without relying on complex, explicit chain-of-thought (CoT). To enhance data diversity, the training data combines human-annotated datasets with weakly labeled datasets generated by large language models (LLMs). Using only a single model and 35k training samples, MiAQA achieves up to 78.0% accuracy on the DCASE 2025 AQA development set.
Data-Balanced Curriculum Learning for Audio Question Answering
Gijs Wijngaard1, Elia Formisano2,3,4, Michele Esposito1, and Michel Dumontier1
1Department of Advanced Computing Sciences, Maastricht University, Netherlands 2Department of Cognitive Neuroscience, Maastricht University, Netherlands 3Maastricht Centre for Systems Biology, Maastricht University, Netherlands 4Brightlands Institute for Smart Society, Maastricht University, Netherlands
Wijngaard_DACS_task5_1 Wijngaard_DACS_task5_2 Wijngaard_DACS_task5_3 Wijngaard_DACS_task5_4
Data-Balanced Curriculum Learning for Audio Question Answering
Gijs Wijngaard1, Elia Formisano2,3,4, Michele Esposito1, and Michel Dumontier1
1Department of Advanced Computing Sciences, Maastricht University, Netherlands 2Department of Cognitive Neuroscience, Maastricht University, Netherlands 3Maastricht Centre for Systems Biology, Maastricht University, Netherlands 4Brightlands Institute for Smart Society, Maastricht University, Netherlands
Abstract
Audio question answering (AQA) requires models to understand acoustic content and perform complex reasoning. Current models struggle with dataset imbalances and unstable training dynamics. This work combines curriculum learning with statistical data balancing to address these challenges. The method labels question difficulty using language models, then trains progressively from easy to hard examples. Statistical filtering removes overrepresented audio categories, and guided decoding constrains outputs to valid multiple-choice formats. Experiments on the DCASE 2025 training set and five additional public datasets show that data curation improves accuracy by 19.2% over baseline models, achieving 64.2% on the DCASE 2025 benchmark.
Take It with a Grain of Salt: Improving Audio Question Answering with Large Language Models
Juliusz Wójtowicz-Kruk1, Piotr Masztalski12, Bartłomiej Zgórzyński1, and Michal Grzeszczyk1
1Samsung R&D Institute Poland, Warsaw, Poland 2AGH University of Krakow, Poland
Grzeszczyk_SRPOL_task5_1 Grzeszczyk_SRPOL_task5_2 Grzeszczyk_SRPOL_task5_3 Grzeszczyk_SRPOL_task5_4
Take It with a Grain of Salt: Improving Audio Question Answering with Large Language Models
Juliusz Wójtowicz-Kruk1, Piotr Masztalski12, Bartłomiej Zgórzyński1, and Michal Grzeszczyk1
1Samsung R&D Institute Poland, Warsaw, Poland 2AGH University of Krakow, Poland
Abstract
In this report, we present our solution for DCASE 2025 Task 5: Audio Question Answering. We explore two distinct architectures: audio encoder trained with a text decoder model, and the R1-AQA model fine-tuned for the challenge tasks. Our original solution utilizes the PaSST-S audio encoder and the Qwen2.5-1.5B-Instruct model. The model was pre-trained on captioning, tagging, and question answering tasks, followed by fine-tuning for each of the three challenge tasks. The R1-AQA model underwent fine-tuning across all challenge tasks using LoRA. Through experimentation with various datasets and training methodologies, including a few-shot approach, our best trained-from-scratch model achieved an accuracy of 0.61, while the fine-tuned R1-AQA model reached an accuracy of 0.71 on the challenge development split
Audio Question Answering Using Audio-Language Model with Supervised Fine-Tuning and Group Relative Policy Optimization
Feiyang Xiao1, Tong Ye1, Kejia Zhang1, Qiaoxi Zhu2, Guangjun He3, Pengming Feng3, Li Liu4, Wenwu Wang5, and Jian Guan1
1College of Computer Science and Technology, Harbin Engineering University, Harbin, China 2University of Technology Sydney, Ultimo, Australia 3State Key Laboratory of Space Information System and Integrated Application (SISIA), Beijing, China 4The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China 5Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
Guan_HEU_task5_1 Guan_HEU_task5_2
Audio Question Answering Using Audio-Language Model with Supervised Fine-Tuning and Group Relative Policy Optimization
Feiyang Xiao1, Tong Ye1, Kejia Zhang1, Qiaoxi Zhu2, Guangjun He3, Pengming Feng3, Li Liu4, Wenwu Wang5, and Jian Guan1
1College of Computer Science and Technology, Harbin Engineering University, Harbin, China 2University of Technology Sydney, Ultimo, Australia 3State Key Laboratory of Space Information System and Integrated Application (SISIA), Beijing, China 4The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China 5Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
Abstract
This report presents our submission for Task 5 (Audio Question Answering, AQA) of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge. Building on the Qwen-Audio backbone, our system interprets natural language questions and generates textual answers directly from audio inputs. In our submission, an open-source variant of Qwen-Audio-7B with Group Relative Policy optimization (GRPO) is adopted as our first system (GRPO-Qwen), which is then fine-tuned on the DCASE training set with GRPO to obtain our second system (GRPO2). Experiments on the official AQA benchmark show that our systems surpass the baselines in terms of Top-1 accuracy, demonstrating their effectiveness.
EchoTwin-QA: A Dual-Tower BEATsBERT System for DCASE 2025 Task 5 Audio Question Answering
Zeyu Yin1, Ziyang Zhou1, Yiqiang Cai1, Shengchen Li1, and Xi Shao2
1Xi’an Jiaotong-Liverpool University, School of Advanced Technology, Suzhou, China 2Nanjing University of Posts and Telecommunications, College of Telecommunications and Information Engineering, Nanjing, China
Yin_XJTLU_task5_1 Yin_XJTLU_task5_2 Yin_XJTLU_task5_3 Yin_XJTLU_task5_4
EchoTwin-QA: A Dual-Tower BEATsBERT System for DCASE 2025 Task 5 Audio Question Answering
Zeyu Yin1, Ziyang Zhou1, Yiqiang Cai1, Shengchen Li1, and Xi Shao2
1Xi’an Jiaotong-Liverpool University, School of Advanced Technology, Suzhou, China 2Nanjing University of Posts and Telecommunications, College of Telecommunications and Information Engineering, Nanjing, China
Abstract
Task 5 of the DCASE 2025 Challenge frames Audio Question Answering (AQA) as a multi-choice test of acoustic reasoning across marine bio-acoustics, temporal soundscapes and everyday recordings. We present a light-weight dual-tower system that couples a BEATs-Base audio encoder with a BERT-Base text encoder; a two-layer MLP, amounting to ∼132M trainable parameters, maps the concatenated embeddings to answer logits. On the official development set our best submission achieves 54.46 % accuracy, surpassing the strongest baseline (Gemini-2.0-Flash, 52.5%).