Task description

The Audio Question Answering (AQA) task focuses on advancing question-answering capabilities in the realm of "interactive audio understanding," covering both general acoustic events and knowledge-heavy sound information within a single track. The AQA task encourages participants to develop systems that can accurately interpret and respond to complex audio-based questions (i.e., (A), (B), or (C) option), requiring models to process and reason across diverse audio types.

The task consists of three distinct QA subsets: Bioacoustics QA, Temporal Soundscapes QA, and Complex QA (MMAU). Each subset is designed to evaluate different aspects of audio understanding and reasoning.

More detailed task description can be found in the task description page

Teams ranking

Here are listed the best systems from all teams. The ranking is based on the achieved domain average accuracy metric.

Selected metric rank	Submission Information				Evaluation dataset				Development dataset	Model size
Selected metric rank	Submission Code	System Rank	Corresponding Author	Technical Report	Domain Average	Part 1 Accuracy	Part 2 Accuracy	Part 3 Accuracy	Domain Average	Amount of Parameters
	Sun_Antgroup_task5_2	1	Renhe Sun	He_cuhk_t5_2025	73.74	70.75	65.31	85.15	77.93	12000000000
	Chen_SRCN_task5_3	9	Minjun Chen	Chen_srcn_t5_2025	64.91	52.19	58.85	83.70	69.82	8300000000
	Shi_USTC_task5_1	3	Song Yan	Cai_ustc_2025	72.81	69.37	61.96	87.10	78.13	8100000000
	Grzeszczyk_SRPOL_task5_4	10	Michal Grzeszczyk	Wojtowicz-Kruk_srpol_2025	60.18	47.41	51.67	81.45	68.18	8400000000
	Gibier_inria_task5_1	13	Marcel Gibier	Gibier_inria_2025	55.97	42.55	50.00	75.35	62.25	7700000000
	Wijngaard_DACS_task5_4	14	Gijs Wijngaard	Wijngaard_um_2025	55.25	44.25	40.19	81.30	61.83	92400000000
	Baseline_Kimi_Audio	19			52.09	37.44	42.58	76.25	46.80	9800000000
	Baseline_Gemini_2_0	20			51.20	36.41	43.18	74.00	48.30
	Baseline_AudioFlamingo2	21			50.85	42.87	32.18	77.50	45.00	3000000000
	Chung_IND_task5_1	22	HaeChun Chung	Chung_ind_2025	50.37	48.54	30.26	72.30	67.68	4800000000
	Guan_HEU_task5_2	23	Jian Guan	Xiao_heu_2025	49.70	36.87	39.47	72.75	68.94	8400000000
	Yin_XJTLU_task5_1	27	Zeyu Yin	yin_xjtlu_2025	42.13	35.41	31.94	59.05	48.94	230000000
	Baseline_Qwen2_Audio	31			37.19	27.71	38.52	45.35	39.60	8400000000

Systems ranking

Here are listed all systems and their ranking according to the different metrics.

Selected metric rank	Submission Information			Evaluation dataset				Development dataset	Model size
Selected metric rank	Submission Code	Corresponding Author	Technical Report	Domain Average	Part 1 Accuracy	Part 2 Accuracy	Part 3 Accuracy	Domain Average	Amount of Parameters
	Baseline_Qwen2_Audio			37.19	27.71	38.52	45.35	39.60	8400000000
	Baseline_AudioFlamingo2			50.85	42.87	32.18	77.50	45.00	300000000
	Baseline_Gemini_2_0			51.20	36.41	43.18	74.00	48.30
	Baseline_Kimi_Audio			52.09	37.44	42.58	76.25	46.80	9800000000
	Sun_Antgroup_task5_2	Sun	sun_antgroup_2025	73.74	70.75	65.31	85.15	77.93	12000000000
	Sun_Antgroup_task5_4	Sun	sun_antgroup_2025	73.11	68.88	65.31	85.15	75.73	20000000000
	Shi_USTC_task5_1	Song Yan	shi_ustc_2025	72.81	69.37	61.96	87.10	78.13	8100000000
	Sun_Antgroup_task5_1	Sun	sun_antgroup_2025	72.14	67.75	63.28	85.40	76.90	12000000000
	Shi_USTC_task5_2	Song Yan	shi_ustc_2025	71.52	68.15	59.57	86.85	75.23	7100000000
	Shi_USTC_task5_4	Song Yan	shi_ustc_2025	71.45	64.34	67.94	82.05	73.90	7100000000
	Sun_Antgroup_task5_3	Sun	sun_antgroup_2025	69.78	62.88	62.80	83.65	75.43	8200000000
	Shi_USTC_task5_3	Song Yan	shi_ustc_2025	69.27	65.88	56.58	85.35	76.53	8100000000
	Chen_SRCN_task5_3	Minjun Chen	Chen_srcn_t5_2025	64.91	52.19	58.85	83.70	69.82	8300000000
	Grzeszczyk_SRPOL_task5_4	Grzeszczyk	grzeszczyk_srpol_2025	60.18	47.41	51.67	81.45	68.18	8400000000
	Grzeszczyk_SRPOL_task5_2	Grzeszczyk	grzeszczyk_srpol_2025	60.08	47.65	52.39	80.20	65.54	8400000000
	Grzeszczyk_SRPOL_task5_3	Grzeszczyk	grzeszczyk_srpol_2025	59.34	47.41	49.16	81.45	64.73	8400000000
	Gibier_inria_task5_1	Marcel Gibier	gibier_inria_2025	55.97	42.55	50.00	75.35	62.25	7700000000
	Wijngaard_DACS_task5_4	Wijngaard	wijngaard_dacs_2025	55.25	44.25	40.19	81.30	61.83	92400000000
	Wijngaard_DACS_task5_1	Wijngaard	wijngaard_dacs_2025	55.00	44.49	41.51	79.00	58.92	8400000000
	Wijngaard_DACS_task5_2	Wijngaard	wijngaard_dacs_2025	54.68	42.30	39.83	81.90	59.32	8400000000
	Wijngaard_DACS_task5_3	Wijngaard	wijngaard_dacs_2025	53.58	43.11	38.28	79.35	59.86	8400000000
	Grzeszczyk_SRPOL_task5_1	Grzeszczyk	grzeszczyk_srpol_2025	52.77	46.03	39.47	72.80	57.57	1600000000
	Chung_IND_task5_1	Chung	chung_ind_2025	50.37	48.54	30.26	72.30	67.68	4800000000
	Guan_HEU_task5_2	Guan	guan_heu_2025	49.70	36.87	39.47	72.75	68.94	8400000000
	Chung_IND_task5_2	Chung	chung_ind_2025	49.20	44.65	29.90	73.05	65.93	4800000000
	Chung_IND_task5_3	Chung	chung_ind_2025	48.98	46.35	27.75	72.85	64.96	4800000000
	Guan_HEU_task5_1	Guan	guan_heu_2025	46.94	30.79	39.12	70.90	66.58	8400000000
	Yin_XJTLU_task5_1	Yin	yin_xjtlu_2025	42.13	35.41	31.94	59.05	48.94	230000000
	Yin_XJTLU_task5_2	Yin	yin_xjtlu_2025	41.67	35.58	30.62	58.80	47.91	230000000
	Yin_XJTLU_task5_4	Yin	yin_xjtlu_2025	41.55	34.93	30.98	58.75	48.28	230000000
	Yin_XJTLU_task5_3	Yin	yin_xjtlu_2025	40.87	34.68	29.19	58.75	49.50	230000000

System characteristics

In this section you can find the characteristics of the submitted systems includedin the official ranking.

Rank	Submission code	Technical Report	Domain Average	Amount of parameters	Pretrained Model	Model Type	External Data	Post-processing
31	Baseline_Qwen2_Audio		37.19	8400000000	Qwen2-Audio-7B-Instruct	end-to-end, autoregressive		Selected the option that has highest SentenceBERT similarity score with the model response
21	Baseline_AudioFlamingo2		50.85	300000000	AudioFlamingo2	end-to-end, autoregressive		Direct matching
20	Baseline_Gemini_2_0		51.20		Gemini-2.0-Flash			Direct matching
19	Baseline_Kimi_Audio		52.09	9800000000	Kimi-Audio-7B-Instruct	end-to-end, autoregressive		Direct matching
1	Sun_Antgroup_task5_2	sun_antgroup_2025	73.74	12000000000	Kimi-Audio-7B-Instruct	end-to-end, autoregressive	AVQA, AudioCaps, Clotho, CompA-R, LP-MusicCaps-MTT, MMAU, MusicCaps, SpeechCraft, TACOS, VGGSound, AudioSet, VocalSound	Select and output the option with the same first letter
2	Sun_Antgroup_task5_4	sun_antgroup_2025	73.11	20000000000	Qwen2-Audio-7B-Instruct, Kimi-Audio-7B-Instruct	end-to-end, autoregressive	AVQA, AudioCaps, Clotho, CompA-R, LP-MusicCaps-MTT, MMAU, MusicCaps, SpeechCraft, TACOS, VGGSound, AudioSet, VocalSound	Select and output the option with the same first letter
3	Shi_USTC_task5_1	shi_ustc_2025	72.81	8100000000	Kimi-Audio-7B-Instruct	end-to-end, autoregressive	AudioSet-Strong, MMAU	Use the first letter of the model output and select the corresponding option
4	Sun_Antgroup_task5_1	sun_antgroup_2025	72.14	12000000000	Kimi-Audio-7B-Instruct	end-to-end, autoregressive	AVQA, AudioCaps, Clotho, CompA-R, LP-MusicCaps-MTT, MMAU, MusicCaps, SpeechCraft, TACOS, VGGSound, AudioSet, VocalSound	Select and output the option with the same first letter
5	Shi_USTC_task5_2	shi_ustc_2025	71.52	7100000000	Qwen2-Audio-7B-Instruct	end-to-end, autoregressive	AudioSet-Strong, MMAU	Use the first letter of the model output and select the corresponding option
6	Shi_USTC_task5_4	shi_ustc_2025	71.45	7100000000	Qwen2-Audio-7B-Instruct	end-to-end, autoregressive	AudioSet-Strong, MMAU	Use the first letter of the model output and select the corresponding option
7	Sun_Antgroup_task5_3	sun_antgroup_2025	69.78	8200000000	Qwen2-Audio-7B-Instruct	end-to-end, autoregressive	AVQA, AudioCaps, Clotho, CompA-R, LP-MusicCaps-MTT, MMAU, MusicCaps, SpeechCraft, TACOS, VGGSound, AudioSet, VocalSound	Select and output the option with the same first letter
8	Shi_USTC_task5_3	shi_ustc_2025	69.27	8100000000	Kimi-Audio-7B-Instruct	end-to-end, autoregressive	AudioSet-Strong, MMAU	Use the first letter of the model output and select the corresponding option
9	Chen_SRCN_task5_3	Chen_srcn_t5_2025	64.91	8300000000	Qwen2-Audio-7B-Instruct	end-to-end, autoregressive	AVQA, Clothov-AQA, Audioset, Audioset-Strong, AudioCaps, WavCaps, FSD50K, CompA-R, TACOS, MMAU	Directly extract the string content from tags
10	Grzeszczyk_SRPOL_task5_4	grzeszczyk_srpol_2025	60.18	8400000000	R1-AQA	end-to-end, autoregressive	MMAU, TACOS	Selected the option that has the lowest Levenshtein distance with the model response
11	Grzeszczyk_SRPOL_task5_2	grzeszczyk_srpol_2025	60.08	8400000000	R1-AQA	end-to-end, autoregressive	MMAU, TACOS	Selected the option that has the lowest Levenshtein distance with the model response
12	Grzeszczyk_SRPOL_task5_3	grzeszczyk_srpol_2025	59.34	8400000000	R1-AQA	end-to-end, autoregressive	MMAU	Selected the option that has the lowest Levenshtein distance with the model response
13	Gibier_inria_task5_1	gibier_inria_2025	55.97	7700000000	Qwen2.5-7B-Instruct	autoregressive	AudioSet	Selected the response at index i corresponding to the letter label of the model's answer (e.g., A → 0, B → 1, ...)
14	Wijngaard_DACS_task5_4	wijngaard_dacs_2025	55.25	92400000000	Qwen2-Audio-7B-Instruct	autoregressive	AVQA, CLothoAQA, CompA-Order, TACOS, AudSem
15	Wijngaard_DACS_task5_1	wijngaard_dacs_2025	55.00	8400000000	Qwen2-Audio-7B-Instruct	autoregressive	AVQA, CLothoAQA, CompA-Order, TACOS, AudSem
16	Wijngaard_DACS_task5_2	wijngaard_dacs_2025	54.68	8400000000	Qwen2-Audio-7B-Instruct	autoregressive	AVQA, CLothoAQA, CompA-Order, TACOS, AudSem	Exact match
17	Wijngaard_DACS_task5_3	wijngaard_dacs_2025	53.58	8400000000	Qwen2-Audio-7B-Instruct	autoregressive	AVQA, CLothoAQA, CompA-Order, TACOS, AudSem	Exact match
18	Grzeszczyk_SRPOL_task5_1	grzeszczyk_srpol_2025	52.77	1600000000		end-to-end, autoregressive	AudioCaps, AudioSet, Clotho, WavCpas, OpenAQA, VGGSOund, FSD-50k	Selected the option that has the lowest Levenshtein distance with the model response
22	Chung_IND_task5_1	chung_ind_2025	50.37	4800000000	AudioFlamingo2	end-to-end, autoregressive		Select the same choice as the index of the answer derived from the model
23	Guan_HEU_task5_2	guan_heu_2025	49.70	8400000000	Qwen2-Audio-7B-Instruct	end-to-end, autoregressive		Selected the option that has highest SentenceBERT similarity score with the model response
24	Chung_IND_task5_2	chung_ind_2025	49.20	4800000000	AudioFlamingo2	end-to-end, autoregressive		Select the same choice as the index of the answer derived from the model
25	Chung_IND_task5_3	chung_ind_2025	48.98	4800000000	AudioFlamingo2	end-to-end, autoregressive		Select the same choice as the index of the answer derived from the model
26	Guan_HEU_task5_1	guan_heu_2025	46.94	8400000000	Qwen2-Audio-7B-Instruct	end-to-end, autoregressive		Selected the option that has highest SentenceBERT similarity score with the model response
27	Yin_XJTLU_task5_1	yin_xjtlu_2025	42.13	230000000	BEATs-Base, BERT-Base-Uncased	end-to-end		Map the predicted choice letter (A/B/…) to its answer string and write to CSV
28	Yin_XJTLU_task5_2	yin_xjtlu_2025	41.67	230000000	BEATs-Base, BERT-Base-Uncased	end-to-end		Map the predicted choice letter (A/B/…) to its answer string and write to CSV
29	Yin_XJTLU_task5_4	yin_xjtlu_2025	41.55	230000000	BEATs-Base, BERT-Base-Uncased	end-to-end		Map the predicted choice letter (A/B/…) to its answer string and write to CSV
30	Yin_XJTLU_task5_3	yin_xjtlu_2025	40.87	230000000	BEATs-Base, BERT-Base-Uncased	end-to-end		Map the predicted choice letter (A/B/…) to its answer string and write to CSV

Additional System Characteristics

This section provides metrics and attributes of the submitted system that are not part of the official ranking.

Submission code	Technical Report	Domain Average	Part 1 Accuracy	Part 2 Accuracy	Part 3 Accuracy	Dev Domain Average	Amount of parameters	Pretrained Model	Model Type	External Data	Post-processing
Baseline_Qwen2_Audio		37.19	27.71	38.52	45.35	39.60	8400000000	Qwen2-Audio-7B-Instruct	end-to-end, autoregressive		Selected the option that has highest SentenceBERT similarity score with the model response
Baseline_AudioFlamingo2		50.85	42.87	32.18	77.50	45.00	300000000	AudioFlamingo2	end-to-end, autoregressive		Direct matching
Baseline_Gemini_2_0		51.20	36.41	43.18	74.00	48.30		Gemini-2.0-Flash			Direct matching
Baseline_Kimi_Audio		52.09	37.44	42.58	76.25	46.80	9800000000	Kimi-Audio-7B-Instruct	end-to-end, autoregressive		Direct matching
omni_Chen_SRCN_task5_1	Chen_srcn_t5_2025	75.67	66.45	74.52	86.05	81.26	8300000000	Qwen2.5-Omni-7B	end-to-end, autoregressive	AVQA, Clothov-AQA, Audioset, Audioset-Strong, AudioCaps, WavCaps, FSD50K, CompA-R, TACOS, MMAU	Directly extract the string content from tags
omni_Chen_SRCN_task5_2	Chen_srcn_t5_2025	75.51	66.94	73.80	85.80	81.40	8300000000	Qwen2.5-Omni-7B	end-to-end, autoregressive	AVQA, Clothov-AQA, Audioset, Audioset-Strong, AudioCaps, WavCaps, FSD50K, CompA-R, TACOS, MMAU	Directly extract the string content from tags
omni_Chen_SRCN_task5_4	Chen_srcn_t5_2025	63.31	46.68	58.61	84.65	64.75	4000000000	Qwen2.5-Omni-3B	end-to-end, autoregressive	AVQA, Clothov-AQA, Audioset, Audioset-Strong, AudioCaps, WavCaps, FSD50K, CompA-R, TACOS, MMAU	Directly extract the string content from tags
omni_Liu_MLPXC_task5_1	Li_mlpxc_2025	66.04	49.92	61.01	87.20	69.57	8900000000	Qwen2.5-Omni-7B	end-to-end, autoregressive	Clotho-AQA,TACOS
omni_Liu_MLPXC_task5_2	Li_mlpxc_2025	73.69	62.32	72.01	86.75	75.10	8900000000	Qwen2.5-Omni-7B	end-to-end, autoregressive	Clotho-AQA,TACOS
omni_Liu_MLPXC_task5_3	Li_mlpxc_2025	71.22	59.56	66.99	87.10	71.42	8900000000	Qwen2.5-Omni-7B	end-to-end, autoregressive	Clotho-AQA,TACOS
omni_Liu_MLPXC_task5_4	Li_mlpxc_2025	71.24	58.35	67.46	87.90	72.28	8900000000	Qwen2.5-Omni-7B	end-to-end, autoregressive	Clotho-AQA,TACOS

Technical reports

Parameter-Efficient Tuning of Large Audio-Language Models for DCASE 2025 Challenge Task 5

Pengfei Cai, Yanfeng Shi, Qing Gu, Nan Jiang, and Yan Song

National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, China

Shi_USTC_task5_1 Shi_USTC_task5_2 Shi_USTC_task5_3 Shi_USTC_task5_4

Content

Task description

Teams ranking

Systems ranking

System characteristics

Additional System Characteristics

Technical reports

Parameter-Efficient Tuning of Large Audio-Language Models for DCASE 2025 Challenge Task 5

Parameter-Efficient Tuning of Large Audio-Language Models for DCASE 2025 Challenge Task 5

Abstract

DCASE 2025 Challenge Task 5 Technical Report

DCASE 2025 Challenge Task 5 Technical Report

Abstract

Parameter-Efficient Fine-Tuning of Audio Flamingo 2 with LoRA for the DCASE 2025 Audio Question Answering Challenge

Parameter-Efficient Fine-Tuning of Audio Flamingo 2 with LoRA for the DCASE 2025 Audio Question Answering Challenge

Abstract

Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions

Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions

Abstract

Audio Question Answering at the DCASE 2025 Challenge

Audio Question Answering at the DCASE 2025 Challenge

Abstract

MIAQA Submission for DCASE 2025 Challenge Task 5: A Reinforcement Learning Driven Audio Question Answering Method

MIAQA Submission for DCASE 2025 Challenge Task 5: A Reinforcement Learning Driven Audio Question Answering Method

Abstract

Data-Balanced Curriculum Learning for Audio Question Answering

Data-Balanced Curriculum Learning for Audio Question Answering

Abstract

Take It with a Grain of Salt: Improving Audio Question Answering with Large Language Models

Take It with a Grain of Salt: Improving Audio Question Answering with Large Language Models

Abstract

Audio Question Answering Using Audio-Language Model with Supervised Fine-Tuning and Group Relative Policy Optimization

Audio Question Answering Using Audio-Language Model with Supervised Fine-Tuning and Group Relative Policy Optimization

Abstract

EchoTwin-QA: A Dual-Tower BEATsBERT System for DCASE 2025 Task 5 Audio Question Answering

EchoTwin-QA: A Dual-Tower BEATsBERT System for DCASE 2025 Task 5 Audio Question Answering

Abstract