Task description

Language-based audio retrieval is the task of retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions are used as text queries. For each text query, the goal of this task is to retrieve audio files from a given dataset and sort them based on their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.

The Clotho v2 is provided as the development dataset, which includes both audio and corresponding captions. Participants are also allowed using pre-trained models and external data for training their systems. This includes pre-trained models for feature extraction from audio and/or captions, and pre-optimized methods for natural language processing. Additionally, participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, Freesound.

As a novelty this year, we introduced additional relevance annotations for the evaluation datasets. While previously only one audio file was considered relevant for each text query, this year we provide multiple relevant audio files for each text query.

More detailed task description can be found in the task description page

A few stats about the evaluation sets:

In the development-eval set, there are 2.8 audios per caption, and 46.56% of the original matching audios are still linked to their queries.

The evaluation set has 597 queries, with 1.8 audios per file and 43.21% of the original true positives kept.

Teams ranking

Here are listed the best systems all from all teams. The ranking is based on the achieved mAP@16 metric. For more elaborated exploration of the performance of different systems, in the same table are listed the values achieved for all the metrics employed in the task. The metric values are for the development-testing split and the evaluation dataset.

Submission Information				Evaluation dataset					Development-testing split
				2025	2024				2025	2024
Submission code	Best official system rank	Corresponding author	Technical Report	mAP@16	mAP@10	R@1	R@5	R@10	mAP@16	mAP@10	R@1	R@5	R@10
Kim_AISTAT_task6_3	1	Changwon Lim	Kim2025_t6	0.421	0.401	0.290	0.551	0.669	0.488	0.417	0.285	0.599	0.724
Calvet_AUDIAS_task6_4	5	Oscar Calvet	Calvet2025_t6	0.345	0.379	0.270	0.526	0.658	0.469	0.404	0.277	0.582	0.717
Filomeno_JKU_task6_1	11	Giovanni Filomeno	Filomeno2025_t6	0.302	0.342	0.231	0.473	0.619	0.367	0.360	0.241	0.518	0.653
Pandey_IITK_task6_1	12	Saubhagya Pandey	Pandey2025_t6	0.301	0.270	0.163	0.410	0.549	0.347	0.300	0.187	0.448	0.594
Cai_NCUT_task6_1	10	Xichang Cai	Cai2025_t6	0.320	0.293	0.186	0.447	0.567	0.372	0.328	0.198	0.474	0.605
baseline	8	Paul Primus	Primus2025_t6_baseline	0.330	0.337	0.229	0.482	0.624	0.406	0.352	0.233	0.522	0.648

Systems ranking

Here are listed all systems and their ranking according to the different metrics. Detailed information of each system is in the next section.

Submission Information			Evaluation dataset					Development-testing split
			2025	2024				2025	2024
Submission code	Best official system rank	Technical Report	mAP@16	mAP@10	R@1	R@5	R@10	mAP@16	mAP@10	R@1	R@5	R@10
Kim_AISTAT_task6_3	1	Kim2025_t6	0.421	0.401	0.290	0.551	0.669	0.488	0.417	0.285	0.599	0.724
Kim_AISTAT_task6_2	2	Kim2025_t6	0.414	0.399	0.288	0.549	0.665	0.488	0.416	0.283	0.599	0.722
Kim_AISTAT_task6_4	3	Kim2025_t6	0.412	0.398	0.286	0.552	0.669	0.488	0.417	0.284	0.600	0.725
Kim_AISTAT_task6_1	4	Kim2025_t6	0.411	0.397	0.286	0.549	0.661	0.488	0.416	0.283	0.597	0.721
Calvet_AUDIAS_task6_4	5	Calvet2025_t6	0.345	0.379	0.270	0.526	0.658	0.469	0.404	0.277	0.582	0.717
Calvet_AUDIAS_task6_2	6	Calvet2025_t6	0.344	0.348	0.229	0.510	0.638	0.439	0.375	0.248	0.549	0.686
Calvet_AUDIAS_task6_1	7	Calvet2025_t6	0.339	0.359	0.251	0.510	0.627	0.442	0.383	0.253	0.560	0.693
baseline	8	Primus2025_t6_baseline	0.330	0.337	0.229	0.482	0.624	0.406	0.352	0.233	0.522	0.648
Calvet_AUDIAS_task6_3	9	Calvet2025_t6	0.324	0.357	0.249	0.500	0.636	0.425	0.379	0.258	0.544	0.676
Cai_NCUT_task6_1	10	Cai2025_t6	0.320	0.293	0.186	0.447	0.567	0.372	0.328	0.198	0.474	0.605
Pandey_IITK_task6_1	12	Pandey2025_t6	0.301	0.270	0.163	0.410	0.549	0.347	0.300	0.187	0.448	0.594
Filomeno_JKU_task6_1	11	Filomeno2025_t6	0.302	0.342	0.231	0.473	0.619	0.367	0.360	0.241	0.518	0.653

System characteristics

In this section, you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems, and the second has a detailed presentation of each system.

Overview of characteristics

Rank	Submission code	mAP@16	Technical Report	Amount of parameters	Audio modelling	Text modelling	Loss function
1	Kim_AISTAT_task6_3	0.421	Kim2025_t6	2697000000	PaSST, EAT, BEATs	RoBERTa-large	InfoNCE
2	Kim_AISTAT_task6_2	0.414	Kim2025_t6	2697000000	PaSST, EAT, BEATs	RoBERTa-large	InfoNCE
3	Kim_AISTAT_task6_4	0.412	Kim2025_t6	2697000000	PaSST, EAT, BEATs	RoBERTa-large	InfoNCE
4	Kim_AISTAT_task6_1	0.411	Kim2025_t6	2697000000	PaSST, EAT, BEATs	RoBERTa-large	InfoNCE
5	Calvet_AUDIAS_task6_4	0.345	Calvet2025_t6	11350187000	PaSST	Sentence-BERT	NT-Xent
6	Calvet_AUDIAS_task6_2	0.344	Calvet2025_t6	452000000	PaSST	Sentence-BERT	NT-Xent
7	Calvet_AUDIAS_task6_1	0.339	Calvet2025_t6	446187000	PaSST	Sentence-BERT	NT-Xent
8	baseline	0.330	Primus2025_t6_baseline	732354	PaSST	RoBERTa-large	InfoNCE
9	Calvet_AUDIAS_task6_3	0.324	Calvet2025_t6	452000000	PaSST	Sentence-BERT	NT-Xent
10	Cai_NCUT_task6_1	0.320	Cai2025_t6	732354	PaSST	RoBERTa-large	InfoNCE
12	Pandey_IITK_task6_1	0.301	Pandey2025_t6	442000000	PaSST	RoBERTa-large	Enhanced InfoNCE with Multi-Positive Learning and Hard Negative Mining
11	Filomeno_JKU_task6_1	0.302	Filomeno2025_t6	732354	PaSST	RoBERTa-large	contrastive InfoNCE

Detailed characteristics

Rank	Submission code	mAP@16	Technical Report	Amount of parameters	Audio modelling	Acoustic features	Text modelling	Audio augmentation	Text augmentation	Sampling rate	Loss function	Optimizer	Metric monitored for training	Dataset(s) used for training	Dataset(s) used for validation
1	Kim_AISTAT_task6_3	0.421	Kim2025_t6	2697000000	PaSST, EAT, BEATs	log-mel spectrogram	RoBERTa-large	LLM mix	random deletion, synonym replacement, back-translation, LLM mix	16kHz, 32kHz	InfoNCE	AdamW	mAP@10	Clotho-development	Clotho-validation
2	Kim_AISTAT_task6_2	0.414	Kim2025_t6	2697000000	PaSST, EAT, BEATs	log-mel spectrogram	RoBERTa-large	LLM mix	random deletion, synonym replacement, back-translation, LLM mix	16kHz, 32kHz	InfoNCE	AdamW	mAP@10	Clotho-development	Clotho-validation
3	Kim_AISTAT_task6_4	0.412	Kim2025_t6	2697000000	PaSST, EAT, BEATs	log-mel spectrogram	RoBERTa-large	LLM mix	random deletion, synonym replacement, back-translation, LLM mix	16kHz, 32kHz	InfoNCE	AdamW	mAP@10	Clotho-development	Clotho-validation
4	Kim_AISTAT_task6_1	0.411	Kim2025_t6	2697000000	PaSST, EAT, BEATs	log-mel spectrogram	RoBERTa-large	LLM mix	random deletion, synonym replacement, back-translation, LLM mix	16kHz, 32kHz	InfoNCE	AdamW	mAP@10	Clotho-development	Clotho-validation
5	Calvet_AUDIAS_task6_4	0.345	Calvet2025_t6	11350187000	PaSST	log-mel energies	Sentence-BERT			32kHz	NT-Xent	Adam	validation_loss	Clotho-development, AudioCaps, WavCaps, TACOS	Clotho-validation
6	Calvet_AUDIAS_task6_2	0.344	Calvet2025_t6	452000000	PaSST	log-mel energies	Sentence-BERT			32kHz	NT-Xent	Adam	validation_loss	Clotho-development, AudioCaps, WavCaps, TACOS	Clotho-validation
7	Calvet_AUDIAS_task6_1	0.339	Calvet2025_t6	446187000	PaSST	log-mel energies	Sentence-BERT			32kHz	NT-Xent	Adam	validation_loss	Clotho-development, AudioCaps, WavCaps, TACOS	Clotho-validation
8	baseline	0.330	Primus2025_t6_baseline	732354	PaSST	log-mel energies	RoBERTa-large			32kHz	InfoNCE	AdamW	mAP@16	Clotho-development, AudioCaps, WavCaps	Clotho-validation
9	Calvet_AUDIAS_task6_3	0.324	Calvet2025_t6	452000000	PaSST	log-mel energies	Sentence-BERT			32kHz	NT-Xent	Adam	validation_loss	Clotho-development, AudioCaps, WavCaps, TACOS	Clotho-validation
10	Cai_NCUT_task6_1	0.320	Cai2025_t6	732354	PaSST	log-mel energies	RoBERTa-large	mixup		44.1kHz	InfoNCE	Adam	validation_loss	Clotho-development, AudioCaps	Clotho-validation
12	Pandey_IITK_task6_1	0.301	Pandey2025_t6	442000000	PaSST	log-mel energies	RoBERTa-large			32kHz	Enhanced InfoNCE with Multi-Positive Learning and Hard Negative Mining	AdamW	validation_loss	Clotho-development	Clotho-validation
11	Filomeno_JKU_task6_1	0.302	Filomeno2025_t6	732354	PaSST	log-mel energies	RoBERTa-large	time-frequency masking		32kHz	contrastive InfoNCE	AdamW	mAP@16	Clotho-development, AudioCaps, WavCaps	Clotho-validation

Technical reports

Dual-Encoder Audio Retrieval with PaSST and RoBERTa: A Contrastive and Distillation Approach

Xichang Cai, Weijie Luo

North China University of Technology, Beijing, China

Cai_NCUT_task6_1

PDF

Dual-Encoder Audio Retrieval with PaSST and RoBERTa: A Contrastive and Distillation Approach

Xichang Cai, Weijie Luo
North China University of Technology, Beijing, China

Abstract

This technical report describes the Cai_NCUT team's submissions to the language-based audio retrieval task of the 2025 DCASE Challenge (Task 6). Our systems are built upon the dual encoder architecture, mapping audio clips and textual queries into a joint embedding space using pretrained backbones. In this submission, we explore the use of the recently proposed Time-Aware Spectrogram Transformer (PASST) as the audio encoder and RoBERTa as the text encoder. We introduce a two-stage training pipeline involving contrastive learning and a self-distillation phase to leverage cross-modal soft alignments. Our best single system, based on PASST and RoBERTa-large, achieves a mAP@16 of 0.32 on the ClothoV2 test split.

PDF

A CROSS-MODAL ATTENTION APPROACH TO LANGUAGE-BASED AUDIO RETRIEVAL

Oscar Calvet, Doroteo Torre

Escuela Politecnica Superior, Madrid, Spain

Calvet_AUDIAS_task6_1 Calvet_AUDIAS_task6_2 Calvet_AUDIAS_task6_3 Calvet_AUDIAS_task6_4

PDF

A CROSS-MODAL ATTENTION APPROACH TO LANGUAGE-BASED AUDIO RETRIEVAL

Oscar Calvet, Doroteo Torre
Escuela Politecnica Superior, Madrid, Spain

Abstract

This report presents the systems we developed for the 2025 DCASE Language-Based Audio Retrieval challenge (task 6). We use a bi-encoder architecture and propose a novel cross modal attention approach in order to calculate the similarity between the embeddings produced by both models. We make use of pretrained encoders in both modalities: PaSST is used for encoding audio and RoBERTa for encoding text. We trained our system on WavCaps, AudioCaps, ClothoV2 and TACOS using contrastive learning. The best single system that we were able to produce reaches a mAP@10 of 38.293 on the ClothoV2 test split and a mAP@16 of 44.203 using the task specific improved notations. An ensemble of the models presented archieves a mAP@10 of 40.423 on the ClothoV2 test split and a mAP@16 of 46.864 using the task specific improved notations.

PDF

ENHANCING LANGUAGE-BASED AUDIO RETRIEVAL WITH PARTIAL FINE-TUNING AND ATTENTION POOLING

Giovanni Filomeno, Youssef Kandah, Florian Spiessberger

Johannes Kepler University, Linz, Austria

Filomeno_JKU_task6_1

PDF

ENHANCING LANGUAGE-BASED AUDIO RETRIEVAL WITH PARTIAL FINE-TUNING AND ATTENTION POOLING

Giovanni Filomeno, Youssef Kandah, Florian Spiessberger
Johannes Kepler University, Linz, Austria

Abstract

This technical report describes our submission to the language-based audio retrieval task of the DCASE 2025 Challenge (Task 6). Building upon our previous work, we retain the dual-encoder architecture that projects audio recordings and textual descriptions into a shared embedding space. This year we focus on architectural and training-level refinements within a single model framework. Specifically, we fine-tune only the upper transformer layers of a PaSST audio encoder, apply attention-based segment pooling, and replace CLS token extraction in RoBERTa with masked mean pooling. Additionally, we introduce time-frequency spectrogram augmentation and reduce the hop size to capture more segment detail. Our improved system achieves a mAP@10 of 36.005 on the ClothoV2 test set, outperforming the official DCASE 2025 baseline without relying on external caption generation or model ensembles. The result for mAP@16 as requested this year is 36.661 (without new annotations). All code and trained models are available on GitHub.

PDF

AISTAT LAB SYSTEM FOR DCASE 2025 TASK6: LANGUAGE-BASED AUDIO RETREIVAL

Hyun Jun Kim, Hyeong Yong Choi, Kyuwon Choi, Eunsin Choi, Changwon Lim

Chung-Ang University, Seoul, Korea

Kim_AISTAT_task6_1 Kim_AISTAT_task6_2 Kim_AISTAT_task6_3 Kim_AISTAT_task6_4

PDF

AISTAT LAB SYSTEM FOR DCASE 2025 TASK6: LANGUAGE-BASED AUDIO RETREIVAL

Hyun Jun Kim, Hyeong Yong Choi, Kyuwon Choi, Eunsin Choi, Changwon Lim
Chung-Ang University, Seoul, Korea

Abstract

This report presents the AISTAT team's submission to the language-based audio retrieval task in DCASE 2025 Task 6. Our proposed system employs dual encoder architecture, where audio and text modalities are encoded separately, and their representations are aligned using contrastive learning. Drawing inspiration from methodologies of the previous year's challenge, we implemented a distillation approach and leveraged large language models (LLMs) for effective data augmentation techniques, including back-translation and LLM mix. Additionally, we incorporated pseudo-labeling to introduce a classification auxiliary task for further finetuning. Our best single system achieved a mAP@16 of 46.5, while an ensemble of four systems reaches a mAP@16 of 48.83 on the Clotho development test split.

PDF

Enhanced Audio-Text Retrieval with Multi-Positive Learning and Hard Negative Mining

Saubhagya Pandey, Khushal Wadhwa, Ayush Goyal, Kanak Khandelwal

Indian Institute of Technology Kanpur, India

Pandey_IITK_task6_1

PDF

Enhanced Audio-Text Retrieval with Multi-Positive Learning and Hard Negative Mining

Saubhagya Pandey, Khushal Wadhwa, Ayush Goyal, Kanak Khandelwal
Indian Institute of Technology Kanpur, India

Abstract

This paper presents an enhanced implementation of audio-text cross-modal retrieval for DCASE 2025 Task 6, featuring advanced contrastive learning techniques. Our system implements a dual-encoder architecture using PaSST (Patch-out Fast Spectrogram Transformer) for audio encoding and RoBERTa-large for text encoding, enhanced with multi-positive learning and hard negative mining strategies. The proposed method introduces progressive training with staged activation of advanced techniques, achieving significant performance improvements over baseline approaches. Experimental evaluation on the Clotho dataset demonstrates competitive retrieval performance with R@1 of 18.68%, R@5 of 44.77%, R@10 of 59.35%, and mAP@10 of 30.01%. The implementation supports mixed precision training and comprehensive evaluation metrics for robust cross-modal retrieval.

PDF

Content

Task description

A few stats about the evaluation sets:

Teams ranking

Systems ranking

System characteristics

Overview of characteristics

Detailed characteristics

Technical reports

Dual-Encoder Audio Retrieval with PaSST and RoBERTa: A Contrastive and Distillation Approach

Dual-Encoder Audio Retrieval with PaSST and RoBERTa: A Contrastive and Distillation Approach

Abstract

A CROSS-MODAL ATTENTION APPROACH TO LANGUAGE-BASED AUDIO RETRIEVAL

A CROSS-MODAL ATTENTION APPROACH TO LANGUAGE-BASED AUDIO RETRIEVAL

Abstract

ENHANCING LANGUAGE-BASED AUDIO RETRIEVAL WITH PARTIAL FINE-TUNING AND ATTENTION POOLING

ENHANCING LANGUAGE-BASED AUDIO RETRIEVAL WITH PARTIAL FINE-TUNING AND ATTENTION POOLING

Abstract

AISTAT LAB SYSTEM FOR DCASE 2025 TASK6: LANGUAGE-BASED AUDIO RETREIVAL

AISTAT LAB SYSTEM FOR DCASE 2025 TASK6: LANGUAGE-BASED AUDIO RETREIVAL

Abstract

Enhanced Audio-Text Retrieval with Multi-Positive Learning and Hard Negative Mining

Enhanced Audio-Text Retrieval with Multi-Positive Learning and Hard Negative Mining

Abstract