Task description
Language-based audio retrieval is the task of retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions are used as text queries. For each text query, the goal of this task is to retrieve audio files from a given dataset and sort them based on their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.
The Clotho v2 is provided as the development dataset, which includes both audio and corresponding captions. Participants are also allowed using pre-trained models and external data for training their systems. This includes pre-trained models for feature extraction from audio and/or captions, and pre-optimized methods for natural language processing. Additionally, participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, Freesound.
As a novelty this year, we introduced additional relevance annotations for the evaluation datasets. While previously only one audio file was considered relevant for each text query, this year we provide multiple relevant audio files for each text query.
More detailed task description can be found in the task description page
A few stats about the evaluation sets:
In the development-eval set, there are 2.8 audios per caption, and 46.56% of the original matching audios are still linked to their queries.
The evaluation set has 597 queries, with 1.8 audios per file and 43.21% of the original true positives kept.
Teams ranking
Here are listed the best systems all from all teams. The ranking is based on the achieved mAP@16 metric. For more elaborated exploration of the performance of different systems, in the same table are listed the values achieved for all the metrics employed in the task. The metric values are for the development-testing split and the evaluation dataset.
Rank |
Submission Information | Evaluation dataset | Development-testing split | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2025 | 2024 | 2025 | 2024 | |||||||||||
Submission code |
Best official system rank |
Corresponding author |
Technical Report |
mAP@16 | mAP@10 | R@1 | R@5 | R@10 | mAP@16 | mAP@10 | R@1 | R@5 | R@10 | |
Kim_AISTAT_task6_3 | 1 | Changwon Lim | Kim2025_t6 | 0.421 | 0.401 | 0.290 | 0.551 | 0.669 | 0.488 | 0.417 | 0.285 | 0.599 | 0.724 | |
Calvet_AUDIAS_task6_4 | 5 | Oscar Calvet | Calvet2025_t6 | 0.345 | 0.379 | 0.270 | 0.526 | 0.658 | 0.469 | 0.404 | 0.277 | 0.582 | 0.717 | |
Filomeno_JKU_task6_1 | 11 | Giovanni Filomeno | Filomeno2025_t6 | 0.302 | 0.342 | 0.231 | 0.473 | 0.619 | 0.367 | 0.360 | 0.241 | 0.518 | 0.653 | |
Pandey_IITK_task6_1 | 12 | Saubhagya Pandey | Pandey2025_t6 | 0.301 | 0.270 | 0.163 | 0.410 | 0.549 | 0.347 | 0.300 | 0.187 | 0.448 | 0.594 | |
Cai_NCUT_task6_1 | 10 | Xichang Cai | Cai2025_t6 | 0.320 | 0.293 | 0.186 | 0.447 | 0.567 | 0.372 | 0.328 | 0.198 | 0.474 | 0.605 | |
baseline | 8 | Paul Primus | Primus2025_t6_baseline | 0.330 | 0.337 | 0.229 | 0.482 | 0.624 | 0.406 | 0.352 | 0.233 | 0.522 | 0.648 |
Systems ranking
Here are listed all systems and their ranking according to the different metrics. Detailed information of each system is in the next section.
Rank |
Submission Information | Evaluation dataset | Development-testing split | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2025 | 2024 | 2025 | 2024 | ||||||||||
Submission code |
Best official system rank |
Technical Report |
mAP@16 | mAP@10 | R@1 | R@5 | R@10 | mAP@16 | mAP@10 | R@1 | R@5 | R@10 | |
Kim_AISTAT_task6_3 | 1 | Kim2025_t6 | 0.421 | 0.401 | 0.290 | 0.551 | 0.669 | 0.488 | 0.417 | 0.285 | 0.599 | 0.724 | |
Kim_AISTAT_task6_2 | 2 | Kim2025_t6 | 0.414 | 0.399 | 0.288 | 0.549 | 0.665 | 0.488 | 0.416 | 0.283 | 0.599 | 0.722 | |
Kim_AISTAT_task6_4 | 3 | Kim2025_t6 | 0.412 | 0.398 | 0.286 | 0.552 | 0.669 | 0.488 | 0.417 | 0.284 | 0.600 | 0.725 | |
Kim_AISTAT_task6_1 | 4 | Kim2025_t6 | 0.411 | 0.397 | 0.286 | 0.549 | 0.661 | 0.488 | 0.416 | 0.283 | 0.597 | 0.721 | |
Calvet_AUDIAS_task6_4 | 5 | Calvet2025_t6 | 0.345 | 0.379 | 0.270 | 0.526 | 0.658 | 0.469 | 0.404 | 0.277 | 0.582 | 0.717 | |
Calvet_AUDIAS_task6_2 | 6 | Calvet2025_t6 | 0.344 | 0.348 | 0.229 | 0.510 | 0.638 | 0.439 | 0.375 | 0.248 | 0.549 | 0.686 | |
Calvet_AUDIAS_task6_1 | 7 | Calvet2025_t6 | 0.339 | 0.359 | 0.251 | 0.510 | 0.627 | 0.442 | 0.383 | 0.253 | 0.560 | 0.693 | |
baseline | 8 | Primus2025_t6_baseline | 0.330 | 0.337 | 0.229 | 0.482 | 0.624 | 0.406 | 0.352 | 0.233 | 0.522 | 0.648 | |
Calvet_AUDIAS_task6_3 | 9 | Calvet2025_t6 | 0.324 | 0.357 | 0.249 | 0.500 | 0.636 | 0.425 | 0.379 | 0.258 | 0.544 | 0.676 | |
Cai_NCUT_task6_1 | 10 | Cai2025_t6 | 0.320 | 0.293 | 0.186 | 0.447 | 0.567 | 0.372 | 0.328 | 0.198 | 0.474 | 0.605 | |
Pandey_IITK_task6_1 | 12 | Pandey2025_t6 | 0.301 | 0.270 | 0.163 | 0.410 | 0.549 | 0.347 | 0.300 | 0.187 | 0.448 | 0.594 | |
Filomeno_JKU_task6_1 | 11 | Filomeno2025_t6 | 0.302 | 0.342 | 0.231 | 0.473 | 0.619 | 0.367 | 0.360 | 0.241 | 0.518 | 0.653 |
System characteristics
In this section, you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems, and the second has a detailed presentation of each system.
Overview of characteristics
Rank |
Submission code |
mAP@16 |
Technical Report |
Amount of parameters | Audio modelling | Text modelling |
Loss function |
---|---|---|---|---|---|---|---|
1 | Kim_AISTAT_task6_3 | 0.421 | Kim2025_t6 | 2697000000 | PaSST, EAT, BEATs | RoBERTa-large | InfoNCE |
2 | Kim_AISTAT_task6_2 | 0.414 | Kim2025_t6 | 2697000000 | PaSST, EAT, BEATs | RoBERTa-large | InfoNCE |
3 | Kim_AISTAT_task6_4 | 0.412 | Kim2025_t6 | 2697000000 | PaSST, EAT, BEATs | RoBERTa-large | InfoNCE |
4 | Kim_AISTAT_task6_1 | 0.411 | Kim2025_t6 | 2697000000 | PaSST, EAT, BEATs | RoBERTa-large | InfoNCE |
5 | Calvet_AUDIAS_task6_4 | 0.345 | Calvet2025_t6 | 11350187000 | PaSST | Sentence-BERT | NT-Xent |
6 | Calvet_AUDIAS_task6_2 | 0.344 | Calvet2025_t6 | 452000000 | PaSST | Sentence-BERT | NT-Xent |
7 | Calvet_AUDIAS_task6_1 | 0.339 | Calvet2025_t6 | 446187000 | PaSST | Sentence-BERT | NT-Xent |
8 | baseline | 0.330 | Primus2025_t6_baseline | 732354 | PaSST | RoBERTa-large | InfoNCE |
9 | Calvet_AUDIAS_task6_3 | 0.324 | Calvet2025_t6 | 452000000 | PaSST | Sentence-BERT | NT-Xent |
10 | Cai_NCUT_task6_1 | 0.320 | Cai2025_t6 | 732354 | PaSST | RoBERTa-large | InfoNCE |
12 | Pandey_IITK_task6_1 | 0.301 | Pandey2025_t6 | 442000000 | PaSST | RoBERTa-large | Enhanced InfoNCE with Multi-Positive Learning and Hard Negative Mining |
11 | Filomeno_JKU_task6_1 | 0.302 | Filomeno2025_t6 | 732354 | PaSST | RoBERTa-large | contrastive InfoNCE |
Detailed characteristics
Rank |
Submission code |
mAP@16 |
Technical Report |
Amount of parameters | Audio modelling |
Acoustic features |
Text modelling |
Audio augmentation |
Text augmentation |
Sampling rate |
Loss function | Optimizer | Metric monitored for training | Dataset(s) used for training | Dataset(s) used for validation |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Kim_AISTAT_task6_3 | 0.421 | Kim2025_t6 | 2697000000 | PaSST, EAT, BEATs | log-mel spectrogram | RoBERTa-large | LLM mix | random deletion, synonym replacement, back-translation, LLM mix | 16kHz, 32kHz | InfoNCE | AdamW | mAP@10 | Clotho-development | Clotho-validation |
2 | Kim_AISTAT_task6_2 | 0.414 | Kim2025_t6 | 2697000000 | PaSST, EAT, BEATs | log-mel spectrogram | RoBERTa-large | LLM mix | random deletion, synonym replacement, back-translation, LLM mix | 16kHz, 32kHz | InfoNCE | AdamW | mAP@10 | Clotho-development | Clotho-validation |
3 | Kim_AISTAT_task6_4 | 0.412 | Kim2025_t6 | 2697000000 | PaSST, EAT, BEATs | log-mel spectrogram | RoBERTa-large | LLM mix | random deletion, synonym replacement, back-translation, LLM mix | 16kHz, 32kHz | InfoNCE | AdamW | mAP@10 | Clotho-development | Clotho-validation |
4 | Kim_AISTAT_task6_1 | 0.411 | Kim2025_t6 | 2697000000 | PaSST, EAT, BEATs | log-mel spectrogram | RoBERTa-large | LLM mix | random deletion, synonym replacement, back-translation, LLM mix | 16kHz, 32kHz | InfoNCE | AdamW | mAP@10 | Clotho-development | Clotho-validation |
5 | Calvet_AUDIAS_task6_4 | 0.345 | Calvet2025_t6 | 11350187000 | PaSST | log-mel energies | Sentence-BERT | 32kHz | NT-Xent | Adam | validation_loss | Clotho-development, AudioCaps, WavCaps, TACOS | Clotho-validation | ||
6 | Calvet_AUDIAS_task6_2 | 0.344 | Calvet2025_t6 | 452000000 | PaSST | log-mel energies | Sentence-BERT | 32kHz | NT-Xent | Adam | validation_loss | Clotho-development, AudioCaps, WavCaps, TACOS | Clotho-validation | ||
7 | Calvet_AUDIAS_task6_1 | 0.339 | Calvet2025_t6 | 446187000 | PaSST | log-mel energies | Sentence-BERT | 32kHz | NT-Xent | Adam | validation_loss | Clotho-development, AudioCaps, WavCaps, TACOS | Clotho-validation | ||
8 | baseline | 0.330 | Primus2025_t6_baseline | 732354 | PaSST | log-mel energies | RoBERTa-large | 32kHz | InfoNCE | AdamW | mAP@16 | Clotho-development, AudioCaps, WavCaps | Clotho-validation | ||
9 | Calvet_AUDIAS_task6_3 | 0.324 | Calvet2025_t6 | 452000000 | PaSST | log-mel energies | Sentence-BERT | 32kHz | NT-Xent | Adam | validation_loss | Clotho-development, AudioCaps, WavCaps, TACOS | Clotho-validation | ||
10 | Cai_NCUT_task6_1 | 0.320 | Cai2025_t6 | 732354 | PaSST | log-mel energies | RoBERTa-large | mixup | 44.1kHz | InfoNCE | Adam | validation_loss | Clotho-development, AudioCaps | Clotho-validation | |
12 | Pandey_IITK_task6_1 | 0.301 | Pandey2025_t6 | 442000000 | PaSST | log-mel energies | RoBERTa-large | 32kHz | Enhanced InfoNCE with Multi-Positive Learning and Hard Negative Mining | AdamW | validation_loss | Clotho-development | Clotho-validation | ||
11 | Filomeno_JKU_task6_1 | 0.302 | Filomeno2025_t6 | 732354 | PaSST | log-mel energies | RoBERTa-large | time-frequency masking | 32kHz | contrastive InfoNCE | AdamW | mAP@16 | Clotho-development, AudioCaps, WavCaps | Clotho-validation |
Technical reports
Dual-Encoder Audio Retrieval with PaSST and RoBERTa: A Contrastive and Distillation Approach
Xichang Cai, Weijie Luo
North China University of Technology, Beijing, China
Cai_NCUT_task6_1
Dual-Encoder Audio Retrieval with PaSST and RoBERTa: A Contrastive and Distillation Approach
Xichang Cai, Weijie Luo
North China University of Technology, Beijing, China
Abstract
This technical report describes the Cai_NCUT team's submissions to the language-based audio retrieval task of the 2025 DCASE Challenge (Task 6). Our systems are built upon the dual encoder architecture, mapping audio clips and textual queries into a joint embedding space using pretrained backbones. In this submission, we explore the use of the recently proposed Time-Aware Spectrogram Transformer (PASST) as the audio encoder and RoBERTa as the text encoder. We introduce a two-stage training pipeline involving contrastive learning and a self-distillation phase to leverage cross-modal soft alignments. Our best single system, based on PASST and RoBERTa-large, achieves a mAP@16 of 0.32 on the ClothoV2 test split.
A CROSS-MODAL ATTENTION APPROACH TO LANGUAGE-BASED AUDIO RETRIEVAL
Oscar Calvet, Doroteo Torre
Escuela Politecnica Superior, Madrid, Spain
Calvet_AUDIAS_task6_1 Calvet_AUDIAS_task6_2 Calvet_AUDIAS_task6_3 Calvet_AUDIAS_task6_4
A CROSS-MODAL ATTENTION APPROACH TO LANGUAGE-BASED AUDIO RETRIEVAL
Oscar Calvet, Doroteo Torre
Escuela Politecnica Superior, Madrid, Spain
Abstract
This report presents the systems we developed for the 2025 DCASE Language-Based Audio Retrieval challenge (task 6). We use a bi-encoder architecture and propose a novel cross modal attention approach in order to calculate the similarity between the embeddings produced by both models. We make use of pretrained encoders in both modalities: PaSST is used for encoding audio and RoBERTa for encoding text. We trained our system on WavCaps, AudioCaps, ClothoV2 and TACOS using contrastive learning. The best single system that we were able to produce reaches a mAP@10 of 38.293 on the ClothoV2 test split and a mAP@16 of 44.203 using the task specific improved notations. An ensemble of the models presented archieves a mAP@10 of 40.423 on the ClothoV2 test split and a mAP@16 of 46.864 using the task specific improved notations.
ENHANCING LANGUAGE-BASED AUDIO RETRIEVAL WITH PARTIAL FINE-TUNING AND ATTENTION POOLING
Giovanni Filomeno, Youssef Kandah, Florian Spiessberger
Johannes Kepler University, Linz, Austria
Filomeno_JKU_task6_1
ENHANCING LANGUAGE-BASED AUDIO RETRIEVAL WITH PARTIAL FINE-TUNING AND ATTENTION POOLING
Giovanni Filomeno, Youssef Kandah, Florian Spiessberger
Johannes Kepler University, Linz, Austria
Abstract
This technical report describes our submission to the language-based audio retrieval task of the DCASE 2025 Challenge (Task 6). Building upon our previous work, we retain the dual-encoder architecture that projects audio recordings and textual descriptions into a shared embedding space. This year we focus on architectural and training-level refinements within a single model framework. Specifically, we fine-tune only the upper transformer layers of a PaSST audio encoder, apply attention-based segment pooling, and replace CLS token extraction in RoBERTa with masked mean pooling. Additionally, we introduce time-frequency spectrogram augmentation and reduce the hop size to capture more segment detail. Our improved system achieves a mAP@10 of 36.005 on the ClothoV2 test set, outperforming the official DCASE 2025 baseline without relying on external caption generation or model ensembles. The result for mAP@16 as requested this year is 36.661 (without new annotations). All code and trained models are available on GitHub.
AISTAT LAB SYSTEM FOR DCASE 2025 TASK6: LANGUAGE-BASED AUDIO RETREIVAL
Hyun Jun Kim, Hyeong Yong Choi, Kyuwon Choi, Eunsin Choi, Changwon Lim
Chung-Ang University, Seoul, Korea
Kim_AISTAT_task6_1 Kim_AISTAT_task6_2 Kim_AISTAT_task6_3 Kim_AISTAT_task6_4
AISTAT LAB SYSTEM FOR DCASE 2025 TASK6: LANGUAGE-BASED AUDIO RETREIVAL
Hyun Jun Kim, Hyeong Yong Choi, Kyuwon Choi, Eunsin Choi, Changwon Lim
Chung-Ang University, Seoul, Korea
Abstract
This report presents the AISTAT team's submission to the language-based audio retrieval task in DCASE 2025 Task 6. Our proposed system employs dual encoder architecture, where audio and text modalities are encoded separately, and their representations are aligned using contrastive learning. Drawing inspiration from methodologies of the previous year's challenge, we implemented a distillation approach and leveraged large language models (LLMs) for effective data augmentation techniques, including back-translation and LLM mix. Additionally, we incorporated pseudo-labeling to introduce a classification auxiliary task for further finetuning. Our best single system achieved a mAP@16 of 46.5, while an ensemble of four systems reaches a mAP@16 of 48.83 on the Clotho development test split.
Enhanced Audio-Text Retrieval with Multi-Positive Learning and Hard Negative Mining
Saubhagya Pandey, Khushal Wadhwa, Ayush Goyal, Kanak Khandelwal
Indian Institute of Technology Kanpur, India
Pandey_IITK_task6_1
Enhanced Audio-Text Retrieval with Multi-Positive Learning and Hard Negative Mining
Saubhagya Pandey, Khushal Wadhwa, Ayush Goyal, Kanak Khandelwal
Indian Institute of Technology Kanpur, India
Abstract
This paper presents an enhanced implementation of audio-text cross-modal retrieval for DCASE 2025 Task 6, featuring advanced contrastive learning techniques. Our system implements a dual-encoder architecture using PaSST (Patch-out Fast Spectrogram Transformer) for audio encoding and RoBERTa-large for text encoding, enhanced with multi-positive learning and hard negative mining strategies. The proposed method introduces progressive training with staged activation of advanced techniques, achieving significant performance improvements over baseline approaches. Experimental evaluation on the Clotho dataset demonstrates competitive retrieval performance with R@1 of 18.68%, R@5 of 44.77%, R@10 of 59.35%, and mAP@10 of 30.01%. The implementation supports mixed precision training and comprehensive evaluation metrics for robust cross-modal retrieval.