Language-Based Audio Retrieval


Challenge results

Task description

Language-based audio retrieval is the task of retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions are used as text queries. For each text query, the goal of this task is to retrieve audio files from a given dataset and sort them based on their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.

The Clotho v2 is provided as the development dataset, which includes both audio and corresponding captions. Participants are also allowed using pre-trained models and external data for training their systems. This includes pre-trained models for feature extraction from audio and/or captions, and pre-optimized methods for natural language processing. Additionally, participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, Freesound.

As a novelty this year, we introduced additional relevance annotations for the evaluation datasets. While previously only one audio file was considered relevant for each text query, this year we provide multiple relevant audio files for each text query.

More detailed task description can be found in the task description page

A few stats about the evaluation sets:

In the development-eval set, there are 2.8 audios per caption, and 46.56% of the original matching audios are still linked to their queries.

The evaluation set has 597 queries, with 1.8 audios per file and 43.21% of the original true positives kept.

Teams ranking

Here are listed the best systems all from all teams. The ranking is based on the achieved mAP@16 metric. For more elaborated exploration of the performance of different systems, in the same table are listed the values achieved for all the metrics employed in the task. The metric values are for the development-testing split and the evaluation dataset.

Rank

Submission Information Evaluation dataset Development-testing split
2025 2024 2025 2024
Submission code Best official
system rank
Corresponding author Technical
Report
mAP@16 mAP@10 R@1 R@5 R@10 mAP@16 mAP@10 R@1 R@5 R@10
Kim_AISTAT_task6_3 1 Changwon Lim Kim2025_t6 0.421 0.401 0.290 0.551 0.669 0.488 0.417 0.285 0.599 0.724
Calvet_AUDIAS_task6_4 5 Oscar Calvet Calvet2025_t6 0.345 0.379 0.270 0.526 0.658 0.469 0.404 0.277 0.582 0.717
Filomeno_JKU_task6_1 11 Giovanni Filomeno Filomeno2025_t6 0.302 0.342 0.231 0.473 0.619 0.367 0.360 0.241 0.518 0.653
Pandey_IITK_task6_1 12 Saubhagya Pandey Pandey2025_t6 0.301 0.270 0.163 0.410 0.549 0.347 0.300 0.187 0.448 0.594
Cai_NCUT_task6_1 10 Xichang Cai Cai2025_t6 0.320 0.293 0.186 0.447 0.567 0.372 0.328 0.198 0.474 0.605
baseline 8 Paul Primus Primus2025_t6_baseline 0.330 0.337 0.229 0.482 0.624 0.406 0.352 0.233 0.522 0.648

Systems ranking

Here are listed all systems and their ranking according to the different metrics. Detailed information of each system is in the next section.

Rank

Submission Information Evaluation dataset Development-testing split
2025 2024 2025 2024
Submission code Best official
system rank
Technical
Report
mAP@16 mAP@10 R@1 R@5 R@10 mAP@16 mAP@10 R@1 R@5 R@10
Kim_AISTAT_task6_3 1 Kim2025_t6 0.421 0.401 0.290 0.551 0.669 0.488 0.417 0.285 0.599 0.724
Kim_AISTAT_task6_2 2 Kim2025_t6 0.414 0.399 0.288 0.549 0.665 0.488 0.416 0.283 0.599 0.722
Kim_AISTAT_task6_4 3 Kim2025_t6 0.412 0.398 0.286 0.552 0.669 0.488 0.417 0.284 0.600 0.725
Kim_AISTAT_task6_1 4 Kim2025_t6 0.411 0.397 0.286 0.549 0.661 0.488 0.416 0.283 0.597 0.721
Calvet_AUDIAS_task6_4 5 Calvet2025_t6 0.345 0.379 0.270 0.526 0.658 0.469 0.404 0.277 0.582 0.717
Calvet_AUDIAS_task6_2 6 Calvet2025_t6 0.344 0.348 0.229 0.510 0.638 0.439 0.375 0.248 0.549 0.686
Calvet_AUDIAS_task6_1 7 Calvet2025_t6 0.339 0.359 0.251 0.510 0.627 0.442 0.383 0.253 0.560 0.693
baseline 8 Primus2025_t6_baseline 0.330 0.337 0.229 0.482 0.624 0.406 0.352 0.233 0.522 0.648
Calvet_AUDIAS_task6_3 9 Calvet2025_t6 0.324 0.357 0.249 0.500 0.636 0.425 0.379 0.258 0.544 0.676
Cai_NCUT_task6_1 10 Cai2025_t6 0.320 0.293 0.186 0.447 0.567 0.372 0.328 0.198 0.474 0.605
Pandey_IITK_task6_1 12 Pandey2025_t6 0.301 0.270 0.163 0.410 0.549 0.347 0.300 0.187 0.448 0.594
Filomeno_JKU_task6_1 11 Filomeno2025_t6 0.302 0.342 0.231 0.473 0.619 0.367 0.360 0.241 0.518 0.653

System characteristics

In this section, you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems, and the second has a detailed presentation of each system.

Overview of characteristics

Rank Submission
code
mAP@16 Technical
Report
Amount of parameters Audio modelling Text modelling Loss
function
1 Kim_AISTAT_task6_3 0.421 Kim2025_t6 2697000000 PaSST, EAT, BEATs RoBERTa-large InfoNCE
2 Kim_AISTAT_task6_2 0.414 Kim2025_t6 2697000000 PaSST, EAT, BEATs RoBERTa-large InfoNCE
3 Kim_AISTAT_task6_4 0.412 Kim2025_t6 2697000000 PaSST, EAT, BEATs RoBERTa-large InfoNCE
4 Kim_AISTAT_task6_1 0.411 Kim2025_t6 2697000000 PaSST, EAT, BEATs RoBERTa-large InfoNCE
5 Calvet_AUDIAS_task6_4 0.345 Calvet2025_t6 11350187000 PaSST Sentence-BERT NT-Xent
6 Calvet_AUDIAS_task6_2 0.344 Calvet2025_t6 452000000 PaSST Sentence-BERT NT-Xent
7 Calvet_AUDIAS_task6_1 0.339 Calvet2025_t6 446187000 PaSST Sentence-BERT NT-Xent
8 baseline 0.330 Primus2025_t6_baseline 732354 PaSST RoBERTa-large InfoNCE
9 Calvet_AUDIAS_task6_3 0.324 Calvet2025_t6 452000000 PaSST Sentence-BERT NT-Xent
10 Cai_NCUT_task6_1 0.320 Cai2025_t6 732354 PaSST RoBERTa-large InfoNCE
12 Pandey_IITK_task6_1 0.301 Pandey2025_t6 442000000 PaSST RoBERTa-large Enhanced InfoNCE with Multi-Positive Learning and Hard Negative Mining
11 Filomeno_JKU_task6_1 0.302 Filomeno2025_t6 732354 PaSST RoBERTa-large contrastive InfoNCE



Detailed characteristics

Rank Submission
code
mAP@16 Technical
Report
Amount of parameters Audio modelling Acoustic
features
Text modelling Audio
augmentation
Text
augmentation
Sampling
rate
Loss function Optimizer Metric monitored for training Dataset(s) used for training Dataset(s) used for validation
1 Kim_AISTAT_task6_3 0.421 Kim2025_t6 2697000000 PaSST, EAT, BEATs log-mel spectrogram RoBERTa-large LLM mix random deletion, synonym replacement, back-translation, LLM mix 16kHz, 32kHz InfoNCE AdamW mAP@10 Clotho-development Clotho-validation
2 Kim_AISTAT_task6_2 0.414 Kim2025_t6 2697000000 PaSST, EAT, BEATs log-mel spectrogram RoBERTa-large LLM mix random deletion, synonym replacement, back-translation, LLM mix 16kHz, 32kHz InfoNCE AdamW mAP@10 Clotho-development Clotho-validation
3 Kim_AISTAT_task6_4 0.412 Kim2025_t6 2697000000 PaSST, EAT, BEATs log-mel spectrogram RoBERTa-large LLM mix random deletion, synonym replacement, back-translation, LLM mix 16kHz, 32kHz InfoNCE AdamW mAP@10 Clotho-development Clotho-validation
4 Kim_AISTAT_task6_1 0.411 Kim2025_t6 2697000000 PaSST, EAT, BEATs log-mel spectrogram RoBERTa-large LLM mix random deletion, synonym replacement, back-translation, LLM mix 16kHz, 32kHz InfoNCE AdamW mAP@10 Clotho-development Clotho-validation
5 Calvet_AUDIAS_task6_4 0.345 Calvet2025_t6 11350187000 PaSST log-mel energies Sentence-BERT 32kHz NT-Xent Adam validation_loss Clotho-development, AudioCaps, WavCaps, TACOS Clotho-validation
6 Calvet_AUDIAS_task6_2 0.344 Calvet2025_t6 452000000 PaSST log-mel energies Sentence-BERT 32kHz NT-Xent Adam validation_loss Clotho-development, AudioCaps, WavCaps, TACOS Clotho-validation
7 Calvet_AUDIAS_task6_1 0.339 Calvet2025_t6 446187000 PaSST log-mel energies Sentence-BERT 32kHz NT-Xent Adam validation_loss Clotho-development, AudioCaps, WavCaps, TACOS Clotho-validation
8 baseline 0.330 Primus2025_t6_baseline 732354 PaSST log-mel energies RoBERTa-large 32kHz InfoNCE AdamW mAP@16 Clotho-development, AudioCaps, WavCaps Clotho-validation
9 Calvet_AUDIAS_task6_3 0.324 Calvet2025_t6 452000000 PaSST log-mel energies Sentence-BERT 32kHz NT-Xent Adam validation_loss Clotho-development, AudioCaps, WavCaps, TACOS Clotho-validation
10 Cai_NCUT_task6_1 0.320 Cai2025_t6 732354 PaSST log-mel energies RoBERTa-large mixup 44.1kHz InfoNCE Adam validation_loss Clotho-development, AudioCaps Clotho-validation
12 Pandey_IITK_task6_1 0.301 Pandey2025_t6 442000000 PaSST log-mel energies RoBERTa-large 32kHz Enhanced InfoNCE with Multi-Positive Learning and Hard Negative Mining AdamW validation_loss Clotho-development Clotho-validation
11 Filomeno_JKU_task6_1 0.302 Filomeno2025_t6 732354 PaSST log-mel energies RoBERTa-large time-frequency masking 32kHz contrastive InfoNCE AdamW mAP@16 Clotho-development, AudioCaps, WavCaps Clotho-validation



Technical reports

Dual-Encoder Audio Retrieval with PaSST and RoBERTa: A Contrastive and Distillation Approach

Xichang Cai, Weijie Luo
North China University of Technology, Beijing, China

Abstract

This technical report describes the Cai_NCUT team's submissions to the language-based audio retrieval task of the 2025 DCASE Challenge (Task 6). Our systems are built upon the dual encoder architecture, mapping audio clips and textual queries into a joint embedding space using pretrained backbones. In this submission, we explore the use of the recently proposed Time-Aware Spectrogram Transformer (PASST) as the audio encoder and RoBERTa as the text encoder. We introduce a two-stage training pipeline involving contrastive learning and a self-distillation phase to leverage cross-modal soft alignments. Our best single system, based on PASST and RoBERTa-large, achieves a mAP@16 of 0.32 on the ClothoV2 test split.

PDF

A CROSS-MODAL ATTENTION APPROACH TO LANGUAGE-BASED AUDIO RETRIEVAL

Oscar Calvet, Doroteo Torre
Escuela Politecnica Superior, Madrid, Spain

Abstract

This report presents the systems we developed for the 2025 DCASE Language-Based Audio Retrieval challenge (task 6). We use a bi-encoder architecture and propose a novel cross modal attention approach in order to calculate the similarity between the embeddings produced by both models. We make use of pretrained encoders in both modalities: PaSST is used for encoding audio and RoBERTa for encoding text. We trained our system on WavCaps, AudioCaps, ClothoV2 and TACOS using contrastive learning. The best single system that we were able to produce reaches a mAP@10 of 38.293 on the ClothoV2 test split and a mAP@16 of 44.203 using the task specific improved notations. An ensemble of the models presented archieves a mAP@10 of 40.423 on the ClothoV2 test split and a mAP@16 of 46.864 using the task specific improved notations.

PDF

ENHANCING LANGUAGE-BASED AUDIO RETRIEVAL WITH PARTIAL FINE-TUNING AND ATTENTION POOLING

Giovanni Filomeno, Youssef Kandah, Florian Spiessberger
Johannes Kepler University, Linz, Austria

Abstract

This technical report describes our submission to the language-based audio retrieval task of the DCASE 2025 Challenge (Task 6). Building upon our previous work, we retain the dual-encoder architecture that projects audio recordings and textual descriptions into a shared embedding space. This year we focus on architectural and training-level refinements within a single model framework. Specifically, we fine-tune only the upper transformer layers of a PaSST audio encoder, apply attention-based segment pooling, and replace CLS token extraction in RoBERTa with masked mean pooling. Additionally, we introduce time-frequency spectrogram augmentation and reduce the hop size to capture more segment detail. Our improved system achieves a mAP@10 of 36.005 on the ClothoV2 test set, outperforming the official DCASE 2025 baseline without relying on external caption generation or model ensembles. The result for mAP@16 as requested this year is 36.661 (without new annotations). All code and trained models are available on GitHub.

PDF

AISTAT LAB SYSTEM FOR DCASE 2025 TASK6: LANGUAGE-BASED AUDIO RETREIVAL

Hyun Jun Kim, Hyeong Yong Choi, Kyuwon Choi, Eunsin Choi, Changwon Lim
Chung-Ang University, Seoul, Korea

Abstract

This report presents the AISTAT team's submission to the language-based audio retrieval task in DCASE 2025 Task 6. Our proposed system employs dual encoder architecture, where audio and text modalities are encoded separately, and their representations are aligned using contrastive learning. Drawing inspiration from methodologies of the previous year's challenge, we implemented a distillation approach and leveraged large language models (LLMs) for effective data augmentation techniques, including back-translation and LLM mix. Additionally, we incorporated pseudo-labeling to introduce a classification auxiliary task for further finetuning. Our best single system achieved a mAP@16 of 46.5, while an ensemble of four systems reaches a mAP@16 of 48.83 on the Clotho development test split.

PDF

Enhanced Audio-Text Retrieval with Multi-Positive Learning and Hard Negative Mining

Saubhagya Pandey, Khushal Wadhwa, Ayush Goyal, Kanak Khandelwal
Indian Institute of Technology Kanpur, India

Abstract

This paper presents an enhanced implementation of audio-text cross-modal retrieval for DCASE 2025 Task 6, featuring advanced contrastive learning techniques. Our system implements a dual-encoder architecture using PaSST (Patch-out Fast Spectrogram Transformer) for audio encoding and RoBERTa-large for text encoding, enhanced with multi-positive learning and hard negative mining strategies. The proposed method introduces progressive training with staged activation of advanced techniques, achieving significant performance improvements over baseline approaches. Experimental evaluation on the Clotho dataset demonstrates competitive retrieval performance with R@1 of 18.68%, R@5 of 44.77%, R@10 of 59.35%, and mAP@10 of 30.01%. The implementation supports mixed precision training and comprehensive evaluation metrics for robust cross-modal retrieval.

PDF