Language-Based Audio Retrieval


Challenge results

Task description

Language-based audio retrieval is the task of retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions are used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.

The Clotho v2 is provided as the development dataset, which includes both audio and corresponding captions. Participants are also allowed using pre-trained models and external data for training their systems. This includes pre-trained models for feature extraction from audio and/or captions, and pre-optimized methods for natural language processing like part-of-speech (POS) tagging. Additionally, participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, Freesound.

More detailed task description can be found in the task description page

Teams ranking

Here are listed the best systems all from all teams. The ranking is based on the achieved mAP@10 metric. For more elaborated exploration of the performance of different systems, in the same table are listed the values achieved for all the metrics employed in the task. The metric values are for the development-testing split and the evaluation dataset.

Selected
metric
rank
Submission Information Evaluation dataset Development-testing split
Submission code Best official
system rank
Corresponding author Technical
Report
mAP@10 R@1 R@5 R@10 mAP@10 R@1 R@5 R@10
Primus_CP-JKU_8_1 1 Paul Primus Primus2024_t8 0.416 0.307 0.563 0.686 0.419 0.293 0.593 0.719
kulik_SRPOL_task8_4 2 Jan Kulik Kulik2024_t8 0.403 0.292 0.546 0.663 0.437 0.314 0.601 0.733
Chen_SRCN_task8_1 3 Minjun Chen Chen2024_t8 0.396 0.290 0.541 0.661 0.406 0.278 0.576 0.705
Munakata_LYVA_1 5 Hokuto Munakata Munakata2024_t8 0.388 0.284 0.532 0.654 0.422 0.290 0.597 0.728
Kim_MAUM_task8_2 13 Jaeyeon Kim Kim2024_t8 0.363 0.252 0.514 0.642 0.385 0.265 0.547 0.676
Cai_NCUT_task8_2 17 Xichang Cai Cai2024_t8 0.259 0.162 0.391 0.513 0.296 0.186 0.444 0.577
xie_tau_task8_1 19 Huang Xie Xie2024_t8 0.211 0.121 0.332 0.459 0.222 0.130 0.343 0.480

Systems ranking

Here are listed all systems and their ranking according to the different metrics. Detailed information of each system is at the next section.

Selected
metric
rank
Submission Information Evaluation dataset Development-testing split
Submission code Best official
system rank
Technical
Report
mAP@10 R@1 R@5 R@10 mAP@10 R@1 R@5 R@10
Primus_CP-JKU_8_1 1 Primus2024_t8 0.416 0.307 0.563 0.686 0.419 0.293 0.593 0.719
kulik_SRPOL_task8_4 2 Kulik2024_t8 0.403 0.292 0.546 0.663 0.437 0.314 0.601 0.733
Chen_SRCN_task8_1 3 Chen2024_t8 0.396 0.290 0.541 0.661 0.406 0.278 0.576 0.705
Primus_CP-JKU_8_4 4 Primus2024_t8 0.389 0.275 0.545 0.664 0.389 0.268 0.549 0.688
Munakata_LYVA_1 5 Munakata2024_t8 0.388 0.284 0.532 0.654 0.422 0.290 0.597 0.728
Munakata_LYVA_2 6 Munakata2024_t8 0.386 0.277 0.531 0.656 0.423 0.292 0.598 0.727
kulik_SRPOL_task8_3 7 Kulik2024_t8 0.386 0.269 0.544 0.661 0.426 0.301 0.597 0.731
kulik_SRPOL_task8_2 8 Kulik2024_t8 0.384 0.267 0.546 0.659 0.426 0.302 0.592 0.731
Primus_CP-JKU_8_1 9 Primus2024_t8 0.378 0.266 0.539 0.648 0.398 0.271 0.571 0.699
Primus_CP-JKU_8_3 10 Primus2024_t8 0.373 0.265 0.524 0.654 0.377 0.252 0.547 0.680
kulik_SRPOL_task8_1 11 Kulik2024_t8 0.369 0.250 0.531 0.646 0.408 0.287 0.574 0.709
Chen_SRCN_task8_2 12 Chen2024_t8 0.364 0.254 0.521 0.627 0.370 0.244 0.534 0.662
Kim_MAUM_task8_2 13 Kim2024_t8 0.363 0.252 0.514 0.642 0.385 0.265 0.547 0.676
Kim_MAUM_task8_3 14 Kim2024_t8 0.362 0.246 0.516 0.643 0.386 0.267 0.547 0.680
Kim_MAUM_task8_4 15 Kim2024_t8 0.359 0.254 0.510 0.633 0.378 0.257 0.543 0.676
Kim_MAUM_task8_1 16 Kim2024_t8 0.350 0.236 0.499 0.630 0.375 0.256 0.535 0.669
Cai_NCUT_task8_2 17 Cai2024_t8 0.259 0.162 0.391 0.513 0.296 0.186 0.444 0.577
Cai_NCUT_task8_1 18 Cai2024_t8 0.255 0.159 0.383 0.520 0.292 0.180 0.440 0.576
xie_tau_task8_1 19 Xie2024_t8 0.211 0.121 0.332 0.459 0.222 0.130 0.343 0.480

System characteristics

In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems and the second has a detailed presentation of each system.

Overview of characteristics

Rank Submission
code
mAP@10 Technical
Report
Amount of parameters Audio modelling Text modelling Loss
function
1 Primus_CP-JKU_8_1 0.416 Primus2024_t8 2596000000 PaSST, ATST, Dynamic MobileNet BERT, RoBERTa NT-Xent loss
2 kulik_SRPOL_task8_4 0.403 Kulik2024_t8 1485700000 PaSST-S GTE-large, RoBERTa-large InfoNCE loss
3 Chen_SRCN_task8_1 0.396 Chen2024_t8 10390000000 BEATs BERT Contrastive loss
4 Primus_CP-JKU_8_4 0.389 Primus2024_t8 453000000 ATST RoBERTa NT-Xent loss
5 Munakata_LYVA_1 0.388 Munakata2024_t8 2680000000 PaSST, VAST, BEATs, CAV-MAE RoBERTa InfoNCE loss
6 Munakata_LYVA_2 0.386 Munakata2024_t8 2240000000 PaSST, VAST RoBERTa InfoNCE loss
7 kulik_SRPOL_task8_3 0.386 Kulik2024_t8 3855200000 PaSST-S GTE-large, RoBERTa-large InfoNCE loss
8 kulik_SRPOL_task8_2 0.384 Kulik2024_t8 1485700000 PaSST-S GTE-large, RoBERTa-large InfoNCE loss
9 Primus_CP-JKU_8_1 0.378 Primus2024_t8 442000000 PaSST RoBERTa NT-Xent loss
10 Primus_CP-JKU_8_3 0.373 Primus2024_t8 430000000 Dynamic MobileNet RoBERTa NT-Xent loss
11 kulik_SRPOL_task8_1 0.369 Kulik2024_t8 521900000 PaSST-S GTE-large InfoNCE loss
12 Chen_SRCN_task8_2 0.364 Chen2024_t8 230000000 BEATs BERT Contrastive loss
13 Kim_MAUM_task8_2 0.363 Kim2024_t8 1131908653 ConvNeXt-Tiny BERT, RoBERTa, BGE m-LTM
14 Kim_MAUM_task8_3 0.362 Kim2024_t8 1581588058 ConvNeXt-Tiny BERT, RoBERTa, BGE m-LTM
15 Kim_MAUM_task8_4 0.359 Kim2024_t8 3163176116 ConvNeXt-Tiny BERT, RoBERTa, BGE m-LTM
16 Kim_MAUM_task8_1 0.350 Kim2024_t8 390781455 ConvNeXt-Tiny RoBERTa m-LTM
17 Cai_NCUT_task8_2 0.259 Cai2024_t8 160771192 PANNs-CNN14, BEATs RoBERTa InfoNCE loss
18 Cai_NCUT_task8_1 0.255 Cai2024_t8 160771192 PANNs-CNN14, BEATs RoBERTa InfoNCE loss
19 xie_tau_task8_1 0.211 Xie2024_t8 160771192 PANNs-CNN14 Sentece-BERT InfoNCE loss



Detailed characteristics

Rank Submission
code
mAP@10 Technical
Report
Amount of parameters Audio modelling Acoustic
features
Text modelling Audio
augmentation
Text
augmentation
Sampling
rate
Loss function Optimizer Metric monitored for training Dataset(s) used for training Dataset(s) used for validation
1 Primus_CP-JKU_8_1 0.416 Primus2024_t8 2596000000 PaSST, ATST, Dynamic MobileNet log-mel energies BERT, RoBERTa patchout, frequency warping synonym replacement, random deletions 32.0kHz NT-Xent loss adam mAP Clotho-development, AudioCaps, WaveCaps Clotho-validation
2 kulik_SRPOL_task8_4 0.403 Kulik2024_t8 1485700000 PaSST-S log-mel energies GTE-large, RoBERTa-large mixing, time and frequency masking, patchout random deletion, synonym replacement, back-translation, LLM mixing, LLM rephrasing 32kHz InfoNCE loss AdamW mAP Clotho-development, AudioCaps, WavCaps, VideoCaps Clotho-validation
3 Chen_SRCN_task8_1 0.396 Chen2024_t8 10390000000 BEATs log-mel energies BERT mixup mixup 16kHz Contrastive loss adamw mAP Clotho-development, AudioCaps, WavCaps, FSD50K, Laion630k, LASS validation (synth) set Clotho-validation
4 Primus_CP-JKU_8_4 0.389 Primus2024_t8 453000000 ATST log-mel energies RoBERTa frequency warping synonym replacement, random deletions 32.0kHz NT-Xent loss adam mAP Clotho-development, AudioCaps, WaveCaps Clotho-validation
5 Munakata_LYVA_1 0.388 Munakata2024_t8 2680000000 PaSST, VAST, BEATs, CAV-MAE log-mel energies RoBERTa SpecAugment, Patchout, Mix-up Contrast Text token masking, GPT augmentation 32kHz, 16kHz InfoNCE loss Adam mAP Clotho-development, AudioCaps, WavCaps, MACS, Auto-ACD-VS Clotho-validation
6 Munakata_LYVA_2 0.386 Munakata2024_t8 2240000000 PaSST, VAST log-mel energies RoBERTa SpecAugment, Patchout, Mix-up Contrast Text token masking, GPT augmentation 32kHz, 16kHz InfoNCE loss Adam, AdamW mAP Clotho-development, AudioCaps, WavCaps, MACS, Auto-ACD-VS Clotho-validation
7 kulik_SRPOL_task8_3 0.386 Kulik2024_t8 3855200000 PaSST-S log-mel energies GTE-large, RoBERTa-large mixing, time and frequency masking, patchout random deletion, synonym replacement, back-translation, LLM mixing, LLM rephrasing 32kHz InfoNCE loss AdamW mAP Clotho-development, AudioCaps, WavCaps, VideoCaps Clotho-validation
8 kulik_SRPOL_task8_2 0.384 Kulik2024_t8 1485700000 PaSST-S log-mel energies GTE-large, RoBERTa-large mixing, time and frequency masking, patchout random deletion, synonym replacement, back-translation, LLM mixing, LLM rephrasing 32kHz InfoNCE loss AdamW mAP Clotho-development, AudioCaps, WavCaps, VideoCaps Clotho-validation
9 Primus_CP-JKU_8_1 0.378 Primus2024_t8 442000000 PaSST log-mel energies RoBERTa patchout synonym replacement, random deletions 32.0kHz NT-Xent loss adam mAP Clotho-development, AudioCaps, WaveCaps Clotho-validation
10 Primus_CP-JKU_8_3 0.373 Primus2024_t8 430000000 Dynamic MobileNet log-mel energies RoBERTa None synonym replacement, random deletions 32.0kHz NT-Xent loss adam mAP Clotho-development, AudioCaps, WaveCaps Clotho-validation
11 kulik_SRPOL_task8_1 0.369 Kulik2024_t8 521900000 PaSST-S log-mel energies GTE-large mixing, time and frequency masking, patchout random deletion, synonym replacement, back-translation, LLM mixing, LLM rephrasing 32kHz InfoNCE loss AdamW mAP Clotho-development, AudioCaps, WavCaps, VideoCaps Clotho-validation
12 Chen_SRCN_task8_2 0.364 Chen2024_t8 230000000 BEATs log-mel energies BERT mixup mixup 16kHz Contrastive loss adamw mAP Clotho-development, AudioCaps, WavCaps, FSD50K, Laion630k, LASS validation (synth) set Clotho-validation
13 Kim_MAUM_task8_2 0.363 Kim2024_t8 1131908653 ConvNeXt-Tiny log-mel energies BERT, RoBERTa, BGE SpecAugment 32kHz m-LTM adam mAP Clotho-development, AudioCaps, WavCaps Clotho-validation
14 Kim_MAUM_task8_3 0.362 Kim2024_t8 1581588058 ConvNeXt-Tiny log-mel energies BERT, RoBERTa, BGE SpecAugment 32kHz m-LTM adam mAP Clotho-development, AudioCaps, WavCaps Clotho-validation
15 Kim_MAUM_task8_4 0.359 Kim2024_t8 3163176116 ConvNeXt-Tiny log-mel energies BERT, RoBERTa, BGE SpecAugment 32kHz m-LTM adam mAP Clotho-development, AudioCaps, WavCaps Clotho-validation
16 Kim_MAUM_task8_1 0.350 Kim2024_t8 390781455 ConvNeXt-Tiny log-mel energies RoBERTa SpecAugment 32kHz m-LTM adam mAP Clotho-development, AudioCaps, WavCaps Clotho-validation
17 Cai_NCUT_task8_2 0.259 Cai2024_t8 160771192 PANNs-CNN14, BEATs log-mel energies RoBERTa Mixup ChatGPT 44.1kHz InfoNCE loss adam loss Clotho-development, AudioCaps-train Clotho-validation
18 Cai_NCUT_task8_1 0.255 Cai2024_t8 160771192 PANNs-CNN14, BEATs log-mel energies RoBERTa Mixup ChatGPT 44.1kHz InfoNCE loss adam loss Clotho-development, AudioCaps-train Clotho-validation
19 xie_tau_task8_1 0.211 Xie2024_t8 160771192 PANNs-CNN14 log-mel energies Sentece-BERT 44.1kHz InfoNCE loss adam loss Clotho-development Clotho-validation



Technical reports

ENSEMBLE SYSTEMS WITH PRETRAINED DUAL-ENCODERS FOR LANGUAGE-BASED AUDIO RETRIEVAL

Jiafeng Li, Xichang Cai, Shenghao Liu, Liangxiao Zuo, Menglong Wu
North China University of Technology, Beijing, China

Abstract

This article presents our system developed for Task 8 of the DCASE2024 Challenge, which focuses on audio retrieval using natural language queries. Our submission incorporates a retrieval system that integrates a frozen pre-trained audio encoder and RoBERT as a text encoder. We adopted a methodology similar to the CLAP framework, training our model using paired data from the AudioCaps and Clotho datasets. Our best-performing system achieved a mean Average Precision (mAP) of 29.6% and a Recall at 1 (R@1) of 18.6% on the Clotho evaluation set.

PDF

DCASE 2024 CHALLENGE TASK 8 TECHNICAL REPORT

Minjun Chen, Yangyang Liu, Bo Peng, Jie Chen
Samsung Research China-Nanjing, Nanjing, China

Abstract

We describe our submitted systems for DCASE2024 Task 8 in this technical report: Language-based Audio Retrieval. Our proposed system focus on training audio encoder and text encoder combined to get expressive audio and text presentation, which helps distinguishing different audios and text more efficiently. We use pre-trained audio and text encoder of VAST, which were trained on a large multi-modality dataset VAST27M. We train these encoders on several audio caption datasets, include AudioCaps, WavCaps, FSD50K, Laion630k, and ClothoV2 furtherly with three learning objectives, except the audio-text contractive objective, we also use audio-text match and masked language model objective to strengthen the training procedure. We use the mix-up as the data augment policy during pre-training. Our proposed systems achieve 0.37 mAP@10, and 0.244 R@1, with model ensemble, our systems achieve 0.406 mAP@10, and 0.278 R@1 on the ClothoV2 evaluation set.

PDF

EXPANDING ON ENCLAP WITH AUXILIARY RETRIEVAL MODEL FOR AUTOMATED AUDIO CAPTIONING

Jaeyeon Kim, Jaeyoon Jung, Minjeong Jeon, Sang Hoon Woo, Jinjoo Lee
Seoul National Unversity, MAUM AI Inc., Soongsil University, Independent Researcher

Abstract

In this technical report, we describe our submission to DCASE2024 Challenge Task6 (Automated Audio Captioning) and Task8 (Language-based Audio Retrieval). We develop our approach building upon the EnCLAP audio captioning framework and optimizing it for Task6 of the challenge. Notably, we outline the changes in the underlying components and the incorporation of the reranking process. Additionally, we submit a supplementary retriever model, a byproduct of our modified framework, to Task8. Our proposed systems achieve FENSE score of 0.542 on Task6 and mAP@10 score of 0.386 on Task8, significantly outperforming the baseline models.

PDF

TAKE IT FOR GRANTED: IMPROVING LANGUAGE-BASED AUDIO RETRIEVAL WITH LARGE LANGUAGE MODELS

Jan Kulik, Bartlomiej Zgorzynski, Juliusz Kruk, Ivan Ryzhankow, Anna Ples, Theodore Lamort de Gail
Samsung R&D Institute Poland, Warsaw, Poland

Abstract

In this report, we present our solution to DCASE 2024 task 8: Language-Based Audio Retrieval. We employ a bi-encoder architecture trained using InfoNCE loss. The audio encoder is a pretrained PaSST-S model, while the text encoder is either a pre-trained GTE-large or RoBERTa-large model. In order to increase the amount of training data, we obtain 10.8 million video-caption pairs from various open-source datasets. We then extract useful audio-caption pairs and evaluate them using our model to filter out low-quality samples. Finally, we use GPT-4o to rephrase the video captions to make them more audio-oriented. In addition, we use GPT-4o for back-translation and GPT-3.5-turbo for Clotho caption mixing. We achieve 43.69% mAP@10 on the development-testing split of Clotho using an ensemble solution, and 40.78% mAP@10 with a single model.

PDF

TRAINING STRATEGY OF MASSIVE TEXT-TO-AUDIO MODELS AND GPT-BASED QUERY-AUGMENTATION

Hokuto Munakata, Taichi Nishimura, Shota Nakada, Tatsuya Komatsu
LY Corporation, Japan

Abstract

This report describes our system submitted to the DCASE 2024 Task 8: Language-based Audio Retrieval. We adopted a conventional language-based audio retrieval approach, leveraging a joint embedding space for the audio and text encoders trained through contrastive learning. We compared and utilized several state-ofthe-art models for the audio encoder, including PaSST, BEATs, VAST, and CAV-MAE. We also employed various datasets with text-audio pairs for training like AudioCaps, WavCaps, Auto-ACD, and MACS. Additionally, we incorporated advanced training techniques such as Mixco and text token masking. During inference, we devised an ensemble method based on queries augmented by ChatGPT. Our final results achieved 39.65 points with a single model and 42.26 points with the ensemble of multiple models in the mean average precision among the top 10 results on the evaluation split of Clotho-V2. Compared with the champion system of the DCASE 2023 Challenge, our model outperformed by 1.09 points for the single mode and 0.84 points for the ensemble of the multiple models, respectively.

PDF

A KNOWLEDGE DISTILLATION APPROACH TO IMPROVING LANGUAGE-BASED AUDIO RETRIEVAL MODELS

Paul Primus, Gerhard Widmer
Institute of Computational Perception (CP-JKU), LIT Artificial Intelligence Lab, Johannes Kepler University, Austria

Abstract

This technical report describes the CP-JKU team’s submissions to the language-based audio retrieval task of the 2024 DCASE Challenge (Task 8). All our submitted systems are based on the dual encoder architecture that projects recordings and textual descriptions into a shared audio-caption space in which related examples from the two modalities are similar. We utilized pretrained audio and text embedding models and trained them on audio-caption datasets (WavCaps, AudioCaps, and ClothoV2) via contrastive learning. We further fine-tuned the resulting models on ClothoV2 via knowledge distillation from a large ensemble of audio retrieval models. Our best single system submission based on PaSST and RoBERTa achieves a mAP@10 of 39.77 on the ClothoV2 test split, outperforming last year’s best single system submission by around 1pp. without utilizing metadata and synthetic captions. An ensemble of three distilled models achieves 41.91 mAP@10 on the ClothoV2 test split.

PDF