Task description
Language-based audio retrieval is the task of retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions are used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.
The Clotho v2 is provided as the development dataset, which includes both audio and corresponding captions. Participants are also allowed using pre-trained models and external data for training their systems. This includes pre-trained models for feature extraction from audio and/or captions, and pre-optimized methods for natural language processing like part-of-speech (POS) tagging. Additionally, participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, Freesound.
More detailed task description can be found in the task description page
Teams ranking
Here are listed the best systems all from all teams. The ranking is based on the achieved mAP@10 metric. For more elaborated exploration of the performance of different systems, in the same table are listed the values achieved for all the metrics employed in the task. The metric values are for the development-testing split and the evaluation dataset.
Selected metric rank |
Submission Information | Evaluation dataset | Development-testing split | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Corresponding author |
Technical Report |
mAP@10 | R@1 | R@5 | R@10 | mAP@10 | R@1 | R@5 | R@10 | |
Primus_CP-JKU_8_1 | 1 | Paul Primus | Primus2024_t8 | 0.416 | 0.307 | 0.563 | 0.686 | 0.419 | 0.293 | 0.593 | 0.719 | |
Kulik_SRPOL_task8_4 | 2 | Jan Kulik | Kulik2024_t8 | 0.403 | 0.292 | 0.546 | 0.663 | 0.437 | 0.314 | 0.601 | 0.733 | |
Chen_SRCN_task8_1 | 3 | Minjun Chen | Chen2024_t8 | 0.396 | 0.290 | 0.541 | 0.661 | 0.406 | 0.278 | 0.576 | 0.705 | |
Munakata_LYVA_1 | 5 | Hokuto Munakata | Munakata2024_t8 | 0.388 | 0.284 | 0.532 | 0.654 | 0.422 | 0.290 | 0.597 | 0.728 | |
Kim_MAUM_task8_2 | 13 | Jaeyeon Kim | Kim2024_t8 | 0.363 | 0.252 | 0.514 | 0.642 | 0.385 | 0.265 | 0.547 | 0.676 | |
Cai_NCUT_task8_2 | 17 | Xichang Cai | Cai2024_t8 | 0.259 | 0.162 | 0.391 | 0.513 | 0.296 | 0.186 | 0.444 | 0.577 | |
Xie_tau_task8_1 | 19 | Huang Xie | Xie2024_t8 | 0.211 | 0.121 | 0.332 | 0.459 | 0.222 | 0.130 | 0.343 | 0.480 |
Systems ranking
Here are listed all systems and their ranking according to the different metrics. Detailed information of each system is at the next section.
Selected metric rank |
Submission Information | Evaluation dataset | Development-testing split | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Technical Report |
mAP@10 | R@1 | R@5 | R@10 | mAP@10 | R@1 | R@5 | R@10 | |
Primus_CP-JKU_8_1 | 1 | Primus2024_t8 | 0.416 | 0.307 | 0.563 | 0.686 | 0.419 | 0.293 | 0.593 | 0.719 | |
Kulik_SRPOL_task8_4 | 2 | Kulik2024_t8 | 0.403 | 0.292 | 0.546 | 0.663 | 0.437 | 0.314 | 0.601 | 0.733 | |
Chen_SRCN_task8_1 | 3 | Chen2024_t8 | 0.396 | 0.290 | 0.541 | 0.661 | 0.406 | 0.278 | 0.576 | 0.705 | |
Primus_CP-JKU_8_4 | 4 | Primus2024_t8 | 0.389 | 0.275 | 0.545 | 0.664 | 0.389 | 0.268 | 0.549 | 0.688 | |
Munakata_LYVA_1 | 5 | Munakata2024_t8 | 0.388 | 0.284 | 0.532 | 0.654 | 0.422 | 0.290 | 0.597 | 0.728 | |
Munakata_LYVA_2 | 6 | Munakata2024_t8 | 0.386 | 0.277 | 0.531 | 0.656 | 0.423 | 0.292 | 0.598 | 0.727 | |
Kulik_SRPOL_task8_3 | 7 | Kulik2024_t8 | 0.386 | 0.269 | 0.544 | 0.661 | 0.426 | 0.301 | 0.597 | 0.731 | |
Kulik_SRPOL_task8_2 | 8 | Kulik2024_t8 | 0.384 | 0.267 | 0.546 | 0.659 | 0.426 | 0.302 | 0.592 | 0.731 | |
Primus_CP-JKU_8_1 | 9 | Primus2024_t8 | 0.378 | 0.266 | 0.539 | 0.648 | 0.398 | 0.271 | 0.571 | 0.699 | |
Primus_CP-JKU_8_3 | 10 | Primus2024_t8 | 0.373 | 0.265 | 0.524 | 0.654 | 0.377 | 0.252 | 0.547 | 0.680 | |
Kulik_SRPOL_task8_1 | 11 | Kulik2024_t8 | 0.369 | 0.250 | 0.531 | 0.646 | 0.408 | 0.287 | 0.574 | 0.709 | |
Chen_SRCN_task8_2 | 12 | Chen2024_t8 | 0.364 | 0.254 | 0.521 | 0.627 | 0.370 | 0.244 | 0.534 | 0.662 | |
Kim_MAUM_task8_2 | 13 | Kim2024_t8 | 0.363 | 0.252 | 0.514 | 0.642 | 0.385 | 0.265 | 0.547 | 0.676 | |
Kim_MAUM_task8_3 | 14 | Kim2024_t8 | 0.362 | 0.246 | 0.516 | 0.643 | 0.386 | 0.267 | 0.547 | 0.680 | |
Kim_MAUM_task8_4 | 15 | Kim2024_t8 | 0.359 | 0.254 | 0.510 | 0.633 | 0.378 | 0.257 | 0.543 | 0.676 | |
Kim_MAUM_task8_1 | 16 | Kim2024_t8 | 0.350 | 0.236 | 0.499 | 0.630 | 0.375 | 0.256 | 0.535 | 0.669 | |
Cai_NCUT_task8_2 | 17 | Cai2024_t8 | 0.259 | 0.162 | 0.391 | 0.513 | 0.296 | 0.186 | 0.444 | 0.577 | |
Cai_NCUT_task8_1 | 18 | Cai2024_t8 | 0.255 | 0.159 | 0.383 | 0.520 | 0.292 | 0.180 | 0.440 | 0.576 | |
Xie_tau_task8_1 | 19 | Xie2024_t8 | 0.211 | 0.121 | 0.332 | 0.459 | 0.222 | 0.130 | 0.343 | 0.480 |
System characteristics
In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems and the second has a detailed presentation of each system.
Overview of characteristics
Rank |
Submission code |
mAP@10 |
Technical Report |
Amount of parameters | Audio modelling | Text modelling |
Loss function |
---|---|---|---|---|---|---|---|
1 | Primus_CP-JKU_8_1 | 0.416 | Primus2024_t8 | 2596000000 | PaSST, ATST, Dynamic MobileNet | BERT, RoBERTa | NT-Xent loss |
2 | Kulik_SRPOL_task8_4 | 0.403 | Kulik2024_t8 | 1485700000 | PaSST-S | GTE-large, RoBERTa-large | InfoNCE loss |
3 | Chen_SRCN_task8_1 | 0.396 | Chen2024_t8 | 10390000000 | BEATs | BERT | Contrastive loss |
4 | Primus_CP-JKU_8_4 | 0.389 | Primus2024_t8 | 453000000 | ATST | RoBERTa | NT-Xent loss |
5 | Munakata_LYVA_1 | 0.388 | Munakata2024_t8 | 2680000000 | PaSST, VAST, BEATs, CAV-MAE | RoBERTa | InfoNCE loss |
6 | Munakata_LYVA_2 | 0.386 | Munakata2024_t8 | 2240000000 | PaSST, VAST | RoBERTa | InfoNCE loss |
7 | Kulik_SRPOL_task8_3 | 0.386 | Kulik2024_t8 | 3855200000 | PaSST-S | GTE-large, RoBERTa-large | InfoNCE loss |
8 | Kulik_SRPOL_task8_2 | 0.384 | Kulik2024_t8 | 1485700000 | PaSST-S | GTE-large, RoBERTa-large | InfoNCE loss |
9 | Primus_CP-JKU_8_1 | 0.378 | Primus2024_t8 | 442000000 | PaSST | RoBERTa | NT-Xent loss |
10 | Primus_CP-JKU_8_3 | 0.373 | Primus2024_t8 | 430000000 | Dynamic MobileNet | RoBERTa | NT-Xent loss |
11 | Kulik_SRPOL_task8_1 | 0.369 | Kulik2024_t8 | 521900000 | PaSST-S | GTE-large | InfoNCE loss |
12 | Chen_SRCN_task8_2 | 0.364 | Chen2024_t8 | 230000000 | BEATs | BERT | Contrastive loss |
13 | Kim_MAUM_task8_2 | 0.363 | Kim2024_t8 | 1131908653 | ConvNeXt-Tiny | BERT, RoBERTa, BGE | m-LTM |
14 | Kim_MAUM_task8_3 | 0.362 | Kim2024_t8 | 1581588058 | ConvNeXt-Tiny | BERT, RoBERTa, BGE | m-LTM |
15 | Kim_MAUM_task8_4 | 0.359 | Kim2024_t8 | 3163176116 | ConvNeXt-Tiny | BERT, RoBERTa, BGE | m-LTM |
16 | Kim_MAUM_task8_1 | 0.350 | Kim2024_t8 | 390781455 | ConvNeXt-Tiny | RoBERTa | m-LTM |
17 | Cai_NCUT_task8_2 | 0.259 | Cai2024_t8 | 160771192 | PANNs-CNN14, BEATs | RoBERTa | InfoNCE loss |
18 | Cai_NCUT_task8_1 | 0.255 | Cai2024_t8 | 160771192 | PANNs-CNN14, BEATs | RoBERTa | InfoNCE loss |
19 | Xie_tau_task8_1 | 0.211 | Xie2024_t8 | 160771192 | PANNs-CNN14 | Sentece-BERT | InfoNCE loss |
Detailed characteristics
Rank |
Submission code |
mAP@10 |
Technical Report |
Amount of parameters | Audio modelling |
Acoustic features |
Text modelling |
Audio augmentation |
Text augmentation |
Sampling rate |
Loss function | Optimizer | Metric monitored for training | Dataset(s) used for training | Dataset(s) used for validation |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Primus_CP-JKU_8_1 | 0.416 | Primus2024_t8 | 2596000000 | PaSST, ATST, Dynamic MobileNet | log-mel energies | BERT, RoBERTa | patchout, frequency warping | synonym replacement, random deletions | 32.0kHz | NT-Xent loss | adam | mAP | Clotho-development, AudioCaps, WaveCaps | Clotho-validation |
2 | Kulik_SRPOL_task8_4 | 0.403 | Kulik2024_t8 | 1485700000 | PaSST-S | log-mel energies | GTE-large, RoBERTa-large | mixing, time and frequency masking, patchout | random deletion, synonym replacement, back-translation, LLM mixing, LLM rephrasing | 32kHz | InfoNCE loss | AdamW | mAP | Clotho-development, AudioCaps, WavCaps, VideoCaps | Clotho-validation |
3 | Chen_SRCN_task8_1 | 0.396 | Chen2024_t8 | 10390000000 | BEATs | log-mel energies | BERT | mixup | mixup | 16kHz | Contrastive loss | adamw | mAP | Clotho-development, AudioCaps, WavCaps, FSD50K, Laion630k, LASS validation (synth) set | Clotho-validation |
4 | Primus_CP-JKU_8_4 | 0.389 | Primus2024_t8 | 453000000 | ATST | log-mel energies | RoBERTa | frequency warping | synonym replacement, random deletions | 32.0kHz | NT-Xent loss | adam | mAP | Clotho-development, AudioCaps, WaveCaps | Clotho-validation |
5 | Munakata_LYVA_1 | 0.388 | Munakata2024_t8 | 2680000000 | PaSST, VAST, BEATs, CAV-MAE | log-mel energies | RoBERTa | SpecAugment, Patchout, Mix-up Contrast | Text token masking, GPT augmentation | 32kHz, 16kHz | InfoNCE loss | Adam | mAP | Clotho-development, AudioCaps, WavCaps, MACS, Auto-ACD-VS | Clotho-validation |
6 | Munakata_LYVA_2 | 0.386 | Munakata2024_t8 | 2240000000 | PaSST, VAST | log-mel energies | RoBERTa | SpecAugment, Patchout, Mix-up Contrast | Text token masking, GPT augmentation | 32kHz, 16kHz | InfoNCE loss | Adam, AdamW | mAP | Clotho-development, AudioCaps, WavCaps, MACS, Auto-ACD-VS | Clotho-validation |
7 | Kulik_SRPOL_task8_3 | 0.386 | Kulik2024_t8 | 3855200000 | PaSST-S | log-mel energies | GTE-large, RoBERTa-large | mixing, time and frequency masking, patchout | random deletion, synonym replacement, back-translation, LLM mixing, LLM rephrasing | 32kHz | InfoNCE loss | AdamW | mAP | Clotho-development, AudioCaps, WavCaps, VideoCaps | Clotho-validation |
8 | Kulik_SRPOL_task8_2 | 0.384 | Kulik2024_t8 | 1485700000 | PaSST-S | log-mel energies | GTE-large, RoBERTa-large | mixing, time and frequency masking, patchout | random deletion, synonym replacement, back-translation, LLM mixing, LLM rephrasing | 32kHz | InfoNCE loss | AdamW | mAP | Clotho-development, AudioCaps, WavCaps, VideoCaps | Clotho-validation |
9 | Primus_CP-JKU_8_1 | 0.378 | Primus2024_t8 | 442000000 | PaSST | log-mel energies | RoBERTa | patchout | synonym replacement, random deletions | 32.0kHz | NT-Xent loss | adam | mAP | Clotho-development, AudioCaps, WaveCaps | Clotho-validation |
10 | Primus_CP-JKU_8_3 | 0.373 | Primus2024_t8 | 430000000 | Dynamic MobileNet | log-mel energies | RoBERTa | None | synonym replacement, random deletions | 32.0kHz | NT-Xent loss | adam | mAP | Clotho-development, AudioCaps, WaveCaps | Clotho-validation |
11 | Kulik_SRPOL_task8_1 | 0.369 | Kulik2024_t8 | 521900000 | PaSST-S | log-mel energies | GTE-large | mixing, time and frequency masking, patchout | random deletion, synonym replacement, back-translation, LLM mixing, LLM rephrasing | 32kHz | InfoNCE loss | AdamW | mAP | Clotho-development, AudioCaps, WavCaps, VideoCaps | Clotho-validation |
12 | Chen_SRCN_task8_2 | 0.364 | Chen2024_t8 | 230000000 | BEATs | log-mel energies | BERT | mixup | mixup | 16kHz | Contrastive loss | adamw | mAP | Clotho-development, AudioCaps, WavCaps, FSD50K, Laion630k, LASS validation (synth) set | Clotho-validation |
13 | Kim_MAUM_task8_2 | 0.363 | Kim2024_t8 | 1131908653 | ConvNeXt-Tiny | log-mel energies | BERT, RoBERTa, BGE | SpecAugment | 32kHz | m-LTM | adam | mAP | Clotho-development, AudioCaps, WavCaps | Clotho-validation | |
14 | Kim_MAUM_task8_3 | 0.362 | Kim2024_t8 | 1581588058 | ConvNeXt-Tiny | log-mel energies | BERT, RoBERTa, BGE | SpecAugment | 32kHz | m-LTM | adam | mAP | Clotho-development, AudioCaps, WavCaps | Clotho-validation | |
15 | Kim_MAUM_task8_4 | 0.359 | Kim2024_t8 | 3163176116 | ConvNeXt-Tiny | log-mel energies | BERT, RoBERTa, BGE | SpecAugment | 32kHz | m-LTM | adam | mAP | Clotho-development, AudioCaps, WavCaps | Clotho-validation | |
16 | Kim_MAUM_task8_1 | 0.350 | Kim2024_t8 | 390781455 | ConvNeXt-Tiny | log-mel energies | RoBERTa | SpecAugment | 32kHz | m-LTM | adam | mAP | Clotho-development, AudioCaps, WavCaps | Clotho-validation | |
17 | Cai_NCUT_task8_2 | 0.259 | Cai2024_t8 | 160771192 | PANNs-CNN14, BEATs | log-mel energies | RoBERTa | Mixup | ChatGPT | 44.1kHz | InfoNCE loss | adam | loss | Clotho-development, AudioCaps-train | Clotho-validation |
18 | Cai_NCUT_task8_1 | 0.255 | Cai2024_t8 | 160771192 | PANNs-CNN14, BEATs | log-mel energies | RoBERTa | Mixup | ChatGPT | 44.1kHz | InfoNCE loss | adam | loss | Clotho-development, AudioCaps-train | Clotho-validation |
19 | Xie_tau_task8_1 | 0.211 | Xie2024_t8 | 160771192 | PANNs-CNN14 | log-mel energies | Sentece-BERT | 44.1kHz | InfoNCE loss | adam | loss | Clotho-development | Clotho-validation |
Technical reports
ENSEMBLE SYSTEMS WITH PRETRAINED DUAL-ENCODERS FOR LANGUAGE-BASED AUDIO RETRIEVAL
Jiafeng Li, Xichang Cai, Shenghao Liu, Liangxiao Zuo, Menglong Wu
North China University of Technology, Beijing, China
Cai_NCUT_task8_1 Cai_NCUT_task8_2
ENSEMBLE SYSTEMS WITH PRETRAINED DUAL-ENCODERS FOR LANGUAGE-BASED AUDIO RETRIEVAL
Jiafeng Li, Xichang Cai, Shenghao Liu, Liangxiao Zuo, Menglong Wu
North China University of Technology, Beijing, China
Abstract
This article presents our system developed for Task 8 of the DCASE2024 Challenge, which focuses on audio retrieval using natural language queries. Our submission incorporates a retrieval system that integrates a frozen pre-trained audio encoder and RoBERT as a text encoder. We adopted a methodology similar to the CLAP framework, training our model using paired data from the AudioCaps and Clotho datasets. Our best-performing system achieved a mean Average Precision (mAP) of 29.6% and a Recall at 1 (R@1) of 18.6% on the Clotho evaluation set.
DCASE 2024 CHALLENGE TASK 8 TECHNICAL REPORT
Minjun Chen, Yangyang Liu, Bo Peng, Jie Chen
Samsung Research China-Nanjing, Nanjing, China
Chen_SRCN_task8_1 Chen_SRCN_task8_2
DCASE 2024 CHALLENGE TASK 8 TECHNICAL REPORT
Minjun Chen, Yangyang Liu, Bo Peng, Jie Chen
Samsung Research China-Nanjing, Nanjing, China
Abstract
We describe our submitted systems for DCASE2024 Task 8 in this technical report: Language-based Audio Retrieval. Our proposed system focus on training audio encoder and text encoder combined to get expressive audio and text presentation, which helps distinguishing different audios and text more efficiently. We use pre-trained audio and text encoder of VAST, which were trained on a large multi-modality dataset VAST27M. We train these encoders on several audio caption datasets, include AudioCaps, WavCaps, FSD50K, Laion630k, and ClothoV2 furtherly with three learning objectives, except the audio-text contractive objective, we also use audio-text match and masked language model objective to strengthen the training procedure. We use the mix-up as the data augment policy during pre-training. Our proposed systems achieve 0.37 mAP@10, and 0.244 R@1, with model ensemble, our systems achieve 0.406 mAP@10, and 0.278 R@1 on the ClothoV2 evaluation set.
EXPANDING ON ENCLAP WITH AUXILIARY RETRIEVAL MODEL FOR AUTOMATED AUDIO CAPTIONING
Jaeyeon Kim, Jaeyoon Jung, Minjeong Jeon, Sang Hoon Woo, Jinjoo Lee
Seoul National Unversity, MAUM AI Inc., Soongsil University, Independent Researcher
Kim_MAUM_task8_1 Kim_MAUM_task8_2 Kim_MAUM_task8_3 Kim_MAUM_task8_4
EXPANDING ON ENCLAP WITH AUXILIARY RETRIEVAL MODEL FOR AUTOMATED AUDIO CAPTIONING
Jaeyeon Kim, Jaeyoon Jung, Minjeong Jeon, Sang Hoon Woo, Jinjoo Lee
Seoul National Unversity, MAUM AI Inc., Soongsil University, Independent Researcher
Abstract
In this technical report, we describe our submission to DCASE2024 Challenge Task6 (Automated Audio Captioning) and Task8 (Language-based Audio Retrieval). We develop our approach building upon the EnCLAP audio captioning framework and optimizing it for Task6 of the challenge. Notably, we outline the changes in the underlying components and the incorporation of the reranking process. Additionally, we submit a supplementary retriever model, a byproduct of our modified framework, to Task8. Our proposed systems achieve FENSE score of 0.542 on Task6 and mAP@10 score of 0.386 on Task8, significantly outperforming the baseline models.
TAKE IT FOR GRANTED: IMPROVING LANGUAGE-BASED AUDIO RETRIEVAL WITH LARGE LANGUAGE MODELS
Jan Kulik, Bartlomiej Zgorzynski, Juliusz Kruk, Ivan Ryzhankow, Anna Ples, Theodore Lamort de Gail
Samsung R&D Institute Poland, Warsaw, Poland
kulik_SRPOL_task8_1 kulik_SRPOL_task8_2 kulik_SRPOL_task8_3 kulik_SRPOL_task8_4
TAKE IT FOR GRANTED: IMPROVING LANGUAGE-BASED AUDIO RETRIEVAL WITH LARGE LANGUAGE MODELS
Jan Kulik, Bartlomiej Zgorzynski, Juliusz Kruk, Ivan Ryzhankow, Anna Ples, Theodore Lamort de Gail
Samsung R&D Institute Poland, Warsaw, Poland
Abstract
In this report, we present our solution to DCASE 2024 task 8: Language-Based Audio Retrieval. We employ a bi-encoder architecture trained using InfoNCE loss. The audio encoder is a pretrained PaSST-S model, while the text encoder is either a pre-trained GTE-large or RoBERTa-large model. In order to increase the amount of training data, we obtain 10.8 million video-caption pairs from various open-source datasets. We then extract useful audio-caption pairs and evaluate them using our model to filter out low-quality samples. Finally, we use GPT-4o to rephrase the video captions to make them more audio-oriented. In addition, we use GPT-4o for back-translation and GPT-3.5-turbo for Clotho caption mixing. We achieve 43.69% mAP@10 on the development-testing split of Clotho using an ensemble solution, and 40.78% mAP@10 with a single model.
TRAINING STRATEGY OF MASSIVE TEXT-TO-AUDIO MODELS AND GPT-BASED QUERY-AUGMENTATION
Hokuto Munakata, Taichi Nishimura, Shota Nakada, Tatsuya Komatsu
LY Corporation, Japan
Munakata_LYVA_1 Munakata_LYVA_2
TRAINING STRATEGY OF MASSIVE TEXT-TO-AUDIO MODELS AND GPT-BASED QUERY-AUGMENTATION
Hokuto Munakata, Taichi Nishimura, Shota Nakada, Tatsuya Komatsu
LY Corporation, Japan
Abstract
This report describes our system submitted to the DCASE 2024 Task 8: Language-based Audio Retrieval. We adopted a conventional language-based audio retrieval approach, leveraging a joint embedding space for the audio and text encoders trained through contrastive learning. We compared and utilized several state-ofthe-art models for the audio encoder, including PaSST, BEATs, VAST, and CAV-MAE. We also employed various datasets with text-audio pairs for training like AudioCaps, WavCaps, Auto-ACD, and MACS. Additionally, we incorporated advanced training techniques such as Mixco and text token masking. During inference, we devised an ensemble method based on queries augmented by ChatGPT. Our final results achieved 39.65 points with a single model and 42.26 points with the ensemble of multiple models in the mean average precision among the top 10 results on the evaluation split of Clotho-V2. Compared with the champion system of the DCASE 2023 Challenge, our model outperformed by 1.09 points for the single mode and 0.84 points for the ensemble of the multiple models, respectively.
A KNOWLEDGE DISTILLATION APPROACH TO IMPROVING LANGUAGE-BASED AUDIO RETRIEVAL MODELS
Paul Primus, Gerhard Widmer
Institute of Computational Perception (CP-JKU), LIT Artificial Intelligence Lab, Johannes Kepler University, Austria
Primus_CP-JKU_task8_1 Primus_CP-JKU_task8_2 Primus_CP-JKU_task8_3 Primus_CP-JKU_task8_4
A KNOWLEDGE DISTILLATION APPROACH TO IMPROVING LANGUAGE-BASED AUDIO RETRIEVAL MODELS
Paul Primus, Gerhard Widmer
Institute of Computational Perception (CP-JKU), LIT Artificial Intelligence Lab, Johannes Kepler University, Austria
Abstract
This technical report describes the CP-JKU team’s submissions to the language-based audio retrieval task of the 2024 DCASE Challenge (Task 8). All our submitted systems are based on the dual encoder architecture that projects recordings and textual descriptions into a shared audio-caption space in which related examples from the two modalities are similar. We utilized pretrained audio and text embedding models and trained them on audio-caption datasets (WavCaps, AudioCaps, and ClothoV2) via contrastive learning. We further fine-tuned the resulting models on ClothoV2 via knowledge distillation from a large ensemble of audio retrieval models. Our best single system submission based on PaSST and RoBERTa achieves a mAP@10 of 39.77 on the ClothoV2 test split, outperforming last year’s best single system submission by around 1pp. without utilizing metadata and synthetic captions. An ensemble of three distilled models achieves 41.91 mAP@10 on the ClothoV2 test split.