Task description

Language-based audio retrieval is the task of retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions are used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.

The Clotho v2 is provided as the development dataset, which includes both audio and corresponding captions. Participants are also allowed using pre-trained models and external data for training their systems. This includes pre-trained models for feature extraction from audio and/or captions, and pre-optimized methods for natural language processing like part-of-speech (POS) tagging. Additionally, participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, Freesound.

More detailed task description can be found in the task description page

Teams ranking

Here are listed the best systems all from all teams. The ranking is based on the achieved mAP@10 metric. For more elaborated exploration of the performance of different systems, in the same table are listed the values achieved for all the metrics employed in the task. The metric values are for the development-testing split and the evaluation dataset.

Selected metric rank	Submission Information				Evaluation dataset				Development-testing split
Selected metric rank	Submission code	Best official system rank	Corresponding author	Technical Report	mAP@10	R@1	R@5	R@10	mAP@10	R@1	R@5	R@10
	Primus_CP-JKU_8_1	1	Paul Primus	Primus2024_t8	0.416	0.307	0.563	0.686	0.419	0.293	0.593	0.719
	Kulik_SRPOL_task8_4	2	Jan Kulik	Kulik2024_t8	0.403	0.292	0.546	0.663	0.437	0.314	0.601	0.733
	Chen_SRCN_task8_1	3	Minjun Chen	Chen2024_t8	0.396	0.290	0.541	0.661	0.406	0.278	0.576	0.705
	Munakata_LYVA_1	5	Hokuto Munakata	Munakata2024_t8	0.388	0.284	0.532	0.654	0.422	0.290	0.597	0.728
	Kim_MAUM_task8_2	13	Jaeyeon Kim	Kim2024_t8	0.363	0.252	0.514	0.642	0.385	0.265	0.547	0.676
	Cai_NCUT_task8_2	17	Xichang Cai	Cai2024_t8	0.259	0.162	0.391	0.513	0.296	0.186	0.444	0.577
	Xie_tau_task8_1	19	Huang Xie	Xie2024_t8	0.211	0.121	0.332	0.459	0.222	0.130	0.343	0.480

Systems ranking

Here are listed all systems and their ranking according to the different metrics. Detailed information of each system is at the next section.

Selected metric rank	Submission Information			Evaluation dataset				Development-testing split
Selected metric rank	Submission code	Best official system rank	Technical Report	mAP@10	R@1	R@5	R@10	mAP@10	R@1	R@5	R@10
	Primus_CP-JKU_8_1	1	Primus2024_t8	0.416	0.307	0.563	0.686	0.419	0.293	0.593	0.719
	Kulik_SRPOL_task8_4	2	Kulik2024_t8	0.403	0.292	0.546	0.663	0.437	0.314	0.601	0.733
	Chen_SRCN_task8_1	3	Chen2024_t8	0.396	0.290	0.541	0.661	0.406	0.278	0.576	0.705
	Primus_CP-JKU_8_4	4	Primus2024_t8	0.389	0.275	0.545	0.664	0.389	0.268	0.549	0.688
	Munakata_LYVA_1	5	Munakata2024_t8	0.388	0.284	0.532	0.654	0.422	0.290	0.597	0.728
	Munakata_LYVA_2	6	Munakata2024_t8	0.386	0.277	0.531	0.656	0.423	0.292	0.598	0.727
	Kulik_SRPOL_task8_3	7	Kulik2024_t8	0.386	0.269	0.544	0.661	0.426	0.301	0.597	0.731
	Kulik_SRPOL_task8_2	8	Kulik2024_t8	0.384	0.267	0.546	0.659	0.426	0.302	0.592	0.731
	Primus_CP-JKU_8_1	9	Primus2024_t8	0.378	0.266	0.539	0.648	0.398	0.271	0.571	0.699
	Primus_CP-JKU_8_3	10	Primus2024_t8	0.373	0.265	0.524	0.654	0.377	0.252	0.547	0.680
	Kulik_SRPOL_task8_1	11	Kulik2024_t8	0.369	0.250	0.531	0.646	0.408	0.287	0.574	0.709
	Chen_SRCN_task8_2	12	Chen2024_t8	0.364	0.254	0.521	0.627	0.370	0.244	0.534	0.662
	Kim_MAUM_task8_2	13	Kim2024_t8	0.363	0.252	0.514	0.642	0.385	0.265	0.547	0.676
	Kim_MAUM_task8_3	14	Kim2024_t8	0.362	0.246	0.516	0.643	0.386	0.267	0.547	0.680
	Kim_MAUM_task8_4	15	Kim2024_t8	0.359	0.254	0.510	0.633	0.378	0.257	0.543	0.676
	Kim_MAUM_task8_1	16	Kim2024_t8	0.350	0.236	0.499	0.630	0.375	0.256	0.535	0.669
	Cai_NCUT_task8_2	17	Cai2024_t8	0.259	0.162	0.391	0.513	0.296	0.186	0.444	0.577
	Cai_NCUT_task8_1	18	Cai2024_t8	0.255	0.159	0.383	0.520	0.292	0.180	0.440	0.576
	Xie_tau_task8_1	19	Xie2024_t8	0.211	0.121	0.332	0.459	0.222	0.130	0.343	0.480

System characteristics

In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems and the second has a detailed presentation of each system.

Overview of characteristics

Rank	Submission code	mAP@10	Technical Report	Amount of parameters	Audio modelling	Text modelling	Loss function
1	Primus_CP-JKU_8_1	0.416	Primus2024_t8	2596000000	PaSST, ATST, Dynamic MobileNet	BERT, RoBERTa	NT-Xent loss
2	Kulik_SRPOL_task8_4	0.403	Kulik2024_t8	1485700000	PaSST-S	GTE-large, RoBERTa-large	InfoNCE loss
3	Chen_SRCN_task8_1	0.396	Chen2024_t8	10390000000	BEATs	BERT	Contrastive loss
4	Primus_CP-JKU_8_4	0.389	Primus2024_t8	453000000	ATST	RoBERTa	NT-Xent loss
5	Munakata_LYVA_1	0.388	Munakata2024_t8	2680000000	PaSST, VAST, BEATs, CAV-MAE	RoBERTa	InfoNCE loss
6	Munakata_LYVA_2	0.386	Munakata2024_t8	2240000000	PaSST, VAST	RoBERTa	InfoNCE loss
7	Kulik_SRPOL_task8_3	0.386	Kulik2024_t8	3855200000	PaSST-S	GTE-large, RoBERTa-large	InfoNCE loss
8	Kulik_SRPOL_task8_2	0.384	Kulik2024_t8	1485700000	PaSST-S	GTE-large, RoBERTa-large	InfoNCE loss
9	Primus_CP-JKU_8_1	0.378	Primus2024_t8	442000000	PaSST	RoBERTa	NT-Xent loss
10	Primus_CP-JKU_8_3	0.373	Primus2024_t8	430000000	Dynamic MobileNet	RoBERTa	NT-Xent loss
11	Kulik_SRPOL_task8_1	0.369	Kulik2024_t8	521900000	PaSST-S	GTE-large	InfoNCE loss
12	Chen_SRCN_task8_2	0.364	Chen2024_t8	230000000	BEATs	BERT	Contrastive loss
13	Kim_MAUM_task8_2	0.363	Kim2024_t8	1131908653	ConvNeXt-Tiny	BERT, RoBERTa, BGE	m-LTM
14	Kim_MAUM_task8_3	0.362	Kim2024_t8	1581588058	ConvNeXt-Tiny	BERT, RoBERTa, BGE	m-LTM
15	Kim_MAUM_task8_4	0.359	Kim2024_t8	3163176116	ConvNeXt-Tiny	BERT, RoBERTa, BGE	m-LTM
16	Kim_MAUM_task8_1	0.350	Kim2024_t8	390781455	ConvNeXt-Tiny	RoBERTa	m-LTM
17	Cai_NCUT_task8_2	0.259	Cai2024_t8	160771192	PANNs-CNN14, BEATs	RoBERTa	InfoNCE loss
18	Cai_NCUT_task8_1	0.255	Cai2024_t8	160771192	PANNs-CNN14, BEATs	RoBERTa	InfoNCE loss
19	Xie_tau_task8_1	0.211	Xie2024_t8	160771192	PANNs-CNN14	Sentece-BERT	InfoNCE loss

Detailed characteristics

Rank	Submission code	mAP@10	Technical Report	Amount of parameters	Audio modelling	Acoustic features	Text modelling	Audio augmentation	Text augmentation	Sampling rate	Loss function	Optimizer	Metric monitored for training	Dataset(s) used for training	Dataset(s) used for validation
1	Primus_CP-JKU_8_1	0.416	Primus2024_t8	2596000000	PaSST, ATST, Dynamic MobileNet	log-mel energies	BERT, RoBERTa	patchout, frequency warping	synonym replacement, random deletions	32.0kHz	NT-Xent loss	adam	mAP	Clotho-development, AudioCaps, WaveCaps	Clotho-validation
2	Kulik_SRPOL_task8_4	0.403	Kulik2024_t8	1485700000	PaSST-S	log-mel energies	GTE-large, RoBERTa-large	mixing, time and frequency masking, patchout	random deletion, synonym replacement, back-translation, LLM mixing, LLM rephrasing	32kHz	InfoNCE loss	AdamW	mAP	Clotho-development, AudioCaps, WavCaps, VideoCaps	Clotho-validation
3	Chen_SRCN_task8_1	0.396	Chen2024_t8	10390000000	BEATs	log-mel energies	BERT	mixup	mixup	16kHz	Contrastive loss	adamw	mAP	Clotho-development, AudioCaps, WavCaps, FSD50K, Laion630k, LASS validation (synth) set	Clotho-validation
4	Primus_CP-JKU_8_4	0.389	Primus2024_t8	453000000	ATST	log-mel energies	RoBERTa	frequency warping	synonym replacement, random deletions	32.0kHz	NT-Xent loss	adam	mAP	Clotho-development, AudioCaps, WaveCaps	Clotho-validation
5	Munakata_LYVA_1	0.388	Munakata2024_t8	2680000000	PaSST, VAST, BEATs, CAV-MAE	log-mel energies	RoBERTa	SpecAugment, Patchout, Mix-up Contrast	Text token masking, GPT augmentation	32kHz, 16kHz	InfoNCE loss	Adam	mAP	Clotho-development, AudioCaps, WavCaps, MACS, Auto-ACD-VS	Clotho-validation
6	Munakata_LYVA_2	0.386	Munakata2024_t8	2240000000	PaSST, VAST	log-mel energies	RoBERTa	SpecAugment, Patchout, Mix-up Contrast	Text token masking, GPT augmentation	32kHz, 16kHz	InfoNCE loss	Adam, AdamW	mAP	Clotho-development, AudioCaps, WavCaps, MACS, Auto-ACD-VS	Clotho-validation
7	Kulik_SRPOL_task8_3	0.386	Kulik2024_t8	3855200000	PaSST-S	log-mel energies	GTE-large, RoBERTa-large	mixing, time and frequency masking, patchout	random deletion, synonym replacement, back-translation, LLM mixing, LLM rephrasing	32kHz	InfoNCE loss	AdamW	mAP	Clotho-development, AudioCaps, WavCaps, VideoCaps	Clotho-validation
8	Kulik_SRPOL_task8_2	0.384	Kulik2024_t8	1485700000	PaSST-S	log-mel energies	GTE-large, RoBERTa-large	mixing, time and frequency masking, patchout	random deletion, synonym replacement, back-translation, LLM mixing, LLM rephrasing	32kHz	InfoNCE loss	AdamW	mAP	Clotho-development, AudioCaps, WavCaps, VideoCaps	Clotho-validation
9	Primus_CP-JKU_8_1	0.378	Primus2024_t8	442000000	PaSST	log-mel energies	RoBERTa	patchout	synonym replacement, random deletions	32.0kHz	NT-Xent loss	adam	mAP	Clotho-development, AudioCaps, WaveCaps	Clotho-validation
10	Primus_CP-JKU_8_3	0.373	Primus2024_t8	430000000	Dynamic MobileNet	log-mel energies	RoBERTa	None	synonym replacement, random deletions	32.0kHz	NT-Xent loss	adam	mAP	Clotho-development, AudioCaps, WaveCaps	Clotho-validation
11	Kulik_SRPOL_task8_1	0.369	Kulik2024_t8	521900000	PaSST-S	log-mel energies	GTE-large	mixing, time and frequency masking, patchout	random deletion, synonym replacement, back-translation, LLM mixing, LLM rephrasing	32kHz	InfoNCE loss	AdamW	mAP	Clotho-development, AudioCaps, WavCaps, VideoCaps	Clotho-validation
12	Chen_SRCN_task8_2	0.364	Chen2024_t8	230000000	BEATs	log-mel energies	BERT	mixup	mixup	16kHz	Contrastive loss	adamw	mAP	Clotho-development, AudioCaps, WavCaps, FSD50K, Laion630k, LASS validation (synth) set	Clotho-validation
13	Kim_MAUM_task8_2	0.363	Kim2024_t8	1131908653	ConvNeXt-Tiny	log-mel energies	BERT, RoBERTa, BGE	SpecAugment		32kHz	m-LTM	adam	mAP	Clotho-development, AudioCaps, WavCaps	Clotho-validation
14	Kim_MAUM_task8_3	0.362	Kim2024_t8	1581588058	ConvNeXt-Tiny	log-mel energies	BERT, RoBERTa, BGE	SpecAugment		32kHz	m-LTM	adam	mAP	Clotho-development, AudioCaps, WavCaps	Clotho-validation
15	Kim_MAUM_task8_4	0.359	Kim2024_t8	3163176116	ConvNeXt-Tiny	log-mel energies	BERT, RoBERTa, BGE	SpecAugment		32kHz	m-LTM	adam	mAP	Clotho-development, AudioCaps, WavCaps	Clotho-validation
16	Kim_MAUM_task8_1	0.350	Kim2024_t8	390781455	ConvNeXt-Tiny	log-mel energies	RoBERTa	SpecAugment		32kHz	m-LTM	adam	mAP	Clotho-development, AudioCaps, WavCaps	Clotho-validation
17	Cai_NCUT_task8_2	0.259	Cai2024_t8	160771192	PANNs-CNN14, BEATs	log-mel energies	RoBERTa	Mixup	ChatGPT	44.1kHz	InfoNCE loss	adam	loss	Clotho-development, AudioCaps-train	Clotho-validation
18	Cai_NCUT_task8_1	0.255	Cai2024_t8	160771192	PANNs-CNN14, BEATs	log-mel energies	RoBERTa	Mixup	ChatGPT	44.1kHz	InfoNCE loss	adam	loss	Clotho-development, AudioCaps-train	Clotho-validation
19	Xie_tau_task8_1	0.211	Xie2024_t8	160771192	PANNs-CNN14	log-mel energies	Sentece-BERT			44.1kHz	InfoNCE loss	adam	loss	Clotho-development	Clotho-validation

Technical reports

ENSEMBLE SYSTEMS WITH PRETRAINED DUAL-ENCODERS FOR LANGUAGE-BASED AUDIO RETRIEVAL

Jiafeng Li, Xichang Cai, Shenghao Liu, Liangxiao Zuo, Menglong Wu

North China University of Technology, Beijing, China

Cai_NCUT_task8_1 Cai_NCUT_task8_2

PDF

ENSEMBLE SYSTEMS WITH PRETRAINED DUAL-ENCODERS FOR LANGUAGE-BASED AUDIO RETRIEVAL

Jiafeng Li, Xichang Cai, Shenghao Liu, Liangxiao Zuo, Menglong Wu
North China University of Technology, Beijing, China

Abstract

This article presents our system developed for Task 8 of the DCASE2024 Challenge, which focuses on audio retrieval using natural language queries. Our submission incorporates a retrieval system that integrates a frozen pre-trained audio encoder and RoBERT as a text encoder. We adopted a methodology similar to the CLAP framework, training our model using paired data from the AudioCaps and Clotho datasets. Our best-performing system achieved a mean Average Precision (mAP) of 29.6% and a Recall at 1 (R@1) of 18.6% on the Clotho evaluation set.

PDF

DCASE 2024 CHALLENGE TASK 8 TECHNICAL REPORT

Minjun Chen, Yangyang Liu, Bo Peng, Jie Chen

Samsung Research China-Nanjing, Nanjing, China

Chen_SRCN_task8_1 Chen_SRCN_task8_2

PDF

DCASE 2024 CHALLENGE TASK 8 TECHNICAL REPORT

Minjun Chen, Yangyang Liu, Bo Peng, Jie Chen
Samsung Research China-Nanjing, Nanjing, China

Abstract

We describe our submitted systems for DCASE2024 Task 8 in this technical report: Language-based Audio Retrieval. Our proposed system focus on training audio encoder and text encoder combined to get expressive audio and text presentation, which helps distinguishing different audios and text more efficiently. We use pre-trained audio and text encoder of VAST, which were trained on a large multi-modality dataset VAST27M. We train these encoders on several audio caption datasets, include AudioCaps, WavCaps, FSD50K, Laion630k, and ClothoV2 furtherly with three learning objectives, except the audio-text contractive objective, we also use audio-text match and masked language model objective to strengthen the training procedure. We use the mix-up as the data augment policy during pre-training. Our proposed systems achieve 0.37 mAP@10, and 0.244 R@1, with model ensemble, our systems achieve 0.406 mAP@10, and 0.278 R@1 on the ClothoV2 evaluation set.

PDF

EXPANDING ON ENCLAP WITH AUXILIARY RETRIEVAL MODEL FOR AUTOMATED AUDIO CAPTIONING

Jaeyeon Kim, Jaeyoon Jung, Minjeong Jeon, Sang Hoon Woo, Jinjoo Lee

Seoul National Unversity, MAUM AI Inc., Soongsil University, Independent Researcher

Kim_MAUM_task8_1 Kim_MAUM_task8_2 Kim_MAUM_task8_3 Kim_MAUM_task8_4

PDF

EXPANDING ON ENCLAP WITH AUXILIARY RETRIEVAL MODEL FOR AUTOMATED AUDIO CAPTIONING

Jaeyeon Kim, Jaeyoon Jung, Minjeong Jeon, Sang Hoon Woo, Jinjoo Lee
Seoul National Unversity, MAUM AI Inc., Soongsil University, Independent Researcher

Abstract

In this technical report, we describe our submission to DCASE2024 Challenge Task6 (Automated Audio Captioning) and Task8 (Language-based Audio Retrieval). We develop our approach building upon the EnCLAP audio captioning framework and optimizing it for Task6 of the challenge. Notably, we outline the changes in the underlying components and the incorporation of the reranking process. Additionally, we submit a supplementary retriever model, a byproduct of our modified framework, to Task8. Our proposed systems achieve FENSE score of 0.542 on Task6 and mAP@10 score of 0.386 on Task8, significantly outperforming the baseline models.

PDF

TAKE IT FOR GRANTED: IMPROVING LANGUAGE-BASED AUDIO RETRIEVAL WITH LARGE LANGUAGE MODELS

Jan Kulik, Bartlomiej Zgorzynski, Juliusz Kruk, Ivan Ryzhankow, Anna Ples, Theodore Lamort de Gail

Samsung R&D Institute Poland, Warsaw, Poland

kulik_SRPOL_task8_1 kulik_SRPOL_task8_2 kulik_SRPOL_task8_3 kulik_SRPOL_task8_4

PDF

TAKE IT FOR GRANTED: IMPROVING LANGUAGE-BASED AUDIO RETRIEVAL WITH LARGE LANGUAGE MODELS

Jan Kulik, Bartlomiej Zgorzynski, Juliusz Kruk, Ivan Ryzhankow, Anna Ples, Theodore Lamort de Gail
Samsung R&D Institute Poland, Warsaw, Poland

Abstract

In this report, we present our solution to DCASE 2024 task 8: Language-Based Audio Retrieval. We employ a bi-encoder architecture trained using InfoNCE loss. The audio encoder is a pretrained PaSST-S model, while the text encoder is either a pre-trained GTE-large or RoBERTa-large model. In order to increase the amount of training data, we obtain 10.8 million video-caption pairs from various open-source datasets. We then extract useful audio-caption pairs and evaluate them using our model to filter out low-quality samples. Finally, we use GPT-4o to rephrase the video captions to make them more audio-oriented. In addition, we use GPT-4o for back-translation and GPT-3.5-turbo for Clotho caption mixing. We achieve 43.69% mAP@10 on the development-testing split of Clotho using an ensemble solution, and 40.78% mAP@10 with a single model.

PDF

TRAINING STRATEGY OF MASSIVE TEXT-TO-AUDIO MODELS AND GPT-BASED QUERY-AUGMENTATION

Hokuto Munakata, Taichi Nishimura, Shota Nakada, Tatsuya Komatsu

LY Corporation, Japan

Munakata_LYVA_1 Munakata_LYVA_2

PDF

TRAINING STRATEGY OF MASSIVE TEXT-TO-AUDIO MODELS AND GPT-BASED QUERY-AUGMENTATION

Hokuto Munakata, Taichi Nishimura, Shota Nakada, Tatsuya Komatsu
LY Corporation, Japan

Abstract

This report describes our system submitted to the DCASE 2024 Task 8: Language-based Audio Retrieval. We adopted a conventional language-based audio retrieval approach, leveraging a joint embedding space for the audio and text encoders trained through contrastive learning. We compared and utilized several state-ofthe-art models for the audio encoder, including PaSST, BEATs, VAST, and CAV-MAE. We also employed various datasets with text-audio pairs for training like AudioCaps, WavCaps, Auto-ACD, and MACS. Additionally, we incorporated advanced training techniques such as Mixco and text token masking. During inference, we devised an ensemble method based on queries augmented by ChatGPT. Our final results achieved 39.65 points with a single model and 42.26 points with the ensemble of multiple models in the mean average precision among the top 10 results on the evaluation split of Clotho-V2. Compared with the champion system of the DCASE 2023 Challenge, our model outperformed by 1.09 points for the single mode and 0.84 points for the ensemble of the multiple models, respectively.

PDF

A KNOWLEDGE DISTILLATION APPROACH TO IMPROVING LANGUAGE-BASED AUDIO RETRIEVAL MODELS

Paul Primus, Gerhard Widmer

Institute of Computational Perception (CP-JKU), LIT Artificial Intelligence Lab, Johannes Kepler University, Austria

Primus_CP-JKU_task8_1 Primus_CP-JKU_task8_2 Primus_CP-JKU_task8_3 Primus_CP-JKU_task8_4

PDF

A KNOWLEDGE DISTILLATION APPROACH TO IMPROVING LANGUAGE-BASED AUDIO RETRIEVAL MODELS

Paul Primus, Gerhard Widmer
Institute of Computational Perception (CP-JKU), LIT Artificial Intelligence Lab, Johannes Kepler University, Austria

Abstract

This technical report describes the CP-JKU team’s submissions to the language-based audio retrieval task of the 2024 DCASE Challenge (Task 8). All our submitted systems are based on the dual encoder architecture that projects recordings and textual descriptions into a shared audio-caption space in which related examples from the two modalities are similar. We utilized pretrained audio and text embedding models and trained them on audio-caption datasets (WavCaps, AudioCaps, and ClothoV2) via contrastive learning. We further fine-tuned the resulting models on ClothoV2 via knowledge distillation from a large ensemble of audio retrieval models. Our best single system submission based on PaSST and RoBERTa achieves a mAP@10 of 39.77 on the ClothoV2 test split, outperforming last year’s best single system submission by around 1pp. without utilizing metadata and synthetic captions. An ensemble of three distilled models achieves 41.91 mAP@10 on the ClothoV2 test split.

PDF

Content

Task description

Teams ranking

Systems ranking

System characteristics

Overview of characteristics

Detailed characteristics

Technical reports

ENSEMBLE SYSTEMS WITH PRETRAINED DUAL-ENCODERS FOR LANGUAGE-BASED AUDIO RETRIEVAL

ENSEMBLE SYSTEMS WITH PRETRAINED DUAL-ENCODERS FOR LANGUAGE-BASED AUDIO RETRIEVAL

Abstract

DCASE 2024 CHALLENGE TASK 8 TECHNICAL REPORT

DCASE 2024 CHALLENGE TASK 8 TECHNICAL REPORT

Abstract

EXPANDING ON ENCLAP WITH AUXILIARY RETRIEVAL MODEL FOR AUTOMATED AUDIO CAPTIONING

EXPANDING ON ENCLAP WITH AUXILIARY RETRIEVAL MODEL FOR AUTOMATED AUDIO CAPTIONING

Abstract

TAKE IT FOR GRANTED: IMPROVING LANGUAGE-BASED AUDIO RETRIEVAL WITH LARGE LANGUAGE MODELS

TAKE IT FOR GRANTED: IMPROVING LANGUAGE-BASED AUDIO RETRIEVAL WITH LARGE LANGUAGE MODELS

Abstract

TRAINING STRATEGY OF MASSIVE TEXT-TO-AUDIO MODELS AND GPT-BASED QUERY-AUGMENTATION

TRAINING STRATEGY OF MASSIVE TEXT-TO-AUDIO MODELS AND GPT-BASED QUERY-AUGMENTATION

Abstract

A KNOWLEDGE DISTILLATION APPROACH TO IMPROVING LANGUAGE-BASED AUDIO RETRIEVAL MODELS

A KNOWLEDGE DISTILLATION APPROACH TO IMPROVING LANGUAGE-BASED AUDIO RETRIEVAL MODELS

Abstract