Task description

Language-based audio retrieval is the task of retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions are used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.

The Clotho v2 is provided as the development dataset, which includes both audio and corresponding captions. Participants are also allowed using pre-trained models and external data for training their systems. This includes pre-trained models for feature extraction from audio and/or captions, and pre-optimized methods for natural language processing like part-of-speech (POS) tagging. Additionally, participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, Freesound. More information about Task 6B: Language-based Audio Retrieval can be found at the task description page.

Teams ranking

Here are listed the best systems all from all teams. The ranking is based on the achieved mAP@10 metric. For more elaborated exploration of the performance of different systems, at the same table are listed the values achieved for all the metrics employed in the task. The metric values are for the development-testing split and the evaluation dataset.

Selected metric rank	Submission Information				Evaluation dataset				Development-testing split
Selected metric rank	Submission code	Best official system rank	Corresponding author	Technical Report	mAP@10	R@1	R@5	R@10	mAP@10	R@1	R@5	R@10
	ensmbl_5	1	Xuenan Xu	xu2022_t6b	0.276	0.176	0.416	0.536	0.299	0.188	0.447	0.587
	Mei_Surrey_1	2	Xinhao Mei	mei2022_t6b	0.251	0.153	0.387	0.504	0.260	0.150	0.400	0.530
	RELAX_4	3	Theodore Lamort de Gail	lamort2022_t6b	0.221	0.131	0.343	0.466	0.226	0.132	0.350	0.478
	wtagsACens	4	Thomas Pellegrini	pellegrini2022_t6b	0.216	0.127	0.321	0.463	0.243	0.148	0.369	0.498
	lai_pa_4	5	Yongquan Lai	lai2022_t6b	0.215	0.122	0.328	0.478	0.510	0.350	0.750	0.890
	CLAP_4	6	Yusong Wu	wu2022_t6b	0.188	0.107	0.303	0.413	0.212	0.124	0.327	0.455
	ATAE-NP-F	7	Benno Weck	weck2022_t6b	0.128	0.077	0.188	0.284	0.140	0.075	0.225	0.324
	P-GAT	8	Feiyang Xiao	xiao2022_t6b	0.097	0.043	0.162	0.267	0.130	0.070	0.210	0.330
	park_cau_1	9	Jiwon Park	park2022_t6b	0.075	0.033	0.127	0.208	0.090	0.050	0.150	0.230
	Baseline	10	Huang Xie	xie2022_t6b	0.061	0.026	0.102	0.176	0.068	0.032	0.109	0.188

Systems ranking

Here are listed all systems and their ranking according to the different metrics. Detailed information of each system is at the next section.

Selected metric rank	Submission Information			Evaluation dataset				Development-testing split
Selected metric rank	Submission code	Best official system rank	Technical Report	mAP@10	R@1	R@5	R@10	mAP@10	R@1	R@5	R@10
	wotags	14	pellegrini2022_t6b	0.212	0.124	0.319	0.448	0.229	0.135	0.355	0.482
	wtags	12	pellegrini2022_t6b	0.214	0.128	0.332	0.445	0.234	0.138	0.364	0.485
	wtagsAC	13	pellegrini2022_t6b	0.213	0.125	0.322	0.454	0.240	0.145	0.365	0.500
	wtagsACens	10	pellegrini2022_t6b	0.216	0.127	0.321	0.463	0.243	0.148	0.369	0.498
	ATAE	23	weck2022_t6b	0.114	0.057	0.179	0.279	0.136	0.072	0.219	0.325
	ATAE-ET	24	weck2022_t6b	0.113	0.066	0.168	0.267	0.122	0.064	0.194	0.288
	ATAE-EP-F	22	weck2022_t6b	0.121	0.069	0.178	0.281	0.127	0.068	0.202	0.299
	ATAE-NP-F	21	weck2022_t6b	0.128	0.077	0.188	0.284	0.140	0.075	0.225	0.324
	P-GAT	25	xiao2022_t6b	0.097	0.043	0.162	0.267	0.130	0.070	0.210	0.330
	lai_pa_1	12	lai2022_t6b	0.214	0.125	0.328	0.469	0.510	0.350	0.750	0.890
	lai_pa_2	16	lai2022_t6b	0.209	0.118	0.326	0.462	0.510	0.340	0.750	0.890
	lai_pa_3	15	lai2022_t6b	0.210	0.115	0.331	0.484	0.510	0.340	0.750	0.890
	lai_pa_4	11	lai2022_t6b	0.215	0.122	0.328	0.478	0.510	0.350	0.750	0.890
	RELAX_1	9	lamort2022_t6b	0.218	0.128	0.337	0.467	0.231	0.137	0.354	0.484
	RELAX_2	14	lamort2022_t6b	0.212	0.118	0.327	0.464	0.229	0.137	0.351	0.469
	RELAX_3	10	lamort2022_t6b	0.216	0.129	0.336	0.458	0.228	0.137	0.354	0.470
	RELAX_4	8	lamort2022_t6b	0.221	0.131	0.343	0.466	0.226	0.132	0.350	0.478
	CLAP_1	19	wu2022_t6b	0.182	0.104	0.295	0.388	0.214	0.126	0.335	0.452
	CLAP_2	20	wu2022_t6b	0.180	0.100	0.284	0.385	0.211	0.124	0.326	0.451
	CLAP_3	18	wu2022_t6b	0.183	0.102	0.289	0.401	0.212	0.124	0.331	0.451
	CLAP_4	17	wu2022_t6b	0.188	0.107	0.303	0.413	0.212	0.124	0.327	0.455
	ensmbl_5	1	xu2022_t6b	0.276	0.176	0.416	0.536	0.299	0.188	0.447	0.587
	ensmbl_4	2	xu2022_t6b	0.269	0.177	0.397	0.517	0.288	0.182	0.427	0.568
	ensmbl_3_1	3	xu2022_t6b	0.265	0.174	0.395	0.514	0.283	0.179	0.424	0.558
	ensmbl_3_2	4	xu2022_t6b	0.259	0.168	0.379	0.508	0.282	0.175	0.417	0.565
	park_cau_1	26	park2022_t6b	0.075	0.033	0.127	0.208	0.090	0.050	0.150	0.230
	park_cau_2	26	park2022_t6b	0.075	0.037	0.117	0.204	0.090	0.050	0.150	0.230
	Mei_Surrey_1	5	mei2022_t6b	0.251	0.153	0.387	0.504	0.260	0.150	0.400	0.530
	Mei_Surrey_2	6	mei2022_t6b	0.250	0.151	0.388	0.507	0.260	0.150	0.400	0.530
	Mei_Surrey_3	7	mei2022_t6b	0.244	0.150	0.382	0.497	0.240	0.140	0.370	0.500
	Mei_Surrey_4	5	mei2022_t6b	0.251	0.162	0.378	0.496	0.260	0.150	0.400	0.530
	Baseline	27	xie2022_t6b	0.061	0.026	0.102	0.176	0.068	0.032	0.109	0.188

System characteristics

In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems and the second has a detailed presentation of each system.

Overview of characteristics

Rank	Submission code	mAP@10	Technical Report	Method scheme/architecture	Amount of parameters	Audio modelling	Word modelling	Data augmentation
14	wotags	0.212	pellegrini2022_t6b	cross-modal alignment	196046781	PaSST	Transformer
12	wtags	0.214	pellegrini2022_t6b	cross-modal alignment	196046781	PaSST	Transformer
13	wtagsAC	0.213	pellegrini2022_t6b	cross-modal alignment	196046781	PaSST	Transformer
10	wtagsACens	0.216	pellegrini2022_t6b	cross-modal alignment	196453339	PaSST	Transformer
23	ATAE	0.114	weck2022_t6b	cross-modal alignment	165000000	PANNs	DistilRoBERTa
24	ATAE-ET	0.113	weck2022_t6b	cross-modal alignment	165000000	PANNs	DistilRoBERTa
22	ATAE-EP-F	0.121	weck2022_t6b	cross-modal alignment	165000000	PANNs	DistilRoBERTa
21	ATAE-NP-F	0.128	weck2022_t6b	cross-modal alignment	165000000	PANNs	DistilRoBERTa
25	P-GAT	0.097	xiao2022_t6b	cross-modal alignment	6799328	PANNs, GAT	Word2vec
12	lai_pa_1	0.214	lai2022_t6b	supervised learning	134111910	ESResNeXt	CLIP	audio cropping
16	lai_pa_2	0.209	lai2022_t6b	supervised learning	134111910	ESResNeXt	CLIP	audio cropping
15	lai_pa_3	0.210	lai2022_t6b	supervised learning	134111910	ESResNeXt	CLIP	audio cropping
11	lai_pa_4	0.215	lai2022_t6b	supervised learning	134111910	ESResNeXt	CLIP	audio cropping
9	RELAX_1	0.218	lamort2022_t6b	cross-modal alignment	401606726	several audio experts, Transformer heads	BERT
14	RELAX_2	0.212	lamort2022_t6b	cross-modal alignment	401606726	several audio experts, Transformer heads	BERT
10	RELAX_3	0.216	lamort2022_t6b	cross-modal alignment	401606726	several audio experts, Transformer heads	BERT
8	RELAX_4	0.221	lamort2022_t6b	cross-modal alignment	401606726	several audio experts, Transformer heads	BERT
19	CLAP_1	0.182	wu2022_t6b	cross-modal alignment	244087786	HTSAT-tiny, PANN-14	Transformer	Spec-Augment
20	CLAP_2	0.180	wu2022_t6b	cross-modal alignment	96460249	HTSAT-tiny	Transformer	Spec-Augment
18	CLAP_3	0.183	wu2022_t6b	cross-modal alignment	244087786	HTSAT-tiny, PANN-14	Transformer	Spec-Augment
17	CLAP_4	0.188	wu2022_t6b	cross-modal alignment	244087786	HTSAT-tiny, PANN-14	Transformer	Spec-Augment
1	ensmbl_5	0.276	xu2022_t6b	cross-modal alignment	911895813	CNN	Transformer
2	ensmbl_4	0.269	xu2022_t6b	cross-modal alignment	715582596	CNN	Transformer
3	ensmbl_3_1	0.265	xu2022_t6b	cross-modal alignment	591600259	CNN	Transformer
4	ensmbl_3_2	0.259	xu2022_t6b	cross-modal alignment	508377539	CNN	Transformer
26	park_cau_1	0.075	park2022_t6b	cross-modal alignment	732354	CNN10(pretrained-learning)+gru	Word2vec
26	park_cau_2	0.075	park2022_t6b	cross-modal alignment	732354	CNN10(pretrained-learning)+gru	Word2vec
5	Mei_Surrey_1	0.251	mei2022_t6b	cross-modal alignment	195420160	CNN	Transformer	Spec-Augment
6	Mei_Surrey_2	0.250	mei2022_t6b	cross-modal alignment	195420160	CNN	Transformer	Spec-Augment
7	Mei_Surrey_3	0.244	mei2022_t6b	cross-modal alignment	188449792	CNN	Transformer	Spec-Augment
5	Mei_Surrey_4	0.251	mei2022_t6b	cross-modal alignment	195420160	CNN	Transformer	Spec-Augment
27	Baseline	0.061	xie2022_t6b	cross-modal alignment	732354	CRNN	Word2vec

Detailed characteristics

Rank	Submission code	mAP@10	Technical Report	Method scheme/architecture	Amount of parameters	Audio modelling	Acoustic features	Word modelling	Word embeddings	Data augmentation	Sampling rate	Learning set-up	Loss function	Optimizer	Learning rate	Gradient norm for clipping	Metric monitored for training	Dataset(s) used for audio modelling	Dataset(s) used for word modelling
14	wotags	0.212	pellegrini2022_t6b	cross-modal alignment	196046781	PaSST	PaSST scene embeddings	Transformer	all-mpnet-base-v2		32.0kHz	supervised	Triplet Loss	Adam	1e-3		validation_loss	Clotho	Clotho
12	wtags	0.214	pellegrini2022_t6b	cross-modal alignment	196046781	PaSST	PaSST scene embeddings	Transformer	all-mpnet-base-v2		32.0kHz	supervised	Triplet Loss	Adam	1e-3		validation_loss	Clotho	Clotho
13	wtagsAC	0.213	pellegrini2022_t6b	cross-modal alignment	196046781	PaSST	PaSST scene embeddings	Transformer	all-mpnet-base-v2		32.0kHz	supervised	Triplet Loss	Adam	1e-3		validation_loss	Clotho, AudioCaps	Clotho, AudioCaps
10	wtagsACens	0.216	pellegrini2022_t6b	cross-modal alignment	196453339	PaSST	PaSST scene embeddings	Transformer	all-mpnet-base-v2		32.0kHz	supervised	Triplet Loss	Adam	1e-3		validation_loss	Clotho, AudioCaps	Clotho, AudioCaps
23	ATAE	0.114	weck2022_t6b	cross-modal alignment	165000000	PANNs	PANNs	DistilRoBERTa	DistilRoBERTa		32.0kHz	metric learning	Contrastive loss	Adam	1e-4		validation_mAP@10	Clotho	Clotho
24	ATAE-ET	0.113	weck2022_t6b	cross-modal alignment	165000000	PANNs	PANNs	DistilRoBERTa	DistilRoBERTa		32.0kHz	metric learning	Contrastive loss	Adam	1e-4		validation_mAP@10	Clotho, FSD50K	Clotho, FSD50K
22	ATAE-EP-F	0.121	weck2022_t6b	cross-modal alignment	165000000	PANNs	PANNs	DistilRoBERTa	DistilRoBERTa		32.0kHz	metric learning	Contrastive loss	Adam	1e-4		validation_mAP@10	Clotho, FSD50K	Clotho, FSD50K
21	ATAE-NP-F	0.128	weck2022_t6b	cross-modal alignment	165000000	PANNs	PANNs	DistilRoBERTa	DistilRoBERTa		32.0kHz	metric learning	Contrastive loss	Adam	1e-4		validation_mAP@10	Clotho, FSD50K	Clotho, FSD50K
25	P-GAT	0.097	xiao2022_t6b	cross-modal alignment	6799328	PANNs, GAT	log-mel energies	Word2vec	Word2Vec		44.1kHz	self-supervised	Triplet Loss	Adam	1e-4		validation_loss	Clotho	Clotho
12	lai_pa_1	0.214	lai2022_t6b	supervised learning	134111910	ESResNeXt	waveform	CLIP	Transformer	audio cropping	44.1kHz	supervised	symmetric cross entropy loss	Adam	1e-5	clip grad norm	training_loss	Clotho	Clotho
16	lai_pa_2	0.209	lai2022_t6b	supervised learning	134111910	ESResNeXt	waveform	CLIP	Transformer	audio cropping	44.1kHz	supervised	symmetric cross entropy loss	Adam	1e-5	clip grad norm	training_loss	Clotho	Clotho
15	lai_pa_3	0.210	lai2022_t6b	supervised learning	134111910	ESResNeXt	waveform	CLIP	Transformer	audio cropping	44.1kHz	supervised	symmetric cross entropy loss	Adam	1e-5	clip grad norm	training_loss	Clotho	Clotho
11	lai_pa_4	0.215	lai2022_t6b	supervised learning	134111910	ESResNeXt	waveform	CLIP	Transformer	audio cropping	44.1kHz	supervised	symmetric cross entropy loss	Adam	1e-5	clip grad norm	training_loss	Clotho	Clotho
9	RELAX_1	0.218	lamort2022_t6b	cross-modal alignment	401606726	several audio experts, Transformer heads	several audio experts	BERT	BERT		several sampling rates	supervised	contrastive ranking loss	AdamW	1e-4		validation_loss, validation_ranking_accuracy	Clotho, AudioCaps, Freesound	Clotho, AudioCaps
14	RELAX_2	0.212	lamort2022_t6b	cross-modal alignment	401606726	several audio experts, Transformer heads	several audio experts	BERT	BERT		several sampling rates	supervised	contrastive ranking loss	AdamW	1e-4		validation_loss, validation_ranking_accuracy	Clotho, AudioCaps, Freesound	Clotho, AudioCaps
10	RELAX_3	0.216	lamort2022_t6b	cross-modal alignment	401606726	several audio experts, Transformer heads	several audio experts	BERT	BERT		several sampling rates	supervised	contrastive ranking loss	AdamW	1e-4		validation_loss, validation_ranking_accuracy	Clotho, AudioCaps, Freesound	Clotho, AudioCaps
8	RELAX_4	0.221	lamort2022_t6b	cross-modal alignment	401606726	several audio experts, Transformer heads	several audio experts	BERT	BERT		several sampling rates	supervised	contrastive ranking loss	AdamW	1e-4		validation_loss, validation_ranking_accuracy	Clotho, AudioCaps, Freesound	Clotho, AudioCaps
19	CLAP_1	0.182	wu2022_t6b	cross-modal alignment	244087786	HTSAT-tiny, PANN-14	log-mel energies	Transformer	learned	Spec-Augment	48.0kHz	self-supervised	Contrastive loss	AdamW	1e-3		text-to-audio-mAP@10	BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, Sonniss_Game_Audio_Effects	BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, Sonniss_Game_Audio_Effects
20	CLAP_2	0.180	wu2022_t6b	cross-modal alignment	96460249	HTSAT-tiny	log-mel energies	Transformer	learned	Spec-Augment	48.0kHz	self-supervised	Contrastive loss	AdamW	1e-3		text-to-audio-mAP@10	BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects	BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects
18	CLAP_3	0.183	wu2022_t6b	cross-modal alignment	244087786	HTSAT-tiny, PANN-14	log-mel energies	Transformer	learned	Spec-Augment	48.0kHz	self-supervised	Contrastive loss	AdamW	1e-3		text-to-audio-mAP@10	BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects	BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects
17	CLAP_4	0.188	wu2022_t6b	cross-modal alignment	244087786	HTSAT-tiny, PANN-14	log-mel energies	Transformer	learned	Spec-Augment	48.0kHz	self-supervised	Contrastive loss	AdamW	1e-3		text-to-audio-mAP@10	BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects	BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects
1	ensmbl_5	0.276	xu2022_t6b	cross-modal alignment	911895813	CNN	waveform	Transformer	learned		32.0kHz	self-supervised	InfoNCE loss	Adam	2e-5		validation_t2a_R1_R5_R10_mean	Clotho, AudioCaps	Clotho, AudioCaps
2	ensmbl_4	0.269	xu2022_t6b	cross-modal alignment	715582596	CNN	waveform	Transformer	learned		32.0kHz	self-supervised	InfoNCE loss	Adam	2e-5		validation_t2a_R1_R5_R10_mean	Clotho, AudioCaps	Clotho, AudioCaps
3	ensmbl_3_1	0.265	xu2022_t6b	cross-modal alignment	591600259	CNN	waveform	Transformer	learned		32.0kHz	self-supervised	InfoNCE loss	Adam	2e-5		validation_t2a_R1_R5_R10_mean	Clotho, AudioCaps	Clotho, AudioCaps
4	ensmbl_3_2	0.259	xu2022_t6b	cross-modal alignment	508377539	CNN	waveform	Transformer	learned		32.0kHz	self-supervised	InfoNCE loss	Adam	2e-5		validation_t2a_R1_R5_R10_mean	Clotho, AudioCaps	Clotho, AudioCaps
26	park_cau_1	0.075	park2022_t6b	cross-modal alignment	732354	CNN10(pretrained-learning)+gru	log-mel energies	Word2vec	Word2Vec		44.1kHz	self-supervised	Triplet Loss	Adam	1e-3		validation_loss	Clotho	Clotho
26	park_cau_2	0.075	park2022_t6b	cross-modal alignment	732354	CNN10(pretrained-learning)+gru	log-mel energies	Word2vec	Word2Vec		44.1kHz	self-supervised	Triplet Loss	Adam	1e-3		validation_loss	Clotho	Clotho
5	Mei_Surrey_1	0.251	mei2022_t6b	cross-modal alignment	195420160	CNN	PANNs	Transformer	BERT	Spec-Augment	44.1kHz	supervised	NTXent loss	AdamW	1e-4		validation_recall	Clotho	Clotho
6	Mei_Surrey_2	0.250	mei2022_t6b	cross-modal alignment	195420160	CNN	PANNs	Transformer	BERT	Spec-Augment	44.1kHz	supervised	NTXent loss	AdamW	1e-4		validation_recall	Clotho	Clotho
7	Mei_Surrey_3	0.244	mei2022_t6b	cross-modal alignment	188449792	CNN	PANNs	Transformer	BERT	Spec-Augment	44.1kHz	supervised	NTXent loss	AdamW	1e-4		validation_recall	Clotho	Clotho
5	Mei_Surrey_4	0.251	mei2022_t6b	cross-modal alignment	195420160	CNN	PANNs	Transformer	BERT	Spec-Augment	44.1kHz	supervised	NTXent loss	AdamW	1e-4		validation_recall	Clotho	Clotho
27	Baseline	0.061	xie2022_t6b	cross-modal alignment	732354	CRNN	log-mel energies	Word2vec	Word2Vec		44.1kHz	self-supervised	Triplet Loss	Adam	1e-3		validation_loss	Clotho	Clotho

Technical reports

A ResNet-Based Clip Text-to-Audio Retrieval System for DCASE Challenge 2022 Task

Yongquan Lai, Jinsong Pan, Buxian Chen

Ping An Property & Casualty Insurance Company of China, Ltd., China

lai_pa_task6b_1 lai_pa_task6b_2 lai_pa_task6b_3 lai_pa_task6b_4

Content

Task description

Teams ranking

Systems ranking

System characteristics

Overview of characteristics

Detailed characteristics

Technical reports

A ResNet-Based Clip Text-to-Audio Retrieval System for DCASE Challenge 2022 Task

A ResNet-Based Clip Text-to-Audio Retrieval System for DCASE Challenge 2022 Task

Abstract

System characteristics

Take It Easy: Relaxing Contrastive Ranking Loss with CIDEr

Take It Easy: Relaxing Contrastive Ranking Loss with CIDEr

Abstract

System characteristics

Language-Based Audio Retrieval with Pre-trained Models

Language-Based Audio Retrieval with Pre-trained Models

Abstract

System characteristics

CAU Submission to DCASE 2022 Task6B: Language-Based Audio Retrieval using Transfer Learning

CAU Submission to DCASE 2022 Task6B: Language-Based Audio Retrieval using Transfer Learning

Abstract

System characteristics

IRIT-UPS DCASE 2022 Language-Based Audio Retrieval System

IRIT-UPS DCASE 2022 Language-Based Audio Retrieval System

Abstract

System characteristics

Aligning Audio and Text Embeddings for the Language-Based Audio Retrieval Task of the DCASE Challenge 2022

Aligning Audio and Text Embeddings for the Language-Based Audio Retrieval Task of the DCASE Challenge 2022

Abstract

System characteristics

Text-to-Audio Retrieval via Large-Scale Contrastive Training

Text-to-Audio Retrieval via Large-Scale Contrastive Training

Abstract

System characteristics

Language-Based Audio Retrieval with Pretrained CNN and Graph Attention

Language-Based Audio Retrieval with Pretrained CNN and Graph Attention

Abstract

System characteristics

The SJTU System for DCASE2022 Challenge Task 6: Audio Captioning with Audio-Text Retrieval Pre-training

The SJTU System for DCASE2022 Challenge Task 6: Audio Captioning with Audio-Text Retrieval Pre-training

Abstract

System characteristics