Language-Based Audio Retrieval


Challenge results

Task description

Language-based audio retrieval is the task of retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions are used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.

The Clotho v2 is provided as the development dataset, which includes both audio and corresponding captions. Participants are also allowed using pre-trained models and external data for training their systems. This includes pre-trained models for feature extraction from audio and/or captions, and pre-optimized methods for natural language processing like part-of-speech (POS) tagging. Additionally, participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, Freesound. More information about Task 6B: Language-based Audio Retrieval can be found at the task description page.

Teams ranking

Here are listed the best systems all from all teams. The ranking is based on the achieved mAP@10 metric. For more elaborated exploration of the performance of different systems, at the same table are listed the values achieved for all the metrics employed in the task. The metric values are for the development-testing split and the evaluation dataset.

Selected
metric
rank
Submission Information Evaluation dataset Development-testing split
Submission code Best official
system rank
Corresponding author Technical
Report
mAP@10 R@1 R@5 R@10 mAP@10 R@1 R@5 R@10
ensmbl_5 1 Xuenan Xu xu2022_t6b 0.276 0.176 0.416 0.536 0.299 0.188 0.447 0.587
Mei_Surrey_1 2 Xinhao Mei mei2022_t6b 0.251 0.153 0.387 0.504 0.260 0.150 0.400 0.530
RELAX_4 3 Theodore Lamort de Gail lamort2022_t6b 0.221 0.131 0.343 0.466 0.226 0.132 0.350 0.478
wtagsACens 4 Thomas Pellegrini pellegrini2022_t6b 0.216 0.127 0.321 0.463 0.243 0.148 0.369 0.498
lai_pa_4 5 Yongquan Lai lai2022_t6b 0.215 0.122 0.328 0.478 0.510 0.350 0.750 0.890
CLAP_4 6 Yusong Wu wu2022_t6b 0.188 0.107 0.303 0.413 0.212 0.124 0.327 0.455
ATAE-NP-F 7 Benno Weck weck2022_t6b 0.128 0.077 0.188 0.284 0.140 0.075 0.225 0.324
P-GAT 8 Feiyang Xiao xiao2022_t6b 0.097 0.043 0.162 0.267 0.130 0.070 0.210 0.330
park_cau_1 9 Jiwon Park park2022_t6b 0.075 0.033 0.127 0.208 0.090 0.050 0.150 0.230
Baseline 10 Huang Xie xie2022_t6b 0.061 0.026 0.102 0.176 0.068 0.032 0.109 0.188

Systems ranking

Here are listed all systems and their ranking according to the different metrics. Detailed information of each system is at the next section.

Selected
metric
rank
Submission Information Evaluation dataset Development-testing split
Submission code Best official
system rank
Technical
Report
mAP@10 R@1 R@5 R@10 mAP@10 R@1 R@5 R@10
wotags 14 pellegrini2022_t6b 0.212 0.124 0.319 0.448 0.229 0.135 0.355 0.482
wtags 12 pellegrini2022_t6b 0.214 0.128 0.332 0.445 0.234 0.138 0.364 0.485
wtagsAC 13 pellegrini2022_t6b 0.213 0.125 0.322 0.454 0.240 0.145 0.365 0.500
wtagsACens 10 pellegrini2022_t6b 0.216 0.127 0.321 0.463 0.243 0.148 0.369 0.498
ATAE 23 weck2022_t6b 0.114 0.057 0.179 0.279 0.136 0.072 0.219 0.325
ATAE-ET 24 weck2022_t6b 0.113 0.066 0.168 0.267 0.122 0.064 0.194 0.288
ATAE-EP-F 22 weck2022_t6b 0.121 0.069 0.178 0.281 0.127 0.068 0.202 0.299
ATAE-NP-F 21 weck2022_t6b 0.128 0.077 0.188 0.284 0.140 0.075 0.225 0.324
P-GAT 25 xiao2022_t6b 0.097 0.043 0.162 0.267 0.130 0.070 0.210 0.330
lai_pa_1 12 lai2022_t6b 0.214 0.125 0.328 0.469 0.510 0.350 0.750 0.890
lai_pa_2 16 lai2022_t6b 0.209 0.118 0.326 0.462 0.510 0.340 0.750 0.890
lai_pa_3 15 lai2022_t6b 0.210 0.115 0.331 0.484 0.510 0.340 0.750 0.890
lai_pa_4 11 lai2022_t6b 0.215 0.122 0.328 0.478 0.510 0.350 0.750 0.890
RELAX_1 9 lamort2022_t6b 0.218 0.128 0.337 0.467 0.231 0.137 0.354 0.484
RELAX_2 14 lamort2022_t6b 0.212 0.118 0.327 0.464 0.229 0.137 0.351 0.469
RELAX_3 10 lamort2022_t6b 0.216 0.129 0.336 0.458 0.228 0.137 0.354 0.470
RELAX_4 8 lamort2022_t6b 0.221 0.131 0.343 0.466 0.226 0.132 0.350 0.478
CLAP_1 19 wu2022_t6b 0.182 0.104 0.295 0.388 0.214 0.126 0.335 0.452
CLAP_2 20 wu2022_t6b 0.180 0.100 0.284 0.385 0.211 0.124 0.326 0.451
CLAP_3 18 wu2022_t6b 0.183 0.102 0.289 0.401 0.212 0.124 0.331 0.451
CLAP_4 17 wu2022_t6b 0.188 0.107 0.303 0.413 0.212 0.124 0.327 0.455
ensmbl_5 1 xu2022_t6b 0.276 0.176 0.416 0.536 0.299 0.188 0.447 0.587
ensmbl_4 2 xu2022_t6b 0.269 0.177 0.397 0.517 0.288 0.182 0.427 0.568
ensmbl_3_1 3 xu2022_t6b 0.265 0.174 0.395 0.514 0.283 0.179 0.424 0.558
ensmbl_3_2 4 xu2022_t6b 0.259 0.168 0.379 0.508 0.282 0.175 0.417 0.565
park_cau_1 26 park2022_t6b 0.075 0.033 0.127 0.208 0.090 0.050 0.150 0.230
park_cau_2 26 park2022_t6b 0.075 0.037 0.117 0.204 0.090 0.050 0.150 0.230
Mei_Surrey_1 5 mei2022_t6b 0.251 0.153 0.387 0.504 0.260 0.150 0.400 0.530
Mei_Surrey_2 6 mei2022_t6b 0.250 0.151 0.388 0.507 0.260 0.150 0.400 0.530
Mei_Surrey_3 7 mei2022_t6b 0.244 0.150 0.382 0.497 0.240 0.140 0.370 0.500
Mei_Surrey_4 5 mei2022_t6b 0.251 0.162 0.378 0.496 0.260 0.150 0.400 0.530
Baseline 27 xie2022_t6b 0.061 0.026 0.102 0.176 0.068 0.032 0.109 0.188

System characteristics

In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems and the second has a detailed presentation of each system.

Overview of characteristics

Rank Submission
code
mAP@10 Technical
Report
Method scheme/architecture Amount of parameters Audio modelling Word modelling Data
augmentation
14 wotags 0.212 pellegrini2022_t6b cross-modal alignment 196046781 PaSST Transformer
12 wtags 0.214 pellegrini2022_t6b cross-modal alignment 196046781 PaSST Transformer
13 wtagsAC 0.213 pellegrini2022_t6b cross-modal alignment 196046781 PaSST Transformer
10 wtagsACens 0.216 pellegrini2022_t6b cross-modal alignment 196453339 PaSST Transformer
23 ATAE 0.114 weck2022_t6b cross-modal alignment 165000000 PANNs DistilRoBERTa
24 ATAE-ET 0.113 weck2022_t6b cross-modal alignment 165000000 PANNs DistilRoBERTa
22 ATAE-EP-F 0.121 weck2022_t6b cross-modal alignment 165000000 PANNs DistilRoBERTa
21 ATAE-NP-F 0.128 weck2022_t6b cross-modal alignment 165000000 PANNs DistilRoBERTa
25 P-GAT 0.097 xiao2022_t6b cross-modal alignment 6799328 PANNs, GAT Word2vec
12 lai_pa_1 0.214 lai2022_t6b supervised learning 134111910 ESResNeXt CLIP audio cropping
16 lai_pa_2 0.209 lai2022_t6b supervised learning 134111910 ESResNeXt CLIP audio cropping
15 lai_pa_3 0.210 lai2022_t6b supervised learning 134111910 ESResNeXt CLIP audio cropping
11 lai_pa_4 0.215 lai2022_t6b supervised learning 134111910 ESResNeXt CLIP audio cropping
9 RELAX_1 0.218 lamort2022_t6b cross-modal alignment 401606726 several audio experts, Transformer heads BERT
14 RELAX_2 0.212 lamort2022_t6b cross-modal alignment 401606726 several audio experts, Transformer heads BERT
10 RELAX_3 0.216 lamort2022_t6b cross-modal alignment 401606726 several audio experts, Transformer heads BERT
8 RELAX_4 0.221 lamort2022_t6b cross-modal alignment 401606726 several audio experts, Transformer heads BERT
19 CLAP_1 0.182 wu2022_t6b cross-modal alignment 244087786 HTSAT-tiny, PANN-14 Transformer Spec-Augment
20 CLAP_2 0.180 wu2022_t6b cross-modal alignment 96460249 HTSAT-tiny Transformer Spec-Augment
18 CLAP_3 0.183 wu2022_t6b cross-modal alignment 244087786 HTSAT-tiny, PANN-14 Transformer Spec-Augment
17 CLAP_4 0.188 wu2022_t6b cross-modal alignment 244087786 HTSAT-tiny, PANN-14 Transformer Spec-Augment
1 ensmbl_5 0.276 xu2022_t6b cross-modal alignment 911895813 CNN Transformer
2 ensmbl_4 0.269 xu2022_t6b cross-modal alignment 715582596 CNN Transformer
3 ensmbl_3_1 0.265 xu2022_t6b cross-modal alignment 591600259 CNN Transformer
4 ensmbl_3_2 0.259 xu2022_t6b cross-modal alignment 508377539 CNN Transformer
26 park_cau_1 0.075 park2022_t6b cross-modal alignment 732354 CNN10(pretrained-learning)+gru Word2vec
26 park_cau_2 0.075 park2022_t6b cross-modal alignment 732354 CNN10(pretrained-learning)+gru Word2vec
5 Mei_Surrey_1 0.251 mei2022_t6b cross-modal alignment 195420160 CNN Transformer Spec-Augment
6 Mei_Surrey_2 0.250 mei2022_t6b cross-modal alignment 195420160 CNN Transformer Spec-Augment
7 Mei_Surrey_3 0.244 mei2022_t6b cross-modal alignment 188449792 CNN Transformer Spec-Augment
5 Mei_Surrey_4 0.251 mei2022_t6b cross-modal alignment 195420160 CNN Transformer Spec-Augment
27 Baseline 0.061 xie2022_t6b cross-modal alignment 732354 CRNN Word2vec



Detailed characteristics

Rank Submission
code
mAP@10 Technical
Report
Method scheme/architecture Amount of parameters Audio modelling Acoustic
features
Word modelling Word
embeddings
Data
augmentation
Sampling
rate
Learning set-up Ensemble method Loss function Optimizer Learning rate Gradient clipping Gradient norm for clipping Metric monitored for training Dataset(s) used for audio modelling Dataset(s) used for word modelling
14 wotags 0.212 pellegrini2022_t6b cross-modal alignment 196046781 PaSST PaSST scene embeddings Transformer all-mpnet-base-v2 32.0kHz supervised Triplet Loss Adam 1e-3 validation_loss Clotho Clotho
12 wtags 0.214 pellegrini2022_t6b cross-modal alignment 196046781 PaSST PaSST scene embeddings Transformer all-mpnet-base-v2 32.0kHz supervised Triplet Loss Adam 1e-3 validation_loss Clotho Clotho
13 wtagsAC 0.213 pellegrini2022_t6b cross-modal alignment 196046781 PaSST PaSST scene embeddings Transformer all-mpnet-base-v2 32.0kHz supervised Triplet Loss Adam 1e-3 validation_loss Clotho, AudioCaps Clotho, AudioCaps
10 wtagsACens 0.216 pellegrini2022_t6b cross-modal alignment 196453339 PaSST PaSST scene embeddings Transformer all-mpnet-base-v2 32.0kHz supervised Triplet Loss Adam 1e-3 validation_loss Clotho, AudioCaps Clotho, AudioCaps
23 ATAE 0.114 weck2022_t6b cross-modal alignment 165000000 PANNs PANNs DistilRoBERTa DistilRoBERTa 32.0kHz metric learning Contrastive loss Adam 1e-4 validation_mAP@10 Clotho Clotho
24 ATAE-ET 0.113 weck2022_t6b cross-modal alignment 165000000 PANNs PANNs DistilRoBERTa DistilRoBERTa 32.0kHz metric learning Contrastive loss Adam 1e-4 validation_mAP@10 Clotho, FSD50K Clotho, FSD50K
22 ATAE-EP-F 0.121 weck2022_t6b cross-modal alignment 165000000 PANNs PANNs DistilRoBERTa DistilRoBERTa 32.0kHz metric learning Contrastive loss Adam 1e-4 validation_mAP@10 Clotho, FSD50K Clotho, FSD50K
21 ATAE-NP-F 0.128 weck2022_t6b cross-modal alignment 165000000 PANNs PANNs DistilRoBERTa DistilRoBERTa 32.0kHz metric learning Contrastive loss Adam 1e-4 validation_mAP@10 Clotho, FSD50K Clotho, FSD50K
25 P-GAT 0.097 xiao2022_t6b cross-modal alignment 6799328 PANNs, GAT log-mel energies Word2vec Word2Vec 44.1kHz self-supervised Triplet Loss Adam 1e-4 validation_loss Clotho Clotho
12 lai_pa_1 0.214 lai2022_t6b supervised learning 134111910 ESResNeXt waveform CLIP Transformer audio cropping 44.1kHz supervised symmetric cross entropy loss Adam 1e-5 clip grad norm training_loss Clotho Clotho
16 lai_pa_2 0.209 lai2022_t6b supervised learning 134111910 ESResNeXt waveform CLIP Transformer audio cropping 44.1kHz supervised symmetric cross entropy loss Adam 1e-5 clip grad norm training_loss Clotho Clotho
15 lai_pa_3 0.210 lai2022_t6b supervised learning 134111910 ESResNeXt waveform CLIP Transformer audio cropping 44.1kHz supervised symmetric cross entropy loss Adam 1e-5 clip grad norm training_loss Clotho Clotho
11 lai_pa_4 0.215 lai2022_t6b supervised learning 134111910 ESResNeXt waveform CLIP Transformer audio cropping 44.1kHz supervised symmetric cross entropy loss Adam 1e-5 clip grad norm training_loss Clotho Clotho
9 RELAX_1 0.218 lamort2022_t6b cross-modal alignment 401606726 several audio experts, Transformer heads several audio experts BERT BERT several sampling rates supervised contrastive ranking loss AdamW 1e-4 validation_loss, validation_ranking_accuracy Clotho, AudioCaps, Freesound Clotho, AudioCaps
14 RELAX_2 0.212 lamort2022_t6b cross-modal alignment 401606726 several audio experts, Transformer heads several audio experts BERT BERT several sampling rates supervised contrastive ranking loss AdamW 1e-4 validation_loss, validation_ranking_accuracy Clotho, AudioCaps, Freesound Clotho, AudioCaps
10 RELAX_3 0.216 lamort2022_t6b cross-modal alignment 401606726 several audio experts, Transformer heads several audio experts BERT BERT several sampling rates supervised contrastive ranking loss AdamW 1e-4 validation_loss, validation_ranking_accuracy Clotho, AudioCaps, Freesound Clotho, AudioCaps
8 RELAX_4 0.221 lamort2022_t6b cross-modal alignment 401606726 several audio experts, Transformer heads several audio experts BERT BERT several sampling rates supervised contrastive ranking loss AdamW 1e-4 validation_loss, validation_ranking_accuracy Clotho, AudioCaps, Freesound Clotho, AudioCaps
19 CLAP_1 0.182 wu2022_t6b cross-modal alignment 244087786 HTSAT-tiny, PANN-14 log-mel energies Transformer learned Spec-Augment 48.0kHz self-supervised Contrastive loss AdamW 1e-3 text-to-audio-mAP@10 BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, Sonniss_Game_Audio_Effects BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, Sonniss_Game_Audio_Effects
20 CLAP_2 0.180 wu2022_t6b cross-modal alignment 96460249 HTSAT-tiny log-mel energies Transformer learned Spec-Augment 48.0kHz self-supervised Contrastive loss AdamW 1e-3 text-to-audio-mAP@10 BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects
18 CLAP_3 0.183 wu2022_t6b cross-modal alignment 244087786 HTSAT-tiny, PANN-14 log-mel energies Transformer learned Spec-Augment 48.0kHz self-supervised Contrastive loss AdamW 1e-3 text-to-audio-mAP@10 BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects
17 CLAP_4 0.188 wu2022_t6b cross-modal alignment 244087786 HTSAT-tiny, PANN-14 log-mel energies Transformer learned Spec-Augment 48.0kHz self-supervised Contrastive loss AdamW 1e-3 text-to-audio-mAP@10 BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects
1 ensmbl_5 0.276 xu2022_t6b cross-modal alignment 911895813 CNN waveform Transformer learned 32.0kHz self-supervised InfoNCE loss Adam 2e-5 validation_t2a_R1_R5_R10_mean Clotho, AudioCaps Clotho, AudioCaps
2 ensmbl_4 0.269 xu2022_t6b cross-modal alignment 715582596 CNN waveform Transformer learned 32.0kHz self-supervised InfoNCE loss Adam 2e-5 validation_t2a_R1_R5_R10_mean Clotho, AudioCaps Clotho, AudioCaps
3 ensmbl_3_1 0.265 xu2022_t6b cross-modal alignment 591600259 CNN waveform Transformer learned 32.0kHz self-supervised InfoNCE loss Adam 2e-5 validation_t2a_R1_R5_R10_mean Clotho, AudioCaps Clotho, AudioCaps
4 ensmbl_3_2 0.259 xu2022_t6b cross-modal alignment 508377539 CNN waveform Transformer learned 32.0kHz self-supervised InfoNCE loss Adam 2e-5 validation_t2a_R1_R5_R10_mean Clotho, AudioCaps Clotho, AudioCaps
26 park_cau_1 0.075 park2022_t6b cross-modal alignment 732354 CNN10(pretrained-learning)+gru log-mel energies Word2vec Word2Vec 44.1kHz self-supervised Triplet Loss Adam 1e-3 validation_loss Clotho Clotho
26 park_cau_2 0.075 park2022_t6b cross-modal alignment 732354 CNN10(pretrained-learning)+gru log-mel energies Word2vec Word2Vec 44.1kHz self-supervised Triplet Loss Adam 1e-3 validation_loss Clotho Clotho
5 Mei_Surrey_1 0.251 mei2022_t6b cross-modal alignment 195420160 CNN PANNs Transformer BERT Spec-Augment 44.1kHz supervised NTXent loss AdamW 1e-4 validation_recall Clotho Clotho
6 Mei_Surrey_2 0.250 mei2022_t6b cross-modal alignment 195420160 CNN PANNs Transformer BERT Spec-Augment 44.1kHz supervised NTXent loss AdamW 1e-4 validation_recall Clotho Clotho
7 Mei_Surrey_3 0.244 mei2022_t6b cross-modal alignment 188449792 CNN PANNs Transformer BERT Spec-Augment 44.1kHz supervised NTXent loss AdamW 1e-4 validation_recall Clotho Clotho
5 Mei_Surrey_4 0.251 mei2022_t6b cross-modal alignment 195420160 CNN PANNs Transformer BERT Spec-Augment 44.1kHz supervised NTXent loss AdamW 1e-4 validation_recall Clotho Clotho
27 Baseline 0.061 xie2022_t6b cross-modal alignment 732354 CRNN log-mel energies Word2vec Word2Vec 44.1kHz self-supervised Triplet Loss Adam 1e-3 validation_loss Clotho Clotho



Technical reports

A ResNet-Based Clip Text-to-Audio Retrieval System for DCASE Challenge 2022 Task

Yongquan Lai, Jinsong Pan, Buxian Chen
Ping An Property & Casualty Insurance Company of China, Ltd., China

Abstract

Language-based audio retrieval aim to use language to retrieval audios in a given dataset. This technical report presents an text-to-audio retrieval system submitted to Task 6b of the DCASE 2022 challenge. The proposed system is based on AudioCLIP, which incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet and clothe V2 datasets and introduces a pre-training method to perform bimodal querying. the original AudioCLIP acquired poor retrieval performance on the clothe V2 dataset in a zero-shot inference fashion. So we used AudioCLIP’s model as a weight initializer, and finetuned audio encoder and text encoder using symmetric cross entropy loss over similarity measure among the mini-batch (audio, text) pairs. Through pre-training and data augmentation methods, our model achieved R1 score of 0.35 and mAP10 score of 0.51 on Clotho V2 evaluation set.

System characteristics
Data augmentation audio cropping
PDF

Take It Easy: Relaxing Contrastive Ranking Loss with CIDEr

Theodore Lamort de Gail, Dawid Kicinski
Samsung R&D Institute Poland, Warsaw, Poland

Abstract

This report presents our approach and results for task 6B of the DCASE2022 challenge concerning natural-language-based audio retrieval. To match the audio-text pairs, we learn cross-modal embeddings. The audio samples are encoded by an ensemble of four frozen expert models with transformer heads for time aggregation. Captions are encoded using a pre-trained language model. The model is trained with a modified contrastive ranking loss, enhanced with a heuristic caption similarity prior based on the CIDEr metric. We train the system on the AudioCaps and Clotho audio captioning datasets. Furthermore, we use an NLP classifier to gather additional useful audio-caption pairs from Freesound. We achieve 0.48 R10 and 0.23 mAP10 on the Clotho evaluation split (vs. 0.19 and 0.07 respectively for the challenge baseline).

System characteristics
Data augmentation None
PDF

Language-Based Audio Retrieval with Pre-trained Models

Xinhao Mei, Xubo Liu, Haohe Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang
Centre for Vision, Speech, and Signal Processing (CVSSP), University of Surrey, UK

Abstract

This technical report presents a language-based audio retrieval system that we submitted to Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2022 Task 6b. Language-based audio retrieval is a cross-modal task aiming at retrieving a matched audio clip from a pool of candidates given a language query such as a sentence. Cross-modal retrieval tasks are often solved by using deep learning models where the features from different modalities are extracted and then mapped to a joint embedding space. These models usually require a large amount of training data to obtain reasonable performance. However, the audio captioning dataset employed in this audio retrieval task is limited in size. In this work, we propose to use large-scale pre-trained models as both audio and text encoders to mitigate the data scarcity problem and learn the acoustic semantic embeddings. Results on the Clotho dataset show that our proposed system significantly improves the scores of all the evaluation metrics as compared to the baseline system

System characteristics
Data augmentation SpecAugment
PDF

CAU Submission to DCASE 2022 Task6B: Language-Based Audio Retrieval using Transfer Learning

Jiwon Park, Chaewon Hwang, Il-Youp Kwak, Changwon Lim
Chung-Ang University, Department of Applied Statistics, Seoul, South Korea

Abstract

This report proposes a language-based audio retrieval model for the 2022 DCASE audio retrieval challenge. In this challenge, to make use of the learned feature from AudioSet data, we utilized CNN10 network pre-trained on AudioSet data. With the transfer learning, our proposed model took 10-layers CNN and adding GRU after CNN Module. We used pre-trained Word2Vec as text encoder[1]. Experiments show that the proposed model achieved mAP score of 0.091 and showed better performance compared to baseline mAP score of 0.067.

System characteristics
Data augmentation None
PDF

IRIT-UPS DCASE 2022 Language-Based Audio Retrieval System

Thomas Pellegrini
Computing Sciences, University Toulouse III, Toulouse, France

Abstract

This technical report is a short description of the IRIT-UPS systems used in the DCASE 2022 task 6b dedicated to audio captioning. Four submissions were made: i) a baseline one using pretrained representations for both the audio signal (scene embeddings extracted with PaSST), and for the caption queries (using a large pretrained sentence transformer called all-MPNet), ii) the same baseline system but adding information from AudioSet tags in the audio encoder part, iii) the same as ii) but pretrained on an external dataset (AudioCaps), iv) an ensemble of two systems iii).

System characteristics
Data augmentation None
PDF

Aligning Audio and Text Embeddings for the Language-Based Audio Retrieval Task of the DCASE Challenge 2022

Benno Weck1,2, Miguel Pérez Fernández1,2, Holger Kirchhoff1, Xavier Serra2
1Huawei Technologies, Munich Research Center, Germany, 2Universitat Pompeu Fabra, Music Technology Group, Spain

Abstract

Our challenge submission shows how large-scale pretrained deep learning models can serve as a strong basis for a cross-modal (text-to-audio) retrieval system. Our system uses embeddings extracted by these models in a general alignment framework to connect matching pairs of audio and text. It processes audio and text separately through different pretrained models, each returning an embedding. Shallow neural networks map the embeddings to a common dimensionality. The cross-modal alignment of the individual embeddings is optimised using a contrastive loss. We employ the RoBERTa foundation model as the text embedding extractor. A pretrained PANNs model extracts the audio embeddings. The embedding extractor model weights remain frozen. To improve the generalisation of our model, we investigate how pretraining with audio and associated noisy text collected from the online platform Freesound improves the performance of our method. We find that a two-stage training process consisting of pretraining with noisy data and fine-tuning with the challenge datasets gives the best results for our approach. Our system showcases a simple yet effective method which is superior to the challenge baseline.

System characteristics
Data augmentation None
PDF

Text-to-Audio Retrieval via Large-Scale Contrastive Training

Yusong Wu1,2, Tianyu Zhang1,2, Ke Chen3
1University of Montreal, Quebec, Canada, 2Mila, Quebec, Canada, 2University of California San Diego, San Diego, United States

Abstract

Although there is an abundance of data available on the internet, audio data is still limited in terms of dataset size and label precision. Scaling the size of audio datasets would therefore be one of the most valuable ways to develop models for better audio understanding. In this report, we propose a pipeline to better learn the audio understanding mechanism by combining audio data with more abundantly available natural language descriptions. We collected a mixed dataset consisting of over 2 million data pairs and trained a contrastive model based on Contrastive Language–Image Pre-training (CLIP) in order to discover correspondence between audio and text. As an audio encoder, we use HTS-AT as a transformer-based model and PANN and a CNN-based model, and as a text encoder, we employ the frozen pre-trained CLIP text encoder. The resulting models are submitted to Task 6B of the DCASE 2022 challenge and achieve a mAP@10 score of at least 0.214.

System characteristics
Data augmentation SpecAugment
PDF

Language-Based Audio Retrieval with Pretrained CNN and Graph Attention

Feiyang Xiao1, Jian Guan1∗, Haiyan Lan1, Qiaoxi Zhu2, and Wenwu Wang3
1Group of Intelligent Signal Processing, College of Computer Science and Technology, Harbin Engineering University, Harbin, China, 2Centre for Audio, Acoustic and Vibration, University of Technology Sydney, Ultimo, Australia, 3Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK

Abstract

This technical report describes our submission for Task 6B of the DCASE2022 Challenge (language-based audio retrieval). Our audio retrieval system has an audio encoder composed of a pretrained CNN module (i.e., pretrained audio neural network, PANNs) and a novel graph attention module. Its text encoder is the pretrained word2vec model, the same as the baseline system of Task 6B. Experiments show that our audio retrieval system can achieve the mAP10 metric (used for ranking) of 13% on the development-testing dataset of Task 6B.

System characteristics
Data augmentation None
PDF

The SJTU System for DCASE2022 Challenge Task 6: Audio Captioning with Audio-Text Retrieval Pre-training

Xuenan Xu, Zeyu Xie, Mengyue Wu, Kai Yu
MoE Key Lab of Artificial Intelligence X-LANCE Lab, Department of Computer Science and Engineering AI Institute, Shanghai Jiao Tong University, Shanghai, China

Abstract

This technical report describes the system submitted to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 challenge Task 6. There are two involving subtasks: text-to-audio retrieval and automated audio captioning. The text-to-audio retrieval system adopts a bi-encoder architecture using pre-trained audio and text encoders. The system is first pre-trained on AudioCaps and then fine-tuned on the challenge dataset Clotho. For the audio captioning system, we first train a retrieval model on all public captioning data and then take the audio encoder as the feature extractor. Then a standard sequence-to-sequence model is trained on Clotho based on the pre-trained feature extractor. The captioning model is first trained by word-level cross entropy loss and then finetuned using self-critical sequence training. Our system achieves a SPIDEr of 32.5 on captioning and an mAP of 29.9 on text-to-audio retrieval.

System characteristics
Data augmentation None
PDF