Task description
Language-based audio retrieval is the task of retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions are used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.
The Clotho v2 is provided as the development dataset, which includes both audio and corresponding captions. Participants are also allowed using pre-trained models and external data for training their systems. This includes pre-trained models for feature extraction from audio and/or captions, and pre-optimized methods for natural language processing like part-of-speech (POS) tagging. Additionally, participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, Freesound. More information about Task 6B: Language-based Audio Retrieval can be found at the task description page.
Teams ranking
Here are listed the best systems all from all teams. The ranking is based on the achieved mAP@10 metric. For more elaborated exploration of the performance of different systems, at the same table are listed the values achieved for all the metrics employed in the task. The metric values are for the development-testing split and the evaluation dataset.
Selected metric rank |
Submission Information | Evaluation dataset | Development-testing split | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Corresponding author |
Technical Report |
mAP@10 | R@1 | R@5 | R@10 | mAP@10 | R@1 | R@5 | R@10 | |
ensmbl_5 | 1 | Xuenan Xu | xu2022_t6b | 0.276 | 0.176 | 0.416 | 0.536 | 0.299 | 0.188 | 0.447 | 0.587 | |
Mei_Surrey_1 | 2 | Xinhao Mei | mei2022_t6b | 0.251 | 0.153 | 0.387 | 0.504 | 0.260 | 0.150 | 0.400 | 0.530 | |
RELAX_4 | 3 | Theodore Lamort de Gail | lamort2022_t6b | 0.221 | 0.131 | 0.343 | 0.466 | 0.226 | 0.132 | 0.350 | 0.478 | |
wtagsACens | 4 | Thomas Pellegrini | pellegrini2022_t6b | 0.216 | 0.127 | 0.321 | 0.463 | 0.243 | 0.148 | 0.369 | 0.498 | |
lai_pa_4 | 5 | Yongquan Lai | lai2022_t6b | 0.215 | 0.122 | 0.328 | 0.478 | 0.510 | 0.350 | 0.750 | 0.890 | |
CLAP_4 | 6 | Yusong Wu | wu2022_t6b | 0.188 | 0.107 | 0.303 | 0.413 | 0.212 | 0.124 | 0.327 | 0.455 | |
ATAE-NP-F | 7 | Benno Weck | weck2022_t6b | 0.128 | 0.077 | 0.188 | 0.284 | 0.140 | 0.075 | 0.225 | 0.324 | |
P-GAT | 8 | Feiyang Xiao | xiao2022_t6b | 0.097 | 0.043 | 0.162 | 0.267 | 0.130 | 0.070 | 0.210 | 0.330 | |
park_cau_1 | 9 | Jiwon Park | park2022_t6b | 0.075 | 0.033 | 0.127 | 0.208 | 0.090 | 0.050 | 0.150 | 0.230 | |
Baseline | 10 | Huang Xie | xie2022_t6b | 0.061 | 0.026 | 0.102 | 0.176 | 0.068 | 0.032 | 0.109 | 0.188 |
Systems ranking
Here are listed all systems and their ranking according to the different metrics. Detailed information of each system is at the next section.
Selected metric rank |
Submission Information | Evaluation dataset | Development-testing split | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Technical Report |
mAP@10 | R@1 | R@5 | R@10 | mAP@10 | R@1 | R@5 | R@10 | |
wotags | 14 | pellegrini2022_t6b | 0.212 | 0.124 | 0.319 | 0.448 | 0.229 | 0.135 | 0.355 | 0.482 | |
wtags | 12 | pellegrini2022_t6b | 0.214 | 0.128 | 0.332 | 0.445 | 0.234 | 0.138 | 0.364 | 0.485 | |
wtagsAC | 13 | pellegrini2022_t6b | 0.213 | 0.125 | 0.322 | 0.454 | 0.240 | 0.145 | 0.365 | 0.500 | |
wtagsACens | 10 | pellegrini2022_t6b | 0.216 | 0.127 | 0.321 | 0.463 | 0.243 | 0.148 | 0.369 | 0.498 | |
ATAE | 23 | weck2022_t6b | 0.114 | 0.057 | 0.179 | 0.279 | 0.136 | 0.072 | 0.219 | 0.325 | |
ATAE-ET | 24 | weck2022_t6b | 0.113 | 0.066 | 0.168 | 0.267 | 0.122 | 0.064 | 0.194 | 0.288 | |
ATAE-EP-F | 22 | weck2022_t6b | 0.121 | 0.069 | 0.178 | 0.281 | 0.127 | 0.068 | 0.202 | 0.299 | |
ATAE-NP-F | 21 | weck2022_t6b | 0.128 | 0.077 | 0.188 | 0.284 | 0.140 | 0.075 | 0.225 | 0.324 | |
P-GAT | 25 | xiao2022_t6b | 0.097 | 0.043 | 0.162 | 0.267 | 0.130 | 0.070 | 0.210 | 0.330 | |
lai_pa_1 | 12 | lai2022_t6b | 0.214 | 0.125 | 0.328 | 0.469 | 0.510 | 0.350 | 0.750 | 0.890 | |
lai_pa_2 | 16 | lai2022_t6b | 0.209 | 0.118 | 0.326 | 0.462 | 0.510 | 0.340 | 0.750 | 0.890 | |
lai_pa_3 | 15 | lai2022_t6b | 0.210 | 0.115 | 0.331 | 0.484 | 0.510 | 0.340 | 0.750 | 0.890 | |
lai_pa_4 | 11 | lai2022_t6b | 0.215 | 0.122 | 0.328 | 0.478 | 0.510 | 0.350 | 0.750 | 0.890 | |
RELAX_1 | 9 | lamort2022_t6b | 0.218 | 0.128 | 0.337 | 0.467 | 0.231 | 0.137 | 0.354 | 0.484 | |
RELAX_2 | 14 | lamort2022_t6b | 0.212 | 0.118 | 0.327 | 0.464 | 0.229 | 0.137 | 0.351 | 0.469 | |
RELAX_3 | 10 | lamort2022_t6b | 0.216 | 0.129 | 0.336 | 0.458 | 0.228 | 0.137 | 0.354 | 0.470 | |
RELAX_4 | 8 | lamort2022_t6b | 0.221 | 0.131 | 0.343 | 0.466 | 0.226 | 0.132 | 0.350 | 0.478 | |
CLAP_1 | 19 | wu2022_t6b | 0.182 | 0.104 | 0.295 | 0.388 | 0.214 | 0.126 | 0.335 | 0.452 | |
CLAP_2 | 20 | wu2022_t6b | 0.180 | 0.100 | 0.284 | 0.385 | 0.211 | 0.124 | 0.326 | 0.451 | |
CLAP_3 | 18 | wu2022_t6b | 0.183 | 0.102 | 0.289 | 0.401 | 0.212 | 0.124 | 0.331 | 0.451 | |
CLAP_4 | 17 | wu2022_t6b | 0.188 | 0.107 | 0.303 | 0.413 | 0.212 | 0.124 | 0.327 | 0.455 | |
ensmbl_5 | 1 | xu2022_t6b | 0.276 | 0.176 | 0.416 | 0.536 | 0.299 | 0.188 | 0.447 | 0.587 | |
ensmbl_4 | 2 | xu2022_t6b | 0.269 | 0.177 | 0.397 | 0.517 | 0.288 | 0.182 | 0.427 | 0.568 | |
ensmbl_3_1 | 3 | xu2022_t6b | 0.265 | 0.174 | 0.395 | 0.514 | 0.283 | 0.179 | 0.424 | 0.558 | |
ensmbl_3_2 | 4 | xu2022_t6b | 0.259 | 0.168 | 0.379 | 0.508 | 0.282 | 0.175 | 0.417 | 0.565 | |
park_cau_1 | 26 | park2022_t6b | 0.075 | 0.033 | 0.127 | 0.208 | 0.090 | 0.050 | 0.150 | 0.230 | |
park_cau_2 | 26 | park2022_t6b | 0.075 | 0.037 | 0.117 | 0.204 | 0.090 | 0.050 | 0.150 | 0.230 | |
Mei_Surrey_1 | 5 | mei2022_t6b | 0.251 | 0.153 | 0.387 | 0.504 | 0.260 | 0.150 | 0.400 | 0.530 | |
Mei_Surrey_2 | 6 | mei2022_t6b | 0.250 | 0.151 | 0.388 | 0.507 | 0.260 | 0.150 | 0.400 | 0.530 | |
Mei_Surrey_3 | 7 | mei2022_t6b | 0.244 | 0.150 | 0.382 | 0.497 | 0.240 | 0.140 | 0.370 | 0.500 | |
Mei_Surrey_4 | 5 | mei2022_t6b | 0.251 | 0.162 | 0.378 | 0.496 | 0.260 | 0.150 | 0.400 | 0.530 | |
Baseline | 27 | xie2022_t6b | 0.061 | 0.026 | 0.102 | 0.176 | 0.068 | 0.032 | 0.109 | 0.188 |
System characteristics
In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems and the second has a detailed presentation of each system.
Overview of characteristics
Rank |
Submission code |
mAP@10 |
Technical Report |
Method scheme/architecture | Amount of parameters | Audio modelling | Word modelling |
Data augmentation |
---|---|---|---|---|---|---|---|---|
14 | wotags | 0.212 | pellegrini2022_t6b | cross-modal alignment | 196046781 | PaSST | Transformer | |
12 | wtags | 0.214 | pellegrini2022_t6b | cross-modal alignment | 196046781 | PaSST | Transformer | |
13 | wtagsAC | 0.213 | pellegrini2022_t6b | cross-modal alignment | 196046781 | PaSST | Transformer | |
10 | wtagsACens | 0.216 | pellegrini2022_t6b | cross-modal alignment | 196453339 | PaSST | Transformer | |
23 | ATAE | 0.114 | weck2022_t6b | cross-modal alignment | 165000000 | PANNs | DistilRoBERTa | |
24 | ATAE-ET | 0.113 | weck2022_t6b | cross-modal alignment | 165000000 | PANNs | DistilRoBERTa | |
22 | ATAE-EP-F | 0.121 | weck2022_t6b | cross-modal alignment | 165000000 | PANNs | DistilRoBERTa | |
21 | ATAE-NP-F | 0.128 | weck2022_t6b | cross-modal alignment | 165000000 | PANNs | DistilRoBERTa | |
25 | P-GAT | 0.097 | xiao2022_t6b | cross-modal alignment | 6799328 | PANNs, GAT | Word2vec | |
12 | lai_pa_1 | 0.214 | lai2022_t6b | supervised learning | 134111910 | ESResNeXt | CLIP | audio cropping |
16 | lai_pa_2 | 0.209 | lai2022_t6b | supervised learning | 134111910 | ESResNeXt | CLIP | audio cropping |
15 | lai_pa_3 | 0.210 | lai2022_t6b | supervised learning | 134111910 | ESResNeXt | CLIP | audio cropping |
11 | lai_pa_4 | 0.215 | lai2022_t6b | supervised learning | 134111910 | ESResNeXt | CLIP | audio cropping |
9 | RELAX_1 | 0.218 | lamort2022_t6b | cross-modal alignment | 401606726 | several audio experts, Transformer heads | BERT | |
14 | RELAX_2 | 0.212 | lamort2022_t6b | cross-modal alignment | 401606726 | several audio experts, Transformer heads | BERT | |
10 | RELAX_3 | 0.216 | lamort2022_t6b | cross-modal alignment | 401606726 | several audio experts, Transformer heads | BERT | |
8 | RELAX_4 | 0.221 | lamort2022_t6b | cross-modal alignment | 401606726 | several audio experts, Transformer heads | BERT | |
19 | CLAP_1 | 0.182 | wu2022_t6b | cross-modal alignment | 244087786 | HTSAT-tiny, PANN-14 | Transformer | Spec-Augment |
20 | CLAP_2 | 0.180 | wu2022_t6b | cross-modal alignment | 96460249 | HTSAT-tiny | Transformer | Spec-Augment |
18 | CLAP_3 | 0.183 | wu2022_t6b | cross-modal alignment | 244087786 | HTSAT-tiny, PANN-14 | Transformer | Spec-Augment |
17 | CLAP_4 | 0.188 | wu2022_t6b | cross-modal alignment | 244087786 | HTSAT-tiny, PANN-14 | Transformer | Spec-Augment |
1 | ensmbl_5 | 0.276 | xu2022_t6b | cross-modal alignment | 911895813 | CNN | Transformer | |
2 | ensmbl_4 | 0.269 | xu2022_t6b | cross-modal alignment | 715582596 | CNN | Transformer | |
3 | ensmbl_3_1 | 0.265 | xu2022_t6b | cross-modal alignment | 591600259 | CNN | Transformer | |
4 | ensmbl_3_2 | 0.259 | xu2022_t6b | cross-modal alignment | 508377539 | CNN | Transformer | |
26 | park_cau_1 | 0.075 | park2022_t6b | cross-modal alignment | 732354 | CNN10(pretrained-learning)+gru | Word2vec | |
26 | park_cau_2 | 0.075 | park2022_t6b | cross-modal alignment | 732354 | CNN10(pretrained-learning)+gru | Word2vec | |
5 | Mei_Surrey_1 | 0.251 | mei2022_t6b | cross-modal alignment | 195420160 | CNN | Transformer | Spec-Augment |
6 | Mei_Surrey_2 | 0.250 | mei2022_t6b | cross-modal alignment | 195420160 | CNN | Transformer | Spec-Augment |
7 | Mei_Surrey_3 | 0.244 | mei2022_t6b | cross-modal alignment | 188449792 | CNN | Transformer | Spec-Augment |
5 | Mei_Surrey_4 | 0.251 | mei2022_t6b | cross-modal alignment | 195420160 | CNN | Transformer | Spec-Augment |
27 | Baseline | 0.061 | xie2022_t6b | cross-modal alignment | 732354 | CRNN | Word2vec |
Detailed characteristics
Rank |
Submission code |
mAP@10 |
Technical Report |
Method scheme/architecture | Amount of parameters | Audio modelling |
Acoustic features |
Word modelling |
Word embeddings |
Data augmentation |
Sampling rate |
Learning set-up | Ensemble method | Loss function | Optimizer | Learning rate | Gradient clipping | Gradient norm for clipping | Metric monitored for training | Dataset(s) used for audio modelling | Dataset(s) used for word modelling |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14 | wotags | 0.212 | pellegrini2022_t6b | cross-modal alignment | 196046781 | PaSST | PaSST scene embeddings | Transformer | all-mpnet-base-v2 | 32.0kHz | supervised | Triplet Loss | Adam | 1e-3 | validation_loss | Clotho | Clotho | ||||
12 | wtags | 0.214 | pellegrini2022_t6b | cross-modal alignment | 196046781 | PaSST | PaSST scene embeddings | Transformer | all-mpnet-base-v2 | 32.0kHz | supervised | Triplet Loss | Adam | 1e-3 | validation_loss | Clotho | Clotho | ||||
13 | wtagsAC | 0.213 | pellegrini2022_t6b | cross-modal alignment | 196046781 | PaSST | PaSST scene embeddings | Transformer | all-mpnet-base-v2 | 32.0kHz | supervised | Triplet Loss | Adam | 1e-3 | validation_loss | Clotho, AudioCaps | Clotho, AudioCaps | ||||
10 | wtagsACens | 0.216 | pellegrini2022_t6b | cross-modal alignment | 196453339 | PaSST | PaSST scene embeddings | Transformer | all-mpnet-base-v2 | 32.0kHz | supervised | Triplet Loss | Adam | 1e-3 | validation_loss | Clotho, AudioCaps | Clotho, AudioCaps | ||||
23 | ATAE | 0.114 | weck2022_t6b | cross-modal alignment | 165000000 | PANNs | PANNs | DistilRoBERTa | DistilRoBERTa | 32.0kHz | metric learning | Contrastive loss | Adam | 1e-4 | validation_mAP@10 | Clotho | Clotho | ||||
24 | ATAE-ET | 0.113 | weck2022_t6b | cross-modal alignment | 165000000 | PANNs | PANNs | DistilRoBERTa | DistilRoBERTa | 32.0kHz | metric learning | Contrastive loss | Adam | 1e-4 | validation_mAP@10 | Clotho, FSD50K | Clotho, FSD50K | ||||
22 | ATAE-EP-F | 0.121 | weck2022_t6b | cross-modal alignment | 165000000 | PANNs | PANNs | DistilRoBERTa | DistilRoBERTa | 32.0kHz | metric learning | Contrastive loss | Adam | 1e-4 | validation_mAP@10 | Clotho, FSD50K | Clotho, FSD50K | ||||
21 | ATAE-NP-F | 0.128 | weck2022_t6b | cross-modal alignment | 165000000 | PANNs | PANNs | DistilRoBERTa | DistilRoBERTa | 32.0kHz | metric learning | Contrastive loss | Adam | 1e-4 | validation_mAP@10 | Clotho, FSD50K | Clotho, FSD50K | ||||
25 | P-GAT | 0.097 | xiao2022_t6b | cross-modal alignment | 6799328 | PANNs, GAT | log-mel energies | Word2vec | Word2Vec | 44.1kHz | self-supervised | Triplet Loss | Adam | 1e-4 | validation_loss | Clotho | Clotho | ||||
12 | lai_pa_1 | 0.214 | lai2022_t6b | supervised learning | 134111910 | ESResNeXt | waveform | CLIP | Transformer | audio cropping | 44.1kHz | supervised | symmetric cross entropy loss | Adam | 1e-5 | clip grad norm | training_loss | Clotho | Clotho | ||
16 | lai_pa_2 | 0.209 | lai2022_t6b | supervised learning | 134111910 | ESResNeXt | waveform | CLIP | Transformer | audio cropping | 44.1kHz | supervised | symmetric cross entropy loss | Adam | 1e-5 | clip grad norm | training_loss | Clotho | Clotho | ||
15 | lai_pa_3 | 0.210 | lai2022_t6b | supervised learning | 134111910 | ESResNeXt | waveform | CLIP | Transformer | audio cropping | 44.1kHz | supervised | symmetric cross entropy loss | Adam | 1e-5 | clip grad norm | training_loss | Clotho | Clotho | ||
11 | lai_pa_4 | 0.215 | lai2022_t6b | supervised learning | 134111910 | ESResNeXt | waveform | CLIP | Transformer | audio cropping | 44.1kHz | supervised | symmetric cross entropy loss | Adam | 1e-5 | clip grad norm | training_loss | Clotho | Clotho | ||
9 | RELAX_1 | 0.218 | lamort2022_t6b | cross-modal alignment | 401606726 | several audio experts, Transformer heads | several audio experts | BERT | BERT | several sampling rates | supervised | contrastive ranking loss | AdamW | 1e-4 | validation_loss, validation_ranking_accuracy | Clotho, AudioCaps, Freesound | Clotho, AudioCaps | ||||
14 | RELAX_2 | 0.212 | lamort2022_t6b | cross-modal alignment | 401606726 | several audio experts, Transformer heads | several audio experts | BERT | BERT | several sampling rates | supervised | contrastive ranking loss | AdamW | 1e-4 | validation_loss, validation_ranking_accuracy | Clotho, AudioCaps, Freesound | Clotho, AudioCaps | ||||
10 | RELAX_3 | 0.216 | lamort2022_t6b | cross-modal alignment | 401606726 | several audio experts, Transformer heads | several audio experts | BERT | BERT | several sampling rates | supervised | contrastive ranking loss | AdamW | 1e-4 | validation_loss, validation_ranking_accuracy | Clotho, AudioCaps, Freesound | Clotho, AudioCaps | ||||
8 | RELAX_4 | 0.221 | lamort2022_t6b | cross-modal alignment | 401606726 | several audio experts, Transformer heads | several audio experts | BERT | BERT | several sampling rates | supervised | contrastive ranking loss | AdamW | 1e-4 | validation_loss, validation_ranking_accuracy | Clotho, AudioCaps, Freesound | Clotho, AudioCaps | ||||
19 | CLAP_1 | 0.182 | wu2022_t6b | cross-modal alignment | 244087786 | HTSAT-tiny, PANN-14 | log-mel energies | Transformer | learned | Spec-Augment | 48.0kHz | self-supervised | Contrastive loss | AdamW | 1e-3 | text-to-audio-mAP@10 | BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, Sonniss_Game_Audio_Effects | BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, Sonniss_Game_Audio_Effects | |||
20 | CLAP_2 | 0.180 | wu2022_t6b | cross-modal alignment | 96460249 | HTSAT-tiny | log-mel energies | Transformer | learned | Spec-Augment | 48.0kHz | self-supervised | Contrastive loss | AdamW | 1e-3 | text-to-audio-mAP@10 | BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects | BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects | |||
18 | CLAP_3 | 0.183 | wu2022_t6b | cross-modal alignment | 244087786 | HTSAT-tiny, PANN-14 | log-mel energies | Transformer | learned | Spec-Augment | 48.0kHz | self-supervised | Contrastive loss | AdamW | 1e-3 | text-to-audio-mAP@10 | BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects | BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects | |||
17 | CLAP_4 | 0.188 | wu2022_t6b | cross-modal alignment | 244087786 | HTSAT-tiny, PANN-14 | log-mel energies | Transformer | learned | Spec-Augment | 48.0kHz | self-supervised | Contrastive loss | AdamW | 1e-3 | text-to-audio-mAP@10 | BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects | BBC_Sound_Effects, Clotho, AudioCaps, AudioSet, Free_To_Use_Sounds, We_Sound_Effects, We_Sound_Effects, Sonniss_Game_Audio_Effects | |||
1 | ensmbl_5 | 0.276 | xu2022_t6b | cross-modal alignment | 911895813 | CNN | waveform | Transformer | learned | 32.0kHz | self-supervised | InfoNCE loss | Adam | 2e-5 | validation_t2a_R1_R5_R10_mean | Clotho, AudioCaps | Clotho, AudioCaps | ||||
2 | ensmbl_4 | 0.269 | xu2022_t6b | cross-modal alignment | 715582596 | CNN | waveform | Transformer | learned | 32.0kHz | self-supervised | InfoNCE loss | Adam | 2e-5 | validation_t2a_R1_R5_R10_mean | Clotho, AudioCaps | Clotho, AudioCaps | ||||
3 | ensmbl_3_1 | 0.265 | xu2022_t6b | cross-modal alignment | 591600259 | CNN | waveform | Transformer | learned | 32.0kHz | self-supervised | InfoNCE loss | Adam | 2e-5 | validation_t2a_R1_R5_R10_mean | Clotho, AudioCaps | Clotho, AudioCaps | ||||
4 | ensmbl_3_2 | 0.259 | xu2022_t6b | cross-modal alignment | 508377539 | CNN | waveform | Transformer | learned | 32.0kHz | self-supervised | InfoNCE loss | Adam | 2e-5 | validation_t2a_R1_R5_R10_mean | Clotho, AudioCaps | Clotho, AudioCaps | ||||
26 | park_cau_1 | 0.075 | park2022_t6b | cross-modal alignment | 732354 | CNN10(pretrained-learning)+gru | log-mel energies | Word2vec | Word2Vec | 44.1kHz | self-supervised | Triplet Loss | Adam | 1e-3 | validation_loss | Clotho | Clotho | ||||
26 | park_cau_2 | 0.075 | park2022_t6b | cross-modal alignment | 732354 | CNN10(pretrained-learning)+gru | log-mel energies | Word2vec | Word2Vec | 44.1kHz | self-supervised | Triplet Loss | Adam | 1e-3 | validation_loss | Clotho | Clotho | ||||
5 | Mei_Surrey_1 | 0.251 | mei2022_t6b | cross-modal alignment | 195420160 | CNN | PANNs | Transformer | BERT | Spec-Augment | 44.1kHz | supervised | NTXent loss | AdamW | 1e-4 | validation_recall | Clotho | Clotho | |||
6 | Mei_Surrey_2 | 0.250 | mei2022_t6b | cross-modal alignment | 195420160 | CNN | PANNs | Transformer | BERT | Spec-Augment | 44.1kHz | supervised | NTXent loss | AdamW | 1e-4 | validation_recall | Clotho | Clotho | |||
7 | Mei_Surrey_3 | 0.244 | mei2022_t6b | cross-modal alignment | 188449792 | CNN | PANNs | Transformer | BERT | Spec-Augment | 44.1kHz | supervised | NTXent loss | AdamW | 1e-4 | validation_recall | Clotho | Clotho | |||
5 | Mei_Surrey_4 | 0.251 | mei2022_t6b | cross-modal alignment | 195420160 | CNN | PANNs | Transformer | BERT | Spec-Augment | 44.1kHz | supervised | NTXent loss | AdamW | 1e-4 | validation_recall | Clotho | Clotho | |||
27 | Baseline | 0.061 | xie2022_t6b | cross-modal alignment | 732354 | CRNN | log-mel energies | Word2vec | Word2Vec | 44.1kHz | self-supervised | Triplet Loss | Adam | 1e-3 | validation_loss | Clotho | Clotho |
Technical reports
A ResNet-Based Clip Text-to-Audio Retrieval System for DCASE Challenge 2022 Task
Yongquan Lai, Jinsong Pan, Buxian Chen
Ping An Property & Casualty Insurance Company of China, Ltd., China
lai_pa_task6b_1 lai_pa_task6b_2 lai_pa_task6b_3 lai_pa_task6b_4
A ResNet-Based Clip Text-to-Audio Retrieval System for DCASE Challenge 2022 Task
Yongquan Lai, Jinsong Pan, Buxian Chen
Ping An Property & Casualty Insurance Company of China, Ltd., China
Abstract
Language-based audio retrieval aim to use language to retrieval audios in a given dataset. This technical report presents an text-to-audio retrieval system submitted to Task 6b of the DCASE 2022 challenge. The proposed system is based on AudioCLIP, which incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet and clothe V2 datasets and introduces a pre-training method to perform bimodal querying. the original AudioCLIP acquired poor retrieval performance on the clothe V2 dataset in a zero-shot inference fashion. So we used AudioCLIP’s model as a weight initializer, and finetuned audio encoder and text encoder using symmetric cross entropy loss over similarity measure among the mini-batch (audio, text) pairs. Through pre-training and data augmentation methods, our model achieved R1 score of 0.35 and mAP10 score of 0.51 on Clotho V2 evaluation set.
System characteristics
Data augmentation | audio cropping |
Take It Easy: Relaxing Contrastive Ranking Loss with CIDEr
Theodore Lamort de Gail, Dawid Kicinski
Samsung R&D Institute Poland, Warsaw, Poland
lamort_srpol_task6b_1 lamort_srpol_task6b_2 lamort_srpol_task6b_3 lamort_srpol_task6b_4
Take It Easy: Relaxing Contrastive Ranking Loss with CIDEr
Theodore Lamort de Gail, Dawid Kicinski
Samsung R&D Institute Poland, Warsaw, Poland
Abstract
This report presents our approach and results for task 6B of the DCASE2022 challenge concerning natural-language-based audio retrieval. To match the audio-text pairs, we learn cross-modal embeddings. The audio samples are encoded by an ensemble of four frozen expert models with transformer heads for time aggregation. Captions are encoded using a pre-trained language model. The model is trained with a modified contrastive ranking loss, enhanced with a heuristic caption similarity prior based on the CIDEr metric. We train the system on the AudioCaps and Clotho audio captioning datasets. Furthermore, we use an NLP classifier to gather additional useful audio-caption pairs from Freesound. We achieve 0.48 R10 and 0.23 mAP10 on the Clotho evaluation split (vs. 0.19 and 0.07 respectively for the challenge baseline).
System characteristics
Data augmentation | None |
Language-Based Audio Retrieval with Pre-trained Models
Xinhao Mei, Xubo Liu, Haohe Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang
Centre for Vision, Speech, and Signal Processing (CVSSP), University of Surrey, UK
Mei_Surrey_task6b_1 Mei_Surrey_task6b_2 Mei_Surrey_task6b_3 Mei_Surrey_task6b_4
Language-Based Audio Retrieval with Pre-trained Models
Xinhao Mei, Xubo Liu, Haohe Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang
Centre for Vision, Speech, and Signal Processing (CVSSP), University of Surrey, UK
Abstract
This technical report presents a language-based audio retrieval system that we submitted to Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2022 Task 6b. Language-based audio retrieval is a cross-modal task aiming at retrieving a matched audio clip from a pool of candidates given a language query such as a sentence. Cross-modal retrieval tasks are often solved by using deep learning models where the features from different modalities are extracted and then mapped to a joint embedding space. These models usually require a large amount of training data to obtain reasonable performance. However, the audio captioning dataset employed in this audio retrieval task is limited in size. In this work, we propose to use large-scale pre-trained models as both audio and text encoders to mitigate the data scarcity problem and learn the acoustic semantic embeddings. Results on the Clotho dataset show that our proposed system significantly improves the scores of all the evaluation metrics as compared to the baseline system
System characteristics
Data augmentation | SpecAugment |
CAU Submission to DCASE 2022 Task6B: Language-Based Audio Retrieval using Transfer Learning
Jiwon Park, Chaewon Hwang, Il-Youp Kwak, Changwon Lim
Chung-Ang University, Department of Applied Statistics, Seoul, South Korea
park_cau_task6b_1 park_cau_task6b_2
CAU Submission to DCASE 2022 Task6B: Language-Based Audio Retrieval using Transfer Learning
Jiwon Park, Chaewon Hwang, Il-Youp Kwak, Changwon Lim
Chung-Ang University, Department of Applied Statistics, Seoul, South Korea
Abstract
This report proposes a language-based audio retrieval model for the 2022 DCASE audio retrieval challenge. In this challenge, to make use of the learned feature from AudioSet data, we utilized CNN10 network pre-trained on AudioSet data. With the transfer learning, our proposed model took 10-layers CNN and adding GRU after CNN Module. We used pre-trained Word2Vec as text encoder[1]. Experiments show that the proposed model achieved mAP score of 0.091 and showed better performance compared to baseline mAP score of 0.067.
System characteristics
Data augmentation | None |
IRIT-UPS DCASE 2022 Language-Based Audio Retrieval System
Thomas Pellegrini
Computing Sciences, University Toulouse III, Toulouse, France
Abstract
This technical report is a short description of the IRIT-UPS systems used in the DCASE 2022 task 6b dedicated to audio captioning. Four submissions were made: i) a baseline one using pretrained representations for both the audio signal (scene embeddings extracted with PaSST), and for the caption queries (using a large pretrained sentence transformer called all-MPNet), ii) the same baseline system but adding information from AudioSet tags in the audio encoder part, iii) the same as ii) but pretrained on an external dataset (AudioCaps), iv) an ensemble of two systems iii).
System characteristics
Data augmentation | None |
Aligning Audio and Text Embeddings for the Language-Based Audio Retrieval Task of the DCASE Challenge 2022
Benno Weck1,2, Miguel Pérez Fernández1,2, Holger Kirchhoff1, Xavier Serra2
1Huawei Technologies, Munich Research Center, Germany, 2Universitat Pompeu Fabra, Music Technology Group, Spain
Weck_Huawei_task6b_1 Weck_Huawei_task6b_2 Weck_Huawei_task6b_3 Weck_Huawei_task6b_4
Aligning Audio and Text Embeddings for the Language-Based Audio Retrieval Task of the DCASE Challenge 2022
Benno Weck1,2, Miguel Pérez Fernández1,2, Holger Kirchhoff1, Xavier Serra2
1Huawei Technologies, Munich Research Center, Germany, 2Universitat Pompeu Fabra, Music Technology Group, Spain
Abstract
Our challenge submission shows how large-scale pretrained deep learning models can serve as a strong basis for a cross-modal (text-to-audio) retrieval system. Our system uses embeddings extracted by these models in a general alignment framework to connect matching pairs of audio and text. It processes audio and text separately through different pretrained models, each returning an embedding. Shallow neural networks map the embeddings to a common dimensionality. The cross-modal alignment of the individual embeddings is optimised using a contrastive loss. We employ the RoBERTa foundation model as the text embedding extractor. A pretrained PANNs model extracts the audio embeddings. The embedding extractor model weights remain frozen. To improve the generalisation of our model, we investigate how pretraining with audio and associated noisy text collected from the online platform Freesound improves the performance of our method. We find that a two-stage training process consisting of pretraining with noisy data and fine-tuning with the challenge datasets gives the best results for our approach. Our system showcases a simple yet effective method which is superior to the challenge baseline.
System characteristics
Data augmentation | None |
Text-to-Audio Retrieval via Large-Scale Contrastive Training
Yusong Wu1,2, Tianyu Zhang1,2, Ke Chen3
1University of Montreal, Quebec, Canada, 2Mila, Quebec, Canada, 2University of California San Diego, San Diego, United States
Wu_Mila_task6b_1 Wu_Mila_task6b_2 Wu_Mila_task6b_3 Wu_Mila_task6b_4
Text-to-Audio Retrieval via Large-Scale Contrastive Training
Yusong Wu1,2, Tianyu Zhang1,2, Ke Chen3
1University of Montreal, Quebec, Canada, 2Mila, Quebec, Canada, 2University of California San Diego, San Diego, United States
Abstract
Although there is an abundance of data available on the internet, audio data is still limited in terms of dataset size and label precision. Scaling the size of audio datasets would therefore be one of the most valuable ways to develop models for better audio understanding. In this report, we propose a pipeline to better learn the audio understanding mechanism by combining audio data with more abundantly available natural language descriptions. We collected a mixed dataset consisting of over 2 million data pairs and trained a contrastive model based on Contrastive Language–Image Pre-training (CLIP) in order to discover correspondence between audio and text. As an audio encoder, we use HTS-AT as a transformer-based model and PANN and a CNN-based model, and as a text encoder, we employ the frozen pre-trained CLIP text encoder. The resulting models are submitted to Task 6B of the DCASE 2022 challenge and achieve a mAP@10 score of at least 0.214.
System characteristics
Data augmentation | SpecAugment |
Language-Based Audio Retrieval with Pretrained CNN and Graph Attention
Feiyang Xiao1, Jian Guan1∗, Haiyan Lan1, Qiaoxi Zhu2, and Wenwu Wang3
1Group of Intelligent Signal Processing, College of Computer Science and Technology, Harbin Engineering University, Harbin, China, 2Centre for Audio, Acoustic and Vibration, University of Technology Sydney, Ultimo, Australia, 3Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK
Guan_HEU_task6b_1
Language-Based Audio Retrieval with Pretrained CNN and Graph Attention
Feiyang Xiao1, Jian Guan1∗, Haiyan Lan1, Qiaoxi Zhu2, and Wenwu Wang3
1Group of Intelligent Signal Processing, College of Computer Science and Technology, Harbin Engineering University, Harbin, China, 2Centre for Audio, Acoustic and Vibration, University of Technology Sydney, Ultimo, Australia, 3Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK
Abstract
This technical report describes our submission for Task 6B of the DCASE2022 Challenge (language-based audio retrieval). Our audio retrieval system has an audio encoder composed of a pretrained CNN module (i.e., pretrained audio neural network, PANNs) and a novel graph attention module. Its text encoder is the pretrained word2vec model, the same as the baseline system of Task 6B. Experiments show that our audio retrieval system can achieve the mAP10 metric (used for ranking) of 13% on the development-testing dataset of Task 6B.
System characteristics
Data augmentation | None |
The SJTU System for DCASE2022 Challenge Task 6: Audio Captioning with Audio-Text Retrieval Pre-training
Xuenan Xu, Zeyu Xie, Mengyue Wu, Kai Yu
MoE Key Lab of Artificial Intelligence X-LANCE Lab, Department of Computer Science and Engineering AI Institute, Shanghai Jiao Tong University, Shanghai, China
xu_sjtu_task6b_1 xu_sjtu_task6b_2 xu_sjtu_task6b_3 xu_sjtu_task6b_4
Judges’ award
The SJTU System for DCASE2022 Challenge Task 6: Audio Captioning with Audio-Text Retrieval Pre-training
Xuenan Xu, Zeyu Xie, Mengyue Wu, Kai Yu
MoE Key Lab of Artificial Intelligence X-LANCE Lab, Department of Computer Science and Engineering AI Institute, Shanghai Jiao Tong University, Shanghai, China
Abstract
This technical report describes the system submitted to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 challenge Task 6. There are two involving subtasks: text-to-audio retrieval and automated audio captioning. The text-to-audio retrieval system adopts a bi-encoder architecture using pre-trained audio and text encoders. The system is first pre-trained on AudioCaps and then fine-tuned on the challenge dataset Clotho. For the audio captioning system, we first train a retrieval model on all public captioning data and then take the audio encoder as the feature extractor. Then a standard sequence-to-sequence model is trained on Clotho based on the pre-trained feature extractor. The captioning model is first trained by word-level cross entropy loss and then finetuned using self-critical sequence training. Our system achieves a SPIDEr of 32.5 on captioning and an mAP of 29.9 on text-to-audio retrieval.
Awards: Judges’ award
System characteristics
Data augmentation | None |