Task description
Language-based audio retrieval is the task of retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions are used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.
The Clotho v2 is provided as the development dataset, which includes both audio and corresponding captions. Participants are also allowed using pre-trained models and external data for training their systems. This includes pre-trained models for feature extraction from audio and/or captions, and pre-optimized methods for natural language processing like part-of-speech (POS) tagging. Additionally, participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, Freesound. More detailed task description can be found in the task description page
Teams ranking
Here are listed the best systems all from all teams. The ranking is based on the achieved mAP@10 metric. For more elaborated exploration of the performance of different systems, in the same table are listed the values achieved for all the metrics employed in the task. The metric values are for the development-testing split and the evaluation dataset.
Selected metric rank |
Submission Information | Evaluation dataset | Development-testing split | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Corresponding author |
Technical Report |
mAP@10 | R@1 | R@5 | R@10 | mAP@10 | R@1 | R@5 | R@10 | |
Primus_CP-JKU_6b_1 | 1 | Paul Primus | primus2023_t6b | 0.401 | 0.283 | 0.553 | 0.681 | 0.414 | 0.289 | 0.587 | 0.711 | |
Lamort_SRPOL_task6B_3 | 5 | Theodore Lamort de Gail | lamort2023_t6b | 0.281 | 0.179 | 0.426 | 0.548 | 0.297 | 0.191 | 0.436 | 0.579 | |
Wang_NTU_task6b_3 | 7 | Chung-Che Wang | wang2023_t6b | 0.273 | 0.162 | 0.428 | 0.548 | 0.314 | 0.186 | 0.452 | 0.585 | |
guan_heu_task6b_1 | 10 | Jian Guan | guan2023_t6b | 0.262 | 0.160 | 0.403 | 0.518 | 0.305 | 0.194 | 0.458 | 0.590 | |
fan_lb_task6b_3 | 17 | Ziye Fan | fan2023_t6b | 0.243 | 0.144 | 0.373 | 0.499 | 0.265 | 0.160 | 0.402 | 0.544 | |
Park_CAU_task6b_3 | 18 | Jiwon Park | park2023_t6b | 0.240 | 0.151 | 0.354 | 0.473 | 23.590 | 13.890 | 36.310 | 49.650 | |
labbe_irit_task6b_4 | 23 | Etienne Labbe | labbe2023_t6b | 0.234 | 0.146 | 0.339 | 0.475 | 0.269 | 0.169 | 0.399 | 0.523 | |
Baseline | 26 | Huang Xie | xie2023_t6b | 0.211 | 0.121 | 0.332 | 0.459 | 0.222 | 0.130 | 0.343 | 0.480 | |
shah_cmu_task6b_1 | 30 | Ankit Shah | shah2023_t6b | 0.004 | 0.003 | 0.007 | 0.011 | 0.250 | 0.154 | 0.381 | 0.511 | |
kim_snu_task6b_2 | 31 | Jinhee Kim | kim2023_t6b | 0.004 | 0.002 | 0.005 | 0.013 | 0.270 | 0.168 | 0.409 | 0.551 |
Systems ranking
Here are listed all systems and their ranking according to the different metrics. Detailed information of each system is at the next section.
Selected metric rank |
Submission Information | Evaluation dataset | Development-testing split | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Technical Report |
mAP@10 | R@1 | R@5 | R@10 | mAP@10 | R@1 | R@5 | R@10 | |
Primus_CP-JKU_6b_1 | 1 | primus2023_t6b | 0.401 | 0.283 | 0.553 | 0.681 | 0.414 | 0.289 | 0.587 | 0.711 | |
Primus_CP-JKU_6b_3 | 2 | primus2023_t6b | 0.363 | 0.250 | 0.518 | 0.648 | 0.386 | 0.261 | 0.553 | 0.693 | |
Primus_CP-JKU_6b_2 | 3 | primus2023_t6b | 0.358 | 0.245 | 0.512 | 0.637 | 0.380 | 0.255 | 0.551 | 0.686 | |
Primus_CP-JKU_6b_4 | 4 | primus2023_t6b | 0.341 | 0.229 | 0.501 | 0.626 | 0.363 | 0.244 | 0.525 | 0.662 | |
Lamort_SRPOL_task6B_3 | 5 | lamort2023_t6b | 0.281 | 0.179 | 0.426 | 0.548 | 0.297 | 0.191 | 0.436 | 0.579 | |
Lamort_SRPOL_task6B_2 | 6 | lamort2023_t6b | 0.275 | 0.172 | 0.412 | 0.543 | 0.297 | 0.191 | 0.436 | 0.579 | |
Wang_NTU_task6b_3 | 7 | wang2023_t6b | 0.273 | 0.162 | 0.428 | 0.548 | 0.314 | 0.186 | 0.452 | 0.585 | |
Wang_NTU_task6b_4 | 8 | wang2023_t6b | 0.265 | 0.167 | 0.385 | 0.528 | 0.313 | 0.191 | 0.441 | 0.581 | |
Lamort_SRPOL_task6B_4 | 9 | lamort2023_t6b | 0.263 | 0.168 | 0.396 | 0.525 | 0.297 | 0.191 | 0.436 | 0.579 | |
guan_heu_task6b_1 | 10 | guan2023_t6b | 0.262 | 0.160 | 0.403 | 0.518 | 0.305 | 0.194 | 0.458 | 0.590 | |
guan_heu_task6b_3 | 11 | guan2023_t6b | 0.261 | 0.163 | 0.392 | 0.522 | 0.309 | 0.197 | 0.461 | 0.594 | |
Lamort_SRPOL_task6B_1 | 12 | lamort2023_t6b | 0.261 | 0.163 | 0.402 | 0.527 | 0.297 | 0.191 | 0.436 | 0.579 | |
guan_heu_task6b_2 | 13 | guan2023_t6b | 0.260 | 0.156 | 0.392 | 0.524 | 0.309 | 0.198 | 0.462 | 0.592 | |
guan_heu_task6b_4 | 14 | guan2023_t6b | 0.259 | 0.156 | 0.395 | 0.523 | 0.312 | 0.200 | 0.464 | 0.599 | |
Wang_NTU_task6b_1 | 15 | wang2023_t6b | 0.256 | 0.165 | 0.366 | 0.506 | 0.260 | 0.153 | 0.407 | 0.544 | |
Wang_NTU_task6b_2 | 16 | wang2023_t6b | 0.245 | 0.147 | 0.370 | 0.504 | 0.271 | 0.167 | 0.410 | 0.539 | |
fan_lb_task6b_3 | 17 | fan2023_t6b | 0.243 | 0.144 | 0.373 | 0.499 | 0.265 | 0.160 | 0.402 | 0.544 | |
Park_CAU_task6b_3 | 18 | park2023_t6b | 0.240 | 0.151 | 0.354 | 0.473 | 23.590 | 13.890 | 36.310 | 49.650 | |
fan_lb_task6b_1 | 19 | fan2023_t6b | 0.239 | 0.143 | 0.376 | 0.499 | 0.253 | 0.147 | 0.401 | 0.542 | |
Park_CAU_task6b_1 | 20 | park2023_t6b | 0.239 | 0.151 | 0.352 | 0.472 | 24.460 | 14.740 | 37.590 | 50.680 | |
fan_lb_task6b_2 | 21 | fan2023_t6b | 0.238 | 0.144 | 0.363 | 0.494 | 0.262 | 0.154 | 0.406 | 0.551 | |
fan_lb_task6b_4 | 22 | fan2023_t6b | 0.235 | 0.137 | 0.369 | 0.499 | 0.251 | 0.146 | 0.390 | 0.533 | |
labbe_irit_task6b_4 | 23 | labbe2023_t6b | 0.234 | 0.146 | 0.339 | 0.475 | 0.269 | 0.169 | 0.399 | 0.523 | |
Park_CAU_task6b_2 | 24 | park2023_t6b | 0.232 | 0.143 | 0.346 | 0.466 | 24.010 | 14.260 | 37.440 | 50.140 | |
labbe_irit_task6b_3 | 25 | labbe2023_t6b | 0.213 | 0.123 | 0.328 | 0.441 | 0.257 | 0.160 | 0.384 | 0.512 | |
Baseline | 26 | xie2023_t6b | 0.211 | 0.121 | 0.332 | 0.459 | 0.222 | 0.130 | 0.343 | 0.480 | |
labbe_irit_task6b_2 | 27 | labbe2023_t6b | 0.204 | 0.124 | 0.312 | 0.429 | 0.231 | 0.140 | 0.353 | 0.483 | |
Park_CAU_task6b_4 | 28 | park2023_t6b | 0.177 | 0.088 | 0.285 | 0.405 | 19.300 | 10.580 | 30.620 | 43.290 | |
labbe_irit_task6b_1 | 29 | labbe2023_t6b | 0.159 | 0.085 | 0.247 | 0.359 | 0.186 | 0.106 | 0.288 | 0.419 | |
shah_cmu_task6b_1 | 30 | shah2023_t6b | 0.004 | 0.003 | 0.007 | 0.011 | 0.250 | 0.154 | 0.381 | 0.511 | |
kim_snu_task6b_2 | 31 | kim2023_t6b | 0.004 | 0.002 | 0.005 | 0.013 | 0.270 | 0.168 | 0.409 | 0.551 | |
kim_snu_task6b_3 | 32 | kim2023_t6b | 0.004 | 0.001 | 0.006 | 0.012 | 0.280 | 0.175 | 0.422 | 0.566 | |
kim_snu_task6b_4 | 33 | kim2023_t6b | 0.004 | 0.002 | 0.005 | 0.013 | 0.271 | 0.169 | 0.410 | 0.549 | |
kim_snu_task6b_1 | 34 | kim2023_t6b | 0.003 | 0.001 | 0.005 | 0.012 | 0.279 | 0.172 | 0.419 | 0.562 |
System characteristics
In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems and the second has a detailed presentation of each system.
Overview of characteristics
Rank |
Submission code |
mAP@10 |
Technical Report |
Amount of parameters | Audio modelling | Text modelling |
Loss function |
---|---|---|---|---|---|---|---|
1 | Primus_CP-JKU_6b_1 | 0.401 | primus2023_t6b | 3003000000 | PaSST | BERT, RoBERTa | NT-Xent loss |
2 | Primus_CP-JKU_6b_3 | 0.363 | primus2023_t6b | 441200000 | PaSST | BERT, RoBERTa | NT-Xent loss |
3 | Primus_CP-JKU_6b_2 | 0.358 | primus2023_t6b | 441200000 | PaSST | BERT, RoBERTa | NT-Xent loss |
4 | Primus_CP-JKU_6b_4 | 0.341 | primus2023_t6b | 441200000 | PaSST | BERT, RoBERTa | NT-Xent loss |
5 | Lamort_SRPOL_task6B_3 | 0.281 | lamort2023_t6b | 279940464 | BEATs, VGGish, CLAP | MPNet | Triplet loss |
6 | Lamort_SRPOL_task6B_2 | 0.275 | lamort2023_t6b | 279940464 | BEATs, VGGish, CLAP | MPNet | Triplet loss |
7 | Wang_NTU_task6b_3 | 0.273 | wang2023_t6b | 653661759 | VALOR | BERT | Contrastive loss |
8 | Wang_NTU_task6b_4 | 0.265 | wang2023_t6b | 653661759 | VALOR | BERT | Contrastive loss |
9 | Lamort_SRPOL_task6B_4 | 0.263 | lamort2023_t6b | 279940464 | BEATs, VGGish, CLAP | MPNet | Triplet loss |
10 | guan_heu_task6b_1 | 0.262 | guan2023_t6b | 1686974464 | PANNs-CNN14, PANNs-CNN14-Attention | BERT, RoBERTa | InfoNCE loss |
11 | guan_heu_task6b_3 | 0.261 | guan2023_t6b | 1686974464 | PANNs-CNN14, PANNs-CNN14-Attention | BERT, RoBERTa | InfoNCE loss |
12 | Lamort_SRPOL_task6B_1 | 0.261 | lamort2023_t6b | 279940464 | BEATs, VGGish, CLAP | MPNet | Triplet loss |
13 | guan_heu_task6b_2 | 0.260 | guan2023_t6b | 1313971200 | PANNs-CNN14, PANNs-CNN14-Attention | BERT, RoBERTa | InfoNCE loss |
14 | guan_heu_task6b_4 | 0.259 | guan2023_t6b | 1313971200 | PANNs-CNN14, PANNs-CNN14-Attention | BERT, RoBERTa | InfoNCE loss |
15 | Wang_NTU_task6b_1 | 0.256 | wang2023_t6b | 97100280 | PANNs-CNN14 | BERT | NT-Xent loss |
16 | Wang_NTU_task6b_2 | 0.245 | wang2023_t6b | 97100280 | PANNs-CNN14 | BERT | NT-Xent loss |
17 | fan_lb_task6b_3 | 0.243 | fan2023_t6b | 271483819 | BEATs | Qformer | InfoNCE loss |
18 | Park_CAU_task6b_3 | 0.240 | park2023_t6b | 560958124 | PANNs-CNN14, PANNs-ResNet38, PANNs-Wavegram-Logmel-Cnn14 | Sentece-BERT | Triplet loss |
19 | fan_lb_task6b_1 | 0.239 | fan2023_t6b | 271483819 | BEATs | Qformer | InfoNCE loss |
20 | Park_CAU_task6b_1 | 0.239 | park2023_t6b | 937970420 | PANNs-CNN14, PANNs-ResNet38, PANNs-Wavegram-Logmel-Cnn14 | Sentece-BERT | Triplet loss |
21 | fan_lb_task6b_2 | 0.238 | fan2023_t6b | 271483819 | BEATs | Qformer | InfoNCE loss |
22 | fan_lb_task6b_4 | 0.235 | fan2023_t6b | 271483819 | BEATs | Qformer | InfoNCE loss |
23 | labbe_irit_task6b_4 | 0.234 | labbe2023_t6b | 98064347 | CNN | Transformer | Cross-Entropy loss |
24 | Park_CAU_task6b_2 | 0.232 | park2023_t6b | 565518444 | PANNs-CNN14 | Sentece-BERT | Triplet loss |
25 | labbe_irit_task6b_3 | 0.213 | labbe2023_t6b | 42191083 | CNN | Transformer | Cross-Entropy loss |
26 | Baseline | 0.211 | xie2023_t6b | 80902892 | PANNs-CNN14 | Sentece-BERT | InfoNCE loss |
27 | labbe_irit_task6b_2 | 0.204 | labbe2023_t6b | 40133440 | CNN | Transformer | Cross-Entropy loss |
28 | Park_CAU_task6b_4 | 0.177 | park2023_t6b | 188506148 | PANNs-CNN14 | Sentece-BERT | InFoNCE+VICReg loss |
29 | labbe_irit_task6b_1 | 0.159 | labbe2023_t6b | 87715793 | CNN | Transformer | Cross-Entropy loss |
30 | shah_cmu_task6b_1 | 0.004 | shah2023_t6b | 647256 | CLAP | RoBERTa | InfoNCE loss |
31 | kim_snu_task6b_2 | 0.004 | kim2023_t6b | 196469248 | PANNs-CNN14 | BERT | NT-Xent loss |
32 | kim_snu_task6b_3 | 0.004 | kim2023_t6b | 196469248 | PANNs-CNN14 | BERT | NT-Xent loss |
33 | kim_snu_task6b_4 | 0.004 | kim2023_t6b | 196469248 | PANNs-CNN14 | BERT | NT-Xent loss |
34 | kim_snu_task6b_1 | 0.003 | kim2023_t6b | 196469248 | PANNs-CNN14 | BERT | NT-Xent loss |
Detailed characteristics
Rank |
Submission code |
mAP@10 |
Technical Report |
Amount of parameters | Audio modelling |
Acoustic features |
Text modelling |
Audio augmentation |
Text augmentation |
Sampling rate |
Loss function | Optimizer | Metric monitored for training | Dataset(s) used for training | Dataset(s) used for validation |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Primus_CP-JKU_6b_1 | 0.401 | primus2023_t6b | 3003000000 | PaSST | log-mel energies | BERT, RoBERTa | patchout | synonym replacement, random deletions, ChatGPT | 32kHz | NT-Xent loss | Adam | mAP | Clotho-development, AudioCaps, WavCaps | Clotho-validation |
2 | Primus_CP-JKU_6b_3 | 0.363 | primus2023_t6b | 441200000 | PaSST | log-mel energies | BERT, RoBERTa | patchout | synonym replacement, random deletions, ChatGPT | 32kHz | NT-Xent loss | Adam | mAP | Clotho-development, AudioCaps, WavCaps | Clotho-validation |
3 | Primus_CP-JKU_6b_2 | 0.358 | primus2023_t6b | 441200000 | PaSST | log-mel energies | BERT, RoBERTa | patchout | synonym replacement, random deletions | 32kHz | NT-Xent loss | Adam | mAP | Clotho-development, AudioCaps, WavCaps | Clotho-validation |
4 | Primus_CP-JKU_6b_4 | 0.341 | primus2023_t6b | 441200000 | PaSST | log-mel energies | BERT, RoBERTa | patchout | synonym replacement, random deletions | 32kHz | NT-Xent loss | Adam | mAP | Clotho-development, AudioCaps, WavCaps | Clotho-validation |
5 | Lamort_SRPOL_task6B_3 | 0.281 | lamort2023_t6b | 279940464 | BEATs, VGGish, CLAP | log-mel energies | MPNet | MixGen, noise mix, cutout, paraphrase | 16kHz | Triplet loss | AdamW | loss, recall | Clotho-development, Clotho-validation, AudioCaps, MACS, SoundDescs, Freesound, YouTube Closed Captions | Clotho-evaluation | |
6 | Lamort_SRPOL_task6B_2 | 0.275 | lamort2023_t6b | 279940464 | BEATs, VGGish, CLAP | log-mel energies | MPNet | MixGen, noise mix, cutout, paraphrase | 16kHz | Triplet loss | AdamW | loss, recall | Clotho-development, Clotho-validation, AudioCaps, MACS, SoundDescs, Freesound, YouTube Closed Captions | Clotho-evaluation | |
7 | Wang_NTU_task6b_3 | 0.273 | wang2023_t6b | 653661759 | VALOR | fbank | BERT | 44.1kHz | Contrastive loss | AdamW | loss | Clotho-development | Clotho-validation | ||
8 | Wang_NTU_task6b_4 | 0.265 | wang2023_t6b | 653661759 | VALOR | fbank | BERT | 44.1kHz | Contrastive loss | AdamW | loss | Clotho-development | Clotho-validation | ||
9 | Lamort_SRPOL_task6B_4 | 0.263 | lamort2023_t6b | 279940464 | BEATs, VGGish, CLAP | log-mel energies | MPNet | MixGen, noise mix, cutout, paraphrase | 16kHz | Triplet loss | AdamW | loss, recall | Clotho-development, Clotho-validation, AudioCaps, MACS, SoundDescs, Freesound, YouTube Closed Captions | Clotho-evaluation | |
10 | guan_heu_task6b_1 | 0.262 | guan2023_t6b | 1686974464 | PANNs-CNN14, PANNs-CNN14-Attention | log-mel energies | BERT, RoBERTa | SpecAugment | augmentation by GPT-3.5 | 32kHz | InfoNCE loss | Adam | recall | Clotho-development, Clotho-validation, WavText5K, AudioCaps | Clotho-evaluation |
11 | guan_heu_task6b_3 | 0.261 | guan2023_t6b | 1686974464 | PANNs-CNN14, PANNs-CNN14-Attention | log-mel energies | BERT, RoBERTa | SpecAugment | augmentation by GPT-3.5 | 32kHz | InfoNCE loss | Adam | recall | Clotho-development, Clotho-validation, WavText5K, AudioCaps | Clotho-evaluation |
12 | Lamort_SRPOL_task6B_1 | 0.261 | lamort2023_t6b | 279940464 | BEATs, VGGish, CLAP | log-mel energies | MPNet | MixGen, noise mix, cutout, paraphrase | 16kHz | Triplet loss | AdamW | loss, recall | Clotho-development, Clotho-validation, AudioCaps, MACS, SoundDescs, Freesound, YouTube Closed Captions | Clotho-evaluation | |
13 | guan_heu_task6b_2 | 0.260 | guan2023_t6b | 1313971200 | PANNs-CNN14, PANNs-CNN14-Attention | log-mel energies | BERT, RoBERTa | SpecAugment | augmentation by GPT-3.5 | 32kHz | InfoNCE loss | Adam | recall | Clotho-development, Clotho-validation, WavText5K, AudioCaps | Clotho-evaluation |
14 | guan_heu_task6b_4 | 0.259 | guan2023_t6b | 1313971200 | PANNs-CNN14, PANNs-CNN14-Attention | log-mel energies | BERT, RoBERTa | SpecAugment | augmentation by GPT-3.5 | 32kHz | InfoNCE loss | Adam | recall | Clotho-development, Clotho-validation, WavText5K, AudioCaps | Clotho-evaluation |
15 | Wang_NTU_task6b_1 | 0.256 | wang2023_t6b | 97100280 | PANNs-CNN14 | log-mel energies | BERT | SpecAugment, random deletion | 32kHz | NT-Xent loss | Adam | mrr | Clotho-development | Clotho-validation | |
16 | Wang_NTU_task6b_2 | 0.245 | wang2023_t6b | 97100280 | PANNs-CNN14 | log-mel energies | BERT | SpecAugment, random deletion | 32kHz | NT-Xent loss | Adam | mrr | Clotho-development | Clotho-validation | |
17 | fan_lb_task6b_3 | 0.243 | fan2023_t6b | 271483819 | BEATs | mel energies | Qformer | 16kHz | InfoNCE loss | Adam | recall | Clotho-development, Clotho-validation, AudioCaps-train, AudioCaps-validation | Clotho-evaluation | ||
18 | Park_CAU_task6b_3 | 0.240 | park2023_t6b | 560958124 | PANNs-CNN14, PANNs-ResNet38, PANNs-Wavegram-Logmel-Cnn14 | PANNs | Sentece-BERT | SpecAugment | 32kHz | Triplet loss | Adam | loss | Clotho-development | Clotho-validation | |
19 | fan_lb_task6b_1 | 0.239 | fan2023_t6b | 271483819 | BEATs | mel energies | Qformer | 16kHz | InfoNCE loss | Adam | recall | Clotho-development, Clotho-validation, AudioCaps-train, AudioCaps-validation | Clotho-evaluation | ||
20 | Park_CAU_task6b_1 | 0.239 | park2023_t6b | 937970420 | PANNs-CNN14, PANNs-ResNet38, PANNs-Wavegram-Logmel-Cnn14 | PANNs | Sentece-BERT | SpecAugment | 32kHz | Triplet loss | Adam | loss | Clotho-development | Clotho-validation | |
21 | fan_lb_task6b_2 | 0.238 | fan2023_t6b | 271483819 | BEATs | mel energies | Qformer | 16kHz | InfoNCE loss | Adam | recall | Clotho-development, Clotho-validation, AudioCaps-train, AudioCaps-validation | Clotho-evaluation | ||
22 | fan_lb_task6b_4 | 0.235 | fan2023_t6b | 271483819 | BEATs | mel energies | Qformer | 16kHz | InfoNCE loss | Adam | recall | Clotho-development, Clotho-validation, AudioCaps-train, AudioCaps-validation | Clotho-evaluation | ||
23 | labbe_irit_task6b_4 | 0.234 | labbe2023_t6b | 98064347 | CNN | ConvNeXt-tiny | Transformer | mixup, SpecAugment, label_smoothing | 32kHz | Cross-Entropy loss | AdamW | FENSE | Clotho-development, AudioCaps-train, MACS, WavCaps (without FreeSound) | Clotho-validation | |
24 | Park_CAU_task6b_2 | 0.232 | park2023_t6b | 565518444 | PANNs-CNN14 | PANNs | Sentece-BERT | SpecAugment | 32kHz | Triplet loss | Adam | loss | Clotho-development | Clotho-validation | |
25 | labbe_irit_task6b_3 | 0.213 | labbe2023_t6b | 42191083 | CNN | ConvNeXt-tiny | Transformer | mixup, SpecAugment, label_smoothing | 32kHz | Cross-Entropy loss | AdamW | FENSE | Clotho-development, AudioCaps-train, MACS, WavCaps (without FreeSound) | Clotho-validation | |
26 | Baseline | 0.211 | xie2023_t6b | 80902892 | PANNs-CNN14 | log-mel energies | Sentece-BERT | 44.1kHz | InfoNCE loss | Adam | loss | Clotho-development | Clotho-validation | ||
27 | labbe_irit_task6b_2 | 0.204 | labbe2023_t6b | 40133440 | CNN | ConvNeXt-tiny | Transformer | mixup, SpecAugment, label_smoothing | 32kHz | Cross-Entropy loss | AdamW | FENSE | Clotho-development | Clotho-validation | |
28 | Park_CAU_task6b_4 | 0.177 | park2023_t6b | 188506148 | PANNs-CNN14 | PANNs | Sentece-BERT | SpecAugment | 32kHz | InFoNCE+VICReg loss | Adam | loss | Clotho-development | Clotho-validation | |
29 | labbe_irit_task6b_1 | 0.159 | labbe2023_t6b | 87715793 | CNN | PANNs | Transformer | mixup, SpecAugment, label_smoothing | 32kHz | Cross-Entropy loss | AdamW | FENSE | Clotho-development | Clotho-validation | |
30 | shah_cmu_task6b_1 | 0.004 | shah2023_t6b | 647256 | CLAP | log-mel energies | RoBERTa | 48kHz | InfoNCE loss | Adam | loss | Clotho-development | Clotho-validation | ||
31 | kim_snu_task6b_2 | 0.004 | kim2023_t6b | 196469248 | PANNs-CNN14 | PANNs | BERT | SpecAugment, noise, PairMix, multi-TTA | EDA, ChatGPT | 16kHz | NT-Xent loss | AdamW | loss | Clotho-development, AudioCaps-train, WavText5K, WavCaps | Clotho-validation |
32 | kim_snu_task6b_3 | 0.004 | kim2023_t6b | 196469248 | PANNs-CNN14 | PANNs | BERT | SpecAugment, noise, PairMix, multi-TTA | EDA, ChatGPT | 16kHz | NT-Xent loss | AdamW | loss | Clotho-development, AudioCaps-train, WavText5K, WavCaps | Clotho-validation |
33 | kim_snu_task6b_4 | 0.004 | kim2023_t6b | 196469248 | PANNs-CNN14 | PANNs | BERT | SpecAugment, noise, PairMix, multi-TTA | EDA, ChatGPT | 16kHz | NT-Xent loss | AdamW | loss | Clotho-development, AudioCaps-train, WavText5K, WavCaps | Clotho-validation |
34 | kim_snu_task6b_1 | 0.003 | kim2023_t6b | 196469248 | PANNs-CNN14 | PANNs | BERT | SpecAugment, noise, PairMix, multi-TTA | EDA, ChatGPT | 16kHz | NT-Xent loss | AdamW | loss | Clotho-development, AudioCaps-train, WavText5K, WavCaps | Clotho-validation |
Technical reports
QFORMER BASED TEXT AUDIO RETRIEVAL SYSTEM
Ziye Fan, Fengyun Zhu
R&D, Lingban Technology Ltd,., Beijing, China
fan_lb_task6b_1 fan_lb_task6b_2 fan_lb_task6b_3 fan_lb_task6b_4
QFORMER BASED TEXT AUDIO RETRIEVAL SYSTEM
Ziye Fan, Fengyun Zhu
R&D, Lingban Technology Ltd,., Beijing, China
Abstract
This paper describes the system we submitted for DCASE2023 Challenge Task 6B. Task 6B involves audio retrieval using natural language. Our submitted retrieval system includes a frozen pretrained audio encoder and a Qformer as text encoder. The system utilizes paired data provided by the AudioCaps and Clotho datasets for contrastive learning in the style of BLIP-2. Natural language query requests are first encoded by the text encoder, followed by a top-k recall in the pre-extracted audio embeddings. These are then paired with the query text to form k pairs of data, which are reranked based on the model’s matching ability to produce the final retrieval results. This system achieved an mAP of 26.47% and a 16.02% R@1 on the Clotho test set, while the baseline system’s performance being mAP of 22.2% and 13.0% R@1.
ENSEMBLE SYSTEMS WITH CONTRASTIVE LANGUAGE-AUDIO PRETRAINING AND ATTENTION-BASED AUDIO FEATURES FOR AUDIO CAPTIONING AND RETRIEVAL
Feiyang Xiao, Qiaoxi Zhu, Haiyan Lan, Wenwu Wang, Jian Guan
Group of Intelligent Signal Processing (GISP), College of Computer Science and Technology, Harbin Engineering University, Harbin, China
Centre for Audio, Acoustic and Vibration (CAAV), University of Technology Sydney, Ultimo, Australia
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
guan_heu_task6b_1 guan_heu_task6b_2 guan_heu_task6b_3 guan_heu_task6b_4
ENSEMBLE SYSTEMS WITH CONTRASTIVE LANGUAGE-AUDIO PRETRAINING AND ATTENTION-BASED AUDIO FEATURES FOR AUDIO CAPTIONING AND RETRIEVAL
Feiyang Xiao, Qiaoxi Zhu, Haiyan Lan, Wenwu Wang, Jian Guan
Group of Intelligent Signal Processing (GISP), College of Computer Science and Technology, Harbin Engineering University, Harbin, China
Centre for Audio, Acoustic and Vibration (CAAV), University of Technology Sydney, Ultimo, Australia
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
Abstract
This technical report describes our submission on Task 6 (automated audio captioning and language-based audio retrieval) of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge. The proposed systems in this submission are based on a contrastive language-audio pretraining strategy and the attention-based audio feature representation. Experiments show that our systems can achieve a SPIDEr-FL score of 28.32 on automated audio captioning and an mAP score of 31.18 on language-based audio retrieval.
OVERCOMING DATA SHORTAGE IN AUDIO-TEXT MULTI-MODAL RETRIEVAL:A TECH REPORT FOR DCASE 2023 CHALLENGE
Jinhee Kim, Chang-Bin Jeon, Yoori Oh, JoonHyeon Bae, Kyogu Lee
Intelligence and Information, Seoul National University, Seoul, Republic of Korea
kim_snu_task6b_1 kim_snu_task6b_2 kim_snu_task6b_3 kim_snu_task6b_4
OVERCOMING DATA SHORTAGE IN AUDIO-TEXT MULTI-MODAL RETRIEVAL:A TECH REPORT FOR DCASE 2023 CHALLENGE
Jinhee Kim, Chang-Bin Jeon, Yoori Oh, JoonHyeon Bae, Kyogu Lee
Intelligence and Information, Seoul National University, Seoul, Republic of Korea
Abstract
This technical report proposes an audio-text retrieval model for DCASE 2023 language-based audio retrieval challenge. We focus to overcome the shortage of data in this task. To this end, we propose two approaches: the first involves gathering large paired audio-text datasets, while the second employs various augmentation techniques such as PairMix and Multi-TTA. Our experimental evaluations demonstrate the effectiveness of these approaches, while achieving competitive performance in audio-text multi-modal retrieval tasks.
IRIT-UPS DCASE 2023 AUDIO CAPTIONING AND RETRIEVAL SYSTEM
Etienne Labb\k{e}, Thomas Pellegrini, Julien Pinquier
IRIT (UMR 5505), Universite Paul Sabatier, CNRS, Toulouse, France
Artificial and Natural Intelligence Toulouse Institute (ANITI)
labbe_irit_task6b_1 labbe_irit_task6b_2 labbe_irit_task6b_3 labbe_irit_task6b_4
IRIT-UPS DCASE 2023 AUDIO CAPTIONING AND RETRIEVAL SYSTEM
Etienne Labb\k{e}, Thomas Pellegrini, Julien Pinquier
IRIT (UMR 5505), Universite Paul Sabatier, CNRS, Toulouse, France
Artificial and Natural Intelligence Toulouse Institute (ANITI)
Abstract
This technical report provides a concise overview of our systems submitted to the DCASE Challenge 2023 for tasks 6a, ”Automated Audio Captioning” (AAC), and 6b, ”Language-Based Audio Retrieval” (LBAR). In task 6a, we made four distinct submissions. The first submission employed a standard CNN14 encoder paired with a transformer decoder. In the second submission, we replaced this encoder with a ConvNeXt model to enhance audio representation. The third submission incorporated additional training data. We introduced a new task embedding approach to differentiate between different writing styles and audio types. Finally, in the fourth submission, we employed an ensemble method to combine five models trained on different seeds, aiming to improve the quality of the captions. For task 6b, we use the AAC models and we propose a novel approach to accomplish the LBAR task by leveraging the AAC system loss function without requiring any additional training. Our most successful AAC and LBAR systems achieved a SPIDEr-FL score of 0.320 and an mAP@10 score of 0.269. These results demonstrate relative improvements of 22.6% and 21.2% compared to the AAC and LBAR baselines, respectively.
TAKE IT SERIOUSLY: IMPROVING ON LAST YEAR WITH ALL SORTS OF TRICKS
Theodore Lamort de Gail, Bart\l{}omiej Zg\'orzy\'nski, Anna Pl\k{e}s, Kamil G\'orzy\'nski
Samsung R&D Institute Poland, Warsaw, Poland
Lamort_SRPOL_task6B_1 Lamort_SRPOL_task6B_2 Lamort_SRPOL_task6B_3 Lamort_SRPOL_task6B_4
TAKE IT SERIOUSLY: IMPROVING ON LAST YEAR WITH ALL SORTS OF TRICKS
Theodore Lamort de Gail, Bart\l{}omiej Zg\'orzy\'nski, Anna Pl\k{e}s, Kamil G\'orzy\'nski
Samsung R&D Institute Poland, Warsaw, Poland
Abstract
In this report, we present our solution to DCASE 2023, task 6B: Language-Based Audio Retrieval. We employ a bi-encoder architecture trained using contrastive ranking loss. The audio encoder is an ensemble of three pre-trained models (BEATs, VGGish, CLAP) with added self-attention heads, while the text encoder is a pre-trained MPNet. To address the small dataset size, we gather 1.7M caption audio pairs from YouTube videos. We use MixGen and paraphrasing, as well as traditional audio augmentation, and Low-Rank Adaptation (LoRA) for fine-tuning on Clotho. We achieve 29.66% mAP@10 on the development-testing split of Clotho using an ensemble solution, and 26.93% mAP@10 with a single model. We also submit an ensemble of our solution and CLAP.
TEXT-TO-AUDIO RETRIEVAL: ENSEMBLE COMBINATIONS OF THE MODELS
Jiwon Park, SangJe Park, Changwon Lim
Applied Statistics, Chung-ang University, Seoul, South Korea
Park_CAU_task6b_1 Park_CAU_task6b_2 Park_CAU_task6b_3 Park_CAU_task6b_4
TEXT-TO-AUDIO RETRIEVAL: ENSEMBLE COMBINATIONS OF THE MODELS
Jiwon Park, SangJe Park, Changwon Lim
Applied Statistics, Chung-ang University, Seoul, South Korea
Abstract
This technical report focuses on the audio-text retrieval model designed for the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2023 Task 6b. In this task, the objective is to retrieve 10 audio files from a given dataset based on a given text query and then sort them according to how well they match the query. The audio encoder in our model employs Pretrained Audio Natural Networks (PANNS), which is a pre-trained model from the AudioSet dataset. We have fine-tuned the encoders using the Clotho dataset. For the text encoder, we have used transfer learning with Sentence-BERT, which is based on the Transformer architecture. To bring audio and text inputs into a joint embedding space, we have passed them through their respective encoders. We have then employed contrastive learning for audio-text pairs so that similar pairs are positioned close together and the other pairs are positioned further apart. We achieves 0.245 on mAP10 of text-to-audio retrieval.
CP-JKU’S SUBMISSION TO TASK 6b OF THE DCASE2023 CHALLENGE: AUDIO RETRIEVAL WITH PaSST AND GPT-AUGMENTED CAPTIONS
Paul Primus, Khaled Koutini, Gerhard Widmer
Institute of Computational Perception (CP-JKU), LIT Artificial Intelligence Lab, Johannes Kepler University, Austria
Primus_CP-JKU_6b_1 Primus_CP-JKU_6b_2 Primus_CP-JKU_6b_3 Primus_CP-JKU_6b_4
Judges’ award
CP-JKU’S SUBMISSION TO TASK 6b OF THE DCASE2023 CHALLENGE: AUDIO RETRIEVAL WITH PaSST AND GPT-AUGMENTED CAPTIONS
Paul Primus, Khaled Koutini, Gerhard Widmer
Institute of Computational Perception (CP-JKU), LIT Artificial Intelligence Lab, Johannes Kepler University, Austria
Abstract
This technical report describes CP-JKU’s submission to the natural language-based audio retrieval task of the 2023 DCASE Challenge (Task 6b). Our proposed system uses pretrained audio and text embedding models to project recordings and textual descriptions into a shared audio-caption space in which related examples from different modalities are close. We pre-train our models on WavCaps, AudioCap, and ClothoV2, three large datasets with audio-caption pairs. We further augment the captions in the ClothoV2 dataset using the provided metadata and the ChatGPT API in order to reduce overfitting. Our best single system submission outperforms the current state-of-the-art text-to-audio retrieval system on the ClothoV2 test split by 4.6 pp. R@1. Furthermore, our ensemble beats the previous year’s best submission on the test split by 11.5 pp. mAP@10. Our implementation is available in GitHub
Awards: Judges’ award
DCASE 2023 TASK 6 AUTOMATED AUDIO CAPTIONING AND LANGUAGE-BASED RETRIEVAL
Greeshma Karanth, Ninaad Rao, Srikumar Subramanian, Ankit Shah
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA
shah_cmu_task6b_1
DCASE 2023 TASK 6 AUTOMATED AUDIO CAPTIONING AND LANGUAGE-BASED RETRIEVAL
Greeshma Karanth, Ninaad Rao, Srikumar Subramanian, Ankit Shah
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA
Abstract
The objective of this project is to examine audio signals utilizing natural language to capture their complex characteristics. This initiative is part of Task 6 in the DCASE 2023 Competition and consists of two subtasks. The first subtask is Automated Audio Captioning, which generates text descriptions of audio content. This task involves the intermodal processing of an audio signal as input and a text description as output. Our best-performing model for this uses the PANN architecture [1] with the CNN-14 feature extractor and BART [2] encoder and decoder. The second subtask is called Language-Based Audio Retrieval, where the system retrieves audio signals by searching for their sound content descriptions. The queries for this subtask are human-generated audio captions. In this task, our best-performing model uses CLAP [3] audio embeddings and Roberta text embeddings [4]. This document presents a summary of our work done for this challenge.
DCASE 2023 TASK 6B: TEXT-TO-AUDIO RETRIEVAL USING PRETRAINED MODELS
Chung-Che Wang, Jiawei Du, Jyh-Shing Roger Jang
Dept. of Computer Science and Information Engineering, National Taiwan Univ., Taipei, Taiwan
Wang_NTU_task6b_1 Wang_NTU_task6b_2 Wang_NTU_task6b_3 Wang_NTU_task6b_4
DCASE 2023 TASK 6B: TEXT-TO-AUDIO RETRIEVAL USING PRETRAINED MODELS
Chung-Che Wang, Jiawei Du, Jyh-Shing Roger Jang
Dept. of Computer Science and Information Engineering, National Taiwan Univ., Taipei, Taiwan
Abstract
This technical report describes our methods to Task 6b of the DCASE 2023 challenge: Language-Based Audio Retrieval. In this work, we use the bi-encoder structure and investigate the effectiveness of different pretrained audio and text encoders, including CNN14 of PANNs, Audio spectrogram transformer, and BERT. We also try to use random deletion as data augmentation for text data, and multi-label classification as an auxiliary task for audio data.