Language-Based Audio Retrieval


Challenge results

Task description

Language-based audio retrieval is the task of retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions are used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.

The Clotho v2 is provided as the development dataset, which includes both audio and corresponding captions. Participants are also allowed using pre-trained models and external data for training their systems. This includes pre-trained models for feature extraction from audio and/or captions, and pre-optimized methods for natural language processing like part-of-speech (POS) tagging. Additionally, participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, Freesound. More detailed task description can be found in the task description page

Teams ranking

Here are listed the best systems all from all teams. The ranking is based on the achieved mAP@10 metric. For more elaborated exploration of the performance of different systems, in the same table are listed the values achieved for all the metrics employed in the task. The metric values are for the development-testing split and the evaluation dataset.

Selected
metric
rank
Submission Information Evaluation dataset Development-testing split
Submission code Best official
system rank
Corresponding author Technical
Report
mAP@10 R@1 R@5 R@10 mAP@10 R@1 R@5 R@10
Primus_CP-JKU_6b_1 1 Paul Primus primus2023_t6b 0.401 0.283 0.553 0.681 0.414 0.289 0.587 0.711
Lamort_SRPOL_task6B_3 5 Theodore Lamort de Gail lamort2023_t6b 0.281 0.179 0.426 0.548 0.297 0.191 0.436 0.579
Wang_NTU_task6b_3 7 Chung-Che Wang wang2023_t6b 0.273 0.162 0.428 0.548 0.314 0.186 0.452 0.585
guan_heu_task6b_1 10 Jian Guan guan2023_t6b 0.262 0.160 0.403 0.518 0.305 0.194 0.458 0.590
fan_lb_task6b_3 17 Ziye Fan fan2023_t6b 0.243 0.144 0.373 0.499 0.265 0.160 0.402 0.544
Park_CAU_task6b_3 18 Jiwon Park park2023_t6b 0.240 0.151 0.354 0.473 23.590 13.890 36.310 49.650
labbe_irit_task6b_4 23 Etienne Labbe labbe2023_t6b 0.234 0.146 0.339 0.475 0.269 0.169 0.399 0.523
Baseline 26 Huang Xie xie2023_t6b 0.211 0.121 0.332 0.459 0.222 0.130 0.343 0.480
shah_cmu_task6b_1 30 Ankit Shah shah2023_t6b 0.004 0.003 0.007 0.011 0.250 0.154 0.381 0.511
kim_snu_task6b_2 31 Jinhee Kim kim2023_t6b 0.004 0.002 0.005 0.013 0.270 0.168 0.409 0.551

Systems ranking

Here are listed all systems and their ranking according to the different metrics. Detailed information of each system is at the next section.

Selected
metric
rank
Submission Information Evaluation dataset Development-testing split
Submission code Best official
system rank
Technical
Report
mAP@10 R@1 R@5 R@10 mAP@10 R@1 R@5 R@10
Primus_CP-JKU_6b_1 1 primus2023_t6b 0.401 0.283 0.553 0.681 0.414 0.289 0.587 0.711
Primus_CP-JKU_6b_3 2 primus2023_t6b 0.363 0.250 0.518 0.648 0.386 0.261 0.553 0.693
Primus_CP-JKU_6b_2 3 primus2023_t6b 0.358 0.245 0.512 0.637 0.380 0.255 0.551 0.686
Primus_CP-JKU_6b_4 4 primus2023_t6b 0.341 0.229 0.501 0.626 0.363 0.244 0.525 0.662
Lamort_SRPOL_task6B_3 5 lamort2023_t6b 0.281 0.179 0.426 0.548 0.297 0.191 0.436 0.579
Lamort_SRPOL_task6B_2 6 lamort2023_t6b 0.275 0.172 0.412 0.543 0.297 0.191 0.436 0.579
Wang_NTU_task6b_3 7 wang2023_t6b 0.273 0.162 0.428 0.548 0.314 0.186 0.452 0.585
Wang_NTU_task6b_4 8 wang2023_t6b 0.265 0.167 0.385 0.528 0.313 0.191 0.441 0.581
Lamort_SRPOL_task6B_4 9 lamort2023_t6b 0.263 0.168 0.396 0.525 0.297 0.191 0.436 0.579
guan_heu_task6b_1 10 guan2023_t6b 0.262 0.160 0.403 0.518 0.305 0.194 0.458 0.590
guan_heu_task6b_3 11 guan2023_t6b 0.261 0.163 0.392 0.522 0.309 0.197 0.461 0.594
Lamort_SRPOL_task6B_1 12 lamort2023_t6b 0.261 0.163 0.402 0.527 0.297 0.191 0.436 0.579
guan_heu_task6b_2 13 guan2023_t6b 0.260 0.156 0.392 0.524 0.309 0.198 0.462 0.592
guan_heu_task6b_4 14 guan2023_t6b 0.259 0.156 0.395 0.523 0.312 0.200 0.464 0.599
Wang_NTU_task6b_1 15 wang2023_t6b 0.256 0.165 0.366 0.506 0.260 0.153 0.407 0.544
Wang_NTU_task6b_2 16 wang2023_t6b 0.245 0.147 0.370 0.504 0.271 0.167 0.410 0.539
fan_lb_task6b_3 17 fan2023_t6b 0.243 0.144 0.373 0.499 0.265 0.160 0.402 0.544
Park_CAU_task6b_3 18 park2023_t6b 0.240 0.151 0.354 0.473 23.590 13.890 36.310 49.650
fan_lb_task6b_1 19 fan2023_t6b 0.239 0.143 0.376 0.499 0.253 0.147 0.401 0.542
Park_CAU_task6b_1 20 park2023_t6b 0.239 0.151 0.352 0.472 24.460 14.740 37.590 50.680
fan_lb_task6b_2 21 fan2023_t6b 0.238 0.144 0.363 0.494 0.262 0.154 0.406 0.551
fan_lb_task6b_4 22 fan2023_t6b 0.235 0.137 0.369 0.499 0.251 0.146 0.390 0.533
labbe_irit_task6b_4 23 labbe2023_t6b 0.234 0.146 0.339 0.475 0.269 0.169 0.399 0.523
Park_CAU_task6b_2 24 park2023_t6b 0.232 0.143 0.346 0.466 24.010 14.260 37.440 50.140
labbe_irit_task6b_3 25 labbe2023_t6b 0.213 0.123 0.328 0.441 0.257 0.160 0.384 0.512
Baseline 26 xie2023_t6b 0.211 0.121 0.332 0.459 0.222 0.130 0.343 0.480
labbe_irit_task6b_2 27 labbe2023_t6b 0.204 0.124 0.312 0.429 0.231 0.140 0.353 0.483
Park_CAU_task6b_4 28 park2023_t6b 0.177 0.088 0.285 0.405 19.300 10.580 30.620 43.290
labbe_irit_task6b_1 29 labbe2023_t6b 0.159 0.085 0.247 0.359 0.186 0.106 0.288 0.419
shah_cmu_task6b_1 30 shah2023_t6b 0.004 0.003 0.007 0.011 0.250 0.154 0.381 0.511
kim_snu_task6b_2 31 kim2023_t6b 0.004 0.002 0.005 0.013 0.270 0.168 0.409 0.551
kim_snu_task6b_3 32 kim2023_t6b 0.004 0.001 0.006 0.012 0.280 0.175 0.422 0.566
kim_snu_task6b_4 33 kim2023_t6b 0.004 0.002 0.005 0.013 0.271 0.169 0.410 0.549
kim_snu_task6b_1 34 kim2023_t6b 0.003 0.001 0.005 0.012 0.279 0.172 0.419 0.562

System characteristics

In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems and the second has a detailed presentation of each system.

Overview of characteristics

Rank Submission
code
mAP@10 Technical
Report
Amount of parameters Audio modelling Text modelling Loss
function
1 Primus_CP-JKU_6b_1 0.401 primus2023_t6b 3003000000 PaSST BERT, RoBERTa NT-Xent loss
2 Primus_CP-JKU_6b_3 0.363 primus2023_t6b 441200000 PaSST BERT, RoBERTa NT-Xent loss
3 Primus_CP-JKU_6b_2 0.358 primus2023_t6b 441200000 PaSST BERT, RoBERTa NT-Xent loss
4 Primus_CP-JKU_6b_4 0.341 primus2023_t6b 441200000 PaSST BERT, RoBERTa NT-Xent loss
5 Lamort_SRPOL_task6B_3 0.281 lamort2023_t6b 279940464 BEATs, VGGish, CLAP MPNet Triplet loss
6 Lamort_SRPOL_task6B_2 0.275 lamort2023_t6b 279940464 BEATs, VGGish, CLAP MPNet Triplet loss
7 Wang_NTU_task6b_3 0.273 wang2023_t6b 653661759 VALOR BERT Contrastive loss
8 Wang_NTU_task6b_4 0.265 wang2023_t6b 653661759 VALOR BERT Contrastive loss
9 Lamort_SRPOL_task6B_4 0.263 lamort2023_t6b 279940464 BEATs, VGGish, CLAP MPNet Triplet loss
10 guan_heu_task6b_1 0.262 guan2023_t6b 1686974464 PANNs-CNN14, PANNs-CNN14-Attention BERT, RoBERTa InfoNCE loss
11 guan_heu_task6b_3 0.261 guan2023_t6b 1686974464 PANNs-CNN14, PANNs-CNN14-Attention BERT, RoBERTa InfoNCE loss
12 Lamort_SRPOL_task6B_1 0.261 lamort2023_t6b 279940464 BEATs, VGGish, CLAP MPNet Triplet loss
13 guan_heu_task6b_2 0.260 guan2023_t6b 1313971200 PANNs-CNN14, PANNs-CNN14-Attention BERT, RoBERTa InfoNCE loss
14 guan_heu_task6b_4 0.259 guan2023_t6b 1313971200 PANNs-CNN14, PANNs-CNN14-Attention BERT, RoBERTa InfoNCE loss
15 Wang_NTU_task6b_1 0.256 wang2023_t6b 97100280 PANNs-CNN14 BERT NT-Xent loss
16 Wang_NTU_task6b_2 0.245 wang2023_t6b 97100280 PANNs-CNN14 BERT NT-Xent loss
17 fan_lb_task6b_3 0.243 fan2023_t6b 271483819 BEATs Qformer InfoNCE loss
18 Park_CAU_task6b_3 0.240 park2023_t6b 560958124 PANNs-CNN14, PANNs-ResNet38, PANNs-Wavegram-Logmel-Cnn14 Sentece-BERT Triplet loss
19 fan_lb_task6b_1 0.239 fan2023_t6b 271483819 BEATs Qformer InfoNCE loss
20 Park_CAU_task6b_1 0.239 park2023_t6b 937970420 PANNs-CNN14, PANNs-ResNet38, PANNs-Wavegram-Logmel-Cnn14 Sentece-BERT Triplet loss
21 fan_lb_task6b_2 0.238 fan2023_t6b 271483819 BEATs Qformer InfoNCE loss
22 fan_lb_task6b_4 0.235 fan2023_t6b 271483819 BEATs Qformer InfoNCE loss
23 labbe_irit_task6b_4 0.234 labbe2023_t6b 98064347 CNN Transformer Cross-Entropy loss
24 Park_CAU_task6b_2 0.232 park2023_t6b 565518444 PANNs-CNN14 Sentece-BERT Triplet loss
25 labbe_irit_task6b_3 0.213 labbe2023_t6b 42191083 CNN Transformer Cross-Entropy loss
26 Baseline 0.211 xie2023_t6b 80902892 PANNs-CNN14 Sentece-BERT InfoNCE loss
27 labbe_irit_task6b_2 0.204 labbe2023_t6b 40133440 CNN Transformer Cross-Entropy loss
28 Park_CAU_task6b_4 0.177 park2023_t6b 188506148 PANNs-CNN14 Sentece-BERT InFoNCE+VICReg loss
29 labbe_irit_task6b_1 0.159 labbe2023_t6b 87715793 CNN Transformer Cross-Entropy loss
30 shah_cmu_task6b_1 0.004 shah2023_t6b 647256 CLAP RoBERTa InfoNCE loss
31 kim_snu_task6b_2 0.004 kim2023_t6b 196469248 PANNs-CNN14 BERT NT-Xent loss
32 kim_snu_task6b_3 0.004 kim2023_t6b 196469248 PANNs-CNN14 BERT NT-Xent loss
33 kim_snu_task6b_4 0.004 kim2023_t6b 196469248 PANNs-CNN14 BERT NT-Xent loss
34 kim_snu_task6b_1 0.003 kim2023_t6b 196469248 PANNs-CNN14 BERT NT-Xent loss



Detailed characteristics

Rank Submission
code
mAP@10 Technical
Report
Amount of parameters Audio modelling Acoustic
features
Text modelling Audio
augmentation
Text
augmentation
Sampling
rate
Loss function Optimizer Metric monitored for training Dataset(s) used for training Dataset(s) used for validation
1 Primus_CP-JKU_6b_1 0.401 primus2023_t6b 3003000000 PaSST log-mel energies BERT, RoBERTa patchout synonym replacement, random deletions, ChatGPT 32kHz NT-Xent loss Adam mAP Clotho-development, AudioCaps, WavCaps Clotho-validation
2 Primus_CP-JKU_6b_3 0.363 primus2023_t6b 441200000 PaSST log-mel energies BERT, RoBERTa patchout synonym replacement, random deletions, ChatGPT 32kHz NT-Xent loss Adam mAP Clotho-development, AudioCaps, WavCaps Clotho-validation
3 Primus_CP-JKU_6b_2 0.358 primus2023_t6b 441200000 PaSST log-mel energies BERT, RoBERTa patchout synonym replacement, random deletions 32kHz NT-Xent loss Adam mAP Clotho-development, AudioCaps, WavCaps Clotho-validation
4 Primus_CP-JKU_6b_4 0.341 primus2023_t6b 441200000 PaSST log-mel energies BERT, RoBERTa patchout synonym replacement, random deletions 32kHz NT-Xent loss Adam mAP Clotho-development, AudioCaps, WavCaps Clotho-validation
5 Lamort_SRPOL_task6B_3 0.281 lamort2023_t6b 279940464 BEATs, VGGish, CLAP log-mel energies MPNet MixGen, noise mix, cutout, paraphrase 16kHz Triplet loss AdamW loss, recall Clotho-development, Clotho-validation, AudioCaps, MACS, SoundDescs, Freesound, YouTube Closed Captions Clotho-evaluation
6 Lamort_SRPOL_task6B_2 0.275 lamort2023_t6b 279940464 BEATs, VGGish, CLAP log-mel energies MPNet MixGen, noise mix, cutout, paraphrase 16kHz Triplet loss AdamW loss, recall Clotho-development, Clotho-validation, AudioCaps, MACS, SoundDescs, Freesound, YouTube Closed Captions Clotho-evaluation
7 Wang_NTU_task6b_3 0.273 wang2023_t6b 653661759 VALOR fbank BERT 44.1kHz Contrastive loss AdamW loss Clotho-development Clotho-validation
8 Wang_NTU_task6b_4 0.265 wang2023_t6b 653661759 VALOR fbank BERT 44.1kHz Contrastive loss AdamW loss Clotho-development Clotho-validation
9 Lamort_SRPOL_task6B_4 0.263 lamort2023_t6b 279940464 BEATs, VGGish, CLAP log-mel energies MPNet MixGen, noise mix, cutout, paraphrase 16kHz Triplet loss AdamW loss, recall Clotho-development, Clotho-validation, AudioCaps, MACS, SoundDescs, Freesound, YouTube Closed Captions Clotho-evaluation
10 guan_heu_task6b_1 0.262 guan2023_t6b 1686974464 PANNs-CNN14, PANNs-CNN14-Attention log-mel energies BERT, RoBERTa SpecAugment augmentation by GPT-3.5 32kHz InfoNCE loss Adam recall Clotho-development, Clotho-validation, WavText5K, AudioCaps Clotho-evaluation
11 guan_heu_task6b_3 0.261 guan2023_t6b 1686974464 PANNs-CNN14, PANNs-CNN14-Attention log-mel energies BERT, RoBERTa SpecAugment augmentation by GPT-3.5 32kHz InfoNCE loss Adam recall Clotho-development, Clotho-validation, WavText5K, AudioCaps Clotho-evaluation
12 Lamort_SRPOL_task6B_1 0.261 lamort2023_t6b 279940464 BEATs, VGGish, CLAP log-mel energies MPNet MixGen, noise mix, cutout, paraphrase 16kHz Triplet loss AdamW loss, recall Clotho-development, Clotho-validation, AudioCaps, MACS, SoundDescs, Freesound, YouTube Closed Captions Clotho-evaluation
13 guan_heu_task6b_2 0.260 guan2023_t6b 1313971200 PANNs-CNN14, PANNs-CNN14-Attention log-mel energies BERT, RoBERTa SpecAugment augmentation by GPT-3.5 32kHz InfoNCE loss Adam recall Clotho-development, Clotho-validation, WavText5K, AudioCaps Clotho-evaluation
14 guan_heu_task6b_4 0.259 guan2023_t6b 1313971200 PANNs-CNN14, PANNs-CNN14-Attention log-mel energies BERT, RoBERTa SpecAugment augmentation by GPT-3.5 32kHz InfoNCE loss Adam recall Clotho-development, Clotho-validation, WavText5K, AudioCaps Clotho-evaluation
15 Wang_NTU_task6b_1 0.256 wang2023_t6b 97100280 PANNs-CNN14 log-mel energies BERT SpecAugment, random deletion 32kHz NT-Xent loss Adam mrr Clotho-development Clotho-validation
16 Wang_NTU_task6b_2 0.245 wang2023_t6b 97100280 PANNs-CNN14 log-mel energies BERT SpecAugment, random deletion 32kHz NT-Xent loss Adam mrr Clotho-development Clotho-validation
17 fan_lb_task6b_3 0.243 fan2023_t6b 271483819 BEATs mel energies Qformer 16kHz InfoNCE loss Adam recall Clotho-development, Clotho-validation, AudioCaps-train, AudioCaps-validation Clotho-evaluation
18 Park_CAU_task6b_3 0.240 park2023_t6b 560958124 PANNs-CNN14, PANNs-ResNet38, PANNs-Wavegram-Logmel-Cnn14 PANNs Sentece-BERT SpecAugment 32kHz Triplet loss Adam loss Clotho-development Clotho-validation
19 fan_lb_task6b_1 0.239 fan2023_t6b 271483819 BEATs mel energies Qformer 16kHz InfoNCE loss Adam recall Clotho-development, Clotho-validation, AudioCaps-train, AudioCaps-validation Clotho-evaluation
20 Park_CAU_task6b_1 0.239 park2023_t6b 937970420 PANNs-CNN14, PANNs-ResNet38, PANNs-Wavegram-Logmel-Cnn14 PANNs Sentece-BERT SpecAugment 32kHz Triplet loss Adam loss Clotho-development Clotho-validation
21 fan_lb_task6b_2 0.238 fan2023_t6b 271483819 BEATs mel energies Qformer 16kHz InfoNCE loss Adam recall Clotho-development, Clotho-validation, AudioCaps-train, AudioCaps-validation Clotho-evaluation
22 fan_lb_task6b_4 0.235 fan2023_t6b 271483819 BEATs mel energies Qformer 16kHz InfoNCE loss Adam recall Clotho-development, Clotho-validation, AudioCaps-train, AudioCaps-validation Clotho-evaluation
23 labbe_irit_task6b_4 0.234 labbe2023_t6b 98064347 CNN ConvNeXt-tiny Transformer mixup, SpecAugment, label_smoothing 32kHz Cross-Entropy loss AdamW FENSE Clotho-development, AudioCaps-train, MACS, WavCaps (without FreeSound) Clotho-validation
24 Park_CAU_task6b_2 0.232 park2023_t6b 565518444 PANNs-CNN14 PANNs Sentece-BERT SpecAugment 32kHz Triplet loss Adam loss Clotho-development Clotho-validation
25 labbe_irit_task6b_3 0.213 labbe2023_t6b 42191083 CNN ConvNeXt-tiny Transformer mixup, SpecAugment, label_smoothing 32kHz Cross-Entropy loss AdamW FENSE Clotho-development, AudioCaps-train, MACS, WavCaps (without FreeSound) Clotho-validation
26 Baseline 0.211 xie2023_t6b 80902892 PANNs-CNN14 log-mel energies Sentece-BERT 44.1kHz InfoNCE loss Adam loss Clotho-development Clotho-validation
27 labbe_irit_task6b_2 0.204 labbe2023_t6b 40133440 CNN ConvNeXt-tiny Transformer mixup, SpecAugment, label_smoothing 32kHz Cross-Entropy loss AdamW FENSE Clotho-development Clotho-validation
28 Park_CAU_task6b_4 0.177 park2023_t6b 188506148 PANNs-CNN14 PANNs Sentece-BERT SpecAugment 32kHz InFoNCE+VICReg loss Adam loss Clotho-development Clotho-validation
29 labbe_irit_task6b_1 0.159 labbe2023_t6b 87715793 CNN PANNs Transformer mixup, SpecAugment, label_smoothing 32kHz Cross-Entropy loss AdamW FENSE Clotho-development Clotho-validation
30 shah_cmu_task6b_1 0.004 shah2023_t6b 647256 CLAP log-mel energies RoBERTa 48kHz InfoNCE loss Adam loss Clotho-development Clotho-validation
31 kim_snu_task6b_2 0.004 kim2023_t6b 196469248 PANNs-CNN14 PANNs BERT SpecAugment, noise, PairMix, multi-TTA EDA, ChatGPT 16kHz NT-Xent loss AdamW loss Clotho-development, AudioCaps-train, WavText5K, WavCaps Clotho-validation
32 kim_snu_task6b_3 0.004 kim2023_t6b 196469248 PANNs-CNN14 PANNs BERT SpecAugment, noise, PairMix, multi-TTA EDA, ChatGPT 16kHz NT-Xent loss AdamW loss Clotho-development, AudioCaps-train, WavText5K, WavCaps Clotho-validation
33 kim_snu_task6b_4 0.004 kim2023_t6b 196469248 PANNs-CNN14 PANNs BERT SpecAugment, noise, PairMix, multi-TTA EDA, ChatGPT 16kHz NT-Xent loss AdamW loss Clotho-development, AudioCaps-train, WavText5K, WavCaps Clotho-validation
34 kim_snu_task6b_1 0.003 kim2023_t6b 196469248 PANNs-CNN14 PANNs BERT SpecAugment, noise, PairMix, multi-TTA EDA, ChatGPT 16kHz NT-Xent loss AdamW loss Clotho-development, AudioCaps-train, WavText5K, WavCaps Clotho-validation



Technical reports

QFORMER BASED TEXT AUDIO RETRIEVAL SYSTEM

Ziye Fan, Fengyun Zhu
R&D, Lingban Technology Ltd,., Beijing, China

Abstract

This paper describes the system we submitted for DCASE2023 Challenge Task 6B. Task 6B involves audio retrieval using natural language. Our submitted retrieval system includes a frozen pretrained audio encoder and a Qformer as text encoder. The system utilizes paired data provided by the AudioCaps and Clotho datasets for contrastive learning in the style of BLIP-2. Natural language query requests are first encoded by the text encoder, followed by a top-k recall in the pre-extracted audio embeddings. These are then paired with the query text to form k pairs of data, which are reranked based on the model’s matching ability to produce the final retrieval results. This system achieved an mAP of 26.47% and a 16.02% R@1 on the Clotho test set, while the baseline system’s performance being mAP of 22.2% and 13.0% R@1.

PDF

ENSEMBLE SYSTEMS WITH CONTRASTIVE LANGUAGE-AUDIO PRETRAINING AND ATTENTION-BASED AUDIO FEATURES FOR AUDIO CAPTIONING AND RETRIEVAL

Feiyang Xiao, Qiaoxi Zhu, Haiyan Lan, Wenwu Wang, Jian Guan
Group of Intelligent Signal Processing (GISP), College of Computer Science and Technology, Harbin Engineering University, Harbin, China
Centre for Audio, Acoustic and Vibration (CAAV), University of Technology Sydney, Ultimo, Australia
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK

Abstract

This technical report describes our submission on Task 6 (automated audio captioning and language-based audio retrieval) of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge. The proposed systems in this submission are based on a contrastive language-audio pretraining strategy and the attention-based audio feature representation. Experiments show that our systems can achieve a SPIDEr-FL score of 28.32 on automated audio captioning and an mAP score of 31.18 on language-based audio retrieval.

PDF

OVERCOMING DATA SHORTAGE IN AUDIO-TEXT MULTI-MODAL RETRIEVAL:A TECH REPORT FOR DCASE 2023 CHALLENGE

Jinhee Kim, Chang-Bin Jeon, Yoori Oh, JoonHyeon Bae, Kyogu Lee
Intelligence and Information, Seoul National University, Seoul, Republic of Korea

Abstract

This technical report proposes an audio-text retrieval model for DCASE 2023 language-based audio retrieval challenge. We focus to overcome the shortage of data in this task. To this end, we propose two approaches: the first involves gathering large paired audio-text datasets, while the second employs various augmentation techniques such as PairMix and Multi-TTA. Our experimental evaluations demonstrate the effectiveness of these approaches, while achieving competitive performance in audio-text multi-modal retrieval tasks.

PDF

IRIT-UPS DCASE 2023 AUDIO CAPTIONING AND RETRIEVAL SYSTEM

Etienne Labb\k{e}, Thomas Pellegrini, Julien Pinquier
IRIT (UMR 5505), Universite Paul Sabatier, CNRS, Toulouse, France
Artificial and Natural Intelligence Toulouse Institute (ANITI)

Abstract

This technical report provides a concise overview of our systems submitted to the DCASE Challenge 2023 for tasks 6a, ”Automated Audio Captioning” (AAC), and 6b, ”Language-Based Audio Retrieval” (LBAR). In task 6a, we made four distinct submissions. The first submission employed a standard CNN14 encoder paired with a transformer decoder. In the second submission, we replaced this encoder with a ConvNeXt model to enhance audio representation. The third submission incorporated additional training data. We introduced a new task embedding approach to differentiate between different writing styles and audio types. Finally, in the fourth submission, we employed an ensemble method to combine five models trained on different seeds, aiming to improve the quality of the captions. For task 6b, we use the AAC models and we propose a novel approach to accomplish the LBAR task by leveraging the AAC system loss function without requiring any additional training. Our most successful AAC and LBAR systems achieved a SPIDEr-FL score of 0.320 and an mAP@10 score of 0.269. These results demonstrate relative improvements of 22.6% and 21.2% compared to the AAC and LBAR baselines, respectively.

PDF

TAKE IT SERIOUSLY: IMPROVING ON LAST YEAR WITH ALL SORTS OF TRICKS

Theodore Lamort de Gail, Bart\l{}omiej Zg\'orzy\'nski, Anna Pl\k{e}s, Kamil G\'orzy\'nski
Samsung R&D Institute Poland, Warsaw, Poland

Abstract

In this report, we present our solution to DCASE 2023, task 6B: Language-Based Audio Retrieval. We employ a bi-encoder architecture trained using contrastive ranking loss. The audio encoder is an ensemble of three pre-trained models (BEATs, VGGish, CLAP) with added self-attention heads, while the text encoder is a pre-trained MPNet. To address the small dataset size, we gather 1.7M caption audio pairs from YouTube videos. We use MixGen and paraphrasing, as well as traditional audio augmentation, and Low-Rank Adaptation (LoRA) for fine-tuning on Clotho. We achieve 29.66% mAP@10 on the development-testing split of Clotho using an ensemble solution, and 26.93% mAP@10 with a single model. We also submit an ensemble of our solution and CLAP.

PDF

TEXT-TO-AUDIO RETRIEVAL: ENSEMBLE COMBINATIONS OF THE MODELS

Jiwon Park, SangJe Park, Changwon Lim
Applied Statistics, Chung-ang University, Seoul, South Korea

Abstract

This technical report focuses on the audio-text retrieval model designed for the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2023 Task 6b. In this task, the objective is to retrieve 10 audio files from a given dataset based on a given text query and then sort them according to how well they match the query. The audio encoder in our model employs Pretrained Audio Natural Networks (PANNS), which is a pre-trained model from the AudioSet dataset. We have fine-tuned the encoders using the Clotho dataset. For the text encoder, we have used transfer learning with Sentence-BERT, which is based on the Transformer architecture. To bring audio and text inputs into a joint embedding space, we have passed them through their respective encoders. We have then employed contrastive learning for audio-text pairs so that similar pairs are positioned close together and the other pairs are positioned further apart. We achieves 0.245 on mAP10 of text-to-audio retrieval.

PDF

CP-JKU’S SUBMISSION TO TASK 6b OF THE DCASE2023 CHALLENGE: AUDIO RETRIEVAL WITH PaSST AND GPT-AUGMENTED CAPTIONS

Paul Primus, Khaled Koutini, Gerhard Widmer
Institute of Computational Perception (CP-JKU), LIT Artificial Intelligence Lab, Johannes Kepler University, Austria

Abstract

This technical report describes CP-JKU’s submission to the natural language-based audio retrieval task of the 2023 DCASE Challenge (Task 6b). Our proposed system uses pretrained audio and text embedding models to project recordings and textual descriptions into a shared audio-caption space in which related examples from different modalities are close. We pre-train our models on WavCaps, AudioCap, and ClothoV2, three large datasets with audio-caption pairs. We further augment the captions in the ClothoV2 dataset using the provided metadata and the ChatGPT API in order to reduce overfitting. Our best single system submission outperforms the current state-of-the-art text-to-audio retrieval system on the ClothoV2 test split by 4.6 pp. R@1. Furthermore, our ensemble beats the previous year’s best submission on the test split by 11.5 pp. mAP@10. Our implementation is available in GitHub

Awards: Judges’ award

PDF

DCASE 2023 TASK 6 AUTOMATED AUDIO CAPTIONING AND LANGUAGE-BASED RETRIEVAL

Greeshma Karanth, Ninaad Rao, Srikumar Subramanian, Ankit Shah
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA

Abstract

The objective of this project is to examine audio signals utilizing natural language to capture their complex characteristics. This initiative is part of Task 6 in the DCASE 2023 Competition and consists of two subtasks. The first subtask is Automated Audio Captioning, which generates text descriptions of audio content. This task involves the intermodal processing of an audio signal as input and a text description as output. Our best-performing model for this uses the PANN architecture [1] with the CNN-14 feature extractor and BART [2] encoder and decoder. The second subtask is called Language-Based Audio Retrieval, where the system retrieves audio signals by searching for their sound content descriptions. The queries for this subtask are human-generated audio captions. In this task, our best-performing model uses CLAP [3] audio embeddings and Roberta text embeddings [4]. This document presents a summary of our work done for this challenge.

PDF

DCASE 2023 TASK 6B: TEXT-TO-AUDIO RETRIEVAL USING PRETRAINED MODELS

Chung-Che Wang, Jiawei Du, Jyh-Shing Roger Jang
Dept. of Computer Science and Information Engineering, National Taiwan Univ., Taipei, Taiwan

Abstract

This technical report describes our methods to Task 6b of the DCASE 2023 challenge: Language-Based Audio Retrieval. In this work, we use the bi-encoder structure and investigate the effectiveness of different pretrained audio and text encoders, including CNN14 of PANNs, Audio spectrogram transformer, and BERT. We also try to use random deletion as data augmentation for text data, and multi-label classification as an auxiliary task for audio data.

PDF