Automated Audio Captioning


Challenge results

Task description

Automated audio captioning is the task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. Given the novelty of the task of audio captioning, current focus is on exploring and developing different methods that can provide some kind of captions for a general audio recording. To this aim, the Clotho dataset is used, which provides good quality captions, without speech transcription, named entities, and hapax legomena (i.e. words that appear once in a split).

Participants used the freely available splits of Clotho development and evaluation, as well as any external data they deemed fit. The developed systems are evaluated on their generated captions, using the testing split of Clotho, which does not provide the corresponding captions for the audio. More information about Task 6a: Automated Audio Captioning can be found at the task description page.

The ranking of the submitted systems is based on the achieved SPIDEr metric penalized by fluency error detection (SPIDEr-FL). Though, in this page is provided a more thorough presentation, grouping the metrics into those that are originated from machine translation and to those that originated from captioning.

Teams ranking

Here are listed the best systems from all teams. The ranking is based on the SPIDEr-FL. For more elaborated exploration of the performance of the different systems, at the same table are listed the values achieved for all the metrics employed in the task. The values for the metrics are for the Clotho testing split and the Clotho evaluation split. The values for the Clotho evaluation split are provided in order to allow further comparison with systems and methods developed outside of this task, since captions for the Clotho evaluation split are freely available.

Selected
metric
rank
Submission Information Clotho testing split Clotho evaluation split
Submission code Best official
system rank
Corresponding author Technical
Report
METEOR CIDEr SPICE SPIDEr SPIDEr-FL METEOR CIDEr SPICE SPIDEr SPIDEr-FL
Wu_t6a_4 1 Shih-Lun Wu wu2023_t6a 0.195 0.505 0.149 0.327 0.327 0.197 0.505 0.145 0.325 0.325
Chang_t6a_4 2 Joon-Hyuk Chang chang2023_t6a 0.197 0.539 0.149 0.344 0.315 0.197 0.541 0.146 0.343 0.313
Labbe_t6a_4 3 Etienne Labbe labbe2023_t6a 0.193 0.486 0.142 0.314 0.314 0.193 0.500 0.140 0.320 0.320
Yan_t6a_4 4 Zhiyong Yan yan2023_t6a 0.191 0.461 0.139 0.300 0.289 0.192 0.474 0.136 0.305 0.294
Schaumloeffel_t6a_1 5 Timothy Schaumloeffel schaumloeffel2023_t6a 0.181 0.436 0.130 0.283 0.282 0.183 0.454 0.132 0.293 0.292
Guan_t6a_3 6 Jian Guan guan2023_t6a 0.180 0.427 0.129 0.278 0.273 0.184 0.450 0.129 0.290 0.283
Kadlčík_t6a_1 7 Marek Kadlčík kadlčík2023_t6a 0.172 0.414 0.123 0.269 0.267 0.378 0.433 0.126 0.279
Lee_t6a_1 8 Kyogu Lee lee2023_t6a 0.176 0.416 0.123 0.269 0.266 0.177 0.431 0.126 0.279 0.275
Baseline 9 Felix Gontier gontier2023_t6a 0.177 0.415 0.126 0.271 0.264 0.177 0.420 0.119 0.270 0.261
Greeshma_t6a_1 10 Karanth Greeshma greeshma2023_t6a 0.178 0.406 0.125 0.265 0.261 0.178 0.419 0.121 0.270 0.264
Lim_t6a_1 11 Changwon Lim lim2023_t6a 0.089 0.035 0.039 0.037 0.010 0.089 0.034 0.038 0.036 0.011

Systems ranking

Here are listed all submitted systems and their ranking according to the different metrics and grouping of metrics. The first table shows all challenge metrics and all systems, and the second table shows all systems but with contrastive metrics.

Detailed information for each system is provided in the next section.

Systems ranking, challenge metrics

Selected
metric
rank
Submission Information Clotho testing split Clotho evaluation split
Submission code Best official
system rank
Technical
Report
METEOR CIDEr SPICE SPIDEr SPIDEr-FL METEOR CIDEr SPICE SPIDEr SPIDEr-FL
Wu_t6a_4 1 wu2023_t6a 0.195 0.505 0.149 0.327 0.327 0.197 0.505 0.145 0.325 0.325
Wu_t6a_3 2 wu2023_t6a 0.196 0.504 0.149 0.326 0.326 0.197 0.525 0.147 0.336 0.336
Wu_t6a_2 3 wu2023_t6a 0.196 0.499 0.149 0.324 0.324 0.198 0.510 0.147 0.329 0.329
Chang_t6a_4 4 chang2023_t6a 0.197 0.539 0.149 0.344 0.315 0.197 0.541 0.146 0.343 0.313
Labbe_t6a_4 5 labbe2023_t6a 0.193 0.486 0.142 0.314 0.314 0.193 0.500 0.140 0.320 0.320
Wu_t6a_1 6 wu2023_t6a 0.190 0.477 0.145 0.311 0.311 0.193 0.506 0.146 0.326 0.326
Labbe_t6a_3 7 labbe2023_t6a 0.192 0.479 0.141 0.310 0.309 0.192 0.485 0.139 0.312 0.310
Chang_t6a_1 8 chang2023_t6a 0.188 0.486 0.138 0.312 0.308 0.188 0.483 0.137 0.309 0.307
Labbe_t6a_2 9 labbe2023_t6a 0.189 0.470 0.139 0.304 0.304 0.190 0.474 0.136 0.305 0.303
Yan_t6a_4 10 yan2023_t6a 0.191 0.461 0.139 0.300 0.289 0.192 0.474 0.136 0.305 0.294
Yan_t6a_3 11 yan2023_t6a 0.190 0.457 0.139 0.298 0.288 0.190 0.468 0.135 0.302 0.292
Yan_t6a_1 12 yan2023_t6a 0.187 0.445 0.137 0.291 0.282 0.191 0.471 0.136 0.304 0.295
Schaumloeffel_t6a_1 13 schaumloeffel2023_t6a 0.181 0.436 0.130 0.283 0.282 0.183 0.454 0.132 0.293 0.292
Schaumloeffel_t6a_2 14 schaumloeffel2023_t6a 0.178 0.425 0.124 0.274 0.274 0.179 0.443 0.126 0.285 0.284
Guan_t6a_3 15 guan2023_t6a 0.180 0.427 0.129 0.278 0.273 0.184 0.450 0.129 0.290 0.283
Guan_t6a_4 16 guan2023_t6a 0.181 0.429 0.130 0.279 0.272 0.184 0.443 0.128 0.285 0.279
Yan_t6a_2 17 yan2023_t6a 0.185 0.424 0.132 0.278 0.270 0.189 0.460 0.136 0.298 0.286
Guan_t6a_1 18 guan2023_t6a 0.180 0.421 0.131 0.276 0.270 0.182 0.438 0.126 0.282 0.275
Kadlčík_t6a_1 19 kadlčík2023_t6a 0.172 0.414 0.123 0.269 0.267 0.378 0.433 0.126 0.279
Lee_t6a_1 20 lee2023_t6a 0.176 0.416 0.123 0.269 0.266 0.177 0.431 0.126 0.279 0.275
Baseline 21 gontier2023_t6a 0.177 0.415 0.126 0.271 0.264 0.177 0.420 0.119 0.270 0.261
Guan_t6a_2 22 guan2023_t6a 0.178 0.415 0.127 0.271 0.263 0.181 0.426 0.124 0.275 0.267
Kadlčík_t6a_2 23 kadlčík2023_t6a 0.177 0.406 0.129 0.267 0.261 0.378 0.414 0.123 0.269
Greeshma_t6a_1 24 greeshma2023_t6a 0.178 0.406 0.125 0.265 0.261 0.178 0.419 0.121 0.270 0.264
Labbe_t6a_1 25 labbe2023_t6a 0.177 0.389 0.125 0.257 0.256 0.179 0.414 0.126 0.270 0.269
Chang_t6a_3 26 chang2023_t6a 0.194 0.527 0.142 0.335 0.231 0.195 0.539 0.143 0.341 0.233
Chang_t6a_2 27 chang2023_t6a 0.195 0.520 0.141 0.330 0.229 0.195 0.526 0.143 0.335 0.225
Kadlčík_t6a_3 28 kadlčík2023_t6a 0.161 0.348 0.116 0.232 0.225 0.345 0.340 0.108 0.224
Lim_t6a_1 29 lim2023_t6a 0.089 0.035 0.039 0.037 0.010 0.089 0.034 0.038 0.036 0.011

Systems ranking, additional metrics

Selected
metric
rank
Submission Information Clotho testing split
Submission code Best official
system rank
Technical
Report
Sentence-BERT FENSE
Wu_t6a_4 1 wu2023_t6a 0.536 0.536
Wu_t6a_3 2 wu2023_t6a 0.536 0.536
Wu_t6a_2 3 wu2023_t6a 0.538 0.538
Chang_t6a_4 4 chang2023_t6a 0.530 0.488
Labbe_t6a_4 5 labbe2023_t6a 0.523 0.522
Wu_t6a_1 6 wu2023_t6a 0.526 0.526
Labbe_t6a_3 7 labbe2023_t6a 0.521 0.519
Chang_t6a_1 8 chang2023_t6a 0.527 0.521
Labbe_t6a_2 9 labbe2023_t6a 0.523 0.522
Yan_t6a_4 10 yan2023_t6a 0.521 0.498
Yan_t6a_3 11 yan2023_t6a 0.523 0.501
Yan_t6a_1 12 yan2023_t6a 0.520 0.503
Schaumloeffel_t6a_1 13 schaumloeffel2023_t6a 0.501 0.498
Schaumloeffel_t6a_2 14 schaumloeffel2023_t6a 0.496 0.496
Guan_t6a_3 15 guan2023_t6a 0.496 0.487
Guan_t6a_4 16 guan2023_t6a 0.496 0.480
Yan_t6a_2 17 yan2023_t6a 0.509 0.490
Guan_t6a_1 18 guan2023_t6a 0.495 0.481
Kadlčík_t6a_1 19 kadlčík2023_t6a 0.495 0.492
Lee_t6a_1 20 lee2023_t6a 0.500 0.495
Baseline 21 gontier2023_t6a 0.482 0.472
Guan_t6a_2 22 guan2023_t6a 0.494 0.475
Kadlčík_t6a_2 23 kadlčík2023_t6a 0.492 0.481
Greeshma_t6a_1 24 greeshma2023_t6a 0.486 0.477
Labbe_t6a_1 25 labbe2023_t6a 0.481 0.480
Chang_t6a_3 26 chang2023_t6a 0.522 0.363
Chang_t6a_2 27 chang2023_t6a 0.522 0.362
Kadlčík_t6a_3 28 kadlčík2023_t6a 0.459 0.445
Lim_t6a_1 29 lim2023_t6a 0.121 0.033

System characteristics

In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems and the second has a detailed presentation of each system.

Overview of characteristics

Rank Submission
code
SPIDEr-FL Technical
Report
Method scheme/architecture Amount of parameters Audio modelling Word modelling Data
augmentation
1 Wu_t6a_4 0.327 wu2023_t6a encoder-decoder 2542000000 conformer transformer
2 Wu_t6a_3 0.326 wu2023_t6a encoder-decoder 2542000000 conformer transformer
3 Wu_t6a_2 0.324 wu2023_t6a encoder-decoder 887000000 conformer transformer
4 Chang_t6a_4 0.315 chang2023_t6a encoder-decoder 1313200128 PANNs BART spec augmentation, AL-mixgen, synonyms substitution
5 Labbe_t6a_4 0.314 labbe2023_t6a encoder-decoder 98064347 cnn transformer mixup, spec_augment, label_smoothing
6 Wu_t6a_1 0.311 wu2023_t6a encoder-decoder 127000000 conformer transformer
7 Labbe_t6a_3 0.309 labbe2023_t6a encoder-decoder 42191083 cnn transformer mixup, spec_augment, label_smoothing
8 Chang_t6a_1 0.308 chang2023_t6a encoder-decoder 218866688 PANNs BART spec augmentation, AL-mixgen, synonyms substitution
9 Labbe_t6a_2 0.304 labbe2023_t6a encoder-decoder 40133440 cnn transformer mixup, spec_augment, label_smoothing
10 Yan_t6a_4 0.289 yan2023_t6a encoder-decoder 90086352 transformer transformer
11 Yan_t6a_3 0.288 yan2023_t6a encoder-decoder 90086352 transformer transformer
12 Yan_t6a_1 0.282 yan2023_t6a encoder-decoder 90086352 transformer transformer
13 Schaumloeffel_t6a_1 0.282 schaumloeffel2023_t6a encoder-decoder 248325888 transformer GPT2 SpecAugment
14 Schaumloeffel_t6a_2 0.274 schaumloeffel2023_t6a encoder-decoder 248325888 transformer GPT2 SpecAugment
15 Guan_t6a_3 0.273 guan2023_t6a encoder-decoder 35502652 PANNs (CNN10) + GAT, PANNs (CNN10) transformer SpecAugmentation
16 Guan_t6a_4 0.272 guan2023_t6a encoder-decoder 17768222 PANNs (CNN10) + GAT transformer SpecAugmentation
17 Yan_t6a_2 0.270 yan2023_t6a encoder-decoder 90086352 transformer transformer
18 Guan_t6a_1 0.270 guan2023_t6a encoder-decoder 8884111 PANNs (CNN10) + GAT transformer SpecAugmentation
19 Kadlčík_t6a_1 0.267 kadlčík2023_t6a encoder-decoder 1550000000 transformer transformer
20 Lee_t6a_1 0.266 lee2023_t6a encoder-decoder 178755308 cnn transformer
21 Baseline 0.264 gontier2023_t6a encoder-decoder 98500000 PANNs transformer
22 Guan_t6a_2 0.263 guan2023_t6a encoder-decoder 8884111 PANNs (CNN10) + GAT transformer SpecAugmentation
23 Kadlčík_t6a_2 0.261 kadlčík2023_t6a encoder-decoder 244000000 transformer transformer
24 Greeshma_t6a_1 0.261 greeshma2023_t6a encoder-decoder 178755308 cnn BART
25 Labbe_t6a_1 0.256 labbe2023_t6a encoder-decoder 87715793 cnn transformer mixup, spec_augment, label_smoothing
26 Chang_t6a_3 0.231 chang2023_t6a encoder-decoder 656600064 PANNs BART spec augmentation, AL-mixgen, synonyms substitution
27 Chang_t6a_2 0.229 chang2023_t6a encoder-decoder 218866688 PANNs BART spec augmentation, AL-mixgen, synonyms substitution
28 Kadlčík_t6a_3 0.225 kadlčík2023_t6a encoder-decoder 39000000 transformer transformer
29 Lim_t6a_1 0.010 lim2023_t6a encoder-decoder 178755308 CNN14 transformer



Detailed characteristics

Rank Submission
code
SPIDEr-FL Technical
Report
Method scheme/architecture Amount of parameters Audio modelling Acoustic
features
Word modelling Word
embeddings
Data
augmentation
Sampling
rate
Learning set-up Ensemble method Loss function Learning set-up Learning rate Gradient clipping Gradient norm for clipping Metric monitored for training Dataset(s) used for audio modelling Dataset(s) used for word modelling Dataset(s) used for audio similarity
1 Wu_t6a_4 0.327 wu2023_t6a encoder-decoder 2542000000 conformer BEATs transformer BART 16kHz supervised adamw 2e-5 validation_acc Clotho, AudioCaps Clotho, AudioCaps
2 Wu_t6a_3 0.326 wu2023_t6a encoder-decoder 2542000000 conformer BEATs transformer BART 16kHz supervised adamw 2e-5 validation_acc Clotho, AudioCaps Clotho, AudioCaps
3 Wu_t6a_2 0.324 wu2023_t6a encoder-decoder 887000000 conformer BEATs transformer BART 16kHz supervised adamw 2e-5 validation_acc Clotho, AudioCaps Clotho, AudioCaps
4 Chang_t6a_4 0.315 chang2023_t6a encoder-decoder 1313200128 PANNs PANNs BART BART spec augmentation, AL-mixgen, synonyms substitution 44.1kHz supervised, reinforcement learninig crossentropy adamw 1e-6 CIDEr Clotho, AudioCaps, WavCaps Clotho, AudioCaps, WavCaps
5 Labbe_t6a_4 0.314 labbe2023_t6a encoder-decoder 98064347 cnn ConvNeXt-tiny transformer learned mixup, spec_augment, label_smoothing 32kHz supervised crossentropy adamw 5e-4 l2 validation_fense Clotho, AudioCaps, MACS, WavCaps (without FreeSound) Clotho, AudioCaps, MACS, WavCaps (without FreeSound)
6 Wu_t6a_1 0.311 wu2023_t6a encoder-decoder 127000000 conformer BEATs transformer BART 16kHz supervised adamw 2e-5 validation_acc Clotho, AudioCaps Clotho, AudioCaps
7 Labbe_t6a_3 0.309 labbe2023_t6a encoder-decoder 42191083 cnn ConvNeXt-tiny transformer learned mixup, spec_augment, label_smoothing 32kHz supervised crossentropy adamw 5e-4 l2 validation_fense Clotho, AudioCaps, MACS, WavCaps (without FreeSound) Clotho, AudioCaps, MACS, WavCaps (without FreeSound)
8 Chang_t6a_1 0.308 chang2023_t6a encoder-decoder 218866688 PANNs PANNs BART BART spec augmentation, AL-mixgen, synonyms substitution 44.1kHz supervised crossentropy adamw 1e-6 validation loss Clotho, AudioCaps, WavCaps Clotho, AudioCaps, WavCaps
9 Labbe_t6a_2 0.304 labbe2023_t6a encoder-decoder 40133440 cnn ConvNeXt-tiny transformer learned mixup, spec_augment, label_smoothing 32kHz supervised crossentropy adamw 5e-4 l2 validation_fense Clotho Clotho
10 Yan_t6a_4 0.289 yan2023_t6a encoder-decoder 90086352 transformer audioset transformer BERT 16kHz supervised crossentropy adamw 1e-4 validation_loss Clotho, FreeSound Clotho, FreeSound
11 Yan_t6a_3 0.288 yan2023_t6a encoder-decoder 90086352 transformer audioset transformer BERT 16kHz supervised crossentropy adamw 1e-4 validation_loss Clotho, FreeSound Clotho, FreeSound
12 Yan_t6a_1 0.282 yan2023_t6a encoder-decoder 90086352 transformer audioset transformer BERT 16kHz supervised crossentropy adamw 1e-4 validation_loss Clotho, FreeSound Clotho, FreeSound
13 Schaumloeffel_t6a_1 0.282 schaumloeffel2023_t6a encoder-decoder 248325888 transformer CLAP GPT2 SpecAugment 48kHz supervised crossentropy adamw 1e-5 validation_loss Clotho, AudioCaps, MACS, WavText5k, SoundDescs Clotho, AudioCaps, MACS, WavText5k, SoundDescs
14 Schaumloeffel_t6a_2 0.274 schaumloeffel2023_t6a encoder-decoder 248325888 transformer CLAP GPT2 SpecAugment 48kHz supervised crossentropy adamw 1e-5 validation_loss Clotho, AudioCaps, MACS Clotho, AudioCaps, MACS
15 Guan_t6a_3 0.273 guan2023_t6a encoder-decoder 35502652 PANNs (CNN10) + GAT, PANNs (CNN10) log-mel energies transformer Word2Vec SpecAugmentation 32.0kHz supervised crossentropy with label smoothing adamw 1e-3 SPIDEr metric Clotho, AudioCaps Clotho, AudioCaps
16 Guan_t6a_4 0.272 guan2023_t6a encoder-decoder 17768222 PANNs (CNN10) + GAT log-mel energies transformer Word2Vec SpecAugmentation 32.0kHz supervised crossentropy with label smoothing adamw 1e-3 SPIDEr metric Clotho, AudioCaps Clotho, AudioCaps
17 Yan_t6a_2 0.270 yan2023_t6a encoder-decoder 90086352 transformer audioset transformer BERT 16kHz supervised crossentropy adamw 1e-4 validation_loss Clotho, FreeSound Clotho, FreeSound
18 Guan_t6a_1 0.270 guan2023_t6a encoder-decoder 8884111 PANNs (CNN10) + GAT log-mel energies transformer Word2Vec SpecAugmentation 32.0kHz supervised crossentropy with label smoothing adamw 1e-3 SPIDEr metric Clotho, AudioCaps Clotho, AudioCaps
19 Kadlčík_t6a_1 0.267 kadlčík2023_t6a encoder-decoder 1550000000 transformer WhisperFeatureExtractor transformer Whisper 16kHz supervised crossentropy adamw 4e-6 SPIDEr Clotho, AudioCaps, AudioSet Clotho, AudioCaps, AudioSet
20 Lee_t6a_1 0.266 lee2023_t6a encoder-decoder 178755308 cnn PANNs transformer BART 44.1kHz supervised crossentropy adamw 1e-5 validation_loss Clotho, AudioCaps, WavText5K, SoundDescs Clotho, AudioCaps, WavText5K, SoundDescs
21 Baseline 0.264 gontier2023_t6a encoder-decoder 98500000 PANNs log-mel energies transformer BART 16kHz supervised crossentropy adamw 1e-5 validation_loss Clotho Clotho
22 Guan_t6a_2 0.263 guan2023_t6a encoder-decoder 8884111 PANNs (CNN10) + GAT log-mel energies transformer Word2Vec SpecAugmentation 32.0kHz supervised crossentropy with label smoothing adamw 1e-3 SPIDEr metric Clotho, AudioCaps Clotho, AudioCaps
23 Kadlčík_t6a_2 0.261 kadlčík2023_t6a encoder-decoder 244000000 transformer WhisperFeatureExtractor transformer Whisper 16kHz supervised crossentropy adamw 4e-6 SPIDEr Clotho, AudioCaps, AudioSet Clotho, AudioCaps, AudioSet
24 Greeshma_t6a_1 0.261 greeshma2023_t6a encoder-decoder 178755308 cnn log-mel energies BART BART 44.1kHz supervised crossentropy adamw 1e-5 validation_loss Clotho Clotho
25 Labbe_t6a_1 0.256 labbe2023_t6a encoder-decoder 87715793 cnn PANNs-CNN14 transformer learned mixup, spec_augment, label_smoothing 32kHz supervised crossentropy adamw 5e-4 l2 validation_fense Clotho Clotho
26 Chang_t6a_3 0.231 chang2023_t6a encoder-decoder 656600064 PANNs PANNs BART BART spec augmentation, AL-mixgen, synonyms substitution 44.1kHz supervised, reinforcement learninig crossentropy adamw 1e-6 CIDEr Clotho, AudioCaps, WavCaps Clotho, AudioCaps, WavCaps
27 Chang_t6a_2 0.229 chang2023_t6a encoder-decoder 218866688 PANNs PANNs BART BART spec augmentation, AL-mixgen, synonyms substitution 44.1kHz supervised, reinforcement learninig crossentropy adamw 1e-6 CIDEr Clotho, AudioCaps, WavCaps Clotho, AudioCaps, WavCaps
28 Kadlčík_t6a_3 0.225 kadlčík2023_t6a encoder-decoder 39000000 transformer WhisperFeatureExtractor transformer Whisper 16kHz supervised crossentropy adamw 4e-6 SPIDEr Clotho, AudioCaps, AudioSet Clotho, AudioCaps, AudioSet
29 Lim_t6a_1 0.010 lim2023_t6a encoder-decoder 178755308 CNN14 mel energies transformer PASST 44.1 kHz supervised crossentropy adamw 1e-5 validation_loss Clotho Clotho



Technical reports

HYU submission for the DCASE 2023 task 6a: automated audio captioning model using AL-MixGen and synonyms substitution

Jae-Heung Cho1, Yoon-Ah Park1, Jaewon Kim1, Joon-Hyuk Chang1
1Department of Electronic Engineering, Hanyang University, Seoul, Republic of Korea

Abstract

This paper presents the automated audio captioning model for participating in the detection and classification of acoustic scenes and events 2023 challenge task 6A. The model consists of two parts: an audio feature extractor and a language model. The audio feature extractor employed in our model is the pre-trained convolutional neural network 14 (CNN14), trained with AudioSet. CNN14 has demonstrated excellent performance in extracting audio features. For the language model, we utilized bidirectional autoregressive transformers model, which has achieved remarkable success in generating the text. We pre-trained the model with WavCaps, AudioCaps and Clotho dataset to manage the limitation of data availability, and then fine-tuned with Clotho dataset. Furthermore, AL-MixGen and synonyms substitution methods were also implemented for data augmentation. To improve the evaluation metric directly, we trained the model with reinforcement learning to optimize the CIDEr score. Finally, we achieved improved performance by adapting an ensemble of higher-performance models, leading to the accomplishment of 0.343 SPIDEr score.

System characteristics
Data augmentation AL-MixGen, SpecAugment, Synonym substitution
PDF

DCASE 2023 task 6 automated audio captioning and language-based retrieval

Karanth Greeshma1, Ninaad Rao1, Srikumar Subramanian1, Ankit Shah1
1Carnegie Mellon University, Language Technologies Institute, Pittsburgh, PA, USA

Abstract

The objective of this project is to examine audio signals utilizing natural language to capture their complex characteristics. This initiative is part of Task 6 in the DCASE 2023 Competition and consists of two subtasks. The first subtask is Automated Audio Captioning, which generates text descriptions of audio content. This task involves the intermodal processing of an audio signal as input and a text description as output. Our best-performing model for this uses the PANN architecture [1] with the CNN-14 feature extractor and BART [2] encoder and decoder. The second subtask is called Language-Based Audio Retrieval, where the system retrieves audio signals by searching for their sound content descriptions. The queries for this subtask are human-generated audio captions. In this task, our best-performing model uses CLAP [3] audio embeddings and Roberta text embeddings [4]. This document presents a summary of our work done for this challenge.

System characteristics
Data augmentation None
PDF

Ensemble systems with contrastive language-audio pretraining and attention-based audio features for audio captioning and retrieval

Feiyang Xiao1, Qiaoxi Zhu2, Haiyan Lan1, Wenwu Wang3, Jian Guan1
1Group of Intelligent Signal Processing (GISP), College of Computer Science and Technology, Harbin Engineering University, Harbin, China, 2 Centre for Audio, Acoustic and Vibration (CAAV), University of Technology Sydney, Ultimo, Australia, 3Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK

Abstract

This technical report describes our submission on Task 6 (automated audio captioning and language-based audio retrieval) of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge. The proposed systems in this submission are based on a contrastive language-audio pretraining strategy and the attention-based audio feature representation. Experiments show that our systems can achieve a SPIDEr-FL score of 28.32 on automated audio captioning and an mAP score of 31.18 on language-based audio retrieval.

System characteristics
Data augmentation SpecAugment
PDF

A whisper transformer for audio captioning trained with synthetic captions and transfer learning

Marek Kadlčı́k1,2, Adam Hájek1,2, Jürgen Kieslich2, Radosław Winiecki2,3
1Student at Masaryk University, Brno, Czech Republic, 2Student at Johannes Kepler University, Linz, Austria, 3Student at Politechnika Poznańska, Poznan, Poland

Abstract

The field of audio captioning has seen significant advancements in recent years, driven by the availability of large-scale audio datasets and advancements in deep learning techniques. In this technical report, we present our approach to audio captioning, focusing on the use of a pretrained speech-to-text Whisper model and pretraining on synthetic captions. We discuss our training procedures and present our experiments’ results, which include model size variations, dataset mixtures, and other hyperparameters. Our findings demonstrate the impact of different training strategies on the performance of the audio captioning model. Our code and trained models are publicly available on GitHub and Hugging Face Hub.

System characteristics
Data augmentation Gaussian noise, Time shifting, Gain
PDF

IRIT-UPS DCASE 2023 audio captioning and retrieval system

Etienne Labbé1, Thomas Pellegrini1,2, Julien Pinquier1
1IRIT (UMR 5505), Université Paul Sabatier, CNRS, Toulouse, France, 2Artificial and Natural Intelligence Toulouse Institute (ANITI)

Abstract

This technical report provides a concise overview of our systems submitted to the DCASE Challenge 2023 for tasks 6a, "Automated Audio Captioning" (AAC), and 6b, "Language-Based Audio Retrieval" (LBAR). In task 6a, we made four distinct submissions. The first submission employed a standard CNN14 encoder paired with a transformer decoder. In the second submission, we replaced this encoder with a ConvNeXt model to enhance audio representation. The third submission incorporated additional training data. We introduced a new task embedding approach to differentiate between different writing styles and audio types. Finally, in the fourth submission, we employed an ensemble method to combine five models trained on different seeds, aiming to improve the quality of the captions. For task 6b, we use the AAC models and we propose a novel approach to accomplish the LBAR task by leveraging the AAC system loss function without requiring any additional training. Our most successful AAC and LBAR systems achieved a SPIDEr-FL score of 0.320 and an mAP@10 score of 0.269. These results demonstrate relative improvements of 22.6\% and 21.2\% compared to the AAC and LBAR baselines, respectively.

System characteristics
Data augmentation MixUp, SpecAugment, Label Smoothing
PDF

Label-refined sequential training with noisy data for automated audio captioning

Jaeheon Sim1, Eungbeom Kim1, Kyogu Lee1,2
1Interdisciplinary Program in Artifical Intelligence, Seoul National University, Seoul, Korea, 2Department of Intelligence and Information, AIIS, Seoul National University, Seoul, Korea

Abstract

This technical report describes the submission to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge Task 6A: Automated Audio Captioning. We utilize a label-refined sequential training method to leverage the large additional dataset which contains two types of noise including domain shift and label noise. We investigate the usefulness of the additional noisy dataset and observe that the models directly trained on the dataset naively including the additional and target dataset suffer from a poor performance. From this observation, we aim to fully leverage the additional dataset by addressing the two types of noise simultaneously. We sequentially train the model with the prior knowledge about the difference between the target dataset and each of the additional datasets, from the largest to the nearest. We finally train the model on the target dataset, thereby progressively minimizing the domain gap. After this training procedure, we apply a label refinement method which is based on pseudo-labelling from self-training method and repeat the sequential training procedure. The proposed method mitigates the noise in the dataset and achieves the improved performance.

System characteristics
Data augmentation None
PDF

CAU submission to DCASE 2023 task 6a: Audio captioning using wavegrams that contain frequency information

Seungmin Chou1, Jaeseung Yim1, Changwon Lim1
1Chung-Ang University, Department of Applied Statistics, Seoul, South Korea

Abstract

This technical report describes an Automated Audio Captioning model for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge, Task 6A. Utilizing wavegram and patchout as proposed in [1] and [2], respectively, we propose audio captioning using Wavegrams that contain frequency information. We use pre-trained models trained using AudioSet data, to make word embedding. Our proposed sequence-to-sequence model consists of CNN14 encoder and a Transformer decoder. Experiments show that the proposed model achieves a SPIDEr score of 0.011.

System characteristics
Data augmentation None
PDF

PEACS: Prefix encoding for auditory caption synthesis

Timothy Schaumlöffel1, Martina G. Vilas1,2, Gemma Roig1,3
1Goethe University Frankfurt, Department of Computer Science, Robert-Mayer-Str. 11-15, 60323 Frankfurt, Germany, 2Ernst Strüngmann Institute for Neuroscience, Deutschordenstraße 46, 60528 Frankfurt, Germany, 3The Hessian Center for Artificial Intelligence (hessian.AI), Darmstadt, Germany

Abstract

This technical report describes an Automated Audio Captioning system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge, Task 6a (automated audio captioning). Our approach employs an encoder-decoder architecture, with the encoder utilizing a large contrastive pre-trained HTS-AT capable of handling variable-length audio segments. The decoder is based on the GPT2 model. To incorporate audio into the decoding process, we employ a light mapping network that translates audio representations into a prefix, effectively guiding the decoder’s generation process. Given the limited data availability, we pre-train our model on various audio captioning datasets and fine-tune it on Clotho. We reach a SPIDERr-FL score of 29.3 on the evaluation split of the Clotho-v2 dataset.

System characteristics
Data augmentation SpecAugment
PDF

BEATs-based audio captioning model with INSTRUCTOR embedding supervision and ChatGPT mix-up

Shih-Lun Wu1, Xuankai Chang1, Gordon Wichern2, Jee-weon Jung1, François Germain2, Jonathan Le Roux2, Shinji Watanabe1
1Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA, 2Speech & Audio Team, Mitsubishi Electric Research Labs, Cambridge, MA, USA

Abstract

DCASE 2023 Task 6A, automated audio captioning (AAC), aims at generating informative descriptions for various sounds from nature and/or human activities. Our AAC system follows the sequence-to-sequence (seq2seq) architecture. The audio encoder stack is comprised of a frozen BEAT S Transformer followed by a 2-layer Conformer. The BEAT S module, which has been pretrained on both masked audio token prediction and audio event classification, extracts fine-grained (i.e., ≈ 50 Hz) audio features, while the Conformer downsamples and summarizes the audio features before they are cross-attended by the BART text decoder. Besides the autoregressive negative log-likelihood (NLL) loss computed on decoder outputs, we simultaneously apply an audio-text contrastive loss on our encoder output to infuse language modality knowledge into it. Specifically, we feed ground-truth captions into INSTRUCTOR Transformer, a state-of-the-art text embedding model, and teach our audio encoder to predict the INSTRUCTOR text embeddings through InfoNCE loss. In addition, we leverage ChatGPT to produce caption mix-ups (i.e., grammatical and compact combinations of two captions) which, together with the corresponding audio mixtures, increases not only the amount but also the complexity and diversity of our training data. During inference, we employ nucleus sampling and a hybrid reranking algorithm that considers both likelihood and audio-caption representation similarity. Combining our efforts, our best single model and ensemble system achieve 0.326 and 0.336 SPIDEr-FL scores, respectively, on the Clotho (V2) evaluation split.

System characteristics
Data augmentation SpecAugment, MixUp
PDF

Leveraging multi-task training and image retrieval with CLAP for audio captioning

Haoran Sun1, Zhiyong Yan1, Yongqing Wang1, Heinrich Dinkel1, Junbo Zhang1, Yujun Wang1
1Xiaomi Corporation, Beijing, China

Abstract

This technical report serves as our submission to Task 6 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 challenge. Our system, as described in this report, consists of two sub-systems designed for the respective sub-tasks: automated audio captioning (task A) and text-to-audio retrieval (task B). The text-to-audio retrieval system employs a tri-encoder architecture, where pre-trained audio and text encoders are trained to establish relationships. Additionally, an extra pre-trained image encoder is utilized to enhance the connections between these encoders. Through this retrieval process, the audio encoder can be considered a pre-trained encoder for task A. Furthermore, we employ multi-task training with audio tagging during the retrieval phase to strengthen the encoder for audio captioning. Pre-training is conducted using AudioCaps and a portion of WavCaps datasets, and both sub-systems are subsequently finetuned on Clotho dataset. Experimental results demonstrate that our model achieves a SPIDEr score of 0.305 and a SPIDEr-FL score of 0.294 for captioning, as well as an mAP (mean Average Precision) of 0.321 for text-to-audio retrieval.

System characteristics
Data augmentation None
PDF