Task description
Automated audio captioning is the task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. Given the novelty of the task of audio captioning, current focus is on exploring and developing different methods that can provide some kind of captions for a general audio recording. To this aim, the Clotho dataset is used, which provides good quality captions, without speech transcription, named entities, and hapax legomena (i.e. words that appear once in a split).
Participants used the freely available splits of Clotho development and evaluation, as well as any external data they deemed fit. The developed systems are evaluated on their generated captions, using the testing split of Clotho, which does not provide the corresponding captions for the audio. More information about Task 6a: Automated Audio Captioning can be found at the task description page.
The ranking of the submitted systems is based on the achieved SPIDEr metric penalized by fluency error detection (SPIDEr-FL). Though, in this page is provided a more thorough presentation, grouping the metrics into those that are originated from machine translation and to those that originated from captioning.
Teams ranking
Here are listed the best systems from all teams. The ranking is based on the SPIDEr-FL. For more elaborated exploration of the performance of the different systems, at the same table are listed the values achieved for all the metrics employed in the task. The values for the metrics are for the Clotho testing split and the Clotho evaluation split. The values for the Clotho evaluation split are provided in order to allow further comparison with systems and methods developed outside of this task, since captions for the Clotho evaluation split are freely available.
Selected metric rank |
Submission Information | Clotho testing split | Clotho evaluation split | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Corresponding author |
Technical Report |
METEOR | CIDEr | SPICE | SPIDEr | SPIDEr-FL | METEOR | CIDEr | SPICE | SPIDEr | SPIDEr-FL | |
Wu_t6a_4 | 1 | Shih-Lun Wu | wu2023_t6a | 0.195 | 0.505 | 0.149 | 0.327 | 0.327 | 0.197 | 0.505 | 0.145 | 0.325 | 0.325 | |
Chang_t6a_4 | 2 | Joon-Hyuk Chang | chang2023_t6a | 0.197 | 0.539 | 0.149 | 0.344 | 0.315 | 0.197 | 0.541 | 0.146 | 0.343 | 0.313 | |
Labbe_t6a_4 | 3 | Etienne Labbe | labbe2023_t6a | 0.193 | 0.486 | 0.142 | 0.314 | 0.314 | 0.193 | 0.500 | 0.140 | 0.320 | 0.320 | |
Yan_t6a_4 | 4 | Zhiyong Yan | yan2023_t6a | 0.191 | 0.461 | 0.139 | 0.300 | 0.289 | 0.192 | 0.474 | 0.136 | 0.305 | 0.294 | |
Schaumloeffel_t6a_1 | 5 | Timothy Schaumloeffel | schaumloeffel2023_t6a | 0.181 | 0.436 | 0.130 | 0.283 | 0.282 | 0.183 | 0.454 | 0.132 | 0.293 | 0.292 | |
Guan_t6a_3 | 6 | Jian Guan | guan2023_t6a | 0.180 | 0.427 | 0.129 | 0.278 | 0.273 | 0.184 | 0.450 | 0.129 | 0.290 | 0.283 | |
Kadlčík_t6a_1 | 7 | Marek Kadlčík | kadlčík2023_t6a | 0.172 | 0.414 | 0.123 | 0.269 | 0.267 | 0.378 | 0.433 | 0.126 | 0.279 | ||
Lee_t6a_1 | 8 | Kyogu Lee | lee2023_t6a | 0.176 | 0.416 | 0.123 | 0.269 | 0.266 | 0.177 | 0.431 | 0.126 | 0.279 | 0.275 | |
Baseline | 9 | Felix Gontier | gontier2023_t6a | 0.177 | 0.415 | 0.126 | 0.271 | 0.264 | 0.177 | 0.420 | 0.119 | 0.270 | 0.261 | |
Greeshma_t6a_1 | 10 | Karanth Greeshma | greeshma2023_t6a | 0.178 | 0.406 | 0.125 | 0.265 | 0.261 | 0.178 | 0.419 | 0.121 | 0.270 | 0.264 | |
Lim_t6a_1 | 11 | Changwon Lim | lim2023_t6a | 0.089 | 0.035 | 0.039 | 0.037 | 0.010 | 0.089 | 0.034 | 0.038 | 0.036 | 0.011 |
Systems ranking
Here are listed all submitted systems and their ranking according to the different metrics and grouping of metrics. The first table shows all challenge metrics and all systems, and the second table shows all systems but with contrastive metrics.
Detailed information for each system is provided in the next section.
Systems ranking, challenge metrics
Selected metric rank |
Submission Information | Clotho testing split | Clotho evaluation split | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Technical Report |
METEOR | CIDEr | SPICE | SPIDEr | SPIDEr-FL | METEOR | CIDEr | SPICE | SPIDEr | SPIDEr-FL | |
Wu_t6a_4 | 1 | wu2023_t6a | 0.195 | 0.505 | 0.149 | 0.327 | 0.327 | 0.197 | 0.505 | 0.145 | 0.325 | 0.325 | |
Wu_t6a_3 | 2 | wu2023_t6a | 0.196 | 0.504 | 0.149 | 0.326 | 0.326 | 0.197 | 0.525 | 0.147 | 0.336 | 0.336 | |
Wu_t6a_2 | 3 | wu2023_t6a | 0.196 | 0.499 | 0.149 | 0.324 | 0.324 | 0.198 | 0.510 | 0.147 | 0.329 | 0.329 | |
Chang_t6a_4 | 4 | chang2023_t6a | 0.197 | 0.539 | 0.149 | 0.344 | 0.315 | 0.197 | 0.541 | 0.146 | 0.343 | 0.313 | |
Labbe_t6a_4 | 5 | labbe2023_t6a | 0.193 | 0.486 | 0.142 | 0.314 | 0.314 | 0.193 | 0.500 | 0.140 | 0.320 | 0.320 | |
Wu_t6a_1 | 6 | wu2023_t6a | 0.190 | 0.477 | 0.145 | 0.311 | 0.311 | 0.193 | 0.506 | 0.146 | 0.326 | 0.326 | |
Labbe_t6a_3 | 7 | labbe2023_t6a | 0.192 | 0.479 | 0.141 | 0.310 | 0.309 | 0.192 | 0.485 | 0.139 | 0.312 | 0.310 | |
Chang_t6a_1 | 8 | chang2023_t6a | 0.188 | 0.486 | 0.138 | 0.312 | 0.308 | 0.188 | 0.483 | 0.137 | 0.309 | 0.307 | |
Labbe_t6a_2 | 9 | labbe2023_t6a | 0.189 | 0.470 | 0.139 | 0.304 | 0.304 | 0.190 | 0.474 | 0.136 | 0.305 | 0.303 | |
Yan_t6a_4 | 10 | yan2023_t6a | 0.191 | 0.461 | 0.139 | 0.300 | 0.289 | 0.192 | 0.474 | 0.136 | 0.305 | 0.294 | |
Yan_t6a_3 | 11 | yan2023_t6a | 0.190 | 0.457 | 0.139 | 0.298 | 0.288 | 0.190 | 0.468 | 0.135 | 0.302 | 0.292 | |
Yan_t6a_1 | 12 | yan2023_t6a | 0.187 | 0.445 | 0.137 | 0.291 | 0.282 | 0.191 | 0.471 | 0.136 | 0.304 | 0.295 | |
Schaumloeffel_t6a_1 | 13 | schaumloeffel2023_t6a | 0.181 | 0.436 | 0.130 | 0.283 | 0.282 | 0.183 | 0.454 | 0.132 | 0.293 | 0.292 | |
Schaumloeffel_t6a_2 | 14 | schaumloeffel2023_t6a | 0.178 | 0.425 | 0.124 | 0.274 | 0.274 | 0.179 | 0.443 | 0.126 | 0.285 | 0.284 | |
Guan_t6a_3 | 15 | guan2023_t6a | 0.180 | 0.427 | 0.129 | 0.278 | 0.273 | 0.184 | 0.450 | 0.129 | 0.290 | 0.283 | |
Guan_t6a_4 | 16 | guan2023_t6a | 0.181 | 0.429 | 0.130 | 0.279 | 0.272 | 0.184 | 0.443 | 0.128 | 0.285 | 0.279 | |
Yan_t6a_2 | 17 | yan2023_t6a | 0.185 | 0.424 | 0.132 | 0.278 | 0.270 | 0.189 | 0.460 | 0.136 | 0.298 | 0.286 | |
Guan_t6a_1 | 18 | guan2023_t6a | 0.180 | 0.421 | 0.131 | 0.276 | 0.270 | 0.182 | 0.438 | 0.126 | 0.282 | 0.275 | |
Kadlčík_t6a_1 | 19 | kadlčík2023_t6a | 0.172 | 0.414 | 0.123 | 0.269 | 0.267 | 0.378 | 0.433 | 0.126 | 0.279 | ||
Lee_t6a_1 | 20 | lee2023_t6a | 0.176 | 0.416 | 0.123 | 0.269 | 0.266 | 0.177 | 0.431 | 0.126 | 0.279 | 0.275 | |
Baseline | 21 | gontier2023_t6a | 0.177 | 0.415 | 0.126 | 0.271 | 0.264 | 0.177 | 0.420 | 0.119 | 0.270 | 0.261 | |
Guan_t6a_2 | 22 | guan2023_t6a | 0.178 | 0.415 | 0.127 | 0.271 | 0.263 | 0.181 | 0.426 | 0.124 | 0.275 | 0.267 | |
Kadlčík_t6a_2 | 23 | kadlčík2023_t6a | 0.177 | 0.406 | 0.129 | 0.267 | 0.261 | 0.378 | 0.414 | 0.123 | 0.269 | ||
Greeshma_t6a_1 | 24 | greeshma2023_t6a | 0.178 | 0.406 | 0.125 | 0.265 | 0.261 | 0.178 | 0.419 | 0.121 | 0.270 | 0.264 | |
Labbe_t6a_1 | 25 | labbe2023_t6a | 0.177 | 0.389 | 0.125 | 0.257 | 0.256 | 0.179 | 0.414 | 0.126 | 0.270 | 0.269 | |
Chang_t6a_3 | 26 | chang2023_t6a | 0.194 | 0.527 | 0.142 | 0.335 | 0.231 | 0.195 | 0.539 | 0.143 | 0.341 | 0.233 | |
Chang_t6a_2 | 27 | chang2023_t6a | 0.195 | 0.520 | 0.141 | 0.330 | 0.229 | 0.195 | 0.526 | 0.143 | 0.335 | 0.225 | |
Kadlčík_t6a_3 | 28 | kadlčík2023_t6a | 0.161 | 0.348 | 0.116 | 0.232 | 0.225 | 0.345 | 0.340 | 0.108 | 0.224 | ||
Lim_t6a_1 | 29 | lim2023_t6a | 0.089 | 0.035 | 0.039 | 0.037 | 0.010 | 0.089 | 0.034 | 0.038 | 0.036 | 0.011 |
Systems ranking, additional metrics
Selected metric rank |
Submission Information | Clotho testing split | |||
---|---|---|---|---|---|
Submission code |
Best official system rank |
Technical Report |
Sentence-BERT | FENSE | |
Wu_t6a_4 | 1 | wu2023_t6a | 0.536 | 0.536 | |
Wu_t6a_3 | 2 | wu2023_t6a | 0.536 | 0.536 | |
Wu_t6a_2 | 3 | wu2023_t6a | 0.538 | 0.538 | |
Chang_t6a_4 | 4 | chang2023_t6a | 0.530 | 0.488 | |
Labbe_t6a_4 | 5 | labbe2023_t6a | 0.523 | 0.522 | |
Wu_t6a_1 | 6 | wu2023_t6a | 0.526 | 0.526 | |
Labbe_t6a_3 | 7 | labbe2023_t6a | 0.521 | 0.519 | |
Chang_t6a_1 | 8 | chang2023_t6a | 0.527 | 0.521 | |
Labbe_t6a_2 | 9 | labbe2023_t6a | 0.523 | 0.522 | |
Yan_t6a_4 | 10 | yan2023_t6a | 0.521 | 0.498 | |
Yan_t6a_3 | 11 | yan2023_t6a | 0.523 | 0.501 | |
Yan_t6a_1 | 12 | yan2023_t6a | 0.520 | 0.503 | |
Schaumloeffel_t6a_1 | 13 | schaumloeffel2023_t6a | 0.501 | 0.498 | |
Schaumloeffel_t6a_2 | 14 | schaumloeffel2023_t6a | 0.496 | 0.496 | |
Guan_t6a_3 | 15 | guan2023_t6a | 0.496 | 0.487 | |
Guan_t6a_4 | 16 | guan2023_t6a | 0.496 | 0.480 | |
Yan_t6a_2 | 17 | yan2023_t6a | 0.509 | 0.490 | |
Guan_t6a_1 | 18 | guan2023_t6a | 0.495 | 0.481 | |
Kadlčík_t6a_1 | 19 | kadlčík2023_t6a | 0.495 | 0.492 | |
Lee_t6a_1 | 20 | lee2023_t6a | 0.500 | 0.495 | |
Baseline | 21 | gontier2023_t6a | 0.482 | 0.472 | |
Guan_t6a_2 | 22 | guan2023_t6a | 0.494 | 0.475 | |
Kadlčík_t6a_2 | 23 | kadlčík2023_t6a | 0.492 | 0.481 | |
Greeshma_t6a_1 | 24 | greeshma2023_t6a | 0.486 | 0.477 | |
Labbe_t6a_1 | 25 | labbe2023_t6a | 0.481 | 0.480 | |
Chang_t6a_3 | 26 | chang2023_t6a | 0.522 | 0.363 | |
Chang_t6a_2 | 27 | chang2023_t6a | 0.522 | 0.362 | |
Kadlčík_t6a_3 | 28 | kadlčík2023_t6a | 0.459 | 0.445 | |
Lim_t6a_1 | 29 | lim2023_t6a | 0.121 | 0.033 |
System characteristics
In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems and the second has a detailed presentation of each system.
Overview of characteristics
Rank |
Submission code |
SPIDEr-FL |
Technical Report |
Method scheme/architecture | Amount of parameters | Audio modelling | Word modelling |
Data augmentation |
---|---|---|---|---|---|---|---|---|
1 | Wu_t6a_4 | 0.327 | wu2023_t6a | encoder-decoder | 2542000000 | conformer | transformer | |
2 | Wu_t6a_3 | 0.326 | wu2023_t6a | encoder-decoder | 2542000000 | conformer | transformer | |
3 | Wu_t6a_2 | 0.324 | wu2023_t6a | encoder-decoder | 887000000 | conformer | transformer | |
4 | Chang_t6a_4 | 0.315 | chang2023_t6a | encoder-decoder | 1313200128 | PANNs | BART | spec augmentation, AL-mixgen, synonyms substitution |
5 | Labbe_t6a_4 | 0.314 | labbe2023_t6a | encoder-decoder | 98064347 | cnn | transformer | mixup, spec_augment, label_smoothing |
6 | Wu_t6a_1 | 0.311 | wu2023_t6a | encoder-decoder | 127000000 | conformer | transformer | |
7 | Labbe_t6a_3 | 0.309 | labbe2023_t6a | encoder-decoder | 42191083 | cnn | transformer | mixup, spec_augment, label_smoothing |
8 | Chang_t6a_1 | 0.308 | chang2023_t6a | encoder-decoder | 218866688 | PANNs | BART | spec augmentation, AL-mixgen, synonyms substitution |
9 | Labbe_t6a_2 | 0.304 | labbe2023_t6a | encoder-decoder | 40133440 | cnn | transformer | mixup, spec_augment, label_smoothing |
10 | Yan_t6a_4 | 0.289 | yan2023_t6a | encoder-decoder | 90086352 | transformer | transformer | |
11 | Yan_t6a_3 | 0.288 | yan2023_t6a | encoder-decoder | 90086352 | transformer | transformer | |
12 | Yan_t6a_1 | 0.282 | yan2023_t6a | encoder-decoder | 90086352 | transformer | transformer | |
13 | Schaumloeffel_t6a_1 | 0.282 | schaumloeffel2023_t6a | encoder-decoder | 248325888 | transformer | GPT2 | SpecAugment |
14 | Schaumloeffel_t6a_2 | 0.274 | schaumloeffel2023_t6a | encoder-decoder | 248325888 | transformer | GPT2 | SpecAugment |
15 | Guan_t6a_3 | 0.273 | guan2023_t6a | encoder-decoder | 35502652 | PANNs (CNN10) + GAT, PANNs (CNN10) | transformer | SpecAugmentation |
16 | Guan_t6a_4 | 0.272 | guan2023_t6a | encoder-decoder | 17768222 | PANNs (CNN10) + GAT | transformer | SpecAugmentation |
17 | Yan_t6a_2 | 0.270 | yan2023_t6a | encoder-decoder | 90086352 | transformer | transformer | |
18 | Guan_t6a_1 | 0.270 | guan2023_t6a | encoder-decoder | 8884111 | PANNs (CNN10) + GAT | transformer | SpecAugmentation |
19 | Kadlčík_t6a_1 | 0.267 | kadlčík2023_t6a | encoder-decoder | 1550000000 | transformer | transformer | |
20 | Lee_t6a_1 | 0.266 | lee2023_t6a | encoder-decoder | 178755308 | cnn | transformer | |
21 | Baseline | 0.264 | gontier2023_t6a | encoder-decoder | 98500000 | PANNs | transformer | |
22 | Guan_t6a_2 | 0.263 | guan2023_t6a | encoder-decoder | 8884111 | PANNs (CNN10) + GAT | transformer | SpecAugmentation |
23 | Kadlčík_t6a_2 | 0.261 | kadlčík2023_t6a | encoder-decoder | 244000000 | transformer | transformer | |
24 | Greeshma_t6a_1 | 0.261 | greeshma2023_t6a | encoder-decoder | 178755308 | cnn | BART | |
25 | Labbe_t6a_1 | 0.256 | labbe2023_t6a | encoder-decoder | 87715793 | cnn | transformer | mixup, spec_augment, label_smoothing |
26 | Chang_t6a_3 | 0.231 | chang2023_t6a | encoder-decoder | 656600064 | PANNs | BART | spec augmentation, AL-mixgen, synonyms substitution |
27 | Chang_t6a_2 | 0.229 | chang2023_t6a | encoder-decoder | 218866688 | PANNs | BART | spec augmentation, AL-mixgen, synonyms substitution |
28 | Kadlčík_t6a_3 | 0.225 | kadlčík2023_t6a | encoder-decoder | 39000000 | transformer | transformer | |
29 | Lim_t6a_1 | 0.010 | lim2023_t6a | encoder-decoder | 178755308 | CNN14 | transformer |
Detailed characteristics
Rank |
Submission code |
SPIDEr-FL |
Technical Report |
Method scheme/architecture | Amount of parameters | Audio modelling |
Acoustic features |
Word modelling |
Word embeddings |
Data augmentation |
Sampling rate |
Learning set-up | Ensemble method | Loss function | Learning set-up | Learning rate | Gradient clipping | Gradient norm for clipping | Metric monitored for training | Dataset(s) used for audio modelling | Dataset(s) used for word modelling | Dataset(s) used for audio similarity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Wu_t6a_4 | 0.327 | wu2023_t6a | encoder-decoder | 2542000000 | conformer | BEATs | transformer | BART | 16kHz | supervised | adamw | 2e-5 | validation_acc | Clotho, AudioCaps | Clotho, AudioCaps | ||||||
2 | Wu_t6a_3 | 0.326 | wu2023_t6a | encoder-decoder | 2542000000 | conformer | BEATs | transformer | BART | 16kHz | supervised | adamw | 2e-5 | validation_acc | Clotho, AudioCaps | Clotho, AudioCaps | ||||||
3 | Wu_t6a_2 | 0.324 | wu2023_t6a | encoder-decoder | 887000000 | conformer | BEATs | transformer | BART | 16kHz | supervised | adamw | 2e-5 | validation_acc | Clotho, AudioCaps | Clotho, AudioCaps | ||||||
4 | Chang_t6a_4 | 0.315 | chang2023_t6a | encoder-decoder | 1313200128 | PANNs | PANNs | BART | BART | spec augmentation, AL-mixgen, synonyms substitution | 44.1kHz | supervised, reinforcement learninig | crossentropy | adamw | 1e-6 | CIDEr | Clotho, AudioCaps, WavCaps | Clotho, AudioCaps, WavCaps | ||||
5 | Labbe_t6a_4 | 0.314 | labbe2023_t6a | encoder-decoder | 98064347 | cnn | ConvNeXt-tiny | transformer | learned | mixup, spec_augment, label_smoothing | 32kHz | supervised | crossentropy | adamw | 5e-4 | l2 | validation_fense | Clotho, AudioCaps, MACS, WavCaps (without FreeSound) | Clotho, AudioCaps, MACS, WavCaps (without FreeSound) | |||
6 | Wu_t6a_1 | 0.311 | wu2023_t6a | encoder-decoder | 127000000 | conformer | BEATs | transformer | BART | 16kHz | supervised | adamw | 2e-5 | validation_acc | Clotho, AudioCaps | Clotho, AudioCaps | ||||||
7 | Labbe_t6a_3 | 0.309 | labbe2023_t6a | encoder-decoder | 42191083 | cnn | ConvNeXt-tiny | transformer | learned | mixup, spec_augment, label_smoothing | 32kHz | supervised | crossentropy | adamw | 5e-4 | l2 | validation_fense | Clotho, AudioCaps, MACS, WavCaps (without FreeSound) | Clotho, AudioCaps, MACS, WavCaps (without FreeSound) | |||
8 | Chang_t6a_1 | 0.308 | chang2023_t6a | encoder-decoder | 218866688 | PANNs | PANNs | BART | BART | spec augmentation, AL-mixgen, synonyms substitution | 44.1kHz | supervised | crossentropy | adamw | 1e-6 | validation loss | Clotho, AudioCaps, WavCaps | Clotho, AudioCaps, WavCaps | ||||
9 | Labbe_t6a_2 | 0.304 | labbe2023_t6a | encoder-decoder | 40133440 | cnn | ConvNeXt-tiny | transformer | learned | mixup, spec_augment, label_smoothing | 32kHz | supervised | crossentropy | adamw | 5e-4 | l2 | validation_fense | Clotho | Clotho | |||
10 | Yan_t6a_4 | 0.289 | yan2023_t6a | encoder-decoder | 90086352 | transformer | audioset | transformer | BERT | 16kHz | supervised | crossentropy | adamw | 1e-4 | validation_loss | Clotho, FreeSound | Clotho, FreeSound | |||||
11 | Yan_t6a_3 | 0.288 | yan2023_t6a | encoder-decoder | 90086352 | transformer | audioset | transformer | BERT | 16kHz | supervised | crossentropy | adamw | 1e-4 | validation_loss | Clotho, FreeSound | Clotho, FreeSound | |||||
12 | Yan_t6a_1 | 0.282 | yan2023_t6a | encoder-decoder | 90086352 | transformer | audioset | transformer | BERT | 16kHz | supervised | crossentropy | adamw | 1e-4 | validation_loss | Clotho, FreeSound | Clotho, FreeSound | |||||
13 | Schaumloeffel_t6a_1 | 0.282 | schaumloeffel2023_t6a | encoder-decoder | 248325888 | transformer | CLAP | GPT2 | SpecAugment | 48kHz | supervised | crossentropy | adamw | 1e-5 | validation_loss | Clotho, AudioCaps, MACS, WavText5k, SoundDescs | Clotho, AudioCaps, MACS, WavText5k, SoundDescs | |||||
14 | Schaumloeffel_t6a_2 | 0.274 | schaumloeffel2023_t6a | encoder-decoder | 248325888 | transformer | CLAP | GPT2 | SpecAugment | 48kHz | supervised | crossentropy | adamw | 1e-5 | validation_loss | Clotho, AudioCaps, MACS | Clotho, AudioCaps, MACS | |||||
15 | Guan_t6a_3 | 0.273 | guan2023_t6a | encoder-decoder | 35502652 | PANNs (CNN10) + GAT, PANNs (CNN10) | log-mel energies | transformer | Word2Vec | SpecAugmentation | 32.0kHz | supervised | crossentropy with label smoothing | adamw | 1e-3 | SPIDEr metric | Clotho, AudioCaps | Clotho, AudioCaps | ||||
16 | Guan_t6a_4 | 0.272 | guan2023_t6a | encoder-decoder | 17768222 | PANNs (CNN10) + GAT | log-mel energies | transformer | Word2Vec | SpecAugmentation | 32.0kHz | supervised | crossentropy with label smoothing | adamw | 1e-3 | SPIDEr metric | Clotho, AudioCaps | Clotho, AudioCaps | ||||
17 | Yan_t6a_2 | 0.270 | yan2023_t6a | encoder-decoder | 90086352 | transformer | audioset | transformer | BERT | 16kHz | supervised | crossentropy | adamw | 1e-4 | validation_loss | Clotho, FreeSound | Clotho, FreeSound | |||||
18 | Guan_t6a_1 | 0.270 | guan2023_t6a | encoder-decoder | 8884111 | PANNs (CNN10) + GAT | log-mel energies | transformer | Word2Vec | SpecAugmentation | 32.0kHz | supervised | crossentropy with label smoothing | adamw | 1e-3 | SPIDEr metric | Clotho, AudioCaps | Clotho, AudioCaps | ||||
19 | Kadlčík_t6a_1 | 0.267 | kadlčík2023_t6a | encoder-decoder | 1550000000 | transformer | WhisperFeatureExtractor | transformer | Whisper | 16kHz | supervised | crossentropy | adamw | 4e-6 | SPIDEr | Clotho, AudioCaps, AudioSet | Clotho, AudioCaps, AudioSet | |||||
20 | Lee_t6a_1 | 0.266 | lee2023_t6a | encoder-decoder | 178755308 | cnn | PANNs | transformer | BART | 44.1kHz | supervised | crossentropy | adamw | 1e-5 | validation_loss | Clotho, AudioCaps, WavText5K, SoundDescs | Clotho, AudioCaps, WavText5K, SoundDescs | |||||
21 | Baseline | 0.264 | gontier2023_t6a | encoder-decoder | 98500000 | PANNs | log-mel energies | transformer | BART | 16kHz | supervised | crossentropy | adamw | 1e-5 | validation_loss | Clotho | Clotho | |||||
22 | Guan_t6a_2 | 0.263 | guan2023_t6a | encoder-decoder | 8884111 | PANNs (CNN10) + GAT | log-mel energies | transformer | Word2Vec | SpecAugmentation | 32.0kHz | supervised | crossentropy with label smoothing | adamw | 1e-3 | SPIDEr metric | Clotho, AudioCaps | Clotho, AudioCaps | ||||
23 | Kadlčík_t6a_2 | 0.261 | kadlčík2023_t6a | encoder-decoder | 244000000 | transformer | WhisperFeatureExtractor | transformer | Whisper | 16kHz | supervised | crossentropy | adamw | 4e-6 | SPIDEr | Clotho, AudioCaps, AudioSet | Clotho, AudioCaps, AudioSet | |||||
24 | Greeshma_t6a_1 | 0.261 | greeshma2023_t6a | encoder-decoder | 178755308 | cnn | log-mel energies | BART | BART | 44.1kHz | supervised | crossentropy | adamw | 1e-5 | validation_loss | Clotho | Clotho | |||||
25 | Labbe_t6a_1 | 0.256 | labbe2023_t6a | encoder-decoder | 87715793 | cnn | PANNs-CNN14 | transformer | learned | mixup, spec_augment, label_smoothing | 32kHz | supervised | crossentropy | adamw | 5e-4 | l2 | validation_fense | Clotho | Clotho | |||
26 | Chang_t6a_3 | 0.231 | chang2023_t6a | encoder-decoder | 656600064 | PANNs | PANNs | BART | BART | spec augmentation, AL-mixgen, synonyms substitution | 44.1kHz | supervised, reinforcement learninig | crossentropy | adamw | 1e-6 | CIDEr | Clotho, AudioCaps, WavCaps | Clotho, AudioCaps, WavCaps | ||||
27 | Chang_t6a_2 | 0.229 | chang2023_t6a | encoder-decoder | 218866688 | PANNs | PANNs | BART | BART | spec augmentation, AL-mixgen, synonyms substitution | 44.1kHz | supervised, reinforcement learninig | crossentropy | adamw | 1e-6 | CIDEr | Clotho, AudioCaps, WavCaps | Clotho, AudioCaps, WavCaps | ||||
28 | Kadlčík_t6a_3 | 0.225 | kadlčík2023_t6a | encoder-decoder | 39000000 | transformer | WhisperFeatureExtractor | transformer | Whisper | 16kHz | supervised | crossentropy | adamw | 4e-6 | SPIDEr | Clotho, AudioCaps, AudioSet | Clotho, AudioCaps, AudioSet | |||||
29 | Lim_t6a_1 | 0.010 | lim2023_t6a | encoder-decoder | 178755308 | CNN14 | mel energies | transformer | PASST | 44.1 kHz | supervised | crossentropy | adamw | 1e-5 | validation_loss | Clotho | Clotho |
Technical reports
HYU submission for the DCASE 2023 task 6a: automated audio captioning model using AL-MixGen and synonyms substitution
Jae-Heung Cho1, Yoon-Ah Park1, Jaewon Kim1, Joon-Hyuk Chang1
1Department of Electronic Engineering, Hanyang University, Seoul, Republic of Korea
chang_t6a_1 chang_t6a_2 chang_t6a_3 chang_t6a_4
HYU submission for the DCASE 2023 task 6a: automated audio captioning model using AL-MixGen and synonyms substitution
Jae-Heung Cho1, Yoon-Ah Park1, Jaewon Kim1, Joon-Hyuk Chang1
1Department of Electronic Engineering, Hanyang University, Seoul, Republic of Korea
Abstract
This paper presents the automated audio captioning model for participating in the detection and classification of acoustic scenes and events 2023 challenge task 6A. The model consists of two parts: an audio feature extractor and a language model. The audio feature extractor employed in our model is the pre-trained convolutional neural network 14 (CNN14), trained with AudioSet. CNN14 has demonstrated excellent performance in extracting audio features. For the language model, we utilized bidirectional autoregressive transformers model, which has achieved remarkable success in generating the text. We pre-trained the model with WavCaps, AudioCaps and Clotho dataset to manage the limitation of data availability, and then fine-tuned with Clotho dataset. Furthermore, AL-MixGen and synonyms substitution methods were also implemented for data augmentation. To improve the evaluation metric directly, we trained the model with reinforcement learning to optimize the CIDEr score. Finally, we achieved improved performance by adapting an ensemble of higher-performance models, leading to the accomplishment of 0.343 SPIDEr score.
System characteristics
Data augmentation | AL-MixGen, SpecAugment, Synonym substitution |
DCASE 2023 task 6 automated audio captioning and language-based retrieval
Karanth Greeshma1, Ninaad Rao1, Srikumar Subramanian1, Ankit Shah1
1Carnegie Mellon University, Language Technologies Institute, Pittsburgh, PA, USA
Abstract
The objective of this project is to examine audio signals utilizing natural language to capture their complex characteristics. This initiative is part of Task 6 in the DCASE 2023 Competition and consists of two subtasks. The first subtask is Automated Audio Captioning, which generates text descriptions of audio content. This task involves the intermodal processing of an audio signal as input and a text description as output. Our best-performing model for this uses the PANN architecture [1] with the CNN-14 feature extractor and BART [2] encoder and decoder. The second subtask is called Language-Based Audio Retrieval, where the system retrieves audio signals by searching for their sound content descriptions. The queries for this subtask are human-generated audio captions. In this task, our best-performing model uses CLAP [3] audio embeddings and Roberta text embeddings [4]. This document presents a summary of our work done for this challenge.
System characteristics
Data augmentation | None |
Ensemble systems with contrastive language-audio pretraining and attention-based audio features for audio captioning and retrieval
Feiyang Xiao1, Qiaoxi Zhu2, Haiyan Lan1, Wenwu Wang3, Jian Guan1
1Group of Intelligent Signal Processing (GISP), College of Computer Science and Technology, Harbin Engineering University, Harbin, China, 2 Centre for Audio, Acoustic and Vibration (CAAV), University of Technology Sydney, Ultimo, Australia, 3Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
guan_t6a_1 guan_t6a_2 guan_t6a_3 guan_t6a_4
Ensemble systems with contrastive language-audio pretraining and attention-based audio features for audio captioning and retrieval
Feiyang Xiao1, Qiaoxi Zhu2, Haiyan Lan1, Wenwu Wang3, Jian Guan1
1Group of Intelligent Signal Processing (GISP), College of Computer Science and Technology, Harbin Engineering University, Harbin, China, 2 Centre for Audio, Acoustic and Vibration (CAAV), University of Technology Sydney, Ultimo, Australia, 3Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
Abstract
This technical report describes our submission on Task 6 (automated audio captioning and language-based audio retrieval) of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge. The proposed systems in this submission are based on a contrastive language-audio pretraining strategy and the attention-based audio feature representation. Experiments show that our systems can achieve a SPIDEr-FL score of 28.32 on automated audio captioning and an mAP score of 31.18 on language-based audio retrieval.
System characteristics
Data augmentation | SpecAugment |
A whisper transformer for audio captioning trained with synthetic captions and transfer learning
Marek Kadlčı́k1,2, Adam Hájek1,2, Jürgen Kieslich2, Radosław Winiecki2,3
1Student at Masaryk University, Brno, Czech Republic, 2Student at Johannes Kepler University, Linz, Austria, 3Student at Politechnika Poznańska, Poznan, Poland
kadlcik_t6a_1 kadlcik_t6a_2 kadlcik_t6a_3
A whisper transformer for audio captioning trained with synthetic captions and transfer learning
Marek Kadlčı́k1,2, Adam Hájek1,2, Jürgen Kieslich2, Radosław Winiecki2,3
1Student at Masaryk University, Brno, Czech Republic, 2Student at Johannes Kepler University, Linz, Austria, 3Student at Politechnika Poznańska, Poznan, Poland
Abstract
The field of audio captioning has seen significant advancements in recent years, driven by the availability of large-scale audio datasets and advancements in deep learning techniques. In this technical report, we present our approach to audio captioning, focusing on the use of a pretrained speech-to-text Whisper model and pretraining on synthetic captions. We discuss our training procedures and present our experiments’ results, which include model size variations, dataset mixtures, and other hyperparameters. Our findings demonstrate the impact of different training strategies on the performance of the audio captioning model. Our code and trained models are publicly available on GitHub and Hugging Face Hub.
System characteristics
Data augmentation | Gaussian noise, Time shifting, Gain |
IRIT-UPS DCASE 2023 audio captioning and retrieval system
Etienne Labbé1, Thomas Pellegrini1,2, Julien Pinquier1
1IRIT (UMR 5505), Université Paul Sabatier, CNRS, Toulouse, France, 2Artificial and Natural Intelligence Toulouse Institute (ANITI)
labbe_t6a_1 labbe_t6a_2 labbe_t6a_3 labbe_t6a_4
IRIT-UPS DCASE 2023 audio captioning and retrieval system
Etienne Labbé1, Thomas Pellegrini1,2, Julien Pinquier1
1IRIT (UMR 5505), Université Paul Sabatier, CNRS, Toulouse, France, 2Artificial and Natural Intelligence Toulouse Institute (ANITI)
Abstract
This technical report provides a concise overview of our systems submitted to the DCASE Challenge 2023 for tasks 6a, "Automated Audio Captioning" (AAC), and 6b, "Language-Based Audio Retrieval" (LBAR). In task 6a, we made four distinct submissions. The first submission employed a standard CNN14 encoder paired with a transformer decoder. In the second submission, we replaced this encoder with a ConvNeXt model to enhance audio representation. The third submission incorporated additional training data. We introduced a new task embedding approach to differentiate between different writing styles and audio types. Finally, in the fourth submission, we employed an ensemble method to combine five models trained on different seeds, aiming to improve the quality of the captions. For task 6b, we use the AAC models and we propose a novel approach to accomplish the LBAR task by leveraging the AAC system loss function without requiring any additional training. Our most successful AAC and LBAR systems achieved a SPIDEr-FL score of 0.320 and an mAP@10 score of 0.269. These results demonstrate relative improvements of 22.6\% and 21.2\% compared to the AAC and LBAR baselines, respectively.
System characteristics
Data augmentation | MixUp, SpecAugment, Label Smoothing |
Label-refined sequential training with noisy data for automated audio captioning
Jaeheon Sim1, Eungbeom Kim1, Kyogu Lee1,2
1Interdisciplinary Program in Artifical Intelligence, Seoul National University, Seoul, Korea, 2Department of Intelligence and Information, AIIS, Seoul National University, Seoul, Korea
lee_t6a_1
Label-refined sequential training with noisy data for automated audio captioning
Jaeheon Sim1, Eungbeom Kim1, Kyogu Lee1,2
1Interdisciplinary Program in Artifical Intelligence, Seoul National University, Seoul, Korea, 2Department of Intelligence and Information, AIIS, Seoul National University, Seoul, Korea
Abstract
This technical report describes the submission to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge Task 6A: Automated Audio Captioning. We utilize a label-refined sequential training method to leverage the large additional dataset which contains two types of noise including domain shift and label noise. We investigate the usefulness of the additional noisy dataset and observe that the models directly trained on the dataset naively including the additional and target dataset suffer from a poor performance. From this observation, we aim to fully leverage the additional dataset by addressing the two types of noise simultaneously. We sequentially train the model with the prior knowledge about the difference between the target dataset and each of the additional datasets, from the largest to the nearest. We finally train the model on the target dataset, thereby progressively minimizing the domain gap. After this training procedure, we apply a label refinement method which is based on pseudo-labelling from self-training method and repeat the sequential training procedure. The proposed method mitigates the noise in the dataset and achieves the improved performance.
System characteristics
Data augmentation | None |
CAU submission to DCASE 2023 task 6a: Audio captioning using wavegrams that contain frequency information
Seungmin Chou1, Jaeseung Yim1, Changwon Lim1
1Chung-Ang University, Department of Applied Statistics, Seoul, South Korea
lim_t6a_1
CAU submission to DCASE 2023 task 6a: Audio captioning using wavegrams that contain frequency information
Seungmin Chou1, Jaeseung Yim1, Changwon Lim1
1Chung-Ang University, Department of Applied Statistics, Seoul, South Korea
Abstract
This technical report describes an Automated Audio Captioning model for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge, Task 6A. Utilizing wavegram and patchout as proposed in [1] and [2], respectively, we propose audio captioning using Wavegrams that contain frequency information. We use pre-trained models trained using AudioSet data, to make word embedding. Our proposed sequence-to-sequence model consists of CNN14 encoder and a Transformer decoder. Experiments show that the proposed model achieves a SPIDEr score of 0.011.
System characteristics
Data augmentation | None |
PEACS: Prefix encoding for auditory caption synthesis
Timothy Schaumlöffel1, Martina G. Vilas1,2, Gemma Roig1,3
1Goethe University Frankfurt, Department of Computer Science, Robert-Mayer-Str. 11-15, 60323 Frankfurt, Germany, 2Ernst Strüngmann Institute for Neuroscience, Deutschordenstraße 46, 60528 Frankfurt, Germany, 3The Hessian Center for Artificial Intelligence (hessian.AI), Darmstadt, Germany
schaumloeffel_t6a_1 schaumloeffel_t6a_2
PEACS: Prefix encoding for auditory caption synthesis
Timothy Schaumlöffel1, Martina G. Vilas1,2, Gemma Roig1,3
1Goethe University Frankfurt, Department of Computer Science, Robert-Mayer-Str. 11-15, 60323 Frankfurt, Germany, 2Ernst Strüngmann Institute for Neuroscience, Deutschordenstraße 46, 60528 Frankfurt, Germany, 3The Hessian Center for Artificial Intelligence (hessian.AI), Darmstadt, Germany
Abstract
This technical report describes an Automated Audio Captioning system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge, Task 6a (automated audio captioning). Our approach employs an encoder-decoder architecture, with the encoder utilizing a large contrastive pre-trained HTS-AT capable of handling variable-length audio segments. The decoder is based on the GPT2 model. To incorporate audio into the decoding process, we employ a light mapping network that translates audio representations into a prefix, effectively guiding the decoder’s generation process. Given the limited data availability, we pre-train our model on various audio captioning datasets and fine-tune it on Clotho. We reach a SPIDERr-FL score of 29.3 on the evaluation split of the Clotho-v2 dataset.
System characteristics
Data augmentation | SpecAugment |
BEATs-based audio captioning model with INSTRUCTOR embedding supervision and ChatGPT mix-up
Shih-Lun Wu1, Xuankai Chang1, Gordon Wichern2, Jee-weon Jung1, François Germain2, Jonathan Le Roux2, Shinji Watanabe1
1Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA, 2Speech & Audio Team, Mitsubishi Electric Research Labs, Cambridge, MA, USA
wu_t6a_1 wu_t6a_2 wu_t6a_3 wu_t6a_4
BEATs-based audio captioning model with INSTRUCTOR embedding supervision and ChatGPT mix-up
Shih-Lun Wu1, Xuankai Chang1, Gordon Wichern2, Jee-weon Jung1, François Germain2, Jonathan Le Roux2, Shinji Watanabe1
1Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA, 2Speech & Audio Team, Mitsubishi Electric Research Labs, Cambridge, MA, USA
Abstract
DCASE 2023 Task 6A, automated audio captioning (AAC), aims at generating informative descriptions for various sounds from nature and/or human activities. Our AAC system follows the sequence-to-sequence (seq2seq) architecture. The audio encoder stack is comprised of a frozen BEAT S Transformer followed by a 2-layer Conformer. The BEAT S module, which has been pretrained on both masked audio token prediction and audio event classification, extracts fine-grained (i.e., ≈ 50 Hz) audio features, while the Conformer downsamples and summarizes the audio features before they are cross-attended by the BART text decoder. Besides the autoregressive negative log-likelihood (NLL) loss computed on decoder outputs, we simultaneously apply an audio-text contrastive loss on our encoder output to infuse language modality knowledge into it. Specifically, we feed ground-truth captions into INSTRUCTOR Transformer, a state-of-the-art text embedding model, and teach our audio encoder to predict the INSTRUCTOR text embeddings through InfoNCE loss. In addition, we leverage ChatGPT to produce caption mix-ups (i.e., grammatical and compact combinations of two captions) which, together with the corresponding audio mixtures, increases not only the amount but also the complexity and diversity of our training data. During inference, we employ nucleus sampling and a hybrid reranking algorithm that considers both likelihood and audio-caption representation similarity. Combining our efforts, our best single model and ensemble system achieve 0.326 and 0.336 SPIDEr-FL scores, respectively, on the Clotho (V2) evaluation split.
System characteristics
Data augmentation | SpecAugment, MixUp |
Leveraging multi-task training and image retrieval with CLAP for audio captioning
Haoran Sun1, Zhiyong Yan1, Yongqing Wang1, Heinrich Dinkel1, Junbo Zhang1, Yujun Wang1
1Xiaomi Corporation, Beijing, China
yan_t6a_1 yan_t6a_2 yan_t6a_3 yan_t6a_4
Leveraging multi-task training and image retrieval with CLAP for audio captioning
Haoran Sun1, Zhiyong Yan1, Yongqing Wang1, Heinrich Dinkel1, Junbo Zhang1, Yujun Wang1
1Xiaomi Corporation, Beijing, China
Abstract
This technical report serves as our submission to Task 6 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 challenge. Our system, as described in this report, consists of two sub-systems designed for the respective sub-tasks: automated audio captioning (task A) and text-to-audio retrieval (task B). The text-to-audio retrieval system employs a tri-encoder architecture, where pre-trained audio and text encoders are trained to establish relationships. Additionally, an extra pre-trained image encoder is utilized to enhance the connections between these encoders. Through this retrieval process, the audio encoder can be considered a pre-trained encoder for task A. Furthermore, we employ multi-task training with audio tagging during the retrieval phase to strengthen the encoder for audio captioning. Pre-training is conducted using AudioCaps and a portion of WavCaps datasets, and both sub-systems are subsequently finetuned on Clotho dataset. Experimental results demonstrate that our model achieves a SPIDEr score of 0.305 and a SPIDEr-FL score of 0.294 for captioning, as well as an mAP (mean Average Precision) of 0.321 for text-to-audio retrieval.
System characteristics
Data augmentation | None |