Automated Audio Captioning


Challenge results

Task description

Automated audio captioning is the task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. Given the novelty of the task of audio captioning, current focus is on exploring and developing different methods that can provide some kind of captions for a general audio recording. To this aim, the Clotho dataset is used, which provides good quality captions, without speech transcription, named entities, and hapax legomena (i.e. words that appear once in a split).

Participants used the freely available splits of Clotho development and evaluation, as well as any external data they deemed fit. The developed systems are evaluated on their generated captions, using the testing split of Clotho, which does not provide the corresponding captions for the audio. More information about Task 6a: Automated Audio Captioning can be found at the task description page.

The ranking of the submitted systems is based on the achieved SPIDEr metric. Though, in this page is provided a more thorough presentation, grouping the metrics into those that are originated from machine translation and to those that originated from captioning.

This year, we also introduce contrastive metrics as well as an analysis subset of Clotho. The corresponding results will be published in the coming days.

Teams ranking

Here are listed the best systems all from all teams. The ranking is based on the SPIDEr metric. For more elaborated exploration of the performance of the different systems, at the same table are listed the values achieved for all the metrics employed in the task. The values for the metrics are for the Clotho testing split and the Clotho evaluation split. The values for the Clotho evaluation split are provided in order to allow further comparison with systems and methods developed outside of this task, since captions for the Clotho evaluation split are freely available.

Selected
metric
rank
Submission Information Clotho testing split Clotho evaluation split
Submission code Best official
system rank
Corresponding author Technical
Report
BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL CIDEr SPICE SPIDEr BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL CIDEr SPICE SPIDEr
Xu_t6a_4 1 Xuenan Xu xu2022_t6a 0.666 0.433 0.282 0.178 0.187 0.412 0.508 0.130 0.319 0.667 0.435 0.285 0.183 0.186 0.415 0.513 0.126 0.320
Zou_t6a_3 2 Yuexian Zou zou2022_t6a 0.670 0.437 0.289 0.183 0.185 0.415 0.502 0.133 0.318 0.646 0.430 0.289 0.186 0.186 0.409 0.497 0.119 0.308
Mei_t6a_3 3 Xinhao Mei mei2022_t6a 0.661 0.433 0.287 0.179 0.188 0.415 0.482 0.135 0.309 0.646 0.414 0.270 0.169 0.183 0.407 0.453 0.128 0.291
Primus_t6a_4 4 Paul Primus primus2022_t6a 0.641 0.421 0.276 0.168 0.184 0.402 0.458 0.134 0.296 0.636 0.417 0.275 0.168 0.183 0.401 0.461 0.130 0.295
Kouzelis_t6a_4 5 Thodoris Kouzelis kouzelis2022_t6a 0.581 0.387 0.262 0.170 0.180 0.388 0.453 0.134 0.293 0.579 0.386 0.262 0.173 0.178 0.387 0.457 0.134 0.296
Guan_t6a_4 6 Jian Guan guan2022_t6a 0.623 0.417 0.284 0.180 0.177 0.405 0.451 0.130 0.291 0.649 0.439 0.303 0.199 0.181 0.415 0.471 0.133 0.302
Kiciński_t6a_1 7 Dawid Kiciński kiciński2022_t6a 0.567 0.368 0.244 0.155 0.175 0.378 0.414 0.126 0.270 0.583 0.382 0.255 0.164 0.179 0.387 0.433 0.125 0.279
Pan_t6a_4 8 Chaofan Pan pan2022_t6a 0.555 0.361 0.240 0.155 0.173 0.374 0.387 0.123 0.255 0.568 0.367 0.245 0.161 0.175 0.383 0.395 0.119 0.257
Labbe_t6a_1 9 Etienne Labbe labbe2022_t6a 0.548 0.351 0.233 0.149 0.170 0.370 0.359 0.123 0.241 0.555 0.357 0.240 0.157 0.170 0.374 0.367 0.118 0.242
Baseline 10 Felix Gontier gontier2022_t6a 0.549 0.353 0.234 0.147 0.164 0.361 0.338 0.110 0.224 0.555 0.358 0.239 0.156 0.164 0.364 0.358 0.109 0.233

Systems ranking

Here are listed all submitted systems and their ranking according to the different metrics and grouping of metrics. The first table shows all metrics and all systems, the second table shows all systems but with only machine translation metrics, and the third table shows all systems but with only captioning metrics.

Detailed information for each system is provided in the next section.

Systems ranking, all metrics

Selected
metric
rank
Submission Information Clotho testing split Clotho evaluation split
Submission code Best official
system rank
Technical
Report
BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL CIDEr SPICE SPIDEr BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL CIDEr SPICE SPIDEr
Xu_t6a_4 1 xu2022_t6a 0.666 0.433 0.282 0.178 0.187 0.412 0.508 0.130 0.319 0.667 0.435 0.285 0.183 0.186 0.415 0.513 0.126 0.320
Zou_t6a_3 2 zou2022_t6a 0.670 0.437 0.289 0.183 0.185 0.415 0.502 0.133 0.318 0.646 0.430 0.289 0.186 0.186 0.409 0.497 0.119 0.308
Xu_t6a_3 3 xu2022_t6a 0.658 0.430 0.281 0.178 0.186 0.410 0.501 0.131 0.316 0.663 0.433 0.285 0.185 0.185 0.413 0.517 0.127 0.322
Xu_t6a_1 4 xu2022_t6a 0.645 0.421 0.276 0.173 0.186 0.402 0.498 0.130 0.314 0.647 0.424 0.280 0.180 0.186 0.409 0.507 0.130 0.318
Zou_t6a_4 5 zou2022_t6a 0.652 0.433 0.287 0.182 0.185 0.408 0.497 0.130 0.314 0.663 0.443 0.299 0.195 0.189 0.416 0.520 0.126 0.323
Xu_t6a_2 6 xu2022_t6a 0.650 0.425 0.278 0.176 0.187 0.407 0.495 0.127 0.311 0.654 0.431 0.286 0.187 0.188 0.413 0.524 0.126 0.325
Zou_t6a_1 7 zou2022_t6a 0.648 0.424 0.279 0.176 0.185 0.410 0.489 0.133 0.311 0.647 0.438 0.296 0.194 0.185 0.414 0.503 0.132 0.317
Zou_t6a_2 8 zou2022_t6a 0.655 0.429 0.283 0.178 0.184 0.405 0.491 0.128 0.309 0.645 0.422 0.281 0.183 0.186 0.408 0.495 0.131 0.313
Mei_t6a_3 9 mei2022_t6a 0.661 0.433 0.287 0.179 0.188 0.415 0.482 0.135 0.309 0.646 0.414 0.270 0.169 0.183 0.407 0.453 0.128 0.291
Mei_t6a_1 10 mei2022_t6a 0.647 0.423 0.278 0.174 0.186 0.407 0.476 0.134 0.305 0.646 0.414 0.270 0.169 0.183 0.407 0.453 0.128 0.291
Mei_t6a_4 11 mei2022_t6a 0.669 0.428 0.281 0.173 0.184 0.410 0.468 0.138 0.303 0.646 0.414 0.270 0.169 0.183 0.407 0.453 0.128 0.291
Mei_t6a_2 12 mei2022_t6a 0.672 0.427 0.276 0.167 0.182 0.410 0.456 0.138 0.297 0.646 0.414 0.270 0.169 0.183 0.407 0.453 0.128 0.291
Primus_t6a_4 13 primus2022_t6a 0.641 0.421 0.276 0.168 0.184 0.402 0.458 0.134 0.296 0.636 0.417 0.275 0.168 0.183 0.401 0.461 0.130 0.295
Kouzelis_t6a_4 14 kouzelis2022_t6a 0.581 0.387 0.262 0.170 0.180 0.388 0.453 0.134 0.293 0.579 0.386 0.262 0.173 0.178 0.387 0.457 0.134 0.296
Guan_t6a_4 15 guan2022_t6a 0.623 0.417 0.284 0.180 0.177 0.405 0.451 0.130 0.291 0.649 0.439 0.303 0.199 0.181 0.415 0.471 0.133 0.302
Guan_t6a_2 16 guan2022_t6a 0.575 0.388 0.268 0.178 0.178 0.387 0.451 0.129 0.290 0.595 0.402 0.277 0.189 0.179 0.395 0.465 0.127 0.296
Guan_t6a_3 17 guan2022_t6a 0.657 0.425 0.279 0.169 0.179 0.405 0.447 0.132 0.290 0.660 0.424 0.279 0.170 0.178 0.410 0.442 0.129 0.285
Kouzelis_t6a_3 18 kouzelis2022_t6a 0.567 0.378 0.257 0.169 0.176 0.385 0.447 0.131 0.289 0.575 0.384 0.262 0.174 0.178 0.386 0.557 0.133 0.295
Kouzelis_t6a_1 19 kouzelis2022_t6a 0.570 0.382 0.259 0.170 0.177 0.384 0.439 0.132 0.286 0.576 0.384 0.261 0.176 0.166 0.385 0.453 0.130 0.292
Kouzelis_t6a_2 20 kouzelis2022_t6a 0.569 0.378 0.256 0.168 0.177 0.386 0.441 0.130 0.285 0.578 0.384 0.262 0.176 0.177 0.387 0.454 0.133 0.293
Primus_t6a_3 21 primus2022_t6a 0.654 0.420 0.271 0.163 0.177 0.395 0.434 0.127 0.280 0.653 0.424 0.278 0.169 0.181 0.404 0.455 0.125 0.290
Primus_t6a_2 22 primus2022_t6a 0.562 0.364 0.243 0.153 0.181 0.374 0.418 0.132 0.275 0.573 0.370 0.244 0.158 0.181 0.376 0.440 0.128 0.284
Guan_t6a_1 23 guan2022_t6a 0.556 0.367 0.249 0.165 0.173 0.375 0.417 0.124 0.270 0.581 0.386 0.265 0.181 0.175 0.385 0.437 0.126 0.281
Kiciński_t6a_1 24 kiciński2022_t6a 0.567 0.368 0.244 0.155 0.175 0.378 0.414 0.126 0.270 0.583 0.382 0.255 0.164 0.179 0.387 0.433 0.125 0.279
Primus_t6a_1 25 primus2022_t6a 0.556 0.364 0.241 0.150 0.176 0.367 0.400 0.127 0.264 0.566 0.373 0.252 0.164 0.178 0.376 0.408 0.120 0.264
Pan_t6a_4 26 pan2022_t6a 0.555 0.361 0.240 0.155 0.173 0.374 0.387 0.123 0.255 0.568 0.367 0.245 0.161 0.175 0.383 0.395 0.119 0.257
Kiciński_t6a_3 27 kiciński2022_t6a 0.546 0.346 0.224 0.138 0.170 0.364 0.379 0.123 0.251 0.566 0.363 0.236 0.147 0.173 0.372 0.400 0.121 0.260
Kiciński_t6a_4 28 kiciński2022_t6a 0.556 0.360 0.238 0.152 0.171 0.367 0.380 0.121 0.250 0.562 0.364 0.242 0.157 0.172 0.377 0.378 0.119 0.249
Pan_t6a_3 29 pan2022_t6a 0.559 0.360 0.237 0.152 0.173 0.376 0.376 0.123 0.250 0.567 0.363 0.240 0.154 0.174 0.380 0.386 0.121 0.253
Pan_t6a_1 30 pan2022_t6a 0.556 0.365 0.244 0.154 0.170 0.375 0.377 0.121 0.249 0.560 0.362 0.240 0.155 0.169 0.375 0.381 0.116 0.248
Kiciński_t6a_2 31 kiciński2022_t6a 0.556 0.358 0.235 0.148 0.171 0.369 0.378 0.119 0.249 0.567 0.365 0.242 0.155 0.172 0.378 0.393 0.117 0.255
Pan_t6a_2 32 pan2022_t6a 0.556 0.358 0.236 0.146 0.169 0.371 0.363 0.120 0.241 0.562 0.361 0.240 0.156 0.172 0.377 0.384 0.118 0.251
Labbe_t6a_1 33 labbe2022_t6a 0.548 0.351 0.233 0.149 0.170 0.370 0.359 0.123 0.241 0.555 0.357 0.240 0.157 0.170 0.374 0.367 0.118 0.242
Baseline 34 gontier2022_t6a 0.549 0.353 0.234 0.147 0.164 0.361 0.338 0.110 0.224 0.555 0.358 0.239 0.156 0.164 0.364 0.358 0.109 0.233
Labbe_t6a_3 35 labbe2022_t6a 0.535 0.326 0.203 0.121 0.164 0.356 0.307 0.117 0.212 0.532 0.322 0.200 0.121 0.161 0.354 0.303 0.111 0.207
Labbe_t6a_2 36 labbe2022_t6a 0.490 0.282 0.163 0.090 0.156 0.328 0.247 0.113 0.180 0.488 0.279 0.160 0.085 0.154 0.329 0.241 0.106 0.174
Labbe_t6a_4 37 labbe2022_t6a 0.460 0.251 0.139 0.072 0.142 0.311 0.203 0.099 0.151 0.457 0.245 0.135 0.067 0.140 0.310 0.198 0.090 0.144

Systems ranking, machine translation metrics

Selected
metric
rank
Submission Information Clotho testing split Clotho evaluation split
Submission code Best official
system rank
Technical
Report
BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL
Xu_t6a_4 1 xu2022_t6a 0.666 0.433 0.282 0.178 0.187 0.412 0.667 0.435 0.285 0.183 0.186 0.415
Zou_t6a_3 2 zou2022_t6a 0.670 0.437 0.289 0.183 0.185 0.415 0.646 0.430 0.289 0.186 0.186 0.409
Xu_t6a_3 3 xu2022_t6a 0.658 0.430 0.281 0.178 0.186 0.410 0.663 0.433 0.285 0.185 0.185 0.413
Xu_t6a_1 4 xu2022_t6a 0.645 0.421 0.276 0.173 0.186 0.402 0.647 0.424 0.280 0.180 0.186 0.409
Zou_t6a_4 5 zou2022_t6a 0.652 0.433 0.287 0.182 0.185 0.408 0.663 0.443 0.299 0.195 0.189 0.416
Xu_t6a_2 6 xu2022_t6a 0.650 0.425 0.278 0.176 0.187 0.407 0.654 0.431 0.286 0.187 0.188 0.413
Zou_t6a_1 7 zou2022_t6a 0.648 0.424 0.279 0.176 0.185 0.410 0.647 0.438 0.296 0.194 0.185 0.414
Zou_t6a_2 8 zou2022_t6a 0.655 0.429 0.283 0.178 0.184 0.405 0.645 0.422 0.281 0.183 0.186 0.408
Mei_t6a_3 9 mei2022_t6a 0.661 0.433 0.287 0.179 0.188 0.415 0.646 0.414 0.270 0.169 0.183 0.407
Mei_t6a_1 10 mei2022_t6a 0.647 0.423 0.278 0.174 0.186 0.407 0.646 0.414 0.270 0.169 0.183 0.407
Mei_t6a_4 11 mei2022_t6a 0.669 0.428 0.281 0.173 0.184 0.410 0.646 0.414 0.270 0.169 0.183 0.407
Mei_t6a_2 12 mei2022_t6a 0.672 0.427 0.276 0.167 0.182 0.410 0.646 0.414 0.270 0.169 0.183 0.407
Primus_t6a_4 13 primus2022_t6a 0.641 0.421 0.276 0.168 0.184 0.402 0.636 0.417 0.275 0.168 0.183 0.401
Kouzelis_t6a_4 14 kouzelis2022_t6a 0.581 0.387 0.262 0.170 0.180 0.388 0.579 0.386 0.262 0.173 0.178 0.387
Guan_t6a_4 15 guan2022_t6a 0.623 0.417 0.284 0.180 0.177 0.405 0.649 0.439 0.303 0.199 0.181 0.415
Guan_t6a_2 16 guan2022_t6a 0.575 0.388 0.268 0.178 0.178 0.387 0.595 0.402 0.277 0.189 0.179 0.395
Guan_t6a_3 17 guan2022_t6a 0.657 0.425 0.279 0.169 0.179 0.405 0.660 0.424 0.279 0.170 0.178 0.410
Kouzelis_t6a_3 18 kouzelis2022_t6a 0.567 0.378 0.257 0.169 0.176 0.385 0.575 0.384 0.262 0.174 0.178 0.386
Kouzelis_t6a_1 19 kouzelis2022_t6a 0.570 0.382 0.259 0.170 0.177 0.384 0.576 0.384 0.261 0.176 0.166 0.385
Kouzelis_t6a_2 20 kouzelis2022_t6a 0.569 0.378 0.256 0.168 0.177 0.386 0.578 0.384 0.262 0.176 0.177 0.387
Primus_t6a_3 21 primus2022_t6a 0.654 0.420 0.271 0.163 0.177 0.395 0.653 0.424 0.278 0.169 0.181 0.404
Primus_t6a_2 22 primus2022_t6a 0.562 0.364 0.243 0.153 0.181 0.374 0.573 0.370 0.244 0.158 0.181 0.376
Guan_t6a_1 23 guan2022_t6a 0.556 0.367 0.249 0.165 0.173 0.375 0.581 0.386 0.265 0.181 0.175 0.385
Kiciński_t6a_1 24 kiciński2022_t6a 0.567 0.368 0.244 0.155 0.175 0.378 0.583 0.382 0.255 0.164 0.179 0.387
Primus_t6a_1 25 primus2022_t6a 0.556 0.364 0.241 0.150 0.176 0.367 0.566 0.373 0.252 0.164 0.178 0.376
Pan_t6a_4 26 pan2022_t6a 0.555 0.361 0.240 0.155 0.173 0.374 0.568 0.367 0.245 0.161 0.175 0.383
Kiciński_t6a_3 27 kiciński2022_t6a 0.546 0.346 0.224 0.138 0.170 0.364 0.566 0.363 0.236 0.147 0.173 0.372
Kiciński_t6a_4 28 kiciński2022_t6a 0.556 0.360 0.238 0.152 0.171 0.367 0.562 0.364 0.242 0.157 0.172 0.377
Pan_t6a_3 29 pan2022_t6a 0.559 0.360 0.237 0.152 0.173 0.376 0.567 0.363 0.240 0.154 0.174 0.380
Pan_t6a_1 30 pan2022_t6a 0.556 0.365 0.244 0.154 0.170 0.375 0.560 0.362 0.240 0.155 0.169 0.375
Kiciński_t6a_2 31 kiciński2022_t6a 0.556 0.358 0.235 0.148 0.171 0.369 0.567 0.365 0.242 0.155 0.172 0.378
Pan_t6a_2 32 pan2022_t6a 0.556 0.358 0.236 0.146 0.169 0.371 0.562 0.361 0.240 0.156 0.172 0.377
Labbe_t6a_1 33 labbe2022_t6a 0.548 0.351 0.233 0.149 0.170 0.370 0.555 0.357 0.240 0.157 0.170 0.374
Baseline 34 gontier2022_t6a 0.549 0.353 0.234 0.147 0.164 0.361 0.555 0.358 0.239 0.156 0.164 0.364
Labbe_t6a_3 35 labbe2022_t6a 0.535 0.326 0.203 0.121 0.164 0.356 0.532 0.322 0.200 0.121 0.161 0.354
Labbe_t6a_2 36 labbe2022_t6a 0.490 0.282 0.163 0.090 0.156 0.328 0.488 0.279 0.160 0.085 0.154 0.329
Labbe_t6a_4 37 labbe2022_t6a 0.460 0.251 0.139 0.072 0.142 0.311 0.457 0.245 0.135 0.067 0.140 0.310

Systems ranking, captioning metrics

Selected
metric
rank
Submission Information Clotho testing split Clotho evaluation split
Submission code Best official
system rank
Technical
Report
CIDEr SPICE SPIDEr CIDEr SPICE SPIDEr
Xu_t6a_4 1 xu2022_t6a 0.508 0.130 0.319 0.513 0.126 0.320
Zou_t6a_3 2 zou2022_t6a 0.502 0.133 0.318 0.497 0.119 0.308
Xu_t6a_3 3 xu2022_t6a 0.501 0.131 0.316 0.517 0.127 0.322
Xu_t6a_1 4 xu2022_t6a 0.498 0.130 0.314 0.507 0.130 0.318
Zou_t6a_4 5 zou2022_t6a 0.497 0.130 0.314 0.520 0.126 0.323
Xu_t6a_2 6 xu2022_t6a 0.495 0.127 0.311 0.524 0.126 0.325
Zou_t6a_1 7 zou2022_t6a 0.489 0.133 0.311 0.503 0.132 0.317
Zou_t6a_2 8 zou2022_t6a 0.491 0.128 0.309 0.495 0.131 0.313
Mei_t6a_3 9 mei2022_t6a 0.482 0.135 0.309 0.453 0.128 0.291
Mei_t6a_1 10 mei2022_t6a 0.476 0.134 0.305 0.453 0.128 0.291
Mei_t6a_4 11 mei2022_t6a 0.468 0.138 0.303 0.453 0.128 0.291
Mei_t6a_2 12 mei2022_t6a 0.456 0.138 0.297 0.453 0.128 0.291
Primus_t6a_4 13 primus2022_t6a 0.458 0.134 0.296 0.461 0.130 0.295
Kouzelis_t6a_4 14 kouzelis2022_t6a 0.453 0.134 0.293 0.457 0.134 0.296
Guan_t6a_4 15 guan2022_t6a 0.451 0.130 0.291 0.471 0.133 0.302
Guan_t6a_2 16 guan2022_t6a 0.451 0.129 0.290 0.465 0.127 0.296
Guan_t6a_3 17 guan2022_t6a 0.447 0.132 0.290 0.442 0.129 0.285
Kouzelis_t6a_3 18 kouzelis2022_t6a 0.447 0.131 0.289 0.557 0.133 0.295
Kouzelis_t6a_1 19 kouzelis2022_t6a 0.439 0.132 0.286 0.453 0.130 0.292
Kouzelis_t6a_2 20 kouzelis2022_t6a 0.441 0.130 0.285 0.454 0.133 0.293
Primus_t6a_3 21 primus2022_t6a 0.434 0.127 0.280 0.455 0.125 0.290
Primus_t6a_2 22 primus2022_t6a 0.418 0.132 0.275 0.440 0.128 0.284
Guan_t6a_1 23 guan2022_t6a 0.417 0.124 0.270 0.437 0.126 0.281
Kiciński_t6a_1 24 kiciński2022_t6a 0.414 0.126 0.270 0.433 0.125 0.279
Primus_t6a_1 25 primus2022_t6a 0.400 0.127 0.264 0.408 0.120 0.264
Pan_t6a_4 26 pan2022_t6a 0.387 0.123 0.255 0.395 0.119 0.257
Kiciński_t6a_3 27 kiciński2022_t6a 0.379 0.123 0.251 0.400 0.121 0.260
Kiciński_t6a_4 28 kiciński2022_t6a 0.380 0.121 0.250 0.378 0.119 0.249
Pan_t6a_3 29 pan2022_t6a 0.376 0.123 0.250 0.386 0.121 0.253
Pan_t6a_1 30 pan2022_t6a 0.377 0.121 0.249 0.381 0.116 0.248
Kiciński_t6a_2 31 kiciński2022_t6a 0.378 0.119 0.249 0.393 0.117 0.255
Pan_t6a_2 32 pan2022_t6a 0.363 0.120 0.241 0.384 0.118 0.251
Labbe_t6a_1 33 labbe2022_t6a 0.359 0.123 0.241 0.367 0.118 0.242
Baseline 34 gontier2022_t6a 0.338 0.110 0.224 0.358 0.109 0.233
Labbe_t6a_3 35 labbe2022_t6a 0.307 0.117 0.212 0.303 0.111 0.207
Labbe_t6a_2 36 labbe2022_t6a 0.247 0.113 0.180 0.241 0.106 0.174
Labbe_t6a_4 37 labbe2022_t6a 0.203 0.099 0.151 0.198 0.090 0.144

System characteristics

In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems and the second has a detailed presentation of each system.

Overview of characteristics

Rank Submission
code
SPIDEr Technical
Report
Method scheme/architecture Amount of parameters Audio modelling Word modelling Data
augmentation
1 Xu_t6a_4 0.319 xu2022_t6a Rnn_Transformer 528099252 RNN Transformer, RNN
2 Zou_t6a_3 0.318 zou2022_t6a encoder-decoder 84541437 CNN LSTM SpecAugment, SpecAugment++
3 Xu_t6a_3 0.316 xu2022_t6a Rnn_Transformer 347873912 RNN Transformer
4 Xu_t6a_1 0.314 xu2022_t6a Rnn_Transformer 85915038 RNN Transformer
5 Zou_t6a_4 0.314 zou2022_t6a encoder-decoder 84541437 CNN LSTM SpecAugment, SpecAugment++
6 Xu_t6a_2 0.311 xu2022_t6a Rnn_Transformer 171830076 RNN Transformer
7 Zou_t6a_1 0.311 zou2022_t6a encoder-decoder 86643711 CNN LSTM SpecAugment, SpecAugment++
8 Zou_t6a_2 0.309 zou2022_t6a encoder-decoder 140000000 CNN LSTM SpecAugment, SpecAugment++
9 Mei_t6a_3 0.309 mei2022_t6a encoder-decoder 8867215 CNN Transformer SpecAugment
10 Mei_t6a_1 0.305 mei2022_t6a encoder-decoder 8867215 CNN Transformer SpecAugment
11 Mei_t6a_4 0.303 mei2022_t6a encoder-decoder 8867215 CNN Transformer SpecAugment
12 Mei_t6a_2 0.297 mei2022_t6a encoder-decoder 8867215 CNN Transformer SpecAugment
13 Primus_t6a_4 0.296 primus2022_t6a encoder-decoder 780000000 CNN Transformer SpecAugment
14 Kouzelis_t6a_4 0.293 kouzelis2022_t6a encoder-decoder 119757102 PaSST Transformer Mixup, SpecAugment, Label Smoothing
15 Guan_t6a_4 0.291 guan2022_t6a encoder-decoder 36147701 PANNs, GAT Transformer, LocalAFT Mixup, SpecAugment
16 Guan_t6a_2 0.290 guan2022_t6a encoder-decoder 28920672 PANNs, GAT Transformer, LocalAFT Mixup, SpecAugment
17 Guan_t6a_3 0.290 guan2022_t6a encoder-decoder 7227029 PANNs, GAT Transformer Mixup, SpecAugment
18 Kouzelis_t6a_3 0.289 kouzelis2022_t6a encoder-decoder 119757102 PaSST Transformer Mixup, SpecAugment, Label Smoothing
19 Kouzelis_t6a_1 0.286 kouzelis2022_t6a encoder-decoder 119757102 PaSST Transformer Mixup, SpecAugment, Label Smoothing
20 Kouzelis_t6a_2 0.285 kouzelis2022_t6a encoder-decoder 119757102 PaSST Transformer Mixup, SpecAugment, Label Smoothing
21 Primus_t6a_3 0.280 primus2022_t6a encoder-decoder 130000000 CNN Transformer SpecAugment
22 Primus_t6a_2 0.275 primus2022_t6a encoder-decoder 130000000 CNN Transformer SpecAugment
23 Guan_t6a_1 0.270 guan2022_t6a encoder-decoder 7227029 PANNs, GAT Transformer Mixup, SpecAugment
24 Kiciński_t6a_1 0.270 kiciński2022_t6a encoder-decoder 104000000 PANNs Transformer random crop, random pad, adding white noise, SpecAugment
25 Primus_t6a_1 0.264 primus2022_t6a encoder-decoder 130000000 CNN Transformer SpecAugment
26 Pan_t6a_4 0.255 pan2022_t6a encoder-decoder 9857679 CNN Transformer Mixture
27 Kiciński_t6a_3 0.251 kiciński2022_t6a encoder-decoder 104000000 PANNs Transformer, KeyBert random crop, random pad, adding white noise, SpecAugment
28 Kiciński_t6a_4 0.250 kiciński2022_t6a encoder-decoder 207000000 PANNs GPT2, KeyBert random crop, random pad, adding white noise, SpecAugment
29 Pan_t6a_3 0.250 pan2022_t6a encoder-decoder 9857679 CNN Transformer Zero_value, Mixup
30 Pan_t6a_1 0.249 pan2022_t6a encoder-decoder 9857679 CNN Transformer Zero_value, Mixup
31 Kiciński_t6a_2 0.249 kiciński2022_t6a encoder-decoder 207000000 PANNs GPT2 random crop, random pad, adding white noise, SpecAugment
32 Pan_t6a_2 0.241 pan2022_t6a encoder-decoder 9857679 CNN Transformer Zero_value, Mixup
33 Labbe_t6a_1 0.241 labbe2022_t6a encoder-decoder 16531922 CNN10 Transformer SpecAugment
34 Baseline 0.224 gontier2022_t6a encoder-decoder 140000000 Transformer Transformer
35 Labbe_t6a_3 0.212 labbe2022_t6a encoder-decoder 16531922 CNN10 Transformer SpecAugment
36 Labbe_t6a_2 0.180 labbe2022_t6a encoder-decoder 16531922 CNN10 Transformer SpecAugment
37 Labbe_t6a_4 0.151 labbe2022_t6a encoder-decoder 16531922 CNN10 Transformer SpecAugment



Detailed characteristics

Rank Submission
code
SPIDEr Technical
Report
Method scheme/architecture Amount of parameters Audio modelling Acoustic
features
Word modelling Word
embeddings
Data
augmentation
Sampling
rate
Learning set-up Ensemble method Loss function Learning set-up Learning rate Gradient clipping Gradient norm for clipping Metric monitored for training Dataset(s) used for audio modelling Dataset(s) used for word modelling Dataset(s) used for audio similarity
1 Xu_t6a_4 0.319 xu2022_t6a Rnn_Transformer 528099252 RNN Clap feature Transformer, RNN learned 44.1kHz supervised, reinforcement learning cross-entropy adam 5e-4 CIDEr Clotho, AudioCaps, MACS Clotho, AudioCaps, MACS
2 Zou_t6a_3 0.318 zou2022_t6a encoder-decoder 84541437 CNN ResNet38 LSTM Random SpecAugment, SpecAugment++ 44.1KHz supervised cross-entropy adamw 5e-5 loss Clotho, Clotho Clotho, Clotho
3 Xu_t6a_3 0.316 xu2022_t6a Rnn_Transformer 347873912 RNN Clap feature Transformer learned 44.1kHz supervised, reinforcement learning cross-entropy adam 5e-4 CIDEr Clotho, AudioCaps, MACS Clotho, AudioCaps, MACS
4 Xu_t6a_1 0.314 xu2022_t6a Rnn_Transformer 85915038 RNN Clap feature Transformer learned 44.1kHz supervised, reinforcement learning cross-entropy adam 5e-4 CIDEr Clotho, AudioCaps, MACS Clotho, AudioCaps, MACS
5 Zou_t6a_4 0.314 zou2022_t6a encoder-decoder 84541437 CNN ResNet38 LSTM Random SpecAugment, SpecAugment++ 44.1KHz supervised cross-entropy adamw 5e-5 loss Clotho, Clotho Clotho, Clotho
6 Xu_t6a_2 0.311 xu2022_t6a Rnn_Transformer 171830076 RNN Clap feature Transformer learned 44.1kHz supervised, reinforcement learning cross-entropy adam 5e-4 CIDEr Clotho, AudioCaps, MACS Clotho, AudioCaps, MACS
7 Zou_t6a_1 0.311 zou2022_t6a encoder-decoder 86643711 CNN ResNet38 LSTM Random SpecAugment, SpecAugment++ 44.1KHz supervised cross-entropy adamw 5e-5 loss Clotho, Clotho Clotho, Clotho
8 Zou_t6a_2 0.309 zou2022_t6a encoder-decoder 140000000 CNN ResNet38 LSTM Random SpecAugment, SpecAugment++ 44.1KHz supervised cross-entropy adamw 5e-5 loss Clotho, Clotho Clotho, Clotho
9 Mei_t6a_3 0.309 mei2022_t6a encoder-decoder 8867215 CNN PANNs Transformer Word2Vec SpecAugment 44.1kHz supervised cross-entropy adam 1e-3 SPIDEr Clotho Clotho
10 Mei_t6a_1 0.305 mei2022_t6a encoder-decoder 8867215 CNN PANNs Transformer Word2Vec SpecAugment 44.1kHz supervised cross-entropy adam 1e-3 SPIDEr Clotho Clotho
11 Mei_t6a_4 0.303 mei2022_t6a encoder-decoder 8867215 CNN PANNs Transformer Word2Vec SpecAugment 44.1kHz supervised cross-entropy adam 1e-3 SPIDEr Clotho Clotho
12 Mei_t6a_2 0.297 mei2022_t6a encoder-decoder 8867215 CNN PANNs Transformer Word2Vec SpecAugment 44.1kHz supervised cross-entropy adam 1e-3 SPIDEr Clotho Clotho
13 Primus_t6a_4 0.296 primus2022_t6a encoder-decoder 780000000 CNN CNN10 Transformer BART SpecAugment 44.1kHz supervised score function estimator adamw 1e-5 SPIDEr Clotho, AudioCaps, AudioSet Clotho, AudioCaps
14 Kouzelis_t6a_4 0.293 kouzelis2022_t6a encoder-decoder 119757102 PaSST mel energies Transformer Word2Vec Mixup, SpecAugment, Label Smoothing 32kHz supervised cross-entropy adam 1e-5 SPIDEr Clotho, AudioCaps, MACS Clotho, AudioCaps, MACS
15 Guan_t6a_4 0.291 guan2022_t6a encoder-decoder 36147701 PANNs, GAT log-mel energies Transformer, LocalAFT Word2Vec Mixup, SpecAugment 44.1kHz supervised, reinforcement learning cross-entropy adam 1e-4 loss, SPIDEr Clotho, AudioCaps Clotho, AudioCaps
16 Guan_t6a_2 0.290 guan2022_t6a encoder-decoder 28920672 PANNs, GAT log-mel energies Transformer, LocalAFT Word2Vec Mixup, SpecAugment 44.1kHz supervised cross-entropy adam 1e-4 loss, SPIDEr Clotho, AudioCaps Clotho, AudioCaps
17 Guan_t6a_3 0.290 guan2022_t6a encoder-decoder 7227029 PANNs, GAT log-mel energies Transformer Word2Vec Mixup, SpecAugment 44.1kHz supervised, reinforcement learning cross-entropy adam 1e-4 loss, SPIDEr Clotho, AudioCaps Clotho, AudioCaps
18 Kouzelis_t6a_3 0.289 kouzelis2022_t6a encoder-decoder 119757102 PaSST mel energies Transformer Word2Vec Mixup, SpecAugment, Label Smoothing 32kHz supervised cross-entropy adam 1e-5 SPIDEr Clotho, AudioCaps, MACS Clotho, AudioCaps, MACS
19 Kouzelis_t6a_1 0.286 kouzelis2022_t6a encoder-decoder 119757102 PaSST mel energies Transformer Word2Vec Mixup, SpecAugment, Label Smoothing 32kHz supervised cross-entropy adam linear warmup 1e-5 SPIDEr Clotho, AudioCaps, MACS Clotho, AudioCaps, MACS
20 Kouzelis_t6a_2 0.285 kouzelis2022_t6a encoder-decoder 119757102 PaSST mel energies Transformer Word2Vec Mixup, SpecAugment, Label Smoothing 32kHz supervised cross-entropy adam 1e-5 SPIDEr Clotho, AudioCaps, MACS Clotho, AudioCaps, MACS
21 Primus_t6a_3 0.280 primus2022_t6a encoder-decoder 130000000 CNN CNN10 Transformer BART SpecAugment 44.1kHz supervised score function estimator adamw 1e-5 SPIDEr Clotho, AudioCaps, AudioSet Clotho, AudioCaps
22 Primus_t6a_2 0.275 primus2022_t6a encoder-decoder 130000000 CNN CNN10 Transformer BART SpecAugment 44.1kHz supervised cross-entropy adamw 1e-5 SPIDEr Clotho, AudioCaps, AudioSet Clotho, AudioCaps
23 Guan_t6a_1 0.270 guan2022_t6a encoder-decoder 7227029 PANNs, GAT log-mel energies Transformer Word2Vec Mixup, SpecAugment 44.1kHz supervised cross-entropy adam 1e-4 loss, SPIDEr Clotho, AudioCaps Clotho, AudioCaps
24 Kiciński_t6a_1 0.270 kiciński2022_t6a encoder-decoder 104000000 PANNs mel energies Transformer learned random crop, random pad, adding white noise, SpecAugment 16kHz supervised cross-entropy adamw 1e-4 loss Clotho, AudioCaps, Freesound Clotho, AudioCaps, Freesound
25 Primus_t6a_1 0.264 primus2022_t6a encoder-decoder 130000000 CNN CNN10 Transformer BART SpecAugment 44.1kHz supervised cross-entropy adamw 1e-5 SPIDEr Clotho, AudioSet Clotho
26 Pan_t6a_4 0.255 pan2022_t6a encoder-decoder 9857679 CNN mel energies Transformer Word2Vec Mixture 44.1kHz supervised cross-entropy adam 1e-3 loss Clotho Clotho
27 Kiciński_t6a_3 0.251 kiciński2022_t6a encoder-decoder 104000000 PANNs mel energies Transformer, KeyBert learned random crop, random pad, adding white noise, SpecAugment 16kHz supervised cross-entropy adamw 1e-4 loss Clotho, AudioCaps, Freesound Clotho, AudioCaps, Freesound
28 Kiciński_t6a_4 0.250 kiciński2022_t6a encoder-decoder 207000000 PANNs mel energies GPT2, KeyBert GPT2 random crop, random pad, adding white noise, SpecAugment 16kHz supervised cross-entropy adamw 5e-5 loss Clotho, AudioCaps, Freesound Clotho, AudioCaps, Freesound
29 Pan_t6a_3 0.250 pan2022_t6a encoder-decoder 9857679 CNN mel energies Transformer Word2Vec Zero_value, Mixup 44.1kHz supervised cross-entropy adamw 1e-3 loss Clotho Clotho
30 Pan_t6a_1 0.249 pan2022_t6a encoder-decoder 9857679 CNN mel energies Transformer learned Zero_value, Mixup 44.1kHz supervised cross-entropy adam 1e-3 loss Clotho Clotho
31 Kiciński_t6a_2 0.249 kiciński2022_t6a encoder-decoder 207000000 PANNs mel energies GPT2 GPT2 random crop, random pad, adding white noise, SpecAugment 16kHz supervised cross-entropy adamw 5e-5 loss Clotho, AudioCaps, Freesound Clotho, AudioCaps, Freesound
32 Pan_t6a_2 0.241 pan2022_t6a encoder-decoder 9857679 CNN mel energies Transformer Word2Vec Zero_value, Mixup 44.1kHz supervised cross-entropy adam 1e-3 loss Clotho Clotho
33 Labbe_t6a_1 0.241 labbe2022_t6a encoder-decoder 16531922 CNN10 log-mel energies Transformer learned SpecAugment 32kHz supervised cross-entropy adam 5e-4 loss Clotho Clotho
34 Baseline 0.224 gontier2022_t6a encoder-decoder 140000000 Transformer VGGish Transformer BART 16kHz supervised cross-entropy adamw 1e-5 loss Clotho Clotho
35 Labbe_t6a_3 0.212 labbe2022_t6a encoder-decoder 16531922 CNN10 log-mel energies Transformer learned SpecAugment 32kHz supervised cross-entropy adam 5e-4 loss Clotho Clotho
36 Labbe_t6a_2 0.180 labbe2022_t6a encoder-decoder 16531922 CNN10 log-mel energies Transformer learned SpecAugment 32kHz supervised cross-entropy adam 5e-4 loss Clotho Clotho
37 Labbe_t6a_4 0.151 labbe2022_t6a encoder-decoder 16531922 CNN10 log-mel energies Transformer learned SpecAugment 32kHz supervised cross-entropy adam 5e-4 loss Clotho Clotho



Technical reports

Ensemble learning for audio captioning with graph audio feature representation

Feiyang Xiao1, Jian Guan1, Haiyan Lan1, Qiaoxi Zhu2, Wenwu Wang3
1Group of Intelligent Signal Processing, College of Computer Science and Technology, Harbin Engineering University, Harbin, China, 2Centre for Audio, Acoustic and Vibration, University of Technology Sydney, Ultimo, Australia, 3Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK

Abstract

This technical report describes our submission for Task 6A of the DCASE2022 Challenge (automated audio captioning). Our system is built based on the ensemble learning strategy. It integrates the advantages of different audio captioning methods, including the graph attention-based audio feature representation method. Experiments show that our ensemble system can achieve the SPIDEr score (used for ranking) of 30.2 on the evaluation split of the Clotho v2 dataset.

System characteristics
Data augmentation Mixup, SpecAugment
PDF

Exploring audio captioning with keyword-guided text generation

Dawid Kicinski, Teodor Lamort de Gail, Pawel Bujnowski
Samsung R&D Institute Poland, Artificial Intelligence, Warsaw, Poland

Abstract

This technical report describes our submission to the DCASE 2022 challenge, Task 6 A: automated audio captioning. In our system, we explore the use of pre-trained language models for the audio captioning task. The proposed system is an encoder-decoder architecture consisting of a pre-trained PANN encoder and a GPT2 decoder. Audio embeddings are encoded to language model prompts using a simple mapping network. We further develop our system by employing strategies of guiding the decoder with textual information. We prompt the decoder with keywords extracted from semantically similar audios and use them to choose the best matching caption by their occurrence.

System characteristics
Data augmentation random crop, random pad, adding white noise, SpecAugment
PDF

Efficient audio captioning transformer with patchout and text guidance

Thodoris Kouzelis1,2, Grigoris Bastas1,2, Athanasios Katsamanis2, Alexandros Potamianos1
1Institute for Language and Speech Processing, Athena Research Center, Athens, Greece, 2School of ECE, National Technical University of Athens, Athens, Greece

Abstract

This technical report describes an Automated Audio Captioning model for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 Challenge, Task 6. We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting. The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model which is fine-tuned to maximize the semantic similarity between AudioSet labels and ground truth captions. To mitigate the data scarcity problem of Automated Audio Captioning (AAC) we pre-train our model on an enlarged dataset. Moreover, we propose a method to apply Mixup augmentation for AAC. Our best model achieves the SPIDEr score of 0.296.

Awards: Judges’ award

System characteristics
Data augmentation Mixup, SpecAugment, Label Smoothing
PDF

IRIT-UPS DCASE 2022 task6a system: stochastic decoding methods for audio captioning

Etienne Labbé, Thomas Pellegrini, Julien Pinquier
IRIT (UMR 5505), Université Paul Sabatier, CNRS, Toulouse, France

Abstract

This document presents a summary of our models used in the Automated Audio Captioning task (6a) for the DCASE2022 challenge. Four submissions were made using different decoding methods : beam search, top k sampling, nucleus sampling and typical decoding.

System characteristics
Data augmentation SpecAugment
PDF

Automated audio captioning with keywords guidance

Xinhao Mei, Xubo Liu, Haohe Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang
Centre for Vision, Speech, and Signal Processing (CVSSP), University of Surrey, UK

Abstract

This technical report describes an automated audio captioning system we submitted to Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2022 Task 6a. The proposed system is built on an encoder-decoder architecture we submitted to DCASE 2021 Challenge Task 6 last year, where the encoder is a pre-trained 10-layer convolutional neural network and the decoder is a Transformer network. In this new submission, we investigate the use of keywords estimated from input audio clips to guide the caption generation process. The results show that keywords guidance can improve the system performance especially when the pre-trained encoder is frozen, and can also reduce the variance of the results when the model is trained with different seeds. The overall system consists of a pre-trained keywords estimation model and a CNN-Transformer audio captioning model. The captioning model is first trained via the cross-entropy loss and then fine-tuned with reinforcement learning to optimize the evaluation metric CIDEr. The proposed system significantly improves the scores of all the evaluation metrics as compared to the baseline system.

System characteristics
Data augmentation SpecAugment
PDF

Audio captioning using pre-trained model and data augmentation

Tianyang Huang1, Chaofan Pan1, Wenyao Chen1, Chenyang Zhu3, Shengchen Li2, Xi Shao1
1College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China, 2School of Advanced Technology, Xi’an Jiaotong-liverpool University, Suzhou, China, 3School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China

Abstract

This technical report describes an automatic audio captioning system for task 6, Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 Challenge. Based on an encoder-decoder architecture, the system is composed of convolutional neural network (CNN) encoder and transformer decoder. Instead of using pre-trained models only in audio modal, we try to introduce pretrained models in text modal as well. In addition, we consider using a data argumentation method with and without noise to improve the data quality and thus improve the generalization and robustness of the model. The experimental results show that our system can achieve the SPIDEr of 0.257 (official baseline: 0.233) on the Clotho evaluation set.

System characteristics
Data augmentation Mixture
PDF

CP-JKU's submission to task 6a of the DCASE2022 challenge: a BART encoder-decoder for automatic audio captioning trained via the reinforce algorithm and transfer learning

Paul Primus1, Gerhard Widmer1,2
1Institute of Computational Perception (CP-JKU), 2LIT Artificial Intelligence Lab, Johannes Kepler University, Austria

Abstract

This technical report details the CP-JKU submission to the automatic audio captioning task of the 2022’s DCASE challenge (task 6a). The objective of the task was to train a sequence-to-sequence model that automatically generates textual descriptions for given audio recordings. The approach described in this report enhances the BART-based encoder-decoder model used as the challenge’s baseline system in three directions: firstly, the VGGish embedding model was replaced with a custom CNN10-like model that we pretrained on AudioSet. Secondly, the BART encoder-decoder model was pre-trained on AudioCaps, which led to faster convergence. And finally, the best model was further fine-tuned by optimizing the non-differentiable CIDEr metric using the REINFORCE algorithm. Our best model achieves a SPIDEr score of .29 (single-model performance), which is an improvement of 6.6 pp. over the challenge’s baseline score.

System characteristics
Data augmentation SpecAugment
PDF

The SJTU System for DCASE2022 Challenge Task 6: Audio Captioning with Audio-Text Retrieval Pre-training

Xuenan Xu, Zeyu Xie, Mengyue Wu, Kai Yu
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Department of Computer Science and Engineering AI Institute, Shanghai Jiao Tong University, Shanghai, China

Abstract

This technical report describes the system submitted to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 challenge Task 6. There are two involving subtasks: text-to-audio retrieval and automated audio captioning. The text-to-audio retrieval system adopts a bi-encoder architecture using pre-trained audio and text encoders. The system is first pre-trained on AudioCaps and then fine-tuned on the challenge dataset Clotho. For the audio captioning system, we first train a retrieval model on all public captioning data and then take the audio encoder as the feature extractor. Then a standard sequence-to-sequence model is trained on Clotho based on the pre-trained feature extractor. The captioning model is first trained by word-level cross entropy loss and then finetuned using self-critical sequence training. Our system achieves a SPIDEr of 32.5 on captioning and an mAP of 29.9 on text-to-audio retrieval.

System characteristics
Data augmentation None
PDF

Automated audio captioning with multi-task learning

Zhongjie Ye1, Yuexian Zou1, Fan Cui2, Yujun Wang2
1ADSPLAB, School of ECE, Peking University, Shenzhen, China, 2Xiaomi Corporation, Beijing, China

Abstract

This technical report describes an automated audio captioning (AAC) model for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 Task 6A Challenge. Our model consists of a convolution neural network (CNN) encoder and a single layer (LSTM) decoder with a temporal attention module. In order to enhance the representation in the domain dataset, we use the ResNet38 pretrained on the AudioSet dataset as our audio encoder and finetune it with keywords of nouns and verbs as labels which are extracted from the captions. For training the whole caption model, we first train the model with the standard cross entropy loss, and fine-tune it with reinforcement learning to directly optimize the CIDEr score. Experimental results show that our single model can achieve a SPIDEr score of 31.7 on the evaluation spilt.

System characteristics
Data augmentation SpecAugment, SpecAugment++
PDF