Task description
Automated audio captioning is the task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. Given the novelty of the task of audio captioning, current focus is on exploring and developing different methods that can provide some kind of captions for a general audio recording. To this aim, the Clotho dataset is used, which provides good quality captions, without speech transcription, named entities, and hapax legomena (i.e. words that appear once in a split).
Participants used the freely available splits of Clotho development and evaluation, as well as any external data they deemed fit. The developed systems are evaluated on their generated captions, using the testing split of Clotho, which does not provide the corresponding captions for the audio. More information about Task 6a: Automated Audio Captioning can be found at the task description page.
The ranking of the submitted systems is based on the achieved SPIDEr metric. Though, in this page is provided a more thorough presentation, grouping the metrics into those that are originated from machine translation and to those that originated from captioning.
This year, we also introduce contrastive metrics as well as an analysis subset of Clotho. The corresponding results will be published in the coming days.
Teams ranking
Here are listed the best systems all from all teams. The ranking is based on the SPIDEr metric. For more elaborated exploration of the performance of the different systems, at the same table are listed the values achieved for all the metrics employed in the task. The values for the metrics are for the Clotho testing split and the Clotho evaluation split. The values for the Clotho evaluation split are provided in order to allow further comparison with systems and methods developed outside of this task, since captions for the Clotho evaluation split are freely available.
Selected metric rank |
Submission Information | Clotho testing split | Clotho evaluation split | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Corresponding author |
Technical Report |
BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | CIDEr | SPICE | SPIDEr | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | CIDEr | SPICE | SPIDEr | |
Xu_t6a_4 | 1 | Xuenan Xu | xu2022_t6a | 0.666 | 0.433 | 0.282 | 0.178 | 0.187 | 0.412 | 0.508 | 0.130 | 0.319 | 0.667 | 0.435 | 0.285 | 0.183 | 0.186 | 0.415 | 0.513 | 0.126 | 0.320 | |
Zou_t6a_3 | 2 | Yuexian Zou | zou2022_t6a | 0.670 | 0.437 | 0.289 | 0.183 | 0.185 | 0.415 | 0.502 | 0.133 | 0.318 | 0.646 | 0.430 | 0.289 | 0.186 | 0.186 | 0.409 | 0.497 | 0.119 | 0.308 | |
Mei_t6a_3 | 3 | Xinhao Mei | mei2022_t6a | 0.661 | 0.433 | 0.287 | 0.179 | 0.188 | 0.415 | 0.482 | 0.135 | 0.309 | 0.646 | 0.414 | 0.270 | 0.169 | 0.183 | 0.407 | 0.453 | 0.128 | 0.291 | |
Primus_t6a_4 | 4 | Paul Primus | primus2022_t6a | 0.641 | 0.421 | 0.276 | 0.168 | 0.184 | 0.402 | 0.458 | 0.134 | 0.296 | 0.636 | 0.417 | 0.275 | 0.168 | 0.183 | 0.401 | 0.461 | 0.130 | 0.295 | |
Kouzelis_t6a_4 | 5 | Thodoris Kouzelis | kouzelis2022_t6a | 0.581 | 0.387 | 0.262 | 0.170 | 0.180 | 0.388 | 0.453 | 0.134 | 0.293 | 0.579 | 0.386 | 0.262 | 0.173 | 0.178 | 0.387 | 0.457 | 0.134 | 0.296 | |
Guan_t6a_4 | 6 | Jian Guan | guan2022_t6a | 0.623 | 0.417 | 0.284 | 0.180 | 0.177 | 0.405 | 0.451 | 0.130 | 0.291 | 0.649 | 0.439 | 0.303 | 0.199 | 0.181 | 0.415 | 0.471 | 0.133 | 0.302 | |
Kiciński_t6a_1 | 7 | Dawid Kiciński | kiciński2022_t6a | 0.567 | 0.368 | 0.244 | 0.155 | 0.175 | 0.378 | 0.414 | 0.126 | 0.270 | 0.583 | 0.382 | 0.255 | 0.164 | 0.179 | 0.387 | 0.433 | 0.125 | 0.279 | |
Pan_t6a_4 | 8 | Chaofan Pan | pan2022_t6a | 0.555 | 0.361 | 0.240 | 0.155 | 0.173 | 0.374 | 0.387 | 0.123 | 0.255 | 0.568 | 0.367 | 0.245 | 0.161 | 0.175 | 0.383 | 0.395 | 0.119 | 0.257 | |
Labbe_t6a_1 | 9 | Etienne Labbe | labbe2022_t6a | 0.548 | 0.351 | 0.233 | 0.149 | 0.170 | 0.370 | 0.359 | 0.123 | 0.241 | 0.555 | 0.357 | 0.240 | 0.157 | 0.170 | 0.374 | 0.367 | 0.118 | 0.242 | |
Baseline | 10 | Felix Gontier | gontier2022_t6a | 0.549 | 0.353 | 0.234 | 0.147 | 0.164 | 0.361 | 0.338 | 0.110 | 0.224 | 0.555 | 0.358 | 0.239 | 0.156 | 0.164 | 0.364 | 0.358 | 0.109 | 0.233 |
Systems ranking
Here are listed all submitted systems and their ranking according to the different metrics and grouping of metrics. The first table shows all metrics and all systems, the second table shows all systems but with only machine translation metrics, and the third table shows all systems but with only captioning metrics.
Detailed information for each system is provided in the next section.
Systems ranking, all metrics
Selected metric rank |
Submission Information | Clotho testing split | Clotho evaluation split | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Technical Report |
BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | CIDEr | SPICE | SPIDEr | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | CIDEr | SPICE | SPIDEr | |
Xu_t6a_4 | 1 | xu2022_t6a | 0.666 | 0.433 | 0.282 | 0.178 | 0.187 | 0.412 | 0.508 | 0.130 | 0.319 | 0.667 | 0.435 | 0.285 | 0.183 | 0.186 | 0.415 | 0.513 | 0.126 | 0.320 | |
Zou_t6a_3 | 2 | zou2022_t6a | 0.670 | 0.437 | 0.289 | 0.183 | 0.185 | 0.415 | 0.502 | 0.133 | 0.318 | 0.646 | 0.430 | 0.289 | 0.186 | 0.186 | 0.409 | 0.497 | 0.119 | 0.308 | |
Xu_t6a_3 | 3 | xu2022_t6a | 0.658 | 0.430 | 0.281 | 0.178 | 0.186 | 0.410 | 0.501 | 0.131 | 0.316 | 0.663 | 0.433 | 0.285 | 0.185 | 0.185 | 0.413 | 0.517 | 0.127 | 0.322 | |
Xu_t6a_1 | 4 | xu2022_t6a | 0.645 | 0.421 | 0.276 | 0.173 | 0.186 | 0.402 | 0.498 | 0.130 | 0.314 | 0.647 | 0.424 | 0.280 | 0.180 | 0.186 | 0.409 | 0.507 | 0.130 | 0.318 | |
Zou_t6a_4 | 5 | zou2022_t6a | 0.652 | 0.433 | 0.287 | 0.182 | 0.185 | 0.408 | 0.497 | 0.130 | 0.314 | 0.663 | 0.443 | 0.299 | 0.195 | 0.189 | 0.416 | 0.520 | 0.126 | 0.323 | |
Xu_t6a_2 | 6 | xu2022_t6a | 0.650 | 0.425 | 0.278 | 0.176 | 0.187 | 0.407 | 0.495 | 0.127 | 0.311 | 0.654 | 0.431 | 0.286 | 0.187 | 0.188 | 0.413 | 0.524 | 0.126 | 0.325 | |
Zou_t6a_1 | 7 | zou2022_t6a | 0.648 | 0.424 | 0.279 | 0.176 | 0.185 | 0.410 | 0.489 | 0.133 | 0.311 | 0.647 | 0.438 | 0.296 | 0.194 | 0.185 | 0.414 | 0.503 | 0.132 | 0.317 | |
Zou_t6a_2 | 8 | zou2022_t6a | 0.655 | 0.429 | 0.283 | 0.178 | 0.184 | 0.405 | 0.491 | 0.128 | 0.309 | 0.645 | 0.422 | 0.281 | 0.183 | 0.186 | 0.408 | 0.495 | 0.131 | 0.313 | |
Mei_t6a_3 | 9 | mei2022_t6a | 0.661 | 0.433 | 0.287 | 0.179 | 0.188 | 0.415 | 0.482 | 0.135 | 0.309 | 0.646 | 0.414 | 0.270 | 0.169 | 0.183 | 0.407 | 0.453 | 0.128 | 0.291 | |
Mei_t6a_1 | 10 | mei2022_t6a | 0.647 | 0.423 | 0.278 | 0.174 | 0.186 | 0.407 | 0.476 | 0.134 | 0.305 | 0.646 | 0.414 | 0.270 | 0.169 | 0.183 | 0.407 | 0.453 | 0.128 | 0.291 | |
Mei_t6a_4 | 11 | mei2022_t6a | 0.669 | 0.428 | 0.281 | 0.173 | 0.184 | 0.410 | 0.468 | 0.138 | 0.303 | 0.646 | 0.414 | 0.270 | 0.169 | 0.183 | 0.407 | 0.453 | 0.128 | 0.291 | |
Mei_t6a_2 | 12 | mei2022_t6a | 0.672 | 0.427 | 0.276 | 0.167 | 0.182 | 0.410 | 0.456 | 0.138 | 0.297 | 0.646 | 0.414 | 0.270 | 0.169 | 0.183 | 0.407 | 0.453 | 0.128 | 0.291 | |
Primus_t6a_4 | 13 | primus2022_t6a | 0.641 | 0.421 | 0.276 | 0.168 | 0.184 | 0.402 | 0.458 | 0.134 | 0.296 | 0.636 | 0.417 | 0.275 | 0.168 | 0.183 | 0.401 | 0.461 | 0.130 | 0.295 | |
Kouzelis_t6a_4 | 14 | kouzelis2022_t6a | 0.581 | 0.387 | 0.262 | 0.170 | 0.180 | 0.388 | 0.453 | 0.134 | 0.293 | 0.579 | 0.386 | 0.262 | 0.173 | 0.178 | 0.387 | 0.457 | 0.134 | 0.296 | |
Guan_t6a_4 | 15 | guan2022_t6a | 0.623 | 0.417 | 0.284 | 0.180 | 0.177 | 0.405 | 0.451 | 0.130 | 0.291 | 0.649 | 0.439 | 0.303 | 0.199 | 0.181 | 0.415 | 0.471 | 0.133 | 0.302 | |
Guan_t6a_2 | 16 | guan2022_t6a | 0.575 | 0.388 | 0.268 | 0.178 | 0.178 | 0.387 | 0.451 | 0.129 | 0.290 | 0.595 | 0.402 | 0.277 | 0.189 | 0.179 | 0.395 | 0.465 | 0.127 | 0.296 | |
Guan_t6a_3 | 17 | guan2022_t6a | 0.657 | 0.425 | 0.279 | 0.169 | 0.179 | 0.405 | 0.447 | 0.132 | 0.290 | 0.660 | 0.424 | 0.279 | 0.170 | 0.178 | 0.410 | 0.442 | 0.129 | 0.285 | |
Kouzelis_t6a_3 | 18 | kouzelis2022_t6a | 0.567 | 0.378 | 0.257 | 0.169 | 0.176 | 0.385 | 0.447 | 0.131 | 0.289 | 0.575 | 0.384 | 0.262 | 0.174 | 0.178 | 0.386 | 0.557 | 0.133 | 0.295 | |
Kouzelis_t6a_1 | 19 | kouzelis2022_t6a | 0.570 | 0.382 | 0.259 | 0.170 | 0.177 | 0.384 | 0.439 | 0.132 | 0.286 | 0.576 | 0.384 | 0.261 | 0.176 | 0.166 | 0.385 | 0.453 | 0.130 | 0.292 | |
Kouzelis_t6a_2 | 20 | kouzelis2022_t6a | 0.569 | 0.378 | 0.256 | 0.168 | 0.177 | 0.386 | 0.441 | 0.130 | 0.285 | 0.578 | 0.384 | 0.262 | 0.176 | 0.177 | 0.387 | 0.454 | 0.133 | 0.293 | |
Primus_t6a_3 | 21 | primus2022_t6a | 0.654 | 0.420 | 0.271 | 0.163 | 0.177 | 0.395 | 0.434 | 0.127 | 0.280 | 0.653 | 0.424 | 0.278 | 0.169 | 0.181 | 0.404 | 0.455 | 0.125 | 0.290 | |
Primus_t6a_2 | 22 | primus2022_t6a | 0.562 | 0.364 | 0.243 | 0.153 | 0.181 | 0.374 | 0.418 | 0.132 | 0.275 | 0.573 | 0.370 | 0.244 | 0.158 | 0.181 | 0.376 | 0.440 | 0.128 | 0.284 | |
Guan_t6a_1 | 23 | guan2022_t6a | 0.556 | 0.367 | 0.249 | 0.165 | 0.173 | 0.375 | 0.417 | 0.124 | 0.270 | 0.581 | 0.386 | 0.265 | 0.181 | 0.175 | 0.385 | 0.437 | 0.126 | 0.281 | |
Kiciński_t6a_1 | 24 | kiciński2022_t6a | 0.567 | 0.368 | 0.244 | 0.155 | 0.175 | 0.378 | 0.414 | 0.126 | 0.270 | 0.583 | 0.382 | 0.255 | 0.164 | 0.179 | 0.387 | 0.433 | 0.125 | 0.279 | |
Primus_t6a_1 | 25 | primus2022_t6a | 0.556 | 0.364 | 0.241 | 0.150 | 0.176 | 0.367 | 0.400 | 0.127 | 0.264 | 0.566 | 0.373 | 0.252 | 0.164 | 0.178 | 0.376 | 0.408 | 0.120 | 0.264 | |
Pan_t6a_4 | 26 | pan2022_t6a | 0.555 | 0.361 | 0.240 | 0.155 | 0.173 | 0.374 | 0.387 | 0.123 | 0.255 | 0.568 | 0.367 | 0.245 | 0.161 | 0.175 | 0.383 | 0.395 | 0.119 | 0.257 | |
Kiciński_t6a_3 | 27 | kiciński2022_t6a | 0.546 | 0.346 | 0.224 | 0.138 | 0.170 | 0.364 | 0.379 | 0.123 | 0.251 | 0.566 | 0.363 | 0.236 | 0.147 | 0.173 | 0.372 | 0.400 | 0.121 | 0.260 | |
Kiciński_t6a_4 | 28 | kiciński2022_t6a | 0.556 | 0.360 | 0.238 | 0.152 | 0.171 | 0.367 | 0.380 | 0.121 | 0.250 | 0.562 | 0.364 | 0.242 | 0.157 | 0.172 | 0.377 | 0.378 | 0.119 | 0.249 | |
Pan_t6a_3 | 29 | pan2022_t6a | 0.559 | 0.360 | 0.237 | 0.152 | 0.173 | 0.376 | 0.376 | 0.123 | 0.250 | 0.567 | 0.363 | 0.240 | 0.154 | 0.174 | 0.380 | 0.386 | 0.121 | 0.253 | |
Pan_t6a_1 | 30 | pan2022_t6a | 0.556 | 0.365 | 0.244 | 0.154 | 0.170 | 0.375 | 0.377 | 0.121 | 0.249 | 0.560 | 0.362 | 0.240 | 0.155 | 0.169 | 0.375 | 0.381 | 0.116 | 0.248 | |
Kiciński_t6a_2 | 31 | kiciński2022_t6a | 0.556 | 0.358 | 0.235 | 0.148 | 0.171 | 0.369 | 0.378 | 0.119 | 0.249 | 0.567 | 0.365 | 0.242 | 0.155 | 0.172 | 0.378 | 0.393 | 0.117 | 0.255 | |
Pan_t6a_2 | 32 | pan2022_t6a | 0.556 | 0.358 | 0.236 | 0.146 | 0.169 | 0.371 | 0.363 | 0.120 | 0.241 | 0.562 | 0.361 | 0.240 | 0.156 | 0.172 | 0.377 | 0.384 | 0.118 | 0.251 | |
Labbe_t6a_1 | 33 | labbe2022_t6a | 0.548 | 0.351 | 0.233 | 0.149 | 0.170 | 0.370 | 0.359 | 0.123 | 0.241 | 0.555 | 0.357 | 0.240 | 0.157 | 0.170 | 0.374 | 0.367 | 0.118 | 0.242 | |
Baseline | 34 | gontier2022_t6a | 0.549 | 0.353 | 0.234 | 0.147 | 0.164 | 0.361 | 0.338 | 0.110 | 0.224 | 0.555 | 0.358 | 0.239 | 0.156 | 0.164 | 0.364 | 0.358 | 0.109 | 0.233 | |
Labbe_t6a_3 | 35 | labbe2022_t6a | 0.535 | 0.326 | 0.203 | 0.121 | 0.164 | 0.356 | 0.307 | 0.117 | 0.212 | 0.532 | 0.322 | 0.200 | 0.121 | 0.161 | 0.354 | 0.303 | 0.111 | 0.207 | |
Labbe_t6a_2 | 36 | labbe2022_t6a | 0.490 | 0.282 | 0.163 | 0.090 | 0.156 | 0.328 | 0.247 | 0.113 | 0.180 | 0.488 | 0.279 | 0.160 | 0.085 | 0.154 | 0.329 | 0.241 | 0.106 | 0.174 | |
Labbe_t6a_4 | 37 | labbe2022_t6a | 0.460 | 0.251 | 0.139 | 0.072 | 0.142 | 0.311 | 0.203 | 0.099 | 0.151 | 0.457 | 0.245 | 0.135 | 0.067 | 0.140 | 0.310 | 0.198 | 0.090 | 0.144 |
Systems ranking, machine translation metrics
Selected metric rank |
Submission Information | Clotho testing split | Clotho evaluation split | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Technical Report |
BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | |
Xu_t6a_4 | 1 | xu2022_t6a | 0.666 | 0.433 | 0.282 | 0.178 | 0.187 | 0.412 | 0.667 | 0.435 | 0.285 | 0.183 | 0.186 | 0.415 | |
Zou_t6a_3 | 2 | zou2022_t6a | 0.670 | 0.437 | 0.289 | 0.183 | 0.185 | 0.415 | 0.646 | 0.430 | 0.289 | 0.186 | 0.186 | 0.409 | |
Xu_t6a_3 | 3 | xu2022_t6a | 0.658 | 0.430 | 0.281 | 0.178 | 0.186 | 0.410 | 0.663 | 0.433 | 0.285 | 0.185 | 0.185 | 0.413 | |
Xu_t6a_1 | 4 | xu2022_t6a | 0.645 | 0.421 | 0.276 | 0.173 | 0.186 | 0.402 | 0.647 | 0.424 | 0.280 | 0.180 | 0.186 | 0.409 | |
Zou_t6a_4 | 5 | zou2022_t6a | 0.652 | 0.433 | 0.287 | 0.182 | 0.185 | 0.408 | 0.663 | 0.443 | 0.299 | 0.195 | 0.189 | 0.416 | |
Xu_t6a_2 | 6 | xu2022_t6a | 0.650 | 0.425 | 0.278 | 0.176 | 0.187 | 0.407 | 0.654 | 0.431 | 0.286 | 0.187 | 0.188 | 0.413 | |
Zou_t6a_1 | 7 | zou2022_t6a | 0.648 | 0.424 | 0.279 | 0.176 | 0.185 | 0.410 | 0.647 | 0.438 | 0.296 | 0.194 | 0.185 | 0.414 | |
Zou_t6a_2 | 8 | zou2022_t6a | 0.655 | 0.429 | 0.283 | 0.178 | 0.184 | 0.405 | 0.645 | 0.422 | 0.281 | 0.183 | 0.186 | 0.408 | |
Mei_t6a_3 | 9 | mei2022_t6a | 0.661 | 0.433 | 0.287 | 0.179 | 0.188 | 0.415 | 0.646 | 0.414 | 0.270 | 0.169 | 0.183 | 0.407 | |
Mei_t6a_1 | 10 | mei2022_t6a | 0.647 | 0.423 | 0.278 | 0.174 | 0.186 | 0.407 | 0.646 | 0.414 | 0.270 | 0.169 | 0.183 | 0.407 | |
Mei_t6a_4 | 11 | mei2022_t6a | 0.669 | 0.428 | 0.281 | 0.173 | 0.184 | 0.410 | 0.646 | 0.414 | 0.270 | 0.169 | 0.183 | 0.407 | |
Mei_t6a_2 | 12 | mei2022_t6a | 0.672 | 0.427 | 0.276 | 0.167 | 0.182 | 0.410 | 0.646 | 0.414 | 0.270 | 0.169 | 0.183 | 0.407 | |
Primus_t6a_4 | 13 | primus2022_t6a | 0.641 | 0.421 | 0.276 | 0.168 | 0.184 | 0.402 | 0.636 | 0.417 | 0.275 | 0.168 | 0.183 | 0.401 | |
Kouzelis_t6a_4 | 14 | kouzelis2022_t6a | 0.581 | 0.387 | 0.262 | 0.170 | 0.180 | 0.388 | 0.579 | 0.386 | 0.262 | 0.173 | 0.178 | 0.387 | |
Guan_t6a_4 | 15 | guan2022_t6a | 0.623 | 0.417 | 0.284 | 0.180 | 0.177 | 0.405 | 0.649 | 0.439 | 0.303 | 0.199 | 0.181 | 0.415 | |
Guan_t6a_2 | 16 | guan2022_t6a | 0.575 | 0.388 | 0.268 | 0.178 | 0.178 | 0.387 | 0.595 | 0.402 | 0.277 | 0.189 | 0.179 | 0.395 | |
Guan_t6a_3 | 17 | guan2022_t6a | 0.657 | 0.425 | 0.279 | 0.169 | 0.179 | 0.405 | 0.660 | 0.424 | 0.279 | 0.170 | 0.178 | 0.410 | |
Kouzelis_t6a_3 | 18 | kouzelis2022_t6a | 0.567 | 0.378 | 0.257 | 0.169 | 0.176 | 0.385 | 0.575 | 0.384 | 0.262 | 0.174 | 0.178 | 0.386 | |
Kouzelis_t6a_1 | 19 | kouzelis2022_t6a | 0.570 | 0.382 | 0.259 | 0.170 | 0.177 | 0.384 | 0.576 | 0.384 | 0.261 | 0.176 | 0.166 | 0.385 | |
Kouzelis_t6a_2 | 20 | kouzelis2022_t6a | 0.569 | 0.378 | 0.256 | 0.168 | 0.177 | 0.386 | 0.578 | 0.384 | 0.262 | 0.176 | 0.177 | 0.387 | |
Primus_t6a_3 | 21 | primus2022_t6a | 0.654 | 0.420 | 0.271 | 0.163 | 0.177 | 0.395 | 0.653 | 0.424 | 0.278 | 0.169 | 0.181 | 0.404 | |
Primus_t6a_2 | 22 | primus2022_t6a | 0.562 | 0.364 | 0.243 | 0.153 | 0.181 | 0.374 | 0.573 | 0.370 | 0.244 | 0.158 | 0.181 | 0.376 | |
Guan_t6a_1 | 23 | guan2022_t6a | 0.556 | 0.367 | 0.249 | 0.165 | 0.173 | 0.375 | 0.581 | 0.386 | 0.265 | 0.181 | 0.175 | 0.385 | |
Kiciński_t6a_1 | 24 | kiciński2022_t6a | 0.567 | 0.368 | 0.244 | 0.155 | 0.175 | 0.378 | 0.583 | 0.382 | 0.255 | 0.164 | 0.179 | 0.387 | |
Primus_t6a_1 | 25 | primus2022_t6a | 0.556 | 0.364 | 0.241 | 0.150 | 0.176 | 0.367 | 0.566 | 0.373 | 0.252 | 0.164 | 0.178 | 0.376 | |
Pan_t6a_4 | 26 | pan2022_t6a | 0.555 | 0.361 | 0.240 | 0.155 | 0.173 | 0.374 | 0.568 | 0.367 | 0.245 | 0.161 | 0.175 | 0.383 | |
Kiciński_t6a_3 | 27 | kiciński2022_t6a | 0.546 | 0.346 | 0.224 | 0.138 | 0.170 | 0.364 | 0.566 | 0.363 | 0.236 | 0.147 | 0.173 | 0.372 | |
Kiciński_t6a_4 | 28 | kiciński2022_t6a | 0.556 | 0.360 | 0.238 | 0.152 | 0.171 | 0.367 | 0.562 | 0.364 | 0.242 | 0.157 | 0.172 | 0.377 | |
Pan_t6a_3 | 29 | pan2022_t6a | 0.559 | 0.360 | 0.237 | 0.152 | 0.173 | 0.376 | 0.567 | 0.363 | 0.240 | 0.154 | 0.174 | 0.380 | |
Pan_t6a_1 | 30 | pan2022_t6a | 0.556 | 0.365 | 0.244 | 0.154 | 0.170 | 0.375 | 0.560 | 0.362 | 0.240 | 0.155 | 0.169 | 0.375 | |
Kiciński_t6a_2 | 31 | kiciński2022_t6a | 0.556 | 0.358 | 0.235 | 0.148 | 0.171 | 0.369 | 0.567 | 0.365 | 0.242 | 0.155 | 0.172 | 0.378 | |
Pan_t6a_2 | 32 | pan2022_t6a | 0.556 | 0.358 | 0.236 | 0.146 | 0.169 | 0.371 | 0.562 | 0.361 | 0.240 | 0.156 | 0.172 | 0.377 | |
Labbe_t6a_1 | 33 | labbe2022_t6a | 0.548 | 0.351 | 0.233 | 0.149 | 0.170 | 0.370 | 0.555 | 0.357 | 0.240 | 0.157 | 0.170 | 0.374 | |
Baseline | 34 | gontier2022_t6a | 0.549 | 0.353 | 0.234 | 0.147 | 0.164 | 0.361 | 0.555 | 0.358 | 0.239 | 0.156 | 0.164 | 0.364 | |
Labbe_t6a_3 | 35 | labbe2022_t6a | 0.535 | 0.326 | 0.203 | 0.121 | 0.164 | 0.356 | 0.532 | 0.322 | 0.200 | 0.121 | 0.161 | 0.354 | |
Labbe_t6a_2 | 36 | labbe2022_t6a | 0.490 | 0.282 | 0.163 | 0.090 | 0.156 | 0.328 | 0.488 | 0.279 | 0.160 | 0.085 | 0.154 | 0.329 | |
Labbe_t6a_4 | 37 | labbe2022_t6a | 0.460 | 0.251 | 0.139 | 0.072 | 0.142 | 0.311 | 0.457 | 0.245 | 0.135 | 0.067 | 0.140 | 0.310 |
Systems ranking, captioning metrics
Selected metric rank |
Submission Information | Clotho testing split | Clotho evaluation split | ||||||
---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Technical Report |
CIDEr | SPICE | SPIDEr | CIDEr | SPICE | SPIDEr | |
Xu_t6a_4 | 1 | xu2022_t6a | 0.508 | 0.130 | 0.319 | 0.513 | 0.126 | 0.320 | |
Zou_t6a_3 | 2 | zou2022_t6a | 0.502 | 0.133 | 0.318 | 0.497 | 0.119 | 0.308 | |
Xu_t6a_3 | 3 | xu2022_t6a | 0.501 | 0.131 | 0.316 | 0.517 | 0.127 | 0.322 | |
Xu_t6a_1 | 4 | xu2022_t6a | 0.498 | 0.130 | 0.314 | 0.507 | 0.130 | 0.318 | |
Zou_t6a_4 | 5 | zou2022_t6a | 0.497 | 0.130 | 0.314 | 0.520 | 0.126 | 0.323 | |
Xu_t6a_2 | 6 | xu2022_t6a | 0.495 | 0.127 | 0.311 | 0.524 | 0.126 | 0.325 | |
Zou_t6a_1 | 7 | zou2022_t6a | 0.489 | 0.133 | 0.311 | 0.503 | 0.132 | 0.317 | |
Zou_t6a_2 | 8 | zou2022_t6a | 0.491 | 0.128 | 0.309 | 0.495 | 0.131 | 0.313 | |
Mei_t6a_3 | 9 | mei2022_t6a | 0.482 | 0.135 | 0.309 | 0.453 | 0.128 | 0.291 | |
Mei_t6a_1 | 10 | mei2022_t6a | 0.476 | 0.134 | 0.305 | 0.453 | 0.128 | 0.291 | |
Mei_t6a_4 | 11 | mei2022_t6a | 0.468 | 0.138 | 0.303 | 0.453 | 0.128 | 0.291 | |
Mei_t6a_2 | 12 | mei2022_t6a | 0.456 | 0.138 | 0.297 | 0.453 | 0.128 | 0.291 | |
Primus_t6a_4 | 13 | primus2022_t6a | 0.458 | 0.134 | 0.296 | 0.461 | 0.130 | 0.295 | |
Kouzelis_t6a_4 | 14 | kouzelis2022_t6a | 0.453 | 0.134 | 0.293 | 0.457 | 0.134 | 0.296 | |
Guan_t6a_4 | 15 | guan2022_t6a | 0.451 | 0.130 | 0.291 | 0.471 | 0.133 | 0.302 | |
Guan_t6a_2 | 16 | guan2022_t6a | 0.451 | 0.129 | 0.290 | 0.465 | 0.127 | 0.296 | |
Guan_t6a_3 | 17 | guan2022_t6a | 0.447 | 0.132 | 0.290 | 0.442 | 0.129 | 0.285 | |
Kouzelis_t6a_3 | 18 | kouzelis2022_t6a | 0.447 | 0.131 | 0.289 | 0.557 | 0.133 | 0.295 | |
Kouzelis_t6a_1 | 19 | kouzelis2022_t6a | 0.439 | 0.132 | 0.286 | 0.453 | 0.130 | 0.292 | |
Kouzelis_t6a_2 | 20 | kouzelis2022_t6a | 0.441 | 0.130 | 0.285 | 0.454 | 0.133 | 0.293 | |
Primus_t6a_3 | 21 | primus2022_t6a | 0.434 | 0.127 | 0.280 | 0.455 | 0.125 | 0.290 | |
Primus_t6a_2 | 22 | primus2022_t6a | 0.418 | 0.132 | 0.275 | 0.440 | 0.128 | 0.284 | |
Guan_t6a_1 | 23 | guan2022_t6a | 0.417 | 0.124 | 0.270 | 0.437 | 0.126 | 0.281 | |
Kiciński_t6a_1 | 24 | kiciński2022_t6a | 0.414 | 0.126 | 0.270 | 0.433 | 0.125 | 0.279 | |
Primus_t6a_1 | 25 | primus2022_t6a | 0.400 | 0.127 | 0.264 | 0.408 | 0.120 | 0.264 | |
Pan_t6a_4 | 26 | pan2022_t6a | 0.387 | 0.123 | 0.255 | 0.395 | 0.119 | 0.257 | |
Kiciński_t6a_3 | 27 | kiciński2022_t6a | 0.379 | 0.123 | 0.251 | 0.400 | 0.121 | 0.260 | |
Kiciński_t6a_4 | 28 | kiciński2022_t6a | 0.380 | 0.121 | 0.250 | 0.378 | 0.119 | 0.249 | |
Pan_t6a_3 | 29 | pan2022_t6a | 0.376 | 0.123 | 0.250 | 0.386 | 0.121 | 0.253 | |
Pan_t6a_1 | 30 | pan2022_t6a | 0.377 | 0.121 | 0.249 | 0.381 | 0.116 | 0.248 | |
Kiciński_t6a_2 | 31 | kiciński2022_t6a | 0.378 | 0.119 | 0.249 | 0.393 | 0.117 | 0.255 | |
Pan_t6a_2 | 32 | pan2022_t6a | 0.363 | 0.120 | 0.241 | 0.384 | 0.118 | 0.251 | |
Labbe_t6a_1 | 33 | labbe2022_t6a | 0.359 | 0.123 | 0.241 | 0.367 | 0.118 | 0.242 | |
Baseline | 34 | gontier2022_t6a | 0.338 | 0.110 | 0.224 | 0.358 | 0.109 | 0.233 | |
Labbe_t6a_3 | 35 | labbe2022_t6a | 0.307 | 0.117 | 0.212 | 0.303 | 0.111 | 0.207 | |
Labbe_t6a_2 | 36 | labbe2022_t6a | 0.247 | 0.113 | 0.180 | 0.241 | 0.106 | 0.174 | |
Labbe_t6a_4 | 37 | labbe2022_t6a | 0.203 | 0.099 | 0.151 | 0.198 | 0.090 | 0.144 |
System characteristics
In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems and the second has a detailed presentation of each system.
Overview of characteristics
Rank |
Submission code |
SPIDEr |
Technical Report |
Method scheme/architecture | Amount of parameters | Audio modelling | Word modelling |
Data augmentation |
---|---|---|---|---|---|---|---|---|
1 | Xu_t6a_4 | 0.319 | xu2022_t6a | Rnn_Transformer | 528099252 | RNN | Transformer, RNN | |
2 | Zou_t6a_3 | 0.318 | zou2022_t6a | encoder-decoder | 84541437 | CNN | LSTM | SpecAugment, SpecAugment++ |
3 | Xu_t6a_3 | 0.316 | xu2022_t6a | Rnn_Transformer | 347873912 | RNN | Transformer | |
4 | Xu_t6a_1 | 0.314 | xu2022_t6a | Rnn_Transformer | 85915038 | RNN | Transformer | |
5 | Zou_t6a_4 | 0.314 | zou2022_t6a | encoder-decoder | 84541437 | CNN | LSTM | SpecAugment, SpecAugment++ |
6 | Xu_t6a_2 | 0.311 | xu2022_t6a | Rnn_Transformer | 171830076 | RNN | Transformer | |
7 | Zou_t6a_1 | 0.311 | zou2022_t6a | encoder-decoder | 86643711 | CNN | LSTM | SpecAugment, SpecAugment++ |
8 | Zou_t6a_2 | 0.309 | zou2022_t6a | encoder-decoder | 140000000 | CNN | LSTM | SpecAugment, SpecAugment++ |
9 | Mei_t6a_3 | 0.309 | mei2022_t6a | encoder-decoder | 8867215 | CNN | Transformer | SpecAugment |
10 | Mei_t6a_1 | 0.305 | mei2022_t6a | encoder-decoder | 8867215 | CNN | Transformer | SpecAugment |
11 | Mei_t6a_4 | 0.303 | mei2022_t6a | encoder-decoder | 8867215 | CNN | Transformer | SpecAugment |
12 | Mei_t6a_2 | 0.297 | mei2022_t6a | encoder-decoder | 8867215 | CNN | Transformer | SpecAugment |
13 | Primus_t6a_4 | 0.296 | primus2022_t6a | encoder-decoder | 780000000 | CNN | Transformer | SpecAugment |
14 | Kouzelis_t6a_4 | 0.293 | kouzelis2022_t6a | encoder-decoder | 119757102 | PaSST | Transformer | Mixup, SpecAugment, Label Smoothing |
15 | Guan_t6a_4 | 0.291 | guan2022_t6a | encoder-decoder | 36147701 | PANNs, GAT | Transformer, LocalAFT | Mixup, SpecAugment |
16 | Guan_t6a_2 | 0.290 | guan2022_t6a | encoder-decoder | 28920672 | PANNs, GAT | Transformer, LocalAFT | Mixup, SpecAugment |
17 | Guan_t6a_3 | 0.290 | guan2022_t6a | encoder-decoder | 7227029 | PANNs, GAT | Transformer | Mixup, SpecAugment |
18 | Kouzelis_t6a_3 | 0.289 | kouzelis2022_t6a | encoder-decoder | 119757102 | PaSST | Transformer | Mixup, SpecAugment, Label Smoothing |
19 | Kouzelis_t6a_1 | 0.286 | kouzelis2022_t6a | encoder-decoder | 119757102 | PaSST | Transformer | Mixup, SpecAugment, Label Smoothing |
20 | Kouzelis_t6a_2 | 0.285 | kouzelis2022_t6a | encoder-decoder | 119757102 | PaSST | Transformer | Mixup, SpecAugment, Label Smoothing |
21 | Primus_t6a_3 | 0.280 | primus2022_t6a | encoder-decoder | 130000000 | CNN | Transformer | SpecAugment |
22 | Primus_t6a_2 | 0.275 | primus2022_t6a | encoder-decoder | 130000000 | CNN | Transformer | SpecAugment |
23 | Guan_t6a_1 | 0.270 | guan2022_t6a | encoder-decoder | 7227029 | PANNs, GAT | Transformer | Mixup, SpecAugment |
24 | Kiciński_t6a_1 | 0.270 | kiciński2022_t6a | encoder-decoder | 104000000 | PANNs | Transformer | random crop, random pad, adding white noise, SpecAugment |
25 | Primus_t6a_1 | 0.264 | primus2022_t6a | encoder-decoder | 130000000 | CNN | Transformer | SpecAugment |
26 | Pan_t6a_4 | 0.255 | pan2022_t6a | encoder-decoder | 9857679 | CNN | Transformer | Mixture |
27 | Kiciński_t6a_3 | 0.251 | kiciński2022_t6a | encoder-decoder | 104000000 | PANNs | Transformer, KeyBert | random crop, random pad, adding white noise, SpecAugment |
28 | Kiciński_t6a_4 | 0.250 | kiciński2022_t6a | encoder-decoder | 207000000 | PANNs | GPT2, KeyBert | random crop, random pad, adding white noise, SpecAugment |
29 | Pan_t6a_3 | 0.250 | pan2022_t6a | encoder-decoder | 9857679 | CNN | Transformer | Zero_value, Mixup |
30 | Pan_t6a_1 | 0.249 | pan2022_t6a | encoder-decoder | 9857679 | CNN | Transformer | Zero_value, Mixup |
31 | Kiciński_t6a_2 | 0.249 | kiciński2022_t6a | encoder-decoder | 207000000 | PANNs | GPT2 | random crop, random pad, adding white noise, SpecAugment |
32 | Pan_t6a_2 | 0.241 | pan2022_t6a | encoder-decoder | 9857679 | CNN | Transformer | Zero_value, Mixup |
33 | Labbe_t6a_1 | 0.241 | labbe2022_t6a | encoder-decoder | 16531922 | CNN10 | Transformer | SpecAugment |
34 | Baseline | 0.224 | gontier2022_t6a | encoder-decoder | 140000000 | Transformer | Transformer | |
35 | Labbe_t6a_3 | 0.212 | labbe2022_t6a | encoder-decoder | 16531922 | CNN10 | Transformer | SpecAugment |
36 | Labbe_t6a_2 | 0.180 | labbe2022_t6a | encoder-decoder | 16531922 | CNN10 | Transformer | SpecAugment |
37 | Labbe_t6a_4 | 0.151 | labbe2022_t6a | encoder-decoder | 16531922 | CNN10 | Transformer | SpecAugment |
Detailed characteristics
Rank |
Submission code |
SPIDEr |
Technical Report |
Method scheme/architecture | Amount of parameters | Audio modelling |
Acoustic features |
Word modelling |
Word embeddings |
Data augmentation |
Sampling rate |
Learning set-up | Ensemble method | Loss function | Learning set-up | Learning rate | Gradient clipping | Gradient norm for clipping | Metric monitored for training | Dataset(s) used for audio modelling | Dataset(s) used for word modelling | Dataset(s) used for audio similarity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Xu_t6a_4 | 0.319 | xu2022_t6a | Rnn_Transformer | 528099252 | RNN | Clap feature | Transformer, RNN | learned | 44.1kHz | supervised, reinforcement learning | cross-entropy | adam | 5e-4 | CIDEr | Clotho, AudioCaps, MACS | Clotho, AudioCaps, MACS | |||||
2 | Zou_t6a_3 | 0.318 | zou2022_t6a | encoder-decoder | 84541437 | CNN | ResNet38 | LSTM | Random | SpecAugment, SpecAugment++ | 44.1KHz | supervised | cross-entropy | adamw | 5e-5 | loss | Clotho, Clotho | Clotho, Clotho | ||||
3 | Xu_t6a_3 | 0.316 | xu2022_t6a | Rnn_Transformer | 347873912 | RNN | Clap feature | Transformer | learned | 44.1kHz | supervised, reinforcement learning | cross-entropy | adam | 5e-4 | CIDEr | Clotho, AudioCaps, MACS | Clotho, AudioCaps, MACS | |||||
4 | Xu_t6a_1 | 0.314 | xu2022_t6a | Rnn_Transformer | 85915038 | RNN | Clap feature | Transformer | learned | 44.1kHz | supervised, reinforcement learning | cross-entropy | adam | 5e-4 | CIDEr | Clotho, AudioCaps, MACS | Clotho, AudioCaps, MACS | |||||
5 | Zou_t6a_4 | 0.314 | zou2022_t6a | encoder-decoder | 84541437 | CNN | ResNet38 | LSTM | Random | SpecAugment, SpecAugment++ | 44.1KHz | supervised | cross-entropy | adamw | 5e-5 | loss | Clotho, Clotho | Clotho, Clotho | ||||
6 | Xu_t6a_2 | 0.311 | xu2022_t6a | Rnn_Transformer | 171830076 | RNN | Clap feature | Transformer | learned | 44.1kHz | supervised, reinforcement learning | cross-entropy | adam | 5e-4 | CIDEr | Clotho, AudioCaps, MACS | Clotho, AudioCaps, MACS | |||||
7 | Zou_t6a_1 | 0.311 | zou2022_t6a | encoder-decoder | 86643711 | CNN | ResNet38 | LSTM | Random | SpecAugment, SpecAugment++ | 44.1KHz | supervised | cross-entropy | adamw | 5e-5 | loss | Clotho, Clotho | Clotho, Clotho | ||||
8 | Zou_t6a_2 | 0.309 | zou2022_t6a | encoder-decoder | 140000000 | CNN | ResNet38 | LSTM | Random | SpecAugment, SpecAugment++ | 44.1KHz | supervised | cross-entropy | adamw | 5e-5 | loss | Clotho, Clotho | Clotho, Clotho | ||||
9 | Mei_t6a_3 | 0.309 | mei2022_t6a | encoder-decoder | 8867215 | CNN | PANNs | Transformer | Word2Vec | SpecAugment | 44.1kHz | supervised | cross-entropy | adam | 1e-3 | SPIDEr | Clotho | Clotho | ||||
10 | Mei_t6a_1 | 0.305 | mei2022_t6a | encoder-decoder | 8867215 | CNN | PANNs | Transformer | Word2Vec | SpecAugment | 44.1kHz | supervised | cross-entropy | adam | 1e-3 | SPIDEr | Clotho | Clotho | ||||
11 | Mei_t6a_4 | 0.303 | mei2022_t6a | encoder-decoder | 8867215 | CNN | PANNs | Transformer | Word2Vec | SpecAugment | 44.1kHz | supervised | cross-entropy | adam | 1e-3 | SPIDEr | Clotho | Clotho | ||||
12 | Mei_t6a_2 | 0.297 | mei2022_t6a | encoder-decoder | 8867215 | CNN | PANNs | Transformer | Word2Vec | SpecAugment | 44.1kHz | supervised | cross-entropy | adam | 1e-3 | SPIDEr | Clotho | Clotho | ||||
13 | Primus_t6a_4 | 0.296 | primus2022_t6a | encoder-decoder | 780000000 | CNN | CNN10 | Transformer | BART | SpecAugment | 44.1kHz | supervised | score function estimator | adamw | 1e-5 | SPIDEr | Clotho, AudioCaps, AudioSet | Clotho, AudioCaps | ||||
14 | Kouzelis_t6a_4 | 0.293 | kouzelis2022_t6a | encoder-decoder | 119757102 | PaSST | mel energies | Transformer | Word2Vec | Mixup, SpecAugment, Label Smoothing | 32kHz | supervised | cross-entropy | adam | 1e-5 | SPIDEr | Clotho, AudioCaps, MACS | Clotho, AudioCaps, MACS | ||||
15 | Guan_t6a_4 | 0.291 | guan2022_t6a | encoder-decoder | 36147701 | PANNs, GAT | log-mel energies | Transformer, LocalAFT | Word2Vec | Mixup, SpecAugment | 44.1kHz | supervised, reinforcement learning | cross-entropy | adam | 1e-4 | loss, SPIDEr | Clotho, AudioCaps | Clotho, AudioCaps | ||||
16 | Guan_t6a_2 | 0.290 | guan2022_t6a | encoder-decoder | 28920672 | PANNs, GAT | log-mel energies | Transformer, LocalAFT | Word2Vec | Mixup, SpecAugment | 44.1kHz | supervised | cross-entropy | adam | 1e-4 | loss, SPIDEr | Clotho, AudioCaps | Clotho, AudioCaps | ||||
17 | Guan_t6a_3 | 0.290 | guan2022_t6a | encoder-decoder | 7227029 | PANNs, GAT | log-mel energies | Transformer | Word2Vec | Mixup, SpecAugment | 44.1kHz | supervised, reinforcement learning | cross-entropy | adam | 1e-4 | loss, SPIDEr | Clotho, AudioCaps | Clotho, AudioCaps | ||||
18 | Kouzelis_t6a_3 | 0.289 | kouzelis2022_t6a | encoder-decoder | 119757102 | PaSST | mel energies | Transformer | Word2Vec | Mixup, SpecAugment, Label Smoothing | 32kHz | supervised | cross-entropy | adam | 1e-5 | SPIDEr | Clotho, AudioCaps, MACS | Clotho, AudioCaps, MACS | ||||
19 | Kouzelis_t6a_1 | 0.286 | kouzelis2022_t6a | encoder-decoder | 119757102 | PaSST | mel energies | Transformer | Word2Vec | Mixup, SpecAugment, Label Smoothing | 32kHz | supervised | cross-entropy | adam | linear warmup 1e-5 | SPIDEr | Clotho, AudioCaps, MACS | Clotho, AudioCaps, MACS | ||||
20 | Kouzelis_t6a_2 | 0.285 | kouzelis2022_t6a | encoder-decoder | 119757102 | PaSST | mel energies | Transformer | Word2Vec | Mixup, SpecAugment, Label Smoothing | 32kHz | supervised | cross-entropy | adam | 1e-5 | SPIDEr | Clotho, AudioCaps, MACS | Clotho, AudioCaps, MACS | ||||
21 | Primus_t6a_3 | 0.280 | primus2022_t6a | encoder-decoder | 130000000 | CNN | CNN10 | Transformer | BART | SpecAugment | 44.1kHz | supervised | score function estimator | adamw | 1e-5 | SPIDEr | Clotho, AudioCaps, AudioSet | Clotho, AudioCaps | ||||
22 | Primus_t6a_2 | 0.275 | primus2022_t6a | encoder-decoder | 130000000 | CNN | CNN10 | Transformer | BART | SpecAugment | 44.1kHz | supervised | cross-entropy | adamw | 1e-5 | SPIDEr | Clotho, AudioCaps, AudioSet | Clotho, AudioCaps | ||||
23 | Guan_t6a_1 | 0.270 | guan2022_t6a | encoder-decoder | 7227029 | PANNs, GAT | log-mel energies | Transformer | Word2Vec | Mixup, SpecAugment | 44.1kHz | supervised | cross-entropy | adam | 1e-4 | loss, SPIDEr | Clotho, AudioCaps | Clotho, AudioCaps | ||||
24 | Kiciński_t6a_1 | 0.270 | kiciński2022_t6a | encoder-decoder | 104000000 | PANNs | mel energies | Transformer | learned | random crop, random pad, adding white noise, SpecAugment | 16kHz | supervised | cross-entropy | adamw | 1e-4 | loss | Clotho, AudioCaps, Freesound | Clotho, AudioCaps, Freesound | ||||
25 | Primus_t6a_1 | 0.264 | primus2022_t6a | encoder-decoder | 130000000 | CNN | CNN10 | Transformer | BART | SpecAugment | 44.1kHz | supervised | cross-entropy | adamw | 1e-5 | SPIDEr | Clotho, AudioSet | Clotho | ||||
26 | Pan_t6a_4 | 0.255 | pan2022_t6a | encoder-decoder | 9857679 | CNN | mel energies | Transformer | Word2Vec | Mixture | 44.1kHz | supervised | cross-entropy | adam | 1e-3 | loss | Clotho | Clotho | ||||
27 | Kiciński_t6a_3 | 0.251 | kiciński2022_t6a | encoder-decoder | 104000000 | PANNs | mel energies | Transformer, KeyBert | learned | random crop, random pad, adding white noise, SpecAugment | 16kHz | supervised | cross-entropy | adamw | 1e-4 | loss | Clotho, AudioCaps, Freesound | Clotho, AudioCaps, Freesound | ||||
28 | Kiciński_t6a_4 | 0.250 | kiciński2022_t6a | encoder-decoder | 207000000 | PANNs | mel energies | GPT2, KeyBert | GPT2 | random crop, random pad, adding white noise, SpecAugment | 16kHz | supervised | cross-entropy | adamw | 5e-5 | loss | Clotho, AudioCaps, Freesound | Clotho, AudioCaps, Freesound | ||||
29 | Pan_t6a_3 | 0.250 | pan2022_t6a | encoder-decoder | 9857679 | CNN | mel energies | Transformer | Word2Vec | Zero_value, Mixup | 44.1kHz | supervised | cross-entropy | adamw | 1e-3 | loss | Clotho | Clotho | ||||
30 | Pan_t6a_1 | 0.249 | pan2022_t6a | encoder-decoder | 9857679 | CNN | mel energies | Transformer | learned | Zero_value, Mixup | 44.1kHz | supervised | cross-entropy | adam | 1e-3 | loss | Clotho | Clotho | ||||
31 | Kiciński_t6a_2 | 0.249 | kiciński2022_t6a | encoder-decoder | 207000000 | PANNs | mel energies | GPT2 | GPT2 | random crop, random pad, adding white noise, SpecAugment | 16kHz | supervised | cross-entropy | adamw | 5e-5 | loss | Clotho, AudioCaps, Freesound | Clotho, AudioCaps, Freesound | ||||
32 | Pan_t6a_2 | 0.241 | pan2022_t6a | encoder-decoder | 9857679 | CNN | mel energies | Transformer | Word2Vec | Zero_value, Mixup | 44.1kHz | supervised | cross-entropy | adam | 1e-3 | loss | Clotho | Clotho | ||||
33 | Labbe_t6a_1 | 0.241 | labbe2022_t6a | encoder-decoder | 16531922 | CNN10 | log-mel energies | Transformer | learned | SpecAugment | 32kHz | supervised | cross-entropy | adam | 5e-4 | loss | Clotho | Clotho | ||||
34 | Baseline | 0.224 | gontier2022_t6a | encoder-decoder | 140000000 | Transformer | VGGish | Transformer | BART | 16kHz | supervised | cross-entropy | adamw | 1e-5 | loss | Clotho | Clotho | |||||
35 | Labbe_t6a_3 | 0.212 | labbe2022_t6a | encoder-decoder | 16531922 | CNN10 | log-mel energies | Transformer | learned | SpecAugment | 32kHz | supervised | cross-entropy | adam | 5e-4 | loss | Clotho | Clotho | ||||
36 | Labbe_t6a_2 | 0.180 | labbe2022_t6a | encoder-decoder | 16531922 | CNN10 | log-mel energies | Transformer | learned | SpecAugment | 32kHz | supervised | cross-entropy | adam | 5e-4 | loss | Clotho | Clotho | ||||
37 | Labbe_t6a_4 | 0.151 | labbe2022_t6a | encoder-decoder | 16531922 | CNN10 | log-mel energies | Transformer | learned | SpecAugment | 32kHz | supervised | cross-entropy | adam | 5e-4 | loss | Clotho | Clotho |
Technical reports
Ensemble learning for audio captioning with graph audio feature representation
Feiyang Xiao1, Jian Guan1, Haiyan Lan1, Qiaoxi Zhu2, Wenwu Wang3
1Group of Intelligent Signal Processing, College of Computer Science and Technology, Harbin Engineering University, Harbin, China, 2Centre for Audio, Acoustic and Vibration, University of Technology Sydney, Ultimo, Australia, 3Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK
guan_t6a_1 guan_t6a_2 guan_t6a_3 guan_t6a_4
Ensemble learning for audio captioning with graph audio feature representation
Feiyang Xiao1, Jian Guan1, Haiyan Lan1, Qiaoxi Zhu2, Wenwu Wang3
1Group of Intelligent Signal Processing, College of Computer Science and Technology, Harbin Engineering University, Harbin, China, 2Centre for Audio, Acoustic and Vibration, University of Technology Sydney, Ultimo, Australia, 3Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK
Abstract
This technical report describes our submission for Task 6A of the DCASE2022 Challenge (automated audio captioning). Our system is built based on the ensemble learning strategy. It integrates the advantages of different audio captioning methods, including the graph attention-based audio feature representation method. Experiments show that our ensemble system can achieve the SPIDEr score (used for ranking) of 30.2 on the evaluation split of the Clotho v2 dataset.
System characteristics
Data augmentation | Mixup, SpecAugment |
Exploring audio captioning with keyword-guided text generation
Dawid Kicinski, Teodor Lamort de Gail, Pawel Bujnowski
Samsung R&D Institute Poland, Artificial Intelligence, Warsaw, Poland
kicinski_t6a_1 kicinski_t6a_2 kicinski_t6a_3 kicinski_t6a_4
Exploring audio captioning with keyword-guided text generation
Dawid Kicinski, Teodor Lamort de Gail, Pawel Bujnowski
Samsung R&D Institute Poland, Artificial Intelligence, Warsaw, Poland
Abstract
This technical report describes our submission to the DCASE 2022 challenge, Task 6 A: automated audio captioning. In our system, we explore the use of pre-trained language models for the audio captioning task. The proposed system is an encoder-decoder architecture consisting of a pre-trained PANN encoder and a GPT2 decoder. Audio embeddings are encoded to language model prompts using a simple mapping network. We further develop our system by employing strategies of guiding the decoder with textual information. We prompt the decoder with keywords extracted from semantically similar audios and use them to choose the best matching caption by their occurrence.
System characteristics
Data augmentation | random crop, random pad, adding white noise, SpecAugment |
Efficient audio captioning transformer with patchout and text guidance
Thodoris Kouzelis1,2, Grigoris Bastas1,2, Athanasios Katsamanis2, Alexandros Potamianos1
1Institute for Language and Speech Processing, Athena Research Center, Athens, Greece, 2School of ECE, National Technical University of Athens, Athens, Greece
kouzelis_t6a_1 kouzelis_t6a_2 kouzelis_t6a_3 kouzelis_t6a_4
Judges’ award
Efficient audio captioning transformer with patchout and text guidance
Thodoris Kouzelis1,2, Grigoris Bastas1,2, Athanasios Katsamanis2, Alexandros Potamianos1
1Institute for Language and Speech Processing, Athena Research Center, Athens, Greece, 2School of ECE, National Technical University of Athens, Athens, Greece
Abstract
This technical report describes an Automated Audio Captioning model for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 Challenge, Task 6. We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting. The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model which is fine-tuned to maximize the semantic similarity between AudioSet labels and ground truth captions. To mitigate the data scarcity problem of Automated Audio Captioning (AAC) we pre-train our model on an enlarged dataset. Moreover, we propose a method to apply Mixup augmentation for AAC. Our best model achieves the SPIDEr score of 0.296.
Awards: Judges’ award
System characteristics
Data augmentation | Mixup, SpecAugment, Label Smoothing |
IRIT-UPS DCASE 2022 task6a system: stochastic decoding methods for audio captioning
Etienne Labbé, Thomas Pellegrini, Julien Pinquier
IRIT (UMR 5505), Université Paul Sabatier, CNRS, Toulouse, France
labbe_t6a_1 labbe_t6a_2 labbe_t6a_3 labbe_t6a_4
IRIT-UPS DCASE 2022 task6a system: stochastic decoding methods for audio captioning
Etienne Labbé, Thomas Pellegrini, Julien Pinquier
IRIT (UMR 5505), Université Paul Sabatier, CNRS, Toulouse, France
Abstract
This document presents a summary of our models used in the Automated Audio Captioning task (6a) for the DCASE2022 challenge. Four submissions were made using different decoding methods : beam search, top k sampling, nucleus sampling and typical decoding.
System characteristics
Data augmentation | SpecAugment |
Automated audio captioning with keywords guidance
Xinhao Mei, Xubo Liu, Haohe Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang
Centre for Vision, Speech, and Signal Processing (CVSSP), University of Surrey, UK
mei_t6a_1 mei_t6a_2 mei_t6a_3 mei_t6a_4
Automated audio captioning with keywords guidance
Xinhao Mei, Xubo Liu, Haohe Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang
Centre for Vision, Speech, and Signal Processing (CVSSP), University of Surrey, UK
Abstract
This technical report describes an automated audio captioning system we submitted to Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2022 Task 6a. The proposed system is built on an encoder-decoder architecture we submitted to DCASE 2021 Challenge Task 6 last year, where the encoder is a pre-trained 10-layer convolutional neural network and the decoder is a Transformer network. In this new submission, we investigate the use of keywords estimated from input audio clips to guide the caption generation process. The results show that keywords guidance can improve the system performance especially when the pre-trained encoder is frozen, and can also reduce the variance of the results when the model is trained with different seeds. The overall system consists of a pre-trained keywords estimation model and a CNN-Transformer audio captioning model. The captioning model is first trained via the cross-entropy loss and then fine-tuned with reinforcement learning to optimize the evaluation metric CIDEr. The proposed system significantly improves the scores of all the evaluation metrics as compared to the baseline system.
System characteristics
Data augmentation | SpecAugment |
Audio captioning using pre-trained model and data augmentation
Tianyang Huang1, Chaofan Pan1, Wenyao Chen1, Chenyang Zhu3, Shengchen Li2, Xi Shao1
1College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China, 2School of Advanced Technology, Xi’an Jiaotong-liverpool University, Suzhou, China, 3School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
pan_t6a_1 pan_t6a_2 pan_t6a_3 pan_t6a_4
Audio captioning using pre-trained model and data augmentation
Tianyang Huang1, Chaofan Pan1, Wenyao Chen1, Chenyang Zhu3, Shengchen Li2, Xi Shao1
1College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China, 2School of Advanced Technology, Xi’an Jiaotong-liverpool University, Suzhou, China, 3School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
Abstract
This technical report describes an automatic audio captioning system for task 6, Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 Challenge. Based on an encoder-decoder architecture, the system is composed of convolutional neural network (CNN) encoder and transformer decoder. Instead of using pre-trained models only in audio modal, we try to introduce pretrained models in text modal as well. In addition, we consider using a data argumentation method with and without noise to improve the data quality and thus improve the generalization and robustness of the model. The experimental results show that our system can achieve the SPIDEr of 0.257 (official baseline: 0.233) on the Clotho evaluation set.
System characteristics
Data augmentation | Mixture |
CP-JKU's submission to task 6a of the DCASE2022 challenge: a BART encoder-decoder for automatic audio captioning trained via the reinforce algorithm and transfer learning
Paul Primus1, Gerhard Widmer1,2
1Institute of Computational Perception (CP-JKU), 2LIT Artificial Intelligence Lab, Johannes Kepler University, Austria
primus_t6a_1 primus_t6a_2 primus_t6a_3 primus_t6a_4
CP-JKU's submission to task 6a of the DCASE2022 challenge: a BART encoder-decoder for automatic audio captioning trained via the reinforce algorithm and transfer learning
Paul Primus1, Gerhard Widmer1,2
1Institute of Computational Perception (CP-JKU), 2LIT Artificial Intelligence Lab, Johannes Kepler University, Austria
Abstract
This technical report details the CP-JKU submission to the automatic audio captioning task of the 2022’s DCASE challenge (task 6a). The objective of the task was to train a sequence-to-sequence model that automatically generates textual descriptions for given audio recordings. The approach described in this report enhances the BART-based encoder-decoder model used as the challenge’s baseline system in three directions: firstly, the VGGish embedding model was replaced with a custom CNN10-like model that we pretrained on AudioSet. Secondly, the BART encoder-decoder model was pre-trained on AudioCaps, which led to faster convergence. And finally, the best model was further fine-tuned by optimizing the non-differentiable CIDEr metric using the REINFORCE algorithm. Our best model achieves a SPIDEr score of .29 (single-model performance), which is an improvement of 6.6 pp. over the challenge’s baseline score.
System characteristics
Data augmentation | SpecAugment |
The SJTU System for DCASE2022 Challenge Task 6: Audio Captioning with Audio-Text Retrieval Pre-training
Xuenan Xu, Zeyu Xie, Mengyue Wu, Kai Yu
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Department of Computer Science and Engineering AI Institute, Shanghai Jiao Tong University, Shanghai, China
xu_t6a_1 xu_t6a_2 xu_t6a_3 xu_t6a_4
The SJTU System for DCASE2022 Challenge Task 6: Audio Captioning with Audio-Text Retrieval Pre-training
Xuenan Xu, Zeyu Xie, Mengyue Wu, Kai Yu
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Department of Computer Science and Engineering AI Institute, Shanghai Jiao Tong University, Shanghai, China
Abstract
This technical report describes the system submitted to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 challenge Task 6. There are two involving subtasks: text-to-audio retrieval and automated audio captioning. The text-to-audio retrieval system adopts a bi-encoder architecture using pre-trained audio and text encoders. The system is first pre-trained on AudioCaps and then fine-tuned on the challenge dataset Clotho. For the audio captioning system, we first train a retrieval model on all public captioning data and then take the audio encoder as the feature extractor. Then a standard sequence-to-sequence model is trained on Clotho based on the pre-trained feature extractor. The captioning model is first trained by word-level cross entropy loss and then finetuned using self-critical sequence training. Our system achieves a SPIDEr of 32.5 on captioning and an mAP of 29.9 on text-to-audio retrieval.
System characteristics
Data augmentation | None |
Automated audio captioning with multi-task learning
Zhongjie Ye1, Yuexian Zou1, Fan Cui2, Yujun Wang2
1ADSPLAB, School of ECE, Peking University, Shenzhen, China, 2Xiaomi Corporation, Beijing, China
zou_t6a_1 zou_t6a_2 zou_t6a_3 zou_t6a_4
Automated audio captioning with multi-task learning
Zhongjie Ye1, Yuexian Zou1, Fan Cui2, Yujun Wang2
1ADSPLAB, School of ECE, Peking University, Shenzhen, China, 2Xiaomi Corporation, Beijing, China
Abstract
This technical report describes an automated audio captioning (AAC) model for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 Task 6A Challenge. Our model consists of a convolution neural network (CNN) encoder and a single layer (LSTM) decoder with a temporal attention module. In order to enhance the representation in the domain dataset, we use the ResNet38 pretrained on the AudioSet dataset as our audio encoder and finetune it with keywords of nouns and verbs as labels which are extracted from the captions. For training the whole caption model, we first train the model with the standard cross entropy loss, and fine-tune it with reinforcement learning to directly optimize the CIDEr score. Experimental results show that our single model can achieve a SPIDEr score of 31.7 on the evaluation spilt.
System characteristics
Data augmentation | SpecAugment, SpecAugment++ |