Task description
Automated audio captioning is the task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. Given the novelty of the task of audio captioning, current focus is on exploring and developing different methods that can provide some kind of captions for a general audio recording. To this aim, the novel Clotho dataset is used, which provides good quality captions, without speech transcription, named entities, and hapax legomena (i.e. words that appear once in a split).
Participants used the freely available splits of Clotho development and evaluation, which splits provide both audio and corresponding captions. The systems are developed without the usage of any external data. The developed systems are evaluated on their generated captions, using the testing split of Clotho, which does not provide the corresponding captions for the audio. More information about Task 6: Automated Audio Captioning can be found at the task description page.
The ranking of the submitted systems is based on the achieved SPIDEr metric. Though, in this page is provided a more thorough presentation, grouping the metrics into those that are originated from machine translation and to those that originated from captioning.
Teams ranking
Here are listed the best systems all from all teams. The ranking is based on the SPIDEr metric. For more elaborated exploration of the performance of the different systems, at the same table are listed the values achieved for all the metrics employed in the task. The values for the metrics are for the Clotho testing split and the Clotho evaluation split. The values for the Clotho evaluation split, are provided in order to allow further comparison with systems and methods developed outside of this task, since Clotho evaluation split is freely available.
Selected metric rank |
Submission Information | Clotho testing split | Clotho evaluation split | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Corresponding author |
Technical Report |
BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | CIDEr | SPICE | SPIDEr | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | CIDEr | SPICE | SPIDEr | |
Yuan_t6_2 | 1 | Weiqiang Yuan | yuan2021_t6 | 0.595 | 0.400 | 0.275 | 0.184 | 0.182 | 0.394 | 0.485 | 0.135 | 0.310 | 0.603 | 0.414 | 0.286 | 0.195 | 0.186 | 0.400 | 0.499 | 0.137 | 0.318 | |
Xu_t6_3 | 2 | Xuenan Xu | xu2021_t6 | 0.650 | 0.420 | 0.271 | 0.171 | 0.182 | 0.405 | 0.463 | 0.129 | 0.296 | 0.659 | 0.424 | 0.275 | 0.176 | 0.182 | 0.411 | 0.472 | 0.124 | 0.298 | |
Xinhao_t6_1 | 3 | Xinhao Mei | xinhao2021_t6 | 0.620 | 0.416 | 0.282 | 0.180 | 0.184 | 0.401 | 0.457 | 0.131 | 0.294 | 0.615 | 0.403 | 0.270 | 0.171 | 0.179 | 0.392 | 0.412 | 0.122 | 0.268 | |
Ye_t6_3 | 4 | Zhongjie Ye | ye2021_t6 | 0.584 | 0.391 | 0.265 | 0.173 | 0.179 | 0.384 | 0.434 | 0.126 | 0.280 | 0.586 | 0.391 | 0.268 | 0.180 | 0.180 | 0.388 | 0.440 | 0.125 | 0.282 | |
Chen_t6_4 | 5 | Zhiwen Chen | chen2021_t6 | 0.549 | 0.358 | 0.239 | 0.156 | 0.169 | 0.367 | 0.402 | 0.121 | 0.262 | 0.563 | 0.367 | 0.244 | 0.158 | 0.170 | 0.371 | 0.406 | 0.119 | 0.262 | |
Won_t6_4 | 6 | Hyejin Won | won2021_t6 | 0.538 | 0.359 | 0.247 | 0.162 | 0.166 | 0.372 | 0.381 | 0.118 | 0.249 | 0.564 | 0.376 | 0.254 | 0.163 | 0.177 | 0.388 | 0.441 | 0.128 | 0.285 | |
Narisetty_t6_4 | 7 | Chaitanya Narisetty | narisetty2021_t6 | 0.534 | 0.348 | 0.238 | 0.160 | 0.157 | 0.361 | 0.362 | 0.110 | 0.236 | 0.563 | 0.378 | 0.264 | 0.184 | 0.168 | 0.378 | 0.417 | 0.115 | 0.266 | |
Labbe_t6_4 | 8 | Etienne Labbe | labbe2021_t6 | 0.539 | 0.354 | 0.239 | 0.154 | 0.156 | 0.361 | 0.333 | 0.108 | 0.221 | 0.541 | 0.358 | 0.243 | 0.159 | 0.327 | 0.235 | 0.351 | 0.110 | 0.231 | |
Liu_t6_1 | 9 | Yang Liu | liu2021_t6 | 0.478 | 0.291 | 0.189 | 0.118 | 0.143 | 0.324 | 0.274 | 0.094 | 0.184 | 0.483 | 0.298 | 0.197 | 0.119 | 0.322 | 0.133 | 0.243 | 0.088 | 0.166 | |
Eren_t6_1 | 10 | Ayşegül Özkaya Eren | eren2021_t6 | 0.479 | 0.280 | 0.168 | 0.090 | 0.140 | 0.302 | 0.256 | 0.107 | 0.182 | 0.586 | 0.356 | 0.268 | 0.150 | 0.214 | 0.444 | 0.328 | 0.155 | 0.242 | |
Gebhard_t6_1 | 11 | Alexander Gebhard | gebhard2021_t6 | 0.447 | 0.169 | 0.072 | 0.029 | 0.099 | 0.287 | 0.105 | 0.047 | 0.076 | 0.449 | 0.167 | 0.068 | 0.029 | 0.097 | 0.284 | 0.098 | 0.043 | 0.071 | |
Xiao_t6_2 | 12 | Feiyang Xiao | xiao2021_t6 | 0.344 | 0.152 | 0.085 | 0.044 | 0.082 | 0.239 | 0.058 | 0.033 | 0.046 | 0.461 | 0.275 | 0.180 | 0.112 | 0.126 | 0.312 | 0.210 | 0.079 | 0.144 | |
Baseline_t6_1 | 13 | Konstantinos Drossos | Baseline2021_t6 | 0.405 | 0.061 | 0.014 | 0.000 | 0.070 | 0.265 | 0.020 | 0.004 | 0.012 | 0.378 | 0.119 | 0.050 | 0.017 | 0.078 | 0.263 | 0.075 | 0.028 | 0.051 |
Systems ranking
Here are listed all systems and their ranking according to the different metrics and grouping of metrics. First, is a table with all metrics and all systems. Then, is a table with all systems but with only machine translation metrics, and then a table with all systems but with only captioning metrics.
Detailed information of each system is at the next section.
Systems ranking, all metrics
Selected metric rank |
Submission Information | Clotho testing split | Clotho evaluation split | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Technical Report |
BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | CIDEr | SPICE | SPIDEr | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | CIDEr | SPICE | SPIDEr | |
Yuan_t6_1 | 4 | yuan2021_t6 | 0.586 | 0.387 | 0.261 | 0.170 | 0.181 | 0.384 | 0.457 | 0.136 | 0.296 | 0.595 | 0.402 | 0.278 | 0.189 | 0.184 | 0.392 | 0.495 | 0.136 | 0.315 | |
Yuan_t6_2 | 1 | yuan2021_t6 | 0.595 | 0.400 | 0.275 | 0.184 | 0.182 | 0.394 | 0.485 | 0.135 | 0.310 | 0.603 | 0.414 | 0.286 | 0.195 | 0.186 | 0.400 | 0.499 | 0.137 | 0.318 | |
Yuan_t6_3 | 2 | yuan2021_t6 | 0.590 | 0.396 | 0.271 | 0.176 | 0.181 | 0.388 | 0.471 | 0.133 | 0.302 | 0.635 | 0.444 | 0.310 | 0.211 | 0.197 | 0.420 | 0.569 | 0.151 | 0.360 | |
Yuan_t6_4 | 3 | yuan2021_t6 | 0.584 | 0.392 | 0.266 | 0.175 | 0.181 | 0.389 | 0.465 | 0.131 | 0.298 | 0.665 | 0.487 | 0.359 | 0.260 | 0.214 | 0.449 | 0.684 | 0.163 | 0.423 | |
Xiao_t6_1 | 37 | xiao2021_t6 | 0.351 | 0.150 | 0.079 | 0.041 | 0.082 | 0.238 | 0.057 | 0.029 | 0.043 | 0.471 | 0.282 | 0.182 | 0.112 | 0.128 | 0.317 | 0.208 | 0.078 | 0.143 | |
Xiao_t6_2 | 36 | xiao2021_t6 | 0.344 | 0.152 | 0.085 | 0.044 | 0.082 | 0.239 | 0.058 | 0.033 | 0.046 | 0.461 | 0.275 | 0.180 | 0.112 | 0.126 | 0.312 | 0.210 | 0.079 | 0.144 | |
Chen_t6_1 | 18 | chen2021_t6 | 0.549 | 0.356 | 0.235 | 0.149 | 0.169 | 0.360 | 0.389 | 0.117 | 0.253 | 0.555 | 0.357 | 0.236 | 0.152 | 0.168 | 0.366 | 0.409 | 0.120 | 0.265 | |
Chen_t6_2 | 28 | chen2021_t6 | 0.535 | 0.345 | 0.227 | 0.142 | 0.161 | 0.359 | 0.349 | 0.113 | 0.231 | 0.553 | 0.364 | 0.247 | 0.161 | 0.167 | 0.371 | 0.408 | 0.118 | 0.263 | |
Chen_t6_3 | 20 | chen2021_t6 | 0.537 | 0.351 | 0.234 | 0.151 | 0.167 | 0.362 | 0.373 | 0.117 | 0.245 | 0.561 | 0.369 | 0.249 | 0.167 | 0.169 | 0.373 | 0.406 | 0.118 | 0.262 | |
Chen_t6_4 | 17 | chen2021_t6 | 0.549 | 0.358 | 0.239 | 0.156 | 0.169 | 0.367 | 0.402 | 0.121 | 0.262 | 0.563 | 0.367 | 0.244 | 0.158 | 0.170 | 0.371 | 0.406 | 0.119 | 0.262 | |
Ye_t6_1 | 12 | ye2021_t6 | 0.582 | 0.385 | 0.259 | 0.169 | 0.180 | 0.382 | 0.432 | 0.126 | 0.279 | 0.578 | 0.381 | 0.257 | 0.169 | 0.181 | 0.381 | 0.433 | 0.125 | 0.279 | |
Ye_t6_2 | 14 | ye2021_t6 | 0.577 | 0.379 | 0.254 | 0.164 | 0.182 | 0.385 | 0.420 | 0.128 | 0.274 | 0.579 | 0.384 | 0.261 | 0.172 | 0.181 | 0.386 | 0.436 | 0.128 | 0.282 | |
Ye_t6_3 | 11 | ye2021_t6 | 0.584 | 0.391 | 0.265 | 0.173 | 0.179 | 0.384 | 0.434 | 0.126 | 0.280 | 0.586 | 0.391 | 0.268 | 0.180 | 0.180 | 0.388 | 0.440 | 0.125 | 0.282 | |
Ye_t6_4 | 13 | ye2021_t6 | 0.586 | 0.389 | 0.261 | 0.170 | 0.181 | 0.387 | 0.429 | 0.125 | 0.277 | 0.590 | 0.395 | 0.272 | 0.183 | 0.182 | 0.394 | 0.453 | 0.129 | 0.291 | |
Liu_t6_1 | 31 | liu2021_t6 | 0.478 | 0.291 | 0.189 | 0.118 | 0.143 | 0.324 | 0.274 | 0.094 | 0.184 | 0.483 | 0.298 | 0.197 | 0.119 | 0.322 | 0.133 | 0.243 | 0.088 | 0.166 | |
Gebhard_t6_1 | 35 | gebhard2021_t6 | 0.447 | 0.169 | 0.072 | 0.029 | 0.099 | 0.287 | 0.105 | 0.047 | 0.076 | 0.449 | 0.167 | 0.068 | 0.029 | 0.097 | 0.284 | 0.098 | 0.043 | 0.071 | |
Eren_t6_1 | 32 | eren2021_t6 | 0.479 | 0.280 | 0.168 | 0.090 | 0.140 | 0.302 | 0.256 | 0.107 | 0.182 | 0.586 | 0.356 | 0.268 | 0.150 | 0.214 | 0.444 | 0.328 | 0.155 | 0.242 | |
Xinhao_t6_1 | 7 | xinhao2021_t6 | 0.620 | 0.416 | 0.282 | 0.180 | 0.184 | 0.401 | 0.457 | 0.131 | 0.294 | 0.615 | 0.403 | 0.270 | 0.171 | 0.179 | 0.392 | 0.412 | 0.122 | 0.268 | |
Xinhao_t6_2 | 9 | xinhao2021_t6 | 0.653 | 0.423 | 0.282 | 0.176 | 0.180 | 0.408 | 0.439 | 0.136 | 0.287 | 0.635 | 0.406 | 0.268 | 0.166 | 0.176 | 0.400 | 0.412 | 0.121 | 0.266 | |
Xinhao_t6_3 | 8 | xinhao2021_t6 | 0.644 | 0.420 | 0.278 | 0.170 | 0.181 | 0.406 | 0.447 | 0.136 | 0.291 | 0.621 | 0.407 | 0.273 | 0.177 | 0.179 | 0.395 | 0.431 | 0.122 | 0.277 | |
Xinhao_t6_4 | 10 | xinhao2021_t6 | 0.627 | 0.407 | 0.269 | 0.166 | 0.182 | 0.399 | 0.436 | 0.129 | 0.283 | 0.625 | 0.412 | 0.278 | 0.178 | 0.176 | 0.401 | 0.428 | 0.126 | 0.277 | |
Narisetty_t6_1 | 26 | narisetty2021_t6 | 0.531 | 0.346 | 0.235 | 0.157 | 0.160 | 0.361 | 0.362 | 0.108 | 0.235 | 0.546 | 0.356 | 0.243 | 0.165 | 0.163 | 0.369 | 0.381 | 0.110 | 0.246 | |
Narisetty_t6_2 | 27 | narisetty2021_t6 | 0.534 | 0.347 | 0.235 | 0.157 | 0.158 | 0.359 | 0.358 | 0.109 | 0.234 | 0.558 | 0.373 | 0.261 | 0.181 | 0.167 | 0.376 | 0.410 | 0.114 | 0.262 | |
Narisetty_t6_3 | 25 | narisetty2021_t6 | 0.534 | 0.347 | 0.235 | 0.157 | 0.159 | 0.362 | 0.360 | 0.110 | 0.235 | 0.562 | 0.377 | 0.261 | 0.182 | 0.169 | 0.377 | 0.416 | 0.116 | 0.266 | |
Narisetty_t6_4 | 23 | narisetty2021_t6 | 0.534 | 0.348 | 0.238 | 0.160 | 0.157 | 0.361 | 0.362 | 0.110 | 0.236 | 0.563 | 0.378 | 0.264 | 0.184 | 0.168 | 0.378 | 0.417 | 0.115 | 0.266 | |
Labbe_t6_1 | 34 | labbe2021_t6 | 0.435 | 0.222 | 0.128 | 0.073 | 0.121 | 0.305 | 0.146 | 0.072 | 0.109 | 0.435 | 0.229 | 0.129 | 0.069 | 0.252 | 0.195 | 0.136 | 0.067 | 0.101 | |
Labbe_t6_2 | 33 | labbe2021_t6 | 0.454 | 0.270 | 0.176 | 0.109 | 0.122 | 0.310 | 0.178 | 0.078 | 0.128 | 0.452 | 0.262 | 0.168 | 0.102 | 0.249 | 0.193 | 0.172 | 0.071 | 0.122 | |
Labbe_t6_3 | 30 | labbe2021_t6 | 0.525 | 0.321 | 0.200 | 0.117 | 0.157 | 0.354 | 0.296 | 0.115 | 0.205 | 0.523 | 0.316 | 0.191 | 0.109 | 0.309 | 0.231 | 0.287 | 0.104 | 0.195 | |
Labbe_t6_4 | 29 | labbe2021_t6 | 0.539 | 0.354 | 0.239 | 0.154 | 0.156 | 0.361 | 0.333 | 0.108 | 0.221 | 0.541 | 0.358 | 0.243 | 0.159 | 0.327 | 0.235 | 0.351 | 0.110 | 0.231 | |
Won_t6_1 | 21 | won2021_t6 | 0.535 | 0.344 | 0.231 | 0.151 | 0.162 | 0.359 | 0.375 | 0.111 | 0.243 | 0.540 | 0.345 | 0.230 | 0.152 | 0.161 | 0.361 | 0.383 | 0.109 | 0.246 | |
Won_t6_2 | 24 | won2021_t6 | 0.516 | 0.338 | 0.226 | 0.145 | 0.161 | 0.359 | 0.357 | 0.114 | 0.236 | 0.550 | 0.361 | 0.244 | 0.160 | 0.172 | 0.375 | 0.401 | 0.121 | 0.261 | |
Won_t6_3 | 22 | won2021_t6 | 0.518 | 0.346 | 0.235 | 0.151 | 0.163 | 0.366 | 0.366 | 0.117 | 0.242 | 0.554 | 0.370 | 0.254 | 0.168 | 0.170 | 0.379 | 0.400 | 0.119 | 0.259 | |
Won_t6_4 | 19 | won2021_t6 | 0.538 | 0.359 | 0.247 | 0.162 | 0.166 | 0.372 | 0.381 | 0.118 | 0.249 | 0.564 | 0.376 | 0.254 | 0.163 | 0.177 | 0.388 | 0.441 | 0.128 | 0.285 | |
Xu_t6_1 | 15 | xu2021_t6 | 0.560 | 0.366 | 0.245 | 0.159 | 0.177 | 0.376 | 0.403 | 0.127 | 0.265 | 0.576 | 0.377 | 0.252 | 0.164 | 0.178 | 0.382 | 0.421 | 0.122 | 0.271 | |
Xu_t6_2 | 16 | xu2021_t6 | 0.556 | 0.365 | 0.245 | 0.161 | 0.178 | 0.375 | 0.404 | 0.125 | 0.265 | 0.572 | 0.374 | 0.251 | 0.165 | 0.178 | 0.381 | 0.418 | 0.122 | 0.270 | |
Xu_t6_3 | 5 | xu2021_t6 | 0.650 | 0.420 | 0.271 | 0.171 | 0.182 | 0.405 | 0.463 | 0.129 | 0.296 | 0.659 | 0.424 | 0.275 | 0.176 | 0.182 | 0.411 | 0.472 | 0.124 | 0.298 | |
Xu_t6_4 | 6 | xu2021_t6 | 0.651 | 0.421 | 0.271 | 0.170 | 0.182 | 0.403 | 0.461 | 0.128 | 0.295 | 0.660 | 0.427 | 0.276 | 0.177 | 0.181 | 0.411 | 0.471 | 0.123 | 0.297 | |
Baseline_t6_1 | 38 | Baseline2021_t6 | 0.405 | 0.061 | 0.014 | 0.000 | 0.070 | 0.265 | 0.020 | 0.004 | 0.012 | 0.378 | 0.119 | 0.050 | 0.017 | 0.078 | 0.263 | 0.075 | 0.028 | 0.051 |
Systems ranking, machine translation metrics
Selected metric rank |
Submission Information | Clotho testing split | Clotho evaluation split | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Technical Report |
BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | |
Yuan_t6_1 | 4 | yuan2021_t6 | 0.586 | 0.387 | 0.261 | 0.170 | 0.181 | 0.384 | 0.595 | 0.402 | 0.278 | 0.189 | 0.184 | 0.392 | |
Yuan_t6_2 | 1 | yuan2021_t6 | 0.595 | 0.400 | 0.275 | 0.184 | 0.182 | 0.394 | 0.603 | 0.414 | 0.286 | 0.195 | 0.186 | 0.400 | |
Yuan_t6_3 | 2 | yuan2021_t6 | 0.590 | 0.396 | 0.271 | 0.176 | 0.181 | 0.388 | 0.635 | 0.444 | 0.310 | 0.211 | 0.197 | 0.420 | |
Yuan_t6_4 | 3 | yuan2021_t6 | 0.584 | 0.392 | 0.266 | 0.175 | 0.181 | 0.389 | 0.665 | 0.487 | 0.359 | 0.260 | 0.214 | 0.449 | |
Xiao_t6_1 | 37 | xiao2021_t6 | 0.351 | 0.150 | 0.079 | 0.041 | 0.082 | 0.238 | 0.471 | 0.282 | 0.182 | 0.112 | 0.128 | 0.317 | |
Xiao_t6_2 | 36 | xiao2021_t6 | 0.344 | 0.152 | 0.085 | 0.044 | 0.082 | 0.239 | 0.461 | 0.275 | 0.180 | 0.112 | 0.126 | 0.312 | |
Chen_t6_1 | 18 | chen2021_t6 | 0.549 | 0.356 | 0.235 | 0.149 | 0.169 | 0.360 | 0.555 | 0.357 | 0.236 | 0.152 | 0.168 | 0.366 | |
Chen_t6_2 | 28 | chen2021_t6 | 0.535 | 0.345 | 0.227 | 0.142 | 0.161 | 0.359 | 0.553 | 0.364 | 0.247 | 0.161 | 0.167 | 0.371 | |
Chen_t6_3 | 20 | chen2021_t6 | 0.537 | 0.351 | 0.234 | 0.151 | 0.167 | 0.362 | 0.561 | 0.369 | 0.249 | 0.167 | 0.169 | 0.373 | |
Chen_t6_4 | 17 | chen2021_t6 | 0.549 | 0.358 | 0.239 | 0.156 | 0.169 | 0.367 | 0.563 | 0.367 | 0.244 | 0.158 | 0.170 | 0.371 | |
Ye_t6_1 | 12 | ye2021_t6 | 0.582 | 0.385 | 0.259 | 0.169 | 0.180 | 0.382 | 0.578 | 0.381 | 0.257 | 0.169 | 0.181 | 0.381 | |
Ye_t6_2 | 14 | ye2021_t6 | 0.577 | 0.379 | 0.254 | 0.164 | 0.182 | 0.385 | 0.579 | 0.384 | 0.261 | 0.172 | 0.181 | 0.386 | |
Ye_t6_3 | 11 | ye2021_t6 | 0.584 | 0.391 | 0.265 | 0.173 | 0.179 | 0.384 | 0.586 | 0.391 | 0.268 | 0.180 | 0.180 | 0.388 | |
Ye_t6_4 | 13 | ye2021_t6 | 0.586 | 0.389 | 0.261 | 0.170 | 0.181 | 0.387 | 0.590 | 0.395 | 0.272 | 0.183 | 0.182 | 0.394 | |
Liu_t6_1 | 31 | liu2021_t6 | 0.478 | 0.291 | 0.189 | 0.118 | 0.143 | 0.324 | 0.483 | 0.298 | 0.197 | 0.119 | 0.322 | 0.133 | |
Gebhard_t6_1 | 35 | gebhard2021_t6 | 0.447 | 0.169 | 0.072 | 0.029 | 0.099 | 0.287 | 0.449 | 0.167 | 0.068 | 0.029 | 0.097 | 0.284 | |
Eren_t6_1 | 32 | eren2021_t6 | 0.479 | 0.280 | 0.168 | 0.090 | 0.140 | 0.302 | 0.586 | 0.356 | 0.268 | 0.150 | 0.214 | 0.444 | |
Xinhao_t6_1 | 7 | xinhao2021_t6 | 0.620 | 0.416 | 0.282 | 0.180 | 0.184 | 0.401 | 0.615 | 0.403 | 0.270 | 0.171 | 0.179 | 0.392 | |
Xinhao_t6_2 | 9 | xinhao2021_t6 | 0.653 | 0.423 | 0.282 | 0.176 | 0.180 | 0.408 | 0.635 | 0.406 | 0.268 | 0.166 | 0.176 | 0.400 | |
Xinhao_t6_3 | 8 | xinhao2021_t6 | 0.644 | 0.420 | 0.278 | 0.170 | 0.181 | 0.406 | 0.621 | 0.407 | 0.273 | 0.177 | 0.179 | 0.395 | |
Xinhao_t6_4 | 10 | xinhao2021_t6 | 0.627 | 0.407 | 0.269 | 0.166 | 0.182 | 0.399 | 0.625 | 0.412 | 0.278 | 0.178 | 0.176 | 0.401 | |
Narisetty_t6_1 | 26 | narisetty2021_t6 | 0.531 | 0.346 | 0.235 | 0.157 | 0.160 | 0.361 | 0.546 | 0.356 | 0.243 | 0.165 | 0.163 | 0.369 | |
Narisetty_t6_2 | 27 | narisetty2021_t6 | 0.534 | 0.347 | 0.235 | 0.157 | 0.158 | 0.359 | 0.558 | 0.373 | 0.261 | 0.181 | 0.167 | 0.376 | |
Narisetty_t6_3 | 25 | narisetty2021_t6 | 0.534 | 0.347 | 0.235 | 0.157 | 0.159 | 0.362 | 0.562 | 0.377 | 0.261 | 0.182 | 0.169 | 0.377 | |
Narisetty_t6_4 | 23 | narisetty2021_t6 | 0.534 | 0.348 | 0.238 | 0.160 | 0.157 | 0.361 | 0.563 | 0.378 | 0.264 | 0.184 | 0.168 | 0.378 | |
Labbe_t6_1 | 34 | labbe2021_t6 | 0.435 | 0.222 | 0.128 | 0.073 | 0.121 | 0.305 | 0.435 | 0.229 | 0.129 | 0.069 | 0.252 | 0.195 | |
Labbe_t6_2 | 33 | labbe2021_t6 | 0.454 | 0.270 | 0.176 | 0.109 | 0.122 | 0.310 | 0.452 | 0.262 | 0.168 | 0.102 | 0.249 | 0.193 | |
Labbe_t6_3 | 30 | labbe2021_t6 | 0.525 | 0.321 | 0.200 | 0.117 | 0.157 | 0.354 | 0.523 | 0.316 | 0.191 | 0.109 | 0.309 | 0.231 | |
Labbe_t6_4 | 29 | labbe2021_t6 | 0.539 | 0.354 | 0.239 | 0.154 | 0.156 | 0.361 | 0.541 | 0.358 | 0.243 | 0.159 | 0.327 | 0.235 | |
Won_t6_1 | 21 | won2021_t6 | 0.535 | 0.344 | 0.231 | 0.151 | 0.162 | 0.359 | 0.540 | 0.345 | 0.230 | 0.152 | 0.161 | 0.361 | |
Won_t6_2 | 24 | won2021_t6 | 0.516 | 0.338 | 0.226 | 0.145 | 0.161 | 0.359 | 0.550 | 0.361 | 0.244 | 0.160 | 0.172 | 0.375 | |
Won_t6_3 | 22 | won2021_t6 | 0.518 | 0.346 | 0.235 | 0.151 | 0.163 | 0.366 | 0.554 | 0.370 | 0.254 | 0.168 | 0.170 | 0.379 | |
Won_t6_4 | 19 | won2021_t6 | 0.538 | 0.359 | 0.247 | 0.162 | 0.166 | 0.372 | 0.564 | 0.376 | 0.254 | 0.163 | 0.177 | 0.388 | |
Xu_t6_1 | 15 | xu2021_t6 | 0.560 | 0.366 | 0.245 | 0.159 | 0.177 | 0.376 | 0.576 | 0.377 | 0.252 | 0.164 | 0.178 | 0.382 | |
Xu_t6_2 | 16 | xu2021_t6 | 0.556 | 0.365 | 0.245 | 0.161 | 0.178 | 0.375 | 0.572 | 0.374 | 0.251 | 0.165 | 0.178 | 0.381 | |
Xu_t6_3 | 5 | xu2021_t6 | 0.650 | 0.420 | 0.271 | 0.171 | 0.182 | 0.405 | 0.659 | 0.424 | 0.275 | 0.176 | 0.182 | 0.411 | |
Xu_t6_4 | 6 | xu2021_t6 | 0.651 | 0.421 | 0.271 | 0.170 | 0.182 | 0.403 | 0.660 | 0.427 | 0.276 | 0.177 | 0.181 | 0.411 | |
Baseline_t6_1 | 38 | Baseline2021_t6 | 0.405 | 0.061 | 0.014 | 0.000 | 0.070 | 0.265 | 0.378 | 0.119 | 0.050 | 0.017 | 0.078 | 0.263 |
Systems ranking, captioning metrics
Selected metric rank |
Submission Information | Clotho testing split | Clotho evaluation split | ||||||
---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Technical Report |
CIDEr | SPICE | SPIDEr | CIDEr | SPICE | SPIDEr | |
Yuan_t6_1 | 4 | yuan2021_t6 | 0.457 | 0.136 | 0.296 | 0.495 | 0.136 | 0.315 | |
Yuan_t6_2 | 1 | yuan2021_t6 | 0.485 | 0.135 | 0.310 | 0.499 | 0.137 | 0.318 | |
Yuan_t6_3 | 2 | yuan2021_t6 | 0.471 | 0.133 | 0.302 | 0.569 | 0.151 | 0.360 | |
Yuan_t6_4 | 3 | yuan2021_t6 | 0.465 | 0.131 | 0.298 | 0.684 | 0.163 | 0.423 | |
Xiao_t6_1 | 37 | xiao2021_t6 | 0.057 | 0.029 | 0.043 | 0.208 | 0.078 | 0.143 | |
Xiao_t6_2 | 36 | xiao2021_t6 | 0.058 | 0.033 | 0.046 | 0.210 | 0.079 | 0.144 | |
Chen_t6_1 | 18 | chen2021_t6 | 0.389 | 0.117 | 0.253 | 0.409 | 0.120 | 0.265 | |
Chen_t6_2 | 28 | chen2021_t6 | 0.349 | 0.113 | 0.231 | 0.408 | 0.118 | 0.263 | |
Chen_t6_3 | 20 | chen2021_t6 | 0.373 | 0.117 | 0.245 | 0.406 | 0.118 | 0.262 | |
Chen_t6_4 | 17 | chen2021_t6 | 0.402 | 0.121 | 0.262 | 0.406 | 0.119 | 0.262 | |
Ye_t6_1 | 12 | ye2021_t6 | 0.432 | 0.126 | 0.279 | 0.433 | 0.125 | 0.279 | |
Ye_t6_2 | 14 | ye2021_t6 | 0.420 | 0.128 | 0.274 | 0.436 | 0.128 | 0.282 | |
Ye_t6_3 | 11 | ye2021_t6 | 0.434 | 0.126 | 0.280 | 0.440 | 0.125 | 0.282 | |
Ye_t6_4 | 13 | ye2021_t6 | 0.429 | 0.125 | 0.277 | 0.453 | 0.129 | 0.291 | |
Liu_t6_1 | 31 | liu2021_t6 | 0.274 | 0.094 | 0.184 | 0.243 | 0.088 | 0.166 | |
Gebhard_t6_1 | 35 | gebhard2021_t6 | 0.105 | 0.047 | 0.076 | 0.098 | 0.043 | 0.071 | |
Eren_t6_1 | 32 | eren2021_t6 | 0.256 | 0.107 | 0.182 | 0.328 | 0.155 | 0.242 | |
Xinhao_t6_1 | 7 | xinhao2021_t6 | 0.457 | 0.131 | 0.294 | 0.412 | 0.122 | 0.268 | |
Xinhao_t6_2 | 9 | xinhao2021_t6 | 0.439 | 0.136 | 0.287 | 0.412 | 0.121 | 0.266 | |
Xinhao_t6_3 | 8 | xinhao2021_t6 | 0.447 | 0.136 | 0.291 | 0.431 | 0.122 | 0.277 | |
Xinhao_t6_4 | 10 | xinhao2021_t6 | 0.436 | 0.129 | 0.283 | 0.428 | 0.126 | 0.277 | |
Narisetty_t6_1 | 26 | narisetty2021_t6 | 0.362 | 0.108 | 0.235 | 0.381 | 0.110 | 0.246 | |
Narisetty_t6_2 | 27 | narisetty2021_t6 | 0.358 | 0.109 | 0.234 | 0.410 | 0.114 | 0.262 | |
Narisetty_t6_3 | 25 | narisetty2021_t6 | 0.360 | 0.110 | 0.235 | 0.416 | 0.116 | 0.266 | |
Narisetty_t6_4 | 23 | narisetty2021_t6 | 0.362 | 0.110 | 0.236 | 0.417 | 0.115 | 0.266 | |
Labbe_t6_1 | 34 | labbe2021_t6 | 0.146 | 0.072 | 0.109 | 0.136 | 0.067 | 0.101 | |
Labbe_t6_2 | 33 | labbe2021_t6 | 0.178 | 0.078 | 0.128 | 0.172 | 0.071 | 0.122 | |
Labbe_t6_3 | 30 | labbe2021_t6 | 0.296 | 0.115 | 0.205 | 0.287 | 0.104 | 0.195 | |
Labbe_t6_4 | 29 | labbe2021_t6 | 0.333 | 0.108 | 0.221 | 0.351 | 0.110 | 0.231 | |
Won_t6_1 | 21 | won2021_t6 | 0.375 | 0.111 | 0.243 | 0.383 | 0.109 | 0.246 | |
Won_t6_2 | 24 | won2021_t6 | 0.357 | 0.114 | 0.236 | 0.401 | 0.121 | 0.261 | |
Won_t6_3 | 22 | won2021_t6 | 0.366 | 0.117 | 0.242 | 0.400 | 0.119 | 0.259 | |
Won_t6_4 | 19 | won2021_t6 | 0.381 | 0.118 | 0.249 | 0.441 | 0.128 | 0.285 | |
Xu_t6_1 | 15 | xu2021_t6 | 0.403 | 0.127 | 0.265 | 0.421 | 0.122 | 0.271 | |
Xu_t6_2 | 16 | xu2021_t6 | 0.404 | 0.125 | 0.265 | 0.418 | 0.122 | 0.270 | |
Xu_t6_3 | 5 | xu2021_t6 | 0.463 | 0.129 | 0.296 | 0.472 | 0.124 | 0.298 | |
Xu_t6_4 | 6 | xu2021_t6 | 0.461 | 0.128 | 0.295 | 0.471 | 0.123 | 0.297 | |
Baseline_t6_1 | 38 | Baseline2021_t6 | 0.020 | 0.004 | 0.012 | 0.075 | 0.028 | 0.051 |
System characteristics
In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an onverview of the systems and the second has a detailed presentation of each system.
Overview of characteristics
Rank |
Submission code |
SPIDEr |
Technical Report |
Method scheme/architecture | Amount of parameters | Audio modelling | Word modelling |
Data augmentation |
---|---|---|---|---|---|---|---|---|
4 | Yuan_t6_1 | 0.296 | yuan2021_t6 | encoder-decoder | 986302137 | PANNs | Transfomer | noise enhance |
1 | Yuan_t6_2 | 0.310 | yuan2021_t6 | encoder-decoder | 986302137 | PANNs | Transfomer | noise enhance |
2 | Yuan_t6_3 | 0.302 | yuan2021_t6 | encoder-decoder | 986302137 | PANNs | Transfomer | noise enhance |
3 | Yuan_t6_4 | 0.298 | yuan2021_t6 | encoder-decoder | 2572592190 | PANNs | Transfomer | noise enhance |
37 | Xiao_t6_1 | 0.043 | xiao2021_t6 | encoder-decoder, Transformer, MLP-mixer, Residual, audio embedding | 2448349 | MLP-mixer encoder | Transformer decoder | |
36 | Xiao_t6_2 | 0.046 | xiao2021_t6 | encoder-decoder, Transformer, MLP-mixer, Residual, audio embedding, pre-train encoder | 2448349 | MLP-mixer encoder | Transformer decoder | |
18 | Chen_t6_1 | 0.253 | chen2021_t6 | encoder-decoder | 93410432 | CNN14, MemoryEncoder | MeshedDecoder | SpecAugment, Label Smoothing |
28 | Chen_t6_2 | 0.231 | chen2021_t6 | encoder-decoder | 93410432 | CNN14, MemoryEncoder | MeshedDecoder | SpecAugment, Label Smoothing |
20 | Chen_t6_3 | 0.245 | chen2021_t6 | encoder-decoder | 22775528 | CNN14, MemoryEncoder | MeshedDecoder | SpecAugment, Label Smoothing |
17 | Chen_t6_4 | 0.262 | chen2021_t6 | encoder-decoder | 86440064 | ResNet38, MemoryEncoder | MeshedDecoder | SpecAugment, Label Smoothing |
12 | Ye_t6_1 | 0.279 | ye2021_t6 | encoder-decoder | 86643711 | ResNet38 | RNN | Mixup, SpecAugment, SpecAugment++ |
14 | Ye_t6_2 | 0.274 | ye2021_t6 | encoder-decoder | 86643711 | ResNet38 | RNN | Mixup, SpecAugment, SpecAugment++ |
11 | Ye_t6_3 | 0.280 | ye2021_t6 | encoder-decoder | 779793399 | ResNet38 | RNN | Mixup, SpecAugment, SpecAugment++ |
13 | Ye_t6_4 | 0.277 | ye2021_t6 | encoder-decoder | 259931133 | ResNet38 | RNN | Mixup, SpecAugment, SpecAugment++ |
31 | Liu_t6_1 | 0.184 | liu2021_t6 | encoder-decoder | 3045913 | CNN | self-attention, Word2Vec | |
35 | Gebhard_t6_1 | 0.076 | gebhard2021_t6 | encoder-decoder | 13409747 | CNN | RNN | |
32 | Eren_t6_1 | 0.182 | eren2021_t6 | encoder-decoder | 2511570 | PANNs | RNN | |
7 | Xinhao_t6_1 | 0.294 | xinhao2021_t6 | encoder-decoder | 7455570 | CNN | Transformer | SpecAugment |
9 | Xinhao_t6_2 | 0.287 | xinhao2021_t6 | encoder-decoder | 7455570 | CNN | Transformer | SpecAugment |
8 | Xinhao_t6_3 | 0.291 | xinhao2021_t6 | encoder-decoder | 8038703 | CNN | Transformer | SpecAugment |
10 | Xinhao_t6_4 | 0.283 | xinhao2021_t6 | encoder-decoder | 8038703 | CNN | Transformer | SpecAugment |
26 | Narisetty_t6_1 | 0.235 | narisetty2021_t6 | Conformer | 143676488 | Conformer | Transformer | SpecAugment |
27 | Narisetty_t6_2 | 0.234 | narisetty2021_t6 | Conformer | 143676488 | Conformer | Transformer | SpecAugment |
25 | Narisetty_t6_3 | 0.235 | narisetty2021_t6 | Conformer | 167205128 | Conformer | Transformer | SpecAugment |
23 | Narisetty_t6_4 | 0.236 | narisetty2021_t6 | Conformer | 167205128 | Conformer | Transformer | SpecAugment |
34 | Labbe_t6_1 | 0.109 | labbe2021_t6 | encoder-decoder | 2887632 | pyramidal bidirectional RNN-LSTM | RNN-LSTM, attention | |
33 | Labbe_t6_2 | 0.128 | labbe2021_t6 | encoder-decoder | 2887632 | pyramidal bidirectional RNN-LSTM | RNN-LSTM, attention | |
30 | Labbe_t6_3 | 0.205 | labbe2021_t6 | encoder-decoder | 84521484 | CNN14 | RNN-LSTM, attention | |
29 | Labbe_t6_4 | 0.221 | labbe2021_t6 | encoder-decoder | 84521484 | CNN14 | RNN-LSTM, attention | |
21 | Won_t6_1 | 0.243 | won2021_t6 | encoder-decoder | 8445139 | ResNet | Transformer | SpecAugment |
24 | Won_t6_2 | 0.236 | won2021_t6 | encoder-decoder | 8445139 | CNN | Transformer | SpecAugment |
22 | Won_t6_3 | 0.242 | won2021_t6 | encoder-decoder | 8445139 | CNN | Transformer | SpecAugment |
19 | Won_t6_4 | 0.249 | won2021_t6 | encoder-decoder | 8445139 | CNN | Transformer | SpecAugment |
15 | Xu_t6_1 | 0.265 | xu2021_t6 | seq2seq | 36181131 | CNN | RNN | SpecAugment |
16 | Xu_t6_2 | 0.265 | xu2021_t6 | seq2seq | 60301885 | CNN | RNN | SpecAugment |
5 | Xu_t6_3 | 0.296 | xu2021_t6 | seq2seq | 48241508 | CNN | RNN | SpecAugment |
6 | Xu_t6_4 | 0.295 | xu2021_t6 | seq2seq | 60301885 | CNN | RNN | SpecAugment |
38 | Baseline_t6_1 | 0.012 | Baseline2021_t6 | encoder-decoder | 5012931 | RNN | RNN |
Detailed characteristics
Rank |
Submission code |
SPIDEr |
Technical Report |
Method scheme/architecture | Amount of parameters | Audio modelling |
Acoustic features |
Word modelling |
Word embeddings |
Data augmentation |
Sampling rate |
Learning set-up | Ensemble method | Loss function | Learning set-up | Learning rate | Gradient clipping | Gradient norm for clipping | Metric monitored for training | Dataset(s) used for audio modelling | Dataset(s) used for word modelling | Dataset(s) used for audio similarity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | Yuan_t6_1 | 0.296 | yuan2021_t6 | encoder-decoder | 986302137 | PANNs | log-mel energies | Transfomer | one-hot | noise enhance | 36kHz | supervised | crossentropy | adam | 1e-4 | Validation SPIDEr score | Clotho, AudioCaps, Freesound | Clotho, AudioCaps, Freesound | Clotho | |||
1 | Yuan_t6_2 | 0.310 | yuan2021_t6 | encoder-decoder | 986302137 | PANNs | log-mel energies | Transfomer | one-hot | noise enhance | 36kHz | supervised | crossentropy | adam | 1e-4 | Validation SPIDEr score | Clotho, AudioCaps, Freesound | Clotho, AudioCaps, Freesound | Clotho | |||
2 | Yuan_t6_3 | 0.302 | yuan2021_t6 | encoder-decoder | 986302137 | PANNs | log-mel energies | Transfomer | one-hot | noise enhance | 36kHz | supervised | crossentropy | adam | 1e-4 | Validation SPIDEr score | Clotho, AudioCaps, Freesound | Clotho, AudioCaps, Freesound | Clotho | |||
3 | Yuan_t6_4 | 0.298 | yuan2021_t6 | encoder-decoder | 2572592190 | PANNs | log-mel energies | Transfomer | one-hot | noise enhance | 36kHz | supervised | crossentropy | adam | 1e-4 | Validation SPIDEr score | Clotho, AudioCaps, Freesound | Clotho, AudioCaps, Freesound | Clotho | |||
37 | Xiao_t6_1 | 0.043 | xiao2021_t6 | encoder-decoder, Transformer, MLP-mixer, Residual, audio embedding | 2448349 | MLP-mixer encoder | log-mel energies | Transformer decoder | learned embeddings | 44.1kHz | supervised | crossentropy | adam | 1e-4 | Validation loss | Clotho | Clotho | |||||
36 | Xiao_t6_2 | 0.046 | xiao2021_t6 | encoder-decoder, Transformer, MLP-mixer, Residual, audio embedding, pre-train encoder | 2448349 | MLP-mixer encoder | log-mel energies | Transformer decoder | learned embeddings | 44.1kHz | supervised | crossentropy | adam | 1e-4 | Validation loss | Clotho | Clotho | |||||
18 | Chen_t6_1 | 0.253 | chen2021_t6 | encoder-decoder | 93410432 | CNN14, MemoryEncoder | log-mel energies | MeshedDecoder | Word2Vec | SpecAugment, Label Smoothing | 44.1kHz | supervised | crossentropy | adam | 3e-5 | Validation SPIDEr score | Clotho | Clotho | ||||
28 | Chen_t6_2 | 0.231 | chen2021_t6 | encoder-decoder | 93410432 | CNN14, MemoryEncoder | log-mel energies | MeshedDecoder | FastText | SpecAugment, Label Smoothing | 44.1kHz | supervised | crossentropy | adam | 3e-5 | Validation SPIDEr score | Clotho | Clotho | ||||
20 | Chen_t6_3 | 0.245 | chen2021_t6 | encoder-decoder | 22775528 | CNN14, MemoryEncoder | log-mel energies | MeshedDecoder | learned embeddings | SpecAugment, Label Smoothing | 44.1kHz | supervised | crossentropy | adam | 3e-5 | Validation SPIDEr score | Clotho | Clotho | ||||
17 | Chen_t6_4 | 0.262 | chen2021_t6 | encoder-decoder | 86440064 | ResNet38, MemoryEncoder | log-mel energies | MeshedDecoder | Word2Vec | SpecAugment, Label Smoothing | 44.1kHz | supervised | crossentropy | adam | 3e-5 | Validation SPIDEr score | Clotho | Clotho | ||||
12 | Ye_t6_1 | 0.279 | ye2021_t6 | encoder-decoder | 86643711 | ResNet38 | log-mel energies | RNN | learned embeddings | Mixup, SpecAugment, SpecAugment++ | 44.1kHz | supervised | crossentropy | adam | 2e-5 | Validation loss | Clotho | Clotho | ||||
14 | Ye_t6_2 | 0.274 | ye2021_t6 | encoder-decoder | 86643711 | ResNet38 | log-mel energies | RNN | learned embeddings | Mixup, SpecAugment, SpecAugment++ | 44.1kHz | supervised | crossentropy | adam | 2e-5 | Validation loss | Clotho | Clotho | ||||
11 | Ye_t6_3 | 0.280 | ye2021_t6 | encoder-decoder | 779793399 | ResNet38 | log-mel energies | RNN | learned embeddings | Mixup, SpecAugment, SpecAugment++ | 44.1kHz | supervised | crossentropy | adam | 2e-5 | Validation loss | Clotho | Clotho | ||||
13 | Ye_t6_4 | 0.277 | ye2021_t6 | encoder-decoder | 259931133 | ResNet38 | log-mel energies | RNN | learned embeddings | Mixup, SpecAugment, SpecAugment++ | 44.1kHz | supervised | crossentropy | adam | 2e-5 | Validation loss | Clotho | Clotho | ||||
31 | Liu_t6_1 | 0.184 | liu2021_t6 | encoder-decoder | 3045913 | CNN | log-mel energies | self-attention, Word2Vec | Word2Vec | 44.1kHz | supervised | crossentropy, sentence-loss | adam | 5e-4 | Training loss | Clotho | Clotho | |||||
35 | Gebhard_t6_1 | 0.076 | gebhard2021_t6 | encoder-decoder | 13409747 | CNN | log-mel energies | RNN | numeric-representation | 44.1kHz | supervised | crossentropy | adam | 1e-4 | Training loss | Clotho | Clotho | |||||
32 | Eren_t6_1 | 0.182 | eren2021_t6 | encoder-decoder | 2511570 | PANNs | log-mel energies, PANNs | RNN | Word2Vec | 44.1kHz | supervised | crossentropy | adam | 1e-3 | Validation loss | Clotho | Clotho | |||||
7 | Xinhao_t6_1 | 0.294 | xinhao2021_t6 | encoder-decoder | 7455570 | CNN | PANNs | Transformer | random | SpecAugment | 44.1kHz | supervised | crossentropy | adam | 1e-3 | Validation SPIDEr score | Clotho | Clotho | ||||
9 | Xinhao_t6_2 | 0.287 | xinhao2021_t6 | encoder-decoder | 7455570 | CNN | PANNs | Transformer | random | SpecAugment | 44.1kHz | supervised | crossentropy | adam | 1e-3 | Validation SPIDEr score | Clotho | Clotho | ||||
8 | Xinhao_t6_3 | 0.291 | xinhao2021_t6 | encoder-decoder | 8038703 | CNN | PANNs | Transformer | Word2Vec | SpecAugment | 44.1kHz | supervised | crossentropy | adam | 1e-3 | Validation SPIDEr score | Clotho, AudioCaps | Clotho, AudioCaps | ||||
10 | Xinhao_t6_4 | 0.283 | xinhao2021_t6 | encoder-decoder | 8038703 | CNN | PANNs | Transformer | Word2Vec | SpecAugment | 44.1kHz | supervised | crossentropy | adam | 1e-3 | Validation SPIDEr score | Clotho, AudioCaps | Clotho, AudioCaps | ||||
26 | Narisetty_t6_1 | 0.235 | narisetty2021_t6 | Conformer | 143676488 | Conformer | log-mel energies, PANNs, tags, embeddings | Transformer | learned embeddings | SpecAugment | 16kHz | supervised | crossentropy | noam | 5e-1 | Validation loss | Clotho, AudioCaps | Clotho, AudioCaps | ||||
27 | Narisetty_t6_2 | 0.234 | narisetty2021_t6 | Conformer | 143676488 | Conformer | log-mel energies, PANNs, tags, embeddings | Transformer | learned embeddings | SpecAugment | 16kHz | supervised | crossentropy | noam | 5e-1 | Validation loss | Clotho, AudioCaps | Clotho, AudioCaps | ||||
25 | Narisetty_t6_3 | 0.235 | narisetty2021_t6 | Conformer | 167205128 | Conformer | log-mel energies, PANNs, tags, embeddings | Transformer | learned embeddings | SpecAugment | 16kHz | supervised | crossentropy | noam | 5e-1 | Validation loss | Clotho, AudioCaps | Clotho, AudioCaps | ||||
23 | Narisetty_t6_4 | 0.236 | narisetty2021_t6 | Conformer | 167205128 | Conformer | log-mel energies, PANNs, tags, embeddings | Transformer | learned embeddings | SpecAugment | 16kHz | supervised | crossentropy | noam | 5e-1 | Validation loss | Clotho, AudioCaps | Clotho, AudioCaps | ||||
34 | Labbe_t6_1 | 0.109 | labbe2021_t6 | encoder-decoder | 2887632 | pyramidal bidirectional RNN-LSTM | log-mel energies | RNN-LSTM, attention | learned embeddings | 32kHz | supervised | crossentropy | adam | 5e-4 | Validation loss | Clotho | Clotho | |||||
33 | Labbe_t6_2 | 0.128 | labbe2021_t6 | encoder-decoder | 2887632 | pyramidal bidirectional RNN-LSTM | log-mel energies | RNN-LSTM, attention | learned embeddings | 32kHz | supervised | crossentropy | adam | 5e-4 | Validation loss | Clotho | Clotho | |||||
30 | Labbe_t6_3 | 0.205 | labbe2021_t6 | encoder-decoder | 84521484 | CNN14 | log-mel energies | RNN-LSTM, attention | learned embeddings | 32kHz | supervised | crossentropy | adam | 5e-4 | Validation loss | Clotho | Clotho | |||||
29 | Labbe_t6_4 | 0.221 | labbe2021_t6 | encoder-decoder | 84521484 | CNN14 | log-mel energies | RNN-LSTM, attention | learned embeddings | 32kHz | supervised | crossentropy | adam | 5e-4 | Validation loss | Clotho | Clotho | |||||
21 | Won_t6_1 | 0.243 | won2021_t6 | encoder-decoder | 8445139 | ResNet | log-mel energies | Transformer | Word2Vec | SpecAugment | 44.1kHz | supervised | crossentropy | adam, SWA | 3e-4, 1e-4 | Training SPIDEr score | Clotho | Clotho | ||||
24 | Won_t6_2 | 0.236 | won2021_t6 | encoder-decoder | 8445139 | CNN | log-mel energies | Transformer | Word2Vec | SpecAugment | 44.1kHz | supervised | crossentropy | adam, SWA | 3e-4, 1e-4 | Training SPIDEr score | Clotho | Clotho | ||||
22 | Won_t6_3 | 0.242 | won2021_t6 | encoder-decoder | 8445139 | CNN | log-mel energies | Transformer | Word2Vec | SpecAugment | 44.1kHz | supervised | crossentropy | adam, SWA | 3e-4, 1e-4 | Training SPIDEr score | Clotho | Clotho | ||||
19 | Won_t6_4 | 0.249 | won2021_t6 | encoder-decoder | 8445139 | CNN | log-mel energies | Transformer | Word2Vec | SpecAugment | 44.1kHz | supervised | crossentropy | adam, SWA | 3e-4, 1e-4 | Training SPIDEr score | Clotho | Clotho | ||||
15 | Xu_t6_1 | 0.265 | xu2021_t6 | seq2seq | 36181131 | CNN | log-mel energies | RNN | learned embeddings | SpecAugment | 44.1kHz | supervised, reinforcement learning | crossentropy | adam | 5e-4 | Validation SPIDEr score | Clotho, AudioSet | Clotho | ||||
16 | Xu_t6_2 | 0.265 | xu2021_t6 | seq2seq | 60301885 | CNN | log-mel energies | RNN | learned embeddings | SpecAugment | 44.1kHz | supervised, reinforcement learning | crossentropy | adam | 5e-4 | Validation SPIDEr score | Clotho, AudioSet | Clotho | ||||
5 | Xu_t6_3 | 0.296 | xu2021_t6 | seq2seq | 48241508 | CNN | log-mel energies | RNN | learned embeddings | SpecAugment | 44.1kHz | supervised, reinforcement learning | crossentropy | adam | 5e-4 | Validation SPIDEr score | Clotho, AudioSet | Clotho | ||||
6 | Xu_t6_4 | 0.295 | xu2021_t6 | seq2seq | 60301885 | CNN | log-mel energies | RNN | learned embeddings | SpecAugment | 44.1kHz | supervised, reinforcement learning | crossentropy | adam | 5e-4 | Validation SPIDEr score | Clotho, AudioSet | Clotho | ||||
38 | Baseline_t6_1 | 0.012 | Baseline2021_t6 | encoder-decoder | 5012931 | RNN | log-mel energies | RNN | learned embeddings | 44.1kHz | supervised | crossentropy | adam | 1e-3 | Validation loss | Clotho | Clotho |
Technical reports
Audio Captioning With Meshed-Memory Transformer
Zhiwen Chen, Dawei Zhang, Jun Wang, and Feng Deng
University of Chinese Academy of Sciences, Beijing, China
Chen_t6_1 Chen_t6_2 Chen_t6_3 Chen_t6_4
Audio Captioning With Meshed-Memory Transformer
Zhiwen Chen, Dawei Zhang, Jun Wang, and Feng Deng
University of Chinese Academy of Sciences, Beijing, China
Abstract
Automated audio captioning is the task of describing the audio content of a given audio signal in natural language. Through advancing for years, transformer-based language models are widely used in audio captioning. However, most architectures based on transformer, cannot learn prior knowledge between samples well, leading to worse text decoding. For better acoustic event and language modeling, a sequence-to-sequence model is proposed which consists of a CNN-based encoder, a memory-augmented refiner and a meshed decoder. The proposed architecture refines a multi-level representation of the relationships between audio features integrating learned a priori knowledge. At decoding stage, it exploit low- and high-level features with a mesh-like connectivity. Experiments show that the proposed model can achieve a SPIDEr score of 0.2645 on Clotho V2 dataset.
System characteristics
Data augmentation | SpecAugment, Label Smoothing |
Audio Captioning Using Sound Event Detection
Ayşegül Özkaya Eren and Mustafa Sert
Department of Computer Engineering, Baskent University, Ankara, Turkey
Eren_t6_1
Audio Captioning Using Sound Event Detection
Ayşegül Özkaya Eren and Mustafa Sert
Department of Computer Engineering, Baskent University, Ankara, Turkey
Abstract
This technical report proposes an audio captioning system for DCASE 2021 Task 6 audio captioning challenge. Our proposed model is based on an encoder-decoder architecture with bi-directional Gated Recurrent Units (BiGRU) using pretrained audio features and sound event detection. A pretrained neural network (PANN) is used to extract audio features and Word2Vec is selected with the aim of extracting word embeddings from the audio captions. To create semantically meaningful captions, we extract sound events from the audio clips and feed the encoder-decoder architecture with sound events in addition to PANNs features. Our experiments on the Clotho dataset show that our proposed method significantly achieves better results than the challenge baseline model across all evaluation metrics.
System characteristics
Data augmentation | None |
An Automated Audio Captioning Approach Utilising a ResNet-based Encoder
Alexander Gebhard1, Andreas Triantafyllopoulos1,2, Alice Baird1, and Björn Schuller1,2,3
1 EIHW, Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany, 2 audEERING GmbH, Gilching, Germany, 3 GLAM, Group on Language, Audio, and Music, Imperial College, London, UK
Gebhard_t6_1
An Automated Audio Captioning Approach Utilising a ResNet-based Encoder
Alexander Gebhard1, Andreas Triantafyllopoulos1,2, Alice Baird1, and Björn Schuller1,2,3
1 EIHW, Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany, 2 audEERING GmbH, Gilching, Germany, 3 GLAM, Group on Language, Audio, and Music, Imperial College, London, UK
Abstract
In this report, we present our submission system to TASK6 of the DCASE2021 Challenge. The main module is based on the baseline architecture for the automated audio captioning (AAC) task, which was provided by the challenge organisers. We exchange the encoder part of the baseline architecture and replace it by a Residual Neural Network (ResNet)-18 encoder adapted to the AAC task. Results from our proposed architecture have shown an average increase of 35.7\% over the baseline system, reaching a BLEU1 score of 0.449 on the development set, demonstrating the effectiveness of the proposed encoder for this task.
System characteristics
Data augmentation | None |
IRIT-UPS DCASE 2021 Audio Captioning System
Etienne Labbé and Thomas Pellegrini
IRIT (UMR 5505), Université Paul Sabatier, CNRS, Toulouse, France
Labbe_t6_1 Labbe_t6_2 Labbe_t6_3 Labbe_t6_4
IRIT-UPS DCASE 2021 Audio Captioning System
Etienne Labbé and Thomas Pellegrini
IRIT (UMR 5505), Université Paul Sabatier, CNRS, Toulouse, France
Abstract
This document provides a description of our seq-to-seq models used for audio captioning in the task 6 of the DCASE 2021 challenge. Four submissions were made with two different models, a “Listen- Attend-Tell” and a “CNN-Tell”, and two different algorithms for inference, greedy and beam search.
System characteristics
Data augmentation | None |
The DCASE2021 Challenge Task 6 System : Automated Audio Caption
Liu Yang and Bi Sijun
Beijing Institute of Technology
Liu_t6_1
The DCASE2021 Challenge Task 6 System : Automated Audio Caption
Liu Yang and Bi Sijun
Beijing Institute of Technology
Abstract
This technical report describes the system participating to the Automated Audio Captioning (DCASE) 2021 Challenge, Task 6: automated audio captioning. In this work, We employ several learnable stack CNNs to extract audio features in the encoder layer, meanwhile, we employ the decoder of the widely used Transformer structure to generate captions. For optimize system, we use the sentence-level cosine loss function and crossentropy loss. The experimental results show that our system could achieve the SPIDEr of 0.166 on the evaluation split of the Clotho dataset.
System characteristics
Data augmentation | None |
Leveraging State-of-the-art ASR Techniques to Audio Captioning
Chaitanya Narisetty1, Tomoki Hayashi2, Ryunosuke Ishizaki2, Shinji Watanabe1, and Kazuya Takeda2
1 Carnegie Mellon University, Pittsburgh, USA, 2 Nagoya University, Nagoya, Japan
Narisetty_t6_1 Narisetty_t6_2 Narisetty_t6_3 Narisetty_t6_4
Leveraging State-of-the-art ASR Techniques to Audio Captioning
Chaitanya Narisetty1, Tomoki Hayashi2, Ryunosuke Ishizaki2, Shinji Watanabe1, and Kazuya Takeda2
1 Carnegie Mellon University, Pittsburgh, USA, 2 Nagoya University, Nagoya, Japan
Abstract
This report presents a summary of our submission to the 2021 DCASE challenge Task 6: Automated Audio Captioning. Our ap- proach to this task is derived from state-of-the-art ASR techniques available in the ESPNet toolkit. Specifically, we train a convolution-augmented Transformer (Conformer) model to generate captions from input acoustic features in an end-to-end manner. In addition to the prescribed challenge dataset: Clotho-v2, we also augment the AudioCaps external dataset. To overcome the limited availability of training data, we further incorporate the Audioset-tags and audio-embeddings obtained by pretrained audio neural networks (PANNs) as an auxiliary input to our model. An ensemble of models trained over various architectures and input embeddings is selected as our final submission system. Experimental results indicate that our models achieve a SPIDEr score of 0.224 and 0.246 on the development-validation and development-evaluation sets respectively.
System characteristics
Data augmentation | SpecAugment |
CAU Submission to DCASE 2021 Task6: Transformer Followed by Transfer Learning for Audio Captioning
Hyejin Won, Baekseung Kim, Il-Youp Kwak, and Changwon Lim
Chung-Ang University, Department of Applied Statistics, Seoul, South Korea
Won_t6_1 Won_t6_2 Won_t6_3 Won_t6_4
CAU Submission to DCASE 2021 Task6: Transformer Followed by Transfer Learning for Audio Captioning
Hyejin Won, Baekseung Kim, Il-Youp Kwak, and Changwon Lim
Chung-Ang University, Department of Applied Statistics, Seoul, South Korea
Abstract
This report proposes an automated audio captioning model for the 2021 DCASE audio captioning challenge. In this challenge, a model is required to generate natural language descriptions of a given audio signal. We use pre-trained models trained using AudioSet data, a large-scale dataset of manually annotated audio events. The large amount of audio events data would help capturing important au- dio feature representation. To make use of the learned feature from AudioSet data, we explored several transfer learning approaches. Our proposed sequence-to-sequence model consists of a CNN14 or ResNet54 encoder and a Transformer decoder. Experiments show that the proposed model can achieve a SPIDEr score of 0.246 and 0.285 on audio captioning performance.
System characteristics
Data augmentation | SpecAugment |
Automated Audio Captioning With MLP-mixer and Pre-trained Encoder
Feiyang Xiao1, Jian Guan1, and Qiuqiang Kong2
1 Group of Intelligent Signal Processing, College of Computer Science and Technology Harbin Engineering University, Harbin, China, 2 ByteDance, Shanghai, China
Xiao_t6_2 Xiao_t6_2 Xiao_t6_3 Xiao_t6_4
Automated Audio Captioning With MLP-mixer and Pre-trained Encoder
Feiyang Xiao1, Jian Guan1, and Qiuqiang Kong2
1 Group of Intelligent Signal Processing, College of Computer Science and Technology Harbin Engineering University, Harbin, China, 2 ByteDance, Shanghai, China
Abstract
This technical report describes the submission from the Group of Intelligent Signal Processing (GISP) for Task6 of DCASE2021 challenge (automated audio captioning). Our audio captioning system is based on the sequence-to-sequence autoencoder model. Previous recurrent neural network (RNN) and Transformer based methods just perceive the time dimension information but ignore the frequency information. To utilize both time dimension and frequency dimension information, multi-layer perceptrons mixer (MLP-Mixer) is used as the encoder. For caption prediction, a Transformer decoder structure is used as the decoder. No extra data is employed. In addition, to highlight the content information, we use a pre-trained encoder with multi-label content information. The experimental results show that our system can achieve the SPIDEr of 0.144 (official baseline: 0.051) on the evaluation split of the Clotho dataset. In addition, comparing with Transformer methods, our system has fewer training time.
System characteristics
Data augmentation | None |
An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning for DCASE Challenge 2021 Task 6
Xinhao Mei1, Qiushi Huang1, Xubo Liu1, Gengyun Chen2, Jingqian Wu3∗, Yusong Wu3†, Jinzheng Zhao1, Shengchen Li3, Tom Ko4, H Lilian Tang1, Xi Shao2, Mark D. Plumbley1, Wenwu Wang1
1 University of Surrey, Guildford, United Kingdom, 2 Nanjing University of Posts and Telecommunications, Nanjing, China, 3 Xi’an Jiaotong-Liverpool University, Suzhou, China, 4 Southern University of Science and Technology, Shenzhen, China, ∗ Jingqian Wu is currently with Wake Forest University, USA, †Yusong Wu is currently with University of Montreal, Canada
Xinhao_t6_1 Xinhao_t6_2 Xinhao_t6_3 Xinhao_t6_4
An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning for DCASE Challenge 2021 Task 6
Xinhao Mei1, Qiushi Huang1, Xubo Liu1, Gengyun Chen2, Jingqian Wu3∗, Yusong Wu3†, Jinzheng Zhao1, Shengchen Li3, Tom Ko4, H Lilian Tang1, Xi Shao2, Mark D. Plumbley1, Wenwu Wang1
1 University of Surrey, Guildford, United Kingdom, 2 Nanjing University of Posts and Telecommunications, Nanjing, China, 3 Xi’an Jiaotong-Liverpool University, Suzhou, China, 4 Southern University of Science and Technology, Shenzhen, China, ∗ Jingqian Wu is currently with Wake Forest University, USA, †Yusong Wu is currently with University of Montreal, Canada
Abstract
Audio captioning aims to use natural language to describe the content of audio data. This technical report presents an automated audio captioning system submitted to Task 6 of the DCASE 2021 challenge. The proposed system is based on an encoder-decoder architecture, consisting of a convolutional neural network (CNN) encoder and a Transformer decoder. We further improve the system with two techniques, namely, pre-training the model via transfer learning techniques, either on upstream audio-related tasks or large in-domain datasets, and incorporating evaluation metrics into the optimization of the model with reinforcement learning techniques, which help address the problem caused by the mismatch between the evaluation metrics and the loss function. The results show that both techniques can further improve the performance of the captioning system. The overall system achieves a SPIDEr score of 0.277 on the Clotho evaluation set, which outperforms the top-ranked system from the DCASE 2020 challenge.
System characteristics
Data augmentation | SpecAugment |
The SJTU System for DCASE2021 Challenge Task 6: Audio Captioning Based on Encoder Pre-training and Reinforcement Learning
Xuenan Xu, Zeyu Xie, Mengyue Wu, and Kai Yu
MoE Key Lab of Artificial Intelligence X-LANCE Lab, Department of Computer Science and Engineering AI Institute, Shanghai Jiao Tong University, Shanghai, China
Xiao_t6_1 Xiao_t6_2
The SJTU System for DCASE2021 Challenge Task 6: Audio Captioning Based on Encoder Pre-training and Reinforcement Learning
Xuenan Xu, Zeyu Xie, Mengyue Wu, and Kai Yu
MoE Key Lab of Artificial Intelligence X-LANCE Lab, Department of Computer Science and Engineering AI Institute, Shanghai Jiao Tong University, Shanghai, China
Abstract
This report proposes an audio captioning system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge task Task 6. Our audio captioning system consists of a 10-layer convolution neural network (CNN) encoder and a temporal attentional single layer gated recurrent unit (GRU) decoder. In this challenge, there is no restriction on the usage of external data and pre-trained models. To better model the concepts in an audio clip, we pre-train the CNN encoder with audio tagging on AudioSet. After standard cross entropy based training, we further fine-tune the model with reinforcement learning to directly optimize the evaluation metric. Experiments show that our proposed system achieves a SPIDEr of 28.6 on the public evaluation split without ensemble.
System characteristics
Data augmentation | SpecAugment |
Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Textual Information
Zhongjie Ye1, Helin Wang1, Dongchao Yang1, and Yuexian Zou1,2
1 ADSPLAB, School of ECE, Peking University, Shenzhen, China, 2 Peng Cheng Laboratory, Shenzhen, China
Ye_t6_1 Ye_t6_2 Ye_t6_3 Ye_t6_4
Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Textual Information
Zhongjie Ye1, Helin Wang1, Dongchao Yang1, and Yuexian Zou1,2
1 ADSPLAB, School of ECE, Peking University, Shenzhen, China, 2 Peng Cheng Laboratory, Shenzhen, China
Abstract
This technical report describes an automated audio captioning (AAC) model for the the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 Task 6 Challenge. In order to utilize more acoustic and textual information, we propose a novel sequence-to-sequence model named KPE-MAD, with a keyword pre-trained encoder and a multi-modal attention decoder. For the encoder, we use pre-trained classification model on the AudioSet dataset, and finetune it with keywords of nouns and verbs as labels. In addition, a multi-modal attention module is proposed to integrate the acoustic and textual information in the decoder. Our single model achieves the SPIDEr score of 0.279 in the evaluation splits. And our best ensemble model by optimizing CIDEr-D via the reinforcement learning, achieves the SPIDEr score of 0.291. Our code1 and models will be released after the competition.
System characteristics
Data augmentation | Mixup, SpecAugment, SpecAugment++ |
The DCASE 2021 Challenge Task 6 System: Automated Audio Captioning With Weakly Supervised Pre-traing and Word Selection Methods
Weiqiang Yuan, Qichen Han, Dong Liu, Xiang Li, and Zhen Yang
NetEase (Hangzhou) Network Co., Ltd., China
Yuan_t6_1 Yuan_t6_2 Yuan_t6_3 Yuan_t6_4
The DCASE 2021 Challenge Task 6 System: Automated Audio Captioning With Weakly Supervised Pre-traing and Word Selection Methods
Weiqiang Yuan, Qichen Han, Dong Liu, Xiang Li, and Zhen Yang
NetEase (Hangzhou) Network Co., Ltd., China
Abstract
This technical report describes the system participating to the De- tection and Classification of Acoustic Scenes and Events (DCASE) 2021 Challenge, Task 6: automated audio captioning. We use encoder-decoder modeling framework for audio understanding and caption generation. Our solution focuses on solving two problems in automated audio captioning: data insufficiency and word selection indeterminacy. As the amount of training data is limited, we collect large-scale weakly labeled dataset from Web with heuristic methods. Then we pre-train the encoder-decoder models with this dataset followed by fine-tuning on Clotho dataset. To solve the word selection indeterminacy problem, we use keywords extracted from captions of similar audios and audio event tags produced by pre-trained audio tagging models to guide words generation in decoding stage. We tested our submissions using the development-testing dataset. Our best submission achieved 31.8 SPIDEr score.
System characteristics
Data augmentation | Noise enhance |