Automated Audio Captioning


Challenge results

Task description

Automated audio captioning is the task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. Given the novelty of the task of audio captioning, current focus is on exploring and developing different methods that can provide some kind of captions for a general audio recording. To this aim, the novel Clotho dataset is used, which provides good quality captions, without speech transcription, named entities, and hapax legomena (i.e. words that appear once in a split).

Participants used the freely available splits of Clotho development and evaluation, which splits provide both audio and corresponding captions. The systems are developed without the usage of any external data. The developed systems are evaluated on their generated captions, using the testing split of Clotho, which does not provide the corresponding captions for the audio. More information about Task 6: Automated Audio Captioning can be found at the task description page.

The ranking of the submitted systems is based on the achieved SPIDEr metric. Though, in this page is provided a more thorough presentation, grouping the metrics into those that are originated from machine translation and to those that originated from captioning.

Teams ranking

Here are listed the best systems all from all teams. The ranking is based on the SPIDEr metric. For more elaborated exploration of the performance of the different systems, at the same table are listed the values achieved for all the metrics employed in the task. The values for the metrics are for the Clotho testing split and the Clotho evaluation split. The values for the Clotho evaluation split, are provided in order to allow further comparison with systems and methods developed outside of this task, since Clotho evaluation split is freely available.

Selected metric
rank
Submission Information Clotho testing split Clotho evaluation split
Submission code Best official
system rank
Corresponding author Technical
Report
BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL CIDEr SPICE SPIDEr BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL CIDEr SPICE SPIDEr
Yuan_t6_2 1 Weiqiang Yuan yuan2021_t6 0.595 0.400 0.275 0.184 0.182 0.394 0.485 0.135 0.310 0.603 0.414 0.286 0.195 0.186 0.400 0.499 0.137 0.318
Xu_t6_3 2 Xuenan Xu xu2021_t6 0.650 0.420 0.271 0.171 0.182 0.405 0.463 0.129 0.296 0.659 0.424 0.275 0.176 0.182 0.411 0.472 0.124 0.298
Xinhao_t6_1 3 Xinhao Mei xinhao2021_t6 0.620 0.416 0.282 0.180 0.184 0.401 0.457 0.131 0.294 0.615 0.403 0.270 0.171 0.179 0.392 0.412 0.122 0.268
Ye_t6_3 4 Zhongjie Ye ye2021_t6 0.584 0.391 0.265 0.173 0.179 0.384 0.434 0.126 0.280 0.586 0.391 0.268 0.180 0.180 0.388 0.440 0.125 0.282
Chen_t6_4 5 Zhiwen Chen chen2021_t6 0.549 0.358 0.239 0.156 0.169 0.367 0.402 0.121 0.262 0.563 0.367 0.244 0.158 0.170 0.371 0.406 0.119 0.262
Won_t6_4 6 Hyejin Won won2021_t6 0.538 0.359 0.247 0.162 0.166 0.372 0.381 0.118 0.249 0.564 0.376 0.254 0.163 0.177 0.388 0.441 0.128 0.285
Narisetty_t6_4 7 Chaitanya Narisetty narisetty2021_t6 0.534 0.348 0.238 0.160 0.157 0.361 0.362 0.110 0.236 0.563 0.378 0.264 0.184 0.168 0.378 0.417 0.115 0.266
Labbe_t6_4 8 Etienne Labbe labbe2021_t6 0.539 0.354 0.239 0.154 0.156 0.361 0.333 0.108 0.221 0.541 0.358 0.243 0.159 0.327 0.235 0.351 0.110 0.231
Liu_t6_1 9 Yang Liu liu2021_t6 0.478 0.291 0.189 0.118 0.143 0.324 0.274 0.094 0.184 0.483 0.298 0.197 0.119 0.322 0.133 0.243 0.088 0.166
Eren_t6_1 10 Ayşegül Özkaya Eren eren2021_t6 0.479 0.280 0.168 0.090 0.140 0.302 0.256 0.107 0.182 0.586 0.356 0.268 0.150 0.214 0.444 0.328 0.155 0.242
Gebhard_t6_1 11 Alexander Gebhard gebhard2021_t6 0.447 0.169 0.072 0.029 0.099 0.287 0.105 0.047 0.076 0.449 0.167 0.068 0.029 0.097 0.284 0.098 0.043 0.071
Xiao_t6_2 12 Feiyang Xiao xiao2021_t6 0.344 0.152 0.085 0.044 0.082 0.239 0.058 0.033 0.046 0.461 0.275 0.180 0.112 0.126 0.312 0.210 0.079 0.144
Baseline_t6_1 13 Konstantinos Drossos Baseline2021_t6 0.405 0.061 0.014 0.000 0.070 0.265 0.020 0.004 0.012 0.378 0.119 0.050 0.017 0.078 0.263 0.075 0.028 0.051

Systems ranking

Here are listed all systems and their ranking according to the different metrics and grouping of metrics. First, is a table with all metrics and all systems. Then, is a table with all systems but with only machine translation metrics, and then a table with all systems but with only captioning metrics.

Detailed information of each system is at the next section.

Systems ranking, all metrics

Selected metric
rank
Submission Information Clotho testing split Clotho evaluation split
Submission code Best official
system rank
Technical
Report
BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL CIDEr SPICE SPIDEr BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL CIDEr SPICE SPIDEr
Yuan_t6_1 4 yuan2021_t6 0.586 0.387 0.261 0.170 0.181 0.384 0.457 0.136 0.296 0.595 0.402 0.278 0.189 0.184 0.392 0.495 0.136 0.315
Yuan_t6_2 1 yuan2021_t6 0.595 0.400 0.275 0.184 0.182 0.394 0.485 0.135 0.310 0.603 0.414 0.286 0.195 0.186 0.400 0.499 0.137 0.318
Yuan_t6_3 2 yuan2021_t6 0.590 0.396 0.271 0.176 0.181 0.388 0.471 0.133 0.302 0.635 0.444 0.310 0.211 0.197 0.420 0.569 0.151 0.360
Yuan_t6_4 3 yuan2021_t6 0.584 0.392 0.266 0.175 0.181 0.389 0.465 0.131 0.298 0.665 0.487 0.359 0.260 0.214 0.449 0.684 0.163 0.423
Xiao_t6_1 37 xiao2021_t6 0.351 0.150 0.079 0.041 0.082 0.238 0.057 0.029 0.043 0.471 0.282 0.182 0.112 0.128 0.317 0.208 0.078 0.143
Xiao_t6_2 36 xiao2021_t6 0.344 0.152 0.085 0.044 0.082 0.239 0.058 0.033 0.046 0.461 0.275 0.180 0.112 0.126 0.312 0.210 0.079 0.144
Chen_t6_1 18 chen2021_t6 0.549 0.356 0.235 0.149 0.169 0.360 0.389 0.117 0.253 0.555 0.357 0.236 0.152 0.168 0.366 0.409 0.120 0.265
Chen_t6_2 28 chen2021_t6 0.535 0.345 0.227 0.142 0.161 0.359 0.349 0.113 0.231 0.553 0.364 0.247 0.161 0.167 0.371 0.408 0.118 0.263
Chen_t6_3 20 chen2021_t6 0.537 0.351 0.234 0.151 0.167 0.362 0.373 0.117 0.245 0.561 0.369 0.249 0.167 0.169 0.373 0.406 0.118 0.262
Chen_t6_4 17 chen2021_t6 0.549 0.358 0.239 0.156 0.169 0.367 0.402 0.121 0.262 0.563 0.367 0.244 0.158 0.170 0.371 0.406 0.119 0.262
Ye_t6_1 12 ye2021_t6 0.582 0.385 0.259 0.169 0.180 0.382 0.432 0.126 0.279 0.578 0.381 0.257 0.169 0.181 0.381 0.433 0.125 0.279
Ye_t6_2 14 ye2021_t6 0.577 0.379 0.254 0.164 0.182 0.385 0.420 0.128 0.274 0.579 0.384 0.261 0.172 0.181 0.386 0.436 0.128 0.282
Ye_t6_3 11 ye2021_t6 0.584 0.391 0.265 0.173 0.179 0.384 0.434 0.126 0.280 0.586 0.391 0.268 0.180 0.180 0.388 0.440 0.125 0.282
Ye_t6_4 13 ye2021_t6 0.586 0.389 0.261 0.170 0.181 0.387 0.429 0.125 0.277 0.590 0.395 0.272 0.183 0.182 0.394 0.453 0.129 0.291
Liu_t6_1 31 liu2021_t6 0.478 0.291 0.189 0.118 0.143 0.324 0.274 0.094 0.184 0.483 0.298 0.197 0.119 0.322 0.133 0.243 0.088 0.166
Gebhard_t6_1 35 gebhard2021_t6 0.447 0.169 0.072 0.029 0.099 0.287 0.105 0.047 0.076 0.449 0.167 0.068 0.029 0.097 0.284 0.098 0.043 0.071
Eren_t6_1 32 eren2021_t6 0.479 0.280 0.168 0.090 0.140 0.302 0.256 0.107 0.182 0.586 0.356 0.268 0.150 0.214 0.444 0.328 0.155 0.242
Xinhao_t6_1 7 xinhao2021_t6 0.620 0.416 0.282 0.180 0.184 0.401 0.457 0.131 0.294 0.615 0.403 0.270 0.171 0.179 0.392 0.412 0.122 0.268
Xinhao_t6_2 9 xinhao2021_t6 0.653 0.423 0.282 0.176 0.180 0.408 0.439 0.136 0.287 0.635 0.406 0.268 0.166 0.176 0.400 0.412 0.121 0.266
Xinhao_t6_3 8 xinhao2021_t6 0.644 0.420 0.278 0.170 0.181 0.406 0.447 0.136 0.291 0.621 0.407 0.273 0.177 0.179 0.395 0.431 0.122 0.277
Xinhao_t6_4 10 xinhao2021_t6 0.627 0.407 0.269 0.166 0.182 0.399 0.436 0.129 0.283 0.625 0.412 0.278 0.178 0.176 0.401 0.428 0.126 0.277
Narisetty_t6_1 26 narisetty2021_t6 0.531 0.346 0.235 0.157 0.160 0.361 0.362 0.108 0.235 0.546 0.356 0.243 0.165 0.163 0.369 0.381 0.110 0.246
Narisetty_t6_2 27 narisetty2021_t6 0.534 0.347 0.235 0.157 0.158 0.359 0.358 0.109 0.234 0.558 0.373 0.261 0.181 0.167 0.376 0.410 0.114 0.262
Narisetty_t6_3 25 narisetty2021_t6 0.534 0.347 0.235 0.157 0.159 0.362 0.360 0.110 0.235 0.562 0.377 0.261 0.182 0.169 0.377 0.416 0.116 0.266
Narisetty_t6_4 23 narisetty2021_t6 0.534 0.348 0.238 0.160 0.157 0.361 0.362 0.110 0.236 0.563 0.378 0.264 0.184 0.168 0.378 0.417 0.115 0.266
Labbe_t6_1 34 labbe2021_t6 0.435 0.222 0.128 0.073 0.121 0.305 0.146 0.072 0.109 0.435 0.229 0.129 0.069 0.252 0.195 0.136 0.067 0.101
Labbe_t6_2 33 labbe2021_t6 0.454 0.270 0.176 0.109 0.122 0.310 0.178 0.078 0.128 0.452 0.262 0.168 0.102 0.249 0.193 0.172 0.071 0.122
Labbe_t6_3 30 labbe2021_t6 0.525 0.321 0.200 0.117 0.157 0.354 0.296 0.115 0.205 0.523 0.316 0.191 0.109 0.309 0.231 0.287 0.104 0.195
Labbe_t6_4 29 labbe2021_t6 0.539 0.354 0.239 0.154 0.156 0.361 0.333 0.108 0.221 0.541 0.358 0.243 0.159 0.327 0.235 0.351 0.110 0.231
Won_t6_1 21 won2021_t6 0.535 0.344 0.231 0.151 0.162 0.359 0.375 0.111 0.243 0.540 0.345 0.230 0.152 0.161 0.361 0.383 0.109 0.246
Won_t6_2 24 won2021_t6 0.516 0.338 0.226 0.145 0.161 0.359 0.357 0.114 0.236 0.550 0.361 0.244 0.160 0.172 0.375 0.401 0.121 0.261
Won_t6_3 22 won2021_t6 0.518 0.346 0.235 0.151 0.163 0.366 0.366 0.117 0.242 0.554 0.370 0.254 0.168 0.170 0.379 0.400 0.119 0.259
Won_t6_4 19 won2021_t6 0.538 0.359 0.247 0.162 0.166 0.372 0.381 0.118 0.249 0.564 0.376 0.254 0.163 0.177 0.388 0.441 0.128 0.285
Xu_t6_1 15 xu2021_t6 0.560 0.366 0.245 0.159 0.177 0.376 0.403 0.127 0.265 0.576 0.377 0.252 0.164 0.178 0.382 0.421 0.122 0.271
Xu_t6_2 16 xu2021_t6 0.556 0.365 0.245 0.161 0.178 0.375 0.404 0.125 0.265 0.572 0.374 0.251 0.165 0.178 0.381 0.418 0.122 0.270
Xu_t6_3 5 xu2021_t6 0.650 0.420 0.271 0.171 0.182 0.405 0.463 0.129 0.296 0.659 0.424 0.275 0.176 0.182 0.411 0.472 0.124 0.298
Xu_t6_4 6 xu2021_t6 0.651 0.421 0.271 0.170 0.182 0.403 0.461 0.128 0.295 0.660 0.427 0.276 0.177 0.181 0.411 0.471 0.123 0.297
Baseline_t6_1 38 Baseline2021_t6 0.405 0.061 0.014 0.000 0.070 0.265 0.020 0.004 0.012 0.378 0.119 0.050 0.017 0.078 0.263 0.075 0.028 0.051

Systems ranking, machine translation metrics

Selected metric
rank
Submission Information Clotho testing split Clotho evaluation split
Submission code Best official
system rank
Technical
Report
BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL
Yuan_t6_1 4 yuan2021_t6 0.586 0.387 0.261 0.170 0.181 0.384 0.595 0.402 0.278 0.189 0.184 0.392
Yuan_t6_2 1 yuan2021_t6 0.595 0.400 0.275 0.184 0.182 0.394 0.603 0.414 0.286 0.195 0.186 0.400
Yuan_t6_3 2 yuan2021_t6 0.590 0.396 0.271 0.176 0.181 0.388 0.635 0.444 0.310 0.211 0.197 0.420
Yuan_t6_4 3 yuan2021_t6 0.584 0.392 0.266 0.175 0.181 0.389 0.665 0.487 0.359 0.260 0.214 0.449
Xiao_t6_1 37 xiao2021_t6 0.351 0.150 0.079 0.041 0.082 0.238 0.471 0.282 0.182 0.112 0.128 0.317
Xiao_t6_2 36 xiao2021_t6 0.344 0.152 0.085 0.044 0.082 0.239 0.461 0.275 0.180 0.112 0.126 0.312
Chen_t6_1 18 chen2021_t6 0.549 0.356 0.235 0.149 0.169 0.360 0.555 0.357 0.236 0.152 0.168 0.366
Chen_t6_2 28 chen2021_t6 0.535 0.345 0.227 0.142 0.161 0.359 0.553 0.364 0.247 0.161 0.167 0.371
Chen_t6_3 20 chen2021_t6 0.537 0.351 0.234 0.151 0.167 0.362 0.561 0.369 0.249 0.167 0.169 0.373
Chen_t6_4 17 chen2021_t6 0.549 0.358 0.239 0.156 0.169 0.367 0.563 0.367 0.244 0.158 0.170 0.371
Ye_t6_1 12 ye2021_t6 0.582 0.385 0.259 0.169 0.180 0.382 0.578 0.381 0.257 0.169 0.181 0.381
Ye_t6_2 14 ye2021_t6 0.577 0.379 0.254 0.164 0.182 0.385 0.579 0.384 0.261 0.172 0.181 0.386
Ye_t6_3 11 ye2021_t6 0.584 0.391 0.265 0.173 0.179 0.384 0.586 0.391 0.268 0.180 0.180 0.388
Ye_t6_4 13 ye2021_t6 0.586 0.389 0.261 0.170 0.181 0.387 0.590 0.395 0.272 0.183 0.182 0.394
Liu_t6_1 31 liu2021_t6 0.478 0.291 0.189 0.118 0.143 0.324 0.483 0.298 0.197 0.119 0.322 0.133
Gebhard_t6_1 35 gebhard2021_t6 0.447 0.169 0.072 0.029 0.099 0.287 0.449 0.167 0.068 0.029 0.097 0.284
Eren_t6_1 32 eren2021_t6 0.479 0.280 0.168 0.090 0.140 0.302 0.586 0.356 0.268 0.150 0.214 0.444
Xinhao_t6_1 7 xinhao2021_t6 0.620 0.416 0.282 0.180 0.184 0.401 0.615 0.403 0.270 0.171 0.179 0.392
Xinhao_t6_2 9 xinhao2021_t6 0.653 0.423 0.282 0.176 0.180 0.408 0.635 0.406 0.268 0.166 0.176 0.400
Xinhao_t6_3 8 xinhao2021_t6 0.644 0.420 0.278 0.170 0.181 0.406 0.621 0.407 0.273 0.177 0.179 0.395
Xinhao_t6_4 10 xinhao2021_t6 0.627 0.407 0.269 0.166 0.182 0.399 0.625 0.412 0.278 0.178 0.176 0.401
Narisetty_t6_1 26 narisetty2021_t6 0.531 0.346 0.235 0.157 0.160 0.361 0.546 0.356 0.243 0.165 0.163 0.369
Narisetty_t6_2 27 narisetty2021_t6 0.534 0.347 0.235 0.157 0.158 0.359 0.558 0.373 0.261 0.181 0.167 0.376
Narisetty_t6_3 25 narisetty2021_t6 0.534 0.347 0.235 0.157 0.159 0.362 0.562 0.377 0.261 0.182 0.169 0.377
Narisetty_t6_4 23 narisetty2021_t6 0.534 0.348 0.238 0.160 0.157 0.361 0.563 0.378 0.264 0.184 0.168 0.378
Labbe_t6_1 34 labbe2021_t6 0.435 0.222 0.128 0.073 0.121 0.305 0.435 0.229 0.129 0.069 0.252 0.195
Labbe_t6_2 33 labbe2021_t6 0.454 0.270 0.176 0.109 0.122 0.310 0.452 0.262 0.168 0.102 0.249 0.193
Labbe_t6_3 30 labbe2021_t6 0.525 0.321 0.200 0.117 0.157 0.354 0.523 0.316 0.191 0.109 0.309 0.231
Labbe_t6_4 29 labbe2021_t6 0.539 0.354 0.239 0.154 0.156 0.361 0.541 0.358 0.243 0.159 0.327 0.235
Won_t6_1 21 won2021_t6 0.535 0.344 0.231 0.151 0.162 0.359 0.540 0.345 0.230 0.152 0.161 0.361
Won_t6_2 24 won2021_t6 0.516 0.338 0.226 0.145 0.161 0.359 0.550 0.361 0.244 0.160 0.172 0.375
Won_t6_3 22 won2021_t6 0.518 0.346 0.235 0.151 0.163 0.366 0.554 0.370 0.254 0.168 0.170 0.379
Won_t6_4 19 won2021_t6 0.538 0.359 0.247 0.162 0.166 0.372 0.564 0.376 0.254 0.163 0.177 0.388
Xu_t6_1 15 xu2021_t6 0.560 0.366 0.245 0.159 0.177 0.376 0.576 0.377 0.252 0.164 0.178 0.382
Xu_t6_2 16 xu2021_t6 0.556 0.365 0.245 0.161 0.178 0.375 0.572 0.374 0.251 0.165 0.178 0.381
Xu_t6_3 5 xu2021_t6 0.650 0.420 0.271 0.171 0.182 0.405 0.659 0.424 0.275 0.176 0.182 0.411
Xu_t6_4 6 xu2021_t6 0.651 0.421 0.271 0.170 0.182 0.403 0.660 0.427 0.276 0.177 0.181 0.411
Baseline_t6_1 38 Baseline2021_t6 0.405 0.061 0.014 0.000 0.070 0.265 0.378 0.119 0.050 0.017 0.078 0.263

Systems ranking, captioning metrics

Selected metric
rank
Submission Information Clotho testing split Clotho evaluation split
Submission code Best official
system rank
Technical
Report
CIDEr SPICE SPIDEr CIDEr SPICE SPIDEr
Yuan_t6_1 4 yuan2021_t6 0.457 0.136 0.296 0.495 0.136 0.315
Yuan_t6_2 1 yuan2021_t6 0.485 0.135 0.310 0.499 0.137 0.318
Yuan_t6_3 2 yuan2021_t6 0.471 0.133 0.302 0.569 0.151 0.360
Yuan_t6_4 3 yuan2021_t6 0.465 0.131 0.298 0.684 0.163 0.423
Xiao_t6_1 37 xiao2021_t6 0.057 0.029 0.043 0.208 0.078 0.143
Xiao_t6_2 36 xiao2021_t6 0.058 0.033 0.046 0.210 0.079 0.144
Chen_t6_1 18 chen2021_t6 0.389 0.117 0.253 0.409 0.120 0.265
Chen_t6_2 28 chen2021_t6 0.349 0.113 0.231 0.408 0.118 0.263
Chen_t6_3 20 chen2021_t6 0.373 0.117 0.245 0.406 0.118 0.262
Chen_t6_4 17 chen2021_t6 0.402 0.121 0.262 0.406 0.119 0.262
Ye_t6_1 12 ye2021_t6 0.432 0.126 0.279 0.433 0.125 0.279
Ye_t6_2 14 ye2021_t6 0.420 0.128 0.274 0.436 0.128 0.282
Ye_t6_3 11 ye2021_t6 0.434 0.126 0.280 0.440 0.125 0.282
Ye_t6_4 13 ye2021_t6 0.429 0.125 0.277 0.453 0.129 0.291
Liu_t6_1 31 liu2021_t6 0.274 0.094 0.184 0.243 0.088 0.166
Gebhard_t6_1 35 gebhard2021_t6 0.105 0.047 0.076 0.098 0.043 0.071
Eren_t6_1 32 eren2021_t6 0.256 0.107 0.182 0.328 0.155 0.242
Xinhao_t6_1 7 xinhao2021_t6 0.457 0.131 0.294 0.412 0.122 0.268
Xinhao_t6_2 9 xinhao2021_t6 0.439 0.136 0.287 0.412 0.121 0.266
Xinhao_t6_3 8 xinhao2021_t6 0.447 0.136 0.291 0.431 0.122 0.277
Xinhao_t6_4 10 xinhao2021_t6 0.436 0.129 0.283 0.428 0.126 0.277
Narisetty_t6_1 26 narisetty2021_t6 0.362 0.108 0.235 0.381 0.110 0.246
Narisetty_t6_2 27 narisetty2021_t6 0.358 0.109 0.234 0.410 0.114 0.262
Narisetty_t6_3 25 narisetty2021_t6 0.360 0.110 0.235 0.416 0.116 0.266
Narisetty_t6_4 23 narisetty2021_t6 0.362 0.110 0.236 0.417 0.115 0.266
Labbe_t6_1 34 labbe2021_t6 0.146 0.072 0.109 0.136 0.067 0.101
Labbe_t6_2 33 labbe2021_t6 0.178 0.078 0.128 0.172 0.071 0.122
Labbe_t6_3 30 labbe2021_t6 0.296 0.115 0.205 0.287 0.104 0.195
Labbe_t6_4 29 labbe2021_t6 0.333 0.108 0.221 0.351 0.110 0.231
Won_t6_1 21 won2021_t6 0.375 0.111 0.243 0.383 0.109 0.246
Won_t6_2 24 won2021_t6 0.357 0.114 0.236 0.401 0.121 0.261
Won_t6_3 22 won2021_t6 0.366 0.117 0.242 0.400 0.119 0.259
Won_t6_4 19 won2021_t6 0.381 0.118 0.249 0.441 0.128 0.285
Xu_t6_1 15 xu2021_t6 0.403 0.127 0.265 0.421 0.122 0.271
Xu_t6_2 16 xu2021_t6 0.404 0.125 0.265 0.418 0.122 0.270
Xu_t6_3 5 xu2021_t6 0.463 0.129 0.296 0.472 0.124 0.298
Xu_t6_4 6 xu2021_t6 0.461 0.128 0.295 0.471 0.123 0.297
Baseline_t6_1 38 Baseline2021_t6 0.020 0.004 0.012 0.075 0.028 0.051

System characteristics

In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an onverview of the systems and the second has a detailed presentation of each system.

Overview of characteristics

Rank Submission
code
SPIDEr Technical
Report
Method scheme/architecture Amount of parameters Audio modelling Word modelling Data
augmentation
4 Yuan_t6_1 0.296 yuan2021_t6 encoder-decoder 986302137 PANNs Transfomer noise enhance
1 Yuan_t6_2 0.310 yuan2021_t6 encoder-decoder 986302137 PANNs Transfomer noise enhance
2 Yuan_t6_3 0.302 yuan2021_t6 encoder-decoder 986302137 PANNs Transfomer noise enhance
3 Yuan_t6_4 0.298 yuan2021_t6 encoder-decoder 2572592190 PANNs Transfomer noise enhance
37 Xiao_t6_1 0.043 xiao2021_t6 encoder-decoder, Transformer, MLP-mixer, Residual, audio embedding 2448349 MLP-mixer encoder Transformer decoder
36 Xiao_t6_2 0.046 xiao2021_t6 encoder-decoder, Transformer, MLP-mixer, Residual, audio embedding, pre-train encoder 2448349 MLP-mixer encoder Transformer decoder
18 Chen_t6_1 0.253 chen2021_t6 encoder-decoder 93410432 CNN14, MemoryEncoder MeshedDecoder SpecAugment, Label Smoothing
28 Chen_t6_2 0.231 chen2021_t6 encoder-decoder 93410432 CNN14, MemoryEncoder MeshedDecoder SpecAugment, Label Smoothing
20 Chen_t6_3 0.245 chen2021_t6 encoder-decoder 22775528 CNN14, MemoryEncoder MeshedDecoder SpecAugment, Label Smoothing
17 Chen_t6_4 0.262 chen2021_t6 encoder-decoder 86440064 ResNet38, MemoryEncoder MeshedDecoder SpecAugment, Label Smoothing
12 Ye_t6_1 0.279 ye2021_t6 encoder-decoder 86643711 ResNet38 RNN Mixup, SpecAugment, SpecAugment++
14 Ye_t6_2 0.274 ye2021_t6 encoder-decoder 86643711 ResNet38 RNN Mixup, SpecAugment, SpecAugment++
11 Ye_t6_3 0.280 ye2021_t6 encoder-decoder 779793399 ResNet38 RNN Mixup, SpecAugment, SpecAugment++
13 Ye_t6_4 0.277 ye2021_t6 encoder-decoder 259931133 ResNet38 RNN Mixup, SpecAugment, SpecAugment++
31 Liu_t6_1 0.184 liu2021_t6 encoder-decoder 3045913 CNN self-attention, Word2Vec
35 Gebhard_t6_1 0.076 gebhard2021_t6 encoder-decoder 13409747 CNN RNN
32 Eren_t6_1 0.182 eren2021_t6 encoder-decoder 2511570 PANNs RNN
7 Xinhao_t6_1 0.294 xinhao2021_t6 encoder-decoder 7455570 CNN Transformer SpecAugment
9 Xinhao_t6_2 0.287 xinhao2021_t6 encoder-decoder 7455570 CNN Transformer SpecAugment
8 Xinhao_t6_3 0.291 xinhao2021_t6 encoder-decoder 8038703 CNN Transformer SpecAugment
10 Xinhao_t6_4 0.283 xinhao2021_t6 encoder-decoder 8038703 CNN Transformer SpecAugment
26 Narisetty_t6_1 0.235 narisetty2021_t6 Conformer 143676488 Conformer Transformer SpecAugment
27 Narisetty_t6_2 0.234 narisetty2021_t6 Conformer 143676488 Conformer Transformer SpecAugment
25 Narisetty_t6_3 0.235 narisetty2021_t6 Conformer 167205128 Conformer Transformer SpecAugment
23 Narisetty_t6_4 0.236 narisetty2021_t6 Conformer 167205128 Conformer Transformer SpecAugment
34 Labbe_t6_1 0.109 labbe2021_t6 encoder-decoder 2887632 pyramidal bidirectional RNN-LSTM RNN-LSTM, attention
33 Labbe_t6_2 0.128 labbe2021_t6 encoder-decoder 2887632 pyramidal bidirectional RNN-LSTM RNN-LSTM, attention
30 Labbe_t6_3 0.205 labbe2021_t6 encoder-decoder 84521484 CNN14 RNN-LSTM, attention
29 Labbe_t6_4 0.221 labbe2021_t6 encoder-decoder 84521484 CNN14 RNN-LSTM, attention
21 Won_t6_1 0.243 won2021_t6 encoder-decoder 8445139 ResNet Transformer SpecAugment
24 Won_t6_2 0.236 won2021_t6 encoder-decoder 8445139 CNN Transformer SpecAugment
22 Won_t6_3 0.242 won2021_t6 encoder-decoder 8445139 CNN Transformer SpecAugment
19 Won_t6_4 0.249 won2021_t6 encoder-decoder 8445139 CNN Transformer SpecAugment
15 Xu_t6_1 0.265 xu2021_t6 seq2seq 36181131 CNN RNN SpecAugment
16 Xu_t6_2 0.265 xu2021_t6 seq2seq 60301885 CNN RNN SpecAugment
5 Xu_t6_3 0.296 xu2021_t6 seq2seq 48241508 CNN RNN SpecAugment
6 Xu_t6_4 0.295 xu2021_t6 seq2seq 60301885 CNN RNN SpecAugment
38 Baseline_t6_1 0.012 Baseline2021_t6 encoder-decoder 5012931 RNN RNN



Detailed characteristics

Rank Submission
code
SPIDEr Technical
Report
Method scheme/architecture Amount of parameters Audio modelling Acoustic
features
Word modelling Word
embeddings
Data
augmentation
Sampling
rate
Learning set-up Ensemble method Loss function Learning set-up Learning rate Gradient clipping Gradient norm for clipping Metric monitored for training Dataset(s) used for audio modelling Dataset(s) used for word modelling Dataset(s) used for audio similarity
4 Yuan_t6_1 0.296 yuan2021_t6 encoder-decoder 986302137 PANNs log-mel energies Transfomer one-hot noise enhance 36kHz supervised crossentropy adam 1e-4 Validation SPIDEr score Clotho, AudioCaps, Freesound Clotho, AudioCaps, Freesound Clotho
1 Yuan_t6_2 0.310 yuan2021_t6 encoder-decoder 986302137 PANNs log-mel energies Transfomer one-hot noise enhance 36kHz supervised crossentropy adam 1e-4 Validation SPIDEr score Clotho, AudioCaps, Freesound Clotho, AudioCaps, Freesound Clotho
2 Yuan_t6_3 0.302 yuan2021_t6 encoder-decoder 986302137 PANNs log-mel energies Transfomer one-hot noise enhance 36kHz supervised crossentropy adam 1e-4 Validation SPIDEr score Clotho, AudioCaps, Freesound Clotho, AudioCaps, Freesound Clotho
3 Yuan_t6_4 0.298 yuan2021_t6 encoder-decoder 2572592190 PANNs log-mel energies Transfomer one-hot noise enhance 36kHz supervised crossentropy adam 1e-4 Validation SPIDEr score Clotho, AudioCaps, Freesound Clotho, AudioCaps, Freesound Clotho
37 Xiao_t6_1 0.043 xiao2021_t6 encoder-decoder, Transformer, MLP-mixer, Residual, audio embedding 2448349 MLP-mixer encoder log-mel energies Transformer decoder learned embeddings 44.1kHz supervised crossentropy adam 1e-4 Validation loss Clotho Clotho
36 Xiao_t6_2 0.046 xiao2021_t6 encoder-decoder, Transformer, MLP-mixer, Residual, audio embedding, pre-train encoder 2448349 MLP-mixer encoder log-mel energies Transformer decoder learned embeddings 44.1kHz supervised crossentropy adam 1e-4 Validation loss Clotho Clotho
18 Chen_t6_1 0.253 chen2021_t6 encoder-decoder 93410432 CNN14, MemoryEncoder log-mel energies MeshedDecoder Word2Vec SpecAugment, Label Smoothing 44.1kHz supervised crossentropy adam 3e-5 Validation SPIDEr score Clotho Clotho
28 Chen_t6_2 0.231 chen2021_t6 encoder-decoder 93410432 CNN14, MemoryEncoder log-mel energies MeshedDecoder FastText SpecAugment, Label Smoothing 44.1kHz supervised crossentropy adam 3e-5 Validation SPIDEr score Clotho Clotho
20 Chen_t6_3 0.245 chen2021_t6 encoder-decoder 22775528 CNN14, MemoryEncoder log-mel energies MeshedDecoder learned embeddings SpecAugment, Label Smoothing 44.1kHz supervised crossentropy adam 3e-5 Validation SPIDEr score Clotho Clotho
17 Chen_t6_4 0.262 chen2021_t6 encoder-decoder 86440064 ResNet38, MemoryEncoder log-mel energies MeshedDecoder Word2Vec SpecAugment, Label Smoothing 44.1kHz supervised crossentropy adam 3e-5 Validation SPIDEr score Clotho Clotho
12 Ye_t6_1 0.279 ye2021_t6 encoder-decoder 86643711 ResNet38 log-mel energies RNN learned embeddings Mixup, SpecAugment, SpecAugment++ 44.1kHz supervised crossentropy adam 2e-5 Validation loss Clotho Clotho
14 Ye_t6_2 0.274 ye2021_t6 encoder-decoder 86643711 ResNet38 log-mel energies RNN learned embeddings Mixup, SpecAugment, SpecAugment++ 44.1kHz supervised crossentropy adam 2e-5 Validation loss Clotho Clotho
11 Ye_t6_3 0.280 ye2021_t6 encoder-decoder 779793399 ResNet38 log-mel energies RNN learned embeddings Mixup, SpecAugment, SpecAugment++ 44.1kHz supervised crossentropy adam 2e-5 Validation loss Clotho Clotho
13 Ye_t6_4 0.277 ye2021_t6 encoder-decoder 259931133 ResNet38 log-mel energies RNN learned embeddings Mixup, SpecAugment, SpecAugment++ 44.1kHz supervised crossentropy adam 2e-5 Validation loss Clotho Clotho
31 Liu_t6_1 0.184 liu2021_t6 encoder-decoder 3045913 CNN log-mel energies self-attention, Word2Vec Word2Vec 44.1kHz supervised crossentropy, sentence-loss adam 5e-4 Training loss Clotho Clotho
35 Gebhard_t6_1 0.076 gebhard2021_t6 encoder-decoder 13409747 CNN log-mel energies RNN numeric-representation 44.1kHz supervised crossentropy adam 1e-4 Training loss Clotho Clotho
32 Eren_t6_1 0.182 eren2021_t6 encoder-decoder 2511570 PANNs log-mel energies, PANNs RNN Word2Vec 44.1kHz supervised crossentropy adam 1e-3 Validation loss Clotho Clotho
7 Xinhao_t6_1 0.294 xinhao2021_t6 encoder-decoder 7455570 CNN PANNs Transformer random SpecAugment 44.1kHz supervised crossentropy adam 1e-3 Validation SPIDEr score Clotho Clotho
9 Xinhao_t6_2 0.287 xinhao2021_t6 encoder-decoder 7455570 CNN PANNs Transformer random SpecAugment 44.1kHz supervised crossentropy adam 1e-3 Validation SPIDEr score Clotho Clotho
8 Xinhao_t6_3 0.291 xinhao2021_t6 encoder-decoder 8038703 CNN PANNs Transformer Word2Vec SpecAugment 44.1kHz supervised crossentropy adam 1e-3 Validation SPIDEr score Clotho, AudioCaps Clotho, AudioCaps
10 Xinhao_t6_4 0.283 xinhao2021_t6 encoder-decoder 8038703 CNN PANNs Transformer Word2Vec SpecAugment 44.1kHz supervised crossentropy adam 1e-3 Validation SPIDEr score Clotho, AudioCaps Clotho, AudioCaps
26 Narisetty_t6_1 0.235 narisetty2021_t6 Conformer 143676488 Conformer log-mel energies, PANNs, tags, embeddings Transformer learned embeddings SpecAugment 16kHz supervised crossentropy noam 5e-1 Validation loss Clotho, AudioCaps Clotho, AudioCaps
27 Narisetty_t6_2 0.234 narisetty2021_t6 Conformer 143676488 Conformer log-mel energies, PANNs, tags, embeddings Transformer learned embeddings SpecAugment 16kHz supervised crossentropy noam 5e-1 Validation loss Clotho, AudioCaps Clotho, AudioCaps
25 Narisetty_t6_3 0.235 narisetty2021_t6 Conformer 167205128 Conformer log-mel energies, PANNs, tags, embeddings Transformer learned embeddings SpecAugment 16kHz supervised crossentropy noam 5e-1 Validation loss Clotho, AudioCaps Clotho, AudioCaps
23 Narisetty_t6_4 0.236 narisetty2021_t6 Conformer 167205128 Conformer log-mel energies, PANNs, tags, embeddings Transformer learned embeddings SpecAugment 16kHz supervised crossentropy noam 5e-1 Validation loss Clotho, AudioCaps Clotho, AudioCaps
34 Labbe_t6_1 0.109 labbe2021_t6 encoder-decoder 2887632 pyramidal bidirectional RNN-LSTM log-mel energies RNN-LSTM, attention learned embeddings 32kHz supervised crossentropy adam 5e-4 Validation loss Clotho Clotho
33 Labbe_t6_2 0.128 labbe2021_t6 encoder-decoder 2887632 pyramidal bidirectional RNN-LSTM log-mel energies RNN-LSTM, attention learned embeddings 32kHz supervised crossentropy adam 5e-4 Validation loss Clotho Clotho
30 Labbe_t6_3 0.205 labbe2021_t6 encoder-decoder 84521484 CNN14 log-mel energies RNN-LSTM, attention learned embeddings 32kHz supervised crossentropy adam 5e-4 Validation loss Clotho Clotho
29 Labbe_t6_4 0.221 labbe2021_t6 encoder-decoder 84521484 CNN14 log-mel energies RNN-LSTM, attention learned embeddings 32kHz supervised crossentropy adam 5e-4 Validation loss Clotho Clotho
21 Won_t6_1 0.243 won2021_t6 encoder-decoder 8445139 ResNet log-mel energies Transformer Word2Vec SpecAugment 44.1kHz supervised crossentropy adam, SWA 3e-4, 1e-4 Training SPIDEr score Clotho Clotho
24 Won_t6_2 0.236 won2021_t6 encoder-decoder 8445139 CNN log-mel energies Transformer Word2Vec SpecAugment 44.1kHz supervised crossentropy adam, SWA 3e-4, 1e-4 Training SPIDEr score Clotho Clotho
22 Won_t6_3 0.242 won2021_t6 encoder-decoder 8445139 CNN log-mel energies Transformer Word2Vec SpecAugment 44.1kHz supervised crossentropy adam, SWA 3e-4, 1e-4 Training SPIDEr score Clotho Clotho
19 Won_t6_4 0.249 won2021_t6 encoder-decoder 8445139 CNN log-mel energies Transformer Word2Vec SpecAugment 44.1kHz supervised crossentropy adam, SWA 3e-4, 1e-4 Training SPIDEr score Clotho Clotho
15 Xu_t6_1 0.265 xu2021_t6 seq2seq 36181131 CNN log-mel energies RNN learned embeddings SpecAugment 44.1kHz supervised, reinforcement learning crossentropy adam 5e-4 Validation SPIDEr score Clotho, AudioSet Clotho
16 Xu_t6_2 0.265 xu2021_t6 seq2seq 60301885 CNN log-mel energies RNN learned embeddings SpecAugment 44.1kHz supervised, reinforcement learning crossentropy adam 5e-4 Validation SPIDEr score Clotho, AudioSet Clotho
5 Xu_t6_3 0.296 xu2021_t6 seq2seq 48241508 CNN log-mel energies RNN learned embeddings SpecAugment 44.1kHz supervised, reinforcement learning crossentropy adam 5e-4 Validation SPIDEr score Clotho, AudioSet Clotho
6 Xu_t6_4 0.295 xu2021_t6 seq2seq 60301885 CNN log-mel energies RNN learned embeddings SpecAugment 44.1kHz supervised, reinforcement learning crossentropy adam 5e-4 Validation SPIDEr score Clotho, AudioSet Clotho
38 Baseline_t6_1 0.012 Baseline2021_t6 encoder-decoder 5012931 RNN log-mel energies RNN learned embeddings 44.1kHz supervised crossentropy adam 1e-3 Validation loss Clotho Clotho



Technical reports

Audio Captioning With Meshed-Memory Transformer

Zhiwen Chen, Dawei Zhang, Jun Wang, and Feng Deng
University of Chinese Academy of Sciences, Beijing, China

Abstract

Automated audio captioning is the task of describing the audio content of a given audio signal in natural language. Through advancing for years, transformer-based language models are widely used in audio captioning. However, most architectures based on transformer, cannot learn prior knowledge between samples well, leading to worse text decoding. For better acoustic event and language modeling, a sequence-to-sequence model is proposed which consists of a CNN-based encoder, a memory-augmented refiner and a meshed decoder. The proposed architecture refines a multi-level representation of the relationships between audio features integrating learned a priori knowledge. At decoding stage, it exploit low- and high-level features with a mesh-like connectivity. Experiments show that the proposed model can achieve a SPIDEr score of 0.2645 on Clotho V2 dataset.

System characteristics
Data augmentation SpecAugment, Label Smoothing
PDF

Audio Captioning Using Sound Event Detection

Ayşegül Özkaya Eren and Mustafa Sert
Department of Computer Engineering, Baskent University, Ankara, Turkey

Abstract

This technical report proposes an audio captioning system for DCASE 2021 Task 6 audio captioning challenge. Our proposed model is based on an encoder-decoder architecture with bi-directional Gated Recurrent Units (BiGRU) using pretrained audio features and sound event detection. A pretrained neural network (PANN) is used to extract audio features and Word2Vec is selected with the aim of extracting word embeddings from the audio captions. To create semantically meaningful captions, we extract sound events from the audio clips and feed the encoder-decoder architecture with sound events in addition to PANNs features. Our experiments on the Clotho dataset show that our proposed method significantly achieves better results than the challenge baseline model across all evaluation metrics.

System characteristics
Data augmentation None
PDF

An Automated Audio Captioning Approach Utilising a ResNet-based Encoder

Alexander Gebhard1, Andreas Triantafyllopoulos1,2, Alice Baird1, and Björn Schuller1,2,3
1 EIHW, Chair of Embedded Intelligence for Healthcare and Wellbeing, University of Augsburg, Augsburg, Germany, 2 audEERING GmbH, Gilching, Germany, 3 GLAM, Group on Language, Audio, and Music, Imperial College, London, UK

Abstract

In this report, we present our submission system to TASK6 of the DCASE2021 Challenge. The main module is based on the baseline architecture for the automated audio captioning (AAC) task, which was provided by the challenge organisers. We exchange the encoder part of the baseline architecture and replace it by a Residual Neural Network (ResNet)-18 encoder adapted to the AAC task. Results from our proposed architecture have shown an average increase of 35.7\% over the baseline system, reaching a BLEU1 score of 0.449 on the development set, demonstrating the effectiveness of the proposed encoder for this task.

System characteristics
Data augmentation None
PDF

IRIT-UPS DCASE 2021 Audio Captioning System

Etienne Labbé and Thomas Pellegrini
IRIT (UMR 5505), Université Paul Sabatier, CNRS, Toulouse, France

Abstract

This document provides a description of our seq-to-seq models used for audio captioning in the task 6 of the DCASE 2021 challenge. Four submissions were made with two different models, a “Listen- Attend-Tell” and a “CNN-Tell”, and two different algorithms for inference, greedy and beam search.

System characteristics
Data augmentation None
PDF

The DCASE2021 Challenge Task 6 System : Automated Audio Caption

Liu Yang and Bi Sijun
Beijing Institute of Technology

Abstract

This technical report describes the system participating to the Automated Audio Captioning (DCASE) 2021 Challenge, Task 6: automated audio captioning. In this work, We employ several learnable stack CNNs to extract audio features in the encoder layer, meanwhile, we employ the decoder of the widely used Transformer structure to generate captions. For optimize system, we use the sentence-level cosine loss function and crossentropy loss. The experimental results show that our system could achieve the SPIDEr of 0.166 on the evaluation split of the Clotho dataset.

System characteristics
Data augmentation None
PDF

Leveraging State-of-the-art ASR Techniques to Audio Captioning

Chaitanya Narisetty1, Tomoki Hayashi2, Ryunosuke Ishizaki2, Shinji Watanabe1, and Kazuya Takeda2
1 Carnegie Mellon University, Pittsburgh, USA, 2 Nagoya University, Nagoya, Japan

Abstract

This report presents a summary of our submission to the 2021 DCASE challenge Task 6: Automated Audio Captioning. Our ap- proach to this task is derived from state-of-the-art ASR techniques available in the ESPNet toolkit. Specifically, we train a convolution-augmented Transformer (Conformer) model to generate captions from input acoustic features in an end-to-end manner. In addition to the prescribed challenge dataset: Clotho-v2, we also augment the AudioCaps external dataset. To overcome the limited availability of training data, we further incorporate the Audioset-tags and audio-embeddings obtained by pretrained audio neural networks (PANNs) as an auxiliary input to our model. An ensemble of models trained over various architectures and input embeddings is selected as our final submission system. Experimental results indicate that our models achieve a SPIDEr score of 0.224 and 0.246 on the development-validation and development-evaluation sets respectively.

System characteristics
Data augmentation SpecAugment
PDF

CAU Submission to DCASE 2021 Task6: Transformer Followed by Transfer Learning for Audio Captioning

Hyejin Won, Baekseung Kim, Il-Youp Kwak, and Changwon Lim
Chung-Ang University, Department of Applied Statistics, Seoul, South Korea

Abstract

This report proposes an automated audio captioning model for the 2021 DCASE audio captioning challenge. In this challenge, a model is required to generate natural language descriptions of a given audio signal. We use pre-trained models trained using AudioSet data, a large-scale dataset of manually annotated audio events. The large amount of audio events data would help capturing important au- dio feature representation. To make use of the learned feature from AudioSet data, we explored several transfer learning approaches. Our proposed sequence-to-sequence model consists of a CNN14 or ResNet54 encoder and a Transformer decoder. Experiments show that the proposed model can achieve a SPIDEr score of 0.246 and 0.285 on audio captioning performance.

System characteristics
Data augmentation SpecAugment
PDF

Automated Audio Captioning With MLP-mixer and Pre-trained Encoder

Feiyang Xiao1, Jian Guan1, and Qiuqiang Kong2
1 Group of Intelligent Signal Processing, College of Computer Science and Technology Harbin Engineering University, Harbin, China, 2 ByteDance, Shanghai, China

Abstract

This technical report describes the submission from the Group of Intelligent Signal Processing (GISP) for Task6 of DCASE2021 challenge (automated audio captioning). Our audio captioning system is based on the sequence-to-sequence autoencoder model. Previous recurrent neural network (RNN) and Transformer based methods just perceive the time dimension information but ignore the frequency information. To utilize both time dimension and frequency dimension information, multi-layer perceptrons mixer (MLP-Mixer) is used as the encoder. For caption prediction, a Transformer decoder structure is used as the decoder. No extra data is employed. In addition, to highlight the content information, we use a pre-trained encoder with multi-label content information. The experimental results show that our system can achieve the SPIDEr of 0.144 (official baseline: 0.051) on the evaluation split of the Clotho dataset. In addition, comparing with Transformer methods, our system has fewer training time.

System characteristics
Data augmentation None
PDF

An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning for DCASE Challenge 2021 Task 6

Xinhao Mei1, Qiushi Huang1, Xubo Liu1, Gengyun Chen2, Jingqian Wu3∗, Yusong Wu3†, Jinzheng Zhao1, Shengchen Li3, Tom Ko4, H Lilian Tang1, Xi Shao2, Mark D. Plumbley1, Wenwu Wang1
1 University of Surrey, Guildford, United Kingdom, 2 Nanjing University of Posts and Telecommunications, Nanjing, China, 3 Xi’an Jiaotong-Liverpool University, Suzhou, China, 4 Southern University of Science and Technology, Shenzhen, China, Jingqian Wu is currently with Wake Forest University, USA, Yusong Wu is currently with University of Montreal, Canada

Abstract

Audio captioning aims to use natural language to describe the content of audio data. This technical report presents an automated audio captioning system submitted to Task 6 of the DCASE 2021 challenge. The proposed system is based on an encoder-decoder architecture, consisting of a convolutional neural network (CNN) encoder and a Transformer decoder. We further improve the system with two techniques, namely, pre-training the model via transfer learning techniques, either on upstream audio-related tasks or large in-domain datasets, and incorporating evaluation metrics into the optimization of the model with reinforcement learning techniques, which help address the problem caused by the mismatch between the evaluation metrics and the loss function. The results show that both techniques can further improve the performance of the captioning system. The overall system achieves a SPIDEr score of 0.277 on the Clotho evaluation set, which outperforms the top-ranked system from the DCASE 2020 challenge.

System characteristics
Data augmentation SpecAugment
PDF

The SJTU System for DCASE2021 Challenge Task 6: Audio Captioning Based on Encoder Pre-training and Reinforcement Learning

Xuenan Xu, Zeyu Xie, Mengyue Wu, and Kai Yu
MoE Key Lab of Artificial Intelligence X-LANCE Lab, Department of Computer Science and Engineering AI Institute, Shanghai Jiao Tong University, Shanghai, China

Abstract

This report proposes an audio captioning system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge task Task 6. Our audio captioning system consists of a 10-layer convolution neural network (CNN) encoder and a temporal attentional single layer gated recurrent unit (GRU) decoder. In this challenge, there is no restriction on the usage of external data and pre-trained models. To better model the concepts in an audio clip, we pre-train the CNN encoder with audio tagging on AudioSet. After standard cross entropy based training, we further fine-tune the model with reinforcement learning to directly optimize the evaluation metric. Experiments show that our proposed system achieves a SPIDEr of 28.6 on the public evaluation split without ensemble.

System characteristics
Data augmentation SpecAugment
PDF

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Textual Information

Zhongjie Ye1, Helin Wang1, Dongchao Yang1, and Yuexian Zou1,2
1 ADSPLAB, School of ECE, Peking University, Shenzhen, China, 2 Peng Cheng Laboratory, Shenzhen, China

Abstract

This technical report describes an automated audio captioning (AAC) model for the the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 Task 6 Challenge. In order to utilize more acoustic and textual information, we propose a novel sequence-to-sequence model named KPE-MAD, with a keyword pre-trained encoder and a multi-modal attention decoder. For the encoder, we use pre-trained classification model on the AudioSet dataset, and finetune it with keywords of nouns and verbs as labels. In addition, a multi-modal attention module is proposed to integrate the acoustic and textual information in the decoder. Our single model achieves the SPIDEr score of 0.279 in the evaluation splits. And our best ensemble model by optimizing CIDEr-D via the reinforcement learning, achieves the SPIDEr score of 0.291. Our code1 and models will be released after the competition.

System characteristics
Data augmentation Mixup, SpecAugment, SpecAugment++
PDF

The DCASE 2021 Challenge Task 6 System: Automated Audio Captioning With Weakly Supervised Pre-traing and Word Selection Methods

Weiqiang Yuan, Qichen Han, Dong Liu, Xiang Li, and Zhen Yang
NetEase (Hangzhou) Network Co., Ltd., China

Abstract

This technical report describes the system participating to the De- tection and Classification of Acoustic Scenes and Events (DCASE) 2021 Challenge, Task 6: automated audio captioning. We use encoder-decoder modeling framework for audio understanding and caption generation. Our solution focuses on solving two problems in automated audio captioning: data insufficiency and word selection indeterminacy. As the amount of training data is limited, we collect large-scale weakly labeled dataset from Web with heuristic methods. Then we pre-train the encoder-decoder models with this dataset followed by fine-tuning on Clotho dataset. To solve the word selection indeterminacy problem, we use keywords extracted from captions of similar audios and audio event tags produced by pre-trained audio tagging models to guide words generation in decoding stage. We tested our submissions using the development-testing dataset. Our best submission achieved 31.8 SPIDEr score.

System characteristics
Data augmentation Noise enhance
PDF