Automated Audio Captioning


Challenge results

Task description

Automated audio captioning is the task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. Given the novelty of the task of audio captioning, current focus is on exploring and developing different methods that can provide some kind of captions for a general audio recording. To this aim, the novel Clotho dataset is used, which provides good quality captions, without speech transcription, named entities, and hapax legomena (i.e. words that appear once in a split).

Participants used the freely available splits of Clotho development and evaluation, which splits provide both audio and corresponding captions. The systems are developed without the usage of any external data. The developed systems are evaluated on their generated captions, using the testing split of Clotho, which does not provide the corresponding captions for the audio. More information about Task 6: Automated Audio Captioning can be found at the task description page.

The ranking of the submitted systems is based on the achieved SPIDEr metric. Though, in this page is provided a more thorough presentation, grouping the metrics into those that are originated from machine translation and to those that originated from captioning.

Teams ranking

Here are listed the best systems all from all teams. The ranking is based on the SPIDEr metric. For more elaborated exploration of the performance of the different systems, at the same table are listed the values achieved for all the metrics employed in the task. The values for the metrics are for the Clotho testing split and the Clotho evaluation split. The values for the Clotho evaluation split, are provided in order to allow further comparison with systems and methods developed outside of this task, since Clotho evaluation split is freely available.

Selected metric
rank
Submission Information Clotho testing split Clotho evaluation split
Submission code Best official
system rank
Corresponding author Technical
Report
BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL CIDEr SPICE SPIDEr BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL CIDEr SPICE SPIDEr
Wang_PKU_task6_1 3 Yuexian Zou wang2020_t6 0.491 0.296 0.189 0.119 0.153 0.331 0.290 0.102 0.196 0.489 0.285 0.177 0.107 0.148 0.325 0.252 0.091 0.172
Shi_SFF_task6_3 7 Anna Shi shi2020_t6 0.435 0.254 0.163 0.099 0.117 0.299 0.172 0.069 0.121 0.423 0.247 0.158 0.097 0.115 0.294 0.168 0.066 0.117
Wu_UESTC_task6_1 11 Qianyang Wu wu2020_t6 0.378 0.030 0.000 0.000 0.063 0.262 0.024 0.000 0.012 0.379 0.020 0.000 0.000 0.063 0.261 0.024 0.001 0.012
Naranjo-Alcazar_UV_task6_2 5 Javier Naranjo-Alcazar naranjoalcazar2020_t6 0.469 0.265 0.162 0.096 0.136 0.310 0.214 0.086 0.150 0.464 0.217 0.107 0.056 0.313 0.144 0.065 0.104
Xu_SJTU_task6_4 4 Xuenan Xu xu2020_t6 0.525 0.330 0.219 0.136 0.153 0.351 0.284 0.104 0.194 0.529 0.335 0.226 0.146 0.149 0.352 0.280 0.099 0.190
Sampathkumar_TUC_task6_1 10 Arunodhayan Sampathkumar sampathkumar2020_t6 0.335 0.077 0.018 0.007 0.061 0.225 0.024 0.009 0.017 0.432 0.128 0.141 0.010 0.078 0.251 0.071 0.024 0.024
Yuma_NTT_task6_1 1 Koizumi Yuma koizumi2020_t1 0.544 0.355 0.239 0.157 0.157 0.365 0.340 0.103 0.222 0.619 0.439 0.313 0.220 0.186 0.417 0.521 0.129 0.325
Pellegrini_IRIT_task6_2 6 Thomas Pellegrini pellegrini2020_t6 0.439 0.252 0.160 0.094 0.137 0.310 0.178 0.082 0.130 0.430 0.248 0.160 0.096 0.305 0.133 0.169 0.079 0.124
Wu_BUPT_task6_4 2 Yusong Wu wuyusong2020_t6 0.519 0.327 0.217 0.141 0.154 0.349 0.323 0.106 0.214 0.532 0.341 0.227 0.149 0.157 0.354 0.340 0.108 0.224
Kuzmin_MSU_task6_1 8 Nikita Kuzmin kuzmin2020_t6 0.312 0.052 0.007 0.000 0.082 0.252 0.020 0.023 0.021
Task6_baseline 9 Konstantinos Drossos 0.344 0.082 0.023 0.000 0.066 0.234 0.022 0.013 0.018 0.389 0.136 0.055 0.015 0.084 0.262 0.074 0.033 0.054

Systems ranking

Here are listed all systems and their ranking according to the different metrics and grouping of metrics. First, is a table with all metrics and all systems. Then, is a table with all systems but with only machine translation metrics, and then a table with all systems but with only captioning metrics.

Detailed information of each system is at the next section.

Systems ranking, all metrics

Selected metric
rank
Submission Information Clotho testing split Clotho evaluation split
Submission code Best official
system rank
Technical
Report
BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL CIDEr SPICE SPIDEr BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL CIDEr SPICE SPIDEr
Wang_PKU_task6_1 9 wang2020_t6 0.491 0.296 0.189 0.119 0.153 0.331 0.290 0.102 0.196 0.489 0.285 0.177 0.107 0.148 0.325 0.252 0.091 0.172
Wang_PKU_task6_2 11 wang2020_t6 0.498 0.304 0.195 0.121 0.154 0.335 0.287 0.101 0.194 0.489 0.285 0.177 0.107 0.148 0.325 0.252 0.091 0.172
Wang_PKU_task6_3 10 wang2020_t6 0.495 0.301 0.193 0.121 0.155 0.336 0.288 0.101 0.195 0.489 0.285 0.177 0.107 0.148 0.325 0.252 0.091 0.172
Wang_PKU_task6_4 11 wang2020_t6 0.500 0.299 0.191 0.120 0.153 0.334 0.287 0.100 0.194 0.489 0.285 0.177 0.107 0.148 0.325 0.252 0.091 0.172
Shi_SFF_task6_1 24 shi2020_t6 0.432 0.251 0.162 0.098 0.117 0.302 0.161 0.070 0.115 0.419 0.238 0.150 0.092 0.114 0.292 0.149 0.064 0.106
Shi_SFF_task6_2 25 shi2020_t6 0.429 0.246 0.158 0.096 0.117 0.300 0.161 0.065 0.113 0.421 0.239 0.148 0.089 0.115 0.292 0.153 0.063 0.108
Shi_SFF_task6_3 20 shi2020_t6 0.435 0.254 0.163 0.099 0.117 0.299 0.172 0.069 0.121 0.423 0.247 0.158 0.097 0.115 0.294 0.168 0.066 0.117
Shi_SFF_task6_4 23 shi2020_t6 0.428 0.242 0.156 0.099 0.116 0.301 0.172 0.063 0.118 0.425 0.241 0.154 0.098 0.115 0.298 0.169 0.063 0.116
Wu_UESTC_task6_1 31 wu2020_t6 0.378 0.030 0.000 0.000 0.063 0.262 0.024 0.000 0.012 0.379 0.020 0.000 0.000 0.063 0.261 0.024 0.001 0.012
Naranjo-Alcazar_UV_task6_1 17 naranjoalcazar2020_t6 0.464 0.260 0.157 0.092 0.135 0.308 0.195 0.083 0.139 0.453 0.206 0.098 0.049 0.307 0.122 0.060 0.091
Naranjo-Alcazar_UV_task6_2 13 naranjoalcazar2020_t6 0.469 0.265 0.162 0.096 0.136 0.310 0.214 0.086 0.150 0.464 0.217 0.107 0.056 0.313 0.144 0.065 0.104
Naranjo-Alcazar_UV_task6_3 14 naranjoalcazar2020_t6 0.466 0.261 0.156 0.091 0.137 0.310 0.207 0.086 0.147 0.448 0.208 0.102 0.054 0.310 0.124 0.063 0.093
Naranjo-Alcazar_UV_task6_4 15 naranjoalcazar2020_t6 0.464 0.259 0.154 0.086 0.137 0.310 0.205 0.087 0.146 0.445 0.205 0.105 0.057 0.309 0.125 0.064 0.095
Xu_SJTU_task6_1 16 xu2020_t6 0.456 0.253 0.150 0.087 0.135 0.311 0.198 0.086 0.142 0.457 0.248 0.143 0.083 0.135 0.306 0.203 0.081 0.142
Xu_SJTU_task6_2 18 xu2020_t6 0.459 0.254 0.151 0.086 0.134 0.313 0.182 0.085 0.133 0.459 0.253 0.151 0.086 0.133 0.314 0.192 0.083 0.138
Xu_SJTU_task6_4 11 xu2020_t6 0.525 0.330 0.219 0.136 0.153 0.351 0.284 0.104 0.194 0.529 0.335 0.226 0.146 0.149 0.352 0.280 0.099 0.190
Xu_SJTU_task6_3 12 xu2020_t6 0.470 0.266 0.160 0.095 0.138 0.318 0.215 0.090 0.153 0.479 0.274 0.167 0.099 0.143 0.328 0.232 0.088 0.142
Sampathkumar_TUC_task6_1 30 sampathkumar2020_t6 0.335 0.077 0.018 0.007 0.061 0.225 0.024 0.009 0.017 0.432 0.128 0.141 0.010 0.078 0.251 0.071 0.024 0.024
Yuma_NTT_task6_1 1 koizumi2020_t1 0.544 0.355 0.239 0.157 0.157 0.365 0.340 0.103 0.222 0.619 0.439 0.313 0.220 0.186 0.417 0.521 0.129 0.325
Yuma_NTT_task6_2 2 koizumi2020_t1 0.540 0.351 0.236 0.155 0.156 0.363 0.338 0.103 0.220 0.618 0.439 0.314 0.221 0.186 0.416 0.515 0.130 0.322
Yuma_NTT_task6_3 4 koizumi2020_t1 0.537 0.349 0.233 0.150 0.156 0.358 0.330 0.103 0.216 0.618 0.441 0.315 0.221 0.186 0.417 0.527 0.129 0.328
Yuma_NTT_task6_4 3 koizumi2020_t1 0.535 0.347 0.233 0.153 0.156 0.359 0.332 0.102 0.217 0.619 0.441 0.317 0.224 0.188 0.418 0.531 0.130 0.331
Pellegrini_IRIT_task6_1 26 pellegrini2020_t6 0.426 0.225 0.131 0.072 0.125 0.295 0.136 0.072 0.104 0.436 0.234 0.138 0.076 0.301 0.124 0.140 0.072 0.106
Pellegrini_IRIT_task6_2 19 pellegrini2020_t6 0.439 0.252 0.160 0.094 0.137 0.310 0.178 0.082 0.130 0.430 0.248 0.160 0.096 0.305 0.133 0.169 0.079 0.124
Pellegrini_IRIT_task6_3 22 pellegrini2020_t6 0.430 0.248 0.154 0.089 0.116 0.292 0.171 0.068 0.119 0.426 0.247 0.157 0.094 0.283 0.112 0.165 0.063 0.114
Pellegrini_IRIT_task6_4 21 pellegrini2020_t6 0.421 0.232 0.145 0.086 0.130 0.301 0.164 0.076 0.120 0.415 0.230 0.143 0.085 0.298 0.125 0.162 0.071 0.116
Wu_BUPT_task6_1 6 wuyusong2020_t6 0.519 0.331 0.221 0.144 0.155 0.347 0.316 0.106 0.211 0.534 0.343 0.230 0.151 0.160 0.356 0.346 0.108 0.227
Wu_BUPT_task6_2 8 wuyusong2020_t6 0.510 0.318 0.210 0.137 0.149 0.342 0.302 0.101 0.202 0.530 0.340 0.228 0.151 0.155 0.355 0.339 0.108 0.223
Wu_BUPT_task6_3 7 wuyusong2020_t6 0.515 0.324 0.213 0.137 0.152 0.348 0.304 0.102 0.203 0.529 0.340 0.229 0.154 0.156 0.357 0.339 0.104 0.221
Wu_BUPT_task6_4 5 wuyusong2020_t6 0.519 0.327 0.217 0.141 0.154 0.349 0.323 0.106 0.214 0.532 0.341 0.227 0.149 0.157 0.354 0.340 0.108 0.224
Kuzmin_MSU_task6_1 27 kuzmin2020_t6 0.312 0.052 0.007 0.000 0.082 0.252 0.020 0.023 0.021
Kuzmin_MSU_task6_2 28 kuzmin2020_t6 0.361 0.094 0.028 0.007 0.069 0.248 0.027 0.014 0.020 0.424 0.159 0.067 0.027 0.093 0.288 0.115 0.042 0.078
Kuzmin_MSU_task6_3 28 kuzmin2020_t6 0.359 0.094 0.033 0.010 0.071 0.250 0.027 0.014 0.020 0.425 0.158 0.065 0.025 0.094 0.290 0.112 0.042 0.077
Kuzmin_MSU_task6_4 30 kuzmin2020_t6 0.312 0.072 0.028 0.000 0.065 0.232 0.023 0.011 0.017 0.370 0.133 0.059 0.021 0.085 0.269 0.107 0.038 0.072
Task6_baseline 29 0.344 0.082 0.023 0.000 0.066 0.234 0.022 0.013 0.018 0.389 0.136 0.055 0.015 0.084 0.262 0.074 0.033 0.054

Systems ranking, machine translation metrics

Selected metric
rank
Submission Information Clotho testing split Clotho evaluation split
Submission code Best official
system rank
Technical
Report
BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGEL
Wang_PKU_task6_1 9 wang2020_t6 0.491 0.296 0.189 0.119 0.153 0.331 0.489 0.285 0.177 0.107 0.148 0.325
Wang_PKU_task6_2 11 wang2020_t6 0.498 0.304 0.195 0.121 0.154 0.335 0.489 0.285 0.177 0.107 0.148 0.325
Wang_PKU_task6_3 10 wang2020_t6 0.495 0.301 0.193 0.121 0.155 0.336 0.489 0.285 0.177 0.107 0.148 0.325
Wang_PKU_task6_4 11 wang2020_t6 0.500 0.299 0.191 0.120 0.153 0.334 0.489 0.285 0.177 0.107 0.148 0.325
Shi_SFF_task6_1 24 shi2020_t6 0.432 0.251 0.162 0.098 0.117 0.302 0.419 0.238 0.150 0.092 0.114 0.292
Shi_SFF_task6_2 25 shi2020_t6 0.429 0.246 0.158 0.096 0.117 0.300 0.421 0.239 0.148 0.089 0.115 0.292
Shi_SFF_task6_3 20 shi2020_t6 0.435 0.254 0.163 0.099 0.117 0.299 0.423 0.247 0.158 0.097 0.115 0.294
Shi_SFF_task6_4 23 shi2020_t6 0.428 0.242 0.156 0.099 0.116 0.301 0.425 0.241 0.154 0.098 0.115 0.298
Wu_UESTC_task6_1 31 wu2020_t6 0.378 0.030 0.000 0.000 0.063 0.262 0.379 0.020 0.000 0.000 0.063 0.261
Naranjo-Alcazar_UV_task6_1 17 naranjoalcazar2020_t6 0.464 0.260 0.157 0.092 0.135 0.308 0.453 0.206 0.098 0.049 0.307
Naranjo-Alcazar_UV_task6_2 13 naranjoalcazar2020_t6 0.469 0.265 0.162 0.096 0.136 0.310 0.464 0.217 0.107 0.056 0.313
Naranjo-Alcazar_UV_task6_3 14 naranjoalcazar2020_t6 0.466 0.261 0.156 0.091 0.137 0.310 0.448 0.208 0.102 0.054 0.310
Naranjo-Alcazar_UV_task6_4 15 naranjoalcazar2020_t6 0.464 0.259 0.154 0.086 0.137 0.310 0.445 0.205 0.105 0.057 0.309
Xu_SJTU_task6_1 16 xu2020_t6 0.456 0.253 0.150 0.087 0.135 0.311 0.457 0.248 0.143 0.083 0.135 0.306
Xu_SJTU_task6_2 18 xu2020_t6 0.459 0.254 0.151 0.086 0.134 0.313 0.459 0.253 0.151 0.086 0.133 0.314
Xu_SJTU_task6_4 11 xu2020_t6 0.525 0.330 0.219 0.136 0.153 0.351 0.529 0.335 0.226 0.146 0.149 0.352
Xu_SJTU_task6_3 12 xu2020_t6 0.470 0.266 0.160 0.095 0.138 0.318 0.479 0.274 0.167 0.099 0.143 0.328
Sampathkumar_TUC_task6_1 30 sampathkumar2020_t6 0.335 0.077 0.018 0.007 0.061 0.225 0.432 0.128 0.141 0.010 0.078 0.251
Yuma_NTT_task6_1 1 koizumi2020_t1 0.544 0.355 0.239 0.157 0.157 0.365 0.619 0.439 0.313 0.220 0.186 0.417
Yuma_NTT_task6_2 2 koizumi2020_t1 0.540 0.351 0.236 0.155 0.156 0.363 0.618 0.439 0.314 0.221 0.186 0.416
Yuma_NTT_task6_3 4 koizumi2020_t1 0.537 0.349 0.233 0.150 0.156 0.358 0.618 0.441 0.315 0.221 0.186 0.417
Yuma_NTT_task6_4 3 koizumi2020_t1 0.535 0.347 0.233 0.153 0.156 0.359 0.619 0.441 0.317 0.224 0.188 0.418
Pellegrini_IRIT_task6_1 26 pellegrini2020_t6 0.426 0.225 0.131 0.072 0.125 0.295 0.436 0.234 0.138 0.076 0.301 0.124
Pellegrini_IRIT_task6_2 19 pellegrini2020_t6 0.439 0.252 0.160 0.094 0.137 0.310 0.430 0.248 0.160 0.096 0.305 0.133
Pellegrini_IRIT_task6_3 22 pellegrini2020_t6 0.430 0.248 0.154 0.089 0.116 0.292 0.426 0.247 0.157 0.094 0.283 0.112
Pellegrini_IRIT_task6_4 21 pellegrini2020_t6 0.421 0.232 0.145 0.086 0.130 0.301 0.415 0.230 0.143 0.085 0.298 0.125
Wu_BUPT_task6_1 6 wuyusong2020_t6 0.519 0.331 0.221 0.144 0.155 0.347 0.534 0.343 0.230 0.151 0.160 0.356
Wu_BUPT_task6_2 8 wuyusong2020_t6 0.510 0.318 0.210 0.137 0.149 0.342 0.530 0.340 0.228 0.151 0.155 0.355
Wu_BUPT_task6_3 7 wuyusong2020_t6 0.515 0.324 0.213 0.137 0.152 0.348 0.529 0.340 0.229 0.154 0.156 0.357
Wu_BUPT_task6_4 5 wuyusong2020_t6 0.519 0.327 0.217 0.141 0.154 0.349 0.532 0.341 0.227 0.149 0.157 0.354
Kuzmin_MSU_task6_1 27 kuzmin2020_t6 0.312 0.052 0.007 0.000 0.082 0.252
Kuzmin_MSU_task6_2 28 kuzmin2020_t6 0.361 0.094 0.028 0.007 0.069 0.248 0.424 0.159 0.067 0.027 0.093 0.288
Kuzmin_MSU_task6_3 28 kuzmin2020_t6 0.359 0.094 0.033 0.010 0.071 0.250 0.425 0.158 0.065 0.025 0.094 0.290
Kuzmin_MSU_task6_4 30 kuzmin2020_t6 0.312 0.072 0.028 0.000 0.065 0.232 0.370 0.133 0.059 0.021 0.085 0.269
Task6_baseline 29 0.344 0.082 0.023 0.000 0.066 0.234 0.389 0.136 0.055 0.015 0.084 0.262

Systems ranking, captioning metrics

Selected metric
rank
Submission Information Clotho testing split Clotho evaluation split
Submission code Best official
system rank
Technical
Report
CIDEr SPICE SPIDEr CIDEr SPICE SPIDEr
Wang_PKU_task6_1 9 wang2020_t6 0.290 0.102 0.196 0.252 0.091 0.172
Wang_PKU_task6_2 11 wang2020_t6 0.287 0.101 0.194 0.252 0.091 0.172
Wang_PKU_task6_3 10 wang2020_t6 0.288 0.101 0.195 0.252 0.091 0.172
Wang_PKU_task6_4 11 wang2020_t6 0.287 0.100 0.194 0.252 0.091 0.172
Shi_SFF_task6_1 24 shi2020_t6 0.161 0.070 0.115 0.149 0.064 0.106
Shi_SFF_task6_2 25 shi2020_t6 0.161 0.065 0.113 0.153 0.063 0.108
Shi_SFF_task6_3 20 shi2020_t6 0.172 0.069 0.121 0.168 0.066 0.117
Shi_SFF_task6_4 23 shi2020_t6 0.172 0.063 0.118 0.169 0.063 0.116
Wu_UESTC_task6_1 31 wu2020_t6 0.024 0.000 0.012 0.024 0.001 0.012
Naranjo-Alcazar_UV_task6_1 17 naranjoalcazar2020_t6 0.195 0.083 0.139 0.122 0.060 0.091
Naranjo-Alcazar_UV_task6_2 13 naranjoalcazar2020_t6 0.214 0.086 0.150 0.144 0.065 0.104
Naranjo-Alcazar_UV_task6_3 14 naranjoalcazar2020_t6 0.207 0.086 0.147 0.124 0.063 0.093
Naranjo-Alcazar_UV_task6_4 15 naranjoalcazar2020_t6 0.205 0.087 0.146 0.125 0.064 0.095
Xu_SJTU_task6_1 16 xu2020_t6 0.198 0.086 0.142 0.203 0.081 0.142
Xu_SJTU_task6_2 18 xu2020_t6 0.182 0.085 0.133 0.192 0.083 0.138
Xu_SJTU_task6_4 11 xu2020_t6 0.284 0.104 0.194 0.280 0.099 0.190
Xu_SJTU_task6_3 12 xu2020_t6 0.215 0.090 0.153 0.232 0.088 0.142
Sampathkumar_TUC_task6_1 30 sampathkumar2020_t6 0.024 0.009 0.017 0.071 0.024 0.024
Yuma_NTT_task6_1 1 koizumi2020_t1 0.340 0.103 0.222 0.521 0.129 0.325
Yuma_NTT_task6_2 2 koizumi2020_t1 0.338 0.103 0.220 0.515 0.130 0.322
Yuma_NTT_task6_3 4 koizumi2020_t1 0.330 0.103 0.216 0.527 0.129 0.328
Yuma_NTT_task6_4 3 koizumi2020_t1 0.332 0.102 0.217 0.531 0.130 0.331
Pellegrini_IRIT_task6_1 26 pellegrini2020_t6 0.136 0.072 0.104 0.140 0.072 0.106
Pellegrini_IRIT_task6_2 19 pellegrini2020_t6 0.178 0.082 0.130 0.169 0.079 0.124
Pellegrini_IRIT_task6_3 22 pellegrini2020_t6 0.171 0.068 0.119 0.165 0.063 0.114
Pellegrini_IRIT_task6_4 21 pellegrini2020_t6 0.164 0.076 0.120 0.162 0.071 0.116
Wu_BUPT_task6_1 6 wuyusong2020_t6 0.316 0.106 0.211 0.346 0.108 0.227
Wu_BUPT_task6_2 8 wuyusong2020_t6 0.302 0.101 0.202 0.339 0.108 0.223
Wu_BUPT_task6_3 7 wuyusong2020_t6 0.304 0.102 0.203 0.339 0.104 0.221
Wu_BUPT_task6_4 5 wuyusong2020_t6 0.323 0.106 0.214 0.340 0.108 0.224
Kuzmin_MSU_task6_1 27 kuzmin2020_t6 0.020 0.023 0.021
Kuzmin_MSU_task6_2 28 kuzmin2020_t6 0.027 0.014 0.020 0.115 0.042 0.078
Kuzmin_MSU_task6_3 28 kuzmin2020_t6 0.027 0.014 0.020 0.112 0.042 0.077
Kuzmin_MSU_task6_4 30 kuzmin2020_t6 0.023 0.011 0.017 0.107 0.038 0.072
Task6_baseline 29 0.022 0.013 0.018 0.074 0.033 0.054

System characteristics

Rank Submission
code
SPIDEr Technical
Report
Method scheme/architecture Amount of parameters Encoder Decoder Classifier Acoustic
features
Word
representation
Data
augmentation
Sampling
rate
Used
meta-data
9 Wang_PKU_task6_1 0.196 wang2020_t6 encoder-decoder 12577360 CNN RNN-LSTM feed-forward log-mel energies one-hot SpecAugment 44.1kHz No
11 Wang_PKU_task6_2 0.194 wang2020_t6 encoder-decoder 12577360 CNN RNN-LSTM feed-forward log-mel energies one-hot SpecAugment 44.1kHz No
10 Wang_PKU_task6_3 0.195 wang2020_t6 encoder-decoder 12577360 CNN RNN-LSTM feed-forward log-mel energies one-hot SpecAugment 44.1kHz No
11 Wang_PKU_task6_4 0.194 wang2020_t6 encoder-decoder 12577360 CNN RNN-LSTM feed-forward log-mel energies one-hot SpecAugment 44.1kHz No
24 Shi_SFF_task6_1 0.115 shi2020_t6 seq2seq transformer encoder feed-forward log-mel energies one-hot temporal-frequency shift 44.1kHz No
25 Shi_SFF_task6_2 0.113 shi2020_t6 seq2seq transformer encoder feed-forward log-mel energies one-hot temporal-frequency shift 44.1kHz No
20 Shi_SFF_task6_3 0.121 shi2020_t6 seq2seq transformer encoder transformer decoder feed-forward log-mel energies one-hot temporal-frequency shift 44.1kHz No
23 Shi_SFF_task6_4 0.118 shi2020_t6 seq2seq transformer encoder transformer decoder feed-forward log-mel energies one-hot temporal-frequency shift 44.1kHz No
31 Wu_UESTC_task6_1 0.012 wu2020_t6 seq2seq 60730943 CNN multi-layer RNN-GRU feed-forward log-mel energies one-hot 44.1kHz No
17 Naranjo-Alcazar_UV_task6_1 0.139 naranjoalcazar2020_t6 encoder-decoder 38734544 CNN RNN-LSTM feed-forward log-Gammatone spectrogram one-hot 44.1kHz No
13 Naranjo-Alcazar_UV_task6_2 0.150 naranjoalcazar2020_t6 encoder-decoder 57726672 CNN RNN-LSTM feed-forward log-Gammatone spectrogram one-hot 44.1kHz No
14 Naranjo-Alcazar_UV_task6_3 0.147 naranjoalcazar2020_t6 encoder-decoder 73370320 CNN RNN-LSTM feed-forward log-Gammatone spectrogram one-hot 44.1kHz No
15 Naranjo-Alcazar_UV_task6_4 0.146 naranjoalcazar2020_t6 encoder-decoder 140064208 CNN RNN-LSTM feed-forward log-Gammatone spectrogram one-hot 44.1kHz No
16 Xu_SJTU_task6_1 0.142 xu2020_t6 seq2seq 5224055 CRNN-BGRU RNN-GRU feed-forward log-mel energies embeddings 44.1kHz No
18 Xu_SJTU_task6_2 0.133 xu2020_t6 seq2seq 5224055 CRNN-BGRU RNN-GRU feed-forward log-mel energies embeddings 44.1kHz No
11 Xu_SJTU_task6_4 0.194 xu2020_t6 seq2seq 5224055 CRNN-BGRU RNN-GRU feed-forward log-mel energies embeddings 44.1kHz No
12 Xu_SJTU_task6_3 0.153 xu2020_t6 seq2seq 10448110 CRNN-BGRU RNN-GRU feed-forward log-mel energies embeddings 44.1kHz No
30 Sampathkumar_TUC_task6_1 0.017 sampathkumar2020_t6 seq2seq 5756431 multi-layer RNN-BGRU RNN-GRU feed-forward log-mel energies embedding 44.1kHz No
1 Yuma_NTT_task6_1 0.222 koizumi2020_t1 seq2seq, keyword estimation, sentence length estimation 32994840 multi-layer RNN-BLSTM RNN-LSTM feed-forward log-mel energies embeddings mix-up, TF-IDF-based word replacement, random data cropping 22.05kHz Yes
2 Yuma_NTT_task6_2 0.220 koizumi2020_t1 seq2seq, keyword estimation, sentence length estimation 82487110 multi-layer RNN-BLSTM RNN-LSTM feed-forward log-mel energies embeddings mix-up, TF-IDF-based word replacement, random data cropping 22.05kHz Yes
4 Yuma_NTT_task6_3 0.216 koizumi2020_t1 seq2seq, keyword estimation, sentence length estimation 20670182 multi-layer RNN-BLSTM RNN-LSTM feed-forward log-mel energies embeddings mix-up, TF-IDF-based word replacement, random data cropping 22.05kHz Yes
3 Yuma_NTT_task6_4 0.217 koizumi2020_t1 seq2seq, keyword estimation, sentence length estimation 51675455 multi-layer RNN-BLSTM RNN-LSTM feed-forward log-mel energies embeddings mix-up, TF-IDF-based word replacement, random data cropping 22.05kHz Yes
26 Pellegrini_IRIT_task6_1 0.104 pellegrini2020_t6 seq2seq 2887375 multi-layer RNN-pBLSTM multi-layer RNN-LSTM feed-forward, greedy search log-mel energies one-hot 44.1kHz No
19 Pellegrini_IRIT_task6_2 0.130 pellegrini2020_t6 seq2seq 2887375 multi-layer RNN-pBLSTM multi-layer RNN-LSTM feed-forward, beam search log-mel energies one-hot 44.1kHz No
22 Pellegrini_IRIT_task6_3 0.119 pellegrini2020_t6 seq2seq 2887375 multi-layer RNN-pBLSTM multi-layer RNN-LSTM feed-forward, beam search with LM log-mel energies one-hot 44.1kHz No
21 Pellegrini_IRIT_task6_4 0.120 pellegrini2020_t6 seq2seq 2120744 multi-layer RNN-pBLSTM multi-layer RNN-LSTM feed-forward, greedy search log-mel energies one-hot 44.1kHz No
6 Wu_BUPT_task6_1 0.211 wuyusong2020_t6 encoder-decoder 8901648 CNN Transformer feed-forward log-mel energies embeddings SpecAugment 44.1kHz No
8 Wu_BUPT_task6_2 0.202 wuyusong2020_t6 encoder-decoder 8901648 CNN Transformer feed-forward log-mel energies embeddings SpecAugment 44.1kHz No
7 Wu_BUPT_task6_3 0.203 wuyusong2020_t6 encoder-decoder 8901648 CNN Transformer feed-forward log-mel energies embeddings SpecAugment 44.1kHz No
5 Wu_BUPT_task6_4 0.214 wuyusong2020_t6 encoder-decoder 8901648 CNN Transformer feed-forward log-mel energies embeddings SpecAugment 44.1kHz No
27 Kuzmin_MSU_task6_1 0.021 kuzmin2020_t6 seq2seq 4804112 multi-layer RNN-GRU RNN-GRU feed-forward log-mel energies one-hot mix-up, reverb, pitch, overdrive, speed 44.1kHz No
28 Kuzmin_MSU_task6_2 0.020 kuzmin2020_t6 seq2seq 15178255 multi-layer RNN-GRU RNN-GRU feed-forward log-mel energies one-hot mix-up 44.1kHz No
28 Kuzmin_MSU_task6_3 0.020 kuzmin2020_t6 seq2seq 15178255 multi-layer RNN-GRU RNN-GRU feed-forward log-mel energies one-hot mix-up, reverb, pitch, overdrive, speed 44.1kHz No
30 Kuzmin_MSU_task6_4 0.017 kuzmin2020_t6 seq2seq 4804112 multi-layer RNN-GRU RNN-GRU feed-forward log-mel energies one-hot mix-up 44.1kHz No
29 Task6_baseline 0.018 seq2seq 5012931 multi-layer RNN-GRU multi-layer RNN-GRU feed-forward log-mel energies one-hot 44.1kHz No



Technical reports

The NTT DCASE2020 Challenge Task 6 System: Automated Audio Captioning With Keywords and Sentence Length Estimation

Yuma Koizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino
NTT Corporation, Japan

Abstract

This technical report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6: automated audio captioning. Our submission focuses on solving two indeterminacy problems in automated audio captioning: word selection indeterminacy and sentence length indeterminacy. We simultaneously solve the main caption generation and sub indeterminacy problems by estimating keywords and sentence length through multi-task learning. We tested a simplified model of our submission using the development-testing dataset. Our model achieved 20.7 SPIDEr score where that of the baseline system was 5.4.

System characteristics
Method scheme/architecture seq2seq, keyword estimation, sentence length estimation
Encoder RNN-BLSTM
Decoder RNN-LSTM
Alignment mechanism self-attention
Classifier feed-forward
Amount of parameters 3299484
Sampling rate 22.05kHz
Audio features log-mel energies
Word representation embeddings
Data augmentation mix-up, TF-IDF-based word replacement, random data cropping
PDF

Automated Audio Captioning

Nikita Kuzmin and Alexander Dyakonov
Moscow State University, CMC Faculty, Mathematical Methods of Forecasting Dept. GSP-1, 1-52, Leninskiye Gory Moscow, 119991, Russia

Abstract

This task can be stated as an automated generation textual content description from the raw audio file. We propose a method for the automated audio captioning task. We examined the impact of augmentations (MixUp, Reverb, Pitch, Over-drive, Speed) on method performance. Our method based on modified encoder-decoder architecture. The encoder consists of three bidirectional gated recurrent units (GRU). The decoder consists of one gated recurrent unit (GRU) and one fully-connected layer for classification. The encoder input is log-mel spectrogram features for every part of audio file segmented by Hann window [1] of 1024 samples with a 50% overlap. The decoder output is a matrix with probabilities of words for each position in a sentence. We used BLEU1, BLEU2, BLEU3, BLEU4, ROUGEL, METEOR, CIDEr, SPICE, SPIDEr metrics to compare methods.

System characteristics
Method scheme/architecture seq2seq
Encoder RNN-GRU
Decoder RNN-GRU
Alignment mechanism attention, vector2sequence
Classifier feed-forward
Amount of parameters 4804112
Sampling rate 44.1kHz
Audio features log-mel energies
Word representation one-hot
Data augmentation mix-up, reverb, pitch, overdrive, speed
PDF

Task 6 DCASE 2020: Listen Carefully and Tell: An Audio Captioning System Based on Residual Learning and Gammatone Audio Representation

Javier Naranjo-Alcazar1, and Sergi Perez-Castanos, and Pedro Zuccarello1, and Maximo Cobos1
1Computer Science Department, Universitat de València, Burjassot, Spain

Abstract

Automated audio captioning is machine listening task whose goal is to describe an audio using free text. An automated audio captioning system has to be implemented as it accepts an audio as input and outputs a textual description, that is, the caption of the signal. This task can be useful in many applications such as automatic content description or machine-to-machine interaction. In this technical report, a automatic audio captioning based on residual learning on the encoder phase is proposed. The encoder phase implemented via different Residual Networks configurations. The decoder phase (create the caption) is run using recurrent layers plus attention mechanism. The audio representation chosen has been Gammatone. Results show that the framework proposed in this work surpass the baseline system improving all metrics.

System characteristics
Method scheme/architecture encoder-decoder
Encoder CNN
Decoder RNN-LSTM
Alignment mechanism attention
Classifier feed-forward
Amount of parameters 38734544
Sampling rate 44.1kHz
Audio features log-Gammatone spectrogram
Word representation one-hot
PDF

IRIT-UPS DCASE 2020 audio captioning system

Thomas Pellegrini
IRIT (UMR 5505), Université Paul Sabatier, CNRS, Toulouse, France

Abstract

This technical report is a short description of the sequence-to-sequence model used in the DCASE 2020 task 6 dedicated to audio captioning. Four submissions were made: i) a baseline one using greedy search, ii) beam search, iii) beam search integrating a 2g language model, iv) with a model trained with a vocabulary limited to the most frequent word types (1k words instead of about 5k words).

System characteristics
Method scheme/architecture seq2seq
Encoder RNN-pBLSTM
Decoder RNN-LSTM
Alignment mechanism attention
Classifier feed-forward, greedy search
Amount of parameters 2887375
Sampling rate 44.1kHz
Audio features log-mel energies
Word representation one-hot
PDF

Automated Audio Captioning

Arunodhayan Sampathkumar and Danny Kowerko
Technische Universität Chemnitz, Juniorprofessur Media Computing, Chemnitz, Germany

Abstract

The audio captioning is a novel approach to describe an audio scene based on human like perception. The human like perception of audio events not only perform detection and localization, but also tries to summarize the relationship between different audio events. The DCASE2020 has developed a strongly labelled caption dataset to perform automated audio captioning. In this research, mel spectrogram is used to extract the audio features. A Recurrent Neural Network (RNN) encoder-decoder is employed to train the dataset. Finally the network is evaluated using the MS COCO metrics where BLEU3 & BLEU1 scores were strong and is discussed in detail in section 5.

System characteristics
Method scheme/architecture seq2seq
Encoder RNN-BGRU
Decoder RNN-GRU
Alignment mechanism identity
Classifier feed-forward
Amount of parameters 16521
Sampling rate 44.1kHz
Audio features log-mel energies
Word representation embeddings
PDF

Audio Captioning With the Transformer

Anna Shi
ShuangFeng First, Beijing, China

Abstract

In this technical report, we present the techniques and models applied to our submission for DCASE 2020 task 6: automated audio captioning. We aim to focus primarily on how to apply transformer methods efficiently to deal with large amount of audio data. Our experiments with the public DCASE2020 challenge task 6 Clotho evaluation data resulted in a SPIDEr of 0.1171, while the SPIDEr of the official baseline is 0.054.

System characteristics
Method scheme/architecture seq2seq
Encoder transformer encoder
Decoder transformer decoder
Alignment mechanism self-attention
Classifier feed-forward
Amount of parameters Not reported
Sampling rate 44.1kHz
Audio features log-mel energies
Word representation one-hot
Data augmentation temporal-frequency shift
PDF

Automated Audio Captioning With Temporal Attention

Helin Wang1, Bang Yang1, Yuexian Zou1,2 and Dading Chong1
1ADSPLAB, School of ECE, Peking University, Shenzhen, China, 2Peng Cheng Laboratory, Shenzhen, China

Abstract

This technical report describes the ADSPLAB team’s submission for Task6 of DCASE2020 challenge (automated audio captioning). Our audio captioning system is based on the sequence-to-sequence model. Convolutional neural network (CNN) is used as the encoder and a long-short term memory (LSTM)-based decoder with temporal attention is used to generate the captions. No extra data or pre-trained models are employed and no extra annotations are used. The experimental results show that our system could achieve the SPIDEr of 0.172 (official baseline: 0.054) on the evaluation split of the Clotho dataset.

System characteristics
Method scheme/architecture encoder-decoder
Encoder CNN
Decoder RNN-LSTM
Alignment mechanism attention
Classifier feed-forward
Amount of parameters 12577360
Sampling rate 44.1kHz
Audio features log-mel energies
Word representation one-hot
Data augmentation SpecAugment
PDF

Automatic Audio Captioning System Based on Convolutional Neural Network

Qianyang Wu, Shengqi Tao, and Xingyu Yang
University of Electronic Science and Technology of China Communication Engineering Dept Chengdu,China

Abstract

Automated audio captioning has been a new issue in natural language processing (NLP) for recent years. The key point of automatic audio captioning system is that it describes non-audio sig- nals in the form of natural language. The system should take audio as input, and output as descriptive audio sentences. Most of approaches use seq2seq model with RNNs as both the encoder and decoder. It results in considerable time to get the training process finished. This paper proposed a neural network with CNN as the encoder and GRU as the decoder. Encoder is based on VGG16, which has deeper networks and three fully-connected layers. Despite the low accuracy of prediction, our model decreases the train- ing time significantly. It proves that the application of CNN can be a choice for automated audio captioning.

System characteristics
Method scheme/architecture encoder-decoder
Encoder CNN
Decoder RNN-GRU
Alignment mechanism attention
Classifier feed-forward
Amount of parameters 60730943
Sampling rate 44.1kHz
Audio features log-mel energies
Word representation one-hot
PDF

Audio Captioning Based on Transformer and Pre-Training for 2020 DCASE Audio Captioning Challenge

Yusong Wu1, Kun Chen1, Ziyue Wang2, Xuan Zhang2, Fudong Nian3, Shengchen Li1, and Xi Shao2
1Beijing University of Posts and Telecommunications, Beijing, China, 2Nanjing University of Posts and Telecommunications, Nanjing, China, 3Anhui University, Anhui, China

Abstract

This report proposes an automated audio captioning model for the 2020 DCASE audio captioning challenge. In this challenge, a model is required to be trained from scratch to generate natural language descriptions of a given audio signal. However, as limited data available and restrictions on using pre-trained models trained by external data, training directly from scratch can result in poor performance where acoustic events and language are poorly modeled. For better acoustic event and language modeling, a sequence-to-sequence model is proposed which consists of a CNN encoder and a Transformer decoder. In the proposed model, the encoder and word embedding are firstly pre-trained. Regulations and data augmentations are applied during training, while fine-tuning is applied after training. Experiments show that the proposed model can achieve a SPIDEr score of 0.227 on audio captioning performance.

System characteristics
Method scheme/architecture encoder-decoder
Encoder CNN
Decoder Transformer
Alignment mechanism attention, self-attention
Classifier feed-forward
Amount of parameters 8901648
Sampling rate 44.1kHz
Audio features log-mel energies
Word representation embeddings
Data augmentation SpecAugment
PDF

The SJTU Submission for DCASE2020 Task 6: A CRNN-GRU Based Reinforcement Learning Approach to Audiocaption

Xuenan Xu, Heinrich Dinkel, Mengyue Wu, and Kai Yu
MoE Key Lab of Artificial Intelligence SpeechLab, Department of Computer Science and Engineering AI Institute, Shanghai Jiao Tong University, Shanghai, China

Abstract

This paper proposes the SJTU AudioCaption system for the DCASE2020 Task 6 challenge. Our system consists of a powerful CRNN encoder combined with a GRU decoder. In addition to standard cross-entropy Audiocaption, reinforcement learning is also investigated. Our approach significantly improves against the challenge baseline model on all shown metrics achieving a relative improvement of at least 34%. Our best submission achieves a {BLEU4} of 0.146, {Rouge}-{L} of 0.352, {CIDEr} of 0.280, {METEOR} of 0.149, and {SPICE} of 0.099 on Clotho evaluation set.

System characteristics
Method scheme/architecture seq2seq
Encoder CRNN-BGRU
Decoder RNN-GRU
Alignment mechanism vector2sequence
Classifier feed-forward
Amount of parameters 5224055
Sampling rate 44.1kHz
Audio features log-mel energies
Word representation embeddings
PDF