Task description
Automated audio captioning is the task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. Given the novelty of the task of audio captioning, current focus is on exploring and developing different methods that can provide some kind of captions for a general audio recording. To this aim, the novel Clotho dataset is used, which provides good quality captions, without speech transcription, named entities, and hapax legomena (i.e. words that appear once in a split).
Participants used the freely available splits of Clotho development and evaluation, which splits provide both audio and corresponding captions. The systems are developed without the usage of any external data. The developed systems are evaluated on their generated captions, using the testing split of Clotho, which does not provide the corresponding captions for the audio. More information about Task 6: Automated Audio Captioning can be found at the task description page.
The ranking of the submitted systems is based on the achieved SPIDEr metric. Though, in this page is provided a more thorough presentation, grouping the metrics into those that are originated from machine translation and to those that originated from captioning.
Teams ranking
Here are listed the best systems all from all teams. The ranking is based on the SPIDEr metric. For more elaborated exploration of the performance of the different systems, at the same table are listed the values achieved for all the metrics employed in the task. The values for the metrics are for the Clotho testing split and the Clotho evaluation split. The values for the Clotho evaluation split, are provided in order to allow further comparison with systems and methods developed outside of this task, since Clotho evaluation split is freely available.
Selected metric rank |
Submission Information | Clotho testing split | Clotho evaluation split | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Corresponding author |
Technical Report |
BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | CIDEr | SPICE | SPIDEr | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | CIDEr | SPICE | SPIDEr | |
Wang_PKU_task6_1 | 3 | Yuexian Zou | wang2020_t6 | 0.491 | 0.296 | 0.189 | 0.119 | 0.153 | 0.331 | 0.290 | 0.102 | 0.196 | 0.489 | 0.285 | 0.177 | 0.107 | 0.148 | 0.325 | 0.252 | 0.091 | 0.172 | |
Shi_SFF_task6_3 | 7 | Anna Shi | shi2020_t6 | 0.435 | 0.254 | 0.163 | 0.099 | 0.117 | 0.299 | 0.172 | 0.069 | 0.121 | 0.423 | 0.247 | 0.158 | 0.097 | 0.115 | 0.294 | 0.168 | 0.066 | 0.117 | |
Wu_UESTC_task6_1 | 11 | Qianyang Wu | wu2020_t6 | 0.378 | 0.030 | 0.000 | 0.000 | 0.063 | 0.262 | 0.024 | 0.000 | 0.012 | 0.379 | 0.020 | 0.000 | 0.000 | 0.063 | 0.261 | 0.024 | 0.001 | 0.012 | |
Naranjo-Alcazar_UV_task6_2 | 5 | Javier Naranjo-Alcazar | naranjoalcazar2020_t6 | 0.469 | 0.265 | 0.162 | 0.096 | 0.136 | 0.310 | 0.214 | 0.086 | 0.150 | 0.464 | 0.217 | 0.107 | 0.056 | 0.313 | 0.144 | 0.065 | 0.104 | ||
Xu_SJTU_task6_4 | 4 | Xuenan Xu | xu2020_t6 | 0.525 | 0.330 | 0.219 | 0.136 | 0.153 | 0.351 | 0.284 | 0.104 | 0.194 | 0.529 | 0.335 | 0.226 | 0.146 | 0.149 | 0.352 | 0.280 | 0.099 | 0.190 | |
Sampathkumar_TUC_task6_1 | 10 | Arunodhayan Sampathkumar | sampathkumar2020_t6 | 0.335 | 0.077 | 0.018 | 0.007 | 0.061 | 0.225 | 0.024 | 0.009 | 0.017 | 0.432 | 0.128 | 0.141 | 0.010 | 0.078 | 0.251 | 0.071 | 0.024 | 0.024 | |
Yuma_NTT_task6_1 | 1 | Koizumi Yuma | koizumi2020_t1 | 0.544 | 0.355 | 0.239 | 0.157 | 0.157 | 0.365 | 0.340 | 0.103 | 0.222 | 0.619 | 0.439 | 0.313 | 0.220 | 0.186 | 0.417 | 0.521 | 0.129 | 0.325 | |
Pellegrini_IRIT_task6_2 | 6 | Thomas Pellegrini | pellegrini2020_t6 | 0.439 | 0.252 | 0.160 | 0.094 | 0.137 | 0.310 | 0.178 | 0.082 | 0.130 | 0.430 | 0.248 | 0.160 | 0.096 | 0.305 | 0.133 | 0.169 | 0.079 | 0.124 | |
Wu_BUPT_task6_4 | 2 | Yusong Wu | wuyusong2020_t6 | 0.519 | 0.327 | 0.217 | 0.141 | 0.154 | 0.349 | 0.323 | 0.106 | 0.214 | 0.532 | 0.341 | 0.227 | 0.149 | 0.157 | 0.354 | 0.340 | 0.108 | 0.224 | |
Kuzmin_MSU_task6_1 | 8 | Nikita Kuzmin | kuzmin2020_t6 | 0.312 | 0.052 | 0.007 | 0.000 | 0.082 | 0.252 | 0.020 | 0.023 | 0.021 | ||||||||||
Task6_baseline | 9 | Konstantinos Drossos | 0.344 | 0.082 | 0.023 | 0.000 | 0.066 | 0.234 | 0.022 | 0.013 | 0.018 | 0.389 | 0.136 | 0.055 | 0.015 | 0.084 | 0.262 | 0.074 | 0.033 | 0.054 |
Systems ranking
Here are listed all systems and their ranking according to the different metrics and grouping of metrics. First, is a table with all metrics and all systems. Then, is a table with all systems but with only machine translation metrics, and then a table with all systems but with only captioning metrics.
Detailed information of each system is at the next section.
Systems ranking, all metrics
Selected metric rank |
Submission Information | Clotho testing split | Clotho evaluation split | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Technical Report |
BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | CIDEr | SPICE | SPIDEr | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | CIDEr | SPICE | SPIDEr | |
Wang_PKU_task6_1 | 9 | wang2020_t6 | 0.491 | 0.296 | 0.189 | 0.119 | 0.153 | 0.331 | 0.290 | 0.102 | 0.196 | 0.489 | 0.285 | 0.177 | 0.107 | 0.148 | 0.325 | 0.252 | 0.091 | 0.172 | |
Wang_PKU_task6_2 | 11 | wang2020_t6 | 0.498 | 0.304 | 0.195 | 0.121 | 0.154 | 0.335 | 0.287 | 0.101 | 0.194 | 0.489 | 0.285 | 0.177 | 0.107 | 0.148 | 0.325 | 0.252 | 0.091 | 0.172 | |
Wang_PKU_task6_3 | 10 | wang2020_t6 | 0.495 | 0.301 | 0.193 | 0.121 | 0.155 | 0.336 | 0.288 | 0.101 | 0.195 | 0.489 | 0.285 | 0.177 | 0.107 | 0.148 | 0.325 | 0.252 | 0.091 | 0.172 | |
Wang_PKU_task6_4 | 11 | wang2020_t6 | 0.500 | 0.299 | 0.191 | 0.120 | 0.153 | 0.334 | 0.287 | 0.100 | 0.194 | 0.489 | 0.285 | 0.177 | 0.107 | 0.148 | 0.325 | 0.252 | 0.091 | 0.172 | |
Shi_SFF_task6_1 | 24 | shi2020_t6 | 0.432 | 0.251 | 0.162 | 0.098 | 0.117 | 0.302 | 0.161 | 0.070 | 0.115 | 0.419 | 0.238 | 0.150 | 0.092 | 0.114 | 0.292 | 0.149 | 0.064 | 0.106 | |
Shi_SFF_task6_2 | 25 | shi2020_t6 | 0.429 | 0.246 | 0.158 | 0.096 | 0.117 | 0.300 | 0.161 | 0.065 | 0.113 | 0.421 | 0.239 | 0.148 | 0.089 | 0.115 | 0.292 | 0.153 | 0.063 | 0.108 | |
Shi_SFF_task6_3 | 20 | shi2020_t6 | 0.435 | 0.254 | 0.163 | 0.099 | 0.117 | 0.299 | 0.172 | 0.069 | 0.121 | 0.423 | 0.247 | 0.158 | 0.097 | 0.115 | 0.294 | 0.168 | 0.066 | 0.117 | |
Shi_SFF_task6_4 | 23 | shi2020_t6 | 0.428 | 0.242 | 0.156 | 0.099 | 0.116 | 0.301 | 0.172 | 0.063 | 0.118 | 0.425 | 0.241 | 0.154 | 0.098 | 0.115 | 0.298 | 0.169 | 0.063 | 0.116 | |
Wu_UESTC_task6_1 | 31 | wu2020_t6 | 0.378 | 0.030 | 0.000 | 0.000 | 0.063 | 0.262 | 0.024 | 0.000 | 0.012 | 0.379 | 0.020 | 0.000 | 0.000 | 0.063 | 0.261 | 0.024 | 0.001 | 0.012 | |
Naranjo-Alcazar_UV_task6_1 | 17 | naranjoalcazar2020_t6 | 0.464 | 0.260 | 0.157 | 0.092 | 0.135 | 0.308 | 0.195 | 0.083 | 0.139 | 0.453 | 0.206 | 0.098 | 0.049 | 0.307 | 0.122 | 0.060 | 0.091 | ||
Naranjo-Alcazar_UV_task6_2 | 13 | naranjoalcazar2020_t6 | 0.469 | 0.265 | 0.162 | 0.096 | 0.136 | 0.310 | 0.214 | 0.086 | 0.150 | 0.464 | 0.217 | 0.107 | 0.056 | 0.313 | 0.144 | 0.065 | 0.104 | ||
Naranjo-Alcazar_UV_task6_3 | 14 | naranjoalcazar2020_t6 | 0.466 | 0.261 | 0.156 | 0.091 | 0.137 | 0.310 | 0.207 | 0.086 | 0.147 | 0.448 | 0.208 | 0.102 | 0.054 | 0.310 | 0.124 | 0.063 | 0.093 | ||
Naranjo-Alcazar_UV_task6_4 | 15 | naranjoalcazar2020_t6 | 0.464 | 0.259 | 0.154 | 0.086 | 0.137 | 0.310 | 0.205 | 0.087 | 0.146 | 0.445 | 0.205 | 0.105 | 0.057 | 0.309 | 0.125 | 0.064 | 0.095 | ||
Xu_SJTU_task6_1 | 16 | xu2020_t6 | 0.456 | 0.253 | 0.150 | 0.087 | 0.135 | 0.311 | 0.198 | 0.086 | 0.142 | 0.457 | 0.248 | 0.143 | 0.083 | 0.135 | 0.306 | 0.203 | 0.081 | 0.142 | |
Xu_SJTU_task6_2 | 18 | xu2020_t6 | 0.459 | 0.254 | 0.151 | 0.086 | 0.134 | 0.313 | 0.182 | 0.085 | 0.133 | 0.459 | 0.253 | 0.151 | 0.086 | 0.133 | 0.314 | 0.192 | 0.083 | 0.138 | |
Xu_SJTU_task6_4 | 11 | xu2020_t6 | 0.525 | 0.330 | 0.219 | 0.136 | 0.153 | 0.351 | 0.284 | 0.104 | 0.194 | 0.529 | 0.335 | 0.226 | 0.146 | 0.149 | 0.352 | 0.280 | 0.099 | 0.190 | |
Xu_SJTU_task6_3 | 12 | xu2020_t6 | 0.470 | 0.266 | 0.160 | 0.095 | 0.138 | 0.318 | 0.215 | 0.090 | 0.153 | 0.479 | 0.274 | 0.167 | 0.099 | 0.143 | 0.328 | 0.232 | 0.088 | 0.142 | |
Sampathkumar_TUC_task6_1 | 30 | sampathkumar2020_t6 | 0.335 | 0.077 | 0.018 | 0.007 | 0.061 | 0.225 | 0.024 | 0.009 | 0.017 | 0.432 | 0.128 | 0.141 | 0.010 | 0.078 | 0.251 | 0.071 | 0.024 | 0.024 | |
Yuma_NTT_task6_1 | 1 | koizumi2020_t1 | 0.544 | 0.355 | 0.239 | 0.157 | 0.157 | 0.365 | 0.340 | 0.103 | 0.222 | 0.619 | 0.439 | 0.313 | 0.220 | 0.186 | 0.417 | 0.521 | 0.129 | 0.325 | |
Yuma_NTT_task6_2 | 2 | koizumi2020_t1 | 0.540 | 0.351 | 0.236 | 0.155 | 0.156 | 0.363 | 0.338 | 0.103 | 0.220 | 0.618 | 0.439 | 0.314 | 0.221 | 0.186 | 0.416 | 0.515 | 0.130 | 0.322 | |
Yuma_NTT_task6_3 | 4 | koizumi2020_t1 | 0.537 | 0.349 | 0.233 | 0.150 | 0.156 | 0.358 | 0.330 | 0.103 | 0.216 | 0.618 | 0.441 | 0.315 | 0.221 | 0.186 | 0.417 | 0.527 | 0.129 | 0.328 | |
Yuma_NTT_task6_4 | 3 | koizumi2020_t1 | 0.535 | 0.347 | 0.233 | 0.153 | 0.156 | 0.359 | 0.332 | 0.102 | 0.217 | 0.619 | 0.441 | 0.317 | 0.224 | 0.188 | 0.418 | 0.531 | 0.130 | 0.331 | |
Pellegrini_IRIT_task6_1 | 26 | pellegrini2020_t6 | 0.426 | 0.225 | 0.131 | 0.072 | 0.125 | 0.295 | 0.136 | 0.072 | 0.104 | 0.436 | 0.234 | 0.138 | 0.076 | 0.301 | 0.124 | 0.140 | 0.072 | 0.106 | |
Pellegrini_IRIT_task6_2 | 19 | pellegrini2020_t6 | 0.439 | 0.252 | 0.160 | 0.094 | 0.137 | 0.310 | 0.178 | 0.082 | 0.130 | 0.430 | 0.248 | 0.160 | 0.096 | 0.305 | 0.133 | 0.169 | 0.079 | 0.124 | |
Pellegrini_IRIT_task6_3 | 22 | pellegrini2020_t6 | 0.430 | 0.248 | 0.154 | 0.089 | 0.116 | 0.292 | 0.171 | 0.068 | 0.119 | 0.426 | 0.247 | 0.157 | 0.094 | 0.283 | 0.112 | 0.165 | 0.063 | 0.114 | |
Pellegrini_IRIT_task6_4 | 21 | pellegrini2020_t6 | 0.421 | 0.232 | 0.145 | 0.086 | 0.130 | 0.301 | 0.164 | 0.076 | 0.120 | 0.415 | 0.230 | 0.143 | 0.085 | 0.298 | 0.125 | 0.162 | 0.071 | 0.116 | |
Wu_BUPT_task6_1 | 6 | wuyusong2020_t6 | 0.519 | 0.331 | 0.221 | 0.144 | 0.155 | 0.347 | 0.316 | 0.106 | 0.211 | 0.534 | 0.343 | 0.230 | 0.151 | 0.160 | 0.356 | 0.346 | 0.108 | 0.227 | |
Wu_BUPT_task6_2 | 8 | wuyusong2020_t6 | 0.510 | 0.318 | 0.210 | 0.137 | 0.149 | 0.342 | 0.302 | 0.101 | 0.202 | 0.530 | 0.340 | 0.228 | 0.151 | 0.155 | 0.355 | 0.339 | 0.108 | 0.223 | |
Wu_BUPT_task6_3 | 7 | wuyusong2020_t6 | 0.515 | 0.324 | 0.213 | 0.137 | 0.152 | 0.348 | 0.304 | 0.102 | 0.203 | 0.529 | 0.340 | 0.229 | 0.154 | 0.156 | 0.357 | 0.339 | 0.104 | 0.221 | |
Wu_BUPT_task6_4 | 5 | wuyusong2020_t6 | 0.519 | 0.327 | 0.217 | 0.141 | 0.154 | 0.349 | 0.323 | 0.106 | 0.214 | 0.532 | 0.341 | 0.227 | 0.149 | 0.157 | 0.354 | 0.340 | 0.108 | 0.224 | |
Kuzmin_MSU_task6_1 | 27 | kuzmin2020_t6 | 0.312 | 0.052 | 0.007 | 0.000 | 0.082 | 0.252 | 0.020 | 0.023 | 0.021 | ||||||||||
Kuzmin_MSU_task6_2 | 28 | kuzmin2020_t6 | 0.361 | 0.094 | 0.028 | 0.007 | 0.069 | 0.248 | 0.027 | 0.014 | 0.020 | 0.424 | 0.159 | 0.067 | 0.027 | 0.093 | 0.288 | 0.115 | 0.042 | 0.078 | |
Kuzmin_MSU_task6_3 | 28 | kuzmin2020_t6 | 0.359 | 0.094 | 0.033 | 0.010 | 0.071 | 0.250 | 0.027 | 0.014 | 0.020 | 0.425 | 0.158 | 0.065 | 0.025 | 0.094 | 0.290 | 0.112 | 0.042 | 0.077 | |
Kuzmin_MSU_task6_4 | 30 | kuzmin2020_t6 | 0.312 | 0.072 | 0.028 | 0.000 | 0.065 | 0.232 | 0.023 | 0.011 | 0.017 | 0.370 | 0.133 | 0.059 | 0.021 | 0.085 | 0.269 | 0.107 | 0.038 | 0.072 | |
Task6_baseline | 29 | 0.344 | 0.082 | 0.023 | 0.000 | 0.066 | 0.234 | 0.022 | 0.013 | 0.018 | 0.389 | 0.136 | 0.055 | 0.015 | 0.084 | 0.262 | 0.074 | 0.033 | 0.054 |
Systems ranking, machine translation metrics
Selected metric rank |
Submission Information | Clotho testing split | Clotho evaluation split | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Technical Report |
BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGEL | |
Wang_PKU_task6_1 | 9 | wang2020_t6 | 0.491 | 0.296 | 0.189 | 0.119 | 0.153 | 0.331 | 0.489 | 0.285 | 0.177 | 0.107 | 0.148 | 0.325 | |
Wang_PKU_task6_2 | 11 | wang2020_t6 | 0.498 | 0.304 | 0.195 | 0.121 | 0.154 | 0.335 | 0.489 | 0.285 | 0.177 | 0.107 | 0.148 | 0.325 | |
Wang_PKU_task6_3 | 10 | wang2020_t6 | 0.495 | 0.301 | 0.193 | 0.121 | 0.155 | 0.336 | 0.489 | 0.285 | 0.177 | 0.107 | 0.148 | 0.325 | |
Wang_PKU_task6_4 | 11 | wang2020_t6 | 0.500 | 0.299 | 0.191 | 0.120 | 0.153 | 0.334 | 0.489 | 0.285 | 0.177 | 0.107 | 0.148 | 0.325 | |
Shi_SFF_task6_1 | 24 | shi2020_t6 | 0.432 | 0.251 | 0.162 | 0.098 | 0.117 | 0.302 | 0.419 | 0.238 | 0.150 | 0.092 | 0.114 | 0.292 | |
Shi_SFF_task6_2 | 25 | shi2020_t6 | 0.429 | 0.246 | 0.158 | 0.096 | 0.117 | 0.300 | 0.421 | 0.239 | 0.148 | 0.089 | 0.115 | 0.292 | |
Shi_SFF_task6_3 | 20 | shi2020_t6 | 0.435 | 0.254 | 0.163 | 0.099 | 0.117 | 0.299 | 0.423 | 0.247 | 0.158 | 0.097 | 0.115 | 0.294 | |
Shi_SFF_task6_4 | 23 | shi2020_t6 | 0.428 | 0.242 | 0.156 | 0.099 | 0.116 | 0.301 | 0.425 | 0.241 | 0.154 | 0.098 | 0.115 | 0.298 | |
Wu_UESTC_task6_1 | 31 | wu2020_t6 | 0.378 | 0.030 | 0.000 | 0.000 | 0.063 | 0.262 | 0.379 | 0.020 | 0.000 | 0.000 | 0.063 | 0.261 | |
Naranjo-Alcazar_UV_task6_1 | 17 | naranjoalcazar2020_t6 | 0.464 | 0.260 | 0.157 | 0.092 | 0.135 | 0.308 | 0.453 | 0.206 | 0.098 | 0.049 | 0.307 | ||
Naranjo-Alcazar_UV_task6_2 | 13 | naranjoalcazar2020_t6 | 0.469 | 0.265 | 0.162 | 0.096 | 0.136 | 0.310 | 0.464 | 0.217 | 0.107 | 0.056 | 0.313 | ||
Naranjo-Alcazar_UV_task6_3 | 14 | naranjoalcazar2020_t6 | 0.466 | 0.261 | 0.156 | 0.091 | 0.137 | 0.310 | 0.448 | 0.208 | 0.102 | 0.054 | 0.310 | ||
Naranjo-Alcazar_UV_task6_4 | 15 | naranjoalcazar2020_t6 | 0.464 | 0.259 | 0.154 | 0.086 | 0.137 | 0.310 | 0.445 | 0.205 | 0.105 | 0.057 | 0.309 | ||
Xu_SJTU_task6_1 | 16 | xu2020_t6 | 0.456 | 0.253 | 0.150 | 0.087 | 0.135 | 0.311 | 0.457 | 0.248 | 0.143 | 0.083 | 0.135 | 0.306 | |
Xu_SJTU_task6_2 | 18 | xu2020_t6 | 0.459 | 0.254 | 0.151 | 0.086 | 0.134 | 0.313 | 0.459 | 0.253 | 0.151 | 0.086 | 0.133 | 0.314 | |
Xu_SJTU_task6_4 | 11 | xu2020_t6 | 0.525 | 0.330 | 0.219 | 0.136 | 0.153 | 0.351 | 0.529 | 0.335 | 0.226 | 0.146 | 0.149 | 0.352 | |
Xu_SJTU_task6_3 | 12 | xu2020_t6 | 0.470 | 0.266 | 0.160 | 0.095 | 0.138 | 0.318 | 0.479 | 0.274 | 0.167 | 0.099 | 0.143 | 0.328 | |
Sampathkumar_TUC_task6_1 | 30 | sampathkumar2020_t6 | 0.335 | 0.077 | 0.018 | 0.007 | 0.061 | 0.225 | 0.432 | 0.128 | 0.141 | 0.010 | 0.078 | 0.251 | |
Yuma_NTT_task6_1 | 1 | koizumi2020_t1 | 0.544 | 0.355 | 0.239 | 0.157 | 0.157 | 0.365 | 0.619 | 0.439 | 0.313 | 0.220 | 0.186 | 0.417 | |
Yuma_NTT_task6_2 | 2 | koizumi2020_t1 | 0.540 | 0.351 | 0.236 | 0.155 | 0.156 | 0.363 | 0.618 | 0.439 | 0.314 | 0.221 | 0.186 | 0.416 | |
Yuma_NTT_task6_3 | 4 | koizumi2020_t1 | 0.537 | 0.349 | 0.233 | 0.150 | 0.156 | 0.358 | 0.618 | 0.441 | 0.315 | 0.221 | 0.186 | 0.417 | |
Yuma_NTT_task6_4 | 3 | koizumi2020_t1 | 0.535 | 0.347 | 0.233 | 0.153 | 0.156 | 0.359 | 0.619 | 0.441 | 0.317 | 0.224 | 0.188 | 0.418 | |
Pellegrini_IRIT_task6_1 | 26 | pellegrini2020_t6 | 0.426 | 0.225 | 0.131 | 0.072 | 0.125 | 0.295 | 0.436 | 0.234 | 0.138 | 0.076 | 0.301 | 0.124 | |
Pellegrini_IRIT_task6_2 | 19 | pellegrini2020_t6 | 0.439 | 0.252 | 0.160 | 0.094 | 0.137 | 0.310 | 0.430 | 0.248 | 0.160 | 0.096 | 0.305 | 0.133 | |
Pellegrini_IRIT_task6_3 | 22 | pellegrini2020_t6 | 0.430 | 0.248 | 0.154 | 0.089 | 0.116 | 0.292 | 0.426 | 0.247 | 0.157 | 0.094 | 0.283 | 0.112 | |
Pellegrini_IRIT_task6_4 | 21 | pellegrini2020_t6 | 0.421 | 0.232 | 0.145 | 0.086 | 0.130 | 0.301 | 0.415 | 0.230 | 0.143 | 0.085 | 0.298 | 0.125 | |
Wu_BUPT_task6_1 | 6 | wuyusong2020_t6 | 0.519 | 0.331 | 0.221 | 0.144 | 0.155 | 0.347 | 0.534 | 0.343 | 0.230 | 0.151 | 0.160 | 0.356 | |
Wu_BUPT_task6_2 | 8 | wuyusong2020_t6 | 0.510 | 0.318 | 0.210 | 0.137 | 0.149 | 0.342 | 0.530 | 0.340 | 0.228 | 0.151 | 0.155 | 0.355 | |
Wu_BUPT_task6_3 | 7 | wuyusong2020_t6 | 0.515 | 0.324 | 0.213 | 0.137 | 0.152 | 0.348 | 0.529 | 0.340 | 0.229 | 0.154 | 0.156 | 0.357 | |
Wu_BUPT_task6_4 | 5 | wuyusong2020_t6 | 0.519 | 0.327 | 0.217 | 0.141 | 0.154 | 0.349 | 0.532 | 0.341 | 0.227 | 0.149 | 0.157 | 0.354 | |
Kuzmin_MSU_task6_1 | 27 | kuzmin2020_t6 | 0.312 | 0.052 | 0.007 | 0.000 | 0.082 | 0.252 | |||||||
Kuzmin_MSU_task6_2 | 28 | kuzmin2020_t6 | 0.361 | 0.094 | 0.028 | 0.007 | 0.069 | 0.248 | 0.424 | 0.159 | 0.067 | 0.027 | 0.093 | 0.288 | |
Kuzmin_MSU_task6_3 | 28 | kuzmin2020_t6 | 0.359 | 0.094 | 0.033 | 0.010 | 0.071 | 0.250 | 0.425 | 0.158 | 0.065 | 0.025 | 0.094 | 0.290 | |
Kuzmin_MSU_task6_4 | 30 | kuzmin2020_t6 | 0.312 | 0.072 | 0.028 | 0.000 | 0.065 | 0.232 | 0.370 | 0.133 | 0.059 | 0.021 | 0.085 | 0.269 | |
Task6_baseline | 29 | 0.344 | 0.082 | 0.023 | 0.000 | 0.066 | 0.234 | 0.389 | 0.136 | 0.055 | 0.015 | 0.084 | 0.262 |
Systems ranking, captioning metrics
Selected metric rank |
Submission Information | Clotho testing split | Clotho evaluation split | ||||||
---|---|---|---|---|---|---|---|---|---|
Submission code |
Best official system rank |
Technical Report |
CIDEr | SPICE | SPIDEr | CIDEr | SPICE | SPIDEr | |
Wang_PKU_task6_1 | 9 | wang2020_t6 | 0.290 | 0.102 | 0.196 | 0.252 | 0.091 | 0.172 | |
Wang_PKU_task6_2 | 11 | wang2020_t6 | 0.287 | 0.101 | 0.194 | 0.252 | 0.091 | 0.172 | |
Wang_PKU_task6_3 | 10 | wang2020_t6 | 0.288 | 0.101 | 0.195 | 0.252 | 0.091 | 0.172 | |
Wang_PKU_task6_4 | 11 | wang2020_t6 | 0.287 | 0.100 | 0.194 | 0.252 | 0.091 | 0.172 | |
Shi_SFF_task6_1 | 24 | shi2020_t6 | 0.161 | 0.070 | 0.115 | 0.149 | 0.064 | 0.106 | |
Shi_SFF_task6_2 | 25 | shi2020_t6 | 0.161 | 0.065 | 0.113 | 0.153 | 0.063 | 0.108 | |
Shi_SFF_task6_3 | 20 | shi2020_t6 | 0.172 | 0.069 | 0.121 | 0.168 | 0.066 | 0.117 | |
Shi_SFF_task6_4 | 23 | shi2020_t6 | 0.172 | 0.063 | 0.118 | 0.169 | 0.063 | 0.116 | |
Wu_UESTC_task6_1 | 31 | wu2020_t6 | 0.024 | 0.000 | 0.012 | 0.024 | 0.001 | 0.012 | |
Naranjo-Alcazar_UV_task6_1 | 17 | naranjoalcazar2020_t6 | 0.195 | 0.083 | 0.139 | 0.122 | 0.060 | 0.091 | |
Naranjo-Alcazar_UV_task6_2 | 13 | naranjoalcazar2020_t6 | 0.214 | 0.086 | 0.150 | 0.144 | 0.065 | 0.104 | |
Naranjo-Alcazar_UV_task6_3 | 14 | naranjoalcazar2020_t6 | 0.207 | 0.086 | 0.147 | 0.124 | 0.063 | 0.093 | |
Naranjo-Alcazar_UV_task6_4 | 15 | naranjoalcazar2020_t6 | 0.205 | 0.087 | 0.146 | 0.125 | 0.064 | 0.095 | |
Xu_SJTU_task6_1 | 16 | xu2020_t6 | 0.198 | 0.086 | 0.142 | 0.203 | 0.081 | 0.142 | |
Xu_SJTU_task6_2 | 18 | xu2020_t6 | 0.182 | 0.085 | 0.133 | 0.192 | 0.083 | 0.138 | |
Xu_SJTU_task6_4 | 11 | xu2020_t6 | 0.284 | 0.104 | 0.194 | 0.280 | 0.099 | 0.190 | |
Xu_SJTU_task6_3 | 12 | xu2020_t6 | 0.215 | 0.090 | 0.153 | 0.232 | 0.088 | 0.142 | |
Sampathkumar_TUC_task6_1 | 30 | sampathkumar2020_t6 | 0.024 | 0.009 | 0.017 | 0.071 | 0.024 | 0.024 | |
Yuma_NTT_task6_1 | 1 | koizumi2020_t1 | 0.340 | 0.103 | 0.222 | 0.521 | 0.129 | 0.325 | |
Yuma_NTT_task6_2 | 2 | koizumi2020_t1 | 0.338 | 0.103 | 0.220 | 0.515 | 0.130 | 0.322 | |
Yuma_NTT_task6_3 | 4 | koizumi2020_t1 | 0.330 | 0.103 | 0.216 | 0.527 | 0.129 | 0.328 | |
Yuma_NTT_task6_4 | 3 | koizumi2020_t1 | 0.332 | 0.102 | 0.217 | 0.531 | 0.130 | 0.331 | |
Pellegrini_IRIT_task6_1 | 26 | pellegrini2020_t6 | 0.136 | 0.072 | 0.104 | 0.140 | 0.072 | 0.106 | |
Pellegrini_IRIT_task6_2 | 19 | pellegrini2020_t6 | 0.178 | 0.082 | 0.130 | 0.169 | 0.079 | 0.124 | |
Pellegrini_IRIT_task6_3 | 22 | pellegrini2020_t6 | 0.171 | 0.068 | 0.119 | 0.165 | 0.063 | 0.114 | |
Pellegrini_IRIT_task6_4 | 21 | pellegrini2020_t6 | 0.164 | 0.076 | 0.120 | 0.162 | 0.071 | 0.116 | |
Wu_BUPT_task6_1 | 6 | wuyusong2020_t6 | 0.316 | 0.106 | 0.211 | 0.346 | 0.108 | 0.227 | |
Wu_BUPT_task6_2 | 8 | wuyusong2020_t6 | 0.302 | 0.101 | 0.202 | 0.339 | 0.108 | 0.223 | |
Wu_BUPT_task6_3 | 7 | wuyusong2020_t6 | 0.304 | 0.102 | 0.203 | 0.339 | 0.104 | 0.221 | |
Wu_BUPT_task6_4 | 5 | wuyusong2020_t6 | 0.323 | 0.106 | 0.214 | 0.340 | 0.108 | 0.224 | |
Kuzmin_MSU_task6_1 | 27 | kuzmin2020_t6 | 0.020 | 0.023 | 0.021 | ||||
Kuzmin_MSU_task6_2 | 28 | kuzmin2020_t6 | 0.027 | 0.014 | 0.020 | 0.115 | 0.042 | 0.078 | |
Kuzmin_MSU_task6_3 | 28 | kuzmin2020_t6 | 0.027 | 0.014 | 0.020 | 0.112 | 0.042 | 0.077 | |
Kuzmin_MSU_task6_4 | 30 | kuzmin2020_t6 | 0.023 | 0.011 | 0.017 | 0.107 | 0.038 | 0.072 | |
Task6_baseline | 29 | 0.022 | 0.013 | 0.018 | 0.074 | 0.033 | 0.054 |
System characteristics
Rank |
Submission code |
SPIDEr |
Technical Report |
Method scheme/architecture | Amount of parameters | Encoder | Decoder | Classifier |
Acoustic features |
Word representation |
Data augmentation |
Sampling rate |
Used meta-data |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | Wang_PKU_task6_1 | 0.196 | wang2020_t6 | encoder-decoder | 12577360 | CNN | RNN-LSTM | feed-forward | log-mel energies | one-hot | SpecAugment | 44.1kHz | No |
11 | Wang_PKU_task6_2 | 0.194 | wang2020_t6 | encoder-decoder | 12577360 | CNN | RNN-LSTM | feed-forward | log-mel energies | one-hot | SpecAugment | 44.1kHz | No |
10 | Wang_PKU_task6_3 | 0.195 | wang2020_t6 | encoder-decoder | 12577360 | CNN | RNN-LSTM | feed-forward | log-mel energies | one-hot | SpecAugment | 44.1kHz | No |
11 | Wang_PKU_task6_4 | 0.194 | wang2020_t6 | encoder-decoder | 12577360 | CNN | RNN-LSTM | feed-forward | log-mel energies | one-hot | SpecAugment | 44.1kHz | No |
24 | Shi_SFF_task6_1 | 0.115 | shi2020_t6 | seq2seq | transformer encoder | feed-forward | log-mel energies | one-hot | temporal-frequency shift | 44.1kHz | No | ||
25 | Shi_SFF_task6_2 | 0.113 | shi2020_t6 | seq2seq | transformer encoder | feed-forward | log-mel energies | one-hot | temporal-frequency shift | 44.1kHz | No | ||
20 | Shi_SFF_task6_3 | 0.121 | shi2020_t6 | seq2seq | transformer encoder | transformer decoder | feed-forward | log-mel energies | one-hot | temporal-frequency shift | 44.1kHz | No | |
23 | Shi_SFF_task6_4 | 0.118 | shi2020_t6 | seq2seq | transformer encoder | transformer decoder | feed-forward | log-mel energies | one-hot | temporal-frequency shift | 44.1kHz | No | |
31 | Wu_UESTC_task6_1 | 0.012 | wu2020_t6 | seq2seq | 60730943 | CNN | multi-layer RNN-GRU | feed-forward | log-mel energies | one-hot | 44.1kHz | No | |
17 | Naranjo-Alcazar_UV_task6_1 | 0.139 | naranjoalcazar2020_t6 | encoder-decoder | 38734544 | CNN | RNN-LSTM | feed-forward | log-Gammatone spectrogram | one-hot | 44.1kHz | No | |
13 | Naranjo-Alcazar_UV_task6_2 | 0.150 | naranjoalcazar2020_t6 | encoder-decoder | 57726672 | CNN | RNN-LSTM | feed-forward | log-Gammatone spectrogram | one-hot | 44.1kHz | No | |
14 | Naranjo-Alcazar_UV_task6_3 | 0.147 | naranjoalcazar2020_t6 | encoder-decoder | 73370320 | CNN | RNN-LSTM | feed-forward | log-Gammatone spectrogram | one-hot | 44.1kHz | No | |
15 | Naranjo-Alcazar_UV_task6_4 | 0.146 | naranjoalcazar2020_t6 | encoder-decoder | 140064208 | CNN | RNN-LSTM | feed-forward | log-Gammatone spectrogram | one-hot | 44.1kHz | No | |
16 | Xu_SJTU_task6_1 | 0.142 | xu2020_t6 | seq2seq | 5224055 | CRNN-BGRU | RNN-GRU | feed-forward | log-mel energies | embeddings | 44.1kHz | No | |
18 | Xu_SJTU_task6_2 | 0.133 | xu2020_t6 | seq2seq | 5224055 | CRNN-BGRU | RNN-GRU | feed-forward | log-mel energies | embeddings | 44.1kHz | No | |
11 | Xu_SJTU_task6_4 | 0.194 | xu2020_t6 | seq2seq | 5224055 | CRNN-BGRU | RNN-GRU | feed-forward | log-mel energies | embeddings | 44.1kHz | No | |
12 | Xu_SJTU_task6_3 | 0.153 | xu2020_t6 | seq2seq | 10448110 | CRNN-BGRU | RNN-GRU | feed-forward | log-mel energies | embeddings | 44.1kHz | No | |
30 | Sampathkumar_TUC_task6_1 | 0.017 | sampathkumar2020_t6 | seq2seq | 5756431 | multi-layer RNN-BGRU | RNN-GRU | feed-forward | log-mel energies | embedding | 44.1kHz | No | |
1 | Yuma_NTT_task6_1 | 0.222 | koizumi2020_t1 | seq2seq, keyword estimation, sentence length estimation | 32994840 | multi-layer RNN-BLSTM | RNN-LSTM | feed-forward | log-mel energies | embeddings | mix-up, TF-IDF-based word replacement, random data cropping | 22.05kHz | Yes |
2 | Yuma_NTT_task6_2 | 0.220 | koizumi2020_t1 | seq2seq, keyword estimation, sentence length estimation | 82487110 | multi-layer RNN-BLSTM | RNN-LSTM | feed-forward | log-mel energies | embeddings | mix-up, TF-IDF-based word replacement, random data cropping | 22.05kHz | Yes |
4 | Yuma_NTT_task6_3 | 0.216 | koizumi2020_t1 | seq2seq, keyword estimation, sentence length estimation | 20670182 | multi-layer RNN-BLSTM | RNN-LSTM | feed-forward | log-mel energies | embeddings | mix-up, TF-IDF-based word replacement, random data cropping | 22.05kHz | Yes |
3 | Yuma_NTT_task6_4 | 0.217 | koizumi2020_t1 | seq2seq, keyword estimation, sentence length estimation | 51675455 | multi-layer RNN-BLSTM | RNN-LSTM | feed-forward | log-mel energies | embeddings | mix-up, TF-IDF-based word replacement, random data cropping | 22.05kHz | Yes |
26 | Pellegrini_IRIT_task6_1 | 0.104 | pellegrini2020_t6 | seq2seq | 2887375 | multi-layer RNN-pBLSTM | multi-layer RNN-LSTM | feed-forward, greedy search | log-mel energies | one-hot | 44.1kHz | No | |
19 | Pellegrini_IRIT_task6_2 | 0.130 | pellegrini2020_t6 | seq2seq | 2887375 | multi-layer RNN-pBLSTM | multi-layer RNN-LSTM | feed-forward, beam search | log-mel energies | one-hot | 44.1kHz | No | |
22 | Pellegrini_IRIT_task6_3 | 0.119 | pellegrini2020_t6 | seq2seq | 2887375 | multi-layer RNN-pBLSTM | multi-layer RNN-LSTM | feed-forward, beam search with LM | log-mel energies | one-hot | 44.1kHz | No | |
21 | Pellegrini_IRIT_task6_4 | 0.120 | pellegrini2020_t6 | seq2seq | 2120744 | multi-layer RNN-pBLSTM | multi-layer RNN-LSTM | feed-forward, greedy search | log-mel energies | one-hot | 44.1kHz | No | |
6 | Wu_BUPT_task6_1 | 0.211 | wuyusong2020_t6 | encoder-decoder | 8901648 | CNN | Transformer | feed-forward | log-mel energies | embeddings | SpecAugment | 44.1kHz | No |
8 | Wu_BUPT_task6_2 | 0.202 | wuyusong2020_t6 | encoder-decoder | 8901648 | CNN | Transformer | feed-forward | log-mel energies | embeddings | SpecAugment | 44.1kHz | No |
7 | Wu_BUPT_task6_3 | 0.203 | wuyusong2020_t6 | encoder-decoder | 8901648 | CNN | Transformer | feed-forward | log-mel energies | embeddings | SpecAugment | 44.1kHz | No |
5 | Wu_BUPT_task6_4 | 0.214 | wuyusong2020_t6 | encoder-decoder | 8901648 | CNN | Transformer | feed-forward | log-mel energies | embeddings | SpecAugment | 44.1kHz | No |
27 | Kuzmin_MSU_task6_1 | 0.021 | kuzmin2020_t6 | seq2seq | 4804112 | multi-layer RNN-GRU | RNN-GRU | feed-forward | log-mel energies | one-hot | mix-up, reverb, pitch, overdrive, speed | 44.1kHz | No |
28 | Kuzmin_MSU_task6_2 | 0.020 | kuzmin2020_t6 | seq2seq | 15178255 | multi-layer RNN-GRU | RNN-GRU | feed-forward | log-mel energies | one-hot | mix-up | 44.1kHz | No |
28 | Kuzmin_MSU_task6_3 | 0.020 | kuzmin2020_t6 | seq2seq | 15178255 | multi-layer RNN-GRU | RNN-GRU | feed-forward | log-mel energies | one-hot | mix-up, reverb, pitch, overdrive, speed | 44.1kHz | No |
30 | Kuzmin_MSU_task6_4 | 0.017 | kuzmin2020_t6 | seq2seq | 4804112 | multi-layer RNN-GRU | RNN-GRU | feed-forward | log-mel energies | one-hot | mix-up | 44.1kHz | No |
29 | Task6_baseline | 0.018 | seq2seq | 5012931 | multi-layer RNN-GRU | multi-layer RNN-GRU | feed-forward | log-mel energies | one-hot | 44.1kHz | No |
Technical reports
The NTT DCASE2020 Challenge Task 6 System: Automated Audio Captioning With Keywords and Sentence Length Estimation
Yuma Koizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino
NTT Corporation, Japan
Koizumi_NTT_task6_1 Koizumi_NTT_task6_2 Koizumi_NTT_task6_3 Koizumi_NTT_task6_4
The NTT DCASE2020 Challenge Task 6 System: Automated Audio Captioning With Keywords and Sentence Length Estimation
Yuma Koizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino
NTT Corporation, Japan
Abstract
This technical report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6: automated audio captioning. Our submission focuses on solving two indeterminacy problems in automated audio captioning: word selection indeterminacy and sentence length indeterminacy. We simultaneously solve the main caption generation and sub indeterminacy problems by estimating keywords and sentence length through multi-task learning. We tested a simplified model of our submission using the development-testing dataset. Our model achieved 20.7 SPIDEr score where that of the baseline system was 5.4.
System characteristics
Method scheme/architecture | seq2seq, keyword estimation, sentence length estimation |
Encoder | RNN-BLSTM |
Decoder | RNN-LSTM |
Alignment mechanism | self-attention |
Classifier | feed-forward |
Amount of parameters | 3299484 |
Sampling rate | 22.05kHz |
Audio features | log-mel energies |
Word representation | embeddings |
Data augmentation | mix-up, TF-IDF-based word replacement, random data cropping |
Automated Audio Captioning
Nikita Kuzmin and Alexander Dyakonov
Moscow State University, CMC Faculty, Mathematical Methods of Forecasting Dept. GSP-1, 1-52, Leninskiye Gory Moscow, 119991, Russia
Kuzmin_MSU_task6_1 Kuzmin_MSU_task6_2 Kuzmin_MSU_task6_3 Kuzmin_MSU_task6_4
Automated Audio Captioning
Nikita Kuzmin and Alexander Dyakonov
Moscow State University, CMC Faculty, Mathematical Methods of Forecasting Dept. GSP-1, 1-52, Leninskiye Gory Moscow, 119991, Russia
Abstract
This task can be stated as an automated generation textual content description from the raw audio file. We propose a method for the automated audio captioning task. We examined the impact of augmentations (MixUp, Reverb, Pitch, Over-drive, Speed) on method performance. Our method based on modified encoder-decoder architecture. The encoder consists of three bidirectional gated recurrent units (GRU). The decoder consists of one gated recurrent unit (GRU) and one fully-connected layer for classification. The encoder input is log-mel spectrogram features for every part of audio file segmented by Hann window [1] of 1024 samples with a 50% overlap. The decoder output is a matrix with probabilities of words for each position in a sentence. We used BLEU1, BLEU2, BLEU3, BLEU4, ROUGEL, METEOR, CIDEr, SPICE, SPIDEr metrics to compare methods.
System characteristics
Method scheme/architecture | seq2seq |
Encoder | RNN-GRU |
Decoder | RNN-GRU |
Alignment mechanism | attention, vector2sequence |
Classifier | feed-forward |
Amount of parameters | 4804112 |
Sampling rate | 44.1kHz |
Audio features | log-mel energies |
Word representation | one-hot |
Data augmentation | mix-up, reverb, pitch, overdrive, speed |
Task 6 DCASE 2020: Listen Carefully and Tell: An Audio Captioning System Based on Residual Learning and Gammatone Audio Representation
Javier Naranjo-Alcazar1, and Sergi Perez-Castanos, and Pedro Zuccarello1, and Maximo Cobos1
1Computer Science Department, Universitat de València, Burjassot, Spain
Naranjo-Alcazar_UV_task6_1 Naranjo-Alcazar_UV_task6_2 Naranjo-Alcazar_UV_task6_3 Naranjo-Alcazar_UV_task6_4
Task 6 DCASE 2020: Listen Carefully and Tell: An Audio Captioning System Based on Residual Learning and Gammatone Audio Representation
Javier Naranjo-Alcazar1, and Sergi Perez-Castanos, and Pedro Zuccarello1, and Maximo Cobos1
1Computer Science Department, Universitat de València, Burjassot, Spain
Abstract
Automated audio captioning is machine listening task whose goal is to describe an audio using free text. An automated audio captioning system has to be implemented as it accepts an audio as input and outputs a textual description, that is, the caption of the signal. This task can be useful in many applications such as automatic content description or machine-to-machine interaction. In this technical report, a automatic audio captioning based on residual learning on the encoder phase is proposed. The encoder phase implemented via different Residual Networks configurations. The decoder phase (create the caption) is run using recurrent layers plus attention mechanism. The audio representation chosen has been Gammatone. Results show that the framework proposed in this work surpass the baseline system improving all metrics.
System characteristics
Method scheme/architecture | encoder-decoder |
Encoder | CNN |
Decoder | RNN-LSTM |
Alignment mechanism | attention |
Classifier | feed-forward |
Amount of parameters | 38734544 |
Sampling rate | 44.1kHz |
Audio features | log-Gammatone spectrogram |
Word representation | one-hot |
IRIT-UPS DCASE 2020 audio captioning system
Thomas Pellegrini
IRIT (UMR 5505), Université Paul Sabatier, CNRS, Toulouse, France
Pellegrini_IRIT_task6_1 Pellegrini_IRIT_task6_2 Pellegrini_IRIT_task6_3 Pellegrini_IRIT_task6_4
IRIT-UPS DCASE 2020 audio captioning system
Thomas Pellegrini
IRIT (UMR 5505), Université Paul Sabatier, CNRS, Toulouse, France
Abstract
This technical report is a short description of the sequence-to-sequence model used in the DCASE 2020 task 6 dedicated to audio captioning. Four submissions were made: i) a baseline one using greedy search, ii) beam search, iii) beam search integrating a 2g language model, iv) with a model trained with a vocabulary limited to the most frequent word types (1k words instead of about 5k words).
System characteristics
Method scheme/architecture | seq2seq |
Encoder | RNN-pBLSTM |
Decoder | RNN-LSTM |
Alignment mechanism | attention |
Classifier | feed-forward, greedy search |
Amount of parameters | 2887375 |
Sampling rate | 44.1kHz |
Audio features | log-mel energies |
Word representation | one-hot |
Automated Audio Captioning
Arunodhayan Sampathkumar and Danny Kowerko
Technische Universität Chemnitz, Juniorprofessur Media Computing, Chemnitz, Germany
Sampathkumar_TUC_task6_1
Automated Audio Captioning
Arunodhayan Sampathkumar and Danny Kowerko
Technische Universität Chemnitz, Juniorprofessur Media Computing, Chemnitz, Germany
Abstract
The audio captioning is a novel approach to describe an audio scene based on human like perception. The human like perception of audio events not only perform detection and localization, but also tries to summarize the relationship between different audio events. The DCASE2020 has developed a strongly labelled caption dataset to perform automated audio captioning. In this research, mel spectrogram is used to extract the audio features. A Recurrent Neural Network (RNN) encoder-decoder is employed to train the dataset. Finally the network is evaluated using the MS COCO metrics where BLEU3 & BLEU1 scores were strong and is discussed in detail in section 5.
System characteristics
Method scheme/architecture | seq2seq |
Encoder | RNN-BGRU |
Decoder | RNN-GRU |
Alignment mechanism | identity |
Classifier | feed-forward |
Amount of parameters | 16521 |
Sampling rate | 44.1kHz |
Audio features | log-mel energies |
Word representation | embeddings |
Audio Captioning With the Transformer
Anna Shi
ShuangFeng First, Beijing, China
Shi_SFF_task6_1 Shi_SFF_task6_2 Shi_SFF_task6_3 Shi_SFF_task6_4
Audio Captioning With the Transformer
Anna Shi
ShuangFeng First, Beijing, China
Abstract
In this technical report, we present the techniques and models applied to our submission for DCASE 2020 task 6: automated audio captioning. We aim to focus primarily on how to apply transformer methods efficiently to deal with large amount of audio data. Our experiments with the public DCASE2020 challenge task 6 Clotho evaluation data resulted in a SPIDEr of 0.1171, while the SPIDEr of the official baseline is 0.054.
System characteristics
Method scheme/architecture | seq2seq |
Encoder | transformer encoder |
Decoder | transformer decoder |
Alignment mechanism | self-attention |
Classifier | feed-forward |
Amount of parameters | Not reported |
Sampling rate | 44.1kHz |
Audio features | log-mel energies |
Word representation | one-hot |
Data augmentation | temporal-frequency shift |
Automated Audio Captioning With Temporal Attention
Helin Wang1, Bang Yang1, Yuexian Zou1,2 and Dading Chong1
1ADSPLAB, School of ECE, Peking University, Shenzhen, China, 2Peng Cheng Laboratory, Shenzhen, China
Helin_ADSPLAB_task6_1 Helin_ADSPLAB_task6_2 Helin_ADSPLAB_task6_3 Helin_ADSPLAB_task6_4
Automated Audio Captioning With Temporal Attention
Helin Wang1, Bang Yang1, Yuexian Zou1,2 and Dading Chong1
1ADSPLAB, School of ECE, Peking University, Shenzhen, China, 2Peng Cheng Laboratory, Shenzhen, China
Abstract
This technical report describes the ADSPLAB team’s submission for Task6 of DCASE2020 challenge (automated audio captioning). Our audio captioning system is based on the sequence-to-sequence model. Convolutional neural network (CNN) is used as the encoder and a long-short term memory (LSTM)-based decoder with temporal attention is used to generate the captions. No extra data or pre-trained models are employed and no extra annotations are used. The experimental results show that our system could achieve the SPIDEr of 0.172 (official baseline: 0.054) on the evaluation split of the Clotho dataset.
System characteristics
Method scheme/architecture | encoder-decoder |
Encoder | CNN |
Decoder | RNN-LSTM |
Alignment mechanism | attention |
Classifier | feed-forward |
Amount of parameters | 12577360 |
Sampling rate | 44.1kHz |
Audio features | log-mel energies |
Word representation | one-hot |
Data augmentation | SpecAugment |
Automatic Audio Captioning System Based on Convolutional Neural Network
Qianyang Wu, Shengqi Tao, and Xingyu Yang
University of Electronic Science and Technology of China Communication Engineering Dept Chengdu,China
Wu_UESTC_task6_1
Automatic Audio Captioning System Based on Convolutional Neural Network
Qianyang Wu, Shengqi Tao, and Xingyu Yang
University of Electronic Science and Technology of China Communication Engineering Dept Chengdu,China
Abstract
Automated audio captioning has been a new issue in natural language processing (NLP) for recent years. The key point of automatic audio captioning system is that it describes non-audio sig- nals in the form of natural language. The system should take audio as input, and output as descriptive audio sentences. Most of approaches use seq2seq model with RNNs as both the encoder and decoder. It results in considerable time to get the training process finished. This paper proposed a neural network with CNN as the encoder and GRU as the decoder. Encoder is based on VGG16, which has deeper networks and three fully-connected layers. Despite the low accuracy of prediction, our model decreases the train- ing time significantly. It proves that the application of CNN can be a choice for automated audio captioning.
System characteristics
Method scheme/architecture | encoder-decoder |
Encoder | CNN |
Decoder | RNN-GRU |
Alignment mechanism | attention |
Classifier | feed-forward |
Amount of parameters | 60730943 |
Sampling rate | 44.1kHz |
Audio features | log-mel energies |
Word representation | one-hot |
Audio Captioning Based on Transformer and Pre-Training for 2020 DCASE Audio Captioning Challenge
Yusong Wu1, Kun Chen1, Ziyue Wang2, Xuan Zhang2, Fudong Nian3, Shengchen Li1, and Xi Shao2
1Beijing University of Posts and Telecommunications, Beijing, China, 2Nanjing University of Posts and Telecommunications, Nanjing, China, 3Anhui University, Anhui, China
Wu_BUPT_task6_1 Wu_BUPT_task6_2Wu_BUPT_task6_3Wu_BUPT_task6_4
Audio Captioning Based on Transformer and Pre-Training for 2020 DCASE Audio Captioning Challenge
Yusong Wu1, Kun Chen1, Ziyue Wang2, Xuan Zhang2, Fudong Nian3, Shengchen Li1, and Xi Shao2
1Beijing University of Posts and Telecommunications, Beijing, China, 2Nanjing University of Posts and Telecommunications, Nanjing, China, 3Anhui University, Anhui, China
Abstract
This report proposes an automated audio captioning model for the 2020 DCASE audio captioning challenge. In this challenge, a model is required to be trained from scratch to generate natural language descriptions of a given audio signal. However, as limited data available and restrictions on using pre-trained models trained by external data, training directly from scratch can result in poor performance where acoustic events and language are poorly modeled. For better acoustic event and language modeling, a sequence-to-sequence model is proposed which consists of a CNN encoder and a Transformer decoder. In the proposed model, the encoder and word embedding are firstly pre-trained. Regulations and data augmentations are applied during training, while fine-tuning is applied after training. Experiments show that the proposed model can achieve a SPIDEr score of 0.227 on audio captioning performance.
System characteristics
Method scheme/architecture | encoder-decoder |
Encoder | CNN |
Decoder | Transformer |
Alignment mechanism | attention, self-attention |
Classifier | feed-forward |
Amount of parameters | 8901648 |
Sampling rate | 44.1kHz |
Audio features | log-mel energies |
Word representation | embeddings |
Data augmentation | SpecAugment |
The SJTU Submission for DCASE2020 Task 6: A CRNN-GRU Based Reinforcement Learning Approach to Audiocaption
Xuenan Xu, Heinrich Dinkel, Mengyue Wu, and Kai Yu
MoE Key Lab of Artificial Intelligence SpeechLab, Department of Computer Science and Engineering AI Institute, Shanghai Jiao Tong University, Shanghai, China
Xu_SJTU_task6_1 Xu_SJTU_task6_2 Xu_SJTU_task6_3 Xu_SJTU_task6_4
The SJTU Submission for DCASE2020 Task 6: A CRNN-GRU Based Reinforcement Learning Approach to Audiocaption
Xuenan Xu, Heinrich Dinkel, Mengyue Wu, and Kai Yu
MoE Key Lab of Artificial Intelligence SpeechLab, Department of Computer Science and Engineering AI Institute, Shanghai Jiao Tong University, Shanghai, China
Abstract
This paper proposes the SJTU AudioCaption system for the DCASE2020 Task 6 challenge. Our system consists of a powerful CRNN encoder combined with a GRU decoder. In addition to standard cross-entropy Audiocaption, reinforcement learning is also investigated. Our approach significantly improves against the challenge baseline model on all shown metrics achieving a relative improvement of at least 34%. Our best submission achieves a {BLEU4} of 0.146, {Rouge}-{L} of 0.352, {CIDEr} of 0.280, {METEOR} of 0.149, and {SPICE} of 0.099 on Clotho evaluation set.
System characteristics
Method scheme/architecture | seq2seq |
Encoder | CRNN-BGRU |
Decoder | RNN-GRU |
Alignment mechanism | vector2sequence |
Classifier | feed-forward |
Amount of parameters | 5224055 |
Sampling rate | 44.1kHz |
Audio features | log-mel energies |
Word representation | embeddings |