Task description
Automated audio captioning is the task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. Given the novelty of the task of audio captioning, current focus is on exploring and developing different methods that can provide some kind of captions for a general audio recording. To this aim, the Clotho dataset is used, which provides good quality captions, without speech transcription, named entities, and hapax legomena (i.e. words that appear once in a split).
The developed systems are evaluated on their generated captions, using the evaluation split of Clotho, which does not provide the corresponding captions for the audio.
More information about Task 6: Automated Audio Captioning can be found at the task description page.
Teams ranking
Here are listed the best systems from all teams. The ranking is based on the FENSE. For more elaborated exploration of the performance of the different systems, at the same table are listed the values achieved for all the metrics employed in the task. The values for the metrics are for the Clotho evaluation split and the Clotho testing split. The values for the Clotho testing split are provided in order to allow further comparison with systems and methods developed outside of this task, since captions for the Clotho testing split are freely available. This year, we asked participants to exclude a list of Freesound IDs to prevent a data leakage between the training and evaluation subsets. We mark "True" in the "Data leak" column for participants who have used Freesound data without taking into account the forbidden IDs.
All confidence intervals are computed using a bootstrap method on the evaluation set, with a 95% confidence level.
Selected metric rank |
Submission Information | Clotho evaluation split | Clotho development-testing split | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Team rank |
Submission code |
Data leak |
Corresponding author |
Technical Report |
METEOR | CIDEr-D | SPICE | SPIDEr | SPIDEr-FL | FENSE | METEOR | CIDEr-D | SPICE | SPIDEr | SPIDEr-FL | FENSE | |
1 | Jung_CMU_t6_4 | False | Jee-weon Jung | jung_cmu_t6_2024 | 0.172 (0.174 - 0.182) | 0.344 (0.324 - 0.366) | 0.140 (0.134 - 0.146) | 0.242 (0.230 - 0.255) | 0.241 (0.229 - 0.254) | 0.554 (0.543 - 0.563) | 0.174 | 0.327 | 0.136 | 0.230 | 0.230 | 0.542 | |
2 | Kim_SNU_t6_2 | False | Jaeyeon Kim | kim_snu_t6_2024 | 0.199 (0.200 - 0.210) | 0.480 (0.453 - 0.508) | 0.148 (0.142 - 0.154) | 0.314 (0.299 - 0.330) | 0.314 (0.299 - 0.330) | 0.544 (0.534 - 0.555) | 0.196 | 0.477 | 0.142 | 0.310 | 0.310 | 0.542 | |
3 | Chen_SJTU_t6_4 | True | Wenxi Chen | chen_sjtu_t6_2024 | 0.194 (0.196 - 0.207) | 0.509 (0.479 - 0.541) | 0.145 (0.138 - 0.151) | 0.327 (0.310 - 0.345) | 0.322 (0.306 - 0.341) | 0.541 (0.530 - 0.552) | 0.193 | 0.522 | 0.148 | 0.335 | 0.333 | 0.543 | |
4 | Li_ALXC_t6_4 | False | Gang Li | li_alxc_t6_2024 | 0.195 (0.197 - 0.208) | 0.493 (0.464 - 0.525) | 0.145 (0.139 - 0.151) | 0.319 (0.302 - 0.337) | 0.317 (0.300 - 0.335) | 0.533 (0.522 - 0.543) | 0.194 | 0.503 | 0.145 | 0.324 | 0.323 | 0.532 | |
5 | Kyogu_SNU_t6_2 | False | Lee Kyogu | kyogu_snu_t6_2024 | 0.189 (0.190 - 0.201) | 0.409 (0.383 - 0.437) | 0.135 (0.129 - 0.141) | 0.272 (0.257 - 0.288) | 0.272 (0.257 - 0.288) | 0.526 (0.515 - 0.537) | 0.187 | 0.412 | 0.134 | 0.273 | 0.273 | 0.518 | |
6 | Kong_CUHK_t6_1 | True | Qiuqiang Kong | kong_cuhk_t6_2024 | 0.192 (0.195 - 0.206) | 0.495 (0.467 - 0.526) | 0.141 (0.135 - 0.147) | 0.318 (0.301 - 0.336) | 0.315 (0.299 - 0.333) | 0.525 (0.514 - 0.536) | 0.196 | 0.529 | 0.138 | 0.334 | 0.332 | 0.528 | |
7 | Choi_KAIST_t6_1 | False | Inhan Choi | choi_kaist_t6_2024 | 0.187 (0.188 - 0.199) | 0.465 (0.438 - 0.494) | 0.135 (0.129 - 0.142) | 0.300 (0.284 - 0.317) | 0.299 (0.284 - 0.316) | 0.520 (0.509 - 0.531) | 0.189 | 0.464 | 0.134 | 0.299 | 0.299 | 0.521 | |
8 | Li_SCUT_t6_4 | False | Qianqian Li | li_scut_t6_2024 | 0.188 (0.190 - 0.201) | 0.468 (0.440 - 0.497) | 0.138 (0.132 - 0.145) | 0.303 (0.287 - 0.320) | 0.302 (0.286 - 0.319) | 0.520 (0.508 - 0.531) | 0.189 | 0.469 | 0.134 | 0.301 | 0.301 | 0.513 | |
9 | Silva_JKUICP_t6_2 | False | Jakob De Jesus Silva | de_jesus_silva_jkuicp_t6_2024 | 0.188 (0.190 - 0.201) | 0.456 (0.430 - 0.484) | 0.138 (0.132 - 0.144) | 0.297 (0.282 - 0.313) | 0.296 (0.281 - 0.313) | 0.516 (0.505 - 0.527) | 0.192 | 0.479 | 0.138 | 0.309 | 0.308 | 0.508 | |
10 | Epshtein_ARC_t6_1 | False | Dan Epshtein | epshtein_arc_t6_2024 | 0.188 (0.190 - 0.200) | 0.462 (0.437 - 0.491) | 0.137 (0.131 - 0.143) | 0.300 (0.285 - 0.316) | 0.298 (0.283 - 0.315) | 0.514 (0.503 - 0.525) | 0.189 | 0.473 | 0.135 | 0.304 | 0.302 | 0.504 | |
11 | Hong_CAU_t6_1 | False | Hyunhee Hong | hong_cau_t6_2024 | 0.184 (0.185 - 0.195) | 0.427 (0.402 - 0.454) | 0.134 (0.128 - 0.140) | 0.280 (0.266 - 0.296) | 0.279 (0.265 - 0.295) | 0.513 (0.502 - 0.524) | 0.188 | 0.458 | 0.133 | 0.295 | 0.294 | 0.509 | |
12 | Baseline | False | Étienne Labbé | labbé_irit_t6_2024 | 0.186 (0.187 - 0.198) | 0.442 (0.417 - 0.468) | 0.135 (0.129 - 0.141) | 0.288 (0.274 - 0.304) | 0.287 (0.273 - 0.303) | 0.510 (0.499 - 0.521) | 0.190 | 0.462 | 0.134 | 0.298 | 0.296 | 0.504 |
Systems ranking
Here are listed all submitted systems and their ranking according to the different metrics and grouping of metrics. The first table shows all challenge metrics and all systems, and the second table shows all systems but with contrastive metrics.
Detailed information for each system is provided in the next section.
Systems ranking, challenge metrics
Selected metric rank |
Submission Information | Clotho evaluation split | Clotho development testing split | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission rank |
Submission code |
Data leak |
Technical Report |
METEOR | CIDEr-D | SPICE | SPIDEr | SPIDEr-FL | FENSE | METEOR | CIDEr-D | SPICE | SPIDEr | SPIDEr-FL | FENSE | |
1 | Jung_CMU_t6_4 | False | jung_cmu_t6_2024 | 0.172 (0.174 - 0.182) | 0.344 (0.324 - 0.366) | 0.140 (0.134 - 0.146) | 0.242 (0.230 - 0.255) | 0.241 (0.229 - 0.254) | 0.554 (0.543 - 0.563) | 0.174 | 0.327 | 0.136 | 0.230 | 0.230 | 0.542 | |
2 | Jung_CMU_t6_2 | False | jung_cmu_t6_2024 | 0.176 (0.177 - 0.186) | 0.359 (0.338 - 0.382) | 0.142 (0.137 - 0.148) | 0.251 (0.238 - 0.264) | 0.249 (0.236 - 0.262) | 0.549 (0.538 - 0.559) | 0.177 | 0.341 | 0.140 | 0.240 | 0.239 | 0.542 | |
3 | Jung_CMU_t6_3 | False | jung_cmu_t6_2024 | 0.172 (0.173 - 0.182) | 0.345 (0.325 - 0.368) | 0.141 (0.135 - 0.146) | 0.243 (0.231 - 0.256) | 0.239 (0.227 - 0.252) | 0.547 (0.536 - 0.557) | 0.174 | 0.333 | 0.132 | 0.232 | 0.232 | 0.544 | |
4 | Jung_CMU_t6_1 | False | jung_cmu_t6_2024 | 0.181 (0.182 - 0.192) | 0.387 (0.365 - 0.412) | 0.135 (0.130 - 0.141) | 0.261 (0.248 - 0.276) | 0.260 (0.247 - 0.275) | 0.544 (0.534 - 0.555) | 0.182 | 0.366 | 0.133 | 0.250 | 0.249 | 0.541 | |
5 | Kim_SNU_t6_2 | False | kim_snu_t6_2024 | 0.199 (0.200 - 0.210) | 0.480 (0.453 - 0.508) | 0.148 (0.142 - 0.154) | 0.314 (0.299 - 0.330) | 0.314 (0.299 - 0.330) | 0.544 (0.534 - 0.555) | 0.196 | 0.477 | 0.142 | 0.310 | 0.310 | 0.542 | |
6 | Kim_SNU_t6_4 | False | kim_snu_t6_2024 | 0.199 (0.200 - 0.211) | 0.487 (0.460 - 0.516) | 0.151 (0.145 - 0.158) | 0.319 (0.303 - 0.336) | 0.319 (0.303 - 0.336) | 0.544 (0.534 - 0.555) | 0.199 | 0.478 | 0.149 | 0.313 | 0.313 | 0.542 | |
7 | Kim_SNU_t6_3 | False | kim_snu_t6_2024 | 0.197 (0.198 - 0.209) | 0.472 (0.446 - 0.501) | 0.148 (0.142 - 0.154) | 0.310 (0.295 - 0.326) | 0.310 (0.295 - 0.326) | 0.542 (0.532 - 0.552) | 0.200 | 0.478 | 0.149 | 0.313 | 0.313 | 0.539 | |
8 | Chen_SJTU_t6_4 | True | chen_sjtu_t6_2024 | 0.194 (0.196 - 0.207) | 0.509 (0.479 - 0.541) | 0.145 (0.138 - 0.151) | 0.327 (0.310 - 0.345) | 0.322 (0.306 - 0.341) | 0.541 (0.530 - 0.552) | 0.193 | 0.522 | 0.148 | 0.335 | 0.333 | 0.543 | |
9 | Chen_SJTU_t6_3 | True | chen_sjtu_t6_2024 | 0.194 (0.196 - 0.207) | 0.510 (0.480 - 0.542) | 0.145 (0.139 - 0.152) | 0.327 (0.310 - 0.346) | 0.323 (0.306 - 0.342) | 0.541 (0.530 - 0.552) | 0.193 | 0.518 | 0.148 | 0.333 | 0.331 | 0.543 | |
10 | Chen_SJTU_t6_1 | True | chen_sjtu_t6_2024 | 0.195 (0.197 - 0.208) | 0.497 (0.468 - 0.528) | 0.144 (0.138 - 0.151) | 0.321 (0.304 - 0.339) | 0.317 (0.301 - 0.335) | 0.540 (0.529 - 0.551) | 0.195 | 0.512 | 0.147 | 0.329 | 0.329 | 0.543 | |
11 | Kim_SNU_t6_1 | False | kim_snu_t6_2024 | 0.195 (0.197 - 0.207) | 0.470 (0.443 - 0.499) | 0.145 (0.139 - 0.151) | 0.307 (0.292 - 0.324) | 0.307 (0.292 - 0.324) | 0.540 (0.530 - 0.550) | 0.199 | 0.483 | 0.148 | 0.316 | 0.316 | 0.539 | |
12 | Chen_SJTU_t6_2 | True | chen_sjtu_t6_2024 | 0.195 (0.197 - 0.208) | 0.518 (0.489 - 0.551) | 0.146 (0.140 - 0.153) | 0.332 (0.315 - 0.351) | 0.329 (0.312 - 0.348) | 0.538 (0.527 - 0.550) | 0.196 | 0.537 | 0.150 | 0.343 | 0.342 | 0.540 | |
13 | Li_ALXC_t6_4 | False | li_alxc_t6_2024 | 0.195 (0.197 - 0.208) | 0.493 (0.464 - 0.525) | 0.145 (0.139 - 0.151) | 0.319 (0.302 - 0.337) | 0.317 (0.300 - 0.335) | 0.533 (0.522 - 0.543) | 0.194 | 0.503 | 0.145 | 0.324 | 0.323 | 0.532 | |
14 | Li_ALXC_t6_3 | False | li_alxc_t6_2024 | 0.177 (0.179 - 0.189) | 0.441 (0.415 - 0.470) | 0.128 (0.123 - 0.134) | 0.285 (0.270 - 0.301) | 0.284 (0.270 - 0.301) | 0.528 (0.517 - 0.538) | 0.178 | 0.447 | 0.127 | 0.287 | 0.287 | 0.521 | |
15 | Kyogu_SNU_t6_2 | False | kyogu_snu_t6_2024 | 0.189 (0.190 - 0.201) | 0.409 (0.383 - 0.437) | 0.135 (0.129 - 0.141) | 0.272 (0.257 - 0.288) | 0.272 (0.257 - 0.288) | 0.526 (0.515 - 0.537) | 0.187 | 0.412 | 0.134 | 0.273 | 0.273 | 0.518 | |
16 | Kong_CUHK_t6_1 | True | kong_cuhk_t6_2024 | 0.192 (0.195 - 0.206) | 0.495 (0.467 - 0.526) | 0.141 (0.135 - 0.147) | 0.318 (0.301 - 0.336) | 0.315 (0.299 - 0.333) | 0.525 (0.514 - 0.536) | 0.196 | 0.529 | 0.138 | 0.334 | 0.332 | 0.528 | |
17 | Kong_CUHK_t6_2 | False | kong_cuhk_t6_2024 | 0.193 (0.195 - 0.206) | 0.478 (0.451 - 0.507) | 0.145 (0.138 - 0.151) | 0.311 (0.296 - 0.328) | 0.307 (0.292 - 0.325) | 0.525 (0.514 - 0.536) | 0.193 | 0.495 | 0.140 | 0.317 | 0.314 | 0.523 | |
18 | Choi_KAIST_t6_1 | False | choi_kaist_t6_2024 | 0.187 (0.188 - 0.199) | 0.465 (0.438 - 0.494) | 0.135 (0.129 - 0.142) | 0.300 (0.284 - 0.317) | 0.299 (0.284 - 0.316) | 0.520 (0.509 - 0.531) | 0.189 | 0.464 | 0.134 | 0.299 | 0.299 | 0.521 | |
19 | Li_ALXC_t6_1 | False | li_alxc_t6_2024 | 0.190 (0.191 - 0.202) | 0.474 (0.446 - 0.506) | 0.141 (0.135 - 0.148) | 0.308 (0.291 - 0.326) | 0.307 (0.290 - 0.325) | 0.520 (0.509 - 0.532) | 0.191 | 0.499 | 0.139 | 0.319 | 0.318 | 0.522 | |
20 | Li_SCUT_t6_4 | False | li_scut_t6_2024 | 0.188 (0.190 - 0.201) | 0.468 (0.440 - 0.497) | 0.138 (0.132 - 0.145) | 0.303 (0.287 - 0.320) | 0.302 (0.286 - 0.319) | 0.520 (0.508 - 0.531) | 0.189 | 0.469 | 0.134 | 0.301 | 0.301 | 0.513 | |
21 | Li_SCUT_t6_3 | False | li_scut_t6_2024 | 0.189 (0.191 - 0.202) | 0.471 (0.443 - 0.502) | 0.138 (0.132 - 0.145) | 0.305 (0.288 - 0.322) | 0.304 (0.288 - 0.322) | 0.519 (0.508 - 0.530) | 0.187 | 0.467 | 0.133 | 0.134 | 0.300 | 0.512 | |
22 | Choi_KAIST_t6_2 | False | choi_kaist_t6_2024 | 0.184 (0.185 - 0.196) | 0.429 (0.403 - 0.457) | 0.133 (0.127 - 0.139) | 0.281 (0.266 - 0.297) | 0.279 (0.264 - 0.296) | 0.518 (0.507 - 0.529) | 0.182 | 0.414 | 0.130 | 0.272 | 0.272 | 0.515 | |
23 | Li_ALXC_t6_2 | False | li_alxc_t6_2024 | 0.187 (0.188 - 0.199) | 0.462 (0.433 - 0.492) | 0.135 (0.129 - 0.141) | 0.298 (0.282 - 0.316) | 0.298 (0.281 - 0.315) | 0.518 (0.506 - 0.529) | 0.187 | 0.458 | 0.137 | 0.298 | 0.297 | 0.520 | |
24 | Silva_JKUICP_t6_2 | False | de_jesus_silva_jkuicp_t6_2024 | 0.188 (0.190 - 0.201) | 0.456 (0.430 - 0.484) | 0.138 (0.132 - 0.144) | 0.297 (0.282 - 0.313) | 0.296 (0.281 - 0.313) | 0.516 (0.505 - 0.527) | 0.192 | 0.479 | 0.138 | 0.309 | 0.308 | 0.508 | |
25 | Li_SCUT_t6_2 | False | li_scut_t6_2024 | 0.189 (0.191 - 0.202) | 0.467 (0.441 - 0.497) | 0.139 (0.133 - 0.145) | 0.303 (0.287 - 0.320) | 0.301 (0.286 - 0.318) | 0.516 (0.505 - 0.527) | 0.186 | 0.460 | 0.133 | 0.296 | 0.295 | 0.505 | |
26 | Silva_JKUICP_t6_1 | False | de_jesus_silva_jkuicp_t6_2024 | 0.187 (0.188 - 0.199) | 0.450 (0.424 - 0.478) | 0.135 (0.129 - 0.141) | 0.292 (0.277 - 0.308) | 0.291 (0.276 - 0.307) | 0.515 (0.504 - 0.526) | 0.186 | 0.451 | 0.134 | 0.292 | 0.290 | 0.506 | |
27 | Epshtein_ARC_t6_1 | False | epshtein_arc_t6_2024 | 0.188 (0.190 - 0.200) | 0.462 (0.437 - 0.491) | 0.137 (0.131 - 0.143) | 0.300 (0.285 - 0.316) | 0.298 (0.283 - 0.315) | 0.514 (0.503 - 0.525) | 0.189 | 0.473 | 0.135 | 0.304 | 0.302 | 0.504 | |
28 | Hong_CAU_t6_1 | False | hong_cau_t6_2024 | 0.184 (0.185 - 0.195) | 0.427 (0.402 - 0.454) | 0.134 (0.128 - 0.140) | 0.280 (0.266 - 0.296) | 0.279 (0.265 - 0.295) | 0.513 (0.502 - 0.524) | 0.188 | 0.458 | 0.133 | 0.295 | 0.294 | 0.509 | |
29 | Kyogu_SNU_t6_1 | False | kyogu_snu_t6_2024 | 0.186 (0.189 - 0.200) | 0.441 (0.414 - 0.469) | 0.134 (0.128 - 0.140) | 0.288 (0.272 - 0.304) | 0.287 (0.271 - 0.303) | 0.512 (0.501 - 0.524) | 0.185 | 0.444 | 0.133 | 0.288 | 0.287 | 0.507 | |
30 | Baseline | False | labbé_irit_t6_2024 | 0.186 (0.187 - 0.198) | 0.442 (0.417 - 0.468) | 0.135 (0.129 - 0.141) | 0.288 (0.274 - 0.304) | 0.287 (0.273 - 0.303) | 0.510 (0.499 - 0.521) | 0.190 | 0.462 | 0.134 | 0.298 | 0.296 | 0.504 | |
31 | Li_SCUT_t6_1 | False | li_scut_t6_2024 | 0.187 (0.189 - 0.200) | 0.459 (0.432 - 0.488) | 0.137 (0.131 - 0.143) | 0.298 (0.283 - 0.315) | 0.296 (0.281 - 0.314) | 0.508 (0.496 - 0.519) | 0.187 | 0.470 | 0.131 | 0.301 | 0.300 | 0.507 |
Systems ranking, additional metrics
Selected metric rank |
Submission Information | Clotho evaluation split | ||||||
---|---|---|---|---|---|---|---|---|
Submission rank |
Submission code |
Data leak |
Technical Report |
FENSE | Sentence-BERT | Fluency Error Rate | Vocabulary | |
1 | Jung_CMU_t6_4 | False | jung_cmu_t6_2024 | 0.554 (0.543 - 0.563) | 0.556 (0.546 - 0.566) | 0.004 (0.001 - 0.010) | 915.0 | |
2 | Jung_CMU_t6_2 | False | jung_cmu_t6_2024 | 0.549 (0.538 - 0.559) | 0.553 (0.543 - 0.563) | 0.008 (0.004 - 0.014) | 920.0 | |
3 | Jung_CMU_t6_3 | False | jung_cmu_t6_2024 | 0.547 (0.536 - 0.557) | 0.554 (0.544 - 0.564) | 0.012 (0.007 - 0.020) | 888.0 | |
4 | Jung_CMU_t6_1 | False | jung_cmu_t6_2024 | 0.544 (0.534 - 0.555) | 0.548 (0.537 - 0.558) | 0.007 (0.003 - 0.013) | 896.0 | |
5 | Kim_SNU_t6_2 | False | kim_snu_t6_2024 | 0.544 (0.534 - 0.555) | 0.544 (0.534 - 0.555) | 0.000 (0.000 - 0.000) | 836.0 | |
6 | Kim_SNU_t6_4 | False | kim_snu_t6_2024 | 0.544 (0.534 - 0.555) | 0.544 (0.534 - 0.555) | 0.000 (0.000 - 0.000) | 799.0 | |
7 | Kim_SNU_t6_3 | False | kim_snu_t6_2024 | 0.542 (0.532 - 0.552) | 0.542 (0.532 - 0.552) | 0.000 (0.000 - 0.000) | 840.0 | |
8 | Chen_SJTU_t6_4 | True | chen_sjtu_t6_2024 | 0.541 (0.530 - 0.552) | 0.546 (0.536 - 0.557) | 0.009 (0.004 - 0.015) | 783.0 | |
9 | Chen_SJTU_t6_3 | True | chen_sjtu_t6_2024 | 0.541 (0.530 - 0.552) | 0.546 (0.535 - 0.557) | 0.009 (0.004 - 0.015) | 787.0 | |
10 | Chen_SJTU_t6_1 | True | chen_sjtu_t6_2024 | 0.540 (0.529 - 0.551) | 0.546 (0.534 - 0.556) | 0.010 (0.005 - 0.017) | 835.0 | |
11 | Kim_SNU_t6_1 | False | kim_snu_t6_2024 | 0.540 (0.530 - 0.550) | 0.540 (0.530 - 0.550) | 0.000 (0.000 - 0.000) | 832.0 | |
12 | Chen_SJTU_t6_2 | True | chen_sjtu_t6_2024 | 0.538 (0.527 - 0.550) | 0.543 (0.532 - 0.554) | 0.010 (0.005 - 0.017) | 800.0 | |
13 | Li_ALXC_t6_4 | False | li_alxc_t6_2024 | 0.533 (0.522 - 0.543) | 0.535 (0.524 - 0.545) | 0.004 (0.001 - 0.010) | 786.0 | |
14 | Li_ALXC_t6_3 | False | li_alxc_t6_2024 | 0.528 (0.517 - 0.538) | 0.528 (0.518 - 0.539) | 0.001 (0.000 - 0.006) | 612.0 | |
15 | Kyogu_SNU_t6_2 | False | kyogu_snu_t6_2024 | 0.526 (0.515 - 0.537) | 0.526 (0.516 - 0.537) | 0.001 (0.000 - 0.006) | 954.0 | |
16 | Kong_CUHK_t6_1 | True | kong_cuhk_t6_2024 | 0.525 (0.514 - 0.536) | 0.529 (0.518 - 0.539) | 0.006 (0.002 - 0.012) | 606.0 | |
17 | Kong_CUHK_t6_2 | False | kong_cuhk_t6_2024 | 0.525 (0.514 - 0.536) | 0.531 (0.520 - 0.541) | 0.011 (0.006 - 0.018) | 565.0 | |
18 | Choi_KAIST_t6_1 | False | choi_kaist_t6_2024 | 0.520 (0.509 - 0.531) | 0.521 (0.510 - 0.532) | 0.003 (0.001 - 0.009) | 609.0 | |
19 | Li_ALXC_t6_1 | False | li_alxc_t6_2024 | 0.520 (0.509 - 0.532) | 0.522 (0.511 - 0.533) | 0.004 (0.001 - 0.010) | 751.0 | |
20 | Li_SCUT_t6_4 | False | li_scut_t6_2024 | 0.520 (0.508 - 0.531) | 0.521 (0.510 - 0.532) | 0.002 (0.000 - 0.007) | 498.0 | |
21 | Li_SCUT_t6_3 | False | li_scut_t6_2024 | 0.519 (0.508 - 0.530) | 0.520 (0.509 - 0.531) | 0.002 (0.000 - 0.007) | 513.0 | |
22 | Choi_KAIST_t6_2 | False | choi_kaist_t6_2024 | 0.518 (0.507 - 0.529) | 0.520 (0.509 - 0.531) | 0.004 (0.001 - 0.010) | 866.0 | |
23 | Li_ALXC_t6_2 | False | li_alxc_t6_2024 | 0.518 (0.506 - 0.529) | 0.520 (0.508 - 0.531) | 0.003 (0.001 - 0.008) | 773.0 | |
24 | Silva_JKUICP_t6_2 | False | de_jesus_silva_jkuicp_t6_2024 | 0.516 (0.505 - 0.527) | 0.517 (0.505 - 0.528) | 0.001 (0.000 - 0.005) | 606.0 | |
25 | Li_SCUT_t6_2 | False | li_scut_t6_2024 | 0.516 (0.505 - 0.527) | 0.517 (0.506 - 0.528) | 0.002 (0.000 - 0.007) | 517.0 | |
26 | Silva_JKUICP_t6_1 | False | de_jesus_silva_jkuicp_t6_2024 | 0.515 (0.504 - 0.526) | 0.517 (0.506 - 0.528) | 0.003 (0.001 - 0.008) | 610.0 | |
27 | Epshtein_ARC_t6_1 | False | epshtein_arc_t6_2024 | 0.514 (0.503 - 0.525) | 0.516 (0.505 - 0.527) | 0.005 (0.002 - 0.011) | 563.0 | |
28 | Hong_CAU_t6_1 | False | hong_cau_t6_2024 | 0.513 (0.502 - 0.524) | 0.515 (0.504 - 0.526) | 0.004 (0.001 - 0.010) | 604.0 | |
29 | Kyogu_SNU_t6_1 | False | kyogu_snu_t6_2024 | 0.512 (0.501 - 0.524) | 0.515 (0.503 - 0.526) | 0.004 (0.001 - 0.010) | 822.0 | |
30 | Baseline | False | labbé_irit_t6_2024 | 0.510 (0.499 - 0.521) | 0.512 (0.501 - 0.523) | 0.004 (0.001 - 0.010) | 532.0 | |
31 | Li_SCUT_t6_1 | False | li_scut_t6_2024 | 0.508 (0.496 - 0.519) | 0.511 (0.499 - 0.522) | 0.006 (0.002 - 0.012) | 539.0 |
System characteristics
In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems and the second has a detailed presentation of each system.
Overview of characteristics
Submission rank |
Submission code |
Data leak |
FENSE |
Technical Report |
Method scheme/architecture |
Amount of parameters |
Audio modelling | Word modelling |
Data augmentation |
---|---|---|---|---|---|---|---|---|---|
1 | Jung_CMU_t6_4 | False | 0.554 (0.543 - 0.563) | jung_cmu_t6_2024 | encoder-decoder | 7857055850 | Conformer | transformer | SpecAugment, mixup |
2 | Jung_CMU_t6_2 | False | 0.549 (0.538 - 0.559) | jung_cmu_t6_2024 | encoder-decoder | 1571411170 | Conformer | transformer | SpecAugment, mixup |
3 | Jung_CMU_t6_3 | False | 0.547 (0.536 - 0.557) | jung_cmu_t6_2024 | encoder-decoder | 5612182750 | Conformer | transformer | SpecAugment, mixup |
4 | Jung_CMU_t6_1 | False | 0.544 (0.534 - 0.555) | jung_cmu_t6_2024 | encoder-decoder | 224487310 | Conformer | transformer | SpecAugment, mixup |
5 | Kim_SNU_t6_2 | False | 0.544 (0.534 - 0.555) | kim_snu_t6_2024 | encoder-decoder | 754328981 | cnn | transformer | mixup |
6 | Kim_SNU_t6_4 | False | 0.544 (0.534 - 0.555) | kim_snu_t6_2024 | encoder-decoder | 4575364501 | cnn | transformer | mixup |
7 | Kim_SNU_t6_3 | False | 0.542 (0.532 - 0.552) | kim_snu_t6_2024 | encoder-decoder | 3620105621 | cnn | transformer | mixup |
8 | Chen_SJTU_t6_4 | True | 0.541 (0.530 - 0.552) | chen_sjtu_t6_2024 | encoder-decoder | 6840335631 | transformer | transformer | SpecAugment, mixup |
9 | Chen_SJTU_t6_3 | True | 0.541 (0.530 - 0.552) | chen_sjtu_t6_2024 | encoder-decoder | 6840335631 | transformer | transformer | SpecAugment, mixup |
10 | Chen_SJTU_t6_1 | True | 0.540 (0.529 - 0.551) | chen_sjtu_t6_2024 | encoder-decoder | 6840335631 | transformer | transformer | SpecAugment, mixup |
11 | Kim_SNU_t6_1 | False | 0.540 (0.530 - 0.550) | kim_snu_t6_2024 | encoder-decoder | 754328981 | cnn | transformer | mixup |
12 | Chen_SJTU_t6_2 | True | 0.538 (0.527 - 0.550) | chen_sjtu_t6_2024 | encoder-decoder | 6840335631 | transformer | transformer | SpecAugment, mixup |
13 | Li_ALXC_t6_4 | False | 0.533 (0.522 - 0.543) | li_alxc_t6_2024 | encoder-decoder | 6850672271 | ced | transformer | |
14 | Li_ALXC_t6_3 | False | 0.528 (0.517 - 0.538) | li_alxc_t6_2024 | encoder-decoder | 245365903 | ced | transformer | |
15 | Kyogu_SNU_t6_2 | False | 0.526 (0.515 - 0.537) | kyogu_snu_t6_2024 | encoder-decoder | 8131137200 | |||
16 | Kong_CUHK_t6_1 | True | 0.525 (0.514 - 0.536) | kong_cuhk_t6_2024 | encoder-decoder | 146403855 | cnn | transformer | spec-based mixup, label smoothing |
17 | Kong_CUHK_t6_2 | False | 0.525 (0.514 - 0.536) | kong_cuhk_t6_2024 | encoder-decoder | 126355215 | cnn | transformer | spec-based mixup, label smoothing |
18 | Choi_KAIST_t6_1 | False | 0.520 (0.509 - 0.531) | choi_kaist_t6_2024 | encoder-decoder | 42038209 | transformer | mixup, label smoothing, ChatGPT paraphrasing | |
19 | Li_ALXC_t6_1 | False | 0.520 (0.509 - 0.532) | li_alxc_t6_2024 | encoder-decoder | 6850408320 | Dasheng | transformer | |
20 | Li_SCUT_t6_4 | False | 0.520 (0.508 - 0.531) | li_scut_t6_2024 | ConvNeXt-Trans | 41303080 | ConvNeXt | transformer | mixup, SpecAugment |
21 | Li_SCUT_t6_3 | False | 0.519 (0.508 - 0.530) | li_scut_t6_2024 | ConvNeXt-Trans | 41303080 | ConvNeXt | transformer | mixup, SpecAugment |
22 | Choi_KAIST_t6_2 | False | 0.518 (0.507 - 0.529) | choi_kaist_t6_2024 | encoder-decoder | 42038209 | transformer | mixup, label smoothing, ChatGPT paraphrasing | |
23 | Li_ALXC_t6_2 | False | 0.518 (0.506 - 0.529) | li_alxc_t6_2024 | encoder-decoder | 7397882752 | Dasheng | transformer | |
24 | Silva_JKUICP_t6_2 | False | 0.516 (0.505 - 0.527) | de_jesus_silva_jkuicp_t6_2024 | encoder-decoder | 59486498 | transformer | mixup, label smoothing | |
25 | Li_SCUT_t6_2 | False | 0.516 (0.505 - 0.527) | li_scut_t6_2024 | ConvNeXt-Trans | 41303080 | ConvNeXt | transformer | mixup, SpecAugment |
26 | Silva_JKUICP_t6_1 | False | 0.515 (0.504 - 0.526) | de_jesus_silva_jkuicp_t6_2024 | encoder-decoder | 59486498 | transformer | mixup, label smoothing | |
27 | Epshtein_ARC_t6_1 | False | 0.514 (0.503 - 0.525) | epshtein_arc_t6_2024 | encoder-decoder | 48014000 | transformer | mixup, label smoothing | |
28 | Hong_CAU_t6_1 | False | 0.513 (0.502 - 0.524) | hong_cau_t6_2024 | encoder-decoder | 41303080 | transformer | mixup, label smoothing | |
29 | Kyogu_SNU_t6_1 | False | 0.512 (0.501 - 0.524) | kyogu_snu_t6_2024 | encoder-decoder | 8131137200 | |||
30 | Baseline | False | 0.510 (0.499 - 0.521) | labbé_irit_t6_2024 | encoder-decoder | 41303080 | transformer | mixup, label smoothing | |
31 | Li_SCUT_t6_1 | False | 0.508 (0.496 - 0.519) | li_scut_t6_2024 | ConvNeXt-Trans | 41303080 | ConvNeXt | transformer | mixup, SpecAugment |
Detailed characteristics
Submission rank |
Submission code |
Data leak |
FENSE |
Technical Report |
Method scheme/architecture |
Amount of learnable parameters |
Amount of frozen parameters |
Amount of inference parameters |
Amount of total parameters |
Amount of inference MACs |
Audio modelling |
Acoustic features |
Word modelling |
Word embeddings |
Data augmentation |
Sampling rate |
Learning set-up |
Ensemble number of systems |
Loss function | Optimizer | Learning rate | Weight decay |
Gradient clipping |
Gradient norm for clipping |
Metric monitored for training |
Dataset(s) used for training |
Number of GPUs used for training |
GPU model used for training |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Jung_CMU_t6_4 | False | 0.554 (0.543 - 0.563) | jung_cmu_t6_2024 | encoder-decoder | 3653368320 | 4203687530 | 7857055850 | 7857055850 | Conformer | BEATs, ConvNeXt-Tiny | transformer | BART | SpecAugment, mixup | 32kHz, 16kHz | supervised | 5 | cross_entropy, infonce | AdamW | 2e-5 | 0.001 | 0.0 | validation_accuracy | Clotho, AudioCaps | 4 | NVIDIA A5000 | ||
2 | Jung_CMU_t6_2 | False | 0.549 (0.538 - 0.559) | jung_cmu_t6_2024 | encoder-decoder | 730673664 | 840737506 | 1571411170 | 1571411170 | Conformer | BEATs, ConvNeXt-Tiny | transformer | BART | SpecAugment, mixup | 32kHz, 16kHz | supervised | 5 | cross_entropy, infonce | AdamW | 2e-5 | 0.001 | 0.0 | validation_accuracy | Clotho, AudioCaps | 4 | NVIDIA A5000 | ||
3 | Jung_CMU_t6_3 | False | 0.547 (0.536 - 0.557) | jung_cmu_t6_2024 | encoder-decoder | 2609548800 | 3002633950 | 5612182750 | 5612182750 | Conformer | BEATs, ConvNeXt-Tiny | transformer | BART | SpecAugment, mixup | 32kHz, 16kHz | supervised | 5 | cross_entropy, infonce | AdamW | 2e-5 | 0.001 | 0.0 | validation_accuracy | Clotho, AudioCaps | 4 | NVIDIA A5000 | ||
4 | Jung_CMU_t6_1 | False | 0.544 (0.534 - 0.555) | jung_cmu_t6_2024 | encoder-decoder | 104381952 | 120105358 | 224487310 | 224487310 | Conformer | BEATs, ConvNeXt-Tiny | transformer | BART | SpecAugment, mixup | 32kHz, 16kHz | supervised | 1 | cross_entropy, infonce | AdamW | 2e-5 | 0.001 | 0.0 | validation_accuracy | Clotho, AudioCaps | 4 | NVIDIA A5000 | ||
5 | Kim_SNU_t6_2 | False | 0.544 (0.534 - 0.555) | kim_snu_t6_2024 | encoder-decoder | 477629440 | 276699541 | 754328981 | 754328981 | cnn | ConvNeXt-Tiny | transformer | BART-large | mixup | 32kHz | supervised | 1 | cross_entropy | AdamW | 3e-5 | 0.010 | 1.0 | L2 | FENSE | Clotho, Clotho-ChatGPT-mixup, AudioCaps, WavCaps | 8 | NVIDIA A100 80GB | |
6 | Kim_SNU_t6_4 | False | 0.544 (0.534 - 0.555) | kim_snu_t6_2024 | encoder-decoder | 4298664960 | 276699541 | 4575364501 | 4575364501 | cnn | ConvNeXt-Tiny | transformer | BART-large | mixup | 32kHz | supervised | 9 | cross_entropy | AdamW | 3e-5 | 0.010 | 1.0 | L2 | FENSE | Clotho, Clotho-ChatGPT-mixup, AudioCaps, WavCaps | 8 | NVIDIA A100 80GB | |
7 | Kim_SNU_t6_3 | False | 0.542 (0.532 - 0.552) | kim_snu_t6_2024 | encoder-decoder | 3343406080 | 276699541 | 3620105621 | 3620105621 | cnn | ConvNeXt-Tiny | transformer | BART-large | mixup | 32kHz | supervised | 7 | cross_entropy | AdamW | 3e-5 | 0.010 | 1.0 | L2 | FENSE | Clotho, Clotho-ChatGPT-mixup, AudioCaps, WavCaps | 8 | NVIDIA A100 80GB | |
8 | Chen_SJTU_t6_4 | True | 0.541 (0.530 - 0.552) | chen_sjtu_t6_2024 | encoder-decoder | 20453376 | 6819882255 | 6840335631 | 6840335631 | 6990830300000 | transformer | EAT | transformer | vicuna-7b-v1.5 | SpecAugment, mixup | 16kHz | supervised | 10 | cross_entropy | AdamW | 8e-6 | 0.000 | 0.0 | validation_loss | Clotho, AudioCaps, MACS, WavCaps | 1 | NVIDIA A800-SXM4-80GB | |
9 | Chen_SJTU_t6_3 | True | 0.541 (0.530 - 0.552) | chen_sjtu_t6_2024 | encoder-decoder | 20453376 | 6819882255 | 6840335631 | 6840335631 | 6990830300000 | transformer | EAT | transformer | vicuna-7b-v1.5 | SpecAugment, mixup | 16kHz | supervised | 10 | cross_entropy | AdamW | 8e-6 | 0.000 | 0.0 | validation_loss | Clotho, AudioCaps, MACS, WavCaps | 1 | NVIDIA A800-SXM4-80GB | |
10 | Chen_SJTU_t6_1 | True | 0.540 (0.529 - 0.551) | chen_sjtu_t6_2024 | encoder-decoder | 20453376 | 6819882255 | 6840335631 | 6840335631 | 6990830300000 | transformer | EAT | transformer | vicuna-7b-v1.5 | SpecAugment, mixup | 16kHz | supervised | 1 | cross_entropy | AdamW | 8e-6 | 0.000 | 0.0 | validation_loss | Clotho, AudioCaps, MACS, WavCaps | 1 | NVIDIA A800-SXM4-80GB | |
11 | Kim_SNU_t6_1 | False | 0.540 (0.530 - 0.550) | kim_snu_t6_2024 | encoder-decoder | 477629440 | 276699541 | 754328981 | 754328981 | cnn | ConvNeXt-Tiny | transformer | BART-large | mixup | 32kHz | supervised | 1 | cross_entropy | AdamW | 3e-5 | 0.010 | 1.0 | L2 | FENSE | Clotho, Clotho-ChatGPT-mixup, AudioCaps, WavCaps | 8 | NVIDIA A100 80GB | |
12 | Chen_SJTU_t6_2 | True | 0.538 (0.527 - 0.550) | chen_sjtu_t6_2024 | encoder-decoder | 20453376 | 6819882255 | 6840335631 | 6840335631 | 6990830300000 | transformer | EAT | transformer | vicuna-7b-v1.5 | SpecAugment, mixup | 16kHz | supervised | 5 | cross_entropy | AdamW | 8e-6 | 0.000 | 0.0 | validation_loss | Clotho, AudioCaps, MACS, WavCaps | 1 | NVIDIA A800-SXM4-80GB | |
13 | Li_ALXC_t6_4 | False | 0.533 (0.522 - 0.543) | li_alxc_t6_2024 | encoder-decoder | 26544896 | 6824127375 | 6850672271 | 6850672271 | none | ced | CED | transformer | llama2_7b | 16kHz | supervised | 1 | cross_entropy | AdamW | 5e-5 | 0.0 | validation_loss | Clotho | 2 | NVIDIA A100 | |||
14 | Li_ALXC_t6_3 | False | 0.528 (0.517 - 0.538) | li_alxc_t6_2024 | encoder-decoder | 20233728 | 225132175 | 245365903 | 245365903 | none | ced | CED | transformer | bart | 16kHz | supervised | 1 | cross_entropy | AdamW | 5e-5 | 0.0 | validation_loss | Clotho | 2 | NVIDIA A100 | |||
15 | Kyogu_SNU_t6_2 | False | 0.526 (0.515 - 0.537) | kyogu_snu_t6_2024 | encoder-decoder | 9965568 | 8121171632 | 8124321456 | 8131137200 | BEATs | LLaMa | 16kHz | supervised | 1 | cross_entropy | AdamW | 3e-4 | 5.0 | L2 | validation_loss | Clotho, AudioCaps | 1 | NVIDIA GeForce RTX 3090 | |||||
16 | Kong_CUHK_t6_1 | True | 0.525 (0.514 - 0.536) | kong_cuhk_t6_2024 | encoder-decoder | 117015552 | 29388303 | 146403855 | 146403855 | 60483202884 | cnn | ConvNeXt-Tiny | transformer | learned | spec-based mixup, label smoothing | 32kHz | supervised | 1 | cross_entropy | AdamW | 3e-5 | 0.0 | the SPIDEr metric | Clotho, AudioCaps, WavCaps | 5 | NVIDIA GeForce RTX 4090 | ||
17 | Kong_CUHK_t6_2 | False | 0.525 (0.514 - 0.536) | kong_cuhk_t6_2024 | encoder-decoder | 96966912 | 29388303 | 126355215 | 126355215 | 53459049669 | cnn | ConvNeXt-Tiny | transformer | learned | spec-based mixup, label smoothing | 32kHz | supervised | 1 | cross_entropy | AdamW | 3e-5 | 0.0 | the SPIDEr metric | Clotho | 1 | NVIDIA GeForce RTX 4090 | ||
18 | Choi_KAIST_t6_1 | False | 0.520 (0.509 - 0.531) | choi_kaist_t6_2024 | encoder-decoder | 12649906 | 29388303 | 42038209 | 42038209 | 49888899616 | ConvNeXt-Tiny | transformer | learned | mixup, label smoothing, ChatGPT paraphrasing | 32kHz | supervised | 1 | cross_entropy | AdamW | 5e-4 | 2.000 | 1.0 | L2 | train_loss | Clotho | 3 | NVIDIA GeForce RTX 2080 Ti | |
19 | Li_ALXC_t6_1 | False | 0.520 (0.509 - 0.532) | li_alxc_t6_2024 | encoder-decoder | 26544896 | 6823863424 | 6850408320 | 6850408320 | none | Dasheng | Dasheng | transformer | llama2_7b | 16kHz | supervised | 1 | cross_entropy | AdamW | 5e-5 | 0.0 | validation_loss | Clotho | 2 | NVIDIA A100 | |||
20 | Li_SCUT_t6_4 | False | 0.520 (0.508 - 0.531) | li_scut_t6_2024 | ConvNeXt-Trans | 11914777 | 29388303 | 41303080 | 41303080 | ConvNeXt | ConvNeXt-Tiny | transformer | mixup, SpecAugment | 32kHz | supervised | 4 | cross_entropy | AdamW | 5e-4 | 2.000 | 1.0 | L2 | validation_loss | Clotho | 1 | NVIDIA GeForce RTX 4090 Ti | ||
21 | Li_SCUT_t6_3 | False | 0.519 (0.508 - 0.530) | li_scut_t6_2024 | ConvNeXt-Trans | 11914777 | 29388303 | 41303080 | 41303080 | ConvNeXt | ConvNeXt-Tiny | transformer | mixup, SpecAugment | 32kHz | supervised | 4 | cross_entropy | AdamW | 5e-4 | 2.000 | 1.0 | L2 | validation_loss | Clotho | 1 | NVIDIA GeForce RTX 4090 Ti | ||
22 | Choi_KAIST_t6_2 | False | 0.518 (0.507 - 0.529) | choi_kaist_t6_2024 | encoder-decoder | 12649906 | 29388303 | 42038209 | 42038209 | 50768107552 | ConvNeXt-Tiny | transformer | learned | mixup, label smoothing, ChatGPT paraphrasing | 32kHz | supervised | 1 | cross_entropy | AdamW | 5e-4 | 2.000 | 1.0 | L2 | train_loss | Clotho | 3 | NVIDIA GeForce RTX 2080 Ti | |
23 | Li_ALXC_t6_2 | False | 0.518 (0.506 - 0.529) | li_alxc_t6_2024 | encoder-decoder | 29133568 | 7368749184 | 7397882752 | 7397882752 | none | Dasheng | Dasheng | transformer | llama2_7b | 16kHz | supervised | 1 | cross_entropy | AdamW | 5e-5 | 0.0 | validation_loss | Clotho | 2 | NVIDIA A100 | |||
24 | Silva_JKUICP_t6_2 | False | 0.516 (0.505 - 0.527) | de_jesus_silva_jkuicp_t6_2024 | encoder-decoder | 30098195 | 29388303 | 59486498 | 59486498 | 14715294720 | ConvNeXt-Tiny | transformer | learned | mixup, label smoothing | 32kHz | supervised | 1 | cross_entropy | AdamW | 4e-4 | 2.000 | 1.0 | L2 | validation_loss | Clotho, Clotho | 1 | NVIDIA GeForce GTX 1060 6GB | |
25 | Li_SCUT_t6_2 | False | 0.516 (0.505 - 0.527) | li_scut_t6_2024 | ConvNeXt-Trans | 11914777 | 29388303 | 41303080 | 41303080 | ConvNeXt | ConvNeXt-Tiny | transformer | mixup, SpecAugment | 32kHz | supervised | 4 | cross_entropy | AdamW | 5e-4 | 2.000 | 1.0 | L2 | validation_loss | Clotho | 1 | NVIDIA GeForce RTX 4090 Ti | ||
26 | Silva_JKUICP_t6_1 | False | 0.515 (0.504 - 0.526) | de_jesus_silva_jkuicp_t6_2024 | encoder-decoder | 30098195 | 29388303 | 59486498 | 59486498 | 15301713408 | ConvNeXt-Tiny | transformer | learned | mixup, label smoothing | 32kHz | supervised | 1 | cross_entropy | AdamW | 4e-4 | 2.000 | 1.0 | L2 | validation_loss | Clotho | 1 | NVIDIA GeForce GTX 1060 6GB | |
27 | Epshtein_ARC_t6_1 | False | 0.514 (0.503 - 0.525) | epshtein_arc_t6_2024 | encoder-decoder | 12003511 | 36010489 | 48014000 | 48014000 | 4821624576 | ConvNeXt-Tiny | transformer | learned | mixup, label smoothing | 32kHz | supervised | 1 | cross_entropy, NTXent | AdamW | 5e-4 | 2.000 | 1.0 | L2 | validation_loss | Clotho | 1 | NVIDIA T1200 Laptop GPU | |
28 | Hong_CAU_t6_1 | False | 0.513 (0.502 - 0.524) | hong_cau_t6_2024 | encoder-decoder | 11914777 | 29388303 | 41303080 | 41303080 | ConvNeXt-Tiny | transformer | learned | mixup, label smoothing | 32kHz | supervised | 1 | cross_entropy | AdamW | 5e-4 | 2.000 | 1.0 | L2 | validation_loss | Clotho | 1 | NVIDIA 20TF-V100 | ||
29 | Kyogu_SNU_t6_1 | False | 0.512 (0.501 - 0.524) | kyogu_snu_t6_2024 | encoder-decoder | 9965568 | 8121171632 | 8124321456 | 8131137200 | BEATs | LLaMa | 16kHz | supervised | 1 | cross_entropy | AdamW | 3e-4 | 5.0 | L2 | validation_loss | Clotho, AudioCaps | 1 | NVIDIA GeForce RTX 3090 | |||||
30 | Baseline | False | 0.510 (0.499 - 0.521) | labbé_irit_t6_2024 | encoder-decoder | 11914777 | 29388303 | 41303080 | 41303080 | 48762319200 | ConvNeXt-Tiny | transformer | learned | mixup, label smoothing | 32kHz | supervised | 1 | cross_entropy | AdamW | 5e-4 | 2.000 | 1.0 | L2 | validation_loss | Clotho | 1 | NVIDIA GeForce RTX 2080 Ti | |
31 | Li_SCUT_t6_1 | False | 0.508 (0.496 - 0.519) | li_scut_t6_2024 | ConvNeXt-Trans | 11914777 | 29388303 | 41303080 | 41303080 | ConvNeXt | ConvNeXt-Tiny | transformer | mixup, SpecAugment | 32kHz | supervised | 4 | cross_entropy | AdamW | 5e-4 | 2.000 | 1.0 | L2 | validation_loss | Clotho | 1 | NVIDIA GeForce RTX 4090 Ti |
Technical reports
AUTOMATIC AUDIO CAPTIONING WITH ENCODER FUSION, MULTI-LAYER AGGREGATION, AND LARGE LANGUAGE MODEL ENRICHED SUMMARIZATION
Jee-weon Jung1, Dong Zhang2, Huck C.-H. Yang3, Shih-Lun Wu1, David M. Chan4, Zhifeng Kong5, Deng Ruifan2, Zhou Yaqian2, Valle Rafael5, Shinji Watanabe1
1Carnegie Mellon University, USA, 2Fudan University, China, 3NVIDIA Research, USA, 4University of California, Berkeley, USA, 5NVIDIA Applied Deep Learning Research, USA
Jung_CMU_t6_1 Jung_CMU_t6_2 Jung_CMU_t6_4 Jung_CMU_t6_3
AUTOMATIC AUDIO CAPTIONING WITH ENCODER FUSION, MULTI-LAYER AGGREGATION, AND LARGE LANGUAGE MODEL ENRICHED SUMMARIZATION
Jee-weon Jung1, Dong Zhang2, Huck C.-H. Yang3, Shih-Lun Wu1, David M. Chan4, Zhifeng Kong5, Deng Ruifan2, Zhou Yaqian2, Valle Rafael5, Shinji Watanabe1
1Carnegie Mellon University, USA, 2Fudan University, China, 3NVIDIA Research, USA, 4University of California, Berkeley, USA, 5NVIDIA Applied Deep Learning Research, USA
Abstract
In this report, we describe our submission to Track 6 of the DCASE 2024 challenge for the task of Automated Audio Captioning (AAC). The submitted models utilize an encoder-decoder architecture using pre-trained and frozen audio encoders, a Conformer post-encoder, and a BART decoder. We introduce five different architectures, employing diverse fusion strategies to leverage multiple audio encoders and a multi-layer aggregation technique, thus exploiting the complementary information from various representations. For inference, we propose a novel scheme incorporating nucleus sampling, CLAP-based filtering, hybrid re-ranking, and large language model summarization. Combining these approaches, our top-performing single and ensemble systems achieve Fluency Enhanced Sentence-BERT Evaluation (FENSE) scores of 0.5410 and 0.5442, respectively, on the Clotho (V2) evaluation partition.
System characteristics
Best submission | Jung_CMU_t6_4 |
Team rank | 1 |
Audio modelling | Conformer |
Word modelling | transformer |
Data augmentation | SpecAugment, mixup |
Ensemble number of systems | 5 |
Train datasets used | Clotho, AudioCaps |
Total number of parameters | 7857055850 |
FENSE score | 0.5536877719555068 |
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
Jaeyeon Kim1, Jaeyoon Jung2, Minjeong Jeon3, Sang Hoon Woo4, Jinjoo Lee5
1Seoul National Unversity, Seoul, Republic of Korea, 2Soongsil University, Seoul, Republic of Korea, 3MAUM AI Inc., Seongnam, Republic of Korea, 4Independent Researcher, Everywhere, 5MAUM AI Inc.,, Seongnam, Republic of Korea
Kim_SNU_t6_3 Kim_SNU_t6_4 Kim_SNU_t6_1 Kim_SNU_t6_2
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
Jaeyeon Kim1, Jaeyoon Jung2, Minjeong Jeon3, Sang Hoon Woo4, Jinjoo Lee5
1Seoul National Unversity, Seoul, Republic of Korea, 2Soongsil University, Seoul, Republic of Korea, 3MAUM AI Inc., Seongnam, Republic of Korea, 4Independent Researcher, Everywhere, 5MAUM AI Inc.,, Seongnam, Republic of Korea
Abstract
In this technical report, we describe our submission to DCASE2024 Challenge Task6 (Automated Audio Captioning) and Task8 (Language-based Audio Retrieval). We develop our approach building upon the EnCLAP audio captioning framework and optimizing it for Task 6 of the challenge. Notably, we outline the changes in the underlying components and the incorporation of the reranking process. Additionally, we submit a supplementary retriever model, a byproduct of our modified framework, to Task8. Our proposed systems achieve FENSE score of 0.542 on Task6 and mAP@10 score of 0.386 on Task8, significantly outperforming the baseline models.
System characteristics
Best submission | Kim_SNU_t6_2 |
Team rank | 2 |
Audio modelling | cnn |
Word modelling | transformer |
Data augmentation | mixup |
Ensemble number of systems | 1 |
Train datasets used | Clotho, Clotho-ChatGPT-mixup, AudioCaps, WavCaps |
Total number of parameters | 754328981 |
FENSE score | 0.5441769132406691 |
SJTU-THU Automated Audio Captioning System for DCASE 2024
Wenxi Chen1, Xiquan Li1, Ziyang Ma1, Yuzhe Liang1, Anbai Jiang2, Zhisheng Zheng1, Yanmin Qian1, Pingyi Fan2, Wei-Qiang Zhang2, Cheng Lu3, Jia Liu2, Xie Chen1
1Shanghai Jiao Tong University, Shanghai, China, 2Tsinghua University, Beijing, China, 3North China Electric Power University, Beijing, China
Chen_SJTU_t6_1 Chen_SJTU_t6_4 Chen_SJTU_t6_2 Chen_SJTU_t6_3
SJTU-THU Automated Audio Captioning System for DCASE 2024
Wenxi Chen1, Xiquan Li1, Ziyang Ma1, Yuzhe Liang1, Anbai Jiang2, Zhisheng Zheng1, Yanmin Qian1, Pingyi Fan2, Wei-Qiang Zhang2, Cheng Lu3, Jia Liu2, Xie Chen1
1Shanghai Jiao Tong University, Shanghai, China, 2Tsinghua University, Beijing, China, 3North China Electric Power University, Beijing, China
Abstract
Task 6 (Automated Audio Captioning) of the DCASE 2024 Challenge requires the automatic creation of textual descriptions for general audio signals. This technical report presents a novel model that integrates a self-supervised model with a large language model (LLM) for audio captioning. For audio feature extraction, we utilize the efficient self-supervised pre-trained model, EAT, to achieve more effective audio representation extraction. The language model component is based on Vicuna, a large language model, which we fine-tune using LoRA to fully harness its robust reasoning capabilities. During training, linear layers function as projectors to align audio and textual representations. Our model is pre-trained using the Clotho, WavCaps, AudioCaps, and MACS datasets, and fine-tuned on Clotho. For decoding, we employ a filtering strategy based on the CLAP model. By leveraging the text-audio alignment capabilities of the CLAP model, we filter out the beam search decoding results to retain only the textual description that best matches the input audio. Evaluation on the testing subset of Clotho demonstrates that our model achieves a FENSE score of 0.5431 in the single-system setting and 0.5429 in the multi-system setting, while the multi-systems outperform the single-system in other metrics. Our project code is based on the SLAM-LLM toolkit.
System characteristics
Best submission | Chen_SJTU_t6_4 |
Team rank | 3 |
Audio modelling | transformer |
Word modelling | transformer |
Data augmentation | SpecAugment, mixup |
Ensemble number of systems | 10 |
Train datasets used | Clotho, AudioCaps, MACS, WavCaps |
Total number of parameters | 6840335631 |
FENSE score | 0.5412474964331918 |
Leveraging CED Encoder and Large Language Models for Automated Audio Captioning
Jizhong Liu1, Gang Li1
1AI Lab, Xiaomi Corporation, Wuhan, China
Li_ALXC_t6_2 Li_ALXC_t6_4 Li_ALXC_t6_1 Li_ALXC_t6_3
Leveraging CED Encoder and Large Language Models for Automated Audio Captioning
Jizhong Liu1, Gang Li1
1AI Lab, Xiaomi Corporation, Wuhan, China
Abstract
This technical report presents an automated audio captioning (AAC) method participating in the DCASE 2024 Challenge Task 6. The method builds upon our previous work.Recent advancements in large language models (LLMs), coupled with improved training approaches for audio encoders, have opened up possibilities for enhancing AAC. Thus, we optimize AAC from three points: 1) a pre-trained audio encoder named consistent ensemble distillation (CED) improves the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we introduce a Llama 2 with 7B parameters as the decoder; 3) a frozen Llama 3 Instruct with 8B parameters corrects text errors caused by insufficient training data and annotation ambiguities. Both the encoder and text decoder are optimized by low-rank adaptation (LoRA). Our method obtains a 53.2 FENSE score.
System characteristics
Best submission | Li_ALXC_t6_4 |
Team rank | 4 |
Audio modelling | ced |
Word modelling | transformer |
Ensemble number of systems | 1 |
Train datasets used | Clotho |
Total number of parameters | 6850672271 |
FENSE score | 0.5327607233845204 |
AUTOMATED AUDIO CAPTIONING USING PARAMETER EFFICIENT FINE-TUNING AND MERGING OF LLMS
Kim Eungbeom1, Sim Jaeheon1, Lee Jin Woo1, Lee Kyogu1
1Seoul National University, Seoul, Korea
Kyogu_SNU_t6_2 Kyogu_SNU_t6_1
AUTOMATED AUDIO CAPTIONING USING PARAMETER EFFICIENT FINE-TUNING AND MERGING OF LLMS
Kim Eungbeom1, Sim Jaeheon1, Lee Jin Woo1, Lee Kyogu1
1Seoul National University, Seoul, Korea
Abstract
This technical report introduces an audio captioning system, which is designed to tackle the task of Automated Audio Captioning (AAC) in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 challenge. Our approach employs BEATs for robust audio representation learning and Llama 3 for high-quality text generation. To address the limitations of small datasets like Clotho, we fix the pre-trained weights of the BEATs and train a small linear model to map audio encoder dimensions to the LLM input. We further fine-tune the LLM using parameter-efficient finetuning method, LoRA, to train the model. We also explore the concatenation based LoRA merging method, achieving notable results on standard benchmarks. Experimental results show that our proposed system achieves a FENSE [1] score of 0.5180 on the evaluation dataset.
System characteristics
Best submission | Kyogu_SNU_t6_2 |
Team rank | 5 |
Audio modelling | None |
Word modelling | None |
Ensemble number of systems | 1 |
Train datasets used | Clotho, AudioCaps |
Total number of parameters | 8131137200 |
FENSE score | 0.5262071474093661 |
Semantic Enhancement Encoder for Audio Captioning and Spectrogram-based data augmentation
Qianhang Feng1, Qiuqiang Kong1
1The Chinese University of Hong Kong, New Territories, HOng Kong
Abstract
Automatic Audio Captioning (AAC) is a process that transforms audio signals into descriptive narratives. This paper introduces an innovative automated audio captioning model developed for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge Task 6A. The model architecture presented here is meticulously designed to adeptly manage the intricacies of AAC tasks. Additionally, this project introduces a novel data enhancement technique, which, with minimal model adjustments, significantly boosts performance. Exclusively trained and fine-tuned on the Clotho dataset, this project achieved a final SPIDEr-FL score of 0.3318, demonstrating its effectiveness.
System characteristics
Best submission | Kong_CUHK_t6_1 |
Team rank | 6 |
Audio modelling | cnn |
Word modelling | transformer |
Data augmentation | spec-based mixup, label smoothing |
Ensemble number of systems | 1 |
Train datasets used | Clotho, AudioCaps, WavCaps |
Total number of parameters | 146403855 |
FENSE score | 0.5254088402978455 |
CHATGPT CAPTION PARAPHRASING AND FENSE-BASED CAPTION FILTERING FOR AUTOMATED AUDIO CAPTIONING
Inhan Choi1, Hyeonuk Nam1, Deokki Min1, Seung-Deok Choi1, Yong-Hwa Park1
1Korea Advanced Institute of Science and Technology, 291, Daehak-ro, Yuseong-gu, Daejeon 34141, South Korea
Choi_KAIST_t6_1 Choi_KAIST_t6_2
CHATGPT CAPTION PARAPHRASING AND FENSE-BASED CAPTION FILTERING FOR AUTOMATED AUDIO CAPTIONING
Inhan Choi1, Hyeonuk Nam1, Deokki Min1, Seung-Deok Choi1, Yong-Hwa Park1
1Korea Advanced Institute of Science and Technology, 291, Daehak-ro, Yuseong-gu, Daejeon 34141, South Korea
Abstract
This paper presents an Automated Audio Captioning (AAC) model developed for the DCASE2024 Task 6. To address the scarcity of audio captioning datasets, we generate paraphrases of captions from the Clotho dataset as a data augmentation strategy. We utilize the ChatGPT-API to produce captions. To ensure the selection of paraphrases with high semantic relevance, we employed FENSE, the metric adopted for this AAC task. By integrating ChatGPT paraphrasing into the AAC baseline model, our submitted model achieves 0.521 FENSE score.
System characteristics
Best submission | Choi_KAIST_t6_1 |
Team rank | 7 |
Audio modelling | None |
Word modelling | transformer |
Data augmentation | mixup, label smoothing, ChatGPT paraphrasing |
Ensemble number of systems | 1 |
Train datasets used | Clotho |
Total number of parameters | 42038209 |
FENSE score | 0.5203327059152886 |
SCUT SUBMISSION FOR AUTOMATED AUDIO CAPTIONING USING GRAPH ATTENTION AND CROSS-ATTENTION MECHANISMS
Qianqian Li1
1South China University of Technology, Guangzhou, China
Li_SCUT_t6_2 Li_SCUT_t6_4 Li_SCUT_t6_1 Li_SCUT_t6_3
SCUT SUBMISSION FOR AUTOMATED AUDIO CAPTIONING USING GRAPH ATTENTION AND CROSS-ATTENTION MECHANISMS
Qianqian Li1
1South China University of Technology, Guangzhou, China
Abstract
This report presents our work for automated audio caption-ing which is the Task 6A of DCASE 2024. Our system is an encoder-decoder framework. The encoder uses a pre-trained ConvNeXt network and the decoder employs a standard Transformer structure. Among the encoders, we include a graph attention module to enhance the module's ability to extract audio features. In the decoder, in addition to the Transformer's multi-head self-attention mechanism, a cross-attention mechanism is added to improve the association between output subtitles and audio features. Finally, our system achieves FENSE score of 0.5131 which is higher than the baseline system's FENSE score of 0.5040.
System characteristics
Best submission | Li_SCUT_t6_4 |
Team rank | 8 |
Audio modelling | ConvNeXt |
Word modelling | transformer |
Data augmentation | mixup, SpecAugment |
Ensemble number of systems | 4 |
Train datasets used | Clotho |
Total number of parameters | 41303080 |
FENSE score | 0.5196854597534395 |
HYPERPARAMETER TUNING OF THE CONETTE AUDIO CAPTIONING SYSTEM
Jakob De Jesus Silva1, Justus Tobias1, Sebastian Sonderegger1
1Institute for Computational Perception, JKU Linz, Linz, Austria
Abstract
In the course of this challenge, we explored various methods to achieve a state-of-the-art audio captioning model. Initially, we worked with the baseline provided by the challenge organizers, then we also constructed several models from scratch, using diverse architectures. The best outcome we could achieve, was by tuning the hyperparameters of the baseline model CoNeTTE[1]. Our systematic approach involved finding hyperparameters that had the most effect on performance and their best combination. Although our enhanced baseline model demonstrated some performance gains, it still did not achieve a significant breakthrough over the original baseline. This is a student project in course of the lecture ”Machine-learning and Audio: A Challenge” at JKU.
System characteristics
Best submission | Silva_JKUICP_t6_2 |
Team rank | 9 |
Audio modelling | None |
Word modelling | transformer |
Data augmentation | mixup, label smoothing |
Ensemble number of systems | 1 |
Train datasets used | Clotho, Clotho |
Total number of parameters | 59486498 |
FENSE score | 0.5161157423087457 |
DCASE 2024 TASK6: AUTOMATED AUDIO CAPTIONING USING CONTRASTIVE LEARNING
Dan Epshtein1, Yuval Amsalem1, Alon Amar1
1Acoustics Research Center, Israel
Epshtein_ARC_t6_1
DCASE 2024 TASK6: AUTOMATED AUDIO CAPTIONING USING CONTRASTIVE LEARNING
Dan Epshtein1, Yuval Amsalem1, Alon Amar1
1Acoustics Research Center, Israel
Abstract
This technical report presents our proposed enhancements to improving the baseline results of the DCASE2024 challenge Task 6 on Automated Audio Captioning. We introduce an additional loss function for contrastive learning, incorporating the NTXent loss as proposed in [1][3] into the baseline platform.
System characteristics
Best submission | Epshtein_ARC_t6_1 |
Team rank | 10 |
Audio modelling | None |
Word modelling | transformer |
Data augmentation | mixup, label smoothing |
Ensemble number of systems | 1 |
Train datasets used | Clotho |
Total number of parameters | 48014000 |
FENSE score | 0.5140716527189527 |
DCASE 2024 task 6 automated audio captioning
Hyunhee Hong1, Yunjung Lee1
1Chungang University Graduate School, Seoul, Korea
Hong_CAU_t6_1
DCASE 2024 task 6 automated audio captioning
Hyunhee Hong1, Yunjung Lee1
1Chungang University Graduate School, Seoul, Korea
Abstract
This project describes an Automated Audio Captioning model for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge, Task 6. The proposed systems in this submission are based on a supervised language-audio pretraining strategy. Experiments show that our systems can achieve a SPIDEr-FL score of 29.39 on automated audio captioning.
System characteristics
Best submission | Hong_CAU_t6_1 |
Team rank | 11 |
Audio modelling | None |
Word modelling | transformer |
Data augmentation | mixup, label smoothing |
Ensemble number of systems | 1 |
Train datasets used | Clotho |
Total number of parameters | 41303080 |
FENSE score | 0.5131689575665977 |