Automated Audio Captioning


Challenge results

Task description

Automated audio captioning is the task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. Given the novelty of the task of audio captioning, current focus is on exploring and developing different methods that can provide some kind of captions for a general audio recording. To this aim, the Clotho dataset is used, which provides good quality captions, without speech transcription, named entities, and hapax legomena (i.e. words that appear once in a split).

Participants used the freely available splits of Clotho development and evaluation, as well as any external data they deemed fit. The developed systems are evaluated on their generated captions, using the evaluation split of Clotho, which does not provide the corresponding captions for the audio.

More information about Task 6: Automated Audio Captioning can be found at the task description page.

The ranking of the submitted systems is based on the achieved FENSE metri. Though, in this page is provided a more thorough presentation, grouping the metrics into those that are originated from machine translation and to those that originated from captioning.

Teams ranking

Here are listed the best systems from all teams. The ranking is based on the FENSE. For more elaborated exploration of the performance of the different systems, at the same table are listed the values achieved for all the metrics employed in the task. The values for the metrics are for the Clotho evaluation split and the Clotho testing split. The values for the Clotho testing split are provided in order to allow further comparison with systems and methods developed outside of this task, since captions for the Clotho testing split are freely available. This year, we asked participants to exclude a list of Freesound IDs to prevent a data leakage between the training and evaluation subsets. We mark "True" in the "Data leak" column for participants who have used Freesound data without taking into account the forbidden IDs.

Selected
metric
rank
Submission Information Clotho evaluation split Clotho development-testing split
Team
rank
Submission code Data leak Corresponding author Technical
Report
METEOR CIDEr-D SPICE SPIDEr SPIDEr-FL FENSE METEOR CIDEr-D SPICE SPIDEr SPIDEr-FL FENSE
1 Jung_CMU_t6_4 False Jee-weon Jung jung_cmu_t6_2024 0.172 0.344 0.140 0.242 0.241 0.554 0.174 0.327 0.136 0.230 0.230 0.542
2 Kim_SNU_t6_2 False Jaeyeon Kim kim_snu_t6_2024 0.199 0.480 0.148 0.314 0.314 0.544 0.196 0.477 0.142 0.310 0.310 0.542
3 Chen_SJTU_t6_4 True Wenxi Chen chen_sjtu_t6_2024 0.194 0.509 0.145 0.327 0.322 0.541 0.193 0.522 0.148 0.335 0.333 0.543
4 Li_ALXC_t6_4 False Gang Li li_alxc_t6_2024 0.195 0.493 0.145 0.319 0.317 0.533 0.194 0.503 0.145 0.324 0.323 0.532
5 Kyogu_SNU_t6_2 False Lee Kyogu kyogu_snu_t6_2024 0.189 0.409 0.135 0.272 0.272 0.526 0.187 0.412 0.134 0.273 0.273 0.518
6 Kong_CUHK_t6_1 True Qiuqiang Kong kong_cuhk_t6_2024 0.192 0.495 0.141 0.318 0.315 0.525 0.196 0.529 0.138 0.334 0.332 0.528
7 Choi_KAIST_t6_1 False Inhan Choi choi_kaist_t6_2024 0.187 0.465 0.135 0.300 0.299 0.520 0.189 0.464 0.134 0.299 0.299 0.521
8 Li_SCUT_t6_4 False Qianqian Li li_scut_t6_2024 0.188 0.468 0.138 0.303 0.302 0.520 0.189 0.469 0.134 0.301 0.301 0.513
9 Silva_JKUICP_t6_2 False Jakob De Jesus Silva de_jesus_silva_jkuicp_t6_2024 0.188 0.456 0.138 0.297 0.296 0.516 0.192 0.479 0.138 0.309 0.308 0.508
10 Epshtein_ARC_t6_1 False Dan Epshtein epshtein_arc_t6_2024 0.188 0.462 0.137 0.300 0.298 0.514 0.189 0.473 0.135 0.304 0.302 0.504
11 Hong_CAU_t6_1 False Hyunhee Hong hong_cau_t6_2024 0.184 0.427 0.134 0.280 0.279 0.513 0.188 0.458 0.133 0.295 0.294 0.509
12 Baseline False Étienne Labbé labbé_irit_t6_2024 0.186 0.442 0.135 0.288 0.287 0.510 0.190 0.462 0.134 0.298 0.296 0.504

Systems ranking

Here are listed all submitted systems and their ranking according to the different metrics and grouping of metrics. The first table shows all challenge metrics and all systems, and the second table shows all systems but with contrastive metrics.

Detailed information for each system is provided in the next section.

Systems ranking, challenge metrics

Selected
metric
rank
Submission Information Clotho evaluation split Clotho development testing split
Submission
rank
Submission code Data leak Technical
Report
METEOR CIDEr-D SPICE SPIDEr SPIDEr-FL FENSE METEOR CIDEr-D SPICE SPIDEr SPIDEr-FL FENSE
1 Jung_CMU_t6_4 False jung_cmu_t6_2024 0.172 0.344 0.140 0.242 0.241 0.554 0.174 0.327 0.136 0.230 0.230 0.542
2 Jung_CMU_t6_2 False jung_cmu_t6_2024 0.176 0.359 0.142 0.251 0.249 0.549 0.177 0.341 0.140 0.240 0.239 0.542
3 Jung_CMU_t6_3 False jung_cmu_t6_2024 0.172 0.345 0.141 0.243 0.239 0.547 0.174 0.333 0.132 0.232 0.232 0.544
4 Jung_CMU_t6_1 False jung_cmu_t6_2024 0.181 0.387 0.135 0.261 0.260 0.544 0.182 0.366 0.133 0.250 0.249 0.541
5 Kim_SNU_t6_2 False kim_snu_t6_2024 0.199 0.480 0.148 0.314 0.314 0.544 0.196 0.477 0.142 0.310 0.310 0.542
6 Kim_SNU_t6_4 False kim_snu_t6_2024 0.199 0.487 0.151 0.319 0.319 0.544 0.199 0.478 0.149 0.313 0.313 0.542
7 Kim_SNU_t6_3 False kim_snu_t6_2024 0.197 0.472 0.148 0.310 0.310 0.542 0.200 0.478 0.149 0.313 0.313 0.539
8 Chen_SJTU_t6_4 True chen_sjtu_t6_2024 0.194 0.509 0.145 0.327 0.322 0.541 0.193 0.522 0.148 0.335 0.333 0.543
9 Chen_SJTU_t6_3 True chen_sjtu_t6_2024 0.194 0.510 0.145 0.327 0.323 0.541 0.193 0.518 0.148 0.333 0.331 0.543
10 Chen_SJTU_t6_1 True chen_sjtu_t6_2024 0.195 0.497 0.144 0.321 0.317 0.540 0.195 0.512 0.147 0.329 0.329 0.543
11 Kim_SNU_t6_1 False kim_snu_t6_2024 0.195 0.470 0.145 0.307 0.307 0.540 0.199 0.483 0.148 0.316 0.316 0.539
12 Chen_SJTU_t6_2 True chen_sjtu_t6_2024 0.195 0.518 0.146 0.332 0.329 0.538 0.196 0.537 0.150 0.343 0.342 0.540
13 Li_ALXC_t6_4 False li_alxc_t6_2024 0.195 0.493 0.145 0.319 0.317 0.533 0.194 0.503 0.145 0.324 0.323 0.532
14 Li_ALXC_t6_3 False li_alxc_t6_2024 0.177 0.441 0.128 0.285 0.284 0.528 0.178 0.447 0.127 0.287 0.287 0.521
15 Kyogu_SNU_t6_2 False kyogu_snu_t6_2024 0.189 0.409 0.135 0.272 0.272 0.526 0.187 0.412 0.134 0.273 0.273 0.518
16 Kong_CUHK_t6_1 True kong_cuhk_t6_2024 0.192 0.495 0.141 0.318 0.315 0.525 0.196 0.529 0.138 0.334 0.332 0.528
17 Kong_CUHK_t6_2 False kong_cuhk_t6_2024 0.193 0.478 0.145 0.311 0.307 0.525 0.193 0.495 0.140 0.317 0.314 0.523
18 Choi_KAIST_t6_1 False choi_kaist_t6_2024 0.187 0.465 0.135 0.300 0.299 0.520 0.189 0.464 0.134 0.299 0.299 0.521
19 Li_ALXC_t6_1 False li_alxc_t6_2024 0.190 0.474 0.141 0.308 0.307 0.520 0.191 0.499 0.139 0.319 0.318 0.522
20 Li_SCUT_t6_4 False li_scut_t6_2024 0.188 0.468 0.138 0.303 0.302 0.520 0.189 0.469 0.134 0.301 0.301 0.513
21 Li_SCUT_t6_3 False li_scut_t6_2024 0.189 0.471 0.138 0.305 0.304 0.519 0.187 0.467 0.133 0.134 0.300 0.512
22 Choi_KAIST_t6_2 False choi_kaist_t6_2024 0.184 0.429 0.133 0.281 0.279 0.518 0.182 0.414 0.130 0.272 0.272 0.515
23 Li_ALXC_t6_2 False li_alxc_t6_2024 0.187 0.462 0.135 0.298 0.298 0.518 0.187 0.458 0.137 0.298 0.297 0.520
24 Silva_JKUICP_t6_2 False de_jesus_silva_jkuicp_t6_2024 0.188 0.456 0.138 0.297 0.296 0.516 0.192 0.479 0.138 0.309 0.308 0.508
25 Li_SCUT_t6_2 False li_scut_t6_2024 0.189 0.467 0.139 0.303 0.301 0.516 0.186 0.460 0.133 0.296 0.295 0.505
26 Silva_JKUICP_t6_1 False de_jesus_silva_jkuicp_t6_2024 0.187 0.450 0.135 0.292 0.291 0.515 0.186 0.451 0.134 0.292 0.290 0.506
27 Epshtein_ARC_t6_1 False epshtein_arc_t6_2024 0.188 0.462 0.137 0.300 0.298 0.514 0.189 0.473 0.135 0.304 0.302 0.504
28 Hong_CAU_t6_1 False hong_cau_t6_2024 0.184 0.427 0.134 0.280 0.279 0.513 0.188 0.458 0.133 0.295 0.294 0.509
29 Kyogu_SNU_t6_1 False kyogu_snu_t6_2024 0.186 0.441 0.134 0.288 0.287 0.512 0.185 0.444 0.133 0.288 0.287 0.507
30 Baseline False labbé_irit_t6_2024 0.186 0.442 0.135 0.288 0.287 0.510 0.190 0.462 0.134 0.298 0.296 0.504
31 Li_SCUT_t6_1 False li_scut_t6_2024 0.187 0.459 0.137 0.298 0.296 0.508 0.187 0.470 0.131 0.301 0.300 0.507

Systems ranking, additional metrics

Selected
metric
rank
Submission Information Clotho evaluation split
Submission
rank
Submission code Data leak Technical
Report
FENSE Sentence-BERT Fluency Error Rate Vocabulary
1 Jung_CMU_t6_4 False jung_cmu_t6_2024 0.554 0.556 0.004 915.0
2 Jung_CMU_t6_2 False jung_cmu_t6_2024 0.549 0.553 0.008 920.0
3 Jung_CMU_t6_3 False jung_cmu_t6_2024 0.547 0.554 0.012 888.0
4 Jung_CMU_t6_1 False jung_cmu_t6_2024 0.544 0.548 0.007 896.0
5 Kim_SNU_t6_2 False kim_snu_t6_2024 0.544 0.544 0.000 836.0
6 Kim_SNU_t6_4 False kim_snu_t6_2024 0.544 0.544 0.000 799.0
7 Kim_SNU_t6_3 False kim_snu_t6_2024 0.542 0.542 0.000 840.0
8 Chen_SJTU_t6_4 True chen_sjtu_t6_2024 0.541 0.546 0.009 783.0
9 Chen_SJTU_t6_3 True chen_sjtu_t6_2024 0.541 0.546 0.009 787.0
10 Chen_SJTU_t6_1 True chen_sjtu_t6_2024 0.540 0.546 0.010 835.0
11 Kim_SNU_t6_1 False kim_snu_t6_2024 0.540 0.540 0.000 832.0
12 Chen_SJTU_t6_2 True chen_sjtu_t6_2024 0.538 0.543 0.010 800.0
13 Li_ALXC_t6_4 False li_alxc_t6_2024 0.533 0.535 0.004 786.0
14 Li_ALXC_t6_3 False li_alxc_t6_2024 0.528 0.528 0.001 612.0
15 Kyogu_SNU_t6_2 False kyogu_snu_t6_2024 0.526 0.526 0.001 954.0
16 Kong_CUHK_t6_1 True kong_cuhk_t6_2024 0.525 0.529 0.006 606.0
17 Kong_CUHK_t6_2 False kong_cuhk_t6_2024 0.525 0.531 0.011 565.0
18 Choi_KAIST_t6_1 False choi_kaist_t6_2024 0.520 0.521 0.003 609.0
19 Li_ALXC_t6_1 False li_alxc_t6_2024 0.520 0.522 0.004 751.0
20 Li_SCUT_t6_4 False li_scut_t6_2024 0.520 0.521 0.002 498.0
21 Li_SCUT_t6_3 False li_scut_t6_2024 0.519 0.520 0.002 513.0
22 Choi_KAIST_t6_2 False choi_kaist_t6_2024 0.518 0.520 0.004 866.0
23 Li_ALXC_t6_2 False li_alxc_t6_2024 0.518 0.520 0.003 773.0
24 Silva_JKUICP_t6_2 False de_jesus_silva_jkuicp_t6_2024 0.516 0.517 0.001 606.0
25 Li_SCUT_t6_2 False li_scut_t6_2024 0.516 0.517 0.002 517.0
26 Silva_JKUICP_t6_1 False de_jesus_silva_jkuicp_t6_2024 0.515 0.517 0.003 610.0
27 Epshtein_ARC_t6_1 False epshtein_arc_t6_2024 0.514 0.516 0.005 563.0
28 Hong_CAU_t6_1 False hong_cau_t6_2024 0.513 0.515 0.004 604.0
29 Kyogu_SNU_t6_1 False kyogu_snu_t6_2024 0.512 0.515 0.004 822.0
30 Baseline False labbé_irit_t6_2024 0.510 0.512 0.004 532.0
31 Li_SCUT_t6_1 False li_scut_t6_2024 0.508 0.511 0.006 539.0

System characteristics

In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems and the second has a detailed presentation of each system.

Overview of characteristics

Submission
rank
Submission
code
Data leak FENSE Technical
Report
Method
scheme/architecture
Amount of
parameters
Audio modelling Word modelling Data
augmentation
1 Jung_CMU_t6_4 False 0.554 jung_cmu_t6_2024 encoder-decoder 7857055850 Conformer transformer SpecAugment, mixup
2 Jung_CMU_t6_2 False 0.549 jung_cmu_t6_2024 encoder-decoder 1571411170 Conformer transformer SpecAugment, mixup
3 Jung_CMU_t6_3 False 0.547 jung_cmu_t6_2024 encoder-decoder 5612182750 Conformer transformer SpecAugment, mixup
4 Jung_CMU_t6_1 False 0.544 jung_cmu_t6_2024 encoder-decoder 224487310 Conformer transformer SpecAugment, mixup
5 Kim_SNU_t6_2 False 0.544 kim_snu_t6_2024 encoder-decoder 754328981 cnn transformer mixup
6 Kim_SNU_t6_4 False 0.544 kim_snu_t6_2024 encoder-decoder 4575364501 cnn transformer mixup
7 Kim_SNU_t6_3 False 0.542 kim_snu_t6_2024 encoder-decoder 3620105621 cnn transformer mixup
8 Chen_SJTU_t6_4 True 0.541 chen_sjtu_t6_2024 encoder-decoder 6840335631 transformer transformer SpecAugment, mixup
9 Chen_SJTU_t6_3 True 0.541 chen_sjtu_t6_2024 encoder-decoder 6840335631 transformer transformer SpecAugment, mixup
10 Chen_SJTU_t6_1 True 0.540 chen_sjtu_t6_2024 encoder-decoder 6840335631 transformer transformer SpecAugment, mixup
11 Kim_SNU_t6_1 False 0.540 kim_snu_t6_2024 encoder-decoder 754328981 cnn transformer mixup
12 Chen_SJTU_t6_2 True 0.538 chen_sjtu_t6_2024 encoder-decoder 6840335631 transformer transformer SpecAugment, mixup
13 Li_ALXC_t6_4 False 0.533 li_alxc_t6_2024 encoder-decoder 6850672271 ced transformer
14 Li_ALXC_t6_3 False 0.528 li_alxc_t6_2024 encoder-decoder 245365903 ced transformer
15 Kyogu_SNU_t6_2 False 0.526 kyogu_snu_t6_2024 encoder-decoder 8131137200
16 Kong_CUHK_t6_1 True 0.525 kong_cuhk_t6_2024 encoder-decoder 146403855 cnn transformer spec-based mixup, label smoothing
17 Kong_CUHK_t6_2 False 0.525 kong_cuhk_t6_2024 encoder-decoder 126355215 cnn transformer spec-based mixup, label smoothing
18 Choi_KAIST_t6_1 False 0.520 choi_kaist_t6_2024 encoder-decoder 42038209 transformer mixup, label smoothing, ChatGPT paraphrasing
19 Li_ALXC_t6_1 False 0.520 li_alxc_t6_2024 encoder-decoder 6850408320 Dasheng transformer
20 Li_SCUT_t6_4 False 0.520 li_scut_t6_2024 ConvNeXt-Trans 41303080 ConvNeXt transformer mixup, SpecAugment
21 Li_SCUT_t6_3 False 0.519 li_scut_t6_2024 ConvNeXt-Trans 41303080 ConvNeXt transformer mixup, SpecAugment
22 Choi_KAIST_t6_2 False 0.518 choi_kaist_t6_2024 encoder-decoder 42038209 transformer mixup, label smoothing, ChatGPT paraphrasing
23 Li_ALXC_t6_2 False 0.518 li_alxc_t6_2024 encoder-decoder 7397882752 Dasheng transformer
24 Silva_JKUICP_t6_2 False 0.516 de_jesus_silva_jkuicp_t6_2024 encoder-decoder 59486498 transformer mixup, label smoothing
25 Li_SCUT_t6_2 False 0.516 li_scut_t6_2024 ConvNeXt-Trans 41303080 ConvNeXt transformer mixup, SpecAugment
26 Silva_JKUICP_t6_1 False 0.515 de_jesus_silva_jkuicp_t6_2024 encoder-decoder 59486498 transformer mixup, label smoothing
27 Epshtein_ARC_t6_1 False 0.514 epshtein_arc_t6_2024 encoder-decoder 48014000 transformer mixup, label smoothing
28 Hong_CAU_t6_1 False 0.513 hong_cau_t6_2024 encoder-decoder 41303080 transformer mixup, label smoothing
29 Kyogu_SNU_t6_1 False 0.512 kyogu_snu_t6_2024 encoder-decoder 8131137200
30 Baseline False 0.510 labbé_irit_t6_2024 encoder-decoder 41303080 transformer mixup, label smoothing
31 Li_SCUT_t6_1 False 0.508 li_scut_t6_2024 ConvNeXt-Trans 41303080 ConvNeXt transformer mixup, SpecAugment



Detailed characteristics

Submission
rank
Submission
code
Data leak FENSE Technical
Report
Method
scheme/architecture
Amount of
learnable
parameters
Amount of
frozen
parameters
Amount of
inference
parameters
Amount of
total
parameters
Amount of
inference
MACs
Audio
modelling
Acoustic
features
Word
modelling
Word
embeddings
Data
augmentation
Sampling
rate
Learning
set-up
Ensemble
number of
systems
Loss function Optimizer Learning rate Weight decay Gradient
clipping
Gradient norm
for clipping
Metric monitored
for training
Dataset(s) used
for training
Number of
GPUs used
for training
GPU model
used
for training
1 Jung_CMU_t6_4 False 0.554 jung_cmu_t6_2024 encoder-decoder 3653368320 4203687530 7857055850 7857055850 Conformer BEATs, ConvNeXt-Tiny transformer BART SpecAugment, mixup 32kHz, 16kHz supervised 5 cross_entropy, infonce AdamW 2e-5 0.001 0.0 validation_accuracy Clotho, AudioCaps 4 NVIDIA A5000
2 Jung_CMU_t6_2 False 0.549 jung_cmu_t6_2024 encoder-decoder 730673664 840737506 1571411170 1571411170 Conformer BEATs, ConvNeXt-Tiny transformer BART SpecAugment, mixup 32kHz, 16kHz supervised 5 cross_entropy, infonce AdamW 2e-5 0.001 0.0 validation_accuracy Clotho, AudioCaps 4 NVIDIA A5000
3 Jung_CMU_t6_3 False 0.547 jung_cmu_t6_2024 encoder-decoder 2609548800 3002633950 5612182750 5612182750 Conformer BEATs, ConvNeXt-Tiny transformer BART SpecAugment, mixup 32kHz, 16kHz supervised 5 cross_entropy, infonce AdamW 2e-5 0.001 0.0 validation_accuracy Clotho, AudioCaps 4 NVIDIA A5000
4 Jung_CMU_t6_1 False 0.544 jung_cmu_t6_2024 encoder-decoder 104381952 120105358 224487310 224487310 Conformer BEATs, ConvNeXt-Tiny transformer BART SpecAugment, mixup 32kHz, 16kHz supervised 1 cross_entropy, infonce AdamW 2e-5 0.001 0.0 validation_accuracy Clotho, AudioCaps 4 NVIDIA A5000
5 Kim_SNU_t6_2 False 0.544 kim_snu_t6_2024 encoder-decoder 477629440 276699541 754328981 754328981 cnn ConvNeXt-Tiny transformer BART-large mixup 32kHz supervised 1 cross_entropy AdamW 3e-5 0.010 1.0 L2 FENSE Clotho, Clotho-ChatGPT-mixup, AudioCaps, WavCaps 8 NVIDIA A100 80GB
6 Kim_SNU_t6_4 False 0.544 kim_snu_t6_2024 encoder-decoder 4298664960 276699541 4575364501 4575364501 cnn ConvNeXt-Tiny transformer BART-large mixup 32kHz supervised 9 cross_entropy AdamW 3e-5 0.010 1.0 L2 FENSE Clotho, Clotho-ChatGPT-mixup, AudioCaps, WavCaps 8 NVIDIA A100 80GB
7 Kim_SNU_t6_3 False 0.542 kim_snu_t6_2024 encoder-decoder 3343406080 276699541 3620105621 3620105621 cnn ConvNeXt-Tiny transformer BART-large mixup 32kHz supervised 7 cross_entropy AdamW 3e-5 0.010 1.0 L2 FENSE Clotho, Clotho-ChatGPT-mixup, AudioCaps, WavCaps 8 NVIDIA A100 80GB
8 Chen_SJTU_t6_4 True 0.541 chen_sjtu_t6_2024 encoder-decoder 20453376 6819882255 6840335631 6840335631 6990830300000 transformer EAT transformer vicuna-7b-v1.5 SpecAugment, mixup 16kHz supervised 10 cross_entropy AdamW 8e-6 0.000 0.0 validation_loss Clotho, AudioCaps, MACS, WavCaps 1 NVIDIA A800-SXM4-80GB
9 Chen_SJTU_t6_3 True 0.541 chen_sjtu_t6_2024 encoder-decoder 20453376 6819882255 6840335631 6840335631 6990830300000 transformer EAT transformer vicuna-7b-v1.5 SpecAugment, mixup 16kHz supervised 10 cross_entropy AdamW 8e-6 0.000 0.0 validation_loss Clotho, AudioCaps, MACS, WavCaps 1 NVIDIA A800-SXM4-80GB
10 Chen_SJTU_t6_1 True 0.540 chen_sjtu_t6_2024 encoder-decoder 20453376 6819882255 6840335631 6840335631 6990830300000 transformer EAT transformer vicuna-7b-v1.5 SpecAugment, mixup 16kHz supervised 1 cross_entropy AdamW 8e-6 0.000 0.0 validation_loss Clotho, AudioCaps, MACS, WavCaps 1 NVIDIA A800-SXM4-80GB
11 Kim_SNU_t6_1 False 0.540 kim_snu_t6_2024 encoder-decoder 477629440 276699541 754328981 754328981 cnn ConvNeXt-Tiny transformer BART-large mixup 32kHz supervised 1 cross_entropy AdamW 3e-5 0.010 1.0 L2 FENSE Clotho, Clotho-ChatGPT-mixup, AudioCaps, WavCaps 8 NVIDIA A100 80GB
12 Chen_SJTU_t6_2 True 0.538 chen_sjtu_t6_2024 encoder-decoder 20453376 6819882255 6840335631 6840335631 6990830300000 transformer EAT transformer vicuna-7b-v1.5 SpecAugment, mixup 16kHz supervised 5 cross_entropy AdamW 8e-6 0.000 0.0 validation_loss Clotho, AudioCaps, MACS, WavCaps 1 NVIDIA A800-SXM4-80GB
13 Li_ALXC_t6_4 False 0.533 li_alxc_t6_2024 encoder-decoder 26544896 6824127375 6850672271 6850672271 none ced CED transformer llama2_7b 16kHz supervised 1 cross_entropy AdamW 5e-5 0.0 validation_loss Clotho 2 NVIDIA A100
14 Li_ALXC_t6_3 False 0.528 li_alxc_t6_2024 encoder-decoder 20233728 225132175 245365903 245365903 none ced CED transformer bart 16kHz supervised 1 cross_entropy AdamW 5e-5 0.0 validation_loss Clotho 2 NVIDIA A100
15 Kyogu_SNU_t6_2 False 0.526 kyogu_snu_t6_2024 encoder-decoder 9965568 8121171632 8124321456 8131137200 BEATs LLaMa 16kHz supervised 1 cross_entropy AdamW 3e-4 5.0 L2 validation_loss Clotho, AudioCaps 1 NVIDIA GeForce RTX 3090
16 Kong_CUHK_t6_1 True 0.525 kong_cuhk_t6_2024 encoder-decoder 117015552 29388303 146403855 146403855 60483202884 cnn ConvNeXt-Tiny transformer learned spec-based mixup, label smoothing 32kHz supervised 1 cross_entropy AdamW 3e-5 0.0 the SPIDEr metric Clotho, AudioCaps, WavCaps 5 NVIDIA GeForce RTX 4090
17 Kong_CUHK_t6_2 False 0.525 kong_cuhk_t6_2024 encoder-decoder 96966912 29388303 126355215 126355215 53459049669 cnn ConvNeXt-Tiny transformer learned spec-based mixup, label smoothing 32kHz supervised 1 cross_entropy AdamW 3e-5 0.0 the SPIDEr metric Clotho 1 NVIDIA GeForce RTX 4090
18 Choi_KAIST_t6_1 False 0.520 choi_kaist_t6_2024 encoder-decoder 12649906 29388303 42038209 42038209 49888899616 ConvNeXt-Tiny transformer learned mixup, label smoothing, ChatGPT paraphrasing 32kHz supervised 1 cross_entropy AdamW 5e-4 2.000 1.0 L2 train_loss Clotho 3 NVIDIA GeForce RTX 2080 Ti
19 Li_ALXC_t6_1 False 0.520 li_alxc_t6_2024 encoder-decoder 26544896 6823863424 6850408320 6850408320 none Dasheng Dasheng transformer llama2_7b 16kHz supervised 1 cross_entropy AdamW 5e-5 0.0 validation_loss Clotho 2 NVIDIA A100
20 Li_SCUT_t6_4 False 0.520 li_scut_t6_2024 ConvNeXt-Trans 11914777 29388303 41303080 41303080 ConvNeXt ConvNeXt-Tiny transformer mixup, SpecAugment 32kHz supervised 4 cross_entropy AdamW 5e-4 2.000 1.0 L2 validation_loss Clotho 1 NVIDIA GeForce RTX 4090 Ti
21 Li_SCUT_t6_3 False 0.519 li_scut_t6_2024 ConvNeXt-Trans 11914777 29388303 41303080 41303080 ConvNeXt ConvNeXt-Tiny transformer mixup, SpecAugment 32kHz supervised 4 cross_entropy AdamW 5e-4 2.000 1.0 L2 validation_loss Clotho 1 NVIDIA GeForce RTX 4090 Ti
22 Choi_KAIST_t6_2 False 0.518 choi_kaist_t6_2024 encoder-decoder 12649906 29388303 42038209 42038209 50768107552 ConvNeXt-Tiny transformer learned mixup, label smoothing, ChatGPT paraphrasing 32kHz supervised 1 cross_entropy AdamW 5e-4 2.000 1.0 L2 train_loss Clotho 3 NVIDIA GeForce RTX 2080 Ti
23 Li_ALXC_t6_2 False 0.518 li_alxc_t6_2024 encoder-decoder 29133568 7368749184 7397882752 7397882752 none Dasheng Dasheng transformer llama2_7b 16kHz supervised 1 cross_entropy AdamW 5e-5 0.0 validation_loss Clotho 2 NVIDIA A100
24 Silva_JKUICP_t6_2 False 0.516 de_jesus_silva_jkuicp_t6_2024 encoder-decoder 30098195 29388303 59486498 59486498 14715294720 ConvNeXt-Tiny transformer learned mixup, label smoothing 32kHz supervised 1 cross_entropy AdamW 4e-4 2.000 1.0 L2 validation_loss Clotho, Clotho 1 NVIDIA GeForce GTX 1060 6GB
25 Li_SCUT_t6_2 False 0.516 li_scut_t6_2024 ConvNeXt-Trans 11914777 29388303 41303080 41303080 ConvNeXt ConvNeXt-Tiny transformer mixup, SpecAugment 32kHz supervised 4 cross_entropy AdamW 5e-4 2.000 1.0 L2 validation_loss Clotho 1 NVIDIA GeForce RTX 4090 Ti
26 Silva_JKUICP_t6_1 False 0.515 de_jesus_silva_jkuicp_t6_2024 encoder-decoder 30098195 29388303 59486498 59486498 15301713408 ConvNeXt-Tiny transformer learned mixup, label smoothing 32kHz supervised 1 cross_entropy AdamW 4e-4 2.000 1.0 L2 validation_loss Clotho 1 NVIDIA GeForce GTX 1060 6GB
27 Epshtein_ARC_t6_1 False 0.514 epshtein_arc_t6_2024 encoder-decoder 12003511 36010489 48014000 48014000 4821624576 ConvNeXt-Tiny transformer learned mixup, label smoothing 32kHz supervised 1 cross_entropy, NTXent AdamW 5e-4 2.000 1.0 L2 validation_loss Clotho 1 NVIDIA T1200 Laptop GPU
28 Hong_CAU_t6_1 False 0.513 hong_cau_t6_2024 encoder-decoder 11914777 29388303 41303080 41303080 ConvNeXt-Tiny transformer learned mixup, label smoothing 32kHz supervised 1 cross_entropy AdamW 5e-4 2.000 1.0 L2 validation_loss Clotho 1 NVIDIA 20TF-V100
29 Kyogu_SNU_t6_1 False 0.512 kyogu_snu_t6_2024 encoder-decoder 9965568 8121171632 8124321456 8131137200 BEATs LLaMa 16kHz supervised 1 cross_entropy AdamW 3e-4 5.0 L2 validation_loss Clotho, AudioCaps 1 NVIDIA GeForce RTX 3090
30 Baseline False 0.510 labbé_irit_t6_2024 encoder-decoder 11914777 29388303 41303080 41303080 48762319200 ConvNeXt-Tiny transformer learned mixup, label smoothing 32kHz supervised 1 cross_entropy AdamW 5e-4 2.000 1.0 L2 validation_loss Clotho 1 NVIDIA GeForce RTX 2080 Ti
31 Li_SCUT_t6_1 False 0.508 li_scut_t6_2024 ConvNeXt-Trans 11914777 29388303 41303080 41303080 ConvNeXt ConvNeXt-Tiny transformer mixup, SpecAugment 32kHz supervised 4 cross_entropy AdamW 5e-4 2.000 1.0 L2 validation_loss Clotho 1 NVIDIA GeForce RTX 4090 Ti



Technical reports

AUTOMATIC AUDIO CAPTIONING WITH ENCODER FUSION, MULTI-LAYER AGGREGATION, AND LARGE LANGUAGE MODEL ENRICHED SUMMARIZATION

Jee-weon Jung1, Dong Zhang2, Huck C.-H. Yang3, Shih-Lun Wu1, David M. Chan4, Zhifeng Kong5, Deng Ruifan2, Zhou Yaqian2, Valle Rafael5, Shinji Watanabe1
1Carnegie Mellon University, USA, 2Fudan University, China, 3NVIDIA Research, USA, 4University of California, Berkeley, USA, 5NVIDIA Applied Deep Learning Research, USA

Abstract

In this report, we describe our submission to Track 6 of the DCASE 2024 challenge for the task of Automated Audio Captioning (AAC). The submitted models utilize an encoder-decoder architecture using pre-trained and frozen audio encoders, a Conformer post-encoder, and a BART decoder. We introduce five different architectures, employing diverse fusion strategies to leverage multiple audio encoders and a multi-layer aggregation technique, thus exploiting the complementary information from various representations. For inference, we propose a novel scheme incorporating nucleus sampling, CLAP-based filtering, hybrid re-ranking, and large language model summarization. Combining these approaches, our top-performing single and ensemble systems achieve Fluency Enhanced Sentence-BERT Evaluation (FENSE) scores of 0.5410 and 0.5442, respectively, on the Clotho (V2) evaluation partition.

System characteristics
Best submission Jung_CMU_t6_4
Team rank 1
Audio modelling Conformer
Word modelling transformer
Data augmentation SpecAugment, mixup
Ensemble number of systems 5
Train datasets used Clotho, AudioCaps
Total number of parameters 7857055850
FENSE score 0.5536877719555068
PDF

Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning

Jaeyeon Kim1, Jaeyoon Jung2, Minjeong Jeon3, Sang Hoon Woo4, Jinjoo Lee5
1Seoul National Unversity, Seoul, Republic of Korea, 2Soongsil University, Seoul, Republic of Korea, 3MAUM AI Inc., Seongnam, Republic of Korea, 4Independent Researcher, Everywhere, 5MAUM AI Inc.,, Seongnam, Republic of Korea

Abstract

In this technical report, we describe our submission to DCASE2024 Challenge Task6 (Automated Audio Captioning) and Task8 (Language-based Audio Retrieval). We develop our approach building upon the EnCLAP audio captioning framework and optimizing it for Task 6 of the challenge. Notably, we outline the changes in the underlying components and the incorporation of the reranking process. Additionally, we submit a supplementary retriever model, a byproduct of our modified framework, to Task8. Our proposed systems achieve FENSE score of 0.542 on Task6 and mAP@10 score of 0.386 on Task8, significantly outperforming the baseline models.

System characteristics
Best submission Kim_SNU_t6_2
Team rank 2
Audio modelling cnn
Word modelling transformer
Data augmentation mixup
Ensemble number of systems 1
Train datasets used Clotho, Clotho-ChatGPT-mixup, AudioCaps, WavCaps
Total number of parameters 754328981
FENSE score 0.5441769132406691
PDF

SJTU-THU Automated Audio Captioning System for DCASE 2024

Wenxi Chen1, Xiquan Li1, Ziyang Ma1, Yuzhe Liang1, Anbai Jiang2, Zhisheng Zheng1, Yanmin Qian1, Pingyi Fan2, Wei-Qiang Zhang2, Cheng Lu3, Jia Liu2, Xie Chen1
1Shanghai Jiao Tong University, Shanghai, China, 2Tsinghua University, Beijing, China, 3North China Electric Power University, Beijing, China

Abstract

Task 6 (Automated Audio Captioning) of the DCASE 2024 Challenge requires the automatic creation of textual descriptions for general audio signals. This technical report presents a novel model that integrates a self-supervised model with a large language model (LLM) for audio captioning. For audio feature extraction, we utilize the efficient self-supervised pre-trained model, EAT, to achieve more effective audio representation extraction. The language model component is based on Vicuna, a large language model, which we fine-tune using LoRA to fully harness its robust reasoning capabilities. During training, linear layers function as projectors to align audio and textual representations. Our model is pre-trained using the Clotho, WavCaps, AudioCaps, and MACS datasets, and fine-tuned on Clotho. For decoding, we employ a filtering strategy based on the CLAP model. By leveraging the text-audio alignment capabilities of the CLAP model, we filter out the beam search decoding results to retain only the textual description that best matches the input audio. Evaluation on the testing subset of Clotho demonstrates that our model achieves a FENSE score of 0.5431 in the single-system setting and 0.5429 in the multi-system setting, while the multi-systems outperform the single-system in other metrics. Our project code is based on the SLAM-LLM toolkit.

System characteristics
Best submission Chen_SJTU_t6_4
Team rank 3
Audio modelling transformer
Word modelling transformer
Data augmentation SpecAugment, mixup
Ensemble number of systems 10
Train datasets used Clotho, AudioCaps, MACS, WavCaps
Total number of parameters 6840335631
FENSE score 0.5412474964331918
PDF

Leveraging CED Encoder and Large Language Models for Automated Audio Captioning

Jizhong Liu1, Gang Li1
1AI Lab, Xiaomi Corporation, Wuhan, China

Abstract

This technical report presents an automated audio captioning (AAC) method participating in the DCASE 2024 Challenge Task 6. The method builds upon our previous work.Recent advancements in large language models (LLMs), coupled with improved training approaches for audio encoders, have opened up possibilities for enhancing AAC. Thus, we optimize AAC from three points: 1) a pre-trained audio encoder named consistent ensemble distillation (CED) improves the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we introduce a Llama 2 with 7B parameters as the decoder; 3) a frozen Llama 3 Instruct with 8B parameters corrects text errors caused by insufficient training data and annotation ambiguities. Both the encoder and text decoder are optimized by low-rank adaptation (LoRA). Our method obtains a 53.2 FENSE score.

System characteristics
Best submission Li_ALXC_t6_4
Team rank 4
Audio modelling ced
Word modelling transformer
Ensemble number of systems 1
Train datasets used Clotho
Total number of parameters 6850672271
FENSE score 0.5327607233845204
PDF

Retrieval-Augmented Audio Captioning with LLM fine-tuning

Kim Eungbeom1, Sim Jaeheon1, Lee Jin Woo1, Lee Kyogu1
1Seoul National University, Seoul, Korea

Abstract

This technical report introduces an audio captioning system, which is designed to tackle the task of Automated Audio Captioning(AAC) in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 challenge. Our approach employs BEATs for robust audio representation learning and Llama 3 for high-quality text generation. To address the limitations of small datasets like Clotho, we fix the pre-trained weights of the BEATs and train a small linear model to map audio encoder dimensions to the LLM input. We further fine-tune the LLM using parameter-efficient fine-tuning method, LoRA, to train the model. We also explore the con-catenation based LoRA merging method, achieving notable results on standard benchmarks. Experimental results show that our proposed system achieves a FENSE [1] score of 0.5180 on the evaluation dataset.

System characteristics
Best submission Kyogu_SNU_t6_2
Team rank 5
Audio modelling None
Word modelling None
Ensemble number of systems 1
Train datasets used Clotho, AudioCaps
Total number of parameters 8131137200
FENSE score 0.5262071474093661
PDF

Semantic Enhancement Encoder for Audio Captioning and Spectrogram-based data augmentation

Qianhang Feng1, Qiuqiang Kong1
1The Chinese University of Hong Kong, New Territories, HOng Kong

Abstract

Automatic Audio Captioning (AAC) is a process that transforms audio signals into descriptive narratives. This paper introduces an innovative automated audio captioning model developed for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge Task 6A. The model architecture presented here is meticulously designed to adeptly manage the intricacies of AAC tasks. Additionally, this project introduces a novel data enhancement technique, which, with minimal model adjustments, significantly boosts performance. Exclusively trained and fine-tuned on the Clotho dataset, this project achieved a final SPIDEr-FL score of 0.3318, demonstrating its effectiveness.

System characteristics
Best submission Kong_CUHK_t6_1
Team rank 6
Audio modelling cnn
Word modelling transformer
Data augmentation spec-based mixup, label smoothing
Ensemble number of systems 1
Train datasets used Clotho, AudioCaps, WavCaps
Total number of parameters 146403855
FENSE score 0.5254088402978455
PDF

Self Training and Ensembling Frequency Dependent Networks with Coarse Prediction Pooling and Sound Event Bounding Boxes

Inhan Choi1, Hyeonuk Nam1, Deokki Min1, Seung-Deok Choi1, Yong-Hwa Park1
1Korea Advanced Institute of Science and Technology, 291, Daehak-ro, Yuseong-gu, Daejeon 34141, South Korea

Abstract

To tackle sound event detection (SED) task, we propose frequency dependent networks (FreDNets), which heavily leverage frequency-dependent methods. We apply frequency warping and FilterAugment, which are frequency-dependent data augmentation methods. The model architecture consists of 3 branches: audio teacher-student transformer (ATST) branch, BEATs branch and CNN branch including either partial dilated frequency dynamic convolution (PDFD) or squeeze-and-Excitation (SE) with time-frame frequency-wise SE (tfwSE). To train MAESTRO labels with coarse temporal resolution, we apply max pooling on prediction for the MAESTRO dataset. Using best ensemble model, we apply self training to obtain pseudo label from DESED weak set, DESED unlabeled set and AudioSet. AudioSet labels are filtered to focus on high-confidence pseudo labels and AudioSet pseudo labels are used to train on DESED labels only. We used change-detection-based sound event bounding boxes (cSEBBs) as post processing for ensemble models on self training and submission models.

System characteristics
Best submission Choi_KAIST_t6_1
Team rank 7
Audio modelling None
Word modelling transformer
Data augmentation mixup, label smoothing, ChatGPT paraphrasing
Ensemble number of systems 1
Train datasets used Clotho
Total number of parameters 42038209
FENSE score 0.5203327059152886
PDF

SCUT SUBMISSION FOR AUTOMATED AUDIO CAPTIONING USING GRAPH ATTENTION AND CROSS-ATTENTION MECHANISMS

Qianqian Li1
1South China University of Technology, Guangzhou, China

Abstract

This report presents our work for automated audio caption-ing which is the Task 6A of DCASE 2024. Our system is an encoder-decoder framework. The encoder uses a pre-trained ConvNeXt network and the decoder employs a standard Transformer structure. Among the encoders, we include a graph attention module to enhance the module's ability to extract audio features. In the decoder, in addition to the Transformer's multi-head self-attention mechanism, a cross-attention mechanism is added to improve the association between output subtitles and audio features. Finally, our system achieves FENSE score of 0.5131 which is higher than the baseline system's FENSE score of 0.5040.

System characteristics
Best submission Li_SCUT_t6_4
Team rank 8
Audio modelling ConvNeXt
Word modelling transformer
Data augmentation mixup, SpecAugment
Ensemble number of systems 4
Train datasets used Clotho
Total number of parameters 41303080
FENSE score 0.5196854597534395
PDF

HYPERPARAMETER TUNING OF THE CONETTE AUDIO CAPTIONING SYSTEM

Jakob De Jesus Silva1, Justus Tobias1, Sebastian Sonderegger1
1Institute for Computational Perception, JKU Linz, Linz, Austria

Abstract

In the course of this challenge, we explored various methods to achieve a state-of-the-art audio captioning model. Initially, we worked with the baseline provided by the challenge organizers, then we also constructed several models from scratch, using diverse ar- chitectures. The best outcome we could achieve, was by tuning the hyperparameters of the baseline model CoNeTTE[1]. Our system- atic approach involved finding hyperparameters that had the most effect on performance and their best combination. Although our enhanced baseline model demonstrated some performance gains, it still did not achieve a significant breakthrough over the original baseline. This is a student project in course of the lecture ”Machine- learning and Audio: A Challenge” at JKU.

System characteristics
Best submission Silva_JKUICP_t6_2
Team rank 9
Audio modelling None
Word modelling transformer
Data augmentation mixup, label smoothing
Ensemble number of systems 1
Train datasets used Clotho, Clotho
Total number of parameters 59486498
FENSE score 0.5161157423087457
PDF

DCASE 2024 TASK6: AUTOMATED AUDIO CAPTIONING USING CONTRASTIVE LEARNING

Dan Epshtein1, Yuval Amsalem1, Alon Amar1
1Acoustics Research Center, Israel

Abstract

This technical report presents our proposed enhancements to improving the baseline results of the DCASE2024 challenge Task 6 on Automated Audio Captioning. We introduce an additional loss function for contrastive learning, incorporating the NTXent loss as proposed in [1][3] into the baseline platform.

System characteristics
Best submission Epshtein_ARC_t6_1
Team rank 10
Audio modelling None
Word modelling transformer
Data augmentation mixup, label smoothing
Ensemble number of systems 1
Train datasets used Clotho
Total number of parameters 48014000
FENSE score 0.5140716527189527
PDF

DCASE 2024 task 6 automated audio captioning

Hyunhee Hong1, Yunjung Lee1
1Chungang University Graduate School, Seoul, Korea

Abstract

This project describes an Automated Audio Captioning model for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge, Task 6. The proposed systems in this submission are based on a supervised language-audio pretraining strategy. Experiments show that our systems can achieve a SPIDEr-FL score of 29.39 on automated audio captioning.

System characteristics
Best submission Hong_CAU_t6_1
Team rank 11
Audio modelling None
Word modelling transformer
Data augmentation mixup, label smoothing
Ensemble number of systems 1
Train datasets used Clotho
Total number of parameters 41303080
FENSE score 0.5131689575665977
PDF