Task description

Automated audio captioning is the task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. Given the novelty of the task of audio captioning, current focus is on exploring and developing different methods that can provide some kind of captions for a general audio recording. To this aim, the Clotho dataset is used, which provides good quality captions, without speech transcription, named entities, and hapax legomena (i.e. words that appear once in a split).

Participants used the freely available splits of Clotho development and evaluation, as well as any external data they deemed fit. The developed systems are evaluated on their generated captions, using the testing split of Clotho, which does not provide the corresponding captions for the audio. More information about Task 6a: Automated Audio Captioning can be found at the task description page.

The ranking of the submitted systems is based on the achieved SPIDEr metric. Though, in this page is provided a more thorough presentation, grouping the metrics into those that are originated from machine translation and to those that originated from captioning.

This year, we also introduce contrastive metrics as well as an analysis subset of Clotho. The corresponding results will be published in the coming days.

Teams ranking

Here are listed the best systems all from all teams. The ranking is based on the SPIDEr metric. For more elaborated exploration of the performance of the different systems, at the same table are listed the values achieved for all the metrics employed in the task. The values for the metrics are for the Clotho testing split and the Clotho evaluation split. The values for the Clotho evaluation split are provided in order to allow further comparison with systems and methods developed outside of this task, since captions for the Clotho evaluation split are freely available.

Selected metric rank	Submission Information				Clotho testing split									Clotho evaluation split
Selected metric rank	Submission code	Best official system rank	Corresponding author	Technical Report	BLEU₁	BLEU₂	BLEU₃	BLEU₄	METEOR	ROUGE_L	CIDEr	SPICE	SPIDEr	BLEU₁	BLEU₂	BLEU₃	BLEU₄	METEOR	ROUGE_L	CIDEr	SPICE	SPIDEr
	Xu_t6a_4	1	Xuenan Xu	xu2022_t6a	0.666	0.433	0.282	0.178	0.187	0.412	0.508	0.130	0.319	0.667	0.435	0.285	0.183	0.186	0.415	0.513	0.126	0.320
	Zou_t6a_3	2	Yuexian Zou	zou2022_t6a	0.670	0.437	0.289	0.183	0.185	0.415	0.502	0.133	0.318	0.646	0.430	0.289	0.186	0.186	0.409	0.497	0.119	0.308
	Mei_t6a_3	3	Xinhao Mei	mei2022_t6a	0.661	0.433	0.287	0.179	0.188	0.415	0.482	0.135	0.309	0.646	0.414	0.270	0.169	0.183	0.407	0.453	0.128	0.291
	Primus_t6a_4	4	Paul Primus	primus2022_t6a	0.641	0.421	0.276	0.168	0.184	0.402	0.458	0.134	0.296	0.636	0.417	0.275	0.168	0.183	0.401	0.461	0.130	0.295
	Kouzelis_t6a_4	5	Thodoris Kouzelis	kouzelis2022_t6a	0.581	0.387	0.262	0.170	0.180	0.388	0.453	0.134	0.293	0.579	0.386	0.262	0.173	0.178	0.387	0.457	0.134	0.296
	Guan_t6a_4	6	Jian Guan	guan2022_t6a	0.623	0.417	0.284	0.180	0.177	0.405	0.451	0.130	0.291	0.649	0.439	0.303	0.199	0.181	0.415	0.471	0.133	0.302
	Kiciński_t6a_1	7	Dawid Kiciński	kiciński2022_t6a	0.567	0.368	0.244	0.155	0.175	0.378	0.414	0.126	0.270	0.583	0.382	0.255	0.164	0.179	0.387	0.433	0.125	0.279
	Pan_t6a_4	8	Chaofan Pan	pan2022_t6a	0.555	0.361	0.240	0.155	0.173	0.374	0.387	0.123	0.255	0.568	0.367	0.245	0.161	0.175	0.383	0.395	0.119	0.257
	Labbe_t6a_1	9	Etienne Labbe	labbe2022_t6a	0.548	0.351	0.233	0.149	0.170	0.370	0.359	0.123	0.241	0.555	0.357	0.240	0.157	0.170	0.374	0.367	0.118	0.242
	Baseline	10	Felix Gontier	gontier2022_t6a	0.549	0.353	0.234	0.147	0.164	0.361	0.338	0.110	0.224	0.555	0.358	0.239	0.156	0.164	0.364	0.358	0.109	0.233

Systems ranking

Here are listed all submitted systems and their ranking according to the different metrics and grouping of metrics. The first table shows all metrics and all systems, the second table shows all systems but with only machine translation metrics, and the third table shows all systems but with only captioning metrics.

Detailed information for each system is provided in the next section.

Systems ranking, all metrics

Selected metric rank	Submission Information			Clotho testing split									Clotho evaluation split
Selected metric rank	Submission code	Best official system rank	Technical Report	BLEU₁	BLEU₂	BLEU₃	BLEU₄	METEOR	ROUGE_L	CIDEr	SPICE	SPIDEr	BLEU₁	BLEU₂	BLEU₃	BLEU₄	METEOR	ROUGE_L	CIDEr	SPICE	SPIDEr
	Xu_t6a_4	1	xu2022_t6a	0.666	0.433	0.282	0.178	0.187	0.412	0.508	0.130	0.319	0.667	0.435	0.285	0.183	0.186	0.415	0.513	0.126	0.320
	Zou_t6a_3	2	zou2022_t6a	0.670	0.437	0.289	0.183	0.185	0.415	0.502	0.133	0.318	0.646	0.430	0.289	0.186	0.186	0.409	0.497	0.119	0.308
	Xu_t6a_3	3	xu2022_t6a	0.658	0.430	0.281	0.178	0.186	0.410	0.501	0.131	0.316	0.663	0.433	0.285	0.185	0.185	0.413	0.517	0.127	0.322
	Xu_t6a_1	4	xu2022_t6a	0.645	0.421	0.276	0.173	0.186	0.402	0.498	0.130	0.314	0.647	0.424	0.280	0.180	0.186	0.409	0.507	0.130	0.318
	Zou_t6a_4	5	zou2022_t6a	0.652	0.433	0.287	0.182	0.185	0.408	0.497	0.130	0.314	0.663	0.443	0.299	0.195	0.189	0.416	0.520	0.126	0.323
	Xu_t6a_2	6	xu2022_t6a	0.650	0.425	0.278	0.176	0.187	0.407	0.495	0.127	0.311	0.654	0.431	0.286	0.187	0.188	0.413	0.524	0.126	0.325
	Zou_t6a_1	7	zou2022_t6a	0.648	0.424	0.279	0.176	0.185	0.410	0.489	0.133	0.311	0.647	0.438	0.296	0.194	0.185	0.414	0.503	0.132	0.317
	Zou_t6a_2	8	zou2022_t6a	0.655	0.429	0.283	0.178	0.184	0.405	0.491	0.128	0.309	0.645	0.422	0.281	0.183	0.186	0.408	0.495	0.131	0.313
	Mei_t6a_3	9	mei2022_t6a	0.661	0.433	0.287	0.179	0.188	0.415	0.482	0.135	0.309	0.646	0.414	0.270	0.169	0.183	0.407	0.453	0.128	0.291
	Mei_t6a_1	10	mei2022_t6a	0.647	0.423	0.278	0.174	0.186	0.407	0.476	0.134	0.305	0.646	0.414	0.270	0.169	0.183	0.407	0.453	0.128	0.291
	Mei_t6a_4	11	mei2022_t6a	0.669	0.428	0.281	0.173	0.184	0.410	0.468	0.138	0.303	0.646	0.414	0.270	0.169	0.183	0.407	0.453	0.128	0.291
	Mei_t6a_2	12	mei2022_t6a	0.672	0.427	0.276	0.167	0.182	0.410	0.456	0.138	0.297	0.646	0.414	0.270	0.169	0.183	0.407	0.453	0.128	0.291
	Primus_t6a_4	13	primus2022_t6a	0.641	0.421	0.276	0.168	0.184	0.402	0.458	0.134	0.296	0.636	0.417	0.275	0.168	0.183	0.401	0.461	0.130	0.295
	Kouzelis_t6a_4	14	kouzelis2022_t6a	0.581	0.387	0.262	0.170	0.180	0.388	0.453	0.134	0.293	0.579	0.386	0.262	0.173	0.178	0.387	0.457	0.134	0.296
	Guan_t6a_4	15	guan2022_t6a	0.623	0.417	0.284	0.180	0.177	0.405	0.451	0.130	0.291	0.649	0.439	0.303	0.199	0.181	0.415	0.471	0.133	0.302
	Guan_t6a_2	16	guan2022_t6a	0.575	0.388	0.268	0.178	0.178	0.387	0.451	0.129	0.290	0.595	0.402	0.277	0.189	0.179	0.395	0.465	0.127	0.296
	Guan_t6a_3	17	guan2022_t6a	0.657	0.425	0.279	0.169	0.179	0.405	0.447	0.132	0.290	0.660	0.424	0.279	0.170	0.178	0.410	0.442	0.129	0.285
	Kouzelis_t6a_3	18	kouzelis2022_t6a	0.567	0.378	0.257	0.169	0.176	0.385	0.447	0.131	0.289	0.575	0.384	0.262	0.174	0.178	0.386	0.557	0.133	0.295
	Kouzelis_t6a_1	19	kouzelis2022_t6a	0.570	0.382	0.259	0.170	0.177	0.384	0.439	0.132	0.286	0.576	0.384	0.261	0.176	0.166	0.385	0.453	0.130	0.292
	Kouzelis_t6a_2	20	kouzelis2022_t6a	0.569	0.378	0.256	0.168	0.177	0.386	0.441	0.130	0.285	0.578	0.384	0.262	0.176	0.177	0.387	0.454	0.133	0.293
	Primus_t6a_3	21	primus2022_t6a	0.654	0.420	0.271	0.163	0.177	0.395	0.434	0.127	0.280	0.653	0.424	0.278	0.169	0.181	0.404	0.455	0.125	0.290
	Primus_t6a_2	22	primus2022_t6a	0.562	0.364	0.243	0.153	0.181	0.374	0.418	0.132	0.275	0.573	0.370	0.244	0.158	0.181	0.376	0.440	0.128	0.284
	Guan_t6a_1	23	guan2022_t6a	0.556	0.367	0.249	0.165	0.173	0.375	0.417	0.124	0.270	0.581	0.386	0.265	0.181	0.175	0.385	0.437	0.126	0.281
	Kiciński_t6a_1	24	kiciński2022_t6a	0.567	0.368	0.244	0.155	0.175	0.378	0.414	0.126	0.270	0.583	0.382	0.255	0.164	0.179	0.387	0.433	0.125	0.279
	Primus_t6a_1	25	primus2022_t6a	0.556	0.364	0.241	0.150	0.176	0.367	0.400	0.127	0.264	0.566	0.373	0.252	0.164	0.178	0.376	0.408	0.120	0.264
	Pan_t6a_4	26	pan2022_t6a	0.555	0.361	0.240	0.155	0.173	0.374	0.387	0.123	0.255	0.568	0.367	0.245	0.161	0.175	0.383	0.395	0.119	0.257
	Kiciński_t6a_3	27	kiciński2022_t6a	0.546	0.346	0.224	0.138	0.170	0.364	0.379	0.123	0.251	0.566	0.363	0.236	0.147	0.173	0.372	0.400	0.121	0.260
	Kiciński_t6a_4	28	kiciński2022_t6a	0.556	0.360	0.238	0.152	0.171	0.367	0.380	0.121	0.250	0.562	0.364	0.242	0.157	0.172	0.377	0.378	0.119	0.249
	Pan_t6a_3	29	pan2022_t6a	0.559	0.360	0.237	0.152	0.173	0.376	0.376	0.123	0.250	0.567	0.363	0.240	0.154	0.174	0.380	0.386	0.121	0.253
	Pan_t6a_1	30	pan2022_t6a	0.556	0.365	0.244	0.154	0.170	0.375	0.377	0.121	0.249	0.560	0.362	0.240	0.155	0.169	0.375	0.381	0.116	0.248
	Kiciński_t6a_2	31	kiciński2022_t6a	0.556	0.358	0.235	0.148	0.171	0.369	0.378	0.119	0.249	0.567	0.365	0.242	0.155	0.172	0.378	0.393	0.117	0.255
	Pan_t6a_2	32	pan2022_t6a	0.556	0.358	0.236	0.146	0.169	0.371	0.363	0.120	0.241	0.562	0.361	0.240	0.156	0.172	0.377	0.384	0.118	0.251
	Labbe_t6a_1	33	labbe2022_t6a	0.548	0.351	0.233	0.149	0.170	0.370	0.359	0.123	0.241	0.555	0.357	0.240	0.157	0.170	0.374	0.367	0.118	0.242
	Baseline	34	gontier2022_t6a	0.549	0.353	0.234	0.147	0.164	0.361	0.338	0.110	0.224	0.555	0.358	0.239	0.156	0.164	0.364	0.358	0.109	0.233
	Labbe_t6a_3	35	labbe2022_t6a	0.535	0.326	0.203	0.121	0.164	0.356	0.307	0.117	0.212	0.532	0.322	0.200	0.121	0.161	0.354	0.303	0.111	0.207
	Labbe_t6a_2	36	labbe2022_t6a	0.490	0.282	0.163	0.090	0.156	0.328	0.247	0.113	0.180	0.488	0.279	0.160	0.085	0.154	0.329	0.241	0.106	0.174
	Labbe_t6a_4	37	labbe2022_t6a	0.460	0.251	0.139	0.072	0.142	0.311	0.203	0.099	0.151	0.457	0.245	0.135	0.067	0.140	0.310	0.198	0.090	0.144

Systems ranking, machine translation metrics

Selected metric rank	Submission Information			Clotho testing split						Clotho evaluation split
Selected metric rank	Submission code	Best official system rank	Technical Report	BLEU₁	BLEU₂	BLEU₃	BLEU₄	METEOR	ROUGE_L	BLEU₁	BLEU₂	BLEU₃	BLEU₄	METEOR	ROUGE_L
	Xu_t6a_4	1	xu2022_t6a	0.666	0.433	0.282	0.178	0.187	0.412	0.667	0.435	0.285	0.183	0.186	0.415
	Zou_t6a_3	2	zou2022_t6a	0.670	0.437	0.289	0.183	0.185	0.415	0.646	0.430	0.289	0.186	0.186	0.409
	Xu_t6a_3	3	xu2022_t6a	0.658	0.430	0.281	0.178	0.186	0.410	0.663	0.433	0.285	0.185	0.185	0.413
	Xu_t6a_1	4	xu2022_t6a	0.645	0.421	0.276	0.173	0.186	0.402	0.647	0.424	0.280	0.180	0.186	0.409
	Zou_t6a_4	5	zou2022_t6a	0.652	0.433	0.287	0.182	0.185	0.408	0.663	0.443	0.299	0.195	0.189	0.416
	Xu_t6a_2	6	xu2022_t6a	0.650	0.425	0.278	0.176	0.187	0.407	0.654	0.431	0.286	0.187	0.188	0.413
	Zou_t6a_1	7	zou2022_t6a	0.648	0.424	0.279	0.176	0.185	0.410	0.647	0.438	0.296	0.194	0.185	0.414
	Zou_t6a_2	8	zou2022_t6a	0.655	0.429	0.283	0.178	0.184	0.405	0.645	0.422	0.281	0.183	0.186	0.408
	Mei_t6a_3	9	mei2022_t6a	0.661	0.433	0.287	0.179	0.188	0.415	0.646	0.414	0.270	0.169	0.183	0.407
	Mei_t6a_1	10	mei2022_t6a	0.647	0.423	0.278	0.174	0.186	0.407	0.646	0.414	0.270	0.169	0.183	0.407
	Mei_t6a_4	11	mei2022_t6a	0.669	0.428	0.281	0.173	0.184	0.410	0.646	0.414	0.270	0.169	0.183	0.407
	Mei_t6a_2	12	mei2022_t6a	0.672	0.427	0.276	0.167	0.182	0.410	0.646	0.414	0.270	0.169	0.183	0.407
	Primus_t6a_4	13	primus2022_t6a	0.641	0.421	0.276	0.168	0.184	0.402	0.636	0.417	0.275	0.168	0.183	0.401
	Kouzelis_t6a_4	14	kouzelis2022_t6a	0.581	0.387	0.262	0.170	0.180	0.388	0.579	0.386	0.262	0.173	0.178	0.387
	Guan_t6a_4	15	guan2022_t6a	0.623	0.417	0.284	0.180	0.177	0.405	0.649	0.439	0.303	0.199	0.181	0.415
	Guan_t6a_2	16	guan2022_t6a	0.575	0.388	0.268	0.178	0.178	0.387	0.595	0.402	0.277	0.189	0.179	0.395
	Guan_t6a_3	17	guan2022_t6a	0.657	0.425	0.279	0.169	0.179	0.405	0.660	0.424	0.279	0.170	0.178	0.410
	Kouzelis_t6a_3	18	kouzelis2022_t6a	0.567	0.378	0.257	0.169	0.176	0.385	0.575	0.384	0.262	0.174	0.178	0.386
	Kouzelis_t6a_1	19	kouzelis2022_t6a	0.570	0.382	0.259	0.170	0.177	0.384	0.576	0.384	0.261	0.176	0.166	0.385
	Kouzelis_t6a_2	20	kouzelis2022_t6a	0.569	0.378	0.256	0.168	0.177	0.386	0.578	0.384	0.262	0.176	0.177	0.387
	Primus_t6a_3	21	primus2022_t6a	0.654	0.420	0.271	0.163	0.177	0.395	0.653	0.424	0.278	0.169	0.181	0.404
	Primus_t6a_2	22	primus2022_t6a	0.562	0.364	0.243	0.153	0.181	0.374	0.573	0.370	0.244	0.158	0.181	0.376
	Guan_t6a_1	23	guan2022_t6a	0.556	0.367	0.249	0.165	0.173	0.375	0.581	0.386	0.265	0.181	0.175	0.385
	Kiciński_t6a_1	24	kiciński2022_t6a	0.567	0.368	0.244	0.155	0.175	0.378	0.583	0.382	0.255	0.164	0.179	0.387
	Primus_t6a_1	25	primus2022_t6a	0.556	0.364	0.241	0.150	0.176	0.367	0.566	0.373	0.252	0.164	0.178	0.376
	Pan_t6a_4	26	pan2022_t6a	0.555	0.361	0.240	0.155	0.173	0.374	0.568	0.367	0.245	0.161	0.175	0.383
	Kiciński_t6a_3	27	kiciński2022_t6a	0.546	0.346	0.224	0.138	0.170	0.364	0.566	0.363	0.236	0.147	0.173	0.372
	Kiciński_t6a_4	28	kiciński2022_t6a	0.556	0.360	0.238	0.152	0.171	0.367	0.562	0.364	0.242	0.157	0.172	0.377
	Pan_t6a_3	29	pan2022_t6a	0.559	0.360	0.237	0.152	0.173	0.376	0.567	0.363	0.240	0.154	0.174	0.380
	Pan_t6a_1	30	pan2022_t6a	0.556	0.365	0.244	0.154	0.170	0.375	0.560	0.362	0.240	0.155	0.169	0.375
	Kiciński_t6a_2	31	kiciński2022_t6a	0.556	0.358	0.235	0.148	0.171	0.369	0.567	0.365	0.242	0.155	0.172	0.378
	Pan_t6a_2	32	pan2022_t6a	0.556	0.358	0.236	0.146	0.169	0.371	0.562	0.361	0.240	0.156	0.172	0.377
	Labbe_t6a_1	33	labbe2022_t6a	0.548	0.351	0.233	0.149	0.170	0.370	0.555	0.357	0.240	0.157	0.170	0.374
	Baseline	34	gontier2022_t6a	0.549	0.353	0.234	0.147	0.164	0.361	0.555	0.358	0.239	0.156	0.164	0.364
	Labbe_t6a_3	35	labbe2022_t6a	0.535	0.326	0.203	0.121	0.164	0.356	0.532	0.322	0.200	0.121	0.161	0.354
	Labbe_t6a_2	36	labbe2022_t6a	0.490	0.282	0.163	0.090	0.156	0.328	0.488	0.279	0.160	0.085	0.154	0.329
	Labbe_t6a_4	37	labbe2022_t6a	0.460	0.251	0.139	0.072	0.142	0.311	0.457	0.245	0.135	0.067	0.140	0.310

Systems ranking, captioning metrics

Selected metric rank	Submission Information			Clotho testing split			Clotho evaluation split
Selected metric rank	Submission code	Best official system rank	Technical Report	CIDEr	SPICE	SPIDEr	CIDEr	SPICE	SPIDEr
	Xu_t6a_4	1	xu2022_t6a	0.508	0.130	0.319	0.513	0.126	0.320
	Zou_t6a_3	2	zou2022_t6a	0.502	0.133	0.318	0.497	0.119	0.308
	Xu_t6a_3	3	xu2022_t6a	0.501	0.131	0.316	0.517	0.127	0.322
	Xu_t6a_1	4	xu2022_t6a	0.498	0.130	0.314	0.507	0.130	0.318
	Zou_t6a_4	5	zou2022_t6a	0.497	0.130	0.314	0.520	0.126	0.323
	Xu_t6a_2	6	xu2022_t6a	0.495	0.127	0.311	0.524	0.126	0.325
	Zou_t6a_1	7	zou2022_t6a	0.489	0.133	0.311	0.503	0.132	0.317
	Zou_t6a_2	8	zou2022_t6a	0.491	0.128	0.309	0.495	0.131	0.313
	Mei_t6a_3	9	mei2022_t6a	0.482	0.135	0.309	0.453	0.128	0.291
	Mei_t6a_1	10	mei2022_t6a	0.476	0.134	0.305	0.453	0.128	0.291
	Mei_t6a_4	11	mei2022_t6a	0.468	0.138	0.303	0.453	0.128	0.291
	Mei_t6a_2	12	mei2022_t6a	0.456	0.138	0.297	0.453	0.128	0.291
	Primus_t6a_4	13	primus2022_t6a	0.458	0.134	0.296	0.461	0.130	0.295
	Kouzelis_t6a_4	14	kouzelis2022_t6a	0.453	0.134	0.293	0.457	0.134	0.296
	Guan_t6a_4	15	guan2022_t6a	0.451	0.130	0.291	0.471	0.133	0.302
	Guan_t6a_2	16	guan2022_t6a	0.451	0.129	0.290	0.465	0.127	0.296
	Guan_t6a_3	17	guan2022_t6a	0.447	0.132	0.290	0.442	0.129	0.285
	Kouzelis_t6a_3	18	kouzelis2022_t6a	0.447	0.131	0.289	0.557	0.133	0.295
	Kouzelis_t6a_1	19	kouzelis2022_t6a	0.439	0.132	0.286	0.453	0.130	0.292
	Kouzelis_t6a_2	20	kouzelis2022_t6a	0.441	0.130	0.285	0.454	0.133	0.293
	Primus_t6a_3	21	primus2022_t6a	0.434	0.127	0.280	0.455	0.125	0.290
	Primus_t6a_2	22	primus2022_t6a	0.418	0.132	0.275	0.440	0.128	0.284
	Guan_t6a_1	23	guan2022_t6a	0.417	0.124	0.270	0.437	0.126	0.281
	Kiciński_t6a_1	24	kiciński2022_t6a	0.414	0.126	0.270	0.433	0.125	0.279
	Primus_t6a_1	25	primus2022_t6a	0.400	0.127	0.264	0.408	0.120	0.264
	Pan_t6a_4	26	pan2022_t6a	0.387	0.123	0.255	0.395	0.119	0.257
	Kiciński_t6a_3	27	kiciński2022_t6a	0.379	0.123	0.251	0.400	0.121	0.260
	Kiciński_t6a_4	28	kiciński2022_t6a	0.380	0.121	0.250	0.378	0.119	0.249
	Pan_t6a_3	29	pan2022_t6a	0.376	0.123	0.250	0.386	0.121	0.253
	Pan_t6a_1	30	pan2022_t6a	0.377	0.121	0.249	0.381	0.116	0.248
	Kiciński_t6a_2	31	kiciński2022_t6a	0.378	0.119	0.249	0.393	0.117	0.255
	Pan_t6a_2	32	pan2022_t6a	0.363	0.120	0.241	0.384	0.118	0.251
	Labbe_t6a_1	33	labbe2022_t6a	0.359	0.123	0.241	0.367	0.118	0.242
	Baseline	34	gontier2022_t6a	0.338	0.110	0.224	0.358	0.109	0.233
	Labbe_t6a_3	35	labbe2022_t6a	0.307	0.117	0.212	0.303	0.111	0.207
	Labbe_t6a_2	36	labbe2022_t6a	0.247	0.113	0.180	0.241	0.106	0.174
	Labbe_t6a_4	37	labbe2022_t6a	0.203	0.099	0.151	0.198	0.090	0.144

System characteristics

In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems and the second has a detailed presentation of each system.

Overview of characteristics

Rank	Submission code	SPIDEr	Technical Report	Method scheme/architecture	Amount of parameters	Audio modelling	Word modelling	Data augmentation
1	Xu_t6a_4	0.319	xu2022_t6a	Rnn_Transformer	528099252	RNN	Transformer, RNN
2	Zou_t6a_3	0.318	zou2022_t6a	encoder-decoder	84541437	CNN	LSTM	SpecAugment, SpecAugment++
3	Xu_t6a_3	0.316	xu2022_t6a	Rnn_Transformer	347873912	RNN	Transformer
4	Xu_t6a_1	0.314	xu2022_t6a	Rnn_Transformer	85915038	RNN	Transformer
5	Zou_t6a_4	0.314	zou2022_t6a	encoder-decoder	84541437	CNN	LSTM	SpecAugment, SpecAugment++
6	Xu_t6a_2	0.311	xu2022_t6a	Rnn_Transformer	171830076	RNN	Transformer
7	Zou_t6a_1	0.311	zou2022_t6a	encoder-decoder	86643711	CNN	LSTM	SpecAugment, SpecAugment++
8	Zou_t6a_2	0.309	zou2022_t6a	encoder-decoder	140000000	CNN	LSTM	SpecAugment, SpecAugment++
9	Mei_t6a_3	0.309	mei2022_t6a	encoder-decoder	8867215	CNN	Transformer	SpecAugment
10	Mei_t6a_1	0.305	mei2022_t6a	encoder-decoder	8867215	CNN	Transformer	SpecAugment
11	Mei_t6a_4	0.303	mei2022_t6a	encoder-decoder	8867215	CNN	Transformer	SpecAugment
12	Mei_t6a_2	0.297	mei2022_t6a	encoder-decoder	8867215	CNN	Transformer	SpecAugment
13	Primus_t6a_4	0.296	primus2022_t6a	encoder-decoder	780000000	CNN	Transformer	SpecAugment
14	Kouzelis_t6a_4	0.293	kouzelis2022_t6a	encoder-decoder	119757102	PaSST	Transformer	Mixup, SpecAugment, Label Smoothing
15	Guan_t6a_4	0.291	guan2022_t6a	encoder-decoder	36147701	PANNs, GAT	Transformer, LocalAFT	Mixup, SpecAugment
16	Guan_t6a_2	0.290	guan2022_t6a	encoder-decoder	28920672	PANNs, GAT	Transformer, LocalAFT	Mixup, SpecAugment
17	Guan_t6a_3	0.290	guan2022_t6a	encoder-decoder	7227029	PANNs, GAT	Transformer	Mixup, SpecAugment
18	Kouzelis_t6a_3	0.289	kouzelis2022_t6a	encoder-decoder	119757102	PaSST	Transformer	Mixup, SpecAugment, Label Smoothing
19	Kouzelis_t6a_1	0.286	kouzelis2022_t6a	encoder-decoder	119757102	PaSST	Transformer	Mixup, SpecAugment, Label Smoothing
20	Kouzelis_t6a_2	0.285	kouzelis2022_t6a	encoder-decoder	119757102	PaSST	Transformer	Mixup, SpecAugment, Label Smoothing
21	Primus_t6a_3	0.280	primus2022_t6a	encoder-decoder	130000000	CNN	Transformer	SpecAugment
22	Primus_t6a_2	0.275	primus2022_t6a	encoder-decoder	130000000	CNN	Transformer	SpecAugment
23	Guan_t6a_1	0.270	guan2022_t6a	encoder-decoder	7227029	PANNs, GAT	Transformer	Mixup, SpecAugment
24	Kiciński_t6a_1	0.270	kiciński2022_t6a	encoder-decoder	104000000	PANNs	Transformer	random crop, random pad, adding white noise, SpecAugment
25	Primus_t6a_1	0.264	primus2022_t6a	encoder-decoder	130000000	CNN	Transformer	SpecAugment
26	Pan_t6a_4	0.255	pan2022_t6a	encoder-decoder	9857679	CNN	Transformer	Mixture
27	Kiciński_t6a_3	0.251	kiciński2022_t6a	encoder-decoder	104000000	PANNs	Transformer, KeyBert	random crop, random pad, adding white noise, SpecAugment
28	Kiciński_t6a_4	0.250	kiciński2022_t6a	encoder-decoder	207000000	PANNs	GPT2, KeyBert	random crop, random pad, adding white noise, SpecAugment
29	Pan_t6a_3	0.250	pan2022_t6a	encoder-decoder	9857679	CNN	Transformer	Zero_value, Mixup
30	Pan_t6a_1	0.249	pan2022_t6a	encoder-decoder	9857679	CNN	Transformer	Zero_value, Mixup
31	Kiciński_t6a_2	0.249	kiciński2022_t6a	encoder-decoder	207000000	PANNs	GPT2	random crop, random pad, adding white noise, SpecAugment
32	Pan_t6a_2	0.241	pan2022_t6a	encoder-decoder	9857679	CNN	Transformer	Zero_value, Mixup
33	Labbe_t6a_1	0.241	labbe2022_t6a	encoder-decoder	16531922	CNN10	Transformer	SpecAugment
34	Baseline	0.224	gontier2022_t6a	encoder-decoder	140000000	Transformer	Transformer
35	Labbe_t6a_3	0.212	labbe2022_t6a	encoder-decoder	16531922	CNN10	Transformer	SpecAugment
36	Labbe_t6a_2	0.180	labbe2022_t6a	encoder-decoder	16531922	CNN10	Transformer	SpecAugment
37	Labbe_t6a_4	0.151	labbe2022_t6a	encoder-decoder	16531922	CNN10	Transformer	SpecAugment

Detailed characteristics

Rank	Submission code	SPIDEr	Technical Report	Method scheme/architecture	Amount of parameters	Audio modelling	Acoustic features	Word modelling	Word embeddings	Data augmentation	Sampling rate	Learning set-up	Loss function	Learning set-up	Learning rate	Metric monitored for training	Dataset(s) used for audio modelling	Dataset(s) used for word modelling
1	Xu_t6a_4	0.319	xu2022_t6a	Rnn_Transformer	528099252	RNN	Clap feature	Transformer, RNN	learned		44.1kHz	supervised, reinforcement learning	cross-entropy	adam	5e-4	CIDEr	Clotho, AudioCaps, MACS	Clotho, AudioCaps, MACS
2	Zou_t6a_3	0.318	zou2022_t6a	encoder-decoder	84541437	CNN	ResNet38	LSTM	Random	SpecAugment, SpecAugment++	44.1KHz	supervised	cross-entropy	adamw	5e-5	loss	Clotho, Clotho	Clotho, Clotho
3	Xu_t6a_3	0.316	xu2022_t6a	Rnn_Transformer	347873912	RNN	Clap feature	Transformer	learned		44.1kHz	supervised, reinforcement learning	cross-entropy	adam	5e-4	CIDEr	Clotho, AudioCaps, MACS	Clotho, AudioCaps, MACS
4	Xu_t6a_1	0.314	xu2022_t6a	Rnn_Transformer	85915038	RNN	Clap feature	Transformer	learned		44.1kHz	supervised, reinforcement learning	cross-entropy	adam	5e-4	CIDEr	Clotho, AudioCaps, MACS	Clotho, AudioCaps, MACS
5	Zou_t6a_4	0.314	zou2022_t6a	encoder-decoder	84541437	CNN	ResNet38	LSTM	Random	SpecAugment, SpecAugment++	44.1KHz	supervised	cross-entropy	adamw	5e-5	loss	Clotho, Clotho	Clotho, Clotho
6	Xu_t6a_2	0.311	xu2022_t6a	Rnn_Transformer	171830076	RNN	Clap feature	Transformer	learned		44.1kHz	supervised, reinforcement learning	cross-entropy	adam	5e-4	CIDEr	Clotho, AudioCaps, MACS	Clotho, AudioCaps, MACS
7	Zou_t6a_1	0.311	zou2022_t6a	encoder-decoder	86643711	CNN	ResNet38	LSTM	Random	SpecAugment, SpecAugment++	44.1KHz	supervised	cross-entropy	adamw	5e-5	loss	Clotho, Clotho	Clotho, Clotho
8	Zou_t6a_2	0.309	zou2022_t6a	encoder-decoder	140000000	CNN	ResNet38	LSTM	Random	SpecAugment, SpecAugment++	44.1KHz	supervised	cross-entropy	adamw	5e-5	loss	Clotho, Clotho	Clotho, Clotho
9	Mei_t6a_3	0.309	mei2022_t6a	encoder-decoder	8867215	CNN	PANNs	Transformer	Word2Vec	SpecAugment	44.1kHz	supervised	cross-entropy	adam	1e-3	SPIDEr	Clotho	Clotho
10	Mei_t6a_1	0.305	mei2022_t6a	encoder-decoder	8867215	CNN	PANNs	Transformer	Word2Vec	SpecAugment	44.1kHz	supervised	cross-entropy	adam	1e-3	SPIDEr	Clotho	Clotho
11	Mei_t6a_4	0.303	mei2022_t6a	encoder-decoder	8867215	CNN	PANNs	Transformer	Word2Vec	SpecAugment	44.1kHz	supervised	cross-entropy	adam	1e-3	SPIDEr	Clotho	Clotho
12	Mei_t6a_2	0.297	mei2022_t6a	encoder-decoder	8867215	CNN	PANNs	Transformer	Word2Vec	SpecAugment	44.1kHz	supervised	cross-entropy	adam	1e-3	SPIDEr	Clotho	Clotho
13	Primus_t6a_4	0.296	primus2022_t6a	encoder-decoder	780000000	CNN	CNN10	Transformer	BART	SpecAugment	44.1kHz	supervised	score function estimator	adamw	1e-5	SPIDEr	Clotho, AudioCaps, AudioSet	Clotho, AudioCaps
14	Kouzelis_t6a_4	0.293	kouzelis2022_t6a	encoder-decoder	119757102	PaSST	mel energies	Transformer	Word2Vec	Mixup, SpecAugment, Label Smoothing	32kHz	supervised	cross-entropy	adam	1e-5	SPIDEr	Clotho, AudioCaps, MACS	Clotho, AudioCaps, MACS
15	Guan_t6a_4	0.291	guan2022_t6a	encoder-decoder	36147701	PANNs, GAT	log-mel energies	Transformer, LocalAFT	Word2Vec	Mixup, SpecAugment	44.1kHz	supervised, reinforcement learning	cross-entropy	adam	1e-4	loss, SPIDEr	Clotho, AudioCaps	Clotho, AudioCaps
16	Guan_t6a_2	0.290	guan2022_t6a	encoder-decoder	28920672	PANNs, GAT	log-mel energies	Transformer, LocalAFT	Word2Vec	Mixup, SpecAugment	44.1kHz	supervised	cross-entropy	adam	1e-4	loss, SPIDEr	Clotho, AudioCaps	Clotho, AudioCaps
17	Guan_t6a_3	0.290	guan2022_t6a	encoder-decoder	7227029	PANNs, GAT	log-mel energies	Transformer	Word2Vec	Mixup, SpecAugment	44.1kHz	supervised, reinforcement learning	cross-entropy	adam	1e-4	loss, SPIDEr	Clotho, AudioCaps	Clotho, AudioCaps
18	Kouzelis_t6a_3	0.289	kouzelis2022_t6a	encoder-decoder	119757102	PaSST	mel energies	Transformer	Word2Vec	Mixup, SpecAugment, Label Smoothing	32kHz	supervised	cross-entropy	adam	1e-5	SPIDEr	Clotho, AudioCaps, MACS	Clotho, AudioCaps, MACS
19	Kouzelis_t6a_1	0.286	kouzelis2022_t6a	encoder-decoder	119757102	PaSST	mel energies	Transformer	Word2Vec	Mixup, SpecAugment, Label Smoothing	32kHz	supervised	cross-entropy	adam	linear warmup 1e-5	SPIDEr	Clotho, AudioCaps, MACS	Clotho, AudioCaps, MACS
20	Kouzelis_t6a_2	0.285	kouzelis2022_t6a	encoder-decoder	119757102	PaSST	mel energies	Transformer	Word2Vec	Mixup, SpecAugment, Label Smoothing	32kHz	supervised	cross-entropy	adam	1e-5	SPIDEr	Clotho, AudioCaps, MACS	Clotho, AudioCaps, MACS
21	Primus_t6a_3	0.280	primus2022_t6a	encoder-decoder	130000000	CNN	CNN10	Transformer	BART	SpecAugment	44.1kHz	supervised	score function estimator	adamw	1e-5	SPIDEr	Clotho, AudioCaps, AudioSet	Clotho, AudioCaps
22	Primus_t6a_2	0.275	primus2022_t6a	encoder-decoder	130000000	CNN	CNN10	Transformer	BART	SpecAugment	44.1kHz	supervised	cross-entropy	adamw	1e-5	SPIDEr	Clotho, AudioCaps, AudioSet	Clotho, AudioCaps
23	Guan_t6a_1	0.270	guan2022_t6a	encoder-decoder	7227029	PANNs, GAT	log-mel energies	Transformer	Word2Vec	Mixup, SpecAugment	44.1kHz	supervised	cross-entropy	adam	1e-4	loss, SPIDEr	Clotho, AudioCaps	Clotho, AudioCaps
24	Kiciński_t6a_1	0.270	kiciński2022_t6a	encoder-decoder	104000000	PANNs	mel energies	Transformer	learned	random crop, random pad, adding white noise, SpecAugment	16kHz	supervised	cross-entropy	adamw	1e-4	loss	Clotho, AudioCaps, Freesound	Clotho, AudioCaps, Freesound
25	Primus_t6a_1	0.264	primus2022_t6a	encoder-decoder	130000000	CNN	CNN10	Transformer	BART	SpecAugment	44.1kHz	supervised	cross-entropy	adamw	1e-5	SPIDEr	Clotho, AudioSet	Clotho
26	Pan_t6a_4	0.255	pan2022_t6a	encoder-decoder	9857679	CNN	mel energies	Transformer	Word2Vec	Mixture	44.1kHz	supervised	cross-entropy	adam	1e-3	loss	Clotho	Clotho
27	Kiciński_t6a_3	0.251	kiciński2022_t6a	encoder-decoder	104000000	PANNs	mel energies	Transformer, KeyBert	learned	random crop, random pad, adding white noise, SpecAugment	16kHz	supervised	cross-entropy	adamw	1e-4	loss	Clotho, AudioCaps, Freesound	Clotho, AudioCaps, Freesound
28	Kiciński_t6a_4	0.250	kiciński2022_t6a	encoder-decoder	207000000	PANNs	mel energies	GPT2, KeyBert	GPT2	random crop, random pad, adding white noise, SpecAugment	16kHz	supervised	cross-entropy	adamw	5e-5	loss	Clotho, AudioCaps, Freesound	Clotho, AudioCaps, Freesound
29	Pan_t6a_3	0.250	pan2022_t6a	encoder-decoder	9857679	CNN	mel energies	Transformer	Word2Vec	Zero_value, Mixup	44.1kHz	supervised	cross-entropy	adamw	1e-3	loss	Clotho	Clotho
30	Pan_t6a_1	0.249	pan2022_t6a	encoder-decoder	9857679	CNN	mel energies	Transformer	learned	Zero_value, Mixup	44.1kHz	supervised	cross-entropy	adam	1e-3	loss	Clotho	Clotho
31	Kiciński_t6a_2	0.249	kiciński2022_t6a	encoder-decoder	207000000	PANNs	mel energies	GPT2	GPT2	random crop, random pad, adding white noise, SpecAugment	16kHz	supervised	cross-entropy	adamw	5e-5	loss	Clotho, AudioCaps, Freesound	Clotho, AudioCaps, Freesound
32	Pan_t6a_2	0.241	pan2022_t6a	encoder-decoder	9857679	CNN	mel energies	Transformer	Word2Vec	Zero_value, Mixup	44.1kHz	supervised	cross-entropy	adam	1e-3	loss	Clotho	Clotho
33	Labbe_t6a_1	0.241	labbe2022_t6a	encoder-decoder	16531922	CNN10	log-mel energies	Transformer	learned	SpecAugment	32kHz	supervised	cross-entropy	adam	5e-4	loss	Clotho	Clotho
34	Baseline	0.224	gontier2022_t6a	encoder-decoder	140000000	Transformer	VGGish	Transformer	BART		16kHz	supervised	cross-entropy	adamw	1e-5	loss	Clotho	Clotho
35	Labbe_t6a_3	0.212	labbe2022_t6a	encoder-decoder	16531922	CNN10	log-mel energies	Transformer	learned	SpecAugment	32kHz	supervised	cross-entropy	adam	5e-4	loss	Clotho	Clotho
36	Labbe_t6a_2	0.180	labbe2022_t6a	encoder-decoder	16531922	CNN10	log-mel energies	Transformer	learned	SpecAugment	32kHz	supervised	cross-entropy	adam	5e-4	loss	Clotho	Clotho
37	Labbe_t6a_4	0.151	labbe2022_t6a	encoder-decoder	16531922	CNN10	log-mel energies	Transformer	learned	SpecAugment	32kHz	supervised	cross-entropy	adam	5e-4	loss	Clotho	Clotho

Technical reports

Ensemble learning for audio captioning with graph audio feature representation

Feiyang Xiao¹, Jian Guan¹, Haiyan Lan¹, Qiaoxi Zhu², Wenwu Wang³

¹Group of Intelligent Signal Processing, College of Computer Science and Technology, Harbin Engineering University, Harbin, China, ²Centre for Audio, Acoustic and Vibration, University of Technology Sydney, Ultimo, Australia, ³Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK

guan_t6a_1 guan_t6a_2 guan_t6a_3 guan_t6a_4

Content

Task description

Teams ranking

Systems ranking

Systems ranking, all metrics

Systems ranking, machine translation metrics

Systems ranking, captioning metrics

System characteristics

Overview of characteristics

Detailed characteristics

Technical reports

Ensemble learning for audio captioning with graph audio feature representation

Ensemble learning for audio captioning with graph audio feature representation

Abstract

System characteristics

Exploring audio captioning with keyword-guided text generation

Exploring audio captioning with keyword-guided text generation

Abstract

System characteristics

Efficient audio captioning transformer with patchout and text guidance

Efficient audio captioning transformer with patchout and text guidance

Abstract

System characteristics

IRIT-UPS DCASE 2022 task6a system: stochastic decoding methods for audio captioning

IRIT-UPS DCASE 2022 task6a system: stochastic decoding methods for audio captioning

Abstract

System characteristics

Automated audio captioning with keywords guidance

Automated audio captioning with keywords guidance

Abstract

System characteristics

Audio captioning using pre-trained model and data augmentation

Audio captioning using pre-trained model and data augmentation

Abstract

System characteristics

CP-JKU's submission to task 6a of the DCASE2022 challenge: a BART encoder-decoder for automatic audio captioning trained via the reinforce algorithm and transfer learning

CP-JKU's submission to task 6a of the DCASE2022 challenge: a BART encoder-decoder for automatic audio captioning trained via the reinforce algorithm and transfer learning

Abstract

System characteristics

The SJTU System for DCASE2022 Challenge Task 6: Audio Captioning with Audio-Text Retrieval Pre-training

The SJTU System for DCASE2022 Challenge Task 6: Audio Captioning with Audio-Text Retrieval Pre-training

Abstract

System characteristics

Automated audio captioning with multi-task learning

Automated audio captioning with multi-task learning

Abstract

System characteristics