Task description

Automated audio captioning is the task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. Given the novelty of the task of audio captioning, current focus is on exploring and developing different methods that can provide some kind of captions for a general audio recording. To this aim, the novel Clotho dataset is used, which provides good quality captions, without speech transcription, named entities, and hapax legomena (i.e. words that appear once in a split).

Participants used the freely available splits of Clotho development and evaluation, which splits provide both audio and corresponding captions. The systems are developed without the usage of any external data. The developed systems are evaluated on their generated captions, using the testing split of Clotho, which does not provide the corresponding captions for the audio. More information about Task 6: Automated Audio Captioning can be found at the task description page.

The ranking of the submitted systems is based on the achieved SPIDEr metric. Though, in this page is provided a more thorough presentation, grouping the metrics into those that are originated from machine translation and to those that originated from captioning.

Teams ranking

Here are listed the best systems all from all teams. The ranking is based on the SPIDEr metric. For more elaborated exploration of the performance of the different systems, at the same table are listed the values achieved for all the metrics employed in the task. The values for the metrics are for the Clotho testing split and the Clotho evaluation split. The values for the Clotho evaluation split, are provided in order to allow further comparison with systems and methods developed outside of this task, since Clotho evaluation split is freely available.

Selected metric rank	Submission Information				Clotho testing split									Clotho evaluation split
Selected metric rank	Submission code	Best official system rank	Corresponding author	Technical Report	BLEU₁	BLEU₂	BLEU₃	BLEU₄	METEOR	ROUGE_L	CIDEr	SPICE	SPIDEr	BLEU₁	BLEU₂	BLEU₃	BLEU₄	METEOR	ROUGE_L	CIDEr	SPICE	SPIDEr
	Yuan_t6_2	1	Weiqiang Yuan	yuan2021_t6	0.595	0.400	0.275	0.184	0.182	0.394	0.485	0.135	0.310	0.603	0.414	0.286	0.195	0.186	0.400	0.499	0.137	0.318
	Xu_t6_3	2	Xuenan Xu	xu2021_t6	0.650	0.420	0.271	0.171	0.182	0.405	0.463	0.129	0.296	0.659	0.424	0.275	0.176	0.182	0.411	0.472	0.124	0.298
	Xinhao_t6_1	3	Xinhao Mei	xinhao2021_t6	0.620	0.416	0.282	0.180	0.184	0.401	0.457	0.131	0.294	0.615	0.403	0.270	0.171	0.179	0.392	0.412	0.122	0.268
	Ye_t6_3	4	Zhongjie Ye	ye2021_t6	0.584	0.391	0.265	0.173	0.179	0.384	0.434	0.126	0.280	0.586	0.391	0.268	0.180	0.180	0.388	0.440	0.125	0.282
	Chen_t6_4	5	Zhiwen Chen	chen2021_t6	0.549	0.358	0.239	0.156	0.169	0.367	0.402	0.121	0.262	0.563	0.367	0.244	0.158	0.170	0.371	0.406	0.119	0.262
	Won_t6_4	6	Hyejin Won	won2021_t6	0.538	0.359	0.247	0.162	0.166	0.372	0.381	0.118	0.249	0.564	0.376	0.254	0.163	0.177	0.388	0.441	0.128	0.285
	Narisetty_t6_4	7	Chaitanya Narisetty	narisetty2021_t6	0.534	0.348	0.238	0.160	0.157	0.361	0.362	0.110	0.236	0.563	0.378	0.264	0.184	0.168	0.378	0.417	0.115	0.266
	Labbe_t6_4	8	Etienne Labbe	labbe2021_t6	0.539	0.354	0.239	0.154	0.156	0.361	0.333	0.108	0.221	0.541	0.358	0.243	0.159	0.327	0.235	0.351	0.110	0.231
	Liu_t6_1	9	Yang Liu	liu2021_t6	0.478	0.291	0.189	0.118	0.143	0.324	0.274	0.094	0.184	0.483	0.298	0.197	0.119	0.322	0.133	0.243	0.088	0.166
	Eren_t6_1	10	Ayşegül Özkaya Eren	eren2021_t6	0.479	0.280	0.168	0.090	0.140	0.302	0.256	0.107	0.182	0.586	0.356	0.268	0.150	0.214	0.444	0.328	0.155	0.242
	Gebhard_t6_1	11	Alexander Gebhard	gebhard2021_t6	0.447	0.169	0.072	0.029	0.099	0.287	0.105	0.047	0.076	0.449	0.167	0.068	0.029	0.097	0.284	0.098	0.043	0.071
	Xiao_t6_2	12	Feiyang Xiao	xiao2021_t6	0.344	0.152	0.085	0.044	0.082	0.239	0.058	0.033	0.046	0.461	0.275	0.180	0.112	0.126	0.312	0.210	0.079	0.144
	Baseline_t6_1	13	Konstantinos Drossos	Baseline2021_t6	0.405	0.061	0.014	0.000	0.070	0.265	0.020	0.004	0.012	0.378	0.119	0.050	0.017	0.078	0.263	0.075	0.028	0.051

Systems ranking

Here are listed all systems and their ranking according to the different metrics and grouping of metrics. First, is a table with all metrics and all systems. Then, is a table with all systems but with only machine translation metrics, and then a table with all systems but with only captioning metrics.

Detailed information of each system is at the next section.

Systems ranking, all metrics

Selected metric rank	Submission Information			Clotho testing split									Clotho evaluation split
Selected metric rank	Submission code	Best official system rank	Technical Report	BLEU₁	BLEU₂	BLEU₃	BLEU₄	METEOR	ROUGE_L	CIDEr	SPICE	SPIDEr	BLEU₁	BLEU₂	BLEU₃	BLEU₄	METEOR	ROUGE_L	CIDEr	SPICE	SPIDEr
	Yuan_t6_1	4	yuan2021_t6	0.586	0.387	0.261	0.170	0.181	0.384	0.457	0.136	0.296	0.595	0.402	0.278	0.189	0.184	0.392	0.495	0.136	0.315
	Yuan_t6_2	1	yuan2021_t6	0.595	0.400	0.275	0.184	0.182	0.394	0.485	0.135	0.310	0.603	0.414	0.286	0.195	0.186	0.400	0.499	0.137	0.318
	Yuan_t6_3	2	yuan2021_t6	0.590	0.396	0.271	0.176	0.181	0.388	0.471	0.133	0.302	0.635	0.444	0.310	0.211	0.197	0.420	0.569	0.151	0.360
	Yuan_t6_4	3	yuan2021_t6	0.584	0.392	0.266	0.175	0.181	0.389	0.465	0.131	0.298	0.665	0.487	0.359	0.260	0.214	0.449	0.684	0.163	0.423
	Xiao_t6_1	37	xiao2021_t6	0.351	0.150	0.079	0.041	0.082	0.238	0.057	0.029	0.043	0.471	0.282	0.182	0.112	0.128	0.317	0.208	0.078	0.143
	Xiao_t6_2	36	xiao2021_t6	0.344	0.152	0.085	0.044	0.082	0.239	0.058	0.033	0.046	0.461	0.275	0.180	0.112	0.126	0.312	0.210	0.079	0.144
	Chen_t6_1	18	chen2021_t6	0.549	0.356	0.235	0.149	0.169	0.360	0.389	0.117	0.253	0.555	0.357	0.236	0.152	0.168	0.366	0.409	0.120	0.265
	Chen_t6_2	28	chen2021_t6	0.535	0.345	0.227	0.142	0.161	0.359	0.349	0.113	0.231	0.553	0.364	0.247	0.161	0.167	0.371	0.408	0.118	0.263
	Chen_t6_3	20	chen2021_t6	0.537	0.351	0.234	0.151	0.167	0.362	0.373	0.117	0.245	0.561	0.369	0.249	0.167	0.169	0.373	0.406	0.118	0.262
	Chen_t6_4	17	chen2021_t6	0.549	0.358	0.239	0.156	0.169	0.367	0.402	0.121	0.262	0.563	0.367	0.244	0.158	0.170	0.371	0.406	0.119	0.262
	Ye_t6_1	12	ye2021_t6	0.582	0.385	0.259	0.169	0.180	0.382	0.432	0.126	0.279	0.578	0.381	0.257	0.169	0.181	0.381	0.433	0.125	0.279
	Ye_t6_2	14	ye2021_t6	0.577	0.379	0.254	0.164	0.182	0.385	0.420	0.128	0.274	0.579	0.384	0.261	0.172	0.181	0.386	0.436	0.128	0.282
	Ye_t6_3	11	ye2021_t6	0.584	0.391	0.265	0.173	0.179	0.384	0.434	0.126	0.280	0.586	0.391	0.268	0.180	0.180	0.388	0.440	0.125	0.282
	Ye_t6_4	13	ye2021_t6	0.586	0.389	0.261	0.170	0.181	0.387	0.429	0.125	0.277	0.590	0.395	0.272	0.183	0.182	0.394	0.453	0.129	0.291
	Liu_t6_1	31	liu2021_t6	0.478	0.291	0.189	0.118	0.143	0.324	0.274	0.094	0.184	0.483	0.298	0.197	0.119	0.322	0.133	0.243	0.088	0.166
	Gebhard_t6_1	35	gebhard2021_t6	0.447	0.169	0.072	0.029	0.099	0.287	0.105	0.047	0.076	0.449	0.167	0.068	0.029	0.097	0.284	0.098	0.043	0.071
	Eren_t6_1	32	eren2021_t6	0.479	0.280	0.168	0.090	0.140	0.302	0.256	0.107	0.182	0.586	0.356	0.268	0.150	0.214	0.444	0.328	0.155	0.242
	Xinhao_t6_1	7	xinhao2021_t6	0.620	0.416	0.282	0.180	0.184	0.401	0.457	0.131	0.294	0.615	0.403	0.270	0.171	0.179	0.392	0.412	0.122	0.268
	Xinhao_t6_2	9	xinhao2021_t6	0.653	0.423	0.282	0.176	0.180	0.408	0.439	0.136	0.287	0.635	0.406	0.268	0.166	0.176	0.400	0.412	0.121	0.266
	Xinhao_t6_3	8	xinhao2021_t6	0.644	0.420	0.278	0.170	0.181	0.406	0.447	0.136	0.291	0.621	0.407	0.273	0.177	0.179	0.395	0.431	0.122	0.277
	Xinhao_t6_4	10	xinhao2021_t6	0.627	0.407	0.269	0.166	0.182	0.399	0.436	0.129	0.283	0.625	0.412	0.278	0.178	0.176	0.401	0.428	0.126	0.277
	Narisetty_t6_1	26	narisetty2021_t6	0.531	0.346	0.235	0.157	0.160	0.361	0.362	0.108	0.235	0.546	0.356	0.243	0.165	0.163	0.369	0.381	0.110	0.246
	Narisetty_t6_2	27	narisetty2021_t6	0.534	0.347	0.235	0.157	0.158	0.359	0.358	0.109	0.234	0.558	0.373	0.261	0.181	0.167	0.376	0.410	0.114	0.262
	Narisetty_t6_3	25	narisetty2021_t6	0.534	0.347	0.235	0.157	0.159	0.362	0.360	0.110	0.235	0.562	0.377	0.261	0.182	0.169	0.377	0.416	0.116	0.266
	Narisetty_t6_4	23	narisetty2021_t6	0.534	0.348	0.238	0.160	0.157	0.361	0.362	0.110	0.236	0.563	0.378	0.264	0.184	0.168	0.378	0.417	0.115	0.266
	Labbe_t6_1	34	labbe2021_t6	0.435	0.222	0.128	0.073	0.121	0.305	0.146	0.072	0.109	0.435	0.229	0.129	0.069	0.252	0.195	0.136	0.067	0.101
	Labbe_t6_2	33	labbe2021_t6	0.454	0.270	0.176	0.109	0.122	0.310	0.178	0.078	0.128	0.452	0.262	0.168	0.102	0.249	0.193	0.172	0.071	0.122
	Labbe_t6_3	30	labbe2021_t6	0.525	0.321	0.200	0.117	0.157	0.354	0.296	0.115	0.205	0.523	0.316	0.191	0.109	0.309	0.231	0.287	0.104	0.195
	Labbe_t6_4	29	labbe2021_t6	0.539	0.354	0.239	0.154	0.156	0.361	0.333	0.108	0.221	0.541	0.358	0.243	0.159	0.327	0.235	0.351	0.110	0.231
	Won_t6_1	21	won2021_t6	0.535	0.344	0.231	0.151	0.162	0.359	0.375	0.111	0.243	0.540	0.345	0.230	0.152	0.161	0.361	0.383	0.109	0.246
	Won_t6_2	24	won2021_t6	0.516	0.338	0.226	0.145	0.161	0.359	0.357	0.114	0.236	0.550	0.361	0.244	0.160	0.172	0.375	0.401	0.121	0.261
	Won_t6_3	22	won2021_t6	0.518	0.346	0.235	0.151	0.163	0.366	0.366	0.117	0.242	0.554	0.370	0.254	0.168	0.170	0.379	0.400	0.119	0.259
	Won_t6_4	19	won2021_t6	0.538	0.359	0.247	0.162	0.166	0.372	0.381	0.118	0.249	0.564	0.376	0.254	0.163	0.177	0.388	0.441	0.128	0.285
	Xu_t6_1	15	xu2021_t6	0.560	0.366	0.245	0.159	0.177	0.376	0.403	0.127	0.265	0.576	0.377	0.252	0.164	0.178	0.382	0.421	0.122	0.271
	Xu_t6_2	16	xu2021_t6	0.556	0.365	0.245	0.161	0.178	0.375	0.404	0.125	0.265	0.572	0.374	0.251	0.165	0.178	0.381	0.418	0.122	0.270
	Xu_t6_3	5	xu2021_t6	0.650	0.420	0.271	0.171	0.182	0.405	0.463	0.129	0.296	0.659	0.424	0.275	0.176	0.182	0.411	0.472	0.124	0.298
	Xu_t6_4	6	xu2021_t6	0.651	0.421	0.271	0.170	0.182	0.403	0.461	0.128	0.295	0.660	0.427	0.276	0.177	0.181	0.411	0.471	0.123	0.297
	Baseline_t6_1	38	Baseline2021_t6	0.405	0.061	0.014	0.000	0.070	0.265	0.020	0.004	0.012	0.378	0.119	0.050	0.017	0.078	0.263	0.075	0.028	0.051

Systems ranking, machine translation metrics

Selected metric rank	Submission Information			Clotho testing split						Clotho evaluation split
Selected metric rank	Submission code	Best official system rank	Technical Report	BLEU₁	BLEU₂	BLEU₃	BLEU₄	METEOR	ROUGE_L	BLEU₁	BLEU₂	BLEU₃	BLEU₄	METEOR	ROUGE_L
	Yuan_t6_1	4	yuan2021_t6	0.586	0.387	0.261	0.170	0.181	0.384	0.595	0.402	0.278	0.189	0.184	0.392
	Yuan_t6_2	1	yuan2021_t6	0.595	0.400	0.275	0.184	0.182	0.394	0.603	0.414	0.286	0.195	0.186	0.400
	Yuan_t6_3	2	yuan2021_t6	0.590	0.396	0.271	0.176	0.181	0.388	0.635	0.444	0.310	0.211	0.197	0.420
	Yuan_t6_4	3	yuan2021_t6	0.584	0.392	0.266	0.175	0.181	0.389	0.665	0.487	0.359	0.260	0.214	0.449
	Xiao_t6_1	37	xiao2021_t6	0.351	0.150	0.079	0.041	0.082	0.238	0.471	0.282	0.182	0.112	0.128	0.317
	Xiao_t6_2	36	xiao2021_t6	0.344	0.152	0.085	0.044	0.082	0.239	0.461	0.275	0.180	0.112	0.126	0.312
	Chen_t6_1	18	chen2021_t6	0.549	0.356	0.235	0.149	0.169	0.360	0.555	0.357	0.236	0.152	0.168	0.366
	Chen_t6_2	28	chen2021_t6	0.535	0.345	0.227	0.142	0.161	0.359	0.553	0.364	0.247	0.161	0.167	0.371
	Chen_t6_3	20	chen2021_t6	0.537	0.351	0.234	0.151	0.167	0.362	0.561	0.369	0.249	0.167	0.169	0.373
	Chen_t6_4	17	chen2021_t6	0.549	0.358	0.239	0.156	0.169	0.367	0.563	0.367	0.244	0.158	0.170	0.371
	Ye_t6_1	12	ye2021_t6	0.582	0.385	0.259	0.169	0.180	0.382	0.578	0.381	0.257	0.169	0.181	0.381
	Ye_t6_2	14	ye2021_t6	0.577	0.379	0.254	0.164	0.182	0.385	0.579	0.384	0.261	0.172	0.181	0.386
	Ye_t6_3	11	ye2021_t6	0.584	0.391	0.265	0.173	0.179	0.384	0.586	0.391	0.268	0.180	0.180	0.388
	Ye_t6_4	13	ye2021_t6	0.586	0.389	0.261	0.170	0.181	0.387	0.590	0.395	0.272	0.183	0.182	0.394
	Liu_t6_1	31	liu2021_t6	0.478	0.291	0.189	0.118	0.143	0.324	0.483	0.298	0.197	0.119	0.322	0.133
	Gebhard_t6_1	35	gebhard2021_t6	0.447	0.169	0.072	0.029	0.099	0.287	0.449	0.167	0.068	0.029	0.097	0.284
	Eren_t6_1	32	eren2021_t6	0.479	0.280	0.168	0.090	0.140	0.302	0.586	0.356	0.268	0.150	0.214	0.444
	Xinhao_t6_1	7	xinhao2021_t6	0.620	0.416	0.282	0.180	0.184	0.401	0.615	0.403	0.270	0.171	0.179	0.392
	Xinhao_t6_2	9	xinhao2021_t6	0.653	0.423	0.282	0.176	0.180	0.408	0.635	0.406	0.268	0.166	0.176	0.400
	Xinhao_t6_3	8	xinhao2021_t6	0.644	0.420	0.278	0.170	0.181	0.406	0.621	0.407	0.273	0.177	0.179	0.395
	Xinhao_t6_4	10	xinhao2021_t6	0.627	0.407	0.269	0.166	0.182	0.399	0.625	0.412	0.278	0.178	0.176	0.401
	Narisetty_t6_1	26	narisetty2021_t6	0.531	0.346	0.235	0.157	0.160	0.361	0.546	0.356	0.243	0.165	0.163	0.369
	Narisetty_t6_2	27	narisetty2021_t6	0.534	0.347	0.235	0.157	0.158	0.359	0.558	0.373	0.261	0.181	0.167	0.376
	Narisetty_t6_3	25	narisetty2021_t6	0.534	0.347	0.235	0.157	0.159	0.362	0.562	0.377	0.261	0.182	0.169	0.377
	Narisetty_t6_4	23	narisetty2021_t6	0.534	0.348	0.238	0.160	0.157	0.361	0.563	0.378	0.264	0.184	0.168	0.378
	Labbe_t6_1	34	labbe2021_t6	0.435	0.222	0.128	0.073	0.121	0.305	0.435	0.229	0.129	0.069	0.252	0.195
	Labbe_t6_2	33	labbe2021_t6	0.454	0.270	0.176	0.109	0.122	0.310	0.452	0.262	0.168	0.102	0.249	0.193
	Labbe_t6_3	30	labbe2021_t6	0.525	0.321	0.200	0.117	0.157	0.354	0.523	0.316	0.191	0.109	0.309	0.231
	Labbe_t6_4	29	labbe2021_t6	0.539	0.354	0.239	0.154	0.156	0.361	0.541	0.358	0.243	0.159	0.327	0.235
	Won_t6_1	21	won2021_t6	0.535	0.344	0.231	0.151	0.162	0.359	0.540	0.345	0.230	0.152	0.161	0.361
	Won_t6_2	24	won2021_t6	0.516	0.338	0.226	0.145	0.161	0.359	0.550	0.361	0.244	0.160	0.172	0.375
	Won_t6_3	22	won2021_t6	0.518	0.346	0.235	0.151	0.163	0.366	0.554	0.370	0.254	0.168	0.170	0.379
	Won_t6_4	19	won2021_t6	0.538	0.359	0.247	0.162	0.166	0.372	0.564	0.376	0.254	0.163	0.177	0.388
	Xu_t6_1	15	xu2021_t6	0.560	0.366	0.245	0.159	0.177	0.376	0.576	0.377	0.252	0.164	0.178	0.382
	Xu_t6_2	16	xu2021_t6	0.556	0.365	0.245	0.161	0.178	0.375	0.572	0.374	0.251	0.165	0.178	0.381
	Xu_t6_3	5	xu2021_t6	0.650	0.420	0.271	0.171	0.182	0.405	0.659	0.424	0.275	0.176	0.182	0.411
	Xu_t6_4	6	xu2021_t6	0.651	0.421	0.271	0.170	0.182	0.403	0.660	0.427	0.276	0.177	0.181	0.411
	Baseline_t6_1	38	Baseline2021_t6	0.405	0.061	0.014	0.000	0.070	0.265	0.378	0.119	0.050	0.017	0.078	0.263

Systems ranking, captioning metrics

Selected metric rank	Submission Information			Clotho testing split			Clotho evaluation split
Selected metric rank	Submission code	Best official system rank	Technical Report	CIDEr	SPICE	SPIDEr	CIDEr	SPICE	SPIDEr
	Yuan_t6_1	4	yuan2021_t6	0.457	0.136	0.296	0.495	0.136	0.315
	Yuan_t6_2	1	yuan2021_t6	0.485	0.135	0.310	0.499	0.137	0.318
	Yuan_t6_3	2	yuan2021_t6	0.471	0.133	0.302	0.569	0.151	0.360
	Yuan_t6_4	3	yuan2021_t6	0.465	0.131	0.298	0.684	0.163	0.423
	Xiao_t6_1	37	xiao2021_t6	0.057	0.029	0.043	0.208	0.078	0.143
	Xiao_t6_2	36	xiao2021_t6	0.058	0.033	0.046	0.210	0.079	0.144
	Chen_t6_1	18	chen2021_t6	0.389	0.117	0.253	0.409	0.120	0.265
	Chen_t6_2	28	chen2021_t6	0.349	0.113	0.231	0.408	0.118	0.263
	Chen_t6_3	20	chen2021_t6	0.373	0.117	0.245	0.406	0.118	0.262
	Chen_t6_4	17	chen2021_t6	0.402	0.121	0.262	0.406	0.119	0.262
	Ye_t6_1	12	ye2021_t6	0.432	0.126	0.279	0.433	0.125	0.279
	Ye_t6_2	14	ye2021_t6	0.420	0.128	0.274	0.436	0.128	0.282
	Ye_t6_3	11	ye2021_t6	0.434	0.126	0.280	0.440	0.125	0.282
	Ye_t6_4	13	ye2021_t6	0.429	0.125	0.277	0.453	0.129	0.291
	Liu_t6_1	31	liu2021_t6	0.274	0.094	0.184	0.243	0.088	0.166
	Gebhard_t6_1	35	gebhard2021_t6	0.105	0.047	0.076	0.098	0.043	0.071
	Eren_t6_1	32	eren2021_t6	0.256	0.107	0.182	0.328	0.155	0.242
	Xinhao_t6_1	7	xinhao2021_t6	0.457	0.131	0.294	0.412	0.122	0.268
	Xinhao_t6_2	9	xinhao2021_t6	0.439	0.136	0.287	0.412	0.121	0.266
	Xinhao_t6_3	8	xinhao2021_t6	0.447	0.136	0.291	0.431	0.122	0.277
	Xinhao_t6_4	10	xinhao2021_t6	0.436	0.129	0.283	0.428	0.126	0.277
	Narisetty_t6_1	26	narisetty2021_t6	0.362	0.108	0.235	0.381	0.110	0.246
	Narisetty_t6_2	27	narisetty2021_t6	0.358	0.109	0.234	0.410	0.114	0.262
	Narisetty_t6_3	25	narisetty2021_t6	0.360	0.110	0.235	0.416	0.116	0.266
	Narisetty_t6_4	23	narisetty2021_t6	0.362	0.110	0.236	0.417	0.115	0.266
	Labbe_t6_1	34	labbe2021_t6	0.146	0.072	0.109	0.136	0.067	0.101
	Labbe_t6_2	33	labbe2021_t6	0.178	0.078	0.128	0.172	0.071	0.122
	Labbe_t6_3	30	labbe2021_t6	0.296	0.115	0.205	0.287	0.104	0.195
	Labbe_t6_4	29	labbe2021_t6	0.333	0.108	0.221	0.351	0.110	0.231
	Won_t6_1	21	won2021_t6	0.375	0.111	0.243	0.383	0.109	0.246
	Won_t6_2	24	won2021_t6	0.357	0.114	0.236	0.401	0.121	0.261
	Won_t6_3	22	won2021_t6	0.366	0.117	0.242	0.400	0.119	0.259
	Won_t6_4	19	won2021_t6	0.381	0.118	0.249	0.441	0.128	0.285
	Xu_t6_1	15	xu2021_t6	0.403	0.127	0.265	0.421	0.122	0.271
	Xu_t6_2	16	xu2021_t6	0.404	0.125	0.265	0.418	0.122	0.270
	Xu_t6_3	5	xu2021_t6	0.463	0.129	0.296	0.472	0.124	0.298
	Xu_t6_4	6	xu2021_t6	0.461	0.128	0.295	0.471	0.123	0.297
	Baseline_t6_1	38	Baseline2021_t6	0.020	0.004	0.012	0.075	0.028	0.051

System characteristics

In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an onverview of the systems and the second has a detailed presentation of each system.

Overview of characteristics

Rank	Submission code	SPIDEr	Technical Report	Method scheme/architecture	Amount of parameters	Audio modelling	Word modelling	Data augmentation
4	Yuan_t6_1	0.296	yuan2021_t6	encoder-decoder	986302137	PANNs	Transfomer	noise enhance
1	Yuan_t6_2	0.310	yuan2021_t6	encoder-decoder	986302137	PANNs	Transfomer	noise enhance
2	Yuan_t6_3	0.302	yuan2021_t6	encoder-decoder	986302137	PANNs	Transfomer	noise enhance
3	Yuan_t6_4	0.298	yuan2021_t6	encoder-decoder	2572592190	PANNs	Transfomer	noise enhance
37	Xiao_t6_1	0.043	xiao2021_t6	encoder-decoder, Transformer, MLP-mixer, Residual, audio embedding	2448349	MLP-mixer encoder	Transformer decoder
36	Xiao_t6_2	0.046	xiao2021_t6	encoder-decoder, Transformer, MLP-mixer, Residual, audio embedding, pre-train encoder	2448349	MLP-mixer encoder	Transformer decoder
18	Chen_t6_1	0.253	chen2021_t6	encoder-decoder	93410432	CNN14, MemoryEncoder	MeshedDecoder	SpecAugment, Label Smoothing
28	Chen_t6_2	0.231	chen2021_t6	encoder-decoder	93410432	CNN14, MemoryEncoder	MeshedDecoder	SpecAugment, Label Smoothing
20	Chen_t6_3	0.245	chen2021_t6	encoder-decoder	22775528	CNN14, MemoryEncoder	MeshedDecoder	SpecAugment, Label Smoothing
17	Chen_t6_4	0.262	chen2021_t6	encoder-decoder	86440064	ResNet38, MemoryEncoder	MeshedDecoder	SpecAugment, Label Smoothing
12	Ye_t6_1	0.279	ye2021_t6	encoder-decoder	86643711	ResNet38	RNN	Mixup, SpecAugment, SpecAugment++
14	Ye_t6_2	0.274	ye2021_t6	encoder-decoder	86643711	ResNet38	RNN	Mixup, SpecAugment, SpecAugment++
11	Ye_t6_3	0.280	ye2021_t6	encoder-decoder	779793399	ResNet38	RNN	Mixup, SpecAugment, SpecAugment++
13	Ye_t6_4	0.277	ye2021_t6	encoder-decoder	259931133	ResNet38	RNN	Mixup, SpecAugment, SpecAugment++
31	Liu_t6_1	0.184	liu2021_t6	encoder-decoder	3045913	CNN	self-attention, Word2Vec
35	Gebhard_t6_1	0.076	gebhard2021_t6	encoder-decoder	13409747	CNN	RNN
32	Eren_t6_1	0.182	eren2021_t6	encoder-decoder	2511570	PANNs	RNN
7	Xinhao_t6_1	0.294	xinhao2021_t6	encoder-decoder	7455570	CNN	Transformer	SpecAugment
9	Xinhao_t6_2	0.287	xinhao2021_t6	encoder-decoder	7455570	CNN	Transformer	SpecAugment
8	Xinhao_t6_3	0.291	xinhao2021_t6	encoder-decoder	8038703	CNN	Transformer	SpecAugment
10	Xinhao_t6_4	0.283	xinhao2021_t6	encoder-decoder	8038703	CNN	Transformer	SpecAugment
26	Narisetty_t6_1	0.235	narisetty2021_t6	Conformer	143676488	Conformer	Transformer	SpecAugment
27	Narisetty_t6_2	0.234	narisetty2021_t6	Conformer	143676488	Conformer	Transformer	SpecAugment
25	Narisetty_t6_3	0.235	narisetty2021_t6	Conformer	167205128	Conformer	Transformer	SpecAugment
23	Narisetty_t6_4	0.236	narisetty2021_t6	Conformer	167205128	Conformer	Transformer	SpecAugment
34	Labbe_t6_1	0.109	labbe2021_t6	encoder-decoder	2887632	pyramidal bidirectional RNN-LSTM	RNN-LSTM, attention
33	Labbe_t6_2	0.128	labbe2021_t6	encoder-decoder	2887632	pyramidal bidirectional RNN-LSTM	RNN-LSTM, attention
30	Labbe_t6_3	0.205	labbe2021_t6	encoder-decoder	84521484	CNN14	RNN-LSTM, attention
29	Labbe_t6_4	0.221	labbe2021_t6	encoder-decoder	84521484	CNN14	RNN-LSTM, attention
21	Won_t6_1	0.243	won2021_t6	encoder-decoder	8445139	ResNet	Transformer	SpecAugment
24	Won_t6_2	0.236	won2021_t6	encoder-decoder	8445139	CNN	Transformer	SpecAugment
22	Won_t6_3	0.242	won2021_t6	encoder-decoder	8445139	CNN	Transformer	SpecAugment
19	Won_t6_4	0.249	won2021_t6	encoder-decoder	8445139	CNN	Transformer	SpecAugment
15	Xu_t6_1	0.265	xu2021_t6	seq2seq	36181131	CNN	RNN	SpecAugment
16	Xu_t6_2	0.265	xu2021_t6	seq2seq	60301885	CNN	RNN	SpecAugment
5	Xu_t6_3	0.296	xu2021_t6	seq2seq	48241508	CNN	RNN	SpecAugment
6	Xu_t6_4	0.295	xu2021_t6	seq2seq	60301885	CNN	RNN	SpecAugment
38	Baseline_t6_1	0.012	Baseline2021_t6	encoder-decoder	5012931	RNN	RNN

Detailed characteristics

Rank	Submission code	SPIDEr	Technical Report	Method scheme/architecture	Amount of parameters	Audio modelling	Acoustic features	Word modelling	Word embeddings	Data augmentation	Sampling rate	Learning set-up	Loss function	Learning set-up	Learning rate	Metric monitored for training	Dataset(s) used for audio modelling	Dataset(s) used for word modelling	Dataset(s) used for audio similarity
4	Yuan_t6_1	0.296	yuan2021_t6	encoder-decoder	986302137	PANNs	log-mel energies	Transfomer	one-hot	noise enhance	36kHz	supervised	crossentropy	adam	1e-4	Validation SPIDEr score	Clotho, AudioCaps, Freesound	Clotho, AudioCaps, Freesound	Clotho
1	Yuan_t6_2	0.310	yuan2021_t6	encoder-decoder	986302137	PANNs	log-mel energies	Transfomer	one-hot	noise enhance	36kHz	supervised	crossentropy	adam	1e-4	Validation SPIDEr score	Clotho, AudioCaps, Freesound	Clotho, AudioCaps, Freesound	Clotho
2	Yuan_t6_3	0.302	yuan2021_t6	encoder-decoder	986302137	PANNs	log-mel energies	Transfomer	one-hot	noise enhance	36kHz	supervised	crossentropy	adam	1e-4	Validation SPIDEr score	Clotho, AudioCaps, Freesound	Clotho, AudioCaps, Freesound	Clotho
3	Yuan_t6_4	0.298	yuan2021_t6	encoder-decoder	2572592190	PANNs	log-mel energies	Transfomer	one-hot	noise enhance	36kHz	supervised	crossentropy	adam	1e-4	Validation SPIDEr score	Clotho, AudioCaps, Freesound	Clotho, AudioCaps, Freesound	Clotho
37	Xiao_t6_1	0.043	xiao2021_t6	encoder-decoder, Transformer, MLP-mixer, Residual, audio embedding	2448349	MLP-mixer encoder	log-mel energies	Transformer decoder	learned embeddings		44.1kHz	supervised	crossentropy	adam	1e-4	Validation loss	Clotho	Clotho
36	Xiao_t6_2	0.046	xiao2021_t6	encoder-decoder, Transformer, MLP-mixer, Residual, audio embedding, pre-train encoder	2448349	MLP-mixer encoder	log-mel energies	Transformer decoder	learned embeddings		44.1kHz	supervised	crossentropy	adam	1e-4	Validation loss	Clotho	Clotho
18	Chen_t6_1	0.253	chen2021_t6	encoder-decoder	93410432	CNN14, MemoryEncoder	log-mel energies	MeshedDecoder	Word2Vec	SpecAugment, Label Smoothing	44.1kHz	supervised	crossentropy	adam	3e-5	Validation SPIDEr score	Clotho	Clotho
28	Chen_t6_2	0.231	chen2021_t6	encoder-decoder	93410432	CNN14, MemoryEncoder	log-mel energies	MeshedDecoder	FastText	SpecAugment, Label Smoothing	44.1kHz	supervised	crossentropy	adam	3e-5	Validation SPIDEr score	Clotho	Clotho
20	Chen_t6_3	0.245	chen2021_t6	encoder-decoder	22775528	CNN14, MemoryEncoder	log-mel energies	MeshedDecoder	learned embeddings	SpecAugment, Label Smoothing	44.1kHz	supervised	crossentropy	adam	3e-5	Validation SPIDEr score	Clotho	Clotho
17	Chen_t6_4	0.262	chen2021_t6	encoder-decoder	86440064	ResNet38, MemoryEncoder	log-mel energies	MeshedDecoder	Word2Vec	SpecAugment, Label Smoothing	44.1kHz	supervised	crossentropy	adam	3e-5	Validation SPIDEr score	Clotho	Clotho
12	Ye_t6_1	0.279	ye2021_t6	encoder-decoder	86643711	ResNet38	log-mel energies	RNN	learned embeddings	Mixup, SpecAugment, SpecAugment++	44.1kHz	supervised	crossentropy	adam	2e-5	Validation loss	Clotho	Clotho
14	Ye_t6_2	0.274	ye2021_t6	encoder-decoder	86643711	ResNet38	log-mel energies	RNN	learned embeddings	Mixup, SpecAugment, SpecAugment++	44.1kHz	supervised	crossentropy	adam	2e-5	Validation loss	Clotho	Clotho
11	Ye_t6_3	0.280	ye2021_t6	encoder-decoder	779793399	ResNet38	log-mel energies	RNN	learned embeddings	Mixup, SpecAugment, SpecAugment++	44.1kHz	supervised	crossentropy	adam	2e-5	Validation loss	Clotho	Clotho
13	Ye_t6_4	0.277	ye2021_t6	encoder-decoder	259931133	ResNet38	log-mel energies	RNN	learned embeddings	Mixup, SpecAugment, SpecAugment++	44.1kHz	supervised	crossentropy	adam	2e-5	Validation loss	Clotho	Clotho
31	Liu_t6_1	0.184	liu2021_t6	encoder-decoder	3045913	CNN	log-mel energies	self-attention, Word2Vec	Word2Vec		44.1kHz	supervised	crossentropy, sentence-loss	adam	5e-4	Training loss	Clotho	Clotho
35	Gebhard_t6_1	0.076	gebhard2021_t6	encoder-decoder	13409747	CNN	log-mel energies	RNN	numeric-representation		44.1kHz	supervised	crossentropy	adam	1e-4	Training loss	Clotho	Clotho
32	Eren_t6_1	0.182	eren2021_t6	encoder-decoder	2511570	PANNs	log-mel energies, PANNs	RNN	Word2Vec		44.1kHz	supervised	crossentropy	adam	1e-3	Validation loss	Clotho	Clotho
7	Xinhao_t6_1	0.294	xinhao2021_t6	encoder-decoder	7455570	CNN	PANNs	Transformer	random	SpecAugment	44.1kHz	supervised	crossentropy	adam	1e-3	Validation SPIDEr score	Clotho	Clotho
9	Xinhao_t6_2	0.287	xinhao2021_t6	encoder-decoder	7455570	CNN	PANNs	Transformer	random	SpecAugment	44.1kHz	supervised	crossentropy	adam	1e-3	Validation SPIDEr score	Clotho	Clotho
8	Xinhao_t6_3	0.291	xinhao2021_t6	encoder-decoder	8038703	CNN	PANNs	Transformer	Word2Vec	SpecAugment	44.1kHz	supervised	crossentropy	adam	1e-3	Validation SPIDEr score	Clotho, AudioCaps	Clotho, AudioCaps
10	Xinhao_t6_4	0.283	xinhao2021_t6	encoder-decoder	8038703	CNN	PANNs	Transformer	Word2Vec	SpecAugment	44.1kHz	supervised	crossentropy	adam	1e-3	Validation SPIDEr score	Clotho, AudioCaps	Clotho, AudioCaps
26	Narisetty_t6_1	0.235	narisetty2021_t6	Conformer	143676488	Conformer	log-mel energies, PANNs, tags, embeddings	Transformer	learned embeddings	SpecAugment	16kHz	supervised	crossentropy	noam	5e-1	Validation loss	Clotho, AudioCaps	Clotho, AudioCaps
27	Narisetty_t6_2	0.234	narisetty2021_t6	Conformer	143676488	Conformer	log-mel energies, PANNs, tags, embeddings	Transformer	learned embeddings	SpecAugment	16kHz	supervised	crossentropy	noam	5e-1	Validation loss	Clotho, AudioCaps	Clotho, AudioCaps
25	Narisetty_t6_3	0.235	narisetty2021_t6	Conformer	167205128	Conformer	log-mel energies, PANNs, tags, embeddings	Transformer	learned embeddings	SpecAugment	16kHz	supervised	crossentropy	noam	5e-1	Validation loss	Clotho, AudioCaps	Clotho, AudioCaps
23	Narisetty_t6_4	0.236	narisetty2021_t6	Conformer	167205128	Conformer	log-mel energies, PANNs, tags, embeddings	Transformer	learned embeddings	SpecAugment	16kHz	supervised	crossentropy	noam	5e-1	Validation loss	Clotho, AudioCaps	Clotho, AudioCaps
34	Labbe_t6_1	0.109	labbe2021_t6	encoder-decoder	2887632	pyramidal bidirectional RNN-LSTM	log-mel energies	RNN-LSTM, attention	learned embeddings		32kHz	supervised	crossentropy	adam	5e-4	Validation loss	Clotho	Clotho
33	Labbe_t6_2	0.128	labbe2021_t6	encoder-decoder	2887632	pyramidal bidirectional RNN-LSTM	log-mel energies	RNN-LSTM, attention	learned embeddings		32kHz	supervised	crossentropy	adam	5e-4	Validation loss	Clotho	Clotho
30	Labbe_t6_3	0.205	labbe2021_t6	encoder-decoder	84521484	CNN14	log-mel energies	RNN-LSTM, attention	learned embeddings		32kHz	supervised	crossentropy	adam	5e-4	Validation loss	Clotho	Clotho
29	Labbe_t6_4	0.221	labbe2021_t6	encoder-decoder	84521484	CNN14	log-mel energies	RNN-LSTM, attention	learned embeddings		32kHz	supervised	crossentropy	adam	5e-4	Validation loss	Clotho	Clotho
21	Won_t6_1	0.243	won2021_t6	encoder-decoder	8445139	ResNet	log-mel energies	Transformer	Word2Vec	SpecAugment	44.1kHz	supervised	crossentropy	adam, SWA	3e-4, 1e-4	Training SPIDEr score	Clotho	Clotho
24	Won_t6_2	0.236	won2021_t6	encoder-decoder	8445139	CNN	log-mel energies	Transformer	Word2Vec	SpecAugment	44.1kHz	supervised	crossentropy	adam, SWA	3e-4, 1e-4	Training SPIDEr score	Clotho	Clotho
22	Won_t6_3	0.242	won2021_t6	encoder-decoder	8445139	CNN	log-mel energies	Transformer	Word2Vec	SpecAugment	44.1kHz	supervised	crossentropy	adam, SWA	3e-4, 1e-4	Training SPIDEr score	Clotho	Clotho
19	Won_t6_4	0.249	won2021_t6	encoder-decoder	8445139	CNN	log-mel energies	Transformer	Word2Vec	SpecAugment	44.1kHz	supervised	crossentropy	adam, SWA	3e-4, 1e-4	Training SPIDEr score	Clotho	Clotho
15	Xu_t6_1	0.265	xu2021_t6	seq2seq	36181131	CNN	log-mel energies	RNN	learned embeddings	SpecAugment	44.1kHz	supervised, reinforcement learning	crossentropy	adam	5e-4	Validation SPIDEr score	Clotho, AudioSet	Clotho
16	Xu_t6_2	0.265	xu2021_t6	seq2seq	60301885	CNN	log-mel energies	RNN	learned embeddings	SpecAugment	44.1kHz	supervised, reinforcement learning	crossentropy	adam	5e-4	Validation SPIDEr score	Clotho, AudioSet	Clotho
5	Xu_t6_3	0.296	xu2021_t6	seq2seq	48241508	CNN	log-mel energies	RNN	learned embeddings	SpecAugment	44.1kHz	supervised, reinforcement learning	crossentropy	adam	5e-4	Validation SPIDEr score	Clotho, AudioSet	Clotho
6	Xu_t6_4	0.295	xu2021_t6	seq2seq	60301885	CNN	log-mel energies	RNN	learned embeddings	SpecAugment	44.1kHz	supervised, reinforcement learning	crossentropy	adam	5e-4	Validation SPIDEr score	Clotho, AudioSet	Clotho
38	Baseline_t6_1	0.012	Baseline2021_t6	encoder-decoder	5012931	RNN	log-mel energies	RNN	learned embeddings		44.1kHz	supervised	crossentropy	adam	1e-3	Validation loss	Clotho	Clotho

Technical reports

Audio Captioning With Meshed-Memory Transformer

Zhiwen Chen, Dawei Zhang, Jun Wang, and Feng Deng

University of Chinese Academy of Sciences, Beijing, China

Chen_t6_1 Chen_t6_2 Chen_t6_3 Chen_t6_4

Content

Task description

Teams ranking

Systems ranking

Systems ranking, all metrics

Systems ranking, machine translation metrics

Systems ranking, captioning metrics

System characteristics

Overview of characteristics

Detailed characteristics

Technical reports

Audio Captioning With Meshed-Memory Transformer

Audio Captioning With Meshed-Memory Transformer

Abstract

System characteristics

Audio Captioning Using Sound Event Detection

Audio Captioning Using Sound Event Detection

Abstract

System characteristics

An Automated Audio Captioning Approach Utilising a ResNet-based Encoder

An Automated Audio Captioning Approach Utilising a ResNet-based Encoder

Abstract

System characteristics

IRIT-UPS DCASE 2021 Audio Captioning System

IRIT-UPS DCASE 2021 Audio Captioning System

Abstract

System characteristics

The DCASE2021 Challenge Task 6 System : Automated Audio Caption

The DCASE2021 Challenge Task 6 System : Automated Audio Caption

Abstract

System characteristics

Leveraging State-of-the-art ASR Techniques to Audio Captioning

Leveraging State-of-the-art ASR Techniques to Audio Captioning

Abstract

System characteristics

CAU Submission to DCASE 2021 Task6: Transformer Followed by Transfer Learning for Audio Captioning

CAU Submission to DCASE 2021 Task6: Transformer Followed by Transfer Learning for Audio Captioning

Abstract

System characteristics

Automated Audio Captioning With MLP-mixer and Pre-trained Encoder

Automated Audio Captioning With MLP-mixer and Pre-trained Encoder

Abstract

System characteristics

An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning for DCASE Challenge 2021 Task 6

An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning for DCASE Challenge 2021 Task 6

Abstract

System characteristics

The SJTU System for DCASE2021 Challenge Task 6: Audio Captioning Based on Encoder Pre-training and Reinforcement Learning

The SJTU System for DCASE2021 Challenge Task 6: Audio Captioning Based on Encoder Pre-training and Reinforcement Learning

Abstract

System characteristics

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Textual Information

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Textual Information

Abstract

System characteristics

The DCASE 2021 Challenge Task 6 System: Automated Audio Captioning With Weakly Supervised Pre-traing and Word Selection Methods

The DCASE 2021 Challenge Task 6 System: Automated Audio Captioning With Weakly Supervised Pre-traing and Word Selection Methods

Abstract

System characteristics