Task description

Automated audio captioning is the task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. Given the novelty of the task of audio captioning, current focus is on exploring and developing different methods that can provide some kind of captions for a general audio recording. To this aim, the Clotho dataset is used, which provides good quality captions, without speech transcription, named entities, and hapax legomena (i.e. words that appear once in a split).

Participants used the freely available splits of Clotho development and evaluation, as well as any external data they deemed fit. The developed systems are evaluated on their generated captions, using the testing split of Clotho, which does not provide the corresponding captions for the audio. More information about Task 6a: Automated Audio Captioning can be found at the task description page.

The ranking of the submitted systems is based on the achieved SPIDEr metric penalized by fluency error detection (SPIDEr-FL). Though, in this page is provided a more thorough presentation, grouping the metrics into those that are originated from machine translation and to those that originated from captioning.

Teams ranking

Here are listed the best systems from all teams. The ranking is based on the SPIDEr-FL. For more elaborated exploration of the performance of the different systems, at the same table are listed the values achieved for all the metrics employed in the task. The values for the metrics are for the Clotho testing split and the Clotho evaluation split. The values for the Clotho evaluation split are provided in order to allow further comparison with systems and methods developed outside of this task, since captions for the Clotho evaluation split are freely available.

Selected metric rank	Submission Information				Clotho testing split					Clotho evaluation split
Selected metric rank	Submission code	Best official system rank	Corresponding author	Technical Report	METEOR	CIDEr	SPICE	SPIDEr	SPIDEr-FL	METEOR	CIDEr	SPICE	SPIDEr	SPIDEr-FL
	Wu_t6a_4	1	Shih-Lun Wu	wu2023_t6a	0.195	0.505	0.149	0.327	0.327	0.197	0.505	0.145	0.325	0.325
	Chang_t6a_4	2	Joon-Hyuk Chang	chang2023_t6a	0.197	0.539	0.149	0.344	0.315	0.197	0.541	0.146	0.343	0.313
	Labbe_t6a_4	3	Etienne Labbe	labbe2023_t6a	0.193	0.486	0.142	0.314	0.314	0.193	0.500	0.140	0.320	0.320
	Yan_t6a_4	4	Zhiyong Yan	yan2023_t6a	0.191	0.461	0.139	0.300	0.289	0.192	0.474	0.136	0.305	0.294
	Schaumloeffel_t6a_1	5	Timothy Schaumloeffel	schaumloeffel2023_t6a	0.181	0.436	0.130	0.283	0.282	0.183	0.454	0.132	0.293	0.292
	Guan_t6a_3	6	Jian Guan	guan2023_t6a	0.180	0.427	0.129	0.278	0.273	0.184	0.450	0.129	0.290	0.283
	Kadlčík_t6a_1	7	Marek Kadlčík	kadlčík2023_t6a	0.172	0.414	0.123	0.269	0.267	0.378	0.433	0.126	0.279
	Lee_t6a_1	8	Kyogu Lee	lee2023_t6a	0.176	0.416	0.123	0.269	0.266	0.177	0.431	0.126	0.279	0.275
	Baseline	9	Felix Gontier	gontier2023_t6a	0.177	0.415	0.126	0.271	0.264	0.177	0.420	0.119	0.270	0.261
	Greeshma_t6a_1	10	Karanth Greeshma	greeshma2023_t6a	0.178	0.406	0.125	0.265	0.261	0.178	0.419	0.121	0.270	0.264
	Lim_t6a_1	11	Changwon Lim	lim2023_t6a	0.089	0.035	0.039	0.037	0.010	0.089	0.034	0.038	0.036	0.011

Systems ranking

Here are listed all submitted systems and their ranking according to the different metrics and grouping of metrics. The first table shows all challenge metrics and all systems, and the second table shows all systems but with contrastive metrics.

Detailed information for each system is provided in the next section.

Systems ranking, challenge metrics

Selected metric rank	Submission Information			Clotho testing split					Clotho evaluation split
Selected metric rank	Submission code	Best official system rank	Technical Report	METEOR	CIDEr	SPICE	SPIDEr	SPIDEr-FL	METEOR	CIDEr	SPICE	SPIDEr	SPIDEr-FL
	Wu_t6a_4	1	wu2023_t6a	0.195	0.505	0.149	0.327	0.327	0.197	0.505	0.145	0.325	0.325
	Wu_t6a_3	2	wu2023_t6a	0.196	0.504	0.149	0.326	0.326	0.197	0.525	0.147	0.336	0.336
	Wu_t6a_2	3	wu2023_t6a	0.196	0.499	0.149	0.324	0.324	0.198	0.510	0.147	0.329	0.329
	Chang_t6a_4	4	chang2023_t6a	0.197	0.539	0.149	0.344	0.315	0.197	0.541	0.146	0.343	0.313
	Labbe_t6a_4	5	labbe2023_t6a	0.193	0.486	0.142	0.314	0.314	0.193	0.500	0.140	0.320	0.320
	Wu_t6a_1	6	wu2023_t6a	0.190	0.477	0.145	0.311	0.311	0.193	0.506	0.146	0.326	0.326
	Labbe_t6a_3	7	labbe2023_t6a	0.192	0.479	0.141	0.310	0.309	0.192	0.485	0.139	0.312	0.310
	Chang_t6a_1	8	chang2023_t6a	0.188	0.486	0.138	0.312	0.308	0.188	0.483	0.137	0.309	0.307
	Labbe_t6a_2	9	labbe2023_t6a	0.189	0.470	0.139	0.304	0.304	0.190	0.474	0.136	0.305	0.303
	Yan_t6a_4	10	yan2023_t6a	0.191	0.461	0.139	0.300	0.289	0.192	0.474	0.136	0.305	0.294
	Yan_t6a_3	11	yan2023_t6a	0.190	0.457	0.139	0.298	0.288	0.190	0.468	0.135	0.302	0.292
	Yan_t6a_1	12	yan2023_t6a	0.187	0.445	0.137	0.291	0.282	0.191	0.471	0.136	0.304	0.295
	Schaumloeffel_t6a_1	13	schaumloeffel2023_t6a	0.181	0.436	0.130	0.283	0.282	0.183	0.454	0.132	0.293	0.292
	Schaumloeffel_t6a_2	14	schaumloeffel2023_t6a	0.178	0.425	0.124	0.274	0.274	0.179	0.443	0.126	0.285	0.284
	Guan_t6a_3	15	guan2023_t6a	0.180	0.427	0.129	0.278	0.273	0.184	0.450	0.129	0.290	0.283
	Guan_t6a_4	16	guan2023_t6a	0.181	0.429	0.130	0.279	0.272	0.184	0.443	0.128	0.285	0.279
	Yan_t6a_2	17	yan2023_t6a	0.185	0.424	0.132	0.278	0.270	0.189	0.460	0.136	0.298	0.286
	Guan_t6a_1	18	guan2023_t6a	0.180	0.421	0.131	0.276	0.270	0.182	0.438	0.126	0.282	0.275
	Kadlčík_t6a_1	19	kadlčík2023_t6a	0.172	0.414	0.123	0.269	0.267	0.378	0.433	0.126	0.279
	Lee_t6a_1	20	lee2023_t6a	0.176	0.416	0.123	0.269	0.266	0.177	0.431	0.126	0.279	0.275
	Baseline	21	gontier2023_t6a	0.177	0.415	0.126	0.271	0.264	0.177	0.420	0.119	0.270	0.261
	Guan_t6a_2	22	guan2023_t6a	0.178	0.415	0.127	0.271	0.263	0.181	0.426	0.124	0.275	0.267
	Kadlčík_t6a_2	23	kadlčík2023_t6a	0.177	0.406	0.129	0.267	0.261	0.378	0.414	0.123	0.269
	Greeshma_t6a_1	24	greeshma2023_t6a	0.178	0.406	0.125	0.265	0.261	0.178	0.419	0.121	0.270	0.264
	Labbe_t6a_1	25	labbe2023_t6a	0.177	0.389	0.125	0.257	0.256	0.179	0.414	0.126	0.270	0.269
	Chang_t6a_3	26	chang2023_t6a	0.194	0.527	0.142	0.335	0.231	0.195	0.539	0.143	0.341	0.233
	Chang_t6a_2	27	chang2023_t6a	0.195	0.520	0.141	0.330	0.229	0.195	0.526	0.143	0.335	0.225
	Kadlčík_t6a_3	28	kadlčík2023_t6a	0.161	0.348	0.116	0.232	0.225	0.345	0.340	0.108	0.224
	Lim_t6a_1	29	lim2023_t6a	0.089	0.035	0.039	0.037	0.010	0.089	0.034	0.038	0.036	0.011

Systems ranking, additional metrics

Selected metric rank	Submission Information			Clotho testing split
Selected metric rank	Submission code	Best official system rank	Technical Report	Sentence-BERT	FENSE
	Wu_t6a_4	1	wu2023_t6a	0.536	0.536
	Wu_t6a_3	2	wu2023_t6a	0.536	0.536
	Wu_t6a_2	3	wu2023_t6a	0.538	0.538
	Chang_t6a_4	4	chang2023_t6a	0.530	0.488
	Labbe_t6a_4	5	labbe2023_t6a	0.523	0.522
	Wu_t6a_1	6	wu2023_t6a	0.526	0.526
	Labbe_t6a_3	7	labbe2023_t6a	0.521	0.519
	Chang_t6a_1	8	chang2023_t6a	0.527	0.521
	Labbe_t6a_2	9	labbe2023_t6a	0.523	0.522
	Yan_t6a_4	10	yan2023_t6a	0.521	0.498
	Yan_t6a_3	11	yan2023_t6a	0.523	0.501
	Yan_t6a_1	12	yan2023_t6a	0.520	0.503
	Schaumloeffel_t6a_1	13	schaumloeffel2023_t6a	0.501	0.498
	Schaumloeffel_t6a_2	14	schaumloeffel2023_t6a	0.496	0.496
	Guan_t6a_3	15	guan2023_t6a	0.496	0.487
	Guan_t6a_4	16	guan2023_t6a	0.496	0.480
	Yan_t6a_2	17	yan2023_t6a	0.509	0.490
	Guan_t6a_1	18	guan2023_t6a	0.495	0.481
	Kadlčík_t6a_1	19	kadlčík2023_t6a	0.495	0.492
	Lee_t6a_1	20	lee2023_t6a	0.500	0.495
	Baseline	21	gontier2023_t6a	0.482	0.472
	Guan_t6a_2	22	guan2023_t6a	0.494	0.475
	Kadlčík_t6a_2	23	kadlčík2023_t6a	0.492	0.481
	Greeshma_t6a_1	24	greeshma2023_t6a	0.486	0.477
	Labbe_t6a_1	25	labbe2023_t6a	0.481	0.480
	Chang_t6a_3	26	chang2023_t6a	0.522	0.363
	Chang_t6a_2	27	chang2023_t6a	0.522	0.362
	Kadlčík_t6a_3	28	kadlčík2023_t6a	0.459	0.445
	Lim_t6a_1	29	lim2023_t6a	0.121	0.033

System characteristics

In this section you can find the characteristics of the submitted systems. There are two tables for easy reference, in the corresponding subsections. The first table has an overview of the systems and the second has a detailed presentation of each system.

Overview of characteristics

Rank	Submission code	SPIDEr-FL	Technical Report	Method scheme/architecture	Amount of parameters	Audio modelling	Word modelling	Data augmentation
1	Wu_t6a_4	0.327	wu2023_t6a	encoder-decoder	2542000000	conformer	transformer
2	Wu_t6a_3	0.326	wu2023_t6a	encoder-decoder	2542000000	conformer	transformer
3	Wu_t6a_2	0.324	wu2023_t6a	encoder-decoder	887000000	conformer	transformer
4	Chang_t6a_4	0.315	chang2023_t6a	encoder-decoder	1313200128	PANNs	BART	spec augmentation, AL-mixgen, synonyms substitution
5	Labbe_t6a_4	0.314	labbe2023_t6a	encoder-decoder	98064347	cnn	transformer	mixup, spec_augment, label_smoothing
6	Wu_t6a_1	0.311	wu2023_t6a	encoder-decoder	127000000	conformer	transformer
7	Labbe_t6a_3	0.309	labbe2023_t6a	encoder-decoder	42191083	cnn	transformer	mixup, spec_augment, label_smoothing
8	Chang_t6a_1	0.308	chang2023_t6a	encoder-decoder	218866688	PANNs	BART	spec augmentation, AL-mixgen, synonyms substitution
9	Labbe_t6a_2	0.304	labbe2023_t6a	encoder-decoder	40133440	cnn	transformer	mixup, spec_augment, label_smoothing
10	Yan_t6a_4	0.289	yan2023_t6a	encoder-decoder	90086352	transformer	transformer
11	Yan_t6a_3	0.288	yan2023_t6a	encoder-decoder	90086352	transformer	transformer
12	Yan_t6a_1	0.282	yan2023_t6a	encoder-decoder	90086352	transformer	transformer
13	Schaumloeffel_t6a_1	0.282	schaumloeffel2023_t6a	encoder-decoder	248325888	transformer	GPT2	SpecAugment
14	Schaumloeffel_t6a_2	0.274	schaumloeffel2023_t6a	encoder-decoder	248325888	transformer	GPT2	SpecAugment
15	Guan_t6a_3	0.273	guan2023_t6a	encoder-decoder	35502652	PANNs (CNN10) + GAT, PANNs (CNN10)	transformer	SpecAugmentation
16	Guan_t6a_4	0.272	guan2023_t6a	encoder-decoder	17768222	PANNs (CNN10) + GAT	transformer	SpecAugmentation
17	Yan_t6a_2	0.270	yan2023_t6a	encoder-decoder	90086352	transformer	transformer
18	Guan_t6a_1	0.270	guan2023_t6a	encoder-decoder	8884111	PANNs (CNN10) + GAT	transformer	SpecAugmentation
19	Kadlčík_t6a_1	0.267	kadlčík2023_t6a	encoder-decoder	1550000000	transformer	transformer
20	Lee_t6a_1	0.266	lee2023_t6a	encoder-decoder	178755308	cnn	transformer
21	Baseline	0.264	gontier2023_t6a	encoder-decoder	98500000	PANNs	transformer
22	Guan_t6a_2	0.263	guan2023_t6a	encoder-decoder	8884111	PANNs (CNN10) + GAT	transformer	SpecAugmentation
23	Kadlčík_t6a_2	0.261	kadlčík2023_t6a	encoder-decoder	244000000	transformer	transformer
24	Greeshma_t6a_1	0.261	greeshma2023_t6a	encoder-decoder	178755308	cnn	BART
25	Labbe_t6a_1	0.256	labbe2023_t6a	encoder-decoder	87715793	cnn	transformer	mixup, spec_augment, label_smoothing
26	Chang_t6a_3	0.231	chang2023_t6a	encoder-decoder	656600064	PANNs	BART	spec augmentation, AL-mixgen, synonyms substitution
27	Chang_t6a_2	0.229	chang2023_t6a	encoder-decoder	218866688	PANNs	BART	spec augmentation, AL-mixgen, synonyms substitution
28	Kadlčík_t6a_3	0.225	kadlčík2023_t6a	encoder-decoder	39000000	transformer	transformer
29	Lim_t6a_1	0.010	lim2023_t6a	encoder-decoder	178755308	CNN14	transformer

Detailed characteristics

Rank	Submission code	SPIDEr-FL	Technical Report	Method scheme/architecture	Amount of parameters	Audio modelling	Acoustic features	Word modelling	Word embeddings	Data augmentation	Sampling rate	Learning set-up	Loss function	Learning set-up	Learning rate	Gradient norm for clipping	Metric monitored for training	Dataset(s) used for audio modelling	Dataset(s) used for word modelling
1	Wu_t6a_4	0.327	wu2023_t6a	encoder-decoder	2542000000	conformer	BEATs	transformer	BART		16kHz	supervised		adamw	2e-5		validation_acc	Clotho, AudioCaps	Clotho, AudioCaps
2	Wu_t6a_3	0.326	wu2023_t6a	encoder-decoder	2542000000	conformer	BEATs	transformer	BART		16kHz	supervised		adamw	2e-5		validation_acc	Clotho, AudioCaps	Clotho, AudioCaps
3	Wu_t6a_2	0.324	wu2023_t6a	encoder-decoder	887000000	conformer	BEATs	transformer	BART		16kHz	supervised		adamw	2e-5		validation_acc	Clotho, AudioCaps	Clotho, AudioCaps
4	Chang_t6a_4	0.315	chang2023_t6a	encoder-decoder	1313200128	PANNs	PANNs	BART	BART	spec augmentation, AL-mixgen, synonyms substitution	44.1kHz	supervised, reinforcement learninig	crossentropy	adamw	1e-6		CIDEr	Clotho, AudioCaps, WavCaps	Clotho, AudioCaps, WavCaps
5	Labbe_t6a_4	0.314	labbe2023_t6a	encoder-decoder	98064347	cnn	ConvNeXt-tiny	transformer	learned	mixup, spec_augment, label_smoothing	32kHz	supervised	crossentropy	adamw	5e-4	l2	validation_fense	Clotho, AudioCaps, MACS, WavCaps (without FreeSound)	Clotho, AudioCaps, MACS, WavCaps (without FreeSound)
6	Wu_t6a_1	0.311	wu2023_t6a	encoder-decoder	127000000	conformer	BEATs	transformer	BART		16kHz	supervised		adamw	2e-5		validation_acc	Clotho, AudioCaps	Clotho, AudioCaps
7	Labbe_t6a_3	0.309	labbe2023_t6a	encoder-decoder	42191083	cnn	ConvNeXt-tiny	transformer	learned	mixup, spec_augment, label_smoothing	32kHz	supervised	crossentropy	adamw	5e-4	l2	validation_fense	Clotho, AudioCaps, MACS, WavCaps (without FreeSound)	Clotho, AudioCaps, MACS, WavCaps (without FreeSound)
8	Chang_t6a_1	0.308	chang2023_t6a	encoder-decoder	218866688	PANNs	PANNs	BART	BART	spec augmentation, AL-mixgen, synonyms substitution	44.1kHz	supervised	crossentropy	adamw	1e-6		validation loss	Clotho, AudioCaps, WavCaps	Clotho, AudioCaps, WavCaps
9	Labbe_t6a_2	0.304	labbe2023_t6a	encoder-decoder	40133440	cnn	ConvNeXt-tiny	transformer	learned	mixup, spec_augment, label_smoothing	32kHz	supervised	crossentropy	adamw	5e-4	l2	validation_fense	Clotho	Clotho
10	Yan_t6a_4	0.289	yan2023_t6a	encoder-decoder	90086352	transformer	audioset	transformer	BERT		16kHz	supervised	crossentropy	adamw	1e-4		validation_loss	Clotho, FreeSound	Clotho, FreeSound
11	Yan_t6a_3	0.288	yan2023_t6a	encoder-decoder	90086352	transformer	audioset	transformer	BERT		16kHz	supervised	crossentropy	adamw	1e-4		validation_loss	Clotho, FreeSound	Clotho, FreeSound
12	Yan_t6a_1	0.282	yan2023_t6a	encoder-decoder	90086352	transformer	audioset	transformer	BERT		16kHz	supervised	crossentropy	adamw	1e-4		validation_loss	Clotho, FreeSound	Clotho, FreeSound
13	Schaumloeffel_t6a_1	0.282	schaumloeffel2023_t6a	encoder-decoder	248325888	transformer	CLAP	GPT2		SpecAugment	48kHz	supervised	crossentropy	adamw	1e-5		validation_loss	Clotho, AudioCaps, MACS, WavText5k, SoundDescs	Clotho, AudioCaps, MACS, WavText5k, SoundDescs
14	Schaumloeffel_t6a_2	0.274	schaumloeffel2023_t6a	encoder-decoder	248325888	transformer	CLAP	GPT2		SpecAugment	48kHz	supervised	crossentropy	adamw	1e-5		validation_loss	Clotho, AudioCaps, MACS	Clotho, AudioCaps, MACS
15	Guan_t6a_3	0.273	guan2023_t6a	encoder-decoder	35502652	PANNs (CNN10) + GAT, PANNs (CNN10)	log-mel energies	transformer	Word2Vec	SpecAugmentation	32.0kHz	supervised	crossentropy with label smoothing	adamw	1e-3		SPIDEr metric	Clotho, AudioCaps	Clotho, AudioCaps
16	Guan_t6a_4	0.272	guan2023_t6a	encoder-decoder	17768222	PANNs (CNN10) + GAT	log-mel energies	transformer	Word2Vec	SpecAugmentation	32.0kHz	supervised	crossentropy with label smoothing	adamw	1e-3		SPIDEr metric	Clotho, AudioCaps	Clotho, AudioCaps
17	Yan_t6a_2	0.270	yan2023_t6a	encoder-decoder	90086352	transformer	audioset	transformer	BERT		16kHz	supervised	crossentropy	adamw	1e-4		validation_loss	Clotho, FreeSound	Clotho, FreeSound
18	Guan_t6a_1	0.270	guan2023_t6a	encoder-decoder	8884111	PANNs (CNN10) + GAT	log-mel energies	transformer	Word2Vec	SpecAugmentation	32.0kHz	supervised	crossentropy with label smoothing	adamw	1e-3		SPIDEr metric	Clotho, AudioCaps	Clotho, AudioCaps
19	Kadlčík_t6a_1	0.267	kadlčík2023_t6a	encoder-decoder	1550000000	transformer	WhisperFeatureExtractor	transformer	Whisper		16kHz	supervised	crossentropy	adamw	4e-6		SPIDEr	Clotho, AudioCaps, AudioSet	Clotho, AudioCaps, AudioSet
20	Lee_t6a_1	0.266	lee2023_t6a	encoder-decoder	178755308	cnn	PANNs	transformer	BART		44.1kHz	supervised	crossentropy	adamw	1e-5		validation_loss	Clotho, AudioCaps, WavText5K, SoundDescs	Clotho, AudioCaps, WavText5K, SoundDescs
21	Baseline	0.264	gontier2023_t6a	encoder-decoder	98500000	PANNs	log-mel energies	transformer	BART		16kHz	supervised	crossentropy	adamw	1e-5		validation_loss	Clotho	Clotho
22	Guan_t6a_2	0.263	guan2023_t6a	encoder-decoder	8884111	PANNs (CNN10) + GAT	log-mel energies	transformer	Word2Vec	SpecAugmentation	32.0kHz	supervised	crossentropy with label smoothing	adamw	1e-3		SPIDEr metric	Clotho, AudioCaps	Clotho, AudioCaps
23	Kadlčík_t6a_2	0.261	kadlčík2023_t6a	encoder-decoder	244000000	transformer	WhisperFeatureExtractor	transformer	Whisper		16kHz	supervised	crossentropy	adamw	4e-6		SPIDEr	Clotho, AudioCaps, AudioSet	Clotho, AudioCaps, AudioSet
24	Greeshma_t6a_1	0.261	greeshma2023_t6a	encoder-decoder	178755308	cnn	log-mel energies	BART	BART		44.1kHz	supervised	crossentropy	adamw	1e-5		validation_loss	Clotho	Clotho
25	Labbe_t6a_1	0.256	labbe2023_t6a	encoder-decoder	87715793	cnn	PANNs-CNN14	transformer	learned	mixup, spec_augment, label_smoothing	32kHz	supervised	crossentropy	adamw	5e-4	l2	validation_fense	Clotho	Clotho
26	Chang_t6a_3	0.231	chang2023_t6a	encoder-decoder	656600064	PANNs	PANNs	BART	BART	spec augmentation, AL-mixgen, synonyms substitution	44.1kHz	supervised, reinforcement learninig	crossentropy	adamw	1e-6		CIDEr	Clotho, AudioCaps, WavCaps	Clotho, AudioCaps, WavCaps
27	Chang_t6a_2	0.229	chang2023_t6a	encoder-decoder	218866688	PANNs	PANNs	BART	BART	spec augmentation, AL-mixgen, synonyms substitution	44.1kHz	supervised, reinforcement learninig	crossentropy	adamw	1e-6		CIDEr	Clotho, AudioCaps, WavCaps	Clotho, AudioCaps, WavCaps
28	Kadlčík_t6a_3	0.225	kadlčík2023_t6a	encoder-decoder	39000000	transformer	WhisperFeatureExtractor	transformer	Whisper		16kHz	supervised	crossentropy	adamw	4e-6		SPIDEr	Clotho, AudioCaps, AudioSet	Clotho, AudioCaps, AudioSet
29	Lim_t6a_1	0.010	lim2023_t6a	encoder-decoder	178755308	CNN14	mel energies	transformer	PASST		44.1 kHz	supervised	crossentropy	adamw	1e-5		validation_loss	Clotho	Clotho

Technical reports

HYU submission for the DCASE 2023 task 6a: automated audio captioning model using AL-MixGen and synonyms substitution

Jae-Heung Cho¹, Yoon-Ah Park¹, Jaewon Kim¹, Joon-Hyuk Chang¹

¹Department of Electronic Engineering, Hanyang University, Seoul, Republic of Korea

chang_t6a_1 chang_t6a_2 chang_t6a_3 chang_t6a_4

Content

Task description

Teams ranking

Systems ranking

Systems ranking, challenge metrics

Systems ranking, additional metrics

System characteristics

Overview of characteristics

Detailed characteristics

Technical reports

HYU submission for the DCASE 2023 task 6a: automated audio captioning model using AL-MixGen and synonyms substitution

HYU submission for the DCASE 2023 task 6a: automated audio captioning model using AL-MixGen and synonyms substitution

Abstract

System characteristics

DCASE 2023 task 6 automated audio captioning and language-based retrieval

DCASE 2023 task 6 automated audio captioning and language-based retrieval

Abstract

System characteristics

Ensemble systems with contrastive language-audio pretraining and attention-based audio features for audio captioning and retrieval

Ensemble systems with contrastive language-audio pretraining and attention-based audio features for audio captioning and retrieval

Abstract

System characteristics

A whisper transformer for audio captioning trained with synthetic captions and transfer learning

A whisper transformer for audio captioning trained with synthetic captions and transfer learning

Abstract

System characteristics

IRIT-UPS DCASE 2023 audio captioning and retrieval system

IRIT-UPS DCASE 2023 audio captioning and retrieval system

Abstract

System characteristics

Label-refined sequential training with noisy data for automated audio captioning

Label-refined sequential training with noisy data for automated audio captioning

Abstract

System characteristics

CAU submission to DCASE 2023 task 6a: Audio captioning using wavegrams that contain frequency information

CAU submission to DCASE 2023 task 6a: Audio captioning using wavegrams that contain frequency information

Abstract

System characteristics

PEACS: Prefix encoding for auditory caption synthesis

PEACS: Prefix encoding for auditory caption synthesis

Abstract

System characteristics

BEATs-based audio captioning model with INSTRUCTOR embedding supervision and ChatGPT mix-up

BEATs-based audio captioning model with INSTRUCTOR embedding supervision and ChatGPT mix-up

Abstract

System characteristics

Leveraging multi-task training and image retrieval with CLAP for audio captioning

Leveraging multi-task training and image retrieval with CLAP for audio captioning

Abstract

System characteristics