Automatic creation of textual content descriptions for general audio signals.

Description

Automated audio captioning (AAC) is the task of general audio content description using free text. It is an inter-modal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. AAC methods can model concepts (e.g. "muffled sound"), physical properties of objects and environment (e.g. "the sound of a big car", "people talking in a small and empty room"), and high level knowledge ("a clock rings three times"). This modeling can be used in various applications, ranging from automatic content description to intelligent and content-oriented machine-to-machine interaction.

Figure 1: An example of an automated audio captioning system and process.

The task is a continuation of the AAC task from DCASE2023.

This year, the task of AAC still allows the usage of external data and/or pre-trained models under several restrictions (please see section below). Participants are allowed to use other datasets for AAC or even datasets for sound event detection/tagging, acoustic scene classification, or datasets from other task that might be deemed fit. Additionally, participants can use pre-trained models, such as (but not limited to) Word2Vec, BERT, and PANNs, wherever they want in their model. Please see below for some recommendations for datasets and pre-tuned models.

Audio dataset

The AAC task in DCASE2024 will be using Clotho v2 dataset for the evaluation of the submissions. Though, participants can use any other dataset, from any other task, for the development of their methods. In this section, we will describe Clotho v2 dataset.

Clotho dataset

Clotho v2 is an extension of the original Clotho dataset (i.e. v1) and consists of audio samples of 15 to 30 seconds duration, each audio sample having five captions of eight to 20 words length. There is a total of 6972 (4981 from version 1 and 1991 from v2) audio samples in Clotho, with 34 860 captions (i.e. 6972 audio samples * 5 captions per each sample). Clotho v2 is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. The new data in Clotho v2 will not affect the splits used in order to assess the performance of methods using the previous version of Clotho (i.e. the evaluation and testing splits of Clotho V1). All audio samples are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing.

Clotho v2 has a total of around 4500 words and is divided in four splits: development, validation, evaluation, and testing. Audio samples are publicly available for all four splits, but captions are publicly available only for the development, validation, and evaluation splits. There are no overlapping audio samples between the four different splits and there is no word that appears in the evaluation, validation, or testing splits, and not appearing in the development split. Also, there is no word that appears in the development split and not appearing at least in one of the other three splits. All words appear proportionally between splits (the word distribution is kept similar across splits), i.e. 55% in the development, 15% in the and validation, 15% in the evaluation, and 15% in the testing split.

Words that could not be divided using the above scheme of 55-15-15-15 (e.g. words that appear only two times in the all four splits combined), appear at least one time in the development split and at least one time to one of the other three splits. This splitting process is similar to the one used for the previous version of Clotho. More detailed info about the splitting process can be found on the paper presenting Clotho, freely available online here and cited as:

Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.

	Huang Xie Tampere University
	Tuomas Virtanen Tampere University
	Romain Serizel University of Lorraine
	Etienne Labbé Université Toulouse III – Paul Sabatier Institut de Recherche en Informatique de Toulouse
	Thomas Pellegrini Université Toulouse III – Paul Sabatier Institut de Recherche en Informatique de Toulouse
	Xinhao Mei University of Surrey
	Xuenan Xu University of Surrey
	Mark D. Plumbley University of Surrey
	Wenwu Wang University of Surrey
	Mengyue Wu Shanghai Jiao Tong University

Clotho naming of splits	DCASE Challenge naming of splits
development	development
validation
evaluation
testing	evaluation

Metric	Value
METEOR	0.1897
CIDEr	0.4619
SPICE	0.1335
SPIDEr	0.2977
SPIDEr-FL	0.2962
Vocabulary	551
FENSE	0.5040

Coordinators

Content

Description

Audio dataset

Clotho dataset

Clotho: an Audio Captioning Dataset

Abstract

Audio samples in Clotho

Captions in Clotho

Crowdsourcing a Dataset of Audio Captions

Abstract

Development, validation, and evaluation datasets of Clotho

Clotho download

Task rules

Excluded data

Submission

System output file

Metadata file

Metadata

Open and reproducible research

Evaluation

Optional: report Multiply–ACcumulate operations (MACs)

Results

Baseline system: updated for the 2024 edition

Deep neural network (DNN) method

Results for the development dataset

Citations

Automated Audio Captioning with Recurrent Neural Networks

Abstract

Crowdsourcing a Dataset of Audio Captions

Abstract

Clotho: an Audio Captioning Dataset

Abstract

A ConvNet for the 2020s

Adapting a ConvNeXt Model to Audio Classification on AudioSet

CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding