Automated Audio Captioning


Task description

Automatic creation of textual content descriptions for general audio signals.

Challenge has ended. Full results for this task can be found in the Results page.

Description

Automated audio captioning is the task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. Audio captioning methods can model concepts (e.g. "muffled sound"), physical properties of objects and environment (e.g. "the sound of a big car", "people talking in a small and empty room"), and high level knowledge ("a clock rings three times"). This modeling can be used in various applications, ranging from automatic content description to intelligent and content oriented machine-to-machine interaction.

Figure 1: Illustration of automated audio captioning system and process.


Audio dataset

This task will be using the freely available Clotho dataset, presented in IEEE ICASSP 2020. Clotho consists of audio samples of 15 to 30 seconds duration, each audio sample having five captions of eight to 20 words length. There is a total of 4981 audio samples in Clotho, with 24 905 captions (i.e. 4981 audio samples * 5 captions per each sample). Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All audio samples are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing.

Clotho has a total of 4365 unique words and is divided in three splits: development, evaluation, and testing. Development and evaluation splits are publicly available. Testing split will be released according to the schedule of DCASE challenge. One audio sample in Clotho appears only in one split. Also, in Clotho there is not a word that appears only in one split. Additionally, all words appear proportionally between splits (the word distribution is kept similar across splits), i.e. 60% in the development, 20% in the evaluation, and 20% in the testing split.

Words that could not be divided using the above scheme of 60-20-20 (e.g. words that appear only two times in the all three splits combined), appear at least one time in the development split and at least one time to one of the other two splits. More detailed info can be found on the paper presenting Clotho, freely available online here and cited as:

Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. In 45th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Barcelona, Spain, May 2020. URL: https://arxiv.org/abs/1910.09387.

PDF

Clotho: An Audio Captioning Dataset

Abstract

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

PDF

The data collection of Clotho received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

ERC

Audio samples

Clotho audio data are extracted from an initial set of 12 000 audio files collected from Freesound. The 12k audio files have durations ranging from 10s to 300s, no spelling errors in the first sentence of the description on Freesound, good quality (44.1kHz and 16-bit), and no tags on Freesound indicating sound effects, music or speech. Before extraction, all 12k files were normalized and the preceding and trailing silences were trimmed.

The content of audio samples in Clotho greatly varies, ranging from ambiance in a forest (e.g. water flowing over some rocks), animal sounds (e.g. goats bleating), and crowd yelling or murmuring, to machines and engines operating (e.g. inside a factory) or revving (e.g. cars, motorbikes), and devices functioning (e.g. container with contents moving, doors opening/closing). For a thorough description how the audio samples are selected and filtered, you can check the paper that presents Clotho dataset.

In the following figure is the distribution of the duration of audio files in Clotho.

Figure 2: Audio duration distribution for Clotho dataset.


Captions

The captions in the Clotho dataset range from 8 to 20 words in length, and were gathered by employing the crowdsourcing platform Amazon Mechanical Turk and a three-step framework. The three steps are:

  1. audio description,
  2. description editing, and
  3. description scoring.

In step 1, five initial captions were gathered for each audio clip from distinct annotators. In step 2, these initial captions were edited to fix grammatical errors. Grammatically correct captions were instead rephrased, in order to acquire diverse captions for the same audio clip. In step 3, the initial and edited captions were scored based on accuracy, i.e. how well the caption describes the audio clip, and fluency, i.e. the English fluency in the caption itself. The initial and edited captions were scored by three distinct annotators. The scores were then summed together and the captions were sorted by the total accuracy score first, total fluency score second. The top five captions, after sorting, were selected as the final captions of the audio clip. More information about the caption scoring (e.g. scoring values, scoring threshold, etc.) is at the corresponding paper of the three-step framework.

We then manually sanitized the final captions of the dataset by removing apostrophes, making compound words consistent, removing phrases describing the content of speech, and replacing named entities. We used in-house annotators to replace transcribed speech in the captions. If the resulting caption were under 8 words, we attempt to find captions in the lower-scored captions (i.e. that were not selected in step 3) that still have decent scores for accuracy and fluency. If there were no such captions or these captions could not be rephrased to 8 words or less, the audio file was removed from the dataset entirely. The same in-house annotators were used to also replace unique words that only appeared in the captions of one audio clip. Since audio clips are not shared between splits, if there are words that appear only in the captions of one audio clip, then these words will appear only in one split. This process yields a total of 4981 audio samples, each having five captions and amounting to a total of 24 905 captions.

A thorough description of the three-step framework can be found at the corresponding paper, freely available online here and cited as:

Publication

Samuel Lipping, Konstantinos Drossos, and Tuoams Virtanen. Crowdsourcing a dataset of audio captions. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE). Nov. 2019. URL: https://arxiv.org/abs/1907.09238.

PDF

Crowdsourcing a Dataset of Audio Captions

Abstract

Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. “people talking in a big room”). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.

PDF

Development and evaluation datasets

Clotho is divided into a development split of 2893 audio clips with 14465 captions, an evaluation split of 1045 audio clips with 5225 captions, and a testing split of 1043 audio clips with 5215 captions. These splits are created by first constructing the sets of unique words of the captions of each audio clip. These sets of words are combined to form the bag of words of the whole dataset, from which we can derive the frequency of a given word. With the unique words of audio files as classes, we use multi-label stratification. More information on the splits of Clotho can be found at the corresponding paper.

The name of the splits for Clotho differ from the DCASE terminology. To avoid confusion for participants, the correspondence of splits between Clotho and DCASE challenge is:

Clotho naming of splits DCASE Challenge naming of splits
development development
evaluation
testing evaluation

Clotho development split is meant for optimizing audio captioning methods. The performance of the audio captioning methods can then be assessed (e.g. for reporting results in a conference or journal paper) using Clotho evaluation split. Clotho testing split is meant only for usage in scientific challenges, e.g. DCASE challenge. For the rest of this text, the DCASE challenge terminology will be used. For differentiating between Clotho development and evaluation, the terms development-training and development-testing will be used, wherever necessary. Development-training refers to Clotho development split and development-testing refers to Clotho evaluation split.

Download

DCASE Development split of Clotho can be found at the online Zenodo repository.


Development-training data are:

  • clotho_audio_development.7z: The development-training audio clips.
  • clotho_captions_development.csv: The captions of the development-training audio clips.
  • clotho_metadata_development.csv: The meta-data of the development-training audio clips.

Development-testing data are:

  • clotho_audio_evaluation.7z: The development-testing audio clips.
  • clotho_captions_evaluation.csv: The captions of the development-testing audio clips.
  • clotho_metadata_evaluation.csv: The meta-data of the development-testing audio clips.



DCASE Evaluation split of Clotho can be found at the online Zenodo repository.


Development-testing data are:

  • clotho_audio_test.7z: The development-testing audio clips.
  • clotho_metadata_test.csv: The meta-data of the development-testing audio clips, containing only authors and licence.

NOTE BOLD: Participants are strongly prohibited to use any additional information for the DCASE evaluation (testing) of their method, apart the provided audio files from the DCASE Evaluation split.

Task setup

The development-training split is meant for optimizing the parameters of an audio captioning method, and the development-testing for assessing the performance of the method. Currently there are no folds or other splits publicly available.

The assessment of the submitted methods will be performed using a withheld split, the evaluation split. The evaluation split will be released according to the schedule of DCASE Challenge.

Participants are free to use the two provided splits of Clotho in whatever way the deem proper, in order to optimize and assess the generalization of their audio captioning methods. For example, the parameters in the provided baseline system are optimized until the loss on the development-training split saturates.

Development dataset

The development dataset of the challenge consists of the development-training and development-testing splits. The development-training split of Clotho consists of 2893 audio samples and two CSV files:

  1. clotho_captions_development.csv
  2. clotho_metadata_development.csv

In the clotho_captions_development.csv file there are six columns: file_name, caption_1, caption_2, caption_3, caption_4, and caption_5. The file_name contains the file name of the corresponding audio file in the development-training split, and other columns the corresponding captions for that particular file.

The clotho_metadata_development.csv file has seven columns: file_name, keywords, sound_id, sound_link, start_end_samples, manufacturer, and license. Again, the file_name column has the file name of the audio sample in the development-training split. The other columns, contain metadata from the Freesound platform. keywords column contain the tags that the sound has in Freesound. sound_id and sound_link are the sound ID and URL in Freesound. The values in start_end_samples show the exact samples that are used in the Clotho data, from the original sound in Freesound. Finally, manufacturer and license have the user name of the uploader in Freesound and the associated license of the sound, respectively.

The development-testing split of Clotho consists of 1045 audio samples and similar files and structure of these files as in the development-training dataset:

  1. clotho_captions_evaluation.csv
  2. clotho_metadata_evaluation.csv

The above CSV files of the development-testing dataset have the same columns at the corresponding CSV files in the development-training dataset. That is, the clotho_captions_development.csv is similar to clotho_captions_evaluation.csv and the clotho_metadata_development.csv is similar to clotho_metadata_evaluation.csv.

Please note bold: Development-testing set of Clotho is offered for evaluating the methods of audio captioning. For testing purposes, there is another set, the evaluation split (Clotho testing split), which will be made available according to the schedule of the DCASE challenge.

Evaluation dataset

The evaluation dataset of the challenge is separate than the development dataset and consists of 1043 audio samples and one CSV file:

  1. clotho_metadata_test.csv

In the clotho_metadata_test.csv there are four columns: file_name, start_end_samples, manufacturer, and license. The file_name column has the file name of the audio sample in the split. The other columns, contain metadata from the Freesound platform. The values in start_end_samples show the exact samples that are used in the Clotho data, from the original sound in Freesound. Finally, manufacturer and license have the user name of the uploader in Freesound and the associated license of the sound, respectively.

Submission

All participants should submit:

  • the output of their audio captioning with the evaluation split (*.csv file)
  • metadata for their submission (*.yaml file), and
  • a technical report for their submission (*.pdf file).

The *.csv file should have the following two columns:

  1. file_name, which will contain the file name of the audio file.
  2. caption_predicted, which will contain the output of the audio captioning method for the file with file name as specified in the file_name column.

For example, if a file has a file name 0001.wav and the predicted caption of the audio captioning method for the file 0001.wav is hello world, then the CSV file should have the entry:

    file_name      caption_predicted
        .               .
        .               .
        .               .
   test_0001.wav    hello world

Please note bold: automated procedures will be used for the evaluation of the submitted results. Therefore, the column names should be exactly as indicated above.

In addition to the CSV files, the participants are asked to update the information of their method in the provided file metadata file and submit a technical report describing the method. We allow up to 4 system output submissions per participant/team. For each system, metadata should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted metadata (the *.yaml file), submitted system output (the *.csv file), and the technical report! Instead of system name you can use submission label too. The detailed information regarding the challenge information can be found in the Submission page.

Finally, for supporting reproducible research, we kindly ask from each participant/team to consider making available the code of their method (e.g. in GitHub) and pre-trained models, after the challenge is over.

Task rules

  • Use of external data (e.g. audio files, text, annotations) is not allowed.
  • Use of pre-trained models (e.g. text models like Word2Vec, audio tagging models, sound event detection models) is not allowed.
  • The development dataset (i.e. development-training and development-testing) can be augmented without the use of external data.
  • Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it.
  • Participants are not allowed to use extra annotations for the provided data.
  • Participants are allowed to use all the available metadata provided, but they must explicitly state it and indicate if they use the available metadata. This will not affect the rating of their method.
  • Participants are not allowed to use any external description/caption of the audio files/samples.
  • Participants are strongly prohibited of using additional information of the DCASE evaluation (testing) for their method, apart the provided audio files from the DCASE Evaluation split.

Evaluation

The submitted systems will be evaluated according to their performance on the withheld evaluation split. For the evaluation, the captions will not have any punctuation and all letters will be small case. Therefore, the participants are advised to optimized their methods using captions which do not have punctuation and all letters are small case. The freely and online available tools of Clotho dataset, provide already such functionality.


All of the following metrics will be reported for every submitted method:

  1. BLEU1
  2. BLEU2
  3. BLEU3
  4. BLEU4
  5. ROUGEL
  6. METEOR
  7. CIDEr
  8. SPICE
  9. SPIDEr

Ranking of the methods will be performed according to the SPIDEr metric, which is a combination of CIDEr and SPICE. Specifically, the evaluation will be performed based on the average of CIDEr and SPICE, referred to as SPIDEr and shown to have the combined benefits of CIDEr and SPICE. More information is available on the corresponding paper, available online here.

For a brief introduction and more pointers on the above mentioned metrics, you can referer to the original paper of audio captioning:

Publication

Konstantinos Drossos, Sharath Adavanne, and Tuomas Virtanen. Automated audio captioning with recurrent neural networks. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz, New York, U.S.A., Oct. 2017. URL: https://arxiv.org/abs/1706.10006.

PDF

Automated Audio Captioning with Recurrent Neural Networks

Abstract

We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.

PDF

Results

Here you can see the ranking of the systems and information about the submissions to the task of automated audio captioning.

Submissions

The task of automated audio captioning received multiple submissions, from different institutions all over the world. Below you can the amount of total submissions and some statistics regarding the submissions at the task of automated audio captioning.

Total submissions Teams Authors Average authors per team Institutions Average institutions per team
34 10 33 3.3 13 1.3

Ranking of submissions

The ranking of the submissions is based on the SPIDEr metric, measured on the generated captions of the systems when using the testing split of the Clotho dataset (i.e. the DCASE Challenge evaluation dataset). Below you can see a table with the best submission of each team, and all the submissions ranked according to their SPIDEr metric.

For faster information flow, here is listed only the code of the submission, the name of the corresponding author, the affiliation of the corresponding author, and the achieved SPIDEr value. Complete results and technical reports can be found in the results page

Rank Submission Information
Submission code Author Affiliation Technical
Report
SPIDEr
Wang_PKU_task6_1 Yuexian Zou School of ECE, Peking University, Shenzhen, China task-automatic-audio-captioning-results#wang2020_t6 0.196
Shi_SFF_task6_3 Anna Shi ShuangFeng First, Beijing, China task-automatic-audio-captioning-results#shi2020_t6 0.121
Wu_UESTC_task6_1 Qianyang Wu Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan task-automatic-audio-captioning-results#wu2020_t6 0.012
Naranjo-Alcazar_UV_task6_2 Javier Naranjo-Alcazar Computer Science Department, Universitat de Valencia, Burjassot, Spain task-automatic-audio-captioning-results#naranjoalcazar2020_t6 0.150
Xu_SJTU_task6_4 Xuenan Xu Computing Sciences and Technology, Shanghai Jiao Tong University, Shanghai, China task-automatic-audio-captioning-results#xu2020_t6 0.194
Sampathkumar_TUC_task6_1 Arunodhayan Sampathkumar Juniorprofessur MEDIA COMPUTING, Techniche universität Chemnitz, Chemnitz, Germany task-automatic-audio-captioning-results#sampathkumar2020_t6 0.017
Yuma_NTT_task6_1 Koizumi Yuma NTT Corporation, Tokyo, Japan task-automatic-audio-captioning-results#koizumi2020_t1 0.222
Pellegrini_IRIT_task6_2 Thomas Pellegrini Computing Sciences, Université Paul Sabatier, Toulouse, France task-automatic-audio-captioning-results#pellegrini2020_t6 0.130
Wu_BUPT_task6_4 Yusong Wu Beijing University of Posts and Telecommunications, Beijing, China task-automatic-audio-captioning-results#wuyusong2020_t6 0.214
Kuzmin_MSU_task6_1 Nikita Kuzmin Mathematical Methods of Forecasting, Lomonosov Moscow State University, Moscow, Russia task-automatic-audio-captioning-results#kuzmin2020_t6 0.021
Task6_baseline Konstantinos Drossos Audio Research Group, Tampere Univeristy, Tampere, Finland 0.018

Baseline system

To provide a starting point and some initial results for the challenge, there is a baseline system for the task of automated audio captioning. The baseline system is freely available online, is a sequence-to-sequence model, and is implemented using PyTorch.

The baseline system consists of four parts:

  1. the caption evaluation part,
  2. the dataset pre-processing/feature extraction part,
  3. the data handling part for Pytorch library, and
  4. the deep neural network (DNN) method part

You can find the baseline system of automated audio captioning task at GitHub.


Caption evaluation

Caption evaluation is performed using a version of the caption evaluation tools used for the MS COCO challenge. This version of the code has been updated in order to be compliant with Python 3.6 and above and with the needs of the automated audio captioning task. The code can for the evaluation is included in the baseline system, but also can be found online.


Dataset pre-processing/feature extraction

Clotho data are WAV and CSV files. In order to be used for an audio captioning method, features have to be extracted from the audio clips (i.e. WAV files) and the captions in the CSV files have to be turned to a more computational friendly form (e.g. one-hot encoding). Finally, the extracted features and processed words, have to be matched and used as input-output pairs for optimizing the parameters of an audio captioning method.

In the baseline system there is code that implements the above. This code is also available as stand-alone, in the following repository:


Data handling for Pytorch library

In PyTorch there is the DataLoader class, which offers a convenient way of handling the data iteration. Clotho dataset has associated data loader for PyTorch, available in the baseline system and online.


Deep neural network (DNN) method

Finally, the DNN of the baseline is a sequence-to-sequence system, consisting of an encoder and a decoder. The encoder takes as an input 64 log mel-band energies, consists of three bi-directional GRU, and outputs the summary of the input sequence of features. Each GRU of the encoder has 256 output features.

The input sequence to the encoder (i.e. the extracted audio features has different length from the targeted output sequence (i.e. the words). For that reason, there has to be some kind of alignment between these two sequences. Our baseline system does not employ any alignment mechanism. Instead, the encoder outputs the summary vector of the input sequence, and this summary vector is then repeated as an input to the decoder.

The decoder consists of one GRU and one classifier (a trainable affine transform with bias and a non-linearity at the end), accepts the output of the encoder, and outputs a probability for each of the unique words. The decoder iterates for 22 time steps, which is the length of the longer caption.

Please not bold: The DNN method serves as an example. It is not meant to be used as a solid method for audio captioning, used in further hyper-parameter optimization.

Hyper-parameters

Feature extraction and caption processing: The baseline system uses the following hyper-parameters for the feature extraction:

  • 64 log mel-band energies
  • Hamming window of 46 ms length with 50% overlap

Captions are pre-processed according to the following:

  • Removal of all punctuation
  • All letters to small case
  • Tokenization
  • Pre-pending of start-of-sentence token (i.e. <sos>)
  • Appending of end-of-sentence token (i.e. <eos>)

Neural network and optimization hyper-parameters: The deep neural network used in the baseline system has the following hyper-parameters:

  • Three layers of bi-directional gated recurrent units (GRUs).
    • First GRU has 64 input dimensionality and outputs 256 features (i.e. 256 * 2 = 512 for the two directions)
    • Second and third GRUs have 512 input dimensionality and output 256 features (i.e. 256 * 2 = 512 for the two directions)
    • The outputs from the second GRU are added with the inputs to the second GRU, before used as an input to the third GRU (i.e. there is a residual connection between the second and third GRU).
    • The outputs from the third are also added to the input of the third GRU, before used by the attention mechanism.
  • One GRU layer, with input dimensionality 512 and output 256
  • One trainable affine transform with bias, acting as a classifier, with input dimensionality of 256 and output of 4637 (i.e. the length of the one-hot encoding of the words).

The optimization of the parameters of the DNN performed using Adam optimizer for 300 epochs, a batch size of 16 and the cross-entropy loss. The learning rate of Adam is 10-4 and before every weight update, the 2-norm of the gradients clipped using as a threshold the value of 2.

All input audio features and captions in a batch are padded to the longest length in the batch. That is, the input audio features are padded with zero vectors to the beginning, in order all of the input audio features sequences to have the same amount of vectors. The output sequences of words, are padded with <eos> tokens at the end, so all sequences of word will have the same amount of words.

Results for the development dataset

The results of the baseline system for the development dataset are:

Metric Value
BLEU1 0.389
BLEU2 0.136
BLEU3 0.055
BLEU4 0.015
ROUGEL 0.262
METEOR 0.084
CIDEr 0.074
SPICE 0.033
SPIDEr 0.054

The pre-trained weights for the DNN of the baseline system yielding the above results are freely available on Zenodo:


Citation

If you participating in this task, you might want to check the following papers.

  • The initial publication on audio captioning:

    Publication

    Konstantinos Drossos, Sharath Adavanne, and Tuomas Virtanen. Automated audio captioning with recurrent neural networks. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz, New York, U.S.A., Oct. 2017. URL: https://arxiv.org/abs/1706.10006.

    PDF

    Automated Audio Captioning with Recurrent Neural Networks

    Abstract

    We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.

    PDF

  • The three-step framework, employed for collecting the annotations of Clotho (if you use the three-step framework, consider citing the paper):

    Publication

    Samuel Lipping, Konstantinos Drossos, and Tuoams Virtanen. Crowdsourcing a dataset of audio captions. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE). Nov. 2019. URL: https://arxiv.org/abs/1907.09238.

    PDF

    Crowdsourcing a Dataset of Audio Captions

    Abstract

    Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. “people talking in a big room”). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.

    PDF

  • The Clotho dataset (if you use Clotho consider citing the Clotho paper):

    Publication

    Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. In 45th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Barcelona, Spain, May 2020. URL: https://arxiv.org/abs/1910.09387.

    PDF

    Clotho: An Audio Captioning Dataset

    Abstract

    Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

    PDF