Automated Audio Captioning


Task description

Automatic creation of textual content descriptions for general audio signals.

Challenge has ended. Full results for this task can be found in the Results page.

Description

Automated audio captioning (AAC) is the task of general audio content description using free text. It is an inter-modal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. AAC methods can model concepts (e.g. "muffled sound"), physical properties of objects and environment (e.g. "the sound of a big car", "people talking in a small and empty room"), and high level knowledge ("a clock rings three times"). This modeling can be used in various applications, ranging from automatic content description to intelligent and content-oriented machine-to-machine interaction.

Figure 1: An example of an automated audio captioning system and process.


The task is a continuation of the AAC task from DCASE2023.

This year, the task of AAC still allows the usage of external data and/or pre-trained models under several restrictions (please see section below). Participants are allowed to use other datasets for AAC or even datasets for sound event detection/tagging, acoustic scene classification, or datasets from other task that might be deemed fit. Additionally, participants can use pre-trained models, such as (but not limited to) Word2Vec, BERT, and PANNs, wherever they want in their model. Please see below for some recommendations for datasets and pre-tuned models.

Audio dataset

The AAC task in DCASE2024 will be using Clotho v2 dataset for the evaluation of the submissions. Though, participants can use any other dataset, from any other task, for the development of their methods. In this section, we will describe Clotho v2 dataset.

Clotho dataset

Clotho v2 is an extension of the original Clotho dataset (i.e. v1) and consists of audio samples of 15 to 30 seconds duration, each audio sample having five captions of eight to 20 words length. There is a total of 6972 (4981 from version 1 and 1991 from v2) audio samples in Clotho, with 34 860 captions (i.e. 6972 audio samples * 5 captions per each sample). Clotho v2 is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. The new data in Clotho v2 will not affect the splits used in order to assess the performance of methods using the previous version of Clotho (i.e. the evaluation and testing splits of Clotho V1). All audio samples are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing.

Clotho v2 has a total of around 4500 words and is divided in four splits: development, validation, evaluation, and testing. Audio samples are publicly available for all four splits, but captions are publicly available only for the development, validation, and evaluation splits. There are no overlapping audio samples between the four different splits and there is no word that appears in the evaluation, validation, or testing splits, and not appearing in the development split. Also, there is no word that appears in the development split and not appearing at least in one of the other three splits. All words appear proportionally between splits (the word distribution is kept similar across splits), i.e. 55% in the development, 15% in the and validation, 15% in the evaluation, and 15% in the testing split.

Words that could not be divided using the above scheme of 55-15-15-15 (e.g. words that appear only two times in the all four splits combined), appear at least one time in the development split and at least one time to one of the other three splits. This splitting process is similar to the one used for the previous version of Clotho. More detailed info about the splitting process can be found on the paper presenting Clotho, freely available online here and cited as:

Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.

PDF

Clotho: an Audio Captioning Dataset

Abstract

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

PDF

The data collection of Clotho received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

ERC

Audio samples in Clotho

Audio samples have durations ranging from 10s to 30s, no spelling errors in the first sentence of the description on Freesound, good quality (44.1kHz and 16-bit), and no tags on Freesound indicating sound effects, music, or speech. Before extraction, all files were normalized and the preceding and trailing silences were trimmed.

The content of audio samples in Clotho greatly varies, ranging from ambiance in a forest (e.g. water flowing over some rocks), animal sounds (e.g. goats bleating), and crowd yelling or murmuring, to machines and engines operating (e.g. inside a factory) or revving (e.g. cars, motorbikes), and devices functioning (e.g. container with contents moving, doors opening/closing). For a thorough description how the audio samples are selected and filtered, you can check the paper that presents Clotho dataset.

In the following figure is the distribution of the duration of audio files in Clotho v2.

Figure 2: Audio duration distribution for Clotho dataset.


Captions in Clotho

The captions in the Clotho dataset range from 8 to 20 words in length, and were gathered by employing the crowdsourcing platform Amazon Mechanical Turk and a three-step framework. The three steps are:

  1. audio description,
  2. description editing, and
  3. description scoring.

In step 1, five initial captions were gathered for each audio clip from distinct annotators. In step 2, these initial captions were edited to fix grammatical errors. Grammatically correct captions were instead rephrased, in order to acquire diverse captions for the same audio clip. In step 3, the initial and edited captions were scored based on accuracy, i.e. how well the caption describes the audio clip, and fluency, i.e. the English fluency in the caption itself. The initial and edited captions were scored by three distinct annotators. The scores were then summed together and the captions were sorted by the total accuracy score first, total fluency score second. The top five captions, after sorting, were selected as the final captions of the audio clip. More information about the caption scoring (e.g. scoring values, scoring threshold, etc.) is at the corresponding paper of the three-step framework.

We then manually sanitized the final captions of the dataset by removing apostrophes, making compound words consistent, removing phrases describing the content of speech, and replacing named entities. We used in-house annotators to replace transcribed speech in the captions. If the resulting caption were under 8 words, we attempt to find captions in the lower-scored captions (i.e. that were not selected in step 3) that still have decent scores for accuracy and fluency. If there were no such captions or these captions could not be rephrased to 8 words or less, the audio file was removed from the dataset entirely. The same in-house annotators were used to also replace unique words that only appeared in the captions of one audio clip. Since audio clips are not shared between splits, if there are words that appear only in the captions of one audio clip, then these words will appear only in one split.

A thorough description of the three-step framework can be found at the corresponding paper, freely available online here and cited as:

Publication

Samuel Lipping, Konstantinos Drossos, and Tuoams Virtanen. Crowdsourcing a dataset of audio captions. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE). Nov. 2019. URL: https://arxiv.org/abs/1907.09238.

PDF

Crowdsourcing a Dataset of Audio Captions

Abstract

Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. “people talking in a big room”). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.

PDF

Development, validation, and evaluation datasets of Clotho

Clotho v2 is divided into a development split of 3839 audio clips with 19195 captions, a validation split of 1045 audio clips with 5225 captions, an evaluation split of 1045 audio clips with 5225 captions, and a testing split of 1043 audio clips with 5215 captions. These splits are created by first constructing the sets of unique words of the captions of each audio clip. These sets of words are combined to form the bag of words of the whole dataset, from which we can derive the frequency of a given word. With the unique words of audio files as classes, we use multi-label stratification. More information on the splits of Clotho can be found in the corresponding paper.

The name of the splits for Clotho differ from the DCASE terminology. To avoid confusion for participants, the correspondence of splits between Clotho and DCASE challenge is:

Clotho naming of splits DCASE Challenge naming of splits
development development
validation
evaluation
testing evaluation

Clotho development and validation splits are meant for optimizing audio captioning methods. The performance of the audio captioning methods can then be assessed (e.g. for reporting results in a conference or journal paper) using Clotho evaluation split. Clotho testing split is meant only for usage in scientific challenges, e.g. DCASE challenge. For the rest of this text, the DCASE challenge terminology will be used. For differentiating between Clotho development and evaluation, the terms development-training, development-validation, and development-testing will be used, wherever necessary. Development-training refers to Clotho development split, development-validation refers to Clotho validation split, and development-testing refers to Clotho evaluation split.

Clotho download

DCASE Development split of Clotho can be found at the online Zenodo repository. Make sure that you download Clotho v2.1, as there were some minor fixes in the dataset (fixing of file naming and some corrupted files).

To download and setup Clotho, you can download it using the aac-datasets package:


Otherwise, you can download the files manually from Zenodo:


Development-training data are:

  • clotho_audio_development.7z: The development-training audio clips.
  • clotho_captions_development.csv: The captions of the development-training audio clips.
  • clotho_metadata_development.csv: The meta-data of the development-training audio clips.

Development-validation data are:

  • clotho_audio_validation.7z: The development-validation audio clips.
  • clotho_captions_validation.csv: The captions of the development-validation audio clips.
  • clotho_metadata_validation.csv: The meta-data of the development-validation audio clips.

Development-testing data are:

  • clotho_audio_evaluation.7z: The development-testing audio clips.
  • clotho_captions_evaluation.csv: The captions of the development-testing audio clips.
  • clotho_metadata_evaluation.csv: The meta-data of the development-testing audio clips.



DCASE Evaluation split of Clotho (i.e. Clotho-testing) can be found at the online Zenodo repository, or using aac-datasets (subset name dcase_aac_test).


Evaluation data are:

  • clotho_audio_test.7z: The DCASE evaluation (Clotho-testing) audio clips.
  • clotho_metadata_test.csv: The meta-data of the DCASE evaluation (Clotho-testing) audio clips, containing only authors and licence.

NOTE: Participants are strongly prohibited to use any additional information for the DCASE evaluation (testing) of their method, apart the provided audio files from the DCASE Evaluation split.

A supplementary dataset will also be used for contrastive evaluation. The objective of this data is to provide an analysis of system performance on out-of-domain or noisy audio. Ground truth captions are currently not available to participants, though they may be released at a future date. Participants are thus encouraged to submit audio-generated captions pairs for this subset in the same manner as for the DCASE evaluation set, with the format detailed below. This supplementary subset Clotho-analysis can be obtained on Zenodo, or using aac-datasets (subset name dcase_aac_analysis).


Note that results on the Clotho-analysis subset will not affect submission rankings.

Task rules

The assessment of the methods will be performed using the withheld split of Clotho, which is the same as last year, offering direct comparison with the results of the AAC task from previous years.

Participants are allowed to:

  • Use external data (e.g. audio files, text, annotations), except if Freesound data is involved (see below).
  • Use pre-trained models (e.g. text models like Word2Vec, audio tagging models, sound event detection models).
  • Augment the development dataset (i.e. development-training and development-testing) with or without the use of external data.
  • Use all the available metadata provided, but they must explicitly state it and indicate if they use the available metadata. This will not affect the rating of their method.

Participants are NOT allowed to:

  • Use Freesound data for training or validation, if these data overlap with the development-testing and the evaluation subsets of Clotho (see below).
  • Make subjective judgments of the DCASE evaluation (testing) data, nor to annotate it.
  • Use additional information of the DCASE evaluation (testing) for their method, apart from the provided audio files from the DCASE Evaluation split.
  • Use developement-testing or the evaluation subset to train or validate your models.

Excluded data

Since the Clotho dataset is extracted from Freesound website, any dataset crowdsourced from this website may have an overlap with the Clotho evaluation data. To solve this issue, we published a CSV file containing the forbidden sound ids of Freesound. So if you use any data from Freesound (e.g., through WavCaps or FSD50K), you have to exclude them from your pretraining, training and validation data.


Submission

All participants should submit:

  • the output of their audio captioning with the Clotho-testing split (*.csv file),
  • the output of their audio captioning with the Clotho-analysis split (*.csv file),
  • metadata for their submission (*.yaml file), and
  • a technical report for their submission (*.pdf file).

We allow up to 4 system output submissions per participant/team. For each system, metadata should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted metadata (the .yaml file), submitted system output (the .csv file), and the technical report (the .pdf file)! For indicating the connection of your files, you can consider using the following naming convention:

<author>_<institute>_task6_<submission_index>_<testing_output or analysis_output or meta or technical_report>.<csv or yaml or pdf>

For example:

Labbe_IRIT_task6_1.testing_output.csv
Labbe_IRIT_task6_1.analysis_output.csv
Labbe_IRIT_task6_1.meta.yaml
Labbe_IRIT_task6_1.technical_report.pdf

The field <submission_index> is to differentiate your submissions in case that you have multiple submissions.

System output file

The system output file should be a *.csv file, and should have the following two columns:

  1. file_name, which will contain the file name of the audio file.
  2. caption_predicted, which will contain the output of the audio captioning method for the file with file name as specified in the file_name column.

For example, if a file has a file name test_0001.wav and the predicted caption of the audio captioning method for the file test_0001.wav is hello world, then the CSV file should have the entry:

    file_name      caption_predicted
        .               .
        .               .
        .               .
   test_0001.wav    hello world

Please note: automated procedures will be used for the evaluation of the submitted results. Therefore, the column names should be exactly as indicated above.

Two output files are expected for each submitted system: one for the Clotho-evaluation, and one for the Clotho-analysis subset.

Metadata file

For each system, metadata should be provided in a separate file. They should contain several information about your systems, including architecture employed, number of parameters, GPU used, training time... The file format should be as indicated below.

# Submission information for task 6
submission:
  # Submission label
  # Label is used to index submissions.
  # Generate your label following way to avoid
  # overlapping codes among submissions
  # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
  label: Labbe_IRIT_task6_1
  #
  # Submission name
  # This name will be used in the results tables when space permits
  name: DCASE2024 baseline system
  #
  # Submission name abbreviated
  # This abbreviated name will be used in the results table when space is tight.
  # Use maximum 10 characters.
  abbreviation: Baseline

# Authors of the submitted system. Mark authors in
# the order you want them to appear in submission lists.
# One of the authors has to be marked as corresponding author,
# this will be listed next to the submission in the results tables.
authors:
  # First author
  - lastname: Labbé
    firstname: Étienne
    email: etienne.labbe@irit.fr               # Contact email address
    corresponding: true                         # Mark true for one of the authors

    # Affiliation information for the author
    affiliation:
      abbreviation: IRIT
      institute: Institut de Recherche en Informatique de Toulouse
      department: Signaux et Images            # Optional
      location: Toulouse, France

  # Second author
  # ...

# System information
system:
  # System description, meta data provided here will be used to do
  # meta analysis of the submitted system.
  # Use general level tags, when possible use the tags provided in comments.
  # If information field is not applicable to the system, use "!!null".
  description:
    # Audio input / sampling rate
    # e.g., 16kHz, 22.05kHz, 44.1kHz, 48.0kHz
    input_sampling_rate: 32kHz

    # Acoustic representation
    # Here you should indicate what can or audio representation
    # you used. If your system used hand-crafted features (e.g.
    # mel band energies), then you can do
    #
    # `acoustic_features: mel energies`
    #
    # Else, if you used some pre-trained audio feature extractor, 
    # you can indicate the name of the system, for example
    #
    # `acoustic_features: cnn10`
    acoustic_features: ConvNeXt-Tiny
    # acoustic_features_url: 

    # Word embeddings
    # Here you can indicate how you treated word embeddings.
    # If your method learned its own word embeddings (i.e. you
    # did not used any pre-trained word embeddings) then you can
    # do
    #
    # `word_embeddings: learned`
    #  
    # Else, specify the pre-trained word embeddings that you used
    # (e.g., Word2Vec, BERT, etc).
    # If possible, please use the fullname of the model involved. (e.g., BART-base)
    word_embeddings: learned

    # Data augmentation methods
    # e.g., mixup, time stretching, block mixing, pitch shifting, ...
    data_augmentation: mixup + label smoothing

    # Method scheme
    # Here you should indicate the scheme of the method that you
    # used. For example
    machine_learning_method: encoder-decoder

    # Learning scheme
    # Here you should indicate the learning scheme. 
    # For example, you could specify either
    # supervised, self-supervised, or even 
    # reinforcement learning. 
    learning_scheme: supervised

    # Ensemble
    # - Here you should indicate the number of systems involved if you used ensembling.
    # - If you did not use ensembling, just write 1.
    ensemble_num_systems: 1

    # Audio modelling
    # Here you should indicate the type of system used for
    # audio modelling. For example, if you used some stacked CNNs, then
    # you could do
    #
    # audio_modelling: cnn
    #
    # If you used some pre-trained system for audio modelling,
    # then you should indicate the system used (e.g., COALA, COLA,
    # transformer).
    audio_modelling: !!null

    # Word modelling
    # Similarly, here you should indicate the type of system used
    # for word modelling. For example, if you used some RNNs,
    # then you could do
    #
    # word_modelling: rnn
    #
    # If you used some pre-trained system for word modelling, then you should indicate the system used (e.g., transformer).
    word_modelling: transformer

    # Loss function
    # - Here you should indicate the loss fuction that you employed.
    loss_function: cross_entropy

    # Optimizer
    # - Here you should indicate the name of the optimizer that you used. 
    optimizer: AdamW

    # Learning rate
    # - Here you should indicate the learning rate of the optimizer that you used.
    learning_rate: 5e-4

    # Weight decay
    # - Here you should indicate if you used any weight decay of your optimizer.
    # - Be careful because most optimizers uses a non-zero value by default.
    # - Use 0 for no weight decay.
    weight_decay: 2

    # Gradient clipping
    # - Here you should indicate if you used any gradient clipping. 
    # - Use 0 for no clipping.
    gradient_clipping: 1

    # Gradient norm
    # - Here you should indicate the norm of the gradient that you used for gradient clipping.
    # - Use !!null for no clipping.
    gradient_norm: "L2"

    # Metric monitored
    # - Here you should report the monitored metric for optimizing your method.
    # - For example, did you monitored the loss on the validation data (i.e. validation loss)?
    # - Or you monitored the SPIDEr metric? Maybe the training loss?
    metric_monitored: validation_loss

  # System complexity, meta data provided here will be used to evaluate
  # submitted systems from the computational load perspective.
  complexity:
    # About the amount of parameters used in the acoustic model.
    # - For neural networks, this information is usually given before training process in the network summary.
    # - For other than neural networks, if parameter count information is not directly available, try estimating the count as accurately as possible.
    # - In case embeddings are used, add up parameter count of the embedding extraction networks and classification network
    # - Use numerical value (do not use comma for thousands-separator).
    # - WARNING: In case of ensembling, add up parameters for all subsystems.

    # Learnable parameters
    learnable_parameters: 11914777
    # Frozen parameters (from the feature extractor and other parts of the model)
    frozen_parameters: 29388303
    # Total amount of parameters involved at inference time
    # Unless you used a complex method for prediction (e.g., re-ranking methods that use additional pretrained models), this value is equal to the sum of the learnable and frozen parameters.
    inference_parameters: 41303080

    # Training duration of your entire system in SECONDS.
    # - WARNING: In case of ensembling, add up durations for all subsystems trained.
    duration: 8840
    # Number of GPUs used for training
    gpu_count: 1
    # GPU model name
    gpu_model: NVIDIA GeForce RTX 2080 Ti

    # Optionally, number of multiply-accumulate operations (macs) to generate a caption
    # - You should use the same audio file ('Santa Motor.wav' from Clotho development-testing subset) for fair comparison with other models.
    # - You should include all the operations involved, including: feature extraction, beam search, etc. However, you can exclude operations used to resample the waveform.
    inference_macs: 48762319200

  # List of datasets used for training your system.
  # Unless you also used them to train your captioning system, you do not not need to include datasets involved to pretrain your encoder and/or decoder. (e.g., AudioSet for ConvNeXt in the baseline)
  # However, you should:
  # - Keep the Clotho development-training if you used it to train your system.
  # - Include here Clotho development-validation subset if you used it to train your system.
  # - Please always specify the correct subset of the dataset involved.
  train_datasets:
    - # Dataset name
      name: Clotho
      # Subset name (DCASE convention for Clotho)
      subset: development-training
      # Audio source (use !!null if not applicable)
      source: Freesound
      # Dataset access url
      url: https://doi.org/10.5281/zenodo.3490683
      # Has audio:
      has_audio: Yes
      # Has images
      has_images: No
      # Has video
      has_video: No
      # Has captions
      has_captions: Yes
      # Number of captions per audio
      nb_captions_per_audio: 5
      # Total amount of examples used
      total_audio_length: 3839
      # Used for (e.g., audio_modelling, word_modelling, audio_and_word_modelling)
      used_for: audio_and_word_modelling

  # List of datasets used for validation (checkpoint selection).
  # However, you should:
  # - Keep the Clotho development-validation if you used it to validate your system.
  # - If you did not used any validation dataset, just write `validation_datasets: []`.
  # - Please always specify the correct subset involved.
  validation_datasets:
    - # Dataset name
      name: Clotho
      # Subset name (DCASE convention for Clotho)
      subset: development-validation
      # Audio source (use !!null if not applicable)
      source: Freesound
      # Dataset access url
      url: https://doi.org/10.5281/zenodo.3490683
      # Has audio:
      has_audio: Yes
      # Has images
      has_images: No
      # Has video
      has_video: No
      # Has captions
      has_captions: Yes
      # Number of captions per audio
      nb_captions_per_audio: 5
      # Total amount of examples used
      total_audio_length: 1045

  # URL to the source code of the system (optional, write !!null if you do not want to share code)
  source_code: https://github.com/Labbeti/dcase2024-task6-baseline

# System results
results:
  development_testing:
    # System results on the development-testing split.
    # - Full results are not mandatory, however, they are highly recommended as they are needed for thorough analysis of the challenge submissions.
    # - If you are unable to provide all the results, incomplete results can also be reported.
    # - Each score should contain at least 3 decimals.
    meteor: 0.18979284501354263
    cider: 0.4619283292849137
    spice: 0.1335348395173806
    spider: 0.2977315844011471
    spider_fl: 0.2962828356306173
    fense: 0.5040896972480929
    vocabulary: 551.000

Open and reproducible research

Finally, for supporting open and reproducible research, we kindly ask from each participant/team to consider making available the code of their method (e.g. in GitHub) and pre-trained models, after the challenge is over. A repository link can be specified in the corresponding entry of the metadata file to this effect.

Evaluation

The submitted systems will be evaluated according to their performance on the withheld evaluation split. For evaluation, the captions should not have any punctuation marks, and all letters will be lowercase. Therefore, participants are advised to optimize their methods by using captions without punctuation marks and in lower case.

All of the following metrics will be reported for every submitted method:

  1. METEOR
  2. CIDEr
  3. SPICE
  4. SPIDEr
  5. SPIDEr with fluency error detection, denoted SPIDEr-FL
  6. FENSE
  7. Vocabulary

Methods will be ranked according to the FENSE metric. Further information on FENSE can be found in the corresponding paper, available online here.

Several contrastive metrics will also be reported. These include previous main metrics SPIDEr, as well as some recent model-based metrics for captioning. This year, we also added Vocabulary, which corresponds to the number of word types, used by the system in all the caption candidates generated on the development-testing subset. Note that these additional metrics are contrastive, and will not affect the team rankings.

All these metrics can be computed with the aac-metrics package:


Optional: report Multiply–ACcumulate operations (MACs)

This year, we are interested in the complexity of the submitted systems, in terms of Multiply–ACcumulate operations (MACs). Please use the audio file "Santa Motor.wav" from the Clotho development-testing subset as input to estimate this value.

In the baseline source code, we use DeepSpeed as framework to compute this number. For more information regarding how to use DeepSpeed, the reader may refer to Flops profiler of DeepSpeed. The number of learnable and frozen parameters of the systems must still be reported in the metadata file.

Results

Complete results and technical reports can be found in the results page.

Baseline system: updated for the 2024 edition

To help participants engage with the challenge, we provide an open source baseline system, implemented using PyTorch and PyTorch-Lightning. You can find the baseline system on GitHub:


Deep neural network (DNN) method

The baseline system is a sequence-to-sequence system consisting an audio encoder (frozen ConvNeXt), a projection layer and a decoder (transformer).

The ConvNeXt encoder has been pretrained on AudioSet. The weights of convnext_tiny_465mAP_BL_AC_70kit.pth are automatically downloaded in the baseline recipe.

The ConvNeXt output representations are given to a sequence of layers comprised of Dropout, Linear, ReLU and another Dropout.

Then, these features are given to a vanilla TransformerDecoder implemented in PyTorch, together with the previous words of the sentences, in order to produce the distribution of the next words.

The model is trained for 400 epochs on the Clotho development-training subset. Batch size is set to 64, with a gradient accumulation of 8 (effective batch size is 512). We chose the AdamW optimizer, with a learning rate starting from 5e-4, and following a cosine decay updated at the end of each epoch. Weight decay is set to 2, which greatly reduces overfitting. Gradient clipping using L2 norm is applied with a value of 1. Label smoothing is applied by reducing the highest probability by 0.2. Mixup between audio and previous words' features is applied to add noise to the decoder inputs. Finally, beam search is used during inference, with a beam size of 3 and a constraint to avoid the repetition of non-stopwords.

Results for the development dataset

The results of the baseline system for the development-testing subset of Clotho v2.1 are:

Metric Value
METEOR 0.1897
CIDEr 0.4619
SPICE 0.1335
SPIDEr 0.2977
SPIDEr-FL 0.2962
Vocabulary 551
FENSE 0.5040

Please download the trained weights of the baseline system here:


If needed, pre-trained weights of ConvNeXt variants are also available on Zenodo:


Citations

If you are participating in this task, you might want to check the following papers. If you find a paper that had to be cited here but it is not (e.g. a paper for some of the suggested resources that is missing), please contact us.

  • The initial publication on audio captioning:
Publication

Konstantinos Drossos, Sharath Adavanne, and Tuomas Virtanen. Automated audio captioning with recurrent neural networks. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz, New York, U.S.A., Oct. 2017. URL: https://arxiv.org/abs/1706.10006.

PDF

Automated Audio Captioning with Recurrent Neural Networks

Abstract

We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.

PDF
  • The three-step framework, employed for collecting the annotations of Clotho (if you use the three-step framework, consider citing the paper):
Publication

Samuel Lipping, Konstantinos Drossos, and Tuoams Virtanen. Crowdsourcing a dataset of audio captions. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE). Nov. 2019. URL: https://arxiv.org/abs/1907.09238.

PDF

Crowdsourcing a Dataset of Audio Captions

Abstract

Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. “people talking in a big room”). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.

PDF
  • The Clotho dataset (if you use Clotho consider citing the Clotho paper):
Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.

PDF

Clotho: an Audio Captioning Dataset

Abstract

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

PDF
  • The paper presenting the original ConvNeXt architecture (trained for image classification):
Publication

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), volume, 11966–11976. 2022. doi:10.1109/CVPR52688.2022.01167.

PDF

A ConvNet for the 2020s

PDF
  • The paper presenting the ConvNeXt model trained for audio classification:
Publication

Thomas Pellegrini, Ismail Khalfaoui-Hassani, Etienne Labbé, and Timothée Masquelier. Adapting a ConvNeXt Model to Audio Classification on AudioSet. In Proc. INTERSPEECH 2023, 4169–4173. 2023. URL: https://www.isca-speech.org/archive/pdfs/interspeech_2023/pellegrini23_interspeech.pdf, doi:10.21437/Interspeech.2023-1564.

PDF

Adapting a ConvNeXt Model to Audio Classification on AudioSet

PDF
  • The baseline model is further described in the following paper, under the name "CNext-trans". The only difference with the paper is the removal of the SpecAugment augmentation, not used to train the proposed baseline system.
Publication

Etienne Labbé, Thomas Pellegrini, and Julien Pinquier. CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding. 2023. URL: https://arxiv.org/abs/2309.00454, arXiv:2309.00454.

PDF

CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding

PDF