Language-Based Audio Retrieval


Task description

Audio retrieval with human-written captions.

If you are interested in the task, you can join us on the dedicated slack channel
Evaluation data collection is ongoing. The evaluation data will be available by the end of April 2025.

Description

This subtask focuses on retrieving audio signals based on their textual descriptions, also known as audio captions. uman-written audio captions will serve as text queries. For each query, the objective is to retrieve a set of audio files from a given dataset and rank them according to their relevance to the query. This subtask aims to stimulate further research in language-based audio retrieval using unconstrained textual descriptions.

Figure 1: Overview of Language-based Audio Retrieval.


Participants are permitted to use pre-trained models and external data for training their models. This includes pre-trained models for extracting embeddings from audio and/or captions, as well as pre-optimized methods for natural language processing, such as part-of-speech (POS) tagging. Additionally, participants may use external audio and/or textual data,such as external text corpora for training language models or additional audio datasets like AudioSet, ESC-50.

Audio dataset

The development dataset for this task is Clotho v2.1 (similar to previous years).

The Clotho v2 dataset consists of audio samples of 15 to 30 seconds duration, with each audio sample having five captions of eight to 20 words in length. There is a total of 6974 audio samples, with 34,870 captions (i.e. 6974 audio samples * 5 captions per each sample). All audio samples are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and a three-step framework. For complete details on the data recording and processing see

Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.

PDF

Clotho: an Audio Captioning Dataset

Abstract

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

PDF

The data collection of Clotho received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

ERC

Development dataset

The Clotho v2 dataset is currently divided into a development split of 3839 audio clips with 19,195 captions, a validation split of 1045 audio clips with 5225 captions, and an evaluation split of 1045 audio clips with 5225 captions. These splits are created by first constructing the sets of unique words of the captions of each audio clip. These sets of words are combined to form the bag of words of the whole dataset, from which the frequency of a given word can be derived. With the unique words of audio files as classes, multi-label stratification is applied. More information on the splits of Clotho v2 can be found here.

Please note that the names of the splits for Clotho v2 differ from the DCASE terminology. The following table provides the correspondence of splits between Clotho v2 and DCASE challenge terminology.

Clotho naming of splits DCASE Challenge naming of splits
development training development
validation validation
evaluation testing

For the rest of this text, the DCASE challenge terminology will be used. For differentiating between Clotho development and evaluation, the terms development-training, development-validation, and development-testing will be used, wherever necessary. Development-training refers to Clotho development split, development-validation refers to Clotho validation split, and development-testing refers to Clotho evaluation split.

The development data can be found at the online Zenodo repository. Make sure that you download Clotho v2.1, as there were some minor fixes in the dataset (fixing of file naming and some corrupted files).


New Annotations for the Development-Testing Dataset

As a new addition to the Clotho dataset, we provide new annotations specifically for the development-testing dataset of this task. The development-testing dataset consists of 1,069 text queries (audio captions) and their corresponding audio files. For each query, multiple audio recordings are marked as relevant.

Annotations for the development-testing dataset are provided in a separate CSV file.

Note: Data collection for the evaluation dataset is ongoing. The evaluation dataset will be available by the end of April 2025.

Annotation Process

The relevance annotations were collected on Amazon Mechanical Turk. To generate a list of potentially relevant audio files for each query, we utilised the last year's winning submission (Primus_CP-JKU_8_1). This system provided a ranked list of 15 audio files for each query, in addition to the ground truth audio file. Annotators were asked to indicate relevance for each audio file with respect to the query.

Evaluation dataset

The evaluation dataset builds on last year's DCASE 2024 Challenge (task 8). It comprises 1,000 text queries (audio captions). Similar to the development-testing dataset, submissions are evaluated based on the relevance of the retrieved audio files. Multiple relevant audio files are indicated for each query. The evaluation data are provided to the participants in the form of audio files and captions without any additional information.

Download

The development dataset is available for download from the Zenodo repository.

Task Rules

Participants are allowed to:

  • Use of external resources (data sets, pretrained models) is allowed under conditions specified in the External Resources section.
  • Augment the development dataset (i.e. development-training and development-testing) with or without the use of external data.
  • Use all the available metadata provided, but they must explicitly state it and indicate if they use the available metadata. This will not affect the rating of their method.

Participants are NOT allowed to:

  • Use Freesound data for training or validation if these data overlap with the development-testing and the evaluation subsets of Clotho (see below).
  • Make subjective judgments of the evaluation (testing) data, nor to annotate it.
  • Use additional information of the evaluation (testing) data for their method, apart from the provided audio files and captions from the evaluation data.

External Resources

The use of external resources (data sets, pretrained models) is allowed under the following conditions:

  • The task coordinators have approved the resource and shared it on the task webpage. To this end, please email the task coordinators. The list of allowed external resources will be finalised on May 18 (no further external sources allowed).
  • The external resource must be freely accessible to any other research group in the world before May 18, 2025.
  • The list of external resources used must be clearly indicated in the technical report.
  • The use of large-language model (LLM) APIs is allowed, provided they are reasonably accessible to everyone and incur minimal costs. For instance, a reasonable price would be equivalent to the total subscription cost of a leading provider, approximately 15 to 25 USD, over the duration of the challenge (2.5 months).

List of external data resources allowed:

Resource name Type Added Link
PaSST model 01.04.2025 https://github.com/kkoutini/PaSST
AudioSet audio, video 01.04.2025 https://research.google.com/audioset/
FSD50K audio, tags 01.04.2025 https://zenodo.org/record/4060432
MACS - Multi-Annotator Captioned Soundscapes audio, caption 01.04.2025 https://doi.org/10.5281/zenodo.5114770
WavCaps audio, caption 01.04.2025 https://huggingface.co/datasets/cvssp/WavCaps
AudioCaps audio, caption 01.04.2025 https://audiocaps.github.io/
BERT text, caption 01.04.2025 https://arxiv.org/abs/1810.04805
RoBERTa text, caption 01.04.2025 https://arxiv.org/abs/1907.11692


Excluded data

Since the Clotho dataset is extracted from Freesound website, any dataset crowdsourced from this website may have an overlap with the Clotho evaluation data. To solve this issue, we published a CSV file containing the forbidden sound ids of Freesound. Note that the list provided this year contains more recordings than last year. So if you use any data from Freesound (e.g. through WavCaps or FSD50K), you have to exclude them from your pretraining, training and validation data.


Submission

All participants should submit:

  • the output of their audio retrieval system in the form of a similarity matrix (*.csv file),
  • metadata for their submission (*.yaml file), and
  • a technical report for their submission (*.pdf file).

We allow up to four system output submissions per participating team. For each system, metadata should be provided in a separate file, containing the task-specific information. All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted metadata (the .yaml file), submitted system output (the .csv file), and the technical report (the .pdf file)! For indicating the connection of your files, you can consider using the following naming convention:

<author>_<institute>_task6_<submission_index>_<output or meta or technical_report>.<csv or yaml or pdf>

For example:

Primus_CP-JKU_task6_1.output.csv
Primus_CP-JKU_task6_1.meta.yaml
Primus_CP-JKU_task6_1.technical_report.pdf

The field <submission_index> is to differentiate your submissions in case that you have multiple submissions.

System output file

The expected system output is a *.csv file holding all similarities between the text descriptions and audios in the data set:

Index Filename_0 Filename_1 Filename_2 Filename_... Filename_N
Caption_0 Similarity_00 Similarity_01 Similarity_02 Similarity_0... Similarity_0N
... ... ... ... ... ...
Caption_M Similarity_M0 Similarity_M1 Similarity_M2 Similarity_M... Similarity_MN
  • The table must include one column for each of the N audio files and one row for each of the M queries.
  • It further must include a header that specifies which audio file each column corresponds to.
  • The leftmost column contains the text queries.
  • The individual cells should give the estimated similarity between text queries and audio recordings.
  • Higher similarity scores indicate a stronger correspondence between the textual query and the audio file.

Metadata file

For each system, metadata should be provided in a separate file. The file format should be as indicated below.

# Submission information for task 6
submission:
    # Submission label
    # The label is used to index submissions.
    # Generate your label following way to avoid overlapping codes among submissions:
    # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
    label: Primus_CP-JKU_task6_1
    #
    # Submission name
    # This name will be used in the results tables when space permits
    name: DCASE2025 baseline system
    #
    # Submission name abbreviated
    # This abbreviated name will be used in the result table when space is tight.
    # Use maximum 10 characters.
    abbreviation: Baseline

    # Authors of the submitted system.
    # Mark authors in the order you want them to appear in submission lists.
    # One of the authors has to be marked as corresponding author,
    # this will be listed next to the submission in the results tables.
    authors:
        # First author
        -   lastname: Primus
            firstname: Paul
            email: paul.primus@jku.at                    # Contact email address
            corresponding: true                         # Mark true for one of the authors

            # Affiliation information for the author
            affiliation:
                abbreviation: CP-JKU
                institute: Johannes Kepler University
                department: Institute of Computational Perception
                location: Linz, Austria

        # Second author
        -   lastname: Author
            firstname: Second
            email: first.last@some.org

            affiliation:
                abbreviation: ORG
                institute: Some Organization
                department: Department of Something
                location: City, Country

# System information
system:
    # System description, meta-data provided here will be used to do meta analysis of the submitted system.
    # Use general level tags, when possible use the tags provided in comments.
    # If information field is not applicable to the system, use "!!null".
    description:

        # Audio input / sampling rate, e.g. 16kHz, 22.05kHz, 44.1kHz, 48.0kHz
        input_sampling_rate: 44.1kHz

        # Acoustic representation
        # Here you should indicate what can or audio representation you used.
        # If your system used hand-crafted features (e.g. mel band energies), then you can do:
        #
        # `acoustic_features: mel energies`
        #
        # Else, if you used some pre-trained audio feature extractor, you can indicate the name of the system, for example:
        #
        # `acoustic_features: audioset`
        acoustic_features: log-mel energies

        # Text embeddings
        # Here you can indicate how you treated text embeddings.
        # If your method learned its own text embeddings (i.e. you did not use any pre-trained or fine-tuned NLP embeddings),
        # then you can do:
        #
        # `text_embeddings: learned`
        #
        # Else, specify the pre-trained or fine-tuned NLP embeddings that you used, for example:
        #
        # `text_embeddings: Sentece-BERT`
        text_embeddings: Sentece-BERT

        # Data augmentation methods for audio
        # e.g. mixup, time stretching, block mixing, pitch shifting, ...
        audio_augmentation: !!null

          # Data augmentation methods for text
        # e.g. random swapping, synonym replacement, ...
        text_augmentation: !!null

          # Learning scheme
          # Here you should indicate the learning scheme.
        # For example, you could specify either supervised, self-supervised, or even reinforcement learning.
        learning_scheme: self-supervised

        # Ensemble
        # Here you should indicate if you used ensemble of systems or not.
        ensemble: No

        # Audio modelling
        # Here you should indicate the type of system used for audio modelling.
        # For example, if you used some stacked CNNs, then you could do:
        #
        # audio_modelling: cnn
        #
        # If you used some pre-trained system for audio modelling, then you should indicate the system used,
        # for example, PANNs-CNN14, PANNs-ResNet38.
        audio_modelling: PANNs-CNN14

        # Text modelling
        # Similarly, here you should indicate the type of system used for text modelling.
        # For example, if you used some RNNs, then you could do:
        #
        # text_modelling: rnn
        #
        # If you used some pre-trained system for text modelling,
        # then you should indicate the system used (e.g. BERT).
        text_modelling: Sentece-BERT

        # Loss function
        # Here you should indicate the loss function that you employed.
        loss_function: InfoNCE

        # Optimizer
        # Here you should indicate the name of the optimizer that you used.
        optimizer: adam

        # Learning rate
        # Here you should indicate the learning rate of the optimizer that you used.
        learning_rate: 1e-3

        # Metric monitored
        # Here you should report the monitored metric for optimizing your method.
        # For example, did you monitor the loss on the validation data (i.e. validation loss)?
        # Or you monitored the training mAP?
        metric_monitored: validation_loss

    # System complexity, meta-data provided here will be used to evaluate
    # submitted systems from the computational load perspective.
    complexity:
        # Total amount of parameters used in the acoustic model.
        # For neural networks, this information is usually given before training process in the network summary.
        # For other than neural networks, if parameter count information is not directly
        # available, try estimating the count as accurately as possible.
        # In case of ensemble approaches, add up parameters for all subsystems.
        # In case embeddings are used, add up parameter count of the embedding
        # extraction networks and classification network
        # Use numerical value (do not use comma for thousands-separator).
        total_parameters: 732354

    # List of datasets used for the system (e.g., pre-training, fine-tuning, training).
    # Development-training data is used here only as example.
    training_datasets:
        -   name: Clotho-development
            purpose: training                           # Used for training system
            url: https://doi.org/10.5281/zenodo.4783391
            data_types: audio, caption                  # Contained data types, e.g., audio, caption, label.
            data_instances:
                audio: 3839                             # Number of contained audio instances
                caption: 19195                          # Number of contained caption instances
            data_volume:
                audio: 86353                            # Total amount durations (in seconds) of audio instances
                caption: 6453                           # Total word types in caption instances

        # More datasets
        #-   name:
        #    purpose: pre-training
        #    url:
        #    data_types: A, B, C
        #    data_instances:
        #        A: xxx
        #        B: xxx
        #        C: xxx
        #    data_volume:
        #        A: xxx
        #        B: xxx
        #        C: xxx

    # List of datasets used for validating the system, for example, optimizing hyperparameter.
    # Development-validation data is used here only as example.
    validation_datasets:
        -   name: Clotho-validation
            url: https://doi.org/10.5281/zenodo.4783391
            data_types: audio, caption
            data_instances:
                audio: 1045
                caption: 5225
            data_volume:
                audio: 23636
                caption: 2763

        # More datasets
        #-   name:
        #    url:
        #    data_types: A, B, C
        #    data_instances:
        #        A: xxx
        #        B: xxx
        #        C: xxx
        #    data_volume:
        #        A: xxx
        #        B: xxx
        #        C: xxx

    # URL to the source code of the system [optional]
    source_code: https://github.com/OptimusPrimus/

# System results
results:
    development_testing:
        # System results for the development-testing split.
        # Full results are not mandatory, however, they are highly recommended as they are needed for through analysis of the challenge submissions.
        # If you are unable to provide all results, also incomplete results can be reported.
        R@1: 0.0
        R@5: 0.0
        R@10: 0.0
        mAP@10: 0.0

Open and reproducible research

Finally, for supporting open and reproducible research, we kindly ask from each participant/team to consider making available the code of their method (e.g. in GitHub) and pre-trained models, after the challenge is over.

Evaluation

The submitted systems will be evaluated according to their performance, i.e., mean average precision at K (mAP@K) and recall at K (R@K), on the withheld evaluation dataset. An explanation of these metrics can be found on Wikipedia page on evaluation measures in information retrieval, in the IR book, and in a blog post.

The mAP@K is a rank-aware metric, which gives the mean of the averaged precision of relevant audio files among the top-K retrieved results for all caption queries. The R@K is a rank-unaware evaluation metric, which measures the proportion of relevant audio files among the top-K retrieved results to all relevant audio files in the evaluation dataset for the caption query. Submitted methods will be ranked by the mAP@16 metric.

Baseline system

This repository contains the code for the baseline system of task 6 in the DCASE 2025 Challenge:


New in 2025:

Some technical highlights:

Results on the development-testing split

The results of the baseline system on the development-testing split are given in the table below.

Metric Value
R1 0.2329
R5 0.5217
R10 0.6478
mAP10 0.3523

For more detailed results, have a look at our GitHub repository.

Citations

If you participate in this task, you might want to check the following papers. If you find a paper that needs to be cited here, please contact us and report it to us.

  • The Clotho dataset:
Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.

PDF

Clotho: an Audio Captioning Dataset

Abstract

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

PDF