Language-Based Audio Retrieval


Task description

Audio retrieval with human written captions.

Challenge has ended. Full results for this task can be found in the Results page.

If you are interested in the task, you can join us on the dedicated slack channel

Description

This subtask is concerned with retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions will be used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.

Figure 1: Overview of Language-based Audio Retrieval.


This subtask will allow using pre-trained models and external data for training models. This includes pre-trained models for embedding extraction from audio and/or captions, and pre-optimized methods for natural language processing like part-of-speech (POS) tagging. Additionally, the participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, ESC-50.

Audio dataset

The development dataset for this subtask is Clotho v2, the same as in the DCASE 2023 Challenge (task 6b). Specifically, each caption will be treated as a text query, and the corresponding audio file as the relevant item in retrieval.

The Clotho v2 dataset consists of audio samples of 15 to 30 seconds duration, with each audio sample having five captions of eight to 20 words length. There is a total of 6974 audio samples, with 34870 captions (i.e. 6974 audio samples * 5 captions per each sample). All audio samples are from the Freesound platform, and captions are crowd-sourced using Amazon Mechanical Turk and a three-step framework. For complete details on the data recording and processing see

Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.

PDF

Clotho: an Audio Captioning Dataset

Abstract

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

PDF

The data collection of Clotho received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

ERC

Task setup

Development dataset

The Clotho v2 dataset is currently divided into a development split of 3839 audio clips with 19195 captions, a validation split of 1045 audio clips with 5225 captions, and an evaluation split of 1045 audio clips with 5225 captions. These splits are created by first constructing the sets of unique words of the captions of each audio clip. These sets of words are combined to form the bag of words of the whole dataset, from which the frequency of a given word can be derived. With the unique words of audio files as classes, multi-label stratification is applied. More information on the splits of Clotho v2 can be found here.

Please note that the name of the splits for Clotho v2 differ from the DCASE terminology. To avoid confusion for participants, the correspondence of splits between Clotho v2 and DCASE challenge is:

Clotho naming of splits DCASE Challenge naming of splits
development training development
validation validation
evaluation testing

For the rest of this text, the DCASE challenge terminology will be used. For differentiating between Clotho development and evaluation, the terms development-training, development-validation, and development-testing will be used, wherever necessary. Development-training refers to Clotho development split, development-validation refers to Clotho validation split, and development-testing refers to Clotho evaluation split.

The development data can be found at the online Zenodo repository. Make sure that you download Clotho v2.1, as there were some minor fixes in the dataset (fixing of file naming and some corrupted files).


Development-training data are:

  • clotho_audio_development.7z: The development-training audio clips.
  • clotho_captions_development.csv: The captions of the development-training audio clips.
  • clotho_metadata_development.csv: The meta-data of the development-training audio clips.

Development-validation data are:

  • clotho_audio_validation.7z: The development-validation audio clips.
  • clotho_captions_validation.csv: The captions of the development-validation audio clips.
  • clotho_metadata_validation.csv: The meta-data of the development-validation audio clips.

Development-testing data are:

  • clotho_audio_evaluation.7z: The development-testing audio clips.
  • clotho_captions_evaluation.csv: The captions of the development-testing audio clips.
  • clotho_metadata_evaluation.csv: The meta-data of the development-testing audio clips.

Evaluation dataset

The evaluation dataset, the same as in the DCASE 2023 Challenge (task 6b), is reused for evaluation this year. It consists of 1000 audio samples, with one caption per each. The data are collected following the same procedure as Clotho v2.


Task rules

Participants are allowed to:

  • Use external data (e.g. audio files, text, annotations), except if Freesound data is involved (see below).
  • Use pre-trained models (e.g. text models like Word2Vec, audio tagging models, sound event detection models).
  • Augment the development dataset (i.e. development-training and development-testing) with or without the use of external data.
  • Use all the available metadata provided, but they must explicitly state it and indicate if they use the available metadata. This will not affect the rating of their method.

Participants are NOT allowed to:

  • Use Freesound data for training or validation, if these data overlap with the development-testing and the evaluation subsets of Clotho (see below).
  • Make subjective judgments of the evaluation (testing) data, nor to annotate it.
  • Use additional information of the evaluation (testing) data for their method, apart from the provided audio files and captions from the evaluation data.

Excluded data

Since the Clotho dataset is extracted from Freesound website, any dataset crowdsourced from this website may have an overlap with the Clotho evaluation data. To solve this issue, we published a CSV file containing the forbidden sound ids of Freesound (see also Task 6). So if you use any data from Freesound (e.g., through WavCaps or FSD50K), you have to exclude them from your pretraining, training and validation data.


Submission

All participants should submit:

  • the output of their audio retrieval with the evaluation data (*.csv file),
  • metadata for their submission (*.yaml file), and
  • a technical report for their submission (*.pdf file).

We allow up to 4 system output submissions per participant/team. For each system, metadata should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted metadata (the .yaml file), submitted system output (the .csv file), and the technical report (the .pdf file)! For indicating the connection of your files, you can consider using the following naming convention:

<author>_<institute>_task8_<submission_index>_<output or meta or technical_report>.<csv or yaml or pdf>

For example:

Xie_TAU_task8_1.output.csv
Xie_TAU_task8_1.meta.yaml
Xie_TAU_task8_1.technical_report.pdf

The field <submission_index> is to differentiate your submissions in case that you have multiple submissions.

System output file

The system output file should be a *.csv file, and should have the following 11 columns:

  1. caption: query caption.
  2. fname_1: file name of the audio file that is most relevant to the query caption in caption.
  3. fname_2: file name of the audio file that is 2nd most relevant to the query caption in caption.
  4. fname_3: file name of the audio file that is 3rd most relevant to the query caption in caption.
  5. fname_4: file name of the audio file that is 4th most relevant to the query caption in caption.
  6. fname_5: file name of the audio file that is 5th most relevant to the query caption in caption.
  7. fname_6: file name of the audio file that is 6th most relevant to the query caption in caption.
  8. fname_7: file name of the audio file that is 7th most relevant to the query caption in caption.
  9. fname_8: file name of the audio file that is 8th most relevant to the query caption in caption.
  10. fname_9: file name of the audio file that is 9th most relevant to the query caption in caption.
  11. fname_10: file name of the audio file that is 10th most relevant to the query caption in caption.

Example output:

caption,				fname_1,	fname_2,	fname_3,	fname_4,	fname_5,	fname_6,	fname_7,	fname_8,	fname_9,	fname_10
The person is rummaging through the pans while looking for something,	fn1.wav,	fn2.wav,	fn3.wav,	fn4.wav,	fn5.wav,	fn6.wav,	fn7.wav,	fn8.wav,	fn9.wav,	fn10.wav

In the example output, "fn2.wav" is the file name of the audio file that is 2nd most relevant to the query caption "The person is rummaging through the pans while looking for something".

Metadata file

For each system, metadata should be provided in a separate file. The file format should be as indicated below.

# Submission information for task 8
submission:
    # Submission label
    # Label is used to index submissions.
    # Generate your label following way to avoid overlapping codes among submissions:
    # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
    label: Xie_TAU_task8_1
    #
    # Submission name
    # This name will be used in the results tables when space permits
    name: DCASE2024 baseline system
    #
    # Submission name abbreviated
    # This abbreviated name will be used in the result table when space is tight.
    # Use maximum 10 characters.
    abbreviation: Baseline

    # Authors of the submitted system.
    # Mark authors in the order you want them to appear in submission lists.
    # One of the authors has to be marked as corresponding author,
    # this will be listed next to the submission in the results tables.
    authors:
        # First author
        -   lastname: Xie
            firstname: Huang
            email: huang.xie@tuni.fi                    # Contact email address
            corresponding: true                         # Mark true for one of the authors

            # Affiliation information for the author
            affiliation:
                abbreviation: TAU
                institute: Tampere University
                department: Computing Sciences
                location: Tampere, Finland

        # Second author
        -   lastname: Virtanen
            firstname: Tuomas
            email: tuomas.virtanen@tuni.fi

            affiliation:
                abbreviation: TAU
                institute: Tampere University
                department: Computing Sciences
                location: Tampere, Finland

# System information
system:
    # System description, meta-data provided here will be used to do meta analysis of the submitted system.
    # Use general level tags, when possible use the tags provided in comments.
    # If information field is not applicable to the system, use "!!null".
    description:

        # Audio input / sampling rate, e.g. 16kHz, 22.05kHz, 44.1kHz, 48.0kHz
        input_sampling_rate: 44.1kHz

        # Acoustic representation
        # Here you should indicate what can or audio representation you used.
        # If your system used hand-crafted features (e.g. mel band energies), then you can do:
        #
        # `acoustic_features: mel energies`
        #
        # Else, if you used some pre-trained audio feature extractor, you can indicate the name of the system, for example:
        #
        # `acoustic_features: audioset`
        acoustic_features: log-mel energies

        # Text embeddings
        # Here you can indicate how you treated text embeddings.
        # If your method learned its own text embeddings (i.e. you did not use any pre-trained or fine-tuned NLP embeddings),
        # then you can do:
        #
        # `text_embeddings: learned`
        #
        # Else, specify the pre-trained or fine-tuned NLP embeddings that you used, for example:
        #
        # `text_embeddings: Sentece-BERT`
        text_embeddings: Sentece-BERT

        # Data augmentation methods for audio
        # e.g. mixup, time stretching, block mixing, pitch shifting, ...
        audio_augmentation: !!null

          # Data augmentation methods for text
        # e.g. random swapping, synonym replacement, ...
        text_augmentation: !!null

          # Learning scheme
          # Here you should indicate the learning scheme.
        # For example, you could specify either supervised, self-supervised, or even reinforcement learning.
        learning_scheme: self-supervised

        # Ensemble
        # Here you should indicate if you used ensemble of systems or not.
        ensemble: No

        # Audio modelling
        # Here you should indicate the type of system used for audio modelling.
        # For example, if you used some stacked CNNs, then you could do:
        #
        # audio_modelling: cnn
        #
        # If you used some pre-trained system for audio modelling, then you should indicate the system used,
        # for example, PANNs-CNN14, PANNs-ResNet38.
        audio_modelling: PANNs-CNN14

        # Text modelling
        # Similarly, here you should indicate the type of system used for text modelling.
        # For example, if you used some RNNs, then you could do:
        #
        # text_modelling: rnn
        #
        # If you used some pre-trained system for text modelling,
        # then you should indicate the system used (e.g. BERT).
        text_modelling: Sentece-BERT

        # Loss function
        # Here you should indicate the loss function that you employed.
        loss_function: InfoNCE

        # Optimizer
        # Here you should indicate the name of the optimizer that you used.
        optimizer: adam

        # Learning rate
        # Here you should indicate the learning rate of the optimizer that you used.
        learning_rate: 1e-3

        # Metric monitored
        # Here you should report the monitored metric for optimizing your method.
        # For example, did you monitor the loss on the validation data (i.e. validation loss)?
        # Or you monitored the training mAP?
        metric_monitored: validation_loss

    # System complexity, meta-data provided here will be used to evaluate
    # submitted systems from the computational load perspective.
    complexity:
        # Total amount of parameters used in the acoustic model.
        # For neural networks, this information is usually given before training process in the network summary.
        # For other than neural networks, if parameter count information is not directly
        # available, try estimating the count as accurately as possible.
        # In case of ensemble approaches, add up parameters for all subsystems.
        # In case embeddings are used, add up parameter count of the embedding
        # extraction networks and classification network
        # Use numerical value (do not use comma for thousands-separator).
        total_parameters: 732354

    # List of datasets used for the system (e.g., pre-training, fine-tuning, training).
    # Development-training data is used here only as example.
    training_datasets:
        -   name: Clotho-development
            purpose: training                           # Used for training system
            url: https://doi.org/10.5281/zenodo.4783391
            data_types: audio, caption                  # Contained data types, e.g., audio, caption, label.
            data_instances:
                audio: 3839                             # Number of contained audio instances
                caption: 19195                          # Number of contained caption instances
            data_volume:
                audio: 86353                            # Total amount durations (in seconds) of audio instances
                caption: 6453                           # Total word types in caption instances

        # More datasets
        #-   name:
        #    purpose: pre-training
        #    url:
        #    data_types: A, B, C
        #    data_instances:
        #        A: xxx
        #        B: xxx
        #        C: xxx
        #    data_volume:
        #        A: xxx
        #        B: xxx
        #        C: xxx

    # List of datasets used for validating the system, for example, optimizing hyperparameter.
    # Development-validation data is used here only as example.
    validation_datasets:
        -   name: Clotho-validation
            url: https://doi.org/10.5281/zenodo.4783391
            data_types: audio, caption
            data_instances:
                audio: 1045
                caption: 5225
            data_volume:
                audio: 23636
                caption: 2763

        # More datasets
        #-   name:
        #    url:
        #    data_types: A, B, C
        #    data_instances:
        #        A: xxx
        #        B: xxx
        #        C: xxx
        #    data_volume:
        #        A: xxx
        #        B: xxx
        #        C: xxx

    # URL to the source code of the system [optional]
    source_code: https://github.com/xieh97/dcase2023-audio-retrieval

# System results
results:
    development_testing:
        # System results for the development-testing split.
        # Full results are not mandatory, however, they are highly recommended as they are needed for through analysis of the challenge submissions.
        # If you are unable to provide all results, also incomplete results can be reported.
        R@1: 0.130
        R@5: 0.343
        R@10: 0.480
        mAP@10: 0.222

Open and reproducible research

Finally, for supporting open and reproducible research, we kindly ask from each participant/team to consider making available the code of their method (e.g. in GitHub) and pre-trained models, after the challenge is over.

Evaluation

The submitted systems will be evaluated according to their performance, i.e., recall at K (R@K) and mean average precision at K (mAP@K), on the withheld the evaluation dataset. An explanation of these metrics can be found here. Specifically, the following metrics will be reported for every submitted method:

  1. R@1: Recall score among the top-1 retrieved result, averaged across all caption queries.
  2. R@5: Recall score among the top-5 retrieved results, averaged across all caption queries.
  3. R@10: Recall score among the top-10 retrieved results, averaged across all caption queries.
  4. mAP@10: Average precision among the top-10 retrieved results, averaged across all caption queries.

The R@K is a rank-unaware evaluation metric, which measures the proportion of relevant audio files among the top-K retrieved results to all relevant audio files in the evaluation dataset for the caption query. The mAP@K is a rank-aware metric, which gives the mean of the averaged precision of relevant audio files among the top-K retrieved results for all caption queries. Submitted methods will be ranked by the mAP@10 metric.


Results

Official
rank
Submission Information
Code Author Affiliation Technical
Report
Rank Score
Primus_CP-JKU_8_1 Paul Primus Computational Perception, Johannes Kepler University, Linz, Austria task-language-based-audio-retrieval-results#Primus2024_t8 0.416
Kulik_SRPOL_task8_4 Jan Kulik Artificial Intelligence Team, Samsung R&D Institute Poland, Warsaw, Poland task-language-based-audio-retrieval-results#Kulik2024_t8 0.403
Chen_SRCN_task8_1 Minjun Chen AI SW Team, Samsung Research China-Nanjing, Nanjing, China task-language-based-audio-retrieval-results#Chen2024_t8 0.396
Munakata_LYVA_1 Hokuto Munakata Video Analysis, LY Corporation, Tokyo, Japan task-language-based-audio-retrieval-results#Munakata2024_t8 0.388
Kim_MAUM_task8_2 Jaeyeon Kim Seoul National Unversity, Seoul, Republic of Korea task-language-based-audio-retrieval-results#Kim2024_t8 0.363
Cai_NCUT_task8_2 Xichang Cai School of Information, North China University of Technology, Beijing, China task-language-based-audio-retrieval-results#Cai2024_t8 0.259
Xie_tau_task8_1 Huang Xie Computing Sciences, Tampere University, Tampere, Finland task-language-based-audio-retrieval-results#Xie2024_t8 0.211


Complete results and technical reports can be found at results page

Baseline system

The baseline system, the same as in the DCASE 2023 Challenge (task 6b), is reused this year. The baseline system employs a bi-encoder architecture with a pretrained CNN14 (see PANNs) being the audio encoder and the Sentence-BERT (i.e., "all-mpnet-base-v2") being the text encoder. The pretrained CNN14 is fine-tuned and the Sentence-BERT is frozen during training. The relevant score between an audio signal and a textual description is calculated by the dot product of their audio embedding and text embedding. The InfoNCE loss is used to optimize the baseline system.

For details on PANNs, please see

Publication

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. PANNs: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pages 2880–2894, 2020. doi:10.1109/TASLP.2020.3030497.

PDF

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

PDF

For details on Sentence-BERT, please see

Publication

Nils Reimers and Iryna Gurevych. Sentence-BERT: sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019.

PDF

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

PDF

For details on InfoNCE loss, please see

Publication

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. 2018. URL: https://arxiv.org/abs/1807.03748.

PDF

Representation Learning with Contrastive Predictive Coding

PDF

For information about submitted systems last year, please see

Publication

Huang Xie, Samuel Lipping, and Tuomas Virtanen. Language-based audio retrieval task in dcase 2022 challenge. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), 216–220. 2022.

PDF

Language-based Audio Retrieval Task in DCASE 2022 Challenge

PDF

Repository

The PyTorch implementation of the baseline system is freely available online, and can be found at GitHub.


Results for the development dataset

The results of the baseline system on the development-evaluation split are shown below.

Metric Value
R1 0.130
R5 0.343
R10 0.480
mAP10 0.222

Citations

If you participate in this task, you might want to check the following papers. If you find a paper that need to be cited here, please contact us and report it to us.

  • The Clotho dataset:
Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.

PDF

Clotho: an Audio Captioning Dataset

Abstract

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

PDF
  • The CNN14 audio encoder, used for the baseline system:
Publication

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. PANNs: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pages 2880–2894, 2020. doi:10.1109/TASLP.2020.3030497.

PDF

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

PDF
  • The Sentence-BERT text encoder, used for the baseline system:
Publication

Nils Reimers and Iryna Gurevych. Sentence-BERT: sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019.

PDF

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

PDF
  • The InfoNCE loss, used for the baseline system:
Publication

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. 2018. URL: https://arxiv.org/abs/1807.03748.

PDF

Representation Learning with Contrastive Predictive Coding

PDF