Language-Based Audio Retrieval


Task description

Audio retrieval with human written captions.

If you are interested in the task, you can join us on the dedicated slack channel

Description

This subtask is concerned with retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions will be used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.

Figure 1: Overview of Language-based Audio Retrieval.


This subtask will allow using pre-trained models and external data for training models. This includes pre-trained models for embedding extraction from audio and/or captions, and pre-optimized methods for natural language processing like part-of-speech (POS) tagging. Additionally, the participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, ESC-50.

Audio dataset

The development dataset for this subtask is Clotho v2, the same as in the DCASE 2023 Challenge (task 6b). Specifically, each caption will be treated as a text query, and the corresponding audio file as the relevant item in retrieval.

The Clotho v2 dataset consists of audio samples of 15 to 30 seconds duration, with each audio sample having five captions of eight to 20 words length. There is a total of 6974 audio samples, with 34870 captions (i.e. 6974 audio samples * 5 captions per each sample). All audio samples are from the Freesound platform, and captions are crowd-sourced using Amazon Mechanical Turk and a three-step framework. For complete details on the data recording and processing see

Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.

PDF

Clotho: an Audio Captioning Dataset

Abstract

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

PDF

The data collection of Clotho received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

ERC

Task setup

Development dataset

The Clotho v2 dataset is currently divided into a development split of 3839 audio clips with 19195 captions, a validation split of 1045 audio clips with 5225 captions, and an evaluation split of 1045 audio clips with 5225 captions. These splits are created by first constructing the sets of unique words of the captions of each audio clip. These sets of words are combined to form the bag of words of the whole dataset, from which the frequency of a given word can be derived. With the unique words of audio files as classes, multi-label stratification is applied. More information on the splits of Clotho v2 can be found here.

Please note that the name of the splits for Clotho v2 differ from the DCASE terminology. To avoid confusion for participants, the correspondence of splits between Clotho v2 and DCASE challenge is:

Clotho naming of splits DCASE Challenge naming of splits
development training development
validation validation
evaluation testing

For the rest of this text, the DCASE challenge terminology will be used. For differentiating between Clotho development and evaluation, the terms development-training, development-validation, and development-testing will be used, wherever necessary. Development-training refers to Clotho development split, development-validation refers to Clotho validation split, and development-testing refers to Clotho evaluation split.

The development data can be found at the online Zenodo repository. Make sure that you download Clotho v2.1, as there were some minor fixes in the dataset (fixing of file naming and some corrupted files).


Development-training data are:

  • clotho_audio_development.7z: The development-training audio clips.
  • clotho_captions_development.csv: The captions of the development-training audio clips.
  • clotho_metadata_development.csv: The meta-data of the development-training audio clips.

Development-validation data are:

  • clotho_audio_validation.7z: The development-validation audio clips.
  • clotho_captions_validation.csv: The captions of the development-validation audio clips.
  • clotho_metadata_validation.csv: The meta-data of the development-validation audio clips.

Development-testing data are:

  • clotho_audio_evaluation.7z: The development-testing audio clips.
  • clotho_captions_evaluation.csv: The captions of the development-testing audio clips.
  • clotho_metadata_evaluation.csv: The meta-data of the development-testing audio clips.

Evaluation dataset

It consists of 1000 audio samples, with one caption per each. The data are collected following the same procedure as Clotho v2.

Evaluation data will be downloadable here in June.

Task rules

Participants are allowed to:

  • Use external data (e.g. audio files, text corpus, annotations).
  • Use pre-trained models (e.g. text models like Word2Vec, audio tagging models, sound event detection models).
  • Augment the development dataset (i.e. development-training and development-testing) with or without the use of external data.
  • Use all the available metadata provided, but they must explicitly state it and indicate if they use the available metadata. This will not affect the rating of their method.

Participants are NOT allowed to:

  • Make subjective judgments of the evaluation data, nor to annotate it.
  • Use additional information of the evaluation data for their method, apart the provided audio files and captions from the evaluation data.

Submission

All participants should submit:

  • the output of their audio retrieval with the evaluation data (*.csv file),
  • metadata for their submission (*.yaml file), and
  • a technical report for their submission (*.pdf file).

We allow up to 4 system output submissions per participant/team. For each system, metadata should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted metadata (the .yaml file), submitted system output (the .csv file), and the technical report (the .pdf file)! For indicating the connection of your files, you can consider using the following naming convention:

<author>_<institute>_task8_submission_<submission_index>_<output or metadata or report>.<csv or yaml or pdf>

For example:

xie_tau_task8_submission_1_output.csv
xie_tau_task8_submission_1_metadata.yaml
xie_tau_task8_submission_1_report.pdf

The field <submission_index> is to differentiate your submissions in case that you have multiple submissions.

System output file

The system output file should be a *.csv file, and should have the following 11 columns:

  1. caption: query caption.
  2. fname_1: file name of the audio file that is most relevant to the query caption in caption.
  3. fname_2: file name of the audio file that is 2nd most relevant to the query caption in caption.
  4. fname_3: file name of the audio file that is 3rd most relevant to the query caption in caption.
  5. fname_4: file name of the audio file that is 4th most relevant to the query caption in caption.
  6. fname_5: file name of the audio file that is 5th most relevant to the query caption in caption.
  7. fname_6: file name of the audio file that is 6th most relevant to the query caption in caption.
  8. fname_7: file name of the audio file that is 7th most relevant to the query caption in caption.
  9. fname_8: file name of the audio file that is 8th most relevant to the query caption in caption.
  10. fname_9: file name of the audio file that is 9th most relevant to the query caption in caption.
  11. fname_10: file name of the audio file that is 10th most relevant to the query caption in caption.

Example output:

caption,				fname_1,	fname_2,	fname_3,	fname_4,	fname_5,	fname_6,	fname_7,	fname_8,	fname_9,	fname_10
The person is rummaging through the pans while looking for something,	fn1.wav,	fn2.wav,	fn3.wav,	fn4.wav,	fn5.wav,	fn6.wav,	fn7.wav,	fn8.wav,	fn9.wav,	fn10.wav

In the example output, "fn2.wav" is the file name of the audio file that is 2nd most relevant to the query caption "The person is rummaging through the pans while looking for something".

Metadata file

Example meta information file will be added later.

Open and reproducible research

Finally, for supporting open and reproducible research, we kindly ask from each participant/team to consider making available the code of their method (e.g. in GitHub) and pre-trained models, after the challenge is over.

Evaluation

The submitted systems will be evaluated according to their performance, i.e., recall at K (R@K) and mean average precision at K (mAP@K), on the withheld the evaluation dataset. An explanation of these metrics can be found here. Specifically, the following metrics will be reported for every submitted method:

  1. R@1: Recall score among the top-1 retrieved result, averaged across all caption queries.
  2. R@5: Recall score among the top-5 retrieved results, averaged across all caption queries.
  3. R@10: Recall score among the top-10 retrieved results, averaged across all caption queries.
  4. mAP@10: Average precision among the top-10 retrieved results, averaged across all caption queries.

The R@K is a rank-unaware evaluation metric, which measures the proportion of relevant audio files among the top-K retrieved results to all relevant audio files in the evaluation dataset for the caption query. The mAP@K is a rank-aware metric, which gives the mean of the averaged precision of relevant audio files among the top-K retrieved results for all caption queries. Submitted methods will be ranked by the mAP@10 metric.

Baseline system

The baseline system, the same as in the DCASE 2023 Challenge (task 6b), is reused this year. The baseline system employs a bi-encoder architecture with a pretrained CNN14 (see PANNs) being the audio encoder and the Sentence-BERT (i.e., "all-mpnet-base-v2") being the text encoder. The pretrained CNN14 is fine-tuned and the Sentence-BERT is frozen during training. The relevant score between an audio signal and a textual description is calculated by the dot product of their audio embedding and text embedding. The InfoNCE loss is used to optimize the baseline system.

For details on PANNs, please see

Publication

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. PANNs: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pages 2880–2894, 2020. doi:10.1109/TASLP.2020.3030497.

PDF

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

PDF

For details on Sentence-BERT, please see

Publication

Nils Reimers and Iryna Gurevych. Sentence-BERT: sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019.

PDF

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

PDF

For details on InfoNCE loss, please see

Publication

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. 2018. URL: https://arxiv.org/abs/1807.03748.

PDF

Representation Learning with Contrastive Predictive Coding

PDF

For information about submitted systems last year, please see

Publication

Huang Xie, Samuel Lipping, and Tuomas Virtanen. Language-based audio retrieval task in dcase 2022 challenge. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), 216–220. 2022.

PDF

Language-based Audio Retrieval Task in DCASE 2022 Challenge

PDF

Repository

The PyTorch implementation of the baseline system is freely available online, and can be found at GitHub.


Results for the development dataset

The results of the baseline system on the development-evaluation split are shown below.

Metric Value
R1 0.130
R5 0.343
R10 0.480
mAP10 0.222

Citations

If you participate in this task, you might want to check the following papers. If you find a paper that need to be cited here, please contact us and report it to us.

  • The Clotho dataset:
Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.

PDF

Clotho: an Audio Captioning Dataset

Abstract

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

PDF
  • The CNN14 audio encoder, used for the baseline system:
Publication

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. PANNs: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pages 2880–2894, 2020. doi:10.1109/TASLP.2020.3030497.

PDF

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

PDF
  • The Sentence-BERT text encoder, used for the baseline system:
Publication

Nils Reimers and Iryna Gurevych. Sentence-BERT: sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019.

PDF

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

PDF
  • The InfoNCE loss, used for the baseline system:
Publication

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. 2018. URL: https://arxiv.org/abs/1807.03748.

PDF

Representation Learning with Contrastive Predictive Coding

PDF