Language-Based Audio Retrieval


Task description

Audio retrieval with human written captions.

If you are interested in the task, you can join us on the dedicated slack channel

Description

This subtask is concerned with retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions will be used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.

This subtask will allow using pre-trained models and external data for training models. This includes pre-trained models for embedding extraction from audio and/or captions, and pre-optimized methods for natural language processing like part-of-speech (POS) tagging. Additionally, the participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, ESC-50.

Audio dataset

The development dataset for this subtask is Clotho v2, the same as in the DCASE 2021 Challenge (task 6), but repurposed for language-based retrieval. Specifically, each caption will be treated as a text query, and non-corresponding audio files as non-relevant retrieved items when measuring the retrieval performance.

The Clotho v2 dataset consists of audio samples of 15 to 30 seconds duration, with each audio sample having five captions of eight to 20 words length. There is a total of 6974 audio samples, with 34870 captions (i.e. 6974 audio samples * 5 captions per each sample). All audio samples are from the Freesound platform, and captions are crowd-sourced using Amazon Mechanical Turk and a three-step framework. For complete details on the data recording and processing see

Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.

PDF

Clotho: an Audio Captioning Dataset

Abstract

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

PDF

The data collection of Clotho received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

ERC

Task setup

Development dataset

The Clotho v2 dataset is currently divided into a development split of 3839 audio clips with 19195 captions, a validation split of 1045 audio clips with 5225 captions, and an evaluation split of 1045 audio clips with 5225 captions. These splits are created by first constructing the sets of unique words of the captions of each audio clip. These sets of words are combined to form the bag of words of the whole dataset, from which the frequency of a given word can be derived. With the unique words of audio files as classes, multi-label stratification is applied. More information on the splits of Clotho v2 can be found here.

Please note that the name of the splits for Clotho v2 differ from the DCASE terminology. To avoid confusion for participants, the correspondence of splits between Clotho v2 and DCASE challenge is:

Clotho naming of splits DCASE Challenge naming of splits
development training development
validation validation
evaluation testing

For the rest of this text, the DCASE challenge terminology will be used. For differentiating between Clotho development and evaluation, the terms development-training, development-validation, and development-testing will be used, wherever necessary. Development-training refers to Clotho development split, development-validation refers to Clotho validation split, and development-testing refers to Clotho evaluation split.

The development data can be found at the online Zenodo repository. Make sure that you download Clotho v2.1, as there were some minor fixes in the dataset (fixing of file naming and some corrupted files).


Development-training data are:

  • clotho_audio_development.7z: The development-training audio clips.
  • clotho_captions_development.csv: The captions of the development-training audio clips.
  • clotho_metadata_development.csv: The meta-data of the development-training audio clips.

Development-validation data are:

  • clotho_audio_validation.7z: The development-validation audio clips.
  • clotho_captions_validation.csv: The captions of the development-validation audio clips.
  • clotho_metadata_validation.csv: The meta-data of the development-validation audio clips.

Development-testing data are:

  • clotho_audio_evaluation.7z: The development-testing audio clips.
  • clotho_captions_evaluation.csv: The captions of the development-testing audio clips.
  • clotho_metadata_evaluation.csv: The meta-data of the development-testing audio clips.

Evaluation dataset

The evaluation dataset will consist of 1000 audio samples, with one caption per each. The data are collected following the same procedure as Clotho v2.

Evaluation data will be downloadable here in June.

Task rules

Participants are allowed to:

  • Use external data (e.g. audio files, text corpus, annotations).
  • Use pre-trained models (e.g. text models like Word2Vec, audio tagging models, sound event detection models).
  • Augment the development dataset (i.e. development-training and development-testing) with or without the use of external data.
  • Use all the available metadata provided, but they must explicitly state it and indicate if they use the available metadata. This will not affect the rating of their method.

Participants are NOT allowed to:

  • Make subjective judgments of the evaluation data, nor to annotate it.
  • Use additional information of the evaluation data for their method, apart the provided audio files and captions from the evaluation data.

Submission

All participants should submit:

  • the output of their audio retrieval with the evaluation data (*.csv file, the evaluation dataset will be announced later),
  • metadata for their submission (*.yaml file), and
  • a technical report for their submission (*.pdf file).

We allow up to 4 system output submissions per participant/team. For each system, metadata should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted metadata (the .yaml file), submitted system output (the .csv file), and the technical report (the .pdf file)! For indicating the connection of your files, you can consider using the following naming convention:

<author>_<institute>_task6b_submission_<submission_index>_<output or metadata or report>.<csv or yaml or pdf>

For example:

xie_tau_task6b_submission_1_output.csv
xie_tau_task6b_submission_1_metadata.yaml
xie_tau_task6b_submission_1_report.pdf

The field <submission_index> is to differentiate your submissions in case that you have multiple submissions.

System output file

The system output file should be a *.csv file, and should have the following 11 columns:

  1. caption: query caption.
  2. file_name_1: file name of the audio file that is most relevant to the query caption in caption.
  3. file_name_2: file name of the audio file that is 2nd most relevant to the query caption in caption.
  4. file_name_3: file name of the audio file that is 3rd most relevant to the query caption in caption.
  5. file_name_4: file name of the audio file that is 4th most relevant to the query caption in caption.
  6. file_name_5: file name of the audio file that is 5th most relevant to the query caption in caption.
  7. file_name_6: file name of the audio file that is 6th most relevant to the query caption in caption.
  8. file_name_7: file name of the audio file that is 7th most relevant to the query caption in caption.
  9. file_name_8: file name of the audio file that is 8th most relevant to the query caption in caption.
  10. file_name_9: file name of the audio file that is 9th most relevant to the query caption in caption.
  11. file_name_10: file name of the audio file that is 10th most relevant to the query caption in caption.

Example output:

caption,				file_name_1,	file_name_2,	file_name_3,	file_name_4,	file_name_5,	file_name_6,	file_name_7,	file_name_8,	file_name_9,	file_name_10
The person is rummaging through the pans while looking for something,	fn1.wav,	fn2.wav,	fn3.wav,	fn4.wav,	fn5.wav,	fn6.wav,	fn7.wav,	fn8.wav,	fn9.wav,	fn10.wav

In the example output, "fn2.wav" is the file name of the audio file that is 2nd most relevant to the query caption "The person is rummaging through the pans while looking for something".

Metadata file

Example meta information file will be added later.

Open and reproducible research

Finally, for supporting open and reproducible research, we kindly ask from each participant/team to consider making available the code of their method (e.g. in GitHub) and pre-trained models, after the challenge is over.

Evaluation

The submitted systems will be evaluated according to their performance on the withheld the evaluation dataset. All following metrics will be reported for every submitted method:

  1. R1: Recall score at top 1 retrieved result.
  2. R5: Recall score at top 5 retrieved results.
  3. R10: Recall score at top 10 retrieved results.
  4. mAP10: mean Average Precision at top 10 retrieved results.

Systems will be ranked by the mAP10 metric.

Baseline system

To provide a starting point and some initial results for the challenge, there is a baseline system for the language-based audio retrieval subtask.

The baseline system is a simpler version of the audio-text aligning framework presented in this paper, which calculates relevant scores between encoded textual descriptions and encoded audio signals. A convolutional recurrent neural network (CRNN) is used as the audio encoder, which extracts frame-wise acoustic embeddings from audio signals. Then, an audio signal is represented by the average of its frame-wise acoustic embeddings. A pre-trained Word2Vec (published here by Google) is used as the text encoder, which converts textual descriptions into sequences of word embeddings. A textual description is represented by the average of its word embeddings. The relevant score between an audio signal and a textual description is calculated by the dot product of their vector representations. The baseline system is optimized with a triplet ranking loss criterion.

For more details on the audio-text aligning framework, please see

Publication

Huang Xie, Okko Räsänen, Konstantinos Drossos, and Tuomas Virtanen. Unsupervised audio-caption aligning learns correspondences between individual sound events and textual phrases. Accepted at ICASSP, 2022.

PDF

Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases

Abstract

We investigate unsupervised learning of correspondences between sound events and textual phrases through aligning audio clips with textual captions describing the content of a whole audio clip. We align originally unaligned and unannotated audio clips and their captions by scoring the similarities between audio frames and words, as encoded by modality-specific encoders and using a ranking-loss criterion to optimize the model. After training, we obtain clip-caption similarity by averaging frame-word similarities and estimate event-phrase correspondences by calculating frame-phrase similarities. We evaluate the method with two cross-modal tasks: audio-caption retrieval, and phrase-based sound event detection (SED). Experimental results show that the proposed method can globally associate audio clips with captions as well as locally learn correspondences between individual sound events and textual phrases in an unsupervised manner.

PDF

For details on the CRNN audio encoder, please see

Publication

Xuenan Xu, Heinrich Dinkel, Mengyue Wu, and Kai Yu. A CRNN-GRU based reinforcement learning approach to audio captioning. In Proc. Detect. Classif. Acoust. Scenes Events Work. (DCASE), 225–229. 2020.

PDF

A CRNN-GRU Based Reinforcement Learning Approach to Audio Captioning

Abstract

Audio captioning aims at generating a natural sentence to describe the content in an audio clip. This paper proposes the use of a powerful CRNN encoder combined with a GRU decoder to tackle this multi-modal task. In addition to standard cross-entropy, reinforcement learning is also investigated for generating richer and more accurate captions. Our approach significantly improves against the baseline model on all shown metrics achieving a relative improvement of at least 34%. Results indicate that our proposed CRNN-GRU model with reinforcement learning achieves a SPIDEr of 0.190 on the Clotho evaluation set. With data augmentation, the performance is further boosted to 0.223. In the DCASE challenge Task 6 we ranked fourth based on SPIDEr, second on 5 metrics including BLEU, ROUGE-L and METEOR, without ensemble or data augmentation while maintaining a small model size (only 5 Million parameters).

PDF

For the triplet ranking loss used to train the baseline system, please see

Publication

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a "siamese" time delay neural network. In Proc. 6th Int. Conf. Neural Inf. Process. Syst. (NIPS), 737–744. 1993.

PDF

Signature Verification Using a "Siamese" Time Delay Neural Network

Abstract

This paper describes an algorithm for verification of signatures written on a pen-input tablet. The algorithm is based on a novel, artificial neural network, called a 'Siamese' neural network. This network consists of two identical sub-networks joined at their outputs. During training the two sub-networks extract features from two signatures, while the joining neuron measures the distance between the two feature vectors. Verification consists of comparing an extracted feature vector with a stored feature vector for the signer. Signatures closer to this stored representation than a chosen threshold are accepted, all other signatures are rejected as forgeries.

PDF

For the pre-trained Word2Vec:

Repository

The PyTorch implementation of the baseline system is freely available online, and can be found at GitHub.


Results for the development dataset

The results of the baseline system for the development-evaluation split are:

Metric Value
R1 0.03
R5 0.11
R10 0.19
mAP10 0.07

Citations

If you participate in this task, you might want to check the following papers. If you find a paper that need to be cited here, please contact us and report it to us.

  • The Clotho dataset:
Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.

PDF

Clotho: an Audio Captioning Dataset

Abstract

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

PDF
  • The original audio-text aligning framework:
Publication

Huang Xie, Okko Räsänen, Konstantinos Drossos, and Tuomas Virtanen. Unsupervised audio-caption aligning learns correspondences between individual sound events and textual phrases. Accepted at ICASSP, 2022.

PDF

Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases

Abstract

We investigate unsupervised learning of correspondences between sound events and textual phrases through aligning audio clips with textual captions describing the content of a whole audio clip. We align originally unaligned and unannotated audio clips and their captions by scoring the similarities between audio frames and words, as encoded by modality-specific encoders and using a ranking-loss criterion to optimize the model. After training, we obtain clip-caption similarity by averaging frame-word similarities and estimate event-phrase correspondences by calculating frame-phrase similarities. We evaluate the method with two cross-modal tasks: audio-caption retrieval, and phrase-based sound event detection (SED). Experimental results show that the proposed method can globally associate audio clips with captions as well as locally learn correspondences between individual sound events and textual phrases in an unsupervised manner.

PDF
  • The CRNN audio encoder, used for the baseline system:
Publication

Xuenan Xu, Heinrich Dinkel, Mengyue Wu, and Kai Yu. A CRNN-GRU based reinforcement learning approach to audio captioning. In Proc. Detect. Classif. Acoust. Scenes Events Work. (DCASE), 225–229. 2020.

PDF

A CRNN-GRU Based Reinforcement Learning Approach to Audio Captioning

Abstract

Audio captioning aims at generating a natural sentence to describe the content in an audio clip. This paper proposes the use of a powerful CRNN encoder combined with a GRU decoder to tackle this multi-modal task. In addition to standard cross-entropy, reinforcement learning is also investigated for generating richer and more accurate captions. Our approach significantly improves against the baseline model on all shown metrics achieving a relative improvement of at least 34%. Results indicate that our proposed CRNN-GRU model with reinforcement learning achieves a SPIDEr of 0.190 on the Clotho evaluation set. With data augmentation, the performance is further boosted to 0.223. In the DCASE challenge Task 6 we ranked fourth based on SPIDEr, second on 5 metrics including BLEU, ROUGE-L and METEOR, without ensemble or data augmentation while maintaining a small model size (only 5 Million parameters).

PDF
  • The triplet ranking loss, used for the baseline system:
Publication

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a "siamese" time delay neural network. In Proc. 6th Int. Conf. Neural Inf. Process. Syst. (NIPS), 737–744. 1993.

PDF

Signature Verification Using a "Siamese" Time Delay Neural Network

Abstract

This paper describes an algorithm for verification of signatures written on a pen-input tablet. The algorithm is based on a novel, artificial neural network, called a 'Siamese' neural network. This network consists of two identical sub-networks joined at their outputs. During training the two sub-networks extract features from two signatures, while the joining neuron measures the distance between the two feature vectors. Verification consists of comparing an extracted feature vector with a stored feature vector for the signer. Signatures closer to this stored representation than a chosen threshold are accepted, all other signatures are rejected as forgeries.

PDF