Audio retrieval with human written captions.

Challenge has ended. Full results for this task can be found in the Results page.

If you are interested in the task, you can join us on the dedicated slack channel

Description

This subtask is concerned with retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions will be used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.

This subtask will allow using pre-trained models and external data for training models. This includes pre-trained models for embedding extraction from audio and/or captions, and pre-optimized methods for natural language processing like part-of-speech (POS) tagging. Additionally, the participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, ESC-50.

Audio dataset

The development dataset for this subtask is Clotho v2, the same as in the DCASE 2021 Challenge (task 6), but repurposed for language-based retrieval. Specifically, each caption will be treated as a text query, and non-corresponding audio files as non-relevant retrieved items when measuring the retrieval performance.

The Clotho v2 dataset consists of audio samples of 15 to 30 seconds duration, with each audio sample having five captions of eight to 20 words length. There is a total of 6974 audio samples, with 34870 captions (i.e. 6974 audio samples * 5 captions per each sample). All audio samples are from the Freesound platform, and captions are crowd-sourced using Amazon Mechanical Turk and a three-step framework. For complete details on the data recording and processing see

Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.

	Huang Xie Tampere University
	Felix Gontier INRIA
	Samuel Lipping Tampere University
	Konstantinos Drossos Tampere University
	Tuomas Virtanen Tampere University
	Romain Serizel University of Lorraine

Clotho naming of splits	DCASE Challenge naming of splits
development	training	development
validation	validation
evaluation	testing

Metric	Value with 95% confidence interval
R1	0.03 (0.03 - 0.04)
R5	0.11 (0.10 - 0.12)
R10	0.19 (0.18 - 0.20)
mAP10	0.07 (0.06 - 0.07)

Coordinators

Content

Description

Audio dataset

Clotho: an Audio Captioning Dataset

Abstract

Task setup

Development dataset

Evaluation dataset

Task rules

Submission

System output file

Metadata file

Metadata

Open and reproducible research

Evaluation

Results

Baseline system

Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases

Abstract

A CRNN-GRU Based Reinforcement Learning Approach to Audio Captioning

Abstract

Signature Verification Using a "Siamese" Time Delay Neural Network

Abstract

Repository

Results for the development dataset

Citations

Language-based Audio Retrieval Task in DCASE 2022 Challenge

Abstract

Clotho: an Audio Captioning Dataset

Abstract

Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases

Abstract

A CRNN-GRU Based Reinforcement Learning Approach to Audio Captioning

Abstract

Signature Verification Using a "Siamese" Time Delay Neural Network

Abstract

Sound Event Detection in the DCASE 2017 Challenge