Audio retrieval with human written captions.

Challenge has ended. Full results for this task can be found in the Results page.

If you are interested in the task, you can join us on the dedicated slack channel

Description

This subtask is concerned with retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions will be used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.

Figure 1: Overview of Language-based Audio Retrieval.

This subtask will allow using pre-trained models and external data for training models. This includes pre-trained models for embedding extraction from audio and/or captions, and pre-optimized methods for natural language processing like part-of-speech (POS) tagging. Additionally, the participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, ESC-50.

Audio dataset

The development dataset for this subtask is Clotho v2, the same as in the DCASE 2023 Challenge (task 6b). Specifically, each caption will be treated as a text query, and the corresponding audio file as the relevant item in retrieval.

The Clotho v2 dataset consists of audio samples of 15 to 30 seconds duration, with each audio sample having five captions of eight to 20 words length. There is a total of 6974 audio samples, with 34870 captions (i.e. 6974 audio samples * 5 captions per each sample). All audio samples are from the Freesound platform, and captions are crowd-sourced using Amazon Mechanical Turk and a three-step framework. For complete details on the data recording and processing see

Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.

	Huang Xie Tampere University
	Tuomas Virtanen Tampere University
	Romain Serizel University of Lorraine
	Etienne Labbé Université Toulouse III – Paul Sabatier Institut de Recherche en Informatique de Toulouse
	Thomas Pellegrini Université Toulouse III – Paul Sabatier Institut de Recherche en Informatique de Toulouse

Clotho naming of splits	DCASE Challenge naming of splits
development	training	development
validation	validation
evaluation	testing

Official rank	Submission Information
Official rank	Code	Author	Affiliation	Technical Report	Rank Score
	Primus_CP-JKU_8_1	Paul Primus	Computational Perception, Johannes Kepler University, Linz, Austria	task-language-based-audio-retrieval-results#Primus2024_t8	0.416
	Kulik_SRPOL_task8_4	Jan Kulik	Artificial Intelligence Team, Samsung R&D Institute Poland, Warsaw, Poland	task-language-based-audio-retrieval-results#Kulik2024_t8	0.403
	Chen_SRCN_task8_1	Minjun Chen	AI SW Team, Samsung Research China-Nanjing, Nanjing, China	task-language-based-audio-retrieval-results#Chen2024_t8	0.396
	Munakata_LYVA_1	Hokuto Munakata	Video Analysis, LY Corporation, Tokyo, Japan	task-language-based-audio-retrieval-results#Munakata2024_t8	0.388
	Kim_MAUM_task8_2	Jaeyeon Kim	Seoul National Unversity, Seoul, Republic of Korea	task-language-based-audio-retrieval-results#Kim2024_t8	0.363
	Cai_NCUT_task8_2	Xichang Cai	School of Information, North China University of Technology, Beijing, China	task-language-based-audio-retrieval-results#Cai2024_t8	0.259
	Xie_tau_task8_1	Huang Xie	Computing Sciences, Tampere University, Tampere, Finland	task-language-based-audio-retrieval-results#Xie2024_t8	0.211

Metric	Value
R1	0.130
R5	0.343
R10	0.480
mAP10	0.222

Coordinators

Content

Description

Audio dataset

Clotho: an Audio Captioning Dataset

Abstract

Task setup

Development dataset

Evaluation dataset

Task rules

Excluded data

Submission

System output file

Metadata file

Metadata

Open and reproducible research

Evaluation

Results

Baseline system

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Representation Learning with Contrastive Predictive Coding

Language-based Audio Retrieval Task in DCASE 2022 Challenge

Repository

Results for the development dataset

Citations

Clotho: an Audio Captioning Dataset

Abstract

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Representation Learning with Contrastive Predictive Coding