Audio retrieval with human-written captions.
Description
This subtask focuses on retrieving audio signals based on their textual descriptions, also known as audio captions. uman-written audio captions will serve as text queries. For each query, the objective is to retrieve a set of audio files from a given dataset and rank them according to their relevance to the query. This subtask aims to stimulate further research in language-based audio retrieval using unconstrained textual descriptions.

Participants are permitted to use pre-trained models and external data for training their models. This includes pre-trained models for extracting embeddings from audio and/or captions, as well as pre-optimized methods for natural language processing, such as part-of-speech (POS) tagging. Additionally, participants may use external audio and/or textual data,such as external text corpora for training language models or additional audio datasets like AudioSet, ESC-50.
Audio dataset
The development dataset for this task is Clotho v2.1 (similar to previous years).
The Clotho v2 dataset consists of audio samples of 15 to 30 seconds duration, with each audio sample having five captions of eight to 20 words in length. There is a total of 6974 audio samples, with 34,870 captions (i.e. 6974 audio samples * 5 captions per each sample). All audio samples are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and a three-step framework. For complete details on the data recording and processing see
Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.
Clotho: an Audio Captioning Dataset
Abstract
Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).
The data collection of Clotho received funding from the European Research Council, grant agreement 637422 EVERYSOUND.
Development dataset
The Clotho v2 dataset is currently divided into a development split of 3839 audio clips with 19,195 captions, a validation split of 1045 audio clips with 5225 captions, and an evaluation split of 1045 audio clips with 5225 captions. These splits are created by first constructing the sets of unique words of the captions of each audio clip. These sets of words are combined to form the bag of words of the whole dataset, from which the frequency of a given word can be derived. With the unique words of audio files as classes, multi-label stratification is applied. More information on the splits of Clotho v2 can be found here.
Please note that the names of the splits for Clotho v2 differ from the DCASE terminology. The following table provides the correspondence of splits between Clotho v2 and DCASE challenge terminology.
Clotho naming of splits | DCASE Challenge naming of splits | |
development | training | development |
validation | validation | |
evaluation | testing |
For the rest of this text, the DCASE challenge terminology will be used. For differentiating between Clotho development and evaluation, the terms development-training, development-validation, and development-testing will be used, wherever necessary. Development-training refers to Clotho development split, development-validation refers to Clotho validation split, and development-testing refers to Clotho evaluation split.
The development data can be found at the online Zenodo repository. Make sure that you download Clotho v2.1, as there were some minor fixes in the dataset (fixing of file naming and some corrupted files).
New Annotations for the Development-Testing Dataset
As a new addition to the Clotho dataset, we provide new annotations specifically for the development-testing dataset of this task. The development-testing dataset consists of 1,069 text queries (audio captions) and their corresponding audio files. For each query, multiple audio recordings are marked as relevant.
Annotations for the development-testing dataset are provided in a separate CSV file.
Annotation Process
The relevance annotations were collected on Amazon Mechanical Turk.
To generate a list of potentially relevant audio files for each query,
we utilised the last year's winning submission (Primus_CP-JKU_8_1
).
This system provided a ranked list of 15 audio files for each query, in addition to the ground truth audio file.
Annotators were asked to indicate relevance for each audio file with respect to the query.
Evaluation dataset
The evaluation dataset builds on last year's DCASE 2024 Challenge (task 8). It comprises 1,000 text queries (audio captions). Similar to the development-testing dataset, submissions are evaluated based on the relevance of the retrieved audio files. Multiple relevant audio files are indicated for each query. The evaluation data are provided to the participants in the form of audio files and captions without any additional information.
Download
The development dataset is available for download from the Zenodo repository.
Task Rules
Participants are allowed to:
- Use of external resources (data sets, pretrained models) is allowed under conditions specified in the External Resources section.
- Augment the development dataset (i.e. development-training and development-testing) with or without the use of external data.
- Use all the available metadata provided, but they must explicitly state it and indicate if they use the available metadata. This will not affect the rating of their method.
Participants are NOT allowed to:
- Use Freesound data for training or validation if these data overlap with the development-testing and the evaluation subsets of Clotho (see below).
- Make subjective judgments of the evaluation (testing) data, nor to annotate it.
- Use additional information of the evaluation (testing) data for their method, apart from the provided audio files and captions from the evaluation data.
External Resources
The use of external resources (data sets, pretrained models) is allowed under the following conditions:
- The task coordinators have approved the resource and shared it on the task webpage. To this end, please email the task coordinators. The list of allowed external resources will be finalised on May 18 (no further external sources allowed).
- The external resource must be freely accessible to any other research group in the world before May 18, 2025.
- The list of external resources used must be clearly indicated in the technical report.
- The use of large-language model (LLM) APIs is allowed, provided they are reasonably accessible to everyone and incur minimal costs. For instance, a reasonable price would be equivalent to the total subscription cost of a leading provider, approximately 15 to 25 USD, over the duration of the challenge (2.5 months).
List of external data resources allowed:
Resource name | Type | Added | Link |
---|---|---|---|
PaSST | model | 01.04.2025 | https://github.com/kkoutini/PaSST |
AudioSet | audio, video | 01.04.2025 | https://research.google.com/audioset/ |
FSD50K | audio, tags | 01.04.2025 | https://zenodo.org/record/4060432 |
MACS - Multi-Annotator Captioned Soundscapes | audio, caption | 01.04.2025 | https://doi.org/10.5281/zenodo.5114770 |
WavCaps | audio, caption | 01.04.2025 | https://huggingface.co/datasets/cvssp/WavCaps |
AudioCaps | audio, caption | 01.04.2025 | https://audiocaps.github.io/ |
BERT | text, caption | 01.04.2025 | https://arxiv.org/abs/1810.04805 |
RoBERTa | text, caption | 01.04.2025 | https://arxiv.org/abs/1907.11692 |
Excluded data
Since the Clotho dataset is extracted from Freesound website, any dataset crowdsourced from this website may have an overlap with the Clotho evaluation data. To solve this issue, we published a CSV file containing the forbidden sound ids of Freesound. Note that the list provided this year contains more recordings than last year. So if you use any data from Freesound (e.g. through WavCaps or FSD50K), you have to exclude them from your pretraining, training and validation data.
Submission
All participants should submit:
- the output of their audio retrieval system in the form of a similarity matrix (
*.csv
file), - metadata for their submission (
*.yaml
file), and - a technical report for their submission (
*.pdf
file).
We allow up to four system output submissions per participating team.
For each system, metadata should be provided in a separate file, containing the task-specific information.
All files should be packaged into a zip file for submission.
Please make a clear connection between the system name in the submitted metadata (the .yaml
file),
submitted system output (the .csv
file), and the technical report (the .pdf
file)!
For indicating the connection of your files, you can consider using the following naming convention:
<author>_<institute>_task6_<submission_index>_<output or meta or technical_report>.<csv or yaml or pdf>
For example:
Primus_CP-JKU_task6_1.output.csv
Primus_CP-JKU_task6_1.meta.yaml
Primus_CP-JKU_task6_1.technical_report.pdf
The field <submission_index>
is to differentiate your submissions in case that you have multiple submissions.
System output file
The expected system output is a *.csv
file holding all similarities between the text descriptions and audios in the data set:
Index | Filename_0 | Filename_1 | Filename_2 | Filename_... | Filename_N |
---|---|---|---|---|---|
Caption_0 | Similarity_00 | Similarity_01 | Similarity_02 | Similarity_0... | Similarity_0N |
... | ... | ... | ... | ... | ... |
Caption_M | Similarity_M0 | Similarity_M1 | Similarity_M2 | Similarity_M... | Similarity_MN |
- The table must include one column for each of the N audio files and one row for each of the M queries.
- It further must include a header that specifies which audio file each column corresponds to.
- The leftmost column contains the text queries.
- The individual cells should give the estimated similarity between text queries and audio recordings.
- Higher similarity scores indicate a stronger correspondence between the textual query and the audio file.
Metadata file
For each system, metadata should be provided in a separate file. The file format should be as indicated below.
# Submission information for task 6
submission:
# Submission label
# The label is used to index submissions.
# Generate your label following way to avoid overlapping codes among submissions:
# [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
label: Primus_CP-JKU_task6_1
#
# Submission name
# This name will be used in the results tables when space permits
name: DCASE2025 baseline system
#
# Submission name abbreviated
# This abbreviated name will be used in the result table when space is tight.
# Use maximum 10 characters.
abbreviation: Baseline
# Authors of the submitted system.
# Mark authors in the order you want them to appear in submission lists.
# One of the authors has to be marked as corresponding author,
# this will be listed next to the submission in the results tables.
authors:
# First author
- lastname: Primus
firstname: Paul
email: paul.primus@jku.at # Contact email address
corresponding: true # Mark true for one of the authors
# Affiliation information for the author
affiliation:
abbreviation: CP-JKU
institute: Johannes Kepler University
department: Institute of Computational Perception
location: Linz, Austria
# Second author
- lastname: Author
firstname: Second
email: first.last@some.org
affiliation:
abbreviation: ORG
institute: Some Organization
department: Department of Something
location: City, Country
# System information
system:
# System description, meta-data provided here will be used to do meta analysis of the submitted system.
# Use general level tags, when possible use the tags provided in comments.
# If information field is not applicable to the system, use "!!null".
description:
# Audio input / sampling rate, e.g. 16kHz, 22.05kHz, 44.1kHz, 48.0kHz
input_sampling_rate: 44.1kHz
# Acoustic representation
# Here you should indicate what can or audio representation you used.
# If your system used hand-crafted features (e.g. mel band energies), then you can do:
#
# `acoustic_features: mel energies`
#
# Else, if you used some pre-trained audio feature extractor, you can indicate the name of the system, for example:
#
# `acoustic_features: audioset`
acoustic_features: log-mel energies
# Text embeddings
# Here you can indicate how you treated text embeddings.
# If your method learned its own text embeddings (i.e. you did not use any pre-trained or fine-tuned NLP embeddings),
# then you can do:
#
# `text_embeddings: learned`
#
# Else, specify the pre-trained or fine-tuned NLP embeddings that you used, for example:
#
# `text_embeddings: Sentece-BERT`
text_embeddings: Sentece-BERT
# Data augmentation methods for audio
# e.g. mixup, time stretching, block mixing, pitch shifting, ...
audio_augmentation: !!null
# Data augmentation methods for text
# e.g. random swapping, synonym replacement, ...
text_augmentation: !!null
# Learning scheme
# Here you should indicate the learning scheme.
# For example, you could specify either supervised, self-supervised, or even reinforcement learning.
learning_scheme: self-supervised
# Ensemble
# Here you should indicate if you used ensemble of systems or not.
ensemble: No
# Audio modelling
# Here you should indicate the type of system used for audio modelling.
# For example, if you used some stacked CNNs, then you could do:
#
# audio_modelling: cnn
#
# If you used some pre-trained system for audio modelling, then you should indicate the system used,
# for example, PANNs-CNN14, PANNs-ResNet38.
audio_modelling: PANNs-CNN14
# Text modelling
# Similarly, here you should indicate the type of system used for text modelling.
# For example, if you used some RNNs, then you could do:
#
# text_modelling: rnn
#
# If you used some pre-trained system for text modelling,
# then you should indicate the system used (e.g. BERT).
text_modelling: Sentece-BERT
# Loss function
# Here you should indicate the loss function that you employed.
loss_function: InfoNCE
# Optimizer
# Here you should indicate the name of the optimizer that you used.
optimizer: adam
# Learning rate
# Here you should indicate the learning rate of the optimizer that you used.
learning_rate: 1e-3
# Metric monitored
# Here you should report the monitored metric for optimizing your method.
# For example, did you monitor the loss on the validation data (i.e. validation loss)?
# Or you monitored the training mAP?
metric_monitored: validation_loss
# System complexity, meta-data provided here will be used to evaluate
# submitted systems from the computational load perspective.
complexity:
# Total amount of parameters used in the acoustic model.
# For neural networks, this information is usually given before training process in the network summary.
# For other than neural networks, if parameter count information is not directly
# available, try estimating the count as accurately as possible.
# In case of ensemble approaches, add up parameters for all subsystems.
# In case embeddings are used, add up parameter count of the embedding
# extraction networks and classification network
# Use numerical value (do not use comma for thousands-separator).
total_parameters: 732354
# List of datasets used for the system (e.g., pre-training, fine-tuning, training).
# Development-training data is used here only as example.
training_datasets:
- name: Clotho-development
purpose: training # Used for training system
url: https://doi.org/10.5281/zenodo.4783391
data_types: audio, caption # Contained data types, e.g., audio, caption, label.
data_instances:
audio: 3839 # Number of contained audio instances
caption: 19195 # Number of contained caption instances
data_volume:
audio: 86353 # Total amount durations (in seconds) of audio instances
caption: 6453 # Total word types in caption instances
# More datasets
#- name:
# purpose: pre-training
# url:
# data_types: A, B, C
# data_instances:
# A: xxx
# B: xxx
# C: xxx
# data_volume:
# A: xxx
# B: xxx
# C: xxx
# List of datasets used for validating the system, for example, optimizing hyperparameter.
# Development-validation data is used here only as example.
validation_datasets:
- name: Clotho-validation
url: https://doi.org/10.5281/zenodo.4783391
data_types: audio, caption
data_instances:
audio: 1045
caption: 5225
data_volume:
audio: 23636
caption: 2763
# More datasets
#- name:
# url:
# data_types: A, B, C
# data_instances:
# A: xxx
# B: xxx
# C: xxx
# data_volume:
# A: xxx
# B: xxx
# C: xxx
# URL to the source code of the system [optional]
source_code: https://github.com/OptimusPrimus/
# System results
results:
development_testing:
# System results for the development-testing split.
# Full results are not mandatory, however, they are highly recommended as they are needed for through analysis of the challenge submissions.
# If you are unable to provide all results, also incomplete results can be reported.
R@1: 0.0
R@5: 0.0
R@10: 0.0
mAP@10: 0.0
Open and reproducible research
Finally, for supporting open and reproducible research, we kindly ask from each participant/team to consider making available the code of their method (e.g. in GitHub) and pre-trained models, after the challenge is over.
Evaluation
The submitted systems will be evaluated according to their performance, i.e., mean average precision at K (mAP@K) and recall at K (R@K), on the withheld evaluation dataset. An explanation of these metrics can be found on Wikipedia page on evaluation measures in information retrieval, in the IR book, and in a blog post.
The mAP@K is a rank-aware metric, which gives the mean of the averaged precision of relevant audio files among the top-K retrieved results for all caption queries. The R@K is a rank-unaware evaluation metric, which measures the proportion of relevant audio files among the top-K retrieved results to all relevant audio files in the evaluation dataset for the caption query. Submitted methods will be ranked by the mAP@16 metric.
Baseline system
This repository contains the code for the baseline system of task 6 in the DCASE 2025 Challenge:
New in 2025:
- The architecture of the baseline system is based on the top-ranked system of Task 8 in the DCASE 2024 Challenge.
- It uses the Patch out Fast Spectrogram Transformer (PaSST) and RoBERTa-large to encode audio recordings and text queries.
- It is trained on three audio-caption datasets (Clotho, AudioCaps, and WavCaps) to match the performance of previous year's submissions.
Some technical highlights:
- The training loop is implemented using PyTorch and PyTorch Lightning.
- Logging is implemented using Weights and Biases.
- Dataset loading is (partially) handled via aac-datasets.
Results on the development-testing split
The results of the baseline system on the development-testing split are given in the table below.
Metric | Value |
R1 | 0.2329 |
R5 | 0.5217 |
R10 | 0.6478 |
mAP10 | 0.3523 |
For more detailed results, have a look at our GitHub repository.
Citations
If you participate in this task, you might want to check the following papers. If you find a paper that needs to be cited here, please contact us and report it to us.
- The Clotho dataset:
Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.
Clotho: an Audio Captioning Dataset
Abstract
Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).
- Find more relevant papers for the baseline system in the GitHub repository.