Retrieving specific moments within long audio recordings that align with a given textual query
Description
Audio moment retrieval (AMR) focuses on retrieving specific moments within long audio recordings that align with a given textual query. For instance, given a long audio clip and a free-format text query like "Spectators are cheering and shouting in a sports game," the objective is to identify and output the timestamps of moments that match the query, as shown in Fig. 1. The primary challenge is to capture temporal contexts within long audio, which requires effective sequence modeling and learning methods to enhance retrieval accuracy. Participants are encouraged to develop advanced audio-text models, design networks that effectively capture temporal structures, and generate synthetic data to enhance training processes.
Audio datasets
Development datasets
The development datasets for this task are Clotho-Moment (synthetic dataset) and CASTELLA (manually annotated dataset). These datasets consist of long audio recordings, audio captions, and their temporal boundaries.
Clotho-Moment is a large-scale synthetic dataset to boost training of the moment retrieval models. This dataset simulates the audio events occurring at random intervals by overlaying Clotho, an audio-text pair dataset, onto background noise. Because the simulations do not require annotations, the dataset is extremely large, containing over 51,240 one-minute audio recordings with 44,261 captions. For complete details of the dataset construction, see the following paper:
CASTELLA is a manually annotated dataset for model training and evaluation. Each audio recording ranges from one to five minutes in length and contains up to five moments, each described with an average of 7.8 words. Audio recordings are extracted from YouTube videos used for AudioCaps, totaling 1,862 samples. Each moment is defined by a pair of start and end timestamps. The audio captions and temporal boundaries are crowdsourced. For complete details of the dataset construction, see the following paper:
Hokuto Munakata, Takehiro Imamura, Taichi Nishimura, and Tatsuya Komatsu. CASTELLA: long audio dataset with captions and temporal boundaries. 2026. arXiv:2511.15131.
CASTELLA and Clotho-Moment are divided into training, validation, and test splits originally. To ensure consistency, the split names are redefined based on DCASE terminology. Development-training refers to CASTELLA and Clotho-Moment training splits, development-validation refers to CASTELLA validation split, and development-testing refers to CASTELLA test split.
Evaluation data
The evaluation data consists of 100 audio recordings. The audio recordings are collected from YouTube, and the audio captions and timestamps are collected in the same manner as CASTELLA. Please note that the evaluation dataset only contains captions that represent local events with timestamps, not summaries of the entire audio recordings.
The organizers will provide extracted audio-text features of MS-CLAP 2023 with a sliding window, instead of raw audio files. Participants who wish to use raw audio files should contact the organizers.
Download
Experiments for this task require audio data, captions, and timestamps.
| Dataset | Audio | Caption | Timestamps | Extracted features |
|---|---|---|---|---|
| Clotho-Moment | HuggingFace | HuggingFace | HuggingFace | Zenodo / HuggingFace |
| CASTELLA | To be downloaded by participants, or contact the organizers | GitHub | GitHub | Zenodo / HuggingFace |
| Evaluation data | TBD | TBD | TBD | TBD |
The organizers provide audio data, captions, and timestamps for Clotho-Moment.
Captions and timestamps of CASTELLA are also available.
The audio data of CASTELLA and the evaluation data for this challenge will **NOT** be distributed. The following scripts include a downloader for CASTELLA audio data and the feature extractor.
If there are issues downloading the data (e.g., due to videos being set to private), please contact the organizers.
To avoid download issues with audio data, the organizers provide extracted audio-text features of CASTELLA, Clotho-Moment, and the evaluation data using MS-CLAP 2023 with a sliding window. These features enable participants to train and evaluate the baseline system without downloading raw audio data.
Clotho-Moment
CASTELLA
Evaluation data
TBD
Task Rules
- Use external resources (datasets, pre-trained models) under conditions specified in the External Resources section.
- Augment the development dataset (i.e., development-training and development-testing) with or without the use of external data.
- Use all the available metadata provided, but participants must explicitly state whether they use the available metadata. This will not affect the rating of their method.
Participants are NOT allowed to:
- Make subjective judgments of the evaluation (testing) data, nor annotate it.
- Use additional information about the evaluation (testing) data for their method, apart from the provided audio files and captions from the evaluation data.
- Use visual information from original video data.
External Resources
The use of external resources (datasets, pre-trained models) is allowed under the following conditions:
- The task organizers have approved the resource and shared it on the task webpage. To this end, please email the task organizers. The list of allowed external resources will be finalized on May 15 (no further external sources allowed).
- The external resource must be freely accessible to any other research group in the world before May 15, 2026.
- The list of external resources used must be clearly indicated in the technical report.
- The use of large-language models (LLMs) that can run in a local environment, and have publicly available weights, is allowed.
- The use of LLM APIs, such as ChatGPT and Gemini, is NOT allowed.
List of external data resources allowed:
| Resource name | Type | Added | Link |
|---|---|---|---|
| MS-CLAP | model | 19.09.2024 | https://github.com/microsoft/CLAP |
| LAION-CLAP | model | 16.05.2025 | https://github.com/LAION-AI/CLAP |
| M2D-CLAP | model | 23.02.2026 | https://github.com/nttcslab/m2d |
| PaSST | model | 01.04.2025 | https://github.com/kkoutini/PaSST |
| EAT | model | 04.05.2025 | https://arxiv.org/pdf/2401.03497v1 |
| BEATs | model | 04.05.2025 | https://arxiv.org/pdf/2212.09058 |
| BERT | text, caption | 01.04.2025 | https://arxiv.org/abs/1810.04805 |
| RoBERTa | text, caption | 01.04.2025 | https://arxiv.org/abs/1907.11692 |
| Qwen2-Audio | model | 15.04.2025 | https://github.com/QwenLM/Qwen2-Audio |
| Qwen2.5-Omni | model | 12.06.2025 | https://github.com/QwenLM/Qwen2.5-Omni |
| Audio-Flamingo | model | 16.12.2025 | https://github.com/NVIDIA/audio-flamingo |
| SALMONN | model | 28.09.2025 | https://github.com/bytedance/SALMONN |
| LLAMA-Omni | model | 19.05.2025 | https://github.com/ictnlp/LLaMA-Omni |
| TimeAudio | model | 18.11.2025 | https://github.com/lysanderism/TimeAudio |
| AudioSet | audio | 01.04.2025 | https://research.google.com/audioset/ |
| TACOS | audio, caption | 12.05.2025 | https://zenodo.org/records/15379789 |
| FSD50K | audio, tags | 01.04.2025 | https://zenodo.org/record/4060432 |
| MACS - Multi-Annotator Captioned Soundscapes | audio, caption | 01.04.2025 | https://doi.org/10.5281/zenodo.5114770 |
| WavCaps | audio, caption | 01.04.2025 | https://huggingface.co/datasets/cvssp/WavCaps |
| AudioCaps | audio, caption | 01.04.2025 | https://audiocaps.github.io/ |
| Clotho2.1 | audio, caption | 26.05.2021 | https://zenodo.org/records/4783391 |
| AudioSetCaps | audio, caption | 13.12.2025 | https://github.com/JishengBai/AudioSetCaps |
Evaluation
The systems are evaluated using ranking metrics, with a particular focus on recall1 and mean average precision (mAP). Considering the correspondence of the temporal boundaries between retrieved and ground-truth moments, the correctness of a retrieved result is determined by whether intersection over union (IoU) exceeds the threshold θ. recall1@θ only considers the top-ranked predicted moment, while mAP@θ considers all predicted and ground-truth moments.
Since this challenge focuses on how accurately the top-ranked retrieved moment corresponds to the ground-truth moment, the primary metric for ranking is recall1@0.7, which evaluates the most confident moment that has an IoU of 0.7 or higher with the corresponding ground-truth moment. Please note that participants are only required to predict a single moment, even in cases where multiple ground-truth moments exist.
Submission
All participants should submit:
- the output of their audio moment retrieval system in the form below (
*.jsonlfile), - metadata for their submission (
*.yamlfile), and - a technical report for their submission (
*.pdffile).
We allow up to four system output submissions per participating team.
For each system, metadata should be provided in a separate file, containing task-specific information.
All files should be packaged into a zip file for submission.
Please make a clear connection between the system name in the submitted metadata (the .yaml file),
submitted system output (the .jsonl file), and the technical report (the .pdf file)!
For indicating the connection of your files, you can consider using the following naming convention:
<author>_<institute>_task6_<submission_index>_<output or meta or technical_report>.<jsonl or yaml or pdf>
For example:
Munakata_LY_task6_1.output.jsonl
Munakata_LY_task6_1.meta.yaml
Munakata_LY_task6_1.technical_report.pdf
The field <submission_index> is to differentiate your submissions in case you have multiple submissions.
System output file
The expected system output is a *.jsonl file containing qid, query, duration, vid, and pred_relevant_windows.
pred_relevant_windows is the target to predict.
{"qid": "dcase2026_evaluation_q001", "query": "Something entering the water", "duration": 60, "vid": "dcase2026_evaluation_audio_001", "pred_relevant_windows": [[0.000, 50.000]]}
{"qid": "dcase2026_evaluation_q002", "query": "Heavy breathing", "duration": 150, "vid": "dcase2026_evaluation_audio_001", "pred_relevant_windows": [[10.000, 35.000]]}
{"qid": "dcase2026_evaluation_q003", "query": "A man talking while the wind is blowing", "duration": 300, "vid": "dcase2026_evaluation_audio_002", "pred_relevant_windows": [[40.000, 45.000]]}
{"qid": "dcase2026_evaluation_q004", "query": "Tapping glass with a mallet", "duration": 250, "vid": "dcase2026_evaluation_audio_003", "pred_relevant_windows": [[30.000, 50.000]]}
| Entry | Type | Description |
|---|---|---|
qid |
int |
unique query id |
query |
str |
natural language query, not used by the evaluation script |
vid |
str |
unique audio id (vid was originally the video id used in video moment retrieval) |
pred_relevant_windows |
list(list) |
moment retrieval predictions. Each sublist contains two elements, [start (seconds), end (seconds)]. |
The participants must submit at least one moment, and can submit multiple moments per query in descending order of confidence. If you submit multiple moments for a single query, these moments will be used to compute the mAP; however, since the primary metric is the recall@0.7, which considers only the top-confident moment, this will not affect the final ranking.
Metadata file
For each system, metadata should be provided in a separate file. The file format should be as indicated below.
# Submission information for task 6
submission:
# Submission label
# The label is used to index submissions.
# Generate your label following way to avoid overlapping codes among submissions:
# [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
label: Munakata_LY_task6_1
#
# Submission name
# This name will be used in the results tables when space permits
name: DCASE2026 baseline system
#
# Submission name abbreviated
# This abbreviated name will be used in the result table when space is tight.
# Use maximum 10 characters.
abbreviation: Baseline
# Authors of the submitted system.
# Mark authors in the order you want them to appear in submission lists.
# One of the authors has to be marked as corresponding author,
# this will be listed next to the submission in the results tables.
authors:
# First author
- lastname: Munakata
firstname: Hokuto
email: hokuto.munakata@lycorp.co.jp # Contact email address
corresponding: true # Mark true for one of the authors
# Affiliation information for the author
affiliation:
abbreviation: LY
institute: LY Corporation
department: Multimodal AI Unit
location: Osaka, Japan
# Second author
- lastname: Author
firstname: Second
email: first.last@some.org
affiliation:
abbreviation: ORG
institute: Some Organization
department: Department of Something
location: City, Country
# System information
system:
model:
# Describe the model architecture of your system.
# If your system is an ensemble of multiple models, please describe all models used in the system.
audio_models: [
MS-CLAP,
]
text_models: [
MS-CLAP
]
# If you use audio llms, such as Qwen2-audio, please specify the number of trainable and freezed parameters.
LLMs: []
# Describe the number of trainable parameters in your system.
trainable_parameters: 7.1 M
# Describe the number of freezed parameters in your system, if any.
freezed_parameters: 158.4 M
loss_function: [
"L1",
"gIoU",
"cross_entropy"
]
dataset:
# If you use data augmentation, please specify the data augmentation methods used in your system.
data_augmentation: !!null
# If you use external data resources except for the provided dataset (i.e., CASTELLA and Clotho-Moment), please specify the name of the data resources used in your system.
external_data_resources: [
"audiocaps"
]
# Describe the number of audio-caption pairs used for training your system.
audio_captions: 48k
ensemble: false
# System results
results:
development_testing:
# System results for the development-testing split.
# Report Recall1@0.5 and Recall1@0.7 for the CASTELLA test set.
# Full results are not mandatory, however, they are highly recommended as they are needed for through analysis of the challenge submissions.
# If you are unable to provide all results, also incomplete results can be reported.
Recall1@0.7: 0.0
Recall1@0.5: 0.0
Open and reproducible research
Finally, for supporting open and reproducible research, we kindly ask each participant/team to consider making available the code of their method (e.g., on GitHub) and pre-trained models, after the challenge is over.
Baseline System
The organizers provide a baseline system which is simple but effective for this challenge. In addition to the baseline system, the following repository contains an evaluation script for the development dataset and a preparation script for submission.
The baseline system consists of two modules: a feature extractor and a Detection Transformer (DETR)-based network. First, the feature extractor based on pre-trained MS-CLAP 2023 with a sliding window converts an input audio recording and a text query into sequential embeddings considering cross-modal alignment between audio and text. Using these sequential embeddings, the DETR-based network outputs multiple pairs of start and end timestamps directly. Through the training on the audio moment retrieval datasets (e.g., CASTELLA and Clotho-Moment), the DETR-based network learns to capture cross-modal alignment between audio and text, and dependencies between audio frames of the long audio recording. For more details, please see the following paper:
Results on the development-testing split
| Model | Recall1@0.5 (↑) | Recall1@0.7 (↑) | mAP (avg) (↑) | mAP@0.5 (↑) | mAP@0.75 (↑) |
|---|---|---|---|---|---|
| Baseline (CASTELLA only) | 23.16 | 10.32 | 9.11 | 20.34 | 6.96 |
| Baseline (CASTELLA + Clotho-Moment) | 25.61 | 13.59 | 12.06 | 23.60 | 10.72 |
Citation
If you participate in this task, you might want to check the following papers.
Baseline system
Dataset
Hokuto Munakata, Takehiro Imamura, Taichi Nishimura, and Tatsuya Komatsu. CASTELLA: long audio dataset with captions and temporal boundaries. 2026. arXiv:2511.15131.