Retrieving specific moments within long audio recordings that align with a given textual query

If you are interested in the task, you can join the DCASE Open Slack and join the #2026-task6-audio-moment-retrieval channel.

Description

Audio moment retrieval (AMR) focuses on retrieving specific moments within long audio recordings that align with a given textual query. For instance, given a long audio clip and a free-format text query like "Spectators are cheering and shouting in a sports game," the objective is to identify and output the timestamps of moments that match the query, as shown in Fig. 1. The primary challenge is to capture temporal contexts within long audio, which requires effective sequence modeling and learning methods to enhance retrieval accuracy. Participants are encouraged to develop advanced audio-text models, design networks that effectively capture temporal structures, and generate synthetic data to enhance training processes.

Figure 1: Overview of Audio Moment Retrieval.

Audio datasets

Development datasets

The development datasets for this task are Clotho-Moment (synthetic dataset) and CASTELLA (manually annotated dataset). These datasets consist of long audio recordings, audio captions, and their temporal boundaries.

Clotho-Moment is a large-scale synthetic dataset to boost training of the moment retrieval models. This dataset simulates the audio events occurring at random intervals by overlaying Clotho, an audio-text pair dataset, onto background noise. Because the simulations do not require annotations, the dataset is extremely large, containing over 51,240 one-minute audio recordings with 44,261 captions. For complete details of the dataset construction, see the following paper:

Publication

Hokuto Munakata, Taichi Nishimura, Shota Nakada, and Tatsuya Komatsu. Language-based audio moment retrieval. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 1–5. 2025.

PDF

Language-based Audio Moment Retrieval

PDF

CASTELLA is a manually annotated dataset for model training and evaluation. Each audio recording ranges from one to five minutes in length and contains up to five moments, each described with an average of 7.8 words. Audio recordings are extracted from YouTube videos used for AudioCaps, totaling 1,862 samples. Each moment is defined by a pair of start and end timestamps. The audio captions and temporal boundaries are crowdsourced. For complete details of the dataset construction, see the following paper:

Publication

Hokuto Munakata, Takehiro Imamura, Taichi Nishimura, and Tatsuya Komatsu. CASTELLA: long audio dataset with captions and temporal boundaries. 2026. arXiv:2511.15131.

PDF

CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries

PDF

CASTELLA and Clotho-Moment are divided into training, validation, and test splits originally. To ensure consistency, the split names are redefined based on DCASE terminology. Development-training refers to CASTELLA and Clotho-Moment training splits, development-validation refers to CASTELLA validation split, and development-testing refers to CASTELLA test split.

Evaluation data

The evaluation data consists of 100 audio recordings. The audio recordings are collected from YouTube, and the audio captions and timestamps are collected in the same manner as CASTELLA. Please note that the evaluation dataset only contains captions that represent local events with timestamps, not summaries of the entire audio recordings.

The organizers will provide extracted audio-text features of MS-CLAP 2023 with a sliding window, instead of raw audio files. Participants who wish to use raw audio files should contact the organizers.

Download

Experiments for this task require audio data, captions, and timestamps.

Dataset	Audio	Caption	Timestamps	Extracted features
Clotho-Moment	HuggingFace	HuggingFace	HuggingFace	Zenodo / HuggingFace
CASTELLA	To be downloaded by participants, or contact the organizers	GitHub	GitHub	Zenodo / HuggingFace
Evaluation data	Contact the organizers	Zenodo / HuggingFace	N/A	Zenodo / HuggingFace

The organizers provide audio data, captions, and timestamps for Clotho-Moment.

Task 6 Audio, Captions, and Timestamps of Clotho-Moment (16.1 GB)

Captions and timestamps of CASTELLA are also available.

Task 6 Captions and Timestamps of CASTELLA

The audio data of CASTELLA and the evaluation data for this challenge will **NOT** be distributed. The following scripts include a downloader for CASTELLA audio data and the feature extractor.

Task 6 Feature Extractor and Downloader of CASTELLA Audio Data

If there are issues downloading the data (e.g., due to videos being set to private), please contact the organizers.

To avoid download issues with audio data, the organizers provide extracted audio-text features of CASTELLA, Clotho-Moment, and the evaluation data using MS-CLAP 2023 with a sliding window. These features enable participants to train and evaluate the baseline system without downloading raw audio data.

Task Rules

Participants are allowed to:

Use external resources (datasets, pre-trained models) under conditions specified in the External Resources section.
Augment the development dataset (i.e., development-training and development-testing) with or without the use of external data.
Use all the available metadata provided, but participants must explicitly state whether they use the available metadata. This will not affect the rating of their method.

Participants are NOT allowed to:

Make subjective judgments of the evaluation (testing) data, nor annotate it.
Use additional information about the evaluation (testing) data for their method, apart from the provided audio files and captions from the evaluation data.
Use visual information from original video data.

External Resources

The use of external resources (datasets, pre-trained models) is allowed under the following conditions:

The task organizers have approved the resource and shared it on the task webpage. To this end, please email the task organizers. The list of allowed external resources will be finalized on May 15 (no further external sources allowed).
The external resource must be freely accessible to any other research group in the world before May 15, 2026.
The list of external resources used must be clearly indicated in the technical report.
The use of large-language models (LLMs) that can run in a local environment, and have publicly available weights, is allowed.
The use of LLM APIs, such as ChatGPT and Gemini, is NOT allowed.

List of external data resources allowed:

Resource name	Type	Added	Link
MS-CLAP	model	19.09.2024	https://github.com/microsoft/CLAP
LAION-CLAP	model	16.05.2025	https://github.com/LAION-AI/CLAP
M2D-CLAP	model	23.02.2026	https://github.com/nttcslab/m2d
PaSST	model	01.04.2025	https://github.com/kkoutini/PaSST
EAT	model	04.05.2025	https://arxiv.org/pdf/2401.03497v1
BEATs	model	04.05.2025	https://arxiv.org/pdf/2212.09058
BERT	model	01.04.2025	https://arxiv.org/abs/1810.04805
RoBERTa	model	01.04.2025	https://arxiv.org/abs/1907.11692
Qwen2-Audio	model	15.04.2025	https://github.com/QwenLM/Qwen2-Audio
Qwen2.5-Omni	model	12.06.2025	https://github.com/QwenLM/Qwen2.5-Omni
Audio Flamingo	model	16.12.2025	https://github.com/NVIDIA/audio-flamingo
SALMONN	model	28.09.2025	https://github.com/bytedance/SALMONN
LLAMA-Omni	model	19.05.2025	https://github.com/ictnlp/LLaMA-Omni
TimeAudio	model	18.11.2025	https://github.com/lysanderism/TimeAudio
AudioSet	audio	01.04.2025	https://research.google.com/audioset/
TACOS	audio, caption	12.05.2025	https://zenodo.org/records/15379789
FSD50K	audio, tags	01.04.2025	https://zenodo.org/record/4060432
MACS - Multi-Annotator Captioned Soundscapes	audio, caption	01.04.2025	https://doi.org/10.5281/zenodo.5114770
WavCaps	audio, caption	01.04.2025	https://huggingface.co/datasets/cvssp/WavCaps
AudioCaps	audio, caption	01.04.2025	https://audiocaps.github.io/
Clotho2.1	audio, caption	26.05.2021	https://zenodo.org/records/4783391
AudioSetCaps	audio, caption	13.12.2025	https://github.com/JishengBai/AudioSetCaps
Audio Flamingo Next	model	30.04.2026	https://afnext-umd-nvidia.github.io/
OpenFLAM	model	30.04.2026	https://github.com/adobe-research/openflam
HTS-AT	model	14.05.2026	https://github.com/RetroCirce/HTS-Audio-Transformer
WavLM (Base)	model	14.05.2026	https://huggingface.co/microsoft/wavlm-base
WavLM (Base+)	model	14.05.2026	https://huggingface.co/microsoft/wavlm-base-plus
WavLM (Large)	model	14.05.2026	https://huggingface.co/microsoft/wavlm-large
T5	model	14.05.2026	https://huggingface.co/docs/transformers/model_doc/t5
Walking Tour Videos	video	20.05.2026	https://shashankvkt.github.io/dora

Evaluation

The systems are evaluated using ranking metrics, with a particular focus on recall1 and mean average precision (mAP). Considering the correspondence of the temporal boundaries between retrieved and ground-truth moments, the correctness of a retrieved result is determined by whether intersection over union (IoU) exceeds the threshold θ. recall1@θ only considers the top-ranked predicted moment, while mAP@θ considers all predicted and ground-truth moments.

Since this challenge focuses on how accurately the top-ranked retrieved moment corresponds to the ground-truth moment, the primary metric for ranking is recall1@0.7, which evaluates the most confident moment that has an IoU of 0.7 or higher with the corresponding ground-truth moment. Please note that participants are only required to predict a single moment, even in cases where multiple ground-truth moments exist.

Submission

All participants should submit:

the output of their audio moment retrieval system in the form below (*.jsonl file),
metadata for their submission (*.yaml file), and
a technical report for their submission (*.pdf file).

We allow up to four system output submissions per participating team. For each system, metadata should be provided in a separate file, containing task-specific information. All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted metadata (the .yaml file), submitted system output (the .jsonl file), and the technical report (the .pdf file)! For indicating the connection of your files, you can consider using the following naming convention:

<author>_<institute>_task6_<submission_index>_<output or meta or technical_report>.<jsonl or yaml or pdf>

For example:

Munakata_LY_task6_1.output.jsonl
Munakata_LY_task6_1.meta.yaml
Munakata_LY_task6_1.technical_report.pdf

The field <submission_index> is to differentiate your submissions in case you have multiple submissions.

System output file

The expected system output is a *.jsonl file containing qid, query, duration, vid, and pred_relevant_windows. pred_relevant_windows is the target to predict.

{"qid": "dcase2026_evaluation_q001", "query": "Something entering the water", "duration": 60, "vid": "dcase2026_evaluation_audio_001", "pred_relevant_windows": [[0.000, 50.000]]}
{"qid": "dcase2026_evaluation_q002", "query": "Heavy breathing", "duration": 150, "vid": "dcase2026_evaluation_audio_001", "pred_relevant_windows": [[10.000, 35.000]]}
{"qid": "dcase2026_evaluation_q003", "query": "A man talking while the wind is blowing", "duration": 300, "vid": "dcase2026_evaluation_audio_002", "pred_relevant_windows": [[40.000, 45.000]]}
{"qid": "dcase2026_evaluation_q004", "query": "Tapping glass with a mallet", "duration": 250, "vid": "dcase2026_evaluation_audio_003", "pred_relevant_windows": [[30.000, 50.000]]}

Entry	Type	Description
`qid`	`str`	unique query id
`query`	`str`	natural language query, not used by the evaluation script
`vid`	`str`	unique audio id (vid was originally the video id used in video moment retrieval)
`pred_relevant_windows`	`list(list)`	moment retrieval predictions. Each sublist contains two elements, `[start (seconds), end (seconds)]`.

The participants must submit at least one moment, and can submit multiple moments per query in descending order of confidence. If you submit multiple moments for a single query, these moments will be used to compute the mAP; however, since the primary metric is the recall@0.7, which considers only the top-confident moment, this will not affect the final ranking.

Metadata file

For each system, metadata should be provided in a separate file. The file format should be as indicated below.

Metadata

# Submission information for task 6
submission:
    # Submission label
    # The label is used to index submissions.
    # Generate your label following way to avoid overlapping codes among submissions:
    # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
    label: Munakata_LY_task6_1
    #
    # Submission name
    # This name will be used in the results tables when space permits
    name: DCASE2026 baseline system
    #
    # Submission name abbreviated
    # This abbreviated name will be used in the result table when space is tight.
    # Use maximum 10 characters.
    abbreviation: Baseline

    # Authors of the submitted system.
    # Mark authors in the order you want them to appear in submission lists.
    # One of the authors has to be marked as corresponding author,
    # this will be listed next to the submission in the results tables.
    authors:
        # First author
        -   lastname: Munakata
            firstname: Hokuto
            email: hokuto.munakata@lycorp.co.jp                    # Contact email address
            corresponding: true                         # Mark true for one of the authors

            # Affiliation information for the author
            affiliation:
                abbreviation: LY
                institute: LY Corporation
                department: Multimodal AI Unit
                location: Osaka, Japan

        # Second author
        -   lastname: Author
            firstname: Second
            email: first.last@some.org

            affiliation:
                abbreviation: ORG
                institute: Some Organization
                department: Department of Something
                location: City, Country

# System information
system:
    model:
        # Describe the model architecture of your system.
        # If your system is an ensemble of multiple models, please describe all models used in the system.
        audio_models: [
          MS-CLAP,
        ]
        text_models: [
          MS-CLAP
        ]
        # If you use audio llms, such as Qwen2-audio, please specify the number of trainable and freezed parameters.
        LLMs: []
        # Describe the number of trainable parameters in your system.
        trainable_parameters: 7.1 M
        # Describe the number of freezed parameters in your system, if any.
        freezed_parameters: 158.4 M
        loss_function: [
          "L1",
          "gIoU",
          "cross_entropy"
        ]

    dataset:
        # If you use data augmentation, please specify the data augmentation methods used in your system.
        data_augmentation: !!null
        # If you use external data resources except for the provided dataset (i.e., CASTELLA and Clotho-Moment), please specify the name of the data resources used in your system.
        external_data_resources: [ 
            "audiocaps"
        ]
        # Describe the number of audio-caption pairs used for training your system.
        audio_captions: 48k

    ensemble: false

# System results
results:
    development_testing:
        # System results for the development-testing split.
        # Report Recall1@0.5 and Recall1@0.7 for the CASTELLA test set.
        # Full results are not mandatory, however, they are highly recommended as they are needed for through analysis of the challenge submissions.
        # If you are unable to provide all results, also incomplete results can be reported.
        Recall1@0.7: 0.0
        Recall1@0.5: 0.0

Open and reproducible research

Finally, for supporting open and reproducible research, we kindly ask each participant/team to consider making available the code of their method (e.g., on GitHub) and pre-trained models, after the challenge is over.

Results

Complete results and technical reports can be found at results page

Baseline System

The organizers provide a baseline system which is simple but effective for this challenge. In addition to the baseline system, the following repository contains an evaluation script for the development dataset and a preparation script for submission.

DCASE2026 Task 6 Baseline

The baseline system consists of two modules: a feature extractor and a Detection Transformer (DETR)-based network. First, the feature extractor based on pre-trained MS-CLAP 2023 with a sliding window converts an input audio recording and a text query into sequential embeddings considering cross-modal alignment between audio and text. Using these sequential embeddings, the DETR-based network outputs multiple pairs of start and end timestamps directly. Through the training on the audio moment retrieval datasets (e.g., CASTELLA and Clotho-Moment), the DETR-based network learns to capture cross-modal alignment between audio and text, and dependencies between audio frames of the long audio recording. For more details, please see the following paper:

Publication

Hokuto Munakata, Taichi Nishimura, Shota Nakada, and Tatsuya Komatsu. Language-based audio moment retrieval. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 1–5. 2025.

PDF

Language-based Audio Moment Retrieval

PDF

Results on the development-testing split

Model	Recall1@0.5 (↑)	Recall1@0.7 (↑)	mAP (avg) (↑)	mAP@0.5 (↑)	mAP@0.75 (↑)
Baseline (CASTELLA only)	23.16	10.32	9.11	20.34	6.96
Baseline (CASTELLA + Clotho-Moment)	25.61	13.59	12.06	23.60	10.72

Citation

If you participate in this task, you might want to check the following papers.

Baseline system

Publication

Hokuto Munakata, Taichi Nishimura, Shota Nakada, and Tatsuya Komatsu. Language-based audio moment retrieval. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 1–5. 2025.

PDF

Language-based Audio Moment Retrieval

PDF

Publication

Benjamin Elizalde, Soham Deshmukh, and Huaming Wang. Natural language supervision for general-purpose audio representations. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 336–340. 2024.

PDF

Natural language supervision for general-purpose audio representations

PDF

Dataset

Publication

Hokuto Munakata, Takehiro Imamura, Taichi Nishimura, and Tatsuya Komatsu. CASTELLA: long audio dataset with captions and temporal boundaries. 2026. arXiv:2511.15131.

PDF

CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries

PDF

	Hokuto Munakata LY Corporation
	Tatsuya Komatsu LY Corporation
	Keisuke Imoto Kyoto University
	Taichi Nishimura Sony Interactive Entertainment
	Paul Primus Johannes Kepler University Linz
	Huang Xie Tampere University
	Tuomas Virtanen Tampere University

Audio Moment Retrieval from Long Audio

Coordinators

Description

Audio datasets

Development datasets

Language-based Audio Moment Retrieval

CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries

Evaluation data

Download

Clotho-Moment

CASTELLA

Evaluation data

Task Rules

External Resources

Evaluation

Submission

System output file

Metadata file

Metadata

Open and reproducible research

Results

Baseline System

Language-based Audio Moment Retrieval

Results on the development-testing split

Citation

Baseline system

Language-based Audio Moment Retrieval

Natural language supervision for general-purpose audio representations

Dataset

CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries

Coordinators

Content

Description

Audio datasets

Development datasets

Language-based Audio Moment Retrieval

CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries

Evaluation data

Download

Clotho-Moment

CASTELLA

Evaluation data

Task Rules

External Resources

Evaluation

Submission

System output file

Metadata file

Metadata

Open and reproducible research

Results

Baseline System

Language-based Audio Moment Retrieval

Results on the development-testing split

Citation

Baseline system

Language-based Audio Moment Retrieval

Natural language supervision for general-purpose audio representations

Dataset

CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries