Audio Moment Retrieval from Long Audio


Challenge results

Task description

Audio moment retrieval is the task of retrieving specific moments within long audio recordings that align with a given textual query. More detailed task description can be found in the task description page

Teams ranking

This table includes only the best performing system from each team. The ranking is based on the achieved Recall1@0.7 metric on the evaluation dataset. The metric values are for the development-testing split and the evaluation dataset.

Team Best
Rank
Submission Code System
Rank
Corresponding author Technical
Report
Recall1@0.7
(eval dataset)
Recall1@0.5
(eval dataset)
Recall1@0.7
(dev-testing dataset)
Recall1@0.5
(dev-testing dataset)
1 Kibata_YCU_task6_2 1 Koki Kibata kibata2026_t6 48.59 69.49 40.70 55.98
1 Kim_CAU_task6_3 1 Changwon Lim kim2026_t6 48.59 63.84 45.14 58.87
1 Sugawara_YCU_task6_3 1 Haruto Sugawara sugawara2026_t6 48.59 59.89 40.68 53.50
4 Ogawa_YCU_task6_1 6 Takumi Ogawa ogawa2026_t6 46.89 60.45 41.13 54.19
5 Calvet_AUDIAS_task6_2 11 Oscar Calvet calvet2026_t6 43.50 64.41 36.82 53.30
6 Usui_YCU_task6_1 12 Ren Usui usui2026_t6 41.24 58.19 37.86 54.10
6 Choi_KAIST_task6_3 12 Seungdeok Choi choi2026_t6 41.24 55.93 31.03 48.40
8 Nakazawa_AM_task6_3 21 Kazushi Nakazawa nakazawa2026_t6 36.16 49.72 29.03 49.59
9 Kang_ISCT_task6_1 22 Yaozhong Kang kang2026_t6 35.59 50.28 33.23 45.89
10 Xiao_HEU_task6_4 26 Jian Guan xiao2026_t6 33.90 48.02 31.55 41.05
11 Khan_WPI_task6_2 28 Mohammad Nur Hossain Khan Khan khan2026_t6 32.20 48.02 32.07 48.71
12 Chunarkar_NTHU_task6_3 32 Snehit Chunarkar chunarkar2026_t6 29.38 52.54 25.39 43.88
13 Huang_WHU_task6_1 34 Gongping Huang huang2026_t6 27.68 48.02 25.09 37.49
14 Nishijima_UTokyo_task6_3 40 Hiroshi Nishijima nishijima2026_t6 22.60 32.77 21.88 30.58
15 Chen_CHT_task6_1 42 Wei-Yu Chen chen2026_t6 22.03 37.29 20.19 31.48
16 LU_YZU_task6_1 43 Jun-Ting LU lu2026_t6 21.47 31.64 14.40 25.61
17 Xu_GZHU_task6_1 47 Yutao Xu xu2026_t6 15.25 31.07 16.11 29.92
18 DCASE2026_baseline_task6 48 Hokuto Munakata 13.56 28.25 13.59 25.61
18 Huck_NV_task6_3 48 Huck Yang huck2026_t6 13.56 22.03
18 Zhang_XJTLU_task6_1 48 Xiaokai Zhang zhang2026_t6 13.56 22.03 16.11 23.83
21 Kret_CooperUnion_task6_2 52 Meghan Kret kret2026_t6 11.30 19.21 11.88 17.67
22 Minh_VGU_task6_1 57 Le Duc Minh minh2026_t6 5.65 12.99 6.76 14.33



Systems ranking

This table includes all systems submitted by participating teams.

System
Rank
Submission Code Technical
Report
Recall1@0.7
(eval dataset)
Recall1@0.5
(eval dataset)
Recall1@0.7
(dev-testing dataset)
Recall1@0.5
(dev-testing dataset)
1 Kibata_YCU_task6_2 kibata2026_t6 48.59 69.49 40.70 55.98
1 Kim_CAU_task6_3 kim2026_t6 48.59 63.84 45.14 58.87
1 Sugawara_YCU_task6_3 sugawara2026_t6 48.59 59.89 40.68 53.50
4 Kim_CAU_task6_2 kim2026_t6 48.02 62.15 45.14 59.02
5 Sugawara_YCU_task6_1 sugawara2026_t6 47.46 58.19 39.57 51.80
6 Ogawa_YCU_task6_1 ogawa2026_t6 46.89 60.45 41.13 54.19
7 Sugawara_YCU_task6_2 sugawara2026_t6 45.76 58.19 38.38 51.30
8 Kim_CAU_task6_1 kim2026_t6 45.20 57.63 42.91 55.23
9 Sugawara_YCU_task6_4 sugawara2026_t6 44.63 57.63 38.01 52.50
10 Kim_CAU_task6_4 kim2026_t6 44.07 61.02 41.28 54.57
11 Calvet_AUDIAS_task6_2 calvet2026_t6 43.50 64.41 36.82 53.30
12 Usui_YCU_task6_1 usui2026_t6 41.24 58.19 37.86 54.10
12 Choi_KAIST_task6_3 choi2026_t6 41.24 55.93 31.03 48.40
14 Usui_YCU_task6_2 usui2026_t6 40.68 58.76 39.68 53.95
14 Usui_YCU_task6_3 usui2026_t6 40.68 58.19 38.32 54.02
14 Calvet_AUDIAS_task6_4 calvet2026_t6 40.68 59.89 42.09 59.39
14 Choi_KAIST_task6_4 choi2026_t6 40.68 54.80 33.70 50.41
18 Usui_YCU_task6_4 usui2026_t6 40.11 58.76 40.14 54.17
19 Choi_KAIST_task6_2 choi2026_t6 37.85 51.41 29.55 47.07
20 Calvet_AUDIAS_task6_1 calvet2026_t6 37.29 59.89 40.01 56.12
21 Nakazawa_AM_task6_3 nakazawa2026_t6 36.16 49.72 29.03 49.59
22 Kang_ISCT_task6_1 kang2026_t6 35.59 50.28 33.23 45.89
22 Nakazawa_AM_task6_1 nakazawa2026_t6 35.59 49.72 30.14 49.29
24 Calvet_AUDIAS_task6_3 calvet2026_t6 35.03 57.06 37.56 54.27
25 Kibata_YCU_task6_1 kibata2026_t6 34.46 53.11 26.80 39.50
26 Xiao_HEU_task6_4 xiao2026_t6 33.90 48.02 31.55 41.05
26 Choi_KAIST_task6_1 choi2026_t6 33.90 55.93 28.36 43.06
28 Khan_WPI_task6_2 khan2026_t6 32.20 48.02 32.07 48.71
29 Kang_ISCT_task6_2 kang2026_t6 31.64 49.15 31.95 45.52
29 Nakazawa_AM_task6_2 nakazawa2026_t6 31.64 51.41 29.47 47.07
31 Xiao_HEU_task6_3 xiao2026_t6 29.94 44.07 28.36 40.31
32 Xiao_HEU_task6_1 xiao2026_t6 29.38 48.02 28.21 40.83
32 Chunarkar_NTHU_task6_3 chunarkar2026_t6 29.38 52.54 25.39 43.88
34 Huang_WHU_task6_1 huang2026_t6 27.68 48.02 25.09 37.49
35 Chunarkar_NTHU_task6_4 chunarkar2026_t6 27.12 49.72 26.28 43.50
36 Xiao_HEU_task6_2 xiao2026_t6 26.55 42.37 30.44 41.13
37 Chunarkar_NTHU_task6_2 chunarkar2026_t6 25.99 48.02 26.43 43.21
38 Chunarkar_NTHU_task6_1 chunarkar2026_t6 25.42 46.33 26.13 44.02
39 Khan_WPI_task6_1 khan2026_t6 23.73 42.37 25.32 37.27
40 Nishijima_UTokyo_task6_3 nishijima2026_t6 22.60 32.77 21.88 30.58
40 Nishijima_UTokyo_task6_4 nishijima2026_t6 22.60 32.77 21.43 29.76
42 Chen_CHT_task6_1 chen2026_t6 22.03 37.29 20.19 31.48
43 LU_YZU_task6_1 lu2026_t6 21.47 31.64 14.40 25.61
44 Nishijima_UTokyo_task6_1 nishijima2026_t6 20.90 37.29 20.91 30.36
44 LU_YZU_task6_2 lu2026_t6 20.90 35.59 12.25 23.01
46 Nishijima_UTokyo_task6_2 nishijima2026_t6 19.21 35.03 18.68 26.49
47 Xu_GZHU_task6_1 xu2026_t6 15.25 31.07 16.11 29.92
48 DCASE2026_baseline_task6 13.56 28.25 13.59 25.61
48 Huck_NV_task6_3 huck2026_t6 13.56 22.03
48 Nakazawa_AM_task6_4 nakazawa2026_t6 13.56 23.16 28.88 49.29
48 Zhang_XJTLU_task6_1 zhang2026_t6 13.56 22.03 16.11 23.83
52 Huck_NV_task6_1 huck2026_t6 11.30 20.34
52 Kret_CooperUnion_task6_2 kret2026_t6 11.30 19.21 11.88 17.67
54 Kret_CooperUnion_task6_1 kret2026_t6 10.73 16.95 13.51 18.49
55 Huck_NV_task6_2 huck2026_t6 9.04 19.21
56 Huck_NV_task6_4 huck2026_t6 6.21 13.56
57 Minh_VGU_task6_1 minh2026_t6 5.65 12.99 6.76 14.33
58 Zhang_XJTLU_task6_2 zhang2026_t6 2.26 5.08 21.97 33.04
59 Zhang_XJTLU_task6_3 zhang2026_t6 1.13 3.39 23.42 34.65
59 Zhang_XJTLU_task6_4 zhang2026_t6 1.13 2.82 23.59 35.13



System characteristics

Summary of the submitted system characteristics.

System
Rank
Submission
Code
Recall1@0.7 (eval) Technical
Report
Audio Model Text Model LLM Loss Function External Data
Resources
Data
augmentation
Ensemble Trainable
parameters
Frozen
parameters
Total
parameters
1 Kibata_YCU_task6_2 48.59 kibata2026_t6 M2D-CLAP M2D-CLAP CG-DETR, Varifocal, SIGReg FALSE FALSE 13370000 198500000 211870000
1 Kim_CAU_task6_3 48.59 kim2026_t6 LAION-CLAP, MS-CLAP, WavLM LAION-CLAP, MS-CLAP Qwen2.5-Omni-7B QD-DETR, Quality regression FALSE TRUE 55600000 11130000000 11185600000
1 Sugawara_YCU_task6_3 48.59 sugawara2026_t6 MS-CLAP, M2D-CLAP MS-CLAP UVCOM TRUE TRUE 445400000 158400000 603800000
4 Kim_CAU_task6_2 48.02 kim2026_t6 LAION-CLAP, MS-CLAP, WavLM LAION-CLAP, MS-CLAP, RoBERTa Qwen2.5-Omni-7B QD-DETR, Quality regression FALSE TRUE 55600000 11670000000 11725600000
5 Sugawara_YCU_task6_1 47.46 sugawara2026_t6 MS-CLAP, M2D-CLAP MS-CLAP UVCOM TRUE TRUE 338600000 158400000 497000000
6 Ogawa_YCU_task6_1 46.89 ogawa2026_t6 M2D-CLAP M2D-CLAP QD-DETR, Quality-based reranking, Distillation FALSE FALSE 19400000 19400000
7 Sugawara_YCU_task6_2 45.76 sugawara2026_t6 MS-CLAP, M2D-CLAP MS-CLAP UVCOM TRUE TRUE 285300000 158400000 443700000
8 Kim_CAU_task6_1 45.20 kim2026_t6 LAION-CLAP, MS-CLAP, WavLM LAION-CLAP, MS-CLAP, RoBERTa QD-DETR, Quality regression FALSE TRUE 55600000 941800000 997400000
9 Sugawara_YCU_task6_4 44.63 sugawara2026_t6 M2D-CLAP, MS-CLAP MS-CLAP UVCOM TRUE TRUE 338000000 158400000 496400000
10 Kim_CAU_task6_4 44.07 kim2026_t6 LAION-CLAP, MS-CLAP LAION-CLAP, MS-CLAP Qwen2.5-Omni-7B QD-DETR, Quality regression FALSE TRUE 55600000 11040000000 11095600000
11 Calvet_AUDIAS_task6_2 43.50 calvet2026_t6 BEATs RoBERTa UVCOM AudioCaps, WavCaps, Clotho, TACOS FALSE FALSE 922020000 922020000
12 Usui_YCU_task6_1 41.24 usui2026_t6 M2D-CLAP M2D-CLAP UVCOM, Varifocal Clotho TRUE FALSE 18580000 198500000 217080000
12 Choi_KAIST_task6_3 41.24 choi2026_t6 M2D-CLAP M2D-CLAP QD-DETR, Span rerank, Coarse auxiliary, Boundary contrast InfoNCE, DN-DETR, Cascade FALSE FALSE 8000000 158400000 166400000
14 Usui_YCU_task6_2 40.68 usui2026_t6 M2D-CLAP M2D-CLAP UVCOM, Varifocal Clotho TRUE FALSE 18580000 198500000 217080000
14 Usui_YCU_task6_3 40.68 usui2026_t6 M2D-CLAP M2D-CLAP UVCOM, Varifocal Clotho TRUE FALSE 18580000 198500000 217080000
14 Calvet_AUDIAS_task6_4 40.68 calvet2026_t6 EAT, BEATs RoBERTa UVCOM AudioCaps, WavCaps, Clotho, TACOS FALSE TRUE 1870000000 1870000000
14 Choi_KAIST_task6_4 40.68 choi2026_t6 M2D-CLAP M2D-CLAP QD-DETR, Span rerank, Coarse auxiliary, Boundary contrast InfoNCE, DN-DETR, Cascade FALSE TRUE 16000000 158400000 174400000
18 Usui_YCU_task6_4 40.11 usui2026_t6 M2D-CLAP M2D-CLAP UVCOM, Varifocal Clotho TRUE FALSE 18580000 198500000 217080000
19 Choi_KAIST_task6_2 37.85 choi2026_t6 M2D-CLAP M2D-CLAP QD-DETR, Span rerank, Coarse auxiliary, Boundary contrast InfoNCE, DN-DETR FALSE FALSE 7400000 158400000 165800000
20 Calvet_AUDIAS_task6_1 37.29 calvet2026_t6 EAT RoBERTa UVCOM AudioCaps, WavCaps, Clotho, TACOS FALSE FALSE 921300000 921300000
21 Nakazawa_AM_task6_3 36.16 nakazawa2026_t6 MS-CLAP, LAION-CLAP, BEATs, EAT, M2D-CLAP, SP-based VAD MS-CLAP, LAION-CLAP QD-DETR FALSE FALSE 9890000 587910000 597800000
22 Kang_ISCT_task6_1 35.59 kang2026_t6 OpenFLAM FLAM text encoder Qwen2.5-7B-Instruct QD-DETR AudioCaps, WavCaps TRUE TRUE 6990000 7780000000 7786990000
22 Nakazawa_AM_task6_1 35.59 nakazawa2026_t6 MS-CLAP, LAION-CLAP, BEATs, EAT, M2D-CLAP, SP-based VAD MS-CLAP, LAION-CLAP QD-DETR FALSE FALSE 9890000 587910000 597800000
24 Calvet_AUDIAS_task6_3 35.03 calvet2026_t6 EAT, BEATs RoBERTa UVCOM AudioCaps, WavCaps, Clotho, TACOS FALSE FALSE 1786050000 1786050000
25 Kibata_YCU_task6_1 34.46 kibata2026_t6 MS-CLAP MS-CLAP CG-DETR, Varifocal, SIGReg FALSE FALSE 13370000 158400000 171770000
26 Xiao_HEU_task6_4 33.90 xiao2026_t6 MS-CLAP MS-CLAP UVCOM AudioCaps, FSD50K FALSE TRUE 19000000 158000000 177000000
26 Choi_KAIST_task6_1 33.90 choi2026_t6 M2D-CLAP M2D-CLAP QD-DETR, Span rerank, Coarse auxiliary FALSE FALSE 7100000 158400000 165500000
28 Khan_WPI_task6_2 32.20 khan2026_t6 LAION-CLAP MS-CLAP Qwen2.5-Omni-7B UVCOM FALSE TRUE 76000000 7700000000 7776000000
29 Kang_ISCT_task6_2 31.64 kang2026_t6 OpenFLAM FLAM text encoder Qwen2.5-7B-Instruct QD-DETR AudioCaps, WavCaps TRUE TRUE 6990000 7780000000 7786990000
29 Nakazawa_AM_task6_2 31.64 nakazawa2026_t6 MS-CLAP, LAION-CLAP, BEATs, EAT, M2D-CLAP, PaSST, SP-based VAD MS-CLAP, LAION-CLAP, RoBERTa QD-DETR FALSE FALSE 10750000 798710000 809460000
31 Xiao_HEU_task6_3 29.94 xiao2026_t6 MS-CLAP MS-CLAP UVCOM AudioCaps, FSD50K FALSE FALSE 19000000 158000000 177000000
32 Xiao_HEU_task6_1 29.38 xiao2026_t6 MS-CLAP MS-CLAP UVCOM AudioCaps, FSD50K FALSE FALSE 21500000 158000000 179500000
32 Chunarkar_NTHU_task6_3 29.38 chunarkar2026_t6 M2D-CLAP M2D-CLAP QD-DETR FALSE FALSE 7180000 89040000 96220000
34 Huang_WHU_task6_1 27.68 huang2026_t6 MS-CLAP MS-CLAP UVCOM, Boundary hard negative TRUE FALSE 14200000 14200000
35 Chunarkar_NTHU_task6_4 27.12 chunarkar2026_t6 M2D-CLAP, LAION-CLAP M2D-CLAP, LAION-CLAP QD-DETR FALSE TRUE 7450000 247370000 254820000
36 Xiao_HEU_task6_2 26.55 xiao2026_t6 MS-CLAP MS-CLAP UVCOM AudioCaps, FSD50K FALSE FALSE 19000000 158000000 177000000
37 Chunarkar_NTHU_task6_2 25.99 chunarkar2026_t6 M2D-CLAP, LAION-CLAP M2D-CLAP, LAION-CLAP QD-DETR FALSE TRUE 7450000 247370000 254820000
38 Chunarkar_NTHU_task6_1 25.42 chunarkar2026_t6 M2D-CLAP M2D-CLAP QD-DETR FALSE FALSE 7180000 89040000 96220000
39 Khan_WPI_task6_1 23.73 khan2026_t6 MS-CLAP MS-CLAP Qwen2.5-Omni-7B UVCOM, GRPO margin IoU reward FALSE TRUE 108000000 7500000000 7608000000
40 Nishijima_UTokyo_task6_3 22.60 nishijima2026_t6 Qwen2-Audio-7B-Instruct Qwen2-Audio-7B-Instruct Qwen2-Audio-7B-Instruct Causal LM cross-entropy FTAR (TimeAudio) TRUE TRUE 44000000 8397000000 8441000000
40 Nishijima_UTokyo_task6_4 22.60 nishijima2026_t6 Qwen2-Audio-7B-Instruct Qwen2-Audio-7B-Instruct Qwen2-Audio-7B-Instruct Causal LM cross-entropy FTAR (TimeAudio) TRUE TRUE 44000000 8397000000 8441000000
42 Chen_CHT_task6_1 22.03 chen2026_t6 MS-CLAP MS-CLAP UVCOM FALSE FALSE
43 LU_YZU_task6_1 21.47 lu2026_t6 MS-CLAP MS-CLAP QD-DETR TRUE FALSE 7100000 7100000
44 Nishijima_UTokyo_task6_1 20.90 nishijima2026_t6 Qwen2-Audio-7B-Instruct Qwen2-Audio-7B-Instruct Qwen2-Audio-7B-Instruct Causal LM cross-entropy FTAR (TimeAudio) TRUE FALSE 44000000 8397000000 8441000000
44 LU_YZU_task6_2 20.90 lu2026_t6 MS-CLAP MS-CLAP QD-DETR TRUE FALSE 7100000 7100000
46 Nishijima_UTokyo_task6_2 19.21 nishijima2026_t6 Qwen2-Audio-7B-Instruct Qwen2-Audio-7B-Instruct Qwen2-Audio-7B-Instruct Causal LM cross-entropy FTAR (TimeAudio) TRUE FALSE 44000000 8397000000 8441000000
47 Xu_GZHU_task6_1 15.25 xu2026_t6 MS-CLAP MS-CLAP QD-DETR, Focal, IoU-aware quality, Boundary auxiliary, Teacher distillation FALSE TRUE 7100000 158400000 165500000
48 DCASE2026_baseline_task6 13.56 MS-CLAP MS-CLAP QD-DETR FALSE FALSE 7100000 158400000 165500000
48 Huck_NV_task6_3 13.56 huck2026_t6 MS-CLAP Audio-Flamingo FALSE FALSE 8000000000 8000000000
48 Nakazawa_AM_task6_4 13.56 nakazawa2026_t6 MS-CLAP, LAION-CLAP, BEATs, EAT MS-CLAP, LAION-CLAP QD-DETR FALSE FALSE 9620000 498870000 508490000
48 Zhang_XJTLU_task6_1 13.56 zhang2026_t6 MS-CLAP MS-CLAP Cross-entropy, Boundary classification FALSE FALSE 1053000 158400000 159453000
52 Huck_NV_task6_1 11.30 huck2026_t6 MS-CLAP Audio-Flamingo FALSE FALSE 8000000000 8000000000
52 Kret_CooperUnion_task6_2 11.30 kret2026_t6 MS-CLAP MS-CLAP FALSE FALSE
54 Kret_CooperUnion_task6_1 10.73 kret2026_t6 MS-CLAP MS-CLAP FALSE FALSE
55 Huck_NV_task6_2 9.04 huck2026_t6 MS-CLAP Audio-Flamingo FALSE FALSE 8000000000 8000000000
56 Huck_NV_task6_4 6.21 huck2026_t6 MS-CLAP Audio-Flamingo FALSE FALSE 8000000000 8000000000
57 Minh_VGU_task6_1 5.65 minh2026_t6 MS-CLAP MS-CLAP QD-DETR, Focal FALSE FALSE 9900000 9900000
58 Zhang_XJTLU_task6_2 2.26 zhang2026_t6 MS-CLAP MS-CLAP Cross-entropy, Boundary classification, Boundary width hard negative, Candidate quality regression FALSE FALSE 1070000 158400000 159470000
59 Zhang_XJTLU_task6_3 1.13 zhang2026_t6 MS-CLAP MS-CLAP Cross-entropy, Boundary classification, Boundary width hard negative, Candidate quality regression, Pairwise ranking FALSE TRUE 2167000 158400000 160567000
59 Zhang_XJTLU_task6_4 1.13 zhang2026_t6 MS-CLAP MS-CLAP Cross-entropy, Boundary classification, Boundary width hard negative, Candidate quality regression, Semantic temporal risk heads, Pairwise ranking FALSE TRUE 2177000 158400000 160577000



Technical reports

COARSE-TO-FINE AUDIO MOMENT RETRIEVAL WITH TEMPORAL REFINEMENT AND RE-RANKING

Óscar Calvet1, Doroteo T. Toledano2
1AUDIAS, Escuela Politécnica Superior, Universidad Autónoma de Madrid, Madrid, Spain, 2AUDIAS, Escuela Politécnica Superior, Universidad Autónoma de Madrid

Abstract

This technical report describes our submission to Task 6 of the DCASE 2026 Challenge, which addresses audio moment retrieval from long audio recordings. The goal of the task is to localize the temporal segment, or segments, that match a natural-language query in an untrimmed audio recording. Our main approach follows a coarse-to-fine retrieval strategy based on UVCOM-style temporal localization models and window-level audio embeddings. First, a coarse model processes the full audio recording and produces a ranked set of candidate moments. Then, a second refinement model operates on local crops around the most promising candidates using higher-resolution audio features, improving the temporal precision of the predicted boundaries. Finally, a lightweight candidate reranker combines temporal, confidence, audio-text similarity, and boundary-context features to select and rank the final predictions. We also apply submission-oriented postprocessing, including timestamp rounding, duration clamping, duplicate removal, and boundary clipping. Our best single system achieves an R1@0.7 of 40.01, while our ensemble system achieves an R1@0.7 of 42.09. These results show that combining global proposal generation, local high-resolution refinement, and candidate-level reranking is an effective strategy for language-based audio moment retrieval in long recordings.

PDF

Audio Moment Retrieval from Long Audio for DCASE 2026 task 6

Wei-Yu Chen, Chung-Li Lu
Telecommunication Laboratories, Chunghwa Telecom Co., Ltd., Taiwan

Abstract

In this technical report, we briefly describe the system we designed for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2026 Challenge Task 6: Audio Moment Retrieval from Long Audio. We first evaluated several existing models, including the official DCASE baseline QD-DETR, as well as QD-DETR, Moment-DETR, UVCom (with and without fine-tuning via the Lighthouse framework), and SpotSound-A. On the test set, fine-tuned UVCom achieved the best performance (R1@0.7 = 20.19%), followed by SpotSound-A (16.04%) and QD-DETR (15.96%). As model sizes continue to grow, the time and computational cost of fine-tuning increases accordingly. We therefore explored an alternative approach: rather than fine-tuning, can we decompose a complex query into simpler sub-queries, predict each subquery independently, and merge the predictions to recover the full answer? Using Gemma4, we decomposed original queries into sub-events, which were then individually localized by the best-performing UVCom and SpotSound models. Following the TFVTG framework, Gemma4 determined whether each subevent occurred simultaneously or sequentially, and the predicted time windows were merged via union or intersection accordingly. UVCom achieved R1@0.7 = 20.19% on the test set; SpotSound achieved 16.04%. Results show that this direction remains highly challenging: all sub-event decomposition variants underperformed their respective baselines, primarily because the postprocessing merge logic caused over-expansion of predicted windows, reducing overlap with ground-truth annotations.

PDF

MULTI-SIGNAL CASCADED GROUNDING FOR AUDIO MOMENT RETRIEVAL FROM LONG AUDIO

Seungdeok Choi, Yong-Hwa Park
Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea

Abstract

We describe our submission to DCASE 2026 Task 6, audio moment retrieval (AMR) from long audio. Building on the querydependent DETR (QD-DETR) detection-transformer baseline, we replace the MS-CLAP feature extractor with frozen M2D-CLAP embeddings and augment the detector with a sequence of complementary, training-only supervision signals: a multi-resolution coarse auxiliary branch, a span-query re-ranking regularizer, a perboundary InfoNCE contrast, and a DN-DETR denoising task, together with a bidirectional Mamba audio encoder. A lightweight cascade refinement decoder then adds a second, localized detection stage that reads per-frame audio crops around each first-stage prediction and emits bounded boundary corrections. At inference we combine a two-seed score-level ensemble with a saliency-asproposals stream that repurposes the saliency head as a parallel moment generator. Our four submitted systems form a monotone progression on the development-testing split, the strongest reaching Recall1@0.7 = 33.70, a large gain over the official baseline.

PDF

Exploring Pretrained Audio-Text Encoders for Audio Moment Retrieval: DCASE 2026 Task 6

Snehit B. Chunarkar1, Krishnagiri Hamza2, Chi-Chun Lee1
1Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, 2Electronics and Communication Engineering, Shri Ramasamy Memorial Institute of Science and Technology, Chennai, India

Abstract

We present our system submitted to DCASE 2026 Task 6: Audio Moment Retrieval (AMR), which aims to retrieve a temporally grounded moment within a long audio recording given a natural language query. Following the AM-DETR baseline framework, we adopt framelevel audio feature extraction by segmenting long recordings into nonoverlapping one-second clips, yielding a temporally ordered sequence of clip-level embeddings. Building on this, our primary contribution is a systematic investigation of pretrained audio and text encoders as replacements for the baseline MS-CLAP features, including M2DCLAP and LAION-CLAP for both audio and text. The selected feature representations are fused into an ensemble and fed to an AM-DETRbased retrieval head for temporal boundary regression. We further incorporate frame masking during training and an IoU-based loss to improve localisation. Our system achieves a Recall1@0.7 score of 26.43 on the CASTELLA test split, surpassing the baseline score of 13.59.

PDF

Quality-and Boundary-Aware Cross-Modal Refinement for Long-Audio Moment Retrieval

Yingzhao Hou1, Anda Liu1, Zhongqin Shu1, Xin Guo1, Yuzheng Wu1, Yiwei Liu1, Xiaolan Xia1, Xueqin Luo2, Gongping Huang1
1School of Electronic Information, Wuhan University, Wuhan, China, 2CIAIC, Northwestern Polytechnical University, Xi'an, China

Abstract

This technical report describes our submission to DCASE 2026 Challenge Task 6: Audio Moment Retrieval from Long Audio. Based on the official QD-DETR baseline, we improve the system through three aspects: cross-modal semantic refinement, temporal localization optimization, and long-context training data construction. First, we adapt the Comprehensive Integration Module (CIM) and Multi-Aspect Contrastive Learning (MCL) from UVCOM to enhance audio-language semantic interaction. Second, we introduce several task-specific designs, including dense all-window saliency supervision, auxiliary audio-text similarity learning, localization-quality-aware candidate resorting, short-window proposal generation, and candidate-level boundary hard negative learning, which jointly improve proposal generation, boundary estimation, and candidate ranking. Third, we construct Clotho-Moment-Long, a long-context extension of Clotho-Moment following the same data generation pipeline while extending the audio context to 300 s and introducing repeated event occurrences and sparser event placement to better simulate realistic and challenging long-audio retrieval scenarios. The resulting model is first trained on Clotho-Moment-Long and then fine-tuned on the real-world CASTELLA dataset. On the CASTELLA development-testing split, the final fused system improves Recall1@0.7 from the official CASTELLA+Clotho-Moment baseline value of 13.59 to 25.09, while increasing average mAP from 12.06 to 18.61.

PDF

TEXT-SPACE IMAGINATION OF AUDIO RETRIEVAL VIA JOINT-SPACE PROJECTION

Chao-Han Huck Yang1, Zih-Ching Chen1, Eli Chien2, Sabato Marco Siniscalchi3,4
1NVIDIA, 2National Taiwan University, 3University of Palermo, 4Georgia Tech

Abstract

We study audio moment retrieval (AMR) as text-space imagination of audio: given a natural-language query and a long recording, a contrastive audio–language model can locate when the described sound is active by scoring each second of audio against the text in a shared embedding space. DCASE 2026 Task 6 provides pre-extracted MS-CLAP-2023 backbone features (768 dimensions) rather than raw audio. We first show, through a matched versus mismatched query analysis, that cosine similarity in this backbone space carries almost no query-specific signal: a query’s peak relevance on its own audio is no sharper than that of a random query. Our method, JointProj, restores the missing cross-modal alignment by passing the provided features and a re-encoded query through MS-CLAP’s own projection heads into the 1024-dimensional joint space, where cosine similarity yields a sharp per-second relevance curve. Moments are then localized by full-width-at-half-maximum (FWHM) peak detection. On the Clotho-Moment validation split, scored with the official tool, JointProj raises Recall1@0.7 from 13.2 to 46.0 and mAP from 12.4 to 42.5 over backbone cosine. We submit four systems: three FWHM peak-width variants that perform best in different moment-length regimes, and one exploratory localizer in which a language model reasons over the joint-space curve. We select the strongest development configuration for the blind evaluation.

PDF

FLAM-CONDITIONED QD-DETR WITH PARAPHRASE POOLING AND WEIGHTED-BOXES-FUSION ENSEMBLING FOR AUDIO MOMENT RETRIEVAL

Yaozhong Kang, Runwu Shi, Benjamin Yen, Kazuhiro Nakadai
Department of Systems and Control Engineering, Institute of Science Tokyo, Tokyo, Japan

Abstract

We describe our submission to DCASE 2026 Task 6, Audio Moment Retrieval (AMR) from long audio. Starting from the official QD-DETR baseline, we leave the detection head untouched and instead strengthen the two stages that bound its boundaryprecise recall: how audio and text are encoded, and how predictions are combined. Replacing the MS-CLAP encoder with FLAM (Frame-Wise Language-Audio Modeling), whose features are text-aligned at a high frame rate, is the single largest source of our gain. We then add several inexpensive sources of prediction diversity, namely query paraphrasing, saliency-curve candidates, and augmentation-trained students, and fuse them with a onedimensional Weighted-Boxes-Fusion (WBF) ensemble that an iterative leave-one-out sweep prunes to its most complementary members. On the held-out test split the final ensemble more than doubles the baseline, raising Recall1 at 0.7 IoU (R1@0.7) from 13.59 to 33.23 and mean average precision (mAP) from 12.06 to 24.51.

PDF

Encoder-aware Verifier Fusion with Boundary Refinement for Audio Moment Retrieval

Mohammad Nur Hossain Khan, Subrata Biswas, Bashima Islam
Electrical and Computer Engineering, Worcester Polytechnic Institute, Worcester, MA, USA

Abstract

We describe our two submissions to the DCASE 2026 Challenge Task 6: Audio Moment Retrieval from Long Audio. Both build on the UVCOM [2] retriever and add complementary post-processing stages, but they target different encoder families: (i) a MS-CLAP [3] pipeline closer to the baseline, combining DAPO RL reranking, confidence-gated swap, and a 2-checkpoint refinement fusion; and (ii) a LAION-CLAP [4] pipeline that swaps the audio encoder, applies our verifier-fusion principle, and concludes with a single-checkpoint refinement SFT. The core insight is that the highest-quality rerank signal comes not from training a reranker over the retriever’s candidates, but from fusing the retriever’s candidate scores with an architecturally independent predictor (a standalone span-SFT Qwen2.5-Omni-7B) whose predictions have never seen the retriever’s ranking. On the CASTELLA test set (public dev-test), our LAION pipeline reaches R1@0.7=32.07 and mAP=27.02 — the strongest configuration in our system grid. The MS-CLAP pipeline reaches R1@0.7=25.32, complementing the LAION submission with a different encoder family and a different refinement variant.

PDF

ADVANCED AUDIO MOMENT RETRIEVAL VIA CG-DETR

Koki Kibata, Sayaka Yamamoto, Tomoki Kawabata, Yuma Higashino, Yuki Osawa
Data Science Department, Yokohama City University, Kanagawa, Japan

Abstract

This technical report describes our system designed for the DCASE 2026 Challenge Task 6 (Audio Moment Retrieval). Utilizing the Context-Gated Detection Transformer (CG-DETR) as our structural foundation, we constructed four model variations through targeted architectural refinements, dataset expansions, and the strategic integration of both CLAP and M2D-CLAP feature backends. Within the CG-DETR encoder, we incorporated dedicated register tokens to suppress attention sinks, encouraging the model to yield noisereduced latent representations. Concurrently, the decoder network is enhanced via Feature-wise Linear Modulation (FiLM) to inject text conditioning into the query channels, enabling dynamic timelocalized predictions guided by natural language inputs. Furthermore, we shared the internal representations across the span regression and classification heads, establishing a task-aligned forecasting topology that coordinates category confidence with temporal boundaries. To optimize data pipeline transitions, we addressed inherent dataset limitations. While CASTELLA offers high-quality, long human annotations, its sample size is limited. Conversely, Clotho-Moment provides abundant but automated annotations for short 60s clips. To bridge this gap, we curated a 5-minute intermediate dataset with 14k queries for mid-phase training. During model training, we employed Periodic ASAM based on the AdamW optimizer and introduced random temporal shifting of audio data to mitigate overfitting. Additionally, adding a Sketched Isotropic Gaussian Regularization (SIGReg) loss as a feature-level penalty on the intermediate representations suppressed dimensional collapse. Following a multi-stage workflow, the model pre-trained on ClothoMoment was iteratively fine-tuned using the CASTELLA dataset. Finally, the resulting models were blended via Model Soups, securing superior generalization without increasing inference latency.

PDF

QAM-DETR SYSTEM FOR DCASE 2026 TASK 6: QUALITY-AWARE MAMBA DETR FOR QUERY-BASED AUDIO MOMENT RETRIEVAL

JeongRae Kim1, Ho jun Jung1, Yewon Park2, Changwon Lim2
1Department of Statistics and Data Science, Chung-Ang University, Seoul, Korea, 2Department of Applied Statistics, Chung-Ang University, Seoul, Korea

Abstract

This technical report describes our system for DCASE 2026 Challenge Task 6: Audio Moment Retrieval from Long Audio. The task aims to localize the temporal segment in a long audio recording that matches a given natural-language query. Our system builds on a DETR-style moment localization framework using precomputed pretrained audio and text features. LAION-CLAP is used as the primary audio-language representation, while MS-CLAP, WavLM, and RoBERTa features are selectively integrated to provide complementary acoustic and linguistic information. To improve query-conditioned localization, we use 3-way cross-attentionbased cross-modal fusion, a BiMamba-based temporal encoder, lightweight multi-scale temporal fusion, and quality-aware candidate ranking. For selected final submissions, a frozen audiolanguage LLM verifier is further applied as a post-processing reranker without modifying predicted temporal boundaries. Experiments on the CASTELLA development splits show that the proposed components improve the DETR-style baseline, and the final ensemble systems further enhance temporal localization performance.

PDF

TRAINING-FREE AUDIO MOMENT RETRIEVAL VIA BACKGROUND-CONTRASTIVE GAUSSIAN MIXTURE LOCALIZATION

Meghan Kret
Department of Electrical Engineering, The Cooper Union, New York, USA

Abstract

This report describes two submitted systems for DCASE 2026 Challenge Task 6 (Audio Moment Retrieval from Long Audio [1]). Both systems use no supervised temporal training and no labeled data. System 1 combines background contrast normalization with per-query two-component Gaussian mixture model (GMM) [2] inference over frozen MS-CLAP 2023 [3], [4] similarity traces. System 2 is an ablation without the contrast step. On the CASTELLA [5] development-test set, System 1 achieves 10.03% mAP and 13.51% R1@0.7 – surpassing the single-dataset supervised DETR [6], [7] baseline (9.11%, 10.32%) [8] using identical frozen features. On Clotho-Moment [8], System 1 achieves 44.28% mAP against the supervised baseline’s 6.32%, a 37.96 pp cross-domain gap explained by the supervised decoder’s domainspecific prior mismatch.

PDF

DCASE 2026 TASK 6: AUDIO MOMENT RETRIEVAL USING BACK-TRANSLATION AND TIME MASKING FOR DATA AUGMENTATION

Jun-Ting Lu, Chung-Che Wang, Syu-Siang Wang
Electrical Engineering, Yuan Ze University, Taoyuan, Taiwan

Abstract

This technical report describes our methods for Task 6 of the DCASE 2026 challenge: Audio Moment Retrieval from Long Audio. In this work, we build upon the official baseline without modifying its network architecture, and focus on data augmentation to improve the generalization ability of the model. Specifically, we investigate two augmentation methods: back-translation-based paraphrasing of the text queries using the OPUS-MT models, and a time mask applied to the CLAP embedding sequence on the audio side. Both augmentation methods improve the performance over the baseline, and the best configuration applies back-translation in both the pretraining and fine-tuning stages.

PDF

LOCAL CONTINUITY SALIENCY DETR FOR LANGUAGE-BASED AUDIO MOMENT RETRIEVAL

Le Duc Minh1, Tran Nguyen Van Anh2
1Department of Computer Science, Vietnamese-German University, Binh Duong, Vietnam, 2Faculty of Mathematics and Computer Science, Ho Chi Minh City University of Science, Ho Chi Minh City, Vietnam

Abstract

This paper proposes LCS-DETR, a Detection Transformer-based model for Language-Based Audio Moment Retrieval that addresses two critical limitations of standard temporal grounding: coarse temporal alignment and class imbalance. We introduce a saliencyguided framework with local temporal continuity modeling via depthwise-separable convolution and Focal Loss in the Hungarian matcher to suppress background dominance. Evaluated on the combining Clotho-Moment and CASTELLA[1] datasets, LCS-DETR achieves a mAP of 70.26% and R1@0.7 of 73.08%, representing a 4.8× and 4.4× improvement over the baseline DETR [2] respectively. Code and results are available at https://github. com/MinLee0210/DCASE_2026.git.

PDF

GATED MULTI-FEATURE FUSION FOR DCASE 2026 TASK 6

Kazushi Nakazawa
Advanced Media, Inc., Japan

Abstract

This report describes our submission to DCASE 2026 Challenge Task 6, Audio Moment Retrieval from Long Audio. The task is to retrieve the temporal segment in a long audio recording that best matches a natural-language query, and systems are ranked primarily by recall1@0.7. Our approach is a DETR-style audio moment retrieval model that combines frozen audio and text representations through a lightweight gated fusion layer. Although residual reranking and external audio-language verifier scores improved some validation runs, a final sweep on the CASTELLA development-testing split selected four non-reranked checkpoint variants for submission. The best submitted system combines MSCLAP, LAION-CLAP, BEATs, EAT, M2D-CLAP PCA, and VAD audio features with MS-CLAP and LAION-CLAP text features. It achieved recall1@0.7 = 30.14, recall1@0.5 = 49.29, mAP = 24.27, and mAP@0.75 = 22.71 on CASTELLA developmenttesting. For the hidden evaluation set, whose query file contains 177 queries over 100 recordings, we generated and format-validated four output files.

PDF

TIME COMPRESSION FOR AUDIO MOMENT RETRIEVAL WITH LARGE AUDIO LANGUAGE MODELS

Hiroshi Nishijima, Daisuke Saito, Nobuaki Minematsu
Graduate School of Engineering, The University of Tokyo, Tokyo, Japan

Abstract

We study time compression (fast-forwarding) as a test-time adaptation that brings long-form audio into the short native context of a large audio language model (LALM), for the DCASE 2026 Task 6 problem of Audio Moment Retrieval (AMR) from long audio. Instead of redesigning the model for long inputs, we time-stretch each recording so that the entire clip fits into a single native input window. We then adapt the model with staged fine-tuning: it first learns temporal grounding, and a final stage fine-tunes it on the target data. Our best system is a three-stage fine-tuned Qwen2-Audio7B. At inference, we compress each recording to at most 15 s, half of the model’s 30 s input limit, and rerank its predictions across several compression settings. On CASTELLA it reaches R1@0.7 (Recall@1 at IoU 0.7) of 27.01 on validation and 20.91 on test, and it exceeds a DETR-based baseline model that ingests the whole recording without any compression. We further show that pitchpreserving compression is essential and that compressing the whole clip beats sliding-window inference for short-context LALMs. The best compression amount tends to track the length of the model’s training clips, a trend that is significant on the validation split.

PDF

ENHANCED AUDIO MOMENT RETRIEVAL APPROACH FOR DCASE 2026 TASK 6

Takumi Ogawa, Nobuyuki Ohhashi, Yota Kurisuno, Minami Takenaka, Ririko Miyamoto
Graduate School of Data Science, Yokohama City University, Yokohama, Japan

Abstract

This technical report presents our solution to Task 6 of the DCASE 2026 Challenge, which focuses on retrieving specific moments within long audio recordings that correspond to a given textual query. We address this task by building on the official DETR-based baseline system provided by the DCASE 2026 Task 6 organizers and introducing three main components: M2D-CLAP audio features, a CG-DETR-based moment detection model enhanced with boundary modeling adapted from BAM-DETR, and a post-processing method for boundary refinement and candidate re-ranking. These components are designed to improve audio representation, temporal localization, and final moment selection. Our submitted system achieves an R1@0.7 of 41.13% on the developmenttesting set.

PDF

YCU SUBMISSION FOR DCASE 2026 CHALLENGE TASK 6

Haruto Sugawara, Tomoko Nakamura, Ken Sakaguchi, Hikari Aida, Shunta Yuri
Yokohama City University, Yokohama, Japan

Abstract

This report describes our submission to DCASE 2026 Task 6 (Audio Moment Retrieval, AMR). Our system extends UVCOM, pretrained on Clotho-Moment and fine-tuned on CASTELLA, with a set of improvement components investigated through ablation on the CASTELLA validation and test splits. Our primary system is a 19-member Weighted Box Fusion ensemble that combines models trained with Quality Focal Loss, delta-feature augmented inputs, CASTELLA-style audio augmentation, and two acoustic encoders (MS-CLAP and M2D-CLAP) at several temporal resolutions and learning-rate schedules. It achieves R1@0.7 = 39.57 % on the CASTELLA test split.

PDF

IMPROVING TEMPORAL BOUNDARY PRECISION IN AUDIO MOMENT RETRIEVAL

Ren Usui, Ryuta Fujimoto, Mikuri Kikuchi, Taichi Kitao, Tomohisa Suzuki
Graduate School of Data Science, Yokohama City University, Yokohama, Japan

Abstract

This technical report describes our system for Audio Moment Retrieval (AMR) from long audio in the DCASE 2026 challenge (task6). To improve boundary localization and retrieval accuracy, we build our system on the Unified Video Comprehension framework (UVCOM) and introduce four main modifications: M2DCLAP-based audio and text feature extraction, Varifocal Loss for IoU-aware confidence estimation, random query sampling during pretraining, and inference-time boundary refinement based on saliency scores. Experimental results on the CASTELLA test dataset show that the proposed modules consistently improve retrieval performance, with the submitted systems substantially outperforming the official QD-DETR baseline across all evaluation metrics. The best configuration achieves a Recall@0.7 of 40.14, demonstrating the effectiveness of the integrated approach for robust AMR in real-world long audio recordings.

PDF

GISP@HEU’S SUBMISSION FOR DCASE 2026 TASK 6: FREQUENCY-AWARE CROSS-MODAL FUSION FOR AUDIO MOMENT RETRIEVAL

Feiyang Xiao1, Li'ang Luo1, Kejia Zhang1, Qiaoxi Zhu2, Guangjun He3, Pengming Feng3, Wenwu Wang4, Jian Guan1
1College of Computer Science and Technology, Harbin Engineering University, Harbin, China, 2University of Technology Sydney, Ultimo, Australia, 3State Key Laboratory of Space Information System and Integrated Application, Beijing, China, 4Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK

Abstract

This technical report presents GISP@HEU’s submission for DCASE 2026 Task 6 audio moment retrieval, which aims to retrieve corresponding segments in long audio recordings based on the content-semantic correlation between audio and text queries. In our submission, we describe four systems built upon the UVCOM framework, focusing on improving the cross-modal fusion process between audio and text features.

PDF

TEF-GUIDED AND QUALITY-AWARE DISTILLED DETR FOR LANGUAGE-BASED AUDIO MOMENT RETRIEVAL

Yutao Xu, Liu Yang, Weixi Zheng
School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China

Abstract

This technical report describes our system for Task 6 languagebased audio moment retrieval of the DCASE 2026 challenge. The system builds upoin the official QD-DETR/AM-DETR baseline, which employs fixed CLAP audio and text features. The core design of the proposed system is a quality-aware distilled DETR framework for precise temporal grounding. It improves the baseline in three aspects, including query-conditioned audio representation, temporal-position encoding with temporal endpoint features, and localization-quality-aware candidate scoring with prediction-level consensus. Specifically, it integrates text-guided audio refinement, focal and IoU-aware quality objectives, boundary auxiliary supervision, span evidence adaptation, teacher distillation, recall-balanced checkpoint averaging, and weighted box fusion oriented to R1@0.7. Besides, since R1@0.7 is the primary competition target, our model selection prioritizes precise top-1 localization rather than broad recall alone. Experiment results show that, on the CASTELLA test set, the proposed system achieves improvements of 6.76 in R1@0.5, 5.79 in R1@0.7, and 3.47 in average mAP over the official baseline.

PDF

TSEL: Temporal Semantic Evidence Learning for Language-Based Audio Moment Retrieval Xiaokai Zhang Xiang Shang Xi’an Jiaotong-Liverpool University

Xiaokai Zhang, Xiang Shang
Department of Computer Science and Software Engineering, Xi'an Jiaotong-Liverpool University, Suzhou, China

Abstract

This technical report describes our submissions to DCASE 2026 Challenge Task 6, Audio Moment Retrieval from Long Audio. The task requires returning temporal windows in a long audio recording that match a natural-language query, with the official ranking emphasizing the top-ranked prediction at Recall1@0.7. Our system is based on Temporal Semantic Evidence Learning (TSEL): instead of directly regressing a single start/end pair, it first predicts query-conditioned temporal evidence and then decodes candidate windows. We package four systems: a conservative internal MS-CLAP evidence baseline (task6-1), a TSEL/SBEC evidence system with a learned evidence decoder (task6-2), a candidate-level evidence fusion system (task6-3), and a risk-aware evidence scoring system (task6-4). On the CASTELLA development-testing split, these systems obtain 16.11%, 21.97%, 23.42%, and 23.59% Recall1@0.7, respectively. TSEL-ECF is our practical main system; TSEL-RAES is retained as a risk-analysis variant. A clean five-seed validation rerun gives 29.83±0.45% Recall1@0.7 for ECF and 28.92±0.24% for RAES, with RAES producing fewer harmful interventions. The official evaluation set has no public ground truth, so no official hidden score is claimed here.

PDF