Task description
Audio moment retrieval is the task of retrieving specific moments within long audio recordings that align with a given textual query. More detailed task description can be found in the task description page
Teams ranking
This table includes only the best performing system from each team. The ranking is based on the achieved Recall1@0.7 metric on the evaluation dataset. The metric values are for the development-testing split and the evaluation dataset.
|
Team Best Rank |
Submission Code |
System Rank |
Corresponding author |
Technical Report |
Recall1@0.7 (eval dataset) |
Recall1@0.5 (eval dataset) |
Recall1@0.7 (dev-testing dataset) |
Recall1@0.5 (dev-testing dataset) |
|---|---|---|---|---|---|---|---|---|
| 1 | Kibata_YCU_task6_2 | 1 | Koki Kibata | kibata2026_t6 | 48.59 | 69.49 | 40.70 | 55.98 |
| 1 | Kim_CAU_task6_3 | 1 | Changwon Lim | kim2026_t6 | 48.59 | 63.84 | 45.14 | 58.87 |
| 1 | Sugawara_YCU_task6_3 | 1 | Haruto Sugawara | sugawara2026_t6 | 48.59 | 59.89 | 40.68 | 53.50 |
| 4 | Ogawa_YCU_task6_1 | 6 | Takumi Ogawa | ogawa2026_t6 | 46.89 | 60.45 | 41.13 | 54.19 |
| 5 | Calvet_AUDIAS_task6_2 | 11 | Oscar Calvet | calvet2026_t6 | 43.50 | 64.41 | 36.82 | 53.30 |
| 6 | Usui_YCU_task6_1 | 12 | Ren Usui | usui2026_t6 | 41.24 | 58.19 | 37.86 | 54.10 |
| 6 | Choi_KAIST_task6_3 | 12 | Seungdeok Choi | choi2026_t6 | 41.24 | 55.93 | 31.03 | 48.40 |
| 8 | Nakazawa_AM_task6_3 | 21 | Kazushi Nakazawa | nakazawa2026_t6 | 36.16 | 49.72 | 29.03 | 49.59 |
| 9 | Kang_ISCT_task6_1 | 22 | Yaozhong Kang | kang2026_t6 | 35.59 | 50.28 | 33.23 | 45.89 |
| 10 | Xiao_HEU_task6_4 | 26 | Jian Guan | xiao2026_t6 | 33.90 | 48.02 | 31.55 | 41.05 |
| 11 | Khan_WPI_task6_2 | 28 | Mohammad Nur Hossain Khan Khan | khan2026_t6 | 32.20 | 48.02 | 32.07 | 48.71 |
| 12 | Chunarkar_NTHU_task6_3 | 32 | Snehit Chunarkar | chunarkar2026_t6 | 29.38 | 52.54 | 25.39 | 43.88 |
| 13 | Huang_WHU_task6_1 | 34 | Gongping Huang | huang2026_t6 | 27.68 | 48.02 | 25.09 | 37.49 |
| 14 | Nishijima_UTokyo_task6_3 | 40 | Hiroshi Nishijima | nishijima2026_t6 | 22.60 | 32.77 | 21.88 | 30.58 |
| 15 | Chen_CHT_task6_1 | 42 | Wei-Yu Chen | chen2026_t6 | 22.03 | 37.29 | 20.19 | 31.48 |
| 16 | LU_YZU_task6_1 | 43 | Jun-Ting LU | lu2026_t6 | 21.47 | 31.64 | 14.40 | 25.61 |
| 17 | Xu_GZHU_task6_1 | 47 | Yutao Xu | xu2026_t6 | 15.25 | 31.07 | 16.11 | 29.92 |
| 18 | DCASE2026_baseline_task6 | 48 | Hokuto Munakata | 13.56 | 28.25 | 13.59 | 25.61 | |
| 18 | Huck_NV_task6_3 | 48 | Huck Yang | huck2026_t6 | 13.56 | 22.03 | ||
| 18 | Zhang_XJTLU_task6_1 | 48 | Xiaokai Zhang | zhang2026_t6 | 13.56 | 22.03 | 16.11 | 23.83 |
| 21 | Kret_CooperUnion_task6_2 | 52 | Meghan Kret | kret2026_t6 | 11.30 | 19.21 | 11.88 | 17.67 |
| 22 | Minh_VGU_task6_1 | 57 | Le Duc Minh | minh2026_t6 | 5.65 | 12.99 | 6.76 | 14.33 |
Systems ranking
This table includes all systems submitted by participating teams.
|
System Rank |
Submission Code |
Technical Report |
Recall1@0.7 (eval dataset) |
Recall1@0.5 (eval dataset) |
Recall1@0.7 (dev-testing dataset) |
Recall1@0.5 (dev-testing dataset) |
|---|---|---|---|---|---|---|
| 1 | Kibata_YCU_task6_2 | kibata2026_t6 | 48.59 | 69.49 | 40.70 | 55.98 |
| 1 | Kim_CAU_task6_3 | kim2026_t6 | 48.59 | 63.84 | 45.14 | 58.87 |
| 1 | Sugawara_YCU_task6_3 | sugawara2026_t6 | 48.59 | 59.89 | 40.68 | 53.50 |
| 4 | Kim_CAU_task6_2 | kim2026_t6 | 48.02 | 62.15 | 45.14 | 59.02 |
| 5 | Sugawara_YCU_task6_1 | sugawara2026_t6 | 47.46 | 58.19 | 39.57 | 51.80 |
| 6 | Ogawa_YCU_task6_1 | ogawa2026_t6 | 46.89 | 60.45 | 41.13 | 54.19 |
| 7 | Sugawara_YCU_task6_2 | sugawara2026_t6 | 45.76 | 58.19 | 38.38 | 51.30 |
| 8 | Kim_CAU_task6_1 | kim2026_t6 | 45.20 | 57.63 | 42.91 | 55.23 |
| 9 | Sugawara_YCU_task6_4 | sugawara2026_t6 | 44.63 | 57.63 | 38.01 | 52.50 |
| 10 | Kim_CAU_task6_4 | kim2026_t6 | 44.07 | 61.02 | 41.28 | 54.57 |
| 11 | Calvet_AUDIAS_task6_2 | calvet2026_t6 | 43.50 | 64.41 | 36.82 | 53.30 |
| 12 | Usui_YCU_task6_1 | usui2026_t6 | 41.24 | 58.19 | 37.86 | 54.10 |
| 12 | Choi_KAIST_task6_3 | choi2026_t6 | 41.24 | 55.93 | 31.03 | 48.40 |
| 14 | Usui_YCU_task6_2 | usui2026_t6 | 40.68 | 58.76 | 39.68 | 53.95 |
| 14 | Usui_YCU_task6_3 | usui2026_t6 | 40.68 | 58.19 | 38.32 | 54.02 |
| 14 | Calvet_AUDIAS_task6_4 | calvet2026_t6 | 40.68 | 59.89 | 42.09 | 59.39 |
| 14 | Choi_KAIST_task6_4 | choi2026_t6 | 40.68 | 54.80 | 33.70 | 50.41 |
| 18 | Usui_YCU_task6_4 | usui2026_t6 | 40.11 | 58.76 | 40.14 | 54.17 |
| 19 | Choi_KAIST_task6_2 | choi2026_t6 | 37.85 | 51.41 | 29.55 | 47.07 |
| 20 | Calvet_AUDIAS_task6_1 | calvet2026_t6 | 37.29 | 59.89 | 40.01 | 56.12 |
| 21 | Nakazawa_AM_task6_3 | nakazawa2026_t6 | 36.16 | 49.72 | 29.03 | 49.59 |
| 22 | Kang_ISCT_task6_1 | kang2026_t6 | 35.59 | 50.28 | 33.23 | 45.89 |
| 22 | Nakazawa_AM_task6_1 | nakazawa2026_t6 | 35.59 | 49.72 | 30.14 | 49.29 |
| 24 | Calvet_AUDIAS_task6_3 | calvet2026_t6 | 35.03 | 57.06 | 37.56 | 54.27 |
| 25 | Kibata_YCU_task6_1 | kibata2026_t6 | 34.46 | 53.11 | 26.80 | 39.50 |
| 26 | Xiao_HEU_task6_4 | xiao2026_t6 | 33.90 | 48.02 | 31.55 | 41.05 |
| 26 | Choi_KAIST_task6_1 | choi2026_t6 | 33.90 | 55.93 | 28.36 | 43.06 |
| 28 | Khan_WPI_task6_2 | khan2026_t6 | 32.20 | 48.02 | 32.07 | 48.71 |
| 29 | Kang_ISCT_task6_2 | kang2026_t6 | 31.64 | 49.15 | 31.95 | 45.52 |
| 29 | Nakazawa_AM_task6_2 | nakazawa2026_t6 | 31.64 | 51.41 | 29.47 | 47.07 |
| 31 | Xiao_HEU_task6_3 | xiao2026_t6 | 29.94 | 44.07 | 28.36 | 40.31 |
| 32 | Xiao_HEU_task6_1 | xiao2026_t6 | 29.38 | 48.02 | 28.21 | 40.83 |
| 32 | Chunarkar_NTHU_task6_3 | chunarkar2026_t6 | 29.38 | 52.54 | 25.39 | 43.88 |
| 34 | Huang_WHU_task6_1 | huang2026_t6 | 27.68 | 48.02 | 25.09 | 37.49 |
| 35 | Chunarkar_NTHU_task6_4 | chunarkar2026_t6 | 27.12 | 49.72 | 26.28 | 43.50 |
| 36 | Xiao_HEU_task6_2 | xiao2026_t6 | 26.55 | 42.37 | 30.44 | 41.13 |
| 37 | Chunarkar_NTHU_task6_2 | chunarkar2026_t6 | 25.99 | 48.02 | 26.43 | 43.21 |
| 38 | Chunarkar_NTHU_task6_1 | chunarkar2026_t6 | 25.42 | 46.33 | 26.13 | 44.02 |
| 39 | Khan_WPI_task6_1 | khan2026_t6 | 23.73 | 42.37 | 25.32 | 37.27 |
| 40 | Nishijima_UTokyo_task6_3 | nishijima2026_t6 | 22.60 | 32.77 | 21.88 | 30.58 |
| 40 | Nishijima_UTokyo_task6_4 | nishijima2026_t6 | 22.60 | 32.77 | 21.43 | 29.76 |
| 42 | Chen_CHT_task6_1 | chen2026_t6 | 22.03 | 37.29 | 20.19 | 31.48 |
| 43 | LU_YZU_task6_1 | lu2026_t6 | 21.47 | 31.64 | 14.40 | 25.61 |
| 44 | Nishijima_UTokyo_task6_1 | nishijima2026_t6 | 20.90 | 37.29 | 20.91 | 30.36 |
| 44 | LU_YZU_task6_2 | lu2026_t6 | 20.90 | 35.59 | 12.25 | 23.01 |
| 46 | Nishijima_UTokyo_task6_2 | nishijima2026_t6 | 19.21 | 35.03 | 18.68 | 26.49 |
| 47 | Xu_GZHU_task6_1 | xu2026_t6 | 15.25 | 31.07 | 16.11 | 29.92 |
| 48 | DCASE2026_baseline_task6 | 13.56 | 28.25 | 13.59 | 25.61 | |
| 48 | Huck_NV_task6_3 | huck2026_t6 | 13.56 | 22.03 | ||
| 48 | Nakazawa_AM_task6_4 | nakazawa2026_t6 | 13.56 | 23.16 | 28.88 | 49.29 |
| 48 | Zhang_XJTLU_task6_1 | zhang2026_t6 | 13.56 | 22.03 | 16.11 | 23.83 |
| 52 | Huck_NV_task6_1 | huck2026_t6 | 11.30 | 20.34 | ||
| 52 | Kret_CooperUnion_task6_2 | kret2026_t6 | 11.30 | 19.21 | 11.88 | 17.67 |
| 54 | Kret_CooperUnion_task6_1 | kret2026_t6 | 10.73 | 16.95 | 13.51 | 18.49 |
| 55 | Huck_NV_task6_2 | huck2026_t6 | 9.04 | 19.21 | ||
| 56 | Huck_NV_task6_4 | huck2026_t6 | 6.21 | 13.56 | ||
| 57 | Minh_VGU_task6_1 | minh2026_t6 | 5.65 | 12.99 | 6.76 | 14.33 |
| 58 | Zhang_XJTLU_task6_2 | zhang2026_t6 | 2.26 | 5.08 | 21.97 | 33.04 |
| 59 | Zhang_XJTLU_task6_3 | zhang2026_t6 | 1.13 | 3.39 | 23.42 | 34.65 |
| 59 | Zhang_XJTLU_task6_4 | zhang2026_t6 | 1.13 | 2.82 | 23.59 | 35.13 |
System characteristics
Summary of the submitted system characteristics.
|
System Rank |
Submission Code |
Recall1@0.7 (eval) |
Technical Report |
Audio Model | Text Model | LLM | Loss Function |
External Data Resources |
Data augmentation |
Ensemble |
Trainable parameters |
Frozen parameters |
Total parameters |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Kibata_YCU_task6_2 | 48.59 | kibata2026_t6 | M2D-CLAP | M2D-CLAP | CG-DETR, Varifocal, SIGReg | FALSE | FALSE | 13370000 | 198500000 | 211870000 | ||
| 1 | Kim_CAU_task6_3 | 48.59 | kim2026_t6 | LAION-CLAP, MS-CLAP, WavLM | LAION-CLAP, MS-CLAP | Qwen2.5-Omni-7B | QD-DETR, Quality regression | FALSE | TRUE | 55600000 | 11130000000 | 11185600000 | |
| 1 | Sugawara_YCU_task6_3 | 48.59 | sugawara2026_t6 | MS-CLAP, M2D-CLAP | MS-CLAP | UVCOM | TRUE | TRUE | 445400000 | 158400000 | 603800000 | ||
| 4 | Kim_CAU_task6_2 | 48.02 | kim2026_t6 | LAION-CLAP, MS-CLAP, WavLM | LAION-CLAP, MS-CLAP, RoBERTa | Qwen2.5-Omni-7B | QD-DETR, Quality regression | FALSE | TRUE | 55600000 | 11670000000 | 11725600000 | |
| 5 | Sugawara_YCU_task6_1 | 47.46 | sugawara2026_t6 | MS-CLAP, M2D-CLAP | MS-CLAP | UVCOM | TRUE | TRUE | 338600000 | 158400000 | 497000000 | ||
| 6 | Ogawa_YCU_task6_1 | 46.89 | ogawa2026_t6 | M2D-CLAP | M2D-CLAP | QD-DETR, Quality-based reranking, Distillation | FALSE | FALSE | 19400000 | 19400000 | |||
| 7 | Sugawara_YCU_task6_2 | 45.76 | sugawara2026_t6 | MS-CLAP, M2D-CLAP | MS-CLAP | UVCOM | TRUE | TRUE | 285300000 | 158400000 | 443700000 | ||
| 8 | Kim_CAU_task6_1 | 45.20 | kim2026_t6 | LAION-CLAP, MS-CLAP, WavLM | LAION-CLAP, MS-CLAP, RoBERTa | QD-DETR, Quality regression | FALSE | TRUE | 55600000 | 941800000 | 997400000 | ||
| 9 | Sugawara_YCU_task6_4 | 44.63 | sugawara2026_t6 | M2D-CLAP, MS-CLAP | MS-CLAP | UVCOM | TRUE | TRUE | 338000000 | 158400000 | 496400000 | ||
| 10 | Kim_CAU_task6_4 | 44.07 | kim2026_t6 | LAION-CLAP, MS-CLAP | LAION-CLAP, MS-CLAP | Qwen2.5-Omni-7B | QD-DETR, Quality regression | FALSE | TRUE | 55600000 | 11040000000 | 11095600000 | |
| 11 | Calvet_AUDIAS_task6_2 | 43.50 | calvet2026_t6 | BEATs | RoBERTa | UVCOM | AudioCaps, WavCaps, Clotho, TACOS | FALSE | FALSE | 922020000 | 922020000 | ||
| 12 | Usui_YCU_task6_1 | 41.24 | usui2026_t6 | M2D-CLAP | M2D-CLAP | UVCOM, Varifocal | Clotho | TRUE | FALSE | 18580000 | 198500000 | 217080000 | |
| 12 | Choi_KAIST_task6_3 | 41.24 | choi2026_t6 | M2D-CLAP | M2D-CLAP | QD-DETR, Span rerank, Coarse auxiliary, Boundary contrast InfoNCE, DN-DETR, Cascade | FALSE | FALSE | 8000000 | 158400000 | 166400000 | ||
| 14 | Usui_YCU_task6_2 | 40.68 | usui2026_t6 | M2D-CLAP | M2D-CLAP | UVCOM, Varifocal | Clotho | TRUE | FALSE | 18580000 | 198500000 | 217080000 | |
| 14 | Usui_YCU_task6_3 | 40.68 | usui2026_t6 | M2D-CLAP | M2D-CLAP | UVCOM, Varifocal | Clotho | TRUE | FALSE | 18580000 | 198500000 | 217080000 | |
| 14 | Calvet_AUDIAS_task6_4 | 40.68 | calvet2026_t6 | EAT, BEATs | RoBERTa | UVCOM | AudioCaps, WavCaps, Clotho, TACOS | FALSE | TRUE | 1870000000 | 1870000000 | ||
| 14 | Choi_KAIST_task6_4 | 40.68 | choi2026_t6 | M2D-CLAP | M2D-CLAP | QD-DETR, Span rerank, Coarse auxiliary, Boundary contrast InfoNCE, DN-DETR, Cascade | FALSE | TRUE | 16000000 | 158400000 | 174400000 | ||
| 18 | Usui_YCU_task6_4 | 40.11 | usui2026_t6 | M2D-CLAP | M2D-CLAP | UVCOM, Varifocal | Clotho | TRUE | FALSE | 18580000 | 198500000 | 217080000 | |
| 19 | Choi_KAIST_task6_2 | 37.85 | choi2026_t6 | M2D-CLAP | M2D-CLAP | QD-DETR, Span rerank, Coarse auxiliary, Boundary contrast InfoNCE, DN-DETR | FALSE | FALSE | 7400000 | 158400000 | 165800000 | ||
| 20 | Calvet_AUDIAS_task6_1 | 37.29 | calvet2026_t6 | EAT | RoBERTa | UVCOM | AudioCaps, WavCaps, Clotho, TACOS | FALSE | FALSE | 921300000 | 921300000 | ||
| 21 | Nakazawa_AM_task6_3 | 36.16 | nakazawa2026_t6 | MS-CLAP, LAION-CLAP, BEATs, EAT, M2D-CLAP, SP-based VAD | MS-CLAP, LAION-CLAP | QD-DETR | FALSE | FALSE | 9890000 | 587910000 | 597800000 | ||
| 22 | Kang_ISCT_task6_1 | 35.59 | kang2026_t6 | OpenFLAM | FLAM text encoder | Qwen2.5-7B-Instruct | QD-DETR | AudioCaps, WavCaps | TRUE | TRUE | 6990000 | 7780000000 | 7786990000 |
| 22 | Nakazawa_AM_task6_1 | 35.59 | nakazawa2026_t6 | MS-CLAP, LAION-CLAP, BEATs, EAT, M2D-CLAP, SP-based VAD | MS-CLAP, LAION-CLAP | QD-DETR | FALSE | FALSE | 9890000 | 587910000 | 597800000 | ||
| 24 | Calvet_AUDIAS_task6_3 | 35.03 | calvet2026_t6 | EAT, BEATs | RoBERTa | UVCOM | AudioCaps, WavCaps, Clotho, TACOS | FALSE | FALSE | 1786050000 | 1786050000 | ||
| 25 | Kibata_YCU_task6_1 | 34.46 | kibata2026_t6 | MS-CLAP | MS-CLAP | CG-DETR, Varifocal, SIGReg | FALSE | FALSE | 13370000 | 158400000 | 171770000 | ||
| 26 | Xiao_HEU_task6_4 | 33.90 | xiao2026_t6 | MS-CLAP | MS-CLAP | UVCOM | AudioCaps, FSD50K | FALSE | TRUE | 19000000 | 158000000 | 177000000 | |
| 26 | Choi_KAIST_task6_1 | 33.90 | choi2026_t6 | M2D-CLAP | M2D-CLAP | QD-DETR, Span rerank, Coarse auxiliary | FALSE | FALSE | 7100000 | 158400000 | 165500000 | ||
| 28 | Khan_WPI_task6_2 | 32.20 | khan2026_t6 | LAION-CLAP | MS-CLAP | Qwen2.5-Omni-7B | UVCOM | FALSE | TRUE | 76000000 | 7700000000 | 7776000000 | |
| 29 | Kang_ISCT_task6_2 | 31.64 | kang2026_t6 | OpenFLAM | FLAM text encoder | Qwen2.5-7B-Instruct | QD-DETR | AudioCaps, WavCaps | TRUE | TRUE | 6990000 | 7780000000 | 7786990000 |
| 29 | Nakazawa_AM_task6_2 | 31.64 | nakazawa2026_t6 | MS-CLAP, LAION-CLAP, BEATs, EAT, M2D-CLAP, PaSST, SP-based VAD | MS-CLAP, LAION-CLAP, RoBERTa | QD-DETR | FALSE | FALSE | 10750000 | 798710000 | 809460000 | ||
| 31 | Xiao_HEU_task6_3 | 29.94 | xiao2026_t6 | MS-CLAP | MS-CLAP | UVCOM | AudioCaps, FSD50K | FALSE | FALSE | 19000000 | 158000000 | 177000000 | |
| 32 | Xiao_HEU_task6_1 | 29.38 | xiao2026_t6 | MS-CLAP | MS-CLAP | UVCOM | AudioCaps, FSD50K | FALSE | FALSE | 21500000 | 158000000 | 179500000 | |
| 32 | Chunarkar_NTHU_task6_3 | 29.38 | chunarkar2026_t6 | M2D-CLAP | M2D-CLAP | QD-DETR | FALSE | FALSE | 7180000 | 89040000 | 96220000 | ||
| 34 | Huang_WHU_task6_1 | 27.68 | huang2026_t6 | MS-CLAP | MS-CLAP | UVCOM, Boundary hard negative | TRUE | FALSE | 14200000 | 14200000 | |||
| 35 | Chunarkar_NTHU_task6_4 | 27.12 | chunarkar2026_t6 | M2D-CLAP, LAION-CLAP | M2D-CLAP, LAION-CLAP | QD-DETR | FALSE | TRUE | 7450000 | 247370000 | 254820000 | ||
| 36 | Xiao_HEU_task6_2 | 26.55 | xiao2026_t6 | MS-CLAP | MS-CLAP | UVCOM | AudioCaps, FSD50K | FALSE | FALSE | 19000000 | 158000000 | 177000000 | |
| 37 | Chunarkar_NTHU_task6_2 | 25.99 | chunarkar2026_t6 | M2D-CLAP, LAION-CLAP | M2D-CLAP, LAION-CLAP | QD-DETR | FALSE | TRUE | 7450000 | 247370000 | 254820000 | ||
| 38 | Chunarkar_NTHU_task6_1 | 25.42 | chunarkar2026_t6 | M2D-CLAP | M2D-CLAP | QD-DETR | FALSE | FALSE | 7180000 | 89040000 | 96220000 | ||
| 39 | Khan_WPI_task6_1 | 23.73 | khan2026_t6 | MS-CLAP | MS-CLAP | Qwen2.5-Omni-7B | UVCOM, GRPO margin IoU reward | FALSE | TRUE | 108000000 | 7500000000 | 7608000000 | |
| 40 | Nishijima_UTokyo_task6_3 | 22.60 | nishijima2026_t6 | Qwen2-Audio-7B-Instruct | Qwen2-Audio-7B-Instruct | Qwen2-Audio-7B-Instruct | Causal LM cross-entropy | FTAR (TimeAudio) | TRUE | TRUE | 44000000 | 8397000000 | 8441000000 |
| 40 | Nishijima_UTokyo_task6_4 | 22.60 | nishijima2026_t6 | Qwen2-Audio-7B-Instruct | Qwen2-Audio-7B-Instruct | Qwen2-Audio-7B-Instruct | Causal LM cross-entropy | FTAR (TimeAudio) | TRUE | TRUE | 44000000 | 8397000000 | 8441000000 |
| 42 | Chen_CHT_task6_1 | 22.03 | chen2026_t6 | MS-CLAP | MS-CLAP | UVCOM | FALSE | FALSE | |||||
| 43 | LU_YZU_task6_1 | 21.47 | lu2026_t6 | MS-CLAP | MS-CLAP | QD-DETR | TRUE | FALSE | 7100000 | 7100000 | |||
| 44 | Nishijima_UTokyo_task6_1 | 20.90 | nishijima2026_t6 | Qwen2-Audio-7B-Instruct | Qwen2-Audio-7B-Instruct | Qwen2-Audio-7B-Instruct | Causal LM cross-entropy | FTAR (TimeAudio) | TRUE | FALSE | 44000000 | 8397000000 | 8441000000 |
| 44 | LU_YZU_task6_2 | 20.90 | lu2026_t6 | MS-CLAP | MS-CLAP | QD-DETR | TRUE | FALSE | 7100000 | 7100000 | |||
| 46 | Nishijima_UTokyo_task6_2 | 19.21 | nishijima2026_t6 | Qwen2-Audio-7B-Instruct | Qwen2-Audio-7B-Instruct | Qwen2-Audio-7B-Instruct | Causal LM cross-entropy | FTAR (TimeAudio) | TRUE | FALSE | 44000000 | 8397000000 | 8441000000 |
| 47 | Xu_GZHU_task6_1 | 15.25 | xu2026_t6 | MS-CLAP | MS-CLAP | QD-DETR, Focal, IoU-aware quality, Boundary auxiliary, Teacher distillation | FALSE | TRUE | 7100000 | 158400000 | 165500000 | ||
| 48 | DCASE2026_baseline_task6 | 13.56 | MS-CLAP | MS-CLAP | QD-DETR | FALSE | FALSE | 7100000 | 158400000 | 165500000 | |||
| 48 | Huck_NV_task6_3 | 13.56 | huck2026_t6 | MS-CLAP | Audio-Flamingo | FALSE | FALSE | 8000000000 | 8000000000 | ||||
| 48 | Nakazawa_AM_task6_4 | 13.56 | nakazawa2026_t6 | MS-CLAP, LAION-CLAP, BEATs, EAT | MS-CLAP, LAION-CLAP | QD-DETR | FALSE | FALSE | 9620000 | 498870000 | 508490000 | ||
| 48 | Zhang_XJTLU_task6_1 | 13.56 | zhang2026_t6 | MS-CLAP | MS-CLAP | Cross-entropy, Boundary classification | FALSE | FALSE | 1053000 | 158400000 | 159453000 | ||
| 52 | Huck_NV_task6_1 | 11.30 | huck2026_t6 | MS-CLAP | Audio-Flamingo | FALSE | FALSE | 8000000000 | 8000000000 | ||||
| 52 | Kret_CooperUnion_task6_2 | 11.30 | kret2026_t6 | MS-CLAP | MS-CLAP | FALSE | FALSE | ||||||
| 54 | Kret_CooperUnion_task6_1 | 10.73 | kret2026_t6 | MS-CLAP | MS-CLAP | FALSE | FALSE | ||||||
| 55 | Huck_NV_task6_2 | 9.04 | huck2026_t6 | MS-CLAP | Audio-Flamingo | FALSE | FALSE | 8000000000 | 8000000000 | ||||
| 56 | Huck_NV_task6_4 | 6.21 | huck2026_t6 | MS-CLAP | Audio-Flamingo | FALSE | FALSE | 8000000000 | 8000000000 | ||||
| 57 | Minh_VGU_task6_1 | 5.65 | minh2026_t6 | MS-CLAP | MS-CLAP | QD-DETR, Focal | FALSE | FALSE | 9900000 | 9900000 | |||
| 58 | Zhang_XJTLU_task6_2 | 2.26 | zhang2026_t6 | MS-CLAP | MS-CLAP | Cross-entropy, Boundary classification, Boundary width hard negative, Candidate quality regression | FALSE | FALSE | 1070000 | 158400000 | 159470000 | ||
| 59 | Zhang_XJTLU_task6_3 | 1.13 | zhang2026_t6 | MS-CLAP | MS-CLAP | Cross-entropy, Boundary classification, Boundary width hard negative, Candidate quality regression, Pairwise ranking | FALSE | TRUE | 2167000 | 158400000 | 160567000 | ||
| 59 | Zhang_XJTLU_task6_4 | 1.13 | zhang2026_t6 | MS-CLAP | MS-CLAP | Cross-entropy, Boundary classification, Boundary width hard negative, Candidate quality regression, Semantic temporal risk heads, Pairwise ranking | FALSE | TRUE | 2177000 | 158400000 | 160577000 |
Technical reports
COARSE-TO-FINE AUDIO MOMENT RETRIEVAL WITH TEMPORAL REFINEMENT AND RE-RANKING
Óscar Calvet1, Doroteo T. Toledano2
1AUDIAS, Escuela Politécnica Superior, Universidad Autónoma de Madrid, Madrid, Spain, 2AUDIAS, Escuela Politécnica Superior, Universidad Autónoma de Madrid
Calvet_AUDIAS_task6_1 Calvet_AUDIAS_task6_2 Calvet_AUDIAS_task6_3 Calvet_AUDIAS_task6_4
COARSE-TO-FINE AUDIO MOMENT RETRIEVAL WITH TEMPORAL REFINEMENT AND RE-RANKING
Óscar Calvet1, Doroteo T. Toledano2
1AUDIAS, Escuela Politécnica Superior, Universidad Autónoma de Madrid, Madrid, Spain, 2AUDIAS, Escuela Politécnica Superior, Universidad Autónoma de Madrid
Abstract
This technical report describes our submission to Task 6 of the DCASE 2026 Challenge, which addresses audio moment retrieval from long audio recordings. The goal of the task is to localize the temporal segment, or segments, that match a natural-language query in an untrimmed audio recording. Our main approach follows a coarse-to-fine retrieval strategy based on UVCOM-style temporal localization models and window-level audio embeddings. First, a coarse model processes the full audio recording and produces a ranked set of candidate moments. Then, a second refinement model operates on local crops around the most promising candidates using higher-resolution audio features, improving the temporal precision of the predicted boundaries. Finally, a lightweight candidate reranker combines temporal, confidence, audio-text similarity, and boundary-context features to select and rank the final predictions. We also apply submission-oriented postprocessing, including timestamp rounding, duration clamping, duplicate removal, and boundary clipping. Our best single system achieves an R1@0.7 of 40.01, while our ensemble system achieves an R1@0.7 of 42.09. These results show that combining global proposal generation, local high-resolution refinement, and candidate-level reranking is an effective strategy for language-based audio moment retrieval in long recordings.
Audio Moment Retrieval from Long Audio for DCASE 2026 task 6
Wei-Yu Chen, Chung-Li Lu
Telecommunication Laboratories, Chunghwa Telecom Co., Ltd., Taiwan
Chen_CHT_task6_1
Audio Moment Retrieval from Long Audio for DCASE 2026 task 6
Wei-Yu Chen, Chung-Li Lu
Telecommunication Laboratories, Chunghwa Telecom Co., Ltd., Taiwan
Abstract
In this technical report, we briefly describe the system we designed for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2026 Challenge Task 6: Audio Moment Retrieval from Long Audio. We first evaluated several existing models, including the official DCASE baseline QD-DETR, as well as QD-DETR, Moment-DETR, UVCom (with and without fine-tuning via the Lighthouse framework), and SpotSound-A. On the test set, fine-tuned UVCom achieved the best performance (R1@0.7 = 20.19%), followed by SpotSound-A (16.04%) and QD-DETR (15.96%). As model sizes continue to grow, the time and computational cost of fine-tuning increases accordingly. We therefore explored an alternative approach: rather than fine-tuning, can we decompose a complex query into simpler sub-queries, predict each subquery independently, and merge the predictions to recover the full answer? Using Gemma4, we decomposed original queries into sub-events, which were then individually localized by the best-performing UVCom and SpotSound models. Following the TFVTG framework, Gemma4 determined whether each subevent occurred simultaneously or sequentially, and the predicted time windows were merged via union or intersection accordingly. UVCom achieved R1@0.7 = 20.19% on the test set; SpotSound achieved 16.04%. Results show that this direction remains highly challenging: all sub-event decomposition variants underperformed their respective baselines, primarily because the postprocessing merge logic caused over-expansion of predicted windows, reducing overlap with ground-truth annotations.
MULTI-SIGNAL CASCADED GROUNDING FOR AUDIO MOMENT RETRIEVAL FROM LONG AUDIO
Seungdeok Choi, Yong-Hwa Park
Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
Choi_KAIST_task6_1 Choi_KAIST_task6_2 Choi_KAIST_task6_3 Choi_KAIST_task6_4
MULTI-SIGNAL CASCADED GROUNDING FOR AUDIO MOMENT RETRIEVAL FROM LONG AUDIO
Seungdeok Choi, Yong-Hwa Park
Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
Abstract
We describe our submission to DCASE 2026 Task 6, audio moment retrieval (AMR) from long audio. Building on the querydependent DETR (QD-DETR) detection-transformer baseline, we replace the MS-CLAP feature extractor with frozen M2D-CLAP embeddings and augment the detector with a sequence of complementary, training-only supervision signals: a multi-resolution coarse auxiliary branch, a span-query re-ranking regularizer, a perboundary InfoNCE contrast, and a DN-DETR denoising task, together with a bidirectional Mamba audio encoder. A lightweight cascade refinement decoder then adds a second, localized detection stage that reads per-frame audio crops around each first-stage prediction and emits bounded boundary corrections. At inference we combine a two-seed score-level ensemble with a saliency-asproposals stream that repurposes the saliency head as a parallel moment generator. Our four submitted systems form a monotone progression on the development-testing split, the strongest reaching Recall1@0.7 = 33.70, a large gain over the official baseline.
Exploring Pretrained Audio-Text Encoders for Audio Moment Retrieval: DCASE 2026 Task 6
Snehit B. Chunarkar1, Krishnagiri Hamza2, Chi-Chun Lee1
1Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, 2Electronics and Communication Engineering, Shri Ramasamy Memorial Institute of Science and Technology, Chennai, India
Chunarkar_NTHU_task6_1 Chunarkar_NTHU_task6_2 Chunarkar_NTHU_task6_3 Chunarkar_NTHU_task6_4
Exploring Pretrained Audio-Text Encoders for Audio Moment Retrieval: DCASE 2026 Task 6
Snehit B. Chunarkar1, Krishnagiri Hamza2, Chi-Chun Lee1
1Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, 2Electronics and Communication Engineering, Shri Ramasamy Memorial Institute of Science and Technology, Chennai, India
Abstract
We present our system submitted to DCASE 2026 Task 6: Audio Moment Retrieval (AMR), which aims to retrieve a temporally grounded moment within a long audio recording given a natural language query. Following the AM-DETR baseline framework, we adopt framelevel audio feature extraction by segmenting long recordings into nonoverlapping one-second clips, yielding a temporally ordered sequence of clip-level embeddings. Building on this, our primary contribution is a systematic investigation of pretrained audio and text encoders as replacements for the baseline MS-CLAP features, including M2DCLAP and LAION-CLAP for both audio and text. The selected feature representations are fused into an ensemble and fed to an AM-DETRbased retrieval head for temporal boundary regression. We further incorporate frame masking during training and an IoU-based loss to improve localisation. Our system achieves a Recall1@0.7 score of 26.43 on the CASTELLA test split, surpassing the baseline score of 13.59.
Quality-and Boundary-Aware Cross-Modal Refinement for Long-Audio Moment Retrieval
Yingzhao Hou1, Anda Liu1, Zhongqin Shu1, Xin Guo1, Yuzheng Wu1, Yiwei Liu1, Xiaolan Xia1, Xueqin Luo2, Gongping Huang1
1School of Electronic Information, Wuhan University, Wuhan, China, 2CIAIC, Northwestern Polytechnical University, Xi'an, China
Huang_WHU_task6_1
Quality-and Boundary-Aware Cross-Modal Refinement for Long-Audio Moment Retrieval
Yingzhao Hou1, Anda Liu1, Zhongqin Shu1, Xin Guo1, Yuzheng Wu1, Yiwei Liu1, Xiaolan Xia1, Xueqin Luo2, Gongping Huang1
1School of Electronic Information, Wuhan University, Wuhan, China, 2CIAIC, Northwestern Polytechnical University, Xi'an, China
Abstract
This technical report describes our submission to DCASE 2026 Challenge Task 6: Audio Moment Retrieval from Long Audio. Based on the official QD-DETR baseline, we improve the system through three aspects: cross-modal semantic refinement, temporal localization optimization, and long-context training data construction. First, we adapt the Comprehensive Integration Module (CIM) and Multi-Aspect Contrastive Learning (MCL) from UVCOM to enhance audio-language semantic interaction. Second, we introduce several task-specific designs, including dense all-window saliency supervision, auxiliary audio-text similarity learning, localization-quality-aware candidate resorting, short-window proposal generation, and candidate-level boundary hard negative learning, which jointly improve proposal generation, boundary estimation, and candidate ranking. Third, we construct Clotho-Moment-Long, a long-context extension of Clotho-Moment following the same data generation pipeline while extending the audio context to 300 s and introducing repeated event occurrences and sparser event placement to better simulate realistic and challenging long-audio retrieval scenarios. The resulting model is first trained on Clotho-Moment-Long and then fine-tuned on the real-world CASTELLA dataset. On the CASTELLA development-testing split, the final fused system improves Recall1@0.7 from the official CASTELLA+Clotho-Moment baseline value of 13.59 to 25.09, while increasing average mAP from 12.06 to 18.61.
TEXT-SPACE IMAGINATION OF AUDIO RETRIEVAL VIA JOINT-SPACE PROJECTION
Chao-Han Huck Yang1, Zih-Ching Chen1, Eli Chien2, Sabato Marco Siniscalchi3,4
1NVIDIA, 2National Taiwan University, 3University of Palermo, 4Georgia Tech
Huck_NV_task6_1 Huck_NV_task6_2 Huck_NV_task6_3 Huck_NV_task6_4
TEXT-SPACE IMAGINATION OF AUDIO RETRIEVAL VIA JOINT-SPACE PROJECTION
Chao-Han Huck Yang1, Zih-Ching Chen1, Eli Chien2, Sabato Marco Siniscalchi3,4
1NVIDIA, 2National Taiwan University, 3University of Palermo, 4Georgia Tech
Abstract
We study audio moment retrieval (AMR) as text-space imagination of audio: given a natural-language query and a long recording, a contrastive audio–language model can locate when the described sound is active by scoring each second of audio against the text in a shared embedding space. DCASE 2026 Task 6 provides pre-extracted MS-CLAP-2023 backbone features (768 dimensions) rather than raw audio. We first show, through a matched versus mismatched query analysis, that cosine similarity in this backbone space carries almost no query-specific signal: a query’s peak relevance on its own audio is no sharper than that of a random query. Our method, JointProj, restores the missing cross-modal alignment by passing the provided features and a re-encoded query through MS-CLAP’s own projection heads into the 1024-dimensional joint space, where cosine similarity yields a sharp per-second relevance curve. Moments are then localized by full-width-at-half-maximum (FWHM) peak detection. On the Clotho-Moment validation split, scored with the official tool, JointProj raises Recall1@0.7 from 13.2 to 46.0 and mAP from 12.4 to 42.5 over backbone cosine. We submit four systems: three FWHM peak-width variants that perform best in different moment-length regimes, and one exploratory localizer in which a language model reasons over the joint-space curve. We select the strongest development configuration for the blind evaluation.
FLAM-CONDITIONED QD-DETR WITH PARAPHRASE POOLING AND WEIGHTED-BOXES-FUSION ENSEMBLING FOR AUDIO MOMENT RETRIEVAL
Yaozhong Kang, Runwu Shi, Benjamin Yen, Kazuhiro Nakadai
Department of Systems and Control Engineering, Institute of Science Tokyo, Tokyo, Japan
Kang_ISCT_task6_1 Kang_ISCT_task6_2
FLAM-CONDITIONED QD-DETR WITH PARAPHRASE POOLING AND WEIGHTED-BOXES-FUSION ENSEMBLING FOR AUDIO MOMENT RETRIEVAL
Yaozhong Kang, Runwu Shi, Benjamin Yen, Kazuhiro Nakadai
Department of Systems and Control Engineering, Institute of Science Tokyo, Tokyo, Japan
Abstract
We describe our submission to DCASE 2026 Task 6, Audio Moment Retrieval (AMR) from long audio. Starting from the official QD-DETR baseline, we leave the detection head untouched and instead strengthen the two stages that bound its boundaryprecise recall: how audio and text are encoded, and how predictions are combined. Replacing the MS-CLAP encoder with FLAM (Frame-Wise Language-Audio Modeling), whose features are text-aligned at a high frame rate, is the single largest source of our gain. We then add several inexpensive sources of prediction diversity, namely query paraphrasing, saliency-curve candidates, and augmentation-trained students, and fuse them with a onedimensional Weighted-Boxes-Fusion (WBF) ensemble that an iterative leave-one-out sweep prunes to its most complementary members. On the held-out test split the final ensemble more than doubles the baseline, raising Recall1 at 0.7 IoU (R1@0.7) from 13.59 to 33.23 and mean average precision (mAP) from 12.06 to 24.51.
Encoder-aware Verifier Fusion with Boundary Refinement for Audio Moment Retrieval
Mohammad Nur Hossain Khan, Subrata Biswas, Bashima Islam
Electrical and Computer Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
Khan_WPI_task6_1 Khan_WPI_task6_2
Encoder-aware Verifier Fusion with Boundary Refinement for Audio Moment Retrieval
Mohammad Nur Hossain Khan, Subrata Biswas, Bashima Islam
Electrical and Computer Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
Abstract
We describe our two submissions to the DCASE 2026 Challenge Task 6: Audio Moment Retrieval from Long Audio. Both build on the UVCOM [2] retriever and add complementary post-processing stages, but they target different encoder families: (i) a MS-CLAP [3] pipeline closer to the baseline, combining DAPO RL reranking, confidence-gated swap, and a 2-checkpoint refinement fusion; and (ii) a LAION-CLAP [4] pipeline that swaps the audio encoder, applies our verifier-fusion principle, and concludes with a single-checkpoint refinement SFT. The core insight is that the highest-quality rerank signal comes not from training a reranker over the retriever’s candidates, but from fusing the retriever’s candidate scores with an architecturally independent predictor (a standalone span-SFT Qwen2.5-Omni-7B) whose predictions have never seen the retriever’s ranking. On the CASTELLA test set (public dev-test), our LAION pipeline reaches R1@0.7=32.07 and mAP=27.02 — the strongest configuration in our system grid. The MS-CLAP pipeline reaches R1@0.7=25.32, complementing the LAION submission with a different encoder family and a different refinement variant.
ADVANCED AUDIO MOMENT RETRIEVAL VIA CG-DETR
Koki Kibata, Sayaka Yamamoto, Tomoki Kawabata, Yuma Higashino, Yuki Osawa
Data Science Department, Yokohama City University, Kanagawa, Japan
Kibata_YCU_task6_1 Kibata_YCU_task6_2
ADVANCED AUDIO MOMENT RETRIEVAL VIA CG-DETR
Koki Kibata, Sayaka Yamamoto, Tomoki Kawabata, Yuma Higashino, Yuki Osawa
Data Science Department, Yokohama City University, Kanagawa, Japan
Abstract
This technical report describes our system designed for the DCASE 2026 Challenge Task 6 (Audio Moment Retrieval). Utilizing the Context-Gated Detection Transformer (CG-DETR) as our structural foundation, we constructed four model variations through targeted architectural refinements, dataset expansions, and the strategic integration of both CLAP and M2D-CLAP feature backends. Within the CG-DETR encoder, we incorporated dedicated register tokens to suppress attention sinks, encouraging the model to yield noisereduced latent representations. Concurrently, the decoder network is enhanced via Feature-wise Linear Modulation (FiLM) to inject text conditioning into the query channels, enabling dynamic timelocalized predictions guided by natural language inputs. Furthermore, we shared the internal representations across the span regression and classification heads, establishing a task-aligned forecasting topology that coordinates category confidence with temporal boundaries. To optimize data pipeline transitions, we addressed inherent dataset limitations. While CASTELLA offers high-quality, long human annotations, its sample size is limited. Conversely, Clotho-Moment provides abundant but automated annotations for short 60s clips. To bridge this gap, we curated a 5-minute intermediate dataset with 14k queries for mid-phase training. During model training, we employed Periodic ASAM based on the AdamW optimizer and introduced random temporal shifting of audio data to mitigate overfitting. Additionally, adding a Sketched Isotropic Gaussian Regularization (SIGReg) loss as a feature-level penalty on the intermediate representations suppressed dimensional collapse. Following a multi-stage workflow, the model pre-trained on ClothoMoment was iteratively fine-tuned using the CASTELLA dataset. Finally, the resulting models were blended via Model Soups, securing superior generalization without increasing inference latency.
QAM-DETR SYSTEM FOR DCASE 2026 TASK 6: QUALITY-AWARE MAMBA DETR FOR QUERY-BASED AUDIO MOMENT RETRIEVAL
JeongRae Kim1, Ho jun Jung1, Yewon Park2, Changwon Lim2
1Department of Statistics and Data Science, Chung-Ang University, Seoul, Korea, 2Department of Applied Statistics, Chung-Ang University, Seoul, Korea
Kim_CAU_task6_1 Kim_CAU_task6_2 Kim_CAU_task6_3 Kim_CAU_task6_4
QAM-DETR SYSTEM FOR DCASE 2026 TASK 6: QUALITY-AWARE MAMBA DETR FOR QUERY-BASED AUDIO MOMENT RETRIEVAL
JeongRae Kim1, Ho jun Jung1, Yewon Park2, Changwon Lim2
1Department of Statistics and Data Science, Chung-Ang University, Seoul, Korea, 2Department of Applied Statistics, Chung-Ang University, Seoul, Korea
Abstract
This technical report describes our system for DCASE 2026 Challenge Task 6: Audio Moment Retrieval from Long Audio. The task aims to localize the temporal segment in a long audio recording that matches a given natural-language query. Our system builds on a DETR-style moment localization framework using precomputed pretrained audio and text features. LAION-CLAP is used as the primary audio-language representation, while MS-CLAP, WavLM, and RoBERTa features are selectively integrated to provide complementary acoustic and linguistic information. To improve query-conditioned localization, we use 3-way cross-attentionbased cross-modal fusion, a BiMamba-based temporal encoder, lightweight multi-scale temporal fusion, and quality-aware candidate ranking. For selected final submissions, a frozen audiolanguage LLM verifier is further applied as a post-processing reranker without modifying predicted temporal boundaries. Experiments on the CASTELLA development splits show that the proposed components improve the DETR-style baseline, and the final ensemble systems further enhance temporal localization performance.
TRAINING-FREE AUDIO MOMENT RETRIEVAL VIA BACKGROUND-CONTRASTIVE GAUSSIAN MIXTURE LOCALIZATION
Meghan Kret
Department of Electrical Engineering, The Cooper Union, New York, USA
Kret_CooperUnion_task6_1 Kret_CooperUnion_task6_2
TRAINING-FREE AUDIO MOMENT RETRIEVAL VIA BACKGROUND-CONTRASTIVE GAUSSIAN MIXTURE LOCALIZATION
Meghan Kret
Department of Electrical Engineering, The Cooper Union, New York, USA
Abstract
This report describes two submitted systems for DCASE 2026 Challenge Task 6 (Audio Moment Retrieval from Long Audio [1]). Both systems use no supervised temporal training and no labeled data. System 1 combines background contrast normalization with per-query two-component Gaussian mixture model (GMM) [2] inference over frozen MS-CLAP 2023 [3], [4] similarity traces. System 2 is an ablation without the contrast step. On the CASTELLA [5] development-test set, System 1 achieves 10.03% mAP and 13.51% R1@0.7 – surpassing the single-dataset supervised DETR [6], [7] baseline (9.11%, 10.32%) [8] using identical frozen features. On Clotho-Moment [8], System 1 achieves 44.28% mAP against the supervised baseline’s 6.32%, a 37.96 pp cross-domain gap explained by the supervised decoder’s domainspecific prior mismatch.
DCASE 2026 TASK 6: AUDIO MOMENT RETRIEVAL USING BACK-TRANSLATION AND TIME MASKING FOR DATA AUGMENTATION
Jun-Ting Lu, Chung-Che Wang, Syu-Siang Wang
Electrical Engineering, Yuan Ze University, Taoyuan, Taiwan
LU_YZU_task6_1 LU_YZU_task6_2
DCASE 2026 TASK 6: AUDIO MOMENT RETRIEVAL USING BACK-TRANSLATION AND TIME MASKING FOR DATA AUGMENTATION
Jun-Ting Lu, Chung-Che Wang, Syu-Siang Wang
Electrical Engineering, Yuan Ze University, Taoyuan, Taiwan
Abstract
This technical report describes our methods for Task 6 of the DCASE 2026 challenge: Audio Moment Retrieval from Long Audio. In this work, we build upon the official baseline without modifying its network architecture, and focus on data augmentation to improve the generalization ability of the model. Specifically, we investigate two augmentation methods: back-translation-based paraphrasing of the text queries using the OPUS-MT models, and a time mask applied to the CLAP embedding sequence on the audio side. Both augmentation methods improve the performance over the baseline, and the best configuration applies back-translation in both the pretraining and fine-tuning stages.
LOCAL CONTINUITY SALIENCY DETR FOR LANGUAGE-BASED AUDIO MOMENT RETRIEVAL
Le Duc Minh1, Tran Nguyen Van Anh2
1Department of Computer Science, Vietnamese-German University, Binh Duong, Vietnam, 2Faculty of Mathematics and Computer Science, Ho Chi Minh City University of Science, Ho Chi Minh City, Vietnam
Minh_VGU_task6_1
LOCAL CONTINUITY SALIENCY DETR FOR LANGUAGE-BASED AUDIO MOMENT RETRIEVAL
Le Duc Minh1, Tran Nguyen Van Anh2
1Department of Computer Science, Vietnamese-German University, Binh Duong, Vietnam, 2Faculty of Mathematics and Computer Science, Ho Chi Minh City University of Science, Ho Chi Minh City, Vietnam
Abstract
This paper proposes LCS-DETR, a Detection Transformer-based model for Language-Based Audio Moment Retrieval that addresses two critical limitations of standard temporal grounding: coarse temporal alignment and class imbalance. We introduce a saliencyguided framework with local temporal continuity modeling via depthwise-separable convolution and Focal Loss in the Hungarian matcher to suppress background dominance. Evaluated on the combining Clotho-Moment and CASTELLA[1] datasets, LCS-DETR achieves a mAP of 70.26% and R1@0.7 of 73.08%, representing a 4.8× and 4.4× improvement over the baseline DETR [2] respectively. Code and results are available at https://github. com/MinLee0210/DCASE_2026.git.
GATED MULTI-FEATURE FUSION FOR DCASE 2026 TASK 6
Kazushi Nakazawa
Advanced Media, Inc., Japan
Nakazawa_AM_task6_1 Nakazawa_AM_task6_2 Nakazawa_AM_task6_3 Nakazawa_AM_task6_4
GATED MULTI-FEATURE FUSION FOR DCASE 2026 TASK 6
Kazushi Nakazawa
Advanced Media, Inc., Japan
Abstract
This report describes our submission to DCASE 2026 Challenge Task 6, Audio Moment Retrieval from Long Audio. The task is to retrieve the temporal segment in a long audio recording that best matches a natural-language query, and systems are ranked primarily by recall1@0.7. Our approach is a DETR-style audio moment retrieval model that combines frozen audio and text representations through a lightweight gated fusion layer. Although residual reranking and external audio-language verifier scores improved some validation runs, a final sweep on the CASTELLA development-testing split selected four non-reranked checkpoint variants for submission. The best submitted system combines MSCLAP, LAION-CLAP, BEATs, EAT, M2D-CLAP PCA, and VAD audio features with MS-CLAP and LAION-CLAP text features. It achieved recall1@0.7 = 30.14, recall1@0.5 = 49.29, mAP = 24.27, and mAP@0.75 = 22.71 on CASTELLA developmenttesting. For the hidden evaluation set, whose query file contains 177 queries over 100 recordings, we generated and format-validated four output files.
TIME COMPRESSION FOR AUDIO MOMENT RETRIEVAL WITH LARGE AUDIO LANGUAGE MODELS
Hiroshi Nishijima, Daisuke Saito, Nobuaki Minematsu
Graduate School of Engineering, The University of Tokyo, Tokyo, Japan
Nishijima_UTokyo_task6_1 Nishijima_UTokyo_task6_2 Nishijima_UTokyo_task6_3 Nishijima_UTokyo_task6_4
TIME COMPRESSION FOR AUDIO MOMENT RETRIEVAL WITH LARGE AUDIO LANGUAGE MODELS
Hiroshi Nishijima, Daisuke Saito, Nobuaki Minematsu
Graduate School of Engineering, The University of Tokyo, Tokyo, Japan
Abstract
We study time compression (fast-forwarding) as a test-time adaptation that brings long-form audio into the short native context of a large audio language model (LALM), for the DCASE 2026 Task 6 problem of Audio Moment Retrieval (AMR) from long audio. Instead of redesigning the model for long inputs, we time-stretch each recording so that the entire clip fits into a single native input window. We then adapt the model with staged fine-tuning: it first learns temporal grounding, and a final stage fine-tunes it on the target data. Our best system is a three-stage fine-tuned Qwen2-Audio7B. At inference, we compress each recording to at most 15 s, half of the model’s 30 s input limit, and rerank its predictions across several compression settings. On CASTELLA it reaches R1@0.7 (Recall@1 at IoU 0.7) of 27.01 on validation and 20.91 on test, and it exceeds a DETR-based baseline model that ingests the whole recording without any compression. We further show that pitchpreserving compression is essential and that compressing the whole clip beats sliding-window inference for short-context LALMs. The best compression amount tends to track the length of the model’s training clips, a trend that is significant on the validation split.
ENHANCED AUDIO MOMENT RETRIEVAL APPROACH FOR DCASE 2026 TASK 6
Takumi Ogawa, Nobuyuki Ohhashi, Yota Kurisuno, Minami Takenaka, Ririko Miyamoto
Graduate School of Data Science, Yokohama City University, Yokohama, Japan
Ogawa_YCU_task6_1
ENHANCED AUDIO MOMENT RETRIEVAL APPROACH FOR DCASE 2026 TASK 6
Takumi Ogawa, Nobuyuki Ohhashi, Yota Kurisuno, Minami Takenaka, Ririko Miyamoto
Graduate School of Data Science, Yokohama City University, Yokohama, Japan
Abstract
This technical report presents our solution to Task 6 of the DCASE 2026 Challenge, which focuses on retrieving specific moments within long audio recordings that correspond to a given textual query. We address this task by building on the official DETR-based baseline system provided by the DCASE 2026 Task 6 organizers and introducing three main components: M2D-CLAP audio features, a CG-DETR-based moment detection model enhanced with boundary modeling adapted from BAM-DETR, and a post-processing method for boundary refinement and candidate re-ranking. These components are designed to improve audio representation, temporal localization, and final moment selection. Our submitted system achieves an R1@0.7 of 41.13% on the developmenttesting set.
YCU SUBMISSION FOR DCASE 2026 CHALLENGE TASK 6
Haruto Sugawara, Tomoko Nakamura, Ken Sakaguchi, Hikari Aida, Shunta Yuri
Yokohama City University, Yokohama, Japan
Sugawara_YCU_task6_1 Sugawara_YCU_task6_2 Sugawara_YCU_task6_3 Sugawara_YCU_task6_4
YCU SUBMISSION FOR DCASE 2026 CHALLENGE TASK 6
Haruto Sugawara, Tomoko Nakamura, Ken Sakaguchi, Hikari Aida, Shunta Yuri
Yokohama City University, Yokohama, Japan
Abstract
This report describes our submission to DCASE 2026 Task 6 (Audio Moment Retrieval, AMR). Our system extends UVCOM, pretrained on Clotho-Moment and fine-tuned on CASTELLA, with a set of improvement components investigated through ablation on the CASTELLA validation and test splits. Our primary system is a 19-member Weighted Box Fusion ensemble that combines models trained with Quality Focal Loss, delta-feature augmented inputs, CASTELLA-style audio augmentation, and two acoustic encoders (MS-CLAP and M2D-CLAP) at several temporal resolutions and learning-rate schedules. It achieves R1@0.7 = 39.57 % on the CASTELLA test split.
IMPROVING TEMPORAL BOUNDARY PRECISION IN AUDIO MOMENT RETRIEVAL
Ren Usui, Ryuta Fujimoto, Mikuri Kikuchi, Taichi Kitao, Tomohisa Suzuki
Graduate School of Data Science, Yokohama City University, Yokohama, Japan
Usui_YCU_task6_1 Usui_YCU_task6_2 Usui_YCU_task6_3 Usui_YCU_task6_4
IMPROVING TEMPORAL BOUNDARY PRECISION IN AUDIO MOMENT RETRIEVAL
Ren Usui, Ryuta Fujimoto, Mikuri Kikuchi, Taichi Kitao, Tomohisa Suzuki
Graduate School of Data Science, Yokohama City University, Yokohama, Japan
Abstract
This technical report describes our system for Audio Moment Retrieval (AMR) from long audio in the DCASE 2026 challenge (task6). To improve boundary localization and retrieval accuracy, we build our system on the Unified Video Comprehension framework (UVCOM) and introduce four main modifications: M2DCLAP-based audio and text feature extraction, Varifocal Loss for IoU-aware confidence estimation, random query sampling during pretraining, and inference-time boundary refinement based on saliency scores. Experimental results on the CASTELLA test dataset show that the proposed modules consistently improve retrieval performance, with the submitted systems substantially outperforming the official QD-DETR baseline across all evaluation metrics. The best configuration achieves a Recall@0.7 of 40.14, demonstrating the effectiveness of the integrated approach for robust AMR in real-world long audio recordings.
GISP@HEU’S SUBMISSION FOR DCASE 2026 TASK 6: FREQUENCY-AWARE CROSS-MODAL FUSION FOR AUDIO MOMENT RETRIEVAL
Feiyang Xiao1, Li'ang Luo1, Kejia Zhang1, Qiaoxi Zhu2, Guangjun He3, Pengming Feng3, Wenwu Wang4, Jian Guan1
1College of Computer Science and Technology, Harbin Engineering University, Harbin, China, 2University of Technology Sydney, Ultimo, Australia, 3State Key Laboratory of Space Information System and Integrated Application, Beijing, China, 4Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
Xiao_HEU_task6_1 Xiao_HEU_task6_2 Xiao_HEU_task6_3 Xiao_HEU_task6_4
GISP@HEU’S SUBMISSION FOR DCASE 2026 TASK 6: FREQUENCY-AWARE CROSS-MODAL FUSION FOR AUDIO MOMENT RETRIEVAL
Feiyang Xiao1, Li'ang Luo1, Kejia Zhang1, Qiaoxi Zhu2, Guangjun He3, Pengming Feng3, Wenwu Wang4, Jian Guan1
1College of Computer Science and Technology, Harbin Engineering University, Harbin, China, 2University of Technology Sydney, Ultimo, Australia, 3State Key Laboratory of Space Information System and Integrated Application, Beijing, China, 4Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
Abstract
This technical report presents GISP@HEU’s submission for DCASE 2026 Task 6 audio moment retrieval, which aims to retrieve corresponding segments in long audio recordings based on the content-semantic correlation between audio and text queries. In our submission, we describe four systems built upon the UVCOM framework, focusing on improving the cross-modal fusion process between audio and text features.
TEF-GUIDED AND QUALITY-AWARE DISTILLED DETR FOR LANGUAGE-BASED AUDIO MOMENT RETRIEVAL
Yutao Xu, Liu Yang, Weixi Zheng
School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China
Xu_GZHU_task6_1
TEF-GUIDED AND QUALITY-AWARE DISTILLED DETR FOR LANGUAGE-BASED AUDIO MOMENT RETRIEVAL
Yutao Xu, Liu Yang, Weixi Zheng
School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China
Abstract
This technical report describes our system for Task 6 languagebased audio moment retrieval of the DCASE 2026 challenge. The system builds upoin the official QD-DETR/AM-DETR baseline, which employs fixed CLAP audio and text features. The core design of the proposed system is a quality-aware distilled DETR framework for precise temporal grounding. It improves the baseline in three aspects, including query-conditioned audio representation, temporal-position encoding with temporal endpoint features, and localization-quality-aware candidate scoring with prediction-level consensus. Specifically, it integrates text-guided audio refinement, focal and IoU-aware quality objectives, boundary auxiliary supervision, span evidence adaptation, teacher distillation, recall-balanced checkpoint averaging, and weighted box fusion oriented to R1@0.7. Besides, since R1@0.7 is the primary competition target, our model selection prioritizes precise top-1 localization rather than broad recall alone. Experiment results show that, on the CASTELLA test set, the proposed system achieves improvements of 6.76 in R1@0.5, 5.79 in R1@0.7, and 3.47 in average mAP over the official baseline.
TSEL: Temporal Semantic Evidence Learning for Language-Based Audio Moment Retrieval Xiaokai Zhang Xiang Shang Xi’an Jiaotong-Liverpool University
Xiaokai Zhang, Xiang Shang
Department of Computer Science and Software Engineering, Xi'an Jiaotong-Liverpool University, Suzhou, China
Zhang_XJTLU_task6_1 Zhang_XJTLU_task6_2 Zhang_XJTLU_task6_3 Zhang_XJTLU_task6_4
TSEL: Temporal Semantic Evidence Learning for Language-Based Audio Moment Retrieval Xiaokai Zhang Xiang Shang Xi’an Jiaotong-Liverpool University
Xiaokai Zhang, Xiang Shang
Department of Computer Science and Software Engineering, Xi'an Jiaotong-Liverpool University, Suzhou, China
Abstract
This technical report describes our submissions to DCASE 2026 Challenge Task 6, Audio Moment Retrieval from Long Audio. The task requires returning temporal windows in a long audio recording that match a natural-language query, with the official ranking emphasizing the top-ranked prediction at Recall1@0.7. Our system is based on Temporal Semantic Evidence Learning (TSEL): instead of directly regressing a single start/end pair, it first predicts query-conditioned temporal evidence and then decodes candidate windows. We package four systems: a conservative internal MS-CLAP evidence baseline (task6-1), a TSEL/SBEC evidence system with a learned evidence decoder (task6-2), a candidate-level evidence fusion system (task6-3), and a risk-aware evidence scoring system (task6-4). On the CASTELLA development-testing split, these systems obtain 16.11%, 21.97%, 23.42%, and 23.59% Recall1@0.7, respectively. TSEL-ECF is our practical main system; TSEL-RAES is retained as a risk-analysis variant. A clean five-seed validation rerun gives 29.83±0.45% Recall1@0.7 for ECF and 28.92±0.24% for RAES, with RAES producing fewer harmful interventions. The official evaluation set has no public ground truth, so no official hidden score is claimed here.