Task description
The goal of the acoustic scene classification task is to classify recordings into one of the ten predefined acoustic scene classes. This task continues the Acoustic Scene Classification tasks from previous editions of the DCASE Challenge, with a slight shift of focus. This year, the task concentrates on three challenging aspects: (1) a recording device mismatch, (2) low-complexity constraints, and (3) the limited availability of labeled data for training.
More detailed task description can be found in the task description page
Teams ranking
Submission information | Rank | Accuracy (maximum among entries) |
|||
---|---|---|---|---|---|
Rank | Submission label | Name |
Technical Report |
Official rank |
Accuracy |
Chang_HYU | mel | Han2025 | 5 | 58.98 | |
Chen_GXU | CD | Chen2025 | 9 | 56.63 | |
Han_CSU | KDTF-SepN | Han2025a | 13 | 32.58 | |
Jeong_SEOULTECH | DAFA-TE | Jeong2025 | 8 | 57.86 | |
Karasin_JKU | MALACH25_4 | Karasin2025 | 1 | 61.47 | |
Krishna_SRIB | SRIB-Team | Gurugubelli2025 | 10 | 56.06 | |
Li_NTU | S2 | Li2025 | 6 | 58.85 | |
Luo_CQUPT | DynaCP | Luo2025 | 3 | 59.58 | |
Ramezanee_SUT | Sharif | Ramezanee2025 | 7 | 57.92 | |
DCASE2025 baseline | Baseline | 12 | 53.24 | ||
Tan_SNTLNTU | SNTLNTU_T1_1 | Tan2025 | 2 | 59.94 | |
Zhang_AITHU-SJTU | Agp_c96_s1 | Zhang2025 | 4 | 59.28 | |
Zhou_XJTLU | Baseline | Ziyang2025 | 11 | 55.52 |
Systems ranking
Submission information | Evaluation dataset | Development dataset | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy | |||||||||||
Rank | Submission label | Name |
Technical Report |
Official system rank |
Memory rank |
MACs rank |
Overall Accuracy, with 95% confidence interval (Evaluation dataset) |
Accuracy, Known devices (Evaluation dataset) |
Accuracy, Unknown devices (Evaluation dataset) |
Logloss (Evaluation dataset) | Accuracy (Development dataset) |
Chang_HYU_task1_1 | base | Han2025 | 17 | 7 | 4 | 58.1 (57.9 - 58.4) | 60.6 | 53.3 | 3.182 | 57.4 | |
Chang_HYU_task1_2 | mel | Han2025 | 10 | 7 | 10 | 59.0 (58.7 - 59.2) | 61.4 | 54.1 | 3.095 | 58.2 | |
Chang_HYU_task1_3 | hop | Han2025 | 15 | 7 | 13 | 58.7 (58.4 - 58.9) | 60.7 | 54.5 | 3.016 | 58.3 | |
Chang_HYU_task1_4 | hop_mel | Han2025 | 16 | 7 | 13 | 58.6 (58.4 - 58.8) | 60.8 | 54.2 | 3.210 | 57.6 | |
Chen_GXU_task1_1 | CD | Chen2025 | 21 | 3 | 11 | 56.6 (56.4 - 56.9) | 58.5 | 52.8 | 1.566 | 56.5 | |
Chen_GXU_task1_2 | CD | Chen2025 | 22 | 3 | 11 | 56.5 (56.3 - 56.8) | 58.4 | 52.9 | 1.411 | 57.7 | |
Chen_GXU_task1_3 | CD | Chen2025 | 26 | 3 | 11 | 55.3 (55.0 - 55.5) | 57.0 | 51.7 | 1.994 | 56.1 | |
Han_CSU_task1_1 | TF-SepN | Han2025a | 31 | 3 | 1 | 26.4 (26.2 - 26.7) | 26.6 | 26.2 | 2.796 | 55.0 | |
Han_CSU_task1_2 | KDTF-SepN | Han2025a | 32 | 3 | 1 | 25.1 (24.9 - 25.3) | 25.3 | 24.7 | 2.289 | 51.3 | |
Han_CSU_task1_3 | KDTF-SepN | Han2025a | 29 | 3 | 1 | 32.6 (32.3 - 32.8) | 33.2 | 31.4 | 1.879 | 54.1 | |
Han_CSU_task1_4 | KDTF-SepN | Han2025a | 30 | 3 | 1 | 30.9 (30.7 - 31.1) | 31.5 | 29.6 | 1.923 | 51.2 | |
Jeong_SEOULTECH_task1_1 | DAFA-TE | Jeong2025 | 20 | 3 | 5 | 56.9 (56.6 - 57.1) | 60.1 | 50.3 | 1.379 | 54.6 | |
Jeong_SEOULTECH_task1_2 | DAFA-TE | Jeong2025 | 19 | 3 | 5 | 57.9 (57.6 - 58.1) | 60.3 | 53.0 | 1.362 | 56.0 | |
Karasin_JKU_task1_1 | MALACH25_1 | Karasin2025 | 2 | 3 | 11 | 61.4 (61.1 - 61.6) | 64.0 | 56.2 | 1.105 | 60.5 | |
Karasin_JKU_task1_2 | MALACH25_2 | Karasin2025 | 4 | 3 | 11 | 60.1 (59.9 - 60.4) | 62.1 | 56.1 | 1.177 | 57.5 | |
Karasin_JKU_task1_3 | MALACH25_3 | Karasin2025 | 3 | 3 | 11 | 60.3 (60.0 - 60.5) | 62.8 | 55.2 | 1.155 | 59.0 | |
Karasin_JKU_task1_4 | MALACH25_4 | Karasin2025 | 1 | 3 | 11 | 61.5 (61.2 - 61.7) | 64.1 | 56.2 | 1.102 | 60.5 | |
Krishna_SRIB_task1_1 | SRIB-Team | Gurugubelli2025 | 23 | 4 | 6 | 56.1 (55.8 - 56.3) | 58.2 | 51.8 | 2.515 | 56.4 | |
Li_NTU_task1_1 | S1 | Li2025 | 14 | 3 | 3 | 58.7 (58.5 - 59.0) | 60.6 | 55.1 | 1.133 | 58.8 | |
Li_NTU_task1_2 | S2 | Li2025 | 12 | 3 | 3 | 58.8 (58.6 - 59.1) | 60.5 | 55.5 | 1.128 | 59.3 | |
Luo_CQUPT_task1_1 | DynaCP | Luo2025 | 6 | 5 | 8 | 59.6 (59.3 - 59.8) | 61.9 | 55.0 | 1.616 | 59.0 | |
Ramezanee_SUT_task1_1 | Sharif | Ramezanee2025 | 27 | 6 | 7 | 54.6 (54.3 - 54.8) | 54.8 | 54.1 | 1.318 | 58.2 | |
Ramezanee_SUT_task1_2 | SUT | Ramezanee2025 | 25 | 6 | 7 | 55.5 (55.3 - 55.7) | 56.2 | 54.1 | 1.262 | 58.2 | |
Ramezanee_SUT_task1_3 | Sharif | Ramezanee2025 | 18 | 6 | 7 | 57.9 (57.7 - 58.2) | 59.8 | 54.1 | 1.176 | 58.2 | |
DCASE2025 baseline | Baseline | 28 | 3 | 11 | 53.2 (53.0 - 53.5) | 55.4 | 49.0 | 1.686 | 50.7 | ||
Tan_SNTLNTU_task1_1 | SNTLNTU_T1_1 | Tan2025 | 5 | 1 | 2 | 59.9 (59.7 - 60.2) | 62.2 | 55.4 | 1.136 | 60.4 | |
Tan_SNTLNTU_task1_2 | SNTLNTU_T1_2 | Tan2025 | 9 | 2 | 2 | 59.0 (58.8 - 59.3) | 61.6 | 53.9 | 1.179 | 60.2 | |
Zhang_AITHU-SJTU_task1_1 | Agp_c64_s1 | Zhang2025 | 13 | 10 | 14 | 58.8 (58.5 - 59.0) | 60.7 | 54.8 | 1.118 | 59.0 | |
Zhang_AITHU-SJTU_task1_2 | Agp_c64_s2 | Zhang2025 | 11 | 10 | 14 | 58.9 (58.7 - 59.2) | 60.9 | 55.0 | 1.110 | 58.5 | |
Zhang_AITHU-SJTU_task1_3 | Agp_c96_s1 | Zhang2025 | 7 | 8 | 9 | 59.3 (59.0 - 59.5) | 60.8 | 56.2 | 1.108 | 59.0 | |
Zhang_AITHU-SJTU_task1_4 | Agp_c96_s2 | Zhang2025 | 8 | 8 | 9 | 59.3 (59.0 - 59.5) | 60.8 | 56.1 | 1.105 | 58.7 | |
Zhou_XJTLU_task1_1 | Baseline | Ziyang2025 | 24 | 9 | 12 | 55.5 (55.3 - 55.8) | 60.0 | 46.6 | 1.232 | 58.5 |
System complexity
Submission information | Accuracy / Evaluation dataset | Acoustic model | System | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Rank | Submission label |
Technical Report |
System rank |
Overall |
Accuracy, Known devices (Evaluation dataset) |
Accuracy, Unknown devices (Evaluation dataset) |
Size | MACS | Parameters |
Complexity management |
Chang_HYU_task1_1 | Han2025 | 17 | 58.1 | 60.6 | 53.3 | 125156 | 18758084 | 62578 | precision_16, network design, knowledge distillation | |
Chang_HYU_task1_2 | Han2025 | 10 | 59.0 | 61.4 | 54.1 | 125156 | 29302844 | 62578 | precision_16, network design, knowledge distillation | |
Chang_HYU_task1_3 | Han2025 | 15 | 58.7 | 60.7 | 54.5 | 125156 | 29512940 | 62578 | precision_16, network design, knowledge distillation | |
Chang_HYU_task1_4 | Han2025 | 16 | 58.6 | 60.8 | 54.2 | 125156 | 29512940 | 62578 | precision_16, network design, knowledge distillation | |
Chen_GXU_task1_1 | Chen2025 | 21 | 56.6 | 58.5 | 52.8 | 122296 | 29419156 | 61148 | knowledge distillation, precision_16 | |
Chen_GXU_task1_2 | Chen2025 | 22 | 56.5 | 58.4 | 52.9 | 122296 | 29419156 | 61148 | knowledge distillation, precision_16 | |
Chen_GXU_task1_3 | Chen2025 | 26 | 55.3 | 57.0 | 51.7 | 122296 | 29419156 | 61148 | knowledge distillation, precision_16 | |
Han_CSU_task1_1 | Han2025a | 31 | 26.4 | 26.6 | 26.2 | 122296 | 298637 | 61148 | network design | |
Han_CSU_task1_2 | Han2025a | 32 | 25.1 | 25.3 | 24.7 | 122296 | 298637 | 61148 | knowledge distillation | |
Han_CSU_task1_3 | Han2025a | 29 | 32.6 | 33.2 | 31.4 | 122296 | 298637 | 61148 | knowledge distillation | |
Han_CSU_task1_4 | Han2025a | 30 | 30.9 | 31.5 | 29.6 | 122296 | 298637 | 61148 | knowledge distillation | |
Jeong_SEOULTECH_task1_1 | Jeong2025 | 20 | 56.9 | 60.1 | 50.3 | 122296 | 26059412 | 61148 | knowledge distillation | |
Jeong_SEOULTECH_task1_2 | Jeong2025 | 19 | 57.9 | 60.3 | 53.0 | 122296 | 26059412 | 61148 | knowledge distillation | |
Karasin_JKU_task1_1 | Karasin2025 | 2 | 61.4 | 64.0 | 56.2 | 122296 | 29419156 | 61148 | precision_16, network design, knowledge distillation | |
Karasin_JKU_task1_2 | Karasin2025 | 4 | 60.1 | 62.1 | 56.1 | 122296 | 29419156 | 61148 | precision_16, network design, knowledge distillation | |
Karasin_JKU_task1_3 | Karasin2025 | 3 | 60.3 | 62.8 | 55.2 | 122296 | 29419156 | 61148 | precision_16, network design, knowledge distillation | |
Karasin_JKU_task1_4 | Karasin2025 | 1 | 61.5 | 64.1 | 56.2 | 122296 | 29419156 | 61148 | precision_16, network design, knowledge distillation | |
Krishna_SRIB_task1_1 | Gurugubelli2025 | 23 | 56.1 | 58.2 | 51.8 | 122320 | 27862676 | 61160 | precision_16, network design | |
Li_NTU_task1_1 | Li2025 | 14 | 58.7 | 60.6 | 55.1 | 122296 | 17050260 | 61160 | knowledge distillation, network design, precision_16 | |
Li_NTU_task1_2 | Li2025 | 12 | 58.8 | 60.5 | 55.5 | 122296 | 17050260 | 61160 | knowledge distillation, network design | |
Luo_CQUPT_task1_1 | Luo2025 | 6 | 59.6 | 61.9 | 55.0 | 123300 | 28938900 | 61650 | knowledge distillation, precision_16, network design | |
Ramezanee_SUT_task1_1 | Ramezanee2025 | 27 | 54.6 | 54.8 | 54.1 | 125040 | 28642220 | 31260 | network design, knowledge distillation, pruning, reparametrization | |
Ramezanee_SUT_task1_2 | Ramezanee2025 | 25 | 55.5 | 56.2 | 54.1 | 125040 | 28642220 | 31260 | network design, knowledge distillation, pruning, reparametrization | |
Ramezanee_SUT_task1_3 | Ramezanee2025 | 18 | 57.9 | 59.8 | 54.1 | 125040 | 28642220 | 31260 | network design, knowledge distillation, pruning, reparametrization | |
DCASE2025 baseline | 28 | 53.2 | 55.4 | 49.0 | 122296 | 29419156 | 61148 | precision_16, network design | ||
Tan_SNTLNTU_task1_1 | Tan2025 | 5 | 59.9 | 62.2 | 55.4 | 116342 | 10902300 | 116342 | precision_16, network design | |
Tan_SNTLNTU_task1_2 | Tan2025 | 9 | 59.0 | 61.6 | 53.9 | 117210 | 10902300 | 117210 | precision_16, network design | |
Zhang_AITHU-SJTU_task1_1 | Zhang2025 | 13 | 58.8 | 60.7 | 54.8 | 127496 | 29982132 | 63748 | precision_16, network design, knowledge distillation, pruning | |
Zhang_AITHU-SJTU_task1_2 | Zhang2025 | 11 | 58.9 | 60.9 | 55.0 | 127496 | 29982132 | 63748 | precision_16, network design, knowledge distillation, pruning | |
Zhang_AITHU-SJTU_task1_3 | Zhang2025 | 7 | 59.3 | 60.8 | 56.2 | 126430 | 29221122 | 63215 | precision_16, network design, knowledge distillation, pruning | |
Zhang_AITHU-SJTU_task1_4 | Zhang2025 | 8 | 59.3 | 60.8 | 56.1 | 126430 | 29221122 | 63215 | precision_16, network design, knowledge distillation, pruning | |
Zhou_XJTLU_task1_1 | Ziyang2025 | 24 | 55.5 | 60.0 | 46.6 | 126858 | 29419648 | 126858 | network design, weight quantization, knowledge distillation |
Generalization performance
All results with evaluation dataset.
Class-wise performance
Overall | Split | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Rank | Submission label |
Technical Report |
System rank |
Accuracy | Airport | Bus | Metro |
Metro station |
Park |
Public square |
Shopping mall |
Street pedestrian |
Street traffic |
Tram |
Chang_HYU_task1_1 | Han2025 | 17 | 58.1 | 42.6 | 76.8 | 59.3 | 52.1 | 80.4 | 34.2 | 59.3 | 38.2 | 74.5 | 64.1 | |
Chang_HYU_task1_2 | Han2025 | 10 | 59.0 | 43.7 | 78.4 | 57.6 | 53.8 | 81.0 | 37.3 | 61.9 | 36.2 | 75.7 | 64.2 | |
Chang_HYU_task1_3 | Han2025 | 15 | 58.7 | 43.9 | 74.4 | 59.1 | 54.5 | 80.6 | 35.6 | 61.4 | 35.5 | 75.6 | 65.9 | |
Chang_HYU_task1_4 | Han2025 | 16 | 58.6 | 44.9 | 77.5 | 59.3 | 50.6 | 80.3 | 35.6 | 62.9 | 38.8 | 74.5 | 61.5 | |
Chen_GXU_task1_1 | Chen2025 | 21 | 56.6 | 36.8 | 77.9 | 54.9 | 50.4 | 80.1 | 36.0 | 63.1 | 33.5 | 74.3 | 59.3 | |
Chen_GXU_task1_2 | Chen2025 | 22 | 56.5 | 39.9 | 72.4 | 54.7 | 50.9 | 87.5 | 33.4 | 58.7 | 30.6 | 72.8 | 64.5 | |
Chen_GXU_task1_3 | Chen2025 | 26 | 55.3 | 37.3 | 72.0 | 54.1 | 44.9 | 82.5 | 28.8 | 59.4 | 38.8 | 71.6 | 63.4 | |
Han_CSU_task1_1 | Han2025a | 31 | 26.4 | 4.6 | 38.8 | 3.2 | 25.8 | 16.7 | 10.7 | 17.8 | 28.2 | 78.2 | 40.4 | |
Han_CSU_task1_2 | Han2025a | 32 | 25.1 | 19.2 | 61.2 | 52.3 | 9.9 | 14.6 | 8.0 | 9.5 | 28.6 | 26.3 | 21.4 | |
Han_CSU_task1_3 | Han2025a | 29 | 32.6 | 12.0 | 26.9 | 18.0 | 32.6 | 30.5 | 25.5 | 49.8 | 35.9 | 66.6 | 28.0 | |
Han_CSU_task1_4 | Han2025a | 30 | 30.9 | 37.4 | 24.6 | 45.9 | 34.7 | 25.8 | 11.3 | 50.7 | 41.5 | 11.1 | 25.9 | |
Jeong_SEOULTECH_task1_1 | Jeong2025 | 20 | 56.9 | 44.8 | 68.8 | 53.3 | 48.8 | 85.9 | 39.0 | 67.5 | 24.7 | 76.6 | 59.0 | |
Jeong_SEOULTECH_task1_2 | Jeong2025 | 19 | 57.9 | 48.1 | 68.2 | 50.6 | 49.4 | 84.0 | 41.5 | 66.8 | 30.3 | 77.3 | 62.3 | |
Karasin_JKU_task1_1 | Karasin2025 | 2 | 61.4 | 51.9 | 83.5 | 61.5 | 50.7 | 87.6 | 35.7 | 68.0 | 33.5 | 77.4 | 64.2 | |
Karasin_JKU_task1_2 | Karasin2025 | 4 | 60.1 | 50.0 | 81.5 | 58.3 | 48.0 | 85.4 | 37.8 | 63.8 | 33.2 | 77.6 | 65.8 | |
Karasin_JKU_task1_3 | Karasin2025 | 3 | 60.3 | 45.3 | 76.7 | 59.9 | 49.0 | 85.7 | 33.6 | 70.9 | 33.2 | 79.6 | 68.7 | |
Karasin_JKU_task1_4 | Karasin2025 | 1 | 61.5 | 52.6 | 83.2 | 61.8 | 50.4 | 87.6 | 35.9 | 67.6 | 33.5 | 77.5 | 64.4 | |
Krishna_SRIB_task1_1 | Gurugubelli2025 | 23 | 56.1 | 40.0 | 76.4 | 55.1 | 46.7 | 81.5 | 32.4 | 57.2 | 35.6 | 75.3 | 60.3 | |
Li_NTU_task1_1 | Li2025 | 14 | 58.7 | 44.6 | 70.2 | 59.8 | 52.2 | 85.7 | 36.6 | 66.6 | 31.4 | 75.1 | 65.2 | |
Li_NTU_task1_2 | Li2025 | 12 | 58.8 | 39.8 | 73.7 | 58.0 | 53.4 | 84.0 | 39.9 | 69.6 | 31.3 | 74.5 | 64.3 | |
Luo_CQUPT_task1_1 | Luo2025 | 6 | 59.6 | 44.8 | 77.3 | 58.8 | 54.5 | 84.9 | 37.7 | 61.1 | 37.6 | 76.5 | 62.6 | |
Ramezanee_SUT_task1_1 | Ramezanee2025 | 27 | 54.6 | 46.4 | 82.3 | 48.6 | 45.9 | 85.4 | 31.8 | 54.9 | 28.4 | 69.2 | 52.7 | |
Ramezanee_SUT_task1_2 | Ramezanee2025 | 25 | 55.5 | 44.6 | 83.0 | 47.8 | 51.6 | 84.3 | 34.1 | 57.2 | 26.3 | 70.6 | 55.5 | |
Ramezanee_SUT_task1_3 | Ramezanee2025 | 18 | 57.9 | 43.0 | 82.8 | 49.7 | 51.9 | 83.7 | 38.6 | 61.1 | 34.6 | 68.3 | 65.4 | |
DCASE2025 baseline | 28 | 53.2 | 40.5 | 69.7 | 47.2 | 42.1 | 79.8 | 36.1 | 53.5 | 34.8 | 74.8 | 53.9 | ||
Tan_SNTLNTU_task1_1 | Tan2025 | 5 | 59.9 | 50.6 | 83.4 | 54.8 | 47.0 | 85.4 | 37.7 | 63.9 | 35.9 | 70.6 | 70.0 | |
Tan_SNTLNTU_task1_2 | Tan2025 | 9 | 59.0 | 49.0 | 78.9 | 58.7 | 50.6 | 82.6 | 36.1 | 59.4 | 38.7 | 70.0 | 66.6 | |
Zhang_AITHU-SJTU_task1_1 | Zhang2025 | 13 | 58.8 | 46.5 | 70.3 | 52.5 | 53.3 | 84.2 | 34.9 | 65.1 | 36.1 | 75.7 | 68.9 | |
Zhang_AITHU-SJTU_task1_2 | Zhang2025 | 11 | 58.9 | 43.6 | 72.1 | 53.3 | 49.8 | 83.2 | 36.4 | 65.5 | 40.0 | 78.9 | 66.7 | |
Zhang_AITHU-SJTU_task1_3 | Zhang2025 | 7 | 59.3 | 50.7 | 66.4 | 54.0 | 51.1 | 85.2 | 35.3 | 66.3 | 37.7 | 77.2 | 68.9 | |
Zhang_AITHU-SJTU_task1_4 | Zhang2025 | 8 | 59.3 | 47.8 | 68.3 | 53.2 | 52.0 | 87.0 | 32.8 | 70.6 | 34.1 | 78.3 | 68.4 | |
Zhou_XJTLU_task1_1 | Ziyang2025 | 24 | 55.5 | 41.1 | 68.8 | 60.1 | 48.4 | 65.6 | 34.9 | 67.1 | 33.9 | 74.9 | 60.5 |
Device-wise performance
Overall | Devices | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Split | Unseen devices | Seen devices | ||||||||||||||||||
Rank | Submission label |
Technical Report |
System rank |
Accuracy |
Accuracy / Unseen |
Accuracy / Seen |
D | S7 | S8 | S9 | S10 | A | B | C | S1 | S2 | S3 | |||
Chang_HYU_task1_1 | Han2025 | 17 | 58.1 | 53.3 | 60.6 | 44.3 | 57.9 | 57.3 | 52.4 | 54.6 | 67.3 | 60.9 | 61.7 | 57.1 | 56.8 | 59.5 | ||||
Chang_HYU_task1_2 | Han2025 | 10 | 59.0 | 54.1 | 61.4 | 46.8 | 58.5 | 56.1 | 53.9 | 55.3 | 67.0 | 61.9 | 62.2 | 57.3 | 59.1 | 61.1 | ||||
Chang_HYU_task1_3 | Han2025 | 15 | 58.7 | 54.5 | 60.7 | 48.1 | 58.8 | 57.7 | 54.0 | 53.8 | 67.3 | 60.4 | 61.8 | 57.2 | 57.8 | 59.9 | ||||
Chang_HYU_task1_4 | Han2025 | 16 | 58.6 | 54.2 | 60.8 | 46.3 | 58.2 | 57.6 | 53.2 | 55.5 | 67.2 | 60.0 | 61.8 | 58.0 | 57.8 | 60.0 | ||||
Chen_GXU_task1_1 | Chen2025 | 21 | 56.6 | 52.8 | 58.5 | 48.8 | 57.7 | 56.9 | 46.6 | 54.2 | 67.9 | 60.6 | 61.5 | 53.5 | 50.9 | 56.8 | ||||
Chen_GXU_task1_2 | Chen2025 | 22 | 56.5 | 52.9 | 58.4 | 46.4 | 57.0 | 55.6 | 49.3 | 56.1 | 68.2 | 60.6 | 61.1 | 53.2 | 50.1 | 57.1 | ||||
Chen_GXU_task1_3 | Chen2025 | 26 | 55.3 | 51.7 | 57.0 | 48.4 | 54.6 | 56.6 | 46.5 | 52.6 | 67.8 | 59.3 | 59.1 | 52.4 | 49.8 | 53.9 | ||||
Han_CSU_task1_1 | Han2025a | 31 | 26.4 | 26.2 | 26.6 | 27.0 | 27.7 | 25.5 | 24.1 | 26.5 | 29.5 | 29.4 | 27.1 | 25.6 | 22.4 | 25.5 | ||||
Han_CSU_task1_2 | Han2025a | 32 | 25.1 | 24.7 | 25.3 | 26.6 | 24.6 | 24.6 | 23.4 | 24.5 | 28.2 | 27.2 | 26.7 | 22.3 | 22.8 | 24.7 | ||||
Han_CSU_task1_3 | Han2025a | 29 | 32.6 | 31.4 | 33.2 | 36.5 | 31.6 | 29.3 | 29.4 | 30.3 | 41.1 | 34.1 | 38.9 | 30.4 | 26.1 | 28.3 | ||||
Han_CSU_task1_4 | Han2025a | 30 | 30.9 | 29.6 | 31.5 | 34.2 | 28.9 | 27.2 | 29.3 | 28.5 | 39.7 | 33.8 | 38.1 | 27.5 | 23.1 | 26.8 | ||||
Jeong_SEOULTECH_task1_1 | Jeong2025 | 20 | 56.9 | 50.3 | 60.1 | 39.0 | 54.2 | 53.6 | 53.1 | 51.3 | 68.8 | 60.0 | 63.0 | 54.5 | 55.5 | 59.2 | ||||
Jeong_SEOULTECH_task1_2 | Jeong2025 | 19 | 57.9 | 53.0 | 60.3 | 44.7 | 57.2 | 54.2 | 54.8 | 54.2 | 68.7 | 60.5 | 63.7 | 55.4 | 54.6 | 58.8 | ||||
Karasin_JKU_task1_1 | Karasin2025 | 2 | 61.4 | 56.2 | 64.0 | 47.5 | 60.9 | 58.0 | 58.3 | 56.1 | 70.6 | 64.6 | 65.4 | 59.7 | 59.8 | 63.9 | ||||
Karasin_JKU_task1_2 | Karasin2025 | 4 | 60.1 | 56.1 | 62.1 | 49.6 | 59.3 | 56.9 | 58.0 | 56.9 | 70.0 | 62.0 | 63.1 | 58.1 | 57.7 | 61.7 | ||||
Karasin_JKU_task1_3 | Karasin2025 | 3 | 60.3 | 55.2 | 62.8 | 45.3 | 59.6 | 55.5 | 58.2 | 57.1 | 69.1 | 62.3 | 64.8 | 59.6 | 58.2 | 63.0 | ||||
Karasin_JKU_task1_4 | Karasin2025 | 1 | 61.5 | 56.2 | 64.1 | 47.5 | 60.9 | 58.0 | 58.3 | 56.1 | 70.6 | 64.6 | 65.7 | 60.2 | 59.8 | 63.9 | ||||
Krishna_SRIB_task1_1 | Gurugubelli2025 | 23 | 56.1 | 51.8 | 58.2 | 45.0 | 56.0 | 54.9 | 48.6 | 54.4 | 66.8 | 59.4 | 61.8 | 52.4 | 52.0 | 56.9 | ||||
Li_NTU_task1_1 | Li2025 | 14 | 58.7 | 55.1 | 60.6 | 51.3 | 58.6 | 58.6 | 52.2 | 54.7 | 67.0 | 59.7 | 62.2 | 57.5 | 56.8 | 60.3 | ||||
Li_NTU_task1_2 | Li2025 | 12 | 58.8 | 55.5 | 60.5 | 53.0 | 58.6 | 58.2 | 52.7 | 54.7 | 66.4 | 59.9 | 62.1 | 57.9 | 56.7 | 60.2 | ||||
Luo_CQUPT_task1_1 | Luo2025 | 6 | 59.6 | 55.0 | 61.9 | 42.1 | 59.7 | 60.1 | 56.4 | 56.7 | 70.7 | 62.1 | 64.7 | 57.3 | 56.7 | 59.6 | ||||
Ramezanee_SUT_task1_1 | Ramezanee2025 | 27 | 54.6 | 54.1 | 54.8 | 41.4 | 59.6 | 58.0 | 56.1 | 55.3 | 64.2 | 53.9 | 53.0 | 52.5 | 50.4 | 54.9 | ||||
Ramezanee_SUT_task1_2 | Ramezanee2025 | 25 | 55.5 | 54.1 | 56.2 | 41.4 | 59.6 | 58.0 | 56.1 | 55.3 | 63.7 | 54.8 | 55.4 | 54.4 | 52.3 | 56.8 | ||||
Ramezanee_SUT_task1_3 | Ramezanee2025 | 18 | 57.9 | 54.1 | 59.8 | 41.4 | 59.6 | 58.0 | 56.1 | 55.3 | 65.6 | 59.2 | 58.8 | 57.3 | 58.4 | 59.8 | ||||
DCASE2025 baseline | 28 | 53.2 | 49.0 | 55.4 | 47.5 | 51.6 | 48.8 | 45.3 | 51.7 | 64.8 | 57.2 | 59.9 | 48.9 | 48.7 | 52.7 | |||||
Tan_SNTLNTU_task1_1 | Tan2025 | 5 | 59.9 | 55.4 | 62.2 | 49.3 | 58.6 | 59.2 | 52.2 | 57.5 | 67.8 | 61.3 | 64.5 | 59.9 | 59.2 | 60.7 | ||||
Tan_SNTLNTU_task1_2 | Tan2025 | 9 | 59.0 | 53.9 | 61.6 | 44.6 | 58.7 | 59.2 | 50.0 | 56.9 | 67.7 | 60.6 | 63.4 | 59.5 | 57.4 | 61.1 | ||||
Zhang_AITHU-SJTU_task1_1 | Zhang2025 | 13 | 58.8 | 54.8 | 60.7 | 46.7 | 58.6 | 57.4 | 55.2 | 56.4 | 69.2 | 60.0 | 62.7 | 56.6 | 55.0 | 61.0 | ||||
Zhang_AITHU-SJTU_task1_2 | Zhang2025 | 11 | 58.9 | 55.0 | 60.9 | 48.7 | 58.5 | 57.5 | 53.5 | 56.8 | 69.6 | 60.8 | 62.8 | 55.9 | 55.7 | 60.8 | ||||
Zhang_AITHU-SJTU_task1_3 | Zhang2025 | 7 | 59.3 | 56.2 | 60.8 | 51.7 | 58.9 | 57.4 | 55.6 | 57.2 | 69.3 | 59.8 | 62.4 | 56.5 | 55.9 | 61.0 | ||||
Zhang_AITHU-SJTU_task1_4 | Zhang2025 | 8 | 59.3 | 56.1 | 60.8 | 52.2 | 59.0 | 58.4 | 54.8 | 56.0 | 69.4 | 60.8 | 62.6 | 55.9 | 56.0 | 60.4 | ||||
Zhou_XJTLU_task1_1 | Ziyang2025 | 24 | 55.5 | 46.6 | 60.0 | 49.3 | 51.5 | 50.0 | 42.7 | 39.7 | 67.5 | 60.7 | 60.5 | 55.3 | 57.7 | 58.1 |
Cities
Submission information | Overall | Split | ||||
---|---|---|---|---|---|---|
Rank | Submission label |
Technical Report |
System rank |
Accuracy |
Unseen / , accuracy, unseen cities (Evaluation dataset) |
Seen / , accuracy, seen cities (Evaluation dataset) |
Chang_HYU_task1_1 | Han2025 | 17 | 58.1 | 56.06 | 58.59 | |
Chang_HYU_task1_2 | Han2025 | 10 | 59.0 | 58.13 | 59.18 | |
Chang_HYU_task1_3 | Han2025 | 15 | 58.7 | 57.80 | 58.86 | |
Chang_HYU_task1_4 | Han2025 | 16 | 58.6 | 56.79 | 58.98 | |
Chen_GXU_task1_1 | Chen2025 | 21 | 56.6 | 54.71 | 57.05 | |
Chen_GXU_task1_2 | Chen2025 | 22 | 56.5 | 54.38 | 57.01 | |
Chen_GXU_task1_3 | Chen2025 | 26 | 55.3 | 53.89 | 55.59 | |
Han_CSU_task1_1 | Han2025a | 31 | 26.4 | 28.12 | 26.11 | |
Han_CSU_task1_2 | Han2025a | 32 | 25.1 | 25.57 | 25.02 | |
Han_CSU_task1_3 | Han2025a | 29 | 32.6 | 33.91 | 32.31 | |
Han_CSU_task1_4 | Han2025a | 30 | 30.9 | 32.42 | 30.57 | |
Jeong_SEOULTECH_task1_1 | Jeong2025 | 20 | 56.9 | 56.03 | 57.06 | |
Jeong_SEOULTECH_task1_2 | Jeong2025 | 19 | 57.9 | 57.03 | 58.07 | |
Karasin_JKU_task1_1 | Karasin2025 | 2 | 61.4 | 61.95 | 61.30 | |
Karasin_JKU_task1_2 | Karasin2025 | 4 | 60.1 | 60.10 | 60.17 | |
Karasin_JKU_task1_3 | Karasin2025 | 3 | 60.3 | 60.95 | 60.16 | |
Karasin_JKU_task1_4 | Karasin2025 | 1 | 61.5 | 62.06 | 61.38 | |
Krishna_SRIB_task1_1 | Gurugubelli2025 | 23 | 56.1 | 54.69 | 56.37 | |
Li_NTU_task1_1 | Li2025 | 14 | 58.7 | 57.92 | 58.94 | |
Li_NTU_task1_2 | Li2025 | 12 | 58.8 | 58.52 | 58.94 | |
Luo_CQUPT_task1_1 | Luo2025 | 6 | 59.6 | 57.91 | 59.96 | |
Ramezanee_SUT_task1_1 | Ramezanee2025 | 27 | 54.6 | 53.90 | 54.74 | |
Ramezanee_SUT_task1_2 | Ramezanee2025 | 25 | 55.5 | 54.72 | 55.69 | |
Ramezanee_SUT_task1_3 | Ramezanee2025 | 18 | 57.9 | 56.96 | 58.15 | |
DCASE2025 baseline | 28 | 53.2 | 52.95 | 53.33 | ||
Tan_SNTLNTU_task1_1 | Tan2025 | 5 | 59.9 | 58.45 | 60.28 | |
Tan_SNTLNTU_task1_2 | Tan2025 | 9 | 59.0 | 58.85 | 59.11 | |
Zhang_AITHU-SJTU_task1_1 | Zhang2025 | 13 | 58.8 | 57.87 | 58.98 | |
Zhang_AITHU-SJTU_task1_2 | Zhang2025 | 11 | 58.9 | 58.39 | 59.09 | |
Zhang_AITHU-SJTU_task1_3 | Zhang2025 | 7 | 59.3 | 58.32 | 59.51 | |
Zhang_AITHU-SJTU_task1_4 | Zhang2025 | 8 | 59.3 | 58.18 | 59.51 | |
Zhou_XJTLU_task1_1 | Ziyang2025 | 24 | 55.5 | 55.60 | 55.53 |
System characteristics
General characteristics
Rank | Submission label |
Technical Report |
Rank | Accuracy |
Sampling rate |
Data augmentation |
Features |
---|---|---|---|---|---|---|---|
Chang_HYU_task1_1 | Han2025 | 17 | 58.1 | 32kHz | freq-mixstyle, frequency masking, time rolling, DIR | log-mel energies | |
Chang_HYU_task1_2 | Han2025 | 10 | 59.0 | 32kHz | freq-mixstyle, frequency masking, time rolling, DIR | log-mel energies | |
Chang_HYU_task1_3 | Han2025 | 15 | 58.7 | 32kHz | freq-mixstyle, frequency masking, time rolling, DIR | log-mel energies | |
Chang_HYU_task1_4 | Han2025 | 16 | 58.6 | 32kHz | freq-mixstyle, frequency masking, time rolling, DIR | log-mel energies | |
Chen_GXU_task1_1 | Chen2025 | 21 | 56.6 | 32kHz | freq-mixstyle, time rolling | log-mel energies | |
Chen_GXU_task1_2 | Chen2025 | 22 | 56.5 | 32kHz | freq-mixstyle, time rolling | log-mel energies | |
Chen_GXU_task1_3 | Chen2025 | 26 | 55.3 | 32kHz | freq-mixstyle, time rolling | log-mel energies | |
Han_CSU_task1_1 | Han2025a | 31 | 26.4 | 44.1kHz | MixUp, MixStyle, SpecAug, FiltAug, AddNoise, FrameShift,TimeMask, FreqMask, DIR | log-mel spectrogram | |
Han_CSU_task1_2 | Han2025a | 32 | 25.1 | 44.1kHz | MixUp, MixStyle, SpecAug, FiltAug, AddNoise, FrameShift,TimeMask, FreqMask, DIR | log-mel spectrogram | |
Han_CSU_task1_3 | Han2025a | 29 | 32.6 | 44.1kHz | MixUp, MixStyle, SpecAug, FiltAug, AddNoise, FrameShift,TimeMask, FreqMask, DIR | log-mel spectrogram | |
Han_CSU_task1_4 | Han2025a | 30 | 30.9 | 44.1kHz | MixUp, MixStyle, SpecAug, FiltAug, AddNoise, FrameShift, TimeMask, FreqMask, DIR | log-mel spectrogram | |
Jeong_SEOULTECH_task1_1 | Jeong2025 | 20 | 56.9 | 44.1kHz | freq-mixstyle, mixup | log-mel energies | |
Jeong_SEOULTECH_task1_2 | Jeong2025 | 19 | 57.9 | 44.1kHz | freq-mixstyle, mixup | log-mel energies | |
Karasin_JKU_task1_1 | Karasin2025 | 2 | 61.4 | 32kHz | freq-mixstyle, DIR, time masking, frequency masking, time rolling | log-mel energies | |
Karasin_JKU_task1_2 | Karasin2025 | 4 | 60.1 | 32kHz | freq-mixstyle, DIR, time masking, frequency masking, time rolling | log-mel energies | |
Karasin_JKU_task1_3 | Karasin2025 | 3 | 60.3 | 32kHz | freq-mixstyle, DIR, time masking, frequency masking, time rolling | log-mel energies | |
Karasin_JKU_task1_4 | Karasin2025 | 1 | 61.5 | 32kHz | freq-mixstyle, DIR, time masking, frequency masking, time rolling | log-mel energies | |
Krishna_SRIB_task1_1 | Gurugubelli2025 | 23 | 56.1 | 32kHz | freq-mixstyle, frequency masking, time rolling | log-mel energies | |
Li_NTU_task1_1 | Li2025 | 14 | 58.7 | 32kHz | freq-mixstyle, time rolling, DIR | log-mel energies | |
Li_NTU_task1_2 | Li2025 | 12 | 58.8 | 32kHz | freq-mixstyle, time rolling, DIR | log-mel energies | |
Luo_CQUPT_task1_1 | Luo2025 | 6 | 59.6 | 44.1kHz | freq-mixstyle, pitch shifting, time rolling | log-mel energies | |
Ramezanee_SUT_task1_1 | Ramezanee2025 | 27 | 54.6 | 32kHz | freq-mixstyle, frequency masking, time masking, random noise, random gain, DIR | log-mel energies | |
Ramezanee_SUT_task1_2 | Ramezanee2025 | 25 | 55.5 | 32kHz | freq-mixstyle, frequency masking, time masking, random noise, random gain, DIR | log-mel energies | |
Ramezanee_SUT_task1_3 | Ramezanee2025 | 18 | 57.9 | 32kHz | freq-mixstyle, frequency masking, time masking, random noise, random gain, DIR | log-mel energies | |
DCASE2025 baseline | 28 | 53.2 | 32kHz | freq-mixstyle, pitch shifting, time rolling | log-mel energies | ||
Tan_SNTLNTU_task1_1 | Tan2025 | 5 | 59.9 | 44.1kHz | freq-mixstyle, DIR, SpecAug | log-mel energies | |
Tan_SNTLNTU_task1_2 | Tan2025 | 9 | 59.0 | 44.1kHz | freq-mixstyle, DIR, SpecAug | log-mel energies | |
Zhang_AITHU-SJTU_task1_1 | Zhang2025 | 13 | 58.8 | 32kHz | freq-mixstyle, frequency masking, time masking, time rolling | log-mel energies | |
Zhang_AITHU-SJTU_task1_2 | Zhang2025 | 11 | 58.9 | 32kHz | freq-mixstyle, frequency masking, time masking, time rolling | log-mel energies | |
Zhang_AITHU-SJTU_task1_3 | Zhang2025 | 7 | 59.3 | 32kHz | freq-mixstyle, frequency masking, time masking, time rolling | log-mel energies | |
Zhang_AITHU-SJTU_task1_4 | Zhang2025 | 8 | 59.3 | 32kHz | freq-mixstyle, frequency masking, time masking, time rolling | log-mel energies | |
Zhou_XJTLU_task1_1 | Ziyang2025 | 24 | 55.5 | 32kHz | mixup, freq-mixstyle, DIR | log-mel spectrogram |
Machine learning characteristics
Rank | Code |
Technical Report |
Rank | Accuracy |
External data usage |
External data sources |
Model complexity |
Model MACS |
Classifier | Framework | Pipeline |
Device information |
Number of models |
Model weight sharing |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Chang_HYU_task1_1 | Han2025 | 17 | 58.1 | pre-trained model | MicIRP, PaSST | 62578 | 18758084 | RF-regularized CNN, CTFAttention | pytorch | train teachers, ensemble teachers, train general student model with knowledge distillation, device-specific end-to-end fine-tuning | per-device end-to-end fine-tuning | 7 | fully device-specific | |
Chang_HYU_task1_2 | Han2025 | 10 | 59.0 | pre-trained model | MicIRP, PaSST | 62578 | 29302844 | RF-regularized CNN, CTFAttention | pytorch | train teachers, ensemble teachers, train general student model with knowledge distillation, device-specific end-to-end fine-tuning | per-device end-to-end fine-tuning | 7 | fully device-specific | |
Chang_HYU_task1_3 | Han2025 | 15 | 58.7 | pre-trained model | MicIRP, PaSST | 62578 | 29512940 | RF-regularized CNN, CTFAttention | pytorch | train teachers, ensemble teachers, train general student model with knowledge distillation, device-specific end-to-end fine-tuning | per-device end-to-end fine-tuning | 7 | fully device-specific | |
Chang_HYU_task1_4 | Han2025 | 16 | 58.6 | pre-trained model | MicIRP, PaSST | 62578 | 29512940 | RF-regularized CNN, CTFAttention | pytorch | train teachers, ensemble teachers, train general student model with knowledge distillation, device-specific end-to-end fine-tuning | per-device end-to-end fine-tuning | 7 | fully device-specific | |
Chen_GXU_task1_1 | Chen2025 | 21 | 56.6 | pre-trained model | AudioSet | 61148 | 29419156 | CP-Mobile | pytorch | train teachers, ensemble teachers, train general student model with knowledge distillation | 1 | fully device-specific | ||
Chen_GXU_task1_2 | Chen2025 | 22 | 56.5 | pre-trained model | 61148 | 29419156 | CP-Mobile | pytorch | train teachers, ensemble teachers, train general student model with knowledge distillation | 1 | fully device-specific | |||
Chen_GXU_task1_3 | Chen2025 | 26 | 55.3 | pre-trained model | AudioSet | 61148 | 29419156 | CP-Mobile | pytorch | train teachers, ensemble teachers, train general student model with knowledge distillation | 1 | fully device-specific | ||
Han_CSU_task1_1 | Han2025a | 31 | 26.4 | None | 61148 | 298637 | CNN (SepNet) | pytorch | data augmentation, train baseline model | per-device end-to-end fine-tuning | 7 | fully device-specific | ||
Han_CSU_task1_2 | Han2025a | 32 | 25.1 | pre-trained model, BEATs | 61148 | 298637 | CNN | pytorch | train transformer teachers, knowledge distillation to SepNet student, device-specific end-to-end fine-tuning | per-device end-to-end fine-tuning | 7 | fully device-specific | ||
Han_CSU_task1_3 | Han2025a | 29 | 32.6 | pre-trained model, BEATs, EfficientAT | 61148 | 298637 | CNN | pytorch | train transformer teachers, knowledge distillation to SepNet student, device-specific end-to-end fine-tuning | per-device end-to-end fine-tuning | 7 | fully device-specific | ||
Han_CSU_task1_4 | Han2025a | 30 | 30.9 | pre-trained model, BEATs, EfficientAT | 61148 | 298637 | CNN | pytorch | train transformer teachers, knowledge distillation to SepNet student, device-specific end-to-end fine-tuning | per-device end-to-end fine-tuning | 7 | fully device-specific | ||
Jeong_SEOULTECH_task1_1 | Jeong2025 | 20 | 56.9 | pre-trained model | 61148 | 26059412 | CNN, Transformer | pytorch | train general teacher model, ensemble teachers, device-specific fine-tuning | per-device end-to-end fine-tuning | 7 | fully device-specific | ||
Jeong_SEOULTECH_task1_2 | Jeong2025 | 19 | 57.9 | pre-trained model | 61148 | 26059412 | CNN, Transformer | pytorch | train general teacher model, ensemble teachers, device-specific fine-tuning | per-device end-to-end fine-tuning | 7 | fully device-specific | ||
Karasin_JKU_task1_1 | Karasin2025 | 2 | 61.4 | dataset, pre-trained model | PretrainedSED, MicIRP, CochlScene | 61148 | 29419156 | RF-regularized CNN | pytorch | pretrain CP-ResNet teacher on CochlScene, train teachers (CP-ResNet;BEATs) on TAU22, device-specific end-to-end fine-tuning the CP-ResNet teacher, pretrain student model on CochlScene, train general student model with knowledge distillation, device-specific end-to-end fine-tuning with knowledge distillation | per-device end-to-end fine-tuning | 7 | fully device-specific | |
Karasin_JKU_task1_2 | Karasin2025 | 4 | 60.1 | dataset, pre-trained model | PretrainedSED, MicIRP, CochlScene | 61148 | 29419156 | RF-regularized CNN | pytorch | pretrain CP-ResNet teacher on CochlScene, train teachers (CP-ResNet;BEATs) on TAU22, device-specific end-to-end fine-tuning the CP-ResNet teacher, pretrain student model on CochlScene, train general student model with knowledge distillation, device-specific end-to-end fine-tuning with knowledge distillation | per-device end-to-end fine-tuning | 7 | fully device-specific | |
Karasin_JKU_task1_3 | Karasin2025 | 3 | 60.3 | dataset, pre-trained model | MicIRP, CochlScene, PaSST | 61148 | 29419156 | RF-regularized CNN | pytorch | pretrain CP-ResNet teacher on CochlScene, train teachers (CP-ResNet;PaSST) on TAU22, device-specific end-to-end fine-tuning the CP-ResNet teacher, pretrain student model on CochlScene, train general student model with knowledge distillation, device-specific end-to-end fine-tuning | per-device end-to-end fine-tuning | 7 | fully device-specific | |
Karasin_JKU_task1_4 | Karasin2025 | 1 | 61.5 | dataset, pre-trained model | PretrainedSED, MicIRP, CochlScene | 61148 | 29419156 | RF-regularized CNN | pytorch | pretrain CP-ResNet teacher on CochlScene, train teachers (CP-ResNet;BEATs) on TAU22, device-specific end-to-end fine-tuning the CP-ResNet teacher, pretrain student model on CochlScene, train general student model with knowledge distillation, device-specific end-to-end fine-tuning for device s1, device-specific end-to-end fine-tuning with knowledge distillation for the rest of the devices | per-device end-to-end fine-tuning | 7 | fully device-specific | |
Krishna_SRIB_task1_1 | Gurugubelli2025 | 23 | 56.1 | 61160 | 27862676 | RF-regularized CNN | pytorch | train general model, device-specific end-to-end fine-tuning | per-device end-to-end fine-tuning | 7 | fully device-specific | |||
Li_NTU_task1_1 | Li2025 | 14 | 58.7 | dataset, micIRP, pre-trained model, PaSST | 61160 | 17050260 | RF-regularized CNN | pytorch | train teachers, ensemble teachers, train general student model with knowledge distillation (both stage-wise and output-wise), model soup, device-specific end-to-end fine-tuning | per-device end-to-end fine-tuning, device-IR augmentation | 1 | fully shared | ||
Li_NTU_task1_2 | Li2025 | 12 | 58.8 | dataset, micIRP, pre-trained model, PaSST | MicIRP | 61160 | 17050260 | RF-regularized CNN | pytorch | train teachers, ensemble teachers, train general student model with knowledge distillation (both stage-wise and output-wise), model soup, device-specific end-to-end fine-tuning | per-device end-to-end fine-tuning, DIR | 1 | fully shared | |
Luo_CQUPT_task1_1 | Luo2025 | 6 | 59.6 | pre-trained model | 61650 | 28938900 | RF-regularized CNN | pytorch | train teachers, ensemble teachers, train general student model with knowledge distillation, device-specific end-to-end fine-tuning | per-device end-to-end fine-tuning | 7 | fully device-specific | ||
Ramezanee_SUT_task1_1 | Ramezanee2025 | 27 | 54.6 | dataset | MicIRP | 31260 | 28642220 | CNN | pytorch | train teachers, ensemble teachers, train general model, device-specific end-to-end fine-tuning, train student models with knowledge distillation | per-device end-to-end fine-tuning | 7 | fully device-specific | |
Ramezanee_SUT_task1_2 | Ramezanee2025 | 25 | 55.5 | dataset | MicIRP | 31260 | 28642220 | CNN | pytorch | train teachers, ensemble teachers, train general model, device-specific end-to-end fine-tuning, train student models with knowledge distillation | per-device end-to-end fine-tuning | 7 | fully device-specific | |
Ramezanee_SUT_task1_3 | Ramezanee2025 | 18 | 57.9 | dataset | MicIRP | 31260 | 28642220 | CNN | pytorch | train teachers, ensemble teachers, train general model, device-specific end-to-end fine-tuning, train student models with knowledge distillation | per-device end-to-end fine-tuning | 7 | fully device-specific | |
DCASE2025 baseline | 28 | 53.2 | 61148 | 29419156 | RF-regularized CNN | pytorch | training | per-device end-to-end fine-tuning | 7 | fully device-specific | ||||
Tan_SNTLNTU_task1_1 | Tan2025 | 5 | 59.9 | 116342 | 10902300 | GRU-CNN | pytorch | train general model, device-specific end-to-end fine-tuning | per-device end-to-end fine-tuning | 7 | fully device-specific | |||
Tan_SNTLNTU_task1_2 | Tan2025 | 9 | 59.0 | 117210 | 10902300 | GRU-CNN | pytorch | train general model, device-specific end-to-end fine-tuning | per-device end-to-end fine-tuning | 7 | fully device-specific | |||
Zhang_AITHU-SJTU_task1_1 | Zhang2025 | 13 | 58.8 | pre-trained model | EfficientAT | 63748 | 29982132 | CNN | pytorch | train teachers, ensemble teachers, train student using knowledge distillation, pruning | 1 | fully shared | ||
Zhang_AITHU-SJTU_task1_2 | Zhang2025 | 11 | 58.9 | pre-trained model | EfficientAT | 63748 | 29982132 | CNN | pytorch | train teachers, ensemble teachers, train student using knowledge distillation, pruning | 1 | fully shared | ||
Zhang_AITHU-SJTU_task1_3 | Zhang2025 | 7 | 59.3 | pre-trained model | EfficientAT | 63215 | 29221122 | CNN | pytorch | train teachers, ensemble teachers, train student using knowledge distillation, pruning | 1 | fully shared | ||
Zhang_AITHU-SJTU_task1_4 | Zhang2025 | 8 | 59.3 | pre-trained model | EfficientAT | 63215 | 29221122 | CNN | pytorch | train teachers, ensemble teachers, train student using knowledge distillation, pruning | 1 | fully shared | ||
Zhou_XJTLU_task1_1 | Ziyang2025 | 24 | 55.5 | dataset, embeddings, pre-trained model | AudioSet_balanced | 126858 | 29419648 | CNN (TF-SepNet) | pytorch_lighting | train general model, device-specific end-to-end fine-tuning | per-device end-to-end fine-tuning | 7 | fully device-specific |
Technical reports
McCi Submission to DCASE 2025: Training Low-Complexity Acoustic Scene Classification System with Knowledge Distillation and Curriculum
Xuanyan Chen and Wei Xie
School of Computer, Electronics and Information, Guangxi University, Guangxi, China
Chen_GXU_task1_1 Chen_GXU_task1_2 Chen_GXU_task1_3
McCi Submission to DCASE 2025: Training Low-Complexity Acoustic Scene Classification System with Knowledge Distillation and Curriculum
Xuanyan Chen and Wei Xie
School of Computer, Electronics and Information, Guangxi University, Guangxi, China
Abstract
The Task 1 of DCASE 2025 focuses on different aspects of Acoustic Scene Classification(ASC) including recording device mismatch, low complexity constraints, data efficiency and the development of recording-device-specific models. This technical report describes the system we submitted. We first trained several teacher models on the ASC dataset through Self-Distillation and Curriculum Learning techniques.These teacher models included a model pre-trained on the AudioSet. Then we distill the knowledge from the teacher model into the student model via curriculum learning. We used the same inference model (i.e., student model) and data augmentation settings as provided in the baseline system. In experiments, our best system achieved an accuracy of 57.66%.
System characteristics
Sampling rate | 32kHz |
Data augmentation | freq-mixstyle, time rolling |
Sampling rate | 32kHz |
Features | log-mel energies |
Classifier | CP-Mobile |
Complexity management | knowledge distillation, precision_16 |
Number of models at inference | 1 |
Model weight sharing | fully device-specific |
Srib Submission for DCASE 2025 Challenge Task-1: Low-Complexity Acoustic Scene Classification with Device Information
Krishna Gurugubelli, Ravi Solanki, Sujith Viswanathan, Madhu Rayappa Kamble, Aditi Deo, Abhinandan Udupa, Ramya Viswanathan and Rajesh Krishna K S
Audio AI Team, Samsung R&D Institute India-Bangalore, Bangalore, India
Krishna_SRIB_task1_1
Srib Submission for DCASE 2025 Challenge Task-1: Low-Complexity Acoustic Scene Classification with Device Information
Krishna Gurugubelli, Ravi Solanki, Sujith Viswanathan, Madhu Rayappa Kamble, Aditi Deo, Abhinandan Udupa, Ramya Viswanathan and Rajesh Krishna K S
Audio AI Team, Samsung R&D Institute India-Bangalore, Bangalore, India
Abstract
This report details our submission for Task 1: Low-Complexity Acoustic Scene Classification with Device Information in the DCASE2025 challenge[1]. Our method builds upon the leading system from the DCASE2023 competition. Specifically, we have explored the CP-Mobile architecture in this work. To improve the generalization across devices, we incorporate several data augmentation strategies, including Freq-Mix-Style, frequency masking, and time rolling. To meet the model complexity requirements of the competition, we have evaluated the model with 16-bit precision. Hence, we have incorporated the mixed precision training to achieve the better performance during inference with 16-bit model. Our results show significant improvements in test accuracy over the baseline, confirming the effectiveness of our approach across all subsets.
System characteristics
Sampling rate | 32kHz |
Data augmentation | freq-mixstyle, frequency masking, time rolling |
Sampling rate | 32kHz |
Features | log-mel energies |
Classifier | RF-regularized CNN |
Complexity management | precision_16, network design |
Device information | per-device end-to-end fine-tuning |
Number of models at inference | 7 |
Model weight sharing | fully device-specific |
Hyu Submission for DCASE 2025 Task 1: Low-Complexity Acoustic Scene Classification Using Reparameterizable CNN with Channel-Time-Frequency Attention
Seung-Gyu Han1, Pil Moo Byun2 and Joon-Hyuk Chang1,2
1Artificial Intelligence Semiconductor Engineering, Hanyang University, Seoul, Republic of Korea, 2Artificial Intelligence, Hanyang University, Seoul, Republic of Korea
Chang_HYU_task1_1 Chang_HYU_task1_2 Chang_HYU_task1_3 Chang_HYU_task1_4
Hyu Submission for DCASE 2025 Task 1: Low-Complexity Acoustic Scene Classification Using Reparameterizable CNN with Channel-Time-Frequency Attention
Seung-Gyu Han1, Pil Moo Byun2 and Joon-Hyuk Chang1,2
1Artificial Intelligence Semiconductor Engineering, Hanyang University, Seoul, Republic of Korea, 2Artificial Intelligence, Hanyang University, Seoul, Republic of Korea
Abstract
This paper presents the Hanyang University team’s submission for the DCASE 2025 Challenge Task 1: Low-Complexity Acoustic Scene Classification with Device Information. The task focuses on developing compact and efficient models that generalize well across both seen and unseen recording devices, under strict constraints on model size and computational cost. To address these challenges, we propose Rep-CTFA, a lightweight convolutional neural network that integrates two key design elements: (1) reparameterizable convolutional blocks with learnable branch scaling coefficients, and (2) a Channel-Time-Frequency Attention (CTFA) module. In addition, we explore input resolution variation by adjusting the hop length and number of mel bins to control time-frequency granularity. Knowledge distillation from a PaSST-based teacher ensemble is used to guide the training of the student model, improving generalization. Finally, we adopt a device-aware fine-tuning scheme that updates lightweight classification heads per device while keeping the shared backbone intact.
System characteristics
Sampling rate | 32kHz |
Data augmentation | freq-mixstyle, frequency masking, time rolling, DIR |
Sampling rate | 32kHz |
Features | log-mel energies |
Classifier | RF-regularized CNN, CTFAttention |
Complexity management | precision_16, network design, knowledge distillation |
Device information | per-device end-to-end fine-tuning |
Number of models at inference | 7 |
Model weight sharing | fully device-specific |
Confidence-Aware Ensemble Knowledge Distillation for Low-Complexity Acoustic Scene Classification
Sarang Han1, Dong Ho Lee2, Min Sik Jo1, Eun Seo Ha1, Min Ju Chae1 and Geon Woo Lee1
1Intelligence Speech and Processing Language, ChoSun University (CSU) Gwangju, Gwangju, South Korea, 2Institute of Computational Perception (CP), Johannes Kepler University (JKU) Linz, Linz, Austria
Han_CSU_task1_1 Han_CSU_task1_2 Han_CSU_task1_3 Han_CSU_task1_4
Confidence-Aware Ensemble Knowledge Distillation for Low-Complexity Acoustic Scene Classification
Sarang Han1, Dong Ho Lee2, Min Sik Jo1, Eun Seo Ha1, Min Ju Chae1 and Geon Woo Lee1
1Intelligence Speech and Processing Language, ChoSun University (CSU) Gwangju, Gwangju, South Korea, 2Institute of Computational Perception (CP), Johannes Kepler University (JKU) Linz, Linz, Austria
Abstract
We propose a confidence-aware ensemble knowledge distillation method for acoustic scene classification under low-complexity and limited-data settings. Our approach utilizes heterogeneous teacher models—BEATs, and EfficientAT—fine-tuned on the DCASE 2025 Task 1 dataset, to guide the training of a lightweight student model, TFSepNet. To improve over naive ensemble distillation, we introduce a confidence-weighted strategy that emphasizes reliable teacher outputs. Experimental results show improved generalization on unseen devices and domains, outperforming single-teacher and uniform ensemble baselines.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | MixUp, MixStyle, SpecAug, FiltAug, AddNoise, FrameShift,TimeMask, FreqMask, DIR; MixUp, MixStyle, SpecAug, FiltAug, AddNoise, FrameShift, TimeMask, FreqMask, DIR |
Sampling rate | 44.1kHz |
Features | log-mel spectrogram |
Classifier | CNN (SepNet); CNN |
Complexity management | network design; knowledge distillation |
Device information | per-device end-to-end fine-tuning |
Number of models at inference | 7 |
Model weight sharing | fully device-specific |
Adaptive Knowledge Distillation Using a Device-Aware Teacher for Low-Complexity Acoustic Scene Classification
Seunggyu Jeong and Seongeon Kim
Department of Artificial Intelligence, Seoul National University of Science and Technology, Seoul, South Korea
Jeong_SEOULTECH_task1_1 Jeong_SEOULTECH_task1_2
Adaptive Knowledge Distillation Using a Device-Aware Teacher for Low-Complexity Acoustic Scene Classification
Seunggyu Jeong and Seongeon Kim
Department of Artificial Intelligence, Seoul National University of Science and Technology, Seoul, South Korea
Abstract
In this technical report, we describe our submission for Task 1, Low-Complexity Device-Robust Acoustic Scene Classification, of the DCASE 2025 Challenge. Our work tackles the dual challenges of strict complexity constraints and robust generalization to both seen and unseen devices, while also leveraging the new rule allowing the use of device labels at test time. Our proposed system is based on a knowledge distillation framework where an efficient CP-MobileNet student learns from a compact, specialized two-teacher ensemble. This ensemble combines a baseline PaSST teacher, trained with standard cross-entropy, and a ’generalization expert’ teacher. This expert is trained using our novel Device-Aware Feature Alignment (DAFA) loss, adapted from prior work, which explicitly structures the feature space for device robustness. To capitalize on the availability of test-time device labels, the distilled student model then undergoes a final device-specific fine-tuning stage. Our proposed system achieves a final accuracy of 57.93% on the development set, demonstrating a significant improvement over the official baseline, particularly on unseen devices.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | freq-mixstyle, mixup |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CNN, Transformer |
Complexity management | knowledge distillation |
Device information | per-device end-to-end fine-tuning |
Number of models at inference | 7 |
Model weight sharing | fully device-specific |
Domain-Specific External Data Pre-Training and Device-Aware Distillation for Data-Efficient Acoustic Scene Classification
Dominik Karasin, Ioan-Cristian Olariu, Michael Schöpf and Anna Szymańska
Institute of Computational Perception (CP), Johannes Kepler University (JKU) Linz, Linz, Austria
Karasin_JKU_task1_1 Karasin_JKU_task1_2 Karasin_JKU_task1_3 Karasin_JKU_task1_4
Domain-Specific External Data Pre-Training and Device-Aware Distillation for Data-Efficient Acoustic Scene Classification
Dominik Karasin, Ioan-Cristian Olariu, Michael Schöpf and Anna Szymańska
Institute of Computational Perception (CP), Johannes Kepler University (JKU) Linz, Linz, Austria
Abstract
In this technical report, we present our submission to the DCASE 2025 Challenge Task 1: Low-Complexity Acoustic Scene Classification with Device Information. Our approach centers on a compact CP-Mobile student model distilled via Bayesian ensemble averaging from different combinations of three teacher architectures: CP-ResNet, BEATs, and PaSST—using AudioSet pretrained check-points for the last two. We then fine-tune the student on each recording device to improve per-device classification accuracy. To compensate for the limited 25% train-split, we pre-train both teacher and student on CochlScene and apply data augmentation, of which Device Impulse Response augmentation was particularly effective.
System characteristics
Sampling rate | 32kHz |
Data augmentation | freq-mixstyle, DIR, time masking, frequency masking, time rolling |
Sampling rate | 32kHz |
Features | log-mel energies |
Classifier | RF-regularized CNN |
Complexity management | precision_16, network design, knowledge distillation |
Device information | per-device end-to-end fine-tuning |
Number of models at inference | 7 |
Model weight sharing | fully device-specific |
Joint Feature and Output Distillation for Low-Complexity Acoustic Scene Classification
Haowen Li1, Ziyi Yang1, Mou Wang2, Ee-Leng Tan1, Junwei Yeow1, Santi Peksi1 and Woon-Seng Gan1
1Smart Nation TRANS Lab, Nanyang Technological University, Singapore, 2Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
Li_NTU_task1_1 Li_NTU_task1_2
Joint Feature and Output Distillation for Low-Complexity Acoustic Scene Classification
Haowen Li1, Ziyi Yang1, Mou Wang2, Ee-Leng Tan1, Junwei Yeow1, Santi Peksi1 and Woon-Seng Gan1
1Smart Nation TRANS Lab, Nanyang Technological University, Singapore, 2Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
Abstract
This report presents a dual-level knowledge distillation framework with multi-teacher guidance for low-complexity acoustic scene classification (ASC) in DCASE2025 Task 1. We propose a distillation strategy that jointly transfers both soft logits and intermediate feature representations. Specifically, we pre-trained PaSST and CP-ResNet models as teacher models. Logits from teachers are averaged to generate soft targets, while one CP-ResNet is selected for feature-level distillation. This enables the compact student model (CP-Mobile) to capture both semantic distribution and structural information from teacher guidance. Experiments on the TAU Urban Acoustic Scenes 2022 Mobile dataset (development set) demonstrate that our submitted systems achieve up to 59.30% accuracy.
System characteristics
Sampling rate | 32kHz |
Data augmentation | freq-mixstyle, time rolling, DIR |
Sampling rate | 32kHz |
Features | log-mel energies |
Classifier | RF-regularized CNN |
Complexity management | knowledge distillation, network design, precision_16; knowledge distillation, network design |
Device information | per-device end-to-end fine-tuning, device-IR augmentation; per-device end-to-end fine-tuning, DIR |
Number of models at inference | 1 |
Model weight sharing | fully shared |
Dynacp: Dynamic Parallel Selective Convolution in Cp-Mobile Under Multi-Teacher Distillation for Acoustic Scene Classification
Yuandong Luo1, Hongqing Liu1, Liming Shi2 and Lu Gan3
1Chongqing Key Lab of Mobile Communications Technology, Chongqing University of Posts and Telecommunications, Chongqing, China, 2School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing, China, 3College of Engineering, Design and Physical Science, Brunel University, London, U.K.
Luo_CQUPT_task1_1
Dynacp: Dynamic Parallel Selective Convolution in Cp-Mobile Under Multi-Teacher Distillation for Acoustic Scene Classification
Yuandong Luo1, Hongqing Liu1, Liming Shi2 and Lu Gan3
1Chongqing Key Lab of Mobile Communications Technology, Chongqing University of Posts and Telecommunications, Chongqing, China, 2School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing, China, 3College of Engineering, Design and Physical Science, Brunel University, London, U.K.
Abstract
This report introduces the acoustic scene classification (ASC) architecture submitted by the Chongqing University of Posts and Telecommunications – Audio Lab (CQUPT-AUL) for DCASE 2025 Task 1. The architecture is a lightweight and efficient network structure, termed as DynaCP. Built upon CP-Mobile, DynaCP dynamically selects between dilated convolutions with pooling or depth-wise convolutions with pooling at different network layers, thereby enhancing multi-scale feature representation with minimal computational overhead, while also alleviating the issue of information sparsity caused by dilated convolutions. To improve classification accuracy, a multi-teacher knowledge distillation approach is employed using pre-trained models of DYMN and MN. Experimental results demonstrate that DynaCP achieves competitive performance while maintaining low computational complexity.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | freq-mixstyle, pitch shifting, time rolling |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | RF-regularized CNN |
Complexity management | knowledge distillation, precision_16, network design |
Device information | per-device end-to-end fine-tuning |
Number of models at inference | 7 |
Model weight sharing | fully device-specific |
Acoustic Scene Classification with Knowledge Distillation and Device-Specific Fine-Tuning for DCASE 2025
Mohamad Mahdee Ramezanee, Hossein Sharify, Amir Mohamad Mehrani Kia and Behnam Raoufi
Electrical Engineering, Sharif University of Technology, Tehran, Iran
Ramezanee_SUT_task1_1 Ramezanee_SUT_task1_2 Ramezanee_SUT_task1_3
Acoustic Scene Classification with Knowledge Distillation and Device-Specific Fine-Tuning for DCASE 2025
Mohamad Mahdee Ramezanee, Hossein Sharify, Amir Mohamad Mehrani Kia and Behnam Raoufi
Electrical Engineering, Sharif University of Technology, Tehran, Iran
Abstract
The objective of the acoustic scene classification task is to categorize audio recordings into one of ten predetermined environmental sound categories, such as urban parks or metro stations. This report to Task 1 of the DCASE 2025 Challenge, which emphasizes developing data-efficient, low-complexity systems for acoustic scene classification, addressing real-world constraints like limited training data and device mismatches [1]. Our model is designed with a reparameterizable convolutional structure that unifies multiple asymmetric kernels into a single efficient layer during inference, enabling both rich spatial representation and computational efficiency. It further integrates a novel attention-guided pooling strategy and a hybrid normalization scheme to enhance feature discrimination and stability throughout the network. Finally, we utilized ensemble learning of the newly defined teacher models and minimized the KL divergence between the student and teacher models to improve the results.
System characteristics
Sampling rate | 32kHz |
Data augmentation | freq-mixstyle, frequency masking, time masking, random noise, random gain, DIR |
Sampling rate | 32kHz |
Features | log-mel energies |
Classifier | CNN |
Complexity management | network design, knowledge distillation, pruning, reparametrization |
Device information | per-device end-to-end fine-tuning |
Number of models at inference | 7 |
Model weight sharing | fully device-specific |
SNTL-Ntu Dcase25 Submission: Acoustic Scene Classification Using CNN-GRU Model Without Knowledge Distillation
Ee-Leng Tan1, Jun Wei Yeow2, Santi Peksi2, Haowen Li2, Ziyi Yang2 and Woon-Seng Gan2
1Smart Nation TRANS Lab, Nanyang Technological University, Singapore, 2Smart Nation TRANS Lab, Nanyang Technological Univeristy, Singapore, Singapore
Tan_SNTLNTU_task1_1 Tan_SNTLNTU_task1_2
SNTL-Ntu Dcase25 Submission: Acoustic Scene Classification Using CNN-GRU Model Without Knowledge Distillation
Ee-Leng Tan1, Jun Wei Yeow2, Santi Peksi2, Haowen Li2, Ziyi Yang2 and Woon-Seng Gan2
1Smart Nation TRANS Lab, Nanyang Technological University, Singapore, 2Smart Nation TRANS Lab, Nanyang Technological Univeristy, Singapore, Singapore
Abstract
In this technical report, we present the SNTL-NTU team’s Task 1 submission for the Low-Complexity Acoustic Scene Classification of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 challenge [1]. This submission departs from the typical application of knowledge distillation from a teacher to a student model, aiming to achieve high performance with limited complexity. The proposed model is based on a CNN-GRU model and is trained solely using the TAU Urban Acoustic Scene 2022 Mobile development dataset [2], without utilizing any external datasets, except for MicIRP [3], which is used for device impulse response (DIR) augmentation. Two models have been submitted to this challenge with memory usage not more than 117 KB and requiring 10.9M multiply-and-accumulate (MAC) operations. Using the development dataset, the proposed model achieved an accuracy of 60.25%.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | freq-mixstyle, DIR, SpecAug |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | GRU-CNN |
Complexity management | precision_16, network design |
Device information | per-device end-to-end fine-tuning |
Number of models at inference | 7 |
Model weight sharing | fully device-specific |
Data-Efficient Acoustic Scene Classification via Ensemble Teachers Distillation and Pruning
Shuwei Zhang1, Bing Han2, Anbai Jiang3, Xinhu Zheng2, Wei-Qiang Zhang3, Xie Chen2, Pingyi Fan3, Cheng Lu4, Jia Liu1,3 and Yanmin Qian2
1Huakong AI, Beijing, China, 2Shanghai Jiao Tong University, Shanghai, China, 3Tsinghua University, Beijing, China, 4North China Electric Power University, Beijing, China
Zhang_AITHU-SJTU_task1_1 Zhang_AITHU-SJTU_task1_2 Zhang_AITHU-SJTU_task1_3 Zhang_AITHU-SJTU_task1_4
Data-Efficient Acoustic Scene Classification via Ensemble Teachers Distillation and Pruning
Shuwei Zhang1, Bing Han2, Anbai Jiang3, Xinhu Zheng2, Wei-Qiang Zhang3, Xie Chen2, Pingyi Fan3, Cheng Lu4, Jia Liu1,3 and Yanmin Qian2
1Huakong AI, Beijing, China, 2Shanghai Jiao Tong University, Shanghai, China, 3Tsinghua University, Beijing, China, 4North China Electric Power University, Beijing, China
Abstract
The goal of the acoustic scene classification task is to classify recordings into one of the ten predefined acoustic scene classes. In this report, we describe the submission of the THU-SJTU team for Task 1 Data-Efficient Low-Complexity Acoustic Scene Classification of the DCASE 2025 challenge. Our methods are consistent with those of last year. Firstly, we use an architecture named SSCP-Mobile (spatially separable), which enhances the CP-Mobile with spatially separable convolution structure and achieves lower computation expenses and better performance. Then we adopt several pre-trained PaSST models as ensemble teachers to teach CP-Mobile with knowledge distillation. After that, we use model pruning techniques to trim the model to meet the computational and parameter requirements of the competition. Finally, we will use knowledge distillation techniques again to fine-tune the pruned model and further improve its performance. Due to some reasons, our submissions included four systems that contain only general models, but we also attempted to use device type information to increase the performance of the system S1.
System characteristics
Sampling rate | 32kHz |
Data augmentation | freq-mixstyle, frequency masking, time masking, time rolling |
Sampling rate | 32kHz |
Features | log-mel energies |
Classifier | CNN |
Complexity management | precision_16, network design, knowledge distillation, pruning |
Number of models at inference | 1 |
Model weight sharing | fully shared |
Adaptf-Sepnet: Audioset-Driven Adaptive Pre-Training of Tf-Sepnet for Multi-Device Acoustic Scene Classification
Zhou Ziyang1, Yin Zeyu1, Cai Yiqiang1, Li Shengchen1 and Shao Xi2
1School of Advanced Technology, Xi'an Jiaotong Liverpool University, Suzhou, China, 2Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China
Zhou_XJTLU_task1_1
Adaptf-Sepnet: Audioset-Driven Adaptive Pre-Training of Tf-Sepnet for Multi-Device Acoustic Scene Classification
Zhou Ziyang1, Yin Zeyu1, Cai Yiqiang1, Li Shengchen1 and Shao Xi2
1School of Advanced Technology, Xi'an Jiaotong Liverpool University, Suzhou, China, 2Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China
Abstract
This technical report presents our submission to DCASE 2025 Challenge Task 1: Low-Complexity Acoustic Scene Classification with Device Information. We propose a multi-device framework that leverages device-specific models trained with knowledge distillation techniques and enhanced through AudioSet pre-training. Our approach utilizes TF-SepNet as the backbone architecture, pre-trained on the large-scale AudioSet dataset to learn robust acoustic representations. For each of the known devices, a dedicated model is trained. At inference time, the system identifies the device source of the audio clip and selects the corresponding pre-trained model for classification. Evaluated on the test set, our device-specific system achieves an overall accuracy of 59.5%.
System characteristics
Sampling rate | 32kHz |
Data augmentation | mixup, freq-mixstyle, DIR |
Sampling rate | 32kHz |
Features | log-mel spectrogram |
Classifier | CNN (TF-SepNet) |
Complexity management | network design, weight quantization, knowledge distillation |
Device information | per-device end-to-end fine-tuning |
Number of models at inference | 7 |
Model weight sharing | fully device-specific |