Task description
Detailed task description can be found in the task description page
Team ranking
Table including only the best ranking score per submitting team.
Submission Information | Evaluation Set | Test (Development) Set | |||||
---|---|---|---|---|---|---|---|
Submission Code |
Technical Report |
Official Team Rank |
CA-SDRi (eval) |
Label Prediction Accuracy (eval) |
CA-SDRi (test) |
Label Prediction Accuracy (test) |
|
Bando_AIST_task4_2 | Bando_2025_t4 | 5 | 7.55 | 49.51 | 13.31 | 64.07 | |
Choi_KAIST_task4_3 | Choi_2025_t4 | 1 | 11.00 | 55.80 | 14.94 | 61.80 | |
Wu_SUSTech_task4_submission_2 | FulinWu_2025_t4 | 3 | 9.73 | 51.54 | 14.00 | 59.80 | |
Morocutti_CPJKU_task4_2 | Morocutti_2025_t4 | 2 | 9.77 | 61.60 | 15.04 | 77.07 | |
Baseline_task4_ResUNetK | 8 | 6.60 | 51.48 | 11.09 | 59.80 | ||
Park_GIST-HanwhaVision_task4_4 | Park_2025_t4 | 6 | 6.69 | 47.22 | 13.22 | 76.53 | |
Qian_SJTU_task4_1 | Qian_2025_t4 | 4 | 7.84 | 47.72 | 14.38 | 73.93 | |
Stergioulis_UTh_task4_submission_1 | Stergioulis_2025_t4 | 7 | 6.60 | 51.48 | 11.12 | 60.67 | |
Zhang_BUPT_task4_1 | Zhang_2025_t4 | 9 | 3.84 | 22.41 | 11.78 | 65.47 |
System ranking
Table shows the ranking of all submitted systems.
Submission Information | Evaluation Set | Test (Development) Set | |||||
---|---|---|---|---|---|---|---|
Submission Code |
Technical Report |
Official System Rank |
CA-SDRi (eval) |
Label Prediction Accuracy (eval) |
CA-SDRi (test) |
Label Prediction Accuracy (test) |
|
Bando_AIST_task4_1 | Bando_2025_t4 | 24 | 5.17 | 29.20 | 12.38 | 57.13 | |
Bando_AIST_task4_2 | Bando_2025_t4 | 13 | 7.55 | 49.51 | 13.31 | 64.07 | |
Bando_AIST_task4_3 | Bando_2025_t4 | 23 | 5.26 | 31.98 | 11.23 | 48.80 | |
Bando_AIST_task4_4 | Bando_2025_t4 | 14 | 7.52 | 48.58 | 12.46 | 55.93 | |
Choi_KAIST_task4_1 | Choi_2025_t4 | 4 | 10.63 | 53.52 | 14.82 | 61.67 | |
Choi_KAIST_task4_2 | Choi_2025_t4 | 3 | 10.80 | 56.42 | 14.60 | 60.90 | |
Choi_KAIST_task4_3 | Choi_2025_t4 | 1 | 11.00 | 55.80 | 14.94 | 61.80 | |
Choi_KAIST_task4_4 | Choi_2025_t4 | 2 | 10.85 | 54.26 | 14.94 | 61.67 | |
Wu_SUSTech_task4_submission_1 | FulinWu_2025_t4 | 8 | 9.11 | 51.54 | 14.16 | 59.80 | |
Wu_SUSTech_task4_submission_2 | FulinWu_2025_t4 | 7 | 9.73 | 51.54 | 14.00 | 59.80 | |
Morocutti_CPJKU_task4_1 | Morocutti_2025_t4 | 6 | 9.76 | 61.30 | 14.95 | 76.87 | |
Morocutti_CPJKU_task4_2 | Morocutti_2025_t4 | 5 | 9.77 | 61.60 | 15.04 | 77.07 | |
Morocutti_CPJKU_task4_3 | Morocutti_2025_t4 | 10 | 8.91 | 51.98 | 14.49 | 73.07 | |
Morocutti_CPJKU_task4_4 | Morocutti_2025_t4 | 9 | 8.99 | 57.28 | 14.27 | 71.73 | |
Baseline_task4_ResUNetK | 19 | 6.60 | 51.48 | 11.09 | 59.80 | ||
Baseline_task4_ResUNet | 22 | 5.72 | 51.48 | 11.03 | 59.80 | ||
Park_GIST-HanwhaVision_task4_1 | Park_2025_t4 | 21 | 6.38 | 43.89 | 12.95 | 73.73 | |
Park_GIST-HanwhaVision_task4_2 | Park_2025_t4 | 17 | 6.64 | 46.67 | 13.21 | 76.53 | |
Park_GIST-HanwhaVision_task4_3 | Park_2025_t4 | 20 | 6.50 | 45.19 | 13.09 | 76.00 | |
Park_GIST-HanwhaVision_task4_4 | Park_2025_t4 | 16 | 6.69 | 47.22 | 13.22 | 76.53 | |
Qian_SJTU_task4_1 | Qian_2025_t4 | 11 | 7.84 | 47.72 | 14.38 | 73.93 | |
Qian_SJTU_task4_2 | Qian_2025_t4 | 12 | 7.72 | 50.68 | 12.40 | 62.73 | |
Qian_SJTU_task4_3 | Qian_2025_t4 | 25 | 4.43 | 22.84 | 10.47 | 49.53 | |
Qian_SJTU_task4_4 | Qian_2025_t4 | 15 | 7.49 | 44.20 | 14.66 | 76.27 | |
Stergioulis_UTh_task4_submission_1 | Stergioulis_2025_t4 | 18 | 6.60 | 51.48 | 11.12 | 60.67 | |
Zhang_BUPT_task4_1 | Zhang_2025_t4 | 26 | 3.84 | 22.41 | 11.78 | 65.47 | |
Zhang_BUPT_task4_2 | Zhang_2025_t4 | 27 | 3.54 | 22.41 | 11.07 | 62.67 |
Supplementary metrics
Detailed analysis of separation and detection performance
All metrics in this table are evaluated on the evaluation set. True Positive (TP), False Positive (FP), and False Negative (FN) are counted per clip. TP-SDRi is the SDRi in clips where the label prediction is TP.
Submission Information | Separation Scores | Label Prediction Scores | Counts | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission Code |
Technical Report |
CA-SDRi | TP-SDRi | Accuracy | Precision | Recall |
F-Score (micro) |
F-Score (macro) |
TP | FP | FN | |
Bando_AIST_task4_1 | Bando_2025_t4 | 5.17 | 10.72 | 29.20 | 0.72 | 0.50 | 0.59 | 0.56 | 1605 | 622 | 1635 | |
Bando_AIST_task4_2 | Bando_2025_t4 | 7.55 | 11.04 | 49.51 | 0.83 | 0.70 | 0.77 | 0.75 | 2277 | 411 | 963 | |
Bando_AIST_task4_3 | Bando_2025_t4 | 5.26 | 10.39 | 31.98 | 0.76 | 0.52 | 0.60 | 0.59 | 1684 | 648 | 1556 | |
Bando_AIST_task4_4 | Bando_2025_t4 | 7.52 | 11.06 | 48.58 | 0.85 | 0.70 | 0.76 | 0.74 | 2258 | 432 | 982 | |
Choi_KAIST_task4_1 | Choi_2025_t4 | 10.63 | 14.37 | 53.52 | 0.88 | 0.79 | 0.82 | 0.82 | 2552 | 401 | 688 | |
Choi_KAIST_task4_2 | Choi_2025_t4 | 10.80 | 14.12 | 56.42 | 0.85 | 0.84 | 0.83 | 0.83 | 2704 | 541 | 536 | |
Choi_KAIST_task4_3 | Choi_2025_t4 | 11.00 | 14.46 | 55.80 | 0.86 | 0.83 | 0.84 | 0.83 | 2681 | 499 | 559 | |
Choi_KAIST_task4_4 | Choi_2025_t4 | 10.85 | 14.52 | 54.26 | 0.88 | 0.80 | 0.83 | 0.82 | 2576 | 399 | 664 | |
Wu_SUSTech_task4_submission_1 | FulinWu_2025_t4 | 9.11 | 12.83 | 51.54 | 0.86 | 0.73 | 0.79 | 0.77 | 2365 | 401 | 875 | |
Wu_SUSTech_task4_submission_2 | FulinWu_2025_t4 | 9.73 | 13.63 | 51.54 | 0.86 | 0.73 | 0.79 | 0.77 | 2365 | 401 | 875 | |
Morocutti_CPJKU_task4_1 | Morocutti_2025_t4 | 9.76 | 12.51 | 61.30 | 0.93 | 0.78 | 0.85 | 0.82 | 2528 | 210 | 712 | |
Morocutti_CPJKU_task4_2 | Morocutti_2025_t4 | 9.77 | 12.51 | 61.60 | 0.93 | 0.78 | 0.85 | 0.82 | 2527 | 200 | 713 | |
Morocutti_CPJKU_task4_3 | Morocutti_2025_t4 | 8.91 | 12.62 | 51.98 | 0.86 | 0.73 | 0.79 | 0.77 | 2368 | 390 | 872 | |
Morocutti_CPJKU_task4_4 | Morocutti_2025_t4 | 8.99 | 11.87 | 57.28 | 0.93 | 0.75 | 0.83 | 0.81 | 2447 | 203 | 793 | |
Baseline_task4_ResUNetK | 6.60 | 9.33 | 51.48 | 0.86 | 0.73 | 0.79 | 0.77 | 2364 | 402 | 876 | ||
Baseline_task4_ResUNet | 5.72 | 8.21 | 51.48 | 0.86 | 0.73 | 0.79 | 0.77 | 2364 | 402 | 876 | ||
Park_GIST-HanwhaVision_task4_1 | Park_2025_t4 | 6.38 | 10.07 | 43.89 | 0.87 | 0.65 | 0.74 | 0.69 | 2121 | 395 | 1119 | |
Park_GIST-HanwhaVision_task4_2 | Park_2025_t4 | 6.64 | 10.18 | 46.67 | 0.85 | 0.67 | 0.75 | 0.70 | 2183 | 379 | 1057 | |
Park_GIST-HanwhaVision_task4_3 | Park_2025_t4 | 6.50 | 10.13 | 45.19 | 0.86 | 0.67 | 0.75 | 0.70 | 2156 | 377 | 1084 | |
Park_GIST-HanwhaVision_task4_4 | Park_2025_t4 | 6.69 | 10.15 | 47.22 | 0.88 | 0.67 | 0.76 | 0.71 | 2185 | 328 | 1055 | |
Qian_SJTU_task4_1 | Qian_2025_t4 | 7.84 | 11.27 | 47.72 | 0.84 | 0.71 | 0.78 | 0.75 | 2309 | 393 | 931 | |
Qian_SJTU_task4_2 | Qian_2025_t4 | 7.72 | 11.06 | 50.68 | 0.87 | 0.70 | 0.78 | 0.75 | 2286 | 309 | 954 | |
Qian_SJTU_task4_3 | Qian_2025_t4 | 4.43 | 10.30 | 22.84 | 0.59 | 0.49 | 0.54 | 0.49 | 1590 | 1110 | 1650 | |
Qian_SJTU_task4_4 | Qian_2025_t4 | 7.49 | 11.44 | 44.20 | 0.87 | 0.65 | 0.75 | 0.72 | 2117 | 280 | 1123 | |
Stergioulis_UTh_task4_submission_1 | Stergioulis_2025_t4 | 6.60 | 9.33 | 51.48 | 0.86 | 0.73 | 0.79 | 0.77 | 2364 | 402 | 876 | |
Zhang_BUPT_task4_1 | Zhang_2025_t4 | 3.84 | 9.41 | 22.41 | 0.89 | 0.40 | 0.52 | 0.46 | 1278 | 418 | 1962 | |
Zhang_BUPT_task4_2 | Zhang_2025_t4 | 3.54 | 8.97 | 22.41 | 0.89 | 0.40 | 0.52 | 0.46 | 1278 | 418 | 1962 |
Detailed analysis focused on quality of separated speech
Table shows the quality of the separated speech in the evaluation dataset. Here, Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI) and Perceptual Evaluation of Audio Quality (PEAQ) was adopted as objective metrics.
Submission Information | PESQ | STOI | PEAQ | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission Code |
Technical Report |
CA-SDRi |
Number of Speech Sample (/251) |
PESQ mean |
PESQ std |
PESQ min |
PESQ max |
STOI mean |
STOI std |
STOI min |
STOI max |
PEAQ mean |
PEAQ std |
PEAQ min |
PEAQ max |
|
Bando_AIST_task4_1 | Bando_2025_t4 | 5.17 | 233 | 2.50 | 0.61 | 1.26 | 4.07 | 0.85 | 0.10 | 0.46 | 0.98 | -3.52 | 0.48 | -3.91 | -0.91 | |
Bando_AIST_task4_2 | Bando_2025_t4 | 7.55 | 243 | 2.46 | 0.61 | 1.07 | 4.06 | 0.85 | 0.11 | 0.46 | 0.98 | -3.52 | 0.48 | -3.91 | -1.10 | |
Bando_AIST_task4_3 | Bando_2025_t4 | 5.26 | 184 | 2.47 | 0.57 | 1.22 | 4.02 | 0.88 | 0.08 | 0.55 | 0.99 | -3.56 | 0.46 | -3.91 | -1.64 | |
Bando_AIST_task4_4 | Bando_2025_t4 | 7.52 | 226 | 2.35 | 0.61 | 1.06 | 4.02 | 0.86 | 0.09 | 0.45 | 0.98 | -3.56 | 0.45 | -3.91 | -1.46 | |
Choi_KAIST_task4_1 | Choi_2025_t4 | 10.63 | 236 | 2.83 | 0.58 | 1.27 | 4.22 | 0.91 | 0.07 | 0.63 | 0.99 | -3.42 | 0.49 | -3.91 | -0.62 | |
Choi_KAIST_task4_2 | Choi_2025_t4 | 10.80 | 241 | 2.82 | 0.56 | 1.27 | 4.23 | 0.91 | 0.07 | 0.58 | 0.99 | -3.46 | 0.47 | -3.91 | -0.88 | |
Choi_KAIST_task4_3 | Choi_2025_t4 | 11.00 | 241 | 2.88 | 0.58 | 1.17 | 4.24 | 0.91 | 0.08 | 0.46 | 0.99 | -3.43 | 0.48 | -3.91 | -0.63 | |
Choi_KAIST_task4_4 | Choi_2025_t4 | 10.85 | 231 | 2.89 | 0.57 | 1.30 | 4.23 | 0.91 | 0.07 | 0.58 | 0.99 | -3.42 | 0.49 | -3.91 | -0.62 | |
Wu_SUSTech_task4_submission_1 | FulinWu_2025_t4 | 9.11 | 246 | 2.67 | 0.54 | 1.19 | 4.02 | 0.90 | 0.08 | 0.49 | 0.99 | -3.46 | 0.50 | -3.91 | -0.95 | |
Wu_SUSTech_task4_submission_2 | FulinWu_2025_t4 | 9.73 | 246 | 2.77 | 0.58 | 1.04 | 4.18 | 0.90 | 0.10 | 0.08 | 0.99 | -3.43 | 0.50 | -3.91 | -1.40 | |
Morocutti_CPJKU_task4_1 | Morocutti_2025_t4 | 9.76 | 249 | 2.97 | 0.59 | 1.25 | 4.17 | 0.90 | 0.07 | 0.61 | 0.99 | -3.40 | 0.50 | -3.91 | -0.74 | |
Morocutti_CPJKU_task4_2 | Morocutti_2025_t4 | 9.77 | 249 | 2.97 | 0.60 | 1.24 | 4.18 | 0.90 | 0.07 | 0.61 | 0.99 | -3.39 | 0.51 | -3.91 | -0.75 | |
Morocutti_CPJKU_task4_3 | Morocutti_2025_t4 | 8.91 | 250 | 2.89 | 0.62 | 1.17 | 4.17 | 0.89 | 0.08 | 0.56 | 0.99 | -3.36 | 0.51 | -3.91 | -0.87 | |
Morocutti_CPJKU_task4_4 | Morocutti_2025_t4 | 8.99 | 250 | 2.91 | 0.61 | 1.22 | 4.18 | 0.89 | 0.08 | 0.58 | 0.99 | -3.45 | 0.50 | -3.91 | -0.72 | |
Baseline_task4_ResUNetK | 6.60 | 246 | 2.39 | 0.63 | 1.09 | 4.07 | 0.84 | 0.11 | 0.39 | 0.99 | -3.60 | 0.43 | -3.91 | -0.87 | ||
Baseline_task4_ResUNet | 5.72 | 246 | 2.55 | 0.63 | 1.11 | 4.11 | 0.85 | 0.10 | 0.44 | 0.99 | -3.56 | 0.47 | -3.91 | -0.98 | ||
Park_GIST-HanwhaVision_task4_1 | Park_2025_t4 | 6.38 | 249 | 2.40 | 0.62 | 1.06 | 4.07 | 0.85 | 0.11 | 0.38 | 0.99 | -3.57 | 0.47 | -3.91 | -1.07 | |
Park_GIST-HanwhaVision_task4_2 | Park_2025_t4 | 6.64 | 245 | 2.42 | 0.62 | 1.10 | 4.07 | 0.85 | 0.11 | 0.41 | 0.99 | -3.57 | 0.46 | -3.91 | -1.07 | |
Park_GIST-HanwhaVision_task4_3 | Park_2025_t4 | 6.50 | 246 | 2.40 | 0.61 | 1.10 | 4.07 | 0.85 | 0.11 | 0.41 | 0.99 | -3.57 | 0.46 | -3.91 | -1.07 | |
Park_GIST-HanwhaVision_task4_4 | Park_2025_t4 | 6.69 | 248 | 2.41 | 0.62 | 1.06 | 4.07 | 0.85 | 0.11 | 0.38 | 0.99 | -3.57 | 0.46 | -3.91 | -1.07 | |
Qian_SJTU_task4_1 | Qian_2025_t4 | 7.84 | 248 | 2.40 | 0.59 | 1.09 | 4.08 | 0.84 | 0.12 | 0.04 | 0.98 | -3.66 | 0.40 | -3.91 | -0.34 | |
Qian_SJTU_task4_2 | Qian_2025_t4 | 7.72 | 244 | 2.46 | 0.58 | 1.15 | 4.02 | 0.84 | 0.10 | 0.44 | 0.98 | -3.55 | 0.48 | -3.91 | -0.98 | |
Qian_SJTU_task4_3 | Qian_2025_t4 | 4.43 | 210 | 2.22 | 0.60 | 1.07 | 3.80 | 0.80 | 0.19 | -0.19 | 0.97 | -3.64 | 0.50 | -3.91 | -1.36 | |
Qian_SJTU_task4_4 | Qian_2025_t4 | 7.49 | 247 | 2.41 | 0.59 | 1.10 | 4.08 | 0.84 | 0.12 | 0.06 | 0.98 | -3.65 | 0.42 | -3.91 | -0.34 | |
Stergioulis_UTh_task4_submission_1 | Stergioulis_2025_t4 | 6.60 | 246 | 2.39 | 0.63 | 1.09 | 4.07 | 0.84 | 0.12 | 0.39 | 0.99 | -3.60 | 0.43 | -3.91 | -0.87 | |
Zhang_BUPT_task4_1 | Zhang_2025_t4 | 3.84 | 58 | 2.79 | 0.70 | 1.29 | 4.08 | 0.88 | 0.11 | 0.44 | 0.99 | -3.53 | 0.54 | -3.91 | -0.98 | |
Zhang_BUPT_task4_2 | Zhang_2025_t4 | 3.54 | 58 | 2.90 | 0.61 | 1.39 | 4.06 | 0.89 | 0.10 | 0.56 | 0.99 | -3.41 | 0.57 | -3.91 | -0.93 |
System's performance under partially known conditions
Table shows the separation and detection performance of each system under partially known conditions. The ‘Known IR Condition’ is the case where the evaluation data is synthesized with the room impulse responses included in the training data. Known Target Condition' is the case where the evaluation data is synthesized with target sound event samples included in the training data. ‘Known Noise Condition’ is the case where the evaluation data is synthesized with the background noise included in the training data. ‘Known Interference Condition’ is the case where the evaluation data is synthesized with interference sound samples included in the training data.
Submission Information | Evaluation set | Known IR Condition |
Known Target Condition |
Known Noise Condition |
Known Interference Condition |
|||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission Code |
Technical Report |
CA-SDRi | Accuracy |
Known IR CA-SDRi |
Known IR Accuracy |
Known Target CA-SDRi |
Known Target Accuracy |
Known Noise CA-SDRi |
Known Noise Accuracy |
Known Interference CA-SDRi |
Known Interference Accuracy |
|
Bando_AIST_task4_1 | Bando_2025_t4 | 5.17 | 29.20 | 7.63 | 37.78 | 8.90 | 62.96 | 5.97 | 33.33 | 4.53 | 26.85 | |
Bando_AIST_task4_2 | Bando_2025_t4 | 7.55 | 49.51 | 10.34 | 61.11 | 9.94 | 68.52 | 8.81 | 58.33 | 7.01 | 48.15 | |
Bando_AIST_task4_3 | Bando_2025_t4 | 5.26 | 31.98 | 6.89 | 33.33 | 8.10 | 51.85 | 5.80 | 25.93 | 4.90 | 26.85 | |
Bando_AIST_task4_4 | Bando_2025_t4 | 7.52 | 48.58 | 10.19 | 51.11 | 9.19 | 61.11 | 8.11 | 51.85 | 7.29 | 50.93 | |
Choi_KAIST_task4_1 | Choi_2025_t4 | 10.63 | 53.52 | 13.49 | 53.33 | 10.79 | 64.81 | 11.65 | 62.04 | 11.35 | 63.89 | |
Choi_KAIST_task4_2 | Choi_2025_t4 | 10.80 | 56.42 | 13.76 | 56.67 | 10.16 | 59.26 | 11.77 | 67.59 | 10.78 | 62.96 | |
Choi_KAIST_task4_3 | Choi_2025_t4 | 11.00 | 55.80 | 14.53 | 61.11 | 10.60 | 59.26 | 12.11 | 67.59 | 11.59 | 66.67 | |
Choi_KAIST_task4_4 | Choi_2025_t4 | 10.85 | 54.26 | 13.32 | 52.22 | 10.89 | 63.89 | 11.85 | 63.89 | 11.19 | 62.04 | |
Wu_SUSTech_task4_submission_1 | FulinWu_2025_t4 | 9.11 | 51.54 | 11.08 | 48.89 | 11.25 | 75.00 | 9.68 | 53.70 | 8.49 | 47.22 | |
Wu_SUSTech_task4_submission_2 | FulinWu_2025_t4 | 9.73 | 51.54 | 11.87 | 48.89 | 11.06 | 75.00 | 9.92 | 53.70 | 9.12 | 47.22 | |
Morocutti_CPJKU_task4_1 | Morocutti_2025_t4 | 9.76 | 61.30 | 12.75 | 68.89 | 12.37 | 92.59 | 10.02 | 61.11 | 9.22 | 58.33 | |
Morocutti_CPJKU_task4_2 | Morocutti_2025_t4 | 9.77 | 61.60 | 12.68 | 70.00 | 12.36 | 91.67 | 9.80 | 61.11 | 9.23 | 59.26 | |
Morocutti_CPJKU_task4_3 | Morocutti_2025_t4 | 8.91 | 51.98 | 10.95 | 58.89 | 12.17 | 91.67 | 9.07 | 48.15 | 8.37 | 50.00 | |
Morocutti_CPJKU_task4_4 | Morocutti_2025_t4 | 8.99 | 57.28 | 12.13 | 67.78 | 11.96 | 90.74 | 9.69 | 61.11 | 8.91 | 56.48 | |
Baseline_task4_ResUNetK | 6.60 | 51.48 | 7.91 | 48.89 | 9.88 | 75.00 | 6.41 | 53.70 | 5.95 | 47.22 | ||
Baseline_task4_ResUNet | 5.72 | 51.48 | 7.03 | 48.89 | 9.51 | 75.00 | 5.27 | 53.70 | 5.08 | 47.22 | ||
Park_GIST-HanwhaVision_task4_1 | Park_2025_t4 | 6.38 | 43.89 | 7.88 | 48.89 | 8.45 | 54.63 | 6.58 | 39.81 | 5.77 | 45.37 | |
Park_GIST-HanwhaVision_task4_2 | Park_2025_t4 | 6.64 | 46.67 | 7.46 | 43.33 | 8.30 | 51.85 | 7.07 | 43.52 | 5.87 | 38.89 | |
Park_GIST-HanwhaVision_task4_3 | Park_2025_t4 | 6.50 | 45.19 | 8.12 | 46.67 | 8.34 | 52.78 | 7.07 | 44.44 | 5.41 | 40.74 | |
Park_GIST-HanwhaVision_task4_4 | Park_2025_t4 | 6.69 | 47.22 | 8.05 | 46.67 | 8.43 | 54.63 | 7.13 | 44.44 | 5.72 | 42.59 | |
Qian_SJTU_task4_1 | Qian_2025_t4 | 7.84 | 47.72 | 10.58 | 52.22 | 10.66 | 88.89 | 8.55 | 50.93 | 7.32 | 47.22 | |
Qian_SJTU_task4_2 | Qian_2025_t4 | 7.72 | 50.68 | 9.54 | 54.44 | 9.86 | 80.56 | 8.39 | 49.07 | 7.19 | 52.78 | |
Qian_SJTU_task4_3 | Qian_2025_t4 | 4.43 | 22.84 | 5.86 | 27.78 | 5.43 | 32.41 | 5.33 | 29.63 | 3.81 | 18.52 | |
Qian_SJTU_task4_4 | Qian_2025_t4 | 7.49 | 44.20 | 10.34 | 52.22 | 11.04 | 95.37 | 8.04 | 46.30 | 7.24 | 45.37 | |
Stergioulis_UTh_task4_submission_1 | Stergioulis_2025_t4 | 6.60 | 51.48 | 7.91 | 48.89 | 9.88 | 75.00 | 6.41 | 53.70 | 5.95 | 47.22 | |
Zhang_BUPT_task4_1 | Zhang_2025_t4 | 3.84 | 22.41 | 3.86 | 17.78 | 5.15 | 28.70 | 3.89 | 30.56 | 2.86 | 26.85 | |
Zhang_BUPT_task4_2 | Zhang_2025_t4 | 3.54 | 22.41 | 3.64 | 17.78 | 5.21 | 28.70 | 3.59 | 30.56 | 2.73 | 26.85 |
System performance in more challenging conditions
This table shows the performance of the system under the 'Multi Same Class Condition'. In this condition, sound events of the same class are included in the same clip.
Submission Information | Evaluation set | Multiple Same Class Condition |
||||
---|---|---|---|---|---|---|
Submission Code |
Technical Report |
CA-SDRi | Accuracy |
Multiple Same Class CA-SDRi |
Multiple Same Class Accuracy |
|
Bando_AIST_task4_1 | Bando_2025_t4 | 5.17 | 29.20 | 1.89 | 41.67 | |
Bando_AIST_task4_2 | Bando_2025_t4 | 7.55 | 49.51 | 3.98 | 60.19 | |
Bando_AIST_task4_3 | Bando_2025_t4 | 5.26 | 31.98 | 2.53 | 47.22 | |
Bando_AIST_task4_4 | Bando_2025_t4 | 7.52 | 48.58 | 4.12 | 64.81 | |
Choi_KAIST_task4_1 | Choi_2025_t4 | 10.63 | 53.52 | 4.15 | 67.59 | |
Choi_KAIST_task4_2 | Choi_2025_t4 | 10.80 | 56.42 | 3.79 | 62.96 | |
Choi_KAIST_task4_3 | Choi_2025_t4 | 11.00 | 55.80 | 3.84 | 62.04 | |
Choi_KAIST_task4_4 | Choi_2025_t4 | 10.85 | 54.26 | 3.89 | 66.67 | |
Wu_SUSTech_task4_submission_1 | FulinWu_2025_t4 | 9.11 | 51.54 | 2.97 | 65.74 | |
Wu_SUSTech_task4_submission_2 | FulinWu_2025_t4 | 9.73 | 51.54 | 3.09 | 65.74 | |
Morocutti_CPJKU_task4_1 | Morocutti_2025_t4 | 9.76 | 61.30 | 5.01 | 72.22 | |
Morocutti_CPJKU_task4_2 | Morocutti_2025_t4 | 9.77 | 61.60 | 4.87 | 73.15 | |
Morocutti_CPJKU_task4_3 | Morocutti_2025_t4 | 8.91 | 51.98 | 3.99 | 62.96 | |
Morocutti_CPJKU_task4_4 | Morocutti_2025_t4 | 8.99 | 57.28 | 4.45 | 69.44 | |
Baseline_task4_ResUNetK | 6.60 | 51.48 | 3.53 | 65.74 | ||
Baseline_task4_ResUNet | 5.72 | 51.48 | 2.21 | 65.74 | ||
Park_GIST-HanwhaVision_task4_1 | Park_2025_t4 | 6.38 | 43.89 | 3.17 | 63.89 | |
Park_GIST-HanwhaVision_task4_2 | Park_2025_t4 | 6.64 | 46.67 | 3.49 | 65.74 | |
Park_GIST-HanwhaVision_task4_3 | Park_2025_t4 | 6.50 | 45.19 | 2.88 | 62.04 | |
Park_GIST-HanwhaVision_task4_4 | Park_2025_t4 | 6.69 | 47.22 | 3.33 | 65.74 | |
Qian_SJTU_task4_1 | Qian_2025_t4 | 7.84 | 47.72 | 2.62 | 61.11 | |
Qian_SJTU_task4_2 | Qian_2025_t4 | 7.72 | 50.68 | 2.89 | 64.81 | |
Qian_SJTU_task4_3 | Qian_2025_t4 | 4.43 | 22.84 | 0.69 | 22.22 | |
Qian_SJTU_task4_4 | Qian_2025_t4 | 7.49 | 44.20 | 2.51 | 64.81 | |
Stergioulis_UTh_task4_submission_1 | Stergioulis_2025_t4 | 6.60 | 51.48 | 3.53 | 65.74 | |
Zhang_BUPT_task4_1 | Zhang_2025_t4 | 3.84 | 22.41 | 1.16 | 36.11 | |
Zhang_BUPT_task4_2 | Zhang_2025_t4 | 3.54 | 22.41 | 1.05 | 36.11 |
System characteristics
General characteristics
Submission Code |
Technical Report |
CA-SDRi |
Label Prediction Accuracy |
Input Sampling Rate |
Input Acoustic Features |
---|---|---|---|---|---|
Bando_AIST_task4_1 | Bando_2025_t4 | 5.17 | 29.20 | 32kHz | waveform, spectrogram |
Bando_AIST_task4_2 | Bando_2025_t4 | 7.55 | 49.51 | 32kHz | waveform, spectrogram |
Bando_AIST_task4_3 | Bando_2025_t4 | 5.26 | 31.98 | 32kHz | waveform, spectrogram |
Bando_AIST_task4_4 | Bando_2025_t4 | 7.52 | 48.58 | 32kHz | waveform, spectrogram |
Choi_KAIST_task4_1 | Choi_2025_t4 | 10.63 | 53.52 | 32kHz | waveform, spectrogram |
Choi_KAIST_task4_2 | Choi_2025_t4 | 10.80 | 56.42 | 32kHz | waveform, spectrogram |
Choi_KAIST_task4_3 | Choi_2025_t4 | 11.00 | 55.80 | 32kHz | waveform, spectrogram |
Choi_KAIST_task4_4 | Choi_2025_t4 | 10.85 | 54.26 | 32kHz | waveform, spectrogram |
Wu_SUSTech_task4_submission_1 | FulinWu_2025_t4 | 9.11 | 51.54 | 32kHz | waveform, spectrogram |
Wu_SUSTech_task4_submission_2 | FulinWu_2025_t4 | 9.73 | 51.54 | 32kHz | waveform, spectrogram |
Morocutti_CPJKU_task4_1 | Morocutti_2025_t4 | 9.76 | 61.30 | 32kHz | waveform, spectrogram |
Morocutti_CPJKU_task4_2 | Morocutti_2025_t4 | 9.77 | 61.60 | 32kHz | waveform, spectrogram |
Morocutti_CPJKU_task4_3 | Morocutti_2025_t4 | 8.91 | 51.98 | 32kHz | waveform, spectrogram |
Morocutti_CPJKU_task4_4 | Morocutti_2025_t4 | 8.99 | 57.28 | 32kHz | waveform, spectrogram |
Baseline_task4_ResUNetK | 6.60 | 51.48 | 32kHz | waveform, spectrogram | |
Baseline_task4_ResUNet | 5.72 | 51.48 | 32kHz | waveform, spectrogram | |
Park_GIST-HanwhaVision_task4_1 | Park_2025_t4 | 6.38 | 43.89 | 32kHz | waveform, spectrogram, spectral_roll-off |
Park_GIST-HanwhaVision_task4_2 | Park_2025_t4 | 6.64 | 46.67 | 32kHz | waveform, spectrogram, chroma |
Park_GIST-HanwhaVision_task4_3 | Park_2025_t4 | 6.50 | 45.19 | 32kHz | waveform, spectrogram, spectral_roll-off, chroma |
Park_GIST-HanwhaVision_task4_4 | Park_2025_t4 | 6.69 | 47.22 | 32kHz | waveform, spectrogram, spectral_roll-off, chroma |
Qian_SJTU_task4_1 | Qian_2025_t4 | 7.84 | 47.72 | 32kHz | waveform, log mel spectrogram |
Qian_SJTU_task4_2 | Qian_2025_t4 | 7.72 | 50.68 | 32kHz | spectrogram, log mel spectrogram |
Qian_SJTU_task4_3 | Qian_2025_t4 | 4.43 | 22.84 | 32kHz | waveform |
Qian_SJTU_task4_4 | Qian_2025_t4 | 7.49 | 44.20 | 32kHz | waveform, log mel spectrogram, fbank |
Stergioulis_UTh_task4_submission_1 | Stergioulis_2025_t4 | 6.60 | 51.48 | 32kHz | waveform, spectrogram |
Zhang_BUPT_task4_1 | Zhang_2025_t4 | 3.84 | 22.41 | 32kHz | waveform, spectrogram |
Zhang_BUPT_task4_2 | Zhang_2025_t4 | 3.54 | 22.41 | 32kHz | waveform, spectrogram |
Machine learning characteristics
Submission Code |
Technical Report |
CA-SDRi |
Label Prediction Accuracy |
Machine Learning Method |
Loss Function |
Training Dataset |
Data Augmentation |
Pretrained Models |
---|---|---|---|---|---|---|---|---|
Bando_AIST_task4_1 | Bando_2025_t4 | 5.17 | 29.20 | Neural blind source separation model (ISS-enhanced RE-SepFormer) | CE, SNR loss | DCASE2025Task4Dataset | SpecAug, dynamic mixing | |
Bando_AIST_task4_2 | Bando_2025_t4 | 7.55 | 49.51 | Neural blind source separation model (ISS-enhanced RE-SepFormer), self-supervised audio encoder (BEATs) | CE, SNR loss | DCASE2025Task4Dataset | SpecAug, dynamic mixing | BEATs |
Bando_AIST_task4_3 | Bando_2025_t4 | 5.26 | 31.98 | Neural blind source separation model (ISS-enhanced RE-SepFormer) | CE, SNR loss | DCASE2025Task4Dataset,AudioSet | SpecAug, dynamic mixing | |
Bando_AIST_task4_4 | Bando_2025_t4 | 7.52 | 48.58 | Neural blind source separation model (ISS-enhanced RE-SepFormer), self-supervised audio encoder (BEATs) | CE, SNR loss | DCASE2025Task4Dataset,AudioSet | SpecAug, dynamic mixing | BEATs |
Choi_KAIST_task4_1 | Choi_2025_t4 | 10.63 | 53.52 | ResUNet-based separation model, M2D-based audio tagging model | SA-SDR, CE, KL-divergence, ArcFace | DCASE2025Task4Dataset | M2D,ATST | |
Choi_KAIST_task4_2 | Choi_2025_t4 | 10.80 | 56.42 | ResUNet-based separation model, M2D-based audio tagging model | SA-SDR, CE, KL-divergence, ArcFace | DCASE2025Task4Dataset | M2D,ATST | |
Choi_KAIST_task4_3 | Choi_2025_t4 | 11.00 | 55.80 | ResUNet-based separation model, M2D-based audio tagging model | SA-SDR, CE, KL-divergence, ArcFace | DCASE2025Task4Dataset | M2D,ATST | |
Choi_KAIST_task4_4 | Choi_2025_t4 | 10.85 | 54.26 | ResUNet-based separation model, M2D-based audio tagging model | SA-SDR, CE, KL-divergence, ArcFace | DCASE2025Task4Dataset | M2D,ATST | |
Wu_SUSTech_task4_submission_1 | FulinWu_2025_t4 | 9.11 | 51.54 | TSTF_v1-based separation model, M2D-based audio tagging model | BCE, SDR loss | DCASE2025Task4Dataset | M2D | |
Wu_SUSTech_task4_submission_2 | FulinWu_2025_t4 | 9.73 | 51.54 | TSTF_v1-based separation model, M2D-based audio tagging model | BCE, SDR loss | DCASE2025Task4Dataset | M2D | |
Morocutti_CPJKU_task4_1 | Morocutti_2025_t4 | 9.76 | 61.30 | ResUNet-based separation models (AudioSep); M2D, BEATs, ATST-F, ASiT and fPaSST models for audio tagging | BCE, SDR loss | DCASE2025Task4Dataset,FOA-MEIR | SNR Range Augmentation | BEATs,M2D,ATST-F,ASiT,fPaSST,AudioSep |
Morocutti_CPJKU_task4_2 | Morocutti_2025_t4 | 9.77 | 61.60 | ResUNet-based separation models (AudioSep); M2D, BEATs, ATST-F, ASiT and fPaSST models for audio tagging | BCE, SDR loss | DCASE2025Task4Dataset,FOA-MEIR | SNR Range Augmentation | BEATs,M2D,ATST-F,ASiT,fPaSST,AudioSep |
Morocutti_CPJKU_task4_3 | Morocutti_2025_t4 | 8.91 | 51.98 | ResUNet-based separation model (AudioSep); M2D model for audio tagging | BCE, SDR loss | DCASE2025Task4Dataset,FOA-MEIR | SNR Range Augmentation | BEATs,M2D,AudioSep |
Morocutti_CPJKU_task4_4 | Morocutti_2025_t4 | 8.99 | 57.28 | ResUNet-based separation models (AudioSep); M2D, BEATs, ATST-F, ASiT and fPaSST models for audio tagging | BCE, SDR loss | DCASE2025Task4Dataset | SNR Range Augmentation | BEATs,M2D,ATST-F,ASiT,fPaSST,AudioSep |
Baseline_task4_ResUNetK | 6.60 | 51.48 | ResUNet-based separation model, M2D-based audio tagging model | BCE, SDR loss | DCASE2025Task4Dataset | M2D | ||
Baseline_task4_ResUNet | 5.72 | 51.48 | ResUNet-based separation model, M2D-based audio tagging model | BCE, SDR loss | DCASE2025Task4Dataset | M2D | ||
Park_GIST-HanwhaVision_task4_1 | Park_2025_t4 | 6.38 | 43.89 | Baseline separation model, M2D and spectral rolloff feature based audio tagging model | BCE, SDR loss | DCASE2025Task4Dataset,AudioSet | M2D,ResUNetK | |
Park_GIST-HanwhaVision_task4_2 | Park_2025_t4 | 6.64 | 46.67 | Baseline separation model, M2D and chroma feature based audio tagging model | BCE, SDR loss | DCASE2025Task4Dataset,AudioSet | M2D,ResUNetK | |
Park_GIST-HanwhaVision_task4_3 | Park_2025_t4 | 6.50 | 45.19 | Baseline separation model, M2D, spectral rolloff, and chroma feature based audio tagging model | BCE, SDR loss | DCASE2025Task4Dataset,AudioSet | M2D,ResUNetK | |
Park_GIST-HanwhaVision_task4_4 | Park_2025_t4 | 6.69 | 47.22 | Baseline separation model, M2D, spectral rolloff, and chroma feature based audio tagging model | BCE, SDR loss | DCASE2025Task4Dataset,AudioSet | M2D,ResUNetK | |
Qian_SJTU_task4_1 | Qian_2025_t4 | 7.84 | 47.72 | Sepformer-based separation model, M2D-based multi-channel audio tagging model | BCE, masked SDR loss | DCASE2025Task4Dataset | M2D | |
Qian_SJTU_task4_2 | Qian_2025_t4 | 7.72 | 50.68 | BSRNN-based separation model, M2D-based multi-channel audio tagging model | BCE, PIT SDR loss, masked SDR loss | DCASE2025Task4Dataset | M2D | |
Qian_SJTU_task4_3 | Qian_2025_t4 | 4.43 | 22.84 | SepEDA-based audio tagging and separation model | BCE, PIT SDR loss | DCASE2025Task4Dataset | ||
Qian_SJTU_task4_4 | Qian_2025_t4 | 7.49 | 44.20 | Sepformer-based separation model, M2D-based multi-channel audio tagging model + BEATs-based audio tagging model | BCE, masked SDR loss | DCASE2025Task4Dataset | M2D,BEATs | |
Stergioulis_UTh_task4_submission_1 | Stergioulis_2025_t4 | 6.60 | 51.48 | Attentive ResUNet-based separation model, M2D-based audio tagging model | BCE, SDR loss | DCASE2025Task4Dataset | SpecAugment | M2D |
Zhang_BUPT_task4_1 | Zhang_2025_t4 | 3.84 | 22.41 | ResUNet-based separation model, M2D-based audio tagging model | BCE, SDR loss | DCASE2025Task4Dataset | M2D | |
Zhang_BUPT_task4_2 | Zhang_2025_t4 | 3.54 | 22.41 | ResUNet-based separation model, M2D-based audio tagging model | BCE, SDR loss | DCASE2025Task4Dataset | M2D |
Complexity
Submission Code |
Technical Report |
CA-SDRi |
Label Prediction Accuracy |
Ensemble subsystems |
Number of Parameters |
---|---|---|---|---|---|
Bando_AIST_task4_1 | Bando_2025_t4 | 5.17 | 29.20 | 10 | 25.2M |
Bando_AIST_task4_2 | Bando_2025_t4 | 7.55 | 49.51 | 10 | 117M |
Bando_AIST_task4_3 | Bando_2025_t4 | 5.26 | 31.98 | 10 | 25.2M |
Bando_AIST_task4_4 | Bando_2025_t4 | 7.52 | 48.58 | 10 | 117M |
Choi_KAIST_task4_1 | Choi_2025_t4 | 10.63 | 53.52 | 1 | 204M |
Choi_KAIST_task4_2 | Choi_2025_t4 | 10.80 | 56.42 | 1 | 204M |
Choi_KAIST_task4_3 | Choi_2025_t4 | 11.00 | 55.80 | 1 | 204M |
Choi_KAIST_task4_4 | Choi_2025_t4 | 10.85 | 54.26 | 1 | 204M |
Wu_SUSTech_task4_submission_1 | FulinWu_2025_t4 | 9.11 | 51.54 | 1 | 88.3M |
Wu_SUSTech_task4_submission_2 | FulinWu_2025_t4 | 9.73 | 51.54 | 1 | 87.4M |
Morocutti_CPJKU_task4_1 | Morocutti_2025_t4 | 9.76 | 61.30 | 118 | 10821.00M |
Morocutti_CPJKU_task4_2 | Morocutti_2025_t4 | 9.77 | 61.60 | 58 | 5359.00M |
Morocutti_CPJKU_task4_3 | Morocutti_2025_t4 | 8.91 | 51.98 | 1 | 228.60M |
Morocutti_CPJKU_task4_4 | Morocutti_2025_t4 | 8.99 | 57.28 | 33 | 3034.20M |
Baseline_task4_ResUNetK | 6.60 | 51.48 | 1 | 115.40M | |
Baseline_task4_ResUNet | 5.72 | 51.48 | 1 | 115.38M | |
Park_GIST-HanwhaVision_task4_1 | Park_2025_t4 | 6.38 | 43.89 | 1 | 116.6M |
Park_GIST-HanwhaVision_task4_2 | Park_2025_t4 | 6.64 | 46.67 | 1 | 115.9M |
Park_GIST-HanwhaVision_task4_3 | Park_2025_t4 | 6.50 | 45.19 | 1 | 116.6M |
Park_GIST-HanwhaVision_task4_4 | Park_2025_t4 | 6.69 | 47.22 | 4 | 464.5M |
Qian_SJTU_task4_1 | Qian_2025_t4 | 7.84 | 47.72 | 1 | 105.1M |
Qian_SJTU_task4_2 | Qian_2025_t4 | 7.72 | 50.68 | 1 | 204M |
Qian_SJTU_task4_3 | Qian_2025_t4 | 4.43 | 22.84 | 1 | 8.9M |
Qian_SJTU_task4_4 | Qian_2025_t4 | 7.49 | 44.20 | 2 | 195.4M |
Stergioulis_UTh_task4_submission_1 | Stergioulis_2025_t4 | 6.60 | 51.48 | 1 | 115.41M |
Zhang_BUPT_task4_1 | Zhang_2025_t4 | 3.84 | 22.41 | 1 | 115.40M |
Zhang_BUPT_task4_2 | Zhang_2025_t4 | 3.54 | 22.41 | 1 | 115.38M |
Representative example of separated audio samples
Evaluation set
The following table shows separated sound samples from the evaluation set. Here, representative outputs from systems ranked 1 to 3 and the baseline are selected.
Condition (synthesized) |
Mixture |
Oracle |
Choi_KAIST_task4_3 Rank 1 |
Morocutti_CPJKU_task4_2 Rank 2 |
Wu_SUSTech_task4_2 Rank 3 |
Baseline_task4_ResUNetK |
---|---|---|---|---|---|---|
Success case |
FILLER
FILLER
FILLER
|
Speech
Buzzer
Doorbell
|
Speech
Buzzer
Doorbell
CA-SDRi (this sample)=17.32 dB
|
Speech
Buzzer
Doorbell
CA-SDRi (this sample)=17.00 dB
|
Speech
Buzzer
Doorbell
CA-SDRi (this sample)=16.93 dB
|
Speech
Buzzer
Doorbell
CA-SDRi (this sample)=11.87 dB
|
Challenging case |
FILLER
FILLER
FILLER
|
Pour
Percussion
Cough
|
Pour
Percussion
--
CA-SDRi (this sample)=7.19 dB
|
Pour
--
--
CA-SDRi (this sample)=3.89 dB
|
Pour
--
--
CA-SDRi (this sample)=4.63 dB
|
Pour
--
--
CA-SDRi (this sample)=0.21 dB
|
Real recording
The following table shows separated sound samples from mixtures recorded using a primary ambisonic microphone. Here, representative outputs from systems ranked 1 to 3 and the baseline are selected.
Condition (real recording) |
Mixture |
Choi_KAIST_task4_3 Rank 1 |
Morocutti_CPJKU_task4_2 Rank 2 |
Wu_SUSTech_task4_2 Rank 3 |
Baseline_task4_ResUNetK |
---|---|---|---|---|---|
Indoor |
FILLER
FILLER
FILLER
|
Blender
Doorbell
Cough
FILLER
|
--
Doorbell
Cough
HairDryer (FP)
|
--
Doorbell
Cough
FILLER
|
--
Doorbell
Cough
FILLER
|
Outdoor |
FILLER
FILLER
FILLER
|
--
BicycleBell
MusicalKeyboard (FP)
FILLER
|
--
BicycleBell
MusicalKeyboard (FP)
FILLER
|
Speech
BicycleBell
MusicalKeyboard (FP)
FILLER
|
Speech
BicycleBell
MusicalKeyboard (FP)
FILLER
|
Technical reports
A HYBRID S5 SYSTEM BASED ON NEURAL BLIND SOURCE SEPARATION
Yuto Nozaki1, Shun Sakurai1,2, Yoshiaki Bando1, Kohei Saijo1,3, Keisuke Imoto1,4, Masaki Onishi1
1National Institute of Advanced Industrial Science and Technology, Tokyo, Japan, 2University of Tsukuba, Ibaraki, Japan, 3Waseda University, Tokyo, Japan, 4Kyoto University, Kyoto, Japan
Bando_AIST_task4_1 Bando_AIST_task4_2 Bando_AIST_task4_3 Bando_AIST_task4_4
A HYBRID S5 SYSTEM BASED ON NEURAL BLIND SOURCE SEPARATION
Yuto Nozaki1, Shun Sakurai1,2, Yoshiaki Bando1, Kohei Saijo1,3, Keisuke Imoto1,4, Masaki Onishi1
1National Institute of Advanced Industrial Science and Technology, Tokyo, Japan, 2University of Tsukuba, Ibaraki, Japan, 3Waseda University, Tokyo, Japan, 4Kyoto University, Kyoto, Japan
Abstract
In this paper, we report our hybrid system for the DCASE 2025 Challenge Task 4 based on neural blind source separation (BSS). This task, called spatial semantic segmentation of sound scenes (S5), aims to detect and separate sound events from a multichannel mixture signal. To make the separation model robust against unseen audio environments, we leverage neural BSS to combine robust statistical signal processing and high-performing neural modeling. Specifically, our network architecture incorporates the iterative source steering algorithm to separate source signals using spatial statistics. The network is trained via multitask learning of source separation and classification with permutation invariant training. In addition, to improve the performance, we utilized an audio foundation model called BEATs and augmented the training data by curating AudioSet. The experimental results on the official development test set show that our best system (System 2) improved more than 2 dB in class-aware signal-to-distortion ratio improvement (CA-SDRi) from the official baseline system.
SELF-GUIDED TARGET SOUND EXTRACTION AND CLASSIFICATION THROUGH UNIVERSAL SOUND SEPARATION MODEL AND MULTIPLE CLUES
Younghoo Kwon1, Dongheon Lee1, Dohwan Kim1, Jung-Woo Choi1
1KAIST, School of Electrical Engineering, Daejeon, South Korea
Choi_KAIST_task4_1 Choi_KAIST_task4_2 Choi_KAIST_task4_3 Choi_KAIST_task4_4
SELF-GUIDED TARGET SOUND EXTRACTION AND CLASSIFICATION THROUGH UNIVERSAL SOUND SEPARATION MODEL AND MULTIPLE CLUES
Younghoo Kwon1, Dongheon Lee1, Dohwan Kim1, Jung-Woo Choi1
1KAIST, School of Electrical Engineering, Daejeon, South Korea
Abstract
This paper presents a multi-stage framework that integrates Universal Sound Separation (USS) and Target Sound Extraction (TSE) for sound separation and classification. In the first stage, USS is applied to decompose the input audio into multiple source components. Each separated source is then individually classified to generate two types of clues: enrollment and class clues. These clues are subsequently utilized in the second stage to guide the TSE process. By generating multiple clues in a self-guided manner, the proposed method enhances the performance of target sound extraction. The final output of the TSE module is then re-classified to improve the classification accuracy further.
TS-TFGRIDNET: EXTENDING TFGRIDNET FOR LABEL-QUERIED TARGET SOUND EXTRACTION VIA EMBEDDING CONCATENTAITON
Fulin Wu1, Zhong-Qiu Wang1
1Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, Guangdong, China
Wu_SUSTech_task4_submission_1 Wu_SUSTech_task4_submission_2
TS-TFGRIDNET: EXTENDING TFGRIDNET FOR LABEL-QUERIED TARGET SOUND EXTRACTION VIA EMBEDDING CONCATENTAITON
Fulin Wu1, Zhong-Qiu Wang1
1Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, Guangdong, China
Abstract
The DCASE2025 Challenge Task 4 - Spatial Semantic Segmentation of Sound Scenes (S5) challenges participants to separate a set of mixed sound events (sampled from 18 targeted sound events) to individual sound-event signals. The baseline system provided by the challenge organizors first performs audio tagging to identify the sound events existed in the mixture, and then conducts label-queried target sound extraction (TSE) to extract the signal of each identified sound event. Building on the baseline system, we propose to improve the label-queried TSE component by using a novel model named Target Sound Extraction TF-GridNet (TS-TFGridNet), leveraging the strong capability of TF-GridNet at speech separation for TSE. TS-TFGridNet concatenates audio and sound-class embeddings along the frequency or feature dimension, thereby conditioning TF-GridNet to perform TSE. Clear improvement is observed over the baseline system.
TRANSFORMER-AIDED AUDIO SOURCE SEPARATION WITH TEMPORAL GUIDANCE AND ITERATIVE REFINEMENT
Tobias Morocutti2, Florian Schmid1, Jonathan Greif1, Paul Primus1, Gerhard Widmer1,2
1Institute of Computational Perception (CP-JKU), 2LIT Artificial Intelligence Lab, Johannes Kepler University Linz, Austria
Morocutti_CPJKU_task4_1 Morocutti_CPJKU_task4_2 Morocutti_CPJKU_task4_3 Morocutti_CPJKU_task4_4
TRANSFORMER-AIDED AUDIO SOURCE SEPARATION WITH TEMPORAL GUIDANCE AND ITERATIVE REFINEMENT
Tobias Morocutti2, Florian Schmid1, Jonathan Greif1, Paul Primus1, Gerhard Widmer1,2
1Institute of Computational Perception (CP-JKU), 2LIT Artificial Intelligence Lab, Johannes Kepler University Linz, Austria
Abstract
This technical report presents the CP-JKU team’s system for Task 4 Spatial Semantic Segmentation of Sound Scenes of the DCASE 2025 Challenge. Building on the two-stage baseline of audio tagging followed by label-conditioned source separation, we introduce several key enhancements. We reframe the tagging stage as a sound event detection task using five Transformers pre-trained on AudioSet Strong, enabling temporally strong conditioning of the separator. We further inject the Transformer’s latent representations into a ResUNet separator initialized from AudioSep and extended with a Dual-Path RNN. Additionally, we propose an iterative refinement scheme that reuses the separator’s output as input. These improvements raise label prediction accuracy to 73.07% and CASDRi to 14.49 for a single-model system on the development test set, substantially surpassing the baseline.
PERFORMANCE IMPROVEMENT OF SPATIAL SEMANTIC SEGMENTATION WITH ENRICHED AUDIO FEATURES AND AGENT-BASED ERROR CORRECTION FOR DCASE 2025 CHALLENGE TASK 4
Jongyeon Park1, Joonhee Lee2, Do-Hyeon Lim1, Hong Kook Kim1,2, Hyeongcheol Geum3, Jeong Eun Lim3
1Dept. of AI Convergence, 2Dept. of EECS, Gwangju Institute of Science and Technology, Gwangju 61005, Korea, 3AI Lab., R&D Center, Hanwha Vision, Seongnam-si, Gyeonggi-do 13488, Korea
Park_GIST-HanwhaVision_task4_1 Park_GIST-HanwhaVision_task4_2 Park_GIST-HanwhaVision_task4_3 Park_GIST-HanwhaVision_task4_4
PERFORMANCE IMPROVEMENT OF SPATIAL SEMANTIC SEGMENTATION WITH ENRICHED AUDIO FEATURES AND AGENT-BASED ERROR CORRECTION FOR DCASE 2025 CHALLENGE TASK 4
Jongyeon Park1, Joonhee Lee2, Do-Hyeon Lim1, Hong Kook Kim1,2, Hyeongcheol Geum3, Jeong Eun Lim3
1Dept. of AI Convergence, 2Dept. of EECS, Gwangju Institute of Science and Technology, Gwangju 61005, Korea, 3AI Lab., R&D Center, Hanwha Vision, Seongnam-si, Gyeonggi-do 13488, Korea
Abstract
This technical report presents submission systems for Task 4 of the DCASE 2025 Challenge. This model incorporates additional audio features (spectral roll-off and chroma features) into the embedding feature extracted from the mel-spectral feature to improve the classification capabilities of an audio-tagging model in the spatial semantic segmentation of sound scenes (S5) system. This approach is motivated by the fact that mixed audio often contains subtle cues that are difficult to capture with mel-spectrograms alone. Thus, these additional features offer alternative perspectives for the model. Second, an agent-based label correction system is applied to the outputs processed by the S5 system. This system reduces false positives, improving the final class-aware signal-to-distortion ratio improvement (CA-SDRi) metric. Finally, we refine the training dataset to enhance the classification accuracy of low-performing classes by removing irrelevant samples and incorporating external data. That is, audio mixtures are generated from a limited number of data points; thus, even a small number of out-of-class data points could degrade model performance. The experiments demonstrate that the submitted systems employing these approaches relatively improve CA-SDRi by up to 14.7% compared to the baseline of DCASE 2025 Challenge Task 4.
SJTU-AUDIOCC SYSTEM FOR DCASE 2025 CHALLENGE TASK 4: SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES
Xin Zhou1, Hongyu Wang1, Chenda Li1, Bing Han1, Xinhu Zheng1, Yanmin Qian1
1Auditory Cognition and Computational Acoustics Lab, MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China
Qian_SJTU_task4_1 Qian_SJTU_task4_2 Qian_SJTU_task4_3 Qian_SJTU_task4_4
SJTU-AUDIOCC SYSTEM FOR DCASE 2025 CHALLENGE TASK 4: SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES
Xin Zhou1, Hongyu Wang1, Chenda Li1, Bing Han1, Xinhu Zheng1, Yanmin Qian1
1Auditory Cognition and Computational Acoustics Lab, MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China
Abstract
The present report introduces four systems developed by the AudioCC Lab at Shanghai Jiao Tong University for DCASE 2025 Task 4. The task at hand is to detect target sound events and separate corresponding signals from multi-channel mixture. It was found that the effective detection of sound events and extraction of the corresponding signals was challenging under conditions where mixture consists of multiple target sound events, non-target sound events, and non-directional background noise. In order to address this challenge, we propose four systems. The first system represents an enhancement to the baseline system, the second is a multistage iterative system that is both novel and promising, the third is a lightweight model based on Encoder-Decoder Attractor (EDA) module, and the fourth integrates multiple audio tagging models to achieve optimal performance. These four systems cover high performance, low overhead, and promising frameworks, providing a reference for future research on this task.
REDUX: AN ITERATIVE STRATEGY FOR SEMANTIC SOURCE SEPARATION
Vasileios Stergioulis1, Gerasimos Potamianos1
1Department of Electrical and Computer Engineering, University of Thessaly, Volos, Greece
Stergioulis_UTh_task4_submission_1
REDUX: AN ITERATIVE STRATEGY FOR SEMANTIC SOURCE SEPARATION
Vasileios Stergioulis1, Gerasimos Potamianos1
1Department of Electrical and Computer Engineering, University of Thessaly, Volos, Greece
Abstract
In this report, we present a system for the Spatial Semantic Segmentation of Sound Scenes (DCASE 2025 Task 4), combining enhanced source separation and label classification through an iterative verification strategy. Our approach integrates the Masked Modeling Duo (M2D) classifier with a separator architecture based on an attentive ResUNeXt network. Inspired by recent advances in universal audio modeling and self-supervised separation, our system incorporates feedback between multiple classification and separation stages to correct early-stage prediction errors. Specifically, classification outputs are verified using post-separation reclassification, and ambiguous cases are resolved through targeted waveform subtraction and re-analysis. This strategy enables improved source-label associations without increasing model complexity. Evaluated on the development set, our method achieves a relative improvement of 0.28% in CA-SDRi and 1.46% in accuracy over the baseline.
SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES BASED ON ADAPTER FINE-TUNING
Zehao Wang1, Sen Wang1, Zhicheng Zhang1, Jianqin Yin1
1Beijing University of Posts and Telecommunications, China
Zhang_BUPT_task4_1 Zhang_BUPT_task4_2
SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES BASED ON ADAPTER FINE-TUNING
Zehao Wang1, Sen Wang1, Zhicheng Zhang1, Jianqin Yin1
1Beijing University of Posts and Telecommunications, China
Abstract
In this work, we present our submission system for DCASE 2025 Task4 on Spatial semantic segmentation of sound scenes (S5).Among them, we introduce the audio tagging (AT) and labelquery source separation (LSS) systems built on the fine-tuned M2D and the modified version of ResUnet. By introducing the bidirectional recurrent neural network (DPRNN) module into ResUNet and improving the Feature-wise Linear Modulation (FiLM) mechanism, the model’s ability to capture long-term dependent features in spatial audio and the flexibility of dynamic feature adjustment are enhanced. Experimental results show that the improved system outperforms the baseline system in class-aware evaluation metrics (CA-SDRi, CA-SI-SDRi), verifying the effectiveness of the method.