Spatial Semantic Segmentation of Sound Scenes


Challenge results

Task description

Detailed task description can be found in the task description page

Team ranking

Table including only the best ranking score per submitting team.

Submission Information Evaluation Set Test (Development) Set
Submission Code Technical
Report
Official
Team
Rank
CA-SDRi (eval) Label Prediction
Accuracy (eval)
CA-SDRi (test) Label Prediction
Accuracy (test)
Bando_AIST_task4_2 Bando_2025_t4 5 7.55 49.51 13.31 64.07
Choi_KAIST_task4_3 Choi_2025_t4 1 11.00 55.80 14.94 61.80
Wu_SUSTech_task4_submission_2 FulinWu_2025_t4 3 9.73 51.54 14.00 59.80
Morocutti_CPJKU_task4_2 Morocutti_2025_t4 2 9.77 61.60 15.04 77.07
Baseline_task4_ResUNetK 8 6.60 51.48 11.09 59.80
Park_GIST-HanwhaVision_task4_4 Park_2025_t4 6 6.69 47.22 13.22 76.53
Qian_SJTU_task4_1 Qian_2025_t4 4 7.84 47.72 14.38 73.93
Stergioulis_UTh_task4_submission_1 Stergioulis_2025_t4 7 6.60 51.48 11.12 60.67
Zhang_BUPT_task4_1 Zhang_2025_t4 9 3.84 22.41 11.78 65.47

System ranking

Table shows the ranking of all submitted systems.

Submission Information Evaluation Set Test (Development) Set
Submission Code Technical
Report
Official
System
Rank
CA-SDRi (eval) Label Prediction
Accuracy (eval)
CA-SDRi (test) Label Prediction
Accuracy (test)
Bando_AIST_task4_1 Bando_2025_t4 24 5.17 29.20 12.38 57.13
Bando_AIST_task4_2 Bando_2025_t4 13 7.55 49.51 13.31 64.07
Bando_AIST_task4_3 Bando_2025_t4 23 5.26 31.98 11.23 48.80
Bando_AIST_task4_4 Bando_2025_t4 14 7.52 48.58 12.46 55.93
Choi_KAIST_task4_1 Choi_2025_t4 4 10.63 53.52 14.82 61.67
Choi_KAIST_task4_2 Choi_2025_t4 3 10.80 56.42 14.60 60.90
Choi_KAIST_task4_3 Choi_2025_t4 1 11.00 55.80 14.94 61.80
Choi_KAIST_task4_4 Choi_2025_t4 2 10.85 54.26 14.94 61.67
Wu_SUSTech_task4_submission_1 FulinWu_2025_t4 8 9.11 51.54 14.16 59.80
Wu_SUSTech_task4_submission_2 FulinWu_2025_t4 7 9.73 51.54 14.00 59.80
Morocutti_CPJKU_task4_1 Morocutti_2025_t4 6 9.76 61.30 14.95 76.87
Morocutti_CPJKU_task4_2 Morocutti_2025_t4 5 9.77 61.60 15.04 77.07
Morocutti_CPJKU_task4_3 Morocutti_2025_t4 10 8.91 51.98 14.49 73.07
Morocutti_CPJKU_task4_4 Morocutti_2025_t4 9 8.99 57.28 14.27 71.73
Baseline_task4_ResUNetK 19 6.60 51.48 11.09 59.80
Baseline_task4_ResUNet 22 5.72 51.48 11.03 59.80
Park_GIST-HanwhaVision_task4_1 Park_2025_t4 21 6.38 43.89 12.95 73.73
Park_GIST-HanwhaVision_task4_2 Park_2025_t4 17 6.64 46.67 13.21 76.53
Park_GIST-HanwhaVision_task4_3 Park_2025_t4 20 6.50 45.19 13.09 76.00
Park_GIST-HanwhaVision_task4_4 Park_2025_t4 16 6.69 47.22 13.22 76.53
Qian_SJTU_task4_1 Qian_2025_t4 11 7.84 47.72 14.38 73.93
Qian_SJTU_task4_2 Qian_2025_t4 12 7.72 50.68 12.40 62.73
Qian_SJTU_task4_3 Qian_2025_t4 25 4.43 22.84 10.47 49.53
Qian_SJTU_task4_4 Qian_2025_t4 15 7.49 44.20 14.66 76.27
Stergioulis_UTh_task4_submission_1 Stergioulis_2025_t4 18 6.60 51.48 11.12 60.67
Zhang_BUPT_task4_1 Zhang_2025_t4 26 3.84 22.41 11.78 65.47
Zhang_BUPT_task4_2 Zhang_2025_t4 27 3.54 22.41 11.07 62.67

Supplementary metrics

Detailed analysis of separation and detection performance

All metrics in this table are evaluated on the evaluation set. True Positive (TP), False Positive (FP), and False Negative (FN) are counted per clip. TP-SDRi is the SDRi in clips where the label prediction is TP.

Submission Information Separation Scores Label Prediction Scores Counts
Submission Code Technical
Report
CA-SDRi TP-SDRi Accuracy Precision Recall F-Score
(micro)
F-Score
(macro)
TP FP FN
Bando_AIST_task4_1 Bando_2025_t4 5.17 10.72 29.20 0.72 0.50 0.59 0.56 1605 622 1635
Bando_AIST_task4_2 Bando_2025_t4 7.55 11.04 49.51 0.83 0.70 0.77 0.75 2277 411 963
Bando_AIST_task4_3 Bando_2025_t4 5.26 10.39 31.98 0.76 0.52 0.60 0.59 1684 648 1556
Bando_AIST_task4_4 Bando_2025_t4 7.52 11.06 48.58 0.85 0.70 0.76 0.74 2258 432 982
Choi_KAIST_task4_1 Choi_2025_t4 10.63 14.37 53.52 0.88 0.79 0.82 0.82 2552 401 688
Choi_KAIST_task4_2 Choi_2025_t4 10.80 14.12 56.42 0.85 0.84 0.83 0.83 2704 541 536
Choi_KAIST_task4_3 Choi_2025_t4 11.00 14.46 55.80 0.86 0.83 0.84 0.83 2681 499 559
Choi_KAIST_task4_4 Choi_2025_t4 10.85 14.52 54.26 0.88 0.80 0.83 0.82 2576 399 664
Wu_SUSTech_task4_submission_1 FulinWu_2025_t4 9.11 12.83 51.54 0.86 0.73 0.79 0.77 2365 401 875
Wu_SUSTech_task4_submission_2 FulinWu_2025_t4 9.73 13.63 51.54 0.86 0.73 0.79 0.77 2365 401 875
Morocutti_CPJKU_task4_1 Morocutti_2025_t4 9.76 12.51 61.30 0.93 0.78 0.85 0.82 2528 210 712
Morocutti_CPJKU_task4_2 Morocutti_2025_t4 9.77 12.51 61.60 0.93 0.78 0.85 0.82 2527 200 713
Morocutti_CPJKU_task4_3 Morocutti_2025_t4 8.91 12.62 51.98 0.86 0.73 0.79 0.77 2368 390 872
Morocutti_CPJKU_task4_4 Morocutti_2025_t4 8.99 11.87 57.28 0.93 0.75 0.83 0.81 2447 203 793
Baseline_task4_ResUNetK 6.60 9.33 51.48 0.86 0.73 0.79 0.77 2364 402 876
Baseline_task4_ResUNet 5.72 8.21 51.48 0.86 0.73 0.79 0.77 2364 402 876
Park_GIST-HanwhaVision_task4_1 Park_2025_t4 6.38 10.07 43.89 0.87 0.65 0.74 0.69 2121 395 1119
Park_GIST-HanwhaVision_task4_2 Park_2025_t4 6.64 10.18 46.67 0.85 0.67 0.75 0.70 2183 379 1057
Park_GIST-HanwhaVision_task4_3 Park_2025_t4 6.50 10.13 45.19 0.86 0.67 0.75 0.70 2156 377 1084
Park_GIST-HanwhaVision_task4_4 Park_2025_t4 6.69 10.15 47.22 0.88 0.67 0.76 0.71 2185 328 1055
Qian_SJTU_task4_1 Qian_2025_t4 7.84 11.27 47.72 0.84 0.71 0.78 0.75 2309 393 931
Qian_SJTU_task4_2 Qian_2025_t4 7.72 11.06 50.68 0.87 0.70 0.78 0.75 2286 309 954
Qian_SJTU_task4_3 Qian_2025_t4 4.43 10.30 22.84 0.59 0.49 0.54 0.49 1590 1110 1650
Qian_SJTU_task4_4 Qian_2025_t4 7.49 11.44 44.20 0.87 0.65 0.75 0.72 2117 280 1123
Stergioulis_UTh_task4_submission_1 Stergioulis_2025_t4 6.60 9.33 51.48 0.86 0.73 0.79 0.77 2364 402 876
Zhang_BUPT_task4_1 Zhang_2025_t4 3.84 9.41 22.41 0.89 0.40 0.52 0.46 1278 418 1962
Zhang_BUPT_task4_2 Zhang_2025_t4 3.54 8.97 22.41 0.89 0.40 0.52 0.46 1278 418 1962

Detailed analysis focused on quality of separated speech

Table shows the quality of the separated speech in the evaluation dataset. Here, Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI) and Perceptual Evaluation of Audio Quality (PEAQ) was adopted as objective metrics.

Submission Information PESQ STOI PEAQ
Submission Code Technical
Report
CA-SDRi Number of
Speech Sample
(/251)
PESQ
mean
PESQ
std
PESQ
min
PESQ
max
STOI
mean
STOI
std
STOI
min
STOI
max
PEAQ
mean
PEAQ
std
PEAQ
min
PEAQ
max
Bando_AIST_task4_1 Bando_2025_t4 5.17 233 2.50 0.61 1.26 4.07 0.85 0.10 0.46 0.98 -3.52 0.48 -3.91 -0.91
Bando_AIST_task4_2 Bando_2025_t4 7.55 243 2.46 0.61 1.07 4.06 0.85 0.11 0.46 0.98 -3.52 0.48 -3.91 -1.10
Bando_AIST_task4_3 Bando_2025_t4 5.26 184 2.47 0.57 1.22 4.02 0.88 0.08 0.55 0.99 -3.56 0.46 -3.91 -1.64
Bando_AIST_task4_4 Bando_2025_t4 7.52 226 2.35 0.61 1.06 4.02 0.86 0.09 0.45 0.98 -3.56 0.45 -3.91 -1.46
Choi_KAIST_task4_1 Choi_2025_t4 10.63 236 2.83 0.58 1.27 4.22 0.91 0.07 0.63 0.99 -3.42 0.49 -3.91 -0.62
Choi_KAIST_task4_2 Choi_2025_t4 10.80 241 2.82 0.56 1.27 4.23 0.91 0.07 0.58 0.99 -3.46 0.47 -3.91 -0.88
Choi_KAIST_task4_3 Choi_2025_t4 11.00 241 2.88 0.58 1.17 4.24 0.91 0.08 0.46 0.99 -3.43 0.48 -3.91 -0.63
Choi_KAIST_task4_4 Choi_2025_t4 10.85 231 2.89 0.57 1.30 4.23 0.91 0.07 0.58 0.99 -3.42 0.49 -3.91 -0.62
Wu_SUSTech_task4_submission_1 FulinWu_2025_t4 9.11 246 2.67 0.54 1.19 4.02 0.90 0.08 0.49 0.99 -3.46 0.50 -3.91 -0.95
Wu_SUSTech_task4_submission_2 FulinWu_2025_t4 9.73 246 2.77 0.58 1.04 4.18 0.90 0.10 0.08 0.99 -3.43 0.50 -3.91 -1.40
Morocutti_CPJKU_task4_1 Morocutti_2025_t4 9.76 249 2.97 0.59 1.25 4.17 0.90 0.07 0.61 0.99 -3.40 0.50 -3.91 -0.74
Morocutti_CPJKU_task4_2 Morocutti_2025_t4 9.77 249 2.97 0.60 1.24 4.18 0.90 0.07 0.61 0.99 -3.39 0.51 -3.91 -0.75
Morocutti_CPJKU_task4_3 Morocutti_2025_t4 8.91 250 2.89 0.62 1.17 4.17 0.89 0.08 0.56 0.99 -3.36 0.51 -3.91 -0.87
Morocutti_CPJKU_task4_4 Morocutti_2025_t4 8.99 250 2.91 0.61 1.22 4.18 0.89 0.08 0.58 0.99 -3.45 0.50 -3.91 -0.72
Baseline_task4_ResUNetK 6.60 246 2.39 0.63 1.09 4.07 0.84 0.11 0.39 0.99 -3.60 0.43 -3.91 -0.87
Baseline_task4_ResUNet 5.72 246 2.55 0.63 1.11 4.11 0.85 0.10 0.44 0.99 -3.56 0.47 -3.91 -0.98
Park_GIST-HanwhaVision_task4_1 Park_2025_t4 6.38 249 2.40 0.62 1.06 4.07 0.85 0.11 0.38 0.99 -3.57 0.47 -3.91 -1.07
Park_GIST-HanwhaVision_task4_2 Park_2025_t4 6.64 245 2.42 0.62 1.10 4.07 0.85 0.11 0.41 0.99 -3.57 0.46 -3.91 -1.07
Park_GIST-HanwhaVision_task4_3 Park_2025_t4 6.50 246 2.40 0.61 1.10 4.07 0.85 0.11 0.41 0.99 -3.57 0.46 -3.91 -1.07
Park_GIST-HanwhaVision_task4_4 Park_2025_t4 6.69 248 2.41 0.62 1.06 4.07 0.85 0.11 0.38 0.99 -3.57 0.46 -3.91 -1.07
Qian_SJTU_task4_1 Qian_2025_t4 7.84 248 2.40 0.59 1.09 4.08 0.84 0.12 0.04 0.98 -3.66 0.40 -3.91 -0.34
Qian_SJTU_task4_2 Qian_2025_t4 7.72 244 2.46 0.58 1.15 4.02 0.84 0.10 0.44 0.98 -3.55 0.48 -3.91 -0.98
Qian_SJTU_task4_3 Qian_2025_t4 4.43 210 2.22 0.60 1.07 3.80 0.80 0.19 -0.19 0.97 -3.64 0.50 -3.91 -1.36
Qian_SJTU_task4_4 Qian_2025_t4 7.49 247 2.41 0.59 1.10 4.08 0.84 0.12 0.06 0.98 -3.65 0.42 -3.91 -0.34
Stergioulis_UTh_task4_submission_1 Stergioulis_2025_t4 6.60 246 2.39 0.63 1.09 4.07 0.84 0.12 0.39 0.99 -3.60 0.43 -3.91 -0.87
Zhang_BUPT_task4_1 Zhang_2025_t4 3.84 58 2.79 0.70 1.29 4.08 0.88 0.11 0.44 0.99 -3.53 0.54 -3.91 -0.98
Zhang_BUPT_task4_2 Zhang_2025_t4 3.54 58 2.90 0.61 1.39 4.06 0.89 0.10 0.56 0.99 -3.41 0.57 -3.91 -0.93

System's performance under partially known conditions

Table shows the separation and detection performance of each system under partially known conditions. The ‘Known IR Condition’ is the case where the evaluation data is synthesized with the room impulse responses included in the training data. Known Target Condition' is the case where the evaluation data is synthesized with target sound event samples included in the training data. ‘Known Noise Condition’ is the case where the evaluation data is synthesized with the background noise included in the training data. ‘Known Interference Condition’ is the case where the evaluation data is synthesized with interference sound samples included in the training data.

Submission Information Evaluation set Known IR
Condition
Known Target
Condition
Known Noise
Condition
Known Interference
Condition
Submission Code Technical
Report
CA-SDRi Accuracy Known IR
CA-SDRi
Known IR
Accuracy
Known Target
CA-SDRi
Known Target
Accuracy
Known Noise
CA-SDRi
Known Noise
Accuracy
Known Interference
CA-SDRi
Known Interference
Accuracy
Bando_AIST_task4_1 Bando_2025_t4 5.17 29.20 7.63 37.78 8.90 62.96 5.97 33.33 4.53 26.85
Bando_AIST_task4_2 Bando_2025_t4 7.55 49.51 10.34 61.11 9.94 68.52 8.81 58.33 7.01 48.15
Bando_AIST_task4_3 Bando_2025_t4 5.26 31.98 6.89 33.33 8.10 51.85 5.80 25.93 4.90 26.85
Bando_AIST_task4_4 Bando_2025_t4 7.52 48.58 10.19 51.11 9.19 61.11 8.11 51.85 7.29 50.93
Choi_KAIST_task4_1 Choi_2025_t4 10.63 53.52 13.49 53.33 10.79 64.81 11.65 62.04 11.35 63.89
Choi_KAIST_task4_2 Choi_2025_t4 10.80 56.42 13.76 56.67 10.16 59.26 11.77 67.59 10.78 62.96
Choi_KAIST_task4_3 Choi_2025_t4 11.00 55.80 14.53 61.11 10.60 59.26 12.11 67.59 11.59 66.67
Choi_KAIST_task4_4 Choi_2025_t4 10.85 54.26 13.32 52.22 10.89 63.89 11.85 63.89 11.19 62.04
Wu_SUSTech_task4_submission_1 FulinWu_2025_t4 9.11 51.54 11.08 48.89 11.25 75.00 9.68 53.70 8.49 47.22
Wu_SUSTech_task4_submission_2 FulinWu_2025_t4 9.73 51.54 11.87 48.89 11.06 75.00 9.92 53.70 9.12 47.22
Morocutti_CPJKU_task4_1 Morocutti_2025_t4 9.76 61.30 12.75 68.89 12.37 92.59 10.02 61.11 9.22 58.33
Morocutti_CPJKU_task4_2 Morocutti_2025_t4 9.77 61.60 12.68 70.00 12.36 91.67 9.80 61.11 9.23 59.26
Morocutti_CPJKU_task4_3 Morocutti_2025_t4 8.91 51.98 10.95 58.89 12.17 91.67 9.07 48.15 8.37 50.00
Morocutti_CPJKU_task4_4 Morocutti_2025_t4 8.99 57.28 12.13 67.78 11.96 90.74 9.69 61.11 8.91 56.48
Baseline_task4_ResUNetK 6.60 51.48 7.91 48.89 9.88 75.00 6.41 53.70 5.95 47.22
Baseline_task4_ResUNet 5.72 51.48 7.03 48.89 9.51 75.00 5.27 53.70 5.08 47.22
Park_GIST-HanwhaVision_task4_1 Park_2025_t4 6.38 43.89 7.88 48.89 8.45 54.63 6.58 39.81 5.77 45.37
Park_GIST-HanwhaVision_task4_2 Park_2025_t4 6.64 46.67 7.46 43.33 8.30 51.85 7.07 43.52 5.87 38.89
Park_GIST-HanwhaVision_task4_3 Park_2025_t4 6.50 45.19 8.12 46.67 8.34 52.78 7.07 44.44 5.41 40.74
Park_GIST-HanwhaVision_task4_4 Park_2025_t4 6.69 47.22 8.05 46.67 8.43 54.63 7.13 44.44 5.72 42.59
Qian_SJTU_task4_1 Qian_2025_t4 7.84 47.72 10.58 52.22 10.66 88.89 8.55 50.93 7.32 47.22
Qian_SJTU_task4_2 Qian_2025_t4 7.72 50.68 9.54 54.44 9.86 80.56 8.39 49.07 7.19 52.78
Qian_SJTU_task4_3 Qian_2025_t4 4.43 22.84 5.86 27.78 5.43 32.41 5.33 29.63 3.81 18.52
Qian_SJTU_task4_4 Qian_2025_t4 7.49 44.20 10.34 52.22 11.04 95.37 8.04 46.30 7.24 45.37
Stergioulis_UTh_task4_submission_1 Stergioulis_2025_t4 6.60 51.48 7.91 48.89 9.88 75.00 6.41 53.70 5.95 47.22
Zhang_BUPT_task4_1 Zhang_2025_t4 3.84 22.41 3.86 17.78 5.15 28.70 3.89 30.56 2.86 26.85
Zhang_BUPT_task4_2 Zhang_2025_t4 3.54 22.41 3.64 17.78 5.21 28.70 3.59 30.56 2.73 26.85

System performance in more challenging conditions

This table shows the performance of the system under the 'Multi Same Class Condition'. In this condition, sound events of the same class are included in the same clip.

Submission Information Evaluation set Multiple Same
Class Condition
Submission Code Technical
Report
CA-SDRi Accuracy Multiple
Same Class
CA-SDRi
Multiple
Same Class
Accuracy
Bando_AIST_task4_1 Bando_2025_t4 5.17 29.20 1.89 41.67
Bando_AIST_task4_2 Bando_2025_t4 7.55 49.51 3.98 60.19
Bando_AIST_task4_3 Bando_2025_t4 5.26 31.98 2.53 47.22
Bando_AIST_task4_4 Bando_2025_t4 7.52 48.58 4.12 64.81
Choi_KAIST_task4_1 Choi_2025_t4 10.63 53.52 4.15 67.59
Choi_KAIST_task4_2 Choi_2025_t4 10.80 56.42 3.79 62.96
Choi_KAIST_task4_3 Choi_2025_t4 11.00 55.80 3.84 62.04
Choi_KAIST_task4_4 Choi_2025_t4 10.85 54.26 3.89 66.67
Wu_SUSTech_task4_submission_1 FulinWu_2025_t4 9.11 51.54 2.97 65.74
Wu_SUSTech_task4_submission_2 FulinWu_2025_t4 9.73 51.54 3.09 65.74
Morocutti_CPJKU_task4_1 Morocutti_2025_t4 9.76 61.30 5.01 72.22
Morocutti_CPJKU_task4_2 Morocutti_2025_t4 9.77 61.60 4.87 73.15
Morocutti_CPJKU_task4_3 Morocutti_2025_t4 8.91 51.98 3.99 62.96
Morocutti_CPJKU_task4_4 Morocutti_2025_t4 8.99 57.28 4.45 69.44
Baseline_task4_ResUNetK 6.60 51.48 3.53 65.74
Baseline_task4_ResUNet 5.72 51.48 2.21 65.74
Park_GIST-HanwhaVision_task4_1 Park_2025_t4 6.38 43.89 3.17 63.89
Park_GIST-HanwhaVision_task4_2 Park_2025_t4 6.64 46.67 3.49 65.74
Park_GIST-HanwhaVision_task4_3 Park_2025_t4 6.50 45.19 2.88 62.04
Park_GIST-HanwhaVision_task4_4 Park_2025_t4 6.69 47.22 3.33 65.74
Qian_SJTU_task4_1 Qian_2025_t4 7.84 47.72 2.62 61.11
Qian_SJTU_task4_2 Qian_2025_t4 7.72 50.68 2.89 64.81
Qian_SJTU_task4_3 Qian_2025_t4 4.43 22.84 0.69 22.22
Qian_SJTU_task4_4 Qian_2025_t4 7.49 44.20 2.51 64.81
Stergioulis_UTh_task4_submission_1 Stergioulis_2025_t4 6.60 51.48 3.53 65.74
Zhang_BUPT_task4_1 Zhang_2025_t4 3.84 22.41 1.16 36.11
Zhang_BUPT_task4_2 Zhang_2025_t4 3.54 22.41 1.05 36.11

System characteristics

General characteristics

Submission
Code
Technical
Report
CA-SDRi Label Prediction
Accuracy
Input
Sampling Rate
Input
Acoustic
Features
Bando_AIST_task4_1 Bando_2025_t4 5.17 29.20 32kHz waveform, spectrogram
Bando_AIST_task4_2 Bando_2025_t4 7.55 49.51 32kHz waveform, spectrogram
Bando_AIST_task4_3 Bando_2025_t4 5.26 31.98 32kHz waveform, spectrogram
Bando_AIST_task4_4 Bando_2025_t4 7.52 48.58 32kHz waveform, spectrogram
Choi_KAIST_task4_1 Choi_2025_t4 10.63 53.52 32kHz waveform, spectrogram
Choi_KAIST_task4_2 Choi_2025_t4 10.80 56.42 32kHz waveform, spectrogram
Choi_KAIST_task4_3 Choi_2025_t4 11.00 55.80 32kHz waveform, spectrogram
Choi_KAIST_task4_4 Choi_2025_t4 10.85 54.26 32kHz waveform, spectrogram
Wu_SUSTech_task4_submission_1 FulinWu_2025_t4 9.11 51.54 32kHz waveform, spectrogram
Wu_SUSTech_task4_submission_2 FulinWu_2025_t4 9.73 51.54 32kHz waveform, spectrogram
Morocutti_CPJKU_task4_1 Morocutti_2025_t4 9.76 61.30 32kHz waveform, spectrogram
Morocutti_CPJKU_task4_2 Morocutti_2025_t4 9.77 61.60 32kHz waveform, spectrogram
Morocutti_CPJKU_task4_3 Morocutti_2025_t4 8.91 51.98 32kHz waveform, spectrogram
Morocutti_CPJKU_task4_4 Morocutti_2025_t4 8.99 57.28 32kHz waveform, spectrogram
Baseline_task4_ResUNetK 6.60 51.48 32kHz waveform, spectrogram
Baseline_task4_ResUNet 5.72 51.48 32kHz waveform, spectrogram
Park_GIST-HanwhaVision_task4_1 Park_2025_t4 6.38 43.89 32kHz waveform, spectrogram, spectral_roll-off
Park_GIST-HanwhaVision_task4_2 Park_2025_t4 6.64 46.67 32kHz waveform, spectrogram, chroma
Park_GIST-HanwhaVision_task4_3 Park_2025_t4 6.50 45.19 32kHz waveform, spectrogram, spectral_roll-off, chroma
Park_GIST-HanwhaVision_task4_4 Park_2025_t4 6.69 47.22 32kHz waveform, spectrogram, spectral_roll-off, chroma
Qian_SJTU_task4_1 Qian_2025_t4 7.84 47.72 32kHz waveform, log mel spectrogram
Qian_SJTU_task4_2 Qian_2025_t4 7.72 50.68 32kHz spectrogram, log mel spectrogram
Qian_SJTU_task4_3 Qian_2025_t4 4.43 22.84 32kHz waveform
Qian_SJTU_task4_4 Qian_2025_t4 7.49 44.20 32kHz waveform, log mel spectrogram, fbank
Stergioulis_UTh_task4_submission_1 Stergioulis_2025_t4 6.60 51.48 32kHz waveform, spectrogram
Zhang_BUPT_task4_1 Zhang_2025_t4 3.84 22.41 32kHz waveform, spectrogram
Zhang_BUPT_task4_2 Zhang_2025_t4 3.54 22.41 32kHz waveform, spectrogram

Machine learning characteristics

Submission
Code
Technical
Report
CA-SDRi Label Prediction
Accuracy
Machine
Learning
Method
Loss
Function
Training
Dataset
Data
Augmentation
Pretrained
Models
Bando_AIST_task4_1 Bando_2025_t4 5.17 29.20 Neural blind source separation model (ISS-enhanced RE-SepFormer) CE, SNR loss DCASE2025Task4Dataset SpecAug, dynamic mixing
Bando_AIST_task4_2 Bando_2025_t4 7.55 49.51 Neural blind source separation model (ISS-enhanced RE-SepFormer), self-supervised audio encoder (BEATs) CE, SNR loss DCASE2025Task4Dataset SpecAug, dynamic mixing BEATs
Bando_AIST_task4_3 Bando_2025_t4 5.26 31.98 Neural blind source separation model (ISS-enhanced RE-SepFormer) CE, SNR loss DCASE2025Task4Dataset,AudioSet SpecAug, dynamic mixing
Bando_AIST_task4_4 Bando_2025_t4 7.52 48.58 Neural blind source separation model (ISS-enhanced RE-SepFormer), self-supervised audio encoder (BEATs) CE, SNR loss DCASE2025Task4Dataset,AudioSet SpecAug, dynamic mixing BEATs
Choi_KAIST_task4_1 Choi_2025_t4 10.63 53.52 ResUNet-based separation model, M2D-based audio tagging model SA-SDR, CE, KL-divergence, ArcFace DCASE2025Task4Dataset M2D,ATST
Choi_KAIST_task4_2 Choi_2025_t4 10.80 56.42 ResUNet-based separation model, M2D-based audio tagging model SA-SDR, CE, KL-divergence, ArcFace DCASE2025Task4Dataset M2D,ATST
Choi_KAIST_task4_3 Choi_2025_t4 11.00 55.80 ResUNet-based separation model, M2D-based audio tagging model SA-SDR, CE, KL-divergence, ArcFace DCASE2025Task4Dataset M2D,ATST
Choi_KAIST_task4_4 Choi_2025_t4 10.85 54.26 ResUNet-based separation model, M2D-based audio tagging model SA-SDR, CE, KL-divergence, ArcFace DCASE2025Task4Dataset M2D,ATST
Wu_SUSTech_task4_submission_1 FulinWu_2025_t4 9.11 51.54 TSTF_v1-based separation model, M2D-based audio tagging model BCE, SDR loss DCASE2025Task4Dataset M2D
Wu_SUSTech_task4_submission_2 FulinWu_2025_t4 9.73 51.54 TSTF_v1-based separation model, M2D-based audio tagging model BCE, SDR loss DCASE2025Task4Dataset M2D
Morocutti_CPJKU_task4_1 Morocutti_2025_t4 9.76 61.30 ResUNet-based separation models (AudioSep); M2D, BEATs, ATST-F, ASiT and fPaSST models for audio tagging BCE, SDR loss DCASE2025Task4Dataset,FOA-MEIR SNR Range Augmentation BEATs,M2D,ATST-F,ASiT,fPaSST,AudioSep
Morocutti_CPJKU_task4_2 Morocutti_2025_t4 9.77 61.60 ResUNet-based separation models (AudioSep); M2D, BEATs, ATST-F, ASiT and fPaSST models for audio tagging BCE, SDR loss DCASE2025Task4Dataset,FOA-MEIR SNR Range Augmentation BEATs,M2D,ATST-F,ASiT,fPaSST,AudioSep
Morocutti_CPJKU_task4_3 Morocutti_2025_t4 8.91 51.98 ResUNet-based separation model (AudioSep); M2D model for audio tagging BCE, SDR loss DCASE2025Task4Dataset,FOA-MEIR SNR Range Augmentation BEATs,M2D,AudioSep
Morocutti_CPJKU_task4_4 Morocutti_2025_t4 8.99 57.28 ResUNet-based separation models (AudioSep); M2D, BEATs, ATST-F, ASiT and fPaSST models for audio tagging BCE, SDR loss DCASE2025Task4Dataset SNR Range Augmentation BEATs,M2D,ATST-F,ASiT,fPaSST,AudioSep
Baseline_task4_ResUNetK 6.60 51.48 ResUNet-based separation model, M2D-based audio tagging model BCE, SDR loss DCASE2025Task4Dataset M2D
Baseline_task4_ResUNet 5.72 51.48 ResUNet-based separation model, M2D-based audio tagging model BCE, SDR loss DCASE2025Task4Dataset M2D
Park_GIST-HanwhaVision_task4_1 Park_2025_t4 6.38 43.89 Baseline separation model, M2D and spectral rolloff feature based audio tagging model BCE, SDR loss DCASE2025Task4Dataset,AudioSet M2D,ResUNetK
Park_GIST-HanwhaVision_task4_2 Park_2025_t4 6.64 46.67 Baseline separation model, M2D and chroma feature based audio tagging model BCE, SDR loss DCASE2025Task4Dataset,AudioSet M2D,ResUNetK
Park_GIST-HanwhaVision_task4_3 Park_2025_t4 6.50 45.19 Baseline separation model, M2D, spectral rolloff, and chroma feature based audio tagging model BCE, SDR loss DCASE2025Task4Dataset,AudioSet M2D,ResUNetK
Park_GIST-HanwhaVision_task4_4 Park_2025_t4 6.69 47.22 Baseline separation model, M2D, spectral rolloff, and chroma feature based audio tagging model BCE, SDR loss DCASE2025Task4Dataset,AudioSet M2D,ResUNetK
Qian_SJTU_task4_1 Qian_2025_t4 7.84 47.72 Sepformer-based separation model, M2D-based multi-channel audio tagging model BCE, masked SDR loss DCASE2025Task4Dataset M2D
Qian_SJTU_task4_2 Qian_2025_t4 7.72 50.68 BSRNN-based separation model, M2D-based multi-channel audio tagging model BCE, PIT SDR loss, masked SDR loss DCASE2025Task4Dataset M2D
Qian_SJTU_task4_3 Qian_2025_t4 4.43 22.84 SepEDA-based audio tagging and separation model BCE, PIT SDR loss DCASE2025Task4Dataset
Qian_SJTU_task4_4 Qian_2025_t4 7.49 44.20 Sepformer-based separation model, M2D-based multi-channel audio tagging model + BEATs-based audio tagging model BCE, masked SDR loss DCASE2025Task4Dataset M2D,BEATs
Stergioulis_UTh_task4_submission_1 Stergioulis_2025_t4 6.60 51.48 Attentive ResUNet-based separation model, M2D-based audio tagging model BCE, SDR loss DCASE2025Task4Dataset SpecAugment M2D
Zhang_BUPT_task4_1 Zhang_2025_t4 3.84 22.41 ResUNet-based separation model, M2D-based audio tagging model BCE, SDR loss DCASE2025Task4Dataset M2D
Zhang_BUPT_task4_2 Zhang_2025_t4 3.54 22.41 ResUNet-based separation model, M2D-based audio tagging model BCE, SDR loss DCASE2025Task4Dataset M2D

Complexity

Submission
Code
Technical
Report
CA-SDRi Label Prediction
Accuracy
Ensemble
subsystems
Number of
Parameters
Bando_AIST_task4_1 Bando_2025_t4 5.17 29.20 10 25.2M
Bando_AIST_task4_2 Bando_2025_t4 7.55 49.51 10 117M
Bando_AIST_task4_3 Bando_2025_t4 5.26 31.98 10 25.2M
Bando_AIST_task4_4 Bando_2025_t4 7.52 48.58 10 117M
Choi_KAIST_task4_1 Choi_2025_t4 10.63 53.52 1 204M
Choi_KAIST_task4_2 Choi_2025_t4 10.80 56.42 1 204M
Choi_KAIST_task4_3 Choi_2025_t4 11.00 55.80 1 204M
Choi_KAIST_task4_4 Choi_2025_t4 10.85 54.26 1 204M
Wu_SUSTech_task4_submission_1 FulinWu_2025_t4 9.11 51.54 1 88.3M
Wu_SUSTech_task4_submission_2 FulinWu_2025_t4 9.73 51.54 1 87.4M
Morocutti_CPJKU_task4_1 Morocutti_2025_t4 9.76 61.30 118 10821.00M
Morocutti_CPJKU_task4_2 Morocutti_2025_t4 9.77 61.60 58 5359.00M
Morocutti_CPJKU_task4_3 Morocutti_2025_t4 8.91 51.98 1 228.60M
Morocutti_CPJKU_task4_4 Morocutti_2025_t4 8.99 57.28 33 3034.20M
Baseline_task4_ResUNetK 6.60 51.48 1 115.40M
Baseline_task4_ResUNet 5.72 51.48 1 115.38M
Park_GIST-HanwhaVision_task4_1 Park_2025_t4 6.38 43.89 1 116.6M
Park_GIST-HanwhaVision_task4_2 Park_2025_t4 6.64 46.67 1 115.9M
Park_GIST-HanwhaVision_task4_3 Park_2025_t4 6.50 45.19 1 116.6M
Park_GIST-HanwhaVision_task4_4 Park_2025_t4 6.69 47.22 4 464.5M
Qian_SJTU_task4_1 Qian_2025_t4 7.84 47.72 1 105.1M
Qian_SJTU_task4_2 Qian_2025_t4 7.72 50.68 1 204M
Qian_SJTU_task4_3 Qian_2025_t4 4.43 22.84 1 8.9M
Qian_SJTU_task4_4 Qian_2025_t4 7.49 44.20 2 195.4M
Stergioulis_UTh_task4_submission_1 Stergioulis_2025_t4 6.60 51.48 1 115.41M
Zhang_BUPT_task4_1 Zhang_2025_t4 3.84 22.41 1 115.40M
Zhang_BUPT_task4_2 Zhang_2025_t4 3.54 22.41 1 115.38M



Representative example of separated audio samples

Evaluation set

The following table shows separated sound samples from the evaluation set. Here, representative outputs from systems ranked 1 to 3 and the baseline are selected.

Condition
(synthesized)
Mixture
Oracle
Choi_KAIST_task4_3
Rank 1
Morocutti_CPJKU_task4_2
Rank 2
Wu_SUSTech_task4_2
Rank 3
Baseline_task4_ResUNetK
Success case
FILLER
FILLER
FILLER
Speech
Buzzer
Doorbell
Speech
Buzzer
Doorbell
CA-SDRi (this sample)=17.32 dB
Speech
Buzzer
Doorbell
CA-SDRi (this sample)=17.00 dB
Speech
Buzzer
Doorbell
CA-SDRi (this sample)=16.93 dB
Speech
Buzzer
Doorbell
CA-SDRi (this sample)=11.87 dB
Challenging case
FILLER
FILLER
FILLER
Pour
Percussion
Cough
Pour
Percussion
--
CA-SDRi (this sample)=7.19 dB
Pour
--
--
CA-SDRi (this sample)=3.89 dB
Pour
--
--
CA-SDRi (this sample)=4.63 dB
Pour
--
--
CA-SDRi (this sample)=0.21 dB

Real recording

The following table shows separated sound samples from mixtures recorded using a primary ambisonic microphone. Here, representative outputs from systems ranked 1 to 3 and the baseline are selected.

Condition
(real recording)
Mixture
Choi_KAIST_task4_3
Rank 1
Morocutti_CPJKU_task4_2
Rank 2
Wu_SUSTech_task4_2
Rank 3
Baseline_task4_ResUNetK
Indoor
FILLER
FILLER
FILLER
Blender
Doorbell
Cough
FILLER
--
Doorbell
Cough
HairDryer (FP)
--
Doorbell
Cough
FILLER
--
Doorbell
Cough
FILLER
Outdoor
FILLER
FILLER
FILLER
--
BicycleBell
MusicalKeyboard (FP)
FILLER
--
BicycleBell
MusicalKeyboard (FP)
FILLER
Speech
BicycleBell
MusicalKeyboard (FP)
FILLER
Speech
BicycleBell
MusicalKeyboard (FP)
FILLER

Technical reports

A HYBRID S5 SYSTEM BASED ON NEURAL BLIND SOURCE SEPARATION

Yuto Nozaki1, Shun Sakurai1,2, Yoshiaki Bando1, Kohei Saijo1,3, Keisuke Imoto1,4, Masaki Onishi1
1National Institute of Advanced Industrial Science and Technology, Tokyo, Japan, 2University of Tsukuba, Ibaraki, Japan, 3Waseda University, Tokyo, Japan, 4Kyoto University, Kyoto, Japan

Abstract

In this paper, we report our hybrid system for the DCASE 2025 Challenge Task 4 based on neural blind source separation (BSS). This task, called spatial semantic segmentation of sound scenes (S5), aims to detect and separate sound events from a multichannel mixture signal. To make the separation model robust against unseen audio environments, we leverage neural BSS to combine robust statistical signal processing and high-performing neural modeling. Specifically, our network architecture incorporates the iterative source steering algorithm to separate source signals using spatial statistics. The network is trained via multitask learning of source separation and classification with permutation invariant training. In addition, to improve the performance, we utilized an audio foundation model called BEATs and augmented the training data by curating AudioSet. The experimental results on the official development test set show that our best system (System 2) improved more than 2 dB in class-aware signal-to-distortion ratio improvement (CA-SDRi) from the official baseline system.

PDF

SELF-GUIDED TARGET SOUND EXTRACTION AND CLASSIFICATION THROUGH UNIVERSAL SOUND SEPARATION MODEL AND MULTIPLE CLUES

Younghoo Kwon1, Dongheon Lee1, Dohwan Kim1, Jung-Woo Choi1
1KAIST, School of Electrical Engineering, Daejeon, South Korea

Abstract

This paper presents a multi-stage framework that integrates Universal Sound Separation (USS) and Target Sound Extraction (TSE) for sound separation and classification. In the first stage, USS is applied to decompose the input audio into multiple source components. Each separated source is then individually classified to generate two types of clues: enrollment and class clues. These clues are subsequently utilized in the second stage to guide the TSE process. By generating multiple clues in a self-guided manner, the proposed method enhances the performance of target sound extraction. The final output of the TSE module is then re-classified to improve the classification accuracy further.

PDF

TS-TFGRIDNET: EXTENDING TFGRIDNET FOR LABEL-QUERIED TARGET SOUND EXTRACTION VIA EMBEDDING CONCATENTAITON

Fulin Wu1, Zhong-Qiu Wang1
1Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, Guangdong, China

Abstract

The DCASE2025 Challenge Task 4 - Spatial Semantic Segmentation of Sound Scenes (S5) challenges participants to separate a set of mixed sound events (sampled from 18 targeted sound events) to individual sound-event signals. The baseline system provided by the challenge organizors first performs audio tagging to identify the sound events existed in the mixture, and then conducts label-queried target sound extraction (TSE) to extract the signal of each identified sound event. Building on the baseline system, we propose to improve the label-queried TSE component by using a novel model named Target Sound Extraction TF-GridNet (TS-TFGridNet), leveraging the strong capability of TF-GridNet at speech separation for TSE. TS-TFGridNet concatenates audio and sound-class embeddings along the frequency or feature dimension, thereby conditioning TF-GridNet to perform TSE. Clear improvement is observed over the baseline system.

PDF

TRANSFORMER-AIDED AUDIO SOURCE SEPARATION WITH TEMPORAL GUIDANCE AND ITERATIVE REFINEMENT

Tobias Morocutti2, Florian Schmid1, Jonathan Greif1, Paul Primus1, Gerhard Widmer1,2
1Institute of Computational Perception (CP-JKU), 2LIT Artificial Intelligence Lab, Johannes Kepler University Linz, Austria

Abstract

This technical report presents the CP-JKU team’s system for Task 4 Spatial Semantic Segmentation of Sound Scenes of the DCASE 2025 Challenge. Building on the two-stage baseline of audio tagging followed by label-conditioned source separation, we introduce several key enhancements. We reframe the tagging stage as a sound event detection task using five Transformers pre-trained on AudioSet Strong, enabling temporally strong conditioning of the separator. We further inject the Transformer’s latent representations into a ResUNet separator initialized from AudioSep and extended with a Dual-Path RNN. Additionally, we propose an iterative refinement scheme that reuses the separator’s output as input. These improvements raise label prediction accuracy to 73.07% and CASDRi to 14.49 for a single-model system on the development test set, substantially surpassing the baseline.

PDF

PERFORMANCE IMPROVEMENT OF SPATIAL SEMANTIC SEGMENTATION WITH ENRICHED AUDIO FEATURES AND AGENT-BASED ERROR CORRECTION FOR DCASE 2025 CHALLENGE TASK 4

Jongyeon Park1, Joonhee Lee2, Do-Hyeon Lim1, Hong Kook Kim1,2, Hyeongcheol Geum3, Jeong Eun Lim3
1Dept. of AI Convergence, 2Dept. of EECS, Gwangju Institute of Science and Technology, Gwangju 61005, Korea, 3AI Lab., R&D Center, Hanwha Vision, Seongnam-si, Gyeonggi-do 13488, Korea

Abstract

This technical report presents submission systems for Task 4 of the DCASE 2025 Challenge. This model incorporates additional audio features (spectral roll-off and chroma features) into the embedding feature extracted from the mel-spectral feature to improve the classification capabilities of an audio-tagging model in the spatial semantic segmentation of sound scenes (S5) system. This approach is motivated by the fact that mixed audio often contains subtle cues that are difficult to capture with mel-spectrograms alone. Thus, these additional features offer alternative perspectives for the model. Second, an agent-based label correction system is applied to the outputs processed by the S5 system. This system reduces false positives, improving the final class-aware signal-to-distortion ratio improvement (CA-SDRi) metric. Finally, we refine the training dataset to enhance the classification accuracy of low-performing classes by removing irrelevant samples and incorporating external data. That is, audio mixtures are generated from a limited number of data points; thus, even a small number of out-of-class data points could degrade model performance. The experiments demonstrate that the submitted systems employing these approaches relatively improve CA-SDRi by up to 14.7% compared to the baseline of DCASE 2025 Challenge Task 4.

PDF

SJTU-AUDIOCC SYSTEM FOR DCASE 2025 CHALLENGE TASK 4: SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES

Xin Zhou1, Hongyu Wang1, Chenda Li1, Bing Han1, Xinhu Zheng1, Yanmin Qian1
1Auditory Cognition and Computational Acoustics Lab, MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China

Abstract

The present report introduces four systems developed by the AudioCC Lab at Shanghai Jiao Tong University for DCASE 2025 Task 4. The task at hand is to detect target sound events and separate corresponding signals from multi-channel mixture. It was found that the effective detection of sound events and extraction of the corresponding signals was challenging under conditions where mixture consists of multiple target sound events, non-target sound events, and non-directional background noise. In order to address this challenge, we propose four systems. The first system represents an enhancement to the baseline system, the second is a multistage iterative system that is both novel and promising, the third is a lightweight model based on Encoder-Decoder Attractor (EDA) module, and the fourth integrates multiple audio tagging models to achieve optimal performance. These four systems cover high performance, low overhead, and promising frameworks, providing a reference for future research on this task.

PDF

REDUX: AN ITERATIVE STRATEGY FOR SEMANTIC SOURCE SEPARATION

Vasileios Stergioulis1, Gerasimos Potamianos1
1Department of Electrical and Computer Engineering, University of Thessaly, Volos, Greece

Abstract

In this report, we present a system for the Spatial Semantic Segmentation of Sound Scenes (DCASE 2025 Task 4), combining enhanced source separation and label classification through an iterative verification strategy. Our approach integrates the Masked Modeling Duo (M2D) classifier with a separator architecture based on an attentive ResUNeXt network. Inspired by recent advances in universal audio modeling and self-supervised separation, our system incorporates feedback between multiple classification and separation stages to correct early-stage prediction errors. Specifically, classification outputs are verified using post-separation reclassification, and ambiguous cases are resolved through targeted waveform subtraction and re-analysis. This strategy enables improved source-label associations without increasing model complexity. Evaluated on the development set, our method achieves a relative improvement of 0.28% in CA-SDRi and 1.46% in accuracy over the baseline.

PDF

SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES BASED ON ADAPTER FINE-TUNING

Zehao Wang1, Sen Wang1, Zhicheng Zhang1, Jianqin Yin1
1Beijing University of Posts and Telecommunications, China

Abstract

In this work, we present our submission system for DCASE 2025 Task4 on Spatial semantic segmentation of sound scenes (S5).Among them, we introduce the audio tagging (AT) and labelquery source separation (LSS) systems built on the fine-tuned M2D and the modified version of ResUnet. By introducing the bidirectional recurrent neural network (DPRNN) module into ResUNet and improving the Feature-wise Linear Modulation (FiLM) mechanism, the model’s ability to capture long-term dependent features in spatial audio and the flexibility of dynamic feature adjustment are enhanced. Experimental results show that the improved system outperforms the baseline system in class-aware evaluation metrics (CA-SDRi, CA-SI-SDRi), verifying the effectiveness of the method.

PDF