Spatial Semantic Segmentation of Sound Scenes


Challenge results

Task description

Detailed task description can be found in the task description page

Teams ranking

Table including the best-ranked system for each participant team. The DCASE 2026 Task 4 baseline is included as a reference.

Submission Information Evaluation Set Test (Development) Set
Submission Code Technical
Report
Official
Team
Rank
CAPI-SDRi (eval) Label Prediction
Accuracy (mix)
(eval)
CAPI-SDRi (test) Label Prediction
Accuracy (mix)
(test)
Bando_AIST_task4_3 Bando_AIST2026 1 14.93 65.54 16.32 64.88
Choi_KAIST_task4_4 Choi_KAIST2026 2 12.98 64.88 14.65 66.07
Saijo_Mitsubishi_task4_3 Saijo_Mitsubishi2026 3 12.94 76.92 14.95 78.11
Wang_SRCN_task4_2 Wang_SRCN2026 4 10.13 57.80 11.74 62.23
Park_SGU_task4_3 Park_SGU2026 5 10.10 53.17 11.42 56.09
You_PKU_task4_4 You_PKU2026 6 8.24 52.58 11.85 74.87
Jeong_Medisensing_task4_1 Jeong_Medisensing2026 7 6.97 58.53 8.98 63.09
Wang_BUPT_task4_2 Wang_BUPT2026 8 6.90 56.61 8.81 61.44
Deng_WHU_task4_1 Deng_WHU2026 9 6.84 58.13 8.62 61.71
Nguyen_NTT_task4_1 10 6.77 56.55 8.17 57.14
Park_KUBIG_task4_1 Park_KUBIG2026 11 1.18 39.81 1.20 40.28

Systems ranking

Table shows the ranking of all submitted systems. The DCASE 2026 Task 4 baseline systems are included as references.

Submission Information Evaluation Set Test (Development) Set
Submission Code Technical
Report
Official
System
Rank
CAPI-SDRi (eval) Label Prediction
Accuracy (mix)
(eval)
CAPI-SDRi (test) Label Prediction
Accuracy (mix)
(test)
Bando_AIST_task4_3 Bando_AIST2026 1 14.93 65.54 16.32 64.88
Bando_AIST_task4_4 Bando_AIST2026 2 14.61 59.52 16.36 61.51
Bando_AIST_task4_2 Bando_AIST2026 3 14.34 58.40 16.45 62.30
Bando_AIST_task4_1 Bando_AIST2026 4 14.23 58.33 15.75 57.74
Choi_KAIST_task4_4 Choi_KAIST2026 5 12.98 64.88 14.65 66.07
Saijo_Mitsubishi_task4_3 Saijo_Mitsubishi2026 6 12.94 76.92 14.95 78.11
Choi_KAIST_task4_1 Choi_KAIST2026 7 12.88 64.15 15.51 71.10
Saijo_Mitsubishi_task4_2 Saijo_Mitsubishi2026 8 12.77 76.92 14.74 78.11
Saijo_Mitsubishi_task4_1 Saijo_Mitsubishi2026 9 12.77 75.00 14.84 78.11
Saijo_Mitsubishi_task4_4 Saijo_Mitsubishi2026 10 12.54 73.28 14.41 74.41
Choi_KAIST_task4_2 Choi_KAIST2026 11 12.32 59.59 14.80 66.40
Choi_KAIST_task4_3 Choi_KAIST2026 12 12.25 59.13 15.50 71.10
Wang_SRCN_task4_2 Wang_SRCN2026 13 10.13 57.80 11.74 62.23
Wang_SRCN_task4_1 Wang_SRCN2026 14 10.12 57.94 11.73 62.04
Park_SGU_task4_3 Park_SGU2026 15 10.10 53.17 11.42 56.09
Wang_SRCN_task4_3 Wang_SRCN2026 16 10.10 57.94 11.74 62.04
Park_SGU_task4_4 Park_SGU2026 17 10.08 53.64 11.43 54.70
Park_SGU_task4_2 Park_SGU2026 18 9.42 53.17 10.53 56.09
Wang_SRCN_task4_4 Wang_SRCN2026 19 9.20 45.63 11.26 51.98
You_PKU_task4_4 You_PKU2026 20 8.24 52.58 11.85 74.87
You_PKU_task4_1 You_PKU2026 21 8.24 52.58 11.86 74.87
You_PKU_task4_2 You_PKU2026 22 8.20 52.58 11.86 74.87
You_PKU_task4_3 You_PKU2026 23 8.20 52.58 11.86 74.87
Park_SGU_task4_1 Park_SGU2026 24 8.12 53.70 9.05 55.36
Jeong_Medisensing_task4_1 Jeong_Medisensing2026 25 6.97 58.53 8.98 63.09
Jeong_Medisensing_task4_2 Jeong_Medisensing2026 26 6.96 58.20 9.05 63.95
Jeong_Medisensing_task4_3 Jeong_Medisensing2026 27 6.94 58.33 9.06 63.82
Wang_BUPT_task4_2 Wang_BUPT2026 28 6.90 56.61 8.81 61.44
Deng_WHU_task4_1 Deng_WHU2026 29 6.84 58.13 8.62 61.71
Nguyen_NTT_task4_1 30 6.77 56.55 8.17 57.14
Wang_BUPT_task4_1 Wang_BUPT2026 31 6.76 55.56 8.56 60.05
Nguyen_NTT_task4_2 32 6.76 55.16 8.49 60.71
Park_KUBIG_task4_1 Park_KUBIG2026 33 1.18 39.81 1.20 40.28

Supplementary metrics

Detailed analysis of joint scores, separation, and detection performance

All metrics in this table are evaluated on the evaluation set. CAPI-SDRi and CASA-SDRi are joint separation and label prediction scores computed by the official evaluator, while TP-SDRi is a separation-only score computed from matched true-positive source pairs. True Positive (TP), False Positive (FP), and False Negative (FN) are counted over target-source hypotheses after class-aware permutation-invariant matching. Accuracy (mix) is mixture-level label prediction accuracy, while Accuracy (src) is source-level label prediction accuracy.

Submission Information Joint Separation and Label Prediction Scores Separation Score Label Prediction Scores Counts
Submission Code Technical
Report
CAPI-SDRi CASA-SDRi TP-SDRi Accuracy
(mix)
Accuracy
(src)
Precision Recall F-Score
(micro)
F-Score
(macro)
TP FP FN
Bando_AIST_task4_3 Bando_AIST2026 14.93 14.92 20.15 65.54 73.80 0.91 0.81 0.85 0.85 2242 266 530
Bando_AIST_task4_4 Bando_AIST2026 14.61 14.59 20.90 59.52 69.61 0.91 0.77 0.82 0.81 2116 268 656
Bando_AIST_task4_2 Bando_AIST2026 14.34 14.33 20.91 58.40 67.95 0.90 0.76 0.81 0.80 2093 308 679
Bando_AIST_task4_1 Bando_AIST2026 14.23 14.22 20.87 58.33 67.48 0.92 0.73 0.81 0.79 2017 217 755
Choi_KAIST_task4_4 Choi_KAIST2026 12.98 12.98 17.78 64.88 73.95 0.89 0.82 0.85 0.85 2279 310 493
Saijo_Mitsubishi_task4_3 Saijo_Mitsubishi2026 12.94 12.94 15.96 76.92 82.84 0.93 0.90 0.91 0.91 2486 229 286
Choi_KAIST_task4_1 Choi_KAIST2026 12.88 12.88 17.76 64.15 73.57 0.89 0.82 0.85 0.85 2277 323 495
Saijo_Mitsubishi_task4_2 Saijo_Mitsubishi2026 12.77 12.76 15.72 76.92 82.84 0.93 0.90 0.91 0.91 2486 229 286
Saijo_Mitsubishi_task4_1 Saijo_Mitsubishi2026 12.77 12.76 15.82 75.00 82.31 0.92 0.90 0.90 0.90 2499 264 273
Saijo_Mitsubishi_task4_4 Saijo_Mitsubishi2026 12.54 12.54 15.90 73.28 80.45 0.90 0.90 0.89 0.89 2493 327 279
Choi_KAIST_task4_2 Choi_KAIST2026 12.32 12.31 17.72 59.59 71.64 0.86 0.84 0.83 0.84 2314 458 458
Choi_KAIST_task4_3 Choi_KAIST2026 12.25 12.24 17.69 59.13 71.47 0.85 0.84 0.83 0.83 2315 467 457
Wang_SRCN_task4_2 Wang_SRCN2026 10.13 10.10 14.48 57.80 69.89 0.86 0.80 0.82 0.82 2221 406 551
Wang_SRCN_task4_1 Wang_SRCN2026 10.12 10.09 14.46 57.94 69.92 0.85 0.81 0.82 0.82 2232 420 540
Park_SGU_task4_3 Park_SGU2026 10.10 10.07 14.78 53.17 68.26 0.90 0.75 0.81 0.81 2047 227 725
Wang_SRCN_task4_3 Wang_SRCN2026 10.10 10.05 14.31 57.94 69.92 0.85 0.81 0.82 0.82 2232 420 540
Park_SGU_task4_4 Park_SGU2026 10.08 10.02 14.74 53.64 68.28 0.90 0.75 0.81 0.81 2051 232 721
Park_SGU_task4_2 Park_SGU2026 9.42 9.38 13.75 53.17 68.26 0.90 0.75 0.81 0.81 2047 227 725
Wang_SRCN_task4_4 Wang_SRCN2026 9.20 9.18 14.49 45.63 62.11 0.75 0.81 0.77 0.77 2226 812 546
You_PKU_task4_4 You_PKU2026 8.24 8.16 12.08 52.58 68.08 0.83 0.82 0.81 0.81 2244 524 528
You_PKU_task4_1 You_PKU2026 8.24 8.16 12.01 52.58 68.08 0.83 0.82 0.81 0.81 2244 524 528
You_PKU_task4_2 You_PKU2026 8.20 8.13 11.97 52.58 68.08 0.83 0.82 0.81 0.81 2244 524 528
You_PKU_task4_3 You_PKU2026 8.20 8.13 11.96 52.58 68.08 0.83 0.82 0.81 0.81 2244 524 528
Park_SGU_task4_1 Park_SGU2026 8.12 8.06 11.82 53.70 68.68 0.91 0.75 0.81 0.81 2057 223 715
Jeong_Medisensing_task4_1 Jeong_Medisensing2026 6.97 6.90 9.78 58.53 70.00 0.86 0.80 0.82 0.82 2196 365 576
Jeong_Medisensing_task4_2 Jeong_Medisensing2026 6.96 6.89 9.79 58.20 69.87 0.86 0.80 0.82 0.82 2201 378 571
Jeong_Medisensing_task4_3 Jeong_Medisensing2026 6.94 6.87 9.76 58.33 70.07 0.86 0.80 0.82 0.82 2201 369 571
Wang_BUPT_task4_2 Wang_BUPT2026 6.90 6.84 9.92 56.61 68.95 0.84 0.81 0.82 0.82 2221 449 551
Deng_WHU_task4_1 Deng_WHU2026 6.84 6.80 9.70 58.13 69.58 0.85 0.80 0.82 0.82 2210 404 562
Nguyen_NTT_task4_1 6.77 6.72 9.74 56.55 68.00 0.83 0.80 0.81 0.81 2193 453 579
Wang_BUPT_task4_1 Wang_BUPT2026 6.76 6.72 9.97 55.56 66.57 0.84 0.78 0.80 0.80 2147 453 625
Nguyen_NTT_task4_2 6.76 6.71 9.81 55.16 66.66 0.83 0.79 0.80 0.80 2175 491 597
Park_KUBIG_task4_1 Park_KUBIG2026 1.18 0.98 2.84 39.81 56.68 0.70 0.77 0.72 0.73 2133 991 639

Detailed analysis focused on quality of separated speech

Table shows the quality of separated Speech-class target outputs in the evaluation dataset. Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), and Perceptual Evaluation of Audio Quality (PEAQ) are reported as objective metrics.

Submission Information PESQ STOI PEAQ
Submission Code Technical
Report
CAPI-SDRi PESQ
mean
PESQ
std
PESQ
min
PESQ
max
STOI
mean
STOI
std
STOI
min
STOI
max
PEAQ
mean
PEAQ
std
PEAQ
min
PEAQ
max
Bando_AIST_task4_3 Bando_AIST2026 14.93 3.26 0.60 1.72 4.38 0.94 0.05 0.62 1.00 -2.37 0.85 -3.91 -0.12
Bando_AIST_task4_4 Bando_AIST2026 14.61 3.29 0.70 1.07 4.40 0.92 0.19 -0.05 1.00 -2.27 0.86 -3.91 -0.07
Bando_AIST_task4_2 Bando_AIST2026 14.34 3.18 0.68 1.37 4.39 0.95 0.05 0.64 1.00 -2.35 0.87 -3.91 -0.09
Bando_AIST_task4_1 Bando_AIST2026 14.23 3.23 0.68 1.30 4.37 0.95 0.04 0.76 1.00 -2.33 0.87 -3.91 -0.09
Choi_KAIST_task4_4 Choi_KAIST2026 12.98 2.98 0.57 1.65 4.18 0.93 0.05 0.77 0.99 -2.93 0.78 -3.91 -0.33
Saijo_Mitsubishi_task4_3 Saijo_Mitsubishi2026 12.94 3.07 0.51 1.85 4.23 0.93 0.04 0.76 0.99 -3.01 0.71 -3.91 -0.51
Choi_KAIST_task4_1 Choi_KAIST2026 12.88 3.01 0.54 1.67 4.17 0.94 0.04 0.77 0.99 -2.93 0.78 -3.91 -0.34
Saijo_Mitsubishi_task4_2 Saijo_Mitsubishi2026 12.77 2.92 0.58 1.63 4.23 0.92 0.06 0.72 0.99 -3.03 0.71 -3.91 -0.40
Saijo_Mitsubishi_task4_1 Saijo_Mitsubishi2026 12.77 2.88 0.56 1.55 4.14 0.92 0.05 0.71 0.99 -3.02 0.71 -3.91 -0.47
Saijo_Mitsubishi_task4_4 Saijo_Mitsubishi2026 12.54 3.06 0.52 1.85 4.23 0.93 0.05 0.62 0.99 -3.01 0.72 -3.91 -0.51
Choi_KAIST_task4_2 Choi_KAIST2026 12.32 2.97 0.58 1.43 4.18 0.93 0.05 0.66 0.99 -2.94 0.78 -3.91 -0.33
Choi_KAIST_task4_3 Choi_KAIST2026 12.25 2.97 0.57 1.67 4.18 0.93 0.05 0.77 0.99 -2.94 0.78 -3.91 -0.33
Wang_SRCN_task4_2 Wang_SRCN2026 10.13 2.70 0.56 1.47 4.07 0.91 0.05 0.68 0.99 -3.07 0.71 -3.91 -0.54
Wang_SRCN_task4_1 Wang_SRCN2026 10.12 2.70 0.56 1.47 4.07 0.91 0.05 0.68 0.99 -3.07 0.71 -3.91 -0.54
Park_SGU_task4_3 Park_SGU2026 10.10 2.57 0.70 1.08 4.07 0.86 0.18 0.13 0.99 -3.08 0.83 -3.91 -0.31
Wang_SRCN_task4_3 Wang_SRCN2026 10.10 2.42 0.66 1.06 4.00 0.88 0.12 -0.03 0.99 -3.40 0.68 -3.91 -0.46
Park_SGU_task4_4 Park_SGU2026 10.08 2.57 0.70 1.07 4.08 0.86 0.17 0.08 0.99 -3.08 0.82 -3.91 -0.36
Park_SGU_task4_2 Park_SGU2026 9.42 2.53 0.70 1.08 4.03 0.86 0.19 0.14 0.99 -3.15 0.80 -3.91 -0.31
Wang_SRCN_task4_4 Wang_SRCN2026 9.20 2.59 0.60 1.18 4.05 0.90 0.06 0.66 0.99 -3.14 0.71 -3.91 -0.54
You_PKU_task4_4 You_PKU2026 8.24 2.47 0.77 1.15 4.16 0.86 0.14 0.33 0.99 -3.27 0.73 -3.91 -0.50
You_PKU_task4_1 You_PKU2026 8.24 2.37 0.84 1.09 4.16 0.81 0.19 0.29 0.99 -3.28 0.72 -3.91 -0.50
You_PKU_task4_2 You_PKU2026 8.20 2.47 0.75 1.15 4.16 0.87 0.12 0.40 0.99 -3.28 0.72 -3.91 -0.50
You_PKU_task4_3 You_PKU2026 8.20 2.47 0.75 1.15 4.16 0.87 0.12 0.40 0.99 -3.28 0.72 -3.91 -0.50
Park_SGU_task4_1 Park_SGU2026 8.12 2.29 0.74 1.08 3.92 0.82 0.21 0.11 0.99 -3.36 0.72 -3.91 -0.39
Jeong_Medisensing_task4_1 Jeong_Medisensing2026 6.97 1.89 0.66 1.09 3.91 0.75 0.18 0.30 0.99 -3.54 0.61 -3.91 -0.60
Jeong_Medisensing_task4_2 Jeong_Medisensing2026 6.96 1.89 0.66 1.09 3.91 0.75 0.18 0.30 0.99 -3.54 0.61 -3.91 -0.60
Jeong_Medisensing_task4_3 Jeong_Medisensing2026 6.94 1.90 0.66 1.08 3.93 0.75 0.18 0.31 0.99 -3.53 0.61 -3.91 -0.47
Wang_BUPT_task4_2 Wang_BUPT2026 6.90 1.90 0.66 1.09 3.91 0.75 0.18 0.30 0.99 -3.54 0.61 -3.91 -0.60
Deng_WHU_task4_1 Deng_WHU2026 6.84 1.79 0.60 1.09 4.00 0.74 0.18 0.27 0.98 -3.53 0.61 -3.91 -0.63
Nguyen_NTT_task4_1 6.77 1.88 0.65 1.11 3.91 0.75 0.18 0.30 0.99 -3.54 0.60 -3.91 -0.61
Wang_BUPT_task4_1 Wang_BUPT2026 6.76 1.89 0.65 1.09 3.91 0.75 0.18 0.30 0.99 -3.53 0.61 -3.91 -0.60
Nguyen_NTT_task4_2 6.76 1.90 0.65 1.11 3.93 0.76 0.18 0.29 0.99 -3.53 0.61 -3.91 -0.57
Park_KUBIG_task4_1 Park_KUBIG2026 1.18 1.91 0.55 1.11 3.40 0.84 0.09 0.57 0.98 -2.73 0.60 -3.91 -1.21

System performance under partially known conditions

Table shows the separation and mixture-level label prediction performance of each system under partially known conditions. The Known IR condition uses evaluation mixtures synthesized with room impulse responses included in the training data. The Known Target condition uses evaluation mixtures synthesized with target sound event samples included in the training data. The Known Noise condition uses evaluation mixtures synthesized with background noise included in the training data. The Known Interference condition uses evaluation mixtures synthesized with interference sound samples included in the training data.

Submission Information Evaluation Set Known IR
Condition
Known Target
Condition
Known Noise
Condition
Known Interference
Condition
Submission Code Technical
Report
CAPI-SDRi Accuracy
(mix)
Known IR
CAPI-SDRi
Known IR
Accuracy
(mix)
Known Target
CAPI-SDRi
Known Target
Accuracy
(mix)
Known Noise
CAPI-SDRi
Known Noise
Accuracy
(mix)
Known Interference
CAPI-SDRi
Known Interference
Accuracy
(mix)
Bando_AIST_task4_3 Bando_AIST2026 14.93 65.54 16.21 69.44 15.73 70.37 15.39 67.59 15.62 70.37
Bando_AIST_task4_4 Bando_AIST2026 14.61 59.52 15.62 60.32 15.13 62.04 15.08 61.11 16.14 68.98
Bando_AIST_task4_2 Bando_AIST2026 14.34 58.40 15.62 61.90 15.70 65.74 15.55 65.28 15.91 68.52
Bando_AIST_task4_1 Bando_AIST2026 14.23 58.33 16.12 64.68 15.30 62.50 15.27 61.11 15.45 65.28
Choi_KAIST_task4_4 Choi_KAIST2026 12.98 64.88 13.97 63.89 13.82 68.98 13.63 68.06 13.71 70.83
Saijo_Mitsubishi_task4_3 Saijo_Mitsubishi2026 12.94 76.92 13.69 77.38 14.52 86.11 13.70 80.09 14.10 85.65
Choi_KAIST_task4_1 Choi_KAIST2026 12.88 64.15 13.75 62.30 13.53 67.59 13.80 68.98 14.13 72.69
Saijo_Mitsubishi_task4_2 Saijo_Mitsubishi2026 12.77 76.92 13.55 77.38 14.28 86.11 13.56 80.09 13.90 85.65
Saijo_Mitsubishi_task4_1 Saijo_Mitsubishi2026 12.77 75.00 13.51 76.19 14.29 83.80 13.50 77.78 13.90 85.19
Saijo_Mitsubishi_task4_4 Saijo_Mitsubishi2026 12.54 73.28 13.19 73.41 13.67 79.17 13.06 73.61 13.62 81.48
Choi_KAIST_task4_2 Choi_KAIST2026 12.32 59.59 13.51 60.71 12.77 62.96 12.96 62.04 13.75 71.76
Choi_KAIST_task4_3 Choi_KAIST2026 12.25 59.13 13.68 61.11 13.02 64.81 12.84 61.57 13.73 72.22
Wang_SRCN_task4_2 Wang_SRCN2026 10.13 57.80 10.63 56.35 10.91 65.28 10.48 61.11 11.03 65.74
Wang_SRCN_task4_1 Wang_SRCN2026 10.12 57.94 10.51 55.95 10.93 65.74 10.40 60.19 10.93 64.81
Park_SGU_task4_3 Park_SGU2026 10.10 53.17 10.86 53.97 11.00 68.06 10.59 54.17 11.27 62.04
Wang_SRCN_task4_3 Wang_SRCN2026 10.10 57.94 10.49 55.95 10.99 65.74 10.33 60.19 11.18 64.81
Park_SGU_task4_4 Park_SGU2026 10.08 53.64 4.10 29.37 11.22 66.20 10.45 53.24 10.92 60.19
Park_SGU_task4_2 Park_SGU2026 9.42 53.17 9.94 53.97 10.26 68.06 9.97 54.17 10.48 62.04
Wang_SRCN_task4_4 Wang_SRCN2026 9.20 45.63 9.29 43.25 10.79 52.78 10.06 49.54 9.79 47.69
You_PKU_task4_4 You_PKU2026 8.24 52.58 8.78 51.59 8.97 50.93 8.39 53.24 8.70 57.41
You_PKU_task4_1 You_PKU2026 8.24 52.58 8.91 51.59 8.96 50.93 8.35 53.24 8.65 57.41
You_PKU_task4_2 You_PKU2026 8.20 52.58 8.94 51.59 8.82 50.93 8.38 53.24 8.60 57.41
You_PKU_task4_3 You_PKU2026 8.20 52.58 8.94 51.59 8.81 50.93 8.38 53.24 8.60 57.41
Park_SGU_task4_1 Park_SGU2026 8.12 53.70 8.76 54.76 8.90 68.52 8.28 54.17 9.03 65.74
Jeong_Medisensing_task4_1 Jeong_Medisensing2026 6.97 58.53 6.94 56.75 9.23 72.22 7.31 57.87 7.79 64.35
Jeong_Medisensing_task4_2 Jeong_Medisensing2026 6.96 58.20 6.93 57.54 9.24 72.22 7.32 59.26 7.81 63.89
Jeong_Medisensing_task4_3 Jeong_Medisensing2026 6.94 58.33 7.00 57.14 9.20 72.22 7.35 59.26 7.79 63.89
Wang_BUPT_task4_2 Wang_BUPT2026 6.90 56.61 6.94 53.17 9.26 71.30 7.37 59.72 7.51 63.89
Deng_WHU_task4_1 Deng_WHU2026 6.84 58.13 6.68 58.33 8.95 70.83 7.33 59.26 7.34 63.89
Nguyen_NTT_task4_1 6.77 56.55 6.76 52.38 8.98 70.37 7.10 56.02 7.43 62.04
Wang_BUPT_task4_1 Wang_BUPT2026 6.76 55.56 6.63 52.78 9.22 70.83 7.40 59.72 7.84 65.74
Nguyen_NTT_task4_2 6.76 55.16 6.65 55.16 9.01 70.37 6.79 52.78 7.81 65.74
Park_KUBIG_task4_1 Park_KUBIG2026 1.18 39.81 0.67 39.29 1.08 37.50 1.37 45.83 1.23 42.59

System performance by target-source overlap condition

This table shows performance by target-source overlap condition. Each condition is denoted as (N, M), where N is the number of active target sound sources in the mixture and M is the number of active target sound sources involved in same-class overlap. M=0 means that no same-class target overlap is present. (0,0) denotes mixtures with no target sound source, for which CAPI-SDRi is not defined; the (0,0) column therefore reports the number of false positives (FP) in zero-target mixtures.

Submission Information Evaluation Set Target-Source Overlap Conditions
Submission Code Technical
Report
CAPI-SDRi Accuracy
(mix)
FP
(0,0)
CAPI-SDRi
(1,0)
CAPI-SDRi
(2,0)
CAPI-SDRi
(2,2)
CAPI-SDRi
(3,0)
CAPI-SDRi
(3,2)
CAPI-SDRi
(3,3)
Bando_AIST_task4_3 Bando_AIST2026 14.93 65.54 33 12.26 15.23 16.86 15.51 16.37 17.08
Bando_AIST_task4_4 Bando_AIST2026 14.61 59.52 32 12.01 14.37 16.77 15.02 15.88 17.25
Bando_AIST_task4_2 Bando_AIST2026 14.34 58.40 43 11.73 14.64 16.52 14.87 16.01 16.58
Bando_AIST_task4_1 Bando_AIST2026 14.23 58.33 26 11.96 13.84 16.80 13.81 15.56 16.61
Choi_KAIST_task4_4 Choi_KAIST2026 12.98 64.88 51 10.66 13.02 14.00 15.34 14.89 14.09
Saijo_Mitsubishi_task4_3 Saijo_Mitsubishi2026 12.94 76.92 30 9.71 13.27 13.51 15.21 14.87 14.14
Choi_KAIST_task4_1 Choi_KAIST2026 12.88 64.15 57 10.81 12.91 13.72 15.39 14.89 14.02
Saijo_Mitsubishi_task4_2 Saijo_Mitsubishi2026 12.77 76.92 30 9.67 13.13 13.36 14.99 14.55 13.82
Saijo_Mitsubishi_task4_1 Saijo_Mitsubishi2026 12.77 75.00 30 9.65 13.15 13.50 14.91 14.66 13.53
Saijo_Mitsubishi_task4_4 Saijo_Mitsubishi2026 12.54 73.28 47 9.36 12.99 13.17 15.13 14.81 13.84
Choi_KAIST_task4_2 Choi_KAIST2026 12.32 59.59 115 10.34 12.56 13.77 15.53 15.07 14.20
Choi_KAIST_task4_3 Choi_KAIST2026 12.25 59.13 119 10.36 12.69 13.67 15.41 15.08 14.13
Wang_SRCN_task4_2 Wang_SRCN2026 10.13 57.80 55 8.28 10.80 11.10 11.29 11.08 11.21
Wang_SRCN_task4_1 Wang_SRCN2026 10.12 57.94 58 8.19 10.90 11.06 11.29 11.11 11.34
Park_SGU_task4_3 Park_SGU2026 10.10 53.17 24 9.33 11.72 8.46 13.20 10.00 7.57
Wang_SRCN_task4_3 Wang_SRCN2026 10.10 57.94 58 8.31 10.87 11.27 10.94 10.94 11.42
Park_SGU_task4_4 Park_SGU2026 10.08 53.64 25 9.05 12.15 8.50 12.99 9.64 7.79
Park_SGU_task4_2 Park_SGU2026 9.42 53.17 24 8.98 11.44 7.07 12.81 9.22 6.23
Wang_SRCN_task4_4 Wang_SRCN2026 9.20 45.63 131 6.71 9.36 10.29 11.04 11.38 11.35
You_PKU_task4_4 You_PKU2026 8.24 52.58 61 6.77 10.35 6.29 12.38 8.78 5.71
You_PKU_task4_1 You_PKU2026 8.24 52.58 61 7.00 10.40 6.12 12.35 8.62 5.67
You_PKU_task4_2 You_PKU2026 8.20 52.58 61 6.91 10.33 6.09 12.33 8.64 5.71
You_PKU_task4_3 You_PKU2026 8.20 52.58 61 6.91 10.33 6.09 12.32 8.65 5.70
Park_SGU_task4_1 Park_SGU2026 8.12 53.70 27 7.97 10.71 4.73 11.54 8.03 5.05
Jeong_Medisensing_task4_1 Jeong_Medisensing2026 6.97 58.53 15 6.37 8.70 4.19 10.11 7.09 4.64
Jeong_Medisensing_task4_2 Jeong_Medisensing2026 6.96 58.20 18 6.44 8.71 4.21 10.15 7.08 4.51
Jeong_Medisensing_task4_3 Jeong_Medisensing2026 6.94 58.33 18 6.32 8.75 4.19 10.20 7.04 4.41
Wang_BUPT_task4_2 Wang_BUPT2026 6.90 56.61 48 6.30 8.90 4.46 10.10 7.10 5.02
Deng_WHU_task4_1 Deng_WHU2026 6.84 58.13 17 5.96 8.66 3.91 10.40 6.85 4.66
Nguyen_NTT_task4_1 6.77 56.55 21 6.13 8.28 4.21 9.94 7.10 4.54
Wang_BUPT_task4_1 Wang_BUPT2026 6.76 55.56 34 6.23 8.66 4.07 10.12 6.80 4.40
Nguyen_NTT_task4_2 6.76 55.16 20 6.27 8.59 3.92 9.96 6.91 4.26
Park_KUBIG_task4_1 Park_KUBIG2026 1.18 39.81 68 -2.81 0.90 1.77 2.59 3.04 4.38

System characteristics

General characteristics

Submission
Code
Technical
Report
CAPI-SDRi Label Prediction
Accuracy (mix)
Input
Sampling Rate
Input
Acoustic
Features
Bando_AIST_task4_3 Bando_AIST2026 14.93 65.54 32kHz spectrogram
Bando_AIST_task4_4 Bando_AIST2026 14.61 59.52 32kHz spectrogram
Bando_AIST_task4_2 Bando_AIST2026 14.34 58.40 32kHz spectrogram
Bando_AIST_task4_1 Bando_AIST2026 14.23 58.33 32kHz spectrogram
Choi_KAIST_task4_4 Choi_KAIST2026 12.98 64.88 32kHz waveform, spectrogram
Saijo_Mitsubishi_task4_3 Saijo_Mitsubishi2026 12.94 76.92 32kHz for separation, 16kHz for three tagging models, 48 kHz for two tagging models spectrogram for separation, Kaldi fbank features for two tagging models, log-Mel spectrogram for one tagging model, and DAC-VAE features for two tagging models
Choi_KAIST_task4_1 Choi_KAIST2026 12.88 64.15 32kHz waveform, spectrogram
Saijo_Mitsubishi_task4_2 Saijo_Mitsubishi2026 12.77 76.92 32kHz for separation, 16kHz for three tagging models, 48 kHz for two tagging models spectrogram for separation, Kaldi fbank features for two tagging models, log-Mel spectrogram for one tagging model, and DAC-VAE features for two tagging models
Saijo_Mitsubishi_task4_1 Saijo_Mitsubishi2026 12.77 75.00 32kHz for separation, 16kHz for three tagging models, 48 kHz for two tagging models spectrogram for separation, Kaldi fbank features for two tagging models, log-Mel spectrogram for one tagging model, and DAC-VAE features for two tagging models
Saijo_Mitsubishi_task4_4 Saijo_Mitsubishi2026 12.54 73.28 32kHz for separation, 16kHz for three tagging models, 48 kHz for two tagging models spectrogram for separation, Kaldi fbank features for two tagging models
Choi_KAIST_task4_2 Choi_KAIST2026 12.32 59.59 32kHz waveform, spectrogram
Choi_KAIST_task4_3 Choi_KAIST2026 12.25 59.13 32kHz waveform, spectrogram
Wang_SRCN_task4_2 Wang_SRCN2026 10.13 57.80 32kHz waveform, spectrogram
Wang_SRCN_task4_1 Wang_SRCN2026 10.12 57.94 32kHz waveform, spectrogram
Park_SGU_task4_3 Park_SGU2026 10.10 53.17 32kHz spectrogram
Wang_SRCN_task4_3 Wang_SRCN2026 10.10 57.94 32kHz waveform, spectrogram
Park_SGU_task4_4 Park_SGU2026 10.08 53.64 32kHz spectrogram
Park_SGU_task4_2 Park_SGU2026 9.42 53.17 32kHz spectrogram
Wang_SRCN_task4_4 Wang_SRCN2026 9.20 45.63 32kHz waveform, spectrogram
You_PKU_task4_4 You_PKU2026 8.24 52.58 32kHz FOA waveform, log-mel spectrogram, M2D audio-tagging embeddings, TUSS all-label candidate sources
You_PKU_task4_1 You_PKU2026 8.24 52.58 32kHz FOA waveform, log-mel spectrogram, M2D audio-tagging embeddings
You_PKU_task4_2 You_PKU2026 8.20 52.58 32kHz FOA waveform, log-mel spectrogram, M2D audio-tagging embeddings, TUSS query-conditioned separated sources
You_PKU_task4_3 You_PKU2026 8.20 52.58 32kHz FOA waveform, log-mel spectrogram, M2D audio-tagging embeddings, TUSS query-conditioned separated sources
Park_SGU_task4_1 Park_SGU2026 8.12 53.70 32kHz spectrogram
Jeong_Medisensing_task4_1 Jeong_Medisensing2026 6.97 58.53 32kHz waveform, log-mel spectrogram, STFT spectrogram
Jeong_Medisensing_task4_2 Jeong_Medisensing2026 6.96 58.20 32kHz waveform, log-mel spectrogram, STFT spectrogram
Jeong_Medisensing_task4_3 Jeong_Medisensing2026 6.94 58.33 32kHz waveform, log-mel spectrogram, STFT spectrogram
Wang_BUPT_task4_2 Wang_BUPT2026 6.90 56.61 32kHz four-channel FOA waveform, channel-wise log-mel spectrogram
Deng_WHU_task4_1 Deng_WHU2026 6.84 58.13 32kHz waveform, spectrogram
Nguyen_NTT_task4_1 6.77 56.55 32kHz waveform, spectrogram
Wang_BUPT_task4_1 Wang_BUPT2026 6.76 55.56 32kHz waveform, log-mel spectrogram
Nguyen_NTT_task4_2 6.76 55.16 32kHz waveform, spectrogram
Park_KUBIG_task4_1 Park_KUBIG2026 1.18 39.81 32kHz waveform

Machine learning characteristics

Submission
Code
Technical
Report
CAPI-SDRi Label Prediction
Accuracy (mix)
Machine
Learning
Method
Loss
Function
Training
Dataset
Data
Augmentation
Pretrained
Models
Bando_AIST_task4_3 Bando_AIST2026 14.93 65.54 TF-Locoformer-based separation model BCE, SNR DCASE2026Task4Dataset ATST-Frame
Bando_AIST_task4_4 Bando_AIST2026 14.61 59.52 TF-Locoformer-based separation model BCE, SNR DCASE2026Task4Dataset ATST-Frame
Bando_AIST_task4_2 Bando_AIST2026 14.34 58.40 TF-Locoformer-based separation model BCE, SNR DCASE2026Task4Dataset
Bando_AIST_task4_1 Bando_AIST2026 14.23 58.33 TF-Locoformer-based separation model BCE, SNR DCASE2026Task4Dataset
Choi_KAIST_task4_4 Choi_KAIST2026 12.98 64.88 TTransformer-Mamba-based separation/extraction model, CRNN-based audio classification model SA-SDR loss, ArcFace loss, KL-divergence loss, BCE loss DCASE2026Task4Dataset; AudioSet-2M-VacuumCleaner difficulty-based mixup M2D; Audio-Flamingo 3
Saijo_Mitsubishi_task4_3 Saijo_Mitsubishi2026 12.94 76.92 BEATs-based, M2D-based, AIST-based, PE-A-Frame-small-based, and PE-A-Frame-base-based tagging models, TF-Locoformer-based separation model SNR (separation), CE (counting and classification) DCASE2026Task4Dataset; AudioSet; SINS database; NIGENS; STARSS23 MixUp, frequency warping, and filter augmentation for three 16-kHz tagging models BEATs (BEATs_strong_1.pt); M2D (M2D_strong_1.pt); AIST-Frame (ATST-F_strong_1.pt); PE-A-Frame-small; PE-A-Frame-base
Choi_KAIST_task4_1 Choi_KAIST2026 12.88 64.15 Transformer-Mamba-based separation/extraction model, CRNN-based audio classification model SA-SDR loss, ArcFace loss, KL-divergence loss, BCE loss DCASE2026Task4Dataset; AudioSet-2M-VacuumCleaner difficulty-based mixup M2D; Audio-Flamingo 3
Saijo_Mitsubishi_task4_2 Saijo_Mitsubishi2026 12.77 76.92 BEATs-based, M2D-based, AIST-based, PE-A-Frame-small-based, and PE-A-Frame-base-based tagging models, TF-Locoformer-based separation model SNR (separation), CE (counting and classification) DCASE2026Task4Dataset; AudioSet; SINS database; NIGENS; STARSS23 MixUp, frequency warping, and filter augmentation for three 16-kHz tagging models BEATs (BEATs_strong_1.pt); M2D (M2D_strong_1.pt); AIST-Frame (ATST-F_strong_1.pt); PE-A-Frame-small; PE-A-Frame-base
Saijo_Mitsubishi_task4_1 Saijo_Mitsubishi2026 12.77 75.00 BEATs-based, M2D-based, AIST-based, PE-A-Frame-small-based, and PE-A-Frame-base-based tagging models, TF-Locoformer-based separation model SNR (separation), CE (counting and classification) DCASE2026Task4Dataset; AudioSet; SINS database; NIGENS; STARSS23 MixUp, frequency warping, and filter augmentation for three 16-kHz tagging models BEATs (BEATs_strong_1.pt); M2D (M2D_strong_1.pt); AIST-Frame (ATST-F_strong_1.pt); PE-A-Frame-small; PE-A-Frame-base
Saijo_Mitsubishi_task4_4 Saijo_Mitsubishi2026 12.54 73.28 BEATs-based tagging model, TF-Locoformer-based separation model SNR (separation), CE (counting and classification) DCASE2026Task4Dataset; AudioSet; SINS database; NIGENS; STARSS23 MixUp, frequency warping, and filter augmentation for three 16-kHz tagging models BEATs (BEATs_strong_1.pt)
Choi_KAIST_task4_2 Choi_KAIST2026 12.32 59.59 Transformer-Mamba-based separation/extraction model, CRNN-based audio classification model SA-SDR loss, ArcFace loss, KL-divergence loss, BCE loss DCASE2026Task4Dataset; AudioSet-2M-VacuumCleaner difficulty-based mixup M2D; Audio-Flamingo 3
Choi_KAIST_task4_3 Choi_KAIST2026 12.25 59.13 Transformer-Mamba-based separation/extraction model, CRNN-based audio classification model SA-SDR loss, ArcFace loss, KL-divergence loss, BCE loss DCASE2026Task4Dataset; AudioSet-2M-VacuumCleaner difficulty-based mixup M2D; Audio-Flamingo 3
Wang_SRCN_task4_2 Wang_SRCN2026 10.13 57.80 Transformer-based separation model, PretrainedSED-based audio tagging model BCE, SA-SDR loss, KL-divergence DCASE2026Task4Dataset; AudioSet Frame Shift, SpecAugmentation PretrainedSED
Wang_SRCN_task4_1 Wang_SRCN2026 10.12 57.94 Transformer-based separation model, PretrainedSED-based audio tagging model BCE, SA-SDR loss, KL-divergence DCASE2026Task4Dataset; AudioSet Frame Shift, SpecAugmentation PretrainedSED
Park_SGU_task4_3 Park_SGU2026 10.10 53.17 SRCorrNet-based separation model, M2D-fPaSST based audio tagging model cross entropy, PIT-SNR DCASE2026Task4Dataset spec augmentation, angle rotation, random gain filter M2D; fPaSST
Wang_SRCN_task4_3 Wang_SRCN2026 10.10 57.94 Transformer-based separation model, PretrainedSED-based audio tagging model BCE, SA-SDR loss, KL-divergence DCASE2026Task4Dataset; AudioSet Frame Shift, SpecAugmentation PretrainedSED
Park_SGU_task4_4 Park_SGU2026 10.08 53.64 SRCorrNet-based separation model, M2D-fPaSST based audio tagging model cross entropy, PIT-SNR DCASE2026Task4Dataset spec augmentation, angle rotation, random gain filter M2D; fPaSST
Park_SGU_task4_2 Park_SGU2026 9.42 53.17 SRCorrNet-based separation model, M2D-fPaSST based audio tagging model cross entropy, PIT-SNR DCASE2026Task4Dataset spec augmentation, angle rotation, random gain filter M2D; fPaSST
Wang_SRCN_task4_4 Wang_SRCN2026 9.20 45.63 Transformer-based separation model, PretrainedSED-based audio tagging model BCE, SA-SDR loss, KL-divergence DCASE2026Task4Dataset; AudioSet Frame Shift, SpecAugmentation PretrainedSED
You_PKU_task4_4 You_PKU2026 8.24 52.58 global candidate selector over raw TUSS source hypotheses and baseline separation outputs binary cross entropy for audio tagging, SDR-style separation objectives for source separation DCASE2026 Task 4 development dataset all-label TUSS candidate pool with ExtraTrees threshold selection
You_PKU_task4_1 You_PKU2026 8.24 52.58 ResUNet/ResUNetK source separation with M2D audio tagging and duplicate-label-aware meta-selection binary cross entropy for audio tagging, SDR-style separation objectives for source separation DCASE2026 Task 4 development dataset class-balanced candidate generation, source-level overlay selection, duplicate-label-aware filtering
You_PKU_task4_2 You_PKU2026 8.20 52.58 duplicate-label weighted selector with label-specific TUSS overlay grafting binary cross entropy for audio tagging, SDR-style separation objectives for source separation DCASE2026 Task 4 development dataset source-level replacement using labels with positive development-set overlay evidence
You_PKU_task4_3 You_PKU2026 8.20 52.58 duplicate-label weighted selector with label-specific TUSS overlay grafting and small all-label Blender-line refinement binary cross entropy for audio tagging, SDR-style separation objectives for source separation DCASE2026 Task 4 development dataset source-level replacement using labels with positive development-set overlay evidence, plus a small Blender-focused overlay
Park_SGU_task4_1 Park_SGU2026 8.12 53.70 SRCorrNet-based separation model, M2D-fPaSST based audio tagging model cross entropy, PIT-SNR DCASE2026Task4Dataset spec augmentation, angle rotation, random gain filter M2D; fPaSST
Jeong_Medisensing_task4_1 Jeong_Medisensing2026 6.97 58.53 ResUNet-based label-queried separation; dual single/4-channel M2D audio tagging with weighted-confidence fusion; verify-and-refine stem re-tagging; energy/probability silence gating BCE (audio tagging), CAPI-SDR (separation) DCASE2026Task4Dataset; FSD50K; EARS; Semantic Hearing (BinauralCuratedDataset) spatial soundscape synthesis, zero-target oversampling (40%), duplicate-source-event (DUPSE) curriculum, interference mixing M2D; ResUNetK (DCASE2026 Task 4 baseline)
Jeong_Medisensing_task4_2 Jeong_Medisensing2026 6.96 58.20 As submission 1 plus per-class verification/drop/probability-gate thresholds tuned offline against exact CAPI-SDRi semantics BCE (audio tagging), CAPI-SDR (separation) DCASE2026Task4Dataset; FSD50K; EARS; Semantic Hearing (BinauralCuratedDataset) spatial soundscape synthesis, zero-target oversampling (40%), duplicate-source-event (DUPSE) curriculum, interference mixing M2D; ResUNetK (DCASE2026 Task 4 baseline)
Jeong_Medisensing_task4_3 Jeong_Medisensing2026 6.94 58.33 As submission 2 plus dual-separator stem selection: per stem keep whichever of two ResUNet separators its single-channel re-tagging verifies more strongly BCE (audio tagging), CAPI-SDR (separation) DCASE2026Task4Dataset; FSD50K; EARS; Semantic Hearing (BinauralCuratedDataset) spatial soundscape synthesis, zero-target oversampling (40%), duplicate-source-event (DUPSE) curriculum, interference mixing M2D; ResUNetK (DCASE2026 Task 4 baseline)
Wang_BUPT_task4_2 Wang_BUPT2026 6.90 56.61 four-channel M2D audio tagging with Qwen2-Audio semantic distillation, followed by baseline ResUNetK label-queried separation permutation-invariant classification loss, linear CKA loss, cosine loss, CAPI-SDR loss DCASE2026Task4Dataset on-the-fly spatial sound-scene synthesis M2D; Qwen2-Audio-7B-Instruct; DCASE2026 baseline ResUNetK
Deng_WHU_task4_1 Deng_WHU2026 6.84 58.13 Label-guided ensemble with multi-model audio tagging fusion and fine-tuned ResUNetK or TF-GridNet separation result selection or fusion BCE, CAPI-SDR DCASE2026Task4Dataset Random mixture synthesis with same-class duplication, random SNR/angle sampling, background/interference mixing, and label shuffling M2D; CLAP
Nguyen_NTT_task4_1 6.77 56.55 ResUNet-based separation model, M2D-based audio tagging model BCE, CAPI-SDR DCASE2026Task4Dataset M2D
Wang_BUPT_task4_1 Wang_BUPT2026 6.76 55.56 single-channel M2D audio tagging with Qwen2-Audio semantic distillation, followed by baseline ResUNetK label-queried separation permutation-invariant classification loss, linear CKA loss, cosine loss, CAPI-SDR loss DCASE2026Task4Dataset on-the-fly spatial sound-scene synthesis M2D; Qwen2-Audio-7B-Instruct; DCASE2026 baseline ResUNetK
Nguyen_NTT_task4_2 6.76 55.16 ResUNet-based separation model, M2D-based audio tagging model BCE, CAPI-SDR DCASE2026Task4Dataset M2D
Park_KUBIG_task4_1 Park_KUBIG2026 1.18 39.81 Joint separation-classification-DoA model (SpatialSeparatorModel). BEATs encoder (frozen) for feature extraction, followed by mask-based waveform separation and class prediction heads. Trained end-to-end with joint PIT loss (SI-SDR + CE + DoA). SI-SDR, Cross-Entropy, DoA regression (joint PIT) DCASE2026Task4Dataset; FSD50K; EARS on-the-fly spatial audio synthesis using SpAudSyn BEATs

Complexity

Submission
Code
Technical
Report
CAPI-SDRi Label Prediction
Accuracy (mix)
Ensemble
subsystems
Number of
Parameters
Bando_AIST_task4_3 Bando_AIST2026 14.93 65.54 1 100917678
Bando_AIST_task4_4 Bando_AIST2026 14.61 59.52 1 108401934
Bando_AIST_task4_2 Bando_AIST2026 14.34 58.40 1 22569614
Bando_AIST_task4_1 Bando_AIST2026 14.23 58.33 1 15085358
Choi_KAIST_task4_4 Choi_KAIST2026 12.98 64.88 1 745744243
Saijo_Mitsubishi_task4_3 Saijo_Mitsubishi2026 12.94 76.92 5 for tagging, 10 for source counting 687.3M
Choi_KAIST_task4_1 Choi_KAIST2026 12.88 64.15 2 760292329
Saijo_Mitsubishi_task4_2 Saijo_Mitsubishi2026 12.77 76.92 5 for tagging, 10 for source counting 687.3M
Saijo_Mitsubishi_task4_1 Saijo_Mitsubishi2026 12.77 75.00 5 for tagging, 10 for source counting 681.3M
Saijo_Mitsubishi_task4_4 Saijo_Mitsubishi2026 12.54 73.28 1 104.84M
Choi_KAIST_task4_2 Choi_KAIST2026 12.32 59.59 2 760292329
Choi_KAIST_task4_3 Choi_KAIST2026 12.25 59.13 1 745744243
Wang_SRCN_task4_2 Wang_SRCN2026 10.13 57.80 1 375724043
Wang_SRCN_task4_1 Wang_SRCN2026 10.12 57.94 1 375724043
Park_SGU_task4_3 Park_SGU2026 10.10 53.17 1 197210110
Wang_SRCN_task4_3 Wang_SRCN2026 10.10 57.94 1 375724043
Park_SGU_task4_4 Park_SGU2026 10.08 53.64 1 197210110
Park_SGU_task4_2 Park_SGU2026 9.42 53.17 1 197210110
Wang_SRCN_task4_4 Wang_SRCN2026 9.20 45.63 1 375524106
You_PKU_task4_4 You_PKU2026 8.24 52.58 5 approximately 115.4M per baseline separation/tagging pass, excluding external overlay candidate generation and lightweight selector models
You_PKU_task4_1 You_PKU2026 8.24 52.58 4 approximately 115.4M per baseline separation/tagging pass, excluding lightweight selector models
You_PKU_task4_2 You_PKU2026 8.20 52.58 5 approximately 115.4M per baseline separation/tagging pass, excluding external overlay candidate generation and lightweight selector models
You_PKU_task4_3 You_PKU2026 8.20 52.58 5 approximately 115.4M per baseline separation/tagging pass, excluding external overlay candidate generation and lightweight selector models
Park_SGU_task4_1 Park_SGU2026 8.12 53.70 1 197210110
Jeong_Medisensing_task4_1 Jeong_Medisensing2026 6.97 58.53 1 215.90M
Jeong_Medisensing_task4_2 Jeong_Medisensing2026 6.96 58.20 1 215.90M
Jeong_Medisensing_task4_3 Jeong_Medisensing2026 6.94 58.33 2 245.79M
Wang_BUPT_task4_2 Wang_BUPT2026 6.90 56.61 1 126434854
Deng_WHU_task4_1 Deng_WHU2026 6.84 58.13 3 29900000
Nguyen_NTT_task4_1 6.77 56.55 1 119356966
Wang_BUPT_task4_1 Wang_BUPT2026 6.76 55.56 1 119356966
Nguyen_NTT_task4_2 6.76 55.16 1 126434854
Park_KUBIG_task4_1 Park_KUBIG2026 1.18 39.81 1 91815245

Representative example of separated audio samples

Evaluation set

The following table shows separated sound samples from the evaluation set. Representative outputs from teams ranked 1 to 3 and the baseline are selected. The mixture column uses pseudo-stereo mixture files, and each score row reports clip-level CAPI-SDRi for the corresponding system.

Condition
(evaluation set)
Mixture*
Oracle
(azimuth, elevation)
Bando_AIST_task4_3
Rank 1
Choi_KAIST_task4_4
Rank 2
Saijo_Mitsubishi_task4_3
Rank 3
Nguyen_NTT_task4_1
Baseline
Success case
(3 overlapping target events, including 2 same-class events)
FILLER
FILLER
FILLER
Speech (160°, -20°)
BicycleBell (-20°, 0°)
BicycleBell (80°, -20°)
Speech
BicycleBell
BicycleBell
CAPI-SDRi (this sample)=25.89 dB
Speech
BicycleBell
BicycleBell
CAPI-SDRi (this sample)=23.78 dB
Speech
BicycleBell
BicycleBell
CAPI-SDRi (this sample)=21.06 dB
Speech
--
BicycleBell
CAPI-SDRi (this sample)=7.56 dB
Challenging case
(3 overlapping target events, including 2 same-class events)
FILLER
FILLER
FILLER
Percussion (-40°, -20°)
Percussion (60°, -20°)
Blender (80°, -20°)
Percussion
Percussion
--
CAPI-SDRi (this sample)=10.04 dB
Percussion
Percussion
--
CAPI-SDRi (this sample)=9.30 dB
Percussion
Percussion
Blender
CAPI-SDRi (this sample)=10.72 dB
Percussion
--
--
CAPI-SDRi (this sample)=1.22 dB
3 Speech
FILLER
FILLER
FILLER
Speech (-140°, -20°)
Speech (80°, -20°)
Speech (160°, 0°)
Speech
Speech
Speech
CAPI-SDRi (this sample)=23.56 dB
Speech
Speech
Speech
CAPI-SDRi (this sample)=21.17 dB
Speech
Speech
Speech
CAPI-SDRi (this sample)=20.00 dB
Speech
Speech
Speech
CAPI-SDRi (this sample)=8.04 dB

* A pseudo-stereo signal extracted from the ambisonic input signal. Directional components toward azimuth -90° and 90° are extracted and assigned to the left and right channels, respectively.

Technical reports

END-TO-END ITERATIVE S5 SYSTEM BASED ON TF-LOCOFORMER AND ATST-FRAME

Yoshiaki Bando, Shun Sakurai, Yuto Nozaki, Keisuke Imoto, Masaki Onishi
National Institute of Advanced Industrial Science and Technology, Koto, Tokyo, Japan; Kyoto University, Kyoto, Kyoto, Japan

Abstract

This technical report describes our end-to-end (E2E) system based on TF-Locoformer and ATST-Frame. The best system in the previous challenge on spatial semantic segmentation of sound scenes (S5) leveraged an iterative architecture that alternately performs separation and classification to achieve excellent performance. Inspired by this architecture, we built an E2E system that iteratively applies TF-Locoformer and ATST-Frame. Specifically, the overall architecture is based on TF-Locoformer, which stacks Transformerbased blocks to process each time-frequency bin. We inserted signal and classification heads into the outputs of intermediate blocks, and applied ATST-Frame to the separated source signals. The sourcewise embeddings extracted by ATST-Frame are then fed back to the Locoformer block to progressively improve the performance. The whole architecture is trained in an E2E manner with permutation invariant training. Our best model on the development set achieved a class-aware permutation invariant signal-to-distortion ratio improvement (CAPI-SDRi) of 16.3 dB and source-wise label accuracy of 73.6 % on the test subset of the development set.

PDF

A MULTI-STAGE SEPARATION-AND-CLASSIFICATION FRAMEWORK GUIDED BY COMPLEMENTARY ACOUSTIC-TO-SEMANTIC CLUES

Younghoo Kwon, Junwoo Park, Han Yin, Jung-Woo Choi
Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea

Abstract

This report describes the system proposed for the DCASE 2026 Challenge Task 4: Spatial Semantic Segmentation of Sound Scenes (S5). Specifically, we develop a multi-stage framework in which each stage couples a separation model with a classification model. The first stage performs source separation and classification directly on the multi-channel mixture. Its outputs are then propagated to the following stage as two complementary clues that progressively refine each target estimate: (i) an enrollment clue, the separated waveform itself, serving as a low-level acoustic reference; and (ii) a class clue, the predicted label encoded as a one-hot vector. The third stage reuses the second-stage outputs under the same scheme, forming an iterative self-guided refinement process. In addition, we use a fine-grained frame-level audio embedding from an audio encoder pretrained on a large audio corpus as an additional clue to further improve the audio separation performance. On the test set, the proposed system achieves a CAPI-SDRi of 15.51 dB, a mixture accuracy of 71.09%, and a source accuracy of 78.62%; with an improvement of 7.02 dB, 10.38%p and 8.22%p compared with the challenge baseline, respectively.

PDF

A LABEL-GUIDED ENSEMBLE SYSTEM FOR SPATIAL SEMANTIC SEGMENTATION OF SAME-CLASS SOUND SOURCES

Yongyi Deng, Tong Zou, Yanxin Tian, Hao Shi, Jiayue Luo, Yicheng Yan, Gongping Huang
Intelligent Acoustics and Speech Processing Laboratory (IASP-Lab), School of Electronic Information, Wuhan University, Wuhan, China; Kyoto University, Kyoto University, Kyoto, Japan

Abstract

This report presents our system for DCASE 2026 Task 4, which addresses spatial semantic segmentation of sound scenes containing same-class foreground sources and inactive source labels. The task requires not only separating active sound events that may share the same class label, but also handling source slots corresponding to inactive labels. To improve the reliability of label-conditioned source separation, we adopt a label-guided ensemble strategy. In the tagging stage, two M2D-AT variants and one CLAPAT variant are fused by weighted voting to obtain robust source-label estimates. The estimated labels are then used to guide source separation. The primary separator is a fine-tuned ResUNetK model with a mask-sharpen inference variant, while a TF-GridNet model is used only as a weak auxiliary branch for a small number of selected classes through class-dependent fusion weights. Instead of uniformly averaging separator outputs, the final outputs are generated through label-guided class-dependent fusion, which improves the consistency between predicted labels and separated sources while keeping inactive slots controlled by silence-label conditioning. On the development set, according to the submission-pack evaluation snapshot, our final system achieves a CAPI-SDRi of 8.625, with a mixture-level accuracy of 61.706% and a source-level accuracy of 72.139%.

PDF

TAGGING-DRIVEN INFERENCE REFINEMENT AND DUAL-SEPARATOR SELECTION FOR SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES

Seunggyu Jeong, Seong-Eun Kim
Medisensing, Seoul, Korea; Seoul National University of Science and Technology (SeoulTech), Seoul, Korea

Abstract

We describe our submission to DCASE 2026 Challenge Task 4, spatial semantic segmentation of sound scenes. A system has to detect which sound classes are active in a four-channel mixture and to separate each detected source, and is scored by class-aware permutation-invariant signal-to-distortion ratio improvement (CASDRi). We build on the official baseline, which couples two Masked Modeling Duo (M2D) audio taggers with a FiLM-conditioned ResUNet label-queried separator. An oracle study, in which groundtruth labels are fed to the unchanged baseline separator, reaches 9.52 dB, so for the given separator the label decision is the limiting factor up to about 9.5 dB. Our system follows this finding and concentrates on the labelling and decision stages, almost all of it at inference time. We fuse the single- and four-channel taggers, fine-tune the four-channel tagger with a curriculum that oversamples silent-target clips, re-tag the separated stems to verify and clean the queries, gate residual false positives using the reward structure of CA-SDRi, and select stems from two separators by re-tagging agreement. These steps raise development-set CA-SDRi from 8.49 dB to 9.06 dB and mixture tagging accuracy from 60.7% to 63.8%. We submit three systems that add these components in turn.

PDF

END-TO-END SPATIAL SEMANTIC SEPARATOR WITH DOA MODULE

Yoohan Park, Haejin Cho
student-oriented research group, Korea University Big data and AI Group, 'KUBIG', Korea University, Seoul, South Korea

Abstract

This technical report describes our submission to DCASE 2026 Challenge Task 4, Spatial Semantic Segmentation of Sound Scenes. The task requires a system to separate sound events from 4-channel First-Order Ambisonics (FOA) recordings while simultaneously predicting their sound classes and directions of arrival (DoA). To address this challenge, we propose an end-to-end framework that jointly performs source separation, sound classification, and DoA estimation. The model generates a fixed set of source representations, each associated with a separated waveform, a class label, and a DoA estimate. A frozen BEATs encoder is used to provide robust acoustic representations, while lightweight task-specific modules are trained for spatial modeling and prediction. The entire system is optimized using a joint permutation-invariant training objective that encourages consistent source assignment across all outputs.

PDF

EXTENDING SR-CORRNET TO LABEL-QUERIED TARGET SOUND EXTRACTION

Bon-Hyeok Ku, Woocheol Jeong, Hyung-Min Park
Intelligent Information Processing Lab, Sogang University, Mapo, Seoul, Korea

Abstract

This paper describes our submission to DCASE 2026 Task 4. We extend SR-CorrNet, originally designed for blind source separation, into a label-queried target sound extraction and separation model. The system reformulates blind separation into target extraction by conditioning the separator with frame-level strong class labels via Feature-wise Linear Modulation (FiLM), steering each output slot to extract the queried class. To stabilize ambiguous regions, we supplement this with a time-invariant weak class label FiLM bias. The class labels are predicted by a front-end fusion tagger that combines two complementary AudioSet-pretrained Transformers (M2D and PaSST) via feature-axis concatenation. The extended model operates with block streaming inference, coupling the tagger and separator through a soft-query interface.

PDF

THE MERL SYSTEMS FOR DCASE 2026 CHALLENGE TASK 4

Kohei Saijo, Yoshiki Masuyama, Christoph Boeddeker, Julius Richter, Takahiro Edo, Gordon Wichern, Jonathan Le Roux
Information Technology R&D Center, Mitsubishi Electric Corporation, Ofuna, Kanagawa, Japan; Mitsubishi Electric Research Laboratories, Cambridge, MA, USA

Abstract

This technical report describes our spatial semantic segmentation of sound scenes (S5) systems for DCASE 2026 Challenge Task 4. Inspired by the top-ranked system in DCASE 2025 Task 4, we adopt a cascaded framework consisting of universal sound separation (USS) with source counting, source classification, and class-aware refinement. In the first stage, a TF-Locoformer-based USS model separates multi-channel mixtures into single-channel foreground and interference signals. Then, each separated signal is classified into one of 18 foreground classes or as interference. The separated foreground signals are further refined by another TF-Locoformer-based model conditioned on the predicted class labels and the observed mixture. Our best system achieves CA-PI-SDRi of 14.95 dB and mixture accuracy of 78.11% on the dev test set.

PDF

SEMANTIC DISTILLATION FOR SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES

Sen Wang, Chengyao Tang, Zhicheng Zhang, Jianqin Yin
Beijing University of Posts and Telecommunications, Beijing, China

Abstract

This report describes our system for DCASE 2026 Task 4, spatial semantic segmentation of sound scenes. The system follows a twostage pipeline: an M2D-based audio tagger first predicts up to three event labels, and a label-queried ResUNetK then separates the corresponding dry monaural sources from a four-channel spatial mixture. We improve audio tagging along two complementary directions. The first uses permutation-invariant deep supervision and an exponential-moving-average teacher. The second transfers semantic representations from a frozen Qwen2-Audio model to M2D using centered-kernel-alignment and cosine objectives. Both methods are developed for single-channel (1c) and four-channel (4c) tagging. On the development test set, the submitted 1c system obtains 8.557 dB CAPI-SDRi, while the submitted 4c system obtains 8.807 dB CAPI-SDRi with 61.442% mixture-level label accuracy and 72.535% source-level label accuracy.

PDF

Local-Global Transformer with Iterative Refinement for Multi-Channel Sound Source Separation and Extraction

Ruohan Wang, Minjun Chen, Yangyang Liu, Longhai Wu, Jie Chen
AI Solution Lab, AI SW Team, Samsung Research China-Nanjing, Nanjing, China; AI SW Team, Samsung Research China-Nanjing, Nanjing, China

Abstract

This technical report describes our proposed systems for DCASE2026 Challenge Task 4: Spatial Semantic Segmentation of Sound Scenes (S5). The task aims to enhance technologies for sound event detection and separation from multi-channel input signals that mix multiple sound events with spatial information, which is a fundamental basis of immersive communication. Our approach employs a deep frequency-time transformer architecture with local modeling by convolution that processes multi-channel audio recordings from a 4-microphone array. The system consists of three main components: (1) a universal sound separation module for separating waveform and predicting the sound classes at the same time, (2) an audio tagging model for semantic label prediction, and (3) an iterative target sound extraction module that leverages enrollment clues and semantic labels to extract specific sound sources. We incorporate spatial features including interchannel phase difference (IPD) and inter-channel level difference (ILD) to enhance separation performance.

PDF

DCASE 2026 Task 4 Submission: Duplicate-Label-Aware Source Selection and TUSS Overlay Grafting

Yuhuan You
Peking University, Beijing, China

Abstract

This report describes the PKU submission to DCASE 2026 Challenge Task 4, Spatial Semantic Segmentation of Sound Scenes. The task requires detection and separation of target sound events from FOA mixtures, and the 2026 setting includes same-class multiple sources and zero-target soundscapes. Our submitted systems use the official baseline family as the separation and tagging foundation, then add duplicate-label-aware candidate selection and label-conditioned overlay grafting. The main ranking metric is CAPI-SDRi, so the system is optimized to select separated sources that jointly improve class assignment and permutation-invariant signal quality. We submit four systems: a stable duplicate-label weighted selector, two progressively more aggressive TUSS overlay systems, and a full all-label TUSS candidate-pool selector for diversity.

PDF