Task description
Detailed task description can be found in the task description page
Teams ranking
Table including the best-ranked system for each participant team. The DCASE 2026 Task 4 baseline is included as a reference.
| Submission Information | Evaluation Set | Test (Development) Set | |||||
|---|---|---|---|---|---|---|---|
| Submission Code | Technical Report |
Official Team Rank |
CAPI-SDRi (eval) | Label Prediction Accuracy (mix) (eval) |
CAPI-SDRi (test) | Label Prediction Accuracy (mix) (test) |
|
| Bando_AIST_task4_3 | Bando_AIST2026 | 1 | 14.93 | 65.54 | 16.32 | 64.88 | |
| Choi_KAIST_task4_4 | Choi_KAIST2026 | 2 | 12.98 | 64.88 | 14.65 | 66.07 | |
| Saijo_Mitsubishi_task4_3 | Saijo_Mitsubishi2026 | 3 | 12.94 | 76.92 | 14.95 | 78.11 | |
| Wang_SRCN_task4_2 | Wang_SRCN2026 | 4 | 10.13 | 57.80 | 11.74 | 62.23 | |
| Park_SGU_task4_3 | Park_SGU2026 | 5 | 10.10 | 53.17 | 11.42 | 56.09 | |
| You_PKU_task4_4 | You_PKU2026 | 6 | 8.24 | 52.58 | 11.85 | 74.87 | |
| Jeong_Medisensing_task4_1 | Jeong_Medisensing2026 | 7 | 6.97 | 58.53 | 8.98 | 63.09 | |
| Wang_BUPT_task4_2 | Wang_BUPT2026 | 8 | 6.90 | 56.61 | 8.81 | 61.44 | |
| Deng_WHU_task4_1 | Deng_WHU2026 | 9 | 6.84 | 58.13 | 8.62 | 61.71 | |
| Nguyen_NTT_task4_1 | 10 | 6.77 | 56.55 | 8.17 | 57.14 | ||
| Park_KUBIG_task4_1 | Park_KUBIG2026 | 11 | 1.18 | 39.81 | 1.20 | 40.28 | |
Systems ranking
Table shows the ranking of all submitted systems. The DCASE 2026 Task 4 baseline systems are included as references.
| Submission Information | Evaluation Set | Test (Development) Set | |||||
|---|---|---|---|---|---|---|---|
| Submission Code | Technical Report |
Official System Rank |
CAPI-SDRi (eval) | Label Prediction Accuracy (mix) (eval) |
CAPI-SDRi (test) | Label Prediction Accuracy (mix) (test) |
|
| Bando_AIST_task4_3 | Bando_AIST2026 | 1 | 14.93 | 65.54 | 16.32 | 64.88 | |
| Bando_AIST_task4_4 | Bando_AIST2026 | 2 | 14.61 | 59.52 | 16.36 | 61.51 | |
| Bando_AIST_task4_2 | Bando_AIST2026 | 3 | 14.34 | 58.40 | 16.45 | 62.30 | |
| Bando_AIST_task4_1 | Bando_AIST2026 | 4 | 14.23 | 58.33 | 15.75 | 57.74 | |
| Choi_KAIST_task4_4 | Choi_KAIST2026 | 5 | 12.98 | 64.88 | 14.65 | 66.07 | |
| Saijo_Mitsubishi_task4_3 | Saijo_Mitsubishi2026 | 6 | 12.94 | 76.92 | 14.95 | 78.11 | |
| Choi_KAIST_task4_1 | Choi_KAIST2026 | 7 | 12.88 | 64.15 | 15.51 | 71.10 | |
| Saijo_Mitsubishi_task4_2 | Saijo_Mitsubishi2026 | 8 | 12.77 | 76.92 | 14.74 | 78.11 | |
| Saijo_Mitsubishi_task4_1 | Saijo_Mitsubishi2026 | 9 | 12.77 | 75.00 | 14.84 | 78.11 | |
| Saijo_Mitsubishi_task4_4 | Saijo_Mitsubishi2026 | 10 | 12.54 | 73.28 | 14.41 | 74.41 | |
| Choi_KAIST_task4_2 | Choi_KAIST2026 | 11 | 12.32 | 59.59 | 14.80 | 66.40 | |
| Choi_KAIST_task4_3 | Choi_KAIST2026 | 12 | 12.25 | 59.13 | 15.50 | 71.10 | |
| Wang_SRCN_task4_2 | Wang_SRCN2026 | 13 | 10.13 | 57.80 | 11.74 | 62.23 | |
| Wang_SRCN_task4_1 | Wang_SRCN2026 | 14 | 10.12 | 57.94 | 11.73 | 62.04 | |
| Park_SGU_task4_3 | Park_SGU2026 | 15 | 10.10 | 53.17 | 11.42 | 56.09 | |
| Wang_SRCN_task4_3 | Wang_SRCN2026 | 16 | 10.10 | 57.94 | 11.74 | 62.04 | |
| Park_SGU_task4_4 | Park_SGU2026 | 17 | 10.08 | 53.64 | 11.43 | 54.70 | |
| Park_SGU_task4_2 | Park_SGU2026 | 18 | 9.42 | 53.17 | 10.53 | 56.09 | |
| Wang_SRCN_task4_4 | Wang_SRCN2026 | 19 | 9.20 | 45.63 | 11.26 | 51.98 | |
| You_PKU_task4_4 | You_PKU2026 | 20 | 8.24 | 52.58 | 11.85 | 74.87 | |
| You_PKU_task4_1 | You_PKU2026 | 21 | 8.24 | 52.58 | 11.86 | 74.87 | |
| You_PKU_task4_2 | You_PKU2026 | 22 | 8.20 | 52.58 | 11.86 | 74.87 | |
| You_PKU_task4_3 | You_PKU2026 | 23 | 8.20 | 52.58 | 11.86 | 74.87 | |
| Park_SGU_task4_1 | Park_SGU2026 | 24 | 8.12 | 53.70 | 9.05 | 55.36 | |
| Jeong_Medisensing_task4_1 | Jeong_Medisensing2026 | 25 | 6.97 | 58.53 | 8.98 | 63.09 | |
| Jeong_Medisensing_task4_2 | Jeong_Medisensing2026 | 26 | 6.96 | 58.20 | 9.05 | 63.95 | |
| Jeong_Medisensing_task4_3 | Jeong_Medisensing2026 | 27 | 6.94 | 58.33 | 9.06 | 63.82 | |
| Wang_BUPT_task4_2 | Wang_BUPT2026 | 28 | 6.90 | 56.61 | 8.81 | 61.44 | |
| Deng_WHU_task4_1 | Deng_WHU2026 | 29 | 6.84 | 58.13 | 8.62 | 61.71 | |
| Nguyen_NTT_task4_1 | 30 | 6.77 | 56.55 | 8.17 | 57.14 | ||
| Wang_BUPT_task4_1 | Wang_BUPT2026 | 31 | 6.76 | 55.56 | 8.56 | 60.05 | |
| Nguyen_NTT_task4_2 | 32 | 6.76 | 55.16 | 8.49 | 60.71 | ||
| Park_KUBIG_task4_1 | Park_KUBIG2026 | 33 | 1.18 | 39.81 | 1.20 | 40.28 | |
Supplementary metrics
Detailed analysis of joint scores, separation, and detection performance
All metrics in this table are evaluated on the evaluation set. CAPI-SDRi and CASA-SDRi are joint separation and label prediction scores computed by the official evaluator, while TP-SDRi is a separation-only score computed from matched true-positive source pairs. True Positive (TP), False Positive (FP), and False Negative (FN) are counted over target-source hypotheses after class-aware permutation-invariant matching. Accuracy (mix) is mixture-level label prediction accuracy, while Accuracy (src) is source-level label prediction accuracy.
| Submission Information | Joint Separation and Label Prediction Scores | Separation Score | Label Prediction Scores | Counts | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Submission Code | Technical Report |
CAPI-SDRi | CASA-SDRi | TP-SDRi | Accuracy (mix) |
Accuracy (src) |
Precision | Recall | F-Score (micro) |
F-Score (macro) |
TP | FP | FN | |
| Bando_AIST_task4_3 | Bando_AIST2026 | 14.93 | 14.92 | 20.15 | 65.54 | 73.80 | 0.91 | 0.81 | 0.85 | 0.85 | 2242 | 266 | 530 | |
| Bando_AIST_task4_4 | Bando_AIST2026 | 14.61 | 14.59 | 20.90 | 59.52 | 69.61 | 0.91 | 0.77 | 0.82 | 0.81 | 2116 | 268 | 656 | |
| Bando_AIST_task4_2 | Bando_AIST2026 | 14.34 | 14.33 | 20.91 | 58.40 | 67.95 | 0.90 | 0.76 | 0.81 | 0.80 | 2093 | 308 | 679 | |
| Bando_AIST_task4_1 | Bando_AIST2026 | 14.23 | 14.22 | 20.87 | 58.33 | 67.48 | 0.92 | 0.73 | 0.81 | 0.79 | 2017 | 217 | 755 | |
| Choi_KAIST_task4_4 | Choi_KAIST2026 | 12.98 | 12.98 | 17.78 | 64.88 | 73.95 | 0.89 | 0.82 | 0.85 | 0.85 | 2279 | 310 | 493 | |
| Saijo_Mitsubishi_task4_3 | Saijo_Mitsubishi2026 | 12.94 | 12.94 | 15.96 | 76.92 | 82.84 | 0.93 | 0.90 | 0.91 | 0.91 | 2486 | 229 | 286 | |
| Choi_KAIST_task4_1 | Choi_KAIST2026 | 12.88 | 12.88 | 17.76 | 64.15 | 73.57 | 0.89 | 0.82 | 0.85 | 0.85 | 2277 | 323 | 495 | |
| Saijo_Mitsubishi_task4_2 | Saijo_Mitsubishi2026 | 12.77 | 12.76 | 15.72 | 76.92 | 82.84 | 0.93 | 0.90 | 0.91 | 0.91 | 2486 | 229 | 286 | |
| Saijo_Mitsubishi_task4_1 | Saijo_Mitsubishi2026 | 12.77 | 12.76 | 15.82 | 75.00 | 82.31 | 0.92 | 0.90 | 0.90 | 0.90 | 2499 | 264 | 273 | |
| Saijo_Mitsubishi_task4_4 | Saijo_Mitsubishi2026 | 12.54 | 12.54 | 15.90 | 73.28 | 80.45 | 0.90 | 0.90 | 0.89 | 0.89 | 2493 | 327 | 279 | |
| Choi_KAIST_task4_2 | Choi_KAIST2026 | 12.32 | 12.31 | 17.72 | 59.59 | 71.64 | 0.86 | 0.84 | 0.83 | 0.84 | 2314 | 458 | 458 | |
| Choi_KAIST_task4_3 | Choi_KAIST2026 | 12.25 | 12.24 | 17.69 | 59.13 | 71.47 | 0.85 | 0.84 | 0.83 | 0.83 | 2315 | 467 | 457 | |
| Wang_SRCN_task4_2 | Wang_SRCN2026 | 10.13 | 10.10 | 14.48 | 57.80 | 69.89 | 0.86 | 0.80 | 0.82 | 0.82 | 2221 | 406 | 551 | |
| Wang_SRCN_task4_1 | Wang_SRCN2026 | 10.12 | 10.09 | 14.46 | 57.94 | 69.92 | 0.85 | 0.81 | 0.82 | 0.82 | 2232 | 420 | 540 | |
| Park_SGU_task4_3 | Park_SGU2026 | 10.10 | 10.07 | 14.78 | 53.17 | 68.26 | 0.90 | 0.75 | 0.81 | 0.81 | 2047 | 227 | 725 | |
| Wang_SRCN_task4_3 | Wang_SRCN2026 | 10.10 | 10.05 | 14.31 | 57.94 | 69.92 | 0.85 | 0.81 | 0.82 | 0.82 | 2232 | 420 | 540 | |
| Park_SGU_task4_4 | Park_SGU2026 | 10.08 | 10.02 | 14.74 | 53.64 | 68.28 | 0.90 | 0.75 | 0.81 | 0.81 | 2051 | 232 | 721 | |
| Park_SGU_task4_2 | Park_SGU2026 | 9.42 | 9.38 | 13.75 | 53.17 | 68.26 | 0.90 | 0.75 | 0.81 | 0.81 | 2047 | 227 | 725 | |
| Wang_SRCN_task4_4 | Wang_SRCN2026 | 9.20 | 9.18 | 14.49 | 45.63 | 62.11 | 0.75 | 0.81 | 0.77 | 0.77 | 2226 | 812 | 546 | |
| You_PKU_task4_4 | You_PKU2026 | 8.24 | 8.16 | 12.08 | 52.58 | 68.08 | 0.83 | 0.82 | 0.81 | 0.81 | 2244 | 524 | 528 | |
| You_PKU_task4_1 | You_PKU2026 | 8.24 | 8.16 | 12.01 | 52.58 | 68.08 | 0.83 | 0.82 | 0.81 | 0.81 | 2244 | 524 | 528 | |
| You_PKU_task4_2 | You_PKU2026 | 8.20 | 8.13 | 11.97 | 52.58 | 68.08 | 0.83 | 0.82 | 0.81 | 0.81 | 2244 | 524 | 528 | |
| You_PKU_task4_3 | You_PKU2026 | 8.20 | 8.13 | 11.96 | 52.58 | 68.08 | 0.83 | 0.82 | 0.81 | 0.81 | 2244 | 524 | 528 | |
| Park_SGU_task4_1 | Park_SGU2026 | 8.12 | 8.06 | 11.82 | 53.70 | 68.68 | 0.91 | 0.75 | 0.81 | 0.81 | 2057 | 223 | 715 | |
| Jeong_Medisensing_task4_1 | Jeong_Medisensing2026 | 6.97 | 6.90 | 9.78 | 58.53 | 70.00 | 0.86 | 0.80 | 0.82 | 0.82 | 2196 | 365 | 576 | |
| Jeong_Medisensing_task4_2 | Jeong_Medisensing2026 | 6.96 | 6.89 | 9.79 | 58.20 | 69.87 | 0.86 | 0.80 | 0.82 | 0.82 | 2201 | 378 | 571 | |
| Jeong_Medisensing_task4_3 | Jeong_Medisensing2026 | 6.94 | 6.87 | 9.76 | 58.33 | 70.07 | 0.86 | 0.80 | 0.82 | 0.82 | 2201 | 369 | 571 | |
| Wang_BUPT_task4_2 | Wang_BUPT2026 | 6.90 | 6.84 | 9.92 | 56.61 | 68.95 | 0.84 | 0.81 | 0.82 | 0.82 | 2221 | 449 | 551 | |
| Deng_WHU_task4_1 | Deng_WHU2026 | 6.84 | 6.80 | 9.70 | 58.13 | 69.58 | 0.85 | 0.80 | 0.82 | 0.82 | 2210 | 404 | 562 | |
| Nguyen_NTT_task4_1 | 6.77 | 6.72 | 9.74 | 56.55 | 68.00 | 0.83 | 0.80 | 0.81 | 0.81 | 2193 | 453 | 579 | ||
| Wang_BUPT_task4_1 | Wang_BUPT2026 | 6.76 | 6.72 | 9.97 | 55.56 | 66.57 | 0.84 | 0.78 | 0.80 | 0.80 | 2147 | 453 | 625 | |
| Nguyen_NTT_task4_2 | 6.76 | 6.71 | 9.81 | 55.16 | 66.66 | 0.83 | 0.79 | 0.80 | 0.80 | 2175 | 491 | 597 | ||
| Park_KUBIG_task4_1 | Park_KUBIG2026 | 1.18 | 0.98 | 2.84 | 39.81 | 56.68 | 0.70 | 0.77 | 0.72 | 0.73 | 2133 | 991 | 639 | |
Detailed analysis focused on quality of separated speech
Table shows the quality of separated Speech-class target outputs in the evaluation dataset. Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), and Perceptual Evaluation of Audio Quality (PEAQ) are reported as objective metrics.
| Submission Information | PESQ | STOI | PEAQ | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Submission Code | Technical Report |
CAPI-SDRi | PESQ mean |
PESQ std |
PESQ min |
PESQ max |
STOI mean |
STOI std |
STOI min |
STOI max |
PEAQ mean |
PEAQ std |
PEAQ min |
PEAQ max |
|
| Bando_AIST_task4_3 | Bando_AIST2026 | 14.93 | 3.26 | 0.60 | 1.72 | 4.38 | 0.94 | 0.05 | 0.62 | 1.00 | -2.37 | 0.85 | -3.91 | -0.12 | |
| Bando_AIST_task4_4 | Bando_AIST2026 | 14.61 | 3.29 | 0.70 | 1.07 | 4.40 | 0.92 | 0.19 | -0.05 | 1.00 | -2.27 | 0.86 | -3.91 | -0.07 | |
| Bando_AIST_task4_2 | Bando_AIST2026 | 14.34 | 3.18 | 0.68 | 1.37 | 4.39 | 0.95 | 0.05 | 0.64 | 1.00 | -2.35 | 0.87 | -3.91 | -0.09 | |
| Bando_AIST_task4_1 | Bando_AIST2026 | 14.23 | 3.23 | 0.68 | 1.30 | 4.37 | 0.95 | 0.04 | 0.76 | 1.00 | -2.33 | 0.87 | -3.91 | -0.09 | |
| Choi_KAIST_task4_4 | Choi_KAIST2026 | 12.98 | 2.98 | 0.57 | 1.65 | 4.18 | 0.93 | 0.05 | 0.77 | 0.99 | -2.93 | 0.78 | -3.91 | -0.33 | |
| Saijo_Mitsubishi_task4_3 | Saijo_Mitsubishi2026 | 12.94 | 3.07 | 0.51 | 1.85 | 4.23 | 0.93 | 0.04 | 0.76 | 0.99 | -3.01 | 0.71 | -3.91 | -0.51 | |
| Choi_KAIST_task4_1 | Choi_KAIST2026 | 12.88 | 3.01 | 0.54 | 1.67 | 4.17 | 0.94 | 0.04 | 0.77 | 0.99 | -2.93 | 0.78 | -3.91 | -0.34 | |
| Saijo_Mitsubishi_task4_2 | Saijo_Mitsubishi2026 | 12.77 | 2.92 | 0.58 | 1.63 | 4.23 | 0.92 | 0.06 | 0.72 | 0.99 | -3.03 | 0.71 | -3.91 | -0.40 | |
| Saijo_Mitsubishi_task4_1 | Saijo_Mitsubishi2026 | 12.77 | 2.88 | 0.56 | 1.55 | 4.14 | 0.92 | 0.05 | 0.71 | 0.99 | -3.02 | 0.71 | -3.91 | -0.47 | |
| Saijo_Mitsubishi_task4_4 | Saijo_Mitsubishi2026 | 12.54 | 3.06 | 0.52 | 1.85 | 4.23 | 0.93 | 0.05 | 0.62 | 0.99 | -3.01 | 0.72 | -3.91 | -0.51 | |
| Choi_KAIST_task4_2 | Choi_KAIST2026 | 12.32 | 2.97 | 0.58 | 1.43 | 4.18 | 0.93 | 0.05 | 0.66 | 0.99 | -2.94 | 0.78 | -3.91 | -0.33 | |
| Choi_KAIST_task4_3 | Choi_KAIST2026 | 12.25 | 2.97 | 0.57 | 1.67 | 4.18 | 0.93 | 0.05 | 0.77 | 0.99 | -2.94 | 0.78 | -3.91 | -0.33 | |
| Wang_SRCN_task4_2 | Wang_SRCN2026 | 10.13 | 2.70 | 0.56 | 1.47 | 4.07 | 0.91 | 0.05 | 0.68 | 0.99 | -3.07 | 0.71 | -3.91 | -0.54 | |
| Wang_SRCN_task4_1 | Wang_SRCN2026 | 10.12 | 2.70 | 0.56 | 1.47 | 4.07 | 0.91 | 0.05 | 0.68 | 0.99 | -3.07 | 0.71 | -3.91 | -0.54 | |
| Park_SGU_task4_3 | Park_SGU2026 | 10.10 | 2.57 | 0.70 | 1.08 | 4.07 | 0.86 | 0.18 | 0.13 | 0.99 | -3.08 | 0.83 | -3.91 | -0.31 | |
| Wang_SRCN_task4_3 | Wang_SRCN2026 | 10.10 | 2.42 | 0.66 | 1.06 | 4.00 | 0.88 | 0.12 | -0.03 | 0.99 | -3.40 | 0.68 | -3.91 | -0.46 | |
| Park_SGU_task4_4 | Park_SGU2026 | 10.08 | 2.57 | 0.70 | 1.07 | 4.08 | 0.86 | 0.17 | 0.08 | 0.99 | -3.08 | 0.82 | -3.91 | -0.36 | |
| Park_SGU_task4_2 | Park_SGU2026 | 9.42 | 2.53 | 0.70 | 1.08 | 4.03 | 0.86 | 0.19 | 0.14 | 0.99 | -3.15 | 0.80 | -3.91 | -0.31 | |
| Wang_SRCN_task4_4 | Wang_SRCN2026 | 9.20 | 2.59 | 0.60 | 1.18 | 4.05 | 0.90 | 0.06 | 0.66 | 0.99 | -3.14 | 0.71 | -3.91 | -0.54 | |
| You_PKU_task4_4 | You_PKU2026 | 8.24 | 2.47 | 0.77 | 1.15 | 4.16 | 0.86 | 0.14 | 0.33 | 0.99 | -3.27 | 0.73 | -3.91 | -0.50 | |
| You_PKU_task4_1 | You_PKU2026 | 8.24 | 2.37 | 0.84 | 1.09 | 4.16 | 0.81 | 0.19 | 0.29 | 0.99 | -3.28 | 0.72 | -3.91 | -0.50 | |
| You_PKU_task4_2 | You_PKU2026 | 8.20 | 2.47 | 0.75 | 1.15 | 4.16 | 0.87 | 0.12 | 0.40 | 0.99 | -3.28 | 0.72 | -3.91 | -0.50 | |
| You_PKU_task4_3 | You_PKU2026 | 8.20 | 2.47 | 0.75 | 1.15 | 4.16 | 0.87 | 0.12 | 0.40 | 0.99 | -3.28 | 0.72 | -3.91 | -0.50 | |
| Park_SGU_task4_1 | Park_SGU2026 | 8.12 | 2.29 | 0.74 | 1.08 | 3.92 | 0.82 | 0.21 | 0.11 | 0.99 | -3.36 | 0.72 | -3.91 | -0.39 | |
| Jeong_Medisensing_task4_1 | Jeong_Medisensing2026 | 6.97 | 1.89 | 0.66 | 1.09 | 3.91 | 0.75 | 0.18 | 0.30 | 0.99 | -3.54 | 0.61 | -3.91 | -0.60 | |
| Jeong_Medisensing_task4_2 | Jeong_Medisensing2026 | 6.96 | 1.89 | 0.66 | 1.09 | 3.91 | 0.75 | 0.18 | 0.30 | 0.99 | -3.54 | 0.61 | -3.91 | -0.60 | |
| Jeong_Medisensing_task4_3 | Jeong_Medisensing2026 | 6.94 | 1.90 | 0.66 | 1.08 | 3.93 | 0.75 | 0.18 | 0.31 | 0.99 | -3.53 | 0.61 | -3.91 | -0.47 | |
| Wang_BUPT_task4_2 | Wang_BUPT2026 | 6.90 | 1.90 | 0.66 | 1.09 | 3.91 | 0.75 | 0.18 | 0.30 | 0.99 | -3.54 | 0.61 | -3.91 | -0.60 | |
| Deng_WHU_task4_1 | Deng_WHU2026 | 6.84 | 1.79 | 0.60 | 1.09 | 4.00 | 0.74 | 0.18 | 0.27 | 0.98 | -3.53 | 0.61 | -3.91 | -0.63 | |
| Nguyen_NTT_task4_1 | 6.77 | 1.88 | 0.65 | 1.11 | 3.91 | 0.75 | 0.18 | 0.30 | 0.99 | -3.54 | 0.60 | -3.91 | -0.61 | ||
| Wang_BUPT_task4_1 | Wang_BUPT2026 | 6.76 | 1.89 | 0.65 | 1.09 | 3.91 | 0.75 | 0.18 | 0.30 | 0.99 | -3.53 | 0.61 | -3.91 | -0.60 | |
| Nguyen_NTT_task4_2 | 6.76 | 1.90 | 0.65 | 1.11 | 3.93 | 0.76 | 0.18 | 0.29 | 0.99 | -3.53 | 0.61 | -3.91 | -0.57 | ||
| Park_KUBIG_task4_1 | Park_KUBIG2026 | 1.18 | 1.91 | 0.55 | 1.11 | 3.40 | 0.84 | 0.09 | 0.57 | 0.98 | -2.73 | 0.60 | -3.91 | -1.21 | |
System performance under partially known conditions
Table shows the separation and mixture-level label prediction performance of each system under partially known conditions. The Known IR condition uses evaluation mixtures synthesized with room impulse responses included in the training data. The Known Target condition uses evaluation mixtures synthesized with target sound event samples included in the training data. The Known Noise condition uses evaluation mixtures synthesized with background noise included in the training data. The Known Interference condition uses evaluation mixtures synthesized with interference sound samples included in the training data.
| Submission Information | Evaluation Set | Known IR Condition |
Known Target Condition |
Known Noise Condition |
Known Interference Condition |
|||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Submission Code | Technical Report |
CAPI-SDRi | Accuracy (mix) |
Known IR CAPI-SDRi |
Known IR Accuracy (mix) |
Known Target CAPI-SDRi |
Known Target Accuracy (mix) |
Known Noise CAPI-SDRi |
Known Noise Accuracy (mix) |
Known Interference CAPI-SDRi |
Known Interference Accuracy (mix) |
|
| Bando_AIST_task4_3 | Bando_AIST2026 | 14.93 | 65.54 | 16.21 | 69.44 | 15.73 | 70.37 | 15.39 | 67.59 | 15.62 | 70.37 | |
| Bando_AIST_task4_4 | Bando_AIST2026 | 14.61 | 59.52 | 15.62 | 60.32 | 15.13 | 62.04 | 15.08 | 61.11 | 16.14 | 68.98 | |
| Bando_AIST_task4_2 | Bando_AIST2026 | 14.34 | 58.40 | 15.62 | 61.90 | 15.70 | 65.74 | 15.55 | 65.28 | 15.91 | 68.52 | |
| Bando_AIST_task4_1 | Bando_AIST2026 | 14.23 | 58.33 | 16.12 | 64.68 | 15.30 | 62.50 | 15.27 | 61.11 | 15.45 | 65.28 | |
| Choi_KAIST_task4_4 | Choi_KAIST2026 | 12.98 | 64.88 | 13.97 | 63.89 | 13.82 | 68.98 | 13.63 | 68.06 | 13.71 | 70.83 | |
| Saijo_Mitsubishi_task4_3 | Saijo_Mitsubishi2026 | 12.94 | 76.92 | 13.69 | 77.38 | 14.52 | 86.11 | 13.70 | 80.09 | 14.10 | 85.65 | |
| Choi_KAIST_task4_1 | Choi_KAIST2026 | 12.88 | 64.15 | 13.75 | 62.30 | 13.53 | 67.59 | 13.80 | 68.98 | 14.13 | 72.69 | |
| Saijo_Mitsubishi_task4_2 | Saijo_Mitsubishi2026 | 12.77 | 76.92 | 13.55 | 77.38 | 14.28 | 86.11 | 13.56 | 80.09 | 13.90 | 85.65 | |
| Saijo_Mitsubishi_task4_1 | Saijo_Mitsubishi2026 | 12.77 | 75.00 | 13.51 | 76.19 | 14.29 | 83.80 | 13.50 | 77.78 | 13.90 | 85.19 | |
| Saijo_Mitsubishi_task4_4 | Saijo_Mitsubishi2026 | 12.54 | 73.28 | 13.19 | 73.41 | 13.67 | 79.17 | 13.06 | 73.61 | 13.62 | 81.48 | |
| Choi_KAIST_task4_2 | Choi_KAIST2026 | 12.32 | 59.59 | 13.51 | 60.71 | 12.77 | 62.96 | 12.96 | 62.04 | 13.75 | 71.76 | |
| Choi_KAIST_task4_3 | Choi_KAIST2026 | 12.25 | 59.13 | 13.68 | 61.11 | 13.02 | 64.81 | 12.84 | 61.57 | 13.73 | 72.22 | |
| Wang_SRCN_task4_2 | Wang_SRCN2026 | 10.13 | 57.80 | 10.63 | 56.35 | 10.91 | 65.28 | 10.48 | 61.11 | 11.03 | 65.74 | |
| Wang_SRCN_task4_1 | Wang_SRCN2026 | 10.12 | 57.94 | 10.51 | 55.95 | 10.93 | 65.74 | 10.40 | 60.19 | 10.93 | 64.81 | |
| Park_SGU_task4_3 | Park_SGU2026 | 10.10 | 53.17 | 10.86 | 53.97 | 11.00 | 68.06 | 10.59 | 54.17 | 11.27 | 62.04 | |
| Wang_SRCN_task4_3 | Wang_SRCN2026 | 10.10 | 57.94 | 10.49 | 55.95 | 10.99 | 65.74 | 10.33 | 60.19 | 11.18 | 64.81 | |
| Park_SGU_task4_4 | Park_SGU2026 | 10.08 | 53.64 | 4.10 | 29.37 | 11.22 | 66.20 | 10.45 | 53.24 | 10.92 | 60.19 | |
| Park_SGU_task4_2 | Park_SGU2026 | 9.42 | 53.17 | 9.94 | 53.97 | 10.26 | 68.06 | 9.97 | 54.17 | 10.48 | 62.04 | |
| Wang_SRCN_task4_4 | Wang_SRCN2026 | 9.20 | 45.63 | 9.29 | 43.25 | 10.79 | 52.78 | 10.06 | 49.54 | 9.79 | 47.69 | |
| You_PKU_task4_4 | You_PKU2026 | 8.24 | 52.58 | 8.78 | 51.59 | 8.97 | 50.93 | 8.39 | 53.24 | 8.70 | 57.41 | |
| You_PKU_task4_1 | You_PKU2026 | 8.24 | 52.58 | 8.91 | 51.59 | 8.96 | 50.93 | 8.35 | 53.24 | 8.65 | 57.41 | |
| You_PKU_task4_2 | You_PKU2026 | 8.20 | 52.58 | 8.94 | 51.59 | 8.82 | 50.93 | 8.38 | 53.24 | 8.60 | 57.41 | |
| You_PKU_task4_3 | You_PKU2026 | 8.20 | 52.58 | 8.94 | 51.59 | 8.81 | 50.93 | 8.38 | 53.24 | 8.60 | 57.41 | |
| Park_SGU_task4_1 | Park_SGU2026 | 8.12 | 53.70 | 8.76 | 54.76 | 8.90 | 68.52 | 8.28 | 54.17 | 9.03 | 65.74 | |
| Jeong_Medisensing_task4_1 | Jeong_Medisensing2026 | 6.97 | 58.53 | 6.94 | 56.75 | 9.23 | 72.22 | 7.31 | 57.87 | 7.79 | 64.35 | |
| Jeong_Medisensing_task4_2 | Jeong_Medisensing2026 | 6.96 | 58.20 | 6.93 | 57.54 | 9.24 | 72.22 | 7.32 | 59.26 | 7.81 | 63.89 | |
| Jeong_Medisensing_task4_3 | Jeong_Medisensing2026 | 6.94 | 58.33 | 7.00 | 57.14 | 9.20 | 72.22 | 7.35 | 59.26 | 7.79 | 63.89 | |
| Wang_BUPT_task4_2 | Wang_BUPT2026 | 6.90 | 56.61 | 6.94 | 53.17 | 9.26 | 71.30 | 7.37 | 59.72 | 7.51 | 63.89 | |
| Deng_WHU_task4_1 | Deng_WHU2026 | 6.84 | 58.13 | 6.68 | 58.33 | 8.95 | 70.83 | 7.33 | 59.26 | 7.34 | 63.89 | |
| Nguyen_NTT_task4_1 | 6.77 | 56.55 | 6.76 | 52.38 | 8.98 | 70.37 | 7.10 | 56.02 | 7.43 | 62.04 | ||
| Wang_BUPT_task4_1 | Wang_BUPT2026 | 6.76 | 55.56 | 6.63 | 52.78 | 9.22 | 70.83 | 7.40 | 59.72 | 7.84 | 65.74 | |
| Nguyen_NTT_task4_2 | 6.76 | 55.16 | 6.65 | 55.16 | 9.01 | 70.37 | 6.79 | 52.78 | 7.81 | 65.74 | ||
| Park_KUBIG_task4_1 | Park_KUBIG2026 | 1.18 | 39.81 | 0.67 | 39.29 | 1.08 | 37.50 | 1.37 | 45.83 | 1.23 | 42.59 | |
System performance by target-source overlap condition
This table shows performance by target-source overlap condition. Each condition is denoted as (N, M), where N is the number of active target sound sources in the mixture and M is the number of active target sound sources involved in same-class overlap. M=0 means that no same-class target overlap is present. (0,0) denotes mixtures with no target sound source, for which CAPI-SDRi is not defined; the (0,0) column therefore reports the number of false positives (FP) in zero-target mixtures.
| Submission Information | Evaluation Set | Target-Source Overlap Conditions | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Submission Code | Technical Report |
CAPI-SDRi | Accuracy (mix) |
FP (0,0) |
CAPI-SDRi (1,0) |
CAPI-SDRi (2,0) |
CAPI-SDRi (2,2) |
CAPI-SDRi (3,0) |
CAPI-SDRi (3,2) |
CAPI-SDRi (3,3) |
|
| Bando_AIST_task4_3 | Bando_AIST2026 | 14.93 | 65.54 | 33 | 12.26 | 15.23 | 16.86 | 15.51 | 16.37 | 17.08 | |
| Bando_AIST_task4_4 | Bando_AIST2026 | 14.61 | 59.52 | 32 | 12.01 | 14.37 | 16.77 | 15.02 | 15.88 | 17.25 | |
| Bando_AIST_task4_2 | Bando_AIST2026 | 14.34 | 58.40 | 43 | 11.73 | 14.64 | 16.52 | 14.87 | 16.01 | 16.58 | |
| Bando_AIST_task4_1 | Bando_AIST2026 | 14.23 | 58.33 | 26 | 11.96 | 13.84 | 16.80 | 13.81 | 15.56 | 16.61 | |
| Choi_KAIST_task4_4 | Choi_KAIST2026 | 12.98 | 64.88 | 51 | 10.66 | 13.02 | 14.00 | 15.34 | 14.89 | 14.09 | |
| Saijo_Mitsubishi_task4_3 | Saijo_Mitsubishi2026 | 12.94 | 76.92 | 30 | 9.71 | 13.27 | 13.51 | 15.21 | 14.87 | 14.14 | |
| Choi_KAIST_task4_1 | Choi_KAIST2026 | 12.88 | 64.15 | 57 | 10.81 | 12.91 | 13.72 | 15.39 | 14.89 | 14.02 | |
| Saijo_Mitsubishi_task4_2 | Saijo_Mitsubishi2026 | 12.77 | 76.92 | 30 | 9.67 | 13.13 | 13.36 | 14.99 | 14.55 | 13.82 | |
| Saijo_Mitsubishi_task4_1 | Saijo_Mitsubishi2026 | 12.77 | 75.00 | 30 | 9.65 | 13.15 | 13.50 | 14.91 | 14.66 | 13.53 | |
| Saijo_Mitsubishi_task4_4 | Saijo_Mitsubishi2026 | 12.54 | 73.28 | 47 | 9.36 | 12.99 | 13.17 | 15.13 | 14.81 | 13.84 | |
| Choi_KAIST_task4_2 | Choi_KAIST2026 | 12.32 | 59.59 | 115 | 10.34 | 12.56 | 13.77 | 15.53 | 15.07 | 14.20 | |
| Choi_KAIST_task4_3 | Choi_KAIST2026 | 12.25 | 59.13 | 119 | 10.36 | 12.69 | 13.67 | 15.41 | 15.08 | 14.13 | |
| Wang_SRCN_task4_2 | Wang_SRCN2026 | 10.13 | 57.80 | 55 | 8.28 | 10.80 | 11.10 | 11.29 | 11.08 | 11.21 | |
| Wang_SRCN_task4_1 | Wang_SRCN2026 | 10.12 | 57.94 | 58 | 8.19 | 10.90 | 11.06 | 11.29 | 11.11 | 11.34 | |
| Park_SGU_task4_3 | Park_SGU2026 | 10.10 | 53.17 | 24 | 9.33 | 11.72 | 8.46 | 13.20 | 10.00 | 7.57 | |
| Wang_SRCN_task4_3 | Wang_SRCN2026 | 10.10 | 57.94 | 58 | 8.31 | 10.87 | 11.27 | 10.94 | 10.94 | 11.42 | |
| Park_SGU_task4_4 | Park_SGU2026 | 10.08 | 53.64 | 25 | 9.05 | 12.15 | 8.50 | 12.99 | 9.64 | 7.79 | |
| Park_SGU_task4_2 | Park_SGU2026 | 9.42 | 53.17 | 24 | 8.98 | 11.44 | 7.07 | 12.81 | 9.22 | 6.23 | |
| Wang_SRCN_task4_4 | Wang_SRCN2026 | 9.20 | 45.63 | 131 | 6.71 | 9.36 | 10.29 | 11.04 | 11.38 | 11.35 | |
| You_PKU_task4_4 | You_PKU2026 | 8.24 | 52.58 | 61 | 6.77 | 10.35 | 6.29 | 12.38 | 8.78 | 5.71 | |
| You_PKU_task4_1 | You_PKU2026 | 8.24 | 52.58 | 61 | 7.00 | 10.40 | 6.12 | 12.35 | 8.62 | 5.67 | |
| You_PKU_task4_2 | You_PKU2026 | 8.20 | 52.58 | 61 | 6.91 | 10.33 | 6.09 | 12.33 | 8.64 | 5.71 | |
| You_PKU_task4_3 | You_PKU2026 | 8.20 | 52.58 | 61 | 6.91 | 10.33 | 6.09 | 12.32 | 8.65 | 5.70 | |
| Park_SGU_task4_1 | Park_SGU2026 | 8.12 | 53.70 | 27 | 7.97 | 10.71 | 4.73 | 11.54 | 8.03 | 5.05 | |
| Jeong_Medisensing_task4_1 | Jeong_Medisensing2026 | 6.97 | 58.53 | 15 | 6.37 | 8.70 | 4.19 | 10.11 | 7.09 | 4.64 | |
| Jeong_Medisensing_task4_2 | Jeong_Medisensing2026 | 6.96 | 58.20 | 18 | 6.44 | 8.71 | 4.21 | 10.15 | 7.08 | 4.51 | |
| Jeong_Medisensing_task4_3 | Jeong_Medisensing2026 | 6.94 | 58.33 | 18 | 6.32 | 8.75 | 4.19 | 10.20 | 7.04 | 4.41 | |
| Wang_BUPT_task4_2 | Wang_BUPT2026 | 6.90 | 56.61 | 48 | 6.30 | 8.90 | 4.46 | 10.10 | 7.10 | 5.02 | |
| Deng_WHU_task4_1 | Deng_WHU2026 | 6.84 | 58.13 | 17 | 5.96 | 8.66 | 3.91 | 10.40 | 6.85 | 4.66 | |
| Nguyen_NTT_task4_1 | 6.77 | 56.55 | 21 | 6.13 | 8.28 | 4.21 | 9.94 | 7.10 | 4.54 | ||
| Wang_BUPT_task4_1 | Wang_BUPT2026 | 6.76 | 55.56 | 34 | 6.23 | 8.66 | 4.07 | 10.12 | 6.80 | 4.40 | |
| Nguyen_NTT_task4_2 | 6.76 | 55.16 | 20 | 6.27 | 8.59 | 3.92 | 9.96 | 6.91 | 4.26 | ||
| Park_KUBIG_task4_1 | Park_KUBIG2026 | 1.18 | 39.81 | 68 | -2.81 | 0.90 | 1.77 | 2.59 | 3.04 | 4.38 | |
System characteristics
General characteristics
| Submission Code |
Technical Report |
CAPI-SDRi | Label Prediction Accuracy (mix) |
Input Sampling Rate |
Input Acoustic Features |
|---|---|---|---|---|---|
| Bando_AIST_task4_3 | Bando_AIST2026 | 14.93 | 65.54 | 32kHz | spectrogram |
| Bando_AIST_task4_4 | Bando_AIST2026 | 14.61 | 59.52 | 32kHz | spectrogram |
| Bando_AIST_task4_2 | Bando_AIST2026 | 14.34 | 58.40 | 32kHz | spectrogram |
| Bando_AIST_task4_1 | Bando_AIST2026 | 14.23 | 58.33 | 32kHz | spectrogram |
| Choi_KAIST_task4_4 | Choi_KAIST2026 | 12.98 | 64.88 | 32kHz | waveform, spectrogram |
| Saijo_Mitsubishi_task4_3 | Saijo_Mitsubishi2026 | 12.94 | 76.92 | 32kHz for separation, 16kHz for three tagging models, 48 kHz for two tagging models | spectrogram for separation, Kaldi fbank features for two tagging models, log-Mel spectrogram for one tagging model, and DAC-VAE features for two tagging models |
| Choi_KAIST_task4_1 | Choi_KAIST2026 | 12.88 | 64.15 | 32kHz | waveform, spectrogram |
| Saijo_Mitsubishi_task4_2 | Saijo_Mitsubishi2026 | 12.77 | 76.92 | 32kHz for separation, 16kHz for three tagging models, 48 kHz for two tagging models | spectrogram for separation, Kaldi fbank features for two tagging models, log-Mel spectrogram for one tagging model, and DAC-VAE features for two tagging models |
| Saijo_Mitsubishi_task4_1 | Saijo_Mitsubishi2026 | 12.77 | 75.00 | 32kHz for separation, 16kHz for three tagging models, 48 kHz for two tagging models | spectrogram for separation, Kaldi fbank features for two tagging models, log-Mel spectrogram for one tagging model, and DAC-VAE features for two tagging models |
| Saijo_Mitsubishi_task4_4 | Saijo_Mitsubishi2026 | 12.54 | 73.28 | 32kHz for separation, 16kHz for three tagging models, 48 kHz for two tagging models | spectrogram for separation, Kaldi fbank features for two tagging models |
| Choi_KAIST_task4_2 | Choi_KAIST2026 | 12.32 | 59.59 | 32kHz | waveform, spectrogram |
| Choi_KAIST_task4_3 | Choi_KAIST2026 | 12.25 | 59.13 | 32kHz | waveform, spectrogram |
| Wang_SRCN_task4_2 | Wang_SRCN2026 | 10.13 | 57.80 | 32kHz | waveform, spectrogram |
| Wang_SRCN_task4_1 | Wang_SRCN2026 | 10.12 | 57.94 | 32kHz | waveform, spectrogram |
| Park_SGU_task4_3 | Park_SGU2026 | 10.10 | 53.17 | 32kHz | spectrogram |
| Wang_SRCN_task4_3 | Wang_SRCN2026 | 10.10 | 57.94 | 32kHz | waveform, spectrogram |
| Park_SGU_task4_4 | Park_SGU2026 | 10.08 | 53.64 | 32kHz | spectrogram |
| Park_SGU_task4_2 | Park_SGU2026 | 9.42 | 53.17 | 32kHz | spectrogram |
| Wang_SRCN_task4_4 | Wang_SRCN2026 | 9.20 | 45.63 | 32kHz | waveform, spectrogram |
| You_PKU_task4_4 | You_PKU2026 | 8.24 | 52.58 | 32kHz | FOA waveform, log-mel spectrogram, M2D audio-tagging embeddings, TUSS all-label candidate sources |
| You_PKU_task4_1 | You_PKU2026 | 8.24 | 52.58 | 32kHz | FOA waveform, log-mel spectrogram, M2D audio-tagging embeddings |
| You_PKU_task4_2 | You_PKU2026 | 8.20 | 52.58 | 32kHz | FOA waveform, log-mel spectrogram, M2D audio-tagging embeddings, TUSS query-conditioned separated sources |
| You_PKU_task4_3 | You_PKU2026 | 8.20 | 52.58 | 32kHz | FOA waveform, log-mel spectrogram, M2D audio-tagging embeddings, TUSS query-conditioned separated sources |
| Park_SGU_task4_1 | Park_SGU2026 | 8.12 | 53.70 | 32kHz | spectrogram |
| Jeong_Medisensing_task4_1 | Jeong_Medisensing2026 | 6.97 | 58.53 | 32kHz | waveform, log-mel spectrogram, STFT spectrogram |
| Jeong_Medisensing_task4_2 | Jeong_Medisensing2026 | 6.96 | 58.20 | 32kHz | waveform, log-mel spectrogram, STFT spectrogram |
| Jeong_Medisensing_task4_3 | Jeong_Medisensing2026 | 6.94 | 58.33 | 32kHz | waveform, log-mel spectrogram, STFT spectrogram |
| Wang_BUPT_task4_2 | Wang_BUPT2026 | 6.90 | 56.61 | 32kHz | four-channel FOA waveform, channel-wise log-mel spectrogram |
| Deng_WHU_task4_1 | Deng_WHU2026 | 6.84 | 58.13 | 32kHz | waveform, spectrogram |
| Nguyen_NTT_task4_1 | 6.77 | 56.55 | 32kHz | waveform, spectrogram | |
| Wang_BUPT_task4_1 | Wang_BUPT2026 | 6.76 | 55.56 | 32kHz | waveform, log-mel spectrogram |
| Nguyen_NTT_task4_2 | 6.76 | 55.16 | 32kHz | waveform, spectrogram | |
| Park_KUBIG_task4_1 | Park_KUBIG2026 | 1.18 | 39.81 | 32kHz | waveform |
Machine learning characteristics
| Submission Code |
Technical Report |
CAPI-SDRi | Label Prediction Accuracy (mix) |
Machine Learning Method |
Loss Function |
Training Dataset |
Data Augmentation |
Pretrained Models |
|---|---|---|---|---|---|---|---|---|
| Bando_AIST_task4_3 | Bando_AIST2026 | 14.93 | 65.54 | TF-Locoformer-based separation model | BCE, SNR | DCASE2026Task4Dataset | ATST-Frame | |
| Bando_AIST_task4_4 | Bando_AIST2026 | 14.61 | 59.52 | TF-Locoformer-based separation model | BCE, SNR | DCASE2026Task4Dataset | ATST-Frame | |
| Bando_AIST_task4_2 | Bando_AIST2026 | 14.34 | 58.40 | TF-Locoformer-based separation model | BCE, SNR | DCASE2026Task4Dataset | ||
| Bando_AIST_task4_1 | Bando_AIST2026 | 14.23 | 58.33 | TF-Locoformer-based separation model | BCE, SNR | DCASE2026Task4Dataset | ||
| Choi_KAIST_task4_4 | Choi_KAIST2026 | 12.98 | 64.88 | TTransformer-Mamba-based separation/extraction model, CRNN-based audio classification model | SA-SDR loss, ArcFace loss, KL-divergence loss, BCE loss | DCASE2026Task4Dataset; AudioSet-2M-VacuumCleaner | difficulty-based mixup | M2D; Audio-Flamingo 3 |
| Saijo_Mitsubishi_task4_3 | Saijo_Mitsubishi2026 | 12.94 | 76.92 | BEATs-based, M2D-based, AIST-based, PE-A-Frame-small-based, and PE-A-Frame-base-based tagging models, TF-Locoformer-based separation model | SNR (separation), CE (counting and classification) | DCASE2026Task4Dataset; AudioSet; SINS database; NIGENS; STARSS23 | MixUp, frequency warping, and filter augmentation for three 16-kHz tagging models | BEATs (BEATs_strong_1.pt); M2D (M2D_strong_1.pt); AIST-Frame (ATST-F_strong_1.pt); PE-A-Frame-small; PE-A-Frame-base |
| Choi_KAIST_task4_1 | Choi_KAIST2026 | 12.88 | 64.15 | Transformer-Mamba-based separation/extraction model, CRNN-based audio classification model | SA-SDR loss, ArcFace loss, KL-divergence loss, BCE loss | DCASE2026Task4Dataset; AudioSet-2M-VacuumCleaner | difficulty-based mixup | M2D; Audio-Flamingo 3 |
| Saijo_Mitsubishi_task4_2 | Saijo_Mitsubishi2026 | 12.77 | 76.92 | BEATs-based, M2D-based, AIST-based, PE-A-Frame-small-based, and PE-A-Frame-base-based tagging models, TF-Locoformer-based separation model | SNR (separation), CE (counting and classification) | DCASE2026Task4Dataset; AudioSet; SINS database; NIGENS; STARSS23 | MixUp, frequency warping, and filter augmentation for three 16-kHz tagging models | BEATs (BEATs_strong_1.pt); M2D (M2D_strong_1.pt); AIST-Frame (ATST-F_strong_1.pt); PE-A-Frame-small; PE-A-Frame-base |
| Saijo_Mitsubishi_task4_1 | Saijo_Mitsubishi2026 | 12.77 | 75.00 | BEATs-based, M2D-based, AIST-based, PE-A-Frame-small-based, and PE-A-Frame-base-based tagging models, TF-Locoformer-based separation model | SNR (separation), CE (counting and classification) | DCASE2026Task4Dataset; AudioSet; SINS database; NIGENS; STARSS23 | MixUp, frequency warping, and filter augmentation for three 16-kHz tagging models | BEATs (BEATs_strong_1.pt); M2D (M2D_strong_1.pt); AIST-Frame (ATST-F_strong_1.pt); PE-A-Frame-small; PE-A-Frame-base |
| Saijo_Mitsubishi_task4_4 | Saijo_Mitsubishi2026 | 12.54 | 73.28 | BEATs-based tagging model, TF-Locoformer-based separation model | SNR (separation), CE (counting and classification) | DCASE2026Task4Dataset; AudioSet; SINS database; NIGENS; STARSS23 | MixUp, frequency warping, and filter augmentation for three 16-kHz tagging models | BEATs (BEATs_strong_1.pt) |
| Choi_KAIST_task4_2 | Choi_KAIST2026 | 12.32 | 59.59 | Transformer-Mamba-based separation/extraction model, CRNN-based audio classification model | SA-SDR loss, ArcFace loss, KL-divergence loss, BCE loss | DCASE2026Task4Dataset; AudioSet-2M-VacuumCleaner | difficulty-based mixup | M2D; Audio-Flamingo 3 |
| Choi_KAIST_task4_3 | Choi_KAIST2026 | 12.25 | 59.13 | Transformer-Mamba-based separation/extraction model, CRNN-based audio classification model | SA-SDR loss, ArcFace loss, KL-divergence loss, BCE loss | DCASE2026Task4Dataset; AudioSet-2M-VacuumCleaner | difficulty-based mixup | M2D; Audio-Flamingo 3 |
| Wang_SRCN_task4_2 | Wang_SRCN2026 | 10.13 | 57.80 | Transformer-based separation model, PretrainedSED-based audio tagging model | BCE, SA-SDR loss, KL-divergence | DCASE2026Task4Dataset; AudioSet | Frame Shift, SpecAugmentation | PretrainedSED |
| Wang_SRCN_task4_1 | Wang_SRCN2026 | 10.12 | 57.94 | Transformer-based separation model, PretrainedSED-based audio tagging model | BCE, SA-SDR loss, KL-divergence | DCASE2026Task4Dataset; AudioSet | Frame Shift, SpecAugmentation | PretrainedSED |
| Park_SGU_task4_3 | Park_SGU2026 | 10.10 | 53.17 | SRCorrNet-based separation model, M2D-fPaSST based audio tagging model | cross entropy, PIT-SNR | DCASE2026Task4Dataset | spec augmentation, angle rotation, random gain filter | M2D; fPaSST |
| Wang_SRCN_task4_3 | Wang_SRCN2026 | 10.10 | 57.94 | Transformer-based separation model, PretrainedSED-based audio tagging model | BCE, SA-SDR loss, KL-divergence | DCASE2026Task4Dataset; AudioSet | Frame Shift, SpecAugmentation | PretrainedSED |
| Park_SGU_task4_4 | Park_SGU2026 | 10.08 | 53.64 | SRCorrNet-based separation model, M2D-fPaSST based audio tagging model | cross entropy, PIT-SNR | DCASE2026Task4Dataset | spec augmentation, angle rotation, random gain filter | M2D; fPaSST |
| Park_SGU_task4_2 | Park_SGU2026 | 9.42 | 53.17 | SRCorrNet-based separation model, M2D-fPaSST based audio tagging model | cross entropy, PIT-SNR | DCASE2026Task4Dataset | spec augmentation, angle rotation, random gain filter | M2D; fPaSST |
| Wang_SRCN_task4_4 | Wang_SRCN2026 | 9.20 | 45.63 | Transformer-based separation model, PretrainedSED-based audio tagging model | BCE, SA-SDR loss, KL-divergence | DCASE2026Task4Dataset; AudioSet | Frame Shift, SpecAugmentation | PretrainedSED |
| You_PKU_task4_4 | You_PKU2026 | 8.24 | 52.58 | global candidate selector over raw TUSS source hypotheses and baseline separation outputs | binary cross entropy for audio tagging, SDR-style separation objectives for source separation | DCASE2026 Task 4 development dataset | all-label TUSS candidate pool with ExtraTrees threshold selection | |
| You_PKU_task4_1 | You_PKU2026 | 8.24 | 52.58 | ResUNet/ResUNetK source separation with M2D audio tagging and duplicate-label-aware meta-selection | binary cross entropy for audio tagging, SDR-style separation objectives for source separation | DCASE2026 Task 4 development dataset | class-balanced candidate generation, source-level overlay selection, duplicate-label-aware filtering | |
| You_PKU_task4_2 | You_PKU2026 | 8.20 | 52.58 | duplicate-label weighted selector with label-specific TUSS overlay grafting | binary cross entropy for audio tagging, SDR-style separation objectives for source separation | DCASE2026 Task 4 development dataset | source-level replacement using labels with positive development-set overlay evidence | |
| You_PKU_task4_3 | You_PKU2026 | 8.20 | 52.58 | duplicate-label weighted selector with label-specific TUSS overlay grafting and small all-label Blender-line refinement | binary cross entropy for audio tagging, SDR-style separation objectives for source separation | DCASE2026 Task 4 development dataset | source-level replacement using labels with positive development-set overlay evidence, plus a small Blender-focused overlay | |
| Park_SGU_task4_1 | Park_SGU2026 | 8.12 | 53.70 | SRCorrNet-based separation model, M2D-fPaSST based audio tagging model | cross entropy, PIT-SNR | DCASE2026Task4Dataset | spec augmentation, angle rotation, random gain filter | M2D; fPaSST |
| Jeong_Medisensing_task4_1 | Jeong_Medisensing2026 | 6.97 | 58.53 | ResUNet-based label-queried separation; dual single/4-channel M2D audio tagging with weighted-confidence fusion; verify-and-refine stem re-tagging; energy/probability silence gating | BCE (audio tagging), CAPI-SDR (separation) | DCASE2026Task4Dataset; FSD50K; EARS; Semantic Hearing (BinauralCuratedDataset) | spatial soundscape synthesis, zero-target oversampling (40%), duplicate-source-event (DUPSE) curriculum, interference mixing | M2D; ResUNetK (DCASE2026 Task 4 baseline) |
| Jeong_Medisensing_task4_2 | Jeong_Medisensing2026 | 6.96 | 58.20 | As submission 1 plus per-class verification/drop/probability-gate thresholds tuned offline against exact CAPI-SDRi semantics | BCE (audio tagging), CAPI-SDR (separation) | DCASE2026Task4Dataset; FSD50K; EARS; Semantic Hearing (BinauralCuratedDataset) | spatial soundscape synthesis, zero-target oversampling (40%), duplicate-source-event (DUPSE) curriculum, interference mixing | M2D; ResUNetK (DCASE2026 Task 4 baseline) |
| Jeong_Medisensing_task4_3 | Jeong_Medisensing2026 | 6.94 | 58.33 | As submission 2 plus dual-separator stem selection: per stem keep whichever of two ResUNet separators its single-channel re-tagging verifies more strongly | BCE (audio tagging), CAPI-SDR (separation) | DCASE2026Task4Dataset; FSD50K; EARS; Semantic Hearing (BinauralCuratedDataset) | spatial soundscape synthesis, zero-target oversampling (40%), duplicate-source-event (DUPSE) curriculum, interference mixing | M2D; ResUNetK (DCASE2026 Task 4 baseline) |
| Wang_BUPT_task4_2 | Wang_BUPT2026 | 6.90 | 56.61 | four-channel M2D audio tagging with Qwen2-Audio semantic distillation, followed by baseline ResUNetK label-queried separation | permutation-invariant classification loss, linear CKA loss, cosine loss, CAPI-SDR loss | DCASE2026Task4Dataset | on-the-fly spatial sound-scene synthesis | M2D; Qwen2-Audio-7B-Instruct; DCASE2026 baseline ResUNetK |
| Deng_WHU_task4_1 | Deng_WHU2026 | 6.84 | 58.13 | Label-guided ensemble with multi-model audio tagging fusion and fine-tuned ResUNetK or TF-GridNet separation result selection or fusion | BCE, CAPI-SDR | DCASE2026Task4Dataset | Random mixture synthesis with same-class duplication, random SNR/angle sampling, background/interference mixing, and label shuffling | M2D; CLAP |
| Nguyen_NTT_task4_1 | 6.77 | 56.55 | ResUNet-based separation model, M2D-based audio tagging model | BCE, CAPI-SDR | DCASE2026Task4Dataset | M2D | ||
| Wang_BUPT_task4_1 | Wang_BUPT2026 | 6.76 | 55.56 | single-channel M2D audio tagging with Qwen2-Audio semantic distillation, followed by baseline ResUNetK label-queried separation | permutation-invariant classification loss, linear CKA loss, cosine loss, CAPI-SDR loss | DCASE2026Task4Dataset | on-the-fly spatial sound-scene synthesis | M2D; Qwen2-Audio-7B-Instruct; DCASE2026 baseline ResUNetK |
| Nguyen_NTT_task4_2 | 6.76 | 55.16 | ResUNet-based separation model, M2D-based audio tagging model | BCE, CAPI-SDR | DCASE2026Task4Dataset | M2D | ||
| Park_KUBIG_task4_1 | Park_KUBIG2026 | 1.18 | 39.81 | Joint separation-classification-DoA model (SpatialSeparatorModel). BEATs encoder (frozen) for feature extraction, followed by mask-based waveform separation and class prediction heads. Trained end-to-end with joint PIT loss (SI-SDR + CE + DoA). | SI-SDR, Cross-Entropy, DoA regression (joint PIT) | DCASE2026Task4Dataset; FSD50K; EARS | on-the-fly spatial audio synthesis using SpAudSyn | BEATs |
Complexity
| Submission Code |
Technical Report |
CAPI-SDRi | Label Prediction Accuracy (mix) |
Ensemble subsystems |
Number of Parameters |
|---|---|---|---|---|---|
| Bando_AIST_task4_3 | Bando_AIST2026 | 14.93 | 65.54 | 1 | 100917678 |
| Bando_AIST_task4_4 | Bando_AIST2026 | 14.61 | 59.52 | 1 | 108401934 |
| Bando_AIST_task4_2 | Bando_AIST2026 | 14.34 | 58.40 | 1 | 22569614 |
| Bando_AIST_task4_1 | Bando_AIST2026 | 14.23 | 58.33 | 1 | 15085358 |
| Choi_KAIST_task4_4 | Choi_KAIST2026 | 12.98 | 64.88 | 1 | 745744243 |
| Saijo_Mitsubishi_task4_3 | Saijo_Mitsubishi2026 | 12.94 | 76.92 | 5 for tagging, 10 for source counting | 687.3M |
| Choi_KAIST_task4_1 | Choi_KAIST2026 | 12.88 | 64.15 | 2 | 760292329 |
| Saijo_Mitsubishi_task4_2 | Saijo_Mitsubishi2026 | 12.77 | 76.92 | 5 for tagging, 10 for source counting | 687.3M |
| Saijo_Mitsubishi_task4_1 | Saijo_Mitsubishi2026 | 12.77 | 75.00 | 5 for tagging, 10 for source counting | 681.3M |
| Saijo_Mitsubishi_task4_4 | Saijo_Mitsubishi2026 | 12.54 | 73.28 | 1 | 104.84M |
| Choi_KAIST_task4_2 | Choi_KAIST2026 | 12.32 | 59.59 | 2 | 760292329 |
| Choi_KAIST_task4_3 | Choi_KAIST2026 | 12.25 | 59.13 | 1 | 745744243 |
| Wang_SRCN_task4_2 | Wang_SRCN2026 | 10.13 | 57.80 | 1 | 375724043 |
| Wang_SRCN_task4_1 | Wang_SRCN2026 | 10.12 | 57.94 | 1 | 375724043 |
| Park_SGU_task4_3 | Park_SGU2026 | 10.10 | 53.17 | 1 | 197210110 |
| Wang_SRCN_task4_3 | Wang_SRCN2026 | 10.10 | 57.94 | 1 | 375724043 |
| Park_SGU_task4_4 | Park_SGU2026 | 10.08 | 53.64 | 1 | 197210110 |
| Park_SGU_task4_2 | Park_SGU2026 | 9.42 | 53.17 | 1 | 197210110 |
| Wang_SRCN_task4_4 | Wang_SRCN2026 | 9.20 | 45.63 | 1 | 375524106 |
| You_PKU_task4_4 | You_PKU2026 | 8.24 | 52.58 | 5 | approximately 115.4M per baseline separation/tagging pass, excluding external overlay candidate generation and lightweight selector models |
| You_PKU_task4_1 | You_PKU2026 | 8.24 | 52.58 | 4 | approximately 115.4M per baseline separation/tagging pass, excluding lightweight selector models |
| You_PKU_task4_2 | You_PKU2026 | 8.20 | 52.58 | 5 | approximately 115.4M per baseline separation/tagging pass, excluding external overlay candidate generation and lightweight selector models |
| You_PKU_task4_3 | You_PKU2026 | 8.20 | 52.58 | 5 | approximately 115.4M per baseline separation/tagging pass, excluding external overlay candidate generation and lightweight selector models |
| Park_SGU_task4_1 | Park_SGU2026 | 8.12 | 53.70 | 1 | 197210110 |
| Jeong_Medisensing_task4_1 | Jeong_Medisensing2026 | 6.97 | 58.53 | 1 | 215.90M |
| Jeong_Medisensing_task4_2 | Jeong_Medisensing2026 | 6.96 | 58.20 | 1 | 215.90M |
| Jeong_Medisensing_task4_3 | Jeong_Medisensing2026 | 6.94 | 58.33 | 2 | 245.79M |
| Wang_BUPT_task4_2 | Wang_BUPT2026 | 6.90 | 56.61 | 1 | 126434854 |
| Deng_WHU_task4_1 | Deng_WHU2026 | 6.84 | 58.13 | 3 | 29900000 |
| Nguyen_NTT_task4_1 | 6.77 | 56.55 | 1 | 119356966 | |
| Wang_BUPT_task4_1 | Wang_BUPT2026 | 6.76 | 55.56 | 1 | 119356966 |
| Nguyen_NTT_task4_2 | 6.76 | 55.16 | 1 | 126434854 | |
| Park_KUBIG_task4_1 | Park_KUBIG2026 | 1.18 | 39.81 | 1 | 91815245 |
Representative example of separated audio samples
Evaluation set
The following table shows separated sound samples from the evaluation set. Representative outputs from teams ranked 1 to 3 and the baseline are selected. The mixture column uses pseudo-stereo mixture files, and each score row reports clip-level CAPI-SDRi for the corresponding system.
| Condition (evaluation set) |
Mixture* |
Oracle (azimuth, elevation) |
Bando_AIST_task4_3 Rank 1 |
Choi_KAIST_task4_4 Rank 2 |
Saijo_Mitsubishi_task4_3 Rank 3 |
Nguyen_NTT_task4_1 Baseline |
|---|---|---|---|---|---|---|
| Success case (3 overlapping target events, including 2 same-class events) |
FILLER
FILLER
FILLER
|
Speech (160°, -20°)
BicycleBell (-20°, 0°)
BicycleBell (80°, -20°)
|
Speech
BicycleBell
BicycleBell
CAPI-SDRi (this sample)=25.89 dB
|
Speech
BicycleBell
BicycleBell
CAPI-SDRi (this sample)=23.78 dB
|
Speech
BicycleBell
BicycleBell
CAPI-SDRi (this sample)=21.06 dB
|
Speech
--
BicycleBell
CAPI-SDRi (this sample)=7.56 dB
|
| Challenging case (3 overlapping target events, including 2 same-class events) |
FILLER
FILLER
FILLER
|
Percussion (-40°, -20°)
Percussion (60°, -20°)
Blender (80°, -20°)
|
Percussion
Percussion
--
CAPI-SDRi (this sample)=10.04 dB
|
Percussion
Percussion
--
CAPI-SDRi (this sample)=9.30 dB
|
Percussion
Percussion
Blender
CAPI-SDRi (this sample)=10.72 dB
|
Percussion
--
--
CAPI-SDRi (this sample)=1.22 dB
|
| 3 Speech |
FILLER
FILLER
FILLER
|
Speech (-140°, -20°)
Speech (80°, -20°)
Speech (160°, 0°)
|
Speech
Speech
Speech
CAPI-SDRi (this sample)=23.56 dB
|
Speech
Speech
Speech
CAPI-SDRi (this sample)=21.17 dB
|
Speech
Speech
Speech
CAPI-SDRi (this sample)=20.00 dB
|
Speech
Speech
Speech
CAPI-SDRi (this sample)=8.04 dB
|
* A pseudo-stereo signal extracted from the ambisonic input signal. Directional components toward azimuth -90° and 90° are extracted and assigned to the left and right channels, respectively.
Technical reports
END-TO-END ITERATIVE S5 SYSTEM BASED ON TF-LOCOFORMER AND ATST-FRAME
Yoshiaki Bando, Shun Sakurai, Yuto Nozaki, Keisuke Imoto, Masaki Onishi
National Institute of Advanced Industrial Science and Technology, Koto, Tokyo, Japan; Kyoto University, Kyoto, Kyoto, Japan
Bando_AIST_task4_3 Bando_AIST_task4_4 Bando_AIST_task4_2 Bando_AIST_task4_1
END-TO-END ITERATIVE S5 SYSTEM BASED ON TF-LOCOFORMER AND ATST-FRAME
Yoshiaki Bando, Shun Sakurai, Yuto Nozaki, Keisuke Imoto, Masaki Onishi
National Institute of Advanced Industrial Science and Technology, Koto, Tokyo, Japan; Kyoto University, Kyoto, Kyoto, Japan
Abstract
This technical report describes our end-to-end (E2E) system based on TF-Locoformer and ATST-Frame. The best system in the previous challenge on spatial semantic segmentation of sound scenes (S5) leveraged an iterative architecture that alternately performs separation and classification to achieve excellent performance. Inspired by this architecture, we built an E2E system that iteratively applies TF-Locoformer and ATST-Frame. Specifically, the overall architecture is based on TF-Locoformer, which stacks Transformerbased blocks to process each time-frequency bin. We inserted signal and classification heads into the outputs of intermediate blocks, and applied ATST-Frame to the separated source signals. The sourcewise embeddings extracted by ATST-Frame are then fed back to the Locoformer block to progressively improve the performance. The whole architecture is trained in an E2E manner with permutation invariant training. Our best model on the development set achieved a class-aware permutation invariant signal-to-distortion ratio improvement (CAPI-SDRi) of 16.3 dB and source-wise label accuracy of 73.6 % on the test subset of the development set.
A MULTI-STAGE SEPARATION-AND-CLASSIFICATION FRAMEWORK GUIDED BY COMPLEMENTARY ACOUSTIC-TO-SEMANTIC CLUES
Younghoo Kwon, Junwoo Park, Han Yin, Jung-Woo Choi
Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
Choi_KAIST_task4_4 Choi_KAIST_task4_1 Choi_KAIST_task4_2 Choi_KAIST_task4_3
A MULTI-STAGE SEPARATION-AND-CLASSIFICATION FRAMEWORK GUIDED BY COMPLEMENTARY ACOUSTIC-TO-SEMANTIC CLUES
Younghoo Kwon, Junwoo Park, Han Yin, Jung-Woo Choi
Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
Abstract
This report describes the system proposed for the DCASE 2026 Challenge Task 4: Spatial Semantic Segmentation of Sound Scenes (S5). Specifically, we develop a multi-stage framework in which each stage couples a separation model with a classification model. The first stage performs source separation and classification directly on the multi-channel mixture. Its outputs are then propagated to the following stage as two complementary clues that progressively refine each target estimate: (i) an enrollment clue, the separated waveform itself, serving as a low-level acoustic reference; and (ii) a class clue, the predicted label encoded as a one-hot vector. The third stage reuses the second-stage outputs under the same scheme, forming an iterative self-guided refinement process. In addition, we use a fine-grained frame-level audio embedding from an audio encoder pretrained on a large audio corpus as an additional clue to further improve the audio separation performance. On the test set, the proposed system achieves a CAPI-SDRi of 15.51 dB, a mixture accuracy of 71.09%, and a source accuracy of 78.62%; with an improvement of 7.02 dB, 10.38%p and 8.22%p compared with the challenge baseline, respectively.
A LABEL-GUIDED ENSEMBLE SYSTEM FOR SPATIAL SEMANTIC SEGMENTATION OF SAME-CLASS SOUND SOURCES
Yongyi Deng, Tong Zou, Yanxin Tian, Hao Shi, Jiayue Luo, Yicheng Yan, Gongping Huang
Intelligent Acoustics and Speech Processing Laboratory (IASP-Lab), School of Electronic Information, Wuhan University, Wuhan, China; Kyoto University, Kyoto University, Kyoto, Japan
Deng_WHU_task4_1
A LABEL-GUIDED ENSEMBLE SYSTEM FOR SPATIAL SEMANTIC SEGMENTATION OF SAME-CLASS SOUND SOURCES
Yongyi Deng, Tong Zou, Yanxin Tian, Hao Shi, Jiayue Luo, Yicheng Yan, Gongping Huang
Intelligent Acoustics and Speech Processing Laboratory (IASP-Lab), School of Electronic Information, Wuhan University, Wuhan, China; Kyoto University, Kyoto University, Kyoto, Japan
Abstract
This report presents our system for DCASE 2026 Task 4, which addresses spatial semantic segmentation of sound scenes containing same-class foreground sources and inactive source labels. The task requires not only separating active sound events that may share the same class label, but also handling source slots corresponding to inactive labels. To improve the reliability of label-conditioned source separation, we adopt a label-guided ensemble strategy. In the tagging stage, two M2D-AT variants and one CLAPAT variant are fused by weighted voting to obtain robust source-label estimates. The estimated labels are then used to guide source separation. The primary separator is a fine-tuned ResUNetK model with a mask-sharpen inference variant, while a TF-GridNet model is used only as a weak auxiliary branch for a small number of selected classes through class-dependent fusion weights. Instead of uniformly averaging separator outputs, the final outputs are generated through label-guided class-dependent fusion, which improves the consistency between predicted labels and separated sources while keeping inactive slots controlled by silence-label conditioning. On the development set, according to the submission-pack evaluation snapshot, our final system achieves a CAPI-SDRi of 8.625, with a mixture-level accuracy of 61.706% and a source-level accuracy of 72.139%.
TAGGING-DRIVEN INFERENCE REFINEMENT AND DUAL-SEPARATOR SELECTION FOR SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES
Seunggyu Jeong, Seong-Eun Kim
Medisensing, Seoul, Korea; Seoul National University of Science and Technology (SeoulTech), Seoul, Korea
Jeong_Medisensing_task4_1 Jeong_Medisensing_task4_2 Jeong_Medisensing_task4_3
TAGGING-DRIVEN INFERENCE REFINEMENT AND DUAL-SEPARATOR SELECTION FOR SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES
Seunggyu Jeong, Seong-Eun Kim
Medisensing, Seoul, Korea; Seoul National University of Science and Technology (SeoulTech), Seoul, Korea
Abstract
We describe our submission to DCASE 2026 Challenge Task 4, spatial semantic segmentation of sound scenes. A system has to detect which sound classes are active in a four-channel mixture and to separate each detected source, and is scored by class-aware permutation-invariant signal-to-distortion ratio improvement (CASDRi). We build on the official baseline, which couples two Masked Modeling Duo (M2D) audio taggers with a FiLM-conditioned ResUNet label-queried separator. An oracle study, in which groundtruth labels are fed to the unchanged baseline separator, reaches 9.52 dB, so for the given separator the label decision is the limiting factor up to about 9.5 dB. Our system follows this finding and concentrates on the labelling and decision stages, almost all of it at inference time. We fuse the single- and four-channel taggers, fine-tune the four-channel tagger with a curriculum that oversamples silent-target clips, re-tag the separated stems to verify and clean the queries, gate residual false positives using the reward structure of CA-SDRi, and select stems from two separators by re-tagging agreement. These steps raise development-set CA-SDRi from 8.49 dB to 9.06 dB and mixture tagging accuracy from 60.7% to 63.8%. We submit three systems that add these components in turn.
END-TO-END SPATIAL SEMANTIC SEPARATOR WITH DOA MODULE
Yoohan Park, Haejin Cho
student-oriented research group, Korea University Big data and AI Group, 'KUBIG', Korea University, Seoul, South Korea
Park_KUBIG_task4_1
END-TO-END SPATIAL SEMANTIC SEPARATOR WITH DOA MODULE
Yoohan Park, Haejin Cho
student-oriented research group, Korea University Big data and AI Group, 'KUBIG', Korea University, Seoul, South Korea
Abstract
This technical report describes our submission to DCASE 2026 Challenge Task 4, Spatial Semantic Segmentation of Sound Scenes. The task requires a system to separate sound events from 4-channel First-Order Ambisonics (FOA) recordings while simultaneously predicting their sound classes and directions of arrival (DoA). To address this challenge, we propose an end-to-end framework that jointly performs source separation, sound classification, and DoA estimation. The model generates a fixed set of source representations, each associated with a separated waveform, a class label, and a DoA estimate. A frozen BEATs encoder is used to provide robust acoustic representations, while lightweight task-specific modules are trained for spatial modeling and prediction. The entire system is optimized using a joint permutation-invariant training objective that encourages consistent source assignment across all outputs.
EXTENDING SR-CORRNET TO LABEL-QUERIED TARGET SOUND EXTRACTION
Bon-Hyeok Ku, Woocheol Jeong, Hyung-Min Park
Intelligent Information Processing Lab, Sogang University, Mapo, Seoul, Korea
Park_SGU_task4_3 Park_SGU_task4_4 Park_SGU_task4_2 Park_SGU_task4_1
EXTENDING SR-CORRNET TO LABEL-QUERIED TARGET SOUND EXTRACTION
Bon-Hyeok Ku, Woocheol Jeong, Hyung-Min Park
Intelligent Information Processing Lab, Sogang University, Mapo, Seoul, Korea
Abstract
This paper describes our submission to DCASE 2026 Task 4. We extend SR-CorrNet, originally designed for blind source separation, into a label-queried target sound extraction and separation model. The system reformulates blind separation into target extraction by conditioning the separator with frame-level strong class labels via Feature-wise Linear Modulation (FiLM), steering each output slot to extract the queried class. To stabilize ambiguous regions, we supplement this with a time-invariant weak class label FiLM bias. The class labels are predicted by a front-end fusion tagger that combines two complementary AudioSet-pretrained Transformers (M2D and PaSST) via feature-axis concatenation. The extended model operates with block streaming inference, coupling the tagger and separator through a soft-query interface.
THE MERL SYSTEMS FOR DCASE 2026 CHALLENGE TASK 4
Kohei Saijo, Yoshiki Masuyama, Christoph Boeddeker, Julius Richter, Takahiro Edo, Gordon Wichern, Jonathan Le Roux
Information Technology R&D Center, Mitsubishi Electric Corporation, Ofuna, Kanagawa, Japan; Mitsubishi Electric Research Laboratories, Cambridge, MA, USA
Saijo_Mitsubishi_task4_3 Saijo_Mitsubishi_task4_2 Saijo_Mitsubishi_task4_1 Saijo_Mitsubishi_task4_4
THE MERL SYSTEMS FOR DCASE 2026 CHALLENGE TASK 4
Kohei Saijo, Yoshiki Masuyama, Christoph Boeddeker, Julius Richter, Takahiro Edo, Gordon Wichern, Jonathan Le Roux
Information Technology R&D Center, Mitsubishi Electric Corporation, Ofuna, Kanagawa, Japan; Mitsubishi Electric Research Laboratories, Cambridge, MA, USA
Abstract
This technical report describes our spatial semantic segmentation of sound scenes (S5) systems for DCASE 2026 Challenge Task 4. Inspired by the top-ranked system in DCASE 2025 Task 4, we adopt a cascaded framework consisting of universal sound separation (USS) with source counting, source classification, and class-aware refinement. In the first stage, a TF-Locoformer-based USS model separates multi-channel mixtures into single-channel foreground and interference signals. Then, each separated signal is classified into one of 18 foreground classes or as interference. The separated foreground signals are further refined by another TF-Locoformer-based model conditioned on the predicted class labels and the observed mixture. Our best system achieves CA-PI-SDRi of 14.95 dB and mixture accuracy of 78.11% on the dev test set.
SEMANTIC DISTILLATION FOR SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES
Sen Wang, Chengyao Tang, Zhicheng Zhang, Jianqin Yin
Beijing University of Posts and Telecommunications, Beijing, China
Wang_BUPT_task4_2 Wang_BUPT_task4_1
SEMANTIC DISTILLATION FOR SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES
Sen Wang, Chengyao Tang, Zhicheng Zhang, Jianqin Yin
Beijing University of Posts and Telecommunications, Beijing, China
Abstract
This report describes our system for DCASE 2026 Task 4, spatial semantic segmentation of sound scenes. The system follows a twostage pipeline: an M2D-based audio tagger first predicts up to three event labels, and a label-queried ResUNetK then separates the corresponding dry monaural sources from a four-channel spatial mixture. We improve audio tagging along two complementary directions. The first uses permutation-invariant deep supervision and an exponential-moving-average teacher. The second transfers semantic representations from a frozen Qwen2-Audio model to M2D using centered-kernel-alignment and cosine objectives. Both methods are developed for single-channel (1c) and four-channel (4c) tagging. On the development test set, the submitted 1c system obtains 8.557 dB CAPI-SDRi, while the submitted 4c system obtains 8.807 dB CAPI-SDRi with 61.442% mixture-level label accuracy and 72.535% source-level label accuracy.
Local-Global Transformer with Iterative Refinement for Multi-Channel Sound Source Separation and Extraction
Ruohan Wang, Minjun Chen, Yangyang Liu, Longhai Wu, Jie Chen
AI Solution Lab, AI SW Team, Samsung Research China-Nanjing, Nanjing, China; AI SW Team, Samsung Research China-Nanjing, Nanjing, China
Wang_SRCN_task4_2 Wang_SRCN_task4_1 Wang_SRCN_task4_3 Wang_SRCN_task4_4
Local-Global Transformer with Iterative Refinement for Multi-Channel Sound Source Separation and Extraction
Ruohan Wang, Minjun Chen, Yangyang Liu, Longhai Wu, Jie Chen
AI Solution Lab, AI SW Team, Samsung Research China-Nanjing, Nanjing, China; AI SW Team, Samsung Research China-Nanjing, Nanjing, China
Abstract
This technical report describes our proposed systems for DCASE2026 Challenge Task 4: Spatial Semantic Segmentation of Sound Scenes (S5). The task aims to enhance technologies for sound event detection and separation from multi-channel input signals that mix multiple sound events with spatial information, which is a fundamental basis of immersive communication. Our approach employs a deep frequency-time transformer architecture with local modeling by convolution that processes multi-channel audio recordings from a 4-microphone array. The system consists of three main components: (1) a universal sound separation module for separating waveform and predicting the sound classes at the same time, (2) an audio tagging model for semantic label prediction, and (3) an iterative target sound extraction module that leverages enrollment clues and semantic labels to extract specific sound sources. We incorporate spatial features including interchannel phase difference (IPD) and inter-channel level difference (ILD) to enhance separation performance.
DCASE 2026 Task 4 Submission: Duplicate-Label-Aware Source Selection and TUSS Overlay Grafting
Yuhuan You
Peking University, Beijing, China
You_PKU_task4_4 You_PKU_task4_1 You_PKU_task4_2 You_PKU_task4_3
DCASE 2026 Task 4 Submission: Duplicate-Label-Aware Source Selection and TUSS Overlay Grafting
Yuhuan You
Peking University, Beijing, China
Abstract
This report describes the PKU submission to DCASE 2026 Challenge Task 4, Spatial Semantic Segmentation of Sound Scenes. The task requires detection and separation of target sound events from FOA mixtures, and the 2026 setting includes same-class multiple sources and zero-target soundscapes. Our submitted systems use the official baseline family as the separation and tagging foundation, then add duplicate-label-aware candidate selection and label-conditioned overlay grafting. The main ranking metric is CAPI-SDRi, so the system is optimized to select separated sources that jointly improve class assignment and permutation-invariant signal quality. We submit four systems: a stable duplicate-label weighted selector, two progressively more aggressive TUSS overlay systems, and a full all-label TUSS candidate-pool selector for diversity.