Task description

Detailed task description can be found in the task description page

Team ranking

Table including only the best ranking score per submitting team.

Submission Information		Evaluation Set			Test (Development) Set
Submission Code	Technical Report	Official Team Rank	CA-SDRi (eval)	Label Prediction Accuracy (eval)	CA-SDRi (test)	Label Prediction Accuracy (test)
Bando_AIST_task4_2	Bando_2025_t4	5	7.55	49.51	13.31	64.07
Choi_KAIST_task4_3	Choi_2025_t4	1	11.00	55.80	14.94	61.80
Wu_SUSTech_task4_submission_2	FulinWu_2025_t4	3	9.73	51.54	14.00	59.80
Morocutti_CPJKU_task4_2	Morocutti_2025_t4	2	9.77	61.60	15.04	77.07
Baseline_task4_ResUNetK		8	6.60	51.48	11.09	59.80
Park_GIST-HanwhaVision_task4_4	Park_2025_t4	6	6.69	47.22	13.22	76.53
Qian_SJTU_task4_1	Qian_2025_t4	4	7.84	47.72	14.38	73.93
Stergioulis_UTh_task4_submission_1	Stergioulis_2025_t4	7	6.60	51.48	11.12	60.67
Zhang_BUPT_task4_1	Zhang_2025_t4	9	3.84	22.41	11.78	65.47

System ranking

Table shows the ranking of all submitted systems.

Submission Information		Evaluation Set			Test (Development) Set
Submission Code	Technical Report	Official System Rank	CA-SDRi (eval)	Label Prediction Accuracy (eval)	CA-SDRi (test)	Label Prediction Accuracy (test)
Bando_AIST_task4_1	Bando_2025_t4	24	5.17	29.20	12.38	57.13
Bando_AIST_task4_2	Bando_2025_t4	13	7.55	49.51	13.31	64.07
Bando_AIST_task4_3	Bando_2025_t4	23	5.26	31.98	11.23	48.80
Bando_AIST_task4_4	Bando_2025_t4	14	7.52	48.58	12.46	55.93
Choi_KAIST_task4_1	Choi_2025_t4	4	10.63	53.52	14.82	61.67
Choi_KAIST_task4_2	Choi_2025_t4	3	10.80	56.42	14.60	60.90
Choi_KAIST_task4_3	Choi_2025_t4	1	11.00	55.80	14.94	61.80
Choi_KAIST_task4_4	Choi_2025_t4	2	10.85	54.26	14.94	61.67
Wu_SUSTech_task4_submission_1	FulinWu_2025_t4	8	9.11	51.54	14.16	59.80
Wu_SUSTech_task4_submission_2	FulinWu_2025_t4	7	9.73	51.54	14.00	59.80
Morocutti_CPJKU_task4_1	Morocutti_2025_t4	6	9.76	61.30	14.95	76.87
Morocutti_CPJKU_task4_2	Morocutti_2025_t4	5	9.77	61.60	15.04	77.07
Morocutti_CPJKU_task4_3	Morocutti_2025_t4	10	8.91	51.98	14.49	73.07
Morocutti_CPJKU_task4_4	Morocutti_2025_t4	9	8.99	57.28	14.27	71.73
Baseline_task4_ResUNetK		19	6.60	51.48	11.09	59.80
Baseline_task4_ResUNet		22	5.72	51.48	11.03	59.80
Park_GIST-HanwhaVision_task4_1	Park_2025_t4	21	6.38	43.89	12.95	73.73
Park_GIST-HanwhaVision_task4_2	Park_2025_t4	17	6.64	46.67	13.21	76.53
Park_GIST-HanwhaVision_task4_3	Park_2025_t4	20	6.50	45.19	13.09	76.00
Park_GIST-HanwhaVision_task4_4	Park_2025_t4	16	6.69	47.22	13.22	76.53
Qian_SJTU_task4_1	Qian_2025_t4	11	7.84	47.72	14.38	73.93
Qian_SJTU_task4_2	Qian_2025_t4	12	7.72	50.68	12.40	62.73
Qian_SJTU_task4_3	Qian_2025_t4	25	4.43	22.84	10.47	49.53
Qian_SJTU_task4_4	Qian_2025_t4	15	7.49	44.20	14.66	76.27
Stergioulis_UTh_task4_submission_1	Stergioulis_2025_t4	18	6.60	51.48	11.12	60.67
Zhang_BUPT_task4_1	Zhang_2025_t4	26	3.84	22.41	11.78	65.47
Zhang_BUPT_task4_2	Zhang_2025_t4	27	3.54	22.41	11.07	62.67

Supplementary metrics

Detailed analysis of separation and detection performance

All metrics in this table are evaluated on the evaluation set. True Positive (TP), False Positive (FP), and False Negative (FN) are counted per clip. TP-SDRi is the SDRi in clips where the label prediction is TP.

Submission Information		Separation Scores		Label Prediction Scores					Counts
Submission Code	Technical Report	CA-SDRi	TP-SDRi	Accuracy	Precision	Recall	F-Score (micro)	F-Score (macro)	TP	FP	FN
Bando_AIST_task4_1	Bando_2025_t4	5.17	10.72	29.20	0.72	0.50	0.59	0.56	1605	622	1635
Bando_AIST_task4_2	Bando_2025_t4	7.55	11.04	49.51	0.83	0.70	0.77	0.75	2277	411	963
Bando_AIST_task4_3	Bando_2025_t4	5.26	10.39	31.98	0.76	0.52	0.60	0.59	1684	648	1556
Bando_AIST_task4_4	Bando_2025_t4	7.52	11.06	48.58	0.85	0.70	0.76	0.74	2258	432	982
Choi_KAIST_task4_1	Choi_2025_t4	10.63	14.37	53.52	0.88	0.79	0.82	0.82	2552	401	688
Choi_KAIST_task4_2	Choi_2025_t4	10.80	14.12	56.42	0.85	0.84	0.83	0.83	2704	541	536
Choi_KAIST_task4_3	Choi_2025_t4	11.00	14.46	55.80	0.86	0.83	0.84	0.83	2681	499	559
Choi_KAIST_task4_4	Choi_2025_t4	10.85	14.52	54.26	0.88	0.80	0.83	0.82	2576	399	664
Wu_SUSTech_task4_submission_1	FulinWu_2025_t4	9.11	12.83	51.54	0.86	0.73	0.79	0.77	2365	401	875
Wu_SUSTech_task4_submission_2	FulinWu_2025_t4	9.73	13.63	51.54	0.86	0.73	0.79	0.77	2365	401	875
Morocutti_CPJKU_task4_1	Morocutti_2025_t4	9.76	12.51	61.30	0.93	0.78	0.85	0.82	2528	210	712
Morocutti_CPJKU_task4_2	Morocutti_2025_t4	9.77	12.51	61.60	0.93	0.78	0.85	0.82	2527	200	713
Morocutti_CPJKU_task4_3	Morocutti_2025_t4	8.91	12.62	51.98	0.86	0.73	0.79	0.77	2368	390	872
Morocutti_CPJKU_task4_4	Morocutti_2025_t4	8.99	11.87	57.28	0.93	0.75	0.83	0.81	2447	203	793
Baseline_task4_ResUNetK		6.60	9.33	51.48	0.86	0.73	0.79	0.77	2364	402	876
Baseline_task4_ResUNet		5.72	8.21	51.48	0.86	0.73	0.79	0.77	2364	402	876
Park_GIST-HanwhaVision_task4_1	Park_2025_t4	6.38	10.07	43.89	0.87	0.65	0.74	0.69	2121	395	1119
Park_GIST-HanwhaVision_task4_2	Park_2025_t4	6.64	10.18	46.67	0.85	0.67	0.75	0.70	2183	379	1057
Park_GIST-HanwhaVision_task4_3	Park_2025_t4	6.50	10.13	45.19	0.86	0.67	0.75	0.70	2156	377	1084
Park_GIST-HanwhaVision_task4_4	Park_2025_t4	6.69	10.15	47.22	0.88	0.67	0.76	0.71	2185	328	1055
Qian_SJTU_task4_1	Qian_2025_t4	7.84	11.27	47.72	0.84	0.71	0.78	0.75	2309	393	931
Qian_SJTU_task4_2	Qian_2025_t4	7.72	11.06	50.68	0.87	0.70	0.78	0.75	2286	309	954
Qian_SJTU_task4_3	Qian_2025_t4	4.43	10.30	22.84	0.59	0.49	0.54	0.49	1590	1110	1650
Qian_SJTU_task4_4	Qian_2025_t4	7.49	11.44	44.20	0.87	0.65	0.75	0.72	2117	280	1123
Stergioulis_UTh_task4_submission_1	Stergioulis_2025_t4	6.60	9.33	51.48	0.86	0.73	0.79	0.77	2364	402	876
Zhang_BUPT_task4_1	Zhang_2025_t4	3.84	9.41	22.41	0.89	0.40	0.52	0.46	1278	418	1962
Zhang_BUPT_task4_2	Zhang_2025_t4	3.54	8.97	22.41	0.89	0.40	0.52	0.46	1278	418	1962

Detailed analysis focused on quality of separated speech

Table shows the quality of the separated speech in the evaluation dataset. Here, Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI) and Perceptual Evaluation of Audio Quality (PEAQ) was adopted as objective metrics.

Submission Information				PESQ				STOI				PEAQ
Submission Code	Technical Report	CA-SDRi	Number of Speech Sample (/251)	PESQ mean	PESQ std	PESQ min	PESQ max	STOI mean	STOI std	STOI min	STOI max	PEAQ mean	PEAQ std	PEAQ min	PEAQ max
Bando_AIST_task4_1	Bando_2025_t4	5.17	233	2.50	0.61	1.26	4.07	0.85	0.10	0.46	0.98	-3.52	0.48	-3.91	-0.91
Bando_AIST_task4_2	Bando_2025_t4	7.55	243	2.46	0.61	1.07	4.06	0.85	0.11	0.46	0.98	-3.52	0.48	-3.91	-1.10
Bando_AIST_task4_3	Bando_2025_t4	5.26	184	2.47	0.57	1.22	4.02	0.88	0.08	0.55	0.99	-3.56	0.46	-3.91	-1.64
Bando_AIST_task4_4	Bando_2025_t4	7.52	226	2.35	0.61	1.06	4.02	0.86	0.09	0.45	0.98	-3.56	0.45	-3.91	-1.46
Choi_KAIST_task4_1	Choi_2025_t4	10.63	236	2.83	0.58	1.27	4.22	0.91	0.07	0.63	0.99	-3.42	0.49	-3.91	-0.62
Choi_KAIST_task4_2	Choi_2025_t4	10.80	241	2.82	0.56	1.27	4.23	0.91	0.07	0.58	0.99	-3.46	0.47	-3.91	-0.88
Choi_KAIST_task4_3	Choi_2025_t4	11.00	241	2.88	0.58	1.17	4.24	0.91	0.08	0.46	0.99	-3.43	0.48	-3.91	-0.63
Choi_KAIST_task4_4	Choi_2025_t4	10.85	231	2.89	0.57	1.30	4.23	0.91	0.07	0.58	0.99	-3.42	0.49	-3.91	-0.62
Wu_SUSTech_task4_submission_1	FulinWu_2025_t4	9.11	246	2.67	0.54	1.19	4.02	0.90	0.08	0.49	0.99	-3.46	0.50	-3.91	-0.95
Wu_SUSTech_task4_submission_2	FulinWu_2025_t4	9.73	246	2.77	0.58	1.04	4.18	0.90	0.10	0.08	0.99	-3.43	0.50	-3.91	-1.40
Morocutti_CPJKU_task4_1	Morocutti_2025_t4	9.76	249	2.97	0.59	1.25	4.17	0.90	0.07	0.61	0.99	-3.40	0.50	-3.91	-0.74
Morocutti_CPJKU_task4_2	Morocutti_2025_t4	9.77	249	2.97	0.60	1.24	4.18	0.90	0.07	0.61	0.99	-3.39	0.51	-3.91	-0.75
Morocutti_CPJKU_task4_3	Morocutti_2025_t4	8.91	250	2.89	0.62	1.17	4.17	0.89	0.08	0.56	0.99	-3.36	0.51	-3.91	-0.87
Morocutti_CPJKU_task4_4	Morocutti_2025_t4	8.99	250	2.91	0.61	1.22	4.18	0.89	0.08	0.58	0.99	-3.45	0.50	-3.91	-0.72
Baseline_task4_ResUNetK		6.60	246	2.39	0.63	1.09	4.07	0.84	0.11	0.39	0.99	-3.60	0.43	-3.91	-0.87
Baseline_task4_ResUNet		5.72	246	2.55	0.63	1.11	4.11	0.85	0.10	0.44	0.99	-3.56	0.47	-3.91	-0.98
Park_GIST-HanwhaVision_task4_1	Park_2025_t4	6.38	249	2.40	0.62	1.06	4.07	0.85	0.11	0.38	0.99	-3.57	0.47	-3.91	-1.07
Park_GIST-HanwhaVision_task4_2	Park_2025_t4	6.64	245	2.42	0.62	1.10	4.07	0.85	0.11	0.41	0.99	-3.57	0.46	-3.91	-1.07
Park_GIST-HanwhaVision_task4_3	Park_2025_t4	6.50	246	2.40	0.61	1.10	4.07	0.85	0.11	0.41	0.99	-3.57	0.46	-3.91	-1.07
Park_GIST-HanwhaVision_task4_4	Park_2025_t4	6.69	248	2.41	0.62	1.06	4.07	0.85	0.11	0.38	0.99	-3.57	0.46	-3.91	-1.07
Qian_SJTU_task4_1	Qian_2025_t4	7.84	248	2.40	0.59	1.09	4.08	0.84	0.12	0.04	0.98	-3.66	0.40	-3.91	-0.34
Qian_SJTU_task4_2	Qian_2025_t4	7.72	244	2.46	0.58	1.15	4.02	0.84	0.10	0.44	0.98	-3.55	0.48	-3.91	-0.98
Qian_SJTU_task4_3	Qian_2025_t4	4.43	210	2.22	0.60	1.07	3.80	0.80	0.19	-0.19	0.97	-3.64	0.50	-3.91	-1.36
Qian_SJTU_task4_4	Qian_2025_t4	7.49	247	2.41	0.59	1.10	4.08	0.84	0.12	0.06	0.98	-3.65	0.42	-3.91	-0.34
Stergioulis_UTh_task4_submission_1	Stergioulis_2025_t4	6.60	246	2.39	0.63	1.09	4.07	0.84	0.12	0.39	0.99	-3.60	0.43	-3.91	-0.87
Zhang_BUPT_task4_1	Zhang_2025_t4	3.84	58	2.79	0.70	1.29	4.08	0.88	0.11	0.44	0.99	-3.53	0.54	-3.91	-0.98
Zhang_BUPT_task4_2	Zhang_2025_t4	3.54	58	2.90	0.61	1.39	4.06	0.89	0.10	0.56	0.99	-3.41	0.57	-3.91	-0.93

System's performance under partially known conditions

Table shows the separation and detection performance of each system under partially known conditions. The ‘Known IR Condition’ is the case where the evaluation data is synthesized with the room impulse responses included in the training data. Known Target Condition' is the case where the evaluation data is synthesized with target sound event samples included in the training data. ‘Known Noise Condition’ is the case where the evaluation data is synthesized with the background noise included in the training data. ‘Known Interference Condition’ is the case where the evaluation data is synthesized with interference sound samples included in the training data.

Submission Information		Evaluation set		Known IR Condition		Known Target Condition		Known Noise Condition		Known Interference Condition
Submission Code	Technical Report	CA-SDRi	Accuracy	Known IR CA-SDRi	Known IR Accuracy	Known Target CA-SDRi	Known Target Accuracy	Known Noise CA-SDRi	Known Noise Accuracy	Known Interference CA-SDRi	Known Interference Accuracy
Bando_AIST_task4_1	Bando_2025_t4	5.17	29.20	7.63	37.78	8.90	62.96	5.97	33.33	4.53	26.85
Bando_AIST_task4_2	Bando_2025_t4	7.55	49.51	10.34	61.11	9.94	68.52	8.81	58.33	7.01	48.15
Bando_AIST_task4_3	Bando_2025_t4	5.26	31.98	6.89	33.33	8.10	51.85	5.80	25.93	4.90	26.85
Bando_AIST_task4_4	Bando_2025_t4	7.52	48.58	10.19	51.11	9.19	61.11	8.11	51.85	7.29	50.93
Choi_KAIST_task4_1	Choi_2025_t4	10.63	53.52	13.49	53.33	10.79	64.81	11.65	62.04	11.35	63.89
Choi_KAIST_task4_2	Choi_2025_t4	10.80	56.42	13.76	56.67	10.16	59.26	11.77	67.59	10.78	62.96
Choi_KAIST_task4_3	Choi_2025_t4	11.00	55.80	14.53	61.11	10.60	59.26	12.11	67.59	11.59	66.67
Choi_KAIST_task4_4	Choi_2025_t4	10.85	54.26	13.32	52.22	10.89	63.89	11.85	63.89	11.19	62.04
Wu_SUSTech_task4_submission_1	FulinWu_2025_t4	9.11	51.54	11.08	48.89	11.25	75.00	9.68	53.70	8.49	47.22
Wu_SUSTech_task4_submission_2	FulinWu_2025_t4	9.73	51.54	11.87	48.89	11.06	75.00	9.92	53.70	9.12	47.22
Morocutti_CPJKU_task4_1	Morocutti_2025_t4	9.76	61.30	12.75	68.89	12.37	92.59	10.02	61.11	9.22	58.33
Morocutti_CPJKU_task4_2	Morocutti_2025_t4	9.77	61.60	12.68	70.00	12.36	91.67	9.80	61.11	9.23	59.26
Morocutti_CPJKU_task4_3	Morocutti_2025_t4	8.91	51.98	10.95	58.89	12.17	91.67	9.07	48.15	8.37	50.00
Morocutti_CPJKU_task4_4	Morocutti_2025_t4	8.99	57.28	12.13	67.78	11.96	90.74	9.69	61.11	8.91	56.48
Baseline_task4_ResUNetK		6.60	51.48	7.91	48.89	9.88	75.00	6.41	53.70	5.95	47.22
Baseline_task4_ResUNet		5.72	51.48	7.03	48.89	9.51	75.00	5.27	53.70	5.08	47.22
Park_GIST-HanwhaVision_task4_1	Park_2025_t4	6.38	43.89	7.88	48.89	8.45	54.63	6.58	39.81	5.77	45.37
Park_GIST-HanwhaVision_task4_2	Park_2025_t4	6.64	46.67	7.46	43.33	8.30	51.85	7.07	43.52	5.87	38.89
Park_GIST-HanwhaVision_task4_3	Park_2025_t4	6.50	45.19	8.12	46.67	8.34	52.78	7.07	44.44	5.41	40.74
Park_GIST-HanwhaVision_task4_4	Park_2025_t4	6.69	47.22	8.05	46.67	8.43	54.63	7.13	44.44	5.72	42.59
Qian_SJTU_task4_1	Qian_2025_t4	7.84	47.72	10.58	52.22	10.66	88.89	8.55	50.93	7.32	47.22
Qian_SJTU_task4_2	Qian_2025_t4	7.72	50.68	9.54	54.44	9.86	80.56	8.39	49.07	7.19	52.78
Qian_SJTU_task4_3	Qian_2025_t4	4.43	22.84	5.86	27.78	5.43	32.41	5.33	29.63	3.81	18.52
Qian_SJTU_task4_4	Qian_2025_t4	7.49	44.20	10.34	52.22	11.04	95.37	8.04	46.30	7.24	45.37
Stergioulis_UTh_task4_submission_1	Stergioulis_2025_t4	6.60	51.48	7.91	48.89	9.88	75.00	6.41	53.70	5.95	47.22
Zhang_BUPT_task4_1	Zhang_2025_t4	3.84	22.41	3.86	17.78	5.15	28.70	3.89	30.56	2.86	26.85
Zhang_BUPT_task4_2	Zhang_2025_t4	3.54	22.41	3.64	17.78	5.21	28.70	3.59	30.56	2.73	26.85

System performance in more challenging conditions

This table shows the performance of the system under the 'Multi Same Class Condition'. In this condition, sound events of the same class are included in the same clip.

Submission Information		Evaluation set		Multiple Same Class Condition
Submission Code	Technical Report	CA-SDRi	Accuracy	Multiple Same Class CA-SDRi	Multiple Same Class Accuracy
Bando_AIST_task4_1	Bando_2025_t4	5.17	29.20	1.89	41.67
Bando_AIST_task4_2	Bando_2025_t4	7.55	49.51	3.98	60.19
Bando_AIST_task4_3	Bando_2025_t4	5.26	31.98	2.53	47.22
Bando_AIST_task4_4	Bando_2025_t4	7.52	48.58	4.12	64.81
Choi_KAIST_task4_1	Choi_2025_t4	10.63	53.52	4.15	67.59
Choi_KAIST_task4_2	Choi_2025_t4	10.80	56.42	3.79	62.96
Choi_KAIST_task4_3	Choi_2025_t4	11.00	55.80	3.84	62.04
Choi_KAIST_task4_4	Choi_2025_t4	10.85	54.26	3.89	66.67
Wu_SUSTech_task4_submission_1	FulinWu_2025_t4	9.11	51.54	2.97	65.74
Wu_SUSTech_task4_submission_2	FulinWu_2025_t4	9.73	51.54	3.09	65.74
Morocutti_CPJKU_task4_1	Morocutti_2025_t4	9.76	61.30	5.01	72.22
Morocutti_CPJKU_task4_2	Morocutti_2025_t4	9.77	61.60	4.87	73.15
Morocutti_CPJKU_task4_3	Morocutti_2025_t4	8.91	51.98	3.99	62.96
Morocutti_CPJKU_task4_4	Morocutti_2025_t4	8.99	57.28	4.45	69.44
Baseline_task4_ResUNetK		6.60	51.48	3.53	65.74
Baseline_task4_ResUNet		5.72	51.48	2.21	65.74
Park_GIST-HanwhaVision_task4_1	Park_2025_t4	6.38	43.89	3.17	63.89
Park_GIST-HanwhaVision_task4_2	Park_2025_t4	6.64	46.67	3.49	65.74
Park_GIST-HanwhaVision_task4_3	Park_2025_t4	6.50	45.19	2.88	62.04
Park_GIST-HanwhaVision_task4_4	Park_2025_t4	6.69	47.22	3.33	65.74
Qian_SJTU_task4_1	Qian_2025_t4	7.84	47.72	2.62	61.11
Qian_SJTU_task4_2	Qian_2025_t4	7.72	50.68	2.89	64.81
Qian_SJTU_task4_3	Qian_2025_t4	4.43	22.84	0.69	22.22
Qian_SJTU_task4_4	Qian_2025_t4	7.49	44.20	2.51	64.81
Stergioulis_UTh_task4_submission_1	Stergioulis_2025_t4	6.60	51.48	3.53	65.74
Zhang_BUPT_task4_1	Zhang_2025_t4	3.84	22.41	1.16	36.11
Zhang_BUPT_task4_2	Zhang_2025_t4	3.54	22.41	1.05	36.11

System characteristics

General characteristics

Submission Code	Technical Report	CA-SDRi	Label Prediction Accuracy	Input Sampling Rate	Input Acoustic Features
Bando_AIST_task4_1	Bando_2025_t4	5.17	29.20	32kHz	waveform, spectrogram
Bando_AIST_task4_2	Bando_2025_t4	7.55	49.51	32kHz	waveform, spectrogram
Bando_AIST_task4_3	Bando_2025_t4	5.26	31.98	32kHz	waveform, spectrogram
Bando_AIST_task4_4	Bando_2025_t4	7.52	48.58	32kHz	waveform, spectrogram
Choi_KAIST_task4_1	Choi_2025_t4	10.63	53.52	32kHz	waveform, spectrogram
Choi_KAIST_task4_2	Choi_2025_t4	10.80	56.42	32kHz	waveform, spectrogram
Choi_KAIST_task4_3	Choi_2025_t4	11.00	55.80	32kHz	waveform, spectrogram
Choi_KAIST_task4_4	Choi_2025_t4	10.85	54.26	32kHz	waveform, spectrogram
Wu_SUSTech_task4_submission_1	FulinWu_2025_t4	9.11	51.54	32kHz	waveform, spectrogram
Wu_SUSTech_task4_submission_2	FulinWu_2025_t4	9.73	51.54	32kHz	waveform, spectrogram
Morocutti_CPJKU_task4_1	Morocutti_2025_t4	9.76	61.30	32kHz	waveform, spectrogram
Morocutti_CPJKU_task4_2	Morocutti_2025_t4	9.77	61.60	32kHz	waveform, spectrogram
Morocutti_CPJKU_task4_3	Morocutti_2025_t4	8.91	51.98	32kHz	waveform, spectrogram
Morocutti_CPJKU_task4_4	Morocutti_2025_t4	8.99	57.28	32kHz	waveform, spectrogram
Baseline_task4_ResUNetK		6.60	51.48	32kHz	waveform, spectrogram
Baseline_task4_ResUNet		5.72	51.48	32kHz	waveform, spectrogram
Park_GIST-HanwhaVision_task4_1	Park_2025_t4	6.38	43.89	32kHz	waveform, spectrogram, spectral_roll-off
Park_GIST-HanwhaVision_task4_2	Park_2025_t4	6.64	46.67	32kHz	waveform, spectrogram, chroma
Park_GIST-HanwhaVision_task4_3	Park_2025_t4	6.50	45.19	32kHz	waveform, spectrogram, spectral_roll-off, chroma
Park_GIST-HanwhaVision_task4_4	Park_2025_t4	6.69	47.22	32kHz	waveform, spectrogram, spectral_roll-off, chroma
Qian_SJTU_task4_1	Qian_2025_t4	7.84	47.72	32kHz	waveform, log mel spectrogram
Qian_SJTU_task4_2	Qian_2025_t4	7.72	50.68	32kHz	spectrogram, log mel spectrogram
Qian_SJTU_task4_3	Qian_2025_t4	4.43	22.84	32kHz	waveform
Qian_SJTU_task4_4	Qian_2025_t4	7.49	44.20	32kHz	waveform, log mel spectrogram, fbank
Stergioulis_UTh_task4_submission_1	Stergioulis_2025_t4	6.60	51.48	32kHz	waveform, spectrogram
Zhang_BUPT_task4_1	Zhang_2025_t4	3.84	22.41	32kHz	waveform, spectrogram
Zhang_BUPT_task4_2	Zhang_2025_t4	3.54	22.41	32kHz	waveform, spectrogram

Machine learning characteristics

Submission Code	Technical Report	CA-SDRi	Label Prediction Accuracy	Machine Learning Method	Loss Function	Training Dataset	Data Augmentation	Pretrained Models
Bando_AIST_task4_1	Bando_2025_t4	5.17	29.20	Neural blind source separation model (ISS-enhanced RE-SepFormer)	CE, SNR loss	DCASE2025Task4Dataset	SpecAug, dynamic mixing
Bando_AIST_task4_2	Bando_2025_t4	7.55	49.51	Neural blind source separation model (ISS-enhanced RE-SepFormer), self-supervised audio encoder (BEATs)	CE, SNR loss	DCASE2025Task4Dataset	SpecAug, dynamic mixing	BEATs
Bando_AIST_task4_3	Bando_2025_t4	5.26	31.98	Neural blind source separation model (ISS-enhanced RE-SepFormer)	CE, SNR loss	DCASE2025Task4Dataset,AudioSet	SpecAug, dynamic mixing
Bando_AIST_task4_4	Bando_2025_t4	7.52	48.58	Neural blind source separation model (ISS-enhanced RE-SepFormer), self-supervised audio encoder (BEATs)	CE, SNR loss	DCASE2025Task4Dataset,AudioSet	SpecAug, dynamic mixing	BEATs
Choi_KAIST_task4_1	Choi_2025_t4	10.63	53.52	ResUNet-based separation model, M2D-based audio tagging model	SA-SDR, CE, KL-divergence, ArcFace	DCASE2025Task4Dataset		M2D,ATST
Choi_KAIST_task4_2	Choi_2025_t4	10.80	56.42	ResUNet-based separation model, M2D-based audio tagging model	SA-SDR, CE, KL-divergence, ArcFace	DCASE2025Task4Dataset		M2D,ATST
Choi_KAIST_task4_3	Choi_2025_t4	11.00	55.80	ResUNet-based separation model, M2D-based audio tagging model	SA-SDR, CE, KL-divergence, ArcFace	DCASE2025Task4Dataset		M2D,ATST
Choi_KAIST_task4_4	Choi_2025_t4	10.85	54.26	ResUNet-based separation model, M2D-based audio tagging model	SA-SDR, CE, KL-divergence, ArcFace	DCASE2025Task4Dataset		M2D,ATST
Wu_SUSTech_task4_submission_1	FulinWu_2025_t4	9.11	51.54	TSTF_v1-based separation model, M2D-based audio tagging model	BCE, SDR loss	DCASE2025Task4Dataset		M2D
Wu_SUSTech_task4_submission_2	FulinWu_2025_t4	9.73	51.54	TSTF_v1-based separation model, M2D-based audio tagging model	BCE, SDR loss	DCASE2025Task4Dataset		M2D
Morocutti_CPJKU_task4_1	Morocutti_2025_t4	9.76	61.30	ResUNet-based separation models (AudioSep); M2D, BEATs, ATST-F, ASiT and fPaSST models for audio tagging	BCE, SDR loss	DCASE2025Task4Dataset,FOA-MEIR	SNR Range Augmentation	BEATs,M2D,ATST-F,ASiT,fPaSST,AudioSep
Morocutti_CPJKU_task4_2	Morocutti_2025_t4	9.77	61.60	ResUNet-based separation models (AudioSep); M2D, BEATs, ATST-F, ASiT and fPaSST models for audio tagging	BCE, SDR loss	DCASE2025Task4Dataset,FOA-MEIR	SNR Range Augmentation	BEATs,M2D,ATST-F,ASiT,fPaSST,AudioSep
Morocutti_CPJKU_task4_3	Morocutti_2025_t4	8.91	51.98	ResUNet-based separation model (AudioSep); M2D model for audio tagging	BCE, SDR loss	DCASE2025Task4Dataset,FOA-MEIR	SNR Range Augmentation	BEATs,M2D,AudioSep
Morocutti_CPJKU_task4_4	Morocutti_2025_t4	8.99	57.28	ResUNet-based separation models (AudioSep); M2D, BEATs, ATST-F, ASiT and fPaSST models for audio tagging	BCE, SDR loss	DCASE2025Task4Dataset	SNR Range Augmentation	BEATs,M2D,ATST-F,ASiT,fPaSST,AudioSep
Baseline_task4_ResUNetK		6.60	51.48	ResUNet-based separation model, M2D-based audio tagging model	BCE, SDR loss	DCASE2025Task4Dataset		M2D
Baseline_task4_ResUNet		5.72	51.48	ResUNet-based separation model, M2D-based audio tagging model	BCE, SDR loss	DCASE2025Task4Dataset		M2D
Park_GIST-HanwhaVision_task4_1	Park_2025_t4	6.38	43.89	Baseline separation model, M2D and spectral rolloff feature based audio tagging model	BCE, SDR loss	DCASE2025Task4Dataset,AudioSet		M2D,ResUNetK
Park_GIST-HanwhaVision_task4_2	Park_2025_t4	6.64	46.67	Baseline separation model, M2D and chroma feature based audio tagging model	BCE, SDR loss	DCASE2025Task4Dataset,AudioSet		M2D,ResUNetK
Park_GIST-HanwhaVision_task4_3	Park_2025_t4	6.50	45.19	Baseline separation model, M2D, spectral rolloff, and chroma feature based audio tagging model	BCE, SDR loss	DCASE2025Task4Dataset,AudioSet		M2D,ResUNetK
Park_GIST-HanwhaVision_task4_4	Park_2025_t4	6.69	47.22	Baseline separation model, M2D, spectral rolloff, and chroma feature based audio tagging model	BCE, SDR loss	DCASE2025Task4Dataset,AudioSet		M2D,ResUNetK
Qian_SJTU_task4_1	Qian_2025_t4	7.84	47.72	Sepformer-based separation model, M2D-based multi-channel audio tagging model	BCE, masked SDR loss	DCASE2025Task4Dataset		M2D
Qian_SJTU_task4_2	Qian_2025_t4	7.72	50.68	BSRNN-based separation model, M2D-based multi-channel audio tagging model	BCE, PIT SDR loss, masked SDR loss	DCASE2025Task4Dataset		M2D
Qian_SJTU_task4_3	Qian_2025_t4	4.43	22.84	SepEDA-based audio tagging and separation model	BCE, PIT SDR loss	DCASE2025Task4Dataset
Qian_SJTU_task4_4	Qian_2025_t4	7.49	44.20	Sepformer-based separation model, M2D-based multi-channel audio tagging model + BEATs-based audio tagging model	BCE, masked SDR loss	DCASE2025Task4Dataset		M2D,BEATs
Stergioulis_UTh_task4_submission_1	Stergioulis_2025_t4	6.60	51.48	Attentive ResUNet-based separation model, M2D-based audio tagging model	BCE, SDR loss	DCASE2025Task4Dataset	SpecAugment	M2D
Zhang_BUPT_task4_1	Zhang_2025_t4	3.84	22.41	ResUNet-based separation model, M2D-based audio tagging model	BCE, SDR loss	DCASE2025Task4Dataset		M2D
Zhang_BUPT_task4_2	Zhang_2025_t4	3.54	22.41	ResUNet-based separation model, M2D-based audio tagging model	BCE, SDR loss	DCASE2025Task4Dataset		M2D

Complexity

Submission Code	Technical Report	CA-SDRi	Label Prediction Accuracy	Ensemble subsystems	Number of Parameters
Bando_AIST_task4_1	Bando_2025_t4	5.17	29.20	10	25.2M
Bando_AIST_task4_2	Bando_2025_t4	7.55	49.51	10	117M
Bando_AIST_task4_3	Bando_2025_t4	5.26	31.98	10	25.2M
Bando_AIST_task4_4	Bando_2025_t4	7.52	48.58	10	117M
Choi_KAIST_task4_1	Choi_2025_t4	10.63	53.52	1	204M
Choi_KAIST_task4_2	Choi_2025_t4	10.80	56.42	1	204M
Choi_KAIST_task4_3	Choi_2025_t4	11.00	55.80	1	204M
Choi_KAIST_task4_4	Choi_2025_t4	10.85	54.26	1	204M
Wu_SUSTech_task4_submission_1	FulinWu_2025_t4	9.11	51.54	1	88.3M
Wu_SUSTech_task4_submission_2	FulinWu_2025_t4	9.73	51.54	1	87.4M
Morocutti_CPJKU_task4_1	Morocutti_2025_t4	9.76	61.30	118	10821.00M
Morocutti_CPJKU_task4_2	Morocutti_2025_t4	9.77	61.60	58	5359.00M
Morocutti_CPJKU_task4_3	Morocutti_2025_t4	8.91	51.98	1	228.60M
Morocutti_CPJKU_task4_4	Morocutti_2025_t4	8.99	57.28	33	3034.20M
Baseline_task4_ResUNetK		6.60	51.48	1	115.40M
Baseline_task4_ResUNet		5.72	51.48	1	115.38M
Park_GIST-HanwhaVision_task4_1	Park_2025_t4	6.38	43.89	1	116.6M
Park_GIST-HanwhaVision_task4_2	Park_2025_t4	6.64	46.67	1	115.9M
Park_GIST-HanwhaVision_task4_3	Park_2025_t4	6.50	45.19	1	116.6M
Park_GIST-HanwhaVision_task4_4	Park_2025_t4	6.69	47.22	4	464.5M
Qian_SJTU_task4_1	Qian_2025_t4	7.84	47.72	1	105.1M
Qian_SJTU_task4_2	Qian_2025_t4	7.72	50.68	1	204M
Qian_SJTU_task4_3	Qian_2025_t4	4.43	22.84	1	8.9M
Qian_SJTU_task4_4	Qian_2025_t4	7.49	44.20	2	195.4M
Stergioulis_UTh_task4_submission_1	Stergioulis_2025_t4	6.60	51.48	1	115.41M
Zhang_BUPT_task4_1	Zhang_2025_t4	3.84	22.41	1	115.40M
Zhang_BUPT_task4_2	Zhang_2025_t4	3.54	22.41	1	115.38M

Representative example of separated audio samples

Evaluation set

The following table shows separated sound samples from the evaluation set. Here, representative outputs from systems ranked 1 to 3 and the baseline are selected.

Condition (synthesized)	Mixture	Oracle	Choi_KAIST_task4_3 Rank 1	Morocutti_CPJKU_task4_2 Rank 2	Wu_SUSTech_task4_2 Rank 3	Baseline_task4_ResUNetK
Success case	FILLER FILLER FILLER	Speech Buzzer Doorbell CA-SDRi dummy	Speech Buzzer Doorbell CA-SDRi (this sample)=17.32 dB	Speech Buzzer Doorbell CA-SDRi (this sample)=17.00 dB	Speech Buzzer Doorbell CA-SDRi (this sample)=16.93 dB	Speech Buzzer Doorbell CA-SDRi (this sample)=11.87 dB
Challenging case	FILLER FILLER FILLER	Pour Percussion Cough CA-SDRi dummy	Pour Percussion -- CA-SDRi (this sample)=7.19 dB	Pour -- -- CA-SDRi (this sample)=3.89 dB	Pour -- -- CA-SDRi (this sample)=4.63 dB	Pour -- -- CA-SDRi (this sample)=0.21 dB

Real recording

The following table shows separated sound samples from mixtures recorded using a primary ambisonic microphone. Here, representative outputs from systems ranked 1 to 3 and the baseline are selected.

Condition (real recording)	Mixture	Choi_KAIST_task4_3 Rank 1	Morocutti_CPJKU_task4_2 Rank 2	Wu_SUSTech_task4_2 Rank 3	Baseline_task4_ResUNetK
Indoor	FILLER FILLER FILLER	Blender Doorbell Cough FILLER	-- Doorbell Cough HairDryer (FP)	-- Doorbell Cough FILLER	-- Doorbell Cough FILLER
Outdoor	FILLER FILLER FILLER	-- BicycleBell MusicalKeyboard (FP) FILLER	-- BicycleBell MusicalKeyboard (FP) FILLER	Speech BicycleBell MusicalKeyboard (FP) FILLER	Speech BicycleBell MusicalKeyboard (FP) FILLER

Technical reports

A HYBRID S5 SYSTEM BASED ON NEURAL BLIND SOURCE SEPARATION

Yuto Nozaki¹, Shun Sakurai^1,2, Yoshiaki Bando¹, Kohei Saijo^1,3, Keisuke Imoto^1,4, Masaki Onishi¹

¹National Institute of Advanced Industrial Science and Technology, Tokyo, Japan, ²University of Tsukuba, Ibaraki, Japan, ³Waseda University, Tokyo, Japan, ⁴Kyoto University, Kyoto, Japan

Bando_AIST_task4_1 Bando_AIST_task4_2 Bando_AIST_task4_3 Bando_AIST_task4_4

PDF

A HYBRID S5 SYSTEM BASED ON NEURAL BLIND SOURCE SEPARATION

Yuto Nozaki¹, Shun Sakurai^1,2, Yoshiaki Bando¹, Kohei Saijo^1,3, Keisuke Imoto^1,4, Masaki Onishi¹
¹National Institute of Advanced Industrial Science and Technology, Tokyo, Japan, ²University of Tsukuba, Ibaraki, Japan, ³Waseda University, Tokyo, Japan, ⁴Kyoto University, Kyoto, Japan

Abstract

In this paper, we report our hybrid system for the DCASE 2025 Challenge Task 4 based on neural blind source separation (BSS). This task, called spatial semantic segmentation of sound scenes (S5), aims to detect and separate sound events from a multichannel mixture signal. To make the separation model robust against unseen audio environments, we leverage neural BSS to combine robust statistical signal processing and high-performing neural modeling. Specifically, our network architecture incorporates the iterative source steering algorithm to separate source signals using spatial statistics. The network is trained via multitask learning of source separation and classification with permutation invariant training. In addition, to improve the performance, we utilized an audio foundation model called BEATs and augmented the training data by curating AudioSet. The experimental results on the official development test set show that our best system (System 2) improved more than 2 dB in class-aware signal-to-distortion ratio improvement (CA-SDRi) from the official baseline system.

PDF

SELF-GUIDED TARGET SOUND EXTRACTION AND CLASSIFICATION THROUGH UNIVERSAL SOUND SEPARATION MODEL AND MULTIPLE CLUES

Younghoo Kwon¹, Dongheon Lee¹, Dohwan Kim¹, Jung-Woo Choi¹

¹KAIST, School of Electrical Engineering, Daejeon, South Korea

Choi_KAIST_task4_1 Choi_KAIST_task4_2 Choi_KAIST_task4_3 Choi_KAIST_task4_4

PDF

SELF-GUIDED TARGET SOUND EXTRACTION AND CLASSIFICATION THROUGH UNIVERSAL SOUND SEPARATION MODEL AND MULTIPLE CLUES

Younghoo Kwon¹, Dongheon Lee¹, Dohwan Kim¹, Jung-Woo Choi¹
¹KAIST, School of Electrical Engineering, Daejeon, South Korea

Abstract

This paper presents a multi-stage framework that integrates Universal Sound Separation (USS) and Target Sound Extraction (TSE) for sound separation and classification. In the first stage, USS is applied to decompose the input audio into multiple source components. Each separated source is then individually classified to generate two types of clues: enrollment and class clues. These clues are subsequently utilized in the second stage to guide the TSE process. By generating multiple clues in a self-guided manner, the proposed method enhances the performance of target sound extraction. The final output of the TSE module is then re-classified to improve the classification accuracy further.

PDF

TS-TFGRIDNET: EXTENDING TFGRIDNET FOR LABEL-QUERIED TARGET SOUND EXTRACTION VIA EMBEDDING CONCATENTAITON

Fulin Wu¹, Zhong-Qiu Wang¹

¹Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, Guangdong, China

Wu_SUSTech_task4_submission_1 Wu_SUSTech_task4_submission_2

PDF

TS-TFGRIDNET: EXTENDING TFGRIDNET FOR LABEL-QUERIED TARGET SOUND EXTRACTION VIA EMBEDDING CONCATENTAITON

Fulin Wu¹, Zhong-Qiu Wang¹
¹Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, Guangdong, China

Abstract

The DCASE2025 Challenge Task 4 - Spatial Semantic Segmentation of Sound Scenes (S5) challenges participants to separate a set of mixed sound events (sampled from 18 targeted sound events) to individual sound-event signals. The baseline system provided by the challenge organizors first performs audio tagging to identify the sound events existed in the mixture, and then conducts label-queried target sound extraction (TSE) to extract the signal of each identified sound event. Building on the baseline system, we propose to improve the label-queried TSE component by using a novel model named Target Sound Extraction TF-GridNet (TS-TFGridNet), leveraging the strong capability of TF-GridNet at speech separation for TSE. TS-TFGridNet concatenates audio and sound-class embeddings along the frequency or feature dimension, thereby conditioning TF-GridNet to perform TSE. Clear improvement is observed over the baseline system.

PDF

TRANSFORMER-AIDED AUDIO SOURCE SEPARATION WITH TEMPORAL GUIDANCE AND ITERATIVE REFINEMENT

Tobias Morocutti², Florian Schmid¹, Jonathan Greif¹, Paul Primus¹, Gerhard Widmer^1,2

¹Institute of Computational Perception (CP-JKU), ²LIT Artificial Intelligence Lab, Johannes Kepler University Linz, Austria

Morocutti_CPJKU_task4_1 Morocutti_CPJKU_task4_2 Morocutti_CPJKU_task4_3 Morocutti_CPJKU_task4_4

PDF Code

TRANSFORMER-AIDED AUDIO SOURCE SEPARATION WITH TEMPORAL GUIDANCE AND ITERATIVE REFINEMENT

Tobias Morocutti², Florian Schmid¹, Jonathan Greif¹, Paul Primus¹, Gerhard Widmer^1,2
¹Institute of Computational Perception (CP-JKU), ²LIT Artificial Intelligence Lab, Johannes Kepler University Linz, Austria

Abstract

This technical report presents the CP-JKU team’s system for Task 4 Spatial Semantic Segmentation of Sound Scenes of the DCASE 2025 Challenge. Building on the two-stage baseline of audio tagging followed by label-conditioned source separation, we introduce several key enhancements. We reframe the tagging stage as a sound event detection task using five Transformers pre-trained on AudioSet Strong, enabling temporally strong conditioning of the separator. We further inject the Transformer’s latent representations into a ResUNet separator initialized from AudioSep and extended with a Dual-Path RNN. Additionally, we propose an iterative refinement scheme that reuses the separator’s output as input. These improvements raise label prediction accuracy to 73.07% and CASDRi to 14.49 for a single-model system on the development test set, substantially surpassing the baseline.

PDF

PERFORMANCE IMPROVEMENT OF SPATIAL SEMANTIC SEGMENTATION WITH ENRICHED AUDIO FEATURES AND AGENT-BASED ERROR CORRECTION FOR DCASE 2025 CHALLENGE TASK 4

Jongyeon Park¹, Joonhee Lee², Do-Hyeon Lim¹, Hong Kook Kim^1,2, Hyeongcheol Geum³, Jeong Eun Lim³

¹Dept. of AI Convergence, ²Dept. of EECS, Gwangju Institute of Science and Technology, Gwangju 61005, Korea, ³AI Lab., R&D Center, Hanwha Vision, Seongnam-si, Gyeonggi-do 13488, Korea

Park_GIST-HanwhaVision_task4_1 Park_GIST-HanwhaVision_task4_2 Park_GIST-HanwhaVision_task4_3 Park_GIST-HanwhaVision_task4_4

PDF

PERFORMANCE IMPROVEMENT OF SPATIAL SEMANTIC SEGMENTATION WITH ENRICHED AUDIO FEATURES AND AGENT-BASED ERROR CORRECTION FOR DCASE 2025 CHALLENGE TASK 4

Jongyeon Park¹, Joonhee Lee², Do-Hyeon Lim¹, Hong Kook Kim^1,2, Hyeongcheol Geum³, Jeong Eun Lim³
¹Dept. of AI Convergence, ²Dept. of EECS, Gwangju Institute of Science and Technology, Gwangju 61005, Korea, ³AI Lab., R&D Center, Hanwha Vision, Seongnam-si, Gyeonggi-do 13488, Korea

Abstract

This technical report presents submission systems for Task 4 of the DCASE 2025 Challenge. This model incorporates additional audio features (spectral roll-off and chroma features) into the embedding feature extracted from the mel-spectral feature to improve the classification capabilities of an audio-tagging model in the spatial semantic segmentation of sound scenes (S5) system. This approach is motivated by the fact that mixed audio often contains subtle cues that are difficult to capture with mel-spectrograms alone. Thus, these additional features offer alternative perspectives for the model. Second, an agent-based label correction system is applied to the outputs processed by the S5 system. This system reduces false positives, improving the final class-aware signal-to-distortion ratio improvement (CA-SDRi) metric. Finally, we refine the training dataset to enhance the classification accuracy of low-performing classes by removing irrelevant samples and incorporating external data. That is, audio mixtures are generated from a limited number of data points; thus, even a small number of out-of-class data points could degrade model performance. The experiments demonstrate that the submitted systems employing these approaches relatively improve CA-SDRi by up to 14.7% compared to the baseline of DCASE 2025 Challenge Task 4.

PDF

SJTU-AUDIOCC SYSTEM FOR DCASE 2025 CHALLENGE TASK 4: SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES

Xin Zhou¹, Hongyu Wang¹, Chenda Li¹, Bing Han¹, Xinhu Zheng¹, Yanmin Qian¹

¹Auditory Cognition and Computational Acoustics Lab, MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China

Qian_SJTU_task4_1 Qian_SJTU_task4_2 Qian_SJTU_task4_3 Qian_SJTU_task4_4

PDF Code

SJTU-AUDIOCC SYSTEM FOR DCASE 2025 CHALLENGE TASK 4: SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES

Xin Zhou¹, Hongyu Wang¹, Chenda Li¹, Bing Han¹, Xinhu Zheng¹, Yanmin Qian¹
¹Auditory Cognition and Computational Acoustics Lab, MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China

Abstract

The present report introduces four systems developed by the AudioCC Lab at Shanghai Jiao Tong University for DCASE 2025 Task 4. The task at hand is to detect target sound events and separate corresponding signals from multi-channel mixture. It was found that the effective detection of sound events and extraction of the corresponding signals was challenging under conditions where mixture consists of multiple target sound events, non-target sound events, and non-directional background noise. In order to address this challenge, we propose four systems. The first system represents an enhancement to the baseline system, the second is a multistage iterative system that is both novel and promising, the third is a lightweight model based on Encoder-Decoder Attractor (EDA) module, and the fourth integrates multiple audio tagging models to achieve optimal performance. These four systems cover high performance, low overhead, and promising frameworks, providing a reference for future research on this task.

PDF

REDUX: AN ITERATIVE STRATEGY FOR SEMANTIC SOURCE SEPARATION

Vasileios Stergioulis¹, Gerasimos Potamianos¹

¹Department of Electrical and Computer Engineering, University of Thessaly, Volos, Greece

Stergioulis_UTh_task4_submission_1

PDF

REDUX: AN ITERATIVE STRATEGY FOR SEMANTIC SOURCE SEPARATION

Vasileios Stergioulis¹, Gerasimos Potamianos¹
¹Department of Electrical and Computer Engineering, University of Thessaly, Volos, Greece

Abstract

In this report, we present a system for the Spatial Semantic Segmentation of Sound Scenes (DCASE 2025 Task 4), combining enhanced source separation and label classification through an iterative verification strategy. Our approach integrates the Masked Modeling Duo (M2D) classifier with a separator architecture based on an attentive ResUNeXt network. Inspired by recent advances in universal audio modeling and self-supervised separation, our system incorporates feedback between multiple classification and separation stages to correct early-stage prediction errors. Specifically, classification outputs are verified using post-separation reclassification, and ambiguous cases are resolved through targeted waveform subtraction and re-analysis. This strategy enables improved source-label associations without increasing model complexity. Evaluated on the development set, our method achieves a relative improvement of 0.28% in CA-SDRi and 1.46% in accuracy over the baseline.

PDF

SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES BASED ON ADAPTER FINE-TUNING

Zehao Wang¹, Sen Wang¹, Zhicheng Zhang¹, Jianqin Yin¹

¹Beijing University of Posts and Telecommunications, China

Zhang_BUPT_task4_1 Zhang_BUPT_task4_2

PDF

SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES BASED ON ADAPTER FINE-TUNING

Zehao Wang¹, Sen Wang¹, Zhicheng Zhang¹, Jianqin Yin¹
¹Beijing University of Posts and Telecommunications, China

Abstract

In this work, we present our submission system for DCASE 2025 Task4 on Spatial semantic segmentation of sound scenes (S5).Among them, we introduce the audio tagging (AT) and labelquery source separation (LSS) systems built on the fine-tuned M2D and the modified version of ResUnet. By introducing the bidirectional recurrent neural network (DPRNN) module into ResUNet and improving the Feature-wise Linear Modulation (FiLM) mechanism, the model’s ability to capture long-term dependent features in spatial audio and the flexibility of dynamic feature adjustment are enhanced. Experimental results show that the improved system outperforms the baseline system in class-aware evaluation metrics (CA-SDRi, CA-SI-SDRi), verifying the effectiveness of the method.

PDF

Content

Task description

Team ranking

System ranking

Supplementary metrics

Detailed analysis of separation and detection performance

Detailed analysis focused on quality of separated speech

System's performance under partially known conditions

System performance in more challenging conditions

System characteristics

General characteristics

Machine learning characteristics

Complexity

Representative example of separated audio samples

Evaluation set

Real recording

Technical reports

A HYBRID S5 SYSTEM BASED ON NEURAL BLIND SOURCE SEPARATION

A HYBRID S5 SYSTEM BASED ON NEURAL BLIND SOURCE SEPARATION

Abstract

SELF-GUIDED TARGET SOUND EXTRACTION AND CLASSIFICATION THROUGH UNIVERSAL SOUND SEPARATION MODEL AND MULTIPLE CLUES

SELF-GUIDED TARGET SOUND EXTRACTION AND CLASSIFICATION THROUGH UNIVERSAL SOUND SEPARATION MODEL AND MULTIPLE CLUES

Abstract

TS-TFGRIDNET: EXTENDING TFGRIDNET FOR LABEL-QUERIED TARGET SOUND EXTRACTION VIA EMBEDDING CONCATENTAITON

TS-TFGRIDNET: EXTENDING TFGRIDNET FOR LABEL-QUERIED TARGET SOUND EXTRACTION VIA EMBEDDING CONCATENTAITON

Abstract

TRANSFORMER-AIDED AUDIO SOURCE SEPARATION WITH TEMPORAL GUIDANCE AND ITERATIVE REFINEMENT

TRANSFORMER-AIDED AUDIO SOURCE SEPARATION WITH TEMPORAL GUIDANCE AND ITERATIVE REFINEMENT

Abstract

PERFORMANCE IMPROVEMENT OF SPATIAL SEMANTIC SEGMENTATION WITH ENRICHED AUDIO FEATURES AND AGENT-BASED ERROR CORRECTION FOR DCASE 2025 CHALLENGE TASK 4

PERFORMANCE IMPROVEMENT OF SPATIAL SEMANTIC SEGMENTATION WITH ENRICHED AUDIO FEATURES AND AGENT-BASED ERROR CORRECTION FOR DCASE 2025 CHALLENGE TASK 4

Abstract

SJTU-AUDIOCC SYSTEM FOR DCASE 2025 CHALLENGE TASK 4: SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES

SJTU-AUDIOCC SYSTEM FOR DCASE 2025 CHALLENGE TASK 4: SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES

Abstract

REDUX: AN ITERATIVE STRATEGY FOR SEMANTIC SOURCE SEPARATION

REDUX: AN ITERATIVE STRATEGY FOR SEMANTIC SOURCE SEPARATION

Abstract

SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES BASED ON ADAPTER FINE-TUNING

SPATIAL SEMANTIC SEGMENTATION OF SOUND SCENES BASED ON ADAPTER FINE-TUNING

Abstract