Stereo Sound Event Localization and Detection in Regular Video Content


Challenge results

Task description

The Sound Event Localization and Detection (SELD) task deals with methods that detect the temporal onset and offset of sound events when active, classify the type of the event from a known set of sound classes, and further localize the events in space when active.

The focus of the current SELD task is developing systems that can perform adequately on stereo audio data. There are two tracks: an audio-only track (Track A) for systems using only stereo audio data to estimate the SELD labels, and an audiovisual track (Track B) for systems employing additionally simultaneous perspective video data aligned spatially with the stereo audio data.

The task provides two datasets, development and evaluation. Among the two datasets, only the development dataset provides the reference labels. The participants are expected to build and validate systems using the development dataset, report results on a predefined development set split, and finally test their system on the unseen evaluation dataset.

More details on the task setup and evaluation can be found in the task description page.

Teams ranking

The SELD task received 57 submissions in total from 16 teams across the world. From those, 40 submissions were on the audio-only Track A, and 17 submissions on the audiovisual Track B. 4 teams participated in both Track A & B, 10 teams participated only in Track A and 2 teams participated only in Track B.

The following table includes only the best performing system per submitting team. Confidence intervals are also reported for each metric on the evaluation set results.

Track A: Audio-only

Rank Submission Information Evaluation Dataset
Submission name Corresponding
author
Affiliation Technical
Report
Team
Rank
F-score
(20°/1)
DOA
error (°)
Relative distance
error
Du_NERCSLIP_task3a_4 Jun Du University of Science and Technology of China Du_NERCSLIP_task3_report 1 50.4 (49.2 - 51.4) 12.2 (11.7 - 12.5) 26.9 (25.9 - 28.1)
He_HIT_task3a_1 Changjiang He Harbin Institute of Technology He_HIT_task3a_report 2 47.0 (45.6 - 48.2) 13.3 (12.6 - 13.9) 38.6 (37.5 - 39.9)
Banerjee_NTU_task3a_1 Mohor Banerjee Nanyang Technological University Banerjee_NTU_task3a_report 3 43.9 (42.5 - 45.5) 14.0 (13.2 - 14.7) 35.2 (33.6 - 36.5)
Berghi_SURREY_task3a_2 Davide Berghi University of Surrey Berghi_SURREY_task3_report 4 42.5 (41.4 - 43.8) 15.4 (14.5 - 16.2) 31.4 (30.5 - 32.4)
Wu_HUST_task3a_2 Digao Wu Huazhong University of Science and Technology Wu_HUST_task3a_report 5 41.8 (40.4 - 43.3) 15.3 (14.6 - 16.0) 29.3 (28.2 - 30.4)
Yeow_NTU_task3a_3 Jun Wei Yeow Nanyang Technological University Yeow_NTU_task3a_report 6 41.3 (40.0 - 42.7) 14.5 (13.3 - 15.6) 28.0 (26.9 - 28.9)
Wan_XJU_task3a_1 QingJing Wan Xinjiang University Wan_XJU_task3a_report 7 35.4 (34.3 - 36.7) 18.6 (17.4 - 19.4) 34.9 (33.1 - 36.9)
Zhao_MITC-MG_task3a_3 Tianbo Zhao Xiaomi Corporation Zhao_MITC-MG_task3a_report 8 34.0 (33.1 - 34.7) 16.8 (15.1 - 18.4) 36.6 (35.7 - 37.3)
Gao_DTU_task3a_1 Wenmiao Gao Denmark Technical University Gao_DTU_task3a_report 9 31.0 (30.2 - 31.8) 17.4 (13.6 - 18.6) 40.1 (35.9 - 41.5)
Park_KAIST_task3a_2 Jehyun Park Korea Advanced Institute of Science and Technology Park_KAIST_task3a_report 10 30.3 (29.5 - 31.1) 14.6 (13.6 - 17.5) 32.4 (26.9 - 43.2)
Bahuguna_UPF_task3a_3 Arjun Bahuguna Universitat Pompeu Fabra Bahuguna_UPF_task3a_report 11 28.8 (27.7 - 29.7) 21.2 (16.8 - 26.9) 100.0 (100.0 - 100.0)
Bingnan_UOE_task3a_1 Duan Bingnan The University of Edinburgh Bingnan_UOE_task3a_report 12 26.9 (26.1 - 27.9) 24.6 (21.4 - 32.8) 37.9 (31.6 - 54.4)
AO_Baseline Parthasaarathy Sudarsanam Tampere University Baseline_report 13 26.1 (25.0 - 27.6) 23.0 (21.5 - 24.1) 33.2 (30.8 - 37.3)
Guan_GISP-HEU_task3a_1 Jian Guan Harbin Engineering University Guan_GISP-HEU_task3_report 14 25.1 (24.0 - 26.1) 24.7 (22.1 - 27.9) 35.6 (34.7 - 36.2)
Kim_Samsung_task3a_1 Gwantae Kim Samsung Electronics Kim_Samsung_task3_report 15 24.6 (23.7 - 25.5) 18.2 (13.2 - 25.9) 33.7 (32.4 - 35.1)

Track B: Audiovisual

Rank Submission Information Evaluation Dataset
Submission name Corresponding
author
Affiliation Technical
Report
Team
Rank
F-score
(20°/1/on)
F-score
(20°/1)
DOA
error (°)
Relative distance
error
Onscreen
accuracy
Du_NERCSLIP_task3b_1 Jun Du University of Science and Technology of China Du_NERCSLIP_task3_report 1 41.6 (40.3 - 42.6) 50.1 (48.8 - 51.1) 12.2 (11.7 - 12.5) 27.0 (26.0 - 28.1) 82.2 (80.0 - 84.5)
Berghi_SURREY_task3b_3 Davide Berghi University of Surrey Berghi_SURREY_task3_report 2 34.8 (33.7 - 35.9) 46.2 (44.9 - 47.9) 14.1 (13.5 - 14.4) 30.4 (29.0 - 31.5) 76.9 (73.5 - 80.0)
Chengnuo_JSU_task3b_4 Sun Chengnuo Jiangsu University Chengnuo_JSU_task3b_report 3 20.8 (19.9 - 21.7) 27.5 (26.7 - 28.3) 22.2 (20.9 - 23.4) 37.7 (35.7 - 40.1) 77.8 (74.0 - 81.3)
AV_Baseline Parthasaarathy Sudarsanam Tampere University Baseline_report 4 20.8 (19.9 - 21.7) 27.5 (26.7 - 28.3) 22.2 (20.9 - 23.4) 37.7 (35.7 - 40.1) 77.8 (74.0 - 81.3)
Guan_GISP-HEU_task3b_3 Jian Guan Harbin Engineering University Guan_GISP-HEU_task3_report 5 18.2 (17.7 - 18.7) 24.7 (23.9 - 25.5) 24.0 (20.2 - 27.1) 38.2 (31.3 - 46.1) 76.1 (68.0 - 84.8)
Yu_Polyu_task3b_1 Xiang Yu The Hong Kong Polytechnic University Yu_Polyu_task3b_report 6 18.1 (17.2 - 19.1) 24.8 (24.0 - 25.8) 18.1 (17.2 - 19.0) 34.0 (31.6 - 35.8) 79.8 (77.3 - 82.3)
Kim_Samsung_task3b_1 Gwantae Kim Samsung Electronics Kim_Samsung_task3_report 7 18.0 (17.0 - 18.9) 24.5 (23.6 - 25.4) 20.9 (14.8 - 38.0) 34.0 (31.2 - 42.8) 78.4 (76.0 - 82.8)

Systems ranking

Performance of all the submitted systems on the evaluation and the development datasets. Confidence intervals are also reported for each metric on the evaluation set results.

Track A: Audio-only

Rank Submission Information Evaluation Dataset Development Dataset
Submission name Technical
Report
Submission
Rank
F-score
(20°/1)
DOA
error (°)
Relative distance
error
F-score
(20°/1)
DOA
error (°)
Relative distance
error
Du_NERCSLIP_task3a_4 Du_NERCSLIP_task3_report 1 50.4 (49.2 - 51.4) 12.2 (11.7 - 12.5) 26.9 (25.9 - 28.1) 54.3 11.8 26.0
Du_NERCSLIP_task3a_3 Du_NERCSLIP_task3_report 2 49.6 (48.5 - 50.4) 12.4 (11.9 - 12.8) 26.9 (25.9 - 28.0) 54.0 12.0 25.9
Du_NERCSLIP_task3a_2 Du_NERCSLIP_task3_report 3 49.2 (48.0 - 50.2) 12.4 (12.0 - 12.8) 27.4 (26.4 - 28.6) 53.1 12.1 25.9
He_HIT_task3a_1 He_HIT_task3a_report 4 47.0 (45.6 - 48.2) 13.3 (12.6 - 13.9) 38.6 (37.5 - 39.9) 50.0 13.1 36.0
Du_NERCSLIP_task3a_1 Du_NERCSLIP_task3_report 5 46.6 (45.1 - 47.7) 12.3 (11.8 - 12.8) 29.8 (28.6 - 31.1) 50.1 12.5 26.5
He_HIT_task3a_2 He_HIT_task3a_report 6 45.4 (44.4 - 46.3) 13.6 (12.9 - 14.2) 30.4 (29.4 - 31.8) 48.8 13.4 32.0
He_HIT_task3a_3 He_HIT_task3a_report 7 45.3 (43.9 - 46.3) 12.8 (12.2 - 13.3) 31.3 (29.8 - 32.9) 51.3 12.5 33.0
He_HIT_task3a_4 He_HIT_task3a_report 8 44.3 (43.0 - 45.6) 13.5 (13.0 - 13.9) 29.8 (28.8 - 31.0) 47.8 13.0 30.0
Banerjee_NTU_task3a_1 Banerjee_NTU_task3a_report 9 43.9 (42.5 - 45.5) 14.0 (13.2 - 14.7) 35.2 (33.6 - 36.5) 48.2 13.3 36.0
Banerjee_NTU_task3a_2 Banerjee_NTU_task3a_report 10 43.7 (42.3 - 45.0) 14.1 (13.0 - 15.2) 36.5 (34.3 - 38.4) 47.5 13.7 36.0
Berghi_SURREY_task3a_2 Berghi_SURREY_task3_report 11 42.5 (41.4 - 43.8) 15.4 (14.5 - 16.2) 31.4 (30.5 - 32.4) 46.0 15.2 30.8
Wu_HUST_task3a_2 Wu_HUST_task3a_report 12 41.8 (40.4 - 43.3) 15.3 (14.6 - 16.0) 29.3 (28.2 - 30.4) 42.5 15.0 30.0
Yeow_NTU_task3a_3 Yeow_NTU_task3a_report 13 41.3 (40.0 - 42.7) 14.5 (13.3 - 15.6) 28.0 (26.9 - 28.9) 45.3 13.2 26.2
Yeow_NTU_task3a_4 Yeow_NTU_task3a_report 14 40.9 (39.3 - 42.4) 13.9 (12.9 - 14.7) 28.7 (27.8 - 29.7) 43.1 12.7 25.9
Wu_HUST_task3a_1 Wu_HUST_task3a_report 15 40.8 (39.3 - 42.3) 14.9 (14.1 - 15.7) 29.2 (28.0 - 30.3) 41.3 14.9 30.0
Yeow_NTU_task3a_1 Yeow_NTU_task3a_report 16 40.5 (39.3 - 42.0) 14.1 (13.3 - 14.7) 28.4 (27.0 - 29.7) 44.0 13.2 27.1
Yeow_NTU_task3a_2 Yeow_NTU_task3a_report 17 39.8 (38.3 - 41.1) 13.9 (13.1 - 14.7) 28.4 (27.0 - 29.8) 43.4 13.2 26.1
Wu_HUST_task3a_4 Wu_HUST_task3a_report 18 39.8 (38.9 - 40.9) 16.3 (15.5 - 16.9) 28.3 (27.3 - 29.1) 42.7 16.3 27.0
Wu_HUST_task3a_3 Wu_HUST_task3a_report 19 39.5 (38.5 - 40.7) 15.8 (15.0 - 16.4) 28.3 (27.3 - 29.2) 41.7 16.1 27.0
Wan_XJU_task3a_1 Wan_XJU_task3a_report 20 35.4 (34.3 - 36.7) 18.6 (17.4 - 19.4) 34.9 (33.1 - 36.9) 37.1 18.3 30.0
Zhao_MITC-MG_task3a_3 Zhao_MITC-MG_task3a_report 21 34.0 (33.1 - 34.7) 16.8 (15.1 - 18.4) 36.6 (35.7 - 37.3) 37.0 16.9 39.0
Zhao_MITC-MG_task3a_4 Zhao_MITC-MG_task3a_report 22 32.6 (31.8 - 33.6) 18.8 (16.7 - 20.8) 38.1 (37.1 - 39.1) 35.1 18.0 37.0
Gao_DTU_task3a_1 Gao_DTU_task3a_report 23 31.0 (30.2 - 31.8) 17.4 (13.6 - 18.6) 40.1 (35.9 - 41.5) 39.6 15.8 33.0
Gao_DTU_task3a_3 Gao_DTU_task3a_report 24 30.4 (29.5 - 31.2) 18.8 (16.5 - 20.1) 36.4 (34.2 - 38.0) 38.2 15.9 33.0
Park_KAIST_task3a_2 Park_KAIST_task3a_report 25 30.3 (29.5 - 31.1) 14.6 (13.6 - 17.5) 32.4 (26.9 - 43.2) 36.3 14.5 28.0
Gao_DTU_task3a_4 Gao_DTU_task3a_report 26 29.9 (29.3 - 30.7) 20.8 (19.2 - 22.3) 36.7 (35.5 - 37.9) 35.1 16.5 30.0
Berghi_SURREY_task3a_1 Berghi_SURREY_task3_report 27 29.4 (28.4 - 30.5) 20.1 (17.6 - 22.4) 34.9 (31.5 - 37.7) 45.7 15.0 31.0
Bahuguna_UPF_task3a_3 Bahuguna_UPF_task3a_report 28 28.8 (27.7 - 29.7) 21.2 (16.8 - 26.9) 100.0 (100.0 - 100.0) 28.4 20.3 100.0
Park_KAIST_task3a_1 Park_KAIST_task3a_report 29 28.5 (27.5 - 29.3) 13.2 (9.7 - 14.6) 28.5 (26.2 - 30.4) 35.3 15.5 30.0
Gao_DTU_task3a_2 Gao_DTU_task3a_report 30 28.2 (27.4 - 29.1) 19.8 (17.3 - 21.8) 39.0 (33.2 - 43.5) 36.2 16.6 33.0
Bahuguna_UPF_task3a_1 Bahuguna_UPF_task3a_report 31 27.4 (26.8 - 28.4) 22.3 (20.1 - 24.1) 35.1 (34.1 - 36.4) 26.6 20.4 36.0
Bingnan_UOE_task3a_1 Bingnan_UOE_task3a_report 32 26.9 (26.1 - 27.9) 24.6 (21.4 - 32.8) 37.9 (31.6 - 54.4) 29.0 19.3 30.0
Bahuguna_UPF_task3a_4 Bahuguna_UPF_task3a_report 33 26.8 (26.0 - 27.9) 22.2 (20.5 - 23.6) 37.0 (36.2 - 37.9) 24.8 20.6 34.0
Bahuguna_UPF_task3a_2 Bahuguna_UPF_task3a_report 34 26.4 (25.6 - 27.4) 22.1 (20.0 - 24.4) 36.6 (35.2 - 38.1) 28.0 17.3 43.0
AO_Baseline Baseline_report 35 26.1 (25.0 - 27.6) 23.0 (21.5 - 24.1) 33.2 (30.8 - 37.3) 22.8 24.5 41.0
Guan_GISP-HEU_task3a_1 Guan_GISP-HEU_task3_report 36 25.1 (24.0 - 26.1) 24.7 (22.1 - 27.9) 35.6 (34.7 - 36.2) 22.9 23.5 32.0
Kim_Samsung_task3a_1 Kim_Samsung_task3_report 37 24.6 (23.7 - 25.5) 18.2 (13.2 - 25.9) 33.7 (32.4 - 35.1) 28.8 18.1 34.0
Guan_GISP-HEU_task3a_2 Guan_GISP-HEU_task3_report 38 23.8 (22.8 - 25.2) 27.3 (26.0 - 28.4) 37.2 (33.4 - 40.7) 21.9 28.2 44.0
Guan_GISP-HEU_task3a_3 Guan_GISP-HEU_task3_report 39 22.9 (22.0 - 23.9) 25.1 (22.4 - 27.2) 36.8 (34.1 - 40.1) 25.3 23.0 45.0
Zhao_MITC-MG_task3a_1 Zhao_MITC-MG_task3a_report 40 11.6 (11.4 - 11.7) 20.5 (18.6 - 22.8) 38.0 (36.7 - 39.0) 35.2 17.4 38.0
Zhao_MITC-MG_task3a_2 Zhao_MITC-MG_task3a_report 41 11.2 (11.0 - 11.3) 22.3 (20.1 - 24.6) 38.9 (37.9 - 39.8) 36.3 17.2 37.0

Track B: Audiovisual

Rank Submission Information Evaluation Dataset Development Dataset
Submission name Technical
Report
Submission
Rank
F-score
(20°/1/on)
F-score
(20°/1)
DOA
error (°)
Relative distance
error
Onscreen
accuracy
F-score
(20°/1/on)
F-score
(20°/1)
DOA
error (°)
Relative distance
error
Onscreen
accuracy
Du_NERCSLIP_task3b_1 Du_NERCSLIP_task3_report 1 41.6 (40.3 - 42.6) 50.1 (48.8 - 51.1) 12.2 (11.7 - 12.5) 27.0 (26.0 - 28.1) 82.2 (80.0 - 84.5) 46.9 54.1 11.9 26.0 86.0
Du_NERCSLIP_task3b_4 Du_NERCSLIP_task3_report 2 41.4 (40.0 - 42.4) 50.1 (48.6 - 51.0) 12.2 (11.8 - 12.6) 26.9 (25.9 - 28.0) 81.9 (79.7 - 84.2) 47.0 54.2 12.0 25.9 85.6
Du_NERCSLIP_task3b_3 Du_NERCSLIP_task3_report 3 41.2 (39.8 - 42.1) 49.8 (48.3 - 50.8) 11.9 (11.5 - 12.3) 26.6 (25.6 - 27.7) 82.0 (79.8 - 84.3) 47.3 54.6 11.8 26.1 85.8
Du_NERCSLIP_task3b_2 Du_NERCSLIP_task3_report 4 41.0 (39.6 - 41.9) 49.6 (48.1 - 50.5) 11.8 (11.4 - 12.2) 26.9 (25.8 - 28.0) 82.1 (79.9 - 84.3) 46.9 54.2 11.9 25.4 85.9
Berghi_SURREY_task3b_3 Berghi_SURREY_task3_report 5 34.8 (33.7 - 35.9) 46.2 (44.9 - 47.9) 14.1 (13.5 - 14.4) 30.4 (29.0 - 31.5) 76.9 (73.5 - 80.0) 37.3 48.0 14.0 29.3 80.8
Berghi_SURREY_task3b_4 Berghi_SURREY_task3_report 6 34.8 (33.5 - 36.0) 46.2 (44.9 - 47.9) 14.1 (13.5 - 14.4) 30.4 (29.0 - 31.5) 76.3 (72.5 - 79.8) 37.5 48.0 14.0 29.3 80.8
Berghi_SURREY_task3b_1 Berghi_SURREY_task3_report 7 33.6 (32.3 - 34.8) 45.1 (43.7 - 46.6) 14.8 (14.0 - 15.3) 32.3 (29.9 - 34.5) 75.3 (71.6 - 78.6) 34.4 44.4 15.6 30.4 80.5
Berghi_SURREY_task3b_2 Berghi_SURREY_task3_report 8 33.3 (32.0 - 34.7) 43.5 (42.0 - 45.5) 15.0 (14.3 - 15.5) 31.9 (30.6 - 33.3) 77.7 (73.7 - 81.2) 35.8 45.5 15.2 32.2 81.0
Chengnuo_JSU_task3b_4 Chengnuo_JSU_task3b_report 9 20.8 (19.9 - 21.7) 27.5 (26.7 - 28.3) 22.2 (20.9 - 23.4) 37.7 (35.7 - 40.1) 77.8 (74.0 - 81.3) 18.8 26.8 20.1 34.0 80.0
AV_Baseline Baseline_report 10 20.8 (19.9 - 21.7) 27.5 (26.7 - 28.3) 22.2 (20.9 - 23.4) 37.7 (35.7 - 40.1) 77.8 (74.0 - 81.3) 20.0 26.8 23.8 40.0 80.0
Guan_GISP-HEU_task3b_3 Guan_GISP-HEU_task3_report 11 18.2 (17.7 - 18.7) 24.7 (23.9 - 25.5) 24.0 (20.2 - 27.1) 38.2 (31.3 - 46.1) 76.1 (68.0 - 84.8) 17.9 23.7 25.8 37.0 81.0
Chengnuo_JSU_task3b_1 Chengnuo_JSU_task3b_report 12 18.1 (17.2 - 18.8) 23.8 (23.0 - 24.6) 24.4 (20.6 - 29.7) 38.7 (37.0 - 40.1) 74.7 (68.2 - 78.7) 23.1 26.8 20.1 34.0 80.5
Yu_Polyu_task3b_1 Yu_Polyu_task3b_report 13 18.1 (17.2 - 19.1) 24.8 (24.0 - 25.8) 18.1 (17.2 - 19.0) 34.0 (31.6 - 35.8) 79.8 (77.3 - 82.3) 24.1 32.4 18.1 32.9 81.4
Kim_Samsung_task3b_1 Kim_Samsung_task3_report 14 18.0 (17.0 - 18.9) 24.5 (23.6 - 25.4) 20.9 (14.8 - 38.0) 34.0 (31.2 - 42.8) 78.4 (76.0 - 82.8) 26.1 19.6 30.0 74.0
Chengnuo_JSU_task3b_2 Chengnuo_JSU_task3b_report 15 17.9 (16.9 - 18.8) 23.7 (22.8 - 24.6) 23.2 (21.9 - 25.0) 39.1 (37.5 - 40.6) 79.1 (75.9 - 81.9) 22.7 26.8 20.1 34.0 80.5
Guan_GISP-HEU_task3b_2 Guan_GISP-HEU_task3_report 16 17.1 (16.1 - 18.3) 22.7 (21.6 - 23.8) 24.5 (20.6 - 27.6) 42.4 (39.3 - 46.9) 78.0 (74.6 - 81.6) 19.6 26.4 22.3 46.0 80.0
Chengnuo_JSU_task3b_3 Chengnuo_JSU_task3b_report 17 15.8 (15.1 - 16.5) 21.1 (20.1 - 22.0) 24.5 (22.3 - 27.4) 47.2 (46.2 - 48.5) 74.5 (70.4 - 77.8) 20.5 26.8 20.1 34.0 80.0
Guan_GISP-HEU_task3b_1 Guan_GISP-HEU_task3_report 18 15.5 (14.9 - 16.2) 22.0 (21.1 - 23.2) 26.8 (24.3 - 28.7) 46.0 (43.6 - 48.4) 78.4 (75.4 - 81.1) 16.6 23.6 25.8 48.0 80.0

System characteristics

Track A: Audio-only

Rank Submission
name
Technical
Report
Model Model
params
Acoustic
features
Data
augmentation
External
datasets
Pre-trained
models
1 Du_NERCSLIP_task3a_4 Du_NERCSLIP_task3_report ResNet, Conformer, ensemble 58472848 log mel spectra Audio Channel Swapping, Multi-channel data simulation, Mixup AudioSet
2 Du_NERCSLIP_task3a_3 Du_NERCSLIP_task3_report ResNet, Conformer, ensemble 46792105 log mel spectra Audio Channel Swapping, Multi-channel data simulation, Mixup AudioSet
3 Du_NERCSLIP_task3a_2 Du_NERCSLIP_task3_report ResNet, Conformer, ensemble 35111362 log mel spectra Audio Channel Swapping, Multi-channel data simulation, Mixup AudioSet
4 He_HIT_task3a_1 He_HIT_task3a_report ResNet, Conformer, ensemble 104852705 log mel spectra Audio Channel Swapping, audio generation, synthetic audio FSD50K, TAU-SRIR DB
5 Du_NERCSLIP_task3a_1 Du_NERCSLIP_task3_report ResNet, Conformer, ensemble 23430619 log mel spectra Audio Channel Swapping, Multi-channel data simulation, Mixup AudioSet
6 He_HIT_task3a_2 He_HIT_task3a_report ResNet, Conformer, ensemble 104852705 log mel spectra Audio Channel Swapping, audio generation, synthetic audio FSD50K, TAU-SRIR DB
7 He_HIT_task3a_3 He_HIT_task3a_report ResNet, Conformer, ensemble 104854001 log mel spectra, intensity vector Audio Channel Swapping, audio generation, synthetic audio FSD50K, TAU-SRIR DB
8 He_HIT_task3a_4 He_HIT_task3a_report ResNet, Conformer 52388101 log mel spectra Audio Channel Swapping, audio generation, synthetic audio FSD50K, TAU-SRIR DB
9 Banerjee_NTU_task3a_1 Banerjee_NTU_task3a_report ResNet, Conformer, ONE-PEACE embedding, ensemble 26337755 log mel spectra, GCC-PHAT, inter-channel level difference, Sine of Interaural Phase Difference, Cosine of Interaural Phase Difference FSD50K, TAU-SRIR DB ONE-PEACE
10 Banerjee_NTU_task3a_2 Banerjee_NTU_task3a_report ResNet, Conformer, ONE-PEACE embedding, ensemble 26337755 log mel spectra, GCC-PHAT, inter-channel level difference, Sine of Interaural Phase Difference, Cosine of Interaural Phase Difference Audio Channel Swapping FSD50K, TAU-SRIR DB ONE-PEACE
11 Berghi_SURREY_task3a_2 Berghi_SURREY_task3_report CNN, Conformer, Cross-Modal Conformer 30959349 log mel spectra, inter-channel level difference, short-term power of ACC Audio Channel Swapping FSD50K, RIR datasets, SpatialScaper CLAP
12 Wu_HUST_task3a_2 Wu_HUST_task3a_report CNN, Conformer, AFF 11785511 log mel spectra frequency shifting, SpecAugment, random cutout, augmix, data simulation FSD50K, TAU-SRIR DB
13 Yeow_NTU_task3a_3 Yeow_NTU_task3a_report CRNN 4000000 log mel spectra, Mid-Side spectrogram, Mid-Side Intensity Vector frequency shifting, FilterAugment SpatialScaper
14 Yeow_NTU_task3a_4 Yeow_NTU_task3a_report CRNN 4000000 log mel spectra, Mid-Side spectrogram, Mid-Side Intensity Vector, Magnitude-Squared Coherence FilterAugment, frequency shifting SpatialScaper
15 Wu_HUST_task3a_1 Wu_HUST_task3a_report CNN, Conformer, AFF 11785511 log mel spectra frequency shifting, SpecAugment, random cutout, augmix, data simulation FSD50K, TAU-SRIR DB
16 Yeow_NTU_task3a_1 Yeow_NTU_task3a_report CRNN 4000000 log mel spectra, Mid-Side spectrogram, Mid-Side Intensity Vector Inter-Channel Aware Time-Frequemcy Masking SpatialScaper
17 Yeow_NTU_task3a_2 Yeow_NTU_task3a_report CRNN 4000000 log mel spectra, Mid-Side spectrogram, Mid-Side Intensity Vector, Magnitude-Squared Coherence Inter-Channel Aware Time-Frequemcy Masking SpatialScaper
18 Wu_HUST_task3a_4 Wu_HUST_task3a_report CNN, Conformer, AFF 11785511 log mel spectra frequency shifting, SpecAugment, random cutout, augmix, data simulation FSD50K, TAU-SRIR DB
19 Wu_HUST_task3a_3 Wu_HUST_task3a_report CNN, Conformer, AFF 11785511 log mel spectra frequency shifting, SpecAugment, random cutout, augmix, data simulation FSD50K, TAU-SRIR DB
20 Wan_XJU_task3a_1 Wan_XJU_task3a_report CNN,Conformer, 4011000 log mel spectra frequency shifting SpatialScaper
21 Zhao_MITC-MG_task3a_3 Zhao_MITC-MG_task3a_report ResNet,Conformer 3656167 log mel spectra Gain,PolarityInversion,SevenBandParametricEQ,Time Masking,Reverb FSD50K, FMA dasheng_base
22 Zhao_MITC-MG_task3a_4 Zhao_MITC-MG_task3a_report ResNet,Conformer 3656167 log mel spectra Gain,PolarityInversion,SevenBandParametricEQ,Time Masking,Reverb FSD50K, FMA, STARSS23 dasheng_base
23 Gao_DTU_task3a_1 Gao_DTU_task3a_report CRNN, CNN, Mamba, Conformer, asymmetric CNN 76083414 log mel spectra, intensity vector Audio Channel Swapping PSELDnet
24 Gao_DTU_task3a_3 Gao_DTU_task3a_report CRNN, CNN, Conformer 210078389 log mel spectra, intensity vector Audio Channel Swapping PSELDnet
25 Park_KAIST_task3a_2 Park_KAIST_task3a_report ResNet, Conformer, ensemble 28100981 log mel spectra Audio Channel Swapping, FilterAugment
26 Gao_DTU_task3a_4 Gao_DTU_task3a_report Transformer, Swin Transformer 28083725 log mel spectra, intensity vector Audio Channel Swapping PSELDnet
27 Berghi_SURREY_task3a_1 Berghi_SURREY_task3_report CNN, Conformer, Cross-Modal Conformer 30959349 log mel spectra, inter-channel level difference, short-term power of ACC Audio Channel Swapping FSD50K, RIR datasets, SpatialScaper CLAP
28 Bahuguna_UPF_task3a_3 Bahuguna_UPF_task3a_report Conformer 1856247 log mel spectra FSD50K, TAU-SRIR DB
29 Park_KAIST_task3a_1 Park_KAIST_task3a_report ResNet, Conformer 14085057 log mel spectra Audio Channel Swapping, FilterAugment self-trained SED+DOA model
30 Gao_DTU_task3a_2 Gao_DTU_task3a_report CRNN, CNN, Mamba, Conformer 178113205 log mel spectra, intensity vector Audio Channel Swapping PSELDnet
31 Bahuguna_UPF_task3a_1 Bahuguna_UPF_task3a_report ensemble, Conformer 3732618 log mel spectra Spatial Scaper for rare class augmentation FSD50K, TAU-SRIR DB
32 Bingnan_UOE_task3a_1 Bingnan_UOE_task3a_report ResNet, MHSA 1350197 log mel spectra, short-term power of ACC Time Masking, Frame Shuffle
33 Bahuguna_UPF_task3a_4 Bahuguna_UPF_task3a_report Conformer 1866309 log mel spectra FSD50K, TAU-SRIR DB
34 Bahuguna_UPF_task3a_2 Bahuguna_UPF_task3a_report ensemble, Conformer 5588865 log mel spectra Spatial Scaper for rare class augmentation FSD50K, TAU-SRIR DB
35 AO_Baseline Baseline_report CRNN, MHSA 734261 log mel spectra
36 Guan_GISP-HEU_task3a_1 Guan_GISP-HEU_task3_report CRNN, MHSA 1728181 log mel spectra FSD50K, TAU-SRIR DB
37 Kim_Samsung_task3a_1 Kim_Samsung_task3_report ViT, ensemble 291473710 log mel spectra Specmix FSD50K, TAU-SRIR DB
38 Guan_GISP-HEU_task3a_2 Guan_GISP-HEU_task3_report CRNN, MHSA 1728181 log mel spectra FSD50K, TAU-SRIR DB
39 Guan_GISP-HEU_task3a_3 Guan_GISP-HEU_task3_report CRNN, MHSA 1728181 log mel spectra SpecAugment FSD50K, TAU-SRIR DB
40 Zhao_MITC-MG_task3a_1 Zhao_MITC-MG_task3a_report ResNet,Conformer 3656167 log mel spectra Gain,PolarityInversion,SevenBandParametricEQ,Time Masking,Reverb FSD50K, FMA dasheng_base
41 Zhao_MITC-MG_task3a_2 Zhao_MITC-MG_task3a_report ResNet,Conformer 3656167 log mel spectra Gain,PolarityInversion,SevenBandParametricEQ,Time Masking,Reverb FSD50K, FMA dasheng_base



Track B: Audiovisual

Rank Submission
name
Technical
Report
Model Model
params
Acoustic
features
Visual
features
Data
augmentation
External
datasets
Pre-trained
models
1 Du_NERCSLIP_task3b_1 Du_NERCSLIP_task3_report ResNet, Conformer, ensemble 58484930 log mel spectra ResNet-50 features, video object detection, video human keypoints detection Audio Channel and Video Pixel Swapping, Multi-channel data simulation, Mixup AudioSet ResNet-50, ppyoloe, grounding dino
2 Du_NERCSLIP_task3b_4 Du_NERCSLIP_task3_report ResNet, Conformer, ensemble 67264873 log mel spectra ResNet-50 features, video object detection, video human keypoints detection Audio Channel and Video Pixel Swapping, Multi-channel data simulation, Mixup AudioSet ResNet-50, ppyoloe, grounding dino
3 Du_NERCSLIP_task3b_3 Du_NERCSLIP_task3_report ResNet, Conformer, ensemble 99415144 log mel spectra ResNet-50 features, video object detection, video human keypoints detection Audio Channel and Video Pixel Swapping, Multi-channel data simulation, Mixup AudioSet ResNet-50, ppyoloe, grounding dino
4 Du_NERCSLIP_task3b_2 Du_NERCSLIP_task3_report ResNet, Conformer, ensemble 67268214 log mel spectra ResNet-50 features, video object detection, video human keypoints detection Audio Channel and Video Pixel Swapping, Multi-channel data simulation, Mixup AudioSet ResNet-50, ppyoloe, grounding dino
5 Berghi_SURREY_task3b_3 Berghi_SURREY_task3_report CNN, Conformer, Cross-Modal Conformer, ViT, ensemble 134694434 log mel spectra, inter-channel level difference, short-term power of ACC OWL-ViT features Audio Channel and Video Pixel Swapping, frame flip FSD50K, RIR datasets, SpatialScaper, SELDVisualSynth Canvas and Assets, Flickr30k, DoorDetect Dataset, 360-Indoor CLAP, OWL-ViT
6 Berghi_SURREY_task3b_4 Berghi_SURREY_task3_report CNN, Conformer, Cross-Modal Conformer, ViT, ensemble 134694434 log mel spectra, inter-channel level difference, short-term power of ACC OWL-ViT features Audio Channel and Video Pixel Swapping, frame flip FSD50K, RIR datasets, SpatialScaper, SELDVisualSynth Canvas and Assets, Flickr30k, DoorDetect Dataset, 360-Indoor CLAP, OWL-ViT
7 Berghi_SURREY_task3b_1 Berghi_SURREY_task3_report CNN, Conformer, Cross-Modal Conformer, ViT 36387868 log mel spectra, inter-channel level difference, short-term power of ACC OWL-ViT features Audio Channel and Video Pixel Swapping, frame flip FSD50K, RIR datasets, SpatialScaper, SELDVisualSynth Canvas and Assets, Flickr30k, DoorDetect Dataset, 360-Indoor CLAP, OWL-ViT
8 Berghi_SURREY_task3b_2 Berghi_SURREY_task3_report CNN, Conformer, Cross-Modal Conformer, ViT 36387868 log mel spectra, inter-channel level difference, short-term power of ACC OWL-ViT features Audio Channel and Video Pixel Swapping, frame flip FSD50K, RIR datasets, SpatialScaper, SELDVisualSynth Canvas and Assets, Flickr30k, DoorDetect Dataset, 360-Indoor CLAP, OWL-ViT
9 Chengnuo_JSU_task3b_4 Chengnuo_JSU_task3b_report MLP, MHSA,CNN 2896093 log mel spectra ResNet-50 features ResNet-50
10 AV_Baseline Baseline_report CRNN, MHSA 2723676 log mel spectra ResNet-50 features ResNet-50
11 Guan_GISP-HEU_task3b_3 Guan_GISP-HEU_task3_report CRNN, MHSA 3717596 log mel spectra ResNet-50 features FSD50K, TAU-SRIR DB, SELDVisualSynth Canvas and Assets, Flickr30k ResNet-50
12 Chengnuo_JSU_task3b_1 Chengnuo_JSU_task3b_report MLP, MHSA,CNN 2896093 log mel spectra ResNet-50 features ResNet-50
13 Yu_Polyu_task3b_1 Yu_Polyu_task3b_report Mamba, ResNet 980000 log mel spectra, intensity vector ResNet-50 features Multi-channel data simulation FSD50K, TAU-SRIR DB ResNet-50
14 Kim_Samsung_task3b_1 Kim_Samsung_task3_report ViT, ensemble 291473710 log mel spectra ResNet-50 features Specmix FSD50K, TAU-SRIR DB ResNet-50
15 Chengnuo_JSU_task3b_2 Chengnuo_JSU_task3b_report MLP, MHSA,CNN 2896093 log mel spectra ResNet-50 features ResNet-50
16 Guan_GISP-HEU_task3b_2 Guan_GISP-HEU_task3_report CRNN, MHSA 2720000 log mel spectra ResNet-50 features SpecAugment FSD50K, TAU-SRIR DB, SELDVisualSynth Canvas and Assets, Flickr30k ResNet-50
17 Chengnuo_JSU_task3b_3 Chengnuo_JSU_task3b_report MLP, MHSA,CNN 2896093 log mel spectra ResNet-50 features ResNet-50
18 Guan_GISP-HEU_task3b_1 Guan_GISP-HEU_task3_report CRNN, MHSA 2720000 log mel spectra ResNet-50 features FSD50K, TAU-SRIR DB, SELDVisualSynth Canvas and Assets, Flickr30k ResNet-50



Technical reports

A CONFORMER-BASED ENSEMBLE APPROACH FOR SOUND EVENT LOCALIZATION AND DETECTION FOR STEREO DATA

Arjun Bahuguna1, Rahul Peter2
1Universitat Pompeu Fabra, Dept. of Engineering, Barcelona, 08018, Spain, 2Aalto University, Electrical Engineering Dept., Espoo, 20150, Finland

Abstract

This report presents our approach to task 3 of the DCASE Challenge 2025, which focuses on the localization and detection of stereo sound events (SELD) in regular video content. We propose a three-part ensemble model that operates in the audio domain and outperforms the official baseline. To address class imbalance in the STARSS23 dataset, we explore synthetic data generation using SpatialScaper and apply data augmentation techniques such as channel-swapping and time-domain remixing. Our proposed system achieves an F-score of 28%, DOA error of 17.3°, and relative distance error of 0.43 on the development data set. We conclude by suggesting possible future enhancements.

PDF

EXPLOITING STEREO SPATIAL PROPERTIES WITH RESNET-CONFORMERS FOR ROBUST EVENT DETECTION AND LOCALIZATION

Banerjee Mohor1, Nagisetty Srikanth2, Han Boon Teo2
1Nanyang Technological University, 2Panasonic R&D Center Singapore

Abstract

This technical report presents our submission for Task 3, Track A of the DCASE 2025 Challenge: Stereo Sound Event Localization and Detection (SELD) in regular video content. Our system focuses exclusively on the audio-only track, leveraging stereo audio inputs for precise SELD. At the heart of our framework lies a ResNet-Conformer architecture augmented with pre-trained ONE-PEACE embeddings. This powerful combination processes a rich set of spatial and spectral features, including Mel spectrograms, Interaural Phase Differences (IPDs), Interaural Level Differences (ILDs), and Generalized Cross-Correlation with Phase Transform (GCC-PHAT). To optimize performance across multiple SELD tasks - namely Direction of Arrival (DoA) estimation, sound event detection (SED), and sound distance estimation (SDE) - we employ a modular training strategy: separate modules are trained for DoA estimation and SDE, with their outputs fused through a joint prediction scheme. The ONE-PEACE embeddings are integrated alongside the ResNet-Conformer outputs and jointly processed to enhance downstream task performance further. We evaluate our system on the DCASE 2025 Task 3 dev-test set, derived from the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset. Our method achieves an F-score of 48.2%, representing a substantial 25.42% improvement over the official baseline, thereby demonstrating the effectiveness of our approach.

PDF

Baseline Models and Evaluation of Sound Event Localization and Detection with Distance Estimation in DCASE2024 Challenge

David Diaz-Guerra1, Archontis Politis1, Parthasaarathy Sudarsanam1, Kazuki Shimada2, Daniel Krause1, Kengo Uchida2, Yuichiro Koyama3, Naoya Takahashi4, Shusuke Takahashi3, Takashi Shibuya2, Yuki Mitsufuji5,6, Tuomas Virtanen1
1Tampere University, Tampere, Finland, 2Sony AI, Tokyo, Japan, 3Sony Group Corporation, Tokyo, Japan, 4Sony AI, Zurich, Switzerland, 5Sony AI, NY, USA, 6Sony Group Corporation, NY, USA

PDF

SPATIAL AND SEMANTIC EMBEDDING INTEGRATION FOR STEREO SOUND EVENT LOCALIZATION AND DETECTION IN REGULAR VIDEOS

Davide Berghi, Philip J. B. Jackson
CVSSP, University of Surrey, U.K.

Abstract

This report presents our systems submitted to the audio-only and audio-visual tracks of the DCASE2025 Task 3 Challenge: Stereo Sound Event Localization and Detection (SELD) in Regular Video Content. SELD is a complex task that combines temporal event classification with spatial localization, requiring reasoning across spatial, temporal, and semantic dimensions. The last is arguably the most challenging to model. Traditional SELD architectures rely on multichannel input, which limits their ability to leverage large-scale pre-training due to data constraints. To address this, we enhance standard SELD architectures with semantic information by integrating pre-trained, contrastive language-aligned models: CLAP for audio and OWL-ViT for visual inputs. These embeddings are incorporated into a modified Conformer module tailored for multimodal fusion, which we refer to as the Cross-Modal Conformer. Additionally, we incorporate autocorrelation-based acoustic features to improve distance estimation. We pre-train our models on curated synthetic audio and audio-visual datasets and apply a left-right channel swapping augmentation to further increase the training data. Both our audio-only and audio-visual systems substantially outperform the challenge baselines on the development set, demonstrating the effectiveness of our strategy. Performance is further improved through model ensembling and a visual post-processing step based on human keypoints. Future work will investigate the contribution of each modality and explore architectural variants to further enhance results.

PDF

MULTI-ACCDOA-BASED SELD IN STEREO AUDIO: FEATURE EXTRACTION AND DATA AUGMENTATION STRATEGIES

Bingnan Duan, Yinhuan Dong, Liuyuan Na
The University of Edinburgh, School of Engineering, Edinburgh, UK

Abstract

This technical report describes the proposed system submitted to the DCASE2025 Task3: Stereo sound event localization and detection in regular video content (Track A: Audio-only inference). To improve SELD performance, we replace the convolutional blocks in the baseline model with ResNet blocks, extract a 3-channel input feature consisting of log-mel spectrograms and short-term power of autocorrelation (stpACC), and employ two data augmentation techniques: Time Masking and Frame Shuffle. Our system uses the Multi-ACCDOA output representation with an ADPIT loss function to support overlapping sound events. Evaluated on the development dataset, our proposed method achieves significant improvements over the official baseline across F1-score, DOA error, and relative distance error.

PDF

THE SYSTEM FOR DCASE 2025 SOUND EVENT LOCALIZATION AND DETECTION CHALLENGE

Chengnuo Sun, Lijian Gao
Jiangsu University, Zhenjiang, China

Abstract

This technical report gives an overview of our system for task3 with audiovisual of the DCASE 2025 challenge. We propose a Sound Event Localization and Detection (SELD) system for stereo sound event localization and detection in regular video content. Compared with the baseline, the proposed method pays more attention to the temporal relationship between modalities. We evaluated our methods on the dev-test set of the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset and we achieve significant improvements over the baseline method.

PDF

THE NERC-SLIP SYSTEM FOR STEREO SOUND EVENT LOCALIZATION AND DETECTION IN REGULAR VIDEO CONTENT OF DCASE 2025 CHALLENGE

Qing Wang1, Hengyi Hong1, Ruoyu Wei3, Lin Li2, Yuxuan Dong1, Mingqi Cai2, Xin Fang2, Jiangzhao Wu3, Jun Du1
1University of Science and Technology of China, Hefei, China, 2iFLYTEK, Hefei, China, 3National Intelligent Voice Innovation Center, Hefei, China

Abstract

This technical report details our submission system for Task 3 of the DCASE 2025 Challenge, which focuses on sound event localization and detection (SELD) in regular video content with stereo audio. In addition to estimating the direction of arrival (DOA) and distance of sound sources, the audio-visual SELD task requires predicting whether the sound source is on-screen. For the audio-only track, we used two-channel log-Mel spectrogram features from stereo audio as model inputs. We adapted the audio-visual pixel swapping (AVPS) technique from first-order Ambisonics (FOA) to stereo format through left-right channel swapping coupled with horizontal video pixel transposition, effectively doubling the training data. Our architecture implemented three specialized models for DOA, distance, and source coordinates estimation tasks, subsequently integrated through a joint prediction framework. The audio-visual track utilized a ResNet-50 model pre-trained on ImageNet for visual feature extraction, enhanced by a teacher-student learning paradigm for cross-modal knowledge distillation. To improve on-screen event detection, we developed a novel two-stage visual post-processing method. Our methods were evaluated using the development set of the DCASE 2025 Task 3.

PDF

STEREO SOUND EVENT LOCALIZATION AND DETECTION BASED ON PSELDNET PRETRAINING AND BIMAMBA SEQUENCE MODELING

Wenmiao Gao1, Yang Xiao2
1Denmark Technical University, 2800 Kgs. Lyngby, Denmark, 2The University of Melbourne, Melbourne, Australia

Abstract

Pre-training methods have achieved significant performance improvements in sound event localization and detection (SELD) tasks, but existing Transformer-based models suffer from high computational complexity. In this work, we propose a stereo sound event localization and detection system based on pre-trained PSELDnet and bidirectional Mamba sequence modeling. We replace the Conformer module with a BiMamba module and introduce asymmetric convolutions to more effectively model the spatiotemporal relationships between time and frequency dimensions. Experimental results demonstrate that the proposed method achieves significantly better performance than the baseline and the original PSELDnet with Conformer decoder architecture on the DCASE2025 Task 3 development dataset, while also reducing computational complexity. These findings highlight the effectiveness of the BiMamba architecture in addressing the challenges of the SELD task.

PDF

GISP@HEU'S SUBMISSION TO THE DCASE 2025 CHALLENGE: STEREO SELD TASK

Congyi Fan1, Shitong Fan1, Feiyang Xiao1, Wenbo Wang2, Xinyi Che3, Qiaoxi Zhu4, Jian Guan1
1Group of Intelligent Signal Processing (GISP), College of Computer Science and Technology, Harbin Engineering University, Harbin, China, 2Faculty of Computing, Harbin Institute of Technology, Harbin, China, 3Sichuan University, 4University of Technology Sydney, Ultimo, Australia

Abstract

This technical report presents our submission to Task 3 of the DCASE 2025 Challenge. To enhance the model's generalization ability, we adopt the official synthetic data generation pipeline to expand the training set. In addition, SpecAugment is applied for data augmentation to improve event recognition performance. To address the challenges of ambiguous localization and long-range temporal dependencies inherent in stereo SELD, we use the Mamba architecture, which effectively captures both local and global temporal dynamics, thereby improving overall system performance.

PDF

STEREO SOUND EVENT LOCALIZATION AND DETECTION WITH SOURCE DISTANCE ESTIMATION USING DATA-DRIVEN RESNET-CONFORMER ENSEMBLE

Changjiang He1,2, Jian Chen1, Siyao Cheng1,2, Jiahua Bao1,2, Jie Liu1,2
1Harbin Institute of Technology, Faculty of Computing, China, 2State Key Laboratory of Smart Farm Technologies and Systems, China

Abstract

This technical report presents our submitted system for Task 3 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge: Stereo Sound Event Localization and Detection in Regular Video Content (SELD). The DCASE task 3 includes two tracks, and we participate exclusively in the audio-only track. First, we perform data augmentation by employing audio channel swapping (ACS) and data simulation techniques, expanding the dataset to 3.7 times its original size. Subsequently, a single ResNet-Conformer model is used to perform SELD predictions. To further optimize the model and submit multiple model ensemble solutions, we fine-tuned it on the original dataset after training it on the augmented dataset. During model ensemble, we integrate two models—SED-DoA and SED-SDE. Our approach is evaluated on the development test set of the dataset.

PDF

SOUND EVENT LOCALIZATION AND DETECTION MODEL WITH ATTENTION-BASED NEURAL NETWORKS AND DATA MODELING

Gwantae Kim
Samsung Electronics, Suwon, South Korea

Abstract

The technical report presents our submission system for task 3 of the DCASE 2025 Challenge, which tackles sound event localization and detection(SELD) problems in regular video content and stereo audio contents. In this report, we introduce data preparation and augmentation method, neural network structure, post-processing, and model ensemble strategy.

PDF

ResNet-Conformer for Stereo Sound Event Localization and Distance Estimation in DCASE 2025 task3

Jehyun Park, Hyeonuk Nam, Yong-Hwa Park
Korea Advanced Institute of Science and Technology, South Korea

Abstract

In the DCASE 2025 Task 3 Track A challenge, we propose a ResNet-Conformer architecture for stereo sound event localization and detection (SELD) with integrated sound distance estimation (SDE). We develop two complementary systems. The first system follows a two-stage training strategy, where the model is initially trained to perform sound event detection (SED) and direction-of-arrival (DOA) estimation, and then fine-tuned to also predict source distance. The second system is based on a dual-branch ensemble that combines a model trained for SED and DOA with another model trained for SED and SDE. Both systems share a common backbone consisting of a ResNet-based convolutional encoder followed by an 8-layer Conformer stack, with separate output branches for SED (sigmoid), DOA (tanh), and SDE (ReLU). To enhance robustness, we apply audio channel swapping (ACS) and FilterAugment as data augmentation techniques. Evaluation on the DCASE 2025 Task 3 development set demonstrates that the proposed ensemble system improves overall SELD performance.

PDF

A MULTI-LEVEL FEATURE EXTRACTION NETWORK FOR SOUND EVENT LOCALIZATION AND DETECTION IN DCASE 2025 TASK 3

QingJing Wan1,2, Ying Hu1,2, Jie Liu1,2, Qiong Wu1,2, Qin Yang1,2, WenTao Zhou1,2, Tianqing Zhou1,2, Nannan Teng1,2, Fangxu Chen1,2, Zijun Chen1,2
1Xinjiang University, School of Information Science and Engineering, Urumqi, China, 2Key Laboratory of Signal Detection and Processing in Xinjiang, Urumqi, China

Abstract

This technical report describes our submission system for Task 3 of the DCASE 2025 Challenge: stereo sound event localization and detection (SELD) in regular video content. We participate in the audio-only track. Our system adopts a Multi-Level Feature Extraction Network, which consists of three main components. First, a Feature Extraction Enhancement module(FEEM) is used to extract fine-grained and meaningful features at multiple hierarchical levels, improving the model's ability to handle both sub-tasks: Direction of Arrival (DOA) estimation and Sound Event Detection (SED). Second, a Feature Fusion module(FFM) is employed to integrate multi-level features, further enhancing the representational capacity of the network. Finally, several data augmentation strategies are applied to improve the robustness of the network. Experimental results on the DCASE 2025 Task 3 stereo SELD dataset demonstrate the effectiveness of the proposed system.

PDF

A STEREO SOUND EVENT LOCALIZATION AND DETECTION METHOD BASED ON FEATURE FUSION AND TWO-STAGE TRAINING

Digao Wu, Ming Zhu
Huazhong University of Science and Technology, School of Electronic Information and Communications, Wuhan, China

Abstract

This technical report presents our system for Task 3 of the DCASE 2025 Challenge: Stereo Sound Event Localization and Detection in Regular Video Content. The task requires predicting the activity, azimuth, and distance of sound events using stereo audio. We participate in the audio-only track. We propose a stereo SELD model based on the ResNet-Conformer structure, integrating channel-wise attention and feature fusion, with outputs represented in the ACCDOA format. To enhance model performance, we augment the training data with additional stereo audio segments sampled from the official DCASE 2024 synthetic dataset. We apply several data augmentation techniques and adopt a two-stage training strategy to improve generalization and performance on real data. A dynamic thresholding method is also introduced during inference to further boost the prediction accuracy. The experimental results on the official development dataset show that our proposed system outperforms the baseline in all evaluation metrics.

PDF

IMPROVING STEREO 3D SOUND EVENT LOCALIZATION AND DETECTION: PERCEPTUAL FEATURES, STEREO-SPECIFIC DATA AUGMENTATION, AND DISTANCE NORMALIZATION

Jun-Wei Yeow, Ee-Leng Tan, Santi Peksi, Woon-Seng Gan
Smart Nation TRANS Lab, Nanyang Technological University, Singapore

Abstract

This technical report presents our submission to Task 3 of the DCASE 2025 Challenge: Stereo Sound Event Localization and Detection (SELD) in Regular Video Content. We address the audio-only task in this report and introduce several key contributions. First, we design perceptually-motivated input features that improve event detection, sound source localization, and distance estimation. Second, we adapt augmentation strategies specifically for the intricacies of stereo audio, including channel swapping and time-frequency masking. We also incorporate the recently proposed FilterAugment technique that has yet to be explored for SELD work. Lastly, we apply a distance normalization approach during training to stabilize regression targets. Experiments on the stereo STARSS23 dataset demonstrate consistent performance gains across all SELD metrics. Code to replicate our work is available in this repository.

PDF

THE MAMBA-BASED SYSTEM FOR DCASE 2025 SOUND EVENT LOCALIZATION AND DETECTION CHALLENGE

Rendong Pi, Xiang Yu
The Hong Kong Polytechnic University, Mechanical Engineering Dept., Hung Hom, Kowloon, Hong Kong, China

Abstract

This technical report gives an overview of our system for task 3 in DCASE 2025 challenge. We propose a stereo sound event loclaization and detection (SELD) system using the mel spectral and intensity vector as the input features. We construct a Mamba-based network and achieve the significant improvements compared to the beseline system with few data augmentation methods. We conduct the performance evaluation on the dev-test of the Stereo SELD dataset extracted from the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset.

PDF

ENHANCING STEREO SOUND EVENT LOCALIZATION AND DETECTION THROUGH PRETRAINED AUDIO REPRESENTATIONS AND HYBRID ARCHITECTURES

Tianbo Zhao, Zerui Han, Mengmei Liu
Xiaomi Corporation, MITC-Multimodal generation, Beijing, China

Abstract

The technical report presents our submission system for Task 3 of the DCASE 2025 Challenge: Stereo sound event localization and detection (SELD) in regular video content. This year we participate in the audio-only track. We propose a method that decomposes the SELD task into two sub-tasks. For the detection task, we employ the pre-trained Dasheng model, which is a high-performing audio encoder. For the localization task, we utilize the ResNet-Conformer architecture, which has demonstrated excellent performance in recent years' DCASE tasks. We evaluated our method on the dev-test set of the development dataset. The results show that our approach outperforms the baseline.

PDF