Task description
The Sound Event Localization and Detection (SELD) task deals with methods that detect the temporal onset and offset of sound events when active, classify the type of the event from a known set of sound classes, and further localize the events in space when active.
The focus of the current SELD task is developing systems that can perform adequately on stereo audio data. There are two tracks: an audio-only track (Track A) for systems using only stereo audio data to estimate the SELD labels, and an audiovisual track (Track B) for systems employing additionally simultaneous perspective video data aligned spatially with the stereo audio data.
The task provides two datasets, development and evaluation. Among the two datasets, only the development dataset provides the reference labels. The participants are expected to build and validate systems using the development dataset, report results on a predefined development set split, and finally test their system on the unseen evaluation dataset.
More details on the task setup and evaluation can be found in the task description page.
Teams ranking
The SELD task received 57 submissions in total from 16 teams across the world. From those, 40 submissions were on the audio-only Track A, and 17 submissions on the audiovisual Track B. 4 teams participated in both Track A & B, 10 teams participated only in Track A and 2 teams participated only in Track B.
The following table includes only the best performing system per submitting team. Confidence intervals are also reported for each metric on the evaluation set results.
Track A: Audio-only
Rank | Submission Information | Evaluation Dataset | |||||||
---|---|---|---|---|---|---|---|---|---|
Submission name |
Corresponding author |
Affiliation |
Technical Report |
Team Rank |
F-score (20°/1) |
DOA error (°) |
Relative distance error |
||
Du_NERCSLIP_task3a_4 | Jun Du | University of Science and Technology of China | Du_NERCSLIP_task3_report | 1 | 50.4 (49.2 - 51.4) | 12.2 (11.7 - 12.5) | 26.9 (25.9 - 28.1) | ||
He_HIT_task3a_1 | Changjiang He | Harbin Institute of Technology | He_HIT_task3a_report | 2 | 47.0 (45.6 - 48.2) | 13.3 (12.6 - 13.9) | 38.6 (37.5 - 39.9) | ||
Banerjee_NTU_task3a_1 | Mohor Banerjee | Nanyang Technological University | Banerjee_NTU_task3a_report | 3 | 43.9 (42.5 - 45.5) | 14.0 (13.2 - 14.7) | 35.2 (33.6 - 36.5) | ||
Berghi_SURREY_task3a_2 | Davide Berghi | University of Surrey | Berghi_SURREY_task3_report | 4 | 42.5 (41.4 - 43.8) | 15.4 (14.5 - 16.2) | 31.4 (30.5 - 32.4) | ||
Wu_HUST_task3a_2 | Digao Wu | Huazhong University of Science and Technology | Wu_HUST_task3a_report | 5 | 41.8 (40.4 - 43.3) | 15.3 (14.6 - 16.0) | 29.3 (28.2 - 30.4) | ||
Yeow_NTU_task3a_3 | Jun Wei Yeow | Nanyang Technological University | Yeow_NTU_task3a_report | 6 | 41.3 (40.0 - 42.7) | 14.5 (13.3 - 15.6) | 28.0 (26.9 - 28.9) | ||
Wan_XJU_task3a_1 | QingJing Wan | Xinjiang University | Wan_XJU_task3a_report | 7 | 35.4 (34.3 - 36.7) | 18.6 (17.4 - 19.4) | 34.9 (33.1 - 36.9) | ||
Zhao_MITC-MG_task3a_3 | Tianbo Zhao | Xiaomi Corporation | Zhao_MITC-MG_task3a_report | 8 | 34.0 (33.1 - 34.7) | 16.8 (15.1 - 18.4) | 36.6 (35.7 - 37.3) | ||
Gao_DTU_task3a_1 | Wenmiao Gao | Denmark Technical University | Gao_DTU_task3a_report | 9 | 31.0 (30.2 - 31.8) | 17.4 (13.6 - 18.6) | 40.1 (35.9 - 41.5) | ||
Park_KAIST_task3a_2 | Jehyun Park | Korea Advanced Institute of Science and Technology | Park_KAIST_task3a_report | 10 | 30.3 (29.5 - 31.1) | 14.6 (13.6 - 17.5) | 32.4 (26.9 - 43.2) | ||
Bahuguna_UPF_task3a_3 | Arjun Bahuguna | Universitat Pompeu Fabra | Bahuguna_UPF_task3a_report | 11 | 28.8 (27.7 - 29.7) | 21.2 (16.8 - 26.9) | 100.0 (100.0 - 100.0) | ||
Bingnan_UOE_task3a_1 | Duan Bingnan | The University of Edinburgh | Bingnan_UOE_task3a_report | 12 | 26.9 (26.1 - 27.9) | 24.6 (21.4 - 32.8) | 37.9 (31.6 - 54.4) | ||
AO_Baseline | Parthasaarathy Sudarsanam | Tampere University | Baseline_report | 13 | 26.1 (25.0 - 27.6) | 23.0 (21.5 - 24.1) | 33.2 (30.8 - 37.3) | ||
Guan_GISP-HEU_task3a_1 | Jian Guan | Harbin Engineering University | Guan_GISP-HEU_task3_report | 14 | 25.1 (24.0 - 26.1) | 24.7 (22.1 - 27.9) | 35.6 (34.7 - 36.2) | ||
Kim_Samsung_task3a_1 | Gwantae Kim | Samsung Electronics | Kim_Samsung_task3_report | 15 | 24.6 (23.7 - 25.5) | 18.2 (13.2 - 25.9) | 33.7 (32.4 - 35.1) |
Track B: Audiovisual
Rank | Submission Information | Evaluation Dataset | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Submission name |
Corresponding author |
Affiliation |
Technical Report |
Team Rank |
F-score (20°/1/on) |
F-score (20°/1) |
DOA error (°) |
Relative distance error |
Onscreen accuracy |
|
Du_NERCSLIP_task3b_1 | Jun Du | University of Science and Technology of China | Du_NERCSLIP_task3_report | 1 | 41.6 (40.3 - 42.6) | 50.1 (48.8 - 51.1) | 12.2 (11.7 - 12.5) | 27.0 (26.0 - 28.1) | 82.2 (80.0 - 84.5) | |
Berghi_SURREY_task3b_3 | Davide Berghi | University of Surrey | Berghi_SURREY_task3_report | 2 | 34.8 (33.7 - 35.9) | 46.2 (44.9 - 47.9) | 14.1 (13.5 - 14.4) | 30.4 (29.0 - 31.5) | 76.9 (73.5 - 80.0) | |
Chengnuo_JSU_task3b_4 | Sun Chengnuo | Jiangsu University | Chengnuo_JSU_task3b_report | 3 | 20.8 (19.9 - 21.7) | 27.5 (26.7 - 28.3) | 22.2 (20.9 - 23.4) | 37.7 (35.7 - 40.1) | 77.8 (74.0 - 81.3) | |
AV_Baseline | Parthasaarathy Sudarsanam | Tampere University | Baseline_report | 4 | 20.8 (19.9 - 21.7) | 27.5 (26.7 - 28.3) | 22.2 (20.9 - 23.4) | 37.7 (35.7 - 40.1) | 77.8 (74.0 - 81.3) | |
Guan_GISP-HEU_task3b_3 | Jian Guan | Harbin Engineering University | Guan_GISP-HEU_task3_report | 5 | 18.2 (17.7 - 18.7) | 24.7 (23.9 - 25.5) | 24.0 (20.2 - 27.1) | 38.2 (31.3 - 46.1) | 76.1 (68.0 - 84.8) | |
Yu_Polyu_task3b_1 | Xiang Yu | The Hong Kong Polytechnic University | Yu_Polyu_task3b_report | 6 | 18.1 (17.2 - 19.1) | 24.8 (24.0 - 25.8) | 18.1 (17.2 - 19.0) | 34.0 (31.6 - 35.8) | 79.8 (77.3 - 82.3) | |
Kim_Samsung_task3b_1 | Gwantae Kim | Samsung Electronics | Kim_Samsung_task3_report | 7 | 18.0 (17.0 - 18.9) | 24.5 (23.6 - 25.4) | 20.9 (14.8 - 38.0) | 34.0 (31.2 - 42.8) | 78.4 (76.0 - 82.8) |
Systems ranking
Performance of all the submitted systems on the evaluation and the development datasets. Confidence intervals are also reported for each metric on the evaluation set results.
Track A: Audio-only
Rank | Submission Information | Evaluation Dataset | Development Dataset | ||||||
---|---|---|---|---|---|---|---|---|---|
Submission name |
Technical Report |
Submission Rank |
F-score (20°/1) |
DOA error (°) |
Relative distance error |
F-score (20°/1) |
DOA error (°) |
Relative distance error |
|
Du_NERCSLIP_task3a_4 | Du_NERCSLIP_task3_report | 1 | 50.4 (49.2 - 51.4) | 12.2 (11.7 - 12.5) | 26.9 (25.9 - 28.1) | 54.3 | 11.8 | 26.0 | |
Du_NERCSLIP_task3a_3 | Du_NERCSLIP_task3_report | 2 | 49.6 (48.5 - 50.4) | 12.4 (11.9 - 12.8) | 26.9 (25.9 - 28.0) | 54.0 | 12.0 | 25.9 | |
Du_NERCSLIP_task3a_2 | Du_NERCSLIP_task3_report | 3 | 49.2 (48.0 - 50.2) | 12.4 (12.0 - 12.8) | 27.4 (26.4 - 28.6) | 53.1 | 12.1 | 25.9 | |
He_HIT_task3a_1 | He_HIT_task3a_report | 4 | 47.0 (45.6 - 48.2) | 13.3 (12.6 - 13.9) | 38.6 (37.5 - 39.9) | 50.0 | 13.1 | 36.0 | |
Du_NERCSLIP_task3a_1 | Du_NERCSLIP_task3_report | 5 | 46.6 (45.1 - 47.7) | 12.3 (11.8 - 12.8) | 29.8 (28.6 - 31.1) | 50.1 | 12.5 | 26.5 | |
He_HIT_task3a_2 | He_HIT_task3a_report | 6 | 45.4 (44.4 - 46.3) | 13.6 (12.9 - 14.2) | 30.4 (29.4 - 31.8) | 48.8 | 13.4 | 32.0 | |
He_HIT_task3a_3 | He_HIT_task3a_report | 7 | 45.3 (43.9 - 46.3) | 12.8 (12.2 - 13.3) | 31.3 (29.8 - 32.9) | 51.3 | 12.5 | 33.0 | |
He_HIT_task3a_4 | He_HIT_task3a_report | 8 | 44.3 (43.0 - 45.6) | 13.5 (13.0 - 13.9) | 29.8 (28.8 - 31.0) | 47.8 | 13.0 | 30.0 | |
Banerjee_NTU_task3a_1 | Banerjee_NTU_task3a_report | 9 | 43.9 (42.5 - 45.5) | 14.0 (13.2 - 14.7) | 35.2 (33.6 - 36.5) | 48.2 | 13.3 | 36.0 | |
Banerjee_NTU_task3a_2 | Banerjee_NTU_task3a_report | 10 | 43.7 (42.3 - 45.0) | 14.1 (13.0 - 15.2) | 36.5 (34.3 - 38.4) | 47.5 | 13.7 | 36.0 | |
Berghi_SURREY_task3a_2 | Berghi_SURREY_task3_report | 11 | 42.5 (41.4 - 43.8) | 15.4 (14.5 - 16.2) | 31.4 (30.5 - 32.4) | 46.0 | 15.2 | 30.8 | |
Wu_HUST_task3a_2 | Wu_HUST_task3a_report | 12 | 41.8 (40.4 - 43.3) | 15.3 (14.6 - 16.0) | 29.3 (28.2 - 30.4) | 42.5 | 15.0 | 30.0 | |
Yeow_NTU_task3a_3 | Yeow_NTU_task3a_report | 13 | 41.3 (40.0 - 42.7) | 14.5 (13.3 - 15.6) | 28.0 (26.9 - 28.9) | 45.3 | 13.2 | 26.2 | |
Yeow_NTU_task3a_4 | Yeow_NTU_task3a_report | 14 | 40.9 (39.3 - 42.4) | 13.9 (12.9 - 14.7) | 28.7 (27.8 - 29.7) | 43.1 | 12.7 | 25.9 | |
Wu_HUST_task3a_1 | Wu_HUST_task3a_report | 15 | 40.8 (39.3 - 42.3) | 14.9 (14.1 - 15.7) | 29.2 (28.0 - 30.3) | 41.3 | 14.9 | 30.0 | |
Yeow_NTU_task3a_1 | Yeow_NTU_task3a_report | 16 | 40.5 (39.3 - 42.0) | 14.1 (13.3 - 14.7) | 28.4 (27.0 - 29.7) | 44.0 | 13.2 | 27.1 | |
Yeow_NTU_task3a_2 | Yeow_NTU_task3a_report | 17 | 39.8 (38.3 - 41.1) | 13.9 (13.1 - 14.7) | 28.4 (27.0 - 29.8) | 43.4 | 13.2 | 26.1 | |
Wu_HUST_task3a_4 | Wu_HUST_task3a_report | 18 | 39.8 (38.9 - 40.9) | 16.3 (15.5 - 16.9) | 28.3 (27.3 - 29.1) | 42.7 | 16.3 | 27.0 | |
Wu_HUST_task3a_3 | Wu_HUST_task3a_report | 19 | 39.5 (38.5 - 40.7) | 15.8 (15.0 - 16.4) | 28.3 (27.3 - 29.2) | 41.7 | 16.1 | 27.0 | |
Wan_XJU_task3a_1 | Wan_XJU_task3a_report | 20 | 35.4 (34.3 - 36.7) | 18.6 (17.4 - 19.4) | 34.9 (33.1 - 36.9) | 37.1 | 18.3 | 30.0 | |
Zhao_MITC-MG_task3a_3 | Zhao_MITC-MG_task3a_report | 21 | 34.0 (33.1 - 34.7) | 16.8 (15.1 - 18.4) | 36.6 (35.7 - 37.3) | 37.0 | 16.9 | 39.0 | |
Zhao_MITC-MG_task3a_4 | Zhao_MITC-MG_task3a_report | 22 | 32.6 (31.8 - 33.6) | 18.8 (16.7 - 20.8) | 38.1 (37.1 - 39.1) | 35.1 | 18.0 | 37.0 | |
Gao_DTU_task3a_1 | Gao_DTU_task3a_report | 23 | 31.0 (30.2 - 31.8) | 17.4 (13.6 - 18.6) | 40.1 (35.9 - 41.5) | 39.6 | 15.8 | 33.0 | |
Gao_DTU_task3a_3 | Gao_DTU_task3a_report | 24 | 30.4 (29.5 - 31.2) | 18.8 (16.5 - 20.1) | 36.4 (34.2 - 38.0) | 38.2 | 15.9 | 33.0 | |
Park_KAIST_task3a_2 | Park_KAIST_task3a_report | 25 | 30.3 (29.5 - 31.1) | 14.6 (13.6 - 17.5) | 32.4 (26.9 - 43.2) | 36.3 | 14.5 | 28.0 | |
Gao_DTU_task3a_4 | Gao_DTU_task3a_report | 26 | 29.9 (29.3 - 30.7) | 20.8 (19.2 - 22.3) | 36.7 (35.5 - 37.9) | 35.1 | 16.5 | 30.0 | |
Berghi_SURREY_task3a_1 | Berghi_SURREY_task3_report | 27 | 29.4 (28.4 - 30.5) | 20.1 (17.6 - 22.4) | 34.9 (31.5 - 37.7) | 45.7 | 15.0 | 31.0 | |
Bahuguna_UPF_task3a_3 | Bahuguna_UPF_task3a_report | 28 | 28.8 (27.7 - 29.7) | 21.2 (16.8 - 26.9) | 100.0 (100.0 - 100.0) | 28.4 | 20.3 | 100.0 | |
Park_KAIST_task3a_1 | Park_KAIST_task3a_report | 29 | 28.5 (27.5 - 29.3) | 13.2 (9.7 - 14.6) | 28.5 (26.2 - 30.4) | 35.3 | 15.5 | 30.0 | |
Gao_DTU_task3a_2 | Gao_DTU_task3a_report | 30 | 28.2 (27.4 - 29.1) | 19.8 (17.3 - 21.8) | 39.0 (33.2 - 43.5) | 36.2 | 16.6 | 33.0 | |
Bahuguna_UPF_task3a_1 | Bahuguna_UPF_task3a_report | 31 | 27.4 (26.8 - 28.4) | 22.3 (20.1 - 24.1) | 35.1 (34.1 - 36.4) | 26.6 | 20.4 | 36.0 | |
Bingnan_UOE_task3a_1 | Bingnan_UOE_task3a_report | 32 | 26.9 (26.1 - 27.9) | 24.6 (21.4 - 32.8) | 37.9 (31.6 - 54.4) | 29.0 | 19.3 | 30.0 | |
Bahuguna_UPF_task3a_4 | Bahuguna_UPF_task3a_report | 33 | 26.8 (26.0 - 27.9) | 22.2 (20.5 - 23.6) | 37.0 (36.2 - 37.9) | 24.8 | 20.6 | 34.0 | |
Bahuguna_UPF_task3a_2 | Bahuguna_UPF_task3a_report | 34 | 26.4 (25.6 - 27.4) | 22.1 (20.0 - 24.4) | 36.6 (35.2 - 38.1) | 28.0 | 17.3 | 43.0 | |
AO_Baseline | Baseline_report | 35 | 26.1 (25.0 - 27.6) | 23.0 (21.5 - 24.1) | 33.2 (30.8 - 37.3) | 22.8 | 24.5 | 41.0 | |
Guan_GISP-HEU_task3a_1 | Guan_GISP-HEU_task3_report | 36 | 25.1 (24.0 - 26.1) | 24.7 (22.1 - 27.9) | 35.6 (34.7 - 36.2) | 22.9 | 23.5 | 32.0 | |
Kim_Samsung_task3a_1 | Kim_Samsung_task3_report | 37 | 24.6 (23.7 - 25.5) | 18.2 (13.2 - 25.9) | 33.7 (32.4 - 35.1) | 28.8 | 18.1 | 34.0 | |
Guan_GISP-HEU_task3a_2 | Guan_GISP-HEU_task3_report | 38 | 23.8 (22.8 - 25.2) | 27.3 (26.0 - 28.4) | 37.2 (33.4 - 40.7) | 21.9 | 28.2 | 44.0 | |
Guan_GISP-HEU_task3a_3 | Guan_GISP-HEU_task3_report | 39 | 22.9 (22.0 - 23.9) | 25.1 (22.4 - 27.2) | 36.8 (34.1 - 40.1) | 25.3 | 23.0 | 45.0 | |
Zhao_MITC-MG_task3a_1 | Zhao_MITC-MG_task3a_report | 40 | 11.6 (11.4 - 11.7) | 20.5 (18.6 - 22.8) | 38.0 (36.7 - 39.0) | 35.2 | 17.4 | 38.0 | |
Zhao_MITC-MG_task3a_2 | Zhao_MITC-MG_task3a_report | 41 | 11.2 (11.0 - 11.3) | 22.3 (20.1 - 24.6) | 38.9 (37.9 - 39.8) | 36.3 | 17.2 | 37.0 |
Track B: Audiovisual
Rank | Submission Information | Evaluation Dataset | Development Dataset | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission name |
Technical Report |
Submission Rank |
F-score (20°/1/on) |
F-score (20°/1) |
DOA error (°) |
Relative distance error |
Onscreen accuracy |
F-score (20°/1/on) |
F-score (20°/1) |
DOA error (°) |
Relative distance error |
Onscreen accuracy |
|
Du_NERCSLIP_task3b_1 | Du_NERCSLIP_task3_report | 1 | 41.6 (40.3 - 42.6) | 50.1 (48.8 - 51.1) | 12.2 (11.7 - 12.5) | 27.0 (26.0 - 28.1) | 82.2 (80.0 - 84.5) | 46.9 | 54.1 | 11.9 | 26.0 | 86.0 | |
Du_NERCSLIP_task3b_4 | Du_NERCSLIP_task3_report | 2 | 41.4 (40.0 - 42.4) | 50.1 (48.6 - 51.0) | 12.2 (11.8 - 12.6) | 26.9 (25.9 - 28.0) | 81.9 (79.7 - 84.2) | 47.0 | 54.2 | 12.0 | 25.9 | 85.6 | |
Du_NERCSLIP_task3b_3 | Du_NERCSLIP_task3_report | 3 | 41.2 (39.8 - 42.1) | 49.8 (48.3 - 50.8) | 11.9 (11.5 - 12.3) | 26.6 (25.6 - 27.7) | 82.0 (79.8 - 84.3) | 47.3 | 54.6 | 11.8 | 26.1 | 85.8 | |
Du_NERCSLIP_task3b_2 | Du_NERCSLIP_task3_report | 4 | 41.0 (39.6 - 41.9) | 49.6 (48.1 - 50.5) | 11.8 (11.4 - 12.2) | 26.9 (25.8 - 28.0) | 82.1 (79.9 - 84.3) | 46.9 | 54.2 | 11.9 | 25.4 | 85.9 | |
Berghi_SURREY_task3b_3 | Berghi_SURREY_task3_report | 5 | 34.8 (33.7 - 35.9) | 46.2 (44.9 - 47.9) | 14.1 (13.5 - 14.4) | 30.4 (29.0 - 31.5) | 76.9 (73.5 - 80.0) | 37.3 | 48.0 | 14.0 | 29.3 | 80.8 | |
Berghi_SURREY_task3b_4 | Berghi_SURREY_task3_report | 6 | 34.8 (33.5 - 36.0) | 46.2 (44.9 - 47.9) | 14.1 (13.5 - 14.4) | 30.4 (29.0 - 31.5) | 76.3 (72.5 - 79.8) | 37.5 | 48.0 | 14.0 | 29.3 | 80.8 | |
Berghi_SURREY_task3b_1 | Berghi_SURREY_task3_report | 7 | 33.6 (32.3 - 34.8) | 45.1 (43.7 - 46.6) | 14.8 (14.0 - 15.3) | 32.3 (29.9 - 34.5) | 75.3 (71.6 - 78.6) | 34.4 | 44.4 | 15.6 | 30.4 | 80.5 | |
Berghi_SURREY_task3b_2 | Berghi_SURREY_task3_report | 8 | 33.3 (32.0 - 34.7) | 43.5 (42.0 - 45.5) | 15.0 (14.3 - 15.5) | 31.9 (30.6 - 33.3) | 77.7 (73.7 - 81.2) | 35.8 | 45.5 | 15.2 | 32.2 | 81.0 | |
Chengnuo_JSU_task3b_4 | Chengnuo_JSU_task3b_report | 9 | 20.8 (19.9 - 21.7) | 27.5 (26.7 - 28.3) | 22.2 (20.9 - 23.4) | 37.7 (35.7 - 40.1) | 77.8 (74.0 - 81.3) | 18.8 | 26.8 | 20.1 | 34.0 | 80.0 | |
AV_Baseline | Baseline_report | 10 | 20.8 (19.9 - 21.7) | 27.5 (26.7 - 28.3) | 22.2 (20.9 - 23.4) | 37.7 (35.7 - 40.1) | 77.8 (74.0 - 81.3) | 20.0 | 26.8 | 23.8 | 40.0 | 80.0 | |
Guan_GISP-HEU_task3b_3 | Guan_GISP-HEU_task3_report | 11 | 18.2 (17.7 - 18.7) | 24.7 (23.9 - 25.5) | 24.0 (20.2 - 27.1) | 38.2 (31.3 - 46.1) | 76.1 (68.0 - 84.8) | 17.9 | 23.7 | 25.8 | 37.0 | 81.0 | |
Chengnuo_JSU_task3b_1 | Chengnuo_JSU_task3b_report | 12 | 18.1 (17.2 - 18.8) | 23.8 (23.0 - 24.6) | 24.4 (20.6 - 29.7) | 38.7 (37.0 - 40.1) | 74.7 (68.2 - 78.7) | 23.1 | 26.8 | 20.1 | 34.0 | 80.5 | |
Yu_Polyu_task3b_1 | Yu_Polyu_task3b_report | 13 | 18.1 (17.2 - 19.1) | 24.8 (24.0 - 25.8) | 18.1 (17.2 - 19.0) | 34.0 (31.6 - 35.8) | 79.8 (77.3 - 82.3) | 24.1 | 32.4 | 18.1 | 32.9 | 81.4 | |
Kim_Samsung_task3b_1 | Kim_Samsung_task3_report | 14 | 18.0 (17.0 - 18.9) | 24.5 (23.6 - 25.4) | 20.9 (14.8 - 38.0) | 34.0 (31.2 - 42.8) | 78.4 (76.0 - 82.8) | 26.1 | 19.6 | 30.0 | 74.0 | ||
Chengnuo_JSU_task3b_2 | Chengnuo_JSU_task3b_report | 15 | 17.9 (16.9 - 18.8) | 23.7 (22.8 - 24.6) | 23.2 (21.9 - 25.0) | 39.1 (37.5 - 40.6) | 79.1 (75.9 - 81.9) | 22.7 | 26.8 | 20.1 | 34.0 | 80.5 | |
Guan_GISP-HEU_task3b_2 | Guan_GISP-HEU_task3_report | 16 | 17.1 (16.1 - 18.3) | 22.7 (21.6 - 23.8) | 24.5 (20.6 - 27.6) | 42.4 (39.3 - 46.9) | 78.0 (74.6 - 81.6) | 19.6 | 26.4 | 22.3 | 46.0 | 80.0 | |
Chengnuo_JSU_task3b_3 | Chengnuo_JSU_task3b_report | 17 | 15.8 (15.1 - 16.5) | 21.1 (20.1 - 22.0) | 24.5 (22.3 - 27.4) | 47.2 (46.2 - 48.5) | 74.5 (70.4 - 77.8) | 20.5 | 26.8 | 20.1 | 34.0 | 80.0 | |
Guan_GISP-HEU_task3b_1 | Guan_GISP-HEU_task3_report | 18 | 15.5 (14.9 - 16.2) | 22.0 (21.1 - 23.2) | 26.8 (24.3 - 28.7) | 46.0 (43.6 - 48.4) | 78.4 (75.4 - 81.1) | 16.6 | 23.6 | 25.8 | 48.0 | 80.0 |
System characteristics
Track A: Audio-only
Rank |
Submission name |
Technical Report |
Model |
Model params |
Acoustic features |
Data augmentation |
External datasets |
Pre-trained models |
---|---|---|---|---|---|---|---|---|
1 | Du_NERCSLIP_task3a_4 | Du_NERCSLIP_task3_report | ResNet, Conformer, ensemble | 58472848 | log mel spectra | Audio Channel Swapping, Multi-channel data simulation, Mixup | AudioSet | |
2 | Du_NERCSLIP_task3a_3 | Du_NERCSLIP_task3_report | ResNet, Conformer, ensemble | 46792105 | log mel spectra | Audio Channel Swapping, Multi-channel data simulation, Mixup | AudioSet | |
3 | Du_NERCSLIP_task3a_2 | Du_NERCSLIP_task3_report | ResNet, Conformer, ensemble | 35111362 | log mel spectra | Audio Channel Swapping, Multi-channel data simulation, Mixup | AudioSet | |
4 | He_HIT_task3a_1 | He_HIT_task3a_report | ResNet, Conformer, ensemble | 104852705 | log mel spectra | Audio Channel Swapping, audio generation, synthetic audio | FSD50K, TAU-SRIR DB | |
5 | Du_NERCSLIP_task3a_1 | Du_NERCSLIP_task3_report | ResNet, Conformer, ensemble | 23430619 | log mel spectra | Audio Channel Swapping, Multi-channel data simulation, Mixup | AudioSet | |
6 | He_HIT_task3a_2 | He_HIT_task3a_report | ResNet, Conformer, ensemble | 104852705 | log mel spectra | Audio Channel Swapping, audio generation, synthetic audio | FSD50K, TAU-SRIR DB | |
7 | He_HIT_task3a_3 | He_HIT_task3a_report | ResNet, Conformer, ensemble | 104854001 | log mel spectra, intensity vector | Audio Channel Swapping, audio generation, synthetic audio | FSD50K, TAU-SRIR DB | |
8 | He_HIT_task3a_4 | He_HIT_task3a_report | ResNet, Conformer | 52388101 | log mel spectra | Audio Channel Swapping, audio generation, synthetic audio | FSD50K, TAU-SRIR DB | |
9 | Banerjee_NTU_task3a_1 | Banerjee_NTU_task3a_report | ResNet, Conformer, ONE-PEACE embedding, ensemble | 26337755 | log mel spectra, GCC-PHAT, inter-channel level difference, Sine of Interaural Phase Difference, Cosine of Interaural Phase Difference | FSD50K, TAU-SRIR DB | ONE-PEACE | |
10 | Banerjee_NTU_task3a_2 | Banerjee_NTU_task3a_report | ResNet, Conformer, ONE-PEACE embedding, ensemble | 26337755 | log mel spectra, GCC-PHAT, inter-channel level difference, Sine of Interaural Phase Difference, Cosine of Interaural Phase Difference | Audio Channel Swapping | FSD50K, TAU-SRIR DB | ONE-PEACE |
11 | Berghi_SURREY_task3a_2 | Berghi_SURREY_task3_report | CNN, Conformer, Cross-Modal Conformer | 30959349 | log mel spectra, inter-channel level difference, short-term power of ACC | Audio Channel Swapping | FSD50K, RIR datasets, SpatialScaper | CLAP |
12 | Wu_HUST_task3a_2 | Wu_HUST_task3a_report | CNN, Conformer, AFF | 11785511 | log mel spectra | frequency shifting, SpecAugment, random cutout, augmix, data simulation | FSD50K, TAU-SRIR DB | |
13 | Yeow_NTU_task3a_3 | Yeow_NTU_task3a_report | CRNN | 4000000 | log mel spectra, Mid-Side spectrogram, Mid-Side Intensity Vector | frequency shifting, FilterAugment | SpatialScaper | |
14 | Yeow_NTU_task3a_4 | Yeow_NTU_task3a_report | CRNN | 4000000 | log mel spectra, Mid-Side spectrogram, Mid-Side Intensity Vector, Magnitude-Squared Coherence | FilterAugment, frequency shifting | SpatialScaper | |
15 | Wu_HUST_task3a_1 | Wu_HUST_task3a_report | CNN, Conformer, AFF | 11785511 | log mel spectra | frequency shifting, SpecAugment, random cutout, augmix, data simulation | FSD50K, TAU-SRIR DB | |
16 | Yeow_NTU_task3a_1 | Yeow_NTU_task3a_report | CRNN | 4000000 | log mel spectra, Mid-Side spectrogram, Mid-Side Intensity Vector | Inter-Channel Aware Time-Frequemcy Masking | SpatialScaper | |
17 | Yeow_NTU_task3a_2 | Yeow_NTU_task3a_report | CRNN | 4000000 | log mel spectra, Mid-Side spectrogram, Mid-Side Intensity Vector, Magnitude-Squared Coherence | Inter-Channel Aware Time-Frequemcy Masking | SpatialScaper | |
18 | Wu_HUST_task3a_4 | Wu_HUST_task3a_report | CNN, Conformer, AFF | 11785511 | log mel spectra | frequency shifting, SpecAugment, random cutout, augmix, data simulation | FSD50K, TAU-SRIR DB | |
19 | Wu_HUST_task3a_3 | Wu_HUST_task3a_report | CNN, Conformer, AFF | 11785511 | log mel spectra | frequency shifting, SpecAugment, random cutout, augmix, data simulation | FSD50K, TAU-SRIR DB | |
20 | Wan_XJU_task3a_1 | Wan_XJU_task3a_report | CNN,Conformer, | 4011000 | log mel spectra | frequency shifting | SpatialScaper | |
21 | Zhao_MITC-MG_task3a_3 | Zhao_MITC-MG_task3a_report | ResNet,Conformer | 3656167 | log mel spectra | Gain,PolarityInversion,SevenBandParametricEQ,Time Masking,Reverb | FSD50K, FMA | dasheng_base |
22 | Zhao_MITC-MG_task3a_4 | Zhao_MITC-MG_task3a_report | ResNet,Conformer | 3656167 | log mel spectra | Gain,PolarityInversion,SevenBandParametricEQ,Time Masking,Reverb | FSD50K, FMA, STARSS23 | dasheng_base |
23 | Gao_DTU_task3a_1 | Gao_DTU_task3a_report | CRNN, CNN, Mamba, Conformer, asymmetric CNN | 76083414 | log mel spectra, intensity vector | Audio Channel Swapping | PSELDnet | |
24 | Gao_DTU_task3a_3 | Gao_DTU_task3a_report | CRNN, CNN, Conformer | 210078389 | log mel spectra, intensity vector | Audio Channel Swapping | PSELDnet | |
25 | Park_KAIST_task3a_2 | Park_KAIST_task3a_report | ResNet, Conformer, ensemble | 28100981 | log mel spectra | Audio Channel Swapping, FilterAugment | ||
26 | Gao_DTU_task3a_4 | Gao_DTU_task3a_report | Transformer, Swin Transformer | 28083725 | log mel spectra, intensity vector | Audio Channel Swapping | PSELDnet | |
27 | Berghi_SURREY_task3a_1 | Berghi_SURREY_task3_report | CNN, Conformer, Cross-Modal Conformer | 30959349 | log mel spectra, inter-channel level difference, short-term power of ACC | Audio Channel Swapping | FSD50K, RIR datasets, SpatialScaper | CLAP |
28 | Bahuguna_UPF_task3a_3 | Bahuguna_UPF_task3a_report | Conformer | 1856247 | log mel spectra | FSD50K, TAU-SRIR DB | ||
29 | Park_KAIST_task3a_1 | Park_KAIST_task3a_report | ResNet, Conformer | 14085057 | log mel spectra | Audio Channel Swapping, FilterAugment | self-trained SED+DOA model | |
30 | Gao_DTU_task3a_2 | Gao_DTU_task3a_report | CRNN, CNN, Mamba, Conformer | 178113205 | log mel spectra, intensity vector | Audio Channel Swapping | PSELDnet | |
31 | Bahuguna_UPF_task3a_1 | Bahuguna_UPF_task3a_report | ensemble, Conformer | 3732618 | log mel spectra | Spatial Scaper for rare class augmentation | FSD50K, TAU-SRIR DB | |
32 | Bingnan_UOE_task3a_1 | Bingnan_UOE_task3a_report | ResNet, MHSA | 1350197 | log mel spectra, short-term power of ACC | Time Masking, Frame Shuffle | ||
33 | Bahuguna_UPF_task3a_4 | Bahuguna_UPF_task3a_report | Conformer | 1866309 | log mel spectra | FSD50K, TAU-SRIR DB | ||
34 | Bahuguna_UPF_task3a_2 | Bahuguna_UPF_task3a_report | ensemble, Conformer | 5588865 | log mel spectra | Spatial Scaper for rare class augmentation | FSD50K, TAU-SRIR DB | |
35 | AO_Baseline | Baseline_report | CRNN, MHSA | 734261 | log mel spectra | |||
36 | Guan_GISP-HEU_task3a_1 | Guan_GISP-HEU_task3_report | CRNN, MHSA | 1728181 | log mel spectra | FSD50K, TAU-SRIR DB | ||
37 | Kim_Samsung_task3a_1 | Kim_Samsung_task3_report | ViT, ensemble | 291473710 | log mel spectra | Specmix | FSD50K, TAU-SRIR DB | |
38 | Guan_GISP-HEU_task3a_2 | Guan_GISP-HEU_task3_report | CRNN, MHSA | 1728181 | log mel spectra | FSD50K, TAU-SRIR DB | ||
39 | Guan_GISP-HEU_task3a_3 | Guan_GISP-HEU_task3_report | CRNN, MHSA | 1728181 | log mel spectra | SpecAugment | FSD50K, TAU-SRIR DB | |
40 | Zhao_MITC-MG_task3a_1 | Zhao_MITC-MG_task3a_report | ResNet,Conformer | 3656167 | log mel spectra | Gain,PolarityInversion,SevenBandParametricEQ,Time Masking,Reverb | FSD50K, FMA | dasheng_base |
41 | Zhao_MITC-MG_task3a_2 | Zhao_MITC-MG_task3a_report | ResNet,Conformer | 3656167 | log mel spectra | Gain,PolarityInversion,SevenBandParametricEQ,Time Masking,Reverb | FSD50K, FMA | dasheng_base |
Track B: Audiovisual
Rank |
Submission name |
Technical Report |
Model |
Model params |
Acoustic features |
Visual features |
Data augmentation |
External datasets |
Pre-trained models |
---|---|---|---|---|---|---|---|---|---|
1 | Du_NERCSLIP_task3b_1 | Du_NERCSLIP_task3_report | ResNet, Conformer, ensemble | 58484930 | log mel spectra | ResNet-50 features, video object detection, video human keypoints detection | Audio Channel and Video Pixel Swapping, Multi-channel data simulation, Mixup | AudioSet | ResNet-50, ppyoloe, grounding dino |
2 | Du_NERCSLIP_task3b_4 | Du_NERCSLIP_task3_report | ResNet, Conformer, ensemble | 67264873 | log mel spectra | ResNet-50 features, video object detection, video human keypoints detection | Audio Channel and Video Pixel Swapping, Multi-channel data simulation, Mixup | AudioSet | ResNet-50, ppyoloe, grounding dino |
3 | Du_NERCSLIP_task3b_3 | Du_NERCSLIP_task3_report | ResNet, Conformer, ensemble | 99415144 | log mel spectra | ResNet-50 features, video object detection, video human keypoints detection | Audio Channel and Video Pixel Swapping, Multi-channel data simulation, Mixup | AudioSet | ResNet-50, ppyoloe, grounding dino |
4 | Du_NERCSLIP_task3b_2 | Du_NERCSLIP_task3_report | ResNet, Conformer, ensemble | 67268214 | log mel spectra | ResNet-50 features, video object detection, video human keypoints detection | Audio Channel and Video Pixel Swapping, Multi-channel data simulation, Mixup | AudioSet | ResNet-50, ppyoloe, grounding dino |
5 | Berghi_SURREY_task3b_3 | Berghi_SURREY_task3_report | CNN, Conformer, Cross-Modal Conformer, ViT, ensemble | 134694434 | log mel spectra, inter-channel level difference, short-term power of ACC | OWL-ViT features | Audio Channel and Video Pixel Swapping, frame flip | FSD50K, RIR datasets, SpatialScaper, SELDVisualSynth Canvas and Assets, Flickr30k, DoorDetect Dataset, 360-Indoor | CLAP, OWL-ViT |
6 | Berghi_SURREY_task3b_4 | Berghi_SURREY_task3_report | CNN, Conformer, Cross-Modal Conformer, ViT, ensemble | 134694434 | log mel spectra, inter-channel level difference, short-term power of ACC | OWL-ViT features | Audio Channel and Video Pixel Swapping, frame flip | FSD50K, RIR datasets, SpatialScaper, SELDVisualSynth Canvas and Assets, Flickr30k, DoorDetect Dataset, 360-Indoor | CLAP, OWL-ViT |
7 | Berghi_SURREY_task3b_1 | Berghi_SURREY_task3_report | CNN, Conformer, Cross-Modal Conformer, ViT | 36387868 | log mel spectra, inter-channel level difference, short-term power of ACC | OWL-ViT features | Audio Channel and Video Pixel Swapping, frame flip | FSD50K, RIR datasets, SpatialScaper, SELDVisualSynth Canvas and Assets, Flickr30k, DoorDetect Dataset, 360-Indoor | CLAP, OWL-ViT |
8 | Berghi_SURREY_task3b_2 | Berghi_SURREY_task3_report | CNN, Conformer, Cross-Modal Conformer, ViT | 36387868 | log mel spectra, inter-channel level difference, short-term power of ACC | OWL-ViT features | Audio Channel and Video Pixel Swapping, frame flip | FSD50K, RIR datasets, SpatialScaper, SELDVisualSynth Canvas and Assets, Flickr30k, DoorDetect Dataset, 360-Indoor | CLAP, OWL-ViT |
9 | Chengnuo_JSU_task3b_4 | Chengnuo_JSU_task3b_report | MLP, MHSA,CNN | 2896093 | log mel spectra | ResNet-50 features | ResNet-50 | ||
10 | AV_Baseline | Baseline_report | CRNN, MHSA | 2723676 | log mel spectra | ResNet-50 features | ResNet-50 | ||
11 | Guan_GISP-HEU_task3b_3 | Guan_GISP-HEU_task3_report | CRNN, MHSA | 3717596 | log mel spectra | ResNet-50 features | FSD50K, TAU-SRIR DB, SELDVisualSynth Canvas and Assets, Flickr30k | ResNet-50 | |
12 | Chengnuo_JSU_task3b_1 | Chengnuo_JSU_task3b_report | MLP, MHSA,CNN | 2896093 | log mel spectra | ResNet-50 features | ResNet-50 | ||
13 | Yu_Polyu_task3b_1 | Yu_Polyu_task3b_report | Mamba, ResNet | 980000 | log mel spectra, intensity vector | ResNet-50 features | Multi-channel data simulation | FSD50K, TAU-SRIR DB | ResNet-50 |
14 | Kim_Samsung_task3b_1 | Kim_Samsung_task3_report | ViT, ensemble | 291473710 | log mel spectra | ResNet-50 features | Specmix | FSD50K, TAU-SRIR DB | ResNet-50 |
15 | Chengnuo_JSU_task3b_2 | Chengnuo_JSU_task3b_report | MLP, MHSA,CNN | 2896093 | log mel spectra | ResNet-50 features | ResNet-50 | ||
16 | Guan_GISP-HEU_task3b_2 | Guan_GISP-HEU_task3_report | CRNN, MHSA | 2720000 | log mel spectra | ResNet-50 features | SpecAugment | FSD50K, TAU-SRIR DB, SELDVisualSynth Canvas and Assets, Flickr30k | ResNet-50 |
17 | Chengnuo_JSU_task3b_3 | Chengnuo_JSU_task3b_report | MLP, MHSA,CNN | 2896093 | log mel spectra | ResNet-50 features | ResNet-50 | ||
18 | Guan_GISP-HEU_task3b_1 | Guan_GISP-HEU_task3_report | CRNN, MHSA | 2720000 | log mel spectra | ResNet-50 features | FSD50K, TAU-SRIR DB, SELDVisualSynth Canvas and Assets, Flickr30k | ResNet-50 |
Technical reports
A CONFORMER-BASED ENSEMBLE APPROACH FOR SOUND EVENT LOCALIZATION AND DETECTION FOR STEREO DATA
Arjun Bahuguna1, Rahul Peter2
1Universitat Pompeu Fabra, Dept. of Engineering, Barcelona, 08018, Spain, 2Aalto University, Electrical Engineering Dept., Espoo, 20150, Finland
Bahuguna_UPF_task3a_1 Bahuguna_UPF_task3a_2 Bahuguna_UPF_task3a_3 Bahuguna_UPF_task3a_4
A CONFORMER-BASED ENSEMBLE APPROACH FOR SOUND EVENT LOCALIZATION AND DETECTION FOR STEREO DATA
Arjun Bahuguna1, Rahul Peter2
1Universitat Pompeu Fabra, Dept. of Engineering, Barcelona, 08018, Spain, 2Aalto University, Electrical Engineering Dept., Espoo, 20150, Finland
Abstract
This report presents our approach to task 3 of the DCASE Challenge 2025, which focuses on the localization and detection of stereo sound events (SELD) in regular video content. We propose a three-part ensemble model that operates in the audio domain and outperforms the official baseline. To address class imbalance in the STARSS23 dataset, we explore synthetic data generation using SpatialScaper and apply data augmentation techniques such as channel-swapping and time-domain remixing. Our proposed system achieves an F-score of 28%, DOA error of 17.3°, and relative distance error of 0.43 on the development data set. We conclude by suggesting possible future enhancements.
EXPLOITING STEREO SPATIAL PROPERTIES WITH RESNET-CONFORMERS FOR ROBUST EVENT DETECTION AND LOCALIZATION
Banerjee Mohor1, Nagisetty Srikanth2, Han Boon Teo2
1Nanyang Technological University, 2Panasonic R&D Center Singapore
Banerjee_NTU_task3a_1 Banerjee_NTU_task3a_2
EXPLOITING STEREO SPATIAL PROPERTIES WITH RESNET-CONFORMERS FOR ROBUST EVENT DETECTION AND LOCALIZATION
Banerjee Mohor1, Nagisetty Srikanth2, Han Boon Teo2
1Nanyang Technological University, 2Panasonic R&D Center Singapore
Abstract
This technical report presents our submission for Task 3, Track A of the DCASE 2025 Challenge: Stereo Sound Event Localization and Detection (SELD) in regular video content. Our system focuses exclusively on the audio-only track, leveraging stereo audio inputs for precise SELD. At the heart of our framework lies a ResNet-Conformer architecture augmented with pre-trained ONE-PEACE embeddings. This powerful combination processes a rich set of spatial and spectral features, including Mel spectrograms, Interaural Phase Differences (IPDs), Interaural Level Differences (ILDs), and Generalized Cross-Correlation with Phase Transform (GCC-PHAT). To optimize performance across multiple SELD tasks - namely Direction of Arrival (DoA) estimation, sound event detection (SED), and sound distance estimation (SDE) - we employ a modular training strategy: separate modules are trained for DoA estimation and SDE, with their outputs fused through a joint prediction scheme. The ONE-PEACE embeddings are integrated alongside the ResNet-Conformer outputs and jointly processed to enhance downstream task performance further. We evaluate our system on the DCASE 2025 Task 3 dev-test set, derived from the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset. Our method achieves an F-score of 48.2%, representing a substantial 25.42% improvement over the official baseline, thereby demonstrating the effectiveness of our approach.
Baseline Models and Evaluation of Sound Event Localization and Detection with Distance Estimation in DCASE2024 Challenge
David Diaz-Guerra1, Archontis Politis1, Parthasaarathy Sudarsanam1, Kazuki Shimada2, Daniel Krause1, Kengo Uchida2, Yuichiro Koyama3, Naoya Takahashi4, Shusuke Takahashi3, Takashi Shibuya2, Yuki Mitsufuji5,6, Tuomas Virtanen1
1Tampere University, Tampere, Finland, 2Sony AI, Tokyo, Japan, 3Sony Group Corporation, Tokyo, Japan, 4Sony AI, Zurich, Switzerland, 5Sony AI, NY, USA, 6Sony Group Corporation, NY, USA
AO_Baseline AV_Baseline
Baseline Models and Evaluation of Sound Event Localization and Detection with Distance Estimation in DCASE2024 Challenge
David Diaz-Guerra1, Archontis Politis1, Parthasaarathy Sudarsanam1, Kazuki Shimada2, Daniel Krause1, Kengo Uchida2, Yuichiro Koyama3, Naoya Takahashi4, Shusuke Takahashi3, Takashi Shibuya2, Yuki Mitsufuji5,6, Tuomas Virtanen1
1Tampere University, Tampere, Finland, 2Sony AI, Tokyo, Japan, 3Sony Group Corporation, Tokyo, Japan, 4Sony AI, Zurich, Switzerland, 5Sony AI, NY, USA, 6Sony Group Corporation, NY, USA
SPATIAL AND SEMANTIC EMBEDDING INTEGRATION FOR STEREO SOUND EVENT LOCALIZATION AND DETECTION IN REGULAR VIDEOS
Davide Berghi, Philip J. B. Jackson
CVSSP, University of Surrey, U.K.
Berghi_SURREY_task3a_1 Berghi_SURREY_task3a_2 Berghi_SURREY_task3b_1 Berghi_SURREY_task3b_2 Berghi_SURREY_task3b_3 Berghi_SURREY_task3b_4
SPATIAL AND SEMANTIC EMBEDDING INTEGRATION FOR STEREO SOUND EVENT LOCALIZATION AND DETECTION IN REGULAR VIDEOS
Davide Berghi, Philip J. B. Jackson
CVSSP, University of Surrey, U.K.
Abstract
This report presents our systems submitted to the audio-only and audio-visual tracks of the DCASE2025 Task 3 Challenge: Stereo Sound Event Localization and Detection (SELD) in Regular Video Content. SELD is a complex task that combines temporal event classification with spatial localization, requiring reasoning across spatial, temporal, and semantic dimensions. The last is arguably the most challenging to model. Traditional SELD architectures rely on multichannel input, which limits their ability to leverage large-scale pre-training due to data constraints. To address this, we enhance standard SELD architectures with semantic information by integrating pre-trained, contrastive language-aligned models: CLAP for audio and OWL-ViT for visual inputs. These embeddings are incorporated into a modified Conformer module tailored for multimodal fusion, which we refer to as the Cross-Modal Conformer. Additionally, we incorporate autocorrelation-based acoustic features to improve distance estimation. We pre-train our models on curated synthetic audio and audio-visual datasets and apply a left-right channel swapping augmentation to further increase the training data. Both our audio-only and audio-visual systems substantially outperform the challenge baselines on the development set, demonstrating the effectiveness of our strategy. Performance is further improved through model ensembling and a visual post-processing step based on human keypoints. Future work will investigate the contribution of each modality and explore architectural variants to further enhance results.
MULTI-ACCDOA-BASED SELD IN STEREO AUDIO: FEATURE EXTRACTION AND DATA AUGMENTATION STRATEGIES
Bingnan Duan, Yinhuan Dong, Liuyuan Na
The University of Edinburgh, School of Engineering, Edinburgh, UK
Bingnan_UOE_task3a_1
MULTI-ACCDOA-BASED SELD IN STEREO AUDIO: FEATURE EXTRACTION AND DATA AUGMENTATION STRATEGIES
Bingnan Duan, Yinhuan Dong, Liuyuan Na
The University of Edinburgh, School of Engineering, Edinburgh, UK
Abstract
This technical report describes the proposed system submitted to the DCASE2025 Task3: Stereo sound event localization and detection in regular video content (Track A: Audio-only inference). To improve SELD performance, we replace the convolutional blocks in the baseline model with ResNet blocks, extract a 3-channel input feature consisting of log-mel spectrograms and short-term power of autocorrelation (stpACC), and employ two data augmentation techniques: Time Masking and Frame Shuffle. Our system uses the Multi-ACCDOA output representation with an ADPIT loss function to support overlapping sound events. Evaluated on the development dataset, our proposed method achieves significant improvements over the official baseline across F1-score, DOA error, and relative distance error.
THE SYSTEM FOR DCASE 2025 SOUND EVENT LOCALIZATION AND DETECTION CHALLENGE
Chengnuo Sun, Lijian Gao
Jiangsu University, Zhenjiang, China
Chengnuo_JSU_task3b_1 Chengnuo_JSU_task3b_2 Chengnuo_JSU_task3b_3 Chengnuo_JSU_task3b_4
THE SYSTEM FOR DCASE 2025 SOUND EVENT LOCALIZATION AND DETECTION CHALLENGE
Chengnuo Sun, Lijian Gao
Jiangsu University, Zhenjiang, China
Abstract
This technical report gives an overview of our system for task3 with audiovisual of the DCASE 2025 challenge. We propose a Sound Event Localization and Detection (SELD) system for stereo sound event localization and detection in regular video content. Compared with the baseline, the proposed method pays more attention to the temporal relationship between modalities. We evaluated our methods on the dev-test set of the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset and we achieve significant improvements over the baseline method.
THE NERC-SLIP SYSTEM FOR STEREO SOUND EVENT LOCALIZATION AND DETECTION IN REGULAR VIDEO CONTENT OF DCASE 2025 CHALLENGE
Qing Wang1, Hengyi Hong1, Ruoyu Wei3, Lin Li2, Yuxuan Dong1, Mingqi Cai2, Xin Fang2, Jiangzhao Wu3, Jun Du1
1University of Science and Technology of China, Hefei, China, 2iFLYTEK, Hefei, China, 3National Intelligent Voice Innovation Center, Hefei, China
Du_NERCSLIP_task3a_1 Du_NERCSLIP_task3a_2 Du_NERCSLIP_task3a_3 Du_NERCSLIP_task3a_4 Du_NERCSLIP_task3b_1 Du_NERCSLIP_task3b_2 Du_NERCSLIP_task3b_3 Du_NERCSLIP_task3b_4
THE NERC-SLIP SYSTEM FOR STEREO SOUND EVENT LOCALIZATION AND DETECTION IN REGULAR VIDEO CONTENT OF DCASE 2025 CHALLENGE
Qing Wang1, Hengyi Hong1, Ruoyu Wei3, Lin Li2, Yuxuan Dong1, Mingqi Cai2, Xin Fang2, Jiangzhao Wu3, Jun Du1
1University of Science and Technology of China, Hefei, China, 2iFLYTEK, Hefei, China, 3National Intelligent Voice Innovation Center, Hefei, China
Abstract
This technical report details our submission system for Task 3 of the DCASE 2025 Challenge, which focuses on sound event localization and detection (SELD) in regular video content with stereo audio. In addition to estimating the direction of arrival (DOA) and distance of sound sources, the audio-visual SELD task requires predicting whether the sound source is on-screen. For the audio-only track, we used two-channel log-Mel spectrogram features from stereo audio as model inputs. We adapted the audio-visual pixel swapping (AVPS) technique from first-order Ambisonics (FOA) to stereo format through left-right channel swapping coupled with horizontal video pixel transposition, effectively doubling the training data. Our architecture implemented three specialized models for DOA, distance, and source coordinates estimation tasks, subsequently integrated through a joint prediction framework. The audio-visual track utilized a ResNet-50 model pre-trained on ImageNet for visual feature extraction, enhanced by a teacher-student learning paradigm for cross-modal knowledge distillation. To improve on-screen event detection, we developed a novel two-stage visual post-processing method. Our methods were evaluated using the development set of the DCASE 2025 Task 3.
STEREO SOUND EVENT LOCALIZATION AND DETECTION BASED ON PSELDNET PRETRAINING AND BIMAMBA SEQUENCE MODELING
Wenmiao Gao1, Yang Xiao2
1Denmark Technical University, 2800 Kgs. Lyngby, Denmark, 2The University of Melbourne, Melbourne, Australia
Gao_DTU_task3a_1 Gao_DTU_task3a_2 Gao_DTU_task3a_3 Gao_DTU_task3a_4
STEREO SOUND EVENT LOCALIZATION AND DETECTION BASED ON PSELDNET PRETRAINING AND BIMAMBA SEQUENCE MODELING
Wenmiao Gao1, Yang Xiao2
1Denmark Technical University, 2800 Kgs. Lyngby, Denmark, 2The University of Melbourne, Melbourne, Australia
Abstract
Pre-training methods have achieved significant performance improvements in sound event localization and detection (SELD) tasks, but existing Transformer-based models suffer from high computational complexity. In this work, we propose a stereo sound event localization and detection system based on pre-trained PSELDnet and bidirectional Mamba sequence modeling. We replace the Conformer module with a BiMamba module and introduce asymmetric convolutions to more effectively model the spatiotemporal relationships between time and frequency dimensions. Experimental results demonstrate that the proposed method achieves significantly better performance than the baseline and the original PSELDnet with Conformer decoder architecture on the DCASE2025 Task 3 development dataset, while also reducing computational complexity. These findings highlight the effectiveness of the BiMamba architecture in addressing the challenges of the SELD task.
GISP@HEU'S SUBMISSION TO THE DCASE 2025 CHALLENGE: STEREO SELD TASK
Congyi Fan1, Shitong Fan1, Feiyang Xiao1, Wenbo Wang2, Xinyi Che3, Qiaoxi Zhu4, Jian Guan1
1Group of Intelligent Signal Processing (GISP), College of Computer Science and Technology, Harbin Engineering University, Harbin, China, 2Faculty of Computing, Harbin Institute of Technology, Harbin, China, 3Sichuan University, 4University of Technology Sydney, Ultimo, Australia
Guan_GISP-HEU_task3a_1 Guan_GISP-HEU_task3a_2 Guan_GISP-HEU_task3a_3 Guan_GISP-HEU_task3b_1 Guan_GISP-HEU_task3b_2 Guan_GISP-HEU_task3b_3
GISP@HEU'S SUBMISSION TO THE DCASE 2025 CHALLENGE: STEREO SELD TASK
Congyi Fan1, Shitong Fan1, Feiyang Xiao1, Wenbo Wang2, Xinyi Che3, Qiaoxi Zhu4, Jian Guan1
1Group of Intelligent Signal Processing (GISP), College of Computer Science and Technology, Harbin Engineering University, Harbin, China, 2Faculty of Computing, Harbin Institute of Technology, Harbin, China, 3Sichuan University, 4University of Technology Sydney, Ultimo, Australia
Abstract
This technical report presents our submission to Task 3 of the DCASE 2025 Challenge. To enhance the model's generalization ability, we adopt the official synthetic data generation pipeline to expand the training set. In addition, SpecAugment is applied for data augmentation to improve event recognition performance. To address the challenges of ambiguous localization and long-range temporal dependencies inherent in stereo SELD, we use the Mamba architecture, which effectively captures both local and global temporal dynamics, thereby improving overall system performance.
STEREO SOUND EVENT LOCALIZATION AND DETECTION WITH SOURCE DISTANCE ESTIMATION USING DATA-DRIVEN RESNET-CONFORMER ENSEMBLE
Changjiang He1,2, Jian Chen1, Siyao Cheng1,2, Jiahua Bao1,2, Jie Liu1,2
1Harbin Institute of Technology, Faculty of Computing, China, 2State Key Laboratory of Smart Farm Technologies and Systems, China
He_HIT_task3a_1 He_HIT_task3a_2 He_HIT_task3a_3 He_HIT_task3a_4
STEREO SOUND EVENT LOCALIZATION AND DETECTION WITH SOURCE DISTANCE ESTIMATION USING DATA-DRIVEN RESNET-CONFORMER ENSEMBLE
Changjiang He1,2, Jian Chen1, Siyao Cheng1,2, Jiahua Bao1,2, Jie Liu1,2
1Harbin Institute of Technology, Faculty of Computing, China, 2State Key Laboratory of Smart Farm Technologies and Systems, China
Abstract
This technical report presents our submitted system for Task 3 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge: Stereo Sound Event Localization and Detection in Regular Video Content (SELD). The DCASE task 3 includes two tracks, and we participate exclusively in the audio-only track. First, we perform data augmentation by employing audio channel swapping (ACS) and data simulation techniques, expanding the dataset to 3.7 times its original size. Subsequently, a single ResNet-Conformer model is used to perform SELD predictions. To further optimize the model and submit multiple model ensemble solutions, we fine-tuned it on the original dataset after training it on the augmented dataset. During model ensemble, we integrate two models—SED-DoA and SED-SDE. Our approach is evaluated on the development test set of the dataset.
SOUND EVENT LOCALIZATION AND DETECTION MODEL WITH ATTENTION-BASED NEURAL NETWORKS AND DATA MODELING
Gwantae Kim
Samsung Electronics, Suwon, South Korea
Kim_Samsung_task3a_1 Kim_Samsung_task3b_1
SOUND EVENT LOCALIZATION AND DETECTION MODEL WITH ATTENTION-BASED NEURAL NETWORKS AND DATA MODELING
Gwantae Kim
Samsung Electronics, Suwon, South Korea
Abstract
The technical report presents our submission system for task 3 of the DCASE 2025 Challenge, which tackles sound event localization and detection(SELD) problems in regular video content and stereo audio contents. In this report, we introduce data preparation and augmentation method, neural network structure, post-processing, and model ensemble strategy.
ResNet-Conformer for Stereo Sound Event Localization and Distance Estimation in DCASE 2025 task3
Jehyun Park, Hyeonuk Nam, Yong-Hwa Park
Korea Advanced Institute of Science and Technology, South Korea
Park_KAIST_task3a_1 Park_KAIST_task3a_2
ResNet-Conformer for Stereo Sound Event Localization and Distance Estimation in DCASE 2025 task3
Jehyun Park, Hyeonuk Nam, Yong-Hwa Park
Korea Advanced Institute of Science and Technology, South Korea
Abstract
In the DCASE 2025 Task 3 Track A challenge, we propose a ResNet-Conformer architecture for stereo sound event localization and detection (SELD) with integrated sound distance estimation (SDE). We develop two complementary systems. The first system follows a two-stage training strategy, where the model is initially trained to perform sound event detection (SED) and direction-of-arrival (DOA) estimation, and then fine-tuned to also predict source distance. The second system is based on a dual-branch ensemble that combines a model trained for SED and DOA with another model trained for SED and SDE. Both systems share a common backbone consisting of a ResNet-based convolutional encoder followed by an 8-layer Conformer stack, with separate output branches for SED (sigmoid), DOA (tanh), and SDE (ReLU). To enhance robustness, we apply audio channel swapping (ACS) and FilterAugment as data augmentation techniques. Evaluation on the DCASE 2025 Task 3 development set demonstrates that the proposed ensemble system improves overall SELD performance.
A MULTI-LEVEL FEATURE EXTRACTION NETWORK FOR SOUND EVENT LOCALIZATION AND DETECTION IN DCASE 2025 TASK 3
QingJing Wan1,2, Ying Hu1,2, Jie Liu1,2, Qiong Wu1,2, Qin Yang1,2, WenTao Zhou1,2, Tianqing Zhou1,2, Nannan Teng1,2, Fangxu Chen1,2, Zijun Chen1,2
1Xinjiang University, School of Information Science and Engineering, Urumqi, China, 2Key Laboratory of Signal Detection and Processing in Xinjiang, Urumqi, China
Wan_XJU_task3a_1
A MULTI-LEVEL FEATURE EXTRACTION NETWORK FOR SOUND EVENT LOCALIZATION AND DETECTION IN DCASE 2025 TASK 3
QingJing Wan1,2, Ying Hu1,2, Jie Liu1,2, Qiong Wu1,2, Qin Yang1,2, WenTao Zhou1,2, Tianqing Zhou1,2, Nannan Teng1,2, Fangxu Chen1,2, Zijun Chen1,2
1Xinjiang University, School of Information Science and Engineering, Urumqi, China, 2Key Laboratory of Signal Detection and Processing in Xinjiang, Urumqi, China
Abstract
This technical report describes our submission system for Task 3 of the DCASE 2025 Challenge: stereo sound event localization and detection (SELD) in regular video content. We participate in the audio-only track. Our system adopts a Multi-Level Feature Extraction Network, which consists of three main components. First, a Feature Extraction Enhancement module(FEEM) is used to extract fine-grained and meaningful features at multiple hierarchical levels, improving the model's ability to handle both sub-tasks: Direction of Arrival (DOA) estimation and Sound Event Detection (SED). Second, a Feature Fusion module(FFM) is employed to integrate multi-level features, further enhancing the representational capacity of the network. Finally, several data augmentation strategies are applied to improve the robustness of the network. Experimental results on the DCASE 2025 Task 3 stereo SELD dataset demonstrate the effectiveness of the proposed system.
A STEREO SOUND EVENT LOCALIZATION AND DETECTION METHOD BASED ON FEATURE FUSION AND TWO-STAGE TRAINING
Digao Wu, Ming Zhu
Huazhong University of Science and Technology, School of Electronic Information and Communications, Wuhan, China
Wu_HUST_task3a_1 Wu_HUST_task3a_2 Wu_HUST_task3a_3 Wu_HUST_task3a_4
A STEREO SOUND EVENT LOCALIZATION AND DETECTION METHOD BASED ON FEATURE FUSION AND TWO-STAGE TRAINING
Digao Wu, Ming Zhu
Huazhong University of Science and Technology, School of Electronic Information and Communications, Wuhan, China
Abstract
This technical report presents our system for Task 3 of the DCASE 2025 Challenge: Stereo Sound Event Localization and Detection in Regular Video Content. The task requires predicting the activity, azimuth, and distance of sound events using stereo audio. We participate in the audio-only track. We propose a stereo SELD model based on the ResNet-Conformer structure, integrating channel-wise attention and feature fusion, with outputs represented in the ACCDOA format. To enhance model performance, we augment the training data with additional stereo audio segments sampled from the official DCASE 2024 synthetic dataset. We apply several data augmentation techniques and adopt a two-stage training strategy to improve generalization and performance on real data. A dynamic thresholding method is also introduced during inference to further boost the prediction accuracy. The experimental results on the official development dataset show that our proposed system outperforms the baseline in all evaluation metrics.
IMPROVING STEREO 3D SOUND EVENT LOCALIZATION AND DETECTION: PERCEPTUAL FEATURES, STEREO-SPECIFIC DATA AUGMENTATION, AND DISTANCE NORMALIZATION
Jun-Wei Yeow, Ee-Leng Tan, Santi Peksi, Woon-Seng Gan
Smart Nation TRANS Lab, Nanyang Technological University, Singapore
Yeow_NTU_task3a_1 Yeow_NTU_task3a_2 Yeow_NTU_task3a_3 Yeow_NTU_task3a_4
IMPROVING STEREO 3D SOUND EVENT LOCALIZATION AND DETECTION: PERCEPTUAL FEATURES, STEREO-SPECIFIC DATA AUGMENTATION, AND DISTANCE NORMALIZATION
Jun-Wei Yeow, Ee-Leng Tan, Santi Peksi, Woon-Seng Gan
Smart Nation TRANS Lab, Nanyang Technological University, Singapore
Abstract
This technical report presents our submission to Task 3 of the DCASE 2025 Challenge: Stereo Sound Event Localization and Detection (SELD) in Regular Video Content. We address the audio-only task in this report and introduce several key contributions. First, we design perceptually-motivated input features that improve event detection, sound source localization, and distance estimation. Second, we adapt augmentation strategies specifically for the intricacies of stereo audio, including channel swapping and time-frequency masking. We also incorporate the recently proposed FilterAugment technique that has yet to be explored for SELD work. Lastly, we apply a distance normalization approach during training to stabilize regression targets. Experiments on the stereo STARSS23 dataset demonstrate consistent performance gains across all SELD metrics. Code to replicate our work is available in this repository.
THE MAMBA-BASED SYSTEM FOR DCASE 2025 SOUND EVENT LOCALIZATION AND DETECTION CHALLENGE
Rendong Pi, Xiang Yu
The Hong Kong Polytechnic University, Mechanical Engineering Dept., Hung Hom, Kowloon, Hong Kong, China
Yu_Polyu_task3b_1
THE MAMBA-BASED SYSTEM FOR DCASE 2025 SOUND EVENT LOCALIZATION AND DETECTION CHALLENGE
Rendong Pi, Xiang Yu
The Hong Kong Polytechnic University, Mechanical Engineering Dept., Hung Hom, Kowloon, Hong Kong, China
Abstract
This technical report gives an overview of our system for task 3 in DCASE 2025 challenge. We propose a stereo sound event loclaization and detection (SELD) system using the mel spectral and intensity vector as the input features. We construct a Mamba-based network and achieve the significant improvements compared to the beseline system with few data augmentation methods. We conduct the performance evaluation on the dev-test of the Stereo SELD dataset extracted from the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset.
ENHANCING STEREO SOUND EVENT LOCALIZATION AND DETECTION THROUGH PRETRAINED AUDIO REPRESENTATIONS AND HYBRID ARCHITECTURES
Tianbo Zhao, Zerui Han, Mengmei Liu
Xiaomi Corporation, MITC-Multimodal generation, Beijing, China
Zhao_MITC-MG_task3a_1 Zhao_MITC-MG_task3a_2 Zhao_MITC-MG_task3a_3 Zhao_MITC-MG_task3a_4
ENHANCING STEREO SOUND EVENT LOCALIZATION AND DETECTION THROUGH PRETRAINED AUDIO REPRESENTATIONS AND HYBRID ARCHITECTURES
Tianbo Zhao, Zerui Han, Mengmei Liu
Xiaomi Corporation, MITC-Multimodal generation, Beijing, China
Abstract
The technical report presents our submission system for Task 3 of the DCASE 2025 Challenge: Stereo sound event localization and detection (SELD) in regular video content. This year we participate in the audio-only track. We propose a method that decomposes the SELD task into two sub-tasks. For the detection task, we employ the pre-trained Dasheng model, which is a high-performing audio encoder. For the localization task, we utilize the ResNet-Conformer architecture, which has demonstrated excellent performance in recent years' DCASE tasks. We evaluated our method on the dev-test set of the development dataset. The results show that our approach outperforms the baseline.