Task description
The Sound Event Localization and Detection (SELD) task deals with methods that detect the temporal onset and offset of sound events when active, classify the type of the event from a known set of sound classes, and further localize the events in space when active.
The focus of the current SELD task is developing systems that can perform adequately on real sound scene recordings, with a small amount of training data. There are two tracks: an audio-only track (Track A) for systems using only microphone recordings to estimate the SELD labels, and an audiovisual track (Track B) for systems employing additionally simultaneous 360° video recordings aligned spatially with the multichannel microphone recordings.
The task provides two datasets, development and evaluation, recorded in a multiple rooms over two different sites. Among the two datasets, only the development dataset provides the reference labels. The participants are expected to build and validate systems using the development dataset, report results on a predefined development set split, and finally test their system on the unseen evaluation dataset.
More details on the task setup and evaluation can be found in the task description page.
Teams ranking
The SELD task received 47 submissions in total from 13 teams across the world. From those, 29 submissions were on the audio-only Track A, and 18 submissions on the audiovisual Track B. 4 teams participated in both Track A & B, 7 teams participated only in Track A and 2 teams articipated only in Track B.
The following table includes only the best performing system per submitting team. Confidence intervals are also reported for each metric on the evaluation set results.
Track A: Audio-only
Rank | Submission Information | Evaluation Dataset | |||||||
---|---|---|---|---|---|---|---|---|---|
Submission name |
Corresponding author |
Affiliation |
Technical Report |
Team Rank |
F-score (20°/1) |
DOA error (°) |
Relative distance error |
||
Du_NERCSLIP_task3a_4 | Qing Wang | University of Science and Technology of China | Du_NERCSLIP_task3_report | 1 | 54.4 (48.9 - 59.2) | 13.6 (12.4 - 15.0) | 0.21 (0.18 - 0.23) | ||
Yu_HYUNDAI_task3a_3 | Hogeon Yu | Hyundai Motor Company | Yu_HYUNDAI_task3a_report | 2 | 29.8 (25.1 - 34.2) | 19.8 (18.3 - 21.6) | 0.28 (0.25 - 0.32) | ||
Yeow_NTU_task3a_2 | Jun Wei Yeow | Nanyang Technological University | Yeow_NTU_task3a_report | 3 | 26.2 (22.0 - 30.5) | 25.1 (23.2 - 27.6) | 0.26 (0.22 - 0.28) | ||
Guan_CQUPT_task3a_4 | Xin Guan | Chongqing University of Posts and Telecommunications | Guan_CQUPT_task3_report | 4 | 26.7 (22.7 - 31.1) | 18.6 (17.4 - 21.8) | 0.36 (0.34 - 0.39) | ||
Vo_DU_task3a_1 | Quoc Thinh Vo | Drexel University | Vo_DU_task3a_report | 5 | 24.7 (20.8 - 28.4) | 19.3 (17.7 - 21.3) | 0.34 (0.30 - 0.37) | ||
Berg_LU_task3a_3 | Axel Berg | Lund University, Arm | Berg_LU_task3_report | 6 | 25.5 (21.8 - 29.6) | 23.2 (18.2 - 28.8) | 0.39 (0.34 - 0.44) | ||
Sun_JLESS_task3a_1 | Wenqiang Sun | Northwestern Polytechnical University | Sun_JLESS_task3a_report | 7 | 28.5 (24.2 - 33.0) | 23.8 (21.5 - 25.9) | 0.51 (0.49 - 0.53) | ||
Qian_IASP_task3a_1 | Yuanhang Qian | Wuhan University | Qian_IASP_task3a_report | 8 | 22.8 (18.6 - 26.8) | 27.2 (24.6 - 29.8) | 0.36 (0.31 - 0.42) | ||
AO_Baseline_FOA | Parthasaarathy Sudarsanam | Tampere University | Politis_TAU_task3a_report | 9 | 18.0 (14.6 - 21.7) | 29.6 (24.6 - 33.3) | 0.31 (0.28 - 0.36) | ||
Zhang_BUPT_task3a_1 | Zhicheng Zhang | Beijing University of Posts and Telecommunications | Zhang_BUPT_task3a_report | 10 | 19.0 (16.1 - 21.8) | 29.6 (26.6 - 32.9) | 0.40 (0.32 - 0.48) | ||
Chen_ECUST_task3a_1 | Ning Chen | East China University of Science and Technology | Chen_ECUST_task3_report | 11 | 15.1 (12.2 - 17.9) | 28.3 (25.5 - 30.9) | 0.48 (0.39 - 0.59) | ||
Li_BIT_task3a_1 | Jiahao Li | Beijing Institution of Technology | Li_BIT_task3a_report | 12 | 16.9 (13.4 - 20.5) | 33.5 (30.0 - 42.7) | 0.51 (0.26 - 1.25) |
Track B: Audiovisual
Rank | Submission Information | Evaluation Dataset | |||||||
---|---|---|---|---|---|---|---|---|---|
Submission name |
Corresponding author |
Affiliation |
Technical Report |
Team Rank |
F-score (20°/1) |
DOA error (°) |
Relative distance error |
||
Du_NERCSLIP_task3b_4 | Qing Wang | University of Science and Technology of China | Du_NERCSLIP_task3_report | 1 | 55.8 (51.2 - 60.4) | 11.4 (10.4 - 12.5) | 0.25 (0.22 - 0.29) | ||
Berghi_SURREY_task3b_4 | Davide Berghi | University of Surrey | Berghi_SURREY_task3b_report | 2 | 39.2 (33.9 - 44.3) | 15.8 (14.2 - 17.4) | 0.29 (0.25 - 0.32) | ||
Li_SHU_task3b_2 | Yongbo Li | Shanghai University | Li_SHU_task3b_report | 3 | 34.2 (29.9 - 38.4) | 21.5 (19.8 - 23.4) | 0.28 (0.25 - 0.31) | ||
Guan_CQUPT_task3b_2 | Xin Guan | Chongqing University of Posts and Telecommunications | Guan_CQUPT_task3_report | 4 | 23.2 (19.2 - 27.2) | 18.8 (17.3 - 21.5) | 0.32 (0.28 - 0.37) | ||
Berg_LU_task3b_3 | Axel Berg | Lund University, Arm | Berg_LU_task3_report | 5 | 25.9 (22.1 - 30.1) | 23.2 (18.2 - 28.8) | 0.33 (0.28 - 0.38) | ||
Chen_ECUST_task3b_1 | Ning Chen | East China University of Science and Technology | Chen_ECUST_task3_report | 6 | 16.3 (13.7 - 19.3) | 25.1 (22.3 - 26.9) | 0.32 (0.27 - 0.39) | ||
AV_Baseline_MIC | Parthasaarathy Sudarsanam | Tampere University | Shimada_SONY_task3b_report | 7 | 16.0 (12.1 - 20.0) | 35.9 (31.8 - 39.6) | 0.30 (0.27 - 0.33) |
Systems ranking
Performance of all the submitted systems on the evaluation and the development datasets. Confidence intervals are also reported for each metric on the evaluation set results.
Track A: Audio-only
Rank | Submission Information | Evaluation Dataset | Development Dataset | ||||||
---|---|---|---|---|---|---|---|---|---|
Submission name |
Technical Report |
Submission Rank |
F-score (20°/1) |
DOA error (°) |
Relative distance error |
F-score (20°/1) |
DOA error (°) |
Relative distance error |
|
Du_NERCSLIP_task3a_4 | Du_NERCSLIP_task3_report | 1 | 54.4 (48.9 - 59.2) | 13.6 (12.4 - 15.0) | 0.21 (0.18 - 0.23) | 59.7 | 12.4 | 0.21 | |
Du_NERCSLIP_task3a_1 | Du_NERCSLIP_task3_report | 2 | 55.7 (50.8 - 60.0) | 13.7 (12.4 - 15.3) | 0.21 (0.19 - 0.23) | 61.0 | 12.3 | 0.21 | |
Du_NERCSLIP_task3a_2 | Du_NERCSLIP_task3_report | 3 | 54.3 (48.9 - 59.0) | 13.6 (12.4 - 15.0) | 0.21 (0.19 - 0.23) | 59.7 | 12.4 | 0.22 | |
Du_NERCSLIP_task3a_3 | Du_NERCSLIP_task3_report | 4 | 53.8 (47.9 - 58.9) | 14.2 (12.6 - 16.0) | 0.21 (0.18 - 0.24) | 58.8 | 12.4 | 0.21 | |
Yu_HYUNDAI_task3a_3 | Yu_HYUNDAI_task3a_report | 5 | 29.8 (25.1 - 34.2) | 19.8 (18.3 - 21.6) | 0.28 (0.25 - 0.32) | 34.7 | 18.8 | 0.28 | |
Yu_HYUNDAI_task3a_4 | Yu_HYUNDAI_task3a_report | 6 | 29.2 (24.4 - 33.6) | 19.7 (18.1 - 21.5) | 0.30 (0.27 - 0.34) | 35.0 | 19.0 | 0.29 | |
Yu_HYUNDAI_task3a_1 | Yu_HYUNDAI_task3a_report | 7 | 29.2 (24.5 - 33.5) | 19.8 (18.2 - 21.5) | 0.29 (0.25 - 0.33) | 33.9 | 19.5 | 0.28 | |
Yu_HYUNDAI_task3a_2 | Yu_HYUNDAI_task3a_report | 8 | 28.2 (23.5 - 32.6) | 20.1 (18.4 - 22.3) | 0.29 (0.24 - 0.32) | 33.4 | 19.2 | 0.28 | |
Yeow_NTU_task3a_2 | Yeow_NTU_task3a_report | 9 | 26.2 (22.0 - 30.5) | 25.1 (23.2 - 27.6) | 0.26 (0.22 - 0.28) | 33.8 | 21.4 | 0.30 | |
Guan_CQUPT_task3a_4 | Guan_CQUPT_task3_report | 10 | 26.7 (22.7 - 31.1) | 18.6 (17.4 - 21.8) | 0.36 (0.34 - 0.39) | ||||
Vo_DU_task3a_1 | Vo_DU_task3a_report | 11 | 24.7 (20.8 - 28.4) | 19.3 (17.7 - 21.3) | 0.34 (0.30 - 0.37) | 39.7 | 17.4 | 0.33 | |
Yeow_NTU_task3a_3 | Yeow_NTU_task3a_report | 12 | 24.6 (20.2 - 29.4) | 25.9 (21.2 - 28.4) | 0.26 (0.19 - 0.29) | 32.7 | 22.9 | 0.30 | |
Vo_DU_task3a_2 | Vo_DU_task3a_report | 13 | 25.6 (21.4 - 29.5) | 20.1 (18.4 - 22.2) | 0.33 (0.29 - 0.36) | 39.9 | 17.5 | 0.32 | |
Guan_CQUPT_task3a_1 | Guan_CQUPT_task3_report | 14 | 21.9 (17.4 - 26.2) | 16.7 (15.5 - 18.9) | 0.31 (0.28 - 0.34) | 43.2 | 14.6 | 0.29 | |
Vo_DU_task3a_3 | Vo_DU_task3a_report | 15 | 24.6 (20.4 - 28.1) | 18.9 (17.4 - 20.5) | 0.34 (0.30 - 0.37) | 40.2 | 17.5 | 0.32 | |
Guan_CQUPT_task3a_3 | Guan_CQUPT_task3_report | 16 | 22.5 (18.2 - 26.7) | 16.7 (15.8 - 18.9) | 0.36 (0.33 - 0.42) | 44.1 | 13.7 | 0.30 | |
Berg_LU_task3a_3 | Berg_LU_task3_report | 17 | 25.5 (21.8 - 29.6) | 23.2 (18.2 - 28.8) | 0.39 (0.34 - 0.44) | 32.0 | 21.8 | 0.44 | |
Berg_LU_task3a_1 | Berg_LU_task3_report | 18 | 27.0 (23.3 - 31.2) | 26.1 (23.0 - 28.6) | 0.37 (0.34 - 0.44) | 29.0 | 23.9 | 0.38 | |
Yeow_NTU_task3a_1 | Yeow_NTU_task3a_report | 19 | 23.5 (19.3 - 27.9) | 27.2 (24.2 - 30.5) | 0.28 (0.25 - 0.33) | 33.9 | 20.4 | 0.30 | |
Sun_JLESS_task3a_1 | Sun_JLESS_task3a_report | 20 | 28.5 (24.2 - 33.0) | 23.8 (21.5 - 25.9) | 0.51 (0.49 - 0.53) | 29.2 | 20.7 | 0.47 | |
Guan_CQUPT_task3a_2 | Guan_CQUPT_task3_report | 21 | 21.6 (17.7 - 25.4) | 17.2 (15.1 - 20.2) | 0.40 (0.37 - 0.45) | 43.7 | 14.0 | 0.30 | |
Berg_LU_task3a_2 | Berg_LU_task3_report | 22 | 24.3 (20.4 - 28.3) | 21.5 (18.7 - 24.0) | 0.39 (0.31 - 0.50) | 28.7 | 20.8 | 0.38 | |
Yeow_NTU_task3a_4 | Yeow_NTU_task3a_report | 23 | 21.6 (17.8 - 25.6) | 27.3 (23.5 - 30.9) | 0.27 (0.23 - 0.30) | 32.7 | 20.6 | 0.31 | |
Berg_LU_task3a_4 | Berg_LU_task3_report | 24 | 23.5 (19.5 - 27.6) | 23.9 (18.2 - 31.1) | 0.43 (0.38 - 0.54) | 26.8 | 26.5 | 0.57 | |
Qian_IASP_task3a_1 | Qian_IASP_task3a_report | 25 | 22.8 (18.6 - 26.8) | 27.2 (24.6 - 29.8) | 0.36 (0.31 - 0.42) | 23.0 | 25.1 | 0.43 | |
AO_Baseline_FOA | Politis_TAU_task3a_report | 26 | 18.0 (14.6 - 21.7) | 29.6 (24.6 - 33.3) | 0.31 (0.28 - 0.36) | 13.1 | 36.9 | 0.33 | |
AO_Baseline_MIC | Politis_TAU_task3a_report | 27 | 16.3 (13.1 - 19.3) | 34.1 (30.7 - 37.4) | 0.30 (0.28 - 0.33) | 9.9 | 38.1 | 0.30 | |
Sun_JLESS_task3a_2 | Sun_JLESS_task3a_report | 28 | 21.9 (18.7 - 25.4) | 26.4 (24.9 - 28.1) | 0.51 (0.49 - 0.53) | 21.7 | 26.5 | 0.48 | |
Zhang_BUPT_task3a_1 | Zhang_BUPT_task3a_report | 29 | 19.0 (16.1 - 21.8) | 29.6 (26.6 - 32.9) | 0.40 (0.32 - 0.48) | 19.0 | 27.5 | 0.39 | |
Chen_ECUST_task3a_1 | Chen_ECUST_task3_report | 30 | 15.1 (12.2 - 17.9) | 28.3 (25.5 - 30.9) | 0.48 (0.39 - 0.59) | 19.2 | 22.9 | 0.32 | |
Li_BIT_task3a_1 | Li_BIT_task3a_report | 31 | 16.9 (13.4 - 20.5) | 33.5 (30.0 - 42.7) | 0.51 (0.26 - 1.25) | 33.9 | 21.1 | 0.30 |
Track B: Audiovisual
Rank | Submission Information | Evaluation Dataset | Development Dataset | ||||||
---|---|---|---|---|---|---|---|---|---|
Submission name |
Technical Report |
Submission Rank |
F-score (20°/1) |
DOA error (°) |
Relative distance error |
F-score (20°/1) |
DOA error (°) |
Relative distance error |
|
Du_NERCSLIP_task3b_4 | Du_NERCSLIP_task3_report | 1 | 55.8 (51.2 - 60.4) | 11.4 (10.4 - 12.5) | 0.25 (0.22 - 0.29) | 59.9 | 10.9 | 0.21 | |
Du_NERCSLIP_task3b_3 | Du_NERCSLIP_task3_report | 2 | 55.6 (50.9 - 60.3) | 11.3 (10.3 - 12.4) | 0.25 (0.22 - 0.29) | 59.2 | 10.8 | 0.22 | |
Du_NERCSLIP_task3b_2 | Du_NERCSLIP_task3_report | 3 | 53.5 (49.1 - 57.8) | 11.7 (10.5 - 12.9) | 0.27 (0.22 - 0.32) | 59.9 | 11.2 | 0.21 | |
Du_NERCSLIP_task3b_1 | Du_NERCSLIP_task3_report | 4 | 52.6 (47.7 - 56.9) | 13.6 (12.4 - 15.2) | 0.29 (0.25 - 0.34) | 61.0 | 10.9 | 0.22 | |
Berghi_SURREY_task3b_4 | Berghi_SURREY_task3b_report | 5 | 39.2 (33.9 - 44.3) | 15.8 (14.2 - 17.4) | 0.29 (0.25 - 0.32) | 40.3 | 18.0 | 0.30 | |
Berghi_SURREY_task3b_2 | Berghi_SURREY_task3b_report | 6 | 36.5 (31.5 - 41.1) | 14.4 (13.0 - 15.8) | 0.29 (0.26 - 0.33) | 38.7 | 16.8 | 0.30 | |
Berghi_SURREY_task3b_1 | Berghi_SURREY_task3b_report | 7 | 39.5 (34.3 - 44.3) | 15.4 (13.9 - 16.9) | 0.31 (0.26 - 0.36) | 40.8 | 17.7 | 0.30 | |
Li_SHU_task3b_2 | Li_SHU_task3b_report | 8 | 34.2 (29.9 - 38.4) | 21.5 (19.8 - 23.4) | 0.28 (0.25 - 0.31) | 36.4 | 19.1 | 0.30 | |
Berghi_SURREY_task3b_3 | Berghi_SURREY_task3b_report | 9 | 30.0 (25.8 - 34.2) | 26.1 (19.4 - 29.8) | 0.29 (0.25 - 0.33) | 30.7 | 18.9 | 0.27 | |
Li_SHU_task3b_1 | Li_SHU_task3b_report | 10 | 31.9 (27.9 - 36.0) | 19.6 (18.1 - 21.2) | 0.33 (0.29 - 0.37) | 39.2 | 18.7 | 0.31 | |
Guan_CQUPT_task3b_2 | Guan_CQUPT_task3_report | 11 | 23.2 (19.2 - 27.2) | 18.8 (17.3 - 21.5) | 0.32 (0.28 - 0.37) | 46.7 | 14.2 | 0.28 | |
Guan_CQUPT_task3b_1 | Guan_CQUPT_task3_report | 12 | 22.2 (18.2 - 26.0) | 20.3 (18.4 - 23.9) | 0.30 (0.26 - 0.34) | 44.4 | 15.2 | 0.27 | |
Berg_LU_task3b_3 | Berg_LU_task3_report | 13 | 25.9 (22.1 - 30.1) | 23.2 (18.2 - 28.8) | 0.33 (0.28 - 0.38) | 33.4 | 21.8 | 0.28 | |
Berg_LU_task3b_2 | Berg_LU_task3_report | 14 | 24.3 (20.4 - 28.4) | 21.5 (18.7 - 24.0) | 0.34 (0.28 - 0.41) | 29.4 | 20.8 | 0.28 | |
Berg_LU_task3b_4 | Berg_LU_task3_report | 15 | 23.7 (19.7 - 27.8) | 23.9 (18.2 - 31.1) | 0.34 (0.26 - 0.40) | 29.0 | 26.5 | 0.28 | |
Chen_ECUST_task3b_1 | Chen_ECUST_task3_report | 16 | 16.3 (13.7 - 19.3) | 25.1 (22.3 - 26.9) | 0.32 (0.27 - 0.39) | 16.2 | 26.2 | 0.41 | |
AV_Baseline_MIC | Shimada_SONY_task3b_report | 17 | 16.0 (12.1 - 20.0) | 35.9 (31.8 - 39.6) | 0.30 (0.27 - 0.33) | 11.8 | 38.5 | 0.29 | |
Berg_LU_task3b_1 | Berg_LU_task3_report | 18 | 26.4 (22.9 - 30.4) | 26.1 (23.0 - 28.6) | 0.35 (0.30 - 0.44) | 29.8 | 23.9 | 0.28 | |
AV_Baseline_FOA | Shimada_SONY_task3b_report | 19 | 15.5 (12.9 - 18.6) | 34.6 (31.0 - 37.3) | 0.31 (0.27 - 0.35) | 11.3 | 38.4 | 0.36 | |
Chen_ECUST_task3b_2 | Chen_ECUST_task3_report | 20 | 14.1 (11.6 - 16.7) | 42.2 (26.1 - 90.5) | 0.39 (0.34 - 0.49) | 17.9 | 24.2 | 0.38 |
System characteristics
Track A: Audio-only
Rank |
Submission name |
Technical Report |
Model |
Model params |
Audio format |
Acoustic features |
Data augmentation |
---|---|---|---|---|---|---|---|
1 | Du_NERCSLIP_task3a_4 | Du_NERCSLIP_task3_report | ResNet, Conformer, Ensemble | 46878922 | Ambisonic | mel spectra, intensity vector | audio channel swapping, multi-channel data simulation, manifold mixup |
2 | Du_NERCSLIP_task3a_1 | Du_NERCSLIP_task3_report | ResNet, Conformer, Conv-TasNet, Ensemble | 145105065 | Ambisonic | mel spectra, intensity vector | audio channel swapping, multi-channel data simulation, manifold mixup |
3 | Du_NERCSLIP_task3a_2 | Du_NERCSLIP_task3_report | ResNet, Conformer, Ensemble | 46803107 | Ambisonic | mel spectra, intensity vector | audio channel swapping, multi-channel data simulation, manifold mixup |
4 | Du_NERCSLIP_task3a_3 | Du_NERCSLIP_task3_report | ResNet, Conformer, Ensemble | 93682029 | Ambisonic | mel spectra, intensity vector | audio channel swapping, multi-channel data simulation, manifold mixup |
5 | Yu_HYUNDAI_task3a_3 | Yu_HYUNDAI_task3a_report | CNN, MHSA, MHA | 6317996 | Ambisonic | mel spectra, intensity vector | multi-channel data simulation |
6 | Yu_HYUNDAI_task3a_4 | Yu_HYUNDAI_task3a_report | CNN, MHSA, MHA | 6317996 | Ambisonic | mel spectra, intensity vector | multi-channel data simulation |
7 | Yu_HYUNDAI_task3a_1 | Yu_HYUNDAI_task3a_report | CNN, MHSA, MHA | 6317996 | Ambisonic | mel spectra, intensity vector | multi-channel data simulation |
8 | Yu_HYUNDAI_task3a_2 | Yu_HYUNDAI_task3a_report | CNN, MHSA, MHA | 6317996 | Ambisonic | mel spectra, intensity vector | multi-channel data simulation |
9 | Yeow_NTU_task3a_2 | Yeow_NTU_task3a_report | ResNet, Conformer, Squeeze-and-Excitation | 5383000 | Ambisonic | SALSA | mixup, frequency shifting, audio channel swapping |
10 | Guan_CQUPT_task3a_4 | Guan_CQUPT_task3_report | CNN, Conformer, Ensemble | 14479876 | Ambisonic | mel spectra, intensity vector, log-rms | cutout, specAugment, pitch shifting, augmix, audio channel swapping |
11 | Vo_DU_task3a_1 | Vo_DU_task3a_report | ResNet, Conformer | 40262940 | Ambisonic | mel spectra, intensity vector | cutout, specAugment, audio channel swapping |
12 | Yeow_NTU_task3a_3 | Yeow_NTU_task3a_report | ResNet, Conformer, Squeeze-and-Excitation | 5383000 | Ambisonic | SALSA | mixup, frequency shifting, audio channel swapping |
13 | Vo_DU_task3a_2 | Vo_DU_task3a_report | ResNet, Conformer | 40262940 | Ambisonic | mel spectra, intensity vector | cutout, specAugment, audio channel swapping |
14 | Guan_CQUPT_task3a_1 | Guan_CQUPT_task3_report | CNN, Conformer, Ensemble | 9318488 | Ambisonic | mel spectra, intensity vector | cutout, specAugment, pitch shifting, augmix, audio channel swapping |
15 | Vo_DU_task3a_3 | Vo_DU_task3a_report | ResNet, Conformer | 40262940 | Ambisonic | mel spectra, intensity vector | cutout, specAugment, audio channel swapping |
16 | Guan_CQUPT_task3a_3 | Guan_CQUPT_task3_report | CNN, Conformer, Ensemble | 9820632 | Ambisonic | mel spectra, intensity vector, log-rms | cutout, specAugment, pitch shifting, augmix, audio channel swapping |
17 | Berg_LU_task3a_3 | Berg_LU_task3_report | CST-Former, MHSA, Transformer | 1490000 | Microphone Array | mel spectra, NGCC-PHAT | audio channel swapping |
18 | Berg_LU_task3a_1 | Berg_LU_task3_report | CST-Former, MHSA, Transformer | 663000 | Microphone Array | mel spectra, NGCC-PHAT | audio channel swapping |
19 | Yeow_NTU_task3a_1 | Yeow_NTU_task3a_report | ResNet, Conformer, Squeeze-and-Excitation | 5383000 | Ambisonic | SALSA | mixup, frequency shifting, audio channel swapping |
20 | Sun_JLESS_task3a_1 | Sun_JLESS_task3a_report | CNN, Conformer, Ensemble | 13107932 | Ambisonic | mel spectra, intensity vector, sinIPD | channel rotation |
21 | Guan_CQUPT_task3a_2 | Guan_CQUPT_task3_report | CNN, Conformer, Ensemble | 10322776 | Ambisonic | mel spectra, intensity vector, log-rms | cutout, specAugment, pitch shifting, augmix, audio channel swapping |
22 | Berg_LU_task3a_2 | Berg_LU_task3_report | CST-Former, MHSA, Transformer | 663000 | Microphone Array | MFCC, NGCC-PHAT | audio channel swapping |
23 | Yeow_NTU_task3a_4 | Yeow_NTU_task3a_report | ResNet, Conformer, Squeeze-and-Excitation | 5383000 | Ambisonic | SALSA | mixup, frequency shifting, audio channel swapping |
24 | Berg_LU_task3a_4 | Berg_LU_task3_report | CST-Former, MHSA, Transformer | 1490000 | Microphone Array | MFCC, NGCC-PHAT | audio channel swapping |
25 | Qian_IASP_task3a_1 | Qian_IASP_task3a_report | ResNet, Conformer,CNN | 64560 | Ambisonic | mel spectra, intensity vector | audio channel swapping |
26 | AO_Baseline_FOA | Politis_TAU_task3a_report | CRNN, MHSA | 742559 | Ambisonic | mel spectra, intensity vector | |
27 | AO_Baseline_MIC | Politis_TAU_task3a_report | CRNN, MHSA | 744287 | Microphone Array | mel spectra, GCC-PHAT | |
28 | Sun_JLESS_task3a_2 | Sun_JLESS_task3a_report | CNN, Conformer, Ensemble | 13107932 | Microphone Array | mel spectra, intensity vector, sinIPD | channel rotation |
29 | Zhang_BUPT_task3a_1 | Zhang_BUPT_task3a_report | CNN, Conformer | 7461404 | Ambisonic | mel spectra, intensity vector | |
30 | Chen_ECUST_task3a_1 | Chen_ECUST_task3_report | CRNN, MHSA | 740963 | Ambisonic | mel spectra, intensity vector, magnitude spectra | audio channel swapping |
31 | Li_BIT_task3a_1 | Li_BIT_task3a_report | Conformer, ConvNeXt | 3714972 | Ambisonic | mel spectra, intensity vector | audio channel swapping |
Track B: Audiovisual
Rank |
Submission name |
Technical Report |
Model |
Model params |
Audio format |
Acoustic features |
Data augmentation |
---|---|---|---|---|---|---|---|
1 | Du_NERCSLIP_task3b_4 | Du_NERCSLIP_task3_report | ResNet, Conformer, Ensemble | 93537081 | Ambisonic | mel spectra, intensity vector | audio channel swapping, multi-channel data simulation, video pixel swapping, manifold mixup |
2 | Du_NERCSLIP_task3b_3 | Du_NERCSLIP_task3_report | ResNet, Conformer, Ensemble | 81851917 | Ambisonic | mel spectra, intensity vector | audio channel swapping, multi-channel data simulation, video pixel swapping, manifold mixup |
3 | Du_NERCSLIP_task3b_2 | Du_NERCSLIP_task3_report | ResNet, Conformer, Ensemble | 58488271 | Ambisonic | mel spectra, intensity vector | audio channel swapping, multi-channel data simulation, video pixel swapping, manifold mixup |
4 | Du_NERCSLIP_task3b_1 | Du_NERCSLIP_task3_report | ResNet, Conformer, Ensemble | 70166753 | Ambisonic | mel spectra, intensity vector | audio channel swapping, multi-channel data simulation, video pixel swapping, manifold mixup |
5 | Berghi_SURREY_task3b_4 | Berghi_SURREY_task3b_report | CNN, Conformer, ViT, MHST | 446613716 | Ambisonic | mel spectra, intensity vector, direct-reverberant components | audio-visual channel swapping |
6 | Berghi_SURREY_task3b_2 | Berghi_SURREY_task3b_report | CNN, Conformer | 85483420 | Ambisonic | mel spectra, intensity vector | audio-visual channel swapping |
7 | Berghi_SURREY_task3b_1 | Berghi_SURREY_task3b_report | CNN, Conformer | 85483420 | Ambisonic | mel spectra, intensity vector | audio-visual channel swapping |
8 | Li_SHU_task3b_2 | Li_SHU_task3b_report | ResNet-50,ResNet, Conformer,Transformer | 9995660 | Ambisonic | mel spectra, intensity vector | audio channel swapping,multi-channel data simulation,video pixel swapping |
9 | Berghi_SURREY_task3b_3 | Berghi_SURREY_task3b_report | CNN, Conformer, ViT, MHST | 275646876 | Ambisonic | mel spectra, intensity vector, direct-reverberant components | audio-visual channel swapping |
10 | Li_SHU_task3b_1 | Li_SHU_task3b_report | ResNet-50,ResNet, Conformer,Transformer | 9995660 | Ambisonic | mel spectra, intensity vector | audio channel swapping,video pixel swapping,multi-channel data simulation |
11 | Guan_CQUPT_task3b_2 | Guan_CQUPT_task3_report | CNN, Conformer, Ensemble, MHSA, MHCA | 13401544 | Ambisonic | mel spectra, intensity vector | cutout, specAugment, pitch shifting, augmix, audio channel swapping, audio-visual channel swapping |
12 | Guan_CQUPT_task3b_1 | Guan_CQUPT_task3_report | CNN, Conformer, Ensemble, MHSA, MHCA | 13401544 | Ambisonic | mel spectra, intensity vector | cutout, specAugment, pitch shifting, augmix, audio channel swapping, audio-visual channel swapping |
13 | Berg_LU_task3b_3 | Berg_LU_task3_report | CST-Former, MHSA, Transformer | 21900000 | Microphone Array | mel spectra, NGCC-PHAT | audio channel swapping |
14 | Berg_LU_task3b_2 | Berg_LU_task3_report | CST-Former, MHSA, Transformer | 21000000 | Microphone Array | MFCC, NGCC-PHAT | audio channel swapping |
15 | Berg_LU_task3b_4 | Berg_LU_task3_report | CST-Former, MHSA, Transformer | 21900000 | Microphone Array | MFCC, NGCC-PHAT | audio channel swapping |
16 | Chen_ECUST_task3b_1 | Chen_ECUST_task3_report | CRNN, MHSA | 743428 | Ambisonic | mel spectra, GCC-PHAT, magnitude spectra | audio channel swapping, video pixel swapping |
17 | AV_Baseline_MIC | Shimada_SONY_task3b_report | CRNN | 2728671 | Microphone Array | magnitude spectra, IPD | |
18 | Berg_LU_task3b_1 | Berg_LU_task3_report | CST-Former, MHSA, Transformer | 21000000 | Microphone Array | mel spectra, NGCC-PHAT | audio channel swapping |
19 | AV_Baseline_FOA | Shimada_SONY_task3b_report | CRNN | 2726943 | Ambisonic | magnitude spectra, IPD | |
20 | Chen_ECUST_task3b_2 | Chen_ECUST_task3_report | CRNN, MHSA | 745963 | Ambisonic | mel spectra, GCC-PHAT, magnitude spectra | audio channel swapping, video pixel swapping |
Technical reports
THE LU SYSTEM FOR DCASE 2024 SOUND EVENT LOCALIZATION AND DETECTION CHALLENGE
Axel Berg1,2, Johanna Engman1, Jens Gulin1,3, Karl Astrom1, Magnus Oskarsson1
1Computer Vision and Machine Learning, Centre for Mathematical Sciences, Lund University, Sweden, 2Arm, Lund, Sweden, 3Sony Europe B.V., Lund, Sweden
Berg_LU_task3a_1 Berg_LU_task3a_2 Berg_LU_task3a_3 Berg_LU_task3a_4 Berg_LU_task3b_1 Berg_LU_task3b_2 Berg_LU_task3b_3 Berg_LU_task3b_4
THE LU SYSTEM FOR DCASE 2024 SOUND EVENT LOCALIZATION AND DETECTION CHALLENGE
Axel Berg1,2, Johanna Engman1, Jens Gulin1,3, Karl Astrom1, Magnus Oskarsson1
1Computer Vision and Machine Learning, Centre for Mathematical Sciences, Lund University, Sweden, 2Arm, Lund, Sweden, 3Sony Europe B.V., Lund, Sweden
Abstract
This technical report gives an overview of our submission to task 3 of the DCASE 2024 challenge. We present a sound event localization and detection (SELD) system using input features based on trainable neural generalized cross-correlations with phase transform (NGCC-PHAT). With these features together with spectrograms as input to a Transformer-based network, we achieve significant improvements over the baseline method. In addition, we also present an audio-visual version of our system, where distance predictions are updated using depth maps from the panorama video frames.
LEVERAGING REVERBERATION AND VISUAL DEPTH CUES FOR SOUND EVENT LOCALIZATION AND DETECTION WITH DISTANCE ESTIMATION
Davide Berghi, Philip J. B. Jackson
CVSSP, University of Surrey, Guildford, UK
Berghi_SURREY_task3b_1 Berghi_SURREY_task3b_2 Berghi_SURREY_task3b_3 Berghi_SURREY_task3b_4
LEVERAGING REVERBERATION AND VISUAL DEPTH CUES FOR SOUND EVENT LOCALIZATION AND DETECTION WITH DISTANCE ESTIMATION
Davide Berghi, Philip J. B. Jackson
CVSSP, University of Surrey, Guildford, UK
Abstract
This report describes our systems submitted for the DCASE2024 Task 3 challenge: Audio and Audiovisual Sound Event Localization and Detection with Source Distance Estimation (Track B). Our main model is based on the audio-visual (AV) Conformer, which processes video and audio embeddings extracted with ResNet50 and with an audio encoder pre-trained on SELD, respectively. This model outperformed the audio-visual baseline of the development set of the STARSS23 dataset by a wide margin, halving its DOAE and improving the F1 by more than 3x. Our second system performs a temporal ensemble from the outputs of the AV-Conformer. We then extended the model with features for distance estimation, such as direct and reverberant signal components extracted from the omnidirectional audio channel, and depth maps extracted from the video frames. While the new system improved the RDE of our previous model by about 3 percentage points, it achieved a lower F1 score. This may be caused by sound classes that rarely appear in the training set and that the more complex system does not detect, as analysis can determine. To overcome this problem, our fourth and final system consists of an ensemble strategy combining the predictions of the other three. Many opportunities to refine the system and training strategy can be tested in future ablation experiments, and likely achieve incremental performance gains for this audio-visual task.
FEATURE FUSION BASED ON CROSS-FEATURE TRANSFORMER FOR SOUND EVENT LOCALIZATION AND DETECTION WITH SOURCE DISTANCE ESTIMATION
Jishen Tao, Ning Chen
East China University of Science and Technology School of Information Science and Engineering, Shanghai, China
Chen_ECUST_task3a_1 Chen_ECUST_task3b_1 Chen_ECUST_task3b_2
FEATURE FUSION BASED ON CROSS-FEATURE TRANSFORMER FOR SOUND EVENT LOCALIZATION AND DETECTION WITH SOURCE DISTANCE ESTIMATION
Jishen Tao, Ning Chen
East China University of Science and Technology School of Information Science and Engineering, Shanghai, China
Abstract
Since the audio of many sound events contains rich high-frequency components, the Log-Mel, which compacts the high-frequency components seriously, cannot represent the essential feature of sound event entirely. In this paper, the Log-Mel Spectrogram + Intensity Vector (LMSIV) and Magnitude Spectrogram (MS) are fused to solve this problem. First, the Cross-Feature Transformer (CFT) is performed on each feature to inspire the other feature to reinforce itself through directly attending to latent relevance revealed in the other feature to fuse the features while ensuring awareness of their interaction introduced. Then Self-Attention Transformer (SAT) is performed on the concatenation of the obtained embeddings to further prioritize contextual information in it. The experimental results show that our proposed system outperform the baseline system on the development dataset of DCASE 2024 task3.
THE NERC-SLIP SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION WITH SOURCE DISTANCE ESTIMATION OF DCASE 2024 CHALLENGE
Qing Wang1, Yuxuan Dong1, Hengyi Hong2, Ruoyu Wei3, Maocheng Hu4, Shi Cheng1, Ya Jiang1, Mingqi Cai3, Xin Fang3, Jun Du1
1University of Science and Technology of China, Hefei, China, 2Harbin Engineering University, Harbin, China, 3iFLYTEK, Hefei, China, 4National Intelligent Voice Innovation Center, Hefei, China
Du_NERCSLIP_task3a_1 Du_NERCSLIP_task3a_2 Du_NERCSLIP_task3a_3 Du_NERCSLIP_task3a_4 Du_NERCSLIP_task3b_1 Du_NERCSLIP_task3b_2 Du_NERCSLIP_task3b_3 Du_NERCSLIP_task3b_4
THE NERC-SLIP SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION WITH SOURCE DISTANCE ESTIMATION OF DCASE 2024 CHALLENGE
Qing Wang1, Yuxuan Dong1, Hengyi Hong2, Ruoyu Wei3, Maocheng Hu4, Shi Cheng1, Ya Jiang1, Mingqi Cai3, Xin Fang3, Jun Du1
1University of Science and Technology of China, Hefei, China, 2Harbin Engineering University, Harbin, China, 3iFLYTEK, Hefei, China, 4National Intelligent Voice Innovation Center, Hefei, China
Abstract
The technical report presents our submission system for Task 3 of the DCASE 2024 Challenge: Audio and Audiovisual Sound Event Localization and Detection (SELD) with Source Distance Estimation (SDE). In addition to direction of arrival estimation (DOAE) of the sound source, this challenge also requires predicting the source distance. We attempted three methods to enable the system to predict both the DOA and the distance of the sound source. First, we proposed two multi-task learning frameworks. One introduces an extra branch to the original SELD model with multi-task learning framework, resulting in a three-branch output to simultaneously predict the DOA and distance of the sound source. The other integrates the sound source distance into the DOA prediction, estimat- ing the absolute position of the sound source. Second, we trained two models for DOAE and SDE respectively, and then used a joint prediction method based on the outputs of the two models. For the audiovisual SELD task with SDE, we used a ResNet-50 model pretrained on ImageNet as the visual feature extractor. Additionally, we simulated audio-visual data and used a teacher-student learning method to train our multi-modal system. We evaluated our methods on the dev-test set of the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset.
POWER CUE ENHANCED NETWORK AND AUDIO-VISUAL FUISON FOR SOUND EVENT LOCALIZATION AND DETECTION OF DCASE2024 CHALLENGE
Xin Guan1, Yi Zhou1, Hongqing Liu1, Yin Cao2
1Chongqing University of Posts and Telecommunications, School of Communication and Information Engineering, Chongqing, China, 2Department of Intelligent Science, Xi’an Jiaotong Liverpool University, China
Guan_CQUPT_task3a_1 Guan_CQUPT_task3a_2 Guan_CQUPT_task3a_3 Guan_CQUPT_task3a_4 Guan_CQUPT_task3b_1 Guan_CQUPT_task3b_2
POWER CUE ENHANCED NETWORK AND AUDIO-VISUAL FUISON FOR SOUND EVENT LOCALIZATION AND DETECTION OF DCASE2024 CHALLENGE
Xin Guan1, Yi Zhou1, Hongqing Liu1, Yin Cao2
1Chongqing University of Posts and Telecommunications, School of Communication and Information Engineering, Chongqing, China, 2Department of Intelligent Science, Xi’an Jiaotong Liverpool University, China
Abstract
This technical report describes our submission systems for Task 3 of the DCASE2024 challenge: Sound Event Localization and Detection (SELD) Evaluated in Real Spatial Sound Scenes. To address the audio-only SELD task, we utilize a Resnet-Conformer as the main network. Additionally, we introduce a branch to receive power cue features, specifically log root mean square (log-rms). We employ various data augmentation techniques, including audio channel swapping (ACS), random cutout, time-frequency masking, frequency shifting, and AugMix, to enhance the model’s generalization. For the audio-visual SELD task, we also augment the visual modality in alignment with ACS. The audio and visual embeddings are sent to parallel Cross-Modal Attentive Fusion (CMAF) blocks before concatenation. We evaluate our approach on the dev-test set of the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset.
THE SYSTEM USING CONVNEXT, CONFORMER, AND DATA AUGMENTATION FOR SOUND EVENT LOCALIZATION AND DETECTION
Jiahao Li
Beijing Institution of Technology, China
Li_BIT_task3a_1
THE SYSTEM USING CONVNEXT, CONFORMER, AND DATA AUGMENTATION FOR SOUND EVENT LOCALIZATION AND DETECTION
Jiahao Li
Beijing Institution of Technology, China
Abstract
This technical report details our submission system for DCASE2024 Task 3: Audio and Audiovisual Sound Event Localization and Detection (SELD) with Source Distance Estimation. To address the audio-only task, we initially apply the Audio Channel Swapping (ACS) method to generate augmented data, enhancing the performance of the proposed system. Subsequently, we introduce the ConvNeXt module for feature extraction and processing. To further enhance feature extraction capabilities, we employ the Squeeze-and-Excitation Block (SEBlock) after ConvNeXt. We then utilize the Conformer to extract additional features and ultimately compute the multi-ACCDOA output. The proposed system significantly outperforms the baseline on the development dataset of DCASE2024 Task 3.
Data Augmentation and Cross-Fusion for Audiovisual Sound Event Localization and Detection with Source Distance Estimation
Yongbo Li, Chuan Wang, Qinghua Huang
Shanghai University, Shanghai, China
Li_SHU_task3b_1 Li_SHU_task3b_2
Data Augmentation and Cross-Fusion for Audiovisual Sound Event Localization and Detection with Source Distance Estimation
Yongbo Li, Chuan Wang, Qinghua Huang
Shanghai University, Shanghai, China
Abstract
This technical report describes a system participating in the DCASE2024 challenge Task 3: Sound Event Localization and Detection with Source Distance Estimation-Track B: Audio-Visual Reasoning. A system based on the official baseline system is developed and improved in terms of network architecture and data augmentation. The convolutional recurrent neural network (CRNN) is substituted by a ResNet-Conformer block pre-trained on an audio-only network. Audio Channel Swapping (ACS) is applied to the DCASE 2024 official audio dataset to generate more audio data. A simulated audio dataset is also created. Video Pixel Swapping (VPS) is performed on the original video data to obtain more video data. Experimental results show that our system outperforms the baseline method on the Sony-TAU Real Spatial Soundscape 2024 (STARSS24) development dataset. A series of experiments are implemented only on the First-Order Ambisonics (FOA) dataset.
STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events
Archontis Politis1, Kazuki Shimada2, Parthasaarathy Sudarsanam1, Sharath Adavanne1, Daniel Krause1, Yuichiro Koyama2, Naoya Takahashi2, Shusuke Takahashi2, Yuki Mitsufuji2, Tuomas Virtanen1
1Tampere University, Tampere, Finland, 2SONY, Tokyo, Japan
AO_Baseline_FOA AO_Baseline_MIC
STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events
Archontis Politis1, Kazuki Shimada2, Parthasaarathy Sudarsanam1, Sharath Adavanne1, Daniel Krause1, Yuichiro Koyama2, Naoya Takahashi2, Shusuke Takahashi2, Yuki Mitsufuji2, Tuomas Virtanen1
1Tampere University, Tampere, Finland, 2SONY, Tokyo, Japan
Abstract
This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) dataset of spatial recordings of real sound scenes collected in various interiors at two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events belonging to 13 target classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. STARSS22 serves as the development and evaluation dataset for Task 3 (Sound Event Localization and Detection) of the DCASE2022 Challenge and it introduces significant new challenges with regard to the previous iterations, which were based on synthetic data. Additionally, the report introduces the baseline system that accompanies the dataset with emphasis on its differences to the baseline of the previous challenge. Baseline results indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6600531.
THE IASP SUBMISSION FOR SOUND EVENT LOCALIZATION AND DETECTION OF DCASE2024 CHALLENGE
Yuanhang Qian, Tianqin Zheng, Yichen Zeng, Gongping Huang
School of Electronic Information, Wuhan University, Wuhan, China
Qian_IASP_task3a_1
THE IASP SUBMISSION FOR SOUND EVENT LOCALIZATION AND DETECTION OF DCASE2024 CHALLENGE
Yuanhang Qian, Tianqin Zheng, Yichen Zeng, Gongping Huang
School of Electronic Information, Wuhan University, Wuhan, China
Abstract
The technical report describes the submission systems developed for task 3a of the DCASE2024 challenge: Audio Sound Event Localization and Detection with Source Distance Estimation. To enhance the performance of the audio-only task, we implement audio channel swapping as a data augmentation technique. We adopt the Resnet-Conformer model for the network architecture, which is well-suited for capturing First-Order Ambisonics (FOA) format data patterns. Additionally, the approach utilizes the Multi- ACCDDOA method to concurrently predict the event type and estimate the source distance. This comprehensive strategy yielded superior results compared to the baseline system.
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events
Kazuki Shimada2, Archontis Politis1, Parthasaarathy Sudarsanam1, Daniel Krause1, Kengo Uchida2, Sharath Adavanne1, Aapo Hakala1, Yuichiro Koyama2, Naoya Takahashi2, Shusuke Takahashi2, Tuomas Virtanen1, Yuki Mitsufuji2
1Tampere University, Tampere, Finland, 2SONY, Tokyo, Japan
AV_Baseline_FOA AV_Baseline_MIC
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events
Kazuki Shimada2, Archontis Politis1, Parthasaarathy Sudarsanam1, Daniel Krause1, Kengo Uchida2, Sharath Adavanne1, Aapo Hakala1, Yuichiro Koyama2, Naoya Takahashi2, Shusuke Takahashi2, Tuomas Virtanen1, Yuki Mitsufuji2
1Tampere University, Tampere, Finland, 2SONY, Tokyo, Japan
Abstract
While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results show that the audio-visual SELD system achieves lower localization error than the audio-only system. The data is available at https://zenodo.org/record/7880637.
JLESS SUBMISSION TO DCASE2024 TASK3: Conformer with Data Augmentation for Sound Event Localization and Detection with Source Distance Estimation
Wenqiang Sun1, Dongzhe Zhang1,2, Jisheng Bai1,2, Jianfeng Chen1,2
1Joint Laboratory of Environmental Sound Sensing, School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China, 2 LianFeng Acoustic Technologies Co., Ltd. Xi’an, China
Sun_JLESS_task3a_1 Sun_JLESS_task3a_1
JLESS SUBMISSION TO DCASE2024 TASK3: Conformer with Data Augmentation for Sound Event Localization and Detection with Source Distance Estimation
Wenqiang Sun1, Dongzhe Zhang1,2, Jisheng Bai1,2, Jianfeng Chen1,2
1Joint Laboratory of Environmental Sound Sensing, School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China, 2 LianFeng Acoustic Technologies Co., Ltd. Xi’an, China
Abstract
This technical report, we describe our proposed system for DCASE2024 task3: Sound Event Localization and Detec tion(SELD) with Source Distance Estimation in Real Spatial Sound Scenes. At first, we review the famous deep learning methods in SELD. To augment our dataset, we employ channel rotation techniques. In addition to existing features, we introduce a novel feature: the sine value of the inter-channel phase difference. Finally, we validate the effectiveness of our approach on the Sony-TAU Realistic Spatial Soundscapes 2023 (STARSS23) dataset and the results demonstrate that our method outperforms the baseline across multiple metrics.
RESNET-CONFORMER NETWORK WITH SHARED WEIGHTS AND ATTENTION MECHANISM FOR SOUND EVENT LOCALIZATION, DETECTION, AND DISTANCE ESTIMATION
Quoc Thinh Vo, David K. Han
Drexel University, College of Engineering Electrical and Computer Engineering Department, Philadelphia, USA
Vo_DU_task3a_1 Vo_DU_task3a_2 Vo_DU_task3a_3
RESNET-CONFORMER NETWORK WITH SHARED WEIGHTS AND ATTENTION MECHANISM FOR SOUND EVENT LOCALIZATION, DETECTION, AND DISTANCE ESTIMATION
Quoc Thinh Vo, David K. Han
Drexel University, College of Engineering Electrical and Computer Engineering Department, Philadelphia, USA
Abstract
This technical report outlines our approach to Task 3A of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024, focusing on Sound Event Localization and Detection (SELD). SELD provides valuable insights by estimating sound event localization and detection, aiding in various machine cognition tasks such as environmental inference, navigation, and other sound localization-related applications. This year’s challenge evaluates models using either audio-only (Track A) or audiovisual (Track B) inputs on annotated recordings of real sound scenes. A notable change this year is the introduction of distance estimation, with evaluation metrics adjusted accordingly for a comprehensive assessment. Our submission is for Task A of the Challenge, which focuses on the audio-only track. Our approach utilizes log-mel spectrograms, intensity vectors, and employs multiple data augmentations. We proposed an EINV2-based network architecture, achieving improved results: an F-score of 40.2%, Angular Error (DOA) of 17.7°, and Relative Distance Error (RDE) of 0.32 on the test set of the Development Dataset.
SQUEEZE-AND-EXCITE RESNET-CONFORMERS FOR SOUND EVENT LOCALIZATION, DETECTION, AND DISTANCE ESTIMATION FOR DCASE2024 CHALLENGE
Jun Wei Yeow1, Ee-Leng Tan1, Jisheng Bai2, Santi Peksi1, Woon-Seng Gan1
1Smart Nation TRANS Lab, Nanyang Technological University, Singapore, 2School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China
Yeow_NTU_task3a_1 Yeow_NTU_task3a_2 Yeow_NTU_task3a_3 Yeow_NTU_task3a_4
SQUEEZE-AND-EXCITE RESNET-CONFORMERS FOR SOUND EVENT LOCALIZATION, DETECTION, AND DISTANCE ESTIMATION FOR DCASE2024 CHALLENGE
Jun Wei Yeow1, Ee-Leng Tan1, Jisheng Bai2, Santi Peksi1, Woon-Seng Gan1
1Smart Nation TRANS Lab, Nanyang Technological University, Singapore, 2School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China
Abstract
This technical report details our systems submitted for Task 3 of the DCASE 2024 Challenge: Audio and Audiovisual Sound Event Localization and Detection (SELD) with Source Distance Estimation (SDE). We address only the audio-only SELD with SDE (SELDDE) task in this report. We propose to improve the existing ResNet- Conformer architectures with Squeeze-and-Excitation blocks in order to introduce additional forms of channel- and spatial-wise attention. In order to improve SELD performance, we also utilize the Spatial Cue-Augmented Log-Spectrogram (SALSA) features over the commonly used log-mel spectra features for polyphonic SELD. We complement the existing Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset with the audio channel swapping technique and synthesize additional data using the SpatialScaper generator. We also perform distance scaling in order to prevent large distance errors from contributing more towards the loss function. Finally, we evaluate our approach on the evaluation subset of the STARSS23 dataset.
DOA AND EVENT GUIDANCE SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION WITH SOURCE DISTANCE ESTIMATION
Hogeon Yu
Hyundai Motor Company, Robotics Lab, South Korea
Yu_HYUNDAI_task3a_1 Yu_HYUNDAI_task3a_2 Yu_HYUNDAI_task3a_3 Yu_HYUNDAI_task3a_4
DOA AND EVENT GUIDANCE SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION WITH SOURCE DISTANCE ESTIMATION
Hogeon Yu
Hyundai Motor Company, Robotics Lab, South Korea
Abstract
This technical report describes the proposed system submitted to the DCASE2024 Task3: Sound Event Localization and Detection with Source Distance Estimation. There are two tracks, and we participate in the audio-only track. At first, we adopt the CST block, a transformer-based network, to extract meaningful features for predicting sub-tasks DOA and SED. Next, DOA and EVENT guidance attention blocks are introduced to boost the performance on a Multi-ACCDOA-based single-task system for the SELD tasks. We only apply the data augmentation method, a multi-channel simulation technique to complement the sparsity of training data provided by the challenge. Tested on the dev-test set of the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset, our proposed systems outperform the baseline system.
MULTI-SCALE FEATURE FUSION FOR SOUND EVENT LOCALIZATION AND DETECTION
Da Mu, Huamei Sun, Haobo Yue, Yuanyuan Jiang, Zehao Wang, Zhicheng Zhang, Jianqin Yin
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China
Zhang_BUPT_task3a_1
MULTI-SCALE FEATURE FUSION FOR SOUND EVENT LOCALIZATION AND DETECTION
Da Mu, Huamei Sun, Haobo Yue, Yuanyuan Jiang, Zehao Wang, Zhicheng Zhang, Jianqin Yin
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China
Abstract
This technical report describes our submission system for task 3 of the DCASE2024 challenge: Sound Event Localization and Detection with Source Distance Estimation. Our experiment specifically focused on analyzing the first-order ambisonics (FOA) dataset. Building upon our previous work, we utilized a three-stage network structure known as the Multi-scale Feature Fusion (MFF) module. This module allowed us to efficiently extract multi-scale features across the spectral, spatial, and temporal domains. In this report, we introduce the implementation of the MFF module as the encoder and Conformer Blocks as the decoder within a single-branch neu- ral network named MFF-Conformer. This configuration enables us to generate Multi-ACCDOA labels as the output. Compared to the baseline system, our approach exhibits significant improvements in F20° and DOAE metrics and demonstrates its effectiveness on the development dataset of DCASE task 3.