Audio and Audiovisual Sound Event Localization and Detection with Source Distance Estimation


Challenge results

Task description

The Sound Event Localization and Detection (SELD) task deals with methods that detect the temporal onset and offset of sound events when active, classify the type of the event from a known set of sound classes, and further localize the events in space when active.

The focus of the current SELD task is developing systems that can perform adequately on real sound scene recordings, with a small amount of training data. There are two tracks: an audio-only track (Track A) for systems using only microphone recordings to estimate the SELD labels, and an audiovisual track (Track B) for systems employing additionally simultaneous 360° video recordings aligned spatially with the multichannel microphone recordings.

The task provides two datasets, development and evaluation, recorded in a multiple rooms over two different sites. Among the two datasets, only the development dataset provides the reference labels. The participants are expected to build and validate systems using the development dataset, report results on a predefined development set split, and finally test their system on the unseen evaluation dataset.

More details on the task setup and evaluation can be found in the task description page.

Teams ranking

The SELD task received 47 submissions in total from 13 teams across the world. From those, 29 submissions were on the audio-only Track A, and 18 submissions on the audiovisual Track B. 4 teams participated in both Track A & B, 7 teams participated only in Track A and 2 teams articipated only in Track B.

The following table includes only the best performing system per submitting team. Confidence intervals are also reported for each metric on the evaluation set results.

Track A: Audio-only

Rank Submission Information Evaluation Dataset
Submission name Corresponding
author
Affiliation Technical
Report
Team
Rank
F-score
(20°/1)
DOA
error (°)
Relative distance
error
Du_NERCSLIP_task3a_4 Qing Wang University of Science and Technology of China Du_NERCSLIP_task3_report 1 54.4 (48.9 - 59.2) 13.6 (12.4 - 15.0) 0.21 (0.18 - 0.23)
Yu_HYUNDAI_task3a_3 Hogeon Yu Hyundai Motor Company Yu_HYUNDAI_task3a_report 2 29.8 (25.1 - 34.2) 19.8 (18.3 - 21.6) 0.28 (0.25 - 0.32)
Yeow_NTU_task3a_2 Jun Wei Yeow Nanyang Technological University Yeow_NTU_task3a_report 3 26.2 (22.0 - 30.5) 25.1 (23.2 - 27.6) 0.26 (0.22 - 0.28)
Guan_CQUPT_task3a_4 Xin Guan Chongqing University of Posts and Telecommunications Guan_CQUPT_task3_report 4 26.7 (22.7 - 31.1) 18.6 (17.4 - 21.8) 0.36 (0.34 - 0.39)
Vo_DU_task3a_1 Quoc Thinh Vo Drexel University Vo_DU_task3a_report 5 24.7 (20.8 - 28.4) 19.3 (17.7 - 21.3) 0.34 (0.30 - 0.37)
Berg_LU_task3a_3 Axel Berg Lund University, Arm Berg_LU_task3_report 6 25.5 (21.8 - 29.6) 23.2 (18.2 - 28.8) 0.39 (0.34 - 0.44)
Sun_JLESS_task3a_1 Wenqiang Sun Northwestern Polytechnical University Sun_JLESS_task3a_report 7 28.5 (24.2 - 33.0) 23.8 (21.5 - 25.9) 0.51 (0.49 - 0.53)
Qian_IASP_task3a_1 Yuanhang Qian Wuhan University Qian_IASP_task3a_report 8 22.8 (18.6 - 26.8) 27.2 (24.6 - 29.8) 0.36 (0.31 - 0.42)
AO_Baseline_FOA Parthasaarathy Sudarsanam Tampere University Politis_TAU_task3a_report 9 18.0 (14.6 - 21.7) 29.6 (24.6 - 33.3) 0.31 (0.28 - 0.36)
Zhang_BUPT_task3a_1 Zhicheng Zhang Beijing University of Posts and Telecommunications Zhang_BUPT_task3a_report 10 19.0 (16.1 - 21.8) 29.6 (26.6 - 32.9) 0.40 (0.32 - 0.48)
Chen_ECUST_task3a_1 Ning Chen East China University of Science and Technology Chen_ECUST_task3_report 11 15.1 (12.2 - 17.9) 28.3 (25.5 - 30.9) 0.48 (0.39 - 0.59)
Li_BIT_task3a_1 Jiahao Li Beijing Institution of Technology Li_BIT_task3a_report 12 16.9 (13.4 - 20.5) 33.5 (30.0 - 42.7) 0.51 (0.26 - 1.25)

Track B: Audiovisual

Rank Submission Information Evaluation Dataset
Submission name Corresponding
author
Affiliation Technical
Report
Team
Rank
F-score
(20°/1)
DOA
error (°)
Relative distance
error
Du_NERCSLIP_task3b_4 Qing Wang University of Science and Technology of China Du_NERCSLIP_task3_report 1 55.8 (51.2 - 60.4) 11.4 (10.4 - 12.5) 0.25 (0.22 - 0.29)
Berghi_SURREY_task3b_4 Davide Berghi University of Surrey Berghi_SURREY_task3b_report 2 39.2 (33.9 - 44.3) 15.8 (14.2 - 17.4) 0.29 (0.25 - 0.32)
Li_SHU_task3b_2 Yongbo Li Shanghai University Li_SHU_task3b_report 3 34.2 (29.9 - 38.4) 21.5 (19.8 - 23.4) 0.28 (0.25 - 0.31)
Guan_CQUPT_task3b_2 Xin Guan Chongqing University of Posts and Telecommunications Guan_CQUPT_task3_report 4 23.2 (19.2 - 27.2) 18.8 (17.3 - 21.5) 0.32 (0.28 - 0.37)
Berg_LU_task3b_3 Axel Berg Lund University, Arm Berg_LU_task3_report 5 25.9 (22.1 - 30.1) 23.2 (18.2 - 28.8) 0.33 (0.28 - 0.38)
Chen_ECUST_task3b_1 Ning Chen East China University of Science and Technology Chen_ECUST_task3_report 6 16.3 (13.7 - 19.3) 25.1 (22.3 - 26.9) 0.32 (0.27 - 0.39)
AV_Baseline_MIC Parthasaarathy Sudarsanam Tampere University Shimada_SONY_task3b_report 7 16.0 (12.1 - 20.0) 35.9 (31.8 - 39.6) 0.30 (0.27 - 0.33)

Systems ranking

Performance of all the submitted systems on the evaluation and the development datasets. Confidence intervals are also reported for each metric on the evaluation set results.

Track A: Audio-only

Rank Submission Information Evaluation Dataset Development Dataset
Submission name Technical
Report
Submission
Rank
F-score
(20°/1)
DOA
error (°)
Relative distance
error
F-score
(20°/1)
DOA
error (°)
Relative distance
error
Du_NERCSLIP_task3a_4 Du_NERCSLIP_task3_report 1 54.4 (48.9 - 59.2) 13.6 (12.4 - 15.0) 0.21 (0.18 - 0.23) 59.7 12.4 0.21
Du_NERCSLIP_task3a_1 Du_NERCSLIP_task3_report 2 55.7 (50.8 - 60.0) 13.7 (12.4 - 15.3) 0.21 (0.19 - 0.23) 61.0 12.3 0.21
Du_NERCSLIP_task3a_2 Du_NERCSLIP_task3_report 3 54.3 (48.9 - 59.0) 13.6 (12.4 - 15.0) 0.21 (0.19 - 0.23) 59.7 12.4 0.22
Du_NERCSLIP_task3a_3 Du_NERCSLIP_task3_report 4 53.8 (47.9 - 58.9) 14.2 (12.6 - 16.0) 0.21 (0.18 - 0.24) 58.8 12.4 0.21
Yu_HYUNDAI_task3a_3 Yu_HYUNDAI_task3a_report 5 29.8 (25.1 - 34.2) 19.8 (18.3 - 21.6) 0.28 (0.25 - 0.32) 34.7 18.8 0.28
Yu_HYUNDAI_task3a_4 Yu_HYUNDAI_task3a_report 6 29.2 (24.4 - 33.6) 19.7 (18.1 - 21.5) 0.30 (0.27 - 0.34) 35.0 19.0 0.29
Yu_HYUNDAI_task3a_1 Yu_HYUNDAI_task3a_report 7 29.2 (24.5 - 33.5) 19.8 (18.2 - 21.5) 0.29 (0.25 - 0.33) 33.9 19.5 0.28
Yu_HYUNDAI_task3a_2 Yu_HYUNDAI_task3a_report 8 28.2 (23.5 - 32.6) 20.1 (18.4 - 22.3) 0.29 (0.24 - 0.32) 33.4 19.2 0.28
Yeow_NTU_task3a_2 Yeow_NTU_task3a_report 9 26.2 (22.0 - 30.5) 25.1 (23.2 - 27.6) 0.26 (0.22 - 0.28) 33.8 21.4 0.30
Guan_CQUPT_task3a_4 Guan_CQUPT_task3_report 10 26.7 (22.7 - 31.1) 18.6 (17.4 - 21.8) 0.36 (0.34 - 0.39)
Vo_DU_task3a_1 Vo_DU_task3a_report 11 24.7 (20.8 - 28.4) 19.3 (17.7 - 21.3) 0.34 (0.30 - 0.37) 39.7 17.4 0.33
Yeow_NTU_task3a_3 Yeow_NTU_task3a_report 12 24.6 (20.2 - 29.4) 25.9 (21.2 - 28.4) 0.26 (0.19 - 0.29) 32.7 22.9 0.30
Vo_DU_task3a_2 Vo_DU_task3a_report 13 25.6 (21.4 - 29.5) 20.1 (18.4 - 22.2) 0.33 (0.29 - 0.36) 39.9 17.5 0.32
Guan_CQUPT_task3a_1 Guan_CQUPT_task3_report 14 21.9 (17.4 - 26.2) 16.7 (15.5 - 18.9) 0.31 (0.28 - 0.34) 43.2 14.6 0.29
Vo_DU_task3a_3 Vo_DU_task3a_report 15 24.6 (20.4 - 28.1) 18.9 (17.4 - 20.5) 0.34 (0.30 - 0.37) 40.2 17.5 0.32
Guan_CQUPT_task3a_3 Guan_CQUPT_task3_report 16 22.5 (18.2 - 26.7) 16.7 (15.8 - 18.9) 0.36 (0.33 - 0.42) 44.1 13.7 0.30
Berg_LU_task3a_3 Berg_LU_task3_report 17 25.5 (21.8 - 29.6) 23.2 (18.2 - 28.8) 0.39 (0.34 - 0.44) 32.0 21.8 0.44
Berg_LU_task3a_1 Berg_LU_task3_report 18 27.0 (23.3 - 31.2) 26.1 (23.0 - 28.6) 0.37 (0.34 - 0.44) 29.0 23.9 0.38
Yeow_NTU_task3a_1 Yeow_NTU_task3a_report 19 23.5 (19.3 - 27.9) 27.2 (24.2 - 30.5) 0.28 (0.25 - 0.33) 33.9 20.4 0.30
Sun_JLESS_task3a_1 Sun_JLESS_task3a_report 20 28.5 (24.2 - 33.0) 23.8 (21.5 - 25.9) 0.51 (0.49 - 0.53) 29.2 20.7 0.47
Guan_CQUPT_task3a_2 Guan_CQUPT_task3_report 21 21.6 (17.7 - 25.4) 17.2 (15.1 - 20.2) 0.40 (0.37 - 0.45) 43.7 14.0 0.30
Berg_LU_task3a_2 Berg_LU_task3_report 22 24.3 (20.4 - 28.3) 21.5 (18.7 - 24.0) 0.39 (0.31 - 0.50) 28.7 20.8 0.38
Yeow_NTU_task3a_4 Yeow_NTU_task3a_report 23 21.6 (17.8 - 25.6) 27.3 (23.5 - 30.9) 0.27 (0.23 - 0.30) 32.7 20.6 0.31
Berg_LU_task3a_4 Berg_LU_task3_report 24 23.5 (19.5 - 27.6) 23.9 (18.2 - 31.1) 0.43 (0.38 - 0.54) 26.8 26.5 0.57
Qian_IASP_task3a_1 Qian_IASP_task3a_report 25 22.8 (18.6 - 26.8) 27.2 (24.6 - 29.8) 0.36 (0.31 - 0.42) 23.0 25.1 0.43
AO_Baseline_FOA Politis_TAU_task3a_report 26 18.0 (14.6 - 21.7) 29.6 (24.6 - 33.3) 0.31 (0.28 - 0.36) 13.1 36.9 0.33
AO_Baseline_MIC Politis_TAU_task3a_report 27 16.3 (13.1 - 19.3) 34.1 (30.7 - 37.4) 0.30 (0.28 - 0.33) 9.9 38.1 0.30
Sun_JLESS_task3a_2 Sun_JLESS_task3a_report 28 21.9 (18.7 - 25.4) 26.4 (24.9 - 28.1) 0.51 (0.49 - 0.53) 21.7 26.5 0.48
Zhang_BUPT_task3a_1 Zhang_BUPT_task3a_report 29 19.0 (16.1 - 21.8) 29.6 (26.6 - 32.9) 0.40 (0.32 - 0.48) 19.0 27.5 0.39
Chen_ECUST_task3a_1 Chen_ECUST_task3_report 30 15.1 (12.2 - 17.9) 28.3 (25.5 - 30.9) 0.48 (0.39 - 0.59) 19.2 22.9 0.32
Li_BIT_task3a_1 Li_BIT_task3a_report 31 16.9 (13.4 - 20.5) 33.5 (30.0 - 42.7) 0.51 (0.26 - 1.25) 33.9 21.1 0.30

Track B: Audiovisual

Rank Submission Information Evaluation Dataset Development Dataset
Submission name Technical
Report
Submission
Rank
F-score
(20°/1)
DOA
error (°)
Relative distance
error
F-score
(20°/1)
DOA
error (°)
Relative distance
error
Du_NERCSLIP_task3b_4 Du_NERCSLIP_task3_report 1 55.8 (51.2 - 60.4) 11.4 (10.4 - 12.5) 0.25 (0.22 - 0.29) 59.9 10.9 0.21
Du_NERCSLIP_task3b_3 Du_NERCSLIP_task3_report 2 55.6 (50.9 - 60.3) 11.3 (10.3 - 12.4) 0.25 (0.22 - 0.29) 59.2 10.8 0.22
Du_NERCSLIP_task3b_2 Du_NERCSLIP_task3_report 3 53.5 (49.1 - 57.8) 11.7 (10.5 - 12.9) 0.27 (0.22 - 0.32) 59.9 11.2 0.21
Du_NERCSLIP_task3b_1 Du_NERCSLIP_task3_report 4 52.6 (47.7 - 56.9) 13.6 (12.4 - 15.2) 0.29 (0.25 - 0.34) 61.0 10.9 0.22
Berghi_SURREY_task3b_4 Berghi_SURREY_task3b_report 5 39.2 (33.9 - 44.3) 15.8 (14.2 - 17.4) 0.29 (0.25 - 0.32) 40.3 18.0 0.30
Berghi_SURREY_task3b_2 Berghi_SURREY_task3b_report 6 36.5 (31.5 - 41.1) 14.4 (13.0 - 15.8) 0.29 (0.26 - 0.33) 38.7 16.8 0.30
Berghi_SURREY_task3b_1 Berghi_SURREY_task3b_report 7 39.5 (34.3 - 44.3) 15.4 (13.9 - 16.9) 0.31 (0.26 - 0.36) 40.8 17.7 0.30
Li_SHU_task3b_2 Li_SHU_task3b_report 8 34.2 (29.9 - 38.4) 21.5 (19.8 - 23.4) 0.28 (0.25 - 0.31) 36.4 19.1 0.30
Berghi_SURREY_task3b_3 Berghi_SURREY_task3b_report 9 30.0 (25.8 - 34.2) 26.1 (19.4 - 29.8) 0.29 (0.25 - 0.33) 30.7 18.9 0.27
Li_SHU_task3b_1 Li_SHU_task3b_report 10 31.9 (27.9 - 36.0) 19.6 (18.1 - 21.2) 0.33 (0.29 - 0.37) 39.2 18.7 0.31
Guan_CQUPT_task3b_2 Guan_CQUPT_task3_report 11 23.2 (19.2 - 27.2) 18.8 (17.3 - 21.5) 0.32 (0.28 - 0.37) 46.7 14.2 0.28
Guan_CQUPT_task3b_1 Guan_CQUPT_task3_report 12 22.2 (18.2 - 26.0) 20.3 (18.4 - 23.9) 0.30 (0.26 - 0.34) 44.4 15.2 0.27
Berg_LU_task3b_3 Berg_LU_task3_report 13 25.9 (22.1 - 30.1) 23.2 (18.2 - 28.8) 0.33 (0.28 - 0.38) 33.4 21.8 0.28
Berg_LU_task3b_2 Berg_LU_task3_report 14 24.3 (20.4 - 28.4) 21.5 (18.7 - 24.0) 0.34 (0.28 - 0.41) 29.4 20.8 0.28
Berg_LU_task3b_4 Berg_LU_task3_report 15 23.7 (19.7 - 27.8) 23.9 (18.2 - 31.1) 0.34 (0.26 - 0.40) 29.0 26.5 0.28
Chen_ECUST_task3b_1 Chen_ECUST_task3_report 16 16.3 (13.7 - 19.3) 25.1 (22.3 - 26.9) 0.32 (0.27 - 0.39) 16.2 26.2 0.41
AV_Baseline_MIC Shimada_SONY_task3b_report 17 16.0 (12.1 - 20.0) 35.9 (31.8 - 39.6) 0.30 (0.27 - 0.33) 11.8 38.5 0.29
Berg_LU_task3b_1 Berg_LU_task3_report 18 26.4 (22.9 - 30.4) 26.1 (23.0 - 28.6) 0.35 (0.30 - 0.44) 29.8 23.9 0.28
AV_Baseline_FOA Shimada_SONY_task3b_report 19 15.5 (12.9 - 18.6) 34.6 (31.0 - 37.3) 0.31 (0.27 - 0.35) 11.3 38.4 0.36
Chen_ECUST_task3b_2 Chen_ECUST_task3_report 20 14.1 (11.6 - 16.7) 42.2 (26.1 - 90.5) 0.39 (0.34 - 0.49) 17.9 24.2 0.38

System characteristics

Track A: Audio-only

Rank Submission
name
Technical
Report
Model Model
params
Audio
format
Acoustic
features
Data
augmentation
1 Du_NERCSLIP_task3a_4 Du_NERCSLIP_task3_report ResNet, Conformer, Ensemble 46878922 Ambisonic mel spectra, intensity vector audio channel swapping, multi-channel data simulation, manifold mixup
2 Du_NERCSLIP_task3a_1 Du_NERCSLIP_task3_report ResNet, Conformer, Conv-TasNet, Ensemble 145105065 Ambisonic mel spectra, intensity vector audio channel swapping, multi-channel data simulation, manifold mixup
3 Du_NERCSLIP_task3a_2 Du_NERCSLIP_task3_report ResNet, Conformer, Ensemble 46803107 Ambisonic mel spectra, intensity vector audio channel swapping, multi-channel data simulation, manifold mixup
4 Du_NERCSLIP_task3a_3 Du_NERCSLIP_task3_report ResNet, Conformer, Ensemble 93682029 Ambisonic mel spectra, intensity vector audio channel swapping, multi-channel data simulation, manifold mixup
5 Yu_HYUNDAI_task3a_3 Yu_HYUNDAI_task3a_report CNN, MHSA, MHA 6317996 Ambisonic mel spectra, intensity vector multi-channel data simulation
6 Yu_HYUNDAI_task3a_4 Yu_HYUNDAI_task3a_report CNN, MHSA, MHA 6317996 Ambisonic mel spectra, intensity vector multi-channel data simulation
7 Yu_HYUNDAI_task3a_1 Yu_HYUNDAI_task3a_report CNN, MHSA, MHA 6317996 Ambisonic mel spectra, intensity vector multi-channel data simulation
8 Yu_HYUNDAI_task3a_2 Yu_HYUNDAI_task3a_report CNN, MHSA, MHA 6317996 Ambisonic mel spectra, intensity vector multi-channel data simulation
9 Yeow_NTU_task3a_2 Yeow_NTU_task3a_report ResNet, Conformer, Squeeze-and-Excitation 5383000 Ambisonic SALSA mixup, frequency shifting, audio channel swapping
10 Guan_CQUPT_task3a_4 Guan_CQUPT_task3_report CNN, Conformer, Ensemble 14479876 Ambisonic mel spectra, intensity vector, log-rms cutout, specAugment, pitch shifting, augmix, audio channel swapping
11 Vo_DU_task3a_1 Vo_DU_task3a_report ResNet, Conformer 40262940 Ambisonic mel spectra, intensity vector cutout, specAugment, audio channel swapping
12 Yeow_NTU_task3a_3 Yeow_NTU_task3a_report ResNet, Conformer, Squeeze-and-Excitation 5383000 Ambisonic SALSA mixup, frequency shifting, audio channel swapping
13 Vo_DU_task3a_2 Vo_DU_task3a_report ResNet, Conformer 40262940 Ambisonic mel spectra, intensity vector cutout, specAugment, audio channel swapping
14 Guan_CQUPT_task3a_1 Guan_CQUPT_task3_report CNN, Conformer, Ensemble 9318488 Ambisonic mel spectra, intensity vector cutout, specAugment, pitch shifting, augmix, audio channel swapping
15 Vo_DU_task3a_3 Vo_DU_task3a_report ResNet, Conformer 40262940 Ambisonic mel spectra, intensity vector cutout, specAugment, audio channel swapping
16 Guan_CQUPT_task3a_3 Guan_CQUPT_task3_report CNN, Conformer, Ensemble 9820632 Ambisonic mel spectra, intensity vector, log-rms cutout, specAugment, pitch shifting, augmix, audio channel swapping
17 Berg_LU_task3a_3 Berg_LU_task3_report CST-Former, MHSA, Transformer 1490000 Microphone Array mel spectra, NGCC-PHAT audio channel swapping
18 Berg_LU_task3a_1 Berg_LU_task3_report CST-Former, MHSA, Transformer 663000 Microphone Array mel spectra, NGCC-PHAT audio channel swapping
19 Yeow_NTU_task3a_1 Yeow_NTU_task3a_report ResNet, Conformer, Squeeze-and-Excitation 5383000 Ambisonic SALSA mixup, frequency shifting, audio channel swapping
20 Sun_JLESS_task3a_1 Sun_JLESS_task3a_report CNN, Conformer, Ensemble 13107932 Ambisonic mel spectra, intensity vector, sinIPD channel rotation
21 Guan_CQUPT_task3a_2 Guan_CQUPT_task3_report CNN, Conformer, Ensemble 10322776 Ambisonic mel spectra, intensity vector, log-rms cutout, specAugment, pitch shifting, augmix, audio channel swapping
22 Berg_LU_task3a_2 Berg_LU_task3_report CST-Former, MHSA, Transformer 663000 Microphone Array MFCC, NGCC-PHAT audio channel swapping
23 Yeow_NTU_task3a_4 Yeow_NTU_task3a_report ResNet, Conformer, Squeeze-and-Excitation 5383000 Ambisonic SALSA mixup, frequency shifting, audio channel swapping
24 Berg_LU_task3a_4 Berg_LU_task3_report CST-Former, MHSA, Transformer 1490000 Microphone Array MFCC, NGCC-PHAT audio channel swapping
25 Qian_IASP_task3a_1 Qian_IASP_task3a_report ResNet, Conformer,CNN 64560 Ambisonic mel spectra, intensity vector audio channel swapping
26 AO_Baseline_FOA Politis_TAU_task3a_report CRNN, MHSA 742559 Ambisonic mel spectra, intensity vector
27 AO_Baseline_MIC Politis_TAU_task3a_report CRNN, MHSA 744287 Microphone Array mel spectra, GCC-PHAT
28 Sun_JLESS_task3a_2 Sun_JLESS_task3a_report CNN, Conformer, Ensemble 13107932 Microphone Array mel spectra, intensity vector, sinIPD channel rotation
29 Zhang_BUPT_task3a_1 Zhang_BUPT_task3a_report CNN, Conformer 7461404 Ambisonic mel spectra, intensity vector
30 Chen_ECUST_task3a_1 Chen_ECUST_task3_report CRNN, MHSA 740963 Ambisonic mel spectra, intensity vector, magnitude spectra audio channel swapping
31 Li_BIT_task3a_1 Li_BIT_task3a_report Conformer, ConvNeXt 3714972 Ambisonic mel spectra, intensity vector audio channel swapping



Track B: Audiovisual

Rank Submission
name
Technical
Report
Model Model
params
Audio
format
Acoustic
features
Data
augmentation
1 Du_NERCSLIP_task3b_4 Du_NERCSLIP_task3_report ResNet, Conformer, Ensemble 93537081 Ambisonic mel spectra, intensity vector audio channel swapping, multi-channel data simulation, video pixel swapping, manifold mixup
2 Du_NERCSLIP_task3b_3 Du_NERCSLIP_task3_report ResNet, Conformer, Ensemble 81851917 Ambisonic mel spectra, intensity vector audio channel swapping, multi-channel data simulation, video pixel swapping, manifold mixup
3 Du_NERCSLIP_task3b_2 Du_NERCSLIP_task3_report ResNet, Conformer, Ensemble 58488271 Ambisonic mel spectra, intensity vector audio channel swapping, multi-channel data simulation, video pixel swapping, manifold mixup
4 Du_NERCSLIP_task3b_1 Du_NERCSLIP_task3_report ResNet, Conformer, Ensemble 70166753 Ambisonic mel spectra, intensity vector audio channel swapping, multi-channel data simulation, video pixel swapping, manifold mixup
5 Berghi_SURREY_task3b_4 Berghi_SURREY_task3b_report CNN, Conformer, ViT, MHST 446613716 Ambisonic mel spectra, intensity vector, direct-reverberant components audio-visual channel swapping
6 Berghi_SURREY_task3b_2 Berghi_SURREY_task3b_report CNN, Conformer 85483420 Ambisonic mel spectra, intensity vector audio-visual channel swapping
7 Berghi_SURREY_task3b_1 Berghi_SURREY_task3b_report CNN, Conformer 85483420 Ambisonic mel spectra, intensity vector audio-visual channel swapping
8 Li_SHU_task3b_2 Li_SHU_task3b_report ResNet-50,ResNet, Conformer,Transformer 9995660 Ambisonic mel spectra, intensity vector audio channel swapping,multi-channel data simulation,video pixel swapping
9 Berghi_SURREY_task3b_3 Berghi_SURREY_task3b_report CNN, Conformer, ViT, MHST 275646876 Ambisonic mel spectra, intensity vector, direct-reverberant components audio-visual channel swapping
10 Li_SHU_task3b_1 Li_SHU_task3b_report ResNet-50,ResNet, Conformer,Transformer 9995660 Ambisonic mel spectra, intensity vector audio channel swapping,video pixel swapping,multi-channel data simulation
11 Guan_CQUPT_task3b_2 Guan_CQUPT_task3_report CNN, Conformer, Ensemble, MHSA, MHCA 13401544 Ambisonic mel spectra, intensity vector cutout, specAugment, pitch shifting, augmix, audio channel swapping, audio-visual channel swapping
12 Guan_CQUPT_task3b_1 Guan_CQUPT_task3_report CNN, Conformer, Ensemble, MHSA, MHCA 13401544 Ambisonic mel spectra, intensity vector cutout, specAugment, pitch shifting, augmix, audio channel swapping, audio-visual channel swapping
13 Berg_LU_task3b_3 Berg_LU_task3_report CST-Former, MHSA, Transformer 21900000 Microphone Array mel spectra, NGCC-PHAT audio channel swapping
14 Berg_LU_task3b_2 Berg_LU_task3_report CST-Former, MHSA, Transformer 21000000 Microphone Array MFCC, NGCC-PHAT audio channel swapping
15 Berg_LU_task3b_4 Berg_LU_task3_report CST-Former, MHSA, Transformer 21900000 Microphone Array MFCC, NGCC-PHAT audio channel swapping
16 Chen_ECUST_task3b_1 Chen_ECUST_task3_report CRNN, MHSA 743428 Ambisonic mel spectra, GCC-PHAT, magnitude spectra audio channel swapping, video pixel swapping
17 AV_Baseline_MIC Shimada_SONY_task3b_report CRNN 2728671 Microphone Array magnitude spectra, IPD
18 Berg_LU_task3b_1 Berg_LU_task3_report CST-Former, MHSA, Transformer 21000000 Microphone Array mel spectra, NGCC-PHAT audio channel swapping
19 AV_Baseline_FOA Shimada_SONY_task3b_report CRNN 2726943 Ambisonic magnitude spectra, IPD
20 Chen_ECUST_task3b_2 Chen_ECUST_task3_report CRNN, MHSA 745963 Ambisonic mel spectra, GCC-PHAT, magnitude spectra audio channel swapping, video pixel swapping



Technical reports

THE LU SYSTEM FOR DCASE 2024 SOUND EVENT LOCALIZATION AND DETECTION CHALLENGE

Axel Berg1,2, Johanna Engman1, Jens Gulin1,3, Karl Astrom1, Magnus Oskarsson1
1Computer Vision and Machine Learning, Centre for Mathematical Sciences, Lund University, Sweden, 2Arm, Lund, Sweden, 3Sony Europe B.V., Lund, Sweden

Abstract

This technical report gives an overview of our submission to task 3 of the DCASE 2024 challenge. We present a sound event localization and detection (SELD) system using input features based on trainable neural generalized cross-correlations with phase transform (NGCC-PHAT). With these features together with spectrograms as input to a Transformer-based network, we achieve significant improvements over the baseline method. In addition, we also present an audio-visual version of our system, where distance predictions are updated using depth maps from the panorama video frames.

PDF

LEVERAGING REVERBERATION AND VISUAL DEPTH CUES FOR SOUND EVENT LOCALIZATION AND DETECTION WITH DISTANCE ESTIMATION

Davide Berghi, Philip J. B. Jackson
CVSSP, University of Surrey, Guildford, UK

Abstract

This report describes our systems submitted for the DCASE2024 Task 3 challenge: Audio and Audiovisual Sound Event Localization and Detection with Source Distance Estimation (Track B). Our main model is based on the audio-visual (AV) Conformer, which processes video and audio embeddings extracted with ResNet50 and with an audio encoder pre-trained on SELD, respectively. This model outperformed the audio-visual baseline of the development set of the STARSS23 dataset by a wide margin, halving its DOAE and improving the F1 by more than 3x. Our second system performs a temporal ensemble from the outputs of the AV-Conformer. We then extended the model with features for distance estimation, such as direct and reverberant signal components extracted from the omnidirectional audio channel, and depth maps extracted from the video frames. While the new system improved the RDE of our previous model by about 3 percentage points, it achieved a lower F1 score. This may be caused by sound classes that rarely appear in the training set and that the more complex system does not detect, as analysis can determine. To overcome this problem, our fourth and final system consists of an ensemble strategy combining the predictions of the other three. Many opportunities to refine the system and training strategy can be tested in future ablation experiments, and likely achieve incremental performance gains for this audio-visual task.

PDF

FEATURE FUSION BASED ON CROSS-FEATURE TRANSFORMER FOR SOUND EVENT LOCALIZATION AND DETECTION WITH SOURCE DISTANCE ESTIMATION

Jishen Tao, Ning Chen
East China University of Science and Technology School of Information Science and Engineering, Shanghai, China

Abstract

Since the audio of many sound events contains rich high-frequency components, the Log-Mel, which compacts the high-frequency components seriously, cannot represent the essential feature of sound event entirely. In this paper, the Log-Mel Spectrogram + Intensity Vector (LMSIV) and Magnitude Spectrogram (MS) are fused to solve this problem. First, the Cross-Feature Transformer (CFT) is performed on each feature to inspire the other feature to reinforce itself through directly attending to latent relevance revealed in the other feature to fuse the features while ensuring awareness of their interaction introduced. Then Self-Attention Transformer (SAT) is performed on the concatenation of the obtained embeddings to further prioritize contextual information in it. The experimental results show that our proposed system outperform the baseline system on the development dataset of DCASE 2024 task3.

PDF

THE NERC-SLIP SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION WITH SOURCE DISTANCE ESTIMATION OF DCASE 2024 CHALLENGE

Qing Wang1, Yuxuan Dong1, Hengyi Hong2, Ruoyu Wei3, Maocheng Hu4, Shi Cheng1, Ya Jiang1, Mingqi Cai3, Xin Fang3, Jun Du1
1University of Science and Technology of China, Hefei, China, 2Harbin Engineering University, Harbin, China, 3iFLYTEK, Hefei, China, 4National Intelligent Voice Innovation Center, Hefei, China

Abstract

The technical report presents our submission system for Task 3 of the DCASE 2024 Challenge: Audio and Audiovisual Sound Event Localization and Detection (SELD) with Source Distance Estimation (SDE). In addition to direction of arrival estimation (DOAE) of the sound source, this challenge also requires predicting the source distance. We attempted three methods to enable the system to predict both the DOA and the distance of the sound source. First, we proposed two multi-task learning frameworks. One introduces an extra branch to the original SELD model with multi-task learning framework, resulting in a three-branch output to simultaneously predict the DOA and distance of the sound source. The other integrates the sound source distance into the DOA prediction, estimat- ing the absolute position of the sound source. Second, we trained two models for DOAE and SDE respectively, and then used a joint prediction method based on the outputs of the two models. For the audiovisual SELD task with SDE, we used a ResNet-50 model pretrained on ImageNet as the visual feature extractor. Additionally, we simulated audio-visual data and used a teacher-student learning method to train our multi-modal system. We evaluated our methods on the dev-test set of the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset.

PDF

POWER CUE ENHANCED NETWORK AND AUDIO-VISUAL FUISON FOR SOUND EVENT LOCALIZATION AND DETECTION OF DCASE2024 CHALLENGE

Xin Guan1, Yi Zhou1, Hongqing Liu1, Yin Cao2
1Chongqing University of Posts and Telecommunications, School of Communication and Information Engineering, Chongqing, China, 2Department of Intelligent Science, Xi’an Jiaotong Liverpool University, China

Abstract

This technical report describes our submission systems for Task 3 of the DCASE2024 challenge: Sound Event Localization and Detection (SELD) Evaluated in Real Spatial Sound Scenes. To address the audio-only SELD task, we utilize a Resnet-Conformer as the main network. Additionally, we introduce a branch to receive power cue features, specifically log root mean square (log-rms). We employ various data augmentation techniques, including audio channel swapping (ACS), random cutout, time-frequency masking, frequency shifting, and AugMix, to enhance the model’s generalization. For the audio-visual SELD task, we also augment the visual modality in alignment with ACS. The audio and visual embeddings are sent to parallel Cross-Modal Attentive Fusion (CMAF) blocks before concatenation. We evaluate our approach on the dev-test set of the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset.

PDF

THE SYSTEM USING CONVNEXT, CONFORMER, AND DATA AUGMENTATION FOR SOUND EVENT LOCALIZATION AND DETECTION

Jiahao Li
Beijing Institution of Technology, China

Abstract

This technical report details our submission system for DCASE2024 Task 3: Audio and Audiovisual Sound Event Localization and Detection (SELD) with Source Distance Estimation. To address the audio-only task, we initially apply the Audio Channel Swapping (ACS) method to generate augmented data, enhancing the performance of the proposed system. Subsequently, we introduce the ConvNeXt module for feature extraction and processing. To further enhance feature extraction capabilities, we employ the Squeeze-and-Excitation Block (SEBlock) after ConvNeXt. We then utilize the Conformer to extract additional features and ultimately compute the multi-ACCDOA output. The proposed system significantly outperforms the baseline on the development dataset of DCASE2024 Task 3.

PDF

Data Augmentation and Cross-Fusion for Audiovisual Sound Event Localization and Detection with Source Distance Estimation

Yongbo Li, Chuan Wang, Qinghua Huang
Shanghai University, Shanghai, China

Abstract

This technical report describes a system participating in the DCASE2024 challenge Task 3: Sound Event Localization and Detection with Source Distance Estimation-Track B: Audio-Visual Reasoning. A system based on the official baseline system is developed and improved in terms of network architecture and data augmentation. The convolutional recurrent neural network (CRNN) is substituted by a ResNet-Conformer block pre-trained on an audio-only network. Audio Channel Swapping (ACS) is applied to the DCASE 2024 official audio dataset to generate more audio data. A simulated audio dataset is also created. Video Pixel Swapping (VPS) is performed on the original video data to obtain more video data. Experimental results show that our system outperforms the baseline method on the Sony-TAU Real Spatial Soundscape 2024 (STARSS24) development dataset. A series of experiments are implemented only on the First-Order Ambisonics (FOA) dataset.

PDF

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

Archontis Politis1, Kazuki Shimada2, Parthasaarathy Sudarsanam1, Sharath Adavanne1, Daniel Krause1, Yuichiro Koyama2, Naoya Takahashi2, Shusuke Takahashi2, Yuki Mitsufuji2, Tuomas Virtanen1
1Tampere University, Tampere, Finland, 2SONY, Tokyo, Japan

Abstract

This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) dataset of spatial recordings of real sound scenes collected in various interiors at two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events belonging to 13 target classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. STARSS22 serves as the development and evaluation dataset for Task 3 (Sound Event Localization and Detection) of the DCASE2022 Challenge and it introduces significant new challenges with regard to the previous iterations, which were based on synthetic data. Additionally, the report introduces the baseline system that accompanies the dataset with emphasis on its differences to the baseline of the previous challenge. Baseline results indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6600531.

PDF

THE IASP SUBMISSION FOR SOUND EVENT LOCALIZATION AND DETECTION OF DCASE2024 CHALLENGE

Yuanhang Qian, Tianqin Zheng, Yichen Zeng, Gongping Huang
School of Electronic Information, Wuhan University, Wuhan, China

Abstract

The technical report describes the submission systems developed for task 3a of the DCASE2024 challenge: Audio Sound Event Localization and Detection with Source Distance Estimation. To enhance the performance of the audio-only task, we implement audio channel swapping as a data augmentation technique. We adopt the Resnet-Conformer model for the network architecture, which is well-suited for capturing First-Order Ambisonics (FOA) format data patterns. Additionally, the approach utilizes the Multi- ACCDDOA method to concurrently predict the event type and estimate the source distance. This comprehensive strategy yielded superior results compared to the baseline system.

PDF

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Kazuki Shimada2, Archontis Politis1, Parthasaarathy Sudarsanam1, Daniel Krause1, Kengo Uchida2, Sharath Adavanne1, Aapo Hakala1, Yuichiro Koyama2, Naoya Takahashi2, Shusuke Takahashi2, Tuomas Virtanen1, Yuki Mitsufuji2
1Tampere University, Tampere, Finland, 2SONY, Tokyo, Japan

Abstract

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results show that the audio-visual SELD system achieves lower localization error than the audio-only system. The data is available at https://zenodo.org/record/7880637.

PDF

JLESS SUBMISSION TO DCASE2024 TASK3: Conformer with Data Augmentation for Sound Event Localization and Detection with Source Distance Estimation

Wenqiang Sun1, Dongzhe Zhang1,2, Jisheng Bai1,2, Jianfeng Chen1,2
1Joint Laboratory of Environmental Sound Sensing, School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China, 2 LianFeng Acoustic Technologies Co., Ltd. Xi’an, China

Abstract

This technical report, we describe our proposed system for DCASE2024 task3: Sound Event Localization and Detec tion(SELD) with Source Distance Estimation in Real Spatial Sound Scenes. At first, we review the famous deep learning methods in SELD. To augment our dataset, we employ channel rotation techniques. In addition to existing features, we introduce a novel feature: the sine value of the inter-channel phase difference. Finally, we validate the effectiveness of our approach on the Sony-TAU Realistic Spatial Soundscapes 2023 (STARSS23) dataset and the results demonstrate that our method outperforms the baseline across multiple metrics.

PDF

RESNET-CONFORMER NETWORK WITH SHARED WEIGHTS AND ATTENTION MECHANISM FOR SOUND EVENT LOCALIZATION, DETECTION, AND DISTANCE ESTIMATION

Quoc Thinh Vo, David K. Han
Drexel University, College of Engineering Electrical and Computer Engineering Department, Philadelphia, USA

Abstract

This technical report outlines our approach to Task 3A of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024, focusing on Sound Event Localization and Detection (SELD). SELD provides valuable insights by estimating sound event localization and detection, aiding in various machine cognition tasks such as environmental inference, navigation, and other sound localization-related applications. This year’s challenge evaluates models using either audio-only (Track A) or audiovisual (Track B) inputs on annotated recordings of real sound scenes. A notable change this year is the introduction of distance estimation, with evaluation metrics adjusted accordingly for a comprehensive assessment. Our submission is for Task A of the Challenge, which focuses on the audio-only track. Our approach utilizes log-mel spectrograms, intensity vectors, and employs multiple data augmentations. We proposed an EINV2-based network architecture, achieving improved results: an F-score of 40.2%, Angular Error (DOA) of 17.7°, and Relative Distance Error (RDE) of 0.32 on the test set of the Development Dataset.

PDF

SQUEEZE-AND-EXCITE RESNET-CONFORMERS FOR SOUND EVENT LOCALIZATION, DETECTION, AND DISTANCE ESTIMATION FOR DCASE2024 CHALLENGE

Jun Wei Yeow1, Ee-Leng Tan1, Jisheng Bai2, Santi Peksi1, Woon-Seng Gan1
1Smart Nation TRANS Lab, Nanyang Technological University, Singapore, 2School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China

Abstract

This technical report details our systems submitted for Task 3 of the DCASE 2024 Challenge: Audio and Audiovisual Sound Event Localization and Detection (SELD) with Source Distance Estimation (SDE). We address only the audio-only SELD with SDE (SELDDE) task in this report. We propose to improve the existing ResNet- Conformer architectures with Squeeze-and-Excitation blocks in order to introduce additional forms of channel- and spatial-wise attention. In order to improve SELD performance, we also utilize the Spatial Cue-Augmented Log-Spectrogram (SALSA) features over the commonly used log-mel spectra features for polyphonic SELD. We complement the existing Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset with the audio channel swapping technique and synthesize additional data using the SpatialScaper generator. We also perform distance scaling in order to prevent large distance errors from contributing more towards the loss function. Finally, we evaluate our approach on the evaluation subset of the STARSS23 dataset.

PDF

DOA AND EVENT GUIDANCE SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION WITH SOURCE DISTANCE ESTIMATION

Hogeon Yu
Hyundai Motor Company, Robotics Lab, South Korea

Abstract

This technical report describes the proposed system submitted to the DCASE2024 Task3: Sound Event Localization and Detection with Source Distance Estimation. There are two tracks, and we participate in the audio-only track. At first, we adopt the CST block, a transformer-based network, to extract meaningful features for predicting sub-tasks DOA and SED. Next, DOA and EVENT guidance attention blocks are introduced to boost the performance on a Multi-ACCDOA-based single-task system for the SELD tasks. We only apply the data augmentation method, a multi-channel simulation technique to complement the sparsity of training data provided by the challenge. Tested on the dev-test set of the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset, our proposed systems outperform the baseline system.

PDF

MULTI-SCALE FEATURE FUSION FOR SOUND EVENT LOCALIZATION AND DETECTION

Da Mu, Huamei Sun, Haobo Yue, Yuanyuan Jiang, Zehao Wang, Zhicheng Zhang, Jianqin Yin
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China

Abstract

This technical report describes our submission system for task 3 of the DCASE2024 challenge: Sound Event Localization and Detection with Source Distance Estimation. Our experiment specifically focused on analyzing the first-order ambisonics (FOA) dataset. Building upon our previous work, we utilized a three-stage network structure known as the Multi-scale Feature Fusion (MFF) module. This module allowed us to efficiently extract multi-scale features across the spectral, spatial, and temporal domains. In this report, we introduce the implementation of the MFF module as the encoder and Conformer Blocks as the decoder within a single-branch neu- ral network named MFF-Conformer. This configuration enables us to generate Multi-ACCDOA labels as the output. Compared to the baseline system, our approach exhibits significant improvements in F20° and DOAE metrics and demonstrates its effectiveness on the development dataset of DCASE task 3.

PDF