Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes


Challenge results

Task description

The Sound Event Localization and Detection (SELD) task deals with methods that detect the temporal onset and offset of sound events when active, classify the type of the event from a known set of sound classes, and further localize the events in space when active.

The focus of the current SELD task is developing systems that can perform adequately on real sound scene recordings, with a small amount of training data. There are two tracks: an audio-only track (Track A) for systems using only microphone recordings to estimate the SELD labels, and an audiovisual track (Track B) for systems employing additionally simultaneous 360° video recordings aligned spatially with the multichannel microphone recordings.

The task provides two datasets, development and evaluation, recorded in a multiple rooms over two different sites. Among the two datasets, only the development dataset provides the reference labels. The participants are expected to build and validate systems using the development dataset, report results on a predefined development set split, and finally test their system on the unseen evaluation dataset.

More details on the task setup and evaluation can be found in the task description page.

Teams ranking

The SELD task received 44 submissions in total from 11 teams across the world. All teams participated in Track A, while 4 of these teams participated in Track B. The following table includes only the best performing system per submitting team. Confidence intervals are also reported for each metric on the evaluation set results.

Track A: Audio-only

Rank Submission Information Evaluation Dataset
Submission name Corresponding
author
Affiliation Technical
Report
Team
Rank
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Du_NERCSLIP_task3a_1 Jun Du National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China Du_NERCSLIP_task3_report 1 0.33 (0.29 - 0.37) 62.7 (57.9 - 68.5) 12.9 (11.8 - 14.2) 72.1 (66.8 - 77.5)
Liu_CQUPT_task3a_2 Xue Lihua Chongqing University of Posts and Telecommunications Liu_CQUPT_task3a_report 2 0.35 (0.31 - 0.40) 58.5 (52.8 - 64.0) 13.5 (12.2 - 15.0) 65.7 (60.2 - 71.1)
Yang_IACAS_task3a_2 Jun Yang Institute of Acoustics, Chinese Academy of Sciences Yang_IACAS_task3a_report 3 0.35 (0.31 - 0.39) 54.5 (50.2 - 59.1) 15.8 (14.5 - 17.3) 66.7 (61.6 - 72.2)
Kang_KT_task3a_2 Sang-Ick Kang KT Corporation Kang_KT_task3_report 4 0.40 (0.36 - 0.45) 51.4 (46.6 - 56.4) 15.0 (13.8 - 16.6) 63.8 (58.3 - 69.7)
Kim_KU_task3a_4 Gwantae Kim Korea University Kim_KU_task3_report 5 0.45 (0.40 - 0.50) 49.0 (44.6 - 53.9) 15.0 (13.3 - 17.8) 62.5 (57.3 - 67.6)
Bai_JLESS_task3a_3 Jisheng Bai Northwestern Polytechnical University, Xi’an, China Bai_JLESS_task3a_report 6 0.44 (0.39 - 0.49) 51.0 (45.6 - 56.4) 14.2 (12.9 - 15.6) 57.7 (51.6 - 63.4)
Wu_NKU_task3a_2 Shichao Wu Nankai University Wu_NKU_task3a_report 7 0.48 (0.43 - 0.53) 45.0 (40.2 - 49.1) 18.6 (16.7 - 20.6) 59.1 (54.5 - 63.3)
YShul_KAIST_task3a_1 Yusun Shul Korea Advanced Institute of Science and Technology YShul_KAIST_task3a_report 8 0.49 (0.44 - 0.54) 39.6 (34.8 - 44.7) 17.8 (15.9 - 20.1) 51.6 (46.3 - 56.9)
Kumar_SRIB_task3a_1 Amit Kumar Samsung Research Institute Bangalore Kumar_SRIB_task3a_report 9 0.56 (0.51 - 0.61) 33.1 (28.4 - 37.8) 19.8 (18.3 - 21.5) 52.1 (46.1 - 57.8)
AO_Baseline_FOA Archontis Politis Tampere University Politis_TAU_task3a_report 10 0.55 (0.51 - 0.60) 29.4 (24.5 - 33.9) 20.5 (18.1 - 22.7) 48.0 (43.5 - 52.9)
Ma_XJU_task3a_1 Mengzhen Ma Xinjiang University Ma_XJU_task3a_report 11 0.64 (0.59 - 0.68) 22.1 (18.6 - 25.7) 39.4 (36.9 - 42.0) 44.0 (39.5 - 48.4)
Wu_CVSSP_task3a_1 Peipei Wu Center for Vision, Speech and Signal Processing, University of Surrey, UK Wu_CVSSP_task3a_report 12 1.68 (1.60 - 1.75) 0.1 (0.0 - 0.2) 114.7 (76.9 - 127.9) 6.7 (3.8 - 9.4)

Track B: Audiovisual

Team Rank Submission Information Evaluation Dataset
Submission name Corresponding
author
Affiliation Technical
Report
Team
Rank
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Du_NERCSLIP_task3b_1 Jun Du National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China Du_NERCSLIP_task3_report 1 0.31 (0.27 - 0.34) 63.6 (58.9 - 69.5) 11.3 (10.2 - 12.5) 72.0 (66.8 - 77.5)
Kang_KT_task3b_1 Sang-Ick Kang KT Corporation Kang_KT_task3_report 2 0.41 (0.37 - 0.46) 48.6 (43.5 - 53.8) 15.5 (14.3 - 16.8) 62.1 (56.3 - 67.4)
Kim_KU_task3b_1 Gwantae Kim Korea University Kim_KU_task3_report 3 0.47 (0.43 - 0.51) 40.9 (35.9 - 45.2) 19.6 (17.8 - 21.5) 53.5 (47.9 - 58.6)
Liu_CQUPT_task3b_1 Wang Yi Chongqing University of Posts and Telecommunications Liu_CQUPT_task3b_report 4 1.05 (0.97 - 1.13) 12.7 (10.5 - 14.9) 47.8 (10.6 - 58.8) 33.0 (27.7 - 39.0)
AV_Baseline_FOA Kazuki Shimada SONY Politis_TAU_task3_report 5 1.10 (1.00 - 1.19) 11.1 (8.8 - 13.6) 47.2 (42.0 - 54.9) 35.2 (30.1 - 41.1)

Systems ranking

Performance of all the submitted systems on the evaluation and the development datasets. Confidence intervals are also reported for each metric on the evaluation set results.

Track A: Audio-only

Submission Rank Submission Information Evaluation Dataset Development Dataset
Submission name Technical
Report
Submission
Rank
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Du_NERCSLIP_task3a_1 Du_NERCSLIP_task3_report 1 0.33 (0.29 - 0.37) 62.7 (57.9 - 68.5) 12.9 (11.8 - 14.2) 72.1 (66.8 - 77.5)
Du_NERCSLIP_task3a_2 Du_NERCSLIP_task3_report 2 0.36 (0.32 - 0.40) 59.3 (54.5 - 64.8) 13.7 (12.6 - 15.0) 70.2 (65.1 - 75.5)
Du_NERCSLIP_task3a_3 Du_NERCSLIP_task3_report 3 0.38 (0.34 - 0.43) 58.4 (53.8 - 63.5) 14.0 (12.8 - 15.5) 69.9 (64.6 - 75.6)
Liu_CQUPT_task3a_2 Liu_CQUPT_task3a_report 4 0.35 (0.31 - 0.40) 58.5 (52.8 - 64.0) 13.5 (12.2 - 15.0) 65.7 (60.2 - 71.1) 0.44 54.2 13.9 67.9
Liu_CQUPT_task3a_4 Liu_CQUPT_task3a_report 5 0.35 (0.31 - 0.40) 58.5 (52.7 - 63.9) 13.5 (12.2 - 15.0) 65.7 (60.2 - 71.0) 0.41 56.4 13.7 67.8
Du_NERCSLIP_task3a_4 Du_NERCSLIP_task3_report 6 0.37 (0.33 - 0.41) 55.4 (50.9 - 60.0) 14.2 (13.0 - 15.6) 66.6 (61.4 - 71.9) 0.38 66.0 12.8 75.0
Yang_IACAS_task3a_2 Yang_IACAS_task3a_report 7 0.35 (0.31 - 0.39) 54.5 (50.2 - 59.1) 15.8 (14.5 - 17.3) 66.7 (61.6 - 72.2)
Liu_CQUPT_task3a_3 Liu_CQUPT_task3a_report 8 0.37 (0.33 - 0.42) 56.9 (50.7 - 62.6) 13.6 (12.3 - 15.1) 64.4 (58.7 - 69.7) 0.42 55.7 13.9 67.7
Yang_IACAS_task3a_3 Yang_IACAS_task3a_report 9 0.36 (0.32 - 0.40) 53.3 (48.5 - 57.8) 16.3 (14.7 - 18.2) 65.7 (60.9 - 70.7)
Yang_IACAS_task3a_1 Yang_IACAS_task3a_report 10 0.35 (0.31 - 0.40) 52.8 (48.9 - 57.1) 16.2 (15.0 - 17.7) 64.5 (59.3 - 69.5) 0.48 47.3 16.1 62.6
Liu_CQUPT_task3a_1 Liu_CQUPT_task3a_report 11 0.40 (0.36 - 0.45) 53.5 (48.0 - 59.1) 14.4 (13.0 - 16.0) 62.4 (56.9 - 67.7) 0.43 54.8 14.7 68.0
Kang_KT_task3a_2 Kang_KT_task3_report 12 0.40 (0.36 - 0.45) 51.4 (46.6 - 56.4) 15.0 (13.8 - 16.6) 63.8 (58.3 - 69.7) 0.43 55.8 15.9 71.5
Kang_KT_task3a_1 Kang_KT_task3_report 13 0.40 (0.36 - 0.45) 51.0 (46.0 - 56.0) 15.1 (14.0 - 16.5) 62.3 (56.5 - 67.8) 0.43 56.9 15.3 70.9
Kang_KT_task3a_3 Kang_KT_task3_report 14 0.41 (0.36 - 0.45) 50.6 (45.4 - 55.8) 15.4 (14.1 - 17.1) 63.1 (57.7 - 69.0) 0.42 57.5 15.8 72.7
Kang_KT_task3a_4 Kang_KT_task3_report 15 0.41 (0.36 - 0.45) 50.8 (45.6 - 55.9) 15.4 (14.1 - 17.1) 62.6 (57.2 - 68.4) 0.43 56.4 15.8 70.4
Yang_IACAS_task3a_4 Yang_IACAS_task3a_report 16 0.38 (0.34 - 0.43) 47.6 (42.6 - 51.7) 17.5 (16.1 - 19.3) 64.0 (58.8 - 68.9)
Kim_KU_task3a_4 Kim_KU_task3_report 17 0.45 (0.40 - 0.50) 49.0 (44.6 - 53.9) 15.0 (13.3 - 17.8) 62.5 (57.3 - 67.6) 0.47 51.7 15.2 70.2
Kim_KU_task3a_1 Kim_KU_task3_report 18 0.44 (0.40 - 0.49) 49.6 (44.7 - 54.7) 14.6 (13.4 - 16.3) 61.2 (55.8 - 66.6) 0.47 52.7 15.2 68.8
Kim_KU_task3a_3 Kim_KU_task3_report 19 0.45 (0.40 - 0.50) 49.1 (44.6 - 54.0) 15.3 (-19.8 - 27.1) 61.7 (56.6 - 67.0) 0.47 52.9 15.0 69.3
Bai_JLESS_task3a_3 Bai_JLESS_task3a_report 20 0.44 (0.39 - 0.49) 51.0 (45.6 - 56.4) 14.2 (12.9 - 15.6) 57.7 (51.6 - 63.4) 0.46 52.0 14.0 59.5
Kim_KU_task3a_2 Kim_KU_task3_report 21 0.45 (0.41 - 0.50) 48.3 (43.8 - 53.3) 15.2 (13.6 - 17.8) 62.3 (57.2 - 67.4) 0.49 51.1 15.5 69.7
Bai_JLESS_task3a_4 Bai_JLESS_task3a_report 22 0.45 (0.41 - 0.50) 46.4 (41.9 - 51.4) 15.1 (13.7 - 16.6) 58.7 (53.0 - 63.9) 0.48 49.2 15.1 61.6
Bai_JLESS_task3a_2 Bai_JLESS_task3a_report 23 0.46 (0.41 - 0.50) 49.4 (44.2 - 54.9) 14.6 (13.3 - 16.0) 55.4 (49.2 - 61.0) 0.47 49.6 15.6 58.9
Bai_JLESS_task3a_1 Bai_JLESS_task3a_report 24 0.46 (0.41 - 0.52) 45.2 (40.5 - 50.0) 15.4 (14.0 - 16.9) 58.7 (52.6 - 64.3) 0.47 51.1 14.6 60.9
Wu_NKU_task3a_2 Wu_NKU_task3a_report 25 0.48 (0.43 - 0.53) 45.0 (40.2 - 49.1) 18.6 (16.7 - 20.6) 59.1 (54.5 - 63.3) 0.54 41.1 22.3 62.3
Wu_NKU_task3a_1 Wu_NKU_task3a_report 26 0.49 (0.44 - 0.53) 38.8 (33.9 - 43.0) 20.4 (18.8 - 22.1) 53.9 (48.9 - 58.0) 0.60 38.0 23.0 60.4
YShul_KAIST_task3a_1 YShul_KAIST_task3a_report 27 0.49 (0.44 - 0.54) 39.6 (34.8 - 44.7) 17.8 (15.9 - 20.1) 51.6 (46.3 - 56.9) 0.49 42.7 16.7 55.2
YShul_KAIST_task3a_2 YShul_KAIST_task3a_report 28 0.51 (0.46 - 0.56) 37.8 (33.0 - 42.9) 18.2 (16.2 - 20.2) 50.7 (45.9 - 55.2)
YShul_KAIST_task3a_3 YShul_KAIST_task3a_report 29 0.50 (0.46 - 0.55) 36.3 (31.6 - 41.1) 22.5 (4.8 - 34.0) 51.3 (47.1 - 55.4) 0.52 41.2 17.7 54.1
Wu_NKU_task3a_3 Wu_NKU_task3a_report 30 0.53 (0.47 - 0.59) 36.0 (31.0 - 40.6) 20.8 (17.4 - 23.7) 51.9 (46.3 - 57.0) 0.54 40.4 19.3 58.4
Kumar_SRIB_task3a_1 Kumar_SRIB_task3a_report 31 0.56 (0.51 - 0.61) 33.1 (28.4 - 37.8) 19.8 (18.3 - 21.5) 52.1 (46.1 - 57.8) 0.39 56.0 20.3 63.0
AO_Baseline_FOA Politis_TAU_task3a_report 32 0.55 (0.51 - 0.60) 29.4 (24.5 - 33.9) 20.5 (18.1 - 22.7) 48.0 (43.5 - 52.9) 0.57 29.9 22.0 47.7
AO_Baseline_MIC Politis_TAU_task3a_report 33 0.55 (0.51 - 0.59) 30.4 (26.1 - 34.9) 22.5 (19.8 - 24.6) 47.9 (42.6 - 53.3) 0.62 27.8 27.0 44.3
Ma_XJU_task3a_1 Ma_XJU_task3a_report 34 0.64 (0.59 - 0.68) 22.1 (18.6 - 25.7) 39.4 (36.9 - 42.0) 44.0 (39.5 - 48.4) 0.69 36.4 25.3 63.3
Wu_CVSSP_task3a_1 Wu_CVSSP_task3a_report 35 1.68 (1.60 - 1.75) 0.1 (-0.0 - 0.2) 114.7 (76.9 - 127.9) 6.7 (3.8 - 9.4) 0.71 21.0 29.3 46.0

Track B: Audiovisual

Submission Rank Submission Information Evaluation Dataset Development Dataset
Submission name Technical
Report
Submission
Rank
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Du_NERCSLIP_task3b_1 Du_NERCSLIP_task3_report 1 0.31 (0.27 - 0.34) 63.6 (58.9 - 69.5) 11.3 (10.2 - 12.5) 72.0 (66.8 - 77.5)
Du_NERCSLIP_task3b_3 Du_NERCSLIP_task3_report 2 0.32 (0.28 - 0.35) 60.5 (56.2 - 65.5) 11.6 (10.5 - 12.7) 70.1 (64.6 - 75.7)
Du_NERCSLIP_task3b_2 Du_NERCSLIP_task3_report 3 0.33 (0.29 - 0.37) 60.9 (56.2 - 66.4) 11.7 (10.6 - 13.0) 70.2 (65.1 - 75.5)
Du_NERCSLIP_task3b_4 Du_NERCSLIP_task3_report 4 0.33 (0.29 - 0.37) 58.7 (54.3 - 63.9) 12.6 (11.7 - 13.7) 71.5 (66.1 - 76.6)
Kang_KT_task3b_1 Kang_KT_task3_report 5 0.41 (0.37 - 0.46) 48.6 (43.5 - 53.8) 15.5 (14.3 - 16.8) 62.1 (56.3 - 67.4) 0.43 54.5 15.6 65.8
Kang_KT_task3b_2 Kang_KT_task3_report 6 0.41 (0.37 - 0.46) 48.4 (43.2 - 53.6) 15.9 (14.5 - 17.5) 62.0 (56.2 - 67.1) 0.44 54.1 15.6 66.5
Kim_KU_task3b_1 Kim_KU_task3_report 7 0.47 (0.43 - 0.51) 40.9 (35.9 - 45.2) 19.6 (17.8 - 21.5) 53.5 (47.9 - 58.6) 0.52 45.1 17.8 59.9
Liu_CQUPT_task3b_1 Liu_CQUPT_task3b_report 8 1.05 (0.97 - 1.13) 12.7 (10.5 - 14.9) 47.8 (10.6 - 58.8) 33.0 (27.7 - 39.0) 0.94 17.0 44.1 42.0
Liu_CQUPT_task3b_2 Liu_CQUPT_task3b_report 9 1.04 (0.96 - 1.11) 11.9 (9.9 - 14.1) 49.5 (12.6 - 60.8) 32.4 (27.2 - 38.4) 0.97 15.9 45.0 41.7
AV_Baseline_FOA Shimada_SONY_task3b_report 10 1.10 (1.00 - 1.19) 11.1 (8.8 - 13.6) 47.2 (42.0 - 54.9) 35.2 (30.1 - 41.1) 1.07 14.3 48.0 35.5
Liu_CQUPT_task3b_3 Liu_CQUPT_task3b_report 11 1.09 (1.00 - 1.18) 11.4 (9.3 - 13.2) 57.8 (52.7 - 63.0) 33.7 (28.3 - 40.1) 0.99 17.8 42.0 40.0
Liu_CQUPT_task3b_4 Liu_CQUPT_task3b_report 12 1.11 (1.02 - 1.19) 11.1 (9.1 - 12.8) 59.6 (54.2 - 65.1) 34.9 (29.5 - 41.1) 1.00 17.4 42.3 42.1
AV_Baseline_MIC Shimada_SONY_task3b_report 13 1.20 (1.09 - 1.30) 9.9 (7.9 - 12.0) 58.5 (53.2 - 62.6) 32.3 (27.7 - 37.9) 1.08 9.8 62.0 29.2

System characteristics

Track A: Audio-only

Rank Submission
name
Technical
Report
Model Model
params
Audio
format
Acoustic
features
Data
augmentation
1 Du_NERCSLIP_task3a_1 Du_NERCSLIP_task3_report Resnet, Conformer, ensemble 34937272 Ambisonic log mel spectra, intensity vector audio channel swapping, multichannel data simulation
2 Du_NERCSLIP_task3a_2 Du_NERCSLIP_task3_report Resnet-Conformer, Conv-TasNet, ensemble 63177333 Ambisonic log mel spectra, intensity vector, log power spectrum audio channel swapping, multi-channel data simulation, single-channel data simulation
3 Du_NERCSLIP_task3a_3 Du_NERCSLIP_task3_report Resnet-Conformer, Conv-TasNet, ensemble 63177333 Ambisonic log mel spectra, intensity vector, log power spectrum audio channel swapping, multi-channel data simulation, single-channel data simulation
4 Liu_CQUPT_task3a_2 Liu_CQUPT_task3a_report CNN, Conformer,Ensemble 33300000 Ambisonic log-mel spectra, intensity vector audio channel swapping, frenquency shifting, specAugment, random cutout, augmix, simulation
5 Liu_CQUPT_task3a_4 Liu_CQUPT_task3a_report CNN, Conformer, Ensemble 33300000 Ambisonic log-mel spectra, intensity vector audio channel swapping, frenquency shifting, specAugment, random cutout, augmix, simulation
6 Du_NERCSLIP_task3a_4 Du_NERCSLIP_task3_report Resnet-Conformer, Conv-TasNet, ensemble 63177333 Ambisonic log mel spectra, intensity vector, log power spectrum audio channel swapping, multi-channel data simulation, single-channel data simulation
7 Yang_IACAS_task3a_2 Yang_IACAS_task3a_report EINV2, Conformer, CNN 85288432 Ambisonic mel spectra, intensity vector mixup, specAugment, rotation, random crop, frequency shifting
8 Liu_CQUPT_task3a_3 Liu_CQUPT_task3a_report CNN, Conformer, Ensemble 19700000 Ambisonic log-mel spectra, intensity vector audio channel swapping, frenquency shifting, specAugment, random cutout, augmix, simulation
9 Yang_IACAS_task3a_3 Yang_IACAS_task3a_report EINV2, Conformer, CNN 85288432 Ambisonic mel spectra, intensity vector mixup, specAugment, rotation, random crop, frequency shifting
10 Yang_IACAS_task3a_1 Yang_IACAS_task3a_report EINV2, Conformer, CNN 85288432 Ambisonic mel spectra, intensity vector mixup, specAugment, rotation, random crop, frequency shifting
11 Liu_CQUPT_task3a_1 Liu_CQUPT_task3a_report CNN, Conformer 6400000 Ambisonic log-mel spectra, intensity vector audio channel swapping, frenquency shifting, specAugment, random cutout, augmix, simulation
12 Kang_KT_task3a_2 Kang_KT_task3_report CNN, Conformer, GRU, ensemble 202900000 Ambisonic mel spectra, intensity vector audio channel swapping, SpecAugment, cutout, multichannel data generation
13 Kang_KT_task3a_1 Kang_KT_task3_report CNN, Conformer, GRU, ensemble 148900000 Ambisonic mel spectra, intensity vector audio channel swapping, SpecAugment, cutout, multichannel data generation
14 Kang_KT_task3a_3 Kang_KT_task3_report CNN, Conformer, GRU, ensemble 202900000 Ambisonic mel spectra, intensity vector audio channel swapping, SpecAugment, cutout, multichannel data generation
15 Kang_KT_task3a_4 Kang_KT_task3_report CNN, Conformer, GRU, ensemble 202900000 Ambisonic mel spectra, intensity vector audio channel swapping, SpecAugment, cutout, multichannel data generation
16 Yang_IACAS_task3a_4 Yang_IACAS_task3a_report EINV2, Conformer, CNN 85288432 Ambisonic mel spectra, intensity vector mixup, specAugment, rotation, random crop, frequency shifting
17 Kim_KU_task3a_4 Kim_KU_task3_report CNN, RNN, MHSA, transfer learning, ensemble 138250318 Ambisonic log-mel spectra, intensity vector specmix, audio rotation
18 Kim_KU_task3a_1 Kim_KU_task3_report CNN, RNN, MHSA, transfer learning, ensemble 69125159 Ambisonic log-mel spectra, intensity vector specmix, audio rotation
19 Kim_KU_task3a_3 Kim_KU_task3_report CNN, RNN, MHSA, transfer learning, ensemble 138250318 Ambisonic log-mel spectra, intensity vector specmix, audio rotation
20 Bai_JLESS_task3a_3 Bai_JLESS_task3a_report CNN, Conformer, ensemble 104800000 Ambisonic mel spectra, intensity vector FMix, random cutout, channel rotation, data generation
21 Kim_KU_task3a_2 Kim_KU_task3_report CNN, RNN, MHSA, transfer learning, ensemble 69125159 Ambisonic log-mel spectra, intensity vector specmix, audio rotation
22 Bai_JLESS_task3a_4 Bai_JLESS_task3a_report CNN, Conformer, ensemble 26200000 Ambisonic mel spectra, intensity vector FMix, random cutout, channel rotation, data generation
23 Bai_JLESS_task3a_2 Bai_JLESS_task3a_report CNN, Conformer, ensemble 52400000 Ambisonic mel spectra, intensity vector FMix, random cutout, channel rotation, data generation
24 Bai_JLESS_task3a_1 Bai_JLESS_task3a_report CNN, Conformer, ensemble 52400000 Ambisonic mel spectra, intensity vector FMix, random cutout, channel rotation, data generation
25 Wu_NKU_task3a_2 Wu_NKU_task3a_report EINV2 85288432 Ambisonic mel spectra, intensity vector HAAC
26 Wu_NKU_task3a_1 Wu_NKU_task3a_report EINV2 85288432 Ambisonic mel spectra, intensity vector HAAC
27 YShul_KAIST_task3a_1 YShul_KAIST_task3a_report CNN, MHSA 2396839 Ambisonic mel spectra, intensity vector time masking, frequency shifting, channel swap, moderate mixup
28 YShul_KAIST_task3a_2 YShul_KAIST_task3a_report CNN, MHSA 2396839 Ambisonic mel spectra, intensity vector time masking, frequency shifting, channel swap, moderate mixup
29 YShul_KAIST_task3a_3 YShul_KAIST_task3a_report CNN, MHSA 2396839 Ambisonic mel spectra, intensity vector time masking, frequency shifting, channel swap, moderate mixup
30 Wu_NKU_task3a_3 Wu_NKU_task3a_report EINV2 85288432 Ambisonic mel spectra, intensity vector, vqt HAAC
31 Kumar_SRIB_task3a_1 Kumar_SRIB_task3a_report CNN, Conformer 3158389 Ambisonic mel spectra, intensity vector Audio Channel Swapping
32 AO_Baseline_FOA Politis_TAU_task3a_report CRNN, MHSA 737528 Ambisonic mel spectra, intensity vector
33 AO_Baseline_MIC Politis_TAU_task3a_report CRNN, MHSA 737528 Microphone Array mel spectra, GCC-PHAT
34 Ma_XJU_task3a_1 Ma_XJU_task3a_report CRNN, MHSA 27760000 Ambisonic mel spectra, intensity vector
35 Wu_CVSSP_task3a_1 Wu_CVSSP_task3a_report CNN, TRANSFORMER 848000 Ambisonic mel spectra, phase vector



Track B: Audiovisual

Rank Submission
name
Technical
Report
Model Model
params
Audio
format
Acoustic
features
Data
augmentation
1 Du_NERCSLIP_task3b_1 Du_NERCSLIP_task3_report Resnet-Conformer, ensemble 34937272 Ambisonic log mel spectra, intensity vector audio channel swapping, multi-channel data simulation
2 Du_NERCSLIP_task3b_3 Du_NERCSLIP_task3_report Resnet-Conformer, ensemble 34917658 Ambisonic log mel spectra, intensity vector audio channel swapping, multi-channel data simulation, video pixel swapping
3 Du_NERCSLIP_task3b_2 Du_NERCSLIP_task3_report Resnet-Conformer, Conv-TasNet, ensemble 63177333 Ambisonic log mel spectra, intensity vector, log power spectrum audio channel swapping, multi-channel data simulation, single-channel data simulation
4 Du_NERCSLIP_task3b_4 Du_NERCSLIP_task3_report Resnet-Conformer, ensemble 23302059 Ambisonic log mel spectra, intensity vector audio channel swapping, multi-channel data simulation, video pixel swapping
5 Kang_KT_task3b_1 Kang_KT_task3_report CNN, Conformer, GRU, ensemble 273900000 Ambisonic mel spectra, intensity vector audio channel swapping, SpecAugment, cutout, multichannel data generation, video mosaic augmentation
6 Kang_KT_task3b_2 Kang_KT_task3_report CNN, Conformer, GRU, ensemble 398900000 Ambisonic mel spectra, intensity vector audio channel swapping, SpecAugment, cutout, multichannel data generation, video mosaic augmentation
7 Kim_KU_task3b_1 Kim_KU_task3_report CNN, RNN, MHSA, transfer learning 7642858 Ambisonic log-mel spectra, intensity vector specmix, audio rotation
8 Liu_CQUPT_task3b_1 Liu_CQUPT_task3b_report CRNN 2044899 Ambisonic magnitude spectra, IPD
9 Liu_CQUPT_task3b_2 Liu_CQUPT_task3b_report CRNN 2044899 Ambisonic magnitude spectra, IPD
10 AV_Baseline_FOA Shimada_SONY_task3b_report CRNN 763701 Ambisonic magnitude spectra, IPD
11 Liu_CQUPT_task3b_3 Liu_CQUPT_task3b_report CRNN 2044899 Ambisonic magnitude spectra, IPD pitch shifting
12 Liu_CQUPT_task3b_4 Liu_CQUPT_task3b_report CRNN 2044899 Ambisonic magnitude spectra, IPD pitch shifting
13 AV_Baseline_MIC Shimada_SONY_task3b_report CRNN 763701 Microphone Array magnitude spectra, IPD



Technical reports

JLESS SUBMISSION TO DCASE2023 TASK3: CONFORMER WITH DATA AUGMENTATION FOR SOUND EVENT LOCALIZATION AND DETECTION IN REAL SPACE

Dongzhe Zhang1,2, Jisheng Bai1,2, Siwei Huang1, Mou Wang1, Jianfeng Chen1,2
1Key Joint Laboratory of Environmental Sound Sensing, School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China 2LianFeng Acoustic Technologies Co., Ltd. Xi’an, China

Abstract

In this technical report, we describe our proposed system for DCASE2023 task3: Sound Event Localization and Detection(SELD) Evaluated in Real Spatial Sound Scenes. At first, we review the famous deep learning methods in SELD. Then we apply various data augmentation methods to balance the sound event classes in the dataset, and generate more spatial audio files to augment the training data. Finally, we use different strategies in the training stage to improve the generalization of the system in realistic environment. The results show that the proposed systems outperform the baseline system on the dev-set-test of Sony-TAU Realistic Spatial Soundscapes 2023 (STARSS23).

PDF

THE NERC-SLIP SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION OF DCASE2023 CHALLENGE

Qing Wang1, Ya Jiang1, Shi Cheng1, Maocheng Hu2, Zhaoxu Nian1, Pengfei Hu1, Zeyan Liu1, Yuxuan Dong1, Mingqi Cai3, Jun Du1 Chin-Hui Lee4
1University of Science and Technology of China, Hefei, China, 2National Intelligent Voice Innovation Center, Hefei, China, 3iFLYTEK, Hefei, China, 4Georgia Institute of Technology, Atlanta, USA

Abstract

The technical report details our submission system for Task 3 of the DCASE2023 Challenge: Sound Event Localization and Detection (SELD) Evaluated in Real Spatial Sound Scenes. To address the audio-only SELD task, we apply the audio channel swapping (ACS) technique to generate augmented data, upon which a ResNet-Conformer architecture is employed as the acoustic model. Additionally, we introduce a class-dependent sound separation (SS) model to tackle overlapping mixtures and extract features from the SS model as prompts to perform SELD for a specific event class. In the case of audio-visual SELD task, we leverage object detection and human body key point detection algorithms to identify potential sound events and extract Gaussian-like vectors, which are subsequently concatenated with acoustic features as the input. Moreover, we propose a video data augmentation method based on the ACS method of audio data. Finally, we present a post-processing strategy to enhance the results of audio-only SELD models with the location information predicted by video data. We evaluate our approach on the dev-test set of the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset.

PDF

THE DISTILLATION SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION OF DCASE2023 CHALLENGE

Sang-Ick Kang, Kyongil Cho, Myungchul Keum, Yeonseok Park
KT Corporation, South Korea

Abstract

This report describes our systems submitted to the DCASE2023 challenge task 3: Sound Event Localization and Detection (SELD) with audio-only data and audio-visual data. Audio-visual data consists of multi-channel audio data for sound events and 360- degree video data. To solve the issue of sparsity in the training data, we conducted various augmentations on both audio and video. The proven ResNet-Conformer based architecture in the Sound Event Localization and Detection system is employed, including the augmented data. To effectively improve the performance of the audio network, we applied the Knowledge Distillation technique by training both a teacher model and a student model. In addition, we fused the SELD model and the object detection model YOLOv7 in the audio-visual network. Finally, post-processing strategies involve an ensemble method for both audio-only track and audiovisual track. The experimental results demonstrate that the deep learning-based models trained on the STARSS23 dataset significantly outperform the DCASE challenge baseline in the proposed system.

PDF

DATA AUGMENTATION, NEURAL NETWORKS, AND ENSEMBLE METHODS FOR SOUND EVENT LOCALIZATION AND DETECTION

Gwantae Kim, Hanseok Ko
Korea University Department of Electrical Engineering Seoul, South Korea

Abstract

This technical report describes the system participating in the DCASE 2023, Task3: Sound event localization and detection evaluated in real spatial sound scenes challenge. The system contains data augmentation strategies, neural network models, and ensemble methods. For track A, we adopt rotation and Specmix data augmentation strategies to increase the amount of data samples and improve robustness. The neural network model, which is based on baseline networks, consists of residual convolution neural networks with spatial attention, recurrent neural networks, and multi-head self- attention. Moreover, we propose several ensemble methods, such as windowing, weight averaging, and clustering-based output selection. For track B, we extend the audio-only baseline model to the audio-visual model with 3D convolution layers using raw video, optical flow, and object detection features. Through a series of relevant experiments, the proposed methods achieve competitive results compared to the baseline and state-of-the-art methods.

PDF

A FRAMEWORK FOR SELD USING CONFORMER AND MULTI-ACCDOA STRATEGIES

Priyanshu Kumar, Amit Kumar, Shwetank Choudhary, Jiban Prakash, Sumit Kumar
Samsung Research Institute Bangalore, India

Abstract

This technical report describes our submission system for the task 3A of the DCASE 2023 challenge: Sound Event Localization and Detection (SELD) Evaluated in Real Spatial Sound Scenes, which uses only audio data as compared to task 3B which leverages audio-visual input. We build our models based on the official baseline system and improve our models in terms of model architecture and data augmentation. Since recent works in Deep Learning, have experimented with replacing traditional Recurrent Neural Networks with Transformer based architectures, we replace the Gated Recurrent Units layers with Conformer blocks. In order to have more training data, we apply Audio Channel Swapping (ACS) augmentation on the DCASE 2023 official dataset. Thus, our experimentations lead to improved SELD score as compared to the official baseline. The proposed system is evaluated on the dev-test set of Sony-TAu Realistic Spatial Soundscapes 2023 (STARS2023) dataset and obtains an improvement of 14.5% in SELD score as compared to the baseline.

PDF

ATTENTION MECHANISM NETWORK AND DATA AUGMENTATION FOR SOUND EVENT LOCALIZATION AND DETECTION

Lihua Xue, Hongqing Liu, Yi Zhou
School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing, China

Abstract

This technical report describes our submission systems for the task 3 of the DCASE2023 challenge: Sound Event Localization and Detection (SELD) Evaluated in Real Spatial Sound Scenes. In our approach, we firstly generate more spatial audio files for training. To improve the generalization of the model, we employ random cutout, time-frequency masking, frequency shifting and augmix. Secondly, we utilize Resnet-Conformer network as the main body of our model, while we merge the Resnet-Conformer network and EINV2 framework with multi-ACCDOA output. To extract more effective features, we introduce a multi-scale channel attention mechanism and attentive statistics pooling. At last, we adopt model ensemble of different models with the same output format and post-processing strategies. The experimental results show that our proposed systems outperform the baseline system on the development dataset of DCASE2023 task3.

PDF

AUDIO-VISUAL SOUND EVENT LOCALIZATION AND DETECTION BASED ON CRNN USING DEPTH-WISE SEPARABLE CONVOLUTION

Yi Wang, Hongqing Liu, Yi Zhou
School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing, China

Abstract

This technical report describes the systems submitted to the DCASE2023 challenge task 3: sound event localization and detection (SELD) -- track B: audio-visual inference. The goal of the sound event localization and detection task is to detect occurrences of sound events belonging to specific target classes, track their temporal activity, and estimate their directions-of-arrival or positions during it. Compared with the official baseline system, the improvements of our submitted system based on CRNN [1] mainly contain two parts: more powerful audio feature processing network architecture, additional visual feature module. For audio network, we utilize depth-wise separable convolution with multi-scale kernel size to better learn the relevant information of different sound event categories in audio features. Then, we modify the pooling stage and some residual operation is added to prevent information loss. Besides, we use the corresponding image at the start frame of the audio feature sequence processed by a pretrained ResNet-18 model as additional visual feature. Experimental results show that our system outperforms the baseline method on the development dataset of Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23).

PDF

SOUND EVENT LOCALIZATION AND DETECTION BASED ON OMI-DIMENSIONAL DYNAMIC CONVOLUTION AND FEATURE PYRAMID ATTENTION MODULE

Mengzhen Ma1,2, Ying Hu1,2, Mingyu Wang1,2, Wenjie Fang1,2, Jie Liu1,2, Zunxue Niu1,2, Xin Fan1,2
1School of Information Science and Engineering, Xinjiang University, Urumqi, China, 2Key Laboratory of Signal Detection and Processing in Xinjiang, China

Abstract

In this report, we present our method for Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 challenge task3: Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes. We propose a method based on Omi-dimensional dynamic convolution (ODConv) and Feature Pyramid Attention Module (FPAM). In order to enhance the ability of extracting features for convolution kernel, we introduce an attention mechanism to it along four dimensions in ODConv. In addition, we explore FPAM to recalibrate high-level features from Residual Omi-dimensional Dynamic Convolution (Res ODConv) blocks, making the model pay more attention to significant positions and channels. We also design Bidirectional Conformer to realize modeling context information in time and frequency dimensions. On Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS2023) dataset, our system demonstrates a prominent improvement over the baseline system. Only the first-order ambisonics (FOA) dataset was considered in this experiment.

PDF

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

Archontis Politis1, Kazuki Shimada2, Parthasaarathy Sudarsanam1, Sharath Adavanne1, Daniel Krause1, Yuichiro Koyama2, Naoya Takahashi2, Shusuke Takahashi2, Yuki Mitsufuji2, Tuomas Virtanen1
1Tampere University, Tampere, Finland, 2SONY, Tokyo, Japan

Abstract

This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) dataset of spatial recordings of real sound scenes collected in various interiors at two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events belonging to 13 target classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. STARSS22 serves as the development and evaluation dataset for Task 3 (Sound Event Localization and Detection) of the DCASE2022 Challenge and it introduces significant new challenges with regard to the previous iterations, which were based on synthetic data. Additionally, the report introduces the baseline system that accompanies the dataset with emphasis on its differences to the baseline of the previous challenge. Baseline results indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6600531.

PDF

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Kazuki Shimada2, Archontis Politis1, Parthasaarathy Sudarsanam1, Daniel Krause1, Kengo Uchida2, Sharath Adavanne1, Aapo Hakala1, Yuichiro Koyama2, Naoya Takahashi2, Shusuke Takahashi2, Tuomas Virtanen1, Yuki Mitsufuji2
1Tampere University, Tampere, Finland, 2SONY, Tokyo, Japan

Abstract

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results show that the audio-visual SELD system achieves lower localization error than the audio-only system. The data is available at https://zenodo.org/record/7880637.

PDF

PLCST: PROBABILISTIC LOCALIZATION AND CLASSIFICATION OF SOUNDS WITH TRANSFORMERS FOR SOUND EVENT LOCALIZATION AND DETECTION

Peipei Wu1, Jinzheng Zhao1, Yaru Chen1, Berghi Davide1, Chenfei Zhu3, Yin Cao2, Yang Liu4, Philip Jackson1, Wenwu Wang1
1University of Surrey, Centre for Vision, Speech and Signal Processing (CVSSP), Surrey, UK, 2Xi’an Jiaotong-Liverpool University, Department of Intelligent Science, Suzhou, China, 3Daqian Information, Wuhan, China, 4Meta, Seattle, USA

Abstract

Sound Event Localization and Detection (SELD) is a task that involves detecting different types of sound events along with their temporal and spatial information, specifically, class-level events detection and their corresponding direction of arrivals at each frame. In DCASE 2023 Task 3, the recordings consist of real-world sound scenes with complex conditions, which contain simultaneous occurrences of up to 3 or even 5 events. Our submitted system for this task is based on the previously proposed method, PILOT (Probabilistic Localization of Sounds with Transformers). While PILOT combines transformers with CNN-based feature extraction modules and covers Sound Event Localization (SEL) tasks with sound activity detection, it requires modifications to address SELD tasks. In our architecture, we adapt PILOT’s input features and output branches to SELD tasks and revise the loss function accordingly. We name our model Probabilistic Localization and Class of Sounds with Transformers (PLCST). Unlike other approaches, we do not generate additional samples from the development dataset or use other datasets for training, aiming to mitigate discrepancies. In addition, another benefit of our model is that the number of parameters is relatively small. Our experimental results demonstrate improvements in our system over the baseline methods.

PDF

ONE AUDIO AUGMENTATION CHAIN PROPOSED FOR SOUND EVENT LOCALIZATION AND DETECTION IN DCASE 2023 TASK3

Shichao Wu
Nankai University, College of Artificial Intelligence, Tianjin, China

Abstract

In this technical report, we propose to give the system details about our submitted results to the sound event localization and detection challenge in DCASE 2023. We concentrate on the audio-only based SELD sub-track, where inference of the SELD labels is performed with multichannel audio input only, as the previous years had done. We only used the audio data for training without any video information, since we think it’s hard to fully explore the visual information collected with the 360 degree video setup. We present three improvements in this work concerning the neural network model, external data generation, and audio augmentation, compared to the baseline system. Specifically, we use one more deep and powerful neural network of the event-independent network (EINV2) in place of CRNN. Second, we propose to augment the audio data with one audio augmentation chain. Third, we synthesize more simulated audio samples for network training. Experiments on the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) benchmark dataset showed our system remarkably outperformed the DCASE baseline system.

PDF

A DATA GENERATION METHOD FOR SOUND EVENT LOCALIZATION AND DETECTION IN REAL SPATIAL SOUND SCENES

Jinbo Hu1,2, Yin Cao3, Ming Wu1, Feiran Yang1, Wenwu Wang4, Mark D. Plumbley4, Jun Yang1,2
1Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China, 2University of Chinese Academy of Sciences, Beijing, China, 3Department of Intelligent Science, Xi’an Jiaotong Liverpool University, China, 4Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK

Abstract

This technical report describes our submission systems for Task 3 of the DCASE 2023 Challenge: Sound Event Localization and Detection (SELD) Evaluated in Real Spatial Sound Scenes. Our proposed solution includes data synthesis, data augmentation, and track-wise model training. We focus on data generation and synthesize multi-channel spatial recordings by convolving monophonic sound event examples with multi-channel spatial room impulse responses (SRIRs) to overcome the problem of lacking real-scene recordings. The sound event samples are sourced from FSD50K and AudioSet. On the other hand, the SRIRs are extracted from the TAU Spatial Room Impulse Response Database (TAU-SRIR DB) dataset and computationally generated using the image source method (ISM). Furthermore, we utilize our previously proposed data augmentation chains, which randomly combine several data augmentation operations. Finally, based on the manually synthesized and augmented data, we employ the Event-Independent Network V2 (EINV2) with a track-wise output format to detect and localize up to three different sound events. These different sound events can be of the same type from different locations. Our proposed solution significantly outperforms the baseline method on the dev-test set of the Sony-TAU Realistic Spatial Soundscapes 2023 (STARSSS23) dataset.

PDF

DIVIDED SPECTRO-TEMPORAL ATTENTION FOR SOUND EVENT LOCALIZATION AND DETECTION IN REAL SCENES FOR DCASE2023 CHALLENGE

Yusun Shul1, Byeong-Yun Ko2, Jung-Woo Choi1
1School of Electrical Engineering,Korea Advanced Institute of Science and Technology,Daejeon, Republic of Korea, 2Dept. of Mechanical Engineering,Korea Advanced Institute of Science and Technology,Daejeon, Republic of Korea

Abstract

Localizing sounds and detecting events in different room environments is a difficult task, mainly due to the wide range of reflections and reverberations. When training neural network models with sounds recorded in only a few room environments, there is a tendency for the models to become overly specialized to those specific environments, resulting in overfitting. To address this overfitting issue, we propose divided spectro-temporal attention. In comparison to the baseline method, which utilizes a convolutional recurrent neural network (CRNN) followed by a temporal multi-head self-attention layer (MHSA), we introduce a separate spectral attention layer that aggregates spectral features prior to the temporal MHSA. To achieve efficient spectral attention, we reduce the frequency pooling size in the convolutional encoder of the baseline to obtain a 3D tensor that incorporates information about frequency, time, and channel. As a result, we can implement spectral attention with channel embeddings, which is not possible in the baseline method dealing with only temporal context in the RNN and MHSA layers. We demonstrate that the proposed divided spectro-temporal attention significantly improves the performance of sound event detection and localization scores for real test data from the STARSS23 development dataset. Additionally, we show that various data augmentations, such as frameshift, time masking, channel swapping, and moderate mix-up, along with the use of external data, contribute to the overall improvement in SELD performance.

PDF