Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes


Challenge results

Task description

The Sound Event Localization and Detection (SELD) task deals with methods that detect the temporal onset and offset of sound events when active, classify the type of the event from a known set of sound classes, and further localize the events in space when active.

The focus of the current SELD task is developing systems that can perform adequately on real sound scene recordings, with a small amount of training data. The task provides two datasets, development and evaluation, recorded in a multiple rooms over two different sites. Among the two datasets, only the development dataset provides the reference labels. The participants are expected to build and validate systems using the development dataset, report results on a predefined development set split, and finally test their system on the unseen evaluation dataset.

More details on the task setup and evaluation can be found in the task description page.

Teams ranking

The SELD task received 63 submissions in total from 19 teams across the world. The following table includes only the best performing system per submitting team. Confidence intervals are also reported for each metric on the evaluation set results.

Rank Submission Information Evaluation dataset Development dataset
Submission name Corresponding
author
Affiliation Technical
Report
Best official
system rank
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Loalization
recall
Du_NERCSLIP_task3_2 Jun Du University of Science and Technology of China Du_NERCSLIP_task3_report 1 0.35 (0.30 - 0.41) 58.3 (53.8 - 64.7) 14.6 (12.8 - 16.5) 73.7 (68.7 - 78.2)
Hu_IACAS_task3_3 Jinbo Hu Institute of Acoustics, Chinese Academy of Sciences Hu_IACAS_task3_report 5 0.39 (0.34 - 0.44) 55.8 (51.2 - 61.1) 16.2 (14.6 - 17.8) 72.4 (67.3 - 77.2) 0.53 48.1 17.8 62.6
Han_KU_task3_4 Sung Won Han Korea University Han_KU_task3_report 7 0.37 (0.31 - 0.42) 49.7 (44.4 - 56.6) 16.5 (14.8 - 18.0) 70.7 (65.8 - 76.1) 0.39 59.5 13.0 73.7
Xie_UESTC_task3_1 Rong Xie University of Electronic Science and Technology of China Xie_UESTC_task3_report 11 0.48 (0.41 - 0.55) 48.6 (42.5 - 55.4) 17.6 (16.0 - 19.2) 73.5 (68.0 - 77.6) 0.44 58.0 12.9 68.0
Bai_JLESS_task3_4 Jisheng Bai Northwestern Polytechnical University Bai_JLESS_task3_report 14 0.47 (0.40 - 0.54) 49.3 (41.8 - 57.1) 16.9 (15.0 - 18.9) 67.9 (59.3 - 73.3) 0.48 52.2 16.9 70.7
Kang_KT_task3_2 Sang-Ick Kang KT Corporation Kang_KT_task3_report 17 0.47 (0.40 - 0.53) 45.9 (40.1 - 52.6) 15.8 (13.6 - 18.0) 59.3 (50.3 - 65.1) 0.48 51.3 16.4 67.7
FOA_Baseline_task3_1 Archontis Politis Tampere University Politis_TAU_task3_report 42 0.61 (0.57 - 0.65) 23.7 (18.7 - 29.4) 22.9 (21.0 - 26.0) 51.4 (46.2 - 55.2) 0.71 21.0 29.3 46.0
Chun_Chosun_task3_3 Chanjun Chun Chosun University Chun_Chosun_task3_report 27 0.59 (0.52 - 0.66) 31.0 (25.9 - 36.3) 19.8 (17.3 - 22.6) 50.7 (42.2 - 56.3) 0.59 35.0 33.8 57.0
Guo_XIAOMI_task3_2 Kaibin Guo Xiaomi Guo_XIAOMI_task3_report 33 0.60 (0.53 - 0.67) 28.2 (22.8 - 34.1) 23.8 (21.3 - 26.2) 52.1 (43.4 - 58.1) 0.61 29.0 23.5 49.0
Scheibler_LINE_task3_1 Robin Scheibler LINE Corporation Scheibler_LINE_task3_report 30 0.62 (0.55 - 0.69) 30.4 (25.2 - 36.3) 16.7 (14.0 - 19.5) 49.2 (42.1 - 54.5) 0.50 51.1 16.7 63.4
Park_SGU_task3_4 Hyung-Min Park Sogang University Park_SGU_task3_report 38 0.60 (0.53 - 0.67) 30.6 (25.2 - 36.4) 21.6 (17.8 - 25.1) 45.9 (40.3 - 51.0) 0.62 46.8 25.1 78.2
Wang_SJTU_task3_2 Yu Wang Shanghai Jiao Tong University Wang_SJTU_task3_report 33 0.67 (0.60 - 0.74) 27.0 (19.3 - 33.6) 24.4 (22.0 - 27.1) 60.3 (53.8 - 65.3) 0.46 61.8 11.4 68.4
FalconPerez_Aalto_task3_2 Ricardo Falcon-Perez Aalto University FalconPerez_Aalto_task3_report 52 0.73 (0.67 - 0.79) 21.8 (15.5 - 27.6) 24.4 (21.7 - 27.1) 43.1 (35.7 - 48.7) 0.74 23.0 27.4 45.0
Kim_KU_task3_2 Gwantae Kim Korea University Kim_KU_task3_report 46 0.74 (0.66 - 0.81) 24.1 (19.8 - 28.9) 26.6 (23.4 - 29.8) 55.1 (48.6 - 59.5) 0.66 30.0 22.5 49.0
Chen_SHU_task3_1 Zhengyu Chen Shanghai University Chen_SHU_task3_report 65 1.00 (1.00 - 1.00) 0.3 (0.1 - 0.6) 60.3 (45.4 - 94.0) 4.5 (2.9 - 6.3) 0.71 27.0 26.7 48.0
Wu_NKU_task3_2 Shichao Wu Nankai University Wu_NKU_task3_report 53 0.69 (0.64 - 0.74) 17.9 (14.4 - 21.5) 28.5 (24.5 - 39.7) 44.5 (38.2 - 48.4) 0.63 33.0 22.7 49.0
Ko_KAIST_task3_2 Byeong-Yun Ko Korea Advanced Institute of Science and Technology Ko_KAIST_task3_report 23 0.49 (0.42 - 0.55) 39.9 (33.8 - 46.0) 17.3 (15.3 - 19.3) 54.6 (46.5 - 60.5) 0.55 46.2 16.4 54.6
Kapka_SRPOL_task3_4 Slawomir Kapka Samsung Research Poland Kapka_SRPOL_task3_report 48 0.72 (0.65 - 0.79) 25.5 (21.3 - 30.4) 25.4 (21.7 - 29.3) 49.8 (42.8 - 55.3)
Zhaoyu_LRVT_task3_1 Zhaoyu Yan Lenovo Research Zhaoyu_LRVT_task3_report 60 0.96 (0.88 - 1.00) 11.2 (8.8 - 13.9) 31.0 (28.5 - 33.4) 53.4 (44.4 - 58.9) 0.58 35.0 22.5 42.0
Xie_XJU_task3_1 Yin Xie Xinjiang university Xie_XJU_task3_report 44 0.66 (0.59 - 0.74) 25.5 (19.3 - 32.2) 23.1 (19.9 - 26.4) 53.1 (42.7 - 59.4) 0.66 34.2 22.9 57.7

Systems ranking

Performance of all the submitted systems on the evaluation and the development datasets. Confidence intervals are also reported for each metric on the evaluation set results.

Rank Submission Information Evaluation dataset Development dataset
Submission name Technical
Report
Official
rank
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
FOA_Baseline_task3_1 Politis_TAU_task3_report 42 0.61 (0.57 - 0.65) 23.7 (18.7 - 29.4) 22.9 (21.0 - 26.0) 51.4 (46.2 - 55.2) 0.71 21.0 29.3 46.0
MIC_Baseline_task3_1 Politis_TAU_task3_report 45 0.61 (0.56 - 0.66) 21.6 (17.6 - 25.8) 25.9 (22.6 - 28.5) 48.1 (36.8 - 54.9) 0.71 21.0 32.2 47.0
Bai_JLESS_task3_1 Bai_JLESS_task3_report 20 0.48 (0.41 - 0.54) 46.0 (38.0 - 54.0) 16.3 (14.4 - 18.1) 58.8 (48.3 - 65.2) 0.48 52.4 16.1 62.1
Bai_JLESS_task3_2 Bai_JLESS_task3_report 16 0.49 (0.42 - 0.56) 47.8 (40.2 - 55.3) 16.9 (14.9 - 18.8) 66.6 (56.0 - 72.8) 0.52 50.0 17.1 68.1
Bai_JLESS_task3_3 Bai_JLESS_task3_report 19 0.46 (0.39 - 0.53) 46.1 (38.3 - 53.8) 16.3 (14.6 - 17.9) 57.8 (46.4 - 64.6) 0.44 54.2 16.0 65.4
Bai_JLESS_task3_4 Bai_JLESS_task3_report 14 0.47 (0.40 - 0.54) 49.3 (41.8 - 57.1) 16.9 (15.0 - 18.9) 67.9 (59.3 - 73.3) 0.48 52.2 16.9 70.7
Chun_Chosun_task3_1 Chun_Chosun_task3_report 28 0.59 (0.52 - 0.66) 30.9 (25.9 - 36.2) 19.7 (17.5 - 21.9) 50.2 (42.0 - 55.7) 0.59 35.0 20.7 57.0
Chun_Chosun_task3_2 Chun_Chosun_task3_report 31 0.60 (0.53 - 0.66) 30.1 (25.7 - 34.8) 20.0 (17.8 - 22.3) 50.2 (41.8 - 55.8) 0.59 34.0 24.8 58.0
Chun_Chosun_task3_3 Chun_Chosun_task3_report 27 0.59 (0.52 - 0.66) 31.0 (25.9 - 36.3) 19.8 (17.3 - 22.6) 50.7 (42.2 - 56.3) 0.59 35.0 33.8 57.0
Chun_Chosun_task3_4 Chun_Chosun_task3_report 29 0.60 (0.53 - 0.67) 30.4 (25.2 - 36.0) 20.2 (17.0 - 22.6) 50.5 (42.4 - 56.0) 0.59 34.0 23.0 59.0
Guo_XIAOMI_task3_1 Guo_XIAOMI_task3_report 47 0.63 (0.57 - 0.69) 20.2 (16.9 - 24.1) 22.9 (20.7 - 25.2) 45.8 (40.4 - 49.7) 0.63 25.0 23.9 48.0
Guo_XIAOMI_task3_2 Guo_XIAOMI_task3_report 33 0.60 (0.53 - 0.67) 28.2 (22.8 - 34.1) 23.8 (21.3 - 26.2) 52.1 (43.4 - 58.1) 0.61 29.0 23.5 49.0
Kang_KT_task3_1 Kang_KT_task3_report 21 0.47 (0.41 - 0.53) 44.3 (38.4 - 50.6) 16.0 (13.8 - 18.2) 57.7 (49.0 - 63.5) 0.49 53.0 15.8 68.0
Kang_KT_task3_2 Kang_KT_task3_report 17 0.47 (0.40 - 0.53) 45.9 (40.1 - 52.6) 15.8 (13.6 - 18.0) 59.3 (50.3 - 65.1) 0.48 51.3 16.4 67.7
Kang_KT_task3_3 Kang_KT_task3_report 18 0.46 (0.40 - 0.52) 45.4 (39.4 - 51.5) 15.8 (13.5 - 18.2) 58.4 (50.8 - 63.7) 0.49 52.6 15.8 66.4
Kang_KT_task3_4 Kang_KT_task3_report 22 0.46 (0.40 - 0.52) 43.7 (38.2 - 49.9) 16.2 (14.0 - 18.5) 56.4 (49.2 - 61.5) 0.48 52.0 16.3 65.3
Du_NERCSLIP_task3_1 Du_NERCSLIP_task3_report 4 0.37 (0.31 - 0.44) 56.9 (50.9 - 64.5) 15.0 (13.2 - 16.9) 73.6 (68.1 - 78.7) 0.38 67.0 14.8 78.0
Du_NERCSLIP_task3_2 Du_NERCSLIP_task3_report 1 0.35 (0.30 - 0.41) 58.3 (53.8 - 64.7) 14.6 (12.8 - 16.5) 73.7 (68.7 - 78.2)
Du_NERCSLIP_task3_3 Du_NERCSLIP_task3_report 2 0.36 (0.29 - 0.43) 56.8 (50.6 - 63.9) 15.5 (13.8 - 17.4) 75.5 (70.1 - 80.4)
Du_NERCSLIP_task3_4 Du_NERCSLIP_task3_report 3 0.37 (0.31 - 0.44) 57.8 (51.7 - 65.3) 14.9 (13.2 - 16.7) 73.4 (67.7 - 78.5) 0.41 64.0 14.9 73.0
Scheibler_LINE_task3_1 Scheibler_LINE_task3_report 30 0.62 (0.55 - 0.69) 30.4 (25.2 - 36.3) 16.7 (14.0 - 19.5) 49.2 (42.1 - 54.5) 0.50 51.1 16.7 63.4
Park_SGU_task3_1 Park_SGU_task3_report 41 0.60 (0.53 - 0.67) 28.4 (23.9 - 33.6) 22.6 (19.8 - 25.3) 46.9 (41.5 - 52.1) 0.61 46.2 24.0 78.2
Park_SGU_task3_2 Park_SGU_task3_report 40 0.63 (0.55 - 0.70) 31.2 (25.3 - 37.5) 21.6 (18.3 - 25.0) 46.5 (40.7 - 51.7) 0.62 46.8 25.1 78.2
Park_SGU_task3_3 Park_SGU_task3_report 38 0.63 (0.56 - 0.70) 31.4 (25.8 - 37.4) 22.7 (18.6 - 26.5) 47.4 (41.7 - 52.5) 0.62 46.8 25.1 78.2
Park_SGU_task3_4 Park_SGU_task3_report 38 0.60 (0.53 - 0.67) 30.6 (25.2 - 36.4) 21.6 (17.8 - 25.1) 45.9 (40.3 - 51.0) 0.62 46.8 25.1 78.2
Wang_SJTU_task3_1 Wang_SJTU_task3_report 35 0.67 (0.60 - 0.74) 26.3 (18.3 - 33.1) 23.9 (21.8 - 26.3) 59.2 (52.6 - 64.4) 0.47 62.2 11.3 69.0
Wang_SJTU_task3_2 Wang_SJTU_task3_report 33 0.67 (0.60 - 0.74) 27.0 (19.3 - 33.6) 24.4 (22.0 - 27.1) 60.3 (53.8 - 65.3) 0.46 61.8 11.4 68.4
Wang_SJTU_task3_3 Wang_SJTU_task3_report 34 0.68 (0.60 - 0.75) 26.3 (18.0 - 33.3) 23.7 (21.7 - 25.9) 59.8 (52.4 - 65.1) 0.48 61.4 11.5 69.0
Wang_SJTU_task3_4 Wang_SJTU_task3_report 36 0.67 (0.60 - 0.74) 26.2 (18.0 - 33.2) 23.8 (21.5 - 26.4) 58.8 (51.2 - 64.2) 0.47 61.6 11.4 68.7
FalconPerez_Aalto_task3_1 FalconPerez_Aalto_task3_report 58 0.70 (0.64 - 0.75) 16.2 (10.1 - 21.1) 28.7 (24.0 - 32.6) 33.9 (26.5 - 39.0) 0.75 19.0 49.3 38.0
FalconPerez_Aalto_task3_2 FalconPerez_Aalto_task3_report 52 0.73 (0.67 - 0.79) 21.8 (15.5 - 27.6) 24.4 (21.7 - 27.1) 43.1 (35.7 - 48.7) 0.74 23.0 27.4 45.0
FalconPerez_Aalto_task3_3 FalconPerez_Aalto_task3_report 59 0.70 (0.64 - 0.77) 17.2 (10.2 - 22.5) 25.5 (22.6 - 28.6) 31.2 (23.4 - 36.2) 0.75 15.0 51.8 3.0
Xie_UESTC_task3_1 Xie_UESTC_task3_report 11 0.48 (0.41 - 0.55) 48.6 (42.5 - 55.4) 17.6 (16.0 - 19.2) 73.5 (68.0 - 77.6) 0.44 58.0 12.9 68.0
Xie_UESTC_task3_2 Xie_UESTC_task3_report 15 0.50 (0.43 - 0.57) 47.8 (41.5 - 54.4) 17.5 (15.9 - 19.2) 72.3 (65.1 - 77.1) 0.47 52.0 14.4 64.0
Xie_UESTC_task3_3 Xie_UESTC_task3_report 13 0.52 (0.44 - 0.60) 48.4 (42.4 - 55.1) 17.9 (16.2 - 19.8) 74.6 (69.3 - 78.8) 0.46 55.0 14.0 66.0
Xie_UESTC_task3_4 Xie_UESTC_task3_report 12 0.50 (0.42 - 0.57) 49.5 (43.8 - 56.0) 17.4 (15.9 - 19.1) 74.0 (69.2 - 77.8) 0.46 56.0 13.7 67.0
Kim_KU_task3_1 Kim_KU_task3_report 54 0.80 (0.74 - 0.86) 20.3 (16.3 - 24.9) 26.1 (23.9 - 28.6) 50.6 (43.8 - 55.5) 0.66 31.0 21.7 51.0
Kim_KU_task3_2 Kim_KU_task3_report 46 0.74 (0.66 - 0.81) 24.1 (19.8 - 28.9) 26.6 (23.4 - 29.8) 55.1 (48.6 - 59.5) 0.66 30.0 22.5 49.0
Kim_KU_task3_3 Kim_KU_task3_report 49 0.75 (0.69 - 0.82) 20.5 (12.6 - 25.9) 26.1 (22.7 - 29.5) 53.3 (47.0 - 57.6) 0.65 33.0 20.4 51.0
Hu_IACAS_task3_1 Hu_IACAS_task3_report 10 0.44 (0.38 - 0.49) 49.2 (43.8 - 55.8) 16.6 (14.4 - 19.0) 70.4 (64.0 - 75.2) 0.50 48.4 19.5 65.7
Hu_IACAS_task3_2 Hu_IACAS_task3_report 6 0.40 (0.34 - 0.46) 57.4 (53.4 - 62.8) 15.1 (13.4 - 16.8) 70.6 (65.4 - 75.4) 0.50 51.0 16.4 65.9
Hu_IACAS_task3_3 Hu_IACAS_task3_report 5 0.39 (0.34 - 0.44) 55.8 (51.2 - 61.1) 16.2 (14.6 - 17.8) 72.4 (67.3 - 77.2) 0.53 48.1 17.8 62.6
Hu_IACAS_task3_4 Hu_IACAS_task3_report 9 0.40 (0.34 - 0.46) 50.9 (44.4 - 59.4) 15.9 (13.8 - 18.1) 69.4 (63.7 - 75.7) 0.53 45.4 17.4 62.5
Chen_SHU_task3_1 Chen_SHU_task3_report 65 1.00 (1.00 - 1.00) 0.3 (0.1 - 0.6) 60.3 (45.4 - 94.0) 4.5 (2.9 - 6.3) 0.71 27.0 26.7 48.0
Wu_NKU_task3_1 Wu_NKU_task3_report 55 0.72 (0.67 - 0.77) 18.5 (13.3 - 23.6) 25.1 (22.0 - 29.4) 42.1 (33.3 - 47.6) 0.66 32.0 23.2 48.0
Wu_NKU_task3_2 Wu_NKU_task3_report 53 0.69 (0.64 - 0.74) 17.9 (14.4 - 21.5) 28.5 (24.5 - 39.7) 44.5 (38.2 - 48.4) 0.63 33.0 22.7 49.0
Wu_NKU_task3_3 Wu_NKU_task3_report 57 0.72 (0.67 - 0.77) 18.8 (14.2 - 24.6) 30.2 (23.4 - 35.2) 39.7 (29.9 - 45.5) 0.65 31.0 26.0 43.0
Wu_NKU_task3_4 Wu_NKU_task3_report 56 0.71 (0.65 - 0.76) 18.7 (14.7 - 23.0) 28.3 (22.8 - 40.2) 38.6 (31.9 - 43.2) 0.65 30.0 18.0 44.0
Han_KU_task3_1 Han_KU_task3_report 39 0.73 (0.66 - 0.80) 27.8 (22.6 - 35.2) 25.6 (23.8 - 27.2) 63.5 (57.7 - 68.7) 0.45 63.6 14.4 71.1
Han_KU_task3_2 Han_KU_task3_report 43 0.72 (0.64 - 0.79) 23.0 (15.6 - 31.1) 25.5 (23.9 - 27.0) 64.0 (58.9 - 70.2) 0.43 58.8 15.1 73.2
Han_KU_task3_3 Han_KU_task3_report 8 0.38 (0.33 - 0.44) 53.6 (47.8 - 60.7) 15.6 (13.9 - 17.1) 67.3 (61.7 - 73.1) 0.28 67.2 11.8 76.7
Han_KU_task3_4 Han_KU_task3_report 7 0.37 (0.31 - 0.42) 49.7 (44.4 - 56.6) 16.5 (14.8 - 18.0) 70.7 (65.8 - 76.1) 0.39 59.5 13.0 73.7
Ko_KAIST_task3_1 Ko_KAIST_task3_report 24 0.47 (0.40 - 0.53) 39.6 (32.9 - 45.9) 18.9 (16.2 - 26.5) 52.7 (42.7 - 59.8) 0.53 49.8 16.0 55.9
Ko_KAIST_task3_2 Ko_KAIST_task3_report 23 0.49 (0.42 - 0.55) 39.9 (33.8 - 46.0) 17.3 (15.3 - 19.3) 54.6 (46.5 - 60.5) 0.55 46.2 16.4 54.6
Ko_KAIST_task3_3 Ko_KAIST_task3_report 25 0.48 (0.42 - 0.53) 39.8 (33.3 - 46.2) 19.6 (17.2 - 26.6) 52.0 (42.4 - 58.7) 0.57 46.4 17.2 54.4
Ko_KAIST_task3_4 Ko_KAIST_task3_report 26 0.50 (0.44 - 0.56) 35.7 (28.6 - 42.1) 20.4 (18.3 - 22.6) 52.8 (42.4 - 59.5) 0.55 46.4 17.0 56.2
Kapka_SRPOL_task3_1 Kapka_SRPOL_task3_report 58 0.92 (0.84 - 0.99) 25.2 (21.6 - 29.2) 24.1 (21.2 - 27.3) 49.5 (43.4 - 54.3) 0.85 32.1 24.7 51.4
Kapka_SRPOL_task3_2 Kapka_SRPOL_task3_report 50 0.81 (0.73 - 0.88) 26.0 (22.1 - 30.2) 22.3 (19.2 - 25.9) 48.1 (41.9 - 53.0) 0.76 32.9 24.6 49.9
Kapka_SRPOL_task3_3 Kapka_SRPOL_task3_report 51 0.81 (0.74 - 0.88) 24.7 (20.5 - 29.5) 26.2 (23.0 - 29.9) 52.1 (45.3 - 57.2)
Kapka_SRPOL_task3_4 Kapka_SRPOL_task3_report 48 0.72 (0.65 - 0.79) 25.5 (21.3 - 30.4) 25.4 (21.7 - 29.3) 49.8 (42.8 - 55.3)
Zhaoyu_LRVT_task3_1 Zhaoyu_LRVT_task3_report 60 0.96 (0.88 - 1.00) 11.2 (8.8 - 13.9) 31.0 (28.5 - 33.4) 53.4 (44.4 - 58.9) 0.58 35.0 22.5 42.0
Zhaoyu_LRVT_task3_2 Zhaoyu_LRVT_task3_report 64 0.88 (0.84 - 0.92) 3.5 (2.3 - 4.8) 39.3 (28.9 - 59.3) 7.5 (5.6 - 9.5) 0.68 25.0 35.4 42.0
Zhaoyu_LRVT_task3_3 Zhaoyu_LRVT_task3_report 62 0.83 (0.78 - 0.87) 7.4 (5.5 - 9.5) 24.5 (20.1 - 34.5) 12.5 (10.0 - 15.1) 0.70 25.4 45.2 42.0
Zhaoyu_LRVT_task3_4 Zhaoyu_LRVT_task3_report 61 0.83 (0.80 - 0.87) 12.1 (7.4 - 16.8) 26.2 (23.0 - 29.0) 36.0 (23.1 - 43.6) 0.72 33.3 43.5 35.0
Xie_XJU_task3_1 Xie_XJU_task3_report 44 0.66 (0.59 - 0.74) 25.5 (19.3 - 32.2) 23.1 (19.9 - 26.4) 53.1 (42.7 - 59.4) 0.66 34.2 22.9 57.7

System characteristics

Rank Submission
name
Technical
Report
Model Model
params
Audio
format
Acoustic
features
Data
augmentation
42 FOA_Baseline_task3_1 Politis_TAU_task3_report CRNN 604920 FOA log-mel spectra, intensity vector
45 MIC_Baseline_task3_1 Politis_TAU_task3_report CRNN 606648 MIC log-mel spectra, GCC
20 Bai_JLESS_task3_1 Bai_JLESS_task3_report CNN, Conformer, ensemble 194560 MIC log-mel spectra, SALSA-Lite FMix, mixup, random cutout, channel rotation, data generation
16 Bai_JLESS_task3_2 Bai_JLESS_task3_report CNN, Conformer, ensemble 194560 MIC log-mel spectra, SALSA-Lite FMix, mixup, random cutout, channel rotation, data generation
19 Bai_JLESS_task3_3 Bai_JLESS_task3_report CNN, Conformer, ensemble 235212 MIC log-mel spectra, SALSA-Lite FMix, mixup, random cutout, channel rotation, data generation
14 Bai_JLESS_task3_4 Bai_JLESS_task3_report CNN, Conformer, ensemble 235212 MIC log-mel spectra, SALSA-Lite FMix, mixup, random cutout, channel rotation, data generation
28 Chun_Chosun_task3_1 Chun_Chosun_task3_report CRNN, Transformer, ensemble 5650035 FOA log-mel spectra, intensity vector SpecAugment, impulse response simulation
31 Chun_Chosun_task3_2 Chun_Chosun_task3_report CRNN, Transformer, ensemble 4194366 FOA log-mel spectra, intensity vector SpecAugment, impulse response simulation
27 Chun_Chosun_task3_3 Chun_Chosun_task3_report CRNN, Transformer, ensemble 4983870 FOA log-mel spectra, intensity vector SpecAugment, impulse response simulation
29 Chun_Chosun_task3_4 Chun_Chosun_task3_report CRNN, Transformer, ensemble 4654910 FOA log-mel spectra, intensity vector SpecAugment, impulse response simulation
47 Guo_XIAOMI_task3_1 Guo_XIAOMI_task3_report ComplexNew 3DCNN 807257 FOA log-mel spectra, intensity vector Channel swapping, Labels first, Channels first
33 Guo_XIAOMI_task3_2 Guo_XIAOMI_task3_report 3DCNN 902953 FOA log-mel spectra, intensity vector Channel swapping, Labels first, Channels first
21 Kang_KT_task3_1 Kang_KT_task3_report CRNN, ensemble 97778356 FOA+MIC log-mel spectra, intensity vector, log-linear magnitude spectra, SALSA-Lite SpecAugment, random cutout, frequency shifting, rotation, channel swapping
17 Kang_KT_task3_2 Kang_KT_task3_report CRNN, ensemble 67818904 FOA+MIC log-mel spectra, intensity vector, log-linear magnitude spectra, SALSA-Lite SpecAugment, random cutout, frequency shifting, rotation, channel swapping
18 Kang_KT_task3_3 Kang_KT_task3_report CRNN, ensemble 126997260 FOA+MIC log-mel spectra, intensity vector, log-linear magnitude spectra, SALSA-Lite SpecAugment, random cutout, frequency shifting, rotation, channel swapping
22 Kang_KT_task3_4 Kang_KT_task3_report CRNN, ensemble 97137808 FOA+MIC log-mel spectra, intensity vector, log-linear magnitude spectra, SALSA-Lite SpecAugment, random cutout, frequency shifting, rotation, channel swapping
4 Du_NERCSLIP_task3_1 Du_NERCSLIP_task3_report CNN, Conformer 58100201 FOA log-mel spectra, intensity vector audio channel swapping, multichannel data simulation
1 Du_NERCSLIP_task3_2 Du_NERCSLIP_task3_report CNN, Conformer 58100201 FOA log-mel spectra, intensity vector audio channel swapping, multichannel data simulation
2 Du_NERCSLIP_task3_3 Du_NERCSLIP_task3_report CNN, Conformer 58100201 FOA log-mel spectra, intensity vector audio channel swapping, multichannel data simulation
3 Du_NERCSLIP_task3_4 Du_NERCSLIP_task3_report CNN, Conformer 58100201 FOA log-mel spectra, intensity vector audio channel swapping, multichannel data simulation
30 Scheibler_LINE_task3_1 Scheibler_LINE_task3_report CNN, Conformer, SSAST, IVA 4000000 FOA log-mel spectra, intensity vector SpecAug, FOA Rotation, Simulation, FSD50K
41 Park_SGU_task3_1 Park_SGU_task3_report CRNN 26242768 FOA log-mel spectra, intensity vector rotate, rotate + mixup
40 Park_SGU_task3_2 Park_SGU_task3_report CRNN 26242768 FOA log-mel spectra, intensity vector rotate, rotate + mixup
38 Park_SGU_task3_3 Park_SGU_task3_report CRNN 26242768 FOA log-mel spectra, intensity vector rotate, rotate + mixup
38 Park_SGU_task3_4 Park_SGU_task3_report CRNN 26242768 FOA log-mel spectra, intensity vector rotate, rotate + mixup
35 Wang_SJTU_task3_1 Wang_SJTU_task3_report CRNN, MHSA, ensemble 538261542 FOA+MIC log-mel spectra, intensity vector, GCC
33 Wang_SJTU_task3_2 Wang_SJTU_task3_report CRNN, Transformer, ensemble 672127703 FOA+MIC log-mel spectra, intensity vector, GCC
34 Wang_SJTU_task3_3 Wang_SJTU_task3_report CRNN, MHSA, ensemble 672127703 FOA+MIC log-mel spectra, intensity vector, GCC
36 Wang_SJTU_task3_4 Wang_SJTU_task3_report CRNN, Transformer, ensemble 805993864 FOA+MIC log-mel spectra, intensity vector, GCC
58 FalconPerez_Aalto_task3_1 FalconPerez_Aalto_task3_report SampleCNN 713511 FOA raw waveform
52 FalconPerez_Aalto_task3_2 FalconPerez_Aalto_task3_report CRNN 4709607 FOA log-linear magnitude spectra, intensity vector
59 FalconPerez_Aalto_task3_3 FalconPerez_Aalto_task3_report CRNN 4709607 FOA log-linear magnitude spectra, intensity vector
11 Xie_UESTC_task3_1 Xie_UESTC_task3_report CRNN 482551524 FOA log-mel spectra, intensity vector Mini-batch mixup, angle noise, mini-batch time-frequency noise, FOA rotation, random cutout and SpecAugment
15 Xie_UESTC_task3_2 Xie_UESTC_task3_report CRNN 273011952 FOA log-mel spectra, intensity vector Mini-batch mixup, angle noise, mini-batch time-frequency noise, FOA rotation, random cutout and SpecAugment
13 Xie_UESTC_task3_3 Xie_UESTC_task3_report CRNN 295482564 FOA log-mel spectra, intensity vector Mini-batch mixup, angle noise, mini-batch time-frequency noise, FOA rotation, random cutout and SpecAugment
12 Xie_UESTC_task3_4 Xie_UESTC_task3_report CRNN 660798176 FOA log-mel spectra, intensity vector Mini-batch mixup, angle noise, mini-batch time-frequency noise, FOA rotation, random cutout and SpecAugment
54 Kim_KU_task3_1 Kim_KU_task3_report CNN, Conformer 122211189 FOA log-mel spectra, inter-phase difference intensity vector Specmix
46 Kim_KU_task3_2 Kim_KU_task3_report CNN, Conformer 122211189 FOA log-mel spectra, inter-phase difference intensity vector Specmix
49 Kim_KU_task3_3 Kim_KU_task3_report CNN, Conformer 122211189 FOA log-mel spectra, inter-phase difference intensity vector Specmix
10 Hu_IACAS_task3_1 Hu_IACAS_task3_report EINV2, Conformer, CNN 85288432 FOA log-mel spectra, intensity vector mixup, specAugment, rotation, random crop, frequency shifting
6 Hu_IACAS_task3_2 Hu_IACAS_task3_report EINV2, Conformer, CNN 85288432 FOA log-mel spectra, intensity vector mixup, specAugment, rotation, random crop, frequency shifting
5 Hu_IACAS_task3_3 Hu_IACAS_task3_report EINV2, Conformer, CNN 85288432 FOA log-mel spectra, intensity vector mixup, specAugment, rotation, random crop, frequency shifting
9 Hu_IACAS_task3_4 Hu_IACAS_task3_report EINV2, Conformer, CNN 85288432 FOA log-mel spectra, intensity vector mixup, specAugment, rotation, random crop, frequency shifting
65 Chen_SHU_task3_1 Chen_SHU_task3_report CRNN, Self-Attention 2918925 FOA log-mel spectra, intensity vector
55 Wu_NKU_task3_1 Wu_NKU_task3_report CRNN 1920757 FOA log-mel spectra, intensity vector, variable-Q transform (VQT)
53 Wu_NKU_task3_2 Wu_NKU_task3_report CRNN 10364997 FOA log-mel spectra, intensity vector, variable-Q transform (VQT) block mixing
57 Wu_NKU_task3_3 Wu_NKU_task3_report CRNN 1922485 MIC log-mel spectra, GCC, variable-Q transform (VQT)
56 Wu_NKU_task3_4 Wu_NKU_task3_report CRNN 10366725 MIC log-mel spectra, GCC, variable-Q transform (VQT) block mixing
39 Han_KU_task3_1 Han_KU_task3_report SE-ResNet34, GRU 6047746 FOA log-mel spectra, intensity vector pitch shifting, gain adjusting, band-pass filter, noise, rotation, Spec-augmentation
43 Han_KU_task3_2 Han_KU_task3_report SE-ResNet34, GRU 6047746 FOA log-mel spectra, intensity vector pitch shifting, gain adjusting, band-pass filter, noise, rotation, Spec-augmentation
8 Han_KU_task3_3 Han_KU_task3_report SE-ResNet34, GRU 24190984 FOA log-mel spectra, intensity vector pitch shifting, gain adjusting, band-pass filter, noise, rotation, Spec-augmentation
7 Han_KU_task3_4 Han_KU_task3_report SE-ResNet34, GRU 24190984 FOA log-mel spectra, intensity vector pitch shifting, gain adjusting, band-pass filter, noise, rotation, Spec-augmentation
24 Ko_KAIST_task3_1 Ko_KAIST_task3_report CRNN 160775516 FOA log-linear magnitude spectra, eigenvector-based intensity vector channel swapping, pitch shifting, mix-up, frame shift
23 Ko_KAIST_task3_2 Ko_KAIST_task3_report CRNN 44050908 FOA log-linear magnitude spectra, eigenvector-based intensity vector channel swapping, pitch shifting, mix-up, frame shift
25 Ko_KAIST_task3_3 Ko_KAIST_task3_report CRNN 44250060 FOA log-linear magnitude spectra, eigenvector-based intensity vector channel swapping, pitch shifting, mix-up, frame shift
26 Ko_KAIST_task3_4 Ko_KAIST_task3_report CRNN 44250060 FOA log-linear magnitude spectra, eigenvector-based intensity vector channel swapping, pitch shifting, mix-up, frame shift
58 Kapka_SRPOL_task3_1 Kapka_SRPOL_task3_report CRNN 4604286 FOA log-linear magnitude spectra, phase spectra, intensity vector volume perturbation, FOA spatial augment
50 Kapka_SRPOL_task3_2 Kapka_SRPOL_task3_report CRNN 4604286 FOA log-linear magnitude spectra, phase spectra, intensity vector volume perturbation, FOA spatial augment
51 Kapka_SRPOL_task3_3 Kapka_SRPOL_task3_report CRNN 4604286 FOA log-linear magnitude spectra, phase spectra, intensity vector volume perturbation, FOA spatial augment
48 Kapka_SRPOL_task3_4 Kapka_SRPOL_task3_report CRNN 4604286 FOA log-linear magnitude spectra, phase spectra, intensity vector volume perturbation, FOA spatial augment
60 Zhaoyu_LRVT_task3_1 Zhaoyu_LRVT_task3_report CNN, Conformer, MLP 30.35M FOA log-mel spectra, intensity vector SpecAugment, Time Frequency Masing, Audio Channel Swapping, Reverb Simulation
64 Zhaoyu_LRVT_task3_2 Zhaoyu_LRVT_task3_report CNN, Conformer, MLP 30.35M FOA log-mel spectra, intensity vector SpecAugment, Time Frequency Masing, Audio Channel Swapping
62 Zhaoyu_LRVT_task3_3 Zhaoyu_LRVT_task3_report CNN, LSTM, U-Net 17.42M FOA log-mel spectra, intensity vector SpecAugment, Time Frequency Masing, Audio Channel Swapping
61 Zhaoyu_LRVT_task3_4 Zhaoyu_LRVT_task3_report CRNN, MLP 2.35M FOA log-mel spectra, intensity vector SpecAugment, Time Frequency Masing, Audio Channel Swapping, Reverb Simulation
44 Xie_XJU_task3_1 Xie_XJU_task3_report CRNN 116118 FOA log-mel spectra, intensity vector



Technical reports

JLESS SUBMISSION TO DCASE2022 TASK3: DYNAMIC KERNEL CONVOLUTION NETWORK WITH DATA AUGMENTATION FOR SOUND EVENT LOCALIZATION AND DETECTION IN REAL SPACE

Siwei Huang1, Jisheng Bai1,2, Yafei Jia1, Mou Wang1, Jianfeng Chen1,2
1Joint Laboratory of Environmental Sound Sensing, School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China, 2LianFeng Acoustic Technologies Co., Ltd. Xi’an, China

Abstract

This technical report describes our proposed system for DCASE2022 task3: Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes. In our approach, we first introduce a dynamic kernel convolution module after the convolution blocks to dynamically model the channel-wise features with different receptive fields. We then incorporate the SELDnet and EINV2 framework into the proposed SELD system with multi-track ACCDOA output. Finally, we use different strategies in the training stage to improve the generalization of the system in realistic environment. Moreover, we apply data augmentation methods to balance the sound event classes in the dataset, and generate more spatial audio files to augment the training data. Experimental results show that the proposed systems outperform the baseline on the development dataset of DCASE2022 task3.

PDF

GLFE: GLOBAL-LOCAL FUSION ENHANCEMENT FOR SOUND EVENT LOCALIZATION AND DETECTION

Zhengyu Chen, Qinghua Huang
Shanghai University, Shanghai, China

Abstract

Sound event localization and detection (SELD), as a combination of sound event detection (SED) task and direction of arrival (DOA) estimation task, aims at detecting the different sound events and obtaining their corresponding localization information simultaneously. The more outperforming systems are required to be applied into the more complex acoustic environments. In this paper, our method called global-local fusion enhancement (GLFE) is presented for Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 challenge task 3: Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes. It could be regarded as a convolution enhancement method. Firstly, the multiple feature cross fusion (MFCF) based on different local receptive fields is proposed. Considering the diversity of real sound events, self-attention network (SANet) integrating global information to local feature is introduced to help the system obtain more efficient information. Further, the skip fusion enhancement (SFE) is explored to fuse the features of different levels by the skip-connection in order to improve feature representation. On Sony-TAu Realistic Spatial Soundscapes 2022 (STRSS22) development dataset, the proposed system shows the significant improvement compared with the baseline system. Series of experiences are implemented only on the first-order Ambisonics (FOA) dataset.

PDF

Polyphonic Sound Event Localization and Detection Using Convolutional Neural Networks and Self-Attention with Synthetic and Real Data

Yeongseo Shin, Kangmin Kim, Chanjun Chun
Chosun University, Gwangju, Korea

Abstract

This technical report describes the system submitted to DCASE 2022 Task 3: Sound Event Localization and Detection (SELD) Evaluated in Real Spatial Sound Scenes. The goal of Task 3 is to detect the occurrence of sound events belonging to a specific target class in a real spatial sound scene, track temporal activity, and estimate the direction or location of arrival. In a given dataset, synthetic and real data exist together, and only a very small amount of real data exists compared with synthetic data. In this study, we developed a method utilizing a multi-generator and another applying SpecAugment as a data augmentation method to address the problem of imbalance in the amount of data. In addition, in our network architecture, the Transformer encoder was applied to the Convolutional Recurrent Neural Network (CRNN) structure that is mainly used in SELD. In addition, as a result of training with a single model and applying an ensemble, it was confirmed that the performance improved compared to the baseline system.

PDF

THE NERC-SLIP SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION OF DCASE2022 CHALLENGE

Qing Wang1, Li Chai2, Huaxin Wu2, Zhaoxu Nian1, Shutong Niu1, Siyuan Zheng1, Yuyang Wang1, Lei Sun2, Yi Fang2, Jia Pan2, Jun Du1, Chin-Hui Lee3
1University of Science and Technology of China, Hefei, China, 2iFLYTEK, Hefei, China, 3Georgia Institute of Technology, Atlanta, USA

Abstract

This technical report describes our submission system for the task 3 of the DCASE2022 challenge: Sound Event Localization and Detection (SELD) Evaluated in Real Spatial Sound Scenes. Compared with the official baseline system, the improvements of our method mainly lie in three aspects: data augmentation, more powerful network architecture, and model ensemble. First, our previous work shows that the audio channel swapping (ACS) technique [1] can effectively deal with data sparsity problems in the SELD task, which is utilized in our method and provides an effective improvement with limited real training data. In addition, we generate multichannel recordings by using public datasets and perform data cleaning to drop bad data. Then, based on the augmented data, we employ a ResNet-Conformer architecture which can better model the context dependencies within an audio sequence. Specially, we found that time resolution had a significant impact on the model performance: with the time pooling layer moving back, the model can obtain a higher feature resolution and achieve better results. Finally, to attain robust performance, we employ model ensemble of different target representations (e.g., activity-coupled Cartesian direction of arrival (ACCDOA) and multi-ACCDOA) and post-processing strategies. The proposed system is evaluated on the dev-test set of Sony-TAu Realistic Spatial Soundscapes 2022 (STARS2022) dataset.

PDF

CURRICULUM LEARNING WITH AUDIO DOMAIN DATA AUGMENTATION FOR SOUND EVENT LOCALIZATION AND DETECTION

Ricardo Falcon-Perez
Aalto University, Espoo, Finland

Abstract

In this report we explore a variety of data augmentation techniques in audio domain, along with a curriculum learning approach, for sound event localizaiton and detection (SELD) tasks. We focus our work on two areas: 1) techniques that modify timbral of temporal characteristics of all channels simultaneously, such as equalization or added noise; 2) methods that transform the spatial impression of the full sound scene, such as directional loudness modifications. We test the approach on models using either time-frequency or raw audio features, trained and evaluated on the STARSS22: Sony-TAU Realistic Spatial Soundscapes 2022 dataset. Although the proposed system struggles to beat the official benchmark system, the aug- mentation techniques show improvements over our non-augmented baseline.

PDF

TACCNN: TIME-ALIGNMENT COMPLEX CONVOLUTIONAL NEURAL NETWORK

Kaibin Guo, Runyu Shi, Tianrui He, Nian Liu, Junfei Yu
Xiaomi, Beijing, China

Abstract

In this technical report, we show our system submitted to the DCASE2022 challenge task3: Sound Event Localization and De- tection(SELD) Evaluated in Real Spatial Sound Scenes. At first, we review the famous deep learning methods in SELD, and point out that these works have ignored the time alignment from the perspective the arrival time of the signal, and the amplitude and the phase are modelling in the separate way. Therefore, we put for- ward a new model, Time Alignment Complex Convolutional Neural Network(TACCNN). In our model, we suggest to use 3DCNN or ConvLSTM to align the feature from different mics. Moreover, we propose to compile the mel spectrogram with intensity vector as the complex vector, and then extract salient feature on the new feature by using complex convolutional neural network. Lastly, we apply Bi-GRU with self-attention to extract the relative information about sound event to determine the rotation of the sound event. The results show that the time alignment block greatly improve the performance of CNN-GRU model. Complex convolutional neural network has the similar result compared with the real convolutional neural network. It seems that we need more experiments to discover the role of complex convolutional neural network.

PDF

A ROBUST FRAMEWORK FOR SOUND EVENT LOCALIZATION AND DETECTION ON REAL RECORDINGS

Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han
School of Industrial and Management Engineering, Korea University, Seoul, South Korea

Abstract

This technical report describes the systems submitted to the DCASE2022 challenge task 3: sound event localization and detection (SELD). The task aims to detect occurrences of sound events and specify their class, furthermore estimate their position. Our system utilizes a ResNet-based model under a proposed robust framework for SELD. To guarantee the generalized performance on the real-world sound scenes, we design the total framework with augmentation techniques, a pipeline of mixing datasets from real-world sound scenes and emulations, and test time augmentation. Augmentation techniques and exploitation of external sound sources enable training diverse samples and keeping the opportunity to train the real-world context enough by maintaining the number of the real recording samples in the batch. In addition, we design a test time augmentation and a clustering-based model ensemble method to aggregate confident predictions. Experimental results show that the model under a proposed framework outperforms the baseline methods and achieves competitive performance in real-world sound recordings.

Awards: Judges’ award

PDF

SOUND EVENT LOCALIZATION AND DETECTION FOR REAL SPATIAL SOUND SCENES: EVENT-INDEPENDENT NETWORK AND DATA AUGMENTATION CHAINS

Jinbo Hu1,2, Yin Cao3, Ming Wu1, Qiuqiang Kong4, Feiran Yang1, Mark D. Plumbley5, Jun Yang1,2
1Institute of Acoustics, Chinese Academy of Sciences, Beijing, China, 2University of Chinese Academy of Sciences, Beijing, China, 3Xi’an Jiaotong Liverpool University, Suzhou, China, 4ByteDance Shanghai, China, 5 Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK

Abstract

Polyphonic sound event localization and detection (SELD) aims at detecting types of sound events with corresponding temporal activities and spatial locations. In DCASE 2022 Task 3, data types transition from computationally generated spatial recordings to recordings of real-sound scenes. Our system submitted to the DCASE 2022 Task 3 is based on our previous proposed Event-Independent eNetwork V2 (EINV2) and novel data augmentation method. To detect different sound events of the same type with different locations, our method employs EINV2, combining a track-wise output format, permutation-invariant training, and soft-parameter sharing. EINV2 is also extended using conformer structures to learn local and global patterns. To improve the generalization ability of the model, we use a data augmentation approach containing several data augmentation chains, which are composed of random combinations of several different data augmentation operations. To mitigate the lack of the real-scene recordings in the development dataset and the presence of sound events being unbalanced, we exploit FSD50K, AudioSet, and TAU Spatial Room Impulse Response Database (TAU-SRIR DB) to generate simulated datasets for training. The results show that our system is improved over the baseline system on the dev-set-test of Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22).

PDF

TRACK-WISE ENSEMBLE OF CRNN MODELS WITH MULTI-TASK ADPIT FOR SOUND EVENT LOCALIZATION AND DETECTION

Sang-Ick Kang , Myungchul Keum, Kyongil Cho, Yeonseok Park
KT Corporation, South Korea

Abstract

This report describes our systems submitted to the DCASE2022 challenge task 3: Sound Event Localization and Detection (SELD) with directional interference. Locating and detecting sound events consists of two subtasks: detecting sound events and estimating the direction of arrival simultaneously. Therefore, it is often difficult to jointly optimize these two subtasks at the same time. We propose track-wise ensemble model which is combined with a multi-task-based auxiliary duplicating permutation invariant training (ADPIT) model and multi-ACCDOA-based model. Specifically, we propose a novel method to ensemble CRNN multi-task models, an event independent network v2 (EINV2)-based multi-task models and CRNN multi-ACCDOA models. Experimental results on the DCASE2022 dataset for sound event localization and directional interference detection show that the deep learning-based model trained on this new function significantly outperforms the DCASE challenge baseline.

PDF

COLOC: CONDITIONED LOCALIZER AND CLASSIFIER FOR SOUND EVENT LOCALIZATION AND DETECTION

Slawomir Kapka
Samsung R&D, Warsaw, Poland

Abstract

This technical report for DCASE2022 Task3 describes Conditioned Localizer and Classifier (CoLoC) which is a novel solution for Sound Event Localization and Detection (SELD). The solution constitutes of two stages: the localization is done first and is followed by classification conditioned by the output of the localizer. In order to resolve the problem of unknown number of sources we incorporate the idea borrowed from Sequential Set Generation (SSG). Models from both stages are SELDnet-like CRNNs, but with single outputs. Conducted reasoning shows that such two single output models are fit for SELD task. We show that our solution improves on the baseline system in most metrics on the STARSS22 Dataset.

PDF

CONVNEXT AND CONFORMER FOR SOUND EVENT LOCALIZATION AND DETECTION

Gwantae Kim, Hanseok Ko
Korea University, Seoul, South Korea

Abstract

This technical report describes the system participating to the DCASE 2022, Task3 : Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes challenge. The system consists of a convolution neural networks and multi-head self attention mechanism. The convolution neural networks consist of depth-wise convolution and point-wise convolution layers like the ConvNeXt block. The structure with the multi-head self attention mechanism is based on Conformer model, which contains combination of the convolution layer and the multi-head self attention mechanism. In the training phase, some regularization methods, such as Specmix, Droppath, and Dropout, are used to improve generalization performance. Multi-ACCDOA, which is output format for the sound event localization and detection task, is used to represent more suitable output format for the task. Our systems demonstrate a improvement over the baseline system.

PDF

Data Augmentation and Squeeze-and-Excitation Network on Multiple Dimension for Sound Event Localization and Detection in Real Scenes

Byeong-Yun Ko, Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Seung-Deok Choi, Yong-Hwa Park
Korea Advanced Institute of Science and Technology, Daejeon, South Korea

Abstract

Performance of sound event localization and detection (SELD) in real scenes is limited by small size of SELD dataset, due to difficulty in obtaining sufficient amount of realistic multi-channel audio data recordings with accurate label. We used two main strategies to solve problems arising from the small real SELD dataset. First, we applied various data augmentation methods on all data dimensions: channel, frequency and time. We also propose original data augmentation method named Moderate Mixup in order to simulate situations where noise floor or interfering events exist. Second, we applied Squeeze-and-Excitation block on channel and frequency dimensions to efficiently extract feature characteristics. Result of our trained models on the STARSS22 test dataset achieved the best ER, F1, LE, and LR of 0.53, 49.8%, 16.0°, and 56.2% respectively.

PDF

SOUND EVENT LOCALIZATION AND DETECTION BASED ON CROSS-MODAL ATTENTION AND SOURCE SEPARATION

Jin-Young Park1, Do-Hui Kim1, Bon Hyeok Ku1, Jun Hyung Kim1, Jaehun Kim2, Kisung Kim2, Hyungchan Yoo2, Kisik Chang2, Hyung-Min Park1
1Sogang University, Seoul, South Korea, 2AI Lab, IVS Inc., Seoul, South Korea

Abstract

Sound event localization and detection (SELD) is a task that combines sound event detection (SED) and direction-of-arrival(DOA) estimation (DOAE). This year’s SELD task focuses on evaluation on real spatial scene, raising the difficulty for two reasons: 1) increase in overlapped events 2) noise-like events combined with real noises. In order to overcome this, we applied source separation and improved data synthesis logic to our basic (DCMA-SELD) model that utilizes dual cross-modal attention (DCMA) and soft parameter sharing of SED and DOAE streams to simultaneously detect and localize sound events. In order to improve the SELD performance of male/female speech that accounts for a large portion of input sounds, the source separation in our method was performed to separate speech signals from other sounds. Regarding the data synthesis logic, sound events that occur in real life may have some regularity, such as a laugh event that occurs in people’s conversations or background music that has a long duration. Instead of data synthesis by mixing random sound events at random times, therefore, we added several rules to simulate more natural data that can learn the context of the events. Experimental results on validation data showed that our proposed approach successfully improved the performance of the task focusing on real spatial scene.

PDF

STARSS22: A DATASET OF SPATIAL RECORDINGS OF REAL SCENES WITH SPATIOTEMPORAL ANNOTATIONS OF SOUND EVENTS

Archontis Politis1, Kazuki Shimada2, Parthasaarathy Sudarsanam1, Sharath Adavanne1, Daniel Krause1, Yuichiro Koyama2, Naoya Takahashi2, Shusuke Takahashi2, Yuki Mitsufuji2, Tuomas Virtanen1
1Tampere University, Tampere, Finland, 2SONY, Tokyo, Japan

Abstract

This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) dataset for sound event localization and detection, comprised of spatial recordings of real scenes collected in various interiors of two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events in the dataset belonging to 13 target sound classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. The dataset serves as the development and evaluation dataset for the Task 3 of the DCASE2022 Challenge on Sound Event Localization and Detection and introduces significant new challenges for the task compared to the previous iterations, which were based on synthetic spatialized sound scene recordings. Dataset specifications are detailed including recording and annotation process, target classes and their presence, and details on the development and evaluation splits. Additionally, the report presents the baseline system that accompanies the dataset in the challenge with emphasis on the differences with the baseline of the previous iterations; namely, introduction of the multi-ACCDOA representation to handle multiple simultaneous occurences of events of the same class, and support for additional improved input features for the microphone array format. Results of the baseline indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6387880.

PDF

3D CNN AND CONFORMER WITH AUDIO SPECTROGRAM TRANSFORMER FOR SOUND EVENT DETECTION AND LOCALIZATION

Robin Scheibler, Tatsuya Komatsu, Yusuke Fujita, Michael Hentschel
LINE Corporation, Tokyo, Japan

Abstract

We propose a network for sound event detection and localization based on a 3D CNN for the extraction of spatial features followed by several conformer layers. The CNN performs spatial feature extraction and the subsequent conformer layers predict the events and their locations. We combine this with features obtained from a fine-tuned audio-spectrogram transformer and a multi-channel separation network trained separately. The two architectures are combined by a linear layer before the final non-linearity. We first train the network on the STARSS22 dataset extended by simulation using events from FSD50K and room impulse responses from previous challenges. To bridge the gap between the simulated dataset and the STARSS22 dataset, we fine-tune the model on the development part of the STARSS22 dataset only before the final evaluation.

PDF

IMPROVING LOW-RESOURCE SOUND EVENT LOCALIZATION AND DETECTION VIA ACTIVE LEARNING WITH DOMAIN ADAPTATION

Yuhao Wang1, Yuxin Duan1, Pingjie Wang1, Yu Wang1,2, Wei Xue3
1Shanghai Jiao Tong University, Shanghai, China, 2Shanghai AI Lab, Shanghai, China, 3Hong Kong Baptist University, Hong Kong, China

Abstract

This report describes our systems submitted to DCASE2022 challenge task3: sound event localization and detection (SELD) evaluated in real spatial sound scenes. We present two approaches to improve the performance of this task. The first one is to leverage active learning to bring in and filter the AudioSet dataset based on the pre-trained audio neural networks (PANNs). The second one is to adapt the generic models to different sound event categories, thereby improving the performance on classes with scarce data. We have also explored various model structures incorporating attention machanisms. Finally, we combine models trained on different input recording formats. Experimental results on the validation set show that the proposed systems can greatly improve all the metrics when compared to the baseline systems.

PDF

MLP-MIXER ENHANCED CRNN FOR SOUND EVENT LOCALIZATION AND DETECTION IN DCASE 2022 TASK 3

Shichao Wu1,2,3, Shouwang Huang1,2,3, Zicheng Liu1,2,3, Jingtai Liu1,2,3
1College of Artificial Intelligence, Nankai University, Tianjin, China, 2Institute of Robotics and Automatic Information System, Nankai University, Tianjin, China, 3Tianjin Key Laboratory of Intelligent Robotics, Nankai University, TianJin, China

Abstract

In this technical report, we propose to give the system details about our MLP-Mixer enhanced convolutional recurrent neural networks (CRNN) submitted to the sound event localization and detection challenge in DCASE 2022. Specifically, we present two improvements concerning the input features and the model structures compared to the baseline methods. For the input feature design, we propose to involve the variable-Q transform (VQT) audio feature both for Ambisonic (FOA) and microphone array (MIC) audio representations. For deep neural network design, we improved the original CRNN by inserting a shallow MLP-Mixer module between the convolution filters and the recurrent layers to elaborately model the interchannel audio patterns, which we thought are extremely conducive to the sound directions of arrival (DOA) estimation. Experiments on the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) benchmark dataset showed our system outperformed the DCASE baseline method.

PDF

ENSEMBLE OF ATTENTION BASED CRNN FOR SOUND EVENT DETECTION AND LOCALIZATION

Rong Xie, Chuang Shi, Le Zhang, Yunxuan Liu and Huiyong Li
University of Electronic Science and Technology of China, Chengdu, China

Abstract

This report describes submitted systems for sound event localization and detection (SELD) task of DCASE 2022, which are implemented as multi-task learning. Soft parameters sharing convolutional recurrent neural network (CRNN) with Split attention (SA), convolutional block attention module (CBAM) and coordinate attention (CA) are trained and ensembled to solve the SELD task. To generalize models, angle noise and mini-batch time-frequency noise are introduced, and mini-batch mixup, FOA rotation, frequency shift, random cutout and SpecAugment are adopted. Proposed systems have a better performance than the baseline system on the development dataset.

PDF

SOUND EVENT LOCALIZATION AND DETECTION BASED ON CRNN USING TIME-FREQUENCY ATTENTION AND CRISS-CROSS ATTENTION

Yin Xie1,2, Ying Hu1,2, Yunlong Li1,2, Shijing Hou1,2, Xiujuan Zhu1,2, Zihao Chen1,2, Liusong Wang1,2, Mengzhen Ma1,2
1Xinjiang University, School of Information Science and Engineering, Urumqi, China, 2Key Laboratory of Signal Detection and Processing in Xinjiang, Urumqi, China

Abstract

This report describes our systems submitted to the DCASE2022 challenge task 3: sound event localization and detection (SELD).We design a CRNN network based on asymmetric convolution mechanism with Time-Frequency Attention module(TFA) and Criss-Cross Attention module(CCA) which achieves great performance to deal with SELD in complex real sound scenes. On TAU-NIGENS Spatial Sound Events 2022 development dataset, our systems demonstrate a significant improvement over the baseline system. Only the first order Ambisonics (FOA) dataset was considered in this experiment.

PDF

SOUND EVENT LOCALIZATION AND DETECTION COMBINED CONVOLUTIONAL CONFORMER STRUCTURE AND MULTI-ACCDOA STRATEGIES

Zhaoyu Yan, Jin Wang, Lin Yang, Junjie Wang
Lenovo Research, Beijing, China

Abstract

Sound event localization and detection (SELD) task aims to identify audio sources’ direction-of-arrival (DOA) and the corre- sponding class. The SELD task was originally considered as a multi-task learning problem, with DOA and sound event detection (SED) estimation branches. The single target methods were introduced recently as more end-to-end solutions and achieves better SELD performance. The activity-coupled Cartesian DOA (ACCDOA) vectors was firstly introduced as a single SELD training target, and multi-ACCDOA with auxiliary duplicating permutation invariant training (ADPIT) loss overcame the situation that the same event class from multiple locations. In this challenge, we combined the convolutional conformer structure with the multi-ACCDOA training target and ADPIT strategy. With multiple methods of data augmentation adapted, the proposed method achieves promising SELD improvement com- pared to the baseline CRNN result.

PDF