Task description
The Sound Event Localization and Detection (SELD) task deals with methods that detect the temporal onset and offset of sound events when active, classify the type of the event from a known set of sound classes, and further localize the events in space when active.
The focus of the current SELD task is developing systems that can perform adequately on real sound scene recordings, with a small amount of training data. The task provides two datasets, development and evaluation, recorded in a multiple rooms over two different sites. Among the two datasets, only the development dataset provides the reference labels. The participants are expected to build and validate systems using the development dataset, report results on a predefined development set split, and finally test their system on the unseen evaluation dataset.
More details on the task setup and evaluation can be found in the task description page.
Teams ranking
The SELD task received 63 submissions in total from 19 teams across the world. The following table includes only the best performing system per submitting team. Confidence intervals are also reported for each metric on the evaluation set results.
Rank | Submission Information | Evaluation dataset | Development dataset | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission name |
Corresponding author |
Affiliation |
Technical Report |
Best official system rank |
Error Rate (20°) |
F-score (20°) |
Localization error (°) |
Localization recall |
Error Rate (20°) |
F-score (20°) |
Localization error (°) |
Loalization recall |
|
Du_NERCSLIP_task3_2 | Jun Du | University of Science and Technology of China | Du_NERCSLIP_task3_report | 1 | 0.35 (0.30 - 0.41) | 58.3 (53.8 - 64.7) | 14.6 (12.8 - 16.5) | 73.7 (68.7 - 78.2) | |||||
Hu_IACAS_task3_3 | Jinbo Hu | Institute of Acoustics, Chinese Academy of Sciences | Hu_IACAS_task3_report | 5 | 0.39 (0.34 - 0.44) | 55.8 (51.2 - 61.1) | 16.2 (14.6 - 17.8) | 72.4 (67.3 - 77.2) | 0.53 | 48.1 | 17.8 | 62.6 | |
Han_KU_task3_4 | Sung Won Han | Korea University | Han_KU_task3_report | 7 | 0.37 (0.31 - 0.42) | 49.7 (44.4 - 56.6) | 16.5 (14.8 - 18.0) | 70.7 (65.8 - 76.1) | 0.39 | 59.5 | 13.0 | 73.7 | |
Xie_UESTC_task3_1 | Rong Xie | University of Electronic Science and Technology of China | Xie_UESTC_task3_report | 11 | 0.48 (0.41 - 0.55) | 48.6 (42.5 - 55.4) | 17.6 (16.0 - 19.2) | 73.5 (68.0 - 77.6) | 0.44 | 58.0 | 12.9 | 68.0 | |
Bai_JLESS_task3_4 | Jisheng Bai | Northwestern Polytechnical University | Bai_JLESS_task3_report | 14 | 0.47 (0.40 - 0.54) | 49.3 (41.8 - 57.1) | 16.9 (15.0 - 18.9) | 67.9 (59.3 - 73.3) | 0.48 | 52.2 | 16.9 | 70.7 | |
Kang_KT_task3_2 | Sang-Ick Kang | KT Corporation | Kang_KT_task3_report | 17 | 0.47 (0.40 - 0.53) | 45.9 (40.1 - 52.6) | 15.8 (13.6 - 18.0) | 59.3 (50.3 - 65.1) | 0.48 | 51.3 | 16.4 | 67.7 | |
FOA_Baseline_task3_1 | Archontis Politis | Tampere University | Politis_TAU_task3_report | 42 | 0.61 (0.57 - 0.65) | 23.7 (18.7 - 29.4) | 22.9 (21.0 - 26.0) | 51.4 (46.2 - 55.2) | 0.71 | 21.0 | 29.3 | 46.0 | |
Chun_Chosun_task3_3 | Chanjun Chun | Chosun University | Chun_Chosun_task3_report | 27 | 0.59 (0.52 - 0.66) | 31.0 (25.9 - 36.3) | 19.8 (17.3 - 22.6) | 50.7 (42.2 - 56.3) | 0.59 | 35.0 | 33.8 | 57.0 | |
Guo_XIAOMI_task3_2 | Kaibin Guo | Xiaomi | Guo_XIAOMI_task3_report | 33 | 0.60 (0.53 - 0.67) | 28.2 (22.8 - 34.1) | 23.8 (21.3 - 26.2) | 52.1 (43.4 - 58.1) | 0.61 | 29.0 | 23.5 | 49.0 | |
Scheibler_LINE_task3_1 | Robin Scheibler | LINE Corporation | Scheibler_LINE_task3_report | 30 | 0.62 (0.55 - 0.69) | 30.4 (25.2 - 36.3) | 16.7 (14.0 - 19.5) | 49.2 (42.1 - 54.5) | 0.50 | 51.1 | 16.7 | 63.4 | |
Park_SGU_task3_4 | Hyung-Min Park | Sogang University | Park_SGU_task3_report | 38 | 0.60 (0.53 - 0.67) | 30.6 (25.2 - 36.4) | 21.6 (17.8 - 25.1) | 45.9 (40.3 - 51.0) | 0.62 | 46.8 | 25.1 | 78.2 | |
Wang_SJTU_task3_2 | Yu Wang | Shanghai Jiao Tong University | Wang_SJTU_task3_report | 33 | 0.67 (0.60 - 0.74) | 27.0 (19.3 - 33.6) | 24.4 (22.0 - 27.1) | 60.3 (53.8 - 65.3) | 0.46 | 61.8 | 11.4 | 68.4 | |
FalconPerez_Aalto_task3_2 | Ricardo Falcon-Perez | Aalto University | FalconPerez_Aalto_task3_report | 52 | 0.73 (0.67 - 0.79) | 21.8 (15.5 - 27.6) | 24.4 (21.7 - 27.1) | 43.1 (35.7 - 48.7) | 0.74 | 23.0 | 27.4 | 45.0 | |
Kim_KU_task3_2 | Gwantae Kim | Korea University | Kim_KU_task3_report | 46 | 0.74 (0.66 - 0.81) | 24.1 (19.8 - 28.9) | 26.6 (23.4 - 29.8) | 55.1 (48.6 - 59.5) | 0.66 | 30.0 | 22.5 | 49.0 | |
Chen_SHU_task3_1 | Zhengyu Chen | Shanghai University | Chen_SHU_task3_report | 65 | 1.00 (1.00 - 1.00) | 0.3 (0.1 - 0.6) | 60.3 (45.4 - 94.0) | 4.5 (2.9 - 6.3) | 0.71 | 27.0 | 26.7 | 48.0 | |
Wu_NKU_task3_2 | Shichao Wu | Nankai University | Wu_NKU_task3_report | 53 | 0.69 (0.64 - 0.74) | 17.9 (14.4 - 21.5) | 28.5 (24.5 - 39.7) | 44.5 (38.2 - 48.4) | 0.63 | 33.0 | 22.7 | 49.0 | |
Ko_KAIST_task3_2 | Byeong-Yun Ko | Korea Advanced Institute of Science and Technology | Ko_KAIST_task3_report | 23 | 0.49 (0.42 - 0.55) | 39.9 (33.8 - 46.0) | 17.3 (15.3 - 19.3) | 54.6 (46.5 - 60.5) | 0.55 | 46.2 | 16.4 | 54.6 | |
Kapka_SRPOL_task3_4 | Slawomir Kapka | Samsung Research Poland | Kapka_SRPOL_task3_report | 48 | 0.72 (0.65 - 0.79) | 25.5 (21.3 - 30.4) | 25.4 (21.7 - 29.3) | 49.8 (42.8 - 55.3) | |||||
Zhaoyu_LRVT_task3_1 | Zhaoyu Yan | Lenovo Research | Zhaoyu_LRVT_task3_report | 60 | 0.96 (0.88 - 1.00) | 11.2 (8.8 - 13.9) | 31.0 (28.5 - 33.4) | 53.4 (44.4 - 58.9) | 0.58 | 35.0 | 22.5 | 42.0 | |
Xie_XJU_task3_1 | Yin Xie | Xinjiang university | Xie_XJU_task3_report | 44 | 0.66 (0.59 - 0.74) | 25.5 (19.3 - 32.2) | 23.1 (19.9 - 26.4) | 53.1 (42.7 - 59.4) | 0.66 | 34.2 | 22.9 | 57.7 |
Systems ranking
Performance of all the submitted systems on the evaluation and the development datasets. Confidence intervals are also reported for each metric on the evaluation set results.
Rank | Submission Information | Evaluation dataset | Development dataset | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Submission name |
Technical Report |
Official rank |
Error Rate (20°) |
F-score (20°) |
Localization error (°) |
Localization recall |
Error Rate (20°) |
F-score (20°) |
Localization error (°) |
Localization recall |
|
FOA_Baseline_task3_1 | Politis_TAU_task3_report | 42 | 0.61 (0.57 - 0.65) | 23.7 (18.7 - 29.4) | 22.9 (21.0 - 26.0) | 51.4 (46.2 - 55.2) | 0.71 | 21.0 | 29.3 | 46.0 | |
MIC_Baseline_task3_1 | Politis_TAU_task3_report | 45 | 0.61 (0.56 - 0.66) | 21.6 (17.6 - 25.8) | 25.9 (22.6 - 28.5) | 48.1 (36.8 - 54.9) | 0.71 | 21.0 | 32.2 | 47.0 | |
Bai_JLESS_task3_1 | Bai_JLESS_task3_report | 20 | 0.48 (0.41 - 0.54) | 46.0 (38.0 - 54.0) | 16.3 (14.4 - 18.1) | 58.8 (48.3 - 65.2) | 0.48 | 52.4 | 16.1 | 62.1 | |
Bai_JLESS_task3_2 | Bai_JLESS_task3_report | 16 | 0.49 (0.42 - 0.56) | 47.8 (40.2 - 55.3) | 16.9 (14.9 - 18.8) | 66.6 (56.0 - 72.8) | 0.52 | 50.0 | 17.1 | 68.1 | |
Bai_JLESS_task3_3 | Bai_JLESS_task3_report | 19 | 0.46 (0.39 - 0.53) | 46.1 (38.3 - 53.8) | 16.3 (14.6 - 17.9) | 57.8 (46.4 - 64.6) | 0.44 | 54.2 | 16.0 | 65.4 | |
Bai_JLESS_task3_4 | Bai_JLESS_task3_report | 14 | 0.47 (0.40 - 0.54) | 49.3 (41.8 - 57.1) | 16.9 (15.0 - 18.9) | 67.9 (59.3 - 73.3) | 0.48 | 52.2 | 16.9 | 70.7 | |
Chun_Chosun_task3_1 | Chun_Chosun_task3_report | 28 | 0.59 (0.52 - 0.66) | 30.9 (25.9 - 36.2) | 19.7 (17.5 - 21.9) | 50.2 (42.0 - 55.7) | 0.59 | 35.0 | 20.7 | 57.0 | |
Chun_Chosun_task3_2 | Chun_Chosun_task3_report | 31 | 0.60 (0.53 - 0.66) | 30.1 (25.7 - 34.8) | 20.0 (17.8 - 22.3) | 50.2 (41.8 - 55.8) | 0.59 | 34.0 | 24.8 | 58.0 | |
Chun_Chosun_task3_3 | Chun_Chosun_task3_report | 27 | 0.59 (0.52 - 0.66) | 31.0 (25.9 - 36.3) | 19.8 (17.3 - 22.6) | 50.7 (42.2 - 56.3) | 0.59 | 35.0 | 33.8 | 57.0 | |
Chun_Chosun_task3_4 | Chun_Chosun_task3_report | 29 | 0.60 (0.53 - 0.67) | 30.4 (25.2 - 36.0) | 20.2 (17.0 - 22.6) | 50.5 (42.4 - 56.0) | 0.59 | 34.0 | 23.0 | 59.0 | |
Guo_XIAOMI_task3_1 | Guo_XIAOMI_task3_report | 47 | 0.63 (0.57 - 0.69) | 20.2 (16.9 - 24.1) | 22.9 (20.7 - 25.2) | 45.8 (40.4 - 49.7) | 0.63 | 25.0 | 23.9 | 48.0 | |
Guo_XIAOMI_task3_2 | Guo_XIAOMI_task3_report | 33 | 0.60 (0.53 - 0.67) | 28.2 (22.8 - 34.1) | 23.8 (21.3 - 26.2) | 52.1 (43.4 - 58.1) | 0.61 | 29.0 | 23.5 | 49.0 | |
Kang_KT_task3_1 | Kang_KT_task3_report | 21 | 0.47 (0.41 - 0.53) | 44.3 (38.4 - 50.6) | 16.0 (13.8 - 18.2) | 57.7 (49.0 - 63.5) | 0.49 | 53.0 | 15.8 | 68.0 | |
Kang_KT_task3_2 | Kang_KT_task3_report | 17 | 0.47 (0.40 - 0.53) | 45.9 (40.1 - 52.6) | 15.8 (13.6 - 18.0) | 59.3 (50.3 - 65.1) | 0.48 | 51.3 | 16.4 | 67.7 | |
Kang_KT_task3_3 | Kang_KT_task3_report | 18 | 0.46 (0.40 - 0.52) | 45.4 (39.4 - 51.5) | 15.8 (13.5 - 18.2) | 58.4 (50.8 - 63.7) | 0.49 | 52.6 | 15.8 | 66.4 | |
Kang_KT_task3_4 | Kang_KT_task3_report | 22 | 0.46 (0.40 - 0.52) | 43.7 (38.2 - 49.9) | 16.2 (14.0 - 18.5) | 56.4 (49.2 - 61.5) | 0.48 | 52.0 | 16.3 | 65.3 | |
Du_NERCSLIP_task3_1 | Du_NERCSLIP_task3_report | 4 | 0.37 (0.31 - 0.44) | 56.9 (50.9 - 64.5) | 15.0 (13.2 - 16.9) | 73.6 (68.1 - 78.7) | 0.38 | 67.0 | 14.8 | 78.0 | |
Du_NERCSLIP_task3_2 | Du_NERCSLIP_task3_report | 1 | 0.35 (0.30 - 0.41) | 58.3 (53.8 - 64.7) | 14.6 (12.8 - 16.5) | 73.7 (68.7 - 78.2) | |||||
Du_NERCSLIP_task3_3 | Du_NERCSLIP_task3_report | 2 | 0.36 (0.29 - 0.43) | 56.8 (50.6 - 63.9) | 15.5 (13.8 - 17.4) | 75.5 (70.1 - 80.4) | |||||
Du_NERCSLIP_task3_4 | Du_NERCSLIP_task3_report | 3 | 0.37 (0.31 - 0.44) | 57.8 (51.7 - 65.3) | 14.9 (13.2 - 16.7) | 73.4 (67.7 - 78.5) | 0.41 | 64.0 | 14.9 | 73.0 | |
Scheibler_LINE_task3_1 | Scheibler_LINE_task3_report | 30 | 0.62 (0.55 - 0.69) | 30.4 (25.2 - 36.3) | 16.7 (14.0 - 19.5) | 49.2 (42.1 - 54.5) | 0.50 | 51.1 | 16.7 | 63.4 | |
Park_SGU_task3_1 | Park_SGU_task3_report | 41 | 0.60 (0.53 - 0.67) | 28.4 (23.9 - 33.6) | 22.6 (19.8 - 25.3) | 46.9 (41.5 - 52.1) | 0.61 | 46.2 | 24.0 | 78.2 | |
Park_SGU_task3_2 | Park_SGU_task3_report | 40 | 0.63 (0.55 - 0.70) | 31.2 (25.3 - 37.5) | 21.6 (18.3 - 25.0) | 46.5 (40.7 - 51.7) | 0.62 | 46.8 | 25.1 | 78.2 | |
Park_SGU_task3_3 | Park_SGU_task3_report | 38 | 0.63 (0.56 - 0.70) | 31.4 (25.8 - 37.4) | 22.7 (18.6 - 26.5) | 47.4 (41.7 - 52.5) | 0.62 | 46.8 | 25.1 | 78.2 | |
Park_SGU_task3_4 | Park_SGU_task3_report | 38 | 0.60 (0.53 - 0.67) | 30.6 (25.2 - 36.4) | 21.6 (17.8 - 25.1) | 45.9 (40.3 - 51.0) | 0.62 | 46.8 | 25.1 | 78.2 | |
Wang_SJTU_task3_1 | Wang_SJTU_task3_report | 35 | 0.67 (0.60 - 0.74) | 26.3 (18.3 - 33.1) | 23.9 (21.8 - 26.3) | 59.2 (52.6 - 64.4) | 0.47 | 62.2 | 11.3 | 69.0 | |
Wang_SJTU_task3_2 | Wang_SJTU_task3_report | 33 | 0.67 (0.60 - 0.74) | 27.0 (19.3 - 33.6) | 24.4 (22.0 - 27.1) | 60.3 (53.8 - 65.3) | 0.46 | 61.8 | 11.4 | 68.4 | |
Wang_SJTU_task3_3 | Wang_SJTU_task3_report | 34 | 0.68 (0.60 - 0.75) | 26.3 (18.0 - 33.3) | 23.7 (21.7 - 25.9) | 59.8 (52.4 - 65.1) | 0.48 | 61.4 | 11.5 | 69.0 | |
Wang_SJTU_task3_4 | Wang_SJTU_task3_report | 36 | 0.67 (0.60 - 0.74) | 26.2 (18.0 - 33.2) | 23.8 (21.5 - 26.4) | 58.8 (51.2 - 64.2) | 0.47 | 61.6 | 11.4 | 68.7 | |
FalconPerez_Aalto_task3_1 | FalconPerez_Aalto_task3_report | 58 | 0.70 (0.64 - 0.75) | 16.2 (10.1 - 21.1) | 28.7 (24.0 - 32.6) | 33.9 (26.5 - 39.0) | 0.75 | 19.0 | 49.3 | 38.0 | |
FalconPerez_Aalto_task3_2 | FalconPerez_Aalto_task3_report | 52 | 0.73 (0.67 - 0.79) | 21.8 (15.5 - 27.6) | 24.4 (21.7 - 27.1) | 43.1 (35.7 - 48.7) | 0.74 | 23.0 | 27.4 | 45.0 | |
FalconPerez_Aalto_task3_3 | FalconPerez_Aalto_task3_report | 59 | 0.70 (0.64 - 0.77) | 17.2 (10.2 - 22.5) | 25.5 (22.6 - 28.6) | 31.2 (23.4 - 36.2) | 0.75 | 15.0 | 51.8 | 3.0 | |
Xie_UESTC_task3_1 | Xie_UESTC_task3_report | 11 | 0.48 (0.41 - 0.55) | 48.6 (42.5 - 55.4) | 17.6 (16.0 - 19.2) | 73.5 (68.0 - 77.6) | 0.44 | 58.0 | 12.9 | 68.0 | |
Xie_UESTC_task3_2 | Xie_UESTC_task3_report | 15 | 0.50 (0.43 - 0.57) | 47.8 (41.5 - 54.4) | 17.5 (15.9 - 19.2) | 72.3 (65.1 - 77.1) | 0.47 | 52.0 | 14.4 | 64.0 | |
Xie_UESTC_task3_3 | Xie_UESTC_task3_report | 13 | 0.52 (0.44 - 0.60) | 48.4 (42.4 - 55.1) | 17.9 (16.2 - 19.8) | 74.6 (69.3 - 78.8) | 0.46 | 55.0 | 14.0 | 66.0 | |
Xie_UESTC_task3_4 | Xie_UESTC_task3_report | 12 | 0.50 (0.42 - 0.57) | 49.5 (43.8 - 56.0) | 17.4 (15.9 - 19.1) | 74.0 (69.2 - 77.8) | 0.46 | 56.0 | 13.7 | 67.0 | |
Kim_KU_task3_1 | Kim_KU_task3_report | 54 | 0.80 (0.74 - 0.86) | 20.3 (16.3 - 24.9) | 26.1 (23.9 - 28.6) | 50.6 (43.8 - 55.5) | 0.66 | 31.0 | 21.7 | 51.0 | |
Kim_KU_task3_2 | Kim_KU_task3_report | 46 | 0.74 (0.66 - 0.81) | 24.1 (19.8 - 28.9) | 26.6 (23.4 - 29.8) | 55.1 (48.6 - 59.5) | 0.66 | 30.0 | 22.5 | 49.0 | |
Kim_KU_task3_3 | Kim_KU_task3_report | 49 | 0.75 (0.69 - 0.82) | 20.5 (12.6 - 25.9) | 26.1 (22.7 - 29.5) | 53.3 (47.0 - 57.6) | 0.65 | 33.0 | 20.4 | 51.0 | |
Hu_IACAS_task3_1 | Hu_IACAS_task3_report | 10 | 0.44 (0.38 - 0.49) | 49.2 (43.8 - 55.8) | 16.6 (14.4 - 19.0) | 70.4 (64.0 - 75.2) | 0.50 | 48.4 | 19.5 | 65.7 | |
Hu_IACAS_task3_2 | Hu_IACAS_task3_report | 6 | 0.40 (0.34 - 0.46) | 57.4 (53.4 - 62.8) | 15.1 (13.4 - 16.8) | 70.6 (65.4 - 75.4) | 0.50 | 51.0 | 16.4 | 65.9 | |
Hu_IACAS_task3_3 | Hu_IACAS_task3_report | 5 | 0.39 (0.34 - 0.44) | 55.8 (51.2 - 61.1) | 16.2 (14.6 - 17.8) | 72.4 (67.3 - 77.2) | 0.53 | 48.1 | 17.8 | 62.6 | |
Hu_IACAS_task3_4 | Hu_IACAS_task3_report | 9 | 0.40 (0.34 - 0.46) | 50.9 (44.4 - 59.4) | 15.9 (13.8 - 18.1) | 69.4 (63.7 - 75.7) | 0.53 | 45.4 | 17.4 | 62.5 | |
Chen_SHU_task3_1 | Chen_SHU_task3_report | 65 | 1.00 (1.00 - 1.00) | 0.3 (0.1 - 0.6) | 60.3 (45.4 - 94.0) | 4.5 (2.9 - 6.3) | 0.71 | 27.0 | 26.7 | 48.0 | |
Wu_NKU_task3_1 | Wu_NKU_task3_report | 55 | 0.72 (0.67 - 0.77) | 18.5 (13.3 - 23.6) | 25.1 (22.0 - 29.4) | 42.1 (33.3 - 47.6) | 0.66 | 32.0 | 23.2 | 48.0 | |
Wu_NKU_task3_2 | Wu_NKU_task3_report | 53 | 0.69 (0.64 - 0.74) | 17.9 (14.4 - 21.5) | 28.5 (24.5 - 39.7) | 44.5 (38.2 - 48.4) | 0.63 | 33.0 | 22.7 | 49.0 | |
Wu_NKU_task3_3 | Wu_NKU_task3_report | 57 | 0.72 (0.67 - 0.77) | 18.8 (14.2 - 24.6) | 30.2 (23.4 - 35.2) | 39.7 (29.9 - 45.5) | 0.65 | 31.0 | 26.0 | 43.0 | |
Wu_NKU_task3_4 | Wu_NKU_task3_report | 56 | 0.71 (0.65 - 0.76) | 18.7 (14.7 - 23.0) | 28.3 (22.8 - 40.2) | 38.6 (31.9 - 43.2) | 0.65 | 30.0 | 18.0 | 44.0 | |
Han_KU_task3_1 | Han_KU_task3_report | 39 | 0.73 (0.66 - 0.80) | 27.8 (22.6 - 35.2) | 25.6 (23.8 - 27.2) | 63.5 (57.7 - 68.7) | 0.45 | 63.6 | 14.4 | 71.1 | |
Han_KU_task3_2 | Han_KU_task3_report | 43 | 0.72 (0.64 - 0.79) | 23.0 (15.6 - 31.1) | 25.5 (23.9 - 27.0) | 64.0 (58.9 - 70.2) | 0.43 | 58.8 | 15.1 | 73.2 | |
Han_KU_task3_3 | Han_KU_task3_report | 8 | 0.38 (0.33 - 0.44) | 53.6 (47.8 - 60.7) | 15.6 (13.9 - 17.1) | 67.3 (61.7 - 73.1) | 0.28 | 67.2 | 11.8 | 76.7 | |
Han_KU_task3_4 | Han_KU_task3_report | 7 | 0.37 (0.31 - 0.42) | 49.7 (44.4 - 56.6) | 16.5 (14.8 - 18.0) | 70.7 (65.8 - 76.1) | 0.39 | 59.5 | 13.0 | 73.7 | |
Ko_KAIST_task3_1 | Ko_KAIST_task3_report | 24 | 0.47 (0.40 - 0.53) | 39.6 (32.9 - 45.9) | 18.9 (16.2 - 26.5) | 52.7 (42.7 - 59.8) | 0.53 | 49.8 | 16.0 | 55.9 | |
Ko_KAIST_task3_2 | Ko_KAIST_task3_report | 23 | 0.49 (0.42 - 0.55) | 39.9 (33.8 - 46.0) | 17.3 (15.3 - 19.3) | 54.6 (46.5 - 60.5) | 0.55 | 46.2 | 16.4 | 54.6 | |
Ko_KAIST_task3_3 | Ko_KAIST_task3_report | 25 | 0.48 (0.42 - 0.53) | 39.8 (33.3 - 46.2) | 19.6 (17.2 - 26.6) | 52.0 (42.4 - 58.7) | 0.57 | 46.4 | 17.2 | 54.4 | |
Ko_KAIST_task3_4 | Ko_KAIST_task3_report | 26 | 0.50 (0.44 - 0.56) | 35.7 (28.6 - 42.1) | 20.4 (18.3 - 22.6) | 52.8 (42.4 - 59.5) | 0.55 | 46.4 | 17.0 | 56.2 | |
Kapka_SRPOL_task3_1 | Kapka_SRPOL_task3_report | 58 | 0.92 (0.84 - 0.99) | 25.2 (21.6 - 29.2) | 24.1 (21.2 - 27.3) | 49.5 (43.4 - 54.3) | 0.85 | 32.1 | 24.7 | 51.4 | |
Kapka_SRPOL_task3_2 | Kapka_SRPOL_task3_report | 50 | 0.81 (0.73 - 0.88) | 26.0 (22.1 - 30.2) | 22.3 (19.2 - 25.9) | 48.1 (41.9 - 53.0) | 0.76 | 32.9 | 24.6 | 49.9 | |
Kapka_SRPOL_task3_3 | Kapka_SRPOL_task3_report | 51 | 0.81 (0.74 - 0.88) | 24.7 (20.5 - 29.5) | 26.2 (23.0 - 29.9) | 52.1 (45.3 - 57.2) | |||||
Kapka_SRPOL_task3_4 | Kapka_SRPOL_task3_report | 48 | 0.72 (0.65 - 0.79) | 25.5 (21.3 - 30.4) | 25.4 (21.7 - 29.3) | 49.8 (42.8 - 55.3) | |||||
Zhaoyu_LRVT_task3_1 | Zhaoyu_LRVT_task3_report | 60 | 0.96 (0.88 - 1.00) | 11.2 (8.8 - 13.9) | 31.0 (28.5 - 33.4) | 53.4 (44.4 - 58.9) | 0.58 | 35.0 | 22.5 | 42.0 | |
Zhaoyu_LRVT_task3_2 | Zhaoyu_LRVT_task3_report | 64 | 0.88 (0.84 - 0.92) | 3.5 (2.3 - 4.8) | 39.3 (28.9 - 59.3) | 7.5 (5.6 - 9.5) | 0.68 | 25.0 | 35.4 | 42.0 | |
Zhaoyu_LRVT_task3_3 | Zhaoyu_LRVT_task3_report | 62 | 0.83 (0.78 - 0.87) | 7.4 (5.5 - 9.5) | 24.5 (20.1 - 34.5) | 12.5 (10.0 - 15.1) | 0.70 | 25.4 | 45.2 | 42.0 | |
Zhaoyu_LRVT_task3_4 | Zhaoyu_LRVT_task3_report | 61 | 0.83 (0.80 - 0.87) | 12.1 (7.4 - 16.8) | 26.2 (23.0 - 29.0) | 36.0 (23.1 - 43.6) | 0.72 | 33.3 | 43.5 | 35.0 | |
Xie_XJU_task3_1 | Xie_XJU_task3_report | 44 | 0.66 (0.59 - 0.74) | 25.5 (19.3 - 32.2) | 23.1 (19.9 - 26.4) | 53.1 (42.7 - 59.4) | 0.66 | 34.2 | 22.9 | 57.7 |
System characteristics
Rank |
Submission name |
Technical Report |
Model |
Model params |
Audio format |
Acoustic features |
Data augmentation |
---|---|---|---|---|---|---|---|
42 | FOA_Baseline_task3_1 | Politis_TAU_task3_report | CRNN | 604920 | FOA | log-mel spectra, intensity vector | |
45 | MIC_Baseline_task3_1 | Politis_TAU_task3_report | CRNN | 606648 | MIC | log-mel spectra, GCC | |
20 | Bai_JLESS_task3_1 | Bai_JLESS_task3_report | CNN, Conformer, ensemble | 194560 | MIC | log-mel spectra, SALSA-Lite | FMix, mixup, random cutout, channel rotation, data generation |
16 | Bai_JLESS_task3_2 | Bai_JLESS_task3_report | CNN, Conformer, ensemble | 194560 | MIC | log-mel spectra, SALSA-Lite | FMix, mixup, random cutout, channel rotation, data generation |
19 | Bai_JLESS_task3_3 | Bai_JLESS_task3_report | CNN, Conformer, ensemble | 235212 | MIC | log-mel spectra, SALSA-Lite | FMix, mixup, random cutout, channel rotation, data generation |
14 | Bai_JLESS_task3_4 | Bai_JLESS_task3_report | CNN, Conformer, ensemble | 235212 | MIC | log-mel spectra, SALSA-Lite | FMix, mixup, random cutout, channel rotation, data generation |
28 | Chun_Chosun_task3_1 | Chun_Chosun_task3_report | CRNN, Transformer, ensemble | 5650035 | FOA | log-mel spectra, intensity vector | SpecAugment, impulse response simulation |
31 | Chun_Chosun_task3_2 | Chun_Chosun_task3_report | CRNN, Transformer, ensemble | 4194366 | FOA | log-mel spectra, intensity vector | SpecAugment, impulse response simulation |
27 | Chun_Chosun_task3_3 | Chun_Chosun_task3_report | CRNN, Transformer, ensemble | 4983870 | FOA | log-mel spectra, intensity vector | SpecAugment, impulse response simulation |
29 | Chun_Chosun_task3_4 | Chun_Chosun_task3_report | CRNN, Transformer, ensemble | 4654910 | FOA | log-mel spectra, intensity vector | SpecAugment, impulse response simulation |
47 | Guo_XIAOMI_task3_1 | Guo_XIAOMI_task3_report | ComplexNew 3DCNN | 807257 | FOA | log-mel spectra, intensity vector | Channel swapping, Labels first, Channels first |
33 | Guo_XIAOMI_task3_2 | Guo_XIAOMI_task3_report | 3DCNN | 902953 | FOA | log-mel spectra, intensity vector | Channel swapping, Labels first, Channels first |
21 | Kang_KT_task3_1 | Kang_KT_task3_report | CRNN, ensemble | 97778356 | FOA+MIC | log-mel spectra, intensity vector, log-linear magnitude spectra, SALSA-Lite | SpecAugment, random cutout, frequency shifting, rotation, channel swapping |
17 | Kang_KT_task3_2 | Kang_KT_task3_report | CRNN, ensemble | 67818904 | FOA+MIC | log-mel spectra, intensity vector, log-linear magnitude spectra, SALSA-Lite | SpecAugment, random cutout, frequency shifting, rotation, channel swapping |
18 | Kang_KT_task3_3 | Kang_KT_task3_report | CRNN, ensemble | 126997260 | FOA+MIC | log-mel spectra, intensity vector, log-linear magnitude spectra, SALSA-Lite | SpecAugment, random cutout, frequency shifting, rotation, channel swapping |
22 | Kang_KT_task3_4 | Kang_KT_task3_report | CRNN, ensemble | 97137808 | FOA+MIC | log-mel spectra, intensity vector, log-linear magnitude spectra, SALSA-Lite | SpecAugment, random cutout, frequency shifting, rotation, channel swapping |
4 | Du_NERCSLIP_task3_1 | Du_NERCSLIP_task3_report | CNN, Conformer | 58100201 | FOA | log-mel spectra, intensity vector | audio channel swapping, multichannel data simulation |
1 | Du_NERCSLIP_task3_2 | Du_NERCSLIP_task3_report | CNN, Conformer | 58100201 | FOA | log-mel spectra, intensity vector | audio channel swapping, multichannel data simulation |
2 | Du_NERCSLIP_task3_3 | Du_NERCSLIP_task3_report | CNN, Conformer | 58100201 | FOA | log-mel spectra, intensity vector | audio channel swapping, multichannel data simulation |
3 | Du_NERCSLIP_task3_4 | Du_NERCSLIP_task3_report | CNN, Conformer | 58100201 | FOA | log-mel spectra, intensity vector | audio channel swapping, multichannel data simulation |
30 | Scheibler_LINE_task3_1 | Scheibler_LINE_task3_report | CNN, Conformer, SSAST, IVA | 4000000 | FOA | log-mel spectra, intensity vector | SpecAug, FOA Rotation, Simulation, FSD50K |
41 | Park_SGU_task3_1 | Park_SGU_task3_report | CRNN | 26242768 | FOA | log-mel spectra, intensity vector | rotate, rotate + mixup |
40 | Park_SGU_task3_2 | Park_SGU_task3_report | CRNN | 26242768 | FOA | log-mel spectra, intensity vector | rotate, rotate + mixup |
38 | Park_SGU_task3_3 | Park_SGU_task3_report | CRNN | 26242768 | FOA | log-mel spectra, intensity vector | rotate, rotate + mixup |
38 | Park_SGU_task3_4 | Park_SGU_task3_report | CRNN | 26242768 | FOA | log-mel spectra, intensity vector | rotate, rotate + mixup |
35 | Wang_SJTU_task3_1 | Wang_SJTU_task3_report | CRNN, MHSA, ensemble | 538261542 | FOA+MIC | log-mel spectra, intensity vector, GCC | |
33 | Wang_SJTU_task3_2 | Wang_SJTU_task3_report | CRNN, Transformer, ensemble | 672127703 | FOA+MIC | log-mel spectra, intensity vector, GCC | |
34 | Wang_SJTU_task3_3 | Wang_SJTU_task3_report | CRNN, MHSA, ensemble | 672127703 | FOA+MIC | log-mel spectra, intensity vector, GCC | |
36 | Wang_SJTU_task3_4 | Wang_SJTU_task3_report | CRNN, Transformer, ensemble | 805993864 | FOA+MIC | log-mel spectra, intensity vector, GCC | |
58 | FalconPerez_Aalto_task3_1 | FalconPerez_Aalto_task3_report | SampleCNN | 713511 | FOA | raw waveform | |
52 | FalconPerez_Aalto_task3_2 | FalconPerez_Aalto_task3_report | CRNN | 4709607 | FOA | log-linear magnitude spectra, intensity vector | |
59 | FalconPerez_Aalto_task3_3 | FalconPerez_Aalto_task3_report | CRNN | 4709607 | FOA | log-linear magnitude spectra, intensity vector | |
11 | Xie_UESTC_task3_1 | Xie_UESTC_task3_report | CRNN | 482551524 | FOA | log-mel spectra, intensity vector | Mini-batch mixup, angle noise, mini-batch time-frequency noise, FOA rotation, random cutout and SpecAugment |
15 | Xie_UESTC_task3_2 | Xie_UESTC_task3_report | CRNN | 273011952 | FOA | log-mel spectra, intensity vector | Mini-batch mixup, angle noise, mini-batch time-frequency noise, FOA rotation, random cutout and SpecAugment |
13 | Xie_UESTC_task3_3 | Xie_UESTC_task3_report | CRNN | 295482564 | FOA | log-mel spectra, intensity vector | Mini-batch mixup, angle noise, mini-batch time-frequency noise, FOA rotation, random cutout and SpecAugment |
12 | Xie_UESTC_task3_4 | Xie_UESTC_task3_report | CRNN | 660798176 | FOA | log-mel spectra, intensity vector | Mini-batch mixup, angle noise, mini-batch time-frequency noise, FOA rotation, random cutout and SpecAugment |
54 | Kim_KU_task3_1 | Kim_KU_task3_report | CNN, Conformer | 122211189 | FOA | log-mel spectra, inter-phase difference intensity vector | Specmix |
46 | Kim_KU_task3_2 | Kim_KU_task3_report | CNN, Conformer | 122211189 | FOA | log-mel spectra, inter-phase difference intensity vector | Specmix |
49 | Kim_KU_task3_3 | Kim_KU_task3_report | CNN, Conformer | 122211189 | FOA | log-mel spectra, inter-phase difference intensity vector | Specmix |
10 | Hu_IACAS_task3_1 | Hu_IACAS_task3_report | EINV2, Conformer, CNN | 85288432 | FOA | log-mel spectra, intensity vector | mixup, specAugment, rotation, random crop, frequency shifting |
6 | Hu_IACAS_task3_2 | Hu_IACAS_task3_report | EINV2, Conformer, CNN | 85288432 | FOA | log-mel spectra, intensity vector | mixup, specAugment, rotation, random crop, frequency shifting |
5 | Hu_IACAS_task3_3 | Hu_IACAS_task3_report | EINV2, Conformer, CNN | 85288432 | FOA | log-mel spectra, intensity vector | mixup, specAugment, rotation, random crop, frequency shifting |
9 | Hu_IACAS_task3_4 | Hu_IACAS_task3_report | EINV2, Conformer, CNN | 85288432 | FOA | log-mel spectra, intensity vector | mixup, specAugment, rotation, random crop, frequency shifting |
65 | Chen_SHU_task3_1 | Chen_SHU_task3_report | CRNN, Self-Attention | 2918925 | FOA | log-mel spectra, intensity vector | |
55 | Wu_NKU_task3_1 | Wu_NKU_task3_report | CRNN | 1920757 | FOA | log-mel spectra, intensity vector, variable-Q transform (VQT) | |
53 | Wu_NKU_task3_2 | Wu_NKU_task3_report | CRNN | 10364997 | FOA | log-mel spectra, intensity vector, variable-Q transform (VQT) | block mixing |
57 | Wu_NKU_task3_3 | Wu_NKU_task3_report | CRNN | 1922485 | MIC | log-mel spectra, GCC, variable-Q transform (VQT) | |
56 | Wu_NKU_task3_4 | Wu_NKU_task3_report | CRNN | 10366725 | MIC | log-mel spectra, GCC, variable-Q transform (VQT) | block mixing |
39 | Han_KU_task3_1 | Han_KU_task3_report | SE-ResNet34, GRU | 6047746 | FOA | log-mel spectra, intensity vector | pitch shifting, gain adjusting, band-pass filter, noise, rotation, Spec-augmentation |
43 | Han_KU_task3_2 | Han_KU_task3_report | SE-ResNet34, GRU | 6047746 | FOA | log-mel spectra, intensity vector | pitch shifting, gain adjusting, band-pass filter, noise, rotation, Spec-augmentation |
8 | Han_KU_task3_3 | Han_KU_task3_report | SE-ResNet34, GRU | 24190984 | FOA | log-mel spectra, intensity vector | pitch shifting, gain adjusting, band-pass filter, noise, rotation, Spec-augmentation |
7 | Han_KU_task3_4 | Han_KU_task3_report | SE-ResNet34, GRU | 24190984 | FOA | log-mel spectra, intensity vector | pitch shifting, gain adjusting, band-pass filter, noise, rotation, Spec-augmentation |
24 | Ko_KAIST_task3_1 | Ko_KAIST_task3_report | CRNN | 160775516 | FOA | log-linear magnitude spectra, eigenvector-based intensity vector | channel swapping, pitch shifting, mix-up, frame shift |
23 | Ko_KAIST_task3_2 | Ko_KAIST_task3_report | CRNN | 44050908 | FOA | log-linear magnitude spectra, eigenvector-based intensity vector | channel swapping, pitch shifting, mix-up, frame shift |
25 | Ko_KAIST_task3_3 | Ko_KAIST_task3_report | CRNN | 44250060 | FOA | log-linear magnitude spectra, eigenvector-based intensity vector | channel swapping, pitch shifting, mix-up, frame shift |
26 | Ko_KAIST_task3_4 | Ko_KAIST_task3_report | CRNN | 44250060 | FOA | log-linear magnitude spectra, eigenvector-based intensity vector | channel swapping, pitch shifting, mix-up, frame shift |
58 | Kapka_SRPOL_task3_1 | Kapka_SRPOL_task3_report | CRNN | 4604286 | FOA | log-linear magnitude spectra, phase spectra, intensity vector | volume perturbation, FOA spatial augment |
50 | Kapka_SRPOL_task3_2 | Kapka_SRPOL_task3_report | CRNN | 4604286 | FOA | log-linear magnitude spectra, phase spectra, intensity vector | volume perturbation, FOA spatial augment |
51 | Kapka_SRPOL_task3_3 | Kapka_SRPOL_task3_report | CRNN | 4604286 | FOA | log-linear magnitude spectra, phase spectra, intensity vector | volume perturbation, FOA spatial augment |
48 | Kapka_SRPOL_task3_4 | Kapka_SRPOL_task3_report | CRNN | 4604286 | FOA | log-linear magnitude spectra, phase spectra, intensity vector | volume perturbation, FOA spatial augment |
60 | Zhaoyu_LRVT_task3_1 | Zhaoyu_LRVT_task3_report | CNN, Conformer, MLP | 30.35M | FOA | log-mel spectra, intensity vector | SpecAugment, Time Frequency Masing, Audio Channel Swapping, Reverb Simulation |
64 | Zhaoyu_LRVT_task3_2 | Zhaoyu_LRVT_task3_report | CNN, Conformer, MLP | 30.35M | FOA | log-mel spectra, intensity vector | SpecAugment, Time Frequency Masing, Audio Channel Swapping |
62 | Zhaoyu_LRVT_task3_3 | Zhaoyu_LRVT_task3_report | CNN, LSTM, U-Net | 17.42M | FOA | log-mel spectra, intensity vector | SpecAugment, Time Frequency Masing, Audio Channel Swapping |
61 | Zhaoyu_LRVT_task3_4 | Zhaoyu_LRVT_task3_report | CRNN, MLP | 2.35M | FOA | log-mel spectra, intensity vector | SpecAugment, Time Frequency Masing, Audio Channel Swapping, Reverb Simulation |
44 | Xie_XJU_task3_1 | Xie_XJU_task3_report | CRNN | 116118 | FOA | log-mel spectra, intensity vector |
Technical reports
JLESS SUBMISSION TO DCASE2022 TASK3: DYNAMIC KERNEL CONVOLUTION NETWORK WITH DATA AUGMENTATION FOR SOUND EVENT LOCALIZATION AND DETECTION IN REAL SPACE
Siwei Huang1, Jisheng Bai1,2, Yafei Jia1, Mou Wang1, Jianfeng Chen1,2
1Joint Laboratory of Environmental Sound Sensing, School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China, 2LianFeng Acoustic Technologies Co., Ltd. Xi’an, China
Bai_JLESS_task3_1 Bai_JLESS_task3_2 Bai_JLESS_task3_3 Bai_JLESS_task3_4
JLESS SUBMISSION TO DCASE2022 TASK3: DYNAMIC KERNEL CONVOLUTION NETWORK WITH DATA AUGMENTATION FOR SOUND EVENT LOCALIZATION AND DETECTION IN REAL SPACE
Siwei Huang1, Jisheng Bai1,2, Yafei Jia1, Mou Wang1, Jianfeng Chen1,2
1Joint Laboratory of Environmental Sound Sensing, School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China, 2LianFeng Acoustic Technologies Co., Ltd. Xi’an, China
Abstract
This technical report describes our proposed system for DCASE2022 task3: Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes. In our approach, we first introduce a dynamic kernel convolution module after the convolution blocks to dynamically model the channel-wise features with different receptive fields. We then incorporate the SELDnet and EINV2 framework into the proposed SELD system with multi-track ACCDOA output. Finally, we use different strategies in the training stage to improve the generalization of the system in realistic environment. Moreover, we apply data augmentation methods to balance the sound event classes in the dataset, and generate more spatial audio files to augment the training data. Experimental results show that the proposed systems outperform the baseline on the development dataset of DCASE2022 task3.
GLFE: GLOBAL-LOCAL FUSION ENHANCEMENT FOR SOUND EVENT LOCALIZATION AND DETECTION
Zhengyu Chen, Qinghua Huang
Shanghai University, Shanghai, China
Chen_SHU_task3_1
GLFE: GLOBAL-LOCAL FUSION ENHANCEMENT FOR SOUND EVENT LOCALIZATION AND DETECTION
Zhengyu Chen, Qinghua Huang
Shanghai University, Shanghai, China
Abstract
Sound event localization and detection (SELD), as a combination of sound event detection (SED) task and direction of arrival (DOA) estimation task, aims at detecting the different sound events and obtaining their corresponding localization information simultaneously. The more outperforming systems are required to be applied into the more complex acoustic environments. In this paper, our method called global-local fusion enhancement (GLFE) is presented for Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 challenge task 3: Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes. It could be regarded as a convolution enhancement method. Firstly, the multiple feature cross fusion (MFCF) based on different local receptive fields is proposed. Considering the diversity of real sound events, self-attention network (SANet) integrating global information to local feature is introduced to help the system obtain more efficient information. Further, the skip fusion enhancement (SFE) is explored to fuse the features of different levels by the skip-connection in order to improve feature representation. On Sony-TAu Realistic Spatial Soundscapes 2022 (STRSS22) development dataset, the proposed system shows the significant improvement compared with the baseline system. Series of experiences are implemented only on the first-order Ambisonics (FOA) dataset.
Polyphonic Sound Event Localization and Detection Using Convolutional Neural Networks and Self-Attention with Synthetic and Real Data
Yeongseo Shin, Kangmin Kim, Chanjun Chun
Chosun University, Gwangju, Korea
Chun_Chosun_task3_1 Chun_Chosun_task3_2 Chun_Chosun_task3_3 Chun_Chosun_task3_4
Polyphonic Sound Event Localization and Detection Using Convolutional Neural Networks and Self-Attention with Synthetic and Real Data
Yeongseo Shin, Kangmin Kim, Chanjun Chun
Chosun University, Gwangju, Korea
Abstract
This technical report describes the system submitted to DCASE 2022 Task 3: Sound Event Localization and Detection (SELD) Evaluated in Real Spatial Sound Scenes. The goal of Task 3 is to detect the occurrence of sound events belonging to a specific target class in a real spatial sound scene, track temporal activity, and estimate the direction or location of arrival. In a given dataset, synthetic and real data exist together, and only a very small amount of real data exists compared with synthetic data. In this study, we developed a method utilizing a multi-generator and another applying SpecAugment as a data augmentation method to address the problem of imbalance in the amount of data. In addition, in our network architecture, the Transformer encoder was applied to the Convolutional Recurrent Neural Network (CRNN) structure that is mainly used in SELD. In addition, as a result of training with a single model and applying an ensemble, it was confirmed that the performance improved compared to the baseline system.
THE NERC-SLIP SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION OF DCASE2022 CHALLENGE
Qing Wang1, Li Chai2, Huaxin Wu2, Zhaoxu Nian1, Shutong Niu1, Siyuan Zheng1, Yuyang Wang1, Lei Sun2, Yi Fang2, Jia Pan2, Jun Du1, Chin-Hui Lee3
1University of Science and Technology of China, Hefei, China, 2iFLYTEK, Hefei, China, 3Georgia Institute of Technology, Atlanta, USA
Du_NERCSLIP_task3_1 Du_NERCSLIP_task3_2 Du_NERCSLIP_task3_3 Du_NERCSLIP_task3_4
THE NERC-SLIP SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION OF DCASE2022 CHALLENGE
Qing Wang1, Li Chai2, Huaxin Wu2, Zhaoxu Nian1, Shutong Niu1, Siyuan Zheng1, Yuyang Wang1, Lei Sun2, Yi Fang2, Jia Pan2, Jun Du1, Chin-Hui Lee3
1University of Science and Technology of China, Hefei, China, 2iFLYTEK, Hefei, China, 3Georgia Institute of Technology, Atlanta, USA
Abstract
This technical report describes our submission system for the task 3 of the DCASE2022 challenge: Sound Event Localization and Detection (SELD) Evaluated in Real Spatial Sound Scenes. Compared with the official baseline system, the improvements of our method mainly lie in three aspects: data augmentation, more powerful network architecture, and model ensemble. First, our previous work shows that the audio channel swapping (ACS) technique [1] can effectively deal with data sparsity problems in the SELD task, which is utilized in our method and provides an effective improvement with limited real training data. In addition, we generate multichannel recordings by using public datasets and perform data cleaning to drop bad data. Then, based on the augmented data, we employ a ResNet-Conformer architecture which can better model the context dependencies within an audio sequence. Specially, we found that time resolution had a significant impact on the model performance: with the time pooling layer moving back, the model can obtain a higher feature resolution and achieve better results. Finally, to attain robust performance, we employ model ensemble of different target representations (e.g., activity-coupled Cartesian direction of arrival (ACCDOA) and multi-ACCDOA) and post-processing strategies. The proposed system is evaluated on the dev-test set of Sony-TAu Realistic Spatial Soundscapes 2022 (STARS2022) dataset.
CURRICULUM LEARNING WITH AUDIO DOMAIN DATA AUGMENTATION FOR SOUND EVENT LOCALIZATION AND DETECTION
Ricardo Falcon-Perez
Aalto University, Espoo, Finland
Abstract
In this report we explore a variety of data augmentation techniques in audio domain, along with a curriculum learning approach, for sound event localizaiton and detection (SELD) tasks. We focus our work on two areas: 1) techniques that modify timbral of temporal characteristics of all channels simultaneously, such as equalization or added noise; 2) methods that transform the spatial impression of the full sound scene, such as directional loudness modifications. We test the approach on models using either time-frequency or raw audio features, trained and evaluated on the STARSS22: Sony-TAU Realistic Spatial Soundscapes 2022 dataset. Although the proposed system struggles to beat the official benchmark system, the aug- mentation techniques show improvements over our non-augmented baseline.
TACCNN: TIME-ALIGNMENT COMPLEX CONVOLUTIONAL NEURAL NETWORK
Kaibin Guo, Runyu Shi, Tianrui He, Nian Liu, Junfei Yu
Xiaomi, Beijing, China
Guo_XIAOMI_task3_1 Guo_XIAOMI_task3_2
TACCNN: TIME-ALIGNMENT COMPLEX CONVOLUTIONAL NEURAL NETWORK
Kaibin Guo, Runyu Shi, Tianrui He, Nian Liu, Junfei Yu
Xiaomi, Beijing, China
Abstract
In this technical report, we show our system submitted to the DCASE2022 challenge task3: Sound Event Localization and De- tection(SELD) Evaluated in Real Spatial Sound Scenes. At first, we review the famous deep learning methods in SELD, and point out that these works have ignored the time alignment from the perspective the arrival time of the signal, and the amplitude and the phase are modelling in the separate way. Therefore, we put for- ward a new model, Time Alignment Complex Convolutional Neural Network(TACCNN). In our model, we suggest to use 3DCNN or ConvLSTM to align the feature from different mics. Moreover, we propose to compile the mel spectrogram with intensity vector as the complex vector, and then extract salient feature on the new feature by using complex convolutional neural network. Lastly, we apply Bi-GRU with self-attention to extract the relative information about sound event to determine the rotation of the sound event. The results show that the time alignment block greatly improve the performance of CNN-GRU model. Complex convolutional neural network has the similar result compared with the real convolutional neural network. It seems that we need more experiments to discover the role of complex convolutional neural network.
A ROBUST FRAMEWORK FOR SOUND EVENT LOCALIZATION AND DETECTION ON REAL RECORDINGS
Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han
School of Industrial and Management Engineering, Korea University, Seoul, South Korea
Han_KU_task3_1 Han_KU_task3_2 Han_KU_task3_3 Han_KU_task3_4
Judges’ award
A ROBUST FRAMEWORK FOR SOUND EVENT LOCALIZATION AND DETECTION ON REAL RECORDINGS
Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han
School of Industrial and Management Engineering, Korea University, Seoul, South Korea
Abstract
This technical report describes the systems submitted to the DCASE2022 challenge task 3: sound event localization and detection (SELD). The task aims to detect occurrences of sound events and specify their class, furthermore estimate their position. Our system utilizes a ResNet-based model under a proposed robust framework for SELD. To guarantee the generalized performance on the real-world sound scenes, we design the total framework with augmentation techniques, a pipeline of mixing datasets from real-world sound scenes and emulations, and test time augmentation. Augmentation techniques and exploitation of external sound sources enable training diverse samples and keeping the opportunity to train the real-world context enough by maintaining the number of the real recording samples in the batch. In addition, we design a test time augmentation and a clustering-based model ensemble method to aggregate confident predictions. Experimental results show that the model under a proposed framework outperforms the baseline methods and achieves competitive performance in real-world sound recordings.
Awards: Judges’ award
SOUND EVENT LOCALIZATION AND DETECTION FOR REAL SPATIAL SOUND SCENES: EVENT-INDEPENDENT NETWORK AND DATA AUGMENTATION CHAINS
Jinbo Hu1,2, Yin Cao3, Ming Wu1, Qiuqiang Kong4, Feiran Yang1, Mark D. Plumbley5, Jun Yang1,2
1Institute of Acoustics, Chinese Academy of Sciences, Beijing, China, 2University of Chinese Academy of Sciences, Beijing, China, 3Xi’an Jiaotong Liverpool University, Suzhou, China, 4ByteDance Shanghai, China, 5 Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK
Hu_IACAS_task3_1 Hu_IACAS_task3_2 Hu_IACAS_task3_3 Hu_IACAS_task3_4
SOUND EVENT LOCALIZATION AND DETECTION FOR REAL SPATIAL SOUND SCENES: EVENT-INDEPENDENT NETWORK AND DATA AUGMENTATION CHAINS
Jinbo Hu1,2, Yin Cao3, Ming Wu1, Qiuqiang Kong4, Feiran Yang1, Mark D. Plumbley5, Jun Yang1,2
1Institute of Acoustics, Chinese Academy of Sciences, Beijing, China, 2University of Chinese Academy of Sciences, Beijing, China, 3Xi’an Jiaotong Liverpool University, Suzhou, China, 4ByteDance Shanghai, China, 5 Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK
Abstract
Polyphonic sound event localization and detection (SELD) aims at detecting types of sound events with corresponding temporal activities and spatial locations. In DCASE 2022 Task 3, data types transition from computationally generated spatial recordings to recordings of real-sound scenes. Our system submitted to the DCASE 2022 Task 3 is based on our previous proposed Event-Independent eNetwork V2 (EINV2) and novel data augmentation method. To detect different sound events of the same type with different locations, our method employs EINV2, combining a track-wise output format, permutation-invariant training, and soft-parameter sharing. EINV2 is also extended using conformer structures to learn local and global patterns. To improve the generalization ability of the model, we use a data augmentation approach containing several data augmentation chains, which are composed of random combinations of several different data augmentation operations. To mitigate the lack of the real-scene recordings in the development dataset and the presence of sound events being unbalanced, we exploit FSD50K, AudioSet, and TAU Spatial Room Impulse Response Database (TAU-SRIR DB) to generate simulated datasets for training. The results show that our system is improved over the baseline system on the dev-set-test of Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22).
TRACK-WISE ENSEMBLE OF CRNN MODELS WITH MULTI-TASK ADPIT FOR SOUND EVENT LOCALIZATION AND DETECTION
Sang-Ick Kang , Myungchul Keum, Kyongil Cho, Yeonseok Park
KT Corporation, South Korea
Kang_KT_task3_1 Kang_KT_task3_2 Kang_KT_task3_3 Kang_KT_task3_4
TRACK-WISE ENSEMBLE OF CRNN MODELS WITH MULTI-TASK ADPIT FOR SOUND EVENT LOCALIZATION AND DETECTION
Sang-Ick Kang , Myungchul Keum, Kyongil Cho, Yeonseok Park
KT Corporation, South Korea
Abstract
This report describes our systems submitted to the DCASE2022 challenge task 3: Sound Event Localization and Detection (SELD) with directional interference. Locating and detecting sound events consists of two subtasks: detecting sound events and estimating the direction of arrival simultaneously. Therefore, it is often difficult to jointly optimize these two subtasks at the same time. We propose track-wise ensemble model which is combined with a multi-task-based auxiliary duplicating permutation invariant training (ADPIT) model and multi-ACCDOA-based model. Specifically, we propose a novel method to ensemble CRNN multi-task models, an event independent network v2 (EINV2)-based multi-task models and CRNN multi-ACCDOA models. Experimental results on the DCASE2022 dataset for sound event localization and directional interference detection show that the deep learning-based model trained on this new function significantly outperforms the DCASE challenge baseline.
COLOC: CONDITIONED LOCALIZER AND CLASSIFIER FOR SOUND EVENT LOCALIZATION AND DETECTION
Slawomir Kapka
Samsung R&D, Warsaw, Poland
Kapka_SRPOL_task3_1 Kapka_SRPOL_task3_2 Kapka_SRPOL_task3_3 Kapka_SRPOL_task3_4
COLOC: CONDITIONED LOCALIZER AND CLASSIFIER FOR SOUND EVENT LOCALIZATION AND DETECTION
Slawomir Kapka
Samsung R&D, Warsaw, Poland
Abstract
This technical report for DCASE2022 Task3 describes Conditioned Localizer and Classifier (CoLoC) which is a novel solution for Sound Event Localization and Detection (SELD). The solution constitutes of two stages: the localization is done first and is followed by classification conditioned by the output of the localizer. In order to resolve the problem of unknown number of sources we incorporate the idea borrowed from Sequential Set Generation (SSG). Models from both stages are SELDnet-like CRNNs, but with single outputs. Conducted reasoning shows that such two single output models are fit for SELD task. We show that our solution improves on the baseline system in most metrics on the STARSS22 Dataset.
CONVNEXT AND CONFORMER FOR SOUND EVENT LOCALIZATION AND DETECTION
Gwantae Kim, Hanseok Ko
Korea University, Seoul, South Korea
Kim_KU_task3_1 Kim_KU_task3_2 Kim_KU_task3_3
CONVNEXT AND CONFORMER FOR SOUND EVENT LOCALIZATION AND DETECTION
Gwantae Kim, Hanseok Ko
Korea University, Seoul, South Korea
Abstract
This technical report describes the system participating to the DCASE 2022, Task3 : Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes challenge. The system consists of a convolution neural networks and multi-head self attention mechanism. The convolution neural networks consist of depth-wise convolution and point-wise convolution layers like the ConvNeXt block. The structure with the multi-head self attention mechanism is based on Conformer model, which contains combination of the convolution layer and the multi-head self attention mechanism. In the training phase, some regularization methods, such as Specmix, Droppath, and Dropout, are used to improve generalization performance. Multi-ACCDOA, which is output format for the sound event localization and detection task, is used to represent more suitable output format for the task. Our systems demonstrate a improvement over the baseline system.
Data Augmentation and Squeeze-and-Excitation Network on Multiple Dimension for Sound Event Localization and Detection in Real Scenes
Byeong-Yun Ko, Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Seung-Deok Choi, Yong-Hwa Park
Korea Advanced Institute of Science and Technology, Daejeon, South Korea
Ko_KAIST_task3_1 Ko_KAIST_task3_2 Ko_KAIST_task3_3 Ko_KAIST_task3_4
Data Augmentation and Squeeze-and-Excitation Network on Multiple Dimension for Sound Event Localization and Detection in Real Scenes
Byeong-Yun Ko, Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Seung-Deok Choi, Yong-Hwa Park
Korea Advanced Institute of Science and Technology, Daejeon, South Korea
Abstract
Performance of sound event localization and detection (SELD) in real scenes is limited by small size of SELD dataset, due to difficulty in obtaining sufficient amount of realistic multi-channel audio data recordings with accurate label. We used two main strategies to solve problems arising from the small real SELD dataset. First, we applied various data augmentation methods on all data dimensions: channel, frequency and time. We also propose original data augmentation method named Moderate Mixup in order to simulate situations where noise floor or interfering events exist. Second, we applied Squeeze-and-Excitation block on channel and frequency dimensions to efficiently extract feature characteristics. Result of our trained models on the STARSS22 test dataset achieved the best ER, F1, LE, and LR of 0.53, 49.8%, 16.0°, and 56.2% respectively.
SOUND EVENT LOCALIZATION AND DETECTION BASED ON CROSS-MODAL ATTENTION AND SOURCE SEPARATION
Jin-Young Park1, Do-Hui Kim1, Bon Hyeok Ku1, Jun Hyung Kim1, Jaehun Kim2, Kisung Kim2, Hyungchan Yoo2, Kisik Chang2, Hyung-Min Park1
1Sogang University, Seoul, South Korea, 2AI Lab, IVS Inc., Seoul, South Korea
Park_SGU_task3_1 Park_SGU_task3_2 Park_SGU_task3_3 Park_SGU_task3_4
SOUND EVENT LOCALIZATION AND DETECTION BASED ON CROSS-MODAL ATTENTION AND SOURCE SEPARATION
Jin-Young Park1, Do-Hui Kim1, Bon Hyeok Ku1, Jun Hyung Kim1, Jaehun Kim2, Kisung Kim2, Hyungchan Yoo2, Kisik Chang2, Hyung-Min Park1
1Sogang University, Seoul, South Korea, 2AI Lab, IVS Inc., Seoul, South Korea
Abstract
Sound event localization and detection (SELD) is a task that combines sound event detection (SED) and direction-of-arrival(DOA) estimation (DOAE). This year’s SELD task focuses on evaluation on real spatial scene, raising the difficulty for two reasons: 1) increase in overlapped events 2) noise-like events combined with real noises. In order to overcome this, we applied source separation and improved data synthesis logic to our basic (DCMA-SELD) model that utilizes dual cross-modal attention (DCMA) and soft parameter sharing of SED and DOAE streams to simultaneously detect and localize sound events. In order to improve the SELD performance of male/female speech that accounts for a large portion of input sounds, the source separation in our method was performed to separate speech signals from other sounds. Regarding the data synthesis logic, sound events that occur in real life may have some regularity, such as a laugh event that occurs in people’s conversations or background music that has a long duration. Instead of data synthesis by mixing random sound events at random times, therefore, we added several rules to simulate more natural data that can learn the context of the events. Experimental results on validation data showed that our proposed approach successfully improved the performance of the task focusing on real spatial scene.
STARSS22: A DATASET OF SPATIAL RECORDINGS OF REAL SCENES WITH SPATIOTEMPORAL ANNOTATIONS OF SOUND EVENTS
Archontis Politis1, Kazuki Shimada2, Parthasaarathy Sudarsanam1, Sharath Adavanne1, Daniel Krause1, Yuichiro Koyama2, Naoya Takahashi2, Shusuke Takahashi2, Yuki Mitsufuji2, Tuomas Virtanen1
1Tampere University, Tampere, Finland, 2SONY, Tokyo, Japan
FOA_Baseline_task3_1 MIC_Baseline_task3_1
STARSS22: A DATASET OF SPATIAL RECORDINGS OF REAL SCENES WITH SPATIOTEMPORAL ANNOTATIONS OF SOUND EVENTS
Archontis Politis1, Kazuki Shimada2, Parthasaarathy Sudarsanam1, Sharath Adavanne1, Daniel Krause1, Yuichiro Koyama2, Naoya Takahashi2, Shusuke Takahashi2, Yuki Mitsufuji2, Tuomas Virtanen1
1Tampere University, Tampere, Finland, 2SONY, Tokyo, Japan
Abstract
This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) dataset for sound event localization and detection, comprised of spatial recordings of real scenes collected in various interiors of two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events in the dataset belonging to 13 target sound classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. The dataset serves as the development and evaluation dataset for the Task 3 of the DCASE2022 Challenge on Sound Event Localization and Detection and introduces significant new challenges for the task compared to the previous iterations, which were based on synthetic spatialized sound scene recordings. Dataset specifications are detailed including recording and annotation process, target classes and their presence, and details on the development and evaluation splits. Additionally, the report presents the baseline system that accompanies the dataset in the challenge with emphasis on the differences with the baseline of the previous iterations; namely, introduction of the multi-ACCDOA representation to handle multiple simultaneous occurences of events of the same class, and support for additional improved input features for the microphone array format. Results of the baseline indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6387880.
3D CNN AND CONFORMER WITH AUDIO SPECTROGRAM TRANSFORMER FOR SOUND EVENT DETECTION AND LOCALIZATION
Robin Scheibler, Tatsuya Komatsu, Yusuke Fujita, Michael Hentschel
LINE Corporation, Tokyo, Japan
Scheibler_LINE_task3_1
3D CNN AND CONFORMER WITH AUDIO SPECTROGRAM TRANSFORMER FOR SOUND EVENT DETECTION AND LOCALIZATION
Robin Scheibler, Tatsuya Komatsu, Yusuke Fujita, Michael Hentschel
LINE Corporation, Tokyo, Japan
Abstract
We propose a network for sound event detection and localization based on a 3D CNN for the extraction of spatial features followed by several conformer layers. The CNN performs spatial feature extraction and the subsequent conformer layers predict the events and their locations. We combine this with features obtained from a fine-tuned audio-spectrogram transformer and a multi-channel separation network trained separately. The two architectures are combined by a linear layer before the final non-linearity. We first train the network on the STARSS22 dataset extended by simulation using events from FSD50K and room impulse responses from previous challenges. To bridge the gap between the simulated dataset and the STARSS22 dataset, we fine-tune the model on the development part of the STARSS22 dataset only before the final evaluation.
IMPROVING LOW-RESOURCE SOUND EVENT LOCALIZATION AND DETECTION VIA ACTIVE LEARNING WITH DOMAIN ADAPTATION
Yuhao Wang1, Yuxin Duan1, Pingjie Wang1, Yu Wang1,2, Wei Xue3
1Shanghai Jiao Tong University, Shanghai, China, 2Shanghai AI Lab, Shanghai, China, 3Hong Kong Baptist University, Hong Kong, China
Wang_SJTU_task3_1 Wang_SJTU_task3_2 Wang_SJTU_task3_3 Wang_SJTU_task3_4
IMPROVING LOW-RESOURCE SOUND EVENT LOCALIZATION AND DETECTION VIA ACTIVE LEARNING WITH DOMAIN ADAPTATION
Yuhao Wang1, Yuxin Duan1, Pingjie Wang1, Yu Wang1,2, Wei Xue3
1Shanghai Jiao Tong University, Shanghai, China, 2Shanghai AI Lab, Shanghai, China, 3Hong Kong Baptist University, Hong Kong, China
Abstract
This report describes our systems submitted to DCASE2022 challenge task3: sound event localization and detection (SELD) evaluated in real spatial sound scenes. We present two approaches to improve the performance of this task. The first one is to leverage active learning to bring in and filter the AudioSet dataset based on the pre-trained audio neural networks (PANNs). The second one is to adapt the generic models to different sound event categories, thereby improving the performance on classes with scarce data. We have also explored various model structures incorporating attention machanisms. Finally, we combine models trained on different input recording formats. Experimental results on the validation set show that the proposed systems can greatly improve all the metrics when compared to the baseline systems.
MLP-MIXER ENHANCED CRNN FOR SOUND EVENT LOCALIZATION AND DETECTION IN DCASE 2022 TASK 3
Shichao Wu1,2,3, Shouwang Huang1,2,3, Zicheng Liu1,2,3, Jingtai Liu1,2,3
1College of Artificial Intelligence, Nankai University, Tianjin, China, 2Institute of Robotics and Automatic Information System, Nankai University, Tianjin, China, 3Tianjin Key Laboratory of Intelligent Robotics, Nankai University, TianJin, China
Wu_NKU_task3_1 Wu_NKU_task3_2 Wu_NKU_task3_3 Wu_NKU_task3_4
MLP-MIXER ENHANCED CRNN FOR SOUND EVENT LOCALIZATION AND DETECTION IN DCASE 2022 TASK 3
Shichao Wu1,2,3, Shouwang Huang1,2,3, Zicheng Liu1,2,3, Jingtai Liu1,2,3
1College of Artificial Intelligence, Nankai University, Tianjin, China, 2Institute of Robotics and Automatic Information System, Nankai University, Tianjin, China, 3Tianjin Key Laboratory of Intelligent Robotics, Nankai University, TianJin, China
Abstract
In this technical report, we propose to give the system details about our MLP-Mixer enhanced convolutional recurrent neural networks (CRNN) submitted to the sound event localization and detection challenge in DCASE 2022. Specifically, we present two improvements concerning the input features and the model structures compared to the baseline methods. For the input feature design, we propose to involve the variable-Q transform (VQT) audio feature both for Ambisonic (FOA) and microphone array (MIC) audio representations. For deep neural network design, we improved the original CRNN by inserting a shallow MLP-Mixer module between the convolution filters and the recurrent layers to elaborately model the interchannel audio patterns, which we thought are extremely conducive to the sound directions of arrival (DOA) estimation. Experiments on the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) benchmark dataset showed our system outperformed the DCASE baseline method.
ENSEMBLE OF ATTENTION BASED CRNN FOR SOUND EVENT DETECTION AND LOCALIZATION
Rong Xie, Chuang Shi, Le Zhang, Yunxuan Liu and Huiyong Li
University of Electronic Science and Technology of China, Chengdu, China
Xie_UESTC_task3_1 Xie_UESTC_task3_2 Xie_UESTC_task3_3 Xie_UESTC_task3_4
ENSEMBLE OF ATTENTION BASED CRNN FOR SOUND EVENT DETECTION AND LOCALIZATION
Rong Xie, Chuang Shi, Le Zhang, Yunxuan Liu and Huiyong Li
University of Electronic Science and Technology of China, Chengdu, China
Abstract
This report describes submitted systems for sound event localization and detection (SELD) task of DCASE 2022, which are implemented as multi-task learning. Soft parameters sharing convolutional recurrent neural network (CRNN) with Split attention (SA), convolutional block attention module (CBAM) and coordinate attention (CA) are trained and ensembled to solve the SELD task. To generalize models, angle noise and mini-batch time-frequency noise are introduced, and mini-batch mixup, FOA rotation, frequency shift, random cutout and SpecAugment are adopted. Proposed systems have a better performance than the baseline system on the development dataset.
SOUND EVENT LOCALIZATION AND DETECTION BASED ON CRNN USING TIME-FREQUENCY ATTENTION AND CRISS-CROSS ATTENTION
Yin Xie1,2, Ying Hu1,2, Yunlong Li1,2, Shijing Hou1,2, Xiujuan Zhu1,2, Zihao Chen1,2, Liusong Wang1,2, Mengzhen Ma1,2
1Xinjiang University, School of Information Science and Engineering, Urumqi, China, 2Key Laboratory of Signal Detection and Processing in Xinjiang, Urumqi, China
Xie_XJU_task3_1
SOUND EVENT LOCALIZATION AND DETECTION BASED ON CRNN USING TIME-FREQUENCY ATTENTION AND CRISS-CROSS ATTENTION
Yin Xie1,2, Ying Hu1,2, Yunlong Li1,2, Shijing Hou1,2, Xiujuan Zhu1,2, Zihao Chen1,2, Liusong Wang1,2, Mengzhen Ma1,2
1Xinjiang University, School of Information Science and Engineering, Urumqi, China, 2Key Laboratory of Signal Detection and Processing in Xinjiang, Urumqi, China
Abstract
This report describes our systems submitted to the DCASE2022 challenge task 3: sound event localization and detection (SELD).We design a CRNN network based on asymmetric convolution mechanism with Time-Frequency Attention module(TFA) and Criss-Cross Attention module(CCA) which achieves great performance to deal with SELD in complex real sound scenes. On TAU-NIGENS Spatial Sound Events 2022 development dataset, our systems demonstrate a significant improvement over the baseline system. Only the first order Ambisonics (FOA) dataset was considered in this experiment.
SOUND EVENT LOCALIZATION AND DETECTION COMBINED CONVOLUTIONAL CONFORMER STRUCTURE AND MULTI-ACCDOA STRATEGIES
Zhaoyu Yan, Jin Wang, Lin Yang, Junjie Wang
Lenovo Research, Beijing, China
Zhaoyu_LRVT_task3_1 Zhaoyu_LRVT_task3_2 Zhaoyu_LRVT_task3_3 Zhaoyu_LRVT_task3_4
SOUND EVENT LOCALIZATION AND DETECTION COMBINED CONVOLUTIONAL CONFORMER STRUCTURE AND MULTI-ACCDOA STRATEGIES
Zhaoyu Yan, Jin Wang, Lin Yang, Junjie Wang
Lenovo Research, Beijing, China
Abstract
Sound event localization and detection (SELD) task aims to identify audio sources’ direction-of-arrival (DOA) and the corre- sponding class. The SELD task was originally considered as a multi-task learning problem, with DOA and sound event detection (SED) estimation branches. The single target methods were introduced recently as more end-to-end solutions and achieves better SELD performance. The activity-coupled Cartesian DOA (ACCDOA) vectors was firstly introduced as a single SELD training target, and multi-ACCDOA with auxiliary duplicating permutation invariant training (ADPIT) loss overcame the situation that the same event class from multiple locations. In this challenge, we combined the convolutional conformer structure with the multi-ACCDOA training target and ADPIT strategy. With multiple methods of data augmentation adapted, the proposed method achieves promising SELD improvement com- pared to the baseline CRNN result.