Task description
The Sound Event Localization and Detection (SELD) task deals with methods that detect the temporal onset and offset of sound events when active, classify the type of the event from a known set of sound classes, and further localize the events in space when active.
The focus of the current SELD task is to build systems that are able to handle event polyphony while being robust to ambient noise and reverberation in different acoustic environments/rooms, under static and dynamic spatial conditions (i.e. with moving sources). The task provides two datasets, development and evaluation, recorded in a total of 13 different acoustics environments. Among the two datasets, only the development dataset provides the reference labels. The participants are expected to build and validate systems using the development dataset, report results on a predefined development set split, and finally test their system on the unseen evaluation dataset.
More details on the task setup and evaluation can be found in the task description page.
Teams ranking
Table including only the best performing system per submitting team.
Rank | Submission Information | Evaluation dataset | Development dataset | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Submission name |
Technical Report |
Best official system rank |
Error Rate (20°) |
F-score (20°) |
Localization error (°) |
Localization recall |
Error Rate (20°) |
F-score (20°) |
Localization error (°) |
Loalization recall |
|
Du_USTC_task3_4 | Du2020_task3_report | 1 | 0.20 | 84.9 | 6.0 | 88.5 | 0.26 | 80.0 | 7.4 | 84.7 | |
Nguyen_NTU_task3_2 | Nguyen2020_task3_report | 4 | 0.23 | 82.0 | 9.3 | 90.0 | 0.36 | 71.4 | 12.1 | 82.0 | |
Shimada_SONY_task3_4 | Shimada2020_task3_report | 5 | 0.25 | 83.2 | 7.0 | 86.2 | 0.29 | 80.0 | 7.5 | 83.5 | |
Cao_Surrey_task3_4 | Cao2020_task3_report | 11 | 0.36 | 71.2 | 13.3 | 81.1 | 0.47 | 61.5 | 16.7 | 75.4 | |
Park_ETRI_task3_4 | Park2020_task3_report | 13 | 0.43 | 65.2 | 16.8 | 81.9 | 0.54 | 55.5 | 20.0 | 76.0 | |
Phan_QMUL_task3_3 | Phan2020_task3_report | 15 | 0.49 | 61.7 | 15.2 | 72.4 | 0.60 | 49.2 | 19.0 | 65.6 | |
PerezLopez_UPF_task3_2 | PerezLopez2020_task3_report | 16 | 0.51 | 60.1 | 12.4 | 65.1 | 0.44 | 68.0 | 13.3 | 79.6 | |
Sampathkumar_TUC_task3_1 | Sampathkumar2020_task3_report | 20 | 0.53 | 56.6 | 14.8 | 66.5 | 0.57 | 51.8 | 16.9 | 65.6 | |
Patel_MST_task3_4 | Patel2020_task3_report | 22 | 0.55 | 55.5 | 14.4 | 65.5 | 0.54 | 55.6 | 15.2 | 67.2 | |
Ronchini_UPF_task3_2 | Ronchini2020_task3_report | 28 | 0.58 | 50.8 | 16.9 | 65.5 | 0.60 | 49.9 | 17.9 | 66.8 | |
Naranjo-Alcazar_VFY_task3_2 | Naranjo-Alcazar2020_task3_report | 30 | 0.61 | 49.1 | 19.5 | 67.1 | 0.70 | 39.5 | 24.8 | 63.0 | |
Song_LGE_task3_3 | Song2020_task3_report | 31 | 0.57 | 50.4 | 20.0 | 64.3 | 0.57 | 50.6 | 20.2 | 64.1 | |
Tian_PKU_task3_1 | Tian2020_task3_report | 36 | 0.64 | 47.6 | 24.5 | 67.5 | 0.72 | 40.1 | 25.9 | 64.0 | |
Singla_SRIB_task3_2 | Singla2020_task3_report | 38 | 0.88 | 18.0 | 53.4 | 66.2 | 0.72 | 36.2 | 23.4 | 67.7 | |
DCASE2020_MIC_baseline | Politis2020_task3_report | 39 | 0.69 | 41.3 | 23.1 | 62.4 | 0.78 | 31.4 | 27.3 | 59.0 |
Systems ranking
Performance of all the submitted systems on the evaluation and the development datasets
Rank | Submission Information | Evaluation dataset | Development dataset | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Submission name |
Technical Report |
Official rank |
Error Rate (20°) |
F-score (20°) |
Localization error (°) |
Localization recall |
Error Rate (20°) |
F-score (20°) |
Localization error (°) |
Localization recall |
|
Du_USTC_task3_4 | Du2020_task3_report | 1 | 0.20 | 84.9 | 6.0 | 88.5 | 0.26 | 80.0 | 7.4 | 84.7 | |
Du_USTC_task3_2 | Du2020_task3_report | 2 | 0.20 | 84.7 | 6.1 | 88.1 | 0.27 | 79.4 | 7.5 | 84.4 | |
Du_USTC_task3_1 | Du2020_task3_report | 3 | 0.23 | 83.0 | 6.1 | 86.2 | 0.27 | 78.9 | 7.5 | 83.6 | |
Nguyen_NTU_task3_2 | Nguyen2020_task3_report | 4 | 0.23 | 82.0 | 9.3 | 90.0 | 0.36 | 71.4 | 12.1 | 82.0 | |
Nguyen_NTU_task3_3 | Nguyen2020_task3_report | 5 | 0.23 | 81.7 | 9.3 | 90.1 | 0.36 | 71.2 | 12.1 | 82.0 | |
Nguyen_NTU_task3_4 | Nguyen2020_task3_report | 5 | 0.23 | 81.8 | 9.3 | 90.0 | 0.35 | 71.9 | 12.1 | 82.7 | |
Shimada_SONY_task3_4 | Shimada2020_task3_report | 5 | 0.25 | 83.2 | 7.0 | 86.2 | 0.29 | 80.0 | 7.5 | 83.5 | |
Du_USTC_task3_3 | Du2020_task3_report | 6 | 0.23 | 82.6 | 8.3 | 88.0 | 0.27 | 79.8 | 7.4 | 84.6 | |
Shimada_SONY_task3_3 | Shimada2020_task3_report | 7 | 0.25 | 82.6 | 7.0 | 85.5 | 0.28 | 79.9 | 7.6 | 83.7 | |
Nguyen_NTU_task3_1 | Nguyen2020_task3_report | 8 | 0.24 | 81.6 | 9.4 | 89.7 | 0.36 | 71.5 | 12.0 | 82.0 | |
Shimada_SONY_task3_2 | Shimada2020_task3_report | 9 | 0.26 | 81.7 | 7.0 | 84.6 | 0.29 | 79.4 | 7.5 | 82.9 | |
Shimada_SONY_task3_1 | Shimada2020_task3_report | 10 | 0.31 | 77.9 | 7.6 | 81.2 | 0.32 | 76.8 | 7.9 | 80.5 | |
Cao_Surrey_task3_4 | Cao2020_task3_report | 11 | 0.36 | 71.2 | 13.3 | 81.1 | 0.47 | 61.5 | 16.7 | 75.4 | |
Cao_Surrey_task3_3 | Cao2020_task3_report | 12 | 0.39 | 69.6 | 10.1 | 76.1 | 0.47 | 61.5 | 16.7 | 75.4 | |
Park_ETRI_task3_4 | Park2020_task3_report | 13 | 0.43 | 65.2 | 16.8 | 81.9 | 0.54 | 55.5 | 20.0 | 76.0 | |
Park_ETRI_task3_3 | Park2020_task3_report | 14 | 0.43 | 64.5 | 17.2 | 81.6 | 0.54 | 55.5 | 20.0 | 76.0 | |
Cao_Surrey_task3_2 | Cao2020_task3_report | 14 | 0.39 | 66.5 | 20.6 | 83.0 | 0.47 | 61.5 | 16.7 | 75.4 | |
Phan_QMUL_task3_3 | Phan2020_task3_report | 15 | 0.49 | 61.7 | 15.2 | 72.4 | 0.60 | 49.2 | 19.0 | 65.6 | |
Park_ETRI_task3_2 | Park2020_task3_report | 15 | 0.43 | 63.9 | 17.4 | 81.5 | 0.54 | 55.5 | 20.0 | 76.0 | |
PerezLopez_UPF_task3_2 | PerezLopez2020_task3_report | 16 | 0.51 | 60.1 | 12.4 | 65.1 | 0.44 | 68.0 | 13.3 | 79.6 | |
Phan_QMUL_task3_4 | Phan2020_task3_report | 17 | 0.53 | 59.2 | 14.6 | 68.2 | 0.59 | 50.8 | 18.2 | 64.1 | |
Park_ETRI_task3_1 | Park2020_task3_report | 18 | 0.50 | 59.0 | 19.2 | 79.0 | 0.54 | 55.5 | 20.0 | 76.0 | |
Phan_QMUL_task3_2 | Phan2020_task3_report | 18 | 0.55 | 58.8 | 14.6 | 68.2 | 0.59 | 50.8 | 18.2 | 64.1 | |
Phan_QMUL_task3_1 | Phan2020_task3_report | 19 | 0.52 | 57.8 | 16.8 | 69.8 | 0.60 | 49.2 | 19.0 | 65.6 | |
Sampathkumar_TUC_task3_1 | Sampathkumar2020_task3_report | 20 | 0.53 | 56.6 | 14.8 | 66.5 | 0.57 | 51.8 | 16.9 | 65.6 | |
Sampathkumar_TUC_task3_2 | Sampathkumar2020_task3_report | 21 | 0.54 | 56.3 | 15.6 | 66.8 | 0.57 | 51.6 | 17.5 | 66.1 | |
Patel_MST_task3_4 | Patel2020_task3_report | 22 | 0.55 | 55.5 | 14.4 | 65.5 | 0.54 | 55.6 | 15.2 | 67.2 | |
Patel_MST_task3_3 | Patel2020_task3_report | 23 | 0.56 | 54.5 | 15.0 | 65.1 | 0.55 | 55.4 | 14.9 | 66.5 | |
PerezLopez_UPF_task3_1 | PerezLopez2020_task3_report | 24 | 0.55 | 56.0 | 12.8 | 61.1 | 0.57 | 55.6 | 15.6 | 66.7 | |
Patel_MST_task3_1 | Patel2020_task3_report | 25 | 0.56 | 54.3 | 13.4 | 63.0 | 0.55 | 54.2 | 13.6 | 63.6 | |
Patel_MST_task3_2 | Patel2020_task3_report | 26 | 0.56 | 54.3 | 13.8 | 62.8 | 0.56 | 53.7 | 14.0 | 62.6 | |
Cao_Surrey_task3_1 | Cao2020_task3_report | 27 | 0.54 | 55.5 | 23.9 | 71.8 | 0.47 | 61.5 | 16.7 | 75.4 | |
Ronchini_UPF_task3_2 | Ronchini2020_task3_report | 28 | 0.58 | 50.8 | 16.9 | 65.5 | 0.60 | 49.9 | 17.9 | 66.8 | |
Ronchini_UPF_task3_3 | Ronchini2020_task3_report | 29 | 0.59 | 50.3 | 16.8 | 65.5 | 0.61 | 48.7 | 18.7 | 65.2 | |
Naranjo-Alcazar_VFY_task3_2 | Naranjo-Alcazar2020_task3_report | 30 | 0.61 | 49.1 | 19.5 | 67.1 | 0.70 | 39.5 | 24.8 | 63.0 | |
Song_LGE_task3_3 | Song2020_task3_report | 31 | 0.57 | 50.4 | 20.0 | 64.3 | 0.57 | 50.6 | 20.2 | 64.1 | |
Naranjo-Alcazar_VFY_task3_1 | Naranjo-Alcazar2020_task3_report | 32 | 0.61 | 48.3 | 19.2 | 65.9 | 0.69 | 40.3 | 22.1 | 63.8 | |
Ronchini_UPF_task3_1 | Ronchini2020_task3_report | 32 | 0.61 | 49.1 | 16.7 | 63.3 | 0.59 | 50.6 | 17.6 | 66.2 | |
Ronchini_UPF_task3_4 | Ronchini2020_task3_report | 33 | 0.60 | 49.1 | 17.1 | 63.7 | 0.61 | 48.4 | 18.6 | 65.6 | |
Song_LGE_task3_4 | Song2020_task3_report | 34 | 0.58 | 49.3 | 21.6 | 64.3 | 0.57 | 50.4 | 20.2 | 64.1 | |
Song_LGE_task3_1 | Song2020_task3_report | 35 | 0.58 | 49.1 | 21.8 | 64.3 | 0.57 | 50.6 | 20.1 | 64.2 | |
Naranjo-Alcazar_VFY_task3_4 | Naranjo-Alcazar2020_task3_report | 35 | 0.63 | 47.3 | 19.5 | 65.5 | 0.69 | 39.9 | 22.8 | 64.1 | |
Tian_PKU_task3_1 | Tian2020_task3_report | 36 | 0.64 | 47.6 | 24.5 | 67.5 | 0.72 | 40.1 | 25.9 | 64.0 | |
Naranjo-Alcazar_VFY_task3_3 | Naranjo-Alcazar2020_task3_report | 37 | 0.64 | 46.7 | 20.0 | 64.5 | 0.70 | 39.6 | 22.7 | 63.1 | |
Song_LGE_task3_2 | Song2020_task3_report | 37 | 0.59 | 48.0 | 23.5 | 64.3 | 0.58 | 49.5 | 21.2 | 64.2 | |
Singla_SRIB_task3_2 | Singla2020_task3_report | 38 | 0.88 | 18.0 | 53.4 | 66.2 | 0.72 | 36.2 | 23.4 | 67.7 | |
DCASE2020_MIC_baseline | Politis2020_task3_report | 39 | 0.69 | 41.3 | 23.1 | 62.4 | 0.78 | 31.4 | 27.3 | 59.0 | |
Singla_SRIB_task3_3 | Singla2020_task3_report | 40 | 0.89 | 13.3 | 59.9 | 66.8 | 0.78 | 27.1 | 25.6 | 62.3 | |
Singla_SRIB_task3_1 | Singla2020_task3_report | 41 | 0.88 | 17.4 | 55.6 | 64.6 | 0.73 | 34.2 | 24.3 | 65.8 | |
DCASE2020_FOA_baseline | Politis2020_task3_report | 42 | 0.70 | 39.5 | 23.2 | 62.1 | 0.72 | 37.4 | 22.8 | 60.7 | |
Singla_SRIB_task3_4 | Singla2020_task3_report | 43 | 0.92 | 15.9 | 57.0 | 62.5 | 0.83 | 25.5 | 26.9 | 56.9 |
Acoustic environment-wise performance
Performance of submitted systems on the two unseen acoustic environments of the evaluation dataset.
Rank | Submission Information | Location 1 | Location 2 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Submission name |
Technical Report |
Official rank |
Error rate (20°) |
F-score (20°) |
Localization error (°) |
Localization recall |
Error rate (20°) |
F-score (20°) |
Localization error (°) |
Localization recall |
|
Du_USTC_task3_4 | Du2020_task3_report | 1 | 0.18 | 86.4 | 6.0 | 90.7 | 0.22 | 83.4 | 6.0 | 86.2 | |
Du_USTC_task3_2 | Du2020_task3_report | 2 | 0.18 | 86.3 | 5.9 | 90.4 | 0.22 | 82.9 | 6.3 | 85.8 | |
Du_USTC_task3_1 | Du2020_task3_report | 3 | 0.22 | 83.9 | 6.0 | 87.7 | 0.24 | 82.1 | 6.1 | 84.6 | |
Nguyen_NTU_task3_2 | Nguyen2020_task3_report | 4 | 0.22 | 83.5 | 7.9 | 90.4 | 0.25 | 80.4 | 10.8 | 89.6 | |
Nguyen_NTU_task3_3 | Nguyen2020_task3_report | 5 | 0.22 | 83.4 | 7.9 | 90.5 | 0.25 | 80.1 | 10.8 | 89.6 | |
Nguyen_NTU_task3_4 | Nguyen2020_task3_report | 5 | 0.22 | 83.4 | 7.8 | 90.3 | 0.25 | 80.1 | 10.9 | 89.7 | |
Shimada_SONY_task3_4 | Shimada2020_task3_report | 5 | 0.24 | 84.5 | 6.7 | 87.7 | 0.26 | 81.9 | 7.3 | 84.7 | |
Du_USTC_task3_3 | Du2020_task3_report | 6 | 0.21 | 83.5 | 8.4 | 89.5 | 0.24 | 81.6 | 8.2 | 86.5 | |
Shimada_SONY_task3_3 | Shimada2020_task3_report | 7 | 0.24 | 83.8 | 6.7 | 87.0 | 0.26 | 81.4 | 7.4 | 84.1 | |
Nguyen_NTU_task3_1 | Nguyen2020_task3_report | 8 | 0.22 | 83.5 | 7.8 | 90.0 | 0.26 | 79.6 | 11.1 | 89.3 | |
Shimada_SONY_task3_2 | Shimada2020_task3_report | 9 | 0.25 | 83.1 | 6.6 | 86.1 | 0.28 | 80.2 | 7.3 | 83.0 | |
Shimada_SONY_task3_1 | Shimada2020_task3_report | 10 | 0.31 | 78.3 | 7.1 | 81.2 | 0.31 | 77.5 | 8.0 | 81.1 | |
Cao_Surrey_task3_4 | Cao2020_task3_report | 11 | 0.35 | 72.7 | 10.9 | 81.0 | 0.37 | 69.6 | 15.8 | 81.3 | |
Cao_Surrey_task3_3 | Cao2020_task3_report | 12 | 0.37 | 71.6 | 8.7 | 77.9 | 0.41 | 67.6 | 11.6 | 74.4 | |
Park_ETRI_task3_4 | Park2020_task3_report | 13 | 0.40 | 67.2 | 16.4 | 83.4 | 0.45 | 63.1 | 17.3 | 80.4 | |
Park_ETRI_task3_3 | Park2020_task3_report | 14 | 0.41 | 66.4 | 16.6 | 83.1 | 0.45 | 62.6 | 17.9 | 80.1 | |
Cao_Surrey_task3_2 | Cao2020_task3_report | 14 | 0.40 | 65.9 | 21.0 | 83.2 | 0.39 | 67.1 | 20.3 | 82.7 | |
Phan_QMUL_task3_3 | Phan2020_task3_report | 15 | 0.43 | 66.2 | 12.9 | 75.8 | 0.54 | 57.2 | 17.7 | 69.1 | |
Park_ETRI_task3_2 | Park2020_task3_report | 15 | 0.42 | 65.4 | 16.8 | 82.5 | 0.45 | 62.4 | 17.9 | 80.4 | |
PerezLopez_UPF_task3_2 | PerezLopez2020_task3_report | 16 | 0.49 | 62.1 | 12.3 | 67.0 | 0.53 | 58.1 | 12.5 | 63.1 | |
Phan_QMUL_task3_4 | Phan2020_task3_report | 17 | 0.47 | 62.2 | 13.1 | 71.0 | 0.60 | 56.3 | 16.2 | 65.6 | |
Park_ETRI_task3_1 | Park2020_task3_report | 18 | 0.50 | 58.2 | 18.6 | 76.8 | 0.49 | 59.7 | 19.7 | 81.2 | |
Phan_QMUL_task3_2 | Phan2020_task3_report | 18 | 0.48 | 62.2 | 13.8 | 71.6 | 0.62 | 55.6 | 15.4 | 64.8 | |
Phan_QMUL_task3_1 | Phan2020_task3_report | 19 | 0.47 | 62.9 | 14.4 | 72.9 | 0.57 | 52.7 | 19.5 | 66.6 | |
Sampathkumar_TUC_task3_1 | Sampathkumar2020_task3_report | 20 | 0.52 | 57.7 | 13.1 | 66.5 | 0.54 | 55.4 | 16.6 | 66.6 | |
Sampathkumar_TUC_task3_2 | Sampathkumar2020_task3_report | 21 | 0.53 | 57.8 | 13.0 | 66.1 | 0.56 | 54.9 | 18.0 | 67.5 | |
Patel_MST_task3_4 | Patel2020_task3_report | 22 | 0.54 | 56.3 | 12.4 | 65.2 | 0.56 | 54.7 | 16.4 | 65.8 | |
Patel_MST_task3_3 | Patel2020_task3_report | 23 | 0.56 | 54.1 | 13.5 | 63.5 | 0.55 | 54.9 | 16.4 | 66.6 | |
PerezLopez_UPF_task3_1 | PerezLopez2020_task3_report | 24 | 0.53 | 57.3 | 13.1 | 62.5 | 0.58 | 54.7 | 12.6 | 59.6 | |
Patel_MST_task3_1 | Patel2020_task3_report | 25 | 0.55 | 55.1 | 11.7 | 62.7 | 0.57 | 53.5 | 15.1 | 63.3 | |
Patel_MST_task3_2 | Patel2020_task3_report | 26 | 0.55 | 55.1 | 12.4 | 62.0 | 0.57 | 53.5 | 15.1 | 63.7 | |
Cao_Surrey_task3_1 | Cao2020_task3_report | 27 | 0.54 | 54.7 | 24.3 | 72.2 | 0.53 | 56.3 | 23.5 | 71.4 | |
Ronchini_UPF_task3_2 | Ronchini2020_task3_report | 28 | 0.56 | 52.8 | 15.4 | 65.0 | 0.60 | 48.9 | 18.3 | 66.1 | |
Ronchini_UPF_task3_3 | Ronchini2020_task3_report | 29 | 0.59 | 50.4 | 16.7 | 64.7 | 0.60 | 50.2 | 17.0 | 66.3 | |
Naranjo-Alcazar_VFY_task3_2 | Naranjo-Alcazar2020_task3_report | 30 | 0.57 | 52.0 | 17.2 | 68.8 | 0.66 | 46.2 | 21.8 | 65.4 | |
Song_LGE_task3_3 | Song2020_task3_report | 31 | 0.57 | 50.1 | 20.8 | 65.3 | 0.58 | 50.7 | 19.1 | 63.3 | |
Naranjo-Alcazar_VFY_task3_1 | Naranjo-Alcazar2020_task3_report | 32 | 0.59 | 50.1 | 17.7 | 65.9 | 0.63 | 46.6 | 20.7 | 65.8 | |
Ronchini_UPF_task3_1 | Ronchini2020_task3_report | 32 | 0.62 | 48.5 | 16.3 | 61.1 | 0.60 | 49.7 | 17.0 | 65.4 | |
Ronchini_UPF_task3_4 | Ronchini2020_task3_report | 33 | 0.57 | 52.4 | 16.0 | 65.0 | 0.64 | 45.9 | 18.4 | 62.5 | |
Song_LGE_task3_4 | Song2020_task3_report | 34 | 0.57 | 49.7 | 21.8 | 65.3 | 0.59 | 48.8 | 21.4 | 63.3 | |
Song_LGE_task3_1 | Song2020_task3_report | 35 | 0.58 | 49.3 | 22.1 | 65.3 | 0.59 | 48.9 | 21.6 | 63.3 | |
Naranjo-Alcazar_VFY_task3_4 | Naranjo-Alcazar2020_task3_report | 35 | 0.63 | 48.7 | 18.0 | 66.1 | 0.64 | 45.9 | 21.0 | 64.9 | |
Tian_PKU_task3_1 | Tian2020_task3_report | 36 | 0.60 | 50.9 | 21.8 | 69.1 | 0.68 | 44.3 | 27.2 | 65.8 | |
Naranjo-Alcazar_VFY_task3_3 | Naranjo-Alcazar2020_task3_report | 37 | 0.61 | 49.3 | 18.5 | 65.3 | 0.67 | 44.1 | 21.5 | 63.6 | |
Song_LGE_task3_2 | Song2020_task3_report | 37 | 0.58 | 48.9 | 23.1 | 65.3 | 0.61 | 47.0 | 23.8 | 63.3 | |
Singla_SRIB_task3_2 | Singla2020_task3_report | 38 | 0.85 | 21.6 | 57.6 | 66.3 | 0.91 | 14.2 | 49.1 | 66.0 | |
DCASE2020_MIC_baseline | Politis2020_task3_report | 39 | 0.66 | 44.0 | 21.8 | 65.9 | 0.72 | 38.6 | 24.7 | 58.9 | |
Singla_SRIB_task3_3 | Singla2020_task3_report | 40 | 0.84 | 19.7 | 61.5 | 68.3 | 0.95 | 6.9 | 58.2 | 65.2 | |
Singla_SRIB_task3_1 | Singla2020_task3_report | 41 | 0.82 | 22.5 | 58.8 | 69.5 | 0.93 | 12.3 | 51.9 | 59.7 | |
DCASE2020_FOA_baseline | Politis2020_task3_report | 42 | 0.66 | 43.3 | 20.5 | 65.0 | 0.74 | 35.5 | 26.2 | 59.1 | |
Singla_SRIB_task3_4 | Singla2020_task3_report | 43 | 0.87 | 20.2 | 60.2 | 63.7 | 0.96 | 11.4 | 53.6 | 61.4 |
Event polyphony-wise performance
Performance of submitted systems on different numbers of overlapping events of the evaluation dataset.
Rank | Submission Information | No overlapping | 2 overlapping | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Submission name |
Technical Report |
Official rank |
Error rate (20°) |
F-score (20°) |
Localization error (°) |
Localization recall |
Error rate (20°) |
F-score (20°) |
Localization error (°) |
Localization recall |
|
Du_USTC_task3_4 | Du2020_task3_report | 1 | 0.12 | 90.8 | 3.4 | 91.1 | 0.24 | 81.8 | 7.5 | 87.1 | |
Du_USTC_task3_2 | Du2020_task3_report | 2 | 0.12 | 90.9 | 3.6 | 91.3 | 0.25 | 81.3 | 7.5 | 86.4 | |
Du_USTC_task3_1 | Du2020_task3_report | 3 | 0.14 | 89.1 | 3.6 | 89.5 | 0.27 | 79.7 | 7.5 | 84.4 | |
Nguyen_NTU_task3_2 | Nguyen2020_task3_report | 4 | 0.10 | 92.8 | 5.1 | 94.0 | 0.30 | 76.1 | 11.7 | 87.8 | |
Nguyen_NTU_task3_3 | Nguyen2020_task3_report | 5 | 0.10 | 92.5 | 5.2 | 93.9 | 0.30 | 75.9 | 11.7 | 88.0 | |
Nguyen_NTU_task3_4 | Nguyen2020_task3_report | 5 | 0.11 | 92.4 | 5.1 | 93.8 | 0.30 | 76.1 | 11.7 | 88.0 | |
Shimada_SONY_task3_4 | Shimada2020_task3_report | 5 | 0.18 | 88.4 | 5.2 | 89.3 | 0.28 | 80.3 | 8.0 | 84.5 | |
Du_USTC_task3_3 | Du2020_task3_report | 6 | 0.13 | 90.1 | 5.2 | 90.8 | 0.27 | 78.6 | 10.0 | 86.6 | |
Shimada_SONY_task3_3 | Shimada2020_task3_report | 7 | 0.18 | 87.9 | 5.2 | 88.7 | 0.28 | 79.7 | 8.1 | 83.8 | |
Nguyen_NTU_task3_1 | Nguyen2020_task3_report | 8 | 0.11 | 92.2 | 5.1 | 93.5 | 0.30 | 75.8 | 11.9 | 87.6 | |
Shimada_SONY_task3_2 | Shimada2020_task3_report | 9 | 0.20 | 86.7 | 5.3 | 87.6 | 0.30 | 78.9 | 7.9 | 82.9 | |
Shimada_SONY_task3_1 | Shimada2020_task3_report | 10 | 0.25 | 83.2 | 5.9 | 84.6 | 0.34 | 75.1 | 8.5 | 79.3 | |
Cao_Surrey_task3_4 | Cao2020_task3_report | 11 | 0.20 | 84.4 | 6.6 | 87.0 | 0.44 | 63.9 | 17.5 | 77.9 | |
Cao_Surrey_task3_3 | Cao2020_task3_report | 12 | 0.24 | 82.3 | 4.6 | 83.6 | 0.46 | 62.7 | 13.5 | 72.0 | |
Park_ETRI_task3_4 | Park2020_task3_report | 13 | 0.25 | 81.5 | 9.9 | 88.0 | 0.52 | 56.0 | 21.2 | 78.5 | |
Park_ETRI_task3_3 | Park2020_task3_report | 14 | 0.26 | 80.2 | 10.4 | 87.2 | 0.52 | 55.7 | 21.5 | 78.5 | |
Cao_Surrey_task3_2 | Cao2020_task3_report | 14 | 0.15 | 88.7 | 4.9 | 90.2 | 0.52 | 54.4 | 30.4 | 79.0 | |
Phan_QMUL_task3_3 | Phan2020_task3_report | 15 | 0.33 | 76.8 | 9.1 | 80.5 | 0.57 | 52.9 | 19.4 | 67.7 | |
Park_ETRI_task3_2 | Park2020_task3_report | 15 | 0.27 | 79.9 | 10.1 | 86.9 | 0.52 | 55.0 | 21.9 | 78.5 | |
PerezLopez_UPF_task3_2 | PerezLopez2020_task3_report | 16 | 0.34 | 76.3 | 7.7 | 77.1 | 0.60 | 50.6 | 16.1 | 57.9 | |
Phan_QMUL_task3_4 | Phan2020_task3_report | 17 | 0.41 | 73.5 | 6.8 | 75.0 | 0.60 | 50.8 | 20.0 | 64.2 | |
Park_ETRI_task3_1 | Park2020_task3_report | 18 | 0.36 | 73.7 | 11.4 | 82.6 | 0.57 | 50.6 | 23.9 | 76.9 | |
Phan_QMUL_task3_2 | Phan2020_task3_report | 18 | 0.43 | 73.0 | 7.5 | 74.8 | 0.60 | 50.4 | 19.6 | 64.2 | |
Phan_QMUL_task3_1 | Phan2020_task3_report | 19 | 0.36 | 73.1 | 10.1 | 77.2 | 0.60 | 49.0 | 21.4 | 65.5 | |
Sampathkumar_TUC_task3_1 | Sampathkumar2020_task3_report | 20 | 0.40 | 69.4 | 9.9 | 74.0 | 0.60 | 49.0 | 18.3 | 62.2 | |
Sampathkumar_TUC_task3_2 | Sampathkumar2020_task3_report | 21 | 0.43 | 68.4 | 8.9 | 71.8 | 0.60 | 49.3 | 20.0 | 63.9 | |
Patel_MST_task3_4 | Patel2020_task3_report | 22 | 0.43 | 66.7 | 9.8 | 71.6 | 0.61 | 48.8 | 17.6 | 61.8 | |
Patel_MST_task3_3 | Patel2020_task3_report | 23 | 0.42 | 67.5 | 10.7 | 73.4 | 0.63 | 46.7 | 18.1 | 60.1 | |
PerezLopez_UPF_task3_1 | PerezLopez2020_task3_report | 24 | 0.38 | 72.3 | 8.1 | 73.5 | 0.64 | 46.3 | 16.7 | 53.7 | |
Patel_MST_task3_1 | Patel2020_task3_report | 25 | 0.45 | 64.9 | 9.8 | 69.9 | 0.61 | 48.0 | 16.0 | 58.8 | |
Patel_MST_task3_2 | Patel2020_task3_report | 26 | 0.45 | 64.9 | 9.6 | 69.6 | 0.62 | 48.0 | 16.8 | 58.8 | |
Cao_Surrey_task3_1 | Cao2020_task3_report | 27 | 0.42 | 71.4 | 6.8 | 74.2 | 0.60 | 47.6 | 32.8 | 70.6 | |
Ronchini_UPF_task3_2 | Ronchini2020_task3_report | 28 | 0.45 | 64.1 | 12.2 | 72.8 | 0.65 | 43.1 | 20.2 | 61.3 | |
Ronchini_UPF_task3_3 | Ronchini2020_task3_report | 29 | 0.47 | 63.7 | 11.8 | 72.0 | 0.66 | 42.5 | 20.2 | 61.7 | |
Naranjo-Alcazar_VFY_task3_2 | Naranjo-Alcazar2020_task3_report | 30 | 0.50 | 62.4 | 14.1 | 74.8 | 0.67 | 41.2 | 23.3 | 62.5 | |
Song_LGE_task3_3 | Song2020_task3_report | 31 | 0.41 | 64.4 | 12.6 | 73.0 | 0.66 | 42.2 | 25.3 | 59.2 | |
Naranjo-Alcazar_VFY_task3_1 | Naranjo-Alcazar2020_task3_report | 32 | 0.48 | 61.7 | 13.9 | 73.5 | 0.68 | 40.6 | 22.9 | 61.5 | |
Ronchini_UPF_task3_1 | Ronchini2020_task3_report | 32 | 0.52 | 59.8 | 12.5 | 68.9 | 0.66 | 42.7 | 19.5 | 59.9 | |
Ronchini_UPF_task3_4 | Ronchini2020_task3_report | 33 | 0.49 | 61.6 | 11.9 | 69.3 | 0.67 | 41.9 | 20.7 | 60.5 | |
Song_LGE_task3_4 | Song2020_task3_report | 34 | 0.42 | 63.3 | 14.4 | 73.0 | 0.67 | 41.1 | 26.8 | 59.2 | |
Song_LGE_task3_1 | Song2020_task3_report | 35 | 0.42 | 63.1 | 13.3 | 73.0 | 0.67 | 40.8 | 28.1 | 59.2 | |
Naranjo-Alcazar_VFY_task3_4 | Naranjo-Alcazar2020_task3_report | 35 | 0.52 | 60.2 | 13.7 | 72.6 | 0.69 | 39.7 | 23.5 | 61.4 | |
Tian_PKU_task3_1 | Tian2020_task3_report | 36 | 0.47 | 66.7 | 10.4 | 72.4 | 0.73 | 37.1 | 33.1 | 64.7 | |
Naranjo-Alcazar_VFY_task3_3 | Naranjo-Alcazar2020_task3_report | 37 | 0.52 | 59.9 | 13.7 | 71.4 | 0.70 | 39.2 | 24.3 | 60.5 | |
Song_LGE_task3_2 | Song2020_task3_report | 37 | 0.43 | 62.0 | 15.1 | 73.0 | 0.68 | 39.7 | 29.5 | 59.2 | |
Singla_SRIB_task3_2 | Singla2020_task3_report | 38 | 0.85 | 23.3 | 49.9 | 73.3 | 0.90 | 14.8 | 55.8 | 62.0 | |
DCASE2020_MIC_baseline | Politis2020_task3_report | 39 | 0.75 | 33.7 | 16.0 | 69.4 | 0.75 | 33.7 | 28.1 | 58.3 | |
Singla_SRIB_task3_3 | Singla2020_task3_report | 40 | 0.86 | 17.5 | 58.0 | 74.2 | 0.91 | 10.8 | 61.2 | 62.4 | |
Singla_SRIB_task3_1 | Singla2020_task3_report | 41 | 0.85 | 22.0 | 53.6 | 71.0 | 0.89 | 14.7 | 57.0 | 60.7 | |
DCASE2020_FOA_baseline | Politis2020_task3_report | 42 | 0.75 | 32.5 | 26.7 | 57.4 | 0.58 | 51.3 | 18.3 | 69.9 | |
Singla_SRIB_task3_4 | Singla2020_task3_report | 43 | 0.92 | 19.7 | 54.7 | 68.9 | 0.91 | 13.6 | 58.6 | 58.7 |
System characteristics
Summary of the submitted systems characteristics.
Rank |
Submission name |
Technical Report |
Classifier |
Classifier params |
Audio format |
Acoustic feature |
Data augmentation |
---|---|---|---|---|---|---|---|
1 | Du_USTC_task3_4 | Du2020_task3_report | CRNN, CNN, ensemble | 123942947 | both | mel spectra, intensity vector, GCC | time mixing, time and frequency masking, multichannel data simulation, voice channel switching |
2 | Du_USTC_task3_2 | Du2020_task3_report | CRNN, CNN, ensemble | 32725833 | both | mel spectra, intensity vector, GCC | time mixing, time and frequency masking, multichannel data simulation, voice channel switching |
3 | Du_USTC_task3_1 | Du2020_task3_report | CRNN, CNN, ensemble | 66238098 | both | mel spectra, intensity vector, GCC | time mixing, time and frequency masking, multichannel data simulation, voice channel switching |
4 | Nguyen_NTU_task3_2 | Nguyen2020_task3_report | CRNN, ensemble | 11589297 | Ambisonic | mel spectra, complex spectra | mixup, frequency-shift, random-cutout, specaugment, |
5 | Nguyen_NTU_task3_3 | Nguyen2020_task3_report | CRNN, ensemble | 12418724 | Ambisonic | mel spectra, complex spectra | mixup, frequency-shift, random-cutout, specaugment, |
5 | Nguyen_NTU_task3_4 | Nguyen2020_task3_report | CRNN, ensemble | 12418724 | Ambisonic | mel spectra, complex spectra | mixup, frequency-shift, random-cutout, specaugment, |
5 | Shimada_SONY_task3_4 | Shimada2020_task3_report | RD3Net, ensemble | 11715040 | Ambisonic | magnitude spectra, PCEN spectra, IPD, cosIPD, sinIPD | EMDA, rotation, Multichannel SpecAugment |
6 | Du_USTC_task3_3 | Du2020_task3_report | CRNN, CNN, ensemble | 24979016 | both | mel spectra, intensity vector, GCC | time mixing, time and frequency masking, multichannel data simulation, voice channel switching |
7 | Shimada_SONY_task3_3 | Shimada2020_task3_report | RD3Net, CRNN, ensemble | 14739274 | Ambisonic | magnitude spectra, PCEN mel spectra, IPD, cosIPD, sinIPD | EMDA, rotation, Multichannel SpecAugment |
8 | Nguyen_NTU_task3_1 | Nguyen2020_task3_report | CRNN, ensemble | 10759870 | Ambisonic | mel spectra, complex spectra | mixup, frequency-shift, random-cutout, specaugment, |
9 | Shimada_SONY_task3_2 | Shimada2020_task3_report | RD3Net, ensemble | 8369540 | Ambisonic | magnitude spectra, IPD, cosIPD, sinIPD | EMDA, rotation, Multichannel SpecAugment |
10 | Shimada_SONY_task3_1 | Shimada2020_task3_report | RD3Net | 1674680 | Ambisonic | magnitude spectra, IPD | EMDA, rotation, Multichannel SpecAugment |
11 | Cao_Surrey_task3_4 | Cao2020_task3_report | CRNN | 23799012 | Ambisonic | mel spectra, intensity vector | |
12 | Cao_Surrey_task3_3 | Cao2020_task3_report | CRNN | 23799012 | Ambisonic | mel spectra, intensity vector | |
13 | Park_ETRI_task3_4 | Park2020_task3_report | FPN, RNN, TrellisNet, ensemble | 19510056 | both | mel spectra, intensity vector, HPSS | time stretching |
14 | Park_ETRI_task3_3 | Park2020_task3_report | FPN, RNN, TrellisNet, ensemble | 13078986 | both | mel spectra, intensity vector, HPSS | |
14 | Cao_Surrey_task3_2 | Cao2020_task3_report | CRNN | 23799012 | Ambisonic | mel spectra, intensity vector | |
15 | Phan_QMUL_task3_3 | Phan2020_task3_report | self-attention CRNN | 116118 | Ambisonic | mel spectra, intensity vector | SpecAugment |
15 | Park_ETRI_task3_2 | Park2020_task3_report | FPN, RNN, TrellisNet | 6647916 | both | mel spectra, intensity vector, HPSS | |
16 | PerezLopez_UPF_task3_2 | PerezLopez2020_task3_report | GBM | 20800 | Ambisonic | diffuseness | |
17 | Phan_QMUL_task3_4 | Phan2020_task3_report | self-attention CRNN | 116118 | Microphone Array | mel spectra, GCC | SpecAugment |
18 | Park_ETRI_task3_1 | Park2020_task3_report | FPN, RNN, TrellisNet | 6647916 | both | mel spectra, intensity vector, HPSS | |
18 | Phan_QMUL_task3_2 | Phan2020_task3_report | self-attention CRNN | 116118 | Microphone Array | mel spectra, GCC | SpecAugment |
19 | Phan_QMUL_task3_1 | Phan2020_task3_report | self-attention CRNN | 116118 | Ambisonic | mel spectra, intensity vector | SpecAugment |
20 | Sampathkumar_TUC_task3_1 | Sampathkumar2020_task3_report | CRNN | 8010648 | Ambisonic | Intensity vector | |
21 | Sampathkumar_TUC_task3_2 | Sampathkumar2020_task3_report | CRNN | 8010648 | Microphone Array | Intensity vector and GCC | |
22 | Patel_MST_task3_4 | Patel2020_task3_report | FC-CRNN | 14463224 | both | mel spectra, intensity vector, GCC | |
23 | Patel_MST_task3_3 | Patel2020_task3_report | CRNN | 14463224 | both | mel spectra, intensity vector, GCC | |
24 | PerezLopez_UPF_task3_1 | PerezLopez2020_task3_report | GBM | 20800 | Ambisonic | diffuseness | time shifting, time strecthing, pitch shifting, white noise addition, reverberation |
25 | Patel_MST_task3_1 | Patel2020_task3_report | FC-CRNN | 107143683 | both | mel spectra, intensity vector, GCC | |
26 | Patel_MST_task3_2 | Patel2020_task3_report | FC-CRNN | 107143683 | both | mel spectra, intensity vector, GCC | |
27 | Cao_Surrey_task3_1 | Cao2020_task3_report | CRNN | 23799012 | Ambisonic | mel spectra, intensity vector | |
28 | Ronchini_UPF_task3_2 | Ronchini2020_task3_report | CRNN | 1244536 | Ambisonic | mel spectra, intensity vector | channel rotations |
29 | Ronchini_UPF_task3_3 | Ronchini2020_task3_report | CRNN | 1278200 | Ambisonic | mel spectra, intensity vector | channel rotations |
30 | Naranjo-Alcazar_VFY_task3_2 | Naranjo-Alcazar2020_task3_report | CRNN | 660264 | Microphone Array | mel spectra, GCC | |
31 | Song_LGE_task3_3 | Song2020_task3_report | CRNN | 2586601 | Both | mel spectra, GCC, intensity vector, angle mask | |
32 | Naranjo-Alcazar_VFY_task3_1 | Naranjo-Alcazar2020_task3_report | CRNN | 635496 | Microphone Array | mel spectra, GCC | |
32 | Ronchini_UPF_task3_1 | Ronchini2020_task3_report | CRNN | 1244536 | Ambisonic | mel spectra, intensity vector | channel rotations |
33 | Ronchini_UPF_task3_4 | Ronchini2020_task3_report | CRNN | 850680 | Ambisonic | mel spectra, intensity vector | channel rotations |
34 | Song_LGE_task3_4 | Song2020_task3_report | CRNN | 2717033 | Both | mel spectra, GCC, intensity vector, angle mask | |
35 | Song_LGE_task3_1 | Song2020_task3_report | CRNN | 2587753 | Microphone Array | mel spectra, GCC | |
35 | Naranjo-Alcazar_VFY_task3_4 | Naranjo-Alcazar2020_task3_report | CRNN | 637224 | Microphone Array | mel spectra, GCC | |
36 | Tian_PKU_task3_1 | Tian2020_task3_report | CRNN | 2000082 | Ambisonic | mel spectra, intensity vector | |
37 | Naranjo-Alcazar_VFY_task3_3 | Naranjo-Alcazar2020_task3_report | CRNN | 638760 | Microphone Array | mel spectra, GCC | |
37 | Song_LGE_task3_2 | Song2020_task3_report | CRNN | 2717033 | Both | mel spectra, GCC, intensity vector, angle mask | |
38 | Singla_SRIB_task3_2 | Singla2020_task3_report | CRNN | 517670 | Ambisonic | phase and magnitude spectra, mel spectra, intensity vector | |
39 | DCASE2020_MIC_baseline | Politis2020_task3_report | CRNN | 513000 | Microphone Array | mel spectra, GCC | |
40 | Singla_SRIB_task3_3 | Singla2020_task3_report | CRNN | 517670 | Ambisonic | phase and magnitude spectra, mel spectra, intensity vector | |
41 | Singla_SRIB_task3_1 | Singla2020_task3_report | CRNN | 513288 | Ambisonic | phase and magnitude spectra, mel spectra, intensity vector | |
42 | DCASE2020_FOA_baseline | Politis2020_task3_report | CRNN | 513000 | Ambisonic | mel spectra, intensity vector | |
43 | Singla_SRIB_task3_4 | Singla2020_task3_report | CRNN | 513288 | Ambisonic | phase and magnitude spectra, mel spectra, intensity vector | time stretching, block mixing |
Technical reports
EVENT-INDEPENDENT NETWORK FOR POLYPHONIC SOUND EVENT LOCALIZATION AND DETECTION
Yin Cao1, Turab Iqbal1, Qiuqiang Kong2, Zhong Yue1, Wenwu Wang1, Mark D. Plumbley1
1University of Surrey, 2ByteDance Ltd.
Abstract
Polyphonic sound event localization and detection is to not only detect what sound events are happening but to localize corresponding sound sources. This series of tasks was firstly introduced in DCASE 2019 Task 3. This year, the sound event localization and detection task brings additional challenges in moving sources and up to two overlapping sound events, which include cases of two same type of events with two different direction-of-arrival (DoA) angles. In this report, a novel event-independent network for polyphonic sound event localization and detection is proposed. Unlike the two-stage method that was proposed by us last year [1], this new network is fully end-to-end. Inputs to the network are first-order Ambisonics (FOA) time-domain signals, which are then fed into a 1-D convolutional layer to extract logmel spectrograms and intensity vectors. The network is then split into two parallel branches. The first branch is for the sound event detection (SED), and the second branch is for the DoA estimation. There are three types of predictions from the network, which are SED predictions, event activity detection (EAD) predictions that are used to combine the SED and DOA features for the on-set and off-set estimation, and DoA predictions. All of these predictions have the format of two tracks indicating that there are at most two overlapping events. Within each track, there could be at most one event happening. This architecture brings a problem of track permutation. To address this problem, a frame-level permutation invariant training method is used. Experimental results show that the proposed method can detect polyphonic sound events and their corresponding DoAs. The performance of Task 3 dataset is greatly increased compared with the baseline method.
THE USTC-IFLYTEK SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION OF DCASE2020 CHALLENGE
Qing Wang1, Huaxin Wu2, Zijun Jing2, Feng Ma2, Yi Fang2, Yuxuan Wang1, Tairan Chen1, Jia Pan2, Jun Du1, Chin-Hui Lee3
1University of Science and Technology of China, 2IFLYTEK CO. LTD., 3Georgia Institute of Technology
Du_USTC_task3_3, Du_USTC_task3_4, Du_USTC_task3_2, Du_USTC_task3_1
THE USTC-IFLYTEK SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION OF DCASE2020 CHALLENGE
Qing Wang1, Huaxin Wu2, Zijun Jing2, Feng Ma2, Yi Fang2, Yuxuan Wang1, Tairan Chen1, Jia Pan2, Jun Du1, Chin-Hui Lee3
1University of Science and Technology of China, 2IFLYTEK CO. LTD., 3Georgia Institute of Technology
Abstract
In this report, we present our method for DCASE 2020 challenge: Sound Event Localization and Detection (SELD). We propose an entire technical solution, which consists of data augmentation, network training, model ensemble, and post-processing. First, more training data is generated by applying transformation to both Ambisonic and microphone array signals, and by mixing the non- overlapping samples in the development dataset. And SpecAugment is also used as an augmentation technique to expand the training dataset. Then we train several deep neural network (DNN) architectures to jointly predict the spatial and temporal location of sound events in addition to its type. Besides, for SED estimation, we also use softmax activation function to handle the classification of both non-overlapping and overlapping sound events. With several network architectures, a more robust prediction of SED and directions-of-arrival (DOA) is obtained by model ensemble. At last, we use post-processing to apply different thresholds to different sound events. The proposed system is evaluated on the development set of TAU-NIGENS Spatial Sound Events 2020.
TASK 3 DCASE 2020: SOUND EVENT LOCALIZATION AND DETECTION USING RESIDUAL SQUEEZE-EXCITATION CNNS
Javier Naranjo-Alcazar1, Sergi Perez-Castanos2, Jose Ferrandis2, Pedro Zuccarello2, Maximo Cobos1
1Universitat de Valencia, 2Visualfy
Naranjo-Alcazar_VFY_task3_3, Naranjo-Alcazar_VFY_task3_2, Naranjo-Alcazar_VFY_task3_1, Naranjo-Alcazar_VFY_task3_4
TASK 3 DCASE 2020: SOUND EVENT LOCALIZATION AND DETECTION USING RESIDUAL SQUEEZE-EXCITATION CNNS
Javier Naranjo-Alcazar1, Sergi Perez-Castanos2, Jose Ferrandis2, Pedro Zuccarello2, Maximo Cobos1
1Universitat de Valencia, 2Visualfy
Abstract
Sound Event Localization and Detection (SELD) is a problem related to the field of machine listening whose objective is to recognize individual sound events, detect their temporal activity, and estimate their spatial location. Thanks to the emergence of more hard-labeled audio datasets, Deep Learning techniques have become state-of-the-art solutions. The most common ones are those that implement a convolutional recurrent network (CRNN) having previously transformed the audio signal into multichannel 2D representation. In the context of this problem, the input to the network, usually, has many more channels than in other problems related to machine listening. This is because the audio is recorded by an array of microphones.Some frequency representation is obtained for each of them together with some additional representations, such as the generalized cross-correlation (GCC), whose objective is the assessment of the relationship between channels. This work aims to improve the accuracy results of the baseline CRNN by adding residual squeeze-excitation (SE) blocks in the convolutional part of the CRNN. The followed procedure involves a grid search of the parameter ratio of the residual SE block, whereas the hyperparameters of the network remain the same as in the baseline. Experiments show that by simply introducing the residual SE blocks, the results obtained in the development phase clearly exceed the baseline.
DCASE 2020 TASK 3: ENSEMBLE OF SEQUENCE MATCHING NETWORKS FOR DYNAMIC SOUND EVENT LOCALIZATION, DETECTION, AND TRACKING
Thi Ngoc Tho Nguyen1, Douglas L. Jones2, Woon Seng Gan1
1Nanyang Technological University, 2University of Illinois Urbana-Champaign
Nguyen_NTU_task3_4, Nguyen_NTU_task3_2, Nguyen_NTU_task3_3, Nguyen_NTU_task3_1
DCASE 2020 TASK 3: ENSEMBLE OF SEQUENCE MATCHING NETWORKS FOR DYNAMIC SOUND EVENT LOCALIZATION, DETECTION, AND TRACKING
Thi Ngoc Tho Nguyen1, Douglas L. Jones2, Woon Seng Gan1
1Nanyang Technological University, 2University of Illinois Urbana-Champaign
Abstract
Sound event localization and detection consisted two subtasks which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different event classes, direction-of-arrival estimation uses magnitude or phase differences between microphones to estimate source directions. Therefore, it is often difficult to jointly train two subtasks simultaneously. Our previous sequence matching approach that solves sound event detection and direction-of-arrival separately and trains a convolutional recurrent neural network to associate the sound classes with the directions-of-arrival using onsets and offsets of the sound events shows improved performance for multiple-static-sound-source scenarios compared to other state-of-the-art networks such as the SELDnet, and the two-stage networks. Experimental results on the new DCASE dataset for sound event localization, detection, and tracking of multiple moving sound sources showed that the sequence matching network also outperformed the jointly trained SELDnet model. In order to estimate directions-of-arrival of moving sound sources with high spatial resolution, we proposed to separate the directional estimations into azimuth and elevation before feeding them into the sequence matching network. We combined several sequence matching networks into ensembles and achieved a sound event detection and localization error of 0.217 compared to 0.466 of the baseline.
SOUND EVENT LOCALIZATION AND DETECTION WITH VARIOUS LOSS FUNCTIONS
Sooyoung Park1, Sangwon Suh1, Youngho Jeong1
1Electronics and Telecommunications Research Institute
Park_ETRI_task3_1, Park_ETRI_task3_3, Park_ETRI_task3_2, Park_ETRI_task3_4
SOUND EVENT LOCALIZATION AND DETECTION WITH VARIOUS LOSS FUNCTIONS
Sooyoung Park1, Sangwon Suh1, Youngho Jeong1
1Electronics and Telecommunications Research Institute
Abstract
This technical report presents our system submitted to DCASE 2020 task 3. The goal of DCASE Task 3 is to detect a sound event and its location when a polyphonic sound event moves dynamically. We focus on designing loss functions to overcome the characteristics of the sub-task and imbalanced dataset. Temporal masking loss is used to overcome imbalance from zero labels of the silence frame. Soft floss is used for overcoming imbalance instances between class labels. A periodic loss function is proposed for regression that infers the periodic label in the direction of arrival estimation. Also, we take a feature pyramid network based network to overcome the information leakage occurred by the pooling layer in the CRNN.
DCASE 2020 TASK 3: A SINGLE STAGE FULLY CONVOLUTIONAL NEURAL NETWORK FOR SOUND SOURCE LOCALIZATION AND DETECTION
Sohel Patel1, Maciej Zawodniok1, Jacob Benesty2
1Missouri University of Science and Technology, 2University of Quebec
Patel_MST_task3_1, Patel_MST_task3_3, Patel_MST_task3_4, Patel_MST_task3_2
DCASE 2020 TASK 3: A SINGLE STAGE FULLY CONVOLUTIONAL NEURAL NETWORK FOR SOUND SOURCE LOCALIZATION AND DETECTION
Sohel Patel1, Maciej Zawodniok1, Jacob Benesty2
1Missouri University of Science and Technology, 2University of Quebec
Abstract
In this report, we present our approach for DCASE 2020 Challenge Task3: Sound event localization and detection. We use a single step training method using SELDNet like models but using fully convolutional architectures. We consider the joint optimization of both event detection and DoA estimation. For the metrics that evaluate the performance of the model consider interdependence of both parameters performance unlike independent performance like DCASE 2019 challenge. We use all the sound event classes and corresponding cartesian co-ordinates for each class to create an image like label for reference and make this an image to image mapping problem. The best model could get DOA error of around 13.5° and error rate of 0.55.
PAPAFIL: A LOW COMPLEXITY SOUND EVENT LOCALIZATION AND DETECTION METHOD WITH PARAMETRIC PARTICLE FILTERING AND GRADIENT BOOSTING
Andres Perez-Lopez1, Rafael Ibanez-Usach2
1Pompeu Fabra University, 2STRATIO
Abstract
The present technical report describes the architecture of the system submitted to the DCASE 2020 Challenge - Task 3: Sound Event Localization and Detection. The proposed method conforms a low complexity solution for the task. It is based on four building blocks: a spatial parametric analysis to find single-source spectrogram bins, a particle tracker to estimate trajectories and temporal activities, a spatial filter, and a gradient boosting machine single-class classifier. Provisional results, computed from the development dataset, show that the proposed method outperforms a CRNN baseline in three out of the four evaluation metrics considered in the challenge, and obtains an overall score almost ten points above the baseline.
AUDIO EVENT DETECTION AND LOCALIZATION WITH MULTITASK REGRESSION NETWORK
Huy Phan1, Lam Pham2, Philipp Koch3, Ngoc Duong4, Ian McLoughlin5, Alfred Mertins3
1Queen Mary University of London, 2University of Kent, 3University of Luebeck, 4InterDigital R&D France, 5Singapore Institute of Technology
Phan_QMUL_task3_4, Phan_QMUL_task3_2, Phan_QMUL_task3_3, Phan_QMUL_task3_1
AUDIO EVENT DETECTION AND LOCALIZATION WITH MULTITASK REGRESSION NETWORK
Huy Phan1, Lam Pham2, Philipp Koch3, Ngoc Duong4, Ian McLoughlin5, Alfred Mertins3
1Queen Mary University of London, 2University of Kent, 3University of Luebeck, 4InterDigital R&D France, 5Singapore Institute of Technology
Abstract
This technical report describes our submission to the DCASE 2020 Task 3 (Sound Event Localization and Detection (SELD)). In the submission, we propose a multitask regression model, in which both (multi-label) event detection and localization are formulated as regression problems to use the mean squared error loss homogeneously for model training. The deep learning model features a recurrent convolutional neural network (CRNN) architecture coupled with self-attention mechanism. Experiments on the development set of the challenge’s SELD task demonstrate that the proposed system outperforms the DCASE 2020 SELD baseline across all the detection and localization metrics, reducing the overall SELD error (the combined metric) approximately 10% absolute.
A DATASET OF REVERBERANT SPATIAL SOUND SCENES WITH MOVING SOURCES FOR SOUND EVENT LOCALIZATION AND DETECTION
Archontis Politis1, Sharath Adavanne1, Tuomas Virtanen1
1Tampere University
Abstract
This report presents the dataset and the evaluation setup of the Sound Event Localization & Detection (SELD) task for the DCASE 2020 Challenge. The SELD task refers to the problem of trying to simultaneously classify a known set of sound event classes, detect their temporal activations, and estimate their spatial directions or locations while they are active. To train and test SELD systems, datasets of diverse sound events occurring under realistic acoustic conditions are needed. Compared to the previous challenge, a significantly more complex dataset was created for DCASE 2020. The two key differences are a more diverse range of acoustical conditions, and dynamic conditions, i.e. moving sources. The spatial sound scenes are created using real room impulse responses captured in a continuous manner with a slowly moving excitation source. Both static and moving sound events are synthesized from them. Ambient noise recorded on location is added to complete the generation of scene recordings. A baseline SELD method accompanies the dataset, based on a convolutional recurrent neural network, to provide benchmark scores for the task. The baseline is an updated version of the one used in the previous challenge, with input features and training modifications to improve its performance.
SOUND EVENT LOCALIZATION AND DETECTION BASED ON CRNN USING DENSE RECTANGULAR FILTERS AND CHANNEL ROTATION DATA AUGMENTATION
Francesca Ronchini1, Andrés Pérez López1, Daniel Arteaga1
1Pompeu Fabra University
Ronchini_UPF_task3_3, Ronchini_UPF_task3_4, Ronchini_UPF_task3_2, Ronchini_UPF_task3_1
SOUND EVENT LOCALIZATION AND DETECTION BASED ON CRNN USING DENSE RECTANGULAR FILTERS AND CHANNEL ROTATION DATA AUGMENTATION
Francesca Ronchini1, Andrés Pérez López1, Daniel Arteaga1
1Pompeu Fabra University
Abstract
This technical report illustrates the system submitted to the DCASE 2020 Challenge Task 3: Sound Event Localization and Detection. The algorithm consists of a CRNN using dense rectangular filters specialized to recognize significant frequency features related to the task. In order to further improve the score and to generalize the system performance to unseen data, the training dataset size has been increased using data augmentation based on channel rotations and reflection on the xy plane in the First Order Ambisonic domain, which allow to improve Direction of Arrival labels keeping the physical relationships between channels. Evaluation results on the cross-validation development dataset show that the proposed system outperforms the baseline results, considerably improving Error Rate and F-score for location-aware detection.
SOUND EVENT DETECTION AND LOCALIZATION USING CRNN MODELS
Arunodhayan Sampathkumar1, Danny Kowerko1
1Techniche Universität Chemnitz
Sampathkumar_TUC_task3_1, Sampathkumar_TUC_task3_2
SOUND EVENT DETECTION AND LOCALIZATION USING CRNN MODELS
Arunodhayan Sampathkumar1, Danny Kowerko1
1Techniche Universität Chemnitz
Abstract
Sound Event Localization and Detection (SELD) requires both spatial and temporal information of sound events that appears in an acoustic event. The sound event localization and detection DCASE2020 task3 developed a strongly labelled dataset consisting of 14 classes. In this research work the existing method from DCASE2019 is used with significant modifications, where this method utilizes logmel features for sound event detection, and uses intensity vector and generalized cross-correlation (GCC) GCC-PHAT features for sound source localization. The Convolutional Recurrent Neural Network (CRNN) is developed that jointly predicts the Sound Event Detection (SED) and Degree of Arrival (DOA) hence minimizing the overlapping problems. The developed model significantly outperformed the baseline system.
SOUND EVENT LOCALIZATION AND DETECTION USING ACTIVITY-COUPLED CARTESIAN DOA VECTOR AND RD3NET
Kazuki Shimada1, Naoya Takahashi1, Shusuke Takahashi1, Yuki Mitsufuji1
1SONY Corporation
Shimada_SONY_task3_1, Shimada_SONY_task3_2, Shimada_SONY_task3_3, Shimada_SONY_task3_4
SOUND EVENT LOCALIZATION AND DETECTION USING ACTIVITY-COUPLED CARTESIAN DOA VECTOR AND RD3NET
Kazuki Shimada1, Naoya Takahashi1, Shusuke Takahashi1, Yuki Mitsufuji1
1SONY Corporation
Abstract
Our systems submitted to the DCASE2020 task 3: Sound Event Localization and Detection (SELD) are described in this report. We consider two systems: a single-stage system that solve sound event localization (SEL) and sound event detection (SED) simultaneously, and a two-stage system that first handles the SED and SEL tasks individually and later combines those results. As the single-stage system, we propose a unified training framework that uses an activity-coupled Cartesian DOA vector (ACCDOA) representation as a single target for both the SED and SEL tasks. To efficiently estimate sound event locations and activities, we further propose RD3Net, which incorporates recurrent and convolution layers with dense skip connections and dilation. To generalize the models, we apply three data augmentation techniques: equalized mixture data augmentation (EMDA), rotation of first-order Ambisonic (FOA) singals, and multichannel extension of SpecAugment. Our systems demonstrate a significant improvement over the baseline system.
A SEQUENTIAL SYSTEM FOR SOUND EVENT DETECTION AND LOCALIZATION USING CRNN
Rohit Singla1, Sourabh Tiwari1, Rajat Sharma1
1Samsung Research Institute Bangalore
Singla_SRIB_task3_3, Singla_SRIB_task3_2, Singla_SRIB_task3_1, Singla_SRIB_task3_4
A SEQUENTIAL SYSTEM FOR SOUND EVENT DETECTION AND LOCALIZATION USING CRNN
Rohit Singla1, Sourabh Tiwari1, Rajat Sharma1
1Samsung Research Institute Bangalore
Abstract
In this technical report, we describe our method for DCASE2020 task 3: Sound Event Localization and Detection. We use a CRNN SELDnet-like single output models which run on the features extracted from audio files using log-mel spectrogram. Our model uses CNN layers followed by RNN layers followed by predicting sound event classes: Sound Event Detection (SED) and then giving the output of SED to estimate Direction Of Arrival (DOA) for those sound events and then the final output is given as a concatenation of SED and DOA. The proposed approach is evaluated on the development set of TAU Spatial Sound Events 2020 – First-Order Ambisonics (FOA).
LOCALIZATION AND DETECTION FOR MOVING SOUND SOURCES USING CONSECUTIVE ENSEMBLE OF 2D-CRNN
Ju-man Song1
1LG Electronics
Song_LGE_task3_4, Song_LGE_task3_2, Song_LGE_task3_1, Song_LGE_task3_3
LOCALIZATION AND DETECTION FOR MOVING SOUND SOURCES USING CONSECUTIVE ENSEMBLE OF 2D-CRNN
Ju-man Song1
1LG Electronics
Abstract
This technical report introduces a deep learning strategy for sound event localization and detection in DCASE 2020 Task 3. This strategy is designed to get accurate estimation of both detecting and localizing moving sound events by splitting a task into five sub-tasks. Each subtask estimates the number of existing sound sources, the number of sound directions, single sound direction, multiple sound directions, and category of events. Thus, each two dimensional convolutional recurrent neural network (2D-CRNN) is focused on each sub-task. In this way, we could improve its robustness to complex conditions. Finally, the consecutive ensemble strategy is performed to achieve high performance with some decision logic. With the proposed strategy, we could get optimal network models for each sub-task. The proposed strategy is evaluated on the development set of TAU-NIGENS Spatial Sound Events 2020, and shows notable improvements.
MULTIPLE CRNN FOR SELD
Congzhou Tian1
1Peking University
Tian_PKU_task3_1
MULTIPLE CRNN FOR SELD
Congzhou Tian1
1Peking University
Abstract
In this task, we use multiple CRNN for SELD. Firstly, there is a CRNN to predict the number of sound events at the same time. A SED CRNN is used to predict the current sound events given the activated number result. After that, we train a DOA1 CRNN specifically for frames with single active event and a total DOA CRNN for frames with more active events. We think training with separate network is helpful for both SED and DOA tasks and our results are proved better than the baseline method on the development dataset.