Sound Event Localization and Detection


Challenge results

Task description

The Sound Event Localization and Detection (SELD) task deals with methods that detect the temporal onset and offset of sound events when active, classify the type of the event from a known set of sound classes, and further localize the events in space when active.

The focus of the current SELD task is to build systems that are able to handle event polyphony while being robust to ambient noise and reverberation in different acoustic environments/rooms, under static and dynamic spatial conditions (i.e. with moving sources). The task provides two datasets, development and evaluation, recorded in a total of 13 different acoustics environments. Among the two datasets, only the development dataset provides the reference labels. The participants are expected to build and validate systems using the development dataset, report results on a predefined development set split, and finally test their system on the unseen evaluation dataset.

More details on the task setup and evaluation can be found in the task description page.

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission Information Evaluation dataset Development dataset
Submission name Technical
Report
Best official
system rank
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Loalization
recall
Du_USTC_task3_4 Du2020_task3_report 1 0.20 84.9 6.0 88.5 0.26 80.0 7.4 84.7
Nguyen_NTU_task3_2 Nguyen2020_task3_report 4 0.23 82.0 9.3 90.0 0.36 71.4 12.1 82.0
Shimada_SONY_task3_4 Shimada2020_task3_report 5 0.25 83.2 7.0 86.2 0.29 80.0 7.5 83.5
Cao_Surrey_task3_4 Cao2020_task3_report 11 0.36 71.2 13.3 81.1 0.47 61.5 16.7 75.4
Park_ETRI_task3_4 Park2020_task3_report 13 0.43 65.2 16.8 81.9 0.54 55.5 20.0 76.0
Phan_QMUL_task3_3 Phan2020_task3_report 15 0.49 61.7 15.2 72.4 0.60 49.2 19.0 65.6
PerezLopez_UPF_task3_2 PerezLopez2020_task3_report 16 0.51 60.1 12.4 65.1 0.44 68.0 13.3 79.6
Sampathkumar_TUC_task3_1 Sampathkumar2020_task3_report 20 0.53 56.6 14.8 66.5 0.57 51.8 16.9 65.6
Patel_MST_task3_4 Patel2020_task3_report 22 0.55 55.5 14.4 65.5 0.54 55.6 15.2 67.2
Ronchini_UPF_task3_2 Ronchini2020_task3_report 28 0.58 50.8 16.9 65.5 0.60 49.9 17.9 66.8
Naranjo-Alcazar_VFY_task3_2 Naranjo-Alcazar2020_task3_report 30 0.61 49.1 19.5 67.1 0.70 39.5 24.8 63.0
Song_LGE_task3_3 Song2020_task3_report 31 0.57 50.4 20.0 64.3 0.57 50.6 20.2 64.1
Tian_PKU_task3_1 Tian2020_task3_report 36 0.64 47.6 24.5 67.5 0.72 40.1 25.9 64.0
Singla_SRIB_task3_2 Singla2020_task3_report 38 0.88 18.0 53.4 66.2 0.72 36.2 23.4 67.7
DCASE2020_MIC_baseline Politis2020_task3_report 39 0.69 41.3 23.1 62.4 0.78 31.4 27.3 59.0

Systems ranking

Performance of all the submitted systems on the evaluation and the development datasets

Rank Submission Information Evaluation dataset Development dataset
Submission name Technical
Report
Official
rank
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Du_USTC_task3_4 Du2020_task3_report 1 0.20 84.9 6.0 88.5 0.26 80.0 7.4 84.7
Du_USTC_task3_2 Du2020_task3_report 2 0.20 84.7 6.1 88.1 0.27 79.4 7.5 84.4
Du_USTC_task3_1 Du2020_task3_report 3 0.23 83.0 6.1 86.2 0.27 78.9 7.5 83.6
Nguyen_NTU_task3_2 Nguyen2020_task3_report 4 0.23 82.0 9.3 90.0 0.36 71.4 12.1 82.0
Nguyen_NTU_task3_3 Nguyen2020_task3_report 5 0.23 81.7 9.3 90.1 0.36 71.2 12.1 82.0
Nguyen_NTU_task3_4 Nguyen2020_task3_report 5 0.23 81.8 9.3 90.0 0.35 71.9 12.1 82.7
Shimada_SONY_task3_4 Shimada2020_task3_report 5 0.25 83.2 7.0 86.2 0.29 80.0 7.5 83.5
Du_USTC_task3_3 Du2020_task3_report 6 0.23 82.6 8.3 88.0 0.27 79.8 7.4 84.6
Shimada_SONY_task3_3 Shimada2020_task3_report 7 0.25 82.6 7.0 85.5 0.28 79.9 7.6 83.7
Nguyen_NTU_task3_1 Nguyen2020_task3_report 8 0.24 81.6 9.4 89.7 0.36 71.5 12.0 82.0
Shimada_SONY_task3_2 Shimada2020_task3_report 9 0.26 81.7 7.0 84.6 0.29 79.4 7.5 82.9
Shimada_SONY_task3_1 Shimada2020_task3_report 10 0.31 77.9 7.6 81.2 0.32 76.8 7.9 80.5
Cao_Surrey_task3_4 Cao2020_task3_report 11 0.36 71.2 13.3 81.1 0.47 61.5 16.7 75.4
Cao_Surrey_task3_3 Cao2020_task3_report 12 0.39 69.6 10.1 76.1 0.47 61.5 16.7 75.4
Park_ETRI_task3_4 Park2020_task3_report 13 0.43 65.2 16.8 81.9 0.54 55.5 20.0 76.0
Park_ETRI_task3_3 Park2020_task3_report 14 0.43 64.5 17.2 81.6 0.54 55.5 20.0 76.0
Cao_Surrey_task3_2 Cao2020_task3_report 14 0.39 66.5 20.6 83.0 0.47 61.5 16.7 75.4
Phan_QMUL_task3_3 Phan2020_task3_report 15 0.49 61.7 15.2 72.4 0.60 49.2 19.0 65.6
Park_ETRI_task3_2 Park2020_task3_report 15 0.43 63.9 17.4 81.5 0.54 55.5 20.0 76.0
PerezLopez_UPF_task3_2 PerezLopez2020_task3_report 16 0.51 60.1 12.4 65.1 0.44 68.0 13.3 79.6
Phan_QMUL_task3_4 Phan2020_task3_report 17 0.53 59.2 14.6 68.2 0.59 50.8 18.2 64.1
Park_ETRI_task3_1 Park2020_task3_report 18 0.50 59.0 19.2 79.0 0.54 55.5 20.0 76.0
Phan_QMUL_task3_2 Phan2020_task3_report 18 0.55 58.8 14.6 68.2 0.59 50.8 18.2 64.1
Phan_QMUL_task3_1 Phan2020_task3_report 19 0.52 57.8 16.8 69.8 0.60 49.2 19.0 65.6
Sampathkumar_TUC_task3_1 Sampathkumar2020_task3_report 20 0.53 56.6 14.8 66.5 0.57 51.8 16.9 65.6
Sampathkumar_TUC_task3_2 Sampathkumar2020_task3_report 21 0.54 56.3 15.6 66.8 0.57 51.6 17.5 66.1
Patel_MST_task3_4 Patel2020_task3_report 22 0.55 55.5 14.4 65.5 0.54 55.6 15.2 67.2
Patel_MST_task3_3 Patel2020_task3_report 23 0.56 54.5 15.0 65.1 0.55 55.4 14.9 66.5
PerezLopez_UPF_task3_1 PerezLopez2020_task3_report 24 0.55 56.0 12.8 61.1 0.57 55.6 15.6 66.7
Patel_MST_task3_1 Patel2020_task3_report 25 0.56 54.3 13.4 63.0 0.55 54.2 13.6 63.6
Patel_MST_task3_2 Patel2020_task3_report 26 0.56 54.3 13.8 62.8 0.56 53.7 14.0 62.6
Cao_Surrey_task3_1 Cao2020_task3_report 27 0.54 55.5 23.9 71.8 0.47 61.5 16.7 75.4
Ronchini_UPF_task3_2 Ronchini2020_task3_report 28 0.58 50.8 16.9 65.5 0.60 49.9 17.9 66.8
Ronchini_UPF_task3_3 Ronchini2020_task3_report 29 0.59 50.3 16.8 65.5 0.61 48.7 18.7 65.2
Naranjo-Alcazar_VFY_task3_2 Naranjo-Alcazar2020_task3_report 30 0.61 49.1 19.5 67.1 0.70 39.5 24.8 63.0
Song_LGE_task3_3 Song2020_task3_report 31 0.57 50.4 20.0 64.3 0.57 50.6 20.2 64.1
Naranjo-Alcazar_VFY_task3_1 Naranjo-Alcazar2020_task3_report 32 0.61 48.3 19.2 65.9 0.69 40.3 22.1 63.8
Ronchini_UPF_task3_1 Ronchini2020_task3_report 32 0.61 49.1 16.7 63.3 0.59 50.6 17.6 66.2
Ronchini_UPF_task3_4 Ronchini2020_task3_report 33 0.60 49.1 17.1 63.7 0.61 48.4 18.6 65.6
Song_LGE_task3_4 Song2020_task3_report 34 0.58 49.3 21.6 64.3 0.57 50.4 20.2 64.1
Song_LGE_task3_1 Song2020_task3_report 35 0.58 49.1 21.8 64.3 0.57 50.6 20.1 64.2
Naranjo-Alcazar_VFY_task3_4 Naranjo-Alcazar2020_task3_report 35 0.63 47.3 19.5 65.5 0.69 39.9 22.8 64.1
Tian_PKU_task3_1 Tian2020_task3_report 36 0.64 47.6 24.5 67.5 0.72 40.1 25.9 64.0
Naranjo-Alcazar_VFY_task3_3 Naranjo-Alcazar2020_task3_report 37 0.64 46.7 20.0 64.5 0.70 39.6 22.7 63.1
Song_LGE_task3_2 Song2020_task3_report 37 0.59 48.0 23.5 64.3 0.58 49.5 21.2 64.2
Singla_SRIB_task3_2 Singla2020_task3_report 38 0.88 18.0 53.4 66.2 0.72 36.2 23.4 67.7
DCASE2020_MIC_baseline Politis2020_task3_report 39 0.69 41.3 23.1 62.4 0.78 31.4 27.3 59.0
Singla_SRIB_task3_3 Singla2020_task3_report 40 0.89 13.3 59.9 66.8 0.78 27.1 25.6 62.3
Singla_SRIB_task3_1 Singla2020_task3_report 41 0.88 17.4 55.6 64.6 0.73 34.2 24.3 65.8
DCASE2020_FOA_baseline Politis2020_task3_report 42 0.70 39.5 23.2 62.1 0.72 37.4 22.8 60.7
Singla_SRIB_task3_4 Singla2020_task3_report 43 0.92 15.9 57.0 62.5 0.83 25.5 26.9 56.9

Acoustic environment-wise performance

Performance of submitted systems on the two unseen acoustic environments of the evaluation dataset.

Rank Submission Information Location 1 Location 2
Submission
name
Technical
Report
Official rank Error rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Error rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Du_USTC_task3_4 Du2020_task3_report 1 0.18 86.4 6.0 90.7 0.22 83.4 6.0 86.2
Du_USTC_task3_2 Du2020_task3_report 2 0.18 86.3 5.9 90.4 0.22 82.9 6.3 85.8
Du_USTC_task3_1 Du2020_task3_report 3 0.22 83.9 6.0 87.7 0.24 82.1 6.1 84.6
Nguyen_NTU_task3_2 Nguyen2020_task3_report 4 0.22 83.5 7.9 90.4 0.25 80.4 10.8 89.6
Nguyen_NTU_task3_3 Nguyen2020_task3_report 5 0.22 83.4 7.9 90.5 0.25 80.1 10.8 89.6
Nguyen_NTU_task3_4 Nguyen2020_task3_report 5 0.22 83.4 7.8 90.3 0.25 80.1 10.9 89.7
Shimada_SONY_task3_4 Shimada2020_task3_report 5 0.24 84.5 6.7 87.7 0.26 81.9 7.3 84.7
Du_USTC_task3_3 Du2020_task3_report 6 0.21 83.5 8.4 89.5 0.24 81.6 8.2 86.5
Shimada_SONY_task3_3 Shimada2020_task3_report 7 0.24 83.8 6.7 87.0 0.26 81.4 7.4 84.1
Nguyen_NTU_task3_1 Nguyen2020_task3_report 8 0.22 83.5 7.8 90.0 0.26 79.6 11.1 89.3
Shimada_SONY_task3_2 Shimada2020_task3_report 9 0.25 83.1 6.6 86.1 0.28 80.2 7.3 83.0
Shimada_SONY_task3_1 Shimada2020_task3_report 10 0.31 78.3 7.1 81.2 0.31 77.5 8.0 81.1
Cao_Surrey_task3_4 Cao2020_task3_report 11 0.35 72.7 10.9 81.0 0.37 69.6 15.8 81.3
Cao_Surrey_task3_3 Cao2020_task3_report 12 0.37 71.6 8.7 77.9 0.41 67.6 11.6 74.4
Park_ETRI_task3_4 Park2020_task3_report 13 0.40 67.2 16.4 83.4 0.45 63.1 17.3 80.4
Park_ETRI_task3_3 Park2020_task3_report 14 0.41 66.4 16.6 83.1 0.45 62.6 17.9 80.1
Cao_Surrey_task3_2 Cao2020_task3_report 14 0.40 65.9 21.0 83.2 0.39 67.1 20.3 82.7
Phan_QMUL_task3_3 Phan2020_task3_report 15 0.43 66.2 12.9 75.8 0.54 57.2 17.7 69.1
Park_ETRI_task3_2 Park2020_task3_report 15 0.42 65.4 16.8 82.5 0.45 62.4 17.9 80.4
PerezLopez_UPF_task3_2 PerezLopez2020_task3_report 16 0.49 62.1 12.3 67.0 0.53 58.1 12.5 63.1
Phan_QMUL_task3_4 Phan2020_task3_report 17 0.47 62.2 13.1 71.0 0.60 56.3 16.2 65.6
Park_ETRI_task3_1 Park2020_task3_report 18 0.50 58.2 18.6 76.8 0.49 59.7 19.7 81.2
Phan_QMUL_task3_2 Phan2020_task3_report 18 0.48 62.2 13.8 71.6 0.62 55.6 15.4 64.8
Phan_QMUL_task3_1 Phan2020_task3_report 19 0.47 62.9 14.4 72.9 0.57 52.7 19.5 66.6
Sampathkumar_TUC_task3_1 Sampathkumar2020_task3_report 20 0.52 57.7 13.1 66.5 0.54 55.4 16.6 66.6
Sampathkumar_TUC_task3_2 Sampathkumar2020_task3_report 21 0.53 57.8 13.0 66.1 0.56 54.9 18.0 67.5
Patel_MST_task3_4 Patel2020_task3_report 22 0.54 56.3 12.4 65.2 0.56 54.7 16.4 65.8
Patel_MST_task3_3 Patel2020_task3_report 23 0.56 54.1 13.5 63.5 0.55 54.9 16.4 66.6
PerezLopez_UPF_task3_1 PerezLopez2020_task3_report 24 0.53 57.3 13.1 62.5 0.58 54.7 12.6 59.6
Patel_MST_task3_1 Patel2020_task3_report 25 0.55 55.1 11.7 62.7 0.57 53.5 15.1 63.3
Patel_MST_task3_2 Patel2020_task3_report 26 0.55 55.1 12.4 62.0 0.57 53.5 15.1 63.7
Cao_Surrey_task3_1 Cao2020_task3_report 27 0.54 54.7 24.3 72.2 0.53 56.3 23.5 71.4
Ronchini_UPF_task3_2 Ronchini2020_task3_report 28 0.56 52.8 15.4 65.0 0.60 48.9 18.3 66.1
Ronchini_UPF_task3_3 Ronchini2020_task3_report 29 0.59 50.4 16.7 64.7 0.60 50.2 17.0 66.3
Naranjo-Alcazar_VFY_task3_2 Naranjo-Alcazar2020_task3_report 30 0.57 52.0 17.2 68.8 0.66 46.2 21.8 65.4
Song_LGE_task3_3 Song2020_task3_report 31 0.57 50.1 20.8 65.3 0.58 50.7 19.1 63.3
Naranjo-Alcazar_VFY_task3_1 Naranjo-Alcazar2020_task3_report 32 0.59 50.1 17.7 65.9 0.63 46.6 20.7 65.8
Ronchini_UPF_task3_1 Ronchini2020_task3_report 32 0.62 48.5 16.3 61.1 0.60 49.7 17.0 65.4
Ronchini_UPF_task3_4 Ronchini2020_task3_report 33 0.57 52.4 16.0 65.0 0.64 45.9 18.4 62.5
Song_LGE_task3_4 Song2020_task3_report 34 0.57 49.7 21.8 65.3 0.59 48.8 21.4 63.3
Song_LGE_task3_1 Song2020_task3_report 35 0.58 49.3 22.1 65.3 0.59 48.9 21.6 63.3
Naranjo-Alcazar_VFY_task3_4 Naranjo-Alcazar2020_task3_report 35 0.63 48.7 18.0 66.1 0.64 45.9 21.0 64.9
Tian_PKU_task3_1 Tian2020_task3_report 36 0.60 50.9 21.8 69.1 0.68 44.3 27.2 65.8
Naranjo-Alcazar_VFY_task3_3 Naranjo-Alcazar2020_task3_report 37 0.61 49.3 18.5 65.3 0.67 44.1 21.5 63.6
Song_LGE_task3_2 Song2020_task3_report 37 0.58 48.9 23.1 65.3 0.61 47.0 23.8 63.3
Singla_SRIB_task3_2 Singla2020_task3_report 38 0.85 21.6 57.6 66.3 0.91 14.2 49.1 66.0
DCASE2020_MIC_baseline Politis2020_task3_report 39 0.66 44.0 21.8 65.9 0.72 38.6 24.7 58.9
Singla_SRIB_task3_3 Singla2020_task3_report 40 0.84 19.7 61.5 68.3 0.95 6.9 58.2 65.2
Singla_SRIB_task3_1 Singla2020_task3_report 41 0.82 22.5 58.8 69.5 0.93 12.3 51.9 59.7
DCASE2020_FOA_baseline Politis2020_task3_report 42 0.66 43.3 20.5 65.0 0.74 35.5 26.2 59.1
Singla_SRIB_task3_4 Singla2020_task3_report 43 0.87 20.2 60.2 63.7 0.96 11.4 53.6 61.4

Event polyphony-wise performance

Performance of submitted systems on different numbers of overlapping events of the evaluation dataset.

Rank Submission Information No overlapping 2 overlapping
Submission
name
Technical
Report
Official rank Error rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Error rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Du_USTC_task3_4 Du2020_task3_report 1 0.12 90.8 3.4 91.1 0.24 81.8 7.5 87.1
Du_USTC_task3_2 Du2020_task3_report 2 0.12 90.9 3.6 91.3 0.25 81.3 7.5 86.4
Du_USTC_task3_1 Du2020_task3_report 3 0.14 89.1 3.6 89.5 0.27 79.7 7.5 84.4
Nguyen_NTU_task3_2 Nguyen2020_task3_report 4 0.10 92.8 5.1 94.0 0.30 76.1 11.7 87.8
Nguyen_NTU_task3_3 Nguyen2020_task3_report 5 0.10 92.5 5.2 93.9 0.30 75.9 11.7 88.0
Nguyen_NTU_task3_4 Nguyen2020_task3_report 5 0.11 92.4 5.1 93.8 0.30 76.1 11.7 88.0
Shimada_SONY_task3_4 Shimada2020_task3_report 5 0.18 88.4 5.2 89.3 0.28 80.3 8.0 84.5
Du_USTC_task3_3 Du2020_task3_report 6 0.13 90.1 5.2 90.8 0.27 78.6 10.0 86.6
Shimada_SONY_task3_3 Shimada2020_task3_report 7 0.18 87.9 5.2 88.7 0.28 79.7 8.1 83.8
Nguyen_NTU_task3_1 Nguyen2020_task3_report 8 0.11 92.2 5.1 93.5 0.30 75.8 11.9 87.6
Shimada_SONY_task3_2 Shimada2020_task3_report 9 0.20 86.7 5.3 87.6 0.30 78.9 7.9 82.9
Shimada_SONY_task3_1 Shimada2020_task3_report 10 0.25 83.2 5.9 84.6 0.34 75.1 8.5 79.3
Cao_Surrey_task3_4 Cao2020_task3_report 11 0.20 84.4 6.6 87.0 0.44 63.9 17.5 77.9
Cao_Surrey_task3_3 Cao2020_task3_report 12 0.24 82.3 4.6 83.6 0.46 62.7 13.5 72.0
Park_ETRI_task3_4 Park2020_task3_report 13 0.25 81.5 9.9 88.0 0.52 56.0 21.2 78.5
Park_ETRI_task3_3 Park2020_task3_report 14 0.26 80.2 10.4 87.2 0.52 55.7 21.5 78.5
Cao_Surrey_task3_2 Cao2020_task3_report 14 0.15 88.7 4.9 90.2 0.52 54.4 30.4 79.0
Phan_QMUL_task3_3 Phan2020_task3_report 15 0.33 76.8 9.1 80.5 0.57 52.9 19.4 67.7
Park_ETRI_task3_2 Park2020_task3_report 15 0.27 79.9 10.1 86.9 0.52 55.0 21.9 78.5
PerezLopez_UPF_task3_2 PerezLopez2020_task3_report 16 0.34 76.3 7.7 77.1 0.60 50.6 16.1 57.9
Phan_QMUL_task3_4 Phan2020_task3_report 17 0.41 73.5 6.8 75.0 0.60 50.8 20.0 64.2
Park_ETRI_task3_1 Park2020_task3_report 18 0.36 73.7 11.4 82.6 0.57 50.6 23.9 76.9
Phan_QMUL_task3_2 Phan2020_task3_report 18 0.43 73.0 7.5 74.8 0.60 50.4 19.6 64.2
Phan_QMUL_task3_1 Phan2020_task3_report 19 0.36 73.1 10.1 77.2 0.60 49.0 21.4 65.5
Sampathkumar_TUC_task3_1 Sampathkumar2020_task3_report 20 0.40 69.4 9.9 74.0 0.60 49.0 18.3 62.2
Sampathkumar_TUC_task3_2 Sampathkumar2020_task3_report 21 0.43 68.4 8.9 71.8 0.60 49.3 20.0 63.9
Patel_MST_task3_4 Patel2020_task3_report 22 0.43 66.7 9.8 71.6 0.61 48.8 17.6 61.8
Patel_MST_task3_3 Patel2020_task3_report 23 0.42 67.5 10.7 73.4 0.63 46.7 18.1 60.1
PerezLopez_UPF_task3_1 PerezLopez2020_task3_report 24 0.38 72.3 8.1 73.5 0.64 46.3 16.7 53.7
Patel_MST_task3_1 Patel2020_task3_report 25 0.45 64.9 9.8 69.9 0.61 48.0 16.0 58.8
Patel_MST_task3_2 Patel2020_task3_report 26 0.45 64.9 9.6 69.6 0.62 48.0 16.8 58.8
Cao_Surrey_task3_1 Cao2020_task3_report 27 0.42 71.4 6.8 74.2 0.60 47.6 32.8 70.6
Ronchini_UPF_task3_2 Ronchini2020_task3_report 28 0.45 64.1 12.2 72.8 0.65 43.1 20.2 61.3
Ronchini_UPF_task3_3 Ronchini2020_task3_report 29 0.47 63.7 11.8 72.0 0.66 42.5 20.2 61.7
Naranjo-Alcazar_VFY_task3_2 Naranjo-Alcazar2020_task3_report 30 0.50 62.4 14.1 74.8 0.67 41.2 23.3 62.5
Song_LGE_task3_3 Song2020_task3_report 31 0.41 64.4 12.6 73.0 0.66 42.2 25.3 59.2
Naranjo-Alcazar_VFY_task3_1 Naranjo-Alcazar2020_task3_report 32 0.48 61.7 13.9 73.5 0.68 40.6 22.9 61.5
Ronchini_UPF_task3_1 Ronchini2020_task3_report 32 0.52 59.8 12.5 68.9 0.66 42.7 19.5 59.9
Ronchini_UPF_task3_4 Ronchini2020_task3_report 33 0.49 61.6 11.9 69.3 0.67 41.9 20.7 60.5
Song_LGE_task3_4 Song2020_task3_report 34 0.42 63.3 14.4 73.0 0.67 41.1 26.8 59.2
Song_LGE_task3_1 Song2020_task3_report 35 0.42 63.1 13.3 73.0 0.67 40.8 28.1 59.2
Naranjo-Alcazar_VFY_task3_4 Naranjo-Alcazar2020_task3_report 35 0.52 60.2 13.7 72.6 0.69 39.7 23.5 61.4
Tian_PKU_task3_1 Tian2020_task3_report 36 0.47 66.7 10.4 72.4 0.73 37.1 33.1 64.7
Naranjo-Alcazar_VFY_task3_3 Naranjo-Alcazar2020_task3_report 37 0.52 59.9 13.7 71.4 0.70 39.2 24.3 60.5
Song_LGE_task3_2 Song2020_task3_report 37 0.43 62.0 15.1 73.0 0.68 39.7 29.5 59.2
Singla_SRIB_task3_2 Singla2020_task3_report 38 0.85 23.3 49.9 73.3 0.90 14.8 55.8 62.0
DCASE2020_MIC_baseline Politis2020_task3_report 39 0.75 33.7 16.0 69.4 0.75 33.7 28.1 58.3
Singla_SRIB_task3_3 Singla2020_task3_report 40 0.86 17.5 58.0 74.2 0.91 10.8 61.2 62.4
Singla_SRIB_task3_1 Singla2020_task3_report 41 0.85 22.0 53.6 71.0 0.89 14.7 57.0 60.7
DCASE2020_FOA_baseline Politis2020_task3_report 42 0.75 32.5 26.7 57.4 0.58 51.3 18.3 69.9
Singla_SRIB_task3_4 Singla2020_task3_report 43 0.92 19.7 54.7 68.9 0.91 13.6 58.6 58.7

System characteristics

Summary of the submitted systems characteristics.

Rank Submission
name
Technical
Report
Classifier Classifier
params
Audio
format
Acoustic
feature
Data
augmentation
1 Du_USTC_task3_4 Du2020_task3_report CRNN, CNN, ensemble 123942947 both mel spectra, intensity vector, GCC time mixing, time and frequency masking, multichannel data simulation, voice channel switching
2 Du_USTC_task3_2 Du2020_task3_report CRNN, CNN, ensemble 32725833 both mel spectra, intensity vector, GCC time mixing, time and frequency masking, multichannel data simulation, voice channel switching
3 Du_USTC_task3_1 Du2020_task3_report CRNN, CNN, ensemble 66238098 both mel spectra, intensity vector, GCC time mixing, time and frequency masking, multichannel data simulation, voice channel switching
4 Nguyen_NTU_task3_2 Nguyen2020_task3_report CRNN, ensemble 11589297 Ambisonic mel spectra, complex spectra mixup, frequency-shift, random-cutout, specaugment,
5 Nguyen_NTU_task3_3 Nguyen2020_task3_report CRNN, ensemble 12418724 Ambisonic mel spectra, complex spectra mixup, frequency-shift, random-cutout, specaugment,
5 Nguyen_NTU_task3_4 Nguyen2020_task3_report CRNN, ensemble 12418724 Ambisonic mel spectra, complex spectra mixup, frequency-shift, random-cutout, specaugment,
5 Shimada_SONY_task3_4 Shimada2020_task3_report RD3Net, ensemble 11715040 Ambisonic magnitude spectra, PCEN spectra, IPD, cosIPD, sinIPD EMDA, rotation, Multichannel SpecAugment
6 Du_USTC_task3_3 Du2020_task3_report CRNN, CNN, ensemble 24979016 both mel spectra, intensity vector, GCC time mixing, time and frequency masking, multichannel data simulation, voice channel switching
7 Shimada_SONY_task3_3 Shimada2020_task3_report RD3Net, CRNN, ensemble 14739274 Ambisonic magnitude spectra, PCEN mel spectra, IPD, cosIPD, sinIPD EMDA, rotation, Multichannel SpecAugment
8 Nguyen_NTU_task3_1 Nguyen2020_task3_report CRNN, ensemble 10759870 Ambisonic mel spectra, complex spectra mixup, frequency-shift, random-cutout, specaugment,
9 Shimada_SONY_task3_2 Shimada2020_task3_report RD3Net, ensemble 8369540 Ambisonic magnitude spectra, IPD, cosIPD, sinIPD EMDA, rotation, Multichannel SpecAugment
10 Shimada_SONY_task3_1 Shimada2020_task3_report RD3Net 1674680 Ambisonic magnitude spectra, IPD EMDA, rotation, Multichannel SpecAugment
11 Cao_Surrey_task3_4 Cao2020_task3_report CRNN 23799012 Ambisonic mel spectra, intensity vector
12 Cao_Surrey_task3_3 Cao2020_task3_report CRNN 23799012 Ambisonic mel spectra, intensity vector
13 Park_ETRI_task3_4 Park2020_task3_report FPN, RNN, TrellisNet, ensemble 19510056 both mel spectra, intensity vector, HPSS time stretching
14 Park_ETRI_task3_3 Park2020_task3_report FPN, RNN, TrellisNet, ensemble 13078986 both mel spectra, intensity vector, HPSS
14 Cao_Surrey_task3_2 Cao2020_task3_report CRNN 23799012 Ambisonic mel spectra, intensity vector
15 Phan_QMUL_task3_3 Phan2020_task3_report self-attention CRNN 116118 Ambisonic mel spectra, intensity vector SpecAugment
15 Park_ETRI_task3_2 Park2020_task3_report FPN, RNN, TrellisNet 6647916 both mel spectra, intensity vector, HPSS
16 PerezLopez_UPF_task3_2 PerezLopez2020_task3_report GBM 20800 Ambisonic diffuseness
17 Phan_QMUL_task3_4 Phan2020_task3_report self-attention CRNN 116118 Microphone Array mel spectra, GCC SpecAugment
18 Park_ETRI_task3_1 Park2020_task3_report FPN, RNN, TrellisNet 6647916 both mel spectra, intensity vector, HPSS
18 Phan_QMUL_task3_2 Phan2020_task3_report self-attention CRNN 116118 Microphone Array mel spectra, GCC SpecAugment
19 Phan_QMUL_task3_1 Phan2020_task3_report self-attention CRNN 116118 Ambisonic mel spectra, intensity vector SpecAugment
20 Sampathkumar_TUC_task3_1 Sampathkumar2020_task3_report CRNN 8010648 Ambisonic Intensity vector
21 Sampathkumar_TUC_task3_2 Sampathkumar2020_task3_report CRNN 8010648 Microphone Array Intensity vector and GCC
22 Patel_MST_task3_4 Patel2020_task3_report FC-CRNN 14463224 both mel spectra, intensity vector, GCC
23 Patel_MST_task3_3 Patel2020_task3_report CRNN 14463224 both mel spectra, intensity vector, GCC
24 PerezLopez_UPF_task3_1 PerezLopez2020_task3_report GBM 20800 Ambisonic diffuseness time shifting, time strecthing, pitch shifting, white noise addition, reverberation
25 Patel_MST_task3_1 Patel2020_task3_report FC-CRNN 107143683 both mel spectra, intensity vector, GCC
26 Patel_MST_task3_2 Patel2020_task3_report FC-CRNN 107143683 both mel spectra, intensity vector, GCC
27 Cao_Surrey_task3_1 Cao2020_task3_report CRNN 23799012 Ambisonic mel spectra, intensity vector
28 Ronchini_UPF_task3_2 Ronchini2020_task3_report CRNN 1244536 Ambisonic mel spectra, intensity vector channel rotations
29 Ronchini_UPF_task3_3 Ronchini2020_task3_report CRNN 1278200 Ambisonic mel spectra, intensity vector channel rotations
30 Naranjo-Alcazar_VFY_task3_2 Naranjo-Alcazar2020_task3_report CRNN 660264 Microphone Array mel spectra, GCC
31 Song_LGE_task3_3 Song2020_task3_report CRNN 2586601 Both mel spectra, GCC, intensity vector, angle mask
32 Naranjo-Alcazar_VFY_task3_1 Naranjo-Alcazar2020_task3_report CRNN 635496 Microphone Array mel spectra, GCC
32 Ronchini_UPF_task3_1 Ronchini2020_task3_report CRNN 1244536 Ambisonic mel spectra, intensity vector channel rotations
33 Ronchini_UPF_task3_4 Ronchini2020_task3_report CRNN 850680 Ambisonic mel spectra, intensity vector channel rotations
34 Song_LGE_task3_4 Song2020_task3_report CRNN 2717033 Both mel spectra, GCC, intensity vector, angle mask
35 Song_LGE_task3_1 Song2020_task3_report CRNN 2587753 Microphone Array mel spectra, GCC
35 Naranjo-Alcazar_VFY_task3_4 Naranjo-Alcazar2020_task3_report CRNN 637224 Microphone Array mel spectra, GCC
36 Tian_PKU_task3_1 Tian2020_task3_report CRNN 2000082 Ambisonic mel spectra, intensity vector
37 Naranjo-Alcazar_VFY_task3_3 Naranjo-Alcazar2020_task3_report CRNN 638760 Microphone Array mel spectra, GCC
37 Song_LGE_task3_2 Song2020_task3_report CRNN 2717033 Both mel spectra, GCC, intensity vector, angle mask
38 Singla_SRIB_task3_2 Singla2020_task3_report CRNN 517670 Ambisonic phase and magnitude spectra, mel spectra, intensity vector
39 DCASE2020_MIC_baseline Politis2020_task3_report CRNN 513000 Microphone Array mel spectra, GCC
40 Singla_SRIB_task3_3 Singla2020_task3_report CRNN 517670 Ambisonic phase and magnitude spectra, mel spectra, intensity vector
41 Singla_SRIB_task3_1 Singla2020_task3_report CRNN 513288 Ambisonic phase and magnitude spectra, mel spectra, intensity vector
42 DCASE2020_FOA_baseline Politis2020_task3_report CRNN 513000 Ambisonic mel spectra, intensity vector
43 Singla_SRIB_task3_4 Singla2020_task3_report CRNN 513288 Ambisonic phase and magnitude spectra, mel spectra, intensity vector time stretching, block mixing



Technical reports

EVENT-INDEPENDENT NETWORK FOR POLYPHONIC SOUND EVENT LOCALIZATION AND DETECTION

Yin Cao1, Turab Iqbal1, Qiuqiang Kong2, Zhong Yue1, Wenwu Wang1, Plumbley Mark1
1University of Surrey, 2ByteDance Ltd.

Abstract

Polyphonic sound event localization and detection is to not only detect what sound events are happening but to localize corresponding sound sources. This series of tasks was firstly introduced in DCASE 2019 Task 3. This year, the sound event localization and detection task brings additional challenges in moving sources and up to two overlapping sound events, which include cases of two same type of events with two different direction-of-arrival (DoA) angles. In this report, a novel event-independent network for polyphonic sound event localization and detection is proposed. Unlike the two-stage method that was proposed by us last year [1], this new network is fully end-to-end. Inputs to the network are first-order Ambisonics (FOA) time-domain signals, which are then fed into a 1-D convolutional layer to extract logmel spectrograms and intensity vectors. The network is then split into two parallel branches. The first branch is for the sound event detection (SED), and the second branch is for the DoA estimation. There are three types of predictions from the network, which are SED predictions, event activity detection (EAD) predictions that are used to combine the SED and DOA features for the on-set and off-set estimation, and DoA predictions. All of these predictions have the format of two tracks indicating that there are at most two overlapping events. Within each track, there could be at most one event happening. This architecture brings a problem of track permutation. To address this problem, a frame-level permutation invariant training method is used. Experimental results show that the proposed method can detect polyphonic sound events and their corresponding DoAs. The performance of Task 3 dataset is greatly increased compared with the baseline method.

PDF

THE USTC-IFLYTEK SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION OF DCASE2020 CHALLENGE

Qing Wang1, Huaxin Wu2, Zijun Jing2, Feng Ma2, Yi Fang2, Yuxuan Wang1, Tairan Chen1, Jia Pan2, Jun Du1, Chin-Hui Lee3
1University of Science and Technology of China, 2IFLYTEK CO. LTD., 3Georgia Institute of Technology

Abstract

In this report, we present our method for DCASE 2020 challenge: Sound Event Localization and Detection (SELD). We propose an entire technical solution, which consists of data augmentation, network training, model ensemble, and post-processing. First, more training data is generated by applying transformation to both Ambisonic and microphone array signals, and by mixing the non- overlapping samples in the development dataset. And SpecAugment is also used as an augmentation technique to expand the training dataset. Then we train several deep neural network (DNN) architectures to jointly predict the spatial and temporal location of sound events in addition to its type. Besides, for SED estimation, we also use softmax activation function to handle the classification of both non-overlapping and overlapping sound events. With several network architectures, a more robust prediction of SED and directions-of-arrival (DOA) is obtained by model ensemble. At last, we use post-processing to apply different thresholds to different sound events. The proposed system is evaluated on the development set of TAU-NIGENS Spatial Sound Events 2020.

PDF

TASK 3 DCASE 2020: SOUND EVENT LOCALIZATION AND DETECTION USING RESIDUAL SQUEEZE-EXCITATION CNNS

Javier Naranjo-Alcazar1, Sergi Perez-Castanos2, Jose Ferrandis2, Pedro Zuccarello2, Maximo Cobos1
1Universitat de Valencia, 2Visualfy

Abstract

Sound Event Localization and Detection (SELD) is a problem related to the field of machine listening whose objective is to recognize individual sound events, detect their temporal activity, and estimate their spatial location. Thanks to the emergence of more hard-labeled audio datasets, Deep Learning techniques have become state-of-the-art solutions. The most common ones are those that implement a convolutional recurrent network (CRNN) having previously transformed the audio signal into multichannel 2D representation. In the context of this problem, the input to the network, usually, has many more channels than in other problems related to machine listening. This is because the audio is recorded by an array of microphones.Some frequency representation is obtained for each of them together with some additional representations, such as the generalized cross-correlation (GCC), whose objective is the assessment of the relationship between channels. This work aims to improve the accuracy results of the baseline CRNN by adding residual squeeze-excitation (SE) blocks in the convolutional part of the CRNN. The followed procedure involves a grid search of the parameter ratio of the residual SE block, whereas the hyperparameters of the network remain the same as in the baseline. Experiments show that by simply introducing the residual SE blocks, the results obtained in the development phase clearly exceed the baseline.

PDF

DCASE 2020 TASK 3: ENSEMBLE OF SEQUENCE MATCHING NETWORKS FOR DYNAMIC SOUND EVENT LOCALIZATION, DETECTION, AND TRACKING

Thi Ngoc Tho Nguyen1, Douglas L. Jones2, Woon Seng Gan1
1Nanyang Technological University, 2University of Illinois Urbana-Champaign

Abstract

Sound event localization and detection consisted two subtasks which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different event classes, direction-of-arrival estimation uses magnitude or phase differences between microphones to estimate source directions. Therefore, it is often difficult to jointly train two subtasks simultaneously. Our previous sequence matching approach that solves sound event detection and direction-of-arrival separately and trains a convolutional recurrent neural network to associate the sound classes with the directions-of-arrival using onsets and offsets of the sound events shows improved performance for multiple-static-sound-source scenarios compared to other state-of-the-art networks such as the SELDnet, and the two-stage networks. Experimental results on the new DCASE dataset for sound event localization, detection, and tracking of multiple moving sound sources showed that the sequence matching network also outperformed the jointly trained SELDnet model. In order to estimate directions-of-arrival of moving sound sources with high spatial resolution, we proposed to separate the directional estimations into azimuth and elevation before feeding them into the sequence matching network. We combined several sequence matching networks into ensembles and achieved a sound event detection and localization error of 0.217 compared to 0.466 of the baseline.

PDF

SOUND EVENT LOCALIZATION AND DETECTION WITH VARIOUS LOSS FUNCTIONS

Sooyoung Park1, Sangwon Suh1, Youngho Jeong1
1Electronics and Telecommunications Research Institute

Abstract

This technical report presents our system submitted to DCASE 2020 task 3. The goal of DCASE Task 3 is to detect a sound event and its location when a polyphonic sound event moves dynamically. We focus on designing loss functions to overcome the characteristics of the sub-task and imbalanced dataset. Temporal masking loss is used to overcome imbalance from zero labels of the silence frame. Soft floss is used for overcoming imbalance instances between class labels. A periodic loss function is proposed for regression that infers the periodic label in the direction of arrival estimation. Also, we take a feature pyramid network based network to overcome the information leakage occurred by the pooling layer in the CRNN.

PDF

DCASE 2020 TASK 3: A SINGLE STAGE FULLY CONVOLUTIONAL NEURAL NETWORK FOR SOUND SOURCE LOCALIZATION AND DETECTION

Sohel Patel1, Maciej Zawodniok1, Jacob Benesty2
1Missouri University of Science and Technology, 2University of Quebec

Abstract

In this report, we present our approach for DCASE 2020 Challenge Task3: Sound event localization and detection. We use a single step training method using SELDNet like models but using fully convolutional architectures. We consider the joint optimization of both event detection and DoA estimation. For the metrics that evaluate the performance of the model consider interdependence of both parameters performance unlike independent performance like DCASE 2019 challenge. We use all the sound event classes and corresponding cartesian co-ordinates for each class to create an image like label for reference and make this an image to image mapping problem. The best model could get DOA error of around 13.5° and error rate of 0.55.

PDF

PAPAFIL: A LOW COMPLEXITY SOUND EVENT LOCALIZATION AND DETECTION METHOD WITH PARAMETRIC PARTICLE FILTERING AND GRADIENT BOOSTING

Andres Perez-Lopez1, Rafael Ibanez-Usach2
1Pompeu Fabra University, 2STRATIO

Abstract

The present technical report describes the architecture of the system submitted to the DCASE 2020 Challenge - Task 3: Sound Event Localization and Detection. The proposed method conforms a low complexity solution for the task. It is based on four building blocks: a spatial parametric analysis to find single-source spectrogram bins, a particle tracker to estimate trajectories and temporal activities, a spatial filter, and a gradient boosting machine single-class classifier. Provisional results, computed from the development dataset, show that the proposed method outperforms a CRNN baseline in three out of the four evaluation metrics considered in the challenge, and obtains an overall score almost ten points above the baseline.

PDF

AUDIO EVENT DETECTION AND LOCALIZATION WITH MULTITASK REGRESSION NETWORK

Huy Phan1, Lam Pham2, Philipp Koch3, Ngoc Duong4, Ian McLoughlin5, Alfred Mertins3
1Queen Mary University of London, 2University of Kent, 3University of Luebeck, 4InterDigital R&D France, 5Singapore Institute of Technology

Abstract

This technical report describes our submission to the DCASE 2020 Task 3 (Sound Event Localization and Detection (SELD)). In the submission, we propose a multitask regression model, in which both (multi-label) event detection and localization are formulated as regression problems to use the mean squared error loss homogeneously for model training. The deep learning model features a recurrent convolutional neural network (CRNN) architecture coupled with self-attention mechanism. Experiments on the development set of the challenge’s SELD task demonstrate that the proposed system outperforms the DCASE 2020 SELD baseline across all the detection and localization metrics, reducing the overall SELD error (the combined metric) approximately 10% absolute.

PDF

A DATASET OF REVERBERANT SPATIAL SOUND SCENES WITH MOVING SOURCES FOR SOUND EVENT LOCALIZATION AND DETECTION

Archontis Politis1, Sharath Adavanne1, Tuomas Virtanen1
1Tampere University

Abstract

This report presents the dataset and the evaluation setup of the Sound Event Localization & Detection (SELD) task for the DCASE 2020 Challenge. The SELD task refers to the problem of trying to simultaneously classify a known set of sound event classes, detect their temporal activations, and estimate their spatial directions or locations while they are active. To train and test SELD systems, datasets of diverse sound events occurring under realistic acoustic conditions are needed. Compared to the previous challenge, a significantly more complex dataset was created for DCASE 2020. The two key differences are a more diverse range of acoustical conditions, and dynamic conditions, i.e. moving sources. The spatial sound scenes are created using real room impulse responses captured in a continuous manner with a slowly moving excitation source. Both static and moving sound events are synthesized from them. Ambient noise recorded on location is added to complete the generation of scene recordings. A baseline SELD method accompanies the dataset, based on a convolutional recurrent neural network, to provide benchmark scores for the task. The baseline is an updated version of the one used in the previous challenge, with input features and training modifications to improve its performance.

PDF

SOUND EVENT LOCALIZATION AND DETECTION BASED ON CRNN USING DENSE RECTANGULAR FILTERS AND CHANNEL ROTATION DATA AUGMENTATION

Francesca Ronchini1, Andrés Pérez López1, Daniel Arteaga1
1Pompeu Fabra University

Abstract

This technical report illustrates the system submitted to the DCASE 2020 Challenge Task 3: Sound Event Localization and Detection. The algorithm consists of a CRNN using dense rectangular filters specialized to recognize significant frequency features related to the task. In order to further improve the score and to generalize the system performance to unseen data, the training dataset size has been increased using data augmentation based on channel rotations and reflection on the xy plane in the First Order Ambisonic domain, which allow to improve Direction of Arrival labels keeping the physical relationships between channels. Evaluation results on the cross-validation development dataset show that the proposed system outperforms the baseline results, considerably improving Error Rate and F-score for location-aware detection.

PDF

SOUND EVENT DETECTION AND LOCALIZATION USING CRNN MODELS

Arunodhayan Sampathkumar1, Danny Kowerko1
1Techniche Universität Chemnitz

Abstract

Sound Event Localization and Detection (SELD) requires both spatial and temporal information of sound events that appears in an acoustic event. The sound event localization and detection DCASE2020 task3 developed a strongly labelled dataset consisting of 14 classes. In this research work the existing method from DCASE2019 is used with significant modifications, where this method utilizes logmel features for sound event detection, and uses intensity vector and generalized cross-correlation (GCC) GCC-PHAT features for sound source localization. The Convolutional Recurrent Neural Network (CRNN) is developed that jointly predicts the Sound Event Detection (SED) and Degree of Arrival (DOA) hence minimizing the overlapping problems. The developed model significantly outperformed the baseline system.

PDF

SOUND EVENT LOCALIZATION AND DETECTION USING ACTIVITY-COUPLED CARTESIAN DOA VECTOR AND RD3NET

Kazuki Shimada1, Naoya Takahashi1, Shusuke Takahashi1, Yuki Mitsufuji1
1SONY Corporation

Abstract

Our systems submitted to the DCASE2020 task 3: Sound Event Localization and Detection (SELD) are described in this report. We consider two systems: a single-stage system that solve sound event localization (SEL) and sound event detection (SED) simultaneously, and a two-stage system that first handles the SED and SEL tasks individually and later combines those results. As the single-stage system, we propose a unified training framework that uses an activity-coupled Cartesian DOA vector (ACCDOA) representation as a single target for both the SED and SEL tasks. To efficiently estimate sound event locations and activities, we further propose RD3Net, which incorporates recurrent and convolution layers with dense skip connections and dilation. To generalize the models, we apply three data augmentation techniques: equalized mixture data augmentation (EMDA), rotation of first-order Ambisonic (FOA) singals, and multichannel extension of SpecAugment. Our systems demonstrate a significant improvement over the baseline system.

PDF

A SEQUENTIAL SYSTEM FOR SOUND EVENT DETECTION AND LOCALIZATION USING CRNN

Rohit Singla1, Sourabh Tiwari1, Rajat Sharma1
1Samsung Research Institute Bangalore

Abstract

In this technical report, we describe our method for DCASE2020 task 3: Sound Event Localization and Detection. We use a CRNN SELDnet-like single output models which run on the features extracted from audio files using log-mel spectrogram. Our model uses CNN layers followed by RNN layers followed by predicting sound event classes: Sound Event Detection (SED) and then giving the output of SED to estimate Direction Of Arrival (DOA) for those sound events and then the final output is given as a concatenation of SED and DOA. The proposed approach is evaluated on the development set of TAU Spatial Sound Events 2020 – First-Order Ambisonics (FOA).

PDF

LOCALIZATION AND DETECTION FOR MOVING SOUND SOURCES USING CONSECUTIVE ENSEMBLE OF 2D-CRNN

Ju-man Song1
1LG Electronics

Abstract

This technical report introduces a deep learning strategy for sound event localization and detection in DCASE 2020 Task 3. This strategy is designed to get accurate estimation of both detecting and localizing moving sound events by splitting a task into five sub-tasks. Each subtask estimates the number of existing sound sources, the number of sound directions, single sound direction, multiple sound directions, and category of events. Thus, each two dimensional convolutional recurrent neural network (2D-CRNN) is focused on each sub-task. In this way, we could improve its robustness to complex conditions. Finally, the consecutive ensemble strategy is performed to achieve high performance with some decision logic. With the proposed strategy, we could get optimal network models for each sub-task. The proposed strategy is evaluated on the development set of TAU-NIGENS Spatial Sound Events 2020, and shows notable improvements.

PDF

MULTIPLE CRNN FOR SELD

Congzhou Tian1
1Peking University

Abstract

In this task, we use multiple CRNN for SELD. Firstly, there is a CRNN to predict the number of sound events at the same time. A SED CRNN is used to predict the current sound events given the activated number result. After that, we train a DOA1 CRNN specifically for frames with single active event and a total DOA CRNN for frames with more active events. We think training with separate network is helpful for both SED and DOA tasks and our results are proved better than the baseline method on the development dataset.

PDF