Sound Event Localization and Detection


Challenge results

Task description

The sound event localization and detection (SELD) task includes recognizing the temporal onset and offset of sound events when active, classifying the sound events into known set of classes, and further localizing the events in space when active.

The focus of the current SELD task is to build systems that are robust to reverberations in different acoustic environments (rooms). The task provides two datasets, development and evaluation, recorded in five different acoustics environments. Among the two datasets, only the development dataset provides the reference labels. The participants are expected to build systems using the provided four cross-validation splits of the development dataset, and finally test their system on the unseen evaluation dataset.

More detailed task description can be found in the task description page.

Systems ranking

Performance of all the submitted systems on the evaluation and the development datasets

Rank Submission Information Evaluation dataset Development dataset
Submission name Technical
Report
Official
rank
Error
Rate
F-score DOA
error
Frame
recall
Error
Rate
F-score DOA
error
Frame
recall
Kapka_SRPOL_task3_2 Kapka2019 1 0.08 94.7 3.7 96.8 0.14 89.3 5.7 95.6
Kapka_SRPOL_task3_4 Kapka2019 2 0.08 94.7 3.7 96.8 0.15 88.7 5.7 95.6
Kapka_SRPOL_task3_3 Kapka2019 3 0.10 93.5 4.6 96.0 0.14 89.3 5.7 95.6
Cao_Surrey_task3_4 Cao2019 4 0.08 95.5 5.5 92.2 0.13 92.8 6.7 90.8
Xue_JDAI_task3_1 Xue2019 5 0.06 96.3 9.7 92.3 0.11 93.4 9.0 90.6
He_THU_task3_2 He2019 6 0.06 96.7 22.4 94.1 0.14 91.6 24.8 90.8
He_THU_task3_1 He2019 7 0.06 96.6 23.8 94.4 0.14 91.6 24.8 90.8
Cao_Surrey_task3_1 Cao2019 8 0.09 95.1 5.5 91.0 0.13 93.0 6.6 89.4
Xue_JDAI_task3_4 Xue2019 9 0.07 95.9 10.0 92.6 0.11 93.4 9.2 90.6
Jee_NTU_task3_1 Jee2019 10 0.12 93.7 4.2 91.8 0.15 91.9 4.6 89.6
Xue_JDAI_task3_3 Xue2019 11 0.08 95.6 10.1 92.2 0.11 93.4 9.3 90.6
He_THU_task3_4 He2019 12 0.06 96.3 26.1 93.4 0.15 91.3 25.8 89.5
Cao_Surrey_task3_3 Cao2019 13 0.10 94.9 5.8 90.4 0.13 93.0 6.8 89.4
Xue_JDAI_task3_2 Xue2019 14 0.09 95.2 9.2 91.5 0.14 92.2 9.6 90.2
He_THU_task3_3 He2019 15 0.08 95.6 24.4 92.9 0.15 91.3 25.8 89.5
Cao_Surrey_task3_2 Cao2019 16 0.12 93.8 5.5 89.0 0.13 93.0 7.2 89.4
Nguyen_NTU_task3_3 Nguyen2019 17 0.11 93.4 5.4 88.8 0.17 89.3 5.1 87.5
MazzonYasuda_NTT_task3_3 MazzonYasuda2019 18 0.10 94.2 6.4 88.8 0.03 98.2 0.6 93.2
Chang_HYU_task3_3 Chang2019 19 0.14 91.9 2.7 90.8 0.24 85.9 3.6 88.7
Nguyen_NTU_task3_4 Nguyen2019 20 0.12 93.2 5.5 88.7 0.17 89.3 5.1 87.5
Chang_HYU_task3_4 Chang2019 21 0.17 90.5 3.1 94.1 0.28 84.2 4.1 91.8
MazzonYasuda_NTT_task3_2 MazzonYasuda2019 22 0.13 93.0 5.0 88.2 0.04 98.2 2.2 92.9
Chang_HYU_task3_2 Chang2019 23 0.14 92.3 9.7 95.3 0.25 85.0 10.6 92.8
Chang_HYU_task3_1 Chang2019 24 0.13 92.8 8.4 91.4 0.26 85.2 9.6 88.4
MazzonYasuda_NTT_task3_1 MazzonYasuda2019 25 0.12 93.3 7.1 88.1 0.03 98.2 2.1 93.2
Ranjan_NTU_task3_3 Ranjan2019 26 0.16 90.9 5.7 91.8 0.27 83.4 13.5 88.0
Ranjan_NTU_task3_4 Ranjan2019 27 0.16 90.7 6.4 92.0 0.27 83.4 13.5 88.0
Park_ETRI_task3_1 Park2019 28 0.15 91.9 5.1 87.4 0.17 90.6 6.4 85.7
Nguyen_NTU_task3_1 Nguyen2019 29 0.15 91.1 5.6 89.8 0.21 86.9 5.1 88.9
Leung_DBS_task3_2 Leung2019 30 0.12 93.3 25.9 91.1 0.20 88.4 25.4 89.6
Park_ETRI_task3_2 Park2019 31 0.15 91.8 5.0 87.2 0.17 90.5 6.4 85.6
Grondin_MIT_task3_1 Grondin2019 32 0.14 92.2 7.4 87.5 0.21 87.2 6.8 84.7
Leung_DBS_task3_1 Leung2019 33 0.12 93.4 27.2 90.7 0.20 88.1 26.9 89.0
Park_ETRI_task3_3 Park2019 34 0.15 91.9 7.0 87.4 0.18 90.1 8.3 85.4
MazzonYasuda_NTT_task3_4 MazzonYasuda2019 35 0.14 92.0 7.3 87.1 0.04 98.1 5.7 93.0
Park_ETRI_task3_4 Park2019 36 0.15 91.8 7.0 87.2 0.18 89.9 8.3 85.2
Ranjan_NTU_task3_1 Ranjan2019 37 0.18 89.9 8.6 90.1 0.27 83.4 13.5 88.0
Ranjan_NTU_task3_2 Ranjan2019 38 0.22 86.8 7.8 90.0 0.27 83.4 13.5 88.0
ZhaoLu_UESTC_task3_1 ZhaoLu2019 39 0.18 89.3 6.8 84.3 0.20 87.5 8.0 83.4
Rough_EMED_task3_2 Rough2019 40 0.18 89.7 9.4 85.5 0.10 86.0 10.0 89.6
Nguyen_NTU_task3_2 Nguyen2019 41 0.17 89.7 8.0 77.3 0.21 86.9 5.1 88.9
Jee_NTU_task3_2 Jee2019 42 0.19 89.1 8.1 85.0 0.18 90.3 7.2 85.6
Tan_NTU_task3_1 Tan2019 43 0.17 89.8 15.4 84.4 0.20 87.7 17.2 83.9
Lewandowski_SRPOL_task3_1 Kapka2019 44 0.19 89.4 36.2 87.7 0.26 84.1 35.3 87.9
Cordourier_IL_task3_2 Cordourier2019 45 0.22 86.5 20.8 85.7 0.12 92.4 15.9 89.9
Cordourier_IL_task3_1 Cordourier2019 46 0.22 86.3 19.9 85.6 0.11 93.3 15.0 90.8
Krause_AGH_task3_4 Krause2019 47 0.22 87.4 31.0 87.0 0.19 88.5 46.9 88.6
DCASE2019_FOA_baseline Adavanne2019 48 0.28 85.4 24.6 85.7 0.34 79.9 28.5 85.4
Perezlopez_UPF_task3_1 Perezlopez2019 49 0.29 82.1 9.3 75.8 0.32 79.7 9.1 76.4
Chytas_UTH_task3_1 Chytas2019 50 0.29 82.4 18.6 75.6 0.31 81.2 19.8 75.3
Anemueller_UOL_task3_3 Anemueller2019 51 0.28 83.8 29.2 84.1 0.30 82.1 28.9 85.8
Chytas_UTH_task3_2 Chytas2019 52 0.29 82.3 18.7 75.7 0.31 81.2 19.8 75.3
Krause_AGH_task3_2 Krause2019 53 0.32 82.9 31.7 85.7 0.35 80.7 62.2 84.1
Krause_AGH_task3_1 Krause2019 54 0.30 83.0 32.5 85.3 0.33 80.9 32.5 85.4
Anemueller_UOL_task3_1 Anemueller2019 55 0.33 81.3 28.2 84.5 0.33 80.8 26.4 86.4
Kong_SURREY_task3_1 Kong2019 56 0.29 83.4 37.6 81.3 0.31 81.2 43.9 78.4
Anemueller_UOL_task3_2 Anemueller2019 57 0.36 79.8 25.0 84.1 0.33 80.8 26.4 86.4
DCASE2019_MIC_baseline Adavanne2019 58 0.30 83.2 38.1 83.4 0.35 80.1 30.8 84.0
Lin_YYZN_task3_1 Lin2019 59 1.03 2.6 21.9 31.6 1.02 2.5 15.4 30.7
Krause_AGH_task3_3 Krause2019 60 0.35 80.3 52.6 83.6 0.48 72.0 70.7 79.3

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission Information Evaluation dataset Development dataset
Submission name Technical
Report
Best official
system rank
Error
Rate
F-score DOA
error
Frame
recall
Error
Rate
F-score DOA
error
Frame
recall
Nguyen_NTU_task3_3 Nguyen2019 17 0.11 93.4 5.4 88.8 0.17 89.3 5.1 87.5
Jee_NTU_task3_1 Jee2019 10 0.12 93.7 4.2 91.8 0.15 91.9 4.6 89.6
Xue_JDAI_task3_1 Xue2019 5 0.06 96.3 9.7 92.3 0.11 93.4 9.0 90.6
He_THU_task3_2 He2019 6 0.06 96.7 22.4 94.1 0.14 91.6 24.8 90.8
Tan_NTU_task3_1 Tan2019 43 0.17 89.8 15.4 84.4 0.20 87.7 17.2 83.9
Cordourier_IL_task3_2 Cordourier2019 45 0.22 86.5 20.8 85.7 0.12 92.4 15.9 89.9
ZhaoLu_UESTC_task3_1 ZhaoLu2019 39 0.18 89.3 6.8 84.3 0.20 87.5 8.0 83.4
Park_ETRI_task3_1 Park2019 28 0.15 91.9 5.1 87.4 0.17 90.6 6.4 85.7
DCASE2019_FOA_baseline Adavanne2019 48 0.28 85.4 24.6 85.7 0.34 79.9 28.5 85.4
Perezlopez_UPF_task3_1 Perezlopez2019 49 0.29 82.1 9.3 75.8 0.32 79.7 9.1 76.4
Chang_HYU_task3_3 Chang2019 19 0.14 91.9 2.7 90.8 0.24 85.9 3.6 88.7
Cao_Surrey_task3_4 Cao2019 4 0.08 95.5 5.5 92.2 0.13 92.8 6.7 90.8
Rough_EMED_task3_2 Rough2019 40 0.18 89.7 9.4 85.5 0.10 86.0 10.0 89.6
Kapka_SRPOL_task3_2 Kapka2019 1 0.08 94.7 3.7 96.8 0.14 89.3 5.7 95.6
Anemueller_UOL_task3_3 Anemueller2019 51 0.28 83.8 29.2 84.1 0.30 82.1 28.9 85.8
Leung_DBS_task3_2 Leung2019 30 0.12 93.3 25.9 91.1 0.20 88.4 25.4 89.6
Krause_AGH_task3_4 Krause2019 47 0.22 87.4 31.0 87.0 0.19 88.5 46.9 88.6
Kong_SURREY_task3_1 Kong2019 56 0.29 83.4 37.6 81.3 0.31 81.2 43.9 78.4
Chytas_UTH_task3_1 Chytas2019 50 0.29 82.4 18.6 75.6 0.31 81.2 19.8 75.3
Ranjan_NTU_task3_3 Ranjan2019 26 0.16 90.9 5.7 91.8 0.27 83.4 13.5 88.0
Grondin_MIT_task3_1 Grondin2019 32 0.14 92.2 7.4 87.5 0.21 87.2 6.8 84.7
Lin_YYZN_task3_1 Lin2019 59 1.03 2.6 21.9 31.6 1.02 2.5 15.4 30.7
MazzonYasuda_NTT_task3_3 MazzonYasuda2019 18 0.10 94.2 6.4 88.8 0.03 98.2 0.6 93.2

Acoustic environment-wise performance

Performance of submitted systems on different acoustic environments of the evaluation dataset.

Rank Submission Information Location 1 Location 2 Location 3 Location 4 Location 5
Submission
name
Technical
Report
Official rank Error
rate
F-score DOA
error
Frame
recall
Error
rate
F-score DOA
error
Frame
recall
Error
rate
F-score DOA
error
Frame
recall
Error
rate
F-score DOA
error
Frame
recall
Error
rate
F-score DOA
error
Frame
recall
Kapka_SRPOL_task3_2 Kapka2019 1 0.05 96.4 2.5 98.0 0.09 94.6 3.3 96.4 0.11 92.5 4.6 95.2 0.09 94.4 4.0 97.3 0.07 95.8 3.9 97.3
Kapka_SRPOL_task3_4 Kapka2019 2 0.05 96.6 2.5 98.0 0.09 94.3 3.3 96.4 0.12 91.8 4.6 95.2 0.08 95.0 4.0 97.3 0.07 95.8 4.0 97.3
Kapka_SRPOL_task3_3 Kapka2019 3 0.06 95.7 4.1 97.3 0.13 91.5 4.2 94.5 0.14 90.8 5.9 94.5 0.08 94.9 4.6 96.6 0.08 94.5 4.4 96.9
Cao_Surrey_task3_4 Cao2019 4 0.06 96.5 5.4 92.1 0.09 94.8 5.9 92.2 0.08 95.6 4.8 92.7 0.10 94.5 5.4 91.7 0.07 96.0 6.1 92.3
Xue_JDAI_task3_1 Xue2019 5 0.05 97.4 10.1 92.1 0.06 96.3 10.3 92.9 0.08 95.6 10.0 91.8 0.07 96.1 9.6 92.6 0.07 96.2 8.3 92.1
He_THU_task3_2 He2019 6 0.04 97.6 20.8 94.5 0.09 95.1 25.2 93.4 0.06 96.9 22.8 94.1 0.07 96.5 21.8 94.1 0.05 97.3 21.3 94.4
He_THU_task3_1 He2019 7 0.04 97.8 22.7 94.5 0.09 95.0 27.0 94.0 0.06 96.4 23.1 94.6 0.06 96.7 23.6 94.5 0.06 96.8 22.7 94.3
Cao_Surrey_task3_1 Cao2019 8 0.08 96.0 5.4 91.0 0.10 94.6 5.8 90.8 0.09 95.3 5.1 91.7 0.10 94.8 5.3 90.9 0.10 95.0 6.0 90.8
Xue_JDAI_task3_4 Xue2019 9 0.05 97.2 10.2 92.5 0.08 95.1 11.1 92.8 0.08 95.5 10.2 92.3 0.08 95.3 9.9 92.7 0.07 96.2 8.8 92.8
Jee_NTU_task3_1 Jee2019 10 0.10 94.6 4.1 91.6 0.13 92.9 4.3 91.8 0.12 93.5 4.5 91.9 0.13 93.0 3.8 91.6 0.11 94.4 4.3 91.9
Xue_JDAI_task3_3 Xue2019 11 0.06 97.1 10.1 92.0 0.10 94.3 11.3 92.0 0.08 95.3 10.2 92.0 0.08 95.1 10.0 92.6 0.07 96.1 8.9 92.6
He_THU_task3_4 He2019 12 0.05 97.5 24.0 93.5 0.08 95.6 29.2 93.3 0.06 96.3 26.4 93.8 0.07 96.0 25.9 93.3 0.07 96.3 25.4 93.1
Cao_Surrey_task3_3 Cao2019 13 0.08 95.7 5.5 89.9 0.11 94.1 6.4 90.4 0.09 95.3 5.1 90.9 0.10 94.5 5.8 90.1 0.10 94.7 6.3 90.5
Xue_JDAI_task3_2 Xue2019 14 0.07 96.3 9.0 91.6 0.10 94.4 10.7 91.4 0.09 95.0 9.4 91.7 0.10 94.4 8.9 91.3 0.08 95.6 8.1 91.5
He_THU_task3_3 He2019 15 0.07 96.4 24.4 92.9 0.09 95.0 26.5 93.4 0.09 95.0 24.2 92.5 0.09 95.3 23.4 92.6 0.07 96.2 23.9 93.0
Cao_Surrey_task3_2 Cao2019 16 0.08 95.4 5.3 89.5 0.17 91.1 5.7 86.9 0.10 94.6 5.1 90.2 0.12 93.5 5.5 88.9 0.11 94.3 5.9 89.6
Nguyen_NTU_task3_3 Nguyen2019 17 0.09 94.9 4.0 88.9 0.12 92.7 4.7 88.9 0.12 92.8 7.0 88.7 0.11 93.6 4.8 89.4 0.13 92.8 6.5 87.9
MazzonYasuda_NTT_task3_3 MazzonYasuda2019 18 0.09 95.1 6.2 88.6 0.10 94.2 6.4 89.5 0.11 93.8 6.5 89.1 0.12 93.2 5.9 88.2 0.10 94.5 6.9 88.7
Chang_HYU_task3_3 Chang2019 19 0.12 93.2 2.5 90.3 0.15 91.9 2.5 91.3 0.14 91.2 2.9 91.4 0.15 91.4 2.7 90.5 0.15 91.6 2.7 90.4
Nguyen_NTU_task3_4 Nguyen2019 20 0.10 94.9 4.0 89.0 0.12 92.3 4.7 88.9 0.12 93.0 7.1 88.5 0.11 93.5 4.8 89.4 0.13 92.4 6.5 87.8
Chang_HYU_task3_4 Chang2019 21 0.14 92.0 2.9 94.1 0.21 89.0 2.8 93.6 0.16 90.3 3.1 94.1 0.17 90.4 3.3 94.5 0.17 90.6 3.3 94.3
MazzonYasuda_NTT_task3_2 MazzonYasuda2019 22 0.12 93.9 4.9 87.8 0.13 93.0 5.1 89.0 0.13 92.7 5.0 88.6 0.14 92.2 4.7 87.8 0.13 93.2 5.5 87.6
Chang_HYU_task3_2 Chang2019 23 0.12 93.4 9.5 95.1 0.17 91.2 9.6 95.0 0.14 92.1 10.0 94.9 0.13 92.5 9.7 96.0 0.14 92.3 9.9 95.3
Chang_HYU_task3_1 Chang2019 24 0.13 93.0 8.2 90.9 0.14 92.3 8.4 91.7 0.13 92.5 9.1 91.2 0.12 93.0 7.8 91.7 0.12 93.4 8.5 91.5
MazzonYasuda_NTT_task3_1 MazzonYasuda2019 25 0.10 94.7 6.8 87.9 0.13 93.2 7.2 88.7 0.13 92.7 7.2 88.7 0.14 92.5 6.6 87.6 0.12 93.6 7.5 87.8
Ranjan_NTU_task3_3 Ranjan2019 26 0.13 93.0 6.3 92.4 0.21 88.7 5.2 91.0 0.16 90.9 5.7 92.1 0.17 90.2 5.3 91.4 0.15 91.7 5.9 92.1
Ranjan_NTU_task3_4 Ranjan2019 27 0.12 93.2 6.8 93.0 0.21 88.5 5.7 90.8 0.15 90.8 6.7 92.4 0.17 89.7 6.5 91.9 0.15 91.0 6.6 92.1
Park_ETRI_task3_1 Park2019 28 0.13 93.4 4.5 87.1 0.15 91.7 5.1 88.3 0.14 92.1 5.3 87.7 0.17 90.4 5.0 86.9 0.15 91.8 5.4 86.8
Nguyen_NTU_task3_1 Nguyen2019 29 0.12 92.9 4.0 90.7 0.17 89.5 4.9 88.0 0.15 90.7 7.2 90.1 0.15 90.9 5.1 90.3 0.14 91.6 6.7 89.7
Leung_DBS_task3_2 Leung2019 30 0.10 94.9 25.2 91.1 0.14 92.1 27.7 91.1 0.13 92.3 26.9 90.7 0.12 93.3 24.7 91.0 0.11 94.1 25.2 91.6
Park_ETRI_task3_2 Park2019 31 0.13 93.2 4.5 86.9 0.15 91.7 5.0 88.3 0.15 91.8 5.1 87.3 0.16 90.7 5.1 87.2 0.15 91.7 5.2 86.5
Grondin_MIT_task3_1 Grondin2019 32 0.12 93.7 6.8 88.6 0.17 90.3 7.0 86.0 0.14 92.2 7.7 87.7 0.15 91.5 7.8 87.1 0.12 93.2 7.7 88.2
Leung_DBS_task3_1 Leung2019 33 0.10 94.9 27.3 90.6 0.13 92.6 28.6 91.2 0.13 92.1 28.3 90.2 0.13 93.0 25.5 90.3 0.10 94.2 26.4 91.2
Park_ETRI_task3_3 Park2019 34 0.13 93.4 7.1 87.1 0.15 91.7 7.7 88.3 0.14 92.1 6.7 87.7 0.17 90.4 6.8 86.9 0.15 91.8 7.1 86.8
MazzonYasuda_NTT_task3_4 MazzonYasuda2019 35 0.12 93.7 7.6 86.9 0.16 91.1 6.7 87.2 0.15 91.7 7.5 87.1 0.16 91.1 7.0 86.5 0.14 92.6 7.3 87.5
Park_ETRI_task3_4 Park2019 36 0.13 93.2 7.0 86.9 0.15 91.7 7.8 88.3 0.15 91.8 6.7 87.3 0.16 90.7 6.9 87.2 0.15 91.7 7.0 86.5
Ranjan_NTU_task3_1 Ranjan2019 37 0.13 92.5 8.9 91.0 0.24 87.4 8.2 87.7 0.15 91.3 8.6 91.2 0.19 88.6 8.3 90.2 0.19 89.5 9.0 90.5
Ranjan_NTU_task3_2 Ranjan2019 38 0.18 89.4 8.0 90.8 0.29 84.3 7.0 88.5 0.19 88.2 8.0 90.5 0.24 85.0 8.1 89.7 0.23 87.0 7.7 90.4
ZhaoLu_UESTC_task3_1 ZhaoLu2019 39 0.16 90.8 6.8 84.7 0.18 89.6 7.1 84.8 0.18 88.8 7.0 84.8 0.20 88.0 6.4 83.4 0.18 89.2 6.6 83.6
Rough_EMED_task3_2 Rough2019 40 0.16 90.7 10.2 85.1 0.18 90.2 9.5 86.7 0.18 89.2 9.2 85.5 0.21 87.8 8.9 84.7 0.17 90.5 9.1 85.3
Nguyen_NTU_task3_2 Nguyen2019 41 0.14 91.7 6.3 78.0 0.19 88.2 5.6 77.4 0.18 89.4 10.2 77.3 0.18 89.1 8.0 76.5 0.17 90.1 9.7 77.5
Jee_NTU_task3_2 Jee2019 42 0.16 90.5 8.0 84.8 0.18 89.7 7.9 86.9 0.18 89.4 8.7 83.9 0.22 87.0 7.5 84.6 0.18 89.2 8.2 84.7
Tan_NTU_task3_1 Tan2019 43 0.14 92.1 14.5 84.5 0.22 87.8 15.1 84.2 0.17 89.3 15.1 83.9 0.18 89.4 16.1 84.0 0.16 90.3 16.3 85.1
Lewandowski_SRPOL_task3_1 Kapka2019 44 0.14 92.4 36.1 89.5 0.32 83.5 32.9 79.4 0.18 89.5 37.5 88.7 0.17 90.1 36.8 89.5 0.15 91.2 37.8 91.2
Cordourier_IL_task3_2 Cordourier2019 45 0.21 86.9 20.9 86.3 0.22 86.6 20.0 85.9 0.21 87.0 21.0 84.9 0.24 84.6 20.3 84.7 0.20 87.6 21.5 86.5
Cordourier_IL_task3_1 Cordourier2019 46 0.19 88.6 19.5 86.5 0.26 85.0 19.7 85.6 0.22 86.2 20.3 85.6 0.24 85.3 19.8 84.4 0.22 86.5 20.1 85.6
Krause_AGH_task3_4 Krause2019 47 0.19 89.6 30.1 88.4 0.31 83.4 31.8 82.8 0.22 87.2 31.8 87.2 0.23 87.0 30.3 87.1 0.18 89.7 31.2 89.7
DCASE2019_FOA_baseline Adavanne2019 48 0.24 87.5 24.2 87.6 0.41 80.3 23.3 79.2 0.26 85.3 25.4 85.8 0.27 86.1 24.0 87.1 0.24 87.4 26.0 88.9
Perezlopez_UPF_task3_1 Perezlopez2019 49 0.20 87.9 7.4 80.2 0.51 71.0 9.8 61.2 0.25 84.6 10.5 79.9 0.27 83.5 9.6 79.1 0.26 84.1 9.2 78.4
Chytas_UTH_task3_1 Chytas2019 50 0.29 82.3 17.8 74.3 0.29 83.2 18.7 78.2 0.28 82.4 18.0 74.5 0.29 82.3 18.7 76.1 0.30 81.9 19.6 75.1
Anemueller_UOL_task3_3 Anemueller2019 51 0.23 85.7 30.3 86.3 0.41 79.2 27.4 76.9 0.28 82.4 28.8 83.5 0.26 85.2 28.4 85.8 0.22 86.5 31.1 87.9
Chytas_UTH_task3_2 Chytas2019 52 0.29 82.3 18.5 74.1 0.28 83.5 18.6 78.4 0.29 81.9 18.2 74.7 0.29 82.2 18.4 76.1 0.30 81.9 19.9 75.0
Krause_AGH_task3_2 Krause2019 53 0.26 85.9 31.0 87.3 0.43 77.9 32.5 80.7 0.29 83.7 31.9 86.4 0.33 82.0 31.0 85.8 0.29 85.2 32.3 88.1
Krause_AGH_task3_1 Krause2019 54 0.27 85.0 31.3 86.4 0.43 78.3 32.3 81.3 0.28 83.4 33.4 85.7 0.31 81.7 32.4 85.4 0.24 86.2 33.0 87.9
Anemueller_UOL_task3_1 Anemueller2019 55 0.26 84.8 27.3 86.0 0.45 76.5 28.1 79.4 0.32 81.5 28.9 85.8 0.37 78.8 28.0 84.5 0.28 85.0 28.8 87.0
Kong_SURREY_task3_1 Kong2019 56 0.26 85.2 30.5 81.0 0.31 83.0 43.4 82.4 0.29 83.0 34.3 80.8 0.30 83.0 39.3 81.5 0.29 82.6 40.0 80.9
Anemueller_UOL_task3_2 Anemueller2019 57 0.32 82.1 24.5 84.6 0.43 76.5 25.2 81.7 0.33 80.4 24.9 85.0 0.37 79.1 24.7 83.9 0.34 80.6 25.5 85.2
DCASE2019_MIC_baseline Adavanne2019 58 0.27 84.3 37.5 81.5 0.33 82.0 39.0 85.1 0.29 83.5 37.4 82.7 0.31 82.2 36.4 83.3 0.28 84.1 40.1 84.3
Lin_YYZN_task3_1 Lin2019 59 1.04 4.2 4.9 27.9 1.06 3.5 39.2 37.8 1.00 0.0 35.2 31.5 1.03 1.7 57.0 32.4 1.03 3.4 9.3 28.7
Krause_AGH_task3_3 Krause2019 60 0.27 84.7 54.2 85.7 0.50 74.1 49.1 79.3 0.32 80.8 54.6 83.2 0.37 79.2 50.5 84.2 0.32 82.7 54.7 85.8

System characteristics

Summary of the submitted systems characteristics.

Rank Submission
name
Technical
Report
Classifier Classifier
params
Audio
format
Acoustic
feature
Data
augmentation
Sampling
rate
1 Kapka_SRPOL_task3_2 Kapka2019 CRNN 2651634 Ambisonic phase and magnitude spectrogram 48kHz
2 Kapka_SRPOL_task3_4 Kapka2019 CRNN 2651634 Ambisonic phase and magnitude spectrogram 48kHz
3 Kapka_SRPOL_task3_3 Kapka2019 CRNN 2651634 Ambisonic phase and magnitude spectrogram 48kHz
4 Cao_Surrey_task3_4 Cao2019 CRNN9, ensemble 6976673 both log mel and intensity vector 48kHz
5 Xue_JDAI_task3_1 Xue2019 CRNN, ensemble 2641263 Microphone Array Log-Mel, Constant Q-transform, raw phase spectrogram, smoothed cross-power angular spectra speed perturbtion 48kHz
6 He_THU_task3_2 He2019 CRNN 474273 Ambisonic phase and magnitude spectrogram, Mel-spectrogram time and frequency masking (SpecAugment) 48kHz
7 He_THU_task3_1 He2019 CRNN 474273 Ambisonic phase and magnitude spectrogram, Mel-spectrogram time and frequency masking (SpecAugment) 48kHz
8 Cao_Surrey_task3_1 Cao2019 CRNN9, ensemble 6965153 Ambisonic log mel and intensity vector 48kHz
9 Xue_JDAI_task3_4 Xue2019 CRNN, ensemble 2641263 Microphone Array Log-Mel, Constant Q-transform, raw phase spectrogram, smoothed cross-power angular spectra speed perturbtion 48kHz
10 Jee_NTU_task3_1 Jee2019 CRNN 10209057 Microphone Array log-mel spectrogram and GCC-PHAT mixup 34 kHz
11 Xue_JDAI_task3_3 Xue2019 CRNN, ensemble 2641263 Microphone Array Log-Mel, Constant Q-transform, raw phase spectrogram, smoothed cross-power angular spectra speed perturbtion 48kHz
12 He_THU_task3_4 He2019 CRNN 474273 Microphone Array phase and magnitude spectrogram, Mel-spectrogram time and frequency masking (SpecAugment) 48kHz
13 Cao_Surrey_task3_3 Cao2019 CRNN11, ensemble 7071969 both log mel and intensity vector 48kHz
14 Xue_JDAI_task3_2 Xue2019 CRNN, ensemble 2641263 Microphone Array Log-Mel, Constant Q-transform, raw phase spectrogram, smoothed cross-power angular spectra speed perturbtion 48kHz
15 He_THU_task3_3 He2019 CRNN 474273 Microphone Array phase and magnitude spectrogram, Mel-spectrogram time and frequency masking (SpecAugment) 48kHz
16 Cao_Surrey_task3_2 Cao2019 CRNN11, ensemble 7071969 both log mel and intensity vector 48kHz
17 Nguyen_NTU_task3_3 Nguyen2019 CRNN, ensemble 5620844 Ambisonic log-mel energies, magnitude spectrogram 48kHz
18 MazzonYasuda_NTT_task3_3 MazzonYasuda2019 CRNN, ResNet, ensemble 134896144 both log-mel energies, GCC FOA domain spatial augmentation 32kHz
19 Chang_HYU_task3_3 Chang2019 CRNN (for SAD), CRNN (for SED), CNN (for SEL) 280876424 Microphone Array Multi-resolution Cochleagram (for SAD), log-mel spectrogram (for SED), generalized cross-correlation phase transform (for SEL) pitch shifting, audio file mixing 48kHz
20 Nguyen_NTU_task3_4 Nguyen2019 CRNN, ensemble 4382721 Ambisonic log-mel energies, magnitude spectrogram 48kHz
21 Chang_HYU_task3_4 Chang2019 CRNN (for SAD), CRNN (for SED), CNN (for SEL) 280941952 Microphone Array Multi-resolution Cochleagram (for SAD), log-mel spectrogram (for SED), generalized cross-correlation phase transform (for SEL) pitch shifting, audio file mixing 48kHz
22 MazzonYasuda_NTT_task3_2 MazzonYasuda2019 CRNN, ResNet, ensemble 539584576 both log-mel energies, GCC FOA domain spatial augmentation 32kHz
23 Chang_HYU_task3_2 Chang2019 CRNN (for SAD), CRNN (for SED), CNN (for SEL) 280941952 Ambisonic Multi-resolution Cochleagram (for SAD), log-mel spectrogram (for SED), generalized cross-correlation phase transform (for SEL) pitch shifting, audio file mixing 48kHz
24 Chang_HYU_task3_1 Chang2019 CRNN (for SAD), CRNN (for SED), CNN (for SEL) 280876424 Ambisonic Multi-resolution Cochleagram (for SAD), log-mel spectrogram (for SED), generalized cross-correlation phase transform (for SEL) pitch shifting, audio file mixing 48kHz
25 MazzonYasuda_NTT_task3_1 MazzonYasuda2019 CRNN, ResNet, ensemble 539584576 both log-mel energies, GCC FOA domain spatial augmentation 32kHz
26 Ranjan_NTU_task3_3 Ranjan2019 Ensemble of 4 ResNet RNN models 12315976 Microphone Array phase and log magnitude melspectrogram time shifting 48kHz
27 Ranjan_NTU_task3_4 Ranjan2019 ResNet RNN trained using all 4 splits 3078994 Microphone Array phase and log magnitude melspectrogram time shifting 48kHz
28 Park_ETRI_task3_1 Park2019 CRNN, TrellisNet 23116833 both log mel-band energy and mel-band intensity vector 32kHz
29 Nguyen_NTU_task3_1 Nguyen2019 CRNN 1008235 Ambisonic log-mel energies, magnitude spectrogram 48kHz
30 Leung_DBS_task3_2 Leung2019 CRNN, ensemble 3193745 Ambisonic complex spectrograms, phase and log magnitude spectrogram 48kHz
31 Park_ETRI_task3_2 Park2019 CRNN, TrellisNet 23116833 both log mel-band energy and mel-band intensity vector 32kHz
32 Grondin_MIT_task3_1 Grondin2019 CRNN, ensemble 18191014 Microphone Array phase and magnitude spectrogram, GCC, TDOA 48kHz
33 Leung_DBS_task3_1 Leung2019 CRNN, ensemble 1891635 Ambisonic complex spectrograms, phase and log magnitude spectrogram 48kHz
34 Park_ETRI_task3_3 Park2019 CRNN 21147681 both log mel-band energy and mel-band intensity vector 32kHz
35 MazzonYasuda_NTT_task3_4 MazzonYasuda2019 CRNN, ResNet, ensemble 16862018 both log-mel energies, GCC FOA domain spatial augmentation 32kHz
36 Park_ETRI_task3_4 Park2019 CRNN 21147681 both log mel-band energy and mel-band intensity vector 32kHz
37 Ranjan_NTU_task3_1 Ranjan2019 ResNet_RNN 3076690 Microphone Array phase and log magnitude melspectrogram time shifting 48kHz
38 Ranjan_NTU_task3_2 Ranjan2019 ResNet_RNN 3078994 Microphone Array phase and log magnitude melspectrogram time shifting 48kHz
39 ZhaoLu_UESTC_task3_1 ZhaoLu2019 CNN+LSTM 7862177 Microphone Array LogMel spectrogram and GCC 48kHz
40 Rough_EMED_task3_2 Rough2019 CRNN 116118 Microphone Array phase and magnitude spectrogram 48kHz
41 Nguyen_NTU_task3_2 Nguyen2019 CRNN 1008235 Ambisonic log-mel energies, magnitude spectrogram 48kHz
42 Jee_NTU_task3_2 Jee2019 CRNN 10209057 Microphone Array log-mel spectrogram and GCC-PHAT mixup 34 kHz
43 Tan_NTU_task3_1 Tan2019 ResNet RNN + TDOA 116118 Microphone Array log mel spectrogram, TDOA frame shifting 48kHz
44 Lewandowski_SRPOL_task3_1 Kapka2019 CRNN 508574 Ambisonic phase and magnitude spectrogram time / frequency masking, time warping (specAugment) 48kHz
45 Cordourier_IL_task3_2 Cordourier2019 Four CRNNs in ensamble 731280 Microphone Array phase, magnitude spectrogram and GCC 48kHz
46 Cordourier_IL_task3_1 Cordourier2019 CRNN 182820 Microphone Array phase, magnitude spectrogram and GCC 48kHz
47 Krause_AGH_task3_4 Krause2019 CRNN, ensemble 615626 Ambisonic phase and magnitude spectrogram 48kHz
48 DCASE2019_FOA_baseline Adavanne2019 CRNN Ambisonic Phase and Magnitude Spectrogram 48kHz
49 Perezlopez_UPF_task3_1 Perezlopez2019 CRNN 150000 Ambisonic DOA, diffuseness, DOA variance, energy density mixup 48kHz
50 Chytas_UTH_task3_1 Chytas2019 CNN, ensemble 100420239 Microphone Array raw audio data, spectrogram channel permutations, segment addition 16kHz
51 Anemueller_UOL_task3_3 Anemueller2019 CRNN 750137 Ambisonic group-delay and magnitude spectrogram, each in Mel-bands and FFT-bands 48kHz
52 Chytas_UTH_task3_2 Chytas2019 CNN, ensemble 100420239 Microphone Array raw audio data, spectrogram channel permutations, segment addition 16kHz
53 Krause_AGH_task3_2 Krause2019 CRNN 282089 Ambisonic phase and magnitude spectrogram 48kHz
54 Krause_AGH_task3_1 Krause2019 CRNN 333537 Ambisonic phase and magnitude spectrogram 48kHz
55 Anemueller_UOL_task3_1 Anemueller2019 CRNN 552761 Ambisonic group-delay and magnitude spectrogram, each in Mel-bands and FFT-bands 48kHz
56 Kong_SURREY_task3_1 Kong2019 CNN 4686144 Ambisonic magnitude spectrogram 32kHz
57 Anemueller_UOL_task3_2 Anemueller2019 CRNN 552761 Ambisonic group-delay and magnitude spectrogram, each in Mel-bands and FFT-bands 48kHz
58 DCASE2019_MIC_baseline Adavanne2019 CRNN Microphone Array Phase and Magnitude Spectrogram 48kHz
59 Lin_YYZN_task3_1 Lin2019 CRNN 613537 Ambisonic phase and magnitude spectrogram 44.1kHz
60 Krause_AGH_task3_3 Krause2019 CRNN 429857 Ambisonic phase and magnitude spectrogram 48kHz



Technical reports

A MULTI-ROOM REVERBERANT DATASET FOR SOUND EVENT LOCALIZATION AND DETECTION

Sharath Adavanne, Archontis Politis, Tuomas Virtanen
Tampere University, Finland

Abstract

This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge. The goal of the SELD task is to detect the temporal activities of a known set of sound event classes, and further localize them in space when active. As part of the challenge, a synthesized dataset with each sound event associated with a spatial coordinate represented using azimuth and elevation angles is provided. These sound events are spatialized using real-life impulse responses collected at multiple spatial coordinates in five different rooms with varying dimensions and material properties. A baseline SELD method employing a convolutional recurrent neural network is used to generate bench mark scores for this reverberant dataset. The benchmark scores are obtained using the recommended cross-validation setup.

PDF

GROUP DELAY FEATURES FOR SOUND EVENT DETECTION AND LOCALIZATION (TASK 3) OF THE DCASE 2019 CHALLENGE

Eike Nustede, Jorn Anemuller
University of Oldenburg

Abstract

Sound event localization algorithms utilize features that encode a source’s time-difference of arrival across an array of microphones. Direct encoding of a signal’s phase in sub-bands is a common representation that is not without its shortcomings, since phase as a circular variable is prone to 2π-wrapping and systematic phase-advancement across frequencies. Group delay encoding may constitute a more robust feature for data-driven algorithms as it represents time-delays of the signal’s spectral-band envelopes. Computed through derivation of phase across frequency, it is in practice characterized by a lower degree of variability, resulting in reduced wrapping and to some extent permitting the computation of average group delay across (e.g., Mel-scaled) bands. The present contribution incorporates group delay features into the baseline system of DCASE 2019 task3, supplementing them with amplitude features. System setup is based on the provided baseline system’s convolutional recurrent neural network architecure with some variation of its topology.

PDF

TWO-STAGE SOUND EVENT LOCALIZATION AND DETECTION USING INTENSITY VECTOR AND GENERALIZED CROSS-CORRELATION

Yin Cao, Turab Iqbal, Qiuqiang Kong, Miguel Galindo, Wenwu Wang, Mark Plumbley
University of Surrey

Abstract

Sound event localization and detection (SELD) refers to the spatial and temporal localization of sound events in addition to classification. The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Task 3 introduces a strongly labelled dataset to address this problem. In this report, a two-stage polyphonic sound event detection and localization method. The method utilizes log mel features for event detection, and uses intensity vector and GCC features for localization. Intensity vector and GCC features use the supplied Ambisonic and microphone array signals, respectively. This method trains SED first, after which the learned feature layers are transferred for direction of arrival (DOA) estimation. It then uses the SED ground truth as a mask to train DOA estimation. Experimental results show that the proposed method is able to localize and detect overlapping sound events in different environments. It is also able to improve the performance of both SED and DOA estimation, and performs significantly better than the baseline method.

PDF

THREE-STAGE APPROACH FOR SOUND EVENT LOCALIZATION AND DETECTION

Kyoungjin Noh, Choi Jeong-Hwan, Jeon Dongyeop, Chang Joon-Hyuk
Hanyang University

Abstract

This paper describes our three-stage approach system for sound event localization and detection (SELD) task. The system consists of three parts: sound activity detection (SAD), sound event detection (SED), and sound event localization (SEL). Firstly, we employ the multi-resolution cochleagram (MRCG) from 4-channel audio and convolutional recurrent neural network (CRNN) model to detect sound activity. Secondly, we extract log mel-spectrogram from 4-channel audio, harmonic percussive source separation (HPSS) audio, mono audio, and train another CRNN model. Lastly, we exploit the generalized cross-correlation phase transform (GCC-PHAT) of each microphone pairs as an input feature of the convolutional neural network (CNN) model for the SEL. Then we combine SAD, SED, and SEL results to obtain the final prediction for the SELD task. To augment overlapped frames that degrade overall performance, we randomly select two non-overlapped audio files and mix them. We also average the predictions of several models to improve the result. Experimental results on the four cross-validation splits for the TAU Spatial Sound Events 2019-Microphone Array dataset are error rate: 0.23, F score: 85.91\%, DOA error: 3.62-degree , and frame recall: 88.66\%, respectively.

PDF

HIERARCHICAL DETECTION OF SOUND EVENTS AND THEIR LOCALIZATION USING CONVOLUTIONAL NEURAL NETWORKS WITH ADAPTIVE THRESHOLDS

Sotirios Panagiotis Chytas, Gerasimos Potamianos
University of Thessaly

Abstract

The paper details our approach to address Task 3 of the DCASE 2019 Challenge, namely that of sound event localization and detection (SELD). Our developed system is based on multi-channel convolutional neural networks (CNNs), combined with data augmentation and ensembling. In particular, it follows a hierarchical approach that first determines adaptive thresholds for the multi-label sound event detection (SED) problem, based on a CNN operating on spectrograms over long-duration windows. It subsequently exploits the derived thresholds in an ensemble of CNNs operating on raw waveforms over shorter-duration sliding windows to provide the desired event segmentation and labeling. Finally, it employs a separate event localization CNN to yield direction-of-arrival (DOA) source estimates of the detected sound events. The system is evaluated on the microphone-array development dataset of the SELD Task. Compared to the corresponding baseline provided by the Challenge organizers, it achieves relative improvements of 12\% in SED error, 2\% in F-score, 36\% in DOA error, and 3\% in the combined SELD metric, but trails significantly in frame-recall.

PDF

GCC-PHAT CROSS-CORRELATION AUDIO FEATURES FOR SIMULTANEOUS SOUND EVENT LOCALIZATION AND DETECTION (SELD) ON MULTIPLE ROOMS

Hector Cordourier Maruri, Paulo Lopez Meyer, Jonathan Huang, Juan Antonio Del Hoyo Ontiveros, Hong Lu
Intel Corporation

Abstract

In this work, we show a simultaneous sound event localization and detection (SELD) system, with enhanced acoustic features, in which we propose using the well-known Generalized Cross Correlation (GCC) PATH algorithm, to augment the magnitude and phase regular Fourier spectra features at each frame. GCC has already been used for some time to calculate the Time Difference of Arrival (TDOA) in simultaneous audio signals, in moderately reverberent environments, using classic signal processing techniques, and can assist audio source localization in current deep learning machines. The neural net architecture we used is a Convolutional Recurrent Neural Network (CRNN), and is tested using the sound database prepared for the Task 3 of the 2019 DCASE Challenge. Our proposed system is able to achieve 20.4 of direction of arrival error, 86.4\% frame recall, 87.1\% F-score and 0.20 error rate detection in testing samples.

PDF

SOUND EVENT LOCALIZATION AND DETECTION USING CRNN ON PAIRS OF MICROPHONES

Francois Grondin, James Glass, Iwona Sobieraj, Mark D. Plumbley
University of Surrey, Massachusetts Institute of Technology

Abstract

This paper proposes a sound event localization and detection from multichannel recording method. The proposed system is based on two Convolutional Recurrent Neural Networks (CRNN) to perform sound event detection (SED) and time difference of arrival (TDOA) estimation on each pair of microphones of a microphone array. In this paper, the system is evaluated with a four-microphone array, and thus combines the results from six pairs of microphones to provide a final classification and a 3-D direction of arrival (DOA) estimate. Results demonstrate that the proposed approach outperforms the DCASE 2019 baseline system.

PDF

DATA AUGMENTATION AND PRIOR KNOWLEDGE-BASED REGULARIZATION FOR SOUND EVENT LOCALIZATION AND DETECTION

Jingyang Zhang, Wenhao Ding, Liang He
Tsinghua University

Abstract

The goal of sound event localization and detection (SELD) is detecting the presence of polyphonic sound events and identifying the sources of those events at the same time. In this paper, we propose an entire pipeline, which contains data augmentation, network prediction and post-processing stage, to deal with the SELD task. In data augmentation part, we expand the official dataset with SpecAugment [1]. In network prediction part, we train the event detection network and the localization network separately, and utilize the prediction of events to output localization prediction for active frames. In post-processing part, we propose a prior knowledge based regularization (PKR), which calculates the average value of the localization prediction of each event segment and replace the prediction of this event with this average value. We theoretically prove that this technique could reduce mean square error (MSE). After evaluating our system on DCASE 2019 Challenge Task 3 Development Dataset, we approximately achieve a 59\% reduction in SED error rate (ER) and a 13\% reduction in directions-of-arrival(DOA) error over the baseline system (on Ambisonic dataset).

PDF

SOUND EVENT LOCALIZATION AND DETECTION USING CONVOLUTIONAL RECURRENT NEURAL NETWORK

Wen Jie Jee, Rohith Mars, Pranay Pratik, Srikanth Nagisetty, Chong Soon Lim
Nanyang Technological University, Panasonic R&D Center Singapore

Abstract

This report details the methods used in the development set of DCASE 2019 Task 3, and the results of the investigations. Data augmentation mixup was used in an attempt to train the model for greater generalization. The kernel size of the pooling layers were also modified to a more intuitive size. In addition, different kernel sizes of the convolutional layers were also investigated and results are reported. Our best model achieved an F-score of 91.9\% and a DOA error of 4.588-degree on the development set, which showed an improvement of 10\% and about 25-degree, respectively compared to the baseline system

PDF

SOUND SOURCE DETECTION, LOCALIZATION AND CLASSIFICATION USING CONSECUTIVE ENSEMBLE OF CRNN MODELS

Slawomir Kapka, Mateusz Lewandowski
Samsung R&D Institute Poland

Abstract

In this technical report, we describe our method for DCASE2019 task 3: Sound Event Localization and Detection. We use four CRNN SELDnet-like single output models which run in a consecutive manner to recover all possible information of occurring events. We decompose the SELD task into estimating number of active sources, estimating direction of arrival of a single source, estimating direction of arrival of the second source where the direction of the first one is known and a multi-label classification task. We use custom consecutive ensemble to predict events’ onset, offset, direction of arrival and class. The proposed approach is evaluated on the development set of TAU Spatial Sound Events 2019 - Ambisonic.

PDF

CROSS-TASK LEARNING FOR AUDIO TAGGING, SOUND EVENT DETECTION AND SPATIAL LOCALIZATION: DCASE 2019 BASELINE SYSTEMS

Qiuqiang Kong, Yin Cao, Turab Iqbal, Wenwu Wang, Mark D. Plumbley
University of Surrey

Abstract

The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge focuses on audio tagging, sound event detection and spatial localisation. DCASE 2019 consists of five tasks: 1) acoustic scene classification, 2) audio tagging with noisy labels and minimal supervision, 3) sound event localisation and detection, 4) sound event detection in domestic environments, and 5) urban sound tagging. In this paper, we propose generic cross-task baseline systems based on convolutional neural networks (CNNs). The motivation is to investigate the performance of a variety of models across several audio recognition tasks without exploiting the specific characteristics of the tasks. We looked at CNNs with 5, 9, and 13 layers, and found that the optimal architecture is task-dependent. For the systems we considered, we found that the 9-layer CNN with average pooling after convolutional layers is a good model for a majority of the DCASE 2019 tasks.

PDF

ARBORESCENT NEURAL NETWORK ARCHITECTURES FOR SOUND EVENT DETECTION AND LOCALIZATION

Daniel Krause, Konrad Kowalczyk
AGH University of Science and Technology

Abstract

This paper describes our contribution to the task of sound event localization and detection (SELD) using first-order ambisonic signals at the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Our approach is based on arborescent convolutional recurrent neural networks with the aim to achieve joint localization and detection of overlapping acoustic events. Three submitted systems can be briefly summarized as follows. System 1 splits the neural network into two branches associated with localization and detection tasks. This splitting is performed directly after the first convolutional layer. System 2 utilizes depthwise separable convolutions in order to exploit interchannel dependencies whilst substantially reducing the model complexity. System 3 exhibits a tree-like architecture in which relations between the channels for phase and magnitude are exploited independently in two branches, and they are concatenated before the recurrent layers. Finally, System 4 is based on score fusion of the first two systems.

PDF

SPECTRUM COMBINATION AND CONVOLUTIONAL RECURRENT NEURAL NETWORKS FOR JOINT LOCALIZATION AND DETECTION OF SOUND EVENTS

Shuangran Leung, Yi Ren
DBSonics

Abstract

In this work, we combine existing Short Time Fourier Transforms (STFT) of 4-channel array audio signals to create new features, and show that this augmented input improves the performance of DCASE2019 task 3 baseline system [1] in both sound event detection (SED) and direction-of-arrival (DOA) estimation. Techniques like ensembling and finetuning with masked DOA output are also applied and shown to further improve both SED and DOA accuracy.

PDF

A REPORT ON SOUND EVENT LOCALIZATION AND DETECTION

Yifeng Lin, Zhisheng Wang
Esound corporation

Abstract

In this paper, we make a little change to the baseline of Sound Event Localization and Detection. We add a Gaussian-noise on the input data to find if noise would help us improve the neutral network. Sound event detection is performed stacked convolutional and recurrent neural network and the evaluation is reported using standard metrics of error rate and F-score. The studied neutral network with noise on input data are seen to consistently perform equal to the origin baseline with respect to error rate metric.

PDF

SOUND EVENT LOCALIZATION AND DETECTION USING FOA DOMAIN SPATIAL AUGMENTATION

Luca Mazzon, Masahiro Yasuda, Yuma Koizumi, Noboru Harada
NTT Media Intelligence Laboratories

Abstract

This technical report describes the system participating to the DCASE 2019, Task 3: Sound Event Localization and Detection challenge. The system consists of a convolutional recurrent neural network (CRNN) reinforced by a ResNet structure. A two-stage training strategy with label masking is adopted. The main advancement of the proposed method is a data augmentation method based on rotation in the first order Ambisonics (FOA) domain. The proposed spatial augmentation enables us to augment direction of arrival (DOA) labels without losing physical relationships between steering vectors and observations. Evaluation results on development dataset show that, even though the proposed method did not use any ensemble method in this experiment, (i) the proposed method outperformed a state-of-the-art system published before the submission deadline and (ii) the DOA error has significantly decreased: 2.73-degree better than the state-of-the-art system.

PDF

DCASE 2019 TASK 3: A TWO-STEP SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION

Thi Ngoc Tho Nguyen, Douglas L. Jones, Rishabh Ranjan, Sathish Jayabalan, Woon Seng Gan
Nanyang Technological University, University of Illinois Urbana-Champaign

Abstract

Sound event detection and sound event localization requires different features from audio input signals. While sound event detection mainly relies on time-frequency patterns to distinguish different event classes, sound event localization uses magnitude or phase differences between microphones to estimate source directions. Therefore, we propose a two-step system to do sound event localization and detection. In the first step, we detect the sound events and estimate the directions-of-arrival separately. In the second step, we combine the results of the event detector and direction-of-arrival estimator together. The obtained results show a significant improvement over the baseline solution for sound event localization and detection in DCASE 2019 task 3 challenge. Using the development dataset on 4 fold cross-validation, the proposed system achieves an F1 score of 86.9\% for sound event detection and an error of 5.15 degrees for direction-of-arrival estimation while the baseline F1 score and error are 79.9\% and 28.5 degrees respectively.

PDF

REASSEMBLY LEARNING FOR SOUND EVENT LOCALIZATION AND DETECTION USING CRNN AND TRELLISNET

Sooyoung Park, Wootaek Lim, Sangwon Suh, Youngho Jeong
Electronics and Telecommunications Research Institute

Abstract

This technical report proposes a deep learning based approach, reassembly learning, for polyphonic sound event localization and detection. Sound event localization and detection is a joint task of two dependent sub-tasks: sound event detection and direction of arrival estimation. Joint learning has performance degradation compared to learning each sub-task separately. For this reason, we propose a reassembly learning to design a single network that deals with dependent sub-tasks together. Reassembly learning is a method to divide multi-task into individual sub-tasks, to train each sub-task, and then to reassemble and fine-tune into a single network. Experimental results show that the reassembly learning has good performance in the sound event localization and detection. Besides, the convolutional recurrent neural networks have been known as a state of art in both sound classification and detection applications. In DCASE 2019 challenge task 3, we suggest new architecture, trellis network based on temporal convolution networks, which can replace the convolutional recurrent neural networks. Trellis network shows a strong point in the direction of arrival estimation and has the possibility of being applied to a variety of sound classification and detection applications.

PDF

A HYBRID PARAMETRIC-DEEP LEARNING APPROACH FOR SOUND EVENT LOCALIZATION AND DETECTION

Andres Perez-Lopez, Eduardo Fonseca, Xavier Serra
Centre Tecnologic de Catalunya, Universitat Pompeu Fabra

Abstract

This technical report describes and discusses the algorithm submitted to the Sound Event Localization and Detection Task of DCASE2019 Challenge. The proposed methodology combines a parametric spatial audio analysis approach for localization estimation, a simple heuristic for event segmentation, and a deep learning based monophonic event classifier. The evaluation of the proposed algorithm with the development dataset yields overall results slightly outperforming the baseline system. The main highlight is a reduction of the localization error over 65\%.

PDF

SOUND EVENTS DETECTION AND DIRECTION OF ARRIVAL ESTIMATION USING RESIDUAL NET AND RECURRENT NEURAL NETWORKS

Rishabh Ranjan, Sathish Jayabalan, Woon-Seng Gan
Nanyang Technological University

Abstract

This paper presents deep learning approach for sound events detection and localization, which is also a part of detection and classification of acoustic scenes and events (DCASE) challenge 2019 Task 3. Deep residual nets originally used for image classification are adapted and combined with recurrent neural networks (RNN) to estimate the onset-offset of sound events, sound events class, and their direction in a reverberant environment. Additionally, data augmentation and post processing techniques are applied to generalize the system performance to unseen data. Using our best model on validation dataset, sound events detection achieves F1-score of 0.89 and error rate of 0.18, whereas sound source localization task achieves angular error of 9 degree and 0.90 frame recall.

PDF

POLYPHONIC SOUND EVENT DETECTION AND LOCALIZATION USING A TWO-STAGE STRATEGY

Pi LiHong, Zheng Xue, Chen Ping, Wang Zhe, Zhang Chun
Tsinghua University, Beijing Yiemed Medical Technology Co. Ltd, Beijing Sanping Technology Co.Ltd

Abstract

The joint training of SED and DOAE affects the performance of both. We adopt a two-stage polyphonic sound event detection and localization method. The method learns SED first, after which the learned feature layers are transferred for DOAE. It then uses the SED ground truth as a mask to train DOAE. We select Log mel spectrograms and GCCPHAT as the input features, and the GCCPHAT feature which contains phase difference information between any of the two microphones improves the performance of DOAE.

PDF

Sound Event Detection and Localization Using ResNet RNN and Time-Delay DOA

Ee Leng Tan, Rishabh Ranjan, Sathish Jayabalan
Nanyang Technological University

Abstract

This paper presents a deep learning approach for sound events detection and time-delay direction-of-arrival (TDOA) for localization, which is also a part of detection and classification of acoustic scenes and events (DCASE) challenge 2019 Task 3. Deep residual nets originally used for image classification are adapted and combined with recurrent neural networks (RNN) to estimate the onset-offset of sound events, sound events class. Data augmentation and postprocessing techniques are applied to generalize the system performance to unseen data. Direction of sound events in a reverberant environment is estimated using a time-delay direction-of-arrival TDOA algorithm. Using our best model on validation dataset, sound events detection achieves F1-score of 0.84 and error rate of 0.25, whereas sound source localization task achieves angular error of 16.56 degree and 0.82 frame recall.

PDF

MULTI-BEAM AND MULTI-TASK LEARNING FOR JOINT SOUND EVENT DETECTION AND LOCALIZATION

Wei Xue, Tong Ying, Zhang Chao, Ding Guohong
JD.COM

Abstract

Joint sound event detection (SED) and sound source localization (SSL) is essential since it provides both the temporal and spatial information of the events that appear in an acoustic scene. Although the problem can be tackled by designing a system based on the deep neural networks (DNNs) and fundamental spectral and spatial features, in this paper, we largely leverage the conventional microphone array signal processing techniques to generate more comprehensive representations for both SED and SSL, and to perform post-processing such that stable SED and SSL results can be obtained. Specifically, the features extracted from signals of multiple beams are utilized, which orient towards different directions of arrival (DOAs), and are formed according to the estimated steering vector of each DOA. Smoothed cross-power spectra (CPS) are computed based on the signal presence probability (SPP), and are used both as the input features of the DNNs, and for estimating the steering vectors of different DOAs. A triple-task learning scheme is developed, which jointly exploits the classification and regression based criterion for DOA estimation, and uses the classification based criterion as a regularization for the DNN. Experimental results demonstrate that the proposed method yields substantial improvements compared with the baseline method for the task 3 of the DCASE challenge 2019.

PDF

Sound event detection and localization based on CNN and LSTM

Zhao Lu
University of Electronic Science and Technology of China

Abstract

The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge is a topic seminar for speech feature classification. Task 3 is the location and detection of sound events. In this field, the learning method based on deep neural network is becoming more and more popular. On the basis of CNN, the spectrum and cross-correlation information of multichannel microphone array are learned based on CNN and LSTM, and the detection of sound events and the estimation of arrival direction are obtained. Compared with the baseline method, this method improves the estimation accuracy of DOA and the recognition ability of SED by using DCASE2019 dataset and PyTorch deep learning tool. The combination of CNN and LSTM works very well on this kind of feature classification problem with time series.

PDF