Task description
The sound event localization and detection (SELD) task includes recognizing the temporal onset and offset of sound events when active, classifying the sound events into known set of classes, and further localizing the events in space when active.
The focus of the current SELD task is to build systems that are robust to reverberations in different acoustic environments (rooms). The task provides two datasets, development and evaluation, recorded in five different acoustics environments. Among the two datasets, only the development dataset provides the reference labels. The participants are expected to build systems using the provided four cross-validation splits of the development dataset, and finally test their system on the unseen evaluation dataset.
More detailed task description can be found in the task description page.
Systems ranking
Performance of all the submitted systems on the evaluation and the development datasets
Rank | Submission Information | Evaluation dataset | Development dataset | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Submission name |
Technical Report |
Official rank |
Error Rate |
F-score |
DOA error |
Frame recall |
Error Rate |
F-score |
DOA error |
Frame recall |
|
Kapka_SRPOL_task3_2 | Kapka2019 | 1 | 0.08 | 94.7 | 3.7 | 96.8 | 0.14 | 89.3 | 5.7 | 95.6 | |
Kapka_SRPOL_task3_4 | Kapka2019 | 2 | 0.08 | 94.7 | 3.7 | 96.8 | 0.15 | 88.7 | 5.7 | 95.6 | |
Kapka_SRPOL_task3_3 | Kapka2019 | 3 | 0.10 | 93.5 | 4.6 | 96.0 | 0.14 | 89.3 | 5.7 | 95.6 | |
Cao_Surrey_task3_4 | Cao2019 | 4 | 0.08 | 95.5 | 5.5 | 92.2 | 0.13 | 92.8 | 6.7 | 90.8 | |
Xue_JDAI_task3_1 | Xue2019 | 5 | 0.06 | 96.3 | 9.7 | 92.3 | 0.11 | 93.4 | 9.0 | 90.6 | |
He_THU_task3_2 | He2019 | 6 | 0.06 | 96.7 | 22.4 | 94.1 | 0.14 | 91.6 | 24.8 | 90.8 | |
He_THU_task3_1 | He2019 | 7 | 0.06 | 96.6 | 23.8 | 94.4 | 0.14 | 91.6 | 24.8 | 90.8 | |
Cao_Surrey_task3_1 | Cao2019 | 8 | 0.09 | 95.1 | 5.5 | 91.0 | 0.13 | 93.0 | 6.6 | 89.4 | |
Xue_JDAI_task3_4 | Xue2019 | 9 | 0.07 | 95.9 | 10.0 | 92.6 | 0.11 | 93.4 | 9.2 | 90.6 | |
Jee_NTU_task3_1 | Jee2019 | 10 | 0.12 | 93.7 | 4.2 | 91.8 | 0.15 | 91.9 | 4.6 | 89.6 | |
Xue_JDAI_task3_3 | Xue2019 | 11 | 0.08 | 95.6 | 10.1 | 92.2 | 0.11 | 93.4 | 9.3 | 90.6 | |
He_THU_task3_4 | He2019 | 12 | 0.06 | 96.3 | 26.1 | 93.4 | 0.15 | 91.3 | 25.8 | 89.5 | |
Cao_Surrey_task3_3 | Cao2019 | 13 | 0.10 | 94.9 | 5.8 | 90.4 | 0.13 | 93.0 | 6.8 | 89.4 | |
Xue_JDAI_task3_2 | Xue2019 | 14 | 0.09 | 95.2 | 9.2 | 91.5 | 0.14 | 92.2 | 9.6 | 90.2 | |
He_THU_task3_3 | He2019 | 15 | 0.08 | 95.6 | 24.4 | 92.9 | 0.15 | 91.3 | 25.8 | 89.5 | |
Cao_Surrey_task3_2 | Cao2019 | 16 | 0.12 | 93.8 | 5.5 | 89.0 | 0.13 | 93.0 | 7.2 | 89.4 | |
Nguyen_NTU_task3_3 | Nguyen2019 | 17 | 0.11 | 93.4 | 5.4 | 88.8 | 0.17 | 89.3 | 5.1 | 87.5 | |
MazzonYasuda_NTT_task3_3 | MazzonYasuda2019 | 18 | 0.10 | 94.2 | 6.4 | 88.8 | 0.03 | 98.2 | 0.6 | 93.2 | |
Chang_HYU_task3_3 | Chang2019 | 19 | 0.14 | 91.9 | 2.7 | 90.8 | 0.24 | 85.9 | 3.6 | 88.7 | |
Nguyen_NTU_task3_4 | Nguyen2019 | 20 | 0.12 | 93.2 | 5.5 | 88.7 | 0.17 | 89.3 | 5.1 | 87.5 | |
Chang_HYU_task3_4 | Chang2019 | 21 | 0.17 | 90.5 | 3.1 | 94.1 | 0.28 | 84.2 | 4.1 | 91.8 | |
MazzonYasuda_NTT_task3_2 | MazzonYasuda2019 | 22 | 0.13 | 93.0 | 5.0 | 88.2 | 0.04 | 98.2 | 2.2 | 92.9 | |
Chang_HYU_task3_2 | Chang2019 | 23 | 0.14 | 92.3 | 9.7 | 95.3 | 0.25 | 85.0 | 10.6 | 92.8 | |
Chang_HYU_task3_1 | Chang2019 | 24 | 0.13 | 92.8 | 8.4 | 91.4 | 0.26 | 85.2 | 9.6 | 88.4 | |
MazzonYasuda_NTT_task3_1 | MazzonYasuda2019 | 25 | 0.12 | 93.3 | 7.1 | 88.1 | 0.03 | 98.2 | 2.1 | 93.2 | |
Ranjan_NTU_task3_3 | Ranjan2019 | 26 | 0.16 | 90.9 | 5.7 | 91.8 | 0.27 | 83.4 | 13.5 | 88.0 | |
Ranjan_NTU_task3_4 | Ranjan2019 | 27 | 0.16 | 90.7 | 6.4 | 92.0 | 0.27 | 83.4 | 13.5 | 88.0 | |
Park_ETRI_task3_1 | Park2019 | 28 | 0.15 | 91.9 | 5.1 | 87.4 | 0.17 | 90.6 | 6.4 | 85.7 | |
Nguyen_NTU_task3_1 | Nguyen2019 | 29 | 0.15 | 91.1 | 5.6 | 89.8 | 0.21 | 86.9 | 5.1 | 88.9 | |
Leung_DBS_task3_2 | Leung2019 | 30 | 0.12 | 93.3 | 25.9 | 91.1 | 0.20 | 88.4 | 25.4 | 89.6 | |
Park_ETRI_task3_2 | Park2019 | 31 | 0.15 | 91.8 | 5.0 | 87.2 | 0.17 | 90.5 | 6.4 | 85.6 | |
Grondin_MIT_task3_1 | Grondin2019 | 32 | 0.14 | 92.2 | 7.4 | 87.5 | 0.21 | 87.2 | 6.8 | 84.7 | |
Leung_DBS_task3_1 | Leung2019 | 33 | 0.12 | 93.4 | 27.2 | 90.7 | 0.20 | 88.1 | 26.9 | 89.0 | |
Park_ETRI_task3_3 | Park2019 | 34 | 0.15 | 91.9 | 7.0 | 87.4 | 0.18 | 90.1 | 8.3 | 85.4 | |
MazzonYasuda_NTT_task3_4 | MazzonYasuda2019 | 35 | 0.14 | 92.0 | 7.3 | 87.1 | 0.04 | 98.1 | 5.7 | 93.0 | |
Park_ETRI_task3_4 | Park2019 | 36 | 0.15 | 91.8 | 7.0 | 87.2 | 0.18 | 89.9 | 8.3 | 85.2 | |
Ranjan_NTU_task3_1 | Ranjan2019 | 37 | 0.18 | 89.9 | 8.6 | 90.1 | 0.27 | 83.4 | 13.5 | 88.0 | |
Ranjan_NTU_task3_2 | Ranjan2019 | 38 | 0.22 | 86.8 | 7.8 | 90.0 | 0.27 | 83.4 | 13.5 | 88.0 | |
ZhaoLu_UESTC_task3_1 | ZhaoLu2019 | 39 | 0.18 | 89.3 | 6.8 | 84.3 | 0.20 | 87.5 | 8.0 | 83.4 | |
Rough_EMED_task3_2 | Rough2019 | 40 | 0.18 | 89.7 | 9.4 | 85.5 | 0.10 | 86.0 | 10.0 | 89.6 | |
Nguyen_NTU_task3_2 | Nguyen2019 | 41 | 0.17 | 89.7 | 8.0 | 77.3 | 0.21 | 86.9 | 5.1 | 88.9 | |
Jee_NTU_task3_2 | Jee2019 | 42 | 0.19 | 89.1 | 8.1 | 85.0 | 0.18 | 90.3 | 7.2 | 85.6 | |
Tan_NTU_task3_1 | Tan2019 | 43 | 0.17 | 89.8 | 15.4 | 84.4 | 0.20 | 87.7 | 17.2 | 83.9 | |
Lewandowski_SRPOL_task3_1 | Kapka2019 | 44 | 0.19 | 89.4 | 36.2 | 87.7 | 0.26 | 84.1 | 35.3 | 87.9 | |
Cordourier_IL_task3_2 | Cordourier2019 | 45 | 0.22 | 86.5 | 20.8 | 85.7 | 0.12 | 92.4 | 15.9 | 89.9 | |
Cordourier_IL_task3_1 | Cordourier2019 | 46 | 0.22 | 86.3 | 19.9 | 85.6 | 0.11 | 93.3 | 15.0 | 90.8 | |
Krause_AGH_task3_4 | Krause2019 | 47 | 0.22 | 87.4 | 31.0 | 87.0 | 0.19 | 88.5 | 46.9 | 88.6 | |
DCASE2019_FOA_baseline | Adavanne2019 | 48 | 0.28 | 85.4 | 24.6 | 85.7 | 0.34 | 79.9 | 28.5 | 85.4 | |
Perezlopez_UPF_task3_1 | Perezlopez2019 | 49 | 0.29 | 82.1 | 9.3 | 75.8 | 0.32 | 79.7 | 9.1 | 76.4 | |
Chytas_UTH_task3_1 | Chytas2019 | 50 | 0.29 | 82.4 | 18.6 | 75.6 | 0.31 | 81.2 | 19.8 | 75.3 | |
Anemueller_UOL_task3_3 | Anemueller2019 | 51 | 0.28 | 83.8 | 29.2 | 84.1 | 0.30 | 82.1 | 28.9 | 85.8 | |
Chytas_UTH_task3_2 | Chytas2019 | 52 | 0.29 | 82.3 | 18.7 | 75.7 | 0.31 | 81.2 | 19.8 | 75.3 | |
Krause_AGH_task3_2 | Krause2019 | 53 | 0.32 | 82.9 | 31.7 | 85.7 | 0.35 | 80.7 | 62.2 | 84.1 | |
Krause_AGH_task3_1 | Krause2019 | 54 | 0.30 | 83.0 | 32.5 | 85.3 | 0.33 | 80.9 | 32.5 | 85.4 | |
Anemueller_UOL_task3_1 | Anemueller2019 | 55 | 0.33 | 81.3 | 28.2 | 84.5 | 0.33 | 80.8 | 26.4 | 86.4 | |
Kong_SURREY_task3_1 | Kong2019 | 56 | 0.29 | 83.4 | 37.6 | 81.3 | 0.31 | 81.2 | 43.9 | 78.4 | |
Anemueller_UOL_task3_2 | Anemueller2019 | 57 | 0.36 | 79.8 | 25.0 | 84.1 | 0.33 | 80.8 | 26.4 | 86.4 | |
DCASE2019_MIC_baseline | Adavanne2019 | 58 | 0.30 | 83.2 | 38.1 | 83.4 | 0.35 | 80.1 | 30.8 | 84.0 | |
Lin_YYZN_task3_1 | Lin2019 | 59 | 1.03 | 2.6 | 21.9 | 31.6 | 1.02 | 2.5 | 15.4 | 30.7 | |
Krause_AGH_task3_3 | Krause2019 | 60 | 0.35 | 80.3 | 52.6 | 83.6 | 0.48 | 72.0 | 70.7 | 79.3 |
Teams ranking
Table including only the best performing system per submitting team.
Rank | Submission Information | Evaluation dataset | Development dataset | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Submission name |
Technical Report |
Best official system rank |
Error Rate |
F-score |
DOA error |
Frame recall |
Error Rate |
F-score |
DOA error |
Frame recall |
|
Nguyen_NTU_task3_3 | Nguyen2019 | 17 | 0.11 | 93.4 | 5.4 | 88.8 | 0.17 | 89.3 | 5.1 | 87.5 | |
Jee_NTU_task3_1 | Jee2019 | 10 | 0.12 | 93.7 | 4.2 | 91.8 | 0.15 | 91.9 | 4.6 | 89.6 | |
Xue_JDAI_task3_1 | Xue2019 | 5 | 0.06 | 96.3 | 9.7 | 92.3 | 0.11 | 93.4 | 9.0 | 90.6 | |
He_THU_task3_2 | He2019 | 6 | 0.06 | 96.7 | 22.4 | 94.1 | 0.14 | 91.6 | 24.8 | 90.8 | |
Tan_NTU_task3_1 | Tan2019 | 43 | 0.17 | 89.8 | 15.4 | 84.4 | 0.20 | 87.7 | 17.2 | 83.9 | |
Cordourier_IL_task3_2 | Cordourier2019 | 45 | 0.22 | 86.5 | 20.8 | 85.7 | 0.12 | 92.4 | 15.9 | 89.9 | |
ZhaoLu_UESTC_task3_1 | ZhaoLu2019 | 39 | 0.18 | 89.3 | 6.8 | 84.3 | 0.20 | 87.5 | 8.0 | 83.4 | |
Park_ETRI_task3_1 | Park2019 | 28 | 0.15 | 91.9 | 5.1 | 87.4 | 0.17 | 90.6 | 6.4 | 85.7 | |
DCASE2019_FOA_baseline | Adavanne2019 | 48 | 0.28 | 85.4 | 24.6 | 85.7 | 0.34 | 79.9 | 28.5 | 85.4 | |
Perezlopez_UPF_task3_1 | Perezlopez2019 | 49 | 0.29 | 82.1 | 9.3 | 75.8 | 0.32 | 79.7 | 9.1 | 76.4 | |
Chang_HYU_task3_3 | Chang2019 | 19 | 0.14 | 91.9 | 2.7 | 90.8 | 0.24 | 85.9 | 3.6 | 88.7 | |
Cao_Surrey_task3_4 | Cao2019 | 4 | 0.08 | 95.5 | 5.5 | 92.2 | 0.13 | 92.8 | 6.7 | 90.8 | |
Rough_EMED_task3_2 | Rough2019 | 40 | 0.18 | 89.7 | 9.4 | 85.5 | 0.10 | 86.0 | 10.0 | 89.6 | |
Kapka_SRPOL_task3_2 | Kapka2019 | 1 | 0.08 | 94.7 | 3.7 | 96.8 | 0.14 | 89.3 | 5.7 | 95.6 | |
Anemueller_UOL_task3_3 | Anemueller2019 | 51 | 0.28 | 83.8 | 29.2 | 84.1 | 0.30 | 82.1 | 28.9 | 85.8 | |
Leung_DBS_task3_2 | Leung2019 | 30 | 0.12 | 93.3 | 25.9 | 91.1 | 0.20 | 88.4 | 25.4 | 89.6 | |
Krause_AGH_task3_4 | Krause2019 | 47 | 0.22 | 87.4 | 31.0 | 87.0 | 0.19 | 88.5 | 46.9 | 88.6 | |
Kong_SURREY_task3_1 | Kong2019 | 56 | 0.29 | 83.4 | 37.6 | 81.3 | 0.31 | 81.2 | 43.9 | 78.4 | |
Chytas_UTH_task3_1 | Chytas2019 | 50 | 0.29 | 82.4 | 18.6 | 75.6 | 0.31 | 81.2 | 19.8 | 75.3 | |
Ranjan_NTU_task3_3 | Ranjan2019 | 26 | 0.16 | 90.9 | 5.7 | 91.8 | 0.27 | 83.4 | 13.5 | 88.0 | |
Grondin_MIT_task3_1 | Grondin2019 | 32 | 0.14 | 92.2 | 7.4 | 87.5 | 0.21 | 87.2 | 6.8 | 84.7 | |
Lin_YYZN_task3_1 | Lin2019 | 59 | 1.03 | 2.6 | 21.9 | 31.6 | 1.02 | 2.5 | 15.4 | 30.7 | |
MazzonYasuda_NTT_task3_3 | MazzonYasuda2019 | 18 | 0.10 | 94.2 | 6.4 | 88.8 | 0.03 | 98.2 | 0.6 | 93.2 |
Acoustic environment-wise performance
Performance of submitted systems on different acoustic environments of the evaluation dataset.
Rank | Submission Information | Location 1 | Location 2 | Location 3 | Location 4 | Location 5 | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission name |
Technical Report |
Official rank |
Error rate |
F-score |
DOA error |
Frame recall |
Error rate |
F-score |
DOA error |
Frame recall |
Error rate |
F-score |
DOA error |
Frame recall |
Error rate |
F-score |
DOA error |
Frame recall |
Error rate |
F-score |
DOA error |
Frame recall |
|
Kapka_SRPOL_task3_2 | Kapka2019 | 1 | 0.05 | 96.4 | 2.5 | 98.0 | 0.09 | 94.6 | 3.3 | 96.4 | 0.11 | 92.5 | 4.6 | 95.2 | 0.09 | 94.4 | 4.0 | 97.3 | 0.07 | 95.8 | 3.9 | 97.3 | |
Kapka_SRPOL_task3_4 | Kapka2019 | 2 | 0.05 | 96.6 | 2.5 | 98.0 | 0.09 | 94.3 | 3.3 | 96.4 | 0.12 | 91.8 | 4.6 | 95.2 | 0.08 | 95.0 | 4.0 | 97.3 | 0.07 | 95.8 | 4.0 | 97.3 | |
Kapka_SRPOL_task3_3 | Kapka2019 | 3 | 0.06 | 95.7 | 4.1 | 97.3 | 0.13 | 91.5 | 4.2 | 94.5 | 0.14 | 90.8 | 5.9 | 94.5 | 0.08 | 94.9 | 4.6 | 96.6 | 0.08 | 94.5 | 4.4 | 96.9 | |
Cao_Surrey_task3_4 | Cao2019 | 4 | 0.06 | 96.5 | 5.4 | 92.1 | 0.09 | 94.8 | 5.9 | 92.2 | 0.08 | 95.6 | 4.8 | 92.7 | 0.10 | 94.5 | 5.4 | 91.7 | 0.07 | 96.0 | 6.1 | 92.3 | |
Xue_JDAI_task3_1 | Xue2019 | 5 | 0.05 | 97.4 | 10.1 | 92.1 | 0.06 | 96.3 | 10.3 | 92.9 | 0.08 | 95.6 | 10.0 | 91.8 | 0.07 | 96.1 | 9.6 | 92.6 | 0.07 | 96.2 | 8.3 | 92.1 | |
He_THU_task3_2 | He2019 | 6 | 0.04 | 97.6 | 20.8 | 94.5 | 0.09 | 95.1 | 25.2 | 93.4 | 0.06 | 96.9 | 22.8 | 94.1 | 0.07 | 96.5 | 21.8 | 94.1 | 0.05 | 97.3 | 21.3 | 94.4 | |
He_THU_task3_1 | He2019 | 7 | 0.04 | 97.8 | 22.7 | 94.5 | 0.09 | 95.0 | 27.0 | 94.0 | 0.06 | 96.4 | 23.1 | 94.6 | 0.06 | 96.7 | 23.6 | 94.5 | 0.06 | 96.8 | 22.7 | 94.3 | |
Cao_Surrey_task3_1 | Cao2019 | 8 | 0.08 | 96.0 | 5.4 | 91.0 | 0.10 | 94.6 | 5.8 | 90.8 | 0.09 | 95.3 | 5.1 | 91.7 | 0.10 | 94.8 | 5.3 | 90.9 | 0.10 | 95.0 | 6.0 | 90.8 | |
Xue_JDAI_task3_4 | Xue2019 | 9 | 0.05 | 97.2 | 10.2 | 92.5 | 0.08 | 95.1 | 11.1 | 92.8 | 0.08 | 95.5 | 10.2 | 92.3 | 0.08 | 95.3 | 9.9 | 92.7 | 0.07 | 96.2 | 8.8 | 92.8 | |
Jee_NTU_task3_1 | Jee2019 | 10 | 0.10 | 94.6 | 4.1 | 91.6 | 0.13 | 92.9 | 4.3 | 91.8 | 0.12 | 93.5 | 4.5 | 91.9 | 0.13 | 93.0 | 3.8 | 91.6 | 0.11 | 94.4 | 4.3 | 91.9 | |
Xue_JDAI_task3_3 | Xue2019 | 11 | 0.06 | 97.1 | 10.1 | 92.0 | 0.10 | 94.3 | 11.3 | 92.0 | 0.08 | 95.3 | 10.2 | 92.0 | 0.08 | 95.1 | 10.0 | 92.6 | 0.07 | 96.1 | 8.9 | 92.6 | |
He_THU_task3_4 | He2019 | 12 | 0.05 | 97.5 | 24.0 | 93.5 | 0.08 | 95.6 | 29.2 | 93.3 | 0.06 | 96.3 | 26.4 | 93.8 | 0.07 | 96.0 | 25.9 | 93.3 | 0.07 | 96.3 | 25.4 | 93.1 | |
Cao_Surrey_task3_3 | Cao2019 | 13 | 0.08 | 95.7 | 5.5 | 89.9 | 0.11 | 94.1 | 6.4 | 90.4 | 0.09 | 95.3 | 5.1 | 90.9 | 0.10 | 94.5 | 5.8 | 90.1 | 0.10 | 94.7 | 6.3 | 90.5 | |
Xue_JDAI_task3_2 | Xue2019 | 14 | 0.07 | 96.3 | 9.0 | 91.6 | 0.10 | 94.4 | 10.7 | 91.4 | 0.09 | 95.0 | 9.4 | 91.7 | 0.10 | 94.4 | 8.9 | 91.3 | 0.08 | 95.6 | 8.1 | 91.5 | |
He_THU_task3_3 | He2019 | 15 | 0.07 | 96.4 | 24.4 | 92.9 | 0.09 | 95.0 | 26.5 | 93.4 | 0.09 | 95.0 | 24.2 | 92.5 | 0.09 | 95.3 | 23.4 | 92.6 | 0.07 | 96.2 | 23.9 | 93.0 | |
Cao_Surrey_task3_2 | Cao2019 | 16 | 0.08 | 95.4 | 5.3 | 89.5 | 0.17 | 91.1 | 5.7 | 86.9 | 0.10 | 94.6 | 5.1 | 90.2 | 0.12 | 93.5 | 5.5 | 88.9 | 0.11 | 94.3 | 5.9 | 89.6 | |
Nguyen_NTU_task3_3 | Nguyen2019 | 17 | 0.09 | 94.9 | 4.0 | 88.9 | 0.12 | 92.7 | 4.7 | 88.9 | 0.12 | 92.8 | 7.0 | 88.7 | 0.11 | 93.6 | 4.8 | 89.4 | 0.13 | 92.8 | 6.5 | 87.9 | |
MazzonYasuda_NTT_task3_3 | MazzonYasuda2019 | 18 | 0.09 | 95.1 | 6.2 | 88.6 | 0.10 | 94.2 | 6.4 | 89.5 | 0.11 | 93.8 | 6.5 | 89.1 | 0.12 | 93.2 | 5.9 | 88.2 | 0.10 | 94.5 | 6.9 | 88.7 | |
Chang_HYU_task3_3 | Chang2019 | 19 | 0.12 | 93.2 | 2.5 | 90.3 | 0.15 | 91.9 | 2.5 | 91.3 | 0.14 | 91.2 | 2.9 | 91.4 | 0.15 | 91.4 | 2.7 | 90.5 | 0.15 | 91.6 | 2.7 | 90.4 | |
Nguyen_NTU_task3_4 | Nguyen2019 | 20 | 0.10 | 94.9 | 4.0 | 89.0 | 0.12 | 92.3 | 4.7 | 88.9 | 0.12 | 93.0 | 7.1 | 88.5 | 0.11 | 93.5 | 4.8 | 89.4 | 0.13 | 92.4 | 6.5 | 87.8 | |
Chang_HYU_task3_4 | Chang2019 | 21 | 0.14 | 92.0 | 2.9 | 94.1 | 0.21 | 89.0 | 2.8 | 93.6 | 0.16 | 90.3 | 3.1 | 94.1 | 0.17 | 90.4 | 3.3 | 94.5 | 0.17 | 90.6 | 3.3 | 94.3 | |
MazzonYasuda_NTT_task3_2 | MazzonYasuda2019 | 22 | 0.12 | 93.9 | 4.9 | 87.8 | 0.13 | 93.0 | 5.1 | 89.0 | 0.13 | 92.7 | 5.0 | 88.6 | 0.14 | 92.2 | 4.7 | 87.8 | 0.13 | 93.2 | 5.5 | 87.6 | |
Chang_HYU_task3_2 | Chang2019 | 23 | 0.12 | 93.4 | 9.5 | 95.1 | 0.17 | 91.2 | 9.6 | 95.0 | 0.14 | 92.1 | 10.0 | 94.9 | 0.13 | 92.5 | 9.7 | 96.0 | 0.14 | 92.3 | 9.9 | 95.3 | |
Chang_HYU_task3_1 | Chang2019 | 24 | 0.13 | 93.0 | 8.2 | 90.9 | 0.14 | 92.3 | 8.4 | 91.7 | 0.13 | 92.5 | 9.1 | 91.2 | 0.12 | 93.0 | 7.8 | 91.7 | 0.12 | 93.4 | 8.5 | 91.5 | |
MazzonYasuda_NTT_task3_1 | MazzonYasuda2019 | 25 | 0.10 | 94.7 | 6.8 | 87.9 | 0.13 | 93.2 | 7.2 | 88.7 | 0.13 | 92.7 | 7.2 | 88.7 | 0.14 | 92.5 | 6.6 | 87.6 | 0.12 | 93.6 | 7.5 | 87.8 | |
Ranjan_NTU_task3_3 | Ranjan2019 | 26 | 0.13 | 93.0 | 6.3 | 92.4 | 0.21 | 88.7 | 5.2 | 91.0 | 0.16 | 90.9 | 5.7 | 92.1 | 0.17 | 90.2 | 5.3 | 91.4 | 0.15 | 91.7 | 5.9 | 92.1 | |
Ranjan_NTU_task3_4 | Ranjan2019 | 27 | 0.12 | 93.2 | 6.8 | 93.0 | 0.21 | 88.5 | 5.7 | 90.8 | 0.15 | 90.8 | 6.7 | 92.4 | 0.17 | 89.7 | 6.5 | 91.9 | 0.15 | 91.0 | 6.6 | 92.1 | |
Park_ETRI_task3_1 | Park2019 | 28 | 0.13 | 93.4 | 4.5 | 87.1 | 0.15 | 91.7 | 5.1 | 88.3 | 0.14 | 92.1 | 5.3 | 87.7 | 0.17 | 90.4 | 5.0 | 86.9 | 0.15 | 91.8 | 5.4 | 86.8 | |
Nguyen_NTU_task3_1 | Nguyen2019 | 29 | 0.12 | 92.9 | 4.0 | 90.7 | 0.17 | 89.5 | 4.9 | 88.0 | 0.15 | 90.7 | 7.2 | 90.1 | 0.15 | 90.9 | 5.1 | 90.3 | 0.14 | 91.6 | 6.7 | 89.7 | |
Leung_DBS_task3_2 | Leung2019 | 30 | 0.10 | 94.9 | 25.2 | 91.1 | 0.14 | 92.1 | 27.7 | 91.1 | 0.13 | 92.3 | 26.9 | 90.7 | 0.12 | 93.3 | 24.7 | 91.0 | 0.11 | 94.1 | 25.2 | 91.6 | |
Park_ETRI_task3_2 | Park2019 | 31 | 0.13 | 93.2 | 4.5 | 86.9 | 0.15 | 91.7 | 5.0 | 88.3 | 0.15 | 91.8 | 5.1 | 87.3 | 0.16 | 90.7 | 5.1 | 87.2 | 0.15 | 91.7 | 5.2 | 86.5 | |
Grondin_MIT_task3_1 | Grondin2019 | 32 | 0.12 | 93.7 | 6.8 | 88.6 | 0.17 | 90.3 | 7.0 | 86.0 | 0.14 | 92.2 | 7.7 | 87.7 | 0.15 | 91.5 | 7.8 | 87.1 | 0.12 | 93.2 | 7.7 | 88.2 | |
Leung_DBS_task3_1 | Leung2019 | 33 | 0.10 | 94.9 | 27.3 | 90.6 | 0.13 | 92.6 | 28.6 | 91.2 | 0.13 | 92.1 | 28.3 | 90.2 | 0.13 | 93.0 | 25.5 | 90.3 | 0.10 | 94.2 | 26.4 | 91.2 | |
Park_ETRI_task3_3 | Park2019 | 34 | 0.13 | 93.4 | 7.1 | 87.1 | 0.15 | 91.7 | 7.7 | 88.3 | 0.14 | 92.1 | 6.7 | 87.7 | 0.17 | 90.4 | 6.8 | 86.9 | 0.15 | 91.8 | 7.1 | 86.8 | |
MazzonYasuda_NTT_task3_4 | MazzonYasuda2019 | 35 | 0.12 | 93.7 | 7.6 | 86.9 | 0.16 | 91.1 | 6.7 | 87.2 | 0.15 | 91.7 | 7.5 | 87.1 | 0.16 | 91.1 | 7.0 | 86.5 | 0.14 | 92.6 | 7.3 | 87.5 | |
Park_ETRI_task3_4 | Park2019 | 36 | 0.13 | 93.2 | 7.0 | 86.9 | 0.15 | 91.7 | 7.8 | 88.3 | 0.15 | 91.8 | 6.7 | 87.3 | 0.16 | 90.7 | 6.9 | 87.2 | 0.15 | 91.7 | 7.0 | 86.5 | |
Ranjan_NTU_task3_1 | Ranjan2019 | 37 | 0.13 | 92.5 | 8.9 | 91.0 | 0.24 | 87.4 | 8.2 | 87.7 | 0.15 | 91.3 | 8.6 | 91.2 | 0.19 | 88.6 | 8.3 | 90.2 | 0.19 | 89.5 | 9.0 | 90.5 | |
Ranjan_NTU_task3_2 | Ranjan2019 | 38 | 0.18 | 89.4 | 8.0 | 90.8 | 0.29 | 84.3 | 7.0 | 88.5 | 0.19 | 88.2 | 8.0 | 90.5 | 0.24 | 85.0 | 8.1 | 89.7 | 0.23 | 87.0 | 7.7 | 90.4 | |
ZhaoLu_UESTC_task3_1 | ZhaoLu2019 | 39 | 0.16 | 90.8 | 6.8 | 84.7 | 0.18 | 89.6 | 7.1 | 84.8 | 0.18 | 88.8 | 7.0 | 84.8 | 0.20 | 88.0 | 6.4 | 83.4 | 0.18 | 89.2 | 6.6 | 83.6 | |
Rough_EMED_task3_2 | Rough2019 | 40 | 0.16 | 90.7 | 10.2 | 85.1 | 0.18 | 90.2 | 9.5 | 86.7 | 0.18 | 89.2 | 9.2 | 85.5 | 0.21 | 87.8 | 8.9 | 84.7 | 0.17 | 90.5 | 9.1 | 85.3 | |
Nguyen_NTU_task3_2 | Nguyen2019 | 41 | 0.14 | 91.7 | 6.3 | 78.0 | 0.19 | 88.2 | 5.6 | 77.4 | 0.18 | 89.4 | 10.2 | 77.3 | 0.18 | 89.1 | 8.0 | 76.5 | 0.17 | 90.1 | 9.7 | 77.5 | |
Jee_NTU_task3_2 | Jee2019 | 42 | 0.16 | 90.5 | 8.0 | 84.8 | 0.18 | 89.7 | 7.9 | 86.9 | 0.18 | 89.4 | 8.7 | 83.9 | 0.22 | 87.0 | 7.5 | 84.6 | 0.18 | 89.2 | 8.2 | 84.7 | |
Tan_NTU_task3_1 | Tan2019 | 43 | 0.14 | 92.1 | 14.5 | 84.5 | 0.22 | 87.8 | 15.1 | 84.2 | 0.17 | 89.3 | 15.1 | 83.9 | 0.18 | 89.4 | 16.1 | 84.0 | 0.16 | 90.3 | 16.3 | 85.1 | |
Lewandowski_SRPOL_task3_1 | Kapka2019 | 44 | 0.14 | 92.4 | 36.1 | 89.5 | 0.32 | 83.5 | 32.9 | 79.4 | 0.18 | 89.5 | 37.5 | 88.7 | 0.17 | 90.1 | 36.8 | 89.5 | 0.15 | 91.2 | 37.8 | 91.2 | |
Cordourier_IL_task3_2 | Cordourier2019 | 45 | 0.21 | 86.9 | 20.9 | 86.3 | 0.22 | 86.6 | 20.0 | 85.9 | 0.21 | 87.0 | 21.0 | 84.9 | 0.24 | 84.6 | 20.3 | 84.7 | 0.20 | 87.6 | 21.5 | 86.5 | |
Cordourier_IL_task3_1 | Cordourier2019 | 46 | 0.19 | 88.6 | 19.5 | 86.5 | 0.26 | 85.0 | 19.7 | 85.6 | 0.22 | 86.2 | 20.3 | 85.6 | 0.24 | 85.3 | 19.8 | 84.4 | 0.22 | 86.5 | 20.1 | 85.6 | |
Krause_AGH_task3_4 | Krause2019 | 47 | 0.19 | 89.6 | 30.1 | 88.4 | 0.31 | 83.4 | 31.8 | 82.8 | 0.22 | 87.2 | 31.8 | 87.2 | 0.23 | 87.0 | 30.3 | 87.1 | 0.18 | 89.7 | 31.2 | 89.7 | |
DCASE2019_FOA_baseline | Adavanne2019 | 48 | 0.24 | 87.5 | 24.2 | 87.6 | 0.41 | 80.3 | 23.3 | 79.2 | 0.26 | 85.3 | 25.4 | 85.8 | 0.27 | 86.1 | 24.0 | 87.1 | 0.24 | 87.4 | 26.0 | 88.9 | |
Perezlopez_UPF_task3_1 | Perezlopez2019 | 49 | 0.20 | 87.9 | 7.4 | 80.2 | 0.51 | 71.0 | 9.8 | 61.2 | 0.25 | 84.6 | 10.5 | 79.9 | 0.27 | 83.5 | 9.6 | 79.1 | 0.26 | 84.1 | 9.2 | 78.4 | |
Chytas_UTH_task3_1 | Chytas2019 | 50 | 0.29 | 82.3 | 17.8 | 74.3 | 0.29 | 83.2 | 18.7 | 78.2 | 0.28 | 82.4 | 18.0 | 74.5 | 0.29 | 82.3 | 18.7 | 76.1 | 0.30 | 81.9 | 19.6 | 75.1 | |
Anemueller_UOL_task3_3 | Anemueller2019 | 51 | 0.23 | 85.7 | 30.3 | 86.3 | 0.41 | 79.2 | 27.4 | 76.9 | 0.28 | 82.4 | 28.8 | 83.5 | 0.26 | 85.2 | 28.4 | 85.8 | 0.22 | 86.5 | 31.1 | 87.9 | |
Chytas_UTH_task3_2 | Chytas2019 | 52 | 0.29 | 82.3 | 18.5 | 74.1 | 0.28 | 83.5 | 18.6 | 78.4 | 0.29 | 81.9 | 18.2 | 74.7 | 0.29 | 82.2 | 18.4 | 76.1 | 0.30 | 81.9 | 19.9 | 75.0 | |
Krause_AGH_task3_2 | Krause2019 | 53 | 0.26 | 85.9 | 31.0 | 87.3 | 0.43 | 77.9 | 32.5 | 80.7 | 0.29 | 83.7 | 31.9 | 86.4 | 0.33 | 82.0 | 31.0 | 85.8 | 0.29 | 85.2 | 32.3 | 88.1 | |
Krause_AGH_task3_1 | Krause2019 | 54 | 0.27 | 85.0 | 31.3 | 86.4 | 0.43 | 78.3 | 32.3 | 81.3 | 0.28 | 83.4 | 33.4 | 85.7 | 0.31 | 81.7 | 32.4 | 85.4 | 0.24 | 86.2 | 33.0 | 87.9 | |
Anemueller_UOL_task3_1 | Anemueller2019 | 55 | 0.26 | 84.8 | 27.3 | 86.0 | 0.45 | 76.5 | 28.1 | 79.4 | 0.32 | 81.5 | 28.9 | 85.8 | 0.37 | 78.8 | 28.0 | 84.5 | 0.28 | 85.0 | 28.8 | 87.0 | |
Kong_SURREY_task3_1 | Kong2019 | 56 | 0.26 | 85.2 | 30.5 | 81.0 | 0.31 | 83.0 | 43.4 | 82.4 | 0.29 | 83.0 | 34.3 | 80.8 | 0.30 | 83.0 | 39.3 | 81.5 | 0.29 | 82.6 | 40.0 | 80.9 | |
Anemueller_UOL_task3_2 | Anemueller2019 | 57 | 0.32 | 82.1 | 24.5 | 84.6 | 0.43 | 76.5 | 25.2 | 81.7 | 0.33 | 80.4 | 24.9 | 85.0 | 0.37 | 79.1 | 24.7 | 83.9 | 0.34 | 80.6 | 25.5 | 85.2 | |
DCASE2019_MIC_baseline | Adavanne2019 | 58 | 0.27 | 84.3 | 37.5 | 81.5 | 0.33 | 82.0 | 39.0 | 85.1 | 0.29 | 83.5 | 37.4 | 82.7 | 0.31 | 82.2 | 36.4 | 83.3 | 0.28 | 84.1 | 40.1 | 84.3 | |
Lin_YYZN_task3_1 | Lin2019 | 59 | 1.04 | 4.2 | 4.9 | 27.9 | 1.06 | 3.5 | 39.2 | 37.8 | 1.00 | 0.0 | 35.2 | 31.5 | 1.03 | 1.7 | 57.0 | 32.4 | 1.03 | 3.4 | 9.3 | 28.7 | |
Krause_AGH_task3_3 | Krause2019 | 60 | 0.27 | 84.7 | 54.2 | 85.7 | 0.50 | 74.1 | 49.1 | 79.3 | 0.32 | 80.8 | 54.6 | 83.2 | 0.37 | 79.2 | 50.5 | 84.2 | 0.32 | 82.7 | 54.7 | 85.8 |
System characteristics
Summary of the submitted systems characteristics.
Rank |
Submission name |
Technical Report |
Classifier |
Classifier params |
Audio format |
Acoustic feature |
Data augmentation |
Sampling rate |
---|---|---|---|---|---|---|---|---|
1 | Kapka_SRPOL_task3_2 | Kapka2019 | CRNN | 2651634 | Ambisonic | phase and magnitude spectrogram | 48kHz | |
2 | Kapka_SRPOL_task3_4 | Kapka2019 | CRNN | 2651634 | Ambisonic | phase and magnitude spectrogram | 48kHz | |
3 | Kapka_SRPOL_task3_3 | Kapka2019 | CRNN | 2651634 | Ambisonic | phase and magnitude spectrogram | 48kHz | |
4 | Cao_Surrey_task3_4 | Cao2019 | CRNN9, ensemble | 6976673 | both | log mel and intensity vector | 48kHz | |
5 | Xue_JDAI_task3_1 | Xue2019 | CRNN, ensemble | 2641263 | Microphone Array | Log-Mel, Constant Q-transform, raw phase spectrogram, smoothed cross-power angular spectra | speed perturbtion | 48kHz |
6 | He_THU_task3_2 | He2019 | CRNN | 474273 | Ambisonic | phase and magnitude spectrogram, Mel-spectrogram | time and frequency masking (SpecAugment) | 48kHz |
7 | He_THU_task3_1 | He2019 | CRNN | 474273 | Ambisonic | phase and magnitude spectrogram, Mel-spectrogram | time and frequency masking (SpecAugment) | 48kHz |
8 | Cao_Surrey_task3_1 | Cao2019 | CRNN9, ensemble | 6965153 | Ambisonic | log mel and intensity vector | 48kHz | |
9 | Xue_JDAI_task3_4 | Xue2019 | CRNN, ensemble | 2641263 | Microphone Array | Log-Mel, Constant Q-transform, raw phase spectrogram, smoothed cross-power angular spectra | speed perturbtion | 48kHz |
10 | Jee_NTU_task3_1 | Jee2019 | CRNN | 10209057 | Microphone Array | log-mel spectrogram and GCC-PHAT | mixup | 34 kHz |
11 | Xue_JDAI_task3_3 | Xue2019 | CRNN, ensemble | 2641263 | Microphone Array | Log-Mel, Constant Q-transform, raw phase spectrogram, smoothed cross-power angular spectra | speed perturbtion | 48kHz |
12 | He_THU_task3_4 | He2019 | CRNN | 474273 | Microphone Array | phase and magnitude spectrogram, Mel-spectrogram | time and frequency masking (SpecAugment) | 48kHz |
13 | Cao_Surrey_task3_3 | Cao2019 | CRNN11, ensemble | 7071969 | both | log mel and intensity vector | 48kHz | |
14 | Xue_JDAI_task3_2 | Xue2019 | CRNN, ensemble | 2641263 | Microphone Array | Log-Mel, Constant Q-transform, raw phase spectrogram, smoothed cross-power angular spectra | speed perturbtion | 48kHz |
15 | He_THU_task3_3 | He2019 | CRNN | 474273 | Microphone Array | phase and magnitude spectrogram, Mel-spectrogram | time and frequency masking (SpecAugment) | 48kHz |
16 | Cao_Surrey_task3_2 | Cao2019 | CRNN11, ensemble | 7071969 | both | log mel and intensity vector | 48kHz | |
17 | Nguyen_NTU_task3_3 | Nguyen2019 | CRNN, ensemble | 5620844 | Ambisonic | log-mel energies, magnitude spectrogram | 48kHz | |
18 | MazzonYasuda_NTT_task3_3 | MazzonYasuda2019 | CRNN, ResNet, ensemble | 134896144 | both | log-mel energies, GCC | FOA domain spatial augmentation | 32kHz |
19 | Chang_HYU_task3_3 | Chang2019 | CRNN (for SAD), CRNN (for SED), CNN (for SEL) | 280876424 | Microphone Array | Multi-resolution Cochleagram (for SAD), log-mel spectrogram (for SED), generalized cross-correlation phase transform (for SEL) | pitch shifting, audio file mixing | 48kHz |
20 | Nguyen_NTU_task3_4 | Nguyen2019 | CRNN, ensemble | 4382721 | Ambisonic | log-mel energies, magnitude spectrogram | 48kHz | |
21 | Chang_HYU_task3_4 | Chang2019 | CRNN (for SAD), CRNN (for SED), CNN (for SEL) | 280941952 | Microphone Array | Multi-resolution Cochleagram (for SAD), log-mel spectrogram (for SED), generalized cross-correlation phase transform (for SEL) | pitch shifting, audio file mixing | 48kHz |
22 | MazzonYasuda_NTT_task3_2 | MazzonYasuda2019 | CRNN, ResNet, ensemble | 539584576 | both | log-mel energies, GCC | FOA domain spatial augmentation | 32kHz |
23 | Chang_HYU_task3_2 | Chang2019 | CRNN (for SAD), CRNN (for SED), CNN (for SEL) | 280941952 | Ambisonic | Multi-resolution Cochleagram (for SAD), log-mel spectrogram (for SED), generalized cross-correlation phase transform (for SEL) | pitch shifting, audio file mixing | 48kHz |
24 | Chang_HYU_task3_1 | Chang2019 | CRNN (for SAD), CRNN (for SED), CNN (for SEL) | 280876424 | Ambisonic | Multi-resolution Cochleagram (for SAD), log-mel spectrogram (for SED), generalized cross-correlation phase transform (for SEL) | pitch shifting, audio file mixing | 48kHz |
25 | MazzonYasuda_NTT_task3_1 | MazzonYasuda2019 | CRNN, ResNet, ensemble | 539584576 | both | log-mel energies, GCC | FOA domain spatial augmentation | 32kHz |
26 | Ranjan_NTU_task3_3 | Ranjan2019 | Ensemble of 4 ResNet RNN models | 12315976 | Microphone Array | phase and log magnitude melspectrogram | time shifting | 48kHz |
27 | Ranjan_NTU_task3_4 | Ranjan2019 | ResNet RNN trained using all 4 splits | 3078994 | Microphone Array | phase and log magnitude melspectrogram | time shifting | 48kHz |
28 | Park_ETRI_task3_1 | Park2019 | CRNN, TrellisNet | 23116833 | both | log mel-band energy and mel-band intensity vector | 32kHz | |
29 | Nguyen_NTU_task3_1 | Nguyen2019 | CRNN | 1008235 | Ambisonic | log-mel energies, magnitude spectrogram | 48kHz | |
30 | Leung_DBS_task3_2 | Leung2019 | CRNN, ensemble | 3193745 | Ambisonic | complex spectrograms, phase and log magnitude spectrogram | 48kHz | |
31 | Park_ETRI_task3_2 | Park2019 | CRNN, TrellisNet | 23116833 | both | log mel-band energy and mel-band intensity vector | 32kHz | |
32 | Grondin_MIT_task3_1 | Grondin2019 | CRNN, ensemble | 18191014 | Microphone Array | phase and magnitude spectrogram, GCC, TDOA | 48kHz | |
33 | Leung_DBS_task3_1 | Leung2019 | CRNN, ensemble | 1891635 | Ambisonic | complex spectrograms, phase and log magnitude spectrogram | 48kHz | |
34 | Park_ETRI_task3_3 | Park2019 | CRNN | 21147681 | both | log mel-band energy and mel-band intensity vector | 32kHz | |
35 | MazzonYasuda_NTT_task3_4 | MazzonYasuda2019 | CRNN, ResNet, ensemble | 16862018 | both | log-mel energies, GCC | FOA domain spatial augmentation | 32kHz |
36 | Park_ETRI_task3_4 | Park2019 | CRNN | 21147681 | both | log mel-band energy and mel-band intensity vector | 32kHz | |
37 | Ranjan_NTU_task3_1 | Ranjan2019 | ResNet_RNN | 3076690 | Microphone Array | phase and log magnitude melspectrogram | time shifting | 48kHz |
38 | Ranjan_NTU_task3_2 | Ranjan2019 | ResNet_RNN | 3078994 | Microphone Array | phase and log magnitude melspectrogram | time shifting | 48kHz |
39 | ZhaoLu_UESTC_task3_1 | ZhaoLu2019 | CNN+LSTM | 7862177 | Microphone Array | LogMel spectrogram and GCC | 48kHz | |
40 | Rough_EMED_task3_2 | Rough2019 | CRNN | 116118 | Microphone Array | phase and magnitude spectrogram | 48kHz | |
41 | Nguyen_NTU_task3_2 | Nguyen2019 | CRNN | 1008235 | Ambisonic | log-mel energies, magnitude spectrogram | 48kHz | |
42 | Jee_NTU_task3_2 | Jee2019 | CRNN | 10209057 | Microphone Array | log-mel spectrogram and GCC-PHAT | mixup | 34 kHz |
43 | Tan_NTU_task3_1 | Tan2019 | ResNet RNN + TDOA | 116118 | Microphone Array | log mel spectrogram, TDOA | frame shifting | 48kHz |
44 | Lewandowski_SRPOL_task3_1 | Kapka2019 | CRNN | 508574 | Ambisonic | phase and magnitude spectrogram | time / frequency masking, time warping (specAugment) | 48kHz |
45 | Cordourier_IL_task3_2 | Cordourier2019 | Four CRNNs in ensamble | 731280 | Microphone Array | phase, magnitude spectrogram and GCC | 48kHz | |
46 | Cordourier_IL_task3_1 | Cordourier2019 | CRNN | 182820 | Microphone Array | phase, magnitude spectrogram and GCC | 48kHz | |
47 | Krause_AGH_task3_4 | Krause2019 | CRNN, ensemble | 615626 | Ambisonic | phase and magnitude spectrogram | 48kHz | |
48 | DCASE2019_FOA_baseline | Adavanne2019 | CRNN | Ambisonic | Phase and Magnitude Spectrogram | 48kHz | ||
49 | Perezlopez_UPF_task3_1 | Perezlopez2019 | CRNN | 150000 | Ambisonic | DOA, diffuseness, DOA variance, energy density | mixup | 48kHz |
50 | Chytas_UTH_task3_1 | Chytas2019 | CNN, ensemble | 100420239 | Microphone Array | raw audio data, spectrogram | channel permutations, segment addition | 16kHz |
51 | Anemueller_UOL_task3_3 | Anemueller2019 | CRNN | 750137 | Ambisonic | group-delay and magnitude spectrogram, each in Mel-bands and FFT-bands | 48kHz | |
52 | Chytas_UTH_task3_2 | Chytas2019 | CNN, ensemble | 100420239 | Microphone Array | raw audio data, spectrogram | channel permutations, segment addition | 16kHz |
53 | Krause_AGH_task3_2 | Krause2019 | CRNN | 282089 | Ambisonic | phase and magnitude spectrogram | 48kHz | |
54 | Krause_AGH_task3_1 | Krause2019 | CRNN | 333537 | Ambisonic | phase and magnitude spectrogram | 48kHz | |
55 | Anemueller_UOL_task3_1 | Anemueller2019 | CRNN | 552761 | Ambisonic | group-delay and magnitude spectrogram, each in Mel-bands and FFT-bands | 48kHz | |
56 | Kong_SURREY_task3_1 | Kong2019 | CNN | 4686144 | Ambisonic | magnitude spectrogram | 32kHz | |
57 | Anemueller_UOL_task3_2 | Anemueller2019 | CRNN | 552761 | Ambisonic | group-delay and magnitude spectrogram, each in Mel-bands and FFT-bands | 48kHz | |
58 | DCASE2019_MIC_baseline | Adavanne2019 | CRNN | Microphone Array | Phase and Magnitude Spectrogram | 48kHz | ||
59 | Lin_YYZN_task3_1 | Lin2019 | CRNN | 613537 | Ambisonic | phase and magnitude spectrogram | 44.1kHz | |
60 | Krause_AGH_task3_3 | Krause2019 | CRNN | 429857 | Ambisonic | phase and magnitude spectrogram | 48kHz |
Technical reports
A MULTI-ROOM REVERBERANT DATASET FOR SOUND EVENT LOCALIZATION AND DETECTION
Sharath Adavanne, Archontis Politis, Tuomas Virtanen
Tampere University, Finland
Abstract
This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge. The goal of the SELD task is to detect the temporal activities of a known set of sound event classes, and further localize them in space when active. As part of the challenge, a synthesized dataset with each sound event associated with a spatial coordinate represented using azimuth and elevation angles is provided. These sound events are spatialized using real-life impulse responses collected at multiple spatial coordinates in five different rooms with varying dimensions and material properties. A baseline SELD method employing a convolutional recurrent neural network is used to generate bench mark scores for this reverberant dataset. The benchmark scores are obtained using the recommended cross-validation setup.
GROUP DELAY FEATURES FOR SOUND EVENT DETECTION AND LOCALIZATION (TASK 3) OF THE DCASE 2019 CHALLENGE
Eike Nustede, Jorn Anemuller
University of Oldenburg
Anemueller_UOL_task3_1 Anemueller_UOL_task3_2 Anemueller_UOL_task3_3
GROUP DELAY FEATURES FOR SOUND EVENT DETECTION AND LOCALIZATION (TASK 3) OF THE DCASE 2019 CHALLENGE
Eike Nustede, Jorn Anemuller
University of Oldenburg
Abstract
Sound event localization algorithms utilize features that encode a source’s time-difference of arrival across an array of microphones. Direct encoding of a signal’s phase in sub-bands is a common representation that is not without its shortcomings, since phase as a circular variable is prone to 2π-wrapping and systematic phase-advancement across frequencies. Group delay encoding may constitute a more robust feature for data-driven algorithms as it represents time-delays of the signal’s spectral-band envelopes. Computed through derivation of phase across frequency, it is in practice characterized by a lower degree of variability, resulting in reduced wrapping and to some extent permitting the computation of average group delay across (e.g., Mel-scaled) bands. The present contribution incorporates group delay features into the baseline system of DCASE 2019 task3, supplementing them with amplitude features. System setup is based on the provided baseline system’s convolutional recurrent neural network architecure with some variation of its topology.
TWO-STAGE SOUND EVENT LOCALIZATION AND DETECTION USING INTENSITY VECTOR AND GENERALIZED CROSS-CORRELATION
Yin Cao, Turab Iqbal, Qiuqiang Kong, Miguel Galindo, Wenwu Wang, Mark Plumbley
University of Surrey
Abstract
Sound event localization and detection (SELD) refers to the spatial and temporal localization of sound events in addition to classification. The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Task 3 introduces a strongly labelled dataset to address this problem. In this report, a two-stage polyphonic sound event detection and localization method. The method utilizes log mel features for event detection, and uses intensity vector and GCC features for localization. Intensity vector and GCC features use the supplied Ambisonic and microphone array signals, respectively. This method trains SED first, after which the learned feature layers are transferred for direction of arrival (DOA) estimation. It then uses the SED ground truth as a mask to train DOA estimation. Experimental results show that the proposed method is able to localize and detect overlapping sound events in different environments. It is also able to improve the performance of both SED and DOA estimation, and performs significantly better than the baseline method.
THREE-STAGE APPROACH FOR SOUND EVENT LOCALIZATION AND DETECTION
Kyoungjin Noh, Choi Jeong-Hwan, Jeon Dongyeop, Chang Joon-Hyuk
Hanyang University
Chang_HYU_task3_1 Chang_HYU_task3_2 Chang_HYU_task3_3 Chang_HYU_task3_4
THREE-STAGE APPROACH FOR SOUND EVENT LOCALIZATION AND DETECTION
Kyoungjin Noh, Choi Jeong-Hwan, Jeon Dongyeop, Chang Joon-Hyuk
Hanyang University
Abstract
This paper describes our three-stage approach system for sound event localization and detection (SELD) task. The system consists of three parts: sound activity detection (SAD), sound event detection (SED), and sound event localization (SEL). Firstly, we employ the multi-resolution cochleagram (MRCG) from 4-channel audio and convolutional recurrent neural network (CRNN) model to detect sound activity. Secondly, we extract log mel-spectrogram from 4-channel audio, harmonic percussive source separation (HPSS) audio, mono audio, and train another CRNN model. Lastly, we exploit the generalized cross-correlation phase transform (GCC-PHAT) of each microphone pairs as an input feature of the convolutional neural network (CNN) model for the SEL. Then we combine SAD, SED, and SEL results to obtain the final prediction for the SELD task. To augment overlapped frames that degrade overall performance, we randomly select two non-overlapped audio files and mix them. We also average the predictions of several models to improve the result. Experimental results on the four cross-validation splits for the TAU Spatial Sound Events 2019-Microphone Array dataset are error rate: 0.23, F score: 85.91\%, DOA error: 3.62-degree , and frame recall: 88.66\%, respectively.
HIERARCHICAL DETECTION OF SOUND EVENTS AND THEIR LOCALIZATION USING CONVOLUTIONAL NEURAL NETWORKS WITH ADAPTIVE THRESHOLDS
Sotirios Panagiotis Chytas, Gerasimos Potamianos
University of Thessaly
Chytas_UTH_task3_1 Chytas_UTH_task3_2
HIERARCHICAL DETECTION OF SOUND EVENTS AND THEIR LOCALIZATION USING CONVOLUTIONAL NEURAL NETWORKS WITH ADAPTIVE THRESHOLDS
Sotirios Panagiotis Chytas, Gerasimos Potamianos
University of Thessaly
Abstract
The paper details our approach to address Task 3 of the DCASE 2019 Challenge, namely that of sound event localization and detection (SELD). Our developed system is based on multi-channel convolutional neural networks (CNNs), combined with data augmentation and ensembling. In particular, it follows a hierarchical approach that first determines adaptive thresholds for the multi-label sound event detection (SED) problem, based on a CNN operating on spectrograms over long-duration windows. It subsequently exploits the derived thresholds in an ensemble of CNNs operating on raw waveforms over shorter-duration sliding windows to provide the desired event segmentation and labeling. Finally, it employs a separate event localization CNN to yield direction-of-arrival (DOA) source estimates of the detected sound events. The system is evaluated on the microphone-array development dataset of the SELD Task. Compared to the corresponding baseline provided by the Challenge organizers, it achieves relative improvements of 12\% in SED error, 2\% in F-score, 36\% in DOA error, and 3\% in the combined SELD metric, but trails significantly in frame-recall.
GCC-PHAT CROSS-CORRELATION AUDIO FEATURES FOR SIMULTANEOUS SOUND EVENT LOCALIZATION AND DETECTION (SELD) ON MULTIPLE ROOMS
Hector Cordourier Maruri, Paulo Lopez Meyer, Jonathan Huang, Juan Antonio Del Hoyo Ontiveros, Hong Lu
Intel Corporation
Cordourier_IL_task3_1 Cordourier_IL_task3_2
GCC-PHAT CROSS-CORRELATION AUDIO FEATURES FOR SIMULTANEOUS SOUND EVENT LOCALIZATION AND DETECTION (SELD) ON MULTIPLE ROOMS
Hector Cordourier Maruri, Paulo Lopez Meyer, Jonathan Huang, Juan Antonio Del Hoyo Ontiveros, Hong Lu
Intel Corporation
Abstract
In this work, we show a simultaneous sound event localization and detection (SELD) system, with enhanced acoustic features, in which we propose using the well-known Generalized Cross Correlation (GCC) PATH algorithm, to augment the magnitude and phase regular Fourier spectra features at each frame. GCC has already been used for some time to calculate the Time Difference of Arrival (TDOA) in simultaneous audio signals, in moderately reverberent environments, using classic signal processing techniques, and can assist audio source localization in current deep learning machines. The neural net architecture we used is a Convolutional Recurrent Neural Network (CRNN), and is tested using the sound database prepared for the Task 3 of the 2019 DCASE Challenge. Our proposed system is able to achieve 20.4 of direction of arrival error, 86.4\% frame recall, 87.1\% F-score and 0.20 error rate detection in testing samples.
SOUND EVENT LOCALIZATION AND DETECTION USING CRNN ON PAIRS OF MICROPHONES
Francois Grondin, James Glass, Iwona Sobieraj, Mark D. Plumbley
University of Surrey, Massachusetts Institute of Technology
Grondin_MIT_task3_1
SOUND EVENT LOCALIZATION AND DETECTION USING CRNN ON PAIRS OF MICROPHONES
Francois Grondin, James Glass, Iwona Sobieraj, Mark D. Plumbley
University of Surrey, Massachusetts Institute of Technology
Abstract
This paper proposes a sound event localization and detection from multichannel recording method. The proposed system is based on two Convolutional Recurrent Neural Networks (CRNN) to perform sound event detection (SED) and time difference of arrival (TDOA) estimation on each pair of microphones of a microphone array. In this paper, the system is evaluated with a four-microphone array, and thus combines the results from six pairs of microphones to provide a final classification and a 3-D direction of arrival (DOA) estimate. Results demonstrate that the proposed approach outperforms the DCASE 2019 baseline system.
DATA AUGMENTATION AND PRIOR KNOWLEDGE-BASED REGULARIZATION FOR SOUND EVENT LOCALIZATION AND DETECTION
Jingyang Zhang, Wenhao Ding, Liang He
Tsinghua University
He_THU_task3_1 He_THU_task3_2 He_THU_task3_3 He_THU_task3_4
DATA AUGMENTATION AND PRIOR KNOWLEDGE-BASED REGULARIZATION FOR SOUND EVENT LOCALIZATION AND DETECTION
Jingyang Zhang, Wenhao Ding, Liang He
Tsinghua University
Abstract
The goal of sound event localization and detection (SELD) is detecting the presence of polyphonic sound events and identifying the sources of those events at the same time. In this paper, we propose an entire pipeline, which contains data augmentation, network prediction and post-processing stage, to deal with the SELD task. In data augmentation part, we expand the official dataset with SpecAugment [1]. In network prediction part, we train the event detection network and the localization network separately, and utilize the prediction of events to output localization prediction for active frames. In post-processing part, we propose a prior knowledge based regularization (PKR), which calculates the average value of the localization prediction of each event segment and replace the prediction of this event with this average value. We theoretically prove that this technique could reduce mean square error (MSE). After evaluating our system on DCASE 2019 Challenge Task 3 Development Dataset, we approximately achieve a 59\% reduction in SED error rate (ER) and a 13\% reduction in directions-of-arrival(DOA) error over the baseline system (on Ambisonic dataset).
SOUND EVENT LOCALIZATION AND DETECTION USING CONVOLUTIONAL RECURRENT NEURAL NETWORK
Wen Jie Jee, Rohith Mars, Pranay Pratik, Srikanth Nagisetty, Chong Soon Lim
Nanyang Technological University, Panasonic R&D Center Singapore
Jee_NTU_task3_1 Jee_NTU_task3_2
SOUND EVENT LOCALIZATION AND DETECTION USING CONVOLUTIONAL RECURRENT NEURAL NETWORK
Wen Jie Jee, Rohith Mars, Pranay Pratik, Srikanth Nagisetty, Chong Soon Lim
Nanyang Technological University, Panasonic R&D Center Singapore
Abstract
This report details the methods used in the development set of DCASE 2019 Task 3, and the results of the investigations. Data augmentation mixup was used in an attempt to train the model for greater generalization. The kernel size of the pooling layers were also modified to a more intuitive size. In addition, different kernel sizes of the convolutional layers were also investigated and results are reported. Our best model achieved an F-score of 91.9\% and a DOA error of 4.588-degree on the development set, which showed an improvement of 10\% and about 25-degree, respectively compared to the baseline system
SOUND SOURCE DETECTION, LOCALIZATION AND CLASSIFICATION USING CONSECUTIVE ENSEMBLE OF CRNN MODELS
Slawomir Kapka, Mateusz Lewandowski
Samsung R&D Institute Poland
Lewandowski_SRPOL_task3_1 Kapka_SRPOL_task3_2 Kapka_SRPOL_task3_3 Kapka_SRPOL_task3_4
SOUND SOURCE DETECTION, LOCALIZATION AND CLASSIFICATION USING CONSECUTIVE ENSEMBLE OF CRNN MODELS
Slawomir Kapka, Mateusz Lewandowski
Samsung R&D Institute Poland
Abstract
In this technical report, we describe our method for DCASE2019 task 3: Sound Event Localization and Detection. We use four CRNN SELDnet-like single output models which run in a consecutive manner to recover all possible information of occurring events. We decompose the SELD task into estimating number of active sources, estimating direction of arrival of a single source, estimating direction of arrival of the second source where the direction of the first one is known and a multi-label classification task. We use custom consecutive ensemble to predict events’ onset, offset, direction of arrival and class. The proposed approach is evaluated on the development set of TAU Spatial Sound Events 2019 - Ambisonic.
CROSS-TASK LEARNING FOR AUDIO TAGGING, SOUND EVENT DETECTION AND SPATIAL LOCALIZATION: DCASE 2019 BASELINE SYSTEMS
Qiuqiang Kong, Yin Cao, Turab Iqbal, Wenwu Wang, Mark D. Plumbley
University of Surrey
Kong_SURREY_task3_1
CROSS-TASK LEARNING FOR AUDIO TAGGING, SOUND EVENT DETECTION AND SPATIAL LOCALIZATION: DCASE 2019 BASELINE SYSTEMS
Qiuqiang Kong, Yin Cao, Turab Iqbal, Wenwu Wang, Mark D. Plumbley
University of Surrey
Abstract
The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge focuses on audio tagging, sound event detection and spatial localisation. DCASE 2019 consists of five tasks: 1) acoustic scene classification, 2) audio tagging with noisy labels and minimal supervision, 3) sound event localisation and detection, 4) sound event detection in domestic environments, and 5) urban sound tagging. In this paper, we propose generic cross-task baseline systems based on convolutional neural networks (CNNs). The motivation is to investigate the performance of a variety of models across several audio recognition tasks without exploiting the specific characteristics of the tasks. We looked at CNNs with 5, 9, and 13 layers, and found that the optimal architecture is task-dependent. For the systems we considered, we found that the 9-layer CNN with average pooling after convolutional layers is a good model for a majority of the DCASE 2019 tasks.
ARBORESCENT NEURAL NETWORK ARCHITECTURES FOR SOUND EVENT DETECTION AND LOCALIZATION
Daniel Krause, Konrad Kowalczyk
AGH University of Science and Technology
Abstract
This paper describes our contribution to the task of sound event localization and detection (SELD) using first-order ambisonic signals at the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Our approach is based on arborescent convolutional recurrent neural networks with the aim to achieve joint localization and detection of overlapping acoustic events. Three submitted systems can be briefly summarized as follows. System 1 splits the neural network into two branches associated with localization and detection tasks. This splitting is performed directly after the first convolutional layer. System 2 utilizes depthwise separable convolutions in order to exploit interchannel dependencies whilst substantially reducing the model complexity. System 3 exhibits a tree-like architecture in which relations between the channels for phase and magnitude are exploited independently in two branches, and they are concatenated before the recurrent layers. Finally, System 4 is based on score fusion of the first two systems.
SPECTRUM COMBINATION AND CONVOLUTIONAL RECURRENT NEURAL NETWORKS FOR JOINT LOCALIZATION AND DETECTION OF SOUND EVENTS
Shuangran Leung, Yi Ren
DBSonics
Leung_DBS_task3_1 Leung_DBS_task3_2
SPECTRUM COMBINATION AND CONVOLUTIONAL RECURRENT NEURAL NETWORKS FOR JOINT LOCALIZATION AND DETECTION OF SOUND EVENTS
Shuangran Leung, Yi Ren
DBSonics
Abstract
In this work, we combine existing Short Time Fourier Transforms (STFT) of 4-channel array audio signals to create new features, and show that this augmented input improves the performance of DCASE2019 task 3 baseline system [1] in both sound event detection (SED) and direction-of-arrival (DOA) estimation. Techniques like ensembling and finetuning with masked DOA output are also applied and shown to further improve both SED and DOA accuracy.
A REPORT ON SOUND EVENT LOCALIZATION AND DETECTION
Yifeng Lin, Zhisheng Wang
Esound corporation
Lin_YYZN_task3_1
A REPORT ON SOUND EVENT LOCALIZATION AND DETECTION
Yifeng Lin, Zhisheng Wang
Esound corporation
Abstract
In this paper, we make a little change to the baseline of Sound Event Localization and Detection. We add a Gaussian-noise on the input data to find if noise would help us improve the neutral network. Sound event detection is performed stacked convolutional and recurrent neural network and the evaluation is reported using standard metrics of error rate and F-score. The studied neutral network with noise on input data are seen to consistently perform equal to the origin baseline with respect to error rate metric.
SOUND EVENT LOCALIZATION AND DETECTION USING FOA DOMAIN SPATIAL AUGMENTATION
Luca Mazzon, Masahiro Yasuda, Yuma Koizumi, Noboru Harada
NTT Media Intelligence Laboratories
MazzonYasuda_NTT_task3_1 MazzonYasuda_NTT_task3_2 MazzonYasuda_NTT_task3_3 MazzonYasuda_NTT_task3_4
SOUND EVENT LOCALIZATION AND DETECTION USING FOA DOMAIN SPATIAL AUGMENTATION
Luca Mazzon, Masahiro Yasuda, Yuma Koizumi, Noboru Harada
NTT Media Intelligence Laboratories
Abstract
This technical report describes the system participating to the DCASE 2019, Task 3: Sound Event Localization and Detection challenge. The system consists of a convolutional recurrent neural network (CRNN) reinforced by a ResNet structure. A two-stage training strategy with label masking is adopted. The main advancement of the proposed method is a data augmentation method based on rotation in the first order Ambisonics (FOA) domain. The proposed spatial augmentation enables us to augment direction of arrival (DOA) labels without losing physical relationships between steering vectors and observations. Evaluation results on development dataset show that, even though the proposed method did not use any ensemble method in this experiment, (i) the proposed method outperformed a state-of-the-art system published before the submission deadline and (ii) the DOA error has significantly decreased: 2.73-degree better than the state-of-the-art system.
DCASE 2019 TASK 3: A TWO-STEP SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION
Thi Ngoc Tho Nguyen, Douglas L. Jones, Rishabh Ranjan, Sathish Jayabalan, Woon Seng Gan
Nanyang Technological University, University of Illinois Urbana-Champaign
Nguyen_NTU_task3_1 Nguyen_NTU_task3_2 Nguyen_NTU_task3_3 Nguyen_NTU_task3_4
DCASE 2019 TASK 3: A TWO-STEP SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION
Thi Ngoc Tho Nguyen, Douglas L. Jones, Rishabh Ranjan, Sathish Jayabalan, Woon Seng Gan
Nanyang Technological University, University of Illinois Urbana-Champaign
Abstract
Sound event detection and sound event localization requires different features from audio input signals. While sound event detection mainly relies on time-frequency patterns to distinguish different event classes, sound event localization uses magnitude or phase differences between microphones to estimate source directions. Therefore, we propose a two-step system to do sound event localization and detection. In the first step, we detect the sound events and estimate the directions-of-arrival separately. In the second step, we combine the results of the event detector and direction-of-arrival estimator together. The obtained results show a significant improvement over the baseline solution for sound event localization and detection in DCASE 2019 task 3 challenge. Using the development dataset on 4 fold cross-validation, the proposed system achieves an F1 score of 86.9\% for sound event detection and an error of 5.15 degrees for direction-of-arrival estimation while the baseline F1 score and error are 79.9\% and 28.5 degrees respectively.
REASSEMBLY LEARNING FOR SOUND EVENT LOCALIZATION AND DETECTION USING CRNN AND TRELLISNET
Sooyoung Park, Wootaek Lim, Sangwon Suh, Youngho Jeong
Electronics and Telecommunications Research Institute
Park_ETRI_task3_1 Park_ETRI_task3_2 Park_ETRI_task3_3 Park_ETRI_task3_4
REASSEMBLY LEARNING FOR SOUND EVENT LOCALIZATION AND DETECTION USING CRNN AND TRELLISNET
Sooyoung Park, Wootaek Lim, Sangwon Suh, Youngho Jeong
Electronics and Telecommunications Research Institute
Abstract
This technical report proposes a deep learning based approach, reassembly learning, for polyphonic sound event localization and detection. Sound event localization and detection is a joint task of two dependent sub-tasks: sound event detection and direction of arrival estimation. Joint learning has performance degradation compared to learning each sub-task separately. For this reason, we propose a reassembly learning to design a single network that deals with dependent sub-tasks together. Reassembly learning is a method to divide multi-task into individual sub-tasks, to train each sub-task, and then to reassemble and fine-tune into a single network. Experimental results show that the reassembly learning has good performance in the sound event localization and detection. Besides, the convolutional recurrent neural networks have been known as a state of art in both sound classification and detection applications. In DCASE 2019 challenge task 3, we suggest new architecture, trellis network based on temporal convolution networks, which can replace the convolutional recurrent neural networks. Trellis network shows a strong point in the direction of arrival estimation and has the possibility of being applied to a variety of sound classification and detection applications.
A HYBRID PARAMETRIC-DEEP LEARNING APPROACH FOR SOUND EVENT LOCALIZATION AND DETECTION
Andres Perez-Lopez, Eduardo Fonseca, Xavier Serra
Centre Tecnologic de Catalunya, Universitat Pompeu Fabra
Abstract
This technical report describes and discusses the algorithm submitted to the Sound Event Localization and Detection Task of DCASE2019 Challenge. The proposed methodology combines a parametric spatial audio analysis approach for localization estimation, a simple heuristic for event segmentation, and a deep learning based monophonic event classifier. The evaluation of the proposed algorithm with the development dataset yields overall results slightly outperforming the baseline system. The main highlight is a reduction of the localization error over 65\%.
SOUND EVENTS DETECTION AND DIRECTION OF ARRIVAL ESTIMATION USING RESIDUAL NET AND RECURRENT NEURAL NETWORKS
Rishabh Ranjan, Sathish Jayabalan, Woon-Seng Gan
Nanyang Technological University
Ranjan_NTU_task3_1 Ranjan_NTU_task3_2 Ranjan_NTU_task3_3 Ranjan_NTU_task3_4
SOUND EVENTS DETECTION AND DIRECTION OF ARRIVAL ESTIMATION USING RESIDUAL NET AND RECURRENT NEURAL NETWORKS
Rishabh Ranjan, Sathish Jayabalan, Woon-Seng Gan
Nanyang Technological University
Abstract
This paper presents deep learning approach for sound events detection and localization, which is also a part of detection and classification of acoustic scenes and events (DCASE) challenge 2019 Task 3. Deep residual nets originally used for image classification are adapted and combined with recurrent neural networks (RNN) to estimate the onset-offset of sound events, sound events class, and their direction in a reverberant environment. Additionally, data augmentation and post processing techniques are applied to generalize the system performance to unseen data. Using our best model on validation dataset, sound events detection achieves F1-score of 0.89 and error rate of 0.18, whereas sound source localization task achieves angular error of 9 degree and 0.90 frame recall.
POLYPHONIC SOUND EVENT DETECTION AND LOCALIZATION USING A TWO-STAGE STRATEGY
Pi LiHong, Zheng Xue, Chen Ping, Wang Zhe, Zhang Chun
Tsinghua University, Beijing Yiemed Medical Technology Co. Ltd, Beijing Sanping Technology Co.Ltd
Rough_EMED_task3_2
POLYPHONIC SOUND EVENT DETECTION AND LOCALIZATION USING A TWO-STAGE STRATEGY
Pi LiHong, Zheng Xue, Chen Ping, Wang Zhe, Zhang Chun
Tsinghua University, Beijing Yiemed Medical Technology Co. Ltd, Beijing Sanping Technology Co.Ltd
Abstract
The joint training of SED and DOAE affects the performance of both. We adopt a two-stage polyphonic sound event detection and localization method. The method learns SED first, after which the learned feature layers are transferred for DOAE. It then uses the SED ground truth as a mask to train DOAE. We select Log mel spectrograms and GCCPHAT as the input features, and the GCCPHAT feature which contains phase difference information between any of the two microphones improves the performance of DOAE.
Sound Event Detection and Localization Using ResNet RNN and Time-Delay DOA
Ee Leng Tan, Rishabh Ranjan, Sathish Jayabalan
Nanyang Technological University
Tan_NTU_task3_1
Sound Event Detection and Localization Using ResNet RNN and Time-Delay DOA
Ee Leng Tan, Rishabh Ranjan, Sathish Jayabalan
Nanyang Technological University
Abstract
This paper presents a deep learning approach for sound events detection and time-delay direction-of-arrival (TDOA) for localization, which is also a part of detection and classification of acoustic scenes and events (DCASE) challenge 2019 Task 3. Deep residual nets originally used for image classification are adapted and combined with recurrent neural networks (RNN) to estimate the onset-offset of sound events, sound events class. Data augmentation and postprocessing techniques are applied to generalize the system performance to unseen data. Direction of sound events in a reverberant environment is estimated using a time-delay direction-of-arrival TDOA algorithm. Using our best model on validation dataset, sound events detection achieves F1-score of 0.84 and error rate of 0.25, whereas sound source localization task achieves angular error of 16.56 degree and 0.82 frame recall.
MULTI-BEAM AND MULTI-TASK LEARNING FOR JOINT SOUND EVENT DETECTION AND LOCALIZATION
Wei Xue, Tong Ying, Zhang Chao, Ding Guohong
JD.COM
Xue_JDAI_task3_1 Xue_JDAI_task3_2 Xue_JDAI_task3_3 Xue_JDAI_task3_4
MULTI-BEAM AND MULTI-TASK LEARNING FOR JOINT SOUND EVENT DETECTION AND LOCALIZATION
Wei Xue, Tong Ying, Zhang Chao, Ding Guohong
JD.COM
Abstract
Joint sound event detection (SED) and sound source localization (SSL) is essential since it provides both the temporal and spatial information of the events that appear in an acoustic scene. Although the problem can be tackled by designing a system based on the deep neural networks (DNNs) and fundamental spectral and spatial features, in this paper, we largely leverage the conventional microphone array signal processing techniques to generate more comprehensive representations for both SED and SSL, and to perform post-processing such that stable SED and SSL results can be obtained. Specifically, the features extracted from signals of multiple beams are utilized, which orient towards different directions of arrival (DOAs), and are formed according to the estimated steering vector of each DOA. Smoothed cross-power spectra (CPS) are computed based on the signal presence probability (SPP), and are used both as the input features of the DNNs, and for estimating the steering vectors of different DOAs. A triple-task learning scheme is developed, which jointly exploits the classification and regression based criterion for DOA estimation, and uses the classification based criterion as a regularization for the DNN. Experimental results demonstrate that the proposed method yields substantial improvements compared with the baseline method for the task 3 of the DCASE challenge 2019.
Sound event detection and localization based on CNN and LSTM
Zhao Lu
University of Electronic Science and Technology of China
ZhaoLu_UESTC_task3_1
Sound event detection and localization based on CNN and LSTM
Zhao Lu
University of Electronic Science and Technology of China
Abstract
The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge is a topic seminar for speech feature classification. Task 3 is the location and detection of sound events. In this field, the learning method based on deep neural network is becoming more and more popular. On the basis of CNN, the spectrum and cross-correlation information of multichannel microphone array are learned based on CNN and LSTM, and the detection of sound events and the estimation of arrival direction are obtained. Compared with the baseline method, this method improves the estimation accuracy of DOA and the recognition ability of SED by using DCASE2019 dataset and PyTorch deep learning tool. The combination of CNN and LSTM works very well on this kind of feature classification problem with time series.