Task description
The task evaluates systems for the detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording (see also Fig 1). The challenge of exploring the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance remains. Isolated sound events, background sound files and scripst to design a training set with strongly annotated synthetic data are provided. The labels in all the annotated subsets are verified and can be considered as reliable.
More detailed task description can be found in the task description page
Systems ranking
Rank |
Submission code |
Submission name |
Technical Report |
Sound Separation |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Event-based F-score (Development dataset) |
---|---|---|---|---|---|---|
Xiaomi_task4_SED_1 | DCASE2020 SED mean-teacher system | Liang2020 | 36.0 (35.3 - 36.8) | 35.6 | ||
Rykaczewski_Samsung_taks4_SED_3 | DCASE2020 SED CNNR | Rykaczewski2020 | 21.6 (21.0 - 22.4) | 35.0 | ||
Rykaczewski_Samsung_taks4_SED_2 | DCASE2020 SED CNNR | Rykaczewski2020 | 21.9 (21.3 - 22.7) | 36.0 | ||
Rykaczewski_Samsung_taks4_SED_4 | DCASE2020 SED CNNR | Rykaczewski2020 | 10.4 (9.7 - 11.1) | 35.7 | ||
Rykaczewski_Samsung_taks4_SED_1 | DCASE2020 SED CNNR | Rykaczewski2020 | 21.6 (20.8 - 22.4) | 34.3 | ||
Hou_IPS_task4_SED_1 | DCASE2020 WASEDA IPS SED | HouB2020 | 34.9 (34.0 - 35.7) | 40.8 | ||
Miyazaki_NU_task4_SED_1 | Conforemr SED | Miyazaki2020 | 51.1 (50.1 - 52.3) | 50.6 | ||
Miyazaki_NU_task4_SED_2 | Transforemr SED | Miyazaki2020 | 46.4 (45.5 - 47.5) | 47.3 | ||
Miyazaki_NU_task4_SED_3 | transformer conformer Ensemble SED | Miyazaki2020 | 50.7 (49.6 - 51.9) | 49.8 | ||
Huang_ICT-TOSHIBA_task4_SED_3 | guided multi-branch learning | Huang2020 | 44.3 (43.4 - 45.4) | 45.8 | ||
Huang_ICT-TOSHIBA_task4_SED_1 | guided multi-branch learning | Huang2020 | 44.6 (43.5 - 46.0) | 46.7 | ||
Huang_ICT-TOSHIBA_task4_SS_SED_4 | guided multi branch learning | Huang2020 | Sound Separation | 44.1 (42.9 - 45.4) | 46.9 | |
Huang_ICT-TOSHIBA_task4_SED_4 | guided multi-branch learning | Huang2020 | 44.3 (43.2 - 45.6) | 46.7 | ||
Huang_ICT-TOSHIBA_task4_SS_SED_1 | guided multi branch learning | Huang2020 | Sound Separation | 44.7 (43.6 - 46.2) | 47.2 | |
Huang_ICT-TOSHIBA_task4_SED_2 | guided multi-branch learning | Huang2020 | 44.3 (43.2 - 45.6) | 46.7 | ||
Huang_ICT-TOSHIBA_task4_SS_SED_3 | guided multi branch learning | Huang2020 | Sound Separation | 44.4 (43.2 - 45.8) | 47.2 | |
Huang_ICT-TOSHIBA_task4_SS_SED_2 | guided_and_multi_branch_learning | Huang2020 | Sound Separation | 44.5 (43.3 - 46.0) | 47.3 | |
Copiaco_UOW_task4_SED_2 | DCASE2020 SED system copiaco | Copiaco2020a | 7.8 (7.3 - 8.2) | 11.2 | ||
Copiaco_UOW_task4_SED_1 | DCASE2020 SED system copiaco | Copiaco2020a | 7.5 (7.0 - 8.0) | 8.1 | ||
Kim_AiTeR_GIST_SED_1 | DCASE 2020 task4 SED system with semi-supervised loss function | Kim2020 | 43.7 (42.8 - 44.7) | 44.6 | ||
Kim_AiTeR_GIST_SED_2 | DCASE 2020 task4 SED system with semi-supervised loss function | Kim2020 | 43.9 (43.0 - 44.7) | 45.2 | ||
Kim_AiTeR_GIST_SED_4 | DCASE 2020 task4 SED system with semi-supervised loss function | Kim2020 | 44.4 (43.5 - 45.2) | 44.0 | ||
Kim_AiTeR_GIST_SED_3 | DCASE 2020 task4 SED system with semi-supervised loss function | Kim2020 | 44.2 (43.4 - 45.1) | 44.5 | ||
Copiaco_UOW_task4_SS_SED_1 | DCASE2020 SS+SED system copiaco | Copiaco2020b | Sound Separation | 6.9 (6.7 - 7.2) | 8.7 | |
LJK_PSH_task4_SED_3 | LJK PSH DCASE2020 Task4 SED 3 | JiaKai2020 | 38.6 (37.7 - 39.7) | 45.4 | ||
LJK_PSH_task4_SED_1 | LJK PSH DCASE2020 Task4 SED 1 | JiaKai2020 | 39.3 (38.4 - 40.4) | 44.8 | ||
LJK_PSH_task4_SED_2 | LJK PSH DCASE2020 Task4 SED 2 | JiaKai2020 | 41.2 (40.1 - 42.4) | 47.9 | ||
LJK_PSH_task4_SED_4 | LJK PSH DCASE2020 Task4 SED 4 | JiaKai2020 | 40.6 (39.6 - 41.6) | 46.7 | ||
Hao_CQU_task4_SED_2 | DCASE2020 cross-domain sound event detection | Hao2020 | 47.0 (46.0 - 48.1) | 46.4 | ||
Hao_CQU_task4_SED_3 | DCASE2020 cross-domain sound event detection | Hao2020 | 46.3 (45.5 - 47.4) | 47.7 | ||
Hao_CQU_task4_SED_1 | DCASE2020 cross-domain sound event detection | Hao2020 | 44.9 (43.9 - 45.8) | 48.2 | ||
Hao_CQU_task4_SED_4 | DCASE2020 cross-domain sound event detection | Hao2020 | 47.8 (46.9 - 49.0) | 50.0 | ||
Zhenwei_Hou_task4_SED_1 | DCASE2020 SED GCA system | HouZ2020 | 45.1 (44.2 - 45.8) | 44.8 | ||
deBenito_AUDIAS_task4_SED_1 | 5-Resolution Mean Teacher with thresholding | deBenito2020 | 38.2 (37.5 - 39.2) | 43.4 | ||
deBenito_AUDIAS_task4_SED_1 | 5-Resolution Mean Teacher | de Benito-Gorron2020 | 37.9 (37.0 - 39.1) | 40.9 | ||
Koh_NTHU_task4_SED_3 | Koh_NTHU_3 | Koh2020 | 46.6 (45.8 - 47.6) | 48.0 | ||
Koh_NTHU_task4_SED_2 | Koh_NTHU_2 | Koh2020 | 45.2 (44.3 - 46.3) | 48.4 | ||
Koh_NTHU_task4_SED_1 | Koh_NTHU_1 | Koh2020 | 45.2 (44.2 - 46.1) | 46.4 | ||
Koh_NTHU_task4_SED_4 | Koh_NTHU_4 | Koh2020 | 46.3 (45.4 - 47.2) | 49.6 | ||
Cornell_UPM-INRIA_task4_SED_2 | UNIVPM-INRIA DAT HMM 1 | Cornell2020 | 42.0 (40.9 - 43.1) | 45.2 | ||
Cornell_UPM-INRIA_task4_SED_1 | UNIVPM-INRIA ensemble DAT+PCEN | Cornell2020 | 44.4 (43.3 - 45.5) | 46.2 | ||
Cornell_UPM-INRIA_task4_SED_4 | UNIVPM-INRIA ensemble DAT+PCEN HMM 2 | Cornell2020 | 43.2 (42.1 - 44.4) | 47.4 | ||
Cornell_UPM-INRIA_task4_SS_SED_1 | UNIVPM-INRIA separation hmm | Cornell2020 | Sound Separation | 38.6 (37.5 - 39.6) | 40.2 | |
Cornell_UPM-INRIA_task4_SED_3 | UNIVPM-INRIA ensemble MT+PCEN | Cornell2020 | 42.6 (41.6 - 43.5) | 43.7 | ||
Yao_UESTC_task4_SED_1 | Yao_UESTC_task4_SED_1 | Yao2020 | 44.1 (43.1 - 45.2) | 47.9 | ||
Yao_UESTC_task4_SED_3 | Yao_UESTC_task4_SED_3 | Yao2020 | 46.4 (45.3 - 47.6) | 49.5 | ||
Yao_UESTC_task4_SED_2 | Yao_UESTC_task4_SED_2 | Yao2020 | 45.7 (44.7 - 47.0) | 50.5 | ||
Yao_UESTC_task4_SED_4 | Yao_UESTC_task4_SED_4 | Yao2020 | 46.2 (45.2 - 47.0) | 49.6 | ||
Liu_thinkit_task4_SED_1 | MT_PPDA_cg_valid | Liu2020 | 40.7 (39.7 - 41.7) | 47.4 | ||
Liu_thinkit_task4_SED_1 | MT_PPDA_cg_glu_valid | Liu2020 | 41.8 (40.7 - 42.9) | 49.5 | ||
Liu_thinkit_task4_SED_1 | MT_PPDA_cg_glu_pub | Liu2020 | 45.2 (44.2 - 46.5) | 44.3 | ||
Liu_thinkit_task4_SED_4 | MT_PPDA_finall | Liu2020 | 43.1 (42.1 - 44.2) | 46.8 | ||
PARK_JHU_task4_SED_1 | PARK_fusion_P | Park2020 | 35.8 (35.0 - 36.6) | 45.4 | ||
PARK_JHU_task4_SED_1 | PARK_fusion_N | Park2020 | 26.5 (25.7 - 27.5) | 41.1 | ||
PARK_JHU_task4_SED_2 | PARK_fusion_M | Park2020 | 36.9 (36.1 - 37.7) | 44.9 | ||
PARK_JHU_task4_SED_3 | PARK_fusion_L | Park2020 | 34.7 (34.1 - 35.6) | 45.4 | ||
Chen_NTHU_task4_SS_SED_1 | DCASE2020 SS+SED system | Chen2020 | Sound Separation | 34.5 (33.5 - 35.3) | 36.7 | |
CTK_NU_task4_SED_2 | CTK_NU NMF-CNN-2 | Chan2020 | 44.4 (43.5 - 45.5) | 45.7 | ||
CTK_NU_task4_SED_4 | CTK_NU NMF-CNN-4 | Chan2020 | 46.3 (45.3 - 47.4) | 48.6 | ||
CTK_NU_task4_SED_3 | CTK_NU NMF-CNN-3 | Chan2020 | 45.8 (45.0 - 47.0) | 48.0 | ||
CTK_NU_task4_SED_1 | CTK_NU NMF-CNN-1 | Chan2020 | 43.5 (42.6 - 44.7) | 45.2 | ||
YenKu_NTU_task4_SED_4 | DCASE2020 SED | Yen2020 | 42.7 (41.6 - 43.6) | 46.6 | ||
YenKu_NTU_task4_SED_2 | DCASE2020 SED | Yen2020 | 42.6 (41.8 - 43.7) | 45.7 | ||
YenKu_NTU_task4_SED_3 | DCASE2020 SED | Yen2020 | 41.6 (40.6 - 42.7) | 45.4 | ||
YenKu_NTU_task4_SED_1 | DCASE2020 SED | Yen2020 | 43.6 (42.4 - 44.6) | 45.6 | ||
Tang_SCU_task4_SED_1 | Basic ResNet block without weakly labeled data augmentation | Tang2020 | 43.1 (42.3 - 44.1) | 46.6 | ||
Tang_SCU_task4_SED_4 | Multi-scale ResNet block with weakly labeled data augmentation | Tang2020 | 44.1 (43.4 - 44.8) | 49.0 | ||
Tang_SCU_task4_SED_2 | Basic ResNet block with weakly labeled data augmentation | Tang2020 | 42.4 (41.4 - 43.4) | 48.4 | ||
Tang_SCU_task4_SED_3 | Multi-scale ResNet block without weakly labeled data augmentation | Tang2020 | 44.1 (43.3 - 45.0) | 48.2 | ||
DCASE2020_SED_baseline_system | DCASE2020 SED baseline system | turpault2020a | 34.9 (34.0 - 35.7) | 34.8 | ||
DCASE2020_SS_SED_baseline_system | DCASE2020 SS+SED baseline system | turpault2020b | Sound Separation | 36.5 (35.6 - 37.2) | 35.6 | |
Ebbers_UPB_task4_SED_1 | DCASE2020 UPB SED system 1 | Ebbers2020 | 47.2 (46.5 - 48.1) | 48.3 |
Supplementary metrics
Rank |
Submission code |
Submission name |
Technical Report |
Sound Separation |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
PSDS Cross-trigger (Evaluation dataset) |
Event-based F-score (Public evaluation) |
PSDS Cross-trigger (Public evaluation) |
Event-based F-score (Vimeo dataset) |
PSDS Cross-trigger (Vimeo dataset) |
---|---|---|---|---|---|---|---|---|---|---|
Xiaomi_task4_SED_1 | DCASE2020 SED mean-teacher system | Liang2020 | 36.0 (35.3 - 36.8) | 40.7 | 25.3 | |||||
Rykaczewski_Samsung_taks4_SED_3 | DCASE2020 SED CNNR | Rykaczewski2020 | 21.6 (21.0 - 22.4) | 23.7 | 15.4 | |||||
Rykaczewski_Samsung_taks4_SED_2 | DCASE2020 SED CNNR | Rykaczewski2020 | 21.9 (21.3 - 22.7) | 24.0 | 15.7 | |||||
Rykaczewski_Samsung_taks4_SED_4 | DCASE2020 SED CNNR | Rykaczewski2020 | 10.4 (9.7 - 11.1) | 11.9 | 6.7 | |||||
Rykaczewski_Samsung_taks4_SED_1 | DCASE2020 SED CNNR | Rykaczewski2020 | 21.6 (20.8 - 22.4) | 23.5 | 15.7 | |||||
Hou_IPS_task4_SED_1 | DCASE2020 WASEDA IPS SED | HouB2020 | 34.9 (34.0 - 35.7) | 38.1 | 27.8 | |||||
Miyazaki_NU_task4_SED_1 | Conforemr SED | Miyazaki2020 | 51.1 (50.1 - 52.3) | 55.7 | 39.6 | |||||
Miyazaki_NU_task4_SED_2 | Transforemr SED | Miyazaki2020 | 46.4 (45.5 - 47.5) | 51.1 | 34.9 | |||||
Miyazaki_NU_task4_SED_3 | transformer conformer Ensemble SED | Miyazaki2020 | 50.7 (49.6 - 51.9) | 55.2 | 39.0 | |||||
Huang_ICT-TOSHIBA_task4_SED_3 | guided multi-branch learning | Huang2020 | 44.3 (43.4 - 45.4) | 48.7 | 32.2 | |||||
Huang_ICT-TOSHIBA_task4_SED_1 | guided multi-branch learning | Huang2020 | 44.6 (43.5 - 46.0) | 49.7 | 31.8 | |||||
Huang_ICT-TOSHIBA_task4_SS_SED_4 | guided multi branch learning | Huang2020 | Sound Separation | 44.1 (42.9 - 45.4) | 48.6 | 32.8 | ||||
Huang_ICT-TOSHIBA_task4_SED_4 | guided multi-branch learning | Huang2020 | 44.3 (43.2 - 45.6) | 49.0 | 32.2 | |||||
Huang_ICT-TOSHIBA_task4_SS_SED_1 | guided multi branch learning | Huang2020 | Sound Separation | 44.7 (43.6 - 46.2) | 49.5 | 32.7 | ||||
Huang_ICT-TOSHIBA_task4_SED_2 | guided multi-branch learning | Huang2020 | 44.3 (43.2 - 45.6) | 49.0 | 32.2 | |||||
Huang_ICT-TOSHIBA_task4_SS_SED_3 | guided multi branch learning | Huang2020 | Sound Separation | 44.4 (43.2 - 45.8) | 49.3 | 32.2 | ||||
Huang_ICT-TOSHIBA_task4_SS_SED_2 | guided_and_multi_branch_learning | Huang2020 | Sound Separation | 44.5 (43.3 - 46.0) | 49.3 | 32.6 | ||||
Copiaco_UOW_task4_SED_2 | DCASE2020 SED system copiaco | Copiaco2020a | 7.8 (7.3 - 8.2) | 8.5 | 5.8 | |||||
Copiaco_UOW_task4_SED_1 | DCASE2020 SED system copiaco | Copiaco2020a | 7.5 (7.0 - 8.0) | 7.9 | 5.9 | |||||
Kim_AiTeR_GIST_SED_1 | DCASE 2020 task4 SED system with semi-supervised loss function | Kim2020 | 43.7 (42.8 - 44.7) | 0.645 | 48.0 | 0.701 | 33.0 | 0.529 | ||
Kim_AiTeR_GIST_SED_2 | DCASE 2020 task4 SED system with semi-supervised loss function | Kim2020 | 43.9 (43.0 - 44.7) | 0.646 | 48.1 | 0.704 | 33.5 | 0.521 | ||
Kim_AiTeR_GIST_SED_4 | DCASE 2020 task4 SED system with semi-supervised loss function | Kim2020 | 44.4 (43.5 - 45.2) | 0.641 | 48.0 | 0.698 | 35.5 | 0.522 | ||
Kim_AiTeR_GIST_SED_3 | DCASE 2020 task4 SED system with semi-supervised loss function | Kim2020 | 44.2 (43.4 - 45.1) | 0.650 | 47.9 | 0.705 | 35.2 | 0.531 | ||
Copiaco_UOW_task4_SS_SED_1 | DCASE2020 SS+SED system copiaco | Copiaco2020b | Sound Separation | 6.9 (6.7 - 7.2) | 7.3 | 5.7 | ||||
LJK_PSH_task4_SED_3 | LJK PSH DCASE2020 Task4 SED 3 | JiaKai2020 | 38.6 (37.7 - 39.7) | 0.582 | 42.6 | 0.639 | 28.8 | 0.468 | ||
LJK_PSH_task4_SED_1 | LJK PSH DCASE2020 Task4 SED 1 | JiaKai2020 | 39.3 (38.4 - 40.4) | 0.595 | 43.9 | 0.644 | 28.1 | 0.509 | ||
LJK_PSH_task4_SED_2 | LJK PSH DCASE2020 Task4 SED 2 | JiaKai2020 | 41.2 (40.1 - 42.4) | 0.602 | 45.8 | 0.653 | 29.7 | 0.513 | ||
LJK_PSH_task4_SED_4 | LJK PSH DCASE2020 Task4 SED 4 | JiaKai2020 | 40.6 (39.6 - 41.6) | 0.598 | 44.1 | 0.647 | 31.9 | 0.494 | ||
Hao_CQU_task4_SED_2 | DCASE2020 cross-domain sound event detection | Hao2020 | 47.0 (46.0 - 48.1) | 50.6 | 37.3 | |||||
Hao_CQU_task4_SED_3 | DCASE2020 cross-domain sound event detection | Hao2020 | 46.3 (45.5 - 47.4) | 50.3 | 36.1 | |||||
Hao_CQU_task4_SED_1 | DCASE2020 cross-domain sound event detection | Hao2020 | 44.9 (43.9 - 45.8) | 49.4 | 33.1 | |||||
Hao_CQU_task4_SED_4 | DCASE2020 cross-domain sound event detection | Hao2020 | 47.8 (46.9 - 49.0) | 52.3 | 35.3 | |||||
Zhenwei_Hou_task4_SED_1 | DCASE2020 SED GCA system | HouZ2020 | 45.1 (44.2 - 45.8) | 0.600 | 49.0 | 0.654 | 35.2 | 0.474 | ||
deBenito_AUDIAS_task4_SED_1 | 5-Resolution Mean Teacher with thresholding | deBenito2020 | 38.2 (37.5 - 39.2) | 0.575 | 42.0 | 0.630 | 29.1 | 0.460 | ||
deBenito_AUDIAS_task4_SED_1 | 5-Resolution Mean Teacher | de Benito-Gorron2020 | 37.9 (37.0 - 39.1) | 0.575 | 41.5 | 0.630 | 29.4 | 0.460 | ||
Koh_NTHU_task4_SED_3 | Koh_NTHU_3 | Koh2020 | 46.6 (45.8 - 47.6) | 0.584 | 51.5 | 0.636 | 34.5 | 0.476 | ||
Koh_NTHU_task4_SED_2 | Koh_NTHU_2 | Koh2020 | 45.2 (44.3 - 46.3) | 0.645 | 48.7 | 0.688 | 36.4 | 0.549 | ||
Koh_NTHU_task4_SED_1 | Koh_NTHU_1 | Koh2020 | 45.2 (44.2 - 46.1) | 0.624 | 49.1 | 0.669 | 35.4 | 0.523 | ||
Koh_NTHU_task4_SED_4 | Koh_NTHU_4 | Koh2020 | 46.3 (45.4 - 47.2) | 0.586 | 50.3 | 0.639 | 36.1 | 0.478 | ||
Cornell_UPM-INRIA_task4_SED_2 | UNIVPM-INRIA DAT HMM 1 | Cornell2020 | 42.0 (40.9 - 43.1) | 45.6 | 32.6 | |||||
Cornell_UPM-INRIA_task4_SED_1 | UNIVPM-INRIA ensemble DAT+PCEN | Cornell2020 | 44.4 (43.3 - 45.5) | 48.6 | 33.8 | |||||
Cornell_UPM-INRIA_task4_SED_4 | UNIVPM-INRIA ensemble DAT+PCEN HMM 2 | Cornell2020 | 43.2 (42.1 - 44.4) | 47.9 | 31.4 | |||||
Cornell_UPM-INRIA_task4_SS_SED_1 | UNIVPM-INRIA separation hmm | Cornell2020 | Sound Separation | 38.6 (37.5 - 39.6) | 42.3 | 29.4 | ||||
Cornell_UPM-INRIA_task4_SED_3 | UNIVPM-INRIA ensemble MT+PCEN | Cornell2020 | 42.6 (41.6 - 43.5) | 47.7 | 29.4 | |||||
Yao_UESTC_task4_SED_1 | Yao_UESTC_task4_SED_1 | Yao2020 | 44.1 (43.1 - 45.2) | 47.6 | 35.3 | |||||
Yao_UESTC_task4_SED_3 | Yao_UESTC_task4_SED_3 | Yao2020 | 46.4 (45.3 - 47.6) | 50.5 | 36.0 | |||||
Yao_UESTC_task4_SED_2 | Yao_UESTC_task4_SED_2 | Yao2020 | 45.7 (44.7 - 47.0) | 49.6 | 35.9 | |||||
Yao_UESTC_task4_SED_4 | Yao_UESTC_task4_SED_4 | Yao2020 | 46.2 (45.2 - 47.0) | 49.9 | 37.3 | |||||
Liu_thinkit_task4_SED_1 | MT_PPDA_cg_valid | Liu2020 | 40.7 (39.7 - 41.7) | 45.4 | 27.7 | |||||
Liu_thinkit_task4_SED_1 | MT_PPDA_cg_glu_valid | Liu2020 | 41.8 (40.7 - 42.9) | 46.0 | 31.1 | |||||
Liu_thinkit_task4_SED_1 | MT_PPDA_cg_glu_pub | Liu2020 | 45.2 (44.2 - 46.5) | 51.2 | 30.3 | |||||
Liu_thinkit_task4_SED_4 | MT_PPDA_finall | Liu2020 | 43.1 (42.1 - 44.2) | 47.2 | 32.3 | |||||
PARK_JHU_task4_SED_1 | PARK_fusion_P | Park2020 | 35.8 (35.0 - 36.6) | 38.9 | 28.0 | |||||
PARK_JHU_task4_SED_1 | PARK_fusion_N | Park2020 | 26.5 (25.7 - 27.5) | 28.1 | 22.6 | |||||
PARK_JHU_task4_SED_2 | PARK_fusion_M | Park2020 | 36.9 (36.1 - 37.7) | 40.2 | 28.7 | |||||
PARK_JHU_task4_SED_3 | PARK_fusion_L | Park2020 | 34.7 (34.1 - 35.6) | 38.1 | 26.7 | |||||
Chen_NTHU_task4_SS_SED_1 | DCASE2020 SS+SED system | Chen2020 | Sound Separation | 34.5 (33.5 - 35.3) | 37.8 | 26.9 | ||||
CTK_NU_task4_SED_2 | CTK_NU NMF-CNN-2 | Chan2020 | 44.4 (43.5 - 45.5) | 0.522 | 47.8 | 0.551 | 35.1 | 0.452 | ||
CTK_NU_task4_SED_4 | CTK_NU NMF-CNN-4 | Chan2020 | 46.3 (45.3 - 47.4) | 0.534 | 50.5 | 0.567 | 35.3 | 0.460 | ||
CTK_NU_task4_SED_3 | CTK_NU NMF-CNN-3 | Chan2020 | 45.8 (45.0 - 47.0) | 0.543 | 49.8 | 0.575 | 35.4 | 0.468 | ||
CTK_NU_task4_SED_1 | CTK_NU NMF-CNN-1 | Chan2020 | 43.5 (42.6 - 44.7) | 0.503 | 47.5 | 0.535 | 33.2 | 0.436 | ||
YenKu_NTU_task4_SED_4 | DCASE2020 SED | Yen2020 | 42.7 (41.6 - 43.6) | 46.9 | 31.6 | |||||
YenKu_NTU_task4_SED_2 | DCASE2020 SED | Yen2020 | 42.6 (41.8 - 43.7) | 47.5 | 29.8 | |||||
YenKu_NTU_task4_SED_3 | DCASE2020 SED | Yen2020 | 41.6 (40.6 - 42.7) | 45.5 | 30.9 | |||||
YenKu_NTU_task4_SED_1 | DCASE2020 SED | Yen2020 | 43.6 (42.4 - 44.6) | 48.5 | 30.8 | |||||
Tang_SCU_task4_SED_1 | Basic ResNet block without weakly labeled data augmentation | Tang2020 | 43.1 (42.3 - 44.1) | 0.495 | 46.6 | 0.538 | 33.7 | 0.400 | ||
Tang_SCU_task4_SED_4 | Multi-scale ResNet block with weakly labeled data augmentation | Tang2020 | 44.1 (43.4 - 44.8) | 0.510 | 47.5 | 0.556 | 35.3 | 0.411 | ||
Tang_SCU_task4_SED_2 | Basic ResNet block with weakly labeled data augmentation | Tang2020 | 42.4 (41.4 - 43.4) | 0.506 | 46.3 | 0.552 | 32.1 | 0.400 | ||
Tang_SCU_task4_SED_3 | Multi-scale ResNet block without weakly labeled data augmentation | Tang2020 | 44.1 (43.3 - 45.0) | 0.503 | 48.8 | 0.559 | 32.5 | 0.379 | ||
DCASE2020_SED_baseline_system | DCASE2020 SED baseline system | turpault2020a | 34.9 (34.0 - 35.7) | 0.496 | 38.1 | 0.552 | 27.8 | 0.378 | ||
DCASE2020_SS_SED_baseline_system | DCASE2020 SS+SED baseline system | turpault2020b | Sound Separation | 36.5 (35.6 - 37.2) | 0.497 | 39.8 | 0.549 | 28.8 | 0.383 | |
Ebbers_UPB_task4_SED_1 | DCASE2020 UPB SED system 1 | Ebbers2020 | 47.2 (46.5 - 48.1) | 50.9 | 38.7 |
Teams ranking
Table including only the best performing system per submitting team.
Rank |
Submission code |
Submission name |
Technical Report |
Sound Separation |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Event-based F-score (Development dataset) |
---|---|---|---|---|---|---|
Xiaomi_task4_SED_1 | DCASE2020 SED mean-teacher system | Liang2020 | 36.0 (35.3 - 36.8) | 35.6 | ||
Rykaczewski_Samsung_taks4_SED_2 | DCASE2020 SED CNNR | Rykaczewski2020 | 21.9 (21.3 - 22.7) | 36.0 | ||
Hou_IPS_task4_SED_1 | DCASE2020 WASEDA IPS SED | HouB2020 | 34.9 (34.0 - 35.7) | 40.8 | ||
Miyazaki_NU_task4_SED_1 | Conforemr SED | Miyazaki2020 | 51.1 (50.1 - 52.3) | 50.6 | ||
Huang_ICT-TOSHIBA_task4_SS_SED_1 | Guided multi branch learning | Huang2020 | Sound Separation | 44.7 (43.6 - 46.2) | 47.2 | |
Copiaco_UOW_task4_SED_2 | DCASE2020 SED system copiaco | Copiaco2020a | 7.8 (7.3 - 8.2) | 11.2 | ||
Kim_AiTeR_GIST_SED_4 | DCASE 2020 task4 SED system with semi-supervised loss function | Kim2020 | 44.4 (43.5 - 45.2) | 44.0 | ||
LJK_PSH_task4_SED_2 | LJK PSH DCASE2020 Task4 SED 2 | JiaKai2020 | 41.2 (40.1 - 42.4) | 47.9 | ||
Hao_CQU_task4_SED_4 | DCASE2020 cross-domain sound event detection | Hao2020 | 47.8 (46.9 - 49.0) | 50.0 | ||
Zhenwei_Hou_task4_SED_1 | DCASE2020 SED GCA system | HouZ2020 | 45.1 (44.2 - 45.8) | 44.8 | ||
deBenito_AUDIAS_task4_SED_1 | 5-Resolution Mean Teacher with Thresholds | deBenito2020 | 38.2 (37.5 - 39.2) | 43.4 | ||
Koh_NTHU_task4_SED_3 | Koh_NTHU_3 | Koh2020 | 46.6 (45.8 - 47.6) | 48.0 | ||
Cornell_UPM-INRIA_task4_SED_1 | UNIVPM-INRIA ensemble DAT+PCEN | Cornell2020 | 44.4 (43.3 - 45.5) | 46.2 | ||
Yao_UESTC_task4_SED_3 | Yao_UESTC_task4_SED_3 | Yao2020 | 46.4 (45.3 - 47.6) | 49.5 | ||
Liu_thinkit_task4_SED_1 | MT_PPDA_cg_glu_pub | Liu2020 | 45.2 (44.2 - 46.5) | 44.3 | ||
PARK_JHU_task4_SED_2 | PARK_fusion_M | Park2020 | 36.9 (36.1 - 37.7) | 44.9 | ||
Chen_NTHU_task4_SS_SED_1 | DCASE2020 SS+SED system | Chen2020 | Sound Separation | 34.5 (33.5 - 35.3) | 36.7 | |
CTK_NU_task4_SED_4 | CTK_NU NMF-CNN-4 | Chan2020 | 46.3 (45.3 - 47.4) | 48.6 | ||
YenKu_NTU_task4_SED_1 | DCASE2020 SED | Yen2020 | 43.6 (42.4 - 44.6) | 45.6 | ||
Tang_SCU_task4_SED_3 | Multi-scale ResNet block without weakly labeled data augmentation | Tang2020 | 44.1 (43.3 - 45.0) | 48.2 | ||
DCASE2020_SS_SED_baseline_system | DCASE2020 SS+SED baseline system | turpault2020b | Sound Separation | 36.5 (35.6 - 37.2) | 35.6 | |
Ebbers_UPB_task4_SED_1 | DCASE2020 UPB SED system 1 | Ebbers2020 | 47.2 (46.5 - 48.1) | 48.3 |
Supplementary metrics
Rank |
Submission code |
Submission name |
Technical Report |
Sound Separation |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
PSDS Cross-trigger (Evaluation dataset) |
Event-based F-score (Public evaluation) |
PSDS Cross-trigger (Public evaluation) |
Event-based F-score (Vimeo dataset) |
PSDS Cross-trigger (Vimeo dataset) |
---|---|---|---|---|---|---|---|---|---|---|
Xiaomi_task4_SED_1 | DCASE2020 SED mean-teacher system | Liang2020 | 36.0 (35.3 - 36.8) | 40.7 | 25.3 | |||||
Rykaczewski_Samsung_taks4_SED_2 | DCASE2020 SED CNNR | Rykaczewski2020 | 21.9 (21.3 - 22.7) | 24.0 | 15.7 | |||||
Hou_IPS_task4_SED_1 | DCASE2020 WASEDA IPS SED | HouB2020 | 34.9 (34.0 - 35.7) | 38.1 | 27.8 | |||||
Miyazaki_NU_task4_SED_1 | Conforemr SED | Miyazaki2020 | 51.1 (50.1 - 52.3) | 55.7 | 39.6 | |||||
Huang_ICT-TOSHIBA_task4_SS_SED_1 | Guided multi branch learning | Huang2020 | Sound Separation | 44.7 (43.6 - 46.2) | 49.5 | 32.7 | ||||
Copiaco_UOW_task4_SED_2 | DCASE2020 SED system copiaco | Copiaco2020a | 7.8 (7.3 - 8.2) | 8.5 | 5.8 | |||||
Kim_AiTeR_GIST_SED_4 | DCASE 2020 task4 SED system with semi-supervised loss function | Kim2020 | 44.4 (43.5 - 45.2) | 0.641 | 48.0 | 0.698 | 35.5 | 0.522 | ||
LJK_PSH_task4_SED_2 | LJK PSH DCASE2020 Task4 SED 2 | JiaKai2020 | 41.2 (40.1 - 42.4) | 0.602 | 45.8 | 0.653 | 29.7 | 0.513 | ||
Hao_CQU_task4_SED_4 | DCASE2020 cross-domain sound event detection | Hao2020 | 47.8 (46.9 - 49.0) | 52.3 | 35.3 | |||||
Zhenwei_Hou_task4_SED_1 | DCASE2020 SED GCA system | HouZ2020 | 45.1 (44.2 - 45.8) | 0.600 | 49.0 | 0.654 | 35.2 | 0.474 | ||
deBenito_AUDIAS_task4_SED_1 | 5-Resolution Mean Teacher with Thresholds | deBenito2020 | 38.2 (37.5 - 39.2) | 0.575 | 42.0 | 0.630 | 29.1 | 0.460 | ||
Koh_NTHU_task4_SED_3 | Koh_NTHU_3 | Koh2020 | 46.6 (45.8 - 47.6) | 0.584 | 51.5 | 0.636 | 34.5 | 0.476 | ||
Cornell_UPM-INRIA_task4_SED_1 | UNIVPM-INRIA ensemble DAT+PCEN | Cornell2020 | 44.4 (43.3 - 45.5) | 48.6 | 33.8 | |||||
Yao_UESTC_task4_SED_3 | Yao_UESTC_task4_SED_3 | Yao2020 | 46.4 (45.3 - 47.6) | 50.5 | 36.0 | |||||
Liu_thinkit_task4_SED_1 | MT_PPDA_cg_glu_pub | Liu2020 | 45.2 (44.2 - 46.5) | 51.2 | 30.3 | |||||
PARK_JHU_task4_SED_2 | PARK_fusion_M | Park2020 | 36.9 (36.1 - 37.7) | 40.2 | 28.7 | |||||
Chen_NTHU_task4_SS_SED_1 | DCASE2020 SS+SED system | Chen2020 | Sound Separation | 34.5 (33.5 - 35.3) | 37.8 | 26.9 | ||||
CTK_NU_task4_SED_4 | CTK_NU NMF-CNN-4 | Chan2020 | 46.3 (45.3 - 47.4) | 0.534 | 50.5 | 0.567 | 35.3 | 0.460 | ||
YenKu_NTU_task4_SED_1 | DCASE2020 SED | Yen2020 | 43.6 (42.4 - 44.6) | 48.5 | 30.8 | |||||
Tang_SCU_task4_SED_3 | Multi-scale ResNet block without weakly labeled data augmentation | Tang2020 | 44.1 (43.3 - 45.0) | 0.503 | 48.8 | 0.559 | 32.5 | 0.379 | ||
DCASE2020_SS_SED_baseline_system | DCASE2020 SS+SED baseline system | turpault2020b | Sound Separation | 36.5 (35.6 - 37.2) | 0.497 | 39.8 | 0.549 | 28.8 | 0.383 | |
Ebbers_UPB_task4_SED_1 | DCASE2020 UPB SED system 1 | Ebbers2020 | 47.2 (46.5 - 48.1) | 50.9 | 38.7 |
Combined SS and SED ranking
System ranking
Rank |
Submission code |
Submission name |
Technical Report |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Event-based F-score (Public evaluation) |
Event-based F-score (Vimeo dataset) |
---|---|---|---|---|---|---|
Copiaco_UOW_task4_SS_SED_1 | DCASE2020 SS+SED system copiaco | Copiaco2020b | 6.9 (6.7 - 7.2) | 7.3 | 5.7 | |
Chen_NTHU_task4_SS_SED_1 | DCASE2020 SS+SED system | Chen2020 | 34.5 (33.5 - 35.3) | 37.8 | 26.9 | |
Cornell_UPM-INRIA_task4_SS_SED_1 | UNIVPM-INRIA separation hmm | Cornell2020 | 38.6 (37.5 - 39.6) | 42.3 | 29.4 | |
Huang_ICT-TOSHIBA_task4_SS_SED_4 | Guided multi branch learning | Huang2020 | 44.1 (42.9 - 45.4) | 48.6 | 32.8 | |
Huang_ICT-TOSHIBA_task4_SS_SED_3 | Guided multi branch learning | Huang2020 | 44.4 (43.2 - 45.8) | 49.3 | 32.2 | |
Huang_ICT-TOSHIBA_task4_SS_SED_2 | Guided_and_multi_branch_learning | Huang2020 | 44.5 (43.3 - 46.0) | 49.3 | 32.6 | |
Huang_ICT-TOSHIBA_task4_SS_SED_1 | Guided multi branch learning | Huang2020 | 44.7 (43.6 - 46.2) | 49.5 | 32.7 | |
DCASE2020_SS_SED_baseline_system | DCASE2020 SS+SED baseline system | 36.5 (35.6 - 37.2) | 39.8 | 28.8 |
Team ranking
Table including only the best performing system per submitting team.
Rank |
Submission code |
Submission name |
Technical Report |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Event-based F-score (Public evaluation) |
Event-based F-score (Vimeo dataset) |
---|---|---|---|---|---|---|
Copiaco_UOW_task4_SS_SED_1 | DCASE2020 SS+SED system copiaco | Copiaco2020b | 6.9 (6.7 - 7.2) | 7.3 | 5.7 | |
Chen_NTHU_task4_SS_SED_1 | DCASE2020 SS+SED system | Chen2020 | 34.5 (33.5 - 35.3) | 37.8 | 26.9 | |
Cornell_UPM-INRIA_task4_SS_SED_1 | UNIVPM-INRIA separation hmm | Cornell2020 | 38.6 (37.5 - 39.6) | 42.3 | 29.4 | |
Huang_ICT-TOSHIBA_task4_SS_SED_1 | Guided multi branch learning | Huang2020 | 44.7 (43.6 - 46.2) | 49.5 | 32.7 | |
DCASE2020_SS_SED_baseline_system | DCASE2020 SS+SED baseline system | 36.5 (35.6 - 37.2) | 39.8 | 28.8 |
SS ranking
Rank |
Submission code |
Submission name |
Technical Report |
Single-source SI-SNR (Evaluation dataset) |
Multi-source SI-SNRi (Evaluation dataset) |
---|---|---|---|---|---|
Xiaomi_task4_SS_1 | DCASE2020 SS system (STFT) | Liang2020 | 33.8 | 7.0 | |
Xiaomi_task4_SS_2 | DCASE2020 SS system (STFT + MFCC) | Liang2020 | 31.8 | 8.4 | |
DCASE2020_SS_baseline_system | DCASE2020 SS baseline system | kavalerov2019universal | 37.6 | 12.5 | |
Hou_IPS_task4_SS_1 | DCASE2020 WASEDA IPS SS | HouB2020 | 37.6 | 12.5 |
Class-wise performance
Rank |
Submission code |
Submission name |
Technical Report |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Alarm Bell Ringing |
Blender | Cat | Dishes | Dog |
Electric shave toothbrush |
Frying |
Running water |
Speech |
Vacuum cleaner |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Xiaomi_task4_SED_1 | DCASE2020 SED mean-teacher system | Liang2020 | 36.0 (35.3 - 36.8) | 29.8 | 39.3 | 64.4 | 26.6 | 35.4 | 29.8 | 32.8 | 24.0 | 55.1 | 23.7 | |
Rykaczewski_Samsung_taks4_SED_3 | DCASE2020 SED CNNR | Rykaczewski2020 | 21.6 (21.0 - 22.4) | 36.5 | 5.0 | 50.7 | 22.5 | 37.1 | 5.8 | 9.3 | 1.6 | 43.1 | 3.8 | |
Rykaczewski_Samsung_taks4_SED_2 | DCASE2020 SED CNNR | Rykaczewski2020 | 21.9 (21.3 - 22.7) | 37.6 | 5.1 | 51.7 | 23.0 | 37.0 | 6.4 | 9.1 | 1.6 | 43.0 | 3.8 | |
Rykaczewski_Samsung_taks4_SED_4 | DCASE2020 SED CNNR | Rykaczewski2020 | 10.4 (9.7 - 11.1) | 10.2 | 10.3 | 0.5 | 0.7 | 0.6 | 20.0 | 33.0 | 8.4 | 1.2 | 20.3 | |
Rykaczewski_Samsung_taks4_SED_1 | DCASE2020 SED CNNR | Rykaczewski2020 | 21.6 (20.8 - 22.4) | 36.7 | 4.3 | 51.7 | 22.3 | 36.6 | 5.7 | 9.3 | 1.2 | 43.1 | 3.8 | |
Hou_IPS_task4_SED_1 | DCASE2020 WASEDA IPS SED | HouB2020 | 34.9 (34.0 - 35.7) | 35.9 | 37.0 | 62.3 | 26.0 | 27.1 | 25.9 | 24.7 | 24.3 | 48.2 | 39.0 | |
Miyazaki_NU_task4_SED_1 | Conforemr SED | Miyazaki2020 | 51.1 (50.1 - 52.3) | 40.9 | 51.5 | 67.3 | 38.9 | 51.4 | 46.7 | 53.1 | 35.3 | 64.4 | 63.0 | |
Miyazaki_NU_task4_SED_2 | Transforemr SED | Miyazaki2020 | 46.4 (45.5 - 47.5) | 33.5 | 51.4 | 56.2 | 39.6 | 49.2 | 42.2 | 47.0 | 27.7 | 63.0 | 55.4 | |
Miyazaki_NU_task4_SED_3 | transformer conformer Ensemble SED | Miyazaki2020 | 50.7 (49.6 - 51.9) | 43.7 | 53.8 | 64.1 | 39.5 | 52.0 | 48.2 | 47.5 | 29.0 | 66.6 | 62.7 | |
Huang_ICT-TOSHIBA_task4_SED_3 | guided multi-branch learning | Huang2020 | 44.3 (43.4 - 45.4) | 41.0 | 49.0 | 55.3 | 26.1 | 44.8 | 40.3 | 43.1 | 27.0 | 58.6 | 56.2 | |
Huang_ICT-TOSHIBA_task4_SED_1 | guided multi-branch learning | Huang2020 | 44.6 (43.5 - 46.0) | 43.4 | 42.0 | 55.4 | 25.9 | 38.8 | 41.5 | 46.2 | 30.2 | 63.1 | 60.3 | |
Huang_ICT-TOSHIBA_task4_SS_SED_4 | guided multi branch learning | Huang2020 | 44.1 (42.9 - 45.4) | 41.4 | 42.3 | 56.8 | 28.4 | 37.4 | 39.0 | 47.1 | 30.0 | 59.7 | 59.1 | |
Huang_ICT-TOSHIBA_task4_SED_4 | guided multi-branch learning | Huang2020 | 44.3 (43.2 - 45.6) | 41.2 | 40.4 | 54.7 | 24.8 | 40.0 | 42.6 | 47.7 | 30.4 | 62.3 | 59.1 | |
Huang_ICT-TOSHIBA_task4_SS_SED_1 | guided multi branch learning | Huang2020 | 44.7 (43.6 - 46.2) | 42.2 | 43.4 | 57.1 | 27.7 | 38.1 | 40.2 | 48.8 | 30.8 | 60.4 | 59.1 | |
Huang_ICT-TOSHIBA_task4_SED_2 | guided multi-branch learning | Huang2020 | 44.3 (43.2 - 45.6) | 41.0 | 40.4 | 55.0 | 24.3 | 40.1 | 42.6 | 47.7 | 30.4 | 62.6 | 59.1 | |
Huang_ICT-TOSHIBA_task4_SS_SED_3 | guided multi branch learning | Huang2020 | 44.4 (43.2 - 45.8) | 42.5 | 42.7 | 56.9 | 27.5 | 38.5 | 39.8 | 46.6 | 30.0 | 61.0 | 59.0 | |
Huang_ICT-TOSHIBA_task4_SS_SED_2 | guided_and_multi_branch_learning | Huang2020 | 44.5 (43.3 - 46.0) | 41.9 | 41.3 | 56.5 | 27.8 | 37.6 | 42.0 | 47.6 | 30.0 | 61.7 | 59.0 | |
Copiaco_UOW_task4_SED_2 | DCASE2020 SED system copiaco | Copiaco2020a | 7.8 (7.3 - 8.2) | 3.8 | 9.7 | 13.1 | 3.2 | 5.1 | 4.1 | 1.0 | 9.3 | 10.4 | 18.0 | |
Copiaco_UOW_task4_SED_1 | DCASE2020 SED system copiaco | Copiaco2020a | 7.5 (7.0 - 8.0) | 4.3 | 11.7 | 13.6 | 1.6 | 6.2 | 1.3 | 1.2 | 5.3 | 12.1 | 17.0 | |
Kim_AiTeR_GIST_SED_1 | DCASE 2020 task4 SED system with semi-supervised loss function | Kim2020 | 43.7 (42.8 - 44.7) | 26.8 | 55.1 | 67.9 | 27.9 | 32.8 | 27.1 | 44.4 | 29.4 | 65.7 | 60.8 | |
Kim_AiTeR_GIST_SED_2 | DCASE 2020 task4 SED system with semi-supervised loss function | Kim2020 | 43.9 (43.0 - 44.7) | 26.5 | 53.6 | 68.6 | 28.1 | 33.9 | 29.7 | 43.6 | 31.6 | 64.9 | 59.4 | |
Kim_AiTeR_GIST_SED_4 | DCASE 2020 task4 SED system with semi-supervised loss function | Kim2020 | 44.4 (43.5 - 45.2) | 28.2 | 56.8 | 68.3 | 29.8 | 34.1 | 28.9 | 42.7 | 30.0 | 65.4 | 60.4 | |
Kim_AiTeR_GIST_SED_3 | DCASE 2020 task4 SED system with semi-supervised loss function | Kim2020 | 44.2 (43.4 - 45.1) | 27.2 | 58.2 | 67.2 | 28.7 | 34.3 | 26.7 | 44.3 | 31.9 | 66.3 | 58.1 | |
Copiaco_UOW_task4_SS_SED_1 | DCASE2020 SS+SED system copiaco | Copiaco2020b | 6.9 (6.7 - 7.2) | 3.9 | 5.8 | 15.0 | 1.0 | 6.3 | 1.3 | 2.2 | 3.4 | 15.7 | 14.3 | |
LJK_PSH_task4_SED_3 | LJK PSH DCASE2020 Task4 SED 3 | JiaKai2020 | 38.6 (37.7 - 39.7) | 34.5 | 41.8 | 57.1 | 28.7 | 31.7 | 39.7 | 38.5 | 23.4 | 54.9 | 36.4 | |
LJK_PSH_task4_SED_1 | LJK PSH DCASE2020 Task4 SED 1 | JiaKai2020 | 39.3 (38.4 - 40.4) | 33.8 | 37.3 | 59.6 | 28.9 | 33.8 | 38.9 | 36.8 | 28.6 | 55.0 | 41.7 | |
LJK_PSH_task4_SED_2 | LJK PSH DCASE2020 Task4 SED 2 | JiaKai2020 | 41.2 (40.1 - 42.4) | 34.2 | 36.6 | 58.9 | 31.0 | 42.7 | 40.3 | 42.3 | 28.3 | 58.8 | 39.2 | |
LJK_PSH_task4_SED_4 | LJK PSH DCASE2020 Task4 SED 4 | JiaKai2020 | 40.6 (39.6 - 41.6) | 29.7 | 42.6 | 58.5 | 28.5 | 33.4 | 42.2 | 44.2 | 29.7 | 51.2 | 46.0 | |
Hao_CQU_task4_SED_2 | DCASE2020 cross-domain sound event detection | Hao2020 | 47.0 (46.0 - 48.1) | 40.2 | 55.4 | 60.5 | 34.3 | 36.2 | 50.4 | 45.0 | 36.5 | 56.7 | 54.9 | |
Hao_CQU_task4_SED_3 | DCASE2020 cross-domain sound event detection | Hao2020 | 46.3 (45.5 - 47.4) | 45.5 | 48.6 | 59.1 | 30.5 | 33.5 | 51.6 | 47.3 | 33.2 | 57.0 | 56.8 | |
Hao_CQU_task4_SED_1 | DCASE2020 cross-domain sound event detection | Hao2020 | 44.9 (43.9 - 45.8) | 39.5 | 47.7 | 59.4 | 31.8 | 35.6 | 48.0 | 45.2 | 29.1 | 59.0 | 55.2 | |
Hao_CQU_task4_SED_4 | DCASE2020 cross-domain sound event detection | Hao2020 | 47.8 (46.9 - 49.0) | 42.8 | 56.9 | 63.2 | 31.7 | 36.4 | 49.8 | 50.2 | 30.1 | 61.9 | 55.0 | |
Zhenwei_Hou_task4_SED_1 | DCASE2020 SED GCA system | HouZ2020 | 45.1 (44.2 - 45.8) | 35.5 | 50.9 | 63.7 | 30.9 | 41.4 | 42.4 | 40.9 | 26.7 | 63.1 | 56.2 | |
deBenito_AUDIAS_task4_SED_1 | 5-Resolution Mean Teacher with thresholding | deBenito2020 | 38.2 (37.5 - 39.2) | 38.5 | 42.2 | 63.1 | 22.3 | 21.5 | 36.8 | 30.8 | 23.5 | 54.0 | 51.5 | |
deBenito_AUDIAS_task4_SED_1 | 5-Resolution Mean Teacher | de Benito-Gorron2020 | 37.9 (37.0 - 39.1) | 40.3 | 42.4 | 61.5 | 20.8 | 14.5 | 40.9 | 28.5 | 24.3 | 48.4 | 60.4 | |
Koh_NTHU_task4_SED_3 | Koh_NTHU_3 | Koh2020 | 46.6 (45.8 - 47.6) | 38.4 | 50.7 | 66.6 | 29.5 | 42.0 | 49.1 | 44.7 | 24.3 | 66.0 | 56.0 | |
Koh_NTHU_task4_SED_2 | Koh_NTHU_2 | Koh2020 | 45.2 (44.3 - 46.3) | 37.6 | 53.4 | 67.1 | 31.0 | 43.1 | 43.7 | 44.2 | 25.9 | 65.1 | 42.2 | |
Koh_NTHU_task4_SED_1 | Koh_NTHU_1 | Koh2020 | 45.2 (44.2 - 46.1) | 37.6 | 53.4 | 67.1 | 31.0 | 43.1 | 45.8 | 36.8 | 29.1 | 65.1 | 44.0 | |
Koh_NTHU_task4_SED_4 | Koh_NTHU_4 | Koh2020 | 46.3 (45.4 - 47.2) | 38.4 | 50.7 | 66.6 | 29.5 | 42.0 | 44.1 | 50.6 | 23.1 | 66.0 | 53.0 | |
Cornell_UPM-INRIA_task4_SED_2 | UNIVPM-INRIA DAT HMM 1 | Cornell2020 | 42.0 (40.9 - 43.1) | 41.3 | 44.4 | 68.2 | 29.3 | 35.6 | 38.8 | 35.1 | 27.4 | 50.2 | 50.8 | |
Cornell_UPM-INRIA_task4_SED_1 | UNIVPM-INRIA ensemble DAT+PCEN | Cornell2020 | 44.4 (43.3 - 45.5) | 45.2 | 49.4 | 69.8 | 25.2 | 33.0 | 45.2 | 35.3 | 32.4 | 55.3 | 55.0 | |
Cornell_UPM-INRIA_task4_SED_4 | UNIVPM-INRIA ensemble DAT+PCEN HMM 2 | Cornell2020 | 43.2 (42.1 - 44.4) | 40.0 | 50.4 | 63.3 | 26.4 | 33.5 | 46.1 | 39.1 | 25.8 | 52.4 | 57.0 | |
Cornell_UPM-INRIA_task4_SS_SED_1 | UNIVPM-INRIA separation hmm | Cornell2020 | 38.6 (37.5 - 39.6) | 30.3 | 43.4 | 65.6 | 28.3 | 25.2 | 41.9 | 32.4 | 23.8 | 49.1 | 47.6 | |
Cornell_UPM-INRIA_task4_SED_3 | UNIVPM-INRIA ensemble MT+PCEN | Cornell2020 | 42.6 (41.6 - 43.5) | 45.4 | 39.7 | 60.2 | 29.5 | 36.9 | 47.3 | 36.2 | 23.6 | 56.6 | 51.2 | |
Yao_UESTC_task4_SED_1 | Yao_UESTC_task4_SED_1 | Yao2020 | 44.1 (43.1 - 45.2) | 39.5 | 51.4 | 47.8 | 29.7 | 35.1 | 43.3 | 50.0 | 31.8 | 52.5 | 60.9 | |
Yao_UESTC_task4_SED_3 | Yao_UESTC_task4_SED_3 | Yao2020 | 46.4 (45.3 - 47.6) | 39.6 | 48.1 | 50.3 | 32.9 | 36.7 | 54.3 | 54.1 | 36.4 | 49.2 | 63.7 | |
Yao_UESTC_task4_SED_2 | Yao_UESTC_task4_SED_2 | Yao2020 | 45.7 (44.7 - 47.0) | 39.6 | 52.3 | 48.7 | 31.5 | 37.5 | 49.8 | 50.8 | 34.3 | 49.7 | 64.0 | |
Yao_UESTC_task4_SED_4 | Yao_UESTC_task4_SED_4 | Yao2020 | 46.2 (45.2 - 47.0) | 40.1 | 54.9 | 53.5 | 33.8 | 22.0 | 53.0 | 56.7 | 36.1 | 49.9 | 62.9 | |
Liu_thinkit_task4_SED_1 | MT_PPDA_cg_valid | Liu2020 | 40.7 (39.7 - 41.7) | 39.8 | 36.0 | 64.6 | 23.7 | 33.1 | 27.2 | 52.1 | 24.3 | 59.8 | 46.7 | |
Liu_thinkit_task4_SED_1 | MT_PPDA_cg_glu_valid | Liu2020 | 41.8 (40.7 - 42.9) | 39.8 | 36.0 | 62.4 | 23.7 | 32.8 | 39.1 | 52.1 | 24.3 | 59.6 | 49.4 | |
Liu_thinkit_task4_SED_1 | MT_PPDA_cg_glu_pub | Liu2020 | 45.2 (44.2 - 46.5) | 50.4 | 38.7 | 66.7 | 24.9 | 36.0 | 40.6 | 52.9 | 30.7 | 59.4 | 52.7 | |
Liu_thinkit_task4_SED_4 | MT_PPDA_finall | Liu2020 | 43.1 (42.1 - 44.2) | 43.6 | 36.3 | 65.5 | 24.2 | 32.6 | 40.2 | 52.2 | 27.2 | 59.6 | 50.2 | |
PARK_JHU_task4_SED_1 | PARK_fusion_P | Park2020 | 35.8 (35.0 - 36.6) | 10.1 | 38.5 | 53.4 | 13.5 | 37.7 | 42.2 | 33.9 | 24.6 | 56.1 | 48.5 | |
PARK_JHU_task4_SED_1 | PARK_fusion_N | Park2020 | 26.5 (25.7 - 27.5) | 10.1 | 38.5 | 33.3 | 13.5 | 37.7 | 18.2 | 32.0 | 15.3 | 25.9 | 40.9 | |
PARK_JHU_task4_SED_2 | PARK_fusion_M | Park2020 | 36.9 (36.1 - 37.7) | 21.0 | 38.5 | 53.4 | 13.5 | 37.7 | 42.2 | 33.9 | 24.6 | 56.1 | 48.5 | |
PARK_JHU_task4_SED_3 | PARK_fusion_L | Park2020 | 34.7 (34.1 - 35.6) | 19.4 | 38.5 | 52.2 | 13.0 | 37.4 | 37.3 | 27.7 | 23.2 | 53.5 | 44.8 | |
Chen_NTHU_task4_SS_SED_1 | DCASE2020 SS+SED system | Chen2020 | 34.5 (33.5 - 35.3) | 37.9 | 36.3 | 58.1 | 23.7 | 26.3 | 23.1 | 26.5 | 24.5 | 48.8 | 40.5 | |
CTK_NU_task4_SED_2 | CTK_NU NMF-CNN-2 | Chan2020 | 44.4 (43.5 - 45.5) | 40.3 | 54.0 | 49.4 | 22.0 | 34.6 | 43.9 | 47.6 | 31.5 | 58.6 | 61.7 | |
CTK_NU_task4_SED_4 | CTK_NU NMF-CNN-4 | Chan2020 | 46.3 (45.3 - 47.4) | 41.6 | 55.1 | 55.3 | 20.1 | 46.4 | 42.2 | 50.2 | 34.3 | 58.7 | 59.9 | |
CTK_NU_task4_SED_3 | CTK_NU NMF-CNN-3 | Chan2020 | 45.8 (45.0 - 47.0) | 44.8 | 52.9 | 54.4 | 20.1 | 41.6 | 42.2 | 49.8 | 33.3 | 59.8 | 59.9 | |
CTK_NU_task4_SED_1 | CTK_NU NMF-CNN-1 | Chan2020 | 43.5 (42.6 - 44.7) | 42.5 | 49.1 | 55.0 | 17.7 | 46.0 | 36.0 | 42.7 | 33.1 | 55.2 | 57.6 | |
YenKu_NTU_task4_SED_4 | DCASE2020 SED | Yen2020 | 42.7 (41.6 - 43.6) | 35.3 | 41.2 | 54.5 | 31.1 | 41.5 | 40.7 | 39.3 | 27.3 | 59.7 | 55.4 | |
YenKu_NTU_task4_SED_2 | DCASE2020 SED | Yen2020 | 42.6 (41.8 - 43.7) | 42.5 | 37.7 | 56.1 | 29.8 | 41.2 | 38.6 | 44.1 | 29.6 | 56.5 | 49.6 | |
YenKu_NTU_task4_SED_3 | DCASE2020 SED | Yen2020 | 41.6 (40.6 - 42.7) | 39.4 | 45.5 | 49.8 | 29.8 | 40.5 | 38.3 | 38.5 | 23.1 | 59.5 | 50.9 | |
YenKu_NTU_task4_SED_1 | DCASE2020 SED | Yen2020 | 43.6 (42.4 - 44.6) | 44.6 | 38.5 | 55.9 | 30.8 | 39.3 | 43.4 | 41.2 | 30.0 | 55.7 | 56.4 | |
Tang_SCU_task4_SED_1 | Basic ResNet block without weakly labeled data augmentation | Tang2020 | 43.1 (42.3 - 44.1) | 34.9 | 46.3 | 63.0 | 27.7 | 42.5 | 33.8 | 37.8 | 27.1 | 67.8 | 49.6 | |
Tang_SCU_task4_SED_4 | Multi-scale ResNet block with weakly labeled data augmentation | Tang2020 | 44.1 (43.4 - 44.8) | 34.4 | 48.6 | 62.7 | 23.8 | 41.0 | 35.9 | 42.8 | 32.3 | 66.7 | 52.8 | |
Tang_SCU_task4_SED_2 | Basic ResNet block with weakly labeled data augmentation | Tang2020 | 42.4 (41.4 - 43.4) | 40.3 | 41.4 | 62.7 | 27.3 | 37.0 | 33.3 | 48.3 | 20.0 | 66.6 | 46.6 | |
Tang_SCU_task4_SED_3 | Multi-scale ResNet block without weakly labeled data augmentation | Tang2020 | 44.1 (43.3 - 45.0) | 30.4 | 45.1 | 64.5 | 29.6 | 40.6 | 35.3 | 48.1 | 27.6 | 63.6 | 57.5 | |
DCASE2020_SED_baseline_system | DCASE2020 SED baseline system | turpault2020a | 34.9 (34.0 - 35.7) | 35.9 | 37.0 | 62.6 | 26.0 | 27.1 | 25.9 | 24.7 | 24.3 | 48.2 | 39.0 | |
DCASE2020_SS_SED_baseline_system | DCASE2020 SS+SED baseline system | turpault2020b | 36.5 (35.6 - 37.2) | 38.7 | 37.5 | 62.8 | 24.5 | 29.6 | 28.0 | 28.0 | 21.5 | 51.6 | 43.7 | |
Ebbers_UPB_task4_SED_1 | DCASE2020 UPB SED system 1 | Ebbers2020 | 47.2 (46.5 - 48.1) | 28.5 | 56.4 | 64.0 | 24.4 | 37.4 | 45.8 | 51.3 | 37.6 | 60.6 | 67.2 |
System characteristics
General characteristics
Rank | Code |
Technical Report |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Sampling rate |
Data augmentation |
Features |
---|---|---|---|---|---|---|
Xiaomi_task4_SED_1 | Liang2020 | 36.0 (35.3 - 36.8) | 16kHz | time stretching,pitch shifting,reverberation | log-mel energies | |
Rykaczewski_Samsung_taks4_SED_3 | Rykaczewski2020 | 21.6 (21.0 - 22.4) | 16kHz | log-mel energies | ||
Rykaczewski_Samsung_taks4_SED_2 | Rykaczewski2020 | 21.9 (21.3 - 22.7) | 16kHz | log-mel energies | ||
Rykaczewski_Samsung_taks4_SED_4 | Rykaczewski2020 | 10.4 (9.7 - 11.1) | 16kHz | log-mel energies | ||
Rykaczewski_Samsung_taks4_SED_1 | Rykaczewski2020 | 21.6 (20.8 - 22.4) | 16kHz | log-mel energies | ||
Hou_IPS_task4_SED_1 | HouB2020 | 34.9 (34.0 - 35.7) | 16kHz | log-mel energies | ||
Miyazaki_NU_task4_SED_1 | Miyazaki2020 | 51.1 (50.1 - 52.3) | 16kHz | time shifting, mixup | log-mel energies | |
Miyazaki_NU_task4_SED_2 | Miyazaki2020 | 46.4 (45.5 - 47.5) | 16kHz | time shifting, mixup | log-mel energies | |
Miyazaki_NU_task4_SED_3 | Miyazaki2020 | 50.7 (49.6 - 51.9) | 16kHz | time shifting, mixup | log-mel energies | |
Huang_ICT-TOSHIBA_task4_SED_3 | Huang2020 | 44.3 (43.4 - 45.4) | 44.1kHz | time shifting, frequency shifting | log-mel energies | |
Huang_ICT-TOSHIBA_task4_SED_1 | Huang2020 | 44.6 (43.5 - 46.0) | 44.1kHz | time shifting, frequency shifting | log-mel energies | |
Huang_ICT-TOSHIBA_task4_SS_SED_4 | Huang2020 | 44.1 (42.9 - 45.4) | 44.1kHz | time shifting, frequency shifting | log-mel energies | |
Huang_ICT-TOSHIBA_task4_SED_4 | Huang2020 | 44.3 (43.2 - 45.6) | 44.1kHz | time shifting, frequency shifting | log-mel energies | |
Huang_ICT-TOSHIBA_task4_SS_SED_1 | Huang2020 | 44.7 (43.6 - 46.2) | 44.1kHz | time shifting, frequency shifting | log-mel energies | |
Huang_ICT-TOSHIBA_task4_SED_2 | Huang2020 | 44.3 (43.2 - 45.6) | 44.1kHz | time shifting, frequency shifting | log-mel energies | |
Huang_ICT-TOSHIBA_task4_SS_SED_3 | Huang2020 | 44.4 (43.2 - 45.8) | 44.1kHz | time shifting, frequency shifting | log-mel energies | |
Huang_ICT-TOSHIBA_task4_SS_SED_2 | Huang2020 | 44.5 (43.3 - 46.0) | 44.1kHz | time shifting, frequency shifting | log-mel energies | |
Copiaco_UOW_task4_SED_2 | Copiaco2020a | 7.8 (7.3 - 8.2) | 44.1kHz | scalogram, signal energy, spectral centroid | ||
Copiaco_UOW_task4_SED_1 | Copiaco2020a | 7.5 (7.0 - 8.0) | 16kHz | scalogram, signal energy, spectral centroid | ||
Kim_AiTeR_GIST_SED_1 | Kim2020 | 43.7 (42.8 - 44.7) | 16kHz | mixup, specaugment, Gaussian noise | mel-spectrogram | |
Kim_AiTeR_GIST_SED_2 | Kim2020 | 43.9 (43.0 - 44.7) | 16kHz | mixup, specaugment, Gaussian noise | mel-spectrogram | |
Kim_AiTeR_GIST_SED_4 | Kim2020 | 44.4 (43.5 - 45.2) | 16kHz | mixup, specaugment, Gaussian noise | mel-spectrogram | |
Kim_AiTeR_GIST_SED_3 | Kim2020 | 44.2 (43.4 - 45.1) | 16kHz | mixup, specaugment, Gaussian noise | mel-spectrogram | |
Copiaco_UOW_task4_SS_SED_1 | Copiaco2020b | 6.9 (6.7 - 7.2) | 16kHz | scalogram, spectral centroid, signal energy | ||
LJK_PSH_task4_SED_3 | JiaKai2020 | 38.6 (37.7 - 39.7) | 16kHz | pitch shifting | log-mel energies | |
LJK_PSH_task4_SED_1 | JiaKai2020 | 39.3 (38.4 - 40.4) | 16kHz | pitch shifting | log-mel energies | |
LJK_PSH_task4_SED_2 | JiaKai2020 | 41.2 (40.1 - 42.4) | 16kHz | pitch shifting | log-mel energies | |
LJK_PSH_task4_SED_4 | JiaKai2020 | 40.6 (39.6 - 41.6) | 16kHz | pitch shifting | log-mel energies | |
Hao_CQU_task4_SED_2 | Hao2020 | 47.0 (46.0 - 48.1) | 22.05kHz | log-mel energies | ||
Hao_CQU_task4_SED_3 | Hao2020 | 46.3 (45.5 - 47.4) | 22.05kHz | log-mel energies | ||
Hao_CQU_task4_SED_1 | Hao2020 | 44.9 (43.9 - 45.8) | 22.05kHz | log-mel energies | ||
Hao_CQU_task4_SED_4 | Hao2020 | 47.8 (46.9 - 49.0) | 22.05kHz | log-mel energies | ||
Zhenwei_Hou_task4_SED_1 | HouZ2020 | 45.1 (44.2 - 45.8) | 22.05kHz | log-mel energies | ||
deBenito_AUDIAS_task4_SED_1 | deBenito2020 | 38.2 (37.5 - 39.2) | 16kHz | log-mel energies | ||
deBenito_AUDIAS_task4_SED_1 | de Benito-Gorron2020 | 37.9 (37.0 - 39.1) | 16kHz | log-mel energies | ||
Koh_NTHU_task4_SED_3 | Koh2020 | 46.6 (45.8 - 47.6) | 16kHz | Gaussian noise, mixup, time shifting, pitch shifting | log-mel energies | |
Koh_NTHU_task4_SED_2 | Koh2020 | 45.2 (44.3 - 46.3) | 16kHz | Gaussian noise, mixup, time shifting, pitch shifting | log-mel energies | |
Koh_NTHU_task4_SED_1 | Koh2020 | 45.2 (44.2 - 46.1) | 16kHz | Gaussian noise, mixup, time shifting, pitch shifting | log-mel energies | |
Koh_NTHU_task4_SED_4 | Koh2020 | 46.3 (45.4 - 47.2) | 16kHz | Gaussian noise, mixup, time shifting, pitch shifting | log-mel energies | |
Cornell_UPM-INRIA_task4_SED_2 | Cornell2020 | 42.0 (40.9 - 43.1) | 16kHz | contrast, overdrive, pitch shifting, highshelf, lowshelf, noise burst, sine burst, roll | log-mel energies | |
Cornell_UPM-INRIA_task4_SED_1 | Cornell2020 | 44.4 (43.3 - 45.5) | 16kHz | contrast, overdrive, pitch shifting, highshelf, lowshelf, noise burst, sine burst, roll | log-mel energies | |
Cornell_UPM-INRIA_task4_SED_4 | Cornell2020 | 43.2 (42.1 - 44.4) | 16kHz | contrast, overdrive, pitch shifting, highshelf, lowshelf, noise burst, sine burst, roll | log-mel energies | |
Cornell_UPM-INRIA_task4_SS_SED_1 | Cornell2020 | 38.6 (37.5 - 39.6) | 16kHz | contrast, overdrive, pitch shifting, highshelf, lowshelf | log-mel energies | |
Cornell_UPM-INRIA_task4_SED_3 | Cornell2020 | 42.6 (41.6 - 43.5) | 16kHz | log-mel energies | ||
Yao_UESTC_task4_SED_1 | Yao2020 | 44.1 (43.1 - 45.2) | 16kHz | mixup, time shifting | log-mel energies | |
Yao_UESTC_task4_SED_3 | Yao2020 | 46.4 (45.3 - 47.6) | 16kHz | mixup, time shifting | log-mel energies | |
Yao_UESTC_task4_SED_2 | Yao2020 | 45.7 (44.7 - 47.0) | 16kHz | mixup, time shifting | log-mel energies | |
Yao_UESTC_task4_SED_4 | Yao2020 | 46.2 (45.2 - 47.0) | 16kHz | mixup, time shifting | log-mel energies | |
Liu_thinkit_task4_SED_1 | Liu2020 | 40.7 (39.7 - 41.7) | 16kHz | noise addition, pitch shifting, time rolling, dynamic range compression | log-mel energies | |
Liu_thinkit_task4_SED_1 | Liu2020 | 41.8 (40.7 - 42.9) | 16kHz | noise addition, pitch shifting, time rolling, dynamic range compression | log-mel energies | |
Liu_thinkit_task4_SED_1 | Liu2020 | 45.2 (44.2 - 46.5) | 16kHz | noise addition, pitch shifting, time rolling, dynamic range compression | log-mel energies | |
Liu_thinkit_task4_SED_4 | Liu2020 | 43.1 (42.1 - 44.2) | 16kHz | noise addition, pitch shifting, time rolling, dynamic range compression | log-mel energies | |
PARK_JHU_task4_SED_1 | Park2020 | 35.8 (35.0 - 36.6) | 44.1kHz | log-mel energies | ||
PARK_JHU_task4_SED_1 | Park2020 | 26.5 (25.7 - 27.5) | 44.1kHz | log-mel energies | ||
PARK_JHU_task4_SED_2 | Park2020 | 36.9 (36.1 - 37.7) | 44.1kHz | log-mel energies | ||
PARK_JHU_task4_SED_3 | Park2020 | 34.7 (34.1 - 35.6) | 44.1kHz | log-mel energies | ||
Chen_NTHU_task4_SS_SED_1 | Chen2020 | 34.5 (33.5 - 35.3) | 16kHz | sound event separation | log-mel energies | |
CTK_NU_task4_SED_2 | Chan2020 | 44.4 (43.5 - 45.5) | 22.05kHz | Gaussian noise | log-mel energies | |
CTK_NU_task4_SED_4 | Chan2020 | 46.3 (45.3 - 47.4) | 22.05kHz | Gaussian noise | log-mel energies | |
CTK_NU_task4_SED_3 | Chan2020 | 45.8 (45.0 - 47.0) | 22.05kHz | Gaussian noise | log-mel energies | |
CTK_NU_task4_SED_1 | Chan2020 | 43.5 (42.6 - 44.7) | 22.05kHz | Gaussian noise | log-mel energies | |
YenKu_NTU_task4_SED_4 | Yen2020 | 42.7 (41.6 - 43.6) | 16kHz | log-mel spectrogram | ||
YenKu_NTU_task4_SED_2 | Yen2020 | 42.6 (41.8 - 43.7) | 16kHz | log-mel spectrogram | ||
YenKu_NTU_task4_SED_3 | Yen2020 | 41.6 (40.6 - 42.7) | 16kHz | log-mel spectrogram | ||
YenKu_NTU_task4_SED_1 | Yen2020 | 43.6 (42.4 - 44.6) | 16kHz | log-mel spectrogram | ||
Tang_SCU_task4_SED_1 | Tang2020 | 43.1 (42.3 - 44.1) | 22.05kHz | specaugment | log-mel energies | |
Tang_SCU_task4_SED_4 | Tang2020 | 44.1 (43.4 - 44.8) | 22.05kHz | specaugment, time stretching, pitch shifting, time shifting | log-mel energies | |
Tang_SCU_task4_SED_2 | Tang2020 | 42.4 (41.4 - 43.4) | 22.05kHz | specaugment, time stretching, pitch shifting, time shifting | log-mel energies | |
Tang_SCU_task4_SED_3 | Tang2020 | 44.1 (43.3 - 45.0) | 22.05kHz | specaugment | log-mel energies | |
DCASE2020_SED_baseline_system | turpault2020a | 34.9 (34.0 - 35.7) | 16kHz | log-mel energies | ||
DCASE2020_SS_SED_baseline_system | turpault2020b | 36.5 (35.6 - 37.2) | 16kHz | log-mel energies | ||
Ebbers_UPB_task4_SED_1 | Ebbers2020 | 47.2 (46.5 - 48.1) | 16kHz | scaling, mixup, frequency warping, blurring, frequency masking, time masking, Gaussian noise | log-mel energies |
Machine learning characteristics
Rank | Code |
Technical Report |
Event-based F-score (Eval) |
Classifier | Semi-supervised approach | Post-processing |
Segmentation method |
Decision making |
---|---|---|---|---|---|---|---|---|
Xiaomi_task4_SED_1 | Liang2020 | 36.0 (35.3 - 36.8) | CRNN | mean-teacher student | median filtering (450ms) | |||
Rykaczewski_Samsung_taks4_SED_3 | Rykaczewski2020 | 21.6 (21.0 - 22.4) | CRNN | mean-teacher student | median filtering (93ms) | |||
Rykaczewski_Samsung_taks4_SED_2 | Rykaczewski2020 | 21.9 (21.3 - 22.7) | CRNN | mean-teacher student | median filtering (93ms) | |||
Rykaczewski_Samsung_taks4_SED_4 | Rykaczewski2020 | 10.4 (9.7 - 11.1) | CRNN | mean-teacher student | median filtering (93ms) | |||
Rykaczewski_Samsung_taks4_SED_1 | Rykaczewski2020 | 21.6 (20.8 - 22.4) | CRNN | mean-teacher student | median filtering (93ms) | |||
Hou_IPS_task4_SED_1 | HouB2020 | 34.9 (34.0 - 35.7) | CRNN | mean-teacher student | median filtering (450ms) | |||
Miyazaki_NU_task4_SED_1 | Miyazaki2020 | 51.1 (50.1 - 52.3) | conformer | mean-teacher student | thresholding, median filtering | mean | ||
Miyazaki_NU_task4_SED_2 | Miyazaki2020 | 46.4 (45.5 - 47.5) | transformer | mean-teacher student | thresholding, median filtering | mean | ||
Miyazaki_NU_task4_SED_3 | Miyazaki2020 | 50.7 (49.6 - 51.9) | transformer, conformer | mean-teacher student | thresholding, median filtering | mean | ||
Huang_ICT-TOSHIBA_task4_SED_3 | Huang2020 | 44.3 (43.4 - 45.4) | CNN | guided and multi branch learning | median filtering (with adaptive window size) | attention layers, global_max_pooling, global average pooling | ||
Huang_ICT-TOSHIBA_task4_SED_1 | Huang2020 | 44.6 (43.5 - 46.0) | CNN | guided and multi branch learning | median filtering (with adaptive window size) | attention layers, global_max_pooling, global average pooling | ||
Huang_ICT-TOSHIBA_task4_SS_SED_4 | Huang2020 | 44.1 (42.9 - 45.4) | CNN | guided multi branch learning | median filtering (with adaptive window size) | attention layers, global_max_pooling, global average pooling | ||
Huang_ICT-TOSHIBA_task4_SED_4 | Huang2020 | 44.3 (43.2 - 45.6) | CNN | guided and multi branch learning | median filtering (with adaptive window size) | attention layers, global_max_pooling, global average pooling | ||
Huang_ICT-TOSHIBA_task4_SS_SED_1 | Huang2020 | 44.7 (43.6 - 46.2) | CNN | guided and multi branch learning | median filtering (with adaptive window size) | attention layers, global_max_pooling, global average pooling | ||
Huang_ICT-TOSHIBA_task4_SED_2 | Huang2020 | 44.3 (43.2 - 45.6) | CNN | guided and multi branch learning | median filtering (with adaptive window size) | attention layers, global_max_pooling, global average pooling | ||
Huang_ICT-TOSHIBA_task4_SS_SED_3 | Huang2020 | 44.4 (43.2 - 45.8) | CNN | guided and multi branch learning | median filtering (with adaptive window size) | attention layers, global_max_pooling, global average pooling | ||
Huang_ICT-TOSHIBA_task4_SS_SED_2 | Huang2020 | 44.5 (43.3 - 46.0) | CNN | guided and multi branch learning | median filtering (with adaptive window size) | attention layers, global_max_pooling, global average pooling | ||
Copiaco_UOW_task4_SED_2 | Copiaco2020a | 7.8 (7.3 - 8.2) | AlexNet, transfer learning CNN | mean-teacher student | median filtering (93ms) | P-norm | ||
Copiaco_UOW_task4_SED_1 | Copiaco2020a | 7.5 (7.0 - 8.0) | AlexNet, transfer learning CNN | mean-teacher student | median filtering (93ms) | P-norm | ||
Kim_AiTeR_GIST_SED_1 | Kim2020 | 43.7 (42.8 - 44.7) | CRNN | pseudo-labelling | median filtering (79ms) | thresholding | ||
Kim_AiTeR_GIST_SED_2 | Kim2020 | 43.9 (43.0 - 44.7) | CRNN | pseudo-labelling | median filtering (79ms) | thresholding | ||
Kim_AiTeR_GIST_SED_4 | Kim2020 | 44.4 (43.5 - 45.2) | CRNN | pseudo-labelling | median filtering (79ms) | thresholding | ||
Kim_AiTeR_GIST_SED_3 | Kim2020 | 44.2 (43.4 - 45.1) | CRNN | pseudo-labelling | median filtering (79ms) | thresholding | ||
Copiaco_UOW_task4_SS_SED_1 | Copiaco2020b | 6.9 (6.7 - 7.2) | AlexNet, transfer learning CNN | mean-teacher student | median filtering (93ms) | P-norm | ||
LJK_PSH_task4_SED_3 | JiaKai2020 | 38.6 (37.7 - 39.7) | CRNN, ensemble | mean-teacher student | median filtering (class-dependent) | attention layers | macro F1 vote | |
LJK_PSH_task4_SED_1 | JiaKai2020 | 39.3 (38.4 - 40.4) | CRNN, ensemble | mean-teacher student | median filtering (class-dependent) | attention layers | macro F1 vote | |
LJK_PSH_task4_SED_2 | JiaKai2020 | 41.2 (40.1 - 42.4) | CRNN, ensemble | mean-teacher student | median filtering (class-dependent) | attention layers | macro F1 vote | |
LJK_PSH_task4_SED_4 | JiaKai2020 | 40.6 (39.6 - 41.6) | CRNN, ensemble | mean-teacher student | median filtering (class-dependent) | attention layers | macro F1 vote | |
Hao_CQU_task4_SED_2 | Hao2020 | 47.0 (46.0 - 48.1) | CRNN | domain adaptation | median filtering (with adaptive window size) | |||
Hao_CQU_task4_SED_3 | Hao2020 | 46.3 (45.5 - 47.4) | CRNN | domain adaptation | median filtering (with adaptive window size) | |||
Hao_CQU_task4_SED_1 | Hao2020 | 44.9 (43.9 - 45.8) | CRNN | domain adaptation | median filtering (with adaptive window size) | |||
Hao_CQU_task4_SED_4 | Hao2020 | 47.8 (46.9 - 49.0) | CRNN | domain adaptation | median filtering (with adaptive window size) | |||
Zhenwei_Hou_task4_SED_1 | HouZ2020 | 45.1 (44.2 - 45.8) | CRNN | mean-teacher student | median filtering (93ms) | mean | ||
deBenito_AUDIAS_task4_SED_1 | deBenito2020 | 38.2 (37.5 - 39.2) | CRNN | mean-teacher student | median filtering (45ms) | mean, class-specific thresholding | ||
deBenito_AUDIAS_task4_SED_1 | de Benito-Gorron2020 | 37.9 (37.0 - 39.1) | CRNN | mean-teacher student | median filtering (45ms) | mean | ||
Koh_NTHU_task4_SED_3 | Koh2020 | 46.6 (45.8 - 47.6) | FP-CRNN | mean-teacher student, interpolation consistency training, shift consistency training, weakly pseudo-labeling | median filtering (0.45s) | mean probabilities, thresholding | ||
Koh_NTHU_task4_SED_2 | Koh2020 | 45.2 (44.3 - 46.3) | CRNN | mean-teacher student, interpolation consistency training, shift consistency training, weakly pseudo-labeling | median filtering (adaptive window size) | mean probabilities, thresholding | ||
Koh_NTHU_task4_SED_1 | Koh2020 | 45.2 (44.2 - 46.1) | CRNN | mean-teacher student, interpolation consistency training, shift consistency training, weakly pseudo-labeling | median filtering (0.45s) | mean probabilities, thresholding | ||
Koh_NTHU_task4_SED_4 | Koh2020 | 46.3 (45.4 - 47.2) | FP-CRNN | mean-teacher student, interpolation consistency training, shift consistency training, weakly pseudo-labeling | median filtering (adaptive window size) | mean probabilities, thresholding | ||
Cornell_UPM-INRIA_task4_SED_2 | Cornell2020 | 42.0 (40.9 - 43.1) | CRNN | mean-teacher student | HMM smoothing | HMM smoothing | ||
Cornell_UPM-INRIA_task4_SED_1 | Cornell2020 | 44.4 (43.3 - 45.5) | CRNN | mean-teacher student | HMM smoothing | HMM smoothing | ||
Cornell_UPM-INRIA_task4_SED_4 | Cornell2020 | 43.2 (42.1 - 44.4) | CRNN | mean-teacher student | HMM smoothing | HMM smoothing | ||
Cornell_UPM-INRIA_task4_SS_SED_1 | Cornell2020 | 38.6 (37.5 - 39.6) | CRNN | mean-teacher student | HMM smoothing | HMM smoothing | ||
Cornell_UPM-INRIA_task4_SED_3 | Cornell2020 | 42.6 (41.6 - 43.5) | CRNN | mean-teacher student | HMM smoothing | HMM smoothing | ||
Yao_UESTC_task4_SED_1 | Yao2020 | 44.1 (43.1 - 45.2) | CRNN | mean-teacher student | median filtering (class-dependent) | attention layers | CRNN | |
Yao_UESTC_task4_SED_3 | Yao2020 | 46.4 (45.3 - 47.6) | CRNN | mean-teacher student | median filtering (class-dependent) | attention layers | CRNN | |
Yao_UESTC_task4_SED_2 | Yao2020 | 45.7 (44.7 - 47.0) | CRNN | mean-teacher student | median filtering (class-dependent) | attention layers | CRNN | |
Yao_UESTC_task4_SED_4 | Yao2020 | 46.2 (45.2 - 47.0) | CRNN | mean-teacher student | median filtering (class-dependent) | attention layers | CRNN | |
Liu_thinkit_task4_SED_1 | Liu2020 | 40.7 (39.7 - 41.7) | CRNN | mean-teacher student | median filtering (class-dependent) | weighted mean | ||
Liu_thinkit_task4_SED_1 | Liu2020 | 41.8 (40.7 - 42.9) | CRNN | mean-teacher student | median filtering (class-dependent) | weighted mean | ||
Liu_thinkit_task4_SED_1 | Liu2020 | 45.2 (44.2 - 46.5) | CRNN | mean-teacher student | median filtering (class-dependent) | weighted mean | ||
Liu_thinkit_task4_SED_4 | Liu2020 | 43.1 (42.1 - 44.2) | CRNN | mean-teacher student | median filtering (class-dependent) | weighted mean | ||
PARK_JHU_task4_SED_1 | Park2020 | 35.8 (35.0 - 36.6) | CRNN | mean-teacher student | median filtering (450ms) | weighted mean | ||
PARK_JHU_task4_SED_1 | Park2020 | 26.5 (25.7 - 27.5) | CRNN | mean-teacher student | median filtering (450ms) | weighted mean | ||
PARK_JHU_task4_SED_2 | Park2020 | 36.9 (36.1 - 37.7) | CRNN | mean-teacher student | median filtering (450ms) | weighted mean | ||
PARK_JHU_task4_SED_3 | Park2020 | 34.7 (34.1 - 35.6) | CRNN | mean-teacher student | median filtering (450ms) | weighted mean | ||
Chen_NTHU_task4_SS_SED_1 | Chen2020 | 34.5 (33.5 - 35.3) | CRNN | mean-teacher student | median filtering (93ms) | P-norm | ||
CTK_NU_task4_SED_2 | Chan2020 | 44.4 (43.5 - 45.5) | NMF, CNN | consistency enforcement | median filtering | |||
CTK_NU_task4_SED_4 | Chan2020 | 46.3 (45.3 - 47.4) | NMF, CNN | consistency enforcement | median filtering | |||
CTK_NU_task4_SED_3 | Chan2020 | 45.8 (45.0 - 47.0) | NMF, CNN | consistency enforcement | median filtering | |||
CTK_NU_task4_SED_1 | Chan2020 | 43.5 (42.6 - 44.7) | NMF, CNN | consistency enforcement | median filtering | |||
YenKu_NTU_task4_SED_4 | Yen2020 | 42.7 (41.6 - 43.6) | CRNN | guided learning pseudo-labelling, mean-teacher student | median filtering (with adaptive window size) | |||
YenKu_NTU_task4_SED_2 | Yen2020 | 42.6 (41.8 - 43.7) | CRNN | guided learning pseudo-labelling, mean-teacher student | median filtering (with adaptive window size) | |||
YenKu_NTU_task4_SED_3 | Yen2020 | 41.6 (40.6 - 42.7) | CRNN | guided learning pseudo-labelling, mean-teacher student | median filtering (with adaptive window size) | |||
YenKu_NTU_task4_SED_1 | Yen2020 | 43.6 (42.4 - 44.6) | CRNN | guided learning pseudo-labelling, mean-teacher student | median filtering (with adaptive window size) | |||
Tang_SCU_task4_SED_1 | Tang2020 | 43.1 (42.3 - 44.1) | CRNN | mean-teacher student | median filtering | |||
Tang_SCU_task4_SED_4 | Tang2020 | 44.1 (43.4 - 44.8) | CRNN | mean-teacher student | median filtering | |||
Tang_SCU_task4_SED_2 | Tang2020 | 42.4 (41.4 - 43.4) | CRNN | mean-teacher student | median filtering | |||
Tang_SCU_task4_SED_3 | Tang2020 | 44.1 (43.3 - 45.0) | CRNN | mean-teacher student | median filtering | |||
DCASE2020_SED_baseline_system | turpault2020a | 34.9 (34.0 - 35.7) | CRNN | mean-teacher student | median filtering (93ms) | |||
DCASE2020_SS_SED_baseline_system | turpault2020b | 36.5 (35.6 - 37.2) | CRNN | mean-teacher student | median filtering (93ms) | P-norm | ||
Ebbers_UPB_task4_SED_1 | Ebbers2020 | 47.2 (46.5 - 48.1) | CRNN | pseudo-labelling | median filtering (200-400ms) | small segment tagging + pseudo-labelled training | mean |
Complexity
Rank | Code |
Technical Report |
Event-based F-score (Eval) |
Model complexity |
Ensemble subsystems |
Training time |
---|---|---|---|---|---|---|
Xiaomi_task4_SED_1 | Liang2020 | 36.0 (35.3 - 36.8) | 1112420 | 3h (1 v100) | ||
Rykaczewski_Samsung_taks4_SED_3 | Rykaczewski2020 | 21.6 (21.0 - 22.4) | 165460 | 8h (1 GTX 1080 Ti) | ||
Rykaczewski_Samsung_taks4_SED_2 | Rykaczewski2020 | 21.9 (21.3 - 22.7) | 165460 | 8h (1 GTX 1080 Ti) | ||
Rykaczewski_Samsung_taks4_SED_4 | Rykaczewski2020 | 10.4 (9.7 - 11.1) | 165460 | 8h (1 GTX 1080 Ti) | ||
Rykaczewski_Samsung_taks4_SED_1 | Rykaczewski2020 | 21.6 (20.8 - 22.4) | 165460 | 8h (1 GTX 1080 Ti) | ||
Hou_IPS_task4_SED_1 | HouB2020 | 34.9 (34.0 - 35.7) | 1112420 | 12h (1 k80) | ||
Miyazaki_NU_task4_SED_1 | Miyazaki2020 | 51.1 (50.1 - 52.3) | 17097712 | 8 | 12h (1 TITAN Xp) | |
Miyazaki_NU_task4_SED_2 | Miyazaki2020 | 46.4 (45.5 - 47.5) | 72107714 | 7 | 12h (1 TITAN Xp) | |
Miyazaki_NU_task4_SED_3 | Miyazaki2020 | 50.7 (49.6 - 51.9) | 89205426 | 15 | 12h (1 TITAN Xp) | |
Huang_ICT-TOSHIBA_task4_SED_3 | Huang2020 | 44.3 (43.4 - 45.4) | 1145928 | 9h (GTX 1080 Ti) | ||
Huang_ICT-TOSHIBA_task4_SED_1 | Huang2020 | 44.6 (43.5 - 46.0) | 6878788 | 6 | 5h for each model(GTX 1080 Ti) | |
Huang_ICT-TOSHIBA_task4_SS_SED_4 | Huang2020 | 44.1 (42.9 - 45.4) | 10316572 | 9 | 5h for each model (GTX 1080 Ti) | |
Huang_ICT-TOSHIBA_task4_SED_4 | Huang2020 | 44.3 (43.2 - 45.6) | 6878788 | 6 | 5h for each model(GTX 1080 Ti) | |
Huang_ICT-TOSHIBA_task4_SS_SED_1 | Huang2020 | 44.7 (43.6 - 46.2) | 10316572 | 9 | 5h for each model (GTX 1080 Ti) | |
Huang_ICT-TOSHIBA_task4_SED_2 | Huang2020 | 44.3 (43.2 - 45.6) | 6878788 | 6 | 5h for each model(GTX 1080 Ti) | |
Huang_ICT-TOSHIBA_task4_SS_SED_3 | Huang2020 | 44.4 (43.2 - 45.8) | 6878788 | 9 | 5h for each model (GTX 1080 Ti) | |
Huang_ICT-TOSHIBA_task4_SS_SED_2 | Huang2020 | 44.5 (43.3 - 46.0) | 10316572 | 9 | 5h for each model (GTX 1080 Ti) | |
Copiaco_UOW_task4_SED_2 | Copiaco2020a | 7.8 (7.3 - 8.2) | 60000000 | 60h ((R)(TM) i7-5600U CPU @ 2.60GHz) | ||
Copiaco_UOW_task4_SED_1 | Copiaco2020a | 7.5 (7.0 - 8.0) | 60000000 | 60h ((R)(TM) i7-5600U CPU @ 2.60GHz) | ||
Kim_AiTeR_GIST_SED_1 | Kim2020 | 43.7 (42.8 - 44.7) | 13049904 | 5 | 12h (1 GTX 1080 Ti) | |
Kim_AiTeR_GIST_SED_2 | Kim2020 | 43.9 (43.0 - 44.7) | 13049904 | 5 | 12h (1 GTX 1080 Ti) | |
Kim_AiTeR_GIST_SED_4 | Kim2020 | 44.4 (43.5 - 45.2) | 13049904 | 5 | 12h (1 GTX 1080 Ti) | |
Kim_AiTeR_GIST_SED_3 | Kim2020 | 44.2 (43.4 - 45.1) | 13049904 | 5 | 12h (1 GTX 1080 Ti) | |
Copiaco_UOW_task4_SS_SED_1 | Copiaco2020b | 6.9 (6.7 - 7.2) | 6000000 | 60h ((R)(TM) i7-5600U CPU @ 2.60GHz) | ||
LJK_PSH_task4_SED_3 | JiaKai2020 | 38.6 (37.7 - 39.7) | 1042132 | 5 | 2.5h (1 GTX 2080 Ti) | |
LJK_PSH_task4_SED_1 | JiaKai2020 | 39.3 (38.4 - 40.4) | 1042132 | 7 | 2.5h (1 GTX 2080 Ti) | |
LJK_PSH_task4_SED_2 | JiaKai2020 | 41.2 (40.1 - 42.4) | 1042132 | 10 | 2.5h (1 GTX 2080 Ti) | |
LJK_PSH_task4_SED_4 | JiaKai2020 | 40.6 (39.6 - 41.6) | 1042132 | 5 | 2.5h (1 GTX 2080 Ti) | |
Hao_CQU_task4_SED_2 | Hao2020 | 47.0 (46.0 - 48.1) | 1914974 | |||
Hao_CQU_task4_SED_3 | Hao2020 | 46.3 (45.5 - 47.4) | 2178270 | |||
Hao_CQU_task4_SED_1 | Hao2020 | 44.9 (43.9 - 45.8) | 1916264 | |||
Hao_CQU_task4_SED_4 | Hao2020 | 47.8 (46.9 - 49.0) | 2134091 | |||
Zhenwei_Hou_task4_SED_1 | HouZ2020 | 45.1 (44.2 - 45.8) | 1112420 | 6 | 9h (1 GTX 1080 Ti) | |
deBenito_AUDIAS_task4_SED_1 | deBenito2020 | 38.2 (37.5 - 39.2) | 5562100 | 5 | 10h (1 RTX 2080) | |
deBenito_AUDIAS_task4_SED_1 | de Benito-Gorron2020 | 37.9 (37.0 - 39.1) | 5562100 | 5 | 10h (1 RTX 2080) | |
Koh_NTHU_task4_SED_3 | Koh2020 | 46.6 (45.8 - 47.6) | 12807540 | 5 | 19h (1 GTX 1080 Ti) | |
Koh_NTHU_task4_SED_2 | Koh2020 | 45.2 (44.3 - 46.3) | 3337260 | 3 | 14h (1 GTX 1080 Ti) | |
Koh_NTHU_task4_SED_1 | Koh2020 | 45.2 (44.2 - 46.1) | 3337260 | 3 | 14h (1 GTX 1080 Ti) | |
Koh_NTHU_task4_SED_4 | Koh2020 | 46.3 (45.4 - 47.2) | 12807540 | 5 | 19h (1 GTX 1080 Ti) | |
Cornell_UPM-INRIA_task4_SED_2 | Cornell2020 | 42.0 (40.9 - 43.1) | 1112420 | 48h (1 GTX 1080) | ||
Cornell_UPM-INRIA_task4_SED_1 | Cornell2020 | 44.4 (43.3 - 45.5) | 3337260 | 3 | 48h (1 GTX 1080) | |
Cornell_UPM-INRIA_task4_SED_4 | Cornell2020 | 43.2 (42.1 - 44.4) | 3337260 | 3 | 48h (1 GTX 1080) | |
Cornell_UPM-INRIA_task4_SS_SED_1 | Cornell2020 | 38.6 (37.5 - 39.6) | 1112420 | 3h (1 GTX 1080) | ||
Cornell_UPM-INRIA_task4_SED_3 | Cornell2020 | 42.6 (41.6 - 43.5) | 1113082 | 6h (1 GTX 1080) | ||
Yao_UESTC_task4_SED_1 | Yao2020 | 44.1 (43.1 - 45.2) | 2051582 | 4 | 17h (1 GTX 2070) | |
Yao_UESTC_task4_SED_3 | Yao2020 | 46.4 (45.3 - 47.6) | 6726008 | 6 | 21h (1 GTX 2070) | |
Yao_UESTC_task4_SED_2 | Yao2020 | 45.7 (44.7 - 47.0) | 6726008 | 6 | 21h (1 GTX 2070) | |
Yao_UESTC_task4_SED_4 | Yao2020 | 46.2 (45.2 - 47.0) | 12710274 | 4 | 24h (1 GTX 2070) | |
Liu_thinkit_task4_SED_1 | Liu2020 | 40.7 (39.7 - 41.7) | 1112421 | 1 | 48h (1 GTX 1080 Ti) | |
Liu_thinkit_task4_SED_1 | Liu2020 | 41.8 (40.7 - 42.9) | 1112421 | 6 | 48h (1 GTX 1080 Ti) | |
Liu_thinkit_task4_SED_1 | Liu2020 | 45.2 (44.2 - 46.5) | 1112421 | 6 | 48h (1 GTX 1080 Ti) | |
Liu_thinkit_task4_SED_4 | Liu2020 | 43.1 (42.1 - 44.2) | 1112421 | 6 | 48h (1 GTX 1080 Ti) | |
PARK_JHU_task4_SED_1 | Park2020 | 35.8 (35.0 - 36.6) | 1112420 | 5 | 3h (1 GTX 1080 Ti) | |
PARK_JHU_task4_SED_1 | Park2020 | 26.5 (25.7 - 27.5) | 1112420 | 5 | 3h (1 GTX 1080 Ti) | |
PARK_JHU_task4_SED_2 | Park2020 | 36.9 (36.1 - 37.7) | 1112420 | 5 | 3h (1 GTX 1080 Ti) | |
PARK_JHU_task4_SED_3 | Park2020 | 34.7 (34.1 - 35.6) | 1112420 | 5 | 3h (1 GTX 1080 Ti) | |
Chen_NTHU_task4_SS_SED_1 | Chen2020 | 34.5 (33.5 - 35.3) | 1112420 | 3 | 3h (1 GTX 1080 Ti) | |
CTK_NU_task4_SED_2 | Chan2020 | 44.4 (43.5 - 45.5) | 5038932 | 11h (1 GTX 1060) | ||
CTK_NU_task4_SED_4 | Chan2020 | 46.3 (45.3 - 47.4) | 10077864 | 2 | 17h (1 GTX 1060) | |
CTK_NU_task4_SED_3 | Chan2020 | 45.8 (45.0 - 47.0) | 10077864 | 2 | 17h (1 GTX 1060) | |
CTK_NU_task4_SED_1 | Chan2020 | 43.5 (42.6 - 44.7) | 5038932 | 6h (1 GTX 1060) | ||
YenKu_NTU_task4_SED_4 | Yen2020 | 42.7 (41.6 - 43.6) | 1403370 | 5h (1 GTX TITAN) | ||
YenKu_NTU_task4_SED_2 | Yen2020 | 42.6 (41.8 - 43.7) | 1403370 | 5h (1 GTX TITAN) | ||
YenKu_NTU_task4_SED_3 | Yen2020 | 41.6 (40.6 - 42.7) | 1403370 | 5h (1 GTX TITAN) | ||
YenKu_NTU_task4_SED_1 | Yen2020 | 43.6 (42.4 - 44.6) | 1403370 | 5h (1 GTX TITAN) | ||
Tang_SCU_task4_SED_1 | Tang2020 | 43.1 (42.3 - 44.1) | 1687466 | 5h (1 TITAN Xp) | ||
Tang_SCU_task4_SED_4 | Tang2020 | 44.1 (43.4 - 44.8) | 3186890 | 17h (1 TITAN Xp) | ||
Tang_SCU_task4_SED_2 | Tang2020 | 42.4 (41.4 - 43.4) | 1687466 | 6h (1 TITAN Xp) | ||
Tang_SCU_task4_SED_3 | Tang2020 | 44.1 (43.3 - 45.0) | 3186890 | 15h (1 TITAN Xp) | ||
DCASE2020_SED_baseline_system | turpault2020a | 34.9 (34.0 - 35.7) | 1112420 | 3h (1 GTX 1080 Ti) | ||
DCASE2020_SS_SED_baseline_system | turpault2020b | 36.5 (35.6 - 37.2) | 1112420 | 3 | 3h (1 GTX 1080 Ti) | |
Ebbers_UPB_task4_SED_1 | Ebbers2020 | 47.2 (46.5 - 48.1) | 20319432 | 8 | 24h (4 RTX 2080) |
Combined SS and SED
Rank | Code |
Technical Report |
Event-based F-score (Eval) |
Sound separation method |
SS Model complexity |
SS Training time |
Sources used for SED |
SS/SED Integration type |
SS/SED Integration method |
---|---|---|---|---|---|---|---|---|---|
Copiaco_UOW_task4_SS_SED_1 | Copiaco2020b | 6.9 (6.7 - 7.2) | DNN | 500000 | 3h ((R)(TM) i7-5600U CPU @ 2.60GHz) | DESED foreground | early | average | |
Chen_NTHU_task4_SS_SED_1 | Chen2020 | 34.5 (33.5 - 35.3) | ConvTasNet | 1112420 | 3h (1 GTX 1080 Ti) | DESED foreground, FUSS | both | average | |
Cornell_UPM-INRIA_task4_SS_SED_1 | Cornell2020 | 38.6 (37.5 - 39.6) | ConvTasNet (X= 5, R= 3) | 3203999 | 12h (1 GTX 1080 Ti) | all sources | late | concat | |
Huang_ICT-TOSHIBA_task4_SS_SED_4 | Huang2020 | 44.1 (42.9 - 45.4) | TDCN++ | 1112420 | 3h (1 GTX 1080 Ti) | all sources | late | average | |
Huang_ICT-TOSHIBA_task4_SS_SED_3 | Huang2020 | 44.4 (43.2 - 45.8) | TDCN++ | 1112420 | 3h (1 GTX 1080 Ti) | all sources | late | average | |
Huang_ICT-TOSHIBA_task4_SS_SED_2 | Huang2020 | 44.5 (43.3 - 46.0) | TDCN++ | 1112420 | 3h (1 GTX 1080 Ti) | all sources | late | average | |
Huang_ICT-TOSHIBA_task4_SS_SED_1 | Huang2020 | 44.7 (43.6 - 46.2) | TDCN++ | 1112420 | 3h (1 GTX 1080 Ti) | all sources | late | average | |
DCASE2020_SS_SED_baseline_system | 36.5 (35.6 - 37.2) | TDCN++ | 9179401 | DESED sources | late | P-norm |
SS systems
Rank | Code |
Technical Report |
Single source SI-SNR (Eval) |
Sound separation method |
SS Model complexity |
Input features |
Data augmentation |
---|---|---|---|---|---|---|---|
Xiaomi_task4_SS_1 | Liang2020 | 33.8 | TDCN++ | 27538206 | stft | time shifting, pitch shifting | |
Xiaomi_task4_SS_2 | Liang2020 | 31.8 | TDCN++ | 27636510 | stft, mfcc | time shifting, pitch shifting | |
DCASE2020_SS_baseline_system | kavalerov2019universal | 37.6 | TDCN++ | 9179401 | stft (32ms / 8ms) | ||
Hou_IPS_task4_SS_1 | HouB2020 | 37.6 | TDCN++ | stft (32ms / 8ms) |
Technical reports
Semi-Supervised NMF-CNN For Sound Event Detection
Chan, Teck Kai1,2 and Chin, Cheng Siong1 and Li, Ye2
1Newcastle University Singapore, Faculty of Science, Agriculture, and Engineering, Singapore, 2Xylem Water Solution Singapore Pte Ltd, 3A International Business Park,Singapore
CTK_NU_task4_SED_1 CTK_NU_task4_SED_2 CTK_NU_task4_SED_3 CTK_NU_task4_SED_4
Semi-Supervised NMF-CNN For Sound Event Detection
Chan, Teck Kai1,2 and Chin, Cheng Siong1 and Li, Ye2
1Newcastle University Singapore, Faculty of Science, Agriculture, and Engineering, Singapore, 2Xylem Water Solution Singapore Pte Ltd, 3A International Business Park,Singapore
Abstract
For the DCASE 2020 Challenge Task 4, this paper proposed a combinative approach using Nonnegative Matrix Factorization (NMF) and Convolutional Neural Network (CNN). The main idea begins with utilizing NMF to approximate strong labels for the weakly labeled data. Sub-sequently, based on the approximated strongly labeled data, two different CNNs are trained using a semi-supervised framework where one CNN is used for clip-level prediction and the other for frame-level prediction. Using this idea, the best model trained can achieve an event-based F1-score of 45.7% on the validation dataset. Using an ensemble of models, the event-based F1-score can be increased to 48.6%. By comparing with the baseline model, the proposed model outperforms the baseline model by a margin of over 8%.
System characteristics
Input | mono |
Sampling rate | 22.05 kHz |
Data augmentation | Gaussian noise |
Features | log-mel energies |
Classifier | NMF, CNN |
Combined Sound Event Detection And Sound Event Separation Networks For DCASE 2020 Task 4
Chen, You-Siang1 and Lin, Zi Jie1 and Li, Shang-En2 and Koh, Chih-Yuan1 and Bai, Mingsian R.1 and Chien, Jen-Tzung2 and liu, Yi-Wen1
1National Tsing Hua University, Hsinchu, 2National Chiao Tung University, Hsinchu, Taiwan
Chen_NTHU_task4_SS_SED_1
Combined Sound Event Detection And Sound Event Separation Networks For DCASE 2020 Task 4
Chen, You-Siang1 and Lin, Zi Jie1 and Li, Shang-En2 and Koh, Chih-Yuan1 and Bai, Mingsian R.1 and Chien, Jen-Tzung2 and liu, Yi-Wen1
1National Tsing Hua University, Hsinchu, 2National Chiao Tung University, Hsinchu, Taiwan
Abstract
In this paper, we propose a hybrid neural network (NN) to handle the tasks of sound event separation (SES) and sound event detection (SED) in Task 4 of DCASE 2020 challenge. The convolutional time-domain audio separation network (Conv-TasNet) is employed to extract the foreground sound events defined in DCASE challenge. By comparing the baseline SED network with various training strategies, we demonstrate that the SES network is capable of enhancing the SED performance effectively in terms of several event-based performance metrics including macro F1 and poly-phonic sound detection score (PSDS).
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Data augmentation | Sound event separation |
Features | log-mel energies |
Classifier | CRNN, ConvTasNet |
Decision making | P-norm |
Sound Event Detection And Classification Using CWT Scalograms And Deep Learning
Copiaco, Abigail1 and Ritz, Christian2 and Fasciani, Stefano3 and Abdulaziz, Nidhal4
1Engineering and Information Sciences Dept., University of Wollongong, Northfields Wollongong, Australia, 2School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, Northfields Wollongong, Australia, 3, Musicology Dept., University of Oslo, Norway, 4University of Wollongong in Dubai, Dubai
Copiaco_UOW_task4_SED_1 Copiaco_UOW_task4_SED_2
Sound Event Detection And Classification Using CWT Scalograms And Deep Learning
Copiaco, Abigail1 and Ritz, Christian2 and Fasciani, Stefano3 and Abdulaziz, Nidhal4
1Engineering and Information Sciences Dept., University of Wollongong, Northfields Wollongong, Australia, 2School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, Northfields Wollongong, Australia, 3, Musicology Dept., University of Oslo, Norway, 4University of Wollongong in Dubai, Dubai
Abstract
This report describes our proposed system for the DCASE 2020 Task 4 challenge. In this work, we examine the combination of signal energy and spectral centroid features with 0.05 s of time windowing for the detection of sound events. Along with this, spectro-temporal features extracted from Fast Fourier Transform (FFT) based wavelet coefficients of the audio files were used for classification. These coefficients are mapped into images called scalograms, which are fed into the layers of AlexNet, a pre-trained Deep Convolutional Neural Network (DCNN), for transfer learning. Through the validation set, this method gathered an average F1-score of 74% amongst the 10 classes of the DESED database for weak labelling. However, this technique is not deemed to be suitable for classification with strong time stamps labelling, gathering an F1-score of a mere 11.21%
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Features | scalogram, signal energy, spectral centroid |
Classifier | AlexNet, transfer learning CNN |
Decision making | P-norm |
Detecting And Classifying Separated Sound Events Using Wavelet_Based Scalograms And Deep Learning
Copiaco, Abigail1 and Ritz, Christian2 and Fasciani, Stefano3 and Abdulaziz, Nidhal4
1Engineering and Information Sciences Dept., University of Wollongong, Northfields Wollongong, Australia, 2School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, Northfields Wollongong, Australia, 3, Musicology Dept., University of Oslo, Norway, 4University of Wollongong in Dubai, Dubai
Copiaco_UOW_task4_SS_SED_1
Detecting And Classifying Separated Sound Events Using Wavelet_Based Scalograms And Deep Learning
Copiaco, Abigail1 and Ritz, Christian2 and Fasciani, Stefano3 and Abdulaziz, Nidhal4
1Engineering and Information Sciences Dept., University of Wollongong, Northfields Wollongong, Australia, 2School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, Northfields Wollongong, Australia, 3, Musicology Dept., University of Oslo, Norway, 4University of Wollongong in Dubai, Dubai
Abstract
This report describes our proposed system submitted to the DCASE 2020 Task 4 challenge that includes sound separation and detection. In this work, we examine the use of a Deep Neural Network (DNN) based sound separation system as a preprocessing technique to the sound event classification technique used. For the detection of sound events, a combination of signal energy and spectral centroid features with 0.05-s of time windowing was utilized. Along with this, spectro-temporal features extracted from a Fast Fourier Transform (FFT) based wavelet coefficients of the audio files were used for classification. These coefficients are mapped into images called scalograms, which are fed into the layers of AlexNet, a pre-trained Deep Convolutional Neural Network (DCNN), for transfer learning. Through the validation set, this method gathered an average F1-score of 70% amongst the 10 classes of the DESED database for weak labelling However, this technique is not deemed to be suitable for classification with strong time stamps labelling, gathering an F1-score of a mere 8.73%.
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Features | scalogram, signal energy, spectral centroid |
Classifier | AlexNet, transfer learning CNN |
Decision making | P-norm |
The UNIVPM-INRIA Systems For The DCASE 2020 Task 4
Cornell, Samuele Cornell1 and Pepe, Giovanni1 and Principi, Emanuele1 and Pariente, Manuel2 and Olvera, Michel2 and Gabrielli, Leonardo1 and Squartini, Stefano1
1Dept. Information Engineering, Università Politecnica delle Marche, Ancona, Italy, 2Dept. Information and Communication Sciences and Technologies, INRIA Nancy Grand-Est, France
Cornell_UPM-INRIA_task4_SED_1 Cornell_UPM-INRIA_task4_SED_2 Cornell_UPM-INRIA_task4_SED_3 Cornell_UPM-INRIA_task4_SED_4 Cornell_UPM-INRIA_task4_SS_SED_1
The UNIVPM-INRIA Systems For The DCASE 2020 Task 4
Cornell, Samuele Cornell1 and Pepe, Giovanni1 and Principi, Emanuele1 and Pariente, Manuel2 and Olvera, Michel2 and Gabrielli, Leonardo1 and Squartini, Stefano1
1Dept. Information Engineering, Università Politecnica delle Marche, Ancona, Italy, 2Dept. Information and Communication Sciences and Technologies, INRIA Nancy Grand-Est, France
Abstract
In this technical report, we propose different Sound Event Detection (SED) systems for the 2020 DCASE Task 4 challenge. Given the mismatch between synthetic labelled data and target domain data, we exploit a domain adversarial training to improve the network invariance to different types of background noise. Furthermore, we use dynamic mixing and augmentation of synthetic examples at training time as well as prediction smoothing by using Hidden Markov Models. In one system, we also show that using a learnable dynamic compression, Per-Channel Energy Normalization (PCEN) front-end improves robustness to background noise by making it Gaussian. Finally, an ensemble of models proves beneficial to improve the prediction score. Concerning joint separation and sound event detection we propose a permutation-invariant training scheme to optimize directly the Sound-Event-Detection objective.
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Data augmentation | contrast, overdrive, pitch shifting, highshelf, lowshelf |
Features | log-mel energies, mel-energies |
Classifier | CRNN, DPRNN |
Decision making | HMM smoothing |
Multi-Resolution Mean Teacher For DCASE 2020 Task 4
de Benito-Gorron, Diego and Segovia, Sergio and Ramos, Daniel and Toledano, Doroteo T.
Universidad Autónoma de Madrid, Madrid, Spain
deBenito_AUDIAS_task4_SED_1 deBenito_AUDIAS_task4_SED_2
Multi-Resolution Mean Teacher For DCASE 2020 Task 4
de Benito-Gorron, Diego and Segovia, Sergio and Ramos, Daniel and Toledano, Doroteo T.
Universidad Autónoma de Madrid, Madrid, Spain
Abstract
In this technical report, we describe our participation in DCASE 2020 Task 4: Sound event detection and separation in domestic environments. A multi-resolution feature extraction approach is proposed, aiming to take advantage of the different lengths and spectral characteristics of each target category. The combination of up to five different time-frequency resolutions via model fusion is able to outperform the baseline results. In addition, we propose class-specific thresholds for the F 1 -score metric, further improving the results over the Validation set.
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Features | log-mel energies |
Classifier | CRNN |
Decision making | mean |
Convolutional Recurrent Neural Networks For Weakly Labeled Semi-Supervised Sound Event Detection In Domestic Environments
Ebbers, Janek and Haeb-Umbach, Reinhold
Paderborn University, Paderborn, Germany
Abstract
In this report we present our system for the Detection and Classification of acoustic scenes and events (DCASE) 2020 Challenge Task 4: Sound event detection in domestic environments. We present a convolutional recurrent neural network (CRNN) with two recurrent neural network (RNN) classifiers sharing the same preprocessing convolutional neural network (CNN). Both recurrent networks perform audio tagging. One is processing the input audio signal in forward direction and the other in backward direction. The networks are trained jointly using weakly labeled data, such that at each time step an active event is tagged by at least one of the networks given that the event is either in the past captured by the forward RNN or in the future captured by the backward RNN. This way the models are encouraged to output tags as soon as possible. After training, the networks can be used for sound event detection by applying them to smaller audio segments, e.g. 200 ms. Further we propose a tag conditioned CNN as a second stage which is supposed to refine sound event detection. Given its receptive field and file tags as input it performs strong label prediction trained using pseudo strong labels provided by the CRNN system. By ensembling four CRNNs and four CNNs we obtain an event-based F-score of 48.3% on the validation set, which is 13.5% higher than the baseline. We are going to release the source code at https://github.com/fgnt/pb_sed.
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Data augmentation | scaling, mixup, frequency warping, blurring, frequency masking, time masking, gaussian noise |
Features | log-mel energies |
Classifier | CRNN |
Decision making | mean |
Cross-domain sound event detection: from synthesized audio to real audio
Hao, Junyong and Hou, Zhenwei and Peng, Wang
Chongqing University, Chongqing, China
Hao_CQU_task4_SED_1 Hao_CQU_task4_SED_2 Hao_CQU_task4_SED_3 Hao_CQU_task4_SED_4
Cross-domain sound event detection: from synthesized audio to real audio
Hao, Junyong and Hou, Zhenwei and Peng, Wang
Chongqing University, Chongqing, China
Abstract
This technical report describes some of the system information submitted to DCASE2020 task4 - Sound Event Detection in Domestic Environments. We use the dataset of DCASE2019 task4 to train our model, contains strongly labeled synthetic data, large unlabeled data, and weakly labeled data. There is a very large domain gaps in the statistical distribution between the synthesized audio and the real audio, and the performance of the SED model obtained on the synthesized audio applied to the real audio is greatly reduced. To perform this task, we propose a DA-CRNN network for joint learning of sound event detection (SED) and domain adaptation (DA).We consider the impact of the distribution within a single sound on the generalization perfor- mance of the model by mitigating the impact of complex background noise on event detection and the self-correlation consistency regularization of clip-level sound event classification, these make the intra-domain of a single sound smoother ; for cross-domain adaptation, adversarial learning through feature extraction network with frame-level domain discriminator and clip-level domain discriminator, forcing the feature extraction network to learn the invariant features of the domain, and further improve the generalization performance of the model. We did not use sound source separation, achieved an F1 score of 48.25% on the validation dataset and an F1 score of 49.43% on the public evaluation dataset.
System characteristics
Input | mono |
Sampling rate | 22.05 kHz |
Features | log-mel energies |
Classifier | CRNN |
Fine-Tuning Using Grid Search & Gradient Visualization
Hou, Bowei and Radzikwoski, Kacper and Farid, Ahmed
Waseda University, Fukuoka, Japan
Hou_IPS_task4_SED_1 Hou_IPS_task4_SS_1
Fine-Tuning Using Grid Search & Gradient Visualization
Hou, Bowei and Radzikwoski, Kacper and Farid, Ahmed
Waseda University, Fukuoka, Japan
Abstract
In this technical report, we briefly describe the models used in the task 4 challenge of DCASE2020. We utilized previously available models and fine-tuned them using the grid search algorithm and gradient visualization. This is the first attempt by our team to enter a competition on sound source manipulation.
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Features | log-mel energies |
Classifier | CRNN |
Decision making | mean |
AUTHOR GUIDELINES FOR DCASE 2020 CHALLENGE TECHNICAL REPORT
Hou, Zhenwei and Hao, Junyong and Peng, Wang
University of Chongqing, Chongqing, China
Zhenwei_Hou_task4_SED_1
AUTHOR GUIDELINES FOR DCASE 2020 CHALLENGE TECHNICAL REPORT
Hou, Zhenwei and Hao, Junyong and Peng, Wang
University of Chongqing, Chongqing, China
Abstract
In this technical report, we present the sound event detection system of task 4 (Sound event detection and separation in do- mestic environments) of the DCASE2020 challenge. We propose an improved CRNN that Context Gating and channel attention mechanism are co-embedded into backbone network. It aims to construct a general and efficient attention structure for extracting features of sound events, and give full play to the advantages of attention mechanism in event feature extraction. In the case of replacing the CRNN in the baseline model with the structure we designed and keeping the other parts unchanged, the macro F-score of our model on the validation set is 4 percentage points higher than the baseline.
System characteristics
Input | mono |
Sampling rate | 22.05 kHz |
Features | log-mel energies |
Classifier | CRNN |
Decision making | mean |
Guided Multi-Branch Learning Systems For DCASE 2020 Task 4
Huang, Yuxin1,2 and Lin, Liwei1,2 and Ma, Shuo1 and Wang, Xiangdong1 and Liu, Hong1 and Qian, Yueliang1 and Liu, Min3 and Ouch, Kazushige3
1Bejing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, 2University of Chinese Academy of Sciences, Beijing, China, 3Toshiba China R&D Center, Beijing, China
Huang_ICT-TOSHIBA_task4_SED_1 Huang_ICT-TOSHIBA_task4_SED_2 Huang_ICT-TOSHIBA_task4_SED_3 Huang_ICT-TOSHIBA_task4_SED_4 Huang_ICT-TOSHIBA_task4_SS_SED_1 Huang_ICT-TOSHIBA_task4_SS_SED_2 Huang_ICT-TOSHIBA_task4_SS_SED_3 Huang_ICT-TOSHIBA_task4_SS_SED_4
Guided Multi-Branch Learning Systems For DCASE 2020 Task 4
Huang, Yuxin1,2 and Lin, Liwei1,2 and Ma, Shuo1 and Wang, Xiangdong1 and Liu, Hong1 and Qian, Yueliang1 and Liu, Min3 and Ouch, Kazushige3
1Bejing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, 2University of Chinese Academy of Sciences, Beijing, China, 3Toshiba China R&D Center, Beijing, China
Abstract
In this paper, we describe in detail our systems for DCASE2020 task 4. The systems are based on the first-place system of DCASE 2019 task 4, which adopts the multiple instance learning (MIL) framework with embedding-level attention pooling and a semi-supervised learning approach called guided learning. The multi-branch learning method is then incorporated into the system to further improve the performance. Multiple branches with different pooling strategies (embedding-level or instance-level) and different pooling modules (attention pooling, global max pooling or global average pooling) are used and shares the same feature encoder. To better exploit the synthetic data with strong labels, inspired by multi-task learning, a sound event detection (SED) branch is also added. Therefore, multiple branches pursuing different purposes and focusing on different characteristics of the data can help the feature encoder model the feature space better and avoid over-fitting. To combine sound separation with sound event detection, we train models using the output of the baseline system of sound separation and fuse the event detection results of models with of without sound separation.
System characteristics
Input | mono |
Sampling rate | 44.1 kHz |
Data augmentation | time shifting, frequency shifting |
Features | log-mel energies |
Classifier | CNN |
Multi-Scale Residual CRNN With Data Augmentation For DCASE 2020 Task 4
JiaKai, Lu
PFU Shanghai Co., LTD, Shanghai, China
LJK_PSH_task4_SED_1 LJK_PSH_task4_SED_2 LJK_PSH_task4_SED_3 LJK_PSH_task4_SED_4
Multi-Scale Residual CRNN With Data Augmentation For DCASE 2020 Task 4
JiaKai, Lu
PFU Shanghai Co., LTD, Shanghai, China
Abstract
In this paper, we present our neural network for the DCASE 2020 challenge’s Task 4 (Sound event detection and separation in domestic environments). This task evaluates systems for the largescale detection of sound events using weakly labeled data, and explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance to doing audio tagging and sound event detection. We propose a mean-teacher model with convolutional neural network (CNN) and recurrent neural network (RNN) to maximize the use of unlabeled in-domain dataset. The architecture is based on our 2018 competition model.
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Data augmentation | pitch shifting |
Features | log-mel energies |
Classifier | CRNN |
Decision making | macro F1 vote |
Universal sound separation
Kavalerov, Ilya1,2 and Wisdom, Scott1 and Erdogan, Hakan1 and Patton, Brian1 and Wilson, Kevin1 and Le Roux, Jonathan3 and Hershey, John R1
1Google Research, Cambridge MA, 2Department of Electrical and Computer Engineering, UMD, 3Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA
DCASE2020_SS_baseline_system
Universal sound separation
Kavalerov, Ilya1,2 and Wisdom, Scott1 and Erdogan, Hakan1 and Patton, Brian1 and Wilson, Kevin1 and Le Roux, Jonathan3 and Hershey, John R1
1Google Research, Cambridge MA, 2Department of Electrical and Computer Engineering, UMD, 3Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA
Abstract
Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. These network architectures include convolutional long short-term memory networks and time-dilated convolution stacks inspired by the recent success of time-domain enhancement networks like ConvTasNet. For the latter architecture, we also propose novel modifications that further improve separation performance. In terms of the framewise analysis-synthesis basis, we explore both a short-time Fourier transform (STFT) and a learnable basis, as used in ConvTasNet. For both of these bases, we also examine the effect of window size. In particular, for STFTs, we find that longer windows (25-50 ms) work best for speech/non-speech separation, while shorter windows (2.5 ms) work best for arbitrary sounds. For learnable bases, shorter windows (2.5 ms) work best on all tasks. Surprisingly, for universal sound separation, STFTs outperform learnable bases. Our best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
Polyphonic Sound Event Detection Based On Convolutional Recurrent Neural Networks With Semi-Supervised Loss Function For DCASE Challenge 2020 Task 4
Kim, Nam Kyun1 and Kim, Hong Kook1,2
1School of Electrical Engineering and Computer Science, Gwangju, South Korea, 2AI Graduate School (GIST), Gwangju Institute of Science and Technology, Gwangju, South Korea
Kim_AiTeR_GIST_SED_1 Kim_AiTeR_GIST_SED_2 Kim_AiTeR_GIST_SED_3 Kim_AiTeR_GIST_SED_4
Polyphonic Sound Event Detection Based On Convolutional Recurrent Neural Networks With Semi-Supervised Loss Function For DCASE Challenge 2020 Task 4
Kim, Nam Kyun1 and Kim, Hong Kook1,2
1School of Electrical Engineering and Computer Science, Gwangju, South Korea, 2AI Graduate School (GIST), Gwangju Institute of Science and Technology, Gwangju, South Korea
Abstract
This report proposes a polyphonic sound event detection (SED) method for the DCASE 2020 Challenge Task 4. The proposed SED method is based on semi-supervised learning to deal with the different combination of training datasets such as weakly labeled dataset, unlabeled dataset, and strongly labeled synthetic dataset. Especially, the target label of each audio clip from weakly labeled or unlabeled dataset is first predicted by using the mean teacher model that is the DCASE 2020 baseline. The data with predicted labels are used for training the proposed SED model, which consists of CNNs with skip connections and self-attention mechanism, followed by RNNs. In order to compensate for the erroneous prediction of weakly labeled and unlabeled data, a semi-supervised loss function is employed for the proposed SED model. In this work, several versions of the proposed SED model are implemented and evaluated on the validation set according to the different parameter setting for the semi-supervised loss function, and then an ensemble model that combines five-fold validation models is finally selected as our final model.
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Data augmentation | mixup, SpecAugment, Gaussian noise |
Features | mel-spectrogram |
Classifier | CRNN |
Decision making | thresholding |
Sound Event Detection By Consistency Training And Pseudo-Labeling With Feature-Pyramid Convolutional Recurrent Neural Networks
Koh, Chih-Yuan1 and Chen, You-Siang1 and Li, Shang-En2 and Liu, Yi-Wen1 and Chien, Jen-Tzung2 and Bai, Mingsian R.1
1National Tsing Hua University, Hsinchu, 2National Chiao Tung University, Hsinchu, Taiwan
Koh_NTHU_task4_SED_1 Koh_NTHU_task4_SED_2 Koh_NTHU_task4_SED_3 Koh_NTHU_task4_SED_4
Sound Event Detection By Consistency Training And Pseudo-Labeling With Feature-Pyramid Convolutional Recurrent Neural Networks
Koh, Chih-Yuan1 and Chen, You-Siang1 and Li, Shang-En2 and Liu, Yi-Wen1 and Chien, Jen-Tzung2 and Bai, Mingsian R.1
1National Tsing Hua University, Hsinchu, 2National Chiao Tung University, Hsinchu, Taiwan
Abstract
A event detection system for DCASE 2020 task 4 is presented. To efficiently utilize large amount of unlabeled in-domain data, three semi-supervised learning strategies are applied: 1) interpolation consistency training (ICT), 2) shift consistency training (SCT), 3) weakly pseudo-labeling. In addition, we propose FP-CRNN, a convolutional recurrent neural network which contains feature-pyramid components and is based on the provided baseline. In terms of event-based F-measure, these approaches outperform the baseline, at 34.8%, by a large margin, with an F-measure of 48.4% for the baseline network which is trained with the combination of all three strategies and 49.6% for FP-CRNN with the same training strategies.
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Data augmentation | Gaussian noise, Mixup, time shifting, pitch shifting |
Features | log-mel energies |
Classifier | CRNN |
Decision making | mean probabilities, thresholding |
Mean Teacher With Sound Source Separation And Data Augmentation For DCASE 2020 Task 4
Liang, Chuming and Ying, Haorong and Chen, Yueyi and Wang, Zhao
Xiaomi AI Lab, Wuhan, China
Abstract
In this paper, we present our system for the DCASE 2020 challenge Task4(Sound event detection and separation in domestic environments). The target of this task is to provide time boundaries of multiple events in an audio recording using a system trained with unlabeled, weakly-labeled and synthetic data. Also, sound source separation is encouraged to use in the system. We propose a mean-teacher model with convolutional and recurrent neural network(CRNN) structure and adopt data augmentation and sound source separation technique to improve the performance of sound event detection.
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Data augmentation | time stretching,pitch shift,reverberation |
Features | log-mel energies |
Classifier | CRNN |
Semi-Supervised Sound Event Detection Based On Mean Teacher With Power Pooling And Data Augmentation
Liu, Yuzhuo1,2 and Chen, Chengxin1,2 and Kuang, Jianzhong1,2 and Zhang, Pengyuan1,2
1Institute of Acoustics, Key Laboratory of Speech Acoustics & Content Understanding, Beijing, China, 2University of Chinese Academy of Sciences, Beijing, China
Liu_thinkit_task4_SED_1 Liu_thinkit_task4_SED_2 Liu_thinkit_task4_SED_3 Liu_thinkit_task4_SED_4
Semi-Supervised Sound Event Detection Based On Mean Teacher With Power Pooling And Data Augmentation
Liu, Yuzhuo1,2 and Chen, Chengxin1,2 and Kuang, Jianzhong1,2 and Zhang, Pengyuan1,2
1Institute of Acoustics, Key Laboratory of Speech Acoustics & Content Understanding, Beijing, China, 2University of Chinese Academy of Sciences, Beijing, China
Abstract
In this technical report, we describe the details of the system submitted to DCASE2020 task4: sound event detection (SED) and separation in domestic environments. We mainly focus on the scenario that recognizes sound events without source separation. The training set includes synthetic strongly labeled data, weakly labeled data and unlabeled data. For training SED models with weak labeling, a power pooling function is introduced to generate clip-level predictions from frame-level ones. Additionally, three traditional data augmentation approaches are applied on all data. We also ensemble models with different strategies. Our best system finally achieves an F1 of 49.55% on the validation set.
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Data augmentation | noise addition, pitch shifting, time rolling, dynamic range compression |
Features | log-mel energies |
Classifier | CRNN |
Decision making | weighted averaging |
Convolution-Augmented Transformer For Semi-Supervised Sound Event Detection
Miyazaki, Koichi1 and Komatsu, Tatsuya2 and Hayashi, Tomoki1,3 and Watanabe, Shinji4 and Toda, Tomoki1 and Takeda, Kazuya1
1Nagoya University, Japan, 2LINE Corporation, Japan, 3Human Dataware Lab. Co., Japan, 4 Johns Hopkins University, USA
Miyazaki_NU_task4_SED_1 Miyazaki_NU_task4_SED_2 Miyazaki_NU_task4_SED_3
Convolution-Augmented Transformer For Semi-Supervised Sound Event Detection
Miyazaki, Koichi1 and Komatsu, Tatsuya2 and Hayashi, Tomoki1,3 and Watanabe, Shinji4 and Toda, Tomoki1 and Takeda, Kazuya1
1Nagoya University, Japan, 2LINE Corporation, Japan, 3Human Dataware Lab. Co., Japan, 4 Johns Hopkins University, USA
Abstract
In this technical report, we describe our submission system for DCASE2020 Task4: sound event detection and separation in domestic environments. Our model employs conformer blocks, which combine the self-attention and depth-wise convolution networks, to efficiently capture the global and local context information of an audio feature sequence. In addition to this novel architecture, we further improve the performance by utilizing a mean teacher semi-supervised learning technique, data augmentation, and post-processing optimized for each sound event class. We demonstrate that the proposed method achieves the event-based macro F1 score of 50.7% on the validation set, significantly outperforming that of the baseline score (34.8%).
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Data augmentation | time shifting, mixup |
Features | log-mel energies |
Classifier | Conformer, Transformer |
Decision making | mean |
Joint Acoustic And Supervised Inference For Sound Event Detection
Park, Sangwook and Bellur, Ashwin and Kothinti, Sandeep and Kapurchali, Masoumeh Heidari and Elhilali, Mounya
Johns Hopkins University, Baltimore, USA
PARK_JHU_task4_SED_1 PARK_JHU_task4_SED_2 PARK_JHU_task4_SED_3 PARK_JHU_task4_SED_4
Joint Acoustic And Supervised Inference For Sound Event Detection
Park, Sangwook and Bellur, Ashwin and Kothinti, Sandeep and Kapurchali, Masoumeh Heidari and Elhilali, Mounya
Johns Hopkins University, Baltimore, USA
Abstract
This is a technical report about a sound event detection system for the task 4 of DCASE2020. The purpose of a sound event detection is to find event class label as well as its time boundaries. To achieve this purpose, we considered several methods such signal enhancement and event boundary detection, and built five systems by integrating these methods with supervised system trained by using Mean Teacher model. In particular, we estimate event boundaries of weakly labeled data by performing a event boundary detection. Then, we used the estimated strong label in training the supervised system. In addition, we adopt a fusion method by calculating weighted averaging posterior over the five outputs from each individual system. In experiments with validation set, we found that a final result of our system shows an improvement about 11 % in class averaging f-score compared to a baseline performance.
System characteristics
Input | mono |
Sampling rate | 44.1 kHz |
Features | log-mel energies |
Classifier | CRNN |
Decision making | weighted averaging |
Multi-Task Learning Paradigm For Sound Event Detection
Rykaczewski, Krzysztof
Samsung R&D Institute Poland, Warsaw, Poland
Rykaczewski_Samsung_taks4_SED_1 Rykaczewski_Samsung_taks4_SED_2 Rykaczewski_Samsung_taks4_SED_3 Rykaczewski_Samsung_taks4_SED_4
Multi-Task Learning Paradigm For Sound Event Detection
Rykaczewski, Krzysztof
Samsung R&D Institute Poland, Warsaw, Poland
Abstract
In this technical report, we describe our system submitted to DCASE2020 task 4. This task evaluates systems for the detection of sound events in domestic environments using large-scale weakly labeled data. To perform this task, we propose resdual convolutional recurrent neural networks (CRNN) as our system and trained by datasets including strong and weak labels. We also use mean-teacher model based on confidence thresholding and smooth embedding method. In addition, we also apply specaugment for the labeled data shortage problem. Finally, we achieve better performance than DCASE2020 baseline system.
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Features | log-mel energies |
Classifier | CRNN |
Multi-Scale Residual CRNN With Data Augmentation For DCASE 2020 Task 4
Tang, Maolin and Guo, Longyin and Zhang, Yanqiu and Yan, Weiran and Zhao, Qijun
Sichuan University, ChengDu, China
Tang_SCU_task4_SED_1 Tang_SCU_task4_SED_2 Tang_SCU_task4_SED_3 Tang_SCU_task4_SED_4
Multi-Scale Residual CRNN With Data Augmentation For DCASE 2020 Task 4
Tang, Maolin and Guo, Longyin and Zhang, Yanqiu and Yan, Weiran and Zhao, Qijun
Sichuan University, ChengDu, China
Abstract
In this technical report, we present our method for task 4 of DCASE 2020 challenge (Sound event detection and separation in domestic environments). The goal of the task is to evaluate systems for the detection of sound events using real data either weakly labeled or unlabeled and simulated data that is strongly labeled (with timestamps). We find that models perform well on synthetic data, but may not perform well on real data. We thus improve the baseline [1] by using a variety of data augmentation methods and synthesizing more complex synthetic data for training. Moreover, we present multi-scale residual convolutional recurrent neural network (CRNN) to solve the problem of multi-scale detection. The promising results on the validation set prove the effectiveness of our method.
System characteristics
Input | mono |
Sampling rate | 22.05 kHz |
Data augmentation | SpecAugment |
Features | log-mel energies |
Classifier | CRNN |
Decision making | mean |
Training Sound Event Detection On A Heterogeneous Dataset
Turpault, Nicolas and Serizel, Romain
Université de Lorraine, CNRS, Inria, Loria, France
Abstract
Training a sound event detection algorithm on a heterogeneous dataset including both recorded and synthetic soundscapes that can have various labeling granularity is a non-trivial task that can lead to systems requiring several technical choices. These technical choices are often passed from one system to another without being questioned. We propose to perform a detailed analysis of DCASE 2020 task 4 sound event detection baseline with regards to several aspects such as the type of data used for training, the parameters of the mean-teacher or the transformations applied while generating the synthetic soundscapes. Some of the parameters that are usually used as default to replicate other approaches are shown to be sub-optimal.
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Features | log-mel energies |
Classifier | CRNN |
Decision making | p-norm |
Improving Sound Event Detection In Domestic Environments Using Sound Separation
Turpault, Nicolas1 and Wisdom, Scott2 and Erdogan, Hakan2 and Herhey, John R.2 and Serizel, Romain1 and Fonseca, Eduardo3 and Seetharaman, Prem4 and Salomon, Justin5
1Universite de Lorraine, CNRS, Inria, Loria, Nancy, France, 2Google Research, AI Perception, Cambridge, United States, 3Music Technology Group, Universitat Pompeu Fabra, Barcelona, 4Interactive Audio Lab, Northwestern University, Evanston, United States5Adobe Research, San Francisco, United States
DCASE2020_SS_SED_baseline_system
Improving Sound Event Detection In Domestic Environments Using Sound Separation
Turpault, Nicolas1 and Wisdom, Scott2 and Erdogan, Hakan2 and Herhey, John R.2 and Serizel, Romain1 and Fonseca, Eduardo3 and Seetharaman, Prem4 and Salomon, Justin5
1Universite de Lorraine, CNRS, Inria, Loria, Nancy, France, 2Google Research, AI Perception, Cambridge, United States, 3Music Technology Group, Universitat Pompeu Fabra, Barcelona, 4Interactive Audio Lab, Northwestern University, Evanston, United States5Adobe Research, San Francisco, United States
Abstract
Performing sound event detection on real-world recordings often implies dealing with overlapping target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing stage for sound event detection. In this paper we start from a sound separation model trained on the Free Universal Sound Separation dataset and the DCASE 2020 task 4 sound event detection baseline. We explore different methods of combining separated sound sources and the original mixture within the sound event detection. Furthermore, we investigate the impact of adapting the universal sound separation model to the sound event detection data in terms of both separation and sound event detection performance.
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Features | log-mel energies |
Classifier | CRNN |
Decision making | p-norm |
Sound Event Detection In Domestic Environments Using Dense Recurrent Neural Network
Yao, Tianchu and Shi, Chuang and Li, Huiyong
University of Electronic Science and Technology of China, Chengdu, China
Yao_UESTC_task4_SED_1 Yao_UESTC_task4_SED_2 Yao_UESTC_task4_SED_3 Yao_UESTC_task4_SED_4
Sound Event Detection In Domestic Environments Using Dense Recurrent Neural Network
Yao, Tianchu and Shi, Chuang and Li, Huiyong
University of Electronic Science and Technology of China, Chengdu, China
Abstract
In this paper, we introduce our sound events detection system using a mean-teacher model with convolutional recurrent neural network (CRNN) for DCASE 2020 Task4, which include residual convolutional block and dense recurrent neural network (DRNN) block. To improve the performance of system, we propose to use various methods such as multi-scale input layer, data augmentation, median window filters and model fusion. By combining those method, our system achieves 15% improvement on macro-averaged F-score on the development set, as compared to the baseline.
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Data augmentation | mixup, time-shift |
Features | log-mel energies |
Classifier | CRNN |
Decision making | geometric-mean |
The Academia Sinica System Of Sound Event Detection And Separation For DCASE 2020
Yen, Hao and Ku1,2, Pin-Jui and Yen1,2, Ming-Chi and Lee1, Hung-Shin and Wang1,2, Hsin-Min Wang1
1Institute of Information Science, Academia Sinica, Taiwan, 2Dept. Electrical Engineering, National Taiwan University, Taiwan
YenKu_NTU_task4_SED_1 YenKu_NTU_task4_SED_2 YenKu_NTU_task4_SED_3 YenKu_NTU_task4_SED_4
The Academia Sinica System Of Sound Event Detection And Separation For DCASE 2020
Yen, Hao and Ku1,2, Pin-Jui and Yen1,2, Ming-Chi and Lee1, Hung-Shin and Wang1,2, Hsin-Min Wang1
1Institute of Information Science, Academia Sinica, Taiwan, 2Dept. Electrical Engineering, National Taiwan University, Taiwan
Abstract
In this report, we present the system of sound event detection and separation in domestic environments for DCASE 2020. The goal of the task aims to determine which sound events appear in a clip and detailed temporal ranges they occupy. The system is trained by using real data, which are either weakly-labeled or unlabeled, and synthesized data with a strongly annotated label. Our proposed model structure starts with a feature-level front-end based on convolution neural networks (CNN) followed by both embedding-level and instance-level back-end attention modules. To take full advantage of a large amount of unlabeled data, we jointly adopt guided learning mechanism and Mean Teacher, which averages model weights instead of label predictions, to carry out weakly-supervised and semi-supervised learning. A group of adaptive median windows for each sound event is also utilized in post-processing for smoothing frame-level predictions. In the public evaluation set of DCASE 2019, our best system achieves 48.50% event-based F-score, much better than the official baseline performance (38.14%) with a relative improvement of 27.16%. Moreover, in the development set of DCASE 2020, our system is also superior to the baseline while using the student model as the back-end classifier. The F1-score is relatively improved by 32.91%.
System characteristics
Input | mono |
Sampling rate | 16 kHz |
Features | log-mel energies |
Classifier | CRNN |