Sound event detection and separation in domestic environments


Challenge results

Task description

The task evaluates systems for the detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording (see also Fig 1). The challenge of exploring the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance remains. Isolated sound events, background sound files and scripst to design a training set with strongly annotated synthetic data are provided. The labels in all the annotated subsets are verified and can be considered as reliable.

More detailed task description can be found in the task description page

Systems ranking

Rank Submission
code
Submission
name
Technical
Report
Sound
Separation
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)
Event-based
F-score
(Development dataset)
Xiaomi_task4_SED_1 DCASE2020 SED mean-teacher system Liang2020 36.0 (35.3 - 36.8) 35.6
Rykaczewski_Samsung_taks4_SED_3 DCASE2020 SED CNNR Rykaczewski2020 21.6 (21.0 - 22.4) 35.0
Rykaczewski_Samsung_taks4_SED_2 DCASE2020 SED CNNR Rykaczewski2020 21.9 (21.3 - 22.7) 36.0
Rykaczewski_Samsung_taks4_SED_4 DCASE2020 SED CNNR Rykaczewski2020 10.4 (9.7 - 11.1) 35.7
Rykaczewski_Samsung_taks4_SED_1 DCASE2020 SED CNNR Rykaczewski2020 21.6 (20.8 - 22.4) 34.3
Hou_IPS_task4_SED_1 DCASE2020 WASEDA IPS SED HouB2020 34.9 (34.0 - 35.7) 40.8
Miyazaki_NU_task4_SED_1 Conforemr SED Miyazaki2020 51.1 (50.1 - 52.3) 50.6
Miyazaki_NU_task4_SED_2 Transforemr SED Miyazaki2020 46.4 (45.5 - 47.5) 47.3
Miyazaki_NU_task4_SED_3 transformer conformer Ensemble SED Miyazaki2020 50.7 (49.6 - 51.9) 49.8
Huang_ICT-TOSHIBA_task4_SED_3 guided multi-branch learning Huang2020 44.3 (43.4 - 45.4) 45.8
Huang_ICT-TOSHIBA_task4_SED_1 guided multi-branch learning Huang2020 44.6 (43.5 - 46.0) 46.7
Huang_ICT-TOSHIBA_task4_SS_SED_4 guided multi branch learning Huang2020 Sound Separation 44.1 (42.9 - 45.4) 46.9
Huang_ICT-TOSHIBA_task4_SED_4 guided multi-branch learning Huang2020 44.3 (43.2 - 45.6) 46.7
Huang_ICT-TOSHIBA_task4_SS_SED_1 guided multi branch learning Huang2020 Sound Separation 44.7 (43.6 - 46.2) 47.2
Huang_ICT-TOSHIBA_task4_SED_2 guided multi-branch learning Huang2020 44.3 (43.2 - 45.6) 46.7
Huang_ICT-TOSHIBA_task4_SS_SED_3 guided multi branch learning Huang2020 Sound Separation 44.4 (43.2 - 45.8) 47.2
Huang_ICT-TOSHIBA_task4_SS_SED_2 guided_and_multi_branch_learning Huang2020 Sound Separation 44.5 (43.3 - 46.0) 47.3
Copiaco_UOW_task4_SED_2 DCASE2020 SED system copiaco Copiaco2020a 7.8 (7.3 - 8.2) 11.2
Copiaco_UOW_task4_SED_1 DCASE2020 SED system copiaco Copiaco2020a 7.5 (7.0 - 8.0) 8.1
Kim_AiTeR_GIST_SED_1 DCASE 2020 task4 SED system with semi-supervised loss function Kim2020 43.7 (42.8 - 44.7) 44.6
Kim_AiTeR_GIST_SED_2 DCASE 2020 task4 SED system with semi-supervised loss function Kim2020 43.9 (43.0 - 44.7) 45.2
Kim_AiTeR_GIST_SED_4 DCASE 2020 task4 SED system with semi-supervised loss function Kim2020 44.4 (43.5 - 45.2) 44.0
Kim_AiTeR_GIST_SED_3 DCASE 2020 task4 SED system with semi-supervised loss function Kim2020 44.2 (43.4 - 45.1) 44.5
Copiaco_UOW_task4_SS_SED_1 DCASE2020 SS+SED system copiaco Copiaco2020b Sound Separation 6.9 (6.7 - 7.2) 8.7
LJK_PSH_task4_SED_3 LJK PSH DCASE2020 Task4 SED 3 JiaKai2020 38.6 (37.7 - 39.7) 45.4
LJK_PSH_task4_SED_1 LJK PSH DCASE2020 Task4 SED 1 JiaKai2020 39.3 (38.4 - 40.4) 44.8
LJK_PSH_task4_SED_2 LJK PSH DCASE2020 Task4 SED 2 JiaKai2020 41.2 (40.1 - 42.4) 47.9
LJK_PSH_task4_SED_4 LJK PSH DCASE2020 Task4 SED 4 JiaKai2020 40.6 (39.6 - 41.6) 46.7
Hao_CQU_task4_SED_2 DCASE2020 cross-domain sound event detection Hao2020 47.0 (46.0 - 48.1) 46.4
Hao_CQU_task4_SED_3 DCASE2020 cross-domain sound event detection Hao2020 46.3 (45.5 - 47.4) 47.7
Hao_CQU_task4_SED_1 DCASE2020 cross-domain sound event detection Hao2020 44.9 (43.9 - 45.8) 48.2
Hao_CQU_task4_SED_4 DCASE2020 cross-domain sound event detection Hao2020 47.8 (46.9 - 49.0) 50.0
Zhenwei_Hou_task4_SED_1 DCASE2020 SED GCA system HouZ2020 45.1 (44.2 - 45.8) 44.8
deBenito_AUDIAS_task4_SED_1 5-Resolution Mean Teacher with thresholding deBenito2020 38.2 (37.5 - 39.2) 43.4
deBenito_AUDIAS_task4_SED_1 5-Resolution Mean Teacher de Benito-Gorron2020 37.9 (37.0 - 39.1) 40.9
Koh_NTHU_task4_SED_3 Koh_NTHU_3 Koh2020 46.6 (45.8 - 47.6) 48.0
Koh_NTHU_task4_SED_2 Koh_NTHU_2 Koh2020 45.2 (44.3 - 46.3) 48.4
Koh_NTHU_task4_SED_1 Koh_NTHU_1 Koh2020 45.2 (44.2 - 46.1) 46.4
Koh_NTHU_task4_SED_4 Koh_NTHU_4 Koh2020 46.3 (45.4 - 47.2) 49.6
Cornell_UPM-INRIA_task4_SED_2 UNIVPM-INRIA DAT HMM 1 Cornell2020 42.0 (40.9 - 43.1) 45.2
Cornell_UPM-INRIA_task4_SED_1 UNIVPM-INRIA ensemble DAT+PCEN Cornell2020 44.4 (43.3 - 45.5) 46.2
Cornell_UPM-INRIA_task4_SED_4 UNIVPM-INRIA ensemble DAT+PCEN HMM 2 Cornell2020 43.2 (42.1 - 44.4) 47.4
Cornell_UPM-INRIA_task4_SS_SED_1 UNIVPM-INRIA separation hmm Cornell2020 Sound Separation 38.6 (37.5 - 39.6) 40.2
Cornell_UPM-INRIA_task4_SED_3 UNIVPM-INRIA ensemble MT+PCEN Cornell2020 42.6 (41.6 - 43.5) 43.7
Yao_UESTC_task4_SED_1 Yao_UESTC_task4_SED_1 Yao2020 44.1 (43.1 - 45.2) 47.9
Yao_UESTC_task4_SED_3 Yao_UESTC_task4_SED_3 Yao2020 46.4 (45.3 - 47.6) 49.5
Yao_UESTC_task4_SED_2 Yao_UESTC_task4_SED_2 Yao2020 45.7 (44.7 - 47.0) 50.5
Yao_UESTC_task4_SED_4 Yao_UESTC_task4_SED_4 Yao2020 46.2 (45.2 - 47.0) 49.6
Liu_thinkit_task4_SED_1 MT_PPDA_cg_valid Liu2020 40.7 (39.7 - 41.7) 47.4
Liu_thinkit_task4_SED_1 MT_PPDA_cg_glu_valid Liu2020 41.8 (40.7 - 42.9) 49.5
Liu_thinkit_task4_SED_1 MT_PPDA_cg_glu_pub Liu2020 45.2 (44.2 - 46.5) 44.3
Liu_thinkit_task4_SED_4 MT_PPDA_finall Liu2020 43.1 (42.1 - 44.2) 46.8
PARK_JHU_task4_SED_1 PARK_fusion_P Park2020 35.8 (35.0 - 36.6) 45.4
PARK_JHU_task4_SED_1 PARK_fusion_N Park2020 26.5 (25.7 - 27.5) 41.1
PARK_JHU_task4_SED_2 PARK_fusion_M Park2020 36.9 (36.1 - 37.7) 44.9
PARK_JHU_task4_SED_3 PARK_fusion_L Park2020 34.7 (34.1 - 35.6) 45.4
Chen_NTHU_task4_SS_SED_1 DCASE2020 SS+SED system Chen2020 Sound Separation 34.5 (33.5 - 35.3) 36.7
CTK_NU_task4_SED_2 CTK_NU NMF-CNN-2 Chan2020 44.4 (43.5 - 45.5) 45.7
CTK_NU_task4_SED_4 CTK_NU NMF-CNN-4 Chan2020 46.3 (45.3 - 47.4) 48.6
CTK_NU_task4_SED_3 CTK_NU NMF-CNN-3 Chan2020 45.8 (45.0 - 47.0) 48.0
CTK_NU_task4_SED_1 CTK_NU NMF-CNN-1 Chan2020 43.5 (42.6 - 44.7) 45.2
YenKu_NTU_task4_SED_4 DCASE2020 SED Yen2020 42.7 (41.6 - 43.6) 46.6
YenKu_NTU_task4_SED_2 DCASE2020 SED Yen2020 42.6 (41.8 - 43.7) 45.7
YenKu_NTU_task4_SED_3 DCASE2020 SED Yen2020 41.6 (40.6 - 42.7) 45.4
YenKu_NTU_task4_SED_1 DCASE2020 SED Yen2020 43.6 (42.4 - 44.6) 45.6
Tang_SCU_task4_SED_1 Basic ResNet block without weakly labeled data augmentation Tang2020 43.1 (42.3 - 44.1) 46.6
Tang_SCU_task4_SED_4 Multi-scale ResNet block with weakly labeled data augmentation Tang2020 44.1 (43.4 - 44.8) 49.0
Tang_SCU_task4_SED_2 Basic ResNet block with weakly labeled data augmentation Tang2020 42.4 (41.4 - 43.4) 48.4
Tang_SCU_task4_SED_3 Multi-scale ResNet block without weakly labeled data augmentation Tang2020 44.1 (43.3 - 45.0) 48.2
DCASE2020_SED_baseline_system DCASE2020 SED baseline system turpault2020a 34.9 (34.0 - 35.7) 34.8
DCASE2020_SS_SED_baseline_system DCASE2020 SS+SED baseline system turpault2020b Sound Separation 36.5 (35.6 - 37.2) 35.6
Ebbers_UPB_task4_SED_1 DCASE2020 UPB SED system 1 Ebbers2020 47.2 (46.5 - 48.1) 48.3

Supplementary metrics

Rank Submission
code
Submission
name
Technical
Report
Sound
Separation
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)

PSDS Cross-trigger
(Evaluation dataset)
Event-based
F-score
(Public evaluation)

PSDS Cross-trigger
(Public evaluation)
Event-based
F-score
(Vimeo dataset)

PSDS Cross-trigger
(Vimeo dataset)
Xiaomi_task4_SED_1 DCASE2020 SED mean-teacher system Liang2020 36.0 (35.3 - 36.8) 40.7 25.3
Rykaczewski_Samsung_taks4_SED_3 DCASE2020 SED CNNR Rykaczewski2020 21.6 (21.0 - 22.4) 23.7 15.4
Rykaczewski_Samsung_taks4_SED_2 DCASE2020 SED CNNR Rykaczewski2020 21.9 (21.3 - 22.7) 24.0 15.7
Rykaczewski_Samsung_taks4_SED_4 DCASE2020 SED CNNR Rykaczewski2020 10.4 (9.7 - 11.1) 11.9 6.7
Rykaczewski_Samsung_taks4_SED_1 DCASE2020 SED CNNR Rykaczewski2020 21.6 (20.8 - 22.4) 23.5 15.7
Hou_IPS_task4_SED_1 DCASE2020 WASEDA IPS SED HouB2020 34.9 (34.0 - 35.7) 38.1 27.8
Miyazaki_NU_task4_SED_1 Conforemr SED Miyazaki2020 51.1 (50.1 - 52.3) 55.7 39.6
Miyazaki_NU_task4_SED_2 Transforemr SED Miyazaki2020 46.4 (45.5 - 47.5) 51.1 34.9
Miyazaki_NU_task4_SED_3 transformer conformer Ensemble SED Miyazaki2020 50.7 (49.6 - 51.9) 55.2 39.0
Huang_ICT-TOSHIBA_task4_SED_3 guided multi-branch learning Huang2020 44.3 (43.4 - 45.4) 48.7 32.2
Huang_ICT-TOSHIBA_task4_SED_1 guided multi-branch learning Huang2020 44.6 (43.5 - 46.0) 49.7 31.8
Huang_ICT-TOSHIBA_task4_SS_SED_4 guided multi branch learning Huang2020 Sound Separation 44.1 (42.9 - 45.4) 48.6 32.8
Huang_ICT-TOSHIBA_task4_SED_4 guided multi-branch learning Huang2020 44.3 (43.2 - 45.6) 49.0 32.2
Huang_ICT-TOSHIBA_task4_SS_SED_1 guided multi branch learning Huang2020 Sound Separation 44.7 (43.6 - 46.2) 49.5 32.7
Huang_ICT-TOSHIBA_task4_SED_2 guided multi-branch learning Huang2020 44.3 (43.2 - 45.6) 49.0 32.2
Huang_ICT-TOSHIBA_task4_SS_SED_3 guided multi branch learning Huang2020 Sound Separation 44.4 (43.2 - 45.8) 49.3 32.2
Huang_ICT-TOSHIBA_task4_SS_SED_2 guided_and_multi_branch_learning Huang2020 Sound Separation 44.5 (43.3 - 46.0) 49.3 32.6
Copiaco_UOW_task4_SED_2 DCASE2020 SED system copiaco Copiaco2020a 7.8 (7.3 - 8.2) 8.5 5.8
Copiaco_UOW_task4_SED_1 DCASE2020 SED system copiaco Copiaco2020a 7.5 (7.0 - 8.0) 7.9 5.9
Kim_AiTeR_GIST_SED_1 DCASE 2020 task4 SED system with semi-supervised loss function Kim2020 43.7 (42.8 - 44.7) 0.645 48.0 0.701 33.0 0.529
Kim_AiTeR_GIST_SED_2 DCASE 2020 task4 SED system with semi-supervised loss function Kim2020 43.9 (43.0 - 44.7) 0.646 48.1 0.704 33.5 0.521
Kim_AiTeR_GIST_SED_4 DCASE 2020 task4 SED system with semi-supervised loss function Kim2020 44.4 (43.5 - 45.2) 0.641 48.0 0.698 35.5 0.522
Kim_AiTeR_GIST_SED_3 DCASE 2020 task4 SED system with semi-supervised loss function Kim2020 44.2 (43.4 - 45.1) 0.650 47.9 0.705 35.2 0.531
Copiaco_UOW_task4_SS_SED_1 DCASE2020 SS+SED system copiaco Copiaco2020b Sound Separation 6.9 (6.7 - 7.2) 7.3 5.7
LJK_PSH_task4_SED_3 LJK PSH DCASE2020 Task4 SED 3 JiaKai2020 38.6 (37.7 - 39.7) 0.582 42.6 0.639 28.8 0.468
LJK_PSH_task4_SED_1 LJK PSH DCASE2020 Task4 SED 1 JiaKai2020 39.3 (38.4 - 40.4) 0.595 43.9 0.644 28.1 0.509
LJK_PSH_task4_SED_2 LJK PSH DCASE2020 Task4 SED 2 JiaKai2020 41.2 (40.1 - 42.4) 0.602 45.8 0.653 29.7 0.513
LJK_PSH_task4_SED_4 LJK PSH DCASE2020 Task4 SED 4 JiaKai2020 40.6 (39.6 - 41.6) 0.598 44.1 0.647 31.9 0.494
Hao_CQU_task4_SED_2 DCASE2020 cross-domain sound event detection Hao2020 47.0 (46.0 - 48.1) 50.6 37.3
Hao_CQU_task4_SED_3 DCASE2020 cross-domain sound event detection Hao2020 46.3 (45.5 - 47.4) 50.3 36.1
Hao_CQU_task4_SED_1 DCASE2020 cross-domain sound event detection Hao2020 44.9 (43.9 - 45.8) 49.4 33.1
Hao_CQU_task4_SED_4 DCASE2020 cross-domain sound event detection Hao2020 47.8 (46.9 - 49.0) 52.3 35.3
Zhenwei_Hou_task4_SED_1 DCASE2020 SED GCA system HouZ2020 45.1 (44.2 - 45.8) 0.600 49.0 0.654 35.2 0.474
deBenito_AUDIAS_task4_SED_1 5-Resolution Mean Teacher with thresholding deBenito2020 38.2 (37.5 - 39.2) 0.575 42.0 0.630 29.1 0.460
deBenito_AUDIAS_task4_SED_1 5-Resolution Mean Teacher de Benito-Gorron2020 37.9 (37.0 - 39.1) 0.575 41.5 0.630 29.4 0.460
Koh_NTHU_task4_SED_3 Koh_NTHU_3 Koh2020 46.6 (45.8 - 47.6) 0.584 51.5 0.636 34.5 0.476
Koh_NTHU_task4_SED_2 Koh_NTHU_2 Koh2020 45.2 (44.3 - 46.3) 0.645 48.7 0.688 36.4 0.549
Koh_NTHU_task4_SED_1 Koh_NTHU_1 Koh2020 45.2 (44.2 - 46.1) 0.624 49.1 0.669 35.4 0.523
Koh_NTHU_task4_SED_4 Koh_NTHU_4 Koh2020 46.3 (45.4 - 47.2) 0.586 50.3 0.639 36.1 0.478
Cornell_UPM-INRIA_task4_SED_2 UNIVPM-INRIA DAT HMM 1 Cornell2020 42.0 (40.9 - 43.1) 45.6 32.6
Cornell_UPM-INRIA_task4_SED_1 UNIVPM-INRIA ensemble DAT+PCEN Cornell2020 44.4 (43.3 - 45.5) 48.6 33.8
Cornell_UPM-INRIA_task4_SED_4 UNIVPM-INRIA ensemble DAT+PCEN HMM 2 Cornell2020 43.2 (42.1 - 44.4) 47.9 31.4
Cornell_UPM-INRIA_task4_SS_SED_1 UNIVPM-INRIA separation hmm Cornell2020 Sound Separation 38.6 (37.5 - 39.6) 42.3 29.4
Cornell_UPM-INRIA_task4_SED_3 UNIVPM-INRIA ensemble MT+PCEN Cornell2020 42.6 (41.6 - 43.5) 47.7 29.4
Yao_UESTC_task4_SED_1 Yao_UESTC_task4_SED_1 Yao2020 44.1 (43.1 - 45.2) 47.6 35.3
Yao_UESTC_task4_SED_3 Yao_UESTC_task4_SED_3 Yao2020 46.4 (45.3 - 47.6) 50.5 36.0
Yao_UESTC_task4_SED_2 Yao_UESTC_task4_SED_2 Yao2020 45.7 (44.7 - 47.0) 49.6 35.9
Yao_UESTC_task4_SED_4 Yao_UESTC_task4_SED_4 Yao2020 46.2 (45.2 - 47.0) 49.9 37.3
Liu_thinkit_task4_SED_1 MT_PPDA_cg_valid Liu2020 40.7 (39.7 - 41.7) 45.4 27.7
Liu_thinkit_task4_SED_1 MT_PPDA_cg_glu_valid Liu2020 41.8 (40.7 - 42.9) 46.0 31.1
Liu_thinkit_task4_SED_1 MT_PPDA_cg_glu_pub Liu2020 45.2 (44.2 - 46.5) 51.2 30.3
Liu_thinkit_task4_SED_4 MT_PPDA_finall Liu2020 43.1 (42.1 - 44.2) 47.2 32.3
PARK_JHU_task4_SED_1 PARK_fusion_P Park2020 35.8 (35.0 - 36.6) 38.9 28.0
PARK_JHU_task4_SED_1 PARK_fusion_N Park2020 26.5 (25.7 - 27.5) 28.1 22.6
PARK_JHU_task4_SED_2 PARK_fusion_M Park2020 36.9 (36.1 - 37.7) 40.2 28.7
PARK_JHU_task4_SED_3 PARK_fusion_L Park2020 34.7 (34.1 - 35.6) 38.1 26.7
Chen_NTHU_task4_SS_SED_1 DCASE2020 SS+SED system Chen2020 Sound Separation 34.5 (33.5 - 35.3) 37.8 26.9
CTK_NU_task4_SED_2 CTK_NU NMF-CNN-2 Chan2020 44.4 (43.5 - 45.5) 0.522 47.8 0.551 35.1 0.452
CTK_NU_task4_SED_4 CTK_NU NMF-CNN-4 Chan2020 46.3 (45.3 - 47.4) 0.534 50.5 0.567 35.3 0.460
CTK_NU_task4_SED_3 CTK_NU NMF-CNN-3 Chan2020 45.8 (45.0 - 47.0) 0.543 49.8 0.575 35.4 0.468
CTK_NU_task4_SED_1 CTK_NU NMF-CNN-1 Chan2020 43.5 (42.6 - 44.7) 0.503 47.5 0.535 33.2 0.436
YenKu_NTU_task4_SED_4 DCASE2020 SED Yen2020 42.7 (41.6 - 43.6) 46.9 31.6
YenKu_NTU_task4_SED_2 DCASE2020 SED Yen2020 42.6 (41.8 - 43.7) 47.5 29.8
YenKu_NTU_task4_SED_3 DCASE2020 SED Yen2020 41.6 (40.6 - 42.7) 45.5 30.9
YenKu_NTU_task4_SED_1 DCASE2020 SED Yen2020 43.6 (42.4 - 44.6) 48.5 30.8
Tang_SCU_task4_SED_1 Basic ResNet block without weakly labeled data augmentation Tang2020 43.1 (42.3 - 44.1) 0.495 46.6 0.538 33.7 0.400
Tang_SCU_task4_SED_4 Multi-scale ResNet block with weakly labeled data augmentation Tang2020 44.1 (43.4 - 44.8) 0.510 47.5 0.556 35.3 0.411
Tang_SCU_task4_SED_2 Basic ResNet block with weakly labeled data augmentation Tang2020 42.4 (41.4 - 43.4) 0.506 46.3 0.552 32.1 0.400
Tang_SCU_task4_SED_3 Multi-scale ResNet block without weakly labeled data augmentation Tang2020 44.1 (43.3 - 45.0) 0.503 48.8 0.559 32.5 0.379
DCASE2020_SED_baseline_system DCASE2020 SED baseline system turpault2020a 34.9 (34.0 - 35.7) 0.496 38.1 0.552 27.8 0.378
DCASE2020_SS_SED_baseline_system DCASE2020 SS+SED baseline system turpault2020b Sound Separation 36.5 (35.6 - 37.2) 0.497 39.8 0.549 28.8 0.383
Ebbers_UPB_task4_SED_1 DCASE2020 UPB SED system 1 Ebbers2020 47.2 (46.5 - 48.1) 50.9 38.7

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission
code
Submission
name
Technical
Report
Sound
Separation
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)
Event-based
F-score
(Development dataset)
Xiaomi_task4_SED_1 DCASE2020 SED mean-teacher system Liang2020 36.0 (35.3 - 36.8) 35.6
Rykaczewski_Samsung_taks4_SED_2 DCASE2020 SED CNNR Rykaczewski2020 21.9 (21.3 - 22.7) 36.0
Hou_IPS_task4_SED_1 DCASE2020 WASEDA IPS SED HouB2020 34.9 (34.0 - 35.7) 40.8
Miyazaki_NU_task4_SED_1 Conforemr SED Miyazaki2020 51.1 (50.1 - 52.3) 50.6
Huang_ICT-TOSHIBA_task4_SS_SED_1 Guided multi branch learning Huang2020 Sound Separation 44.7 (43.6 - 46.2) 47.2
Copiaco_UOW_task4_SED_2 DCASE2020 SED system copiaco Copiaco2020a 7.8 (7.3 - 8.2) 11.2
Kim_AiTeR_GIST_SED_4 DCASE 2020 task4 SED system with semi-supervised loss function Kim2020 44.4 (43.5 - 45.2) 44.0
LJK_PSH_task4_SED_2 LJK PSH DCASE2020 Task4 SED 2 JiaKai2020 41.2 (40.1 - 42.4) 47.9
Hao_CQU_task4_SED_4 DCASE2020 cross-domain sound event detection Hao2020 47.8 (46.9 - 49.0) 50.0
Zhenwei_Hou_task4_SED_1 DCASE2020 SED GCA system HouZ2020 45.1 (44.2 - 45.8) 44.8
deBenito_AUDIAS_task4_SED_1 5-Resolution Mean Teacher with Thresholds deBenito2020 38.2 (37.5 - 39.2) 43.4
Koh_NTHU_task4_SED_3 Koh_NTHU_3 Koh2020 46.6 (45.8 - 47.6) 48.0
Cornell_UPM-INRIA_task4_SED_1 UNIVPM-INRIA ensemble DAT+PCEN Cornell2020 44.4 (43.3 - 45.5) 46.2
Yao_UESTC_task4_SED_3 Yao_UESTC_task4_SED_3 Yao2020 46.4 (45.3 - 47.6) 49.5
Liu_thinkit_task4_SED_1 MT_PPDA_cg_glu_pub Liu2020 45.2 (44.2 - 46.5) 44.3
PARK_JHU_task4_SED_2 PARK_fusion_M Park2020 36.9 (36.1 - 37.7) 44.9
Chen_NTHU_task4_SS_SED_1 DCASE2020 SS+SED system Chen2020 Sound Separation 34.5 (33.5 - 35.3) 36.7
CTK_NU_task4_SED_4 CTK_NU NMF-CNN-4 Chan2020 46.3 (45.3 - 47.4) 48.6
YenKu_NTU_task4_SED_1 DCASE2020 SED Yen2020 43.6 (42.4 - 44.6) 45.6
Tang_SCU_task4_SED_3 Multi-scale ResNet block without weakly labeled data augmentation Tang2020 44.1 (43.3 - 45.0) 48.2
DCASE2020_SS_SED_baseline_system DCASE2020 SS+SED baseline system turpault2020b Sound Separation 36.5 (35.6 - 37.2) 35.6
Ebbers_UPB_task4_SED_1 DCASE2020 UPB SED system 1 Ebbers2020 47.2 (46.5 - 48.1) 48.3

Supplementary metrics

Rank Submission
code
Submission
name
Technical
Report
Sound
Separation
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)

PSDS Cross-trigger
(Evaluation dataset)
Event-based
F-score
(Public evaluation)

PSDS Cross-trigger
(Public evaluation)
Event-based
F-score
(Vimeo dataset)

PSDS Cross-trigger
(Vimeo dataset)
Xiaomi_task4_SED_1 DCASE2020 SED mean-teacher system Liang2020 36.0 (35.3 - 36.8) 40.7 25.3
Rykaczewski_Samsung_taks4_SED_2 DCASE2020 SED CNNR Rykaczewski2020 21.9 (21.3 - 22.7) 24.0 15.7
Hou_IPS_task4_SED_1 DCASE2020 WASEDA IPS SED HouB2020 34.9 (34.0 - 35.7) 38.1 27.8
Miyazaki_NU_task4_SED_1 Conforemr SED Miyazaki2020 51.1 (50.1 - 52.3) 55.7 39.6
Huang_ICT-TOSHIBA_task4_SS_SED_1 Guided multi branch learning Huang2020 Sound Separation 44.7 (43.6 - 46.2) 49.5 32.7
Copiaco_UOW_task4_SED_2 DCASE2020 SED system copiaco Copiaco2020a 7.8 (7.3 - 8.2) 8.5 5.8
Kim_AiTeR_GIST_SED_4 DCASE 2020 task4 SED system with semi-supervised loss function Kim2020 44.4 (43.5 - 45.2) 0.641 48.0 0.698 35.5 0.522
LJK_PSH_task4_SED_2 LJK PSH DCASE2020 Task4 SED 2 JiaKai2020 41.2 (40.1 - 42.4) 0.602 45.8 0.653 29.7 0.513
Hao_CQU_task4_SED_4 DCASE2020 cross-domain sound event detection Hao2020 47.8 (46.9 - 49.0) 52.3 35.3
Zhenwei_Hou_task4_SED_1 DCASE2020 SED GCA system HouZ2020 45.1 (44.2 - 45.8) 0.600 49.0 0.654 35.2 0.474
deBenito_AUDIAS_task4_SED_1 5-Resolution Mean Teacher with Thresholds deBenito2020 38.2 (37.5 - 39.2) 0.575 42.0 0.630 29.1 0.460
Koh_NTHU_task4_SED_3 Koh_NTHU_3 Koh2020 46.6 (45.8 - 47.6) 0.584 51.5 0.636 34.5 0.476
Cornell_UPM-INRIA_task4_SED_1 UNIVPM-INRIA ensemble DAT+PCEN Cornell2020 44.4 (43.3 - 45.5) 48.6 33.8
Yao_UESTC_task4_SED_3 Yao_UESTC_task4_SED_3 Yao2020 46.4 (45.3 - 47.6) 50.5 36.0
Liu_thinkit_task4_SED_1 MT_PPDA_cg_glu_pub Liu2020 45.2 (44.2 - 46.5) 51.2 30.3
PARK_JHU_task4_SED_2 PARK_fusion_M Park2020 36.9 (36.1 - 37.7) 40.2 28.7
Chen_NTHU_task4_SS_SED_1 DCASE2020 SS+SED system Chen2020 Sound Separation 34.5 (33.5 - 35.3) 37.8 26.9
CTK_NU_task4_SED_4 CTK_NU NMF-CNN-4 Chan2020 46.3 (45.3 - 47.4) 0.534 50.5 0.567 35.3 0.460
YenKu_NTU_task4_SED_1 DCASE2020 SED Yen2020 43.6 (42.4 - 44.6) 48.5 30.8
Tang_SCU_task4_SED_3 Multi-scale ResNet block without weakly labeled data augmentation Tang2020 44.1 (43.3 - 45.0) 0.503 48.8 0.559 32.5 0.379
DCASE2020_SS_SED_baseline_system DCASE2020 SS+SED baseline system turpault2020b Sound Separation 36.5 (35.6 - 37.2) 0.497 39.8 0.549 28.8 0.383
Ebbers_UPB_task4_SED_1 DCASE2020 UPB SED system 1 Ebbers2020 47.2 (46.5 - 48.1) 50.9 38.7

Combined SS and SED ranking

System ranking

Rank Submission
code
Submission
name
Technical
Report
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)
Event-based
F-score
(Public evaluation)
Event-based
F-score
(Vimeo dataset)
Copiaco_UOW_task4_SS_SED_1 DCASE2020 SS+SED system copiaco Copiaco2020b 6.9 (6.7 - 7.2) 7.3 5.7
Chen_NTHU_task4_SS_SED_1 DCASE2020 SS+SED system Chen2020 34.5 (33.5 - 35.3) 37.8 26.9
Cornell_UPM-INRIA_task4_SS_SED_1 UNIVPM-INRIA separation hmm Cornell2020 38.6 (37.5 - 39.6) 42.3 29.4
Huang_ICT-TOSHIBA_task4_SS_SED_4 Guided multi branch learning Huang2020 44.1 (42.9 - 45.4) 48.6 32.8
Huang_ICT-TOSHIBA_task4_SS_SED_3 Guided multi branch learning Huang2020 44.4 (43.2 - 45.8) 49.3 32.2
Huang_ICT-TOSHIBA_task4_SS_SED_2 Guided_and_multi_branch_learning Huang2020 44.5 (43.3 - 46.0) 49.3 32.6
Huang_ICT-TOSHIBA_task4_SS_SED_1 Guided multi branch learning Huang2020 44.7 (43.6 - 46.2) 49.5 32.7
DCASE2020_SS_SED_baseline_system DCASE2020 SS+SED baseline system 36.5 (35.6 - 37.2) 39.8 28.8

Team ranking

Table including only the best performing system per submitting team.

Rank Submission
code
Submission
name
Technical
Report
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)
Event-based
F-score
(Public evaluation)
Event-based
F-score
(Vimeo dataset)
Copiaco_UOW_task4_SS_SED_1 DCASE2020 SS+SED system copiaco Copiaco2020b 6.9 (6.7 - 7.2) 7.3 5.7
Chen_NTHU_task4_SS_SED_1 DCASE2020 SS+SED system Chen2020 34.5 (33.5 - 35.3) 37.8 26.9
Cornell_UPM-INRIA_task4_SS_SED_1 UNIVPM-INRIA separation hmm Cornell2020 38.6 (37.5 - 39.6) 42.3 29.4
Huang_ICT-TOSHIBA_task4_SS_SED_1 Guided multi branch learning Huang2020 44.7 (43.6 - 46.2) 49.5 32.7
DCASE2020_SS_SED_baseline_system DCASE2020 SS+SED baseline system 36.5 (35.6 - 37.2) 39.8 28.8

SS ranking

Rank Submission
code
Submission
name
Technical
Report
Single-source
SI-SNR
(Evaluation dataset)
Multi-source
SI-SNRi
(Evaluation dataset)
Xiaomi_task4_SS_1 DCASE2020 SS system (STFT) Liang2020 33.8 7.0
Xiaomi_task4_SS_2 DCASE2020 SS system (STFT + MFCC) Liang2020 31.8 8.4
DCASE2020_SS_baseline_system DCASE2020 SS baseline system kavalerov2019universal 37.6 12.5
Hou_IPS_task4_SS_1 DCASE2020 WASEDA IPS SS HouB2020 37.6 12.5

Class-wise performance

Rank Submission
code
Submission
name
Technical
Report
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)
Alarm
Bell
Ringing
Blender Cat Dishes Dog Electric
shave
toothbrush
Frying Running
water
Speech Vacuum
cleaner
Xiaomi_task4_SED_1 DCASE2020 SED mean-teacher system Liang2020 36.0 (35.3 - 36.8) 29.8 39.3 64.4 26.6 35.4 29.8 32.8 24.0 55.1 23.7
Rykaczewski_Samsung_taks4_SED_3 DCASE2020 SED CNNR Rykaczewski2020 21.6 (21.0 - 22.4) 36.5 5.0 50.7 22.5 37.1 5.8 9.3 1.6 43.1 3.8
Rykaczewski_Samsung_taks4_SED_2 DCASE2020 SED CNNR Rykaczewski2020 21.9 (21.3 - 22.7) 37.6 5.1 51.7 23.0 37.0 6.4 9.1 1.6 43.0 3.8
Rykaczewski_Samsung_taks4_SED_4 DCASE2020 SED CNNR Rykaczewski2020 10.4 (9.7 - 11.1) 10.2 10.3 0.5 0.7 0.6 20.0 33.0 8.4 1.2 20.3
Rykaczewski_Samsung_taks4_SED_1 DCASE2020 SED CNNR Rykaczewski2020 21.6 (20.8 - 22.4) 36.7 4.3 51.7 22.3 36.6 5.7 9.3 1.2 43.1 3.8
Hou_IPS_task4_SED_1 DCASE2020 WASEDA IPS SED HouB2020 34.9 (34.0 - 35.7) 35.9 37.0 62.3 26.0 27.1 25.9 24.7 24.3 48.2 39.0
Miyazaki_NU_task4_SED_1 Conforemr SED Miyazaki2020 51.1 (50.1 - 52.3) 40.9 51.5 67.3 38.9 51.4 46.7 53.1 35.3 64.4 63.0
Miyazaki_NU_task4_SED_2 Transforemr SED Miyazaki2020 46.4 (45.5 - 47.5) 33.5 51.4 56.2 39.6 49.2 42.2 47.0 27.7 63.0 55.4
Miyazaki_NU_task4_SED_3 transformer conformer Ensemble SED Miyazaki2020 50.7 (49.6 - 51.9) 43.7 53.8 64.1 39.5 52.0 48.2 47.5 29.0 66.6 62.7
Huang_ICT-TOSHIBA_task4_SED_3 guided multi-branch learning Huang2020 44.3 (43.4 - 45.4) 41.0 49.0 55.3 26.1 44.8 40.3 43.1 27.0 58.6 56.2
Huang_ICT-TOSHIBA_task4_SED_1 guided multi-branch learning Huang2020 44.6 (43.5 - 46.0) 43.4 42.0 55.4 25.9 38.8 41.5 46.2 30.2 63.1 60.3
Huang_ICT-TOSHIBA_task4_SS_SED_4 guided multi branch learning Huang2020 44.1 (42.9 - 45.4) 41.4 42.3 56.8 28.4 37.4 39.0 47.1 30.0 59.7 59.1
Huang_ICT-TOSHIBA_task4_SED_4 guided multi-branch learning Huang2020 44.3 (43.2 - 45.6) 41.2 40.4 54.7 24.8 40.0 42.6 47.7 30.4 62.3 59.1
Huang_ICT-TOSHIBA_task4_SS_SED_1 guided multi branch learning Huang2020 44.7 (43.6 - 46.2) 42.2 43.4 57.1 27.7 38.1 40.2 48.8 30.8 60.4 59.1
Huang_ICT-TOSHIBA_task4_SED_2 guided multi-branch learning Huang2020 44.3 (43.2 - 45.6) 41.0 40.4 55.0 24.3 40.1 42.6 47.7 30.4 62.6 59.1
Huang_ICT-TOSHIBA_task4_SS_SED_3 guided multi branch learning Huang2020 44.4 (43.2 - 45.8) 42.5 42.7 56.9 27.5 38.5 39.8 46.6 30.0 61.0 59.0
Huang_ICT-TOSHIBA_task4_SS_SED_2 guided_and_multi_branch_learning Huang2020 44.5 (43.3 - 46.0) 41.9 41.3 56.5 27.8 37.6 42.0 47.6 30.0 61.7 59.0
Copiaco_UOW_task4_SED_2 DCASE2020 SED system copiaco Copiaco2020a 7.8 (7.3 - 8.2) 3.8 9.7 13.1 3.2 5.1 4.1 1.0 9.3 10.4 18.0
Copiaco_UOW_task4_SED_1 DCASE2020 SED system copiaco Copiaco2020a 7.5 (7.0 - 8.0) 4.3 11.7 13.6 1.6 6.2 1.3 1.2 5.3 12.1 17.0
Kim_AiTeR_GIST_SED_1 DCASE 2020 task4 SED system with semi-supervised loss function Kim2020 43.7 (42.8 - 44.7) 26.8 55.1 67.9 27.9 32.8 27.1 44.4 29.4 65.7 60.8
Kim_AiTeR_GIST_SED_2 DCASE 2020 task4 SED system with semi-supervised loss function Kim2020 43.9 (43.0 - 44.7) 26.5 53.6 68.6 28.1 33.9 29.7 43.6 31.6 64.9 59.4
Kim_AiTeR_GIST_SED_4 DCASE 2020 task4 SED system with semi-supervised loss function Kim2020 44.4 (43.5 - 45.2) 28.2 56.8 68.3 29.8 34.1 28.9 42.7 30.0 65.4 60.4
Kim_AiTeR_GIST_SED_3 DCASE 2020 task4 SED system with semi-supervised loss function Kim2020 44.2 (43.4 - 45.1) 27.2 58.2 67.2 28.7 34.3 26.7 44.3 31.9 66.3 58.1
Copiaco_UOW_task4_SS_SED_1 DCASE2020 SS+SED system copiaco Copiaco2020b 6.9 (6.7 - 7.2) 3.9 5.8 15.0 1.0 6.3 1.3 2.2 3.4 15.7 14.3
LJK_PSH_task4_SED_3 LJK PSH DCASE2020 Task4 SED 3 JiaKai2020 38.6 (37.7 - 39.7) 34.5 41.8 57.1 28.7 31.7 39.7 38.5 23.4 54.9 36.4
LJK_PSH_task4_SED_1 LJK PSH DCASE2020 Task4 SED 1 JiaKai2020 39.3 (38.4 - 40.4) 33.8 37.3 59.6 28.9 33.8 38.9 36.8 28.6 55.0 41.7
LJK_PSH_task4_SED_2 LJK PSH DCASE2020 Task4 SED 2 JiaKai2020 41.2 (40.1 - 42.4) 34.2 36.6 58.9 31.0 42.7 40.3 42.3 28.3 58.8 39.2
LJK_PSH_task4_SED_4 LJK PSH DCASE2020 Task4 SED 4 JiaKai2020 40.6 (39.6 - 41.6) 29.7 42.6 58.5 28.5 33.4 42.2 44.2 29.7 51.2 46.0
Hao_CQU_task4_SED_2 DCASE2020 cross-domain sound event detection Hao2020 47.0 (46.0 - 48.1) 40.2 55.4 60.5 34.3 36.2 50.4 45.0 36.5 56.7 54.9
Hao_CQU_task4_SED_3 DCASE2020 cross-domain sound event detection Hao2020 46.3 (45.5 - 47.4) 45.5 48.6 59.1 30.5 33.5 51.6 47.3 33.2 57.0 56.8
Hao_CQU_task4_SED_1 DCASE2020 cross-domain sound event detection Hao2020 44.9 (43.9 - 45.8) 39.5 47.7 59.4 31.8 35.6 48.0 45.2 29.1 59.0 55.2
Hao_CQU_task4_SED_4 DCASE2020 cross-domain sound event detection Hao2020 47.8 (46.9 - 49.0) 42.8 56.9 63.2 31.7 36.4 49.8 50.2 30.1 61.9 55.0
Zhenwei_Hou_task4_SED_1 DCASE2020 SED GCA system HouZ2020 45.1 (44.2 - 45.8) 35.5 50.9 63.7 30.9 41.4 42.4 40.9 26.7 63.1 56.2
deBenito_AUDIAS_task4_SED_1 5-Resolution Mean Teacher with thresholding deBenito2020 38.2 (37.5 - 39.2) 38.5 42.2 63.1 22.3 21.5 36.8 30.8 23.5 54.0 51.5
deBenito_AUDIAS_task4_SED_1 5-Resolution Mean Teacher de Benito-Gorron2020 37.9 (37.0 - 39.1) 40.3 42.4 61.5 20.8 14.5 40.9 28.5 24.3 48.4 60.4
Koh_NTHU_task4_SED_3 Koh_NTHU_3 Koh2020 46.6 (45.8 - 47.6) 38.4 50.7 66.6 29.5 42.0 49.1 44.7 24.3 66.0 56.0
Koh_NTHU_task4_SED_2 Koh_NTHU_2 Koh2020 45.2 (44.3 - 46.3) 37.6 53.4 67.1 31.0 43.1 43.7 44.2 25.9 65.1 42.2
Koh_NTHU_task4_SED_1 Koh_NTHU_1 Koh2020 45.2 (44.2 - 46.1) 37.6 53.4 67.1 31.0 43.1 45.8 36.8 29.1 65.1 44.0
Koh_NTHU_task4_SED_4 Koh_NTHU_4 Koh2020 46.3 (45.4 - 47.2) 38.4 50.7 66.6 29.5 42.0 44.1 50.6 23.1 66.0 53.0
Cornell_UPM-INRIA_task4_SED_2 UNIVPM-INRIA DAT HMM 1 Cornell2020 42.0 (40.9 - 43.1) 41.3 44.4 68.2 29.3 35.6 38.8 35.1 27.4 50.2 50.8
Cornell_UPM-INRIA_task4_SED_1 UNIVPM-INRIA ensemble DAT+PCEN Cornell2020 44.4 (43.3 - 45.5) 45.2 49.4 69.8 25.2 33.0 45.2 35.3 32.4 55.3 55.0
Cornell_UPM-INRIA_task4_SED_4 UNIVPM-INRIA ensemble DAT+PCEN HMM 2 Cornell2020 43.2 (42.1 - 44.4) 40.0 50.4 63.3 26.4 33.5 46.1 39.1 25.8 52.4 57.0
Cornell_UPM-INRIA_task4_SS_SED_1 UNIVPM-INRIA separation hmm Cornell2020 38.6 (37.5 - 39.6) 30.3 43.4 65.6 28.3 25.2 41.9 32.4 23.8 49.1 47.6
Cornell_UPM-INRIA_task4_SED_3 UNIVPM-INRIA ensemble MT+PCEN Cornell2020 42.6 (41.6 - 43.5) 45.4 39.7 60.2 29.5 36.9 47.3 36.2 23.6 56.6 51.2
Yao_UESTC_task4_SED_1 Yao_UESTC_task4_SED_1 Yao2020 44.1 (43.1 - 45.2) 39.5 51.4 47.8 29.7 35.1 43.3 50.0 31.8 52.5 60.9
Yao_UESTC_task4_SED_3 Yao_UESTC_task4_SED_3 Yao2020 46.4 (45.3 - 47.6) 39.6 48.1 50.3 32.9 36.7 54.3 54.1 36.4 49.2 63.7
Yao_UESTC_task4_SED_2 Yao_UESTC_task4_SED_2 Yao2020 45.7 (44.7 - 47.0) 39.6 52.3 48.7 31.5 37.5 49.8 50.8 34.3 49.7 64.0
Yao_UESTC_task4_SED_4 Yao_UESTC_task4_SED_4 Yao2020 46.2 (45.2 - 47.0) 40.1 54.9 53.5 33.8 22.0 53.0 56.7 36.1 49.9 62.9
Liu_thinkit_task4_SED_1 MT_PPDA_cg_valid Liu2020 40.7 (39.7 - 41.7) 39.8 36.0 64.6 23.7 33.1 27.2 52.1 24.3 59.8 46.7
Liu_thinkit_task4_SED_1 MT_PPDA_cg_glu_valid Liu2020 41.8 (40.7 - 42.9) 39.8 36.0 62.4 23.7 32.8 39.1 52.1 24.3 59.6 49.4
Liu_thinkit_task4_SED_1 MT_PPDA_cg_glu_pub Liu2020 45.2 (44.2 - 46.5) 50.4 38.7 66.7 24.9 36.0 40.6 52.9 30.7 59.4 52.7
Liu_thinkit_task4_SED_4 MT_PPDA_finall Liu2020 43.1 (42.1 - 44.2) 43.6 36.3 65.5 24.2 32.6 40.2 52.2 27.2 59.6 50.2
PARK_JHU_task4_SED_1 PARK_fusion_P Park2020 35.8 (35.0 - 36.6) 10.1 38.5 53.4 13.5 37.7 42.2 33.9 24.6 56.1 48.5
PARK_JHU_task4_SED_1 PARK_fusion_N Park2020 26.5 (25.7 - 27.5) 10.1 38.5 33.3 13.5 37.7 18.2 32.0 15.3 25.9 40.9
PARK_JHU_task4_SED_2 PARK_fusion_M Park2020 36.9 (36.1 - 37.7) 21.0 38.5 53.4 13.5 37.7 42.2 33.9 24.6 56.1 48.5
PARK_JHU_task4_SED_3 PARK_fusion_L Park2020 34.7 (34.1 - 35.6) 19.4 38.5 52.2 13.0 37.4 37.3 27.7 23.2 53.5 44.8
Chen_NTHU_task4_SS_SED_1 DCASE2020 SS+SED system Chen2020 34.5 (33.5 - 35.3) 37.9 36.3 58.1 23.7 26.3 23.1 26.5 24.5 48.8 40.5
CTK_NU_task4_SED_2 CTK_NU NMF-CNN-2 Chan2020 44.4 (43.5 - 45.5) 40.3 54.0 49.4 22.0 34.6 43.9 47.6 31.5 58.6 61.7
CTK_NU_task4_SED_4 CTK_NU NMF-CNN-4 Chan2020 46.3 (45.3 - 47.4) 41.6 55.1 55.3 20.1 46.4 42.2 50.2 34.3 58.7 59.9
CTK_NU_task4_SED_3 CTK_NU NMF-CNN-3 Chan2020 45.8 (45.0 - 47.0) 44.8 52.9 54.4 20.1 41.6 42.2 49.8 33.3 59.8 59.9
CTK_NU_task4_SED_1 CTK_NU NMF-CNN-1 Chan2020 43.5 (42.6 - 44.7) 42.5 49.1 55.0 17.7 46.0 36.0 42.7 33.1 55.2 57.6
YenKu_NTU_task4_SED_4 DCASE2020 SED Yen2020 42.7 (41.6 - 43.6) 35.3 41.2 54.5 31.1 41.5 40.7 39.3 27.3 59.7 55.4
YenKu_NTU_task4_SED_2 DCASE2020 SED Yen2020 42.6 (41.8 - 43.7) 42.5 37.7 56.1 29.8 41.2 38.6 44.1 29.6 56.5 49.6
YenKu_NTU_task4_SED_3 DCASE2020 SED Yen2020 41.6 (40.6 - 42.7) 39.4 45.5 49.8 29.8 40.5 38.3 38.5 23.1 59.5 50.9
YenKu_NTU_task4_SED_1 DCASE2020 SED Yen2020 43.6 (42.4 - 44.6) 44.6 38.5 55.9 30.8 39.3 43.4 41.2 30.0 55.7 56.4
Tang_SCU_task4_SED_1 Basic ResNet block without weakly labeled data augmentation Tang2020 43.1 (42.3 - 44.1) 34.9 46.3 63.0 27.7 42.5 33.8 37.8 27.1 67.8 49.6
Tang_SCU_task4_SED_4 Multi-scale ResNet block with weakly labeled data augmentation Tang2020 44.1 (43.4 - 44.8) 34.4 48.6 62.7 23.8 41.0 35.9 42.8 32.3 66.7 52.8
Tang_SCU_task4_SED_2 Basic ResNet block with weakly labeled data augmentation Tang2020 42.4 (41.4 - 43.4) 40.3 41.4 62.7 27.3 37.0 33.3 48.3 20.0 66.6 46.6
Tang_SCU_task4_SED_3 Multi-scale ResNet block without weakly labeled data augmentation Tang2020 44.1 (43.3 - 45.0) 30.4 45.1 64.5 29.6 40.6 35.3 48.1 27.6 63.6 57.5
DCASE2020_SED_baseline_system DCASE2020 SED baseline system turpault2020a 34.9 (34.0 - 35.7) 35.9 37.0 62.6 26.0 27.1 25.9 24.7 24.3 48.2 39.0
DCASE2020_SS_SED_baseline_system DCASE2020 SS+SED baseline system turpault2020b 36.5 (35.6 - 37.2) 38.7 37.5 62.8 24.5 29.6 28.0 28.0 21.5 51.6 43.7
Ebbers_UPB_task4_SED_1 DCASE2020 UPB SED system 1 Ebbers2020 47.2 (46.5 - 48.1) 28.5 56.4 64.0 24.4 37.4 45.8 51.3 37.6 60.6 67.2

System characteristics

General characteristics

Rank Code Technical
Report
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)
Sampling
rate
Data
augmentation
Features
Xiaomi_task4_SED_1 Liang2020 36.0 (35.3 - 36.8) 16kHz time stretching,pitch shifting,reverberation log-mel energies
Rykaczewski_Samsung_taks4_SED_3 Rykaczewski2020 21.6 (21.0 - 22.4) 16kHz log-mel energies
Rykaczewski_Samsung_taks4_SED_2 Rykaczewski2020 21.9 (21.3 - 22.7) 16kHz log-mel energies
Rykaczewski_Samsung_taks4_SED_4 Rykaczewski2020 10.4 (9.7 - 11.1) 16kHz log-mel energies
Rykaczewski_Samsung_taks4_SED_1 Rykaczewski2020 21.6 (20.8 - 22.4) 16kHz log-mel energies
Hou_IPS_task4_SED_1 HouB2020 34.9 (34.0 - 35.7) 16kHz log-mel energies
Miyazaki_NU_task4_SED_1 Miyazaki2020 51.1 (50.1 - 52.3) 16kHz time shifting, mixup log-mel energies
Miyazaki_NU_task4_SED_2 Miyazaki2020 46.4 (45.5 - 47.5) 16kHz time shifting, mixup log-mel energies
Miyazaki_NU_task4_SED_3 Miyazaki2020 50.7 (49.6 - 51.9) 16kHz time shifting, mixup log-mel energies
Huang_ICT-TOSHIBA_task4_SED_3 Huang2020 44.3 (43.4 - 45.4) 44.1kHz time shifting, frequency shifting log-mel energies
Huang_ICT-TOSHIBA_task4_SED_1 Huang2020 44.6 (43.5 - 46.0) 44.1kHz time shifting, frequency shifting log-mel energies
Huang_ICT-TOSHIBA_task4_SS_SED_4 Huang2020 44.1 (42.9 - 45.4) 44.1kHz time shifting, frequency shifting log-mel energies
Huang_ICT-TOSHIBA_task4_SED_4 Huang2020 44.3 (43.2 - 45.6) 44.1kHz time shifting, frequency shifting log-mel energies
Huang_ICT-TOSHIBA_task4_SS_SED_1 Huang2020 44.7 (43.6 - 46.2) 44.1kHz time shifting, frequency shifting log-mel energies
Huang_ICT-TOSHIBA_task4_SED_2 Huang2020 44.3 (43.2 - 45.6) 44.1kHz time shifting, frequency shifting log-mel energies
Huang_ICT-TOSHIBA_task4_SS_SED_3 Huang2020 44.4 (43.2 - 45.8) 44.1kHz time shifting, frequency shifting log-mel energies
Huang_ICT-TOSHIBA_task4_SS_SED_2 Huang2020 44.5 (43.3 - 46.0) 44.1kHz time shifting, frequency shifting log-mel energies
Copiaco_UOW_task4_SED_2 Copiaco2020a 7.8 (7.3 - 8.2) 44.1kHz scalogram, signal energy, spectral centroid
Copiaco_UOW_task4_SED_1 Copiaco2020a 7.5 (7.0 - 8.0) 16kHz scalogram, signal energy, spectral centroid
Kim_AiTeR_GIST_SED_1 Kim2020 43.7 (42.8 - 44.7) 16kHz mixup, specaugment, Gaussian noise mel-spectrogram
Kim_AiTeR_GIST_SED_2 Kim2020 43.9 (43.0 - 44.7) 16kHz mixup, specaugment, Gaussian noise mel-spectrogram
Kim_AiTeR_GIST_SED_4 Kim2020 44.4 (43.5 - 45.2) 16kHz mixup, specaugment, Gaussian noise mel-spectrogram
Kim_AiTeR_GIST_SED_3 Kim2020 44.2 (43.4 - 45.1) 16kHz mixup, specaugment, Gaussian noise mel-spectrogram
Copiaco_UOW_task4_SS_SED_1 Copiaco2020b 6.9 (6.7 - 7.2) 16kHz scalogram, spectral centroid, signal energy
LJK_PSH_task4_SED_3 JiaKai2020 38.6 (37.7 - 39.7) 16kHz pitch shifting log-mel energies
LJK_PSH_task4_SED_1 JiaKai2020 39.3 (38.4 - 40.4) 16kHz pitch shifting log-mel energies
LJK_PSH_task4_SED_2 JiaKai2020 41.2 (40.1 - 42.4) 16kHz pitch shifting log-mel energies
LJK_PSH_task4_SED_4 JiaKai2020 40.6 (39.6 - 41.6) 16kHz pitch shifting log-mel energies
Hao_CQU_task4_SED_2 Hao2020 47.0 (46.0 - 48.1) 22.05kHz log-mel energies
Hao_CQU_task4_SED_3 Hao2020 46.3 (45.5 - 47.4) 22.05kHz log-mel energies
Hao_CQU_task4_SED_1 Hao2020 44.9 (43.9 - 45.8) 22.05kHz log-mel energies
Hao_CQU_task4_SED_4 Hao2020 47.8 (46.9 - 49.0) 22.05kHz log-mel energies
Zhenwei_Hou_task4_SED_1 HouZ2020 45.1 (44.2 - 45.8) 22.05kHz log-mel energies
deBenito_AUDIAS_task4_SED_1 deBenito2020 38.2 (37.5 - 39.2) 16kHz log-mel energies
deBenito_AUDIAS_task4_SED_1 de Benito-Gorron2020 37.9 (37.0 - 39.1) 16kHz log-mel energies
Koh_NTHU_task4_SED_3 Koh2020 46.6 (45.8 - 47.6) 16kHz Gaussian noise, mixup, time shifting, pitch shifting log-mel energies
Koh_NTHU_task4_SED_2 Koh2020 45.2 (44.3 - 46.3) 16kHz Gaussian noise, mixup, time shifting, pitch shifting log-mel energies
Koh_NTHU_task4_SED_1 Koh2020 45.2 (44.2 - 46.1) 16kHz Gaussian noise, mixup, time shifting, pitch shifting log-mel energies
Koh_NTHU_task4_SED_4 Koh2020 46.3 (45.4 - 47.2) 16kHz Gaussian noise, mixup, time shifting, pitch shifting log-mel energies
Cornell_UPM-INRIA_task4_SED_2 Cornell2020 42.0 (40.9 - 43.1) 16kHz contrast, overdrive, pitch shifting, highshelf, lowshelf, noise burst, sine burst, roll log-mel energies
Cornell_UPM-INRIA_task4_SED_1 Cornell2020 44.4 (43.3 - 45.5) 16kHz contrast, overdrive, pitch shifting, highshelf, lowshelf, noise burst, sine burst, roll log-mel energies
Cornell_UPM-INRIA_task4_SED_4 Cornell2020 43.2 (42.1 - 44.4) 16kHz contrast, overdrive, pitch shifting, highshelf, lowshelf, noise burst, sine burst, roll log-mel energies
Cornell_UPM-INRIA_task4_SS_SED_1 Cornell2020 38.6 (37.5 - 39.6) 16kHz contrast, overdrive, pitch shifting, highshelf, lowshelf log-mel energies
Cornell_UPM-INRIA_task4_SED_3 Cornell2020 42.6 (41.6 - 43.5) 16kHz log-mel energies
Yao_UESTC_task4_SED_1 Yao2020 44.1 (43.1 - 45.2) 16kHz mixup, time shifting log-mel energies
Yao_UESTC_task4_SED_3 Yao2020 46.4 (45.3 - 47.6) 16kHz mixup, time shifting log-mel energies
Yao_UESTC_task4_SED_2 Yao2020 45.7 (44.7 - 47.0) 16kHz mixup, time shifting log-mel energies
Yao_UESTC_task4_SED_4 Yao2020 46.2 (45.2 - 47.0) 16kHz mixup, time shifting log-mel energies
Liu_thinkit_task4_SED_1 Liu2020 40.7 (39.7 - 41.7) 16kHz noise addition, pitch shifting, time rolling, dynamic range compression log-mel energies
Liu_thinkit_task4_SED_1 Liu2020 41.8 (40.7 - 42.9) 16kHz noise addition, pitch shifting, time rolling, dynamic range compression log-mel energies
Liu_thinkit_task4_SED_1 Liu2020 45.2 (44.2 - 46.5) 16kHz noise addition, pitch shifting, time rolling, dynamic range compression log-mel energies
Liu_thinkit_task4_SED_4 Liu2020 43.1 (42.1 - 44.2) 16kHz noise addition, pitch shifting, time rolling, dynamic range compression log-mel energies
PARK_JHU_task4_SED_1 Park2020 35.8 (35.0 - 36.6) 44.1kHz log-mel energies
PARK_JHU_task4_SED_1 Park2020 26.5 (25.7 - 27.5) 44.1kHz log-mel energies
PARK_JHU_task4_SED_2 Park2020 36.9 (36.1 - 37.7) 44.1kHz log-mel energies
PARK_JHU_task4_SED_3 Park2020 34.7 (34.1 - 35.6) 44.1kHz log-mel energies
Chen_NTHU_task4_SS_SED_1 Chen2020 34.5 (33.5 - 35.3) 16kHz sound event separation log-mel energies
CTK_NU_task4_SED_2 Chan2020 44.4 (43.5 - 45.5) 22.05kHz Gaussian noise log-mel energies
CTK_NU_task4_SED_4 Chan2020 46.3 (45.3 - 47.4) 22.05kHz Gaussian noise log-mel energies
CTK_NU_task4_SED_3 Chan2020 45.8 (45.0 - 47.0) 22.05kHz Gaussian noise log-mel energies
CTK_NU_task4_SED_1 Chan2020 43.5 (42.6 - 44.7) 22.05kHz Gaussian noise log-mel energies
YenKu_NTU_task4_SED_4 Yen2020 42.7 (41.6 - 43.6) 16kHz log-mel spectrogram
YenKu_NTU_task4_SED_2 Yen2020 42.6 (41.8 - 43.7) 16kHz log-mel spectrogram
YenKu_NTU_task4_SED_3 Yen2020 41.6 (40.6 - 42.7) 16kHz log-mel spectrogram
YenKu_NTU_task4_SED_1 Yen2020 43.6 (42.4 - 44.6) 16kHz log-mel spectrogram
Tang_SCU_task4_SED_1 Tang2020 43.1 (42.3 - 44.1) 22.05kHz specaugment log-mel energies
Tang_SCU_task4_SED_4 Tang2020 44.1 (43.4 - 44.8) 22.05kHz specaugment, time stretching, pitch shifting, time shifting log-mel energies
Tang_SCU_task4_SED_2 Tang2020 42.4 (41.4 - 43.4) 22.05kHz specaugment, time stretching, pitch shifting, time shifting log-mel energies
Tang_SCU_task4_SED_3 Tang2020 44.1 (43.3 - 45.0) 22.05kHz specaugment log-mel energies
DCASE2020_SED_baseline_system turpault2020a 34.9 (34.0 - 35.7) 16kHz log-mel energies
DCASE2020_SS_SED_baseline_system turpault2020b 36.5 (35.6 - 37.2) 16kHz log-mel energies
Ebbers_UPB_task4_SED_1 Ebbers2020 47.2 (46.5 - 48.1) 16kHz scaling, mixup, frequency warping, blurring, frequency masking, time masking, Gaussian noise log-mel energies



Machine learning characteristics

Rank Code Technical
Report
Event-based
F-score
(Eval)
Classifier Semi-supervised approach Post-processing Segmentation
method
Decision
making
Xiaomi_task4_SED_1 Liang2020 36.0 (35.3 - 36.8) CRNN mean-teacher student median filtering (450ms)
Rykaczewski_Samsung_taks4_SED_3 Rykaczewski2020 21.6 (21.0 - 22.4) CRNN mean-teacher student median filtering (93ms)
Rykaczewski_Samsung_taks4_SED_2 Rykaczewski2020 21.9 (21.3 - 22.7) CRNN mean-teacher student median filtering (93ms)
Rykaczewski_Samsung_taks4_SED_4 Rykaczewski2020 10.4 (9.7 - 11.1) CRNN mean-teacher student median filtering (93ms)
Rykaczewski_Samsung_taks4_SED_1 Rykaczewski2020 21.6 (20.8 - 22.4) CRNN mean-teacher student median filtering (93ms)
Hou_IPS_task4_SED_1 HouB2020 34.9 (34.0 - 35.7) CRNN mean-teacher student median filtering (450ms)
Miyazaki_NU_task4_SED_1 Miyazaki2020 51.1 (50.1 - 52.3) conformer mean-teacher student thresholding, median filtering mean
Miyazaki_NU_task4_SED_2 Miyazaki2020 46.4 (45.5 - 47.5) transformer mean-teacher student thresholding, median filtering mean
Miyazaki_NU_task4_SED_3 Miyazaki2020 50.7 (49.6 - 51.9) transformer, conformer mean-teacher student thresholding, median filtering mean
Huang_ICT-TOSHIBA_task4_SED_3 Huang2020 44.3 (43.4 - 45.4) CNN guided and multi branch learning median filtering (with adaptive window size) attention layers, global_max_pooling, global average pooling
Huang_ICT-TOSHIBA_task4_SED_1 Huang2020 44.6 (43.5 - 46.0) CNN guided and multi branch learning median filtering (with adaptive window size) attention layers, global_max_pooling, global average pooling
Huang_ICT-TOSHIBA_task4_SS_SED_4 Huang2020 44.1 (42.9 - 45.4) CNN guided multi branch learning median filtering (with adaptive window size) attention layers, global_max_pooling, global average pooling
Huang_ICT-TOSHIBA_task4_SED_4 Huang2020 44.3 (43.2 - 45.6) CNN guided and multi branch learning median filtering (with adaptive window size) attention layers, global_max_pooling, global average pooling
Huang_ICT-TOSHIBA_task4_SS_SED_1 Huang2020 44.7 (43.6 - 46.2) CNN guided and multi branch learning median filtering (with adaptive window size) attention layers, global_max_pooling, global average pooling
Huang_ICT-TOSHIBA_task4_SED_2 Huang2020 44.3 (43.2 - 45.6) CNN guided and multi branch learning median filtering (with adaptive window size) attention layers, global_max_pooling, global average pooling
Huang_ICT-TOSHIBA_task4_SS_SED_3 Huang2020 44.4 (43.2 - 45.8) CNN guided and multi branch learning median filtering (with adaptive window size) attention layers, global_max_pooling, global average pooling
Huang_ICT-TOSHIBA_task4_SS_SED_2 Huang2020 44.5 (43.3 - 46.0) CNN guided and multi branch learning median filtering (with adaptive window size) attention layers, global_max_pooling, global average pooling
Copiaco_UOW_task4_SED_2 Copiaco2020a 7.8 (7.3 - 8.2) AlexNet, transfer learning CNN mean-teacher student median filtering (93ms) P-norm
Copiaco_UOW_task4_SED_1 Copiaco2020a 7.5 (7.0 - 8.0) AlexNet, transfer learning CNN mean-teacher student median filtering (93ms) P-norm
Kim_AiTeR_GIST_SED_1 Kim2020 43.7 (42.8 - 44.7) CRNN pseudo-labelling median filtering (79ms) thresholding
Kim_AiTeR_GIST_SED_2 Kim2020 43.9 (43.0 - 44.7) CRNN pseudo-labelling median filtering (79ms) thresholding
Kim_AiTeR_GIST_SED_4 Kim2020 44.4 (43.5 - 45.2) CRNN pseudo-labelling median filtering (79ms) thresholding
Kim_AiTeR_GIST_SED_3 Kim2020 44.2 (43.4 - 45.1) CRNN pseudo-labelling median filtering (79ms) thresholding
Copiaco_UOW_task4_SS_SED_1 Copiaco2020b 6.9 (6.7 - 7.2) AlexNet, transfer learning CNN mean-teacher student median filtering (93ms) P-norm
LJK_PSH_task4_SED_3 JiaKai2020 38.6 (37.7 - 39.7) CRNN, ensemble mean-teacher student median filtering (class-dependent) attention layers macro F1 vote
LJK_PSH_task4_SED_1 JiaKai2020 39.3 (38.4 - 40.4) CRNN, ensemble mean-teacher student median filtering (class-dependent) attention layers macro F1 vote
LJK_PSH_task4_SED_2 JiaKai2020 41.2 (40.1 - 42.4) CRNN, ensemble mean-teacher student median filtering (class-dependent) attention layers macro F1 vote
LJK_PSH_task4_SED_4 JiaKai2020 40.6 (39.6 - 41.6) CRNN, ensemble mean-teacher student median filtering (class-dependent) attention layers macro F1 vote
Hao_CQU_task4_SED_2 Hao2020 47.0 (46.0 - 48.1) CRNN domain adaptation median filtering (with adaptive window size)
Hao_CQU_task4_SED_3 Hao2020 46.3 (45.5 - 47.4) CRNN domain adaptation median filtering (with adaptive window size)
Hao_CQU_task4_SED_1 Hao2020 44.9 (43.9 - 45.8) CRNN domain adaptation median filtering (with adaptive window size)
Hao_CQU_task4_SED_4 Hao2020 47.8 (46.9 - 49.0) CRNN domain adaptation median filtering (with adaptive window size)
Zhenwei_Hou_task4_SED_1 HouZ2020 45.1 (44.2 - 45.8) CRNN mean-teacher student median filtering (93ms) mean
deBenito_AUDIAS_task4_SED_1 deBenito2020 38.2 (37.5 - 39.2) CRNN mean-teacher student median filtering (45ms) mean, class-specific thresholding
deBenito_AUDIAS_task4_SED_1 de Benito-Gorron2020 37.9 (37.0 - 39.1) CRNN mean-teacher student median filtering (45ms) mean
Koh_NTHU_task4_SED_3 Koh2020 46.6 (45.8 - 47.6) FP-CRNN mean-teacher student, interpolation consistency training, shift consistency training, weakly pseudo-labeling median filtering (0.45s) mean probabilities, thresholding
Koh_NTHU_task4_SED_2 Koh2020 45.2 (44.3 - 46.3) CRNN mean-teacher student, interpolation consistency training, shift consistency training, weakly pseudo-labeling median filtering (adaptive window size) mean probabilities, thresholding
Koh_NTHU_task4_SED_1 Koh2020 45.2 (44.2 - 46.1) CRNN mean-teacher student, interpolation consistency training, shift consistency training, weakly pseudo-labeling median filtering (0.45s) mean probabilities, thresholding
Koh_NTHU_task4_SED_4 Koh2020 46.3 (45.4 - 47.2) FP-CRNN mean-teacher student, interpolation consistency training, shift consistency training, weakly pseudo-labeling median filtering (adaptive window size) mean probabilities, thresholding
Cornell_UPM-INRIA_task4_SED_2 Cornell2020 42.0 (40.9 - 43.1) CRNN mean-teacher student HMM smoothing HMM smoothing
Cornell_UPM-INRIA_task4_SED_1 Cornell2020 44.4 (43.3 - 45.5) CRNN mean-teacher student HMM smoothing HMM smoothing
Cornell_UPM-INRIA_task4_SED_4 Cornell2020 43.2 (42.1 - 44.4) CRNN mean-teacher student HMM smoothing HMM smoothing
Cornell_UPM-INRIA_task4_SS_SED_1 Cornell2020 38.6 (37.5 - 39.6) CRNN mean-teacher student HMM smoothing HMM smoothing
Cornell_UPM-INRIA_task4_SED_3 Cornell2020 42.6 (41.6 - 43.5) CRNN mean-teacher student HMM smoothing HMM smoothing
Yao_UESTC_task4_SED_1 Yao2020 44.1 (43.1 - 45.2) CRNN mean-teacher student median filtering (class-dependent) attention layers CRNN
Yao_UESTC_task4_SED_3 Yao2020 46.4 (45.3 - 47.6) CRNN mean-teacher student median filtering (class-dependent) attention layers CRNN
Yao_UESTC_task4_SED_2 Yao2020 45.7 (44.7 - 47.0) CRNN mean-teacher student median filtering (class-dependent) attention layers CRNN
Yao_UESTC_task4_SED_4 Yao2020 46.2 (45.2 - 47.0) CRNN mean-teacher student median filtering (class-dependent) attention layers CRNN
Liu_thinkit_task4_SED_1 Liu2020 40.7 (39.7 - 41.7) CRNN mean-teacher student median filtering (class-dependent) weighted mean
Liu_thinkit_task4_SED_1 Liu2020 41.8 (40.7 - 42.9) CRNN mean-teacher student median filtering (class-dependent) weighted mean
Liu_thinkit_task4_SED_1 Liu2020 45.2 (44.2 - 46.5) CRNN mean-teacher student median filtering (class-dependent) weighted mean
Liu_thinkit_task4_SED_4 Liu2020 43.1 (42.1 - 44.2) CRNN mean-teacher student median filtering (class-dependent) weighted mean
PARK_JHU_task4_SED_1 Park2020 35.8 (35.0 - 36.6) CRNN mean-teacher student median filtering (450ms) weighted mean
PARK_JHU_task4_SED_1 Park2020 26.5 (25.7 - 27.5) CRNN mean-teacher student median filtering (450ms) weighted mean
PARK_JHU_task4_SED_2 Park2020 36.9 (36.1 - 37.7) CRNN mean-teacher student median filtering (450ms) weighted mean
PARK_JHU_task4_SED_3 Park2020 34.7 (34.1 - 35.6) CRNN mean-teacher student median filtering (450ms) weighted mean
Chen_NTHU_task4_SS_SED_1 Chen2020 34.5 (33.5 - 35.3) CRNN mean-teacher student median filtering (93ms) P-norm
CTK_NU_task4_SED_2 Chan2020 44.4 (43.5 - 45.5) NMF, CNN consistency enforcement median filtering
CTK_NU_task4_SED_4 Chan2020 46.3 (45.3 - 47.4) NMF, CNN consistency enforcement median filtering
CTK_NU_task4_SED_3 Chan2020 45.8 (45.0 - 47.0) NMF, CNN consistency enforcement median filtering
CTK_NU_task4_SED_1 Chan2020 43.5 (42.6 - 44.7) NMF, CNN consistency enforcement median filtering
YenKu_NTU_task4_SED_4 Yen2020 42.7 (41.6 - 43.6) CRNN guided learning pseudo-labelling, mean-teacher student median filtering (with adaptive window size)
YenKu_NTU_task4_SED_2 Yen2020 42.6 (41.8 - 43.7) CRNN guided learning pseudo-labelling, mean-teacher student median filtering (with adaptive window size)
YenKu_NTU_task4_SED_3 Yen2020 41.6 (40.6 - 42.7) CRNN guided learning pseudo-labelling, mean-teacher student median filtering (with adaptive window size)
YenKu_NTU_task4_SED_1 Yen2020 43.6 (42.4 - 44.6) CRNN guided learning pseudo-labelling, mean-teacher student median filtering (with adaptive window size)
Tang_SCU_task4_SED_1 Tang2020 43.1 (42.3 - 44.1) CRNN mean-teacher student median filtering
Tang_SCU_task4_SED_4 Tang2020 44.1 (43.4 - 44.8) CRNN mean-teacher student median filtering
Tang_SCU_task4_SED_2 Tang2020 42.4 (41.4 - 43.4) CRNN mean-teacher student median filtering
Tang_SCU_task4_SED_3 Tang2020 44.1 (43.3 - 45.0) CRNN mean-teacher student median filtering
DCASE2020_SED_baseline_system turpault2020a 34.9 (34.0 - 35.7) CRNN mean-teacher student median filtering (93ms)
DCASE2020_SS_SED_baseline_system turpault2020b 36.5 (35.6 - 37.2) CRNN mean-teacher student median filtering (93ms) P-norm
Ebbers_UPB_task4_SED_1 Ebbers2020 47.2 (46.5 - 48.1) CRNN pseudo-labelling median filtering (200-400ms) small segment tagging + pseudo-labelled training mean

Complexity

Rank Code Technical
Report
Event-based
F-score
(Eval)
Model
complexity
Ensemble
subsystems
Training time
Xiaomi_task4_SED_1 Liang2020 36.0 (35.3 - 36.8) 1112420 3h (1 v100)
Rykaczewski_Samsung_taks4_SED_3 Rykaczewski2020 21.6 (21.0 - 22.4) 165460 8h (1 GTX 1080 Ti)
Rykaczewski_Samsung_taks4_SED_2 Rykaczewski2020 21.9 (21.3 - 22.7) 165460 8h (1 GTX 1080 Ti)
Rykaczewski_Samsung_taks4_SED_4 Rykaczewski2020 10.4 (9.7 - 11.1) 165460 8h (1 GTX 1080 Ti)
Rykaczewski_Samsung_taks4_SED_1 Rykaczewski2020 21.6 (20.8 - 22.4) 165460 8h (1 GTX 1080 Ti)
Hou_IPS_task4_SED_1 HouB2020 34.9 (34.0 - 35.7) 1112420 12h (1 k80)
Miyazaki_NU_task4_SED_1 Miyazaki2020 51.1 (50.1 - 52.3) 17097712 8 12h (1 TITAN Xp)
Miyazaki_NU_task4_SED_2 Miyazaki2020 46.4 (45.5 - 47.5) 72107714 7 12h (1 TITAN Xp)
Miyazaki_NU_task4_SED_3 Miyazaki2020 50.7 (49.6 - 51.9) 89205426 15 12h (1 TITAN Xp)
Huang_ICT-TOSHIBA_task4_SED_3 Huang2020 44.3 (43.4 - 45.4) 1145928 9h (GTX 1080 Ti)
Huang_ICT-TOSHIBA_task4_SED_1 Huang2020 44.6 (43.5 - 46.0) 6878788 6 5h for each model(GTX 1080 Ti)
Huang_ICT-TOSHIBA_task4_SS_SED_4 Huang2020 44.1 (42.9 - 45.4) 10316572 9 5h for each model (GTX 1080 Ti)
Huang_ICT-TOSHIBA_task4_SED_4 Huang2020 44.3 (43.2 - 45.6) 6878788 6 5h for each model(GTX 1080 Ti)
Huang_ICT-TOSHIBA_task4_SS_SED_1 Huang2020 44.7 (43.6 - 46.2) 10316572 9 5h for each model (GTX 1080 Ti)
Huang_ICT-TOSHIBA_task4_SED_2 Huang2020 44.3 (43.2 - 45.6) 6878788 6 5h for each model(GTX 1080 Ti)
Huang_ICT-TOSHIBA_task4_SS_SED_3 Huang2020 44.4 (43.2 - 45.8) 6878788 9 5h for each model (GTX 1080 Ti)
Huang_ICT-TOSHIBA_task4_SS_SED_2 Huang2020 44.5 (43.3 - 46.0) 10316572 9 5h for each model (GTX 1080 Ti)
Copiaco_UOW_task4_SED_2 Copiaco2020a 7.8 (7.3 - 8.2) 60000000 60h ((R)(TM) i7-5600U CPU @ 2.60GHz)
Copiaco_UOW_task4_SED_1 Copiaco2020a 7.5 (7.0 - 8.0) 60000000 60h ((R)(TM) i7-5600U CPU @ 2.60GHz)
Kim_AiTeR_GIST_SED_1 Kim2020 43.7 (42.8 - 44.7) 13049904 5 12h (1 GTX 1080 Ti)
Kim_AiTeR_GIST_SED_2 Kim2020 43.9 (43.0 - 44.7) 13049904 5 12h (1 GTX 1080 Ti)
Kim_AiTeR_GIST_SED_4 Kim2020 44.4 (43.5 - 45.2) 13049904 5 12h (1 GTX 1080 Ti)
Kim_AiTeR_GIST_SED_3 Kim2020 44.2 (43.4 - 45.1) 13049904 5 12h (1 GTX 1080 Ti)
Copiaco_UOW_task4_SS_SED_1 Copiaco2020b 6.9 (6.7 - 7.2) 6000000 60h ((R)(TM) i7-5600U CPU @ 2.60GHz)
LJK_PSH_task4_SED_3 JiaKai2020 38.6 (37.7 - 39.7) 1042132 5 2.5h (1 GTX 2080 Ti)
LJK_PSH_task4_SED_1 JiaKai2020 39.3 (38.4 - 40.4) 1042132 7 2.5h (1 GTX 2080 Ti)
LJK_PSH_task4_SED_2 JiaKai2020 41.2 (40.1 - 42.4) 1042132 10 2.5h (1 GTX 2080 Ti)
LJK_PSH_task4_SED_4 JiaKai2020 40.6 (39.6 - 41.6) 1042132 5 2.5h (1 GTX 2080 Ti)
Hao_CQU_task4_SED_2 Hao2020 47.0 (46.0 - 48.1) 1914974
Hao_CQU_task4_SED_3 Hao2020 46.3 (45.5 - 47.4) 2178270
Hao_CQU_task4_SED_1 Hao2020 44.9 (43.9 - 45.8) 1916264
Hao_CQU_task4_SED_4 Hao2020 47.8 (46.9 - 49.0) 2134091
Zhenwei_Hou_task4_SED_1 HouZ2020 45.1 (44.2 - 45.8) 1112420 6 9h (1 GTX 1080 Ti)
deBenito_AUDIAS_task4_SED_1 deBenito2020 38.2 (37.5 - 39.2) 5562100 5 10h (1 RTX 2080)
deBenito_AUDIAS_task4_SED_1 de Benito-Gorron2020 37.9 (37.0 - 39.1) 5562100 5 10h (1 RTX 2080)
Koh_NTHU_task4_SED_3 Koh2020 46.6 (45.8 - 47.6) 12807540 5 19h (1 GTX 1080 Ti)
Koh_NTHU_task4_SED_2 Koh2020 45.2 (44.3 - 46.3) 3337260 3 14h (1 GTX 1080 Ti)
Koh_NTHU_task4_SED_1 Koh2020 45.2 (44.2 - 46.1) 3337260 3 14h (1 GTX 1080 Ti)
Koh_NTHU_task4_SED_4 Koh2020 46.3 (45.4 - 47.2) 12807540 5 19h (1 GTX 1080 Ti)
Cornell_UPM-INRIA_task4_SED_2 Cornell2020 42.0 (40.9 - 43.1) 1112420 48h (1 GTX 1080)
Cornell_UPM-INRIA_task4_SED_1 Cornell2020 44.4 (43.3 - 45.5) 3337260 3 48h (1 GTX 1080)
Cornell_UPM-INRIA_task4_SED_4 Cornell2020 43.2 (42.1 - 44.4) 3337260 3 48h (1 GTX 1080)
Cornell_UPM-INRIA_task4_SS_SED_1 Cornell2020 38.6 (37.5 - 39.6) 1112420 3h (1 GTX 1080)
Cornell_UPM-INRIA_task4_SED_3 Cornell2020 42.6 (41.6 - 43.5) 1113082 6h (1 GTX 1080)
Yao_UESTC_task4_SED_1 Yao2020 44.1 (43.1 - 45.2) 2051582 4 17h (1 GTX 2070)
Yao_UESTC_task4_SED_3 Yao2020 46.4 (45.3 - 47.6) 6726008 6 21h (1 GTX 2070)
Yao_UESTC_task4_SED_2 Yao2020 45.7 (44.7 - 47.0) 6726008 6 21h (1 GTX 2070)
Yao_UESTC_task4_SED_4 Yao2020 46.2 (45.2 - 47.0) 12710274 4 24h (1 GTX 2070)
Liu_thinkit_task4_SED_1 Liu2020 40.7 (39.7 - 41.7) 1112421 1 48h (1 GTX 1080 Ti)
Liu_thinkit_task4_SED_1 Liu2020 41.8 (40.7 - 42.9) 1112421 6 48h (1 GTX 1080 Ti)
Liu_thinkit_task4_SED_1 Liu2020 45.2 (44.2 - 46.5) 1112421 6 48h (1 GTX 1080 Ti)
Liu_thinkit_task4_SED_4 Liu2020 43.1 (42.1 - 44.2) 1112421 6 48h (1 GTX 1080 Ti)
PARK_JHU_task4_SED_1 Park2020 35.8 (35.0 - 36.6) 1112420 5 3h (1 GTX 1080 Ti)
PARK_JHU_task4_SED_1 Park2020 26.5 (25.7 - 27.5) 1112420 5 3h (1 GTX 1080 Ti)
PARK_JHU_task4_SED_2 Park2020 36.9 (36.1 - 37.7) 1112420 5 3h (1 GTX 1080 Ti)
PARK_JHU_task4_SED_3 Park2020 34.7 (34.1 - 35.6) 1112420 5 3h (1 GTX 1080 Ti)
Chen_NTHU_task4_SS_SED_1 Chen2020 34.5 (33.5 - 35.3) 1112420 3 3h (1 GTX 1080 Ti)
CTK_NU_task4_SED_2 Chan2020 44.4 (43.5 - 45.5) 5038932 11h (1 GTX 1060)
CTK_NU_task4_SED_4 Chan2020 46.3 (45.3 - 47.4) 10077864 2 17h (1 GTX 1060)
CTK_NU_task4_SED_3 Chan2020 45.8 (45.0 - 47.0) 10077864 2 17h (1 GTX 1060)
CTK_NU_task4_SED_1 Chan2020 43.5 (42.6 - 44.7) 5038932 6h (1 GTX 1060)
YenKu_NTU_task4_SED_4 Yen2020 42.7 (41.6 - 43.6) 1403370 5h (1 GTX TITAN)
YenKu_NTU_task4_SED_2 Yen2020 42.6 (41.8 - 43.7) 1403370 5h (1 GTX TITAN)
YenKu_NTU_task4_SED_3 Yen2020 41.6 (40.6 - 42.7) 1403370 5h (1 GTX TITAN)
YenKu_NTU_task4_SED_1 Yen2020 43.6 (42.4 - 44.6) 1403370 5h (1 GTX TITAN)
Tang_SCU_task4_SED_1 Tang2020 43.1 (42.3 - 44.1) 1687466 5h (1 TITAN Xp)
Tang_SCU_task4_SED_4 Tang2020 44.1 (43.4 - 44.8) 3186890 17h (1 TITAN Xp)
Tang_SCU_task4_SED_2 Tang2020 42.4 (41.4 - 43.4) 1687466 6h (1 TITAN Xp)
Tang_SCU_task4_SED_3 Tang2020 44.1 (43.3 - 45.0) 3186890 15h (1 TITAN Xp)
DCASE2020_SED_baseline_system turpault2020a 34.9 (34.0 - 35.7) 1112420 3h (1 GTX 1080 Ti)
DCASE2020_SS_SED_baseline_system turpault2020b 36.5 (35.6 - 37.2) 1112420 3 3h (1 GTX 1080 Ti)
Ebbers_UPB_task4_SED_1 Ebbers2020 47.2 (46.5 - 48.1) 20319432 8 24h (4 RTX 2080)

Combined SS and SED

Rank Code Technical
Report
Event-based
F-score
(Eval)
Sound separation
method
SS Model
complexity
SS Training time Sources
used for SED
SS/SED Integration
type
SS/SED Integration
method
Copiaco_UOW_task4_SS_SED_1 Copiaco2020b 6.9 (6.7 - 7.2) DNN 500000 3h ((R)(TM) i7-5600U CPU @ 2.60GHz) DESED foreground early average
Chen_NTHU_task4_SS_SED_1 Chen2020 34.5 (33.5 - 35.3) ConvTasNet 1112420 3h (1 GTX 1080 Ti) DESED foreground, FUSS both average
Cornell_UPM-INRIA_task4_SS_SED_1 Cornell2020 38.6 (37.5 - 39.6) ConvTasNet (X= 5, R= 3) 3203999 12h (1 GTX 1080 Ti) all sources late concat
Huang_ICT-TOSHIBA_task4_SS_SED_4 Huang2020 44.1 (42.9 - 45.4) TDCN++ 1112420 3h (1 GTX 1080 Ti) all sources late average
Huang_ICT-TOSHIBA_task4_SS_SED_3 Huang2020 44.4 (43.2 - 45.8) TDCN++ 1112420 3h (1 GTX 1080 Ti) all sources late average
Huang_ICT-TOSHIBA_task4_SS_SED_2 Huang2020 44.5 (43.3 - 46.0) TDCN++ 1112420 3h (1 GTX 1080 Ti) all sources late average
Huang_ICT-TOSHIBA_task4_SS_SED_1 Huang2020 44.7 (43.6 - 46.2) TDCN++ 1112420 3h (1 GTX 1080 Ti) all sources late average
DCASE2020_SS_SED_baseline_system 36.5 (35.6 - 37.2) TDCN++ 9179401 DESED sources late P-norm

SS systems

Rank Code Technical
Report
Single source
SI-SNR
(Eval)
Sound separation
method
SS Model
complexity
Input
features
Data
augmentation
Xiaomi_task4_SS_1 Liang2020 33.8 TDCN++  27538206 stft time shifting, pitch shifting
Xiaomi_task4_SS_2 Liang2020 31.8 TDCN++    27636510 stft, mfcc time shifting, pitch shifting
DCASE2020_SS_baseline_system kavalerov2019universal 37.6 TDCN++ 9179401 stft (32ms / 8ms)
Hou_IPS_task4_SS_1 HouB2020 37.6 TDCN++ stft (32ms / 8ms)

Technical reports

Semi-Supervised NMF-CNN For Sound Event Detection

Chan, Teck Kai1,2 and Chin, Cheng Siong1 and Li, Ye2
1Newcastle University Singapore, Faculty of Science, Agriculture, and Engineering, Singapore, 2Xylem Water Solution Singapore Pte Ltd, 3A International Business Park,Singapore

Abstract

For the DCASE 2020 Challenge Task 4, this paper proposed a combinative approach using Nonnegative Matrix Factorization (NMF) and Convolutional Neural Network (CNN). The main idea begins with utilizing NMF to approximate strong labels for the weakly labeled data. Sub-sequently, based on the approximated strongly labeled data, two different CNNs are trained using a semi-supervised framework where one CNN is used for clip-level prediction and the other for frame-level prediction. Using this idea, the best model trained can achieve an event-based F1-score of 45.7% on the validation dataset. Using an ensemble of models, the event-based F1-score can be increased to 48.6%. By comparing with the baseline model, the proposed model outperforms the baseline model by a margin of over 8%.

System characteristics
Input mono
Sampling rate 22.05 kHz
Data augmentation Gaussian noise
Features log-mel energies
Classifier NMF, CNN
PDF

Combined Sound Event Detection And Sound Event Separation Networks For DCASE 2020 Task 4

Chen, You-Siang1 and Lin, Zi Jie1 and Li, Shang-En2 and Koh, Chih-Yuan1 and Bai, Mingsian R.1 and Chien, Jen-Tzung2 and liu, Yi-Wen1
1National Tsing Hua University, Hsinchu, 2National Chiao Tung University, Hsinchu, Taiwan

Abstract

In this paper, we propose a hybrid neural network (NN) to handle the tasks of sound event separation (SES) and sound event detection (SED) in Task 4 of DCASE 2020 challenge. The convolutional time-domain audio separation network (Conv-TasNet) is employed to extract the foreground sound events defined in DCASE challenge. By comparing the baseline SED network with various training strategies, we demonstrate that the SES network is capable of enhancing the SED performance effectively in terms of several event-based performance metrics including macro F1 and poly-phonic sound detection score (PSDS).

System characteristics
Input mono
Sampling rate 16 kHz
Data augmentation Sound event separation
Features log-mel energies
Classifier CRNN, ConvTasNet
Decision making P-norm
PDF

Sound Event Detection And Classification Using CWT Scalograms And Deep Learning

Copiaco, Abigail1 and Ritz, Christian2 and Fasciani, Stefano3 and Abdulaziz, Nidhal4
1Engineering and Information Sciences Dept., University of Wollongong, Northfields Wollongong, Australia, 2School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, Northfields Wollongong, Australia, 3, Musicology Dept., University of Oslo, Norway, 4University of Wollongong in Dubai, Dubai

Abstract

This report describes our proposed system for the DCASE 2020 Task 4 challenge. In this work, we examine the combination of signal energy and spectral centroid features with 0.05 s of time windowing for the detection of sound events. Along with this, spectro-temporal features extracted from Fast Fourier Transform (FFT) based wavelet coefficients of the audio files were used for classification. These coefficients are mapped into images called scalograms, which are fed into the layers of AlexNet, a pre-trained Deep Convolutional Neural Network (DCNN), for transfer learning. Through the validation set, this method gathered an average F1-score of 74% amongst the 10 classes of the DESED database for weak labelling. However, this technique is not deemed to be suitable for classification with strong time stamps labelling, gathering an F1-score of a mere 11.21%

System characteristics
Input mono
Sampling rate 16 kHz
Features scalogram, signal energy, spectral centroid
Classifier AlexNet, transfer learning CNN
Decision making P-norm
PDF

Detecting And Classifying Separated Sound Events Using Wavelet_Based Scalograms And Deep Learning

Copiaco, Abigail1 and Ritz, Christian2 and Fasciani, Stefano3 and Abdulaziz, Nidhal4
1Engineering and Information Sciences Dept., University of Wollongong, Northfields Wollongong, Australia, 2School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, Northfields Wollongong, Australia, 3, Musicology Dept., University of Oslo, Norway, 4University of Wollongong in Dubai, Dubai

Abstract

This report describes our proposed system submitted to the DCASE 2020 Task 4 challenge that includes sound separation and detection. In this work, we examine the use of a Deep Neural Network (DNN) based sound separation system as a preprocessing technique to the sound event classification technique used. For the detection of sound events, a combination of signal energy and spectral centroid features with 0.05-s of time windowing was utilized. Along with this, spectro-temporal features extracted from a Fast Fourier Transform (FFT) based wavelet coefficients of the audio files were used for classification. These coefficients are mapped into images called scalograms, which are fed into the layers of AlexNet, a pre-trained Deep Convolutional Neural Network (DCNN), for transfer learning. Through the validation set, this method gathered an average F1-score of 70% amongst the 10 classes of the DESED database for weak labelling However, this technique is not deemed to be suitable for classification with strong time stamps labelling, gathering an F1-score of a mere 8.73%.

System characteristics
Input mono
Sampling rate 16 kHz
Features scalogram, signal energy, spectral centroid
Classifier AlexNet, transfer learning CNN
Decision making P-norm
PDF

The UNIVPM-INRIA Systems For The DCASE 2020 Task 4

Cornell, Samuele Cornell1 and Pepe, Giovanni1 and Principi, Emanuele1 and Pariente, Manuel2 and Olvera, Michel2 and Gabrielli, Leonardo1 and Squartini, Stefano1
1Dept. Information Engineering, Università Politecnica delle Marche, Ancona, Italy, 2Dept. Information and Communication Sciences and Technologies, INRIA Nancy Grand-Est, France

Abstract

In this technical report, we propose different Sound Event Detection (SED) systems for the 2020 DCASE Task 4 challenge. Given the mismatch between synthetic labelled data and target domain data, we exploit a domain adversarial training to improve the network invariance to different types of background noise. Furthermore, we use dynamic mixing and augmentation of synthetic examples at training time as well as prediction smoothing by using Hidden Markov Models. In one system, we also show that using a learnable dynamic compression, Per-Channel Energy Normalization (PCEN) front-end improves robustness to background noise by making it Gaussian. Finally, an ensemble of models proves beneficial to improve the prediction score. Concerning joint separation and sound event detection we propose a permutation-invariant training scheme to optimize directly the Sound-Event-Detection objective.

System characteristics
Input mono
Sampling rate 16 kHz
Data augmentation contrast, overdrive, pitch shifting, highshelf, lowshelf
Features log-mel energies, mel-energies
Classifier CRNN, DPRNN
Decision making HMM smoothing
PDF

Multi-Resolution Mean Teacher For DCASE 2020 Task 4

de Benito-Gorron, Diego and Segovia, Sergio and Ramos, Daniel and Toledano, Doroteo T.
Universidad Autónoma de Madrid, Madrid, Spain

Abstract

In this technical report, we describe our participation in DCASE 2020 Task 4: Sound event detection and separation in domestic environments. A multi-resolution feature extraction approach is proposed, aiming to take advantage of the different lengths and spectral characteristics of each target category. The combination of up to five different time-frequency resolutions via model fusion is able to outperform the baseline results. In addition, we propose class-specific thresholds for the F 1 -score metric, further improving the results over the Validation set.

System characteristics
Input mono
Sampling rate 16 kHz
Features log-mel energies
Classifier CRNN
Decision making mean
PDF

Convolutional Recurrent Neural Networks For Weakly Labeled Semi-Supervised Sound Event Detection In Domestic Environments

Ebbers, Janek and Haeb-Umbach, Reinhold
Paderborn University, Paderborn, Germany

Abstract

In this report we present our system for the Detection and Classification of acoustic scenes and events (DCASE) 2020 Challenge Task 4: Sound event detection in domestic environments. We present a convolutional recurrent neural network (CRNN) with two recurrent neural network (RNN) classifiers sharing the same preprocessing convolutional neural network (CNN). Both recurrent networks perform audio tagging. One is processing the input audio signal in forward direction and the other in backward direction. The networks are trained jointly using weakly labeled data, such that at each time step an active event is tagged by at least one of the networks given that the event is either in the past captured by the forward RNN or in the future captured by the backward RNN. This way the models are encouraged to output tags as soon as possible. After training, the networks can be used for sound event detection by applying them to smaller audio segments, e.g. 200 ms. Further we propose a tag conditioned CNN as a second stage which is supposed to refine sound event detection. Given its receptive field and file tags as input it performs strong label prediction trained using pseudo strong labels provided by the CRNN system. By ensembling four CRNNs and four CNNs we obtain an event-based F-score of 48.3% on the validation set, which is 13.5% higher than the baseline. We are going to release the source code at https://github.com/fgnt/pb_sed.

System characteristics
Input mono
Sampling rate 16 kHz
Data augmentation scaling, mixup, frequency warping, blurring, frequency masking, time masking, gaussian noise
Features log-mel energies
Classifier CRNN
Decision making mean
PDF

Cross-domain sound event detection: from synthesized audio to real audio

Hao, Junyong and Hou, Zhenwei and Peng, Wang
Chongqing University, Chongqing, China

Abstract

This technical report describes some of the system information submitted to DCASE2020 task4 - Sound Event Detection in Domestic Environments. We use the dataset of DCASE2019 task4 to train our model, contains strongly labeled synthetic data, large unlabeled data, and weakly labeled data. There is a very large domain gaps in the statistical distribution between the synthesized audio and the real audio, and the performance of the SED model obtained on the synthesized audio applied to the real audio is greatly reduced. To perform this task, we propose a DA-CRNN network for joint learning of sound event detection (SED) and domain adaptation (DA).We consider the impact of the distribution within a single sound on the generalization perfor- mance of the model by mitigating the impact of complex background noise on event detection and the self-correlation consistency regularization of clip-level sound event classification, these make the intra-domain of a single sound smoother ; for cross-domain adaptation, adversarial learning through feature extraction network with frame-level domain discriminator and clip-level domain discriminator, forcing the feature extraction network to learn the invariant features of the domain, and further improve the generalization performance of the model. We did not use sound source separation, achieved an F1 score of 48.25% on the validation dataset and an F1 score of 49.43% on the public evaluation dataset.

System characteristics
Input mono
Sampling rate 22.05 kHz
Features log-mel energies
Classifier CRNN
PDF

Fine-Tuning Using Grid Search & Gradient Visualization

Hou, Bowei and Radzikwoski, Kacper and Farid, Ahmed
Waseda University, Fukuoka, Japan

Abstract

In this technical report, we briefly describe the models used in the task 4 challenge of DCASE2020. We utilized previously available models and fine-tuned them using the grid search algorithm and gradient visualization. This is the first attempt by our team to enter a competition on sound source manipulation.

System characteristics
Input mono
Sampling rate 16 kHz
Features log-mel energies
Classifier CRNN
Decision making mean
PDF

AUTHOR GUIDELINES FOR DCASE 2020 CHALLENGE TECHNICAL REPORT

Hou, Zhenwei and Hao, Junyong and Peng, Wang
University of Chongqing, Chongqing, China

Abstract

In this technical report, we present the sound event detection system of task 4 (Sound event detection and separation in do- mestic environments) of the DCASE2020 challenge. We propose an improved CRNN that Context Gating and channel attention mechanism are co-embedded into backbone network. It aims to construct a general and efficient attention structure for extracting features of sound events, and give full play to the advantages of attention mechanism in event feature extraction. In the case of replacing the CRNN in the baseline model with the structure we designed and keeping the other parts unchanged, the macro F-score of our model on the validation set is 4 percentage points higher than the baseline.

System characteristics
Input mono
Sampling rate 22.05 kHz
Features log-mel energies
Classifier CRNN
Decision making mean
PDF

Guided Multi-Branch Learning Systems For DCASE 2020 Task 4

Huang, Yuxin1,2 and Lin, Liwei1,2 and Ma, Shuo1 and Wang, Xiangdong1 and Liu, Hong1 and Qian, Yueliang1 and Liu, Min3 and Ouch, Kazushige3
1Bejing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, 2University of Chinese Academy of Sciences, Beijing, China, 3Toshiba China R&D Center, Beijing, China

Abstract

In this paper, we describe in detail our systems for DCASE2020 task 4. The systems are based on the first-place system of DCASE 2019 task 4, which adopts the multiple instance learning (MIL) framework with embedding-level attention pooling and a semi-supervised learning approach called guided learning. The multi-branch learning method is then incorporated into the system to further improve the performance. Multiple branches with different pooling strategies (embedding-level or instance-level) and different pooling modules (attention pooling, global max pooling or global average pooling) are used and shares the same feature encoder. To better exploit the synthetic data with strong labels, inspired by multi-task learning, a sound event detection (SED) branch is also added. Therefore, multiple branches pursuing different purposes and focusing on different characteristics of the data can help the feature encoder model the feature space better and avoid over-fitting. To combine sound separation with sound event detection, we train models using the output of the baseline system of sound separation and fuse the event detection results of models with of without sound separation.

System characteristics
Input mono
Sampling rate 44.1 kHz
Data augmentation time shifting, frequency shifting
Features log-mel energies
Classifier CNN
PDF

Multi-Scale Residual CRNN With Data Augmentation For DCASE 2020 Task 4

JiaKai, Lu
PFU Shanghai Co., LTD, Shanghai, China

Abstract

In this paper, we present our neural network for the DCASE 2020 challenge’s Task 4 (Sound event detection and separation in domestic environments). This task evaluates systems for the largescale detection of sound events using weakly labeled data, and explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance to doing audio tagging and sound event detection. We propose a mean-teacher model with convolutional neural network (CNN) and recurrent neural network (RNN) to maximize the use of unlabeled in-domain dataset. The architecture is based on our 2018 competition model.

System characteristics
Input mono
Sampling rate 16 kHz
Data augmentation pitch shifting
Features log-mel energies
Classifier CRNN
Decision making macro F1 vote
PDF

Universal sound separation

Kavalerov, Ilya1,2 and Wisdom, Scott1 and Erdogan, Hakan1 and Patton, Brian1 and Wilson, Kevin1 and Le Roux, Jonathan3 and Hershey, John R1
1Google Research, Cambridge MA, 2Department of Electrical and Computer Engineering, UMD, 3Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA

Abstract

Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. These network architectures include convolutional long short-term memory networks and time-dilated convolution stacks inspired by the recent success of time-domain enhancement networks like ConvTasNet. For the latter architecture, we also propose novel modifications that further improve separation performance. In terms of the framewise analysis-synthesis basis, we explore both a short-time Fourier transform (STFT) and a learnable basis, as used in ConvTasNet. For both of these bases, we also examine the effect of window size. In particular, for STFTs, we find that longer windows (25-50 ms) work best for speech/non-speech separation, while shorter windows (2.5 ms) work best for arbitrary sounds. For learnable bases, shorter windows (2.5 ms) work best on all tasks. Surprisingly, for universal sound separation, STFTs outperform learnable bases. Our best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.

PDF

Polyphonic Sound Event Detection Based On Convolutional Recurrent Neural Networks With Semi-Supervised Loss Function For DCASE Challenge 2020 Task 4

Kim, Nam Kyun1 and Kim, Hong Kook1,2
1School of Electrical Engineering and Computer Science, Gwangju, South Korea, 2AI Graduate School (GIST), Gwangju Institute of Science and Technology, Gwangju, South Korea

Abstract

This report proposes a polyphonic sound event detection (SED) method for the DCASE 2020 Challenge Task 4. The proposed SED method is based on semi-supervised learning to deal with the different combination of training datasets such as weakly labeled dataset, unlabeled dataset, and strongly labeled synthetic dataset. Especially, the target label of each audio clip from weakly labeled or unlabeled dataset is first predicted by using the mean teacher model that is the DCASE 2020 baseline. The data with predicted labels are used for training the proposed SED model, which consists of CNNs with skip connections and self-attention mechanism, followed by RNNs. In order to compensate for the erroneous prediction of weakly labeled and unlabeled data, a semi-supervised loss function is employed for the proposed SED model. In this work, several versions of the proposed SED model are implemented and evaluated on the validation set according to the different parameter setting for the semi-supervised loss function, and then an ensemble model that combines five-fold validation models is finally selected as our final model.

System characteristics
Input mono
Sampling rate 16 kHz
Data augmentation mixup, SpecAugment, Gaussian noise
Features mel-spectrogram
Classifier CRNN
Decision making thresholding
PDF

Sound Event Detection By Consistency Training And Pseudo-Labeling With Feature-Pyramid Convolutional Recurrent Neural Networks

Koh, Chih-Yuan1 and Chen, You-Siang1 and Li, Shang-En2 and Liu, Yi-Wen1 and Chien, Jen-Tzung2 and Bai, Mingsian R.1
1National Tsing Hua University, Hsinchu, 2National Chiao Tung University, Hsinchu, Taiwan

Abstract

A event detection system for DCASE 2020 task 4 is presented. To efficiently utilize large amount of unlabeled in-domain data, three semi-supervised learning strategies are applied: 1) interpolation consistency training (ICT), 2) shift consistency training (SCT), 3) weakly pseudo-labeling. In addition, we propose FP-CRNN, a convolutional recurrent neural network which contains feature-pyramid components and is based on the provided baseline. In terms of event-based F-measure, these approaches outperform the baseline, at 34.8%, by a large margin, with an F-measure of 48.4% for the baseline network which is trained with the combination of all three strategies and 49.6% for FP-CRNN with the same training strategies.

System characteristics
Input mono
Sampling rate 16 kHz
Data augmentation Gaussian noise, Mixup, time shifting, pitch shifting
Features log-mel energies
Classifier CRNN
Decision making mean probabilities, thresholding
PDF

Mean Teacher With Sound Source Separation And Data Augmentation For DCASE 2020 Task 4

Liang, Chuming and Ying, Haorong and Chen, Yueyi and Wang, Zhao
Xiaomi AI Lab, Wuhan, China

Abstract

In this paper, we present our system for the DCASE 2020 challenge Task4(Sound event detection and separation in domestic environments). The target of this task is to provide time boundaries of multiple events in an audio recording using a system trained with unlabeled, weakly-labeled and synthetic data. Also, sound source separation is encouraged to use in the system. We propose a mean-teacher model with convolutional and recurrent neural network(CRNN) structure and adopt data augmentation and sound source separation technique to improve the performance of sound event detection.

System characteristics
Input mono
Sampling rate 16 kHz
Data augmentation time stretching,pitch shift,reverberation
Features log-mel energies
Classifier CRNN
PDF

Semi-Supervised Sound Event Detection Based On Mean Teacher With Power Pooling And Data Augmentation

Liu, Yuzhuo1,2 and Chen, Chengxin1,2 and Kuang, Jianzhong1,2 and Zhang, Pengyuan1,2
1Institute of Acoustics, Key Laboratory of Speech Acoustics & Content Understanding, Beijing, China, 2University of Chinese Academy of Sciences, Beijing, China

Abstract

In this technical report, we describe the details of the system submitted to DCASE2020 task4: sound event detection (SED) and separation in domestic environments. We mainly focus on the scenario that recognizes sound events without source separation. The training set includes synthetic strongly labeled data, weakly labeled data and unlabeled data. For training SED models with weak labeling, a power pooling function is introduced to generate clip-level predictions from frame-level ones. Additionally, three traditional data augmentation approaches are applied on all data. We also ensemble models with different strategies. Our best system finally achieves an F1 of 49.55% on the validation set.

System characteristics
Input mono
Sampling rate 16 kHz
Data augmentation noise addition, pitch shifting, time rolling, dynamic range compression
Features log-mel energies
Classifier CRNN
Decision making weighted averaging
PDF

Convolution-Augmented Transformer For Semi-Supervised Sound Event Detection

Miyazaki, Koichi1 and Komatsu, Tatsuya2 and Hayashi, Tomoki1,3 and Watanabe, Shinji4 and Toda, Tomoki1 and Takeda, Kazuya1
1Nagoya University, Japan, 2LINE Corporation, Japan, 3Human Dataware Lab. Co., Japan, 4 Johns Hopkins University, USA

Abstract

In this technical report, we describe our submission system for DCASE2020 Task4: sound event detection and separation in domestic environments. Our model employs conformer blocks, which combine the self-attention and depth-wise convolution networks, to efficiently capture the global and local context information of an audio feature sequence. In addition to this novel architecture, we further improve the performance by utilizing a mean teacher semi-supervised learning technique, data augmentation, and post-processing optimized for each sound event class. We demonstrate that the proposed method achieves the event-based macro F1 score of 50.7% on the validation set, significantly outperforming that of the baseline score (34.8%).

System characteristics
Input mono
Sampling rate 16 kHz
Data augmentation time shifting, mixup
Features log-mel energies
Classifier Conformer, Transformer
Decision making mean
PDF

Joint Acoustic And Supervised Inference For Sound Event Detection

Park, Sangwook and Bellur, Ashwin and Kothinti, Sandeep and Kapurchali, Masoumeh Heidari and Elhilali, Mounya
Johns Hopkins University, Baltimore, USA

Abstract

This is a technical report about a sound event detection system for the task 4 of DCASE2020. The purpose of a sound event detection is to find event class label as well as its time boundaries. To achieve this purpose, we considered several methods such signal enhancement and event boundary detection, and built five systems by integrating these methods with supervised system trained by using Mean Teacher model. In particular, we estimate event boundaries of weakly labeled data by performing a event boundary detection. Then, we used the estimated strong label in training the supervised system. In addition, we adopt a fusion method by calculating weighted averaging posterior over the five outputs from each individual system. In experiments with validation set, we found that a final result of our system shows an improvement about 11 % in class averaging f-score compared to a baseline performance.

System characteristics
Input mono
Sampling rate 44.1 kHz
Features log-mel energies
Classifier CRNN
Decision making weighted averaging
PDF

Multi-Task Learning Paradigm For Sound Event Detection

Rykaczewski, Krzysztof
Samsung R&D Institute Poland, Warsaw, Poland

Abstract

In this technical report, we describe our system submitted to DCASE2020 task 4. This task evaluates systems for the detection of sound events in domestic environments using large-scale weakly labeled data. To perform this task, we propose resdual convolutional recurrent neural networks (CRNN) as our system and trained by datasets including strong and weak labels. We also use mean-teacher model based on confidence thresholding and smooth embedding method. In addition, we also apply specaugment for the labeled data shortage problem. Finally, we achieve better performance than DCASE2020 baseline system.

System characteristics
Input mono
Sampling rate 16 kHz
Features log-mel energies
Classifier CRNN
PDF

Multi-Scale Residual CRNN With Data Augmentation For DCASE 2020 Task 4

Tang, Maolin and Guo, Longyin and Zhang, Yanqiu and Yan, Weiran and Zhao, Qijun
Sichuan University, ChengDu, China

Abstract

In this technical report, we present our method for task 4 of DCASE 2020 challenge (Sound event detection and separation in domestic environments). The goal of the task is to evaluate systems for the detection of sound events using real data either weakly labeled or unlabeled and simulated data that is strongly labeled (with timestamps). We find that models perform well on synthetic data, but may not perform well on real data. We thus improve the baseline [1] by using a variety of data augmentation methods and synthesizing more complex synthetic data for training. Moreover, we present multi-scale residual convolutional recurrent neural network (CRNN) to solve the problem of multi-scale detection. The promising results on the validation set prove the effectiveness of our method.

System characteristics
Input mono
Sampling rate 22.05 kHz
Data augmentation SpecAugment
Features log-mel energies
Classifier CRNN
Decision making mean
PDF

Training Sound Event Detection On A Heterogeneous Dataset

Turpault, Nicolas and Serizel, Romain
Université de Lorraine, CNRS, Inria, Loria, France

Abstract

Training a sound event detection algorithm on a heterogeneous dataset including both recorded and synthetic soundscapes that can have various labeling granularity is a non-trivial task that can lead to systems requiring several technical choices. These technical choices are often passed from one system to another without being questioned. We propose to perform a detailed analysis of DCASE 2020 task 4 sound event detection baseline with regards to several aspects such as the type of data used for training, the parameters of the mean-teacher or the transformations applied while generating the synthetic soundscapes. Some of the parameters that are usually used as default to replicate other approaches are shown to be sub-optimal.

System characteristics
Input mono
Sampling rate 16 kHz
Features log-mel energies
Classifier CRNN
Decision making p-norm
PDF

Improving Sound Event Detection In Domestic Environments Using Sound Separation

Turpault, Nicolas1 and Wisdom, Scott2 and Erdogan, Hakan2 and Herhey, John R.2 and Serizel, Romain1 and Fonseca, Eduardo3 and Seetharaman, Prem4 and Salomon, Justin5
1Universite de Lorraine, CNRS, Inria, Loria, Nancy, France, 2Google Research, AI Perception, Cambridge, United States, 3Music Technology Group, Universitat Pompeu Fabra, Barcelona, 4Interactive Audio Lab, Northwestern University, Evanston, United States5Adobe Research, San Francisco, United States

Abstract

Performing sound event detection on real-world recordings often implies dealing with overlapping target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing stage for sound event detection. In this paper we start from a sound separation model trained on the Free Universal Sound Separation dataset and the DCASE 2020 task 4 sound event detection baseline. We explore different methods of combining separated sound sources and the original mixture within the sound event detection. Furthermore, we investigate the impact of adapting the universal sound separation model to the sound event detection data in terms of both separation and sound event detection performance.

System characteristics
Input mono
Sampling rate 16 kHz
Features log-mel energies
Classifier CRNN
Decision making p-norm
PDF

Sound Event Detection In Domestic Environments Using Dense Recurrent Neural Network

Yao, Tianchu and Shi, Chuang and Li, Huiyong
University of Electronic Science and Technology of China, Chengdu, China

Abstract

In this paper, we introduce our sound events detection system using a mean-teacher model with convolutional recurrent neural network (CRNN) for DCASE 2020 Task4, which include residual convolutional block and dense recurrent neural network (DRNN) block. To improve the performance of system, we propose to use various methods such as multi-scale input layer, data augmentation, median window filters and model fusion. By combining those method, our system achieves 15% improvement on macro-averaged F-score on the development set, as compared to the baseline.

System characteristics
Input mono
Sampling rate 16 kHz
Data augmentation mixup, time-shift
Features log-mel energies
Classifier CRNN
Decision making geometric-mean
PDF

The Academia Sinica System Of Sound Event Detection And Separation For DCASE 2020

Yen, Hao and Ku1,2, Pin-Jui and Yen1,2, Ming-Chi and Lee1, Hung-Shin and Wang1,2, Hsin-Min Wang1
1Institute of Information Science, Academia Sinica, Taiwan, 2Dept. Electrical Engineering, National Taiwan University, Taiwan

Abstract

In this report, we present the system of sound event detection and separation in domestic environments for DCASE 2020. The goal of the task aims to determine which sound events appear in a clip and detailed temporal ranges they occupy. The system is trained by using real data, which are either weakly-labeled or unlabeled, and synthesized data with a strongly annotated label. Our proposed model structure starts with a feature-level front-end based on convolution neural networks (CNN) followed by both embedding-level and instance-level back-end attention modules. To take full advantage of a large amount of unlabeled data, we jointly adopt guided learning mechanism and Mean Teacher, which averages model weights instead of label predictions, to carry out weakly-supervised and semi-supervised learning. A group of adaptive median windows for each sound event is also utilized in post-processing for smoothing frame-level predictions. In the public evaluation set of DCASE 2019, our best system achieves 48.50% event-based F-score, much better than the official baseline performance (38.14%) with a relative improvement of 27.16%. Moreover, in the development set of DCASE 2020, our system is also superior to the baseline while using the student model as the back-end classifier. The F1-score is relatively improved by 32.91%.

System characteristics
Input mono
Sampling rate 16 kHz
Features log-mel energies
Classifier CRNN
PDF