Sound event detection in domestic environments


Challenge results

Task description

The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance. The labels in the annotated subset are verified and can be considered as reliable.

More detailed task description can be found in the task description page

Systems ranking

Rank Submission
code
Submission
name
Technical
Report
Event-based
F-score
(Evaluation dataset)
Event-based
F-score
(Development dataset)
Wang_NUDT_task4_4 NUDT System for DCASE2019 Task4 Wang2019 16.8 23.8
Wang_NUDT_task4_3 NUDT System for DCASE2019 Task4 Wang2019 17.5 22.4
Wang_NUDT_task4_2 NUDT System for DCASE2019 Task4 Wang2019 17.2 22.5
Wang_NUDT_task4_1 NUDT System for DCASE2019 Task4 Wang2019 17.2 22.7
Delphin_OL_task4_2 DCASE2019 mean-teacher with shifted and noisy data augmentation system Delphin-Poulat2019 42.1 43.6
Delphin_OL_task4_1 DCASE2019 mean-teacher with shifted data augmentation system Delphin-Poulat2019 38.3 42.1
Kong_SURREY_task4_1 CVSSP cross-task CNN baseline Kong2019 22.3 21.3
CTK_NU_task4_2 CTK_NU_task4_2 Chan2019 29.7 29.7
CTK_NU_task4_3 CTK_NU_task4_3 Chan2019 27.7 27.8
CTK_NU_task4_4 CTK_NU_task4_4 Chan2019 26.9 27.2
CTK_NU_task4_1 CTK_NU_task4_1 Chan2019 31.0 30.4
Mishima_NEC_task4_3 msm_ResNet_3_augmentation Mishima2019 18.3 25.9
Mishima_NEC_task4_4 msm_ResNet_4_augmentation_pseudo Mishima2019 19.8 24.7
Mishima_NEC_task4_2 msm_ResNet_2_pseudo Mishima2019 17.7 24.8
Mishima_NEC_task4_1 msm_ResNet_1_simple Mishima2019 16.7 24.0
CANCES_IRIT_task4_2 CANCES multi-task Cances2019 28.4 33.8
CANCES_IRIT_task4_2 CANCES multi-task Cances2019 26.1 28.8
PELLEGRINI_IRIT_task4_1 PELLEGRINI multi-task Cances2019 39.7 39.9
Lin_ICT_task4_2 Guiding_learning_2 Lin2019 40.9 44.0
Lin_ICT_task4_4 Guiding_learning_4 Lin2019 41.8 45.4
Lin_ICT_task4_3 Guiding_learning_3 Lin2019 42.7 45.3
Lin_ICT_task4_1 Guiding_learning_1 Lin2019 40.7 44.5
Baseline_dcase2019 DCASE2019 baseline system Turpault2019 25.8 23.7
bolun_NWPU_task4_1 DCASE2019 task4 system Bolun2019 21.7 25.0
bolun_NWPU_task4_4 DCASE2019 task4 system Bolun2019 25.3 31.9
bolun_NWPU_task4_3 DCASE2019 task4 system Bolun2019 23.8 25.0
bolun_NWPU_task4_2 DCASE2019 task4 system Bolun2019 27.8 31.9
Agnone_PDL_task4_1 Mean VAT Teacher Agnone2019 25.0 59.6
Kiyokawa_NEC_task4_1 DCASE2019 SED ResNet self-mask kiyo Kiyokawa2019 27.8 31.6
Kiyokawa_NEC_task4_4 DCASE2019 SED ResNet self-mask kiyo Kiyokawa2019 32.4 36.1
Kiyokawa_NEC_task4_3 DCASE2019 SED ResNet self-mask kiyo Kiyokawa2019 29.4 34.5
Kiyokawa_NEC_task4_2 DCASE2019 SED ResNet self-mask kiyo Kiyokawa2019 28.3 31.8
Kothinti_JHU_task4_2 JHU DCASE2019 task4 system Kothinti2019 30.5 35.3
Kothinti_JHU_task4_3 JHU DCASE2019 task4 system Kothinti2019 29.0 34.4
Kothinti_JHU_task4_4 JHU DCASE2019 task4 system Kothinti2019 29.4 35.0
Kothinti_JHU_task4_1 JHU DCASE2019 task4 system Kothinti2019 30.7 34.6
Shi_FRDC_task4_2 BossLee_FRDC_2 Shi2019 42.0 42.5
Shi_FRDC_task4_3 BossLee_FRDC_3 Shi2019 40.9 38.9
Shi_FRDC_task4_4 BossLee_FRDC_4 Shi2019 41.5 41.7
Shi_FRDC_task4_1 BossLee_FRDC_1 Shi2019 37.0 36.7
ZYL_UESTC_task4_1 UESTC_SICE_task4_1 Zhang2019 29.4 36.0
ZYL_UESTC_task4_2 UESTC_SICE_task4_2 Zhang2019 30.8 35.6
Wang_YSU_task4_1 Wang_YSU_task4_1 Yang2019 6.5 19.4
Wang_YSU_task4_2 Wang_YSU_task4_2 Yang2019 6.2 20.9
Wang_YSU_task4_3 Wang_YSU_task4_3 Yang2019 6.7 22.7
Yan_USTC_task4_1 USTC_CRNN_MT system1 Yan2019 35.8 41.4
Yan_USTC_task4_3 USTC_CRNN_MT system3 Yan2019 35.6 42.1
Yan_USTC_task4_4 USTC_CRNN_MT system4 Yan2019 33.5 39.4
Yan_USTC_task4_2 USTC_CRNN_MT system2 Yan2019 36.2 42.6
Lee_KNU_task4_2 KNUwaveCNN2 Lee2019 25.8 31.6
Lee_KNU_task4_4 KNUwaveCNN4 Lee2019 24.6 28.7
Lee_KNU_task4_3 KNUwaveCNN3 Lee2019 26.7 31.6
Lee_KNU_task4_1 KNUwaveCNN1 Lee2019 26.4 28.8
Rakowski_SRPOL_task4_1 Regularized Surrey9 Rakowski2019 24.2 24.3
Lim_ETRI_task4_1 Lim_task4_1 Lim2019 32.6 38.8
Lim_ETRI_task4_2 Lim_task4_2 Lim2019 33.2 39.5
Lim_ETRI_task4_3 Lim_task4_3 Lim2019 32.5 39.4
Lim_ETRI_task4_4 Lim_task4_4 Lim2019 34.4 40.9

Supplementary metrics

Rank Submission
code
Submission
name
Technical
Report
Event-based
F-score
(Evaluation dataset)
Event-based
F-score
(Youtube dataset)
Event-based
F-score
(Vimeo dataset)
Segment-based
F-score
(Evaluation dataset)
Wang_NUDT_task4_4 NUDT System for DCASE2019 Task4 Wang2019 16.8 18.3 13.2 64.8
Wang_NUDT_task4_3 NUDT System for DCASE2019 Task4 Wang2019 17.5 19.2 13.3 63.0
Wang_NUDT_task4_2 NUDT System for DCASE2019 Task4 Wang2019 17.2 18.4 14.4 65.0
Wang_NUDT_task4_1 NUDT System for DCASE2019 Task4 Wang2019 17.2 18.7 13.5 64.4
Delphin_OL_task4_2 DCASE2019 mean-teacher with shifted and noisy data augmentation system Delphin-Poulat2019 42.1 45.8 33.3 71.4
Delphin_OL_task4_1 DCASE2019 mean-teacher with shifted data augmentation system Delphin-Poulat2019 38.3 41.9 29.2 68.6
Kong_SURREY_task4_1 CVSSP cross-task CNN baseline Kong2019 22.3 24.1 17.0 59.4
CTK_NU_task4_2 CTK_NU_task4_2 Chan2019 29.7 33.2 21.0 55.6
CTK_NU_task4_3 CTK_NU_task4_3 Chan2019 27.7 30.8 19.8 50.5
CTK_NU_task4_4 CTK_NU_task4_4 Chan2019 26.9 30.1 18.8 48.7
CTK_NU_task4_1 CTK_NU_task4_1 Chan2019 31.0 34.7 21.6 58.2
Mishima_NEC_task4_3 msm_ResNet_3_augmentation Mishima2019 18.3 20.6 12.6 58.8
Mishima_NEC_task4_4 msm_ResNet_4_augmentation_pseudo Mishima2019 19.8 21.8 15.0 58.7
Mishima_NEC_task4_2 msm_ResNet_2_pseudo Mishima2019 17.7 19.0 14.1 56.1
Mishima_NEC_task4_1 msm_ResNet_1_simple Mishima2019 16.7 18.8 11.7 56.2
CANCES_IRIT_task4_2 CANCES multi-task Cances2019 28.4 31.1 21.3 61.2
CANCES_IRIT_task4_2 CANCES multi-task Cances2019 26.1 29.2 18.1 62.5
PELLEGRINI_IRIT_task4_1 PELLEGRINI multi-task Cances2019 39.7 43.0 30.9 64.7
Lin_ICT_task4_2 Guiding_learning_2 Lin2019 40.9 45.0 29.8 62.7
Lin_ICT_task4_4 Guiding_learning_4 Lin2019 41.8 46.7 28.6 64.5
Lin_ICT_task4_3 Guiding_learning_3 Lin2019 42.7 47.7 29.4 64.8
Lin_ICT_task4_1 Guiding_learning_1 Lin2019 40.7 45.5 27.6 61.5
Baseline_dcase2019 DCASE2019 baseline system Turpault2019 25.8 29.0 18.1 53.7
bolun_NWPU_task4_1 DCASE2019 task4 system Bolun2019 21.7 23.0 18.2 63.3
bolun_NWPU_task4_4 DCASE2019 task4 system Bolun2019 25.3 28.6 16.1 58.7
bolun_NWPU_task4_3 DCASE2019 task4 system Bolun2019 23.8 26.2 17.5 61.7
bolun_NWPU_task4_2 DCASE2019 task4 system Bolun2019 27.8 30.1 21.7 61.6
Agnone_PDL_task4_1 Mean VAT Teacher Agnone2019 25.0 27.1 20.0 60.4
Kiyokawa_NEC_task4_1 DCASE2019 SED ResNet self-mask kiyo Kiyokawa2019 27.8 30.4 22.1 66.1
Kiyokawa_NEC_task4_4 DCASE2019 SED ResNet self-mask kiyo Kiyokawa2019 32.4 36.2 23.8 65.3
Kiyokawa_NEC_task4_3 DCASE2019 SED ResNet self-mask kiyo Kiyokawa2019 29.4 32.9 21.2 65.7
Kiyokawa_NEC_task4_2 DCASE2019 SED ResNet self-mask kiyo Kiyokawa2019 28.3 32.1 19.3 62.4
Kothinti_JHU_task4_2 JHU DCASE2019 task4 system Kothinti2019 30.5 32.5 24.7 53.5
Kothinti_JHU_task4_3 JHU DCASE2019 task4 system Kothinti2019 29.0 31.2 23.0 52.0
Kothinti_JHU_task4_4 JHU DCASE2019 task4 system Kothinti2019 29.4 31.2 24.4 52.4
Kothinti_JHU_task4_1 JHU DCASE2019 task4 system Kothinti2019 30.7 33.2 23.8 53.1
Shi_FRDC_task4_2 BossLee_FRDC_2 Shi2019 42.0 46.1 31.5 69.8
Shi_FRDC_task4_3 BossLee_FRDC_3 Shi2019 40.9 45.5 29.8 68.7
Shi_FRDC_task4_4 BossLee_FRDC_4 Shi2019 41.5 46.4 29.3 67.8
Shi_FRDC_task4_1 BossLee_FRDC_1 Shi2019 37.0 40.2 28.9 63.0
ZYL_UESTC_task4_1 UESTC_SICE_task4_1 Zhang2019 29.4 31.9 23.3 62.0
ZYL_UESTC_task4_2 UESTC_SICE_task4_2 Zhang2019 30.8 34.5 21.1 60.9
Wang_YSU_task4_1 Wang_YSU_task4_1 Yang2019 6.5 7.4 4.1 26.1
Wang_YSU_task4_2 Wang_YSU_task4_2 Yang2019 6.2 7.2 4.0 25.4
Wang_YSU_task4_3 Wang_YSU_task4_3 Yang2019 6.7 7.6 4.6 26.3
Yan_USTC_task4_1 USTC_CRNN_MT system1 Yan2019 35.8 38.2 29.3 66.1
Yan_USTC_task4_3 USTC_CRNN_MT system3 Yan2019 35.6 38.2 28.2 64.6
Yan_USTC_task4_4 USTC_CRNN_MT system4 Yan2019 33.5 35.6 27.3 64.1
Yan_USTC_task4_2 USTC_CRNN_MT system2 Yan2019 36.2 38.8 28.7 65.2
Lee_KNU_task4_2 KNUwaveCNN2 Lee2019 25.8 27.4 21.5 49.0
Lee_KNU_task4_4 KNUwaveCNN4 Lee2019 24.6 26.1 20.5 48.3
Lee_KNU_task4_3 KNUwaveCNN3 Lee2019 26.7 28.1 22.9 50.2
Lee_KNU_task4_1 KNUwaveCNN1 Lee2019 26.4 27.8 22.6 49.0
Rakowski_SRPOL_task4_1 Regularized Surrey9 Rakowski2019 24.2 26.2 19.2 63.4
Lim_ETRI_task4_1 Lim_task4_1 Lim2019 32.6 35.3 25.8 67.1
Lim_ETRI_task4_2 Lim_task4_2 Lim2019 33.2 36.7 24.8 69.2
Lim_ETRI_task4_3 Lim_task4_3 Lim2019 32.5 36.3 22.4 63.2
Lim_ETRI_task4_4 Lim_task4_4 Lim2019 34.4 38.6 23.7 66.4

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission
code
Submission
name
Technical
Report
Event-based
F-score
(Evaluation dataset)
Event-based
F-score
(Development dataset)
Wang_NUDT_task4_3 NUDT System for DCASE2019 Task4 Wang2019 17.5 22.4
Delphin_OL_task4_2 DCASE2019 mean-teacher with shifted and noisy data augmentation system Delphin-Poulat2019 42.1 43.6
Kong_SURREY_task4_1 CVSSP cross-task CNN baseline Kong2019 22.3 21.3
CTK_NU_task4_1 CTK_NU_task4_1 Chan2019 31.0 30.4
Mishima_NEC_task4_4 msm_ResNet_4_augmentation_pseudo Mishima2019 19.8 24.7
PELLEGRINI_IRIT_task4_1 PELLEGRINI multi-task Cances2019 39.7 39.9
Lin_ICT_task4_3 Guiding_learning_3 Lin2019 42.7 45.3
Baseline_dcase2019 DCASE2019 baseline system Turpault2019 25.8 23.7
bolun_NWPU_task4_2 DCASE2019 task4 system Bolun2019 27.8 31.9
Agnone_PDL_task4_1 Mean VAT Teacher Agnone2019 25.0 59.6
Kiyokawa_NEC_task4_4 DCASE2019 SED ResNet self-mask kiyo Kiyokawa2019 32.4 36.1
Kothinti_JHU_task4_1 JHU DCASE2019 task4 system Kothinti2019 30.7 34.6
Shi_FRDC_task4_2 BossLee_FRDC_2 Shi2019 42.0 42.5
ZYL_UESTC_task4_2 UESTC_SICE_task4_2 Zhang2019 30.8 35.6
Wang_YSU_task4_1 Wang_YSU_task4_1 Yang2019 6.7 19.4
Yan_USTC_task4_2 USTC_CRNN_MT system2 Yan2019 36.2 42.6
Lee_KNU_task4_3 KNUwaveCNN3 Lee2019 26.7 31.6
Rakowski_SRPOL_task4_1 Regularized Surrey9 Rakowski2019 24.2 24.3
Lim_ETRI_task4_4 Lim_task4_4 Lim2019 34.4 40.9

Supplementary metrics

Rank Submission
code
Submission
name
Technical
Report
Event-based
F-score
(Evaluation dataset)
Event-based
F-score
(Youtube dataset)
Event-based
F-score
(Vimeo dataset)
Segment-based
F-score
(Evaluation dataset)
Wang_NUDT_task4_3 NUDT System for DCASE2019 Task4 Wang2019 17.5 19.2 13.3 63.0
Delphin_OL_task4_2 DCASE2019 mean-teacher with shifted and noisy data augmentation system Delphin-Poulat2019 42.1 45.8 33.3 71.4
Kong_SURREY_task4_1 CVSSP cross-task CNN baseline Kong2019 22.3 24.1 17.0 59.4
CTK_NU_task4_1 CTK_NU_task4_1 Chan2019 31.0 34.7 21.6 58.2
Mishima_NEC_task4_4 msm_ResNet_4_augmentation_pseudo Mishima2019 19.8 21.8 15.0 58.7
PELLEGRINI_IRIT_task4_1 PELLEGRINI multi-task Cances2019 39.7 43.0 30.9 64.7
Lin_ICT_task4_3 Guiding_learning_3 Lin2019 42.7 47.7 29.4 64.8
Baseline_dcase2019 DCASE2019 baseline system Turpault2019 25.8 29.0 18.1 53.7
bolun_NWPU_task4_2 DCASE2019 task4 system Bolun2019 27.8 30.1 21.7 61.6
Agnone_PDL_task4_1 Mean VAT Teacher Agnone2019 25.0 27.1 20.0 60.4
Kiyokawa_NEC_task4_4 DCASE2019 SED ResNet self-mask kiyo Kiyokawa2019 32.4 36.2 23.8 65.3
Kothinti_JHU_task4_1 JHU DCASE2019 task4 system Kothinti2019 30.7 33.2 23.8 53.1
Shi_FRDC_task4_2 BossLee_FRDC_2 Shi2019 42.0 46.1 31.5 69.8
ZYL_UESTC_task4_2 UESTC_SICE_task4_2 Zhang2019 30.8 34.5 21.1 60.9
Wang_YSU_task4_1 Wang_YSU_task4_1 Yang2019 6.7 7.6 4.6 26.3
Yan_USTC_task4_2 USTC_CRNN_MT system2 Yan2019 36.2 38.8 28.7 65.2
Lee_KNU_task4_3 KNUwaveCNN3 Lee2019 26.7 28.1 22.9 50.2
Rakowski_SRPOL_task4_1 Regularized Surrey9 Rakowski2019 24.2 26.2 19.2 63.4
Lim_ETRI_task4_4 Lim_task4_4 Lim2019 34.4 38.6 23.7 66.4

Class-wise performance

Rank Submission
code
Submission
name
Technical
Report
Event-based
F-score
(Evaluation dataset)
Alarm
Bell
Ringing
Blender Cat Dishes Dog Electric
shave
toothbrush
Frying Running
water
Speech Vacuum
cleaner
Wang_NUDT_task4_4 NUDT System for DCASE2019 Task4 Wang2019 16.8 14.0 21.5 0.4 0.2 0.3 21.5 25.0 24.6 10.7 50.2
Wang_NUDT_task4_3 NUDT System for DCASE2019 Task4 Wang2019 17.5 14.0 26.1 0.4 0.0 0.3 22.5 26.8 26.3 10.7 47.9
Wang_NUDT_task4_2 NUDT System for DCASE2019 Task4 Wang2019 17.2 13.2 22.1 0.4 0.2 0.3 23.6 27.8 23.1 11.6 49.8
Wang_NUDT_task4_1 NUDT System for DCASE2019 Task4 Wang2019 17.2 11.9 20.9 0.4 0.2 0.0 24.9 27.7 27.0 11.7 47.5
Delphin_OL_task4_2 DCASE2019 mean-teacher with shifted and noisy data augmentation system Delphin-Poulat2019 42.1 42.6 49.2 52.9 35.2 40.9 47.5 41.4 31.9 43.9 35.7
Delphin_OL_task4_1 DCASE2019 mean-teacher with shifted data augmentation system Delphin-Poulat2019 38.3 41.6 40.8 51.9 37.2 37.8 41.1 39.0 22.1 41.2 29.8
Kong_SURREY_task4_1 CVSSP cross-task CNN baseline Kong2019 22.3 6.2 14.2 41.7 11.1 17.1 28.7 3.0 20.8 50.3 30.3
CTK_NU_task4_2 CTK_NU_task4_2 Chan2019 29.7 20.5 42.9 40.3 0.7 22.9 37.4 30.5 20.0 39.7 41.8
CTK_NU_task4_3 CTK_NU_task4_3 Chan2019 27.7 32.5 38.8 33.8 0.0 17.6 40.2 29.7 19.5 23.0 42.1
CTK_NU_task4_4 CTK_NU_task4_4 Chan2019 26.9 24.0 38.9 30.7 0.7 17.2 35.3 27.6 18.5 36.3 39.9
CTK_NU_task4_1 CTK_NU_task4_1 Chan2019 31.0 25.1 38.2 27.2 7.7 25.6 50.0 35.0 24.2 26.6 50.7
Mishima_NEC_task4_3 msm_ResNet_3_augmentation Mishima2019 18.3 17.9 5.3 36.5 24.9 28.7 13.9 5.0 4.6 38.0 8.4
Mishima_NEC_task4_4 msm_ResNet_4_augmentation_pseudo Mishima2019 19.8 18.7 7.9 41.2 25.1 18.2 14.5 9.5 3.6 48.6 10.5
Mishima_NEC_task4_2 msm_ResNet_2_pseudo Mishima2019 17.7 22.4 0.8 40.9 18.1 31.5 7.5 0.5 1.3 51.7 2.4
Mishima_NEC_task4_1 msm_ResNet_1_simple Mishima2019 16.7 18.3 2.4 35.8 18.6 27.1 8.0 0.8 1.4 50.5 4.5
CANCES_IRIT_task4_2 CANCES multi-task Cances2019 28.4 23.2 24.8 38.0 22.0 24.5 25.2 29.6 21.3 44.0 31.1
CANCES_IRIT_task4_2 CANCES multi-task Cances2019 26.1 18.8 26.9 20.5 19.4 11.1 27.6 40.9 14.1 45.5 36.1
PELLEGRINI_IRIT_task4_1 PELLEGRINI multi-task Cances2019 39.7 35.8 35.1 60.2 32.5 35.5 35.9 37.5 27.7 47.4 49.1
Lin_ICT_task4_2 Guiding_learning_2 Lin2019 40.9 36.4 40.5 54.2 27.0 41.5 42.0 41.7 25.7 46.2 54.2
Lin_ICT_task4_4 Guiding_learning_4 Lin2019 41.8 42.5 36.8 55.1 26.5 43.1 41.8 39.3 20.4 54.2 57.9
Lin_ICT_task4_3 Guiding_learning_3 Lin2019 42.7 42.3 40.5 55.1 26.4 42.0 44.6 41.5 21.8 54.6 58.6
Lin_ICT_task4_1 Guiding_learning_1 Lin2019 40.7 40.2 35.8 55.3 24.6 38.7 43.1 42.0 23.6 55.9 48.3
Baseline_dcase2019 DCASE2019 baseline system Turpault2019 25.8 26.6 32.2 53.6 13.7 9.9 13.3 24.1 10.9 37.7 35.5
bolun_NWPU_task4_1 DCASE2019 task4 system Bolun2019 21.7 6.4 18.6 40.9 12.3 9.0 32.8 12.7 19.0 46.0 19.5
bolun_NWPU_task4_4 DCASE2019 task4 system Bolun2019 25.3 3.7 14.6 51.6 5.9 5.6 39.8 37.4 23.6 45.7 24.8
bolun_NWPU_task4_3 DCASE2019 task4 system Bolun2019 23.8 17.0 14.6 34.6 14.8 13.8 31.9 16.3 23.6 46.0 25.5
bolun_NWPU_task4_2 DCASE2019 task4 system Bolun2019 27.8 16.6 18.6 46.3 20.1 21.5 34.9 28.6 19.0 46.8 25.9
Agnone_PDL_task4_1 Mean VAT Teacher Agnone2019 25.0 33.9 34.9 44.0 19.5 2.8 12.6 23.4 11.8 39.4 27.4
Kiyokawa_NEC_task4_1 DCASE2019 SED ResNet self-mask kiyo Kiyokawa2019 27.8 21.0 32.1 34.1 22.6 20.2 25.1 14.5 22.1 46.4 40.1
Kiyokawa_NEC_task4_4 DCASE2019 SED ResNet self-mask kiyo Kiyokawa2019 32.4 32.4 33.2 38.3 24.3 27.6 32.7 17.0 25.0 45.8 48.1
Kiyokawa_NEC_task4_3 DCASE2019 SED ResNet self-mask kiyo Kiyokawa2019 29.4 32.4 27.0 38.3 24.3 27.6 25.1 13.7 19.4 45.8 40.1
Kiyokawa_NEC_task4_2 DCASE2019 SED ResNet self-mask kiyo Kiyokawa2019 28.3 32.4 31.7 32.8 20.5 27.6 19.5 18.9 20.8 45.8 33.5
Kothinti_JHU_task4_2 JHU DCASE2019 task4 system Kothinti2019 30.5 23.3 46.9 29.5 16.4 41.6 20.9 30.2 19.4 43.6 32.8
Kothinti_JHU_task4_3 JHU DCASE2019 task4 system Kothinti2019 29.0 22.1 43.9 32.5 12.7 36.7 18.9 25.8 19.4 43.5 34.4
Kothinti_JHU_task4_4 JHU DCASE2019 task4 system Kothinti2019 29.4 23.1 46.0 32.0 14.8 37.4 19.7 28.2 17.8 44.6 30.4
Kothinti_JHU_task4_1 JHU DCASE2019 task4 system Kothinti2019 30.7 21.2 45.2 30.6 16.4 40.8 20.7 28.8 23.6 42.5 36.7
Shi_FRDC_task4_2 BossLee_FRDC_2 Shi2019 42.0 45.8 44.2 63.2 32.2 27.5 41.9 27.6 21.8 58.7 56.5
Shi_FRDC_task4_3 BossLee_FRDC_3 Shi2019 40.9 45.5 44.4 65.2 27.1 32.5 36.1 29.1 19.1 58.0 52.4
Shi_FRDC_task4_4 BossLee_FRDC_4 Shi2019 41.5 48.0 42.6 62.6 29.4 33.6 35.5 29.9 19.0 59.1 55.0
Shi_FRDC_task4_1 BossLee_FRDC_1 Shi2019 37.0 45.1 39.3 58.4 27.2 33.9 28.2 28.9 10.0 56.0 43.0
ZYL_UESTC_task4_1 UESTC_SICE_task4_1 Zhang2019 29.4 32.8 41.5 54.7 8.1 1.0 25.5 21.6 17.3 56.6 35.3
ZYL_UESTC_task4_2 UESTC_SICE_task4_2 Zhang2019 30.8 30.9 29.3 52.4 17.1 21.0 31.4 21.1 21.9 44.8 38.1
Wang_YSU_task4_1 Wang_YSU_task4_1 Yang2019 6.5 1.7 0.0 3.3 1.4 3.4 7.0 11.9 3.1 19.2 13.6
Wang_YSU_task4_2 Wang_YSU_task4_2 Yang2019 6.2 2.3 6.2 7.6 2.9 0.7 4.3 7.7 3.4 16.0 11.4
Wang_YSU_task4_3 Wang_YSU_task4_3 Yang2019 6.7 1.9 6.2 7.6 2.9 1.0 4.3 9.6 6.0 16.0 11.0
Yan_USTC_task4_1 USTC_CRNN_MT system1 Yan2019 35.8 23.3 36.6 52.8 28.0 41.4 22.6 42.6 34.9 37.0 39.2
Yan_USTC_task4_3 USTC_CRNN_MT system3 Yan2019 35.6 23.3 33.5 57.1 32.1 41.8 23.1 42.6 31.9 32.6 37.5
Yan_USTC_task4_4 USTC_CRNN_MT system4 Yan2019 33.5 23.3 33.5 57.1 32.1 41.8 23.1 22.2 31.9 32.6 37.5
Yan_USTC_task4_2 USTC_CRNN_MT system2 Yan2019 36.2 18.5 39.2 52.3 26.8 41.4 30.5 42.6 34.9 38.3 37.3
Lee_KNU_task4_2 KNUwaveCNN2 Lee2019 25.8 25.8 21.2 24.4 11.4 25.7 28.0 34.7 16.6 38.2 32.4
Lee_KNU_task4_4 KNUwaveCNN4 Lee2019 24.6 26.9 30.0 20.1 10.4 26.5 27.7 30.9 16.4 31.4 26.0
Lee_KNU_task4_3 KNUwaveCNN3 Lee2019 26.7 25.8 23.4 24.3 11.4 25.8 28.9 36.3 16.7 38.4 36.0
Lee_KNU_task4_1 KNUwaveCNN1 Lee2019 26.4 25.8 27.0 24.4 11.4 25.7 26.9 35.1 16.2 38.2 33.5
Rakowski_SRPOL_task4_1 Regularized Surrey9 Rakowski2019 24.2 25.6 21.6 25.1 14.4 12.9 25.7 24.9 17.5 50.6 23.7
Lim_ETRI_task4_1 Lim_task4_1 Lim2019 32.6 22.2 41.7 53.1 17.2 29.2 12.6 36.0 21.8 50.8 41.4
Lim_ETRI_task4_2 Lim_task4_2 Lim2019 33.2 26.9 36.7 53.7 19.3 27.1 14.0 35.9 23.0 52.4 42.9
Lim_ETRI_task4_3 Lim_task4_3 Lim2019 32.5 25.7 31.6 52.6 20.1 35.2 15.9 33.2 19.5 58.4 32.6
Lim_ETRI_task4_4 Lim_task4_4 Lim2019 34.4 26.2 35.5 57.2 24.1 33.1 17.4 33.3 21.5 58.5 37.1

System characteristics

General characteristics

Rank Code Technical
Report
Event-based
F-score
(Eval)
Sampling
rate
Data
augmentation
Features
Wang_NUDT_task4_4 Wang2019 16.8 44.1kHz mixup log-mel energies, delta features
Wang_NUDT_task4_3 Wang2019 17.5 44.1kHz mixup log-mel energies, delta features
Wang_NUDT_task4_2 Wang2019 17.2 44.1kHz mixup log-mel energies, delta features
Wang_NUDT_task4_1 Wang2019 17.2 44.1kHz mixup log-mel energies, delta features
Delphin_OL_task4_2 Delphin-Poulat2019 42.1 22.05kHz noise addition, time shifting, frequency shifting log-mel energies
Delphin_OL_task4_1 Delphin-Poulat2019 38.3 22.05kHz time shifting, frequency shifting log-mel energies
Kong_SURREY_task4_1 Kong2019 22.3 32kHz log-mel energies
CTK_NU_task4_2 Chan2019 29.7 32kHz log-mel energies
CTK_NU_task4_3 Chan2019 27.7 32kHz log-mel energies
CTK_NU_task4_4 Chan2019 26.9 32kHz log-mel energies
CTK_NU_task4_1 Chan2019 31.0 32kHz log-mel energies
Mishima_NEC_task4_3 Mishima2019 18.3 44.1kHz block mixing,mixup log-mel energies
Mishima_NEC_task4_4 Mishima2019 19.8 44.1kHz block mixing,mixup log-mel energies
Mishima_NEC_task4_2 Mishima2019 17.7 44.1kHz log-mel energies
Mishima_NEC_task4_1 Mishima2019 16.7 44.1kHz log-mel energies
CANCES_IRIT_task4_2 Cances2019 28.4 22KHz pitch_shifting, time stretching, level, noise log-mel energies
CANCES_IRIT_task4_2 Cances2019 26.1 22KHz pitch_shifting, time stretching, level, noise log-mel energies
PELLEGRINI_IRIT_task4_1 Cances2019 39.7 22KHz pitch_shifting, time stretching, level log-mel energies
Lin_ICT_task4_2 Lin2019 40.9 44.1kHz log-mel energies
Lin_ICT_task4_4 Lin2019 41.8 44.1kHz log-mel energies
Lin_ICT_task4_3 Lin2019 42.7 44.1kHz log-mel energies
Lin_ICT_task4_1 Lin2019 40.7 44.1kHz log-mel energies
Baseline_dcase2019 Turpault2019 25.8 44.1kHz log-mel energies
bolun_NWPU_task4_1 Bolun2019 21.7 32kHz event adding log-mel energies
bolun_NWPU_task4_4 Bolun2019 25.3 32kHz event adding log-mel energies
bolun_NWPU_task4_3 Bolun2019 23.8 32kHz event adding log-mel energies
bolun_NWPU_task4_2 Bolun2019 27.8 32kHz event adding log-mel energies
Agnone_PDL_task4_1 Agnone2019 25.0 44.1kHz VAT log-mel energies
Kiyokawa_NEC_task4_1 Kiyokawa2019 27.8 44.1kHz mixup log-mel energies
Kiyokawa_NEC_task4_4 Kiyokawa2019 32.4 44.1kHz mixup log-mel energies
Kiyokawa_NEC_task4_3 Kiyokawa2019 29.4 44.1kHz mixup log-mel energies
Kiyokawa_NEC_task4_2 Kiyokawa2019 28.3 44.1kHz mixup log-mel energies
Kothinti_JHU_task4_2 Kothinti2019 30.5 44.1kHz log-mel energies, auditory spectrogram
Kothinti_JHU_task4_3 Kothinti2019 29.0 44.1kHz log-mel energies, auditory spectrogram
Kothinti_JHU_task4_4 Kothinti2019 29.4 44.1kHz log-mel energies, auditory spectrogram
Kothinti_JHU_task4_1 Kothinti2019 30.7 44.1kHz log-mel energies, auditory spectrogram
Shi_FRDC_task4_2 Shi2019 42.0 44.1kHz Gaussian noise log-mel energies
Shi_FRDC_task4_3 Shi2019 40.9 44.1kHz Gaussian noise log-mel energies
Shi_FRDC_task4_4 Shi2019 41.5 44.1kHz Gaussian noise log-mel energies
Shi_FRDC_task4_1 Shi2019 37.0 44.1kHz Gaussian noise log-mel energies
ZYL_UESTC_task4_1 Zhang2019 29.4 44.1kHz mel-spectrogram
ZYL_UESTC_task4_2 Zhang2019 30.8 44.1kHz mel-spectrogram
Wang_YSU_task4_1 Yang2019 6.5 44.1kHz log-mel energies
Wang_YSU_task4_2 Yang2019 6.2 44.1kHz log-mel energies
Wang_YSU_task4_3 Yang2019 6.7 44.1kHz log-mel energies
Yan_USTC_task4_1 Yan2019 35.8 44.1kHz SpecAugment log-mel energies
Yan_USTC_task4_3 Yan2019 35.6 44.1kHz SpecAugment log-mel energies
Yan_USTC_task4_4 Yan2019 33.5 44.1kHz SpecAugment log-mel energies
Yan_USTC_task4_2 Yan2019 36.2 44.1kHz SpecAugment log-mel energies
Lee_KNU_task4_2 Lee2019 25.8 44.1kHz waveform
Lee_KNU_task4_4 Lee2019 24.6 44.1kHz notch filter waveform
Lee_KNU_task4_3 Lee2019 26.7 44.1kHz waveform
Lee_KNU_task4_1 Lee2019 26.4 44.1kHz waveform
Rakowski_SRPOL_task4_1 Rakowski2019 24.2 32kHz occlusions log-mel energies
Lim_ETRI_task4_1 Lim2019 32.6 44.1kHz SpecAugment log-mel energies
Lim_ETRI_task4_2 Lim2019 33.2 44.1kHz SpecAugment log-mel energies
Lim_ETRI_task4_3 Lim2019 32.5 44.1kHz SpecAugment log-mel energies
Lim_ETRI_task4_4 Lim2019 34.4 44.1kHz SpecAugment log-mel energies



Machine learning characteristics

Rank Code Technical
Report
Event-based
F-score
(Eval)
Classifier Semi-supervised approach Post-processing Segmentation
method
Decision
making
Wang_NUDT_task4_4 Wang2019 16.8 CRNN pseudo-labelling mean probabilities
Wang_NUDT_task4_3 Wang2019 17.5 CRNN pseudo-labelling mean probabilities
Wang_NUDT_task4_2 Wang2019 17.2 CRNN pseudo-labelling mean probabilities
Wang_NUDT_task4_1 Wang2019 17.2 CRNN pseudo-labelling mean probabilities
Delphin_OL_task4_2 Delphin-Poulat2019 42.1 CRNN mean-teacher student median filtering (class-dependent)
Delphin_OL_task4_1 Delphin-Poulat2019 38.3 CRNN mean-teacher student median filtering (class-dependent)
Kong_SURREY_task4_1 Kong2019 22.3 CNN supervised median filtering
CTK_NU_task4_2 Chan2019 29.7 NMF, CNN non-negative matrix factorization
CTK_NU_task4_3 Chan2019 27.7 NMF, CNN non-negative matrix factorization
CTK_NU_task4_4 Chan2019 26.9 NMF, CNN non-negative matrix factorization
CTK_NU_task4_1 Chan2019 31.0 NMF, CNN non-negative matrix factorization
Mishima_NEC_task4_3 Mishima2019 18.3 ResNet median filtering
Mishima_NEC_task4_4 Mishima2019 19.8 ResNet pseudo-labelling median filtering
Mishima_NEC_task4_2 Mishima2019 17.7 ResNet pseudo-labelling time aggregation, median filtering
Mishima_NEC_task4_1 Mishima2019 16.7 ResNet median filtering
CANCES_IRIT_task4_2 Cances2019 28.4 CRNN smoothed moving average
CANCES_IRIT_task4_2 Cances2019 26.1 CRNN smoothed moving average
PELLEGRINI_IRIT_task4_1 Cances2019 39.7 CRNN smoothed moving average
Lin_ICT_task4_2 Lin2019 40.9 CNN guiding learning with a more professional teacher median filtering (with adaptive window size) attention layer
Lin_ICT_task4_4 Lin2019 41.8 CNN guiding learning with a more professional teacher median filtering (with adaptive window size) attention layer
Lin_ICT_task4_3 Lin2019 42.7 CNN guiding learning with a more professional teacher median filtering (with adaptive window size) attention layer
Lin_ICT_task4_1 Lin2019 40.7 CNN guiding learning with a more professional teacher median filtering (with adaptive window size) attention layer
Baseline_dcase2019 Turpault2019 25.8 CRNN mean-teacher student median filtering
bolun_NWPU_task4_1 Bolun2019 21.7 CNN mean-teacher student median filtering
bolun_NWPU_task4_4 Bolun2019 25.3 CNN, RNN, ensemble mean-teacher student median filtering
bolun_NWPU_task4_3 Bolun2019 23.8 CNN mean-teacher student median filtering
bolun_NWPU_task4_2 Bolun2019 27.8 CNN, RNN, ensemble mean-teacher student median filtering
Agnone_PDL_task4_1 Agnone2019 25.0 CRNN mean-teacher student, VAT median filtering attention layer
Kiyokawa_NEC_task4_1 Kiyokawa2019 27.8 ResNet, SENet time thresholding
Kiyokawa_NEC_task4_4 Kiyokawa2019 32.4 ResNet, SENet time thresholding
Kiyokawa_NEC_task4_3 Kiyokawa2019 29.4 ResNet, SENet time thresholding
Kiyokawa_NEC_task4_2 Kiyokawa2019 28.3 ResNet, SENet time thresholding
Kothinti_JHU_task4_2 Kothinti2019 30.5 CRNN, RBM, CRBM, PCA mean-teacher student, pseudo-labelling saliency majority vote
Kothinti_JHU_task4_3 Kothinti2019 29.0 CRNN, RBM, CRBM, PCA, Kalman Filter mean-teacher student, pseudo-labelling saliency
Kothinti_JHU_task4_4 Kothinti2019 29.4 CRNN, RBM, CRBM, PCA, Kalman Filter mean-teacher student, pseudo-labelling saliency majority vote
Kothinti_JHU_task4_1 Kothinti2019 30.7 CRNN, RBM, CRBM, PCA mean-teacher student, pseudo-labelling saliency
Shi_FRDC_task4_2 Shi2019 42.0 CRNN interpolation consistency training median filtering
Shi_FRDC_task4_3 Shi2019 40.9 CRNN MixMatch median filtering
Shi_FRDC_task4_4 Shi2019 41.5 CRNN mean-teacher student, interpolation consistency training, MixMatch median filtering
Shi_FRDC_task4_1 Shi2019 37.0 CRNN mean-teacher student median filtering
ZYL_UESTC_task4_1 Zhang2019 29.4 CNN,ResNet,RNN mean-teacher student median filtering attention layer
ZYL_UESTC_task4_2 Zhang2019 30.8 CNN,ResNet,RNN mean-teacher student median filtering attention layer
Wang_YSU_task4_1 Yang2019 6.5 CMRANN-MT mean-teacher student median filtering
Wang_YSU_task4_2 Yang2019 6.2 CMRANN-MT mean-teacher student median filtering
Wang_YSU_task4_3 Yang2019 6.7 CMRANN-MT mean-teacher student median filtering
Yan_USTC_task4_1 Yan2019 35.8 CRNN mean-teacher student median filtering
Yan_USTC_task4_3 Yan2019 35.6 CRNN mean-teacher student median filtering
Yan_USTC_task4_4 Yan2019 33.5 CRNN mean-teacher student median filtering
Yan_USTC_task4_2 Yan2019 36.2 CRNN mean-teacher student median filtering
Lee_KNU_task4_2 Lee2019 25.8 CNN pseudo-labelling, mean-teacher student minimum gap/length compensation double thresholding
Lee_KNU_task4_4 Lee2019 24.6 CNN pseudo-labelling, mean-teacher student minimum gap/length compensation double thresholding
Lee_KNU_task4_3 Lee2019 26.7 CNN pseudo-labelling, mean-teacher student minimum gap/length compensation double thresholding
Lee_KNU_task4_1 Lee2019 26.4 CNN pseudo-labelling, mean-teacher student double thresholding
Rakowski_SRPOL_task4_1 Rakowski2019 24.2 CNN voice activity detection
Lim_ETRI_task4_1 Lim2019 32.6 CRNN, Ensemble median filtering mean probabilities, thresholding
Lim_ETRI_task4_2 Lim2019 33.2 CRNN, Ensemble median filtering mean probabilities, thresholding
Lim_ETRI_task4_3 Lim2019 32.5 CRNN, Ensemble mean-teacher student median filtering mean probabilities, thresholding
Lim_ETRI_task4_4 Lim2019 34.4 CRNN, Ensemble mean-teacher student median filtering mean probabilities, thresholding

Complexity

Rank Code Technical
Report
Event-based
F-score
(Eval)
Model
complexity
Ensemble
subsystems
Training time
Wang_NUDT_task4_4 Wang2019 16.8 4034268 3 2h (1 GTX 1080 Ti)
Wang_NUDT_task4_3 Wang2019 17.5 4034268 3 2h (1 GTX 1080 Ti)
Wang_NUDT_task4_2 Wang2019 17.2 4034268 3 2h (1 GTX 1080 Ti)
Wang_NUDT_task4_1 Wang2019 17.2 4034268 3 2h (1 GTX 1080 Ti)
Delphin_OL_task4_2 Delphin-Poulat2019 42.1 1582036 21h (1 GTX 1080)
Delphin_OL_task4_1 Delphin-Poulat2019 38.3 1582036 17h (1 GTX 1080)
Kong_SURREY_task4_1 Kong2019 22.3 4686144 1h (1 Titan XP)
CTK_NU_task4_2 Chan2019 29.7 4309450 0.5h (1 GTX 1060)
CTK_NU_task4_3 Chan2019 27.7 4309450 0.5h (1 GTX 1060)
CTK_NU_task4_4 Chan2019 26.9 4309450 0.5h (1 GTX 1060)
CTK_NU_task4_1 Chan2019 31.0 4309450 0.5h (1 GTX 1060)
Mishima_NEC_task4_3 Mishima2019 18.3 23865546 6h (4 GTX 1080 Ti)
Mishima_NEC_task4_4 Mishima2019 19.8 23865546 6h (4 GTX 1080 Ti)
Mishima_NEC_task4_2 Mishima2019 17.7 23865546 8h (4 GTX 1080 Ti)
Mishima_NEC_task4_1 Mishima2019 16.7 23865546 6h (4 GTX 1080 Ti)
CANCES_IRIT_task4_2 Cances2019 28.4 420116 1h (1 GTX 1080 Ti)
CANCES_IRIT_task4_2 Cances2019 26.1 470036 1h (1 GTX 1080 Ti)
PELLEGRINI_IRIT_task4_1 Cances2019 39.7 165460 12h (1 GTX 1080 Ti)
Lin_ICT_task4_2 Lin2019 40.9 1209744 3h (1 GTX 1080 Ti)
Lin_ICT_task4_4 Lin2019 41.8 6048720 5 3h (1 GTX 1080 Ti)
Lin_ICT_task4_3 Lin2019 42.7 7258464 6 3h (1 GTX 1080 Ti)
Lin_ICT_task4_1 Lin2019 40.7 1209744 3h (1 GTX 1080 Ti)
Baseline_dcase2019 Turpault2019 25.8 214356 3h (1 GTX 1080 Ti)
bolun_NWPU_task4_1 Bolun2019 21.7
bolun_NWPU_task4_4 Bolun2019 25.3
bolun_NWPU_task4_3 Bolun2019 23.8
bolun_NWPU_task4_2 Bolun2019 27.8
Agnone_PDL_task4_1 Agnone2019 25.0 214356 2h (1 GTX 1080 Ti)
Kiyokawa_NEC_task4_1 Kiyokawa2019 27.8 11408962 8h (4 GTX 1080 Ti)
Kiyokawa_NEC_task4_4 Kiyokawa2019 32.4 11408962 12h (4 GTX 1080 Ti)
Kiyokawa_NEC_task4_3 Kiyokawa2019 29.4 11408962 12h (4 GTX 1080 Ti)
Kiyokawa_NEC_task4_2 Kiyokawa2019 28.3 11408962 12h (4 GTX 1080 Ti)
Kothinti_JHU_task4_2 Kothinti2019 30.5 1200000 3 2h (1 GTX 2080 Ti)
Kothinti_JHU_task4_3 Kothinti2019 29.0 520000 2h (1 GTX 2080 Ti)
Kothinti_JHU_task4_4 Kothinti2019 29.4 1200000 3 2h (1 GTX 2080 Ti)
Kothinti_JHU_task4_1 Kothinti2019 30.7 520000 2h (1 GTX 2080 Ti)
Shi_FRDC_task4_2 Shi2019 42.0 6878340 9 24h (1 TITAN Xp)
Shi_FRDC_task4_3 Shi2019 40.9 4585560 6 24h (1 TITAN Xp)
Shi_FRDC_task4_4 Shi2019 41.5 18342240 24 24h (1 TITAN Xp)
Shi_FRDC_task4_1 Shi2019 37.0 6878340 9 24h (1 TITAN Xp)
ZYL_UESTC_task4_1 Zhang2019 29.4 298122 2.5h (1 GTX 1080 Ti)
ZYL_UESTC_task4_2 Zhang2019 30.8 298698 4h (1 GTX 1080 Ti)
Wang_YSU_task4_1 Yang2019 6.5 126090 3h (1 GTX 1080 Ti)
Wang_YSU_task4_2 Yang2019 6.2 126090 3h (1 GTX 1080 Ti)
Wang_YSU_task4_3 Yang2019 6.7 126090 3h (1 GTX 1080 Ti)
Yan_USTC_task4_1 Yan2019 35.8 7068540 5 3.5h (1 GTX 1080 Ti)
Yan_USTC_task4_3 Yan2019 35.6 7068540 5 3.5h (1 GTX 1080 Ti)
Yan_USTC_task4_4 Yan2019 33.5 7068540 5 3.5h (1 GTX 1080 Ti)
Yan_USTC_task4_2 Yan2019 36.2 7068540 5 3.5h (1 GTX 1080 Ti)
Lee_KNU_task4_2 Lee2019 25.8 3425776 6h (1 GTX Titan V)
Lee_KNU_task4_4 Lee2019 24.6 3425776 6h (1 GTX Titan V)
Lee_KNU_task4_3 Lee2019 26.7 3425776 6h (1 GTX Titan V)
Lee_KNU_task4_1 Lee2019 26.4 3425776 6h (1 GTX Titan V)
Rakowski_SRPOL_task4_1 Rakowski2019 24.2 4691274 6h (1 Tesla P40)
Lim_ETRI_task4_1 Lim2019 32.6 10572448 4
Lim_ETRI_task4_2 Lim2019 33.2 42289792 16
Lim_ETRI_task4_3 Lim2019 32.5 10572448 4
Lim_ETRI_task4_4 Lim2019 34.4 42289792 16

Technical reports

VIRTUAL ADVERSARIAL TRAINING SYSTEM FOR DCASE 2019 TASK 4

Agnone, Anthony and Altaf, Umair
Pindrop, Audio Research department, Atlanta, United States

Abstract

This paper describes the approach used for Task 4 of the DCASE 2019 Challenge. This tasks challenges systems to learn from a combination of labeled and unlabeled data. Furthermore, the labeled data is itself a combination of strongly-informed, coarse time-based data and weakly-informed, fine time-based synthetic data. The baseline system builds off of the winning solution from last year, and adds the synthetic data, which was not provided in that iteration of the challenge. Our solution uses the semi-supervised virtual adversarial training method, in addition to the Mean Teacher consistency loss, to encourage generalization from weakly-labeled and unlabeled data. The chosen system parametrization achieves a 59.57% macro F1 score.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation VAT
Features log-mel energies
Classifier CRNN, mean-teacher student + VAT
PDF

CLASS WISE FUSION SYSTEM FOR DCASE 2019 TASK4

Wang, Bolun and Wu, Hao and Bai, Jisheng and Chen, Chen and Wang, Mou and Wang, Rui and Fu, Zhonghua and Chen, Jianfeng and Rahardja, Susanto and Zhang, Xiaolei
Northwestern Polytechnical University, School of Computer Science, Xi'an, China

Abstract

In this report, we introduce our system for Task4 of Dcase 2019 challenges (Sound event detection in domestic environments). The goal of the task is to evaluate systems for the detection of sound events using real data either weakly labeled or unlabeled, along with strong labeled simulated data. With the aim of improving performance with large amount of unlabeled data, and a small labeled training data. We focus on three parts: data augmentation, loss function, and network fusion.

System characteristics
Input mono
Sampling rate 32kHz
Data augmentation event adding
Features log-mel energies
Classifier CNN, mean-teacher student
PDF

MULTI TASK LEARNING AND POST PROCESSING OPTIMIZATION FOR SOUND EVENT DETECTION

Cances, Léo and Pellegrini, Thomas and Guyot, Patrice
IRIT, Université de Toulouse, CNRS, Toulouse, France

Abstract

In this paper, we report our experiments in Sound Event Detection in domestic environments in the framework of the DCASE 2019 Task 4 challenge. The novelty, this year, lies in the availability of three different subsets for development: a weakly annotated dataset, a strongly annotated synthetic subset, and an unlabeled subset. The weak annotations, unlike the strong ones, provide tags from audio events but do not provide temporal boundaries. The task objective is twofold: detecting audio events (multi label classification at recording level), and localizing the events precisely within the recordings. First, we explore multi task training to take advantage of the synthetic and unlabeled in domain subsets. Then, we applied various temporal segmentation methods using optimization algorithms to obtain the best performing segmentation parameters. On the multi-task itself, we explored two strategies based on convolutional recurrent neural networks (CRNN): 1) a single branch model with two outputs, 2) multi branch models with two or three outputs. These approaches outperform the baseline of 23.7% in F measure by a large margin, with values of respectively 39.9% and 33.8% for the first and second strategy, on the official validation subset comprised of 1103 recordings.

System characteristics
Input mono
Sampling rate 22kHz
Data augmentation pitch_shifting, time stretching, level, noise
Features log-mel energies
Classifier CRNN
PDF

NON-NEGATIVE MATRIX FACTORIZATION-CONVOLUTION NEURAL NETWORK (NMF-CNN) FOR SOUND EVENT DETECTION

Chan, Teck Kai and Chin, Cheng Siong and Li, Ye
Newcastle University, Singapore

Abstract

The main scientific question of this year DCASE challenge, Task 4 - Sound Event Detection in Domestic Environments, is to investigate the types of data (strongly labeled synthetic data, weakly labeled data, unlabeled in domain data) required to achieve the best performing system. In this paper, we proposed a deep learning model that integrates Convolution Neural Network (CNN) with Non-Negative Matrix Factorization (NMF). The best performing model can achieve a higher event based F1-score of 30.39% as compared to the baseline system that achieved an F1-score of 23.7% on the validation dataset. Based on the results, even though synthetic data is strongly labeled, it cannot be used as a sole source of training data and resulted in the worst performance. Although, using a combination of weakly and strongly labeled data can achieve the highest F1-score, but the increment was not significant and may not be worth- while to include synthetic data into the training set. Results have also suggested that the quality of labeling unlabeled in domain data is essential and can have an adverse effect on the accuracy rather than improving the model performance if labeling was not done accurately.

System characteristics
Input mono
Sampling rate 32kHz
Features log-mel energies
Classifier NMF, CNN
PDF

MEAN TEACHER WITH DATA AUGMENTATION FOR DCASE 2019 TASK 4

Delphin-Poulat, Lionel and Plapous, Cyril
Orange Labs Lannion, France

Abstract

In this paper, we present our neural network for the DCASE 2019 challenge’s Task 4 (Sound event detection in domestic environments) [1]. The goal of the task is to evaluate systems for the detection of sound events using real data either weakly labeled or unlabeled and simulated data that is strongly labeled. We propose a mean-teacher model with convolutional neural network (CNN) and recurrent neural network (RNN) together with data augmentation and a median window tuned for each class based on prior knowledge.

System characteristics
Input mono
Sampling rate 22.05kHz
Data augmentation time shifting, frequency shifting
Features log-mel energies
Classifier CRNN, mean-teacher student
PDF

SOUND EVENT DETECTION WITH RESNET AND SELF-MASK MODULE FOR DCASE 2019 TASK 4

Kiyokawa, Yu and Mishima, Sakiko and Toizumi, Takahiro and Sagi, Kazutoshi and Kondo, Reishi and Nomura, Toshiyuki
Data Science Research Laboratories, NEC Corporation, Japan

Abstract

In this technical report, we propose a sound event detection system using a residual network (ResNet) with a self-mask module for a task 4 of detection and classification of acoustic scenes and events 2019 (DCASE 2019) challenge. Our system is constructed with a convolutional neural network based on a ResNet. We introduce a self-mask module as a region proposal network in order to detect event time boundaries. The self-mask module constrains time duration of silent and sound events by proposing candidates of the sound event region. These constraints improve detection accuracy of the sound event regions. Evaluation results show that our system obtains 36.09% of event-based F1-score for a sound event detection on a validation dataset of the task 4.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies
Classifier ResNet, SENet
PDF

CROSS-TASK LEARNING FOR AUDIO TAGGING, SOUND EVENT DETECTION AND SPATIAL LOCALIZATION: DCASE 2019 BASELINE SYSTEMS

Kong, Qiuqiang and Cao, Yin and Iqbal, Turab and Wang, Wenwu and Plumbley, Mark D.
Centre for Vision, Speech and Signal Processing, University of Surrey, UK

Abstract

The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge focuses on audio tagging, sound event detection and spatial localisation. DCASE 2019 consists of five tasks: 1) acoustic scene classification, 2) audio tagging with noisy labels and minimal supervision, 3) sound event localisation and detection, 4) sound event detection in domestic environments, and 5) urban sound tagging. In this paper, we propose generic cross-task baseline systems based on convolutional neural networks (CNNs). The motivation is to investigate the performance of a variety of models across several audio recognition tasks without exploiting the specific characteristics of the tasks. We looked at CNNs with 5, 9, and 13 layers, and found that the optimal architecture is task-dependent. For the systems we considered, we found that the 9-layer CNN with average pooling after convolutional layers is a good model for a majority of the DCASE 2019 tasks.

System characteristics
Input mono
Sampling rate 32kHz
Features log-mel energies
Classifier CNN
PDF

INTEGRATED BOTTOM-UP AND TOP-DOWN INFERENCE FOR SOUND EVENT DETECTION

Kothinti, Sandeep and Sell, Gregory and Watanabe, Shinji and Elhilali, Mounya
Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, USA

Abstract

While supervised methods have been highly effective at defining boundaries of sound events, the characteristics of the acoustic scene itself can provide complementary information about the changing profile of the scene and presence of new events. This work explores an integrated supervised and unsupervised approach to weakly labeled sound event detection by complementing a class-based inference system with a bottom-up, salience-based analysis. The two systems work conjointly in two ways: 1) Class information from the supervised model is used to tune the parameters of the bottom-up salience detection; and 2) Salience-based boundaries are leveraged to create pseudo-labels for weakly labeled data to generate more samples of strongly annotated data. These operations reflect the interplay between stimulus driven analysis and semantic driven analysis. The proposed method gives an absolute improvement of 11% on macro-averaged F-score on the development set.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies, auditory spectrogram
Classifier CRNN, RBM, CRBM, PCA, mean-teacher student
PDF

END-TO-END DEEP CONVOLUTIONAL NEURAL NETWORK WITH MULTI-SCALE STRUCTURE FOR WEAKLY LABELED SOUND EVENT DETECTION

Lee, Seokjin and Kim, Minhan and Jeong, Youngho
Kyungpook National University, School of Electronics Engineering, Daegu, Republic of Korea

Abstract

In this paper, an end-to-end sound event detection algorithm that detects and classifies the sound events from the waveform itself. The proposed model consists of multi-scale time frames and networks to handle both short and long signal characteristics; the frame slides 0.1 second to provide sufficiently fine resolution. The element network for each time frame data consists of several one-dimensional convolutional neural networks (1D-CNNs) with deeply stacked structure. The results of element networks are averaged and gated by sound activity detection. The decision is made by performing the double thresholding, and the results are enhanced by class-wise minimum gap/length compensation. To evaluate our proposed network, the simulation was performed with development data from DCASE 2019 Task 4, and the results show that the proposed algorithm has a macro-averaged f1-score of 31.7% for the development dataset of DCASE 2019 and 30.2% for the evaluation dataset of DCASE 2018.

System characteristics
Input mono
Sampling rate 44.1kHz
Features waveform
Classifier CNN, pseudo-labelling, mean-teacher student
PDF

SOUND EVENT DETECTION IN DOMESTIC ENVIRONMENTS USING ENSEMBLE OF CONVOLUTIONAL RECURRENT NEURAL NETWORKS

Lim, Wootaek and Suh, Sangwon and Park, Sooyoung and Jeong, Youngho
Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea

Abstract

In this paper, we present a method to detect sound events in domestic environments using small weakly labeled data, large unlabeled data, and strongly labeled synthetic data as proposed in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge task 4. To solve the problem, we use convolutional recurrent neural network (CRNN), as it stacks convolutional neural networks (CNN) and bi-directional gated recurrent unit (Bi-GRU). Moreover, we propose various methods such as data augmentation, event activity detection, multi-median filtering, mean-teacher student model, and the ensemble of neural networks to improve performance. By combining the proposed method, sound event detection performance can be enhanced, compared with the baseline algorithm. As a result, performance evaluation shows that the proposed method provides detection results of 40.89% for event-based metrics and 66.17% for segment-based metrics.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation SpecAugment
Features log-mel energies
Classifier CRNN
Decision making mean probabilities, thresholding
PDF

GUIDED LEARNING CONVOLUTION SYSTEM FOR DCASE 2019 TASK 4

Lin, Liwei and Wang, Xiangdong
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

Abstract

In this technical report, we describe in detail the system we submitted to DCASE2019 task 4: sound event detection (SED) in domestic environments. We approach SED as a multiple instance learning (MIL) problem and employ a convolutional neural network (CNN) with class-wise attention pooling (cATP) module to solve it. By considering the interference caused by the co-occurrence of multiple events in the unbalanced dataset, we combine the cATP-MIL framework with the disentangled feature. To take advantage of the unlabeled data, we adopt the guided learning with a more professional teacher for semi-supervised learning. A group of median filters with adaptive window sizes is utilized in post-processing. We also analyze the effect of the synthetic data on the performance of the model and finally achieve an F-measure of 45.43% on the validation set.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CNN
PDF

TRAINING METHOD USING CLASS-FRAME PSEUDO LABEL FOR WEAKLY LABELED DATASET ON DCASE2019

Mishima, Sakiko and Kiyokawa, Yu and Toizumi, Takahiro and Sagi, Kazutoshi and Kondo, Reishi and Nomura, Toshiyuki
Data Science Research Laboratories, NEC Corporation, Japan

Abstract

We propose a training method using class-frame pseudo label for weakly labeled datasets given by the IEEE AASP challenge on detection and classification of acoustic scenes and events 2019 (DCASE2019) task 4. Our model is constructed based on a residual network (ResNet) and trained by datasets including strong and weak labels. The strong label has event classes and their presences at each frame, and the weak label has only event classes. In order to train the model effectively, we propose class-frame pseudo labels for weakly labeled datasets. The class-frame pseudo label contributes to improvement of the event presence prediction at each frame by avoidance of overfitting to strongly labeled datasets. A result shows that F1-scores by our proposed method are 25.9% and 62.0% in the event-based and segment-based evaluations, respectively.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies
Classifier ResNet
PDF

REGULARIZED CNN FOR SOUND EVENT DETECTION IN DOMESTIC ENVIRONMENTS

Rakowski, Alexander
Samsung R&D Poland, Audio Intelligence Dept., Warsaw, Poland

Abstract

This report describes a system used for Task 4 of the DCASE 2019 Challenge - Sound Event Detection in Domestic Environments. The system consists of a 9-layer convolutional neural network which yields frame-level predictions. These are then aggregated using a Voice Activity Detection algorithm in order to extract sound events. To prevent the system from overfitting two techniques are applied. The first one consists of training the model with channel- and pixel-wise dropout. The second one removes information from a randomly selected subset of frames.

System characteristics
Input mono
Sampling rate 32kHz
Data augmentation occlusions
Features log-mel energies
Classifier CNN
PDF

HODGEPODGE: SOUND EVENT DETECTION BASED ON ENSEMBLE OF SEMI-SUPERVISED LEARNING METHODS

Shi, Ziqiang
Fujitsu Research and Development Center, Beijing, China

Abstract

In this technical report, we present the techniques and models applied to our submission for DCASE 2019 task 4: Sound event detection in domestic environments. We aim to focus primarily on how to apply semi-supervise learning methods efficiently to deal with large amount of unlabeled in-domain data. Three semi-supervised learning principles have been used in our system, including: 1) Consistency regularization applies data augmentation; 2) MixUp regularizer requiring that the prediction for a interpolation of two inputs is close to the interpolation of the prediction for each individual input; 3) MixUp regularization applies to interpolation between data augmentations. We also tried an ensemble of various models, trained by using different semi-supervised learning principles.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation Gaussian noise
Features log-mel energies
Classifier CRNN, mean-teacher student
PDF

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Turpault, Nicolas and Serizel, Romain and Parag Shah, Ankit and Salamon, Justin
Université de Lorraine, CNRS, Inria, Loria, France

Abstract

This paper presents DCASE 2019 task 4 and proposes a first analysis of the results. The task is the follow up to DCASE 2018 task 4 and evaluates systems for the large-scale detection of sound events using weakly labeled data (without time boundaries). The paper focuses in particular on the additional synthetic, strongly labeled, dataset provided this year.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CRNN, mean-teacher student
PDF

SOUND EVENT DETECTION USING WEAKLY LABELED AND UNLABELED DATA WITH SELF-ADAPTIVE EVENT THRESHOLD

Wang, Dezhi and Zhang, Lilun and Bao, Changchun and Wang, Yongxian and Xu, Kele and Zhu, Boqing
National University of Defense Technology, College of Meteorology and Oceanography, Changsha, China

Abstract

The details of our method submitted to the task 4 of DCASE challenge 2019 are described in this technical report. This task evaluates systems for the detection of sound events in domestic environments using large-scale weakly labeled data. In particular, an architecture based on the framework of convolutional recurrent neural network (CRNN) is utilized to detect the timestamps of all the events in given audio clips where the training audio files have only clip-level labels. In order to take advantage of the large-scale unlabeled in-domain training data, an audio tagging system using deep residual network (ResNext) is first employed to make predictions for weak labels of the unlabeled data before the sound event detection process. In addition, a self-adaptive searching strategy for best sound-event thresholds is applied in the model testing process, which is believed to have some benefits on the improvement of model performance and generalization capability. Finally, the system achieves 23.79% F1-value in class-wise average metrics for the sound event detection on the provided testing dataset.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies, delta feature
Classifier CRNN
Decision making mean probability
PDF

WEAKLY LABELED SOUND EVENT DETECTION WITH RESDUAL CRNN USING SEMI-SUPERVISED METHOD

Yan, Jie and Song, Yan
University of Science and Technology of China, National Engineering Laboratory for Speech and Language Information Processing, Hefei, China

Abstract

In this report, we present our system for the task 4 of DCASE 2019 challenge (Sound event detection in domestic environments). The goal of the task is to evaluate systems with real data either weakly labeled or unlabeled and simulated data that is strongly labeled. To perform this task, we propose resdual CRNN as our system. We also use mean-teacher model based on confidence thresholding and smooth embedding method. In addition, we also apply specaugment for the labeled data shortage problem. Finaly, we achieve better performance than DCASE2019 baseline system.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation specaugment
Features log-mel energies
Classifier CRNN, mean-teacher student
PDF

MEAN TEACHER MODEL BASED ON CMRANN NETWORK FOR SOUND EVENT DETECTION

Yang, Qian and Xia, Jing and Wang, Jinjia
College of Information Science and Engineering, Yan shan University, Qinhuangdao, China

Abstract

This paper proposes an improved mean teacher model for sound event detection tasks in a domestic environment. The model consists of CNN network, ML-LoBCoD network, RNN network and attention mechanism. To evaluate our method, we tested on the DCASE 2019 Challenge Task 4 dataset. The results show that the average score of F1 in the evaluation 2018 dataset is 22.7%, and the F1 score in the validation 2019 dataset is 23.4%.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CMRANN-MT, mean-teacher student
PDF

AN IMPROVED SYSTEM FOR DCASE 2019 CHALLENGE TASK 4

Zhang, Zhenyuan Zhang and Yang, Mingxue and Liu, Li
University of Electronic Science and Technology of China ence and Technology of China, School of Information and Communication Engineering, Chengdu, China

Abstract

In this technical report, we present an improved system for DCASE2019 challenge task 4, with the goal to evaluate systems for the detection of sound events using real data either weakly labeled or unlabeled and simulated data that is strongly labeled .We use the multi-scale Mel-spectra as the feature and do the detection with the 3 layers convolu- tional neural network(CNN) and 2 layers recurrent neural network (RNN), after each layer of CNN, we apply a Res-Net (Residual Neural Network) block to increase learning depth. Aim to use data without labels or with weak labels, we apply the mean–teacher model to do the sound event detection.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CNN,ResNet,RNN, mean-teacher student
PDF