Task description
The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance. The labels in the annotated subset are verified and can be considered as reliable.
More detailed task description can be found in the task description page
Systems ranking
Rank |
Submission code |
Submission name |
Technical Report |
Event-based F-score (Evaluation dataset) |
Event-based F-score (Development dataset) |
---|---|---|---|---|---|
Avdeeva_ITMO_task4_1 | PPF_system | Avdveeva2018 | 20.1 | 28.1 | |
Avdeeva_ITMO_task4_2 | PPF_system | Avdveeva2018 | 19.5 | 28.1 | |
Wang_NUDT_task4_1 | NUDT-System | WangD2018 | 12.4 | 22.1 | |
Wang_NUDT_task4_2 | NUDT-System | WangD2018 | 12.6 | 22.0 | |
Wang_NUDT_task4_3 | NUDT-System | WangD2018 | 12.0 | 20.5 | |
Wang_NUDT_task4_4 | NUDT-System | WangD2018 | 12.2 | 20.1 | |
Dinkel_SJTU_task4_1 | SJTU-ASR-GRU | Dinkel2018 | 10.4 | 13.4 | |
Dinkel_SJTU_task4_2 | SJTU-ASR-CRNN | Dinkel2018 | 10.7 | 13.7 | |
Dinkel_SJTU_task4_3 | SJTU-ASR-GAUSS | Dinkel2018 | 13.4 | 19.4 | |
Dinkel_SJTU_task4_4 | SJTU-CRNN | Dinkel2018 | 11.2 | 14.9 | |
Guo_THU_task4_1 | THU_multiCRNN | Guo2018 | 21.3 | 29.2 | |
Guo_THU_task4_2 | THU_multiCRNN | Guo2018 | 20.6 | 29.2 | |
Guo_THU_task4_3 | THU_multiCRNN | Guo2018 | 19.1 | 29.2 | |
Guo_THU_task4_4 | THU_multiCRNN | Guo2018 | 19.0 | 29.2 | |
Harb_TUG_task4_1 | Harb_TUG | Harb2018 | 19.4 | 34.6 | |
Harb_TUG_task4_2 | Harb_TUG | Harb2018 | 15.7 | 34.6 | |
Harb_TUG_task4_3 | Harb_TUG | Harb2018 | 21.6 | 34.6 | |
Hou_BUPT_task4_1 | Hou_BUPT_1 | Hou2018 | 19.6 | 32.7 | |
Hou_BUPT_task4_2 | Hou_BUPT_2 | Hou2018 | 18.9 | 30.8 | |
Hou_BUPT_task4_3 | Hou_BUPT_3 | Hou2018 | 20.9 | 33.0 | |
Hou_BUPT_task4_4 | Hou_BUPT_4 | Hou2018 | 21.1 | 31.5 | |
CANCES_IRIT_task4_1 | IRIT_WGRU_GRU_fusion | Cances2018 | 8.4 | 16.3 | |
PELLEGRINI_IRIT_task4_2 | IRIT_MIL | Cances2018 | 16.6 | 24.6 | |
Kothinti_JHU_task4_1 | JHU_T4 | Kothinti2018 | 20.6 | 29.3 | |
Kothinti_JHU_task4_2 | JHU_T4 | Kothinti2018 | 20.9 | 29.8 | |
Kothinti_JHU_task4_3 | JHU_T4 | Kothinti2018 | 20.9 | 24.5 | |
Kothinti_JHU_task4_4 | JHU_T4 | Kothinti2018 | 22.4 | 30.1 | |
Koutini_JKU_task4_1 | JKU_rcnn_threshold | Koutini2018 | 21.5 | 40.9 | |
Koutini_JKU_task4_2 | JKU_rcnn_prec | Koutini2018 | 21.1 | 40.2 | |
Koutini_JKU_task4_3 | JKU_rcnn_prec2 | Koutini2018 | 20.6 | 40.2 | |
Koutini_JKU_task4_4 | JKU_rcnn_uth | Koutini2018 | 18.8 | 35.6 | |
Liu_USTC_task4_1 | USTC_NEL1 | Liu2018 | 27.3 | 42.4 | |
Liu_USTC_task4_2 | USTC_NEL2 | Liu2018 | 28.8 | 47.4 | |
Liu_USTC_task4_3 | USTC_NEL3 | Liu2018 | 28.1 | 50.3 | |
Liu_USTC_task4_4 | USTC_NEL4 | Liu2018 | 29.9 | 51.6 | |
LJK_PSH_task4_1 | LJK_PSH_task4_1 | Lu2018 | 24.1 | 28.6 | |
LJK_PSH_task4_2 | LJK_PSH_task4_2 | Lu2018 | 26.3 | 26.4 | |
LJK_PSH_task4_3 | LJK_PSH_task4_3 | Lu2018 | 29.5 | 27.2 | |
LJK_PSH_task4_4 | LJK_PSH_task4_4 | Lu2018 | 32.4 | 25.9 | |
Moon_YONSEI_task4_1 | Yonsei_str_1 | Moon2018 | 15.9 | 21.6 | |
Moon_YONSEI_task4_2 | Yonsei_str_2 | Moon2018 | 14.3 | 24.3 | |
Raj_IITKGP_task4_1 | Raj_IIT_KGP_Task4_1 | Raj2018 | 9.4 | 21.9 | |
Lim_ETRI_task4_1 | Lim_task4_1 | Lim2018 | 17.1 | 21.9 | |
Lim_ETRI_task4_2 | Lim_task4_2 | Lim2018 | 18.0 | 23.1 | |
Lim_ETRI_task4_3 | Lim_task4_3 | Lim2018 | 19.6 | 28.4 | |
Lim_ETRI_task4_4 | Lim_task4_4 | Lim2018 | 20.4 | 29.3 | |
WangJun_BUPT_task4_2 | BUPT_Attention | WangJ2018 | 17.9 | 27.0 | |
DCASE2018 baseline | Baseline | Serizel2018 | 10.8 | 14.1 | |
Baseline_Surrey_task4_1 | SurreyCNN8 | Kong2018 | 18.6 | 20.8 | |
Baseline_Surrey_task4_2 | SurreyCNN4 | Kong2018 | 16.7 | 20.8 | |
Baseline_Surrey_task4_3 | SurreyFuse | Kong2018 | 24.0 | 26.7 |
Teams ranking
Table including only the best performing system per submitting team.
Rank |
Submission code |
Submission name |
Technical Report |
Event-based F-score (Evaluation dataset) |
Event-based F-score (Development dataset) |
---|---|---|---|---|---|
Avdeeva_ITMO_task4_1 | PPF_system | Avdveeva2018 | 20.1 | 28.1 | |
Wang_NUDT_task4_2 | NUDT-System | WangD2018 | 12.6 | 22.0 | |
Dinkel_SJTU_task4_3 | SJTU-ASR-GAUSS | Dinkel2018 | 13.4 | 19.4 | |
Guo_THU_task4_1 | THU_multiCRNN | Guo2018 | 21.3 | 29.2 | |
Harb_TUG_task4_3 | Harb_TUG | Harb2018 | 21.6 | 34.6 | |
Hou_BUPT_task4_4 | Hou_BUPT_4 | Hou2018 | 21.1 | 31.5 | |
PELLEGRINI_IRIT_task4_2 | IRIT_MIL | Cances2018 | 16.6 | 24.6 | |
Kothinti_JHU_task4_4 | JHU_T4 | Kothinti2018 | 22.4 | 30.1 | |
Koutini_JKU_task4_1 | JKU_rcnn_threshold | Koutini2018 | 21.5 | 40.9 | |
Liu_USTC_task4_4 | USTC_NEL4 | Liu2018 | 29.9 | 51.6 | |
LJK_PSH_task4_4 | LJK_PSH_task4_4 | Lu2018 | 32.4 | 25.9 | |
Moon_YONSEI_task4_1 | Yonsei_str_1 | Moon2018 | 15.9 | 21.6 | |
Raj_IITKGP_task4_1 | Raj_IIT_KGP_Task4_1 | Raj2018 | 9.4 | 21.9 | |
Lim_ETRI_task4_4 | Lim_task4_4 | Lim2018 | 20.4 | 29.3 | |
WangJun_BUPT_task4_2 | BUPT_Attention | WangJ2018 | 17.9 | 27.0 | |
DCASE2018 baseline | Baseline | Serizel2018 | 10.8 | 14.1 | |
Baseline_Surrey_task4_3 | SurreyFuse | Kong2018 | 24.0 | 26.7 |
Class-wise performance
Rank |
Submission code |
Submission name |
Technical Report |
Event-based F-score (Evaluation dataset) |
Alarm Bell Ringing |
Blender | Cat | Dishes | Dog |
Electric shave toothbrush |
Frying |
Running water |
Speech |
Vacuum cleaner |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Avdeeva_ITMO_task4_1 | PPF_system | Avdveeva2018 | 20.1 | 33.3 | 15.2 | 14.9 | 6.3 | 16.3 | 15.8 | 24.6 | 13.3 | 27.2 | 34.8 | |
Avdeeva_ITMO_task4_2 | PPF_system | Avdveeva2018 | 19.5 | 33.3 | 11.8 | 14.9 | 6.3 | 16.3 | 13.1 | 24.6 | 13.3 | 27.2 | 34.7 | |
Wang_NUDT_task4_1 | NUDT-System | WangD2018 | 12.4 | 6.8 | 14.1 | 2.6 | 0.8 | 2.7 | 29.3 | 20.2 | 11.2 | 1.3 | 35.0 | |
Wang_NUDT_task4_2 | NUDT-System | WangD2018 | 12.6 | 6.7 | 14.4 | 2.5 | 1.1 | 2.6 | 29.7 | 22.0 | 11.1 | 1.3 | 34.0 | |
Wang_NUDT_task4_3 | NUDT-System | WangD2018 | 12.0 | 7.2 | 17.8 | 4.2 | 2.3 | 3.0 | 26.2 | 13.7 | 10.0 | 2.7 | 32.5 | |
Wang_NUDT_task4_4 | NUDT-System | WangD2018 | 12.2 | 7.0 | 18.2 | 3.6 | 2.7 | 3.1 | 27.2 | 13.9 | 10.1 | 2.8 | 33.1 | |
Dinkel_SJTU_task4_1 | SJTU-ASR-GRU | Dinkel2018 | 10.4 | 12.2 | 17.1 | 2.0 | 2.7 | 5.4 | 12.2 | 0.0 | 6.0 | 23.7 | 22.6 | |
Dinkel_SJTU_task4_2 | SJTU-ASR-CRNN | Dinkel2018 | 10.7 | 12.9 | 15.9 | 0.6 | 4.4 | 5.3 | 7.5 | 0.0 | 9.9 | 30.6 | 20.0 | |
Dinkel_SJTU_task4_3 | SJTU-ASR-GAUSS | Dinkel2018 | 13.4 | 20.2 | 19.0 | 0.0 | 14.1 | 11.3 | 9.7 | 0.0 | 3.9 | 39.7 | 16.0 | |
Dinkel_SJTU_task4_4 | SJTU-CRNN | Dinkel2018 | 11.2 | 12.7 | 22.6 | 0.0 | 6.1 | 5.1 | 11.3 | 0.0 | 3.3 | 31.1 | 19.6 | |
Guo_THU_task4_1 | THU_multiCRNN | Guo2018 | 21.3 | 35.3 | 31.8 | 7.8 | 4.0 | 9.9 | 17.4 | 32.7 | 18.3 | 31.0 | 24.8 | |
Guo_THU_task4_2 | THU_multiCRNN | Guo2018 | 20.6 | 35.3 | 19.9 | 6.6 | 4.4 | 10.6 | 13.6 | 36.8 | 13.5 | 35.4 | 29.4 | |
Guo_THU_task4_3 | THU_multiCRNN | Guo2018 | 19.1 | 16.7 | 12.7 | 6.0 | 10.7 | 14.1 | 12.8 | 22.1 | 19.2 | 36.2 | 40.8 | |
Guo_THU_task4_4 | THU_multiCRNN | Guo2018 | 19.0 | 16.5 | 11.8 | 7.0 | 11.3 | 15.1 | 14.2 | 19.9 | 16.8 | 37.9 | 39.2 | |
Harb_TUG_task4_1 | Harb_TUG | Harb2018 | 19.4 | 21.6 | 23.7 | 6.6 | 0.4 | 4.8 | 26.4 | 34.8 | 18.1 | 33.0 | 25.0 | |
Harb_TUG_task4_2 | Harb_TUG | Harb2018 | 15.7 | 14.6 | 20.0 | 7.2 | 15.0 | 10.0 | 9.1 | 14.8 | 13.5 | 33.7 | 19.2 | |
Harb_TUG_task4_3 | Harb_TUG | Harb2018 | 21.6 | 15.4 | 30.0 | 8.1 | 17.5 | 9.7 | 21.0 | 34.7 | 17.3 | 31.1 | 31.5 | |
Hou_BUPT_task4_1 | Hou_BUPT_1 | Hou2018 | 19.6 | 38.6 | 18.4 | 3.5 | 22.2 | 20.4 | 31.5 | 1.4 | 14.4 | 37.6 | 8.5 | |
Hou_BUPT_task4_2 | Hou_BUPT_2 | Hou2018 | 18.9 | 38.9 | 15.0 | 5.7 | 16.5 | 16.5 | 35.1 | 2.0 | 15.5 | 35.4 | 8.7 | |
Hou_BUPT_task4_3 | Hou_BUPT_3 | Hou2018 | 20.9 | 43.8 | 12.2 | 10.0 | 23.4 | 18.3 | 9.2 | 10.9 | 15.6 | 37.3 | 28.4 | |
Hou_BUPT_task4_4 | Hou_BUPT_4 | Hou2018 | 21.1 | 41.4 | 16.4 | 6.4 | 23.5 | 20.2 | 9.8 | 6.2 | 14.0 | 40.6 | 32.3 | |
CANCES_IRIT_task4_1 | IRIT_WGRU_GRU_fusion | Cances2018 | 8.4 | 2.5 | 5.9 | 0.5 | 0.3 | 1.8 | 17.7 | 20.9 | 8.6 | 4.0 | 21.6 | |
PELLEGRINI_IRIT_task4_2 | IRIT_MIL | Cances2018 | 16.6 | 23.8 | 5.1 | 25.3 | 0.7 | 4.1 | 6.5 | 18.3 | 15.0 | 22.3 | 44.9 | |
Kothinti_JHU_task4_1 | JHU_T4 | Kothinti2018 | 20.6 | 36.0 | 13.0 | 20.0 | 13.1 | 24.4 | 22.0 | 0.0 | 10.4 | 34.5 | 32.7 | |
Kothinti_JHU_task4_2 | JHU_T4 | Kothinti2018 | 20.9 | 32.5 | 21.7 | 18.6 | 13.4 | 25.4 | 24.7 | 0.0 | 7.8 | 34.2 | 31.3 | |
Kothinti_JHU_task4_3 | JHU_T4 | Kothinti2018 | 20.9 | 37.2 | 20.4 | 17.8 | 12.4 | 24.5 | 16.9 | 0.0 | 10.4 | 34.0 | 35.1 | |
Kothinti_JHU_task4_4 | JHU_T4 | Kothinti2018 | 22.4 | 36.7 | 22.0 | 20.5 | 12.8 | 26.5 | 24.3 | 0.0 | 9.6 | 34.3 | 37.0 | |
Koutini_JKU_task4_1 | JKU_rcnn_threshold | Koutini2018 | 21.5 | 30.0 | 16.4 | 13.1 | 9.5 | 8.4 | 23.5 | 18.1 | 12.6 | 42.9 | 40.8 | |
Koutini_JKU_task4_2 | JKU_rcnn_prec | Koutini2018 | 21.1 | 30.0 | 15.8 | 13.1 | 9.5 | 8.4 | 23.5 | 17.6 | 12.1 | 42.0 | 39.2 | |
Koutini_JKU_task4_3 | JKU_rcnn_prec2 | Koutini2018 | 20.6 | 30.0 | 15.8 | 12.9 | 9.3 | 8.5 | 22.7 | 16.1 | 12.3 | 40.9 | 37.6 | |
Koutini_JKU_task4_4 | JKU_rcnn_uth | Koutini2018 | 18.8 | 29.2 | 15.1 | 12.6 | 9.5 | 9.4 | 22.1 | 15.2 | 12.2 | 41.1 | 21.4 | |
Liu_USTC_task4_1 | USTC_NEL1 | Liu2018 | 27.3 | 44.2 | 20.7 | 23.1 | 15.2 | 18.1 | 30.6 | 8.7 | 20.8 | 43.3 | 48.8 | |
Liu_USTC_task4_2 | USTC_NEL2 | Liu2018 | 28.8 | 46.0 | 27.1 | 21.6 | 10.8 | 26.5 | 42.0 | 11.0 | 20.9 | 33.5 | 48.6 | |
Liu_USTC_task4_3 | USTC_NEL3 | Liu2018 | 28.1 | 41.7 | 28.4 | 22.9 | 9.2 | 26.7 | 33.3 | 10.3 | 21.6 | 43.1 | 43.9 | |
Liu_USTC_task4_4 | USTC_NEL4 | Liu2018 | 29.9 | 46.0 | 27.1 | 20.3 | 13.0 | 26.5 | 37.6 | 10.9 | 23.9 | 43.1 | 50.0 | |
LJK_PSH_task4_1 | LJK_PSH_task4_1 | Lu2018 | 24.1 | 23.1 | 32.6 | 1.2 | 0.0 | 5.0 | 51.4 | 36.0 | 30.4 | 14.0 | 46.7 | |
LJK_PSH_task4_2 | LJK_PSH_task4_2 | Lu2018 | 26.3 | 25.1 | 36.1 | 1.9 | 0.4 | 3.1 | 52.1 | 42.4 | 36.2 | 16.7 | 49.1 | |
LJK_PSH_task4_3 | LJK_PSH_task4_3 | Lu2018 | 29.5 | 48.0 | 30.4 | 2.3 | 3.7 | 20.1 | 46.8 | 29.4 | 27.9 | 41.4 | 44.6 | |
LJK_PSH_task4_4 | LJK_PSH_task4_4 | Lu2018 | 32.4 | 49.9 | 38.2 | 3.6 | 3.2 | 18.1 | 48.7 | 35.4 | 31.2 | 46.8 | 48.3 | |
Moon_YONSEI_task4_1 | Yonsei_str_1 | Moon2018 | 15.9 | 26.3 | 14.0 | 9.8 | 6.3 | 15.7 | 10.4 | 8.7 | 11.0 | 29.6 | 27.5 | |
Moon_YONSEI_task4_2 | Yonsei_str_2 | Moon2018 | 14.3 | 17.8 | 14.9 | 8.1 | 2.0 | 10.3 | 14.6 | 13.7 | 12.7 | 17.3 | 31.7 | |
Raj_IITKGP_task4_1 | Raj_IIT_KGP_Task4_1 | Raj2018 | 9.4 | 5.1 | 7.2 | 1.0 | 0.3 | 2.3 | 15.9 | 20.4 | 6.6 | 0.3 | 34.9 | |
Lim_ETRI_task4_1 | Lim_task4_1 | Lim2018 | 17.1 | 10.0 | 20.8 | 4.8 | 0.6 | 6.2 | 29.1 | 18.3 | 16.4 | 11.2 | 53.1 | |
Lim_ETRI_task4_2 | Lim_task4_2 | Lim2018 | 18.0 | 12.9 | 22.5 | 4.9 | 0.6 | 7.0 | 30.5 | 19.7 | 16.5 | 11.9 | 53.2 | |
Lim_ETRI_task4_3 | Lim_task4_3 | Lim2018 | 19.6 | 10.2 | 20.5 | 6.8 | 5.9 | 16.9 | 25.4 | 13.5 | 13.2 | 20.2 | 63.3 | |
Lim_ETRI_task4_4 | Lim_task4_4 | Lim2018 | 20.4 | 11.6 | 21.6 | 7.9 | 5.9 | 17.4 | 27.8 | 14.9 | 15.5 | 21.0 | 60.0 | |
WangJun_BUPT_task4_2 | BUPT_Attention | WangJ2018 | 17.9 | 40.3 | 14.5 | 19.0 | 6.1 | 4.6 | 18.6 | 20.4 | 18.3 | 26.0 | 11.3 | |
DCASE2018 baseline | Baseline | Serizel2018 | 10.8 | 4.8 | 12.7 | 2.9 | 0.4 | 2.4 | 20.0 | 24.5 | 10.1 | 0.1 | 30.2 | |
Baseline_Surrey_task4_1 | SurreyCNN8 | Kong2018 | 18.6 | 6.0 | 18.9 | 2.4 | 0.0 | 3.6 | 46.4 | 43.6 | 15.2 | 0.0 | 50.0 | |
Baseline_Surrey_task4_2 | SurreyCNN4 | Kong2018 | 16.7 | 5.5 | 16.3 | 2.5 | 0.0 | 4.0 | 44.1 | 42.5 | 13.5 | 0.0 | 38.8 | |
Baseline_Surrey_task4_3 | SurreyFuse | Kong2018 | 24.0 | 24.5 | 18.9 | 7.8 | 7.7 | 5.6 | 46.4 | 43.6 | 15.2 | 19.9 | 50.0 |
System characteristics
General characteristics
Rank | Code |
Technical Report |
Event-based F-score (Eval) |
Input |
Sampling rate |
Data augmentation |
Features |
---|---|---|---|---|---|---|---|
Avdeeva_ITMO_task4_1 | Avdveeva2018 | 20.1 | mono | 16kHz | time stretching, pitch shifting | log-mel energies | |
Avdeeva_ITMO_task4_2 | Avdveeva2018 | 19.5 | mono | 16kHz | time stretching, pitch shifting | log-mel energies | |
Wang_NUDT_task4_1 | WangD2018 | 12.4 | mono | 44.1kHz | mixup | log-mel energies, delta features | |
Wang_NUDT_task4_2 | WangD2018 | 12.6 | mono | 44.1kHz | mixup | log-mel energies, delta features | |
Wang_NUDT_task4_3 | WangD2018 | 12.0 | mono | 44.1kHz | mixup | log-mel energies, delta features | |
Wang_NUDT_task4_4 | WangD2018 | 12.2 | mono | 44.1kHz | mixup | log-mel energies, delta features | |
Dinkel_SJTU_task4_1 | Dinkel2018 | 10.4 | mono | 44.1kHz | MFCC, log-mel energies | ||
Dinkel_SJTU_task4_2 | Dinkel2018 | 10.7 | mono | 44.1kHz | MFCC, log-mel energies | ||
Dinkel_SJTU_task4_3 | Dinkel2018 | 13.4 | mono | 44.1kHz | MFCC, log-mel energies | ||
Dinkel_SJTU_task4_4 | Dinkel2018 | 11.2 | mono | 44.1kHz | MFCC, log-mel energies | ||
Guo_THU_task4_1 | Guo2018 | 21.3 | mono | 44.1kHz | log-mel energies | ||
Guo_THU_task4_2 | Guo2018 | 20.6 | mono | 44.1kHz | log-mel energies | ||
Guo_THU_task4_3 | Guo2018 | 19.1 | mono | 44.1kHz | log-mel energies | ||
Guo_THU_task4_4 | Guo2018 | 19.0 | mono | 44.1kHz | log-mel energies | ||
Harb_TUG_task4_1 | Harb2018 | 19.4 | mono | 44.1kHz | log-mel energies | ||
Harb_TUG_task4_2 | Harb2018 | 15.7 | mono | 44.1kHz | log-mel energies | ||
Harb_TUG_task4_3 | Harb2018 | 21.6 | mono | 44.1kHz | log-mel energies | ||
Hou_BUPT_task4_1 | Hou2018 | 19.6 | mono | 16kHz | log-mel energies | ||
Hou_BUPT_task4_2 | Hou2018 | 18.9 | mono | 16kHz | log-mel energies | ||
Hou_BUPT_task4_3 | Hou2018 | 20.9 | mono | 16kHz | log-mel energies | ||
Hou_BUPT_task4_4 | Hou2018 | 21.1 | mono | 16kHz | log-mel energies | ||
CANCES_IRIT_task4_1 | Cances2018 | 8.4 | mono | 44.1kHz | log-mel energies | ||
PELLEGRINI_IRIT_task4_2 | Cances2018 | 16.6 | mono | 44.1kHz | log-mel energies | ||
Kothinti_JHU_task4_1 | Kothinti2018 | 20.6 | mono | 44.1kHz | log-mel energies, auditory spectrogram | ||
Kothinti_JHU_task4_2 | Kothinti2018 | 20.9 | mono | 44.1kHz | log-mel energies, auditory spectrogram | ||
Kothinti_JHU_task4_3 | Kothinti2018 | 20.9 | mono | 44.1kHz | log-mel energies, auditory spectrogram | ||
Kothinti_JHU_task4_4 | Kothinti2018 | 22.4 | mono | 44.1kHz | log-mel energies, auditory spectrogram | ||
Koutini_JKU_task4_1 | Koutini2018 | 21.5 | mono | 44.1kHz | log-mel energies | ||
Koutini_JKU_task4_2 | Koutini2018 | 21.1 | mono | 44.1kHz | log-mel energies | ||
Koutini_JKU_task4_3 | Koutini2018 | 20.6 | mono | 44.1kHz | log-mel energies | ||
Koutini_JKU_task4_4 | Koutini2018 | 18.8 | mono | 44.1kHz | log-mel energies | ||
Liu_USTC_task4_1 | Liu2018 | 27.3 | mono | 44.1kHz | log-mel energies | ||
Liu_USTC_task4_2 | Liu2018 | 28.8 | mono | 44.1kHz | log-mel energies | ||
Liu_USTC_task4_3 | Liu2018 | 28.1 | mono | 44.1kHz | log-mel energies | ||
Liu_USTC_task4_4 | Liu2018 | 29.9 | mono | 44.1kHz | log-mel energies | ||
LJK_PSH_task4_1 | Lu2018 | 24.1 | mono | 22.05kHz | log-mel energies | ||
LJK_PSH_task4_2 | Lu2018 | 26.3 | mono | 22.05kHz | log-mel energies | ||
LJK_PSH_task4_3 | Lu2018 | 29.5 | mono | 22.05kHz | log-mel energies | ||
LJK_PSH_task4_4 | Lu2018 | 32.4 | mono | 22.05kHz | log-mel energies | ||
Moon_YONSEI_task4_1 | Moon2018 | 15.9 | mono | 22.05kHz | time stretching, pitch shifting, block mixing, DRC | raw waveforms | |
Moon_YONSEI_task4_2 | Moon2018 | 14.3 | mono | 22.05kHz | time stretching, pitch shifting, block mixing, DRC | raw waveforms | |
Raj_IITKGP_task4_1 | Raj2018 | 9.4 | mono | 44.1kHz | CQT | ||
Lim_ETRI_task4_1 | Lim2018 | 17.1 | mono | 16kHz | time stretching, pitch shifting, reversing | log-mel energies | |
Lim_ETRI_task4_2 | Lim2018 | 18.0 | mono | 16kHz | time stretching, pitch shifting, reversing | log-mel energies | |
Lim_ETRI_task4_3 | Lim2018 | 19.6 | mono | 16kHz | time stretching, pitch shifting, reversing | log-mel energies | |
Lim_ETRI_task4_4 | Lim2018 | 20.4 | mono | 16kHz | time stretching, pitch shifting, reversing | log-mel energies | |
WangJun_BUPT_task4_2 | WangJ2018 | 17.9 | mono | 44.1kHz | log-mel energies | ||
DCASE2018 baseline | Serizel2018 | 10.8 | mono | 44.1kHz | log-mel energies | ||
Baseline_Surrey_task4_1 | Kong2018 | 18.6 | mono | 32kHz | log-mel energies | ||
Baseline_Surrey_task4_2 | Kong2018 | 16.7 | mono | 32kHz | log-mel energies | ||
Baseline_Surrey_task4_3 | Kong2018 | 24.0 | mono | 32kHz | log-mel energies |
Machine learning characteristics
Rank | Code |
Technical Report |
Event-based F-score (Eval) |
Model complexity |
Classifier |
Ensemble subsystems |
Decision making |
---|---|---|---|---|---|---|---|
Avdeeva_ITMO_task4_1 | Avdveeva2018 | 20.1 | 200242 | CRNN, CNN | 2 | hierarchical | |
Avdeeva_ITMO_task4_2 | Avdveeva2018 | 19.5 | 200242 | CRNN, CNN | 2 | hierarchical | |
Wang_NUDT_task4_1 | WangD2018 | 12.4 | 24210492 | CRNN | 3 | mean probability | |
Wang_NUDT_task4_2 | WangD2018 | 12.6 | 24210492 | CRNN | 3 | mean probability | |
Wang_NUDT_task4_3 | WangD2018 | 12.0 | 24210492 | CRNN | 3 | mean probability | |
Wang_NUDT_task4_4 | WangD2018 | 12.2 | 24210492 | CRNN | 3 | mean probability | |
Dinkel_SJTU_task4_1 | Dinkel2018 | 10.4 | 1781259 | HMM-GMM, GRU | |||
Dinkel_SJTU_task4_2 | Dinkel2018 | 10.7 | 126219 | HMM-GMM, CRNN | |||
Dinkel_SJTU_task4_3 | Dinkel2018 | 13.4 | 126219 | HMM-GMM, CRNN | |||
Dinkel_SJTU_task4_4 | Dinkel2018 | 11.2 | 126090 | CRNN | |||
Guo_THU_task4_1 | Guo2018 | 21.3 | 970644 | multi-scale CRNN | 2 | ||
Guo_THU_task4_2 | Guo2018 | 20.6 | 970644 | multi-scale CRNN | 2 | ||
Guo_THU_task4_3 | Guo2018 | 19.1 | 970644 | multi-scale CRNN | 2 | ||
Guo_THU_task4_4 | Guo2018 | 19.0 | 970644 | multi-scale CRNN | 2 | ||
Harb_TUG_task4_1 | Harb2018 | 19.4 | 497428 | CRNN, VAT | |||
Harb_TUG_task4_2 | Harb2018 | 15.7 | 497428 | CRNN, VAT | |||
Harb_TUG_task4_3 | Harb2018 | 21.6 | 497428 | CRNN, VAT | |||
Hou_BUPT_task4_1 | Hou2018 | 19.6 | 1166484 | CRNN | |||
Hou_BUPT_task4_2 | Hou2018 | 18.9 | 1166484 | CRNN | |||
Hou_BUPT_task4_3 | Hou2018 | 20.9 | 1166484 | CRNN | |||
Hou_BUPT_task4_4 | Hou2018 | 21.1 | 1166484 | CRNN | |||
CANCES_IRIT_task4_1 | Cances2018 | 8.4 | 126090 | CRNN | |||
PELLEGRINI_IRIT_task4_2 | Cances2018 | 16.6 | 1040724 | CNN, CRNN with Multi-Instance Learning | |||
Kothinti_JHU_task4_1 | Kothinti2018 | 20.6 | 1540854 | CRNN, RBM, cRBM, PCA | |||
Kothinti_JHU_task4_2 | Kothinti2018 | 20.9 | 1540854 | CRNN, RBM, cRBM, PCA | |||
Kothinti_JHU_task4_3 | Kothinti2018 | 20.9 | 1189290 | CRNN, RBM, cRBM, PCA | |||
Kothinti_JHU_task4_4 | Kothinti2018 | 22.4 | 1540854 | CRNN, RBM, cRBM, PCA | |||
Koutini_JKU_task4_1 | Koutini2018 | 21.5 | 126090 | CRNN | |||
Koutini_JKU_task4_2 | Koutini2018 | 21.1 | 126090 | CRNN | |||
Koutini_JKU_task4_3 | Koutini2018 | 20.6 | 126090 | CRNN | |||
Koutini_JKU_task4_4 | Koutini2018 | 18.8 | 126090 | CRNN | |||
Liu_USTC_task4_1 | Liu2018 | 27.3 | 3478026 | Capsule-RNN, ensemble | 8 | dynamic threshold | |
Liu_USTC_task4_2 | Liu2018 | 28.8 | 534460 | Capsule-RNN, ensemble | 2 | dynamic threshold | |
Liu_USTC_task4_3 | Liu2018 | 28.1 | 4012486 | Capsule-RNN, CRNN, ensemble | 9 | dynamic threshold | |
Liu_USTC_task4_4 | Liu2018 | 29.9 | 4012486 | Capsule-RNN, CRNN, ensemble | 10 | dynamic threshold | |
LJK_PSH_task4_1 | Lu2018 | 24.1 | 1382246 | CRNN | 4 | mean probabilities | |
LJK_PSH_task4_2 | Lu2018 | 26.3 | 1382246 | CRNN | 2 | mean probabilities | |
LJK_PSH_task4_3 | Lu2018 | 29.5 | 1382246 | CRNN | |||
LJK_PSH_task4_4 | Lu2018 | 32.4 | 1382246 | CRNN | |||
Moon_YONSEI_task4_1 | Moon2018 | 15.9 | 10902218 | GLU, Bi-RNN, ResNet, SENet, Multi-level | |||
Moon_YONSEI_task4_2 | Moon2018 | 14.3 | 10902218 | GLU, Bi-RNN, ResNet, SENet, Multi-level | |||
Raj_IITKGP_task4_1 | Raj2018 | 9.4 | 215890 | CRNN | |||
Lim_ETRI_task4_1 | Lim2018 | 17.1 | 239338 | CRNN | |||
Lim_ETRI_task4_2 | Lim2018 | 18.0 | 239338 | CRNN | |||
Lim_ETRI_task4_3 | Lim2018 | 19.6 | 239338 | CRNN | |||
Lim_ETRI_task4_4 | Lim2018 | 20.4 | 239338 | CRNN | |||
WangJun_BUPT_task4_2 | WangJ2018 | 17.9 | 1263508 | RNN,BGRU,self-attention | |||
DCASE2018 baseline | Serizel2018 | 10.8 | 126090 | CRNN | |||
Baseline_Surrey_task4_1 | Kong2018 | 18.6 | 4691274 | VGGish 8 layer CNN with global max pooling | |||
Baseline_Surrey_task4_2 | Kong2018 | 16.7 | 4309450 | AlexNetish 4 layer CNN with global max pooling | |||
Baseline_Surrey_task4_3 | Kong2018 | 24.0 | 4691274 | VGGish 8 layer CNN with global max pooling, fuse SED and non-SED |
Technical reports
Sound Event Detection Using Weakly Labeled Dataset with Convolutional Recurrent Neural Network
Avdeeva, Anastasia and Agafonov, Iurii
Speech Information Systems Department, University of Information Technology Mechanics and Optics, Saint-Petersburg, Russia
Avdeeva_ITMO_task4_1 Avdeeva_ITMO_task4_2
Sound Event Detection Using Weakly Labeled Dataset with Convolutional Recurrent Neural Network
Avdeeva, Anastasia and Agafonov, Iurii
Speech Information Systems Department, University of Information Technology Mechanics and Optics, Saint-Petersburg, Russia
Abstract
In this paper, a sound event detection system is proposed. This system uses fusion of CNN classifier and CRNN segmentator.
System characteristics
Input | mono |
Sampling rate | 16kHz |
Data augmentation | time stretching, pitch shifting |
Features | log-mel energies |
Classifier | CRNN, CNN |
Decision making | hierarchical |
SOUND EVENT DETECTION FROM WEAK ANNOTATIONS: WEIGHTED GRU VERSUS MULTI-INSTANCE LEARNING
Cances, leo and Pellegrini, Thomas and Guyot, Patrice
Institut de Recherche en Informatique de Toulouse, Université Paul Sabatier Toulouse III, Toulouse, France
CANCES_IRIT_task4_1 PELLEGRINI_IRIT_task4_2
SOUND EVENT DETECTION FROM WEAK ANNOTATIONS: WEIGHTED GRU VERSUS MULTI-INSTANCE LEARNING
Cances, leo and Pellegrini, Thomas and Guyot, Patrice
Institut de Recherche en Informatique de Toulouse, Université Paul Sabatier Toulouse III, Toulouse, France
Abstract
In this paper, we address the detection of audio events in domestic environments in the case where a weakly annotated dataset is available for training. The weak annotations provide tags from audio events but do not provide temporal boundaries. We report experiments in the framework of the task four of the DCASE 2018 challenge. The objective is twofold: detect audio events (multi-categorical classification at recording level), localize the events precisely within the recordings. We explored two approaches: 1) a ”weighted-GRU” (WGRU), in which we train a Convolutional Recurrent Neural Network (CRNN) for classification and then exploit its frame-based predictions at the output of the time-distributed dense layer to perform localization. We propose to lower the influence of the hidden states to avoid predicting a same score throughout a recording. 2) An approach inspired by Multi-Instance Learning (MIL), in which we train a CRNN to give predictions at frame-level, using a custom loss function based on the weak label and statistics of the frame-based predictions. Both approaches outperform the baseline of 14.06% in F-measure by a large margin, with values of respectively 16.77% and 24.58% for combined WGRUs and MIL, on a test set comprised of 288 recordings.
System characteristics
Input | mono |
Sampling rate | 16kHz |
Features | log-mel energies |
Classifier | CRNN |
A HYBRID ASR MODEL APPROACH ON WEAKLY LABELED SCENE CLASSIFICATION
Dinkel, Heinrich and Qiand, Yanmin and Yu, Kai
Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China
Dinkel_SJTU_task4_1 Dinkel_SJTU_task4_2 Dinkel_SJTU_task4_3 Dinkel_SJTU_task4_4
A HYBRID ASR MODEL APPROACH ON WEAKLY LABELED SCENE CLASSIFICATION
Dinkel, Heinrich and Qiand, Yanmin and Yu, Kai
Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China
Abstract
This paper presents our submission to task 4 of the DCASE 2018 challenge. Our approach focuses on refining the training labels by using a HMM-GMM to obtain a frame-wise alignment from the clip-wise labels. Then we train a convolutional recurrent neural network (CRNN), as well as a single gated recurrent neural network on those labels in standard cross-entropy fashion. Our approach utilizes a ”blank” state which is treated as a junk collector for all uninteresting events. Moreover, Gaussian posterior filtering is introduced in order to enhance the connectivity between segments. Compared to the baseline result, the proposed framework significantly enhances the models capability to detect short, impulsively occurring events such as speech, dog, dishes and alarm. Our best submission on the test set is a CRNN model with Gaussian posterior filtering, resulting in a 19.37% macro average, as well as 24.41% micro average F-score.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC, log-mel energies |
Classifier | HMM-GMM, GRU, CRNN |
MULTI-SCALE CONVOLUTIONAL RECURRENT NEURAL NETWORK WITH ENSEMBLE METHOD FOR WEAKLY LABELED SOUND EVENT DETECTION
Guo, Yingmei and Xu, Mingxing and Wu, Jianming and Wang, Yanan and Hoashi, Keiichiro
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Guo_THU_task4_1 Guo_THU_task4_2 Guo_THU_task4_3 Guo_THU_task4_4
MULTI-SCALE CONVOLUTIONAL RECURRENT NEURAL NETWORK WITH ENSEMBLE METHOD FOR WEAKLY LABELED SOUND EVENT DETECTION
Guo, Yingmei and Xu, Mingxing and Wu, Jianming and Wang, Yanan and Hoashi, Keiichiro
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Abstract
In this paper, we describe our contributions to the challenge of detection and classification of acoustic scenes and events 2018 (DCASE2018). We propose multi-scale convolutional recurrent neural network (Multi-scale CRNN), a novel weakly-supervised learning framework for sound event detection. By integrating information from different time resolutions, the multi-scale method can capture both the fine-grained and coarse-grained features of sound events and model the temporal dependencies including fine-grained dependencies and long-term dependencies. CRNN using learnable gated linear units(GLUs) can also help to select the most related features corresponding to the audio labels. Furthermore, the ensemble method proposed in the paper can help to correct the frame-level prediction errors with classification results as identifying the sound events occurred in the audio is much easier than providing the event time boundaries. The proposed method achieves 29.2% in the event-based F1-score and 1.40 in event-based error rate in development set of DCASE2018 task4 compared to the baseline of 14.1% F-value and 1.54 error rate.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | multi-scale CRNN |
SOUND EVENT DETECTION USING WEAKLY LABELED SEMI-SUPERVISED DATA WITH GCRNNS, VAT AND SELF-ADAPTIVE LABEL REFINEMENT
Harb, Robert and Pernkopf, Franz
Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria
Harb_TUG_task4_1 Harb_TUG_task4_2 Harb_TUG_task4_3
SOUND EVENT DETECTION USING WEAKLY LABELED SEMI-SUPERVISED DATA WITH GCRNNS, VAT AND SELF-ADAPTIVE LABEL REFINEMENT
Harb, Robert and Pernkopf, Franz
Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria
Abstract
In this paper, we present a gated convolutional recurrent neural network based approach to solve task 4, large-scale weakly labeled semi-supervised sound event detection in domestic environments, of the DCASE 2018 challenge. Gated linear units and a temporal attention layer are used to predict the onset and offset of sound events in 10s long audio clips. Whereby for training only weakly-labeled data is used. Virtual adverserial training is used for regularization, utilizing both labelled and unlabelled data. Furthermore, we introduce self-adaptive label refinement, a method which allows unsupervised adaption of our trained system to increase the quality of frame-level class predictions. The proposed system reaches an overall macro averaged event-based F-score of 34.6%, resulting in a relative improvement of 20.5% over the baseline system.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CRNN, VAT |
Semi-supervised sound event detection with convolutional recurrent neural network using weakly labelled data
Hou, Yuanbo and Li, Shengchen
Institute of Information Photonics and Optical, Beijing University of Posts and Telecommunications, Beijing, China
Hou_BUPT_task4_1 Hou_BUPT_task4_2 Hou_BUPT_task4_3 Hou_BUPT_task4_4
Semi-supervised sound event detection with convolutional recurrent neural network using weakly labelled data
Hou, Yuanbo and Li, Shengchen
Institute of Information Photonics and Optical, Beijing University of Posts and Telecommunications, Beijing, China
Abstract
In this technique report, we present a polyphonic sound event detection (SED) system based on a convolutional recurrent neural network for the task 4 of Detection and Classification of Acoustic Scenes and Events 2018 (DCASE2018) challenge. Convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) are adopted as our framework. We used a learnable gating activation function for selecting informative local features. In a summary, we get 32.95% F-value and 1.34 error rate (ER) for SED on the development set. While the baseline just obtained 14.06% F-value and 1.54 for SED.
System characteristics
Input | mono |
Sampling rate | 16kHz |
Features | log-mel energies |
Classifier | CRNN |
DCASE 2018 Challenge Baseline with Convolutional Neural Networks
Qiuqiang Kong, Iqbal Turab, Xu Yong, Wenwu Wang and Mark D. Plumbley
Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
Baseline_Surrey_task4_1 Baseline_Surrey_task4_2 Baseline_Surrey_task4_3
DCASE 2018 Challenge Baseline with Convolutional Neural Networks
Qiuqiang Kong, Iqbal Turab, Xu Yong, Wenwu Wang and Mark D. Plumbley
Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
Abstract
Detection and classification of acoustic scenes and events (DCASE) 2018 challenge is a well known IEEE AASP challenge consists of several audio classification and sound event detection tasks. DCASE 2018 challenge includes five tasks: 1) Acoustic scene classification, 2) Audio tagging of Freesound, 3) Bird audio detection, 4) Weakly labeled semi-supervised sound event detection and 5) Multi-channel audio tagging. In this paper we open source the python code of all of Task 1 - 5 of DCASE 2018 challenge. The baseline source code contains the implementation of the convolutioanl neural networks (CNNs) including the AlexNetish and the VGGish from the image processing area. We researched how the performance varies from task to task when the configuration of the neural networks are the same. The experiment shows deeper VGGish network performs better than AlexNetish on Task 2 - 5 except Task 1 where VGGish and AlexNetish network perform similar. With the VGGish network, we achieve an accuracy of 0.680 on Task 1, a mean average precision (mAP) of 0.928 on Task 2, an area under the curve (AUC) of 0.854 on Task 3, a sound event detection F1 score of 20.8% on Task 4 and a F1 score of 87.75% on Task 5.
System characteristics
Input | mono |
Sampling rate | 32kHz |
Features | log-mel energies |
Classifier | VGGish 8 layer CNN with global max pooling; AlexNetish 4 layer CNN with global max pooling |
JOINT ACOUSTIC AND CLASS INFERENCE FOR WEAKLY SUPERVISED SOUND EVENT DETECTION
Kothinti, Sandeep and Imoto, Keisuke and Chakrabarty, Debmalya and Gregory, Sell and Watanabe, Shinji and Elhilali, Mounya
Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA
Kothinti_JHU_task4_1 Kothinti_JHU_task4_2 Kothinti_JHU_task4_3 Kothinti_JHU_task4_4
JOINT ACOUSTIC AND CLASS INFERENCE FOR WEAKLY SUPERVISED SOUND EVENT DETECTION
Kothinti, Sandeep and Imoto, Keisuke and Chakrabarty, Debmalya and Gregory, Sell and Watanabe, Shinji and Elhilali, Mounya
Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA
Abstract
Sound event detection is a challenging task, especially for scenes with simultaneous presence of multiple events. Task4 of the 2018 DCASE challenge presents an event detection task that requires accuracy in both segmentation and recognition of events. Supervised methods produce accurate event labels but are limited in event segmentation when training data lacks event time stamps. On the other hand, unsupervised methods that model acoustic properties of the audio can produce accurate event boundaries but are not guided by the characteristics of event classes and sound categories. In this report, we present a hybrid approach that combines an acoustic-driven event boundary detection and a supervised label inference using a deep neural network. This framework leverages benefits of both unsupervised and supervised methodologies and takes advantage of large amounts of unlabeled data, making it ideal for large-scale weakly labeled event detection. Compared to a baseline system, the proposed approach delivers a 15% absolute improvement in F1-score, demonstrating the benefits of the hybrid bottom-up, top-down approach.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies, auditory spectrogram |
Classifier | CRNN, RBM, cRBM, PCA |
ITERATIVE KNOWLEDGE DISTILLATION IN R-CNNS FOR WEAKLY-LABELED SEMI-SUPERVISED SOUND EVENT DETECTION
Koutini, Khaled and Eghbal-zadeh, Hamid and Widmer, Gerhard
Institute of Computational Perception, Johannes Kepler University, Linz, Austria
Koutini_JKU_task4_1 Koutini_JKU_task4_2 Koutini_JKU_task4_3 Koutini_JKU_task4_4
ITERATIVE KNOWLEDGE DISTILLATION IN R-CNNS FOR WEAKLY-LABELED SEMI-SUPERVISED SOUND EVENT DETECTION
Koutini, Khaled and Eghbal-zadeh, Hamid and Widmer, Gerhard
Institute of Computational Perception, Johannes Kepler University, Linz, Austria
Abstract
In this technical report, we present our approach used for the CP-JKU submission in Task 4 of the DCASE-2018 Challenge. We propose a novel iterative knowledge distillation technique for weakly-labeled semi-supervised event detection using neural networks, specifically Recurrent Convolutional Neural Networks (R-CNNs). R-CNNs are used to tag the unlabeled data and predict strong labels. Further, we use the R-CNN strong pseudo-labels on the training datasets and train new models after applying label-smoothing techniques on the strong pseudo-labels. Our proposed approach significantly improved the performance of the baseline, achieving the event-based f-measure of 40.86% compared to 15.11% event-based f-measure of the baseline in the provided test set from the development dataset.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CRNN |
WEAKLY LABELED SEMI-SUPERVISED SOUND EVENT DETECTION USING CRNN WITH INCEPTION MODULE
Lim, Wootaek and Suh, Sangwon and Jeong, Youngho
Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea
Lim_task4_1 Lim_task4_2 Lim_task4_3 Lim_task4_4
WEAKLY LABELED SEMI-SUPERVISED SOUND EVENT DETECTION USING CRNN WITH INCEPTION MODULE
Lim, Wootaek and Suh, Sangwon and Jeong, Youngho
Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea
Abstract
In this paper, we present a method for large-scale detection of sound events using small weakly labeled data proposed in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 challenge Task 4. To perform this task, we adopted the convolutional neural network (CNN) and gated recurrent unit (GRU) based bidirectional recurrent neural network (RNN) as our proposed system. In addition, we proposed the Inception module for handling various receptive fields at once in each CNN layer. We also applied the data augmentation method to solve the labeled data shortage problem and applied the event activity detection method for strong label learning. By applying the proposed method to a weakly labeled semi-supervised sound event detection, it was verified that the proposed system provides better detection performance compared to the DCASE 2018 baseline system.
System characteristics
Input | mono |
Sampling rate | 16kHz |
Data augmentation | time stretching, pitch shifting, reversing |
Features | log-mel energies |
Classifier | CRNN |
USTC-NELSLIP SYSTEM FOR DCASE 2018 CHALLENGE TASK 4
Liu, Yaming Liu and Yan, Jie and Song, Yan and Du, Jun
National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China
Liu_USTC_task4_1 Liu_USTC_task4_2 Liu_USTC_task4_3 Liu_USTC_task4_4
USTC-NELSLIP SYSTEM FOR DCASE 2018 CHALLENGE TASK 4
Liu, Yaming Liu and Yan, Jie and Song, Yan and Du, Jun
National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China
Abstract
In this technical report, we present a group of methods for the task 4 of Detection and Classification of Acoustic Scenes and Events 2018 (DCASE 2018). This task aims to detect sound events in domestic environments using weakly labeled training data in a semi-supervised way. In this report, firstly, an event activity detection technique is performed to transform weak labels to strong labels before training. Then a capsule based method and a gated convolutional neural networks (CNN) are performed to estimate event activity probabilities respectively. At last, event activity probabilities of two systems are combined to obtain the final sound event detection (SED) estimation. On the other hand, a tagging model based on proposed CNN is used to tag the unlabeled in domain training set. Data with high confidence are added to the training data to get a further promotion of performance. Experiments on the validation dataset show that the proposed approach obtains an F1-score of 51.60% and an error rate of 0.93, outperforming the baseline of 14.06% and 1.54.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | Capsule-RNN, ensemble |
Decision making | dynamic threshold |
MEAN TEACHER CONVOLUTION SYSTEM FOR DCASE 2018 TASK 4
JiaKai, Lu
1T5K, PFU SHANGHAI Co., LTD, Shanghai, China
LJK_PSH_task4_1 LJK_PSH_task4_2 LJK_PSH_task4_3 LJK_PSH_task4_4
MEAN TEACHER CONVOLUTION SYSTEM FOR DCASE 2018 TASK 4
JiaKai, Lu
1T5K, PFU SHANGHAI Co., LTD, Shanghai, China
Abstract
In this paper, we present our neural network for the DCASE 2018 challenge’s Task 4 (Large-scale weakly labeled semi-supervised sound event detection in domestic environments). This task evaluates systems for the large-scale detection of sound events using weakly labeled data, and explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance to doing audio tagging and sound event detection. We propose a mean-teacher model with context-gating convolutional neural network (CNN) and recurrent neural network (RNN) to maximize the use of unlabeled in-domain dataset.
System characteristics
Input | mono |
Sampling rate | 22.05kHz |
Features | log-mel energies |
Classifier | CRNN |
Decision making | mean probabilities |
End-to-end CRNN Architectures for Weakly Supervised Sound Event Detection
Hyeongi, Moon and Joon, Byun and Bum-Jun, Kim and Shin-hyuk, Jeon and Youngho, Jeong and Young-cheol, Park and Sung-wook, Park
School of Electrical and Electronic Engineering, Yonsei University, Seoul, Republic of korea
Yonsei_str_1 Yonsei_str_2
End-to-end CRNN Architectures for Weakly Supervised Sound Event Detection
Hyeongi, Moon and Joon, Byun and Bum-Jun, Kim and Shin-hyuk, Jeon and Youngho, Jeong and Young-cheol, Park and Sung-wook, Park
School of Electrical and Electronic Engineering, Yonsei University, Seoul, Republic of korea
Abstract
This presentation describes our approach for large-scale weakly labeled semi-supervise sound event detection in domestic environments (TASK4) of the DCASE 2018. Our structure is based on Convolutional Recurrent Neural Network (CRNN) using raw waveform. The conventional Convolutional Neural Network (CNN) is modified to adopt Gated Linear Unit (GLU), ResNet, and Squeeze-and-Excitation (SE) network. Then three Recurrent Neural Networks (RNNs) follow. Each RNN receives features from different layers, respectively, and the outputs of RNNs are concatenate for final classification by Feed-forward connection (FC) layers. Simple data augmentation method is also applied to augment small amount of labeled data. With this approach F1 score of 5.5% improvement is achieved.
System characteristics
Input | mono |
Sampling rate | 22.05kHz |
Data augmentation | time stretching, pitch shifting, block mixing, DRC |
Features | raw waveforms |
Classifier | GLU, Bi-RNN, ResNet, SENet, Multi-level |
LARGE-SCALE WEAKLY LABELLED SEMI-SUPERVISED CQT BASED SOUND EVENT DETECTION IN DOMESTIC ENVIRONMENTS
Raj, Rojin and Waldekar, Shefali and Saha, Goutam
Electronics and Electrical Communication Engineering Department, Indian Institute of Technology Kharagpur, Kharagpur, India
Raj_IITKGP_task4_1
LARGE-SCALE WEAKLY LABELLED SEMI-SUPERVISED CQT BASED SOUND EVENT DETECTION IN DOMESTIC ENVIRONMENTS
Raj, Rojin and Waldekar, Shefali and Saha, Goutam
Electronics and Electrical Communication Engineering Department, Indian Institute of Technology Kharagpur, Kharagpur, India
Abstract
This paper proposes a constant quality transform based input feature for baseline architecture to learn the start and end time of sound events (strong labels) in an audio recording given just the list of sound events existing in the audio without time information (weak labels). This is achieved using constant quality transform coefficients as input feature for convolutional recurrent neural network. The proposed method is a contribution to the challenge of detection and classification of acoustic scenes and events (DCASE 2018, Task 4) and evaluated on a publicly available dataset from youtube with 10 sound event classes. The method achieves the best error rate of 1.48 and F-score of 14.55 %.Based on the results obtained using a CPU based system there is a decrease of 7.5 % in case of error rate and increase of 11.5 % in case of F-score as compared to baseline results.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | CQT |
Classifier | CRNN |
LARGE-SCALE WEAKLY LABELED SEMI-SUPERVISED SOUND EVENT DETECTION IN DOMESTIC ENVIRONMENTS
Serizel, Romain and Turpault, Nicolas and Eghbal-Zadeh, Hamid and Shah, Ankit Parag
Department of Natural Language Processing & Knowledge Discovery, University of Lorraine, Loria, Nancy, France
Task 4 baseline
LARGE-SCALE WEAKLY LABELED SEMI-SUPERVISED SOUND EVENT DETECTION IN DOMESTIC ENVIRONMENTS
Serizel, Romain and Turpault, Nicolas and Eghbal-Zadeh, Hamid and Shah, Ankit Parag
Department of Natural Language Processing & Knowledge Discovery, University of Lorraine, Loria, Nancy, France
Abstract
This paper presents DCASE 2018 task 4. The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without time boundaries). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly labeled training set to improve system performance. The data are Youtube video excerpts from domestic context which have many applications such as ambient assisted living. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events... ) and potential industrial applications.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CRNN |
A CRNN-BASED SYSTEM WITH MIXUP TECHNIQUE FOR LARGE-SCALE WEAKLY LABELED SOUND EVENT DETECTION
Wang, Dezhi and Xu, Kele and Zhu, Boqing and Zhang, Lilun and Peng, Yuxing and Wang, Huaimin
School of Computer, National University of Defense Technology, Changsha, China
Wang_NUDT_task4_1 Wang_NUDT_task4_2 Wang_NUDT_task4_3 Wang_NUDT_task4_4
A CRNN-BASED SYSTEM WITH MIXUP TECHNIQUE FOR LARGE-SCALE WEAKLY LABELED SOUND EVENT DETECTION
Wang, Dezhi and Xu, Kele and Zhu, Boqing and Zhang, Lilun and Peng, Yuxing and Wang, Huaimin
School of Computer, National University of Defense Technology, Changsha, China
Abstract
The details of our method submitted to the task 4 of DCASE challenge 2018 are described in this technical report. This task evaluates systems for the detection of sound events in domestic environments using large-scale weakly labeled data. In particular, an architecture based on the framework of convolutional recurrent neural network (CRNN) is utilized to detect the timestamps of all the events in given audio clips where the training audio files have only clip-level labels. In order to take advantage of the large-scale unlabeled in-domain training data, a deep residual network based model (ResNeXt) is first employed to make predictions for weak labels of the unlabeled data. In addition, a mixup technique is applied in model training process, which is believed to have some benefits on the data augmentation and the model generalization capability. Finally, the system achieves 22.05% F1-value in class-wise average metrics for the sound event detection on the provided testing dataset.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Data augmentation | mixup |
Features | log-mel energies, delta features |
Classifier | CRNN |
Decision making | mean probability |
SELF-ATTENTION MECHANISM BASED SYSTEM FOR DCASE2018 CHALLENGE TASK1 AND TASK4
Jun, Wang and Shengchen, Li
Beijing University of Posts and Telecommunications, Beijing, China
WangJun_BUPT_task4_2
SELF-ATTENTION MECHANISM BASED SYSTEM FOR DCASE2018 CHALLENGE TASK1 AND TASK4
Jun, Wang and Shengchen, Li
Beijing University of Posts and Telecommunications, Beijing, China
Abstract
In this technique report, we provide self-attention mechanism for the Task1 and Task 4 of Detection and Classification of Acoustic Scenes and Events 2018 (DCASE2017) challenge. We take convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) as our basic systems in Task 1 and Task 4. In this convolutional recurrent neural network (CRNN), gated linear units (GLUs) is used for non-linearity which implement a gating mechanism over the output of the network for selecting informative local features. Self-attention mechanism called intra-attention is used for modeling relationship between different positions of a single sequence over the output of the CRNN. Attention-based pooling scheme is used for localizing the specific events in Task 4 and for obtaining the final labels in Task 1. In a summary, we get 70.81% accuracy subtask 1 of Task 1. In the subtask 2 of Task 1, we get 70.1% accuracy for device a, 59.4% accuracy for device b, and 55.6 accuracy for device c. For Task 1, we get 26.98% F1 value for sound event detection in old test data of developmemt data.
System characteristics
Input | mono |
Sampling rate | 16kHz |
Data augmentation | time stretching, pitch shifting, reversing |
Features | log-mel energies |
Classifier | CRNN |