Large-scale weakly labeled semi-supervised sound event detection in domestic environments


Challenge results

Task description

The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance. The labels in the annotated subset are verified and can be considered as reliable.

More detailed task description can be found in the task description page

Systems ranking

Rank Submission
code
Submission
name
Technical
Report
Event-based
F-score
(Evaluation dataset)
Event-based
F-score
(Development dataset)
Avdeeva_ITMO_task4_1 PPF_system Avdveeva2018 20.1 28.1
Avdeeva_ITMO_task4_2 PPF_system Avdveeva2018 19.5 28.1
Wang_NUDT_task4_1 NUDT-System WangD2018 12.4 22.1
Wang_NUDT_task4_2 NUDT-System WangD2018 12.6 22.0
Wang_NUDT_task4_3 NUDT-System WangD2018 12.0 20.5
Wang_NUDT_task4_4 NUDT-System WangD2018 12.2 20.1
Dinkel_SJTU_task4_1 SJTU-ASR-GRU Dinkel2018 10.4 13.4
Dinkel_SJTU_task4_2 SJTU-ASR-CRNN Dinkel2018 10.7 13.7
Dinkel_SJTU_task4_3 SJTU-ASR-GAUSS Dinkel2018 13.4 19.4
Dinkel_SJTU_task4_4 SJTU-CRNN Dinkel2018 11.2 14.9
Guo_THU_task4_1 THU_multiCRNN Guo2018 21.3 29.2
Guo_THU_task4_2 THU_multiCRNN Guo2018 20.6 29.2
Guo_THU_task4_3 THU_multiCRNN Guo2018 19.1 29.2
Guo_THU_task4_4 THU_multiCRNN Guo2018 19.0 29.2
Harb_TUG_task4_1 Harb_TUG Harb2018 19.4 34.6
Harb_TUG_task4_2 Harb_TUG Harb2018 15.7 34.6
Harb_TUG_task4_3 Harb_TUG Harb2018 21.6 34.6
Hou_BUPT_task4_1 Hou_BUPT_1 Hou2018 19.6 32.7
Hou_BUPT_task4_2 Hou_BUPT_2 Hou2018 18.9 30.8
Hou_BUPT_task4_3 Hou_BUPT_3 Hou2018 20.9 33.0
Hou_BUPT_task4_4 Hou_BUPT_4 Hou2018 21.1 31.5
CANCES_IRIT_task4_1 IRIT_WGRU_GRU_fusion Cances2018 8.4 16.3
PELLEGRINI_IRIT_task4_2 IRIT_MIL Cances2018 16.6 24.6
Kothinti_JHU_task4_1 JHU_T4 Kothinti2018 20.6 29.3
Kothinti_JHU_task4_2 JHU_T4 Kothinti2018 20.9 29.8
Kothinti_JHU_task4_3 JHU_T4 Kothinti2018 20.9 24.5
Kothinti_JHU_task4_4 JHU_T4 Kothinti2018 22.4 30.1
Koutini_JKU_task4_1 JKU_rcnn_threshold Koutini2018 21.5 40.9
Koutini_JKU_task4_2 JKU_rcnn_prec Koutini2018 21.1 40.2
Koutini_JKU_task4_3 JKU_rcnn_prec2 Koutini2018 20.6 40.2
Koutini_JKU_task4_4 JKU_rcnn_uth Koutini2018 18.8 35.6
Liu_USTC_task4_1 USTC_NEL1 Liu2018 27.3 42.4
Liu_USTC_task4_2 USTC_NEL2 Liu2018 28.8 47.4
Liu_USTC_task4_3 USTC_NEL3 Liu2018 28.1 50.3
Liu_USTC_task4_4 USTC_NEL4 Liu2018 29.9 51.6
LJK_PSH_task4_1 LJK_PSH_task4_1 Lu2018 24.1 28.6
LJK_PSH_task4_2 LJK_PSH_task4_2 Lu2018 26.3 26.4
LJK_PSH_task4_3 LJK_PSH_task4_3 Lu2018 29.5 27.2
LJK_PSH_task4_4 LJK_PSH_task4_4 Lu2018 32.4 25.9
Moon_YONSEI_task4_1 Yonsei_str_1 Moon2018 15.9 21.6
Moon_YONSEI_task4_2 Yonsei_str_2 Moon2018 14.3 24.3
Raj_IITKGP_task4_1 Raj_IIT_KGP_Task4_1 Raj2018 9.4 21.9
Lim_ETRI_task4_1 Lim_task4_1 Lim2018 17.1 21.9
Lim_ETRI_task4_2 Lim_task4_2 Lim2018 18.0 23.1
Lim_ETRI_task4_3 Lim_task4_3 Lim2018 19.6 28.4
Lim_ETRI_task4_4 Lim_task4_4 Lim2018 20.4 29.3
WangJun_BUPT_task4_2 BUPT_Attention WangJ2018 17.9 27.0
DCASE2018 baseline Baseline Serizel2018 10.8 14.1
Baseline_Surrey_task4_1 SurreyCNN8 Kong2018 18.6 20.8
Baseline_Surrey_task4_2 SurreyCNN4 Kong2018 16.7 20.8
Baseline_Surrey_task4_3 SurreyFuse Kong2018 24.0 26.7

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission
code
Submission
name
Technical
Report
Event-based
F-score
(Evaluation dataset)
Event-based
F-score
(Development dataset)
Avdeeva_ITMO_task4_1 PPF_system Avdveeva2018 20.1 28.1
Wang_NUDT_task4_2 NUDT-System WangD2018 12.6 22.0
Dinkel_SJTU_task4_3 SJTU-ASR-GAUSS Dinkel2018 13.4 19.4
Guo_THU_task4_1 THU_multiCRNN Guo2018 21.3 29.2
Harb_TUG_task4_3 Harb_TUG Harb2018 21.6 34.6
Hou_BUPT_task4_4 Hou_BUPT_4 Hou2018 21.1 31.5
PELLEGRINI_IRIT_task4_2 IRIT_MIL Cances2018 16.6 24.6
Kothinti_JHU_task4_4 JHU_T4 Kothinti2018 22.4 30.1
Koutini_JKU_task4_1 JKU_rcnn_threshold Koutini2018 21.5 40.9
Liu_USTC_task4_4 USTC_NEL4 Liu2018 29.9 51.6
LJK_PSH_task4_4 LJK_PSH_task4_4 Lu2018 32.4 25.9
Moon_YONSEI_task4_1 Yonsei_str_1 Moon2018 15.9 21.6
Raj_IITKGP_task4_1 Raj_IIT_KGP_Task4_1 Raj2018 9.4 21.9
Lim_ETRI_task4_4 Lim_task4_4 Lim2018 20.4 29.3
WangJun_BUPT_task4_2 BUPT_Attention WangJ2018 17.9 27.0
DCASE2018 baseline Baseline Serizel2018 10.8 14.1
Baseline_Surrey_task4_3 SurreyFuse Kong2018 24.0 26.7

Class-wise performance

Rank Submission
code
Submission
name
Technical
Report
Event-based
F-score
(Evaluation dataset)
Alarm
Bell
Ringing
Blender Cat Dishes Dog Electric
shave
toothbrush
Frying Running
water
Speech Vacuum
cleaner
Avdeeva_ITMO_task4_1 PPF_system Avdveeva2018 20.1 33.3 15.2 14.9 6.3 16.3 15.8 24.6 13.3 27.2 34.8
Avdeeva_ITMO_task4_2 PPF_system Avdveeva2018 19.5 33.3 11.8 14.9 6.3 16.3 13.1 24.6 13.3 27.2 34.7
Wang_NUDT_task4_1 NUDT-System WangD2018 12.4 6.8 14.1 2.6 0.8 2.7 29.3 20.2 11.2 1.3 35.0
Wang_NUDT_task4_2 NUDT-System WangD2018 12.6 6.7 14.4 2.5 1.1 2.6 29.7 22.0 11.1 1.3 34.0
Wang_NUDT_task4_3 NUDT-System WangD2018 12.0 7.2 17.8 4.2 2.3 3.0 26.2 13.7 10.0 2.7 32.5
Wang_NUDT_task4_4 NUDT-System WangD2018 12.2 7.0 18.2 3.6 2.7 3.1 27.2 13.9 10.1 2.8 33.1
Dinkel_SJTU_task4_1 SJTU-ASR-GRU Dinkel2018 10.4 12.2 17.1 2.0 2.7 5.4 12.2 0.0 6.0 23.7 22.6
Dinkel_SJTU_task4_2 SJTU-ASR-CRNN Dinkel2018 10.7 12.9 15.9 0.6 4.4 5.3 7.5 0.0 9.9 30.6 20.0
Dinkel_SJTU_task4_3 SJTU-ASR-GAUSS Dinkel2018 13.4 20.2 19.0 0.0 14.1 11.3 9.7 0.0 3.9 39.7 16.0
Dinkel_SJTU_task4_4 SJTU-CRNN Dinkel2018 11.2 12.7 22.6 0.0 6.1 5.1 11.3 0.0 3.3 31.1 19.6
Guo_THU_task4_1 THU_multiCRNN Guo2018 21.3 35.3 31.8 7.8 4.0 9.9 17.4 32.7 18.3 31.0 24.8
Guo_THU_task4_2 THU_multiCRNN Guo2018 20.6 35.3 19.9 6.6 4.4 10.6 13.6 36.8 13.5 35.4 29.4
Guo_THU_task4_3 THU_multiCRNN Guo2018 19.1 16.7 12.7 6.0 10.7 14.1 12.8 22.1 19.2 36.2 40.8
Guo_THU_task4_4 THU_multiCRNN Guo2018 19.0 16.5 11.8 7.0 11.3 15.1 14.2 19.9 16.8 37.9 39.2
Harb_TUG_task4_1 Harb_TUG Harb2018 19.4 21.6 23.7 6.6 0.4 4.8 26.4 34.8 18.1 33.0 25.0
Harb_TUG_task4_2 Harb_TUG Harb2018 15.7 14.6 20.0 7.2 15.0 10.0 9.1 14.8 13.5 33.7 19.2
Harb_TUG_task4_3 Harb_TUG Harb2018 21.6 15.4 30.0 8.1 17.5 9.7 21.0 34.7 17.3 31.1 31.5
Hou_BUPT_task4_1 Hou_BUPT_1 Hou2018 19.6 38.6 18.4 3.5 22.2 20.4 31.5 1.4 14.4 37.6 8.5
Hou_BUPT_task4_2 Hou_BUPT_2 Hou2018 18.9 38.9 15.0 5.7 16.5 16.5 35.1 2.0 15.5 35.4 8.7
Hou_BUPT_task4_3 Hou_BUPT_3 Hou2018 20.9 43.8 12.2 10.0 23.4 18.3 9.2 10.9 15.6 37.3 28.4
Hou_BUPT_task4_4 Hou_BUPT_4 Hou2018 21.1 41.4 16.4 6.4 23.5 20.2 9.8 6.2 14.0 40.6 32.3
CANCES_IRIT_task4_1 IRIT_WGRU_GRU_fusion Cances2018 8.4 2.5 5.9 0.5 0.3 1.8 17.7 20.9 8.6 4.0 21.6
PELLEGRINI_IRIT_task4_2 IRIT_MIL Cances2018 16.6 23.8 5.1 25.3 0.7 4.1 6.5 18.3 15.0 22.3 44.9
Kothinti_JHU_task4_1 JHU_T4 Kothinti2018 20.6 36.0 13.0 20.0 13.1 24.4 22.0 0.0 10.4 34.5 32.7
Kothinti_JHU_task4_2 JHU_T4 Kothinti2018 20.9 32.5 21.7 18.6 13.4 25.4 24.7 0.0 7.8 34.2 31.3
Kothinti_JHU_task4_3 JHU_T4 Kothinti2018 20.9 37.2 20.4 17.8 12.4 24.5 16.9 0.0 10.4 34.0 35.1
Kothinti_JHU_task4_4 JHU_T4 Kothinti2018 22.4 36.7 22.0 20.5 12.8 26.5 24.3 0.0 9.6 34.3 37.0
Koutini_JKU_task4_1 JKU_rcnn_threshold Koutini2018 21.5 30.0 16.4 13.1 9.5 8.4 23.5 18.1 12.6 42.9 40.8
Koutini_JKU_task4_2 JKU_rcnn_prec Koutini2018 21.1 30.0 15.8 13.1 9.5 8.4 23.5 17.6 12.1 42.0 39.2
Koutini_JKU_task4_3 JKU_rcnn_prec2 Koutini2018 20.6 30.0 15.8 12.9 9.3 8.5 22.7 16.1 12.3 40.9 37.6
Koutini_JKU_task4_4 JKU_rcnn_uth Koutini2018 18.8 29.2 15.1 12.6 9.5 9.4 22.1 15.2 12.2 41.1 21.4
Liu_USTC_task4_1 USTC_NEL1 Liu2018 27.3 44.2 20.7 23.1 15.2 18.1 30.6 8.7 20.8 43.3 48.8
Liu_USTC_task4_2 USTC_NEL2 Liu2018 28.8 46.0 27.1 21.6 10.8 26.5 42.0 11.0 20.9 33.5 48.6
Liu_USTC_task4_3 USTC_NEL3 Liu2018 28.1 41.7 28.4 22.9 9.2 26.7 33.3 10.3 21.6 43.1 43.9
Liu_USTC_task4_4 USTC_NEL4 Liu2018 29.9 46.0 27.1 20.3 13.0 26.5 37.6 10.9 23.9 43.1 50.0
LJK_PSH_task4_1 LJK_PSH_task4_1 Lu2018 24.1 23.1 32.6 1.2 0.0 5.0 51.4 36.0 30.4 14.0 46.7
LJK_PSH_task4_2 LJK_PSH_task4_2 Lu2018 26.3 25.1 36.1 1.9 0.4 3.1 52.1 42.4 36.2 16.7 49.1
LJK_PSH_task4_3 LJK_PSH_task4_3 Lu2018 29.5 48.0 30.4 2.3 3.7 20.1 46.8 29.4 27.9 41.4 44.6
LJK_PSH_task4_4 LJK_PSH_task4_4 Lu2018 32.4 49.9 38.2 3.6 3.2 18.1 48.7 35.4 31.2 46.8 48.3
Moon_YONSEI_task4_1 Yonsei_str_1 Moon2018 15.9 26.3 14.0 9.8 6.3 15.7 10.4 8.7 11.0 29.6 27.5
Moon_YONSEI_task4_2 Yonsei_str_2 Moon2018 14.3 17.8 14.9 8.1 2.0 10.3 14.6 13.7 12.7 17.3 31.7
Raj_IITKGP_task4_1 Raj_IIT_KGP_Task4_1 Raj2018 9.4 5.1 7.2 1.0 0.3 2.3 15.9 20.4 6.6 0.3 34.9
Lim_ETRI_task4_1 Lim_task4_1 Lim2018 17.1 10.0 20.8 4.8 0.6 6.2 29.1 18.3 16.4 11.2 53.1
Lim_ETRI_task4_2 Lim_task4_2 Lim2018 18.0 12.9 22.5 4.9 0.6 7.0 30.5 19.7 16.5 11.9 53.2
Lim_ETRI_task4_3 Lim_task4_3 Lim2018 19.6 10.2 20.5 6.8 5.9 16.9 25.4 13.5 13.2 20.2 63.3
Lim_ETRI_task4_4 Lim_task4_4 Lim2018 20.4 11.6 21.6 7.9 5.9 17.4 27.8 14.9 15.5 21.0 60.0
WangJun_BUPT_task4_2 BUPT_Attention WangJ2018 17.9 40.3 14.5 19.0 6.1 4.6 18.6 20.4 18.3 26.0 11.3
DCASE2018 baseline Baseline Serizel2018 10.8 4.8 12.7 2.9 0.4 2.4 20.0 24.5 10.1 0.1 30.2
Baseline_Surrey_task4_1 SurreyCNN8 Kong2018 18.6 6.0 18.9 2.4 0.0 3.6 46.4 43.6 15.2 0.0 50.0
Baseline_Surrey_task4_2 SurreyCNN4 Kong2018 16.7 5.5 16.3 2.5 0.0 4.0 44.1 42.5 13.5 0.0 38.8
Baseline_Surrey_task4_3 SurreyFuse Kong2018 24.0 24.5 18.9 7.8 7.7 5.6 46.4 43.6 15.2 19.9 50.0

System characteristics

General characteristics

Rank Code Technical
Report
Event-based
F-score
(Eval)
Input Sampling
rate
Data
augmentation
Features
Avdeeva_ITMO_task4_1 Avdveeva2018 20.1 mono 16kHz time stretching, pitch shifting log-mel energies
Avdeeva_ITMO_task4_2 Avdveeva2018 19.5 mono 16kHz time stretching, pitch shifting log-mel energies
Wang_NUDT_task4_1 WangD2018 12.4 mono 44.1kHz mixup log-mel energies, delta features
Wang_NUDT_task4_2 WangD2018 12.6 mono 44.1kHz mixup log-mel energies, delta features
Wang_NUDT_task4_3 WangD2018 12.0 mono 44.1kHz mixup log-mel energies, delta features
Wang_NUDT_task4_4 WangD2018 12.2 mono 44.1kHz mixup log-mel energies, delta features
Dinkel_SJTU_task4_1 Dinkel2018 10.4 mono 44.1kHz MFCC, log-mel energies
Dinkel_SJTU_task4_2 Dinkel2018 10.7 mono 44.1kHz MFCC, log-mel energies
Dinkel_SJTU_task4_3 Dinkel2018 13.4 mono 44.1kHz MFCC, log-mel energies
Dinkel_SJTU_task4_4 Dinkel2018 11.2 mono 44.1kHz MFCC, log-mel energies
Guo_THU_task4_1 Guo2018 21.3 mono 44.1kHz log-mel energies
Guo_THU_task4_2 Guo2018 20.6 mono 44.1kHz log-mel energies
Guo_THU_task4_3 Guo2018 19.1 mono 44.1kHz log-mel energies
Guo_THU_task4_4 Guo2018 19.0 mono 44.1kHz log-mel energies
Harb_TUG_task4_1 Harb2018 19.4 mono 44.1kHz log-mel energies
Harb_TUG_task4_2 Harb2018 15.7 mono 44.1kHz log-mel energies
Harb_TUG_task4_3 Harb2018 21.6 mono 44.1kHz log-mel energies
Hou_BUPT_task4_1 Hou2018 19.6 mono 16kHz log-mel energies
Hou_BUPT_task4_2 Hou2018 18.9 mono 16kHz log-mel energies
Hou_BUPT_task4_3 Hou2018 20.9 mono 16kHz log-mel energies
Hou_BUPT_task4_4 Hou2018 21.1 mono 16kHz log-mel energies
CANCES_IRIT_task4_1 Cances2018 8.4 mono 44.1kHz log-mel energies
PELLEGRINI_IRIT_task4_2 Cances2018 16.6 mono 44.1kHz log-mel energies
Kothinti_JHU_task4_1 Kothinti2018 20.6 mono 44.1kHz log-mel energies, auditory spectrogram
Kothinti_JHU_task4_2 Kothinti2018 20.9 mono 44.1kHz log-mel energies, auditory spectrogram
Kothinti_JHU_task4_3 Kothinti2018 20.9 mono 44.1kHz log-mel energies, auditory spectrogram
Kothinti_JHU_task4_4 Kothinti2018 22.4 mono 44.1kHz log-mel energies, auditory spectrogram
Koutini_JKU_task4_1 Koutini2018 21.5 mono 44.1kHz log-mel energies
Koutini_JKU_task4_2 Koutini2018 21.1 mono 44.1kHz log-mel energies
Koutini_JKU_task4_3 Koutini2018 20.6 mono 44.1kHz log-mel energies
Koutini_JKU_task4_4 Koutini2018 18.8 mono 44.1kHz log-mel energies
Liu_USTC_task4_1 Liu2018 27.3 mono 44.1kHz log-mel energies
Liu_USTC_task4_2 Liu2018 28.8 mono 44.1kHz log-mel energies
Liu_USTC_task4_3 Liu2018 28.1 mono 44.1kHz log-mel energies
Liu_USTC_task4_4 Liu2018 29.9 mono 44.1kHz log-mel energies
LJK_PSH_task4_1 Lu2018 24.1 mono 22.05kHz log-mel energies
LJK_PSH_task4_2 Lu2018 26.3 mono 22.05kHz log-mel energies
LJK_PSH_task4_3 Lu2018 29.5 mono 22.05kHz log-mel energies
LJK_PSH_task4_4 Lu2018 32.4 mono 22.05kHz log-mel energies
Moon_YONSEI_task4_1 Moon2018 15.9 mono 22.05kHz time stretching, pitch shifting, block mixing, DRC raw waveforms
Moon_YONSEI_task4_2 Moon2018 14.3 mono 22.05kHz time stretching, pitch shifting, block mixing, DRC raw waveforms
Raj_IITKGP_task4_1 Raj2018 9.4 mono 44.1kHz CQT
Lim_ETRI_task4_1 Lim2018 17.1 mono 16kHz time stretching, pitch shifting, reversing log-mel energies
Lim_ETRI_task4_2 Lim2018 18.0 mono 16kHz time stretching, pitch shifting, reversing log-mel energies
Lim_ETRI_task4_3 Lim2018 19.6 mono 16kHz time stretching, pitch shifting, reversing log-mel energies
Lim_ETRI_task4_4 Lim2018 20.4 mono 16kHz time stretching, pitch shifting, reversing log-mel energies
WangJun_BUPT_task4_2 WangJ2018 17.9 mono 44.1kHz log-mel energies
DCASE2018 baseline Serizel2018 10.8 mono 44.1kHz log-mel energies
Baseline_Surrey_task4_1 Kong2018 18.6 mono 32kHz log-mel energies
Baseline_Surrey_task4_2 Kong2018 16.7 mono 32kHz log-mel energies
Baseline_Surrey_task4_3 Kong2018 24.0 mono 32kHz log-mel energies



Machine learning characteristics

Rank Code Technical
Report
Event-based
F-score
(Eval)
Model
complexity
Classifier Ensemble
subsystems
Decision
making
Avdeeva_ITMO_task4_1 Avdveeva2018 20.1 200242 CRNN, CNN 2 hierarchical
Avdeeva_ITMO_task4_2 Avdveeva2018 19.5 200242 CRNN, CNN 2 hierarchical
Wang_NUDT_task4_1 WangD2018 12.4 24210492 CRNN 3 mean probability
Wang_NUDT_task4_2 WangD2018 12.6 24210492 CRNN 3 mean probability
Wang_NUDT_task4_3 WangD2018 12.0 24210492 CRNN 3 mean probability
Wang_NUDT_task4_4 WangD2018 12.2 24210492 CRNN 3 mean probability
Dinkel_SJTU_task4_1 Dinkel2018 10.4 1781259 HMM-GMM, GRU
Dinkel_SJTU_task4_2 Dinkel2018 10.7 126219 HMM-GMM, CRNN
Dinkel_SJTU_task4_3 Dinkel2018 13.4 126219 HMM-GMM, CRNN
Dinkel_SJTU_task4_4 Dinkel2018 11.2 126090 CRNN
Guo_THU_task4_1 Guo2018 21.3 970644 multi-scale CRNN 2
Guo_THU_task4_2 Guo2018 20.6 970644 multi-scale CRNN 2
Guo_THU_task4_3 Guo2018 19.1 970644 multi-scale CRNN 2
Guo_THU_task4_4 Guo2018 19.0 970644 multi-scale CRNN 2
Harb_TUG_task4_1 Harb2018 19.4 497428 CRNN, VAT
Harb_TUG_task4_2 Harb2018 15.7 497428 CRNN, VAT
Harb_TUG_task4_3 Harb2018 21.6 497428 CRNN, VAT
Hou_BUPT_task4_1 Hou2018 19.6 1166484 CRNN
Hou_BUPT_task4_2 Hou2018 18.9 1166484 CRNN
Hou_BUPT_task4_3 Hou2018 20.9 1166484 CRNN
Hou_BUPT_task4_4 Hou2018 21.1 1166484 CRNN
CANCES_IRIT_task4_1 Cances2018 8.4 126090 CRNN
PELLEGRINI_IRIT_task4_2 Cances2018 16.6 1040724 CNN, CRNN with Multi-Instance Learning
Kothinti_JHU_task4_1 Kothinti2018 20.6 1540854 CRNN, RBM, cRBM, PCA
Kothinti_JHU_task4_2 Kothinti2018 20.9 1540854 CRNN, RBM, cRBM, PCA
Kothinti_JHU_task4_3 Kothinti2018 20.9 1189290 CRNN, RBM, cRBM, PCA
Kothinti_JHU_task4_4 Kothinti2018 22.4 1540854 CRNN, RBM, cRBM, PCA
Koutini_JKU_task4_1 Koutini2018 21.5 126090 CRNN
Koutini_JKU_task4_2 Koutini2018 21.1 126090 CRNN
Koutini_JKU_task4_3 Koutini2018 20.6 126090 CRNN
Koutini_JKU_task4_4 Koutini2018 18.8 126090 CRNN
Liu_USTC_task4_1 Liu2018 27.3 3478026 Capsule-RNN, ensemble 8 dynamic threshold
Liu_USTC_task4_2 Liu2018 28.8 534460 Capsule-RNN, ensemble 2 dynamic threshold
Liu_USTC_task4_3 Liu2018 28.1 4012486 Capsule-RNN, CRNN, ensemble 9 dynamic threshold
Liu_USTC_task4_4 Liu2018 29.9 4012486 Capsule-RNN, CRNN, ensemble 10 dynamic threshold
LJK_PSH_task4_1 Lu2018 24.1 1382246 CRNN 4 mean probabilities
LJK_PSH_task4_2 Lu2018 26.3 1382246 CRNN 2 mean probabilities
LJK_PSH_task4_3 Lu2018 29.5 1382246 CRNN
LJK_PSH_task4_4 Lu2018 32.4 1382246 CRNN
Moon_YONSEI_task4_1 Moon2018 15.9 10902218 GLU, Bi-RNN, ResNet, SENet, Multi-level
Moon_YONSEI_task4_2 Moon2018 14.3 10902218 GLU, Bi-RNN, ResNet, SENet, Multi-level
Raj_IITKGP_task4_1 Raj2018 9.4 215890 CRNN
Lim_ETRI_task4_1 Lim2018 17.1 239338 CRNN
Lim_ETRI_task4_2 Lim2018 18.0 239338 CRNN
Lim_ETRI_task4_3 Lim2018 19.6 239338 CRNN
Lim_ETRI_task4_4 Lim2018 20.4 239338 CRNN
WangJun_BUPT_task4_2 WangJ2018 17.9 1263508 RNN,BGRU,self-attention
DCASE2018 baseline Serizel2018 10.8 126090 CRNN
Baseline_Surrey_task4_1 Kong2018 18.6 4691274 VGGish 8 layer CNN with global max pooling
Baseline_Surrey_task4_2 Kong2018 16.7 4309450 AlexNetish 4 layer CNN with global max pooling
Baseline_Surrey_task4_3 Kong2018 24.0 4691274 VGGish 8 layer CNN with global max pooling, fuse SED and non-SED

Technical reports

Sound Event Detection Using Weakly Labeled Dataset with Convolutional Recurrent Neural Network

Avdeeva, Anastasia and Agafonov, Iurii
Speech Information Systems Department, University of Information Technology Mechanics and Optics, Saint-Petersburg, Russia

Abstract

In this paper, a sound event detection system is proposed. This system uses fusion of CNN classifier and CRNN segmentator.

System characteristics
Input mono
Sampling rate 16kHz
Data augmentation time stretching, pitch shifting
Features log-mel energies
Classifier CRNN, CNN
Decision making hierarchical
PDF

SOUND EVENT DETECTION FROM WEAK ANNOTATIONS: WEIGHTED GRU VERSUS MULTI-INSTANCE LEARNING

Cances, leo and Pellegrini, Thomas and Guyot, Patrice
Institut de Recherche en Informatique de Toulouse, Université Paul Sabatier Toulouse III, Toulouse, France

Abstract

In this paper, we address the detection of audio events in domestic environments in the case where a weakly annotated dataset is available for training. The weak annotations provide tags from audio events but do not provide temporal boundaries. We report experiments in the framework of the task four of the DCASE 2018 challenge. The objective is twofold: detect audio events (multi-categorical classification at recording level), localize the events precisely within the recordings. We explored two approaches: 1) a ”weighted-GRU” (WGRU), in which we train a Convolutional Recurrent Neural Network (CRNN) for classification and then exploit its frame-based predictions at the output of the time-distributed dense layer to perform localization. We propose to lower the influence of the hidden states to avoid predicting a same score throughout a recording. 2) An approach inspired by Multi-Instance Learning (MIL), in which we train a CRNN to give predictions at frame-level, using a custom loss function based on the weak label and statistics of the frame-based predictions. Both approaches outperform the baseline of 14.06% in F-measure by a large margin, with values of respectively 16.77% and 24.58% for combined WGRUs and MIL, on a test set comprised of 288 recordings.

System characteristics
Input mono
Sampling rate 16kHz
Features log-mel energies
Classifier CRNN
PDF

A HYBRID ASR MODEL APPROACH ON WEAKLY LABELED SCENE CLASSIFICATION

Dinkel, Heinrich and Qiand, Yanmin and Yu, Kai
Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China

Abstract

This paper presents our submission to task 4 of the DCASE 2018 challenge. Our approach focuses on refining the training labels by using a HMM-GMM to obtain a frame-wise alignment from the clip-wise labels. Then we train a convolutional recurrent neural network (CRNN), as well as a single gated recurrent neural network on those labels in standard cross-entropy fashion. Our approach utilizes a ”blank” state which is treated as a junk collector for all uninteresting events. Moreover, Gaussian posterior filtering is introduced in order to enhance the connectivity between segments. Compared to the baseline result, the proposed framework significantly enhances the models capability to detect short, impulsively occurring events such as speech, dog, dishes and alarm. Our best submission on the test set is a CRNN model with Gaussian posterior filtering, resulting in a 19.37% macro average, as well as 24.41% micro average F-score.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC, log-mel energies
Classifier HMM-GMM, GRU, CRNN
PDF

MULTI-SCALE CONVOLUTIONAL RECURRENT NEURAL NETWORK WITH ENSEMBLE METHOD FOR WEAKLY LABELED SOUND EVENT DETECTION

Guo, Yingmei and Xu, Mingxing and Wu, Jianming and Wang, Yanan and Hoashi, Keiichiro
Department of Computer Science and Technology, Tsinghua University, Beijing, China

Abstract

In this paper, we describe our contributions to the challenge of detection and classification of acoustic scenes and events 2018 (DCASE2018). We propose multi-scale convolutional recurrent neural network (Multi-scale CRNN), a novel weakly-supervised learning framework for sound event detection. By integrating information from different time resolutions, the multi-scale method can capture both the fine-grained and coarse-grained features of sound events and model the temporal dependencies including fine-grained dependencies and long-term dependencies. CRNN using learnable gated linear units(GLUs) can also help to select the most related features corresponding to the audio labels. Furthermore, the ensemble method proposed in the paper can help to correct the frame-level prediction errors with classification results as identifying the sound events occurred in the audio is much easier than providing the event time boundaries. The proposed method achieves 29.2% in the event-based F1-score and 1.40 in event-based error rate in development set of DCASE2018 task4 compared to the baseline of 14.1% F-value and 1.54 error rate.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier multi-scale CRNN
PDF

SOUND EVENT DETECTION USING WEAKLY LABELED SEMI-SUPERVISED DATA WITH GCRNNS, VAT AND SELF-ADAPTIVE LABEL REFINEMENT

Harb, Robert and Pernkopf, Franz
Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria

Abstract

In this paper, we present a gated convolutional recurrent neural network based approach to solve task 4, large-scale weakly labeled semi-supervised sound event detection in domestic environments, of the DCASE 2018 challenge. Gated linear units and a temporal attention layer are used to predict the onset and offset of sound events in 10s long audio clips. Whereby for training only weakly-labeled data is used. Virtual adverserial training is used for regularization, utilizing both labelled and unlabelled data. Furthermore, we introduce self-adaptive label refinement, a method which allows unsupervised adaption of our trained system to increase the quality of frame-level class predictions. The proposed system reaches an overall macro averaged event-based F-score of 34.6%, resulting in a relative improvement of 20.5% over the baseline system.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CRNN, VAT
PDF

Semi-supervised sound event detection with convolutional recurrent neural network using weakly labelled data

Hou, Yuanbo and Li, Shengchen
Institute of Information Photonics and Optical, Beijing University of Posts and Telecommunications, Beijing, China

Abstract

In this technique report, we present a polyphonic sound event detection (SED) system based on a convolutional recurrent neural network for the task 4 of Detection and Classification of Acoustic Scenes and Events 2018 (DCASE2018) challenge. Convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) are adopted as our framework. We used a learnable gating activation function for selecting informative local features. In a summary, we get 32.95% F-value and 1.34 error rate (ER) for SED on the development set. While the baseline just obtained 14.06% F-value and 1.54 for SED.

System characteristics
Input mono
Sampling rate 16kHz
Features log-mel energies
Classifier CRNN
PDF

DCASE 2018 Challenge Baseline with Convolutioanl Neural Networks

Qiuqiang Kong, Iqbal Turab, Xu Yong, Wenwu Wang and Mark D. Plumbley
Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK

Abstract

Detection and classification of acoustic scenes and events (DCASE) 2018 challenge is a well known IEEE AASP challenge consists of several audio classification and sound event detection tasks. DCASE 2018 challenge includes five tasks: 1) Acoustic scene classification, 2) Audio tagging of Freesound, 3) Bird audio detection, 4) Weakly labeled semi-supervised sound event detection and 5) Multi-channel audio tagging. In this paper we open source the python code of all of Task 1 - 5 of DCASE 2018 challenge. The baseline source code contains the implementation of the convolutioanl neural networks (CNNs) including the AlexNetish and the VGGish from the image processing area. We researched how the performance varies from task to task when the configuration of the neural networks are the same. The experiment shows deeper VGGish network performs better than AlexNetish on Task 2 - 5 except Task 1 where VGGish and AlexNetish network perform similar. With the VGGish network, we achieve an accuracy of 0.680 on Task 1, a mean average precision (mAP) of 0.928 on Task 2, an area under the curve (AUC) of 0.854 on Task 3, a sound event detection F1 score of 20.8% on Task 4 and a F1 score of 87.75% on Task 5.

System characteristics
Input mono
Sampling rate 32kHz
Features log-mel energies
Classifier VGGish 8 layer CNN with global max pooling; AlexNetish 4 layer CNN with global max pooling
PDF

JOINT ACOUSTIC AND CLASS INFERENCE FOR WEAKLY SUPERVISED SOUND EVENT DETECTION

Kothinti, Sandeep and Imoto, Keisuke and Chakrabarty, Debmalya and Gregory, Sell and Watanabe, Shinji and Elhilali, Mounya
Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA

Abstract

Sound event detection is a challenging task, especially for scenes with simultaneous presence of multiple events. Task4 of the 2018 DCASE challenge presents an event detection task that requires accuracy in both segmentation and recognition of events. Supervised methods produce accurate event labels but are limited in event segmentation when training data lacks event time stamps. On the other hand, unsupervised methods that model acoustic properties of the audio can produce accurate event boundaries but are not guided by the characteristics of event classes and sound categories. In this report, we present a hybrid approach that combines an acoustic-driven event boundary detection and a supervised label inference using a deep neural network. This framework leverages benefits of both unsupervised and supervised methodologies and takes advantage of large amounts of unlabeled data, making it ideal for large-scale weakly labeled event detection. Compared to a baseline system, the proposed approach delivers a 15% absolute improvement in F1-score, demonstrating the benefits of the hybrid bottom-up, top-down approach.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies, auditory spectrogram
Classifier CRNN, RBM, cRBM, PCA
PDF

ITERATIVE KNOWLEDGE DISTILLATION IN R-CNNS FOR WEAKLY-LABELED SEMI-SUPERVISED SOUND EVENT DETECTION

Koutini, Khaled and Eghbal-zadeh, Hamid and Widmer, Gerhard
Institute of Computational Perception, Johannes Kepler University, Linz, Austria

Abstract

In this technical report, we present our approach used for the CP-JKU submission in Task 4 of the DCASE-2018 Challenge. We propose a novel iterative knowledge distillation technique for weakly-labeled semi-supervised event detection using neural networks, specifically Recurrent Convolutional Neural Networks (R-CNNs). R-CNNs are used to tag the unlabeled data and predict strong labels. Further, we use the R-CNN strong pseudo-labels on the training datasets and train new models after applying label-smoothing techniques on the strong pseudo-labels. Our proposed approach significantly improved the performance of the baseline, achieving the event-based f-measure of 40.86% compared to 15.11% event-based f-measure of the baseline in the provided test set from the development dataset.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CRNN
PDF

WEAKLY LABELED SEMI-SUPERVISED SOUND EVENT DETECTION USING CRNN WITH INCEPTION MODULE

Lim, Wootaek and Suh, Sangwon and Jeong, Youngho
Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea

Abstract

In this paper, we present a method for large-scale detection of sound events using small weakly labeled data proposed in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 challenge Task 4. To perform this task, we adopted the convolutional neural network (CNN) and gated recurrent unit (GRU) based bidirectional recurrent neural network (RNN) as our proposed system. In addition, we proposed the Inception module for handling various receptive fields at once in each CNN layer. We also applied the data augmentation method to solve the labeled data shortage problem and applied the event activity detection method for strong label learning. By applying the proposed method to a weakly labeled semi-supervised sound event detection, it was verified that the proposed system provides better detection performance compared to the DCASE 2018 baseline system.

System characteristics
Input mono
Sampling rate 16kHz
Data augmentation time stretching, pitch shifting, reversing
Features log-mel energies
Classifier CRNN
PDF

USTC-NELSLIP SYSTEM FOR DCASE 2018 CHALLENGE TASK 4

Liu, Yaming Liu and Yan, Jie and Song, Yan and Du, Jun
National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China

Abstract

In this technical report, we present a group of methods for the task 4 of Detection and Classification of Acoustic Scenes and Events 2018 (DCASE 2018). This task aims to detect sound events in domestic environments using weakly labeled training data in a semi-supervised way. In this report, firstly, an event activity detection technique is performed to transform weak labels to strong labels before training. Then a capsule based method and a gated convolutional neural networks (CNN) are performed to estimate event activity probabilities respectively. At last, event activity probabilities of two systems are combined to obtain the final sound event detection (SED) estimation. On the other hand, a tagging model based on proposed CNN is used to tag the unlabeled in domain training set. Data with high confidence are added to the training data to get a further promotion of performance. Experiments on the validation dataset show that the proposed approach obtains an F1-score of 51.60% and an error rate of 0.93, outperforming the baseline of 14.06% and 1.54.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier Capsule-RNN, ensemble
Decision making dynamic threshold
PDF

MEAN TEACHER CONVOLUTION SYSTEM FOR DCASE 2018 TASK 4

JiaKai, Lu
1T5K, PFU SHANGHAI Co., LTD, Shanghai, China

Abstract

In this paper, we present our neural network for the DCASE 2018 challenge’s Task 4 (Large-scale weakly labeled semi-supervised sound event detection in domestic environments). This task evaluates systems for the large-scale detection of sound events using weakly labeled data, and explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance to doing audio tagging and sound event detection. We propose a mean-teacher model with context-gating convolutional neural network (CNN) and recurrent neural network (RNN) to maximize the use of unlabeled in-domain dataset.

System characteristics
Input mono
Sampling rate 22.05kHz
Features log-mel energies
Classifier CRNN
Decision making mean probabilities
PDF

End-to-end CRNN Architectures for Weakly Supervised Sound Event Detection

Hyeongi, Moon and Joon, Byun and Bum-Jun, Kim and Shin-hyuk, Jeon and Youngho, Jeong and Young-cheol, Park and Sung-wook, Park
School of Electrical and Electronic Engineering, Yonsei University, Seoul, Republic of korea

Abstract

This presentation describes our approach for large-scale weakly labeled semi-supervise sound event detection in domestic environments (TASK4) of the DCASE 2018. Our structure is based on Convolutional Recurrent Neural Network (CRNN) using raw waveform. The conventional Convolutional Neural Network (CNN) is modified to adopt Gated Linear Unit (GLU), ResNet, and Squeeze-and-Excitation (SE) network. Then three Recurrent Neural Networks (RNNs) follow. Each RNN receives features from different layers, respectively, and the outputs of RNNs are concatenate for final classification by Feed-forward connection (FC) layers. Simple data augmentation method is also applied to augment small amount of labeled data. With this approach F1 score of 5.5% improvement is achieved.

System characteristics
Input mono
Sampling rate 22.05kHz
Data augmentation time stretching, pitch shifting, block mixing, DRC
Features raw waveforms
Classifier GLU, Bi-RNN, ResNet, SENet, Multi-level
PDF

LARGE-SCALE WEAKLY LABELLED SEMI-SUPERVISED CQT BASED SOUND EVENT DETECTION IN DOMESTIC ENVIRONMENTS

Raj, Rojin and Waldekar, Shefali and Saha, Goutam
Electronics and Electrical Communication Engineering Department, Indian Institute of Technology Kharagpur, Kharagpur, India

Abstract

This paper proposes a constant quality transform based input feature for baseline architecture to learn the start and end time of sound events (strong labels) in an audio recording given just the list of sound events existing in the audio without time information (weak labels). This is achieved using constant quality transform coefficients as input feature for convolutional recurrent neural network. The proposed method is a contribution to the challenge of detection and classification of acoustic scenes and events (DCASE 2018, Task 4) and evaluated on a publicly available dataset from youtube with 10 sound event classes. The method achieves the best error rate of 1.48 and F-score of 14.55 %.Based on the results obtained using a CPU based system there is a decrease of 7.5 % in case of error rate and increase of 11.5 % in case of F-score as compared to baseline results.

System characteristics
Input mono
Sampling rate 44.1kHz
Features CQT
Classifier CRNN
PDF

LARGE-SCALE WEAKLY LABELED SEMI-SUPERVISED SOUND EVENT DETECTION IN DOMESTIC ENVIRONMENTS

Serizel, Romain and Turpault, Nicolas and Eghbal-Zadeh, Hamid and Shah, Ankit Parag
Department of Natural Language Processing & Knowledge Discovery, University of Lorraine, Loria, Nancy, France

Abstract

This paper presents DCASE 2018 task 4. The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without time boundaries). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly labeled training set to improve system performance. The data are Youtube video excerpts from domestic context which have many applications such as ambient assisted living. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events... ) and potential industrial applications.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CRNN
PDF

A CRNN-BASED SYSTEM WITH MIXUP TECHNIQUE FOR LARGE-SCALE WEAKLY LABELED SOUND EVENT DETECTION

Wang, Dezhi and Xu, Kele and Zhu, Boqing and Zhang, Lilun and Peng, Yuxing and Wang, Huaimin
School of Computer, National University of Defense Technology, Changsha, China

Abstract

The details of our method submitted to the task 4 of DCASE challenge 2018 are described in this technical report. This task evaluates systems for the detection of sound events in domestic environments using large-scale weakly labeled data. In particular, an architecture based on the framework of convolutional recurrent neural network (CRNN) is utilized to detect the timestamps of all the events in given audio clips where the training audio files have only clip-level labels. In order to take advantage of the large-scale unlabeled in-domain training data, a deep residual network based model (ResNeXt) is first employed to make predictions for weak labels of the unlabeled data. In addition, a mixup technique is applied in model training process, which is believed to have some benefits on the data augmentation and the model generalization capability. Finally, the system achieves 22.05% F1-value in class-wise average metrics for the sound event detection on the provided testing dataset.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies, delta features
Classifier CRNN
Decision making mean probability
PDF

SELF-ATTENTION MECHANISM BASED SYSTEM FOR DCASE2018 CHALLENGE TASK1 AND TASK4

Jun, Wang and Shengchen, Li
Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea

Abstract

In this technique report, we provide self-attention mechanism for the Task1 and Task 4 of Detection and Classification of Acoustic Scenes and Events 2018 (DCASE2017) challenge. We take convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) as our basic systems in Task 1 and Task 4. In this convolutional recurrent neural network (CRNN), gated linear units (GLUs) is used for non-linearity which implement a gating mechanism over the output of the network for selecting informative local features. Self-attention mechanism called intra-attention is used for modeling relationship between different positions of a single sequence over the output of the CRNN. Attention-based pooling scheme is used for localizing the specific events in Task 4 and for obtaining the final labels in Task 1. In a summary, we get 70.81% accuracy subtask 1 of Task 1. In the subtask 2 of Task 1, we get 70.1% accuracy for device a, 59.4% accuracy for device b, and 55.6 accuracy for device c. For Task 1, we get 26.98% F1 value for sound event detection in old test data of developmemt data.

System characteristics
Input mono
Sampling rate 16kHz
Data augmentation time stretching, pitch shifting, reversing
Features log-mel energies
Classifier CRNN
PDF