Task description
This task focused on detection of rare sound events in artificially created mixtures. Targeted sound events are baby crying, glass breaking, and gunshot. The training material available for the participants contained a set of ready created mixtures (1500 30-second audio mixtures, totalling 12h 30min in length), a set of isolated events (474 unique events) and background recordings (1121 30-second audio recordings, totalling 9h 20min in length). A total of 1500 30-second audio mixtures (12h 30min of audio) were used for the challenge evaluation.
More detailed task description can be found in the task description page
Challenge results
Detailed description of metrics used can be found here.
System outputs:
Systems ranking
Rank | Submission Information |
Technical Report |
Event-based (overall / evaluation dataset) |
Event-based (overall / development dataset) |
|||
---|---|---|---|---|---|---|---|
Code | Name | ER (overall / evaluation dataset) | F1 (overall / evaluation dataset) | ER (overall / development dataset) | F1 (overall / development dataset) | ||
Cakir2017 | Cakir_TUT_task2_1 | CRNN-1 | 0.1813 | 91.0 | 0.1600 | 91.8 | |
Cakir2017 | Cakir_TUT_task2_2 | CRNN-2 | 0.1733 | 91.0 | 0.1400 | 92.9 | |
Cakir2017 | Cakir_TUT_task2_3 | CRNN-3 | 0.2920 | 86.0 | 0.1400 | 92.8 | |
Cakir2017 | Cakir_TUT_task2_4 | CRNN-4 | 0.1867 | 90.3 | 0.1200 | 93.6 | |
Dang2017 | Dang_NCU_task2_1 | CRNN | 0.4787 | 73.3 | 0.2600 | 85.9 | |
Dang2017 | Dang_NCU_task2_2 | andang2 | 0.4107 | 79.1 | 0.2500 | 86.4 | |
Dang2017 | Dang_NCU_task2_3 | andang2 | 0.4453 | 76.1 | 0.2700 | 85.6 | |
Dang2017 | Dang_NCU_task2_4 | andang2 | 0.4253 | 78.6 | 0.2700 | 85.6 | |
Ghaffarzadegan2017 | Ghaffarzadegan_BOSCH_task2_1 | BOSCH21 | 0.5000 | 74.2 | 0.1700 | 91.2 | |
Ghaffarzadegan2017 | Ghaffarzadegan_BOSCH_task2_2 | BOSCH22 | 0.5493 | 71.8 | 0.1600 | 92.1 | |
Ghaffarzadegan2017 | Ghaffarzadegan_BOSCH_task2_3 | BOSCH23 | 0.5560 | 70.8 | 0.2100 | 89.6 | |
Heittola2017 | DCASE2017 baseline | Baseline | 0.6373 | 64.1 | 0.5300 | 72.7 | |
Jeon2017 | Jeon_GIST_task2_1 | NMF_SS+DNN | 0.6773 | 65.8 | 0.4600 | 76.9 | |
Li2017 | Li_SCUT_task2_1 | LiSCUTt2_1 | 0.6333 | 65.5 | 0.6100 | 69.6 | |
Li2017 | Li_SCUT_task2_2 | LiSCUTt2_2 | 0.7373 | 57.4 | 0.6000 | 68.1 | |
Li2017 | Li_SCUT_task2_3 | LiSCUTt2_3 | 0.6213 | 66.6 | 0.6400 | 67.8 | |
Li2017 | Li_SCUT_task2_4 | LiSCUTt2_4 | 0.6000 | 69.8 | 0.5500 | 72.5 | |
Lim2017 | Lim_COCAI_task2_1 | 1dCRNN1 | 0.1307 | 93.1 | 0.0700 | 96.3 | |
Lim2017 | Lim_COCAI_task2_2 | 1dCRNN2 | 0.1347 | 93.0 | 0.0700 | 96.1 | |
Lim2017 | Lim_COCAI_task2_3 | 1dCRNN3 | 0.1520 | 92.2 | 0.0700 | 96.1 | |
Lim2017 | Lim_COCAI_task2_4 | 1dCRNN4 | 0.1720 | 91.4 | 0.0900 | 95.5 | |
Kaiwu2017 | Liping_CQU_task2_1 | E-RFCN | 0.3400 | 79.5 | 0.1800 | 90.3 | |
Kaiwu2017 | Liping_CQU_task2_2 | E-RFCN | 0.3293 | 81.2 | 0.1600 | 91.4 | |
Kaiwu2017 | Liping_CQU_task2_3 | E-RFCN | 0.3173 | 82.0 | 0.1800 | 90.5 | |
Phan2017 | Phan_UniLuebeck_task2_1 | AED-Net | 0.2773 | 85.3 | 0.1900 | 89.8 | |
Ravichandran2017 | Ravichandran_BOSCH_task2_4 | BOSCH24 | 0.4267 | 78.6 | 0.1700 | 87.8 | |
Vesperini2017 | Vesperini_UNIVPM_task2_1 | A3LAB | 0.3267 | 83.9 | 0.2000 | 89.8 | |
Vesperini2017 | Vesperini_UNIVPM_task2_2 | A3LAB | 0.3440 | 82.8 | 0.1800 | 90.8 | |
Vesperini2017 | Vesperini_UNIVPM_task2_3 | A3LAB | 0.3267 | 83.2 | 0.1800 | 90.8 | |
Vesperini2017 | Vesperini_UNIVPM_task2_4 | A3LAB | 0.3267 | 83.2 | 0.1900 | 90.4 | |
Wang2017 | Wang_BUPT_task2_1 | MFC_WJ | 0.4320 | 73.4 | 0.2800 | 85.0 | |
Wang2017a | Wang_THU_task2_1 | Baseline | 0.4973 | 72.6 | 0.3800 | 78.3 | |
Zhou2017 | Zhou_XJTU_task2_1 | SLR-NMF | 0.3133 | 84.2 | 0.2800 | 85.8 |
Teams ranking
Table including only the best performing system per submitting team.
Rank | Submission Information |
Technical Report |
Event-based (overall / evaluation dataset) |
Event-based (overall / development dataset) |
|||
---|---|---|---|---|---|---|---|
Code | Name | ER (overall / evaluation dataset) | F1 (overall / evaluation dataset) | ER (overall / development dataset) | F1 (overall / development dataset) | ||
Cakir2017 | Cakir_TUT_task2_2 | CRNN-2 | 0.1733 | 91.0 | 0.1400 | 92.9 | |
Dang2017 | Dang_NCU_task2_2 | andang2 | 0.4107 | 79.1 | 0.2500 | 86.4 | |
Heittola2017 | DCASE2017 baseline | Baseline | 0.6373 | 64.1 | 0.5300 | 72.7 | |
Jeon2017 | Jeon_GIST_task2_1 | NMF_SS+DNN | 0.6773 | 65.8 | 0.4600 | 76.9 | |
Li2017 | Li_SCUT_task2_4 | LiSCUTt2_4 | 0.6000 | 69.8 | 0.5500 | 72.5 | |
Lim2017 | Lim_COCAI_task2_1 | 1dCRNN1 | 0.1307 | 93.1 | 0.0700 | 96.3 | |
Kaiwu2017 | Liping_CQU_task2_3 | E-RFCN | 0.3173 | 82.0 | 0.1800 | 90.5 | |
Phan2017 | Phan_UniLuebeck_task2_1 | AED-Net | 0.2773 | 85.3 | 0.1900 | 89.8 | |
Ravichandran2017 | Ravichandran_BOSCH_task2_4 | BOSCH24 | 0.4267 | 78.6 | 0.1700 | 87.8 | |
Vesperini2017 | Vesperini_UNIVPM_task2_1 | A3LAB | 0.3267 | 83.9 | 0.2000 | 89.8 | |
Wang2017 | Wang_BUPT_task2_1 | MFC_WJ | 0.4320 | 73.4 | 0.2800 | 85.0 | |
Wang2017a | Wang_THU_task2_1 | Baseline | 0.4973 | 72.6 | 0.3800 | 78.3 | |
Zhou2017 | Zhou_XJTU_task2_1 | SLR-NMF | 0.3133 | 84.2 | 0.2800 | 85.8 |
Class-wise performance
Rank | Submission Information |
Technical Report |
Event-based (average / evaluation dataset) |
Baby cry | Glass break | Gunshot | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Code | Name | ER (average / evaluation dataset) | F1 (average / evaluation dataset) | ER / Baby cry (eval/seg) | F1 / Baby cry (eval/seg) | ER / Glass break (eval/seg) | F1 / Glass break (eval/seg) | ER / Gunshot (eval/seg) | F1 / Gunshot (eval/seg) | ||
Cakir2017 | Cakir_TUT_task2_1 | CRNN-1 | 0.1813 | 91.0 | 0.2720 | 87.0 | 0.0720 | 96.4 | 0.2000 | 89.5 | |
Cakir2017 | Cakir_TUT_task2_2 | CRNN-2 | 0.1733 | 91.0 | 0.1840 | 90.8 | 0.1040 | 94.7 | 0.2320 | 87.4 | |
Cakir2017 | Cakir_TUT_task2_3 | CRNN-3 | 0.2920 | 86.0 | 0.2720 | 87.0 | 0.1360 | 92.9 | 0.4680 | 78.0 | |
Cakir2017 | Cakir_TUT_task2_4 | CRNN-4 | 0.1867 | 90.3 | 0.2120 | 89.5 | 0.1120 | 94.2 | 0.2360 | 87.3 | |
Dang2017 | Dang_NCU_task2_1 | CRNN | 0.4787 | 73.3 | 0.4760 | 75.5 | 0.3880 | 79.3 | 0.5720 | 65.2 | |
Dang2017 | Dang_NCU_task2_2 | andang2 | 0.4107 | 79.1 | 0.4400 | 80.6 | 0.2280 | 88.5 | 0.5640 | 68.2 | |
Dang2017 | Dang_NCU_task2_3 | andang2 | 0.4453 | 76.1 | 0.4400 | 80.6 | 0.3240 | 82.4 | 0.5720 | 65.2 | |
Dang2017 | Dang_NCU_task2_4 | andang2 | 0.4253 | 78.6 | 0.4400 | 80.6 | 0.2720 | 87.1 | 0.5640 | 68.2 | |
Ghaffarzadegan2017 | Ghaffarzadegan_BOSCH_task2_1 | BOSCH21 | 0.5000 | 74.2 | 0.4080 | 78.8 | 0.1640 | 91.5 | 0.9280 | 52.3 | |
Ghaffarzadegan2017 | Ghaffarzadegan_BOSCH_task2_2 | BOSCH22 | 0.5493 | 71.8 | 0.4320 | 78.0 | 0.2400 | 87.5 | 0.9760 | 49.8 | |
Ghaffarzadegan2017 | Ghaffarzadegan_BOSCH_task2_3 | BOSCH23 | 0.5560 | 70.8 | 0.4600 | 74.7 | 0.2320 | 87.9 | 0.9760 | 49.8 | |
Heittola2017 | DCASE2017 baseline | Baseline | 0.6373 | 64.1 | 0.8040 | 66.8 | 0.3800 | 79.1 | 0.7280 | 46.5 | |
Jeon2017 | Jeon_GIST_task2_1 | NMF_SS+DNN | 0.6773 | 65.8 | 0.8840 | 65.3 | 0.3960 | 80.2 | 0.7520 | 51.8 | |
Li2017 | Li_SCUT_task2_1 | LiSCUTt2_1 | 0.6333 | 65.5 | 0.8280 | 65.8 | 0.4240 | 77.8 | 0.6480 | 52.9 | |
Li2017 | Li_SCUT_task2_2 | LiSCUTt2_2 | 0.7373 | 57.4 | 0.9160 | 61.8 | 0.5280 | 69.3 | 0.7680 | 41.1 | |
Li2017 | Li_SCUT_task2_3 | LiSCUTt2_3 | 0.6213 | 66.6 | 0.7400 | 68.2 | 0.4440 | 76.2 | 0.6800 | 55.3 | |
Li2017 | Li_SCUT_task2_4 | LiSCUTt2_4 | 0.6000 | 69.8 | 0.7800 | 67.4 | 0.3240 | 82.4 | 0.6960 | 59.5 | |
Lim2017 | Lim_COCAI_task2_1 | 1dCRNN1 | 0.1307 | 93.1 | 0.1520 | 92.2 | 0.0480 | 97.6 | 0.1920 | 89.6 | |
Lim2017 | Lim_COCAI_task2_2 | 1dCRNN2 | 0.1347 | 93.0 | 0.1520 | 92.4 | 0.0600 | 97.0 | 0.1920 | 89.6 | |
Lim2017 | Lim_COCAI_task2_3 | 1dCRNN3 | 0.1520 | 92.2 | 0.1520 | 92.5 | 0.1120 | 94.6 | 0.1920 | 89.6 | |
Lim2017 | Lim_COCAI_task2_4 | 1dCRNN4 | 0.1720 | 91.4 | 0.1720 | 91.7 | 0.1520 | 92.9 | 0.1920 | 89.6 | |
Kaiwu2017 | Liping_CQU_task2_1 | E-RFCN | 0.3400 | 79.5 | 0.2760 | 86.4 | 0.1800 | 90.2 | 0.5640 | 62.0 | |
Kaiwu2017 | Liping_CQU_task2_2 | E-RFCN | 0.3293 | 81.2 | 0.2840 | 86.5 | 0.1600 | 91.5 | 0.5440 | 65.7 | |
Kaiwu2017 | Liping_CQU_task2_3 | E-RFCN | 0.3173 | 82.0 | 0.2640 | 87.3 | 0.1600 | 91.5 | 0.5280 | 67.2 | |
Phan2017 | Phan_UniLuebeck_task2_1 | AED-Net | 0.2773 | 85.3 | 0.2840 | 85.7 | 0.2200 | 88.8 | 0.3280 | 81.6 | |
Ravichandran2017 | Ravichandran_BOSCH_task2_4 | BOSCH24 | 0.4267 | 78.6 | 0.5000 | 75.9 | 0.2360 | 87.8 | 0.5440 | 71.9 | |
Vesperini2017 | Vesperini_UNIVPM_task2_1 | A3LAB | 0.3267 | 83.9 | 0.3560 | 83.0 | 0.3120 | 84.7 | 0.3120 | 84.0 | |
Vesperini2017 | Vesperini_UNIVPM_task2_2 | A3LAB | 0.3440 | 82.8 | 0.3680 | 82.4 | 0.3280 | 83.8 | 0.3360 | 82.3 | |
Vesperini2017 | Vesperini_UNIVPM_task2_3 | A3LAB | 0.3267 | 83.2 | 0.3240 | 84.3 | 0.2960 | 85.1 | 0.3600 | 80.3 | |
Vesperini2017 | Vesperini_UNIVPM_task2_4 | A3LAB | 0.3267 | 83.2 | 0.3240 | 84.3 | 0.2960 | 85.1 | 0.3600 | 80.3 | |
Wang2017 | Wang_BUPT_task2_1 | MFC_WJ | 0.4320 | 73.4 | 0.4400 | 77.3 | 0.2120 | 89.1 | 0.6440 | 53.9 | |
Wang2017a | Wang_THU_task2_1 | Baseline | 0.4973 | 72.6 | 0.5680 | 70.7 | 0.3560 | 81.0 | 0.5680 | 66.0 | |
Zhou2017 | Zhou_XJTU_task2_1 | SLR-NMF | 0.3133 | 84.2 | 0.1720 | 91.4 | 0.2200 | 89.1 | 0.5480 | 72.0 |
System characteristics
Rank | Submission Information |
Technical Report |
Event-based (overall) | System characteristics | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Code | Name | ER (overall / evaluation dataset) | F1 (overall / evaluation dataset) | Input |
Sampling rate |
Data augmentation |
Features | Classifier |
Decision making |
||
Cakir2017 | Cakir_TUT_task2_1 | CRNN-1 | 0.1813 | 91.0 | mono | 44.1kHz | mixture generation | log-mel energies | CRNN | median filtering, same architecture in separate models for each class | |
Cakir2017 | Cakir_TUT_task2_2 | CRNN-2 | 0.1733 | 91.0 | mono | 44.1kHz | mixture generation | log-mel energies | CRNN | median filtering, ensemble of 7 best overall architectures | |
Cakir2017 | Cakir_TUT_task2_3 | CRNN-3 | 0.2920 | 86.0 | mono | 44.1kHz | mixture generation | log-mel energies | CRNN | median filtering, best architecture for each class | |
Cakir2017 | Cakir_TUT_task2_4 | CRNN-4 | 0.1867 | 90.3 | mono | 44.1kHz | mixture generation | log-mel energies | CRNN | median filtering, ensemble of 7 best architectures for each class | |
Dang2017 | Dang_NCU_task2_1 | CRNN | 0.4787 | 73.3 | mono | 44.1kHz | log-mel energies | CRNN | majority vote | ||
Dang2017 | Dang_NCU_task2_2 | andang2 | 0.4107 | 79.1 | mono | 44.1kHz | log-mel energies | CRNN | majority vote | ||
Dang2017 | Dang_NCU_task2_3 | andang2 | 0.4453 | 76.1 | mono | 44.1kHz | log-mel energies | CRNN | majority vote | ||
Dang2017 | Dang_NCU_task2_4 | andang2 | 0.4253 | 78.6 | mono | 44.1kHz | log-mel energies | CRNN | majority vote | ||
Ghaffarzadegan2017 | Ghaffarzadegan_BOSCH_task2_1 | BOSCH21 | 0.5000 | 74.2 | mono | 44.1kHz | MFCC, ZCR, energy, spectral centroid, pitch | ensemble | thresholding | ||
Ghaffarzadegan2017 | Ghaffarzadegan_BOSCH_task2_2 | BOSCH22 | 0.5493 | 71.8 | mono | 44.1kHz | MFCC, ZCR, energy, spectral centroid, pitch | ensemble | thresholding | ||
Ghaffarzadegan2017 | Ghaffarzadegan_BOSCH_task2_3 | BOSCH23 | 0.5560 | 70.8 | mono | 44.1kHz | MFCC, ZCR, energy, spectral centroid, pitch | ensemble | thresholding | ||
Heittola2017 | DCASE2017 baseline | Baseline | 0.6373 | 64.1 | mono | 44.1kHz | log-mel energies | MLP | median filtering | ||
Jeon2017 | Jeon_GIST_task2_1 | NMF_SS+DNN | 0.6773 | 65.8 | mono | 44.1kHz | mixture generation | log-mel energies from NMF source separation | MLP | median filtering | |
Li2017 | Li_SCUT_task2_1 | LiSCUTt2_1 | 0.6333 | 65.5 | mono | 44.1kHz | DNN(MFCC) | Bi-LSTM | top output probability | ||
Li2017 | Li_SCUT_task2_2 | LiSCUTt2_2 | 0.7373 | 57.4 | mono | 44.1kHz | DNN(MFCC) | Bi-LSTM | top output probability | ||
Li2017 | Li_SCUT_task2_3 | LiSCUTt2_3 | 0.6213 | 66.6 | mono | 44.1kHz | DNN(MFCC) | DNN | top output probability | ||
Li2017 | Li_SCUT_task2_4 | LiSCUTt2_4 | 0.6000 | 69.8 | mono | 44.1kHz | DNN(MFCC) | Bi-LSTM | top output probability | ||
Lim2017 | Lim_COCAI_task2_1 | 1dCRNN1 | 0.1307 | 93.1 | mono | 44.1kHz | mixture generation | log-mel energies | CRNN | thresholding | |
Lim2017 | Lim_COCAI_task2_2 | 1dCRNN2 | 0.1347 | 93.0 | mono | 44.1kHz | mixture generation | log-mel energies | CRNN | thresholding | |
Lim2017 | Lim_COCAI_task2_3 | 1dCRNN3 | 0.1520 | 92.2 | mono | 44.1kHz | mixture generation | log-mel energies | CRNN | thresholding | |
Lim2017 | Lim_COCAI_task2_4 | 1dCRNN4 | 0.1720 | 91.4 | mono | 44.1kHz | mixture generation | log-mel energies | CRNN | thresholding | |
Kaiwu2017 | Liping_CQU_task2_1 | E-RFCN | 0.3400 | 79.5 | mono | 44.1kHz | spectrogram | CNN | majority vote | ||
Kaiwu2017 | Liping_CQU_task2_2 | E-RFCN | 0.3293 | 81.2 | mono | 44.1kHz | spectrogram | CNN | majority vote | ||
Kaiwu2017 | Liping_CQU_task2_3 | E-RFCN | 0.3173 | 82.0 | mono | 44.1kHz | spectrogram | CNN | majority vote | ||
Phan2017 | Phan_UniLuebeck_task2_1 | AED-Net | 0.2773 | 85.3 | mono | 44.1kHz | log Gammatone cepstral coefficients | tailored-loss DNN+CNN | median filtering | ||
Ravichandran2017 | Ravichandran_BOSCH_task2_4 | BOSCH24 | 0.4267 | 78.6 | mono | 44.1kHz | log-mel Spectrograms, MFCC | MLP, CNN, RNN | median filtering, ensembling, hard Thresholding | ||
Vesperini2017 | Vesperini_UNIVPM_task2_1 | A3LAB | 0.3267 | 83.9 | mono | 44.1kHz | mixture generation | log-mel energies | MLP, CNN | theshold | |
Vesperini2017 | Vesperini_UNIVPM_task2_2 | A3LAB | 0.3440 | 82.8 | mono | 44.1kHz | mixture generation | log-mel energies | MLP, CNN | theshold | |
Vesperini2017 | Vesperini_UNIVPM_task2_3 | A3LAB | 0.3267 | 83.2 | mono | 44.1kHz | mixture generation | log-mel energies | MLP, CNN | theshold | |
Vesperini2017 | Vesperini_UNIVPM_task2_4 | A3LAB | 0.3267 | 83.2 | mono | 44.1kHz | mixture generation | log-mel energies | MLP, CNN | theshold | |
Wang2017 | Wang_BUPT_task2_1 | MFC_WJ | 0.4320 | 73.4 | mono | 44.1kHz | log-mel energies | DNN | median filtering | ||
Wang2017a | Wang_THU_task2_1 | Baseline | 0.4973 | 72.6 | mono | 44.1kHz | mixture generation | MFCC, log-mel energies | DNN, HMM | maxout | |
Zhou2017 | Zhou_XJTU_task2_1 | SLR-NMF | 0.3133 | 84.2 | mono | 44.1kHz | spectrogram | NMF | moving average filter |
Technical reports
Convolutional Recurrent Neural Networks for Rare Sound Event Detection
Emre Cakir and Tuomas Virtanen
Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland
Cakir_TUT_task2_1 Cakir_TUT_task2_2 Cakir_TUT_task2_3 Cakir_TUT_task2_4
Convolutional Recurrent Neural Networks for Rare Sound Event Detection
Abstract
Sound events possess certain temporal and spectral structure in their time-frequency representations. The spectral content for the samples of the same sound event class may exhibit small shifts due to intra-class acoustic variability. Convolutional layers can be used to learn high-level, shift invariant features from time-frequency representations of acoustic samples, while recurrent layers can be used to learn the longer term temporal context from the extracted high-level features. In this paper, we propose combining these two in a convolutional recurrent neural network (CRNN) for rare sound event detection. The proposed method is evaluated over DCASE 2017 challenge dataset of individual sound event samples mixed with everyday acoustic scene samples. CRNN provides significant performance improvement over two other deep learning based methods mainly due to its capability of longer term temporal modeling.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Data augmentation | mixture generation |
Features | log-mel energies |
Classifier | CRNN |
Decision making | median filtering, same architecture in separate models for each class; median filtering, ensemble of 7 best overall architectures; median filtering, best architecture for each class; median filtering, ensemble of 7 best architectures for each class |
Deep Learning for DCASE2017 Challenge
An Dang, Toan Vu and Jia-Ching Wang
Computer Sciene and Information Engineering, National Central University, Taoyuan, Taiwan
Dang_NCU_task2_1 Dang_NCU_task2_2 Dang_NCU_task2_3 Dang_NCU_task2_4
Deep Learning for DCASE2017 Challenge
Abstract
This paper reports our results on all tasks of DCASE challenge 2017 which are acoustic scene classification, detection of rare sound events, sound event detection in real life audio, and large-scale weakly supervised sound event detection for smart cars. Our proposed methods are developed based on two favorite neural networks which are convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Experiments show that our proposed methods outperform the baseline.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CRNN |
Decision making | majority vote |
Bosch Rare Sound Events Detection Systems for DCASE2017 Challenge
Shabnam Ghaffarzadegan1, Asif Salekin2, Samarjit Das1 and Zhe Feng1
1Human Machine Interaction, Robert Bosch Research and Technology Center, Palo Alto, USA, 2Computer science, University of Virginia, Virginia, USA
Ghaffarzadegan_BOSCH_task2_1 Ghaffarzadegan_BOSCH_task2_2 Ghaffarzadegan_BOSCH_task2_3 Ravichandran_BOSCH_task2_4
Bosch Rare Sound Events Detection Systems for DCASE2017 Challenge
Abstract
In this report, we describe three systems designed at BOSCH for rare audio sound events detection task of DCASE 2017 challenge. The first system is an end-to-end audio event segmentation using embeddings based on deep convolutional neural network (DCNN) and deep recurrent neural network (DRNN) trained on Mel-filter banks and spectogram features. Both system 2 and 3 contain two parts: audio event tagging and audio event segmentation. Audio event tagging selects the positive audio recordings (containing audio events), which are later processed by the audio segmentation part. Feature selection method has been deployed to select a subset of features in both systems. System 2 employs Dilated convolutional neural network on the selected features for audio tagging, and an audio-codebook approach to convert audio features to audio vectors (Audio2vec system) which are then passed to an LSTM network for audio events boundary prediction. System 3 is based on multiple instance learning problem using variational auto encoder (VAE) to perform audio event tagging and segmentation. Similar to system 2, here a LSTM network is used for audio segmentation. Finally, we have utilized models based on score-fusion among different systems to improve the final results.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC, ZCR, energy, spectral centroid, pitch; log-mel Spectrograms, MFCC |
Classifier | ensemble; MLP, CNN, RNN |
Decision making | thresholding; median filtering, ensembling, hard Thresholding |
DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System
Abstract
DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | MLP |
Decision making | median filtering |
Nonnegative Matrix Factorization-Based Source Separation with Online Noise Learning for Detection of Rare Sound Events
Kwang Myung Jeon and Hong Kook Kim
School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South Korea
Jeon_GIST_task2_1
Nonnegative Matrix Factorization-Based Source Separation with Online Noise Learning for Detection of Rare Sound Events
Abstract
In this paper, a source separation method based on nonnegative matrix factorization (NMF) with online noise learning (ONL) is proposed for the robust detection of rare sound events. The pro-posed method models the rare sound event into combinations of acoustic dictionaries, which consist of multiple spectral bases. In addition, ONL is adopted during the separation to improve the robustness against unseen noises. The spectra of the sound event separated by the proposed method act as a feature vector for the deep neural network (DNN)-based binary classifier, which deter-mines whether the event has occurred. The evaluation results using the DCASE 2017 Task 2 Dataset show that the proposed source separation method improved the F-score of the baseline DNN classifier by 6.30% while decreasing the error rate by 14.81% on average.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Data augmentation | mixture generation |
Features | log-mel energies from NMF source separation |
Classifier | MLP |
Decision making | median filtering |
Audio Events Detection and Classification Using Extended R-FCN Approach
Wang Kaiwu, Yang Liping and Yang Bin
Key Laboratory of Optoelectronic Technology and Systems (Chongqing University), Ministry of Education, ChongQing University, ChongQing, China
Liping_CQU_task2_1 Liping_CQU_task2_2 Liping_CQU_task2_3
Audio Events Detection and Classification Using Extended R-FCN Approach
Abstract
In this study, we present a new audio events detection and classification approach based on R-FCN—a state-of-the-art fully convolutional network framework for visual object detection. Spectrogram features of audio signals are used as the input of the approach. The proposed approach consists of two parts like R-FCN network. In the first part, we detect whether there are audio events by the sliding of convolutional kernel in time axis and proposals which possibly contain audio events are generated by RPN (Region Proposal Networks). In the second part, time and frequency information are integrated to classify these proposals and refine their boundaries by R-FCN. Our approach can process arbitrary length audio signals without any post-processing. Experiments on the dataset of IEEE DCASE Challenge 2017 Task 2 show that the proposed approach achieved great performance, an average F score of 91.4 %, Error Rate of 0.16 on the event-based evaluation metrics.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | spectrogram |
Classifier | CNN |
Decision making | majority vote |
The SEIE-SCUT Systems for IEEE AASP Challenge on DCASE 2017: Deep Learning Techniques for Audio Representation and Classification
Yanxiong Li and Xianku Li
School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
Li_SCUT_task2_1 Li_SCUT_task2_2 Li_SCUT_task2_3 Li_SCUT_task2_4
The SEIE-SCUT Systems for IEEE AASP Challenge on DCASE 2017: Deep Learning Techniques for Audio Representation and Classification
Abstract
In this report, we present our works about three tasks of IEEE AASP challenge on DCASE 2017, i.e. task 1: Acoustic Scene Classification (ASC), task 2: detection of rare sound events in artificially created mixtures and task 3: sound event detection in real life recordings. Tasks 2 and 3 belong to the same problem, i.e. Sound Event Detection (SED). We adopt deep learning techniques to extract Deep Audio Feature (DAF) and classify various acoustic scenes or sound events. Specifically, a Deep Neural Network (DNN) is first built for generating the DAF from Mel-Frequency Cepstral Coefficients (MFCCs), and then a Recurrent Neural Network (RNN) of Bi-directional Long Short Term Memory (Bi-LSTM) fed by the DAF is built for ASC and SED. Evaluated on the development datasets of DCASE 2017, our systems are superior to the corresponding baselines for tasks 1 and 2, and our system for task 3 performs as good as the baseline in terms of the predominant metrics.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | DNN(MFCC) |
Classifier | Bi-LSTM; DNN |
Decision making | top output probability |
Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks
Hyungui Lim1, Jeongsoo Park2,3 and Yoonchang Han1
1Cochlear.ai, Seoul, Korea, 2N/A, Cochlear.ai, Seoul, Korea, 3Music and Audio Research Group, Seoul National University, Seoul, Korea
Lim_COCAI_task2_1 Lim_COCAI_task2_2 Lim_COCAI_task2_3 Lim_COCAI_task2_4
Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks
Abstract
Rare sound event detection is a newly proposed task in IEEE DCASE 2017 to identify the presence of monophonic sound event that is classified as an emergency and to detect the onset time of the event. In this paper, we introduce a rare sound event detection system using combination of 1D convolutional neural network (1D ConvNet) and recurrent neural network (RNN) with long short-term memory units (LSTM). A log-amplitude mel-spectrogram is used as an input acoustic feature and the 1D ConvNet is applied in each time domain segment to convert the spectral feature. Then the RNN-LSTM is utilized to incorporate the temporal dependency of the extracted features. The system is evaluated using DCASE 2017 Challenge Task 2 Dataset. Our best result on the test set of the development dataset shows 0.07 and 96.26 of error rate and F-Score on event-based metric, respectively.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Data augmentation | mixture generation |
Features | log-mel energies |
Classifier | CRNN |
Decision making | thresholding |
DNN and CNN with Weighted and Multi-Task Loss Functions for Audio Event Detection
Huy Phan1, Martin Krawczyk-Becker2, Timo Gerkmann2 and Alfred Mertins1
1Institute for Signal Processing, University of Luebeck, Luebeck, Germany, 2Department of Informatics, University of Hamburg, Hamburg, Germany
Phan_UniLuebeck_task2_1
DNN and CNN with Weighted and Multi-Task Loss Functions for Audio Event Detection
Abstract
This report presents our audio event detection system submitted for Task 2, "Detection of rare sound events", of DCASE 2017 challenge. The proposed system is based on convolutional neural networks (CNNs) and deep neural networks (DNNs) coupled with novel weighted and multi-task loss functions and state-of-the-art phase-aware signal enhancement. The loss functions are tailored for audio event detection in audio streams. The weighted loss is designed to tackle the common issue of imbalanced data in background/foreground classification while the multi-task loss enables the networks to simultaneously model the class distribution and the temporal structures of the target events for recognition. Our proposed systems significantly outperform the challenge baseline, improving F-score from 72.7% to 89.8% and reducing detection error rate from 0.53 to 0.19 on average.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log Gammatone cepstral coefficients |
Classifier | tailored-loss DNN+CNN |
Decision making | median filtering |
Bosch Rare Sound Events Detection Systems for DCASE2017 Challenge
Anravich Ravichandran1 and Samarjit Das2
1Computer science, University of California, San Diego, USA, 2Human Machine Interaction, Robert Bosch Research and Technology Center, Palo Alto, USA
Ghaffarzadegan_BOSCH_task2_1 Ghaffarzadegan_BOSCH_task2_2 Ghaffarzadegan_BOSCH_task2_3 Ravichandran_BOSCH_task2_4
Bosch Rare Sound Events Detection Systems for DCASE2017 Challenge
Abstract
In this report, we describe three systems designed at BOSCH for rare audio sound events detection task of DCASE 2017 challenge. The first system is an end-to-end audio event segmentation using embeddings based on deep convolutional neural network (DCNN) and deep recurrent neural network (DRNN) trained on Mel-filter banks and spectogram features. Both system 2 and 3 contain two parts: audio event tagging and audio event segmentation. Audio event tagging selects the positive audio recordings (containing audio events), which are later processed by the audio segmentation part. Feature selection method has been deployed to select a subset of features in both systems. System 2 employs Dilated convolutional neural network on the selected features for audio tagging, and an audio-codebook approach to convert audio features to audio vectors (Audio2vec system) which are then passed to an LSTM network for audio events boundary prediction. System 3 is based on multiple instance learning problem using variational auto encoder (VAE) to perform audio event tagging and segmentation. Similar to system 2, here a LSTM network is used for audio segmentation. Finally, we have utilized models based on score-fusion among different systems to improve the final results.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC, ZCR, energy, spectral centroid, pitch; log-mel Spectrograms, MFCC |
Classifier | ensemble; MLP, CNN, RNN |
Decision making | thresholding; median filtering, ensembling, hard Thresholding |
A Hierarchic Multi-Scaled Approach for Rare Sound Event Detection
Fabio Vesperini, Diego Droghini, Daniele Ferretti, Emanuele Principi, Leonardo Gabrielli, Stefano Squartini and Francesco Piazza
Department of Information Engineering, Università Politecnica delle Marche, Ancona, Italy
Vesperini_UNIVPM_task2_1 Vesperini_UNIVPM_task2_2 Vesperini_UNIVPM_task2_3 Vesperini_UNIVPM_task2_4
A Hierarchic Multi-Scaled Approach for Rare Sound Event Detection
Abstract
We propose a system for rare sound event detection using hierarchical and multi-scaled approach based on Multi Layer Perceptron (MLP) and Convolutional Neural Networks (CNN). It is our contribution to the rare sound event detection task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017). The task consists on detection of event onset from artificially generated mixtures. Acoustic features are extracted from the acoustic signals, successively first event detection stage is performed by an MLP based neural network which proposes contiguous blocks of frames to the second stage. The CNN refines the event detection of the prior network, intrinsically operating on a multi-scaled resolution and discarding blocks that contain background wrongly classified by the MLP as event. Finally the effective onset time of the active event is obtained. The achieved overall error rate and F-measure on the development testset are respectively equal to 0.18 and 90.9%.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Data augmentation | mixture generation |
Features | log-mel energies |
Classifier | MLP, CNN |
Decision making | theshold |
Multi-Frame Concatenation for Detection of Rare Sound Events Based on Deep Neural Network
Jun Wang and Shengchen Li
Embedded Artificial Intelligence Laboratory, Beijing University of Posts and Telecommunications, Beijing, China
Wang_BUPT_task2_1
Multi-Frame Concatenation for Detection of Rare Sound Events Based on Deep Neural Network
Abstract
This paper proposes a Sound Event Detection (SED) system based on Deep Neural Network (DNN). Three DNN-based classifiers are trained for detecting three target sound events including baby cry, glass break and gun shot from the audio streams provided. This paper investigates the influence of different frame concatenation when detecting sound events. Our results illustrate that the number of frames concatenated affects the accuracy of SED. The SED system proposed is tested by Development Datasets provided by Detection of Rare Sound Events in DCASE Challenge 2017. The average accuracy of the detection is that F-score and Error Rate (ER) on event-based metrics are 84.98% and 0.28, respectively.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | DNN |
Decision making | median filtering |
Transfer Learning Based DNN-HMM Hybrid System for Rare Sound Event Detection
Jianfei Wang, Weiqiang Zhang and Jia Liu
Speech and Audio Technology Laboratory, Tsinghua University, Beijing, China
Wang_THU_task2_1
Transfer Learning Based DNN-HMM Hybrid System for Rare Sound Event Detection
Abstract
In this paper, we propose an improved Deep Neural Network-Hidden Markov Model (DNN-HMM) hybrid system for rare sound event detection. The proposed system leverages transfer learning technology in the neural network training stage. Experiment results indicate that transfer learning is more efficient when the training samples are insufficient. We use the Multi-Layer Perception (MLP) system and standard DNN-HMM system as the baseline. The performance was evaluated on the DCASE2017 task 2 development dataset show that our proposed system outperforms the MLP and DNN-HMM baselines, and finally achieves an average error rate (ER) of 0.38 and 78.3% F1-score on the event-based evaluation. The average error rate of proposed system is 15% and 8% absolutely lower than the MLP and DNN-HMM systems, respectively.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Data augmentation | mixture generation |
Features | MFCC, log-mel energies |
Classifier | DNN, HMM |
Decision making | maxout |
Robust Sound Event Detection Through Noise Estimation and Source Separation Using NMF
Qing Zhou and Zuren Feng
Systems Engineering Institute, Xi'an Jiaotong University, Xi'an, China
Zhou_XJTU_task2_1
Robust Sound Event Detection Through Noise Estimation and Source Separation Using NMF
Abstract
This paper addresses the problem of sound event detection under non-stationary noises and various real-world acoustic scenes. An effective noise reduction strategy is proposed in this paper which can automatically adapt to background variations. The proposed method is based on supervised non-negative matrix factorization (NMF) for separating target events from noise. The event dictionary is trained offline using the training data of the target event class while the noise dictionary is learned online from the input signal by sparse and low-rank decomposition. Incorporating the estimated noise bases, this method can produce accurate source separation results by reducing noise residue and signal distortion of the reconstructed event spectrogram. Experimental results on DCASE 2017 task 2 dataset show that the proposed method outperforms the baseline system based on multi-layer perceptron classifiers and also another NMF-based method which employs a semi-supervised strategy for noise reduction.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | spectrogram |
Classifier | NMF |
Decision making | moving average filter |