Detection of
rare sound events


Challenge results

Task description

This task focused on detection of rare sound events in artificially created mixtures. Targeted sound events are baby crying, glass breaking, and gunshot. The training material available for the participants contained a set of ready created mixtures (1500 30-second audio mixtures, totalling 12h 30min in length), a set of isolated events (474 unique events) and background recordings (1121 30-second audio recordings, totalling 9h 20min in length). A total of 1500 30-second audio mixtures (12h 30min of audio) were used for the challenge evaluation.

More detailed task description can be found in the task description page

Challenge results

Detailed description of metrics used can be found here.

System outputs:


Systems ranking

Rank Submission Information Technical
Report
Event-based
(overall / evaluation dataset)
Event-based
(overall / development dataset)
Code Name ER F1 ER F1
Cakir2017 Cakir_TUT_task2_1 CRNN-1 0.1813 91.0 0.1600 91.8
Cakir2017 Cakir_TUT_task2_2 CRNN-2 0.1733 91.0 0.1400 92.9
Cakir2017 Cakir_TUT_task2_3 CRNN-3 0.2920 86.0 0.1400 92.8
Cakir2017 Cakir_TUT_task2_4 CRNN-4 0.1867 90.3 0.1200 93.6
Dang2017 Dang_NCU_task2_1 CRNN 0.4787 73.3 0.2600 85.9
Dang2017 Dang_NCU_task2_2 andang2 0.4107 79.1 0.2500 86.4
Dang2017 Dang_NCU_task2_3 andang2 0.4453 76.1 0.2700 85.6
Dang2017 Dang_NCU_task2_4 andang2 0.4253 78.6 0.2700 85.6
Ghaffarzadegan2017 Ghaffarzadegan_BOSCH_task2_1 BOSCH21 0.5000 74.2 0.1700 91.2
Ghaffarzadegan2017 Ghaffarzadegan_BOSCH_task2_2 BOSCH22 0.5493 71.8 0.1600 92.1
Ghaffarzadegan2017 Ghaffarzadegan_BOSCH_task2_3 BOSCH23 0.5560 70.8 0.2100 89.6
Heittola2017 DCASE2017 baseline Baseline 0.6373 64.1 0.5300 72.7
Jeon2017 Jeon_GIST_task2_1 NMF_SS+DNN 0.6773 65.8 0.4600 76.9
Li2017 Li_SCUT_task2_1 LiSCUTt2_1 0.6333 65.5 0.6100 69.6
Li2017 Li_SCUT_task2_2 LiSCUTt2_2 0.7373 57.4 0.6000 68.1
Li2017 Li_SCUT_task2_3 LiSCUTt2_3 0.6213 66.6 0.6400 67.8
Li2017 Li_SCUT_task2_4 LiSCUTt2_4 0.6000 69.8 0.5500 72.5
Lim2017 Lim_COCAI_task2_1 1dCRNN1 0.1307 93.1 0.0700 96.3
Lim2017 Lim_COCAI_task2_2 1dCRNN2 0.1347 93.0 0.0700 96.1
Lim2017 Lim_COCAI_task2_3 1dCRNN3 0.1520 92.2 0.0700 96.1
Lim2017 Lim_COCAI_task2_4 1dCRNN4 0.1720 91.4 0.0900 95.5
Kaiwu2017 Liping_CQU_task2_1 E-RFCN 0.3400 79.5 0.1800 90.3
Kaiwu2017 Liping_CQU_task2_2 E-RFCN 0.3293 81.2 0.1600 91.4
Kaiwu2017 Liping_CQU_task2_3 E-RFCN 0.3173 82.0 0.1800 90.5
Phan2017 Phan_UniLuebeck_task2_1 AED-Net 0.2773 85.3 0.1900 89.8
Ravichandran2017 Ravichandran_BOSCH_task2_4 BOSCH24 0.4267 78.6 0.1700 87.8
Vesperini2017 Vesperini_UNIVPM_task2_1 A3LAB 0.3267 83.9 0.2000 89.8
Vesperini2017 Vesperini_UNIVPM_task2_2 A3LAB 0.3440 82.8 0.1800 90.8
Vesperini2017 Vesperini_UNIVPM_task2_3 A3LAB 0.3267 83.2 0.1800 90.8
Vesperini2017 Vesperini_UNIVPM_task2_4 A3LAB 0.3267 83.2 0.1900 90.4
Wang2017 Wang_BUPT_task2_1 MFC_WJ 0.4320 73.4 0.2800 85.0
Wang2017a Wang_THU_task2_1 Baseline 0.4973 72.6 0.3800 78.3
Zhou2017 Zhou_XJTU_task2_1 SLR-NMF 0.3133 84.2 0.2800 85.8

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission Information Technical
Report
Event-based
(overall / evaluation dataset)
Event-based
(overall / development dataset)
Code Name ER F1 ER F1
Cakir2017 Cakir_TUT_task2_2 CRNN-2 0.1733 91.0 0.1400 92.9
Dang2017 Dang_NCU_task2_2 andang2 0.4107 79.1 0.2500 86.4
Heittola2017 DCASE2017 baseline Baseline 0.6373 64.1 0.5300 72.7
Jeon2017 Jeon_GIST_task2_1 NMF_SS+DNN 0.6773 65.8 0.4600 76.9
Li2017 Li_SCUT_task2_4 LiSCUTt2_4 0.6000 69.8 0.5500 72.5
Lim2017 Lim_COCAI_task2_1 1dCRNN1 0.1307 93.1 0.0700 96.3
Kaiwu2017 Liping_CQU_task2_3 E-RFCN 0.3173 82.0 0.1800 90.5
Phan2017 Phan_UniLuebeck_task2_1 AED-Net 0.2773 85.3 0.1900 89.8
Ravichandran2017 Ravichandran_BOSCH_task2_4 BOSCH24 0.4267 78.6 0.1700 87.8
Vesperini2017 Vesperini_UNIVPM_task2_1 A3LAB 0.3267 83.9 0.2000 89.8
Wang2017 Wang_BUPT_task2_1 MFC_WJ 0.4320 73.4 0.2800 85.0
Wang2017a Wang_THU_task2_1 Baseline 0.4973 72.6 0.3800 78.3
Zhou2017 Zhou_XJTU_task2_1 SLR-NMF 0.3133 84.2 0.2800 85.8

Class-wise performance

Rank Submission Information Technical
Report
Event-based
(average / evaluation dataset)
Baby cry Glass break Gunshot
Code Name ER F1 ER F1 ER F1 ER F1
Cakir2017 Cakir_TUT_task2_1 CRNN-1 0.1813 91.0 0.2720 87.0 0.0720 96.4 0.2000 89.5
Cakir2017 Cakir_TUT_task2_2 CRNN-2 0.1733 91.0 0.1840 90.8 0.1040 94.7 0.2320 87.4
Cakir2017 Cakir_TUT_task2_3 CRNN-3 0.2920 86.0 0.2720 87.0 0.1360 92.9 0.4680 78.0
Cakir2017 Cakir_TUT_task2_4 CRNN-4 0.1867 90.3 0.2120 89.5 0.1120 94.2 0.2360 87.3
Dang2017 Dang_NCU_task2_1 CRNN 0.4787 73.3 0.4760 75.5 0.3880 79.3 0.5720 65.2
Dang2017 Dang_NCU_task2_2 andang2 0.4107 79.1 0.4400 80.6 0.2280 88.5 0.5640 68.2
Dang2017 Dang_NCU_task2_3 andang2 0.4453 76.1 0.4400 80.6 0.3240 82.4 0.5720 65.2
Dang2017 Dang_NCU_task2_4 andang2 0.4253 78.6 0.4400 80.6 0.2720 87.1 0.5640 68.2
Ghaffarzadegan2017 Ghaffarzadegan_BOSCH_task2_1 BOSCH21 0.5000 74.2 0.4080 78.8 0.1640 91.5 0.9280 52.3
Ghaffarzadegan2017 Ghaffarzadegan_BOSCH_task2_2 BOSCH22 0.5493 71.8 0.4320 78.0 0.2400 87.5 0.9760 49.8
Ghaffarzadegan2017 Ghaffarzadegan_BOSCH_task2_3 BOSCH23 0.5560 70.8 0.4600 74.7 0.2320 87.9 0.9760 49.8
Heittola2017 DCASE2017 baseline Baseline 0.6373 64.1 0.8040 66.8 0.3800 79.1 0.7280 46.5
Jeon2017 Jeon_GIST_task2_1 NMF_SS+DNN 0.6773 65.8 0.8840 65.3 0.3960 80.2 0.7520 51.8
Li2017 Li_SCUT_task2_1 LiSCUTt2_1 0.6333 65.5 0.8280 65.8 0.4240 77.8 0.6480 52.9
Li2017 Li_SCUT_task2_2 LiSCUTt2_2 0.7373 57.4 0.9160 61.8 0.5280 69.3 0.7680 41.1
Li2017 Li_SCUT_task2_3 LiSCUTt2_3 0.6213 66.6 0.7400 68.2 0.4440 76.2 0.6800 55.3
Li2017 Li_SCUT_task2_4 LiSCUTt2_4 0.6000 69.8 0.7800 67.4 0.3240 82.4 0.6960 59.5
Lim2017 Lim_COCAI_task2_1 1dCRNN1 0.1307 93.1 0.1520 92.2 0.0480 97.6 0.1920 89.6
Lim2017 Lim_COCAI_task2_2 1dCRNN2 0.1347 93.0 0.1520 92.4 0.0600 97.0 0.1920 89.6
Lim2017 Lim_COCAI_task2_3 1dCRNN3 0.1520 92.2 0.1520 92.5 0.1120 94.6 0.1920 89.6
Lim2017 Lim_COCAI_task2_4 1dCRNN4 0.1720 91.4 0.1720 91.7 0.1520 92.9 0.1920 89.6
Kaiwu2017 Liping_CQU_task2_1 E-RFCN 0.3400 79.5 0.2760 86.4 0.1800 90.2 0.5640 62.0
Kaiwu2017 Liping_CQU_task2_2 E-RFCN 0.3293 81.2 0.2840 86.5 0.1600 91.5 0.5440 65.7
Kaiwu2017 Liping_CQU_task2_3 E-RFCN 0.3173 82.0 0.2640 87.3 0.1600 91.5 0.5280 67.2
Phan2017 Phan_UniLuebeck_task2_1 AED-Net 0.2773 85.3 0.2840 85.7 0.2200 88.8 0.3280 81.6
Ravichandran2017 Ravichandran_BOSCH_task2_4 BOSCH24 0.4267 78.6 0.5000 75.9 0.2360 87.8 0.5440 71.9
Vesperini2017 Vesperini_UNIVPM_task2_1 A3LAB 0.3267 83.9 0.3560 83.0 0.3120 84.7 0.3120 84.0
Vesperini2017 Vesperini_UNIVPM_task2_2 A3LAB 0.3440 82.8 0.3680 82.4 0.3280 83.8 0.3360 82.3
Vesperini2017 Vesperini_UNIVPM_task2_3 A3LAB 0.3267 83.2 0.3240 84.3 0.2960 85.1 0.3600 80.3
Vesperini2017 Vesperini_UNIVPM_task2_4 A3LAB 0.3267 83.2 0.3240 84.3 0.2960 85.1 0.3600 80.3
Wang2017 Wang_BUPT_task2_1 MFC_WJ 0.4320 73.4 0.4400 77.3 0.2120 89.1 0.6440 53.9
Wang2017a Wang_THU_task2_1 Baseline 0.4973 72.6 0.5680 70.7 0.3560 81.0 0.5680 66.0
Zhou2017 Zhou_XJTU_task2_1 SLR-NMF 0.3133 84.2 0.1720 91.4 0.2200 89.1 0.5480 72.0

System characteristics

Rank Submission Information Technical
Report
Event-based (overall) System characteristics
Code Name ER F1 Input Sampling
rate
Data
augmentation
Features Classifier Decision
making
Cakir2017 Cakir_TUT_task2_1 CRNN-1 0.1813 91.0 mono 44.1kHz mixture generation log-mel energies CRNN median filtering, same architecture in separate models for each class
Cakir2017 Cakir_TUT_task2_2 CRNN-2 0.1733 91.0 mono 44.1kHz mixture generation log-mel energies CRNN median filtering, ensemble of 7 best overall architectures
Cakir2017 Cakir_TUT_task2_3 CRNN-3 0.2920 86.0 mono 44.1kHz mixture generation log-mel energies CRNN median filtering, best architecture for each class
Cakir2017 Cakir_TUT_task2_4 CRNN-4 0.1867 90.3 mono 44.1kHz mixture generation log-mel energies CRNN median filtering, ensemble of 7 best architectures for each class
Dang2017 Dang_NCU_task2_1 CRNN 0.4787 73.3 mono 44.1kHz log-mel energies CRNN majority vote
Dang2017 Dang_NCU_task2_2 andang2 0.4107 79.1 mono 44.1kHz log-mel energies CRNN majority vote
Dang2017 Dang_NCU_task2_3 andang2 0.4453 76.1 mono 44.1kHz log-mel energies CRNN majority vote
Dang2017 Dang_NCU_task2_4 andang2 0.4253 78.6 mono 44.1kHz log-mel energies CRNN majority vote
Ghaffarzadegan2017 Ghaffarzadegan_BOSCH_task2_1 BOSCH21 0.5000 74.2 mono 44.1kHz MFCC, ZCR, energy, spectral centroid, pitch ensemble thresholding
Ghaffarzadegan2017 Ghaffarzadegan_BOSCH_task2_2 BOSCH22 0.5493 71.8 mono 44.1kHz MFCC, ZCR, energy, spectral centroid, pitch ensemble thresholding
Ghaffarzadegan2017 Ghaffarzadegan_BOSCH_task2_3 BOSCH23 0.5560 70.8 mono 44.1kHz MFCC, ZCR, energy, spectral centroid, pitch ensemble thresholding
Heittola2017 DCASE2017 baseline Baseline 0.6373 64.1 mono 44.1kHz log-mel energies MLP median filtering
Jeon2017 Jeon_GIST_task2_1 NMF_SS+DNN 0.6773 65.8 mono 44.1kHz mixture generation log-mel energies from NMF source separation MLP median filtering
Li2017 Li_SCUT_task2_1 LiSCUTt2_1 0.6333 65.5 mono 44.1kHz DNN(MFCC) Bi-LSTM top output probability
Li2017 Li_SCUT_task2_2 LiSCUTt2_2 0.7373 57.4 mono 44.1kHz DNN(MFCC) Bi-LSTM top output probability
Li2017 Li_SCUT_task2_3 LiSCUTt2_3 0.6213 66.6 mono 44.1kHz DNN(MFCC) DNN top output probability
Li2017 Li_SCUT_task2_4 LiSCUTt2_4 0.6000 69.8 mono 44.1kHz DNN(MFCC) Bi-LSTM top output probability
Lim2017 Lim_COCAI_task2_1 1dCRNN1 0.1307 93.1 mono 44.1kHz mixture generation log-mel energies CRNN thresholding
Lim2017 Lim_COCAI_task2_2 1dCRNN2 0.1347 93.0 mono 44.1kHz mixture generation log-mel energies CRNN thresholding
Lim2017 Lim_COCAI_task2_3 1dCRNN3 0.1520 92.2 mono 44.1kHz mixture generation log-mel energies CRNN thresholding
Lim2017 Lim_COCAI_task2_4 1dCRNN4 0.1720 91.4 mono 44.1kHz mixture generation log-mel energies CRNN thresholding
Kaiwu2017 Liping_CQU_task2_1 E-RFCN 0.3400 79.5 mono 44.1kHz spectrogram CNN majority vote
Kaiwu2017 Liping_CQU_task2_2 E-RFCN 0.3293 81.2 mono 44.1kHz spectrogram CNN majority vote
Kaiwu2017 Liping_CQU_task2_3 E-RFCN 0.3173 82.0 mono 44.1kHz spectrogram CNN majority vote
Phan2017 Phan_UniLuebeck_task2_1 AED-Net 0.2773 85.3 mono 44.1kHz log Gammatone cepstral coefficients tailored-loss DNN+CNN median filtering
Ravichandran2017 Ravichandran_BOSCH_task2_4 BOSCH24 0.4267 78.6 mono 44.1kHz log-mel Spectrograms, MFCC MLP, CNN, RNN median filtering, ensembling, hard Thresholding
Vesperini2017 Vesperini_UNIVPM_task2_1 A3LAB 0.3267 83.9 mono 44.1kHz mixture generation log-mel energies MLP, CNN theshold
Vesperini2017 Vesperini_UNIVPM_task2_2 A3LAB 0.3440 82.8 mono 44.1kHz mixture generation log-mel energies MLP, CNN theshold
Vesperini2017 Vesperini_UNIVPM_task2_3 A3LAB 0.3267 83.2 mono 44.1kHz mixture generation log-mel energies MLP, CNN theshold
Vesperini2017 Vesperini_UNIVPM_task2_4 A3LAB 0.3267 83.2 mono 44.1kHz mixture generation log-mel energies MLP, CNN theshold
Wang2017 Wang_BUPT_task2_1 MFC_WJ 0.4320 73.4 mono 44.1kHz log-mel energies DNN median filtering
Wang2017a Wang_THU_task2_1 Baseline 0.4973 72.6 mono 44.1kHz mixture generation MFCC, log-mel energies DNN, HMM maxout
Zhou2017 Zhou_XJTU_task2_1 SLR-NMF 0.3133 84.2 mono 44.1kHz spectrogram NMF moving average filter

Technical reports

Convolutional Recurrent Neural Networks for Rare Sound Event Detection

Abstract

Sound events possess certain temporal and spectral structure in their time-frequency representations. The spectral content for the samples of the same sound event class may exhibit small shifts due to intra-class acoustic variability. Convolutional layers can be used to learn high-level, shift invariant features from time-frequency representations of acoustic samples, while recurrent layers can be used to learn the longer term temporal context from the extracted high-level features. In this paper, we propose combining these two in a convolutional recurrent neural network (CRNN) for rare sound event detection. The proposed method is evaluated over DCASE 2017 challenge dataset of individual sound event samples mixed with everyday acoustic scene samples. CRNN provides significant performance improvement over two other deep learning based methods mainly due to its capability of longer term temporal modeling.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation mixture generation
Features log-mel energies
Classifier CRNN
Decision making median filtering, same architecture in separate models for each class; median filtering, ensemble of 7 best overall architectures; median filtering, best architecture for each class; median filtering, ensemble of 7 best architectures for each class
PDF

Deep Learning for DCASE2017 Challenge

Abstract

This paper reports our results on all tasks of DCASE challenge 2017 which are acoustic scene classification, detection of rare sound events, sound event detection in real life audio, and large-scale weakly supervised sound event detection for smart cars. Our proposed methods are developed based on two favorite neural networks which are convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Experiments show that our proposed methods outperform the baseline.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CRNN
Decision making majority vote
PDF

Bosch Rare Sound Events Detection Systems for DCASE2017 Challenge

Abstract

In this report, we describe three systems designed at BOSCH for rare audio sound events detection task of DCASE 2017 challenge. The first system is an end-to-end audio event segmentation using embeddings based on deep convolutional neural network (DCNN) and deep recurrent neural network (DRNN) trained on Mel-filter banks and spectogram features. Both system 2 and 3 contain two parts: audio event tagging and audio event segmentation. Audio event tagging selects the positive audio recordings (containing audio events), which are later processed by the audio segmentation part. Feature selection method has been deployed to select a subset of features in both systems. System 2 employs Dilated convolutional neural network on the selected features for audio tagging, and an audio-codebook approach to convert audio features to audio vectors (Audio2vec system) which are then passed to an LSTM network for audio events boundary prediction. System 3 is based on multiple instance learning problem using variational auto encoder (VAE) to perform audio event tagging and segmentation. Similar to system 2, here a LSTM network is used for audio segmentation. Finally, we have utilized models based on score-fusion among different systems to improve the final results.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC, ZCR, energy, spectral centroid, pitch; log-mel Spectrograms, MFCC
Classifier ensemble; MLP, CNN, RNN
Decision making thresholding; median filtering, ensembling, hard Thresholding
PDF

DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier MLP
Decision making median filtering
PDF

Nonnegative Matrix Factorization-Based Source Separation with Online Noise Learning for Detection of Rare Sound Events

Abstract

In this paper, a source separation method based on nonnegative matrix factorization (NMF) with online noise learning (ONL) is proposed for the robust detection of rare sound events. The pro-posed method models the rare sound event into combinations of acoustic dictionaries, which consist of multiple spectral bases. In addition, ONL is adopted during the separation to improve the robustness against unseen noises. The spectra of the sound event separated by the proposed method act as a feature vector for the deep neural network (DNN)-based binary classifier, which deter-mines whether the event has occurred. The evaluation results using the DCASE 2017 Task 2 Dataset show that the proposed source separation method improved the F-score of the baseline DNN classifier by 6.30% while decreasing the error rate by 14.81% on average.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation mixture generation
Features log-mel energies from NMF source separation
Classifier MLP
Decision making median filtering
PDF

Audio Events Detection and Classification Using Extended R-FCN Approach

Abstract

In this study, we present a new audio events detection and classification approach based on R-FCN—a state-of-the-art fully convolutional network framework for visual object detection. Spectrogram features of audio signals are used as the input of the approach. The proposed approach consists of two parts like R-FCN network. In the first part, we detect whether there are audio events by the sliding of convolutional kernel in time axis and proposals which possibly contain audio events are generated by RPN (Region Proposal Networks). In the second part, time and frequency information are integrated to classify these proposals and refine their boundaries by R-FCN. Our approach can process arbitrary length audio signals without any post-processing. Experiments on the dataset of IEEE DCASE Challenge 2017 Task 2 show that the proposed approach achieved great performance, an average F score of 91.4 %, Error Rate of 0.16 on the event-based evaluation metrics.

System characteristics
Input mono
Sampling rate 44.1kHz
Features spectrogram
Classifier CNN
Decision making majority vote
PDF

The SEIE-SCUT Systems for IEEE AASP Challenge on DCASE 2017: Deep Learning Techniques for Audio Representation and Classification

Abstract

In this report, we present our works about three tasks of IEEE AASP challenge on DCASE 2017, i.e. task 1: Acoustic Scene Classification (ASC), task 2: detection of rare sound events in artificially created mixtures and task 3: sound event detection in real life recordings. Tasks 2 and 3 belong to the same problem, i.e. Sound Event Detection (SED). We adopt deep learning techniques to extract Deep Audio Feature (DAF) and classify various acoustic scenes or sound events. Specifically, a Deep Neural Network (DNN) is first built for generating the DAF from Mel-Frequency Cepstral Coefficients (MFCCs), and then a Recurrent Neural Network (RNN) of Bi-directional Long Short Term Memory (Bi-LSTM) fed by the DAF is built for ASC and SED. Evaluated on the development datasets of DCASE 2017, our systems are superior to the corresponding baselines for tasks 1 and 2, and our system for task 3 performs as good as the baseline in terms of the predominant metrics.

System characteristics
Input mono
Sampling rate 44.1kHz
Features DNN(MFCC)
Classifier Bi-LSTM; DNN
Decision making top output probability
PDF

Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks

Abstract

Rare sound event detection is a newly proposed task in IEEE DCASE 2017 to identify the presence of monophonic sound event that is classified as an emergency and to detect the onset time of the event. In this paper, we introduce a rare sound event detection system using combination of 1D convolutional neural network (1D ConvNet) and recurrent neural network (RNN) with long short-term memory units (LSTM). A log-amplitude mel-spectrogram is used as an input acoustic feature and the 1D ConvNet is applied in each time domain segment to convert the spectral feature. Then the RNN-LSTM is utilized to incorporate the temporal dependency of the extracted features. The system is evaluated using DCASE 2017 Challenge Task 2 Dataset. Our best result on the test set of the development dataset shows 0.07 and 96.26 of error rate and F-Score on event-based metric, respectively.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation mixture generation
Features log-mel energies
Classifier CRNN
Decision making thresholding
PDF

DNN and CNN with Weighted and Multi-Task Loss Functions for Audio Event Detection

Abstract

This report presents our audio event detection system submitted for Task 2, "Detection of rare sound events", of DCASE 2017 challenge. The proposed system is based on convolutional neural networks (CNNs) and deep neural networks (DNNs) coupled with novel weighted and multi-task loss functions and state-of-the-art phase-aware signal enhancement. The loss functions are tailored for audio event detection in audio streams. The weighted loss is designed to tackle the common issue of imbalanced data in background/foreground classification while the multi-task loss enables the networks to simultaneously model the class distribution and the temporal structures of the target events for recognition. Our proposed systems significantly outperform the challenge baseline, improving F-score from 72.7% to 89.8% and reducing detection error rate from 0.53 to 0.19 on average.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log Gammatone cepstral coefficients
Classifier tailored-loss DNN+CNN
Decision making median filtering
PDF

Bosch Rare Sound Events Detection Systems for DCASE2017 Challenge

Abstract

In this report, we describe three systems designed at BOSCH for rare audio sound events detection task of DCASE 2017 challenge. The first system is an end-to-end audio event segmentation using embeddings based on deep convolutional neural network (DCNN) and deep recurrent neural network (DRNN) trained on Mel-filter banks and spectogram features. Both system 2 and 3 contain two parts: audio event tagging and audio event segmentation. Audio event tagging selects the positive audio recordings (containing audio events), which are later processed by the audio segmentation part. Feature selection method has been deployed to select a subset of features in both systems. System 2 employs Dilated convolutional neural network on the selected features for audio tagging, and an audio-codebook approach to convert audio features to audio vectors (Audio2vec system) which are then passed to an LSTM network for audio events boundary prediction. System 3 is based on multiple instance learning problem using variational auto encoder (VAE) to perform audio event tagging and segmentation. Similar to system 2, here a LSTM network is used for audio segmentation. Finally, we have utilized models based on score-fusion among different systems to improve the final results.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC, ZCR, energy, spectral centroid, pitch; log-mel Spectrograms, MFCC
Classifier ensemble; MLP, CNN, RNN
Decision making thresholding; median filtering, ensembling, hard Thresholding
PDF

A Hierarchic Multi-Scaled Approach for Rare Sound Event Detection

Abstract

We propose a system for rare sound event detection using hierarchical and multi-scaled approach based on Multi Layer Perceptron (MLP) and Convolutional Neural Networks (CNN). It is our contribution to the rare sound event detection task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017). The task consists on detection of event onset from artificially generated mixtures. Acoustic features are extracted from the acoustic signals, successively first event detection stage is performed by an MLP based neural network which proposes contiguous blocks of frames to the second stage. The CNN refines the event detection of the prior network, intrinsically operating on a multi-scaled resolution and discarding blocks that contain background wrongly classified by the MLP as event. Finally the effective onset time of the active event is obtained. The achieved overall error rate and F-measure on the development testset are respectively equal to 0.18 and 90.9%.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation mixture generation
Features log-mel energies
Classifier MLP, CNN
Decision making theshold
PDF

Multi-Frame Concatenation for Detection of Rare Sound Events Based on Deep Neural Network

Abstract

This paper proposes a Sound Event Detection (SED) system based on Deep Neural Network (DNN). Three DNN-based classifiers are trained for detecting three target sound events including baby cry, glass break and gun shot from the audio streams provided. This paper investigates the influence of different frame concatenation when detecting sound events. Our results illustrate that the number of frames concatenated affects the accuracy of SED. The SED system proposed is tested by Development Datasets provided by Detection of Rare Sound Events in DCASE Challenge 2017. The average accuracy of the detection is that F-score and Error Rate (ER) on event-based metrics are 84.98% and 0.28, respectively.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier DNN
Decision making median filtering
PDF

Transfer Learning Based DNN-HMM Hybrid System for Rare Sound Event Detection

Abstract

In this paper, we propose an improved Deep Neural Network-Hidden Markov Model (DNN-HMM) hybrid system for rare sound event detection. The proposed system leverages transfer learning technology in the neural network training stage. Experiment results indicate that transfer learning is more efficient when the training samples are insufficient. We use the Multi-Layer Perception (MLP) system and standard DNN-HMM system as the baseline. The performance was evaluated on the DCASE2017 task 2 development dataset show that our proposed system outperforms the MLP and DNN-HMM baselines, and finally achieves an average error rate (ER) of 0.38 and 78.3% F1-score on the event-based evaluation. The average error rate of proposed system is 15% and 8% absolutely lower than the MLP and DNN-HMM systems, respectively.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation mixture generation
Features MFCC, log-mel energies
Classifier DNN, HMM
Decision making maxout
PDF

Robust Sound Event Detection Through Noise Estimation and Source Separation Using NMF

Abstract

This paper addresses the problem of sound event detection under non-stationary noises and various real-world acoustic scenes. An effective noise reduction strategy is proposed in this paper which can automatically adapt to background variations. The proposed method is based on supervised non-negative matrix factorization (NMF) for separating target events from noise. The event dictionary is trained offline using the training data of the target event class while the noise dictionary is learned online from the input signal by sparse and low-rank decomposition. Incorporating the estimated noise bases, this method can produce accurate source separation results by reducing noise residue and signal distortion of the reconstructed event spectrogram. Experimental results on DCASE 2017 task 2 dataset show that the proposed method outperforms the baseline system based on multi-layer perceptron classifiers and also another NMF-based method which employs a semi-supervised strategy for noise reduction.

System characteristics
Input mono
Sampling rate 44.1kHz
Features spectrogram
Classifier NMF
Decision making moving average filter
PDF