Sound event detection
in synthetic audio


Challenge results

Task description

Detailed task description in task description page

Challenge results

Here you can find complete information on the submissions for Task 2: results on evaluation and development set (when reported by authors), class-wise results, technical reports and bibtex citations.

Detailed description of metrics used can be found here.

System outputs:


Systems ranking

Rank Submission Information Technical
Report
Segment-based
(overall / evaluation dataset)
Event-based
(overall / onset-only evaluation dataset)
Segment-based
(overall / development dataset)
Code Name ER F1 ER F1 ER F1
Choi2016 Choi_task2_1 Choi 0.3660 78.7 0.6178 67.1 0.1379 92.6
Benetos2016 DCASE2016 baseline DCASE2016_Baseline 0.8933 37.0 1.6852 24.2 0.7859 41.6
Giannoulis2016 Giannoulis_task2_1 Giannoulis 0.6774 55.8 1.3490 34.2
Gutirrez-Arriola2016 Gutierrez_task2_1 Gutierrez 2.0870 25.0 1.3064 25.7 0.4973 67.7
Hayashi2016 Hayashi_task2_1 BLSTM-PP 0.4082 78.1 0.6004 68.2 0.2424 87.7
Hayashi2016 Hayashi_task2_2 BLSTM-HMM 0.4958 76.0 0.6448 67.0 0.2591 87.2
Komatsu2016 Komatsu_task2_1 Komatsu 0.3307 80.2 0.4624 73.8
Kong2016 Kong_task2_1 Kong 3.5464 12.6 2.5999 2.5 17.4
Phan2016 Phan_task2_1 Phan 0.5901 64.8 1.0123 39.8 0.1420 92.8
Pikrakis2016 Pikrakis_task2_1 Pikrakis 0.7499 37.4 0.7379 36.8 0.5075 64.0
Vu2016 Vu_task2_1 Vu 0.8979 52.8 3.1818 18.1 0.3412 81.2

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission Information Technical
Report
Segment-based
(overall / evaluation dataset)
Event-based
(overall / onset-only evaluation dataset)
Segment-based
(overall / development dataset)
Code Name ER F1 ER F1 ER F1
Choi2016 Choi_task2_1 Choi 0.3660 78.7 0.6178 67.1 0.1379 92.6
Benetos2016 DCASE2016 baseline DCASE2016_Baseline 0.8933 37.0 1.6852 24.2 0.7859 41.6
Giannoulis2016 Giannoulis_task2_1 Giannoulis 0.6774 55.8 1.3490 34.2
Gutirrez-Arriola2016 Gutierrez_task2_1 Gutierrez 2.0870 25.0 1.3064 25.7 0.4973 67.7
Hayashi2016 Hayashi_task2_1 BLSTM-PP 0.4082 78.1 0.6004 68.2 0.2424 87.7
Komatsu2016 Komatsu_task2_1 Komatsu 0.3307 80.2 0.4624 73.8
Kong2016 Kong_task2_1 Kong 3.5464 12.6 2.5999 2.5 17.4
Phan2016 Phan_task2_1 Phan 0.5901 64.8 1.0123 39.8 0.1420 92.8
Pikrakis2016 Pikrakis_task2_1 Pikrakis 0.7499 37.4 0.7379 36.8 0.5075 64.0
Vu2016 Vu_task2_1 Vu 0.8979 52.8 3.1818 18.1 0.3412 81.2

Class-wise performance

Rank Submission Information Technical
Report
Segment-based
(Class-based average)
Clearing throat Coughing Door knock Door slam Drawer Keyboard Keys Human laughter Page turning Phone ringing Speech
Code Name ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1
Choi2016 Choi_task2_1 Choi 0.4447 74.2 0.3373 83.1 0.4175 77.8 0.2294 88.2 1.1836 5.4 0.6000 65.0 0.1873 91.3 0.2933 84.8 0.4566 76.1 0.5609 75.6 0.2190 88.6 0.4066 81.0
Benetos2016 DCASE2016 baseline DCASE2016_Baseline 1.1066 33.2 0.8956 49.4 1.0561 6.2 0.8674 63.1 2.9855 15.1 0.9833 3.3 0.9100 22.4 0.6400 62.7 1.3193 40.8 1.0032 2.5 0.7956 40.6 0.7163 59.3
Giannoulis2016 Giannoulis_task2_1 Giannoulis 0.8479 55.5 0.7510 54.9 0.9474 47.3 0.6667 61.6 2.3575 29.7 0.7867 38.5 0.4112 77.1 0.4400 74.3 0.9440 52.2 0.7083 51.2 0.5815 60.6 0.7329 62.6
Gutirrez-Arriola2016 Gutierrez_task2_1 Gutierrez 2.2537 34.2 0.7510 54.0 0.9053 25.0 1.8566 42.4 3.0628 4.5 0.9033 33.1 0.5426 67.5 1.1156 12.5 0.7647 47.6 0.9744 5.6 0.4793 71.3 13.4350 12.8
Hayashi2016 Hayashi_task2_1 BLSTM-PP 0.5228 78.6 0.5422 77.2 0.2947 84.7 0.2115 90.2 0.7295 67.8 0.8000 60.4 0.2822 87.5 0.3733 83.2 0.5042 78.4 1.6186 55.0 0.2238 88.5 0.1702 91.3
Hayashi2016 Hayashi_task2_2 BLSTM-HMM 0.6055 76.6 0.4458 78.2 0.3825 82.9 0.4875 80.1 1.3043 55.0 0.7867 64.9 0.2749 87.8 0.2356 88.8 0.7983 70.2 1.5032 56.9 0.1679 91.5 0.2742 86.0
Komatsu2016 Komatsu_task2_1 Komatsu 0.3851 77.1 0.4458 76.8 0.5719 62.5 0.1828 90.7 0.8309 37.7 0.4400 74.7 0.2311 87.9 0.3867 81.0 0.4090 77.3 0.2853 84.5 0.1290 93.1 0.3239 82.2
Kong2016 Kong_task2_1 Kong 3.8264 11.7 0.9880 3.1 0.9825 3.5 1.0072 15.6 1.1643 2.4 1.1900 11.8 14.4939 12.1 1.0000 0.0 0.9916 1.7 17.8109 9.7 1.0487 32.8 1.4137 36.1
Phan2016 Phan_task2_1 Phan 0.7051 59.7 0.6506 62.7 0.8491 47.2 0.6595 73.1 1.3092 15.6 0.7433 62.1 0.4599 79.5 0.7956 40.5 0.7199 64.1 0.5929 62.9 0.4136 77.8 0.5626 71.7
Pikrakis2016 Pikrakis_task2_1 Pikrakis 1.1604 35.5 1.4096 40.0 1.3088 21.5 0.8244 41.9 0.9855 37.8 1.6300 28.2 0.7956 44.5 1.0400 29.1 1.0476 52.4 1.5801 41.0 0.9513 10.5 1.1915 44.0
Vu2016 Vu_task2_1 Vu 1.1240 56.8 0.8635 61.0 0.9789 47.1 0.4875 74.2 1.0338 32.3 5.2467 24.5 0.3650 83.0 0.6889 62.1 0.7927 54.1 0.7051 60.1 0.6180 55.6 0.5839 71.1

System characteristics

Rank Submission Information Technical
Report
Segment-based (overall) System characteristics
Code Name ER F1 Features Classifier
Choi2016 Choi_task2_1 Choi 0.3660 78.7 mel energy DNN
Benetos2016 DCASE2016 baseline DCASE2016_Baseline 0.8933 37.0 VQT NMF
Giannoulis2016 Giannoulis_task2_1 Giannoulis 0.6774 55.8 various CNMF
Gutirrez-Arriola2016 Gutierrez_task2_1 Gutierrez 2.0870 25.0 MFCC kNN
Hayashi2016 Hayashi_task2_1 BLSTM-PP 0.4082 78.1 mel filterbank BLSTM-PP
Hayashi2016 Hayashi_task2_2 BLSTM-HMM 0.4958 76.0 mel filterbank BLSTM-HMM
Komatsu2016 Komatsu_task2_1 Komatsu 0.3307 80.2 VQT NMF-MLD
Kong2016 Kong_task2_1 Kong 3.5464 12.6 mel filterbank DNN
Phan2016 Phan_task2_1 Phan 0.5901 64.8 Gammatone cepstrum Random forests
Pikrakis2016 Pikrakis_task2_1 Pikrakis 0.7499 37.4 Bark scale coefficients Template matching
Vu2016 Vu_task2_1 Vu 0.8979 52.8 CQT RNN

Technical reports

DCASE2016 Task 2 Baseline

Abstract

The Task 2 baseline system is meant to implement a basic approach for detecting overlapping acoustic events, and provide some comparison point for the participants while developing their systems. The baseline system is based on supervised non-negative matrix factorization (NMF), and uses a dictionary of spectral templates for performing detection, which is extracted during the training phase. The output of the NMF system is a non-binary matrix denoting event activation, which is post-processed into a list of detected events.

System characteristics
Features VQT
Classifier NMF

DNN-Based Sound Event Detection with Exemplar-Based Approach for Noise Reduction

Abstract

In this paper, we present a sound event detection system based on a deep neural network (DNN). Exemplar-based noise reduction approach is proposed for enhancing mel-band energy feature. Multi-label DNN classifier is trained for polyphonic event detection. The system is evaluated on IEEE DCASE 2016 Challenge Task 2 Train/Development Datasets. The result on the development set yields up to 0.9261 and 0.1379 in terms of F-Score and error rate on segment-based metric, respectively.

System characteristics
Features mel energy
Classifier DNN
PDF

Improved Dictionary Selection and Detection Schemes in Sparse-Cnmf-Based Overlapping Acoustic Event Detection

Abstract

In this paper, we investigate sparse convolutive non-negative matrix factorization (sparse-CNMF) for detecting overlapped acoustic events in single-channel audio, within the experimental framework of Task 2 of the DCASE’16 challenge. In particular, our main focus lies on the efficient creation of the dictionary, as well as on the detection scheme associated with the CNMF approach. Specifically, we propose a shift-invariant dictionary reduction method that outperforms standard CNMF-based dictionary building. Further, we develop a novel detection algorithm that combines information from the CNMF activation matrix and atom-based reconstruction residuals, achieving significant improvement over the conventional approach based on the activations alone. The resulting system, evaluated on the development set of Task 2 of the DCASE’16 Challenge, also achieves large gains over the traditional NMF baseline provided by the Challenge organizers.

System characteristics
Features various
Classifier CNMF
PDF

Synthetic Sound Event Detection Based on MFCC

Abstract

This paper presents a sound event detection system based on melfrequency cepstral coefficients and a non-parametric classifier. System performance is tested using the training and development datasets corresponding to the second task of the DCASE 2016 challenge. Results indicate that the most relevant spectral information for event detection is below 8000 Hz and that the general shape of the spectral envelope is much more relevant than its fine details.

System characteristics
Features MFCC
Classifier kNN
PDF

Bidirectional LSTM-HMM Hybrid System for Polyphonic Sound Event Detection

Abstract

In this study, we propose a new method of polyphonic sound event detection based on a Bidirectional Long Short-Term Memory Hidden Markov Model hybrid system (BLSTM-HMM). We extend the hybrid model of neural network and HMM, which achieved state-of-the-art performance in the field of speech recognition, to the multi-label classification problem. This extension provides an explicit duration model for output labels, unlike the straightforward application of BLSTM-RNN. We compare the performance of our proposed method to conventional methods such as non-negative matrix factorization (NMF) and standard BLSTM-RNN, using the DCASE2016 task 2 dataset. Our proposed method outperformed conventional approaches in both monophonic and polyphonic tasks, and finally achieved an average F1 score of 76.63 % (error rate of 51.11 %) on the event-based evaluation, and an average F1-score 87.16 % (error rate of 25.91 %) on the segment-based evaluation.

System characteristics
Features mel filterbank
Classifier BLSTM-PP; BLSTM-HMM
PDF

Acoustic Event Detection Method Using Semi-Supervised Non-Negative Matrix Factorization with a Mixture of Local Dictionaries

Abstract

This paper proposes an acoustic event detection (AED) method using semi-supervised non-negative matrix factorization (NMF) with a mixture of local dictionaries (MLD). The proposed method based on semi-supervised NMF newly introduces a noise dictionary and a noise activation matrix both dedicated to unknown acoustic atoms which are not included in MLD. Because unknown acoustic atoms are better modeled by the new noise dictionary learned upon classification and the new activation matrix, the proposed method provides a higher classification performance for event classes modeled by MLD when a signal to be classified is contaminated by unknown acoustic atoms. Evaluation results using DCASE2016 task 2 Dataset show that F-measure by the proposed method with semisupervised NMF is improved by as much as 11.1% compared to that by the conventional method with supervised NMF.

System characteristics
Features VQT
Classifier NMF-MLD
PDF

Deep Neural Network Baseline for DCASE Challenge 2016

Abstract

The DCASE Challenge 2016 contains tasks for Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. For feature extraction, 40 Mel filterbank features are used. Two kinds of Mel banks, same area bank and same height bank are discussed. Experimental results show that the same height bank is better than the same area bank. DNNs with the same structure are applied to all four tasks in the DCASE Challenge 2016. In Task 1 we obtained accuracy of 76.4% using Mel + DNN against 72.5% by using Mel Frequency Ceptral Coefficient (MFCC) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 17.4% using Mel + DNN against 41.6% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 38.1% using Mel + DNN against 26.6% by using MFCC + GMM. In task 4 we obtained Equal Error Rate (ERR) of 20.9% using Mel + DNN against 21.0% by using MFCC + GMM. Therefore the DNN improves the baseline in Task 1 and Task 3, and is similar to the baseline in Task 4, although is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always work.

System characteristics
Features mel filterbank
Classifier DNN
PDF

Car-Forest: Joint Classification-Regression Decision Forests for Overlapping Audio Event Detection

Abstract

This report describes our submissions to Task2 and Task3 of the DCASE 2016 challenge [1]. The systems aim at dealing with the detection of overlapping audio events in continuous streams, where the detectors are based on random decision forests. The proposed forests are jointly trained for classification and regression simultaneously. Initially, the training is classification-oriented to encourage the trees to select discriminative features from overlapping mixtures to separate positive audio segments from the negative ones. The regression phase is then carried out to let the positive audio segments vote for the event onsets and offsets, and therefore model the temporal structure of audio events. One random decision forest is specifically trained for each event category of interest. Experimental results on the development data show that our systems outperform the DCASE 2016 challenge baselines with absolute gains of 64.4% and 8.0% on Task2 and Task3, respectively.

System characteristics
Features Gammatone cepstrum
Classifier Random forests
PDF

Dictionary Learning Assisted Template Matching for Audio Event Detection (Legato)

Abstract

We submit a two-stage scheme for the detection of audio events in synthetic audio.At a first stage, the endpoints of candidate events are located by means of an unsupervised method based on dictionary learning. At a second stage, each candidate event is matched against all provided event templates using a variant of the Smith-Waterman algorithm. This stage includes a hypothesis test against scores generated by random permutations of the feature sequences corresponding to the candidate event and each reference template. The unknown event is classified to the reference template that generates the highest computed score. The segment-based values of the F-measure and Error Rate, when the method is tested on the provided development dataset of the 2016 DCASE Challenge, are 64:01% and 50:75%, respectively. The corresponding values during event-based evaluation are 62:52% and 51:85%.

System characteristics
Features Bark scale coefficients
Classifier Template matching
PDF

Acoustic Scene and Event Recognition Using Recurrent Neural Networks

Abstract

The DCASE2016 challenge is designed particularly for research in environmental sound analysis. It consists of four tasks that spread on various problems such as acoustic scene classification and sound event detection. This paper reports our results on all the tasks by using Recurrent Neural Networks (RNNs). Experiments show that our models achieved superior performances compared with the baselines.

System characteristics
Features CQT
Classifier RNN
PDF