Task description
Detailed task description in task description page
Challenge results
Here you can find complete information on the submissions for Task 2: results on evaluation and development set (when reported by authors), class-wise results, technical reports and bibtex citations.
Detailed description of metrics used can be found here.
System outputs:
Systems ranking
Rank | Submission Information |
Technical Report |
Segment-based (overall / evaluation dataset) |
Event-based (overall / onset-only evaluation dataset) |
Segment-based (overall / development dataset) |
||||
---|---|---|---|---|---|---|---|---|---|
Code | Name | ER | F1 | ER | F1 | ER | F1 | ||
Choi2016 | Choi_task2_1 | Choi | 0.3660 | 78.7 | 0.6178 | 67.1 | 0.1379 | 92.6 | |
Benetos2016 | DCASE2016 baseline | DCASE2016_Baseline | 0.8933 | 37.0 | 1.6852 | 24.2 | 0.7859 | 41.6 | |
Giannoulis2016 | Giannoulis_task2_1 | Giannoulis | 0.6774 | 55.8 | 1.3490 | 34.2 | |||
Gutirrez-Arriola2016 | Gutierrez_task2_1 | Gutierrez | 2.0870 | 25.0 | 1.3064 | 25.7 | 0.4973 | 67.7 | |
Hayashi2016 | Hayashi_task2_1 | BLSTM-PP | 0.4082 | 78.1 | 0.6004 | 68.2 | 0.2424 | 87.7 | |
Hayashi2016 | Hayashi_task2_2 | BLSTM-HMM | 0.4958 | 76.0 | 0.6448 | 67.0 | 0.2591 | 87.2 | |
Komatsu2016 | Komatsu_task2_1 | Komatsu | 0.3307 | 80.2 | 0.4624 | 73.8 | |||
Kong2016 | Kong_task2_1 | Kong | 3.5464 | 12.6 | 2.5999 | 2.5 | 17.4 | ||
Phan2016 | Phan_task2_1 | Phan | 0.5901 | 64.8 | 1.0123 | 39.8 | 0.1420 | 92.8 | |
Pikrakis2016 | Pikrakis_task2_1 | Pikrakis | 0.7499 | 37.4 | 0.7379 | 36.8 | 0.5075 | 64.0 | |
Vu2016 | Vu_task2_1 | Vu | 0.8979 | 52.8 | 3.1818 | 18.1 | 0.3412 | 81.2 |
Teams ranking
Table including only the best performing system per submitting team.
Rank | Submission Information |
Technical Report |
Segment-based (overall / evaluation dataset) |
Event-based (overall / onset-only evaluation dataset) |
Segment-based (overall / development dataset) |
||||
---|---|---|---|---|---|---|---|---|---|
Code | Name | ER | F1 | ER | F1 | ER | F1 | ||
Choi2016 | Choi_task2_1 | Choi | 0.3660 | 78.7 | 0.6178 | 67.1 | 0.1379 | 92.6 | |
Benetos2016 | DCASE2016 baseline | DCASE2016_Baseline | 0.8933 | 37.0 | 1.6852 | 24.2 | 0.7859 | 41.6 | |
Giannoulis2016 | Giannoulis_task2_1 | Giannoulis | 0.6774 | 55.8 | 1.3490 | 34.2 | |||
Gutirrez-Arriola2016 | Gutierrez_task2_1 | Gutierrez | 2.0870 | 25.0 | 1.3064 | 25.7 | 0.4973 | 67.7 | |
Hayashi2016 | Hayashi_task2_1 | BLSTM-PP | 0.4082 | 78.1 | 0.6004 | 68.2 | 0.2424 | 87.7 | |
Komatsu2016 | Komatsu_task2_1 | Komatsu | 0.3307 | 80.2 | 0.4624 | 73.8 | |||
Kong2016 | Kong_task2_1 | Kong | 3.5464 | 12.6 | 2.5999 | 2.5 | 17.4 | ||
Phan2016 | Phan_task2_1 | Phan | 0.5901 | 64.8 | 1.0123 | 39.8 | 0.1420 | 92.8 | |
Pikrakis2016 | Pikrakis_task2_1 | Pikrakis | 0.7499 | 37.4 | 0.7379 | 36.8 | 0.5075 | 64.0 | |
Vu2016 | Vu_task2_1 | Vu | 0.8979 | 52.8 | 3.1818 | 18.1 | 0.3412 | 81.2 |
Class-wise performance
Rank | Submission Information |
Technical Report |
Segment-based (Class-based average) |
Clearing throat | Coughing | Door knock | Door slam | Drawer | Keyboard | Keys | Human laughter | Page turning | Phone ringing | Speech | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Code | Name | ER (class avg/eval/seg) | F1 (class avg/eval/seg) | ER / Clear throat (eval/seg) | F1 / Clear throat (eval/seg) | ER / Coughing (eval/seg) | F1 / Coughing (eval/seg) | ER / Door knock (eval/seg) | F1 / Door knock (eval/seg) | ER / Door slam (eval/seg) | F1 / Door slam (eval/seg) | ER / Drawer (eval/seg) | F1 / Drawer (eval/seg) | ER / Keyboard (eval/seg) | F1 / Keyboard (eval/seg) | ER / Keys (eval/seg) | F1 / Keys (eval/seg) | ER / Laughter (eval/seg) | F1 / Laughter (eval/seg) | ER / Page turning (eval/seg) | F1 / Page turning (eval/seg) | ER / Phone ringing (eval/seg) | F1 / Phone ringing (eval/seg) | ER / Speech (eval/seg) | F1 / Speech (eval/seg) | ||
Choi2016 | Choi_task2_1 | Choi | 0.4447 | 74.2 | 0.3373 | 83.1 | 0.4175 | 77.8 | 0.2294 | 88.2 | 1.1836 | 5.4 | 0.6000 | 65.0 | 0.1873 | 91.3 | 0.2933 | 84.8 | 0.4566 | 76.1 | 0.5609 | 75.6 | 0.2190 | 88.6 | 0.4066 | 81.0 | |
Benetos2016 | DCASE2016 baseline | DCASE2016_Baseline | 1.1066 | 33.2 | 0.8956 | 49.4 | 1.0561 | 6.2 | 0.8674 | 63.1 | 2.9855 | 15.1 | 0.9833 | 3.3 | 0.9100 | 22.4 | 0.6400 | 62.7 | 1.3193 | 40.8 | 1.0032 | 2.5 | 0.7956 | 40.6 | 0.7163 | 59.3 | |
Giannoulis2016 | Giannoulis_task2_1 | Giannoulis | 0.8479 | 55.5 | 0.7510 | 54.9 | 0.9474 | 47.3 | 0.6667 | 61.6 | 2.3575 | 29.7 | 0.7867 | 38.5 | 0.4112 | 77.1 | 0.4400 | 74.3 | 0.9440 | 52.2 | 0.7083 | 51.2 | 0.5815 | 60.6 | 0.7329 | 62.6 | |
Gutirrez-Arriola2016 | Gutierrez_task2_1 | Gutierrez | 2.2537 | 34.2 | 0.7510 | 54.0 | 0.9053 | 25.0 | 1.8566 | 42.4 | 3.0628 | 4.5 | 0.9033 | 33.1 | 0.5426 | 67.5 | 1.1156 | 12.5 | 0.7647 | 47.6 | 0.9744 | 5.6 | 0.4793 | 71.3 | 13.4350 | 12.8 | |
Hayashi2016 | Hayashi_task2_1 | BLSTM-PP | 0.5228 | 78.6 | 0.5422 | 77.2 | 0.2947 | 84.7 | 0.2115 | 90.2 | 0.7295 | 67.8 | 0.8000 | 60.4 | 0.2822 | 87.5 | 0.3733 | 83.2 | 0.5042 | 78.4 | 1.6186 | 55.0 | 0.2238 | 88.5 | 0.1702 | 91.3 | |
Hayashi2016 | Hayashi_task2_2 | BLSTM-HMM | 0.6055 | 76.6 | 0.4458 | 78.2 | 0.3825 | 82.9 | 0.4875 | 80.1 | 1.3043 | 55.0 | 0.7867 | 64.9 | 0.2749 | 87.8 | 0.2356 | 88.8 | 0.7983 | 70.2 | 1.5032 | 56.9 | 0.1679 | 91.5 | 0.2742 | 86.0 | |
Komatsu2016 | Komatsu_task2_1 | Komatsu | 0.3851 | 77.1 | 0.4458 | 76.8 | 0.5719 | 62.5 | 0.1828 | 90.7 | 0.8309 | 37.7 | 0.4400 | 74.7 | 0.2311 | 87.9 | 0.3867 | 81.0 | 0.4090 | 77.3 | 0.2853 | 84.5 | 0.1290 | 93.1 | 0.3239 | 82.2 | |
Kong2016 | Kong_task2_1 | Kong | 3.8264 | 11.7 | 0.9880 | 3.1 | 0.9825 | 3.5 | 1.0072 | 15.6 | 1.1643 | 2.4 | 1.1900 | 11.8 | 14.4939 | 12.1 | 1.0000 | 0.0 | 0.9916 | 1.7 | 17.8109 | 9.7 | 1.0487 | 32.8 | 1.4137 | 36.1 | |
Phan2016 | Phan_task2_1 | Phan | 0.7051 | 59.7 | 0.6506 | 62.7 | 0.8491 | 47.2 | 0.6595 | 73.1 | 1.3092 | 15.6 | 0.7433 | 62.1 | 0.4599 | 79.5 | 0.7956 | 40.5 | 0.7199 | 64.1 | 0.5929 | 62.9 | 0.4136 | 77.8 | 0.5626 | 71.7 | |
Pikrakis2016 | Pikrakis_task2_1 | Pikrakis | 1.1604 | 35.5 | 1.4096 | 40.0 | 1.3088 | 21.5 | 0.8244 | 41.9 | 0.9855 | 37.8 | 1.6300 | 28.2 | 0.7956 | 44.5 | 1.0400 | 29.1 | 1.0476 | 52.4 | 1.5801 | 41.0 | 0.9513 | 10.5 | 1.1915 | 44.0 | |
Vu2016 | Vu_task2_1 | Vu | 1.1240 | 56.8 | 0.8635 | 61.0 | 0.9789 | 47.1 | 0.4875 | 74.2 | 1.0338 | 32.3 | 5.2467 | 24.5 | 0.3650 | 83.0 | 0.6889 | 62.1 | 0.7927 | 54.1 | 0.7051 | 60.1 | 0.6180 | 55.6 | 0.5839 | 71.1 |
System characteristics
Rank | Submission Information |
Technical Report |
Segment-based (overall) | System characteristics | |||
---|---|---|---|---|---|---|---|
Code | Name | ER (eval/seg) | F1 (eval/seg) | Features | Classifier | ||
Choi2016 | Choi_task2_1 | Choi | 0.3660 | 78.7 | mel energy | DNN | |
Benetos2016 | DCASE2016 baseline | DCASE2016_Baseline | 0.8933 | 37.0 | VQT | NMF | |
Giannoulis2016 | Giannoulis_task2_1 | Giannoulis | 0.6774 | 55.8 | various | CNMF | |
Gutirrez-Arriola2016 | Gutierrez_task2_1 | Gutierrez | 2.0870 | 25.0 | MFCC | kNN | |
Hayashi2016 | Hayashi_task2_1 | BLSTM-PP | 0.4082 | 78.1 | mel filterbank | BLSTM-PP | |
Hayashi2016 | Hayashi_task2_2 | BLSTM-HMM | 0.4958 | 76.0 | mel filterbank | BLSTM-HMM | |
Komatsu2016 | Komatsu_task2_1 | Komatsu | 0.3307 | 80.2 | VQT | NMF-MLD | |
Kong2016 | Kong_task2_1 | Kong | 3.5464 | 12.6 | mel filterbank | DNN | |
Phan2016 | Phan_task2_1 | Phan | 0.5901 | 64.8 | Gammatone cepstrum | Random forests | |
Pikrakis2016 | Pikrakis_task2_1 | Pikrakis | 0.7499 | 37.4 | Bark scale coefficients | Template matching | |
Vu2016 | Vu_task2_1 | Vu | 0.8979 | 52.8 | CQT | RNN |
Technical reports
DCASE2016 Task 2 Baseline
Emmanouil Benetos1, Grégoire Lafay2 and Mathieu Lagrange2
1Queen Mary University of London, London, United Kingdom, 2IRCCYN, Ecole Centrale de Nantes, France
DCASE2016_task2_1
DCASE2016 Task 2 Baseline
Abstract
The Task 2 baseline system is meant to implement a basic approach for detecting overlapping acoustic events, and provide some comparison point for the participants while developing their systems. The baseline system is based on supervised non-negative matrix factorization (NMF), and uses a dictionary of spectral templates for performing detection, which is extracted during the training phase. The output of the NMF system is a non-binary matrix denoting event activation, which is post-processed into a list of detected events.
System characteristics
Features | VQT |
Classifier | NMF |
DNN-Based Sound Event Detection with Exemplar-Based Approach for Noise Reduction
Inkyu Choi, Kisoo Kwon, Soo Hyun Bae and Nam Soo Kim
Department of Electrical and Computer Engineering and INMC, Seoul National University, Seoul, South Korea
Choi_task2_1
DNN-Based Sound Event Detection with Exemplar-Based Approach for Noise Reduction
Abstract
In this paper, we present a sound event detection system based on a deep neural network (DNN). Exemplar-based noise reduction approach is proposed for enhancing mel-band energy feature. Multi-label DNN classifier is trained for polyphonic event detection. The system is evaluated on IEEE DCASE 2016 Challenge Task 2 Train/Development Datasets. The result on the development set yields up to 0.9261 and 0.1379 in terms of F-Score and error rate on segment-based metric, respectively.
System characteristics
Features | mel energy |
Classifier | DNN |
Improved Dictionary Selection and Detection Schemes in Sparse-Cnmf-Based Overlapping Acoustic Event Detection
Panagiotis Giannoulis1,2, Gerasimos Potamianos2,3, Petros Maragos1,2 and Athanasios Katsamanis1,2
1School of ECE, National Technical University of Athens, Athens, Greece, 2Athena Research and Innovation Center, Maroussi, Greece, 3Department of ECE, University of Thessaly, Volos, Greece
Giannoulis_task2_1
Improved Dictionary Selection and Detection Schemes in Sparse-Cnmf-Based Overlapping Acoustic Event Detection
Abstract
In this paper, we investigate sparse convolutive non-negative matrix factorization (sparse-CNMF) for detecting overlapped acoustic events in single-channel audio, within the experimental framework of Task 2 of the DCASE’16 challenge. In particular, our main focus lies on the efficient creation of the dictionary, as well as on the detection scheme associated with the CNMF approach. Specifically, we propose a shift-invariant dictionary reduction method that outperforms standard CNMF-based dictionary building. Further, we develop a novel detection algorithm that combines information from the CNMF activation matrix and atom-based reconstruction residuals, achieving significant improvement over the conventional approach based on the activations alone. The resulting system, evaluated on the development set of Task 2 of the DCASE’16 Challenge, also achieves large gains over the traditional NMF baseline provided by the Challenge organizers.
System characteristics
Features | various |
Classifier | CNMF |
Synthetic Sound Event Detection Based on MFCC
J.M. Gutiérrez-Arriola, R. Fraile, A. Camacho, T. Durand, J.L. Jarrin and S.R. Mendoza
Escuela Técnica Superior de Ingeniería y Sistemas de Telecomunicacíon, Universidad Politécnica de Madrid, Madrid, Spain
Gutierrez_task2_1
Synthetic Sound Event Detection Based on MFCC
Abstract
This paper presents a sound event detection system based on melfrequency cepstral coefficients and a non-parametric classifier. System performance is tested using the training and development datasets corresponding to the second task of the DCASE 2016 challenge. Results indicate that the most relevant spectral information for event detection is below 8000 Hz and that the general shape of the spectral envelope is much more relevant than its fine details.
System characteristics
Features | MFCC |
Classifier | kNN |
Bidirectional LSTM-HMM Hybrid System for Polyphonic Sound Event Detection
Tomoki Hayashi1, Shinji Watanabe2, Tomoki Toda1, Takaaki Hori2, Jonathan Le Roux2 and Kazuya Takeda1
1Nagoya University, Nagoya, Japan, 2Mitsubishi Electric Research Laboratories, Cambridge, USA
Hayashi_task2_1 Hayashi_task2_2
Bidirectional LSTM-HMM Hybrid System for Polyphonic Sound Event Detection
Abstract
In this study, we propose a new method of polyphonic sound event detection based on a Bidirectional Long Short-Term Memory Hidden Markov Model hybrid system (BLSTM-HMM). We extend the hybrid model of neural network and HMM, which achieved state-of-the-art performance in the field of speech recognition, to the multi-label classification problem. This extension provides an explicit duration model for output labels, unlike the straightforward application of BLSTM-RNN. We compare the performance of our proposed method to conventional methods such as non-negative matrix factorization (NMF) and standard BLSTM-RNN, using the DCASE2016 task 2 dataset. Our proposed method outperformed conventional approaches in both monophonic and polyphonic tasks, and finally achieved an average F1 score of 76.63 % (error rate of 51.11 %) on the event-based evaluation, and an average F1-score 87.16 % (error rate of 25.91 %) on the segment-based evaluation.
System characteristics
Features | mel filterbank |
Classifier | BLSTM-PP; BLSTM-HMM |
Acoustic Event Detection Method Using Semi-Supervised Non-Negative Matrix Factorization with a Mixture of Local Dictionaries
Tatsuya Komatsu, Takahiro Toizumi, Reishi Kondo and Yuzo Senda
Data Science Research Laboratories, NEC Corporation, Kawasaki, Japan
Komatsu_task2_1
Acoustic Event Detection Method Using Semi-Supervised Non-Negative Matrix Factorization with a Mixture of Local Dictionaries
Abstract
This paper proposes an acoustic event detection (AED) method using semi-supervised non-negative matrix factorization (NMF) with a mixture of local dictionaries (MLD). The proposed method based on semi-supervised NMF newly introduces a noise dictionary and a noise activation matrix both dedicated to unknown acoustic atoms which are not included in MLD. Because unknown acoustic atoms are better modeled by the new noise dictionary learned upon classification and the new activation matrix, the proposed method provides a higher classification performance for event classes modeled by MLD when a signal to be classified is contaminated by unknown acoustic atoms. Evaluation results using DCASE2016 task 2 Dataset show that F-measure by the proposed method with semisupervised NMF is improved by as much as 11.1% compared to that by the conventional method with supervised NMF.
System characteristics
Features | VQT |
Classifier | NMF-MLD |
Deep Neural Network Baseline for DCASE Challenge 2016
Abstract
The DCASE Challenge 2016 contains tasks for Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. For feature extraction, 40 Mel filterbank features are used. Two kinds of Mel banks, same area bank and same height bank are discussed. Experimental results show that the same height bank is better than the same area bank. DNNs with the same structure are applied to all four tasks in the DCASE Challenge 2016. In Task 1 we obtained accuracy of 76.4% using Mel + DNN against 72.5% by using Mel Frequency Ceptral Coefficient (MFCC) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 17.4% using Mel + DNN against 41.6% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 38.1% using Mel + DNN against 26.6% by using MFCC + GMM. In task 4 we obtained Equal Error Rate (ERR) of 20.9% using Mel + DNN against 21.0% by using MFCC + GMM. Therefore the DNN improves the baseline in Task 1 and Task 3, and is similar to the baseline in Task 4, although is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always work.
System characteristics
Features | mel filterbank |
Classifier | DNN |
Car-Forest: Joint Classification-Regression Decision Forests for Overlapping Audio Event Detection
Huy Phan1,2, Lars Hertel1, Marco Maass1, Philipp Koch1 and Alfred Mertins1
1Institute for Signal Processing, University of Luebeck, Luebeck, Germany, 2Graduate School for Computing in Medicine and Life Sciences, University of Luebeck, Luebeck, Germany
Phan_task2_1
Car-Forest: Joint Classification-Regression Decision Forests for Overlapping Audio Event Detection
Abstract
This report describes our submissions to Task2 and Task3 of the DCASE 2016 challenge [1]. The systems aim at dealing with the detection of overlapping audio events in continuous streams, where the detectors are based on random decision forests. The proposed forests are jointly trained for classification and regression simultaneously. Initially, the training is classification-oriented to encourage the trees to select discriminative features from overlapping mixtures to separate positive audio segments from the negative ones. The regression phase is then carried out to let the positive audio segments vote for the event onsets and offsets, and therefore model the temporal structure of audio events. One random decision forest is specifically trained for each event category of interest. Experimental results on the development data show that our systems outperform the DCASE 2016 challenge baselines with absolute gains of 64.4% and 8.0% on Task2 and Task3, respectively.
System characteristics
Features | Gammatone cepstrum |
Classifier | Random forests |
Dictionary Learning Assisted Template Matching for Audio Event Detection (Legato)
Aggelos Pikrakis1 and Yannis Kopsinis2
1Department of Informatics, University of Piraeus, Piraeus, Greece, 2Libra MLI, Edinburgh, United Kingdom
Pikrakis_task2_1
Dictionary Learning Assisted Template Matching for Audio Event Detection (Legato)
Abstract
We submit a two-stage scheme for the detection of audio events in synthetic audio.At a first stage, the endpoints of candidate events are located by means of an unsupervised method based on dictionary learning. At a second stage, each candidate event is matched against all provided event templates using a variant of the Smith-Waterman algorithm. This stage includes a hypothesis test against scores generated by random permutations of the feature sequences corresponding to the candidate event and each reference template. The unknown event is classified to the reference template that generates the highest computed score. The segment-based values of the F-measure and Error Rate, when the method is tested on the provided development dataset of the 2016 DCASE Challenge, are 64:01% and 50:75%, respectively. The corresponding values during event-based evaluation are 62:52% and 51:85%.
System characteristics
Features | Bark scale coefficients |
Classifier | Template matching |
Acoustic Scene and Event Recognition Using Recurrent Neural Networks
Toan H. Vu and Jia-Ching Wang
Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
Vu_task2_1
Acoustic Scene and Event Recognition Using Recurrent Neural Networks
Abstract
The DCASE2016 challenge is designed particularly for research in environmental sound analysis. It consists of four tasks that spread on various problems such as acoustic scene classification and sound event detection. This paper reports our results on all the tasks by using Recurrent Neural Networks (RNNs). Experiments show that our models achieved superior performances compared with the baselines.
System characteristics
Features | CQT |
Classifier | RNN |