Task description
Detailed task description in task description page
Challenge results
Here you can find complete information on the submissions for Task 4: results on evaluation and development set (when reported by authors), class-wise results, technical reports and bibtex citations.
System outputs:
Systems ranking
Rank | Submission code |
Technical Report |
Equal Error Rate (evaluation dataset) |
Equal Error Rate (development dataset) |
---|---|---|---|---|
Cakir_task4_1 | Cakir2016 | 16.8 | 17.1 | |
DCASE2016 baseline | Foster2016 | 20.9 | 21.3 | |
Hertel_task4_1 | Hertel2016 | 22.1 | 17.3 | |
Kong_task4_1 | Kong2016 | 18.9 | 20.9 | |
Lidy_task4_1 | Lidy2016 | 16.6 | 17.8 | |
Vu_task4_1 | Vu2016 | 21.1 | 20.0 | |
Xu_task4_1 | Xu2016 | 19.5 | 17.9 | |
Xu_task4_2 | Xu2016 | 19.8 | ||
Yun_task4_1 | Yun2016 | 17.4 | 17.6 |
Class-wise performance
Rank |
Submission code |
Tech. Report |
Equal Error Rate (evaluation dataset) |
|||||||
---|---|---|---|---|---|---|---|---|---|---|
Average |
Child speech |
Adult male speech |
Adult female speech |
Video game / TV |
Percussive sounds |
Broadband noise |
Other identifiable sounds |
|||
Cakir_task4_1 | Cakir2016 | 16.8 | 25.0 | 15.9 | 25.0 | 2.7 | 20.8 | 2.2 | 25.8 | |
DCASE2016 baseline | Foster2016 | 20.9 | 19.1 | 32.6 | 31.4 | 5.6 | 21.2 | 11.7 | 24.9 | |
Hertel_task4_1 | Hertel2016 | 22.1 | 18.3 | 27.8 | 23.4 | 8.0 | 20.1 | 32.3 | 24.6 | |
Kong_task4_1 | Kong2016 | 18.9 | 19.5 | 28.0 | 22.9 | 9.0 | 22.1 | 3.9 | 27.2 | |
Lidy_task4_1 | Lidy2016 | 16.6 | 21.0 | 18.2 | 21.4 | 3.5 | 16.8 | 3.2 | 32.0 | |
Vu_task4_1 | Vu2016 | 21.1 | 22.6 | 30.7 | 29.3 | 7.8 | 21.8 | 7.8 | 27.9 | |
Xu_task4_1 | Xu2016 | 19.5 | 20.9 | 31.3 | 21.6 | 4.0 | 24.9 | 6.5 | 27.2 | |
Xu_task4_2 | Xu2016 | 19.8 | 20.3 | 30.4 | 23.6 | 3.7 | 27.5 | 4.8 | 28.0 | |
Yun_task4_1 | Yun2016 | 17.4 | 17.7 | 25.3 | 17.9 | 10.2 | 20.7 | 3.2 | 26.6 |
System characteristics
Rank |
Submission code |
Tech. Report |
Equal Error Rate (evaluation dataset) |
System characteristics | |
---|---|---|---|---|---|
Features | Classifier | ||||
Cakir_task4_1 | Cakir2016 | 16.8 | Mel spectrogram | CNN | |
DCASE2016 baseline | Foster2016 | 20.9 | MFCCs | GMM | |
Hertel_task4_1 | Hertel2016 | 22.1 | Magnitude spectrogram | CNN | |
Kong_task4_1 | Kong2016 | 18.9 | Mel spectrogram | DNN | |
Lidy_task4_1 | Lidy2016 | 16.6 | CQT features | CNN | |
Vu_task4_1 | Vu2016 | 21.1 | MFCCs | RNN | |
Xu_task4_1 | Xu2016 | 19.5 | MFCCs | DNN | |
Xu_task4_2 | Xu2016 | 19.8 | MFCCs | DNN | |
Yun_task4_1 | Yun2016 | 17.4 | MFCCs | GMM |
Technical reports
Domestic Audio Tagging with Convolutional Neural Networks
Emre Cakir, Toni Heittola and Tuomas Virtanen
Tampere University of Technology, Tampere, Finland
Cakir_task4_1
Domestic Audio Tagging with Convolutional Neural Networks
Abstract
In this paper, the method used in our submission for DCASE2016 challenge task 4 (domestic audio tagging) is described. The use of convolutional neural networks (CNN) to label the audio signals recorded in a domestic (home) environment is investigated. A relative 23.8% improvement over the Gaussian mixture model (GMM) baseline method is observed over the development dataset for the challenge.
System characteristics
Features | Mel spectrogram |
Classifier | CNN |
DCASE2016 Baseline System
Peter Foster1 and Toni Heittola2
1Queen Mary University of London, London, United Kingdom, 2Tampere University of Technology, Tampere, Finland
DCASE2016_task4_1
Classifying Variable-Length Audio Files with All-Convolutional Networks and Masked Global Pooling
Lars Hertel1, Huy Phan1,2 and Alfred Mertins1
1Institute for Signal Processing, University of Luebeck, Luebeck, Germany, 2Graduate School for Computing in Medicine and Life Sciences, University of Luebeck, Luebeck, Germany
Hertel_task4_1
Classifying Variable-Length Audio Files with All-Convolutional Networks and Masked Global Pooling
Abstract
We trained a deep all-convolutional neural network with masked global pooling to perform single-label classification for acoustic scene classification and multi-label classification for domestic audio tagging in the DCASE-2016 contest. Our network achieved an average accuracy of 84.5 % on the four-fold cross-validation for acoustic scene recognition, compared to the provided baseline of 72.5 %, and an average equal error rate of 0.17 for domestic audio tagging, compared to the baseline of 0.21. The network therefore improves the baselines by a relative amount of 17 % and 19 %, respectively. The network only consists of convolutional layers to extract features from the short-time Fourier transform and one global pooling layer to combine those features. It particularly possesses neither fully-connected layers, besides the fully-connected output layer. nor dropout layers.
System characteristics
Features | Magnitude spectrogram |
Classifier | CNN |
Deep Neural Network Baseline for DCASE Challenge 2016
Abstract
The DCASE Challenge 2016 contains tasks for Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. For feature extraction, 40 Mel-filter bank features are used. Two kinds of Mel banks, same area bank and same height bank are discussed. Experimental results show that the same height bank is better than the same area bank. DNNs with the same structure are applied to all four tasks in the DCASE Challenge 2016. In Task 1 we obtained accuracy of 76.4% using Mel + DNN against 72.5% by using Mel Frequency Ceptral Coefficient (MFCC) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 17.4% using Mel + DNN against 41.6% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 38.1% using Mel + DNN against 26.6% by using MFCC + GMM. In task 4 we obtained Equal Error Rate (ERR) of 20.9% using Mel + DNN against 21.0% by using MFCC + GMM. Therefore the DNN improves the baseline in Task 1 and Task 3, and is similar to the baseline in Task 4, although is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always work.
System characteristics
Features | Mel spectrogram |
Classifier | DNN |
CQT-Based Convolutional Neural Networks for Audio Scene Classification and Domestic Audio Tagging
Thomas Lidy1 and Alexander Schindler2
1Institute of Software Technology, Vienna University of Technology, Vienna, Austria, 2Digital Safety and Security, Austrian Institute of Technolog, Vienna, Austria
Lidy_task4_1
CQT-Based Convolutional Neural Networks for Audio Scene Classification and Domestic Audio Tagging
Abstract
For the DCASE 2016 audio benchmarking contest, we submitted a parallel Convolutional Neural Network architecture for the tasks of 1) classifying acoustic scenes (task 1) and urban soundscapes and 2) domestic audio tagging (task 4). A popular choice for input to a Convolutional Neural Network in audio classification problems are Mel-transformed spectrograms. We, however, found that a Constant-Q-transformed input improves results. Furthermore, we evaluated critical parameters such as the number of necessary bands and filter sizes in a Convolutional Neural Network. Finally, we propose a parallel (graph-based) neural network architecture which captures relevant audio characteristics both in time and in frequency, and submitted it to the DCASE 2016 tasks 1 and 4, with some slight alterations described in this paper. Our approach shows a 10.7 % relative improvement of the baseline system of the Acoustic Scenes Classification task on the development set of task 1[1].
System characteristics
Features | CQT features |
Classifier | CNN |
Acoustic Scene and Event Recognition Using Recurrent Neural Networks
Toan H. Vu and Jia-Ching Wang
Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
Vu_task4_1
Acoustic Scene and Event Recognition Using Recurrent Neural Networks
Abstract
The DCASE2016 challenge is designed particularly for research in environmental sound analysis. It consists of four tasks that spread on various problems such as acoustic scene classification and sound event detection. This paper reports our results on all the tasks by using Recurrent Neural Networks (RNNs). Experiments show that our models achieved superior performances compared with the baselines.
System characteristics
Features | MFCCs |
Classifier | RNN |
Fully DNN-Based Multi-Label Regression for Audio Tagging
Yong Xu, Qiang Huang, Wenwu Wang and Mark D. Plumbley
Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom
Xu_task4_1 Xu_task4_2
Fully DNN-Based Multi-Label Regression for Audio Tagging
Abstract
Acoustic event detection for content analysis in most cases relies on lots of labeled data. However, manually annotating data is a time-consuming task, which thus makes few annotated resources available so far. Unlike audio event detection, automatic audio tagging, a multi-label acoustic event classification task, only relies on weakly labeled data. This is highly desirable to some practical applications using audio analysis. In this paper we propose to use a fully deep neural network (DNN) framework to handle the multilabel classification task in a regression way. Considering that only chunk-level rather than frame-level labels are available, the whole or almost whole frames of the chunk were fed into the DNN to perform a multi-label regression for the expected tags. The fully DNN, which is regarded as an encoding function, can well map the audio features sequence to a multi-tag vector. A deep pyramid structure was also designed to extract more robust high-level features related to the target tags. Further improved methods were adopted, such as the Dropout and background noise aware training, to enhance its generalization capability for new audio recordings in mismatched environments. Compared with the conventional Gaussian Mixture Model (GMM) and support vector machine (SVM) methods, the proposed fully DNN-based method could well utilize the long-term temporal information with the whole chunk as the input. The results show that our approach obtained a 15% relative improvement compared with the official GMM-based method of DCASE 2016 challenge.
System characteristics
Features | MFCCs |
Classifier | DNN |
Discriminative Training of GMM Parameters for Audio Scene Classification
Sungrack Yun, Sungwoong Kim, Sunkuck Moon, Juncheol Cho and Taesu Kim
Qualcomm Research, Seoul, South Korea
Yun_task4_1
Discriminative Training of GMM Parameters for Audio Scene Classification
Abstract
This report describes the algorithm for audio scene classification and audio tagging and the result for DCASE 2016 challenge data. We propose a discriminative training algorithm to improve the baseline GMM performance. The algorithm updates the baseline GMM parameters by maximizing the margin between classes to improve discriminative performance. For Task1, we use a hierarchical classifier to maximize discriminative performance, and achieve 84% accuracy for given cross validation data. For Task4, we apply binary classifier for each label, and achieve 16.71% EER for given cross validation data.
System characteristics
Features | MFCCs |
Classifier | GMM |