Domestic audio tagging


Challenge results

Task description

Detailed task description in task description page

Challenge results

Here you can find complete information on the submissions for Task 4: results on evaluation and development set (when reported by authors), class-wise results, technical reports and bibtex citations.

System outputs:


Systems ranking

Rank Submission code Technical
Report
Equal Error Rate
(evaluation dataset)
Equal Error Rate
(development dataset)
Cakir_task4_1 Cakir2016 16.8 17.1
DCASE2016 baseline Foster2016 20.9 21.3
Hertel_task4_1 Hertel2016 22.1 17.3
Kong_task4_1 Kong2016 18.9 20.9
Lidy_task4_1 Lidy2016 16.6 17.8
Vu_task4_1 Vu2016 21.1 20.0
Xu_task4_1 Xu2016 19.5 17.9
Xu_task4_2 Xu2016 19.8
Yun_task4_1 Yun2016 17.4 17.6

Class-wise performance

Rank Submission
code
Tech.
Report
Equal Error Rate
(evaluation dataset)
Average
Child
speech
Adult
male
speech
Adult
female
speech
Video
game /
TV
Percussive
sounds
Broadband
noise
Other
identifiable
sounds
Cakir_task4_1 Cakir2016 16.8 25.0 15.9 25.0 2.7 20.8 2.2 25.8
DCASE2016 baseline Foster2016 20.9 19.1 32.6 31.4 5.6 21.2 11.7 24.9
Hertel_task4_1 Hertel2016 22.1 18.3 27.8 23.4 8.0 20.1 32.3 24.6
Kong_task4_1 Kong2016 18.9 19.5 28.0 22.9 9.0 22.1 3.9 27.2
Lidy_task4_1 Lidy2016 16.6 21.0 18.2 21.4 3.5 16.8 3.2 32.0
Vu_task4_1 Vu2016 21.1 22.6 30.7 29.3 7.8 21.8 7.8 27.9
Xu_task4_1 Xu2016 19.5 20.9 31.3 21.6 4.0 24.9 6.5 27.2
Xu_task4_2 Xu2016 19.8 20.3 30.4 23.6 3.7 27.5 4.8 28.0
Yun_task4_1 Yun2016 17.4 17.7 25.3 17.9 10.2 20.7 3.2 26.6

System characteristics

Rank Submission
code
Tech.
Report
Equal Error Rate
(evaluation dataset)
System characteristics
Features Classifier
Cakir_task4_1 Cakir2016 16.8 Mel spectrogram CNN
DCASE2016 baseline Foster2016 20.9 MFCCs GMM
Hertel_task4_1 Hertel2016 22.1 Magnitude spectrogram CNN
Kong_task4_1 Kong2016 18.9 Mel spectrogram DNN
Lidy_task4_1 Lidy2016 16.6 CQT features CNN
Vu_task4_1 Vu2016 21.1 MFCCs RNN
Xu_task4_1 Xu2016 19.5 MFCCs DNN
Xu_task4_2 Xu2016 19.8 MFCCs DNN
Yun_task4_1 Yun2016 17.4 MFCCs GMM

Technical reports

Domestic Audio Tagging with Convolutional Neural Networks

Abstract

In this paper, the method used in our submission for DCASE2016 challenge task 4 (domestic audio tagging) is described. The use of convolutional neural networks (CNN) to label the audio signals recorded in a domestic (home) environment is investigated. A relative 23.8% improvement over the Gaussian mixture model (GMM) baseline method is observed over the development dataset for the challenge.

System characteristics
Features Mel spectrogram
Classifier CNN
PDF

DCASE2016 Baseline System

System characteristics
Features MFCCs
Classifier GMM

Classifying Variable-Length Audio Files with All-Convolutional Networks and Masked Global Pooling

Abstract

We trained a deep all-convolutional neural network with masked global pooling to perform single-label classification for acoustic scene classification and multi-label classification for domestic audio tagging in the DCASE-2016 contest. Our network achieved an average accuracy of 84.5 % on the four-fold cross-validation for acoustic scene recognition, compared to the provided baseline of 72.5 %, and an average equal error rate of 0.17 for domestic audio tagging, compared to the baseline of 0.21. The network therefore improves the baselines by a relative amount of 17 % and 19 %, respectively. The network only consists of convolutional layers to extract features from the short-time Fourier transform and one global pooling layer to combine those features. It particularly possesses neither fully-connected layers, besides the fully-connected output layer. nor dropout layers.

System characteristics
Features Magnitude spectrogram
Classifier CNN
PDF

Deep Neural Network Baseline for DCASE Challenge 2016

Abstract

The DCASE Challenge 2016 contains tasks for Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. For feature extraction, 40 Mel-filter bank features are used. Two kinds of Mel banks, same area bank and same height bank are discussed. Experimental results show that the same height bank is better than the same area bank. DNNs with the same structure are applied to all four tasks in the DCASE Challenge 2016. In Task 1 we obtained accuracy of 76.4% using Mel + DNN against 72.5% by using Mel Frequency Ceptral Coefficient (MFCC) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 17.4% using Mel + DNN against 41.6% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 38.1% using Mel + DNN against 26.6% by using MFCC + GMM. In task 4 we obtained Equal Error Rate (ERR) of 20.9% using Mel + DNN against 21.0% by using MFCC + GMM. Therefore the DNN improves the baseline in Task 1 and Task 3, and is similar to the baseline in Task 4, although is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always work.

System characteristics
Features Mel spectrogram
Classifier DNN
PDF

CQT-Based Convolutional Neural Networks for Audio Scene Classification and Domestic Audio Tagging

Abstract

For the DCASE 2016 audio benchmarking contest, we submitted a parallel Convolutional Neural Network architecture for the tasks of 1) classifying acoustic scenes (task 1) and urban soundscapes and 2) domestic audio tagging (task 4). A popular choice for input to a Convolutional Neural Network in audio classification problems are Mel-transformed spectrograms. We, however, found that a Constant-Q-transformed input improves results. Furthermore, we evaluated critical parameters such as the number of necessary bands and filter sizes in a Convolutional Neural Network. Finally, we propose a parallel (graph-based) neural network architecture which captures relevant audio characteristics both in time and in frequency, and submitted it to the DCASE 2016 tasks 1 and 4, with some slight alterations described in this paper. Our approach shows a 10.7 % relative improvement of the baseline system of the Acoustic Scenes Classification task on the development set of task 1[1].

System characteristics
Features CQT features
Classifier CNN
PDF

Acoustic Scene and Event Recognition Using Recurrent Neural Networks

Abstract

The DCASE2016 challenge is designed particularly for research in environmental sound analysis. It consists of four tasks that spread on various problems such as acoustic scene classification and sound event detection. This paper reports our results on all the tasks by using Recurrent Neural Networks (RNNs). Experiments show that our models achieved superior performances compared with the baselines.

System characteristics
Features MFCCs
Classifier RNN
PDF

Fully DNN-Based Multi-Label Regression for Audio Tagging

Abstract

Acoustic event detection for content analysis in most cases relies on lots of labeled data. However, manually annotating data is a time-consuming task, which thus makes few annotated resources available so far. Unlike audio event detection, automatic audio tagging, a multi-label acoustic event classification task, only relies on weakly labeled data. This is highly desirable to some practical applications using audio analysis. In this paper we propose to use a fully deep neural network (DNN) framework to handle the multilabel classification task in a regression way. Considering that only chunk-level rather than frame-level labels are available, the whole or almost whole frames of the chunk were fed into the DNN to perform a multi-label regression for the expected tags. The fully DNN, which is regarded as an encoding function, can well map the audio features sequence to a multi-tag vector. A deep pyramid structure was also designed to extract more robust high-level features related to the target tags. Further improved methods were adopted, such as the Dropout and background noise aware training, to enhance its generalization capability for new audio recordings in mismatched environments. Compared with the conventional Gaussian Mixture Model (GMM) and support vector machine (SVM) methods, the proposed fully DNN-based method could well utilize the long-term temporal information with the whole chunk as the input. The results show that our approach obtained a 15% relative improvement compared with the official GMM-based method of DCASE 2016 challenge.

System characteristics
Features MFCCs
Classifier DNN
PDF

Discriminative Training of GMM Parameters for Audio Scene Classification

Abstract

This report describes the algorithm for audio scene classification and audio tagging and the result for DCASE 2016 challenge data. We propose a discriminative training algorithm to improve the baseline GMM performance. The algorithm updates the baseline GMM parameters by maximizing the margin between classes to improve discriminative performance. For Task1, we use a hierarchical classifier to maximize discriminative performance, and achieve 84% accuracy for given cross validation data. For Task4, we apply binary classifier for each label, and achieve 16.71% EER for given cross validation data.

System characteristics
Features MFCCs
Classifier GMM
PDF