Domestic audio tagging

Task description

Detailed task description in task description page

Challenge results

Here you can find complete information on the submissions for Task 4: results on evaluation and development set (when reported by authors), class-wise results, technical reports and bibtex citations.

System outputs:

DCASE2016 Challenge Submissions Package (28.7 MB)

Systems ranking

Submission code	Technical Report	Equal Error Rate (evaluation dataset)	Equal Error Rate (development dataset)
Cakir_task4_1	Cakir2016	16.8	17.1
DCASE2016 baseline	Foster2016	20.9	21.3
Hertel_task4_1	Hertel2016	22.1	17.3
Kong_task4_1	Kong2016	18.9	20.9
Lidy_task4_1	Lidy2016	16.6	17.8
Vu_task4_1	Vu2016	21.1	20.0
Xu_task4_1	Xu2016	19.5	17.9
Xu_task4_2	Xu2016	19.8
Yun_task4_1	Yun2016	17.4	17.6

Class-wise performance

Rank	Submission code	Tech. Report	Equal Error Rate (evaluation dataset)
Rank	Submission code	Tech. Report	Average	Child speech	Adult male speech	Adult female speech	Video game / TV	Percussive sounds	Broadband noise	Other identifiable sounds
	Cakir_task4_1	Cakir2016	16.8	25.0	15.9	25.0	2.7	20.8	2.2	25.8
	DCASE2016 baseline	Foster2016	20.9	19.1	32.6	31.4	5.6	21.2	11.7	24.9
	Hertel_task4_1	Hertel2016	22.1	18.3	27.8	23.4	8.0	20.1	32.3	24.6
	Kong_task4_1	Kong2016	18.9	19.5	28.0	22.9	9.0	22.1	3.9	27.2
	Lidy_task4_1	Lidy2016	16.6	21.0	18.2	21.4	3.5	16.8	3.2	32.0
	Vu_task4_1	Vu2016	21.1	22.6	30.7	29.3	7.8	21.8	7.8	27.9
	Xu_task4_1	Xu2016	19.5	20.9	31.3	21.6	4.0	24.9	6.5	27.2
	Xu_task4_2	Xu2016	19.8	20.3	30.4	23.6	3.7	27.5	4.8	28.0
	Yun_task4_1	Yun2016	17.4	17.7	25.3	17.9	10.2	20.7	3.2	26.6

System characteristics

Rank	Submission code	Tech. Report	Equal Error Rate (evaluation dataset)	System characteristics
Rank	Submission code	Tech. Report	Equal Error Rate (evaluation dataset)	Features	Classifier
	Cakir_task4_1	Cakir2016	16.8	Mel spectrogram	CNN
	DCASE2016 baseline	Foster2016	20.9	MFCCs	GMM
	Hertel_task4_1	Hertel2016	22.1	Magnitude spectrogram	CNN
	Kong_task4_1	Kong2016	18.9	Mel spectrogram	DNN
	Lidy_task4_1	Lidy2016	16.6	CQT features	CNN
	Vu_task4_1	Vu2016	21.1	MFCCs	RNN
	Xu_task4_1	Xu2016	19.5	MFCCs	DNN
	Xu_task4_2	Xu2016	19.8	MFCCs	DNN
	Yun_task4_1	Yun2016	17.4	MFCCs	GMM

Technical reports

Domestic Audio Tagging with Convolutional Neural Networks

Emre Cakir, Toni Heittola and Tuomas Virtanen

Tampere University of Technology, Tampere, Finland

Cakir_task4_1

PDF

Domestic Audio Tagging with Convolutional Neural Networks

Abstract

In this paper, the method used in our submission for DCASE2016 challenge task 4 (domestic audio tagging) is described. The use of convolutional neural networks (CNN) to label the audio signals recorded in a domestic (home) environment is investigated. A relative 23.8% improvement over the Gaussian mixture model (GMM) baseline method is observed over the development dataset for the challenge.

System characteristics

Features	Mel spectrogram
Classifier	CNN

PDF

DCASE2016 Baseline System

Peter Foster¹ and Toni Heittola²

¹Queen Mary University of London, London, United Kingdom, ²Tampere University of Technology, Tampere, Finland

DCASE2016_task4_1

Code

DCASE2016 Baseline System

System characteristics

Features	MFCCs
Classifier	GMM

Source code

Classifying Variable-Length Audio Files with All-Convolutional Networks and Masked Global Pooling

Lars Hertel¹, Huy Phan^1,2 and Alfred Mertins¹

¹Institute for Signal Processing, University of Luebeck, Luebeck, Germany, ²Graduate School for Computing in Medicine and Life Sciences, University of Luebeck, Luebeck, Germany

Hertel_task4_1

PDF

Classifying Variable-Length Audio Files with All-Convolutional Networks and Masked Global Pooling

Abstract

We trained a deep all-convolutional neural network with masked global pooling to perform single-label classification for acoustic scene classification and multi-label classification for domestic audio tagging in the DCASE-2016 contest. Our network achieved an average accuracy of 84.5 % on the four-fold cross-validation for acoustic scene recognition, compared to the provided baseline of 72.5 %, and an average equal error rate of 0.17 for domestic audio tagging, compared to the baseline of 0.21. The network therefore improves the baselines by a relative amount of 17 % and 19 %, respectively. The network only consists of convolutional layers to extract features from the short-time Fourier transform and one global pooling layer to combine those features. It particularly possesses neither fully-connected layers, besides the fully-connected output layer. nor dropout layers.

System characteristics

Features	Magnitude spectrogram
Classifier	CNN

PDF

Deep Neural Network Baseline for DCASE Challenge 2016

Qiuqiang Kong, Iwona Sobieraj, Wenwu Wang and Mark Plumbley

Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom

Kong_task4_1

PDF Code

Deep Neural Network Baseline for DCASE Challenge 2016

Abstract

The DCASE Challenge 2016 contains tasks for Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. For feature extraction, 40 Mel-filter bank features are used. Two kinds of Mel banks, same area bank and same height bank are discussed. Experimental results show that the same height bank is better than the same area bank. DNNs with the same structure are applied to all four tasks in the DCASE Challenge 2016. In Task 1 we obtained accuracy of 76.4% using Mel + DNN against 72.5% by using Mel Frequency Ceptral Coefficient (MFCC) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 17.4% using Mel + DNN against 41.6% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 38.1% using Mel + DNN against 26.6% by using MFCC + GMM. In task 4 we obtained Equal Error Rate (ERR) of 20.9% using Mel + DNN against 21.0% by using MFCC + GMM. Therefore the DNN improves the baseline in Task 1 and Task 3, and is similar to the baseline in Task 4, although is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always work.

System characteristics

Features	Mel spectrogram
Classifier	DNN

PDF

Source code

CQT-Based Convolutional Neural Networks for Audio Scene Classification and Domestic Audio Tagging

Thomas Lidy¹ and Alexander Schindler²

¹Institute of Software Technology, Vienna University of Technology, Vienna, Austria, ²Digital Safety and Security, Austrian Institute of Technolog, Vienna, Austria

Lidy_task4_1

PDF

CQT-Based Convolutional Neural Networks for Audio Scene Classification and Domestic Audio Tagging

Abstract

For the DCASE 2016 audio benchmarking contest, we submitted a parallel Convolutional Neural Network architecture for the tasks of 1) classifying acoustic scenes (task 1) and urban soundscapes and 2) domestic audio tagging (task 4). A popular choice for input to a Convolutional Neural Network in audio classification problems are Mel-transformed spectrograms. We, however, found that a Constant-Q-transformed input improves results. Furthermore, we evaluated critical parameters such as the number of necessary bands and filter sizes in a Convolutional Neural Network. Finally, we propose a parallel (graph-based) neural network architecture which captures relevant audio characteristics both in time and in frequency, and submitted it to the DCASE 2016 tasks 1 and 4, with some slight alterations described in this paper. Our approach shows a 10.7 % relative improvement of the baseline system of the Acoustic Scenes Classification task on the development set of task 1[1].

System characteristics

Features	CQT features
Classifier	CNN

PDF

Acoustic Scene and Event Recognition Using Recurrent Neural Networks

Toan H. Vu and Jia-Ching Wang

Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan

Vu_task4_1

PDF

Acoustic Scene and Event Recognition Using Recurrent Neural Networks

Abstract

The DCASE2016 challenge is designed particularly for research in environmental sound analysis. It consists of four tasks that spread on various problems such as acoustic scene classification and sound event detection. This paper reports our results on all the tasks by using Recurrent Neural Networks (RNNs). Experiments show that our models achieved superior performances compared with the baselines.

System characteristics

Features	MFCCs
Classifier	RNN

PDF

Fully DNN-Based Multi-Label Regression for Audio Tagging

Yong Xu, Qiang Huang, Wenwu Wang and Mark D. Plumbley

Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom

Xu_task4_1 Xu_task4_2

PDF

Fully DNN-Based Multi-Label Regression for Audio Tagging

Abstract

Acoustic event detection for content analysis in most cases relies on lots of labeled data. However, manually annotating data is a time-consuming task, which thus makes few annotated resources available so far. Unlike audio event detection, automatic audio tagging, a multi-label acoustic event classification task, only relies on weakly labeled data. This is highly desirable to some practical applications using audio analysis. In this paper we propose to use a fully deep neural network (DNN) framework to handle the multilabel classification task in a regression way. Considering that only chunk-level rather than frame-level labels are available, the whole or almost whole frames of the chunk were fed into the DNN to perform a multi-label regression for the expected tags. The fully DNN, which is regarded as an encoding function, can well map the audio features sequence to a multi-tag vector. A deep pyramid structure was also designed to extract more robust high-level features related to the target tags. Further improved methods were adopted, such as the Dropout and background noise aware training, to enhance its generalization capability for new audio recordings in mismatched environments. Compared with the conventional Gaussian Mixture Model (GMM) and support vector machine (SVM) methods, the proposed fully DNN-based method could well utilize the long-term temporal information with the whole chunk as the input. The results show that our approach obtained a 15% relative improvement compared with the official GMM-based method of DCASE 2016 challenge.

System characteristics

Features	MFCCs
Classifier	DNN

PDF

Discriminative Training of GMM Parameters for Audio Scene Classification

Sungrack Yun, Sungwoong Kim, Sunkuck Moon, Juncheol Cho and Taesu Kim

Qualcomm Research, Seoul, South Korea

Yun_task4_1

PDF

Discriminative Training of GMM Parameters for Audio Scene Classification

Abstract

This report describes the algorithm for audio scene classification and audio tagging and the result for DCASE 2016 challenge data. We propose a discriminative training algorithm to improve the baseline GMM performance. The algorithm updates the baseline GMM parameters by maximizing the margin between classes to improve discriminative performance. For Task1, we use a hierarchical classifier to maximize discriminative performance, and achieve 84% accuracy for given cross validation data. For Task4, we apply binary classifier for each label, and achieve 16.71% EER for given cross validation data.

System characteristics

Features	MFCCs
Classifier	GMM

PDF

Content

Task description

Challenge results

Systems ranking

Class-wise performance

System characteristics

Technical reports

Domestic Audio Tagging with Convolutional Neural Networks

Domestic Audio Tagging with Convolutional Neural Networks

Abstract

System characteristics

DCASE2016 Baseline System

DCASE2016 Baseline System

System characteristics

Classifying Variable-Length Audio Files with All-Convolutional Networks and Masked Global Pooling

Classifying Variable-Length Audio Files with All-Convolutional Networks and Masked Global Pooling

Abstract

System characteristics

Deep Neural Network Baseline for DCASE Challenge 2016

Deep Neural Network Baseline for DCASE Challenge 2016

Abstract

System characteristics

CQT-Based Convolutional Neural Networks for Audio Scene Classification and Domestic Audio Tagging

CQT-Based Convolutional Neural Networks for Audio Scene Classification and Domestic Audio Tagging

Abstract

System characteristics

Acoustic Scene and Event Recognition Using Recurrent Neural Networks

Acoustic Scene and Event Recognition Using Recurrent Neural Networks

Abstract

System characteristics

Fully DNN-Based Multi-Label Regression for Audio Tagging

Fully DNN-Based Multi-Label Regression for Audio Tagging

Abstract

System characteristics

Discriminative Training of GMM Parameters for Audio Scene Classification

Discriminative Training of GMM Parameters for Audio Scene Classification

Abstract

System characteristics