Domestic audio tagging

Task description

Challenge has ended. Full results for this task can be found here


This task is based on audio recordings made in a domestic environment. The objective of the task is to perform multi-label classification on 4-second audio chunks (i.e. assign zero or more labels to each 4-second audio chunk). We motivate this task for applications such as human activity monitoring, where identifying the precise boundaries of acoustic events is of secondary importance, compared to determining the presence of events in the acoustic scene. Furthermore, when obtaining annotations for this task we observed that manually tagging audio chunks was much less time-consuming compared to manually locating event boundaries within recordings. We believe our chosen approach carries substantial potential for reducing the time cost and thus improving the tractability of obtaining manual annotations for large audio databases.

Figure 1: Overview of audio tagging system.

Audio dataset

Prominent sound sources in the acoustic environment are two adults and two children, television and electronic gadgets, kitchen appliances, footsteps and knocks produced by human activity, in addition to sound originating from outside the house [Christensen2010]. The audio data are provided as 4-second chunks at two sampling rates (48kHz and 16kHz) with the 48kHz data in stereo and with the 16kHz data in mono. The 16kHz recordings were obtained by downsampling the right-hand channel of the 48kHz recordings. Each audio file corresponds to a single chunk.

All available audio data may be used for system development, however the evaluation will be performed using the monophonic audio data sampled at 16kHz, with the aim of approximating typical recording capabilities of commodity hardware.

Out of 6137 chunks, 4378 chunks are available for system development, based on partitioning at the level of 5-minute recording segments.


Heidi Christensen, Jon Barker, Ning Ma, and Phil D. Green. The chime corpus: a resource and a challenge for computational hearing in multisource environments. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, 1918–1921. ISCA, 2010.


The CHiME corpus: a resource and a challenge for computational hearing in multisource environments.


We present a new corpus designed for noise-robust speech processing research, CHiME. Our goal was to produce material which is both natural (derived from reverberant domestic environments with many simultaneous and unpredictable sound sources) and controlled (providing an enumerated range of SNRs spanning 20 dB). The corpus includes around 40 hours of background recordings from a head and torso simulator positioned in a domestic setting, and a comprehensive set of binaural impulse responses collected in the same environment. These have been used to add target utterances from the Grid speech recognition corpus into the CHiME domestic setting. Data has been mixed in a manner that produces a controlled and yet natural range of SNRs over which speech separation, enhancement and recognition algorithms can be evaluated. The paper motivates the design of the corpus, and describes the collection and post-processing of the data. We also present a set of baseline recognition results.



PLEASE NOTE: If you downloaded the CHiME-Home dataset prior to the release of this information, please make sure to use the latest version of the development dataset, which includes monophonic audio data sampled at 16kHz.


The annotations are based on a set of 7 label classes, listed in Table 1. For each chunk, multi-label annotations were first obtained for each of 3 annotators. A detailed description of the annotation procedure is provided in


P. Foster, S. Sigtia, S. Krstulovic, J. Barker, and M. D. Plumbley. Chime-home: a dataset for sound source recognition in a domestic environment. In 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), volume, 1–5. Oct 2015. doi:10.1109/WASPAA.2015.7336899.


Chime-home: A dataset for sound source recognition in a domestic environment


For the task of sound source recognition, we introduce a novel data set based on 6.8 hours of domestic environment audio recordings. We describe our approach of obtaining annotations for the recordings. Further, we quantify agreement between obtained annotations. Finally, we report baseline results for sound source recognition using the obtained dataset. Our annotation approach associates each 4-second excerpt from the audio recordings with multiple labels, based on a set of 7 labels associated with sound sources in the acoustic environment. With the aid of 3 human annotators, we obtain 3 sets of multi-label annotations, for 4378 4-second audio excerpts. We evaluate agreement between annotators by computing Jaccard indices between sets of label assignments. Observing varying levels of agreement across labels, with a view to obtaining a representation of `ground truth' in annotations, we refine our dataset to obtain a set of multi-label annotations for 1946 audio excerpts. For the set of 1946 annotated audio excerpts, we predict binary label assignments using Gaussian mixture models estimated on MFCCs. Evaluated using the area under receiver operating characteristic curves, across considered labels we observe performance scores in the range 0.76 to 0.98.


acoustic radiators;audio recording;audio signal processing;Gaussian processes;mixture models;sensitivity analysis;sound source recognition;domestic environment audio recordings;acoustic environment;multilabel annotation;Jaccard indices computation;binary label assignment prediction;Gaussian mixture model;MFCC;receiver operating characteristic curve;Acoustics;Speech;Speech processing;Audio recording;Conferences;Speech recognition;Computational Auditory Scene analysis;Sound Source Recognition;Datasets


In this task, the evaluation is based on those chunks where 2 or more annotators agreed about label presence across label classes. There are 1946 such 'strong agreement' chunks is the development dataset, and 816 such 'strong agreement' chunks in the evaluation dataset. Based on a majority vote, annotations are combined across annotators to form a single multi-label annotation (referred to as CHiME-Home-refine in [Foster2015]).

Table 1: Labels used in annotations.
Label Description Number of occurrences
(development dataset strong agreement chunks)
c Child speech 1214
m Adult male speech 174
f Adult female speech 409
v Video game / TV 1181
p Percussive sounds, e.g. crash, bang, knock, footsteps 765
b Broadband noise, e.g. household appliances 19
o Other identifiable sounds 361

Development dataset

For the 1946 'strong agreement' chunks in the development dataset, label occurrences are summarised in Table 1. These chunks have been partitioned at the level of 5-minute recording segments for 5-fold cross validation (please refer to the file development_chunks_refined_crossval_dcase2016.csv in the updated version of the CHiME-Home dataset). In the partition, the 5-minute recording constraint was omitted for chunks labelled 'b', owing to the low number of associated label occurrences. While not used for evaluation, the remaining 2432 chunks in the development dataset may be used to train models, for example for unsupervised learning.

Participants are not allowed to use external data for system development. Manipulation of provided data is allowed.

Prediction task

For each chunk, output a classification score for each of the 7 label classes listed in Table 1.


Label prediction performance is quantified using the equal error rate (EER), which is defined as the fixed point of the graph of false negative rate versus false positive rate [Murphy2012, p. 181], for which Python and Matlab implementations are provided. The EER is computed individually for each label. When using the development data, EERs are computed individually for each cross-validation fold, before averaging the obtained EERs across folds.

Detailed description of the equal error rate can be found in


Kevin P. Murphy. Machine learning : a probabilistic perspective. MIT Press, Cambridge, Mass. [u.a.], 2013. ISBN 9780262018029 0262018020. URL:

Machine learning : a probabilistic perspective


Machine learning


Rank Submission Information
Code Author Affiliation Technical
Equal Error Rate
Cakir_task4_1 Emre Cakir Tampere University of Technology, Tampere, Finland task-audio-tagging-results#Cakir2016 16.8
DCASE2016 baseline Peter Foster Queen Mary University of London, London, United Kingdom task-audio-tagging-results#Foster2016 20.9
Hertel_task4_1 Lars Hertel Institute for Signal Processing, University of Luebeck, Luebeck, Germany task-audio-tagging-results#Hertel2016 22.1
Kong_task4_1 Qiuqiang Kong Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom task-audio-tagging-results#Kong2016 18.9
Lidy_task4_1 Thomas Lidy Institute of Software Technology, Vienna University of Technology, Vienna, Austria task-audio-tagging-results#Lidy2016 16.6
Vu_task4_1 Toan H. Vu Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan task-audio-tagging-results#Vu2016 21.1
Xu_task4_1 Yong Xu Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom task-audio-tagging-results#Xu2016 19.5
Xu_task4_2 Yong Xu Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom task-audio-tagging-results#Xu2016 19.8
Yun_task4_1 Sungrack Yun Qualcomm Research, Seoul, South Korea task-audio-tagging-results#Yun2016 17.4

Complete results and technical reports can be found at Task 4 result page.


  • Only the provided development dataset can be used to train the submitted system.
  • The development dataset can be augmented only by mixing data sampled from a pdf; use of real recordings is forbidden.
  • The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation dataset in the decision making is also forbidden.
  • Technical report with sufficient description of the system has to be submitted along with the system outputs.

More information on submission process.

Baseline system

A baseline system implemented in Python using MFCCs as features performs multi-label classification by associating a binary classifier with each label class. Classification scores are obtained as log-likelihood ratios using a pair of GMMs, respectively associated with label presence/absence.

Python implementation

PLEASE NOTE: The provided baseline system should attempt to download the dataset by default, prior to training and testing models using the provided cross-validation partition. The relevant script for invoking this procedure is

Results for CHiME-Home, development set

Evaluation setup

  • 5-fold cross-validation
  • 7 classes
  • Average EER across folds

System parameters

  • Frame size: 20 ms (50% hop size)
  • Number of Gaussians per audio tag model: 8
  • Features: 14 MFCC static coefficients (excluding 0th)
Audio tagging results over evaluation folds.
Audio tag EER
adult female speech 0.29
adult male speech 0.30
broadband noise 0.09
child speech 0.20
other 0.29
percussive sound 0.25
video game/tv 0.07
Mean error 0.21


Classification scores should be output to a text file containing the score associated with each (chunk, label) combination. Each line should contain a file name identifying a chunk, followed by a comma-delimited character indicating the label, followed by the classification score, e.g. file1.wav,f,0.8751. (Since there are 7 label classes, there should be 7 lines output for each chunk.) The file should be ASCII-formatted and lines should be terminated by the newline (\n) character. Relative to the baseline script location, example output may be found at data/CHiMeHome-audiotag-development/evaluation_setup.

Detailed information for the challenge submission can found from submission page.


When citing challenge task and results please cite the following paper:


A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, and M. D. Plumbley. Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(2):379–393, Feb 2018. doi:10.1109/TASLP.2017.2778423.


Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge


Public evaluation campaigns and datasets promote active development in target research areas, allowing direct comparison of algorithms. The second edition of the challenge on detection and classification of acoustic scenes and events (DCASE 2016) has offered such an opportunity for development of the state-of-the-art methods, and succeeded in drawing together a large number of participants from academic and industrial backgrounds. In this paper, we report on the tasks and outcomes of the DCASE 2016 challenge. The challenge comprised four tasks: acoustic scene classification, sound event detection in synthetic audio, sound event detection in real-life audio, and domestic audio tagging. We present each task in detail and analyze the submitted systems in terms of design and performance. We observe the emergence of deep learning as the most popular classification method, replacing the traditional approaches based on Gaussian mixture models and support vector machines. By contrast, feature representations have not changed substantially throughout the years, as mel frequency-based representations predominate in all tasks. The datasets created for and used in DCASE 2016 are publicly available and are a valuable resource for further research.


Acoustics;Event detection;Hidden Markov models;Speech;Speech processing;Tagging;Acoustic scene classification;audio datasets;pattern recognition;sound event detection