Challenge has ended.
This site is collecting information from original DCASE2013 Challenge website to document DCASE challenges in a uniform way.
Challenge results and analysis of submitted systems have been published in:
D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley. Detection and classification of acoustic scenes and events. IEEE Transactions on Multimedia, 17(10):1733–1746, Oct 2015. doi:10.1109/TMM.2015.2428998.
Detection and Classification of Acoustic Scenes and Events
For intelligent systems to make best use of the audio modality, it is important that they can recognize not just speech and music, which have been researched as specific tasks, but also general sounds in everyday environments. To stimulate research in this field we conducted a public research challenge: the IEEE Audio and Acoustic Signal Processing Technical Committee challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). In this paper, we report on the state of the art in automatically classifying audio scenes, and automatically detecting and classifying audio events. We survey prior work as well as the state of the art represented by the submissions to the challenge from various research groups. We also provide detail on the organization of the challenge, so that our experience as challenge hosts may be useful to those organizing challenges in similar domains. We created new audio datasets and baseline systems for the challenge; these, as well as some submitted systems, are publicly available under open licenses, to serve as benchmarks for further research in general-purpose machine listening.
acoustic signal processing;knowledge based systems;speech recognition;acoustic scenes detection;acoustic scenes classification;intelligent systems;audio modality;speech recognition;music;IEEE Audio and Acoustic Signal Processing Technical Committee;DCASE;Event detection;Speech;Speech recognition;Music;Microphones;Licenses;Audio databases;event detection;machine intelligence;pattern recognition
Results for each tasks are presented in task specific results pages:
We invite researchers in signal processing, machine learning and other fields to participate in our challenge, which consists of a set of related tasks on automatic detection and classification of acoustic scenes and acoustic events.
The tasks fall into the field of computational auditory scene analysis (CASA). Humans are able to follow specific sound sources in a complex audio environment with ease and the development of systems that try to mimic this behaviour is an open problem, especially in the case of overlapping sound events.
Acoustic scene classification
The goal of acoustic scene classification is to classify a test recording into one of predefined classes that characterizes the environment in which it was recorded -- for example "park", "busy street", "office". The acoustic data will include recordings from 10 contexts.Task description Results
Sound event detection
The event detection challenge will address the problem of identifying individual sound events that are prominent in an acoustic scene. Two distinct experiments will take, one for simple acoustic scenes without overlapping sounds and the other using complex scenes in a polyphonic scenario:
The first dataset for event detection will consist of 3 subsets (for development, training, and testing). The training set will contain instantiations of individual events for every class. The developement and testing datasets, denoted as office live (OL), will consist of 1 min recordings of every-day audio events in a number of office environments. The audio events for these recordings will be annotated and they will include: door knock, door slam, speech, laughter, keyboard clicks, objects hitting table, keys clinging, phone ringing, turning page, cough, printer, short alert-beeping, clearing throat, mouse click, drawer, and switches.
The second dataset will contain artificially sequenced sounds provided by the Analysis-Synthesis team of IRCAM, termed Office Synthetic (OS). The training set will be identical to the one for the first dataset. The development and testing sets will consist of artificial scenes built by sequencing recordings of individual events (different recordings from the ones used for the training dataset) and background recordings provided by C4DM.
WASPAA2013 Poster session
Participants were encouraged to submit novel work as a regular paper at the WASPAA 2013 conference. Approved papers were presented as D-CASE Poster Session.
J. T. Geiger, B. Schuller, and G. Rigoll. Large-scale audio feature extraction and svm for acoustic scene classification. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, volume, 1–4. Oct 2013. doi:10.1109/WASPAA.2013.6701857.
This work describes a system for acoustic scene classification using large-scale audio feature extraction. It is our contribution to the Scene Classification track of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (D-CASE). The system classifies 30 second long recordings of 10 different acoustic scenes. From the highly variable recordings, a large number of spectral, cepstral, energy and voicing-related audio features are extracted. Using a sliding window approach, classification is performed on short windows. SVM are used to classify these short segments, and a majority voting scheme is employed to get a decision for longer recordings. On the official development set of the challenge, an accuracy of 73 % is achieved. SVM are compared with a nearest neighbour classifier and an approach called Latent Perceptual Indexing, whereby SVM achieve the best results. A feature analysis using the t-statistic shows that mainly Mel spectra are the most relevant features.
audio signals;feature extraction;signal classification;statistical analysis;support vector machines;Mel spectra;t-statistic;latent perceptual indexing;nearest neighbour classifier;short segments;sliding window approach;variable recordings;D-CASE;acoustic scenes;IEEE AASP;scene classification track;acoustic scene classification;SVM;support vector machines;large-scale audio feature extraction;Support vector machines;Mel frequency cepstral coefficient;Feature extraction;Accuracy;Training data;Training;Computational auditory scene analysis;acoustic scene recognition;feature extraction
J. F. Gemmeke, L. Vuegen, P. Karsmakers, B. Vanrumste, and H. Van hamme. An exemplar-based nmf approach to audio event detection. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, volume, 1–4. Oct 2013. doi:10.1109/WASPAA.2013.6701847.
We present a novel, exemplar-based method for audio event detection based on non-negative matrix factorisation. Building on recent work in noise robust automatic speech recognition, we model events as a linear combination of dictionary atoms, and mixtures as a linear combination of overlapping events. The weights of activated atoms in an observation serve directly as evidence for the underlying event classes. The atoms in the dictionary span multiple frames and are created by extracting all possible fixed-length exemplars from the training data. To combat data scarcity of small training datasets, we propose to artificially augment the amount of training data by linear time warping in the feature domain at multiple rates. The method is evaluated on the Office Live and Office Synthetic datasets released by the AASP Challenge on Detection and Classification of Acoustic Scenes and Events.
acoustic signal detection;acoustic signal processing;audio signal processing;matrix decomposition;signal classification;speech recognition;exemplar-based NMF approach;nonnegative matrix factorisation;audio event detection;noise robust automatic speech recognition;dictionary atoms;linear overlapping event combination;dictionary span multiple frames;possible fixed-length exemplar extraction;linear time warping;Office Live datasets;Office Synthetic datasets;AASP Challenge;acoustic scene detection;acoustic scene classification;acoustic event detection;acoustic event classification;data scarcity;Acoustics;Dictionaries;Event detection;Hidden Markov models;Training data;Measurement;Noise;Audio event detection;exemplars;NMF
K. Lee, Z. Hyung, and J. Nam. Acoustic scene classification using sparse feature learning and event-based pooling. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, volume, 1–4. Oct 2013. doi:10.1109/WASPAA.2013.6701893.
Recently unsupervised learning algorithms have been successfully used to represent data in many of machine recognition tasks. In particular, sparse feature learning algorithms have shown that they can not only discover meaningful structures from raw data but also outperform many hand-engineered features. In this paper, we apply the sparse feature learning approach to acoustic scene classification. We use a sparse restricted Boltzmann machine to capture manyfold local acoustic structures from audio data and represent the data in a high-dimensional sparse feature space given the learned structures. For scene classification, we summarize the local features by pooling over audio scene data. While the feature pooling is typically performed over uniformly divided segments, we suggest a new pooling method, which first detects audio events and then performs pooling only over detected events, considering the irregular occurrence of audio events in acoustic scene data. We evaluate the learned features on the IEEE AASP Challenge development set, comparing them with a baseline model using mel-frequency cepstral coefficients (MFCCs). The results show that learned features outperform MFCCs, event-based pooling achieves higher accuracy than uniform pooling and, furthermore, a combination of the two methods performs even better than either one used alone.
acoustic signal processing;Boltzmann machines;cepstral analysis;learning (artificial intelligence);signal classification;acoustic scene classification;event-based pooling;unsupervised learning algorithm;machine recognition tasks;sparse feature learning algorithm;sparse restricted Boltzmann machine;IEEE AASP Challenge development set;mel-frequency cepstral coefficients;MFCC;Acoustics;Feature extraction;Training;Electron tubes;Accuracy;Mathematical model;Conferences;acoustic scene classification;environmental sound;feature learning;restricted Boltzmann machine;sparse feature representation;max-pooling;event detection
M. E. Niessen, T. L. M. Van Kasteren, and A. Merentitis. Hierarchical modeling using automated sub-clustering for sound event recognition. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, volume, 1–4. Oct 2013. doi:10.1109/WASPAA.2013.6701862.
The automatic recognition of sound events allows for novel applications in areas such as security, mobile and multimedia. In this work we present a hierarchical hidden Markov model for sound event detection that automatically clusters the inherent structure of the events into sub-events. We evaluate our approach on an IEEE audio challenge dataset consisting of office sound events and provide a systematic comparison of the various building blocks of our approach to demonstrate the effectiveness of incorporating certain dependencies in the model. The hierarchical hidden Markov model achieves an average frame-based F-measure recognition performance of 45.5% on a test dataset that was used to evaluate challenge submissions. We also show how the hierarchical model can be used as a meta-classifier, although in the particular application this did not lead to an increase in performance on the test dataset.
audio signal processing;hidden Markov models;metaclassifier;frame based F measure recognition;sound event detection;hierarchical hidden Markov model;automatic recognition;sound event recognition;automated subclustering;hierarchical modeling;Hidden Markov models;Data models;Acoustics;Feature extraction;Conferences;Event detection;Speech;sound event detection;hierarchical models;meta-classifier
G. Roma, W. Nogueira, and P. Herrera. Recurrence quantification analysis features for environmental sound recognition. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, volume, 1–4. Oct 2013. doi:10.1109/WASPAA.2013.6701890.
This paper tackles the problem of feature aggregation for recognition of auditory scenes in unlabeled audio. We describe a new set of descriptors based on Recurrence Quantification Analysis (RQA), which can be extracted from the similarity matrix of a time series of audio descriptors. We analyze their usefulness for environmental audio recognition combined with traditional feature statistics in the context of the AASP D-CASE challenge. Our results show the potential of non-linear time series analysis techniques for dealing with environmental sounds.
audio signal processing;time series;nonlinear time series analysis technique;environmental audio recognition;unlabeled audio;auditory scene recognition;feature aggregation;environmental sound recognition;RQA;recurrence quantification analysis;Feature extraction;Mel frequency cepstral coefficient;Databases;Accuracy;Time series analysis;Conferences
J. Schröder, N. Moritz, M. R. Schädler, B. Cauchi, K. Adiloglu, J. Anemüller, S. Doclo, B. Kollmeier, and S. Goetze. On the use of spectro-temporal features for the ieee aasp challenge ‘detection and classification of acoustic scenes and events’. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, volume, 1–4. Oct 2013. doi:10.1109/WASPAA.2013.6701868.
In this contribution, an acoustic event detection system based on spectro-temporal features and a two-layer hidden Markov model as back-end is proposed within the framework of the IEEE AASP challenge `Detection and Classification of Acoustic Scenes and Events' (D-CASE). Noise reduction based on the log-spectral amplitude estimator by  and noise power density estimation by  is used for signal enhancement. Performance based on three different kinds of features is compared, i.e. for amplitude modulation spectrogram, Gabor filterbank-features and conventional Mel-frequency cepstral coefficients (MFCCs), all of them known from automatic speech recognition (ASR). The evaluation is based on the office live recordings provided within the D-CASE challenge. The influence of the signal enhancement is investigated and the increase in recognition rate by the proposed features in comparison to MFCC-features is shown. It is demonstrated that the proposed spectro-temporal features achieve a better recognition accuracy than MFCCs.
Gabor filters;hidden Markov models;signal classification;speech enhancement;speech recognition;spectro-temporal features;acoustic event detection system;two-layer hidden Markov model;IEEE AASP challenge;detection and classification of acoustic scenes and events;noise reduction;log-spectral amplitude estimator;noise power density estimation;signal enhancement;Gabor filterbank-features;Mel-frequency cepstral coefficients;automatic speech recognition;office live recordings;D-CASE challenge;Acoustics;Hidden Markov models;Feature extraction;Speech;Event detection;Noise;Frequency modulation;acoustic event detection;Gabor filterbank;amplitude modulation spectrogram;IEEE AASP D-CASE challenge