scene classification

Challenge results

Official results are shown in original DCASE2013 Challenge website. Purpose of this page is to show results in an uniform way compared to more recent editions of the DCASE Challenge.

Task description

The scene classification (SC) challenge will address the problem of identifying and classifying acoustic scenes and soundscapes.

More detailed task description can be found in the task description page

Systems ranking

Rank Submission
with 95% confidence interval
(Evaluation dataset)
DCASE2013 baseline Baseline 55.0 (45.2 - 64.8)
CHR_1 CHR_SVM Chum2013 63.0 (53.5 - 72.5)
CHR_2 CHR_HMM Chum2013 65.0 (55.7 - 74.3)
ELF ELF Elizalde2013 55.0 (45.2 - 64.8)
GSR GSR Geiger2013 69.0 (59.9 - 78.1)
KH KH Krijnders2013 55.0 (45.2 - 64.8)
LTT_1 LTT_1 Li2013 72.0 (63.2 - 80.8)
LTT_2 LTT_2 Li2013 70.0 (61.0 - 79.0)
LTT_3 LTT_2 Li2013 67.0 (57.8 - 76.2)
NHL NHL Nam2013 60.0 (50.4 - 69.6)
NR1_1 NR1_1 Nogueira2013 60.0 (50.4 - 69.6)
NR1_2 NR1_2 Nogueira2013 60.0 (50.4 - 69.6)
NR1_3 NR1_3 Nogueira2013 59.0 (49.4 - 68.6)
OE OE Olivetti2013 14.0 (7.2 - 20.8)
PE PE Patil2013 58.0 (48.3 - 67.7)
RG RG Rakotomamonjy2013 69.0 (59.9 - 78.1)
RNH_1 RNH1 Roma2013 71.0 (62.1 - 79.9)
RNH_2 RNH2 Roma2013 76.0 (67.6 - 84.4)

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission
with 95% confidence interval
(Evaluation dataset)
DCASE2013 baseline Baseline 55.0 (45.2 - 64.8)
CHR_2 CHR_HMM Chum2013 65.0 (55.7 - 74.3)
ELF ELF Elizalde2013 55.0 (45.2 - 64.8)
GSR GSR Geiger2013 69.0 (59.9 - 78.1)
KH KH Krijnders2013 55.0 (45.2 - 64.8)
LTT_1 LTT_1 Li2013 72.0 (63.2 - 80.8)
NHL NHL Nam2013 60.0 (50.4 - 69.6)
NR1_1 NR1_1 Nogueira2013 60.0 (50.4 - 69.6)
OE OE Olivetti2013 14.0 (7.2 - 20.8)
PE PE Patil2013 58.0 (48.3 - 67.7)
RG RG Rakotomamonjy2013 69.0 (59.9 - 78.1)
RNH_2 RNH2 Roma2013 76.0 (67.6 - 84.4)

Class-wise performance

Rank Submission
(Evaluation dataset)
Bus Busy
Office Open air
Park Quiet
Restaurant Supermarket Tube Tube
DCASE2013 baseline Baseline 55.0 90.0 30.0 80.0 70.0 90.0 40.0 30.0 20.0 60.0 40.0
CHR_1 CHR_SVM Chum2013 63.0 100.0 80.0 40.0 70.0 30.0 40.0 70.0 50.0 80.0 70.0
CHR_2 CHR_HMM Chum2013 65.0 90.0 90.0 80.0 60.0 90.0 50.0 40.0 30.0 80.0 40.0
ELF ELF Elizalde2013 55.0 50.0 70.0 60.0 100.0 40.0 50.0 70.0 30.0 30.0 50.0
GSR GSR Geiger2013 69.0 90.0 90.0 90.0 90.0 70.0 40.0 80.0 50.0 70.0 20.0
KH KH Krijnders2013 55.0 60.0 90.0 70.0 50.0 30.0 50.0 50.0 20.0 80.0 50.0
LTT_1 LTT_1 Li2013 72.0 100.0 100.0 70.0 80.0 70.0 50.0 70.0 70.0 70.0 40.0
LTT_2 LTT_2 Li2013 70.0 100.0 100.0 70.0 80.0 70.0 60.0 70.0 40.0 70.0 40.0
LTT_3 LTT_2 Li2013 67.0 100.0 100.0 70.0 80.0 70.0 50.0 60.0 30.0 70.0 40.0
NHL NHL Nam2013 60.0 70.0 90.0 80.0 70.0 50.0 60.0 50.0 40.0 30.0 60.0
NR1_1 NR1_1 Nogueira2013 60.0 80.0 80.0 60.0 70.0 70.0 30.0 60.0 80.0 20.0 50.0
NR1_2 NR1_2 Nogueira2013 60.0 80.0 90.0 50.0 80.0 70.0 20.0 90.0 60.0 20.0 40.0
NR1_3 NR1_3 Nogueira2013 59.0 80.0 80.0 50.0 70.0 70.0 20.0 90.0 70.0 20.0 40.0
OE OE Olivetti2013 14.0 0.0 10.0 20.0 20.0 10.0 30.0 0.0 20.0 20.0 10.0
PE PE Patil2013 58.0 90.0 90.0 50.0 70.0 40.0 60.0 60.0 20.0 40.0 60.0
RG RG Rakotomamonjy2013 69.0 100.0 100.0 80.0 80.0 80.0 30.0 50.0 50.0 80.0 40.0
RNH_1 RNH1 Roma2013 71.0 80.0 100.0 60.0 40.0 70.0 60.0 100.0 70.0 70.0 60.0
RNH_2 RNH2 Roma2013 76.0 80.0 100.0 80.0 70.0 70.0 50.0 100.0 80.0 70.0 60.0

System characteristics

Rank Code Technical
Input Sampling
Features Classifier
DCASE2013 baseline 55.0 mono 44.1kHz MFCC GMM
CHR_1 Chum2013 63.0 mono 11.025kHz Magnitude response, Loudness, Spectral sparsity, Temporal sparsity SVM
CHR_2 Chum2013 65.0 mono 11.025kHz Magnitude response, Loudness, Spectral sparsity, Temporal sparsity HMM
ELF Elizalde2013 55.0 left, right, difference, average 44.1kHz MFCC i-vector, pLDA
GSR Geiger2013 69.0 mono 44.1kHz openSMILE / emo_large SVM
KH Krijnders2013 55.0 mono 44.1kHz tone-fit representation SVM
LTT_1 Li2013 72.0 mono 44.1kHz Wavelet, MFCC Treebagger, majority vote
LTT_2 Li2013 70.0 mono 44.1kHz Wavelet, MFCC Treebagger, majority vote
LTT_3 Li2013 67.0 mono 44.1kHz Wavelet, MFCC Treebagger, majority vote
NHL Nam2013 60.0 mono 44.1kHz Feature learning, max-pooling SVM
NR1_1 Nogueira2013 60.0 mono 44.1kHz MFCC, temporal modulation, event density, binaural features SVM
NR1_2 Nogueira2013 60.0 mono 44.1kHz MFCC, temporal modulation, event density, binaural features SVM
NR1_3 Nogueira2013 59.0 mono 44.1kHz MFCC, temporal modulation, event density, binaural features SVM
OE Olivetti2013 14.0 mono 44.1kHz Normalized compression distance, Euclidean embedding Random Forest
PE Patil2013 58.0 mono 44.1kHz Spectrotemporal modulation SVM
RG Rakotomamonjy2013 69.0 mono 44.1kHz CQT, HOG SVM
RNH_1 Roma2013 71.0 mono 44.1kHz MFCC, Recurrence Quantification Analysis SVM
RNH_2 Roma2013 76.0 mono 44.1kHz MFCC, Recurrence Quantification Analysis SVM

Technical reports

IEEE AASP Scene Classification Challenge Using Hidden Markov Models and Frame Based Classification

May Chum, Ariel Habshush, Abrar Rahman and Christopher Sang
Electrical Engineering Department, The Cooper Union, New York, USA


The IEEE AASP Challenge involves the detection and classification of acoustic scenes and events. The scene classification (SC) challenge consists of 10 different scenes of 10 audio files or length 30 seconds each, totaling a number of 100 audio clips. The list of scenes is: busy street, quiet street, park, open-air market, bus, subway-train, restaurant, shop/supermarket, office, and subway station. The goal is to test on a development set that is composed of audio clips of the same scenes as the training set and determine what scene the audio clips originated from. One of the algorithms presented in this paper to discriminate between these different scenes include the use of hidden Markov models (HMMs) and Gaussian mixture models (GMMs). The features that were used include the following: short time Fourier transform, loudness, and spectral sparsity. Using these features yielded 72% correct classification with 10 fold crossvalidation. The other algorithm implemented uses the same features as before plus temporal sparsity to classify individual frames of an audio clip, then vote on the class. This algorithm achieved 62% accuracy.

System characteristics
Input mono
Sampling rate 11.025kHz
Features Magnitude response, Loudness, Spectral sparsity, Temporal sparsity
Classifier SVM; HMM

An I-Vector Based Approach for Audio Scene Detection

Benjamin Elizalde1, Howard Lei1, Gerald Friedland1 and Nils Peters2
1International Computer Science Institute, Berkeley, USA, 2Qualcomm Technologies Inc., San Diego, USA


The IEEE-ASSP Scene Classification challenge on user-generated content (UGC) aims to classify an audio recording that belongs to a specific scene such as busystreet, office or supermarket. The difficulty of scene content analysis on UGC lies in the lack of structure and acoustic variability of the data. The i-vector system is state-ofthe-art in Speaker Verification and Scene Detection, and is outperforming conventional Gaussian Mixture Model (GMM)-based approaches. The system compensates for undesired acoustic variability and extracts information from the acoustic environment, making it a meaningful choice for detection on UGC. This paper reports our results in the challenge by using a hand-tuned i-vector system and MFCC features. Compared to the MFCC+GMM baseline system, our system increased the classification accuracy by 26.4% to about 65.8%. We discuss our approach and highlight parameters in our system that showed to significantly improved our classification accuracy

System characteristics
Input left, right, difference, average
Sampling rate 44.1kHz
Features MFCC
Classifier i-vector, pLDA

Recognising Acoustic Scenes with Large-Scale Audio Feature Extraction and SVM

Jürgen T. Geiger1, Björn Schuller1,2 and Gerhard Rigoll1
1Institute for Human-Machine Communication, Technische Universität München, München, Germany, 2University of Passau, Institute for Sensor Systems, Passau, Germany


This work describes our contribution to the IEEE AASP Challenge on classification of acoustic scenes. From the 30 second long highly variable recordings, spectral, cepstral, energy and voicing-related audio features are extracted. A sliding window approach is used to obtain statistical functionals of the low-level features on short segments. SVM are used for classification of these short segments, and a majority voting scheme is employed to get a decision for the whole recording. On the official development set of the challenge, an accuracy of 73% is achieved. A feature analysis using the t-statistic showed that mainly Mel spectra were the most relevant features.

System characteristics
Input mono
Sampling rate 44.1kHz
Features openSMILE / emo_large
Classifier SVM

A Tone-Fit Feature Representation for Scene Classification

Johannes D. Krijnders and Gineke A. ten Holt
INCAS3, Assen, Netherlands


We present an algorithm that classifies environmental sound recordings using a feature representation based on the human hearing. Specifically, we use a mathematical model of the human cochlea to transform a sound (wav) clip into a time-frequency representation called a cochleogram. From the cochleogram, we calculate the tone-fit of each time-frequency region by calculating the fit of the region to a pure tone. This gives us a representation of the ’tonelikeness’ of the sound at various moments and frequencies. Finally, to arrive at a summarized representation for the entire clip, we calculate 20 statistic components over the tone-fit matrix. The resulting 20-dimensional feature representation is then classified using a support vector machine. The accuracy of the resulting method is 0.53 (SE = 0.06). Similar results are obtained by using MFCC features and voting by frame (0.60, SE = 0.04). Future directions include separately identifying sound events and representing scenes in terms of component events.

System characteristics
Input mono
Sampling rate 44.1kHz
Features tone-fit representation
Classifier SVM

Auditory Scene Classification Using Machine Learning Techniques

David Li, Jason Tam and Derek Toub
Cooper Union, New York, USA


Audio scene classification will play an important role in context-based organization of audio data in the future. With classified and labeled audio data, it will be possible to set up a searchable database where users can retrieve audio files based on their contents. In this paper, we introduce a system to extract features from such audio scenes and identify the environments in which they were recorded. This system makes use of wavelet and Mel-frequency cepstral coefficient (MFCC) features, and classifies scenes by first classifying segments of the scene, and deciding the overall classification with a vote. The system achieves a classification accuracy of 72% for the training dataset provided for the IEEE AASP CASA challenge [1].

System characteristics
Input mono
Sampling rate 44.1kHz
Features Wavelet, MFCC
Classifier Treebagger, majority vote

Acoustic Scene Classification Using Sparse Feature Learning and Selective Max-Pooling by Event Detection

Juhan Nam1, Ziwon Hyung2 and Kyogu Lee2
1Stanford University, Stanford, USA, 2Music and Audio Research Group, Seoul National University, Seoul, South Korea


Feature representations by learning algorithms recently have shown promising results in music classification. In this work, we applied the feature learning approach to audio scene classification. Using a previously proposed method, we learn local acoustic features on mel-frequency spectrogram and performs max-pooling to form a scene-level feature vector. In order to adapt the method to environmental scene classification, where acoustic events occur in an irregular manner, we suggest a new pooling technique that detects events using mean feature activation and then selectively performs max-pooling for the events. Our experiments show that this method is effective in acoustic scene classification.

System characteristics
Input mono
Sampling rate 44.1kHz
Features Feature learning, max-pooling
Classifier SVM

Sound Scene Identification Based on Mfcc, Binaural Features and a Support Vector Machine Classifier

Waldo Nogueira, Gerard Roma and Perfecto Herrera
Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain


This submission to the sub-task scene classification of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events is based on a feature extraction module in three dimensions (spectral, temporal and spatial). Spectral features are based on Mel frequency cepstrums coefficients, temporal features are based on an event density extractor and the spatial features are based on the extraction of inter-aural differences (level and temporal) and the coherence between the two channels of stereo recordings. After feature selection, the features are used in conjunction with a supportvector-machine for the classification of the sound scenes. In this short paper the impact of different features is analyzed.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC, temporal modulation, event density, binaural features
Classifier SVM

The Wonders of the Normalized Compression Dissimilarity Representation

Emanuele Olivetti1,2
1NeuroInformatics Laboratory, Bruno Kessler Foundation, Trento, Italy, 2Center for Mind and Brain Sciences, Trento, Italy


We propose a method to effectively embed general objects, like audio samples, into a vectorial feature space, suitable for classification problems. From the practical point of view, the researcher adopting the proposed method is just required to provide two ingredients: an efficient compressor for those objects, and a way to combine two objects into a new one. The proposed method is based on two main elements: the dissimilarity representation and the normalized compression distance (NCD). The dissimilarity representation is an Euclidean embedding algorithm, i.e. a procedure to map generic objects into a vector space, which requires the definition of a distance function between the objects. The quality of the resulting embedding is strictly dependent on the choice of this distance. The NCD is a distance between objects based on the concept of Kolmogorov complexity. In practice the NCD is based on two building blocks: a compression function and a method to combine two objects into a new one. We claim that, as soon as a good compressor and a meaningful way to combine two objects are available, then it is possible to build an effective feature space in which classification algorithms can be accurate. As our submission to the IEEE AASP Challenge, we show a practical application of the proposed method in the context of acoustic scene classification where the compressor is the free and open source Vorbis lossy audio compressor and the combination of two audio samples is their simple concatenation.

System characteristics
Input mono
Sampling rate 44.1kHz
Features Normalized compression distance, Euclidean embedding
Classifier Random Forest

Multiresolution Auditory Representations for Scene Classification

Kailash Patil and Mounya Elhilali
Center for Language and Speech Processing, Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, USA


Here, we propose a framework that provides a detailed analysis of the spectrotemporal modulations in the acoustic signal, augmented with a discriminative classifier using support vector machines. We have seen that such representation is successful at capturing the nontrivial commonalties within a sound class and differences between different classes[1, 2, 3].

System characteristics
Input mono
Sampling rate 44.1kHz
Features Spectrotemporal modulation
Classifier SVM

Histogram of Gradients of Time-Frequency Representations for Audio Scene Classification

Alain Rakotomamonjy and Gilles Gasso
Center for Language and Speech Processing, Department of Electrical and Computer Engineering, Normandie Universite, Rouen, France


This abstract presents our entry to the Detection and Classification of Acoustic Scenes challenge. The approach we propose for classifying acoustic scenes is based on transforming the audio signal into a time-frequency representation and then in extracting relevant features about shapes and evolutions of time-frequency structures. These features are based on histogram of gradients that are subsequently fed to a multi-class linear support vector machines.

System characteristics
Input mono
Sampling rate 44.1kHz
Features CQT, HOG
Classifier SVM

Recurrence Quantification Analysis Features for Auditory Scene Classification

Gerard Roma, Waldo Nogueira and Perfecto Herrera
Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain


This extended abstract describes our submission for the scene classification task of the IEEE AASP Challenge for Detection and Classification of Acoustic Scenes and Events. We explore the use of Recurrence Quantification Analysis (RQA) features for this task. These features are computed over a thresholded similarity matrix computed from windows of MFCC features. Added to traditional MFCC statistics, they improve accuracy when using a standard SVM classifier.

System characteristics
Input mono
Sampling rate 44.1kHz
Features MFCC, Recurrence Quantification Analysis
Classifier SVM