Official results are shown in original DCASE2013 Challenge website. Purpose of this page is to show results in an uniform way compared to more recent editions of the DCASE Challenge.
Task description
The scene classification (SC) challenge will address the problem of identifying and classifying acoustic scenes and soundscapes.
More detailed task description can be found in the task description page
Systems ranking
Rank |
Submission code |
Submission name |
Technical Report |
Accuracy with 95% confidence interval (Evaluation dataset) |
---|---|---|---|---|
DCASE2013 baseline | Baseline | 55.0 (45.2 - 64.8) | ||
CHR_1 | CHR_SVM | Chum2013 | 63.0 (53.5 - 72.5) | |
CHR_2 | CHR_HMM | Chum2013 | 65.0 (55.7 - 74.3) | |
ELF | ELF | Elizalde2013 | 55.0 (45.2 - 64.8) | |
GSR | GSR | Geiger2013 | 69.0 (59.9 - 78.1) | |
KH | KH | Krijnders2013 | 55.0 (45.2 - 64.8) | |
LTT_1 | LTT_1 | Li2013 | 72.0 (63.2 - 80.8) | |
LTT_2 | LTT_2 | Li2013 | 70.0 (61.0 - 79.0) | |
LTT_3 | LTT_2 | Li2013 | 67.0 (57.8 - 76.2) | |
NHL | NHL | Nam2013 | 60.0 (50.4 - 69.6) | |
NR1_1 | NR1_1 | Nogueira2013 | 60.0 (50.4 - 69.6) | |
NR1_2 | NR1_2 | Nogueira2013 | 60.0 (50.4 - 69.6) | |
NR1_3 | NR1_3 | Nogueira2013 | 59.0 (49.4 - 68.6) | |
OE | OE | Olivetti2013 | 14.0 (7.2 - 20.8) | |
PE | PE | Patil2013 | 58.0 (48.3 - 67.7) | |
RG | RG | Rakotomamonjy2013 | 69.0 (59.9 - 78.1) | |
RNH_1 | RNH1 | Roma2013 | 71.0 (62.1 - 79.9) | |
RNH_2 | RNH2 | Roma2013 | 76.0 (67.6 - 84.4) |
Teams ranking
Table including only the best performing system per submitting team.
Rank |
Submission code |
Submission name |
Technical Report |
Accuracy with 95% confidence interval (Evaluation dataset) |
---|---|---|---|---|
DCASE2013 baseline | Baseline | 55.0 (45.2 - 64.8) | ||
CHR_2 | CHR_HMM | Chum2013 | 65.0 (55.7 - 74.3) | |
ELF | ELF | Elizalde2013 | 55.0 (45.2 - 64.8) | |
GSR | GSR | Geiger2013 | 69.0 (59.9 - 78.1) | |
KH | KH | Krijnders2013 | 55.0 (45.2 - 64.8) | |
LTT_1 | LTT_1 | Li2013 | 72.0 (63.2 - 80.8) | |
NHL | NHL | Nam2013 | 60.0 (50.4 - 69.6) | |
NR1_1 | NR1_1 | Nogueira2013 | 60.0 (50.4 - 69.6) | |
OE | OE | Olivetti2013 | 14.0 (7.2 - 20.8) | |
PE | PE | Patil2013 | 58.0 (48.3 - 67.7) | |
RG | RG | Rakotomamonjy2013 | 69.0 (59.9 - 78.1) | |
RNH_2 | RNH2 | Roma2013 | 76.0 (67.6 - 84.4) |
Class-wise performance
Rank |
Submission code |
Submission name |
Technical Report |
Accuracy (Evaluation dataset) |
Bus |
Busy street |
Office |
Open air Market |
Park |
Quiet street |
Restaurant | Supermarket | Tube |
Tube station |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DCASE2013 baseline | Baseline | 55.0 | 90.0 | 30.0 | 80.0 | 70.0 | 90.0 | 40.0 | 30.0 | 20.0 | 60.0 | 40.0 | ||
CHR_1 | CHR_SVM | Chum2013 | 63.0 | 100.0 | 80.0 | 40.0 | 70.0 | 30.0 | 40.0 | 70.0 | 50.0 | 80.0 | 70.0 | |
CHR_2 | CHR_HMM | Chum2013 | 65.0 | 90.0 | 90.0 | 80.0 | 60.0 | 90.0 | 50.0 | 40.0 | 30.0 | 80.0 | 40.0 | |
ELF | ELF | Elizalde2013 | 55.0 | 50.0 | 70.0 | 60.0 | 100.0 | 40.0 | 50.0 | 70.0 | 30.0 | 30.0 | 50.0 | |
GSR | GSR | Geiger2013 | 69.0 | 90.0 | 90.0 | 90.0 | 90.0 | 70.0 | 40.0 | 80.0 | 50.0 | 70.0 | 20.0 | |
KH | KH | Krijnders2013 | 55.0 | 60.0 | 90.0 | 70.0 | 50.0 | 30.0 | 50.0 | 50.0 | 20.0 | 80.0 | 50.0 | |
LTT_1 | LTT_1 | Li2013 | 72.0 | 100.0 | 100.0 | 70.0 | 80.0 | 70.0 | 50.0 | 70.0 | 70.0 | 70.0 | 40.0 | |
LTT_2 | LTT_2 | Li2013 | 70.0 | 100.0 | 100.0 | 70.0 | 80.0 | 70.0 | 60.0 | 70.0 | 40.0 | 70.0 | 40.0 | |
LTT_3 | LTT_2 | Li2013 | 67.0 | 100.0 | 100.0 | 70.0 | 80.0 | 70.0 | 50.0 | 60.0 | 30.0 | 70.0 | 40.0 | |
NHL | NHL | Nam2013 | 60.0 | 70.0 | 90.0 | 80.0 | 70.0 | 50.0 | 60.0 | 50.0 | 40.0 | 30.0 | 60.0 | |
NR1_1 | NR1_1 | Nogueira2013 | 60.0 | 80.0 | 80.0 | 60.0 | 70.0 | 70.0 | 30.0 | 60.0 | 80.0 | 20.0 | 50.0 | |
NR1_2 | NR1_2 | Nogueira2013 | 60.0 | 80.0 | 90.0 | 50.0 | 80.0 | 70.0 | 20.0 | 90.0 | 60.0 | 20.0 | 40.0 | |
NR1_3 | NR1_3 | Nogueira2013 | 59.0 | 80.0 | 80.0 | 50.0 | 70.0 | 70.0 | 20.0 | 90.0 | 70.0 | 20.0 | 40.0 | |
OE | OE | Olivetti2013 | 14.0 | 0.0 | 10.0 | 20.0 | 20.0 | 10.0 | 30.0 | 0.0 | 20.0 | 20.0 | 10.0 | |
PE | PE | Patil2013 | 58.0 | 90.0 | 90.0 | 50.0 | 70.0 | 40.0 | 60.0 | 60.0 | 20.0 | 40.0 | 60.0 | |
RG | RG | Rakotomamonjy2013 | 69.0 | 100.0 | 100.0 | 80.0 | 80.0 | 80.0 | 30.0 | 50.0 | 50.0 | 80.0 | 40.0 | |
RNH_1 | RNH1 | Roma2013 | 71.0 | 80.0 | 100.0 | 60.0 | 40.0 | 70.0 | 60.0 | 100.0 | 70.0 | 70.0 | 60.0 | |
RNH_2 | RNH2 | Roma2013 | 76.0 | 80.0 | 100.0 | 80.0 | 70.0 | 70.0 | 50.0 | 100.0 | 80.0 | 70.0 | 60.0 |
System characteristics
Rank | Code |
Technical Report |
Accuracy (Eval) |
Input |
Sampling rate |
Features | Classifier |
---|---|---|---|---|---|---|---|
DCASE2013 baseline | 55.0 | mono | 44.1kHz | MFCC | GMM | ||
CHR_1 | Chum2013 | 63.0 | mono | 11.025kHz | Magnitude response, Loudness, Spectral sparsity, Temporal sparsity | SVM | |
CHR_2 | Chum2013 | 65.0 | mono | 11.025kHz | Magnitude response, Loudness, Spectral sparsity, Temporal sparsity | HMM | |
ELF | Elizalde2013 | 55.0 | left, right, difference, average | 44.1kHz | MFCC | i-vector, pLDA | |
GSR | Geiger2013 | 69.0 | mono | 44.1kHz | openSMILE / emo_large | SVM | |
KH | Krijnders2013 | 55.0 | mono | 44.1kHz | tone-fit representation | SVM | |
LTT_1 | Li2013 | 72.0 | mono | 44.1kHz | Wavelet, MFCC | Treebagger, majority vote | |
LTT_2 | Li2013 | 70.0 | mono | 44.1kHz | Wavelet, MFCC | Treebagger, majority vote | |
LTT_3 | Li2013 | 67.0 | mono | 44.1kHz | Wavelet, MFCC | Treebagger, majority vote | |
NHL | Nam2013 | 60.0 | mono | 44.1kHz | Feature learning, max-pooling | SVM | |
NR1_1 | Nogueira2013 | 60.0 | mono | 44.1kHz | MFCC, temporal modulation, event density, binaural features | SVM | |
NR1_2 | Nogueira2013 | 60.0 | mono | 44.1kHz | MFCC, temporal modulation, event density, binaural features | SVM | |
NR1_3 | Nogueira2013 | 59.0 | mono | 44.1kHz | MFCC, temporal modulation, event density, binaural features | SVM | |
OE | Olivetti2013 | 14.0 | mono | 44.1kHz | Normalized compression distance, Euclidean embedding | Random Forest | |
PE | Patil2013 | 58.0 | mono | 44.1kHz | Spectrotemporal modulation | SVM | |
RG | Rakotomamonjy2013 | 69.0 | mono | 44.1kHz | CQT, HOG | SVM | |
RNH_1 | Roma2013 | 71.0 | mono | 44.1kHz | MFCC, Recurrence Quantification Analysis | SVM | |
RNH_2 | Roma2013 | 76.0 | mono | 44.1kHz | MFCC, Recurrence Quantification Analysis | SVM |
Technical reports
IEEE AASP Scene Classification Challenge Using Hidden Markov Models and Frame Based Classification
May Chum, Ariel Habshush, Abrar Rahman and Christopher Sang
Electrical Engineering Department, The Cooper Union, New York, USA
Abstract
The IEEE AASP Challenge involves the detection and classification of acoustic scenes and events. The scene classification (SC) challenge consists of 10 different scenes of 10 audio files or length 30 seconds each, totaling a number of 100 audio clips. The list of scenes is: busy street, quiet street, park, open-air market, bus, subway-train, restaurant, shop/supermarket, office, and subway station. The goal is to test on a development set that is composed of audio clips of the same scenes as the training set and determine what scene the audio clips originated from. One of the algorithms presented in this paper to discriminate between these different scenes include the use of hidden Markov models (HMMs) and Gaussian mixture models (GMMs). The features that were used include the following: short time Fourier transform, loudness, and spectral sparsity. Using these features yielded 72% correct classification with 10 fold crossvalidation. The other algorithm implemented uses the same features as before plus temporal sparsity to classify individual frames of an audio clip, then vote on the class. This algorithm achieved 62% accuracy.
System characteristics
Input | mono |
Sampling rate | 11.025kHz |
Features | Magnitude response, Loudness, Spectral sparsity, Temporal sparsity |
Classifier | SVM; HMM |
An I-Vector Based Approach for Audio Scene Detection
Benjamin Elizalde1, Howard Lei1, Gerald Friedland1 and Nils Peters2
1International Computer Science Institute, Berkeley, USA, 2Qualcomm Technologies Inc., San Diego, USA
ELF
An I-Vector Based Approach for Audio Scene Detection
Benjamin Elizalde1, Howard Lei1, Gerald Friedland1 and Nils Peters2
1International Computer Science Institute, Berkeley, USA, 2Qualcomm Technologies Inc., San Diego, USA
Abstract
The IEEE-ASSP Scene Classification challenge on user-generated content (UGC) aims to classify an audio recording that belongs to a specific scene such as busystreet, office or supermarket. The difficulty of scene content analysis on UGC lies in the lack of structure and acoustic variability of the data. The i-vector system is state-ofthe-art in Speaker Verification and Scene Detection, and is outperforming conventional Gaussian Mixture Model (GMM)-based approaches. The system compensates for undesired acoustic variability and extracts information from the acoustic environment, making it a meaningful choice for detection on UGC. This paper reports our results in the challenge by using a hand-tuned i-vector system and MFCC features. Compared to the MFCC+GMM baseline system, our system increased the classification accuracy by 26.4% to about 65.8%. We discuss our approach and highlight parameters in our system that showed to significantly improved our classification accuracy
System characteristics
Input | left, right, difference, average |
Sampling rate | 44.1kHz |
Features | MFCC |
Classifier | i-vector, pLDA |
Recognising Acoustic Scenes with Large-Scale Audio Feature Extraction and SVM
Jürgen T. Geiger1, Björn Schuller1,2 and Gerhard Rigoll1
1Institute for Human-Machine Communication, Technische Universität München, München, Germany, 2University of Passau, Institute for Sensor Systems, Passau, Germany
GSR
Recognising Acoustic Scenes with Large-Scale Audio Feature Extraction and SVM
Jürgen T. Geiger1, Björn Schuller1,2 and Gerhard Rigoll1
1Institute for Human-Machine Communication, Technische Universität München, München, Germany, 2University of Passau, Institute for Sensor Systems, Passau, Germany
Abstract
This work describes our contribution to the IEEE AASP Challenge on classification of acoustic scenes. From the 30 second long highly variable recordings, spectral, cepstral, energy and voicing-related audio features are extracted. A sliding window approach is used to obtain statistical functionals of the low-level features on short segments. SVM are used for classification of these short segments, and a majority voting scheme is employed to get a decision for the whole recording. On the official development set of the challenge, an accuracy of 73% is achieved. A feature analysis using the t-statistic showed that mainly Mel spectra were the most relevant features.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | openSMILE / emo_large |
Classifier | SVM |
A Tone-Fit Feature Representation for Scene Classification
Johannes D. Krijnders and Gineke A. ten Holt
INCAS3, Assen, Netherlands
KH
A Tone-Fit Feature Representation for Scene Classification
Johannes D. Krijnders and Gineke A. ten Holt
INCAS3, Assen, Netherlands
Abstract
We present an algorithm that classifies environmental sound recordings using a feature representation based on the human hearing. Specifically, we use a mathematical model of the human cochlea to transform a sound (wav) clip into a time-frequency representation called a cochleogram. From the cochleogram, we calculate the tone-fit of each time-frequency region by calculating the fit of the region to a pure tone. This gives us a representation of the ’tonelikeness’ of the sound at various moments and frequencies. Finally, to arrive at a summarized representation for the entire clip, we calculate 20 statistic components over the tone-fit matrix. The resulting 20-dimensional feature representation is then classified using a support vector machine. The accuracy of the resulting method is 0.53 (SE = 0.06). Similar results are obtained by using MFCC features and voting by frame (0.60, SE = 0.04). Future directions include separately identifying sound events and representing scenes in terms of component events.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | tone-fit representation |
Classifier | SVM |
Auditory Scene Classification Using Machine Learning Techniques
David Li, Jason Tam and Derek Toub
Cooper Union, New York, USA
LTT_1 LTT_2 LTT_3
Auditory Scene Classification Using Machine Learning Techniques
David Li, Jason Tam and Derek Toub
Cooper Union, New York, USA
Abstract
Audio scene classification will play an important role in context-based organization of audio data in the future. With classified and labeled audio data, it will be possible to set up a searchable database where users can retrieve audio files based on their contents. In this paper, we introduce a system to extract features from such audio scenes and identify the environments in which they were recorded. This system makes use of wavelet and Mel-frequency cepstral coefficient (MFCC) features, and classifies scenes by first classifying segments of the scene, and deciding the overall classification with a vote. The system achieves a classification accuracy of 72% for the training dataset provided for the IEEE AASP CASA challenge [1].
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | Wavelet, MFCC |
Classifier | Treebagger, majority vote |
Acoustic Scene Classification Using Sparse Feature Learning and Selective Max-Pooling by Event Detection
Juhan Nam1, Ziwon Hyung2 and Kyogu Lee2
1Stanford University, Stanford, USA, 2Music and Audio Research Group, Seoul National University, Seoul, South Korea
NHL
Acoustic Scene Classification Using Sparse Feature Learning and Selective Max-Pooling by Event Detection
Juhan Nam1, Ziwon Hyung2 and Kyogu Lee2
1Stanford University, Stanford, USA, 2Music and Audio Research Group, Seoul National University, Seoul, South Korea
Abstract
Feature representations by learning algorithms recently have shown promising results in music classification. In this work, we applied the feature learning approach to audio scene classification. Using a previously proposed method, we learn local acoustic features on mel-frequency spectrogram and performs max-pooling to form a scene-level feature vector. In order to adapt the method to environmental scene classification, where acoustic events occur in an irregular manner, we suggest a new pooling technique that detects events using mean feature activation and then selectively performs max-pooling for the events. Our experiments show that this method is effective in acoustic scene classification.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | Feature learning, max-pooling |
Classifier | SVM |
Sound Scene Identification Based on Mfcc, Binaural Features and a Support Vector Machine Classifier
Waldo Nogueira, Gerard Roma and Perfecto Herrera
Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
NR1_1 NR1_2 NR1_3
Sound Scene Identification Based on Mfcc, Binaural Features and a Support Vector Machine Classifier
Waldo Nogueira, Gerard Roma and Perfecto Herrera
Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
Abstract
This submission to the sub-task scene classification of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events is based on a feature extraction module in three dimensions (spectral, temporal and spatial). Spectral features are based on Mel frequency cepstrums coefficients, temporal features are based on an event density extractor and the spatial features are based on the extraction of inter-aural differences (level and temporal) and the coherence between the two channels of stereo recordings. After feature selection, the features are used in conjunction with a supportvector-machine for the classification of the sound scenes. In this short paper the impact of different features is analyzed.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC, temporal modulation, event density, binaural features |
Classifier | SVM |
The Wonders of the Normalized Compression Dissimilarity Representation
Emanuele Olivetti1,2
1NeuroInformatics Laboratory, Bruno Kessler Foundation, Trento, Italy, 2Center for Mind and Brain Sciences, Trento, Italy
Abstract
We propose a method to effectively embed general objects, like audio samples, into a vectorial feature space, suitable for classification problems. From the practical point of view, the researcher adopting the proposed method is just required to provide two ingredients: an efficient compressor for those objects, and a way to combine two objects into a new one. The proposed method is based on two main elements: the dissimilarity representation and the normalized compression distance (NCD). The dissimilarity representation is an Euclidean embedding algorithm, i.e. a procedure to map generic objects into a vector space, which requires the definition of a distance function between the objects. The quality of the resulting embedding is strictly dependent on the choice of this distance. The NCD is a distance between objects based on the concept of Kolmogorov complexity. In practice the NCD is based on two building blocks: a compression function and a method to combine two objects into a new one. We claim that, as soon as a good compressor and a meaningful way to combine two objects are available, then it is possible to build an effective feature space in which classification algorithms can be accurate. As our submission to the IEEE AASP Challenge, we show a practical application of the proposed method in the context of acoustic scene classification where the compressor is the free and open source Vorbis lossy audio compressor and the combination of two audio samples is their simple concatenation.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | Normalized compression distance, Euclidean embedding |
Classifier | Random Forest |
Multiresolution Auditory Representations for Scene Classification
Kailash Patil and Mounya Elhilali
Center for Language and Speech Processing, Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, USA
PE
Multiresolution Auditory Representations for Scene Classification
Kailash Patil and Mounya Elhilali
Center for Language and Speech Processing, Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, USA
Abstract
Here, we propose a framework that provides a detailed analysis of the spectrotemporal modulations in the acoustic signal, augmented with a discriminative classifier using support vector machines. We have seen that such representation is successful at capturing the nontrivial commonalties within a sound class and differences between different classes[1, 2, 3].
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | Spectrotemporal modulation |
Classifier | SVM |
Histogram of Gradients of Time-Frequency Representations for Audio Scene Classification
Alain Rakotomamonjy and Gilles Gasso
Center for Language and Speech Processing, Department of Electrical and Computer Engineering, Normandie Universite, Rouen, France
RG
Histogram of Gradients of Time-Frequency Representations for Audio Scene Classification
Alain Rakotomamonjy and Gilles Gasso
Center for Language and Speech Processing, Department of Electrical and Computer Engineering, Normandie Universite, Rouen, France
Abstract
This abstract presents our entry to the Detection and Classification of Acoustic Scenes challenge. The approach we propose for classifying acoustic scenes is based on transforming the audio signal into a time-frequency representation and then in extracting relevant features about shapes and evolutions of time-frequency structures. These features are based on histogram of gradients that are subsequently fed to a multi-class linear support vector machines.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | CQT, HOG |
Classifier | SVM |
Recurrence Quantification Analysis Features for Auditory Scene Classification
Gerard Roma, Waldo Nogueira and Perfecto Herrera
Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
Abstract
This extended abstract describes our submission for the scene classification task of the IEEE AASP Challenge for Detection and Classification of Acoustic Scenes and Events. We explore the use of Recurrence Quantification Analysis (RQA) features for this task. These features are computed over a thresholded similarity matrix computed from windows of MFCC features. Added to traditional MFCC statistics, they improve accuracy when using a standard SVM classifier.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC, Recurrence Quantification Analysis |
Classifier | SVM |