Acoustic
scene classification


Task description

Challenge has ended. Full results for this task can be found here

Description

The goal of acoustic scene classification is to classify a test recording into one of predefined classes that characterizes the environment in which it was recorded — for example "park", "home", "office".

Figure 1: Overview of acoustic scene classification system.

Audio dataset

TUT Acoustic scenes 2016 dataset will be used for the task. The dataset consists of recordings from various acoustic scenes, all having distinct recording locations. For each recording location, 3-5 minute long audio recording was captured. The original recordings were then split into 30-second segments for the challenge.

Acoustic scenes for the task (15):

  • Bus - traveling by bus in the city (vehicle)
  • Cafe / Restaurant - small cafe/restaurant (indoor)
  • Car - driving or traveling as a passenger, in the city (vehicle)
  • City center (outdoor)
  • Forest path (outdoor)
  • Grocery store - medium size grocery store (indoor)
  • Home (indoor)
  • Lakeside beach (outdoor)
  • Library (indoor)
  • Metro station (indoor)
  • Office - multiple persons, typical work day (indoor)
  • Residential area (outdoor)
  • Train (traveling, vehicle)
  • Tram (traveling, vehicle)
  • Urban park (outdoor)

Detailed description of acoustic scenes included in the dataset can be found here.

The dataset was collected in Finland by Tampere University of Technology between 06/2015 - 01/2016. The data collection has received funding from the European Research Council.

ERC

Recording and annotation procedure

For all acoustic scenes, the recordings were captured each in a different location: different streets, different parks, different homes. Recordings were made using a Soundman OKM II Klassik/studio A3, electret binaural microphone and a Roland Edirol R-09 wave recorder using 44.1 kHz sampling rate and 24 bit resolution. The microphones are specifically made to look like headphones, being worn in the ears. As an effect of this, the recorded audio is very similar to the sound that reaches the human auditory system of the person wearing the equipment.

Postprocessing of the recorded data involves aspects related to privacy of recorded individuals, and possible errors in the recording process. For audio material recorded in private places, written consent was obtained from all people involved. Material recorded in public places does not require such consent, but was screened for content, and privacy infringing segments were eliminated. Microphone failure and audio distortions were also annotated and segments containing such errors were also eliminated.

After eliminating the problematic segments, the remaining audio material was cut into segments of 30 seconds length.

Challenge setup

TUT Acoustic scenes 2016 dataset consist of two subsets: development dataset and evaluation dataset. The partitioning of the data into the subsets was done based on the location of the original recordings. All segments obtained from the same original recording were included into a single subset - either development dataset or evaluation dataset. For each acoustic scene, 78 segments (39 minutes of audio) were included in the development dataset and 26 segments (13 minutes of audio) were kept for evaluation. Development set contains in total 9h 45mins of audio, and evaluation set 3h 15mins.

Participants are not allowed to use external data for system development. Manipulation of provided data is allowed.

Download

** Development dataset **


** Evaluation dataset **


In publications using the datasets, cite as:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016.

PDF

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.

PDF

Cross-validation with development dataset

A cross-validation setup is provided for the development dataset in order to make results reported with this dataset uniform. The setup consists of four folds distributing the 78 available segments based on location. The folds are provided with the dataset in the directory evaluation setup.

If not using the provided cross-validation setup, pay attention to the segments extracted from same original recordings. Make sure that all files recorded in same location are placed on the same side of the evaluation.

Submission

Detailed information for the challenge submission can found from submission page.

One should submit single text-file (in CSV format) containing classification result for each audio file in the evaluation set. Result items can be in any order. Format:

[filename (string)][tab][scene label (string)]

Task rules

  • Only the provided development dataset can be used to train the submitted system.
  • The development dataset can be augmented only by mixing data sampled from a pdf; use of real recordings is forbidden.
  • The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation dataset in the decision making is also forbidden.
  • Technical report with sufficient description of the system has to be submitted along with the system outputs.

More information on submission process.

Evaluation

The scoring of acoustic scene classification will be based on classification accuracy: the number of correctly classified segments among the total number of segments. Each segment is considered an independent test sample.

Code for evaluation is available with the baseline system:

  • Python implementation from src.evaluation import DCASE2016_SceneClassification_Metrics.
  • Matlab implementation, use class src/evaluation/DCASE2016_SceneClassification_Metrics.m.

Results

Rank Submission Information
Code Author Affiliation Technical
Report
Classification
Accuracy
Aggarwal_task1_1 Naveen Aggarwal UIET, Panjab University, Chandigarh, India task-acoustic-scene-classification-results#Vij2016 74.4
Bae_task1_1 Soo Hyun Bae Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea task-acoustic-scene-classification-results#Bae2016 84.1
Bao_task1_1 Xiao Bao National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, Anhui, China task-acoustic-scene-classification-results#Bao2016 83.1
Battaglino_task1_1 Daniele Battaglino NXP Software, France; EURECOM, France task-acoustic-scene-classification-results#Battaglino2016 80.0
Bisot_task1_1 Victor Bisot Telecom ParisTech, Paris, France task-acoustic-scene-classification-results#Bisot2016 87.7
DCASE2016 baseline Toni Heittola Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland task-acoustic-scene-classification-results#Heittola2016 77.2
Duong_task1_1 Quang-Khanh-Ngoc Duong Technicolor, France task-acoustic-scene-classification-results#Sena_Mafra2016 76.4
Duong_task1_2 Quang-Khanh-Ngoc Duong Technicolor, France task-acoustic-scene-classification-results#Sena_Mafra2016 80.5
Duong_task1_3 Quang-Khanh-Ngoc Duong Technicolor, France task-acoustic-scene-classification-results#Sena_Mafra2016 73.1
Duong_task1_4 Quang-Khanh-Ngoc Duong Technicolor, France task-acoustic-scene-classification-results#Sena_Mafra2016 62.8
Eghbal-Zadeh_task1_1 Hamid Eghbal-Zadeh Department of Computational Perception, Johannes Kepler University of Linz, Linz, Austria task-acoustic-scene-classification-results#Eghbal-Zadeh2016 86.4
Eghbal-Zadeh_task1_2 Hamid Eghbal-Zadeh Department of Computational Perception, Johannes Kepler University of Linz, Linz, Austria task-acoustic-scene-classification-results#Eghbal-Zadeh2016 88.7
Eghbal-Zadeh_task1_3 Hamid Eghbal-Zadeh Department of Computational Perception, Johannes Kepler University of Linz, Linz, Austria task-acoustic-scene-classification-results#Eghbal-Zadeh2016 83.3
Eghbal-Zadeh_task1_4 Hamid Eghbal-Zadeh Department of Computational Perception, Johannes Kepler University of Linz, Linz, Austria task-acoustic-scene-classification-results#Eghbal-Zadeh2016 89.7
Foleiss_task1_1 Juliano Henrique Foleiss Universidade Tecnologica Federal do Parana, Campo Mourao, Brazil task-acoustic-scene-classification-results#Foleiss2016 76.2
Hertel_task1_1 Alfred Mertins Institute for Signal Processing, University of Luebeck, Luebeck, Germany task-acoustic-scene-classification-results#Hertel2016 79.5
Kim_task1_1 Alfred Mertins Institute for Signal Processing, University of Luebeck, Luebeck, Germany task-acoustic-scene-classification-results#Yun2016 82.1
Ko_task1_1 Hanseok Ko School of Electrical Engineering, Korea University, Seoul, South Korea; Department of Visual Information Processing, Korea University, Seoul, South Korea task-acoustic-scene-classification-results#Park2016 87.2
Ko_task1_2 Hanseok Ko School of Electrical Engineering, Korea University, Seoul, South Korea; Department of Visual Information Processing, Korea University, Seoul, South Korea task-acoustic-scene-classification-results#Mun2016 82.3
Kong_task1_1 Qiuqiang Kong Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom task-acoustic-scene-classification-results#Kong2016 81.0
Kumar_task1_1 Anurag Kumar Carnegie Mellon University, Pittsburgh, USA task-acoustic-scene-classification-results#Elizalde2016 85.9
Lee_task1_1 Kyogu Lee Music and Audio Research Group, Seoul National University, Seoul, South Korea task-acoustic-scene-classification-results#Han2016 84.6
Lee_task1_2 Kyogu Lee Music and Audio Research Group, Seoul National University, Seoul, South Korea task-acoustic-scene-classification-results#Kim2016 85.4
Liu_task1_1 Jiaming Liu Department of Control Science and Engineering, Tongji University, Shanghai, China task-acoustic-scene-classification-results#Liu2016 83.8
Liu_task1_2 Jiaming Liu Department of Control Science and Engineering, Tongji University, Shanghai, China task-acoustic-scene-classification-results#Liu2016 83.6
Lostanlen_task1_1 Vincent Lostanlen Departement d’Informatique, Ecole normale superieure, Paris, France task-acoustic-scene-classification-results#Lostanlen2016 80.8
Marchi_task1_1 Erik Marchi Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany; audEERING GmbH, Gilching, Germany task-acoustic-scene-classification-results#Marchi2016 86.4
Marques_task1_1 Gonçalo Marques Electronic Telecom. and Comp. Dept., Instituto Superior de Engenharia de Lisboa, Lisboa, Portugal task-acoustic-scene-classification-results#Marques2016 83.1
Moritz_task1_1 Niko Moritz Project Group for Hearing, Speech, and Audio Processing, Fraunhofer IDMT, Oldenburg, Germany task-acoustic-scene-classification-results#Moritz2016 79.0
Mulimani_task1_1 Manjunath Mulimani Dept. of Computer Science & Engineering, National Institute of Technology, Karnataka, India task-acoustic-scene-classification-results#Mulimani2016 65.6
Nogueira_task1_1 Waldo Nogueira Medical University Hannover, Hannover, Germany; Cluster of Excellence Hearing4all, Hannover, Germany task-acoustic-scene-classification-results#Nogueira2016 81.0
Patiyal_task1_1 Rohit Patiyal School of Computing and Electrical Engineering, Indian Institute of Technology Mandi, Himachal Pradesh, India task-acoustic-scene-classification-results#Patiyal2016 78.5
Phan_task1_1 Huy Phan Institute for Signal Processing, University of Luebeck, Luebeck, Germany; Graduate School for Computing in Medicine and Life Sciences, University of Luebeck, Luebeck, Germany task-acoustic-scene-classification-results#Phan2016 83.3
Pugachev_task1_1 Alexei Pugachev Chair of Speech Information Systems, ITMO University, St. Petersburg, Russia task-acoustic-scene-classification-results#Pugachev2016 73.1
Qu_task1_1 Shuhui Qu Stanford University, Stanford, USA task-acoustic-scene-classification-results#Dai2016 80.5
Qu_task1_2 Shuhui Qu Stanford University, Stanford, USA task-acoustic-scene-classification-results#Dai2016 84.1
Qu_task1_3 Shuhui Qu Stanford University, Stanford, USA task-acoustic-scene-classification-results#Dai2016 82.3
Qu_task1_4 Shuhui Qu Stanford University, Stanford, USA task-acoustic-scene-classification-results#Dai2016 80.5
Rakotomamonjy_task1_1 Alain Rakotomamonjy Music and Audio Research Group, Normandie Université, Saint Etienne du Rouvray, France task-acoustic-scene-classification-results#Rakotomamonjy2016 82.1
Rakotomamonjy_task1_2 Alain Rakotomamonjy Music and Audio Research Group, Normandie Université, Saint Etienne du Rouvray, France task-acoustic-scene-classification-results#Rakotomamonjy2016 79.2
Santoso_task1_1 Andri Santoso National Central University, Taiwan task-acoustic-scene-classification-results#Santoso2016 80.8
Schindler_task1_1 Alexander Schindler Digital Safety and Security, Austrian Institute of Technology, Vienna, Austria task-acoustic-scene-classification-results#Lidy2016 81.8
Schindler_task1_2 Alexander Schindler Digital Safety and Security, Austrian Institute of Technology, Vienna, Austria task-acoustic-scene-classification-results#Lidy2016 83.3
Takahashi_task1_1 Gen Takahashi University of Tsukuba, Tsukuba, Japan task-acoustic-scene-classification-results#Takahashi2016 85.6
Valenti_task1_1 Michele Valenti Department of Information Engineering, Università Politecnica delle Marche, Ancona, Italy task-acoustic-scene-classification-results#Valenti2016 86.2
Vikaskumar_task1_1 Ghodasara Vikaskumar Electronics & Electrical Communication Engineering Department, Indian Institute of Technology Kharagpur, Kharagpur, India task-acoustic-scene-classification-results#Vikaskumar2016 81.3
Vu_task1_1 Toan H. Vu Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan task-acoustic-scene-classification-results#Vu2016 80.0
Xu_task1_1 Yong Xu Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom task-acoustic-scene-classification-results#Xu2016 73.3
Zoehrer_task1_1 Matthias Zöhrer Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria task-acoustic-scene-classification-results#Zoehrer2016 73.1


Complete results and technical reports can be found at Task 1 result page

Baseline system

The baseline system for the task is provided. The system is meant to implement basic approach for acoustic scene classification, and provide some comparison point for the participants while developing their systems. The baseline systems for task 1 and task 3 share the code base, and implements quite similar approach for both tasks. The baseline system will download the needed datasets and produces the results below when ran with the default parameters.

The baseline system is based on MFCC acoustic features and GMM classifier. The acoustic features include MFCC static coefficients (0th coefficient included), delta coefficients and acceleration coefficients. The system learns one acoustic model per acoustic scene class, and does the classification with maximum likelihood classification scheme.

The baseline system provides also reference implementation of evaluation metric. Baseline systems are provided for both Python and Matlab. Python implementation is regarded as the main implementation.

Participants are allowed to build their system on top of the given baseline systems. The systems have all needed functionality for dataset handling, storing / accessing features and models, and evaluating the results, making the adaptation for one's needs rather easy. The baseline systems are also good starting point for entry level researchers.

In publications using the baseline, cite as:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016.

PDF

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.

PDF


Python implementation


Matlab implementation


Results for TUT Acoustic scenes 2016, development set

Evaluation setup

  • 4-fold cross-validation, average classification accuracy over folds
  • 15 acoustic scene classes
  • Classification unit: one file (30 seconds of audio).

System parameters

  • Frame size: 40 ms (with 50% hop size)
  • Number of Gaussians per acoustic scene class model: 16
  • Feature vector: 20 MFCC static coefficients (including 0th) + 20 delta MFCC coefficients + 20 acceleration MFCC coefficients = 60 values
  • Trained and tested on full audio
  • Python implementation
Acoustic scene classification results, averaged over evaluation folds.
Acoustic scene Accuracy
Beach 69.3 %
Bus 79.6 %
Cafe / Restaurant 83.2 %
Car 87.2 %
City center 85.5 %
Forest path 81.0 %
Grocery store 65.0 %
Home 82.1 %
Library 50.4 %
Metro station 94.7 %
Office 98.6 %
Park 13.9 %
Residential area 77.7 %
Train 33.6 %
Tram 85.4 %
Overall accuracy 72.5 %

Citation

If you are using the dataset or baseline code please cite the following paper:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016.

PDF

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.

PDF


When citing challenge task and results please cite the following paper:

Publication

A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, and M. D. Plumbley. Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(2):379–393, Feb 2018. doi:10.1109/TASLP.2017.2778423.

PDF

Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge

Abstract

Public evaluation campaigns and datasets promote active development in target research areas, allowing direct comparison of algorithms. The second edition of the challenge on detection and classification of acoustic scenes and events (DCASE 2016) has offered such an opportunity for development of the state-of-the-art methods, and succeeded in drawing together a large number of participants from academic and industrial backgrounds. In this paper, we report on the tasks and outcomes of the DCASE 2016 challenge. The challenge comprised four tasks: acoustic scene classification, sound event detection in synthetic audio, sound event detection in real-life audio, and domestic audio tagging. We present each task in detail and analyze the submitted systems in terms of design and performance. We observe the emergence of deep learning as the most popular classification method, replacing the traditional approaches based on Gaussian mixture models and support vector machines. By contrast, feature representations have not changed substantially throughout the years, as mel frequency-based representations predominate in all tasks. The datasets created for and used in DCASE 2016 are publicly available and are a valuable resource for further research.

Keywords

Acoustics;Event detection;Hidden Markov models;Speech;Speech processing;Tagging;Acoustic scene classification;audio datasets;pattern recognition;sound event detection

PDF


Follow up paper about assessing the human and machine performance for acoustic scene classification task:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Assessment of human and machine performance in acoustic scene classification: DCASE 2016 case study. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 319–323. IEEE Computer Society, 2017. doi:10.1109/WASPAA.2017.8170047.

Assessment of Human and Machine Performance in Acoustic Scene Classification: DCASE 2016 Case Study

Abstract

Human and machine performance in acoustic scene classification is examined through a parallel experiment using TUT Acoustic Scenes 2016 dataset. The machine learning perspective is presented based on the systems submitted for the 2016 challenge on Detection and Classification of Acoustic Scenes and Events. The human performance, assessed through a listening experiment, was found to be significantly lower than machine performance. Test subjects exhibited different behavior throughout the experiment, leading to significant differences in performance between groups of subjects. An expert listener trained for the task obtained similar accuracy to the average of submitted systems, comparable also to previous studies of human abilities in recognizing everyday acoustic scenes.