Challenge has ended. Full results for this task can be found here

Description

The goal of acoustic scene classification is to classify a test recording into one of predefined classes that characterizes the environment in which it was recorded — for example "park", "home", "office".

Figure 1: Overview of acoustic scene classification system.

Audio dataset

TUT Acoustic scenes 2016 dataset will be used for the task. The dataset consists of recordings from various acoustic scenes, all having distinct recording locations. For each recording location, 3-5 minute long audio recording was captured. The original recordings were then split into 30-second segments for the challenge.

Acoustic scenes for the task (15):

Bus - traveling by bus in the city (vehicle)
Cafe / Restaurant - small cafe/restaurant (indoor)
Car - driving or traveling as a passenger, in the city (vehicle)
City center (outdoor)
Forest path (outdoor)
Grocery store - medium size grocery store (indoor)
Home (indoor)
Lakeside beach (outdoor)
Library (indoor)
Metro station (indoor)
Office - multiple persons, typical work day (indoor)
Residential area (outdoor)
Train (traveling, vehicle)
Tram (traveling, vehicle)
Urban park (outdoor)

Detailed description of acoustic scenes included in the dataset can be found here.

The dataset was collected in Finland by Tampere University of Technology between 06/2015 - 01/2016. The data collection has received funding from the European Research Council.

Recording and annotation procedure

For all acoustic scenes, the recordings were captured each in a different location: different streets, different parks, different homes. Recordings were made using a Soundman OKM II Klassik/studio A3, electret binaural microphone and a Roland Edirol R-09 wave recorder using 44.1 kHz sampling rate and 24 bit resolution. The microphones are specifically made to look like headphones, being worn in the ears. As an effect of this, the recorded audio is very similar to the sound that reaches the human auditory system of the person wearing the equipment.

Postprocessing of the recorded data involves aspects related to privacy of recorded individuals, and possible errors in the recording process. For audio material recorded in private places, written consent was obtained from all people involved. Material recorded in public places does not require such consent, but was screened for content, and privacy infringing segments were eliminated. Microphone failure and audio distortions were also annotated and segments containing such errors were also eliminated.

After eliminating the problematic segments, the remaining audio material was cut into segments of 30 seconds length.

Challenge setup

TUT Acoustic scenes 2016 dataset consist of two subsets: development dataset and evaluation dataset. The partitioning of the data into the subsets was done based on the location of the original recordings. All segments obtained from the same original recording were included into a single subset - either development dataset or evaluation dataset. For each acoustic scene, 78 segments (39 minutes of audio) were included in the development dataset and 26 segments (13 minutes of audio) were kept for evaluation. Development set contains in total 9h 45mins of audio, and evaluation set 3h 15mins.

Participants are not allowed to use external data for system development. Manipulation of provided data is allowed.

Download

** Development dataset **

TUT Acoustic scenes 2016, development dataset (7.5 GB)

** Evaluation dataset **

TUT Acoustic scenes 2016, evaluation dataset (2.5 GB)

In publications using the datasets, cite as:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016.

PDF

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.

PDF

Cross-validation with development dataset

A cross-validation setup is provided for the development dataset in order to make results reported with this dataset uniform. The setup consists of four folds distributing the 78 available segments based on location. The folds are provided with the dataset in the directory evaluation setup.

If not using the provided cross-validation setup, pay attention to the segments extracted from same original recordings. Make sure that all files recorded in same location are placed on the same side of the evaluation.

Submission

Detailed information for the challenge submission can found from submission page.

One should submit single text-file (in CSV format) containing classification result for each audio file in the evaluation set. Result items can be in any order. Format:

[filename (string)][tab][scene label (string)]

Task rules

Only the provided development dataset can be used to train the submitted system.
The development dataset can be augmented only by mixing data sampled from a pdf; use of real recordings is forbidden.
The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation dataset in the decision making is also forbidden.
Technical report with sufficient description of the system has to be submitted along with the system outputs.

More information on submission process.

Evaluation

The scoring of acoustic scene classification will be based on classification accuracy: the number of correctly classified segments among the total number of segments. Each segment is considered an independent test sample.

Code for evaluation is available with the baseline system:

Python implementation from src.evaluation import DCASE2016_SceneClassification_Metrics.
Matlab implementation, use class src/evaluation/DCASE2016_SceneClassification_Metrics.m.

Results

Rank	Submission Information
Rank	Code	Author	Affiliation	Technical Report	Classification Accuracy
	Aggarwal_task1_1	Naveen Aggarwal	UIET, Panjab University, Chandigarh, India	task-acoustic-scene-classification-results#Vij2016	74.4
	Bae_task1_1	Soo Hyun Bae	Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea	task-acoustic-scene-classification-results#Bae2016	84.1
	Bao_task1_1	Xiao Bao	National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, Anhui, China	task-acoustic-scene-classification-results#Bao2016	83.1
	Battaglino_task1_1	Daniele Battaglino	NXP Software, France; EURECOM, France	task-acoustic-scene-classification-results#Battaglino2016	80.0
	Bisot_task1_1	Victor Bisot	Telecom ParisTech, Paris, France	task-acoustic-scene-classification-results#Bisot2016	87.7
	DCASE2016 baseline	Toni Heittola	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-acoustic-scene-classification-results#Heittola2016	77.2
	Duong_task1_1	Quang-Khanh-Ngoc Duong	Technicolor, France	task-acoustic-scene-classification-results#Sena_Mafra2016	76.4
	Duong_task1_2	Quang-Khanh-Ngoc Duong	Technicolor, France	task-acoustic-scene-classification-results#Sena_Mafra2016	80.5
	Duong_task1_3	Quang-Khanh-Ngoc Duong	Technicolor, France	task-acoustic-scene-classification-results#Sena_Mafra2016	73.1
	Duong_task1_4	Quang-Khanh-Ngoc Duong	Technicolor, France	task-acoustic-scene-classification-results#Sena_Mafra2016	62.8
	Eghbal-Zadeh_task1_1	Hamid Eghbal-Zadeh	Department of Computational Perception, Johannes Kepler University of Linz, Linz, Austria	task-acoustic-scene-classification-results#Eghbal-Zadeh2016	86.4
	Eghbal-Zadeh_task1_2	Hamid Eghbal-Zadeh	Department of Computational Perception, Johannes Kepler University of Linz, Linz, Austria	task-acoustic-scene-classification-results#Eghbal-Zadeh2016	88.7
	Eghbal-Zadeh_task1_3	Hamid Eghbal-Zadeh	Department of Computational Perception, Johannes Kepler University of Linz, Linz, Austria	task-acoustic-scene-classification-results#Eghbal-Zadeh2016	83.3
	Eghbal-Zadeh_task1_4	Hamid Eghbal-Zadeh	Department of Computational Perception, Johannes Kepler University of Linz, Linz, Austria	task-acoustic-scene-classification-results#Eghbal-Zadeh2016	89.7
	Foleiss_task1_1	Juliano Henrique Foleiss	Universidade Tecnologica Federal do Parana, Campo Mourao, Brazil	task-acoustic-scene-classification-results#Foleiss2016	76.2
	Hertel_task1_1	Alfred Mertins	Institute for Signal Processing, University of Luebeck, Luebeck, Germany	task-acoustic-scene-classification-results#Hertel2016	79.5
	Kim_task1_1	Alfred Mertins	Institute for Signal Processing, University of Luebeck, Luebeck, Germany	task-acoustic-scene-classification-results#Yun2016	82.1
	Ko_task1_1	Hanseok Ko	School of Electrical Engineering, Korea University, Seoul, South Korea; Department of Visual Information Processing, Korea University, Seoul, South Korea	task-acoustic-scene-classification-results#Park2016	87.2
	Ko_task1_2	Hanseok Ko	School of Electrical Engineering, Korea University, Seoul, South Korea; Department of Visual Information Processing, Korea University, Seoul, South Korea	task-acoustic-scene-classification-results#Mun2016	82.3
	Kong_task1_1	Qiuqiang Kong	Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom	task-acoustic-scene-classification-results#Kong2016	81.0
	Kumar_task1_1	Anurag Kumar	Carnegie Mellon University, Pittsburgh, USA	task-acoustic-scene-classification-results#Elizalde2016	85.9
	Lee_task1_1	Kyogu Lee	Music and Audio Research Group, Seoul National University, Seoul, South Korea	task-acoustic-scene-classification-results#Han2016	84.6
	Lee_task1_2	Kyogu Lee	Music and Audio Research Group, Seoul National University, Seoul, South Korea	task-acoustic-scene-classification-results#Kim2016	85.4
	Liu_task1_1	Jiaming Liu	Department of Control Science and Engineering, Tongji University, Shanghai, China	task-acoustic-scene-classification-results#Liu2016	83.8
	Liu_task1_2	Jiaming Liu	Department of Control Science and Engineering, Tongji University, Shanghai, China	task-acoustic-scene-classification-results#Liu2016	83.6
	Lostanlen_task1_1	Vincent Lostanlen	Departement d’Informatique, Ecole normale superieure, Paris, France	task-acoustic-scene-classification-results#Lostanlen2016	80.8
	Marchi_task1_1	Erik Marchi	Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany; audEERING GmbH, Gilching, Germany	task-acoustic-scene-classification-results#Marchi2016	86.4
	Marques_task1_1	Gonçalo Marques	Electronic Telecom. and Comp. Dept., Instituto Superior de Engenharia de Lisboa, Lisboa, Portugal	task-acoustic-scene-classification-results#Marques2016	83.1
	Moritz_task1_1	Niko Moritz	Project Group for Hearing, Speech, and Audio Processing, Fraunhofer IDMT, Oldenburg, Germany	task-acoustic-scene-classification-results#Moritz2016	79.0
	Mulimani_task1_1	Manjunath Mulimani	Dept. of Computer Science & Engineering, National Institute of Technology, Karnataka, India	task-acoustic-scene-classification-results#Mulimani2016	65.6
	Nogueira_task1_1	Waldo Nogueira	Medical University Hannover, Hannover, Germany; Cluster of Excellence Hearing4all, Hannover, Germany	task-acoustic-scene-classification-results#Nogueira2016	81.0
	Patiyal_task1_1	Rohit Patiyal	School of Computing and Electrical Engineering, Indian Institute of Technology Mandi, Himachal Pradesh, India	task-acoustic-scene-classification-results#Patiyal2016	78.5
	Phan_task1_1	Huy Phan	Institute for Signal Processing, University of Luebeck, Luebeck, Germany; Graduate School for Computing in Medicine and Life Sciences, University of Luebeck, Luebeck, Germany	task-acoustic-scene-classification-results#Phan2016	83.3
	Pugachev_task1_1	Alexei Pugachev	Chair of Speech Information Systems, ITMO University, St. Petersburg, Russia	task-acoustic-scene-classification-results#Pugachev2016	73.1
	Qu_task1_1	Shuhui Qu	Stanford University, Stanford, USA	task-acoustic-scene-classification-results#Dai2016	80.5
	Qu_task1_2	Shuhui Qu	Stanford University, Stanford, USA	task-acoustic-scene-classification-results#Dai2016	84.1
	Qu_task1_3	Shuhui Qu	Stanford University, Stanford, USA	task-acoustic-scene-classification-results#Dai2016	82.3
	Qu_task1_4	Shuhui Qu	Stanford University, Stanford, USA	task-acoustic-scene-classification-results#Dai2016	80.5
	Rakotomamonjy_task1_1	Alain Rakotomamonjy	Music and Audio Research Group, Normandie Université, Saint Etienne du Rouvray, France	task-acoustic-scene-classification-results#Rakotomamonjy2016	82.1
	Rakotomamonjy_task1_2	Alain Rakotomamonjy	Music and Audio Research Group, Normandie Université, Saint Etienne du Rouvray, France	task-acoustic-scene-classification-results#Rakotomamonjy2016	79.2
	Santoso_task1_1	Andri Santoso	National Central University, Taiwan	task-acoustic-scene-classification-results#Santoso2016	80.8
	Schindler_task1_1	Alexander Schindler	Digital Safety and Security, Austrian Institute of Technology, Vienna, Austria	task-acoustic-scene-classification-results#Lidy2016	81.8
	Schindler_task1_2	Alexander Schindler	Digital Safety and Security, Austrian Institute of Technology, Vienna, Austria	task-acoustic-scene-classification-results#Lidy2016	83.3
	Takahashi_task1_1	Gen Takahashi	University of Tsukuba, Tsukuba, Japan	task-acoustic-scene-classification-results#Takahashi2016	85.6
	Valenti_task1_1	Michele Valenti	Department of Information Engineering, Università Politecnica delle Marche, Ancona, Italy	task-acoustic-scene-classification-results#Valenti2016	86.2
	Vikaskumar_task1_1	Ghodasara Vikaskumar	Electronics & Electrical Communication Engineering Department, Indian Institute of Technology Kharagpur, Kharagpur, India	task-acoustic-scene-classification-results#Vikaskumar2016	81.3
	Vu_task1_1	Toan H. Vu	Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan	task-acoustic-scene-classification-results#Vu2016	80.0
	Xu_task1_1	Yong Xu	Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom	task-acoustic-scene-classification-results#Xu2016	73.3
	Zoehrer_task1_1	Matthias Zöhrer	Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria	task-acoustic-scene-classification-results#Zoehrer2016	73.1

Complete results and technical reports can be found at Task 1 result page

Baseline system

The baseline system for the task is provided. The system is meant to implement basic approach for acoustic scene classification, and provide some comparison point for the participants while developing their systems. The baseline systems for task 1 and task 3 share the code base, and implements quite similar approach for both tasks. The baseline system will download the needed datasets and produces the results below when ran with the default parameters.

The baseline system is based on MFCC acoustic features and GMM classifier. The acoustic features include MFCC static coefficients (0th coefficient included), delta coefficients and acceleration coefficients. The system learns one acoustic model per acoustic scene class, and does the classification with maximum likelihood classification scheme.

The baseline system provides also reference implementation of evaluation metric. Baseline systems are provided for both Python and Matlab. Python implementation is regarded as the main implementation.

Participants are allowed to build their system on top of the given baseline systems. The systems have all needed functionality for dataset handling, storing / accessing features and models, and evaluating the results, making the adaptation for one's needs rather easy. The baseline systems are also good starting point for entry level researchers.

In publications using the baseline, cite as:

Publication

PDF

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

PDF

Python implementation

DCASE2016 Task 1&3 Python baseline, repository

DCASE2016 Task 1&3 Python baseline, release
version 1.0.7 (.zip)

Matlab implementation

DCASE2016 Task 1&3 Matlab baseline, repository

DCASE2016 Task 1&3 Matlab baseline, release
version 1.0.6 (.zip)

Results for TUT Acoustic scenes 2016, development set

Evaluation setup

4-fold cross-validation, average classification accuracy over folds
15 acoustic scene classes
Classification unit: one file (30 seconds of audio).

System parameters

Frame size: 40 ms (with 50% hop size)
Number of Gaussians per acoustic scene class model: 16
Feature vector: 20 MFCC static coefficients (including 0th) + 20 delta MFCC coefficients + 20 acceleration MFCC coefficients = 60 values
Trained and tested on full audio
Python implementation

Acoustic scene classification results, averaged over evaluation folds.
Acoustic scene	Accuracy
Beach	69.3 %
Bus	79.6 %
Cafe / Restaurant	83.2 %
Car	87.2 %
City center	85.5 %
Forest path	81.0 %
Grocery store	65.0 %
Home	82.1 %
Library	50.4 %
Metro station	94.7 %
Office	98.6 %
Park	13.9 %
Residential area	77.7 %
Train	33.6 %
Tram	85.4 %
Overall accuracy	72.5 %

Citation

If you are using the dataset or baseline code please cite the following paper:

Publication

PDF

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

PDF

When citing challenge task and results please cite the following paper:

Publication

A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, and M. D. Plumbley. Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(2):379–393, Feb 2018. doi:10.1109/TASLP.2017.2778423.

PDF

Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge

Abstract

Public evaluation campaigns and datasets promote active development in target research areas, allowing direct comparison of algorithms. The second edition of the challenge on detection and classification of acoustic scenes and events (DCASE 2016) has offered such an opportunity for development of the state-of-the-art methods, and succeeded in drawing together a large number of participants from academic and industrial backgrounds. In this paper, we report on the tasks and outcomes of the DCASE 2016 challenge. The challenge comprised four tasks: acoustic scene classification, sound event detection in synthetic audio, sound event detection in real-life audio, and domestic audio tagging. We present each task in detail and analyze the submitted systems in terms of design and performance. We observe the emergence of deep learning as the most popular classification method, replacing the traditional approaches based on Gaussian mixture models and support vector machines. By contrast, feature representations have not changed substantially throughout the years, as mel frequency-based representations predominate in all tasks. The datasets created for and used in DCASE 2016 are publicly available and are a valuable resource for further research.

Keywords

Acoustics;Event detection;Hidden Markov models;Speech;Speech processing;Tagging;Acoustic scene classification;audio datasets;pattern recognition;sound event detection

PDF

Follow up paper about assessing the human and machine performance for acoustic scene classification task:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Assessment of human and machine performance in acoustic scene classification: DCASE 2016 case study. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 319–323. IEEE Computer Society, 2017. doi:10.1109/WASPAA.2017.8170047.

Assessment of Human and Machine Performance in Acoustic Scene Classification: DCASE 2016 Case Study

Abstract

Human and machine performance in acoustic scene classification is examined through a parallel experiment using TUT Acoustic Scenes 2016 dataset. The machine learning perspective is presented based on the systems submitted for the 2016 challenge on Detection and Classification of Acoustic Scenes and Events. The human performance, assessed through a listening experiment, was found to be significantly lower than machine performance. Test subjects exhibited different behavior throughout the experiment, leading to significant differences in performance between groups of subjects. An expert listener trained for the task obtained similar accuracy to the average of submitted systems, comparable also to previous studies of human abilities in recognizing everyday acoustic scenes.

	Annamaria Mesaros Tampere University of Technology
	Toni Heittola Tampere University of Technology

Coordinators

Content

Description

Audio dataset

Recording and annotation procedure

Challenge setup

Download

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

Cross-validation with development dataset

Submission

Task rules

Evaluation

Results

Baseline system

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

Python implementation

Matlab implementation

Results for TUT Acoustic scenes 2016, development set

Citation

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge

Abstract

Keywords

Assessment of Human and Machine Performance in Acoustic Scene Classification: DCASE 2016 Case Study

Abstract