Challenge has ended. Full results for this task can be found here
Description
The goal of acoustic scene classification is to classify a test recording into one of predefined classes that characterizes the environment in which it was recorded — for example "park", "home", "office".
Audio dataset
TUT Acoustic scenes 2016 dataset will be used for the task. The dataset consists of recordings from various acoustic scenes, all having distinct recording locations. For each recording location, 3-5 minute long audio recording was captured. The original recordings were then split into 30-second segments for the challenge.
Acoustic scenes for the task (15):
- Bus - traveling by bus in the city (vehicle)
- Cafe / Restaurant - small cafe/restaurant (indoor)
- Car - driving or traveling as a passenger, in the city (vehicle)
- City center (outdoor)
- Forest path (outdoor)
- Grocery store - medium size grocery store (indoor)
- Home (indoor)
- Lakeside beach (outdoor)
- Library (indoor)
- Metro station (indoor)
- Office - multiple persons, typical work day (indoor)
- Residential area (outdoor)
- Train (traveling, vehicle)
- Tram (traveling, vehicle)
- Urban park (outdoor)
Detailed description of acoustic scenes included in the dataset can be found here.
The dataset was collected in Finland by Tampere University of Technology between 06/2015 - 01/2016. The data collection has received funding from the European Research Council.
Recording and annotation procedure
For all acoustic scenes, the recordings were captured each in a different location: different streets, different parks, different homes. Recordings were made using a Soundman OKM II Klassik/studio A3, electret binaural microphone and a Roland Edirol R-09 wave recorder using 44.1 kHz sampling rate and 24 bit resolution. The microphones are specifically made to look like headphones, being worn in the ears. As an effect of this, the recorded audio is very similar to the sound that reaches the human auditory system of the person wearing the equipment.
Postprocessing of the recorded data involves aspects related to privacy of recorded individuals, and possible errors in the recording process. For audio material recorded in private places, written consent was obtained from all people involved. Material recorded in public places does not require such consent, but was screened for content, and privacy infringing segments were eliminated. Microphone failure and audio distortions were also annotated and segments containing such errors were also eliminated.
After eliminating the problematic segments, the remaining audio material was cut into segments of 30 seconds length.
Challenge setup
TUT Acoustic scenes 2016 dataset consist of two subsets: development dataset and evaluation dataset. The partitioning of the data into the subsets was done based on the location of the original recordings. All segments obtained from the same original recording were included into a single subset - either development dataset or evaluation dataset. For each acoustic scene, 78 segments (39 minutes of audio) were included in the development dataset and 26 segments (13 minutes of audio) were kept for evaluation. Development set contains in total 9h 45mins of audio, and evaluation set 3h 15mins.
Participants are not allowed to use external data for system development. Manipulation of provided data is allowed.
Download
** Development dataset **
** Evaluation dataset **
In publications using the datasets, cite as:
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016.
TUT Database for Acoustic Scene Classification and Sound Event Detection
Abstract
We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.
Cross-validation with development dataset
A cross-validation setup is provided for the development dataset in order to make results reported with this dataset uniform. The setup consists of four folds distributing the 78 available segments based on location. The folds are provided with the dataset in the directory evaluation setup
.
If not using the provided cross-validation setup, pay attention to the segments extracted from same original recordings. Make sure that all files recorded in same location are placed on the same side of the evaluation.
Submission
Detailed information for the challenge submission can found from submission page.
One should submit single text-file (in CSV format) containing classification result for each audio file in the evaluation set. Result items can be in any order. Format:
[filename (string)][tab][scene label (string)]
Task rules
- Only the provided development dataset can be used to train the submitted system.
- The development dataset can be augmented only by mixing data sampled from a pdf; use of real recordings is forbidden.
- The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation dataset in the decision making is also forbidden.
- Technical report with sufficient description of the system has to be submitted along with the system outputs.
More information on submission process.
Evaluation
The scoring of acoustic scene classification will be based on classification accuracy: the number of correctly classified segments among the total number of segments. Each segment is considered an independent test sample.
Code for evaluation is available with the baseline system:
- Python implementation
from src.evaluation import DCASE2016_SceneClassification_Metrics
. - Matlab implementation, use class
src/evaluation/DCASE2016_SceneClassification_Metrics.m
.
Results
Rank | Submission Information | ||||
---|---|---|---|---|---|
Code | Author | Affiliation |
Technical Report |
Classification Accuracy |
|
Aggarwal_task1_1 | Naveen Aggarwal | UIET, Panjab University, Chandigarh, India | task-acoustic-scene-classification-results#Vij2016 | 74.4 | |
Bae_task1_1 | Soo Hyun Bae | Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea | task-acoustic-scene-classification-results#Bae2016 | 84.1 | |
Bao_task1_1 | Xiao Bao | National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, Anhui, China | task-acoustic-scene-classification-results#Bao2016 | 83.1 | |
Battaglino_task1_1 | Daniele Battaglino | NXP Software, France; EURECOM, France | task-acoustic-scene-classification-results#Battaglino2016 | 80.0 | |
Bisot_task1_1 | Victor Bisot | Telecom ParisTech, Paris, France | task-acoustic-scene-classification-results#Bisot2016 | 87.7 | |
DCASE2016 baseline | Toni Heittola | Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland | task-acoustic-scene-classification-results#Heittola2016 | 77.2 | |
Duong_task1_1 | Quang-Khanh-Ngoc Duong | Technicolor, France | task-acoustic-scene-classification-results#Sena_Mafra2016 | 76.4 | |
Duong_task1_2 | Quang-Khanh-Ngoc Duong | Technicolor, France | task-acoustic-scene-classification-results#Sena_Mafra2016 | 80.5 | |
Duong_task1_3 | Quang-Khanh-Ngoc Duong | Technicolor, France | task-acoustic-scene-classification-results#Sena_Mafra2016 | 73.1 | |
Duong_task1_4 | Quang-Khanh-Ngoc Duong | Technicolor, France | task-acoustic-scene-classification-results#Sena_Mafra2016 | 62.8 | |
Eghbal-Zadeh_task1_1 | Hamid Eghbal-Zadeh | Department of Computational Perception, Johannes Kepler University of Linz, Linz, Austria | task-acoustic-scene-classification-results#Eghbal-Zadeh2016 | 86.4 | |
Eghbal-Zadeh_task1_2 | Hamid Eghbal-Zadeh | Department of Computational Perception, Johannes Kepler University of Linz, Linz, Austria | task-acoustic-scene-classification-results#Eghbal-Zadeh2016 | 88.7 | |
Eghbal-Zadeh_task1_3 | Hamid Eghbal-Zadeh | Department of Computational Perception, Johannes Kepler University of Linz, Linz, Austria | task-acoustic-scene-classification-results#Eghbal-Zadeh2016 | 83.3 | |
Eghbal-Zadeh_task1_4 | Hamid Eghbal-Zadeh | Department of Computational Perception, Johannes Kepler University of Linz, Linz, Austria | task-acoustic-scene-classification-results#Eghbal-Zadeh2016 | 89.7 | |
Foleiss_task1_1 | Juliano Henrique Foleiss | Universidade Tecnologica Federal do Parana, Campo Mourao, Brazil | task-acoustic-scene-classification-results#Foleiss2016 | 76.2 | |
Hertel_task1_1 | Alfred Mertins | Institute for Signal Processing, University of Luebeck, Luebeck, Germany | task-acoustic-scene-classification-results#Hertel2016 | 79.5 | |
Kim_task1_1 | Alfred Mertins | Institute for Signal Processing, University of Luebeck, Luebeck, Germany | task-acoustic-scene-classification-results#Yun2016 | 82.1 | |
Ko_task1_1 | Hanseok Ko | School of Electrical Engineering, Korea University, Seoul, South Korea; Department of Visual Information Processing, Korea University, Seoul, South Korea | task-acoustic-scene-classification-results#Park2016 | 87.2 | |
Ko_task1_2 | Hanseok Ko | School of Electrical Engineering, Korea University, Seoul, South Korea; Department of Visual Information Processing, Korea University, Seoul, South Korea | task-acoustic-scene-classification-results#Mun2016 | 82.3 | |
Kong_task1_1 | Qiuqiang Kong | Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom | task-acoustic-scene-classification-results#Kong2016 | 81.0 | |
Kumar_task1_1 | Anurag Kumar | Carnegie Mellon University, Pittsburgh, USA | task-acoustic-scene-classification-results#Elizalde2016 | 85.9 | |
Lee_task1_1 | Kyogu Lee | Music and Audio Research Group, Seoul National University, Seoul, South Korea | task-acoustic-scene-classification-results#Han2016 | 84.6 | |
Lee_task1_2 | Kyogu Lee | Music and Audio Research Group, Seoul National University, Seoul, South Korea | task-acoustic-scene-classification-results#Kim2016 | 85.4 | |
Liu_task1_1 | Jiaming Liu | Department of Control Science and Engineering, Tongji University, Shanghai, China | task-acoustic-scene-classification-results#Liu2016 | 83.8 | |
Liu_task1_2 | Jiaming Liu | Department of Control Science and Engineering, Tongji University, Shanghai, China | task-acoustic-scene-classification-results#Liu2016 | 83.6 | |
Lostanlen_task1_1 | Vincent Lostanlen | Departement d’Informatique, Ecole normale superieure, Paris, France | task-acoustic-scene-classification-results#Lostanlen2016 | 80.8 | |
Marchi_task1_1 | Erik Marchi | Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany; audEERING GmbH, Gilching, Germany | task-acoustic-scene-classification-results#Marchi2016 | 86.4 | |
Marques_task1_1 | Gonçalo Marques | Electronic Telecom. and Comp. Dept., Instituto Superior de Engenharia de Lisboa, Lisboa, Portugal | task-acoustic-scene-classification-results#Marques2016 | 83.1 | |
Moritz_task1_1 | Niko Moritz | Project Group for Hearing, Speech, and Audio Processing, Fraunhofer IDMT, Oldenburg, Germany | task-acoustic-scene-classification-results#Moritz2016 | 79.0 | |
Mulimani_task1_1 | Manjunath Mulimani | Dept. of Computer Science & Engineering, National Institute of Technology, Karnataka, India | task-acoustic-scene-classification-results#Mulimani2016 | 65.6 | |
Nogueira_task1_1 | Waldo Nogueira | Medical University Hannover, Hannover, Germany; Cluster of Excellence Hearing4all, Hannover, Germany | task-acoustic-scene-classification-results#Nogueira2016 | 81.0 | |
Patiyal_task1_1 | Rohit Patiyal | School of Computing and Electrical Engineering, Indian Institute of Technology Mandi, Himachal Pradesh, India | task-acoustic-scene-classification-results#Patiyal2016 | 78.5 | |
Phan_task1_1 | Huy Phan | Institute for Signal Processing, University of Luebeck, Luebeck, Germany; Graduate School for Computing in Medicine and Life Sciences, University of Luebeck, Luebeck, Germany | task-acoustic-scene-classification-results#Phan2016 | 83.3 | |
Pugachev_task1_1 | Alexei Pugachev | Chair of Speech Information Systems, ITMO University, St. Petersburg, Russia | task-acoustic-scene-classification-results#Pugachev2016 | 73.1 | |
Qu_task1_1 | Shuhui Qu | Stanford University, Stanford, USA | task-acoustic-scene-classification-results#Dai2016 | 80.5 | |
Qu_task1_2 | Shuhui Qu | Stanford University, Stanford, USA | task-acoustic-scene-classification-results#Dai2016 | 84.1 | |
Qu_task1_3 | Shuhui Qu | Stanford University, Stanford, USA | task-acoustic-scene-classification-results#Dai2016 | 82.3 | |
Qu_task1_4 | Shuhui Qu | Stanford University, Stanford, USA | task-acoustic-scene-classification-results#Dai2016 | 80.5 | |
Rakotomamonjy_task1_1 | Alain Rakotomamonjy | Music and Audio Research Group, Normandie Université, Saint Etienne du Rouvray, France | task-acoustic-scene-classification-results#Rakotomamonjy2016 | 82.1 | |
Rakotomamonjy_task1_2 | Alain Rakotomamonjy | Music and Audio Research Group, Normandie Université, Saint Etienne du Rouvray, France | task-acoustic-scene-classification-results#Rakotomamonjy2016 | 79.2 | |
Santoso_task1_1 | Andri Santoso | National Central University, Taiwan | task-acoustic-scene-classification-results#Santoso2016 | 80.8 | |
Schindler_task1_1 | Alexander Schindler | Digital Safety and Security, Austrian Institute of Technology, Vienna, Austria | task-acoustic-scene-classification-results#Lidy2016 | 81.8 | |
Schindler_task1_2 | Alexander Schindler | Digital Safety and Security, Austrian Institute of Technology, Vienna, Austria | task-acoustic-scene-classification-results#Lidy2016 | 83.3 | |
Takahashi_task1_1 | Gen Takahashi | University of Tsukuba, Tsukuba, Japan | task-acoustic-scene-classification-results#Takahashi2016 | 85.6 | |
Valenti_task1_1 | Michele Valenti | Department of Information Engineering, Università Politecnica delle Marche, Ancona, Italy | task-acoustic-scene-classification-results#Valenti2016 | 86.2 | |
Vikaskumar_task1_1 | Ghodasara Vikaskumar | Electronics & Electrical Communication Engineering Department, Indian Institute of Technology Kharagpur, Kharagpur, India | task-acoustic-scene-classification-results#Vikaskumar2016 | 81.3 | |
Vu_task1_1 | Toan H. Vu | Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan | task-acoustic-scene-classification-results#Vu2016 | 80.0 | |
Xu_task1_1 | Yong Xu | Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom | task-acoustic-scene-classification-results#Xu2016 | 73.3 | |
Zoehrer_task1_1 | Matthias Zöhrer | Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria | task-acoustic-scene-classification-results#Zoehrer2016 | 73.1 |
Complete results and technical reports can be found at Task 1 result page
Baseline system
The baseline system for the task is provided. The system is meant to implement basic approach for acoustic scene classification, and provide some comparison point for the participants while developing their systems. The baseline systems for task 1 and task 3 share the code base, and implements quite similar approach for both tasks. The baseline system will download the needed datasets and produces the results below when ran with the default parameters.
The baseline system is based on MFCC acoustic features and GMM classifier. The acoustic features include MFCC static coefficients (0th coefficient included), delta coefficients and acceleration coefficients. The system learns one acoustic model per acoustic scene class, and does the classification with maximum likelihood classification scheme.
The baseline system provides also reference implementation of evaluation metric. Baseline systems are provided for both Python and Matlab. Python implementation is regarded as the main implementation.
Participants are allowed to build their system on top of the given baseline systems. The systems have all needed functionality for dataset handling, storing / accessing features and models, and evaluating the results, making the adaptation for one's needs rather easy. The baseline systems are also good starting point for entry level researchers.
In publications using the baseline, cite as:
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016.
TUT Database for Acoustic Scene Classification and Sound Event Detection
Abstract
We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.
Python implementation
Matlab implementation
Results for TUT Acoustic scenes 2016, development set
Evaluation setup
- 4-fold cross-validation, average classification accuracy over folds
- 15 acoustic scene classes
- Classification unit: one file (30 seconds of audio).
System parameters
- Frame size: 40 ms (with 50% hop size)
- Number of Gaussians per acoustic scene class model: 16
- Feature vector: 20 MFCC static coefficients (including 0th) + 20 delta MFCC coefficients + 20 acceleration MFCC coefficients = 60 values
- Trained and tested on full audio
- Python implementation
Acoustic scene | Accuracy |
---|---|
Beach | 69.3 % |
Bus | 79.6 % |
Cafe / Restaurant | 83.2 % |
Car | 87.2 % |
City center | 85.5 % |
Forest path | 81.0 % |
Grocery store | 65.0 % |
Home | 82.1 % |
Library | 50.4 % |
Metro station | 94.7 % |
Office | 98.6 % |
Park | 13.9 % |
Residential area | 77.7 % |
Train | 33.6 % |
Tram | 85.4 % |
Overall accuracy | 72.5 % |
Citation
If you are using the dataset or baseline code please cite the following paper:
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016.
TUT Database for Acoustic Scene Classification and Sound Event Detection
Abstract
We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.
When citing challenge task and results please cite the following paper:
A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, and M. D. Plumbley. Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(2):379–393, Feb 2018. doi:10.1109/TASLP.2017.2778423.
Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge
Abstract
Public evaluation campaigns and datasets promote active development in target research areas, allowing direct comparison of algorithms. The second edition of the challenge on detection and classification of acoustic scenes and events (DCASE 2016) has offered such an opportunity for development of the state-of-the-art methods, and succeeded in drawing together a large number of participants from academic and industrial backgrounds. In this paper, we report on the tasks and outcomes of the DCASE 2016 challenge. The challenge comprised four tasks: acoustic scene classification, sound event detection in synthetic audio, sound event detection in real-life audio, and domestic audio tagging. We present each task in detail and analyze the submitted systems in terms of design and performance. We observe the emergence of deep learning as the most popular classification method, replacing the traditional approaches based on Gaussian mixture models and support vector machines. By contrast, feature representations have not changed substantially throughout the years, as mel frequency-based representations predominate in all tasks. The datasets created for and used in DCASE 2016 are publicly available and are a valuable resource for further research.
Keywords
Acoustics;Event detection;Hidden Markov models;Speech;Speech processing;Tagging;Acoustic scene classification;audio datasets;pattern recognition;sound event detection
Follow up paper about assessing the human and machine performance for acoustic scene classification task:
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Assessment of human and machine performance in acoustic scene classification: DCASE 2016 case study. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 319–323. IEEE Computer Society, 2017. doi:10.1109/WASPAA.2017.8170047.
Assessment of Human and Machine Performance in Acoustic Scene Classification: DCASE 2016 Case Study
Abstract
Human and machine performance in acoustic scene classification is examined through a parallel experiment using TUT Acoustic Scenes 2016 dataset. The machine learning perspective is presented based on the systems submitted for the 2016 challenge on Detection and Classification of Acoustic Scenes and Events. The human performance, assessed through a listening experiment, was found to be significantly lower than machine performance. Test subjects exhibited different behavior throughout the experiment, leading to significant differences in performance between groups of subjects. An expert listener trained for the task obtained similar accuracy to the average of submitted systems, comparable also to previous studies of human abilities in recognizing everyday acoustic scenes.