Challenge has ended. Full results for this task can be found here

Description

This task evaluates performance of the sound event detection systems in multisource conditions similar to our everyday life, where the sound sources are rarely heard in isolation. Contrary to task 2, there is no control over the number of overlapping sound events at each time, not in the training nor in the testing audio data.

Figure 1: Overview of sound event detection system.

Audio dataset

TUT Sound events 2016 dataset will be used for task 3. Audio in the dataset is a subset of TUT Acoustic scenes 2016 dataset (used for task 1). The TUT Sound events 2016 dataset consisting of recordings from two acoustic scenes:

Home (indoor)
Residential area (outdoor).

These acoustic scenes were selected to represent common environments of interest in applications for safety and surveillance (outside home) and human activity monitoring or home surveillance.

The dataset was collected in Finland by Tampere University of Technology between 06/2015 - 01/2016. The data collection has received funding from the European Research Council.

Recording and annotation procedure

The recordings were captured each in a different location: different streets, different homes. For each recording location, 3-5 minute long audio recording was captured. The equipment used for recording consists of a binaural Soundman OKM II Klassik/studio A3 electret in-ear microphone and a Roland Edirol R-09 wave recorder using 44.1 kHz sampling rate and 24 bit resolution. For audio material recorded in private places, written consent was obtained from all people involved.

Individual sound events in each recording were annotated by two research assistants using freely chosen labels for sounds. Nouns were used to characterize each sound source, and verbs the sound production mechanism, whenever this was possible. Annotators were trained first on few example recordings. They were instructed to annotate all audible sound events, decide the start time and end time of the sounds as they see fit, and choose event labels freely. This resulted in a large set of raw labels. There was no verification of the annotations and no evaluation of annotator inter-annotator agreement due to the high level of subjectivity inherent to the problem.

Target sound event classes were selected based on the frequency of the obtained labels, to ensure that the selected sounds are common for an acoustic scene, and there are sufficient examples for learning acoustic models. Mapping of the raw labels was performed, merging for example "car engine running" to "engine running", and grouping various impact sounds with only verb description such as "banging", "clacking" into "object impact".

Selected sound event classes:

Home

(object) Rustling
(object) Snapping
Cupboard
Cutlery
Dishes
Drawer
Glass jingling
Object impact
People walking
Washing dishes
Water tap running

Residential area

(object) Banging
Bird singing
Car passing by
Children shouting
People speaking
People walking
Wind blowing

For residential area, the sound event classes are mostly related to concrete physical sound sources - bird singing, car passing by. Home scenes are dominated by abstract object impact sounds, besides some more well defined sound events (still impact) like dishes, cutlery, etc.

Challenge setup

TUT Sound events 2016 dataset consists of two subsets: development dataset and evaluation dataset. Partitioning of data into these subsets was done based on the amount of examples available for each sound event class, while also taking into account recording location. Ideally the subsets should have the same amount of data for each class, or at least the same relative amount, such as a 70-30% split. Because the event instances belonging to different classes are distributed unevenly within the recordings, the partitioning of individual classes can be controlled only to a certain extent.

The split condition was relaxed from 70-30%. For home, 40-80% of instances of each class were selected into the development set. For residential area, 60-80% of instances of each class were selected into the development set.

Participants are not allowed to use external data for system development. Manipulation of provided data is allowed. Acoustic scene label can be used as external information in the detection (acoustic scene-dependent sound event detection system).

Download

** Development dataset **

TUT Sound events 2016, development dataset (0.9 GB)

** Evaluation dataset **

TUT Sound events 2016, evaluation dataset (0.4 GB)

In publications using the datasets, cite as:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016.

PDF

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.

PDF

Cross-validation with development dataset

A cross-validation setup is provided in order to make results reported with this dataset uniform. The setup consists of four folds, so that each recording is used exactly once as test data. While creating the cross-validation folds, the only condition imposed was that the test subset does not contain classes unavailable in training subset. The folds are provided with the dataset in the directory evaluation_setup.

Submission

Detailed information for the challenge submission can found from submission page.

One should submit single text-file (in CSV format) per evaluated acoustic scene (home and residential area), each file containing detected sound event from each audio file. Events can be in any order. Format:

[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][event label (string)]

Task rules

Only the provided development dataset can be used to train the submitted system.
The development dataset can be augmented only by mixing data sampled from a pdf; use of real recordings is forbidden.
The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation dataset in the decision making is also forbidden.
Technical report with sufficient description of the system has to be submitted along with the system outputs.

More information on submission process.

Evaluation

Total error rate (ER) is the main metric for this task. Error rate will be evaluated in one-second segments over the entire test set. Ranking of submitted systems will be done using this metric. Additionally, other metrics will be calculated.

Detailed description of metrics can be found here.

Code for evaluation is available with the baseline system:

Python implementation from src.evaluation import DCASE2016_EventDetection_SegmentBasedMetrics and from src.evaluation import DCASE2016_EventDetection_EventBasedMetrics.
Matlab implementation, use classes src/evaluation/DCASE2016_EventDetection_SegmentBasedMetrics.m and src/evaluation/DCASE2016_EventDetection_EventBasedMetrics.m.

sed_eval - Evaluation toolbox for Sound Event Detection

sed_eval contains same metrics as baseline system, and they are tested to give same values. Use parameters time_resolution=1 and t_collar=0.250 to align it with the baseline system results.

sed_eval - Evaluation toolbox for Sound Event Detection

Results

Rank	Submission Information				Segment-based (overall)
Rank	Code	Author	Affiliation	Technical Report	ER	F1
	Adavanne_task3_1	Sharath Adavanne	Department of Signal Processing, Tampere University of Technology, Tampere, Finland	task-sound-event-detection-in-real-life-audio-results#Adavanne2016	0.8051	47.8
	Adavanne_task3_2	Sharath Adavanne	Department of Signal Processing, Tampere University of Technology, Tampere, Finland	task-sound-event-detection-in-real-life-audio-results#Adavanne2016	0.8887	37.9
	DCASE2016 baseline	Toni Heittola	Department of Signal Processing, Tampere University of Technology, Tampere, Finland	task-sound-event-detection-in-real-life-audio-results#Heittola2016	0.8773	34.3
	Elizalde_task3_1	Benjamin Elizalde	Carnegie Mellon University, Pittsburgh, USA	task-sound-event-detection-in-real-life-audio-results#Elizalde2016	1.0730	22.5
	Elizalde_task3_2	Benjamin Elizalde	Carnegie Mellon University, Pittsburgh, USA	task-sound-event-detection-in-real-life-audio-results#Elizalde2016	1.1056	20.8
	Elizalde_task3_3	Benjamin Elizalde	Carnegie Mellon University, Pittsburgh, USA	task-sound-event-detection-in-real-life-audio-results#Elizalde2016	0.9635	33.3
	Elizalde_task3_4	Benjamin Elizalde	Carnegie Mellon University, Pittsburgh, USA	task-sound-event-detection-in-real-life-audio-results#Elizalde2016	0.9613	33.6
	Gorin_task3_1	Arseniy Gorin	ACTechnologies LLC, Moscow, Russia	task-sound-event-detection-in-real-life-audio-results#Gorin2016	0.9799	41.1
	Kong_task3_1	Qiuqiang Kong	Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom	task-sound-event-detection-in-real-life-audio-results#Kong2016	0.9557	36.3
	Kroos_task3_1	Christian Kroos	Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Surrey, SUnited Kingdom	task-sound-event-detection-in-real-life-audio-results#Kroos2016	1.1488	16.8
	Liu_task3_1	Christian Kroos	Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Surrey, SUnited Kingdom	task-sound-event-detection-in-real-life-audio-results#Lai2016	0.9287	34.5
	Pham_task3_1	Phuong Pham	University of Pittsburgh, Pittsburgh, USA	task-sound-event-detection-in-real-life-audio-results#Dai2016	0.9583	11.6
	Phan_task3_1	Huy Phan	Institute for Signal Processing, University of Luebeck, Luebeck, Germany; Graduate School for Computing in Medicine and Life Sciences, University of Luebeck, Luebeck, Germany	task-sound-event-detection-in-real-life-audio-results#Phan2016	0.9644	23.9
	Schroeder_task3_1	Jens Schröder	Fraunhofer Institute for Digital Media Technology IDMT, Oldenburg, Germany; Cluster of Excellence, Hearing4all, Germany	task-sound-event-detection-in-real-life-audio-results#Schroeder2016	1.3092	33.6
	Ubskii_task3_1	Dmitrii Ubskii	Chair of Speech Information Systems, ITMO University, St. Petersburg, Russia	task-sound-event-detection-in-real-life-audio-results#Ubskii2016	0.9971	39.6
	Vu_task3_1	Toan H. Vu	Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan	task-sound-event-detection-in-real-life-audio-results#Vu2016	0.9124	41.9
	Zoehrer_task3_1	Matthias Zöhrer	Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria	task-sound-event-detection-in-real-life-audio-results#Zoehrer2016	0.9056	39.6

Complete results and technical reports can be found at Task 3 result page

Baseline system

The baseline system for the task is provided. The system is meant to implement basic approach for sound event detection, and provide some comparison point for the participants while developing their systems. The baseline systems for task 1 and task 3 share the code base, and implements quite similar approach for both tasks. The baseline system will download the needed datasets and produces the results below when ran with the default parameters.

The baseline system is based on MFCC acoustic features and GMM classifier. The acoustic features include MFCC static coefficients (0th coefficient excluded), delta coefficients and acceleration coefficients. For each event class, a binary classifier is set up. The class model is trained using the audio segments annotated as belonging to the modeled event class, and a negative model is trained using the rest of the audio. The decision is based on likelihood ratio between the positive and negative models for each individual class, with a sliding window of one second.

The baseline system provides also reference implementation of evaluation metrics. Baseline systems are provided for both Python and Matlab. Python implementation is regarded as the main implementation.

Participants are allowed to build their system on top of the given baseline systems. The systems have all needed functionality for dataset handling, storing / accessing features and models, and evaluating the results, making the adaptation for one's needs rather easy. The baseline systems are also good starting point for entry level researchers.

**In publications using the baseline, cite as: **

Publication

PDF

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

PDF

Python implementation

DCASE2016 Task 1&3 Python baseline, repository

DCASE2016 Task 1&3 Python baseline, release
version 1.0.7 (.zip)

Matlab implementation

DCASE2016 Task 1&3 Matlab baseline, repository

DCASE2016 Task 1&3 Matlab baseline, release
version 1.0.6 (.zip)

Results for TUT Sound events 2016, development set

Evaluation setup

4-fold cross-validation

System parameters

Frame size: 40 ms (with 50% hop size)
Number of Gaussians per sound event model (positive and negative): 16
Feature vector: 20 MFCC static coefficients (excluding 0th) + 20 delta MFCC coefficients + 20 acceleration MFCC coefficients = 60 values

PLEASE NOTE: The four cross-validation folds are treated as single experiment, meaning that metrics are calculated only after training and testing all folds (not calculating fold-wise metric). Intermediate measures (insertion, deletion, substitution) from all folds are accumulated for calculating error rate. More details

Sound event detection results over evaluation folds.
	Segment-based overall metrics
Acoustic scene	ER	F-score
Home	0.96	15.9 %
Residential area	0.86	31.5 %
Average	0.91	23.7 %

Citation

If you are using the dataset or baseline code please cite the following paper:

Publication

PDF

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

PDF

When citing challenge task and results please cite the following paper:

Publication

A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, and M. D. Plumbley. Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(2):379–393, Feb 2018. doi:10.1109/TASLP.2017.2778423.

PDF

Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge

Abstract

Public evaluation campaigns and datasets promote active development in target research areas, allowing direct comparison of algorithms. The second edition of the challenge on detection and classification of acoustic scenes and events (DCASE 2016) has offered such an opportunity for development of the state-of-the-art methods, and succeeded in drawing together a large number of participants from academic and industrial backgrounds. In this paper, we report on the tasks and outcomes of the DCASE 2016 challenge. The challenge comprised four tasks: acoustic scene classification, sound event detection in synthetic audio, sound event detection in real-life audio, and domestic audio tagging. We present each task in detail and analyze the submitted systems in terms of design and performance. We observe the emergence of deep learning as the most popular classification method, replacing the traditional approaches based on Gaussian mixture models and support vector machines. By contrast, feature representations have not changed substantially throughout the years, as mel frequency-based representations predominate in all tasks. The datasets created for and used in DCASE 2016 are publicly available and are a valuable resource for further research.

Keywords

Acoustics;Event detection;Hidden Markov models;Speech;Speech processing;Tagging;Acoustic scene classification;audio datasets;pattern recognition;sound event detection

PDF

Sound event detection
in real life audio

Coordinators

Description

Audio dataset

Recording and annotation procedure

Challenge setup

Download

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

Cross-validation with development dataset

Submission

Task rules

Evaluation

sed_eval - Evaluation toolbox for Sound Event Detection

Results

Baseline system

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

Python implementation

Matlab implementation

Results for TUT Sound events 2016, development set

Citation

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge

Abstract

Keywords

	Annamaria Mesaros Tampere University of Technology
	Toni Heittola Tampere University of Technology

Coordinators

Content

Description

Audio dataset

Recording and annotation procedure

Challenge setup

Download

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

Cross-validation with development dataset

Submission

Task rules

Evaluation

sed_eval - Evaluation toolbox for Sound Event Detection

Results

Baseline system

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

Python implementation

Matlab implementation

Results for TUT Sound events 2016, development set

Citation

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge

Abstract

Keywords