Challenge has ended. Full results for this task can be found here

Description

This task evaluates performance of the sound event detection systems in multisource conditions similar to our everyday life, where the sound sources are rarely heard in isolation. In this task, there is no control over the number of overlapping sound events at each time, not in the training nor in the testing audio data.

Figure 1: Overview of sound event detection system.

Audio dataset

TUT Sound Events 2017 dataset will be used for task 3. Audio in the dataset is a subset of TUT Acoustic scenes 2017 dataset (used for task 1). The TUT Sound Events 2017 dataset consists of recordings of street acoustic scenes with various levels of traffic and other activity. The scene was selected as representing an environment of interest for detection of sound events related to human activities and hazard situations.

The dataset was collected in Finland by Tampere University of Technology between 06/2015 - 01/2016. The data collection has received funding from the European Research Council.

Recording and annotation procedure

The recordings were captured each in a different streets. For each recording location, a 3-5 minute long audio recording was captured. The equipment used for recording consists of a binaural Soundman OKM II Klassik/studio A3 electret in-ear microphone and a Roland Edirol R-09 wave recorder using 44.1 kHz sampling rate and 24 bit resolution. For audio material recorded in private places, written consent was obtained from all people involved.

Individual sound events in each recording were annotated by the same person using freely chosen labels for sounds. Nouns were used to characterize the sound source, and verbs to characterize the sound production mechanism, using a noun-verb pair whenever this was possible. The annotator was instructed to annotate all audible sound events, decide the start time and end time of the sounds as he sees fit, and choose event labels freely. This resulted in a large set of raw labels.

Target sound event classes were selected to represent common sounds related to human presence and traffic. Mapping of the raw labels was performed, merging sounds into classes described by their source before selecting target classes. Target sound event classes for the dataset were selected based on the frequency of the obtained labels, resulting in selection of most common sounds for the street acoustic scene, in sufficient numbers for learning acoustic models. Mapping of the raw labels was performed, merging sounds into classes described by their source, for example “car passing by”, “car engine running”, “car idling”, etc into “car”, sounds produced by buses and trucks into “large vehicle”, “children yelling” and ” children talking” into “children”, etc.

Selected sound classes for the task are:

brakes squeaking
car
children
large vehicle
people speaking
people walking

Due to the high level of subjectivity inherent to the annotation process, a verification of the reference annotation was done using these mapped classes. Three persons (other than the annotator) listened to each audio segment annotated as belonging to one of these classes, marking agreement about the presence of the indicated sound within the segment. Agreement/disagreement did not take into account the sound event onset and offset, only the presence of the sound event within the annotated segment. Event instances that were confirmed by at least one person were kept, resulting in elimination of about 10% of the original event instances in the development set.

Make sure you are using version 2 of the dataset as this contains the verified annotations.

Download

In case you are using the provided baseline system, there is no need to download the datasets as the system will automatically download needed datasets for you.

Development dataset

TUT Sound events 2017, development dataset v2 (1.3 GB)

version 2

Evaluation dataset

TUT Sound events 2017, evaluation dataset (388.2 MB)

Task setup

TUT Sound Events 2017 dataset consists of two subsets: development dataset and evaluation dataset. Partitioning of data into these subsets was done based on the amount of examples available for each sound event class, while also taking into account recording location. Because the event instances belonging to different classes are distributed unevenly within the recordings, the partitioning of individual classes can be controlled only to a certain extent, but so that the majority of events are in the development set.

A detailed description of the data recording and annotation procedure is available in:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016.

PDF

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.

PDF

Development dataset

A cross-validation setup is provided in order to make results reported with this dataset uniform. The setup consists of four folds, and is made so that each recording is used exactly once as test data. While creating the cross-validation folds, the only condition imposed was that the test subset does not contain classes unavailable in training subset. The folds are provided with the dataset.

Evaluation dataset

Evaluation dataset without ground truth will be released one month before the submission deadline. Full ground truth meta data for it will be published after the DCASE 2017 challenge and workshop are concluded.

Submission

Detailed information for the challenge submission can found on the submission page.

System output should be presented as a single text-file (in CSV format) containing a list of detected sound events from each audio file. Events can be in any order. Format:

[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][event label (string)]

Multiple system outputs can be submitted (maximum 4 per participant). If submitting multiple systems, the individual text-files should be packaged into a zip file for submission. Please carefully mark the connection between the submitted files and the corresponding system or system parameters (for example by naming the text file appropriately).

Task rules

These are the general rules valid for all tasks. The same rules and additional information on technical report and submission requirements can be found here. Task specific rules are highlighted with green.

Participants are not allowed to use external data for system development. Data from another task is considered external data.
Manipulation of provided training and development data is allowed.
The development dataset can be augmented without use of external data (e.g. by mixing data sampled from a pdf or using techniques such as pitch shifting or time stretching).
Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision making is also forbidden.

Evaluation

The evaluation metric for this task is segment-based error rate calculated in one-second segments over the entire test set. Additionally, segment-based F-score will be calculated. Ranking of submitted systems will be based on segment-based error rate.

Detailed information on metrics calculation is available in:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016. URL: http://www.mdpi.com/2076-3417/6/6/162, doi:10.3390/app6060162.

PDF

Metrics for Polyphonic Sound Event Detection

Abstract

This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.

PDF Web publication

Toolbox

A short description of metrics can be found here.

The evaluation is done automatically in the baseline system. Evaluation is done using sed_eval toolbox.

sed_eval - Evaluation toolbox for Sound Event Detection

In case of using the toolbox directly, use the following parameters for sed_eval.sound_event.SegmentBasedMetrics evaluator to align it with the baseline system:

one second segment size time_resolution=1.0

PLEASE NOTE: The four cross-validation folds are treated as single experiment, meaning that metrics are calculated only after training and testing all folds, not as average of the individual folds nor as average of individual class performance. Intermediate measures (insertions, deletions, substitutions) from all folds are accumulated before calculating metrics. For more information on why so, please refer to the following paper:

Publication

George Forman and Martin Scholz. Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. SIGKDD Explor. Newsl., 12(1):49–57, November 2010. URL: http://doi.acm.org/10.1145/1882471.1882479, doi:10.1145/1882471.1882479.

PDF

Apples-to-apples in Cross-validation Studies: Pitfalls in Classifier Performance Measurement

Abstract

Cross-validation is a mainstay for measuring performance and progress in machine learning. There are subtle differences in how exactly to compute accuracy, F-measure and Area Under the ROC Curve (AUC) in cross-validation studies. However, these details are not discussed in the literature, and incompatible methods are used by various papers and software packages. This leads to inconsistency across the research literature. Anomalies in performance calculations for particular folds and situations go undiscovered when they are buried in aggregated results over many folds and datasets, without ever a person looking at the intermediate performance measurements. This research note clarifies and illustrates the differences, and it provides guidance for how best to measure classification performance under cross-validation. In particular, there are several divergent methods used for computing F-measure, which is often recommended as a performance measure under class imbalance, e.g., for text classification domains and in one-vs.-all reductions of datasets having many classes. We show by experiment that all but one of these computation methods leads to biased measurements, especially under high class imbalance. This paper is of particular interest to those designing machine learning software libraries and researchers focused on high class imbalance.

PDF

Results

Rank	Submission Information				Segment-based (overall)
Rank	Code	Author	Affiliation	Technical Report	ER	F1
	Adavanne_TUT_task3_1	Sharath Adavanne	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-sound-event-detection-in-real-life-audio-results#Adavanne2017	0.7914	41.7
	Adavanne_TUT_task3_2	Sharath Adavanne	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-sound-event-detection-in-real-life-audio-results#Adavanne2017	0.8061	42.9
	Adavanne_TUT_task3_3	Sharath Adavanne	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-sound-event-detection-in-real-life-audio-results#Adavanne2017	0.8544	41.4
	Adavanne_TUT_task3_4	Sharath Adavanne	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-sound-event-detection-in-real-life-audio-results#Adavanne2017	0.8716	36.2
	Chen_UR_task3_1	Yukun Chen	Electrical and Computer Engineering, University of Rochester, NY, US	task-sound-event-detection-in-real-life-audio-results#Chen2017	0.8575	30.9
	Dang_NCU_task3_1	Jia-Ching Wang	Computer Sciene and Information Engineering, National Central University, Taoyuan, Taiwan	task-sound-event-detection-in-real-life-audio-results#Dang2017	0.9529	42.6
	Dang_NCU_task3_2	Jia-Ching Wang	Computer Sciene and Information Engineering, National Central University, Taoyuan, Taiwan	task-sound-event-detection-in-real-life-audio-results#Dang2017	0.9468	42.8
	Dang_NCU_task3_3	Jia-Ching Wang	Computer Sciene and Information Engineering, National Central University, Taoyuan, Taiwan	task-sound-event-detection-in-real-life-audio-results#Dang2017	1.0318	44.2
	Dang_NCU_task3_4	Jia-Ching Wang	Computer Sciene and Information Engineering, National Central University, Taoyuan, Taiwan	task-sound-event-detection-in-real-life-audio-results#Dang2017	1.1028	43.5
	Feroze_IST_task3_1	Khizer Feroze	Electrical Engineering, Institute of Space Technology, Islamabad, Pakistan	task-sound-event-detection-in-real-life-audio-results#Feroze2017	1.0942	42.6
	Feroze_IST_task3_2	Khizer Feroze	Electrical Engineering, Institute of Space Technology, Islamabad, Pakistan	task-sound-event-detection-in-real-life-audio-results#Feroze2017	1.0312	39.7
	DCASE2017 baseline	Toni Heittola	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-sound-event-detection-in-real-life-audio-results#Heittola2017	0.9358	42.8
	Hou_BUPT_task3_1	Yuanbo Hou	Embedded Artificial Intelligence Laboratory, Beijing University Of Posts And Telecommunications, Beijing, China	task-sound-event-detection-in-real-life-audio-results#Hou2017	1.0446	29.3
	Hou_BUPT_task3_2	Yuanbo Hou	Embedded Artificial Intelligence Laboratory, Beijing University Of Posts And Telecommunications, Beijing, China	task-sound-event-detection-in-real-life-audio-results#Hou2017	0.9248	34.1
	Kroos_CVSSP_task3_1	Christian Kroos	Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK	task-sound-event-detection-in-real-life-audio-results#Kroos2017	0.8979	44.9
	Kroos_CVSSP_task3_2	Christian Kroos	Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK	task-sound-event-detection-in-real-life-audio-results#Kroos2017	0.8911	41.6
	Kroos_CVSSP_task3_3	Christian Kroos	Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK	task-sound-event-detection-in-real-life-audio-results#Kroos2017	1.0141	43.8
	Lee_SNU_task3_1	Kyogu Lee	Music and Audio Research Group, Seoul National University, Seoul, Korea	task-sound-event-detection-in-real-life-audio-results#Jeong2017	0.9260	42.0
	Lee_SNU_task3_2	Kyogu Lee	Music and Audio Research Group, Seoul National University, Seoul, Korea	task-sound-event-detection-in-real-life-audio-results#Jeong2017	0.8673	27.9
	Lee_SNU_task3_3	Kyogu Lee	Music and Audio Research Group, Seoul National University, Seoul, Korea	task-sound-event-detection-in-real-life-audio-results#Jeong2017	0.8080	40.8
	Lee_SNU_task3_4	Kyogu Lee	Music and Audio Research Group, Seoul National University, Seoul, Korea	task-sound-event-detection-in-real-life-audio-results#Jeong2017	0.8985	43.6
	Li_SCUT_task3_1	Yanxiong Li	School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China	task-sound-event-detection-in-real-life-audio-results#Li2017	0.9920	40.3
	Li_SCUT_task3_2	Yanxiong Li	School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China	task-sound-event-detection-in-real-life-audio-results#Li2017	0.9523	41.0
	Li_SCUT_task3_3	Yanxiong Li	School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China	task-sound-event-detection-in-real-life-audio-results#Li2017	1.0043	43.4
	Li_SCUT_task3_4	Yanxiong Li	School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China	task-sound-event-detection-in-real-life-audio-results#Li2017	0.9878	33.9
	Lu_THU_task3_1	Rui Lu	Department of Automation, Tsinghua University, Beijing, China	task-sound-event-detection-in-real-life-audio-results#Lu2017	0.8251	39.6
	Lu_THU_task3_2	Rui Lu	Department of Automation, Tsinghua University, Beijing, China	task-sound-event-detection-in-real-life-audio-results#Lu2017	0.8306	39.2
	Lu_THU_task3_3	Rui Lu	Department of Automation, Tsinghua University, Beijing, China	task-sound-event-detection-in-real-life-audio-results#Lu2017	0.8361	38.0
	Lu_THU_task3_4	Rui Lu	Department of Automation, Tsinghua University, Beijing, China	task-sound-event-detection-in-real-life-audio-results#Lu2017	0.8373	38.3
	Wang_NTHU_task3_1	Chun-Hao Wang	Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan	task-sound-event-detection-in-real-life-audio-results#Wang2017	0.9749	40.8
	Xia_UWA_task3_1	Xianjun Xia	School of Electrical, Electronic and Computer Engineering, The University of Western Australia, Perth, Australia	task-sound-event-detection-in-real-life-audio-results#Xia2017	0.9523	43.5
	Xia_UWA_task3_2	Xianjun Xia	School of Electrical, Electronic and Computer Engineering, The University of Western Australia, Perth, Australia	task-sound-event-detection-in-real-life-audio-results#Xia2017	0.9437	41.1
	Xia_UWA_task3_3	Xianjun Xia	School of Electrical, Electronic and Computer Engineering, The University of Western Australia, Perth, Australia	task-sound-event-detection-in-real-life-audio-results#Xia2017	0.8740	41.7
	Yu_FZU_task3_1^*	Chun-Yan Yu	College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350108, China	task-sound-event-detection-in-real-life-audio-results#Yu2017	1.1963	3.9
	Zhou_PKU_task3_1	Jianchao Zhou	Institute of Computer Science & Technology, Peking University, Beijing, China	task-sound-event-detection-in-real-life-audio-results#Zhou2017	0.8526	39.1
	Zhou_PKU_task3_2	Jianchao Zhou	Institute of Computer Science & Technology, Peking University, Beijing, China	task-sound-event-detection-in-real-life-audio-results#Zhou2017	0.8526	37.3

Complete results and technical reports can be found here.

Baseline system

The baseline system for the task is provided. The system is meant to implement a basic approach for acoustic scene classification, and provide some comparison point for the participants while developing their systems. The baseline systems for all tasks share the code base, implementing quite similar approach for all tasks. The baseline system will download the needed datasets and produces the results below when ran with the default parameters.

The baseline system is based on a multilayer perceptron architecture using log mel-band energies as features. A 5-frame context is used, resulting in a feature vector length of 200. Using these features, a neural network containing two dense layers of 50 hidden units per layer and 20% dropout is trained for 200 epochs for each class. Detection decision is based on the network output layer containing sigmoid units that can be active at the same time. A detailed description is available in the baseline system documentation. The baseline system includes evaluation of results using segment-based error rate and segment-based F-score as metrics.

The baseline system is implemented using Python (version 2.7 and 3.6). Participants are allowed to build their system on top of the given baseline system. The system has all needed functionality for dataset handling, storing / accessing features and models, and evaluating the results, making the adaptation for one's needs rather easy. The baseline system is also a good starting point for entry level researchers.

Python implementation

DCASE2017 Baseline, repository

Results for TUT Sound events 2017, development dataset

Evaluation setup

4-fold cross-validation, error rate calculated after testing all folds
Python 2.7.13 used

System parameters

Frame size: 40 ms (with 50% hop size)
Feature vector: 40 log mel-band energies in 5 consecutive frames = 200 values
MLP: 2 layers x 50 hidden units, 20% dropout, 200 epochs (using early stopping criteria, monitoring started after 100 epoch, 10 epoch patience), learning rate 0.001, sigmoid output layer
Trained and tested on full audio

Segment-based overall metrics
ER	0.69
F-score	56.7 %

Citation

If you are using the dataset or baseline code, or want to refer challenge task please cite the following paper:

Publication

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen. DCASE 2017 challenge setup: tasks, datasets and baseline system. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 85–92. November 2017.

PDF

DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.

Keywords

Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events

PDF

When citing challenge task and results please cite the following paper:

Publication

A. Mesaros, A. Diment, B. Elizalde, T. Heittola, E. Vincent, B. Raj, and T. Virtanen. Sound event detection in the DCASE 2017 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019. In press. doi:10.1109/TASLP.2019.2907016.

PDF

Sound event detection in the DCASE 2017 Challenge

Abstract

Each edition of the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) contained several tasks involving sound event detection in different setups. DCASE 2017 presented participants with three such tasks, each having specific datasets and detection requirements: Task 2, in which target sound events were very rare in both training and testing data, Task 3 having overlapping events annotated in real-life audio, and Task 4, in which only weakly-labeled data was available for training. In this paper, we present the three tasks, including the datasets and baseline systems, and analyze the challenge entries for each task. We observe the popularity of methods using deep neural networks, and the still widely used mel frequency based representations, with only few approaches standing out as radically different. Analysis of the systems behavior reveals that task-specific optimization has a big role in producing good performance; however, often this optimization closely follows the ranking metric, and its maximization/minimization does not result in universally good performance. We also introduce the calculation of confidence intervals based on a jackknife resampling procedure, to perform statistical analysis of the challenge results. The analysis indicates that while the 95% confidence intervals for many systems overlap, there are significant difference in performance between the top systems and the baseline for all tasks.

Keywords

Event detection;Task analysis;Training;Acoustics;Speech processing;Glass;Hidden Markov models;Sound event detection;weak labels;pattern recognition;jackknife estimates;confidence intervals

PDF

Sound event detection
in real life audio

Coordinators

Description

Audio dataset

Recording and annotation procedure

Download

Task setup

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

Development dataset

Evaluation dataset

Submission

Task rules

Evaluation

Metrics for Polyphonic Sound Event Detection

Abstract

Apples-to-apples in Cross-validation Studies: Pitfalls in Classifier Performance Measurement

Abstract

Results

Baseline system

Python implementation

Results for TUT Sound events 2017, development dataset

Citation

DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

Keywords

Sound event detection in the DCASE 2017 Challenge

Abstract

Keywords

	Annamaria Mesaros Tampere University of Technology
	Toni Heittola Tampere University of Technology

Coordinators

Content

Description

Audio dataset

Recording and annotation procedure

Download

Task setup

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

Development dataset

Evaluation dataset

Submission

Task rules

Evaluation

Metrics for Polyphonic Sound Event Detection

Abstract

Apples-to-apples in Cross-validation Studies: Pitfalls in Classifier Performance Measurement

Abstract

Results

Baseline system

Python implementation

Results for TUT Sound events 2017, development dataset

Citation

DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

Keywords

Sound event detection in the DCASE 2017 Challenge

Abstract

Keywords