Challenge has ended. Full results for this task can be found here

Description

This task will focus on detection of rare sound events in artificially created mixtures. This specific use of data will allow creating mixtures of background everyday audio and sound events of interest at different event-to-background ratio, providing a larger amount of training conditions than would be available in real recordings.

Figure 1: Overview of sound event detection system.

Audio dataset

TUT Rare Sound Events 2017 consists of isolated sound events for each target class and recordings of everyday acoustic scenes to serve as background. Source code for creating mixtures at different event-to-background will be also provided, along with a set of ready generated mixtures. The participants are allowed to use any combination of the provided data for training their systems (separate sounds, provided mixtures or any other created mixtures). The evaluation set will consist of similar mixtures.

The target sound event categories are:

Baby crying
Glass breaking
Gunshot

Target sound event categories will be treated independently, therefore it is allowed to use completely different approaches for detection, tailored to the characteristics of each sound category.

The background audio material consists of recordings from 15 different audio scenes, and is part of TUT Acoustic Scenes 2016 dataset.

Recording and annotation procedure

The background audio is part of TUT Acoustic Scenes 2016 dataset, and the isolated sound examples were collected from Freesound. Selection of sounds from Freesound was based on the exact label, selecting examples that had sampling frequency 44.1 kHz or higher. Isolated examples in the training set were annotated with start time and end time using an SVM-based semi-supervised segmentation, followed by manual refinement.

Parameters controlling the synthesized material include the event-to-background ratio (EBR) and the randomly selected positioning for the target sound and event occurrence probability (there may be one or no events per mixture). Isolated sounds and background samples are selected at random. Annotations for the synthetic mixtures are produced automatically by the mixture synthesizer. Annotations contain only the target sound event temporal position, as the task is independent of the background acoustic scene.

Download

In case you are using the provided baseline system, there is no need to download the dataset as the system will automatically download needed dataset for you.

** Development dataset **

A minor bug was found in the development dataset after initial release. A few background audio signals were found to contain also target sound events (babycry), affecting the evaluation result minimally. Make sure you update your dataset by using the provided script. Detailed instructions how to use the script can be found here.

Below is the most recent, updated version of the development dataset. If you have downloaded it after June 30th 2017, no measures are required from your side.

TUT Rare sound events 2017, development dataset (10.7 GB)

** Evaluation dataset **

TUT Rare sound events 2017, evaluation dataset (3.6 GB)

Task setup

The dataset consists of two subsets: development dataset and evaluation dataset. The development dataset contains the original background and isolated sound event samples, and a set of generated mixture audio samples. Participants can generate any amount of additional mixtures for use in system training, using the provided dataset.

The development dataset contains approximately 9 hours of background audio, around 100 isolated sound examples for each target class, and 500 mixture audio examples for each target class, with each mixture containing 0 to 1 target sounds. In addition, 500 mixtures are provided per target class for evaluation of the system during development - these mixtures are created using different background audio and sound examples than the ones available in the development set. The evaluation set will be produced the same way, using original audio material not distributed in development set, and will be provided close to the challenge deadline.

A detailed description of the dataset creation is available in:

Publication

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen. DCASE 2017 challenge setup: tasks, datasets and baseline system. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 85–92. November 2017.

PDF

DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.

Keywords

Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events

PDF

Development dataset

A set of 500 mixtures per target class is designated and clearly marked as test material during system development. Participants are required to report the performance of the system on this subset for comparison.

Evaluation dataset

Evaluation dataset without ground truth will be released one month before the submission deadline. Full ground truth meta data for it will be published after the DCASE 2017 challenge and workshop are concluded.

Submission

Detailed information for the challenge submission can found on the submission page.

System output should be presented as a single text-file (in CSV format) containing a list of detected sound events from each audio file. Events can be in any order. Format for the case if the event is detected in a mixture:

[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][event label (string)]

and if no event is detected:

[filename (string)]

Multiple submissions are allowed (maximum 4 per participant). In the case of multiple submissions, the individual text-files should be packaged into a zip file for submission. Please carefully mark the connection between the submitted files and the corresponding system or system parameters (for example by naming the text file appropriately).

Task rules

These are the general rules valid for all tasks. The same rules and additional information on technical report and submission requirements can be found here. Task specific rules are highlighted with green.

Participants are not allowed to use external data for system development. Data from another task is considered external data.
It is allowed to have different approaches for the specific target classes, i.e. target class can be used as prior knowledge.
Manipulation of provided training and development data is allowed.
Participants are allowed to use any combination of the training and development datasets for training their systems, and creating a high number of mixtures with various combinations of target sounds and background audio is encouraged.
Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision making is also forbidden.

Evaluation

The evaluation metric for the task is event-based error rate calculated using onset-only condition with a collar of 500 ms. Additionally, event-based F-score with a 500 ms onset-only collar will be calculated. Ranking of submitted systems will be based on the event-based error rate.

Detailed information on metrics calculation is available in:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016. URL: http://www.mdpi.com/2076-3417/6/6/162, doi:10.3390/app6060162.

PDF

Metrics for Polyphonic Sound Event Detection

Abstract

This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.

PDF Web publication

Toolbox

A short description of metrics can be found here.

The evaluation is done automatically in the baseline system. Evaluation is done using sed_eval toolbox.

sed_eval - Evaluation toolbox for Sound Event Detection

In case of using the toolbox directly, use the following parameters for sed_eval.EventBasedMetrics evaluator to align it with the baseline system:

Collar of 500ms or 50% length of the event, t_collar=0.5, percentage_of_length=0.5
Only evaluate onsets, evaluate_onset=True, evaluate_offset=False

Results

Rank	Submission Information				Event-based (overall)
Rank	Code	Author	Affiliation	Technical Report	ER	F1
	Cakir_TUT_task2_1	Emre Cakir	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-rare-sound-event-detection-results#Cakir2017	0.1813	91.0
	Cakir_TUT_task2_2	Emre Cakir	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-rare-sound-event-detection-results#Cakir2017	0.1733	91.0
	Cakir_TUT_task2_3	Emre Cakir	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-rare-sound-event-detection-results#Cakir2017	0.2920	86.0
	Cakir_TUT_task2_4	Emre Cakir	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-rare-sound-event-detection-results#Cakir2017	0.1867	90.3
	Dang_NCU_task2_1	Jia-Ching Wang	Computer Sciene and Information Engineering, National Central University, Taoyuan, Taiwan	task-rare-sound-event-detection-results#Dang2017	0.4787	73.3
	Dang_NCU_task2_2	Jia-Ching Wang	Computer Sciene and Information Engineering, National Central University, Taoyuan, Taiwan	task-rare-sound-event-detection-results#Dang2017	0.4107	79.1
	Dang_NCU_task2_3	Jia-Ching Wang	Computer Sciene and Information Engineering, National Central University, Taoyuan, Taiwan	task-rare-sound-event-detection-results#Dang2017	0.4453	76.1
	Dang_NCU_task2_4	Jia-Ching Wang	Computer Sciene and Information Engineering, National Central University, Taoyuan, Taiwan	task-rare-sound-event-detection-results#Dang2017	0.4253	78.6
	Ghaffarzadegan_BOSCH_task2_1	Shabnam Ghaffarzadegan	Human Machine Interaction, Robert Bosch Research and Technology Center, Palo Alto, USA	task-rare-sound-event-detection-results#Ghaffarzadegan2017	0.5000	74.2
	Ghaffarzadegan_BOSCH_task2_2	Shabnam Ghaffarzadegan	Robert Bosch Research and Technology Center, Palo Alto, USA	task-rare-sound-event-detection-results#Ghaffarzadegan2017	0.5493	71.8
	Ghaffarzadegan_BOSCH_task2_3	Shabnam Ghaffarzadegan	Robert Bosch Research and Technology Center, Palo Alto, USA	task-rare-sound-event-detection-results#Ghaffarzadegan2017	0.5560	70.8
	DCASE2017 baseline	Toni Heittola	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-rare-sound-event-detection-results#Heittola2017	0.6373	64.1
	Jeon_GIST_task2_1	Hong Kook Kim	School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South Korea	task-rare-sound-event-detection-results#Jeon2017	0.6773	65.8
	Li_SCUT_task2_1	Yanxiong Li	School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China	task-rare-sound-event-detection-results#Li2017	0.6333	65.5
	Li_SCUT_task2_2	Yanxiong Li	School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China	task-rare-sound-event-detection-results#Li2017	0.7373	57.4
	Li_SCUT_task2_3	Yanxiong Li	School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China	task-rare-sound-event-detection-results#Li2017	0.6213	66.6
	Li_SCUT_task2_4	Yanxiong Li	School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China	task-rare-sound-event-detection-results#Li2017	0.6000	69.8
	Lim_COCAI_task2_1	Yoonchang Han	Cochlear.ai, Seoul, Korea	task-rare-sound-event-detection-results#Lim2017	0.1307	93.1
	Lim_COCAI_task2_2	Yoonchang Han	Cochlear.ai, Seoul, Korea	task-rare-sound-event-detection-results#Lim2017	0.1347	93.0
	Lim_COCAI_task2_3	Yoonchang Han	Cochlear.ai, Seoul, Korea	task-rare-sound-event-detection-results#Lim2017	0.1520	92.2
	Lim_COCAI_task2_4	Yoonchang Han	Cochlear.ai, Seoul, Korea	task-rare-sound-event-detection-results#Lim2017	0.1720	91.4
	Liping_CQU_task2_1	Yang Liping	Key Laboratory of Optoelectronic Technology and Systems (Chongqing University), Ministry of Education, ChongQing University, ChongQing, China	task-rare-sound-event-detection-results#Kaiwu2017	0.3400	79.5
	Liping_CQU_task2_2	Yang Liping	Key Laboratory of Optoelectronic Technology and Systems (Chongqing University), Ministry of Education, ChongQing University, ChongQing, China	task-rare-sound-event-detection-results#Kaiwu2017	0.3293	81.2
	Liping_CQU_task2_3	Yang Liping	Key Laboratory of Optoelectronic Technology and Systems (Chongqing University), Ministry of Education, ChongQing University, ChongQing, China	task-rare-sound-event-detection-results#Kaiwu2017	0.3173	82.0
	Phan_UniLuebeck_task2_1	Huy Phan	Institute for Signal Processing, University of Luebeck, Luebeck, Germany	task-rare-sound-event-detection-results#Phan2017	0.2773	85.3
	Ravichandran_BOSCH_task2_4	Anravich Ravichandran	Computer science, University of California, San Diego, USA	task-rare-sound-event-detection-results#Ravichandran2017	0.4267	78.6
	Vesperini_UNIVPM_task2_1	Fabio Vesperini	Department of Information Engineering, Università Politecnica delle Marche, Ancona, Italy	task-rare-sound-event-detection-results#Vesperini2017	0.3267	83.9
	Vesperini_UNIVPM_task2_2	Fabio Vesperini	Department of Information Engineering, Università Politecnica delle Marche, Ancona, Italy	task-rare-sound-event-detection-results#Vesperini2017	0.3440	82.8
	Vesperini_UNIVPM_task2_3	Fabio Vesperini	Department of Information Engineering, Università Politecnica delle Marche, Ancona, Italy	task-rare-sound-event-detection-results#Vesperini2017	0.3267	83.2
	Vesperini_UNIVPM_task2_4	Fabio Vesperini	Department of Information Engineering, Università Politecnica delle Marche, Ancona, Italy	task-rare-sound-event-detection-results#Vesperini2017	0.3267	83.2
	Wang_BUPT_task2_1	Jun Wang	Embedded Artificial Intelligence Laboratory, Beijing University of Posts and Telecommunications, Beijing, China	task-rare-sound-event-detection-results#Wang2017	0.4320	73.4
	Wang_THU_task2_1	Jianfei Wang	Speech and Audio Technology Laboratory, Tsinghua University, Beijing, China	task-rare-sound-event-detection-results#Wang2017a	0.4973	72.6
	Zhou_XJTU_task2_1	Zuren Feng	Systems Engineering Institute, Xi'an Jiaotong University, Xi'an, China	task-rare-sound-event-detection-results#Zhou2017	0.3133	84.2

Complete results and technical reports can be found here.

Baseline system

The baseline system for the task is provided. The system is meant to implement a basic approach for acoustic scene classification, and provide some comparison point for the participants while developing their systems. The baseline systems for all tasks share the code base, implementing quite similar approach for all tasks. The baseline system will download the needed datasets and produces the results below when ran with the default parameters.

The baseline system is based on a multilayer perceptron architecture using log mel-band energies as features. A 5-frame context is used, resulting in a feature vector length of 200. Using these features, a neural network containing two dense layers of 50 hidden units per layer and 20% dropout is trained for 200 epochs for each class. Detection decision is based on a single output neuron of sigmoid type, indicating the target class active or inactive. A detailed description is available in the baseline system documentation. The baseline system includes evaluation of results using event-based error rate and event-based F-score as metrics.

The baseline system is implemented using Python (version 2.7 and 3.6). Participants are allowed to build their system on top of the given baseline system. The system has all needed functionality for dataset handling, storing / accessing features and models, and evaluating the results, making the adaptation for one's needs rather easy. The baseline system is also a good starting point for entry level researchers.

Python implementation

DCASE2017 Baseline, repository

Results for TUT Rare sound events 2017, development dataset

Evaluation setup

three separate binary classifiers (one per class), average error rate over the three classes
trained and tested using development dataset mixtures (provided train and test sets)
Python 2.7.13 used

System parameters

Frame size: 40 ms (with 50% hop size)
Feature vector: 40 log mel-band energies in 5 consecutive frames = 200 values
MLP: 2 layers x 50 hidden units, 20% dropout, 200 epochs (using early stopping criteria, monitoring started after 100 epoch, 10 epoch patience), learning rate 0.001, one sigmoid unit as output
Trained only on mixture audio

Sound event detection results.
	Event-based overall metrics (onset only,t_collar=500ms)
	ER	F-score
Baby cry	0.67	72.0 %
Glass break	0.22	88.5 %
Gun shot	0.69	57.4 %
Average	0.53	72.7 %

Citation

If you are using the dataset or baseline code, or want to refer challenge task please cite the following paper:

Publication

PDF

DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

Keywords

Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events

PDF

When citing challenge task and results please cite the following paper:

Publication

A. Mesaros, A. Diment, B. Elizalde, T. Heittola, E. Vincent, B. Raj, and T. Virtanen. Sound event detection in the DCASE 2017 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019. In press. doi:10.1109/TASLP.2019.2907016.

PDF

Sound event detection in the DCASE 2017 Challenge

Abstract

Each edition of the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) contained several tasks involving sound event detection in different setups. DCASE 2017 presented participants with three such tasks, each having specific datasets and detection requirements: Task 2, in which target sound events were very rare in both training and testing data, Task 3 having overlapping events annotated in real-life audio, and Task 4, in which only weakly-labeled data was available for training. In this paper, we present the three tasks, including the datasets and baseline systems, and analyze the challenge entries for each task. We observe the popularity of methods using deep neural networks, and the still widely used mel frequency based representations, with only few approaches standing out as radically different. Analysis of the systems behavior reveals that task-specific optimization has a big role in producing good performance; however, often this optimization closely follows the ranking metric, and its maximization/minimization does not result in universally good performance. We also introduce the calculation of confidence intervals based on a jackknife resampling procedure, to perform statistical analysis of the challenge results. The analysis indicates that while the 95% confidence intervals for many systems overlap, there are significant difference in performance between the top systems and the baseline for all tasks.

Keywords

Event detection;Task analysis;Training;Acoustics;Speech processing;Glass;Hidden Markov models;Sound event detection;weak labels;pattern recognition;jackknife estimates;confidence intervals

PDF

Detection of
rare sound events

Coordinators

Description

Audio dataset

Recording and annotation procedure

Download

Task setup

DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

Keywords

Development dataset

Evaluation dataset

Submission

Task rules

Evaluation

Metrics for Polyphonic Sound Event Detection

Abstract

Results

Baseline system

Python implementation

Results for TUT Rare sound events 2017, development dataset

Citation

DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

Keywords

Sound event detection in the DCASE 2017 Challenge

Abstract

Keywords

	Aleksandr Diment Tampere Univeristy of Technology
	Annamaria Mesaros Tampere University of Technology
	Toni Heittola Tampere University of Technology

Coordinators

Content

Description

Audio dataset

Recording and annotation procedure

Download

Task setup

DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

Keywords

Development dataset

Evaluation dataset

Submission

Task rules

Evaluation

Metrics for Polyphonic Sound Event Detection

Abstract

Results

Baseline system

Python implementation

Results for TUT Rare sound events 2017, development dataset

Citation

DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

Keywords

Sound event detection in the DCASE 2017 Challenge

Abstract

Keywords