Sound event detection
in synthetic audio

Task description

Challenge has ended. Full results for this task can be found here


This task will focus on event detection of office sounds in synthetic mixtures. This task will focus on event detection of overlapping office sounds in synthetic mixtures. By using synthetic mixtures in testing, this task will study the behaviour of tested algorithms when facing different levels of complexity (noise, polyphony), with the added benefit of a very accurate ground truth.

Figure 1: Overview of sound event detection system.

Audio dataset

Training material for this task consists of isolated sound events for each class and synthetic mixtures of the same examples in multiple SNR and event density conditions. The participants are allowed to use any combination of them for training their system. The test data will consist of synthetic mixtures of (source-independent) sound examples at various SNR levels, event density conditions and polyphony.

The provided sound event categories are: (11)

  • Clearing throat
  • Coughing
  • Door knock
  • Door slam
  • Drawer
  • Human laughter
  • Keyboard
  • Keys (put on table)
  • Page turning
  • Phone ringing
  • Speech

There will be 20 samples provided for each sound event class in the training set, plus a development set consisting of 18 minutes of synthetic mixture material in 2 minute length audio files. The test set will be provided close to the challenge deadline.

Recording and annotation procedure

Audio is provided by IRCCYN, École Centrale de Nantes. The material was recorded in a calm environment, using the shotgun microphone AT8035 connected to a ZOOM H4n recorder. Audio files are sampled at 44.1kHz and are monophonic. Parameters controlling the synthesized material include the event-to-background ratio (EBR) with values -6, 0, 6 dB, the presence/absence of overlapping events (monophonic/polyphonic scene), as well as the number of events per class. Isolated examples in the training set will be annotated with start time, end time and event label for all sound events, while for the synthetic mixtures annotations are provided automatically by the event sequence synthesizer.

Challenge setup

Task 2 consists of two public subsets: a training dataset and a development dataset. The training dataset consists of 20 isolated sound segments per event class. The development dataset consists of 18 2min recordings, in various noise and event density conditions (see the README.txt file in the dataset folder for more details).

Participants are not allowed to use external data for system development. Manipulation of provided data is allowed. Participants are allowed to use any combination of the training and development datasets for training their systems.


** Development dataset **


Detailed information for the challenge submission can found from submission page. One should submit single .txt file per evaluated audio recording. The output file should contain a list of detected events, specified by the onset, offset and the event ID separated by a tab. Format:

[event onset in seconds (float)][tab][event offset in seconds (float)][tab][event ID (string)]

Example file

1.387392290    3.262403627    pageturn
5.073560090    5.793378684    knock

There should be no additional tab characters anywhere, and there should be no whitespace added after the label, just the newline. The 11 event IDs to be used for the .txt output are: clearthroat, cough, doorslam, drawer, keyboard, keys, knock, laughter, pageturn, phone, speech.

Task rules

  • Only the provided development dataset can be used to train the submitted system.
  • The development dataset can be augmented only by mixing data sampled from a pdf; use of real recordings is forbidden.
  • The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation dataset in the decision making is also forbidden.
  • Technical report with sufficient description of the system has to be submitted along with the system outputs.

More information on submission process.


Tasks 2 and 3 will use the same metrics. The main metric for the challenge will be Total error rate ER. Error rate will be evaluated in one-second segments over the entire test set. Ranking of submitted systems will be done using this metric. We will also use the onset-only event-based F-measure (with 200ms tolerance) as an additional metric.

Detailed description of metrics used can be found here.

Code for evaluation is available with the baseline system. Use classes:

  • metrics/DCASE2016_EventDetection_SegmentBasedMetrics.m
  • metrics/DCASE2016_EventDetection_EventBasedMetrics.m


Rank Submission Information Segment-based (overall)
Code Author Affiliation Technical
Choi_task2_1 Inkyu Choi Department of Electrical and Computer Engineering and INMC, Seoul National University, Seoul, South Korea task-sound-event-detection-in-synthetic-audio-results#Choi2016 0.3660 78.7
DCASE2016 baseline Emmanouil Benetos Queen Mary University of London, London, United Kingdom task-sound-event-detection-in-synthetic-audio-results#Benetos2016 0.8933 37.0
Giannoulis_task2_1 Panagiotis Giannoulis School of ECE, National Technical University of Athens, Athens, Greece; Athena Research and Innovation Center, Maroussi, Greece task-sound-event-detection-in-synthetic-audio-results#Giannoulis2016 0.6774 55.8
Gutierrez_task2_1 J.M. Gutiérrez-Arriola Escuela Técnica Superior de Ingeniería y Sistemas de Telecomunicacíon, Universidad Politécnica de Madrid, Madrid, Spain task-sound-event-detection-in-synthetic-audio-results#Gutirrez-Arriola2016 2.0870 25.0
Hayashi_task2_1 Tomoki Hayashi Nagoya University, Nagoya, Japan task-sound-event-detection-in-synthetic-audio-results#Hayashi2016 0.4082 78.1
Hayashi_task2_2 Tomoki Hayashi Nagoya University, Nagoya, Japan task-sound-event-detection-in-synthetic-audio-results#Hayashi2016 0.4958 76.0
Komatsu_task2_1 Tatsuya Komatsu Data Science Research Laboratories, NEC Corporation, Kawasaki, Japan task-sound-event-detection-in-synthetic-audio-results#Komatsu2016 0.3307 80.2
Kong_task2_1 Qiuqiang Kong Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom task-sound-event-detection-in-synthetic-audio-results#Kong2016 3.5464 12.6
Phan_task2_1 Huy Phan Institute for Signal Processing, University of Luebeck, Luebeck, Germany; Graduate School for Computing in Medicine and Life Sciences, University of Luebeck, Luebeck, Germany task-sound-event-detection-in-synthetic-audio-results#Phan2016 0.5901 64.8
Pikrakis_task2_1 Aggelos Pikrakis Department of Informatics, University of Piraeus, Piraeus, Greece task-sound-event-detection-in-synthetic-audio-results#Pikrakis2016 0.7499 37.4
Vu_task2_1 Toan H. Vu Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan task-sound-event-detection-in-synthetic-audio-results#Vu2016 0.8979 52.8

Complete results and technical reports can be found at Task 2 result page

Baseline system

A baseline system for the task is provided. The system is meant to implement a basic approach for detecting overlapping acoustic events, and provide some comparison point for the participants while developing their systems.

The baseline system is based on supervised non-negative matrix factorization (NMF), and uses a dictionary of spectral templates for performing detection, which is extracted during the training phase. The output of the NMF system is a non-binary matrix denoting event activation, which is post-processed into a list of detected events.

The baseline system provides also reference implementation of the evaluation metrics (provided by Toni Heittola). The baseline system is provided for Matlab.

Matlab implementation

Baseline results for development set

System parameters

  • Input: variable-Q transform spectrogram (60 bins/octave, 10ms step)
  • NMF with beta-divergence (30 iterations, beta=0.6, activation threshold=1.0)
  • Postprocessing: 90ms median filter span, up to 5 concurrent events, 60ms minimum event duration
Sound event detection results.
Segment-based overall metrics Event-based overall metrics
ER F-score F-score (onset-only)
0.7859 41.6 % 30.3 %


When citing challenge task and results please cite the following papers:


A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, and M. D. Plumbley. Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(2):379–393, Feb 2018. doi:10.1109/TASLP.2017.2778423.


Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge


Public evaluation campaigns and datasets promote active development in target research areas, allowing direct comparison of algorithms. The second edition of the challenge on detection and classification of acoustic scenes and events (DCASE 2016) has offered such an opportunity for development of the state-of-the-art methods, and succeeded in drawing together a large number of participants from academic and industrial backgrounds. In this paper, we report on the tasks and outcomes of the DCASE 2016 challenge. The challenge comprised four tasks: acoustic scene classification, sound event detection in synthetic audio, sound event detection in real-life audio, and domestic audio tagging. We present each task in detail and analyze the submitted systems in terms of design and performance. We observe the emergence of deep learning as the most popular classification method, replacing the traditional approaches based on Gaussian mixture models and support vector machines. By contrast, feature representations have not changed substantially throughout the years, as mel frequency-based representations predominate in all tasks. The datasets created for and used in DCASE 2016 are publicly available and are a valuable resource for further research.


Acoustics;Event detection;Hidden Markov models;Speech;Speech processing;Tagging;Acoustic scene classification;audio datasets;pattern recognition;sound event detection


G. Lafay, E. Benetos, and M. Lagrange. Sound event detection in synthetic audio: analysis of the DCASE 2016 task results. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), volume, 11–15. Oct 2017. doi:10.1109/WASPAA.2017.8169985.


Sound event detection in synthetic audio: Analysis of the DCASE 2016 task results


As part of the 2016 public evaluation challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2016), the second task focused on evaluating sound event detection systems using synthetic mixtures of office sounds. This task, which follows the `Event Detection-Office Synthetic' task of DCASE 2013, studies the behaviour of tested algorithms when facing controlled levels of audio complexity with respect to background noise and polyphony/density, with the added benefit of a very accurate ground truth. This paper presents the task formulation, evaluation metrics, submitted systems, and provides a statistical analysis of the results achieved, with respect to various aspects of the evaluation dataset.


acoustic signal detection;acoustic signal processing;audio signal processing;signal classification;statistical analysis;synthetic audio;dcase 2016 task results;2016 public evaluation challenge;Acoustic Scenes;sound event detection systems;synthetic mixtures;office sounds;Event Detection-Office Synthetic task;DCASE 2013;audio complexity;background noise;polyphony/density;task formulation;evaluation metrics;evaluation dataset;submitted systems;statistical analysis;Acoustics;Analysis of variance;Event detection;Image analysis;Measurement;Training;Sound event detection;experimental validation;DCASE;acoustic scene analysis;sound scene analysis