Sound event detection


Task description

Challenge has ended. Full results for this task can be found here
This page collects information from original DCASE2013 Challenge website to document DCASE challenge tasks in an uniform way.

Description

The event detection challenge will address the problem of identifying individual sound events that are prominent in an acoustic scene. Two distinct experiments will take, one for simple acoustic scenes without overlapping sounds and the other using complex scenes in a polyphonic scenario. Three datasets will be used for the task.

Figure 1: Overview of sound event detection system.

Task setup

Subtask OL - Office live

The first dataset for event detection will consist of 3 subsets (for development, training, and testing). The training set will contain instantiations of individual events for every class. The development and testing datasets, denoted as office live (OL), will consist of 1 min recordings of every-day audio events in a number of office environments.

The test data consists of 11 stereo recordings (WAV, 44.1 kHz, 24-bit), lasting between 1 and 3 minues, of scripted sequences containing non-overlapping acoustic events in an office environment. Recordings were made using a Soundfield microphone system, model SPS422B. The test dataset contains events from 16 different classes, which are as follows:

  • alarm (short alert (beep) sound)
  • clearthroat (clearing throat)
  • cough
  • doorslam (door slam)
  • drawer
  • keyboard (keyboard clicks)
  • keys (keys put on table)
  • knock (door knock)
  • laughter
  • mouse (mouse click)
  • pageturn (page turning)
  • pendrop (pen, pencil, or marker touching table surfaces)
  • phone
  • printer
  • speech
  • switch

Submitted event detection systems can be tuned and trained using the publicly released training and development datasets.

Datasets

Isolated events:

Event sequences:

Event sequences:


Subtask OS - Office synthetic

The second dataset will contain artificially sequenced sounds provided by the Analysis-Synthesis team of IRCAM, termed Office Synthetic (OS). The training set will be identical to the one for the first dataset. The development and testing sets will consist of artificial scenes built by sequencing recordings of individual events (different recordings from the ones used for the training dataset) and background recordings provided by C4DM.

The test data consists of mono recordings (WAV, 44.1 kHz) of sequences containing artificially concatenating overlapping acoustic events in an office environment. Original recordings of isolated acoustic events were made using a Soundfield microphone system, model SPS422B. The dataset contains various SNRs of events over background noise (+6, 0, and -6 dB) and different levels of "density" of events (low, medium, and high). The distribution of events in the scene is random, following high-level directives that specify the desired density of events. The average SNR of events over the background noise is also specified upon synthesis and, unlike in the natural scenes, is the same for all event types. The synthesized scenes are mixed down to mono in order to avoid having spatialization inconsistencies between successive occurrences of a same event. The test dataset contains events from 16 different classes, which are as follows:

  • alarm (short alert (beep) sound)
  • clearthroat (clearing throat)
  • cough
  • doorslam (door slam)
  • drawer
  • keyboard (keyboard clicks)
  • keys (keys put on table)
  • knock (door knock)
  • laughter
  • mouse (mouse click)
  • pageturn (page turning)
  • pendrop (pen, pencil, or marker touching table surfaces)
  • phone
  • printer
  • speech
  • switch

Submitted event detection systems can be tuned and trained using the publicly released training and development datasets.

Datasets

Isolated events:

Event sequences:

Event sequences:


Submission

The challenge participants submit an executable for both subtasks.

Submission format

Command line calling format

Executables must accept command-line parameters which specify:

  • A path to an input .wav file.
  • A path to an output .txt file.

For example:

>./eventdetection /path/to/input.wav /path/to/output.txt

If parameters need to be set for the program, this can be done, provided the manner in which the parameters is set are well documented by the submitter. So if, for example, your program needs a specified frame rate, set by a -fr flag, an example calling format could be of the form:

>./eventdetection -fr 1024 /path/to/input.wav /path/to/output.txt

where the manner by which, as well as the desired parameters to be used are specified upon submission and in the corresponding readme file bundled with the submission of the algorithm. Programs can use their working directory if they need to keep temporary cache files or internal debugging info.

Output file

The output ASCII file should contain the onset, offset and the event ID separated by a tab, ordered in terms of onset times (onset/offset times in sec):

<onset1>\t<offset1>\t<EventID1>
<onset2>\t<offset2>\t<EventID2>
...

E.g.

1.387392290 3.262403627 pageturn
5.073560090 5.793378684 knock
...

There should be no additional tab characters anywhere, and there should be no whitespace added after the label, just the newline.

Packaging submissions

For Python/R/C/C++/etc submissions, please ensure that the submission can run on the Linux disk image we provide, WITHOUT any additional configuration. You may have modified the virtual machine after downloading it, but we will not be using your modified disk image - we will be running your submission on the standard disk image. This means:

  • if you have used additional Python/R script libraries, they must be included in your submission bundle, and your script should be able to use them without installing them systemwide.
  • if you have used any additional C/C++ libraries, they must be statically-linked to your executable.

For Matlab submissions, ensure that the submission can run with the toolboxes and system that the organisers have specified. If you need any particular toolboxes or configuration please contact the organisers as soon as you can. Please aim to make MATLAB submissions compatible across multiple OS (usual problems exist in the file/path separators). All Matlab submissions should be written in the form of a function, e.g. eventdetection(input,output); which can allow calling the script from the command line very easily. Please provide some console output, which can provide a sanity check to the challenge team when running the code. This can be of the form of simply writing out a line corresponding to different stages of your algorithm All submissions should include a README file including the following the information:

  • Command line calling format for all executables including examples
  • Number of threads/cores used or whether this should be specified on the command line
  • Expected memory footprint
  • Expected runtime
  • Approximately how much scratch disk space will the submission need to store any feature/cache files?
  • Any special notice regarding to running your algorithm

Note that the information that you place in the README file is extremely important in ensuring that your submission is evaluated properly.

Time and Hardware limits

Due to the potentially high resource requirements across all participants, hard limits on the runtime of submissions will be imposed. A hard limit of 48 hours will be imposed for each submission

Evaluation

Participating algorithms will be evaluated using frame-based, event-based, and class-wise event-based metrics. The computed metrics will consist of the AEER, precision, recall, and F-measure for the frame-based, event-based, and class-wise event-based evaluations. For the event-based evaluations, both onset-based and onset-offset-based metrics will be computed. In addition, computation times of each participating algorithm will be measured.

Frame-based evaluation is using a 10ms step and metrics are averaged over the duration of the recordings. Main metric is the acoustic event error rate (AEER) also used in the CLEAR evaluations:

\begin{equation*} AEER=\frac{D+I+S} {N} \end{equation*}

where \(N\) is the number of events to detect for the current frame, \(D\) is the number of deletions (missing events), \(I\) is the number of insertions (extra events), and \(S\) is the number of event substitutions. Substitutions is defined as \(S=\min(D,I)\).

More detailed description of used metrics see:

Publication

D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley. Detection and classification of acoustic scenes and events. IEEE Transactions on Multimedia, 17(10):1733–1746, Oct 2015. doi:10.1109/TMM.2015.2428998.

PDF

Detection and Classification of Acoustic Scenes and Events

Abstract

For intelligent systems to make best use of the audio modality, it is important that they can recognize not just speech and music, which have been researched as specific tasks, but also general sounds in everyday environments. To stimulate research in this field we conducted a public research challenge: the IEEE Audio and Acoustic Signal Processing Technical Committee challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). In this paper, we report on the state of the art in automatically classifying audio scenes, and automatically detecting and classifying audio events. We survey prior work as well as the state of the art represented by the submissions to the challenge from various research groups. We also provide detail on the organization of the challenge, so that our experience as challenge hosts may be useful to those organizing challenges in similar domains. We created new audio datasets and baseline systems for the challenge; these, as well as some submitted systems, are publicly available under open licenses, to serve as benchmarks for further research in general-purpose machine listening.

Keywords

acoustic signal processing;knowledge based systems;speech recognition;acoustic scenes detection;acoustic scenes classification;intelligent systems;audio modality;speech recognition;music;IEEE Audio and Acoustic Signal Processing Technical Committee;DCASE;Event detection;Speech;Speech recognition;Music;Microphones;Licenses;Audio databases;event detection;machine intelligence;pattern recognition

PDF


Matlab implementation of metrics:


Results

Subtask OL

Rank Submission Information Frame-based metrics
Code Author Affiliation Technical
Report
AEER F1
DCASE2013 baseline Dimitrios Giannoulis Centre for Digital Music, Queen Mary University of London, London, UK task-sound-event-detection-results-ol#Giannoulis2013 2.5900 10.7
CPS Sameer Chauhan Electrical Engineering, Cooper Union for the Advancement of Science and Art, New York, USA task-sound-event-detection-results-ol#Chauhan2013 2.1160 3.8
DHV Aleksandr Diment Tampere University of Technology, Tampere, Finland task-sound-event-detection-results-ol#Diment2013 3.1280 26.0
GVV Jort F Gemmeke ESAT-PSI, KU Leuven, Heverlee, Belgium task-sound-event-detection-results-ol#Gemmeke2013 1.0840 31.9
NR2 Waldo Nogueira Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain task-sound-event-detection-results-ol#Nogueira2013 1.8850 34.7
NVM_1 Maria E. Niessen AGT International, Darmstadt, Germany task-sound-event-detection-results-ol#Niessen2013 1.1150 40.9
NVM_2 Maria E. Niessen AGT International, Darmstadt, Germany task-sound-event-detection-results-ol#Niessen2013 1.1020 42.8
NVM_3 Maria E. Niessen AGT International, Darmstadt, Germany task-sound-event-detection-results-ol#Niessen2013 1.2120 45.5
NVM_4 Maria E. Niessen AGT International, Darmstadt, Germany task-sound-event-detection-results-ol#Niessen2013 1.3600 42.9
SCS_1 Jens Schröder Project Group Hearing, Speech and Audio Technology, Fraunhofer IDMT, Oldenburg, Germany task-sound-event-detection-results-ol#Schroeder2013 1.1670 53.0
SCS_2 Jens Schröder Project Group Hearing, Speech and Audio Technology, Fraunhofer IDMT, Oldenburg, Germany task-sound-event-detection-results-ol#Schroeder2013 1.0160 61.5
VVK Lode Vuegen ESAT-PSI, KU Leuven, Heverlee, Belgium; Future Health Department, iMinds, Heverlee, Belgium; MOBILAB, TM Kempen, Geel, Belgium task-sound-event-detection-results-ol#Vuegen2013 1.0010 43.4


Complete results and technical reports can be found at Subtask OL result page

Subtask OS

Rank Submission Information Frame-based metrics
Code Author Affiliation Technical
Report
AEER F1
DCASE2013 baseline Dimitrios Giannoulis Centre for Digital Music, Queen Mary University of London, London, UK task-sound-event-detection-results-os#Giannoulis2013 2.8040 12.8
DHV Aleksandr Diment Tampere University of Technology, Tampere, Finland task-sound-event-detection-results-os#Diment2013 7.9800 18.7
GVV Jort F Gemmeke ESAT-PSI, KU Leuven, Heverlee, Belgium task-sound-event-detection-results-os#Gemmeke2013 1.3180 21.3
VVK Lode Vuegen ESAT-PSI, KU Leuven, Heverlee, Belgium; Future Health Department, iMinds, Heverlee, Belgium; MOBILAB, TM Kempen, Geel, Belgium task-sound-event-detection-results-os#Vuegen2013 1.8880 13.5


Complete results and technical reports can be found at Subtask OS result page

Baseline system

Audio Event Detection baseline system using NMF (MATLAB).

This is an event detection system that you can train on a set of labelled audio files with isolated sound events of various sounds, and then it is able to detect and classify audio activity related to these events from an audio files containing a series of different events and background noise. It is designed with two main aims:

  1. to provide a baseline against which to test more advanced systems;
  2. to provide a simple code example of a system which people are free to build on.

It follows a training/testing framework. A dictionary of spectral basis vectors is learned using NMF on the training data. This dictionary is subsequently set fixed and used to obtain an activation matrix of unlabelled audio files from the development set using NMF decomposition. The activation vectors per class are summed together and thresholded to give the activity for different classes.

In publications using the baseline, cite as:

Publication

D. Giannoulis, D. Stowell, E. Benetos, M. Rossignol, M. Lagrange, and M. D. Plumbley. A database and challenge for acoustic scene classification and event detection. In 21st European Signal Processing Conference (EUSIPCO 2013), volume, 1–5. Sep. 2013. doi:.

PDF

A database and challenge for acoustic scene classification and event detection

Abstract

An increasing number of researchers work in computational auditory scene analysis (CASA). However, a set of tasks, each with a well-defined evaluation framework and commonly used datasets do not yet exist. Thus, it is difficult for results and algorithms to be compared fairly, which hinders research on the field. In this paper we will introduce a newly-launched public evaluation challenge dealing with two closely related tasks of the field: acoustic scene classification and event detection. We give an overview of the tasks involved; describe the processes of creating the dataset; and define the evaluation metrics. Finally, illustrations on results for both tasks using baseline methods applied on this dataset are presented, accompanied by open-source code.

Keywords

acoustic signal processing;feature extraction;Gaussian processes;mixture models;signal classification;computational auditory scene analysis;CASA;public evaluation challenge;acoustic scene classification;event detection;dataset creation;evaluation metrics;baseline methods;open-source code;Event detection;Measurement;Music;Speech;Educational institutions;Hidden Markov models;Computational auditory scene analysis;acoustic scene classification;acoustic event detection

PDF

Matlab implementation


Citation

If you are using the dataset or baseline code please cite the following paper:

Publication

D. Giannoulis, D. Stowell, E. Benetos, M. Rossignol, M. Lagrange, and M. D. Plumbley. A database and challenge for acoustic scene classification and event detection. In 21st European Signal Processing Conference (EUSIPCO 2013), volume, 1–5. Sep. 2013. doi:.

PDF

A database and challenge for acoustic scene classification and event detection

Abstract

An increasing number of researchers work in computational auditory scene analysis (CASA). However, a set of tasks, each with a well-defined evaluation framework and commonly used datasets do not yet exist. Thus, it is difficult for results and algorithms to be compared fairly, which hinders research on the field. In this paper we will introduce a newly-launched public evaluation challenge dealing with two closely related tasks of the field: acoustic scene classification and event detection. We give an overview of the tasks involved; describe the processes of creating the dataset; and define the evaluation metrics. Finally, illustrations on results for both tasks using baseline methods applied on this dataset are presented, accompanied by open-source code.

Keywords

acoustic signal processing;feature extraction;Gaussian processes;mixture models;signal classification;computational auditory scene analysis;CASA;public evaluation challenge;acoustic scene classification;event detection;dataset creation;evaluation metrics;baseline methods;open-source code;Event detection;Measurement;Music;Speech;Educational institutions;Hidden Markov models;Computational auditory scene analysis;acoustic scene classification;acoustic event detection

PDF


When citing challenge task and results please cite the following paper:

Publication

D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley. Detection and classification of acoustic scenes and events. IEEE Transactions on Multimedia, 17(10):1733–1746, Oct 2015. doi:10.1109/TMM.2015.2428998.

PDF

Detection and Classification of Acoustic Scenes and Events

Abstract

For intelligent systems to make best use of the audio modality, it is important that they can recognize not just speech and music, which have been researched as specific tasks, but also general sounds in everyday environments. To stimulate research in this field we conducted a public research challenge: the IEEE Audio and Acoustic Signal Processing Technical Committee challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). In this paper, we report on the state of the art in automatically classifying audio scenes, and automatically detecting and classifying audio events. We survey prior work as well as the state of the art represented by the submissions to the challenge from various research groups. We also provide detail on the organization of the challenge, so that our experience as challenge hosts may be useful to those organizing challenges in similar domains. We created new audio datasets and baseline systems for the challenge; these, as well as some submitted systems, are publicly available under open licenses, to serve as benchmarks for further research in general-purpose machine listening.

Keywords

acoustic signal processing;knowledge based systems;speech recognition;acoustic scenes detection;acoustic scenes classification;intelligent systems;audio modality;speech recognition;music;IEEE Audio and Acoustic Signal Processing Technical Committee;DCASE;Event detection;Speech;Speech recognition;Music;Microphones;Licenses;Audio databases;event detection;machine intelligence;pattern recognition

PDF