Sound event detection in domestic environments

Task description

The goal of the task is to evaluate systems for the detection of sound events using real data either weakly labeled or unlabeled and simulated data that is strongly labeled (with time stamps).


This task is the follow-up to DCASE 2018 task 4. The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. The challenge of exploring the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance remains but an additional training set with strongly annotated synthetic data is provided. The labels in all the annotated subsets are verified and can be considered as reliable. An additional scientific question this task is aiming to investigate is whether we really need real but partially and weakly annotated data or is using synthetic data sufficient? or do we need both?

Figure 1: Overview of a sound event detection system.

Audio dataset

The dataset for this task is composed of 10 sec audio clips recorded in domestic environment or synthesized to simulate a domestic environment. The task focuses on 10 class of sound events that represent a subset of Audioset (not all the classes are present in Audioset, some classes of sound events are including several classes from Audioset):

  • Speech Speech
  • Dog Dog
  • Cat Cat
  • Alarm/bell/ringing Alarm_bell_ringing
  • Dishes Dishes
  • Frying Frying
  • Blender Blender
  • Running water Running_water
  • Vacuum cleaner Vacuum_cleaner
  • Electric shaver/toothbrush Electric_shaver_toothbrush

The dataset for DCASE 2019 task 4 is composed of a subset with real recordings (from Audioset) and a subset with synthetic recordings. The datasets used de generate the dataset for DCASE 2019 task 4 are described below.

Audioset: Real recordings are extracted from Audioset. It consists of an expanding ontology of 632 sound event classes and a collection of 2 million human-labeled 10-second sound clips (less than 21% are shorter than 10-seconds) drawn from 2 million Youtube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.

Freesound dataset: A subset of FSD is used as foreground sound events for the synthetic subset of the dataset for DCASE 2019 task 4. FSD is a large-scale, general-purpose audio dataset composed of Freesound content annotated with labels from the AudioSet Ontology.


Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andrés Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra. Freesound datasets: a platform for the creation of open audio datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR 2017), 486–493. Suzhou, China, 2017.


Freesound Datasets: a platform for the creation of open audio datasets


Openly available datasets are a key factor in the advancement of data-driven research approaches, including many of the ones used in sound and music computing. In the last few years, quite a number of new audio datasets have been made available but there are still major shortcomings in many of them to have a significant research impact. Among the common shortcomings are the lack of transparency in their creation and the difficulty of making them completely open and sharable. They often do not include clear mechanisms to amend errors and many times they are not large enough for current machine learning needs. This paper introduces Freesound Datasets, an online platform for the collaborative creation of open audio datasets based on principles of transparency, openness, dynamic character, and sustainability. As a proof-of-concept, we present an early snapshot of a large-scale audio dataset built using this platform. It consists of audio samples from Freesound organised in a hierarchy based on the AudioSet Ontology. We believe that building and maintaining datasets following the outlined principles and using open tools and collaborative approaches like the ones presented here will have a significant impact in our research community.


Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, 411–412. ACM, 2013.


Freesound technical demo


SINS dataset: The derivative of the SINS dataset used for DCASE2018 task 5 is used as background for the synthetic subset of the dataset for DCASE 2019 task 4. The SINS dataset contains a continuous recording of one person living in a vacation home over a period of one week. It was collected using a network of 13 microphone arrays distributed over the entire home. The microphone array consists of 4 linearly arranged microphones.


Gert Dekkers, Steven Lauwereins, Bart Thoen, Mulu Weldegebreal Adhana, Henk Brouckxon, Toon van Waterschoot, Bart Vanrumste, Marian Verhelst, and Peter Karsmakers. The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 32–36. November 2017.


The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network


There is a rising interest in monitoring and improving human wellbeing at home using different types of sensors including microphones. In the context of Ambient Assisted Living (AAL) persons are monitored, e.g. to support patients with a chronic illness and older persons, by tracking their activities being performed at home. When considering an acoustic sensing modality, a performed activity can be seen as an acoustic scene. Recently, acoustic detection and classification of scenes and events has gained interest in the scientific community and led to numerous public databases for a wide range of applications. However, no public databases exist which a) focus on daily activities in a home environment, b) contain activities being performed in a spontaneous manner, c) make use of an acoustic sensor network, and d) are recorded as a continuous stream. In this paper we introduce a database recorded in one living home, over a period of one week. The recording setup is an acoustic sensor network containing thirteen sensor nodes, with four low-cost microphones each, distributed over five rooms. Annotation is available on an activity level. In this paper we present the recording and annotation procedure, the database content and a discussion on a baseline detection benchmark. The baseline consists of Mel-Frequency Cepstral Coefficients, Support Vector Machine and a majority vote late-fusion scheme. The database is publicly released to provide a common ground for future research.


Database, Acoustic Scene Classification, Acoustic Event Detection, Acoustic Sensor Networks


Synthetic data generation procedure

The synthetic set is composed of 10 sec audio clips generated with Scaper. The foreground events are obtained from FSD. Each event audio clip was verified manually to ensure that the sound quality and the event-to-background ratio were sufficient to be used an isolated event. We also verified that the event was actually dominant in the clip and we controlled if the event onset and offset are present in the clip. Each selected clip was then segmented when needed to remove silences before and after the event and between events when the file contained multiple occurrences of the event class. The number of unique isolated event per class used to generate the synthetic set is described below:

Class # unique events
Speech 128
Dog 136
Cat 88
Alarm/bell/ringing 190
Dishes 109
Frying 64
Blender 98
Running water 68
Vacuum cleaner 74
Electric shaver/toothbrush 56
Total 1011

The background texture where obtained from the SINS dataset (class other). This particular class was selected because it presents a low amount of sound events from the 10 target sound event classes. However, there is no guarantee that these sound event classes are totally absent from the background clips. The number of unique background clips used to generate the synthetic dataset is presented below:

Class # background clips
Other 2060

Scaper scripts are designed such that the distribution of sound events per class, the number sound events per clip (depending on the class) and the sound event class co-occurrence is similar to that of the validation set composed of real recordings.


Justin Salamon, Duncan MacConnell, Mark Cartwright, Peter Li, and Juan Pablo Bello. Scaper: a library for soundscape synthesis and augmentation. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 344–348. IEEE, 2017.


Scaper: A library for soundscape synthesis and augmentation


Reference labels

Audioset provides annotations at clip level (without time boundaries for the events). Therefore, the original annotations are considered as weak labels. Google researchers conducted a quality assessment task where experts were exposed to 10 randomly selected clips for each class and discovered that a in most of the cases not all the clips contains the event related to the given annotation.

Weak annotations

The weak annotations have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file under the following format:

[filename (string)][tab][class_label (strings)]

For example:

Y-BJNMHMZDcU_50.000_60.000.wav  Alarm_bell_ringing,Dog

The first column, Y-BJNMHMZDcU_50.000_60.000.wav, is the name of the audio file downloaded from Youtube (Y-BJNMHMZDcU is Youtube ID of the video from where the 10-second clips was extracted t=50 sec to t=60 sec, correspond to the clip boundaries within the full video) and the last column, Alarm_bell_ringing;Dog corresponds to the sound classes present in the clip separated by a coma.

Strong annotations

Another subset of the development has been annotated manually with strong annotations, to be used as the test set (see also below for a detailed explanation about the development set).

The synthetic subset of the development set is also annotated with strong annotations obtained from Scaper. Each sound clips from FSD was verified by humans in order to check the event class present in FSD annotation was indeed dominant in the audio clip.

In both cases, the minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event. The strong annotations are provided in a tab separated csv file under the following format:

[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][class_label (strings)]

For example:

YOTsn73eqbfc_10.000_20.000.wav  0.163   0.665   Alarm_bell_ringing

The first column, YOTsn73eqbfc_10.000_20.000.wav, is the name of the audio file, the second column 0.163 is the onset time in seconds, the third column 0.665 is the offset time in seconds and the last column, Alarm_bell_ringing corresponds to the class of the sound event.


The content of the development set is structured in the following manner:

dataset root
└───metadata                          (directories containing the annotations files)
│   │
│   └───train                         (annotations for the training sets)
│   │     weak.csv                    (weakly labeled training set list and annotations)
│   │     unlabel_in_domain.csv       (unlabeled in domain training set list)
│   │     synthetic.csv               (synthetic data training set list and annotations)
│   │
│   └───validation                    (annotations for the test set)
│         validation.csv              (validation set list with strong labels)
│         test_2018.csv               (test set list with strong labels - DCASE 2018)
│         eval_2018.csv               (eval set list with strong labels - DCASE 2018)
└───audio                             (directories where the audio files will be downloaded)
    └───train                         (audio files for the training sets)
    │   └───weak                      (weakly labeled training set)
    │   └───unlabel_in_domain         (unlabeled in domain training set)
    │   └───synthetic                 (synthetic data training set)
    └───validation                    (validation set)       

The dataset is composed of two subset that can be downloaded independently. The procedure to download each subset is described below.

Real recordings

The annotations files and the script to download the audio files is available on the git repository for task 4. This subset is 23.4Gb, the download/extraction process can take approximately 4 hours.

If you experience problems during the download of this subset please contact the task organizers. (Nicolas Turpault and Romain Serizel in priority)

Synthetic clips

Task setup

The challenge consists of detecting sound events within web videos using training data from real recordings both weakly labeled and unlabeled and synthetic audio clips that are strongly labeled. The detection within a 10-seconds clip should be performed with start and end timestamps.

Development dataset

The development set is divided into two main partitions: training and validation.

Training set

To motivate the participants to come up with innovative solutions, we provide 3 different splits of training data in our training set: Labeled training set, Unlabeled in domain training set and Synthetic set with strong annotations. The first two set are the same as in DCASE2018 task 4.

Labeled training set:
This set contains 1578 clips (2244 class occurrences) for which weak annotations have been verified and cross-checked. The amount of clips per class is the following:

Class # 10s clips containing the event
Speech 550
Dog 214
Cat 173
Alarm/bell/ringing 205
Dishes 184
Frying 171
Blender 134
Running water 343
Vacuum cleaner 167
Electric shaver/toothbrush 103
Total 2244

Unlabeled in domain training set:
This set is considerably larger than the previous one. It contains 14412 clips. The clips are selected such that the distribution per class (based on Audioset annotations) is close to the distribution in the labeled set. Note however that given the uncertainty on Audioset labels this distribution might not be exactly similar.

Synthetic strongly labeled set:
This set is composed of 2045 clips generated with Scaper. The clips are generated such that the distribution per event is close to that of the validation set. Note that a 10-seconds clip may correspond to more than one sound event.

Class # events
Speech 2132
Dog 516
Cat 547
Alarm/bell/ringing 755
Dishes 814
Frying 137
Blender 540
Running water 157
Vacuum cleaner 204
Electric shaver/toothbrush 230
Total 6032

Validation set

The validation set is designed such that the distribution in term of clips per class is similar to that of the weakly labeled training set. It is the fusion of DCASE 2018 task 4 test set and evaluation set (Note that for comparison purpose the original csv are also provided). The validation set contains 1168 clips (4093 events). The evaluation set is annotated with strong labels, with timestamps (obtained by human annotators). Note that a 10-seconds clip may correspond to more than one sound event. The amount of events per class is the following:

Class # events
Speech 261
Dog 127
Cat 97
Alarm/bell/ringing 112
Dishes 122
Frying 24
Blender 40
Running water 76
Vacuum cleaner 36
Electric shaver/toothbrush 28
Total 4093

Evaluation dataset

The systems will be evaluated on 10 sec clips extracted from youtube videos and from YFCC100M. Additionally, a synthetic evaluation dataset will be provided for scientific evaluation purposes but the performance on this dataset will not be taken into account in the final performance evaluation.

Task rules

  • Participants are not allowed to use external data for system development. Data from other task is considered external data.
  • Another example of external data is other materials related to the video such as the rest of audio from where the 10-sec clip was extracted, the video frames and metadata.
  • Participants are not allowed to use the embeddings provided by Audioset or any other features that indirectly use external data.
  • Only weak labels and none of the strong labels (timestamps) or original (Audioset) labels can be used for the training of the submitted system.
  • Manipulation of provided training data is allowed.
  • The development dataset can be augmented without the use of external data (e.g. by mixing data sampled from a PDF or using techniques such as pitch shifting or time stretching).
  • Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it (this includes the use of statistics about the evaluation dataset in the decision making). The evaluation dataset cannot be used to train the submitted system.


Submissions will be evaluated with event-based measures with a 200 ms collar on onsets and a 200 ms / 20% of the events length collar on offsets. Submissions will be ranked according to the event-based F1-score computed over the real recordings in the evaluation set (the performance on synthetic recordings is not taken into account in the ranking). Additionally, segment-based F1-score on 1 s segments will be provided as a secondary measure. Evaluation is done using sed_eval toolbox:

Detailed information on metrics calculation is available in:


Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016. URL:, doi:10.3390/app6060162.


Metrics for Polyphonic Sound Event Detection


This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.



System description

The baseline system is based on the idea of the best submission of DCASE 2018 task 4. The author provided his system code and most of the hyper-parameters of this year baseline close to the hyper-parameters defined by last year winner. However, the network architecture itself remains similar to last year baseline so it is much simpler that the networks used by Lu JiaKai.


Lu JiaKai. Mean teacher convolution system for dcase 2018 task 4. Technical Report, DCASE2018 Challenge, September 2018.


Mean Teacher Convolution System for DCASE 2018 Task 4


In this paper, we present our neural network for the DCASE 2018 challenge’s Task 4 (Large-scale weakly labeled semi-supervised sound event detection in domestic environments). This task evaluates systems for the large-scale detection of sound events using weakly labeled data, and explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance to doing audio tagging and sound event detection. We propose a mean-teacher model with context-gating convolutional neural network (CNN) and recurrent neural network (RNN) to maximize the use of unlabeled in-domain dataset.


The baseline using a mean-teacher model that is composed of two networks that are both the same CRNN. The implementation of Mean teacher model is based on Tarvainen & Valpola from Curious AI.


Antti Tarvainen and Harri Valpola. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, 1195–1204. 2017.


Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results


The recently proposed Temporal Ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks. It maintains an exponential moving average of label predictions on each training example, and penalizes predictions that are inconsistent with this target. However, because the targets change only once per epoch, Temporal Ensembling becomes unwieldy when learning large datasets. To overcome this problem, we propose Mean Teacher, a method that averages model weights instead of label predictions. As an additional benefit, Mean Teacher improves test accuracy and enables training with fewer labels than Temporal Ensembling. Without changing the network architecture, Mean Teacher achieves an error rate of 4.35% on SVHN with 250 labels, outperforming Temporal Ensembling trained with 1000 labels. We also show that a good network architecture is crucial to performance. Combining Mean Teacher and Residual Networks, we improve the state of the art on CIFAR-10 with 4000 labels from 10.55% to 6.28%, and on ImageNet 2012 with 10% of the labels from 35.24% to 9.11%.


The model is trained as follows: - The teacher model is trained on synthetic and weakly labeled data. The classification cost is computed at frame level on synthetic data and at clip level on weakly labeled data. - The student model is not trained, its weights are a moving average of the teacher model (at each epoch). - The inputs of the student model are the inputs of the teacher model + some Gaussian noise - A cost for consistency between teacher and student model is applied (for weak and strong predictions).

The baseline exploit unlabeled, weakly labeled and synthetic data for training and is trained for 100 epochs. Inputs are 864 frames long. The CRNN model is pooling in time to have 108 frames. Postprocessing (median filtering) of 5 frames is used to obtain events onset and offset for each file. The baseline system includes evaluations of results using event-based F-score as metric.

Python Implementation

System Performance

Performance is reported on this year validation set (Validation 2019) and on the evaluation set from DCASE 2018 task 4.

F-score metrics (macro averaged)
Validation 2019 Evaluation 2018
Event-based 23.46 % 20.29 %
Segment-based 54.66 % 50.65 %

Note: The performance might not be exactly reproducible on a GPU based system. That is why, you download the weights of the networks used for the experiments and run --model_path="Path_of_model" to reproduce the results.