Large-scale weakly labeled semi-supervised sound event detection in domestic environments

Task description

The task evaluates systems for the large-scale detection of sound events using weakly labeled data. The challenge is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance.

Note to participants (July 6th): A small amount of files were present both in the training set and in the test set from the development data.

This would not have a major impact on the system performance but it could introduce a small bias while evaluating the test set performance. Therefore, we decided to remove those files from the test set and replace them by new annotated files from the same classes in order to keep a similar distribution.

The new test set is available on task 4 repository together with the list of files to remove from the test set. In order to avoid any confusion, we simply removed the csv file corresponding to the old test set.

Note that this will not affect the training of the models you have done until now, as the files were removed from the test set and not from the training set. This will affect only marginally the performance reported on the test set (and only for the classes `speech`, `dog` and `cat`).

This will also not have any impact on the evaluation data. Therefore for the development of your methods, you can decide to keep running with the old test set. We already have updated the performance of the baseline on the challenge's webpage. We kindly ask you to report the results using the new test set, and not the old test set in the submitted technical reports.


The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance. The labels in the annotated subset are verified and can be considered as reliable. The data are Youtube video excerpts from domestic context which have many applications such as ambient assisted living. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events...) and potential industrial applications.

Figure 1: Overview of a sound event detection system.

Audio dataset

The task employs a subset of “Audioset: An Ontology And Human-Labeled Dataset For Audio Events” by Google. Audioset consists of an expanding ontology of 632 sound event classes and a collection of 2 million human-labeled 10-second sound clips (less than 21% are shorter than 10-seconds) drawn from 2 million Youtube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. We will focus on a subset of Audioset that consists of 10 classes of sound events:

  • Speech Speech
  • Dog Dog
  • Cat Cat
  • Alarm/bell/ringing Alarm_bell_ringing
  • Dishes Dishes
  • Frying Frying
  • Blender Blender
  • Running water Running_water
  • Vacuum cleaner Vacuum_cleaner
  • Electric shaver/toothbrush Electric_shaver_toothbrush

Recording and annotation procedure

Audioset provides annotations at clip level (without time boundaries for the events). Therefore, the annotations are considered as weak labels.

Audio clips are collected from Youtube videos uploaded by independent users so the number of clips per class vary dramatically and the dataset is not balanced (see also

Google researchers conducted a quality assessment task where experts were exposed to 10 randomly selected clips for each class and discovered that a in most of the cases not all the clips contains the event related to the given annotation. More information about the initial annotation process can be found in:


Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA, 2017.


Audio Set: An ontology and human-labeled dataset for audio events


Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets -- principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 635 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.


The weak annotations have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file under the following format:

[filename (string)][tab][class_label (strings)]

For example:

Y-BJNMHMZDcU_50.000_60.000.wav  Alarm_bell_ringing;Dog

The first column, Y-BJNMHMZDcU_50.000_60.000.wav, is the name of the audio file downloaded from Youtube (Y-BJNMHMZDcU is Youtube ID of the video from where the 10-second clips was extracted t=50 sec to t=60 sec, correspond to the clip boundaries within the full video) and the last column, Alarm_bell_ringing;Dog corresponds to the sound classes present in the clip separated by a semi-colon.

Another subset of the development has been annotated manually with strong annotations, to be used as the test set (see also below for a detailed explanation about the development set). The minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event. The strong annotations are provided in a tab separated csv file under the following format:

[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][class_label (strings)]

For example:

YOTsn73eqbfc_10.000_20.000.wav  0.163   0.665   Alarm_bell_ringing

The first column, YOTsn73eqbfc_10.000_20.000.wav, is the name of the audio file downloaded from Youtube, the second column 0.163 is the onset time in seconds, the third column 0.665 is the offset time in seconds and the last column, Alarm_bell_ringing corresponds to the class of the sound event.


The annotations files and the script to download the audio files is available on the git repository for task 4. The final dataset is 82Gb, the download/extraction process can take approximately 12 hours. (In case you are using the provided baseline system, there is no need to download the dataset as the system will automatically download needed dataset for you.)

If you experience problems during the download of the dataset please contact the task organizers. (Nicolas Turpault and Romain Serizel in priority)

The content of the development set is structured in the following manner:

dataset root
│                         (instructions to run the script that downloads the audio files and description of the annotations files)
│                  (script to download the files)
└───metadata                          (directories containing the annotations files)
│   │
│   └───train                         (annotations for the train sets)
│   │     weak.csv                    (weakly labeled training set list)
│   │     unlabel_in_domain.csv       (unlabeled in domain training set list)
│   │     unlabel_out_of_domain.csv   (unlabeled out-of-domain training set list)
│   │
│   └───test                          (annotations for the test set)
│         test.csv                    (test set list with strong labels)
└───audio                             (directories where the audio files will be downloaded)
    └───train                         (annotations for the train sets)
    │   └───weak                      (weakly labeled training set)
    │   └───unlabel_in_domain         (unlabeled in domain training set)
    │   └───unlabel_out_of_domain     (unlabeled out-of-domain training set)
    └───test                          (test set)       

Task setup

The challenge consists of detecting sound events within web videos using weakly labeled training data. The detection within a 10-seconds clip should be performed with start and end timestamps.

Development dataset

The development set is divided into two main partitions: training and validation.

Training set

To motivate the participants to come up with innovative solutions, we provide 3 different splits of training data in our training set: Labeled training set, Unlabeled in domain training set and Unlabeled out of domain training set.

Labeled training set:
This set contains 1578 clips (2244 class occurrences) for which weak annotations have been verified and cross-checked. The amount of clips per class is the following:

Class # 10s clips containing the event
Speech 550
Dog 214
Cat 173
Alarm/bell/ringing 205
Dishes 184
Frying 171
Blender 134
Running water 343
Vacuum cleaner 167
Electric shaver/toothbrush 103
Total 2244

Unlabeled in domain training set:
This set is considerably larger than the previous one. It contains 14412 clips. The clips are selected such that the distribution per class (based on Audioset annotations) is close to the distribution in the labeled set. Note however that given the uncertainty on Audioset labels this distribution might not be exactly similar.

Unlabeled out of domain training set:
This set is composed of 39999 clips extracted from classes that are not considered in the task.

Test set

The test set is designed such that the distribution in term of clips per class is similar to that of the weakly labeled training set. The size of the validation set is such that it represent about 20% of the size of the labeled training set, it contains 288 clips (906 events). The evaluation set is annotated with strong labels, with timestamps (obtained by human annotators). Note that a 10-seconds clip may correspond to more than one sound event. The amount of events per class is the following:

Class # events
Speech 261
Dog 127
Cat 97
Alarm/bell/ringing 112
Dishes 122
Frying 24
Blender 40
Running water 76
Vacuum cleaner 36
Electric shaver/toothbrush 28
Total 906

Evaluation dataset

Evaluation set is a subset of Audioset including the classes mentioned above. Labels with timestamps (obtained by human annotations) will be released after the DCASE 2018 challenge is concluded.The eval dataset can be downloaded from the following location:


Detailed information for the challenge submission can found from in the submission page.

System output should be presented as a single text-file (in CSV format) containing predictions for each audio file in the evaluation set. Result items can be in any order. Format:

[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][event label (string)]

Multiple system outputs can be submitted (maximum 4 per participant). If submitting multiple systems, the individual text-files should be packaged into a zip file for submission. Please carefully mark the connection between the submitted files and the corresponding system or system parameters (for example by naming the text file appropriately).

If no event is detected for the particular audio signal, the system should still output a row containing only the file name, to indicate that the file was processed. This is used to verify that participants processed all evaluation files.

Task rules

  • Participants are not allowed to use external data for system development. Data from other task is considered external data.
  • Another example of external data is other materials related to the video such as the rest of audio from where the 10-sec clip was extracted, the video frames and metadata.
  • Participants are not allowed to use the embeddings provided by Audioset or any other features that indirectly use external data.
  • Only weak labels and none of the strong labels (timestamps) or original (Audioset) labels can be used for the training of the submitted system.
  • Manipulation of provided training data is allowed.
  • The development dataset can be augmented without the use of external data (e.g. by mixing data sampled from a PDF or using techniques such as pitch shifting or time stretching).
  • Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it (this includes the use of statistics about the evaluation dataset in the decision making). The evaluation dataset cannot be used to train the submitted system.


Submissions will be evaluated with event-based measures with a 200ms collar on onsets and a 200ms / 20% of the events length collar on offsets. Submissions will be ranked according to the event-based F1-score computed over the whole evaluation set. Additionally, event-based error rate will be provided as a secondary measure. Evaluation is done using sed_eval toolbox:

Detailed information on metrics calculation is available in:


Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016.


Metrics for Polyphonic Sound Event Detection


This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.


Baseline system

System description

The baseline system is based on two conolutional recurrent neural network (CRNN) using 64 log mel-band magnitudes as features. 10 seconds audio files are divided in 500 frames.

Using these features, we train a first CRNN with three convolution layers (64 filters (3x3), max pooling (4) along the frequency axis and 30% dropout), one recurrent layer (64 Gated Recurrent Units GRU with 30% dropout on the input), a dense layer (10 units sigmoid activation) and global average pooling across frames. The system is trained for 100 epochs (early stopping after 15 epochs patience) on weak labels (1578 clips, 20% is used for validation). This model is trained at clip level (file containing the event or not), inputs are 500 frames long (10 sec audio file) for a single output frame. This first model is used to predict labels of unlabeled files (unlabel_in_domain, 14412 clips).

A second model based on the same architecture (3 convolutional layers and 1 recurrent layer) is trained on predictions of the first model (unlabel_in_domain, 14412 clips; the weak files, 1578 clips are used to validate the model). The main difference with the first pass model is that the output is the dense layer in order to be able to predict event at frame level. Inputs are 500 frames long, each of them labeled identically following clip labels. The model outputs a decision for each frame. Preprocessing (median filtering) is used to obtain events onset and offset for each file. The baseline system includes evaluations of results using event-based F-score as metric.

Script description

The baseline system is a semi supervised approach:

  • Download the data (only the first time)
  • First pass at clip level:
    • Train a CRNN on weak data (train/weak) - 20% of data used for validation
    • Predict unlabel (in domain) data (train/unlabel_in_domain)
  • Second pass at frame level:
    • Train a CRNN on predicted unlabel data from the first pass (train/unlabel_in_domain) - weak data (train/weak) is used for validation Note: labels are used at frames level but annotations are at clip level, so if an event is present in the 10 sec, all frames contain this label during training
    • Predict strong test labels (test/) Note: predict an event with an onset and offset
  • Evaluate the model between test annotations and second pass predictions (metric is event based F-measure)

Python implementation

System performance

Event-based metrics with a 200ms collar on onsets and a 200ms / 20% of the events length collar on offsets:

Performance on the latest test set version (**July 6th**)
Event-based overall metrics (macro-average)
F-score 14.06 %
ER 1.54
**Note:** This performance was obtained on a CPU based system (Intel® Xeon E5-1630 -- 8 cores, 128Gb RAM). The total runtime was approximately 24h. **Note:** The performance might not be exactly reproducible on a GPU based system. However, it runs in around 8 hours on a single Nvidia Geforce 1080 Ti GPU.