The task evaluates systems for the large-scale detection of sound events using weakly labeled data. The challenge is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance.
The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance. The labels in the annotated subset are verified and can be considered as reliable. The data are Youtube video excerpts from domestic context which have many applications such as ambient assisted living. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events...) and potential industrial applications.
The task employs a subset of “Audioset: An Ontology And Human-Labeled Dataset For Audio Events” by Google. Audioset consists of an expanding ontology of 632 sound event classes and a collection of 2 million human-labeled 10-second sound clips (less than 21% are shorter than 10-seconds) drawn from 2 million Youtube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. We will focus on a subset of Audioset that consists of 10 classes of sound events:
- Running water
- Vacuum cleaner
- Electric shaver/toothbrush
Recording and annotation procedure
Audioset provides annotations at clip level (without time boundaries for the events). Therefore, the annotations are considered as weak labels.
Audio clips are collected from Youtube videos uploaded by independent users so the number of clips per class vary dramatically and the dataset is not balanced (see also https://research.google.com/Audioset//dataset/index.html).
Google researchers conducted a quality assessment task where experts were exposed to 10 randomly selected clips for each class and discovered that a in most of the cases not all the clips contains the event related to the given annotation. More information about the initial annotation process can be found in:
Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA, 2017.
Audio Set: An ontology and human-labeled dataset for audio events
Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets -- principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 635 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.
The weak annotations have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file under the following format:
[filename (string)][tab][class_label (strings)]
The first column,
Y-BJNMHMZDcU_50.000_60.000.wav, is the name of the audio file downloaded from Youtube (
Y-BJNMHMZDcU is Youtube ID of the video from where the 10-second clips was extracted t=50 sec to t=60 sec, correspond to the clip boundaries within the full video) and the last column,
Alarm_bell_ringing;Dog corresponds to the sound classes present in the clip separated by a semi-colon.
Another subset of the development has been annotated manually with strong annotations, to be used as the test set (see also below for a detailed explanation about the development set). The minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event. The strong annotations are provided in a tab separated csv file under the following format:
[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][class_label (strings)]
YOTsn73eqbfc_10.000_20.000.wav 0.163 0.665 Alarm_bell_ringing
The first column,
YOTsn73eqbfc_10.000_20.000.wav, is the name of the audio file downloaded from Youtube, the second column
0.163 is the onset time in seconds, the third column
0.665 is the offset time in seconds and the last column,
Alarm_bell_ringing corresponds to the class of the sound event.
The annotations files and the script to download the audio files is available on the git repository for task 4. The final dataset is 82Gb, the download/extraction process can take approximately 12 hours. (In case you are using the provided baseline system, there is no need to download the dataset as the system will automatically download needed dataset for you.)
If you experience problems during the download of the dataset please contact the task organizers. (Nicolas Turpault and Romain Serizel in priority)
The content of the development set is structured in the following manner:
dataset root │ readme.md (instructions to run the script that downloads the audio files and description of the annotations files) │ download_data.py (script to download the files) │ └───metadata (directories containing the annotations files) │ │ │ └───train (annotations for the train sets) │ │ weak.csv (weakly labeled training set list) │ │ unlabel_in_domain.csv (unlabeled in domain training set list) │ │ unlabel_out_of_domain.csv (unlabeled out-of-domain training set list) │ │ │ └───test (annotations for the test set) │ test.csv (test set list with strong labels) │ └───audio (directories where the audio files will be downloaded) └───train (annotations for the train sets) │ └───weak (weakly labeled training set) │ └───unlabel_in_domain (unlabeled in domain training set) │ └───unlabel_out_of_domain (unlabeled out-of-domain training set) │ └───test (test set)
The challenge consists of detecting sound events within web videos using weakly labeled training data. The detection within a 10-seconds clip should be performed with start and end timestamps.
The development set is divided into two main partitions: training and validation.
To motivate the participants to come up with innovative solutions, we provide 3 different splits of training data in our training set: Labeled training set, Unlabeled in domain training set and Unlabeled out of domain training set.
Labeled training set:
This set contains 1578 clips (2244 class occurrences) for which weak annotations have been verified and cross-checked. The amount of clips per class is the following:
|Class||# 10s clips containing the event|
Unlabeled in domain training set:
This set is considerably larger than the previous one. It contains 14412 clips. The clips are selected such that the distribution per class (based on Audioset annotations) is close to the distribution in the labeled set. Note however that given the uncertainty on Audioset labels this distribution might not be exactly similar.
Unlabeled out of domain training set:
This set is composed of 39999 clips extracted from classes that are not considered in the task.
The test set is designed such that the distribution in term of clips per class is similar to that of the weakly labeled training set. The size of the validation set is such that it represent about 20% of the size of the labeled training set, it contains 288 clips (906 events). The evaluation set is annotated with strong labels, with timestamps (obtained by human annotators). Note that a 10-seconds clip may correspond to more than one sound event. The amount of events per class is the following:
Evaluation set is a subset of Audioset including the classes mentioned above. Labels with timestamps (obtained by human annotations) will be released after the DCASE 2018 challenge is concluded.
Detailed information for the challenge submission can found from in the submission page.
System output should be presented as a single text-file (in CSV format) containing predictions for each audio file in the evaluation set. Result items can be in any order. Format:
[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][event label (string)]
Multiple system outputs can be submitted (maximum 4 per participant). If submitting multiple systems, the individual text-files should be packaged into a zip file for submission. Please carefully mark the connection between the submitted files and the corresponding system or system parameters (for example by naming the text file appropriately).
If no event is detected for the particular audio signal, the system should still output a row containing only the file name, to indicate that the file was processed. This is used to verify that participants processed all evaluation files.
- Participants are not allowed to use external data for system development. Data from other task is considered external data.
- Another example of external data is other materials related to the video such as the rest of audio from where the 10-sec clip was extracted, the video frames and metadata.
- Participants are not allowed to use the embeddings provided by Audioset or any other features that indirectly use external data.
- Only weak labels and none of the strong labels (timestamps) or original (Audioset) labels can be used for the training of the submitted system.
- Manipulation of provided training data is allowed.
- The development dataset can be augmented without the use of external data (e.g. by mixing data sampled from a PDF or using techniques such as pitch shifting or time stretching).
- Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it (this includes the use of statistics about the evaluation dataset in the decision making). The evaluation dataset cannot be used to train the submitted system.
Submissions will be evaluated with event-based measures with a 200ms collar on onsets and a 200ms / 20% of the events length collar on offsets. Submissions will be ranked according to the event-based F1-score computed over the whole evaluation set. Additionally, event-based error rate will be provided as a secondary measure. Evaluation is done using sed_eval toolbox:
Detailed information on metrics calculation is available in:
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016.
Metrics for Polyphonic Sound Event Detection
This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.
The baseline system is based on two conolutional recurrent neural network (CRNN) using 64 log mel-band magnitudes as features. 10 seconds audio files are divided in 500 frames.
Using these features, we train a first CRNN with three convolution layers (64 filters (3x3), max pooling (4) along the frequency axis and 30% dropout), one recurrent layer (64 Gated Recurrent Units GRU with 30% dropout on the input), a dense layer (10 units sigmoid activation) and global average pooling across frames. The system is trained for 100 epochs (early stopping after 15 epochs patience) on weak labels (1578 clips, 20% is used for validation). This model is trained at clip level (file containing the event or not), inputs are 500 frames long (10 sec audio file) for a single output frame. This first model is used to predict labels of unlabeled files (unlabel_in_domain, 14412 clips).
A second model based on the same architecture (3 convolutional layers and 1 recurrent layer) is trained on predictions of the first model (unlabel_in_domain, 14412 clips; the weak files, 1578 clips are used to validate the model). The main difference with the first pass model is that the output is the dense layer in order to be able to predict event at frame level. Inputs are 500 frames long, each of them labeled identically following clip labels. The model outputs a decision for each frame. Preprocessing (median filtering) is used to obtain events onset and offset for each file. The baseline system includes evaluations of results using event-based F-score as metric.
The baseline system is a semi supervised approach:
- Download the data (only the first time)
- First pass at clip level:
- Train a CRNN on weak data (
train/weak) - 20% of data used for validation
- Predict unlabel (in domain) data (
- Train a CRNN on weak data (
- Second pass at frame level:
- Train a CRNN on predicted unlabel data from the first pass (
train/unlabel_in_domain) - weak data (
train/weak) is used for validation Note: labels are used at frames level but annotations are at clip level, so if an event is present in the 10 sec, all frames contain this label during training
- Predict strong test labels (
test/) Note: predict an event with an onset and offset
- Train a CRNN on predicted unlabel data from the first pass (
- Evaluate the model between test annotations and second pass predictions (metric is event based F-measure)
Event-based metrics with a 200ms collar on onsets and a 200ms / 20% of the events length collar on offsets:
|Event-based overall metrics (macro-average)|
Note: The performance might not be exactly reproducible on a GPU based system. However, it runs in around 8 hours on a single Nvidia Geforce 1080 Ti GPU.