Large-scale weakly labeled semi-supervised sound event detection in domestic environments


Task description

The task evaluates systems for the large-scale detection of sound events using weakly labeled data. The challenge is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance.

Note to participants (September 20th): Minor bug in the evaluation script.

We found a minor bug in the evaluation script. Some of the files were not taken into account correctly. This has been fixed and the evaluation scores have been updated (resulting in at most 0.3% decrease of the F-score). This had no impact on the teams ranking. However, the systems ranking can have varied very slightly (around ranks 15~25).

Description

The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance. The labels in the annotated subset are verified and can be considered as reliable. The data are Youtube video excerpts from domestic context which have many applications such as ambient assisted living. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events...) and potential industrial applications.

Figure 1: Overview of a sound event detection system.


Audio dataset

The task employs a subset of “Audioset: An Ontology And Human-Labeled Dataset For Audio Events” by Google. Audioset consists of an expanding ontology of 632 sound event classes and a collection of 2 million human-labeled 10-second sound clips (less than 21% are shorter than 10-seconds) drawn from 2 million Youtube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. We will focus on a subset of Audioset that consists of 10 classes of sound events:

  • Speech Speech
  • Dog Dog
  • Cat Cat
  • Alarm/bell/ringing Alarm_bell_ringing
  • Dishes Dishes
  • Frying Frying
  • Blender Blender
  • Running water Running_water
  • Vacuum cleaner Vacuum_cleaner
  • Electric shaver/toothbrush Electric_shaver_toothbrush

Recording and annotation procedure

Audioset provides annotations at clip level (without time boundaries for the events). Therefore, the annotations are considered as weak labels.

Audio clips are collected from Youtube videos uploaded by independent users so the number of clips per class vary dramatically and the dataset is not balanced (see also https://research.google.com/Audioset//dataset/index.html).

Google researchers conducted a quality assessment task where experts were exposed to 10 randomly selected clips for each class and discovered that a in most of the cases not all the clips contains the event related to the given annotation. More information about the initial annotation process can be found in:

Publication

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA, 2017.

PDF

Audio Set: An ontology and human-labeled dataset for audio events

Abstract

Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets -- principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 635 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

PDF

The weak annotations have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file under the following format:

[filename (string)][tab][class_label (strings)]

For example:

Y-BJNMHMZDcU_50.000_60.000.wav  Alarm_bell_ringing;Dog

The first column, Y-BJNMHMZDcU_50.000_60.000.wav, is the name of the audio file downloaded from Youtube (Y-BJNMHMZDcU is Youtube ID of the video from where the 10-second clips was extracted t=50 sec to t=60 sec, correspond to the clip boundaries within the full video) and the last column, Alarm_bell_ringing;Dog corresponds to the sound classes present in the clip separated by a semi-colon.

Another subset of the development has been annotated manually with strong annotations, to be used as the test set (see also below for a detailed explanation about the development set). The minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event. The strong annotations are provided in a tab separated csv file under the following format:

[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][class_label (strings)]

For example:

YOTsn73eqbfc_10.000_20.000.wav  0.163   0.665   Alarm_bell_ringing

The first column, YOTsn73eqbfc_10.000_20.000.wav, is the name of the audio file downloaded from Youtube, the second column 0.163 is the onset time in seconds, the third column 0.665 is the offset time in seconds and the last column, Alarm_bell_ringing corresponds to the class of the sound event.

Download

The annotations files and the script to download the audio files is available on the git repository for task 4. The final dataset is 82Gb, the download/extraction process can take approximately 12 hours. (In case you are using the provided baseline system, there is no need to download the dataset as the system will automatically download needed dataset for you.)

If you experience problems during the download of the dataset please contact the task organizers. (Nicolas Turpault and Romain Serizel in priority)

The content of the development set is structured in the following manner:

dataset root
   readme.md                         (instructions to run the script that downloads the audio files and description of the annotations files)
   download_data.py                  (script to download the files)

└───metadata                          (directories containing the annotations files)
   
   └───train                         (annotations for the train sets)
        weak.csv                    (weakly labeled training set list)
        unlabel_in_domain.csv       (unlabeled in domain training set list)
        unlabel_out_of_domain.csv   (unlabeled out-of-domain training set list)
   
   └───test                          (annotations for the test set)
         test.csv                    (test set list with strong labels)
    
└───audio                             (directories where the audio files will be downloaded)
    └───train                         (annotations for the train sets)
       └───weak                      (weakly labeled training set)
       └───unlabel_in_domain         (unlabeled in domain training set)
       └───unlabel_out_of_domain     (unlabeled out-of-domain training set)
    
    └───test                          (test set)       

Task setup

The challenge consists of detecting sound events within web videos using weakly labeled training data. The detection within a 10-seconds clip should be performed with start and end timestamps.

Development dataset

The development set is divided into two main partitions: training and validation.

Training set

To motivate the participants to come up with innovative solutions, we provide 3 different splits of training data in our training set: Labeled training set, Unlabeled in domain training set and Unlabeled out of domain training set.

Labeled training set:
This set contains 1578 clips (2244 class occurrences) for which weak annotations have been verified and cross-checked. The amount of clips per class is the following:

Class # 10s clips containing the event
Speech 550
Dog 214
Cat 173
Alarm/bell/ringing 205
Dishes 184
Frying 171
Blender 134
Running water 343
Vacuum cleaner 167
Electric shaver/toothbrush 103
Total 2244

Unlabeled in domain training set:
This set is considerably larger than the previous one. It contains 14412 clips. The clips are selected such that the distribution per class (based on Audioset annotations) is close to the distribution in the labeled set. Note however that given the uncertainty on Audioset labels this distribution might not be exactly similar.

Unlabeled out of domain training set:
This set is composed of 39999 clips extracted from classes that are not considered in the task. Note that these clips are chosen based on the Audioset labels which are not verified and therefore might be noisy. Additionally, as speech is present in half of the segments of Audioset this set also contain almost 20000 clips with speech. This was the only way to have an unlabel_out_of_domain set which is somehow representative of Audioset. Indeed, discarding speech would have also meant discarding many other classes and the variability of the set would have been penalized.

Test set

The test set is designed such that the distribution in term of clips per class is similar to that of the weakly labeled training set. The size of the validation set is such that it represent about 20% of the size of the labeled training set, it contains 288 clips (906 events). The evaluation set is annotated with strong labels, with timestamps (obtained by human annotators). Note that a 10-seconds clip may correspond to more than one sound event. The amount of events per class is the following:

Class # events
Speech 261
Dog 127
Cat 97
Alarm/bell/ringing 112
Dishes 122
Frying 24
Blender 40
Running water 76
Vacuum cleaner 36
Electric shaver/toothbrush 28
Total 906

Evaluation dataset

Evaluation set is a subset of Audioset including the classes mentioned above. Labels with timestamps (obtained by human annotations) will be released after the DCASE 2018 challenge is concluded.The eval dataset can be downloaded from the following location:

Submission

Detailed information for the challenge submission can found from in the submission page.

System output should be presented as a single text-file (in CSV format) containing predictions for each audio file in the evaluation set. Result items can be in any order. Format:

[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][event label (string)]

Multiple system outputs can be submitted (maximum 4 per participant). If submitting multiple systems, the individual text-files should be packaged into a zip file for submission. Please carefully mark the connection between the submitted files and the corresponding system or system parameters (for example by naming the text file appropriately).

If no event is detected for the particular audio signal, the system should still output a row containing only the file name, to indicate that the file was processed. This is used to verify that participants processed all evaluation files.

Task rules

  • Participants are not allowed to use external data for system development. Data from other task is considered external data.
  • Another example of external data is other materials related to the video such as the rest of audio from where the 10-sec clip was extracted, the video frames and metadata.
  • Participants are not allowed to use the embeddings provided by Audioset or any other features that indirectly use external data.
  • Only weak labels and none of the strong labels (timestamps) or original (Audioset) labels can be used for the training of the submitted system.
  • Manipulation of provided training data is allowed.
  • The development dataset can be augmented without the use of external data (e.g. by mixing data sampled from a PDF or using techniques such as pitch shifting or time stretching).
  • Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it (this includes the use of statistics about the evaluation dataset in the decision making). The evaluation dataset cannot be used to train the submitted system.

Evaluation

Submissions will be evaluated with event-based measures with a 200ms collar on onsets and a 200ms / 20% of the events length collar on offsets. Submissions will be ranked according to the event-based F1-score computed over the whole evaluation set. Additionally, event-based error rate will be provided as a secondary measure. Evaluation is done using sed_eval toolbox:


Detailed information on metrics calculation is available in:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016. URL: http://www.mdpi.com/2076-3417/6/6/162, doi:10.3390/app6060162.

PDF

Metrics for Polyphonic Sound Event Detection

Abstract

This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.

PDF

Baseline system

System description

The baseline system is based on two conolutional recurrent neural network (CRNN) using 64 log mel-band magnitudes as features. 10 seconds audio files are divided in 500 frames.

Using these features, we train a first CRNN with three convolution layers (64 filters (3x3), max pooling (4) along the frequency axis and 30% dropout), one recurrent layer (64 Gated Recurrent Units GRU with 30% dropout on the input), a dense layer (10 units sigmoid activation) and global average pooling across frames. The system is trained for 100 epochs (early stopping after 15 epochs patience) on weak labels (1578 clips, 20% is used for validation). This model is trained at clip level (file containing the event or not), inputs are 500 frames long (10 sec audio file) for a single output frame. This first model is used to predict labels of unlabeled files (unlabel_in_domain, 14412 clips).

A second model based on the same architecture (3 convolutional layers and 1 recurrent layer) is trained on predictions of the first model (unlabel_in_domain, 14412 clips; the weak files, 1578 clips are used to validate the model). The main difference with the first pass model is that the output is the dense layer in order to be able to predict event at frame level. Inputs are 500 frames long, each of them labeled identically following clip labels. The model outputs a decision for each frame. Preprocessing (median filtering) is used to obtain events onset and offset for each file. The baseline system includes evaluations of results using event-based F-score as metric.

Script description

The baseline system is a semi supervised approach:

  • Download the data (only the first time)
  • First pass at clip level:
    • Train a CRNN on weak data (train/weak) - 20% of data used for validation
    • Predict unlabel (in domain) data (train/unlabel_in_domain)
  • Second pass at frame level:
    • Train a CRNN on predicted unlabel data from the first pass (train/unlabel_in_domain) - weak data (train/weak) is used for validation Note: labels are used at frames level but annotations are at clip level, so if an event is present in the 10 sec, all frames contain this label during training
    • Predict strong test labels (test/) Note: predict an event with an onset and offset
  • Evaluate the model between test annotations and second pass predictions (metric is macro-averaged event based F-measure)

Python implementation


System performance

Event-based metrics with a 200ms collar on onsets and a 200ms / 20% of the events length collar on offsets:

Performance on the latest test set version (July 6th)
Event-based overall metrics (macro-average)
F-score 14.06 %
ER 1.54

Note: This performance was obtained on a CPU based system (Intel® Xeon E5-1630 -- 8 cores, 128Gb RAM). The total runtime was approximately 24h.
Note: The performance might not be exactly reproducible on a GPU based system. However, it runs in around 8 hours on a single Nvidia Geforce 1080 Ti GPU.

Results

Rank Submission Information
Code Author Affiliation Technical
Report
Event-based
F-score
(Evaluation dataset)
Avdeeva_ITMO_task4_1 Anastasia Avdeeva Speech Information Systems Department, University of Information Technology Mechanics and Optics, Saint-Petersburg, Russia task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Avdveeva2018 20.1
Avdeeva_ITMO_task4_2 Anastasia Avdeeva Speech Information Systems Department, University of Information Technology Mechanics and Optics, Saint-Petersburg, Russia task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Avdveeva2018 19.5
Wang_NUDT_task4_1 Dezhi Wang School of Computer, National University of Defense Technology, Changsha, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#WangD2018 12.4
Wang_NUDT_task4_2 Dezhi Wang School of Computer, National University of Defense Technology, Changsha, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#WangD2018 12.6
Wang_NUDT_task4_3 Dezhi Wang School of Computer, National University of Defense Technology, Changsha, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#WangD2018 12.0
Wang_NUDT_task4_4 Dezhi Wang School of Computer, National University of Defense Technology, Changsha, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#WangD2018 12.2
Dinkel_SJTU_task4_1 Heinrich Dinkel Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Dinkel2018 10.4
Dinkel_SJTU_task4_2 Heinrich Dinkel Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Dinkel2018 10.7
Dinkel_SJTU_task4_3 Heinrich Dinkel Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Dinkel2018 13.4
Dinkel_SJTU_task4_4 Heinrich Dinkel Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Dinkel2018 11.2
Guo_THU_task4_1 Yingmei Guo Department of Computer Science and Technology, Tsinghua University, Beijing, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Guo2018 21.3
Guo_THU_task4_2 Yingmei Guo Department of Computer Science and Technology, Tsinghua University, Beijing, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Guo2018 20.6
Guo_THU_task4_3 Yingmei Guo Department of Computer Science and Technology, Tsinghua University, Beijing, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Guo2018 19.1
Guo_THU_task4_4 Yingmei Guo Department of Computer Science and Technology, Tsinghua University, Beijing, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Guo2018 19.0
Harb_TUG_task4_1 Robert Harb Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Harb2018 19.4
Harb_TUG_task4_2 Robert Harb Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Harb2018 15.7
Harb_TUG_task4_3 Robert Harb Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Harb2018 21.6
Hou_BUPT_task4_1 Yuanbo Hou Institute of Information Photonics and Optical, Beijing University of Posts and Telecommunications, Beijing, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Hou2018 19.6
Hou_BUPT_task4_2 Yuanbo Hou Institute of Information Photonics and Optical, Beijing University of Posts and Telecommunications, Beijing, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Hou2018 18.9
Hou_BUPT_task4_3 Yuanbo Hou Institute of Information Photonics and Optical, Beijing University of Posts and Telecommunications, Beijing, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Hou2018 20.9
Hou_BUPT_task4_4 Yuanbo Hou Institute of Information Photonics and Optical, Beijing University of Posts and Telecommunications, Beijing, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Hou2018 21.1
CANCES_IRIT_task4_1 Cances Leo Institut de Recherche en Informatique de Toulouse, Université Paul Sabatier Toulouse III, Toulouse, France task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Cances2018 8.4
PELLEGRINI_IRIT_task4_2 Thomas Pellegrini Institut de Recherche en Informatique de Toulouse, Université Paul Sabatier Toulouse III, Toulouse, France task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Cances2018 16.6
Kothinti_JHU_task4_1 Sandeep Kothinti Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kothinti2018 20.6
Kothinti_JHU_task4_2 Sandeep Kothinti Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kothinti2018 20.9
Kothinti_JHU_task4_3 Sandeep Kothinti Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kothinti2018 20.9
Kothinti_JHU_task4_4 Sandeep Kothinti Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kothinti2018 22.4
Koutini_JKU_task4_1 Khaled Koutini Institute of Computational Perception, Johannes Kepler University, Linz, Austria task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Koutini2018 21.5
Koutini_JKU_task4_2 Khaled Koutini Institute of Computational Perception, Johannes Kepler University, Linz, Austria task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Koutini2018 21.1
Koutini_JKU_task4_3 Khaled Koutini Institute of Computational Perception, Johannes Kepler University, Linz, Austria task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Koutini2018 20.6
Koutini_JKU_task4_4 Khaled Koutini Institute of Computational Perception, Johannes Kepler University, Linz, Austria task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Koutini2018 18.8
Liu_USTC_task4_1 Yaming Liu National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Liu2018 27.3
Liu_USTC_task4_2 Yaming Liu National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Liu2018 28.8
Liu_USTC_task4_3 Yaming Liu National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Liu2018 28.1
Liu_USTC_task4_4 Yaming Liu National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Liu2018 29.9
LJK_PSH_task4_1 JaiKai Lu 1T5K, PFU SHANGHAI Co., LTD, Shanghai, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lu2018 24.1
LJK_PSH_task4_2 JaiKai Lu 1T5K, PFU SHANGHAI Co., LTD, Shanghai, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lu2018 26.3
LJK_PSH_task4_3 JaiKai Lu 1T5K, PFU SHANGHAI Co., LTD, Shanghai, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lu2018 29.5
LJK_PSH_task4_4 JaiKai Lu 1T5K, PFU SHANGHAI Co., LTD, Shanghai, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lu2018 32.4
Moon_YONSEI_task4_1 Moon Hyeongi School of Electrical and Electronic Engineering, Yonsei University, Seoul, Republic of korea task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Moon2018 15.9
Moon_YONSEI_task4_2 Moon Hyeongi School of Electrical and Electronic Engineering, Yonsei University, Seoul, Republic of korea task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Moon2018 14.3
Raj_IITKGP_task4_1 Rojin Raj Electronics and Electrical Communication Engineering Department, Indian Institute of Technology Kharagpur, Kharagpur, India task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Raj2018 9.4
Lim_ETRI_task4_1 Wootaek Lim Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lim2018 17.1
Lim_ETRI_task4_2 Wootaek Lim Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lim2018 18.0
Lim_ETRI_task4_3 Wootaek Lim Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lim2018 19.6
Lim_ETRI_task4_4 Wootaek Lim Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lim2018 20.4
WangJun_BUPT_task4_2 Wang Jun Laboratory of Signal Processing & Knowledge Discovery, Beijing University of Posts and Telecommunications, Beijing, China task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#WangJ2018 17.9
DCASE2018 baseline Romain Serizel Department of Natural Language Processing & Knowledge Discovery, University of Lorraine, Loria, Nancy, France task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#SerizelJ2018 10.8
Baseline_Surrey_task4_1 Qiuqiang Kong Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kong2018 18.6
Baseline_Surrey_task4_2 Qiuqiang Kong Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kong2018 16.7
Baseline_Surrey_task4_3 Qiuqiang Kong Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kong2018 24.0

Complete results and technical reports can be found at Task 4 results page

Citation

If you are using the dataset or baseline code, or want to refer challenge task please cite the following paper:
Publication

Romain Serizel, Nicolas Turpault, Hamid Eghbal-Zadeh, and Ankit Parag Shah. Large-scale weakly labeled semi-supervised sound event detection in domestic environments. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 19–23. November 2018. URL: https://hal.inria.fr/hal-01850270.

PDF

Large-scale weakly labeled semi-supervised sound event detection in domestic environments

Abstract

This paper presents DCASE 2018 task 4. The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without time boundaries). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly labeled training set to improve system performance. The data are Youtube video excerpts from domestic context which have many applications such as ambient assisted living. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events. . . ) and potential applications.

Keywords

Sound event detection, Large scale, Weakly labeled data, Semi-supervised learning

PDF