Sound event detection in domestic environments


Task description

The goal of the task is to evaluate systems for the detection of sound events using real data either weakly labeled or unlabeled and simulated data that is strongly labeled (with time stamps).

Challenge has ended. Full results for this task can be found in the Results page.

Description

This task is the follow-up to DCASE 2018 task 4. The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. The challenge of exploring the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance remains but an additional training set with strongly annotated synthetic data is provided. The labels in all the annotated subsets are verified and can be considered as reliable. An additional scientific question this task is aiming to investigate is whether we really need real but partially and weakly annotated data or is using synthetic data sufficient? or do we need both?

Figure 1: Overview of a sound event detection system.


Audio dataset

The dataset for this task is composed of 10 sec audio clips recorded in domestic environment or synthesized to simulate a domestic environment. The task focuses on 10 class of sound events that represent a subset of Audioset (not all the classes are present in Audioset, some classes of sound events are including several classes from Audioset):

  • Speech Speech
  • Dog Dog
  • Cat Cat
  • Alarm/bell/ringing Alarm_bell_ringing
  • Dishes Dishes
  • Frying Frying
  • Blender Blender
  • Running water Running_water
  • Vacuum cleaner Vacuum_cleaner
  • Electric shaver/toothbrush Electric_shaver_toothbrush

The dataset for DCASE 2019 task 4 is composed of a subset with real recordings (from Audioset) and a subset with synthetic recordings. The datasets used de generate the dataset for DCASE 2019 task 4 are described below.

Audioset: Real recordings are extracted from Audioset. It consists of an expanding ontology of 632 sound event classes and a collection of 2 million human-labeled 10-second sound clips (less than 21% are shorter than 10-seconds) drawn from 2 million Youtube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.

Publication

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA, 2017.

PDF

Audio Set: An ontology and human-labeled dataset for audio events

Abstract

Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets - principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 632 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

PDF

Freesound dataset: A subset of FSD is used as foreground sound events for the synthetic subset of the dataset for DCASE 2019 task 4. FSD is a large-scale, general-purpose audio dataset composed of Freesound content annotated with labels from the AudioSet Ontology.

Publication

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andrés Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra. Freesound datasets: a platform for the creation of open audio datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR 2017), 486–493. Suzhou, China, 2017.

PDF

Freesound Datasets: a platform for the creation of open audio datasets

Abstract

Openly available datasets are a key factor in the advancement of data-driven research approaches, including many of the ones used in sound and music computing. In the last few years, quite a number of new audio datasets have been made available but there are still major shortcomings in many of them to have a significant research impact. Among the common shortcomings are the lack of transparency in their creation and the difficulty of making them completely open and sharable. They often do not include clear mechanisms to amend errors and many times they are not large enough for current machine learning needs. This paper introduces Freesound Datasets, an online platform for the collaborative creation of open audio datasets based on principles of transparency, openness, dynamic character, and sustainability. As a proof-of-concept, we present an early snapshot of a large-scale audio dataset built using this platform. It consists of audio samples from Freesound organised in a hierarchy based on the AudioSet Ontology. We believe that building and maintaining datasets following the outlined principles and using open tools and collaborative approaches like the ones presented here will have a significant impact in our research community.

PDF
Publication

Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, 411–412. ACM, 2013.

PDF

Freesound technical demo

PDF

SINS dataset: The derivative of the SINS dataset used for DCASE2018 task 5 is used as background for the synthetic subset of the dataset for DCASE 2019 task 4. The SINS dataset contains a continuous recording of one person living in a vacation home over a period of one week. It was collected using a network of 13 microphone arrays distributed over the entire home. The microphone array consists of 4 linearly arranged microphones.

Publication

Gert Dekkers, Steven Lauwereins, Bart Thoen, Mulu Weldegebreal Adhana, Henk Brouckxon, Toon van Waterschoot, Bart Vanrumste, Marian Verhelst, and Peter Karsmakers. The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 32–36. November 2017.

PDF

The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network

Abstract

There is a rising interest in monitoring and improving human wellbeing at home using different types of sensors including microphones. In the context of Ambient Assisted Living (AAL) persons are monitored, e.g. to support patients with a chronic illness and older persons, by tracking their activities being performed at home. When considering an acoustic sensing modality, a performed activity can be seen as an acoustic scene. Recently, acoustic detection and classification of scenes and events has gained interest in the scientific community and led to numerous public databases for a wide range of applications. However, no public databases exist which a) focus on daily activities in a home environment, b) contain activities being performed in a spontaneous manner, c) make use of an acoustic sensor network, and d) are recorded as a continuous stream. In this paper we introduce a database recorded in one living home, over a period of one week. The recording setup is an acoustic sensor network containing thirteen sensor nodes, with four low-cost microphones each, distributed over five rooms. Annotation is available on an activity level. In this paper we present the recording and annotation procedure, the database content and a discussion on a baseline detection benchmark. The baseline consists of Mel-Frequency Cepstral Coefficients, Support Vector Machine and a majority vote late-fusion scheme. The database is publicly released to provide a common ground for future research.

Keywords

Database, Acoustic Scene Classification, Acoustic Event Detection, Acoustic Sensor Networks

PDF

Synthetic data generation procedure

The synthetic set is composed of 10 sec audio clips generated with Scaper. The foreground events are obtained from FSD. Each event audio clip was verified manually to ensure that the sound quality and the event-to-background ratio were sufficient to be used an isolated event. We also verified that the event was actually dominant in the clip and we controlled if the event onset and offset are present in the clip. Each selected clip was then segmented when needed to remove silences before and after the event and between events when the file contained multiple occurrences of the event class. The number of unique isolated event per class used to generate the synthetic set is described below:

Class # unique events
Speech 128
Dog 136
Cat 88
Alarm/bell/ringing 190
Dishes 109
Frying 64
Blender 98
Running water 68
Vacuum cleaner 74
Electric shaver/toothbrush 56
Total 1011

The background texture where obtained from the SINS dataset (class other). This particular class was selected because it presents a low amount of sound events from the 10 target sound event classes. However, there is no guarantee that these sound event classes are totally absent from the background clips. The number of unique background clips used to generate the synthetic dataset is presented below:

Class # background clips
Other 2060

Scaper scripts are designed such that the distribution of sound events per class, the number sound events per clip (depending on the class) and the sound event class co-occurrence is similar to that of the validation set composed of real recordings.

Publication

Justin Salamon, Duncan MacConnell, Mark Cartwright, Peter Li, and Juan Pablo Bello. Scaper: a library for soundscape synthesis and augmentation. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 344–348. IEEE, 2017.

PDF

Scaper: A library for soundscape synthesis and augmentation

PDF

Reference labels

Audioset provides annotations at clip level (without time boundaries for the events). Therefore, the original annotations are considered as weak labels. Google researchers conducted a quality assessment task where experts were exposed to 10 randomly selected clips for each class and discovered that a in most of the cases not all the clips contains the event related to the given annotation.

Weak annotations

The weak annotations have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file under the following format:

[filename (string)][tab][class_label (strings)]

For example:

Y-BJNMHMZDcU_50.000_60.000.wav  Alarm_bell_ringing,Dog

The first column, Y-BJNMHMZDcU_50.000_60.000.wav, is the name of the audio file downloaded from Youtube (Y-BJNMHMZDcU is Youtube ID of the video from where the 10-second clips was extracted t=50 sec to t=60 sec, correspond to the clip boundaries within the full video) and the last column, Alarm_bell_ringing;Dog corresponds to the sound classes present in the clip separated by a coma.

Strong annotations

Another subset of the development has been annotated manually with strong annotations, to be used as the test set (see also below for a detailed explanation about the development set).

The synthetic subset of the development set is also annotated with strong annotations obtained from Scaper. Each sound clips from FSD was verified by humans in order to check the event class present in FSD annotation was indeed dominant in the audio clip.

In both cases, the minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event. The strong annotations are provided in a tab separated csv file under the following format:

[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][class_label (strings)]

For example:

YOTsn73eqbfc_10.000_20.000.wav  0.163   0.665   Alarm_bell_ringing

The first column, YOTsn73eqbfc_10.000_20.000.wav, is the name of the audio file, the second column 0.163 is the onset time in seconds, the third column 0.665 is the offset time in seconds and the last column, Alarm_bell_ringing corresponds to the class of the sound event.

Download

The content of the development set is structured in the following manner:

dataset root
└───metadata                          (directories containing the annotations files)
│   │
│   └───train                         (annotations for the training sets)
│   │     weak.csv                    (weakly labeled training set list and annotations)
│   │     unlabel_in_domain.csv       (unlabeled in domain training set list)
│   │     synthetic.csv               (synthetic data training set list and annotations)
│   │
│   └───validation                    (annotations for the test set)validation.csv              (validation set list with strong labels)test_2018.csv               (test set list with strong labels - DCASE 2018)eval_2018.csv               (eval set list with strong labels - DCASE 2018)
│    
└───audio                             (directories where the audio files will be downloaded)
    └───train                         (audio files for the training sets)
    │   └───weak                      (weakly labeled training set)
    │   └───unlabel_in_domain         (unlabeled in domain training set)
    │   └───synthetic                 (synthetic data training set)
    │
    └───validation                    (validation set)       

The dataset is composed of two subset that can be downloaded independently. The procedure to download each subset is described below.

Real recordings

The annotations files and the script to download the audio files is available on the git repository for task 4. This subset is 23.4Gb, the download/extraction process can take approximately 4 hours.

If you experience problems during the download of this subset please contact the task organizers. (Nicolas Turpault and Romain Serizel in priority)

Synthetic clips

Task setup

The challenge consists of detecting sound events within web videos using training data from real recordings both weakly labeled and unlabeled and synthetic audio clips that are strongly labeled. The detection within a 10-seconds clip should be performed with start and end timestamps.

Development dataset

The development set is divided into two main partitions: training and validation.

Training set

To motivate the participants to come up with innovative solutions, we provide 3 different splits of training data in our training set: Labeled training set, Unlabeled in domain training set and Synthetic set with strong annotations. The first two set are the same as in DCASE2018 task 4.

Labeled training set:
This set contains 1578 clips (2244 class occurrences) for which weak annotations have been verified and cross-checked. The amount of clips per class is the following:

Class # 10s clips containing the event
Speech 550
Dog 214
Cat 173
Alarm/bell/ringing 205
Dishes 184
Frying 171
Blender 134
Running water 343
Vacuum cleaner 167
Electric shaver/toothbrush 103
Total 2244

Unlabeled in domain training set:
This set is considerably larger than the previous one. It contains 14412 clips. The clips are selected such that the distribution per class (based on Audioset annotations) is close to the distribution in the labeled set. Note however that given the uncertainty on Audioset labels this distribution might not be exactly similar.

Synthetic strongly labeled set:
This set is composed of 2045 clips generated with Scaper. The clips are generated such that the distribution per event is close to that of the validation set. Note that a 10-seconds clip may correspond to more than one sound event.

Class # events
Speech 2132
Dog 516
Cat 547
Alarm/bell/ringing 755
Dishes 814
Frying 137
Blender 540
Running water 157
Vacuum cleaner 204
Electric shaver/toothbrush 230
Total 6032

Validation set

The validation set is designed such that the distribution in term of clips per class is similar to that of the weakly labeled training set. It is the fusion of DCASE 2018 task 4 test set and evaluation set (Note that for comparison purpose the original csv are also provided). The validation set contains 1168 clips (4093 events). The evaluation set is annotated with strong labels, with timestamps (obtained by human annotators). Note that a 10-seconds clip may correspond to more than one sound event. The amount of events per class is the following:

Class # events
Speech 1662
Dog 577
Cat 340
Alarm/bell/ringing 418
Dishes 492
Frying 91
Blender 96
Running water 230
Vacuum cleaner 92
Electric shaver/toothbrush 65
Total 4093

Evaluation dataset

The dataset is composed of 10 seconds audio clips.

  • A first subset is composed of audio clips extracted from youtube and vimeo videos under creative common licenses. This subset is used for ranking purposes.

  • A second subset is composed on synthetic clips generated with scaper. This subset is used for analysis purposes.

    • The foreground events are obtained from the freesound dataset. Each event audio clip was verified manually to ensure that the sound quality and the event-to-background ratio were sufficient to be used an isolated event. We also verified that the event was actually dominant in the clip and we controlled if the event onset and offset are present in the clip. Each selected clip was then segmented when needed to remove silences before and after the event and between events when the file contained multiple occurrences of the event class.

    • Background sound are extracted from youtube videos under creative common license and from the free-sound subset of the MUSAN dataset.

    • Audio clips are artificially degraded using Audio Degradation Toolbox.

Publication

David Snyder, Guoguo Chen, and Daniel Povey. MUSAN: A Music, Speech, and Noise Corpus. 2015. arXiv:1510.08484v1. arXiv:1510.08484.

PDF

MUSAN: A Music, Speech, and Noise Corpus

Abstract

This report introduces a new corpus of music, speech, and noise. This dataset is suitable for training models for voice activity detection (VAD) and music/speech discrimination. Our corpus is released under a flexible Creative Commons license. The dataset consists of music from several genres, speech from twelve languages, and a wide assortment of technical and non-technical noises. We demonstrate use of this corpus for music/speech discrimination on Broadcast news and VAD for speaker identification.

PDF
Publication

Matthias Mauch and Sebastian Ewert. The audio degradation toolbox and its application to robustness evaluation. In Proceedings of the 14th International Society for Music Information Retrieval Conference (ISMIR 2013), 83–88. Curitiba, Brazil, 2013.

PDF

The Audio Degradation Toolbox and its Application to Robustness Evaluation

Abstract

We introduce the Audio Degradation Toolbox (ADT) for the controlled degradation of audio signals, and propose its usage as a means of evaluating and comparing the robustness of audio processing algorithms. Music recordings encountered in practical applications are subject to varied, sometimes unpredictable degradation. For example, audio is degraded by low-quality microphones, noisy recording environments, MP3 compression, dynamic compression in broadcasting or vinyl decay. In spite of this, no standard software for the degradation of audio exists, and music processing methods are usually evaluated against clean data. The ADT fills this gap by providing Matlab scripts that emulate a wide range of degradation types. We describe 14 degradation units, and how they can be chained to create more complex, ‘real-world’ degradations. The ADT also provides functionality to adjust existing ground-truth, correcting for temporal distortions introduced by degradation. Using four different music informatics tasks, we show that performance strongly depends on the combination of method and degradation applied. We demonstrate that specific degradations can reduce or even reverse the performance difference between two competing methods. ADT source code, sounds, impulse responses and definitions are freely available for download.

PDF

More detailed information will be provided after the evaluation period. This will include information about the dataset split between real audio clips and synthetic audio clips.

Task rules

  • Participants are not allowed to use external data for system development. Data from other task is considered external data.
  • Another example of external data is other materials related to the video such as the rest of audio from where the 10-sec clip was extracted, the video frames and metadata.
  • Participants are not allowed to use the embeddings provided by Audioset or any other features that indirectly use external data.
  • For the real recordings (from Audioset), only weak labels and none of the strong labels (timestamps) or original (Audioset) labels can be used for the training of the submitted system. For the synthetic clips, strong labels provided can be used.
  • Manipulation of provided training data is allowed.
  • The development dataset can be augmented without the use of external data (e.g. by mixing data sampled from a PDF or using techniques such as pitch shifting or time stretching).
  • Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it (this includes the use of statistics about the evaluation dataset in the decision making). The evaluation dataset cannot be used to train the submitted system.

Evaluation

Submissions will be evaluated with event-based measures with a 200 ms collar on onsets and a 200 ms / 20% of the events length collar on offsets. Submissions will be ranked according to the event-based F1-score computed over the real recordings in the evaluation set (the performance on synthetic recordings is not taken into account in the ranking). Additionally, segment-based F1-score on 1 s segments will be provided as a secondary measure. Evaluation is done using sed_eval toolbox:


Detailed information on metrics calculation is available in:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016. URL: http://www.mdpi.com/2076-3417/6/6/162, doi:10.3390/app6060162.

PDF

Metrics for Polyphonic Sound Event Detection

Abstract

This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.

PDF

Awards

This task will offer two awards, not necessarily based on the evaluation set performance ranking. These awards aim to encourage contestants to openly publish their code, and to use novel and problem-specific approaches which leverage knowledge of the audio domain. We also highly encourage student authorship.

Reproducible system award

Reproducible system award of 500 USD will be offered for the highest scoring method that is open-source and fully reproducible. For full reproducibility, the authors must provide all the information needed to run the system and achieve the reported performance. The choice of licence is left to the author, but should ideally be selected among the ones approved by the Open Source Initiative.

Judges’ award

Judges’ award of 500 USD will be offered for the method considered by the judges to be the most interesting or innovative. Criteria considered for this award include but are not limited to: originality, complexity, student participation, open-source, etc. Single model approaches are strongly preferred over ensembles; occasionally, small ensembles of different models can be considered, if the approach is innovative.

More information can be found on the Award page.


The awards are sponsored by

Gold sponsor Silver sponsor
Sonos Harman
Bronze sponsors
Cochlear.ai Oticon Sound Intelligence
Technical sponsor
Inria

Baseline

System description

The baseline system is based on the idea of the best submission of DCASE 2018 task 4. The author provided his system code and most of the hyper-parameters of this year baseline close to the hyper-parameters defined by last year winner. However, the network architecture itself remains similar to last year baseline so it is much simpler that the networks used by Lu JiaKai.

Publication

Lu JiaKai. Mean teacher convolution system for dcase 2018 task 4. Technical Report, DCASE2018 Challenge, September 2018.

PDF

Mean Teacher Convolution System for DCASE 2018 Task 4

Abstract

In this paper, we present our neural network for the DCASE 2018 challenge’s Task 4 (Large-scale weakly labeled semi-supervised sound event detection in domestic environments). This task evaluates systems for the large-scale detection of sound events using weakly labeled data, and explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance to doing audio tagging and sound event detection. We propose a mean-teacher model with context-gating convolutional neural network (CNN) and recurrent neural network (RNN) to maximize the use of unlabeled in-domain dataset.

PDF

The baseline using a mean-teacher model that is composed of two networks that are both the same CRNN. The implementation of Mean teacher model is based on Tarvainen & Valpola from Curious AI.

Publication

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, 1195–1204. 2017.

PDF

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

Abstract

The recently proposed Temporal Ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks. It maintains an exponential moving average of label predictions on each training example, and penalizes predictions that are inconsistent with this target. However, because the targets change only once per epoch, Temporal Ensembling becomes unwieldy when learning large datasets. To overcome this problem, we propose Mean Teacher, a method that averages model weights instead of label predictions. As an additional benefit, Mean Teacher improves test accuracy and enables training with fewer labels than Temporal Ensembling. Without changing the network architecture, Mean Teacher achieves an error rate of 4.35% on SVHN with 250 labels, outperforming Temporal Ensembling trained with 1000 labels. We also show that a good network architecture is crucial to performance. Combining Mean Teacher and Residual Networks, we improve the state of the art on CIFAR-10 with 4000 labels from 10.55% to 6.28%, and on ImageNet 2012 with 10% of the labels from 35.24% to 9.11%.

PDF

The model is trained as follows: - The teacher model is trained on synthetic and weakly labeled data. The classification cost is computed at frame level on synthetic data and at clip level on weakly labeled data. - The student model is not trained, its weights are a moving average of the teacher model (at each epoch). - The inputs of the student model are the inputs of the teacher model + some Gaussian noise - A cost for consistency between teacher and student model is applied (for weak and strong predictions).

The baseline exploit unlabeled, weakly labeled and synthetic data for training and is trained for 100 epochs. Inputs are 864 frames long. The CRNN model is pooling in time to have 108 frames. Postprocessing (median filtering) of 5 frames is used to obtain events onset and offset for each file. The baseline system includes evaluations of results using event-based F-score as metric.

Python Implementation


System Performance

Performance is reported on this year validation set (Validation 2019) and on the evaluation set from DCASE 2018 task 4.

F-score metrics (macro averaged)
Validation 2019 Evaluation 2018
Event-based 23.7 % 20.6 %
Segment-based 55.2 % 51.4 %

Note: The performance might not be exactly reproducible on a GPU based system. That is why, you download the weights of the networks used for the experiments and run TestModel.py --model_path="Path_of_model" to reproduce the results.

Results

Rank Submission Information
Code Author Affiliation Technical
Report
Event-based
F-score
(Evaluation dataset)
Wang_NUDT_task4_4 Dezhi Wang National University of Defense Technology, College of Meteorology and Oceanography, Changsha, China task-sound-event-detection-in-domestic-environments-results#Wang2019 16.8
Wang_NUDT_task4_3 Dezhi Wang National University of Defense Technology, College of Meteorology and Oceanography, Changsha, China task-sound-event-detection-in-domestic-environments-results#Wang2019 17.5
Wang_NUDT_task4_2 Dezhi Wang National University of Defense Technology, College of Meteorology and Oceanography, Changsha, China task-sound-event-detection-in-domestic-environments-results#Wang2019 17.2
Wang_NUDT_task4_1 Dezhi Wang National University of Defense Technology, College of Meteorology and Oceanography, Changsha, China task-sound-event-detection-in-domestic-environments-results#Wang2019 17.2
Delphin_OL_task4_2 Lionel Delphin-Poulat Orange Labs, HOME/CONTENT, Lannion, France task-sound-event-detection-in-domestic-environments-results#Delphin-Poulat2019 42.1
Delphin_OL_task4_1 Lionel Delphin-Poulat Orange Labs, HOME/CONTENT, Lannion, France task-sound-event-detection-in-domestic-environments-results#Delphin-Poulat2019 38.3
Kong_SURREY_task4_1 Qiuqiang Kong University of Surrey, Centre for Vision, Speech and Signal Processing (CVSSP), Guildford, England task-sound-event-detection-in-domestic-environments-results#Kong2019 22.3
CTK_NU_task4_2 Teck Kai Chan Newcastle University, Singapore, Faculty of Science, Agriculture, Engineering, Singapore task-sound-event-detection-in-domestic-environments-results#Chan2019 29.7
CTK_NU_task4_3 Teck Kai Chan Newcastle University, Singapore, Faculty of Science, Agriculture, Engineering, Singapore task-sound-event-detection-in-domestic-environments-results#Chan2019 27.7
CTK_NU_task4_4 Teck Kai Chan Newcastle University, Singapore, Faculty of Science, Agriculture, Engineering, Singapore task-sound-event-detection-in-domestic-environments-results#Chan2019 26.9
CTK_NU_task4_1 Teck Kai Chan Newcastle University, Singapore, Faculty of Science, Agriculture, Engineering, Singapore task-sound-event-detection-in-domestic-environments-results#Chan2019 31.0
Mishima_NEC_task4_3 Sakiko Mishima NEC Corporation, Central Research Laboratories, Kanagawa, Japan task-sound-event-detection-in-domestic-environments-results#Mishima2019 18.3
Mishima_NEC_task4_4 Sakiko Mishima NEC Corporation, Central Research Laboratories, Kanagawa, Japan task-sound-event-detection-in-domestic-environments-results#Mishima2019 19.8
Mishima_NEC_task4_2 Sakiko Mishima NEC Corporation, Central Research Laboratories, Kanagawa, Japan task-sound-event-detection-in-domestic-environments-results#Mishima2019 17.7
Mishima_NEC_task4_1 Sakiko Mishima NEC Corporation, Central Research Laboratories, Kanagawa, Japan task-sound-event-detection-in-domestic-environments-results#Mishima2019 16.7
CANCES_IRIT_task4_2 Thomas Pellegrini Université Paul Sabatier Toulouse III, Institut de Recherche en Informatique de Toulouse, Theme 1 - Signal Image, Toulouse, France task-sound-event-detection-in-domestic-environments-results#Cances2019 28.4
CANCES_IRIT_task4_2 Thomas Pellegrini Université Paul Sabatier Toulouse III, Institut de Recherche en Informatique de Toulouse, Theme 1 - Signal Image, Toulouse, France task-sound-event-detection-in-domestic-environments-results#Cances2019 26.1
PELLEGRINI_IRIT_task4_1 Thomas Pellegrini Université Paul Sabatier Toulouse III, Institut de Recherche en Informatique de Toulouse, Theme 1 - Signal Image, Toulouse, France task-sound-event-detection-in-domestic-environments-results#Cances2019 39.7
Lin_ICT_task4_2 Liwei Lin Institute of Computing Technology, Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China task-sound-event-detection-in-domestic-environments-results#Lin2019 40.9
Lin_ICT_task4_4 Liwei Lin Institute of Computing Technology, Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China task-sound-event-detection-in-domestic-environments-results#Lin2019 41.8
Lin_ICT_task4_3 Liwei Lin Institute of Computing Technology, Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China task-sound-event-detection-in-domestic-environments-results#Lin2019 42.7
Lin_ICT_task4_1 Liwei Lin Institute of Computing Technology, Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China task-sound-event-detection-in-domestic-environments-results#Lin2019 40.7
Baseline_dcase2019 Romain Serizel University of Lorraine, Loria, Department of Natural Language Processing & Knowledge Discovery, Nancy, France task-sound-event-detection-in-domestic-environments-results#Turpault2019 25.8
bolun_NWPU_task4_1 Wang bolun Northwestern Polytechnical University, School of Computer Science, Xi'an, China task-sound-event-detection-in-domestic-environments-results#Bolun2019 21.7
bolun_NWPU_task4_4 Wang bolun Northwestern Polytechnical University, School of Computer Science, Xi'an, China task-sound-event-detection-in-domestic-environments-results#Bolun2019 25.3
bolun_NWPU_task4_3 Wang bolun Northwestern Polytechnical University, School of Computer Science, Xi'an, China task-sound-event-detection-in-domestic-environments-results#Bolun2019 23.8
bolun_NWPU_task4_2 Wang bolun Northwestern Polytechnical University, School of Computer Science, Xi'an, China task-sound-event-detection-in-domestic-environments-results#Bolun2019 27.8
Agnone_PDL_task4_1 Anthony Agnone Pindrop, Audio Research, Atlanta, GA task-sound-event-detection-in-domestic-environments-results#Agnone2019 25.0
Kiyokawa_NEC_task4_1 Yu Kiyokawa NEC Corporation, Central Research Laboratories, Kanagawa, Japan task-sound-event-detection-in-domestic-environments-results#Kiyokawa2019 27.8
Kiyokawa_NEC_task4_4 Yu Kiyokawa NEC Corporation, Central Research Laboratories, Kanagawa, Japan task-sound-event-detection-in-domestic-environments-results#Kiyokawa2019 32.4
Kiyokawa_NEC_task4_3 Yu Kiyokawa NEC Corporation, Central Research Laboratories, Kanagawa, Japan task-sound-event-detection-in-domestic-environments-results#Kiyokawa2019 29.4
Kiyokawa_NEC_task4_2 Yu Kiyokawa NEC Corporation, Central Research Laboratories, Kanagawa, Japan task-sound-event-detection-in-domestic-environments-results#Kiyokawa2019 28.3
Kothinti_JHU_task4_2 Sandeep Kothinti Johns Hopkins University, Department of Electrical and Computer Engineering, Baltimore, MD, USA task-sound-event-detection-in-domestic-environments-results#Kothinti2019 30.5
Kothinti_JHU_task4_3 Sandeep Kothinti Johns Hopkins University, Department of Electrical and Computer Engineering, Baltimore, MD, USA task-sound-event-detection-in-domestic-environments-results#Kothinti2019 29.0
Kothinti_JHU_task4_4 Sandeep Kothinti Johns Hopkins University, Department of Electrical and Computer Engineering, Baltimore, MD, USA task-sound-event-detection-in-domestic-environments-results#Kothinti2019 29.4
Kothinti_JHU_task4_1 Sandeep Kothinti Johns Hopkins University, Department of Electrical and Computer Engineering, Baltimore, MD, USA task-sound-event-detection-in-domestic-environments-results#Kothinti2019 30.7
Shi_FRDC_task4_2 Ziqiang Shi Fujitsu Research and Development Center, Information Technology Laboratory, Beijing, China task-sound-event-detection-in-domestic-environments-results#Shi2019 42.0
Shi_FRDC_task4_3 Ziqiang Shi Fujitsu Research and Development Center, Information Technology Laboratory, Beijing, China task-sound-event-detection-in-domestic-environments-results#Shi2019 40.9
Shi_FRDC_task4_4 Ziqiang Shi Fujitsu Research and Development Center, Information Technology Laboratory, Beijing, China task-sound-event-detection-in-domestic-environments-results#Shi2019 41.5
Shi_FRDC_task4_1 Ziqiang Shi Fujitsu Research and Development Center, Information Technology Laboratory, Beijing, China task-sound-event-detection-in-domestic-environments-results#Shi2019 37.0
ZYL_UESTC_task4_1 Zhang Zhenyuan University of Electronic Sci-ence and Technology of China, Department of Internet of Things Engineering, Chengdu, China task-sound-event-detection-in-domestic-environments-results#Zhang2019 29.4
ZYL_UESTC_task4_2 Zhang Zhenyuan University of Electronic Sci-ence and Technology of China, Department of Internet of Things Engineering, Chengdu, China task-sound-event-detection-in-domestic-environments-results#Zhang2019 30.8
Wang_YSU_task4_1 Qian Yang Yanshan University, Information Science and Engineering, Qinghuangdao, China task-sound-event-detection-in-domestic-environments-results#Yang2019 6.5
Wang_YSU_task4_2 Qian Yang Yanshan University, Information Science and Engineering, Qinghuangdao, China task-sound-event-detection-in-domestic-environments-results#Yang2019 6.2
Wang_YSU_task4_3 Qian Yang Yanshan University, Information Science and Engineering, Qinghuangdao, China task-sound-event-detection-in-domestic-environments-results#Yang2019 6.7
Yan_USTC_task4_1 Jie Yan University of Science and Technology of China, National Engineering Laboratory for Speech and Language Information Processing, Hefei, China task-sound-event-detection-in-domestic-environments-results#Yan2019 35.8
Yan_USTC_task4_3 Jie Yan University of Science and Technology of China, National Engineering Laboratory for Speech and Language Information Processing, Hefei, China task-sound-event-detection-in-domestic-environments-results#Yan2019 35.6
Yan_USTC_task4_4 Jie Yan University of Science and Technology of China, National Engineering Laboratory for Speech and Language Information Processing, Hefei, China task-sound-event-detection-in-domestic-environments-results#Yan2019 33.5
Yan_USTC_task4_2 Jie Yan University of Science and Technology of China, National Engineering Laboratory for Speech and Language Information Processing, Hefei, China task-sound-event-detection-in-domestic-environments-results#Yan2019 36.2
Lee_KNU_task4_2 Seokjin Lee Kyungpook National University, School of Electronics Engineering, Daegu, Republic of Korea task-sound-event-detection-in-domestic-environments-results#Lee2019 25.8
Lee_KNU_task4_4 Seokjin Lee Kyungpook National University, School of Electronics Engineering, Daegu, Republic of Korea task-sound-event-detection-in-domestic-environments-results#Lee2019 24.6
Lee_KNU_task4_3 Seokjin Lee Kyungpook National University, School of Electronics Engineering, Daegu, Republic of Korea task-sound-event-detection-in-domestic-environments-results#Lee2019 26.7
Lee_KNU_task4_1 Seokjin Lee Kyungpook National University, School of Electronics Engineering, Daegu, Republic of Korea task-sound-event-detection-in-domestic-environments-results#Lee2019 26.4
Rakowski_SRPOL_task4_1 Alexander Rakowski Samsung R&D Institute Poland, Audio Intelligence, Warsaw, Poland task-sound-event-detection-in-domestic-environments-results#Rakowski2019 24.2
Lim_ETRI_task4_1 Wootaek Lim Electronics and Telecommunications Research Institute, Realistic AV Research Group, Daejeon, Korea task-sound-event-detection-in-domestic-environments-results#Lim2019 32.6
Lim_ETRI_task4_2 Wootaek Lim Electronics and Telecommunications Research Institute, Realistic AV Research Group, Daejeon, Korea task-sound-event-detection-in-domestic-environments-results#Lim2019 33.2
Lim_ETRI_task4_3 Wootaek Lim Electronics and Telecommunications Research Institute, Realistic AV Research Group, Daejeon, Korea task-sound-event-detection-in-domestic-environments-results#Lim2019 32.5
Lim_ETRI_task4_4 Wootaek Lim Electronics and Telecommunications Research Institute, Realistic AV Research Group, Daejeon, Korea task-sound-event-detection-in-domestic-environments-results#Lim2019 34.4


Complete results and technical reports can be found at Task 4 results page

Citation

If you are using the dataset or baseline code, or want to refer challenge task please cite the following paper:

Publication

Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. working paper or preprint, June 2019. URL: https://hal.inria.fr/hal-02160855.

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Keywords

Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data