The task evaluates systems for the large-scale detection of sound events using weakly labeled data. The challenge is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance.

Note to participants (September 20th): Minor bug in the evaluation script.

We found a minor bug in the evaluation script. Some of the files were not taken into account correctly. This has been fixed and the evaluation scores have been updated (resulting in at most 0.3% decrease of the F-score). This had no impact on the teams ranking. However, the systems ranking can have varied very slightly (around ranks 15~25).

Description

The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance. The labels in the annotated subset are verified and can be considered as reliable. The data are Youtube video excerpts from domestic context which have many applications such as ambient assisted living. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events...) and potential industrial applications.

Figure 1: Overview of a sound event detection system.

Audio dataset

The task employs a subset of “Audioset: An Ontology And Human-Labeled Dataset For Audio Events” by Google. Audioset consists of an expanding ontology of 632 sound event classes and a collection of 2 million human-labeled 10-second sound clips (less than 21% are shorter than 10-seconds) drawn from 2 million Youtube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. We will focus on a subset of Audioset that consists of 10 classes of sound events:

Speech Speech
Dog Dog
Cat Cat
Alarm/bell/ringing Alarm_bell_ringing
Dishes Dishes
Frying Frying
Blender Blender
Running water Running_water
Vacuum cleaner Vacuum_cleaner
Electric shaver/toothbrush Electric_shaver_toothbrush

Recording and annotation procedure

Audioset provides annotations at clip level (without time boundaries for the events). Therefore, the annotations are considered as weak labels.

Audio clips are collected from Youtube videos uploaded by independent users so the number of clips per class vary dramatically and the dataset is not balanced (see also https://research.google.com/Audioset//dataset/index.html).

Google researchers conducted a quality assessment task where experts were exposed to 10 randomly selected clips for each class and discovered that a in most of the cases not all the clips contains the event related to the given annotation. More information about the initial annotation process can be found in:

Publication

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA, 2017.

PDF

Audio Set: An ontology and human-labeled dataset for audio events

Abstract

Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets -- principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 635 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

PDF

The weak annotations have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file under the following format:

[filename (string)][tab][class_label (strings)]

For example:

Y-BJNMHMZDcU_50.000_60.000.wav  Alarm_bell_ringing;Dog

The first column, Y-BJNMHMZDcU_50.000_60.000.wav, is the name of the audio file downloaded from Youtube (Y-BJNMHMZDcU is Youtube ID of the video from where the 10-second clips was extracted t=50 sec to t=60 sec, correspond to the clip boundaries within the full video) and the last column, Alarm_bell_ringing;Dog corresponds to the sound classes present in the clip separated by a semi-colon.

Another subset of the development has been annotated manually with strong annotations, to be used as the test set (see also below for a detailed explanation about the development set). The minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event. The strong annotations are provided in a tab separated csv file under the following format:

[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][class_label (strings)]

For example:

YOTsn73eqbfc_10.000_20.000.wav  0.163   0.665   Alarm_bell_ringing

The first column, YOTsn73eqbfc_10.000_20.000.wav, is the name of the audio file downloaded from Youtube, the second column 0.163 is the onset time in seconds, the third column 0.665 is the offset time in seconds and the last column, Alarm_bell_ringing corresponds to the class of the sound event.

Download

The annotations files and the script to download the audio files is available on the git repository for task 4. The final dataset is 82Gb, the download/extraction process can take approximately 12 hours. (In case you are using the provided baseline system, there is no need to download the dataset as the system will automatically download needed dataset for you.)

If you experience problems during the download of the dataset please contact the task organizers. (Nicolas Turpault and Romain Serizel in priority)

DCASE2018 Task 4 development dataset, repository

The content of the development set is structured in the following manner:

dataset root
│   readme.md                         (instructions to run the script that downloads the audio files and description of the annotations files)
│   download_data.py                  (script to download the files)
│
└───metadata                          (directories containing the annotations files)
│   │
│   └───train                         (annotations for the train sets)
│   │     weak.csv                    (weakly labeled training set list)
│   │     unlabel_in_domain.csv       (unlabeled in domain training set list)
│   │     unlabel_out_of_domain.csv   (unlabeled out-of-domain training set list)
│   │
│   └───test                          (annotations for the test set)
│         test.csv                    (test set list with strong labels)
│    
└───audio                             (directories where the audio files will be downloaded)
    └───train                         (annotations for the train sets)
    │   └───weak                      (weakly labeled training set)
    │   └───unlabel_in_domain         (unlabeled in domain training set)
    │   └───unlabel_out_of_domain     (unlabeled out-of-domain training set)
    │
    └───test                          (test set)

Task setup

The challenge consists of detecting sound events within web videos using weakly labeled training data. The detection within a 10-seconds clip should be performed with start and end timestamps.

Development dataset

The development set is divided into two main partitions: training and validation.

Training set

To motivate the participants to come up with innovative solutions, we provide 3 different splits of training data in our training set: Labeled training set, Unlabeled in domain training set and Unlabeled out of domain training set.

Labeled training set:
This set contains 1578 clips (2244 class occurrences) for which weak annotations have been verified and cross-checked. The amount of clips per class is the following:

Class	# 10s clips containing the event
Speech	550
Dog	214
Cat	173
Alarm/bell/ringing	205
Dishes	184
Frying	171
Blender	134
Running water	343
Vacuum cleaner	167
Electric shaver/toothbrush	103
Total	2244

Unlabeled in domain training set:
This set is considerably larger than the previous one. It contains 14412 clips. The clips are selected such that the distribution per class (based on Audioset annotations) is close to the distribution in the labeled set. Note however that given the uncertainty on Audioset labels this distribution might not be exactly similar.

Unlabeled out of domain training set:
This set is composed of 39999 clips extracted from classes that are not considered in the task. Note that these clips are chosen based on the Audioset labels which are not verified and therefore might be noisy. Additionally, as speech is present in half of the segments of Audioset this set also contain almost 20000 clips with speech. This was the only way to have an unlabel_out_of_domain set which is somehow representative of Audioset. Indeed, discarding speech would have also meant discarding many other classes and the variability of the set would have been penalized.

Test set

The test set is designed such that the distribution in term of clips per class is similar to that of the weakly labeled training set. The size of the validation set is such that it represent about 20% of the size of the labeled training set, it contains 288 clips (906 events). The evaluation set is annotated with strong labels, with timestamps (obtained by human annotators). Note that a 10-seconds clip may correspond to more than one sound event. The amount of events per class is the following:

Class	# events
Speech	261
Dog	127
Cat	97
Alarm/bell/ringing	112
Dishes	122
Frying	24
Blender	40
Running water	76
Vacuum cleaner	36
Electric shaver/toothbrush	28
Total	906

Evaluation dataset

Evaluation set is a subset of Audioset including the classes mentioned above. Labels with timestamps (obtained by human annotations) will be released after the DCASE 2018 challenge is concluded.The eval dataset can be downloaded from the following location:

DCASE2018 Task 4 evaluation dataset, repository

Submission

Detailed information for the challenge submission can found from in the submission page.

System output should be presented as a single text-file (in CSV format) containing predictions for each audio file in the evaluation set. Result items can be in any order. Format:

[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][event label (string)]

Multiple system outputs can be submitted (maximum 4 per participant). If submitting multiple systems, the individual text-files should be packaged into a zip file for submission. Please carefully mark the connection between the submitted files and the corresponding system or system parameters (for example by naming the text file appropriately).

If no event is detected for the particular audio signal, the system should still output a row containing only the file name, to indicate that the file was processed. This is used to verify that participants processed all evaluation files.

Task rules

Participants are not allowed to use external data for system development. Data from other task is considered external data.
Another example of external data is other materials related to the video such as the rest of audio from where the 10-sec clip was extracted, the video frames and metadata.
Participants are not allowed to use the embeddings provided by Audioset or any other features that indirectly use external data.
Only weak labels and none of the strong labels (timestamps) or original (Audioset) labels can be used for the training of the submitted system.
Manipulation of provided training data is allowed.
The development dataset can be augmented without the use of external data (e.g. by mixing data sampled from a PDF or using techniques such as pitch shifting or time stretching).
Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it (this includes the use of statistics about the evaluation dataset in the decision making). The evaluation dataset cannot be used to train the submitted system.

Evaluation

Submissions will be evaluated with event-based measures with a 200ms collar on onsets and a 200ms / 20% of the events length collar on offsets. Submissions will be ranked according to the event-based F1-score computed over the whole evaluation set. Additionally, event-based error rate will be provided as a secondary measure. Evaluation is done using sed_eval toolbox:

sed_eval - Evaluation toolbox for Sound Event Detection

Detailed information on metrics calculation is available in:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016. URL: http://www.mdpi.com/2076-3417/6/6/162, doi:10.3390/app6060162.

PDF

Metrics for Polyphonic Sound Event Detection

Abstract

This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.

PDF

Baseline system

System description

The baseline system is based on two conolutional recurrent neural network (CRNN) using 64 log mel-band magnitudes as features. 10 seconds audio files are divided in 500 frames.

Using these features, we train a first CRNN with three convolution layers (64 filters (3x3), max pooling (4) along the frequency axis and 30% dropout), one recurrent layer (64 Gated Recurrent Units GRU with 30% dropout on the input), a dense layer (10 units sigmoid activation) and global average pooling across frames. The system is trained for 100 epochs (early stopping after 15 epochs patience) on weak labels (1578 clips, 20% is used for validation). This model is trained at clip level (file containing the event or not), inputs are 500 frames long (10 sec audio file) for a single output frame. This first model is used to predict labels of unlabeled files (unlabel_in_domain, 14412 clips).

A second model based on the same architecture (3 convolutional layers and 1 recurrent layer) is trained on predictions of the first model (unlabel_in_domain, 14412 clips; the weak files, 1578 clips are used to validate the model). The main difference with the first pass model is that the output is the dense layer in order to be able to predict event at frame level. Inputs are 500 frames long, each of them labeled identically following clip labels. The model outputs a decision for each frame. Preprocessing (median filtering) is used to obtain events onset and offset for each file. The baseline system includes evaluations of results using event-based F-score as metric.

Script description

The baseline system is a semi supervised approach:

Download the data (only the first time)
First pass at clip level:
- Train a CRNN on weak data (train/weak) - 20% of data used for validation
- Predict unlabel (in domain) data (train/unlabel_in_domain)
Second pass at frame level:
- Train a CRNN on predicted unlabel data from the first pass (train/unlabel_in_domain) - weak data (train/weak) is used for validation Note: labels are used at frames level but annotations are at clip level, so if an event is present in the 10 sec, all frames contain this label during training
- Predict strong test labels (test/) Note: predict an event with an onset and offset
Evaluate the model between test annotations and second pass predictions (metric is macro-averaged event based F-measure)

Python implementation

DCASE2018 Task 4 Baseline (python implementation)

System performance

Event-based metrics with a 200ms collar on onsets and a 200ms / 20% of the events length collar on offsets:

Performance on the latest test set version (July 6th)

Event-based overall metrics (macro-average)
F-score	14.06 %
ER	1.54

Note: This performance was obtained on a CPU based system (Intel® Xeon E5-1630 -- 8 cores, 128Gb RAM). The total runtime was approximately 24h.
Note: The performance might not be exactly reproducible on a GPU based system. However, it runs in around 8 hours on a single Nvidia Geforce 1080 Ti GPU.

Results

Rank	Submission Information
Rank	Code	Author	Affiliation	Technical Report	Event-based F-score (Evaluation dataset)
	Avdeeva_ITMO_task4_1	Anastasia Avdeeva	Speech Information Systems Department, University of Information Technology Mechanics and Optics, Saint-Petersburg, Russia	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Avdveeva2018	20.1
	Avdeeva_ITMO_task4_2	Anastasia Avdeeva	Speech Information Systems Department, University of Information Technology Mechanics and Optics, Saint-Petersburg, Russia	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Avdveeva2018	19.5
	Wang_NUDT_task4_1	Dezhi Wang	School of Computer, National University of Defense Technology, Changsha, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#WangD2018	12.4
	Wang_NUDT_task4_2	Dezhi Wang	School of Computer, National University of Defense Technology, Changsha, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#WangD2018	12.6
	Wang_NUDT_task4_3	Dezhi Wang	School of Computer, National University of Defense Technology, Changsha, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#WangD2018	12.0
	Wang_NUDT_task4_4	Dezhi Wang	School of Computer, National University of Defense Technology, Changsha, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#WangD2018	12.2
	Dinkel_SJTU_task4_1	Heinrich Dinkel	Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Dinkel2018	10.4
	Dinkel_SJTU_task4_2	Heinrich Dinkel	Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Dinkel2018	10.7
	Dinkel_SJTU_task4_3	Heinrich Dinkel	Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Dinkel2018	13.4
	Dinkel_SJTU_task4_4	Heinrich Dinkel	Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Dinkel2018	11.2
	Guo_THU_task4_1	Yingmei Guo	Department of Computer Science and Technology, Tsinghua University, Beijing, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Guo2018	21.3
	Guo_THU_task4_2	Yingmei Guo	Department of Computer Science and Technology, Tsinghua University, Beijing, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Guo2018	20.6
	Guo_THU_task4_3	Yingmei Guo	Department of Computer Science and Technology, Tsinghua University, Beijing, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Guo2018	19.1
	Guo_THU_task4_4	Yingmei Guo	Department of Computer Science and Technology, Tsinghua University, Beijing, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Guo2018	19.0
	Harb_TUG_task4_1	Robert Harb	Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Harb2018	19.4
	Harb_TUG_task4_2	Robert Harb	Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Harb2018	15.7
	Harb_TUG_task4_3	Robert Harb	Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Harb2018	21.6
	Hou_BUPT_task4_1	Yuanbo Hou	Institute of Information Photonics and Optical, Beijing University of Posts and Telecommunications, Beijing, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Hou2018	19.6
	Hou_BUPT_task4_2	Yuanbo Hou	Institute of Information Photonics and Optical, Beijing University of Posts and Telecommunications, Beijing, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Hou2018	18.9
	Hou_BUPT_task4_3	Yuanbo Hou	Institute of Information Photonics and Optical, Beijing University of Posts and Telecommunications, Beijing, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Hou2018	20.9
	Hou_BUPT_task4_4	Yuanbo Hou	Institute of Information Photonics and Optical, Beijing University of Posts and Telecommunications, Beijing, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Hou2018	21.1
	CANCES_IRIT_task4_1	Cances Leo	Institut de Recherche en Informatique de Toulouse, Université Paul Sabatier Toulouse III, Toulouse, France	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Cances2018	8.4
	PELLEGRINI_IRIT_task4_2	Thomas Pellegrini	Institut de Recherche en Informatique de Toulouse, Université Paul Sabatier Toulouse III, Toulouse, France	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Cances2018	16.6
	Kothinti_JHU_task4_1	Sandeep Kothinti	Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kothinti2018	20.6
	Kothinti_JHU_task4_2	Sandeep Kothinti	Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kothinti2018	20.9
	Kothinti_JHU_task4_3	Sandeep Kothinti	Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kothinti2018	20.9
	Kothinti_JHU_task4_4	Sandeep Kothinti	Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kothinti2018	22.4
	Koutini_JKU_task4_1	Khaled Koutini	Institute of Computational Perception, Johannes Kepler University, Linz, Austria	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Koutini2018	21.5
	Koutini_JKU_task4_2	Khaled Koutini	Institute of Computational Perception, Johannes Kepler University, Linz, Austria	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Koutini2018	21.1
	Koutini_JKU_task4_3	Khaled Koutini	Institute of Computational Perception, Johannes Kepler University, Linz, Austria	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Koutini2018	20.6
	Koutini_JKU_task4_4	Khaled Koutini	Institute of Computational Perception, Johannes Kepler University, Linz, Austria	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Koutini2018	18.8
	Liu_USTC_task4_1	Yaming Liu	National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Liu2018	27.3
	Liu_USTC_task4_2	Yaming Liu	National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Liu2018	28.8
	Liu_USTC_task4_3	Yaming Liu	National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Liu2018	28.1
	Liu_USTC_task4_4	Yaming Liu	National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Liu2018	29.9
	LJK_PSH_task4_1	JaiKai Lu	1T5K, PFU SHANGHAI Co., LTD, Shanghai, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lu2018	24.1
	LJK_PSH_task4_2	JaiKai Lu	1T5K, PFU SHANGHAI Co., LTD, Shanghai, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lu2018	26.3
	LJK_PSH_task4_3	JaiKai Lu	1T5K, PFU SHANGHAI Co., LTD, Shanghai, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lu2018	29.5
	LJK_PSH_task4_4	JaiKai Lu	1T5K, PFU SHANGHAI Co., LTD, Shanghai, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lu2018	32.4
	Moon_YONSEI_task4_1	Moon Hyeongi	School of Electrical and Electronic Engineering, Yonsei University, Seoul, Republic of korea	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Moon2018	15.9
	Moon_YONSEI_task4_2	Moon Hyeongi	School of Electrical and Electronic Engineering, Yonsei University, Seoul, Republic of korea	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Moon2018	14.3
	Raj_IITKGP_task4_1	Rojin Raj	Electronics and Electrical Communication Engineering Department, Indian Institute of Technology Kharagpur, Kharagpur, India	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Raj2018	9.4
	Lim_ETRI_task4_1	Wootaek Lim	Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lim2018	17.1
	Lim_ETRI_task4_2	Wootaek Lim	Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lim2018	18.0
	Lim_ETRI_task4_3	Wootaek Lim	Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lim2018	19.6
	Lim_ETRI_task4_4	Wootaek Lim	Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lim2018	20.4
	WangJun_BUPT_task4_2	Wang Jun	Laboratory of Signal Processing & Knowledge Discovery, Beijing University of Posts and Telecommunications, Beijing, China	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#WangJ2018	17.9
	DCASE2018 baseline	Romain Serizel	Department of Natural Language Processing & Knowledge Discovery, University of Lorraine, Loria, Nancy, France	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#SerizelJ2018	10.8
	Baseline_Surrey_task4_1	Qiuqiang Kong	Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kong2018	18.6
	Baseline_Surrey_task4_2	Qiuqiang Kong	Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kong2018	16.7
	Baseline_Surrey_task4_3	Qiuqiang Kong	Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK	task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kong2018	24.0

Complete results and technical reports can be found at Task 4 results page

Citation

If you are using the dataset or baseline code, or want to refer challenge task please cite the following paper:

Publication

Romain Serizel, Nicolas Turpault, Hamid Eghbal-Zadeh, and Ankit Parag Shah. Large-scale weakly labeled semi-supervised sound event detection in domestic environments. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 19–23. November 2018. URL: https://hal.inria.fr/hal-01850270.

PDF

Large-scale weakly labeled semi-supervised sound event detection in domestic environments

Abstract

This paper presents DCASE 2018 task 4. The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without time boundaries). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly labeled training set to improve system performance. The data are Youtube video excerpts from domestic context which have many applications such as ambient assisted living. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events. . . ) and potential applications.

Keywords

Sound event detection, Large scale, Weakly labeled data, Semi-supervised learning

PDF

	Romain Serizel University of Lorraine
	Hamid Eghbal-zadeh Johannes Kepler University
	Nicolas Turpault Inria Nancy Grand-Est
	Ankit Parag Shah Carnegie Mellon University

Coordinators

Content

Description

Audio dataset

Recording and annotation procedure

Audio Set: An ontology and human-labeled dataset for audio events

Abstract

Download

Task setup

Development dataset

Training set

Test set

Evaluation dataset

Submission

Task rules

Evaluation

Metrics for Polyphonic Sound Event Detection

Abstract

Baseline system

System description

Script description

Python implementation

System performance

Results

Citation

Large-scale weakly labeled semi-supervised sound event detection in domestic environments

Abstract

Keywords