The task evaluates systems for the large-scale detection of sound events using weakly labeled data. The challenge is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance.
Note to participants (September 20th): Minor bug in the evaluation script.
We found a minor bug in the evaluation script. Some of the files were not taken into account correctly. This has been fixed and the evaluation scores have been updated (resulting in at most 0.3% decrease of the F-score). This had no impact on the teams ranking. However, the systems ranking can have varied very slightly (around ranks 15~25).
Description
The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance. The labels in the annotated subset are verified and can be considered as reliable. The data are Youtube video excerpts from domestic context which have many applications such as ambient assisted living. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events...) and potential industrial applications.
Audio dataset
The task employs a subset of “Audioset: An Ontology And Human-Labeled Dataset For Audio Events” by Google. Audioset consists of an expanding ontology of 632 sound event classes and a collection of 2 million human-labeled 10-second sound clips (less than 21% are shorter than 10-seconds) drawn from 2 million Youtube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. We will focus on a subset of Audioset that consists of 10 classes of sound events:
- Speech
Speech
- Dog
Dog
- Cat
Cat
- Alarm/bell/ringing
Alarm_bell_ringing
- Dishes
Dishes
- Frying
Frying
- Blender
Blender
- Running water
Running_water
- Vacuum cleaner
Vacuum_cleaner
- Electric shaver/toothbrush
Electric_shaver_toothbrush
Recording and annotation procedure
Audioset provides annotations at clip level (without time boundaries for the events). Therefore, the annotations are considered as weak labels.
Audio clips are collected from Youtube videos uploaded by independent users so the number of clips per class vary dramatically and the dataset is not balanced (see also https://research.google.com/Audioset//dataset/index.html).
Google researchers conducted a quality assessment task where experts were exposed to 10 randomly selected clips for each class and discovered that a in most of the cases not all the clips contains the event related to the given annotation. More information about the initial annotation process can be found in:
Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA, 2017.
Audio Set: An ontology and human-labeled dataset for audio events
Abstract
Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets -- principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 635 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.
The weak annotations have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file under the following format:
[filename (string)][tab][class_label (strings)]
For example:
Y-BJNMHMZDcU_50.000_60.000.wav Alarm_bell_ringing;Dog
The first column, Y-BJNMHMZDcU_50.000_60.000.wav
, is the name of the audio file downloaded from Youtube (Y-BJNMHMZDcU
is Youtube ID of the video from where the 10-second clips was extracted t=50 sec to t=60 sec, correspond to the clip boundaries within the full video) and the last column, Alarm_bell_ringing;Dog
corresponds to the sound classes present in the clip separated by a semi-colon.
Another subset of the development has been annotated manually with strong annotations, to be used as the test set (see also below for a detailed explanation about the development set). The minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event. The strong annotations are provided in a tab separated csv file under the following format:
[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][class_label (strings)]
For example:
YOTsn73eqbfc_10.000_20.000.wav 0.163 0.665 Alarm_bell_ringing
The first column, YOTsn73eqbfc_10.000_20.000.wav
, is the name of the audio file downloaded from Youtube, the second column 0.163
is the onset time in seconds, the third column 0.665
is the offset time in seconds and the last column, Alarm_bell_ringing
corresponds to the class of the sound event.
Download
The annotations files and the script to download the audio files is available on the git repository for task 4. The final dataset is 82Gb, the download/extraction process can take approximately 12 hours. (In case you are using the provided baseline system, there is no need to download the dataset as the system will automatically download needed dataset for you.)
If you experience problems during the download of the dataset please contact the task organizers. (Nicolas Turpault and Romain Serizel in priority)
The content of the development set is structured in the following manner:
dataset root
│ readme.md (instructions to run the script that downloads the audio files and description of the annotations files)
│ download_data.py (script to download the files)
│
└───metadata (directories containing the annotations files)
│ │
│ └───train (annotations for the train sets)
│ │ weak.csv (weakly labeled training set list)
│ │ unlabel_in_domain.csv (unlabeled in domain training set list)
│ │ unlabel_out_of_domain.csv (unlabeled out-of-domain training set list)
│ │
│ └───test (annotations for the test set)
│ test.csv (test set list with strong labels)
│
└───audio (directories where the audio files will be downloaded)
└───train (annotations for the train sets)
│ └───weak (weakly labeled training set)
│ └───unlabel_in_domain (unlabeled in domain training set)
│ └───unlabel_out_of_domain (unlabeled out-of-domain training set)
│
└───test (test set)
Task setup
The challenge consists of detecting sound events within web videos using weakly labeled training data. The detection within a 10-seconds clip should be performed with start and end timestamps.
Development dataset
The development set is divided into two main partitions: training and validation.
Training set
To motivate the participants to come up with innovative solutions, we provide 3 different splits of training data in our training set: Labeled training set, Unlabeled in domain training set and Unlabeled out of domain training set.
Labeled training set:
This set contains 1578 clips (2244 class occurrences) for which weak annotations have been verified and cross-checked. The amount of clips per class is the following:
Class | # 10s clips containing the event |
---|---|
Speech | 550 |
Dog | 214 |
Cat | 173 |
Alarm/bell/ringing | 205 |
Dishes | 184 |
Frying | 171 |
Blender | 134 |
Running water | 343 |
Vacuum cleaner | 167 |
Electric shaver/toothbrush | 103 |
Total | 2244 |
Unlabeled in domain training set:
This set is considerably larger than the previous one. It contains 14412 clips. The clips are selected such that the distribution per class (based on Audioset annotations) is close to the distribution in the labeled set. Note however that given the uncertainty on Audioset labels this distribution might not be exactly similar.
Unlabeled out of domain training set:
This set is composed of 39999 clips extracted from classes that are not considered in the task. Note that these clips are chosen based on the Audioset labels which are not verified and therefore might be noisy. Additionally, as speech is present in half of the segments of Audioset this set also contain almost 20000 clips with speech. This was the only way to have an unlabel_out_of_domain set which is somehow representative of Audioset. Indeed, discarding speech would have also meant discarding many other classes and the variability of the set would have been penalized.
Test set
The test set is designed such that the distribution in term of clips per class is similar to that of the weakly labeled training set. The size of the validation set is such that it represent about 20% of the size of the labeled training set, it contains 288 clips (906 events). The evaluation set is annotated with strong labels, with timestamps (obtained by human annotators). Note that a 10-seconds clip may correspond to more than one sound event. The amount of events per class is the following:
Class | # events |
---|---|
Speech | 261 |
Dog | 127 |
Cat | 97 |
Alarm/bell/ringing | 112 |
Dishes | 122 |
Frying | 24 |
Blender | 40 |
Running water | 76 |
Vacuum cleaner | 36 |
Electric shaver/toothbrush | 28 |
Total | 906 |
Evaluation dataset
Evaluation set is a subset of Audioset including the classes mentioned above. Labels with timestamps (obtained by human annotations) will be released after the DCASE 2018 challenge is concluded.The eval dataset can be downloaded from the following location:
Submission
Detailed information for the challenge submission can found from in the submission page.
System output should be presented as a single text-file (in CSV format) containing predictions for each audio file in the evaluation set. Result items can be in any order. Format:
[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][event label (string)]
Multiple system outputs can be submitted (maximum 4 per participant). If submitting multiple systems, the individual text-files should be packaged into a zip file for submission. Please carefully mark the connection between the submitted files and the corresponding system or system parameters (for example by naming the text file appropriately).
If no event is detected for the particular audio signal, the system should still output a row containing only the file name, to indicate that the file was processed. This is used to verify that participants processed all evaluation files.
Task rules
- Participants are not allowed to use external data for system development. Data from other task is considered external data.
- Another example of external data is other materials related to the video such as the rest of audio from where the 10-sec clip was extracted, the video frames and metadata.
- Participants are not allowed to use the embeddings provided by Audioset or any other features that indirectly use external data.
- Only weak labels and none of the strong labels (timestamps) or original (Audioset) labels can be used for the training of the submitted system.
- Manipulation of provided training data is allowed.
- The development dataset can be augmented without the use of external data (e.g. by mixing data sampled from a PDF or using techniques such as pitch shifting or time stretching).
- Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it (this includes the use of statistics about the evaluation dataset in the decision making). The evaluation dataset cannot be used to train the submitted system.
Evaluation
Submissions will be evaluated with event-based measures with a 200ms collar on onsets and a 200ms / 20% of the events length collar on offsets. Submissions will be ranked according to the event-based F1-score computed over the whole evaluation set. Additionally, event-based error rate will be provided as a secondary measure. Evaluation is done using sed_eval toolbox:
Detailed information on metrics calculation is available in:
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016. URL: http://www.mdpi.com/2076-3417/6/6/162, doi:10.3390/app6060162.
Metrics for Polyphonic Sound Event Detection
Abstract
This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.
Baseline system
System description
The baseline system is based on two conolutional recurrent neural network (CRNN) using 64 log mel-band magnitudes as features. 10 seconds audio files are divided in 500 frames.
Using these features, we train a first CRNN with three convolution layers (64 filters (3x3), max pooling (4) along the frequency axis and 30% dropout), one recurrent layer (64 Gated Recurrent Units GRU with 30% dropout on the input), a dense layer (10 units sigmoid activation) and global average pooling across frames. The system is trained for 100 epochs (early stopping after 15 epochs patience) on weak labels (1578 clips, 20% is used for validation). This model is trained at clip level (file containing the event or not), inputs are 500 frames long (10 sec audio file) for a single output frame. This first model is used to predict labels of unlabeled files (unlabel_in_domain, 14412 clips).
A second model based on the same architecture (3 convolutional layers and 1 recurrent layer) is trained on predictions of the first model (unlabel_in_domain, 14412 clips; the weak files, 1578 clips are used to validate the model). The main difference with the first pass model is that the output is the dense layer in order to be able to predict event at frame level. Inputs are 500 frames long, each of them labeled identically following clip labels. The model outputs a decision for each frame. Preprocessing (median filtering) is used to obtain events onset and offset for each file. The baseline system includes evaluations of results using event-based F-score as metric.
Script description
The baseline system is a semi supervised approach:
- Download the data (only the first time)
- First pass at clip level:
- Train a CRNN on weak data (
train/weak
) - 20% of data used for validation - Predict unlabel (in domain) data (
train/unlabel_in_domain
)
- Train a CRNN on weak data (
- Second pass at frame level:
- Train a CRNN on predicted unlabel data from the first pass (
train/unlabel_in_domain
) - weak data (train/weak
) is used for validation Note: labels are used at frames level but annotations are at clip level, so if an event is present in the 10 sec, all frames contain this label during training - Predict strong test labels (
test/
) Note: predict an event with an onset and offset
- Train a CRNN on predicted unlabel data from the first pass (
- Evaluate the model between test annotations and second pass predictions (metric is macro-averaged event based F-measure)
Python implementation
System performance
Event-based metrics with a 200ms collar on onsets and a 200ms / 20% of the events length collar on offsets:
Event-based overall metrics (macro-average) | |
F-score | 14.06 % |
ER | 1.54 |
Note: This performance was obtained on a CPU based system (Intel® Xeon E5-1630 -- 8 cores, 128Gb RAM). The total runtime was approximately 24h.
Note: The performance might not be exactly reproducible on a GPU based system. However, it runs in around 8 hours on a single Nvidia Geforce 1080 Ti GPU.
Results
Rank | Submission Information | ||||
---|---|---|---|---|---|
Code | Author | Affiliation |
Technical Report |
Event-based F-score (Evaluation dataset) |
|
Avdeeva_ITMO_task4_1 | Anastasia Avdeeva | Speech Information Systems Department, University of Information Technology Mechanics and Optics, Saint-Petersburg, Russia | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Avdveeva2018 | 20.1 | |
Avdeeva_ITMO_task4_2 | Anastasia Avdeeva | Speech Information Systems Department, University of Information Technology Mechanics and Optics, Saint-Petersburg, Russia | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Avdveeva2018 | 19.5 | |
Wang_NUDT_task4_1 | Dezhi Wang | School of Computer, National University of Defense Technology, Changsha, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#WangD2018 | 12.4 | |
Wang_NUDT_task4_2 | Dezhi Wang | School of Computer, National University of Defense Technology, Changsha, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#WangD2018 | 12.6 | |
Wang_NUDT_task4_3 | Dezhi Wang | School of Computer, National University of Defense Technology, Changsha, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#WangD2018 | 12.0 | |
Wang_NUDT_task4_4 | Dezhi Wang | School of Computer, National University of Defense Technology, Changsha, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#WangD2018 | 12.2 | |
Dinkel_SJTU_task4_1 | Heinrich Dinkel | Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Dinkel2018 | 10.4 | |
Dinkel_SJTU_task4_2 | Heinrich Dinkel | Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Dinkel2018 | 10.7 | |
Dinkel_SJTU_task4_3 | Heinrich Dinkel | Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Dinkel2018 | 13.4 | |
Dinkel_SJTU_task4_4 | Heinrich Dinkel | Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Dinkel2018 | 11.2 | |
Guo_THU_task4_1 | Yingmei Guo | Department of Computer Science and Technology, Tsinghua University, Beijing, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Guo2018 | 21.3 | |
Guo_THU_task4_2 | Yingmei Guo | Department of Computer Science and Technology, Tsinghua University, Beijing, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Guo2018 | 20.6 | |
Guo_THU_task4_3 | Yingmei Guo | Department of Computer Science and Technology, Tsinghua University, Beijing, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Guo2018 | 19.1 | |
Guo_THU_task4_4 | Yingmei Guo | Department of Computer Science and Technology, Tsinghua University, Beijing, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Guo2018 | 19.0 | |
Harb_TUG_task4_1 | Robert Harb | Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Harb2018 | 19.4 | |
Harb_TUG_task4_2 | Robert Harb | Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Harb2018 | 15.7 | |
Harb_TUG_task4_3 | Robert Harb | Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Harb2018 | 21.6 | |
Hou_BUPT_task4_1 | Yuanbo Hou | Institute of Information Photonics and Optical, Beijing University of Posts and Telecommunications, Beijing, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Hou2018 | 19.6 | |
Hou_BUPT_task4_2 | Yuanbo Hou | Institute of Information Photonics and Optical, Beijing University of Posts and Telecommunications, Beijing, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Hou2018 | 18.9 | |
Hou_BUPT_task4_3 | Yuanbo Hou | Institute of Information Photonics and Optical, Beijing University of Posts and Telecommunications, Beijing, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Hou2018 | 20.9 | |
Hou_BUPT_task4_4 | Yuanbo Hou | Institute of Information Photonics and Optical, Beijing University of Posts and Telecommunications, Beijing, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Hou2018 | 21.1 | |
CANCES_IRIT_task4_1 | Cances Leo | Institut de Recherche en Informatique de Toulouse, Université Paul Sabatier Toulouse III, Toulouse, France | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Cances2018 | 8.4 | |
PELLEGRINI_IRIT_task4_2 | Thomas Pellegrini | Institut de Recherche en Informatique de Toulouse, Université Paul Sabatier Toulouse III, Toulouse, France | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Cances2018 | 16.6 | |
Kothinti_JHU_task4_1 | Sandeep Kothinti | Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kothinti2018 | 20.6 | |
Kothinti_JHU_task4_2 | Sandeep Kothinti | Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kothinti2018 | 20.9 | |
Kothinti_JHU_task4_3 | Sandeep Kothinti | Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kothinti2018 | 20.9 | |
Kothinti_JHU_task4_4 | Sandeep Kothinti | Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kothinti2018 | 22.4 | |
Koutini_JKU_task4_1 | Khaled Koutini | Institute of Computational Perception, Johannes Kepler University, Linz, Austria | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Koutini2018 | 21.5 | |
Koutini_JKU_task4_2 | Khaled Koutini | Institute of Computational Perception, Johannes Kepler University, Linz, Austria | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Koutini2018 | 21.1 | |
Koutini_JKU_task4_3 | Khaled Koutini | Institute of Computational Perception, Johannes Kepler University, Linz, Austria | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Koutini2018 | 20.6 | |
Koutini_JKU_task4_4 | Khaled Koutini | Institute of Computational Perception, Johannes Kepler University, Linz, Austria | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Koutini2018 | 18.8 | |
Liu_USTC_task4_1 | Yaming Liu | National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Liu2018 | 27.3 | |
Liu_USTC_task4_2 | Yaming Liu | National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Liu2018 | 28.8 | |
Liu_USTC_task4_3 | Yaming Liu | National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Liu2018 | 28.1 | |
Liu_USTC_task4_4 | Yaming Liu | National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Liu2018 | 29.9 | |
LJK_PSH_task4_1 | JaiKai Lu | 1T5K, PFU SHANGHAI Co., LTD, Shanghai, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lu2018 | 24.1 | |
LJK_PSH_task4_2 | JaiKai Lu | 1T5K, PFU SHANGHAI Co., LTD, Shanghai, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lu2018 | 26.3 | |
LJK_PSH_task4_3 | JaiKai Lu | 1T5K, PFU SHANGHAI Co., LTD, Shanghai, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lu2018 | 29.5 | |
LJK_PSH_task4_4 | JaiKai Lu | 1T5K, PFU SHANGHAI Co., LTD, Shanghai, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lu2018 | 32.4 | |
Moon_YONSEI_task4_1 | Moon Hyeongi | School of Electrical and Electronic Engineering, Yonsei University, Seoul, Republic of korea | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Moon2018 | 15.9 | |
Moon_YONSEI_task4_2 | Moon Hyeongi | School of Electrical and Electronic Engineering, Yonsei University, Seoul, Republic of korea | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Moon2018 | 14.3 | |
Raj_IITKGP_task4_1 | Rojin Raj | Electronics and Electrical Communication Engineering Department, Indian Institute of Technology Kharagpur, Kharagpur, India | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Raj2018 | 9.4 | |
Lim_ETRI_task4_1 | Wootaek Lim | Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lim2018 | 17.1 | |
Lim_ETRI_task4_2 | Wootaek Lim | Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lim2018 | 18.0 | |
Lim_ETRI_task4_3 | Wootaek Lim | Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lim2018 | 19.6 | |
Lim_ETRI_task4_4 | Wootaek Lim | Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Lim2018 | 20.4 | |
WangJun_BUPT_task4_2 | Wang Jun | Laboratory of Signal Processing & Knowledge Discovery, Beijing University of Posts and Telecommunications, Beijing, China | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#WangJ2018 | 17.9 | |
DCASE2018 baseline | Romain Serizel | Department of Natural Language Processing & Knowledge Discovery, University of Lorraine, Loria, Nancy, France | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#SerizelJ2018 | 10.8 | |
Baseline_Surrey_task4_1 | Qiuqiang Kong | Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kong2018 | 18.6 | |
Baseline_Surrey_task4_2 | Qiuqiang Kong | Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kong2018 | 16.7 | |
Baseline_Surrey_task4_3 | Qiuqiang Kong | Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK | task-large-scale-weakly-labeled-semi-supervised-sound-event-detection-results#Kong2018 | 24.0 |
Complete results and technical reports can be found at Task 4 results page
Citation
If you are using the dataset or baseline code, or want to refer challenge task please cite the following paper:Romain Serizel, Nicolas Turpault, Hamid Eghbal-Zadeh, and Ankit Parag Shah. Large-scale weakly labeled semi-supervised sound event detection in domestic environments. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 19–23. November 2018. URL: https://hal.inria.fr/hal-01850270.
Large-scale weakly labeled semi-supervised sound event detection in domestic environments
Abstract
This paper presents DCASE 2018 task 4. The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without time boundaries). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly labeled training set to improve system performance. The data are Youtube video excerpts from domestic context which have many applications such as ambient assisted living. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events. . . ) and potential applications.
Keywords
Sound event detection, Large scale, Weakly labeled data, Semi-supervised learning