The goal of the task is to evaluate systems for the detection of sound events using real data either weakly labeled or unlabeled and simulated data that is strongly labeled (with time stamps).
Challenge has ended. Full results for this task can be found in the Results page.
Description
This task is the follow-up to DCASE 2019 task 4. The task evaluates systems for the detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording (see also Fig 1). The challenge of exploring the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance remains. Isolated sound events, background sound files and scripst to design a training set with strongly annotated synthetic data are provided. The labels in all the annotated subsets are verified and can be considered as reliable.
This year, we also encourage participants to propose systems that use sound separation jointly with sound event detection. This step can be used to separate overlapping sound events and extract foreground sound events from the background sound. To motivate participants to explore that direction, we provide a baseline sound separation model that can be used for pre-processing (see also Fig 2).
Compared to previous years, this task aims to investigate how we can optimally exploit synthetic data. An additional scientific question is to what extent can sound separation improve sound event detection, and vice-versa?
Audio dataset
The data for the DCASE 2020 task 4 consist of several datasets designed for sound event detection and/or sound separation. The datasets are described below.
Audio material
Dataset | Subset | Type | Usage | Annotations | Sampling frequency |
---|---|---|---|---|---|
DESED | Real: weakly labeled | Real soundscapes | Training | Weak labels (no timestamps) | 44.1kHz |
Real: unlabeled | Real soundscapes | Training | No annotations | 44.1kHz | |
Real: validation | Real soundscapes | Validation | Strong labels (with timestamps) | 44.1kHz | |
Real: public evaluation | Real soundscapes | Evaluation **(do not use this subset to tune hyperparamters)** | Strong labels (with timestamps) | 44.1kHz | |
Synthetic: training | Isolated events + synthetic soundscapes | Training/validation | Strong labels (with timestamps) | 16kHz | |
Synthetic: evaluation | Isolated events + backgrounds | Evaluation **(do not use this subset to tune hyperparamters)** | Event level labels (no timestamps) | 16kHz | |
SINS | Background | Training/validation | No annotations | 16kHz | |
TUT Acoustic scenes 2017, development dataset | Background | Training/validation | No annotations | 44.1kHz | |
Source separation dataset | Isolated events + synthetic soundscapes | Training/validation | No annotations | 16kHz |
If you plan to perform source separation (or to use backgrounds from the SINS dataset) please resample your recorded data in 16kHz. If you are using only recorded data and perform only sound event detection you can use sampling rates up to 44.1kHz.
Please note that the baselines work on 16kHz data.
DESED dataset
DESED dataset is the dataset that was used in DCASE 2019 task 4. The dataset for this task is composed of 10 sec audio clips recorded in domestic environment or synthesized using Scaper to simulate a domestic environment. The task focuses on 10 class of sound events that represent a subset of Audioset (not all the classes are present in Audioset, some classes of sound events are including several classes from Audioset):
- Speech
Speech
- Dog
Dog
- Cat
Cat
- Alarm/bell/ringing
Alarm_bell_ringing
- Dishes
Dishes
- Frying
Frying
- Blender
Blender
- Running water
Running_water
- Vacuum cleaner
Vacuum_cleaner
- Electric shaver/toothbrush
Electric_shaver_toothbrush
More information about this dataset and how to generate synthetic soundscapes can be found on DESED website.
Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.
Sound event detection in domestic environments with weakly labeled data and soundscape synthesis
Abstract
This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.
Keywords
Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data
Romain Serizel, Nicolas Turpault, Ankit Shah, and Justin Salamon. Sound event detection in synthetic domestic environments. In ICASSP 2020 - 45th International Conference on Acoustics, Speech, and Signal Processing. Barcelona, Spain, 2020. URL: https://hal.inria.fr/hal-02355573.
Sound event detection in synthetic domestic environments
Abstract
This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.
Keywords
semi-supervised learning ; weakly labeled data ; synthetic data ; Sound event detection ; Index Terms-Sound event detection
Sound separation dataset (FUSS)
The Free Universal Sound Separation (FUSS) Dataset is a database of arbitrary sound mixtures and source-level references, for use in experiments on arbitrary sound separation.
Overview: The audio data is sourced from a subset of a prerelease of FSD50k, a sound event dataset composed of Freesound content annotated with labels from the AudioSet Ontology. Using the FSD50k labels, these sound source files have been screened such that they likely only contain a single type of sound. Labels are not provided for these sound source files, and are not considered part of the challenge. Thus, official challenge results should not use FSD50k labels, even though they may become available upon FSD50k release. To create mixtures, 10 second clips of sounds are convolved with simulated room impulse responses and added together. Each 10 second mixture contains between 1 and 4 sounds. Sound source files longer than 10s are considered "background" sources. Every mixture contains one background source, which is active for the entire duration. We provide: a software recipe to create the dataset, the room impulse responses, and access to the original source audio.
Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.
FSD50K: an Open Dataset of Human-Labeled Sound Events
Abstract
Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset---its release consists of pre-computed audio features (instead of waveforms), which limits the adoption of some SER methods. Downloading the original audio tracks is also problematic due to constituent YouTube videos gradually disappearing and usage rights issues, which casts doubts over the suitability of this resource for systems' benchmarking. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.
Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, 411–412. ACM, 2013.
Freesound technical demo
Abstract
Freesound is an online collaborative sound database where people with diverse interests share recorded sound samples under Creative Commons licenses. It was started in 2005 and it is being maintained to support diverse research projects and as a service to the overall research and artistic community. In this demo we want to introduce Freesound to the multimedia community and show its potential as a research resource. We begin by describing some general aspects of Freesound, its architecture and functionalities, and then explain potential usages that this framework has for research applications.
Scott Wisdom, Hakan Erdogan, Daniel P. W. Ellis, Romain Serizel, Nicolas Turpault, Eduardo Fonseca, Justin Salamon, Prem Seetharaman, and John R. Hershey. What's all the fuss about free universal sound separation data? In in preparation. 2020.
What's All the FUSS About Free Universal Sound Separation Data?
Motivation: This dataset provides a platform to investigate how sound separation may help with event detection and vice versa. Event detection is more difficult in noisy environments, and so separation could be a useful pre-processing step. Data with strong labels for event detection are relatively scarce, especially when restricted to specific classes within a domain. In contrast, sound separation data needs no event labels for training, and may be more plentiful. In this setting, the idea is to utilize larger unlabeled separation data to train separation systems, which can serve as a front-end to event-detection systems trained on more limited data.
Room simulation: Room impulse responses are simulated using the image method with frequency-dependent walls. Each impulse corresponds to a rectangular room of random size with random wall materials, where a single microphone and up to 4 sound sources are placed at random spatial locations.
Recipe for data creation:
The data creation recipe starts with scripts, based on Scaper, to generate mixtures of events with random timing of sound events,
along with a background sound that spans the duration of the mixture clip.
The constituent sound files for each mixture are also generated for use as references for training and evaluation.
The data are reverberated using a different room simulation for each mixture.
In this simulation each sound source has its own reverberation corresponding to a different spatial location.
The reverberated mixtures are created by summing over the reverberated sound sources.
The data creation scripts support modification, so that participants may remix and augment the training data as desired.
Additional (background) datasets
SINS dataset:
A part of the derivative of the SINS dataset used for DCASE2018 task 5 is used as background for the
synthetic subset of the dataset for DCASE 2019 task 4.
The SINS dataset contains a continuous recording of one person living in a vacation home over a period of one week.
It was collected using a network of 13 microphone arrays distributed over the entire home.
The microphone array consists of 4 linearly arranged microphones.
Gert Dekkers, Steven Lauwereins, Bart Thoen, Mulu Weldegebreal Adhana, Henk Brouckxon, Toon van Waterschoot, Bart Vanrumste, Marian Verhelst, and Peter Karsmakers. The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 32–36. November 2017.
The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network
Abstract
There is a rising interest in monitoring and improving human wellbeing at home using different types of sensors including microphones. In the context of Ambient Assisted Living (AAL) persons are monitored, e.g. to support patients with a chronic illness and older persons, by tracking their activities being performed at home. When considering an acoustic sensing modality, a performed activity can be seen as an acoustic scene. Recently, acoustic detection and classification of scenes and events has gained interest in the scientific community and led to numerous public databases for a wide range of applications. However, no public databases exist which a) focus on daily activities in a home environment, b) contain activities being performed in a spontaneous manner, c) make use of an acoustic sensor network, and d) are recorded as a continuous stream. In this paper we introduce a database recorded in one living home, over a period of one week. The recording setup is an acoustic sensor network containing thirteen sensor nodes, with four low-cost microphones each, distributed over five rooms. Annotation is available on an activity level. In this paper we present the recording and annotation procedure, the database content and a discussion on a baseline detection benchmark. The baseline consists of Mel-Frequency Cepstral Coefficients, Support Vector Machine and a majority vote late-fusion scheme. The database is publicly released to provide a common ground for future research.
Keywords
Database, Acoustic Scene Classification, Acoustic Event Detection, Acoustic Sensor Networks
TUT Acoustic scenes 2017, development dataset: TUT Acoustic Scenes 2017 dataset consists of recordings from various acoustic scenes, all having distinct recording locations. For each recording location, 3-5 minute long audio recording was captured. The original recordings were then split into segments with a length of 10 seconds. These audio segments are provided in individual files.
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016.
TUT Database for Acoustic Scene Classification and Sound Event Detection
Abstract
We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.
Generating your own training data using Scaper
Participants can use the provided isolated foreground and background sounds, in combination with the Scaper soundscape synthesis and augmentation library, to generate additional (potentially infinite!) training data.
Resources for getting started include:
- The Scaper scripts provided with the DESED dataset (link)
- The canonical Scaper script used to create the source separation dataset (link)
- The Scaper tutorial
Reference labels
Audioset provides annotations at clip level (without time boundaries for the events). Therefore, the original annotations are considered as weak labels. Google researchers conducted a quality assessment task where experts were exposed to 10 randomly selected clips for each class and discovered that a in most of the cases not all the clips contains the event related to the given annotation.
Weak annotations
The weak annotations have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file under the following format:
[filename (string)][tab][event_labels (strings)]
For example:
Y-BJNMHMZDcU_50.000_60.000.wav Alarm_bell_ringing,Dog
The first column, Y-BJNMHMZDcU_50.000_60.000.wav
, is the name of the audio file downloaded from Youtube
(Y-BJNMHMZDcU
is Youtube ID of the video from where the 10-second clips was extracted t=50 sec to t=60 sec,
correspond to the clip boundaries within the full video) and the last column,
Alarm_bell_ringing;Dog
corresponds to the sound classes present in the clip separated by a coma.
Strong annotations
Another subset of the development has been annotated manually with strong annotations, to be used as the test set (see also below for a detailed explanation about the development set).
The synthetic subset of the development set is generated and labeled with strong annotations using the Scaper soundscape synthesis and augmentation library. Each sound clip from FSD50k was verified by humans in order to check the event class present in FSD50k annotation was indeed dominant in the audio clip.
Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.
FSD50K: an Open Dataset of Human-Labeled Sound Events
Abstract
Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset---its release consists of pre-computed audio features (instead of waveforms), which limits the adoption of some SER methods. Downloading the original audio tracks is also problematic due to constituent YouTube videos gradually disappearing and usage rights issues, which casts doubts over the suitability of this resource for systems' benchmarking. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.
Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, 411–412. ACM, 2013.
Freesound technical demo
Abstract
Freesound is an online collaborative sound database where people with diverse interests share recorded sound samples under Creative Commons licenses. It was started in 2005 and it is being maintained to support diverse research projects and as a service to the overall research and artistic community. In this demo we want to introduce Freesound to the multimedia community and show its potential as a research resource. We begin by describing some general aspects of Freesound, its architecture and functionalities, and then explain potential usages that this framework has for research applications.
In both cases, the minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event. The strong annotations are provided in a tab separated csv file under the following format:
[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)]
For example:
YOTsn73eqbfc_10.000_20.000.wav 0.163 0.665 Alarm_bell_ringing
The first column, YOTsn73eqbfc_10.000_20.000.wav
, is the name of the audio file, the second column 0.163
is the onset time in seconds,
the third column 0.665
is the offset time in seconds and the last column, Alarm_bell_ringing
corresponds to the class of the sound event.
Download
The dataset is composed of two subset that can be downloaded independently. The procedure to download each subset is described below.
DESED dataset
The instructions to download the audio files, the annotations and example scripts to generate synthetic soundscapes can be found on the DCASE 2020 section of DESED website.
If you experience problems during the download of the recorded soundscapes please contact the task organizers. (Nicolas Turpault and Romain Serizel in priority)
Source separation (FUSS) dataset
Instructions to download the original audio data, the model parameters, the audio mixtures and the recipes to generate them is available on the FUSS repositories (zenodo and github).
Task setup
The challenge consists of detecting sound events within web videos using training data from real recordings both weakly labeled and unlabeled and synthetic audio clips that are strongly labeled. The detection within a 10-seconds clip should be performed with start and end timestamps. As we encourage participants to use a source separation algorithm together with the sound event detection, there are three possible scenarios:
- You are working sound event detection without source separation pre-processing (this is a direct follow-up to last year task 4)
- You are are working on both source separation and sound event detection (this includes the case when reusing the source separation baseline)
- You are working only on source separation and use the sound event detection baseline
In each case, participants are expect to provide the output of a sound event detection system (see also below). Note that an additional (optional) separate evaluation set designed to evaluate source separation performance will also be provided (see also below).
Development dataset
The development set is divided into six main partitions (four for sound event detection, 2 for source separation) that are detailed below.
Sound event detection training set
This dataset is a subset of DESED. We provide 3 different splits of training data in our training set: Labeled training set, Unlabeled in domain training set and Synthetic set with strong annotations. The first two set are the same as in DCASE2019 task 4.
Labeled training set:
This set contains 1578 clips (2244 class occurrences) for which weak annotations have been verified and cross-checked.
The amount of clips per class is the following:
Unlabeled in domain training set:
This set is considerably larger than the previous one. It contains 14412 clips.
The clips are selected such that the distribution per class (based on Audioset annotations)
is close to the distribution in the labeled set.
Note however that given the uncertainty on Audioset labels this distribution might not be exactly similar.
Synthetic strongly labeled set:
This set is composed of 2584 clips generated with the Scaper soundscape synthesis and augmentation library.
The clips are generated such that the distribution per event is close to that of the validation set.
We used all the foreground files from DESED synthetic soundbank.
We used background files annotated as "other" from the subpart of SINS dataset.
The default parameters and the Room Impulse Responses are the same as the source separation training set (but the events distribution is different).
Note that a 10-seconds clip may correspond to more than one sound event but the polyphony maximum is limited to 2.
This year we share the original data and scripts to generate soundscapes and encourage participants to create their own subsets. See DESED github repo and Scaper documentation for more information about how to create new soundscapes.
Sound event detection validation set
The validation set is designed such that the distribution in term of clips per class is similar to that of the weakly labeled training set. It is the same as DCASE 2019 task 4 validation set. The validation set contains 1168 clips (4093 events). The validation set is annotated with strong labels, with timestamps (obtained by human annotators). Note that a 10-seconds clip may correspond to more than one sound event. The amount of events per class is the following:
Source separation training set
The source separation training set consists of 20000 mixture clips. The ground-truth reference sources are provided for each of these mixtures. Each 10 second mixture contains between 1 and 4 sources. Every mixture contains one background source, which is active for the entire duration.
Source separation validation set
The source separation validation set is generated in the same way as the training set, and consists of 1000 mixture clips, along with corresponding ground-truth reference sources. Raw source clips for this set are sourced from different Freesound uploaders than those for the raw source clips used to generate training.
Evaluation datasets
As we encourage participants to use a source separation algorithm together with the sound event detection, there are three possible scenarios:
- You are working sound event detection without source separation pre-processing
- You are are working on both source separation and sound event detection
- You are working only on source separation and use the sound event detection baseline
Participants are allowed to submit up to 4 different systems for each scenario listed above. In each case, participants are expected to provide at least the output of a sound event detection system. Participants who wants to get their sound separation submission evaluated can download the specific sound separation evaluation dataset (see below).
PSDS submissions are optional. If you want your system to be evaluated in PSDS please follow strictly the format illustrated in the example submission package.
Before submission, please make sure you check that your submission package is correct with the validation script enclosed in the submission package:
python validate_submissions.py -i /Users/nturpaul/Documents/code/dcase2020/task4_test
Sound event detection evaluation dataset
The sound event detection evaluation dataset is composed of 10 seconds and 5 minutes audio clips.
-
A first subset is composed of audio clips extracted from youtube and vimeo videos under creative common licenses. This subset is used for ranking purposes. This subset includes the public evaluation dataset.
-
A second subset is composed on synthetic clips generated with scaper. This subset is used for analysis purposes.
Sound separation evaluation evaluation dataset
The sound separation evalaution dataset is the evaluation part of the FUSS dataset.
Task rules
There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.
Task specific rules:
- Participants are allowed to submit up to 4 different systems for each scenario listed in the task setup section.
- Participants are not allowed to use external data for system development. Data from other task is considered external data.
- Another example of external data is other materials related to the video such as the rest of audio from where the 10-sec clip was extracted, the video frames and metadata.
- Participants are not allowed to use the embeddings provided by Audioset or any other features that indirectly use external data.
- For the real recordings (from Audioset), only weak labels and none of the strong labels (timestamps) or original (Audioset) labels can be used for the training of the submitted system. For the synthetic clips, strong labels provided can be used.
- Manipulation of provided training data is allowed.
- The development dataset can be augmented without the use of external data (e.g. by mixing data sampled from a PDF or using techniques such as pitch shifting or time stretching).
- Participants are not allowed to use the public evaluation dataset and synthetic evaluation dataset (or part of them) to train their systems or tune hyper-parameters.
Evaluation
Sound event detection evaluation
All submissions will be evaluated with event-based measures with a 200 ms collar on onsets and a 200 ms / 20% of the events length collar on offsets. Submissions will be ranked according to the event-based F1-score computed over the real recordings in the evaluation set (the performance on synthetic recordings is not taken into account in the ranking). Additionally, multiple poly-phonic sound event detection scores will be provided as a contrastive measure.
F-scores are computed using a single operating point (decision thresholds=0.5) while other PSDS values are computed using 50 operating points (linearly distributed from 0.01 to 0.99). The evaluation with the PSDS metric is optional.
The parameters used for PSDS performances are:
- Detection Tolerance parameter (dtc): 0.5
- Ground Truth intersection parameter (gtc): 0.5
- Cross-Trigger Tolerance parameter (cttc): 0.3
- maximum False Positive rate (e_max): 100
The difference between the 3 PSDS reported:
alpha_ct | alpha_st | |
PSDS | 0 | 0 |
PSDS cross-trigger | 1 | 0 |
PSDS macro | 0 | 1 |
alpha_ct is the weight related to the cost of cross-trigger, alpha_st is the weight related to the cost of instability across classes.
Evaluation is done using sed_eval and psds_eval toolboxes:
You can find more information on how to use PSDS for task 4 on the dedicated notebook:
Detailed information on metrics calculation is available in:
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016. URL: http://www.mdpi.com/2076-3417/6/6/162, doi:10.3390/app6060162.
Metrics for Polyphonic Sound Event Detection
Abstract
This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.
Cagdas Bilen, Giacomo Ferroni, Francesco Tuveri, Juan Azcarreta, and Sacha Krstulovic. A framework for the robust evaluation of sound event detection. arXiv preprint arXiv:1910.08440, 2019. URL: https://arxiv.org/abs/1910.08440.
Source separation evaluation (optional)
Source separation approaches will be optionally evaluated on an additional evaluation with standard source separation metrics as follows. For each example mixture x containing J sources, performance will be measured with permutation-invariant scale-invariant signal-to-noise ratio improvement (SI-SNRi).
- SNR is defined as 10 * log10 of the ratio of source power to error power.
- Scale-invariant SNR (SI-SNR) allows scaling of the reference to best match the estimate. The optimal scaling factor given an estimate signal e and reference signal r is
/ , where < , > indicates inner-product. - Permutation invariance allows the estimates to be permuted to match reference signals such that the mean SI-SNR across sources is maximized.
- SI-SNR improvement (SI-SNRi) is the SI-SNR of the estimate with respect to the reference, minus SI-SNR of the mixture with respect to the reference.
A python evaluation script for source separation can be found on:
Results
Rank | Submission Information | |||||
---|---|---|---|---|---|---|
Code | Author | Affiliation |
Technical Report |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Sound Separation | |
Xiaomi_task4_SED_1 | Chuming Liang | Xiaomi Co., AI Lab, Wuhan, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Liang2020 | 36.0 (35.3 - 36.8) | ||
Rykaczewski_Samsung_taks4_SED_3 | Krzysztof Rykaczewski | Samsung R&D Institute Poland - Samsung Research, Audio Intelligence, Warsaw, Poland | task-sound-event-detection-and-separation-in-domestic-environments-results#Rykaczewski2020 | 21.6 (21.0 - 22.4) | ||
Rykaczewski_Samsung_taks4_SED_2 | Krzysztof Rykaczewski | Samsung R&D Institute Poland - Samsung Research, Audio Intelligence, Warsaw, Poland | task-sound-event-detection-and-separation-in-domestic-environments-results#Rykaczewski2020 | 21.9 (21.3 - 22.7) | ||
Rykaczewski_Samsung_taks4_SED_4 | Krzysztof Rykaczewski | Samsung R&D Institute Poland - Samsung Research, Audio Intelligence, Warsaw, Poland | task-sound-event-detection-and-separation-in-domestic-environments-results#Rykaczewski2020 | 10.4 (9.7 - 11.1) | ||
Rykaczewski_Samsung_taks4_SED_1 | Krzysztof Rykaczewski | Samsung R&D Institute Poland - Samsung Research, Audio Intelligence, Warsaw, Poland | task-sound-event-detection-and-separation-in-domestic-environments-results#Rykaczewski2020 | 21.6 (20.8 - 22.4) | ||
Hou_IPS_task4_SED_1 | Bowei Hou | Waseda University, Graduate School of Information, Production and Systems, Kitakyushu, Japan | task-sound-event-detection-and-separation-in-domestic-environments-results#HouB2020 | 34.9 (34.0 - 35.7) | ||
Miyazaki_NU_task4_SED_1 | Koichi Miyazaki | Nagoya University, a, Japan | task-sound-event-detection-and-separation-in-domestic-environments-results#Miyazaki2020 | 51.1 (50.1 - 52.3) | ||
Miyazaki_NU_task4_SED_2 | Koichi Miyazaki | Nagoya University, Japan | task-sound-event-detection-and-separation-in-domestic-environments-results#Miyazaki2020 | 46.4 (45.5 - 47.5) | ||
Miyazaki_NU_task4_SED_3 | Koichi Miyazaki | Nagoya University, Japan | task-sound-event-detection-and-separation-in-domestic-environments-results#Miyazaki2020 | 50.7 (49.6 - 51.9) | ||
Huang_ICT-TOSHIBA_task4_SED_3 | Yuxin Huang | Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Huang2020 | 44.3 (43.4 - 45.4) | ||
Huang_ICT-TOSHIBA_task4_SED_1 | Yuxin Huang | Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Huang2020 | 44.6 (43.5 - 46.0) | ||
Huang_ICT-TOSHIBA_task4_SS_SED_4 | Yuxin Huang | Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Huang2020 | 44.1 (42.9 - 45.4) | Sound Separation | |
Huang_ICT-TOSHIBA_task4_SED_4 | Yuxin Huang | Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Huang2020 | 44.3 (43.2 - 45.6) | ||
Huang_ICT-TOSHIBA_task4_SS_SED_1 | Yuxin Huang | Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Huang2020 | 44.7 (43.6 - 46.2) | Sound Separation | |
Huang_ICT-TOSHIBA_task4_SED_2 | Yuxin Huang | Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Huang2020 | 44.3 (43.2 - 45.6) | ||
Huang_ICT-TOSHIBA_task4_SS_SED_3 | Yuxin Huang | Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Huang2020 | 44.4 (43.2 - 45.8) | Sound Separation | |
Huang_ICT-TOSHIBA_task4_SS_SED_2 | Yuxin Huang | Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Huang2020 | 44.5 (43.3 - 46.0) | Sound Separation | |
Copiaco_UOW_task4_SED_2 | Abigail Copiaco | University of Wollongong, Department of Engineering and Information Sciences, Wollongong, Australia | task-sound-event-detection-and-separation-in-domestic-environments-results#Copiaco2020a | 7.8 (7.3 - 8.2) | ||
Copiaco_UOW_task4_SED_1 | Abigail Copiaco | University of Wollongong, Department of Engineering and Information Sciences, Wollongong, Australia | task-sound-event-detection-and-separation-in-domestic-environments-results#Copiaco2020a | 7.5 (7.0 - 8.0) | ||
Kim_AiTeR_GIST_SED_1 | Nam Kyun Kim | Gwnagju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwnagju, South Korea | task-sound-event-detection-and-separation-in-domestic-environments-results#Kim2020 | 43.7 (42.8 - 44.7) | ||
Kim_AiTeR_GIST_SED_2 | Nam Kyun Kim | Gwnagju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwnagju, South Korea | task-sound-event-detection-and-separation-in-domestic-environments-results#Kim2020 | 43.9 (43.0 - 44.7) | ||
Kim_AiTeR_GIST_SED_4 | Nam Kyun Kim | Gwnagju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwnagju, South Korea | task-sound-event-detection-and-separation-in-domestic-environments-results#Kim2020 | 44.4 (43.5 - 45.2) | ||
Kim_AiTeR_GIST_SED_3 | Nam Kyun Kim | Gwnagju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwnagju, South Korea | task-sound-event-detection-and-separation-in-domestic-environments-results#Kim2020 | 44.2 (43.4 - 45.1) | ||
Copiaco_UOW_task4_SS_SED_1 | Abigail Copiaco | University of Wollongong, Department of Engineering and Information Sciences, Wollongong, Australia | task-sound-event-detection-and-separation-in-domestic-environments-results#Copiaco2020b | 6.9 (6.7 - 7.2) | Sound Separation | |
LJK_PSH_task4_SED_3 | Lu JiaKai | PFU SHANGHAI Co., LTD, 1T3K, Shanghai, China | task-sound-event-detection-and-separation-in-domestic-environments-results#JiaKai2020 | 38.6 (37.7 - 39.7) | ||
LJK_PSH_task4_SED_1 | Lu JiaKai | PFU SHANGHAI Co., LTD, 1T3K, Shanghai, China | task-sound-event-detection-and-separation-in-domestic-environments-results#JiaKai2020 | 39.3 (38.4 - 40.4) | ||
LJK_PSH_task4_SED_2 | Lu JiaKai | PFU SHANGHAI Co., LTD, 1T3K, Shanghai, China | task-sound-event-detection-and-separation-in-domestic-environments-results#JiaKai2020 | 41.2 (40.1 - 42.4) | ||
LJK_PSH_task4_SED_4 | Lu JiaKai | PFU SHANGHAI Co., LTD, 1T3K, Shanghai, China | task-sound-event-detection-and-separation-in-domestic-environments-results#JiaKai2020 | 40.6 (39.6 - 41.6) | ||
Hao_CQU_task4_SED_2 | junyong Hao | CHONGQING UNIVERSITY, Intelligent Information Technology and System Lab, Chongqing, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Hao2020 | 47.0 (46.0 - 48.1) | ||
Hao_CQU_task4_SED_3 | junyong Hao | CHONGQING UNIVERSITY, Intelligent Information Technology and System Lab, Chongqing, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Hao2020 | 46.3 (45.5 - 47.4) | ||
Hao_CQU_task4_SED_1 | junyong Hao | CHONGQING UNIVERSITY, Intelligent Information Technology and System Lab, Chongqing, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Hao2020 | 44.9 (43.9 - 45.8) | ||
Hao_CQU_task4_SED_4 | junyong Hao | CHONGQING UNIVERSITY, Intelligent Information Technology and System Lab, Chongqing, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Hao2020 | 47.8 (46.9 - 49.0) | ||
Zhenwei_Hou_task4_SED_1 | Hou Zhenwei | CHONGQING UNIVERSITY, Intelligent Information Technology and System Lab, Chongqing, China | task-sound-event-detection-and-separation-in-domestic-environments-results#HouZ2020 | 45.1 (44.2 - 45.8) | ||
deBenito_AUDIAS_task4_SED_1 | Diego de Benito-Gorron | Universidad Autónoma de Madrid, Escuela Politécnica Superior, Madrid, Spain | task-sound-event-detection-and-separation-in-domestic-environments-results#deBenito2020 | 38.2 (37.5 - 39.2) | ||
deBenito_AUDIAS_task4_SED_1 | Diego de Benito-Gorron | Universidad Autónoma de Madrid, Escuela Politécnica Superior, Madrid, Spain | task-sound-event-detection-and-separation-in-domestic-environments-results#de Benito-Gorron2020 | 37.9 (37.0 - 39.1) | ||
Koh_NTHU_task4_SED_3 | Chih-Yuan Koh | National Tsing Hua University, Department of Electrical Engineering, Hsinchu, Taiwan | task-sound-event-detection-and-separation-in-domestic-environments-results#Koh2020 | 46.6 (45.8 - 47.6) | ||
Koh_NTHU_task4_SED_2 | Chih-Yuan Koh | National Tsing Hua University, Department of Electrical Engineering, Hsinchu, Taiwan | task-sound-event-detection-and-separation-in-domestic-environments-results#Koh2020 | 45.2 (44.3 - 46.3) | ||
Koh_NTHU_task4_SED_1 | Chih-Yuan Koh | National Tsing Hua University, Department of Electrical Engineering, Hsinchu, Taiwan | task-sound-event-detection-and-separation-in-domestic-environments-results#Koh2020 | 45.2 (44.2 - 46.1) | ||
Koh_NTHU_task4_SED_4 | Chih-Yuan Koh | National Tsing Hua University, Department of Electrical Engineering, Hsinchu, Taiwan | task-sound-event-detection-and-separation-in-domestic-environments-results#Koh2020 | 46.3 (45.4 - 47.2) | ||
Cornell_UPM-INRIA_task4_SED_2 | Samuele Cornell | Università Politecnica delle Marche, Department of Information Engineering, Ancona, Italy | task-sound-event-detection-and-separation-in-domestic-environments-results#Cornell2020 | 42.0 (40.9 - 43.1) | ||
Cornell_UPM-INRIA_task4_SED_1 | Samuele Cornell | Università Politecnica delle Marche, Department of Information Engineering, Ancona, Italy | task-sound-event-detection-and-separation-in-domestic-environments-results#Cornell2020 | 44.4 (43.3 - 45.5) | ||
Cornell_UPM-INRIA_task4_SED_4 | Samuele Cornell | Università Politecnica delle Marche, Department of Information Engineering, Ancona, Italy | task-sound-event-detection-and-separation-in-domestic-environments-results#Cornell2020 | 43.2 (42.1 - 44.4) | ||
Cornell_UPM-INRIA_task4_SS_SED_1 | Samuele Cornell | Università Politecnica delle Marche, Department of Information Engineering, Ancona, Italy | task-sound-event-detection-and-separation-in-domestic-environments-results#Cornell2020 | 38.6 (37.5 - 39.6) | Sound Separation | |
Cornell_UPM-INRIA_task4_SED_3 | Samuele Cornell | Università Politecnica delle Marche, Department of Information Engineering, Ancona, Italy | task-sound-event-detection-and-separation-in-domestic-environments-results#Cornell2020 | 42.6 (41.6 - 43.5) | ||
Yao_UESTC_task4_SED_1 | Tianchu Yao | University of Electronic Science and Technology of China, School of Information and Communication Engineering, Chengdu, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Yao2020 | 44.1 (43.1 - 45.2) | ||
Yao_UESTC_task4_SED_3 | Tianchu Yao | University of Electronic Science and Technology of China, School of Information and Communication Engineering, Chengdu, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Yao2020 | 46.4 (45.3 - 47.6) | ||
Yao_UESTC_task4_SED_2 | Tianchu Yao | University of Electronic Science and Technology of China, School of Information and Communication Engineering, Chengdu, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Yao2020 | 45.7 (44.7 - 47.0) | ||
Yao_UESTC_task4_SED_4 | Tianchu Yao | University of Electronic Science and Technology of China, School of Information and Communication Engineering, Chengdu, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Yao2020 | 46.2 (45.2 - 47.0) | ||
Liu_thinkit_task4_SED_1 | Yuzhuo Liu | The Institute of Acoustics of the Chinese Academy of Sciences, The Key Lab of Speech Acoustics and Content Understanding, Beijing, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Liu2020 | 40.7 (39.7 - 41.7) | ||
Liu_thinkit_task4_SED_1 | Yuzhuo Liu | The Institute of Acoustics of the Chinese Academy of Sciences, The Key Lab of Speech Acoustics and Content Understanding, Beijing, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Liu2020 | 41.8 (40.7 - 42.9) | ||
Liu_thinkit_task4_SED_1 | Yuzhuo Liu | The Institute of Acoustics of the Chinese Academy of Sciences, The Key Lab of Speech Acoustics and Content Understanding, Beijing, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Liu2020 | 45.2 (44.2 - 46.5) | ||
Liu_thinkit_task4_SED_4 | Yuzhuo Liu | The Institute of Acoustics of the Chinese Academy of Sciences, The Key Lab of Speech Acoustics and Content Understanding, Beijing, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Liu2020 | 43.1 (42.1 - 44.2) | ||
PARK_JHU_task4_SED_1 | Sangwook Park | Johns Hopkins University., Electrical and Computer Engineering, Baltimore, MD, USA | task-sound-event-detection-and-separation-in-domestic-environments-results#Park2020 | 35.8 (35.0 - 36.6) | ||
PARK_JHU_task4_SED_1 | Sangwook Park | Johns Hopkins University., Electrical and Computer Engineering, Baltimore, MD, USA | task-sound-event-detection-and-separation-in-domestic-environments-results#Park2020 | 26.5 (25.7 - 27.5) | ||
PARK_JHU_task4_SED_2 | Sangwook Park | Johns Hopkins University., Electrical and Computer Engineering, Baltimore, MD, USA | task-sound-event-detection-and-separation-in-domestic-environments-results#Park2020 | 36.9 (36.1 - 37.7) | ||
PARK_JHU_task4_SED_3 | Sangwook Park | Johns Hopkins University., Electrical and Computer Engineering, Baltimore, MD, USA | task-sound-event-detection-and-separation-in-domestic-environments-results#Park2020 | 34.7 (34.1 - 35.6) | ||
Chen_NTHU_task4_SS_SED_1 | You Siang Chen | National Tsing Hua University, Department of Power Mechanical Engineering, Hsinchu, Taiwan | task-sound-event-detection-and-separation-in-domestic-environments-results#Chen2020 | 34.5 (33.5 - 35.3) | Sound Separation | |
CTK_NU_task4_SED_2 | Teck Kai Chan | Newcastle University Singapore, Faculty of Science, Agriculture, and Engineering, Singapore, Singapore | task-sound-event-detection-and-separation-in-domestic-environments-results#Chan2020 | 44.4 (43.5 - 45.5) | ||
CTK_NU_task4_SED_4 | Teck Kai Chan | Newcastle University Singapore, Faculty of Science, Agriculture, and Engineering, Singapore, Singapore | task-sound-event-detection-and-separation-in-domestic-environments-results#Chan2020 | 46.3 (45.3 - 47.4) | ||
CTK_NU_task4_SED_3 | Teck Kai Chan | Newcastle University Singapore, Faculty of Science, Agriculture, and Engineering, Singapore, Singapore | task-sound-event-detection-and-separation-in-domestic-environments-results#Chan2020 | 45.8 (45.0 - 47.0) | ||
CTK_NU_task4_SED_1 | Teck Kai Chan | Newcastle University Singapore, Faculty of Science, Agriculture, and Engineering, Singapore, Singapore | task-sound-event-detection-and-separation-in-domestic-environments-results#Chan2020 | 43.5 (42.6 - 44.7) | ||
YenKu_NTU_task4_SED_4 | Hao Yen | National Taiwan University, Department of Electrical Engineering, Taipei, Taiwan | task-sound-event-detection-and-separation-in-domestic-environments-results#Yen2020 | 42.7 (41.6 - 43.6) | ||
YenKu_NTU_task4_SED_2 | Hao Yen | National Taiwan University, Department of Electrical Engineering, Taipei, Taiwan | task-sound-event-detection-and-separation-in-domestic-environments-results#Yen2020 | 42.6 (41.8 - 43.7) | ||
YenKu_NTU_task4_SED_3 | Hao Yen | National Taiwan University, Department of Electrical Engineering, Taipei, Taiwan | task-sound-event-detection-and-separation-in-domestic-environments-results#Yen2020 | 41.6 (40.6 - 42.7) | ||
YenKu_NTU_task4_SED_1 | Hao Yen | National Taiwan University, Department of Electrical Engineering, Taipei, Taiwan | task-sound-event-detection-and-separation-in-domestic-environments-results#Yen2020 | 43.6 (42.4 - 44.6) | ||
Tang_SCU_task4_SED_1 | Maolin Tang | Sichuan University, Computer Science and Technology, Sichuan, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Tang2020 | 43.1 (42.3 - 44.1) | ||
Tang_SCU_task4_SED_4 | Maolin Tang | Sichuan University, Computer Science and Technology, Sichuan, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Tang2020 | 44.1 (43.4 - 44.8) | ||
Tang_SCU_task4_SED_2 | Maolin Tang | Sichuan University, Computer Science and Technology, Sichuan, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Tang2020 | 42.4 (41.4 - 43.4) | ||
Tang_SCU_task4_SED_3 | Maolin Tang | Sichuan University, Computer Science and Technology, Sichuan, China | task-sound-event-detection-and-separation-in-domestic-environments-results#Tang2020 | 44.1 (43.3 - 45.0) | ||
DCASE2020_SED_baseline_system | Nicolas Turpault | Inria Nancy Grand-Est, Department of Natural Language Processing & Knowledge Discovery, Nancy, France | task-sound-event-detection-in-domestic-environments#Turpault2020a | 34.9 (34.0 - 35.7) | ||
DCASE2020_SS_SED_baseline_system | Nicolas Turpault | Inria Nancy Grand-Est, Department of Natural Language Processing & Knowledge Discovery, Nancy, France | task-sound-event-detection-in-domestic-environments#Turpault2020b | 36.5 (35.6 - 37.2) | Sound Separation | |
Ebbers_UPB_task4_SED_1 | Janek Ebbers | Paderborn University, Department of Communications Engineering, Paderborn, Germany | task-sound-event-detection-and-separation-in-domestic-environments-results#Ebbers2020 | 47.2 (46.5 - 48.1) |
Complete results and technical reports can be found at Task 4 results page
Baselines
There are three different baselines: one for sound event detection, one for sound separation and one combing sound separation and sound event detection.
Sound event detection baseline
The baseline model is inspired by last year 2nd best submission system of DCASE 2019 task 4. It is an improvement of dcase 2019 baseline. The model is a mean-teacher model.
The main differences of the baseline system (without source separation) compared to DCASE 2019:
- The sampling rate becomes 16kHz.
- Features:
- 2048 fft window, 255 hop size, 8000 max frequency for mel, 128 mel bins.
- Different synthetic dataset is used.
- The architecture (number of layers) is taken from L. Delphin-Poulat & C. Plapous.
- There is rampup for the learning rate for 50 epochs.
- Median window of 0.45s.
- Early stopping (10 epochs).
Performance
Macro F-score Event-based | PSDS macro F-score | PSDS | PSDS cross-trigger | PSDS macro | |
Validation | 34.8 % | 60.0% | 0.610 | 0.524 | 0.433 |
Download
Note: The performance might not be exactly reproducible on a GPU based system.
That is why, you can download the weights of the networks used for the experiments and run TestModel.py --model_path="Path_of_model"
to reproduce the results.
Nicolas Turpault and Romain Serizel. Training sound event detection on a heterogeneous dataset. working paper or preprint, 2020.
Training Sound Event Detection On A Heterogeneous Dataset
Abstract
Training a sound event detection algorithm on a heterogeneous dataset including both recorded and synthetic soundscapes that can have various labeling granularity is a non-trivial task that can lead to systems requiring several technical choices. These technical choices are often passed from one system to another without being questioned. We propose to perform a detailed analysis of DCASE 2020 task 4 sound event detection baseline with regards to several aspects such as the type of data used for training, the parameters of the mean-teacher or the transformations applied while generating the synthetic soundscapes. Some of the parameters that are usually used as default to replicate other approaches are shown to be sub-optimal.
Lionel Delphin-Poulat and Cyril Plapous. Mean teacher with data augmentation for dcase 2019 task 4. Technical Report, Orange Labs Lannion, France, June 2019.
MEAN TEACHER WITH DATA AUGMENTATION FOR DCASE 2019 TASK 4
Abstract
In this paper, we present our neural network for the DCASE 2019 challenge’s Task 4 (Sound event detection in domestic environments) [1]. The goal of the task is to evaluate systems for the detection of sound events using real data either weakly labeled or unlabeled and simulated data that is strongly labeled. We propose a mean-teacher model with convolutional neural network (CNN) and recurrent neural network (RNN) together with data augmentation and a median window tuned for each class based on prior knowledge.
Antti Tarvainen and Harri Valpola. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, 1195–1204. 2017.
Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results
Abstract
The recently proposed Temporal Ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks. It maintains an exponential moving average of label predictions on each training example, and penalizes predictions that are inconsistent with this target. However, because the targets change only once per epoch, Temporal Ensembling becomes unwieldy when learning large datasets. To overcome this problem, we propose Mean Teacher, a method that averages model weights instead of label predictions. As an additional benefit, Mean Teacher improves test accuracy and enables training with fewer labels than Temporal Ensembling. Without changing the network architecture, Mean Teacher achieves an error rate of 4.35% on SVHN with 250 labels, outperforming Temporal Ensembling trained with 1000 labels. We also show that a good network architecture is crucial to performance. Combining Mean Teacher and Residual Networks, we improve the state of the art on CIFAR-10 with 4000 labels from 10.55% to 6.28%, and on ImageNet 2012 with 10% of the labels from 35.24% to 9.11%.
Sound separation baseline
This baseline model consists of a TDCN++ masking network using STFT analysis/synthesis and weighted mixture consistency, where the weights are predicted by the network, with one scalar per source. The training loss is thresholded negative signal-to-noise ratio (SNR).
The model architecture is able to handle variable number sources by using different loss functions for active and
inactive reference sources. For active reference sources (i.e. non-zero
reference source signals), the threshold for negative SNR is 30 dB, equivalent
to the error power being below the reference
power by 30 dB. For non-active reference sources (i.e. all-zero reference
source signals), the threshold is 20 dB measured relative to the mixture power,
which means gradients are clipped when the error power is 20 dB below the mixture power.
This model architecture achieves the following performance when trained and evaluated on the two variants of the FUSS dataset: reverberant and and dry (i.e. non-reverberant):
Validation | Eval | |||
---|---|---|---|---|
Single-source SI-SNR | Multi-source SI-SNRi | Single-source SI-SNR | Multi-source SI-SNRi | |
Reverberant FUSS | 35.0 dB | 13.0 dB | 37.6 dB | 12.5 dB |
Dry FUSS | 30.6 dB | 10.5 dB | 31.8 dB | 10.2 dB |
Download
Reverberant FUSS baseline
Dry FUSS baseline
Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, and John R. Hershey. Universal sound separation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 175–179. October 2019. URL: https://arxiv.org/abs/1905.03330.
Efthymios Tzinis, Scott Wisdom, John R. Hershey, Aren Jansen, and Daniel P. W. Ellis. Improving universal sound separation using sound classification. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). May 2020. URL: https://arxiv.org/abs/1911.07951.
Scott Wisdom, John R Hershey, Kevin Wilson, Jeremy Thorpe, Michael Chinen, Brian Patton, and Rif A Saurous. Differentiable consistency constraints for improved deep speech enhancement. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 900–904. IEEE, 2019.
Differentiable consistency constraints for improved deep speech enhancement
Abstract
In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglect two important constraints: STFT consistency and mixture consistency. Without STFT consistency, the system's output is not necessarily the STFT of a time-domain signal, and without mixture consistency, the sum of the estimated sources does not necessarily equal the input mixture. Furthermore, the only previous approaches that apply mixture consistency use real-valued masks; mixture consistency has been ignored for complex-valued masks.
Sound event detection and separation baseline
For the combined SED+SS baseline, we mix together audio from DESED and FUSS training and validation data, to create new mixtures with both in-domain (DESED) and open-domain (FUSS) sources. The sound separation model is trained to separate these mixtures into three output signals: DESED background, mixture of DESED foreground sounds, and mixture of FUSS sounds. This model is trained in the same way as the baseline SS model, except without permutation invariance. On the DESED+FUSS validation set, this model achieves an average of 18.6 dB SI-SNR improvement for the separated DESED foreground mixture, which is used as input to the SED model.
The baseline to combine SS and SED is then relying late integration of SED baseline applied on separated sound sources.
The sound separation baseline has been trained using 3 sources, so it returns:
- DESED background
- DESED foreground
- FUSS mixture
In our case, we use only the output of the second source (DESED foreground).
To get the predictions of the combination of SED and SS we do as follow:
- Get the output (not binarized with threshold) on the original mixtures using the SED baseline
- Get the output (not binarized with threshold) on the DESED foreground source from SS model SED baseline
- Take the average of both outputs
- Apply thresholds (different for F-scores and psds)
- Apply median filtering (0.45s)
Nicolas Turpault, Scott Wisdom Wisdom, Hakan Erdogan, John R. Herhey, Romain Serizel, Eduardo Fonseca, and Prem Seetharaman. Improving sound event detection in domestic environments using sound separation. working paper or preprint, 2020.
Improving Sound Event Detection In Domestic Environments Using Sound Separation
Abstract
Performing sound event detection on real-world recordings often implies dealing with overlapping target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing stage for sound event detection. In this paper we start from a sound separation model trained on the Free Universal Sound Separation dataset and the DCASE 2020 task 4 sound event detection baseline. We explore different methods of combining separated sound sources and the original mixture within the sound event detection. Furthermore, we investigate the impact of adapting the universal sound separation model to the sound event detection data in terms of both separation and sound event detection performance.
Performance
Macro F-score Event-based | PSDS macro F-score | PSDS | PSDS cross-trigger | PSDS macro | |
Validation | 35.6 % | 60.5% | 0.626 | 0.546 | 0.449 |
Download
Sound separation
Sound event detection
Citation
If you are using the dataset or baseline code, or want to refer challenge task please cite the following paper:
Task and datasets
Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.
Sound event detection in domestic environments with weakly labeled data and soundscape synthesis
Abstract
This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.
Keywords
Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data
Scott Wisdom, Hakan Erdogan, Daniel P. W. Ellis, Romain Serizel, Nicolas Turpault, Eduardo Fonseca, Justin Salamon, Prem Seetharaman, and John R. Hershey. What's all the fuss about free universal sound separation data? In in preparation. 2020.
What's All the FUSS About Free Universal Sound Separation Data?
Baselines
Sound event detection
Nicolas Turpault and Romain Serizel. Training sound event detection on a heterogeneous dataset. working paper or preprint, 2020.
Training Sound Event Detection On A Heterogeneous Dataset
Abstract
Training a sound event detection algorithm on a heterogeneous dataset including both recorded and synthetic soundscapes that can have various labeling granularity is a non-trivial task that can lead to systems requiring several technical choices. These technical choices are often passed from one system to another without being questioned. We propose to perform a detailed analysis of DCASE 2020 task 4 sound event detection baseline with regards to several aspects such as the type of data used for training, the parameters of the mean-teacher or the transformations applied while generating the synthetic soundscapes. Some of the parameters that are usually used as default to replicate other approaches are shown to be sub-optimal.
Sound separation
Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, and John R. Hershey. Universal sound separation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 175–179. October 2019. URL: https://arxiv.org/abs/1905.03330.
Efthymios Tzinis, Scott Wisdom, John R. Hershey, Aren Jansen, and Daniel P. W. Ellis. Improving universal sound separation using sound classification. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). May 2020. URL: https://arxiv.org/abs/1911.07951.
Scott Wisdom, John R Hershey, Kevin Wilson, Jeremy Thorpe, Michael Chinen, Brian Patton, and Rif A Saurous. Differentiable consistency constraints for improved deep speech enhancement. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 900–904. IEEE, 2019.
Differentiable consistency constraints for improved deep speech enhancement
Abstract
In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglect two important constraints: STFT consistency and mixture consistency. Without STFT consistency, the system's output is not necessarily the STFT of a time-domain signal, and without mixture consistency, the sum of the estimated sources does not necessarily equal the input mixture. Furthermore, the only previous approaches that apply mixture consistency use real-valued masks; mixture consistency has been ignored for complex-valued masks.
Sound separation and sound event detection
Nicolas Turpault, Scott Wisdom Wisdom, Hakan Erdogan, John R. Herhey, Romain Serizel, Eduardo Fonseca, and Prem Seetharaman. Improving sound event detection in domestic environments using sound separation. working paper or preprint, 2020.
Improving Sound Event Detection In Domestic Environments Using Sound Separation
Abstract
Performing sound event detection on real-world recordings often implies dealing with overlapping target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing stage for sound event detection. In this paper we start from a sound separation model trained on the Free Universal Sound Separation dataset and the DCASE 2020 task 4 sound event detection baseline. We explore different methods of combining separated sound sources and the original mixture within the sound event detection. Furthermore, we investigate the impact of adapting the universal sound separation model to the sound event detection data in terms of both separation and sound event detection performance.