Sound event detection and separation in domestic environments


Task description

The goal of the task is to evaluate systems for the detection of sound events using real data either weakly labeled or unlabeled and simulated data that is strongly labeled (with time stamps).

Challenge has ended. Full results for this task can be found in the Results page.

Description

This task is the follow-up to DCASE 2019 task 4. The task evaluates systems for the detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording (see also Fig 1). The challenge of exploring the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance remains. Isolated sound events, background sound files and scripst to design a training set with strongly annotated synthetic data are provided. The labels in all the annotated subsets are verified and can be considered as reliable.

Figure 1: Overview of a sound event detection system.


This year, we also encourage participants to propose systems that use sound separation jointly with sound event detection. This step can be used to separate overlapping sound events and extract foreground sound events from the background sound. To motivate participants to explore that direction, we provide a baseline sound separation model that can be used for pre-processing (see also Fig 2).

Figure 2: Example of a sound event detection system with a sound separation pre-processing.


Compared to previous years, this task aims to investigate how we can optimally exploit synthetic data. An additional scientific question is to what extent can sound separation improve sound event detection, and vice-versa?

Audio dataset

The data for the DCASE 2020 task 4 consist of several datasets designed for sound event detection and/or sound separation. The datasets are described below.

Audio material

Dataset Subset Type Usage Annotations Sampling frequency
DESED Real: weakly labeled Real soundscapes Training Weak labels (no timestamps) 44.1kHz
Real: unlabeled Real soundscapes Training No annotations 44.1kHz
Real: validation Real soundscapes Validation Strong labels (with timestamps) 44.1kHz
Real: public evaluation Real soundscapes Evaluation **(do not use this subset to tune hyperparamters)** Strong labels (with timestamps) 44.1kHz
Synthetic: training Isolated events + synthetic soundscapes Training/validation Strong labels (with timestamps) 16kHz
Synthetic: evaluation Isolated events + backgrounds Evaluation **(do not use this subset to tune hyperparamters)** Event level labels (no timestamps) 16kHz
SINS Background Training/validation No annotations 16kHz
TUT Acoustic scenes 2017, development dataset Background Training/validation No annotations 44.1kHz
Source separation dataset Isolated events + synthetic soundscapes Training/validation No annotations 16kHz

If you plan to perform source separation (or to use backgrounds from the SINS dataset) please resample your recorded data in 16kHz. If you are using only recorded data and perform only sound event detection you can use sampling rates up to 44.1kHz.
Please note that the baselines work on 16kHz data.

DESED dataset

DESED dataset is the dataset that was used in DCASE 2019 task 4. The dataset for this task is composed of 10 sec audio clips recorded in domestic environment or synthesized using Scaper to simulate a domestic environment. The task focuses on 10 class of sound events that represent a subset of Audioset (not all the classes are present in Audioset, some classes of sound events are including several classes from Audioset):

  • Speech Speech
  • Dog Dog
  • Cat Cat
  • Alarm/bell/ringing Alarm_bell_ringing
  • Dishes Dishes
  • Frying Frying
  • Blender Blender
  • Running water Running_water
  • Vacuum cleaner Vacuum_cleaner
  • Electric shaver/toothbrush Electric_shaver_toothbrush

More information about this dataset and how to generate synthetic soundscapes can be found on DESED website.

Publication

Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.

PDF

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

Keywords

Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data

PDF
Publication

Romain Serizel, Nicolas Turpault, Ankit Shah, and Justin Salamon. Sound event detection in synthetic domestic environments. In ICASSP 2020 - 45th International Conference on Acoustics, Speech, and Signal Processing. Barcelona, Spain, 2020. URL: https://hal.inria.fr/hal-02355573.

PDF

Sound event detection in synthetic domestic environments

Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

Keywords

semi-supervised learning ; weakly labeled data ; synthetic data ; Sound event detection ; Index Terms-Sound event detection

PDF

Sound separation dataset (FUSS)

The Free Universal Sound Separation (FUSS) Dataset is a database of arbitrary sound mixtures and source-level references, for use in experiments on arbitrary sound separation.

Overview: The audio data is sourced from a subset of a prerelease of FSD50k, a sound event dataset composed of Freesound content annotated with labels from the AudioSet Ontology. Using the FSD50k labels, these sound source files have been screened such that they likely only contain a single type of sound. Labels are not provided for these sound source files, and are not considered part of the challenge. Thus, official challenge results should not use FSD50k labels, even though they may become available upon FSD50k release. To create mixtures, 10 second clips of sounds are convolved with simulated room impulse responses and added together. Each 10 second mixture contains between 1 and 4 sounds. Sound source files longer than 10s are considered "background" sources. Every mixture contains one background source, which is active for the entire duration. We provide: a software recipe to create the dataset, the room impulse responses, and access to the original source audio.

Publication

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.

PDF

FSD50K: an Open Dataset of Human-Labeled Sound Events

Abstract

Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset---its release consists of pre-computed audio features (instead of waveforms), which limits the adoption of some SER methods. Downloading the original audio tracks is also problematic due to constituent YouTube videos gradually disappearing and usage rights issues, which casts doubts over the suitability of this resource for systems' benchmarking. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.

PDF
Publication

Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, 411–412. ACM, 2013.

PDF

Freesound technical demo

Abstract

Freesound is an online collaborative sound database where people with diverse interests share recorded sound samples under Creative Commons licenses. It was started in 2005 and it is being maintained to support diverse research projects and as a service to the overall research and artistic community. In this demo we want to introduce Freesound to the multimedia community and show its potential as a research resource. We begin by describing some general aspects of Freesound, its architecture and functionalities, and then explain potential usages that this framework has for research applications.

PDF
Publication

Scott Wisdom, Hakan Erdogan, Daniel P. W. Ellis, Romain Serizel, Nicolas Turpault, Eduardo Fonseca, Justin Salamon, Prem Seetharaman, and John R. Hershey. What's all the fuss about free universal sound separation data? In in preparation. 2020.

What's All the FUSS About Free Universal Sound Separation Data?

Motivation: This dataset provides a platform to investigate how sound separation may help with event detection and vice versa. Event detection is more difficult in noisy environments, and so separation could be a useful pre-processing step. Data with strong labels for event detection are relatively scarce, especially when restricted to specific classes within a domain. In contrast, sound separation data needs no event labels for training, and may be more plentiful. In this setting, the idea is to utilize larger unlabeled separation data to train separation systems, which can serve as a front-end to event-detection systems trained on more limited data.

Room simulation: Room impulse responses are simulated using the image method with frequency-dependent walls. Each impulse corresponds to a rectangular room of random size with random wall materials, where a single microphone and up to 4 sound sources are placed at random spatial locations.

Recipe for data creation: The data creation recipe starts with scripts, based on Scaper, to generate mixtures of events with random timing of sound events, along with a background sound that spans the duration of the mixture clip.
The constituent sound files for each mixture are also generated for use as references for training and evaluation.
The data are reverberated using a different room simulation for each mixture.
In this simulation each sound source has its own reverberation corresponding to a different spatial location.
The reverberated mixtures are created by summing over the reverberated sound sources.
The data creation scripts support modification, so that participants may remix and augment the training data as desired.

Additional (background) datasets

SINS dataset: A part of the derivative of the SINS dataset used for DCASE2018 task 5 is used as background for the synthetic subset of the dataset for DCASE 2019 task 4. The SINS dataset contains a continuous recording of one person living in a vacation home over a period of one week.
It was collected using a network of 13 microphone arrays distributed over the entire home. The microphone array consists of 4 linearly arranged microphones.

Publication

Gert Dekkers, Steven Lauwereins, Bart Thoen, Mulu Weldegebreal Adhana, Henk Brouckxon, Toon van Waterschoot, Bart Vanrumste, Marian Verhelst, and Peter Karsmakers. The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 32–36. November 2017.

PDF

The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network

Abstract

There is a rising interest in monitoring and improving human wellbeing at home using different types of sensors including microphones. In the context of Ambient Assisted Living (AAL) persons are monitored, e.g. to support patients with a chronic illness and older persons, by tracking their activities being performed at home. When considering an acoustic sensing modality, a performed activity can be seen as an acoustic scene. Recently, acoustic detection and classification of scenes and events has gained interest in the scientific community and led to numerous public databases for a wide range of applications. However, no public databases exist which a) focus on daily activities in a home environment, b) contain activities being performed in a spontaneous manner, c) make use of an acoustic sensor network, and d) are recorded as a continuous stream. In this paper we introduce a database recorded in one living home, over a period of one week. The recording setup is an acoustic sensor network containing thirteen sensor nodes, with four low-cost microphones each, distributed over five rooms. Annotation is available on an activity level. In this paper we present the recording and annotation procedure, the database content and a discussion on a baseline detection benchmark. The baseline consists of Mel-Frequency Cepstral Coefficients, Support Vector Machine and a majority vote late-fusion scheme. The database is publicly released to provide a common ground for future research.

Keywords

Database, Acoustic Scene Classification, Acoustic Event Detection, Acoustic Sensor Networks

PDF

TUT Acoustic scenes 2017, development dataset: TUT Acoustic Scenes 2017 dataset consists of recordings from various acoustic scenes, all having distinct recording locations. For each recording location, 3-5 minute long audio recording was captured. The original recordings were then split into segments with a length of 10 seconds. These audio segments are provided in individual files.

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016.

PDF

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.

PDF

Generating your own training data using Scaper

Participants can use the provided isolated foreground and background sounds, in combination with the Scaper soundscape synthesis and augmentation library, to generate additional (potentially infinite!) training data.

Resources for getting started include:

  • The Scaper scripts provided with the DESED dataset (link)
  • The canonical Scaper script used to create the source separation dataset (link)
  • The Scaper tutorial
Publication

J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello. Scaper: a library for soundscape synthesis and augmentation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 344–348. New Paltz, NY, USA, Oct. 2017.

PDF

Scaper: A Library for Soundscape Synthesis and Augmentation

PDF

Reference labels

Audioset provides annotations at clip level (without time boundaries for the events). Therefore, the original annotations are considered as weak labels. Google researchers conducted a quality assessment task where experts were exposed to 10 randomly selected clips for each class and discovered that a in most of the cases not all the clips contains the event related to the given annotation.

Weak annotations

The weak annotations have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file under the following format:

[filename (string)][tab][event_labels (strings)]

For example:

Y-BJNMHMZDcU_50.000_60.000.wav Alarm_bell_ringing,Dog

The first column, Y-BJNMHMZDcU_50.000_60.000.wav, is the name of the audio file downloaded from Youtube (Y-BJNMHMZDcU is Youtube ID of the video from where the 10-second clips was extracted t=50 sec to t=60 sec, correspond to the clip boundaries within the full video) and the last column, Alarm_bell_ringing;Dog corresponds to the sound classes present in the clip separated by a coma.

Strong annotations

Another subset of the development has been annotated manually with strong annotations, to be used as the test set (see also below for a detailed explanation about the development set).

The synthetic subset of the development set is generated and labeled with strong annotations using the Scaper soundscape synthesis and augmentation library. Each sound clip from FSD50k was verified by humans in order to check the event class present in FSD50k annotation was indeed dominant in the audio clip.

Publication

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.

PDF

FSD50K: an Open Dataset of Human-Labeled Sound Events

Abstract

Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset---its release consists of pre-computed audio features (instead of waveforms), which limits the adoption of some SER methods. Downloading the original audio tracks is also problematic due to constituent YouTube videos gradually disappearing and usage rights issues, which casts doubts over the suitability of this resource for systems' benchmarking. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.

PDF
Publication

Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, 411–412. ACM, 2013.

PDF

Freesound technical demo

Abstract

Freesound is an online collaborative sound database where people with diverse interests share recorded sound samples under Creative Commons licenses. It was started in 2005 and it is being maintained to support diverse research projects and as a service to the overall research and artistic community. In this demo we want to introduce Freesound to the multimedia community and show its potential as a research resource. We begin by describing some general aspects of Freesound, its architecture and functionalities, and then explain potential usages that this framework has for research applications.

PDF

In both cases, the minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event. The strong annotations are provided in a tab separated csv file under the following format:

  [filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)]

For example:

YOTsn73eqbfc_10.000_20.000.wav 0.163 0.665 Alarm_bell_ringing

The first column, YOTsn73eqbfc_10.000_20.000.wav, is the name of the audio file, the second column 0.163 is the onset time in seconds, the third column 0.665 is the offset time in seconds and the last column, Alarm_bell_ringing corresponds to the class of the sound event.

Download

The dataset is composed of two subset that can be downloaded independently. The procedure to download each subset is described below.

DESED dataset

The instructions to download the audio files, the annotations and example scripts to generate synthetic soundscapes can be found on the DCASE 2020 section of DESED website.


If you experience problems during the download of the recorded soundscapes please contact the task organizers. (Nicolas Turpault and Romain Serizel in priority)

Source separation (FUSS) dataset

Instructions to download the original audio data, the model parameters, the audio mixtures and the recipes to generate them is available on the FUSS repositories (zenodo and github).



Task setup

The challenge consists of detecting sound events within web videos using training data from real recordings both weakly labeled and unlabeled and synthetic audio clips that are strongly labeled. The detection within a 10-seconds clip should be performed with start and end timestamps. As we encourage participants to use a source separation algorithm together with the sound event detection, there are three possible scenarios:

  • You are working sound event detection without source separation pre-processing (this is a direct follow-up to last year task 4)
  • You are are working on both source separation and sound event detection (this includes the case when reusing the source separation baseline)
  • You are working only on source separation and use the sound event detection baseline

In each case, participants are expect to provide the output of a sound event detection system (see also below). Note that an additional (optional) separate evaluation set designed to evaluate source separation performance will also be provided (see also below).

Development dataset

The development set is divided into six main partitions (four for sound event detection, 2 for source separation) that are detailed below.

Sound event detection training set

This dataset is a subset of DESED. We provide 3 different splits of training data in our training set: Labeled training set, Unlabeled in domain training set and Synthetic set with strong annotations. The first two set are the same as in DCASE2019 task 4.

Labeled training set:
This set contains 1578 clips (2244 class occurrences) for which weak annotations have been verified and cross-checked. The amount of clips per class is the following:

Unlabeled in domain training set:
This set is considerably larger than the previous one. It contains 14412 clips. The clips are selected such that the distribution per class (based on Audioset annotations) is close to the distribution in the labeled set. Note however that given the uncertainty on Audioset labels this distribution might not be exactly similar.

Synthetic strongly labeled set:
This set is composed of 2584 clips generated with the Scaper soundscape synthesis and augmentation library. The clips are generated such that the distribution per event is close to that of the validation set. We used all the foreground files from DESED synthetic soundbank. We used background files annotated as "other" from the subpart of SINS dataset. The default parameters and the Room Impulse Responses are the same as the source separation training set (but the events distribution is different). Note that a 10-seconds clip may correspond to more than one sound event but the polyphony maximum is limited to 2.

This year we share the original data and scripts to generate soundscapes and encourage participants to create their own subsets. See DESED github repo and Scaper documentation for more information about how to create new soundscapes.

Sound event detection validation set

The validation set is designed such that the distribution in term of clips per class is similar to that of the weakly labeled training set. It is the same as DCASE 2019 task 4 validation set. The validation set contains 1168 clips (4093 events). The validation set is annotated with strong labels, with timestamps (obtained by human annotators). Note that a 10-seconds clip may correspond to more than one sound event. The amount of events per class is the following:

Source separation training set

The source separation training set consists of 20000 mixture clips. The ground-truth reference sources are provided for each of these mixtures. Each 10 second mixture contains between 1 and 4 sources. Every mixture contains one background source, which is active for the entire duration.

Source separation validation set

The source separation validation set is generated in the same way as the training set, and consists of 1000 mixture clips, along with corresponding ground-truth reference sources. Raw source clips for this set are sourced from different Freesound uploaders than those for the raw source clips used to generate training.

Evaluation datasets

As we encourage participants to use a source separation algorithm together with the sound event detection, there are three possible scenarios:

  1. You are working sound event detection without source separation pre-processing
  2. You are are working on both source separation and sound event detection
  3. You are working only on source separation and use the sound event detection baseline

Participants are allowed to submit up to 4 different systems for each scenario listed above. In each case, participants are expected to provide at least the output of a sound event detection system. Participants who wants to get their sound separation submission evaluated can download the specific sound separation evaluation dataset (see below).

PSDS submissions are optional. If you want your system to be evaluated in PSDS please follow strictly the format illustrated in the example submission package.

Before submission, please make sure you check that your submission package is correct with the validation script enclosed in the submission package: python validate_submissions.py -i /Users/nturpaul/Documents/code/dcase2020/task4_test


Sound event detection evaluation dataset

The sound event detection evaluation dataset is composed of 10 seconds and 5 minutes audio clips.

  • A first subset is composed of audio clips extracted from youtube and vimeo videos under creative common licenses. This subset is used for ranking purposes. This subset includes the public evaluation dataset.

  • A second subset is composed on synthetic clips generated with scaper. This subset is used for analysis purposes.

Sound separation evaluation evaluation dataset

The sound separation evalaution dataset is the evaluation part of the FUSS dataset.

Task rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

Task specific rules:

  • Participants are allowed to submit up to 4 different systems for each scenario listed in the task setup section.
  • Participants are not allowed to use external data for system development. Data from other task is considered external data.
  • Another example of external data is other materials related to the video such as the rest of audio from where the 10-sec clip was extracted, the video frames and metadata.
  • Participants are not allowed to use the embeddings provided by Audioset or any other features that indirectly use external data.
  • For the real recordings (from Audioset), only weak labels and none of the strong labels (timestamps) or original (Audioset) labels can be used for the training of the submitted system. For the synthetic clips, strong labels provided can be used.
  • Manipulation of provided training data is allowed.
  • The development dataset can be augmented without the use of external data (e.g. by mixing data sampled from a PDF or using techniques such as pitch shifting or time stretching).
  • Participants are not allowed to use the public evaluation dataset and synthetic evaluation dataset (or part of them) to train their systems or tune hyper-parameters.

Evaluation

Sound event detection evaluation

All submissions will be evaluated with event-based measures with a 200 ms collar on onsets and a 200 ms / 20% of the events length collar on offsets. Submissions will be ranked according to the event-based F1-score computed over the real recordings in the evaluation set (the performance on synthetic recordings is not taken into account in the ranking). Additionally, multiple poly-phonic sound event detection scores will be provided as a contrastive measure.

F-scores are computed using a single operating point (decision thresholds=0.5) while other PSDS values are computed using 50 operating points (linearly distributed from 0.01 to 0.99). The evaluation with the PSDS metric is optional.

The parameters used for PSDS performances are:

  • Detection Tolerance parameter (dtc): 0.5
  • Ground Truth intersection parameter (gtc): 0.5
  • Cross-Trigger Tolerance parameter (cttc): 0.3
  • maximum False Positive rate (e_max): 100

The difference between the 3 PSDS reported:

alpha_ct alpha_st
PSDS 0 0
PSDS cross-trigger 1 0
PSDS macro 0 1

alpha_ct is the weight related to the cost of cross-trigger, alpha_st is the weight related to the cost of instability across classes.

Evaluation is done using sed_eval and psds_eval toolboxes:



You can find more information on how to use PSDS for task 4 on the dedicated notebook:


Detailed information on metrics calculation is available in:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016. URL: http://www.mdpi.com/2076-3417/6/6/162, doi:10.3390/app6060162.

PDF

Metrics for Polyphonic Sound Event Detection

Abstract

This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.

PDF
Publication

Cagdas Bilen, Giacomo Ferroni, Francesco Tuveri, Juan Azcarreta, and Sacha Krstulovic. A framework for the robust evaluation of sound event detection. arXiv preprint arXiv:1910.08440, 2019. URL: https://arxiv.org/abs/1910.08440.

PDF

A Framework for the Robust Evaluation of Sound Event Detection

PDF

Source separation evaluation (optional)

Source separation approaches will be optionally evaluated on an additional evaluation with standard source separation metrics as follows. For each example mixture x containing J sources, performance will be measured with permutation-invariant scale-invariant signal-to-noise ratio improvement (SI-SNRi).

  • SNR is defined as 10 * log10 of the ratio of source power to error power.
  • Scale-invariant SNR (SI-SNR) allows scaling of the reference to best match the estimate. The optimal scaling factor given an estimate signal e and reference signal r is / , where < , > indicates inner-product.
  • Permutation invariance allows the estimates to be permuted to match reference signals such that the mean SI-SNR across sources is maximized.
  • SI-SNR improvement (SI-SNRi) is the SI-SNR of the estimate with respect to the reference, minus SI-SNR of the mixture with respect to the reference.

A python evaluation script for source separation can be found on:


Results

Rank Submission Information
Code Author Affiliation Technical
Report
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)
Sound Separation
Xiaomi_task4_SED_1 Chuming Liang Xiaomi Co., AI Lab, Wuhan, China task-sound-event-detection-and-separation-in-domestic-environments-results#Liang2020 36.0 (35.3 - 36.8)
Rykaczewski_Samsung_taks4_SED_3 Krzysztof Rykaczewski Samsung R&D Institute Poland - Samsung Research, Audio Intelligence, Warsaw, Poland task-sound-event-detection-and-separation-in-domestic-environments-results#Rykaczewski2020 21.6 (21.0 - 22.4)
Rykaczewski_Samsung_taks4_SED_2 Krzysztof Rykaczewski Samsung R&D Institute Poland - Samsung Research, Audio Intelligence, Warsaw, Poland task-sound-event-detection-and-separation-in-domestic-environments-results#Rykaczewski2020 21.9 (21.3 - 22.7)
Rykaczewski_Samsung_taks4_SED_4 Krzysztof Rykaczewski Samsung R&D Institute Poland - Samsung Research, Audio Intelligence, Warsaw, Poland task-sound-event-detection-and-separation-in-domestic-environments-results#Rykaczewski2020 10.4 (9.7 - 11.1)
Rykaczewski_Samsung_taks4_SED_1 Krzysztof Rykaczewski Samsung R&D Institute Poland - Samsung Research, Audio Intelligence, Warsaw, Poland task-sound-event-detection-and-separation-in-domestic-environments-results#Rykaczewski2020 21.6 (20.8 - 22.4)
Hou_IPS_task4_SED_1 Bowei Hou Waseda University, Graduate School of Information, Production and Systems, Kitakyushu, Japan task-sound-event-detection-and-separation-in-domestic-environments-results#HouB2020 34.9 (34.0 - 35.7)
Miyazaki_NU_task4_SED_1 Koichi Miyazaki Nagoya University, a, Japan task-sound-event-detection-and-separation-in-domestic-environments-results#Miyazaki2020 51.1 (50.1 - 52.3)
Miyazaki_NU_task4_SED_2 Koichi Miyazaki Nagoya University, Japan task-sound-event-detection-and-separation-in-domestic-environments-results#Miyazaki2020 46.4 (45.5 - 47.5)
Miyazaki_NU_task4_SED_3 Koichi Miyazaki Nagoya University, Japan task-sound-event-detection-and-separation-in-domestic-environments-results#Miyazaki2020 50.7 (49.6 - 51.9)
Huang_ICT-TOSHIBA_task4_SED_3 Yuxin Huang Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China task-sound-event-detection-and-separation-in-domestic-environments-results#Huang2020 44.3 (43.4 - 45.4)
Huang_ICT-TOSHIBA_task4_SED_1 Yuxin Huang Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China task-sound-event-detection-and-separation-in-domestic-environments-results#Huang2020 44.6 (43.5 - 46.0)
Huang_ICT-TOSHIBA_task4_SS_SED_4 Yuxin Huang Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China task-sound-event-detection-and-separation-in-domestic-environments-results#Huang2020 44.1 (42.9 - 45.4) Sound Separation
Huang_ICT-TOSHIBA_task4_SED_4 Yuxin Huang Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China task-sound-event-detection-and-separation-in-domestic-environments-results#Huang2020 44.3 (43.2 - 45.6)
Huang_ICT-TOSHIBA_task4_SS_SED_1 Yuxin Huang Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China task-sound-event-detection-and-separation-in-domestic-environments-results#Huang2020 44.7 (43.6 - 46.2) Sound Separation
Huang_ICT-TOSHIBA_task4_SED_2 Yuxin Huang Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China task-sound-event-detection-and-separation-in-domestic-environments-results#Huang2020 44.3 (43.2 - 45.6)
Huang_ICT-TOSHIBA_task4_SS_SED_3 Yuxin Huang Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China task-sound-event-detection-and-separation-in-domestic-environments-results#Huang2020 44.4 (43.2 - 45.8) Sound Separation
Huang_ICT-TOSHIBA_task4_SS_SED_2 Yuxin Huang Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China task-sound-event-detection-and-separation-in-domestic-environments-results#Huang2020 44.5 (43.3 - 46.0) Sound Separation
Copiaco_UOW_task4_SED_2 Abigail Copiaco University of Wollongong, Department of Engineering and Information Sciences, Wollongong, Australia task-sound-event-detection-and-separation-in-domestic-environments-results#Copiaco2020a 7.8 (7.3 - 8.2)
Copiaco_UOW_task4_SED_1 Abigail Copiaco University of Wollongong, Department of Engineering and Information Sciences, Wollongong, Australia task-sound-event-detection-and-separation-in-domestic-environments-results#Copiaco2020a 7.5 (7.0 - 8.0)
Kim_AiTeR_GIST_SED_1 Nam Kyun Kim Gwnagju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwnagju, South Korea task-sound-event-detection-and-separation-in-domestic-environments-results#Kim2020 43.7 (42.8 - 44.7)
Kim_AiTeR_GIST_SED_2 Nam Kyun Kim Gwnagju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwnagju, South Korea task-sound-event-detection-and-separation-in-domestic-environments-results#Kim2020 43.9 (43.0 - 44.7)
Kim_AiTeR_GIST_SED_4 Nam Kyun Kim Gwnagju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwnagju, South Korea task-sound-event-detection-and-separation-in-domestic-environments-results#Kim2020 44.4 (43.5 - 45.2)
Kim_AiTeR_GIST_SED_3 Nam Kyun Kim Gwnagju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwnagju, South Korea task-sound-event-detection-and-separation-in-domestic-environments-results#Kim2020 44.2 (43.4 - 45.1)
Copiaco_UOW_task4_SS_SED_1 Abigail Copiaco University of Wollongong, Department of Engineering and Information Sciences, Wollongong, Australia task-sound-event-detection-and-separation-in-domestic-environments-results#Copiaco2020b 6.9 (6.7 - 7.2) Sound Separation
LJK_PSH_task4_SED_3 Lu JiaKai PFU SHANGHAI Co., LTD, 1T3K, Shanghai, China task-sound-event-detection-and-separation-in-domestic-environments-results#JiaKai2020 38.6 (37.7 - 39.7)
LJK_PSH_task4_SED_1 Lu JiaKai PFU SHANGHAI Co., LTD, 1T3K, Shanghai, China task-sound-event-detection-and-separation-in-domestic-environments-results#JiaKai2020 39.3 (38.4 - 40.4)
LJK_PSH_task4_SED_2 Lu JiaKai PFU SHANGHAI Co., LTD, 1T3K, Shanghai, China task-sound-event-detection-and-separation-in-domestic-environments-results#JiaKai2020 41.2 (40.1 - 42.4)
LJK_PSH_task4_SED_4 Lu JiaKai PFU SHANGHAI Co., LTD, 1T3K, Shanghai, China task-sound-event-detection-and-separation-in-domestic-environments-results#JiaKai2020 40.6 (39.6 - 41.6)
Hao_CQU_task4_SED_2 junyong Hao CHONGQING UNIVERSITY, Intelligent Information Technology and System Lab, Chongqing, China task-sound-event-detection-and-separation-in-domestic-environments-results#Hao2020 47.0 (46.0 - 48.1)
Hao_CQU_task4_SED_3 junyong Hao CHONGQING UNIVERSITY, Intelligent Information Technology and System Lab, Chongqing, China task-sound-event-detection-and-separation-in-domestic-environments-results#Hao2020 46.3 (45.5 - 47.4)
Hao_CQU_task4_SED_1 junyong Hao CHONGQING UNIVERSITY, Intelligent Information Technology and System Lab, Chongqing, China task-sound-event-detection-and-separation-in-domestic-environments-results#Hao2020 44.9 (43.9 - 45.8)
Hao_CQU_task4_SED_4 junyong Hao CHONGQING UNIVERSITY, Intelligent Information Technology and System Lab, Chongqing, China task-sound-event-detection-and-separation-in-domestic-environments-results#Hao2020 47.8 (46.9 - 49.0)
Zhenwei_Hou_task4_SED_1 Hou Zhenwei CHONGQING UNIVERSITY, Intelligent Information Technology and System Lab, Chongqing, China task-sound-event-detection-and-separation-in-domestic-environments-results#HouZ2020 45.1 (44.2 - 45.8)
deBenito_AUDIAS_task4_SED_1 Diego de Benito-Gorron Universidad Autónoma de Madrid, Escuela Politécnica Superior, Madrid, Spain task-sound-event-detection-and-separation-in-domestic-environments-results#deBenito2020 38.2 (37.5 - 39.2)
deBenito_AUDIAS_task4_SED_1 Diego de Benito-Gorron Universidad Autónoma de Madrid, Escuela Politécnica Superior, Madrid, Spain task-sound-event-detection-and-separation-in-domestic-environments-results#de Benito-Gorron2020 37.9 (37.0 - 39.1)
Koh_NTHU_task4_SED_3 Chih-Yuan Koh National Tsing Hua University, Department of Electrical Engineering, Hsinchu, Taiwan task-sound-event-detection-and-separation-in-domestic-environments-results#Koh2020 46.6 (45.8 - 47.6)
Koh_NTHU_task4_SED_2 Chih-Yuan Koh National Tsing Hua University, Department of Electrical Engineering, Hsinchu, Taiwan task-sound-event-detection-and-separation-in-domestic-environments-results#Koh2020 45.2 (44.3 - 46.3)
Koh_NTHU_task4_SED_1 Chih-Yuan Koh National Tsing Hua University, Department of Electrical Engineering, Hsinchu, Taiwan task-sound-event-detection-and-separation-in-domestic-environments-results#Koh2020 45.2 (44.2 - 46.1)
Koh_NTHU_task4_SED_4 Chih-Yuan Koh National Tsing Hua University, Department of Electrical Engineering, Hsinchu, Taiwan task-sound-event-detection-and-separation-in-domestic-environments-results#Koh2020 46.3 (45.4 - 47.2)
Cornell_UPM-INRIA_task4_SED_2 Samuele Cornell Università Politecnica delle Marche, Department of Information Engineering, Ancona, Italy task-sound-event-detection-and-separation-in-domestic-environments-results#Cornell2020 42.0 (40.9 - 43.1)
Cornell_UPM-INRIA_task4_SED_1 Samuele Cornell Università  Politecnica delle Marche, Department of Information Engineering, Ancona, Italy task-sound-event-detection-and-separation-in-domestic-environments-results#Cornell2020 44.4 (43.3 - 45.5)
Cornell_UPM-INRIA_task4_SED_4 Samuele Cornell Università Politecnica delle Marche, Department of Information Engineering, Ancona, Italy task-sound-event-detection-and-separation-in-domestic-environments-results#Cornell2020 43.2 (42.1 - 44.4)
Cornell_UPM-INRIA_task4_SS_SED_1 Samuele Cornell Università Politecnica delle Marche, Department of Information Engineering, Ancona, Italy task-sound-event-detection-and-separation-in-domestic-environments-results#Cornell2020 38.6 (37.5 - 39.6) Sound Separation
Cornell_UPM-INRIA_task4_SED_3 Samuele Cornell Università Politecnica delle Marche, Department of Information Engineering, Ancona, Italy task-sound-event-detection-and-separation-in-domestic-environments-results#Cornell2020 42.6 (41.6 - 43.5)
Yao_UESTC_task4_SED_1 Tianchu Yao University of Electronic Science and Technology of China, School of Information and Communication Engineering, Chengdu, China task-sound-event-detection-and-separation-in-domestic-environments-results#Yao2020 44.1 (43.1 - 45.2)
Yao_UESTC_task4_SED_3 Tianchu Yao University of Electronic Science and Technology of China, School of Information and Communication Engineering, Chengdu, China task-sound-event-detection-and-separation-in-domestic-environments-results#Yao2020 46.4 (45.3 - 47.6)
Yao_UESTC_task4_SED_2 Tianchu Yao University of Electronic Science and Technology of China, School of Information and Communication Engineering, Chengdu, China task-sound-event-detection-and-separation-in-domestic-environments-results#Yao2020 45.7 (44.7 - 47.0)
Yao_UESTC_task4_SED_4 Tianchu Yao University of Electronic Science and Technology of China, School of Information and Communication Engineering, Chengdu, China task-sound-event-detection-and-separation-in-domestic-environments-results#Yao2020 46.2 (45.2 - 47.0)
Liu_thinkit_task4_SED_1 Yuzhuo Liu The Institute of Acoustics of the Chinese Academy of Sciences, The Key Lab of Speech Acoustics and Content Understanding, Beijing, China task-sound-event-detection-and-separation-in-domestic-environments-results#Liu2020 40.7 (39.7 - 41.7)
Liu_thinkit_task4_SED_1 Yuzhuo Liu The Institute of Acoustics of the Chinese Academy of Sciences, The Key Lab of Speech Acoustics and Content Understanding, Beijing, China task-sound-event-detection-and-separation-in-domestic-environments-results#Liu2020 41.8 (40.7 - 42.9)
Liu_thinkit_task4_SED_1 Yuzhuo Liu The Institute of Acoustics of the Chinese Academy of Sciences, The Key Lab of Speech Acoustics and Content Understanding, Beijing, China task-sound-event-detection-and-separation-in-domestic-environments-results#Liu2020 45.2 (44.2 - 46.5)
Liu_thinkit_task4_SED_4 Yuzhuo Liu The Institute of Acoustics of the Chinese Academy of Sciences, The Key Lab of Speech Acoustics and Content Understanding, Beijing, China task-sound-event-detection-and-separation-in-domestic-environments-results#Liu2020 43.1 (42.1 - 44.2)
PARK_JHU_task4_SED_1 Sangwook Park Johns Hopkins University., Electrical and Computer Engineering, Baltimore, MD, USA task-sound-event-detection-and-separation-in-domestic-environments-results#Park2020 35.8 (35.0 - 36.6)
PARK_JHU_task4_SED_1 Sangwook Park Johns Hopkins University., Electrical and Computer Engineering, Baltimore, MD, USA task-sound-event-detection-and-separation-in-domestic-environments-results#Park2020 26.5 (25.7 - 27.5)
PARK_JHU_task4_SED_2 Sangwook Park Johns Hopkins University., Electrical and Computer Engineering, Baltimore, MD, USA task-sound-event-detection-and-separation-in-domestic-environments-results#Park2020 36.9 (36.1 - 37.7)
PARK_JHU_task4_SED_3 Sangwook Park Johns Hopkins University., Electrical and Computer Engineering, Baltimore, MD, USA task-sound-event-detection-and-separation-in-domestic-environments-results#Park2020 34.7 (34.1 - 35.6)
Chen_NTHU_task4_SS_SED_1 You Siang Chen National Tsing Hua University, Department of Power Mechanical Engineering, Hsinchu, Taiwan task-sound-event-detection-and-separation-in-domestic-environments-results#Chen2020 34.5 (33.5 - 35.3) Sound Separation
CTK_NU_task4_SED_2 Teck Kai Chan Newcastle University Singapore, Faculty of Science, Agriculture, and Engineering, Singapore, Singapore task-sound-event-detection-and-separation-in-domestic-environments-results#Chan2020 44.4 (43.5 - 45.5)
CTK_NU_task4_SED_4 Teck Kai Chan Newcastle University Singapore, Faculty of Science, Agriculture, and Engineering, Singapore, Singapore task-sound-event-detection-and-separation-in-domestic-environments-results#Chan2020 46.3 (45.3 - 47.4)
CTK_NU_task4_SED_3 Teck Kai Chan Newcastle University Singapore, Faculty of Science, Agriculture, and Engineering, Singapore, Singapore task-sound-event-detection-and-separation-in-domestic-environments-results#Chan2020 45.8 (45.0 - 47.0)
CTK_NU_task4_SED_1 Teck Kai Chan Newcastle University Singapore, Faculty of Science, Agriculture, and Engineering, Singapore, Singapore task-sound-event-detection-and-separation-in-domestic-environments-results#Chan2020 43.5 (42.6 - 44.7)
YenKu_NTU_task4_SED_4 Hao Yen National Taiwan University, Department of Electrical Engineering, Taipei, Taiwan task-sound-event-detection-and-separation-in-domestic-environments-results#Yen2020 42.7 (41.6 - 43.6)
YenKu_NTU_task4_SED_2 Hao Yen National Taiwan University, Department of Electrical Engineering, Taipei, Taiwan task-sound-event-detection-and-separation-in-domestic-environments-results#Yen2020 42.6 (41.8 - 43.7)
YenKu_NTU_task4_SED_3 Hao Yen National Taiwan University, Department of Electrical Engineering, Taipei, Taiwan task-sound-event-detection-and-separation-in-domestic-environments-results#Yen2020 41.6 (40.6 - 42.7)
YenKu_NTU_task4_SED_1 Hao Yen National Taiwan University, Department of Electrical Engineering, Taipei, Taiwan task-sound-event-detection-and-separation-in-domestic-environments-results#Yen2020 43.6 (42.4 - 44.6)
Tang_SCU_task4_SED_1 Maolin Tang Sichuan University, Computer Science and Technology, Sichuan, China task-sound-event-detection-and-separation-in-domestic-environments-results#Tang2020 43.1 (42.3 - 44.1)
Tang_SCU_task4_SED_4 Maolin Tang Sichuan University, Computer Science and Technology, Sichuan, China task-sound-event-detection-and-separation-in-domestic-environments-results#Tang2020 44.1 (43.4 - 44.8)
Tang_SCU_task4_SED_2 Maolin Tang Sichuan University, Computer Science and Technology, Sichuan, China task-sound-event-detection-and-separation-in-domestic-environments-results#Tang2020 42.4 (41.4 - 43.4)
Tang_SCU_task4_SED_3 Maolin Tang Sichuan University, Computer Science and Technology, Sichuan, China task-sound-event-detection-and-separation-in-domestic-environments-results#Tang2020 44.1 (43.3 - 45.0)
DCASE2020_SED_baseline_system Nicolas Turpault Inria Nancy Grand-Est, Department of Natural Language Processing & Knowledge Discovery, Nancy, France task-sound-event-detection-in-domestic-environments#Turpault2020a 34.9 (34.0 - 35.7)
DCASE2020_SS_SED_baseline_system Nicolas Turpault Inria Nancy Grand-Est, Department of Natural Language Processing & Knowledge Discovery, Nancy, France task-sound-event-detection-in-domestic-environments#Turpault2020b 36.5 (35.6 - 37.2) Sound Separation
Ebbers_UPB_task4_SED_1 Janek Ebbers Paderborn University, Department of Communications Engineering, Paderborn, Germany task-sound-event-detection-and-separation-in-domestic-environments-results#Ebbers2020 47.2 (46.5 - 48.1)


Complete results and technical reports can be found at Task 4 results page

Baselines

There are three different baselines: one for sound event detection, one for sound separation and one combing sound separation and sound event detection.

Sound event detection baseline

The baseline model is inspired by last year 2nd best submission system of DCASE 2019 task 4. It is an improvement of dcase 2019 baseline. The model is a mean-teacher model.

The main differences of the baseline system (without source separation) compared to DCASE 2019:

  • The sampling rate becomes 16kHz.
  • Features:
    • 2048 fft window, 255 hop size, 8000 max frequency for mel, 128 mel bins.
  • Different synthetic dataset is used.
  • The architecture (number of layers) is taken from L. Delphin-Poulat & C. Plapous.
  • There is rampup for the learning rate for 50 epochs.
  • Median window of 0.45s.
  • Early stopping (10 epochs).

Performance

Macro F-score Event-based PSDS macro F-score PSDS PSDS cross-trigger PSDS macro
Validation 34.8 % 60.0% 0.610 0.524 0.433

Download


Note: The performance might not be exactly reproducible on a GPU based system. That is why, you can download the weights of the networks used for the experiments and run TestModel.py --model_path="Path_of_model" to reproduce the results.

Publication

Nicolas Turpault and Romain Serizel. Training sound event detection on a heterogeneous dataset. working paper or preprint, 2020.

Training Sound Event Detection On A Heterogeneous Dataset

Abstract

Training a sound event detection algorithm on a heterogeneous dataset including both recorded and synthetic soundscapes that can have various labeling granularity is a non-trivial task that can lead to systems requiring several technical choices. These technical choices are often passed from one system to another without being questioned. We propose to perform a detailed analysis of DCASE 2020 task 4 sound event detection baseline with regards to several aspects such as the type of data used for training, the parameters of the mean-teacher or the transformations applied while generating the synthetic soundscapes. Some of the parameters that are usually used as default to replicate other approaches are shown to be sub-optimal.

Publication

Lionel Delphin-Poulat and Cyril Plapous. Mean teacher with data augmentation for dcase 2019 task 4. Technical Report, Orange Labs Lannion, France, June 2019.

PDF

MEAN TEACHER WITH DATA AUGMENTATION FOR DCASE 2019 TASK 4

Abstract

In this paper, we present our neural network for the DCASE 2019 challenge’s Task 4 (Sound event detection in domestic environments) [1]. The goal of the task is to evaluate systems for the detection of sound events using real data either weakly labeled or unlabeled and simulated data that is strongly labeled. We propose a mean-teacher model with convolutional neural network (CNN) and recurrent neural network (RNN) together with data augmentation and a median window tuned for each class based on prior knowledge.

PDF
Publication

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, 1195–1204. 2017.

PDF

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

Abstract

The recently proposed Temporal Ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks. It maintains an exponential moving average of label predictions on each training example, and penalizes predictions that are inconsistent with this target. However, because the targets change only once per epoch, Temporal Ensembling becomes unwieldy when learning large datasets. To overcome this problem, we propose Mean Teacher, a method that averages model weights instead of label predictions. As an additional benefit, Mean Teacher improves test accuracy and enables training with fewer labels than Temporal Ensembling. Without changing the network architecture, Mean Teacher achieves an error rate of 4.35% on SVHN with 250 labels, outperforming Temporal Ensembling trained with 1000 labels. We also show that a good network architecture is crucial to performance. Combining Mean Teacher and Residual Networks, we improve the state of the art on CIFAR-10 with 4000 labels from 10.55% to 6.28%, and on ImageNet 2012 with 10% of the labels from 35.24% to 9.11%.

PDF

Sound separation baseline

This baseline model consists of a TDCN++ masking network using STFT analysis/synthesis and weighted mixture consistency, where the weights are predicted by the network, with one scalar per source. The training loss is thresholded negative signal-to-noise ratio (SNR).


The model architecture is able to handle variable number sources by using different loss functions for active and inactive reference sources. For active reference sources (i.e. non-zero reference source signals), the threshold for negative SNR is 30 dB, equivalent to the error power being below the reference power by 30 dB. For non-active reference sources (i.e. all-zero reference source signals), the threshold is 20 dB measured relative to the mixture power, which means gradients are clipped when the error power is 20 dB below the mixture power. This model architecture achieves the following performance when trained and evaluated on the two variants of the FUSS dataset: reverberant and and dry (i.e. non-reverberant):

Validation Eval
Single-source SI-SNR Multi-source SI-SNRi Single-source SI-SNR Multi-source SI-SNRi
Reverberant FUSS 35.0 dB 13.0 dB 37.6 dB 12.5 dB
Dry FUSS 30.6 dB 10.5 dB 31.8 dB 10.2 dB

Download


Reverberant FUSS baseline


Dry FUSS baseline


Publication

Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, and John R. Hershey. Universal sound separation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 175–179. October 2019. URL: https://arxiv.org/abs/1905.03330.

PDF

Universal Sound Separation

PDF
Publication

Efthymios Tzinis, Scott Wisdom, John R. Hershey, Aren Jansen, and Daniel P. W. Ellis. Improving universal sound separation using sound classification. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). May 2020. URL: https://arxiv.org/abs/1911.07951.

PDF

Improving Universal Sound Separation Using Sound Classification

PDF
Publication

Scott Wisdom, John R Hershey, Kevin Wilson, Jeremy Thorpe, Michael Chinen, Brian Patton, and Rif A Saurous. Differentiable consistency constraints for improved deep speech enhancement. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 900–904. IEEE, 2019.

PDF

Differentiable consistency constraints for improved deep speech enhancement

Abstract

In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglect two important constraints: STFT consistency and mixture consistency. Without STFT consistency, the system's output is not necessarily the STFT of a time-domain signal, and without mixture consistency, the sum of the estimated sources does not necessarily equal the input mixture. Furthermore, the only previous approaches that apply mixture consistency use real-valued masks; mixture consistency has been ignored for complex-valued masks.

PDF

Sound event detection and separation baseline

For the combined SED+SS baseline, we mix together audio from DESED and FUSS training and validation data, to create new mixtures with both in-domain (DESED) and open-domain (FUSS) sources. The sound separation model is trained to separate these mixtures into three output signals: DESED background, mixture of DESED foreground sounds, and mixture of FUSS sounds. This model is trained in the same way as the baseline SS model, except without permutation invariance. On the DESED+FUSS validation set, this model achieves an average of 18.6 dB SI-SNR improvement for the separated DESED foreground mixture, which is used as input to the SED model.

The baseline to combine SS and SED is then relying late integration of SED baseline applied on separated sound sources.

The sound separation baseline has been trained using 3 sources, so it returns:

  • DESED background
  • DESED foreground
  • FUSS mixture

In our case, we use only the output of the second source (DESED foreground).

To get the predictions of the combination of SED and SS we do as follow:

  • Get the output (not binarized with threshold) on the original mixtures using the SED baseline
  • Get the output (not binarized with threshold) on the DESED foreground source from SS model SED baseline
  • Take the average of both outputs
  • Apply thresholds (different for F-scores and psds)
  • Apply median filtering (0.45s)
Publication

Nicolas Turpault, Scott Wisdom Wisdom, Hakan Erdogan, John R. Herhey, Romain Serizel, Eduardo Fonseca, and Prem Seetharaman. Improving sound event detection in domestic environments using sound separation. working paper or preprint, 2020.

Improving Sound Event Detection In Domestic Environments Using Sound Separation

Abstract

Performing sound event detection on real-world recordings often implies dealing with overlapping target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing stage for sound event detection. In this paper we start from a sound separation model trained on the Free Universal Sound Separation dataset and the DCASE 2020 task 4 sound event detection baseline. We explore different methods of combining separated sound sources and the original mixture within the sound event detection. Furthermore, we investigate the impact of adapting the universal sound separation model to the sound event detection data in terms of both separation and sound event detection performance.

Performance

Macro F-score Event-based PSDS macro F-score PSDS PSDS cross-trigger PSDS macro
Validation 35.6 % 60.5% 0.626 0.546 0.449

Download

Sound separation



Sound event detection


Citation

If you are using the dataset or baseline code, or want to refer challenge task please cite the following paper:

Task and datasets

Publication

Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.

PDF

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

Keywords

Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data

PDF
Publication

Scott Wisdom, Hakan Erdogan, Daniel P. W. Ellis, Romain Serizel, Nicolas Turpault, Eduardo Fonseca, Justin Salamon, Prem Seetharaman, and John R. Hershey. What's all the fuss about free universal sound separation data? In in preparation. 2020.

What's All the FUSS About Free Universal Sound Separation Data?

Baselines

Sound event detection

Publication

Nicolas Turpault and Romain Serizel. Training sound event detection on a heterogeneous dataset. working paper or preprint, 2020.

Training Sound Event Detection On A Heterogeneous Dataset

Abstract

Training a sound event detection algorithm on a heterogeneous dataset including both recorded and synthetic soundscapes that can have various labeling granularity is a non-trivial task that can lead to systems requiring several technical choices. These technical choices are often passed from one system to another without being questioned. We propose to perform a detailed analysis of DCASE 2020 task 4 sound event detection baseline with regards to several aspects such as the type of data used for training, the parameters of the mean-teacher or the transformations applied while generating the synthetic soundscapes. Some of the parameters that are usually used as default to replicate other approaches are shown to be sub-optimal.

Sound separation

Publication

Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, and John R. Hershey. Universal sound separation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 175–179. October 2019. URL: https://arxiv.org/abs/1905.03330.

PDF

Universal Sound Separation

PDF
Publication

Efthymios Tzinis, Scott Wisdom, John R. Hershey, Aren Jansen, and Daniel P. W. Ellis. Improving universal sound separation using sound classification. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). May 2020. URL: https://arxiv.org/abs/1911.07951.

PDF

Improving Universal Sound Separation Using Sound Classification

PDF
Publication

Scott Wisdom, John R Hershey, Kevin Wilson, Jeremy Thorpe, Michael Chinen, Brian Patton, and Rif A Saurous. Differentiable consistency constraints for improved deep speech enhancement. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 900–904. IEEE, 2019.

PDF

Differentiable consistency constraints for improved deep speech enhancement

Abstract

In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglect two important constraints: STFT consistency and mixture consistency. Without STFT consistency, the system's output is not necessarily the STFT of a time-domain signal, and without mixture consistency, the sum of the estimated sources does not necessarily equal the input mixture. Furthermore, the only previous approaches that apply mixture consistency use real-valued masks; mixture consistency has been ignored for complex-valued masks.

PDF

Sound separation and sound event detection

Publication

Nicolas Turpault, Scott Wisdom Wisdom, Hakan Erdogan, John R. Herhey, Romain Serizel, Eduardo Fonseca, and Prem Seetharaman. Improving sound event detection in domestic environments using sound separation. working paper or preprint, 2020.

Improving Sound Event Detection In Domestic Environments Using Sound Separation

Abstract

Performing sound event detection on real-world recordings often implies dealing with overlapping target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing stage for sound event detection. In this paper we start from a sound separation model trained on the Free Universal Sound Separation dataset and the DCASE 2020 task 4 sound event detection baseline. We explore different methods of combining separated sound sources and the original mixture within the sound event detection. Furthermore, we investigate the impact of adapting the universal sound separation model to the sound event detection data in terms of both separation and sound event detection performance.