Sound Event Detection and Separation in Domestic Environments

Coordinators

 Romain Serizel Nicolas Turpault Francesca Ronchini Scott Wisdom Hakan Erdogan John Hershey Justin Salamon Prem Seetharaman Eduardo Fonseca Samuele Cornell Daniel P. W. Ellis

The goal of the task is to evaluate systems for the detection of sound events using real data either weakly labeled or unlabeled and simulated data that is strongly labeled (with time stamps).

Description

This task is the follow-up to DCASE 2020 task 4. The task evaluates systems for the detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time localization given that multiple events can be present in an audio recording (see also Fig 1). The challenge of exploring the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance remains. Isolated sound events, background sound files and scripts to design a training set with strongly annotated synthetic data are provided. The labels in all the annotated subsets are verified and can be considered as reliable.

As last year, we encourage participants to propose systems that use sound separation jointly with sound event detection. This step can be used to separate overlapping sound events and extract foreground sound events from the background sound. To motivate participants to explore that direction, we provide a baseline sound separation model that can be used for pre-processing (see also Fig 2).

Compared to previous years, this task aims to investigate how we can optimally exploit synthetic data, including non-target isolated events for both sound event detection and sound separation.

Audio dataset

The data for the DCASE 2020 task 4 consist of several datasets designed for sound event detection and/or sound separation. The datasets are described below.

Audio material

Dataset Subset Type Usage Annotations Event type Sampling frequency
DESED Real: weakly labeled Recorded soundscapes Training Weak labels (no timestamps) Target 44.1kHz
Real: unlabeled Recorded soundscapes Training No annotations Target 44.1kHz
Real: validation Recorded soundscapes Validation Strong labels (with timestamps) Target 44.1kHz
Real: public evaluation Recorded soundscapes Evaluation (do not use this subset to tune hyperparamters) Strong labels (with timestamps) Target 44.1kHz
Synthetic: training Isolated events + synthetic soundscapes Training/validation Strong labels (with timestamps) Target 16kHz
Synthetic: evaluation Isolated events + backgrounds Evaluation (do not use this subset to tune hyperparamters) Event level labels (no timestamps) Target 16kHz
SINS Background Training/validation No annotations N/A 16kHz
TUT Acoustic scenes 2017, development dataset Background Training/validation No annotations N/A 44.1kHz
FUSS dataset Isolated events + synthetic soundscapes Training/validation Weak annotations from FSD50K (no timestamps) Target and non-target 16kHz
FSD50K dataset Isolated events + recorded soundscapes Training/validation Weak annotations (no timestamps) Target and non-target 44.1kHz
YFCC100M dataset Recorded soundscapes Training/validation No annotations Sound sources 44.1kHz

If you plan to perform source separation (or to use backgrounds from the SINS dataset) please resample your recorded data in 16kHz. If you are using only recorded data and perform only sound event detection you can use sampling rates up to 44.1kHz yet we strongly encourage participant to use 16kHz as sampling rate.
Please note that the baselines work on 16kHz data.

DESED dataset

DESED dataset is the dataset that was used in DCASE 2020 task 4. The dataset for this task is composed of 10 sec audio clips recorded in domestic environments or synthesized using Scaper to simulate a domestic environment. The task focuses on 10 class of sound events that represent a subset of Audioset (not all the classes are present in Audioset, some classes of sound events are including several classes from Audioset):

• Speech Speech
• Dog Dog
• Cat Cat
• Alarm/bell/ringing Alarm_bell_ringing
• Dishes Dishes
• Frying Frying
• Blender Blender
• Running water Running_water
• Vacuum cleaner Vacuum_cleaner
• Electric shaver/toothbrush Electric_shaver_toothbrush

Publication

Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

Keywords

Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data

Publication

Romain Serizel, Nicolas Turpault, Ankit Shah, and Justin Salamon. Sound event detection in synthetic domestic environments. In ICASSP 2020 - 45th International Conference on Acoustics, Speech, and Signal Processing. Barcelona, Spain, 2020. URL: https://hal.inria.fr/hal-02355573.

Sound event detection in synthetic domestic environments

Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

Keywords

semi-supervised learning ; weakly labeled data ; synthetic data ; Sound event detection ; Index Terms-Sound event detection

This year we provide a file mapping from Audioset event classe names to DESED target event classe name in order to allow for using FSD50K and FUSS datasets. We provide csv files for training/validation splits that are compatible with both FSD50K and FUSS. We also provide event distribution statistics for both target event classes and non target event classes. These distributions are computed on annotations obtained by human for ~90k clips from Audioset.

Publication

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA, 2017.

Audio Set: An ontology and human-labeled dataset for audio events

Abstract

Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets - principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 632 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

Publication

Shawn Hershey, Daniel P W Ellis, Eduardo Fonseca, Aren Jansen, Caroline Liu, R Channing Moore, and Manoj Plakal. The benefit of temporally-strong labels in audio event classification. In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021. accepted.

Sound separation dataset (FUSS)

The Free Universal Sound Separation (FUSS) Dataset is composed of arbitrary sound mixtures and source-level references, for use in experiments on arbitrary sound separation.

Overview: The audio data is sourced from a subset of FSD50K, a sound event dataset composed of Freesound content annotated with labels from the AudioSet Ontology. Using the FSD50K labels, these sound source files have been screened such that they likely only contain a single type of sound. As the file id have not been modified, labels FSD50K are still valid for FUSS files. To create mixtures, 10 second clips of sounds are convolved with simulated room impulse responses and added together. Each 10 second mixture contains between 1 and 4 sounds. Sound source files longer than 10s are considered "background" sources. Every mixture contains one background source, which is active for the entire duration. We provide: a software recipe to create the dataset, the room impulse responses, and access to the original source audio.

Publication

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.

FSD50K: an Open Dataset of Human-Labeled Sound Events

Publication

Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, 411–412. ACM, 2013.

Freesound technical demo

Publication

Scott Wisdom, Hakan Erdogan, Daniel P. W. Ellis, Romain Serizel, Nicolas Turpault, Eduardo Fonseca, Justin Salamon, Prem Seetharaman, and John R. Hershey. What's all the fuss about free universal sound separation data? In in preparation. 2020.

What's All the FUSS About Free Universal Sound Separation Data?

Motivation: This dataset provides a platform to investigate how sound separation may help with event detection and vice versa. Event detection is more difficult in noisy environments, and so separation could be a useful pre-processing step. Data with strong labels for event detection are relatively scarce, especially when restricted to specific classes within a domain. In contrast, sound separation data needs no event labels for training, and may be more plentiful. In this setting, the idea is to utilize larger unlabeled separation data to train separation systems, which can serve as a front-end to event-detection systems trained on more limited data.

Room simulation: Room impulse responses are simulated using the image method with frequency-dependent walls. Each impulse corresponds to a rectangular room of random size with random wall materials, where a single microphone and up to 4 sound sources are placed at random spatial locations.

Recipe for data creation: The data creation recipe starts with scripts, based on Scaper, to generate mixtures of events with random timing of sound events, along with a background sound that spans the duration of the mixture clip.
The constituent sound files for each mixture are also generated for use as references for training and evaluation.
The data are reverberated using a different room simulation for each mixture.
In this simulation each sound source has its own reverberation corresponding to a different spatial location.
The reverberated mixtures are created by summing over the reverberated sound sources.
The data creation scripts support modification, so that participants may remix and augment the training data as desired.

The background files were not screened to ensure that they contain no target class. Therefore, background files might include target events.

SINS dataset: A part of the derivative of the SINS dataset used for DCASE2018 task 5 is used as background for the synthetic subset of the dataset for DCASE 2019 task 4. The SINS dataset contains a continuous recording of one person living in a vacation home over a period of one week.
It was collected using a network of 13 microphone arrays distributed over the entire home. The microphone array consists of 4 linearly arranged microphones.

Publication

Gert Dekkers, Steven Lauwereins, Bart Thoen, Mulu Weldegebreal Adhana, Henk Brouckxon, Toon van Waterschoot, Bart Vanrumste, Marian Verhelst, and Peter Karsmakers. The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 32–36. November 2017.

The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network

Abstract

There is a rising interest in monitoring and improving human wellbeing at home using different types of sensors including microphones. In the context of Ambient Assisted Living (AAL) persons are monitored, e.g. to support patients with a chronic illness and older persons, by tracking their activities being performed at home. When considering an acoustic sensing modality, a performed activity can be seen as an acoustic scene. Recently, acoustic detection and classification of scenes and events has gained interest in the scientific community and led to numerous public databases for a wide range of applications. However, no public databases exist which a) focus on daily activities in a home environment, b) contain activities being performed in a spontaneous manner, c) make use of an acoustic sensor network, and d) are recorded as a continuous stream. In this paper we introduce a database recorded in one living home, over a period of one week. The recording setup is an acoustic sensor network containing thirteen sensor nodes, with four low-cost microphones each, distributed over five rooms. Annotation is available on an activity level. In this paper we present the recording and annotation procedure, the database content and a discussion on a baseline detection benchmark. The baseline consists of Mel-Frequency Cepstral Coefficients, Support Vector Machine and a majority vote late-fusion scheme. The database is publicly released to provide a common ground for future research.

Keywords

Database, Acoustic Scene Classification, Acoustic Event Detection, Acoustic Sensor Networks

TUT Acoustic scenes 2017, development dataset: TUT Acoustic Scenes 2017 dataset consists of recordings from various acoustic scenes, all having distinct recording locations. For each recording location, 3-5 minute long audio recording was captured. The original recordings were then split into segments with a length of 10 seconds. These audio segments are provided in individual files.

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016.

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.

FSD50K dataset FSD50K is an open dataset of human-labeled sound events containing over 51k Freesound audio clips, and totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. Clips are of variable length from 0.3 to 30s, containing both target and non-targets events that can be used to generate more realistic soundscapes.

Publication

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.

FSD50K: an Open Dataset of Human-Labeled Sound Events

YFCC100M YFCC100M is a multimedia dataset comprising a total of 100 million media objects, including approximately 0.8 million videos, all of which have been uploaded to Flickr between 2004 and 2014 and published under a CC commercial or non-commercial licence. Within the task, this dataset is used for unsupervised training of the sound separation.

Publication

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. YFCC100M: the new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.

Generating your own training data using Scaper

Participants are encouraged use the provided isolated foreground and background sounds, in combination with the Scaper soundscape synthesis and augmentation library, to generate additional (potentially infinite!) training data.

Resources for getting started include:

• The Scaper scripts provided with the DESED dataset (link)
• The canonical Scaper script used to create the source separation dataset (link)
• The Scaper tutorial
Publication

J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello. Scaper: a library for soundscape synthesis and augmentation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 344–348. New Paltz, NY, USA, Oct. 2017.

Reference labels

Audioset provides annotations at clip level (without time boundaries for the events). Therefore, the original annotations are considered as weak labels. Google researchers conducted a quality assessment task where experts were exposed to 10 randomly selected clips for each class and discovered that a in most of the cases not all the clips contains the event related to the given annotation.

Weak annotations

The weak annotations have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file under the following format:

[filename (string)][tab][event_labels (strings)]


For example:

Y-BJNMHMZDcU_50.000_60.000.wav Alarm_bell_ringing,Dog

The first column, Y-BJNMHMZDcU_50.000_60.000.wav, is the name of the audio file downloaded from Youtube (Y-BJNMHMZDcU is Youtube ID of the video from where the 10-second clips was extracted t=50 sec to t=60 sec, correspond to the clip boundaries within the full video) and the last column, Alarm_bell_ringing;Dog corresponds to the sound classes present in the clip separated by a comma.

Strong annotations

Another subset of the development has been annotated manually with strong annotations, to be used as the test set (see also below for a detailed explanation about the development set).

The synthetic subset of the development set is generated and labeled with strong annotations using the Scaper soundscape synthesis and augmentation library. Each sound clip from DESED was verified by humans in order to check the event class present in the original freesound tags was indeed dominant in the audio clip. Each clip was then segmented to remove silence part and keep only isolated event(s).

Publication

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.

FSD50K: an Open Dataset of Human-Labeled Sound Events

Publication

Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, 411–412. ACM, 2013.

Freesound technical demo

In both cases, the minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event. The strong annotations are provided in a tab separated csv file under the following format:

  [filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)]


For example:

YOTsn73eqbfc_10.000_20.000.wav 0.163 0.665 Alarm_bell_ringing

The first column, YOTsn73eqbfc_10.000_20.000.wav, is the name of the audio file, the second column 0.163 is the onset time in seconds, the third column 0.665 is the offset time in seconds and the last column, Alarm_bell_ringing corresponds to the class of the sound event.

FSD50K annotations

Labels are provided at clip-level (i.e., weak labels). However, some clips contain sound events such that the acoustic signal fills almost the entirety of the file, which can be considered strong labels. All existing labels are human-verified. Nonetheless, there could be some missing “Present” labels, mainly in the dev set. More details can be found in Section IV of the FSD50K paper. The annotation format is:

[filename],[labels],[mids],[split]


For example, for a clip containing the sound event "Purr":

181956,"Purr,Cat,Domestic_animals_and_pets,Animal","/m/02yds9,/m/01yrx,/m/068hy,/m/0jbk",train


Note that the ancestors of "Purr", as per the AudioSet Ontology, are also included. Mids are the class identifiers used in the AudioSet Ontology. We provide a file mapping from Audioset event class names to DESED target event class names in order to allow for using FSD50K annotations in the data generation process.

Publication

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.

FSD50K: an Open Dataset of Human-Labeled Sound Events

The dataset is composed of several subsets that can be downloaded independently from the respective repositories or automatically with the task 4 data generation script.

The table below summarizes the access information to the different datasets.

DESED DESED github repo DESED real (zenodo) DESED synthetic (zenodo) DESED public eval (zenodo) Yes
SINS SINS github repo SINS dev (zenodo) Yes
TUT Acoustic scenes 2017, development dataset TUT 2017 dev (zenodo) Yes
FUSS dataset FUSS github repo FUSS (zenodo) Yes
FSD50K dataset FSD50K (zenodo) Yes
YFFC100M dataset YFCC100M website No

The challenge consists of detecting sound events within audio clips using training data from real recordings both weakly labeled and unlabeled and synthetic audio clips that are strongly labeled. The detection within a 10-seconds clip should be performed with start and end timestamps. As we encourage participants to use a source separation algorithm together with the sound event detection, there are three possible scenarios:

• You are working on sound event detection without source separation pre-processing (this is a direct follow-up to last year task 4)
• You are are working on both source separation and sound event detection (this includes the case when reusing the source separation baseline)
• You are working only on source separation and use the sound event detection baseline

In each case, participants are expect to provide the output of a sound event detection system (see also below). Note that an additional (optional) separate evaluation set designed to evaluate source separation performance will also be provided (see also below).

Development dataset

The development set is divided into six main partitions (four for sound event detection, 2 for source separation) that are detailed below.

Sound event detection training set

We provide 3 different splits of training data in our training set: Labeled training set, Unlabeled in domain training set and Synthetic set with strong annotations. The first two set are the same as in DCASE2019 task 4. The development dataset organization is described in Fig 3).

Labeled training set:
This set contains 1578 clips (2244 class occurrences) for which weak annotations have been verified and cross-checked. The amount of clips per class is the following:

Unlabeled in domain training set:
This set is considerably larger than the previous one. It contains 14412 clips. The clips are selected such that the distribution per class (based on Audioset annotations) is close to the distribution in the labeled set. Note however that given the uncertainty on Audioset labels this distribution might not be exactly similar.

Synthetic strongly labeled set:
This set is composed of 10000 clips generated with the Scaper soundscape synthesis and augmentation library. The clips are generated such that the distribution per event is close to that of the validation set.

• We used all the foreground files from the DESED synthetic soundbank.
• We used background files annotated as "other" from the subpart of SINS dataset and files annotated as "xxx" from the (TUT Acoustic scenes 2017, development dataset)[https://zenodo.org/record/400515].
• We used the clips from FUSS containing the non-target classes. The clip selection is based on FSD50K annotations. The clips selected are in the training for both FSD50K and FUSS. We provide csv files corresponding to these splits.
• Event distribution statistics for both target event classes and non target event classes are computed on annotations obtained by human for ~90k clips from Audioset.

We share the original data and scripts to generate soundscapes and encourage participants to create their own subsets. See DESED github repo and Scaper documentation for more information about how to create new soundscapes.

Sound event detection validation set

The validation set is designed such that the distribution in term of clips per class is similar to that of the weakly labeled training set. The validation set contains 1168 clips (4093 events). The validation set is annotated with strong labels, with timestamps (obtained by human annotators). Note that a 10-seconds clip may correspond to more than one sound event.

TBA

TBA

Evaluation datasets

As we encourage participants to use a source separation algorithm together with the sound event detection, there are three possible scenarios (listed in the task setup). Participants are allowed to submit up to 4 different systems for each scenario . In each case, participants are expected to provide at least the output of a sound event detection system. Participants who wants to get their sound separation submission evaluated can download the specific sound separation evaluation dataset (see below). The evaluation dataset organization is described in Fig 4).

Sound event detection evaluation dataset

The sound event detection evaluation dataset is composed of 10 seconds and 5 minutes audio clips.

• A first subset is composed of audio clips extracted from youtube and vimeo videos under creative common licenses. This subset is used for ranking purposes.

• A second subset is composed on synthetic clips generated with scaper. This subset is used for analysis purposes.

Sound separation evaluation dataset

The sound separation evaluation dataset is the evaluation part of the FUSS dataset.

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

• Participants are allowed to submit up to 4 different systems for each scenario listed in the task setup section.
• Participants are not allowed to use external data for system development. Data from other task is considered external data.
• Another example of external data is other materials related to the video such as the rest of audio from where the 10-sec clip was extracted, the video frames and metadata.
• Participants are not allowed to use the embeddings provided by Audioset or any other features that indirectly use external data.
• For the real recordings (from Audioset), only weak labels and none of the strong labels (timestamps) or original (Audioset) labels can be used for the training of the submitted system. For the synthetic clips, strong labels provided can be used.
• Manipulation of provided training data is allowed.
• The development dataset can be augmented without the use of external data (e.g. by mixing data sampled from a PDF or using techniques such as pitch shifting or time stretching).
• Participants are not allowed to use the public evaluation dataset and synthetic evaluation dataset (or part of them) to train their systems or tune hyper-parameters.

Evaluation

Sound event detection evaluation

All submissions will be evaluated with poly-phonic sound event detection scores (PSDS) computed over the real recordings in the evaluation set (the performance on synthetic recordings is not taken into account in the metric). PSDS values are computed using 50 operating points (linearly distributed from 0.01 to 0.99). In order to understand better what the behavior of each submissions for two different scenarios that emphasize different systems properties.

Scenario 1

The system needs to react fast upon an event detection (e.g. to trigger an alarm, adapt home automation system...). The localization of the sound event is then really important. The PSDS parameters reflecting these needs are:

• Detection Tolerance criterion (DTC): 0.7
• Ground Truth intersection criterion (GTC): 0.7
• Cost of instability across class ($$\alpha_{ST}$$): 1
• Cost of CTs on user experience ($$\alpha_{CT}$$): 0
• Maximum False Positive rate (e_max): 100

Scenario 2

The system must avoid confusing between classes but the reaction time is less crucial than in the first scenario. The PSDS parameters reflecting these needs are:

• Detection Tolerance criterion (DTC): 0.1
• Ground Truth intersection criterion (GTC): 0.1
• Cost of instability across class ($$\alpha_{ST}$$): 1
• Cross-Trigger Tolerance criterion (cttc): 0.3
• Cost of CTs on user experience ($$\alpha_{CT}$$): 0.5
• Maximum False Positive rate (e_max): 100

The official ranking will be a team wise ranking, not a system wise ranking. The ranking criterion will be the aggregation of PSDS-scenario1 and PSDS-scenario2. Each separate metric considered in the final ranking criterion will be the best separate metric among all teams submission (PSDS-scenario1 and PSDS-scenario2 can be obtained by two different systems from the same team, see also Fig 5). The setup is chosen in order to favor experiments on the systems behavior, and adaptation to different metrics depending on the targeted scenario.

$$\mathrm{Ranking\ Score} = \overline{\mathrm{PSDS}_1} + \overline{\mathrm{PSDS}_2}$$

with $$\overline{\mathrm{PSDS}_1}$$ and $$\overline{\mathrm{PSDS}_2}$$ the PSDS on scenario 1 and 2 normalized by the baseline PSDS on these scenarios, respectively.

Contrastive metric (collar-based F1-score)

Additionally, event-based measures with a 200 ms collar on onsets and a 200 ms / 20% of the events length collar on offsets will be provided as a contrastive measure. System will be evaluated with threshold fixed at 0.5 unless participant explicitly provide another operating point to be evaluated with F1-score.

Evaluation is done using sed_eval and psds_eval toolboxes:

You can find more information on how to use PSDS for task 4 on the dedicated notebook:

Detailed information on metrics calculation is available in:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016. URL: http://www.mdpi.com/2076-3417/6/6/162, doi:10.3390/app6060162.

Metrics for Polyphonic Sound Event Detection

Abstract

This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.

Publication

Cagdas Bilen, Giacomo Ferroni, Francesco Tuveri, Juan Azcarreta, and Sacha Krstulovic. A framework for the robust evaluation of sound event detection. arXiv preprint arXiv:1910.08440, 2019. URL: https://arxiv.org/abs/1910.08440.

Source separation evaluation (optional)

Source separation approaches will be optionally evaluated on an additional evaluation with standard source separation metrics as follows. For each example mixture x containing J sources, performance will be measured with permutation-invariant scale-invariant signal-to-noise ratio improvement (SI-SNRi).

• SNR is defined as 10 * log10 of the ratio of source power to error power.
• Scale-invariant SNR (SI-SNR) allows scaling of the reference to best match the estimate. The optimal scaling factor given an estimate signal e and reference signal r is / , where < , > indicates inner-product.
• Permutation invariance allows the estimates to be permuted to match reference signals such that the mean SI-SNR across sources is maximized.
• SI-SNR improvement (SI-SNRi) is the SI-SNR of the estimate with respect to the reference, minus SI-SNR of the mixture with respect to the reference.

A python evaluation script for source separation can be found on:

Baseline systems

Sound event detection baseline

System description

The baseline model is an improvement of [dcase 2020 baseline][dcase2020-baseline]. The model is a mean-teacher model.

The main differences of the baseline system (without source separation) compared to DCASE 2020:

• Features: hop size of 256 instead of 255.
• Different synthetic dataset is used.
• No early stopping used (200 epochs) but getting the best model
• Normalisation per-instance using min-max approach
• Mixup is used for weak and synthetic data by mixing data in a batch (50% chance of applying it).
• Batch size of 48 (still 1/4 synthetic, 1/4 weak, 1/2 unlabelled)
• Intersection-based F1 instead of event-based F1 for the synthetic validation score

Results for the development dataset

 PSDS-scenario1 PSDS-scenario2 Intersection-based F1 Collar-based F1 Dev-test 0.342 0.527 76.6% 40.1%

Collar-based = event-based. Intersection based is computed using (dtc=gtc=0.5, cttc=0.3) and event-based is computed using collars (onset=200ms, offset=max(200ms, 20% event-duration)

Note: The performance might not be exactly reproducible on a GPU based system. That is why, you can download the checkpoint of the network along with the tensorboard events. Launch python train_sed.py --test_from_checkpoint /path/to/downloaded.ckpt to test this model.

Citation

If you are using the dataset or baseline code, or want to refer to the challenge task please cite the following paper:

Publication

Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

Keywords

Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data

Publication

Scott Wisdom, Hakan Erdogan, Daniel P. W. Ellis, Romain Serizel, Nicolas Turpault, Eduardo Fonseca, Justin Salamon, Prem Seetharaman, and John R. Hershey. What's all the fuss about free universal sound separation data? In in preparation. 2020.