Sound Event Detection and Separation in Domestic Environments


Task description

The goal of the task is to evaluate systems for the detection of sound events using real data either weakly labeled or unlabeled and simulated data that is strongly labeled (with time stamps).

Challenge has ended. Full results for this task can be found in the Results page.

If you are interested in the task, you can join us on the dedicated slack channel

Description

This task is the follow-up to DCASE 2020 task 4. The task evaluates systems for the detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time localization given that multiple events can be present in an audio recording (see also Fig 1). The challenge of exploring the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance remains. Isolated sound events, background sound files and scripts to design a training set with strongly annotated synthetic data are provided. The labels in all the annotated subsets are verified and can be considered as reliable.

Figure 1: Overview of a sound event detection system.


As last year, we encourage participants to propose systems that use sound separation jointly with sound event detection. This step can be used to separate overlapping sound events and extract foreground sound events from the background sound. To motivate participants to explore that direction, we provide a baseline sound separation model that can be used for pre-processing (see also Fig 2).

Figure 2: Example of a sound event detection system with a sound separation pre-processing.


Compared to previous years, this task aims to investigate how we can optimally exploit synthetic data, including non-target isolated events for both sound event detection and sound separation.

Audio dataset

The data for the DCASE 2020 task 4 consist of several datasets designed for sound event detection and/or sound separation. The datasets are described below.

Audio material

Dataset Subset Type Usage Annotations Event type Sampling frequency
DESED Real: weakly labeled Recorded soundscapes Training Weak labels (no timestamps) Target 44.1kHz
Real: unlabeled Recorded soundscapes Training No annotations Target 44.1kHz
Real: validation Recorded soundscapes Validation Strong labels (with timestamps) Target 44.1kHz
Real: public evaluation Recorded soundscapes Evaluation (do not use this subset to tune hyperparamters) Strong labels (with timestamps) Target 44.1kHz
Synthetic: training Isolated events + synthetic soundscapes Training/validation Strong labels (with timestamps) Target 16kHz
Synthetic: evaluation Isolated events + backgrounds Evaluation (do not use this subset to tune hyperparamters) Event level labels (no timestamps) Target 16kHz
SINS Background Training/validation No annotations N/A 16kHz
TUT Acoustic scenes 2017, development dataset Background Training/validation No annotations N/A 44.1kHz
FUSS dataset Isolated events + synthetic soundscapes Training/validation Weak annotations from FSD50K (no timestamps) Target and non-target 16kHz
FSD50K dataset Isolated events + recorded soundscapes Training/validation Weak annotations (no timestamps) Target and non-target 44.1kHz
YFCC100M dataset Recorded soundscapes Training/validation No annotations Sound sources 44.1kHz

If you plan to perform source separation (or to use backgrounds from the SINS dataset) please resample your recorded data in 16kHz. If you are using only recorded data and perform only sound event detection you can use sampling rates up to 44.1kHz yet we strongly encourage participant to use 16kHz as sampling rate.
Please note that the baselines work on 16kHz data.

DESED dataset

DESED dataset is the dataset that was used in DCASE 2020 task 4. The dataset for this task is composed of 10 sec audio clips recorded in domestic environments or synthesized using Scaper to simulate a domestic environment. The task focuses on 10 class of sound events that represent a subset of Audioset (not all the classes are present in Audioset, some classes of sound events are including several classes from Audioset):

  • Speech Speech
  • Dog Dog
  • Cat Cat
  • Alarm/bell/ringing Alarm_bell_ringing
  • Dishes Dishes
  • Frying Frying
  • Blender Blender
  • Running water Running_water
  • Vacuum cleaner Vacuum_cleaner
  • Electric shaver/toothbrush Electric_shaver_toothbrush

More information about this dataset and how to generate synthetic soundscapes can be found on the DESED website.

Publication

Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.

PDF

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

Keywords

Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data

PDF
Publication

Romain Serizel, Nicolas Turpault, Ankit Shah, and Justin Salamon. Sound event detection in synthetic domestic environments. In ICASSP 2020 - 45th International Conference on Acoustics, Speech, and Signal Processing. Barcelona, Spain, 2020. URL: https://hal.inria.fr/hal-02355573.

PDF

Sound event detection in synthetic domestic environments

Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

Keywords

semi-supervised learning ; weakly labeled data ; synthetic data ; Sound event detection ; Index Terms-Sound event detection

PDF

This year we provide a file mapping from Audioset event classe names to DESED target event classe name in order to allow for using FSD50K and FUSS datasets. We provide csv files for training/validation splits that are compatible with both FSD50K and FUSS. We also provide event distribution statistics for both target event classes and non target event classes. These distributions are computed on annotations obtained by human for ~90k clips from Audioset.

Publication

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA, 2017.

PDF

Audio Set: An ontology and human-labeled dataset for audio events

Abstract

Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets - principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 632 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

PDF
Publication

Shawn Hershey, Daniel P W Ellis, Eduardo Fonseca, Aren Jansen, Caroline Liu, R Channing Moore, and Manoj Plakal. The benefit of temporally-strong labels in audio event classification. In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021. accepted.

THE BENEFIT OF TEMPORALLY-STRONG LABELS IN AUDIO EVENT CLASSIFICATION

Sound separation dataset (FUSS)

The Free Universal Sound Separation (FUSS) Dataset is composed of arbitrary sound mixtures and source-level references, for use in experiments on arbitrary sound separation.

Overview: The audio data is sourced from a subset of FSD50K, a sound event dataset composed of Freesound content annotated with labels from the AudioSet Ontology. Using the FSD50K labels, these sound source files have been screened such that they likely only contain a single type of sound. As the file id have not been modified, labels FSD50K are still valid for FUSS files. To create mixtures, 10 second clips of sounds are convolved with simulated room impulse responses and added together. Each 10 second mixture contains between 1 and 4 sounds. Sound source files longer than 10s are considered "background" sources. Every mixture contains one background source, which is active for the entire duration. We provide: a software recipe to create the dataset, the room impulse responses, and access to the original source audio.

Publication

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.

PDF

FSD50K: an Open Dataset of Human-Labeled Sound Events

Abstract

Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset---its release consists of pre-computed audio features (instead of waveforms), which limits the adoption of some SER methods. Downloading the original audio tracks is also problematic due to constituent YouTube videos gradually disappearing and usage rights issues, which casts doubts over the suitability of this resource for systems' benchmarking. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.

PDF
Publication

Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, 411–412. ACM, 2013.

PDF

Freesound technical demo

Abstract

Freesound is an online collaborative sound database where people with diverse interests share recorded sound samples under Creative Commons licenses. It was started in 2005 and it is being maintained to support diverse research projects and as a service to the overall research and artistic community. In this demo we want to introduce Freesound to the multimedia community and show its potential as a research resource. We begin by describing some general aspects of Freesound, its architecture and functionalities, and then explain potential usages that this framework has for research applications.

PDF
Publication

Scott Wisdom, Hakan Erdogan, Daniel P. W. Ellis, Romain Serizel, Nicolas Turpault, Eduardo Fonseca, Justin Salamon, Prem Seetharaman, and John R. Hershey. What's all the fuss about free universal sound separation data? In in preparation. 2020.

What's All the FUSS About Free Universal Sound Separation Data?

Motivation: This dataset provides a platform to investigate how sound separation may help with event detection and vice versa. Event detection is more difficult in noisy environments, and so separation could be a useful pre-processing step. Data with strong labels for event detection are relatively scarce, especially when restricted to specific classes within a domain. In contrast, sound separation data needs no event labels for training, and may be more plentiful. In this setting, the idea is to utilize larger unlabeled separation data to train separation systems, which can serve as a front-end to event-detection systems trained on more limited data.

Room simulation: Room impulse responses are simulated using the image method with frequency-dependent walls. Each impulse corresponds to a rectangular room of random size with random wall materials, where a single microphone and up to 4 sound sources are placed at random spatial locations.

Recipe for data creation: The data creation recipe starts with scripts, based on Scaper, to generate mixtures of events with random timing of sound events, along with a background sound that spans the duration of the mixture clip.
The constituent sound files for each mixture are also generated for use as references for training and evaluation.
The data are reverberated using a different room simulation for each mixture.
In this simulation each sound source has its own reverberation corresponding to a different spatial location.
The reverberated mixtures are created by summing over the reverberated sound sources.
The data creation scripts support modification, so that participants may remix and augment the training data as desired.

Additional (background) datasets

The background files were not screened to ensure that they contain no target class. Therefore, background files might include target events.

SINS dataset: A part of the derivative of the SINS dataset used for DCASE2018 task 5 is used as background for the synthetic subset of the dataset for DCASE 2019 task 4. The SINS dataset contains a continuous recording of one person living in a vacation home over a period of one week.
It was collected using a network of 13 microphone arrays distributed over the entire home. The microphone array consists of 4 linearly arranged microphones.

Publication

Gert Dekkers, Steven Lauwereins, Bart Thoen, Mulu Weldegebreal Adhana, Henk Brouckxon, Toon van Waterschoot, Bart Vanrumste, Marian Verhelst, and Peter Karsmakers. The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 32–36. November 2017.

PDF

The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network

Abstract

There is a rising interest in monitoring and improving human wellbeing at home using different types of sensors including microphones. In the context of Ambient Assisted Living (AAL) persons are monitored, e.g. to support patients with a chronic illness and older persons, by tracking their activities being performed at home. When considering an acoustic sensing modality, a performed activity can be seen as an acoustic scene. Recently, acoustic detection and classification of scenes and events has gained interest in the scientific community and led to numerous public databases for a wide range of applications. However, no public databases exist which a) focus on daily activities in a home environment, b) contain activities being performed in a spontaneous manner, c) make use of an acoustic sensor network, and d) are recorded as a continuous stream. In this paper we introduce a database recorded in one living home, over a period of one week. The recording setup is an acoustic sensor network containing thirteen sensor nodes, with four low-cost microphones each, distributed over five rooms. Annotation is available on an activity level. In this paper we present the recording and annotation procedure, the database content and a discussion on a baseline detection benchmark. The baseline consists of Mel-Frequency Cepstral Coefficients, Support Vector Machine and a majority vote late-fusion scheme. The database is publicly released to provide a common ground for future research.

Keywords

Database, Acoustic Scene Classification, Acoustic Event Detection, Acoustic Sensor Networks

PDF

TUT Acoustic scenes 2017, development dataset: TUT Acoustic Scenes 2017 dataset consists of recordings from various acoustic scenes, all having distinct recording locations. For each recording location, 3-5 minute long audio recording was captured. The original recordings were then split into segments with a length of 10 seconds. These audio segments are provided in individual files.

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016.

PDF

TUT Database for Acoustic Scene Classification and Sound Event Detection

Abstract

We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.

PDF

Additional (event) datasets

FSD50K dataset FSD50K is an open dataset of human-labeled sound events containing over 51k Freesound audio clips, and totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. Clips are of variable length from 0.3 to 30s, containing both target and non-targets events that can be used to generate more realistic soundscapes.

Publication

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.

PDF

FSD50K: an Open Dataset of Human-Labeled Sound Events

Abstract

Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset---its release consists of pre-computed audio features (instead of waveforms), which limits the adoption of some SER methods. Downloading the original audio tracks is also problematic due to constituent YouTube videos gradually disappearing and usage rights issues, which casts doubts over the suitability of this resource for systems' benchmarking. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.

PDF

Additional (sound sources) datasets

YFCC100M YFCC100M is a multimedia dataset comprising a total of 100 million media objects, including approximately 0.8 million videos, all of which have been uploaded to Flickr between 2004 and 2014 and published under a CC commercial or non-commercial licence. Within the task, this dataset is used for unsupervised training of the sound separation.

Publication

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. YFCC100M: the new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.

PDF

YFCC100M: The new data in multimedia research

PDF

Generating your own training data using Scaper

Participants are encouraged use the provided isolated foreground and background sounds, in combination with the Scaper soundscape synthesis and augmentation library, to generate additional (potentially infinite!) training data.

Resources for getting started include:

  • The Scaper scripts provided with the DESED dataset (link)
  • The canonical Scaper script used to create the source separation dataset (link)
  • The Scaper tutorial
Publication

J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello. Scaper: a library for soundscape synthesis and augmentation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 344–348. New Paltz, NY, USA, Oct. 2017.

PDF

Scaper: A Library for Soundscape Synthesis and Augmentation

PDF

Reference labels

Audioset provides annotations at clip level (without time boundaries for the events). Therefore, the original annotations are considered as weak labels. Google researchers conducted a quality assessment task where experts were exposed to 10 randomly selected clips for each class and discovered that a in most of the cases not all the clips contains the event related to the given annotation.

Weak annotations

The weak annotations have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file under the following format:

[filename (string)][tab][event_labels (strings)]

For example:

Y-BJNMHMZDcU_50.000_60.000.wav Alarm_bell_ringing,Dog

The first column, Y-BJNMHMZDcU_50.000_60.000.wav, is the name of the audio file downloaded from Youtube (Y-BJNMHMZDcU is Youtube ID of the video from where the 10-second clips was extracted t=50 sec to t=60 sec, correspond to the clip boundaries within the full video) and the last column, Alarm_bell_ringing;Dog corresponds to the sound classes present in the clip separated by a comma.

Strong annotations

Another subset of the development has been annotated manually with strong annotations, to be used as the test set (see also below for a detailed explanation about the development set).

The synthetic subset of the development set is generated and labeled with strong annotations using the Scaper soundscape synthesis and augmentation library. Each sound clip from DESED was verified by humans in order to check the event class present in the original freesound tags was indeed dominant in the audio clip. Each clip was then segmented to remove silence part and keep only isolated event(s).

Publication

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.

PDF

FSD50K: an Open Dataset of Human-Labeled Sound Events

Abstract

Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset---its release consists of pre-computed audio features (instead of waveforms), which limits the adoption of some SER methods. Downloading the original audio tracks is also problematic due to constituent YouTube videos gradually disappearing and usage rights issues, which casts doubts over the suitability of this resource for systems' benchmarking. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.

PDF
Publication

Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, 411–412. ACM, 2013.

PDF

Freesound technical demo

Abstract

Freesound is an online collaborative sound database where people with diverse interests share recorded sound samples under Creative Commons licenses. It was started in 2005 and it is being maintained to support diverse research projects and as a service to the overall research and artistic community. In this demo we want to introduce Freesound to the multimedia community and show its potential as a research resource. We begin by describing some general aspects of Freesound, its architecture and functionalities, and then explain potential usages that this framework has for research applications.

PDF

In both cases, the minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event. The strong annotations are provided in a tab separated csv file under the following format:

  [filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)]

For example:

YOTsn73eqbfc_10.000_20.000.wav 0.163 0.665 Alarm_bell_ringing

The first column, YOTsn73eqbfc_10.000_20.000.wav, is the name of the audio file, the second column 0.163 is the onset time in seconds, the third column 0.665 is the offset time in seconds and the last column, Alarm_bell_ringing corresponds to the class of the sound event.

FSD50K annotations

Labels are provided at clip-level (i.e., weak labels). However, some clips contain sound events such that the acoustic signal fills almost the entirety of the file, which can be considered strong labels. All existing labels are human-verified. Nonetheless, there could be some missing “Present” labels, mainly in the dev set. More details can be found in Section IV of the FSD50K paper. The annotation format is:

`[filename],[labels],[mids],[split]`

For example, for a clip containing the sound event "Purr":

`181956,"Purr,Cat,Domestic_animals_and_pets,Animal","/m/02yds9,/m/01yrx,/m/068hy,/m/0jbk",train`

Note that the ancestors of "Purr", as per the AudioSet Ontology, are also included. Mids are the class identifiers used in the AudioSet Ontology. We provide a file mapping from Audioset event class names to DESED target event class names in order to allow for using FSD50K annotations in the data generation process.

Publication

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.

PDF

FSD50K: an Open Dataset of Human-Labeled Sound Events

Abstract

Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset---its release consists of pre-computed audio features (instead of waveforms), which limits the adoption of some SER methods. Downloading the original audio tracks is also problematic due to constituent YouTube videos gradually disappearing and usage rights issues, which casts doubts over the suitability of this resource for systems' benchmarking. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.

PDF

Download

The dataset is composed of several subsets that can be downloaded independently from the respective repositories or automatically with the task 4 data generation script.


The table below summarizes the access information to the different datasets.

Dataset Github Download Automatic download with the script
DESED DESED github repo DESED real (zenodo) DESED synthetic (zenodo) DESED public eval (zenodo) Yes
SINS SINS github repo SINS dev (zenodo) Yes
TUT Acoustic scenes 2017, development dataset TUT 2017 dev (zenodo) Yes
FUSS dataset FUSS github repo FUSS (zenodo) Yes
FSD50K dataset FSD50K (zenodo) Yes
YFFC100M dataset YFCC100M website No

If you experience problems during the download of the recorded soundscapes please contact the task organizers. (Francesca Ronchini and Romain Serizel in priority)

Task setup

The challenge consists of detecting sound events within audio clips using training data from real recordings both weakly labeled and unlabeled and synthetic audio clips that are strongly labeled. The detection within a 10-seconds clip should be performed with start and end timestamps. As we encourage participants to use a source separation algorithm together with the sound event detection, there are three possible scenarios:

  • You are working on sound event detection without source separation pre-processing (this is a direct follow-up to last year task 4)
  • You are are working on both source separation and sound event detection (this includes the case when reusing the source separation baseline)
  • You are working only on source separation and use the sound event detection baseline

In each case, participants are expect to provide the output of a sound event detection system (see also below). Note that an additional (optional) separate evaluation set designed to evaluate source separation performance will also be provided (see also below).

Development dataset

The development set is divided into six main partitions (four for sound event detection, 2 for source separation) that are detailed below.

Sound event detection training set

We provide 3 different splits of training data in our training set: Labeled training set, Unlabeled in domain training set and Synthetic set with strong annotations. The first two set are the same as in DCASE2019 task 4. The development dataset organization is described in Fig 3).

Figure 3: DCASE 2021 task 4 development dataset.

Labeled training set:
This set contains 1578 clips (2244 class occurrences) for which weak annotations have been verified and cross-checked. The amount of clips per class is the following:

Unlabeled in domain training set:
This set is considerably larger than the previous one. It contains 14412 clips. The clips are selected such that the distribution per class (based on Audioset annotations) is close to the distribution in the labeled set. Note however that given the uncertainty on Audioset labels this distribution might not be exactly similar.

Synthetic strongly labeled set:
This set is composed of 10000 clips generated with the Scaper soundscape synthesis and augmentation library. The clips are generated such that the distribution per event is close to that of the validation set.

  • We used all the foreground files from the DESED synthetic soundbank.
  • We used background files annotated as "other" from the subpart of SINS dataset and files annotated as "xxx" from the (TUT Acoustic scenes 2017, development dataset)[https://zenodo.org/record/400515].
  • We used the clips from FUSS containing the non-target classes. The clip selection is based on FSD50K annotations. The clips selected are in the training for both FSD50K and FUSS. We provide csv files corresponding to these splits.
  • Event distribution statistics for both target event classes and non target event classes are computed on annotations obtained by human for ~90k clips from Audioset.

We share the original data and scripts to generate soundscapes and encourage participants to create their own subsets. See DESED github repo and Scaper documentation for more information about how to create new soundscapes.

Sound event detection validation set

The validation set is designed such that the distribution in term of clips per class is similar to that of the weakly labeled training set. The validation set contains 1168 clips (4093 events). The validation set is annotated with strong labels, with timestamps (obtained by human annotators). Note that a 10-seconds clip may correspond to more than one sound event.

Source separation training set

TBA

Source separation validation set

TBA

Evaluation datasets

As we encourage participants to use a source separation algorithm together with the sound event detection, there are three possible scenarios (listed in the task setup). Participants are allowed to submit up to 4 different systems for each scenario . In each case, participants are expected to provide at least the output of a sound event detection system. Participants who wants to get their sound separation submission evaluated can download the specific sound separation evaluation dataset (see below). The evaluation dataset organization is described in Fig 4).

Figure 4: DCASE 2021 task 4 evaluation dataset.


Before submission, please make sure you check that your submission package is correct with the validation script enclosed in the submission package: python validate_submissions.py -i /Users/nturpaul/Documents/code/dcase2021/task4_test


Sound event detection evaluation dataset

The sound event detection evaluation dataset is composed of 10 seconds and 5 minutes audio clips.

  • A first subset is composed of audio clips extracted from youtube and vimeo videos under creative common licenses. This subset is used for ranking purposes. This subset includes the public evaluation dataset.

  • A second subset is composed on synthetic clips generated with scaper. This subset is used for analysis purposes.

Sound separation evaluation evaluation dataset

The sound separation evalaution dataset is the evaluation part of the FUSS dataset.

Task rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

Task specific rules:

  • Participants are allowed to submit up to 4 different systems for each scenario listed in the task setup section.
  • Participants are not allowed to use external data for system development. Data from other task is considered external data.
  • Another example of external data is other materials related to the video such as the rest of audio from where the 10-sec clip was extracted, the video frames and metadata.
  • Participants are not allowed to use the embeddings provided by Audioset or any other features that indirectly use external data.
  • For the real recordings (from Audioset), only weak labels and none of the strong labels (timestamps) or original (Audioset) labels can be used for the training of the submitted system. For the synthetic clips, strong labels provided can be used.
  • Manipulation of provided training data is allowed.
  • The development dataset can be augmented without the use of external data (e.g. by mixing data sampled from a PDF or using techniques such as pitch shifting or time stretching).
  • Participants are not allowed to use the public evaluation dataset and synthetic evaluation dataset (or part of them) to train their systems or tune hyper-parameters.

Evaluation

Sound event detection evaluation

All submissions will be evaluated with poly-phonic sound event detection scores (PSDS) computed over the real recordings in the evaluation set (the performance on synthetic recordings is not taken into account in the metric). PSDS values are computed using 50 operating points (linearly distributed from 0.01 to 0.99). In order to understand better what the behavior of each submissions for two different scenarios that emphasize different systems properties.

Scenario 1

The system needs to react fast upon an event detection (e.g. to trigger an alarm, adapt home automation system...). The localization of the sound event is then really important. The PSDS parameters reflecting these needs are:

  • Detection Tolerance criterion (DTC): 0.7
  • Ground Truth intersection criterion (GTC): 0.7
  • Cost of instability across class (\(\alpha_{ST}\)): 1
  • Cost of CTs on user experience (\(\alpha_{CT}\)): 0
  • Maximum False Positive rate (e_max): 100

Scenario 2

The system must avoid confusing between classes but the reaction time is less crucial than in the first scenario. The PSDS parameters reflecting these needs are:

  • Detection Tolerance criterion (DTC): 0.1
  • Ground Truth intersection criterion (GTC): 0.1
  • Cost of instability across class (\(\alpha_{ST}\)): 1
  • Cross-Trigger Tolerance criterion (cttc): 0.3
  • Cost of CTs on user experience (\(\alpha_{CT}\)): 0.5
  • Maximum False Positive rate (e_max): 100

Task Ranking

The official ranking will be a team wise ranking, not a system wise ranking. The ranking criterion will be the aggregation of PSDS-scenario1 and PSDS-scenario2. Each separate metric considered in the final ranking criterion will be the best separate metric among all teams submission (PSDS-scenario1 and PSDS-scenario2 can be obtained by two different systems from the same team, see also Fig 5). The setup is chosen in order to favor experiments on the systems behavior, and adaptation to different metrics depending on the targeted scenario.

$$ \mathrm{Ranking\ Score} = \overline{\mathrm{PSDS}_1} + \overline{\mathrm{PSDS}_2}$$

with \(\overline{\mathrm{PSDS}_1}\) and \(\overline{\mathrm{PSDS}_2}\) the PSDS on scenario 1 and 2 normalized by the baseline PSDS on these scenarios, respectively.

Figure 5: PSDS combination for the final ranking.


Contrastive metric (collar-based F1-score)

Additionally, event-based measures with a 200 ms collar on onsets and a 200 ms / 20% of the events length collar on offsets will be provided as a contrastive measure. System will be evaluated with threshold fixed at 0.5 unless participant explicitly provide another operating point to be evaluated with F1-score.

Evaluation is done using sed_eval and psds_eval toolboxes:



You can find more information on how to use PSDS for task 4 on the dedicated notebook:


Detailed information on metrics calculation is available in:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016. URL: http://www.mdpi.com/2076-3417/6/6/162, doi:10.3390/app6060162.

PDF

Metrics for Polyphonic Sound Event Detection

Abstract

This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.

PDF
Publication

Cagdas Bilen, Giacomo Ferroni, Francesco Tuveri, Juan Azcarreta, and Sacha Krstulovic. A framework for the robust evaluation of sound event detection. arXiv preprint arXiv:1910.08440, 2019. URL: https://arxiv.org/abs/1910.08440.

PDF

A Framework for the Robust Evaluation of Sound Event Detection

PDF

Source separation evaluation (optional)

Source separation approaches will be optionally evaluated on an additional evaluation with standard source separation metrics as follows. For each example mixture x containing J sources, performance will be measured with permutation-invariant scale-invariant signal-to-noise ratio improvement (SI-SNRi).

  • SNR is defined as 10 * log10 of the ratio of source power to error power.
  • Scale-invariant SNR (SI-SNR) allows scaling of the reference to best match the estimate. The optimal scaling factor given an estimate signal e and reference signal r is / , where < , > indicates inner-product.
  • Permutation invariance allows the estimates to be permuted to match reference signals such that the mean SI-SNR across sources is maximized.
  • SI-SNR improvement (SI-SNRi) is the SI-SNR of the estimate with respect to the reference, minus SI-SNR of the mixture with respect to the reference.

A python evaluation script for source separation can be found on:


Results

Rank Submission Information Performance
Code Author Affiliation Technical
Report

Ranking score
(Evaluation dataset)

PSDS 1
(Evaluation dataset)

PSDS 2
(Evaluation dataset)
Na_BUPT_task4_SED_1 Tong Na Beijing, China task-sound-event-detection-in-domestic-environments#Na2021 0.80 0.245 0.452
Hafsati_TUITO_task4_SED_3 Mohammed Hafsati Tuito, Tuito R&D Speech Processing and AI team, La Ciotat, France task-sound-event-detection-in-domestic-environments#Hafsati2021 0.91 0.287 0.502
Hafsati_TUITO_task4_SED_4 Mohammed Hafsati Tuito, Tuito R&D Speech Processing and AI team, La Ciotat, France task-sound-event-detection-in-domestic-environments#Hafsati2021 0.91 0.287 0.502
Hafsati_TUITO_task4_SED_1 Mohammed Hafsati Tuito, Tuito R&D Speech Processing and AI team, La Ciotat, France task-sound-event-detection-in-domestic-environments#Hafsati2021 1.03 0.334 0.549
Hafsati_TUITO_task4_SED_2 Mohammed Hafsati Tuito, Tuito R&D Speech Processing and AI team, La Ciotat, France task-sound-event-detection-in-domestic-environments#Hafsati2021 1.04 0.336 0.550
Gong_TAL_task4_SED_3 Yaguang Gong Tomorrow Advancing Life Education Group, AI Department, Beijing, China task-sound-event-detection-in-domestic-environments#Gong2021 1.16 0.370 0.626
Gong_TAL_task4_SED_2 Yaguang Gong Tomorrow Advancing Life Education Group, AI Department, Beijing, China task-sound-event-detection-in-domestic-environments#Gong2021 1.15 0.367 0.616
Gong_TAL_task4_SED_1 Yaguang Gong Tomorrow Advancing Life Education Group, AI Department, Beijing, China task-sound-event-detection-in-domestic-environments#Gong2021 1.14 0.364 0.611
Park_JHU_task4_SED_2 Sangwook Park Johns Hopkins University, Department of Electrical and Computer Engineering, Baltimore, MD, USA task-sound-event-detection-in-domestic-environments#Park2021 1.07 0.327 0.603
Park_JHU_task4_SED_4 Sangwook Park Johns Hopkins University, Department of Electrical and Computer Engineering, Baltimore, MD, USA task-sound-event-detection-in-domestic-environments#Park2021 0.86 0.237 0.524
Park_JHU_task4_SED_1 Sangwook Park Johns Hopkins University, Department of Electrical and Computer Engineering, Baltimore, MD, USA task-sound-event-detection-in-domestic-environments#Park2021 1.01 0.305 0.579
Park_JHU_task4_SED_3 Sangwook Park Johns Hopkins University, Department of Electrical and Computer Engineering, Baltimore, MD, USA task-sound-event-detection-in-domestic-environments#Park2021 0.84 0.222 0.537
Zheng_USTC_task4_SED_4 Xu Zheng University of Science and Technology of China, National Engineering Laboratory for Speech and Language Information Processing, Hefei, China task-sound-event-detection-in-domestic-environments#Zheng2021 1.30 0.389 0.742
Zheng_USTC_task4_SED_1 Xu Zheng University of Science and Technology of China, National Engineering Laboratory for Speech and Language Information Processing, Hefei, China task-sound-event-detection-in-domestic-environments#Zheng2021 1.33 0.452 0.669
Zheng_USTC_task4_SED_3 Xu Zheng University of Science and Technology of China, National Engineering Laboratory for Speech and Language Information Processing, Hefei, China task-sound-event-detection-in-domestic-environments#Zheng2021 1.29 0.386 0.746
Zheng_USTC_task4_SED_2 Xu Zheng University of Science and Technology of China, National Engineering Laboratory for Speech and Language Information Processing, Hefei, China task-sound-event-detection-in-domestic-environments#Zheng2021 1.33 0.447 0.676
Nam_KAIST_task4_SED_2 Hyeonuk Nam Korea Advanced Institute of Science and Technology, Department of Mechanical Engineering, Daejeon, South Korea task-sound-event-detection-in-domestic-environments#Nam2021 1.19 0.399 0.609
Nam_KAIST_task4_SED_1 Hyeonuk Nam Korea Advanced Institute of Science and Technology, Department of Mechanical Engineering, Daejeon, South Korea task-sound-event-detection-in-domestic-environments#Nam2021 1.16 0.378 0.617
Nam_KAIST_task4_SED_3 Hyeonuk Nam Korea Advanced Institute of Science and Technology, Department of Mechanical Engineering, Daejeon, South Korea task-sound-event-detection-in-domestic-environments#Nam2021 1.09 0.324 0.634
Nam_KAIST_task4_SED_4 Hyeonuk Nam Korea Advanced Institute of Science and Technology, Department of Mechanical Engineering, Daejeon, South Korea task-sound-event-detection-in-domestic-environments#Nam2021 0.75 0.059 0.715
Koo_SGU_task4_SED_2 Hyejin Koo Sogang University, Department of Electronic Engineering, Seoul, South Korea task-sound-event-detection-in-domestic-environments#Koo2021 0.12 0.044 0.059
Koo_SGU_task4_SED_3 Hyejin Koo Sogang University, Department of Electronic Engineering, Seoul, South Korea task-sound-event-detection-in-domestic-environments#Koo2021 0.41 0.058 0.348
Koo_SGU_task4_SED_1 Hyejin Koo Sogang University, Department of Electronic Engineering, Seloul, Korea task-sound-event-detection-in-domestic-environments#Koo2021 0.74 0.258 0.364
deBenito_AUDIAS_task4_SED_4 Diego de Benito-Gorron Universidad Autónoma de Madrid, Escuela Politécnica Superior, Madrid, Spain task-sound-event-detection-in-domestic-environments#de Benito-Gorron2021 1.10 0.361 0.577
deBenito_AUDIAS_task4_SED_1 Diego de Benito-Gorron Universidad Autónoma de Madrid, Escuela Politécnica Superior, Madrid, Spain task-sound-event-detection-in-domestic-environments#de Benito-Gorron2021 1.07 0.343 0.571
deBenito_AUDIAS_task4_SED_2 Diego de Benito-Gorron Universidad Autónoma de Madrid, Escuela Politécnica Superior, Madrid, Spain task-sound-event-detection-in-domestic-environments#de Benito-Gorron2021 1.10 0.363 0.574
deBenito_AUDIAS_task4_SED_3 Diego de Benito-Gorron Universidad Autónoma de Madrid, Escuela Politécnica Superior, Madrid, Spain task-sound-event-detection-in-domestic-environments#de Benito-Gorron2021 1.07 0.345 0.571
Baseline_SSep_SED Nicolas Turpault Inria Nancy Grand-Est, Department of Natural Language Processing & Knowledge Discovery, Nancy, France task-sound-event-detection-in-domestic-environments#turpault2020b 1.11 0.364 0.580
Boes_KUL_task4_SED_4 Wim Boes KU Leuven, ESAT, Leuven, Belgium task-sound-event-detection-in-domestic-environments#Boes2021 0.60 0.117 0.457
Boes_KUL_task4_SED_3 Wim Boes KU Leuven, ESAT, Leuven, Belgium task-sound-event-detection-in-domestic-environments#Boes2021 0.68 0.121 0.531
Boes_KUL_task4_SED_2 Wim Boes KU Leuven, ESAT, Leuven, Belgium task-sound-event-detection-in-domestic-environments#Boes2021 0.77 0.233 0.440
Boes_KUL_task4_SED_1 Wim Boes KU Leuven, ESAT, Leuven, Belgium task-sound-event-detection-in-domestic-environments#Boes2021 0.81 0.253 0.442
Ebbers_UPB_task4_SED_2 Janek Ebbers Paderborn University, Department of Communications Engineering, Paderborn, Germany task-sound-event-detection-in-domestic-environments#Ebbers2021 1.10 0.335 0.621
Ebbers_UPB_task4_SED_4 Janek Ebbers Paderborn University, Department of Communications Engineering, Paderborn, Germany task-sound-event-detection-in-domestic-environments#Ebbers2021 1.16 0.363 0.637
Ebbers_UPB_task4_SED_3 Janek Ebbers Paderborn University, Department of Communications Engineering, Paderborn, Germany task-sound-event-detection-in-domestic-environments#Ebbers2021 1.24 0.416 0.635
Ebbers_UPB_task4_SED_1 Janek Ebbers Paderborn University, Department of Communications Engineering, Paderborn, Germany task-sound-event-detection-in-domestic-environments#Ebbers2021 1.16 0.373 0.621
Zhu_AIAL-XJU_task4_SED_2 Xiujuan Zhu XinJiang University, Key Laboratory of Signal Detection and Processing, Urumqi, China task-sound-event-detection-in-domestic-environments#Zhu2021 0.99 0.290 0.574
Zhu_AIAL-XJU_task4_SED_1 Xiujuan Zhu XinJiang University, Key Laboratory of Signal Detection and Processing, Urumqi, China task-sound-event-detection-in-domestic-environments#Zhu2021 1.04 0.318 0.583
Liu_BUPT_task4_4 Gang Liu Beijing University of Posts and Telecommunications, Pattern Recognition and Intelligent System Laboratory (PRIS Lab), Beijing, China task-sound-event-detection-in-domestic-environments#Liu2021 0.37 0.102 0.231
Liu_BUPT_task4_1 Gang Liu Beijing University of Posts and Telecommunications, Pattern Recognition and Intelligent System Laboratory (PRIS Lab), Beijing, China task-sound-event-detection-in-domestic-environments#Liu2021 0.30 0.090 0.169
Liu_BUPT_task4_2 Gang Liu Beijing University of Posts and Telecommunications, Pattern Recognition and Intelligent System Laboratory (PRIS Lab), Beijing, China task-sound-event-detection-in-domestic-environments#Liu2021 0.54 0.152 0.322
Liu_BUPT_task4_3 Gang Liu Beijing University of Posts and Telecommunications, Pattern Recognition and Intelligent System Laboratory (PRIS Lab), Beijing, China task-sound-event-detection-in-domestic-environments#Liu2021 0.24 0.068 0.146
Olvera_INRIA_task4_SED_2 Michel Olvera Inria Nancy Grand-Est, Department of Information and Communication Sciences and Technologies, Nancy, France task-sound-event-detection-in-domestic-environments#Olvera2021 0.98 0.338 0.481
Olvera_INRIA_task4_SED_1 Michel Olvera Inria Nancy Grand-Est, Department of Information and Communication Sciences and Technologies, Nancy, France task-sound-event-detection-in-domestic-environments#Olvera2021 0.95 0.332 0.462
Kim_AiTeR_GIST_SED_4 Nam Kyun Kim Gwnagju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwnagju, South Korea task-sound-event-detection-in-domestic-environments#Kim2021 1.32 0.442 0.674
Kim_AiTeR_GIST_SED_2 Nam Kyun Kim Gwnagju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwnagju, South Korea task-sound-event-detection-in-domestic-environments#Kim2021 1.31 0.439 0.667
Kim_AiTeR_GIST_SED_3 Nam Kyun Kim Gwnagju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwnagju, South Korea task-sound-event-detection-in-domestic-environments#Kim2021 1.30 0.434 0.669
Kim_AiTeR_GIST_SED_1 Nam Kyun Kim Gwnagju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwnagju, South Korea task-sound-event-detection-in-domestic-environments#Kim2021 1.29 0.431 0.661
Cai_SMALLRICE_task4_SED_1 Heinrich Dinkel Xiaomi Corperation, Technology Comittee, Beijing, China task-sound-event-detection-in-domestic-environments#Dinkel2021 1.11 0.361 0.584
Cai_SMALLRICE_task4_SED_2 Heinrich Dinkel Xiaomi Corperation, Technology Comittee, Beijing, China task-sound-event-detection-in-domestic-environments#Dinkel2021 1.13 0.373 0.585
Cai_SMALLRICE_task4_SED_3 Heinrich Dinkel Xiaomi Corperation, Technology Comittee, Beijing, China task-sound-event-detection-in-domestic-environments#Dinkel2021 1.13 0.370 0.596
Cai_SMALLRICE_task4_SED_4 Heinrich Dinkel Xiaomi Corperation, Technology Comittee, Beijing, China task-sound-event-detection-in-domestic-environments#Dinkel2021 1.00 0.339 0.504
HangYuChen_Roal_task4_SED_2 Chen HangYu task-sound-event-detection-in-domestic-environments#HangYu2021 0.90 0.294 0.473
HangYuChen_Roal_task4_SED_1 Chen YuHang task-sound-event-detection-in-domestic-environments#YuHang2021 0.61 0.098 0.496
Yu_NCUT_task4_SED_1 Dongchi Yu North China University of Technology, Department of Electronic Information Engineering, Beijing, China task-sound-event-detection-in-domestic-environments#Yu2021 0.20 0.038 0.157
Yu_NCUT_task4_SED_2 Dongchi Yu North China University of Technology, Department of Electronic Information Engineering, Beijing, China task-sound-event-detection-in-domestic-environments#Yu2021 0.92 0.301 0.485
lu_kwai_task4_SED_1 Rui Lu Beijing Kuaishou Technology Co., Ltd, AI-Platform, Beijing, China task-sound-event-detection-in-domestic-environments#Lu2021 1.27 0.419 0.660
lu_kwai_task4_SED_4 Rui Lu Beijing Kuaishou Technology Co., Ltd, AI-Platform, Beijing, China task-sound-event-detection-in-domestic-environments#Lu2021 0.88 0.157 0.685
lu_kwai_task4_SED_3 Rui Lu Beijing Kuaishou Technology Co., Ltd, AI-Platform, Beijing, China task-sound-event-detection-in-domestic-environments#Lu2021 0.86 0.148 0.686
lu_kwai_task4_SED_2 Rui Lu Beijing Kuaishou Technology Co., Ltd, AI-Platform, Beijing, China task-sound-event-detection-in-domestic-environments#Lu2021 1.25 0.412 0.651
Liu_BUPT_task4_SS_SED_2 Gang Liu_SS Beijing University of Posts and Telecommunications, Pattern Recognition and Intelligent System Laboratory (PRIS Lab), Beijing, China task-sound-event-detection-in-domestic-environments#Liu_SS2021 0.94 0.302 0.507
Liu_BUPT_task4_SS_SED_1 Gang Liu_SS Beijing University of Posts and Telecommunications, Pattern Recognition and Intelligent System Laboratory (PRIS Lab), Beijing, China task-sound-event-detection-in-domestic-environments#Liu_SS2021 0.94 0.302 0.507
Tian_ICT-TOSHIBA_task4_SED_2 Gangyi Tian Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China task-sound-event-detection-in-domestic-environments#Tian2021 1.19 0.411 0.585
Tian_ICT-TOSHIBA_task4_SED_1 Gangyi Tian Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China task-sound-event-detection-in-domestic-environments#Tian2021 1.19 0.413 0.586
Tian_ICT-TOSHIBA_task4_SED_4 Gangyi Tian Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China task-sound-event-detection-in-domestic-environments#Tian2021 1.19 0.412 0.586
Tian_ICT-TOSHIBA_task4_SED_3 Gangyi Tian Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China task-sound-event-detection-in-domestic-environments#Tian2021 1.18 0.409 0.584
Yao_GUET_task4_SED_3 Yu Yao GUILIN UNIVERSITY OF ELECTRONIC TECHNOLOGY, SCHOOL OF INFORMATION AND COMMUNICATION, GuiLin, China task-sound-event-detection-in-domestic-environments#Yao2021 0.88 0.279 0.479
Yao_GUET_task4_SED_1 Yu Yao GUILIN UNIVERSITY OF ELECTRONIC TECHNOLOGY, SCHOOL OF INFORMATION AND COMMUNICATION, GuiLin, China task-sound-event-detection-in-domestic-environments#Yao2021 0.88 0.277 0.482
Yao_GUET_task4_SED_2 Yu Yao GUILIN UNIVERSITY OF ELECTRONIC TECHNOLOGY, SCHOOL OF INFORMATION AND COMMUNICATION, GuiLin, China task-sound-event-detection-in-domestic-environments#Yao2021 0.54 0.056 0.496
Liang_SHNU_task4_SED_4 Yunhao Liang Shanghai Normal University, The College of Information,Mechanical and Electrical Engineering, Shanghai,China task-sound-event-detection-in-domestic-environments#Liang2021 0.99 0.313 0.543
Bajzik_UNIZA_task4_SED_2 Jakub Bajzik University of Zilina, Department of Mechatronics and Electronics, Zilina, Slovak Republic task-sound-event-detection-in-domestic-environments#Bajzik2021 1.02 0.330 0.544
Bajzik_UNIZA_task4_SED_1 Jakub Bajzik University of Zilina, Department of Mechatronics and Electronics, Zilina, Slovak Republic task-sound-event-detection-in-domestic-environments#Bajzik2021 0.45 0.133 0.266
Liang_SHNU_task4_SSep_SED_3 Yunhao Liang_SS Shanghai Normal University, The College of Information,Mechanical and Electrical Engineering, Shanghai,China task-sound-event-detection-in-domestic-environments#Liang_SS2021 0.99 0.304 0.559
Liang_SHNU_task4_SSep_SED_1 Yunhao Liang_SS Shanghai Normal University, The College of Information,Mechanical and Electrical Engineering, Shanghai,China task-sound-event-detection-in-domestic-environments#Liang_SS2021 1.03 0.313 0.588
Liang_SHNU_task4_SSep_SED_2 Yunhao Liang_SS Shanghai Normal University, The College of Information,Mechanical and Electrical Engineering, Shanghai,China task-sound-event-detection-in-domestic-environments#Liang_SS2021 1.01 0.325 0.542
Baseline_SED Nicolas Turpault Inria Nancy Grand-Est, Department of Natural Language Processing & Knowledge Discovery, Nancy, France task-sound-event-detection-in-domestic-environments#Turpault2020a 1.00 0.315 0.547
Wang_NSYSU_task4_SED_1 Yih Wen Wang National Sun Yat-sen University, Department of Computer Science and Engineering, Kaohsiung, Taiwan task-sound-event-detection-in-domestic-environments#Wang2021 1.13 0.336 0.646
Wang_NSYSU_task4_SED_4 Yih Wen Wang National Sun Yat-sen University, Department of Computer Science and Engineering, Kaohsiung, Taiwan task-sound-event-detection-in-domestic-environments#Wang2021 1.09 0.304 0.662
Wang_NSYSU_task4_SED_2 Yih Wen Wang National Sun Yat-sen University, Department of Computer Science and Engineering, Kaohsiung, Taiwan task-sound-event-detection-in-domestic-environments#Wang2021 0.69 0.070 0.636
Wang_NSYSU_task4_SED_3 Yih Wen Wang National Sun Yat-sen University, Department of Computer Science and Engineering, Kaohsiung, Taiwan task-sound-event-detection-in-domestic-environments#Wang2021 1.13 0.339 0.649


Complete results and technical reports can be found in the results page

Baseline systems

Sound event detection baseline

System description

The baseline model is an improvement of [dcase 2020 baseline][dcase2020-baseline]. The model is a mean-teacher model.

The main differences of the baseline system (without source separation) compared to DCASE 2020:

  • Features: hop size of 256 instead of 255.
  • Different synthetic dataset is used.
  • No early stopping used (200 epochs) but getting the best model
  • Normalisation per-instance using min-max approach
  • Mixup is used for weak and synthetic data by mixing data in a batch (50% chance of applying it).
  • Batch size of 48 (still 1/4 synthetic, 1/4 weak, 1/2 unlabelled)
  • Intersection-based F1 instead of event-based F1 for the synthetic validation score

Results for the development dataset

PSDS-scenario1 PSDS-scenario2 Intersection-based F1 Collar-based F1
Dev-test 0.342 0.527 76.6% 40.1%

Collar-based = event-based. Intersection based is computed using (dtc=gtc=0.5, cttc=0.3) and event-based is computed using collars (onset=200ms, offset=max(200ms, 20% event-duration)

Note: The performance might not be exactly reproducible on a GPU based system. That is why, you can download the checkpoint of the network along with the tensorboard events. Launch python train_sed.py --test_from_checkpoint /path/to/downloaded.ckpt to test this model.

Repositories


Sound separation and sound event detection

System description

The Sound separation and sound event detection baseline uses the pre-trained SED model together with a pre-trained sound separation model.

Figure 6: Combination of sound separation with sound event detection.


The SED model is fine-tuned on separated sound events obtained by pre-processing the data with the pre-trained sound separation model.

The sound separation model is based on TDCN++ and is trained in an unsupervised way with MixIT on YFCC100m dataset.

Predictions are obtained by ensembling the fine-tuned SED model with the original SED model following. Ensembling is performed by weighted average of the predictions of the two models, the weight is learned during training.

Publication

Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, and John R. Hershey. Universal sound separation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 175–179. October 2019. URL: https://arxiv.org/abs/1905.03330.

PDF

Universal Sound Separation

PDF
Publication

Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron J Weiss, Kevin Wilson, and John R Hershey. Unsupervised sound separation using mixtures of mixtures. arXiv preprint arXiv:2006.12701, 2020.

PDF

Unsupervised sound separation using mixtures of mixtures

PDF
Publication

Nicolas Turpault, Scott Wisdom, Hakan Erdogan, John Hershey, Romain Serizel, Eduardo Fonseca, Prem Seetharaman, and Justin Salamon. Improving sound event detection in domestic environments using sound separation. arXiv preprint arXiv:2007.03932, 2020.

PDF

Improving sound event detection in domestic environments using sound separation

PDF

Running the baseline

You can run the SSEP + SED baseline from scratch by first downloading the pre-trained universal sound separation model trained on YFCC100m [3] following the instructions on the Sound separation baseline repository using the Google Cloud SDK (installation instructions):

  • gsutil -m cp -r gs://gresearch/sound_separation/yfcc100m_mixit_model_checkpoint .

You also need the pre-trained SED system as obtained from the SED Baseline, you can train your own or use the pretrained system.

Be sure to check that in the configuration YAML file ./confs/sep+sed.yaml the paths to the SED checkpoint and YAML file and to the pre-trained sound separation model are set correctly.

First sound separation is applied using this script to the data - python run_separation.py

The SED model is then fine-tuned on the separated data using:

  • python finetune_on_separated.py

We also provide for this model a pre-trained checkpoint along with tensorboard logs.

You can test it on the validation real world data by using: - python train_sed.py --test_from_checkpoint /path/to/downloaded.ckpt

Check tensorboard logs using tensorboard --logdir="path/to/exp_folder"

Results for the development dataset

PSDS-scenario1 PSDS-scenario2 Intersection-based F1 Collar-based F1
Dev-test 0.373 0.549 77.2% 44.3%

The results are from the "Teacher" predictions (better predictions over the Student model, note that this is the only thing cherry picked on the dev-test set).

Repositories


Citation

If you are using the dataset or baseline code, or want to refer to the challenge task please cite the following paper:

Task and datasets

Publication

Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.

PDF

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

Keywords

Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data

PDF
Publication

Scott Wisdom, Hakan Erdogan, Daniel P. W. Ellis, Romain Serizel, Nicolas Turpault, Eduardo Fonseca, Justin Salamon, Prem Seetharaman, and John R. Hershey. What's all the fuss about free universal sound separation data? In in preparation. 2020.

What's All the FUSS About Free Universal Sound Separation Data?