The goal of the task is to evaluate systems for the detection of sound events using real data either weakly labeled or unlabeled and simulated data that is strongly labeled (with time stamps).
Challenge has ended. Full results for this task can be found in the Results page.
Description
This task is the follow-up to DCASE 2020 task 4. The task evaluates systems for the detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time localization given that multiple events can be present in an audio recording (see also Fig 1). The challenge of exploring the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance remains. Isolated sound events, background sound files and scripts to design a training set with strongly annotated synthetic data are provided. The labels in all the annotated subsets are verified and can be considered as reliable.
As last year, we encourage participants to propose systems that use sound separation jointly with sound event detection. This step can be used to separate overlapping sound events and extract foreground sound events from the background sound. To motivate participants to explore that direction, we provide a baseline sound separation model that can be used for pre-processing (see also Fig 2).
Compared to previous years, this task aims to investigate how we can optimally exploit synthetic data, including non-target isolated events for both sound event detection and sound separation.
Audio dataset
The data for the DCASE 2020 task 4 consist of several datasets designed for sound event detection and/or sound separation. The datasets are described below.
Audio material
Dataset | Subset | Type | Usage | Annotations | Event type | Sampling frequency |
---|---|---|---|---|---|---|
DESED | Real: weakly labeled | Recorded soundscapes | Training | Weak labels (no timestamps) | Target | 44.1kHz |
Real: unlabeled | Recorded soundscapes | Training | No annotations | Target | 44.1kHz | |
Real: validation | Recorded soundscapes | Validation | Strong labels (with timestamps) | Target | 44.1kHz | |
Real: public evaluation | Recorded soundscapes | Evaluation (do not use this subset to tune hyperparamters) | Strong labels (with timestamps) | Target | 44.1kHz | |
Synthetic: training | Isolated events + synthetic soundscapes | Training/validation | Strong labels (with timestamps) | Target | 16kHz | |
Synthetic: evaluation | Isolated events + backgrounds | Evaluation (do not use this subset to tune hyperparamters) | Event level labels (no timestamps) | Target | 16kHz | |
SINS | Background | Training/validation | No annotations | N/A | 16kHz | |
TUT Acoustic scenes 2017, development dataset | Background | Training/validation | No annotations | N/A | 44.1kHz | |
FUSS dataset | Isolated events + synthetic soundscapes | Training/validation | Weak annotations from FSD50K (no timestamps) | Target and non-target | 16kHz | |
FSD50K dataset | Isolated events + recorded soundscapes | Training/validation | Weak annotations (no timestamps) | Target and non-target | 44.1kHz | |
YFCC100M dataset | Recorded soundscapes | Training/validation | No annotations | Sound sources | 44.1kHz |
If you plan to perform source separation (or to use backgrounds from the SINS dataset) please resample your recorded data in 16kHz. If you are using only recorded data and perform only sound event detection you can use sampling rates up to 44.1kHz yet we strongly encourage participant to use 16kHz as sampling rate.
Please note that the baselines work on 16kHz data.
DESED dataset
DESED dataset is the dataset that was used in DCASE 2020 task 4. The dataset for this task is composed of 10 sec audio clips recorded in domestic environments or synthesized using Scaper to simulate a domestic environment. The task focuses on 10 class of sound events that represent a subset of Audioset (not all the classes are present in Audioset, some classes of sound events are including several classes from Audioset):
- Speech
Speech
- Dog
Dog
- Cat
Cat
- Alarm/bell/ringing
Alarm_bell_ringing
- Dishes
Dishes
- Frying
Frying
- Blender
Blender
- Running water
Running_water
- Vacuum cleaner
Vacuum_cleaner
- Electric shaver/toothbrush
Electric_shaver_toothbrush
More information about this dataset and how to generate synthetic soundscapes can be found on the DESED website.
Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.
Sound event detection in domestic environments with weakly labeled data and soundscape synthesis
Abstract
This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.
Keywords
Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data
Romain Serizel, Nicolas Turpault, Ankit Shah, and Justin Salamon. Sound event detection in synthetic domestic environments. In ICASSP 2020 - 45th International Conference on Acoustics, Speech, and Signal Processing. Barcelona, Spain, 2020. URL: https://hal.inria.fr/hal-02355573.
Sound event detection in synthetic domestic environments
Abstract
This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.
Keywords
semi-supervised learning ; weakly labeled data ; synthetic data ; Sound event detection ; Index Terms-Sound event detection
This year we provide a file mapping from Audioset event classe names to DESED target event classe name in order to allow for using FSD50K and FUSS datasets. We provide csv files for training/validation splits that are compatible with both FSD50K and FUSS. We also provide event distribution statistics for both target event classes and non target event classes. These distributions are computed on annotations obtained by human for ~90k clips from Audioset.
Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA, 2017.
Audio Set: An ontology and human-labeled dataset for audio events
Abstract
Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets - principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 632 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.
Shawn Hershey, Daniel P W Ellis, Eduardo Fonseca, Aren Jansen, Caroline Liu, R Channing Moore, and Manoj Plakal. The benefit of temporally-strong labels in audio event classification. In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021. accepted.
THE BENEFIT OF TEMPORALLY-STRONG LABELS IN AUDIO EVENT CLASSIFICATION
Sound separation dataset (FUSS)
The Free Universal Sound Separation (FUSS) Dataset is composed of arbitrary sound mixtures and source-level references, for use in experiments on arbitrary sound separation.
Overview: The audio data is sourced from a subset of FSD50K, a sound event dataset composed of Freesound content annotated with labels from the AudioSet Ontology. Using the FSD50K labels, these sound source files have been screened such that they likely only contain a single type of sound. As the file id have not been modified, labels FSD50K are still valid for FUSS files. To create mixtures, 10 second clips of sounds are convolved with simulated room impulse responses and added together. Each 10 second mixture contains between 1 and 4 sounds. Sound source files longer than 10s are considered "background" sources. Every mixture contains one background source, which is active for the entire duration. We provide: a software recipe to create the dataset, the room impulse responses, and access to the original source audio.
Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.
FSD50K: an Open Dataset of Human-Labeled Sound Events
Abstract
Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset---its release consists of pre-computed audio features (instead of waveforms), which limits the adoption of some SER methods. Downloading the original audio tracks is also problematic due to constituent YouTube videos gradually disappearing and usage rights issues, which casts doubts over the suitability of this resource for systems' benchmarking. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.
Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, 411–412. ACM, 2013.
Freesound technical demo
Abstract
Freesound is an online collaborative sound database where people with diverse interests share recorded sound samples under Creative Commons licenses. It was started in 2005 and it is being maintained to support diverse research projects and as a service to the overall research and artistic community. In this demo we want to introduce Freesound to the multimedia community and show its potential as a research resource. We begin by describing some general aspects of Freesound, its architecture and functionalities, and then explain potential usages that this framework has for research applications.
Scott Wisdom, Hakan Erdogan, Daniel P. W. Ellis, Romain Serizel, Nicolas Turpault, Eduardo Fonseca, Justin Salamon, Prem Seetharaman, and John R. Hershey. What's all the fuss about free universal sound separation data? In in preparation. 2020.
What's All the FUSS About Free Universal Sound Separation Data?
Motivation: This dataset provides a platform to investigate how sound separation may help with event detection and vice versa. Event detection is more difficult in noisy environments, and so separation could be a useful pre-processing step. Data with strong labels for event detection are relatively scarce, especially when restricted to specific classes within a domain. In contrast, sound separation data needs no event labels for training, and may be more plentiful. In this setting, the idea is to utilize larger unlabeled separation data to train separation systems, which can serve as a front-end to event-detection systems trained on more limited data.
Room simulation: Room impulse responses are simulated using the image method with frequency-dependent walls. Each impulse corresponds to a rectangular room of random size with random wall materials, where a single microphone and up to 4 sound sources are placed at random spatial locations.
Recipe for data creation:
The data creation recipe starts with scripts, based on Scaper, to generate mixtures of events with random timing of sound events,
along with a background sound that spans the duration of the mixture clip.
The constituent sound files for each mixture are also generated for use as references for training and evaluation.
The data are reverberated using a different room simulation for each mixture.
In this simulation each sound source has its own reverberation corresponding to a different spatial location.
The reverberated mixtures are created by summing over the reverberated sound sources.
The data creation scripts support modification, so that participants may remix and augment the training data as desired.
Additional (background) datasets
The background files were not screened to ensure that they contain no target class. Therefore, background files might include target events.
SINS dataset:
A part of the derivative of the SINS dataset used for DCASE2018 task 5 is used as background for the
synthetic subset of the dataset for DCASE 2019 task 4.
The SINS dataset contains a continuous recording of one person living in a vacation home over a period of one week.
It was collected using a network of 13 microphone arrays distributed over the entire home.
The microphone array consists of 4 linearly arranged microphones.
Gert Dekkers, Steven Lauwereins, Bart Thoen, Mulu Weldegebreal Adhana, Henk Brouckxon, Toon van Waterschoot, Bart Vanrumste, Marian Verhelst, and Peter Karsmakers. The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 32–36. November 2017.
The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network
Abstract
There is a rising interest in monitoring and improving human wellbeing at home using different types of sensors including microphones. In the context of Ambient Assisted Living (AAL) persons are monitored, e.g. to support patients with a chronic illness and older persons, by tracking their activities being performed at home. When considering an acoustic sensing modality, a performed activity can be seen as an acoustic scene. Recently, acoustic detection and classification of scenes and events has gained interest in the scientific community and led to numerous public databases for a wide range of applications. However, no public databases exist which a) focus on daily activities in a home environment, b) contain activities being performed in a spontaneous manner, c) make use of an acoustic sensor network, and d) are recorded as a continuous stream. In this paper we introduce a database recorded in one living home, over a period of one week. The recording setup is an acoustic sensor network containing thirteen sensor nodes, with four low-cost microphones each, distributed over five rooms. Annotation is available on an activity level. In this paper we present the recording and annotation procedure, the database content and a discussion on a baseline detection benchmark. The baseline consists of Mel-Frequency Cepstral Coefficients, Support Vector Machine and a majority vote late-fusion scheme. The database is publicly released to provide a common ground for future research.
Keywords
Database, Acoustic Scene Classification, Acoustic Event Detection, Acoustic Sensor Networks
TUT Acoustic scenes 2017, development dataset: TUT Acoustic Scenes 2017 dataset consists of recordings from various acoustic scenes, all having distinct recording locations. For each recording location, 3-5 minute long audio recording was captured. The original recordings were then split into segments with a length of 10 seconds. These audio segments are provided in individual files.
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016.
TUT Database for Acoustic Scene Classification and Sound Event Detection
Abstract
We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting ofbinaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.
Additional (event) datasets
FSD50K dataset FSD50K is an open dataset of human-labeled sound events containing over 51k Freesound audio clips, and totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. Clips are of variable length from 0.3 to 30s, containing both target and non-targets events that can be used to generate more realistic soundscapes.
Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.
FSD50K: an Open Dataset of Human-Labeled Sound Events
Abstract
Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset---its release consists of pre-computed audio features (instead of waveforms), which limits the adoption of some SER methods. Downloading the original audio tracks is also problematic due to constituent YouTube videos gradually disappearing and usage rights issues, which casts doubts over the suitability of this resource for systems' benchmarking. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.
Additional (sound sources) datasets
YFCC100M YFCC100M is a multimedia dataset comprising a total of 100 million media objects, including approximately 0.8 million videos, all of which have been uploaded to Flickr between 2004 and 2014 and published under a CC commercial or non-commercial licence. Within the task, this dataset is used for unsupervised training of the sound separation.
Generating your own training data using Scaper
Participants are encouraged use the provided isolated foreground and background sounds, in combination with the Scaper soundscape synthesis and augmentation library, to generate additional (potentially infinite!) training data.
Resources for getting started include:
- The Scaper scripts provided with the DESED dataset (link)
- The canonical Scaper script used to create the source separation dataset (link)
- The Scaper tutorial
Reference labels
Audioset provides annotations at clip level (without time boundaries for the events). Therefore, the original annotations are considered as weak labels. Google researchers conducted a quality assessment task where experts were exposed to 10 randomly selected clips for each class and discovered that a in most of the cases not all the clips contains the event related to the given annotation.
Weak annotations
The weak annotations have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file under the following format:
[filename (string)][tab][event_labels (strings)]
For example:
Y-BJNMHMZDcU_50.000_60.000.wav Alarm_bell_ringing,Dog
The first column, Y-BJNMHMZDcU_50.000_60.000.wav
, is the name of the audio file downloaded from Youtube
(Y-BJNMHMZDcU
is Youtube ID of the video from where the 10-second clips was extracted t=50 sec to t=60 sec,
correspond to the clip boundaries within the full video) and the last column,
Alarm_bell_ringing;Dog
corresponds to the sound classes present in the clip separated by a comma.
Strong annotations
Another subset of the development has been annotated manually with strong annotations, to be used as the test set (see also below for a detailed explanation about the development set).
The synthetic subset of the development set is generated and labeled with strong annotations using the Scaper soundscape synthesis and augmentation library. Each sound clip from DESED was verified by humans in order to check the event class present in the original freesound tags was indeed dominant in the audio clip. Each clip was then segmented to remove silence part and keep only isolated event(s).
Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.
FSD50K: an Open Dataset of Human-Labeled Sound Events
Abstract
Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset---its release consists of pre-computed audio features (instead of waveforms), which limits the adoption of some SER methods. Downloading the original audio tracks is also problematic due to constituent YouTube videos gradually disappearing and usage rights issues, which casts doubts over the suitability of this resource for systems' benchmarking. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.
Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, 411–412. ACM, 2013.
Freesound technical demo
Abstract
Freesound is an online collaborative sound database where people with diverse interests share recorded sound samples under Creative Commons licenses. It was started in 2005 and it is being maintained to support diverse research projects and as a service to the overall research and artistic community. In this demo we want to introduce Freesound to the multimedia community and show its potential as a research resource. We begin by describing some general aspects of Freesound, its architecture and functionalities, and then explain potential usages that this framework has for research applications.
In both cases, the minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event. The strong annotations are provided in a tab separated csv file under the following format:
[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)]
For example:
YOTsn73eqbfc_10.000_20.000.wav 0.163 0.665 Alarm_bell_ringing
The first column, YOTsn73eqbfc_10.000_20.000.wav
, is the name of the audio file, the second column 0.163
is the onset time in seconds,
the third column 0.665
is the offset time in seconds and the last column, Alarm_bell_ringing
corresponds to the class of the sound event.
FSD50K annotations
Labels are provided at clip-level (i.e., weak labels). However, some clips contain sound events such that the acoustic signal fills almost the entirety of the file, which can be considered strong labels. All existing labels are human-verified. Nonetheless, there could be some missing “Present” labels, mainly in the dev set. More details can be found in Section IV of the FSD50K paper. The annotation format is:
`[filename],[labels],[mids],[split]`
For example, for a clip containing the sound event "Purr":
`181956,"Purr,Cat,Domestic_animals_and_pets,Animal","/m/02yds9,/m/01yrx,/m/068hy,/m/0jbk",train`
Note that the ancestors of "Purr", as per the AudioSet Ontology, are also included. Mids are the class identifiers used in the AudioSet Ontology. We provide a file mapping from Audioset event class names to DESED target event class names in order to allow for using FSD50K annotations in the data generation process.
Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.
FSD50K: an Open Dataset of Human-Labeled Sound Events
Abstract
Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset---its release consists of pre-computed audio features (instead of waveforms), which limits the adoption of some SER methods. Downloading the original audio tracks is also problematic due to constituent YouTube videos gradually disappearing and usage rights issues, which casts doubts over the suitability of this resource for systems' benchmarking. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.
Download
The dataset is composed of several subsets that can be downloaded independently from the respective repositories or automatically with the task 4 data generation script.
The table below summarizes the access information to the different datasets.
Dataset | Github | Download | Automatic download with the script |
---|---|---|---|
DESED | DESED github repo | DESED real (zenodo) DESED synthetic (zenodo) DESED public eval (zenodo) | Yes |
SINS | SINS github repo | SINS dev (zenodo) | Yes |
TUT Acoustic scenes 2017, development dataset | TUT 2017 dev (zenodo) | Yes | |
FUSS dataset | FUSS github repo | FUSS (zenodo) | Yes |
FSD50K dataset | FSD50K (zenodo) | Yes | |
YFFC100M dataset | YFCC100M website | No |
If you experience problems during the download of the recorded soundscapes please contact the task organizers. (Francesca Ronchini and Romain Serizel in priority)
Task setup
The challenge consists of detecting sound events within audio clips using training data from real recordings both weakly labeled and unlabeled and synthetic audio clips that are strongly labeled. The detection within a 10-seconds clip should be performed with start and end timestamps. As we encourage participants to use a source separation algorithm together with the sound event detection, there are three possible scenarios:
- You are working on sound event detection without source separation pre-processing (this is a direct follow-up to last year task 4)
- You are are working on both source separation and sound event detection (this includes the case when reusing the source separation baseline)
- You are working only on source separation and use the sound event detection baseline
In each case, participants are expect to provide the output of a sound event detection system (see also below). Note that an additional (optional) separate evaluation set designed to evaluate source separation performance will also be provided (see also below).
Development dataset
The development set is divided into six main partitions (four for sound event detection, 2 for source separation) that are detailed below.
Sound event detection training set
We provide 3 different splits of training data in our training set: Labeled training set, Unlabeled in domain training set and Synthetic set with strong annotations. The first two set are the same as in DCASE2019 task 4. The development dataset organization is described in Fig 3).
Labeled training set:
This set contains 1578 clips (2244 class occurrences) for which weak annotations have been verified and cross-checked.
The amount of clips per class is the following:
Unlabeled in domain training set:
This set is considerably larger than the previous one. It contains 14412 clips.
The clips are selected such that the distribution per class (based on Audioset annotations)
is close to the distribution in the labeled set.
Note however that given the uncertainty on Audioset labels this distribution might not be exactly similar.
Synthetic strongly labeled set:
This set is composed of 10000 clips generated with the Scaper soundscape synthesis and augmentation library.
The clips are generated such that the distribution per event is close to that of the validation set.
- We used all the foreground files from the DESED synthetic soundbank.
- We used background files annotated as "other" from the subpart of SINS dataset and files annotated as "xxx" from the (TUT Acoustic scenes 2017, development dataset)[https://zenodo.org/record/400515].
- We used the clips from FUSS containing the non-target classes. The clip selection is based on FSD50K annotations. The clips selected are in the training for both FSD50K and FUSS. We provide csv files corresponding to these splits.
- Event distribution statistics for both target event classes and non target event classes are computed on annotations obtained by human for ~90k clips from Audioset.
We share the original data and scripts to generate soundscapes and encourage participants to create their own subsets. See DESED github repo and Scaper documentation for more information about how to create new soundscapes.
Sound event detection validation set
The validation set is designed such that the distribution in term of clips per class is similar to that of the weakly labeled training set. The validation set contains 1168 clips (4093 events). The validation set is annotated with strong labels, with timestamps (obtained by human annotators). Note that a 10-seconds clip may correspond to more than one sound event.
Source separation training set
TBA
Source separation validation set
TBA
Evaluation datasets
As we encourage participants to use a source separation algorithm together with the sound event detection, there are three possible scenarios (listed in the task setup). Participants are allowed to submit up to 4 different systems for each scenario . In each case, participants are expected to provide at least the output of a sound event detection system. Participants who wants to get their sound separation submission evaluated can download the specific sound separation evaluation dataset (see below). The evaluation dataset organization is described in Fig 4).
Before submission, please make sure you check that your submission package is correct with the validation script enclosed in the submission package:
python validate_submissions.py -i /Users/nturpaul/Documents/code/dcase2021/task4_test
Sound event detection evaluation dataset
The sound event detection evaluation dataset is composed of 10 seconds and 5 minutes audio clips.
-
A first subset is composed of audio clips extracted from youtube and vimeo videos under creative common licenses. This subset is used for ranking purposes. This subset includes the public evaluation dataset.
-
A second subset is composed on synthetic clips generated with scaper. This subset is used for analysis purposes.
Sound separation evaluation evaluation dataset
The sound separation evalaution dataset is the evaluation part of the FUSS dataset.
Task rules
There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.
Task specific rules:
- Participants are allowed to submit up to 4 different systems for each scenario listed in the task setup section.
- Participants are not allowed to use external data for system development. Data from other task is considered external data.
- Another example of external data is other materials related to the video such as the rest of audio from where the 10-sec clip was extracted, the video frames and metadata.
- Participants are not allowed to use the embeddings provided by Audioset or any other features that indirectly use external data.
- For the real recordings (from Audioset), only weak labels and none of the strong labels (timestamps) or original (Audioset) labels can be used for the training of the submitted system. For the synthetic clips, strong labels provided can be used.
- Manipulation of provided training data is allowed.
- The development dataset can be augmented without the use of external data (e.g. by mixing data sampled from a PDF or using techniques such as pitch shifting or time stretching).
- Participants are not allowed to use the public evaluation dataset and synthetic evaluation dataset (or part of them) to train their systems or tune hyper-parameters.
Evaluation
Sound event detection evaluation
All submissions will be evaluated with poly-phonic sound event detection scores (PSDS) computed over the real recordings in the evaluation set (the performance on synthetic recordings is not taken into account in the metric). PSDS values are computed using 50 operating points (linearly distributed from 0.01 to 0.99). In order to understand better what the behavior of each submissions for two different scenarios that emphasize different systems properties.
Scenario 1
The system needs to react fast upon an event detection (e.g. to trigger an alarm, adapt home automation system...). The localization of the sound event is then really important. The PSDS parameters reflecting these needs are:
- Detection Tolerance criterion (DTC): 0.7
- Ground Truth intersection criterion (GTC): 0.7
- Cost of instability across class (\(\alpha_{ST}\)): 1
- Cost of CTs on user experience (\(\alpha_{CT}\)): 0
- Maximum False Positive rate (e_max): 100
Scenario 2
The system must avoid confusing between classes but the reaction time is less crucial than in the first scenario. The PSDS parameters reflecting these needs are:
- Detection Tolerance criterion (DTC): 0.1
- Ground Truth intersection criterion (GTC): 0.1
- Cost of instability across class (\(\alpha_{ST}\)): 1
- Cross-Trigger Tolerance criterion (cttc): 0.3
- Cost of CTs on user experience (\(\alpha_{CT}\)): 0.5
- Maximum False Positive rate (e_max): 100
Task Ranking
The official ranking will be a team wise ranking, not a system wise ranking. The ranking criterion will be the aggregation of PSDS-scenario1 and PSDS-scenario2. Each separate metric considered in the final ranking criterion will be the best separate metric among all teams submission (PSDS-scenario1 and PSDS-scenario2 can be obtained by two different systems from the same team, see also Fig 5). The setup is chosen in order to favor experiments on the systems behavior, and adaptation to different metrics depending on the targeted scenario.
with \(\overline{\mathrm{PSDS}_1}\) and \(\overline{\mathrm{PSDS}_2}\) the PSDS on scenario 1 and 2 normalized by the baseline PSDS on these scenarios, respectively.
Contrastive metric (collar-based F1-score)
Additionally, event-based measures with a 200 ms collar on onsets and a 200 ms / 20% of the events length collar on offsets will be provided as a contrastive measure. System will be evaluated with threshold fixed at 0.5 unless participant explicitly provide another operating point to be evaluated with F1-score.
Evaluation is done using sed_eval and psds_eval toolboxes:
You can find more information on how to use PSDS for task 4 on the dedicated notebook:
Detailed information on metrics calculation is available in:
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016. URL: http://www.mdpi.com/2076-3417/6/6/162, doi:10.3390/app6060162.
Metrics for Polyphonic Sound Event Detection
Abstract
This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.
Cagdas Bilen, Giacomo Ferroni, Francesco Tuveri, Juan Azcarreta, and Sacha Krstulovic. A framework for the robust evaluation of sound event detection. arXiv preprint arXiv:1910.08440, 2019. URL: https://arxiv.org/abs/1910.08440.
Source separation evaluation (optional)
Source separation approaches will be optionally evaluated on an additional evaluation with standard source separation metrics as follows. For each example mixture x containing J sources, performance will be measured with permutation-invariant scale-invariant signal-to-noise ratio improvement (SI-SNRi).
- SNR is defined as 10 * log10 of the ratio of source power to error power.
- Scale-invariant SNR (SI-SNR) allows scaling of the reference to best match the estimate. The optimal scaling factor given an estimate signal e and reference signal r is
/ , where < , > indicates inner-product. - Permutation invariance allows the estimates to be permuted to match reference signals such that the mean SI-SNR across sources is maximized.
- SI-SNR improvement (SI-SNRi) is the SI-SNR of the estimate with respect to the reference, minus SI-SNR of the mixture with respect to the reference.
A python evaluation script for source separation can be found on:
Results
Rank | Submission Information | Performance | ||||||
---|---|---|---|---|---|---|---|---|
Code | Author | Affiliation |
Technical Report |
Ranking score (Evaluation dataset) |
PSDS 1 (Evaluation dataset) |
PSDS 2 (Evaluation dataset) |
||
Na_BUPT_task4_SED_1 | Tong Na | Beijing, China | task-sound-event-detection-in-domestic-environments#Na2021 | 0.80 | 0.245 | 0.452 | ||
Hafsati_TUITO_task4_SED_3 | Mohammed Hafsati | Tuito, Tuito R&D Speech Processing and AI team, La Ciotat, France | task-sound-event-detection-in-domestic-environments#Hafsati2021 | 0.91 | 0.287 | 0.502 | ||
Hafsati_TUITO_task4_SED_4 | Mohammed Hafsati | Tuito, Tuito R&D Speech Processing and AI team, La Ciotat, France | task-sound-event-detection-in-domestic-environments#Hafsati2021 | 0.91 | 0.287 | 0.502 | ||
Hafsati_TUITO_task4_SED_1 | Mohammed Hafsati | Tuito, Tuito R&D Speech Processing and AI team, La Ciotat, France | task-sound-event-detection-in-domestic-environments#Hafsati2021 | 1.03 | 0.334 | 0.549 | ||
Hafsati_TUITO_task4_SED_2 | Mohammed Hafsati | Tuito, Tuito R&D Speech Processing and AI team, La Ciotat, France | task-sound-event-detection-in-domestic-environments#Hafsati2021 | 1.04 | 0.336 | 0.550 | ||
Gong_TAL_task4_SED_3 | Yaguang Gong | Tomorrow Advancing Life Education Group, AI Department, Beijing, China | task-sound-event-detection-in-domestic-environments#Gong2021 | 1.16 | 0.370 | 0.626 | ||
Gong_TAL_task4_SED_2 | Yaguang Gong | Tomorrow Advancing Life Education Group, AI Department, Beijing, China | task-sound-event-detection-in-domestic-environments#Gong2021 | 1.15 | 0.367 | 0.616 | ||
Gong_TAL_task4_SED_1 | Yaguang Gong | Tomorrow Advancing Life Education Group, AI Department, Beijing, China | task-sound-event-detection-in-domestic-environments#Gong2021 | 1.14 | 0.364 | 0.611 | ||
Park_JHU_task4_SED_2 | Sangwook Park | Johns Hopkins University, Department of Electrical and Computer Engineering, Baltimore, MD, USA | task-sound-event-detection-in-domestic-environments#Park2021 | 1.07 | 0.327 | 0.603 | ||
Park_JHU_task4_SED_4 | Sangwook Park | Johns Hopkins University, Department of Electrical and Computer Engineering, Baltimore, MD, USA | task-sound-event-detection-in-domestic-environments#Park2021 | 0.86 | 0.237 | 0.524 | ||
Park_JHU_task4_SED_1 | Sangwook Park | Johns Hopkins University, Department of Electrical and Computer Engineering, Baltimore, MD, USA | task-sound-event-detection-in-domestic-environments#Park2021 | 1.01 | 0.305 | 0.579 | ||
Park_JHU_task4_SED_3 | Sangwook Park | Johns Hopkins University, Department of Electrical and Computer Engineering, Baltimore, MD, USA | task-sound-event-detection-in-domestic-environments#Park2021 | 0.84 | 0.222 | 0.537 | ||
Zheng_USTC_task4_SED_4 | Xu Zheng | University of Science and Technology of China, National Engineering Laboratory for Speech and Language Information Processing, Hefei, China | task-sound-event-detection-in-domestic-environments#Zheng2021 | 1.30 | 0.389 | 0.742 | ||
Zheng_USTC_task4_SED_1 | Xu Zheng | University of Science and Technology of China, National Engineering Laboratory for Speech and Language Information Processing, Hefei, China | task-sound-event-detection-in-domestic-environments#Zheng2021 | 1.33 | 0.452 | 0.669 | ||
Zheng_USTC_task4_SED_3 | Xu Zheng | University of Science and Technology of China, National Engineering Laboratory for Speech and Language Information Processing, Hefei, China | task-sound-event-detection-in-domestic-environments#Zheng2021 | 1.29 | 0.386 | 0.746 | ||
Zheng_USTC_task4_SED_2 | Xu Zheng | University of Science and Technology of China, National Engineering Laboratory for Speech and Language Information Processing, Hefei, China | task-sound-event-detection-in-domestic-environments#Zheng2021 | 1.33 | 0.447 | 0.676 | ||
Nam_KAIST_task4_SED_2 | Hyeonuk Nam | Korea Advanced Institute of Science and Technology, Department of Mechanical Engineering, Daejeon, South Korea | task-sound-event-detection-in-domestic-environments#Nam2021 | 1.19 | 0.399 | 0.609 | ||
Nam_KAIST_task4_SED_1 | Hyeonuk Nam | Korea Advanced Institute of Science and Technology, Department of Mechanical Engineering, Daejeon, South Korea | task-sound-event-detection-in-domestic-environments#Nam2021 | 1.16 | 0.378 | 0.617 | ||
Nam_KAIST_task4_SED_3 | Hyeonuk Nam | Korea Advanced Institute of Science and Technology, Department of Mechanical Engineering, Daejeon, South Korea | task-sound-event-detection-in-domestic-environments#Nam2021 | 1.09 | 0.324 | 0.634 | ||
Nam_KAIST_task4_SED_4 | Hyeonuk Nam | Korea Advanced Institute of Science and Technology, Department of Mechanical Engineering, Daejeon, South Korea | task-sound-event-detection-in-domestic-environments#Nam2021 | 0.75 | 0.059 | 0.715 | ||
Koo_SGU_task4_SED_2 | Hyejin Koo | Sogang University, Department of Electronic Engineering, Seoul, South Korea | task-sound-event-detection-in-domestic-environments#Koo2021 | 0.12 | 0.044 | 0.059 | ||
Koo_SGU_task4_SED_3 | Hyejin Koo | Sogang University, Department of Electronic Engineering, Seoul, South Korea | task-sound-event-detection-in-domestic-environments#Koo2021 | 0.41 | 0.058 | 0.348 | ||
Koo_SGU_task4_SED_1 | Hyejin Koo | Sogang University, Department of Electronic Engineering, Seloul, Korea | task-sound-event-detection-in-domestic-environments#Koo2021 | 0.74 | 0.258 | 0.364 | ||
deBenito_AUDIAS_task4_SED_4 | Diego de Benito-Gorron | Universidad Autónoma de Madrid, Escuela Politécnica Superior, Madrid, Spain | task-sound-event-detection-in-domestic-environments#de Benito-Gorron2021 | 1.10 | 0.361 | 0.577 | ||
deBenito_AUDIAS_task4_SED_1 | Diego de Benito-Gorron | Universidad Autónoma de Madrid, Escuela Politécnica Superior, Madrid, Spain | task-sound-event-detection-in-domestic-environments#de Benito-Gorron2021 | 1.07 | 0.343 | 0.571 | ||
deBenito_AUDIAS_task4_SED_2 | Diego de Benito-Gorron | Universidad Autónoma de Madrid, Escuela Politécnica Superior, Madrid, Spain | task-sound-event-detection-in-domestic-environments#de Benito-Gorron2021 | 1.10 | 0.363 | 0.574 | ||
deBenito_AUDIAS_task4_SED_3 | Diego de Benito-Gorron | Universidad Autónoma de Madrid, Escuela Politécnica Superior, Madrid, Spain | task-sound-event-detection-in-domestic-environments#de Benito-Gorron2021 | 1.07 | 0.345 | 0.571 | ||
Baseline_SSep_SED | Nicolas Turpault | Inria Nancy Grand-Est, Department of Natural Language Processing & Knowledge Discovery, Nancy, France | task-sound-event-detection-in-domestic-environments#turpault2020b | 1.11 | 0.364 | 0.580 | ||
Boes_KUL_task4_SED_4 | Wim Boes | KU Leuven, ESAT, Leuven, Belgium | task-sound-event-detection-in-domestic-environments#Boes2021 | 0.60 | 0.117 | 0.457 | ||
Boes_KUL_task4_SED_3 | Wim Boes | KU Leuven, ESAT, Leuven, Belgium | task-sound-event-detection-in-domestic-environments#Boes2021 | 0.68 | 0.121 | 0.531 | ||
Boes_KUL_task4_SED_2 | Wim Boes | KU Leuven, ESAT, Leuven, Belgium | task-sound-event-detection-in-domestic-environments#Boes2021 | 0.77 | 0.233 | 0.440 | ||
Boes_KUL_task4_SED_1 | Wim Boes | KU Leuven, ESAT, Leuven, Belgium | task-sound-event-detection-in-domestic-environments#Boes2021 | 0.81 | 0.253 | 0.442 | ||
Ebbers_UPB_task4_SED_2 | Janek Ebbers | Paderborn University, Department of Communications Engineering, Paderborn, Germany | task-sound-event-detection-in-domestic-environments#Ebbers2021 | 1.10 | 0.335 | 0.621 | ||
Ebbers_UPB_task4_SED_4 | Janek Ebbers | Paderborn University, Department of Communications Engineering, Paderborn, Germany | task-sound-event-detection-in-domestic-environments#Ebbers2021 | 1.16 | 0.363 | 0.637 | ||
Ebbers_UPB_task4_SED_3 | Janek Ebbers | Paderborn University, Department of Communications Engineering, Paderborn, Germany | task-sound-event-detection-in-domestic-environments#Ebbers2021 | 1.24 | 0.416 | 0.635 | ||
Ebbers_UPB_task4_SED_1 | Janek Ebbers | Paderborn University, Department of Communications Engineering, Paderborn, Germany | task-sound-event-detection-in-domestic-environments#Ebbers2021 | 1.16 | 0.373 | 0.621 | ||
Zhu_AIAL-XJU_task4_SED_2 | Xiujuan Zhu | XinJiang University, Key Laboratory of Signal Detection and Processing, Urumqi, China | task-sound-event-detection-in-domestic-environments#Zhu2021 | 0.99 | 0.290 | 0.574 | ||
Zhu_AIAL-XJU_task4_SED_1 | Xiujuan Zhu | XinJiang University, Key Laboratory of Signal Detection and Processing, Urumqi, China | task-sound-event-detection-in-domestic-environments#Zhu2021 | 1.04 | 0.318 | 0.583 | ||
Liu_BUPT_task4_4 | Gang Liu | Beijing University of Posts and Telecommunications, Pattern Recognition and Intelligent System Laboratory (PRIS Lab), Beijing, China | task-sound-event-detection-in-domestic-environments#Liu2021 | 0.37 | 0.102 | 0.231 | ||
Liu_BUPT_task4_1 | Gang Liu | Beijing University of Posts and Telecommunications, Pattern Recognition and Intelligent System Laboratory (PRIS Lab), Beijing, China | task-sound-event-detection-in-domestic-environments#Liu2021 | 0.30 | 0.090 | 0.169 | ||
Liu_BUPT_task4_2 | Gang Liu | Beijing University of Posts and Telecommunications, Pattern Recognition and Intelligent System Laboratory (PRIS Lab), Beijing, China | task-sound-event-detection-in-domestic-environments#Liu2021 | 0.54 | 0.152 | 0.322 | ||
Liu_BUPT_task4_3 | Gang Liu | Beijing University of Posts and Telecommunications, Pattern Recognition and Intelligent System Laboratory (PRIS Lab), Beijing, China | task-sound-event-detection-in-domestic-environments#Liu2021 | 0.24 | 0.068 | 0.146 | ||
Olvera_INRIA_task4_SED_2 | Michel Olvera | Inria Nancy Grand-Est, Department of Information and Communication Sciences and Technologies, Nancy, France | task-sound-event-detection-in-domestic-environments#Olvera2021 | 0.98 | 0.338 | 0.481 | ||
Olvera_INRIA_task4_SED_1 | Michel Olvera | Inria Nancy Grand-Est, Department of Information and Communication Sciences and Technologies, Nancy, France | task-sound-event-detection-in-domestic-environments#Olvera2021 | 0.95 | 0.332 | 0.462 | ||
Kim_AiTeR_GIST_SED_4 | Nam Kyun Kim | Gwnagju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwnagju, South Korea | task-sound-event-detection-in-domestic-environments#Kim2021 | 1.32 | 0.442 | 0.674 | ||
Kim_AiTeR_GIST_SED_2 | Nam Kyun Kim | Gwnagju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwnagju, South Korea | task-sound-event-detection-in-domestic-environments#Kim2021 | 1.31 | 0.439 | 0.667 | ||
Kim_AiTeR_GIST_SED_3 | Nam Kyun Kim | Gwnagju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwnagju, South Korea | task-sound-event-detection-in-domestic-environments#Kim2021 | 1.30 | 0.434 | 0.669 | ||
Kim_AiTeR_GIST_SED_1 | Nam Kyun Kim | Gwnagju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwnagju, South Korea | task-sound-event-detection-in-domestic-environments#Kim2021 | 1.29 | 0.431 | 0.661 | ||
Cai_SMALLRICE_task4_SED_1 | Heinrich Dinkel | Xiaomi Corperation, Technology Comittee, Beijing, China | task-sound-event-detection-in-domestic-environments#Dinkel2021 | 1.11 | 0.361 | 0.584 | ||
Cai_SMALLRICE_task4_SED_2 | Heinrich Dinkel | Xiaomi Corperation, Technology Comittee, Beijing, China | task-sound-event-detection-in-domestic-environments#Dinkel2021 | 1.13 | 0.373 | 0.585 | ||
Cai_SMALLRICE_task4_SED_3 | Heinrich Dinkel | Xiaomi Corperation, Technology Comittee, Beijing, China | task-sound-event-detection-in-domestic-environments#Dinkel2021 | 1.13 | 0.370 | 0.596 | ||
Cai_SMALLRICE_task4_SED_4 | Heinrich Dinkel | Xiaomi Corperation, Technology Comittee, Beijing, China | task-sound-event-detection-in-domestic-environments#Dinkel2021 | 1.00 | 0.339 | 0.504 | ||
HangYuChen_Roal_task4_SED_2 | Chen HangYu | task-sound-event-detection-in-domestic-environments#HangYu2021 | 0.90 | 0.294 | 0.473 | |||
HangYuChen_Roal_task4_SED_1 | Chen YuHang | task-sound-event-detection-in-domestic-environments#YuHang2021 | 0.61 | 0.098 | 0.496 | |||
Yu_NCUT_task4_SED_1 | Dongchi Yu | North China University of Technology, Department of Electronic Information Engineering, Beijing, China | task-sound-event-detection-in-domestic-environments#Yu2021 | 0.20 | 0.038 | 0.157 | ||
Yu_NCUT_task4_SED_2 | Dongchi Yu | North China University of Technology, Department of Electronic Information Engineering, Beijing, China | task-sound-event-detection-in-domestic-environments#Yu2021 | 0.92 | 0.301 | 0.485 | ||
lu_kwai_task4_SED_1 | Rui Lu | Beijing Kuaishou Technology Co., Ltd, AI-Platform, Beijing, China | task-sound-event-detection-in-domestic-environments#Lu2021 | 1.27 | 0.419 | 0.660 | ||
lu_kwai_task4_SED_4 | Rui Lu | Beijing Kuaishou Technology Co., Ltd, AI-Platform, Beijing, China | task-sound-event-detection-in-domestic-environments#Lu2021 | 0.88 | 0.157 | 0.685 | ||
lu_kwai_task4_SED_3 | Rui Lu | Beijing Kuaishou Technology Co., Ltd, AI-Platform, Beijing, China | task-sound-event-detection-in-domestic-environments#Lu2021 | 0.86 | 0.148 | 0.686 | ||
lu_kwai_task4_SED_2 | Rui Lu | Beijing Kuaishou Technology Co., Ltd, AI-Platform, Beijing, China | task-sound-event-detection-in-domestic-environments#Lu2021 | 1.25 | 0.412 | 0.651 | ||
Liu_BUPT_task4_SS_SED_2 | Gang Liu_SS | Beijing University of Posts and Telecommunications, Pattern Recognition and Intelligent System Laboratory (PRIS Lab), Beijing, China | task-sound-event-detection-in-domestic-environments#Liu_SS2021 | 0.94 | 0.302 | 0.507 | ||
Liu_BUPT_task4_SS_SED_1 | Gang Liu_SS | Beijing University of Posts and Telecommunications, Pattern Recognition and Intelligent System Laboratory (PRIS Lab), Beijing, China | task-sound-event-detection-in-domestic-environments#Liu_SS2021 | 0.94 | 0.302 | 0.507 | ||
Tian_ICT-TOSHIBA_task4_SED_2 | Gangyi Tian | Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China | task-sound-event-detection-in-domestic-environments#Tian2021 | 1.19 | 0.411 | 0.585 | ||
Tian_ICT-TOSHIBA_task4_SED_1 | Gangyi Tian | Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China | task-sound-event-detection-in-domestic-environments#Tian2021 | 1.19 | 0.413 | 0.586 | ||
Tian_ICT-TOSHIBA_task4_SED_4 | Gangyi Tian | Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China | task-sound-event-detection-in-domestic-environments#Tian2021 | 1.19 | 0.412 | 0.586 | ||
Tian_ICT-TOSHIBA_task4_SED_3 | Gangyi Tian | Institute of Computing Technology,Chinese Academy of Sciences, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China | task-sound-event-detection-in-domestic-environments#Tian2021 | 1.18 | 0.409 | 0.584 | ||
Yao_GUET_task4_SED_3 | Yu Yao | GUILIN UNIVERSITY OF ELECTRONIC TECHNOLOGY, SCHOOL OF INFORMATION AND COMMUNICATION, GuiLin, China | task-sound-event-detection-in-domestic-environments#Yao2021 | 0.88 | 0.279 | 0.479 | ||
Yao_GUET_task4_SED_1 | Yu Yao | GUILIN UNIVERSITY OF ELECTRONIC TECHNOLOGY, SCHOOL OF INFORMATION AND COMMUNICATION, GuiLin, China | task-sound-event-detection-in-domestic-environments#Yao2021 | 0.88 | 0.277 | 0.482 | ||
Yao_GUET_task4_SED_2 | Yu Yao | GUILIN UNIVERSITY OF ELECTRONIC TECHNOLOGY, SCHOOL OF INFORMATION AND COMMUNICATION, GuiLin, China | task-sound-event-detection-in-domestic-environments#Yao2021 | 0.54 | 0.056 | 0.496 | ||
Liang_SHNU_task4_SED_4 | Yunhao Liang | Shanghai Normal University, The College of Information,Mechanical and Electrical Engineering, Shanghai,China | task-sound-event-detection-in-domestic-environments#Liang2021 | 0.99 | 0.313 | 0.543 | ||
Bajzik_UNIZA_task4_SED_2 | Jakub Bajzik | University of Zilina, Department of Mechatronics and Electronics, Zilina, Slovak Republic | task-sound-event-detection-in-domestic-environments#Bajzik2021 | 1.02 | 0.330 | 0.544 | ||
Bajzik_UNIZA_task4_SED_1 | Jakub Bajzik | University of Zilina, Department of Mechatronics and Electronics, Zilina, Slovak Republic | task-sound-event-detection-in-domestic-environments#Bajzik2021 | 0.45 | 0.133 | 0.266 | ||
Liang_SHNU_task4_SSep_SED_3 | Yunhao Liang_SS | Shanghai Normal University, The College of Information,Mechanical and Electrical Engineering, Shanghai,China | task-sound-event-detection-in-domestic-environments#Liang_SS2021 | 0.99 | 0.304 | 0.559 | ||
Liang_SHNU_task4_SSep_SED_1 | Yunhao Liang_SS | Shanghai Normal University, The College of Information,Mechanical and Electrical Engineering, Shanghai,China | task-sound-event-detection-in-domestic-environments#Liang_SS2021 | 1.03 | 0.313 | 0.588 | ||
Liang_SHNU_task4_SSep_SED_2 | Yunhao Liang_SS | Shanghai Normal University, The College of Information,Mechanical and Electrical Engineering, Shanghai,China | task-sound-event-detection-in-domestic-environments#Liang_SS2021 | 1.01 | 0.325 | 0.542 | ||
Baseline_SED | Nicolas Turpault | Inria Nancy Grand-Est, Department of Natural Language Processing & Knowledge Discovery, Nancy, France | task-sound-event-detection-in-domestic-environments#Turpault2020a | 1.00 | 0.315 | 0.547 | ||
Wang_NSYSU_task4_SED_1 | Yih Wen Wang | National Sun Yat-sen University, Department of Computer Science and Engineering, Kaohsiung, Taiwan | task-sound-event-detection-in-domestic-environments#Wang2021 | 1.13 | 0.336 | 0.646 | ||
Wang_NSYSU_task4_SED_4 | Yih Wen Wang | National Sun Yat-sen University, Department of Computer Science and Engineering, Kaohsiung, Taiwan | task-sound-event-detection-in-domestic-environments#Wang2021 | 1.09 | 0.304 | 0.662 | ||
Wang_NSYSU_task4_SED_2 | Yih Wen Wang | National Sun Yat-sen University, Department of Computer Science and Engineering, Kaohsiung, Taiwan | task-sound-event-detection-in-domestic-environments#Wang2021 | 0.69 | 0.070 | 0.636 | ||
Wang_NSYSU_task4_SED_3 | Yih Wen Wang | National Sun Yat-sen University, Department of Computer Science and Engineering, Kaohsiung, Taiwan | task-sound-event-detection-in-domestic-environments#Wang2021 | 1.13 | 0.339 | 0.649 |
Complete results and technical reports can be found in the results page
Baseline systems
Sound event detection baseline
System description
The baseline model is an improvement of [dcase 2020 baseline][dcase2020-baseline]. The model is a mean-teacher model.
The main differences of the baseline system (without source separation) compared to DCASE 2020:
- Features: hop size of 256 instead of 255.
- Different synthetic dataset is used.
- No early stopping used (200 epochs) but getting the best model
- Normalisation per-instance using min-max approach
- Mixup is used for weak and synthetic data by mixing data in a batch (50% chance of applying it).
- Batch size of 48 (still 1/4 synthetic, 1/4 weak, 1/2 unlabelled)
- Intersection-based F1 instead of event-based F1 for the synthetic validation score
Results for the development dataset
PSDS-scenario1 | PSDS-scenario2 | Intersection-based F1 | Collar-based F1 | |
Dev-test | 0.342 | 0.527 | 76.6% | 40.1% |
Collar-based = event-based. Intersection based is computed using (dtc=gtc=0.5, cttc=0.3) and event-based is computed using collars (onset=200ms, offset=max(200ms, 20% event-duration)
Note: The performance might not be exactly reproducible on a GPU based system.
That is why, you can download the checkpoint of the network along with the tensorboard events.
Launch python train_sed.py --test_from_checkpoint /path/to/downloaded.ckpt
to test this model.
Repositories
Sound separation and sound event detection
System description
The Sound separation and sound event detection baseline uses the pre-trained SED model together with a pre-trained sound separation model.
The SED model is fine-tuned on separated sound events obtained by pre-processing the data with the pre-trained sound separation model.
The sound separation model is based on TDCN++ and is trained in an unsupervised way with MixIT on YFCC100m dataset.
Predictions are obtained by ensembling the fine-tuned SED model with the original SED model following. Ensembling is performed by weighted average of the predictions of the two models, the weight is learned during training.
Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, and John R. Hershey. Universal sound separation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 175–179. October 2019. URL: https://arxiv.org/abs/1905.03330.
Running the baseline
You can run the SSEP + SED baseline from scratch by first downloading the pre-trained universal sound separation model trained on YFCC100m [3] following the instructions on the Sound separation baseline repository using the Google Cloud SDK (installation instructions):
gsutil -m cp -r gs://gresearch/sound_separation/yfcc100m_mixit_model_checkpoint .
You also need the pre-trained SED system as obtained from the SED Baseline, you can train your own or use the pretrained system.
Be sure to check that in the configuration YAML file ./confs/sep+sed.yaml
the
paths to the SED checkpoint and YAML file and to the pre-trained sound separation model are set
correctly.
First sound separation is applied using this script to the data
- python run_separation.py
The SED model is then fine-tuned on the separated data using:
python finetune_on_separated.py
We also provide for this model a pre-trained checkpoint along with tensorboard logs.
You can test it on the validation real world data by using:
- python train_sed.py --test_from_checkpoint /path/to/downloaded.ckpt
Check tensorboard logs using tensorboard --logdir="path/to/exp_folder"
Results for the development dataset
PSDS-scenario1 | PSDS-scenario2 | Intersection-based F1 | Collar-based F1 | |
Dev-test | 0.373 | 0.549 | 77.2% | 44.3% |
The results are from the "Teacher" predictions (better predictions over the Student model, note that this is the only thing cherry picked on the dev-test set).
Repositories
Citation
If you are using the dataset or baseline code, or want to refer to the challenge task please cite the following paper:
Task and datasets
Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.
Sound event detection in domestic environments with weakly labeled data and soundscape synthesis
Abstract
This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.
Keywords
Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data
Scott Wisdom, Hakan Erdogan, Daniel P. W. Ellis, Romain Serizel, Nicolas Turpault, Eduardo Fonseca, Justin Salamon, Prem Seetharaman, and John R. Hershey. What's all the fuss about free universal sound separation data? In in preparation. 2020.