# Sound Event Detection in Domestic Environments

### Coordinators

 Romain Serizel Francesca Ronchini Nicolas Turpault Samuele Cornell Eduardo Fonseca Daniel P. W. Ellis

The goal of the task is to evaluate systems for the detection of sound events using real data either weakly labeled or unlabeled and simulated data that is strongly labeled (with time stamps).

Challenge has ended. Full results for this task can be found in the page.

# Description

This task is the follow-up to DCASE 2021 Task 4. The task evaluates systems for the detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time localization given that multiple events can be present in an audio recording (see also Fig 1).

## Novelties for 2022 edition

• This year there is no separate track for sound separation
• The main focus of this edition is to evaluate the impact of external data. To this end:

## Scientific questions

This task highlights a number of specific research questions:

• What strategies work well when training a sound event detection system with a heterogeneous dataset, including:
• A large amount of unbalanced and unlabeled training data
• A small weakly annotated set
• A synthetic set from isolated sound events and backgrounds
• What is the impact of using embeddings extracted from pre-trained models?
• What are the potential advantages of using external data?
• Can we define adapted sound event detection metrics to better analyse specific scientific problems?

# Audio dataset

This Task is primarily based on the DESED dataset, which has been used since DCASE 2020 Task 4. DESED is composed of 10 sec audio clips recorded in domestic environments (taken from AudioSet) or synthesized using Scaper to simulate a domestic environment. The task focuses on 10 classes of sound events that represent a subset of AudioSet (note that not all the classes in DESED correspond to classes in AudioSet; for example, some classes in DESED group several classes from AudioSet):

• Google researchers conducted a quality assessment task where experts were exposed to 10 randomly selected clips for each class and discovered that a in most of the cases not all the clips contains the event related to the given annotation
• The weak annotations have been verified manually for a small subset of the training set.
• Another subset of the development set has been annotated manually with strong annotations, to be used as the validation set (see also below for a detailed explanation about the development set).
Publication

Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.

#### Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

##### Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

##### Keywords

Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data

Publication

Romain Serizel, Nicolas Turpault, Ankit Shah, and Justin Salamon. Sound event detection in synthetic domestic environments. In ICASSP 2020 - 45th International Conference on Acoustics, Speech, and Signal Processing. Barcelona, Spain, 2020. URL: https://hal.inria.fr/hal-02355573.

#### Sound event detection in synthetic domestic environments

##### Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

##### Keywords

semi-supervised learning ; weakly labeled data ; synthetic data ; Sound event detection ; Index Terms-Sound event detection

## Reference labels

AudioSet provides annotations at clip level (without time boundaries for the events). Therefore, the original annotations are considered as weak labels. Google researchers conducted a quality assessment task where experts were exposed to 10 randomly selected clips for each class and discovered that in most cases not all the clips contains the event related to the given annotation.

### Weak annotations

The weak annotations have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file under the following format:

[filename (string)][tab][event_labels (strings)]

For example: Y-BJNMHMZDcU_50.000_60.000.wav Alarm_bell_ringing,Dog

### Strong annotations

Another subset of the development has been annotated manually with strong annotations, to be used as the test set (see also below for a detailed explanation about the development set).

The synthetic subset of the development set is generated and labeled with strong annotations using the Scaper soundscape synthesis and augmentation library. Each sound clip from FSD50K was verified by humans in order to check the event class present in FSD50K annotation was indeed dominant in the audio clip.

Publication

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.

#### FSD50K: an Open Dataset of Human-Labeled Sound Events

##### Abstract

Publication

Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, 411–412. ACM, 2013.

#### Freesound technical demo

##### Abstract

Freesound is an online collaborative sound database where people with diverse interests share recorded sound samples under Creative Commons licenses. It was started in 2005 and it is being maintained to support diverse research projects and as a service to the overall research and artistic community. In this demo we want to introduce Freesound to the multimedia community and show its potential as a research resource. We begin by describing some general aspects of Freesound, its architecture and functionalities, and then explain potential usages that this framework has for research applications.

In both cases, the minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event.

The strong annotations are provided in a tab separated csv file under the following format:

[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)]

For example:

YOTsn73eqbfc_10.000_20.000.wav 0.163 0.665 Alarm_bell_ringing

The dataset is composed of several subsets that can be downloaded independently from the respective repositories or automatically with the Task 4 data generation script.

To access the datasets separately please refer to last year instructions

The challenge consists of detecting sound events within web videos using training data from real recordings (both weakly labeled and unlabeled and synthetic audio clips which are strongly labeled). The detection within a 10-second clip should be performed with start and end timestamps. Note that a 10-seconds clip may correspond to more than one sound event.

This year, participants are allowed to use external datasets and embeddings extracted from pre-trained models. Lists of the eligible datasets and pre-trained models are provided below. Datasets and models can be added to the list upon request until May 15th (as long as the corresponding resources are publicly available).

Note also that each participant should submit at least one system that is not using external data.

## Development set

We provide 3 different splits of the training data in our development set: "Weakly labeled training set", "Unlabeled in domain training set" and "Synthetic strongly labeled set" with strong annotations.

Weakly labeled training set:
This set contains 1578 clips (2244 class occurrences) for which weak annotations have been verified and cross-checked.

Unlabeled in domain training set:
This set is considerably larger than the previous one. It contains 14412 clips. The clips are selected such that the distribution per class (based on AudioSet annotations) is close to the distribution in the labeled set. Note however that given the uncertainty on AudioSet labels this distribution might not be exactly similar.

Synthetic strongly labeled set:
This set is composed of 10000 clips generated with the Scaper soundscape synthesis and augmentation library. The clips are generated such that the distribution per event is close to the one of the validation set.

• We used all the foreground files from the DESED synthetic soundbank (multiple times).
• We used background files annotated as "other" from the subpart of SINS dataset and files from the [TUT Acoustic scenes 2017, development dataset][https://zenodo.org/record/400515].
• We used the clips from [FUSS][https://zenodo.org/record/4012661] containing the non-target classes. The clip selection is based on FSD50K annotations. The clips selected are in the training for both FSD50K and FUSS. We provide tsv files corresponding to these splits.
• Event distribution statistics for both target event classes and non target event classes are computed on annotations obtained by human for [~90k clips from AudioSet][https://research.google.com/audioset/download_strong.html].
Publication

Shawn Hershey, Daniel PW Ellis, Eduardo Fonseca, Aren Jansen, Caroline Liu, R Channing Moore, and Manoj Plakal. The benefit of temporally-strong labels in audio event classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 366–370. IEEE, 2021.

#### The benefit of temporally-strong labels in audio event classification

We share the original data and scripts to generate soundscapes and encourage participants to create their own subsets. See DESED github repo and Scaper documentation for more information about how to create new soundscapes.

### Sound event detection validation set

The validation set is designed such that the distribution in term of clips per class is similar to that of the weakly labeled training set. The validation set contains 1168 clips (4093 events). The validation set is annotated with strong labels, with timestamps (obtained by human annotators).

## Evaluation set

The sound event detection evaluation dataset is composed of 10 seconds and 5 minutes audio clips.

• A first subset is composed of audio clips extracted from YouTube and Vimeo videos under Creative Commons licenses. This subset is used for ranking purposes. This subset includes the public evaluation dataset.

• A second subset is composed on synthetic clips generated with Scaper. This subset is used for analysis purposes.

# External data resources

## Pre-trained models

• Pre-trained models using AudioSet:
• Supervised:
• Self-supervised:
• AST
• PaSST

## Allowed Datasets

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

• Participants are allowed to submit up to 4 different systems.
• Participants are allowed to use external data for system development. However, each participant should submit at least one system that is not using external data.
• Data from other task is considered external data.
• Embeddings extracted from models pre-trained on external data is considered as external data
• Another example of external data is other materials related to the video such as the rest of audio from where the 10-sec clip was extracted, the video frames and metadata.
• Manipulation of provided training data is allowed.
• Participants are not allowed to use the public evaluation dataset and synthetic evaluation dataset (or part of them) to train their systems or tune hyper-parameters.
• Submit at least one system that is not using external data.

# Submission

Instructions regarding the output submission format and the required metadata can be found in the example submission package.

Each participants is expected to submit at least one system without external data. Participants using external data/pretrained models, please make sure to fill the corresponding fields in the yaml file (lines 109 and 112).

Please make sure to report the energy consumption in the yaml file (lines 102 and 103).

## Package validator

Before submission, please make sure you check that your submission package is correct with the validation script enclosed in the submission package: python validate_submissions.py -i /Users/nturpaul/Documents/code/dcase2021/task4_test

# Evaluation

All submissions will be evaluated with poly-phonic sound event detection scores (PSDS) computed over the real recordings in the evaluation set (the performance on synthetic recordings is not taken into account in the metric). This metric is based on the intersection between events. PSDS values are computed using 50 operating points (linearly distributed from 0.01 to 0.99). In order to understand better what is the behavior of each submissions for different scenarios. We propose a metric that evaluate the submissions on two different scenarios that emphasize different systems properties.

## Scenario 1

The system needs to react fast upon an event detection (e.g. to trigger an alarm, adapt home automation system...). The localization of the sound event is then really important. The PSDS parameters reflecting these needs are:

• Detection Tolerance criterion (DTC): 0.7
• Ground Truth intersection criterion (GTC): 0.7
• Cost of instability across class ($$\alpha_{ST}$$): 1
• Cost of CTs on user experience ($$\alpha_{CT}$$): 0
• Maximum False Positive rate (e_max): 100

## Scenario 2

The system must avoid confusing between classes but the reaction time is less crucial than in the first scenario. The PSDS parameters reflecting these needs are:

• Detection Tolerance criterion (DTC): 0.1
• Ground Truth intersection criterion (GTC): 0.1
• Cost of instability across class ($$\alpha_{ST}$$): 1
• Cross-Trigger Tolerance criterion (cttc): 0.3
• Cost of CTs on user experience ($$\alpha_{CT}$$): 0.5
• Maximum False Positive rate (e_max): 100

The official ranking will be a team wise ranking, not a system wise ranking. The ranking criterion will be the aggregation of PSDS-scenario1 and PSDS-scenario2. Each separate metric considered in the final ranking criterion will be the best separate metric among all teams submission (PSDS-scenario1 and PSDS-scenario2 can be obtained by two different systems from the same team, see also Fig 1). The setup is chosen in order to favor experiments on the systems behavior, and adaptation to different metrics depending on the targeted scenario.

$$\mathrm{Ranking\ Score} = \overline{\mathrm{PSDS}_1} + \overline{\mathrm{PSDS}_2}$$

with $$\overline{\mathrm{PSDS}_1}$$ and $$\overline{\mathrm{PSDS}_2}$$ the PSDS on scenario 1 and 2 normalized by the baseline PSDS on these scenarios, respectively.

## Contrastive metric (collar-based F1-score)

Additionally, event-based measures with a 200 ms collar on onsets and a 200 ms / 20% of the events length collar on offsets will be provided as a contrastive measure. System will be evaluated with threshold fixed at 0.5 unless participant explicitly provide another operating point to be evaluated with F1-score.

Evaluation is done using sed_eval and psds_eval toolboxes:

## Evaluation of the environmental impact

The cost computation can have an important ecological impact. A environmental metric is suggested to raise awareness around this subject. While this metric won't be used in the ranking system, it can be an important aspect during the award attribution. Energy consumption is computed using code carbon (a running example is provided in the baseline).

In order to account for potential hardware difference, participants can normalize the energy consumption measure with their algorithm by the energy consumption measured for the baseline (on their hardware).

Beside the energy consumption figure we propose a metric highlighting the PSDS improvement compared to the baseline performance, weighted by the increase in energy consumption:

$$\mathrm{EW-PSDS} = \mathrm{PSDS} * \frac{\mathrm{kWh}_{\mathrm{baseline}}}{\mathrm{kWh}_{\mathrm{submission}}}$$

## Propose your own PSDS setup

The choice of a relevant metric for SED depending on the target application remains an open subject in the community. In order to foster discussion on that topic we will exploit the tuning possibilities offered by the PSDS. We encourage participants to suggest their own PSDS setup, together with a short text motivating the choices and the systems aspects that their settings are supposed to highlight. The submitted systems will be evaluated with these contrastive metrics to be presented in a separate table together with the metric motivation.

# Results

Rank Submission
code
(PSDS 1)
Submission
code
(PSDS 2)

Ranking score
(Evaluation dataset)

PSDS 1
(Evaluation dataset)

PSDS 2
(Evaluation dataset)
Baseline Baseline 1.00 0.315 0.543
Baseline (AudioSet) Baseline (AudioSet) 1.04 0.345 0.540

Complete results and technical reports can be found in the

# Baseline system

### System description

The baseline model is the same as in DCASE 2021 Task 4. The model is a mean-teacher model. The 2022 recipe include a version of the baseline train on DESED and strongly annotated clips from AudioSet.

### Results for the development dataset

 PSDS-scenario1 PSDS-scenario2 Intersection-based F1 Collar-based F1 Baseline 0.336 0.536 64.1% 40.1% Baseline (AudioSet strong) 0.351 0.552 64.3% 42.9% Baseline (AST) 0.313 0.722 90.0% 37.2%

Collar-based = event-based. Intersection based is computed using (dtc=gtc=0.5, cttc=0.3) and event-based is computed using collars (onset=200ms, offset=max(200ms, 20% event-duration)

Note: The performance might not be exactly reproducible on a GPU based system. That is why, you can download the checkpoint of the network along with the TensorBoard events. Launch python train_sed.py --test_from_checkpoint /path/to/downloaded.ckpt to test this model.

### Energy consumption during the training and evaluation phase

Energy consumption for 1 run on a NVIDIA A100 40Gb for a training phase and an inference phase on the development set.

 Training (kWh) Dev-test (kWh) EW-PSDS-scenario1 EW-PSDS-scenario2 Baseline 1.717 0.030 0.336 0.536 Baseline (AudioSet strong) 2.418 0.027 0.390 0.613 Baseline (AST) 0.037 0.254 0.585

# Citation

Publication

Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.

#### Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

##### Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

##### Keywords

Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data

Publication

Francesca Ronchini, Romain Serizel, Nicolas Turpault, and Samuele Cornell. The impact of non-target events in synthetic soundscapes for sound event detection. In Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 115–119. Barcelona, Spain, November 2021.

#### The Impact of Non-Target Events in Synthetic Soundscapes for Sound Event Detection

##### Abstract

Detection and Classification Acoustic Scene and Events Challenge 2021 Task 4 uses a heterogeneous dataset that includes both recorded and synthetic soundscapes. Until recently only target sound events were considered when synthesizing the soundscapes. However, recorded soundscapes often contain a substantial amount of non-target events that may affect the performance. In this paper, we focus on the impact of these non-target events in the synthetic soundscapes. Firstly, we investigate to what extent using non-target events alternatively during the training or validation phase (or none of them) helps the system to correctly detect target events. Secondly, we analyze to what extend adjusting the signal-to-noise ratio between target and non-target events at training improves the sound event detection performance. The results show that using both target and non-target events for only one of the phases (validation or training) helps the system to properly detect sound events, outperforming the baseline (which uses non-target events in both phases). The paper also reports the results of a preliminary study on evaluating the system on clips that contain only non-target events. This opens questions for future work on non-target subset and acoustic similarity between target and non-target events which might confuse the system.