Sound Event Detection with Weak Labels and Synthetic Soundscapes


Task description

The goal of the task is to evaluate systems for the detection of sound events using real data either weakly labeled or unlabeled and simulated data that is strongly labeled (with time stamps).

Challenge has ended. Full results for this task can be found in the Results page.

If you are interested in the task, you can join us on the dedicated slack channel

Description

This task is the follow-up to [DCASE 2022 Task 4][dcase22_task4]. The task evaluates systems for the detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time localization given that multiple events can be present in an audio recording (see also Fig 1).

Figure 1: Overview of a sound event detection system.


Novelties for 2023 edition

  • We will evaluated the submissions using a threshold-independant implementation of the PSDS.
  • For each submitted system we ask you to submit the output scores from three independent model trainings with different initialization to be able to evaluate the model performance's standard deviation.
  • Reporting the energy consumption is mandatory.
  • In order to account for potential hardware difference, participants have to report the energy consumption measured while training the baseline during 10 epochs (on their hardware).
  • We are introducing a new metric, complementary to the energy consumption metric: Multiply–accumulate operations (MACs) for 10 seconds of audio prediction.
  • We will experiment run post-processing-invariant evaluation.
  • We require to submit at least one system without ensembles.
  • We propose a new baseline using BEATS embeddings.

Scientific questions

This task highlights a number of specific research questions:

  • What strategies work well when training a sound event detection system with a heterogeneous dataset, including:
    • A large amount of unbalanced and unlabeled training data
    • A small weakly annotated set
    • A synthetic set from isolated sound events and backgrounds
  • What is the impact of using embeddings extracted from pre-trained models?
  • What are the potential advantages of using external data?
  • What is the impact of model complexity/energy consumption on the performance?
  • What is the impact of the temporal post-processing on the performance?
  • Can we find more robust way to evaluate systems (and take training variabilities into account)?

Audio dataset

This Task is primarily based on the DESED dataset, which has been used since DCASE 2020 Task 4. DESED is composed of 10 sec audio clips recorded in domestic environments (taken from AudioSet) or synthesized using Scaper to simulate a domestic environment. The task focuses on 10 classes of sound events that represent a subset of AudioSet (note that not all the classes in DESED correspond to classes in AudioSet; for example, some classes in DESED group several classes from AudioSet):

More information about this dataset and how to generate synthetic soundscapes can be found on the DESED website.

  • Google researchers conducted a quality assessment task where experts were exposed to 10 randomly selected clips for each class and discovered that a in most of the cases not all the clips contains the event related to the given annotation
  • The weak annotations have been verified manually for a small subset of the training set.
  • Another subset of the development set has been annotated manually with strong annotations, to be used as the validation set (see also below for a detailed explanation about the development set).
Publication

Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.

PDF

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

Keywords

Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data

PDF
Publication

Romain Serizel, Nicolas Turpault, Ankit Shah, and Justin Salamon. Sound event detection in synthetic domestic environments. In ICASSP 2020 - 45th International Conference on Acoustics, Speech, and Signal Processing. Barcelona, Spain, 2020. URL: https://hal.inria.fr/hal-02355573.

PDF

Sound event detection in synthetic domestic environments

Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

Keywords

semi-supervised learning ; weakly labeled data ; synthetic data ; Sound event detection ; Index Terms-Sound event detection

PDF

Reference labels

AudioSet provides annotations at clip level (without time boundaries for the events). Therefore, the original annotations are considered as weak labels. Google researchers conducted a quality assessment task where experts were exposed to 10 randomly selected clips for each class and discovered that in most cases not all the clips contains the event related to the given annotation.

Weak annotations

The weak annotations have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file under the following format:

[filename (string)][tab][event_labels (strings)]

For example: Y-BJNMHMZDcU_50.000_60.000.wav Alarm_bell_ringing,Dog

Strong annotations

Another subset of the development has been annotated manually with strong annotations, to be used as the test set (see also below for a detailed explanation about the development set).

The synthetic subset of the development set is generated and labeled with strong annotations using the Scaper soundscape synthesis and augmentation library. Each sound clip from FSD50K was verified by humans in order to check the event class present in FSD50K annotation was indeed dominant in the audio clip.

Publication

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.

PDF

FSD50K: an Open Dataset of Human-Labeled Sound Events

Abstract

Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset---its release consists of pre-computed audio features (instead of waveforms), which limits the adoption of some SER methods. Downloading the original audio tracks is also problematic due to constituent YouTube videos gradually disappearing and usage rights issues, which casts doubts over the suitability of this resource for systems' benchmarking. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.

PDF
Publication

Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, 411–412. ACM, 2013.

PDF

Freesound technical demo

Abstract

Freesound is an online collaborative sound database where people with diverse interests share recorded sound samples under Creative Commons licenses. It was started in 2005 and it is being maintained to support diverse research projects and as a service to the overall research and artistic community. In this demo we want to introduce Freesound to the multimedia community and show its potential as a research resource. We begin by describing some general aspects of Freesound, its architecture and functionalities, and then explain potential usages that this framework has for research applications.

PDF

In both cases, the minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event.

The strong annotations are provided in a tab separated csv file under the following format:

  [filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)]

For example:

YOTsn73eqbfc_10.000_20.000.wav 0.163 0.665 Alarm_bell_ringing

Download

The dataset is composed of several subsets that can be downloaded independently from the respective repositories or automatically with the Task 4 data generation script.


To access the datasets separately please refer to last year instructions

Task setup

The challenge consists of detecting sound events within web videos using training data from real recordings (both weakly labeled and unlabeled and synthetic audio clips which are strongly labeled). The detection within a 10-second clip should be performed with start and end timestamps. Note that a 10-seconds clip may correspond to more than one sound event.

Participants are allowed to use external datasets and embeddings extracted from pre-trained models. Lists of the eligible datasets and pre-trained models are provided below. Datasets and models can be added to the list upon request until April 15th (as long as the corresponding resources are publicly available).

Note also that each participant should submit at least one system that is not using external data.

Development set

We provide 3 different splits of the training data in our development set: "Weakly labeled training set", "Unlabeled in domain training set" and "Synthetic strongly labeled set" with strong annotations.

Weakly labeled training set:
This set contains 1578 clips (2244 class occurrences) for which weak annotations have been verified and cross-checked.

Unlabeled in domain training set:
This set is considerably larger than the previous one. It contains 14412 clips. The clips are selected such that the distribution per class (based on AudioSet annotations) is close to the distribution in the labeled set. Note however that given the uncertainty on AudioSet labels this distribution might not be exactly similar.

Synthetic strongly labeled set:
This set is composed of 10000 clips generated with the Scaper soundscape synthesis and augmentation library. The clips are generated such that the distribution per event is close to the one of the validation set.

  • We used all the foreground files from the DESED synthetic soundbank (multiple times).
  • We used background files annotated as "other" from the subpart of SINS dataset and files from the TUT Acoustic scenes 2017, development dataset.
  • We used the clips from FUSS containing the non-target classes. The clip selection is based on FSD50K annotations. The clips selected are in the training for both FSD50K and FUSS. We provide tsv files corresponding to these splits.
  • Event distribution statistics for both target event classes and non target event classes are computed on annotations obtained by human for ~90k clips from AudioSet.
Publication

Shawn Hershey, Daniel PW Ellis, Eduardo Fonseca, Aren Jansen, Caroline Liu, R Channing Moore, and Manoj Plakal. The benefit of temporally-strong labels in audio event classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 366–370. IEEE, 2021.

PDF

The benefit of temporally-strong labels in audio event classification

PDF

We share the original data and scripts to generate soundscapes and encourage participants to create their own subsets. See DESED github repo and Scaper documentation for more information about how to create new soundscapes.

Sound event detection validation set

The validation set is designed such that the distribution in term of clips per class is similar to that of the weakly labeled training set. The validation set contains 1168 clips (4093 events). The validation set is annotated with strong labels, with timestamps (obtained by human annotators).

Evaluation set

The sound event detection evaluation dataset is composed of 10 seconds.

  • A first subset is composed of audio clips extracted from YouTube and Vimeo videos under Creative Commons licenses. This subset is used for ranking purposes. This subset includes the public evaluation dataset.

External data resources

List of external data resources allowed:

Dataset name Type Added Link
YAMNet model 20.05.2021 https://github.com/tensorflow/models/tree/master/research/audioset/yamnet
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition model 31.03.2021 https://zenodo.org/record/3987831
OpenL3 model 12.02.2020 https://openl3.readthedocs.io/
VGGish model 12.02.2020 https://github.com/tensorflow/models/tree/master/research/audioset/vggish
COLA model 25.02.2023 https://github.com/google-research/google-research/tree/master/cola
BYOL-A model 25.02.2023 https://github.com/nttcslab/byol-a
AST: Audio Spectrogram Transformer model 25.02.2023 https://github.com/YuanGongND/ast
PaSST: Efficient Training of Audio Transformers with Patchout model 13.05.2022 https://github.com/kkoutini/PaSST
BEATs: Audio Pre-Training with Acoustic Tokenizers model 01.03.2023 https://github.com/microsoft/unilm/tree/master/beats
AudioSet audio, video 04.03.2019 https://research.google.com/audioset/
FSD50K audio 10.03.2022 https://zenodo.org/record/4060432
ImageNet image 01.03.2021 http://www.image-net.org/
MUSAN audio 25.02.2023 https://www.openslr.org/17/
DCASE 2018, Task 5: Monitoring of domestic activities based on multi-channel acoustics - Development dataset audio 25.02.2023 https://zenodo.org/record/1247102#.Y_oyRIBBx8s
Pre-trained desed embeddings (Panns, AST part 1) model 25.02.2023 https://zenodo.org/record/6642806#.Y_oy_oBBx8s
Audio Teacher-Student Transformer model 22.04.2024 https://drive.google.com/file/d/1_xb0_n3UNbUG_pH1vLHTviLfsaSfCzxz/view
TUT Acoustic scenes dataset audio 22.04.2024 https://zenodo.org/records/45739
MicIRP IR 28.03.2023 http://micirp.blogspot.com/?m=1


Task rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

Task specific rules:

  • Participants are allowed to submit up to 4 different systems without ensembling and 4 different systems with ensembling.
  • Participants have to submit at least one system without ensembling.
  • Participants are allowed to use external data for system development. However, each participant should submit at least one system that is not using external data.
  • Data from other task is considered external data.
  • Embeddings extracted from models pre-trained on external data is considered as external data
  • Another example of external data is other materials related to the video such as the rest of audio from where the 10-sec clip was extracted, the video frames and metadata.
  • Manipulation of provided training data is allowed.
  • Participants are not allowed to use the public evaluation dataset and synthetic evaluation dataset (or part of them) to train their systems or tune hyper-parameters.

Submission

Instructions regarding the output submission format and the required metadata can be found in the example submission package.


Metadata file

Participants are allowed to submit up to 4 different systems without ensembling and 4 systems with ensembling. Participants have to submit at least one system without ensembling . Each participants is expected to submit at least one system without external data. Participants using external data/pretrained models, please make sure to fill the field corresponding to the field in the yaml file.

For each submission, participants should provide the outputs obtained with 3 differents runs of the same systems in order to compute confidence intervals.

Participants using external data/pretrained models, please make sure to fill the corresponding fields in the yaml file (lines 109 and 112).

Please make sure to report the energy consumption in the yaml file (lines 102 and 103).

Package validator

Before submission, please make sure you check that your submission package is correct with the validation script enclosed in the submission package: python validate_submissions.py -i /Users/nturpaul/Documents/code/dcase2021/task4_test

Evaluation

All submissions will be evaluated with polyphonic sound event detection scores (PSDS) computed over the real recordings in the evaluation set (the performance on synthetic recordings is not taken into account in the metric). This metric is based on the intersection between events.

(NEW) This year we use sed_scores_eval for evaluation which computes the PSDS accurately from sound event detection scores. Hence, we require participants to submit timestamped scores rather than detected events. See https://github.com/fgnt/sed_scores_eval for details.

Note that this year's results can therefore not be directly compared with previous year's results as threshold independent PSDS results in higher values (for the baseline ~1%).

In order to understand better what is the behavior of each submissions for different scenarios. We propose a metric that evaluate the submissions on two different scenarios that emphasize different systems properties.

Publication

Janek Ebbers, Reinhold Haeb-Umbach, and Romain Serizel. Threshold independent evaluation of sound event detection scores. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1021–1025. IEEE, 2022.

PDF

Threshold Independent Evaluation of Sound Event Detection Scores

PDF

Scenario 1

The system needs to react fast upon an event detection (e.g. to trigger an alarm, adapt home automation system...). The localization of the sound event is then really important. The PSDS parameters reflecting these needs are:

  • Detection Tolerance criterion (DTC): 0.7
  • Ground Truth intersection criterion (GTC): 0.7
  • Cost of instability across class (\(\alpha_{ST}\)): 1
  • Cost of CTs on user experience (\(\alpha_{CT}\)): 0
  • Maximum False Positive rate (e_max): 100

Scenario 2

The system must avoid confusing between classes but the reaction time is less crucial than in the first scenario. The PSDS parameters reflecting these needs are:

  • Detection Tolerance criterion (DTC): 0.1
  • Ground Truth intersection criterion (GTC): 0.1
  • Cost of instability across class (\(\alpha_{ST}\)): 1
  • Cross-Trigger Tolerance criterion (cttc): 0.3
  • Cost of CTs on user experience (\(\alpha_{CT}\)): 0.5
  • Maximum False Positive rate (e_max): 100

Task Ranking

The official ranking will be a team wise ranking, not a system wise ranking. The ranking criterion will be the aggregation of PSDS-scenario1 and PSDS-scenario2. Each separate metric considered in the final ranking criterion will be the best separate metric among all teams submission (PSDS-scenario1 and PSDS-scenario2 can be obtained by two different systems from the same team, see also Fig 1). The setup is chosen in order to favor experiments on the systems behavior, and adaptation to different metrics depending on the targeted scenario.

$$ \mathrm{Ranking\ Score} = \overline{\mathrm{PSDS}_1} + \overline{\mathrm{PSDS}_2}$$

with \(\overline{\mathrm{PSDS}_1}\) and \(\overline{\mathrm{PSDS}_2}\) the PSDS on scenario 1 and 2 normalized by the baseline PSDS on these scenarios, respectively.

Figure 5: PSDS combination for the final ranking.


Contrastive metric (collar-based F1-score)

Additionally, event-based measures with a 200 ms collar on onsets and a 200 ms / 20% of the events length collar on offsets will be provided as a contrastive measure. System will be evaluated with threshold fixed at 0.5 unless participant explicitly provide another operating point to be evaluated with F1-score.

Evaluation is done using sed_eval and psds_eval toolboxes:



Energy consumption

Note that this year the energy consumption reports are mandatory.

The cost computation can have an important ecological impact. A environmental metric is suggested to raise awareness around this subject. While this metric won't be used in the ranking system, it can be an important aspect during the award attribution. Energy consumption is computed using code carbon (a running example is provided in the baseline).

(NEW) In order to account for potential hardware difference, participants have to report the energy consumption measured while training the baseline during 10 epochs (on their hardware).

(New) Multiply–accumulate (MAC) operations

This year we are introducing a new metric, complementary to the energy consumption metric. We are considering the Multiply–accumulate operations (MACs) for 10 seconds of audio prediction, so to have information regarding the computational complexity of the network in terms of multiply-accumulate (MAC) operations.

We use THOP: PyTorch-OpCounter as framework to compute the number of multiply-accumulate operations (MACs). For more information regarding how to install and use THOP, the reader is referred to THOP documentation.

Results

All confindence intervals are computed based on the three runs per systems and bootstrapping on the evaluation set. The table below includes only the best ranking score per submitting team without ensembling.

Rank Submission
code
(PSDS 1)
Submission
code
(PSDS 2)

Ranking score
(Evaluation dataset)

PSDS 1
(Evaluation dataset)

PSDS 2
(Evaluation dataset)
Kim_GIST-HanwhaVision_task4a_2 Kim_GIST-HanwhaVision_task4a_3 1.68 0.591 (0.574 - 0.611) 0.835 (0.826 - 0.846)
Zhang_IOA_task4a_6 Zhang_IOA_task4a_7 1.63 0.562 (0.552 - 0.575) 0.830 (0.820 - 0.842)
Wenxin_TJU_task4a_6 Wenxin_TJU_task4a_6 1.61 0.546 (0.536 - 0.556) 0.831 (0.823 - 0.842)
Xiao_FMSG_task4a_4 Xiao_FMSG_task4a_4 1.60 0.551 (0.543 - 0.562) 0.813 (0.802 - 0.827)
Guan_HIT_task4a_3 Guan_HIT_task4a_4 1.60 0.526 (0.513 - 0.539) 0.855 (0.844 - 0.867)
Chen_CHT_task4a_2 Chen_CHT_task4a_2 1.58 0.563 (0.550 - 0.574) 0.779 (0.768 - 0.792)
Li_USTC_task4a_6 Li_USTC_task4a_6 1.56 0.546 (0.529 - 0.562) 0.783 (0.771 - 0.796)
Liu_NSYSU_task4a_7 Liu_NSYSU_task4a_7 1.55 0.521 (0.510 - 0.531) 0.813 (0.796 - 0.831)
Cheimariotis_DUTH_task4a_1 Cheimariotis_DUTH_task4a_1 1.53 0.516 (0.504 - 0.529) 0.796 (0.784 - 0.808)
Baseline_BEATS Baseline_BEATS 1.52 0.510 (0.496 - 0.523) 0.798 (0.782 - 0.811)
Baseline Baseline 1.00 0.327 (0.317 - 0.339) 0.538 (0.515 - 0.566)
Wang_XiaoRice_task4a_1 Wang_XiaoRice_task4a_1 1.50 0.494 (0.477 - 0.510) 0.801 (0.789 - 0.815)
Lee_CAUET_task4a_1 Lee_CAUET_task4a_2 1.28 0.425 (0.415 - 0.440) 0.674 (0.661 - 0.690)
Liu_SRCN_task4a_4 Liu_SRCN_task4a_4 1.25 0.412 (0.400 - 0.424) 0.663 (0.652 - 0.676)
Barahona_AUDIAS_task4a_2 Barahona_AUDIAS_task4a_4 1.21 0.380 (0.361 - 0.406) 0.673 (0.652 - 0.700)
Wu_NCUT_task4a_1 Wu_NCUT_task4a_1 1.15 0.391 (0.379 - 0.405) 0.596 (0.584 - 0.610)
Gan_NCUT_task4a_1 Gan_NCUT_task4a_1 1.12 0.365 (0.353 - 0.377) 0.603 (0.589 - 0.617)

Complete results and technical reports can be found in the results page

Baseline system

System description

The baseline model is the same as in DCASE 2021 Task 4. The model is a mean-teacher model. The 2023 recipe include a version of the baseline train on DESED and strongly annotated clips from AudioSet. Mixup is used as data augmentation technique for weak and synthetic data by mixing data in a batch (50% chance of applying it). More details about the baseline are available on the baseline page.

(New) Baseline using pre-trained embeddings from models (SEC/Tagging) trained on Audioset

We added a baseline which exploits the pre-trained model BEATs, the current state-of-the-art (as of March 2023) on the Audioset classification task.

In the proposed baseline, the frame-level embeddings are used in a late-fusion fashion with the existing CRNN baseline classifier. The temporal resolution of the frame-level embeddings is matched to that of the CNN output using Adaptative Average Pooling. We then feed their frame-level concatenation to the RNN + MLP classifier. See the baseline page for details.

Results for the development dataset

PSDS-scenario1 PSDS-scenario2 Intersection-based F1 Collar-based F1
Baseline 0.359 +/- 0.006 0.562 +/- 0.012 64.2 +/- 0.8 % 40.7 +/- 0.6 %
Baseline (AudioSet strong) 0.364 +/- 0.005 0.576 +/- 0.011 65.5 +/- 1.3 % 43.3 +/- 1.4 %
Baseline (BEATS) 0.500 +/- 0.004 0.762 +/- 0.008 80.7 +/- 0.4 % 57.1 +/- 1.3 %

Collar-based = event-based. Intersection based is computed using (dtc=gtc=0.5, cttc=0.3) and event-based is computed using collars (onset=200ms, offset=max(200ms, 20% event-duration)

Note: The performance might not be exactly reproducible on a GPU based system. That is why, you can download the checkpoint of the network along with the TensorBoard events. Launch python train_sed.py --test_from_checkpoint /path/to/downloaded.ckpt to test this model.

Energy consumption during the training and evaluation phase

Energy consumption for 1 run on a NVIDIA A100 80Gb for a training phase and an inference phase on the development set.

Training (kWh) Dev-test (kWh)
Baseline 1.390 +/- 0.019 0.019 +/- 0.001
Baseline (AudioSet strong) 1.418 +/- 0.016 0.020 +/- 0.001
Baseline (BEATS) 1.821 +/- 0.457 0.022 +/- 0.003

Total number of multiply–accumulate operation (MACs): 44.683 G

Repositories




Citation

Publication

Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.

PDF

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

Keywords

Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data

PDF
Publication

Francesca Ronchini, Romain Serizel, Nicolas Turpault, and Samuele Cornell. The impact of non-target events in synthetic soundscapes for sound event detection. In Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 115–119. Barcelona, Spain, November 2021.

PDF

The Impact of Non-Target Events in Synthetic Soundscapes for Sound Event Detection

Abstract

Detection and Classification Acoustic Scene and Events Challenge 2021 Task 4 uses a heterogeneous dataset that includes both recorded and synthetic soundscapes. Until recently only target sound events were considered when synthesizing the soundscapes. However, recorded soundscapes often contain a substantial amount of non-target events that may affect the performance. In this paper, we focus on the impact of these non-target events in the synthetic soundscapes. Firstly, we investigate to what extent using non-target events alternatively during the training or validation phase (or none of them) helps the system to correctly detect target events. Secondly, we analyze to what extend adjusting the signal-to-noise ratio between target and non-target events at training improves the sound event detection performance. The results show that using both target and non-target events for only one of the phases (validation or training) helps the system to properly detect sound events, outperforming the baseline (which uses non-target events in both phases). The paper also reports the results of a preliminary study on evaluating the system on clips that contain only non-target events. This opens questions for future work on non-target subset and acoustic similarity between target and non-target events which might confuse the system.

PDF
Publication

Janek Ebbers, Reinhold Haeb-Umbach, and Romain Serizel. Threshold independent evaluation of sound event detection scores. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1021–1025. IEEE, 2022.

PDF

Threshold Independent Evaluation of Sound Event Detection Scores

PDF