Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels


Task description

The goal of this task is to evaluate systems for the detection of sound events using real data, with different types of annotations data and corresponding labels available for training.

If you are interested in the task, you can join us on the dedicated slack channel

Description

This task is the follow up to task 4 A and task 4 B in 2023. We propose to unify the setup of both subtasks. The target of this task is to provide the event class together with the event time boundaries, given that multiple events can be present and may be overlapping in an audio recording. This task aims at exploring how to leverage training data with varying annotation granularity (temporal resolution, soft/hard labels).

Novelties for 2024 edition

  • Systems will be evaluated on labels with different granularity in order to get a broader view of the systems behavior and to assess their robustness for different applications.
  • The target classes in the different datasets are different, and hence sound labels of one dataset may be present but not annotated in the other one. The systems will have to cope with potentially missing target labels at training.

Scientific questions

This task highlights a number of specific research questions:

  • What is the most efficient way to exploit different sources of data to train a sound event detection system?
  • Is annotation uncertainty useful in learning models for SED?
  • How to exploit training data with partially missing annotations? How can we evaluate SED systems in a robust way?

Audio dataset

This Task is based on the DESED dataset and the MAESTRO Real dataset.

DESED dataset

DESED dataset has been used since DCASE 2020 Task 4. DESED is composed of 10 sec audio clips recorded in domestic environments (taken from AudioSet) or synthesized using Scaper to simulate a domestic environment. DESED focuses on 10 classes of sound events that represent a subset of AudioSet (note that not all the classes in DESED correspond to classes in AudioSet; for example, some classes in DESED group several classes from AudioSet).

Publication

Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.

PDF

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

Keywords

Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data

PDF
Publication

Romain Serizel, Nicolas Turpault, Ankit Shah, and Justin Salamon. Sound event detection in synthetic domestic environments. In ICASSP 2020 - 45th International Conference on Acoustics, Speech, and Signal Processing. Barcelona, Spain, 2020. URL: https://hal.inria.fr/hal-02355573.

PDF

Sound event detection in synthetic domestic environments

Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

Keywords

semi-supervised learning ; weakly labeled data ; synthetic data ; Sound event detection ; Index Terms-Sound event detection

PDF

MAESTRO Real

The dataset consists of real-life recordings with a length of approximately 3 minutes each, recorded in a few different acoustic scenes. The audio was annotated using Amazon Mechanical Turk, with a procedure that allows estimating soft labels from multiple annotator opinions.

Publication

Irene Martín-Morató and Annamaria Mesaros. Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31():902–914, 2023. doi:10.1109/TASLP.2022.3233468.

PDF

Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

Abstract

Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the specific format of the strong labels necessary for sound event detection is not easily obtainable through crowdsourcing. In this work, we propose a novel annotation workflow that leverages the efficiency of crowdsourcing weak labels, and uses a high number of annotators to produce reliable and objective strong labels. The weak labels are collected in a highly redundant setup, to allow reconstruction of the temporal information. To obtain reliable labels, the annotators' competence is estimated using MACE (Multi-Annotator Competence Estimation) and incorporated into the strong labels estimation through weighing of individual opinions. We show that the proposed method produces consistently reliable strong annotations not only for synthetic audio mixtures, but also for audio recordings of real everyday environments. While only a maximum 80% coincidence with the complete and correct reference annotations was obtained for synthetic data, these results are explained by an extended study of how polyphony and SNR levels affect the identification rate of the sound events by the annotators. On real data, even though the estimated annotators' competence is significantly lower and the coincidence with reference labels is under 69%, the proposed majority opinion approach produces reliable aggregated strong labels in comparison with the more difficult task of crowdsourcing directly strong labels.

PDF


Dataset overlap

The original training datasets have not been reannotated. Therefore, sound labels of one dataset may be present but not annotated in the other one. The systems will have to cope with potentially missing target labels at training. Additionnaly, there is some overlap in terms of classes between the datasets. The following classes have been collapsed:

  • People talking (MAESTRO) -> Speech (DESED)
  • Cutlery & dishes (MAESTRO) -> dishes (DESED)

Reference labels

One of the challenge of the task is to train systems exploiting annotations at different granularity. We describe here the different annotations available.

Weak annotations

The weak annotations in DESED have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file under the following format:

[filename (string)][tab][event_labels (strings)]

For example: Y-BJNMHMZDcU_50.000_60.000.wav Alarm_bell_ringing,Dog

Strong annotations

A subset of DESED development has been annotated manually with strong annotations, to be used as the test set (see also below for a detailed explanation about the development set).

The synthetic subset of DESED development set is generated and labeled with strong annotations using the Scaper soundscape synthesis and augmentation library. Each sound clip from FSD50K was verified by humans in order to check the event class present in FSD50K annotation was indeed dominant in the audio clip.

Publication

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.

PDF

FSD50K: an Open Dataset of Human-Labeled Sound Events

Abstract

Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset---its release consists of pre-computed audio features (instead of waveforms), which limits the adoption of some SER methods. Downloading the original audio tracks is also problematic due to constituent YouTube videos gradually disappearing and usage rights issues, which casts doubts over the suitability of this resource for systems' benchmarking. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.

PDF
Publication

Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, 411–412. ACM, 2013.

PDF

Freesound technical demo

Abstract

Freesound is an online collaborative sound database where people with diverse interests share recorded sound samples under Creative Commons licenses. It was started in 2005 and it is being maintained to support diverse research projects and as a service to the overall research and artistic community. In this demo we want to introduce Freesound to the multimedia community and show its potential as a research resource. We begin by describing some general aspects of Freesound, its architecture and functionalities, and then explain potential usages that this framework has for research applications.

PDF

In both cases, the minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event.

The strong annotations are provided in a tab separated csv file under the following format:

  [filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)]

For example:

YOTsn73eqbfc_10.000_20.000.wav 0.163 0.665 Alarm_bell_ringing

Soft labels

The reference labels in MAESTRO development data are available as soft labels.

[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)[tab][soft label (float)]]

Example:

a1.wav       0  1   footsteps   0.6
a1.wav       0  1   people_talking      0.9
a1.wav       1  2   footsteps   0.8

These labels can be transformed into hard (binary) labels, using the 0.5 threshold, and the equivalent annotation would be

Hard labels:

[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)

Example:

a1.wav       0  2   footsteps
a1.wav       0  1   people_talking

In the provided dataset there are 17 sound classes. Of them, only 15 classes have values over 0.5, out of which another 4 are very rare. For this reason, the evaluation is conducted only against the following 11 classes:

  • Birds singing
  • Car
  • People talking
  • Footsteps
  • Children voices
  • Wind blowing
  • Brakes squeaking
  • Large vehicle
  • Cutlery and dishes
  • Metro approaching
  • Metro leaving

Download

The DESED dataset is composed of several subsets that can be downloaded independently from the respective repositories or automatically with the Task 4 data generation script.


To access the datasets separately please refer to task 4 2021 instructions

The MAESTRO dataset can be downloaded at the following repository.


Task setup

The challenge consists of detecting sound events within audio using training with varying types of annotation (weak hard label, strong hard labels and strong soft labels). The detection within an audio clip should be performed with start and end timestamps. Note that an audio clip may contain to more than one sound event.

Participants are allowed to use external datasets and embeddings extracted from pre-trained models. Lists of the eligible datasets and pre-trained models are provided below.

Development set

We provide 4 different splits of the training data in our development set: "Weakly labeled training set", "Unlabeled in domain training set", "Synthetic strongly labeled set" and "Soft labeled training set".

Weakly labeled training set:
This set contains 1578 clips (2244 class occurrences) for which weak annotations have been verified and cross-checked.

Unlabeled in domain training set:
This set is considerably larger than the previous one. It contains 14412 clips. The clips are selected such that the distribution per class (based on AudioSet annotations) is close to the distribution in the labeled set. Note however that given the uncertainty on AudioSet labels this distribution might not be exactly similar.

Synthetic strongly labeled set:
This set is composed of 10000 clips generated with the Scaper soundscape synthesis and augmentation library. The clips are generated such that the distribution per event is close to the one of the validation set.

  • We used all the foreground files from the DESED synthetic soundbank (multiple times).
  • We used background files annotated as "other" from the subpart of SINS dataset and files from the TUT Acoustic scenes 2017, development dataset.
  • We used the clips from FUSS containing the non-target classes. The clip selection is based on FSD50K annotations. The clips selected are in the training for both FSD50K and FUSS. We provide tsv files corresponding to these splits.
  • Event distribution statistics for both target event classes and non target event classes are computed on annotations obtained by human for ~90k clips from AudioSet.
Publication

Shawn Hershey, Daniel PW Ellis, Eduardo Fonseca, Aren Jansen, Caroline Liu, R Channing Moore, and Manoj Plakal. The benefit of temporally-strong labels in audio event classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 366–370. IEEE, 2021.

PDF

The benefit of temporally-strong labels in audio event classification

PDF

We share the original data and scripts to generate soundscapes and encourage participants to create their own subsets. See DESED github repo and Scaper documentation for more information about how to create new soundscapes.

Soft labeled training set This year the MAESTRO dataset is provided with a fixed training/validation split in which approximately 70% of the data (per class) is used in training, and the rest is used for testing. Participants are required to report results on the validation set results using this setup.

Sound event detection validation set

The validation set is the combination of DESED validation set and MAESTRO validation set split described above. The overall validation set contains 1184 clips. DESED validation set is annotated with strong labels, with timestamps (obtained by human annotators) while MAESTRO validation set is annotated with soft labels.

External data resources

List of external data resources allowed:

Dataset name Type Added Link
YAMNet model 20.05.2021 https://github.com/tensorflow/models/tree/master/research/audioset/yamnet
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition model 31.03.2021 https://zenodo.org/record/3987831
OpenL3 model 12.02.2020 https://openl3.readthedocs.io/
VGGish model 12.02.2020 https://github.com/tensorflow/models/tree/master/research/audioset/vggish
COLA model 25.02.2023 https://github.com/google-research/google-research/tree/master/cola
BYOL-A model 25.02.2023 https://github.com/nttcslab/byol-a
AST: Audio Spectrogram Transformer model 25.02.2023 https://github.com/YuanGongND/ast
PaSST: Efficient Training of Audio Transformers with Patchout model 13.05.2022 https://github.com/kkoutini/PaSST
BEATs: Audio Pre-Training with Acoustic Tokenizers model 01.03.2023 https://github.com/microsoft/unilm/tree/master/beats
AudioSet audio, video 04.03.2019 https://research.google.com/audioset/
FSD50K audio 10.03.2022 https://zenodo.org/record/4060432
ImageNet image 01.03.2021 http://www.image-net.org/
MUSAN audio 25.02.2023 https://www.openslr.org/17/
DCASE 2018, Task 5: Monitoring of domestic activities based on multi-channel acoustics - Development dataset audio 25.02.2023 https://zenodo.org/record/1247102#.Y_oyRIBBx8s
Pre-trained desed embeddings (Panns, AST part 1) model 25.02.2023 https://zenodo.org/record/6642806#.Y_oy_oBBx8s


Datasets and models can be added to the list upon request until May 1st (as long as the corresponding resources are publicly available).

Task rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

Task specific rules:

  • Participants are allowed to submit up to 4 different systems without ensembling and 4 different systems with ensembling.
  • Participants have to submit at least one system without ensembling.
  • Participants are allowed to use external data for system development.
  • Data from other task is considered external data.
  • Embeddings extracted from models pre-trained on external data is considered as external data
  • Another example of external data is other materials related to the video such as the rest of audio from where the 10-sec clip was extracted, the video frames and metadata.
  • Datasets and models can be added to the list upon request until May 1st (as long as the corresponding resources are publicly available).
  • The external dataset used during training should be listed in the YAML file describing the submission.
  • Manipulation of provided training data is allowed.
  • Participants are not allowed to use the public evaluation dataset and synthetic evaluation dataset (or part of them) to train their systems or tune hyper-parameters.

Evaluation set

The evaluation dataset is composed of the DESED evaluation set and the MAESTRO evaluation set.

DESED evaluation set The DESED evaluation set is composed of audio clips extracted from YouTube and Vimeo videos under Creative Commons licenses. This subset is used for ranking purposes. This subset includes the public evaluation dataset. Part of this evaluation set has also been reannotated with soft labels following the same procedure as in MAESTRO.


MAESTRO evaluation set The MAESTRO evaluation set consists of 26 files with a total length of 97 minutes. Only audio is provided for the evaluation set.

Submission

TBA

Evaluation

Segment based metrics

System evaluation will be based on the following metrics, calculated in 1s-segments:

  • micro-average F1 score \(F1_m\), calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
  • micro-average error rate \(ER_m\), calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
  • macro-average F1 score \(F1_M\), calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
  • macro-average F1 score with optimum threshold per class \(F1_{MO}\) calculated using sed_scores_eval, based on the best F1 score per class obtained with a class-specific threshold

Polyphonic sound event detection scores

Submissions will also be evaluated with polyphonic sound event detection scores (PSDS) computed over the real recordings in the evaluation set (the performance on synthetic recordings is not taken into account in the metric). This metric is based on the intersection between events. The PSDS parameters used for evaluation are the following:

  • Detection Tolerance criterion (DTC): 0.7
  • Ground Truth intersection criterion (GTC): 0.7
  • Cost of instability across class (\(\alpha_{ST}\)): 1
  • Cost of CTs on user experience (\(\alpha_{CT}\)): 0
  • Maximum False Positive rate (e_max): 100

Collar-based F1-score

Additionally, event-based measures with a 200 ms collar on onsets and a 200 ms / 20% of the events length collar on offsets will be provided as a contrastive measure. System will be evaluated with threshold fixed at 0.5 unless participant explicitly provide another operating point to be evaluated with F1-score.

Energy consumption

Note that this year the energy consumption reports are mandatory.

The cost computation can have an important ecological impact. A environmental metric is suggested to raise awareness around this subject. While this metric won't be used in the ranking system, it can be an important aspect during the award attribution. Energy consumption is computed using code carbon (a running example is provided in the baseline).

Participants need to provide, for each submitted system the following energy consumption figures in kWh using CodeCarbon:

1) Whole system training 2) Devtest inference

You can refer to the baseline code in local/sed_trained.py for some hints on how we accomplish this for the baseline system.

Important!!

In order to normalize the energy consumption measurement we require participants report the energy consumption in kWh (using the same hardware used for your submission for:

1) Training the baseline system for 10 epochs 2) Devtest inference for the baseline system

Both are computed by the python train_sed.py command. You just need to set 10 epochs in the confs/default.yaml. You can find the energy consumed in kWh in ./exp/2024_baseline/version_X/codecarbon/emissions_baseline_training.csv for training and ./exp/2024_baseline/version_X/codecarbon/emissions_baseline_inference.csv for devtest inference.

Besides the metrics report in the YAML file describing your submission, we require that you submit the whole .csv files that provide the details of consumption for GPU, CPU and RAM usage. For more information, please refer to the submission package example.

Multiply–accumulate (MAC) operations

This year we are introducing a new metric, complementary to the energy consumption metric. We are considering the Multiply–accumulate operations (MACs) for 10 seconds of audio prediction, so to have information regarding the computational complexity of the network in terms of multiply-accumulate (MAC) operations.

We use THOP: PyTorch-OpCounter as framework to compute the number of multiply-accumulate operations (MACs). For more information regarding how to install and use THOP, the reader is referred to THOP documentation.

Evaluation toolboxes

Evaluation is done using sed_eval and sed_scores_eval toolboxes:








Baseline system

TBA

Citations

Publication

Irene Martín-Morató and Annamaria Mesaros. Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31():902–914, 2023. doi:10.1109/TASLP.2022.3233468.

PDF

Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

Abstract

Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the specific format of the strong labels necessary for sound event detection is not easily obtainable through crowdsourcing. In this work, we propose a novel annotation workflow that leverages the efficiency of crowdsourcing weak labels, and uses a high number of annotators to produce reliable and objective strong labels. The weak labels are collected in a highly redundant setup, to allow reconstruction of the temporal information. To obtain reliable labels, the annotators' competence is estimated using MACE (Multi-Annotator Competence Estimation) and incorporated into the strong labels estimation through weighing of individual opinions. We show that the proposed method produces consistently reliable strong annotations not only for synthetic audio mixtures, but also for audio recordings of real everyday environments. While only a maximum 80% coincidence with the complete and correct reference annotations was obtained for synthetic data, these results are explained by an extended study of how polyphony and SNR levels affect the identification rate of the sound events by the annotators. On real data, even though the estimated annotators' competence is significantly lower and the coincidence with reference labels is under 69%, the proposed majority opinion approach produces reliable aggregated strong labels in comparison with the more difficult task of crowdsourcing directly strong labels.

PDF
Publication

Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.

PDF

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Abstract

This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.

Keywords

Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data

PDF
Publication

Francesca Ronchini, Romain Serizel, Nicolas Turpault, and Samuele Cornell. The impact of non-target events in synthetic soundscapes for sound event detection. In Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 115–119. Barcelona, Spain, November 2021.

PDF

The Impact of Non-Target Events in Synthetic Soundscapes for Sound Event Detection

Abstract

Detection and Classification Acoustic Scene and Events Challenge 2021 Task 4 uses a heterogeneous dataset that includes both recorded and synthetic soundscapes. Until recently only target sound events were considered when synthesizing the soundscapes. However, recorded soundscapes often contain a substantial amount of non-target events that may affect the performance. In this paper, we focus on the impact of these non-target events in the synthetic soundscapes. Firstly, we investigate to what extent using non-target events alternatively during the training or validation phase (or none of them) helps the system to correctly detect target events. Secondly, we analyze to what extend adjusting the signal-to-noise ratio between target and non-target events at training improves the sound event detection performance. The results show that using both target and non-target events for only one of the phases (validation or training) helps the system to properly detect sound events, outperforming the baseline (which uses non-target events in both phases). The paper also reports the results of a preliminary study on evaluating the system on clips that contain only non-target events. This opens questions for future work on non-target subset and acoustic similarity between target and non-target events which might confuse the system.

PDF
Publication

Janek Ebbers, Reinhold Haeb-Umbach, and Romain Serizel. Threshold independent evaluation of sound event detection scores. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1021–1025. IEEE, 2022.

PDF

Threshold Independent Evaluation of Sound Event Detection Scores

PDF