The goal of this task is to evaluate systems for the detection of sound events using real data, with different types of annotations data and corresponding labels available for training.
Challenge has ended. Full results for this task can be found in the Results page.
Description
This task is the follow up to task 4 A and task 4 B in 2023. We propose to unify the setup of both subtasks. The target of this task is to provide the event class together with the event time boundaries, given that multiple events can be present and may be overlapping in an audio recording. This task aims at exploring how to leverage training data with varying annotation granularity (temporal resolution, soft/hard labels).
Novelties for 2024 edition
- Systems will be evaluated on labels with different granularity in order to get a broader view of the systems behavior and to assess their robustness for different applications.
- The target classes in the different datasets are different, and hence sound labels of one dataset may be present but not annotated in the other one. The systems will have to cope with potentially missing target labels at training.
- The SED system will have to perform without knowing the origin of the audio clips at evaluation time.
Scientific questions
This task highlights a number of specific research questions:
- What is the most efficient way to exploit different sources of data to train a sound event detection system?
- Is annotation uncertainty useful in learning models for SED?
- How to exploit training data with partially missing annotations? How can we evaluate SED systems in a robust way?
- How can we train SED systems that should perform well under various sound event distribution with potentail domain mismatch?
Audio dataset
This task is based on the DESED dataset and the MAESTRO Real dataset.
DESED dataset
DESED dataset has been used since DCASE 2020 Task 4. DESED is composed of 10 sec audio clips recorded in domestic environments (taken from AudioSet) or synthesized using Scaper to simulate a domestic environment. DESED focuses on 10 classes of sound events that represent a subset of AudioSet (note that not all the classes in DESED correspond to classes in AudioSet; for example, some classes in DESED group several classes from AudioSet).
Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.
Sound event detection in domestic environments with weakly labeled data and soundscape synthesis
Abstract
This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.
Keywords
Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data
Romain Serizel, Nicolas Turpault, Ankit Shah, and Justin Salamon. Sound event detection in synthetic domestic environments. In ICASSP 2020 - 45th International Conference on Acoustics, Speech, and Signal Processing. Barcelona, Spain, 2020. URL: https://hal.inria.fr/hal-02355573.
Sound event detection in synthetic domestic environments
Abstract
This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.
Keywords
semi-supervised learning ; weakly labeled data ; synthetic data ; Sound event detection ; Index Terms-Sound event detection
MAESTRO Real
The dataset consists of real-life recordings with a length of approximately 3 minutes each, recorded in a few different acoustic scenes. The audio was annotated using Amazon Mechanical Turk, with a procedure that allows estimating soft labels from multiple annotator opinions.
Irene Martín-Morató and Annamaria Mesaros. Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31():902–914, 2023. doi:10.1109/TASLP.2022.3233468.
Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation
Abstract
Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the specific format of the strong labels necessary for sound event detection is not easily obtainable through crowdsourcing. In this work, we propose a novel annotation workflow that leverages the efficiency of crowdsourcing weak labels, and uses a high number of annotators to produce reliable and objective strong labels. The weak labels are collected in a highly redundant setup, to allow reconstruction of the temporal information. To obtain reliable labels, the annotators' competence is estimated using MACE (Multi-Annotator Competence Estimation) and incorporated into the strong labels estimation through weighing of individual opinions. We show that the proposed method produces consistently reliable strong annotations not only for synthetic audio mixtures, but also for audio recordings of real everyday environments. While only a maximum 80% coincidence with the complete and correct reference annotations was obtained for synthetic data, these results are explained by an extended study of how polyphony and SNR levels affect the identification rate of the sound events by the annotators. On real data, even though the estimated annotators' competence is significantly lower and the coincidence with reference labels is under 69%, the proposed majority opinion approach produces reliable aggregated strong labels in comparison with the more difficult task of crowdsourcing directly strong labels.
Dataset overlap
The original training datasets have not been reannotated. Therefore, sound labels of one dataset may be present but not annotated in the other one. The systems will have to cope with potentially missing target labels at training. Additionnaly, there is some overlap in terms of classes between the datasets. The following classes have been collapsed:
- People talking (MAESTRO) -> Speech (DESED)
- Cutlery & dishes (MAESTRO) -> dishes (DESED)
Reference labels
One of the challenge of the task is to train systems exploiting annotations at different granularity. We describe here the different annotations available.
Weak annotations
The weak annotations in DESED have been verified manually for a small subset of the training set. The weak annotations are provided in a tab separated csv file under the following format:
[filename (string)][tab][event_labels (strings)]
For example:
Y-BJNMHMZDcU_50.000_60.000.wav Alarm_bell_ringing,Dog
Strong annotations
A subset of DESED development has been annotated manually with strong annotations, to be used as the test set (see also below for a detailed explanation about the development set).
The synthetic subset of DESED development set is generated and labeled with strong annotations using the Scaper soundscape synthesis and augmentation library. Each sound clip from FSD50K was verified by humans in order to check the event class present in FSD50K annotation was indeed dominant in the audio clip.
Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events. In arXiv:2010.00475. 2020.
FSD50K: an Open Dataset of Human-Labeled Sound Events
Abstract
Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset---its release consists of pre-computed audio features (instead of waveforms), which limits the adoption of some SER methods. Downloading the original audio tracks is also problematic due to constituent YouTube videos gradually disappearing and usage rights issues, which casts doubts over the suitability of this resource for systems' benchmarking. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.
Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, 411–412. ACM, 2013.
Freesound technical demo
Abstract
Freesound is an online collaborative sound database where people with diverse interests share recorded sound samples under Creative Commons licenses. It was started in 2005 and it is being maintained to support diverse research projects and as a service to the overall research and artistic community. In this demo we want to introduce Freesound to the multimedia community and show its potential as a research resource. We begin by describing some general aspects of Freesound, its architecture and functionalities, and then explain potential usages that this framework has for research applications.
In both cases, the minimum length for an event is 250ms. The minimum duration of the pause between two events from the same class is 150ms. When the silence between two consecutive events from the same class was less than 150ms the events have been merged to a single event.
The strong annotations are provided in a tab separated csv file under the following format:
[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)]
For example:
YOTsn73eqbfc_10.000_20.000.wav 0.163 0.665 Alarm_bell_ringing
Soft labels
The reference labels in MAESTRO development data are available as soft labels.
[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)[tab][soft label (float)]]
Example:
a1.wav 0 1 footsteps 0.6
a1.wav 0 1 people_talking 0.9
a1.wav 1 2 footsteps 0.8
These labels can be transformed into hard (binary) labels, using the 0.5 threshold, and the equivalent annotation would be
Hard labels:
[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)
Example:
a1.wav 0 2 footsteps
a1.wav 0 1 people_talking
In the provided dataset there are 17 sound classes. Of them, only 15 classes have values over 0.5, out of which another 4 are very rare. For this reason, the evaluation is conducted only against the following 11 classes:
- Birds singing
- Car
- People talking
- Footsteps
- Children voices
- Wind blowing
- Brakes squeaking
- Large vehicle
- Cutlery and dishes
- Metro approaching
- Metro leaving
Download
The DESED dataset is composed of several subsets that can be downloaded independently from the respective repositories or automatically with the Task 4 data generation script.
To access the datasets separately please refer to task 4 2021 instructions
The MAESTRO dataset can be downloaded at the following repository.
Task setup
The challenge consists of detecting sound events within audio using training with varying types of annotation (weak hard label, strong hard labels and strong soft labels). The detection within an audio clip should be performed with start and end timestamps. Note that an audio clip may contain to more than one sound event.
Participants are allowed to use external datasets and embeddings extracted from pre-trained models. Lists of the eligible datasets and pre-trained models are provided below.
Development set
We provide 4 different splits of the training data in our development set: "Weakly labeled training set", "Unlabeled in domain training set", "Synthetic strongly labeled set" and "Soft labeled training set".
Weakly labeled training set:
This set contains 1578 clips (2244 class occurrences) for which weak annotations have been verified and cross-checked.
Unlabeled in domain training set:
This set is considerably larger than the previous one. It contains 14412 clips.
The clips are selected such that the distribution per class (based on AudioSet annotations)
is close to the distribution in the labeled set.
Note however that given the uncertainty on AudioSet labels this distribution might not be exactly similar.
Synthetic strongly labeled set:
This set is composed of 10000 clips generated with the Scaper soundscape synthesis and augmentation library.
The clips are generated such that the distribution per event is close to the one of the validation set.
- We used all the foreground files from the DESED synthetic soundbank (multiple times).
- We used background files annotated as "other" from the subpart of SINS dataset and files from the TUT Acoustic scenes 2017, development dataset.
- We used the clips from FUSS containing the non-target classes. The clip selection is based on FSD50K annotations. The clips selected are in the training for both FSD50K and FUSS. We provide tsv files corresponding to these splits.
- Event distribution statistics for both target event classes and non target event classes are computed on annotations obtained by human for ~90k clips from AudioSet.
Shawn Hershey, Daniel PW Ellis, Eduardo Fonseca, Aren Jansen, Caroline Liu, R Channing Moore, and Manoj Plakal. The benefit of temporally-strong labels in audio event classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 366–370. IEEE, 2021.
We share the original data and scripts to generate soundscapes and encourage participants to create their own subsets. See DESED github repo and Scaper documentation for more information about how to create new soundscapes.
Soft labeled training set This year the MAESTRO dataset is provided with a fixed training/validation split in which approximately 70% of the data (per class) is used in training, and the rest is used for testing. Participants are required to report results on the validation set results using this setup.
Sound event detection validation set
The validation set is the combination of DESED validation set and MAESTRO validation set split described above. The overall validation set contains 1184 clips. DESED validation set is annotated with strong labels, with timestamps (obtained by human annotators) while MAESTRO validation set is annotated with soft labels.
External data resources
List of external data resources allowed:
Dataset name | Type | Added | Link |
---|---|---|---|
YAMNet | model | 20.05.2021 | https://github.com/tensorflow/models/tree/master/research/audioset/yamnet |
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition | model | 31.03.2021 | https://zenodo.org/record/3987831 |
OpenL3 | model | 12.02.2020 | https://openl3.readthedocs.io/ |
VGGish | model | 12.02.2020 | https://github.com/tensorflow/models/tree/master/research/audioset/vggish |
COLA | model | 25.02.2023 | https://github.com/google-research/google-research/tree/master/cola |
BYOL-A | model | 25.02.2023 | https://github.com/nttcslab/byol-a |
AST: Audio Spectrogram Transformer | model | 25.02.2023 | https://github.com/YuanGongND/ast |
PaSST: Efficient Training of Audio Transformers with Patchout | model | 13.05.2022 | https://github.com/kkoutini/PaSST |
BEATs: Audio Pre-Training with Acoustic Tokenizers | model | 01.03.2023 | https://github.com/microsoft/unilm/tree/master/beats |
AudioSet | audio, video | 04.03.2019 | https://research.google.com/audioset/ |
FSD50K | audio | 10.03.2022 | https://zenodo.org/record/4060432 |
ImageNet | image | 01.03.2021 | http://www.image-net.org/ |
MUSAN | audio | 25.02.2023 | https://www.openslr.org/17/ |
DCASE 2018, Task 5: Monitoring of domestic activities based on multi-channel acoustics - Development dataset | audio | 25.02.2023 | https://zenodo.org/record/1247102#.Y_oyRIBBx8s |
Pre-trained desed embeddings (Panns, AST part 1) | model | 25.02.2023 | https://zenodo.org/record/6642806#.Y_oy_oBBx8s |
Audio Teacher-Student Transformer | model | 22.04.2024 | https://drive.google.com/file/d/1_xb0_n3UNbUG_pH1vLHTviLfsaSfCzxz/view |
TUT Acoustic scenes dataset | audio | 22.04.2024 | https://zenodo.org/records/45739 |
MicIRP | IR | 28.03.2023 | http://micirp.blogspot.com/?m=1 |
Datasets and models can be added to the list upon request until May 15th (as long as the corresponding resources are publicly available).
Task rules
There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.
Task specific rules:
- Participants are allowed to submit up to 4 different systems .
- Participants have to submit at least one system without ensembling.
- Participants have to submit (post-processed and unprocessed) output scores from three independent model trainings with different initialization to be able to evaluate the model performance's standard deviation.
- Participants are allowed to use external data for system development.
- Data from other task is considered external data.
- Embeddings extracted from models pre-trained on external data is considered as external data
- Another example of external data is other materials related to the video such as the rest of audio from where the 10-sec clip was extracted, the video frames and metadata.
- Datasets and models can be added to the list upon request until May 1st (as long as the corresponding resources are publicly available).
- The external dataset used during training should be listed in the YAML file describing the submission.
- Manipulation of provided training data is allowed.
- Participants are not allowed to use the public evaluation dataset and synthetic evaluation dataset (or part of them) to train their systems or tune hyper-parameters.
- Domain identification is prohibited: participant are not allowed to leverage domain information in inference whether the audio comes from MAESTRO or DESED.
Evaluation set
The evaluation dataset is composed of the DESED evaluation set and the MAESTRO evaluation set. The audio clips will be randomized and there will be no indication of the origin (DESED or MAESTRO) of the clips.
Submission
Instructions regarding the output submission format and the required metadata can be found in the example submission package.
Evaluation
Segment based metrics
System evaluation will be based on the following metrics, calculated in 1s-segments:
- micro-average F1 score \(F1_m\), calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
- micro-average error rate \(ER_M\), calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
- macro-average F1 score \(F1_M\), calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
- macro-average pAUC score \(pAUC_M\), calculated using sed_scores_eval, with a false positive rate threshold of 0.1
- macro-average F1 score with optimum threshold per class \(F1_{MO}\) calculated using sed_scores_eval, based on the best F1 score per class obtained with a class-specific threshold
Polyphonic sound event detection scores
Submissions will also be evaluated with polyphonic sound event detection scores (PSDS) computed over the real recordings in the evaluation set (the performance on synthetic recordings is not taken into account in the metric). This metric is based on the intersection between events. The PSDS parameters used for evaluation are the following:
- Detection Tolerance criterion (DTC): 0.7
- Ground Truth intersection criterion (GTC): 0.7
- Cost of instability across class (\(\alpha_{ST}\)): 1
- Cost of CTs on user experience (\(\alpha_{CT}\)): 0
- Maximum False Positive rate (e_max): 100
Collar-based F1-score
Additionally, event-based measures with a 200 ms collar on onsets and a 200 ms / 20% of the events length collar on offsets will be provided as a contrastive measure. System will be evaluated with threshold fixed at 0.5 unless participant explicitly provide another operating point to be evaluated with F1-score.
Ranking metric
The systems will be ranked according to the sum of \(pAUC_M\) (computed on the MAESTRO evaluation set, on MAESTRO classes) and \(PSDS\) (computed on DESED evaluation set only, on DESED classes).
Multi-runs Evaluation
Further we kindly ask participants to provide (post-processed and unprocessed) output scores from three independent model trainings with different initialization to be able to evaluate the model performance's standard deviation.
Energy Consumption (mandatory this year !)
Since last year, energy consumption (kWh) is going to be considered as additional metric to rank the submitted systems, therefore it is mandatory to report the energy consumption of the submitted models [11].
Participants need to provide, for each submitted system (or at least the best one), the following energy consumption figures in kWh using CodeCarbon:
1) whole system training 2) devtest inference
You can refer to Codecarbon on how to accomplish this (it's super simple! 😉).
Important Steps
- Initialize the Tracker:
- When initializing the tracker, make sure to specify the GPU IDs that you want to track.
-
You can do this by using the following parameter:
gpu_ids=[torch.cuda.current_device()]
. -
Example: ```python from codecarbon import OfflineEmissionsTracker tracker = OfflineEmissionsTracker(gpu_ids=[torch.cuda.current_device()], country_iso_code="CAN") tracker.start() # Your code here tracker.stop()
-
Additional Resources:
- For more hints on how we do this on the baseline system, check out
local/sed_trainer_pretrained.py
.
⚠️ In addition to this, we kindly suggest the participants to provide the energy consumption in kWh (using the same hardware used for 2) and 3)) of:
1) training the baseline system for 10 epochs 2) devtest inference for the baseline system
Both are computed by the python train_pretrained.py
command. You just need to set 10 epochs in the confs/default.yaml
.
You can find the energy consumed in kWh in ./exp/2024_baseline/version_X/codecarbon/emissions_baseline_training.csv
for training and ./exp/2024_baseline/version_X/codecarbon/emissions_baseline_test.csv
for devtest inference.
Energy consumption depends on hardware, and each participant uses different hardware. To account for this difference, we use the baseline training and inference kWh energy consumption as a common reference. It is important that the inference energy consumption figures for both submitted systems and baseline are computed on the same hardware under similar loading.
Multiply–accumulate (MAC) operations
This year we are introducing a new metric, complementary to the energy consumption metric. We are considering the Multiply–accumulate operations (MACs) for 10 seconds of audio prediction, so to have information regarding the computational complexity of the network in terms of multiply-accumulate (MAC) operations.
We use THOP: PyTorch-OpCounter as framework to compute the number of multiply-accumulate operations (MACs). For more information regarding how to install and use THOP, the reader is referred to THOP documentation.
Evaluation toolboxes
Evaluation is done using sed_eval
and sed_scores_eval
toolboxes:
Baseline system
We provide one baseline system for the task which uses pre-trained BEATS embeddings and Audioset strong-annotated data together with DESED and MAESTRO data.
This baseline is built upon the 2023 pre-trained embedding baseline.
It exploits the pre-trained model BEATs, the current state-of-the-art on the Audioset classification task. In addition it uses by default the Audioset strong-annotated data.
🆕 We made some changes in the loss computation as well as in the attention pooling to make sure that the baseline can handle now multiple datasets with potentially missing information.
In the proposed baseline, the frame-level embeddings are used in a late-fusion fashion with the existing CRNN baseline classifier. The temporal resolution of the frame-level embeddings is matched to that of the CNN output using Adaptative Average Pooling. We then feed their frame-level concatenation to the RNN + MLP classifier. See desed_tasl/nnet/CRNN.py
for details.
See the configuration file: ./confs/pretrained.yaml
:
pretrained:
pretrained:
model: beats
e2e: False
freezed: True
extracted_embeddings_dir: ./embeddings
net:
use_embeddings: True
embedding_size: 768
embedding_type: frame
aggregation_type: pool1d
The embeddings can be integrated using several aggregation methods : frame (method from 2022 year : taking the last state of an RNN fed with the embeddings sequence), interpolate (nearest-neighbour interpolation to adapt the temporal resolution) and pool1d (adaptative average pooling as described before).
We provide [pretrained checkpoints][zenodo_pretrained_models]. The baseline can be tested on the development set of the dataset using the following command:
python train_pretrained.py --test_from_checkpoint /path/to/downloaded.ckpt
More detailed info about the baseline on DCASE 2024 Task 4 baseline repo.
Baseline Novelties Short Description
The baseline is the same as the pre-trained embedding DCASE 2023 Task 4 baseline, based on a Mean-Teacher model [1].
We made some changes here in order to handle both DESED and MAESTRO which can have partially missing labels (e.g. DESED events may not be annotated in MAESTRO and vice-versa).
In detail:
- We map certain classes in MAESTRO to some DESED classes (but not vice-versa) when training on MAESTRO data.
- See
local/classes_dict.py
and the functionprocess_tsvs
used intrain_pretrained.py
. - When computing losses on MAESTRO and DESED we mask the output logits which corresponds to classes for which we do miss annotation for the current dataset.
- This masking is also applied to the attention pooling layer see
desed_task/nnet/CRNN.py
. - Mixup is performed only within the same dataset (e.g. only within MAESTRO and DESED).
- To handle MAESTRO, which is long form, we perform overlap add at the logit level over sliding windows, see
local/sed_trained_pretrained.py
Results for the development dataset
Dataset | PSDS-scenario1 | PSDS-scenario1 (sed score) | mean pAUC |
Dev-test | 0.50 +/- 0.01 | 0.52 +/- 0.007 | 0.637 +/- 0.04 |
Note: The Dev-test \(pAUC_M\) is computed only on MAESTRO data (and on MAESTRO classes) while the Dev-test \(PSDS_1\) is computed only on DESED data (and on DESED classes).
Note: The performance might not be exactly reproducible on a GPU based system.
That is why, you can download the checkpoint of the network along with the TensorBoard events.
Launch python train_sed.py --test_from_checkpoint /path/to/downloaded.ckpt
to test this model.
Energy consumption during the training and evaluation phase
Energy consumption for 1 run on a NVIDIA A100 40Gb on a single DGX A100 machine for a training phase and an inference phase on the development set.
Dataset | Training (kWh) | Dev-test (kWh) |
1.542 +/- 0.341 | 0.133 +/- 0.03 |
Repositories
Citations
Samuele Cornell, Janek Ebbers, Constance Douwes, Irene Martín-Morató, Manu Harju, Annamaria Mesaros, and Romain Serizel. Dcase 2024 task 4: sound event detection with heterogeneous data and missing labels. 2024. arXiv:2406.08056.
DCASE 2024 Task 4: Sound Event Detection with Heterogeneous Data and Missing Labels
Irene Martín-Morató and Annamaria Mesaros. Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31():902–914, 2023. doi:10.1109/TASLP.2022.3233468.
Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation
Abstract
Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the specific format of the strong labels necessary for sound event detection is not easily obtainable through crowdsourcing. In this work, we propose a novel annotation workflow that leverages the efficiency of crowdsourcing weak labels, and uses a high number of annotators to produce reliable and objective strong labels. The weak labels are collected in a highly redundant setup, to allow reconstruction of the temporal information. To obtain reliable labels, the annotators' competence is estimated using MACE (Multi-Annotator Competence Estimation) and incorporated into the strong labels estimation through weighing of individual opinions. We show that the proposed method produces consistently reliable strong annotations not only for synthetic audio mixtures, but also for audio recordings of real everyday environments. While only a maximum 80% coincidence with the complete and correct reference annotations was obtained for synthetic data, these results are explained by an extended study of how polyphony and SNR levels affect the identification rate of the sound events by the annotators. On real data, even though the estimated annotators' competence is significantly lower and the coincidence with reference labels is under 69%, the proposed majority opinion approach produces reliable aggregated strong labels in comparison with the more difficult task of crowdsourcing directly strong labels.
Francesca Ronchini, Romain Serizel, Nicolas Turpault, and Samuele Cornell. The impact of non-target events in synthetic soundscapes for sound event detection. In Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 115–119. Barcelona, Spain, November 2021.
The Impact of Non-Target Events in Synthetic Soundscapes for Sound Event Detection
Abstract
Detection and Classification Acoustic Scene and Events Challenge 2021 Task 4 uses a heterogeneous dataset that includes both recorded and synthetic soundscapes. Until recently only target sound events were considered when synthesizing the soundscapes. However, recorded soundscapes often contain a substantial amount of non-target events that may affect the performance. In this paper, we focus on the impact of these non-target events in the synthetic soundscapes. Firstly, we investigate to what extent using non-target events alternatively during the training or validation phase (or none of them) helps the system to correctly detect target events. Secondly, we analyze to what extend adjusting the signal-to-noise ratio between target and non-target events at training improves the sound event detection performance. The results show that using both target and non-target events for only one of the phases (validation or training) helps the system to properly detect sound events, outperforming the baseline (which uses non-target events in both phases). The paper also reports the results of a preliminary study on evaluating the system on clips that contain only non-target events. This opens questions for future work on non-target subset and acoustic similarity between target and non-target events which might confuse the system.
Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.
Sound event detection in domestic environments with weakly labeled data and soundscape synthesis
Abstract
This paper presents Task 4 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge and provides a first analysis of the challenge results. The task is a followup to Task 4 of DCASE 2018, and involves training systems for large-scale detection of sound events using a combination of weakly labeled data, i.e. training labels without time boundaries, and strongly-labeled synthesized data. The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that we’ll describe more in detail. We also report the performance of the submitted systems on the official evaluation (test) and development sets as well as several additional datasets. The best systems from this year outperform last year’s winning system by about 10% points in terms of F-measure.
Keywords
Sound event detection ; Weakly labeled data ; Semi-supervised learning ; Synthetic data
Results
All confindence intervals are computed based on the three runs per systems and bootstrapping on the evaluation set. The table below includes only the best ranking score per submitting team without ensembling.
Rank |
Submission code |
Ranking score (Evaluation dataset) |
PSDS (DESED evaluation dataset) |
mpAUC (MAESTRO evaluation dataset) |
---|---|---|---|---|
Schmid_CPJKU_task4_2 | 1.35 | 0.642 (0.612 - 0.675) | 0.711 (0.704 - 0.717) | |
Nam_KAIST_task4_2 | 1.32 | 0.583 (0.560 - 0.601) | 0.738 (0.732 - 0.745) | |
Zhang_BUPT_task4_1 | 1.23 | 0.528 (0.502 - 0.549) | 0.704 (0.704 - 0.705) | |
Chen_CHT_task4_1 | 1.23 | 0.499 (0.474 - 0.523) | 0.733 (0.730 - 0.739) | |
Kim_GIST-HanwhaVision_task4_1 | 1.23 | 0.564 (0.545 - 0.586) | 0.665 (0.646 - 0.677) | |
Chen_NCUT_task4_3 | 1.20 | 0.529 (0.510 - 0.555) | 0.675 (0.675 - 0.675) | |
LEE_KT_task4_1 | 1.19 | 0.503 (0.458 - 0.562) | 0.684 (0.672 - 0.693) | |
Baseline | 1.13 | 0.481 (0.456 - 0.505) | 0.646 (0.641 - 0.653) | |
XIAO_FMSG-JLESS_task4_3 | 1.12 | 0.571 (0.540 - 0.587) | 0.553 (0.553 - 0.553) | |
Lyu_SCUT_task4_2 | 1.10 | 0.484 (0.461 - 0.505) | 0.612 (0.596 - 0.624) | |
Niu_XJU_task4_1 | 1.07 | 0.469 (0.444 - 0.499) | 0.603 (0.599 - 0.610) | |
Cai_USTC_task4_2 | 0.63 | 0.577 (0.559 - 0.598) | 0.050 (0.050 - 0.050) | |
Huang_SJTU_task4_4 | 1.20 | 0.523 (0.500 - 0.552) | 0.678 (0.669 - 0.685) |