Sound Event Detection with Soft Labels


Task description

The goal of this task is to evaluate systems for the detection of sound events that use softly labeled data for training in addition to other types of data such as weakly labeled, unlabeled or strongly labeled. The main focus of this subtask is to investigate whether using soft labels brings any improvement in performance.

Challenge has ended. Full results for this task can be found in the Results page.

If you are interested in the task, you can join us on the dedicated slack channel

Description

This task is a subtopic of the Sound event detection task (task 4) which provides for training weakly labeled data (without timestamps), strongly-labeled synthetic data (with timestamps) and unlabeled data. The target of the systems is to provide not only the event class but also the event time localization given that multiple events can be present in an audio recording (see also Fig 1).

Specific to this subtask is another type of training data:

  • Soft labels provided as a number between 0 and 1 that characterize the certainty of human annotators for the sound at that specific time
  • The temporal resolution of the provided data is 1 second (due to the annotation procedure)
  • Systems will be evaluated against hard labels, obtained by thresholding the soft labels at 0.5; anything above 0.5 is considered 1 (sound active), anything below 0.5 is considered 0 (sound inactive)

Research question: Do soft labels contain any useful additional information to help train better sound event detection systems?

Audio dataset

The development set provided for this task is MAESTRO Real. The dataset consists of real-life recordings with a length of approximately 3 minutes each, recorded in a few different acoustic scenes. The audio was annotated using Amazon Mechanical Turk, with a procedure that allows estimating soft labels from multiple annotator opinions. The full procedure for annotation and aggregation of multiple opinions can be found in the publication provided below.

Publication

Irene Martín-Morató and Annamaria Mesaros. Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31():902–914, 2023. doi:10.1109/TASLP.2022.3233468.

PDF

Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

Abstract

Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the specific format of the strong labels necessary for sound event detection is not easily obtainable through crowdsourcing. In this work, we propose a novel annotation workflow that leverages the efficiency of crowdsourcing weak labels, and uses a high number of annotators to produce reliable and objective strong labels. The weak labels are collected in a highly redundant setup, to allow reconstruction of the temporal information. To obtain reliable labels, the annotators' competence is estimated using MACE (Multi-Annotator Competence Estimation) and incorporated into the strong labels estimation through weighing of individual opinions. We show that the proposed method produces consistently reliable strong annotations not only for synthetic audio mixtures, but also for audio recordings of real everyday environments. While only a maximum 80% coincidence with the complete and correct reference annotations was obtained for synthetic data, these results are explained by an extended study of how polyphony and SNR levels affect the identification rate of the sound events by the annotators. On real data, even though the estimated annotators' competence is significantly lower and the coincidence with reference labels is under 69%, the proposed majority opinion approach produces reliable aggregated strong labels in comparison with the more difficult task of crowdsourcing directly strong labels.

PDF


Reference labels

The reference labels for the development data are available as soft labels. Their format is as follows:

Soft labels:

[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)[tab][soft label (float)]]

Example:

a1.wav       0  1   footsteps   0.6
a1.wav       0  1   people_talking      0.9
a1.wav       1  2   footsteps   0.8

These labels can be transformed into hard (binary) labels, using the 0.5 threshold, and the equivalent annotation would be

Hard labels:

[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)

Example:

a1.wav       0  2   footsteps
a1.wav       0  1   people_talking

In the provided dataset there are 17 sound classes. Of them, only 15 classes have values over 0.5, out of which another 4 are very rare. For this reason, the evaluation is conducted only against the following 11 classes:

  • Birds singing
  • Car
  • People talking
  • Footsteps
  • Children voices
  • Wind blowing
  • Brakes squeaking
  • Large vehicle
  • Cutlery and dishes
  • Metro approaching
  • Metro leaving

Download


Task setup

Participants must use the soft labels in training their system. However, participants are allowed to use external datasets and embeddings extracted from pre-trained models and train their system in any combination. This means that it is possible to use both hard and soft labels in the same training setup, and other data as well. Lists of the eligible datasets and pre-trained models are provided below. Datasets and models can be added to the list upon request until April 1st 2023 (as long as the corresponding resources are publicly available).

Note also that each participant should submit at least one system that is not using external data.

Development dataset

The development set consists of X files with a total duration of X minutes. The dataset is provided with a 5-fold cross-validation setup in which approximately 70% of the data (per class) is used in training, and the rest is used for testing. Participants are required to report the development set results using this setup. Please note that for a correct calculation of performance it is required to run the training and testing for each fold (so, 5 train/test rounds) and only evaluate performance after that. This allows evaluation of the entire list of files at once, in contrast to evaluation per fold and averaging the 5 values. Due to data imbalance between folds, the overall evaluation is more stable.

Evaluation dataset

The evaluation dataset consists of 26 files with a total length of 97 minutes. Only audio is provided for the evaluation set.

External data resources

List of external data resources allowed:

Dataset name Type Added Link
YAMNet model 20.05.2021 https://github.com/tensorflow/models/tree/master/research/audioset/yamnet
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition model 31.03.2021 https://zenodo.org/record/3987831
OpenL3 model 12.02.2020 https://openl3.readthedocs.io/
VGGish model 12.02.2020 https://github.com/tensorflow/models/tree/master/research/audioset/vggish
COLA model 25.02.2023 https://github.com/google-research/google-research/tree/master/cola
BYOL-A model 25.02.2023 https://github.com/nttcslab/byol-a
AST: Audio Spectrogram Transformer model 25.02.2023 https://github.com/YuanGongND/ast
PaSST: Efficient Training of Audio Transformers with Patchout model 13.05.2022 https://github.com/kkoutini/PaSST
AudioSet audio, video 04.03.2019 https://research.google.com/audioset/
FSD50K audio 10.03.2022 https://zenodo.org/record/4060432
ImageNet image 01.03.2021 http://www.image-net.org/
MUSAN audio 25.02.2023 https://www.openslr.org/17/
DCASE 2018, Task 5: Monitoring of domestic activities based on multi-channel acoustics - Development dataset audio 25.02.2023 https://zenodo.org/record/1247102#.Y_oyRIBBx8s
Pre-trained desed embeddings (Panns, AST part 1) model 25.02.2023 https://zenodo.org/record/6642806#.Y_oy_oBBx8s


Task rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

Task specific rules:

  • Participants are allowed to submit up to 4 different systems.
  • Participants are allowed to use external data for system development. However, each participant should submit at least one system that is not using external data.
  • Data from other task is considered external data.
  • Embeddings extracted from models pre-trained on external data is considered as external data
  • Hard labels of the same dataset are not considered external data.
  • Manipulation of provided training data is allowed.
  • Participants are not allowed to use the evaluation dataset (or part of it) to train their systems or tune hyper-parameters.

Submission

Instructions regarding the output submission format and the required metadata can be found in the example submission package.


Evaluation

System evaluation will be based on the following metrics, calculated in 1s-segments:

  • micro-average F1 score \(F1_m\), calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
  • micro-average error rate \(ER_m\), calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
  • macro-average F1 score \(F1_M\), calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
  • macro-average F1 score with optimum threshold per class \(F1_{MO}\) calculated using sed_scores_eval, based on the best F1 score per class obtained with a class-specific threshold

Evaluation toolboxes

Evaluation is done using sed_eval and sed_scores_eval toolboxes:




Task Ranking

Ranking of the systems will be done based on \(F1_{MO}\).

Results

Rank Submission Information
Code Author Affiliation Technical
Report

F1_MO
Bai_JLESS_task4b_1 Jisheng Bai Northwestern Polytechnical University, Marine Science and Technology, Joint Laboratory of Environmental Sound Sensing,, Xian, China task-sound-event-detection-with-soft-labels-results#Yin2023 58.21
Bai_JLESS_task4b_2 Jisheng Bai Northwestern Polytechnical University, Marine Science and Technology, Joint Laboratory of Environmental Sound Sensing,, Xian, China task-sound-event-detection-with-soft-labels-results#Yin2023 59.77
Bai_JLESS_task4b_3 Jisheng Bai Northwestern Polytechnical University, Marine Science and Technology, Joint Laboratory of Environmental Sound Sensing,, Xian, China task-sound-event-detection-with-soft-labels-results#Yin2023 58.00
Bai_JLESS_task4b_4 Jisheng Bai Northwestern Polytechnical University, Marine Science and Technology, Joint Laboratory of Environmental Sound Sensing,, Xian, China task-sound-event-detection-with-soft-labels-results#Yin2023 60.74
Cai_NCUT_task4b_1 Xichang Cai College of Information, North China University of Technology, Beijing, China task-sound-event-detection-with-soft-labels-results#Zhang2023 43.60
Cai_NCUT_task4b_2 Xichang Cai College of Information, North China University of Technology, Beijing, China task-sound-event-detection-with-soft-labels-results#Zhang2023 43.58
Cai_NCUT_task4b_3 Xichang Cai College of Information, North China University of Technology, Beijing, China task-sound-event-detection-with-soft-labels-results#Zhang2023 42.14
Liu_NJUPT_task4b_1 Xi Shao Nanjing University of Posts and Telecommunications,, Nanjing, Jiangsu, P.R.China task-sound-event-detection-with-soft-labels-results#Liu2023 19.82
Liu_NJUPT_task4b_2 Xi Shao Nanjing University of Posts and Telecommunications,, Nanjing, Jiangsu, P.R.China task-sound-event-detection-with-soft-labels-results#Liu2023 20.83
Liu_NJUPT_task4b_3 Xi Shao Nanjing University of Posts and Telecommunications,, Nanjing, Jiangsu, P.R.China task-sound-event-detection-with-soft-labels-results#Liu2023 22.53
Liu_NJUPT_task4b_4 Xi Shao Nanjing University of Posts and Telecommunications,, Nanjing, Jiangsu, P.R.China task-sound-event-detection-with-soft-labels-results#Liu2023 22.46
Liu_SRCN_task4b_1 Yangyang Liu Intelligence Service Lab, Intelligence SW Team, Samsung Research China-Nanjing, Nanjing, China task-sound-event-detection-with-soft-labels-results#Jin2023 44.69
Liu_SRCN_task4b_2 Yangyang Liu Intelligence Service Lab, Intelligence SW Team, Samsung Research China-Nanjing, Nanjing, China task-sound-event-detection-with-soft-labels-results#Jin2023 52.03
DCASE2023 baseline Irene Martin Computing Sciences, Tampere University, Tampere, Finland task-sound-event-detection-with-soft-labels-results#Martin2023 43.44
Min_KAIST_task4b_1 Deokki Min Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea task-sound-event-detection-with-soft-labels-results#Min2023 48.95
Min_KAIST_task4b_2 Deokki Min Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea task-sound-event-detection-with-soft-labels-results#Min2023 48.72
Min_KAIST_task4b_3 Deokki Min Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea task-sound-event-detection-with-soft-labels-results#Min2023 45.21
Min_KAIST_task4b_4 Deokki Min Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea task-sound-event-detection-with-soft-labels-results#Min2023 46.24
Nhan_VNUHCMUS_task4b_1 Tri-Do Nhan Computing Sciences, University of Science, Vietnam National University task-sound-event-detection-with-soft-labels-results#Nhan2023 47.17
Xu_SJTU_task4b_1 Xu Xuenan X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China task-sound-event-detection-with-soft-labels-results#Xuenan2023 46.13
Xu_SJTU_task4b_2 Xu Xuenan X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China task-sound-event-detection-with-soft-labels-results#Xuenan2023 50.88
Xu_SJTU_task4b_3 Xu Xuenan X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China task-sound-event-detection-with-soft-labels-results#Xuenan2023 51.13
Xu_SJTU_task4b_4 Xu Xuenan X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China task-sound-event-detection-with-soft-labels-results#Xuenan2023 46.99

Complete results and technical reports can be found in the results page

Baseline system

The baseline system is a CRNN with a linear output layer that is trained using the soft labels and mse. The system architecture consists of three CNN layers and one bi-directional gated recurrent unit (GRU) layer. As input, the model uses mel-band energies extracted using a hop length of 200 ms and 64 mel filter banks.

Repository


Parameters

Neural network:

  • Input shape: sequence_length * 64
  • Architecture:
  • CNN layer #1
    • 2D Convolutional layer (filters: 128, kernel size: 3) + Batch normalization + ReLu activation
    • 2D max pooling (pool size: (1, 5)) + Dropout (rate: 20%)
  • CNN layer #2
    • 2D Convolutional layer (filters: 128, kernel size: 3) + Batch normalization + ReLu activation
    • 2D max pooling (pool size: (1, 2)) + Dropout (rate: 20%)
  • CNN layer #3
    • 2D Convolutional layer (filters: 32, kernel size: 3) + Batch normalization + ReLu activation
    • 2D max pooling (pool size: (1, 2)) + Dropout (rate: 20%)
  • Permute
  • Bidirectional #1
  • Dense layer #1
    • Dense layer (units: 64, activation: Linear )
    • Dropout (rate: 30%)
  • Dense layer #2
    • Dense layer (units: 32, activation: Linear )

Results for the development dataset

Micro-average Macro-average
ERm F1m F1M F1MO
Baseline 0.479 71.50 % 35.21 % 44.13 %

Citation

If you are using the audio dataset, please cite the following paper:

Publication

Irene Martín-Morató and Annamaria Mesaros. Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31():902–914, 2023. doi:10.1109/TASLP.2022.3233468.

PDF

Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

Abstract

Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the specific format of the strong labels necessary for sound event detection is not easily obtainable through crowdsourcing. In this work, we propose a novel annotation workflow that leverages the efficiency of crowdsourcing weak labels, and uses a high number of annotators to produce reliable and objective strong labels. The weak labels are collected in a highly redundant setup, to allow reconstruction of the temporal information. To obtain reliable labels, the annotators' competence is estimated using MACE (Multi-Annotator Competence Estimation) and incorporated into the strong labels estimation through weighing of individual opinions. We show that the proposed method produces consistently reliable strong annotations not only for synthetic audio mixtures, but also for audio recordings of real everyday environments. While only a maximum 80% coincidence with the complete and correct reference annotations was obtained for synthetic data, these results are explained by an extended study of how polyphony and SNR levels affect the identification rate of the sound events by the annotators. On real data, even though the estimated annotators' competence is significantly lower and the coincidence with reference labels is under 69%, the proposed majority opinion approach produces reliable aggregated strong labels in comparison with the more difficult task of crowdsourcing directly strong labels.

PDF


If you are using the baseline, please cite the following paper:

Publication

Irene Martín-Morató, Manu Harju, Paul Ahokas, and Annamaria Mesaros. Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP). 2023.

PDF

Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

Abstract

In this paper, we study the use of soft labels to train a system for sound event detection (SED). Soft labels can result from annotations which account for human uncertainty about categories, or emerge as a natural representation of multiple opinions in annotation. Converting annotations to hard labels results in unambiguous categories for training, at the cost of losing the details about the labels distribution. This work investigates how soft labels can be used, and what benefits they bring in training a SED system. The results show that the system is capable of learning information about the activity of the sounds which is reflected in the soft labels and is able to detect sounds that are missed in the typical binary target training setup. We also release a new dataset produced through crowdsourcing, containing temporally strong labels for sound events in real-life recordings, with both soft and hard labels.

PDF