# Sound Event Detection with Soft Labels

### Coordinators

 Annamaria Mesaros Irene Martin Morato Toni Heittola

The goal of this task is to evaluate systems for the detection of sound events that use softly labeled data for training in addition to other types of data such as weakly labeled, unlabeled or strongly labeled. The main focus of this subtask is to investigate whether using soft labels brings any improvement in performance.

# Description

This task is a subtopic of the Sound event detection task (task 4) which provides for training weakly labeled data (without timestamps), strongly-labeled synthetic data (with timestamps) and unlabeled data. The target of the systems is to provide not only the event class but also the event time localization given that multiple events can be present in an audio recording (see also Fig 1).

Specific to this subtask is another type of training data:

• Soft labels provided as a number between 0 and 1 that characterize the certainty of human annotators for the sound at that specific time
• The temporal resolution of the provided data is 1 second (due to the annotation procedure)
• Systems will be evaluated against hard labels, obtained by thresholding the soft labels at 0.5; anything above 0.5 is considered 1 (sound active), anything below 0.5 is considered 0 (sound inactive)

Research question: Do soft labels contain any useful additional information to help train better sound event detection systems?

# Audio dataset

The development set provided for this task is MAESTRO Real. The dataset consists of real-life recordings with a length of approximately 3 minutes each, recorded in a few different acoustic scenes. The audio was annotated using Amazon Mechanical Turk, with a procedure that allows estimating soft labels from multiple annotator opinions. The full procedure for annotation and aggregation of multiple opinions can be found in the publication provided below.

Publication

Irene Martín-Morató and Annamaria Mesaros. Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31():902–914, 2023. doi:10.1109/TASLP.2022.3233468.

#### Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

##### Abstract

Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the specific format of the strong labels necessary for sound event detection is not easily obtainable through crowdsourcing. In this work, we propose a novel annotation workflow that leverages the efficiency of crowdsourcing weak labels, and uses a high number of annotators to produce reliable and objective strong labels. The weak labels are collected in a highly redundant setup, to allow reconstruction of the temporal information. To obtain reliable labels, the annotators' competence is estimated using MACE (Multi-Annotator Competence Estimation) and incorporated into the strong labels estimation through weighing of individual opinions. We show that the proposed method produces consistently reliable strong annotations not only for synthetic audio mixtures, but also for audio recordings of real everyday environments. While only a maximum 80% coincidence with the complete and correct reference annotations was obtained for synthetic data, these results are explained by an extended study of how polyphony and SNR levels affect the identification rate of the sound events by the annotators. On real data, even though the estimated annotators' competence is significantly lower and the coincidence with reference labels is under 69%, the proposed majority opinion approach produces reliable aggregated strong labels in comparison with the more difficult task of crowdsourcing directly strong labels.

## Reference labels

The reference labels for the development data are available as soft labels. Their format is as follows:

Soft labels:

[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)[tab][soft label (float)]]


Example:

a1.wav       0  1   footsteps   0.6
a1.wav       0  1   people_talking      0.9
a1.wav       1  2   footsteps   0.8


These labels can be transformed into hard (binary) labels, using the 0.5 threshold, and the equivalent annotation would be

Hard labels:

[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)


Example:

a1.wav       0  2   footsteps
a1.wav       0  1   people_talking


In the provided dataset there are 17 sound classes. Of them, only 15 classes have values over 0.5, out of which another 4 are very rare. For this reason, the evaluation is conducted only against the following 11 classes:

• Birds singing
• Car
• People talking
• Footsteps
• Children voices
• Wind blowing
• Brakes squeaking
• Large vehicle
• Cutlery and dishes
• Metro approaching
• Metro leaving

Participants must use the soft labels in training their system. However, participants are allowed to use external datasets and embeddings extracted from pre-trained models and train their system in any combination. This means that it is possible to use both hard and soft labels in the same training setup, and other data as well. Lists of the eligible datasets and pre-trained models are provided below. Datasets and models can be added to the list upon request until April 1st 2023 (as long as the corresponding resources are publicly available).

Note also that each participant should submit at least one system that is not using external data.

## Development dataset

The development set consists of X files with a total duration of X minutes. The dataset is provided with a 5-fold cross-validation setup in which approximately 70% of the data (per class) is used in training, and the rest is used for testing. Participants are required to report the development set results using this setup. Please note that for a correct calculation of performance it is required to run the training and testing for each fold (so, 5 train/test rounds) and only evaluate performance after that. This allows evaluation of the entire list of files at once, in contrast to evaluation per fold and averaging the 5 values. Due to data imbalance between folds, the overall evaluation is more stable.

## Evaluation dataset

The evaluation dataset consists of X files with a total length of X minutes. Only audio is provided for the evaluation set.

# External data resources

List of external data resources allowed:

YAMNet model 20.05.2021 https://github.com/tensorflow/models/tree/master/research/audioset/yamnet
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition model 31.03.2021 https://zenodo.org/record/3987831
VGGish model 12.02.2020 https://github.com/tensorflow/models/tree/master/research/audioset/vggish
BYOL-A model 25.02.2023 https://github.com/nttcslab/byol-a
AST: Audio Spectrogram Transformer model 25.02.2023 https://github.com/YuanGongND/ast
PaSST: Efficient Training of Audio Transformers with Patchout model 13.05.2022 https://github.com/kkoutini/PaSST
FSD50K audio 10.03.2022 https://zenodo.org/record/4060432
ImageNet image 01.03.2021 http://www.image-net.org/
MUSAN audio 25.02.2023 https://www.openslr.org/17/
DCASE 2018, Task 5: Monitoring of domestic activities based on multi-channel acoustics - Development dataset audio 25.02.2023 https://zenodo.org/record/1247102#.Y_oyRIBBx8s
Pre-trained desed embeddings (Panns, AST part 1) model 25.02.2023 https://zenodo.org/record/6642806#.Y_oy_oBBx8s

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

• Participants are allowed to submit up to 4 different systems.
• Participants are allowed to use external data for system development. However, each participant should submit at least one system that is not using external data.
• Data from other task is considered external data.
• Embeddings extracted from models pre-trained on external data is considered as external data
• Hard labels of the same dataset are not considered external data.
• Manipulation of provided training data is allowed.
• Participants are not allowed to use the evaluation dataset (or part of it) to train their systems or tune hyper-parameters.

# Submission

Instructions regarding the output submission format and the required metadata can be will be added later.

# Evaluation

System evaluation will be based on the following metrics, calculated in 1s-segments:

• micro-average F1 score $$F1_m$$, calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
• micro-average error rate $$ER_m$$, calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
• macro-average F1 score $$F1_M$$, calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
• macro-average F1 score with optimum threshold per class $$F1_{MO}$$ calculated using sed_scores_eval, based on the best F1 score per class obtained with a class-specific threshold

## Evaluation toolboxes

Evaluation is done using sed_eval and sed_scores_eval toolboxes:

Ranking of the systems will be done based on $$F1_{MO}$$.

# Baseline system

The baseline system is a CRNN with a linear output layer that is trained using the soft labels and mse. The system architecture consists of three CNN layers and one bi-directional gated recurrent unit (GRU) layer. As input, the model uses mel-band energies extracted using a hop length of 200 ms and 64 mel filter banks.

### Parameters

Neural network:

• Input shape: sequence_length * 64
• Architecture:
• CNN layer #1
• 2D Convolutional layer (filters: 128, kernel size: 3) + Batch normalization + ReLu activation
• 2D max pooling (pool size: (1, 5)) + Dropout (rate: 20%)
• CNN layer #2
• 2D Convolutional layer (filters: 128, kernel size: 3) + Batch normalization + ReLu activation
• 2D max pooling (pool size: (1, 2)) + Dropout (rate: 20%)
• CNN layer #3
• 2D Convolutional layer (filters: 32, kernel size: 3) + Batch normalization + ReLu activation
• 2D max pooling (pool size: (1, 2)) + Dropout (rate: 20%)
• Permute
• Bidirectional #1
• Dense layer #1
• Dense layer (units: 64, activation: Linear )
• Dropout (rate: 30%)
• Dense layer #2
• Dense layer (units: 32, activation: Linear )

### Results for the development dataset

Micro-average Macro-average
ERm F1m F1M F1MO
Baseline 0.479 71.50 % 35.21 % 44.13 %

# Citation

If you are using the audio dataset, please cite the following paper:

Publication

Irene Martín-Morató and Annamaria Mesaros. Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31():902–914, 2023. doi:10.1109/TASLP.2022.3233468.

#### Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

##### Abstract

Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the specific format of the strong labels necessary for sound event detection is not easily obtainable through crowdsourcing. In this work, we propose a novel annotation workflow that leverages the efficiency of crowdsourcing weak labels, and uses a high number of annotators to produce reliable and objective strong labels. The weak labels are collected in a highly redundant setup, to allow reconstruction of the temporal information. To obtain reliable labels, the annotators' competence is estimated using MACE (Multi-Annotator Competence Estimation) and incorporated into the strong labels estimation through weighing of individual opinions. We show that the proposed method produces consistently reliable strong annotations not only for synthetic audio mixtures, but also for audio recordings of real everyday environments. While only a maximum 80% coincidence with the complete and correct reference annotations was obtained for synthetic data, these results are explained by an extended study of how polyphony and SNR levels affect the identification rate of the sound events by the annotators. On real data, even though the estimated annotators' competence is significantly lower and the coincidence with reference labels is under 69%, the proposed majority opinion approach produces reliable aggregated strong labels in comparison with the more difficult task of crowdsourcing directly strong labels.

If you are using the baseline, please cite the following paper:

Publication

Irene Martín-Morató, Manu Harju, Paul Ahokas, and Annamaria Mesaros. Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP). 2023.

#### Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

##### Abstract

In this paper, we study the use of soft labels to train a system for sound event detection (SED). Soft labels can result from annotations which account for human uncertainty about categories, or emerge as a natural representation of multiple opinions in annotation. Converting annotations to hard labels results in unambiguous categories for training, at the cost of losing the details about the labels distribution. This work investigates how soft labels can be used, and what benefits they bring in training a SED system. The results show that the system is capable of learning information about the activity of the sounds which is reflected in the soft labels and is able to detect sounds that are missed in the typical binary target training setup. We also release a new dataset produced through crowdsourcing, containing temporally strong labels for sound events in real-life recordings, with both soft and hard labels.