The goal of this task is to evaluate systems for the detection of sound events that use softly labeled data for training in addition to other types of data such as weakly labeled, unlabeled or strongly labeled. The main focus of this subtask is to investigate whether using soft labels brings any improvement in performance.
Challenge has ended. Full results for this task can be found in the Results page.
Description
This task is a subtopic of the Sound event detection task (task 4) which provides for training weakly labeled data (without timestamps), strongly-labeled synthetic data (with timestamps) and unlabeled data. The target of the systems is to provide not only the event class but also the event time localization given that multiple events can be present in an audio recording (see also Fig 1).
Specific to this subtask is another type of training data:
- Soft labels provided as a number between 0 and 1 that characterize the certainty of human annotators for the sound at that specific time
- The temporal resolution of the provided data is 1 second (due to the annotation procedure)
- Systems will be evaluated against hard labels, obtained by thresholding the soft labels at 0.5; anything above 0.5 is considered 1 (sound active), anything below 0.5 is considered 0 (sound inactive)
Research question: Do soft labels contain any useful additional information to help train better sound event detection systems?
Audio dataset
The development set provided for this task is MAESTRO Real. The dataset consists of real-life recordings with a length of approximately 3 minutes each, recorded in a few different acoustic scenes. The audio was annotated using Amazon Mechanical Turk, with a procedure that allows estimating soft labels from multiple annotator opinions. The full procedure for annotation and aggregation of multiple opinions can be found in the publication provided below.
Irene Martín-Morató and Annamaria Mesaros. Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31():902–914, 2023. doi:10.1109/TASLP.2022.3233468.
Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation
Abstract
Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the specific format of the strong labels necessary for sound event detection is not easily obtainable through crowdsourcing. In this work, we propose a novel annotation workflow that leverages the efficiency of crowdsourcing weak labels, and uses a high number of annotators to produce reliable and objective strong labels. The weak labels are collected in a highly redundant setup, to allow reconstruction of the temporal information. To obtain reliable labels, the annotators' competence is estimated using MACE (Multi-Annotator Competence Estimation) and incorporated into the strong labels estimation through weighing of individual opinions. We show that the proposed method produces consistently reliable strong annotations not only for synthetic audio mixtures, but also for audio recordings of real everyday environments. While only a maximum 80% coincidence with the complete and correct reference annotations was obtained for synthetic data, these results are explained by an extended study of how polyphony and SNR levels affect the identification rate of the sound events by the annotators. On real data, even though the estimated annotators' competence is significantly lower and the coincidence with reference labels is under 69%, the proposed majority opinion approach produces reliable aggregated strong labels in comparison with the more difficult task of crowdsourcing directly strong labels.
Reference labels
The reference labels for the development data are available as soft labels. Their format is as follows:
Soft labels:
[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)[tab][soft label (float)]]
Example:
a1.wav 0 1 footsteps 0.6
a1.wav 0 1 people_talking 0.9
a1.wav 1 2 footsteps 0.8
These labels can be transformed into hard (binary) labels, using the 0.5 threshold, and the equivalent annotation would be
Hard labels:
[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)
Example:
a1.wav 0 2 footsteps
a1.wav 0 1 people_talking
In the provided dataset there are 17 sound classes. Of them, only 15 classes have values over 0.5, out of which another 4 are very rare. For this reason, the evaluation is conducted only against the following 11 classes:
- Birds singing
- Car
- People talking
- Footsteps
- Children voices
- Wind blowing
- Brakes squeaking
- Large vehicle
- Cutlery and dishes
- Metro approaching
- Metro leaving
Download
Task setup
Participants must use the soft labels in training their system. However, participants are allowed to use external datasets and embeddings extracted from pre-trained models and train their system in any combination. This means that it is possible to use both hard and soft labels in the same training setup, and other data as well. Lists of the eligible datasets and pre-trained models are provided below. Datasets and models can be added to the list upon request until April 1st 2023 (as long as the corresponding resources are publicly available).
Note also that each participant should submit at least one system that is not using external data.
Development dataset
The development set consists of X files with a total duration of X minutes. The dataset is provided with a 5-fold cross-validation setup in which approximately 70% of the data (per class) is used in training, and the rest is used for testing. Participants are required to report the development set results using this setup. Please note that for a correct calculation of performance it is required to run the training and testing for each fold (so, 5 train/test rounds) and only evaluate performance after that. This allows evaluation of the entire list of files at once, in contrast to evaluation per fold and averaging the 5 values. Due to data imbalance between folds, the overall evaluation is more stable.
Evaluation dataset
The evaluation dataset consists of 26 files with a total length of 97 minutes. Only audio is provided for the evaluation set.
External data resources
List of external data resources allowed:
Dataset name | Type | Added | Link |
---|---|---|---|
YAMNet | model | 20.05.2021 | https://github.com/tensorflow/models/tree/master/research/audioset/yamnet |
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition | model | 31.03.2021 | https://zenodo.org/record/3987831 |
OpenL3 | model | 12.02.2020 | https://openl3.readthedocs.io/ |
VGGish | model | 12.02.2020 | https://github.com/tensorflow/models/tree/master/research/audioset/vggish |
COLA | model | 25.02.2023 | https://github.com/google-research/google-research/tree/master/cola |
BYOL-A | model | 25.02.2023 | https://github.com/nttcslab/byol-a |
AST: Audio Spectrogram Transformer | model | 25.02.2023 | https://github.com/YuanGongND/ast |
PaSST: Efficient Training of Audio Transformers with Patchout | model | 13.05.2022 | https://github.com/kkoutini/PaSST |
AudioSet | audio, video | 04.03.2019 | https://research.google.com/audioset/ |
FSD50K | audio | 10.03.2022 | https://zenodo.org/record/4060432 |
ImageNet | image | 01.03.2021 | http://www.image-net.org/ |
MUSAN | audio | 25.02.2023 | https://www.openslr.org/17/ |
DCASE 2018, Task 5: Monitoring of domestic activities based on multi-channel acoustics - Development dataset | audio | 25.02.2023 | https://zenodo.org/record/1247102#.Y_oyRIBBx8s |
Pre-trained desed embeddings (Panns, AST part 1) | model | 25.02.2023 | https://zenodo.org/record/6642806#.Y_oy_oBBx8s |
Task rules
There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.
Task specific rules:
- Participants are allowed to submit up to 4 different systems.
- Participants are allowed to use external data for system development. However, each participant should submit at least one system that is not using external data.
- Data from other task is considered external data.
- Embeddings extracted from models pre-trained on external data is considered as external data
- Hard labels of the same dataset are not considered external data.
- Manipulation of provided training data is allowed.
- Participants are not allowed to use the evaluation dataset (or part of it) to train their systems or tune hyper-parameters.
Submission
Instructions regarding the output submission format and the required metadata can be found in the example submission package.
Evaluation
System evaluation will be based on the following metrics, calculated in 1s-segments:
- micro-average F1 score \(F1_m\), calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
- micro-average error rate \(ER_m\), calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
- macro-average F1 score \(F1_M\), calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
- macro-average F1 score with optimum threshold per class \(F1_{MO}\) calculated using sed_scores_eval, based on the best F1 score per class obtained with a class-specific threshold
Evaluation toolboxes
Evaluation is done using sed_eval
and sed_scores_eval
toolboxes:
Task Ranking
Ranking of the systems will be done based on \(F1_{MO}\).
Results
Rank | Submission Information | |||||
---|---|---|---|---|---|---|
Code | Author | Affiliation |
Technical Report |
F1_MO |
||
Bai_JLESS_task4b_1 | Jisheng Bai | Northwestern Polytechnical University, Marine Science and Technology, Joint Laboratory of Environmental Sound Sensing,, Xian, China | task-sound-event-detection-with-soft-labels-results#Yin2023 | 58.21 | ||
Bai_JLESS_task4b_2 | Jisheng Bai | Northwestern Polytechnical University, Marine Science and Technology, Joint Laboratory of Environmental Sound Sensing,, Xian, China | task-sound-event-detection-with-soft-labels-results#Yin2023 | 59.77 | ||
Bai_JLESS_task4b_3 | Jisheng Bai | Northwestern Polytechnical University, Marine Science and Technology, Joint Laboratory of Environmental Sound Sensing,, Xian, China | task-sound-event-detection-with-soft-labels-results#Yin2023 | 58.00 | ||
Bai_JLESS_task4b_4 | Jisheng Bai | Northwestern Polytechnical University, Marine Science and Technology, Joint Laboratory of Environmental Sound Sensing,, Xian, China | task-sound-event-detection-with-soft-labels-results#Yin2023 | 60.74 | ||
Cai_NCUT_task4b_1 | Xichang Cai | College of Information, North China University of Technology, Beijing, China | task-sound-event-detection-with-soft-labels-results#Zhang2023 | 43.60 | ||
Cai_NCUT_task4b_2 | Xichang Cai | College of Information, North China University of Technology, Beijing, China | task-sound-event-detection-with-soft-labels-results#Zhang2023 | 43.58 | ||
Cai_NCUT_task4b_3 | Xichang Cai | College of Information, North China University of Technology, Beijing, China | task-sound-event-detection-with-soft-labels-results#Zhang2023 | 42.14 | ||
Liu_NJUPT_task4b_1 | Xi Shao | Nanjing University of Posts and Telecommunications,, Nanjing, Jiangsu, P.R.China | task-sound-event-detection-with-soft-labels-results#Liu2023 | 19.82 | ||
Liu_NJUPT_task4b_2 | Xi Shao | Nanjing University of Posts and Telecommunications,, Nanjing, Jiangsu, P.R.China | task-sound-event-detection-with-soft-labels-results#Liu2023 | 20.83 | ||
Liu_NJUPT_task4b_3 | Xi Shao | Nanjing University of Posts and Telecommunications,, Nanjing, Jiangsu, P.R.China | task-sound-event-detection-with-soft-labels-results#Liu2023 | 22.53 | ||
Liu_NJUPT_task4b_4 | Xi Shao | Nanjing University of Posts and Telecommunications,, Nanjing, Jiangsu, P.R.China | task-sound-event-detection-with-soft-labels-results#Liu2023 | 22.46 | ||
Liu_SRCN_task4b_1 | Yangyang Liu | Intelligence Service Lab, Intelligence SW Team, Samsung Research China-Nanjing, Nanjing, China | task-sound-event-detection-with-soft-labels-results#Jin2023 | 44.69 | ||
Liu_SRCN_task4b_2 | Yangyang Liu | Intelligence Service Lab, Intelligence SW Team, Samsung Research China-Nanjing, Nanjing, China | task-sound-event-detection-with-soft-labels-results#Jin2023 | 52.03 | ||
DCASE2023 baseline | Irene Martin | Computing Sciences, Tampere University, Tampere, Finland | task-sound-event-detection-with-soft-labels-results#Martin2023 | 43.44 | ||
Min_KAIST_task4b_1 | Deokki Min | Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea | task-sound-event-detection-with-soft-labels-results#Min2023 | 48.95 | ||
Min_KAIST_task4b_2 | Deokki Min | Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea | task-sound-event-detection-with-soft-labels-results#Min2023 | 48.72 | ||
Min_KAIST_task4b_3 | Deokki Min | Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea | task-sound-event-detection-with-soft-labels-results#Min2023 | 45.21 | ||
Min_KAIST_task4b_4 | Deokki Min | Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea | task-sound-event-detection-with-soft-labels-results#Min2023 | 46.24 | ||
Nhan_VNUHCMUS_task4b_1 | Tri-Do Nhan | Computing Sciences, University of Science, Vietnam National University | task-sound-event-detection-with-soft-labels-results#Nhan2023 | 47.17 | ||
Xu_SJTU_task4b_1 | Xu Xuenan | X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China | task-sound-event-detection-with-soft-labels-results#Xuenan2023 | 46.13 | ||
Xu_SJTU_task4b_2 | Xu Xuenan | X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China | task-sound-event-detection-with-soft-labels-results#Xuenan2023 | 50.88 | ||
Xu_SJTU_task4b_3 | Xu Xuenan | X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China | task-sound-event-detection-with-soft-labels-results#Xuenan2023 | 51.13 | ||
Xu_SJTU_task4b_4 | Xu Xuenan | X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China | task-sound-event-detection-with-soft-labels-results#Xuenan2023 | 46.99 |
Complete results and technical reports can be found in the results page
Baseline system
The baseline system is a CRNN with a linear output layer that is trained using the soft labels and mse. The system architecture consists of three CNN layers and one bi-directional gated recurrent unit (GRU) layer. As input, the model uses mel-band energies extracted using a hop length of 200 ms and 64 mel filter banks.
Repository
Parameters
Neural network:
- Input shape: sequence_length * 64
- Architecture:
- CNN layer #1
- 2D Convolutional layer (filters: 128, kernel size: 3) + Batch normalization + ReLu activation
- 2D max pooling (pool size: (1, 5)) + Dropout (rate: 20%)
- CNN layer #2
- 2D Convolutional layer (filters: 128, kernel size: 3) + Batch normalization + ReLu activation
- 2D max pooling (pool size: (1, 2)) + Dropout (rate: 20%)
- CNN layer #3
- 2D Convolutional layer (filters: 32, kernel size: 3) + Batch normalization + ReLu activation
- 2D max pooling (pool size: (1, 2)) + Dropout (rate: 20%)
- Permute
- Bidirectional #1
- Dense layer #1
- Dense layer (units: 64, activation: Linear )
- Dropout (rate: 30%)
- Dense layer #2
- Dense layer (units: 32, activation: Linear )
Results for the development dataset
Micro-average | Macro-average | |||
---|---|---|---|---|
ERm | F1m | F1M | F1MO | |
Baseline | 0.479 | 71.50 % | 35.21 % | 44.13 % |
Citation
If you are using the audio dataset, please cite the following paper:
Irene Martín-Morató and Annamaria Mesaros. Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31():902–914, 2023. doi:10.1109/TASLP.2022.3233468.
Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation
Abstract
Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the specific format of the strong labels necessary for sound event detection is not easily obtainable through crowdsourcing. In this work, we propose a novel annotation workflow that leverages the efficiency of crowdsourcing weak labels, and uses a high number of annotators to produce reliable and objective strong labels. The weak labels are collected in a highly redundant setup, to allow reconstruction of the temporal information. To obtain reliable labels, the annotators' competence is estimated using MACE (Multi-Annotator Competence Estimation) and incorporated into the strong labels estimation through weighing of individual opinions. We show that the proposed method produces consistently reliable strong annotations not only for synthetic audio mixtures, but also for audio recordings of real everyday environments. While only a maximum 80% coincidence with the complete and correct reference annotations was obtained for synthetic data, these results are explained by an extended study of how polyphony and SNR levels affect the identification rate of the sound events by the annotators. On real data, even though the estimated annotators' competence is significantly lower and the coincidence with reference labels is under 69%, the proposed majority opinion approach produces reliable aggregated strong labels in comparison with the more difficult task of crowdsourcing directly strong labels.
If you are using the baseline, please cite the following paper:
Irene Martín-Morató, Manu Harju, Paul Ahokas, and Annamaria Mesaros. Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP). 2023.
Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation
Abstract
In this paper, we study the use of soft labels to train a system for sound event detection (SED). Soft labels can result from annotations which account for human uncertainty about categories, or emerge as a natural representation of multiple opinions in annotation. Converting annotations to hard labels results in unambiguous categories for training, at the cost of losing the details about the labels distribution. This work investigates how soft labels can be used, and what benefits they bring in training a SED system. The results show that the system is capable of learning information about the activity of the sounds which is reflected in the soft labels and is able to detect sounds that are missed in the typical binary target training setup. We also release a new dataset produced through crowdsourcing, containing temporally strong labels for sound events in real-life recordings, with both soft and hard labels.