The goal of this task is to evaluate systems for the detection of sound events that use softly labeled data for training in addition to other types of data such as weakly labeled, unlabeled or strongly labeled. The main focus of this subtask is to investigate whether using soft labels brings any improvement in performance.

Challenge has ended. Full results for this task can be found in the Results page.

If you are interested in the task, you can join us on the dedicated slack channel

Description

This task is a subtopic of the Sound event detection task (task 4) which provides for training weakly labeled data (without timestamps), strongly-labeled synthetic data (with timestamps) and unlabeled data. The target of the systems is to provide not only the event class but also the event time localization given that multiple events can be present in an audio recording (see also Fig 1).

Specific to this subtask is another type of training data:

Soft labels provided as a number between 0 and 1 that characterize the certainty of human annotators for the sound at that specific time
The temporal resolution of the provided data is 1 second (due to the annotation procedure)
Systems will be evaluated against hard labels, obtained by thresholding the soft labels at 0.5; anything above 0.5 is considered 1 (sound active), anything below 0.5 is considered 0 (sound inactive)

Research question: Do soft labels contain any useful additional information to help train better sound event detection systems?

Audio dataset

The development set provided for this task is MAESTRO Real. The dataset consists of real-life recordings with a length of approximately 3 minutes each, recorded in a few different acoustic scenes. The audio was annotated using Amazon Mechanical Turk, with a procedure that allows estimating soft labels from multiple annotator opinions. The full procedure for annotation and aggregation of multiple opinions can be found in the publication provided below.

Publication

Irene Martín-Morató and Annamaria Mesaros. Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31():902–914, 2023. doi:10.1109/TASLP.2022.3233468.

PDF

Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

Abstract

Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the specific format of the strong labels necessary for sound event detection is not easily obtainable through crowdsourcing. In this work, we propose a novel annotation workflow that leverages the efficiency of crowdsourcing weak labels, and uses a high number of annotators to produce reliable and objective strong labels. The weak labels are collected in a highly redundant setup, to allow reconstruction of the temporal information. To obtain reliable labels, the annotators' competence is estimated using MACE (Multi-Annotator Competence Estimation) and incorporated into the strong labels estimation through weighing of individual opinions. We show that the proposed method produces consistently reliable strong annotations not only for synthetic audio mixtures, but also for audio recordings of real everyday environments. While only a maximum 80% coincidence with the complete and correct reference annotations was obtained for synthetic data, these results are explained by an extended study of how polyphony and SNR levels affect the identification rate of the sound events by the annotators. On real data, even though the estimated annotators' competence is significantly lower and the coincidence with reference labels is under 69%, the proposed majority opinion approach produces reliable aggregated strong labels in comparison with the more difficult task of crowdsourcing directly strong labels.

PDF

Reference labels

The reference labels for the development data are available as soft labels. Their format is as follows:

Soft labels:

[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)[tab][soft label (float)]]

Example:

a1.wav       0  1   footsteps   0.6
a1.wav       0  1   people_talking      0.9
a1.wav       1  2   footsteps   0.8

These labels can be transformed into hard (binary) labels, using the 0.5 threshold, and the equivalent annotation would be

Hard labels:

[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)

Example:

a1.wav       0  2   footsteps
a1.wav       0  1   people_talking

In the provided dataset there are 17 sound classes. Of them, only 15 classes have values over 0.5, out of which another 4 are very rare. For this reason, the evaluation is conducted only against the following 11 classes:

Birds singing
Car
People talking
Footsteps
Children voices
Wind blowing
Brakes squeaking
Large vehicle
Cutlery and dishes
Metro approaching
Metro leaving

Download

MAESTRO Real - Multi-Annotator Estimated Strong Labels (2.6 GB)

Task setup

Participants must use the soft labels in training their system. However, participants are allowed to use external datasets and embeddings extracted from pre-trained models and train their system in any combination. This means that it is possible to use both hard and soft labels in the same training setup, and other data as well. Lists of the eligible datasets and pre-trained models are provided below. Datasets and models can be added to the list upon request until April 1st 2023 (as long as the corresponding resources are publicly available).

Note also that each participant should submit at least one system that is not using external data.

Development dataset

The development set consists of X files with a total duration of X minutes. The dataset is provided with a 5-fold cross-validation setup in which approximately 70% of the data (per class) is used in training, and the rest is used for testing. Participants are required to report the development set results using this setup. Please note that for a correct calculation of performance it is required to run the training and testing for each fold (so, 5 train/test rounds) and only evaluate performance after that. This allows evaluation of the entire list of files at once, in contrast to evaluation per fold and averaging the 5 values. Due to data imbalance between folds, the overall evaluation is more stable.

Evaluation dataset

The evaluation dataset consists of 26 files with a total length of 97 minutes. Only audio is provided for the evaluation set.

MAESTRO Real - Multi-Annotator Estimated Strong Labels; Evaluation dataset (1.3 GB)

External data resources

List of external data resources allowed:

Dataset name	Type	Added	Link
YAMNet	model	20.05.2021	https://github.com/tensorflow/models/tree/master/research/audioset/yamnet
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition	model	31.03.2021	https://zenodo.org/record/3987831
OpenL3	model	12.02.2020	https://openl3.readthedocs.io/
VGGish	model	12.02.2020	https://github.com/tensorflow/models/tree/master/research/audioset/vggish
COLA	model	25.02.2023	https://github.com/google-research/google-research/tree/master/cola
BYOL-A	model	25.02.2023	https://github.com/nttcslab/byol-a
AST: Audio Spectrogram Transformer	model	25.02.2023	https://github.com/YuanGongND/ast
PaSST: Efficient Training of Audio Transformers with Patchout	model	13.05.2022	https://github.com/kkoutini/PaSST
AudioSet	audio, video	04.03.2019	https://research.google.com/audioset/
FSD50K	audio	10.03.2022	https://zenodo.org/record/4060432
ImageNet	image	01.03.2021	http://www.image-net.org/
MUSAN	audio	25.02.2023	https://www.openslr.org/17/
DCASE 2018, Task 5: Monitoring of domestic activities based on multi-channel acoustics - Development dataset	audio	25.02.2023	https://zenodo.org/record/1247102#.Y_oyRIBBx8s
Pre-trained desed embeddings (Panns, AST part 1)	model	25.02.2023	https://zenodo.org/record/6642806#.Y_oy_oBBx8s

Task rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

Task specific rules:

Participants are allowed to submit up to 4 different systems.
Participants are allowed to use external data for system development. However, each participant should submit at least one system that is not using external data.
Data from other task is considered external data.
Embeddings extracted from models pre-trained on external data is considered as external data
Hard labels of the same dataset are not considered external data.
Manipulation of provided training data is allowed.
Participants are not allowed to use the evaluation dataset (or part of it) to train their systems or tune hyper-parameters.

Submission

Instructions regarding the output submission format and the required metadata can be found in the example submission package.

DCASE2023 challenge submission example package (8.9 MB)
(.zip)

Evaluation

System evaluation will be based on the following metrics, calculated in 1s-segments:

micro-average F1 score \(F1_m\), calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
micro-average error rate \(ER_m\), calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
macro-average F1 score \(F1_M\), calculated using sed_eval, with a decision threshold of 0.5 applied to the system output provided by participants
macro-average F1 score with optimum threshold per class \(F1_{MO}\) calculated using sed_scores_eval, based on the best F1 score per class obtained with a class-specific threshold

Evaluation toolboxes

Evaluation is done using sed_eval and sed_scores_eval toolboxes:

sed_eval - Evaluation toolbox for Sound Event Detection

sed_scores_eval - Evaluation toolbox for efficient threshold-independent evaluation of Sound Event Detection

Task Ranking

Ranking of the systems will be done based on \(F1_{MO}\).

Results

Rank	Submission Information
Rank	Code	Author	Affiliation	Technical Report	F1_MO
	Bai_JLESS_task4b_1	Jisheng Bai	Northwestern Polytechnical University, Marine Science and Technology, Joint Laboratory of Environmental Sound Sensing,, Xian, China	task-sound-event-detection-with-soft-labels-results#Yin2023	58.21
	Bai_JLESS_task4b_2	Jisheng Bai	Northwestern Polytechnical University, Marine Science and Technology, Joint Laboratory of Environmental Sound Sensing,, Xian, China	task-sound-event-detection-with-soft-labels-results#Yin2023	59.77
	Bai_JLESS_task4b_3	Jisheng Bai	Northwestern Polytechnical University, Marine Science and Technology, Joint Laboratory of Environmental Sound Sensing,, Xian, China	task-sound-event-detection-with-soft-labels-results#Yin2023	58.00
	Bai_JLESS_task4b_4	Jisheng Bai	Northwestern Polytechnical University, Marine Science and Technology, Joint Laboratory of Environmental Sound Sensing,, Xian, China	task-sound-event-detection-with-soft-labels-results#Yin2023	60.74
	Cai_NCUT_task4b_1	Xichang Cai	College of Information, North China University of Technology, Beijing, China	task-sound-event-detection-with-soft-labels-results#Zhang2023	43.60
	Cai_NCUT_task4b_2	Xichang Cai	College of Information, North China University of Technology, Beijing, China	task-sound-event-detection-with-soft-labels-results#Zhang2023	43.58
	Cai_NCUT_task4b_3	Xichang Cai	College of Information, North China University of Technology, Beijing, China	task-sound-event-detection-with-soft-labels-results#Zhang2023	42.14
	Liu_NJUPT_task4b_1	Xi Shao	Nanjing University of Posts and Telecommunications,, Nanjing, Jiangsu, P.R.China	task-sound-event-detection-with-soft-labels-results#Liu2023	19.82
	Liu_NJUPT_task4b_2	Xi Shao	Nanjing University of Posts and Telecommunications,, Nanjing, Jiangsu, P.R.China	task-sound-event-detection-with-soft-labels-results#Liu2023	20.83
	Liu_NJUPT_task4b_3	Xi Shao	Nanjing University of Posts and Telecommunications,, Nanjing, Jiangsu, P.R.China	task-sound-event-detection-with-soft-labels-results#Liu2023	22.53
	Liu_NJUPT_task4b_4	Xi Shao	Nanjing University of Posts and Telecommunications,, Nanjing, Jiangsu, P.R.China	task-sound-event-detection-with-soft-labels-results#Liu2023	22.46
	Liu_SRCN_task4b_1	Yangyang Liu	Intelligence Service Lab, Intelligence SW Team, Samsung Research China-Nanjing, Nanjing, China	task-sound-event-detection-with-soft-labels-results#Jin2023	44.69
	Liu_SRCN_task4b_2	Yangyang Liu	Intelligence Service Lab, Intelligence SW Team, Samsung Research China-Nanjing, Nanjing, China	task-sound-event-detection-with-soft-labels-results#Jin2023	52.03
	DCASE2023 baseline	Irene Martin	Computing Sciences, Tampere University, Tampere, Finland	task-sound-event-detection-with-soft-labels-results#Martin2023	43.44
	Min_KAIST_task4b_1	Deokki Min	Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea	task-sound-event-detection-with-soft-labels-results#Min2023	48.95
	Min_KAIST_task4b_2	Deokki Min	Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea	task-sound-event-detection-with-soft-labels-results#Min2023	48.72
	Min_KAIST_task4b_3	Deokki Min	Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea	task-sound-event-detection-with-soft-labels-results#Min2023	45.21
	Min_KAIST_task4b_4	Deokki Min	Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea	task-sound-event-detection-with-soft-labels-results#Min2023	46.24
	Nhan_VNUHCMUS_task4b_1	Tri-Do Nhan	Computing Sciences, University of Science, Vietnam National University	task-sound-event-detection-with-soft-labels-results#Nhan2023	47.17
	Xu_SJTU_task4b_1	Xu Xuenan	X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China	task-sound-event-detection-with-soft-labels-results#Xuenan2023	46.13
	Xu_SJTU_task4b_2	Xu Xuenan	X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China	task-sound-event-detection-with-soft-labels-results#Xuenan2023	50.88
	Xu_SJTU_task4b_3	Xu Xuenan	X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China	task-sound-event-detection-with-soft-labels-results#Xuenan2023	51.13
	Xu_SJTU_task4b_4	Xu Xuenan	X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China	task-sound-event-detection-with-soft-labels-results#Xuenan2023	46.99

Complete results and technical reports can be found in the results page

Baseline system

The baseline system is a CRNN with a linear output layer that is trained using the soft labels and mse. The system architecture consists of three CNN layers and one bi-directional gated recurrent unit (GRU) layer. As input, the model uses mel-band energies extracted using a hop length of 200 ms and 64 mel filter banks.

Repository

DCASE2022 Task 4B baseline, repository

Parameters

Neural network:

Input shape: sequence_length * 64
Architecture:
CNN layer #1
- 2D Convolutional layer (filters: 128, kernel size: 3) + Batch normalization + ReLu activation
- 2D max pooling (pool size: (1, 5)) + Dropout (rate: 20%)
CNN layer #2
- 2D Convolutional layer (filters: 128, kernel size: 3) + Batch normalization + ReLu activation
- 2D max pooling (pool size: (1, 2)) + Dropout (rate: 20%)
CNN layer #3
- 2D Convolutional layer (filters: 32, kernel size: 3) + Batch normalization + ReLu activation
- 2D max pooling (pool size: (1, 2)) + Dropout (rate: 20%)
Permute
Bidirectional #1
Dense layer #1
- Dense layer (units: 64, activation: Linear )
- Dropout (rate: 30%)
Dense layer #2
- Dense layer (units: 32, activation: Linear )

Results for the development dataset

	Micro-average		Macro-average
	ER_m	F1_m	F1_M	F1_MO
Baseline	0.479	71.50 %	35.21 %	44.13 %

Citation

If you are using the audio dataset, please cite the following paper:

Publication

PDF

Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

Abstract

PDF

If you are using the baseline, please cite the following paper:

Publication

Irene Martín-Morató, Manu Harju, Paul Ahokas, and Annamaria Mesaros. Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP). 2023.

PDF

Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

Abstract

In this paper, we study the use of soft labels to train a system for sound event detection (SED). Soft labels can result from annotations which account for human uncertainty about categories, or emerge as a natural representation of multiple opinions in annotation. Converting annotations to hard labels results in unambiguous categories for training, at the cost of losing the details about the labels distribution. This work investigates how soft labels can be used, and what benefits they bring in training a SED system. The results show that the system is capable of learning information about the activity of the sounds which is reflected in the soft labels and is able to detect sounds that are missed in the typical binary target training setup. We also release a new dataset produced through crowdsourcing, containing temporally strong labels for sound events in real-life recordings, with both soft and hard labels.

PDF

Sound Event Detection with Soft Labels

Coordinators

Description

Audio dataset

Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

Abstract

Reference labels

Download

Task setup

Development dataset

Evaluation dataset

External data resources

Task rules

Submission

Evaluation

Evaluation toolboxes

Task Ranking

Results

Baseline system

Repository

Parameters

Results for the development dataset

Citation

Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

Abstract

Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

Abstract

	Annamaria Mesaros Tampere University
	Irene Martin Morato Tampere University
	Toni Heittola Tampere University

Coordinators

Content

Description

Audio dataset

Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

Abstract

Reference labels

Download

Task setup

Development dataset

Evaluation dataset

External data resources

Task rules

Submission

Evaluation

Evaluation toolboxes

Task Ranking

Results

Baseline system

Repository

Parameters

Results for the development dataset

Citation

Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

Abstract

Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

Abstract