The goal of this task is to evaluate systems for the detection of sound events using real data, with different types of annotations data and corresponding labels available for training.

Challenge has ended. Full results for this task can be found in the Results page.

If you are interested in the task, you can join us on the dedicated slack channel

Description

This task is the follow up to task 4 A and task 4 B in 2023. We propose to unify the setup of both subtasks. The target of this task is to provide the event class together with the event time boundaries, given that multiple events can be present and may be overlapping in an audio recording. This task aims at exploring how to leverage training data with varying annotation granularity (temporal resolution, soft/hard labels).

Novelties for 2024 edition

Systems will be evaluated on labels with different granularity in order to get a broader view of the systems behavior and to assess their robustness for different applications.
The target classes in the different datasets are different, and hence sound labels of one dataset may be present but not annotated in the other one. The systems will have to cope with potentially missing target labels at training.
The SED system will have to perform without knowing the origin of the audio clips at evaluation time.

Scientific questions

This task highlights a number of specific research questions:

What is the most efficient way to exploit different sources of data to train a sound event detection system?
Is annotation uncertainty useful in learning models for SED?
How to exploit training data with partially missing annotations? How can we evaluate SED systems in a robust way?
How can we train SED systems that should perform well under various sound event distribution with potentail domain mismatch?

Audio dataset

This task is based on the DESED dataset and the MAESTRO Real dataset.

DESED dataset

DESED dataset has been used since DCASE 2020 Task 4. DESED is composed of 10 sec audio clips recorded in domestic environments (taken from AudioSet) or synthesized using Scaper to simulate a domestic environment. DESED focuses on 10 classes of sound events that represent a subset of AudioSet (note that not all the classes in DESED correspond to classes in AudioSet; for example, some classes in DESED group several classes from AudioSet).

Publication

Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.

	Samuele Cornell Carnegie Mellon University
	Janek Ebbers Mitsubishi Electric Research Lab
	Manu Harju Tampere University
	Irene Martin Morato Tampere University
	Constance Douwes Inria
	Annamaria Mesaros Tampere University
	Romain Serizel University of Lorraine

Dataset name	Type	Added	Link
YAMNet	model	20.05.2021	https://github.com/tensorflow/models/tree/master/research/audioset/yamnet
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition	model	31.03.2021	https://zenodo.org/record/3987831
OpenL3	model	12.02.2020	https://openl3.readthedocs.io/
VGGish	model	12.02.2020	https://github.com/tensorflow/models/tree/master/research/audioset/vggish
COLA	model	25.02.2023	https://github.com/google-research/google-research/tree/master/cola
BYOL-A	model	25.02.2023	https://github.com/nttcslab/byol-a
AST: Audio Spectrogram Transformer	model	25.02.2023	https://github.com/YuanGongND/ast
PaSST: Efficient Training of Audio Transformers with Patchout	model	13.05.2022	https://github.com/kkoutini/PaSST
BEATs: Audio Pre-Training with Acoustic Tokenizers	model	01.03.2023	https://github.com/microsoft/unilm/tree/master/beats
AudioSet	audio, video	04.03.2019	https://research.google.com/audioset/
FSD50K	audio	10.03.2022	https://zenodo.org/record/4060432
ImageNet	image	01.03.2021	http://www.image-net.org/
MUSAN	audio	25.02.2023	https://www.openslr.org/17/
DCASE 2018, Task 5: Monitoring of domestic activities based on multi-channel acoustics - Development dataset	audio	25.02.2023	https://zenodo.org/record/1247102#.Y_oyRIBBx8s
Pre-trained desed embeddings (Panns, AST part 1)	model	25.02.2023	https://zenodo.org/record/6642806#.Y_oy_oBBx8s
Audio Teacher-Student Transformer	model	22.04.2024	https://drive.google.com/file/d/1_xb0_n3UNbUG_pH1vLHTviLfsaSfCzxz/view
TUT Acoustic scenes dataset	audio	22.04.2024	https://zenodo.org/records/45739
MicIRP	IR	28.03.2023	http://micirp.blogspot.com/?m=1

Dataset	PSDS-scenario1	PSDS-scenario1 (sed score)	mean pAUC
Dev-test	0.50 +/- 0.01	0.52 +/- 0.007	0.637 +/- 0.04

Dataset	Training (kWh)	Dev-test (kWh)
	1.542 +/- 0.341	0.133 +/- 0.03

Submission code	Ranking score (Evaluation dataset)	PSDS (DESED evaluation dataset)	mpAUC (MAESTRO evaluation dataset)
Schmid_CPJKU_task4_2	1.35	0.642 (0.612 - 0.675)	0.711 (0.704 - 0.717)
Nam_KAIST_task4_2	1.32	0.583 (0.560 - 0.601)	0.738 (0.732 - 0.745)
Zhang_BUPT_task4_1	1.23	0.528 (0.502 - 0.549)	0.704 (0.704 - 0.705)
Chen_CHT_task4_1	1.23	0.499 (0.474 - 0.523)	0.733 (0.730 - 0.739)
Kim_GIST-HanwhaVision_task4_1	1.23	0.564 (0.545 - 0.586)	0.665 (0.646 - 0.677)
Chen_NCUT_task4_3	1.20	0.529 (0.510 - 0.555)	0.675 (0.675 - 0.675)
LEE_KT_task4_1	1.19	0.503 (0.458 - 0.562)	0.684 (0.672 - 0.693)
Baseline	1.13	0.481 (0.456 - 0.505)	0.646 (0.641 - 0.653)
XIAO_FMSG-JLESS_task4_3	1.12	0.571 (0.540 - 0.587)	0.553 (0.553 - 0.553)
Lyu_SCUT_task4_2	1.10	0.484 (0.461 - 0.505)	0.612 (0.596 - 0.624)
Niu_XJU_task4_1	1.07	0.469 (0.444 - 0.499)	0.603 (0.599 - 0.610)
Cai_USTC_task4_2	0.63	0.577 (0.559 - 0.598)	0.050 (0.050 - 0.050)
Huang_SJTU_task4_4	1.20	0.523 (0.500 - 0.552)	0.678 (0.669 - 0.685)

Coordinators

Content

Description

Novelties for 2024 edition

Scientific questions

Audio dataset

DESED dataset

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Abstract

Keywords

Sound event detection in synthetic domestic environments

Abstract

Keywords

MAESTRO Real

Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

Abstract

Dataset overlap

Reference labels

Weak annotations

Strong annotations

FSD50K: an Open Dataset of Human-Labeled Sound Events

Abstract

Freesound technical demo

Abstract

Soft labels

Download

Task setup

Development set

The benefit of temporally-strong labels in audio event classification

Sound event detection validation set

External data resources

Task rules

Evaluation set

Submission

Evaluation

Segment based metrics

Polyphonic sound event detection scores

Collar-based F1-score

Ranking metric

Multi-runs Evaluation

Energy Consumption (mandatory this year !)

Important Steps

Multiply–accumulate (MAC) operations

Evaluation toolboxes

Baseline system

Baseline Novelties Short Description

Results for the development dataset

Energy consumption during the training and evaluation phase

Repositories

Citations

DCASE 2024 Task 4: Sound Event Detection with Heterogeneous Data and Missing Labels

Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

Abstract

The Impact of Non-Target Events in Synthetic Soundscapes for Sound Event Detection

Abstract

Threshold Independent Evaluation of Sound Event Detection Scores

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Abstract

Keywords

Results