The goal of the task is to evaluate systems for the detection of sound events using real data either weakly labeled or unlabeled and simulated data that is strongly labeled (with time stamps).

Challenge has ended. Full results for this task can be found in the Results page.

If you are interested in the task, you can join us on the dedicated slack channel

Description

This task is the follow-up to [DCASE 2022 Task 4][dcase22_task4]. The task evaluates systems for the detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time localization given that multiple events can be present in an audio recording (see also Fig 1).

Figure 1: Overview of a sound event detection system.

Novelties for 2023 edition

We will evaluated the submissions using a threshold-independant implementation of the PSDS.
For each submitted system we ask you to submit the output scores from three independent model trainings with different initialization to be able to evaluate the model performance's standard deviation.
Reporting the energy consumption is mandatory.
In order to account for potential hardware difference, participants have to report the energy consumption measured while training the baseline during 10 epochs (on their hardware).
We are introducing a new metric, complementary to the energy consumption metric: Multiply–accumulate operations (MACs) for 10 seconds of audio prediction.
We will experiment run post-processing-invariant evaluation.
We require to submit at least one system without ensembles.
We propose a new baseline using BEATS embeddings.

Scientific questions

This task highlights a number of specific research questions:

What strategies work well when training a sound event detection system with a heterogeneous dataset, including:
- A large amount of unbalanced and unlabeled training data
- A small weakly annotated set
- A synthetic set from isolated sound events and backgrounds
What is the impact of using embeddings extracted from pre-trained models?
What are the potential advantages of using external data?
What is the impact of model complexity/energy consumption on the performance?
What is the impact of the temporal post-processing on the performance?
Can we find more robust way to evaluate systems (and take training variabilities into account)?

Audio dataset

This Task is primarily based on the DESED dataset, which has been used since DCASE 2020 Task 4. DESED is composed of 10 sec audio clips recorded in domestic environments (taken from AudioSet) or synthesized using Scaper to simulate a domestic environment. The task focuses on 10 classes of sound events that represent a subset of AudioSet (note that not all the classes in DESED correspond to classes in AudioSet; for example, some classes in DESED group several classes from AudioSet):

More information about this dataset and how to generate synthetic soundscapes can be found on the DESED website.

Google researchers conducted a quality assessment task where experts were exposed to 10 randomly selected clips for each class and discovered that a in most of the cases not all the clips contains the event related to the given annotation
The weak annotations have been verified manually for a small subset of the training set.
Another subset of the development set has been annotated manually with strong annotations, to be used as the validation set (see also below for a detailed explanation about the development set).

Publication

Nicolas Turpault, Romain Serizel, Ankit Parag Shah, and Justin Salamon. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events. New York City, United States, October 2019. URL: https://hal.inria.fr/hal-02160855.

	Romain Serizel University of Lorraine
	Francesca Ronchini Politecnico di Milano
	Janek Ebbers Paderborn University
	Florian Angulo Telecom Paris
	David Perera Telecom Paris
	Slim Essid Telecom Paris

Dataset name	Type	Added	Link
YAMNet	model	20.05.2021	https://github.com/tensorflow/models/tree/master/research/audioset/yamnet
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition	model	31.03.2021	https://zenodo.org/record/3987831
OpenL3	model	12.02.2020	https://openl3.readthedocs.io/
VGGish	model	12.02.2020	https://github.com/tensorflow/models/tree/master/research/audioset/vggish
COLA	model	25.02.2023	https://github.com/google-research/google-research/tree/master/cola
BYOL-A	model	25.02.2023	https://github.com/nttcslab/byol-a
AST: Audio Spectrogram Transformer	model	25.02.2023	https://github.com/YuanGongND/ast
PaSST: Efficient Training of Audio Transformers with Patchout	model	13.05.2022	https://github.com/kkoutini/PaSST
BEATs: Audio Pre-Training with Acoustic Tokenizers	model	01.03.2023	https://github.com/microsoft/unilm/tree/master/beats
AudioSet	audio, video	04.03.2019	https://research.google.com/audioset/
FSD50K	audio	10.03.2022	https://zenodo.org/record/4060432
ImageNet	image	01.03.2021	http://www.image-net.org/
MUSAN	audio	25.02.2023	https://www.openslr.org/17/
DCASE 2018, Task 5: Monitoring of domestic activities based on multi-channel acoustics - Development dataset	audio	25.02.2023	https://zenodo.org/record/1247102#.Y_oyRIBBx8s
Pre-trained desed embeddings (Panns, AST part 1)	model	25.02.2023	https://zenodo.org/record/6642806#.Y_oy_oBBx8s
Audio Teacher-Student Transformer	model	22.04.2024	https://drive.google.com/file/d/1_xb0_n3UNbUG_pH1vLHTviLfsaSfCzxz/view
TUT Acoustic scenes dataset	audio	22.04.2024	https://zenodo.org/records/45739
MicIRP	IR	28.03.2023	http://micirp.blogspot.com/?m=1

Submission code (PSDS 1)	Submission code (PSDS 2)	Ranking score (Evaluation dataset)	PSDS 1 (Evaluation dataset)	PSDS 2 (Evaluation dataset)
Kim_GIST-HanwhaVision_task4a_2	Kim_GIST-HanwhaVision_task4a_3	1.68	0.591 (0.574 - 0.611)	0.835 (0.826 - 0.846)
Zhang_IOA_task4a_6	Zhang_IOA_task4a_7	1.63	0.562 (0.552 - 0.575)	0.830 (0.820 - 0.842)
Wenxin_TJU_task4a_6	Wenxin_TJU_task4a_6	1.61	0.546 (0.536 - 0.556)	0.831 (0.823 - 0.842)
Xiao_FMSG_task4a_4	Xiao_FMSG_task4a_4	1.60	0.551 (0.543 - 0.562)	0.813 (0.802 - 0.827)
Guan_HIT_task4a_3	Guan_HIT_task4a_4	1.60	0.526 (0.513 - 0.539)	0.855 (0.844 - 0.867)
Chen_CHT_task4a_2	Chen_CHT_task4a_2	1.58	0.563 (0.550 - 0.574)	0.779 (0.768 - 0.792)
Li_USTC_task4a_6	Li_USTC_task4a_6	1.56	0.546 (0.529 - 0.562)	0.783 (0.771 - 0.796)
Liu_NSYSU_task4a_7	Liu_NSYSU_task4a_7	1.55	0.521 (0.510 - 0.531)	0.813 (0.796 - 0.831)
Cheimariotis_DUTH_task4a_1	Cheimariotis_DUTH_task4a_1	1.53	0.516 (0.504 - 0.529)	0.796 (0.784 - 0.808)
Baseline_BEATS	Baseline_BEATS	1.52	0.510 (0.496 - 0.523)	0.798 (0.782 - 0.811)
Baseline	Baseline	1.00	0.327 (0.317 - 0.339)	0.538 (0.515 - 0.566)
Wang_XiaoRice_task4a_1	Wang_XiaoRice_task4a_1	1.50	0.494 (0.477 - 0.510)	0.801 (0.789 - 0.815)
Lee_CAUET_task4a_1	Lee_CAUET_task4a_2	1.28	0.425 (0.415 - 0.440)	0.674 (0.661 - 0.690)
Liu_SRCN_task4a_4	Liu_SRCN_task4a_4	1.25	0.412 (0.400 - 0.424)	0.663 (0.652 - 0.676)
Barahona_AUDIAS_task4a_2	Barahona_AUDIAS_task4a_4	1.21	0.380 (0.361 - 0.406)	0.673 (0.652 - 0.700)
Wu_NCUT_task4a_1	Wu_NCUT_task4a_1	1.15	0.391 (0.379 - 0.405)	0.596 (0.584 - 0.610)
Gan_NCUT_task4a_1	Gan_NCUT_task4a_1	1.12	0.365 (0.353 - 0.377)	0.603 (0.589 - 0.617)

	PSDS-scenario1	PSDS-scenario2	Intersection-based F1	Collar-based F1
Baseline	0.359 +/- 0.006	0.562 +/- 0.012	64.2 +/- 0.8 %	40.7 +/- 0.6 %
Baseline (AudioSet strong)	0.364 +/- 0.005	0.576 +/- 0.011	65.5 +/- 1.3 %	43.3 +/- 1.4 %
Baseline (BEATS)	0.500 +/- 0.004	0.762 +/- 0.008	80.7 +/- 0.4 %	57.1 +/- 1.3 %

	Training (kWh)	Dev-test (kWh)
Baseline	1.390 +/- 0.019	0.019 +/- 0.001
Baseline (AudioSet strong)	1.418 +/- 0.016	0.020 +/- 0.001
Baseline (BEATS)	1.821 +/- 0.457	0.022 +/- 0.003

Coordinators

Content

Description

Novelties for 2023 edition

Scientific questions

Audio dataset

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Abstract

Keywords

Sound event detection in synthetic domestic environments

Abstract

Keywords

Reference labels

Weak annotations

Strong annotations

FSD50K: an Open Dataset of Human-Labeled Sound Events

Abstract

Freesound technical demo

Abstract

Download

Task setup

Development set

The benefit of temporally-strong labels in audio event classification

Sound event detection validation set

Evaluation set

External data resources

Task rules

Submission

Metadata file

Package validator

Evaluation

Threshold Independent Evaluation of Sound Event Detection Scores

Scenario 1

Scenario 2

Task Ranking

Contrastive metric (collar-based F1-score)

Energy consumption

(New) Multiply–accumulate (MAC) operations

Results

Baseline system

System description

(New) Baseline using pre-trained embeddings from models (SEC/Tagging) trained on Audioset

Results for the development dataset

Energy consumption during the training and evaluation phase

Repositories

Citation

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis

Abstract

Keywords

The Impact of Non-Target Events in Synthetic Soundscapes for Sound Event Detection

Abstract

Threshold Independent Evaluation of Sound Event Detection Scores