Introduction

Sounds carry a large amount of information about our everyday environment and physical events that take place in it. We can perceive the sound scene we are within (busy street, office, etc.), and recognize individual sound sources (car passing by, footsteps, etc.). Developing signal processing methods to automatically extract this information has huge potential in several applications, for example searching for multimedia based on its audio content, making context-aware mobile devices, robots, cars etc., and intelligent monitoring systems to recognize activities in their environments using acoustic information. However, a significant amount of research is still needed to reliably recognize sound scenes and individual sound sources in realistic soundscapes, where multiple sounds are present, often simultaneously, and distorted by the environment.

Challenge status

Task		Task description	Development dataset	Baseline system	Evaluation dataset	Results
Task 1, Low-Complexity Acoustic Scene Classification		Released	Released	Released	Released	Released
Task 2, First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring		Released	Released	Released	Released	Released
Task 3, Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes		Released	Released	Released	Released	Released
Task 4, Sound Event Detection with Weak Labels and Synthetic Soundscapes; Sound Event Detection with Soft Labels	A	Released	Released	Released	Released	Released
	B	Released	Released	Released	Released	Released
Task 5, Few-shot Bioacoustic Event Detection		Released	Released	Released	Released	Released
Task 6, Automated Audio Captioning and Language-Based Audio Retrieval	A	Released	Released	Released	Released	Released
	B	Released	Released	Released	Released	Released
Task 7, Foley Sound Synthesis		Released	Released	Released	No evaluation dataset	Released

updated 2023/06/01

Tasks

Low-Complexity Acoustic Scene Classification

Scenes Task 1

The task targets acoustic scene classification on devices with low computational and memory allowance, which impose certain limits on the model complexity. The task setup is based on limited model complexity, acoustically diverse data, and multiple mobile devices, reflecting a real-life application for ASC. The focus of the task is on the training strategies used for obtaining robust models that at the same time are light enough to run on embedded systems. Specific details for the required implementation are for example a maximum memory allowance, but no predefined parameter representation format, maximum allowance of 30 MMACs, and the requirement to calculate energy consumption.

The task is a repeat task from DCASE 2022, with the added calculation of the energy consumption, which is a factor in the overall ranking.

Ranking of submissions will be done taking into account the model accuracy, memory, MMACs and energy needs, to create a measurement that looks at the low resources from multiple perspectives.

Organizers

First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

Monitoring Task 2

The goal of this task is to identify whether a machine is normal or anomalous using only normal sound data under domain shifted conditions. One major difference from DCASE 2022 Task 2 is that the set of machine types are completely different between the development dataset and evaluation dataset. Therefore, the participants are expected to develop a system that can handle completely new machine types.

Organizers

Kota Dohi

Hitachi, Ltd.

Keisuke Imoto

Doshisha University

Yuma Koizumi

Google, Inc.

Noboru Harada

NTT Corporation

Daisuke Niizumi

NTT Corporation

Tomoya Nishida

Hitachi, Ltd.

Harsh Purohit

Hitachi, Ltd.

Ryo Tanabe

Hitachi, Ltd.

Takashi Endo

Hitachi, Ltd.

Yohei Kawaguchi

Hitachi, Ltd.

Task description Results

Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes

Localization Task 3

SELD is the joint task of detecting the temporal activities of multiple sound event classes of interest and estimating their respective spatial trajectories when active. This task is a continuation of the Task 3 of DCASE2022 of SELD systems evaluated on real spatially annotated recordings. Testing on (and partial training with) real recordings brings new challenges and opportunities to improve real-world performance. To foster further innovation, this task will have, in addition to an audio-only setup similar to the DCASE2022, an additional setup where participants also have access to 360° video of the recorded scenes.

Organizers

Archontis Politis

Tampere University

Kazuki Shimada

SONY

Yuki Mitsufuji

SONY

Tuomas Virtanen

Tampere University

Sharath Adavanne

Tampere University

Parthasaarathy Sudarsanam

Tampere University

Daniel Krause

Tampere University

Naoya Takahashi

SONY

Shusuke Takahashi

SONY

Yuichiro Koyama

SONY

Kengo Uchida

SONY

Aapo Hakala

Tampere University

Task description Results

Sound Event Detection with Weak Labels and Synthetic Soundscapes; Sound Event Detection with Soft Labels

Events Task 4

The goal of sound event detection is to provide the event class together with the event time boundaries given multiple events can be present in an audio recording. The target of this task is to perform sound event detection using training datasets where varying types of annotations are available, aiming at exploring how to leverage different types of annotations. Because strongly labeled data is costly to obtain, prone to annotator biases and does not account for the annotator uncertainty, in this task we propose to investigate: weakly labeled data and strongly labeled synthetic soundscapes in subtask A (follow up to DCASE Task 4) and softly labeled data (nonbinary activity for sounds) in subtask B. In both cases the labeled in domain data can be used together with external datasets. In order to encourage cross-participation to subtasks, one baseline will be common to both subtasks.

A Weak

Subtask A: Sound event detection with weak labels and synthetic soundscapes

This task is a continuation of Task 4 at DCASE 2022. The main novelties for this year will be additional sets of extracted embedding will be made available, the evaluation will include systematic evaluation of the energy consumption, alternative evaluation methods will be explored to explore the robustness of the systems to changes in the operating point.

B Soft

Subtask B: Sound event detection with soft labels

This is a new subtask where annotations on the training data are given with a 1s resolution. The annotation is based on multiple annotators, and the aggregation is a value between 0 and 1 per event class per 1s segment. These are considered these as soft labels that indicate uncertainty of the annotators’ pool on the content. The soft annotations are used to train a sound event detection system that should perform at a 1s time resolution.

Organizers

Romain Serizel

University of Lorraine

Francesca Ronchini

Politecnico di Milano

Janek Ebbers

Paderborn University

Florian Angulo

Telecom Paris

David Perera

Telecom Paris

Slim Essid

Telecom Paris

Annamaria Mesaros

Tampere University

Irene Martin Morato

Tampere University

Toni Heittola

Tampere University

Task description Subtask A Subtask B

Subtask A results Subtask B results

Few-shot Bioacoustic Event Detection

Bio Task 5

This task focuses on sound event detection (SED) in a few-shot learning setting for animal vocalisations. Participants will be expected to create a method that can extract information from five exemplar vocalisations in a long sound recording, and generalise from those exemplars to find the rest of the sound events in the audio. The bioacoustic (animal sound) setting addresses a real need from animal researchers, while also providing a well-defined, constrained yet highly variable domain in which to evaluate few-shot learning methodology. In 2021 and 2022, the task has stimulated a lot of innovation, but there is still plenty of scope for technical innovations to achieve robust strong performance. For this edition, the task will keep most details the same as 2022 in order to ensure maximum comparability, but will introduce new unseen evaluation recordings from new animal sound recording scenarios, to ensure that the evaluation is as representative as possible, and also to ensure there are truly-unseen datasets in the evaluation.

Organizers

Ines Nolasco

Queen Mary University of London

Shubhr Singh

Queen Mary University of London

Vincent Lostanlen

Centre National de la Recherche Scientifique(CNRS)
Laboratoire des Sciences du Numérique de Nantes (LS2N)

Ariana Strandburg-Peshkin

University of Konstanz
Max Planck Institute of Animal Behavior

Lisa Gill

BIOTOPIA Naturkundemuseum Bayern

Hanna Pamula

AGH University of Science and Technology

Ester Vidana Vila

La Salle, Universitat Ramon Llull

Helen Whitehead

University of Salford

Ivan Kiskin

University of Surrey

Frants Jensen

Syracuse University

Joe Morford

University of Oxford

Michael Emmerson

Queen Mary University of London

Veronica Morfi

Queen Mary University of London

Burooj Ghani

Tilburg University

Dan Stowell

Tilburg University

Task description Results

Automated Audio Captioning and Language-Based Audio Retrieval

Caption Task 6

This task approaches the problem of analysis of audio signals by using natural language to represent rich characteristics of audio signals. The task setup is otherwise similar to the DCASE 2022 Challenge Task 6, but with some changes, to take into account the development of the field.

A Captioning

Subtask A: Automated Audio Captioning

This task is a continuation of Task 6 at DCASE 2022 Challenge and focuses on the research question “How can we make machines understand higher level and human-perceived information from general sounds?”.

The metric used to rank the submissions will combine SPIDEr with a fluency error detection model. Participants will still report METEOR, CIDEr, and SPICE metrics. In addition, FENSE and CB-score will be reported as contrastive metrics.

B Retrieval

Subtask B: Language-Based Audio Retrieval

This task is a continuation of Task 6 at DCASE 2022 Challenge and the goal of this task is to evaluate methods where a retrieval system takes a free-form textual description as an input and is supposed to rank audio signals in a fixed dataset based on their match to the given description.

In the 2022 Challenge, the ground truth relevance of audio files were considered binary, so that only audio files matching with their corresponding caption were considered relevant, and all the other audio files non-relevant. In the 2023 Challenge, this limitation is addressed by crowdsourced graded relevance scores, and use of the Normalized Distributed Cumulative Gain as the metric. mAP@10 (mean average precision at cut-off 10) and recall@k (i.e., recall@1, recall@5, and recall@10) will be used as secondary metrics.

Organizers

Huang Xie

Tampere University

Felix Gontier

INRIA

Konstantinos Drossos

NOKIA Tech

Tuomas Virtanen

Tampere University

Romain Serizel

University of Lorraine

Task description Subtask A Subtask B

Subtask A results Subtask B results

Foley Sound Synthesis

Synthesis Task 7

This task aims to build a foley sound synthesis system that can generate plausible audio signals fitting into given categories of foley sound. The foley sound categories are composed of sound events and environmental background sounds. The challenge has two subproblems – the development of models with and without external resources. Participants are expected to submit a system for one of the two problems, and each problem is evaluated independently. Submissions will be evaluated by Frechet Audio Distance (FAD), followed by a subjective test.

Organizers

Keunwoo Choi

Gaudio Lab, Inc.

Jaekwon Im

Gaudio Lab, Inc. / Korea Advanced Institute of Science & Technology (KAIST)

Laurie Heller

Carnegie Mellon University

Brian McFee

New York University

Keisuke Imoto

Doshisha University

Yuki Okamoto

Ritsumeikan University

Mathieu Lagrange

CNRS, Ecole Centrale Nantes, Nantes Université

Shinnosuke Takamichi

The University of Tokyo

Task description Results

Schedule

16 Jan 2023

Challenge task descriptions

01 Mar 2023

Challenge launch

01 May 2023

Release of evaluation datasets

15 May 2023

Challenge deadline

31 May 2023

Challenge results

Contact

Recent news

DCASE2023 Challenge results published

DCASE2023 Challenge received 428 submission entries

DCASE2023 Challenge evaluation datasets available

Content

Introduction

Challenge status

Tasks

Low-Complexity Acoustic Scene Classification

Organizers

Annamaria Mesaros

Irene Martin Morato

Francesco Paissan

Alberto Ancilotto

Elisabetta Farella

Alessio Brutti

Toni Heittola

Tuomas Virtanen

First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

Organizers

Kota Dohi

Keisuke Imoto

Yuma Koizumi

Noboru Harada

Daisuke Niizumi

Tomoya Nishida

Harsh Purohit

Ryo Tanabe

Takashi Endo

Yohei Kawaguchi

Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes

Organizers

Archontis Politis

Kazuki Shimada

Yuki Mitsufuji

Tuomas Virtanen

Sharath Adavanne

Parthasaarathy Sudarsanam

Daniel Krause

Naoya Takahashi

Shusuke Takahashi

Yuichiro Koyama

Kengo Uchida

Aapo Hakala

Sound Event Detection with Weak Labels and Synthetic Soundscapes; Sound Event Detection with Soft Labels

Organizers

Romain Serizel

Francesca Ronchini

Janek Ebbers

Florian Angulo

David Perera

Slim Essid

Annamaria Mesaros

Irene Martin Morato

Toni Heittola

Few-shot Bioacoustic Event Detection

Organizers

Ines Nolasco

Shubhr Singh

Vincent Lostanlen

Ariana Strandburg-Peshkin

Lisa Gill

Hanna Pamula

Ester Vidana Vila

Helen Whitehead

Ivan Kiskin

Frants Jensen

Joe Morford

DCASE2023 Challenge evaluation
datasets available