DCASE2023 Challenge

IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events
15 March - 1 July 2023

Introduction

Sounds carry a large amount of information about our everyday environment and physical events that take place in it. We can perceive the sound scene we are within (busy street, office, etc.), and recognize individual sound sources (car passing by, footsteps, etc.). Developing signal processing methods to automatically extract this information has huge potential in several applications, for example searching for multimedia based on its audio content, making context-aware mobile devices, robots, cars etc., and intelligent monitoring systems to recognize activities in their environments using acoustic information. However, a significant amount of research is still needed to reliably recognize sound scenes and individual sound sources in realistic soundscapes, where multiple sounds are present, often simultaneously, and distorted by the environment.

Challenge status

Task Task description Development dataset Baseline system Evaluation dataset Results
Task 1, Low-Complexity Acoustic Scene Classification Released Released Released Released Released
Task 2, First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Released Released Released Released Released
Task 3, Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes Released Released Released Released Released
Task 4, Sound Event Detection with Weak Labels and Synthetic Soundscapes; Sound Event Detection with Soft Labels A Released Released Released Released Released
B Released Released Released Released Released
Task 5, Few-shot Bioacoustic Event Detection Released Released Released Released Released
Task 6, Automated Audio Captioning and Language-Based Audio Retrieval A Released Released Released Released Released
B Released Released Released Released Released
Task 7, Foley Sound Synthesis Released Released Released No evaluation dataset Released

updated 2023/06/01

Tasks

Low-Complexity Acoustic Scene Classification

Scenes Task 1

The task targets acoustic scene classification on devices with low computational and memory allowance, which impose certain limits on the model complexity. The task setup is based on limited model complexity, acoustically diverse data, and multiple mobile devices, reflecting a real-life application for ASC. The focus of the task is on the training strategies used for obtaining robust models that at the same time are light enough to run on embedded systems. Specific details for the required implementation are for example a maximum memory allowance, but no predefined parameter representation format, maximum allowance of 30 MMACs, and the requirement to calculate energy consumption.

The task is a repeat task from DCASE 2022, with the added calculation of the energy consumption, which is a factor in the overall ranking.

Ranking of submissions will be done taking into account the model accuracy, memory, MMACs and energy needs, to create a measurement that looks at the low resources from multiple perspectives.

Organizers

Annamaria Mesaros

Annamaria Mesaros

Irene Martin Morato

Irene Martin Morato

Francesco Paissan

Francesco Paissan

Alberto Ancilotto

Alberto Ancilotto

Elisabetta Farella

Elisabetta Farella

Toni Heittola

Toni Heittola

Tuomas Virtanen

Tuomas Virtanen

Task description Results



First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

Monitoring Task 2

The goal of this task is to identify whether a machine is normal or anomalous using only normal sound data under domain shifted conditions. One major difference from DCASE 2022 Task 2 is that the set of machine types are completely different between the development dataset and evaluation dataset. Therefore, the participants are expected to develop a system that can handle completely new machine types.

Organizers

Kota Dohi

Kota Dohi

Hitachi, Ltd.

Keisuke Imoto

Keisuke Imoto

Doshisha University

Yuma Koizumi

Yuma Koizumi

Google, Inc.

Noboru Harada

Noboru Harada

Daisuke Niizumi

Daisuke Niizumi

Tomoya Nishida

Tomoya Nishida

Hitachi, Ltd.

Harsh Purohit

Harsh Purohit

Hitachi, Ltd.

Ryo Tanabe

Ryo Tanabe

Hitachi, Ltd.

Takashi Endo

Takashi Endo

Hitachi, Ltd.

Yohei Kawaguchi

Yohei Kawaguchi

Hitachi, Ltd.

Task description Results



Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes

Localization Task 3

SELD is the joint task of detecting the temporal activities of multiple sound event classes of interest and estimating their respective spatial trajectories when active. This task is a continuation of the Task 3 of DCASE2022 of SELD systems evaluated on real spatially annotated recordings. Testing on (and partial training with) real recordings brings new challenges and opportunities to improve real-world performance. To foster further innovation, this task will have, in addition to an audio-only setup similar to the DCASE2022, an additional setup where participants also have access to 360° video of the recorded scenes.

Organizers

Archontis Politis

Archontis Politis

Kazuki Shimada

Kazuki Shimada

Yuki Mitsufuji

Yuki Mitsufuji

Tuomas Virtanen

Tuomas Virtanen

Sharath Adavanne

Sharath Adavanne

Parthasaarathy Sudarsanam

Parthasaarathy Sudarsanam

Daniel Krause

Daniel Krause

Naoya Takahashi

Naoya Takahashi

Shusuke Takahashi

Shusuke Takahashi

Yuichiro Koyama

Yuichiro Koyama

Kengo Uchida

Kengo Uchida

Task description Results



Sound Event Detection with Weak Labels and Synthetic Soundscapes; Sound Event Detection with Soft Labels

Events Task 4

The goal of sound event detection is to provide the event class together with the event time boundaries given multiple events can be present in an audio recording. The target of this task is to perform sound event detection using training datasets where varying types of annotations are available, aiming at exploring how to leverage different types of annotations. Because strongly labeled data is costly to obtain, prone to annotator biases and does not account for the annotator uncertainty, in this task we propose to investigate: weakly labeled data and strongly labeled synthetic soundscapes in subtask A (follow up to DCASE Task 4) and softly labeled data (nonbinary activity for sounds) in subtask B. In both cases the labeled in domain data can be used together with external datasets. In order to encourage cross-participation to subtasks, one baseline will be common to both subtasks.

A Weak
Subtask A: Sound event detection with weak labels and synthetic soundscapes

This task is a continuation of Task 4 at DCASE 2022. The main novelties for this year will be additional sets of extracted embedding will be made available, the evaluation will include systematic evaluation of the energy consumption, alternative evaluation methods will be explored to explore the robustness of the systems to changes in the operating point.

B Soft
Subtask B: Sound event detection with soft labels

This is a new subtask where annotations on the training data are given with a 1s resolution. The annotation is based on multiple annotators, and the aggregation is a value between 0 and 1 per event class per 1s segment. These are considered these as soft labels that indicate uncertainty of the annotators’ pool on the content. The soft annotations are used to train a sound event detection system that should perform at a 1s time resolution.

Organizers

Francesca Ronchini

Francesca Ronchini

Florian Angulo

Florian Angulo

David Perera

David Perera

Slim Essid

Slim Essid

Annamaria Mesaros

Annamaria Mesaros

Irene Martin Morato

Irene Martin Morato

Toni Heittola

Toni Heittola





Few-shot Bioacoustic Event Detection

Bio Task 5

This task focuses on sound event detection (SED) in a few-shot learning setting for animal vocalisations. Participants will be expected to create a method that can extract information from five exemplar vocalisations in a long sound recording, and generalise from those exemplars to find the rest of the sound events in the audio. The bioacoustic (animal sound) setting addresses a real need from animal researchers, while also providing a well-defined, constrained yet highly variable domain in which to evaluate few-shot learning methodology. In 2021 and 2022, the task has stimulated a lot of innovation, but there is still plenty of scope for technical innovations to achieve robust strong performance. For this edition, the task will keep most details the same as 2022 in order to ensure maximum comparability, but will introduce new unseen evaluation recordings from new animal sound recording scenarios, to ensure that the evaluation is as representative as possible, and also to ensure there are truly-unseen datasets in the evaluation.

Organizers

Ester Vidana Vila

Ester Vidana Vila

La Salle, Universitat Ramon Llull

Helen Whitehead

Helen Whitehead

University of Salford

Frants Jensen

Frants Jensen

Syracuse University

Joe Morford

Joe Morford

University of Oxford

Michael Emmerson

Michael Emmerson

Queen Mary University of London

Burooj Ghani

Burooj Ghani

Dan Stowell

Dan Stowell

Task description Results



Automated Audio Captioning and Language-Based Audio Retrieval

Caption Task 6

This task approaches the problem of analysis of audio signals by using natural language to represent rich characteristics of audio signals. The task setup is otherwise similar to the DCASE 2022 Challenge Task 6, but with some changes, to take into account the development of the field.

A Captioning
Subtask A: Automated Audio Captioning

This task is a continuation of Task 6 at DCASE 2022 Challenge and focuses on the research question “How can we make machines understand higher level and human-perceived information from general sounds?”.

The metric used to rank the submissions will combine SPIDEr with a fluency error detection model. Participants will still report METEOR, CIDEr, and SPICE metrics. In addition, FENSE and CB-score will be reported as contrastive metrics.

B Retrieval
Subtask B: Language-Based Audio Retrieval

This task is a continuation of Task 6 at DCASE 2022 Challenge and the goal of this task is to evaluate methods where a retrieval system takes a free-form textual description as an input and is supposed to rank audio signals in a fixed dataset based on their match to the given description.

In the 2022 Challenge, the ground truth relevance of audio files were considered binary, so that only audio files matching with their corresponding caption were considered relevant, and all the other audio files non-relevant. In the 2023 Challenge, this limitation is addressed by crowdsourced graded relevance scores, and use of the Normalized Distributed Cumulative Gain as the metric. mAP@10 (mean average precision at cut-off 10) and recall@k (i.e., recall@1, recall@5, and recall@10) will be used as secondary metrics.

Organizers

Felix Gontier

Felix Gontier

INRIA

Konstantinos Drossos

Konstantinos Drossos

Tuomas Virtanen

Tuomas Virtanen





Foley Sound Synthesis

Synthesis Task 7

This task aims to build a foley sound synthesis system that can generate plausible audio signals fitting into given categories of foley sound. The foley sound categories are composed of sound events and environmental background sounds. The challenge has two subproblems – the development of models with and without external resources. Participants are expected to submit a system for one of the two problems, and each problem is evaluated independently. Submissions will be evaluated by Frechet Audio Distance (FAD), followed by a subjective test.

Organizers

Keunwoo Choi

Keunwoo Choi

Gaudio Lab, Inc.

Jaekwon Im

Jaekwon Im

Gaudio Lab, Inc. / Korea Advanced Institute of Science & Technology (KAIST)

Laurie Heller

Laurie Heller

Carnegie Mellon University

Brian McFee

Brian McFee

New York University

Keisuke Imoto

Keisuke Imoto

Doshisha University

Yuki Okamoto

Yuki Okamoto

Ritsumeikan University

Mathieu Lagrange

Mathieu Lagrange

CNRS, Ecole Centrale Nantes, Nantes Université

Shinnosuke Takamichi

Shinnosuke Takamichi

The University of Tokyo

Task description Results