DCASE2020 Challenge

IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events

1 March - 1 July 2020

Contact

DCASE Challenge

Challenge has ended.

Results for some tasks are ready and presented in task specific results pages:

Task 1A Task 1B

Task 2 Task 3 Task 4 Task 5 Task 6

Introduction

Sounds carry a large amount of information about our everyday environment and physical events that take place in it. We can perceive the sound scene we are within (busy street, office, etc.), and recognize individual sound sources (car passing by, footsteps, etc.). Developing signal processing methods to automatically extract this information has huge potential in several applications, for example searching for multimedia based on its audio content, making context-aware mobile devices, robots, cars etc., and intelligent monitoring systems to recognize activities in their environments using acoustic information. However, a significant amount of research is still needed to reliably recognize sound scenes and individual sound sources in realistic soundscapes, where multiple sounds are present, often simultaneously, and distorted by the environment.

Challenge status

Task	Task description	Development dataset	Baseline system	Evaluation dataset	Results
Task 1, Acoustic Scene Classification	Released	Released	Released	Released	Released
Task 2, Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring	Released	Released	Released	Released	Released
Task 3, Sound Event Localization and Detection	Released	Released	Released	Released	Released
Task 4, Sound Event Detection and Separation in Domestic Environments	Released	Released	Released	Released	Released
Task 5, Urban Sound Tagging with Spatiotemporal Context	Released	Released	Released	Released	Released
Task 6, Automated Audio Captioning	Released	Released	Released	Released	Released

updated 2020/07/01

Tasks

Acoustic scene classification

Scenes Task 1

The goal of acoustic scene classification is to classify a test recording into one of the predefined acoustic scene classes. This task is a continuation of the Acoustic Scene Classification task from previous DCASE Challenge editions, with some changes that bring new research problems into focus.

We provide two different setups of the acoustic classification problem:

A Devices

Subtask A: Acoustic Scene Classification with Multiple Devices

Classification of data from multiple devices (real and simulated) targeting generalization properties of systems across a number of different devices.

B Complexity

Subtask B: Low-Complexity Acoustic Scene Classification

Classification of data into three higher level classes while focusing on low-complexity solutions.

Organizers

Annamaria Mesaros

Annamaria Mesaros

Tampere University

Toni Heittola

Toni Heittola

Tampere University

Tuomas Virtanen

Tuomas Virtanen

Tampere University

Task description

Subtask A results Subtask B results

Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring

Monitoring Task 2

This is a novel DCASE challenge task closely related to an industrial problem: automatically detecting mechanical failure for achieving factory automation. The goal of this task is to identify whether the sound emitted from a target machine is normal or anomalous. The main challenge is to detect unknown anomalous sounds under the condition that only normal sound samples have been provided as training data. This task cannot be solved as a simple classification problem, even though the normal/anomaly classification problem seems to be a two-class classification problem. Prompt detection of machine anomaly by observing its sounds will be useful for machine condition monitoring.

Organizers

Yuma Koizumi

Yuma Koizumi

NTT Corporation

Yohei Kawaguchi

Yohei Kawaguchi

Hitachi, Ltd.

Keisuke Imoto

Keisuke Imoto

Doshisha University

Toshiki Nakamura

Toshiki Nakamura

Hitachi, Ltd.

Yuki Nikaido

Yuki Nikaido

Hitachi, Ltd.

Ryo Tanabe

Ryo Tanabe

Hitachi, Ltd.

Harsh Purohit

Harsh Purohit

Hitachi, Ltd.

Kaori Suefusa

Kaori Suefusa

Hitachi, Ltd.

Takashi Endo

Takashi Endo

Hitachi, Ltd.

Masahiro Yasuda

Masahiro Yasuda

NTT Corporation

Noboru Harada

Noboru Harada

NTT Corporation

Task description Results

Sound Event Localization and Detection

Localization Task 3

Given a multichannel audio input, the goal of a sound event localization and detection (SELD) method is to output all instances of the sound labels in the recording, its respective onset-offset times, and spatial locations in azimuth and elevation angles. Each individual sound event instance in the provided recordings are spatially stationary with a fixed location during their entire duration. Successful implementation of such a SELD method will enable the automatic description of the social and human activities and help machines to interact with the world more seamlessly. Specifically, SELD will enable people with hearing impairment to visualize sounds. Robots and smart video conference equipment can recognize and track the sound source of interest. Further, smart homes, smart cities, and smart industries can use SELD for audio surveillance.

Organizers

Archontis Politis

Archontis Politis

Tampere University

Sharath Adavanne

Sharath Adavanne

Tampere University

Tuomas Virtanen

Tuomas Virtanen

Tampere University

Task description Results

Sound Event Detection and Separation in Domestic Environments

Domestic Task 4

This task is the follow-up to DCASE 2019 task 4. The task evaluates systems for the detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. This year, we also encourage participants to propose systems that use source separation jointly with sound event detection. This task aims to investigate how we can optimally exploit synthetic data and to what extent can source separation improve sound event detection, and vice-versa?

Organizers

Romain Serizel

Romain Serizel

University of Lorraine

Nicolas Turpault

Nicolas Turpault

Inria Nancy Grand-Est

John Hershey

John Hershey

Scott Wisdom

Scott Wisdom

Hakan Erdogan

Hakan Erdogan

Justin Salamon

Justin Salamon

Ankit Parag Shah

Ankit Parag Shah

Carnegie Mellon University

Daniel P. W. Ellis

Daniel P. W. Ellis

Prem Seetharaman

Prem Seetharaman

Northwestern University

Task description Results

Urban Sound Tagging with Spatiotemporal Context

Urban Task 5

This task is a follow up to DCASE 2019 Task 5. This task aims to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags for 10s recordings from an urban acoustic sensor network for which we know when and where the recordings were taken. This task is motivated by the real-world problem of building machine listening tools to aid in the monitoring, analysis, and mitigation of urban noise pollution. In this task, in addition to the recordings, we provide identifiers of the block of New York City the recording was taken as well as when the recording was taken, quantized to the hour. We encourage all submissions to exploit both this provided information as well as incorporating any external data (e.g. weather, land usage, traffic, Twitter) that can further help inform your system to predict tags.

Organizers

Mark Cartwright

Mark Cartwright

New York University

Jason Cramer

Jason Cramer

New York University

Ana Elisa Mendez Mendez

Ana Elisa Mendez Mendez

New York University

Yu Wang

Yu Wang

New York University

Ho-Hsiang Wu

Ho-Hsiang Wu

New York University

Vincent Lostanlen

Vincent Lostanlen

Cornell University

Magdalena Fuentes

Magdalena Fuentes

New York University

Justin Salamon

Justin Salamon

Juan P. Bello

Juan P. Bello

New York University

Task description Results

Automated Audio Captioning

Caption Task 6

Automated audio captioning is the task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. Audio captioning methods can model concepts (e.g. "muffled sound"), physical properties of objects and environment (e.g. "the sound of a big car", "people talking in a small and empty room"), and high level knowledge ("a clock rings three times"). This modeling can be used in various applications, ranging from automatic content description to intelligent and content oriented machine-to-machine interaction.

Organizers

Konstantinos Drossos

Konstantinos Drossos

Tampere University

Samuel Lipping

Samuel Lipping

Tampere University

Tuomas Virtanen

Tuomas Virtanen

Tampere University

Task description Results