Challenge results and analysis of submitted systems have been published in:
A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, and M. D. Plumbley. Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(2):379–393, Feb 2018. doi:10.1109/TASLP.2017.2778423.
Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge
Abstract
Public evaluation campaigns and datasets promote active development in target research areas, allowing direct comparison of algorithms. The second edition of the challenge on detection and classification of acoustic scenes and events (DCASE 2016) has offered such an opportunity for development of the state-of-the-art methods, and succeeded in drawing together a large number of participants from academic and industrial backgrounds. In this paper, we report on the tasks and outcomes of the DCASE 2016 challenge. The challenge comprised four tasks: acoustic scene classification, sound event detection in synthetic audio, sound event detection in real-life audio, and domestic audio tagging. We present each task in detail and analyze the submitted systems in terms of design and performance. We observe the emergence of deep learning as the most popular classification method, replacing the traditional approaches based on Gaussian mixture models and support vector machines. By contrast, feature representations have not changed substantially throughout the years, as mel frequency-based representations predominate in all tasks. The datasets created for and used in DCASE 2016 are publicly available and are a valuable resource for further research.
Keywords
Acoustics;Event detection;Hidden Markov models;Speech;Speech processing;Tagging;Acoustic scene classification;audio datasets;pattern recognition;sound event detection
Results for each tasks are presented in task specific results pages:
System outputs:
Introduction
Sounds carry a large amount of information about our everyday environment and physical events that take place in it. We can perceive the sound scene we are within (busy street, office, etc.), and recognize individual sound sources (car passing by, footsteps, etc.). Developing signal processing methods to automatically extract this information has huge potential in several applications, for example searching for multimedia based on its audio content, making context-aware mobile devices, robots, cars etc., and intelligent monitoring systems to recognize activities in their environments using acoustic information. However, a significant amount of research is still needed to reliably recognize sound scenes and individual sound sources in realistic soundscapes, where multiple sounds are present, often simultaneously, and distorted by the environment.
Previous challenges
Public evaluations are common in many areas of research, with some challenges being active for many consecutive years. They help push the boundaries of algorithm developments to deal with more and more complex tasks. TRECVID Multimedia Event detection is another of the long tradition evaluations, with focus on audiovisual, multi-modal event detection in video recordings. Such public evaluations provide a good opportunity for code dissemination, unification and definition of terms, procedures, benchmark datasets and evaluation metrics. It is our wish to provide a similar tool for computational auditory scene analysis, specifically for detection and classification of sound scenes and events.
The previously organized DCASE2013 challenge (sponsored by the IEEE AASP TC, and held at WASPAA 2013) attracted the interest of the research community and had a good participation rate. It also contributed on creating benchmark datasets and fostered reproducible research (6 out of 18 participating teams had their source code released through the challenge). Based on its success, we propose to organize the follow-up challenge on the performance evaluation of systems for the detection and classification of sound events. This challenge will move the DCASE setup closer to real world applications, by providing more complex problems. This will help defining a common ground for researchers that actively pursue research on this field, and offer a reference point for systems developed to perform parts of this task.
Tasks
Continuing the tasks of the previous DCASE, the proposed tasks for the challenge are acoustic scene classification and sound event detection within a scene.
Acoustic scene classification
The goal of acoustic scene classification is to classify a test recording into one of predefined classes that characterizes the environment in which it was recorded -- for example "park", "street", "office". The acoustic data will include recordings from 15 contexts, approximately one hour of data from each context. The setup is similar to the previous DCASE challenge, but with a higher number of classes and diversity of data.
Task description ResultsSound event detection in synthetic audio
The goal of sound event detection is to detect the sound events (for example “bird singing”, “car passing by”) that are present within an audio signal, estimate their start and end times, and give a class label to each of the events.
The sound event detection challenge will consist of 2 distinct tasks. This task will focus on event detection of office sounds, and will use training material provided as isolated sound events for each class, and synthetic mixtures of the same examples in multiple SNR and event density conditions (sounds have been recorded at IRCCYN, École Centrale de Nantes). The participants will be allowed to use any combination of them for training their system. The test data will consist of synthetic mixtures of (source-independent) sound examples at various SNR levels, event density conditions and polyphony. Thus, the aim of this task is to study the behaviour of tested algorithms when facing different levels of complexity, with an added benefit that the ground truth will be most accurate, even for polyphonic mixtures.
Task description ResultsSound event detection in real life audio
The third task will use training and testing material recorded in real life environments. This task evaluates performance of the sound event detection systems in multisource conditions similar to our everyday life, where the sound sources are rarely heard in isolation. In this case, there is no control over the number of overlapping sound events at each time, not in the training nor the testing audio data. The annotations of event activities are done manually, and can therefore be somewhat subjective.
Task description ResultsDomestic audio tagging
This task will use binaural audio recordings made in a domestic environment. Prominent sound sources in the acoustic environment are two adults and two children, television and electronic gadgets, kitchen appliances, footsteps and knocks produced by human activity, in addition to sound originating from outside the house. The audio data is provided as four-second chunks. The objective of the task is to label each chunk with one or more of a multi-label set of labels such as Child speech, Adult male speech and/or Video game/TV.
Task description ResultsChallenge setup
For each challenge task, a development dataset and baseline system will be provided. Challenge evaluation will be done using an evaluation dataset that will be published shortly before the deadline. Task-specific rules are available on the task pages.