We invite you to participate to the fifth edition of the Detection and Classification of Acoustic Scenes and Events challenge. DCASE 2019 continues to support the development of computational scene and event analysis methods by comparing different approaches using common publicly available datasets.
Sounds carry a large amount of information about our everyday environment and physical events that take place in it. We can perceive the sound scene we are within (busy street, office, etc.), and recognize individual sound sources (car passing by, footsteps, etc.). Developing signal processing methods to automatically extract this information has huge potential in several applications, for example searching for multimedia based on its audio content, making context-aware mobile devices, robots, cars etc., and intelligent monitoring systems to recognize activities in their environments using acoustic information. However, a significant amount of research is still needed to reliably recognize sound scenes and individual sound sources in realistic soundscapes, where multiple sounds are present, often simultaneously, and distorted by the environment.
Acoustic scene classification
The goal of acoustic scene classification is to classify a test recording into one of the predefined acoustic scene classes. This task is a continuation of the Acoustic Scene Classification task from previous DCASE Challenge editions, with some changes that bring new research problems into focus.
We provide three different setups of the acoustic classification problem:
Basic closed set classification, using data from a single device, high quality audio (similar to Task 1 / Subtask A in DCASE2018 Challenge). Development data and evaluation data from same device are provided.
Closed set classification that uses data from multiple devices (similar to Task 1 / Subtask B in DCASE2018 Challenge). Development data contains mostly data from other device than Evaluation data. The task encourages domain adaptation methods to cope with the mismatch.
New Setup in which evaluation data will also contain recordings from acoustic scenes not encountered in the training data. To limit the number of research problems, this subtask uses single device data.
The dataset for this task is an extension of TUT Urban Acoustic Scenes 2018, with recordings from more cities and acoustic scenes.
Audio tagging with noisy labels and minimal supervision
Current machine learning techniques require large and varied datasets in order to provide good performance and generalization. However, manually labelling a dataset is expensive and time-consuming, which limits its size. Websites like Youtube, Freesound, or Flickr host large volumes of user-contributed audio and metadata (e.g., tags), and labels can be inferred automatically from these metadata. Nevertheless, these automatically inferred labels might include a substantial level of noise. The goal of this DCASE task is to address the question of how to adequately exploit a small amount of reliable manually-labeled audio data and a larger quantity of noisy web audio data in the context of multi-label audio tagging and for a large vocabulary setting.
Sound Event Localization and Detection
Given a multichannel audio input, the goal of a sound event localization and detection (SELD) method is to output all instances of the sound labels in the recording, its respective onset-offset times, and spatial locations in azimuth and elevation angles. Each individual sound event instance in the provided recordings are spatially stationary with a fixed location during their entire duration. Successful implementation of such a SELD method will enable the automatic description of the social and human activities and help machines to interact with the world more seamlessly. Specifically, SELD will enable people with hearing impairment to visualize sounds. Robots and smart video conference equipment can recognize and track the sound source of interest. Further, smart homes, smart cities, and smart industries can use SELD for audio surveillance.
Sound event detection in domestic environments
This task evaluates systems for the detection of sound events using real data either weakly labeled or unlabeled and simulated data that is strongly labeled (with time stamps). The target of the systems is to provide not only the event class but also the event time boundaries. The main scientific question this task is aiming to investigate is: do we really need real but partially and weakly annotated data or is using simulated data sufficient? or do we need both?
Urban Sound Tagging
This task evaluates systems for tagging short audio recordings with urban sound tags related to urban noise pollution. All recordings come from an acoustic sensor network deployed in New York City. The set of tags was selected based on discussions with noise officials in New York City and inspection of the city's noise code. This task aims to investigate audio tagging system performance on a relevant, real-world task given limited, unbalanced data of varying reliability.