This task evaluates systems for general-purpose audio tagging with an increased number of categories and using data with annotations of varying reliability. This task will provide insight towards the development of broadly-applicable sound event classifiers that consider an increased and diverse amount of categories.
This task addresses the problem of general-purpose automatic audio tagging and poses two main challenges. The first one is to build models that can recognize an increased number of sound events of very diverse nature, including musical instruments, human sounds, domestic sounds, animals, etc. The second challenge consists of leveraging subsets of training data featuring annotations of varying reliability, as a reflection of the expensiveness of having high quality annotations. This task will provide insight towards the development of broadly-applicable sound event classifiers that consider an increased and diverse amount of categories. These models can be used, for example, in automatic description of multimedia or acoustic monitoring applications.
This task is hosted in Kaggle - a platform that hosts machine learning competitions with a vibrant community of participants. Hence, the resources associated to this task (datasets download, leaderboard and submission) will be provided by Kaggle. Please visit the Freesound General-Purpose Audio Tagging Challenge Kaggle competition page for detailed information about how to participate and all other relevant aspects of the challenge. What follows in this page is a summary of the most important aspects of the challenge.
The dataset provided for this task is a reduced subset of FSD: a work-in-progress, large-scale, general-purpose audio dataset composed of Freesound content annotated with labels from the AudioSet Ontology. FSD is being collected through the Freesound Datasets platform, which is a platform for the collaborative creation of open audio collections. We encourage participants of the DCASE challenge to check out the Freesound Datasets platform and, why not, contribute with some annotations for the FSD. More information about the Freesound Datasets platform and the creation of FSD is available in:
Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andrés Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra. Freesound datasets: a platform for the creation of open audio datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR 2017), pp 486-493. Suzhou, China, 2017.
Freesound Datasets: a platform for the creation of open audio datasets
Openly available datasets are a key factor in the advancement of data-driven research approaches, including many of the ones used in sound and music computing. In the last few years, quite a number of new audio datasets have been made available but there are still major shortcomings in many of them to have a significant research impact. Among the common shortcomings are the lack of transparency in their creation and the difficulty of making them completely open and sharable. They often do not include clear mechanisms to amend errors and many times they are not large enough for current machine learning needs. This paper introduces Freesound Datasets, an online platform for the collaborative creation of open audio datasets based on principles of transparency, openness, dynamic character, and sustainability. As a proof-of-concept, we present an early snapshot of a large-scale audio dataset built using this platform. It consists of audio samples from Freesound organised in a hierarchy based on the AudioSet Ontology. We believe that building and maintaining datasets following the outlined principles and using open tools and collaborative approaches like the ones presented here will have a significant impact in our research community.
Recording and annotation procedure
This task employs data drawn from content uploaded by the Freesound user community, encompassing sounds in a wide range of real-world environments. Recording scenarios and techniques can be very different as sounds are uploaded by users across the globe. All audio samples in this dataset are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.
The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories. Using this mapping, a number of Freesound audio samples were automatically annotated. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample. Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assesed the presence/absence of an automatically assigned sound category, according to the AudioSet category description. More details about the annotation procedure can be found in Fonseca et al. (2017). More information about AudioSet can be found in:
Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA, 2017.
Audio Set: An ontology and human-labeled dataset for audio events
Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets -- principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 635 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.
In the provided dataset, a number of the ground truth labels have been manually verified (some of the labels have inter-annotator agreement but not all of them) while the the rest has not been manually verified and therefore some of them could be inaccurate. All audio samples in this dataset have a single label (i.e., they are only annotated with one label).
The datasets for this task can be downloaded from the Freesound General-Purpose Audio Tagging Challenge Kaggle competition page. Details about usage restrictions and sound licenses are provided there.
The task consists of predicting the sound category to which every audio sample in the test set belongs to. A single label should be assigned to each file in the test set (i.e., single-tag tagging), although up to three labels can be predicted for each file as we evaluate with Mean Average Precision @ 3. The predictions are to be done at the audio sample level, i.e., no start/end timestamps for the events are required. The dataset for this task is split into a train set and a test set.
The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds.
The train set is composed of ~3.7k manually-verified annotations and ~5.8k non-verified annotations. The quality of the non-verified annotations has been roughly estimated to be at least 65-70% in each sound category. A flag for each annotation is provided which indicates whether or not that annotation has been manually verified. Participants can use this information during the development of their systems.
The test set is composed of ~1.6k manually-verified annotations with a similar category distribution than that of the train set. These annotations are complemented with ~7.8k padding annotations which are also included in the test set but that won't be used for evaluating the systems.
Submission and evaluation
Submissions will be done through the Kaggle platform and will be evaluated with the Mean Average Precision @ 3 metric. Please visit the Freesound General-Purpose Audio Tagging Challenge Kaggle competition page for detailed information about submission and evaluation.
A detailed description of the task rules can be found in the Freesound General-Purpose Audio Tagging Challenge Kaggle competition page. This is a summary of the most important points:
- Participants are allowed to use external data for system development, but external data can not be sourced from Freesound (including any of the original sound's metadata or other sounds in Freesound).
- Participants are not allowed to make subjective judgements of the evaluation data, nor to annotate it (this includes the use of statistics about the evaluation dataset in the decision making). The evaluation dataset cannot be used to train the submitted system.
- The top-3 winning teams are required to publish their systems under an open-source license in order to be considered winners.
The baseline system provides a simple entry-level state-of-the-art approach that gives a sense of the performance possible with the dataset of Task 2. It is a good starting point especially for entry-level researchers to get familiar with the task. Regardless of whether participants build their approaches on top of this baseline system or develop their own, DCASE organizers encourage all participants to open-source their code after the challenge. An overall description of the baseline system is included next. More detailed information can be found in the repository.
The baseline system implements a convolutional neural network (CNN) classifier similar to, but scaled down from, the deep CNN models that have been very successful in the vision domain. The model takes framed examples of log mel spectrogram as input and produces ranked predictions over the 41 classes in the dataset. The baseline system also allows training a simpler fully connected multi-layer perceptron (MLP) classifier. The baseline system is built on TensorFlow.
We use frames of log mel spectrogram as input features:
- computing spectrogram with a window size of 25ms and a hop size of 10ms
- mapping the spectrogram to 64 mel bins covering the range 125-7500 Hz
- log mel spectrogram is computed by applying log(mel spectrogram + 0.001)
- log mel spectrogram is then framed into overlapping examples with a window size of 0.25s and a hop size of 0.125s
The baseline CNN model consists of three 2-D convolutional layers (with ReLU activations) and alternating 2-D max-pool layers, followed by a final max-reduction (to produce a single value per feature map), and a softmax layer. The Adam optimizer is used to train the model, with a learning rate of 1e-4. A batch size of 64 is used.
The layers are listed in the table below using notation Conv2D(kernel size, stride, # feature maps) and MaxPool2D(kernel size, stride). Both Conv2D and MaxPool2D use the
SAME padding scheme. ReduceMax applies a maximum-value reduction across the first two dimensions. Activation shapes do not include the batch dimension.
|Input||(25, 64, 1)|
|Conv2D(7x7, 1, 100)||(25, 64, 100)|
|MaxPool2D(3x3, 2x2)||(13, 32, 100)|
|Conv2D(5x5, 1, 150)||(13, 32, 150)|
|MaxPool2D(3x3, 2x2)||(7, 16, 150)|
|Conv2D(3x3, 1, 200)||(7, 16, 200)|
|ReduceMax||(1, 1, 200)|
The classifier predicts 41 scores for individual 0.25s-wide examples. In order to produce a ranked list of predicted classes for an entire clip, we average the predictions from all framed examples generated from the clip, and take the top 3 classes by score.
The baseline system trains to achieve an MAP@3 of ~0.7 on the public Kaggle leaderboard after ~5 epochs of the entire training set which are completed in ~12 hours on an
n1-standard-8 Google Compute Engine machine with a quad-core Intel Xeon E5 v3 (Haswell) @ 2.3 GHz.