General-purpose audio tagging of Freesound content with AudioSet labels


Task description

This task evaluates systems for general-purpose audio tagging with an increased number of categories and using data with annotations of varying reliability. This task will provide insight towards the development of broadly-applicable sound event classifiers that consider an increased and diverse amount of categories.

Description

This task addresses the problem of general-purpose automatic audio tagging and poses two main challenges. The first one is to build models that can recognize an increased number of sound events of very diverse nature, including musical instruments, human sounds, domestic sounds, animals, etc. The second challenge consists of leveraging subsets of training data featuring annotations of varying reliability, as a reflection of the expensiveness of having high quality annotations. This task will provide insight towards the development of broadly-applicable sound event classifiers that consider an increased and diverse amount of categories. These models can be used, for example, in automatic description of multimedia or acoustic monitoring applications.

Figure 1: Overview of a single-tag tagging system.


This task is hosted in Kaggle - a platform that hosts machine learning competitions with a vibrant community of participants. Hence, the resources associated to this task (datasets download, leaderboard and submission) will be provided by Kaggle. Please visit the Freesound General-Purpose Audio Tagging Challenge Kaggle competition page for detailed information about how to participate and all other relevant aspects of the challenge. What follows in this page is a summary of the most important aspects of the challenge.

Kaggle competition page

Audio dataset

This task employs audio samples from Freesound annotated using a vocabulary of 41 labels from Google’s AudioSet Ontology:

  • Tearing
  • Bus
  • Shatter
  • Gunshot, gunfire
  • Fireworks
  • Writing
  • Computer keyboard
  • Scissors
  • Microwave oven
  • Keys jangling
  • Drawer open or close
  • Squeak
  • Knock
  • Telephone
  • Saxophone
  • Oboe
  • Flute
  • Clarinet
  • Acoustic guitar
  • Tambourine
  • Glockenspiel
  • Gong
  • Snare drum
  • Bass drum
  • Hi-hat
  • Electric piano
  • Harmonica
  • Trumpet
  • Violin, fiddle
  • Double bass
  • Cello
  • Chime
  • Cough
  • Laughter
  • Applause
  • Finger snapping
  • Fart
  • Burping, eructation
  • Cowbell
  • Bark
  • Meow

The dataset provided for this task is a reduced subset of FSD: a work-in-progress, large-scale, general-purpose audio dataset composed of Freesound content annotated with labels from the AudioSet Ontology. FSD is being collected through the Freesound Datasets platform, which is a platform for the collaborative creation of open audio collections. We encourage participants of the DCASE challenge to check out the Freesound Datasets platform and, why not, contribute with some annotations for the FSD. More information about the Freesound Datasets platform and the creation of FSD is available in:

Publication

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andrés Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra. Freesound datasets: a platform for the creation of open audio datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR 2017), 486–493. Suzhou, China, 2017.

PDF

Freesound Datasets: a platform for the creation of open audio datasets

Abstract

Openly available datasets are a key factor in the advancement of data-driven research approaches, including many of the ones used in sound and music computing. In the last few years, quite a number of new audio datasets have been made available but there are still major shortcomings in many of them to have a significant research impact. Among the common shortcomings are the lack of transparency in their creation and the difficulty of making them completely open and sharable. They often do not include clear mechanisms to amend errors and many times they are not large enough for current machine learning needs. This paper introduces Freesound Datasets, an online platform for the collaborative creation of open audio datasets based on principles of transparency, openness, dynamic character, and sustainability. As a proof-of-concept, we present an early snapshot of a large-scale audio dataset built using this platform. It consists of audio samples from Freesound organised in a hierarchy based on the AudioSet Ontology. We believe that building and maintaining datasets following the outlined principles and using open tools and collaborative approaches like the ones presented here will have a significant impact in our research community.

PDF

Recording and annotation procedure

This task employs data drawn from content uploaded by the Freesound user community, encompassing sounds in a wide range of real-world environments. Recording scenarios and techniques can be very different as sounds are uploaded by users across the globe. All audio samples in this dataset are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories. Using this mapping, a number of Freesound audio samples were automatically annotated. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample. Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assesed the presence/absence of an automatically assigned sound category, according to the AudioSet category description. More details about the annotation procedure can be found in Fonseca et al. (2017). More information about AudioSet can be found in:

Publication

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA, 2017.

PDF

Audio Set: An ontology and human-labeled dataset for audio events

Abstract

Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets -- principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 635 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

PDF

In the provided dataset, a number of the ground truth labels have been manually verified (some of the labels have inter-annotator agreement but not all of them) while the the rest has not been manually verified and therefore some of them could be inaccurate. All audio samples in this dataset have a single label (i.e., they are only annotated with one label).

Download

The datasets for this task can be downloaded from the Freesound General-Purpose Audio Tagging Challenge Kaggle competition page. Details about usage restrictions and sound licenses are provided there.

Task setup

The task consists of predicting the sound category to which every audio sample in the test set belongs to. A single label should be assigned to each file in the test set (i.e., single-tag tagging), although up to three labels can be predicted for each file as we evaluate with Mean Average Precision @ 3. The predictions are to be done at the audio sample level, i.e., no start/end timestamps for the events are required. The dataset for this task is split into a train set and a test set.

Train set

The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds.

The train set is composed of ~3.7k manually-verified annotations and ~5.8k non-verified annotations. The quality of the non-verified annotations has been roughly estimated to be at least 65-70% in each sound category. A flag for each annotation is provided which indicates whether or not that annotation has been manually verified. Participants can use this information during the development of their systems.

Test set

The test set is composed of ~1.6k manually-verified annotations with a similar category distribution than that of the train set. These annotations are complemented with ~7.8k padding annotations which are also included in the test set but that won't be used for evaluating the systems.

Submission and evaluation

Submissions will be done through the Kaggle platform and will be evaluated with the Mean Average Precision @ 3 metric. Please visit the Freesound General-Purpose Audio Tagging Challenge Kaggle competition page for detailed information about submission and evaluation.

Task rules

A detailed description of the task rules can be found in the Freesound General-Purpose Audio Tagging Challenge Kaggle competition page. This is a summary of the most important points:

  • Participants are allowed to use external data for system development, but external data can not be sourced from Freesound (including any of the original sound's metadata or other sounds in Freesound).
  • Participants are not allowed to make subjective judgements of the evaluation data, nor to annotate it (this includes the use of statistics about the evaluation dataset in the decision making). The evaluation dataset cannot be used to train the submitted system.
  • The top-3 winning teams are required to publish their systems under an open-source license in order to be considered winners.

Results

Rank Submission Information Corresponding Technical
Report
mAP@3
(Private leaderboard)
Code Name Author Affiliation
DCASE2018 baseline Baseline Eduardo Fonseca Music Technology Group (UPF), Universitat Pompeu Fabra, Barcelona, Barcelona, Spain. task-general-purpose-audio-tagging-results#Fonseca2018 0.6943
Jeong_COCAI_task2_1 Cochlear.ai_1 Il-Young Jeong COCAI, Cochlear.ai, Seoul, South Korea. task-general-purpose-audio-tagging-results#Jeong2018 0.9538
Jeong_COCAI_task2_2 Cochlear.ai_2 Il-Young Jeong COCAI, Cochlear.ai, Seoul, South Korea. task-general-purpose-audio-tagging-results#Jeong2018 0.9506
Jeong_COCAI_task2_3 Cochlear.ai_3 Il-Young Jeong COCAI, Cochlear.ai, Seoul, South Korea. task-general-purpose-audio-tagging-results#Jeong2018 0.9405
Nguyen_NTU_task2_1 NTU_ensemble8 Thi Ngoc Tho Nguyen Electrical and Electronic Engineering (NTU), Nanyang Technological University, Singapore. task-general-purpose-audio-tagging-results#Nguyen2018 0.9496
Nguyen_NTU_task2_2 NTU_labelsmoothing Thi Ngoc Tho Nguyen Electrical and Electronic Engineering (NTU), Nanyang Technological University, Singapore. task-general-purpose-audio-tagging-results#Nguyen2018 0.9251
Nguyen_NTU_task2_3 NTU_bgnormalization Thi Ngoc Tho Nguyen Electrical and Electronic Engineering (NTU), Nanyang Technological University, Singapore. task-general-purpose-audio-tagging-results#Nguyen2018 0.9213
Nguyen_NTU_task2_4 NTU_en8_augment_test Thi Ngoc Tho Nguyen Electrical and Electronic Engineering (NTU), Nanyang Technological University, Singapore. task-general-purpose-audio-tagging-results#Nguyen2018 0.9478
Wilhelm_UKON_task2_1 CNN on Raw-Audio and Spectrogram Benjamin Wilhelm Computer and Information Science (UKON), University of Konstanz, Constance, Germany. task-general-purpose-audio-tagging-results#Wilhelm2018 0.9435
Wilhelm_UKON_task2_2 CNN on Raw-Audio and Spectrogram Benjamin Wilhelm Computer and Information Science (UKON), University of Konstanz, Constance, Germany. task-general-purpose-audio-tagging-results#Wilhelm2018 0.9416
Kim_GIST_task2_1 ConResNet Nam Kyun Kim School of Electrical Engineering and Computer Science (GIST), Gwangju Institute of Science and Technology, Gwangju, Korea. task-general-purpose-audio-tagging-results#Kim2018 0.9151
Kim_GIST_task2_2 ConResNet Nam Kyun Kim School of Electrical Engineering and Computer Science (GIST), Gwangju Institute of Science and Technology, Gwangju, Korea. task-general-purpose-audio-tagging-results#Kim2018 0.9133
Kim_GIST_task2_3 ConResNet Nam Kyun Kim School of Electrical Engineering and Computer Science (GIST), Gwangju Institute of Science and Technology, Gwangju, Korea. task-general-purpose-audio-tagging-results#Kim2018 0.9139
Kim_GIST_task2_4 ConResNet Nam Kyun Kim School of Electrical Engineering and Computer Science (GIST), Gwangju Institute of Science and Technology, Gwangju, Korea. task-general-purpose-audio-tagging-results#Kim2018 0.9174
Xu_Aalto_task2_1 Multi-level attention model on fine-tuned AudioSet features Zhicun Xu Department of Signal Processing and Acoustics (Aalto), Aalto University, Finland, Espoo, Finland. task-general-purpose-audio-tagging-results#Xu2018 0.9065
Xu_Aalto_task2_2 Multi-level attention model on fine-tuned AudioSet features Zhicun Xu Department of Signal Processing and Acoustics (Aalto), Aalto University, Finland, Espoo, Finland. task-general-purpose-audio-tagging-results#Xu2018 0.9081
Chakraborty_IBM_Task2_1 3 CNN L1 Stacked Fused Spectral XGBoost L2 Ria Chakraborty Cognitive Business Decision Services (IBM), International Business Machines, India, Kolkata, India. task-general-purpose-audio-tagging-results#Chakraborty2018 0.9328
Chakraborty_IBM_Task2_2 2 CNN results geometrically averaged Ria Chakraborty Cognitive Business Decision Services (IBM), International Business Machines, India, Kolkata, India. task-general-purpose-audio-tagging-results#Chakraborty2018 0.9320
Chakraborty_IBM_Task2_judges_award VGG Style CNN with 3 channel input Ria Chakraborty Cognitive Business Decision Services (IBM), International Business Machines, India, Kolkata, India. task-general-purpose-audio-tagging-results#Chakraborty2018 0.9079
Han_NPU_task2_1 2ModEnsem Xueyu Han Center of Intelligence Acoustics and Immersive Communication (NPU), Northwestern Polytechnical University, Xi'an, China. task-general-purpose-audio-tagging-results#Han2018 0.8723
Zhesong_PKU_task2_1 BCNN_WaveNet Zhesong Yu Institute of computer science & technology (PKU), Peking University, Beijing, China. task-general-purpose-audio-tagging-results#Yu2018 0.8807
Hanyu_BUPT_task2 CRNN Zhang Hanyu Embedded Artificial Intelligence Group (BUPT), University of Posts and Telecommunications, Beijing, Beijing, China. task-general-purpose-audio-tagging-results#Hanyu2018 0.7877
Wei_Kuaiyu_task2_1 Kuaiyu tagging system Qingkai WEI Kuaiyu, Beijing Kuaiyu Electronics Co., Ltd, Beijing, PRC. task-general-purpose-audio-tagging-results#WEI2018 0.9409
Wei_Kuaiyu_task2_2 Kuaiyu tagging system Qingkai WEI Kuaiyu, Beijing Kuaiyu Electronics Co., Ltd, Beijing, PRC. task-general-purpose-audio-tagging-results#WEI2018 0.9423
Colangelo_RM3_task2_1 DCASE2018 Task2 CRNN RM3 Federico Colangelo Department of Engineering (RM3), Universita degli studi Roma Tre, Rome, Italy. task-general-purpose-audio-tagging-results#Colangelo2018 0.6978
Shan_DBSonics_task2_1 Shan DBSonics approach Yi Ren DB Sonics, Beijing, China. task-general-purpose-audio-tagging-results#Ren2018 0.9405
Kele_NUDT_task2_1 DCASE2018 Meta-learning system Xu Kele Department of Computer Science (NUDT), National University of Defense Technology, Changsha, China. task-general-purpose-audio-tagging-results#Kele2018 0.9498
Kele_NUDT_task2_2 DCASE2018 Meta-learning system Xu Kele Department of Computer Science (NUDT), National University of Defense Technology, Changsha, China. task-general-purpose-audio-tagging-results#Kele2018 0.9441
Agafonov_ITMO_task2_1 Fusion of 4 CNN Iurii Agafonov Speech Information Systems (ITMO), ITMO University, Saint-Petersburg, Saint-Petersburg, Russia. task-general-purpose-audio-tagging-results#Agafonov2018 0.9174
Agafonov_ITMO_task2_2 Fusion of 4 CNN Iurii Agafonov Speech Information Systems (ITMO), ITMO University, Saint-Petersburg, Saint-Petersburg, Russia. task-general-purpose-audio-tagging-results#Agafonov2018 0.9275
Wilkinghoff_FKIE_task2_1 CNN Ensemble based on Multiple Features Kevin Wilkinghoff Communication Systems (FKIE), Fraunhofer Institute for Communication, Information Processing and Ergonomics, Wachtberg, Germany. task-general-purpose-audio-tagging-results#Wilkinghoff2018 0.9414
Pantic_ETF_task2_1 Ensemble of convolutional neural networks for general purpose audio tagging Bogdan Pantic Signals and Systems Department (ETF), School of Electrical Engineering, Belgrade, Serbia. task-general-purpose-audio-tagging-results#Pantic2018 0.9419
Khadkevich_FB_task2_1 2 average pooling Maksim Khadkevich AML (FB), Facebook, Menlo Park, CA, USA. task-general-purpose-audio-tagging-results#Khadkevich2018 0.9188
Khadkevich_FB_task2_2 2 max pooling Maksim Khadkevich AML (FB), Facebook, Menlo Park, CA, USA. task-general-purpose-audio-tagging-results#Khadkevich2018 0.9178
Iqbal_Surrey_task2_1 Stacked CNN-CRNN (4 Models) Turab Iqbal Centre for Vision, Speech and Signal Processing (Surrey), University of Surrey, UK, Surrey, UK. task-general-purpose-audio-tagging-results#Iqbal2018 0.9484
Iqbal_Surrey_task2_2 Stacked CNN-CRNN (8 Models) Turab Iqbal Centre for Vision, Speech and Signal Processing (Surrey), University of Surrey, UK, Surrey, UK. task-general-purpose-audio-tagging-results#Iqbal2018 0.9512
Baseline_Surrey_task2_1 Surrey baseline CNN 8 layers Qiuqiang Kong Centre for Vission, Speech and Signal Processing (CVSSP) (Surrey), University of Surrey, Guildford, UK. task-general-purpose-audio-tagging-results#Kong2018 0.9034
Baseline_Surrey_task2_2 Surrey baseline CNN 4 layers Qiuqiang Kong Centre for Vission, Speech and Signal Processing (CVSSP) (Surrey), University of Surrey, Guildford, UK. task-general-purpose-audio-tagging-results#Kong2018 0.8622
Dorfer_CPJKU_task2_1 CNN - Iterative Self-Verification Matthias Dorfer Institute of Computational Perception (JKU), Johannes Kepler University Linz, Linz, Austria. task-general-purpose-audio-tagging-results#Dorfer2018 0.9518


Complete results and technical reports can be found at results page

Baseline system

The baseline system provides a simple entry-level state-of-the-art approach that gives a sense of the performance possible with the dataset of Task 2. It is a good starting point especially for entry-level researchers to get familiar with the task. Regardless of whether participants build their approaches on top of this baseline system or develop their own, DCASE organizers encourage all participants to open-source their code after the challenge. An overall description of the baseline system is included next. More detailed information can be found in the repository.

Repository


System description

The baseline system implements a convolutional neural network (CNN) classifier similar to, but scaled down from, the deep CNN models that have been very successful in the vision domain. The model takes framed examples of log mel spectrogram as input and produces ranked predictions over the 41 classes in the dataset. The baseline system also allows training a simpler fully connected multi-layer perceptron (MLP) classifier. The baseline system is built on TensorFlow.

Input features

We use frames of log mel spectrogram as input features:

  • computing spectrogram with a window size of 25ms and a hop size of 10ms
  • mapping the spectrogram to 64 mel bins covering the range 125-7500 Hz
  • log mel spectrogram is computed by applying log(mel spectrogram + 0.001)
  • log mel spectrogram is then framed into overlapping examples with a window size of 0.25s and a hop size of 0.125s

Architecture

The baseline CNN model consists of three 2-D convolutional layers (with ReLU activations) and alternating 2-D max-pool layers, followed by a final max-reduction (to produce a single value per feature map), and a softmax layer. The Adam optimizer is used to train the model, with a learning rate of 1e-4. A batch size of 64 is used.

The layers are listed in the table below using notation Conv2D(kernel size, stride, # feature maps) and MaxPool2D(kernel size, stride). Both Conv2D and MaxPool2D use the SAME padding scheme. ReduceMax applies a maximum-value reduction across the first two dimensions. Activation shapes do not include the batch dimension.

Layer Activation shape
Input (25, 64, 1)
Conv2D(7x7, 1, 100) (25, 64, 100)
MaxPool2D(3x3, 2x2) (13, 32, 100)
Conv2D(5x5, 1, 150) (13, 32, 150)
MaxPool2D(3x3, 2x2) (7, 16, 150)
Conv2D(3x3, 1, 200) (7, 16, 200)
ReduceMax (1, 1, 200)
Softmax (41,)

Clip prediction

The classifier predicts 41 scores for individual 0.25s-wide examples. In order to produce a ranked list of predicted classes for an entire clip, we average the predictions from all framed examples generated from the clip, and take the top 3 classes by score.

System performance

The baseline system trains to achieve an MAP@3 of ~0.7 on the public Kaggle leaderboard after ~5 epochs of the entire training set which are completed in ~12 hours on an n1-standard-8 Google Compute Engine machine with a quad-core Intel Xeon E5 v3 (Haswell) @ 2.3 GHz.

Citation

If you are participating to this task or using the dataset or baseline code please cite the following paper:

Publication

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, and Xavier Serra. General-purpose tagging of freesound audio with audioset labels: task description, dataset, and baseline. Submitted to DCASE2018 Workshop, 2018. URL: https://arxiv.org/abs/1807.09902, arXiv:1807.09902.

PDF

General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline

Abstract

This paper describes Task 2 of the DCASE 2018 Challenge, titled "General-purpose audio tagging of Freesound content with AudioSet labels". This task was hosted on the Kaggle platform as "Freesound General-Purpose Audio Tagging Challenge". The goal of the task is to build an audio tagging system that can recognize the category of an audio clip from a subset of 41 heterogeneous categories drawn from the AudioSet Ontology. We present the task, the dataset prepared for the competition, and a baseline system.

Keywords

Audio tagging, audio dataset, data collection

PDF