This task evaluates systems for general-purpose audio tagging with an increased number of categories and using data with annotations of varying reliability. This task will provide insight towards the development of broadly-applicable sound event classifiers that consider an increased and diverse amount of categories.

Description

This task addresses the problem of general-purpose automatic audio tagging and poses two main challenges. The first one is to build models that can recognize an increased number of sound events of very diverse nature, including musical instruments, human sounds, domestic sounds, animals, etc. The second challenge consists of leveraging subsets of training data featuring annotations of varying reliability, as a reflection of the expensiveness of having high quality annotations. This task will provide insight towards the development of broadly-applicable sound event classifiers that consider an increased and diverse amount of categories. These models can be used, for example, in automatic description of multimedia or acoustic monitoring applications.

Figure 1: Overview of a single-tag tagging system.

This task is hosted in Kaggle - a platform that hosts machine learning competitions with a vibrant community of participants. Hence, the resources associated to this task (datasets download, leaderboard and submission) will be provided by Kaggle. Please visit the Freesound General-Purpose Audio Tagging Challenge Kaggle competition page for detailed information about how to participate and all other relevant aspects of the challenge. What follows in this page is a summary of the most important aspects of the challenge.

Kaggle competition page

Audio dataset

This task uses the FSDKaggle2018 dataset, consisting of audio samples from Freesound annotated using a vocabulary of 41 labels from Google’s AudioSet Ontology:

Tearing
Bus
Shatter
Gunshot, gunfire
Fireworks
Writing
Computer keyboard
Scissors
Microwave oven
Keys jangling
Drawer open or close
Squeak
Knock
Telephone

Saxophone
Oboe
Flute
Clarinet
Acoustic guitar
Tambourine
Glockenspiel
Gong
Snare drum
Bass drum
Hi-hat
Electric piano
Harmonica
Trumpet

Violin, fiddle
Double bass
Cello
Chime
Cough
Laughter
Applause
Finger snapping
Fart
Burping, eructation
Cowbell
Bark
Meow

The FSDKaggle2018 dataset provided for this task is a reduced subset of FSD: a work-in-progress, large-scale, general-purpose audio dataset composed of Freesound content annotated with labels from the AudioSet Ontology. FSD is being collected through the Freesound Datasets platform, which is a platform for the collaborative creation of open audio collections. We encourage participants of the DCASE challenge to check out the Freesound Datasets platform and, why not, contribute with some annotations for the FSD. More information about the Freesound Datasets platform and the creation of FSD is available in:

Publication

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andrés Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra. Freesound datasets: a platform for the creation of open audio datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR 2017), 486–493. Suzhou, China, 2017.

PDF

Freesound Datasets: a platform for the creation of open audio datasets

Abstract

Openly available datasets are a key factor in the advancement of data-driven research approaches, including many of the ones used in sound and music computing. In the last few years, quite a number of new audio datasets have been made available but there are still major shortcomings in many of them to have a significant research impact. Among the common shortcomings are the lack of transparency in their creation and the difficulty of making them completely open and sharable. They often do not include clear mechanisms to amend errors and many times they are not large enough for current machine learning needs. This paper introduces Freesound Datasets, an online platform for the collaborative creation of open audio datasets based on principles of transparency, openness, dynamic character, and sustainability. As a proof-of-concept, we present an early snapshot of a large-scale audio dataset built using this platform. It consists of audio samples from Freesound organised in a hierarchy based on the AudioSet Ontology. We believe that building and maintaining datasets following the outlined principles and using open tools and collaborative approaches like the ones presented here will have a significant impact in our research community.

PDF

Recording and annotation procedure

This task employs data drawn from content uploaded by the Freesound user community, encompassing sounds in a wide range of real-world environments. Recording scenarios and techniques can be very different as sounds are uploaded by users across the globe. All audio samples in this dataset are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories. Using this mapping, a number of Freesound audio samples were automatically annotated. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample. Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assesed the presence/absence of an automatically assigned sound category, according to the AudioSet category description. More details about the annotation procedure can be found in Fonseca et al. (2017). More information about AudioSet can be found in:

Publication

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA, 2017.

PDF

Audio Set: An ontology and human-labeled dataset for audio events

Abstract

Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets -- principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 635 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

PDF

In the provided FSDKaggle2018 dataset, a number of the ground truth labels have been manually verified (some of the labels have inter-annotator agreement but not all of them) while the the rest has not been manually verified and therefore some of them could be inaccurate. All audio samples in this dataset have a single label (i.e., they are only annotated with one label).

Download

The FSDKaggle2018 dataset can be downloaded from the Freesound General-Purpose Audio Tagging Challenge Kaggle competition page. Details about usage restrictions and sound licenses are provided there.

Task setup

The task consists of predicting the sound category to which every audio sample in the test set belongs to. A single label should be assigned to each file in the test set (i.e., single-tag tagging), although up to three labels can be predicted for each file as we evaluate with Mean Average Precision @ 3. The predictions are to be done at the audio sample level, i.e., no start/end timestamps for the events are required. The dataset for this task is split into a train set and a test set.

Train set

The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds.

The train set is composed of ~3.7k manually-verified annotations and ~5.8k non-verified annotations. The quality of the non-verified annotations has been roughly estimated to be at least 65-70% in each sound category. A flag for each annotation is provided which indicates whether or not that annotation has been manually verified. Participants can use this information during the development of their systems.

Test set

The test set is composed of ~1.6k manually-verified annotations with a similar category distribution than that of the train set. These annotations are complemented with ~7.8k padding annotations which are also included in the test set but that won't be used for evaluating the systems.

Submission and evaluation

Submissions will be done through the Kaggle platform and will be evaluated with the Mean Average Precision @ 3 metric. Please visit the Freesound General-Purpose Audio Tagging Challenge Kaggle competition page for detailed information about submission and evaluation.

Task rules

A detailed description of the task rules can be found in the Freesound General-Purpose Audio Tagging Challenge Kaggle competition page. This is a summary of the most important points:

Participants are allowed to use external data for system development, but external data can not be sourced from Freesound (including any of the original sound's metadata or other sounds in Freesound).
Participants are not allowed to make subjective judgements of the evaluation data, nor to annotate it (this includes the use of statistics about the evaluation dataset in the decision making). The evaluation dataset cannot be used to train the submitted system.
The top-3 winning teams are required to publish their systems under an open-source license in order to be considered winners.

Results

Rank	Submission Information
Rank	Code	Author	Affiliation	Technical Report	mAP@3 (Private leaderboard)
	DCASE2018 baseline	Eduardo Fonseca	Music Technology Group (UPF), Universitat Pompeu Fabra, Barcelona, Barcelona, Spain.	task-general-purpose-audio-tagging-results#Fonseca2018	0.6943
	Jeong_COCAI_task2_1	Il-Young Jeong	COCAI, Cochlear.ai, Seoul, South Korea.	task-general-purpose-audio-tagging-results#Jeong2018	0.9538
	Jeong_COCAI_task2_2	Il-Young Jeong	COCAI, Cochlear.ai, Seoul, South Korea.	task-general-purpose-audio-tagging-results#Jeong2018	0.9506
	Jeong_COCAI_task2_3	Il-Young Jeong	COCAI, Cochlear.ai, Seoul, South Korea.	task-general-purpose-audio-tagging-results#Jeong2018	0.9405
	Nguyen_NTU_task2_1	Thi Ngoc Tho Nguyen	Electrical and Electronic Engineering (NTU), Nanyang Technological University, Singapore.	task-general-purpose-audio-tagging-results#Nguyen2018	0.9496
	Nguyen_NTU_task2_2	Thi Ngoc Tho Nguyen	Electrical and Electronic Engineering (NTU), Nanyang Technological University, Singapore.	task-general-purpose-audio-tagging-results#Nguyen2018	0.9251
	Nguyen_NTU_task2_3	Thi Ngoc Tho Nguyen	Electrical and Electronic Engineering (NTU), Nanyang Technological University, Singapore.	task-general-purpose-audio-tagging-results#Nguyen2018	0.9213
	Nguyen_NTU_task2_4	Thi Ngoc Tho Nguyen	Electrical and Electronic Engineering (NTU), Nanyang Technological University, Singapore.	task-general-purpose-audio-tagging-results#Nguyen2018	0.9478
	Wilhelm_UKON_task2_1	Benjamin Wilhelm	Computer and Information Science (UKON), University of Konstanz, Constance, Germany.	task-general-purpose-audio-tagging-results#Wilhelm2018	0.9435
	Wilhelm_UKON_task2_2	Benjamin Wilhelm	Computer and Information Science (UKON), University of Konstanz, Constance, Germany.	task-general-purpose-audio-tagging-results#Wilhelm2018	0.9416
	Kim_GIST_WisenetAI_task2_1	Nam Kyun Kim	School of Electrical Engineering and Computer Science (GIST), Gwangju Institute of Science and Technology, Gwangju, Korea.	task-general-purpose-audio-tagging-results#Kim2018	0.9151
	Kim_GIST_WisenetAI_task2_2	Nam Kyun Kim	School of Electrical Engineering and Computer Science (GIST), Gwangju Institute of Science and Technology, Gwangju, Korea.	task-general-purpose-audio-tagging-results#Kim2018	0.9133
	Kim_GIST_WisenetAI_task2_3	Nam Kyun Kim	School of Electrical Engineering and Computer Science (GIST), Gwangju Institute of Science and Technology, Gwangju, Korea.	task-general-purpose-audio-tagging-results#Kim2018	0.9139
	Kim_GIST_WisenetAI_task2_4	Nam Kyun Kim	School of Electrical Engineering and Computer Science (GIST), Gwangju Institute of Science and Technology, Gwangju, Korea.	task-general-purpose-audio-tagging-results#Kim2018	0.9174
	Xu_Aalto_task2_1	Zhicun Xu	Department of Signal Processing and Acoustics (Aalto), Aalto University, Finland, Espoo, Finland.	task-general-purpose-audio-tagging-results#Xu2018	0.9065
	Xu_Aalto_task2_2	Zhicun Xu	Department of Signal Processing and Acoustics (Aalto), Aalto University, Finland, Espoo, Finland.	task-general-purpose-audio-tagging-results#Xu2018	0.9081
	Chakraborty_IBM_Task2_1	Ria Chakraborty	Cognitive Business Decision Services (IBM), International Business Machines, India, Kolkata, India.	task-general-purpose-audio-tagging-results#Chakraborty2018	0.9328
	Chakraborty_IBM_Task2_2	Ria Chakraborty	Cognitive Business Decision Services (IBM), International Business Machines, India, Kolkata, India.	task-general-purpose-audio-tagging-results#Chakraborty2018	0.9320
	Chakraborty_IBM_Task2_judges_award	Ria Chakraborty	Cognitive Business Decision Services (IBM), International Business Machines, India, Kolkata, India.	task-general-purpose-audio-tagging-results#Chakraborty2018	0.9079
	Han_NPU_task2_1	Xueyu Han	Center of Intelligence Acoustics and Immersive Communication (NPU), Northwestern Polytechnical University, Xi'an, China.	task-general-purpose-audio-tagging-results#Han2018	0.8723
	Zhesong_PKU_task2_1	Zhesong Yu	Institute of computer science & technology (PKU), Peking University, Beijing, China.	task-general-purpose-audio-tagging-results#Yu2018	0.8807
	Hanyu_BUPT_task2	Zhang Hanyu	Embedded Artificial Intelligence Group (BUPT), University of Posts and Telecommunications, Beijing, Beijing, China.	task-general-purpose-audio-tagging-results#Hanyu2018	0.7877
	Wei_Kuaiyu_task2_1	Qingkai WEI	Kuaiyu, Beijing Kuaiyu Electronics Co., Ltd, Beijing, PRC.	task-general-purpose-audio-tagging-results#WEI2018	0.9409
	Wei_Kuaiyu_task2_2	Qingkai WEI	Kuaiyu, Beijing Kuaiyu Electronics Co., Ltd, Beijing, PRC.	task-general-purpose-audio-tagging-results#WEI2018	0.9423
	Colangelo_RM3_task2_1	Federico Colangelo	Department of Engineering (RM3), Universita degli studi Roma Tre, Rome, Italy.	task-general-purpose-audio-tagging-results#Colangelo2018	0.6978
	Shan_DBSonics_task2_1	Yi Ren	DB Sonics, Beijing, China.	task-general-purpose-audio-tagging-results#Ren2018	0.9405
	Kele_NUDT_task2_1	Xu Kele	Department of Computer Science (NUDT), National University of Defense Technology, Changsha, China.	task-general-purpose-audio-tagging-results#Kele2018	0.9498
	Kele_NUDT_task2_2	Xu Kele	Department of Computer Science (NUDT), National University of Defense Technology, Changsha, China.	task-general-purpose-audio-tagging-results#Kele2018	0.9441
	Agafonov_ITMO_task2_1	Iurii Agafonov	Speech Information Systems (ITMO), ITMO University, Saint-Petersburg, Saint-Petersburg, Russia.	task-general-purpose-audio-tagging-results#Agafonov2018	0.9174
	Agafonov_ITMO_task2_2	Iurii Agafonov	Speech Information Systems (ITMO), ITMO University, Saint-Petersburg, Saint-Petersburg, Russia.	task-general-purpose-audio-tagging-results#Agafonov2018	0.9275
	Wilkinghoff_FKIE_task2_1	Kevin Wilkinghoff	Communication Systems (FKIE), Fraunhofer Institute for Communication, Information Processing and Ergonomics, Wachtberg, Germany.	task-general-purpose-audio-tagging-results#Wilkinghoff2018	0.9414
	Pantic_ETF_task2_1	Bogdan Pantic	Signals and Systems Department (ETF), School of Electrical Engineering, Belgrade, Serbia.	task-general-purpose-audio-tagging-results#Pantic2018	0.9419
	Khadkevich_FB_task2_1	Maksim Khadkevich	AML (FB), Facebook, Menlo Park, CA, USA.	task-general-purpose-audio-tagging-results#Khadkevich2018	0.9188
	Khadkevich_FB_task2_2	Maksim Khadkevich	AML (FB), Facebook, Menlo Park, CA, USA.	task-general-purpose-audio-tagging-results#Khadkevich2018	0.9178
	Iqbal_Surrey_task2_1	Turab Iqbal	Centre for Vision, Speech and Signal Processing (Surrey), University of Surrey, UK, Surrey, UK.	task-general-purpose-audio-tagging-results#Iqbal2018	0.9484
	Iqbal_Surrey_task2_2	Turab Iqbal	Centre for Vision, Speech and Signal Processing (Surrey), University of Surrey, UK, Surrey, UK.	task-general-purpose-audio-tagging-results#Iqbal2018	0.9512
	Baseline_Surrey_task2_1	Qiuqiang Kong	Centre for Vission, Speech and Signal Processing (CVSSP) (Surrey), University of Surrey, Guildford, UK.	task-general-purpose-audio-tagging-results#Kong2018	0.9034
	Baseline_Surrey_task2_2	Qiuqiang Kong	Centre for Vission, Speech and Signal Processing (CVSSP) (Surrey), University of Surrey, Guildford, UK.	task-general-purpose-audio-tagging-results#Kong2018	0.8622
	Dorfer_CPJKU_task2_1	Matthias Dorfer	Institute of Computational Perception (JKU), Johannes Kepler University Linz, Linz, Austria.	task-general-purpose-audio-tagging-results#Dorfer2018	0.9518

Complete results and technical reports can be found at results page

Baseline system

The baseline system provides a simple entry-level state-of-the-art approach that gives a sense of the performance possible with the FSDKaggle2018 dataset of Task 2. It is a good starting point especially for entry-level researchers to get familiar with the task. Regardless of whether participants build their approaches on top of this baseline system or develop their own, DCASE organizers encourage all participants to open-source their code after the challenge. An overall description of the baseline system is included next. More detailed information can be found in the repository.

Repository

DCASE2018 Task 2 Baseline

System description

The baseline system implements a convolutional neural network (CNN) classifier similar to, but scaled down from, the deep CNN models that have been very successful in the vision domain. The model takes framed examples of log mel spectrogram as input and produces ranked predictions over the 41 classes in the dataset. The baseline system also allows training a simpler fully connected multi-layer perceptron (MLP) classifier. The baseline system is built on TensorFlow.

Input features

We use frames of log mel spectrogram as input features:

computing spectrogram with a window size of 25ms and a hop size of 10ms
mapping the spectrogram to 64 mel bins covering the range 125-7500 Hz
log mel spectrogram is computed by applying log(mel spectrogram + 0.001)
log mel spectrogram is then framed into overlapping examples with a window size of 0.25s and a hop size of 0.125s

Architecture

The baseline CNN model consists of three 2-D convolutional layers (with ReLU activations) and alternating 2-D max-pool layers, followed by a final max-reduction (to produce a single value per feature map), and a softmax layer. The Adam optimizer is used to train the model, with a learning rate of 1e-4. A batch size of 64 is used.

The layers are listed in the table below using notation Conv2D(kernel size, stride, # feature maps) and MaxPool2D(kernel size, stride). Both Conv2D and MaxPool2D use the SAME padding scheme. ReduceMax applies a maximum-value reduction across the first two dimensions. Activation shapes do not include the batch dimension.

Layer	Activation shape
Input	(25, 64, 1)
Conv2D(7x7, 1, 100)	(25, 64, 100)
MaxPool2D(3x3, 2x2)	(13, 32, 100)
Conv2D(5x5, 1, 150)	(13, 32, 150)
MaxPool2D(3x3, 2x2)	(7, 16, 150)
Conv2D(3x3, 1, 200)	(7, 16, 200)
ReduceMax	(1, 1, 200)
Softmax	(41,)

Clip prediction

The classifier predicts 41 scores for individual 0.25s-wide examples. In order to produce a ranked list of predicted classes for an entire clip, we average the predictions from all framed examples generated from the clip, and take the top 3 classes by score.

System performance

The baseline system trains to achieve an MAP@3 of ~0.7 on the public Kaggle leaderboard after ~5 epochs of the entire training set which are completed in ~12 hours on an n1-standard-8 Google Compute Engine machine with a quad-core Intel Xeon E5 v3 (Haswell) @ 2.3 GHz.

Citation

If you are using the FSDKaggle2018 dataset or baseline code, or want to refer challenge task please cite the following paper:

Publication

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, and Xavier Serra. General-purpose tagging of freesound audio with audioset labels: task description, dataset, and baseline. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 69–73. November 2018. URL: https://arxiv.org/abs/1807.09902.

PDF

General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline

Abstract

This paper describes Task 2 of the DCASE 2018 Challenge, titled "General-purpose audio tagging of Freesound content with AudioSet labels". This task was hosted on the Kaggle platform as "Freesound General-Purpose Audio Tagging Challenge". The goal of the task is to build an audio tagging system that can recognize the category of an audio clip from a subset of 41 heterogeneous categories drawn from the AudioSet Ontology. We present the task, the dataset prepared for the competition, and a baseline system.

Keywords

Audio tagging, audio dataset, data collection

PDF

	Frederic Font Corbera Universitat Pompeu Fabra
	Eduardo Fonseca Universitat Pompeu Fabra
	Daniel P. W. Ellis Google, Inc.
	Manoj Plakal Google, Inc.

Coordinators

Content

Description

Audio dataset

Freesound Datasets: a platform for the creation of open audio datasets

Abstract

Recording and annotation procedure

Audio Set: An ontology and human-labeled dataset for audio events

Abstract

Download

Task setup

Train set

Test set

Submission and evaluation

Task rules

Results

Baseline system

Repository

System description

Input features

Architecture

Clip prediction

System performance

Citation

General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline

Abstract

Keywords