Audio tagging with noisy labels and minimal supervision

Task description

This task evaluates systems for multi-label audio tagging using a small set of manually-labeled data, and a larger set of noisy-labeled data, under a large vocabulary setting. This task will provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.


Current machine learning techniques require large and varied datasets in order to provide good performance and generalization. However, manually labelling a dataset is expensive and time-consuming, which limits its size. Websites like Youtube, Freesound, or Flickr host large volumes of user-contributed audio and metadata, and labels can be inferred automatically from the metadata and/or making predictions with pre-trained models. Nevertheless, these automatically inferred labels might include a substantial level of label noise.

The main research question addressed in this task is how to adequately exploit a small amount of reliable, manually-labeled data, and a larger quantity of noisy web audio data in a multi-label audio tagging task with a large vocabulary setting. In addition, since the data comes from different sources, the task encourages domain adaptation approaches to deal with a potential domain mismatch.

Figure 1: Overview of a multi-label tagging system.

This task is hosted on Kaggle, a platform that hosts machine learning competitions with a vibrant community of participants. The resources associated to this task (datasets, leaderboard and submission) as well as detailed information about how to participate are provided in the corresponding Kaggle competition page (note this task is named Freesound Audio Tagging 2019 on Kaggle). What follows in this page is a summary of the most important aspects of the challenge. For full information please refer to the Kaggle competition page.

Owing to Kaggle / US Government policy, teams from several countries are unable to participate in the Kaggle competition (see rules page, section B.2 Eligibility). We, the organizers, are committed to making this competition open to everybody; if the Kaggle restrictions are preventing you from participating via their site, please contact us directly (Task 2 organizers) so we can devise an alternative solution to take your submissions into consideration.

Audio dataset

This task employs audio clips from the following sources:

The audio data is labeled using a vocabulary of 80 labels from Google’s AudioSet Ontology, covering diverse topics as shown in the pie chart below.

The full list of categories can be inspected in the data section of the Kaggle competition page. Details on the AudioSet Ontology can be found in:


Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA, 2017.


Audio Set: An ontology and human-labeled dataset for audio events


Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets - principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 632 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.


Ground truth labels

The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s, see more details below).

The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in:


Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andrés Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra. Freesound datasets: a platform for the creation of open audio datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR 2017), 486–493. Suzhou, China, 2017.


Freesound Datasets: a platform for the creation of open audio datasets


Openly available datasets are a key factor in the advancement of data-driven research approaches, including many of the ones used in sound and music computing. In the last few years, quite a number of new audio datasets have been made available but there are still major shortcomings in many of them to have a significant research impact. Among the common shortcomings are the lack of transparency in their creation and the difficulty of making them completely open and sharable. They often do not include clear mechanisms to amend errors and many times they are not large enough for current machine learning needs. This paper introduces Freesound Datasets, an online platform for the collaborative creation of open audio datasets based on principles of transparency, openness, dynamic character, and sustainability. As a proof-of-concept, we present an early snapshot of a large-scale audio dataset built using this platform. It consists of audio samples from Freesound organised in a hierarchy based on the AudioSet Ontology. We believe that building and maintaining datasets following the outlined principles and using open tools and collaborative approaches like the ones presented here will have a significant impact in our research community.


The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in:


Eduardo Fonseca, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra. Learning sound event classifiers from web audio with noisy labels. In Proc. IEEE ICASSP 2019. Brighton, UK, 2019.


Learning Sound Event Classifiers from Web Audio with Noisy Labels


As sound event classification moves towards larger datasets, issues of label noise become inevitable. Web sites can supply large volumes of user-contributed audio and metadata, but inferring labels from this metadata introduces errors due to unreliable inputs, and limitations in the mapping. There is, however, little research into the impact of these errors. To foster the investigation of label noise in sound event classification we present FSDnoisy18k, a dataset containing 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data. We characterize the label noise empirically, and provide a CNN baseline system. Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data. We also show that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.


Further information on the YFCC dataset can be found in:


Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. YFCC100M: the new data in multimedia research. Commun. ACM, 59(2):64–73, January 2016. URL:, doi:10.1145/2812802.


YFCC100M: The New Data in Multimedia Research


This publicly available curated dataset of almost 100 million photos and videos is free and legal for all.



The dataset for this task can be downloaded from the Kaggle competition page. The audio files of this dataset are released under Creative Commons (CC) licenses, some of them requiring attribution to their original authors and some forbidding further commercial reuse. Please check information in the data section of the Kaggle competition page for more information about usage restrictions and sound licenses.

Task setup

The task consists of predicting the audio labels (tags) for every test clip. Some test clips bear one label while others bear several labels. The predictions are to be done at the clip level, i.e., no start/end timestamps for the sound events are required. The dataset for this task is split into a train set and a test set.

Train set

The train set is meant to be for system development. The idea is to limit the supervision provided (i.e., the manually-labeled data), thus promoting novel approaches to deal with label noise. The train set is composed of two subsets:

  • Curated subset: a small set of manually-labeled data from FSD.
    • Number of clips/class: 75 except in a few cases (where there are less)
    • Total number of clips: 4970
    • Avge number of labels/clip: 1.2
    • Total duration: 10.5 hours
    • The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).

  • Noisy subset: a larger set of noisy web audio data from Flickr videos taken from YFCC.
    • Number of clips/class: 300
    • Total number of clips: 19815
    • Avge number of labels/clip: 1.2
    • Total duration: ~80 hours
    • The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s.

Considering the numbers above, per-class data distribution available for training is, for most of the classes, 300 clips from the noisy subset and 75 clips from the curated subset, which means 80% noisy - 20% curated at the clip level (not at the audio duration level, considering the variable-length clips).

Test set

The test set is used for system evaluation and consists of manually-labeled data from FSD. Since most of the train data come from YFCC, some acoustic domain mismatch between the train and test set can be expected. All the acoustic material present in the test set is labeled, except human error, considering the vocabulary of 80 classes used in the competition.

The test set is split into two subsets, for the public and private leaderboards of the Kaggle competition page. In this competition, the submission is to be made through Kaggle Kernels. Only the subset corresponding to the public leaderboard is provided (without ground truth).

Submission and evaluation

Submissions are to be done through the Kaggle platform using Kaggle Kernels, and are evaluated with the lwlrap evaluation metric (see below). Participants can decide to train also in the Kaggle Kernels or offline. More information about submission using Kaggle kernels can be found in the competition page.

Evaluation metric

The primary competition metric is label-weighted label-ranking average precision (lwlrap, pronounced "Lol wrap"). This measures the average precision of retrieving a ranked list of relevant labels for each test clip (i.e., the system ranks all the available labels, then the precisions of the ranked lists down to each true label are averaged). This is a generalization of the mean reciprocal rank measure (used in last year’s edition of the challenge) for the case where there can be multiple true labels per test item. The novel "label-weighted" part means that the overall score is the average over all the labels in the test set, where each label receives equal weight (by contrast, plain lrap gives each test item equal weight, thereby discounting the contribution of individual labels when they appear on the same item as multiple other labels).

We use label weighting because it allows per-class values to be calculated, and still have the overall metric be expressed as simple average of the per-class metrics (weighted by each label's prior in the test set). For participant’s convenience, a Python implementation of lwlrap is provided in this Google Colab.

Task rules

The most important rules include:

  • Unlike last year's edition of this task, participants are not allowed to use external data for system development. This also excludes the use of pre-trained models.
  • Participants are not allowed to make subjective judgements of the test data, nor to annotate it (this includes the use of statistics about the evaluation dataset in the decision making). The test set cannot be used to train the submitted system.
  • The winning teams are required to publish their systems under an open-source license in order to be considered winners.

As usual, further details about competition rules are given in the Rules section of the competition page.

Baseline system

The baseline system provides a simple entry-level state-of-the-art approach that gives a sense of the performance possible with the dataset of Task 2. It is a good starting point especially for entry-level researchers to get familiar with the task. Regardless of whether participants build their approaches on top of this baseline system or develop their own, DCASE organizers encourage all participants to open-source their code after the challenge, adding installation and running instructions similar to those of the baseline.

The baseline system implements an audio classifier using an efficient MobileNet v1 convolutional neural network, which takes log mel spectrogram features as input and produces predicted scores for the 80 classes in the dataset. Baseline is available in this source code repository.


If you are using the dataset or baseline code, or want to refer this challenge task please cite the following paper: