Audio tagging with noisy labels and minimal supervision

Task description

This task evaluates systems for multi-label audio tagging using a small set of manually-labeled data, and a larger set of noisy-labeled data, under a large vocabulary setting. This task will provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.


Current machine learning techniques require large and varied datasets in order to provide good performance and generalization. However, manually labelling a dataset is expensive and time-consuming, which limits its size. Websites like Youtube, Freesound, or Flickr host large volumes of user-contributed audio and metadata (e.g., tags), and labels can be inferred automatically from these metadata. Nevertheless, these automatically inferred labels might include a substantial level of noise. This task addresses how to adequately exploit a small amount of reliable manually-labeled audio data and a larger quantity of noisy web audio data in the context of audio tagging for a large vocabulary setting.

The main research question addressed in this task is how to adequately exploit these two different types of audio data (a small amount of manually-labeled data, and a larger quantity of noisy web audio data) in the context of multi-label audio tagging and for a large vocabulary setting. In addition, since the data comes from two different sources (Freesound and Flickr), the task encourages domain adaptation approaches to deal with a potential domain mismatch.

Figure 1: Overview of a multi-label tagging system.

This task is hosted in Kaggle - a platform that hosts machine learning competitions with a vibrant community of participants. Hence, the resources associated to this task (datasets, download, leaderboard and submission) as well as detailed information about how to participate will be provided on the Kaggle competition page. We expect to launch the competition at the middle of March.

What follows in this page is a summary of the most important aspects of the challenge. More information about the following sections will be provided in the coming weeks.

Audio dataset

This task employs audio clips from the following sources:

The audio data is labeled using a vocabulary of around 70 labels from Google’s AudioSet Ontology, covering diverse topics (domestic sounds, natural sounds, human sounds, animal sounds, etc.). Audio clips have variable lengths (roughly from 0.3 to 30s) due to the diversity of the sound categories and the preferences of users when recording/uploading sounds.


The dataset for this task will be downloadable from the Kaggle competition page. Details about usage restrictions and sound licenses are provided there.

Data will be published at the middle of March.

Task setup

The task consists of predicting the audio labels (tags) for every test clip. Some test clips bear one label while others bear several labels. The predictions are to be done at the clip level, i.e., no start/end timestamps for the events are required. The dataset for this task is split into a train set and a test set.

Train set

The train set is meant to be for system development. The idea is to limit the supervision provided (i.e., the manually-labeled data), thus promoting novel approaches to deal with label noise. The train set is composed of two subsets:

  • Curated subset: a small set of manually-labeled data from FSD. Number of clips/class: from 50 to 100.
  • Noisy subset: a larger set of noisy web audio data from Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset. Number of clips/class: we expect to have many more clips than in the curated subset.

Test set

The test set consists of around 50 clips per class taken from FSD. More information about the train and test sets will be provided once the dataset is released.

Submission and evaluation

Submissions will be done through the Kaggle platform and will be evaluated with the lwLRAP evaluation metric (see below). A link to the Kaggle competition page will be provided here once the competition starts.

Evaluation metric

The primary competition metric will be label-weighted label-ranking average precision (lwLRAP, pronounced "Lol wrap"). This is a version of the average precision of retrieving from the pool of labels treating each test clip item as a query (i.e., the system ranks all the available labels, then the precisions of the ranked lists down to each true label are averaged). This is a generalization of the mean reciprocal rank measure (used in last year’s edition of the challenge) for the case where there can be multiple true labels per test item. The novel "label-weighted" part means that the overall score is the average over all the labels in the test set, where each label receives equal weight (in contrast to giving each test item equal weight, which discounts the contribution of individual labels when they appear on the same item as multiple other labels). We use label weighting because it allows per-class values to be calculated, and still have the overall metric be expressed as simple average of the per-class metrics (weighted by each label's prior in the test set). For participant’s convenience, a Python implementation of lwLRAP is provided here [link coming soon].

Task rules

A detailed description of the task rules will be found on the Kaggle competition page. This is a summary of the most important points:

  • Participants are not allowed to make subjective judgements of the test data, nor to annotate it (this includes the use of statistics about the evaluation dataset in the decision making). The test set cannot be used to train the submitted system.
  • The top winning teams are required to publish their systems under an open-source license in order to be considered winners.

Baseline system

The baseline system provides a simple entry-level state-of-the-art approach that gives a sense of the performance possible with the dataset of Task 2. It is a good starting point especially for entry-level researchers to get familiar with the task. Regardless of whether participants build their approaches on top of this baseline system or develop their own, DCASE organizers encourage all participants to open-source their code after the challenge.

The baseline system will implement a convolutional neural network (CNN) classifier similar to, but scaled down from, the deep CNN models that have been very successful in the vision domain. The model will take framed examples of log mel spectrogram as input and will produce ranked predictions over the classes in the dataset. The baseline system will be published after the dataset is released.