Detecting bird sounds in audio is an important task for automatic wildlife monitoring, as well as in citizen science and audio library management. Bird sound detection is a very common required first step before further analysis (e.g. classification, counting), and makes it possible to conduct work with large datasets (e.g. continuous 24h monitoring) by filtering data down to regions of interest.
This task is an expanded version of the Bird Audio Detection challenge which ran in 2016/2017.
The task is to design a system that, given a short audio recording, returns a binary decision for the presence/absence of bird sound (bird sound of any kind). The output can be just "0" or "1", but we encourage weighted/probability outputs in the continuous range [0,1] for the purposes of evaluation. For the main assessment we will use the well-known "Area Under the ROC Curve" (AUC) measure of classification performance.
An important goal of this task is generalisation to new conditions. To explore this we provide 3 separate development datasets, and 3 evaluation datasets, each recorded under differing conditions. The datasets will have different balances of positive/negative cases, different bird species, different background sounds, different recording equipment. To solve this task well, you will need an approach which either inherently generalises across conditions (including conditions not seen in the training data), or which can self-adapt to new datasets ("domain adaptation").
Note that for every audio clip, you will be told which dataset it belongs to. This means that adapting to the overall characteristics of each dataset separately is possible. The evaluation will also consider each dataset separately and combine the outcomes, rather than treating them as a single pooled dataset.
We provide three development datasets, and three evaluation datasets, each from a separate bird sound monitoring project. All of the datasets contain 10-second-long WAV files (44.1 kHz mono PCM), and are manually labelled with a 0 or 1 to indicate the absence/presence of any birds within that 10-second audio clip.
Note that there will in general be some low level of error/disagreement in "ground truth" manual labelling. For BirdVox the accuracy of the labeling is estimated to be 99.5% or better, while for our other datasets it is estimated as 96.7%.
The sampling strategy differs across datasets. Crowdsourced recordings are opportunistic, having sampling bias of users (e.g. good weather conditions, daytime). The remote-monitoring sets contain clips extracted according to fixed recording schedules, and are typically evenly sampled across times of day and diverse weather conditions.
Download links are below, next to the description of each dataset.
Field recordings, worldwide ("freefield1010") - a collection of 7,690 excerpts from field recordings around the world, gathered by the FreeSound project, and then standardised for research. This collection is very diverse in location and environment, and for the BAD Challenge we have annotated it for the presence/absence of birds.
- Download: [data labels] • [audio files (5.8 Gb zip)] (or [via bittorrent])
Crowdsourced dataset, UK ("warblrb10k") - 8,000 smartphone audio recordings from around the UK, crowdsourced by users of Warblr the bird recognition app. The audio covers a wide distribution of UK locations and environments, and includes weather noise, traffic noise, human speech and even human bird imitations.
- Download: [data labels] • [audio files (4.3 Gb zip)] (or [via bittorrent])
Remote monitoring flight calls, USA ("BirdVox-DCASE-20k") - 20,000 audio clips collected from remote monitoring units placed near Ithaca, NY, USA during the autumn of 2015, by the BirdVox project. More info about BirdVox-DCASE-20k
- Download: [data labels] • [audio files (15.4 Gb zip)]
The evaluation datasets will be released according to the main DCASE schedule.
Crowdsourced dataset, UK ("warblrb10k") - a held-out set of 2,000 recordings from the same conditions as the Warblr development dataset.
Remote monitoring dataset, Chernobyl ("Chernobyl") - 6,620 audio clips collected from unattended remote monitoring equipment in the Chernobyl Exclusion Zone (CEZ). This data was collected as part of the TREE (Transfer-Exposure-Effects) research project into the long-term effects of the Chernobyl accident on local ecology. The audio covers a range of birds and includes weather, large mammal and insect noise sampled across various CEZ environments, including abandoned village, grassland and forest areas.
Remote monitoring night-flight calls, Poland ("PolandNFC") - 4,000 recordings from Hanna Pamuɫa's PhD project of monitoring autumn nocturnal bird migration. The recordings were collected every night, from September to November 2016 on the Baltic Sea coast, Poland, using Song Meter SM2 units with microphones mounted on 3–5 m poles. For this challenge, we use a subset derived from 15 nights with different weather conditions and background noise including wind, rain, sea noise, insect calls, human voice and deer calls.
Recommended cross-validation procedure
In standard machine learning studies, you take only one dataset and divide it into "folds" for cross-validation. However in this task, we recommend that you perform stratified 3-way cross-validation and in each fold you use two sets for training and the other one for testing. This allows you to study the way your method behaves when exposed to data from unseen conditions, just like in the final evaluation.
The interface for submitting results files will be a Kaggle-like interface designed for academic users.
Your system should output a results file in the following CSV format:
filename1,decision1 filename2,decision2 ...
Two columns, separated by a comma (and rows separated by a "newline"). Filenames are without the .wav extension. Note that the public development data has only 0 or 1 as the annotation, while we encourage your submission to make use of the continuous range between 0 and 1, given as ordinary floating-point numbers. There should be no additional whitespace in the file, except for the newlines. The development data also has an extra column indicating which file belongs to which dataset, but this column is not needed for submission.
Note that you should submit a single CSV file, containing predictions for all of the evaluation datasets combined into one file. Participants must submit predictions for all of our testing datasets, since otherwise that would allow for less-general solutions.
During the challenge, a public leaderboard will be provided using a small amount of evaluation data to provide a "preview" of what the overall results might be like.
You may submit outputs from multiple variants of your system - one per day, with an overall maximum of 20 per team - until the challenge deadline. Each time, you will see how the results appear with a small subset of the evaluation data, a clue about your performance. You will then be asked to select your preferred four submissions to go into the final evaluation.
There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.
Task specific rules:
(a) Adapting to the evaluation data
Although the main rules state that "the use of statistics about the evaluation data in the decision making is also forbidden", for this task it is allowed and even encouraged that the algorithms developed can adapt their behaviour based on overall statistics of each dataset. However participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it.
(b) Use of external data
Use of external data and transfer learning is allowed in this task under the following conditions:
- The used dataset must be public and freely available
- Participants must inform/suggest such data to be listed on the task webpage, so that all competitors know about them and have equal opportunity to use them
- Once the evaluation set is published, the list of allowed external datasets is locked (no further external sources allowed)
- Participants should list in their technical report the external data sources they used
As one specific example, Google's "AudioSet" includes labelled bird examples, and we have prepared a metadata file for participants who wish to use AudioSet. However there are some important warnings: (1) AudioSet does not provide raw audio - instead it provides pre-computed features, which you might not wish to work with; (2) the labelling is crowdsourced and not intended for bird detection, so there are likely to be many false negatives and unusual items; (3) even the "balanced" AudioSet is very unbalanced for our purposes: less than 1.6% of the items are tagged with bird tags.
The scoring on this task will use the "Area Under the ROC Curve" (AUC) measure of classification performance. More precisely, we will use a stratified AUC: we calculate AUC separately for each of the evaluation datasets, and then take the average (the harmonic mean). This gives two improvements over the "simple" AUC: firstly the calculation allows that the "detection threshold" for each dataset might be different; secondly it combines performance across datasets in an explicit weighted fashion, not merely influenced by the number of files in each dataset.
During development, you can perform a similar stratified AUC simply by calculating the AUC for each of the three "folds" as recommended above, and then taking the harmonic mean.
As a baseline, we have published a modified version of "bulbul" the strongest-scoring system in the original Bird Audio Detection challenge. Our modified version will work with the updated 2018 metadata format, and also performs stratified crossvalidation across datasets, in the manner we recommend.