Bird audio detection

Task description

Detecting bird sounds in audio is an important task for automatic wildlife monitoring, as well as in citizen science and audio library management. Bird sound detection is a very common required first step before further analysis (e.g. classification, counting), and makes it possible to conduct work with large datasets (e.g. continuous 24h monitoring) by filtering data down to regions of interest.

This task is an expanded version of the Bird Audio Detection challenge which ran in 2016/2017.


The task is to design a system that, given a short audio recording, returns a binary decision for the presence/absence of bird sound (bird sound of any kind). The output can be just "0" or "1", but we encourage weighted/probability outputs in the continuous range [0,1] for the purposes of evaluation. For the main assessment we will use the well-known "Area Under the ROC Curve" (AUC) measure of classification performance.

An important goal of this task is generalisation to new conditions. To explore this we provide 3 separate development datasets, and 3 evaluation datasets, each recorded under differing conditions. The datasets will have different balances of positive/negative cases, different bird species, different background sounds, different recording equipment. To solve this task well, you will need an approach which either inherently generalises across conditions (including conditions not seen in the training data), or which can self-adapt to new datasets ("domain adaptation").

Note that for every audio clip, you will be told which dataset it belongs to. This means that adapting to the overall characteristics of each dataset separately is possible. The evaluation will also consider each dataset separately and combine the outcomes, rather than treating them as a single pooled dataset.

Audio datasets

We provide three development datasets, and three evaluation datasets, each from a separate bird sound monitoring project. All of the datasets contain 10-second-long WAV files (44.1 kHz mono PCM), and are manually labelled with a 0 or 1 to indicate the absence/presence of any birds within that 10-second audio clip.

Note that there will in general be some low level of error/disagreement in "ground truth" manual labelling. For BirdVox the accuracy of the labeling is estimated to be 99.5% or better, while for our other datasets it is estimated as 96.7%.

The sampling strategy differs across datasets. Crowdsourced recordings are opportunistic, having sampling bias of users (e.g. good weather conditions, daytime). The remote-monitoring sets contain clips extracted according to fixed recording schedules, and are typically evenly sampled across times of day and diverse weather conditions.

Download links are below, next to the description of each dataset.

All datasets are published under the Creative Commons Attribution licence CC-BY 4.0.

Development datasets

  1. Field recordings, worldwide ("freefield1010") - a collection of 7,690 excerpts from field recordings around the world, gathered by the FreeSound project, and then standardised for research. This collection is very diverse in location and environment, and for the BAD Challenge we have annotated it for the presence/absence of birds.
    - Download: [data labels] • [audio files (5.8 Gb zip)] (or [via bittorrent])

  2. Crowdsourced dataset, UK ("warblrb10k") - 8,000 smartphone audio recordings from around the UK, crowdsourced by users of Warblr the bird recognition app. The audio covers a wide distribution of UK locations and environments, and includes weather noise, traffic noise, human speech and even human bird imitations.
    - Download: [data labels] • [audio files (4.3 Gb zip)] (or [via bittorrent])

  3. Remote monitoring flight calls, USA ("BirdVox-DCASE-20k") - 20,000 audio clips collected from remote monitoring units placed near Ithaca, NY, USA during the autumn of 2015, by the BirdVox project. More info about BirdVox-DCASE-20k
    - Download: [data labels] • [audio files (15.4 Gb zip)]

(Thanks to Internet Archive, Zenodo and Figshare for dataset hosting)

Evaluation datasets

  1. Crowdsourced dataset, UK ("warblrb10k") - a held-out set of 2,000 recordings from the same conditions as the Warblr development dataset.
    - Download: audio files (1.3 GB zip)

  2. Remote monitoring dataset, Chernobyl ("Chernobyl") - 6,620 audio clips collected from unattended remote monitoring equipment in the Chernobyl Exclusion Zone (CEZ). This data was collected as part of the TREE (Transfer-Exposure-Effects) research project into the long-term effects of the Chernobyl accident on local ecology. The audio covers a range of birds and includes weather, large mammal and insect noise sampled across various CEZ environments, including abandoned village, grassland and forest areas.
    - Download: audio files (5.3 GB zip)

  3. Remote monitoring night-flight calls, Poland ("PolandNFC") - 4,000 recordings from Hanna Pamuɫa's PhD project of monitoring autumn nocturnal bird migration. The recordings were collected every night, from September to November 2016 on the Baltic Sea coast, Poland, using Song Meter SM2 units with microphones mounted on 3–5 m poles. For this challenge, we use a subset derived from 15 nights with different weather conditions and background noise including wind, rain, sea noise, insect calls, human voice and deer calls.
    - Download: audio files (2.3 Gb zip)

(Thanks to Zenodo for dataset hosting)

In standard machine learning studies, you take only one dataset and divide it into "folds" for cross-validation. However in this task, we recommend that you perform stratified 3-way cross-validation and in each fold you use two sets for training and the other one for testing. This allows you to study the way your method behaves when exposed to data from unseen conditions, just like in the final evaluation.


The interface for submitting results files is a Kaggle-like interface designed for academic users.

Your system should output a results file in the following CSV format:


Two columns, separated by a comma (and rows separated by a "newline"). Filenames are without the .wav extension. Note that the public development data has only 0 or 1 as the annotation, while we encourage your submission to make use of the continuous range between 0 and 1, given as ordinary floating-point numbers. There should be no additional whitespace in the file, except for the newlines. The development data also has an extra column indicating which file belongs to which dataset, but this column is not needed for submission.

Note that you should submit a single CSV file, containing predictions for all of the evaluation datasets combined into one file. Participants must submit predictions for all of our testing datasets, since otherwise that would allow for less-general solutions.

During the challenge, a public leaderboard will be provided using a small amount of evaluation data to provide a "preview" of what the overall results might be like. (The preview score is calculated using approx 1000 files randomly selected from the Chernobyl and warblrb10k data.)

You may submit outputs from multiple variants of your system - one per day, with an overall maximum of 20 per team - until the challenge deadline. Each time, you will see how the results appear with a small subset of the evaluation data, a clue about your performance. You will then be asked to select your preferred four submissions to go into the final evaluation.

At the end of the challenge, you should also submit your final tech report and predictions using the submission system for the main DCASE challenge.

Task rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

Task specific rules:

(a) Adapting to the evaluation data

Although the main rules state that "the use of statistics about the evaluation data in the decision making is also forbidden", for this task it is allowed and even encouraged that the algorithms developed can adapt their behaviour based on overall statistics of each dataset. However participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it.

(b) Use of external data

Use of external data and transfer learning is allowed in this task under the following conditions:

  • The used dataset must be public and freely available
  • Participants must inform/suggest such data to be listed on the task webpage, so that all competitors know about them and have equal opportunity to use them
  • Once the evaluation set is published, the list of allowed external datasets is locked (no further external sources allowed)
  • Participants should list in their technical report the external data sources they used

As one specific example, Google's "AudioSet" includes labelled bird examples, and we have prepared a metadata file for participants who wish to use AudioSet. However there are some important warnings: (1) AudioSet does not provide raw audio - instead it provides pre-computed features, which you might not wish to work with; (2) the labelling is crowdsourced and not intended for bird detection, so there are likely to be many false negatives and unusual items; (3) even the "balanced" AudioSet is very unbalanced for our purposes: less than 1.6% of the items are tagged with bird tags.

Here are external datasets that have been suggested by challenge participants:

Evaluation measure

The scoring on this task will use the "Area Under the ROC Curve" (AUC) measure of classification performance (we use the implementation from the sklearn.metrics Python package). More precisely, we will use a stratified AUC: we calculate AUC separately for each of the evaluation datasets, and then take the average (the harmonic mean). This gives two improvements over the "simple" AUC: firstly the calculation allows that the "detection threshold" for each dataset might be different; secondly it combines performance across datasets in an explicit weighted fashion, not merely influenced by the number of files in each dataset.

During development, you can perform a similar stratified AUC simply by calculating the AUC for each of the three "folds" as recommended above, and then taking the harmonic mean.


For the bird detection task we offer two awards, with cash prizes, for the following:

  • £250: Highest-scoring reproducible method award for the high-scoring submission that is open-source and fully reproducible.
  • £250: Judges' award for the method considered by the judges to be the most interesting or innovative.

The decisions will be made after the closing date, and will be based on the scores attained as well as the information supplied about how your method is done.

Baseline system

As a baseline, we have published a modified version of "bulbul" the strongest-scoring system in the original Bird Audio Detection challenge. Our modified version will work with the updated 2018 metadata format, and also performs stratified crossvalidation across datasets, in the manner we recommend.


If you are participating to this task or using the dataset code please consider citing the following paper:


D. Stowell, Y. Stylianou, M. Wood, H. Pamuła, and H. Glotin. Automatic acoustic detection of birds through deep learning: the first bird audio detection challenge. Methods in Ecology and Evolution, 2018.


Automatic acoustic detection of birds through deep learning: the first Bird Audio Detection challenge


Assessing the presence and abundance of birds is important for monitoring specific species as well as overall ecosystem health. Many birds are most readily detected by their sounds, and thus passive acoustic monitoring is highly appropriate. Yet acoustic monitoring is often held back by practical limitations such as the need for manual configuration, reliance on example sound libraries, low accuracy, low robustness, and limited ability to generalise to novel acoustic conditions. Here we report outcomes from a collaborative data challenge showing that with modern machine learning including deep learning, general-purpose acoustic bird detection can achieve very high retrieval rates in remote monitoring data --- with no manual recalibration, and no pre-training of the detector for the target species or the acoustic conditions in the target environment. Multiple methods were able to attain performance of around 88% AUC (area under the ROC curve), much higher performance than previous general-purpose methods. We present new acoustic monitoring datasets, summarise the machine learning techniques proposed by challenge teams, conduct detailed performance evaluation, and discuss how such approaches to detection can be integrated into remote monitoring projects.


The paper describes the previous bird challenge (2016-2017), but also specifies much of how this Task 3 was conducted.