Few-shot Bioacoustic Event Detection


Task description

This task focuses on sound event detection in a few-shot learning setting for animal (mammal and bird) vocalisations. Participants will be expected to create a method that can extract information from five exemplar vocalisations (shots) of mammals or birds and detect and classify sounds in field recordings.

Challenge has ended. Full results for this task can be found in the Results page.

If you are interested in the task, you can join us on the dedicated slack channel

Description

Few-shot learning is a highly promising paradigm for sound event detection. It is also an extremely good fit to the needs of users in bioacoustics, in which increasingly large acoustic datasets commonly need to be labelled for events of an identified category (e.g. species or call-type), even though this category might not be known in other datasets or have any yet-known label. While satisfying user needs, this will also benchmark few-shot learning for the wider domain of sound event detection (SED).


Few-shot learning describes tasks in which an algorithm must make predictions given only a few instances of each class, contrary to standard supervised learning paradigm. The main objective is to find reliable algorithms that are capable of dealing with data sparsity, class imbalance and noisy/busy environments. Few-shot learning is usually studied using N-way-K-shot classification, where N denotes the number of classes and K the number of examples for each class.

Some reasons why few-shot learning has been of increasing interest:

  • Scarcity of supervised data can lead to unreliable generalisations of machine learning models.
  • Explicitly labeling a huge dataset can be costly both in time and resources.
  • Fixed ontologies or class labels used in SED and other DCASE tasks are often a poor fit to a given user’s goal.

Development Set

The development set is pre-split into training and validation sets. The training set consists of four sub-folders deriving from a different source each. Along with the audio files multi-class annotations are provided for each. The validation set consists of two sub-folders deriving from a different source each, with a single-class (class of interest) annotation file provided for each audio file.

Training Set

The training set contains four different sub-folders (BV, HV, JD, MT). Statistics are given overall and specific for each sub-folder.

Overall

Statistics Values
Number of audio recordings 11
Total duration 14 hours and 20 mins
Total classes (excl. UNK) 19
Total events (excl. UNK) 4,686

BV

The BirdVox-DCASE-10h (BV for short) contains five audio files from four different autonomous recording units, each lasting two hours. These autonomous recording units are all located in Tompkins County, New York, United States. Furthermore, they follow the same hardware specification: the Recording and Observing Bird Identification Node (ROBIN) developed by the Cornell Lab of Ornithology. Andrew Farnsworth, an expert ornithologist, has annotated these recordings for the presence of flight calls from migratory passerines, namely: American sparrows, cardinals, thrushes, and warblers. In total, the annotator found 2,662 from 11 different species. We estimate these flight calls to have a duration of 150 milliseconds and a fundamental frequency between 2 kHz and 10 kHz.

Statistics Values
Number of audio recordings 5
Total duration 10 hours
Total classes (excl. UNK) 11
Total events (excl. UNK) 2662
Sampling rate 24,000 Hz

HT

Spotted hyenas are a highly social species that live in "fission-fusion" groups where group members range alone or in smaller subgroups that split and merge over time. Hyenas use a variety of types of vocalizations to coordinate with one another over both short and long distances. Spotted hyena vocalization data were recorded on custom-developed audio tags designed by Mark Johnson and integrated into combined GPS / acoustic collars (Followit Sweden AB) by Frants Jensen and Mark Johnson. Collars were deployed on female hyenas of the Talek West hyena clan at the MSU-Mara Hyena Project (directed by Kay Holekamp) in the Masai Mara, Kenya as part of a multi-species study on communication and collective behavior. Field work was carried out by Kay Holekamp, Andrew Gersick, Frants Jensen, Ariana Strandburg-Peshkin, and Benson Pion; labeling was done by Kenna Lehmann and colleagues.

Statistics Values
Number of audio recordings 3
Total duration 3 hours
Total classes (excl. UNK) 3
Total events (excl. UNK) 435
Sampling rate 6000 Hz

JD

Jackdaws are corvid songbirds which usually breed, forage and sleep in large groups, but form a pair bond with the same partner for life. They produce thousands of vocalisations per day, but many aspects of their vocal behaviour remained unexplored due to the difficulty in recording and assigning vocalisations to specific individuals, especially in natural settings. In a multi-year field study (Max-Planck-Institute for Ornithology, Seewiesen, Germany), wild jackdaws were equipped with small backpacks containing miniature voice recorders (Edic Mini Tiny A31, TS-Market Ltd., Russia) to investigate the vocal behaviour of individuals interacting normally with their group, and behaving freely in their natural environment. The jackdaw training dataset contains a 10-minute on-bird sound recording of one male jackdaw during the breeding season 2015. Field work was conducted by Lisa Gill, Magdalena Pelayo van Buuren and Magdalena Maier. Sound files were annotated by Lisa Gill, based on a previously established video-validation in a captive setting.

Statistics Values
Number of audio recordings 1
Total duration 10 minutes
Total classes (excl. UNK) 1
Total events (excl. UNK) 355
Sampling rate 22,050 Hz

MT

Meerkats are a highly social mongoose species that live in stable social groups and use a variety of distinct vocalizations to communicate and coordinate with one another. Meerkat vocalization data were recorded at the Kalahari Meerkat Project (Kuruman River Reserve, South Africa; directed by Marta Manser and Tim Clutton-Brock), as part of a multi-species study on communication and collective behavior. Data in the training set were recorded on small audio devices (TS Market, Edic Mini Tiny+ A77, 8 kHz) integrated into combined GPS/audio collars which were deployed on multiple members of meerkat groups to monitor their movements and vocalizations simultaneously. Recordings were carried out during daytime hours while meerkats were primarily foraging (digging in the ground for small prey items). Field work was carried out by Ariana Strandburg-Peshkin, Baptiste Averly, Vlad Demartsev, Gabriella Gall, Rebecca Schaefer and Marta Manser. Audio recordings were labeled by Baptiste Averly, Vlad Demartsev, Ariana Strandburg-Peshkin, and colleagues

Statistics Values
Number of audio recordings 2
Total duration 1 hour and 10 mins
Total classes (excl. UNK) 4
Total events (excl. UNK) 1234
Sampling rate 8,000 Hz

Training annotation format

Annotation files have the same name as their corresponding audiofiles with extension *.csv. For the training set multi-class annotations are provided, with positive (POS), negative (NEG) and unknown (UNK) values for each class. UNK indicates uncertainty about a class and participants can choose to ignore it. Example of an annotation file for audio.wav:

Audiofilename,Starttime,Endtime,CLASS_1,CLASS_2,...,CLASS_N
audio.wav,1.1,2.2,POS,NEG,...,NEG
.
.
.
audio.wav,99.9,100.0,UNK,UNK,...,NEG

Validation Set

The validation set comprises of two sub-folders (HV, PB). Specific information about the source of the recordings are not provided for the participants for the duration of the challenge, as to make information available for the validation set as similar to the evaluation set (once that is also published). More information about both will be made available after the end of the challenge.

There is no overlap between the training set and validation set classes.

Overall

Statistics Values
Number of audio recordings 8
Total duration 5 hours
Total classes (excl. UNK) 4
Total events (excl. UNK) 310

HV

Statistics Values
Number of audio recordings 2
Total duration 2 hours
Total classes (excl. UNK) 2
Total events (excl. UNK) 50
Sampling rate 6,000 Hz

PB

Statistics Values
Number of audio recordings 6
Total duration 3 hours
Total classes (excl. UNK) 2
Total events (excl. UNK) 260
Sampling rate 44,100 Hz

Validation annotation format

Annotation files have the same name as their corresponding audiofiles with extension *.csv. For the validation set single-class (class of interest) annotations are provided, with positive (POS), unkwown (UNK) values. UNK indicates uncertainty about a class and participants can choose to ignore it. Each audio file should be treated separately of the rest, as there is possible overlap between the classes of the evaluation set across different audio files.

Participants must treat the task as a 5-shot setting and only use the first five POS annotations for the class of interest for each file, when trying to predict the rest.

Example of an annotation file for audio_val.wav:

Audiofilename,Starttime,Endtime,Q
audio_val.wav,1.1,2.2,POS
.
.
.
audio_val.wav,99.9,100.0,UNK

Download


Evaluation Set

The evaluation set for task 5 of DCASE 2021 "Few-shot Bioacoustic Event Detection" consists of 31 audio files acquired from different bioacoustic sources.

The first 5 annotations are provided for each file, with events marked as positive (POS) for the class of interest.

The recordings in each subfolder denote different recording sources and there may or may not be overlap between classes of interest from different wav files. Hence each recording should be treated independently of the others. Each line of the annotation csv represents an event in the audio file. The column descriptions are as follows: [ Audiofilename, Starttime, Endtime, Q ]

There is no overlap between the development set and evaluation set classes.

Download



Task setup

This few-shot task will run as a 5-shot task. Hence, five annotated calls from each recording in the evaluation set will be provided to the participants. Each recording of the evaluation set will have a single class of interest which the participants will then need to detect through the recording. Each recording can have multiple types of calls or species present in it, as well as background noise, however only the label of interest needs to be detected.

During the development period the participants are required to treat the validation set in the same way as the evaluation set by using the first five positive (POS) events for their models. Participants should keep in mind that our evaluation metric ignores anything before the end time of the fifth positive event, hence using randomly selected events from the validation set may lead to incorrect performance values.

Task rules

  • Use of external data (e.g. audio files, annotations) is allowed only after approval from the task coordinators (contact: g.v.morfi@qmul.ac.uk). Typically these external datasets should be public, open datasets.
  • Use of pre-trained models is allowed only after approval from the task coordinators (contact: g.v.morfi@qmul.ac.uk).
  • List of external datasets and models allowed:
Dataset name Type Added Link
AudioSet audio, video 14.05.2021 https://research.google.com/audioset/
OpenL3 model 14.05.2021 https://openl3.readthedocs.io/
ESC50 audio dataset 14.05.2021 http://www.cs.cmu.edu/~alnu/tlwled/esc50.htm
ImageNet image dataset 14.05.2021 http://www.image-net.org/
VoxCeleb audio,visual dataset 14.05.2021 https://www.robots.ox.ac.uk/~vgg/data/voxceleb/
TUT Acoustic scenes 2016 audio dataset 14.05.2021 https://zenodo.org/record/45739#.YJ76v5NKidY


  • The development dataset (i.e. training and validation) can be augmented without the use of external data.
  • Participants are not allowed to use [VGG Sound dataset] (https://www.robots.ox.ac.uk/~vgg/data/vggsound/) and [DCASE2018 Bird Audio Detection task dataset] (http://dcase.community/challenge2018/task-bird-audio-detection).
  • Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it.
  • Participants are not allowed to use extra annotations for the provided data.
  • Participants are only allowed to use the first five positive (POS) annotations from each validation set annotation file and use the rest for evaluation of their method.
  • Participants must treat each file in the validation set independently of the others (e.g. for prototypical networks do not save prototypes between audio files). This is due to the fact that the classes of the validation set are hidden and there is possible overlap between them inside the validation set.

The following data resources and pre-trained models are :

Submission

Official challenge submission consists of:

  • System output file (*.csv)
  • Metadata file (*.yaml)
  • Technical report explaining in sufficient detail the method (*.pdf)

System output should be presented as a single text-file (in CSV format, with a header row as shown in the example output below).

For each system, meta information should be provided in a separate file, containing the task-specific information. This meta information enables fast processing of the submissions and analysis of submitted systems. Participants are advised to fill the meta information carefully while making sure all information is correctly provided.

We allow up to 4 system output submissions per participant/team. For each system, metadata should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted metadata (the *.yaml file), submitted system output (the *.csv file), and the technical report. The detailed information regarding the challenge information can be found in the Submission page. Finally, for supporting reproducible research, we kindly ask from each participant/team to consider making available the code of their method (e.g. in GitHub) and pre-trained models, after the challenge is over.

Please note: automated procedures will be used for the evaluation of the submitted results. Therefore, the column names should be exactly as indicated in the example output below. Events in each file should be in order of start time.

Example output:

Audiofilename,Starttime,Endtime
BUK5_20161101_002104a.wav,356.0134694,356.1384127
BUK5_20161101_002104a.wav,356.18839,356.488254
BUK5_20161101_002104a.wav,356.5882086,356.7131519

Metadata file

Example meta information file for task 5 baseline system task5/Morfi_QMUL_task5_1/Morfi_QMUL_task5_1.meta.yaml:

# Submission information
submission:
  # Submission label
  # Label is used to index submissions, to avoid overlapping codes among submissions
  # use the following way to form your label:
  # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
  label: Morfi_QMUL_task5_1

  # Submission name
  # This name will be used in the results tables when space permits
  name: Cross-correlation baseline

  # Submission name abbreviated
  # This abbreviated name will be used in the results table when space is tight, maximum 10 characters
  abbreviation: xcorr_base

  # Submission authors in order, mark one of the authors as corresponding author.
  authors:
    # First author
    - lastname: Morfi
      firstname: Veronica
      email: g.v.morfi@qmul.ac.uk                     # Contact email address
      corresponding: true                             # Mark true for one of the authors

      # Affiliation information for the author
      affiliation:
        abbreviation: QMUL
        institute: Queen Mary University of London
        department: Centre for Digital Music
        location: London, UK

    # Second author
    - lastname: Stowell
      firstname: Dan
      email: dan.stowell@qmul.ac.uk                  # Contact email address

      # Affiliation information for the author
      affiliation:
        abbreviation: QMUL
        institute: Queen Mary University of London
        department: Centre for Digital Music
        location: London, UK

        #...


# System information
system:
  # SED system description, meta data provided here will be used to do
  # meta analysis of the submitted system. Use general level tags, if possible use the tags provided in comments.
  # If information field is not applicable to the system, use "!!null".
  description:

    # Audio input
    input_sampling_rate: any               # In kHz

    # Acoustic representation
    acoustic_features: spectrogram   # e.g one or multiple [MFCC, log-mel energies, spectrogram, CQT, PCEN, ...]

    # Data augmentation methods
    data_augmentation: !!null             # [time stretching, block mixing, pitch shifting, ...]

    # Embeddings
    # e.g. VGGish, OpenL3, ...
    embeddings: !!null

    # Machine learning
    # In case using ensemble methods, please specify all methods used (comma separated list).
    machine_learning_method: template matching         # e.g one or multiple [GMM, HMM, SVM, kNN, MLP, CNN, RNN, CRNN, NMF, random forest, ensemble, transformer, ...]
    # the system adaptation for "few shot" scenario.
    # For example, if machine_learning_method is "CNN", the few_shot_method might use one of [fine tuning, prototypical, MAML] in addition to the standard CNN architecture.
    few_shot_method: template matching         # e.g [fine tuning, prototypical, MAML, nearest neighbours...]

    # External data usage method
    # e.g. directly, embeddings, pre-trained model, ...
    external_data_usage: !!null

    # Ensemble method subsystem count
    # In case ensemble method is not used, mark !!null.
    ensemble_method_subsystem_count: !!null # [2, 3, 4, 5, ... ]

    # Decision making methods (for ensemble)
    decision_making: !!null                 # [majority vote, ...]

    # Post-processing, followed by the time span (in ms) in case of smoothing
    post-processing: peak picking, threshold				# [median filtering, time aggregation...]

  # System complexity, meta data provided here will be used to evaluate
  # submitted systems from the computational load perspective.
  complexity:

    # Total amount of parameters used in the acoustic model. For neural networks, this
    # information is usually given before training process in the network summary.
    # For other than neural networks, if parameter count information is not directly available,
    # try estimating the count as accurately as possible.
    # In case of ensemble approaches, add up parameters for all subsystems.
    total_parameters: !!null    # note that for simple template matching, the "parameters"==the pixel count of the templates, plus 1 for each param such as thresholding. 
    # Approximate training time followed by the hardware used
    trainining_time: !!null
    # Model size in MB
    model_size: !!null


  # URL to the source code of the system [optional, highly recommended]
  source_code:   

  # List of external datasets used in the submission.
  # A previous DCASE development dataset is used here only as example! List only external datasets
  external_datasets:
    # Dataset name
    - name: !!null
      # Dataset access url
      url: !!null
      # Total audio length in minutes
      total_audio_length: !!null            # minutes

# System results 
results:
  # Full results are not mandatory, but for through analysis of the challenge submissions recommended.
  # If you cannot provide all result details, also incomplete results can be reported.
  validation_set:
    overall:
      F-score: 2.01 # percentile

    # Per-dataset
    dataset_wise:
      HV:
        F-score: 1.22 #percentile
      PB:
        F-score: 5.84 #percentile

Evaluation Metric

We implemented an event-based F-measure, macro-averaged evaluation metric. We use IoU followed by bipartite graph matching. The evalution metric ignores the part of the file that contains the first five positive (POS) events and measure are estimated after the end time of the fitfh positive event for each file. Furthermore, real-world datasets contain a small number of ambiguous or unknown labels marked as UNK in the annotation files provided. This evaluation metrics treats these separately during evaluation, so as not to penalise algorithms that can perform better than a human annotator. Final ranking of methods will be based on the overall F-measure for the whole of the evaluation set.

Results

Rank Submission Information
Code Author Affiliation Technical
Report
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)
Baseline_TempMatch_task5_1 Veronica Morfi Centre for Digital Music, EECS, Queen Mary University of London, UK 34.8 (32.6 - 37.1)
Baseline_PROTO_task5_1 Shubhr Singh Centre for Digital Music, EECS, Queen Mary University of London, UK 20.1 (18.2 - 21.9)
Anderson_TCD_task5_1 Mark Anderson Trinity College Dublin, SIGMEDIA, Dublin, Ireland task-few-shot-bioacoustic-event-detection-results#Anderson2021 35.0 (33.1 - 37.0)
Bielecki_SMSNG_task5_1 Radoslaw Bielecki Audio Intelligence, Samsung R&D Institute, Warsaw, Poland task-few-shot-bioacoustic-event-detection-results#Bielecki2021 8.4 (7.1 - 9.6)
Bielecki_SMSNG_task5_2 Radoslaw Bielecki Audio Intelligence, Samsung R&D Institute, Warsaw, Poland task-few-shot-bioacoustic-event-detection-results#Bielecki2021 5.8 (4.9 - 6.7)
Bielecki_SMSNG_task5_3 Radoslaw Bielecki Audio Intelligence, Samsung R&D Institute, Warsaw, Poland task-few-shot-bioacoustic-event-detection-results#Bielecki2021 8.4 (7.1 - 9.7)
Bielecki_SMSNG_task5_4 Radoslaw Bielecki Audio Intelligence, Samsung R&D Institute, Warsaw, Poland task-few-shot-bioacoustic-event-detection-results#Bielecki2021 5.3 (4.4 - 6.2)
Cheng_BIT_task5_1 Hao Cheng Beijing Institute of Technology, School Of Information And Electronics, Beijing, China task-few-shot-bioacoustic-event-detection-results#Cheng2021 23.8 (21.9 - 25.7)
Cheng_BIT_task5_2 Hao Cheng Beijing Institute of Technology, School Of Information And Electronics, Beijing, China task-few-shot-bioacoustic-event-detection-results#Cheng2021 12.5 (11.0 - 14.1)
Cheng_BIT_task5_3 Hao Cheng Beijing Institute of Technology, School Of Information And Electronics, Beijing, China task-few-shot-bioacoustic-event-detection-results#Cheng2021 11.0 (9.4 - 12.6)
Cheng_BIT_task5_4 Hao Cheng Beijing Institute of Technology, School Of Information And Electronics, Beijing, China task-few-shot-bioacoustic-event-detection-results#Cheng2021 8.0 (6.7 - 9.3)
Johannsmeier_OVGU_task5_1 Jens Johannsmeier Otto-von-Guericke-Universität Magdeburg, Faculty of Computer Science, Magdeburg, Germany task-few-shot-bioacoustic-event-detection-results#Johannsmeier2021 5.5 (4.7 - 6.4)
Johannsmeier_OVGU_task5_2 Jens Johannsmeier Otto-von-Guericke-Universität Magdeburg, Faculty of Computer Science, Magdeburg, Germany task-few-shot-bioacoustic-event-detection-results#Johannsmeier2021 4.5 (3.7 - 5.4)
Johannsmeier_OVGU_task5_3 Jens Johannsmeier Otto-von-Guericke-Universität Magdeburg, Faculty of Computer Science, Magdeburg, Germany task-few-shot-bioacoustic-event-detection-results#Johannsmeier2021 15.2 (13.7 - 16.7)
Johannsmeier_OVGU_task5_4 Jens Johannsmeier Otto-von-Guericke-Universität Magdeburg, Faculty of Computer Science, Magdeburg, Germany task-few-shot-bioacoustic-event-detection-results#Johannsmeier2021 7.1 (5.9 - 8.3)
zhang_uestc_task5_1 Yue Zhang University of Electronic Science and Technology of China,ChengDu, China task-few-shot-bioacoustic-event-detection-results#Zhang2021 9.0 (7.8 - 10.2)
zhang_uestc_task5_2 Yue Zhang University of Electronic Science and Technology of China,ChengDu, China task-few-shot-bioacoustic-event-detection-results#Zhang2021 8.3 (7.1 - 9.4)
zhang_uestc_task5_3 Yue Zhang University of Electronic Science and Technology of China,ChengDu, China task-few-shot-bioacoustic-event-detection-results#Zhang2021 16.8 (15.5 - 18.2)
zhang_uestc_task5_4 Yue Zhang University of Electronic Science and Technology of China,ChengDu, China task-few-shot-bioacoustic-event-detection-results#Zhang2021 7.2 (6.0 - 8.4)
Zou_PKU_task5_1 Yuexian Zou Peking University, Shcool of ECE, Shenzhen,China task-few-shot-bioacoustic-event-detection-results#Zou2021 33.2 (31.0 - 35.3)
Yang_PKU_task5_2 Yuexian Zou Peking University, Shcool of ECE, Shenzhen,China task-few-shot-bioacoustic-event-detection-results#Zou2021 22.4 (20.7 - 24.1)
Zou_PKU_task5_3 Yuexian Zou Peking University, Shcool of ECE, Shenzhen,China task-few-shot-bioacoustic-event-detection-results#Zou2021 38.4 (36.2 - 40.6)
Zou_PKU_task5_4 Yuexian Zou Peking University, Shcool of ECE, Shenzhen,China task-few-shot-bioacoustic-event-detection-results#Zou2021 33.7 (31.7 - 35.8)
Tang_SHNU_task5_1 Tiantian Tang Shanghai Normal University, The College of Information, Mechanical and Electrical Engineering, Shanghai, China task-few-shot-bioacoustic-event-detection-results#Tang2021 36.5 (34.0 - 38.9)
Tang_SHNU_task5_2 Tiantian Tang Shanghai Normal University, The College of Information, Mechanical and Electrical Engineering, Shanghai, China task-few-shot-bioacoustic-event-detection-results#Tang2021 35.1 (31.7 - 38.4)
Tang_SHNU_task5_3 Tiantian Tang Shanghai Normal University, The College of Information, Mechanical and Electrical Engineering, Shanghai, China task-few-shot-bioacoustic-event-detection-results#Tang2021 38.3 (36.1 - 40.5)


Complete results and technical reports can be found in the results page

Baseline system

Two baselines are provided:

  • Spectrogram correlation template matching (common in bioacoustics)
  • Deep learning prototypical network (a good modern approach)
  • The reported result for deep learning baseline is the best performance of the model.


Baseline Results

System F-measure Precision Recall
Template Matching 2.01% 1.08% 14.46%
Prototypical Network 41.48% 32.20% 58.27%

Sound event detection via few-shot learning is a novel and challenging task, as reflected in the performance of the baseline systems. There is thus lots of scope for improving on these scores, and making a significant contribution to animal monitoring.

Contact

Participants can contact the task organisers via email (g.v.morfi@qmul.ac.uk) or in the slack channel: task-5-2021