Task description
The task is to design a system that, given a short audio recording, returns a binary decision for the presence/absence of bird sound (bird sound of any kind).
An important goal of this task is generalisation to new conditions. To explore this we provide 3 separate development datasets, and 3 evaluation datasets, each recorded under differing conditions. The datasets have different balances of positive/negative cases, different bird species, different background sounds, different recording equipment.
More detailed task description can be found in the task description page
Teams ranking
Table including only the best performing system per submitting team.
Rank |
Submission name |
Technical Report |
AUC with 95% confidence interval (Evaluation dataset) |
---|---|---|---|
Lasseck_MfN_1 | Lasseck_MfN | 89.0 (87.7 - 89.9) | |
bulbul_DCASE_1 | bulbul_DCASE | 88.5 (86.9 - 89.1) | |
SpeechLab_UKY_3 | SpeechLab_UKY | 83.9 (81.7 - 84.7) | |
JiananSong_BUPT_1 | JiananSong_BUPT | 82.1 (80.3 - 83.0) | |
Himawan_QUT_1 | Himawan_QUT | 81.7 (80.3 - 82.8) | |
Bai_NPU_1 | Bai_NPU | 81.5 (80.1 - 82.8) | |
Baseline_Surrey_1 | Baseline_Surrey | 80.9 (79.1 - 82.4) | |
Berger_JKU_1 | Berger_JKU | 80.8 (79.2 - 82.5) | |
Mukherjee_IITKgp_2 | Mukherjee_IITKgp | 80.7 (79.5 - 82.3) | |
Yu_LR_2 | Yu_LR | 80.6 (78.5 - 81.4) | |
Thakur_IITMANDI_1 | Thakur_IITMANDI | 79.2 (76.7 - 79.5) | |
Vesperini_A3Lab_1 | Vesperini_A3Lab | 78.8 (77.4 - 80.2) | |
Tao_IITLAB_2 | Tao_IITLAB | 75.4 (73.2 - 77.1) | |
skfl_DCASE_1 | skfl_DCASE | 73.4 (72.0 - 75.3) | |
smacpy_DCASE_1 | smacpy_DCASE | 51.7 (50.5 - 52.5) | |
Jamali_HUT_1 | Jamali_HUT | 48.9 (46.4 - 49.6) |
Prize winners
The two prize winners receive £250 in recognition of their contribution.
1: Highest-scoring open-source/reproducible method award
- Winner: Liaquat et al (University of Kentucky, USA) - This student team re-implemented the "bulbul" system (last year's winner) and then evaluated various ideas for improving it. Although the individual modifications did not improve the score, an ensemble of the resulting systems led to an improved final score. The tech report gives a discussion of the techniques tried, including a domain adaptation method and signal enhancement.
2: Judges' award for the method considered by the judges to be the most interesting or innovative.
- Winner: Vesperini et al (Università Politecnica delle Marche, Italy) - The authors use "capsule networks", a new idea for routing between modules in neural networks. The paper gives a clear introduction to the concept, and it's encouraging that this rather new idea gets respectable performance on the challenge data (78.8%).
Special mention: Berger et al (Johannes Kepler University, Austria) - The authors use a bulbul-like model, and they describe an interesting domain-adaptation technique, which gives them approximately a 1% boost over their base model.
Systems ranking
Table including all systems officially submitted (up to 4 per team).
Rank |
Submission name |
Technical Report |
AUC with 95% confidence interval (Evaluation datasets) |
---|---|---|---|
Lasseck_MfN_1 | Lasseck_MfN | 89.0 (87.7 - 89.9) | |
bulbul_DCASE_1 | bulbul_DCASE | 88.5 (86.9 - 89.1) | |
SpeechLab_UKY_1 | SpeechLab_UKY | 82.5 (81.0 - 83.5) | |
JiananSong_BUPT_1 | JiananSong_BUPT | 82.1 (80.3 - 83.0) | |
Himawan_QUT_1 | Himawan_QUT | 81.7 (80.3 - 82.8) | |
Bai_NPU_1 | Bai_NPU | 81.5 (80.1 - 82.8) | |
Baseline_Surrey_1 | Baseline_Surrey | 80.9 (79.1 - 82.4) | |
Berger_JKU_1 | Berger_JKU | 80.8 (79.2 - 82.5) | |
Yu_LR_1 | Yu_LR | 80.5 (78.6 - 81.5) | |
Mukherjee_IITKgp_1 | Mukherjee_IITKgp | 80.4 (79.0 - 82.0) | |
Thakur_IITMANDI_1 | Thakur_IITMANDI | 79.2 (76.7 - 79.5) | |
Vesperini_A3Lab_1 | Vesperini_A3Lab | 78.8 (77.4 - 80.2) | |
Tao_IITLAB_1 | Tao_IITLAB | 74.9 (73.4 - 76.7) | |
skfl_DCASE_1 | skfl_DCASE | 73.4 (72.0 - 75.3) | |
smacpy_DCASE_1 | smacpy_DCASE | 51.7 (50.5 - 52.5) | |
Jamali_HUT_1 | Jamali_HUT | 48.9 (46.4 - 49.6) | |
SpeechLab_UKY_2 | SpeechLab_UKY | 82.7 (79.8 - 83.6) | |
Himawan_QUT_2 | Himawan_QUT | 81.3 (80.0 - 82.7) | |
Bai_NPU_2 | Bai_NPU | 80.9 (79.5 - 82.2) | |
Mukherjee_IITKgp_2 | Mukherjee_IITKgp | 80.7 (79.5 - 82.3) | |
Yu_LR_2 | Yu_LR | 80.6 (78.5 - 81.4) | |
Vesperini_A3Lab_2 | Vesperini_A3Lab | 75.9 (73.0 - 78.0) | |
Tao_IITLAB_2 | Tao_IITLAB | 75.4 (73.2 - 77.1) | |
Thakur_IITMANDI_2 | Thakur_IITMANDI | 75.4 (72.1 - 77.6) | |
Baseline_Surrey_2 | Baseline_Surrey | 74.8 (72.8 - 76.3) | |
Berger_JKU_2 | Berger_JKU | 70.8 (68.2 - 71.8) | |
JiananSong_BUPT_2 | JiananSong_BUPT | 51.5 (49.2 - 52.6) | |
SpeechLab_UKY_3 | SpeechLab_UKY | 83.9 (81.7 - 84.7) | |
Bai_NPU_3 | Bai_NPU | 81.5 (80.1 - 82.8) | |
Himawan_QUT_3 | Himawan_QUT | 80.6 (78.7 - 81.5) | |
Yu_LR_3 | Yu_LR | 80.0 (77.7 - 80.6) | |
Tao_IITLAB_3 | Tao_IITLAB | 74.1 (72.3 - 76.0) | |
Thakur_IITMANDI_3 | Thakur_IITMANDI | 72.9 (70.0 - 74.1) | |
SpeechLab_UKY_4 | SpeechLab_UKY | 83.6 (81.4 - 84.6) | |
Bai_NPU_4 | Bai_NPU | 81.4 (80.0 - 82.7) | |
Himawan_QUT_4 | Himawan_QUT | 78.4 (76.8 - 79.9) | |
Thakur_IITMANDI_4 | Thakur_IITMANDI | 77.7 (76.2 - 79.7) |
Technical reports
CIAIC-BAD SYSTEM FOR DCASE2018 CHALLENGE TASK 3
Bai, Jisheng and Wu, Ru and Wang, Mou and Li, Dexin and Li, Di and Han, Xueyu and Wang, Qian and Liu, Qing and Wang, Bolun and Fu, Zhonghua
Northwestern Polytechnical University, Xi'an, China
CIAIC-BAD SYSTEM FOR DCASE2018 CHALLENGE TASK 3
Bai, Jisheng and Wu, Ru and Wang, Mou and Li, Dexin and Li, Di and Han, Xueyu and Wang, Qian and Liu, Qing and Wang, Bolun and Fu, Zhonghua
Northwestern Polytechnical University, Xi'an, China
Abstract
In this technical report, we present our system for the task 3 of Detection and Classification of Acoustic Scenes and Events 2018 (DCASE2018) challenge, i.e. bird audio detection(BAD). First, log mel-spectrogram and mel-frequency cepstral coefficients (MFCC) are extracted as features. In order to improve the quality of original audio, same denoising methods are adopted, for example, adaptive denoising in Adobe Audition. Then, convolutional recurrent neural networks (CRNN) with customized activation function is used for detection. Finally, we use aforementioned features as inputs to train our CRNN model and make a fusion on three subsystems to further improve the performance. We evaluate the proposed systems on the dataset with area under the ROC curve (AUC) measure, and our best AUC score on leaderboard dataset is 85.67.
Bird Audio Detection - DCASE 2018
Franz Berger and William Freillinger and Paul Primus and Wolfgang Reisinger
Johannes Kepler University, Linz
Bird Audio Detection - DCASE 2018
Franz Berger and William Freillinger and Paul Primus and Wolfgang Reisinger
Johannes Kepler University, Linz
Abstract
In this paper we explore three approaches on bird audio detection. We establish a simple baseline, experiment with handcrafted features and finally move to Convolutional Neural Networks.
3D CONVOLUTIONAL RECURRENT NEURAL NETWORKS FOR BIRD SOUND DETECTION
Ivan Himawan and Michael Towsey and Paul Roe
Queensland University of Technology
Abstract
With the increasing use of a high quality acoustic devices to monitor wildlife population, it has become imperative to develop techniques for analyzing animals’ calls automatically. Bird sound detection is one example of a long-term monitoring project where data are collected in continuous periods, often cover multiple sites at the same time. Inspired by the success of deep learning approaches in various audio classification tasks, this paper first review previous works exploiting deep learning for bird audio detection, and then proposes a novel 3-dimensional (3D) convolutional and recurrent neural networks. We employed 3D convolutions for extracting spa- tial and temporal information simultaneously. In order to leverage powerful and compact features of 3D convolution, we employ se- parate RNNs, acting on each filter of the last convolutional layers rather than stacking the feature maps in the typical combined CNN and RNN architectures.
Bird Audio Detection using Supervised Weighted NMF
Soroush Jamali and Juan Ahmadpanah and Ghasem Alipoor
Hamedan University of Technology
Bird Audio Detection using Supervised Weighted NMF
Soroush Jamali and Juan Ahmadpanah and Ghasem Alipoor
Hamedan University of Technology
Abstract
This paper reports on the results of our bird audio detection system, developed for Task 3 of the DCACE 2018, challenge that is defined as a binary classification problem. Our proposed method is based on supervised non- negative matrix factorization (NMF) of the constant-Q transform (CQT) spectrogram. Two dictionaries are trained over the training data available for the bird and environment classes. Test samples are then linearly decomposed using a combined dictionary, generated by concatenating these two dictionaries. Classification is performed based on the energy of the activations relevant to each class. However, to further improve the classification performance, we propose to weight each activation coefficient according to the contribution of its corresponding basis in constructing each class. A scheme is proposed to extract this contribution weights from the activation coefficients of the training data. The developed system, evaluated over the development dataset of the challenge, results in up to 80% accuracy.
Bird Audio Detection using Convolutional Neural Networks and Binary Neural Networks
Jinan Song and Shengchen Li
Beijing University of Posts and Telecommunications
Bird Audio Detection using Convolutional Neural Networks and Binary Neural Networks
Jinan Song and Shengchen Li
Beijing University of Posts and Telecommunications
Abstract
For the bird audio detection task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017), we propose a audio classification method for bird species identification using Convolutional Neural Networks (CNNs) and Binarized Neural Networks (BNNs).Although deep learning networks is currently popular in bird audio detection[1], the complex network structure makes it difficult to design the hardware of the detection system. Therefore, after the design of the CNNs, the convolutional layer and the fully connected layer are binarized on the basis of the original network, and both network structures are tested. Finally Area Under ROC Curve (AUC) score is used as the evaluation index. The results of using CNNs and BNNs in the preview score are 88.75% and 68.60%.
DCASE 2018 Challenge Surrey Cross-Task convolutional neural network baseline
Qiuqiang Kong, Iqbal Turab, Xu Yong, Wenwu Wang and Mark D. Plumbley
Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
Abstract
Detection and classification of acoustic scenes and events (DCASE) 2018 challenge is a well known IEEE AASP challenge consists of several audio classification and sound event detection tasks. DCASE 2018 challenge includes five tasks: 1) Acoustic scene classification, 2) Audio tagging of Freesound, 3) Bird audio detection, 4) Weakly labeled semi-supervised sound event detection and 5) Multi-channel audio tagging. In this paper we open source the python code of all of Task 1 - 5 of DCASE 2018 challenge. The baseline source code contains the implementation of the convolutioanl neural networks (CNNs) including the AlexNetish and the VGGish from the image processing area. We researched how the performance varies from task to task when the configuration of the neural networks are the same. The experiment shows deeper VGGish network performs better than AlexNetish on Task 2 - 5 except Task 1 where VGGish and AlexNetish network perform similar. With the VGGish network, we achieve an accuracy of 0.680 on Task 1, a mean average precision (mAP) of 0.928 on Task 2, an area under the curve (AUC) of 0.854 on Task 3, a sound event detection F1 score of 20.8% on Task 4 and a F1 score of 87.75% on Task 5.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | VGGish 8 layer CNN with global max pooling; AlexNetish 4 layer CNN with global max pooling |
ACOUSTIC BIRD DETECTION WITH DEEP CONVOLUTIONAL NEURAL NETWORKS
Mario Lasseck
Museum fuer Naturkunde, Berlin
ACOUSTIC BIRD DETECTION WITH DEEP CONVOLUTIONAL NEURAL NETWORKS
Mario Lasseck
Museum fuer Naturkunde, Berlin
Abstract
This paper presents deep learning techniques for acoustic bird detection. Deep Convolutional Neural Networks (DCNNs), originally designed for image classification, are adapted and fine-tuned to detect the presence of birds in audio recordings. Various data augmentation techniques are applied to increase model performance and improve generalization to unknown recording conditions and new habitats. The proposed approach is evaluated in the Bird Audio Detection task which is part of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2018. It provides the best system for the task and surpasses previous state-of-the-art achieving an area under the curve (AUC) above 95 % on the public challenge leaderboard [1].
CONVOLUTIONAL RECURRENT NEURAL NETWORK BASED BIRD AUDIO DETECTION
Rajdeep Mukherjee and Dipyaman Banerjee and Kuntal Dey and Niloy Ganguly
Indian Institute of Technology, Kharagpur and IBM Research, New Delhi
CONVOLUTIONAL RECURRENT NEURAL NETWORK BASED BIRD AUDIO DETECTION
Rajdeep Mukherjee and Dipyaman Banerjee and Kuntal Dey and Niloy Ganguly
Indian Institute of Technology, Kharagpur and IBM Research, New Delhi
Abstract
We propose a Convolutional Recurrent Neural Network (CRNN) based approach, implemented as a Convolutional Neural Network (CNN) followed by a Recurrent Neural Network (RNN), for the task of detecting the presence of birds in audio recordings. As part of the IEEE DCASE 2018 Challenge, we were provided with three sep- arate development datasets containing recordings from three very different bird sound monitoring projects. We performed a stratified 3-way cross-validation mechanism for training our model by con- sidering two datasets for training and the remaining one for valida- tion in each fold in order to generalize our model well when exposed to data from unseen conditions. We obtained an Area Under Curve (AUC) measure of 88.7% on the leaderboard test set. We compare our results with the CNN version of our model which achieves an AUC measure of 87.74% on the same test set.
DOMAIN TUNING METHODS FOR BIRD AUDIO DETECTION
Sidrah Liaqat and Narjes Bozorg and Neenu Jose and Patrick Conrey and Antony Tamasi and Michael T. Johnson
University of Kentucky Speech and Signal Processing Lab
Abstract
This paper presents several feature extraction and normal- ization methods implemented for the DCASE 2018 Bird Audio Detection challenge, a binary audio classification task to identify whether a ten second audio segment from a specified dataset contains one or more bird vocaliza- tions. Our baseline system is adapted from the Convolu- tional Neural Network system of last year’s challenge winner bulbul [1]. We introduce one feature modification, an increase in temporal resolution of the Mel-spectrogram feature matrix, tailored to the fast-changing temporal structure of many song-bird vocalizations. Additionally, we introduce two feature normalization approaches, a front-end signal enhancement method to reduce differ- ences in dataset noise characteristics and an explicit do- main adaptation method based on covariance normaliza- tion. Overall results show that none of these approaches gave significant improvements for either a within-dataset training/testing paradigm or a cross-dataset train- ing/testing paradigm.
Awards: Highest-scoring open source / reproducible method
BIRD AUDIO DETECTION FOR DCASE 2018 CHALLENGE TECHNICAL REPORT
Lianjie Tao and Xinxing Chen
Chongqing University
Abstract
The 2018 BAD challenge [1] requires to determine bird audio in a 10 seconds sound clips, the organizer gave us three development datasets for training our NN, and three evaluation datasets to evaluate our NN. The goal of the challenge is to maximize the recognition of audio in the birds.
LEARNED AGGREGATION IN CNN: ALL-CONV NET FOR BIRD ACTIVITY DETECTION
Anshul Thakur and Arjun Pankajakshan and Padmanabhan Rajan
Indian Institute of Technology Mandi
LEARNED AGGREGATION IN CNN: ALL-CONV NET FOR BIRD ACTIVITY DETECTION
Anshul Thakur and Arjun Pankajakshan and Padmanabhan Rajan
Indian Institute of Technology Mandi
Abstract
The task 3 of DCASE 2018 i.e. bird activity detection (BAD) deals with identifying the presence or absence of bird vocaliza- tions in a given audio recording. In this submission, we utilize an all-convolutional neural network (all-conv net) for BAD. The network is characterized by the utilization of convolutional oper- ations to implement aggregation/pooling and dense layers. The ag- gregation operation implemented by convolution helps in capturing the inter feature-map correlations which are ignored in traditional max/average pooling. This helps in learning a function which ag- gregates the complementary information in various feature maps, leading to better bird activity detection. Building on the all-conv net, we utilize four different derivative systems which provide good validation and preview scores.
A CAPSULE NEURAL NETWORKS BASED APPROACH FOR BIRD AUDIO DETECTION
Fabio Vesperini and Leonardo Gabrielli and Emanuele Principi and Stefano Squartini
Università Politecnica delle Marche, Ancona
Judges' award
A CAPSULE NEURAL NETWORKS BASED APPROACH FOR BIRD AUDIO DETECTION
Fabio Vesperini and Leonardo Gabrielli and Emanuele Principi and Stefano Squartini
Università Politecnica delle Marche, Ancona
Abstract
We propose a system for bird audio detection based on the innova- tive CapsNet architecture. It is our contribution to the third task of the DCASE2018 Challenge. The task consists on a binary detec- tion of presence/absence of bird sounds on audio files belonging to different datasets. Spectral acoustic features are extracted from the acoustic signals, successively a deep neural network which com- prehend capsule units is trained by means of supervised learning using binary annotations of bird song activity as target vector in combination with the dynamic routing mechanism. This procedure has the aim to incentive the network to learn global coherence im- plicitly and to identify part-whole relationships between capsules, thereby improving generalization performance in detecting the pres- ence bird songs from various environmental conditions. We achieve a harmonic mean of the Area Under Roc Curve (AUC) score equal to 85.08 from the cross-validation performed on the development dataset, while we obtain an AUC equal to 84.43 as preview score from a subset of the unseen evaluation data.
Awards: Judges' award
DCASE 2018 CHALLENGE TECHNICAL REPORT
Chenchen Yu and Yu Hao and Wenbo Yang and Bo Fu
AI Lab, Lenovo Research
DCASE 2018 CHALLENGE TECHNICAL REPORT
Chenchen Yu and Yu Hao and Wenbo Yang and Bo Fu
AI Lab, Lenovo Research
Abstract
For the task of Bird Audio Detection in the DCASE Challenge 2018[1], we present three approaches that all use convolutional neural networks on Mel-spectrogram. We obtained Area Under Curve (AUC) measure of 0.8610, 0.8548, 0.8464 on preview score which is calculated using approximate 1000 files randomly selected from the Chernobyl and warblrb10k data.