Few-shot Bioacoustic Event Detection


Challenge results

Task description

More detailed task description can be found in the task description page

Systems ranking

Rank Submission
code
Submission
name
Technical
Report
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)
Event-based
F-score
(Validation dataset)
Baseline_TempMatch_task5_1 Baseline Template Matching 14.9 (14.0 - 15.3)
Baseline_PROTO_task5_1 Baseline Prototypical Network 41.6 (41.0 - 42.1) 52.1
KAO_NTHU_task5_2 cosine similarity_Chromagram kao_nthu2024 30.27 ( 29.42 - 30.81 ) 32.2
KAO_NTHU_task5_3 cosine similarity_Mel-Spectrograms\u222AChromagram kao_nthu2024 40.1 (39.2 - 40.7) 36.2
KAO_NTHU_task5_1 cosine similarity_Mel-Spectrograms kao_nthu2024 46.9 (45.6 - 47.9) 44.2
Lu_AILab_task5_1 Lu_AILab_task5_1 lu_ailab2024 56.7 (56.2 - 57.1) 44.1
Lu_AILab_task5_2 Lu_AILab_task5_2 lu_ailab2024 31.4 (31.0 - 31.7) 56.4
Lu_AILab_task5_3 Lu_AILab_task5_3 lu_ailab2024 37.7 (37.3 - 38.1) 56.4
Lu_AILab_task5_4 Lu_AILab_task5_4 lu_ailab2024 43.0 (42.3 - 43.4) 44.1
XF_NUDT_task5_1 Unet-Frame-Level xf_nudt2024 64.1 (63.8 - 64.5) 66.7
XF_NUDT_task5_4 AAPM-Seg-Level xf_nudt2024 28.4 (27.4 - 29.0) 63.2
XF_NUDT_task5_3 pcenBase-Frame-Level xf_nudt2024 61.5 (61.1 - 61.9) 61.7
XF_NUDT_task5_2 logmelBase-Frame-Level xf_nudt2024 65.2 (64.9 - 65.5) 70.6
Latifi_IDMT_task5_1 Cross-correlation baseline latifi_idmt2024 0.7 (0.6 - 0.7) 47.0
QianHu_BHEBIT_task5_1 QianHu_DYXS_task5_1 qianhu_bhebit2024 21.7 (21.1 - 22.2) 51.4
QianHu_BIT_task5_3 QianHu_DYXS_task5_3 qianhu_bit2024 21.7 (21.2 - 22.1) 51.8
QianHu_BHEBIT_task5_4 QianHu_DYXS_task5_4 qianhu_bhebit2024 42.5 (41.7 - 43.0) 55.5
QianHu_BHEBIT_task5_2 QianHu_DYXS_task5_2 qianhu_bhebit2024 34.1 (33.6 - 34.5) 53.7
Hoffman_ESP_task5_4 In-context learning hoffman_esp2024 8.1 (7.6 - 8.4) 44.4
Hoffman_ESP_task5_2 In-context learning hoffman_esp2024 3.3 (3.0 - 3.5) 58.4
Hoffman_ESP_task5_3 In-context learning hoffman_esp2024 1.2 (0.9 - 1.3) 55.9
Hoffman_ESP_task5_1 In-context learning hoffman_esp2024 6.5 (6.1 - 6.9) 53.3
Bordoux_WUR_task5_1 AVESproto direct bordoux_wur2024 42.0 (41.3 - 42.5) 39.3

Dataset wise metrics

Rank Submission
code
Submission
name
Technical
Report
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)
Event-based
F-score
(CHE dataset)
Event-based
F-score
(CT dataset)
Event-based
F-score
(MGE dataset)
Event-based
F-score
(MS dataset)
Event-based
F-score
(QU dataset)
Event-based
F-score
(DC dataset)
Event-based
F-score
(CHE23 dataset)
Event-based
F-score
(CW dataset)
Baseline_TempMatch_task5_1 Baseline Template Matching 14.9 (14.0 - 15.3) 21.1 7.2 44.1 8.0 9.7 34.9 36.1 44.2
Baseline_PROTO_task5_1 Baseline Prototypical Network 41.6 (41.0 - 42.1) 37.9 28.6 41.3 58.7 26.3 47.2 68.3 62.7
KAO_NTHU_task5_2 cosine similarity_Chromagram kao_nthu2024 30.27 ( 29.42 - 30.81 ) 78.4 14.8 52.6 41.3 16.2 45.1 33.9 47.2
KAO_NTHU_task5_3 cosine similarity_Mel-Spectrograms\u222AChromagram kao_nthu2024 40.1 (39.2 - 40.7) 64.8 16.0 60.0 52.9 32.8 54.5 54.9 57.5
KAO_NTHU_task5_1 cosine similarity_Mel-Spectrograms kao_nthu2024 46.9 (45.6 - 47.9) 92.5 16.6 83.0 59.7 44.5 61.1 71.5 70.3
Lu_AILab_task5_1 Lu_AILab_task5_1 lu_ailab2024 56.7 (56.2 - 57.1) 54.6 47.6 70.5 62.8 47.7 52.1 78.0 69.0
Lu_AILab_task5_2 Lu_AILab_task5_2 lu_ailab2024 31.4 (31.0 - 31.7) 48.4 40.2 33.0 39.5 39.7 18.5 55.8 19.4
Lu_AILab_task5_3 Lu_AILab_task5_3 lu_ailab2024 37.7 (37.3 - 38.1) 38.6 48.1 63.0 59.5 47.5 21.2 71.9 21.2
Lu_AILab_task5_4 Lu_AILab_task5_4 lu_ailab2024 43.0 (42.3 - 43.4) 56.7 30.1 35.5 39.5 39.6 48.9 65.8 68.2
XF_NUDT_task5_1 Unet-Frame-Level xf_nudt2024 64.1 (63.8 - 64.5) 87.6 62.4 76.4 73.7 37.2 67.3 79.9 76.8
XF_NUDT_task5_4 AAPM-Seg-Level xf_nudt2024 28.4 (27.4 - 29.0) 74.1 11.1 76.0 52.1 12.3 32.9 78.4 67.7
XF_NUDT_task5_3 pcenBase-Frame-Level xf_nudt2024 61.5 (61.1 - 61.9) 85.9 59.7 83.8 68.6 41.1 60.5 59.9 72.4
XF_NUDT_task5_2 logmelBase-Frame-Level xf_nudt2024 65.2 (64.9 - 65.5) 64.1 69.4 93.1 78.9 41.2 64.2 79.7 75.5
Latifi_IDMT_task5_1 Cross-correlation baseline latifi_idmt2024 0.7 (0.6 - 0.7) 1.9 0.7 0.7 6.5 0.7 0.5 0.6 0.3
QianHu_BHEBIT_task5_1 QianHu_DYXS_task5_1 qianhu_bhebit2024 21.7 (21.1 - 22.2) 36.4 21.9 62.5 38.8 5.5 40.1 58.3 59.9
QianHu_BIT_task5_3 QianHu_DYXS_task5_3 qianhu_bit2024 21.7 (21.2 - 22.1) 50.3 31.5 69.9 29.2 5.5 34.9 64.3 32.6
QianHu_BHEBIT_task5_4 QianHu_DYXS_task5_4 qianhu_bhebit2024 42.5 (41.7 - 43.0) 75.0 27.3 42.6 38.7 31.0 43.8 64.2 78.0
QianHu_BHEBIT_task5_2 QianHu_DYXS_task5_2 qianhu_bhebit2024 34.1 (33.6 - 34.5) 28.9 26.1 60.8 40.3 18.6 39.8 54.8 59.9
Hoffman_ESP_task5_4 In-context learning hoffman_esp2024 8.1 (7.6 - 8.4) 55.9 43.5 3.0 27.0 2.2 31.7 56.7 38.9
Hoffman_ESP_task5_2 In-context learning hoffman_esp2024 3.3 (3.0 - 3.5) 76.4 42.1 1.8 26.4 0.6 51.7 65.9 39.9
Hoffman_ESP_task5_3 In-context learning hoffman_esp2024 1.2 (0.9 - 1.3) 39.3 2.1 21.6 0.2 45.0 60.6 47.1
Hoffman_ESP_task5_1 In-context learning hoffman_esp2024 6.5 (6.1 - 6.9) 68.2 39.5 4.5 36.2 1.2 41.3 72.6 39.0
Bordoux_WUR_task5_1 AVESproto direct bordoux_wur2024 42.0 (41.3 - 42.5) 60.6 18.1 66.4 68.4 36.0 51.0 53.9 52.5

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission
code
Submission
name
Technical
Report
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)
Event-based
F-score
(Development dataset)
Baseline_TempMatch_task5_1 Baseline Template Matching 14.9 (14.0 - 15.3)
Baseline_PROTO_task5_1 Baseline Prototypical Network 41.6 (41.0 - 42.1)
KAO_NTHU_task5_1 cosine similarity_Mel-Spectrograms kao_nthu2024 46.9 (45.6 - 47.9) 44.2
Lu_AILab_task5_1 Lu_AILab_task5_1 lu_ailab2024 56.7 (56.2 - 57.1) 44.1
XF_NUDT_task5_2 logmelBase-Frame-Level xf_nudt2024 65.2 (64.9 - 65.5) 70.6
Latifi_IDMT_task5_1 Cross-correlation baseline latifi_idmt2024 0.7 (0.6 - 0.7) 47.0
QianHu_BHEBIT_task5_4 QianHu_DYXS_task5_4 qianhu_bhebit2024 42.5 (41.7 - 43.0) 55.5
Hoffman_ESP_task5_4 In-context learning hoffman_esp2024 8.1 (7.6 - 8.4) 44.4
Bordoux_WUR_task5_1 AVESproto direct bordoux_wur2024 42.0 (41.3 - 42.5) 39.3

System characteristics

General characteristics

Rank Code Technical
Report
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)
Sampling
rate
Data
augmentation
Features
Baseline_TempMatch_task5_1 14.9 (14.0 - 15.3) any spectrogram
Baseline_PROTO_task5_1 41.6 (41.0 - 42.1) 22.05 KHz PCEN
KAO_NTHU_task5_2 kao_nthu2024 30.27 ( 29.42 - 30.81 ) melspectrogram, chromagram.
KAO_NTHU_task5_3 kao_nthu2024 40.1 (39.2 - 40.7) melspectrogram, chromagram.
KAO_NTHU_task5_1 kao_nthu2024 46.9 (45.6 - 47.9) melspectrogram, chromagram.
Lu_AILab_task5_1 lu_ailab2024 56.7 (56.2 - 57.1) add_gussionNoise,FilterAugment log-mel
Lu_AILab_task5_2 lu_ailab2024 31.4 (31.0 - 31.7) add_gussionNoise,FilterAugment PCEN
Lu_AILab_task5_3 lu_ailab2024 37.7 (37.3 - 38.1) add_gussionNoise,FilterAugment PCEN
Lu_AILab_task5_4 lu_ailab2024 43.0 (42.3 - 43.4) add_gussionNoise,FilterAugment log-mel
XF_NUDT_task5_1 xf_nudt2024 64.1 (63.8 - 64.5) mixup log-mel
XF_NUDT_task5_4 xf_nudt2024 28.4 (27.4 - 29.0) FBANK
XF_NUDT_task5_3 xf_nudt2024 61.5 (61.1 - 61.9) Mixup PCEN
XF_NUDT_task5_2 xf_nudt2024 65.2 (64.9 - 65.5) Mixup log-mel
Latifi_IDMT_task5_1 latifi_idmt2024 0.7 (0.6 - 0.7) any [block mixing] spectrogram
QianHu_BHEBIT_task5_1 qianhu_bhebit2024 21.7 (21.1 - 22.2) delta MFCC & PCEN
QianHu_BIT_task5_3 qianhu_bit2024 21.7 (21.2 - 22.1) delta MFCC & PCEN
QianHu_BHEBIT_task5_4 qianhu_bhebit2024 42.5 (41.7 - 43.0) delta MFCC & PCEN
QianHu_BHEBIT_task5_2 qianhu_bhebit2024 34.1 (33.6 - 34.5) delta MFCC & PCEN
Hoffman_ESP_task5_4 hoffman_esp2024 8.1 (7.6 - 8.4) mixup, resampling, time reversal, block mixing, spectrogram
Hoffman_ESP_task5_2 hoffman_esp2024 3.3 (3.0 - 3.5) mixup, resampling, time reversal, block mixing, spectrogram
Hoffman_ESP_task5_3 hoffman_esp2024 1.2 (0.9 - 1.3) mixup, resampling, time reversal, block mixing, spectrogram
Hoffman_ESP_task5_1 hoffman_esp2024 6.5 (6.1 - 6.9) mixup, resampling, time reversal, block mixing, spectrogram
Bordoux_WUR_task5_1 bordoux_wur2024 42.0 (41.3 - 42.5) waveform



Machine learning characteristics

Rank Code Technical
Report
Event-based
F-score
(Eval)
Classifier Few-shot approach Post-processing
Baseline_TempMatch_task5_1 14.9 (14.0 - 15.3) template matching template matching peak picking, threshold
Baseline_PROTO_task5_1 41.6 (41.0 - 42.1) ResNet prototypical threshold
KAO_NTHU_task5_2 kao_nthu2024 30.27 ( 29.42 - 30.81 ) supervised learning Cosine similarity based Customized Merging [median filtering, time aggregation...]
KAO_NTHU_task5_3 kao_nthu2024 40.1 (39.2 - 40.7) supervised learning Cosine similarity based Customized Merging [median filtering, time aggregation...]
KAO_NTHU_task5_1 kao_nthu2024 46.9 (45.6 - 47.9) supervised learning Cosine similarity based Customized Merging [median filtering, time aggregation...]
Lu_AILab_task5_1 lu_ailab2024 56.7 (56.2 - 57.1) CNN fine tuning peak picking, threshold
Lu_AILab_task5_2 lu_ailab2024 31.4 (31.0 - 31.7) CNN fine tuning peak picking, threshold
Lu_AILab_task5_3 lu_ailab2024 37.7 (37.3 - 38.1) CNN fine tuning peak picking, threshold
Lu_AILab_task5_4 lu_ailab2024 43.0 (42.3 - 43.4) CNN fine tuning peak picking, threshold
XF_NUDT_task5_1 xf_nudt2024 64.1 (63.8 - 64.5) CNN fine tuning median filtering, threshold
XF_NUDT_task5_4 xf_nudt2024 28.4 (27.4 - 29.0) transformer fine tuning median filtering, threshold
XF_NUDT_task5_3 xf_nudt2024 61.5 (61.1 - 61.9) CNN fine tuning median filtering, threshold
XF_NUDT_task5_2 xf_nudt2024 65.2 (64.9 - 65.5) CNN fine tuning median filtering, threshold
Latifi_IDMT_task5_1 latifi_idmt2024 0.7 (0.6 - 0.7) template matching [fine tuning] peak picking, threshold
QianHu_BHEBIT_task5_1 qianhu_bhebit2024 21.7 (21.1 - 22.2) CNN prototypical threshold=0.15
QianHu_BIT_task5_3 qianhu_bit2024 21.7 (21.2 - 22.1) CNN prototypical threshold=0.16
QianHu_BHEBIT_task5_4 qianhu_bhebit2024 42.5 (41.7 - 43.0) CNN prototypical threshold=0.15
QianHu_BHEBIT_task5_2 qianhu_bhebit2024 34.1 (33.6 - 34.5) prot
Hoffman_ESP_task5_4 hoffman_esp2024 8.1 (7.6 - 8.4) transformer in-context threshold (data dependent), smoothing (data dependent)
Hoffman_ESP_task5_2 hoffman_esp2024 3.3 (3.0 - 3.5) transformer in-context threshold (data dependent), smoothing (data dependent)
Hoffman_ESP_task5_3 hoffman_esp2024 1.2 (0.9 - 1.3) transformer in-context threshold (data dependent), smoothing (data dependent)
Hoffman_ESP_task5_1 hoffman_esp2024 6.5 (6.1 - 6.9) transformer in-context threshold (data dependent), smoothing (data dependent)
Bordoux_WUR_task5_1 bordoux_wur2024 42.0 (41.3 - 42.5) transformer based auto encoder prototypical threshold 70% of annotation length

Complexity

Rank Code Technical
Report
Event-based
F-score
(Eval)
Model
complexity
Training time
Baseline_TempMatch_task5_1 14.9 (14.0 - 15.3)
Baseline_PROTO_task5_1 41.6 (41.0 - 42.1)
KAO_NTHU_task5_2 kao_nthu2024 30.27 ( 29.42 - 30.81 ) 120 hours
KAO_NTHU_task5_3 kao_nthu2024 40.1 (39.2 - 40.7) 120 hours
KAO_NTHU_task5_1 kao_nthu2024 46.9 (45.6 - 47.9) 120 hours
Lu_AILab_task5_1 lu_ailab2024 56.7 (56.2 - 57.1) 27M 1 hour(A6000-48G)
Lu_AILab_task5_2 lu_ailab2024 31.4 (31.0 - 31.7) 27M 1 hour(A6000-48G)
Lu_AILab_task5_3 lu_ailab2024 37.7 (37.3 - 38.1) 27M 1 hour(A6000-48G)
Lu_AILab_task5_4 lu_ailab2024 43.0 (42.3 - 43.4) 27M 1 hour(A6000-48G)
XF_NUDT_task5_1 xf_nudt2024 64.1 (63.8 - 64.5) 880k 56minutes
XF_NUDT_task5_4 xf_nudt2024 28.4 (27.4 - 29.0) 89M 3h
XF_NUDT_task5_3 xf_nudt2024 61.5 (61.1 - 61.9) 359k 48minuates
XF_NUDT_task5_2 xf_nudt2024 65.2 (64.9 - 65.5) 359k 48minutes
Latifi_IDMT_task5_1 latifi_idmt2024 0.7 (0.6 - 0.7) 12046462 12 hours 8GB GPU
QianHu_BHEBIT_task5_1 qianhu_bhebit2024 21.7 (21.1 - 22.2) 724k 4 hours on RTX 4070
QianHu_BIT_task5_3 qianhu_bit2024 21.7 (21.2 - 22.1) 726k 4 hours on RTX 4070
QianHu_BHEBIT_task5_4 qianhu_bhebit2024 42.5 (41.7 - 43.0) 724k 4 hours on RTX 4070
QianHu_BHEBIT_task5_2 qianhu_bhebit2024 34.1 (33.6 - 34.5) 724k 4 hours on RTX 4070
Hoffman_ESP_task5_4 hoffman_esp2024 8.1 (7.6 - 8.4) 33.5 Million unfrozen, plus 80 Million frozen 24 hours, A100 GPU
Hoffman_ESP_task5_2 hoffman_esp2024 3.3 (3.0 - 3.5) 33.5 Million unfrozen, plus 80 Million frozen 24 hours, A100 GPU
Hoffman_ESP_task5_3 hoffman_esp2024 1.2 (0.9 - 1.3) 33.5 Million unfrozen, plus 80 Million frozen 24 hours, A100 GPU
Hoffman_ESP_task5_1 hoffman_esp2024 6.5 (6.1 - 6.9) 33.5 Million unfrozen, plus 80 Million frozen 24 hours, A100 GPU
Bordoux_WUR_task5_1 bordoux_wur2024 42.0 (41.3 - 42.5) 90M

Technical reports

ADAPTABLE INPUT LENGTH USING MODEL TRAINED ON WAVEFORM Technical Report

Bordoux,Valentin
Marine Animal Ecology,Wageningen University

Abstract

This report presents a method for bioacoustic sound event detec- tion using few-shot learning, developed for the DCASE 2024 Task 5. Our approach experiments with pretrained models that take waveforms as input. These models serve as feature extrac- tors, and prototypical loss is used for prediction. Initially, we employed direct predictions with openly available pretrained models. Subsequently, we attempted to fine-tune the models for each file, using only the first five annotations as training set. The direct prediction system achieved 40% F-measure score, 12 points under the baseline system proposed by the organizers. Fine-tuning did not improve the model's performance over direct prediction.While our proposed method can be applied directly without extensive parameter tuning or additional training, the results indicate that it does not achieve the generalizability required for this challenge when compared to the baseline method. This work suggests how state-of-the-art models, despite their high perfor- mance on other datasets or benchmarks, may still perform subop- timal on sound event detection using few-shot learning for cer- tain taxonomic groups present in the DCASE challenge datasets.

System characteristics
System embeddings AVES (wav2wec2)
Subsystem count False
External data usage False
PDF

Toward in-context bioacoustic sound event detection

Hoffman,Benjamin and Robinson,David
Earth Species Project

Abstract

We introduce an in-context learning approach to bioacoustic sound event detection. Our approach consists of a large pre-trained transformer model which, when prompted with a small amount of labeled audio, directly predicts de- tection labels on unlabeled audio. To train our model, we constructed a large audio database, which we used to generate acoustic scenes with temporally fine-grained detection labels. On the validation set for the 2024 DCASE Few-shot bioacous- tic event detection challenge, our best-performing submission achieves an average F1 score of 0.584, improving on the challenge baseline by 0.063.

System characteristics
Data augmentation mixup, resampling, time reversal, block mixing
System embeddings ATST-FRAME
Subsystem count False
External data usage True
PDF

Cosine similarity based Few-shot Bioacoustic Event Detection with Automatic Frequency Range Identification in Mel-Spectrograms Technical Report

Kao,Sheng-Lun and Liu,Yi-Wen
Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan

Abstract

In response to the Few-shot Bioacoustic Event Detection chal- lenge, we have developed a detection system comprising three key components. First, an algorithm has been devised for auto- matically identifying frequency ranges of the positive (POS) sig- nal within the mel-spectrogram. Secondly, the cosine similarity between POS and negative (NEG) events is computed across the entire audio file. Thirdly, predictions of POS events are made based on the results of cosine similarity. Remarkably, this ap- proach does not rely on any training data from the development dataset, external data, or pretrained models. The proposed system achieved an F1-score of 44.187% on the 2023 validation set.

System characteristics
System embeddings CNN
Subsystem count False
External data usage False
PDF

CROSS-ADAPT: CROSS-DATASET GENERATION AND DOMAIN ADAPTATION TECHNIQUE FOR FEW-SHOT LEARNING Technical Report

Bidarouni,Amir Latifi and Abeßer,Jakob
Semantic Music Technologies Group, Fraunhofer IDMIT, Ilmenau, Germany

Abstract

Bioacoustic monitoring is an invaluable tool for understanding wildlife well-being. However, the scarcity of annotated data for effective model training coupled with domain shifts resulting from data recorded at various sensor locations with diverse acoustic en- vironments poses significant challenges for deep learning-based audio classification systems. In this paper, we propose a novel cross-dataset data augmentation technique designed to effectively use the limited annotated data available, exemplified by the few- shot learning task 5 of the DCASE challenge. Furthermore, we employ Instance-wise Feature Projection-based Domain Adapta- tion (IFPDA) to mitigate the domain shifts caused by variations in recording locations or devices. We use a modified ResNet model architecture for a multitask learning setting, which combines multi- class species classification on a patch level and binary classification for frame-level sound event detection.

System characteristics
Data augmentation Block mixing
System embeddings Modified ResNet50
Subsystem count False
External data usage False
PDF

FEW-SHOT BIOACOUSTIC EVENT DETECTION AT THE DCASE 2024 CHALLENGE Technical Report

Liu Wei, and Liu,Hy and Lin,Fl and Liu,Hs and Gao,Tian and Fang,Xin and Liu Jh
iFLYTEK Research Institute, HeFei, China, and National University of Defense Technology, ChangSha, China

Abstract

In this technical report, we describe the submission system for DCASE2024 Task 5: Few-shot Bioacoustic Event Detection. In previous work, we proposed a frame-level embedding learning system and achieved the best performance in DCASE2022 Task 5. In this task, we propose several methods to improve the representational capacity of embeddings under limited positive samples. Three methods are proposed based on the pre-training fine-tuning process, including the AAPM segment-level embedding learning method, the Baseline framework-level embedding learning method, and the Unet network-based framework-level embedding learning method. Compared to our previous work, our new system achieved better results on the official 2023 validation set (F-measure 76.8%, No ML). The proposed system was evaluated on the newly released official 2024 validation set, with a best overall F-measure score of 70.56%.

System characteristics
Data augmentation Mixup
Subsystem count False
External data usage False
PDF

FEW-SHOT BIOACOUSTIC EVENT DETECTION WITH FRAME-LEVEL EMBEDDING LEARNING SYSTEM Technical Report

Zhao,Peng Yuan and Lu,Cheng Wei and Zou, Liang
China University of Mining and Technology, XuZhou, China

Abstract

In this technical report, we describe our submission system for the few-shot bioacoustic event detection in the DCASE2022 task5. Participants are expected to develop a few-shot learning system for detecting mammal and birds sounds from audio recordings. In our system, Prototypical Networks are used to embed spectrograms into an embedding space and learn a non-linear mapping between data samples. We leverage various data augmentation techniques on Mel-spectrograms and introduce a ResNet variant as the classifier. Our experiments demonstrate that the system can achieve the F1-score of 47.88\% on the vali-dation data.

System characteristics
Data augmentation add_gussionNoise,FilterAugment
System embeddings False
Subsystem count False
External data usage False
PDF

LIF-PROTONET: PROTOTYPICAL NETWORK WITH LEAKY INTEGRATE-AND-FIRE NEURON AND SQUEEZE-AND-EXCITATION BLOCKS FOR BIOACOUSTIC EVENT DETECTION Technical report

Sun,Mengkai and Zhang,Haojie and Qian,Kun and Hu1, Bin
Key Laboratory of Brain Health Intelligent Evaluation and Intervention,Ministry of Education (Beijing Institute of Technology), P. R. China and School of Medical Technology, Beijing Institute of Technology, P. R. China

Abstract

In this technical repot, we describe our submission system for DCASE2024 Task5: Few-shot Bioacoustic Event Detection. We propose a metric learning method to construct a novel prototypical network, based on Leaky Integrate-and-Fire Neuron and Squeeze- and-Excitation (SE) blocks. We make better utilization of the nega- tive data, which can be used to construct the loss function and pro- vide much more semantic information. Most importantly, we pro- pose to use SE blocks to adaptively recalibrate channel-wise feature response, by explicitly modeling interdependencies between chan- nels, which improves f-measure to 53.72 %. For the input feature, we use combination of per-channel energy normalization (PCEN) and delta mel-frequency cepstral coefficients (∆MFCC), then the features were first transformed through Leaky Integrate-and-Fire Neuron to mimic brain function. Our system performs better than the baseline given by the officials, on the DCASE2024 task 5 vali- dation set. Our final score reaches an f-measure of 55.49 %, outper- forming the baseline performance.

System characteristics
System embeddings CNN
Subsystem count False
External data usage False
PDF