Few-shot Bioacoustic Event Detection


Challenge results

Task description

More detailed task description can be found in the task description page

Systems ranking

Rank Submission
code
Submission
name
Technical
Report
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)
Event-based
F-score
(Validation dataset)
Baseline_TempMatch_task5_1 Baseline Template Matching 14.9 (14.0 - 15.3) 3.4
Baseline_PROTO_task5_1 Baseline Prototypical Network 2.92 ( 2.32 - 3.08 )
Moummad_IMT_task5_3 BRAIn_LORIA_S3 Moummad2023 38.3 (37.9 - 38.7) 62.8
Moummad_IMT_task5_4 BRAIn_LORIA_S4 Moummad2023 34.4 (33.9 - 34.8) 58.3
Moummad_IMT_task5_1 BRAIn_LORIA_S1 Moummad2023 35.6 (35.3 - 36.0) 62.3
Moummad_IMT_task5_2 BRAIn_LORIA_S2 Moummad2023 42.7 (42.2 - 43.1) 63.5
Gelderblom_SINTEF_task5_2 FEW-SHOT BIOACOUSTIC EVENT DETECTION USING BEATS Gelderblom2023 31.1 (30.5 - 31.6) 36.6
Gelderblom_SINTEF_task5_1 FEW-SHOT BIOACOUSTIC EVENT DETECTION USING BEATS Gelderblom2023 23.4 (22.9 - 23.8) 36.6
XuQianHu_NUDT_BIT_task5_3 XuQianHu_DYXS_task5_3 XuQianHu2023 42.5 (41.8 - 43.0) 63.9
XuQianHu_NUDT_BIT_task5_1 XuQianHu_DYXS_task5_1 XuQianHu2023 21.7 (21.1 - 22.1) 65.5
XuQianHu_NUDT_BIT_task5_2 XuQianHu_DYXS_task5_2 XuQianHu2023 34.1 (33.6 - 34.4) 63.1
XuQianHu_NUDT_BIT_task5_4 XuQianHu_DYXS_task5_4 XuQianHu2023 21.7 (21.2 - 22.1) 62.1
Wilkinghoff_FKIE_task5_4 FKIE system 4 Wilkinghoff2023 16.0 (15.5 - 16.4) 62.6
Wilkinghoff_FKIE_task5_1 FKIE system 1 Wilkinghoff2023 10.1 (9.6 - 10.5) 63.8
Wilkinghoff_FKIE_task5_3 FKIE system 3 Wilkinghoff2023 9.4 (8.9 - 9.8) 65.5
Wilkinghoff_FKIE_task5_2 FKIE system 2 Wilkinghoff2023 9.9 (9.3 - 10.2) 63.7
Jung_KT_task5_4 Jung_KT_task5_4 Jung2023 26.3 (25.8 - 26.8) 83.1
Jung_KT_task5_2 Jung_KT_task5_2 Jung2023 15.7 (14.8 - 16.3) 81.7
Jung_KT_task5_3 Jung_KT_task5_3 Jung2023 27.1 (26.5 - 27.6) 81.5
Jung_KT_task5_1 Jung_KT_task5_1 Jung2023 14.1 (13.2 - 14.6) 79.8
Du_NERCSLIP_task5_2 Multi-task Frame-level embedding learning 2 Du2023 63.8 (63.3 - 64.2) 75.6
Du_NERCSLIP_task5_1 Multi-task Frame-level embedding learning 1 Du2023 61.2 (60.7 - 61.6) 74.1
Du_NERCSLIP_task5_3 Multi-task Frame-level embedding learning 3 Du2023 63.6 (63.2 - 64.0) 76.4
Du_NERCSLIP_task5_4 Multi-task Frame-level embedding learning 4 Du2023 63.6 (63.2 - 64.0) 76.4

Dataset wise metrics

Rank Submission
code
Submission
name
Technical
Report
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)
Event-based
F-score
(CHE dataset)
Event-based
F-score
(CT dataset)
Event-based
F-score
(MGE dataset)
Event-based
F-score
(MS dataset)
Event-based
F-score
(QU dataset)
Event-based
F-score
(DC dataset)
Event-based
F-score
(CHE23 dataset)
Event-based
F-score
(CW dataset)
Baseline_TempMatch_task5_1 Baseline Template Matching 14.9 (14.0 - 15.3) 21.1 7.2 44.1 8.0 9.7 34.9 36.1 44.2
Baseline_PROTO_task5_1 Baseline Prototypical Network 2.92 ( 2.32 - 3.08 ) 31.5 14.0 8.1 27.1 0.4 37.8 36.1 44.2
Moummad_IMT_task5_3 BRAIn_LORIA_S3 Moummad2023 38.3 (37.9 - 38.7) 60.0 43.9 41.5 71.1 13.6 40.7 83.7 70.7
Moummad_IMT_task5_4 BRAIn_LORIA_S4 Moummad2023 34.4 (33.9 - 34.8) 61.4 37.6 42.0 63.6 10.9 41.4 80.9 70.1
Moummad_IMT_task5_1 BRAIn_LORIA_S1 Moummad2023 35.6 (35.3 - 36.0) 52.0 39.7 66.5 31.4 13.7 40.9 81.5 62.6
Moummad_IMT_task5_2 BRAIn_LORIA_S2 Moummad2023 42.7 (42.2 - 43.1) 60.3 36.2 61.3 67.8 17.7 41.5 83.2 72.3
Gelderblom_SINTEF_task5_2 FEW-SHOT BIOACOUSTIC EVENT DETECTION USING BEATS Gelderblom2023 31.1 (30.5 - 31.6) 58.3 15.4 70.9 58.0 12.5 39.8 58.4 66.3
Gelderblom_SINTEF_task5_1 FEW-SHOT BIOACOUSTIC EVENT DETECTION USING BEATS Gelderblom2023 23.4 (22.9 - 23.8) 52.0 14.1 58.5 56.3 6.6 40.7 64.2 57.3
XuQianHu_NUDT_BIT_task5_3 XuQianHu_DYXS_task5_3 XuQianHu2023 42.5 (41.8 - 43.0) 75.0 27.3 42.6 38.7 31.0 43.8 64.2 78.0
XuQianHu_NUDT_BIT_task5_1 XuQianHu_DYXS_task5_1 XuQianHu2023 21.7 (21.1 - 22.1) 36.4 21.9 62.5 38.8 5.5 40.1 58.3 59.9
XuQianHu_NUDT_BIT_task5_2 XuQianHu_DYXS_task5_2 XuQianHu2023 34.1 (33.6 - 34.4) 28.9 26.1 60.8 40.3 18.6 39.9 54.8 59.9
XuQianHu_NUDT_BIT_task5_4 XuQianHu_DYXS_task5_4 XuQianHu2023 21.7 (21.2 - 22.1) 50.3 31.5 69.9 29.2 5.5 34.9 64.3 32.5
Wilkinghoff_FKIE_task5_4 FKIE system 4 Wilkinghoff2023 16.0 (15.5 - 16.4) 30.9 43.2 41.9 15.1 3.9 34.9 31.5 59.4
Wilkinghoff_FKIE_task5_1 FKIE system 1 Wilkinghoff2023 10.1 (9.6 - 10.5) 29.7 40.5 32.3 15.2 1.9 32.4 26.6 63.1
Wilkinghoff_FKIE_task5_3 FKIE system 3 Wilkinghoff2023 9.4 (8.9 - 9.8) 31.2 39.2 25.4 9.9 1.9 29.9 29.4 73.3
Wilkinghoff_FKIE_task5_2 FKIE system 2 Wilkinghoff2023 9.9 (9.3 - 10.2) 31.9 38.1 31.6 12.5 1.9 31.5 34.6 73.1
Jung_KT_task5_4 Jung_KT_task5_4 Jung2023 26.3 (25.8 - 26.8) 30.2 14.4 47.2 24.8 17.9 37.7 30.0 62.1
Jung_KT_task5_2 Jung_KT_task5_2 Jung2023 15.7 (14.8 - 16.3) 46.3 10.9 4.1 25.7 29.6 38.1 48.9 77.3
Jung_KT_task5_3 Jung_KT_task5_3 Jung2023 27.1 (26.5 - 27.6) 32.2 14.3 20.8 26.7 25.8 47.7 41.2 58.8
Jung_KT_task5_1 Jung_KT_task5_1 Jung2023 14.1 (13.2 - 14.6) 47.5 10.4 3.2 29.4 31.1 44.1 47.3 78.1
Du_NERCSLIP_task5_2 Multi-task Frame-level embedding learning 2 Du2023 63.8 (63.3 - 64.2) 85.7 53.8 93.1 72.7 42.7 62.4 69.5 44.2
Du_NERCSLIP_task5_1 Multi-task Frame-level embedding learning 1 Du2023 61.2 (60.7 - 61.6) 85.7 53.8 93.1 72.7 38.5 53.7 69.1 71.3
Du_NERCSLIP_task5_3 Multi-task Frame-level embedding learning 3 Du2023 63.6 (63.2 - 64.0) 74.9 54.5 95.2 68.3 43.7 66.3 71.1 59.3
Du_NERCSLIP_task5_4 Multi-task Frame-level embedding learning 4 Du2023 63.6 (63.2 - 64.0) 86.0 51.4 95.2 75.0 43.7 60.8 69.1 72.3

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission
code
Submission
name
Technical
Report
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)
Event-based
F-score
(Development dataset)
Baseline_TempMatch_task5_1 Baseline Template Matching 14.9 (14.0 - 15.3) 3.4
Baseline_PROTO_task5_1 Baseline Prototypical Network 2.92 ( 2.32 - 3.08 )
Moummad_IMT_task5_2 BRAIn_LORIA_S2 Moummad2023 42.7 (42.2 - 43.1) 63.5
Gelderblom_SINTEF_task5_2 FEW-SHOT BIOACOUSTIC EVENT DETECTION USING BEATS Gelderblom2023 31.1 (30.5 - 31.6) 36.6
XuQianHu_NUDT_BIT_task5_3 XuQianHu_DYXS_task5_3 XuQianHu2023 42.5 (41.8 - 43.0) 63.9
Wilkinghoff_FKIE_task5_4 FKIE system 4 Wilkinghoff2023 16.0 (15.5 - 16.4) 62.6
Jung_KT_task5_3 Jung_KT_task5_2 Jung2023 27.1 (26.5 - 27.6) 81.5
Du_NERCSLIP_task5_2 Multi-task Frame-level embedding learning 2 Du2023 63.8 (63.3 - 64.2) 75.6

System characteristics

General characteristics

Rank Code Technical
Report
Event-based
F-score
with 95% confidence interval
(Evaluation dataset)
Sampling
rate
Data
augmentation
Features
Baseline_TempMatch_task5_1 14.9 (14.0 - 15.3) any spectrogram
Baseline_PROTO_task5_1 2.92 ( 2.32 - 3.08 ) 22.05 KHz PCEN
Moummad_IMT_task5_3 Moummad2023 38.3 (37.9 - 38.7) 22.05 KHz spectrogram mixing, random crop, resized crop, compression, additive white gaussian noise mel spectrogram
Moummad_IMT_task5_4 Moummad2023 34.4 (33.9 - 34.8) 22.05 KHz spectrogram mixing, random crop, resized crop, compression, additive white gaussian noise mel spectrogram
Moummad_IMT_task5_1 Moummad2023 35.6 (35.3 - 36.0) 22.05 KHz spectrogram mixing, random crop, resized crop, compression, additive white gaussian noise mel spectrogram
Moummad_IMT_task5_2 Moummad2023 42.7 (42.2 - 43.1) 22.05 KHz spectrogram mixing, random crop, resized crop, compression, additive white gaussian noise mel spectrogram
Gelderblom_SINTEF_task5_2 Gelderblom2023 31.1 (30.5 - 31.6) 16 KHz time stretching, denoising mel spectrogram
Gelderblom_SINTEF_task5_1 Gelderblom2023 23.4 (22.9 - 23.8) 16 KHz time stretching, denoising mel spectrogram
XuQianHu_NUDT_BIT_task5_3 XuQianHu2023 42.5 (41.8 - 43.0) 22.05 KHz delta MFCC and PCEN
XuQianHu_NUDT_BIT_task5_1 XuQianHu2023 21.7 (21.1 - 22.1) 22.05 KHz delta MFCC and PCEN
XuQianHu_NUDT_BIT_task5_2 XuQianHu2023 34.1 (33.6 - 34.4) 22.05 KHz delta MFCC and PCEN
XuQianHu_NUDT_BIT_task5_4 XuQianHu2023 21.7 (21.2 - 22.1) 22.05 KHz delta MFCC and PCEN
Wilkinghoff_FKIE_task5_4 Wilkinghoff2023 16.0 (15.5 - 16.4) 22.05 KHz SpecAugment, Mixup PCEN
Wilkinghoff_FKIE_task5_1 Wilkinghoff2023 10.1 (9.6 - 10.5) 22.05 KHz SpecAugment, Mixup PCEN
Wilkinghoff_FKIE_task5_3 Wilkinghoff2023 9.4 (8.9 - 9.8) 22.05 KHz SpecAugment, Mixup PCEN
Wilkinghoff_FKIE_task5_2 Wilkinghoff2023 9.9 (9.3 - 10.2) 22.05 KHz SpecAugment, Mixup PCEN
Jung_KT_task5_4 Jung2023 26.3 (25.8 - 26.8) 22.05 KHz delta MFCC, PCEN
Jung_KT_task5_2 Jung2023 15.7 (14.8 - 16.3) 22.05 KHz delta MFCC, PCEN
Jung_KT_task5_3 Jung2023 27.1 (26.5 - 27.6) 22.05 KHz delta MFCC, PCEN
Jung_KT_task5_1 Jung2023 14.1 (13.2 - 14.6) 22.05 KHz delta MFCC, PCEN
Du_NERCSLIP_task5_2 Du2023 63.8 (63.3 - 64.2) 22.05 KHz SpecAugment PCEN
Du_NERCSLIP_task5_1 Du2023 61.2 (60.7 - 61.6) 22.05 KHz SpecAugment PCEN
Du_NERCSLIP_task5_3 Du2023 63.6 (63.2 - 64.0) 22.05 KHz SpecAugment, time stretching PCEN
Du_NERCSLIP_task5_4 Du2023 63.6 (63.2 - 64.0) 22.05 KHz SpecAugment, time stretching PCEN



Machine learning characteristics

Rank Code Technical
Report
Event-based
F-score
(Eval)
Classifier Few-shot approach Post-processing
Baseline_TempMatch_task5_1 14.9 (14.0 - 15.3) template matching template matching peak picking, threshold
Baseline_PROTO_task5_1 2.92 ( 2.32 - 3.08 ) ResNet prototypical threshold
Moummad_IMT_task5_3 Moummad2023 38.3 (37.9 - 38.7) CNN softmax binary classifier + finetuning two last layers
Moummad_IMT_task5_4 Moummad2023 34.4 (33.9 - 34.8) CNN softmax binary classifier + finetuning whole model
Moummad_IMT_task5_1 Moummad2023 35.6 (35.3 - 36.0) CNN softmax binary classifier
Moummad_IMT_task5_2 Moummad2023 42.7 (42.2 - 43.1) CNN softmax binary classifier + finetuning last layers
Gelderblom_SINTEF_task5_2 Gelderblom2023 31.1 (30.5 - 31.6) transformer prototypical threshold
Gelderblom_SINTEF_task5_1 Gelderblom2023 23.4 (22.9 - 23.8) transformer prototypical threshold
XuQianHu_NUDT_BIT_task5_3 XuQianHu2023 42.5 (41.8 - 43.0) CNN prototypical threshold (0.3)
XuQianHu_NUDT_BIT_task5_1 XuQianHu2023 21.7 (21.1 - 22.1) CNN prototypical threshold (0.1)
XuQianHu_NUDT_BIT_task5_2 XuQianHu2023 34.1 (33.6 - 34.4) CNN prototypical threshold (0.05)
XuQianHu_NUDT_BIT_task5_4 XuQianHu2023 21.7 (21.2 - 22.1) CNN prototypical threshold (0.15)
Wilkinghoff_FKIE_task5_4 Wilkinghoff2023 16.0 (15.5 - 16.4) CNN, K-Means, Logistic Regression TempArcFace, DTW peak picking, threshold
Wilkinghoff_FKIE_task5_1 Wilkinghoff2023 10.1 (9.6 - 10.5) CNN, K-Means, Logistic Regression TempArcFace, DTW peak picking, threshold
Wilkinghoff_FKIE_task5_3 Wilkinghoff2023 9.4 (8.9 - 9.8) CNN, K-Means, Logistic Regression TempArcFace, DTW peak picking, threshold
Wilkinghoff_FKIE_task5_2 Wilkinghoff2023 9.9 (9.3 - 10.2) CNN, K-Means, Logistic Regression TempArcFace, DTW peak picking, threshold
Jung_KT_task5_4 Jung2023 26.3 (25.8 - 26.8) CNN prototypical, contrastive, fine tuning, Transductive Inference threshold, filter by length, remove long
Jung_KT_task5_2 Jung2023 15.7 (14.8 - 16.3) CNN prototypical, contrastive, fine tuning, Transductive Inference threshold, filter by length, remove long
Jung_KT_task5_3 Jung2023 27.1 (26.5 - 27.6) CNN prototypical, contrastive, fine tuning, Transductive Inference threshold, filter by length, remove long
Jung_KT_task5_1 Jung2023 14.1 (13.2 - 14.6) CNN prototypical, contrastive, fine tuning, Transductive Inference threshold, filter by length, remove long
Du_NERCSLIP_task5_2 Du2023 63.8 (63.3 - 64.2) CNN finetuning peak picking, threshold
Du_NERCSLIP_task5_1 Du2023 61.2 (60.7 - 61.6) CNN finetuning peak picking, threshold
Du_NERCSLIP_task5_3 Du2023 63.6 (63.2 - 64.0) CNN finetuning peak picking, threshold
Du_NERCSLIP_task5_4 Du2023 63.6 (63.2 - 64.0) CNN finetuning peak picking, threshold

Complexity

Rank Code Technical
Report
Event-based
F-score
(Eval)
Model
complexity
Training time
Baseline_TempMatch_task5_1 14.9 (14.0 - 15.3)
Baseline_PROTO_task5_1 2.92 ( 2.32 - 3.08 )
Moummad_IMT_task5_3 Moummad2023 38.3 (37.9 - 38.7) 7.2 M 50 min
Moummad_IMT_task5_4 Moummad2023 34.4 (33.9 - 34.8) 7.2 M 60 min
Moummad_IMT_task5_1 Moummad2023 35.6 (35.3 - 36.0) 7.2 M 25 min
Moummad_IMT_task5_2 Moummad2023 42.7 (42.2 - 43.1) 7.2 M 40 min
Gelderblom_SINTEF_task5_2 Gelderblom2023 31.1 (30.5 - 31.6) 90 M 2 hours
Gelderblom_SINTEF_task5_1 Gelderblom2023 23.4 (22.9 - 23.8) 90 M 2 hours
XuQianHu_NUDT_BIT_task5_3 XuQianHu2023 42.5 (41.8 - 43.0) 724 K 3 hours (GTX 1080ti)
XuQianHu_NUDT_BIT_task5_1 XuQianHu2023 21.7 (21.1 - 22.1) 724 K 3 hours (GTX 1080ti)
XuQianHu_NUDT_BIT_task5_2 XuQianHu2023 34.1 (33.6 - 34.4) 724 K 3 hours (GTX 1080ti)
XuQianHu_NUDT_BIT_task5_4 XuQianHu2023 21.7 (21.2 - 22.1) 724 K 3 hours (GTX 1080ti)
Wilkinghoff_FKIE_task5_4 Wilkinghoff2023 16.0 (15.5 - 16.4) 3.1 M 12 hours (Quadro RTX 5000)
Wilkinghoff_FKIE_task5_1 Wilkinghoff2023 10.1 (9.6 - 10.5) 3.1 M 12 hours (Quadro RTX 5000)
Wilkinghoff_FKIE_task5_3 Wilkinghoff2023 9.4 (8.9 - 9.8) 3.1 M 12 hours (Quadro RTX 5000)
Wilkinghoff_FKIE_task5_2 Wilkinghoff2023 9.9 (9.3 - 10.2) 3.1 M 12 hours (Quadro RTX 5000)
Jung_KT_task5_4 Jung2023 26.3 (25.8 - 26.8)
Jung_KT_task5_2 Jung2023 15.7 (14.8 - 16.3)
Jung_KT_task5_3 Jung2023 27.1 (26.5 - 27.6)
Jung_KT_task5_1 Jung2023 14.1 (13.2 - 14.6)
Du_NERCSLIP_task5_2 Du2023 63.8 (63.3 - 64.2) 21460630 1 hour (TeslaV100-32GB)
Du_NERCSLIP_task5_1 Du2023 61.2 (60.7 - 61.6) 21460630 1 hour (TeslaV100-32GB)
Du_NERCSLIP_task5_3 Du2023 63.6 (63.2 - 64.0) 21460630 3 hour (TeslaV100-32GB)
Du_NERCSLIP_task5_4 Du2023 63.6 (63.2 - 64.0) 21460630 3 hour (TeslaV100-32GB)

Technical reports

MULTI-TASK FRAME LEVEL SYSTEM FOR FEW-SHOT BIOACOUSTIC EVENT DETECTION

Yan,Genwei and Wang,Ruoyu and Zou,Liang and Du,Jun and Wang,Qing and Gao,Tian and Fang,Xin
China University of Mining and Technology and University of Science and Technology of China and iFLYTEK Research Institute

Abstract

This technical report describes our new frame-level embedding learning system for DCASE2023 Task5: few-shot bioacoustic event detection. In the previous work, we proposed the frame-level embedding learning system and achieved the best performance of the DCASE 2022 Task5. In this work, we utilize several techniques to improve upon our previous work. Additionally, we introduce multi-task learning and Target Speaker Voice Activity Detection (TS-VAD) strategies to transform our previous system into a new multi-task frame-level embedding learning system. Compare to our previous work, our new system can achieve a better result (Fmeasure 75.74%, No ML) on the official validation set

System characteristics
Data augmentation SpecAugment, time stretching
System embeddings False
Subsystem count False
External data usage False
PDF

FEW-SHOT BIOACOUSTIC EVENT DETECTION USING BEATS

Gelderblom, Femke and Cretois,Benjamin and Johnsen,Pal and Remonato,Filippo and Reinen,Tor Arne

Abstract

Our method for the DCASE Challenge 2023 combines BEATs with Prototypical Networks. BEATs, standing for Bidirectional Encoder representation from Audio Transformers, is a newly-released architecture by Microsoft for audio tokenisation and classification. BEATs combines a tokenizer and a semi-supervised audio classifier which learn from each other to improve the classification of audio samples. Prototypical Networks, instead, can be briefly described as a neural network-based clustering algorithm. Somewhat resembling a K-means clustering, Prototypical Networks classify samples based on their distance from the classes’ prototypes (what would be the centroids in a K-means setting). Since the prototypes are constructed from a small set of examples from each class, called the support set, Prototypical Networks are well suited to handle fewshot learning settings like the DCASE Challenge. In our method, we combine the two by using BEATs as a feature extractor, constructing informative features which are used by the Prototypical Network to perform the prototypes’ construction and subsequent classification of test audio samples. We obtain a F1 score of 0.36 on the validation dataset.

System characteristics
Data augmentation time stretching, denoising
System embeddings BEATS
Subsystem count False
PDF

FEW-SHOT BIOACOUSTIC DETECTION BOOSTING WITH FINE TUNING STRATEGY USING NEGATIVE BASED PROTOTYPICAL LEARNING

Lee, Yuna and Chung, HaeChun, and Jung, JaeHoon
KT Corporation, Republic of Korea

Abstract

Few-shot sound event detection has always faced the challenge of detecting bioacoustic sound events with only a few labelled instances of the class of interest. In this technical report, we describe our submission system for DCASE2023 Task5: few-shot bioacoustic event detection. We propose a novel framework of training audio segments via contrastive learning and prototypical learning, building the network more robust to the variety of acoustic environments, even in unseen domains. In addition, a finetuning strategy based on the novel loss functions is introduced. Our final systems achieves an f-measure of 83.08 on the DCASE task 5 validation set, outperforming the baseline performance and last year’s first place by a large margin.

System characteristics
System embeddings False
Subsystem count False
External data usage False
PDF

SUPERVISED CONTRASTIVE LEARNING FOR PRE-TRAINING BIOACOUSTIC FEW SHOT SYSTEMS

Moummad, Ilyass and Serizel, Romain and Farrugia, Nicolas
IMT Atlantique, UMR CNRS, University of Lorraine,CNRS, France

Abstract

We show in this work that learning a rich feature extractor from scratch using only official training data is feasible. We achieve this by learning representations using a supervised contrastive learning framework. We then transfer the learned feature extractor to the sets of validation and test for few-shot evaluation. For fewshot validation, we simply train a linear classifier on the negative and positive shots and obtain a F-score of 63.46% outperforming the baseline by a large margin. We don’t use any external data or pretrained model. Our approach doesn’t require choosing a threshold for prediction or any post-processing technique

System characteristics
Data augmentation spectrogram mixing, random crop, resized crop, compression, additive white gaussian noise
System embeddings False
Subsystem count False
External data usage False
PDF

FEW-SHOT BIOACOUSTIC EVENT DETECTION

Wilkinghoff, Kevin and Cornaggia-Urrigshardt, Alessia
Fraunhofer FKIE, Wachtberg, Germany

Abstract

This report describes the Fraunhofer FKIE submission for task 5 “Few-shot Bioacoustic Event Detection” of the DCASE challenge 2023. The submitted system is an adaptation of a few-shot keyword spotting system that uses embeddings with a temporal resolution suitable for template matching with dynamic time warping. The embedding model is trained to not only predict the sound event class but also the temporal position of a segment in a sound event using the angular margin loss TempAdaCos. At inference, embeddings are extracted and segment-wise cosine distances between the recording to be searched in and the provided templates are calculated. The resulting cost matrices are processed by applying a logistic regression model that is trained to discriminate between positive and negative frames. Lastly, dynamic time warping in combination with peak-picking and using a decision threshold is applied to detect on- and offsets of bioacoustic events. As a result, the presented system significantly outperforms both baseline systems.

System characteristics
Data augmentation SpecAugment, Mixup
System embeddings False
Subsystem count False
External data usage False
PDF

SE-PROTONET: PROTOTYPICAL NETWORK WITH SQUEEZE-AND-EXCITATION BLOCKS FOR BIOACOUSTIC EVENT DETECTION

Liu, Junyan and Zhou,Zikai and Sun,Mengkai and Xu,Kele and Qian,Kun and Hu,Bian
National University of Defence Technology and Key Laboratory of Brain Health Intelligent Evaluation and Intervention,Beijing Institute of Technology and School of Medical Technology,Beijing Institute of Technology, China

Abstract

In this technical reprot, we describe our submission system for DCASE2023 Task5: Few-shot Bioacoustic Event Detection. We propose a metric learning method to construct a novel prototypical network, based on adaptive segment-level learning and Squeezeand-Excitation (SE) blocks. We make better utilization of the negative data, which can be used to construct the loss function and provide much more semantic information. Most importantly, we propose to use SE blocks to adaptively recalibrate channel-wise feature response, by explicitly modeling interdependencies between channels, which improves f-measure to 63.94 %. For the input feature, we use combination of per-channel energy normalization (PCEN) and delta mel-frequency cepstral coefficients (∆MFCC). Our system performs better than the baseline given by the officials, on the DCASE task 5 validation set. Our final score reaches an f-measure of 65.49 %, outperforming the baseline performance by 30.18 %.

System characteristics
System embeddings False
Subsystem count False
External data usage False
PDF