Task description
More detailed task description can be found in the task description page
Systems ranking
Rank |
Submission code |
Submission name |
Technical Report |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Event-based F-score (Validation dataset) |
---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | Baseline Template Matching | 14.9 (14.0 - 15.3) | 3.4 | ||
Baseline_PROTO_task5_1 | Baseline Prototypical Network | 2.92 ( 2.32 - 3.08 ) | |||
Moummad_IMT_task5_3 | BRAIn_LORIA_S3 | Moummad2023 | 38.3 (37.9 - 38.7) | 62.8 | |
Moummad_IMT_task5_4 | BRAIn_LORIA_S4 | Moummad2023 | 34.4 (33.9 - 34.8) | 58.3 | |
Moummad_IMT_task5_1 | BRAIn_LORIA_S1 | Moummad2023 | 35.6 (35.3 - 36.0) | 62.3 | |
Moummad_IMT_task5_2 | BRAIn_LORIA_S2 | Moummad2023 | 42.7 (42.2 - 43.1) | 63.5 | |
Gelderblom_SINTEF_task5_2 | FEW-SHOT BIOACOUSTIC EVENT DETECTION USING BEATS | Gelderblom2023 | 31.1 (30.5 - 31.6) | 36.6 | |
Gelderblom_SINTEF_task5_1 | FEW-SHOT BIOACOUSTIC EVENT DETECTION USING BEATS | Gelderblom2023 | 23.4 (22.9 - 23.8) | 36.6 | |
XuQianHu_NUDT_BIT_task5_3 | XuQianHu_DYXS_task5_3 | XuQianHu2023 | 42.5 (41.8 - 43.0) | 63.9 | |
XuQianHu_NUDT_BIT_task5_1 | XuQianHu_DYXS_task5_1 | XuQianHu2023 | 21.7 (21.1 - 22.1) | 65.5 | |
XuQianHu_NUDT_BIT_task5_2 | XuQianHu_DYXS_task5_2 | XuQianHu2023 | 34.1 (33.6 - 34.4) | 63.1 | |
XuQianHu_NUDT_BIT_task5_4 | XuQianHu_DYXS_task5_4 | XuQianHu2023 | 21.7 (21.2 - 22.1) | 62.1 | |
Wilkinghoff_FKIE_task5_4 | FKIE system 4 | Wilkinghoff2023 | 16.0 (15.5 - 16.4) | 62.6 | |
Wilkinghoff_FKIE_task5_1 | FKIE system 1 | Wilkinghoff2023 | 10.1 (9.6 - 10.5) | 63.8 | |
Wilkinghoff_FKIE_task5_3 | FKIE system 3 | Wilkinghoff2023 | 9.4 (8.9 - 9.8) | 65.5 | |
Wilkinghoff_FKIE_task5_2 | FKIE system 2 | Wilkinghoff2023 | 9.9 (9.3 - 10.2) | 63.7 | |
Jung_KT_task5_4 | Jung_KT_task5_4 | Jung2023 | 26.3 (25.8 - 26.8) | 83.1 | |
Jung_KT_task5_2 | Jung_KT_task5_2 | Jung2023 | 15.7 (14.8 - 16.3) | 81.7 | |
Jung_KT_task5_3 | Jung_KT_task5_3 | Jung2023 | 27.1 (26.5 - 27.6) | 81.5 | |
Jung_KT_task5_1 | Jung_KT_task5_1 | Jung2023 | 14.1 (13.2 - 14.6) | 79.8 | |
Du_NERCSLIP_task5_2 | Multi-task Frame-level embedding learning 2 | Du2023 | 63.8 (63.3 - 64.2) | 75.6 | |
Du_NERCSLIP_task5_1 | Multi-task Frame-level embedding learning 1 | Du2023 | 61.2 (60.7 - 61.6) | 74.1 | |
Du_NERCSLIP_task5_3 | Multi-task Frame-level embedding learning 3 | Du2023 | 63.6 (63.2 - 64.0) | 76.4 | |
Du_NERCSLIP_task5_4 | Multi-task Frame-level embedding learning 4 | Du2023 | 63.6 (63.2 - 64.0) | 76.4 |
Dataset wise metrics
Rank |
Submission code |
Submission name |
Technical Report |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Event-based F-score (CHE dataset) |
Event-based F-score (CT dataset) |
Event-based F-score (MGE dataset) |
Event-based F-score (MS dataset) |
Event-based F-score (QU dataset) |
Event-based F-score (DC dataset) |
Event-based F-score (CHE23 dataset) |
Event-based F-score (CW dataset) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | Baseline Template Matching | 14.9 (14.0 - 15.3) | 21.1 | 7.2 | 44.1 | 8.0 | 9.7 | 34.9 | 36.1 | 44.2 | ||
Baseline_PROTO_task5_1 | Baseline Prototypical Network | 2.92 ( 2.32 - 3.08 ) | 31.5 | 14.0 | 8.1 | 27.1 | 0.4 | 37.8 | 36.1 | 44.2 | ||
Moummad_IMT_task5_3 | BRAIn_LORIA_S3 | Moummad2023 | 38.3 (37.9 - 38.7) | 60.0 | 43.9 | 41.5 | 71.1 | 13.6 | 40.7 | 83.7 | 70.7 | |
Moummad_IMT_task5_4 | BRAIn_LORIA_S4 | Moummad2023 | 34.4 (33.9 - 34.8) | 61.4 | 37.6 | 42.0 | 63.6 | 10.9 | 41.4 | 80.9 | 70.1 | |
Moummad_IMT_task5_1 | BRAIn_LORIA_S1 | Moummad2023 | 35.6 (35.3 - 36.0) | 52.0 | 39.7 | 66.5 | 31.4 | 13.7 | 40.9 | 81.5 | 62.6 | |
Moummad_IMT_task5_2 | BRAIn_LORIA_S2 | Moummad2023 | 42.7 (42.2 - 43.1) | 60.3 | 36.2 | 61.3 | 67.8 | 17.7 | 41.5 | 83.2 | 72.3 | |
Gelderblom_SINTEF_task5_2 | FEW-SHOT BIOACOUSTIC EVENT DETECTION USING BEATS | Gelderblom2023 | 31.1 (30.5 - 31.6) | 58.3 | 15.4 | 70.9 | 58.0 | 12.5 | 39.8 | 58.4 | 66.3 | |
Gelderblom_SINTEF_task5_1 | FEW-SHOT BIOACOUSTIC EVENT DETECTION USING BEATS | Gelderblom2023 | 23.4 (22.9 - 23.8) | 52.0 | 14.1 | 58.5 | 56.3 | 6.6 | 40.7 | 64.2 | 57.3 | |
XuQianHu_NUDT_BIT_task5_3 | XuQianHu_DYXS_task5_3 | XuQianHu2023 | 42.5 (41.8 - 43.0) | 75.0 | 27.3 | 42.6 | 38.7 | 31.0 | 43.8 | 64.2 | 78.0 | |
XuQianHu_NUDT_BIT_task5_1 | XuQianHu_DYXS_task5_1 | XuQianHu2023 | 21.7 (21.1 - 22.1) | 36.4 | 21.9 | 62.5 | 38.8 | 5.5 | 40.1 | 58.3 | 59.9 | |
XuQianHu_NUDT_BIT_task5_2 | XuQianHu_DYXS_task5_2 | XuQianHu2023 | 34.1 (33.6 - 34.4) | 28.9 | 26.1 | 60.8 | 40.3 | 18.6 | 39.9 | 54.8 | 59.9 | |
XuQianHu_NUDT_BIT_task5_4 | XuQianHu_DYXS_task5_4 | XuQianHu2023 | 21.7 (21.2 - 22.1) | 50.3 | 31.5 | 69.9 | 29.2 | 5.5 | 34.9 | 64.3 | 32.5 | |
Wilkinghoff_FKIE_task5_4 | FKIE system 4 | Wilkinghoff2023 | 16.0 (15.5 - 16.4) | 30.9 | 43.2 | 41.9 | 15.1 | 3.9 | 34.9 | 31.5 | 59.4 | |
Wilkinghoff_FKIE_task5_1 | FKIE system 1 | Wilkinghoff2023 | 10.1 (9.6 - 10.5) | 29.7 | 40.5 | 32.3 | 15.2 | 1.9 | 32.4 | 26.6 | 63.1 | |
Wilkinghoff_FKIE_task5_3 | FKIE system 3 | Wilkinghoff2023 | 9.4 (8.9 - 9.8) | 31.2 | 39.2 | 25.4 | 9.9 | 1.9 | 29.9 | 29.4 | 73.3 | |
Wilkinghoff_FKIE_task5_2 | FKIE system 2 | Wilkinghoff2023 | 9.9 (9.3 - 10.2) | 31.9 | 38.1 | 31.6 | 12.5 | 1.9 | 31.5 | 34.6 | 73.1 | |
Jung_KT_task5_4 | Jung_KT_task5_4 | Jung2023 | 26.3 (25.8 - 26.8) | 30.2 | 14.4 | 47.2 | 24.8 | 17.9 | 37.7 | 30.0 | 62.1 | |
Jung_KT_task5_2 | Jung_KT_task5_2 | Jung2023 | 15.7 (14.8 - 16.3) | 46.3 | 10.9 | 4.1 | 25.7 | 29.6 | 38.1 | 48.9 | 77.3 | |
Jung_KT_task5_3 | Jung_KT_task5_3 | Jung2023 | 27.1 (26.5 - 27.6) | 32.2 | 14.3 | 20.8 | 26.7 | 25.8 | 47.7 | 41.2 | 58.8 | |
Jung_KT_task5_1 | Jung_KT_task5_1 | Jung2023 | 14.1 (13.2 - 14.6) | 47.5 | 10.4 | 3.2 | 29.4 | 31.1 | 44.1 | 47.3 | 78.1 | |
Du_NERCSLIP_task5_2 | Multi-task Frame-level embedding learning 2 | Du2023 | 63.8 (63.3 - 64.2) | 85.7 | 53.8 | 93.1 | 72.7 | 42.7 | 62.4 | 69.5 | 44.2 | |
Du_NERCSLIP_task5_1 | Multi-task Frame-level embedding learning 1 | Du2023 | 61.2 (60.7 - 61.6) | 85.7 | 53.8 | 93.1 | 72.7 | 38.5 | 53.7 | 69.1 | 71.3 | |
Du_NERCSLIP_task5_3 | Multi-task Frame-level embedding learning 3 | Du2023 | 63.6 (63.2 - 64.0) | 74.9 | 54.5 | 95.2 | 68.3 | 43.7 | 66.3 | 71.1 | 59.3 | |
Du_NERCSLIP_task5_4 | Multi-task Frame-level embedding learning 4 | Du2023 | 63.6 (63.2 - 64.0) | 86.0 | 51.4 | 95.2 | 75.0 | 43.7 | 60.8 | 69.1 | 72.3 |
Teams ranking
Table including only the best performing system per submitting team.
Rank |
Submission code |
Submission name |
Technical Report |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Event-based F-score (Development dataset) |
---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | Baseline Template Matching | 14.9 (14.0 - 15.3) | 3.4 | ||
Baseline_PROTO_task5_1 | Baseline Prototypical Network | 2.92 ( 2.32 - 3.08 ) | |||
Moummad_IMT_task5_2 | BRAIn_LORIA_S2 | Moummad2023 | 42.7 (42.2 - 43.1) | 63.5 | |
Gelderblom_SINTEF_task5_2 | FEW-SHOT BIOACOUSTIC EVENT DETECTION USING BEATS | Gelderblom2023 | 31.1 (30.5 - 31.6) | 36.6 | |
XuQianHu_NUDT_BIT_task5_3 | XuQianHu_DYXS_task5_3 | XuQianHu2023 | 42.5 (41.8 - 43.0) | 63.9 | |
Wilkinghoff_FKIE_task5_4 | FKIE system 4 | Wilkinghoff2023 | 16.0 (15.5 - 16.4) | 62.6 | |
Jung_KT_task5_3 | Jung_KT_task5_2 | Jung2023 | 27.1 (26.5 - 27.6) | 81.5 | |
Du_NERCSLIP_task5_2 | Multi-task Frame-level embedding learning 2 | Du2023 | 63.8 (63.3 - 64.2) | 75.6 |
System characteristics
General characteristics
Rank | Code |
Technical Report |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Sampling rate |
Data augmentation |
Features |
---|---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | 14.9 (14.0 - 15.3) | any | spectrogram | |||
Baseline_PROTO_task5_1 | 2.92 ( 2.32 - 3.08 ) | 22.05 KHz | PCEN | |||
Moummad_IMT_task5_3 | Moummad2023 | 38.3 (37.9 - 38.7) | 22.05 KHz | spectrogram mixing, random crop, resized crop, compression, additive white gaussian noise | mel spectrogram | |
Moummad_IMT_task5_4 | Moummad2023 | 34.4 (33.9 - 34.8) | 22.05 KHz | spectrogram mixing, random crop, resized crop, compression, additive white gaussian noise | mel spectrogram | |
Moummad_IMT_task5_1 | Moummad2023 | 35.6 (35.3 - 36.0) | 22.05 KHz | spectrogram mixing, random crop, resized crop, compression, additive white gaussian noise | mel spectrogram | |
Moummad_IMT_task5_2 | Moummad2023 | 42.7 (42.2 - 43.1) | 22.05 KHz | spectrogram mixing, random crop, resized crop, compression, additive white gaussian noise | mel spectrogram | |
Gelderblom_SINTEF_task5_2 | Gelderblom2023 | 31.1 (30.5 - 31.6) | 16 KHz | time stretching, denoising | mel spectrogram | |
Gelderblom_SINTEF_task5_1 | Gelderblom2023 | 23.4 (22.9 - 23.8) | 16 KHz | time stretching, denoising | mel spectrogram | |
XuQianHu_NUDT_BIT_task5_3 | XuQianHu2023 | 42.5 (41.8 - 43.0) | 22.05 KHz | delta MFCC and PCEN | ||
XuQianHu_NUDT_BIT_task5_1 | XuQianHu2023 | 21.7 (21.1 - 22.1) | 22.05 KHz | delta MFCC and PCEN | ||
XuQianHu_NUDT_BIT_task5_2 | XuQianHu2023 | 34.1 (33.6 - 34.4) | 22.05 KHz | delta MFCC and PCEN | ||
XuQianHu_NUDT_BIT_task5_4 | XuQianHu2023 | 21.7 (21.2 - 22.1) | 22.05 KHz | delta MFCC and PCEN | ||
Wilkinghoff_FKIE_task5_4 | Wilkinghoff2023 | 16.0 (15.5 - 16.4) | 22.05 KHz | SpecAugment, Mixup | PCEN | |
Wilkinghoff_FKIE_task5_1 | Wilkinghoff2023 | 10.1 (9.6 - 10.5) | 22.05 KHz | SpecAugment, Mixup | PCEN | |
Wilkinghoff_FKIE_task5_3 | Wilkinghoff2023 | 9.4 (8.9 - 9.8) | 22.05 KHz | SpecAugment, Mixup | PCEN | |
Wilkinghoff_FKIE_task5_2 | Wilkinghoff2023 | 9.9 (9.3 - 10.2) | 22.05 KHz | SpecAugment, Mixup | PCEN | |
Jung_KT_task5_4 | Jung2023 | 26.3 (25.8 - 26.8) | 22.05 KHz | delta MFCC, PCEN | ||
Jung_KT_task5_2 | Jung2023 | 15.7 (14.8 - 16.3) | 22.05 KHz | delta MFCC, PCEN | ||
Jung_KT_task5_3 | Jung2023 | 27.1 (26.5 - 27.6) | 22.05 KHz | delta MFCC, PCEN | ||
Jung_KT_task5_1 | Jung2023 | 14.1 (13.2 - 14.6) | 22.05 KHz | delta MFCC, PCEN | ||
Du_NERCSLIP_task5_2 | Du2023 | 63.8 (63.3 - 64.2) | 22.05 KHz | SpecAugment | PCEN | |
Du_NERCSLIP_task5_1 | Du2023 | 61.2 (60.7 - 61.6) | 22.05 KHz | SpecAugment | PCEN | |
Du_NERCSLIP_task5_3 | Du2023 | 63.6 (63.2 - 64.0) | 22.05 KHz | SpecAugment, time stretching | PCEN | |
Du_NERCSLIP_task5_4 | Du2023 | 63.6 (63.2 - 64.0) | 22.05 KHz | SpecAugment, time stretching | PCEN |
Machine learning characteristics
Rank | Code |
Technical Report |
Event-based F-score (Eval) |
Classifier | Few-shot approach | Post-processing |
---|---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | 14.9 (14.0 - 15.3) | template matching | template matching | peak picking, threshold | ||
Baseline_PROTO_task5_1 | 2.92 ( 2.32 - 3.08 ) | ResNet | prototypical | threshold | ||
Moummad_IMT_task5_3 | Moummad2023 | 38.3 (37.9 - 38.7) | CNN | softmax binary classifier + finetuning two last layers | ||
Moummad_IMT_task5_4 | Moummad2023 | 34.4 (33.9 - 34.8) | CNN | softmax binary classifier + finetuning whole model | ||
Moummad_IMT_task5_1 | Moummad2023 | 35.6 (35.3 - 36.0) | CNN | softmax binary classifier | ||
Moummad_IMT_task5_2 | Moummad2023 | 42.7 (42.2 - 43.1) | CNN | softmax binary classifier + finetuning last layers | ||
Gelderblom_SINTEF_task5_2 | Gelderblom2023 | 31.1 (30.5 - 31.6) | transformer | prototypical | threshold | |
Gelderblom_SINTEF_task5_1 | Gelderblom2023 | 23.4 (22.9 - 23.8) | transformer | prototypical | threshold | |
XuQianHu_NUDT_BIT_task5_3 | XuQianHu2023 | 42.5 (41.8 - 43.0) | CNN | prototypical | threshold (0.3) | |
XuQianHu_NUDT_BIT_task5_1 | XuQianHu2023 | 21.7 (21.1 - 22.1) | CNN | prototypical | threshold (0.1) | |
XuQianHu_NUDT_BIT_task5_2 | XuQianHu2023 | 34.1 (33.6 - 34.4) | CNN | prototypical | threshold (0.05) | |
XuQianHu_NUDT_BIT_task5_4 | XuQianHu2023 | 21.7 (21.2 - 22.1) | CNN | prototypical | threshold (0.15) | |
Wilkinghoff_FKIE_task5_4 | Wilkinghoff2023 | 16.0 (15.5 - 16.4) | CNN, K-Means, Logistic Regression | TempArcFace, DTW | peak picking, threshold | |
Wilkinghoff_FKIE_task5_1 | Wilkinghoff2023 | 10.1 (9.6 - 10.5) | CNN, K-Means, Logistic Regression | TempArcFace, DTW | peak picking, threshold | |
Wilkinghoff_FKIE_task5_3 | Wilkinghoff2023 | 9.4 (8.9 - 9.8) | CNN, K-Means, Logistic Regression | TempArcFace, DTW | peak picking, threshold | |
Wilkinghoff_FKIE_task5_2 | Wilkinghoff2023 | 9.9 (9.3 - 10.2) | CNN, K-Means, Logistic Regression | TempArcFace, DTW | peak picking, threshold | |
Jung_KT_task5_4 | Jung2023 | 26.3 (25.8 - 26.8) | CNN | prototypical, contrastive, fine tuning, Transductive Inference | threshold, filter by length, remove long | |
Jung_KT_task5_2 | Jung2023 | 15.7 (14.8 - 16.3) | CNN | prototypical, contrastive, fine tuning, Transductive Inference | threshold, filter by length, remove long | |
Jung_KT_task5_3 | Jung2023 | 27.1 (26.5 - 27.6) | CNN | prototypical, contrastive, fine tuning, Transductive Inference | threshold, filter by length, remove long | |
Jung_KT_task5_1 | Jung2023 | 14.1 (13.2 - 14.6) | CNN | prototypical, contrastive, fine tuning, Transductive Inference | threshold, filter by length, remove long | |
Du_NERCSLIP_task5_2 | Du2023 | 63.8 (63.3 - 64.2) | CNN | finetuning | peak picking, threshold | |
Du_NERCSLIP_task5_1 | Du2023 | 61.2 (60.7 - 61.6) | CNN | finetuning | peak picking, threshold | |
Du_NERCSLIP_task5_3 | Du2023 | 63.6 (63.2 - 64.0) | CNN | finetuning | peak picking, threshold | |
Du_NERCSLIP_task5_4 | Du2023 | 63.6 (63.2 - 64.0) | CNN | finetuning | peak picking, threshold |
Complexity
Rank | Code |
Technical Report |
Event-based F-score (Eval) |
Model complexity |
Training time |
---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | 14.9 (14.0 - 15.3) | ||||
Baseline_PROTO_task5_1 | 2.92 ( 2.32 - 3.08 ) | ||||
Moummad_IMT_task5_3 | Moummad2023 | 38.3 (37.9 - 38.7) | 7.2 M | 50 min | |
Moummad_IMT_task5_4 | Moummad2023 | 34.4 (33.9 - 34.8) | 7.2 M | 60 min | |
Moummad_IMT_task5_1 | Moummad2023 | 35.6 (35.3 - 36.0) | 7.2 M | 25 min | |
Moummad_IMT_task5_2 | Moummad2023 | 42.7 (42.2 - 43.1) | 7.2 M | 40 min | |
Gelderblom_SINTEF_task5_2 | Gelderblom2023 | 31.1 (30.5 - 31.6) | 90 M | 2 hours | |
Gelderblom_SINTEF_task5_1 | Gelderblom2023 | 23.4 (22.9 - 23.8) | 90 M | 2 hours | |
XuQianHu_NUDT_BIT_task5_3 | XuQianHu2023 | 42.5 (41.8 - 43.0) | 724 K | 3 hours (GTX 1080ti) | |
XuQianHu_NUDT_BIT_task5_1 | XuQianHu2023 | 21.7 (21.1 - 22.1) | 724 K | 3 hours (GTX 1080ti) | |
XuQianHu_NUDT_BIT_task5_2 | XuQianHu2023 | 34.1 (33.6 - 34.4) | 724 K | 3 hours (GTX 1080ti) | |
XuQianHu_NUDT_BIT_task5_4 | XuQianHu2023 | 21.7 (21.2 - 22.1) | 724 K | 3 hours (GTX 1080ti) | |
Wilkinghoff_FKIE_task5_4 | Wilkinghoff2023 | 16.0 (15.5 - 16.4) | 3.1 M | 12 hours (Quadro RTX 5000) | |
Wilkinghoff_FKIE_task5_1 | Wilkinghoff2023 | 10.1 (9.6 - 10.5) | 3.1 M | 12 hours (Quadro RTX 5000) | |
Wilkinghoff_FKIE_task5_3 | Wilkinghoff2023 | 9.4 (8.9 - 9.8) | 3.1 M | 12 hours (Quadro RTX 5000) | |
Wilkinghoff_FKIE_task5_2 | Wilkinghoff2023 | 9.9 (9.3 - 10.2) | 3.1 M | 12 hours (Quadro RTX 5000) | |
Jung_KT_task5_4 | Jung2023 | 26.3 (25.8 - 26.8) | |||
Jung_KT_task5_2 | Jung2023 | 15.7 (14.8 - 16.3) | |||
Jung_KT_task5_3 | Jung2023 | 27.1 (26.5 - 27.6) | |||
Jung_KT_task5_1 | Jung2023 | 14.1 (13.2 - 14.6) | |||
Du_NERCSLIP_task5_2 | Du2023 | 63.8 (63.3 - 64.2) | 21460630 | 1 hour (TeslaV100-32GB) | |
Du_NERCSLIP_task5_1 | Du2023 | 61.2 (60.7 - 61.6) | 21460630 | 1 hour (TeslaV100-32GB) | |
Du_NERCSLIP_task5_3 | Du2023 | 63.6 (63.2 - 64.0) | 21460630 | 3 hour (TeslaV100-32GB) | |
Du_NERCSLIP_task5_4 | Du2023 | 63.6 (63.2 - 64.0) | 21460630 | 3 hour (TeslaV100-32GB) |
Technical reports
MULTI-TASK FRAME LEVEL SYSTEM FOR FEW-SHOT BIOACOUSTIC EVENT DETECTION
Yan,Genwei and Wang,Ruoyu and Zou,Liang and Du,Jun and Wang,Qing and Gao,Tian and Fang,Xin
China University of Mining and Technology and University of Science and Technology of China and iFLYTEK Research Institute
Du_NERCSLIP_task5_1 Du_NERCSLIP_task5_2 Du_NERCSLIP_task5_3 Du_NERCSLIP_task5_4
MULTI-TASK FRAME LEVEL SYSTEM FOR FEW-SHOT BIOACOUSTIC EVENT DETECTION
Yan,Genwei and Wang,Ruoyu and Zou,Liang and Du,Jun and Wang,Qing and Gao,Tian and Fang,Xin
China University of Mining and Technology and University of Science and Technology of China and iFLYTEK Research Institute
Abstract
This technical report describes our new frame-level embedding learning system for DCASE2023 Task5: few-shot bioacoustic event detection. In the previous work, we proposed the frame-level embedding learning system and achieved the best performance of the DCASE 2022 Task5. In this work, we utilize several techniques to improve upon our previous work. Additionally, we introduce multi-task learning and Target Speaker Voice Activity Detection (TS-VAD) strategies to transform our previous system into a new multi-task frame-level embedding learning system. Compare to our previous work, our new system can achieve a better result (Fmeasure 75.74%, No ML) on the official validation set
System characteristics
Data augmentation | SpecAugment, time stretching |
System embeddings | False |
Subsystem count | False |
External data usage | False |
FEW-SHOT BIOACOUSTIC EVENT DETECTION USING BEATS
Gelderblom, Femke and Cretois,Benjamin and Johnsen,Pal and Remonato,Filippo and Reinen,Tor Arne
DCASE2023_Gelderblom_2
FEW-SHOT BIOACOUSTIC EVENT DETECTION USING BEATS
Gelderblom, Femke and Cretois,Benjamin and Johnsen,Pal and Remonato,Filippo and Reinen,Tor Arne
Abstract
Our method for the DCASE Challenge 2023 combines BEATs with Prototypical Networks. BEATs, standing for Bidirectional Encoder representation from Audio Transformers, is a newly-released architecture by Microsoft for audio tokenisation and classification. BEATs combines a tokenizer and a semi-supervised audio classifier which learn from each other to improve the classification of audio samples. Prototypical Networks, instead, can be briefly described as a neural network-based clustering algorithm. Somewhat resembling a K-means clustering, Prototypical Networks classify samples based on their distance from the classes’ prototypes (what would be the centroids in a K-means setting). Since the prototypes are constructed from a small set of examples from each class, called the support set, Prototypical Networks are well suited to handle fewshot learning settings like the DCASE Challenge. In our method, we combine the two by using BEATs as a feature extractor, constructing informative features which are used by the Prototypical Network to perform the prototypes’ construction and subsequent classification of test audio samples. We obtain a F1 score of 0.36 on the validation dataset.
System characteristics
Data augmentation | time stretching, denoising |
System embeddings | BEATS |
Subsystem count | False |
FEW-SHOT BIOACOUSTIC DETECTION BOOSTING WITH FINE TUNING STRATEGY USING NEGATIVE BASED PROTOTYPICAL LEARNING
Lee, Yuna and Chung, HaeChun, and Jung, JaeHoon
KT Corporation, Republic of Korea
Jung_KT_task5_1 Jung_KT_task5_2 Jung_KT_task5_3 Jung_KT_task5_4
FEW-SHOT BIOACOUSTIC DETECTION BOOSTING WITH FINE TUNING STRATEGY USING NEGATIVE BASED PROTOTYPICAL LEARNING
Lee, Yuna and Chung, HaeChun, and Jung, JaeHoon
KT Corporation, Republic of Korea
Abstract
Few-shot sound event detection has always faced the challenge of detecting bioacoustic sound events with only a few labelled instances of the class of interest. In this technical report, we describe our submission system for DCASE2023 Task5: few-shot bioacoustic event detection. We propose a novel framework of training audio segments via contrastive learning and prototypical learning, building the network more robust to the variety of acoustic environments, even in unseen domains. In addition, a finetuning strategy based on the novel loss functions is introduced. Our final systems achieves an f-measure of 83.08 on the DCASE task 5 validation set, outperforming the baseline performance and last year’s first place by a large margin.
System characteristics
System embeddings | False |
Subsystem count | False |
External data usage | False |
SUPERVISED CONTRASTIVE LEARNING FOR PRE-TRAINING BIOACOUSTIC FEW SHOT SYSTEMS
Moummad, Ilyass and Serizel, Romain and Farrugia, Nicolas
IMT Atlantique, UMR CNRS, University of Lorraine,CNRS, France
Abstract
We show in this work that learning a rich feature extractor from scratch using only official training data is feasible. We achieve this by learning representations using a supervised contrastive learning framework. We then transfer the learned feature extractor to the sets of validation and test for few-shot evaluation. For fewshot validation, we simply train a linear classifier on the negative and positive shots and obtain a F-score of 63.46% outperforming the baseline by a large margin. We don’t use any external data or pretrained model. Our approach doesn’t require choosing a threshold for prediction or any post-processing technique
System characteristics
Data augmentation | spectrogram mixing, random crop, resized crop, compression, additive white gaussian noise |
System embeddings | False |
Subsystem count | False |
External data usage | False |
FEW-SHOT BIOACOUSTIC EVENT DETECTION
Wilkinghoff, Kevin and Cornaggia-Urrigshardt, Alessia
Fraunhofer FKIE, Wachtberg, Germany
Wilkinghoff_FKIE_task5_4
FEW-SHOT BIOACOUSTIC EVENT DETECTION
Wilkinghoff, Kevin and Cornaggia-Urrigshardt, Alessia
Fraunhofer FKIE, Wachtberg, Germany
Abstract
This report describes the Fraunhofer FKIE submission for task 5 “Few-shot Bioacoustic Event Detection” of the DCASE challenge 2023. The submitted system is an adaptation of a few-shot keyword spotting system that uses embeddings with a temporal resolution suitable for template matching with dynamic time warping. The embedding model is trained to not only predict the sound event class but also the temporal position of a segment in a sound event using the angular margin loss TempAdaCos. At inference, embeddings are extracted and segment-wise cosine distances between the recording to be searched in and the provided templates are calculated. The resulting cost matrices are processed by applying a logistic regression model that is trained to discriminate between positive and negative frames. Lastly, dynamic time warping in combination with peak-picking and using a decision threshold is applied to detect on- and offsets of bioacoustic events. As a result, the presented system significantly outperforms both baseline systems.
System characteristics
Data augmentation | SpecAugment, Mixup |
System embeddings | False |
Subsystem count | False |
External data usage | False |
SE-PROTONET: PROTOTYPICAL NETWORK WITH SQUEEZE-AND-EXCITATION BLOCKS FOR BIOACOUSTIC EVENT DETECTION
Liu, Junyan and Zhou,Zikai and Sun,Mengkai and Xu,Kele and Qian,Kun and Hu,Bian
National University of Defence Technology and Key Laboratory of Brain Health Intelligent Evaluation and Intervention,Beijing Institute of Technology and School of Medical Technology,Beijing Institute of Technology, China
XuQianHu_NUDT_BIT_task5_3
SE-PROTONET: PROTOTYPICAL NETWORK WITH SQUEEZE-AND-EXCITATION BLOCKS FOR BIOACOUSTIC EVENT DETECTION
Liu, Junyan and Zhou,Zikai and Sun,Mengkai and Xu,Kele and Qian,Kun and Hu,Bian
National University of Defence Technology and Key Laboratory of Brain Health Intelligent Evaluation and Intervention,Beijing Institute of Technology and School of Medical Technology,Beijing Institute of Technology, China
Abstract
In this technical reprot, we describe our submission system for DCASE2023 Task5: Few-shot Bioacoustic Event Detection. We propose a metric learning method to construct a novel prototypical network, based on adaptive segment-level learning and Squeezeand-Excitation (SE) blocks. We make better utilization of the negative data, which can be used to construct the loss function and provide much more semantic information. Most importantly, we propose to use SE blocks to adaptively recalibrate channel-wise feature response, by explicitly modeling interdependencies between channels, which improves f-measure to 63.94 %. For the input feature, we use combination of per-channel energy normalization (PCEN) and delta mel-frequency cepstral coefficients (∆MFCC). Our system performs better than the baseline given by the officials, on the DCASE task 5 validation set. Our final score reaches an f-measure of 65.49 %, outperforming the baseline performance by 30.18 %.
System characteristics
System embeddings | False |
Subsystem count | False |
External data usage | False |