Task description
This challenge focuses on sound event detection in a few-shot learning setting for animal (mammal and bird) vocalisations. Participants will be expected to create a method that can extract information from five exemplar vocalisations (shots) of mammals or birds and detect and classify sounds in field recordings. The main objective is to find reliable algorithms that are capable of dealing with data sparsity, class imbalance, and noisy/busy environments.
More detailed task description can be found in the task description page
Systems ranking
Rank |
Submission code |
Submission name |
Technical Report |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Event-based F-score (Validation dataset) |
---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | Baseline Template Matching | 34.8 (32.6 - 37.1) | 2.0 | ||
Baseline_PROTO_task5_1 | Baseline Prototypical Network | 20.1 (18.2 - 21.9) | 41.5 | ||
Anderson_TCD_task5_1 | Prototypical Network with SpecAugment | Anderson2021 | 35.0 (33.1 - 37.0) | 26.2 | |
Bielecki_SMSNG_task5_1 | Prototypical network with knowledge distillation and attention loss | Bielecki2021 | 8.4 (7.1 - 9.6) | 52.5 | |
Bielecki_SMSNG_task5_2 | Prototypical network with knowledge distillation and attention loss | Bielecki2021 | 5.8 (4.9 - 6.7) | 51.8 | |
Bielecki_SMSNG_task5_3 | Prototypical network with knowledge distillation and attention loss | Bielecki2021 | 8.4 (7.1 - 9.7) | 51.8 | |
Bielecki_SMSNG_task5_4 | Prototypical network with knowledge distillation and attention loss | Bielecki2021 | 5.3 (4.4 - 6.2) | 51.1 | |
Cheng_BIT_task5_1 | ivector baseline | Cheng2021 | 23.8 (21.9 - 25.7) | 46.3 | |
Cheng_BIT_task5_2 | baseline_5w3s | Cheng2021 | 12.5 (11.0 - 14.1) | 47.8 | |
Cheng_BIT_task5_3 | baseline_5w5s | Cheng2021 | 11.0 (9.4 - 12.6) | 45.0 | |
Cheng_BIT_task5_4 | ivector-tripleloss baseline | Cheng2021 | 8.0 (6.7 - 9.3) | 44.9 | |
Johannsmeier_OVGU_task5_1 | Prototype Segmentation | Johannsmeier2021 | 5.5 (4.7 - 6.4) | 59.8 | |
Johannsmeier_OVGU_task5_2 | Prototype Segmentation | Johannsmeier2021 | 4.5 (3.7 - 5.4) | 56.0 | |
Johannsmeier_OVGU_task5_3 | Prototype Segmentation | Johannsmeier2021 | 15.2 (13.7 - 16.7) | 58.6 | |
Johannsmeier_OVGU_task5_4 | Prototype Segmentation | Johannsmeier2021 | 7.1 (5.9 - 8.3) | 58.8 | |
zhang_uestc_task5_1 | dcase2021-t5 prototypical network | Zhang2021 | 9.0 (7.8 - 10.2) | 52.9 | |
zhang_uestc_task5_2 | dcase2021-t5 prototypical network | Zhang2021 | 8.3 (7.1 - 9.4) | 53.8 | |
zhang_uestc_task5_3 | dcase2021-t5 prototypical network | Zhang2021 | 16.8 (15.5 - 18.2) | 54.4 | |
zhang_uestc_task5_4 | dcase2021-t5 prototypical network | Zhang2021 | 7.2 (6.0 - 8.4) | 57.1 | |
Zou_PKU_task5_1 | TIM | Zou2021 | 33.2 (31.0 - 35.3) | 55.3 | |
Yang_PKU_task5_2 | Contrast learning for few shot learning | Zou2021 | 22.4 (20.7 - 24.1) | 55.3 | |
Zou_PKU_task5_3 | TIM-ML | Zou2021 | 38.4 (36.2 - 40.6) | 55.3 | |
Zou_PKU_task5_4 | TIM-ML2 | Zou2021 | 33.7 (31.7 - 35.8) | 55.3 | |
Tang_SHNU_task5_1 | SHNU1 | Tang2021 | 36.5 (34.0 - 38.9) | 54.7 | |
Tang_SHNU_task5_2 | SHNU2 | Tang2021 | 35.1 (31.7 - 38.4) | 51.7 | |
Tang_SHNU_task5_3 | SHNU3 | Tang2021 | 38.3 (36.1 - 40.5) | 51.4 |
Dataset wise metrics
Rank |
Submission code |
Submission name |
Technical Report |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Event-based F-score (DC dataset) |
Event-based F-score (ME dataset) |
Event-based F-score (ML dataset) |
---|---|---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | Baseline Template Matching | 34.8 (32.6 - 37.1) | 32.2 | 47.0 | 29.5 | ||
Baseline_PROTO_task5_1 | Baseline Prototypical Network | 20.1 (18.2 - 21.9) | 8.5 | 72.7 | 55.7 | ||
Anderson_TCD_task5_1 | Prototypical Network with SpecAugment | Anderson2021 | 35.0 (33.1 - 37.0) | 19.9 | 56.6 | 56.8 | |
Bielecki_SMSNG_task5_1 | Prototypical network with knowledge distillation and attention loss | Bielecki2021 | 8.4 (7.1 - 9.6) | 3.1 | 57.3 | 43.7 | |
Bielecki_SMSNG_task5_2 | Prototypical network with knowledge distillation and attention loss | Bielecki2021 | 5.8 (4.9 - 6.7) | 2.1 | 74.4 | 32.9 | |
Bielecki_SMSNG_task5_3 | Prototypical network with knowledge distillation and attention loss | Bielecki2021 | 8.4 (7.1 - 9.7) | 3.1 | 56.3 | 51.4 | |
Bielecki_SMSNG_task5_4 | Prototypical network with knowledge distillation and attention loss | Bielecki2021 | 5.3 (4.4 - 6.2) | 1.9 | 44.3 | 45.0 | |
Cheng_BIT_task5_1 | ivector baseline | Cheng2021 | 23.8 (21.9 - 25.7) | 10.6 | 53.5 | 78.8 | |
Cheng_BIT_task5_2 | baseline_5w3s | Cheng2021 | 12.5 (11.0 - 14.1) | 4.8 | 80.8 | 57.8 | |
Cheng_BIT_task5_3 | baseline_5w5s | Cheng2021 | 11.0 (9.4 - 12.6) | 4.1 | 75.5 | 56.4 | |
Cheng_BIT_task5_4 | ivector-tripleloss baseline | Cheng2021 | 8.0 (6.7 - 9.3) | 2.9 | 70.5 | 53.1 | |
Johannsmeier_OVGU_task5_1 | Prototype Segmentation | Johannsmeier2021 | 5.5 (4.7 - 6.4) | 2.0 | 51.4 | 37.3 | |
Johannsmeier_OVGU_task5_2 | Prototype Segmentation | Johannsmeier2021 | 4.5 (3.7 - 5.4) | 1.7 | 60.8 | 17.9 | |
Johannsmeier_OVGU_task5_3 | Prototype Segmentation | Johannsmeier2021 | 15.2 (13.7 - 16.7) | 6.5 | 64.3 | 35.8 | |
Johannsmeier_OVGU_task5_4 | Prototype Segmentation | Johannsmeier2021 | 7.1 (5.9 - 8.3) | 2.7 | 61.5 | 29.4 | |
zhang_uestc_task5_1 | dcase2021-t5 prototypical network | Zhang2021 | 9.0 (7.8 - 10.2) | 3.5 | 49.3 | 32.4 | |
zhang_uestc_task5_2 | dcase2021-t5 prototypical network | Zhang2021 | 8.3 (7.1 - 9.4) | 3.4 | 41.8 | 23.9 | |
zhang_uestc_task5_3 | dcase2021-t5 prototypical network | Zhang2021 | 16.8 (15.5 - 18.2) | 8.1 | 45.1 | 29.9 | |
zhang_uestc_task5_4 | dcase2021-t5 prototypical network | Zhang2021 | 7.2 (6.0 - 8.4) | 2.8 | 45.1 | 24.7 | |
Zou_PKU_task5_1 | TIM | Zou2021 | 33.2 (31.0 - 35.3) | 16.1 | 72.7 | 67.9 | |
Yang_PKU_task5_2 | Contrast learning for few shot learning | Zou2021 | 22.4 (20.7 - 24.1) | 10.3 | 61.0 | 49.9 | |
Zou_PKU_task5_3 | TIM-ML | Zou2021 | 38.4 (36.2 - 40.6) | 20.6 | 68.0 | 67.3 | |
Zou_PKU_task5_4 | TIM-ML2 | Zou2021 | 33.7 (31.7 - 35.8) | 17.3 | 62.8 | 66.4 | |
Tang_SHNU_task5_1 | SHNU1 | Tang2021 | 36.5 (34.0 - 38.9) | 22.3 | 48.6 | 59.3 | |
Tang_SHNU_task5_2 | SHNU2 | Tang2021 | 35.1 (31.7 - 38.4) | 25.5 | 31.7 | 67.2 | |
Tang_SHNU_task5_3 | SHNU3 | Tang2021 | 38.3 (36.1 - 40.5) | 25.6 | 61.5 | 43.3 |
Teams ranking
Table including only the best performing system per submitting team.
Rank |
Submission code |
Submission name |
Technical Report |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Event-based F-score (Development dataset) |
---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | Baseline Template Matching | 34.8 (32.6 - 37.1) | 2.0 | ||
Baseline_PROTO_task5_1 | Baseline Prototypical Network | 20.1 (18.2 - 21.9) | 41.5 | ||
Anderson_TCD_task5_1 | Prototypical Network with SpecAugment | Anderson2021 | 35.0 (33.1 - 37.0) | 26.2 | |
Bielecki_SMSNG_task5_3 | Prototypical network with knowledge distillation and attention loss | Bielecki2021 | 8.4 (7.1 - 9.7) | 51.8 | |
Cheng_BIT_task5_1 | ivector baseline | Cheng2021 | 23.8 (21.9 - 25.7) | 46.3 | |
Johannsmeier_OVGU_task5_3 | Prototype Segmentation | Johannsmeier2021 | 15.2 (13.7 - 16.7) | 58.6 | |
zhang_uestc_task5_3 | dcase2021-t5 prototypical network | Zhang2021 | 16.8 (15.5 - 18.2) | 54.4 | |
Zou_PKU_task5_3 | TIM-ML | Zou2021 | 38.4 (36.2 - 40.6) | 55.3 | |
Tang_SHNU_task5_3 | SHNU3 | Tang2021 | 38.3 (36.1 - 40.5) | 51.4 |
System characteristics
General characteristics
Rank | Code |
Technical Report |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Sampling rate |
Data augmentation |
Features |
---|---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | 34.8 (32.6 - 37.1) | any | spectrogram | |||
Baseline_PROTO_task5_1 | 20.1 (18.2 - 21.9) | 22.05 KHz | PCEN | |||
Anderson_TCD_task5_1 | Anderson2021 | 35.0 (33.1 - 37.0) | 22.05 KHz | time warping, time masking, frequency masking | PCEN, Mel Spectrogram | |
Bielecki_SMSNG_task5_1 | Bielecki2021 | 8.4 (7.1 - 9.6) | 22.05 KHz | melspectrogram time, frequency masking | melspectrogram | |
Bielecki_SMSNG_task5_2 | Bielecki2021 | 5.8 (4.9 - 6.7) | 22.05 KHz | melspectrogram time, frequences masking | melspectrogram | |
Bielecki_SMSNG_task5_3 | Bielecki2021 | 8.4 (7.1 - 9.7) | 22.05 KHz | melspectrogram time, frequency masking | melspectrogram | |
Bielecki_SMSNG_task5_4 | Bielecki2021 | 5.3 (4.4 - 6.2) | 22.05 KHz | melspectrogram time, frequency masking | melspectrogram | |
Cheng_BIT_task5_1 | Cheng2021 | 23.8 (21.9 - 25.7) | 22.05 KHz | Specaugment | PCEN,i-vector | |
Cheng_BIT_task5_2 | Cheng2021 | 12.5 (11.0 - 14.1) | 22.05 KHz | Specaugment | PCEN | |
Cheng_BIT_task5_3 | Cheng2021 | 11.0 (9.4 - 12.6) | 22.05 KHz | Specaugment | PCEN | |
Cheng_BIT_task5_4 | Cheng2021 | 8.0 (6.7 - 9.3) | 22.05 KHz | Specaugment | PCEN, i-vector | |
Johannsmeier_OVGU_task5_1 | Johannsmeier2021 | 5.5 (4.7 - 6.4) | 22.05 KHz | time stretching, pitch shifting, time shifting | mel energies, PCEN | |
Johannsmeier_OVGU_task5_2 | Johannsmeier2021 | 4.5 (3.7 - 5.4) | 22.05 KHz | time stretching, pitch shifting, time shifting | mel energies, PCEN | |
Johannsmeier_OVGU_task5_3 | Johannsmeier2021 | 15.2 (13.7 - 16.7) | 22.05 KHz | time stretching, pitch shifting, time shifting | mel energies, PCEN | |
Johannsmeier_OVGU_task5_4 | Johannsmeier2021 | 7.1 (5.9 - 8.3) | 22.05 KHz | time stretching, pitch shifting, time shifting | mel energies, PCEN | |
zhang_uestc_task5_1 | Zhang2021 | 9.0 (7.8 - 10.2) | 25.6 KHz | Specaugment | PCEN | |
zhang_uestc_task5_2 | Zhang2021 | 8.3 (7.1 - 9.4) | 25.6 KHz | Specaugment | PCEN | |
zhang_uestc_task5_3 | Zhang2021 | 16.8 (15.5 - 18.2) | 25.6 KHz | Specaugment | PCEN | |
zhang_uestc_task5_4 | Zhang2021 | 7.2 (6.0 - 8.4) | 25.6 KHz | Specaugment | PCEN | |
Zou_PKU_task5_1 | Zou2021 | 33.2 (31.0 - 35.3) | 22.05 KHz | spectrogram | ||
Yang_PKU_task5_2 | Zou2021 | 22.4 (20.7 - 24.1) | 22.05 KHz | spectrogram | ||
Zou_PKU_task5_3 | Zou2021 | 38.4 (36.2 - 40.6) | 22.05 KHz | spectrogram | ||
Zou_PKU_task5_4 | Zou2021 | 33.7 (31.7 - 35.8) | 22.05 KHz | spectrogram | ||
Tang_SHNU_task5_1 | Tang2021 | 36.5 (34.0 - 38.9) | any | Specaugment, inference-time augmentation | PCEN | |
Tang_SHNU_task5_2 | Tang2021 | 35.1 (31.7 - 38.4) | any | Specaugment, inference-time augmentation | PCEN | |
Tang_SHNU_task5_3 | Tang2021 | 38.3 (36.1 - 40.5) | any | Specaugment, inference-time augmentation | PCEN |
Machine learning characteristics
Rank | Code |
Technical Report |
Event-based F-score (Eval) |
Classifier | Few-shot approach | Post-processing |
---|---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | 34.8 (32.6 - 37.1) | template matching | template matching | peak picking, threshold | ||
Baseline_PROTO_task5_1 | 20.1 (18.2 - 21.9) | CNN | prototypical | threshold | ||
Anderson_TCD_task5_1 | Anderson2021 | 35.0 (33.1 - 37.0) | CNN | prototypical | probability averaging, median filtering, minimum event length | |
Bielecki_SMSNG_task5_1 | Bielecki2021 | 8.4 (7.1 - 9.6) | CNN | prototypical | minimum time length threshold, prediction frames elongation | |
Bielecki_SMSNG_task5_2 | Bielecki2021 | 5.8 (4.9 - 6.7) | CNN | prototypical | min time length threshold, prediction frames elongation | |
Bielecki_SMSNG_task5_3 | Bielecki2021 | 8.4 (7.1 - 9.7) | CNN | prototypical | min time length threshold, prediction frames elongation | |
Bielecki_SMSNG_task5_4 | Bielecki2021 | 5.3 (4.4 - 6.2) | CNN | prototypical | min time length threshold, prediction frames elongation | |
Cheng_BIT_task5_1 | Cheng2021 | 23.8 (21.9 - 25.7) | CNN | prototypical | threshold | |
Cheng_BIT_task5_2 | Cheng2021 | 12.5 (11.0 - 14.1) | CNN | prototypical | threshold | |
Cheng_BIT_task5_3 | Cheng2021 | 11.0 (9.4 - 12.6) | CNN | prototypical | threshold | |
Cheng_BIT_task5_4 | Cheng2021 | 8.0 (6.7 - 9.3) | CNN | prototypical | threshold | |
Johannsmeier_OVGU_task5_1 | Johannsmeier2021 | 5.5 (4.7 - 6.4) | CNN | prototypical | threshold, gaussian smoothing (adaptive) | |
Johannsmeier_OVGU_task5_2 | Johannsmeier2021 | 4.5 (3.7 - 5.4) | CNN | prototypical | threshold, gaussian smoothing (adaptive) | |
Johannsmeier_OVGU_task5_3 | Johannsmeier2021 | 15.2 (13.7 - 16.7) | CNN | prototypical | threshold, gaussian smoothing (adaptive) | |
Johannsmeier_OVGU_task5_4 | Johannsmeier2021 | 7.1 (5.9 - 8.3) | CNN | prototypical | threshold, gaussian smoothing (adaptive) | |
zhang_uestc_task5_1 | Zhang2021 | 9.0 (7.8 - 10.2) | ResNet | prototypical | threshold | |
zhang_uestc_task5_2 | Zhang2021 | 8.3 (7.1 - 9.4) | ResNet | prototypical | threshold | |
zhang_uestc_task5_3 | Zhang2021 | 16.8 (15.5 - 18.2) | ResNet | prototypical | threshold | |
zhang_uestc_task5_4 | Zhang2021 | 7.2 (6.0 - 8.4) | ResNet | prototypical | threshold | |
Zou_PKU_task5_1 | Zou2021 | 33.2 (31.0 - 35.3) | CNN | Transductive inference | peak picking, threshold | |
Yang_PKU_task5_2 | Zou2021 | 22.4 (20.7 - 24.1) | CNN | Prototypical network | peak picking, threshold | |
Zou_PKU_task5_3 | Zou2021 | 38.4 (36.2 - 40.6) | CNN | Transductive inference | peak picking, threshold | |
Zou_PKU_task5_4 | Zou2021 | 33.7 (31.7 - 35.8) | CNN | Transductive inference | peak picking, threshold | |
Tang_SHNU_task5_1 | Tang2021 | 36.5 (34.0 - 38.9) | CNN | prototypical | peak picking, median filtering | |
Tang_SHNU_task5_2 | Tang2021 | 35.1 (31.7 - 38.4) | CNN | prototypical | peak picking, median filtering | |
Tang_SHNU_task5_3 | Tang2021 | 38.3 (36.1 - 40.5) | ResNet | fine tuning, prototypical | peak picking, median filtering |
Complexity
Rank | Code |
Technical Report |
Event-based F-score (Eval) |
Model complexity |
Training time |
---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | 34.8 (32.6 - 37.1) | ||||
Baseline_PROTO_task5_1 | 20.1 (18.2 - 21.9) | ||||
Anderson_TCD_task5_1 | Anderson2021 | 35.0 (33.1 - 37.0) | 132000 | 30m34s (Nvidia V100 (1) Intel Xeon Gold 5122 @ 3.60GHz 32GB RAM) | |
Bielecki_SMSNG_task5_1 | Bielecki2021 | 8.4 (7.1 - 9.6) | 813600 | 3h (Generation) | |
Bielecki_SMSNG_task5_2 | Bielecki2021 | 5.8 (4.9 - 6.7) | 1084200 | 3h (Generation) | |
Bielecki_SMSNG_task5_3 | Bielecki2021 | 8.4 (7.1 - 9.7) | 813600 | 3h (Generation) | |
Bielecki_SMSNG_task5_4 | Bielecki2021 | 5.3 (4.4 - 6.2) | 813600 | 3h (Generation) | |
Cheng_BIT_task5_1 | Cheng2021 | 23.8 (21.9 - 25.7) | 6762757 | 1h | |
Cheng_BIT_task5_2 | Cheng2021 | 12.5 (11.0 - 14.1) | 6762757 | 1h | |
Cheng_BIT_task5_3 | Cheng2021 | 11.0 (9.4 - 12.6) | 6762757 | 1h | |
Cheng_BIT_task5_4 | Cheng2021 | 8.0 (6.7 - 9.3) | 6762757 | 1h | |
Johannsmeier_OVGU_task5_1 | Johannsmeier2021 | 5.5 (4.7 - 6.4) | 389804 | 300 seconds (single NVIDIA Geforce 1080Ti) | |
Johannsmeier_OVGU_task5_2 | Johannsmeier2021 | 4.5 (3.7 - 5.4) | 389804 | 300 seconds (single NVIDIA Geforce 1080Ti) | |
Johannsmeier_OVGU_task5_3 | Johannsmeier2021 | 15.2 (13.7 - 16.7) | 389804 | 300 seconds (single NVIDIA Geforce 1080Ti) | |
Johannsmeier_OVGU_task5_4 | Johannsmeier2021 | 7.1 (5.9 - 8.3) | 1169412 | 900 seconds (single NVIDIA Geforce 1080Ti), 300 seconds (3GPUs parallel training) | |
zhang_uestc_task5_1 | Zhang2021 | 9.0 (7.8 - 10.2) | 2889984 | ||
zhang_uestc_task5_2 | Zhang2021 | 8.3 (7.1 - 9.4) | 2889984 | ||
zhang_uestc_task5_3 | Zhang2021 | 16.8 (15.5 - 18.2) | 2889984 | ||
zhang_uestc_task5_4 | Zhang2021 | 7.2 (6.0 - 8.4) | 2889984 | ||
Zou_PKU_task5_1 | Zou2021 | 33.2 (31.0 - 35.3) | 468627 | 403.5 seconds | |
Yang_PKU_task5_2 | Zou2021 | 22.4 (20.7 - 24.1) | 464531 | 403.5 seconds | |
Zou_PKU_task5_3 | Zou2021 | 38.4 (36.2 - 40.6) | 468627 | 403.5 seconds | |
Zou_PKU_task5_4 | Zou2021 | 33.7 (31.7 - 35.8) | 468627 | 403.5 seconds | |
Tang_SHNU_task5_1 | Tang2021 | 36.5 (34.0 - 38.9) | 2950000 | 1h (GeForce RTX 2080 Ti) | |
Tang_SHNU_task5_2 | Tang2021 | 35.1 (31.7 - 38.4) | 2950000 | 45 min (GeForce RTX 2080 Ti) | |
Tang_SHNU_task5_3 | Tang2021 | 38.3 (36.1 - 40.5) | 4750000 | 45 min (GeForce RTX 2080 Ti) |
Technical reports
Bioacoustic Event Detection with Prototypical Networks and Data Augmentation
Mark Anderson and Naomi Harte
Trinity College Dublin, SIGMEDIA, Dublin, Ireland
Abstract
This report presents deep learning and data augmentation techniques used by a system entered into the Few-Shot Bioacoustic Event Detection for the DCASE2021 Challenge. The remit was to develop a few-shot learning system for animal (mammal and bird) vocalisations. Participants were tasked with developing a method that can extract information from five exemplar vocalisations, or shots, of mammals or birds and detect and classify sounds in field recordings. In the system described in this report, prototypical networks are used to learn a metric space, from which classification is performed by computing the distance of a query point to class prototypes, classifying based on shortest distance. We describe the architecture of this network, feature extraction methods, and data augmentation performed on the given dataset and compare our work to the challenge's baseline networks
System characteristics
Data augmentation | time warping, time masking, frequency masking |
System embeddings | False |
Subsystem count | False |
External data usage | False |
FEW-SHOT BIOACOUSTIC EVENT DETECTION WITH PROTOTYPICAL NETWORKS , KNOWLEDGE DISTILLATION AND ATTENTION TRANSFER LOSS
Radoslaw Bielecki
Audio Intelligence, Samsung R&D Institute, Warsaw, Poland
Bielecki_SMSNG_task5_1 Bielecki_SMSNG_task5_2 Bielecki_SMSNG_task5_3 Bielecki_SMSNG_task5_4
FEW-SHOT BIOACOUSTIC EVENT DETECTION WITH PROTOTYPICAL NETWORKS , KNOWLEDGE DISTILLATION AND ATTENTION TRANSFER LOSS
Radoslaw Bielecki
Audio Intelligence, Samsung R&D Institute, Warsaw, Poland
Abstract
The report presents the results of submission to Task 5 (Few-shot Bioacoustics Event Detection) of Detection and Classification of Acoustic Scenes and Events Challenge (DCASE) 2021. This task focuses on sound event detection in a few-shot learning setting for animal (mammal and bird) vocalizations. Main issue of this task is the very limited number of training instances. The presented approach is based on prototypical networks built up from the convolutional layers. Main techniques used during model development are knowledge distillation, attention transfer loss and spectrogram augmentation. The best of presented models achieved 55.5% F-measure on the challenge validation set. That is improvement by over 10% in comparison to baseline model.
System characteristics
Data augmentation | melspectrogram time masking, frequency masking |
System embeddings | False |
Subsystem count | False |
External data usage | directly as additional training data |
PROTOTYPICAL NETWORK FOR BIOACOUSTIC EVENT DETECTION VIA I-VECTORS
Hao Cheng
Beijing Institute of Technology, School Of Information And Electronics, Beijing, China
Cheng_BIT_task5_1 Cheng_BIT_task5_2 Cheng_BIT_task5_3 Cheng_BIT_task5_4
PROTOTYPICAL NETWORK FOR BIOACOUSTIC EVENT DETECTION VIA I-VECTORS
Hao Cheng
Beijing Institute of Technology, School Of Information And Electronics, Beijing, China
Abstract
In this technical report, we present our system for the task 5 of Detection and Classification of Acoustic Scenes and Events 2021 (DCASE2021) challenge, i.e. few-shot bioacoustic event detection. First, per-channel energy normalization (PCEN) and i-vectors are extracted as features. In order to improve the diversity of original audio, some data augmentation methods are adopted, for example, specaugment. Then, the prototypical network with convolutional neural networks (CNN) is used for few-shot detection. Finally, we use aforementioned features as inputs to train our CNN model. We evaluate the proposed systems with overall F-measure for the whole of the evaluation set, and our best F-measure score on the validation set is 46.28.
System characteristics
Data augmentation | Specaugment |
System embeddings | False |
Subsystem count | False |
External data usage | False |
FEW-SHOT BIOACOUSTIC EVENT DETECTION VIA SEGMENTATION USING PROTOTYPICAL NETWORKS
Jens Johannsmeier and Sebastian Stober
Otto-von-Guericke-Universität Magdeburg, Faculty of Computer Science, Magdeburg, Germany
Johannsmeier_OVGU_task5_1 Johannsmeier_OVGU_task5_2 Johannsmeier_OVGU_task5_3 Johannsmeier_OVGU_task5_4
FEW-SHOT BIOACOUSTIC EVENT DETECTION VIA SEGMENTATION USING PROTOTYPICAL NETWORKS
Jens Johannsmeier and Sebastian Stober
Otto-von-Guericke-Universität Magdeburg, Faculty of Computer Science, Magdeburg, Germany
Abstract
This report describes our submission to task 5 of the 2021 DCASE challenge. We detail how we processed the data, the model structure as well as the training procedure. We may submit an extended version to the DCASE 2021 workshop.
System characteristics
Data augmentation | time stretching, pitch shifting, time shifting |
System embeddings | False |
Subsystem count | False |
External data usage | False |
TWO IMPROVED ARCHITECTURES BASED ON PROTOTYPE NETWORK FOR FEW-SHOT BIOACOUSTIC EVENT DETECTION
Tiantian Tang and Yunhao Liang and Yanhua Long
Shanghai Normal University, The College of Information, Mechanical and Electrical Engineering, Shanghai, China
Tang_SHNU_task5_1 Tang_SHNU_task5_2 Tang_SHNU_task5_3
TWO IMPROVED ARCHITECTURES BASED ON PROTOTYPE NETWORK FOR FEW-SHOT BIOACOUSTIC EVENT DETECTION
Tiantian Tang and Yunhao Liang and Yanhua Long
Shanghai Normal University, The College of Information, Mechanical and Electrical Engineering, Shanghai, China
Abstract
In this technical report, we describe our submission system for DCASE2021 Task5:few-shot bioacoustic event detection. Few improvements are investigated to better the baseline of deep learn- ing prototypical network. Including the N-way 5-shot classification prototypical network training strategy, data augmentation techniques, the proposed embedding propagation and attention similarity approaches. On the official validation set, we demonstrate that the proposed method achieves the overall F-measure score of 54.7% on the validation set.
System characteristics
Data augmentation | Specaugment, inference-time augmentation |
System embeddings | False |
Subsystem count | 5 |
External data usage | AudioSet |
FEW-SHOT BIOACOUSTIC EVENT DETECTION USING PROTOTYPICAL NETWORK WITH BACKGROUND CLASSs
Yue Zhang and Jun Wang and Dawei Zhang and Feng Deng
University of Electronic Science and Technology of China,ChengDu, China
zhang_uestc_task5_1 zhang_uestc_task5_2 zhang_uestc_task5_3 zhang_uestc_task5_4
FEW-SHOT BIOACOUSTIC EVENT DETECTION USING PROTOTYPICAL NETWORK WITH BACKGROUND CLASSs
Yue Zhang and Jun Wang and Dawei Zhang and Feng Deng
University of Electronic Science and Technology of China,ChengDu, China
Abstract
Few-shot bioacoustic event detection is a task to detect and classify bioacoustic events with only a few instances. This task was firstly introduced in DCASE2021 Task 5, which requires participants to create a method that can extract information from five sample sounds (shots) of mammals or birds, and detect sounds in field recordings. In this paper, a prototypical network-based method was proposed for few-shot bioacoustic event detection challenge. In order to detect the target event from the query sequence, we need to distinguish the target event, other events, and background noise with only a few support set. To solve this problem, we propose to sample background noise from the training dataset as the ”NEG” class for small sample learning. To better distinguish between events and background noise, the ”NEG” class is used as a ”way” in each episode of training. Experimental results show that the proposed method can effectively distinguish target events and background noise. The F-measure of sound event detection(SED) in the DCASE2021 Task 5 dataset can reach 57.10%, which is higher than the baseline method(41.48%).
System characteristics
Data augmentation | Specaugment |
System embeddings | False |
Subsystem count | False |
External data usage | False |
FEW-SHOT BIOACOUSTIC EVENT DETECTION = A GOOD TRANSDUCTIVE INFERENCE IS ALL YOU NEED
Dongchao Yang and Helin Wang and Zhongjie Ye and Yuexian Zou
Peking University, Shcool of ECE, Shenzhen,China
Abstract
In this technical report, we describe our few-shot bioacoustic event detection methods submitted to Detection and Classification of Acoustic Scenes and Events Challenge 2021 Task 5. We analyze the reason why Prototypical networks cannot perform well, and propose to use transductive inference for few shot learning. Our method maximizes the mutual information between the query features and their label predictions for a given few-shot task, in con- junction with a supervision loss based on the support set. Furthermore, we propose a mutual learning framework, which makes feature extractor and classifier to help each other. Experimental results indicate our transductive inference method get better performance than baseline, and F1 score is about 50.8% on evaluation set. Furthermore, our mutual learning framework brings about 5% improvement over the transductive inference method. We will release our code on https://github.com/yangdongchao/ DCASE2021Task5.
System characteristics
Data augmentation | False |
System embeddings | False |
Subsystem count | False |
External data usage | False |