Task description
This challenge focuses on sound event detection in a few-shot learning setting for animal (mammal and bird) vocalisations. Participants will be expected to create a method that can extract information from five exemplar vocalisations (shots) of mammals or birds and detect and classify sounds in field recordings. The main objective is to find reliable algorithms that are capable of dealing with data sparsity, class imbalance, and noisy/busy environments.
More detailed task description can be found in the task description page
Systems ranking
Rank |
Submission code |
Submission name |
Technical Report |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Event-based F-score (Validation dataset) |
---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | Baseline Template Matching | 12.3 (11.5 - 12.8) | 3.4 | ||
Baseline_PROTO_task5_1 | Baseline Prototypical Network | 5.3 ( - ) | |||
Wu_SHNU_task5_1 | Continual_learning | Wu2022 | 40.9 (40.5 - 41.3) | 53.9 | |
Zhang_CQU_task5_1 | Zhang_CQU_task5_1 | Zhang2022 | 1.2 (0.9 - 1.3) | 46.5 | |
Zhang_CQU_task5_2 | Zhang_CQU_task5_2 | Zhang2022 | 0.9 (0.0 - 1.0) | 45.5 | |
Zhang_CQU_task5_3 | Zhang_CQU_task5_3 | Zhang2022 | 1.9 (1.0 - 2.0) | 44.2 | |
Zhang_CQU_task5_4 | Zhang_CQU_task5_4 | Zhang2022 | 4.3 (3.7 - 4.6) | 44.2 | |
Kang_ET_task5_1 | FewShot_using_good_embedding_model | Kang2022 | 2.4 (2.4 - 2.4) | ||
Kang_ET_task5_2 | FewShot_using_good_embedding_model | Kang2022 | 2.8 (2.8 - 2.9) | ||
Hertkorn_ZF_task5_1 | ZF_CNN1 | Hertkorn2022 | 43.4 (42.9 - 43.8) | 60.6 | |
Hertkorn_ZF_task5_2 | ZF_CNN2 | Hertkorn2022 | 44.4 (45.0 - 45.4) | 61.8 | |
Hertkorn_ZF_task5_3 | ZF_CNN3 | Hertkorn2022 | 41.4 (41.9 - 42.3) | 67.9 | |
Hertkorn_ZF_task5_4 | ZF_CNN4 | Hertkorn2022 | 33.8 (32.4 - 34.6) | 60.5 | |
Zou_PKU_task5_1 | TI_1 | Yang2022 | 19.2 (18.9 - 19.5) | 52.0 | |
Zou_PKU_task5_2 | TI_2 | Yang2022 | 18.7 (18.4 - 19.0) | 52.0 | |
Zou_PKU_task5_3 | TI_3 | Yang2022 | 18.9 (18.6 - 19.2) | 52.0 | |
Zou_PKU_task5_4 | TI_4 | Yang2022 | 15.8 (15.4 - 16.1) | 52.0 | |
Tan_WHU_task5_1 | Knowledge trasnfer 75% training 10 iteration adaptive (8) | Tan2022 | 8.1 (7.3 - 8.5) | 52.4 | |
Tan_WHU_task5_2 | Knowledge transfer 90% training 15 iteration | Tan2022 | 16.9 (16.4 - 17.2) | 53.9 | |
Tan_WHU_task5_3 | Knowledge Transfer 90 training (4) | Tan2022 | 17.1 (16.7 - 17.4) | 54.9 | |
Tan_WHU_task5_4 | Knowledge Transfer 90 training adaptive (4) | Tan2022 | 17.2 (16.8 - 17.6) | 54.5 | |
Liu_BIT-SRCB_task5_1 | TI-PN ensemble | Liu2022 | 44.1 (43.6 - 44.5) | 61.2 | |
Liu_BIT-SRCB_task5_2 | TI-PN ensemble_2 | Liu2022 | 41.9 (41.6 - 42.2) | 63.3 | |
Liu_BIT-SRCB_task5_3 | TI_scalable | Liu2022 | 36.8 (36.5 - 37.2) | 43.5 | |
Liu_BIT-SRCB_task5_4 | pretrained TI-PN ensemble | Liu2022 | 44.3 (43.9 - 44.6) | 64.8 | |
Willbo_RISE_task5_1 | willbo_supervised_1 | Willbo2022 | 17.9 (17.6 - 18.2) | 51.4 | |
Willbo_RISE_task5_2 | willbo_supervised_2 | Willbo2022 | 20.4 (20.1 - 20.7) | 57.5 | |
Willbo_RISE_task5_3 | willbo_semi_1 | Willbo2022 | 20.2 (19.9 - 20.5) | 50.8 | |
Willbo_RISE_task5_4 | willbo_semi_2 | Willbo2022 | 21.7 (21.3 - 22.0) | 47.9 | |
ZGORZYNSKI_SRPOL_task5_1 | Siamese Network with fully connected head | Zgorzynski2022 | 28.1 (27.6 - 28.5) | 67.3 | |
ZGORZYNSKI_SRPOL_task5_2 | Siamese Network with fully connected head | Zgorzynski2022 | 16.3 (15.1 - 16.9) | 59.4 | |
ZGORZYNSKI_SRPOL_task5_3 | Siamese Network with fully connected head | Zgorzynski2022 | 29.9 (29.3 - 30.3) | 60.0 | |
ZGORZYNSKI_SRPOL_task5_4 | Siamese Network with fully connected head | Zgorzynski2022 | 33.2 (32.7 - 33.7) | 57.2 | |
Huang_SCUT_task5_1 | Transductive learning and modified central difference convolution | Huang2022 | 18.3 (18.0 - 18.6) | 54.6 | |
Martinsson_RISE_task5_1 | Adaptive prototypical ensemble | Martinsson2022 | 48.0 (47.5 - 48.4) | 60.0 | |
Martinsson_RISE_task5_2 | Adaptive prototypical ensemble | Martinsson2022 | 45.4 (44.9 - 45.9) | 30.6 | |
Martinsson_RISE_task5_3 | Adaptive prototypical ensemble | Martinsson2022 | 19.4 (18.6 - 20.0) | 44.6 | |
Martinsson_RISE_task5_4 | Adaptive prototypical ensemble | Martinsson2022 | 32.5 (31.7 - 33.1) | 13.3 | |
Liu_Surrey_task5_1 | Haohe_Liu_S1 | Liu2022a | 43.1 (42.7 - 43.4) | 58.5 | |
Liu_Surrey_task5_2 | Haohe_Liu_S2 | Liu2022a | 48.2 (48.5 - 48.9) | 50.0 | |
Liu_Surrey_task5_3 | Haohe_Liu_S3 | Liu2022a | 36.9 (36.5 - 37.2) | 40.7 | |
Liu_Surrey_task5_4 | Haohe_Liu_S4 | Liu2022a | 45.5 (45.8 - 46.2) | 60.2 | |
Li_QMUL_task5_1 | Prototypical Network with ResNet and SpecAugment | Li2022 | 15.5 (15.2 - 15.8) | 47.9 | |
Mariajohn_DSPC_task5_1 | Prototypical-1 | Mariajohn2022 | 25.7 (25.4 - 25.9) | 43.9 | |
Du_NERCSLIP_task5_1 | Segment-level embedding learning | Du2022a | 36.5 (35.6 - 37.0) | 68.2 | |
Du_NERCSLIP_task5_2 | Frame-level embedding learning 1 | Du2022a | 60.2 (59.7 - 61.7) | 74.4 | |
Du_NERCSLIP_task5_3 | event filtering | Du2022a | 42.9 (42.4 - 43.4) | 53.4 | |
Du_NERCSLIP_task5_4 | Frame-level embedding learning 2 | Du2022a | 60.0 (58.5 - 61.5) | 74.4 |
Dataset wise metrics
Rank |
Submission code |
Submission name |
Technical Report |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Event-based F-score (CHE dataset) |
Event-based F-score (CT dataset) |
Event-based F-score (MGE dataset) |
Event-based F-score (MS dataset) |
Event-based F-score (QU dataset) |
Event-based F-score (DC dataset) |
---|---|---|---|---|---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | Baseline Template Matching | 12.3 (11.5 - 12.8) | 21.1 | 7.1 | 44.1 | 8.0 | 9.7 | 35.0 | ||
Baseline_PROTO_task5_1 | Baseline Prototypical Network | 5.3 ( - ) | 42.6 | 8.0 | 3.8 | 11.6 | 1.6 | 40.1 | ||
Wu_SHNU_task5_1 | Continual_learning | Wu2022 | 40.9 (40.5 - 41.3) | 65.0 | 37.2 | 38.2 | 38.9 | 38.1 | 44.8 | |
Zhang_CQU_task5_1 | Zhang_CQU_task5_1 | Zhang2022 | 1.2 (0.9 - 1.3) | 30.3 | 24.6 | 5.8 | 1.1 | 0.3 | 25.4 | |
Zhang_CQU_task5_2 | Zhang_CQU_task5_2 | Zhang2022 | 0.9 (0.0 - 1.0) | 26.8 | 38.3 | 11.1 | 0.2 | 14.6 | 9.1 | |
Zhang_CQU_task5_3 | Zhang_CQU_task5_3 | Zhang2022 | 1.9 (1.0 - 2.0) | 29.2 | 26.0 | 55.6 | 0.4 | 15.8 | 17.5 | |
Zhang_CQU_task5_4 | Zhang_CQU_task5_4 | Zhang2022 | 4.3 (3.7 - 4.6) | 29.6 | 17.6 | 55.3 | 0.9 | 18.2 | 30.2 | |
Kang_ET_task5_1 | FewShot_using_good_embedding_model | Kang2022 | 2.4 (2.4 - 2.4) | 11.0 | 0.7 | 3.3 | 3.5 | 4.3 | 4.7 | |
Kang_ET_task5_2 | FewShot_using_good_embedding_model | Kang2022 | 2.8 (2.8 - 2.9) | 8.7 | 0.9 | 3.3 | 3.9 | 5.3 | 4.7 | |
Hertkorn_ZF_task5_1 | ZF_CNN1 | Hertkorn2022 | 43.4 (42.9 - 43.8) | 70.2 | 37.8 | 68.4 | 64.1 | 22.5 | 51.2 | |
Hertkorn_ZF_task5_2 | ZF_CNN2 | Hertkorn2022 | 44.4 (45.0 - 45.4) | 70.3 | 37.1 | 63.8 | 58.6 | 25.9 | 57.4 | |
Hertkorn_ZF_task5_3 | ZF_CNN3 | Hertkorn2022 | 41.4 (41.9 - 42.3) | 66.7 | 40.0 | 76.4 | 74.0 | 18.2 | 57.9 | |
Hertkorn_ZF_task5_4 | ZF_CNN4 | Hertkorn2022 | 33.8 (32.4 - 34.6) | 64.6 | 15.0 | 84.9 | 71.0 | 21.5 | 58.8 | |
Zou_PKU_task5_1 | TI_1 | Yang2022 | 19.2 (18.9 - 19.5) | 33.4 | 22.8 | 59.7 | 44.0 | 6.8 | 22.9 | |
Zou_PKU_task5_2 | TI_2 | Yang2022 | 18.7 (18.4 - 19.0) | 32.9 | 22.6 | 60.7 | 42.7 | 6.6 | 22.4 | |
Zou_PKU_task5_3 | TI_3 | Yang2022 | 18.9 (18.6 - 19.2) | 30.9 | 24.0 | 60.9 | 43.8 | 6.7 | 22.1 | |
Zou_PKU_task5_4 | TI_4 | Yang2022 | 15.8 (15.4 - 16.1) | 43.8 | 9.3 | 57.2 | 30.9 | 6.3 | 31.4 | |
Tan_WHU_task5_1 | Knowledge trasnfer 75% training 10 iteration adaptive (8) | Tan2022 | 8.1 (7.3 - 8.5) | 39.0 | 43.9 | 2.4 | 10.3 | 15.0 | 12.7 | |
Tan_WHU_task5_2 | Knowledge transfer 90% training 15 iteration | Tan2022 | 16.9 (16.4 - 17.2) | 31.5 | 32.8 | 8.0 | 15.3 | 15.4 | 39.8 | |
Tan_WHU_task5_3 | Knowledge Transfer 90 training (4) | Tan2022 | 17.1 (16.7 - 17.4) | 25.5 | 40.3 | 8.4 | 15.7 | 18.0 | 28.6 | |
Tan_WHU_task5_4 | Knowledge Transfer 90 training adaptive (4) | Tan2022 | 17.2 (16.8 - 17.6) | 26.2 | 40.3 | 8.4 | 15.7 | 18.0 | 29.6 | |
Liu_BIT-SRCB_task5_1 | TI-PN ensemble | Liu2022 | 44.1 (43.6 - 44.5) | 54.6 | 45.7 | 47.3 | 51.5 | 32.4 | 48.5 | |
Liu_BIT-SRCB_task5_2 | TI-PN ensemble_2 | Liu2022 | 41.9 (41.6 - 42.2) | 54.6 | 56.3 | 47.3 | 51.5 | 24.0 | 48.5 | |
Liu_BIT-SRCB_task5_3 | TI_scalable | Liu2022 | 36.8 (36.5 - 37.2) | 52.2 | 41.0 | 51.6 | 49.3 | 22.2 | 33.6 | |
Liu_BIT-SRCB_task5_4 | pretrained TI-PN ensemble | Liu2022 | 44.3 (43.9 - 44.6) | 54.6 | 45.0 | 48.0 | 53.9 | 32.5 | 47.7 | |
Willbo_RISE_task5_1 | willbo_supervised_1 | Willbo2022 | 17.9 (17.6 - 18.2) | 43.8 | 19.1 | 24.6 | 20.9 | 12.2 | 12.8 | |
Willbo_RISE_task5_2 | willbo_supervised_2 | Willbo2022 | 20.4 (20.1 - 20.7) | 47.1 | 17.4 | 31.1 | 21.4 | 12.2 | 21.9 | |
Willbo_RISE_task5_3 | willbo_semi_1 | Willbo2022 | 20.2 (19.9 - 20.5) | 44.0 | 14.8 | 24.8 | 24.9 | 13.9 | 22.1 | |
Willbo_RISE_task5_4 | willbo_semi_2 | Willbo2022 | 21.7 (21.3 - 22.0) | 48.8 | 14.9 | 31.1 | 25.9 | 13.9 | 25.5 | |
ZGORZYNSKI_SRPOL_task5_1 | Siamese Network with fully connected head | Zgorzynski2022 | 28.1 (27.6 - 28.5) | 51.0 | 52.9 | 13.9 | 33.4 | 27.4 | 33.7 | |
ZGORZYNSKI_SRPOL_task5_2 | Siamese Network with fully connected head | Zgorzynski2022 | 16.3 (15.1 - 16.9) | 51.2 | 39.8 | 4.2 | 48.4 | 34.7 | 46.3 | |
ZGORZYNSKI_SRPOL_task5_3 | Siamese Network with fully connected head | Zgorzynski2022 | 29.9 (29.3 - 30.3) | 49.7 | 23.7 | 15.5 | 60.9 | 35.9 | 41.7 | |
ZGORZYNSKI_SRPOL_task5_4 | Siamese Network with fully connected head | Zgorzynski2022 | 33.2 (32.7 - 33.7) | 58.8 | 31.1 | 19.7 | 41.1 | 38.4 | 40.4 | |
Huang_SCUT_task5_1 | Transductive learning and modified central difference convolution | Huang2022 | 18.3 (18.0 - 18.6) | 17.9 | 20.6 | 65.6 | 56.0 | 7.4 | 22.1 | |
Martinsson_RISE_task5_1 | Adaptive prototypical ensemble | Martinsson2022 | 48.0 (47.5 - 48.4) | 71.7 | 48.4 | 77.6 | 70.6 | 24.6 | 53.1 | |
Martinsson_RISE_task5_2 | Adaptive prototypical ensemble | Martinsson2022 | 45.4 (44.9 - 45.9) | 56.3 | 37.6 | 61.5 | 70.7 | 29.5 | 49.4 | |
Martinsson_RISE_task5_3 | Adaptive prototypical ensemble | Martinsson2022 | 19.4 (18.6 - 20.0) | 67.1 | 4.7 | 65.5 | 73.3 | 34.7 | 45.0 | |
Martinsson_RISE_task5_4 | Adaptive prototypical ensemble | Martinsson2022 | 32.5 (31.7 - 33.1) | 50.9 | 13.4 | 47.8 | 71.2 | 34.1 | 42.5 | |
Liu_Surrey_task5_1 | Haohe_Liu_S1 | Liu2022a | 43.1 (42.7 - 43.4) | 81.9 | 58.4 | 46.4 | 48.4 | 22.8 | 52.0 | |
Liu_Surrey_task5_2 | Haohe_Liu_S2 | Liu2022a | 48.2 (48.5 - 48.9) | 76.9 | 57.4 | 48.0 | 60.7 | 28.9 | 56.8 | |
Liu_Surrey_task5_3 | Haohe_Liu_S3 | Liu2022a | 36.9 (36.5 - 37.2) | 83.0 | 52.2 | 29.1 | 53.5 | 18.5 | 53.7 | |
Liu_Surrey_task5_4 | Haohe_Liu_S4 | Liu2022a | 45.5 (45.8 - 46.2) | 80.5 | 61.8 | 38.8 | 47.7 | 30.3 | 53.8 | |
Li_QMUL_task5_1 | Prototypical Network with ResNet and SpecAugment | Li2022 | 15.5 (15.2 - 15.8) | 39.5 | 35.0 | 11.9 | 17.9 | 6.9 | 30.7 | |
Mariajohn_DSPC_task5_1 | Prototypical-1 | Mariajohn2022 | 25.7 (25.4 - 25.9) | 27.4 | 23.6 | 55.4 | 65.5 | 19.4 | 14.9 | |
Du_NERCSLIP_task5_1 | Segment-level embedding learning | Du2022a | 36.5 (35.6 - 37.0) | 53.6 | 43.9 | 43.0 | 57.5 | 17.7 | 46.7 | |
Du_NERCSLIP_task5_2 | Frame-level embedding learning 1 | Du2022a | 60.2 (59.7 - 61.7) | 71.7 | 48.4 | 89.1 | 66.3 | 48.7 | 57.3 | |
Du_NERCSLIP_task5_3 | event filtering | Du2022a | 42.9 (42.4 - 43.4) | 57.4 | 48.6 | 62.3 | 42.4 | 23.5 | 52.2 | |
Du_NERCSLIP_task5_4 | Frame-level embedding learning 2 | Du2022a | 60.0 (58.5 - 61.5) | 73.3 | 49.6 | 91.3 | 64.4 | 46.3 | 57.7 |
Teams ranking
Table including only the best performing system per submitting team.
Rank |
Submission code |
Submission name |
Technical Report |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Event-based F-score (Development dataset) |
---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | Baseline Template Matching | 12.3 (11.5 - 12.8) | 3.4 | ||
Baseline_PROTO_task5_1 | Baseline Prototypical Network | 5.3 ( - ) | |||
Wu_SHNU_task5_1 | Continual_learning | Wu2022 | 40.9 (40.5 - 41.3) | 53.9 | |
Zhang_CQU_task5_4 | Zhang_CQU_task5_4 | Zhang2022 | 4.3 (3.7 - 4.6) | 44.2 | |
Kang_ET_task5_2 | FewShot_using_good_embedding_model | Kang2022 | 2.8 (2.8 - 2.9) | ||
Hertkorn_ZF_task5_2 | ZF_CNN2 | Hertkorn2022 | 44.4 (45.0 - 45.4) | 61.8 | |
Zou_PKU_task5_1 | TI_1 | Yang2022 | 19.2 (18.9 - 19.5) | 52.0 | |
Tan_WHU_task5_4 | Knowledge Transfer 90 training adaptive (4) | Tan2022 | 17.2 (16.8 - 17.6) | 54.5 | |
Liu_BIT-SRCB_task5_4 | pretrained TI-PN ensemble | Liu2022 | 44.3 (43.9 - 44.6) | 64.8 | |
Willbo_RISE_task5_4 | willbo_semi_2 | Willbo2022 | 21.7 (21.3 - 22.0) | 47.9 | |
ZGORZYNSKI_SRPOL_task5_4 | Siamese Network with fully connected head | Zgorzynski2022 | 33.2 (32.7 - 33.7) | 57.2 | |
Huang_SCUT_task5_1 | Transductive learning and modified central difference convolution | Huang2022 | 18.3 (18.0 - 18.6) | 54.6 | |
Martinsson_RISE_task5_1 | Adaptive prototypical ensemble | Martinsson2022 | 48.0 (47.5 - 48.4) | 60.0 | |
Liu_Surrey_task5_2 | Haohe_Liu_S2 | Liu2022a | 48.2 (48.5 - 48.9) | 50.0 | |
Li_QMUL_task5_1 | Prototypical Network with ResNet and SpecAugment | Li2022 | 15.5 (15.2 - 15.8) | 47.9 | |
Mariajohn_DSPC_task5_1 | Prototypical-1 | Mariajohn2022 | 25.7 (25.4 - 25.9) | 43.9 | |
Du_NERCSLIP_task5_2 | Frame-level embedding learning 1 | Du2022a | 60.2 (59.7 - 61.7) | 74.4 |
System characteristics
General characteristics
Rank | Code |
Technical Report |
Event-based F-score with 95% confidence interval (Evaluation dataset) |
Sampling rate |
Data augmentation |
Features |
---|---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | 12.3 (11.5 - 12.8) | any | spectrogram | |||
Baseline_PROTO_task5_1 | 5.3 ( - ) | 22.05 KHz | PCEN | |||
Wu_SHNU_task5_1 | Wu2022 | 40.9 (40.5 - 41.3) | any | Time masking, Frequency masking | PCEN | |
Zhang_CQU_task5_1 | Zhang2022 | 1.2 (0.9 - 1.3) | 22.05 KHz | Spectrogram | ||
Zhang_CQU_task5_2 | Zhang2022 | 0.9 (0.0 - 1.0) | 22.05 KHz | Spectrogram | ||
Zhang_CQU_task5_3 | Zhang2022 | 1.9 (1.0 - 2.0) | 22.05 KHz | Spectrogram | ||
Zhang_CQU_task5_4 | Zhang2022 | 4.3 (3.7 - 4.6) | 22.05 KHz | Spectrogram | ||
Kang_ET_task5_1 | Kang2022 | 2.4 (2.4 - 2.4) | 16 KHz | specaugment | PCEN | |
Kang_ET_task5_2 | Kang2022 | 2.8 (2.8 - 2.9) | 16 KHz | Specaugment | PCEN | |
Hertkorn_ZF_task5_1 | Hertkorn2022 | 43.4 (42.9 - 43.8) | any | Spectrogram | ||
Hertkorn_ZF_task5_2 | Hertkorn2022 | 44.4 (45.0 - 45.4) | any | Spectrogram | ||
Hertkorn_ZF_task5_3 | Hertkorn2022 | 41.4 (41.9 - 42.3) | any | Spectrogram | ||
Hertkorn_ZF_task5_4 | Hertkorn2022 | 33.8 (32.4 - 34.6) | any | Spectrogram | ||
Zou_PKU_task5_1 | Yang2022 | 19.2 (18.9 - 19.5) | 22.05 KHz | time and frequency masking, mixup | Spectrogram | |
Zou_PKU_task5_2 | Yang2022 | 18.7 (18.4 - 19.0) | 22.05 KHz | time and frequency masking, mixup | Spectrogram | |
Zou_PKU_task5_3 | Yang2022 | 18.9 (18.6 - 19.2) | 22.05 KHz | time and frequency masking, mixup | Spectrogram | |
Zou_PKU_task5_4 | Yang2022 | 15.8 (15.4 - 16.1) | 22.05 KHz | time masking, frequency masking, mixup | Spectrogram | |
Tan_WHU_task5_1 | Tan2022 | 8.1 (7.3 - 8.5) | 22.05 KHz | PCEN | ||
Tan_WHU_task5_2 | Tan2022 | 16.9 (16.4 - 17.2) | 22.05 KHz | PCEN | ||
Tan_WHU_task5_3 | Tan2022 | 17.1 (16.7 - 17.4) | 22.05 KHz | PCEN | ||
Tan_WHU_task5_4 | Tan2022 | 17.2 (16.8 - 17.6) | 22.05 KHz | PCEN | ||
Liu_BIT-SRCB_task5_1 | Liu2022 | 44.1 (43.6 - 44.5) | 22.05 KHz | Specaugment | PCEN | |
Liu_BIT-SRCB_task5_2 | Liu2022 | 41.9 (41.6 - 42.2) | 22.05 KHz | Specaugment | PCEN | |
Liu_BIT-SRCB_task5_3 | Liu2022 | 36.8 (36.5 - 37.2) | 22.05 KHz | PCEN | ||
Liu_BIT-SRCB_task5_4 | Liu2022 | 44.3 (43.9 - 44.6) | 22.05 KHz | Specaugment | PCEN | |
Willbo_RISE_task5_1 | Willbo2022 | 17.9 (17.6 - 18.2) | any | Mel-spectrogram, PCEN | ||
Willbo_RISE_task5_2 | Willbo2022 | 20.4 (20.1 - 20.7) | any | Mel-spectrogram, PCEN | ||
Willbo_RISE_task5_3 | Willbo2022 | 20.2 (19.9 - 20.5) | any | Mel-spectrogram, PCEN | ||
Willbo_RISE_task5_4 | Willbo2022 | 21.7 (21.3 - 22.0) | any | Mel-spectrogram, PCEN | ||
ZGORZYNSKI_SRPOL_task5_1 | Zgorzynski2022 | 28.1 (27.6 - 28.5) | 48 KHz | Noise mixing, Random Crop | Mel-spectrogram, PCEN | |
ZGORZYNSKI_SRPOL_task5_2 | Zgorzynski2022 | 16.3 (15.1 - 16.9) | 48 KHz | Noise mixing | Mel-spectrogram | |
ZGORZYNSKI_SRPOL_task5_3 | Zgorzynski2022 | 29.9 (29.3 - 30.3) | 48 KHz | Noise mixing | Mel-spectrogram | |
ZGORZYNSKI_SRPOL_task5_4 | Zgorzynski2022 | 33.2 (32.7 - 33.7) | 48 KHz | Noise mixing | Mel-spectrogram | |
Huang_SCUT_task5_1 | Huang2022 | 18.3 (18.0 - 18.6) | 22.05 KHz | Specaugment | PCEN | |
Martinsson_RISE_task5_1 | Martinsson2022 | 48.0 (47.5 - 48.4) | 22.05 KHz | Log-Mel energies, PCEN | ||
Martinsson_RISE_task5_2 | Martinsson2022 | 45.4 (44.9 - 45.9) | 22.05 KHz | Log-Mel energies, PCEN | ||
Martinsson_RISE_task5_3 | Martinsson2022 | 19.4 (18.6 - 20.0) | 22.05 KHz | PCEN | ||
Martinsson_RISE_task5_4 | Martinsson2022 | 32.5 (31.7 - 33.1) | 22.05 KHz | PCEN | ||
Liu_Surrey_task5_1 | Liu2022a | 43.1 (42.7 - 43.4) | 22.05 KHz | Dynamic dataloader | PCEN, Delta-MFCC | |
Liu_Surrey_task5_2 | Liu2022a | 48.2 (48.5 - 48.9) | 22.05 KHz | Dynamic dataloader | PCEN, Delta-MFCC | |
Liu_Surrey_task5_3 | Liu2022a | 36.9 (36.5 - 37.2) | 22.05 KHz | Dynamic dataloader | PCEN, Delta-MFCC | |
Liu_Surrey_task5_4 | Liu2022a | 45.5 (45.8 - 46.2) | 22.05 KHz | Dynamic dataloader | PCEN, Delta-MFCC | |
Li_QMUL_task5_1 | Li2022 | 15.5 (15.2 - 15.8) | any | time masking, frequency masking, time warping | PCEN, Spectrogram | |
Mariajohn_DSPC_task5_1 | Mariajohn2022 | 25.7 (25.4 - 25.9) | any | time shifting, segment level mirroring | Log-Mel spectrogram | |
Du_NERCSLIP_task5_1 | Du2022a | 36.5 (35.6 - 37.0) | 22.05 KHz | SpecAugment | PCEN | |
Du_NERCSLIP_task5_2 | Du2022a | 60.2 (59.7 - 61.7) | 22.05 KHz | PCEN | ||
Du_NERCSLIP_task5_3 | Du2022a | 42.9 (42.4 - 43.4) | 22.05 KHz | PCEN | ||
Du_NERCSLIP_task5_4 | Du2022a | 60.0 (58.5 - 61.5) | 22.05 KHz | PCEN |
Machine learning characteristics
Rank | Code |
Technical Report |
Event-based F-score (Eval) |
Classifier | Few-shot approach | Post-processing |
---|---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | 12.3 (11.5 - 12.8) | template matching | template matching | peak picking, threshold | ||
Baseline_PROTO_task5_1 | 5.3 ( - ) | ResNet | prototypical | threshold | ||
Wu_SHNU_task5_1 | Wu2022 | 40.9 (40.5 - 41.3) | Continual Learning | prototypical, weight generator | threshold | |
Zhang_CQU_task5_1 | Zhang2022 | 1.2 (0.9 - 1.3) | CNN | prototypical | peak picking, threshold | |
Zhang_CQU_task5_2 | Zhang2022 | 0.9 (0.0 - 1.0) | CNN | prototypical | peak picking, threshold | |
Zhang_CQU_task5_3 | Zhang2022 | 1.9 (1.0 - 2.0) | CNN | prototypical | peak picking, threshold | |
Zhang_CQU_task5_4 | Zhang2022 | 4.3 (3.7 - 4.6) | CNN | prototypical | peak picking, threshold | |
Kang_ET_task5_1 | Kang2022 | 2.4 (2.4 - 2.4) | TDNN | Fine tuning | ||
Kang_ET_task5_2 | Kang2022 | 2.8 (2.8 - 2.9) | TDNN | Fine tuning | ||
Hertkorn_ZF_task5_1 | Hertkorn2022 | 43.4 (42.9 - 43.8) | CNN | threshold, duration threshold, event stitching | ||
Hertkorn_ZF_task5_2 | Hertkorn2022 | 44.4 (45.0 - 45.4) | CNN | threshold, duration threshold, event stitching | ||
Hertkorn_ZF_task5_3 | Hertkorn2022 | 41.4 (41.9 - 42.3) | CNN | threshold, duration threshold, event stitching | ||
Hertkorn_ZF_task5_4 | Hertkorn2022 | 33.8 (32.4 - 34.6) | CNN | threshold, duration threshold, event stitching | ||
Zou_PKU_task5_1 | Yang2022 | 19.2 (18.9 - 19.5) | CNN | prototypical | threshold, peak picking | |
Zou_PKU_task5_2 | Yang2022 | 18.7 (18.4 - 19.0) | CNN | prototypical | threshold, peak picking | |
Zou_PKU_task5_3 | Yang2022 | 18.9 (18.6 - 19.2) | CNN | prototypical | threshold, peak picking | |
Zou_PKU_task5_4 | Yang2022 | 15.8 (15.4 - 16.1) | CNN | prototypical | threshold, peak picking | |
Tan_WHU_task5_1 | Tan2022 | 8.1 (7.3 - 8.5) | CNN | prototypical, transductive inference | threshold, minimum event length | |
Tan_WHU_task5_2 | Tan2022 | 16.9 (16.4 - 17.2) | CNN | prototypical, transductive inference | threshold | |
Tan_WHU_task5_3 | Tan2022 | 17.1 (16.7 - 17.4) | CNN | prototypical, transductive inference | threshold | |
Tan_WHU_task5_4 | Tan2022 | 17.2 (16.8 - 17.6) | CNN | prototypical, transductive inference | threshold, minimum event length | |
Liu_BIT-SRCB_task5_1 | Liu2022 | 44.1 (43.6 - 44.5) | CNN | prototypical, transductive inference | peak picking, threshold, VAD | |
Liu_BIT-SRCB_task5_2 | Liu2022 | 41.9 (41.6 - 42.2) | CNN | prototypical, transductive inference | peak picking, threshold, VAD | |
Liu_BIT-SRCB_task5_3 | Liu2022 | 36.8 (36.5 - 37.2) | CNN | Transductive inference | peak picking, threshold | |
Liu_BIT-SRCB_task5_4 | Liu2022 | 44.3 (43.9 - 44.6) | CNN | prototypical, transductive inference | peak picking, threshold, VAD | |
Willbo_RISE_task5_1 | Willbo2022 | 17.9 (17.6 - 18.2) | ResNet | prototypical | median filtering, minimum event length, threshold | |
Willbo_RISE_task5_2 | Willbo2022 | 20.4 (20.1 - 20.7) | ResNet | prototypical, threshold fitting | median filtering, minimum event length, threshold | |
Willbo_RISE_task5_3 | Willbo2022 | 20.2 (19.9 - 20.5) | ResNet | prototypical | median filtering, minimum event length, threshold | |
Willbo_RISE_task5_4 | Willbo2022 | 21.7 (21.3 - 22.0) | ResNet | prototypical, threshold fitting | median filtering, minimum event length, threshold | |
ZGORZYNSKI_SRPOL_task5_1 | Zgorzynski2022 | 28.1 (27.6 - 28.5) | CNN | Siamese network with fully connected head, fine tuning | peak picking, threshold, Gaussian filter | |
ZGORZYNSKI_SRPOL_task5_2 | Zgorzynski2022 | 16.3 (15.1 - 16.9) | CNN | Siamese network with fully connected head, fine tuning | threshold, Gaussian filter | |
ZGORZYNSKI_SRPOL_task5_3 | Zgorzynski2022 | 29.9 (29.3 - 30.3) | CNN | Siamese network with fully connected head, fine tuning | threshold, Gaussian filter | |
ZGORZYNSKI_SRPOL_task5_4 | Zgorzynski2022 | 33.2 (32.7 - 33.7) | CNN | Siamese network with fully connected head, fine tuning | threshold, Gaussian filter | |
Huang_SCUT_task5_1 | Huang2022 | 18.3 (18.0 - 18.6) | transductive learning | transductive learning | peak picking, threshold | |
Martinsson_RISE_task5_1 | Martinsson2022 | 48.0 (47.5 - 48.4) | Ensemble, CNN | prototypical, input size | threshold, merging, filter too small, filter too big | |
Martinsson_RISE_task5_2 | Martinsson2022 | 45.4 (44.9 - 45.9) | Ensemble, CNN | prototypical, input size | threshold, merging, filter too small, filter too big | |
Martinsson_RISE_task5_3 | Martinsson2022 | 19.4 (18.6 - 20.0) | CNN | prototypical | threshold, merging, filter too small, filter too big | |
Martinsson_RISE_task5_4 | Martinsson2022 | 32.5 (31.7 - 33.1) | CNN | prototypical | threshold, merging, filter too small, filter too big | |
Liu_Surrey_task5_1 | Liu2022a | 43.1 (42.7 - 43.4) | CNN, ensemble | prototypical | threshold, filter by length, split long, remove long | |
Liu_Surrey_task5_2 | Liu2022a | 48.2 (48.5 - 48.9) | CNN | prototypical | threshold, filter by length, remove long, padding | |
Liu_Surrey_task5_3 | Liu2022a | 36.9 (36.5 - 37.2) | CNN | prototypical | threshold, filter by length, split long, remove long, merge short, padding | |
Liu_Surrey_task5_4 | Liu2022a | 45.5 (45.8 - 46.2) | CNN | prototypical | threshold, filter by length, remove long | |
Li_QMUL_task5_1 | Li2022 | 15.5 (15.2 - 15.8) | CNN | prototypical | peak picking, threshold | |
Mariajohn_DSPC_task5_1 | Mariajohn2022 | 25.7 (25.4 - 25.9) | CNN | prototypical | threshold | |
Du_NERCSLIP_task5_1 | Du2022a | 36.5 (35.6 - 37.0) | CNN | fine tuning | peak picking, threshold | |
Du_NERCSLIP_task5_2 | Du2022a | 60.2 (59.7 - 61.7) | CNN | fine tuning | peak picking, threshold | |
Du_NERCSLIP_task5_3 | Du2022a | 42.9 (42.4 - 43.4) | CNN | fine tuning | peak picking, threshold | |
Du_NERCSLIP_task5_4 | Du2022a | 60.0 (58.5 - 61.5) | CNN | fine tuning | peak picking, threshold |
Complexity
Rank | Code |
Technical Report |
Event-based F-score (Eval) |
Model complexity |
Training time |
---|---|---|---|---|---|
Baseline_TempMatch_task5_1 | 12.3 (11.5 - 12.8) | ||||
Baseline_PROTO_task5_1 | 5.3 ( - ) | ||||
Wu_SHNU_task5_1 | Wu2022 | 40.9 (40.5 - 41.3) | 443520 | 2.5h | |
Zhang_CQU_task5_1 | Zhang2022 | 1.2 (0.9 - 1.3) | 90min | ||
Zhang_CQU_task5_2 | Zhang2022 | 0.9 (0.0 - 1.0) | 90min | ||
Zhang_CQU_task5_3 | Zhang2022 | 1.9 (1.0 - 2.0) | 90min | ||
Zhang_CQU_task5_4 | Zhang2022 | 4.3 (3.7 - 4.6) | 90min | ||
Kang_ET_task5_1 | Kang2022 | 2.4 (2.4 - 2.4) | |||
Kang_ET_task5_2 | Kang2022 | 2.8 (2.8 - 2.9) | |||
Hertkorn_ZF_task5_1 | Hertkorn2022 | 43.4 (42.9 - 43.8) | 54979 | 6 min/ wav file | |
Hertkorn_ZF_task5_2 | Hertkorn2022 | 44.4 (45.0 - 45.4) | 54979 | 6 min/ wav file | |
Hertkorn_ZF_task5_3 | Hertkorn2022 | 41.4 (41.9 - 42.3) | 54979 | 6 min/ wav file | |
Hertkorn_ZF_task5_4 | Hertkorn2022 | 33.8 (32.4 - 34.6) | 54979 | 6 min/ wav file | |
Zou_PKU_task5_1 | Yang2022 | 19.2 (18.9 - 19.5) | 468627 | 30 min | |
Zou_PKU_task5_2 | Yang2022 | 18.7 (18.4 - 19.0) | 468627 | 30 min | |
Zou_PKU_task5_3 | Yang2022 | 18.9 (18.6 - 19.2) | 468627 | 30 min | |
Zou_PKU_task5_4 | Yang2022 | 15.8 (15.4 - 16.1) | 468627 | 30 min | |
Tan_WHU_task5_1 | Tan2022 | 8.1 (7.3 - 8.5) | 700k | 1h | |
Tan_WHU_task5_2 | Tan2022 | 16.9 (16.4 - 17.2) | 700k | 1h | |
Tan_WHU_task5_3 | Tan2022 | 17.1 (16.7 - 17.4) | 700k | 1h | |
Tan_WHU_task5_4 | Tan2022 | 17.2 (16.8 - 17.6) | 700k | 1h | |
Liu_BIT-SRCB_task5_1 | Liu2022 | 44.1 (43.6 - 44.5) | 9627177 | 1.5h | |
Liu_BIT-SRCB_task5_2 | Liu2022 | 41.9 (41.6 - 42.2) | 9627177 | 1.5h | |
Liu_BIT-SRCB_task5_3 | Liu2022 | 36.8 (36.5 - 37.2) | 8757077 | 1.5h | |
Liu_BIT-SRCB_task5_4 | Liu2022 | 44.3 (43.9 - 44.6) | 9914068 | 1.5h | |
Willbo_RISE_task5_1 | Willbo2022 | 17.9 (17.6 - 18.2) | |||
Willbo_RISE_task5_2 | Willbo2022 | 20.4 (20.1 - 20.7) | |||
Willbo_RISE_task5_3 | Willbo2022 | 20.2 (19.9 - 20.5) | |||
Willbo_RISE_task5_4 | Willbo2022 | 21.7 (21.3 - 22.0) | |||
ZGORZYNSKI_SRPOL_task5_1 | Zgorzynski2022 | 28.1 (27.6 - 28.5) | 76700357 | 9h | |
ZGORZYNSKI_SRPOL_task5_2 | Zgorzynski2022 | 16.3 (15.1 - 16.9) | 76700357 | 9h | |
ZGORZYNSKI_SRPOL_task5_3 | Zgorzynski2022 | 29.9 (29.3 - 30.3) | 76700357 | 9h | |
ZGORZYNSKI_SRPOL_task5_4 | Zgorzynski2022 | 33.2 (32.7 - 33.7) | 76700357 | 9h | |
Huang_SCUT_task5_1 | Huang2022 | 18.3 (18.0 - 18.6) | 492206 | 50min, RTX3090 | |
Martinsson_RISE_task5_1 | Martinsson2022 | 48.0 (47.5 - 48.4) | 25994880 | ||
Martinsson_RISE_task5_2 | Martinsson2022 | 45.4 (44.9 - 45.9) | 25994880 | ||
Martinsson_RISE_task5_3 | Martinsson2022 | 19.4 (18.6 - 20.0) | 1732992 | ||
Martinsson_RISE_task5_4 | Martinsson2022 | 32.5 (31.7 - 33.1) | 1732992 | ||
Liu_Surrey_task5_1 | Liu2022a | 43.1 (42.7 - 43.4) | 724096 | 91 min, NVIDIA GeForce 3070 | |
Liu_Surrey_task5_2 | Liu2022a | 48.2 (48.5 - 48.9) | 724096 | 91 min, NVIDIA GeForce 3070 | |
Liu_Surrey_task5_3 | Liu2022a | 36.9 (36.5 - 37.2) | 724096 | 91 min, NVIDIA GeForce 3070 | |
Liu_Surrey_task5_4 | Liu2022a | 45.5 (45.8 - 46.2) | 724096 | 91 min, NVIDIA GeForce 3070 | |
Li_QMUL_task5_1 | Li2022 | 15.5 (15.2 - 15.8) | 40 min, Colab pro Tesla p100 | ||
Mariajohn_DSPC_task5_1 | Mariajohn2022 | 25.7 (25.4 - 25.9) | 2h | ||
Du_NERCSLIP_task5_1 | Du2022a | 36.5 (35.6 - 37.0) | 464531 | 5 minutes, TeslaP40-24GB | |
Du_NERCSLIP_task5_2 | Du2022a | 60.2 (59.7 - 61.7) | 469654 | 1 hour, TeslaV100-32GB | |
Du_NERCSLIP_task5_3 | Du2022a | 42.9 (42.4 - 43.4) | 12091947 | 1 hour, TeslaV100-32GB | |
Du_NERCSLIP_task5_4 | Du2022a | 60.0 (58.5 - 61.5) | 12091947 | 1 hour, TeslaV100-32GB |
Technical reports
BIOACOUSTIC FEW SHOT LEARNING WITH CLASS AUGMENTATION Technical Report
Mariajohn, Aaquila
Mariajohn_DSPC_task5_1
BIOACOUSTIC FEW SHOT LEARNING WITH CLASS AUGMENTATION Technical Report
Mariajohn, Aaquila
Abstract
This document details the results and techniques used for the submission for the DCASE 2022 Task 5 challenge. The goal is to identify positive shots of the required sample throughout the audio clip using few-shot learning. Prototypical networks are used for the few-shot learning training and inference models. The lack of data was compensated with augmentations.
System characteristics
Data augmentation | time shifting, segment level mirroring |
System embeddings | False |
Subsystem count | False |
External data usage | directly as additional training data |
FEW-SHOT EMBEDDING LEARNING AND EVENT FILTERING FOR BIOACOUSTIC EVENT DETECTION Technical Report
Tang,Jigang and Xueyang,Zhang and Gao,Tian and Liu,Diyuan and Fang,Xin and Pan,Jia and Wang,Qing and Du,Jan and Xu,Kele and Pan,Qinghua
iFLYTEK Research Institute
Du_NERCSLIP_task5_1 Du_NERCSLIP_task5_2 Du_NERCSLIP_task5_3 Du_NERCSLIP_task5_4
Judges’ award
FEW-SHOT EMBEDDING LEARNING AND EVENT FILTERING FOR BIOACOUSTIC EVENT DETECTION Technical Report
Tang,Jigang and Xueyang,Zhang and Gao,Tian and Liu,Diyuan and Fang,Xin and Pan,Jia and Wang,Qing and Du,Jan and Xu,Kele and Pan,Qinghua
iFLYTEK Research Institute
Abstract
In this technical report, we describe our submission system for DCASE2022 Task5: few-shot bioacoustic event detection.We propose several methods to improve the representational ability of embedding under limited positive samples. Including the segmentlevel and frame-level embedding learning strategy, model adaptation technology and embedding-guided event filtering approach. The event filtering task is independently trained on each test file to improve the discrimination of embeddings between similar events. The proposed system is evaluated on the official validation set, and the best overall F-measure score is 74.4%.
Awards: Judges’ award
System characteristics
Data augmentation | False |
System embeddings | False |
Subsystem count | False |
External data usage | False |
FEW-SHOT BIOACOUSTIC EVENT DETECTION : DON ' T WASTE INFORMATION Technical Report
Hertkorn, Michael
ZF Friedrichshafen AG
Hertkorn_ZF_task5_1 Hertkorn_ZF_task5_2 Hertkorn_ZF_task5_3 Hertkorn_ZF_task5_4
FEW-SHOT BIOACOUSTIC EVENT DETECTION : DON ' T WASTE INFORMATION Technical Report
Hertkorn, Michael
ZF Friedrichshafen AG
Abstract
In the past a lot of attention has been dedicated into finding a good neural network architecture, mainly adopting large NN architectures found in image processing.[1] The parameters in the fixed preprocess, which usually consists of a short-time Fourier transform (STFT) and optionally adding a Mel or Mel frequency cepstral coefficient (MFCC) transformation, can be made trainable[2], however some major parameters stay fixed, like the window size and the fact that the absolute of the complex output of the Fourier transformation is calculated. Also, a learnable frontend is not desirable for a few-shot training setting. This investigation shall demonstrate the importance of choosing suitable parameters for the acoustic preprocess. In order to do this, a standard CNN with a minor tweak is used and the pretraining with training data has been skipped which means that the model is only trained on the 5 shots provided in the validation and evaluation datasets, similar to the pattern matching baseline.
System characteristics
System embeddings | False |
Subsystem count | False |
External data usage | False |
FEW-SHOT BIO-ACOUSTIC EVENT DETECTION BASED ON TRANSDUCTIVE LEARNING AND ADAPTED CENTRAL DIFFERENCE CONVOLUTION Technical Report
Huang, Qisheng and Li, Yanxiong and Cao, Wenchang and Chen, Hao
School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
Huang_SCUT_task5_1
FEW-SHOT BIO-ACOUSTIC EVENT DETECTION BASED ON TRANSDUCTIVE LEARNING AND ADAPTED CENTRAL DIFFERENCE CONVOLUTION Technical Report
Huang, Qisheng and Li, Yanxiong and Cao, Wenchang and Chen, Hao
School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
Abstract
In this technical report, we present our submitted system for DCASE2022 Task5: few-shot bio-acoustic event detection. Our system employs the transductive learning strategy, data augmentation and an adapted version of central difference convolution (CDC). Evaluated on the validation set, our method achieves the overall F-measure score of 41.1%.
System characteristics
Data augmentation | False |
System embeddings | False |
Subsystem count | False |
External data usage | False |
FEW-SHOT BIOACOUSTIC EVENT DETECTION USING GOOD EMBEDDING MODEL Technical report
Kang, Taein
Chung-Ang University, Seoul, South Korea
Kang_ET_task5_1 Kang_ET_task5_2
FEW-SHOT BIOACOUSTIC EVENT DETECTION USING GOOD EMBEDDING MODEL Technical report
Kang, Taein
Chung-Ang University, Seoul, South Korea
Abstract
Few-shot learning is widely used as benchmarks for meta-learning. Few-shot learning is a learning algorithm that attempts to show how quickly it adapts to test tasks with limited data. Unlike general image new-shot learning, DCASE 2022 Task 5 [1] examines whether it can detect the corresponding sound at the back of audio data when five annotations are given in audio data. In this paper, we would like to demonstrate whether an embedding model well-learned bioacoustic information can perform few-shot learning well even with a simple classifier.
System characteristics
Data augmentation | Specaugment, inference-time augmentation |
System embeddings | False |
Subsystem count | 5 |
External data usage | AudioSet |
FEW-SHOT BIOACOUSTIC EVENT DETECTION USING PROTOTYPICAL NETWORKS WITH RESNET CLASSIFIER Technical Report
Li, Ren and Liang, Jinhua and Phan, Huy
Queen Mary University of London, United Kingdom
Li_QMUL_task5_1
FEW-SHOT BIOACOUSTIC EVENT DETECTION USING PROTOTYPICAL NETWORKS WITH RESNET CLASSIFIER Technical Report
Li, Ren and Liang, Jinhua and Phan, Huy
Queen Mary University of London, United Kingdom
Abstract
In this technical report, we describe our submission system for the few-shot bioacoustic event detection in the DCASE2022 task5. Participants are expected to develop a few-shot learning system for detecting mammal and birds sounds from audio recordings. In our system, Prototypical Networks are used to embed spectrograms into an embedding space and learn a non-linear mapping between data samples. We leverage various data augmentation techniques on Mel-spectrograms and introduce a ResNet variant as the classifier. Our experiments demonstrate that the system can achieve the F1-score of 47.88% on the vali-dation data.
System characteristics
Data augmentation | time warping, time masking, frequency masking |
System embeddings | False |
Subsystem count | False |
External data usage | False |
BIT SRCB TEAM ' S SUBMISSION FOR DCASE2022 TASK5 - FEW-SHOT BIOACOUSTIC EVENT DETECTION Technical Report
Liu, Miao and Zhang, Jianqian and Wang, Lizhong and Peng, Jiawei and Hu, Chenguang and Li, Kaige and Wang, Jing and Ma, Qiuyue
Beijing Institute of Technology, Beijing, China,Samsung Research China-Beijing (SRC-B), Beijing, China
Liu_BIT-SRCB_task5_1 Liu_BIT-SRCB_task5_2 Liu_BIT-SRCB_task5_3
BIT SRCB TEAM ' S SUBMISSION FOR DCASE2022 TASK5 - FEW-SHOT BIOACOUSTIC EVENT DETECTION Technical Report
Liu, Miao and Zhang, Jianqian and Wang, Lizhong and Peng, Jiawei and Hu, Chenguang and Li, Kaige and Wang, Jing and Ma, Qiuyue
Beijing Institute of Technology, Beijing, China,Samsung Research China-Beijing (SRC-B), Beijing, China
Abstract
In this technical report, we present our system for the task 5 of Detection and Classification of Acoustic Scenes and Events 2022 (DCASE2022) challenge, i.e. few-shot bioacoustic event detection. First, per-channel energy normalization (PCEN) is extracted as features. In order to improve the diversity of original audio, some data augmentation methods are adopted, for example, specaugment. Then, the prototypical network with convolutional neural networks (CNN) and the transductive inference method are used for few-shot detection in our systems. Finally, we use aforementioned features as inputs to train our CNN model. Moreover, we merge the prediction results of improved prototypical network and transductive inference method for better performance. We evaluate the proposed systems with overall F-measure for the whole of the evaluation set, and our best F-measure score on the validation set is 64.77%.
System characteristics
Data augmentation | False |
System embeddings | False |
Subsystem count | False |
External data usage | False |
SURREY SYSTEM FOR DCASE 2022 TASK 5 : FEW-SHOT BIOACOUSTIC EVENT DETECTION WITH SEGMENT-LEVEL METRIC LEARNING Technical Report
Liu, Haohe and Liu, Xubo and Mei, Xinhao and Kong, Qiuqiang and Wang, Wenwu and Plumbley, Mark D
University of Surrey
Liu_Surrey_task5_1 Liu_Surrey_task5_2 Liu_Surrey_task5_3 Liu_Surrey_task5_4
SURREY SYSTEM FOR DCASE 2022 TASK 5 : FEW-SHOT BIOACOUSTIC EVENT DETECTION WITH SEGMENT-LEVEL METRIC LEARNING Technical Report
Liu, Haohe and Liu, Xubo and Mei, Xinhao and Kong, Qiuqiang and Wang, Wenwu and Plumbley, Mark D
University of Surrey
Abstract
Few-shot audio event detection is a task that detects the occurrence time of a novel sound class given a few examples. In this work, we propose a system based on segment-level metric learning for DCASE 2022 challenge few-shot bioacoustic event detection (task 5). We make better utilization of the negative data within each sound class to build the loss function, and use transductive inference to gain better adaptation on the evaluation set. For the input feature, we find the per-channel energy normalization concatenated with delta melfrequency cepstral coefficients to be the most effective combination. We also introduce new data augmentation and post-processing procedures for this task. Our final system achieves an f-measure of 68.74 on the DCASE task 5 validation set, outperforming the baseline performance of 29.5 by a large margin. Our system is fully open-sourced1
System characteristics
Data augmentation | False |
System embeddings | False |
Subsystem count | False |
External data usage | False |
FEW-SHOT BIOACOUSTIC EVENT DETECTION USING A PROTOTYPICAL NETWORK ENSEMBLE WITH ADAPTIVE EMBEDDING FUNCTIONS Technical Report
Martinsson, John and Willbo, Martin and Pirinen, Aleksis and Mogren, Olof and Sandsten, Maria
Computer Science, RISE Research Institutes of Sweden, Sweden, Centre for Mathematical Sciences, Lund University, Sweden
Martinsson_RISE_task5_1 Martinsson_RISE_task5_2 Martinsson_RISE_task5_3 Martinsson_RISE_task5_4
FEW-SHOT BIOACOUSTIC EVENT DETECTION USING A PROTOTYPICAL NETWORK ENSEMBLE WITH ADAPTIVE EMBEDDING FUNCTIONS Technical Report
Martinsson, John and Willbo, Martin and Pirinen, Aleksis and Mogren, Olof and Sandsten, Maria
Computer Science, RISE Research Institutes of Sweden, Sweden, Centre for Mathematical Sciences, Lund University, Sweden
Abstract
In this report we present our method for the DCASE 2022 challenge on few-shot bioacoustic event detection. We use an ensemble of prototypical neural networks with adaptive embedding functions and show that both ensemble and adaptive embedding functions can be used to improve results from an average F-score of 41.3% to an average F-score of 60.0% on the validation dataset.
System characteristics
Data augmentation | False |
System embeddings | False |
Subsystem count | False |
External data usage | False |
A NEW TRANSDUCTIVE FRAMEWORK FOR FEW-SHOT BIOACOUSTIC EVENT DETECTION TASK Technical Report
Tan, Yizhou and Xu, Lifan and Zhu, Chenyang and Li, Shengchen and Ai, Haojun and Shao, Xi
Wuhan University, School of Cyber Science and Engineering, Wuhan, China,Xi’an Jiaotong-Liverpool University, Department of Intelligent Science School of Advanced Engineering, Suzhou, China,Jiangnan University, School of Artificial Intelligence and Computer Science,Wuxi, China, Nanjing University of Posts and Telecommunications,School of Communication and Information Engineering, Nanjing, China,
Tan_WHU_task5_1 Tan_WHU_task5_2 Tan_WHU_task5_3 Tan_WHU_task5_4
A NEW TRANSDUCTIVE FRAMEWORK FOR FEW-SHOT BIOACOUSTIC EVENT DETECTION TASK Technical Report
Tan, Yizhou and Xu, Lifan and Zhu, Chenyang and Li, Shengchen and Ai, Haojun and Shao, Xi
Wuhan University, School of Cyber Science and Engineering, Wuhan, China,Xi’an Jiaotong-Liverpool University, Department of Intelligent Science School of Advanced Engineering, Suzhou, China,Jiangnan University, School of Artificial Intelligence and Computer Science,Wuxi, China, Nanjing University of Posts and Telecommunications,School of Communication and Information Engineering, Nanjing, China,
Abstract
Few-shot learning is introduced to reduce the requirements of data availability in machine learning, especially when the labelling is labour expensive. Few-shot learning algorithms usually suffer from the extraordinary feature distribution of the query class, especially in few-shot bioacoustic event detection task. In this work, Knowledge transfer technique is introduced into the transductive inference process to restrict the feature distribution of newly appeared class to a dedicated sub-space, while adapts the feature distribution for existing classes. The proposed system outperforms the traditional few-shot learning system according to the development dataset provided by bioacoustics event detection (Task 5) in DCASE data challenge 2022. The f-measure score of the validation in development dataset successfully reaches 57.40.
System characteristics
Data augmentation | False |
System embeddings | False |
Subsystem count | False |
External data usage | False |
WIDE RESNET MODELS FOR FEW-SHOT SOUND EVENT DETECTION Technical report
Willbo, Martin and Martinsson, John and Pirinen, Aleksis and Mogren, Olof
Computer Science, RISE Research Institutes of Sweden, Sweden, Centre for Mathematical Sciences, Lund University, Sweden
Willbo_RISE_task5_1 Willbo_RISE_task5_2 Willbo_RISE_task5_3 Willbo_RISE_task5_4
WIDE RESNET MODELS FOR FEW-SHOT SOUND EVENT DETECTION Technical report
Willbo, Martin and Martinsson, John and Pirinen, Aleksis and Mogren, Olof
Computer Science, RISE Research Institutes of Sweden, Sweden, Centre for Mathematical Sciences, Lund University, Sweden
Abstract
In this technical report we describe our few-shot sound event detection (SED) systems used to generate predictions for the DCASE 2022 task 5 challenge. At the core of the SED systems is a wider variant of ResNet-18, i.e., each block throughout the depth of the network have more convolutional filters. In addition to this, for one of the submissions we include what we believe to be a novel approach to semi-supervised learning for prototypical networks. For both the fully supervised and semi-supervised methods we showcase the importance of calibrating the probability thresholds in the few-shot learning tasks, and provide a simple implementation of how to find these.
System characteristics
Data augmentation | False |
System embeddings | False |
Subsystem count | False |
External data usage | False |
FEW-SHOT CONTINUAL LEARNING FOR BIOACOUSTIC EVENT DETECTION Technical Report
Wu, Xiaoxiao and Long, Yanhua
Shanghai Normal University, Shanghai, China
Wu_SHNU_task5_1
FEW-SHOT CONTINUAL LEARNING FOR BIOACOUSTIC EVENT DETECTION Technical Report
Wu, Xiaoxiao and Long, Yanhua
Shanghai Normal University, Shanghai, China
Abstract
In this technical report, we describe our submission system for DCASE2022 Task5: few-shot bioacoustic event detection. In this submission, a few-shot continual learning framework is used for our bioacoustic event detection, where we can continuously expand a trained base classifier to detect novel classes with only few labeled data at inference time. On the official validation set, the proposed continual learning achieves the overall F-measure score of 53.876%.
System characteristics
Data augmentation | Time masking,Frequency masking |
System embeddings | False |
Subsystem count | False |
External data usage | False |
IMPROVED PROTOTYPICAL NETWORK WITH DATA AUGMENTATION Technical Report
Dongchao Yang and Helin Wang and Zhongjie Ye and Yuexian Zou
Peking University, Shcool of ECE, Shenzhen,China, Xiaomi Corporation, Beijing, China
Zou_PKU_task5_1 Zou_PKU_task5_2 Zou_PKU_task5_3 Zou_PKU_task5_4
IMPROVED PROTOTYPICAL NETWORK WITH DATA AUGMENTATION Technical Report
Dongchao Yang and Helin Wang and Zhongjie Ye and Yuexian Zou
Peking University, Shcool of ECE, Shenzhen,China, Xiaomi Corporation, Beijing, China
Abstract
In this technical report, we describe our few-shot bioacoustic event detection methods submitted to Detection and Classification of Acoustic Scenes and Events Challenge 2022 Task 5. We follow our previous work, and further improve our model through data augmentation strategy. Specifically, we analyze the reason why Prototypical networks cannot perform well, and propose to use transductive inference for few shot learning. Our method maximizes the mutual information between the query features and their label predictions for a given few-shot task, in conjunction with a supervision loss based on the support set. Furthermore, we use multiple data augmentation strategies to improve the feature extractor, including time and frequency masking, mixup, and so on. Experimental results indicate our model gets better performance than baseline, and F1 score is about 51.9% on evaluation set
System characteristics
Data augmentation | False |
System embeddings | False |
Subsystem count | False |
External data usage | False |
SIAMESE NETWORK FOR FEW-SHOT BIOACOUSTIC EVENT DETECTION Technical report
Zgorzynski, Bartlomiej and Matuszewski, Mateusz
Samsung R&D Institute, Poland
ZGORZYNSKI_SRPOL_task5_1 ZGORZYNSKI_SRPOL_task5_2 ZGORZYNSKI_SRPOL_task5_3 ZGORZYNSKI_SRPOL_task5_4
SIAMESE NETWORK FOR FEW-SHOT BIOACOUSTIC EVENT DETECTION Technical report
Zgorzynski, Bartlomiej and Matuszewski, Mateusz
Samsung R&D Institute, Poland
System characteristics
Data augmentation | False |
System embeddings | False |
Subsystem count | False |
External data usage | False |
A META-LEARNING FRAMEWORK FOR FEW-SHOT SOUND EVENT DETECTION Technical Report
Zhang, Tianyang and Wang, Yuyang and Wang, Ying
Chongqing University, Shapingba
Zhang_CQU_task5_1 Zhang_CQU_task5_2 Zhang_CQU_task5_3Zhang_CQU_task5_4
A META-LEARNING FRAMEWORK FOR FEW-SHOT SOUND EVENT DETECTION Technical Report
Zhang, Tianyang and Wang, Yuyang and Wang, Ying
Chongqing University, Shapingba
Abstract
The report presents our submission to Detection and Classification of Acoustic Scenes and Events challenges 2022 (DCASE2022) task 5. This task focuses on sound event detection in a few-shot learning setting for animal (mammal and bird) vocalisations. Main issue of this task is that only five exemplar vocalisations (shots) of mammals or birds are available. In this paper, we propose a metalearning framework for few-shot bioacoustic event detection challenge. Maximizing inter-class distance and minimizing intra-class distance (MIMI) are used as a criteria to fine-tune embedded network for few-shot tasks. Experimental results indicate our framework get better performance than baseline, and F1 score is about 46.51% on evaluation set.
System characteristics
System embeddings | False |
Subsystem count | False |
External data usage | False |