Task description
The task evaluates systems for the detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. The challenge of exploring the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance remains. Isolated sound events, background sound files and scripst to design a training set with strongly annotated synthetic data are provided. The labels in all the annotated subsets are verified and can be considered as reliable.
More detailed task description can be found in the task description page
Systems ranking
Rank |
Submission code |
Submission name |
Technical Report |
Sound Separation |
Ranking score (Evaluation dataset) |
PSDS 1 (Evaluation dataset) |
PSDS 2 (Evaluation dataset) |
PSDS 1 (Development dataset) |
PSDS 2 (Development dataset) |
---|---|---|---|---|---|---|---|---|---|
Na_BUPT_task4_SED_1 | Na_BUPT_task4_SED_1 | Na2021 | 0.80 | 0.245 | 0.452 | 0.313 | 0.535 | ||
Hafsati_TUITO_task4_SED_3 | TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 | Hafsati2021 | 0.91 | 0.287 | 0.502 | 0.325 | 0.561 | ||
Hafsati_TUITO_task4_SED_4 | TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 | Hafsati2021 | 0.91 | 0.287 | 0.502 | 0.325 | 0.561 | ||
Hafsati_TUITO_task4_SED_1 | TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 | Hafsati2021 | 1.03 | 0.334 | 0.549 | 0.345 | 0.555 | ||
Hafsati_TUITO_task4_SED_2 | TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 | Hafsati2021 | 1.04 | 0.336 | 0.550 | 0.345 | 0.555 | ||
Gong_TAL_task4_SED_3 | TAL SED system | Gong2021 | 1.16 | 0.370 | 0.626 | 0.407 | 0.653 | ||
Gong_TAL_task4_SED_2 | TAL SED system | Gong2021 | 1.15 | 0.367 | 0.616 | 0.407 | 0.648 | ||
Gong_TAL_task4_SED_1 | TAL SED system | Gong2021 | 1.14 | 0.364 | 0.611 | 0.398 | 0.642 | ||
Park_JHU_task4_SED_2 | Park_JHU_task4_SED_2 | Park2021 | 1.07 | 0.327 | 0.603 | 0.524 | 0.674 | ||
Park_JHU_task4_SED_4 | Park_JHU_task4_SED_4 | Park2021 | 0.86 | 0.237 | 0.524 | 0.446 | 0.561 | ||
Park_JHU_task4_SED_1 | Park_JHU_task4_SED_1 | Park2021 | 1.01 | 0.305 | 0.579 | 0.508 | 0.668 | ||
Park_JHU_task4_SED_3 | Park_JHU_task4_SED_3 | Park2021 | 0.84 | 0.222 | 0.537 | 0.456 | 0.596 | ||
Zheng_USTC_task4_SED_4 | DCASE2020 SED Mean teacher system 4 | Zheng2021 | 1.30 | 0.389 | 0.742 | 0.402 | 0.786 | ||
Zheng_USTC_task4_SED_1 | DCASE2020 SED Mean teacher system 1 | Zheng2021 | 1.33 | 0.452 | 0.669 | 0.454 | 0.671 | ||
Zheng_USTC_task4_SED_3 | DCASE2020 SED Mean teacher system 3 | Zheng2021 | 1.29 | 0.386 | 0.746 | 0.397 | 0.788 | ||
Zheng_USTC_task4_SED_2 | DCASE2020 SED Mean teacher system 2 | Zheng2021 | 1.33 | 0.447 | 0.676 | 0.454 | 0.680 | ||
Nam_KAIST_task4_SED_2 | SED_mixupratip=0.8_nband=(2,3)_medianfilter=5 | Nam2021 | 1.19 | 0.399 | 0.609 | 0.434 | 0.639 | ||
Nam_KAIST_task4_SED_1 | SED_default | Nam2021 | 1.16 | 0.378 | 0.617 | 0.423 | 0.658 | ||
Nam_KAIST_task4_SED_3 | SED_AFL | Nam2021 | 1.09 | 0.324 | 0.634 | 0.381 | 0.692 | ||
Nam_KAIST_task4_SED_4 | Weak_SED | Nam2021 | 0.75 | 0.059 | 0.715 | 0.064 | 0.816 | ||
Koo_SGU_task4_SED_2 | DCASE2021 SED system using wav2vec | Koo2021 | 0.12 | 0.044 | 0.059 | 0.316 | 0.337 | ||
Koo_SGU_task4_SED_3 | DCASE2021 SED system using wav2vec | Koo2021 | 0.41 | 0.058 | 0.348 | 0.249 | 0.711 | ||
Koo_SGU_task4_SED_1 | DCASE2021 SED system using wav2vec | Koo2021 | 0.74 | 0.258 | 0.364 | 0.295 | 0.503 | ||
deBenito_AUDIAS_task4_SED_4 | 5-Resolution Mean Teacher | de Benito-Gorron2021 | 1.10 | 0.361 | 0.577 | 0.386 | 0.600 | ||
deBenito_AUDIAS_task4_SED_1 | 3-Resolution Mean Teacher | de Benito-Gorron2021 | 1.07 | 0.343 | 0.571 | 0.380 | 0.589 | ||
deBenito_AUDIAS_task4_SED_2 | 3-Resolution Mean Teacher (Higher time resolutions) | de Benito-Gorron2021 | 1.10 | 0.363 | 0.574 | 0.386 | 0.578 | ||
deBenito_AUDIAS_task4_SED_3 | 4-Resolution Mean Teacher | de Benito-Gorron2021 | 1.07 | 0.345 | 0.571 | 0.372 | 0.600 | ||
Baseline_SSep_SED | DCASE2021 SSep SED baseline system | turpault2020b | 1.11 | 0.364 | 0.580 | 0.342 | 0.527 | ||
Boes_KUL_task4_SED_4 | CRNN with optimized pooling operations for scenario 2 (2) | Boes2021 | 0.60 | 0.117 | 0.457 | 0.154 | 0.729 | ||
Boes_KUL_task4_SED_3 | CRNN with optimized pooling operations for scenario 2 (1) | Boes2021 | 0.68 | 0.121 | 0.531 | 0.158 | 0.731 | ||
Boes_KUL_task4_SED_2 | CRNN with optimized pooling operations for scenario 1 (2) | Boes2021 | 0.77 | 0.233 | 0.440 | 0.359 | 0.601 | ||
Boes_KUL_task4_SED_1 | CRNN with optimized pooling operations for scenario 1 (1) | Boes2021 | 0.81 | 0.253 | 0.442 | 0.361 | 0.593 | ||
Ebbers_UPB_task4_SED_2 | UPB sytem 2 | Ebbers2021 | 1.10 | 0.335 | 0.621 | 0.377 | 0.748 | ||
Ebbers_UPB_task4_SED_4 | UPB sytem 4 | Ebbers2021 | 1.16 | 0.363 | 0.637 | 0.393 | 0.758 | ||
Ebbers_UPB_task4_SED_3 | UPB sytem 3 | Ebbers2021 | 1.24 | 0.416 | 0.635 | 0.454 | 0.726 | ||
Ebbers_UPB_task4_SED_1 | UPB sytem 1 | Ebbers2021 | 1.16 | 0.373 | 0.621 | 0.429 | 0.727 | ||
Zhu_AIAL-XJU_task4_SED_2 | Zhu_AIAL-XJU_task4_SED_2 | Zhu2021 | 0.99 | 0.290 | 0.574 | 0.342 | 0.614 | ||
Zhu_AIAL-XJU_task4_SED_1 | Zhu_AIAL-XJU_task4_SED_1 | Zhu2021 | 1.04 | 0.318 | 0.583 | 0.354 | 0.613 | ||
Liu_BUPT_task4_4 | DCASE2020 liuliuliufangzhou system | Liu2021 | 0.37 | 0.102 | 0.231 | 0.348 | 0.551 | ||
Liu_BUPT_task4_1 | DCASE2020 liuliuliufangzhou system | Liu2021 | 0.30 | 0.090 | 0.169 | 0.348 | 0.551 | ||
Liu_BUPT_task4_2 | DCASE2020 liuliuliufangzhou system | Liu2021 | 0.54 | 0.152 | 0.322 | 0.348 | 0.551 | ||
Liu_BUPT_task4_3 | DCASE2020 liuliuliufangzhou system | Liu2021 | 0.24 | 0.068 | 0.146 | 0.348 | 0.551 | ||
Olvera_INRIA_task4_SED_2 | SED ensemble 2 OT + FG/BG | Olvera2021 | 0.98 | 0.338 | 0.481 | ||||
Olvera_INRIA_task4_SED_1 | DA-SED + FG/BG | Olvera2021 | 0.95 | 0.332 | 0.462 | ||||
Kim_AiTeR_GIST_SED_4 | RCRNN-based noisy student SED | Kim2021 | 1.32 | 0.442 | 0.674 | 0.457 | 0.685 | ||
Kim_AiTeR_GIST_SED_2 | RCRNN-based noisy student SED | Kim2021 | 1.31 | 0.439 | 0.667 | 0.450 | 0.682 | ||
Kim_AiTeR_GIST_SED_3 | RCRNN-based noisy student SED | Kim2021 | 1.30 | 0.434 | 0.669 | 0.451 | 0.679 | ||
Kim_AiTeR_GIST_SED_1 | RCRNN-based noisy student SED | Kim2021 | 1.29 | 0.431 | 0.661 | 0.449 | 0.675 | ||
Cai_SMALLRICE_task4_SED_1 | DCASE2021_Cai_SED_CDur_Ensemble_1 | Dinkel2021 | 1.11 | 0.361 | 0.584 | 0.375 | 0.619 | ||
Cai_SMALLRICE_task4_SED_2 | DCASE2021_Cai_SED_CDur_Ensemble_2 | Dinkel2021 | 1.13 | 0.373 | 0.585 | 0.382 | 0.622 | ||
Cai_SMALLRICE_task4_SED_3 | DCASE2021_Cai_SED_CDur_Ensemble_3 | Dinkel2021 | 1.13 | 0.370 | 0.596 | 0.381 | 0.629 | ||
Cai_SMALLRICE_task4_SED_4 | DCASE2021_Cai_SED_CDur_Single_4 | Dinkel2021 | 1.00 | 0.339 | 0.504 | 0.369 | 0.571 | ||
HangYuChen_Roal_task4_SED_2 | DCASE2021 SED system | HangYu2021 | 0.90 | 0.294 | 0.473 | 0.134 | 0.557 | ||
HangYuChen_Roal_task4_SED_1 | DCASE2021 SED system | YuHang2021 | 0.61 | 0.098 | 0.496 | 0.340 | 0.523 | ||
Yu_NCUT_task4_SED_1 | multi-scale CRNN | Yu2021 | 0.20 | 0.038 | 0.157 | 0.330 | 0.540 | ||
Yu_NCUT_task4_SED_2 | multi-scale CRNN | Yu2021 | 0.92 | 0.301 | 0.485 | 0.110 | 0.610 | ||
lu_kwai_task4_SED_1 | DCASE2021 SED CRNN Model1 | Lu2021 | 1.27 | 0.419 | 0.660 | 0.419 | 0.638 | ||
lu_kwai_task4_SED_4 | DCASE2021 SED Conformer Model2 | Lu2021 | 0.88 | 0.157 | 0.685 | 0.177 | 0.749 | ||
lu_kwai_task4_SED_3 | DCASE2021 SED Conformer Model1 | Lu2021 | 0.86 | 0.148 | 0.686 | 0.173 | 0.752 | ||
lu_kwai_task4_SED_2 | DCASE2021 SED CRNN Model2 | Lu2021 | 1.25 | 0.412 | 0.651 | 0.418 | 0.637 | ||
Liu_BUPT_task4_SS_SED_2 | DCASE2020 liuliuliufangzhou system | Liu_SS2021 | 0.94 | 0.302 | 0.507 | 0.360 | 0.550 | ||
Liu_BUPT_task4_SS_SED_1 | DCASE2020 liuliuliufangzhou system | Liu_SS2021 | 0.94 | 0.302 | 0.507 | 0.360 | 0.550 | ||
Tian_ICT-TOSHIBA_task4_SED_2 | SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS | Tian2021 | 1.19 | 0.411 | 0.585 | 0.396 | 0.587 | ||
Tian_ICT-TOSHIBA_task4_SED_1 | SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS | Tian2021 | 1.19 | 0.413 | 0.586 | 0.401 | 0.597 | ||
Tian_ICT-TOSHIBA_task4_SED_4 | SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS | Tian2021 | 1.19 | 0.412 | 0.586 | 0.398 | 0.599 | ||
Tian_ICT-TOSHIBA_task4_SED_3 | SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS | Tian2021 | 1.18 | 0.409 | 0.584 | 0.392 | 0.585 | ||
Yao_GUET_task4_SED_3 | Adaptive Sequential Self Attention Span for Sound Event Detection | Yao2021 | 0.88 | 0.279 | 0.479 | 0.328 | 0.530 | ||
Yao_GUET_task4_SED_1 | Adaptive Sequential Self Attention Span for Sound Event Detection | Yao2021 | 0.88 | 0.277 | 0.482 | 0.332 | 0.533 | ||
Yao_GUET_task4_SED_2 | Adaptive Sequential Self Attention Span for Sound Event Detection | Yao2021 | 0.54 | 0.056 | 0.496 | 0.060 | 0.618 | ||
Liang_SHNU_task4_SED_4 | Guided Learning system | Liang2021 | 0.99 | 0.313 | 0.543 | 0.328 | 0.575 | ||
Bajzik_UNIZA_task4_SED_2 | CAM attention SED system | Bajzik2021 | 1.02 | 0.330 | 0.544 | 0.374 | 0.586 | ||
Bajzik_UNIZA_task4_SED_1 | CAM-based SED system | Bajzik2021 | 0.45 | 0.133 | 0.266 | 0.165 | 0.348 | ||
Liang_SHNU_task4_SSep_SED_3 | Mean teacher system | Liang_SS2021 | 0.99 | 0.304 | 0.559 | 0.426 | 0.726 | ||
Liang_SHNU_task4_SSep_SED_1 | Mean teacher system | Liang_SS2021 | 1.03 | 0.313 | 0.588 | 0.428 | 0.736 | ||
Liang_SHNU_task4_SSep_SED_2 | Mean teacher system | Liang_SS2021 | 1.01 | 0.325 | 0.542 | 0.418 | 0.721 | ||
Baseline_SED | DCASE2021 SED baseline system | turpault2020a | 1.00 | 0.315 | 0.547 | 0.342 | 0.527 | ||
Wang_NSYSU_task4_SED_1 | DCASE2021_SED_A | Wang2021 | 1.13 | 0.336 | 0.646 | 0.407 | 0.703 | ||
Wang_NSYSU_task4_SED_4 | DCASE2021_SED_D | Wang2021 | 1.09 | 0.304 | 0.662 | 0.370 | 0.724 | ||
Wang_NSYSU_task4_SED_2 | DCASE2021_SED_B | Wang2021 | 0.69 | 0.070 | 0.636 | 0.061 | 0.808 | ||
Wang_NSYSU_task4_SED_3 | DCASE2021_SED_C | Wang2021 | 1.13 | 0.339 | 0.649 | 0.388 | 0.672 |
Supplementary metrics
Rank |
Submission code |
Submission name |
Technical Report |
Sound Separation |
PSDS 1 (Evaluation dataset) |
PSDS 1 (Public evaluation) |
PSDS 1 (Vimeo dataset) |
PSDS 2 (Evaluation dataset) |
PSDS 2 (Public evaluation) |
PSDS 2 (Vimeo dataset) |
F-score (Evaluation dataset) |
F-score (Public evaluation) |
F-score (Vimeo dataset) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Na_BUPT_task4_SED_1 | Na_BUPT_task4_SED_1 | Na2021 | 0.245 | 0.269 | 0.185 | 0.452 | 0.485 | 0.354 | 25.0 | 27.5 | 19.5 | ||
Hafsati_TUITO_task4_SED_3 | TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 | Hafsati2021 | 0.287 | 0.321 | 0.207 | 0.502 | 0.547 | 0.386 | 35.7 | 39.2 | 27.4 | ||
Hafsati_TUITO_task4_SED_4 | TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 | Hafsati2021 | 0.287 | 0.322 | 0.209 | 0.502 | 0.547 | 0.387 | 37.2 | 40.9 | 28.0 | ||
Hafsati_TUITO_task4_SED_1 | TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 | Hafsati2021 | 0.334 | 0.370 | 0.249 | 0.549 | 0.591 | 0.437 | 39.5 | 43.8 | 29.0 | ||
Hafsati_TUITO_task4_SED_2 | TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 | Hafsati2021 | 0.336 | 0.374 | 0.249 | 0.550 | 0.591 | 0.440 | 40.9 | 44.9 | 31.3 | ||
Gong_TAL_task4_SED_3 | TAL SED system | Gong2021 | 0.370 | 0.419 | 0.273 | 0.626 | 0.672 | 0.509 | 41.9 | 45.1 | 34.0 | ||
Gong_TAL_task4_SED_2 | TAL SED system | Gong2021 | 0.367 | 0.409 | 0.271 | 0.616 | 0.654 | 0.512 | 42.7 | 45.9 | 34.8 | ||
Gong_TAL_task4_SED_1 | TAL SED system | Gong2021 | 0.364 | 0.409 | 0.266 | 0.611 | 0.661 | 0.486 | 41.5 | 44.9 | 33.0 | ||
Park_JHU_task4_SED_2 | Park_JHU_task4_SED_2 | Park2021 | 0.327 | 0.371 | 0.240 | 0.603 | 0.644 | 0.492 | 38.4 | 42.2 | 28.9 | ||
Park_JHU_task4_SED_4 | Park_JHU_task4_SED_4 | Park2021 | 0.237 | 0.267 | 0.174 | 0.524 | 0.568 | 0.417 | 36.9 | 39.8 | 29.6 | ||
Park_JHU_task4_SED_1 | Park_JHU_task4_SED_1 | Park2021 | 0.305 | 0.344 | 0.214 | 0.579 | 0.617 | 0.471 | 34.7 | 37.8 | 26.9 | ||
Park_JHU_task4_SED_3 | Park_JHU_task4_SED_3 | Park2021 | 0.222 | 0.244 | 0.166 | 0.537 | 0.578 | 0.430 | 33.5 | 35.4 | 28.5 | ||
Zheng_USTC_task4_SED_4 | DCASE2020 SED Mean teacher system 4 | Zheng2021 | 0.389 | 0.438 | 0.261 | 0.742 | 0.775 | 0.644 | 49.5 | 54.2 | 36.9 | ||
Zheng_USTC_task4_SED_1 | DCASE2020 SED Mean teacher system 1 | Zheng2021 | 0.452 | 0.517 | 0.318 | 0.669 | 0.725 | 0.530 | 52.3 | 57.4 | 39.2 | ||
Zheng_USTC_task4_SED_3 | DCASE2020 SED Mean teacher system 3 | Zheng2021 | 0.386 | 0.429 | 0.270 | 0.746 | 0.778 | 0.650 | 49.7 | 55.0 | 36.3 | ||
Zheng_USTC_task4_SED_2 | DCASE2020 SED Mean teacher system 2 | Zheng2021 | 0.447 | 0.506 | 0.318 | 0.676 | 0.730 | 0.546 | 52.9 | 57.7 | 40.2 | ||
Nam_KAIST_task4_SED_2 | SED_mixupratip=0.8_nband=(2,3)_medianfilter=5 | Nam2021 | 0.399 | 0.443 | 0.299 | 0.609 | 0.641 | 0.525 | 48.0 | 52.2 | 37.1 | ||
Nam_KAIST_task4_SED_1 | SED_default | Nam2021 | 0.378 | 0.426 | 0.285 | 0.617 | 0.666 | 0.506 | 44.2 | 47.8 | 34.5 | ||
Nam_KAIST_task4_SED_3 | SED_AFL | Nam2021 | 0.324 | 0.364 | 0.235 | 0.634 | 0.672 | 0.536 | 29.3 | 32.3 | 22.7 | ||
Nam_KAIST_task4_SED_4 | Weak_SED | Nam2021 | 0.059 | 0.069 | 0.022 | 0.715 | 0.750 | 0.616 | 12.5 | 13.1 | 11.4 | ||
Koo_SGU_task4_SED_2 | DCASE2021 SED system using wav2vec | Koo2021 | 0.044 | 0.050 | 0.024 | 0.059 | 0.057 | 0.047 | 12.4 | 13.8 | 9.4 | ||
Koo_SGU_task4_SED_3 | DCASE2021 SED system using wav2vec | Koo2021 | 0.058 | 0.060 | 0.048 | 0.348 | 0.406 | 0.257 | 8.5 | 9.0 | 7.3 | ||
Koo_SGU_task4_SED_1 | DCASE2021 SED system using wav2vec | Koo2021 | 0.258 | 0.282 | 0.183 | 0.364 | 0.401 | 0.241 | 20.5 | 22.2 | 16.2 | ||
deBenito_AUDIAS_task4_SED_4 | 5-Resolution Mean Teacher | de Benito-Gorron2021 | 0.361 | 0.405 | 0.262 | 0.577 | 0.635 | 0.443 | 42.7 | 46.7 | 32.7 | ||
deBenito_AUDIAS_task4_SED_1 | 3-Resolution Mean Teacher | de Benito-Gorron2021 | 0.343 | 0.387 | 0.245 | 0.571 | 0.628 | 0.439 | 42.6 | 46.4 | 33.2 | ||
deBenito_AUDIAS_task4_SED_2 | 3-Resolution Mean Teacher (Higher time resolutions) | de Benito-Gorron2021 | 0.363 | 0.406 | 0.265 | 0.574 | 0.630 | 0.449 | 43.1 | 47.0 | 33.6 | ||
deBenito_AUDIAS_task4_SED_3 | 4-Resolution Mean Teacher | de Benito-Gorron2021 | 0.345 | 0.383 | 0.255 | 0.571 | 0.628 | 0.438 | 42.2 | 46.4 | 31.6 | ||
Baseline_SSep_SED | DCASE2021 SSep SED baseline system | turpault2020b | 0.364 | 0.407 | 0.283 | 0.580 | 0.627 | 0.471 | 42.0 | 44.9 | 34.7 | ||
Boes_KUL_task4_SED_4 | CRNN with optimized pooling operations for scenario 2 (2) | Boes2021 | 0.117 | 0.131 | 0.078 | 0.457 | 0.500 | 0.346 | 10.6 | 11.8 | 7.9 | ||
Boes_KUL_task4_SED_3 | CRNN with optimized pooling operations for scenario 2 (1) | Boes2021 | 0.121 | 0.139 | 0.081 | 0.531 | 0.555 | 0.435 | 14.0 | 15.9 | 9.3 | ||
Boes_KUL_task4_SED_2 | CRNN with optimized pooling operations for scenario 1 (2) | Boes2021 | 0.233 | 0.266 | 0.143 | 0.440 | 0.489 | 0.310 | 31.2 | 34.4 | 22.6 | ||
Boes_KUL_task4_SED_1 | CRNN with optimized pooling operations for scenario 1 (1) | Boes2021 | 0.253 | 0.290 | 0.150 | 0.442 | 0.483 | 0.319 | 31.0 | 34.7 | 21.3 | ||
Ebbers_UPB_task4_SED_2 | UPB sytem 2 | Ebbers2021 | 0.335 | 0.369 | 0.269 | 0.621 | 0.661 | 0.519 | 54.1 | 57.2 | 46.7 | ||
Ebbers_UPB_task4_SED_4 | UPB sytem 4 | Ebbers2021 | 0.363 | 0.407 | 0.285 | 0.637 | 0.683 | 0.533 | 56.7 | 59.6 | 49.4 | ||
Ebbers_UPB_task4_SED_3 | UPB sytem 3 | Ebbers2021 | 0.416 | 0.455 | 0.328 | 0.635 | 0.684 | 0.519 | 56.7 | 59.6 | 49.4 | ||
Ebbers_UPB_task4_SED_1 | UPB sytem 1 | Ebbers2021 | 0.373 | 0.410 | 0.300 | 0.621 | 0.661 | 0.516 | 54.1 | 57.2 | 46.7 | ||
Zhu_AIAL-XJU_task4_SED_2 | Zhu_AIAL-XJU_task4_SED_2 | Zhu2021 | 0.290 | 0.319 | 0.216 | 0.574 | 0.640 | 0.438 | 43.0 | 47.1 | 33.0 | ||
Zhu_AIAL-XJU_task4_SED_1 | Zhu_AIAL-XJU_task4_SED_1 | Zhu2021 | 0.318 | 0.357 | 0.238 | 0.583 | 0.641 | 0.451 | 40.2 | 43.5 | 32.3 | ||
Liu_BUPT_task4_4 | DCASE2020 liuliuliufangzhou system | Liu2021 | 0.102 | 0.123 | 0.043 | 0.231 | 0.244 | 0.165 | 17.5 | 19.4 | 12.9 | ||
Liu_BUPT_task4_1 | DCASE2020 liuliuliufangzhou system | Liu2021 | 0.090 | 0.101 | 0.040 | 0.169 | 0.176 | 0.110 | 18.1 | 19.6 | 13.8 | ||
Liu_BUPT_task4_2 | DCASE2020 liuliuliufangzhou system | Liu2021 | 0.152 | 0.173 | 0.099 | 0.322 | 0.347 | 0.234 | 23.6 | 25.7 | 18.2 | ||
Liu_BUPT_task4_3 | DCASE2020 liuliuliufangzhou system | Liu2021 | 0.068 | 0.086 | 0.012 | 0.146 | 0.152 | 0.104 | 15.1 | 16.9 | 10.8 | ||
Olvera_INRIA_task4_SED_2 | SED ensemble 2 OT + FG/BG | Olvera2021 | 0.338 | 0.382 | 0.218 | 0.481 | 0.528 | 0.357 | 43.4 | 48.4 | 30.0 | ||
Olvera_INRIA_task4_SED_1 | DA-SED + FG/BG | Olvera2021 | 0.332 | 0.375 | 0.205 | 0.462 | 0.506 | 0.333 | 45.5 | 50.2 | 33.1 | ||
Kim_AiTeR_GIST_SED_4 | RCRNN-based noisy student SED | Kim2021 | 0.442 | 0.492 | 0.330 | 0.674 | 0.715 | 0.573 | 50.6 | 53.3 | 43.5 | ||
Kim_AiTeR_GIST_SED_2 | RCRNN-based noisy student SED | Kim2021 | 0.439 | 0.492 | 0.319 | 0.667 | 0.710 | 0.564 | 50.5 | 53.3 | 43.0 | ||
Kim_AiTeR_GIST_SED_3 | RCRNN-based noisy student SED | Kim2021 | 0.434 | 0.481 | 0.326 | 0.669 | 0.709 | 0.570 | 49.4 | 52.4 | 41.8 | ||
Kim_AiTeR_GIST_SED_1 | RCRNN-based noisy student SED | Kim2021 | 0.431 | 0.478 | 0.320 | 0.661 | 0.702 | 0.554 | 49.9 | 52.3 | 43.6 | ||
Cai_SMALLRICE_task4_SED_1 | DCASE2021_Cai_SED_CDur_Ensemble_1 | Dinkel2021 | 0.361 | 0.406 | 0.239 | 0.584 | 0.654 | 0.418 | 37.8 | 41.4 | 28.3 | ||
Cai_SMALLRICE_task4_SED_2 | DCASE2021_Cai_SED_CDur_Ensemble_2 | Dinkel2021 | 0.373 | 0.423 | 0.243 | 0.585 | 0.652 | 0.422 | 38.8 | 41.9 | 30.3 | ||
Cai_SMALLRICE_task4_SED_3 | DCASE2021_Cai_SED_CDur_Ensemble_3 | Dinkel2021 | 0.370 | 0.419 | 0.241 | 0.596 | 0.662 | 0.433 | 38.8 | 42.0 | 30.7 | ||
Cai_SMALLRICE_task4_SED_4 | DCASE2021_Cai_SED_CDur_Single_4 | Dinkel2021 | 0.339 | 0.386 | 0.212 | 0.504 | 0.561 | 0.356 | 38.4 | 42.0 | 29.5 | ||
HangYuChen_Roal_task4_SED_2 | DCASE2021 SED system | HangYu2021 | 0.294 | 0.327 | 0.205 | 0.473 | 0.510 | 0.350 | 34.2 | 37.8 | 25.5 | ||
HangYuChen_Roal_task4_SED_1 | DCASE2021 SED system | YuHang2021 | 0.098 | 0.104 | 0.090 | 0.496 | 0.515 | 0.391 | 10.7 | 11.7 | 8.7 | ||
Yu_NCUT_task4_SED_1 | multi-scale CRNN | Yu2021 | 0.038 | 0.039 | 0.045 | 0.157 | 0.182 | 0.144 | 6.8 | 7.9 | 4.0 | ||
Yu_NCUT_task4_SED_2 | multi-scale CRNN | Yu2021 | 0.301 | 0.341 | 0.197 | 0.485 | 0.528 | 0.360 | 34.4 | 37.8 | 25.7 | ||
lu_kwai_task4_SED_1 | DCASE2021 SED CRNN Model1 | Lu2021 | 0.419 | 0.468 | 0.314 | 0.660 | 0.702 | 0.556 | 45.0 | 48.6 | 36.0 | ||
lu_kwai_task4_SED_4 | DCASE2021 SED Conformer Model2 | Lu2021 | 0.157 | 0.177 | 0.125 | 0.685 | 0.714 | 0.598 | 15.7 | 16.6 | 14.6 | ||
lu_kwai_task4_SED_3 | DCASE2021 SED Conformer Model1 | Lu2021 | 0.148 | 0.170 | 0.114 | 0.686 | 0.715 | 0.597 | 15.6 | 16.7 | 14.0 | ||
lu_kwai_task4_SED_2 | DCASE2021 SED CRNN Model2 | Lu2021 | 0.412 | 0.461 | 0.313 | 0.651 | 0.694 | 0.550 | 45.5 | 48.9 | 36.9 | ||
Liu_BUPT_task4_SS_SED_2 | DCASE2020 liuliuliufangzhou system | Liu_SS2021 | 0.302 | 0.328 | 0.235 | 0.507 | 0.537 | 0.410 | 37.6 | 40.5 | 30.5 | ||
Liu_BUPT_task4_SS_SED_1 | DCASE2020 liuliuliufangzhou system | Liu_SS2021 | 0.302 | 0.328 | 0.235 | 0.507 | 0.537 | 0.410 | 38.4 | 40.9 | 32.2 | ||
Tian_ICT-TOSHIBA_task4_SED_2 | SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS | Tian2021 | 0.411 | 0.462 | 0.307 | 0.585 | 0.639 | 0.473 | 38.3 | 41.2 | 31.6 | ||
Tian_ICT-TOSHIBA_task4_SED_1 | SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS | Tian2021 | 0.413 | 0.468 | 0.306 | 0.586 | 0.640 | 0.473 | 38.3 | 41.2 | 31.6 | ||
Tian_ICT-TOSHIBA_task4_SED_4 | SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS | Tian2021 | 0.412 | 0.467 | 0.306 | 0.586 | 0.639 | 0.473 | 38.3 | 41.2 | 31.6 | ||
Tian_ICT-TOSHIBA_task4_SED_3 | SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS | Tian2021 | 0.409 | 0.456 | 0.307 | 0.584 | 0.637 | 0.472 | 38.3 | 41.2 | 31.6 | ||
Yao_GUET_task4_SED_3 | Adaptive Sequential Self Attention Span for Sound Event Detection | Yao2021 | 0.279 | 0.312 | 0.197 | 0.479 | 0.526 | 0.357 | 34.2 | 37.1 | 27.4 | ||
Yao_GUET_task4_SED_1 | Adaptive Sequential Self Attention Span for Sound Event Detection | Yao2021 | 0.277 | 0.305 | 0.215 | 0.482 | 0.510 | 0.388 | 31.9 | 34.2 | 26.4 | ||
Yao_GUET_task4_SED_2 | Adaptive Sequential Self Attention Span for Sound Event Detection | Yao2021 | 0.056 | 0.064 | 0.048 | 0.496 | 0.529 | 0.389 | 8.9 | 9.5 | 7.5 | ||
Liang_SHNU_task4_SED_4 | Guided Learning system | Liang2021 | 0.313 | 0.349 | 0.226 | 0.543 | 0.589 | 0.422 | 36.0 | 39.5 | 27.5 | ||
Bajzik_UNIZA_task4_SED_2 | CAM attention SED system | Bajzik2021 | 0.330 | 0.383 | 0.216 | 0.544 | 0.602 | 0.398 | 39.8 | 43.7 | 30.1 | ||
Bajzik_UNIZA_task4_SED_1 | CAM-based SED system | Bajzik2021 | 0.133 | 0.140 | 0.081 | 0.266 | 0.259 | 0.219 | 13.7 | 15.2 | 9.7 | ||
Liang_SHNU_task4_SSep_SED_3 | Mean teacher system | Liang_SS2021 | 0.304 | 0.345 | 0.218 | 0.559 | 0.604 | 0.441 | 34.2 | 37.0 | 27.8 | ||
Liang_SHNU_task4_SSep_SED_1 | Mean teacher system | Liang_SS2021 | 0.313 | 0.348 | 0.235 | 0.588 | 0.639 | 0.462 | 34.6 | 38.1 | 26.5 | ||
Liang_SHNU_task4_SSep_SED_2 | Mean teacher system | Liang_SS2021 | 0.325 | 0.371 | 0.240 | 0.542 | 0.600 | 0.408 | 37.0 | 40.5 | 28.7 | ||
Baseline_SED | DCASE2021 SED baseline system | turpault2020a | 0.315 | 0.359 | 0.222 | 0.547 | 0.596 | 0.407 | 37.3 | 40.8 | 29.7 | ||
Wang_NSYSU_task4_SED_1 | DCASE2021_SED_A | Wang2021 | 0.336 | 0.379 | 0.253 | 0.646 | 0.692 | 0.537 | 43.0 | 47.3 | 32.3 | ||
Wang_NSYSU_task4_SED_4 | DCASE2021_SED_D | Wang2021 | 0.304 | 0.340 | 0.233 | 0.662 | 0.710 | 0.554 | 38.2 | 41.3 | 30.4 | ||
Wang_NSYSU_task4_SED_2 | DCASE2021_SED_B | Wang2021 | 0.070 | 0.081 | 0.050 | 0.636 | 0.672 | 0.552 | 9.9 | 10.1 | 9.5 | ||
Wang_NSYSU_task4_SED_3 | DCASE2021_SED_C | Wang2021 | 0.339 | 0.384 | 0.251 | 0.649 | 0.698 | 0.540 | 43.0 | 46.4 | 34.4 |
Teams ranking
Table including only the best ranking score per submitting team.
Rank |
Submission code (PSDS 1) |
Submission name (PSDS 1) |
Submission code (PSDS 2) |
Submission name (PSDS 2) |
Technical Report |
Sound Separation |
Ranking score (Evaluation dataset) |
PSDS 1 (Evaluation dataset) |
PSDS 2 (Evaluation dataset) |
---|---|---|---|---|---|---|---|---|---|
Na_BUPT_task4_SED_1 | Na_BUPT_task4_SED_1 | Na_BUPT_task4_SED_1 | Na_BUPT_task4_SED_1 | Na2021 | 0.80 | 0.245 | 0.452 | ||
Hafsati_TUITO_task4_SED_2 | TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 | Hafsati_TUITO_task4_SED_2 | TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 | Hafsati2021 | 1.04 | 0.336 | 0.550 | ||
Gong_TAL_task4_SED_3 | TAL SED system | Gong_TAL_task4_SED_3 | TAL SED system | Gong2021 | 1.16 | 0.370 | 0.626 | ||
Park_JHU_task4_SED_2 | Park_JHU_task4_SED_2 | Park_JHU_task4_SED_2 | Park_JHU_task4_SED_2 | Park2021 | 1.07 | 0.327 | 0.603 | ||
Zheng_USTC_task4_SED_1 | DCASE2020 SED Mean teacher system 1 | Zheng_USTC_task4_SED_3 | DCASE2020 SED Mean teacher system 3 | Zheng2021 | 1.40 | 0.452 | 0.746 | ||
Nam_KAIST_task4_SED_2 | SED_mixupratip=0.8_nband=(2,3)_medianfilter=5 | Nam_KAIST_task4_SED_4 | Weak_SED | Nam2021 | 1.29 | 0.399 | 0.715 | ||
Koo_SGU_task4_SED_1 | DCASE2021 SED system using wav2vec | Koo_SGU_task4_SED_1 | DCASE2021 SED system using wav2vec | Koo2021 | 0.74 | 0.258 | 0.364 | ||
deBenito_AUDIAS_task4_SED_2 | 3-Resolution Mean Teacher (Higher time resolutions) | deBenito_AUDIAS_task4_SED_4 | 5-Resolution Mean Teacher | de Benito-Gorron2021 | 1.10 | 0.363 | 0.577 | ||
Baseline_SSep_SED | DCASE2021 SSep SED baseline system | Baseline_SSep_SED | DCASE2021 SSep SED baseline system | turpault2020b | 1.11 | 0.364 | 0.580 | ||
Boes_KUL_task4_SED_1 | CRNN with optimized pooling operations for scenario 1 (1) | Boes_KUL_task4_SED_3 | CRNN with optimized pooling operations for scenario 2 (1) | Boes2021 | 0.89 | 0.253 | 0.531 | ||
Ebbers_UPB_task4_SED_3 | UPB sytem 3 | Ebbers_UPB_task4_SED_4 | UPB sytem 4 | Ebbers2021 | 1.24 | 0.416 | 0.637 | ||
Zhu_AIAL-XJU_task4_SED_1 | Zhu_AIAL-XJU_task4_SED_1 | Zhu_AIAL-XJU_task4_SED_1 | Zhu_AIAL-XJU_task4_SED_1 | Zhu2021 | 1.04 | 0.318 | 0.583 | ||
Liu_BUPT_task4_2 | DCASE2020 liuliuliufangzhou system | Liu_BUPT_task4_2 | DCASE2020 liuliuliufangzhou system | Liu2021 | 0.54 | 0.152 | 0.322 | ||
Olvera_INRIA_task4_SED_2 | SED ensemble 2 OT + FG/BG | Olvera_INRIA_task4_SED_2 | SED ensemble 2 OT + FG/BG | Olvera2021 | 0.98 | 0.338 | 0.481 | ||
Kim_AiTeR_GIST_SED_4 | RCRNN-based noisy student SED | Kim_AiTeR_GIST_SED_4 | RCRNN-based noisy student SED | Kim2021 | 1.32 | 0.442 | 0.674 | ||
Cai_SMALLRICE_task4_SED_2 | DCASE2021_Cai_SED_CDur_Ensemble_2 | Cai_SMALLRICE_task4_SED_3 | DCASE2021_Cai_SED_CDur_Ensemble_3 | Dinkel2021 | 1.14 | 0.373 | 0.596 | ||
HangYuChen_Roal_task4_SED_2 | DCASE2021 SED system | HangYuChen_Roal_task4_SED_2 | DCASE2021 SED system | HangYu2021 | 0.90 | 0.294 | 0.473 | ||
HangYuChen_Roal_task4_SED_1 | DCASE2021 SED system | HangYuChen_Roal_task4_SED_1 | DCASE2021 SED system | YuHang2021 | 0.61 | 0.098 | 0.496 | ||
Yu_NCUT_task4_SED_2 | multi-scale CRNN | Yu_NCUT_task4_SED_2 | multi-scale CRNN | Yu2021 | 0.92 | 0.301 | 0.485 | ||
lu_kwai_task4_SED_1 | DCASE2021 SED CRNN Model1 | lu_kwai_task4_SED_3 | DCASE2021 SED Conformer Model1 | Lu2021 | 1.29 | 0.419 | 0.686 | ||
Liu_BUPT_task4_SS_SED_2 | DCASE2020 liuliuliufangzhou system | Liu_BUPT_task4_SS_SED_2 | DCASE2020 liuliuliufangzhou system | Liu_SS2021 | 0.94 | 0.302 | 0.507 | ||
Tian_ICT-TOSHIBA_task4_SED_1 | SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS | Tian_ICT-TOSHIBA_task4_SED_1 | SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS | Tian2021 | 1.19 | 0.413 | 0.586 | ||
Yao_GUET_task4_SED_3 | Adaptive Sequential Self Attention Span for Sound Event Detection | Yao_GUET_task4_SED_2 | Adaptive Sequential Self Attention Span for Sound Event Detection | Yao2021 | 0.90 | 0.279 | 0.496 | ||
Liang_SHNU_task4_SED_4 | Guided Learning system | Liang_SHNU_task4_SED_4 | Guided Learning system | Liang2021 | 0.99 | 0.313 | 0.543 | ||
Bajzik_UNIZA_task4_SED_2 | CAM attention SED system | Bajzik_UNIZA_task4_SED_2 | CAM attention SED system | Bajzik2021 | 1.02 | 0.330 | 0.544 | ||
Liang_SHNU_task4_SSep_SED_2 | Mean teacher system | Liang_SHNU_task4_SSep_SED_1 | Mean teacher system | Liang_SS2021 | 1.05 | 0.325 | 0.588 | ||
Baseline_SED | DCASE2021 SED baseline system | Baseline_SED | DCASE2021 SED baseline system | turpault2020a | 1.00 | 0.315 | 0.547 | ||
Wang_NSYSU_task4_SED_3 | DCASE2021_SED_C | Wang_NSYSU_task4_SED_4 | DCASE2021_SED_D | Wang2021 | 1.14 | 0.339 | 0.662 |
Supplementary metrics
Rank |
Submission code (PSDS 1) |
Submission name (PSDS 1) |
Submission code (PSDS 2) |
Submission name (PSDS 2) |
Technical Report |
Sound Separation |
Ranking score (Evaluation dataset) |
Ranking score (Public evaluation) |
Ranking score (Vimeo dataset) |
---|---|---|---|---|---|---|---|---|---|
Na_BUPT_task4_SED_1 | Na_BUPT_task4_SED_1 | Na_BUPT_task4_SED_1 | Na_BUPT_task4_SED_1 | Na2021 | 0.80 | 0.78 | 0.85 | ||
Hafsati_TUITO_task4_SED_2 | TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 | Hafsati_TUITO_task4_SED_2 | TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 | Hafsati2021 | 1.04 | 1.02 | 1.10 | ||
Gong_TAL_task4_SED_3 | TAL SED system | Gong_TAL_task4_SED_3 | TAL SED system | Gong2021 | 1.16 | 1.15 | 1.24 | ||
Park_JHU_task4_SED_2 | Park_JHU_task4_SED_2 | Park_JHU_task4_SED_2 | Park_JHU_task4_SED_2 | Park2021 | 1.07 | 1.06 | 1.15 | ||
Zheng_USTC_task4_SED_1 | DCASE2020 SED Mean teacher system 1 | Zheng_USTC_task4_SED_3 | DCASE2020 SED Mean teacher system 3 | Zheng2021 | 1.40 | 1.37 | 1.52 | ||
Nam_KAIST_task4_SED_2 | SED_mixupratip=0.8_nband=(2,3)_medianfilter=5 | Nam_KAIST_task4_SED_4 | Weak_SED | Nam2021 | 1.29 | 1.25 | 1.43 | ||
Koo_SGU_task4_SED_1 | DCASE2021 SED system using wav2vec | Koo_SGU_task4_SED_1 | DCASE2021 SED system using wav2vec | Koo2021 | 0.74 | 0.73 | 0.71 | ||
deBenito_AUDIAS_task4_SED_2 | 3-Resolution Mean Teacher (Higher time resolutions) | deBenito_AUDIAS_task4_SED_4 | 5-Resolution Mean Teacher | de Benito-Gorron2021 | 1.10 | 1.10 | 1.14 | ||
Baseline_SSep_SED | DCASE2021 SSep SED baseline system | Baseline_SSep_SED | DCASE2021 SSep SED baseline system | turpault2020b | 1.11 | 1.09 | 1.22 | ||
Boes_KUL_task4_SED_1 | CRNN with optimized pooling operations for scenario 1 (1) | Boes_KUL_task4_SED_3 | CRNN with optimized pooling operations for scenario 2 (1) | Boes2021 | 0.89 | 0.87 | 0.87 | ||
Ebbers_UPB_task4_SED_3 | UPB sytem 3 | Ebbers_UPB_task4_SED_4 | UPB sytem 4 | Ebbers2021 | 1.24 | 1.21 | 1.39 | ||
Zhu_AIAL-XJU_task4_SED_1 | Zhu_AIAL-XJU_task4_SED_1 | Zhu_AIAL-XJU_task4_SED_1 | Zhu_AIAL-XJU_task4_SED_1 | Zhu2021 | 1.04 | 1.03 | 1.09 | ||
Liu_BUPT_task4_2 | DCASE2020 liuliuliufangzhou system | Liu_BUPT_task4_2 | DCASE2020 liuliuliufangzhou system | Liu2021 | 0.54 | 0.53 | 0.51 | ||
Olvera_INRIA_task4_SED_2 | SED ensemble 2 OT + FG/BG | Olvera_INRIA_task4_SED_2 | SED ensemble 2 OT + FG/BG | Olvera2021 | 0.98 | 0.97 | 0.93 | ||
Kim_AiTeR_GIST_SED_4 | RCRNN-based noisy student SED | Kim_AiTeR_GIST_SED_4 | RCRNN-based noisy student SED | Kim2021 | 1.32 | 1.28 | 1.45 | ||
Cai_SMALLRICE_task4_SED_2 | DCASE2021_Cai_SED_CDur_Ensemble_2 | Cai_SMALLRICE_task4_SED_3 | DCASE2021_Cai_SED_CDur_Ensemble_3 | Dinkel2021 | 1.14 | 1.14 | 1.08 | ||
HangYuChen_Roal_task4_SED_2 | DCASE2021 SED system | HangYuChen_Roal_task4_SED_2 | DCASE2021 SED system | HangYu2021 | 0.90 | 0.88 | 0.89 | ||
HangYuChen_Roal_task4_SED_1 | DCASE2021 SED system | HangYuChen_Roal_task4_SED_1 | DCASE2021 SED system | YuHang2021 | 0.61 | 0.58 | 0.68 | ||
Yu_NCUT_task4_SED_2 | multi-scale CRNN | Yu_NCUT_task4_SED_2 | multi-scale CRNN | Yu2021 | 0.92 | 0.92 | 0.89 | ||
lu_kwai_task4_SED_1 | DCASE2021 SED CRNN Model1 | lu_kwai_task4_SED_3 | DCASE2021 SED Conformer Model1 | Lu2021 | 1.29 | 1.25 | 1.44 | ||
Liu_BUPT_task4_SS_SED_2 | DCASE2020 liuliuliufangzhou system | Liu_BUPT_task4_SS_SED_2 | DCASE2020 liuliuliufangzhou system | Liu_SS2021 | 0.94 | 0.91 | 1.03 | ||
Tian_ICT-TOSHIBA_task4_SED_1 | SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS | Tian_ICT-TOSHIBA_task4_SED_1 | SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS | Tian2021 | 1.19 | 1.19 | 1.27 | ||
Yao_GUET_task4_SED_3 | Adaptive Sequential Self Attention Span for Sound Event Detection | Yao_GUET_task4_SED_2 | Adaptive Sequential Self Attention Span for Sound Event Detection | Yao2021 | 0.90 | 0.88 | 0.92 | ||
Liang_SHNU_task4_SED_4 | Guided Learning system | Liang_SHNU_task4_SED_4 | Guided Learning system | Liang2021 | 0.99 | 0.98 | 1.03 | ||
Bajzik_UNIZA_task4_SED_2 | CAM attention SED system | Bajzik_UNIZA_task4_SED_2 | CAM attention SED system | Bajzik2021 | 1.02 | 1.04 | 0.98 | ||
Liang_SHNU_task4_SSep_SED_2 | Mean teacher system | Liang_SHNU_task4_SSep_SED_1 | Mean teacher system | Liang_SS2021 | 1.05 | 1.05 | 1.11 | ||
Baseline_SED | DCASE2021 SED baseline system | Baseline_SED | DCASE2021 SED baseline system | turpault2020a | 1.00 | 1.00 | 1.00 | ||
Wang_NSYSU_task4_SED_3 | DCASE2021_SED_C | Wang_NSYSU_task4_SED_4 | DCASE2021_SED_D | Wang2021 | 1.14 | 1.13 | 1.25 |
Class-wise performance
Rank |
Submission code |
Submission name |
Technical Report |
Ranking score (Evaluation dataset) |
Alarm Bell Ringing |
Blender | Cat | Dishes | Dog |
Electric shave toothbrush |
Frying |
Running water |
Speech |
Vacuum cleaner |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Na_BUPT_task4_SED_1 | Na_BUPT_task4_SED_1 | Na2021 | 0.80 | 23.5 | 28.6 | 42.5 | 25.0 | 16.5 | 15.7 | 19.3 | 19.9 | 35.4 | 23.2 | |
Hafsati_TUITO_task4_SED_3 | TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 | Hafsati2021 | 0.91 | 25.0 | 39.7 | 53.7 | 19.4 | 28.3 | 39.7 | 38.3 | 25.1 | 49.3 | 38.2 | |
Hafsati_TUITO_task4_SED_4 | TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 | Hafsati2021 | 0.91 | 25.7 | 40.8 | 50.7 | 26.5 | 28.8 | 39.7 | 42.6 | 26.4 | 49.7 | 41.0 | |
Hafsati_TUITO_task4_SED_1 | TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 | Hafsati2021 | 1.03 | 30.4 | 38.1 | 63.7 | 27.6 | 29.1 | 35.6 | 37.0 | 28.4 | 52.7 | 52.3 | |
Hafsati_TUITO_task4_SED_2 | TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 | Hafsati2021 | 1.04 | 32.8 | 39.0 | 63.3 | 28.9 | 32.8 | 39.7 | 41.0 | 27.9 | 51.5 | 52.6 | |
Gong_TAL_task4_SED_3 | TAL SED system | Gong2021 | 1.16 | 33.3 | 49.8 | 61.9 | 34.6 | 31.6 | 39.8 | 41.8 | 26.9 | 45.0 | 54.3 | |
Gong_TAL_task4_SED_2 | TAL SED system | Gong2021 | 1.15 | 35.0 | 48.1 | 62.1 | 33.9 | 36.3 | 40.3 | 41.4 | 28.1 | 45.8 | 55.7 | |
Gong_TAL_task4_SED_1 | TAL SED system | Gong2021 | 1.14 | 33.9 | 48.7 | 61.3 | 34.7 | 29.3 | 42.4 | 39.8 | 27.7 | 44.1 | 53.1 | |
Park_JHU_task4_SED_2 | Park_JHU_task4_SED_2 | Park2021 | 1.07 | 25.7 | 41.8 | 52.2 | 10.1 | 27.2 | 40.4 | 47.7 | 36.7 | 58.0 | 44.3 | |
Park_JHU_task4_SED_4 | Park_JHU_task4_SED_4 | Park2021 | 0.86 | 25.9 | 42.1 | 33.4 | 34.0 | 17.2 | 38.9 | 50.0 | 35.9 | 50.6 | 41.5 | |
Park_JHU_task4_SED_1 | Park_JHU_task4_SED_1 | Park2021 | 1.01 | 22.8 | 40.3 | 43.6 | 8.2 | 22.9 | 33.5 | 43.2 | 34.8 | 58.5 | 39.0 | |
Park_JHU_task4_SED_3 | Park_JHU_task4_SED_3 | Park2021 | 0.84 | 21.8 | 40.2 | 24.4 | 32.2 | 13.9 | 35.1 | 44.4 | 33.3 | 51.5 | 37.8 | |
Zheng_USTC_task4_SED_4 | DCASE2020 SED Mean teacher system 4 | Zheng2021 | 1.30 | 36.1 | 53.3 | 70.4 | 18.8 | 45.7 | 58.2 | 40.4 | 32.5 | 70.6 | 68.7 | |
Zheng_USTC_task4_SED_1 | DCASE2020 SED Mean teacher system 1 | Zheng2021 | 1.33 | 41.4 | 54.1 | 72.5 | 29.4 | 47.8 | 60.1 | 49.2 | 33.7 | 69.5 | 65.5 | |
Zheng_USTC_task4_SED_3 | DCASE2020 SED Mean teacher system 3 | Zheng2021 | 1.29 | 36.4 | 52.5 | 70.9 | 20.9 | 42.9 | 59.0 | 43.3 | 34.1 | 68.7 | 68.7 | |
Zheng_USTC_task4_SED_2 | DCASE2020 SED Mean teacher system 2 | Zheng2021 | 1.33 | 36.6 | 55.1 | 75.3 | 29.8 | 45.6 | 55.7 | 53.6 | 38.6 | 69.3 | 69.5 | |
Nam_KAIST_task4_SED_2 | SED_mixupratip=0.8_nband=(2,3)_medianfilter=5 | Nam2021 | 1.19 | 34.2 | 55.4 | 70.5 | 39.6 | 46.2 | 44.7 | 36.2 | 39.3 | 55.7 | 58.6 | |
Nam_KAIST_task4_SED_1 | SED_default | Nam2021 | 1.16 | 28.6 | 58.3 | 69.8 | 30.3 | 37.0 | 38.1 | 37.8 | 35.7 | 51.7 | 54.6 | |
Nam_KAIST_task4_SED_3 | SED_AFL | Nam2021 | 1.09 | 27.9 | 36.9 | 25.2 | 9.8 | 7.2 | 30.0 | 32.8 | 33.0 | 40.6 | 49.8 | |
Nam_KAIST_task4_SED_4 | Weak_SED | Nam2021 | 0.75 | 5.3 | 3.3 | 0.5 | 0.0 | 0.3 | 13.6 | 43.2 | 23.5 | 0.3 | 35.0 | |
Koo_SGU_task4_SED_2 | DCASE2021 SED system using wav2vec | Koo2021 | 0.12 | 0.0 | 20.5 | 9.9 | 1.0 | 2.3 | 12.5 | 20.0 | 15.7 | 26.8 | 14.8 | |
Koo_SGU_task4_SED_3 | DCASE2021 SED system using wav2vec | Koo2021 | 0.41 | 2.5 | 7.7 | 2.2 | 0.8 | 1.2 | 7.8 | 22.5 | 15.3 | 1.8 | 23.2 | |
Koo_SGU_task4_SED_1 | DCASE2021 SED system using wav2vec | Koo2021 | 0.74 | 15.4 | 23.5 | 30.5 | 15.1 | 20.6 | 21.1 | 21.1 | 18.5 | 19.0 | 20.0 | |
deBenito_AUDIAS_task4_SED_4 | 5-Resolution Mean Teacher | de Benito-Gorron2021 | 1.10 | 37.4 | 57.1 | 63.8 | 24.2 | 34.5 | 30.0 | 46.8 | 25.9 | 49.8 | 57.3 | |
deBenito_AUDIAS_task4_SED_1 | 3-Resolution Mean Teacher | de Benito-Gorron2021 | 1.07 | 37.6 | 58.1 | 63.1 | 23.9 | 34.2 | 35.4 | 43.5 | 29.8 | 49.3 | 51.3 | |
deBenito_AUDIAS_task4_SED_2 | 3-Resolution Mean Teacher (Higher time resolutions) | de Benito-Gorron2021 | 1.10 | 37.1 | 51.4 | 63.9 | 26.0 | 36.9 | 28.9 | 46.9 | 30.5 | 52.0 | 57.5 | |
deBenito_AUDIAS_task4_SED_3 | 4-Resolution Mean Teacher | de Benito-Gorron2021 | 1.07 | 36.2 | 57.6 | 63.1 | 24.4 | 34.8 | 35.0 | 41.9 | 27.1 | 48.2 | 53.5 | |
Baseline_SSep_SED | DCASE2021 SSep SED baseline system | turpault2020b | 1.11 | 36.7 | 47.4 | 66.3 | 33.1 | 40.5 | 34.8 | 37.2 | 21.5 | 53.0 | 49.3 | |
Boes_KUL_task4_SED_4 | CRNN with optimized pooling operations for scenario 2 (2) | Boes2021 | 0.60 | 3.7 | 24.2 | 1.4 | 0.0 | 0.6 | 13.9 | 23.7 | 12.6 | 6.2 | 19.9 | |
Boes_KUL_task4_SED_3 | CRNN with optimized pooling operations for scenario 2 (1) | Boes2021 | 0.68 | 6.6 | 16.5 | 1.4 | 0.0 | 0.3 | 21.3 | 33.0 | 18.7 | 6.6 | 35.1 | |
Boes_KUL_task4_SED_2 | CRNN with optimized pooling operations for scenario 1 (2) | Boes2021 | 0.77 | 16.9 | 32.9 | 63.1 | 7.7 | 19.4 | 25.6 | 32.6 | 14.8 | 51.8 | 47.7 | |
Boes_KUL_task4_SED_1 | CRNN with optimized pooling operations for scenario 1 (1) | Boes2021 | 0.81 | 19.0 | 29.0 | 59.1 | 7.7 | 20.9 | 34.5 | 24.8 | 13.0 | 54.0 | 47.9 | |
Ebbers_UPB_task4_SED_2 | UPB sytem 2 | Ebbers2021 | 1.10 | 37.2 | 60.8 | 73.0 | 24.2 | 45.6 | 58.5 | 65.9 | 36.9 | 65.0 | 73.5 | |
Ebbers_UPB_task4_SED_4 | UPB sytem 4 | Ebbers2021 | 1.16 | 39.2 | 61.7 | 74.2 | 33.4 | 46.6 | 57.1 | 64.0 | 45.4 | 67.6 | 77.6 | |
Ebbers_UPB_task4_SED_3 | UPB sytem 3 | Ebbers2021 | 1.24 | 39.2 | 61.7 | 74.2 | 33.4 | 46.6 | 57.1 | 64.0 | 45.4 | 67.6 | 77.6 | |
Ebbers_UPB_task4_SED_1 | UPB sytem 1 | Ebbers2021 | 1.16 | 37.2 | 60.8 | 73.0 | 24.2 | 45.6 | 58.5 | 65.9 | 36.9 | 65.0 | 73.5 | |
Zhu_AIAL-XJU_task4_SED_2 | Zhu_AIAL-XJU_task4_SED_2 | Zhu2021 | 0.99 | 30.0 | 46.3 | 63.3 | 23.6 | 16.8 | 44.2 | 47.5 | 40.9 | 59.4 | 57.6 | |
Zhu_AIAL-XJU_task4_SED_1 | Zhu_AIAL-XJU_task4_SED_1 | Zhu2021 | 1.04 | 31.8 | 48.2 | 58.3 | 28.5 | 26.7 | 37.4 | 48.6 | 36.0 | 51.1 | 35.3 | |
Liu_BUPT_task4_4 | DCASE2020 liuliuliufangzhou system | Liu2021 | 0.37 | 14.0 | 26.2 | 40.2 | 9.3 | 16.4 | 18.6 | 7.7 | 5.0 | 26.4 | 11.4 | |
Liu_BUPT_task4_1 | DCASE2020 liuliuliufangzhou system | Liu2021 | 0.30 | 13.1 | 18.1 | 49.0 | 8.6 | 19.3 | 20.0 | 6.4 | 6.2 | 28.6 | 11.4 | |
Liu_BUPT_task4_2 | DCASE2020 liuliuliufangzhou system | Liu2021 | 0.54 | 19.6 | 30.4 | 36.5 | 14.5 | 18.5 | 29.0 | 18.6 | 11.8 | 30.1 | 27.2 | |
Liu_BUPT_task4_3 | DCASE2020 liuliuliufangzhou system | Liu2021 | 0.24 | 16.5 | 19.0 | 37.2 | 6.2 | 19.3 | 13.7 | 3.5 | 6.8 | 25.0 | 3.8 | |
Olvera_INRIA_task4_SED_2 | SED ensemble 2 OT + FG/BG | Olvera2021 | 0.98 | 46.0 | 47.8 | 63.5 | 23.2 | 39.1 | 51.1 | 20.4 | 27.0 | 62.2 | 53.4 | |
Olvera_INRIA_task4_SED_1 | DA-SED + FG/BG | Olvera2021 | 0.95 | 43.7 | 52.3 | 63.6 | 30.0 | 40.8 | 52.6 | 24.4 | 26.6 | 63.9 | 56.9 | |
Kim_AiTeR_GIST_SED_4 | RCRNN-based noisy student SED | Kim2021 | 1.32 | 34.7 | 59.8 | 71.6 | 40.4 | 47.3 | 26.2 | 61.8 | 32.8 | 64.9 | 66.7 | |
Kim_AiTeR_GIST_SED_2 | RCRNN-based noisy student SED | Kim2021 | 1.31 | 37.9 | 57.4 | 72.9 | 41.8 | 46.8 | 25.2 | 60.5 | 36.9 | 64.3 | 60.8 | |
Kim_AiTeR_GIST_SED_3 | RCRNN-based noisy student SED | Kim2021 | 1.30 | 37.4 | 55.4 | 71.9 | 41.0 | 44.6 | 26.5 | 59.5 | 32.3 | 64.6 | 61.1 | |
Kim_AiTeR_GIST_SED_1 | RCRNN-based noisy student SED | Kim2021 | 1.29 | 33.0 | 57.1 | 70.0 | 42.5 | 49.6 | 28.2 | 60.6 | 31.3 | 65.0 | 62.3 | |
Cai_SMALLRICE_task4_SED_1 | DCASE2021_Cai_SED_CDur_Ensemble_1 | Dinkel2021 | 1.11 | 37.0 | 32.2 | 55.9 | 31.2 | 20.4 | 37.8 | 33.8 | 23.9 | 60.1 | 45.3 | |
Cai_SMALLRICE_task4_SED_2 | DCASE2021_Cai_SED_CDur_Ensemble_2 | Dinkel2021 | 1.13 | 37.8 | 37.4 | 53.8 | 31.8 | 22.1 | 35.9 | 32.3 | 28.9 | 61.3 | 46.6 | |
Cai_SMALLRICE_task4_SED_3 | DCASE2021_Cai_SED_CDur_Ensemble_3 | Dinkel2021 | 1.13 | 36.6 | 36.6 | 55.7 | 31.7 | 21.2 | 36.1 | 37.7 | 25.0 | 61.1 | 46.6 | |
Cai_SMALLRICE_task4_SED_4 | DCASE2021_Cai_SED_CDur_Single_4 | Dinkel2021 | 1.00 | 34.9 | 34.1 | 52.5 | 30.8 | 28.0 | 37.6 | 35.1 | 24.9 | 61.3 | 44.4 | |
HangYuChen_Roal_task4_SED_2 | DCASE2021 SED system | HangYu2021 | 0.90 | 29.0 | 30.7 | 59.3 | 24.5 | 31.8 | 35.3 | 30.2 | 26.0 | 49.3 | 25.9 | |
HangYuChen_Roal_task4_SED_1 | DCASE2021 SED system | YuHang2021 | 0.61 | 5.2 | 4.8 | 5.6 | 4.3 | 2.7 | 12.5 | 26.8 | 16.0 | 5.1 | 24.0 | |
Yu_NCUT_task4_SED_1 | multi-scale CRNN | Yu2021 | 0.20 | 0.5 | 7.1 | 0.7 | 2.0 | 1.7 | 9.5 | 24.4 | 1.2 | 1.6 | 19.7 | |
Yu_NCUT_task4_SED_2 | multi-scale CRNN | Yu2021 | 0.92 | 28.6 | 34.6 | 57.9 | 20.2 | 31.7 | 36.0 | 29.7 | 28.0 | 44.4 | 33.1 | |
lu_kwai_task4_SED_1 | DCASE2021 SED CRNN Model1 | Lu2021 | 1.27 | 37.1 | 41.4 | 62.5 | 40.6 | 39.7 | 46.5 | 46.5 | 34.5 | 54.5 | 46.9 | |
lu_kwai_task4_SED_4 | DCASE2021 SED Conformer Model2 | Lu2021 | 0.88 | 5.8 | 5.9 | 2.1 | 1.0 | 0.3 | 16.9 | 44.6 | 22.0 | 25.6 | 32.9 | |
lu_kwai_task4_SED_3 | DCASE2021 SED Conformer Model1 | Lu2021 | 0.86 | 6.3 | 5.1 | 1.3 | 0.8 | 0.3 | 15.7 | 43.8 | 21.8 | 29.3 | 31.9 | |
lu_kwai_task4_SED_2 | DCASE2021 SED CRNN Model2 | Lu2021 | 1.25 | 38.6 | 41.6 | 65.5 | 41.0 | 39.3 | 46.1 | 49.0 | 36.0 | 51.5 | 46.6 | |
Liu_BUPT_task4_SS_SED_2 | DCASE2020 liuliuliufangzhou system | Liu_SS2021 | 0.94 | 31.7 | 38.2 | 63.5 | 19.9 | 30.1 | 46.6 | 32.0 | 21.1 | 49.4 | 43.4 | |
Liu_BUPT_task4_SS_SED_1 | DCASE2020 liuliuliufangzhou system | Liu_SS2021 | 0.94 | 34.3 | 38.8 | 63.1 | 25.7 | 27.3 | 45.3 | 31.1 | 25.8 | 49.4 | 43.7 | |
Tian_ICT-TOSHIBA_task4_SED_2 | SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS | Tian2021 | 1.19 | 33.6 | 44.9 | 60.9 | 26.4 | 34.8 | 24.3 | 38.7 | 25.9 | 48.4 | 45.2 | |
Tian_ICT-TOSHIBA_task4_SED_1 | SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS | Tian2021 | 1.19 | 33.6 | 44.9 | 60.9 | 26.4 | 34.8 | 24.3 | 38.7 | 25.9 | 48.4 | 45.2 | |
Tian_ICT-TOSHIBA_task4_SED_4 | SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS | Tian2021 | 1.19 | 33.6 | 44.9 | 60.9 | 26.4 | 34.8 | 24.3 | 38.7 | 25.9 | 48.4 | 45.2 | |
Tian_ICT-TOSHIBA_task4_SED_3 | SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS | Tian2021 | 1.18 | 33.6 | 44.9 | 60.9 | 26.4 | 34.8 | 24.3 | 38.7 | 25.9 | 48.4 | 45.2 | |
Yao_GUET_task4_SED_3 | Adaptive Sequential Self Attention Span for Sound Event Detection | Yao2021 | 0.88 | 32.2 | 32.4 | 58.2 | 21.7 | 17.6 | 36.2 | 26.8 | 24.6 | 49.4 | 42.9 | |
Yao_GUET_task4_SED_1 | Adaptive Sequential Self Attention Span for Sound Event Detection | Yao2021 | 0.88 | 31.6 | 22.7 | 58.5 | 23.4 | 22.8 | 35.9 | 31.5 | 23.8 | 45.1 | 23.9 | |
Yao_GUET_task4_SED_2 | Adaptive Sequential Self Attention Span for Sound Event Detection | Yao2021 | 0.54 | 4.7 | 4.2 | 2.2 | 1.4 | 1.3 | 10.5 | 27.3 | 16.7 | 3.4 | 17.1 | |
Liang_SHNU_task4_SED_4 | Guided Learning system | Liang2021 | 0.99 | 38.0 | 40.7 | 48.3 | 26.0 | 24.2 | 22.6 | 35.6 | 30.0 | 44.6 | 50.0 | |
Bajzik_UNIZA_task4_SED_2 | CAM attention SED system | Bajzik2021 | 1.02 | 37.5 | 44.4 | 57.6 | 28.8 | 22.9 | 35.5 | 44.0 | 29.8 | 51.7 | 45.2 | |
Bajzik_UNIZA_task4_SED_1 | CAM-based SED system | Bajzik2021 | 0.45 | 17.1 | 7.5 | 34.2 | 7.1 | 15.2 | 8.9 | 1.2 | 4.8 | 34.1 | 7.1 | |
Liang_SHNU_task4_SSep_SED_3 | Mean teacher system | Liang_SS2021 | 0.99 | 33.6 | 38.7 | 47.2 | 22.5 | 17.6 | 21.1 | 38.6 | 28.3 | 44.7 | 50.0 | |
Liang_SHNU_task4_SSep_SED_1 | Mean teacher system | Liang_SS2021 | 1.03 | 29.4 | 25.7 | 60.7 | 20.5 | 29.1 | 30.6 | 38.3 | 24.9 | 55.0 | 32.3 | |
Liang_SHNU_task4_SSep_SED_2 | Mean teacher system | Liang_SS2021 | 1.01 | 33.1 | 37.1 | 52.0 | 26.8 | 32.8 | 31.7 | 41.0 | 28.0 | 49.2 | 37.8 | |
Baseline_SED | DCASE2021 SED baseline system | turpault2020a | 1.00 | 32.2 | 39.0 | 62.4 | 28.6 | 34.5 | 21.1 | 37.2 | 26.4 | 49.7 | 42.0 | |
Wang_NSYSU_task4_SED_1 | DCASE2021_SED_A | Wang2021 | 1.13 | 34.3 | 46.5 | 62.7 | 35.6 | 29.0 | 50.6 | 51.8 | 38.2 | 45.9 | 35.9 | |
Wang_NSYSU_task4_SED_4 | DCASE2021_SED_D | Wang2021 | 1.09 | 32.5 | 49.4 | 66.2 | 28.3 | 15.2 | 34.0 | 47.2 | 33.0 | 39.9 | 36.0 | |
Wang_NSYSU_task4_SED_2 | DCASE2021_SED_B | Wang2021 | 0.69 | 7.0 | 5.2 | 0.5 | 0.0 | 0.3 | 11.1 | 31.5 | 15.7 | 0.3 | 27.8 | |
Wang_NSYSU_task4_SED_3 | DCASE2021_SED_C | Wang2021 | 1.13 | 34.4 | 52.0 | 70.1 | 32.2 | 25.1 | 41.5 | 47.8 | 36.1 | 52.6 | 37.7 |
System characteristics
General characteristics
Rank | Code |
Technical Report |
Ranking score (Evaluation dataset) |
PSDS 1 (Evaluation dataset) |
PSDS 2 (Evaluation dataset) |
Data augmentation |
Features |
---|---|---|---|---|---|---|---|
Na_BUPT_task4_SED_1 | Na2021 | 0.80 | 0.245 | 0.452 | log-mel energies | ||
Hafsati_TUITO_task4_SED_3 | Hafsati2021 | 0.91 | 0.287 | 0.502 | pitch shifting, audio concatenation, volume changing | log-mel energies | |
Hafsati_TUITO_task4_SED_4 | Hafsati2021 | 0.91 | 0.287 | 0.502 | pitch shifting, audio concatenation, volume changing | log-mel energies | |
Hafsati_TUITO_task4_SED_1 | Hafsati2021 | 1.03 | 0.334 | 0.549 | log-mel energies | ||
Hafsati_TUITO_task4_SED_2 | Hafsati2021 | 1.04 | 0.336 | 0.550 | log-mel energies | ||
Gong_TAL_task4_SED_3 | Gong2021 | 1.16 | 0.370 | 0.626 | SpecAugment, time shift, mixup | log-mel energies | |
Gong_TAL_task4_SED_2 | Gong2021 | 1.15 | 0.367 | 0.616 | SpecAugment, time shift, mixup | log-mel energies | |
Gong_TAL_task4_SED_1 | Gong2021 | 1.14 | 0.364 | 0.611 | SpecAugment, time shift, mixup | log-mel energies | |
Park_JHU_task4_SED_2 | Park2021 | 1.07 | 0.327 | 0.603 | mixup, frame shifting | log-mel energies | |
Park_JHU_task4_SED_4 | Park2021 | 0.86 | 0.237 | 0.524 | mixup, frame shifting | log-mel energies | |
Park_JHU_task4_SED_1 | Park2021 | 1.01 | 0.305 | 0.579 | mixup, frame shifting | log-mel energies | |
Park_JHU_task4_SED_3 | Park2021 | 0.84 | 0.222 | 0.537 | mixup, frame shifting | log-mel energies | |
Zheng_USTC_task4_SED_4 | Zheng2021 | 1.30 | 0.389 | 0.742 | spec-augment, time-shifting, mixup | log-mel energies | |
Zheng_USTC_task4_SED_1 | Zheng2021 | 1.33 | 0.452 | 0.669 | spec-augment, time-shifting, mixup | log-mel energies | |
Zheng_USTC_task4_SED_3 | Zheng2021 | 1.29 | 0.386 | 0.746 | spec-augment, time-shifting, mixup | log-mel energies | |
Zheng_USTC_task4_SED_2 | Zheng2021 | 1.33 | 0.447 | 0.676 | spec-augment, time-shifting, mixup | log-mel energies | |
Nam_KAIST_task4_SED_2 | Nam2021 | 1.19 | 0.399 | 0.609 | time shifiting, mixup, time masking, FilterAugment | log-mel energies | |
Nam_KAIST_task4_SED_1 | Nam2021 | 1.16 | 0.378 | 0.617 | time shifiting, mixup, time masking, FilterAugment | log-mel energies | |
Nam_KAIST_task4_SED_3 | Nam2021 | 1.09 | 0.324 | 0.634 | time shifiting, mixup, time masking, FilterAugment | log-mel energies | |
Nam_KAIST_task4_SED_4 | Nam2021 | 0.75 | 0.059 | 0.715 | time shifiting, mixup, time masking, FilterAugment | log-mel energies | |
Koo_SGU_task4_SED_2 | Koo2021 | 0.12 | 0.044 | 0.059 | raw waveform | ||
Koo_SGU_task4_SED_3 | Koo2021 | 0.41 | 0.058 | 0.348 | raw waveform | ||
Koo_SGU_task4_SED_1 | Koo2021 | 0.74 | 0.258 | 0.364 | raw waveform | ||
deBenito_AUDIAS_task4_SED_4 | de Benito-Gorron2021 | 1.10 | 0.361 | 0.577 | log-mel energies | ||
deBenito_AUDIAS_task4_SED_1 | de Benito-Gorron2021 | 1.07 | 0.343 | 0.571 | log-mel energies | ||
deBenito_AUDIAS_task4_SED_2 | de Benito-Gorron2021 | 1.10 | 0.363 | 0.574 | log-mel energies | ||
deBenito_AUDIAS_task4_SED_3 | de Benito-Gorron2021 | 1.07 | 0.345 | 0.571 | log-mel energies | ||
Baseline_SSep_SED | turpault2020b | 1.11 | 0.364 | 0.580 | mixup | log-mel energies | |
Boes_KUL_task4_SED_4 | Boes2021 | 0.60 | 0.117 | 0.457 | time masking, frequency masking, mixup | log-mel energies | |
Boes_KUL_task4_SED_3 | Boes2021 | 0.68 | 0.121 | 0.531 | time masking, frequency masking, mixup | log-mel energies | |
Boes_KUL_task4_SED_2 | Boes2021 | 0.77 | 0.233 | 0.440 | time masking, frequency masking, mixup | log-mel energies | |
Boes_KUL_task4_SED_1 | Boes2021 | 0.81 | 0.253 | 0.442 | time masking, frequency masking, mixup | log-mel energies | |
Ebbers_UPB_task4_SED_2 | Ebbers2021 | 1.10 | 0.335 | 0.621 | freuency warping, time-/frequency-masking, shifted superposition, random noise | log-mel energies | |
Ebbers_UPB_task4_SED_4 | Ebbers2021 | 1.16 | 0.363 | 0.637 | freuency warping, time-/frequency-masking, shifted superposition, random noise | log-mel energies | |
Ebbers_UPB_task4_SED_3 | Ebbers2021 | 1.24 | 0.416 | 0.635 | freuency warping, time-/frequency-masking, shifted superposition, random noise | log-mel energies | |
Ebbers_UPB_task4_SED_1 | Ebbers2021 | 1.16 | 0.373 | 0.621 | freuency warping, time-/frequency-masking, shifted superposition, random noise | log-mel energies | |
Zhu_AIAL-XJU_task4_SED_2 | Zhu2021 | 0.99 | 0.290 | 0.574 | mixup | log-mel spectrogram | |
Zhu_AIAL-XJU_task4_SED_1 | Zhu2021 | 1.04 | 0.318 | 0.583 | mixup | log-mel spectrogram | |
Liu_BUPT_task4_4 | Liu2021 | 0.37 | 0.102 | 0.231 | log-mel energies | ||
Liu_BUPT_task4_1 | Liu2021 | 0.30 | 0.090 | 0.169 | log-mel energies | ||
Liu_BUPT_task4_2 | Liu2021 | 0.54 | 0.152 | 0.322 | log-mel energies | ||
Liu_BUPT_task4_3 | Liu2021 | 0.24 | 0.068 | 0.146 | log-mel energies | ||
Olvera_INRIA_task4_SED_2 | Olvera2021 | 0.98 | 0.338 | 0.481 | log-mel energies | ||
Olvera_INRIA_task4_SED_1 | Olvera2021 | 0.95 | 0.332 | 0.462 | log-mel energies | ||
Kim_AiTeR_GIST_SED_4 | Kim2021 | 1.32 | 0.442 | 0.674 | time-frequency shift, mixup, specaugment | log-mel energies | |
Kim_AiTeR_GIST_SED_2 | Kim2021 | 1.31 | 0.439 | 0.667 | time-frequency shift, mixup, specaugment | log-mel energies | |
Kim_AiTeR_GIST_SED_3 | Kim2021 | 1.30 | 0.434 | 0.669 | time-frequency shift, mixup, specaugment | log-mel energies | |
Kim_AiTeR_GIST_SED_1 | Kim2021 | 1.29 | 0.431 | 0.661 | time-frequency shift, mixup, specaugment | log-mel energies | |
Cai_SMALLRICE_task4_SED_1 | Dinkel2021 | 1.11 | 0.361 | 0.584 | time shifting, mixup, time masking, frequency masking | log-mel energies | |
Cai_SMALLRICE_task4_SED_2 | Dinkel2021 | 1.13 | 0.373 | 0.585 | time shifting, mixup, time masking, frequency masking | log-mel energies | |
Cai_SMALLRICE_task4_SED_3 | Dinkel2021 | 1.13 | 0.370 | 0.596 | time shifting, mixup, time masking, frequency masking | log-mel energies | |
Cai_SMALLRICE_task4_SED_4 | Dinkel2021 | 1.00 | 0.339 | 0.504 | time shifting, mixup, time masking, frequency masking | log-mel energies | |
HangYuChen_Roal_task4_SED_2 | HangYu2021 | 0.90 | 0.294 | 0.473 | minmax | log-mel energies | |
HangYuChen_Roal_task4_SED_1 | YuHang2021 | 0.61 | 0.098 | 0.496 | minmax | log-mel energies | |
Yu_NCUT_task4_SED_1 | Yu2021 | 0.20 | 0.038 | 0.157 | mixup | log-mel energies | |
Yu_NCUT_task4_SED_2 | Yu2021 | 0.92 | 0.301 | 0.485 | mixup | log-mel energies | |
lu_kwai_task4_SED_1 | Lu2021 | 1.27 | 0.419 | 0.660 | mixup, frame-shift | log-mel energies | |
lu_kwai_task4_SED_4 | Lu2021 | 0.88 | 0.157 | 0.685 | mixup, frame-shift | log-mel energies | |
lu_kwai_task4_SED_3 | Lu2021 | 0.86 | 0.148 | 0.686 | mixup, frame-shift | log-mel energies | |
lu_kwai_task4_SED_2 | Lu2021 | 1.25 | 0.412 | 0.651 | mixup, frame-shift | log-mel energies | |
Liu_BUPT_task4_SS_SED_2 | Liu_SS2021 | 0.94 | 0.302 | 0.507 | source augmentation, random track mixing | raw waveform | |
Liu_BUPT_task4_SS_SED_1 | Liu_SS2021 | 0.94 | 0.302 | 0.507 | source augmentation, random track mixing | raw waveform | |
Tian_ICT-TOSHIBA_task4_SED_2 | Tian2021 | 1.19 | 0.411 | 0.585 | mixup | log-mel energies | |
Tian_ICT-TOSHIBA_task4_SED_1 | Tian2021 | 1.19 | 0.413 | 0.586 | mixup | log-mel energies | |
Tian_ICT-TOSHIBA_task4_SED_4 | Tian2021 | 1.19 | 0.412 | 0.586 | mixup | log-mel energies | |
Tian_ICT-TOSHIBA_task4_SED_3 | Tian2021 | 1.18 | 0.409 | 0.584 | mixup | log-mel energies | |
Yao_GUET_task4_SED_3 | Yao2021 | 0.88 | 0.279 | 0.479 | MIXUP | log-mel energies | |
Yao_GUET_task4_SED_1 | Yao2021 | 0.88 | 0.277 | 0.482 | MIXUP | log-mel energies | |
Yao_GUET_task4_SED_2 | Yao2021 | 0.54 | 0.056 | 0.496 | MIXUP | log-mel energies | |
Liang_SHNU_task4_SED_4 | Liang2021 | 0.99 | 0.313 | 0.543 | mixup, specAugment | log-mel energies | |
Bajzik_UNIZA_task4_SED_2 | Bajzik2021 | 1.02 | 0.330 | 0.544 | log-mel energies | ||
Bajzik_UNIZA_task4_SED_1 | Bajzik2021 | 0.45 | 0.133 | 0.266 | log-mel energies | ||
Liang_SHNU_task4_SSep_SED_3 | Liang_SS2021 | 0.99 | 0.304 | 0.559 | log-mel energies | ||
Liang_SHNU_task4_SSep_SED_1 | Liang_SS2021 | 1.03 | 0.313 | 0.588 | log-mel energies | ||
Liang_SHNU_task4_SSep_SED_2 | Liang_SS2021 | 1.01 | 0.325 | 0.542 | log-mel energies | ||
Baseline_SED | turpault2020a | 1.00 | 0.315 | 0.547 | mixup | log-mel energies | |
Wang_NSYSU_task4_SED_1 | Wang2021 | 1.13 | 0.336 | 0.646 | Mixup, Time Shift, Time Mask, Frequency Mask | log-mel energies | |
Wang_NSYSU_task4_SED_4 | Wang2021 | 1.09 | 0.304 | 0.662 | Mixup, Time Shift, Time Mask, Frequency Mask | log-mel energies | |
Wang_NSYSU_task4_SED_2 | Wang2021 | 0.69 | 0.070 | 0.636 | Mixup, Time Shift, Time Mask, Frequency Mask | log-mel energies | |
Wang_NSYSU_task4_SED_3 | Wang2021 | 1.13 | 0.339 | 0.649 | Mixup, Time Shift, Time Mask, Frequency Mask | log-mel energies |
Machine learning characteristics
Rank | Code |
Technical Report |
Ranking score (Evaluation dataset) |
PSDS 1 (Evaluation dataset) |
PSDS 2 (Evaluation dataset) |
Classifier | Semi-supervised approach | Post-processing |
Segmentation method |
Decision making |
---|---|---|---|---|---|---|---|---|---|---|
Na_BUPT_task4_SED_1 | Na2021 | 0.80 | 0.245 | 0.452 | CNN, conformer | mean-teacher student | median filtering (93ms) | |||
Hafsati_TUITO_task4_SED_3 | Hafsati2021 | 0.91 | 0.287 | 0.502 | CRNN | mean-teacher student | median filtering (93ms) | |||
Hafsati_TUITO_task4_SED_4 | Hafsati2021 | 0.91 | 0.287 | 0.502 | CRNN | mean-teacher student | median filtering (93ms) | |||
Hafsati_TUITO_task4_SED_1 | Hafsati2021 | 1.03 | 0.334 | 0.549 | CRNN | mean-teacher student | median filtering (93ms) | |||
Hafsati_TUITO_task4_SED_2 | Hafsati2021 | 1.04 | 0.336 | 0.550 | CRNN | mean-teacher student | median filtering (93ms) | |||
Gong_TAL_task4_SED_3 | Gong2021 | 1.16 | 0.370 | 0.626 | CRNN | mean-teacher, pseudo-labelling | class-wise median filtering | attention layers | mean | |
Gong_TAL_task4_SED_2 | Gong2021 | 1.15 | 0.367 | 0.616 | CRNN | mean-teacher | class-wise median filtering | attention layers | mean | |
Gong_TAL_task4_SED_1 | Gong2021 | 1.14 | 0.364 | 0.611 | CRNN | mean-teacher | class-wise median filtering | attention layers | mean | |
Park_JHU_task4_SED_2 | Park2021 | 1.07 | 0.327 | 0.603 | RCRNN | cross-referencing self-training | median filtering | |||
Park_JHU_task4_SED_4 | Park2021 | 0.86 | 0.237 | 0.524 | RCRNN | cross-referencing self-training | median filtering | |||
Park_JHU_task4_SED_1 | Park2021 | 1.01 | 0.305 | 0.579 | RCRNN | cross-referencing self-training | median filtering | |||
Park_JHU_task4_SED_3 | Park2021 | 0.84 | 0.222 | 0.537 | RCRNN | cross-referencing self-training | median filtering | |||
Zheng_USTC_task4_SED_4 | Zheng2021 | 1.30 | 0.389 | 0.742 | CRNN | mean-teacher student | median filtering (340ms) | averaging | ||
Zheng_USTC_task4_SED_1 | Zheng2021 | 1.33 | 0.452 | 0.669 | CRNN | mean-teacher student | median filtering (340ms) | averaging | ||
Zheng_USTC_task4_SED_3 | Zheng2021 | 1.29 | 0.386 | 0.746 | CRNN | mean-teacher student | median filtering (340ms) | averaging | ||
Zheng_USTC_task4_SED_2 | Zheng2021 | 1.33 | 0.447 | 0.676 | CRNN | mean-teacher student | median filtering (340ms) | averaging | ||
Nam_KAIST_task4_SED_2 | Nam2021 | 1.19 | 0.399 | 0.609 | CRNN, ensemble | mean-teacher student | median filtering (329ms), weak prediction masking | mean | ||
Nam_KAIST_task4_SED_1 | Nam2021 | 1.16 | 0.378 | 0.617 | CRNN, ensemble | mean-teacher student | median filtering (461ms), weak prediction masking | mean | ||
Nam_KAIST_task4_SED_3 | Nam2021 | 1.09 | 0.324 | 0.634 | CRNN, ensemble | mean-teacher student | median filtering (461ms), weak prediction masking | mean | ||
Nam_KAIST_task4_SED_4 | Nam2021 | 0.75 | 0.059 | 0.715 | CRNN, ensemble | mean-teacher student | weak SED | mean | ||
Koo_SGU_task4_SED_2 | Koo2021 | 0.12 | 0.044 | 0.059 | Transformer, RNN | mean-teacher student | median filtering (93ms) | |||
Koo_SGU_task4_SED_3 | Koo2021 | 0.41 | 0.058 | 0.348 | Transformer, RNN | mean-teacher student | median filtering (93ms) | |||
Koo_SGU_task4_SED_1 | Koo2021 | 0.74 | 0.258 | 0.364 | Transformer | mean-teacher student | median filtering (93ms) | |||
deBenito_AUDIAS_task4_SED_4 | de Benito-Gorron2021 | 1.10 | 0.361 | 0.577 | CRNN | mean-teacher student | median filtering (45ms) | average | ||
deBenito_AUDIAS_task4_SED_1 | de Benito-Gorron2021 | 1.07 | 0.343 | 0.571 | CRNN | mean-teacher student | median filtering (45ms) | average | ||
deBenito_AUDIAS_task4_SED_2 | de Benito-Gorron2021 | 1.10 | 0.363 | 0.574 | CRNN | mean-teacher student | median filtering (45ms) | average | ||
deBenito_AUDIAS_task4_SED_3 | de Benito-Gorron2021 | 1.07 | 0.345 | 0.571 | CRNN | mean-teacher student | median filtering (45ms) | average | ||
Baseline_SSep_SED | turpault2020b | 1.11 | 0.364 | 0.580 | CRNN | mean-teacher student | ||||
Boes_KUL_task4_SED_4 | Boes2021 | 0.60 | 0.117 | 0.457 | CRNN | mean teacher | median filtering (3.7s) | |||
Boes_KUL_task4_SED_3 | Boes2021 | 0.68 | 0.121 | 0.531 | CRNN | mean teacher | median filtering (3.7s) | |||
Boes_KUL_task4_SED_2 | Boes2021 | 0.77 | 0.233 | 0.440 | CRNN | mean teacher | median filtering (460ms) | |||
Boes_KUL_task4_SED_1 | Boes2021 | 0.81 | 0.253 | 0.442 | CRNN | mean teacher | median filtering (460ms) | |||
Ebbers_UPB_task4_SED_2 | Ebbers2021 | 1.10 | 0.335 | 0.621 | FBCRNN,CRNN | self-training | median filtering (class dependent) | MIL | ||
Ebbers_UPB_task4_SED_4 | Ebbers2021 | 1.16 | 0.363 | 0.637 | FBCRNN,CRNN,CTNN,CNN | self-training | median filtering (class dependent) | averaging | ||
Ebbers_UPB_task4_SED_3 | Ebbers2021 | 1.24 | 0.416 | 0.635 | FBCRNN,CRNN,CTNN,CNN | self-training | median filtering (class dependent) | averaging | ||
Ebbers_UPB_task4_SED_1 | Ebbers2021 | 1.16 | 0.373 | 0.621 | FBCRNN,CRNN | self-training | median filtering (class dependent) | MIL | ||
Zhu_AIAL-XJU_task4_SED_2 | Zhu2021 | 0.99 | 0.290 | 0.574 | CRNN | mean-teacher student | median filtering | LinearSoftmax | ||
Zhu_AIAL-XJU_task4_SED_1 | Zhu2021 | 1.04 | 0.318 | 0.583 | CRNN | mean-teacher student | median filtering | LinearSoftmax | ||
Liu_BUPT_task4_4 | Liu2021 | 0.37 | 0.102 | 0.231 | CRNN | mean-teacher student | median filtering (93ms) | |||
Liu_BUPT_task4_1 | Liu2021 | 0.30 | 0.090 | 0.169 | CRNN | mean-teacher student | median filtering (93ms) | |||
Liu_BUPT_task4_2 | Liu2021 | 0.54 | 0.152 | 0.322 | CRNN | mean-teacher student | median filtering (93ms) | |||
Liu_BUPT_task4_3 | Liu2021 | 0.24 | 0.068 | 0.146 | CRNN | mean-teacher student | median filtering (93ms) | |||
Olvera_INRIA_task4_SED_2 | Olvera2021 | 0.98 | 0.338 | 0.481 | CRNN | mean-teacher student | HMM smoothing | HMM smoothing | ||
Olvera_INRIA_task4_SED_1 | Olvera2021 | 0.95 | 0.332 | 0.462 | CRNN | mean-teacher student | HMM smoothing | HMM smoothing | ||
Kim_AiTeR_GIST_SED_4 | Kim2021 | 1.32 | 0.442 | 0.674 | RCRNN | mean-teacher student, self-training with noisy student | median filtering | mean | ||
Kim_AiTeR_GIST_SED_2 | Kim2021 | 1.31 | 0.439 | 0.667 | RCRNN | mean-teacher student, self-training with noisy student | median filtering | mean | ||
Kim_AiTeR_GIST_SED_3 | Kim2021 | 1.30 | 0.434 | 0.669 | RCRNN | mean-teacher student, self-training with noisy student | median filtering | mean | ||
Kim_AiTeR_GIST_SED_1 | Kim2021 | 1.29 | 0.431 | 0.661 | RCRNN | mean-teacher student, self-training with noisy student | median filtering | mean | ||
Cai_SMALLRICE_task4_SED_1 | Dinkel2021 | 1.11 | 0.361 | 0.584 | CRNN, ensemble | unsupervised data augmentation | average | |||
Cai_SMALLRICE_task4_SED_2 | Dinkel2021 | 1.13 | 0.373 | 0.585 | CRNN, ensemble | unsupervised data augmentation | average | |||
Cai_SMALLRICE_task4_SED_3 | Dinkel2021 | 1.13 | 0.370 | 0.596 | CRNN, ensemble | unsupervised data augmentation | average | |||
Cai_SMALLRICE_task4_SED_4 | Dinkel2021 | 1.00 | 0.339 | 0.504 | CRNN | unsupervised data augmentation | ||||
HangYuChen_Roal_task4_SED_2 | HangYu2021 | 0.90 | 0.294 | 0.473 | Transformer,CNN | mean-teacher student | median filtering (93ms) | attention layers | majority vote | |
HangYuChen_Roal_task4_SED_1 | YuHang2021 | 0.61 | 0.098 | 0.496 | CRNN | mean-teacher student | median filtering (93ms) | attention layers | majority vote | |
Yu_NCUT_task4_SED_1 | Yu2021 | 0.20 | 0.038 | 0.157 | Multi-scale CRNN | mean-teacher student | median filtering (93ms) | attention | ||
Yu_NCUT_task4_SED_2 | Yu2021 | 0.92 | 0.301 | 0.485 | Multi-scale CRNN | mean-teacher student | median filtering (93ms) | attention | ||
lu_kwai_task4_SED_1 | Lu2021 | 1.27 | 0.419 | 0.660 | CRNN | mean-teacher student | classwise median filtering | majority vote | ||
lu_kwai_task4_SED_4 | Lu2021 | 0.88 | 0.157 | 0.685 | Conformer | mean-teacher student | classwise median filtering | majority vote | ||
lu_kwai_task4_SED_3 | Lu2021 | 0.86 | 0.148 | 0.686 | Conformer | mean-teacher student | classwise median filtering | majority vote | ||
lu_kwai_task4_SED_2 | Lu2021 | 1.25 | 0.412 | 0.651 | CRNN | mean-teacher student | classwise median filtering | majority vote | ||
Liu_BUPT_task4_SS_SED_2 | Liu_SS2021 | 0.94 | 0.302 | 0.507 | u-net, VGG | median filtering (93ms) | attention layers, d-vector | |||
Liu_BUPT_task4_SS_SED_1 | Liu_SS2021 | 0.94 | 0.302 | 0.507 | u-net, VGG | median filtering (93ms) | attention layers, d-vector | |||
Tian_ICT-TOSHIBA_task4_SED_2 | Tian2021 | 1.19 | 0.411 | 0.585 | CNN | mean-teacher student | median filtering with adaptive window size | attention_layers | ||
Tian_ICT-TOSHIBA_task4_SED_1 | Tian2021 | 1.19 | 0.413 | 0.586 | CNN | mean-teacher student | median filtering with adaptive window size | attention_layers | ||
Tian_ICT-TOSHIBA_task4_SED_4 | Tian2021 | 1.19 | 0.412 | 0.586 | CNN | mean-teacher student | median filtering with adaptive window size | attention_layers | ||
Tian_ICT-TOSHIBA_task4_SED_3 | Tian2021 | 1.18 | 0.409 | 0.584 | CNN | mean-teacher student | median filtering with adaptive window size | attention_layers | ||
Yao_GUET_task4_SED_3 | Yao2021 | 0.88 | 0.279 | 0.479 | CRNN,Self Attention | mean-teacher student | median filtering (93ms) | |||
Yao_GUET_task4_SED_1 | Yao2021 | 0.88 | 0.277 | 0.482 | CRNN,Self Attention | mean-teacher student | median filtering (93ms) | |||
Yao_GUET_task4_SED_2 | Yao2021 | 0.54 | 0.056 | 0.496 | CRNN,Self Attention | mean-teacher student | median filtering (93ms) | |||
Liang_SHNU_task4_SED_4 | Liang2021 | 0.99 | 0.313 | 0.543 | CRNN | teacher student | median filtering (with adaptive window size) | |||
Bajzik_UNIZA_task4_SED_2 | Bajzik2021 | 1.02 | 0.330 | 0.544 | CRNN | mean-teacher student | median filtering (112ms) | |||
Bajzik_UNIZA_task4_SED_1 | Bajzik2021 | 0.45 | 0.133 | 0.266 | CNN | mean-teacher student | median filtering (112ms) | |||
Liang_SHNU_task4_SSep_SED_3 | Liang_SS2021 | 0.99 | 0.304 | 0.559 | CRNN | mean-teacher student | median filtering (with adaptive window size) | |||
Liang_SHNU_task4_SSep_SED_1 | Liang_SS2021 | 1.03 | 0.313 | 0.588 | CRNN | mean-teacher student | median filtering (with adaptive window size) | |||
Liang_SHNU_task4_SSep_SED_2 | Liang_SS2021 | 1.01 | 0.325 | 0.542 | CRNN | mean-teacher student | median filtering (with adaptive window size) | |||
Baseline_SED | turpault2020a | 1.00 | 0.315 | 0.547 | CRNN | mean-teacher student | ||||
Wang_NSYSU_task4_SED_1 | Wang2021 | 1.13 | 0.336 | 0.646 | CRNN | mean-teacher student | median filtering | attention layer | mean | |
Wang_NSYSU_task4_SED_4 | Wang2021 | 1.09 | 0.304 | 0.662 | CRNN, CNN-Transformer | mean-teacher student | median filtering | attention layer, exponential softmax layer | mean | |
Wang_NSYSU_task4_SED_2 | Wang2021 | 0.69 | 0.070 | 0.636 | CRNN | mean-teacher student | median filtering | exponential softmax layer | mean | |
Wang_NSYSU_task4_SED_3 | Wang2021 | 1.13 | 0.339 | 0.649 | CRNN, CNN-Transformer | mean-teacher student | median filtering | attention layer | mean |
Complexity
Rank | Code |
Technical Report |
Ranking score (Evaluation dataset) |
PSDS 1 (Evaluation dataset) |
PSDS 2 (Evaluation dataset) |
Model complexity |
Ensemble subsystems |
Training time |
---|---|---|---|---|---|---|---|---|
Na_BUPT_task4_SED_1 | Na2021 | 0.80 | 0.245 | 0.452 | 3900000 | 40h (1 Quadro K1200) | ||
Hafsati_TUITO_task4_SED_3 | Hafsati2021 | 0.91 | 0.287 | 0.502 | 1100000 | 20h (1 Tesla V100-SXM2-16GB) | ||
Hafsati_TUITO_task4_SED_4 | Hafsati2021 | 0.91 | 0.287 | 0.502 | 1100000 | 20h (1 Tesla V100-SXM2-16GB) | ||
Hafsati_TUITO_task4_SED_1 | Hafsati2021 | 1.03 | 0.334 | 0.549 | 1100000 | 6h (1 Tesla V100-SXM2-16GB) | ||
Hafsati_TUITO_task4_SED_2 | Hafsati2021 | 1.04 | 0.336 | 0.550 | 1100000 | 6h (1 Tesla V100-SXM2-16GB) | ||
Gong_TAL_task4_SED_3 | Gong2021 | 1.16 | 0.370 | 0.626 | 6674520 | 6 | 22.5h (1 V100) | |
Gong_TAL_task4_SED_2 | Gong2021 | 1.15 | 0.367 | 0.616 | 2224840 | 2 | 7.5h (1 V100) | |
Gong_TAL_task4_SED_1 | Gong2021 | 1.14 | 0.364 | 0.611 | 4449680 | 4 | 15h (1 V100) | |
Park_JHU_task4_SED_2 | Park2021 | 1.07 | 0.327 | 0.603 | 9000000 | 20h (1 GTX 1080 Ti) | ||
Park_JHU_task4_SED_4 | Park2021 | 0.86 | 0.237 | 0.524 | 9000000 | 20h (1 GTX 1080 Ti) | ||
Park_JHU_task4_SED_1 | Park2021 | 1.01 | 0.305 | 0.579 | 9000000 | 20h (1 GTX 1080 Ti) | ||
Park_JHU_task4_SED_3 | Park2021 | 0.84 | 0.222 | 0.537 | 9000000 | 20h (1 GTX 1080 Ti) | ||
Zheng_USTC_task4_SED_4 | Zheng2021 | 1.30 | 0.389 | 0.742 | 1112420 | 9 | 3h (1 GTX 3090) | |
Zheng_USTC_task4_SED_1 | Zheng2021 | 1.33 | 0.452 | 0.669 | 1112420 | 3 | 3h (1 GTX 3090) | |
Zheng_USTC_task4_SED_3 | Zheng2021 | 1.29 | 0.386 | 0.746 | 1112420 | 10 | 3h (1 GTX 3090) | |
Zheng_USTC_task4_SED_2 | Zheng2021 | 1.33 | 0.447 | 0.676 | 1112420 | 9 | 3h (1 GTX 3090) | |
Nam_KAIST_task4_SED_2 | Nam2021 | 1.19 | 0.399 | 0.609 | 4427956 | 9 | 4h (1 GTX 2080 Ti) | |
Nam_KAIST_task4_SED_1 | Nam2021 | 1.16 | 0.378 | 0.617 | 4427956 | 16 | 4h (1 GTX 2080 Ti) | |
Nam_KAIST_task4_SED_3 | Nam2021 | 1.09 | 0.324 | 0.634 | 4427956 | 11 | 4h (1 GTX 2080 Ti) | |
Nam_KAIST_task4_SED_4 | Nam2021 | 0.75 | 0.059 | 0.715 | 4427956 | 9 | 4h (1 GTX 2080 Ti) | |
Koo_SGU_task4_SED_2 | Koo2021 | 0.12 | 0.044 | 0.059 | 102000000 | 19h (1 Tesla M40) | ||
Koo_SGU_task4_SED_3 | Koo2021 | 0.41 | 0.058 | 0.348 | 196000000 | 2 | 48h (1 Tesla M40) | |
Koo_SGU_task4_SED_1 | Koo2021 | 0.74 | 0.258 | 0.364 | 95800000 | 19h (1 RTX 3080 Ti) | ||
deBenito_AUDIAS_task4_SED_4 | de Benito-Gorron2021 | 1.10 | 0.361 | 0.577 | 5562100 | 5 | 20h (1 RTX 2080) | |
deBenito_AUDIAS_task4_SED_1 | de Benito-Gorron2021 | 1.07 | 0.343 | 0.571 | 3337260 | 3 | 12h (1 RTX 2080) | |
deBenito_AUDIAS_task4_SED_2 | de Benito-Gorron2021 | 1.10 | 0.363 | 0.574 | 3337260 | 3 | 12h (1 RTX 2080) | |
deBenito_AUDIAS_task4_SED_3 | de Benito-Gorron2021 | 1.07 | 0.345 | 0.571 | 4449600 | 4 | 16h (1 RTX 2080) | |
Baseline_SSep_SED | turpault2020b | 1.11 | 0.364 | 0.580 | 2200000 | 6h (1 GTX 1080 Ti) | ||
Boes_KUL_task4_SED_4 | Boes2021 | 0.60 | 0.117 | 0.457 | 1038314 | 5h (1 GTX 1080 Ti) | ||
Boes_KUL_task4_SED_3 | Boes2021 | 0.68 | 0.121 | 0.531 | 1038314 | 5h (1 GTX 1080 Ti) | ||
Boes_KUL_task4_SED_2 | Boes2021 | 0.77 | 0.233 | 0.440 | 1038314 | 5h (1 GTX 1080 Ti) | ||
Boes_KUL_task4_SED_1 | Boes2021 | 0.81 | 0.253 | 0.442 | 1038314 | 5h (1 GTX 1080 Ti) | ||
Ebbers_UPB_task4_SED_2 | Ebbers2021 | 1.10 | 0.335 | 0.621 | 9568030 | 1 | 72h (4 RTX 2070) | |
Ebbers_UPB_task4_SED_4 | Ebbers2021 | 1.16 | 0.363 | 0.637 | 59853372 | 6 | 72h (4 RTX 2070) | |
Ebbers_UPB_task4_SED_3 | Ebbers2021 | 1.24 | 0.416 | 0.635 | 59853372 | 6 | 72h (4 RTX 2070) | |
Ebbers_UPB_task4_SED_1 | Ebbers2021 | 1.16 | 0.373 | 0.621 | 9568030 | 1 | 72h (4 RTX 2070) | |
Zhu_AIAL-XJU_task4_SED_2 | Zhu2021 | 0.99 | 0.290 | 0.574 | 3900000 | 12.5h (1 RTX 3090) | ||
Zhu_AIAL-XJU_task4_SED_1 | Zhu2021 | 1.04 | 0.318 | 0.583 | 3900000 | 13.5h (1 RTX 3090) | ||
Liu_BUPT_task4_4 | Liu2021 | 0.37 | 0.102 | 0.231 | 1112420 | 12h (1 GTX 1080 Ti) | ||
Liu_BUPT_task4_1 | Liu2021 | 0.30 | 0.090 | 0.169 | 1112420 | 12h (1 GTX 1080 Ti) | ||
Liu_BUPT_task4_2 | Liu2021 | 0.54 | 0.152 | 0.322 | 1112420 | 12h (1 GTX 1080 Ti) | ||
Liu_BUPT_task4_3 | Liu2021 | 0.24 | 0.068 | 0.146 | 1112420 | 12h (1 GTX 1080 Ti) | ||
Olvera_INRIA_task4_SED_2 | Olvera2021 | 0.98 | 0.338 | 0.481 | 2225868 | 2 | 24h (1 GTX 1080) | |
Olvera_INRIA_task4_SED_1 | Olvera2021 | 0.95 | 0.332 | 0.462 | 3338802 | 3 | 24h (1 GTX 1080) | |
Kim_AiTeR_GIST_SED_4 | Kim2021 | 1.32 | 0.442 | 0.674 | 2162412 | 10 | 5h (1 GTX 1080 Ti) | |
Kim_AiTeR_GIST_SED_2 | Kim2021 | 1.31 | 0.439 | 0.667 | 2162412 | 5 | 5h (1 GTX 1080 Ti) | |
Kim_AiTeR_GIST_SED_3 | Kim2021 | 1.30 | 0.434 | 0.669 | 2162412 | 5 | 5h (1 GTX 1080 Ti) | |
Kim_AiTeR_GIST_SED_1 | Kim2021 | 1.29 | 0.431 | 0.661 | 2162412 | 5 | 5h (1 GTX 1080 Ti) | |
Cai_SMALLRICE_task4_SED_1 | Dinkel2021 | 1.11 | 0.361 | 0.584 | 2043204 | 3 | 3h (1 GTX 2080 Ti) | |
Cai_SMALLRICE_task4_SED_2 | Dinkel2021 | 1.13 | 0.373 | 0.585 | 2724272 | 4 | 3h (1 GTX 2080 Ti) | |
Cai_SMALLRICE_task4_SED_3 | Dinkel2021 | 1.13 | 0.370 | 0.596 | 3405340 | 5 | 3h (1 GTX 2080 Ti) | |
Cai_SMALLRICE_task4_SED_4 | Dinkel2021 | 1.00 | 0.339 | 0.504 | 681068 | 3h (1 GTX 2080 Ti) | ||
HangYuChen_Roal_task4_SED_2 | HangYu2021 | 0.90 | 0.294 | 0.473 | 11312420 | 2 | 6h (1 GTX 1080 Ti) | |
HangYuChen_Roal_task4_SED_1 | YuHang2021 | 0.61 | 0.098 | 0.496 | 1112420 | 2 | 3h (1 GTX 1080 Ti) | |
Yu_NCUT_task4_SED_1 | Yu2021 | 0.20 | 0.038 | 0.157 | 1300000 | 5h (1 GTX 1080) | ||
Yu_NCUT_task4_SED_2 | Yu2021 | 0.92 | 0.301 | 0.485 | 1300000 | 5h (1 GTX 1080) | ||
lu_kwai_task4_SED_1 | Lu2021 | 1.27 | 0.419 | 0.660 | 10500000 | 5 | 5h (1 GTX 2080 Ti) | |
lu_kwai_task4_SED_4 | Lu2021 | 0.88 | 0.157 | 0.685 | 39500000 | 5 | 10h (1 GTX 2080 Ti) | |
lu_kwai_task4_SED_3 | Lu2021 | 0.86 | 0.148 | 0.686 | 39500000 | 5 | 10h (1 GTX 2080 Ti) | |
lu_kwai_task4_SED_2 | Lu2021 | 1.25 | 0.412 | 0.651 | 10500000 | 5 | 5h (1 GTX 2080 Ti) | |
Liu_BUPT_task4_SS_SED_2 | Liu_SS2021 | 0.94 | 0.302 | 0.507 | 192905515 | 17h (1 RTX 3090) | ||
Liu_BUPT_task4_SS_SED_1 | Liu_SS2021 | 0.94 | 0.302 | 0.507 | 192905515 | 17h (1 RTX 3090) | ||
Tian_ICT-TOSHIBA_task4_SED_2 | Tian2021 | 1.19 | 0.411 | 0.585 | 8471847 | 4 | 6h for each model(GTX 2080 Ti) | |
Tian_ICT-TOSHIBA_task4_SED_1 | Tian2021 | 1.19 | 0.413 | 0.586 | 8471847 | 4 | 6h for each model(GTX 2080 Ti) | |
Tian_ICT-TOSHIBA_task4_SED_4 | Tian2021 | 1.19 | 0.412 | 0.586 | 8471847 | 4 | 6h for each model(GTX 2080 Ti) | |
Tian_ICT-TOSHIBA_task4_SED_3 | Tian2021 | 1.18 | 0.409 | 0.584 | 8471847 | 4 | 6h for each model(GTX 2080 Ti) | |
Yao_GUET_task4_SED_3 | Yao2021 | 0.88 | 0.279 | 0.479 | 2.5M | 6h (1 Titan RTX) | ||
Yao_GUET_task4_SED_1 | Yao2021 | 0.88 | 0.277 | 0.482 | 2.5M | 6h (1 Titan RTX) | ||
Yao_GUET_task4_SED_2 | Yao2021 | 0.54 | 0.056 | 0.496 | 2.5M | 6h (1 Titan RTX) | ||
Liang_SHNU_task4_SED_4 | Liang2021 | 0.99 | 0.313 | 0.543 | 1431280 | 16h (Tesla-V100) | ||
Bajzik_UNIZA_task4_SED_2 | Bajzik2021 | 1.02 | 0.330 | 0.544 | 2200000 | 13h (1 GeForce GTX 1650) | ||
Bajzik_UNIZA_task4_SED_1 | Bajzik2021 | 0.45 | 0.133 | 0.266 | 1200000 | 5h (1 GeForce GTX 1650) | ||
Liang_SHNU_task4_SSep_SED_3 | Liang_SS2021 | 0.99 | 0.304 | 0.559 | 1112420 | 3h (1 GTX 1080 Ti) | ||
Liang_SHNU_task4_SSep_SED_1 | Liang_SS2021 | 1.03 | 0.313 | 0.588 | 1112420 | 3h (1 GTX 1080 Ti) | ||
Liang_SHNU_task4_SSep_SED_2 | Liang_SS2021 | 1.01 | 0.325 | 0.542 | 1112420 | 3h (1 GTX 1080 Ti) | ||
Baseline_SED | turpault2020a | 1.00 | 0.315 | 0.547 | 2200000 | 6h (1 GTX 1080 Ti) | ||
Wang_NSYSU_task4_SED_1 | Wang2021 | 1.13 | 0.336 | 0.646 | 47213260 | 10 | 480h (1 GPU 1080 Ti) | |
Wang_NSYSU_task4_SED_4 | Wang2021 | 1.09 | 0.304 | 0.662 | 118739112 | 24 | 864h (1 GPU 1080Ti), 360h (1 GPU V100) | |
Wang_NSYSU_task4_SED_2 | Wang2021 | 0.69 | 0.070 | 0.636 | 3350984 | 8 | 384h (1 GPU 1080Ti) | |
Wang_NSYSU_task4_SED_3 | Wang2021 | 1.13 | 0.339 | 0.649 | 115388128 | 16 | 480h (1 GPU 1080 Ti), 360h (1 GPU V100) |
Technical reports
Sound Event Detection System For DCASE 2021 Challenge
Bajzik, Jakub
University of Zilina, Department of Mechatronics and Electronics, Žilina 010 26, Slovak Republic
Bajzik_UNIZA_task4_SED_1 Bajzik_UNIZA_task4_SED_2
Sound Event Detection System For DCASE 2021 Challenge
Bajzik, Jakub
University of Zilina, Department of Mechatronics and Electronics, Žilina 010 26, Slovak Republic
Abstract
This paper presents the systems proposal for the DCASE 2021 challenge Task 4 (Sound event detection and separation in domestic environments). The aim is to provide the event time localization timestamps in addition to event class probabilities. In this paper, the two systems are proposed. System 1 is a convolutional neural network trained for sound event classification using only weakly labeled and unlabeled data. The strong labels are obtained using the class activation mapping technique. System 1 does not reach the baseline performance. System 2 is the convolutional neural network and recurrent neural network which uses the class activation mapping technique as a part of the attention mechanism to increase the baseline performance. The second model was trained using weakly labeled, strongly labeled, and unlabeled data. Both architectures are based on the Mean Teacher baseline system 2021.
System characteristics
Optimizing Temporal Resolution Of Convolutional Recurrent Neural Networks For Sound Event Detection
Boes, Wim and Van Hamme, Hugo
ESAT, KU Leuven, Leuven, Belgium
Boes_KUL_task4_SED_1 Boes_KUL_task4_SED_2 Boes_KUL_task4_SED_3 Boes_KUL_task4_SED_4
Optimizing Temporal Resolution Of Convolutional Recurrent Neural Networks For Sound Event Detection
Boes, Wim and Van Hamme, Hugo
ESAT, KU Leuven, Leuven, Belgium
Abstract
In this technical report, the systems we submitted for subtask 4 of the DCASE 2021 challenge, regarding sound event detection, are described in detail. These models are closely related to the baseline provided for this problem, as they are essentially convolutional recurrent neural networks trained in a mean teacher setting to deal with the heterogeneous annotation of the supplied data. However, the time resolution of the predictions was adapted to deal with the fact that these systems are evaluated using two intersection-based metrics involving different needs in terms of temporal localization. This was done by optimizing the pooling operations. For the first of the defined evaluation scenarios, imposing relatively strict requirements on the temporal localization accuracy, our best model achieved a PSDS score of 0.3609 on the validation data. This is only marginally better than the performance obtained by the baseline system (0.342): The amount of pooling in the baseline network already turned out to be optimal, and thus, no substantial changes were made, explaining this result. For the second evaluation scenario, imposing relatively lax restrictions on the localization accuracy, our best-performing system achieved a PSDS score of 0.7312 on the validation data. This is significantly better than the performance obtained by the baseline model (0.527), which can effectively be attributed to the changes that were applied to the pooling operations of the network.
System characteristics
Convolution-Augmented Conformer For Sound Event Detection
Chen, YuHang
Royal Flush, 18, Tongshun Street, Wuchang Street, Yuhang District. HangZhou 310000, CHINA
HangYuChen_Royal_task4_SED_1 HangYuChen_Royal_task4_SED_2
Convolution-Augmented Conformer For Sound Event Detection
Chen, YuHang
Royal Flush, 18, Tongshun Street, Wuchang Street, Yuhang District. HangZhou 310000, CHINA
Abstract
In this technical report, we describe our submission system for DCASE2021 Task4: sound event detection and separation in domestic environments. Our model employs conformer blocks, which combine the self-attention and depth-wise convolution networks, to efficiently capture the global and local context information of an audio feature sequence. In addition to this novel architecture, we further improve the performance by utilizing a mean teacher semi-supervised learning technique, data augmentation for each sound event class. We demonstrate that the proposed method achieves the PSDS-1 and PSDS-2 score of 34%, 55.7% on the validation set, outperforming that of the baseline score.
System characteristics
Multi-Resolution Mean Teacher For DCASE 2021 Task 4
de Benito-Gorron, Diego and Segovia, Sergio and Ramos, Daniel and T. Toledano, Doroteo
AUDIAS Research Group, Universidad Autónoma de Madrid, Calle Francisco Tomás y Valiente, 11, 28049 Madrid, Spain
deBenito_AUDIAS_task4_SED_1 deBenito_AUDIAS_task4_SED_2 deBenito_AUDIAS_task4_SED_3deBenito_AUDIAS_task4_SED_4
Multi-Resolution Mean Teacher For DCASE 2021 Task 4
de Benito-Gorron, Diego and Segovia, Sergio and Ramos, Daniel and T. Toledano, Doroteo
AUDIAS Research Group, Universidad Autónoma de Madrid, Calle Francisco Tomás y Valiente, 11, 28049 Madrid, Spain
Abstract
This technical report describes our participation in DCASE 2021 Task 4: Sound event detection and separation in domestic environments. Aiming to take advantage of the different lengths and spectral characteristics of each target category, we follow the multiresolution feature extraction approach that we proposed for last year’s edition. It is found that each one of the proposed Polyphonic Sound Detection Score (PSDS) scenarios benefits from either a higher temporal resolution or a higher frequency resolution. Furthermore, combining several time-frequency resolutions via model fusion is able to improve the PSDS results in both scenarios.
System characteristics
The Smallrice Submission To The Dcase2021 Task 4 Challenge: A Lightweight Approach For Semi-Supervised Sound Event Detection With Unsupervised Data Augmentation
Dinkel, Heinrich and Cai, Xinyu and Yan, Zhiyong and Wang, Yongqing and Zhang, Junbo and Wang, Yujun
Xiaomi Corporation, Beijing, China
Cai_SMALLRICE_task4_SED_1 Cai_SMALLRICE_task4_SED_2 Cai_SMALLRICE_task4_SED_3 Cai_SMALLRICE_task4_SED_4
The Smallrice Submission To The Dcase2021 Task 4 Challenge: A Lightweight Approach For Semi-Supervised Sound Event Detection With Unsupervised Data Augmentation
Dinkel, Heinrich and Cai, Xinyu and Yan, Zhiyong and Wang, Yongqing and Zhang, Junbo and Wang, Yujun
Xiaomi Corporation, Beijing, China
Abstract
This paper describes our submission to the DCASE 2021 challenge. Different from the baseline and most other approaches, our work focuses on training a lightweight and well-performing model which can be used in real-world applications. Compared to the baseline, our model only contains 600k (15 %) parameters, resulting in a size of 2.7 Mb on disk, making it viable for applications on low-resource devices such as mobile phones. Our model is trained using unsupervised data augmentation as its consistency criterion, which we show can achieve competitive performance to the more common mean teacher paradigm. Our submitted results on the validation set result in a single model peak performance of 36.91 PSDS-1 and 57.17 PSDS2, outperforming the baseline by an absolute of 2.7 and 5.0 points respectively. Notably our approach achieves an Event-F1 score on the development set of 39.29 without post-processing. The best submitted ensemble system using a 4-way fusion achieves a PSDS-1 of 38.23 and PSDS-2 of 62.29 on the validation dataset.
System characteristics
Self-Trained Audio Tagging And Sound Event Detection In Domestic Environments
Ebbers, Janek Haeb-Umbach, Reinhold
Paderborn University, Department of Communications Engineering, Paderborn, Germany
Abstract
In this report we present our system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 Challenge Task 4: Sound Event Detection and Separation in Domestic Environments. Our presented solution is an advancement of our system used in the previous edition of the task.We use our previously proposed forward-backward convolutional recurrent neural network (FBCRNN) for tagging and pseudo labeling and tag-conditioned sound event detection (SED) models which are trained using the strong pseudo labels provided by the FBCRNN. Our advancement over our previous model is threefold. Firstly, we introduce a strong label loss in the objective of the FBCRNN to take advantage of the strongly labeled synthetic data during training, which leads to both better tagging and detection performance. Secondly, we perform multiple iterations of self-training for both the FBCRNN and tag-conditioned SED models. Thirdly, while we used only tag-conditioned CNNs as our SED model in the last edition we here explore sophisticated SED model architectures, namely, tag-conditioned bidirectional CRNNs and tag-conditioned bidirectional convolutional transformer neural networks (CTNNs) and combine them. With scenario and class dependent tuning of median filter lengths for post-processing, our final SED model, consisting of 6 submodels (2 of each architecture), is able to achieve validation poly-phonic sound event detection scores (PSDS) of 0.454 for scenario 1 and 0.758 for scenario 2 as well as a collar-based F1-score of 0.602 outperforming the baselines and our model from the last edition by far. Source code will be made publicly available at https://github.com/fgnt/pb_sed.
System characteristics
Improved Pseudo-Labeling Method For Semi-Supervised Sound Event Detection
Gong, Yaguang and Li, Changlong and Wang, Xintian and Ma, Lu and Yang, Song and Wu, Zhongqin Wu
TAL Education Group, China
Gong_TAL_SED_task4_1 Gong_TAL_SED_task4_2 Gong_TAL_SED_task4_3
Improved Pseudo-Labeling Method For Semi-Supervised Sound Event Detection
Gong, Yaguang and Li, Changlong and Wang, Xintian and Ma, Lu and Yang, Song and Wu, Zhongqin Wu
TAL Education Group, China
Abstract
This report illustrates a framework for the DCASE2021 task4 - Sound Event Detection. The proposed framework is built on the pseudo-labeling method widely applied for semi-supervised learning(SSL) tasks. The proposed method synthesizes weak pseudo-labels for the large amount of unlabeled data by utilizing the model’s predictions on weakly augmented spectrograms. Weak pseudo-labels are then used as supervision for strongly augmented spectrograms of the same sample. Along to this main contribution, this work introduces data augmentation techniques including random frequency masking and time shifting, training techniques such as class-specific weighted loss, and model ensemble techniques. Experimental results demonstrate that the proposed method achieves PSDS of 0.407/0.653(scenario1/scenario2) on the validation set, which presents superior performance against the baseline score 0.342/0.527.
System characteristics
Task Aware Sound Event Detection Based On Semi-Supervised CRNNWith Skip Connections: DCASE 2021 Challenge, Task 4
Hafsati, Mohammed and Bentounes, Kamil
1Beijing Kuaishou Technology Co., Ltd, China 2The State Key Laboratory of Automotive Satefy and Energy, Tsinghua University, Beijing, China
Hafsati_TUITO_task4_SED_1 Hafsati_TUITO_task4_SED_2 Hafsati_TUITO_task4_SED_3 Hafsati_TUITO_task4_SED_4
Task Aware Sound Event Detection Based On Semi-Supervised CRNNWith Skip Connections: DCASE 2021 Challenge, Task 4
Hafsati, Mohammed and Bentounes, Kamil
1Beijing Kuaishou Technology Co., Ltd, China 2The State Key Laboratory of Automotive Satefy and Energy, Tsinghua University, Beijing, China
Abstract
Sound Event Detection (SED) is the task of classifying different sounds occurring in a recorded environment and their onset and offset times. This assignment is the primary goal of the fourth task of the DCASE challenge using some strongly labeled, partially labeled, and unlabeled datasets. In this paper, we describe our submitted approach for this challenge. Our neural network is based on sequential convolutional neural networks with skipping some layers and a recurrent neural network. To overcome the challenge of using unlabeled data, we used semi-supervised learning, and to improve the performance further, we propose to use data augmentation techniques. With our model, we can slightly outperform the baseline with fewer filters and therefore fewer parameters. Moreover, similar amount of parameters as the baseline, we significantly outperform it.
System characteristics
Self-Training With Noisy Student Model And Semi-Supervised Loss Function For DCASE 2021 Challenge Task 4
Kim, Nam Kyun 1 and Kim, Hong Kook 1,2
1School of Electrical Engineering and Computer Science, 123 Cheomdangwagi-ro, Gwangju 61005, Republic of Korea 2 AI Graduate School Gwangju Institute of Science and Technology, 123 Cheomdangwagi-ro, Gwangju 61005, Republic of Korea
Kim_AiTeR_GIST_SED_1 Kim_AiTeR_GIST_SED_2 Kim_AiTeR_GIST_SED_3 Kim_AiTeR_GIST_SED_4
Self-Training With Noisy Student Model And Semi-Supervised Loss Function For DCASE 2021 Challenge Task 4
Kim, Nam Kyun 1 and Kim, Hong Kook 1,2
1School of Electrical Engineering and Computer Science, 123 Cheomdangwagi-ro, Gwangju 61005, Republic of Korea 2 AI Graduate School Gwangju Institute of Science and Technology, 123 Cheomdangwagi-ro, Gwangju 61005, Republic of Korea
Abstract
This report proposes a polyphonic sound event detection (SED) method for the DCASE 2021 Challenge Task 4. The proposed SED model consists of two stages: a mean-teacher model for providing target labels regarding weakly labeled or unlabeled data and a self-training-based noisy student model for predicting strong labels for sound events. The mean-teacher model, which is based on the residual convolutional recurrent neural network (RCRNN) for the teacher and student model, is first trained using all the training data from a weakly labeled dataset, an unlabeled dataset, and a strongly labeled synthetic dataset. Then, the trained mean-teacher model predicts the strong label to each of the weakly labeled and unlabeled datasets, which is brought to the noisy student model in the second stage of the proposed SED model. Here, the structure of the noisy student model is identical to the RCRNN-based student model of the mean-teacher model in the first stage. Then, it is self-trained by adding feature noises, such as time-frequency shift, mixup, SpecAugment, and dropout-based model noise. In addition, a semi-supervised loss function is applied to train the noisy student model, which acts as label noise injection. The performance of the proposed SED model is evaluated on the validation set of the DCASE 2021 Challenge Task 4, and then, several ensemble models that combine five-fold validation models with different hyperparameters of the semi-supervised loss function are finally selected as our final models.
System characteristics
Sound Event Detection Based On Self-Supervised Learning Of Wav2vec 2.0
Koo, Hyejin1 and Park, Hyung-Min1 and Park, Jonghyeon2 and Oh, Myungwoo2
1Dept. of Electronic Engineering, Sogang University, Seoul 04107, South Korea, 2NAVER Corp. Gyeonggi-do 13561, South Korea
Koo_SGU_task4_SED_1 Koo_SGU_task4_SED_2 Koo_SGU_task4_SED_3
Sound Event Detection Based On Self-Supervised Learning Of Wav2vec 2.0
Koo, Hyejin1 and Park, Hyung-Min1 and Park, Jonghyeon2 and Oh, Myungwoo2
1Dept. of Electronic Engineering, Sogang University, Seoul 04107, South Korea, 2NAVER Corp. Gyeonggi-do 13561, South Korea
Abstract
In this report, we present our system for DCASE2021 Task4: Sound Event Detection (SED) and Separation in Domain Environments. This task evaluates how to capture information of SED with a relatively small amount of labeled data in addition to lots of unlabeled data. We apply wav2vec 2.0 on the SED for the first time. Even though wav2vec 2.0 pre-training using the DCASE2021 Taksk4 dataset spends long time to train audio representations, the presented model achieved higher intersection F1 and PSDS2. The baseline’s mean-teacher model and dataset was used to compare wav2vec 2.0 and log-mel features. Under the same conditions, we present how wav2vec 2.0 features work on the SED task.
System characteristics
Adaptive Focal Loss With Data Augmentation For Semi-Supervised Sound Event Detection
Liang, Yunhao and Tang, Tiantian and Long, Yanhua
Shanghai Normal University, Shanghai, China
Liang_SHNU_task4_SSep_SED_1 Liang_SHNU_task4_SSep_SED_2 Liang_SHNU_task4_SSep_SED_3 Liang_SHNU_task4_SED_4
Adaptive Focal Loss With Data Augmentation For Semi-Supervised Sound Event Detection
Liang, Yunhao and Tang, Tiantian and Long, Yanhua
Shanghai Normal University, Shanghai, China
Abstract
In this technical report, we describe our submission system for DCASE2021 Task4: sound event detection and separation in domestic environments. In our submissions, two different deep models are investigated. The first one is a mean-teacher model with convolutional recurrent neural network (CRNN). The second one is a joint framework with adaptive focal loss based on the Guided Learning architecture. To improve the performance of system, we propose to use various methods such as the specaugment data augmentation method, adaptive focal loss, event specific post-processing. To combine sound separation with sound event detection, we train models using the outputs of the sound separation baseline system. We demonstrate that the proposed method achieves the event-based macro F1 score of 44.4%, 0.428 in PSDS1 and 0.736 in PSDS2 on the validation set.
System characteristics
Combined Sound Event Detection And Sound Event Separation Networks For DCASE 2021 Task 4
Liu, Gang and Liu, Zhuang Zhuang and Fang, Jun Yan and Liu, Yi Liu and Zhou, Ming Kun
Beijing University of Posts and Telecommunications, Beijing,China
Liu_BUPT_task4_SED_1 Liu_BUPT_task4_SED_2 Liu_BUPT_task4_SED_3 Liu_BUPT_task4_SED_4 Liu_BUPT_task4_SS_SED_1 Liu_BUPT_task4_SS_SED_2
Combined Sound Event Detection And Sound Event Separation Networks For DCASE 2021 Task 4
Liu, Gang and Liu, Zhuang Zhuang and Fang, Jun Yan and Liu, Yi Liu and Zhou, Ming Kun
Beijing University of Posts and Telecommunications, Beijing,China
Abstract
Audio tagging aims to assign one or more labels to the audio clip. In this paper, we proposed our solutions applied to our submission for DCASE2021 Task4. The target of the systems is to provide not only the event class but also the event time localization given that multiple events can be present in an audio recording. We present a convolutional recurrent neural network (CRNN) with two recurrent neural network (RNN) classifiers sharing the same preprocessing convolutional neural network (CNN). Both recurrent networks perform audio tagging. One is processing the input audio signal in forward direction and the other in backward direction. We also use a spatial attention layer which called Fcanet to improve our system. We also make an independent system to achieve sound event speration.
System characteristics
Integrating Advantages Of Recurrent And Transformer Structures For Sound Event Detection In Multiple Scenarios
Lu, Rui 1 and Hu, Wenzheng 2 and Duan Zhiyao 1 and Liu, Ji 1
1Beijing Kuaishou Technology Co., Ltd, China 2The State Key Laboratory of Automotive Satefy and Energy, Tsinghua University, Beijing, China
lu_kwai_task4_SED_1 lu_kwai_task4_SED_2 lu_kwai_task4_SED_3 lu_kwai_task4_SED_4
Integrating Advantages Of Recurrent And Transformer Structures For Sound Event Detection In Multiple Scenarios
Lu, Rui 1 and Hu, Wenzheng 2 and Duan Zhiyao 1 and Liu, Ji 1
1Beijing Kuaishou Technology Co., Ltd, China 2The State Key Laboratory of Automotive Satefy and Energy, Tsinghua University, Beijing, China
Abstract
In this technical report, we detail our submitted systems for task4 of DCASE2021: Sound Event Detection and Separation in Domestic Environments. Our systems exploit both recurrent structure and transformer structure to model the complicated dynamics in real life domestic audio data. In addition to prevalent tricks such as semi-supervised mean-teacher learning, data augmentation and ensemble, we find that different models exhibit differently under the two scenarios, which emphasize different system properties. By integrating advantages of both the recurrent and transformer structures, our proposed systems achieve an overall poly-phonic sound event detection scores (PSDS-scores) of 1.171 (PSDS-scenario1 + PSDS-scenario2) on the hold-out test set of the development dataset, outperforming the baseline system by 34.8%.
System characteristics
Convolutional Network With Conformer For Semi-Supervised Sound Event Detection
Na, Tong and Zhang, Qinyi
Beijing University of Posts and Telecommunications, Beijing,China
Na_BUPT_task4_SED_1
Convolutional Network With Conformer For Semi-Supervised Sound Event Detection
Na, Tong and Zhang, Qinyi
Beijing University of Posts and Telecommunications, Beijing,China
Abstract
In this technical report, we describe our system submission for DCASE 2021 Task 4. Our model employs a convolutional network in conjunction with conformer blocks and utilizes the Mean-Teacher semi-supervised learning technique for further improvement.
System characteristics
Heavily Augmented Sound Event Detection utilizing Weak Predictions
Nam, Hyeonuk and Ko, Byeong-Yun and Lee, Gyeong-Tae and Kim, Seong-Hu and Jung, Won-Ho and Choi, Sang-Min and Park, Yong-Hwa
Korea Advanced Institute of Science and Technology, Department of Mechanical Engineering, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, South Korea
Nam_KAIST_task4_SED_1 Nam_KAIST_task4_SED_2 Nam_KAIST_task4_SED_3 Nam_KAIST_task4_SED_4
Heavily Augmented Sound Event Detection utilizing Weak Predictions
Nam, Hyeonuk and Ko, Byeong-Yun and Lee, Gyeong-Tae and Kim, Seong-Hu and Jung, Won-Ho and Choi, Sang-Min and Park, Yong-Hwa
Korea Advanced Institute of Science and Technology, Department of Mechanical Engineering, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, South Korea
Abstract
The performance of Sound Event Detection (SED) systems are greatly limited by the difficulty in generating large strongly labeled dataset. In this work, we used two main approaches to overcome the lack of strongly labeled data. First, we applied heavy data augmentation on input features. Data augmentation methods used include not only conventional methods used in speech/audio domains but also our proposed method named FilterAugment. Second, we propose two methods to utilize weak predictions to enhance weakly supervised SED performance. As a result, we obtained the best PSDS1 of 0.4336 and best PSDS2 of 0.8161 on the DESED real validation dataset.
System characteristics
Domain-Adapted Sound Event Detection System With Auxiliary Foreground-Background Classifier
Olvera, Michel1 and Vincent, Emmanuel1 and Gasso, Gilles2
1Université de Lorraine, Inria, Loria, F-54000 Nancy, France, 2LITIS EA 4108, Université & INSA Rouen Normandie, 76800 Saint-Étienne du Rouvray, France
Olvera_INRIA_task4_SED_1 Olvera_INRIA_task4_SED_2
Domain-Adapted Sound Event Detection System With Auxiliary Foreground-Background Classifier
Olvera, Michel1 and Vincent, Emmanuel1 and Gasso, Gilles2
1Université de Lorraine, Inria, Loria, F-54000 Nancy, France, 2LITIS EA 4108, Université & INSA Rouen Normandie, 76800 Saint-Étienne du Rouvray, France
Abstract
In this technical report, we propose a sound event detection system for the DCASE 2021 task 4 challenge, which consists of a foreground-background classification branch that is jointly trained with the baseline architecture. Furthermore, to account for the mismatch between synthetic annotated data and real unlabeled data used for training, we also propose a frame-level domain adaptation scheme to improve detection performance over real soundscapes. We show that these improvements to the baseline method help in the generalization of the sound event detection task.
System characteristics
Sound Event Detection with Cross-Referencing Self-Training
Park Sangwook1 and Choi, Woohyun2 and Elhilali, Mounya1
1Department of Electrical and Computer Engineering, Johns Hopkins University, United States, 2LG electronics
Park_task4_SED_1 Park_task4_SED_2 Park_task4_SED_3 Park_task4_SED_4
Sound Event Detection with Cross-Referencing Self-Training
Park Sangwook1 and Choi, Woohyun2 and Elhilali, Mounya1
1Department of Electrical and Computer Engineering, Johns Hopkins University, United States, 2LG electronics
Abstract
This report describes a sound event detection method submitted to the DCASE2021 challenge, task 4. In this approach, we design a residual convolutional recurrent neural network and train this network with a cross-referencing self-training approach that leverages an extensive unlabeled data in combination with labeled data. This approach takes advantage of semi-supervised training using pseudo-labels from a balanced student-teacher model, and outperforms DCASE2021 challenge baseline in terms of Poly-phonic Sound event Detection Score. Additionally, the proposed network has more accurate predictions in class-wise collar-based-F1, compared to the baseline.
System characteristics
Sound Event Detection Using Metric Learning And Focal Loss For DCASE 2021 Task 4
Tian, Gangyi1 and Huang, Yuxin1,2 and Ye, Zhirong1,2 and Ma, Shuo1,2 and Wang, Xiangdong1 and Liu, Hong1 and Qian, Yueliang1 and Tao, Rui3 and Yan, Long3 and Ouchi, Kazushige3 and Ebbers, Janek4 Haeb-Umbach, Reinhold4
1Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, 2University of Chinese Academy of Sciences, Beijing, China, 3Toshiba China R&D Center, Beijing, China,Beijing University of Posts and Telecommunications, Beijing,China, 4Paderborn University, Germany
Tian_ICT-TOSHIBA_task4_SED_1Tian_ICT-TOSHIBA_task4_SED_2Tian_ICT-TOSHIBA_task4_SED_3Tian_ICT-TOSHIBA_task4_SED_4
Sound Event Detection Using Metric Learning And Focal Loss For DCASE 2021 Task 4
Tian, Gangyi1 and Huang, Yuxin1,2 and Ye, Zhirong1,2 and Ma, Shuo1,2 and Wang, Xiangdong1 and Liu, Hong1 and Qian, Yueliang1 and Tao, Rui3 and Yan, Long3 and Ouchi, Kazushige3 and Ebbers, Janek4 Haeb-Umbach, Reinhold4
1Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, 2University of Chinese Academy of Sciences, Beijing, China, 3Toshiba China R&D Center, Beijing, China,Beijing University of Posts and Telecommunications, Beijing,China, 4Paderborn University, Germany
Abstract
In this technical report, we describe our system submission for DCASE 2021 Task 4. Our model employs a convolutional network in conjunction with conformer blocks and utilizes the Mean-Teacher semi-supervised learning technique for further improvement.
System characteristics
Training Sound Event Detection On A Heterogeneous Dataset
Turpault, Nicolas and Serizel, Romain
Université de Lorraine, CNRS, Inria, Loria, France
Abstract
Training a sound event detection algorithm on a heterogeneous dataset including both recorded and synthetic soundscapes that can have various labeling granularity is a non-trivial task that can lead to systems requiring several technical choices. These technical choices are often passed from one system to another without being questioned. We propose to perform a detailed analysis of DCASE 2020 task 4 sound event detection baseline with regards to several aspects such as the type of data used for training, the parameters of the mean-teacher or the transformations applied while generating the synthetic soundscapes. Some of the parameters that are usually used as default to replicate other approaches are shown to be sub-optimal.
System characteristics
Input | mono |
Classifier | CRNN |
Acoustic features | log-mel energies |
Decision making | p-norm |
Improving Sound Event Detection In Domestic Environments Using Sound Separation
Turpault, Nicolas1 and Wisdom, Scott2 and Erdogan, Hakan2 and Herhey, John R.2 and Serizel, Romain1 and Fonseca, Eduardo3 and Seetharaman, Prem4 and Salomon, Justin5
1Universite de Lorraine, CNRS, Inria, Loria, Nancy, France, 2Google Research, AI Perception, Cambridge, United States, 3Music Technology Group, Universitat Pompeu Fabra, Barcelona, 4Interactive Audio Lab, Northwestern University, Evanston, United States5Adobe Research, San Francisco, United States
DCASE2020_SS_SED_baseline_system
Improving Sound Event Detection In Domestic Environments Using Sound Separation
Turpault, Nicolas1 and Wisdom, Scott2 and Erdogan, Hakan2 and Herhey, John R.2 and Serizel, Romain1 and Fonseca, Eduardo3 and Seetharaman, Prem4 and Salomon, Justin5
1Universite de Lorraine, CNRS, Inria, Loria, Nancy, France, 2Google Research, AI Perception, Cambridge, United States, 3Music Technology Group, Universitat Pompeu Fabra, Barcelona, 4Interactive Audio Lab, Northwestern University, Evanston, United States5Adobe Research, San Francisco, United States
Abstract
Performing sound event detection on real-world recordings often implies dealing with overlapping target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing stage for sound event detection. In this paper we start from a sound separation model trained on the Free Universal Sound Separation dataset and the DCASE 2020 task 4 sound event detection baseline. We explore different methods of combining separated sound sources and the original mixture within the sound event detection. Furthermore, we investigate the impact of adapting the universal sound separation model to the sound event detection data in terms of both separation and sound event detection performance.
System characteristics
Input | mono |
Classifier | CRNN |
Acoustic features | log-mel energies |
Decision making | p-norm |
CHT+NSYSU Sound Event Detection System With Multiscale Channel Attention And Multiple Consistency Training For DCASE 2021 Task 4
Wang, Yih-Wen 1 and Chen, Chia-Ping 1 and Lu, Chung-Li 2 and Chan, Bo-Cheng 1
1National Sun Yat-Sen University, Taiwan 2 Chunghwa Telecom Laboratories, Taiwan
Wang_NSYSU_task4_SED_1 Wang_NSYSU_task4_SED_2 Wang_NSYSU_task4_SED_3 Wang_NSYSU_task4_SED_4
CHT+NSYSU Sound Event Detection System With Multiscale Channel Attention And Multiple Consistency Training For DCASE 2021 Task 4
Wang, Yih-Wen 1 and Chen, Chia-Ping 1 and Lu, Chung-Li 2 and Chan, Bo-Cheng 1
1National Sun Yat-Sen University, Taiwan 2 Chunghwa Telecom Laboratories, Taiwan
Abstract
In this technical report, we describe our submission system for DCASE 2021 Task4: sound event detection and separation in domestic environments. The proposed system is based on mean-teacher framework of semi-supervised learning and neural networks of CRNN and CNN-Transformer. We employ consistency training of interpolation (ICT), shift (SCT), and clip-level (CCT) to enhance the generalization and representation. A multiscale CNN block is applied to extract various features to mitigate the influence of the event length diversity for the network. An efficient channel attention network (ECA-Net) and exponential softmax pooling enable the model to obtain definite sound event predictions. To further improve the performance, we use data augmentation including mixup, time shift, and time-frequency masks. Our ensemble system achieves the PSDS-scenario1 of 40.72% and PSDS-scenario2 of 80.80% on the validation set, significantly outperforming that of the baseline score of 34.2% and 52.7%, respectively.
System characteristics
Adaptive Memory Controlled Self Attention For Sound Event Detection
Yoa, Yu and Song, Xiyu
Guilin university of Electronic Technology, Guilin, 541004, Guangxi, China
Yao_GUET_task4_SED_1 Yao_GUET_task4_SED_2 Yao_GUET_task4_SED_3
Adaptive Memory Controlled Self Attention For Sound Event Detection
Yoa, Yu and Song, Xiyu
Guilin university of Electronic Technology, Guilin, 541004, Guangxi, China
Abstract
Sound event detection is a task to detect the time stamps and the class of sound event occurred in a recording. Real life sound events overlap in recording and the duration varies dramatically than synthetic data, making it even harder to recognize. In this paper we investigate how well that attention mechanism could improve for real life sound event detection (SED). Convolutional Recurrent Neural Networks (CRNN) have recently shown improved performances over established methods in various sound recognition tasks. In our work we use CRNN to extract hidden state feature representations; then, self-attention mechanism is introduced to memorize long-range dependencies of features that CRNN extract. Furthermore, we proposed to used adaptive memory controlled self-attention to explicitly compute the relations between time steps in audio representation embedding. The proposed method is evaluated on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge Task4 dataset, which contains different overlapping sound events from real life and synthetic. We develop a self attention SED model that used memory-controlled strategy with heuristically choose a fix attention width achieving a PSDS-scenario2 of 60.72% in average which indicating that attention mechanism is able to improve sound event detection. We show that proposed adaptive memory-controlled model reaches the same level result as fix attention width memory-controlled model.
System characteristics
Semi-Supervised Sound Event Detection Using Multi-Scale Convolutional Recurrent Neural Network And Weighted Pooling
Yu, Dongchi and Cai, Xichang and Liu, Duxin and Liu, Zihan
North China University of Technology, Beijing, China
Yu_NCUT_task4_SED_1 Yu_NCUT_task4_SED_2
Semi-Supervised Sound Event Detection Using Multi-Scale Convolutional Recurrent Neural Network And Weighted Pooling
Yu, Dongchi and Cai, Xichang and Liu, Duxin and Liu, Zihan
North China University of Technology, Beijing, China
Abstract
In this technical report, we describe our submission system for DCASE2021 Task4: sound event detection and separation in domestic environment. We mainly focus on the scenario that recognizes sound events without source separation. Since the duration of different sound events could be quite different, our model employs a multi-scale convolution recurrent network to extract the multi-scale features of an audio sequence. For more efficiently utilizing weak label training data, a global weighted pooling strategy is introduced to aggregate frame level predictions to generate clip level prediction. Additionally, our model also use mean teacher semi-supervised learning technique and data augmentation. We demonstrate that the proposed method achieves the PSDS2 score of 0.61 and the event-based macro F1 score of 42.15% on the validation set.
System characteristics
Zheng USTC Team’s Submission For DCASE2021 Task4 – Semi-Supervised Sound Event Detection
Zheng, Xu and Chen, Han and Song, Yan
National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China.
Zheng_USTC_task4_SED_1 Zheng_USTC_task4_SED_2 Zheng_USTC_task4_SED_3Zheng_USTC_task4_SED_4
Zheng USTC Team’s Submission For DCASE2021 Task4 – Semi-Supervised Sound Event Detection
Zheng, Xu and Chen, Han and Song, Yan
National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China.
Abstract
In this technical report, we present our submitted system for DCASE2021 Task4: sound event detection and separation in domestic environments. Specifically, three main techniques are applied to improve the performance of the official baseline system with both synthetic and real data (weakly labeled and unlabeled). Firstly, in order to improve the localization ability of CRNN model, we propose to use the selective kernel(SK) unit. By stacking the SK unit, each neuron can adaptively adjust its receptive field for both short- and long- duration events. Secondly, based on the fact that detection outputs are dominated by the high-confidence predictions(lower than 0.1 or higher than 0.9), we propose to use soft detection output by setting proper temperature parameter in sigmoid, which can effectively improve the PSDS2 score. Thirdly, several data augmentation techniques and score fusion mechanisms are applied to improve the stability and robustness of the system performance. Experiments on the DCASE2021 task4 validation dataset demonstrate the effectiveness of the techniques used in our system. Specifically, PSDS scores of 0.45 and 0.78 are achieved for scenario1 and scenario2 respectively, outperforming the result of 0.34 and 0.53 in baseline system.
System characteristics
Multi-Scale Convolution Based Attention Network For Semi-Supervised Sound Event Detection
Zhu, Xiujuan1,3 and Sun, Xinghao1,3 and Hu, Ying1,3 and Chen, Yadong1,3 and Qiu, Wenbo1,3 and Tang, Yuwu1,3 and He, Liang1,2 and Xu, Minqiang4
1School of Information Science and Engineering, Xinjiang University, Urumqi, China, 2Department of Electronic Engineering, Tsinghua University, China, 3Key Laboratory of Signal Detection and Processing in Xinjiang, China, 4SpeakIn Technology
Zhu_AIAL-XJU_task4_SED_1 Zhu_AIAL-XJU_task4_SED_2
Multi-Scale Convolution Based Attention Network For Semi-Supervised Sound Event Detection
Zhu, Xiujuan1,3 and Sun, Xinghao1,3 and Hu, Ying1,3 and Chen, Yadong1,3 and Qiu, Wenbo1,3 and Tang, Yuwu1,3 and He, Liang1,2 and Xu, Minqiang4
1School of Information Science and Engineering, Xinjiang University, Urumqi, China, 2Department of Electronic Engineering, Tsinghua University, China, 3Key Laboratory of Signal Detection and Processing in Xinjiang, China, 4SpeakIn Technology
Abstract
Deep Convolutional Recurrent Neural Networks (CRNN) have drawn great attention in sound event detection (SED). Due to the variation in duration for acoustic events is relatively large, It is critcally important to design a good operator that can extract multiscale feature more efficiently for SED. However, most CRNN-based models lack discriminative ability for different types of acoustic events and deal with them equally, which results in the representational capacity of the models being limited. Inspired by this, We proposed a Multi-Scale Convolution based Attention Network(MSCA). By using Multi-Scale Convolution, a more effective feature representation ability can be obtained, Which can naturally learn coarse-to-fine multi-scale features to helps the model recognize different sound events. On the other hand, a channel-wise attention module is designed, which can adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels.