Sound Event Detection and Separation in Domestic Environments


Challenge results

Task description

The task evaluates systems for the detection of sound events using weakly labeled data (without timestamps). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. The challenge of exploring the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly annotated training set to improve system performance remains. Isolated sound events, background sound files and scripst to design a training set with strongly annotated synthetic data are provided. The labels in all the annotated subsets are verified and can be considered as reliable.

More detailed task description can be found in the task description page

Systems ranking

Rank Submission
code
Submission
name
Technical
Report
Sound
Separation

Ranking score
(Evaluation dataset)

PSDS 1
(Evaluation dataset)

PSDS 2
(Evaluation dataset)

PSDS 1
(Development dataset)

PSDS 2
(Development dataset)
Na_BUPT_task4_SED_1 Na_BUPT_task4_SED_1 Na2021 0.80 0.245 0.452 0.313 0.535
Hafsati_TUITO_task4_SED_3 TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 Hafsati2021 0.91 0.287 0.502 0.325 0.561
Hafsati_TUITO_task4_SED_4 TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 Hafsati2021 0.91 0.287 0.502 0.325 0.561
Hafsati_TUITO_task4_SED_1 TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 Hafsati2021 1.03 0.334 0.549 0.345 0.555
Hafsati_TUITO_task4_SED_2 TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 Hafsati2021 1.04 0.336 0.550 0.345 0.555
Gong_TAL_task4_SED_3 TAL SED system Gong2021 1.16 0.370 0.626 0.407 0.653
Gong_TAL_task4_SED_2 TAL SED system Gong2021 1.15 0.367 0.616 0.407 0.648
Gong_TAL_task4_SED_1 TAL SED system Gong2021 1.14 0.364 0.611 0.398 0.642
Park_JHU_task4_SED_2 Park_JHU_task4_SED_2 Park2021 1.07 0.327 0.603 0.524 0.674
Park_JHU_task4_SED_4 Park_JHU_task4_SED_4 Park2021 0.86 0.237 0.524 0.446 0.561
Park_JHU_task4_SED_1 Park_JHU_task4_SED_1 Park2021 1.01 0.305 0.579 0.508 0.668
Park_JHU_task4_SED_3 Park_JHU_task4_SED_3 Park2021 0.84 0.222 0.537 0.456 0.596
Zheng_USTC_task4_SED_4 DCASE2020 SED Mean teacher system 4 Zheng2021 1.30 0.389 0.742 0.402 0.786
Zheng_USTC_task4_SED_1 DCASE2020 SED Mean teacher system 1 Zheng2021 1.33 0.452 0.669 0.454 0.671
Zheng_USTC_task4_SED_3 DCASE2020 SED Mean teacher system 3 Zheng2021 1.29 0.386 0.746 0.397 0.788
Zheng_USTC_task4_SED_2 DCASE2020 SED Mean teacher system 2 Zheng2021 1.33 0.447 0.676 0.454 0.680
Nam_KAIST_task4_SED_2 SED_mixupratip=0.8_nband=(2,3)_medianfilter=5 Nam2021 1.19 0.399 0.609 0.434 0.639
Nam_KAIST_task4_SED_1 SED_default Nam2021 1.16 0.378 0.617 0.423 0.658
Nam_KAIST_task4_SED_3 SED_AFL Nam2021 1.09 0.324 0.634 0.381 0.692
Nam_KAIST_task4_SED_4 Weak_SED Nam2021 0.75 0.059 0.715 0.064 0.816
Koo_SGU_task4_SED_2 DCASE2021 SED system using wav2vec Koo2021 0.12 0.044 0.059 0.316 0.337
Koo_SGU_task4_SED_3 DCASE2021 SED system using wav2vec Koo2021 0.41 0.058 0.348 0.249 0.711
Koo_SGU_task4_SED_1 DCASE2021 SED system using wav2vec Koo2021 0.74 0.258 0.364 0.295 0.503
deBenito_AUDIAS_task4_SED_4 5-Resolution Mean Teacher de Benito-Gorron2021 1.10 0.361 0.577 0.386 0.600
deBenito_AUDIAS_task4_SED_1 3-Resolution Mean Teacher de Benito-Gorron2021 1.07 0.343 0.571 0.380 0.589
deBenito_AUDIAS_task4_SED_2 3-Resolution Mean Teacher (Higher time resolutions) de Benito-Gorron2021 1.10 0.363 0.574 0.386 0.578
deBenito_AUDIAS_task4_SED_3 4-Resolution Mean Teacher de Benito-Gorron2021 1.07 0.345 0.571 0.372 0.600
Baseline_SSep_SED DCASE2021 SSep SED baseline system turpault2020b 1.11 0.364 0.580 0.342 0.527
Boes_KUL_task4_SED_4 CRNN with optimized pooling operations for scenario 2 (2) Boes2021 0.60 0.117 0.457 0.154 0.729
Boes_KUL_task4_SED_3 CRNN with optimized pooling operations for scenario 2 (1) Boes2021 0.68 0.121 0.531 0.158 0.731
Boes_KUL_task4_SED_2 CRNN with optimized pooling operations for scenario 1 (2) Boes2021 0.77 0.233 0.440 0.359 0.601
Boes_KUL_task4_SED_1 CRNN with optimized pooling operations for scenario 1 (1) Boes2021 0.81 0.253 0.442 0.361 0.593
Ebbers_UPB_task4_SED_2 UPB sytem 2 Ebbers2021 1.10 0.335 0.621 0.377 0.748
Ebbers_UPB_task4_SED_4 UPB sytem 4 Ebbers2021 1.16 0.363 0.637 0.393 0.758
Ebbers_UPB_task4_SED_3 UPB sytem 3 Ebbers2021 1.24 0.416 0.635 0.454 0.726
Ebbers_UPB_task4_SED_1 UPB sytem 1 Ebbers2021 1.16 0.373 0.621 0.429 0.727
Zhu_AIAL-XJU_task4_SED_2 Zhu_AIAL-XJU_task4_SED_2 Zhu2021 0.99 0.290 0.574 0.342 0.614
Zhu_AIAL-XJU_task4_SED_1 Zhu_AIAL-XJU_task4_SED_1 Zhu2021 1.04 0.318 0.583 0.354 0.613
Liu_BUPT_task4_4 DCASE2020 liuliuliufangzhou system Liu2021 0.37 0.102 0.231 0.348 0.551
Liu_BUPT_task4_1 DCASE2020 liuliuliufangzhou system Liu2021 0.30 0.090 0.169 0.348 0.551
Liu_BUPT_task4_2 DCASE2020 liuliuliufangzhou system Liu2021 0.54 0.152 0.322 0.348 0.551
Liu_BUPT_task4_3 DCASE2020 liuliuliufangzhou system Liu2021 0.24 0.068 0.146 0.348 0.551
Olvera_INRIA_task4_SED_2 SED ensemble 2 OT + FG/BG Olvera2021 0.98 0.338 0.481
Olvera_INRIA_task4_SED_1 DA-SED + FG/BG Olvera2021 0.95 0.332 0.462
Kim_AiTeR_GIST_SED_4 RCRNN-based noisy student SED Kim2021 1.32 0.442 0.674 0.457 0.685
Kim_AiTeR_GIST_SED_2 RCRNN-based noisy student SED Kim2021 1.31 0.439 0.667 0.450 0.682
Kim_AiTeR_GIST_SED_3 RCRNN-based noisy student SED Kim2021 1.30 0.434 0.669 0.451 0.679
Kim_AiTeR_GIST_SED_1 RCRNN-based noisy student SED Kim2021 1.29 0.431 0.661 0.449 0.675
Cai_SMALLRICE_task4_SED_1 DCASE2021_Cai_SED_CDur_Ensemble_1 Dinkel2021 1.11 0.361 0.584 0.375 0.619
Cai_SMALLRICE_task4_SED_2 DCASE2021_Cai_SED_CDur_Ensemble_2 Dinkel2021 1.13 0.373 0.585 0.382 0.622
Cai_SMALLRICE_task4_SED_3 DCASE2021_Cai_SED_CDur_Ensemble_3 Dinkel2021 1.13 0.370 0.596 0.381 0.629
Cai_SMALLRICE_task4_SED_4 DCASE2021_Cai_SED_CDur_Single_4 Dinkel2021 1.00 0.339 0.504 0.369 0.571
HangYuChen_Roal_task4_SED_2 DCASE2021 SED system HangYu2021 0.90 0.294 0.473 0.134 0.557
HangYuChen_Roal_task4_SED_1 DCASE2021 SED system YuHang2021 0.61 0.098 0.496 0.340 0.523
Yu_NCUT_task4_SED_1 multi-scale CRNN Yu2021 0.20 0.038 0.157 0.330 0.540
Yu_NCUT_task4_SED_2 multi-scale CRNN Yu2021 0.92 0.301 0.485 0.110 0.610
lu_kwai_task4_SED_1 DCASE2021 SED CRNN Model1 Lu2021 1.27 0.419 0.660 0.419 0.638
lu_kwai_task4_SED_4 DCASE2021 SED Conformer Model2 Lu2021 0.88 0.157 0.685 0.177 0.749
lu_kwai_task4_SED_3 DCASE2021 SED Conformer Model1 Lu2021 0.86 0.148 0.686 0.173 0.752
lu_kwai_task4_SED_2 DCASE2021 SED CRNN Model2 Lu2021 1.25 0.412 0.651 0.418 0.637
Liu_BUPT_task4_SS_SED_2 DCASE2020 liuliuliufangzhou system Liu_SS2021 0.94 0.302 0.507 0.360 0.550
Liu_BUPT_task4_SS_SED_1 DCASE2020 liuliuliufangzhou system Liu_SS2021 0.94 0.302 0.507 0.360 0.550
Tian_ICT-TOSHIBA_task4_SED_2 SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS Tian2021 1.19 0.411 0.585 0.396 0.587
Tian_ICT-TOSHIBA_task4_SED_1 SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS Tian2021 1.19 0.413 0.586 0.401 0.597
Tian_ICT-TOSHIBA_task4_SED_4 SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS Tian2021 1.19 0.412 0.586 0.398 0.599
Tian_ICT-TOSHIBA_task4_SED_3 SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS Tian2021 1.18 0.409 0.584 0.392 0.585
Yao_GUET_task4_SED_3 Adaptive Sequential Self Attention Span for Sound Event Detection Yao2021 0.88 0.279 0.479 0.328 0.530
Yao_GUET_task4_SED_1 Adaptive Sequential Self Attention Span for Sound Event Detection Yao2021 0.88 0.277 0.482 0.332 0.533
Yao_GUET_task4_SED_2 Adaptive Sequential Self Attention Span for Sound Event Detection Yao2021 0.54 0.056 0.496 0.060 0.618
Liang_SHNU_task4_SED_4 Guided Learning system Liang2021 0.99 0.313 0.543 0.328 0.575
Bajzik_UNIZA_task4_SED_2 CAM attention SED system Bajzik2021 1.02 0.330 0.544 0.374 0.586
Bajzik_UNIZA_task4_SED_1 CAM-based SED system Bajzik2021 0.45 0.133 0.266 0.165 0.348
Liang_SHNU_task4_SSep_SED_3 Mean teacher system Liang_SS2021 0.99 0.304 0.559 0.426 0.726
Liang_SHNU_task4_SSep_SED_1 Mean teacher system Liang_SS2021 1.03 0.313 0.588 0.428 0.736
Liang_SHNU_task4_SSep_SED_2 Mean teacher system Liang_SS2021 1.01 0.325 0.542 0.418 0.721
Baseline_SED DCASE2021 SED baseline system turpault2020a 1.00 0.315 0.547 0.342 0.527
Wang_NSYSU_task4_SED_1 DCASE2021_SED_A Wang2021 1.13 0.336 0.646 0.407 0.703
Wang_NSYSU_task4_SED_4 DCASE2021_SED_D Wang2021 1.09 0.304 0.662 0.370 0.724
Wang_NSYSU_task4_SED_2 DCASE2021_SED_B Wang2021 0.69 0.070 0.636 0.061 0.808
Wang_NSYSU_task4_SED_3 DCASE2021_SED_C Wang2021 1.13 0.339 0.649 0.388 0.672

Supplementary metrics

Rank Submission
code
Submission
name
Technical
Report
Sound
Separation
PSDS 1
(Evaluation dataset)
PSDS 1
(Public evaluation)
PSDS 1
(Vimeo dataset)
PSDS 2
(Evaluation dataset)
PSDS 2
(Public evaluation)
PSDS 2
(Vimeo dataset)
F-score
(Evaluation dataset)
F-score
(Public evaluation)
F-score
(Vimeo dataset)
Na_BUPT_task4_SED_1 Na_BUPT_task4_SED_1 Na2021 0.245 0.269 0.185 0.452 0.485 0.354 25.0 27.5 19.5
Hafsati_TUITO_task4_SED_3 TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 Hafsati2021 0.287 0.321 0.207 0.502 0.547 0.386 35.7 39.2 27.4
Hafsati_TUITO_task4_SED_4 TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 Hafsati2021 0.287 0.322 0.209 0.502 0.547 0.387 37.2 40.9 28.0
Hafsati_TUITO_task4_SED_1 TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 Hafsati2021 0.334 0.370 0.249 0.549 0.591 0.437 39.5 43.8 29.0
Hafsati_TUITO_task4_SED_2 TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 Hafsati2021 0.336 0.374 0.249 0.550 0.591 0.440 40.9 44.9 31.3
Gong_TAL_task4_SED_3 TAL SED system Gong2021 0.370 0.419 0.273 0.626 0.672 0.509 41.9 45.1 34.0
Gong_TAL_task4_SED_2 TAL SED system Gong2021 0.367 0.409 0.271 0.616 0.654 0.512 42.7 45.9 34.8
Gong_TAL_task4_SED_1 TAL SED system Gong2021 0.364 0.409 0.266 0.611 0.661 0.486 41.5 44.9 33.0
Park_JHU_task4_SED_2 Park_JHU_task4_SED_2 Park2021 0.327 0.371 0.240 0.603 0.644 0.492 38.4 42.2 28.9
Park_JHU_task4_SED_4 Park_JHU_task4_SED_4 Park2021 0.237 0.267 0.174 0.524 0.568 0.417 36.9 39.8 29.6
Park_JHU_task4_SED_1 Park_JHU_task4_SED_1 Park2021 0.305 0.344 0.214 0.579 0.617 0.471 34.7 37.8 26.9
Park_JHU_task4_SED_3 Park_JHU_task4_SED_3 Park2021 0.222 0.244 0.166 0.537 0.578 0.430 33.5 35.4 28.5
Zheng_USTC_task4_SED_4 DCASE2020 SED Mean teacher system 4 Zheng2021 0.389 0.438 0.261 0.742 0.775 0.644 49.5 54.2 36.9
Zheng_USTC_task4_SED_1 DCASE2020 SED Mean teacher system 1 Zheng2021 0.452 0.517 0.318 0.669 0.725 0.530 52.3 57.4 39.2
Zheng_USTC_task4_SED_3 DCASE2020 SED Mean teacher system 3 Zheng2021 0.386 0.429 0.270 0.746 0.778 0.650 49.7 55.0 36.3
Zheng_USTC_task4_SED_2 DCASE2020 SED Mean teacher system 2 Zheng2021 0.447 0.506 0.318 0.676 0.730 0.546 52.9 57.7 40.2
Nam_KAIST_task4_SED_2 SED_mixupratip=0.8_nband=(2,3)_medianfilter=5 Nam2021 0.399 0.443 0.299 0.609 0.641 0.525 48.0 52.2 37.1
Nam_KAIST_task4_SED_1 SED_default Nam2021 0.378 0.426 0.285 0.617 0.666 0.506 44.2 47.8 34.5
Nam_KAIST_task4_SED_3 SED_AFL Nam2021 0.324 0.364 0.235 0.634 0.672 0.536 29.3 32.3 22.7
Nam_KAIST_task4_SED_4 Weak_SED Nam2021 0.059 0.069 0.022 0.715 0.750 0.616 12.5 13.1 11.4
Koo_SGU_task4_SED_2 DCASE2021 SED system using wav2vec Koo2021 0.044 0.050 0.024 0.059 0.057 0.047 12.4 13.8 9.4
Koo_SGU_task4_SED_3 DCASE2021 SED system using wav2vec Koo2021 0.058 0.060 0.048 0.348 0.406 0.257 8.5 9.0 7.3
Koo_SGU_task4_SED_1 DCASE2021 SED system using wav2vec Koo2021 0.258 0.282 0.183 0.364 0.401 0.241 20.5 22.2 16.2
deBenito_AUDIAS_task4_SED_4 5-Resolution Mean Teacher de Benito-Gorron2021 0.361 0.405 0.262 0.577 0.635 0.443 42.7 46.7 32.7
deBenito_AUDIAS_task4_SED_1 3-Resolution Mean Teacher de Benito-Gorron2021 0.343 0.387 0.245 0.571 0.628 0.439 42.6 46.4 33.2
deBenito_AUDIAS_task4_SED_2 3-Resolution Mean Teacher (Higher time resolutions) de Benito-Gorron2021 0.363 0.406 0.265 0.574 0.630 0.449 43.1 47.0 33.6
deBenito_AUDIAS_task4_SED_3 4-Resolution Mean Teacher de Benito-Gorron2021 0.345 0.383 0.255 0.571 0.628 0.438 42.2 46.4 31.6
Baseline_SSep_SED DCASE2021 SSep SED baseline system turpault2020b 0.364 0.407 0.283 0.580 0.627 0.471 42.0 44.9 34.7
Boes_KUL_task4_SED_4 CRNN with optimized pooling operations for scenario 2 (2) Boes2021 0.117 0.131 0.078 0.457 0.500 0.346 10.6 11.8 7.9
Boes_KUL_task4_SED_3 CRNN with optimized pooling operations for scenario 2 (1) Boes2021 0.121 0.139 0.081 0.531 0.555 0.435 14.0 15.9 9.3
Boes_KUL_task4_SED_2 CRNN with optimized pooling operations for scenario 1 (2) Boes2021 0.233 0.266 0.143 0.440 0.489 0.310 31.2 34.4 22.6
Boes_KUL_task4_SED_1 CRNN with optimized pooling operations for scenario 1 (1) Boes2021 0.253 0.290 0.150 0.442 0.483 0.319 31.0 34.7 21.3
Ebbers_UPB_task4_SED_2 UPB sytem 2 Ebbers2021 0.335 0.369 0.269 0.621 0.661 0.519 54.1 57.2 46.7
Ebbers_UPB_task4_SED_4 UPB sytem 4 Ebbers2021 0.363 0.407 0.285 0.637 0.683 0.533 56.7 59.6 49.4
Ebbers_UPB_task4_SED_3 UPB sytem 3 Ebbers2021 0.416 0.455 0.328 0.635 0.684 0.519 56.7 59.6 49.4
Ebbers_UPB_task4_SED_1 UPB sytem 1 Ebbers2021 0.373 0.410 0.300 0.621 0.661 0.516 54.1 57.2 46.7
Zhu_AIAL-XJU_task4_SED_2 Zhu_AIAL-XJU_task4_SED_2 Zhu2021 0.290 0.319 0.216 0.574 0.640 0.438 43.0 47.1 33.0
Zhu_AIAL-XJU_task4_SED_1 Zhu_AIAL-XJU_task4_SED_1 Zhu2021 0.318 0.357 0.238 0.583 0.641 0.451 40.2 43.5 32.3
Liu_BUPT_task4_4 DCASE2020 liuliuliufangzhou system Liu2021 0.102 0.123 0.043 0.231 0.244 0.165 17.5 19.4 12.9
Liu_BUPT_task4_1 DCASE2020 liuliuliufangzhou system Liu2021 0.090 0.101 0.040 0.169 0.176 0.110 18.1 19.6 13.8
Liu_BUPT_task4_2 DCASE2020 liuliuliufangzhou system Liu2021 0.152 0.173 0.099 0.322 0.347 0.234 23.6 25.7 18.2
Liu_BUPT_task4_3 DCASE2020 liuliuliufangzhou system Liu2021 0.068 0.086 0.012 0.146 0.152 0.104 15.1 16.9 10.8
Olvera_INRIA_task4_SED_2 SED ensemble 2 OT + FG/BG Olvera2021 0.338 0.382 0.218 0.481 0.528 0.357 43.4 48.4 30.0
Olvera_INRIA_task4_SED_1 DA-SED + FG/BG Olvera2021 0.332 0.375 0.205 0.462 0.506 0.333 45.5 50.2 33.1
Kim_AiTeR_GIST_SED_4 RCRNN-based noisy student SED Kim2021 0.442 0.492 0.330 0.674 0.715 0.573 50.6 53.3 43.5
Kim_AiTeR_GIST_SED_2 RCRNN-based noisy student SED Kim2021 0.439 0.492 0.319 0.667 0.710 0.564 50.5 53.3 43.0
Kim_AiTeR_GIST_SED_3 RCRNN-based noisy student SED Kim2021 0.434 0.481 0.326 0.669 0.709 0.570 49.4 52.4 41.8
Kim_AiTeR_GIST_SED_1 RCRNN-based noisy student SED Kim2021 0.431 0.478 0.320 0.661 0.702 0.554 49.9 52.3 43.6
Cai_SMALLRICE_task4_SED_1 DCASE2021_Cai_SED_CDur_Ensemble_1 Dinkel2021 0.361 0.406 0.239 0.584 0.654 0.418 37.8 41.4 28.3
Cai_SMALLRICE_task4_SED_2 DCASE2021_Cai_SED_CDur_Ensemble_2 Dinkel2021 0.373 0.423 0.243 0.585 0.652 0.422 38.8 41.9 30.3
Cai_SMALLRICE_task4_SED_3 DCASE2021_Cai_SED_CDur_Ensemble_3 Dinkel2021 0.370 0.419 0.241 0.596 0.662 0.433 38.8 42.0 30.7
Cai_SMALLRICE_task4_SED_4 DCASE2021_Cai_SED_CDur_Single_4 Dinkel2021 0.339 0.386 0.212 0.504 0.561 0.356 38.4 42.0 29.5
HangYuChen_Roal_task4_SED_2 DCASE2021 SED system HangYu2021 0.294 0.327 0.205 0.473 0.510 0.350 34.2 37.8 25.5
HangYuChen_Roal_task4_SED_1 DCASE2021 SED system YuHang2021 0.098 0.104 0.090 0.496 0.515 0.391 10.7 11.7 8.7
Yu_NCUT_task4_SED_1 multi-scale CRNN Yu2021 0.038 0.039 0.045 0.157 0.182 0.144 6.8 7.9 4.0
Yu_NCUT_task4_SED_2 multi-scale CRNN Yu2021 0.301 0.341 0.197 0.485 0.528 0.360 34.4 37.8 25.7
lu_kwai_task4_SED_1 DCASE2021 SED CRNN Model1 Lu2021 0.419 0.468 0.314 0.660 0.702 0.556 45.0 48.6 36.0
lu_kwai_task4_SED_4 DCASE2021 SED Conformer Model2 Lu2021 0.157 0.177 0.125 0.685 0.714 0.598 15.7 16.6 14.6
lu_kwai_task4_SED_3 DCASE2021 SED Conformer Model1 Lu2021 0.148 0.170 0.114 0.686 0.715 0.597 15.6 16.7 14.0
lu_kwai_task4_SED_2 DCASE2021 SED CRNN Model2 Lu2021 0.412 0.461 0.313 0.651 0.694 0.550 45.5 48.9 36.9
Liu_BUPT_task4_SS_SED_2 DCASE2020 liuliuliufangzhou system Liu_SS2021 0.302 0.328 0.235 0.507 0.537 0.410 37.6 40.5 30.5
Liu_BUPT_task4_SS_SED_1 DCASE2020 liuliuliufangzhou system Liu_SS2021 0.302 0.328 0.235 0.507 0.537 0.410 38.4 40.9 32.2
Tian_ICT-TOSHIBA_task4_SED_2 SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS Tian2021 0.411 0.462 0.307 0.585 0.639 0.473 38.3 41.2 31.6
Tian_ICT-TOSHIBA_task4_SED_1 SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS Tian2021 0.413 0.468 0.306 0.586 0.640 0.473 38.3 41.2 31.6
Tian_ICT-TOSHIBA_task4_SED_4 SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS Tian2021 0.412 0.467 0.306 0.586 0.639 0.473 38.3 41.2 31.6
Tian_ICT-TOSHIBA_task4_SED_3 SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS Tian2021 0.409 0.456 0.307 0.584 0.637 0.472 38.3 41.2 31.6
Yao_GUET_task4_SED_3 Adaptive Sequential Self Attention Span for Sound Event Detection Yao2021 0.279 0.312 0.197 0.479 0.526 0.357 34.2 37.1 27.4
Yao_GUET_task4_SED_1 Adaptive Sequential Self Attention Span for Sound Event Detection Yao2021 0.277 0.305 0.215 0.482 0.510 0.388 31.9 34.2 26.4
Yao_GUET_task4_SED_2 Adaptive Sequential Self Attention Span for Sound Event Detection Yao2021 0.056 0.064 0.048 0.496 0.529 0.389 8.9 9.5 7.5
Liang_SHNU_task4_SED_4 Guided Learning system Liang2021 0.313 0.349 0.226 0.543 0.589 0.422 36.0 39.5 27.5
Bajzik_UNIZA_task4_SED_2 CAM attention SED system Bajzik2021 0.330 0.383 0.216 0.544 0.602 0.398 39.8 43.7 30.1
Bajzik_UNIZA_task4_SED_1 CAM-based SED system Bajzik2021 0.133 0.140 0.081 0.266 0.259 0.219 13.7 15.2 9.7
Liang_SHNU_task4_SSep_SED_3 Mean teacher system Liang_SS2021 0.304 0.345 0.218 0.559 0.604 0.441 34.2 37.0 27.8
Liang_SHNU_task4_SSep_SED_1 Mean teacher system Liang_SS2021 0.313 0.348 0.235 0.588 0.639 0.462 34.6 38.1 26.5
Liang_SHNU_task4_SSep_SED_2 Mean teacher system Liang_SS2021 0.325 0.371 0.240 0.542 0.600 0.408 37.0 40.5 28.7
Baseline_SED DCASE2021 SED baseline system turpault2020a 0.315 0.359 0.222 0.547 0.596 0.407 37.3 40.8 29.7
Wang_NSYSU_task4_SED_1 DCASE2021_SED_A Wang2021 0.336 0.379 0.253 0.646 0.692 0.537 43.0 47.3 32.3
Wang_NSYSU_task4_SED_4 DCASE2021_SED_D Wang2021 0.304 0.340 0.233 0.662 0.710 0.554 38.2 41.3 30.4
Wang_NSYSU_task4_SED_2 DCASE2021_SED_B Wang2021 0.070 0.081 0.050 0.636 0.672 0.552 9.9 10.1 9.5
Wang_NSYSU_task4_SED_3 DCASE2021_SED_C Wang2021 0.339 0.384 0.251 0.649 0.698 0.540 43.0 46.4 34.4

Teams ranking

Table including only the best ranking score per submitting team.

Rank Submission
code
(PSDS 1)
Submission
name
(PSDS 1)
Submission
code
(PSDS 2)
Submission
name
(PSDS 2)
Technical
Report
Sound
Separation

Ranking score
(Evaluation dataset)

PSDS 1
(Evaluation dataset)

PSDS 2
(Evaluation dataset)
Na_BUPT_task4_SED_1 Na_BUPT_task4_SED_1 Na_BUPT_task4_SED_1 Na_BUPT_task4_SED_1 Na2021 0.80 0.245 0.452
Hafsati_TUITO_task4_SED_2 TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 Hafsati_TUITO_task4_SED_2 TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 Hafsati2021 1.04 0.336 0.550
Gong_TAL_task4_SED_3 TAL SED system Gong_TAL_task4_SED_3 TAL SED system Gong2021 1.16 0.370 0.626
Park_JHU_task4_SED_2 Park_JHU_task4_SED_2 Park_JHU_task4_SED_2 Park_JHU_task4_SED_2 Park2021 1.07 0.327 0.603
Zheng_USTC_task4_SED_1 DCASE2020 SED Mean teacher system 1 Zheng_USTC_task4_SED_3 DCASE2020 SED Mean teacher system 3 Zheng2021 1.40 0.452 0.746
Nam_KAIST_task4_SED_2 SED_mixupratip=0.8_nband=(2,3)_medianfilter=5 Nam_KAIST_task4_SED_4 Weak_SED Nam2021 1.29 0.399 0.715
Koo_SGU_task4_SED_1 DCASE2021 SED system using wav2vec Koo_SGU_task4_SED_1 DCASE2021 SED system using wav2vec Koo2021 0.74 0.258 0.364
deBenito_AUDIAS_task4_SED_2 3-Resolution Mean Teacher (Higher time resolutions) deBenito_AUDIAS_task4_SED_4 5-Resolution Mean Teacher de Benito-Gorron2021 1.10 0.363 0.577
Baseline_SSep_SED DCASE2021 SSep SED baseline system Baseline_SSep_SED DCASE2021 SSep SED baseline system turpault2020b 1.11 0.364 0.580
Boes_KUL_task4_SED_1 CRNN with optimized pooling operations for scenario 1 (1) Boes_KUL_task4_SED_3 CRNN with optimized pooling operations for scenario 2 (1) Boes2021 0.89 0.253 0.531
Ebbers_UPB_task4_SED_3 UPB sytem 3 Ebbers_UPB_task4_SED_4 UPB sytem 4 Ebbers2021 1.24 0.416 0.637
Zhu_AIAL-XJU_task4_SED_1 Zhu_AIAL-XJU_task4_SED_1 Zhu_AIAL-XJU_task4_SED_1 Zhu_AIAL-XJU_task4_SED_1 Zhu2021 1.04 0.318 0.583
Liu_BUPT_task4_2 DCASE2020 liuliuliufangzhou system Liu_BUPT_task4_2 DCASE2020 liuliuliufangzhou system Liu2021 0.54 0.152 0.322
Olvera_INRIA_task4_SED_2 SED ensemble 2 OT + FG/BG Olvera_INRIA_task4_SED_2 SED ensemble 2 OT + FG/BG Olvera2021 0.98 0.338 0.481
Kim_AiTeR_GIST_SED_4 RCRNN-based noisy student SED Kim_AiTeR_GIST_SED_4 RCRNN-based noisy student SED Kim2021 1.32 0.442 0.674
Cai_SMALLRICE_task4_SED_2 DCASE2021_Cai_SED_CDur_Ensemble_2 Cai_SMALLRICE_task4_SED_3 DCASE2021_Cai_SED_CDur_Ensemble_3 Dinkel2021 1.14 0.373 0.596
HangYuChen_Roal_task4_SED_2 DCASE2021 SED system HangYuChen_Roal_task4_SED_2 DCASE2021 SED system HangYu2021 0.90 0.294 0.473
HangYuChen_Roal_task4_SED_1 DCASE2021 SED system HangYuChen_Roal_task4_SED_1 DCASE2021 SED system YuHang2021 0.61 0.098 0.496
Yu_NCUT_task4_SED_2 multi-scale CRNN Yu_NCUT_task4_SED_2 multi-scale CRNN Yu2021 0.92 0.301 0.485
lu_kwai_task4_SED_1 DCASE2021 SED CRNN Model1 lu_kwai_task4_SED_3 DCASE2021 SED Conformer Model1 Lu2021 1.29 0.419 0.686
Liu_BUPT_task4_SS_SED_2 DCASE2020 liuliuliufangzhou system Liu_BUPT_task4_SS_SED_2 DCASE2020 liuliuliufangzhou system Liu_SS2021 0.94 0.302 0.507
Tian_ICT-TOSHIBA_task4_SED_1 SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS Tian_ICT-TOSHIBA_task4_SED_1 SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS Tian2021 1.19 0.413 0.586
Yao_GUET_task4_SED_3 Adaptive Sequential Self Attention Span for Sound Event Detection Yao_GUET_task4_SED_2 Adaptive Sequential Self Attention Span for Sound Event Detection Yao2021 0.90 0.279 0.496
Liang_SHNU_task4_SED_4 Guided Learning system Liang_SHNU_task4_SED_4 Guided Learning system Liang2021 0.99 0.313 0.543
Bajzik_UNIZA_task4_SED_2 CAM attention SED system Bajzik_UNIZA_task4_SED_2 CAM attention SED system Bajzik2021 1.02 0.330 0.544
Liang_SHNU_task4_SSep_SED_2 Mean teacher system Liang_SHNU_task4_SSep_SED_1 Mean teacher system Liang_SS2021 1.05 0.325 0.588
Baseline_SED DCASE2021 SED baseline system Baseline_SED DCASE2021 SED baseline system turpault2020a 1.00 0.315 0.547
Wang_NSYSU_task4_SED_3 DCASE2021_SED_C Wang_NSYSU_task4_SED_4 DCASE2021_SED_D Wang2021 1.14 0.339 0.662

Supplementary metrics

Rank Submission
code
(PSDS 1)
Submission
name
(PSDS 1)
Submission
code
(PSDS 2)
Submission
name
(PSDS 2)
Technical
Report
Sound
Separation
Ranking score
(Evaluation dataset)
Ranking score
(Public evaluation)
Ranking score
(Vimeo dataset)
Na_BUPT_task4_SED_1 Na_BUPT_task4_SED_1 Na_BUPT_task4_SED_1 Na_BUPT_task4_SED_1 Na2021 0.80 0.78 0.85
Hafsati_TUITO_task4_SED_2 TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 Hafsati_TUITO_task4_SED_2 TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 Hafsati2021 1.04 1.02 1.10
Gong_TAL_task4_SED_3 TAL SED system Gong_TAL_task4_SED_3 TAL SED system Gong2021 1.16 1.15 1.24
Park_JHU_task4_SED_2 Park_JHU_task4_SED_2 Park_JHU_task4_SED_2 Park_JHU_task4_SED_2 Park2021 1.07 1.06 1.15
Zheng_USTC_task4_SED_1 DCASE2020 SED Mean teacher system 1 Zheng_USTC_task4_SED_3 DCASE2020 SED Mean teacher system 3 Zheng2021 1.40 1.37 1.52
Nam_KAIST_task4_SED_2 SED_mixupratip=0.8_nband=(2,3)_medianfilter=5 Nam_KAIST_task4_SED_4 Weak_SED Nam2021 1.29 1.25 1.43
Koo_SGU_task4_SED_1 DCASE2021 SED system using wav2vec Koo_SGU_task4_SED_1 DCASE2021 SED system using wav2vec Koo2021 0.74 0.73 0.71
deBenito_AUDIAS_task4_SED_2 3-Resolution Mean Teacher (Higher time resolutions) deBenito_AUDIAS_task4_SED_4 5-Resolution Mean Teacher de Benito-Gorron2021 1.10 1.10 1.14
Baseline_SSep_SED DCASE2021 SSep SED baseline system Baseline_SSep_SED DCASE2021 SSep SED baseline system turpault2020b 1.11 1.09 1.22
Boes_KUL_task4_SED_1 CRNN with optimized pooling operations for scenario 1 (1) Boes_KUL_task4_SED_3 CRNN with optimized pooling operations for scenario 2 (1) Boes2021 0.89 0.87 0.87
Ebbers_UPB_task4_SED_3 UPB sytem 3 Ebbers_UPB_task4_SED_4 UPB sytem 4 Ebbers2021 1.24 1.21 1.39
Zhu_AIAL-XJU_task4_SED_1 Zhu_AIAL-XJU_task4_SED_1 Zhu_AIAL-XJU_task4_SED_1 Zhu_AIAL-XJU_task4_SED_1 Zhu2021 1.04 1.03 1.09
Liu_BUPT_task4_2 DCASE2020 liuliuliufangzhou system Liu_BUPT_task4_2 DCASE2020 liuliuliufangzhou system Liu2021 0.54 0.53 0.51
Olvera_INRIA_task4_SED_2 SED ensemble 2 OT + FG/BG Olvera_INRIA_task4_SED_2 SED ensemble 2 OT + FG/BG Olvera2021 0.98 0.97 0.93
Kim_AiTeR_GIST_SED_4 RCRNN-based noisy student SED Kim_AiTeR_GIST_SED_4 RCRNN-based noisy student SED Kim2021 1.32 1.28 1.45
Cai_SMALLRICE_task4_SED_2 DCASE2021_Cai_SED_CDur_Ensemble_2 Cai_SMALLRICE_task4_SED_3 DCASE2021_Cai_SED_CDur_Ensemble_3 Dinkel2021 1.14 1.14 1.08
HangYuChen_Roal_task4_SED_2 DCASE2021 SED system HangYuChen_Roal_task4_SED_2 DCASE2021 SED system HangYu2021 0.90 0.88 0.89
HangYuChen_Roal_task4_SED_1 DCASE2021 SED system HangYuChen_Roal_task4_SED_1 DCASE2021 SED system YuHang2021 0.61 0.58 0.68
Yu_NCUT_task4_SED_2 multi-scale CRNN Yu_NCUT_task4_SED_2 multi-scale CRNN Yu2021 0.92 0.92 0.89
lu_kwai_task4_SED_1 DCASE2021 SED CRNN Model1 lu_kwai_task4_SED_3 DCASE2021 SED Conformer Model1 Lu2021 1.29 1.25 1.44
Liu_BUPT_task4_SS_SED_2 DCASE2020 liuliuliufangzhou system Liu_BUPT_task4_SS_SED_2 DCASE2020 liuliuliufangzhou system Liu_SS2021 0.94 0.91 1.03
Tian_ICT-TOSHIBA_task4_SED_1 SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS Tian_ICT-TOSHIBA_task4_SED_1 SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS Tian2021 1.19 1.19 1.27
Yao_GUET_task4_SED_3 Adaptive Sequential Self Attention Span for Sound Event Detection Yao_GUET_task4_SED_2 Adaptive Sequential Self Attention Span for Sound Event Detection Yao2021 0.90 0.88 0.92
Liang_SHNU_task4_SED_4 Guided Learning system Liang_SHNU_task4_SED_4 Guided Learning system Liang2021 0.99 0.98 1.03
Bajzik_UNIZA_task4_SED_2 CAM attention SED system Bajzik_UNIZA_task4_SED_2 CAM attention SED system Bajzik2021 1.02 1.04 0.98
Liang_SHNU_task4_SSep_SED_2 Mean teacher system Liang_SHNU_task4_SSep_SED_1 Mean teacher system Liang_SS2021 1.05 1.05 1.11
Baseline_SED DCASE2021 SED baseline system Baseline_SED DCASE2021 SED baseline system turpault2020a 1.00 1.00 1.00
Wang_NSYSU_task4_SED_3 DCASE2021_SED_C Wang_NSYSU_task4_SED_4 DCASE2021_SED_D Wang2021 1.14 1.13 1.25

Class-wise performance

Rank Submission
code
Submission
name
Technical
Report
Ranking score
(Evaluation dataset)
Alarm
Bell
Ringing
Blender Cat Dishes Dog Electric
shave
toothbrush
Frying Running
water
Speech Vacuum
cleaner
Na_BUPT_task4_SED_1 Na_BUPT_task4_SED_1 Na2021 0.80 23.5 28.6 42.5 25.0 16.5 15.7 19.3 19.9 35.4 23.2
Hafsati_TUITO_task4_SED_3 TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 Hafsati2021 0.91 25.0 39.7 53.7 19.4 28.3 39.7 38.3 25.1 49.3 38.2
Hafsati_TUITO_task4_SED_4 TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 Hafsati2021 0.91 25.7 40.8 50.7 26.5 28.8 39.7 42.6 26.4 49.7 41.0
Hafsati_TUITO_task4_SED_1 TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 Hafsati2021 1.03 30.4 38.1 63.7 27.6 29.1 35.6 37.0 28.4 52.7 52.3
Hafsati_TUITO_task4_SED_2 TASK AWARE SOUND EVENT DETECTION BASED ON SEMI-SUPERVISED CRNN WITH SKIP CONNECTIONS DCASE 2021 CHALLENGE, TASK 4 Hafsati2021 1.04 32.8 39.0 63.3 28.9 32.8 39.7 41.0 27.9 51.5 52.6
Gong_TAL_task4_SED_3 TAL SED system Gong2021 1.16 33.3 49.8 61.9 34.6 31.6 39.8 41.8 26.9 45.0 54.3
Gong_TAL_task4_SED_2 TAL SED system Gong2021 1.15 35.0 48.1 62.1 33.9 36.3 40.3 41.4 28.1 45.8 55.7
Gong_TAL_task4_SED_1 TAL SED system Gong2021 1.14 33.9 48.7 61.3 34.7 29.3 42.4 39.8 27.7 44.1 53.1
Park_JHU_task4_SED_2 Park_JHU_task4_SED_2 Park2021 1.07 25.7 41.8 52.2 10.1 27.2 40.4 47.7 36.7 58.0 44.3
Park_JHU_task4_SED_4 Park_JHU_task4_SED_4 Park2021 0.86 25.9 42.1 33.4 34.0 17.2 38.9 50.0 35.9 50.6 41.5
Park_JHU_task4_SED_1 Park_JHU_task4_SED_1 Park2021 1.01 22.8 40.3 43.6 8.2 22.9 33.5 43.2 34.8 58.5 39.0
Park_JHU_task4_SED_3 Park_JHU_task4_SED_3 Park2021 0.84 21.8 40.2 24.4 32.2 13.9 35.1 44.4 33.3 51.5 37.8
Zheng_USTC_task4_SED_4 DCASE2020 SED Mean teacher system 4 Zheng2021 1.30 36.1 53.3 70.4 18.8 45.7 58.2 40.4 32.5 70.6 68.7
Zheng_USTC_task4_SED_1 DCASE2020 SED Mean teacher system 1 Zheng2021 1.33 41.4 54.1 72.5 29.4 47.8 60.1 49.2 33.7 69.5 65.5
Zheng_USTC_task4_SED_3 DCASE2020 SED Mean teacher system 3 Zheng2021 1.29 36.4 52.5 70.9 20.9 42.9 59.0 43.3 34.1 68.7 68.7
Zheng_USTC_task4_SED_2 DCASE2020 SED Mean teacher system 2 Zheng2021 1.33 36.6 55.1 75.3 29.8 45.6 55.7 53.6 38.6 69.3 69.5
Nam_KAIST_task4_SED_2 SED_mixupratip=0.8_nband=(2,3)_medianfilter=5 Nam2021 1.19 34.2 55.4 70.5 39.6 46.2 44.7 36.2 39.3 55.7 58.6
Nam_KAIST_task4_SED_1 SED_default Nam2021 1.16 28.6 58.3 69.8 30.3 37.0 38.1 37.8 35.7 51.7 54.6
Nam_KAIST_task4_SED_3 SED_AFL Nam2021 1.09 27.9 36.9 25.2 9.8 7.2 30.0 32.8 33.0 40.6 49.8
Nam_KAIST_task4_SED_4 Weak_SED Nam2021 0.75 5.3 3.3 0.5 0.0 0.3 13.6 43.2 23.5 0.3 35.0
Koo_SGU_task4_SED_2 DCASE2021 SED system using wav2vec Koo2021 0.12 0.0 20.5 9.9 1.0 2.3 12.5 20.0 15.7 26.8 14.8
Koo_SGU_task4_SED_3 DCASE2021 SED system using wav2vec Koo2021 0.41 2.5 7.7 2.2 0.8 1.2 7.8 22.5 15.3 1.8 23.2
Koo_SGU_task4_SED_1 DCASE2021 SED system using wav2vec Koo2021 0.74 15.4 23.5 30.5 15.1 20.6 21.1 21.1 18.5 19.0 20.0
deBenito_AUDIAS_task4_SED_4 5-Resolution Mean Teacher de Benito-Gorron2021 1.10 37.4 57.1 63.8 24.2 34.5 30.0 46.8 25.9 49.8 57.3
deBenito_AUDIAS_task4_SED_1 3-Resolution Mean Teacher de Benito-Gorron2021 1.07 37.6 58.1 63.1 23.9 34.2 35.4 43.5 29.8 49.3 51.3
deBenito_AUDIAS_task4_SED_2 3-Resolution Mean Teacher (Higher time resolutions) de Benito-Gorron2021 1.10 37.1 51.4 63.9 26.0 36.9 28.9 46.9 30.5 52.0 57.5
deBenito_AUDIAS_task4_SED_3 4-Resolution Mean Teacher de Benito-Gorron2021 1.07 36.2 57.6 63.1 24.4 34.8 35.0 41.9 27.1 48.2 53.5
Baseline_SSep_SED DCASE2021 SSep SED baseline system turpault2020b 1.11 36.7 47.4 66.3 33.1 40.5 34.8 37.2 21.5 53.0 49.3
Boes_KUL_task4_SED_4 CRNN with optimized pooling operations for scenario 2 (2) Boes2021 0.60 3.7 24.2 1.4 0.0 0.6 13.9 23.7 12.6 6.2 19.9
Boes_KUL_task4_SED_3 CRNN with optimized pooling operations for scenario 2 (1) Boes2021 0.68 6.6 16.5 1.4 0.0 0.3 21.3 33.0 18.7 6.6 35.1
Boes_KUL_task4_SED_2 CRNN with optimized pooling operations for scenario 1 (2) Boes2021 0.77 16.9 32.9 63.1 7.7 19.4 25.6 32.6 14.8 51.8 47.7
Boes_KUL_task4_SED_1 CRNN with optimized pooling operations for scenario 1 (1) Boes2021 0.81 19.0 29.0 59.1 7.7 20.9 34.5 24.8 13.0 54.0 47.9
Ebbers_UPB_task4_SED_2 UPB sytem 2 Ebbers2021 1.10 37.2 60.8 73.0 24.2 45.6 58.5 65.9 36.9 65.0 73.5
Ebbers_UPB_task4_SED_4 UPB sytem 4 Ebbers2021 1.16 39.2 61.7 74.2 33.4 46.6 57.1 64.0 45.4 67.6 77.6
Ebbers_UPB_task4_SED_3 UPB sytem 3 Ebbers2021 1.24 39.2 61.7 74.2 33.4 46.6 57.1 64.0 45.4 67.6 77.6
Ebbers_UPB_task4_SED_1 UPB sytem 1 Ebbers2021 1.16 37.2 60.8 73.0 24.2 45.6 58.5 65.9 36.9 65.0 73.5
Zhu_AIAL-XJU_task4_SED_2 Zhu_AIAL-XJU_task4_SED_2 Zhu2021 0.99 30.0 46.3 63.3 23.6 16.8 44.2 47.5 40.9 59.4 57.6
Zhu_AIAL-XJU_task4_SED_1 Zhu_AIAL-XJU_task4_SED_1 Zhu2021 1.04 31.8 48.2 58.3 28.5 26.7 37.4 48.6 36.0 51.1 35.3
Liu_BUPT_task4_4 DCASE2020 liuliuliufangzhou system Liu2021 0.37 14.0 26.2 40.2 9.3 16.4 18.6 7.7 5.0 26.4 11.4
Liu_BUPT_task4_1 DCASE2020 liuliuliufangzhou system Liu2021 0.30 13.1 18.1 49.0 8.6 19.3 20.0 6.4 6.2 28.6 11.4
Liu_BUPT_task4_2 DCASE2020 liuliuliufangzhou system Liu2021 0.54 19.6 30.4 36.5 14.5 18.5 29.0 18.6 11.8 30.1 27.2
Liu_BUPT_task4_3 DCASE2020 liuliuliufangzhou system Liu2021 0.24 16.5 19.0 37.2 6.2 19.3 13.7 3.5 6.8 25.0 3.8
Olvera_INRIA_task4_SED_2 SED ensemble 2 OT + FG/BG Olvera2021 0.98 46.0 47.8 63.5 23.2 39.1 51.1 20.4 27.0 62.2 53.4
Olvera_INRIA_task4_SED_1 DA-SED + FG/BG Olvera2021 0.95 43.7 52.3 63.6 30.0 40.8 52.6 24.4 26.6 63.9 56.9
Kim_AiTeR_GIST_SED_4 RCRNN-based noisy student SED Kim2021 1.32 34.7 59.8 71.6 40.4 47.3 26.2 61.8 32.8 64.9 66.7
Kim_AiTeR_GIST_SED_2 RCRNN-based noisy student SED Kim2021 1.31 37.9 57.4 72.9 41.8 46.8 25.2 60.5 36.9 64.3 60.8
Kim_AiTeR_GIST_SED_3 RCRNN-based noisy student SED Kim2021 1.30 37.4 55.4 71.9 41.0 44.6 26.5 59.5 32.3 64.6 61.1
Kim_AiTeR_GIST_SED_1 RCRNN-based noisy student SED Kim2021 1.29 33.0 57.1 70.0 42.5 49.6 28.2 60.6 31.3 65.0 62.3
Cai_SMALLRICE_task4_SED_1 DCASE2021_Cai_SED_CDur_Ensemble_1 Dinkel2021 1.11 37.0 32.2 55.9 31.2 20.4 37.8 33.8 23.9 60.1 45.3
Cai_SMALLRICE_task4_SED_2 DCASE2021_Cai_SED_CDur_Ensemble_2 Dinkel2021 1.13 37.8 37.4 53.8 31.8 22.1 35.9 32.3 28.9 61.3 46.6
Cai_SMALLRICE_task4_SED_3 DCASE2021_Cai_SED_CDur_Ensemble_3 Dinkel2021 1.13 36.6 36.6 55.7 31.7 21.2 36.1 37.7 25.0 61.1 46.6
Cai_SMALLRICE_task4_SED_4 DCASE2021_Cai_SED_CDur_Single_4 Dinkel2021 1.00 34.9 34.1 52.5 30.8 28.0 37.6 35.1 24.9 61.3 44.4
HangYuChen_Roal_task4_SED_2 DCASE2021 SED system HangYu2021 0.90 29.0 30.7 59.3 24.5 31.8 35.3 30.2 26.0 49.3 25.9
HangYuChen_Roal_task4_SED_1 DCASE2021 SED system YuHang2021 0.61 5.2 4.8 5.6 4.3 2.7 12.5 26.8 16.0 5.1 24.0
Yu_NCUT_task4_SED_1 multi-scale CRNN Yu2021 0.20 0.5 7.1 0.7 2.0 1.7 9.5 24.4 1.2 1.6 19.7
Yu_NCUT_task4_SED_2 multi-scale CRNN Yu2021 0.92 28.6 34.6 57.9 20.2 31.7 36.0 29.7 28.0 44.4 33.1
lu_kwai_task4_SED_1 DCASE2021 SED CRNN Model1 Lu2021 1.27 37.1 41.4 62.5 40.6 39.7 46.5 46.5 34.5 54.5 46.9
lu_kwai_task4_SED_4 DCASE2021 SED Conformer Model2 Lu2021 0.88 5.8 5.9 2.1 1.0 0.3 16.9 44.6 22.0 25.6 32.9
lu_kwai_task4_SED_3 DCASE2021 SED Conformer Model1 Lu2021 0.86 6.3 5.1 1.3 0.8 0.3 15.7 43.8 21.8 29.3 31.9
lu_kwai_task4_SED_2 DCASE2021 SED CRNN Model2 Lu2021 1.25 38.6 41.6 65.5 41.0 39.3 46.1 49.0 36.0 51.5 46.6
Liu_BUPT_task4_SS_SED_2 DCASE2020 liuliuliufangzhou system Liu_SS2021 0.94 31.7 38.2 63.5 19.9 30.1 46.6 32.0 21.1 49.4 43.4
Liu_BUPT_task4_SS_SED_1 DCASE2020 liuliuliufangzhou system Liu_SS2021 0.94 34.3 38.8 63.1 25.7 27.3 45.3 31.1 25.8 49.4 43.7
Tian_ICT-TOSHIBA_task4_SED_2 SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS Tian2021 1.19 33.6 44.9 60.9 26.4 34.8 24.3 38.7 25.9 48.4 45.2
Tian_ICT-TOSHIBA_task4_SED_1 SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS Tian2021 1.19 33.6 44.9 60.9 26.4 34.8 24.3 38.7 25.9 48.4 45.2
Tian_ICT-TOSHIBA_task4_SED_4 SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS Tian2021 1.19 33.6 44.9 60.9 26.4 34.8 24.3 38.7 25.9 48.4 45.2
Tian_ICT-TOSHIBA_task4_SED_3 SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS Tian2021 1.18 33.6 44.9 60.9 26.4 34.8 24.3 38.7 25.9 48.4 45.2
Yao_GUET_task4_SED_3 Adaptive Sequential Self Attention Span for Sound Event Detection Yao2021 0.88 32.2 32.4 58.2 21.7 17.6 36.2 26.8 24.6 49.4 42.9
Yao_GUET_task4_SED_1 Adaptive Sequential Self Attention Span for Sound Event Detection Yao2021 0.88 31.6 22.7 58.5 23.4 22.8 35.9 31.5 23.8 45.1 23.9
Yao_GUET_task4_SED_2 Adaptive Sequential Self Attention Span for Sound Event Detection Yao2021 0.54 4.7 4.2 2.2 1.4 1.3 10.5 27.3 16.7 3.4 17.1
Liang_SHNU_task4_SED_4 Guided Learning system Liang2021 0.99 38.0 40.7 48.3 26.0 24.2 22.6 35.6 30.0 44.6 50.0
Bajzik_UNIZA_task4_SED_2 CAM attention SED system Bajzik2021 1.02 37.5 44.4 57.6 28.8 22.9 35.5 44.0 29.8 51.7 45.2
Bajzik_UNIZA_task4_SED_1 CAM-based SED system Bajzik2021 0.45 17.1 7.5 34.2 7.1 15.2 8.9 1.2 4.8 34.1 7.1
Liang_SHNU_task4_SSep_SED_3 Mean teacher system Liang_SS2021 0.99 33.6 38.7 47.2 22.5 17.6 21.1 38.6 28.3 44.7 50.0
Liang_SHNU_task4_SSep_SED_1 Mean teacher system Liang_SS2021 1.03 29.4 25.7 60.7 20.5 29.1 30.6 38.3 24.9 55.0 32.3
Liang_SHNU_task4_SSep_SED_2 Mean teacher system Liang_SS2021 1.01 33.1 37.1 52.0 26.8 32.8 31.7 41.0 28.0 49.2 37.8
Baseline_SED DCASE2021 SED baseline system turpault2020a 1.00 32.2 39.0 62.4 28.6 34.5 21.1 37.2 26.4 49.7 42.0
Wang_NSYSU_task4_SED_1 DCASE2021_SED_A Wang2021 1.13 34.3 46.5 62.7 35.6 29.0 50.6 51.8 38.2 45.9 35.9
Wang_NSYSU_task4_SED_4 DCASE2021_SED_D Wang2021 1.09 32.5 49.4 66.2 28.3 15.2 34.0 47.2 33.0 39.9 36.0
Wang_NSYSU_task4_SED_2 DCASE2021_SED_B Wang2021 0.69 7.0 5.2 0.5 0.0 0.3 11.1 31.5 15.7 0.3 27.8
Wang_NSYSU_task4_SED_3 DCASE2021_SED_C Wang2021 1.13 34.4 52.0 70.1 32.2 25.1 41.5 47.8 36.1 52.6 37.7

System characteristics

General characteristics

Rank Code Technical
Report
Ranking score (Evaluation dataset)
PSDS 1
(Evaluation dataset)

PSDS 2
(Evaluation dataset)
Data
augmentation
Features
Na_BUPT_task4_SED_1 Na2021 0.80 0.245 0.452 log-mel energies
Hafsati_TUITO_task4_SED_3 Hafsati2021 0.91 0.287 0.502 pitch shifting, audio concatenation, volume changing log-mel energies
Hafsati_TUITO_task4_SED_4 Hafsati2021 0.91 0.287 0.502 pitch shifting, audio concatenation, volume changing log-mel energies
Hafsati_TUITO_task4_SED_1 Hafsati2021 1.03 0.334 0.549 log-mel energies
Hafsati_TUITO_task4_SED_2 Hafsati2021 1.04 0.336 0.550 log-mel energies
Gong_TAL_task4_SED_3 Gong2021 1.16 0.370 0.626 SpecAugment, time shift, mixup log-mel energies
Gong_TAL_task4_SED_2 Gong2021 1.15 0.367 0.616 SpecAugment, time shift, mixup log-mel energies
Gong_TAL_task4_SED_1 Gong2021 1.14 0.364 0.611 SpecAugment, time shift, mixup log-mel energies
Park_JHU_task4_SED_2 Park2021 1.07 0.327 0.603 mixup, frame shifting log-mel energies
Park_JHU_task4_SED_4 Park2021 0.86 0.237 0.524 mixup, frame shifting log-mel energies
Park_JHU_task4_SED_1 Park2021 1.01 0.305 0.579 mixup, frame shifting log-mel energies
Park_JHU_task4_SED_3 Park2021 0.84 0.222 0.537 mixup, frame shifting log-mel energies
Zheng_USTC_task4_SED_4 Zheng2021 1.30 0.389 0.742 spec-augment, time-shifting, mixup log-mel energies
Zheng_USTC_task4_SED_1 Zheng2021 1.33 0.452 0.669 spec-augment, time-shifting, mixup log-mel energies
Zheng_USTC_task4_SED_3 Zheng2021 1.29 0.386 0.746 spec-augment, time-shifting, mixup log-mel energies
Zheng_USTC_task4_SED_2 Zheng2021 1.33 0.447 0.676 spec-augment, time-shifting, mixup log-mel energies
Nam_KAIST_task4_SED_2 Nam2021 1.19 0.399 0.609 time shifiting, mixup, time masking, FilterAugment log-mel energies
Nam_KAIST_task4_SED_1 Nam2021 1.16 0.378 0.617 time shifiting, mixup, time masking, FilterAugment log-mel energies
Nam_KAIST_task4_SED_3 Nam2021 1.09 0.324 0.634 time shifiting, mixup, time masking, FilterAugment log-mel energies
Nam_KAIST_task4_SED_4 Nam2021 0.75 0.059 0.715 time shifiting, mixup, time masking, FilterAugment log-mel energies
Koo_SGU_task4_SED_2 Koo2021 0.12 0.044 0.059 raw waveform
Koo_SGU_task4_SED_3 Koo2021 0.41 0.058 0.348 raw waveform
Koo_SGU_task4_SED_1 Koo2021 0.74 0.258 0.364 raw waveform
deBenito_AUDIAS_task4_SED_4 de Benito-Gorron2021 1.10 0.361 0.577 log-mel energies
deBenito_AUDIAS_task4_SED_1 de Benito-Gorron2021 1.07 0.343 0.571 log-mel energies
deBenito_AUDIAS_task4_SED_2 de Benito-Gorron2021 1.10 0.363 0.574 log-mel energies
deBenito_AUDIAS_task4_SED_3 de Benito-Gorron2021 1.07 0.345 0.571 log-mel energies
Baseline_SSep_SED turpault2020b 1.11 0.364 0.580 mixup log-mel energies
Boes_KUL_task4_SED_4 Boes2021 0.60 0.117 0.457 time masking, frequency masking, mixup log-mel energies
Boes_KUL_task4_SED_3 Boes2021 0.68 0.121 0.531 time masking, frequency masking, mixup log-mel energies
Boes_KUL_task4_SED_2 Boes2021 0.77 0.233 0.440 time masking, frequency masking, mixup log-mel energies
Boes_KUL_task4_SED_1 Boes2021 0.81 0.253 0.442 time masking, frequency masking, mixup log-mel energies
Ebbers_UPB_task4_SED_2 Ebbers2021 1.10 0.335 0.621 freuency warping, time-/frequency-masking, shifted superposition, random noise log-mel energies
Ebbers_UPB_task4_SED_4 Ebbers2021 1.16 0.363 0.637 freuency warping, time-/frequency-masking, shifted superposition, random noise log-mel energies
Ebbers_UPB_task4_SED_3 Ebbers2021 1.24 0.416 0.635 freuency warping, time-/frequency-masking, shifted superposition, random noise log-mel energies
Ebbers_UPB_task4_SED_1 Ebbers2021 1.16 0.373 0.621 freuency warping, time-/frequency-masking, shifted superposition, random noise log-mel energies
Zhu_AIAL-XJU_task4_SED_2 Zhu2021 0.99 0.290 0.574 mixup log-mel spectrogram
Zhu_AIAL-XJU_task4_SED_1 Zhu2021 1.04 0.318 0.583 mixup log-mel spectrogram
Liu_BUPT_task4_4 Liu2021 0.37 0.102 0.231 log-mel energies
Liu_BUPT_task4_1 Liu2021 0.30 0.090 0.169 log-mel energies
Liu_BUPT_task4_2 Liu2021 0.54 0.152 0.322 log-mel energies
Liu_BUPT_task4_3 Liu2021 0.24 0.068 0.146 log-mel energies
Olvera_INRIA_task4_SED_2 Olvera2021 0.98 0.338 0.481 log-mel energies
Olvera_INRIA_task4_SED_1 Olvera2021 0.95 0.332 0.462 log-mel energies
Kim_AiTeR_GIST_SED_4 Kim2021 1.32 0.442 0.674 time-frequency shift, mixup, specaugment log-mel energies
Kim_AiTeR_GIST_SED_2 Kim2021 1.31 0.439 0.667 time-frequency shift, mixup, specaugment log-mel energies
Kim_AiTeR_GIST_SED_3 Kim2021 1.30 0.434 0.669 time-frequency shift, mixup, specaugment log-mel energies
Kim_AiTeR_GIST_SED_1 Kim2021 1.29 0.431 0.661 time-frequency shift, mixup, specaugment log-mel energies
Cai_SMALLRICE_task4_SED_1 Dinkel2021 1.11 0.361 0.584 time shifting, mixup, time masking, frequency masking log-mel energies
Cai_SMALLRICE_task4_SED_2 Dinkel2021 1.13 0.373 0.585 time shifting, mixup, time masking, frequency masking log-mel energies
Cai_SMALLRICE_task4_SED_3 Dinkel2021 1.13 0.370 0.596 time shifting, mixup, time masking, frequency masking log-mel energies
Cai_SMALLRICE_task4_SED_4 Dinkel2021 1.00 0.339 0.504 time shifting, mixup, time masking, frequency masking log-mel energies
HangYuChen_Roal_task4_SED_2 HangYu2021 0.90 0.294 0.473 minmax log-mel energies
HangYuChen_Roal_task4_SED_1 YuHang2021 0.61 0.098 0.496 minmax log-mel energies
Yu_NCUT_task4_SED_1 Yu2021 0.20 0.038 0.157 mixup log-mel energies
Yu_NCUT_task4_SED_2 Yu2021 0.92 0.301 0.485 mixup log-mel energies
lu_kwai_task4_SED_1 Lu2021 1.27 0.419 0.660 mixup, frame-shift log-mel energies
lu_kwai_task4_SED_4 Lu2021 0.88 0.157 0.685 mixup, frame-shift log-mel energies
lu_kwai_task4_SED_3 Lu2021 0.86 0.148 0.686 mixup, frame-shift log-mel energies
lu_kwai_task4_SED_2 Lu2021 1.25 0.412 0.651 mixup, frame-shift log-mel energies
Liu_BUPT_task4_SS_SED_2 Liu_SS2021 0.94 0.302 0.507 source augmentation, random track mixing raw waveform
Liu_BUPT_task4_SS_SED_1 Liu_SS2021 0.94 0.302 0.507 source augmentation, random track mixing raw waveform
Tian_ICT-TOSHIBA_task4_SED_2 Tian2021 1.19 0.411 0.585 mixup log-mel energies
Tian_ICT-TOSHIBA_task4_SED_1 Tian2021 1.19 0.413 0.586 mixup log-mel energies
Tian_ICT-TOSHIBA_task4_SED_4 Tian2021 1.19 0.412 0.586 mixup log-mel energies
Tian_ICT-TOSHIBA_task4_SED_3 Tian2021 1.18 0.409 0.584 mixup log-mel energies
Yao_GUET_task4_SED_3 Yao2021 0.88 0.279 0.479 MIXUP log-mel energies
Yao_GUET_task4_SED_1 Yao2021 0.88 0.277 0.482 MIXUP log-mel energies
Yao_GUET_task4_SED_2 Yao2021 0.54 0.056 0.496 MIXUP log-mel energies
Liang_SHNU_task4_SED_4 Liang2021 0.99 0.313 0.543 mixup, specAugment log-mel energies
Bajzik_UNIZA_task4_SED_2 Bajzik2021 1.02 0.330 0.544 log-mel energies
Bajzik_UNIZA_task4_SED_1 Bajzik2021 0.45 0.133 0.266 log-mel energies
Liang_SHNU_task4_SSep_SED_3 Liang_SS2021 0.99 0.304 0.559 log-mel energies
Liang_SHNU_task4_SSep_SED_1 Liang_SS2021 1.03 0.313 0.588 log-mel energies
Liang_SHNU_task4_SSep_SED_2 Liang_SS2021 1.01 0.325 0.542 log-mel energies
Baseline_SED turpault2020a 1.00 0.315 0.547 mixup log-mel energies
Wang_NSYSU_task4_SED_1 Wang2021 1.13 0.336 0.646 Mixup, Time Shift, Time Mask, Frequency Mask log-mel energies
Wang_NSYSU_task4_SED_4 Wang2021 1.09 0.304 0.662 Mixup, Time Shift, Time Mask, Frequency Mask log-mel energies
Wang_NSYSU_task4_SED_2 Wang2021 0.69 0.070 0.636 Mixup, Time Shift, Time Mask, Frequency Mask log-mel energies
Wang_NSYSU_task4_SED_3 Wang2021 1.13 0.339 0.649 Mixup, Time Shift, Time Mask, Frequency Mask log-mel energies



Machine learning characteristics

Rank Code Technical
Report
Ranking score (Evaluation dataset)
PSDS 1
(Evaluation dataset)

PSDS 2
(Evaluation dataset)
Classifier Semi-supervised approach Post-processing Segmentation
method
Decision
making
Na_BUPT_task4_SED_1 Na2021 0.80 0.245 0.452 CNN, conformer mean-teacher student median filtering (93ms)
Hafsati_TUITO_task4_SED_3 Hafsati2021 0.91 0.287 0.502 CRNN mean-teacher student median filtering (93ms)
Hafsati_TUITO_task4_SED_4 Hafsati2021 0.91 0.287 0.502 CRNN mean-teacher student median filtering (93ms)
Hafsati_TUITO_task4_SED_1 Hafsati2021 1.03 0.334 0.549 CRNN mean-teacher student median filtering (93ms)
Hafsati_TUITO_task4_SED_2 Hafsati2021 1.04 0.336 0.550 CRNN mean-teacher student median filtering (93ms)
Gong_TAL_task4_SED_3 Gong2021 1.16 0.370 0.626 CRNN mean-teacher, pseudo-labelling class-wise median filtering attention layers mean
Gong_TAL_task4_SED_2 Gong2021 1.15 0.367 0.616 CRNN mean-teacher class-wise median filtering attention layers mean
Gong_TAL_task4_SED_1 Gong2021 1.14 0.364 0.611 CRNN mean-teacher class-wise median filtering attention layers mean
Park_JHU_task4_SED_2 Park2021 1.07 0.327 0.603 RCRNN cross-referencing self-training median filtering
Park_JHU_task4_SED_4 Park2021 0.86 0.237 0.524 RCRNN cross-referencing self-training median filtering
Park_JHU_task4_SED_1 Park2021 1.01 0.305 0.579 RCRNN cross-referencing self-training median filtering
Park_JHU_task4_SED_3 Park2021 0.84 0.222 0.537 RCRNN cross-referencing self-training median filtering
Zheng_USTC_task4_SED_4 Zheng2021 1.30 0.389 0.742 CRNN mean-teacher student median filtering (340ms) averaging
Zheng_USTC_task4_SED_1 Zheng2021 1.33 0.452 0.669 CRNN mean-teacher student median filtering (340ms) averaging
Zheng_USTC_task4_SED_3 Zheng2021 1.29 0.386 0.746 CRNN mean-teacher student median filtering (340ms) averaging
Zheng_USTC_task4_SED_2 Zheng2021 1.33 0.447 0.676 CRNN mean-teacher student median filtering (340ms) averaging
Nam_KAIST_task4_SED_2 Nam2021 1.19 0.399 0.609 CRNN, ensemble mean-teacher student median filtering (329ms), weak prediction masking mean
Nam_KAIST_task4_SED_1 Nam2021 1.16 0.378 0.617 CRNN, ensemble mean-teacher student median filtering (461ms), weak prediction masking mean
Nam_KAIST_task4_SED_3 Nam2021 1.09 0.324 0.634 CRNN, ensemble mean-teacher student median filtering (461ms), weak prediction masking mean
Nam_KAIST_task4_SED_4 Nam2021 0.75 0.059 0.715 CRNN, ensemble mean-teacher student weak SED mean
Koo_SGU_task4_SED_2 Koo2021 0.12 0.044 0.059 Transformer, RNN mean-teacher student median filtering (93ms)
Koo_SGU_task4_SED_3 Koo2021 0.41 0.058 0.348 Transformer, RNN mean-teacher student median filtering (93ms)
Koo_SGU_task4_SED_1 Koo2021 0.74 0.258 0.364 Transformer mean-teacher student median filtering (93ms)
deBenito_AUDIAS_task4_SED_4 de Benito-Gorron2021 1.10 0.361 0.577 CRNN mean-teacher student median filtering (45ms) average
deBenito_AUDIAS_task4_SED_1 de Benito-Gorron2021 1.07 0.343 0.571 CRNN mean-teacher student median filtering (45ms) average
deBenito_AUDIAS_task4_SED_2 de Benito-Gorron2021 1.10 0.363 0.574 CRNN mean-teacher student median filtering (45ms) average
deBenito_AUDIAS_task4_SED_3 de Benito-Gorron2021 1.07 0.345 0.571 CRNN mean-teacher student median filtering (45ms) average
Baseline_SSep_SED turpault2020b 1.11 0.364 0.580 CRNN mean-teacher student
Boes_KUL_task4_SED_4 Boes2021 0.60 0.117 0.457 CRNN mean teacher median filtering (3.7s)
Boes_KUL_task4_SED_3 Boes2021 0.68 0.121 0.531 CRNN mean teacher median filtering (3.7s)
Boes_KUL_task4_SED_2 Boes2021 0.77 0.233 0.440 CRNN mean teacher median filtering (460ms)
Boes_KUL_task4_SED_1 Boes2021 0.81 0.253 0.442 CRNN mean teacher median filtering (460ms)
Ebbers_UPB_task4_SED_2 Ebbers2021 1.10 0.335 0.621 FBCRNN,CRNN self-training median filtering (class dependent) MIL
Ebbers_UPB_task4_SED_4 Ebbers2021 1.16 0.363 0.637 FBCRNN,CRNN,CTNN,CNN self-training median filtering (class dependent) averaging
Ebbers_UPB_task4_SED_3 Ebbers2021 1.24 0.416 0.635 FBCRNN,CRNN,CTNN,CNN self-training median filtering (class dependent) averaging
Ebbers_UPB_task4_SED_1 Ebbers2021 1.16 0.373 0.621 FBCRNN,CRNN self-training median filtering (class dependent) MIL
Zhu_AIAL-XJU_task4_SED_2 Zhu2021 0.99 0.290 0.574 CRNN mean-teacher student median filtering LinearSoftmax
Zhu_AIAL-XJU_task4_SED_1 Zhu2021 1.04 0.318 0.583 CRNN mean-teacher student median filtering LinearSoftmax
Liu_BUPT_task4_4 Liu2021 0.37 0.102 0.231 CRNN mean-teacher student median filtering (93ms)
Liu_BUPT_task4_1 Liu2021 0.30 0.090 0.169 CRNN mean-teacher student median filtering (93ms)
Liu_BUPT_task4_2 Liu2021 0.54 0.152 0.322 CRNN mean-teacher student median filtering (93ms)
Liu_BUPT_task4_3 Liu2021 0.24 0.068 0.146 CRNN mean-teacher student median filtering (93ms)
Olvera_INRIA_task4_SED_2 Olvera2021 0.98 0.338 0.481 CRNN mean-teacher student HMM smoothing HMM smoothing
Olvera_INRIA_task4_SED_1 Olvera2021 0.95 0.332 0.462 CRNN mean-teacher student HMM smoothing HMM smoothing
Kim_AiTeR_GIST_SED_4 Kim2021 1.32 0.442 0.674 RCRNN mean-teacher student, self-training with noisy student median filtering mean
Kim_AiTeR_GIST_SED_2 Kim2021 1.31 0.439 0.667 RCRNN mean-teacher student, self-training with noisy student median filtering mean
Kim_AiTeR_GIST_SED_3 Kim2021 1.30 0.434 0.669 RCRNN mean-teacher student, self-training with noisy student median filtering mean
Kim_AiTeR_GIST_SED_1 Kim2021 1.29 0.431 0.661 RCRNN mean-teacher student, self-training with noisy student median filtering mean
Cai_SMALLRICE_task4_SED_1 Dinkel2021 1.11 0.361 0.584 CRNN, ensemble unsupervised data augmentation average
Cai_SMALLRICE_task4_SED_2 Dinkel2021 1.13 0.373 0.585 CRNN, ensemble unsupervised data augmentation average
Cai_SMALLRICE_task4_SED_3 Dinkel2021 1.13 0.370 0.596 CRNN, ensemble unsupervised data augmentation average
Cai_SMALLRICE_task4_SED_4 Dinkel2021 1.00 0.339 0.504 CRNN unsupervised data augmentation
HangYuChen_Roal_task4_SED_2 HangYu2021 0.90 0.294 0.473 Transformer,CNN mean-teacher student median filtering (93ms) attention layers majority vote
HangYuChen_Roal_task4_SED_1 YuHang2021 0.61 0.098 0.496 CRNN mean-teacher student median filtering (93ms) attention layers majority vote
Yu_NCUT_task4_SED_1 Yu2021 0.20 0.038 0.157 Multi-scale CRNN mean-teacher student median filtering (93ms) attention
Yu_NCUT_task4_SED_2 Yu2021 0.92 0.301 0.485 Multi-scale CRNN mean-teacher student median filtering (93ms) attention
lu_kwai_task4_SED_1 Lu2021 1.27 0.419 0.660 CRNN mean-teacher student classwise median filtering majority vote
lu_kwai_task4_SED_4 Lu2021 0.88 0.157 0.685 Conformer mean-teacher student classwise median filtering majority vote
lu_kwai_task4_SED_3 Lu2021 0.86 0.148 0.686 Conformer mean-teacher student classwise median filtering majority vote
lu_kwai_task4_SED_2 Lu2021 1.25 0.412 0.651 CRNN mean-teacher student classwise median filtering majority vote
Liu_BUPT_task4_SS_SED_2 Liu_SS2021 0.94 0.302 0.507 u-net, VGG median filtering (93ms) attention layers, d-vector
Liu_BUPT_task4_SS_SED_1 Liu_SS2021 0.94 0.302 0.507 u-net, VGG median filtering (93ms) attention layers, d-vector
Tian_ICT-TOSHIBA_task4_SED_2 Tian2021 1.19 0.411 0.585 CNN mean-teacher student median filtering with adaptive window size attention_layers
Tian_ICT-TOSHIBA_task4_SED_1 Tian2021 1.19 0.413 0.586 CNN mean-teacher student median filtering with adaptive window size attention_layers
Tian_ICT-TOSHIBA_task4_SED_4 Tian2021 1.19 0.412 0.586 CNN mean-teacher student median filtering with adaptive window size attention_layers
Tian_ICT-TOSHIBA_task4_SED_3 Tian2021 1.18 0.409 0.584 CNN mean-teacher student median filtering with adaptive window size attention_layers
Yao_GUET_task4_SED_3 Yao2021 0.88 0.279 0.479 CRNN,Self Attention mean-teacher student median filtering (93ms)
Yao_GUET_task4_SED_1 Yao2021 0.88 0.277 0.482 CRNN,Self Attention mean-teacher student median filtering (93ms)
Yao_GUET_task4_SED_2 Yao2021 0.54 0.056 0.496 CRNN,Self Attention mean-teacher student median filtering (93ms)
Liang_SHNU_task4_SED_4 Liang2021 0.99 0.313 0.543 CRNN teacher student median filtering (with adaptive window size)
Bajzik_UNIZA_task4_SED_2 Bajzik2021 1.02 0.330 0.544 CRNN mean-teacher student median filtering (112ms)
Bajzik_UNIZA_task4_SED_1 Bajzik2021 0.45 0.133 0.266 CNN mean-teacher student median filtering (112ms)
Liang_SHNU_task4_SSep_SED_3 Liang_SS2021 0.99 0.304 0.559 CRNN mean-teacher student median filtering (with adaptive window size)
Liang_SHNU_task4_SSep_SED_1 Liang_SS2021 1.03 0.313 0.588 CRNN mean-teacher student median filtering (with adaptive window size)
Liang_SHNU_task4_SSep_SED_2 Liang_SS2021 1.01 0.325 0.542 CRNN mean-teacher student median filtering (with adaptive window size)
Baseline_SED turpault2020a 1.00 0.315 0.547 CRNN mean-teacher student
Wang_NSYSU_task4_SED_1 Wang2021 1.13 0.336 0.646 CRNN mean-teacher student median filtering attention layer mean
Wang_NSYSU_task4_SED_4 Wang2021 1.09 0.304 0.662 CRNN, CNN-Transformer mean-teacher student median filtering attention layer, exponential softmax layer mean
Wang_NSYSU_task4_SED_2 Wang2021 0.69 0.070 0.636 CRNN mean-teacher student median filtering exponential softmax layer mean
Wang_NSYSU_task4_SED_3 Wang2021 1.13 0.339 0.649 CRNN, CNN-Transformer mean-teacher student median filtering attention layer mean

Complexity

Rank Code Technical
Report
Ranking score (Evaluation dataset)
PSDS 1
(Evaluation dataset)

PSDS 2
(Evaluation dataset)
Model
complexity
Ensemble
subsystems
Training time
Na_BUPT_task4_SED_1 Na2021 0.80 0.245 0.452 3900000 40h (1 Quadro K1200)
Hafsati_TUITO_task4_SED_3 Hafsati2021 0.91 0.287 0.502 1100000 20h (1 Tesla V100-SXM2-16GB)
Hafsati_TUITO_task4_SED_4 Hafsati2021 0.91 0.287 0.502 1100000 20h (1 Tesla V100-SXM2-16GB)
Hafsati_TUITO_task4_SED_1 Hafsati2021 1.03 0.334 0.549 1100000 6h (1 Tesla V100-SXM2-16GB)
Hafsati_TUITO_task4_SED_2 Hafsati2021 1.04 0.336 0.550 1100000 6h (1 Tesla V100-SXM2-16GB)
Gong_TAL_task4_SED_3 Gong2021 1.16 0.370 0.626 6674520 6 22.5h (1 V100)
Gong_TAL_task4_SED_2 Gong2021 1.15 0.367 0.616 2224840 2 7.5h (1 V100)
Gong_TAL_task4_SED_1 Gong2021 1.14 0.364 0.611 4449680 4 15h (1 V100)
Park_JHU_task4_SED_2 Park2021 1.07 0.327 0.603 9000000 20h (1 GTX 1080 Ti)
Park_JHU_task4_SED_4 Park2021 0.86 0.237 0.524 9000000 20h (1 GTX 1080 Ti)
Park_JHU_task4_SED_1 Park2021 1.01 0.305 0.579 9000000 20h (1 GTX 1080 Ti)
Park_JHU_task4_SED_3 Park2021 0.84 0.222 0.537 9000000 20h (1 GTX 1080 Ti)
Zheng_USTC_task4_SED_4 Zheng2021 1.30 0.389 0.742 1112420 9 3h (1 GTX 3090)
Zheng_USTC_task4_SED_1 Zheng2021 1.33 0.452 0.669 1112420 3 3h (1 GTX 3090)
Zheng_USTC_task4_SED_3 Zheng2021 1.29 0.386 0.746 1112420 10 3h (1 GTX 3090)
Zheng_USTC_task4_SED_2 Zheng2021 1.33 0.447 0.676 1112420 9 3h (1 GTX 3090)
Nam_KAIST_task4_SED_2 Nam2021 1.19 0.399 0.609 4427956 9 4h (1 GTX 2080 Ti)
Nam_KAIST_task4_SED_1 Nam2021 1.16 0.378 0.617 4427956 16 4h (1 GTX 2080 Ti)
Nam_KAIST_task4_SED_3 Nam2021 1.09 0.324 0.634 4427956 11 4h (1 GTX 2080 Ti)
Nam_KAIST_task4_SED_4 Nam2021 0.75 0.059 0.715 4427956 9 4h (1 GTX 2080 Ti)
Koo_SGU_task4_SED_2 Koo2021 0.12 0.044 0.059 102000000 19h (1 Tesla M40)
Koo_SGU_task4_SED_3 Koo2021 0.41 0.058 0.348 196000000 2 48h (1 Tesla M40)
Koo_SGU_task4_SED_1 Koo2021 0.74 0.258 0.364 95800000 19h (1 RTX 3080 Ti)
deBenito_AUDIAS_task4_SED_4 de Benito-Gorron2021 1.10 0.361 0.577 5562100 5 20h (1 RTX 2080)
deBenito_AUDIAS_task4_SED_1 de Benito-Gorron2021 1.07 0.343 0.571 3337260 3 12h (1 RTX 2080)
deBenito_AUDIAS_task4_SED_2 de Benito-Gorron2021 1.10 0.363 0.574 3337260 3 12h (1 RTX 2080)
deBenito_AUDIAS_task4_SED_3 de Benito-Gorron2021 1.07 0.345 0.571 4449600 4 16h (1 RTX 2080)
Baseline_SSep_SED turpault2020b 1.11 0.364 0.580 2200000 6h (1 GTX 1080 Ti)
Boes_KUL_task4_SED_4 Boes2021 0.60 0.117 0.457 1038314 5h (1 GTX 1080 Ti)
Boes_KUL_task4_SED_3 Boes2021 0.68 0.121 0.531 1038314 5h (1 GTX 1080 Ti)
Boes_KUL_task4_SED_2 Boes2021 0.77 0.233 0.440 1038314 5h (1 GTX 1080 Ti)
Boes_KUL_task4_SED_1 Boes2021 0.81 0.253 0.442 1038314 5h (1 GTX 1080 Ti)
Ebbers_UPB_task4_SED_2 Ebbers2021 1.10 0.335 0.621 9568030 1 72h (4 RTX 2070)
Ebbers_UPB_task4_SED_4 Ebbers2021 1.16 0.363 0.637 59853372 6 72h (4 RTX 2070)
Ebbers_UPB_task4_SED_3 Ebbers2021 1.24 0.416 0.635 59853372 6 72h (4 RTX 2070)
Ebbers_UPB_task4_SED_1 Ebbers2021 1.16 0.373 0.621 9568030 1 72h (4 RTX 2070)
Zhu_AIAL-XJU_task4_SED_2 Zhu2021 0.99 0.290 0.574 3900000 12.5h (1 RTX 3090)
Zhu_AIAL-XJU_task4_SED_1 Zhu2021 1.04 0.318 0.583 3900000 13.5h (1 RTX 3090)
Liu_BUPT_task4_4 Liu2021 0.37 0.102 0.231 1112420 12h (1 GTX 1080 Ti)
Liu_BUPT_task4_1 Liu2021 0.30 0.090 0.169 1112420 12h (1 GTX 1080 Ti)
Liu_BUPT_task4_2 Liu2021 0.54 0.152 0.322 1112420 12h (1 GTX 1080 Ti)
Liu_BUPT_task4_3 Liu2021 0.24 0.068 0.146 1112420 12h (1 GTX 1080 Ti)
Olvera_INRIA_task4_SED_2 Olvera2021 0.98 0.338 0.481 2225868 2 24h (1 GTX 1080)
Olvera_INRIA_task4_SED_1 Olvera2021 0.95 0.332 0.462 3338802 3 24h (1 GTX 1080)
Kim_AiTeR_GIST_SED_4 Kim2021 1.32 0.442 0.674 2162412 10 5h (1 GTX 1080 Ti)
Kim_AiTeR_GIST_SED_2 Kim2021 1.31 0.439 0.667 2162412 5 5h (1 GTX 1080 Ti)
Kim_AiTeR_GIST_SED_3 Kim2021 1.30 0.434 0.669 2162412 5 5h (1 GTX 1080 Ti)
Kim_AiTeR_GIST_SED_1 Kim2021 1.29 0.431 0.661 2162412 5 5h (1 GTX 1080 Ti)
Cai_SMALLRICE_task4_SED_1 Dinkel2021 1.11 0.361 0.584 2043204 3 3h (1 GTX 2080 Ti)
Cai_SMALLRICE_task4_SED_2 Dinkel2021 1.13 0.373 0.585 2724272 4 3h (1 GTX 2080 Ti)
Cai_SMALLRICE_task4_SED_3 Dinkel2021 1.13 0.370 0.596 3405340 5 3h (1 GTX 2080 Ti)
Cai_SMALLRICE_task4_SED_4 Dinkel2021 1.00 0.339 0.504 681068 3h (1 GTX 2080 Ti)
HangYuChen_Roal_task4_SED_2 HangYu2021 0.90 0.294 0.473 11312420 2 6h (1 GTX 1080 Ti)
HangYuChen_Roal_task4_SED_1 YuHang2021 0.61 0.098 0.496 1112420 2 3h (1 GTX 1080 Ti)
Yu_NCUT_task4_SED_1 Yu2021 0.20 0.038 0.157 1300000 5h (1 GTX 1080)
Yu_NCUT_task4_SED_2 Yu2021 0.92 0.301 0.485 1300000 5h (1 GTX 1080)
lu_kwai_task4_SED_1 Lu2021 1.27 0.419 0.660 10500000 5 5h (1 GTX 2080 Ti)
lu_kwai_task4_SED_4 Lu2021 0.88 0.157 0.685 39500000 5 10h (1 GTX 2080 Ti)
lu_kwai_task4_SED_3 Lu2021 0.86 0.148 0.686 39500000 5 10h (1 GTX 2080 Ti)
lu_kwai_task4_SED_2 Lu2021 1.25 0.412 0.651 10500000 5 5h (1 GTX 2080 Ti)
Liu_BUPT_task4_SS_SED_2 Liu_SS2021 0.94 0.302 0.507 192905515 17h (1 RTX 3090)
Liu_BUPT_task4_SS_SED_1 Liu_SS2021 0.94 0.302 0.507 192905515 17h (1 RTX 3090)
Tian_ICT-TOSHIBA_task4_SED_2 Tian2021 1.19 0.411 0.585 8471847 4 6h for each model(GTX 2080 Ti)
Tian_ICT-TOSHIBA_task4_SED_1 Tian2021 1.19 0.413 0.586 8471847 4 6h for each model(GTX 2080 Ti)
Tian_ICT-TOSHIBA_task4_SED_4 Tian2021 1.19 0.412 0.586 8471847 4 6h for each model(GTX 2080 Ti)
Tian_ICT-TOSHIBA_task4_SED_3 Tian2021 1.18 0.409 0.584 8471847 4 6h for each model(GTX 2080 Ti)
Yao_GUET_task4_SED_3 Yao2021 0.88 0.279 0.479 2.5M 6h (1 Titan RTX)
Yao_GUET_task4_SED_1 Yao2021 0.88 0.277 0.482 2.5M 6h (1 Titan RTX)
Yao_GUET_task4_SED_2 Yao2021 0.54 0.056 0.496 2.5M 6h (1 Titan RTX)
Liang_SHNU_task4_SED_4 Liang2021 0.99 0.313 0.543 1431280 16h (Tesla-V100)
Bajzik_UNIZA_task4_SED_2 Bajzik2021 1.02 0.330 0.544 2200000 13h (1 GeForce GTX 1650)
Bajzik_UNIZA_task4_SED_1 Bajzik2021 0.45 0.133 0.266 1200000 5h (1 GeForce GTX 1650)
Liang_SHNU_task4_SSep_SED_3 Liang_SS2021 0.99 0.304 0.559 1112420 3h (1 GTX 1080 Ti)
Liang_SHNU_task4_SSep_SED_1 Liang_SS2021 1.03 0.313 0.588 1112420 3h (1 GTX 1080 Ti)
Liang_SHNU_task4_SSep_SED_2 Liang_SS2021 1.01 0.325 0.542 1112420 3h (1 GTX 1080 Ti)
Baseline_SED turpault2020a 1.00 0.315 0.547 2200000 6h (1 GTX 1080 Ti)
Wang_NSYSU_task4_SED_1 Wang2021 1.13 0.336 0.646 47213260 10 480h (1 GPU 1080 Ti)
Wang_NSYSU_task4_SED_4 Wang2021 1.09 0.304 0.662 118739112 24 864h (1 GPU 1080Ti), 360h (1 GPU V100)
Wang_NSYSU_task4_SED_2 Wang2021 0.69 0.070 0.636 3350984 8 384h (1 GPU 1080Ti)
Wang_NSYSU_task4_SED_3 Wang2021 1.13 0.339 0.649 115388128 16 480h (1 GPU 1080 Ti), 360h (1 GPU V100)

Technical reports

Sound Event Detection System For DCASE 2021 Challenge

Bajzik, Jakub
University of Zilina, Department of Mechatronics and Electronics, Žilina 010 26, Slovak Republic

Abstract

This paper presents the systems proposal for the DCASE 2021 challenge Task 4 (Sound event detection and separation in domestic environments). The aim is to provide the event time localization timestamps in addition to event class probabilities. In this paper, the two systems are proposed. System 1 is a convolutional neural network trained for sound event classification using only weakly labeled and unlabeled data. The strong labels are obtained using the class activation mapping technique. System 1 does not reach the baseline performance. System 2 is the convolutional neural network and recurrent neural network which uses the class activation mapping technique as a part of the attention mechanism to increase the baseline performance. The second model was trained using weakly labeled, strongly labeled, and unlabeled data. Both architectures are based on the Mean Teacher baseline system 2021.

System characteristics
PDF

Optimizing Temporal Resolution Of Convolutional Recurrent Neural Networks For Sound Event Detection

Boes, Wim and Van Hamme, Hugo
ESAT, KU Leuven, Leuven, Belgium

Abstract

In this technical report, the systems we submitted for subtask 4 of the DCASE 2021 challenge, regarding sound event detection, are described in detail. These models are closely related to the baseline provided for this problem, as they are essentially convolutional recurrent neural networks trained in a mean teacher setting to deal with the heterogeneous annotation of the supplied data. However, the time resolution of the predictions was adapted to deal with the fact that these systems are evaluated using two intersection-based metrics involving different needs in terms of temporal localization. This was done by optimizing the pooling operations. For the first of the defined evaluation scenarios, imposing relatively strict requirements on the temporal localization accuracy, our best model achieved a PSDS score of 0.3609 on the validation data. This is only marginally better than the performance obtained by the baseline system (0.342): The amount of pooling in the baseline network already turned out to be optimal, and thus, no substantial changes were made, explaining this result. For the second evaluation scenario, imposing relatively lax restrictions on the localization accuracy, our best-performing system achieved a PSDS score of 0.7312 on the validation data. This is significantly better than the performance obtained by the baseline model (0.527), which can effectively be attributed to the changes that were applied to the pooling operations of the network.

System characteristics
PDF

Convolution-Augmented Conformer For Sound Event Detection

Chen, YuHang
Royal Flush, 18, Tongshun Street, Wuchang Street, Yuhang District. HangZhou 310000, CHINA

Abstract

In this technical report, we describe our submission system for DCASE2021 Task4: sound event detection and separation in domestic environments. Our model employs conformer blocks, which combine the self-attention and depth-wise convolution networks, to efficiently capture the global and local context information of an audio feature sequence. In addition to this novel architecture, we further improve the performance by utilizing a mean teacher semi-supervised learning technique, data augmentation for each sound event class. We demonstrate that the proposed method achieves the PSDS-1 and PSDS-2 score of 34%, 55.7% on the validation set, outperforming that of the baseline score.

System characteristics
PDF

Multi-Resolution Mean Teacher For DCASE 2021 Task 4

de Benito-Gorron, Diego and Segovia, Sergio and Ramos, Daniel and T. Toledano, Doroteo
AUDIAS Research Group, Universidad Autónoma de Madrid, Calle Francisco Tomás y Valiente, 11, 28049 Madrid, Spain

Abstract

This technical report describes our participation in DCASE 2021 Task 4: Sound event detection and separation in domestic environments. Aiming to take advantage of the different lengths and spectral characteristics of each target category, we follow the multiresolution feature extraction approach that we proposed for last year’s edition. It is found that each one of the proposed Polyphonic Sound Detection Score (PSDS) scenarios benefits from either a higher temporal resolution or a higher frequency resolution. Furthermore, combining several time-frequency resolutions via model fusion is able to improve the PSDS results in both scenarios.

System characteristics
PDF

The Smallrice Submission To The Dcase2021 Task 4 Challenge: A Lightweight Approach For Semi-Supervised Sound Event Detection With Unsupervised Data Augmentation

Dinkel, Heinrich and Cai, Xinyu and Yan, Zhiyong and Wang, Yongqing and Zhang, Junbo and Wang, Yujun
Xiaomi Corporation, Beijing, China

Abstract

This paper describes our submission to the DCASE 2021 challenge. Different from the baseline and most other approaches, our work focuses on training a lightweight and well-performing model which can be used in real-world applications. Compared to the baseline, our model only contains 600k (15 %) parameters, resulting in a size of 2.7 Mb on disk, making it viable for applications on low-resource devices such as mobile phones. Our model is trained using unsupervised data augmentation as its consistency criterion, which we show can achieve competitive performance to the more common mean teacher paradigm. Our submitted results on the validation set result in a single model peak performance of 36.91 PSDS-1 and 57.17 PSDS2, outperforming the baseline by an absolute of 2.7 and 5.0 points respectively. Notably our approach achieves an Event-F1 score on the development set of 39.29 without post-processing. The best submitted ensemble system using a 4-way fusion achieves a PSDS-1 of 38.23 and PSDS-2 of 62.29 on the validation dataset.

System characteristics
PDF

Self-Trained Audio Tagging And Sound Event Detection In Domestic Environments

Ebbers, Janek Haeb-Umbach, Reinhold
Paderborn University, Department of Communications Engineering, Paderborn, Germany

Abstract

In this report we present our system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 Challenge Task 4: Sound Event Detection and Separation in Domestic Environments. Our presented solution is an advancement of our system used in the previous edition of the task.We use our previously proposed forward-backward convolutional recurrent neural network (FBCRNN) for tagging and pseudo labeling and tag-conditioned sound event detection (SED) models which are trained using the strong pseudo labels provided by the FBCRNN. Our advancement over our previous model is threefold. Firstly, we introduce a strong label loss in the objective of the FBCRNN to take advantage of the strongly labeled synthetic data during training, which leads to both better tagging and detection performance. Secondly, we perform multiple iterations of self-training for both the FBCRNN and tag-conditioned SED models. Thirdly, while we used only tag-conditioned CNNs as our SED model in the last edition we here explore sophisticated SED model architectures, namely, tag-conditioned bidirectional CRNNs and tag-conditioned bidirectional convolutional transformer neural networks (CTNNs) and combine them. With scenario and class dependent tuning of median filter lengths for post-processing, our final SED model, consisting of 6 submodels (2 of each architecture), is able to achieve validation poly-phonic sound event detection scores (PSDS) of 0.454 for scenario 1 and 0.758 for scenario 2 as well as a collar-based F1-score of 0.602 outperforming the baselines and our model from the last edition by far. Source code will be made publicly available at https://github.com/fgnt/pb_sed.

System characteristics
PDF

Improved Pseudo-Labeling Method For Semi-Supervised Sound Event Detection

Gong, Yaguang and Li, Changlong and Wang, Xintian and Ma, Lu and Yang, Song and Wu, Zhongqin Wu
TAL Education Group, China

Abstract

This report illustrates a framework for the DCASE2021 task4 - Sound Event Detection. The proposed framework is built on the pseudo-labeling method widely applied for semi-supervised learning(SSL) tasks. The proposed method synthesizes weak pseudo-labels for the large amount of unlabeled data by utilizing the model’s predictions on weakly augmented spectrograms. Weak pseudo-labels are then used as supervision for strongly augmented spectrograms of the same sample. Along to this main contribution, this work introduces data augmentation techniques including random frequency masking and time shifting, training techniques such as class-specific weighted loss, and model ensemble techniques. Experimental results demonstrate that the proposed method achieves PSDS of 0.407/0.653(scenario1/scenario2) on the validation set, which presents superior performance against the baseline score 0.342/0.527.

System characteristics
PDF

Task Aware Sound Event Detection Based On Semi-Supervised CRNNWith Skip Connections: DCASE 2021 Challenge, Task 4

Hafsati, Mohammed and Bentounes, Kamil
1Beijing Kuaishou Technology Co., Ltd, China 2The State Key Laboratory of Automotive Satefy and Energy, Tsinghua University, Beijing, China

Abstract

Sound Event Detection (SED) is the task of classifying different sounds occurring in a recorded environment and their onset and offset times. This assignment is the primary goal of the fourth task of the DCASE challenge using some strongly labeled, partially labeled, and unlabeled datasets. In this paper, we describe our submitted approach for this challenge. Our neural network is based on sequential convolutional neural networks with skipping some layers and a recurrent neural network. To overcome the challenge of using unlabeled data, we used semi-supervised learning, and to improve the performance further, we propose to use data augmentation techniques. With our model, we can slightly outperform the baseline with fewer filters and therefore fewer parameters. Moreover, similar amount of parameters as the baseline, we significantly outperform it.

System characteristics
PDF

Self-Training With Noisy Student Model And Semi-Supervised Loss Function For DCASE 2021 Challenge Task 4

Kim, Nam Kyun 1 and Kim, Hong Kook 1,2
1School of Electrical Engineering and Computer Science, 123 Cheomdangwagi-ro, Gwangju 61005, Republic of Korea 2 AI Graduate School Gwangju Institute of Science and Technology, 123 Cheomdangwagi-ro, Gwangju 61005, Republic of Korea

Abstract

This report proposes a polyphonic sound event detection (SED) method for the DCASE 2021 Challenge Task 4. The proposed SED model consists of two stages: a mean-teacher model for providing target labels regarding weakly labeled or unlabeled data and a self-training-based noisy student model for predicting strong labels for sound events. The mean-teacher model, which is based on the residual convolutional recurrent neural network (RCRNN) for the teacher and student model, is first trained using all the training data from a weakly labeled dataset, an unlabeled dataset, and a strongly labeled synthetic dataset. Then, the trained mean-teacher model predicts the strong label to each of the weakly labeled and unlabeled datasets, which is brought to the noisy student model in the second stage of the proposed SED model. Here, the structure of the noisy student model is identical to the RCRNN-based student model of the mean-teacher model in the first stage. Then, it is self-trained by adding feature noises, such as time-frequency shift, mixup, SpecAugment, and dropout-based model noise. In addition, a semi-supervised loss function is applied to train the noisy student model, which acts as label noise injection. The performance of the proposed SED model is evaluated on the validation set of the DCASE 2021 Challenge Task 4, and then, several ensemble models that combine five-fold validation models with different hyperparameters of the semi-supervised loss function are finally selected as our final models.

System characteristics
PDF

Sound Event Detection Based On Self-Supervised Learning Of Wav2vec 2.0

Koo, Hyejin1 and Park, Hyung-Min1 and Park, Jonghyeon2 and Oh, Myungwoo2
1Dept. of Electronic Engineering, Sogang University, Seoul 04107, South Korea, 2NAVER Corp. Gyeonggi-do 13561, South Korea

Abstract

In this report, we present our system for DCASE2021 Task4: Sound Event Detection (SED) and Separation in Domain Environments. This task evaluates how to capture information of SED with a relatively small amount of labeled data in addition to lots of unlabeled data. We apply wav2vec 2.0 on the SED for the first time. Even though wav2vec 2.0 pre-training using the DCASE2021 Taksk4 dataset spends long time to train audio representations, the presented model achieved higher intersection F1 and PSDS2. The baseline’s mean-teacher model and dataset was used to compare wav2vec 2.0 and log-mel features. Under the same conditions, we present how wav2vec 2.0 features work on the SED task.

System characteristics
PDF

Adaptive Focal Loss With Data Augmentation For Semi-Supervised Sound Event Detection

Liang, Yunhao and Tang, Tiantian and Long, Yanhua
Shanghai Normal University, Shanghai, China

Abstract

In this technical report, we describe our submission system for DCASE2021 Task4: sound event detection and separation in domestic environments. In our submissions, two different deep models are investigated. The first one is a mean-teacher model with convolutional recurrent neural network (CRNN). The second one is a joint framework with adaptive focal loss based on the Guided Learning architecture. To improve the performance of system, we propose to use various methods such as the specaugment data augmentation method, adaptive focal loss, event specific post-processing. To combine sound separation with sound event detection, we train models using the outputs of the sound separation baseline system. We demonstrate that the proposed method achieves the event-based macro F1 score of 44.4%, 0.428 in PSDS1 and 0.736 in PSDS2 on the validation set.

System characteristics
PDF

Combined Sound Event Detection And Sound Event Separation Networks For DCASE 2021 Task 4

Liu, Gang and Liu, Zhuang Zhuang and Fang, Jun Yan and Liu, Yi Liu and Zhou, Ming Kun
Beijing University of Posts and Telecommunications, Beijing,China

Abstract

Audio tagging aims to assign one or more labels to the audio clip. In this paper, we proposed our solutions applied to our submission for DCASE2021 Task4. The target of the systems is to provide not only the event class but also the event time localization given that multiple events can be present in an audio recording. We present a convolutional recurrent neural network (CRNN) with two recurrent neural network (RNN) classifiers sharing the same preprocessing convolutional neural network (CNN). Both recurrent networks perform audio tagging. One is processing the input audio signal in forward direction and the other in backward direction. We also use a spatial attention layer which called Fcanet to improve our system. We also make an independent system to achieve sound event speration.

System characteristics
PDF

Integrating Advantages Of Recurrent And Transformer Structures For Sound Event Detection In Multiple Scenarios

Lu, Rui 1 and Hu, Wenzheng 2 and Duan Zhiyao 1 and Liu, Ji 1
1Beijing Kuaishou Technology Co., Ltd, China 2The State Key Laboratory of Automotive Satefy and Energy, Tsinghua University, Beijing, China

Abstract

In this technical report, we detail our submitted systems for task4 of DCASE2021: Sound Event Detection and Separation in Domestic Environments. Our systems exploit both recurrent structure and transformer structure to model the complicated dynamics in real life domestic audio data. In addition to prevalent tricks such as semi-supervised mean-teacher learning, data augmentation and ensemble, we find that different models exhibit differently under the two scenarios, which emphasize different system properties. By integrating advantages of both the recurrent and transformer structures, our proposed systems achieve an overall poly-phonic sound event detection scores (PSDS-scores) of 1.171 (PSDS-scenario1 + PSDS-scenario2) on the hold-out test set of the development dataset, outperforming the baseline system by 34.8%.

System characteristics
PDF

Convolutional Network With Conformer For Semi-Supervised Sound Event Detection

Na, Tong and Zhang, Qinyi
Beijing University of Posts and Telecommunications, Beijing,China

Abstract

In this technical report, we describe our system submission for DCASE 2021 Task 4. Our model employs a convolutional network in conjunction with conformer blocks and utilizes the Mean-Teacher semi-supervised learning technique for further improvement.

System characteristics
PDF

Heavily Augmented Sound Event Detection utilizing Weak Predictions

Nam, Hyeonuk and Ko, Byeong-Yun and Lee, Gyeong-Tae and Kim, Seong-Hu and Jung, Won-Ho and Choi, Sang-Min and Park, Yong-Hwa
Korea Advanced Institute of Science and Technology, Department of Mechanical Engineering, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, South Korea

Abstract

The performance of Sound Event Detection (SED) systems are greatly limited by the difficulty in generating large strongly labeled dataset. In this work, we used two main approaches to overcome the lack of strongly labeled data. First, we applied heavy data augmentation on input features. Data augmentation methods used include not only conventional methods used in speech/audio domains but also our proposed method named FilterAugment. Second, we propose two methods to utilize weak predictions to enhance weakly supervised SED performance. As a result, we obtained the best PSDS1 of 0.4336 and best PSDS2 of 0.8161 on the DESED real validation dataset.

System characteristics
PDF

Domain-Adapted Sound Event Detection System With Auxiliary Foreground-Background Classifier

Olvera, Michel1 and Vincent, Emmanuel1 and Gasso, Gilles2
1Université de Lorraine, Inria, Loria, F-54000 Nancy, France, 2LITIS EA 4108, Université & INSA Rouen Normandie, 76800 Saint-Étienne du Rouvray, France

Abstract

In this technical report, we propose a sound event detection system for the DCASE 2021 task 4 challenge, which consists of a foreground-background classification branch that is jointly trained with the baseline architecture. Furthermore, to account for the mismatch between synthetic annotated data and real unlabeled data used for training, we also propose a frame-level domain adaptation scheme to improve detection performance over real soundscapes. We show that these improvements to the baseline method help in the generalization of the sound event detection task.

System characteristics
PDF

Sound Event Detection with Cross-Referencing Self-Training

Park Sangwook1 and Choi, Woohyun2 and Elhilali, Mounya1
1Department of Electrical and Computer Engineering, Johns Hopkins University, United States, 2LG electronics

Abstract

This report describes a sound event detection method submitted to the DCASE2021 challenge, task 4. In this approach, we design a residual convolutional recurrent neural network and train this network with a cross-referencing self-training approach that leverages an extensive unlabeled data in combination with labeled data. This approach takes advantage of semi-supervised training using pseudo-labels from a balanced student-teacher model, and outperforms DCASE2021 challenge baseline in terms of Poly-phonic Sound event Detection Score. Additionally, the proposed network has more accurate predictions in class-wise collar-based-F1, compared to the baseline.

System characteristics
PDF

Sound Event Detection Using Metric Learning And Focal Loss For DCASE 2021 Task 4

Tian, Gangyi1 and Huang, Yuxin1,2 and Ye, Zhirong1,2 and Ma, Shuo1,2 and Wang, Xiangdong1 and Liu, Hong1 and Qian, Yueliang1 and Tao, Rui3 and Yan, Long3 and Ouchi, Kazushige3 and Ebbers, Janek4 Haeb-Umbach, Reinhold4
1Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, 2University of Chinese Academy of Sciences, Beijing, China, 3Toshiba China R&D Center, Beijing, China,Beijing University of Posts and Telecommunications, Beijing,China, 4Paderborn University, Germany

Abstract

In this technical report, we describe our system submission for DCASE 2021 Task 4. Our model employs a convolutional network in conjunction with conformer blocks and utilizes the Mean-Teacher semi-supervised learning technique for further improvement.

System characteristics
PDF

Training Sound Event Detection On A Heterogeneous Dataset

Turpault, Nicolas and Serizel, Romain
Université de Lorraine, CNRS, Inria, Loria, France

Abstract

Training a sound event detection algorithm on a heterogeneous dataset including both recorded and synthetic soundscapes that can have various labeling granularity is a non-trivial task that can lead to systems requiring several technical choices. These technical choices are often passed from one system to another without being questioned. We propose to perform a detailed analysis of DCASE 2020 task 4 sound event detection baseline with regards to several aspects such as the type of data used for training, the parameters of the mean-teacher or the transformations applied while generating the synthetic soundscapes. Some of the parameters that are usually used as default to replicate other approaches are shown to be sub-optimal.

System characteristics
Input mono
Classifier CRNN
Acoustic features log-mel energies
Decision making p-norm
PDF

Improving Sound Event Detection In Domestic Environments Using Sound Separation

Turpault, Nicolas1 and Wisdom, Scott2 and Erdogan, Hakan2 and Herhey, John R.2 and Serizel, Romain1 and Fonseca, Eduardo3 and Seetharaman, Prem4 and Salomon, Justin5
1Universite de Lorraine, CNRS, Inria, Loria, Nancy, France, 2Google Research, AI Perception, Cambridge, United States, 3Music Technology Group, Universitat Pompeu Fabra, Barcelona, 4Interactive Audio Lab, Northwestern University, Evanston, United States5Adobe Research, San Francisco, United States

Abstract

Performing sound event detection on real-world recordings often implies dealing with overlapping target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing stage for sound event detection. In this paper we start from a sound separation model trained on the Free Universal Sound Separation dataset and the DCASE 2020 task 4 sound event detection baseline. We explore different methods of combining separated sound sources and the original mixture within the sound event detection. Furthermore, we investigate the impact of adapting the universal sound separation model to the sound event detection data in terms of both separation and sound event detection performance.

System characteristics
Input mono
Classifier CRNN
Acoustic features log-mel energies
Decision making p-norm
PDF

CHT+NSYSU Sound Event Detection System With Multiscale Channel Attention And Multiple Consistency Training For DCASE 2021 Task 4

Wang, Yih-Wen 1 and Chen, Chia-Ping 1 and Lu, Chung-Li 2 and Chan, Bo-Cheng 1
1National Sun Yat-Sen University, Taiwan 2 Chunghwa Telecom Laboratories, Taiwan

Abstract

In this technical report, we describe our submission system for DCASE 2021 Task4: sound event detection and separation in domestic environments. The proposed system is based on mean-teacher framework of semi-supervised learning and neural networks of CRNN and CNN-Transformer. We employ consistency training of interpolation (ICT), shift (SCT), and clip-level (CCT) to enhance the generalization and representation. A multiscale CNN block is applied to extract various features to mitigate the influence of the event length diversity for the network. An efficient channel attention network (ECA-Net) and exponential softmax pooling enable the model to obtain definite sound event predictions. To further improve the performance, we use data augmentation including mixup, time shift, and time-frequency masks. Our ensemble system achieves the PSDS-scenario1 of 40.72% and PSDS-scenario2 of 80.80% on the validation set, significantly outperforming that of the baseline score of 34.2% and 52.7%, respectively.

System characteristics
PDF

Adaptive Memory Controlled Self Attention For Sound Event Detection

Yoa, Yu and Song, Xiyu
Guilin university of Electronic Technology, Guilin, 541004, Guangxi, China

Abstract

Sound event detection is a task to detect the time stamps and the class of sound event occurred in a recording. Real life sound events overlap in recording and the duration varies dramatically than synthetic data, making it even harder to recognize. In this paper we investigate how well that attention mechanism could improve for real life sound event detection (SED). Convolutional Recurrent Neural Networks (CRNN) have recently shown improved performances over established methods in various sound recognition tasks. In our work we use CRNN to extract hidden state feature representations; then, self-attention mechanism is introduced to memorize long-range dependencies of features that CRNN extract. Furthermore, we proposed to used adaptive memory controlled self-attention to explicitly compute the relations between time steps in audio representation embedding. The proposed method is evaluated on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge Task4 dataset, which contains different overlapping sound events from real life and synthetic. We develop a self attention SED model that used memory-controlled strategy with heuristically choose a fix attention width achieving a PSDS-scenario2 of 60.72% in average which indicating that attention mechanism is able to improve sound event detection. We show that proposed adaptive memory-controlled model reaches the same level result as fix attention width memory-controlled model.

System characteristics
PDF

Semi-Supervised Sound Event Detection Using Multi-Scale Convolutional Recurrent Neural Network And Weighted Pooling

Yu, Dongchi and Cai, Xichang and Liu, Duxin and Liu, Zihan
North China University of Technology, Beijing, China

Abstract

In this technical report, we describe our submission system for DCASE2021 Task4: sound event detection and separation in domestic environment. We mainly focus on the scenario that recognizes sound events without source separation. Since the duration of different sound events could be quite different, our model employs a multi-scale convolution recurrent network to extract the multi-scale features of an audio sequence. For more efficiently utilizing weak label training data, a global weighted pooling strategy is introduced to aggregate frame level predictions to generate clip level prediction. Additionally, our model also use mean teacher semi-supervised learning technique and data augmentation. We demonstrate that the proposed method achieves the PSDS2 score of 0.61 and the event-based macro F1 score of 42.15% on the validation set.

System characteristics
PDF

Zheng USTC Team’s Submission For DCASE2021 Task4 – Semi-Supervised Sound Event Detection

Zheng, Xu and Chen, Han and Song, Yan
National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China.

Abstract

In this technical report, we present our submitted system for DCASE2021 Task4: sound event detection and separation in domestic environments. Specifically, three main techniques are applied to improve the performance of the official baseline system with both synthetic and real data (weakly labeled and unlabeled). Firstly, in order to improve the localization ability of CRNN model, we propose to use the selective kernel(SK) unit. By stacking the SK unit, each neuron can adaptively adjust its receptive field for both short- and long- duration events. Secondly, based on the fact that detection outputs are dominated by the high-confidence predictions(lower than 0.1 or higher than 0.9), we propose to use soft detection output by setting proper temperature parameter in sigmoid, which can effectively improve the PSDS2 score. Thirdly, several data augmentation techniques and score fusion mechanisms are applied to improve the stability and robustness of the system performance. Experiments on the DCASE2021 task4 validation dataset demonstrate the effectiveness of the techniques used in our system. Specifically, PSDS scores of 0.45 and 0.78 are achieved for scenario1 and scenario2 respectively, outperforming the result of 0.34 and 0.53 in baseline system.

System characteristics
PDF

Multi-Scale Convolution Based Attention Network For Semi-Supervised Sound Event Detection

Zhu, Xiujuan1,3 and Sun, Xinghao1,3 and Hu, Ying1,3 and Chen, Yadong1,3 and Qiu, Wenbo1,3 and Tang, Yuwu1,3 and He, Liang1,2 and Xu, Minqiang4
1School of Information Science and Engineering, Xinjiang University, Urumqi, China, 2Department of Electronic Engineering, Tsinghua University, China, 3Key Laboratory of Signal Detection and Processing in Xinjiang, China, 4SpeakIn Technology

Abstract

Deep Convolutional Recurrent Neural Networks (CRNN) have drawn great attention in sound event detection (SED). Due to the variation in duration for acoustic events is relatively large, It is critcally important to design a good operator that can extract multiscale feature more efficiently for SED. However, most CRNN-based models lack discriminative ability for different types of acoustic events and deal with them equally, which results in the representational capacity of the models being limited. Inspired by this, We proposed a Multi-Scale Convolution based Attention Network(MSCA). By using Multi-Scale Convolution, a more effective feature representation ability can be obtained, Which can naturally learn coarse-to-fine multi-scale features to helps the model recognize different sound events. On the other hand, a channel-wise attention module is designed, which can adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels.

System characteristics
PDF