Acoustic Scene Classification
with Multiple Devices


Challenge results

Task description

This subtask is concerned with the classification of data from multiple devices (real and simulated) targeting the generalization properties of systems across a number of different devices.

The development dataset consists of recordings from 10 European cities using 9 different devices: 3 real devices (A, B, C) and 6 simulated devices (S1-S6). Data from devices B, C, and S1-S6 consists of randomly selected segments from the simultaneous recordings, therefore all overlap with the data from device A, but not necessarily with each other. The total amount of audio in the development set is 64 hours.

The evaluation dataset contains data from 12 cities, 10 acoustic scenes, 11 devices. There are five new devices (not available in the development set): real device D and simulated devices S7-S11. Evaluation data contains 33 hours of audio.

The device A consists in a Soundman OKM II Klassik/studio A3, electret binaural microphone and a Zoom F8 audio recorder using 48kHz sampling rate and 24-bit resolution. The other devices are commonly available customer devices: device B is a Samsung Galaxy S7, device C is iPhone SE, and device D is a GoPro Hero5 Session.

More detailed task description can be found in the task description page

Systems ranking

Submission information Evaluation dataset Development dataset
Rank Submission label Name Technical
Report
Official
system rank
Accuracy
with 95% confidence interval
Logloss Accuracy Logloss
Abbasi_ARI_task1a_1 1a_CNN Abbasi2020 78 59.7 (58.8 - 60.6) 1.099 61.6 1.071
Abbasi_ARI_task1a_2 1a_CNN Abbasi2020 76 60.6 (59.7 - 61.5) 1.063 62.1
Cao_JNU_task1a_1 CaoJNU1 Fei2020 63 65.7 (64.9 - 66.6) 1.265 68.3 1.186
Cao_JNU_task1a_2 CaoJNU2 Fei2020 64 65.7 (64.8 - 66.5) 1.259 68.9 1.163
Cao_JNU_task1a_3 CaoJNU3 Fei2020 61 66.0 (65.1 - 66.8) 1.268 68.7 1.202
Cao_JNU_task1a_4 CaoJNU4 Fei2020 62 65.9 (65.1 - 66.8) 1.267 69.2 1.171
FanVaf__task1a_1 CRNN_4kHz Fanioudakis2020 72 63.4 (62.5 - 64.2) 1.106 65.4 2.070
FanVaf__task1a_2 CRNN_8kHz Fanioudakis2020 75 60.7 (59.9 - 61.6) 1.142 62.8
FanVaf__task1a_3 CRNN_ens Fanioudakis2020 66 64.8 (63.9 - 65.6) 1.298 67.4
FanVaf__task1a_4 CRNN_ens Fanioudakis2020 54 67.5 (66.6 - 68.3) 1.240
Gao_UNISA_task1a_1 Baseline Gao2020 9 75.0 (74.3 - 75.8) 1.225 71.7
Gao_UNISA_task1a_2 focal_ls Gao2020 12 74.1 (73.3 - 74.9) 1.242 71.8
Gao_UNISA_task1a_3 da Gao2020 11 74.7 (73.9 - 75.5) 1.231 71.4
Gao_UNISA_task1a_4 ensemble Gao2020 8 75.2 (74.4 - 76.0) 1.230 72.5
DCASE2020 baseline Baseline 51.4 (50.5 - 52.3) 1.902 51.6 1.405
Helin_ADSPLAB_task1a_1 Helin1 Wang2020_t1 14 73.4 (72.6 - 74.2) 0.850 84.2 0.569
Helin_ADSPLAB_task1a_2 Helin2 Wang2020_t1 49 68.4 (67.6 - 69.3) 0.991 81.8 0.694
Helin_ADSPLAB_task1a_3 Helin3 Wang2020_t1 18 73.1 (72.3 - 73.9) 0.889 84.5 0.611
Helin_ADSPLAB_task1a_4 Helin4 Wang2020_t1 24 72.3 (71.5 - 73.1) 0.899 84.2 0.601
Hu_GT_task1a_1 Hu_GT_1a_1 Hu2020 6 75.7 (74.9 - 76.4) 0.924
Hu_GT_task1a_2 Hu_GT_1a_2 Hu2020 4 75.9 (75.1 - 76.7) 0.895 81.9 0.936
Hu_GT_task1a_3 Hu_GT_1a_3 Hu2020 3 76.2 (75.4 - 77.0) 0.898
Hu_GT_task1a_4 Hu_GT_1a_4 Hu2020 5 75.8 (75.0 - 76.5) 0.900
JHKim_IVS_task1a_1 EF5+SFA Kim2020_t1 55 67.3 (66.5 - 68.2) 5.219 70.1 0.013
JHKim_IVS_task1a_2 EF2+SFA Kim2020_t1 60 66.2 (65.3 - 67.0) 4.766 68.6 0.019
Jie_Maxvision_task1a_1 maxvision Jie2020 10 75.0 (74.3 - 75.8) 1.209 72.1 1.370
Kim_SGU_task1a_1 5ch_m_2 Changmin2020 33 71.6 (70.8 - 72.4) 1.309 72.7 1.307
Kim_SGU_task1a_2 7ch_m_2 Changmin2020 38 70.7 (69.9 - 71.6) 1.304 71.0 1.301
Kim_SGU_task1a_3 7ch_m_4 Changmin2020 39 70.7 (69.8 - 71.5) 1.412 72.2 1.408
Kim_SGU_task1a_4 9ch_2 Changmin2020 57 66.4 (65.6 - 67.3) 1.428 71.7 1.292
Koutini_CPJKU_task1a_1 fdamp Koutini2020 29 71.9 (71.1 - 72.7) 0.800 71.0 0.820
Koutini_CPJKU_task1a_2 FDswd Koutini2020 32 71.6 (70.8 - 72.4) 0.862 72.5 0.820
Koutini_CPJKU_task1a_3 ensemble Koutini2020 13 73.6 (72.8 - 74.4) 0.796 73.3 0.820
Koutini_CPJKU_task1a_4 DAensem Koutini2020 15 73.4 (72.6 - 74.2) 0.814 73.0 0.820
Lee_CAU_task1a_1 CAUET Lee2020 47 69.2 (68.3 - 70.0) 0.885 67.1 0.939
Lee_CAU_task1a_2 CAUET Lee2020 41 69.6 (68.8 - 70.5) 0.859 67.1 0.939
Lee_CAU_task1a_3 CAUET Lee2020 27 72.0 (71.2 - 72.8) 0.944 67.1 0.939
Lee_CAU_task1a_4 CAUET Lee2020 20 72.9 (72.1 - 73.7) 0.919 67.1 0.939
Lee_GU_task1a_1 PRML Aryal2020 81 55.9 (55.0 - 56.8) 1.969 59.6
Lee_GU_task1a_2 PRML Aryal2020 85 55.6 (54.7 - 56.5) 1.818 59.6
Lee_GU_task1a_3 PRML Aryal2020 84 55.6 (54.7 - 56.5) 2.987 59.6
Lee_GU_task1a_4 PRML Aryal2020 86 54.9 (54.1 - 55.8) 2.847 59.6
Liu_SHNU_task1a_1 ResNet Liu2020 45 69.3 (68.5 - 70.1) 1.396 70.2 0.939
Liu_SHNU_task1a_2 CNN-9 Liu2020 50 68.0 (67.2 - 68.9) 4.510 68.8 1.792
Liu_SHNU_task1a_3 E-T Liu2020 83 55.7 (54.8 - 56.6) 9.403 58.1
Liu_SHNU_task1a_4 Fusion Liu2020 26 72.0 (71.2 - 72.8) 3.165 73.1
Liu_UESTC_task1a_1 Averag_8 Liu2020a 16 73.2 (72.4 - 74.0) 1.305 68.4 1.362
Liu_UESTC_task1a_2 Averag_18 Liu2020a 23 72.4 (71.6 - 73.2) 1.303 69.0 1.367
Liu_UESTC_task1a_3 Rforest_8 Liu2020a 21 72.5 (71.7 - 73.3) 0.755 68.6 0.841
Liu_UESTC_task1a_4 Rforest_18 Liu2020a 28 72.0 (71.2 - 72.8) 0.767 68.4 0.839
Lopez-Meyer_IL_task1a_1 CNNensem Lopez-Meyer2020_t1a 68 64.3 (63.4 - 65.1) 5.268 68.8
Lopez-Meyer_IL_task1a_2 CNNensem Lopez-Meyer2020_t1a 70 64.1 (63.3 - 65.0) 11.870 69.3
Lu_INTC_task1a_1 city_cv Hong2020 36 71.2 (70.4 - 72.0) 0.809
Lu_INTC_task1a_2 resnext Hong2020 69 64.1 (63.3 - 65.0) 1.383
Lu_INTC_task1a_3 2resnext Hong2020 58 66.4 (65.5 - 67.2) 1.192
Lu_INTC_task1a_4 all Hong2020 35 71.2 (70.4 - 72.1) 0.806
Monteiro_INRS_task1a_1 preResnet Joao2020 74 61.7 (60.8 - 62.6) 5.936
Monteiro_INRS_task1a_2 TDNN Joao2020 82 55.9 (55.0 - 56.8) 5.198
Monteiro_INRS_task1a_3 ModResNet Joao2020 88 50.8 (49.9 - 51.7) 2.766
Monteiro_INRS_task1a_4 FuseCNN Joao2020 59 66.3 (65.5 - 67.2) 2.226
Naranjo-Alcazar_Vfy_task1a_1 ASCCSSE Naranjo-Alcazar2020_t1 73 61.9 (61.0 - 62.7) 1.246 65.1 1.120
Naranjo-Alcazar_Vfy_task1a_2 ASCCSSE Naranjo-Alcazar2020_t1 77 59.7 (58.8 - 60.6) 1.314 65.1 1.120
Paniagua_UPM_task1a_1 Pan_UPM Paniagua2020 92 43.8 (42.9 - 44.7) 2.053 57.1
Shim_UOS_task1a_1 UOS_totens Shim2020 31 71.7 (70.9 - 72.5) 1.190 71.9
Shim_UOS_task1a_2 UOS_rbfens Shim2020 34 71.5 (70.7 - 72.4) 0.897 71.0
Shim_UOS_task1a_3 UOS_lcnn Shim2020 48 68.5 (67.6 - 69.3) 0.911 70.5
Shim_UOS_task1a_4 UOS_trgasc Shim2020 37 71.0 (70.2 - 71.8) 0.945 68.8
Suh_ETRI_task1a_1 TRN_Dev Suh2020 22 72.5 (71.7 - 73.3) 1.290 73.7 1.285
Suh_ETRI_task1a_2 TRN_Eval Suh2020 7 75.5 (74.7 - 76.2) 1.221 73.7 1.285
Suh_ETRI_task1a_3 TRN_Ensem Suh2020 1 76.5 (75.8 - 77.3) 1.219 74.2 1.289
Suh_ETRI_task1a_4 TRN_wEnsem Suh2020 2 76.5 (75.7 - 77.2) 1.219 74.4 1.288
Swiecicki_NON_task1a_1 b3_train Swiecicki2020 56 67.1 (66.2 - 67.9) 0.926 69.3 0.846
Swiecicki_NON_task1a_2 b3_all Swiecicki2020 42 69.5 (68.7 - 70.3) 0.851 69.3 0.846
Swiecicki_NON_task1a_3 b3_all_lr Swiecicki2020 40 70.3 (69.4 - 71.1) 0.970 68.9 0.973
Swiecicki_NON_task1a_4 b3_all_mix Swiecicki2020 30 71.8 (71.0 - 72.7) 0.793 71.9 0.790
Vilouras_AUTh_task1a_1 VilEnsemb1 Vilouras2020 53 67.7 (66.8 - 68.5) 0.929 68.1 0.908
Vilouras_AUTh_task1a_2 VilEnsemb2 Vilouras2020 52 67.8 (67.0 - 68.7) 0.931 69.2 0.890
Vilouras_AUTh_task1a_3 VilEnsemb3 Vilouras2020 44 69.3 (68.5 - 70.1) 0.883 70.3 0.872
Waldekar_IITKGP_task1a_1 MFDWC20 Waldekar2020 79 58.4 (57.5 - 59.2) 1.427 55.0
Wang_RoyalFlush_task1a_1 RoyalFlush Wang2020a 80 56.7 (55.8 - 57.6) 1.576 63.9 1.826
Wang_RoyalFlush_task1a_2 RoyalFlush Wang2020a 65 65.2 (64.3 - 66.0) 1.294 62.9 1.586
Wang_RoyalFlush_task1a_3 RoyalFlush Wang2020a 71 64.0 (63.1 - 64.8) 1.239 62.1 1.334
Wang_RoyalFlush_task1a_4 RoyalFlush Wang2020a 91 45.5 (44.6 - 46.4) 5.880 62.7 1.712
Wu_CUHK_task1a_1 CNN_RCE Wu2020_t1a 67 64.7 (63.9 - 65.6) 1.148 65.2
Wu_CUHK_task1a_2 ensemble_4 Wu2020_t1a 46 69.3 (68.4 - 70.1) 1.070 67.3
Wu_CUHK_task1a_3 ensemble_5 Wu2020_t1a 51 67.9 (67.1 - 68.8) 1.100 67.6
Wu_CUHK_task1a_4 ensemble_9 Wu2020_t1a 43 69.4 (68.6 - 70.2) 1.080 68.3
Zhang_THUEE_task1a_1 THUEE Shao2020 19 73.0 (72.2 - 73.8) 1.963 75.0 0.791
Zhang_THUEE_task1a_2 THUEE Shao2020 17 73.2 (72.4 - 74.0) 1.967 75.0 0.789
Zhang_THUEE_task1a_3 THUEE Shao2020 25 72.3 (71.5 - 73.1) 1.958 74.3 0.824
Zhang_UESTC_task1a_1 N1 Zhang2020 89 50.4 (49.5 - 51.3) 1.899 57.4 1.275
Zhang_UESTC_task1a_2 N2 Zhang2020 87 51.7 (50.8 - 52.6) 1.805 56.1 1.297
Zhang_UESTC_task1a_3 N3 Zhang2020 90 47.4 (46.5 - 48.3) 2.068 53.7 1.344

Teams ranking

Table including only the best performing system per submitting team.

Submission information Evaluation dataset Development dataset
Rank Submission label Name Technical
Report
Official
system rank
Team rank Accuracy
with 95% confidence interval
Logloss Accuracy Logloss
Abbasi_ARI_task1a_2 1a_CNN Abbasi2020 76 24 60.6 (59.7 - 61.5) 1.063 62.1
Cao_JNU_task1a_3 CaoJNU3 Fei2020 61 20 66.0 (65.1 - 66.8) 1.268 68.7 1.202
FanVaf__task1a_4 CRNN_ens Fanioudakis2020 54 17 67.5 (66.6 - 68.3) 1.240
Gao_UNISA_task1a_4 ensemble Gao2020 8 3 75.2 (74.4 - 76.0) 1.230 72.5
DCASE2020 baseline Baseline 51.4 (50.5 - 52.3) 1.902 51.6 1.405
Helin_ADSPLAB_task1a_1 Helin1 Wang2020_t1 14 6 73.4 (72.6 - 74.2) 0.850 84.2 0.569
Hu_GT_task1a_3 Hu_GT_1a_3 Hu2020 3 2 76.2 (75.4 - 77.0) 0.898
JHKim_IVS_task1a_1 EF5+SFA Kim2020_t1 55 18 67.3 (66.5 - 68.2) 5.219 70.1 0.013
Jie_Maxvision_task1a_1 maxvision Jie2020 10 4 75.0 (74.3 - 75.8) 1.209 72.1 1.370
Kim_SGU_task1a_1 5ch_m_2 Changmin2020 33 13 71.6 (70.8 - 72.4) 1.309 72.7 1.307
Koutini_CPJKU_task1a_3 ensemble Koutini2020 13 5 73.6 (72.8 - 74.4) 0.796 73.3 0.820
Lee_CAU_task1a_4 CAUET Lee2020 20 9 72.9 (72.1 - 73.7) 0.919 67.1 0.939
Lee_GU_task1a_1 PRML Aryal2020 81 26 55.9 (55.0 - 56.8) 1.969 59.6
Liu_SHNU_task1a_4 Fusion Liu2020 26 10 72.0 (71.2 - 72.8) 3.165 73.1
Liu_UESTC_task1a_1 Averag_8 Liu2020a 16 7 73.2 (72.4 - 74.0) 1.305 68.4 1.362
Lopez-Meyer_IL_task1a_1 CNNensem Lopez-Meyer2020_t1a 68 22 64.3 (63.4 - 65.1) 5.268 68.8
Lu_INTC_task1a_4 all Hong2020 35 14 71.2 (70.4 - 72.1) 0.806
Monteiro_INRS_task1a_4 FuseCNN Joao2020 59 19 66.3 (65.5 - 67.2) 2.226
Naranjo-Alcazar_Vfy_task1a_1 ASCCSSE Naranjo-Alcazar2020_t1 73 23 61.9 (61.0 - 62.7) 1.246 65.1 1.120
Paniagua_UPM_task1a_1 Pan_UPM Paniagua2020 92 28 43.8 (42.9 - 44.7) 2.053 57.1
Shim_UOS_task1a_1 UOS_totens Shim2020 31 12 71.7 (70.9 - 72.5) 1.190 71.9
Suh_ETRI_task1a_3 TRN_Ensem Suh2020 1 1 76.5 (75.8 - 77.3) 1.219 74.2 1.289
Swiecicki_NON_task1a_4 b3_all_mix Swiecicki2020 30 11 71.8 (71.0 - 72.7) 0.793 71.9 0.790
Vilouras_AUTh_task1a_3 VilEnsemb3 Vilouras2020 44 16 69.3 (68.5 - 70.1) 0.883 70.3 0.872
Waldekar_IITKGP_task1a_1 MFDWC20 Waldekar2020 79 25 58.4 (57.5 - 59.2) 1.427 55.0
Wang_RoyalFlush_task1a_2 RoyalFlush Wang2020a 65 21 65.2 (64.3 - 66.0) 1.294 62.9 1.586
Wu_CUHK_task1a_4 ensemble_9 Wu2020_t1a 43 15 69.4 (68.6 - 70.2) 1.080 68.3
Zhang_THUEE_task1a_2 THUEE Shao2020 17 8 73.2 (72.4 - 74.0) 1.967 75.0 0.789
Zhang_UESTC_task1a_2 N2 Zhang2020 87 27 51.7 (50.8 - 52.6) 1.805 56.1 1.297

Generalization performance

All results with evaluation dataset.

Submission information Overall Devices Cities
Rank Submission label Technical
Report
Official
system
rank
Accuracy
(Evaluation dataset)
Accuracy /
unseen
Accuracy /
seen
Accuracy /
unseen
Accuracy /
seen
Abbasi_ARI_task1a_1 Abbasi2020 78 59.7 56.1 62.7 60.2 59.5
Abbasi_ARI_task1a_2 Abbasi2020 76 60.6 58.9 62.0 58.9 60.8
Cao_JNU_task1a_1 Fei2020 63 65.7 63.0 68.0 65.4 66.0
Cao_JNU_task1a_2 Fei2020 64 65.7 62.9 68.0 65.6 65.9
Cao_JNU_task1a_3 Fei2020 61 66.0 62.9 68.5 64.6 66.4
Cao_JNU_task1a_4 Fei2020 62 65.9 63.0 68.4 65.2 66.3
FanVaf__task1a_1 Fanioudakis2020 72 63.4 57.3 68.5 59.1 64.3
FanVaf__task1a_2 Fanioudakis2020 75 60.7 53.4 66.9 60.4 61.4
FanVaf__task1a_3 Fanioudakis2020 66 64.8 58.1 70.3 61.9 65.8
FanVaf__task1a_4 Fanioudakis2020 54 67.5 60.8 73.1 64.6 68.3
Gao_UNISA_task1a_1 Gao2020 9 75.0 73.3 76.5 73.7 75.7
Gao_UNISA_task1a_2 Gao2020 12 74.1 71.9 75.9 73.0 74.9
Gao_UNISA_task1a_3 Gao2020 11 74.7 72.9 76.1 73.3 75.5
Gao_UNISA_task1a_4 Gao2020 8 75.2 73.1 77.0 73.9 75.9
DCASE2020 baseline 51.4 37.2 63.1 51.8 51.5
Helin_ADSPLAB_task1a_1 Wang2020_t1 14 73.4 70.1 76.2 71.2 74.1
Helin_ADSPLAB_task1a_2 Wang2020_t1 49 68.4 63.8 72.3 66.5 69.5
Helin_ADSPLAB_task1a_3 Wang2020_t1 18 73.1 70.2 75.5 70.8 74.0
Helin_ADSPLAB_task1a_4 Wang2020_t1 24 72.3 68.8 75.2 70.1 73.2
Hu_GT_task1a_1 Hu2020 6 75.7 74.3 76.8 73.0 76.3
Hu_GT_task1a_2 Hu2020 4 75.9 74.4 77.2 73.8 76.4
Hu_GT_task1a_3 Hu2020 3 76.2 74.7 77.5 74.1 76.9
Hu_GT_task1a_4 Hu2020 5 75.8 74.3 77.0 74.0 76.3
JHKim_IVS_task1a_1 Kim2020_t1 55 67.3 64.5 69.7 67.7 67.2
JHKim_IVS_task1a_2 Kim2020_t1 60 66.2 64.3 67.7 65.4 66.5
Jie_Maxvision_task1a_1 Jie2020 10 75.0 73.2 76.5 73.2 76.0
Kim_SGU_task1a_1 Changmin2020 33 71.6 69.2 73.5 69.5 72.5
Kim_SGU_task1a_2 Changmin2020 38 70.7 68.4 72.7 69.4 71.7
Kim_SGU_task1a_3 Changmin2020 39 70.7 68.3 72.6 70.1 71.4
Kim_SGU_task1a_4 Changmin2020 57 66.4 63.5 68.9 62.7 67.1
Koutini_CPJKU_task1a_1 Koutini2020 29 71.9 68.4 74.9 73.1 72.2
Koutini_CPJKU_task1a_2 Koutini2020 32 71.6 66.9 75.5 71.4 72.2
Koutini_CPJKU_task1a_3 Koutini2020 13 73.6 69.8 76.8 72.6 74.1
Koutini_CPJKU_task1a_4 Koutini2020 15 73.4 69.4 76.7 72.4 73.9
Lee_CAU_task1a_1 Lee2020 47 69.2 66.2 71.6 67.5 69.8
Lee_CAU_task1a_2 Lee2020 41 69.6 66.5 72.3 68.4 70.2
Lee_CAU_task1a_3 Lee2020 27 72.0 69.3 74.3 70.7 72.4
Lee_CAU_task1a_4 Lee2020 20 72.9 69.8 75.5 71.7 73.3
Lee_GU_task1a_1 Aryal2020 81 55.9 46.4 63.8 55.1 56.2
Lee_GU_task1a_2 Aryal2020 85 55.6 45.7 63.8 54.9 55.7
Lee_GU_task1a_3 Aryal2020 84 55.6 45.5 64.0 53.5 56.3
Lee_GU_task1a_4 Aryal2020 86 54.9 44.7 63.5 53.8 55.4
Liu_SHNU_task1a_1 Liu2020 45 69.3 65.1 72.7 67.6 70.0
Liu_SHNU_task1a_2 Liu2020 50 68.0 64.9 70.6 67.1 68.7
Liu_SHNU_task1a_3 Liu2020 83 55.7 46.8 63.1 49.4 57.0
Liu_SHNU_task1a_4 Liu2020 26 72.0 67.5 75.8 69.6 72.7
Liu_UESTC_task1a_1 Liu2020a 16 73.2 71.9 74.3 73.3 73.4
Liu_UESTC_task1a_2 Liu2020a 23 72.4 71.1 73.5 73.1 72.5
Liu_UESTC_task1a_3 Liu2020a 21 72.5 71.3 73.5 71.7 72.9
Liu_UESTC_task1a_4 Liu2020a 28 72.0 70.3 73.4 71.7 72.3
Lopez-Meyer_IL_task1a_1 Lopez-Meyer2020_t1a 68 64.3 60.9 67.1 62.9 64.4
Lopez-Meyer_IL_task1a_2 Lopez-Meyer2020_t1a 70 64.1 61.1 66.7 62.2 64.2
Lu_INTC_task1a_1 Hong2020 36 71.2 68.8 73.2 69.1 72.0
Lu_INTC_task1a_2 Hong2020 69 64.1 60.8 66.9 62.4 64.5
Lu_INTC_task1a_3 Hong2020 58 66.4 63.3 68.9 64.5 66.5
Lu_INTC_task1a_4 Hong2020 35 71.2 68.8 73.3 68.6 71.9
Monteiro_INRS_task1a_1 Joao2020 74 61.7 59.4 63.6 59.4 62.0
Monteiro_INRS_task1a_2 Joao2020 82 55.9 51.8 59.3 52.5 56.4
Monteiro_INRS_task1a_3 Joao2020 88 50.8 44.5 56.1 47.4 51.5
Monteiro_INRS_task1a_4 Joao2020 59 66.3 63.2 69.0 63.7 66.8
Naranjo-Alcazar_Vfy_task1a_1 Naranjo-Alcazar2020_t1 73 61.9 55.9 66.9 59.6 62.8
Naranjo-Alcazar_Vfy_task1a_2 Naranjo-Alcazar2020_t1 77 59.7 54.0 64.5 54.7 60.8
Paniagua_UPM_task1a_1 Paniagua2020 92 43.8 36.0 50.3 45.7 43.5
Shim_UOS_task1a_1 Shim2020 31 71.7 69.0 74.0 71.4 71.9
Shim_UOS_task1a_2 Shim2020 34 71.5 68.4 74.2 70.5 71.7
Shim_UOS_task1a_3 Shim2020 48 68.5 64.9 71.4 67.4 68.7
Shim_UOS_task1a_4 Shim2020 37 71.0 68.2 73.3 68.9 71.6
Suh_ETRI_task1a_1 Suh2020 22 72.5 69.9 74.6 70.4 73.1
Suh_ETRI_task1a_2 Suh2020 7 75.5 73.6 77.0 75.0 76.0
Suh_ETRI_task1a_3 Suh2020 1 76.5 74.6 78.1 75.8 77.3
Suh_ETRI_task1a_4 Suh2020 2 76.5 74.7 77.9 75.8 77.2
Swiecicki_NON_task1a_1 Swiecicki2020 56 67.1 64.0 69.6 65.7 66.9
Swiecicki_NON_task1a_2 Swiecicki2020 42 69.5 66.5 72.0 68.9 69.7
Swiecicki_NON_task1a_3 Swiecicki2020 40 70.3 68.2 72.0 66.5 71.1
Swiecicki_NON_task1a_4 Swiecicki2020 30 71.8 69.0 74.2 69.4 72.4
Vilouras_AUTh_task1a_1 Vilouras2020 53 67.7 63.5 71.2 65.8 68.1
Vilouras_AUTh_task1a_2 Vilouras2020 52 67.8 63.0 71.8 65.6 68.4
Vilouras_AUTh_task1a_3 Vilouras2020 44 69.3 65.3 72.6 66.9 70.1
Waldekar_IITKGP_task1a_1 Waldekar2020 79 58.4 52.9 62.9 52.8 59.6
Wang_RoyalFlush_task1a_1 Wang2020a 80 56.7 54.8 58.2 54.9 57.4
Wang_RoyalFlush_task1a_2 Wang2020a 65 65.2 63.0 67.0 64.1 65.5
Wang_RoyalFlush_task1a_3 Wang2020a 71 64.0 60.0 67.3 63.7 64.4
Wang_RoyalFlush_task1a_4 Wang2020a 91 45.5 42.9 47.7 45.2 45.7
Wu_CUHK_task1a_1 Wu2020_t1a 67 64.7 60.0 68.7 63.7 65.2
Wu_CUHK_task1a_2 Wu2020_t1a 46 69.3 63.0 74.5 65.1 70.4
Wu_CUHK_task1a_3 Wu2020_t1a 51 67.9 62.7 72.3 66.3 68.3
Wu_CUHK_task1a_4 Wu2020_t1a 43 69.4 63.6 74.3 66.1 70.3
Zhang_THUEE_task1a_1 Shao2020 19 73.0 69.9 75.6 71.8 73.8
Zhang_THUEE_task1a_2 Shao2020 17 73.2 70.0 75.8 71.6 74.1
Zhang_THUEE_task1a_3 Shao2020 25 72.3 68.8 75.2 70.2 73.2
Zhang_UESTC_task1a_1 Zhang2020 89 50.4 35.8 62.5 50.3 50.7
Zhang_UESTC_task1a_2 Zhang2020 87 51.7 37.5 63.5 50.8 52.5
Zhang_UESTC_task1a_3 Zhang2020 90 47.4 32.2 60.1 46.7 47.8

Class-wise performance

Rank Submission label Technical
Report
Official
system
rank
Accuracy Airport Bus Metro Metro
station
Park Public
square
Shopping
mall
Street
pedestrian
Street
traffic
Tram
Abbasi_ARI_task1a_1 Abbasi2020 78 59.7 38.0 68.9 56.1 57.3 73.7 44.2 69.4 36.7 80.1 72.6
Abbasi_ARI_task1a_2 Abbasi2020 76 60.6 39.1 63.4 53.5 59.2 70.9 40.7 70.6 43.4 84.1 81.0
Cao_JNU_task1a_1 Fei2020 63 65.7 56.1 74.6 72.3 70.4 85.9 47.4 70.3 30.1 79.8 70.6
Cao_JNU_task1a_2 Fei2020 64 65.7 52.0 72.1 71.8 70.6 84.1 47.5 70.7 32.9 82.2 72.7
Cao_JNU_task1a_3 Fei2020 61 66.0 54.2 74.4 73.2 69.8 85.0 46.2 70.8 32.0 82.2 71.8
Cao_JNU_task1a_4 Fei2020 62 65.9 51.8 71.9 72.1 69.9 84.1 47.0 71.3 33.8 83.7 74.0
FanVaf__task1a_1 Fanioudakis2020 72 63.4 52.0 89.8 68.7 50.1 78.1 36.6 62.7 37.0 78.8 80.0
FanVaf__task1a_2 Fanioudakis2020 75 60.7 50.4 77.6 68.1 49.0 76.7 39.0 55.8 37.7 72.6 80.4
FanVaf__task1a_3 Fanioudakis2020 66 64.8 53.7 89.1 71.9 52.4 78.7 39.4 62.2 38.9 78.0 83.2
FanVaf__task1a_4 Fanioudakis2020 54 67.5 55.8 90.7 79.0 57.2 79.5 44.1 63.0 43.8 75.9 85.9
Gao_UNISA_task1a_1 Gao2020 9 75.0 58.2 87.8 76.2 75.6 92.7 57.5 75.3 52.3 90.7 84.3
Gao_UNISA_task1a_2 Gao2020 12 74.1 59.3 88.2 79.9 72.7 92.3 51.4 72.4 52.7 90.3 81.9
Gao_UNISA_task1a_3 Gao2020 11 74.7 57.3 86.7 78.4 74.5 93.0 57.1 73.3 52.3 90.3 83.8
Gao_UNISA_task1a_4 Gao2020 8 75.2 58.5 88.0 79.4 74.7 92.8 56.4 74.2 53.4 90.5 84.3
DCASE2020 baseline 51.4 26.3 82.3 45.4 53.8 67.3 34.7 40.3 30.4 69.4 63.7
Helin_ADSPLAB_task1a_1 Wang2020_t1 14 73.4 60.2 82.7 81.4 72.8 93.4 52.9 74.6 44.4 85.0 86.7
Helin_ADSPLAB_task1a_2 Wang2020_t1 49 68.4 51.8 75.2 76.8 68.6 89.0 48.5 61.4 44.1 83.0 86.2
Helin_ADSPLAB_task1a_3 Wang2020_t1 18 73.1 56.9 82.4 81.7 73.7 92.8 53.8 72.6 44.9 85.7 86.7
Helin_ADSPLAB_task1a_4 Wang2020_t1 24 72.3 57.2 82.0 80.6 72.1 92.5 52.2 71.0 44.3 84.6 86.6
Hu_GT_task1a_1 Hu2020 6 75.7 62.3 89.8 82.1 72.4 92.8 56.2 83.4 40.4 91.5 85.7
Hu_GT_task1a_2 Hu2020 4 75.9 60.7 91.8 83.2 75.5 93.8 52.4 81.1 39.4 92.3 88.6
Hu_GT_task1a_3 Hu2020 3 76.2 61.7 92.1 84.0 74.5 93.9 53.8 81.5 39.4 92.6 88.6
Hu_GT_task1a_4 Hu2020 5 75.8 59.5 91.6 83.4 75.0 93.9 52.4 81.4 39.1 92.6 88.7
JHKim_IVS_task1a_1 Kim2020_t1 55 67.3 68.9 83.1 79.2 65.2 84.6 32.7 56.6 40.1 85.4 77.7
JHKim_IVS_task1a_2 Kim2020_t1 60 66.2 57.6 80.9 67.8 63.6 86.1 41.9 63.9 46.2 82.3 71.2
Jie_Maxvision_task1a_1 Jie2020 10 75.0 62.4 88.6 75.2 70.4 93.8 58.4 76.8 48.3 90.4 86.1
Kim_SGU_task1a_1 Changmin2020 33 71.6 51.2 78.5 82.0 71.9 93.0 46.4 76.3 48.7 90.4 77.4
Kim_SGU_task1a_2 Changmin2020 38 70.7 47.7 79.6 79.2 72.2 92.2 50.2 72.9 47.1 89.7 76.5
Kim_SGU_task1a_3 Changmin2020 39 70.7 47.6 76.0 77.2 74.4 92.8 48.6 74.3 42.8 90.9 81.9
Kim_SGU_task1a_4 Changmin2020 57 66.4 31.8 75.8 80.0 66.9 95.9 36.4 77.4 42.5 88.5 69.1
Koutini_CPJKU_task1a_1 Koutini2020 29 71.9 60.2 90.6 75.8 72.5 89.5 51.9 64.9 44.9 84.0 85.1
Koutini_CPJKU_task1a_2 Koutini2020 32 71.6 60.4 90.2 74.9 70.9 86.1 52.4 66.7 43.9 84.3 86.0
Koutini_CPJKU_task1a_3 Koutini2020 13 73.6 59.3 92.5 77.0 75.9 89.8 53.9 66.9 47.5 84.7 88.6
Koutini_CPJKU_task1a_4 Koutini2020 15 73.4 59.9 92.3 75.8 75.5 89.8 53.0 67.5 46.1 84.8 89.1
Lee_CAU_task1a_1 Lee2020 47 69.2 53.5 82.2 69.5 69.4 83.9 50.1 66.8 47.9 84.4 84.1
Lee_CAU_task1a_2 Lee2020 41 69.6 54.0 82.7 68.5 70.4 85.7 48.9 66.5 50.1 84.6 85.2
Lee_CAU_task1a_3 Lee2020 27 72.0 62.8 87.5 69.4 71.2 87.5 53.0 65.7 50.2 86.4 86.2
Lee_CAU_task1a_4 Lee2020 20 72.9 65.4 87.3 72.8 72.1 87.2 56.1 65.5 49.7 85.9 87.1
Lee_GU_task1a_1 Aryal2020 81 55.9 37.7 83.2 50.4 49.3 82.9 36.4 56.8 37.0 62.5 62.7
Lee_GU_task1a_2 Aryal2020 85 55.6 29.7 80.1 43.9 50.1 81.1 34.5 54.7 49.8 62.8 68.9
Lee_GU_task1a_3 Aryal2020 84 55.6 40.5 81.4 51.6 48.0 77.1 38.3 58.6 32.9 68.0 59.6
Lee_GU_task1a_4 Aryal2020 86 54.9 30.0 77.9 52.9 51.1 76.3 34.3 59.4 36.4 68.8 62.4
Liu_SHNU_task1a_1 Liu2020 45 69.3 55.6 85.4 74.4 64.3 88.5 47.3 67.0 45.2 83.7 81.4
Liu_SHNU_task1a_2 Liu2020 50 68.0 55.6 84.3 74.5 64.5 90.1 46.9 63.5 36.9 82.0 82.0
Liu_SHNU_task1a_3 Liu2020 83 55.7 35.5 65.4 54.2 50.8 80.6 35.6 52.4 38.9 71.5 71.9
Liu_SHNU_task1a_4 Liu2020 26 72.0 57.6 90.3 77.2 69.7 91.8 51.4 68.7 42.3 85.1 86.1
Liu_UESTC_task1a_1 Liu2020a 16 73.2 55.1 79.1 80.2 71.5 86.3 58.3 85.9 46.0 90.6 79.1
Liu_UESTC_task1a_2 Liu2020a 23 72.4 55.3 78.1 79.7 69.8 85.6 57.5 84.8 45.9 90.4 76.9
Liu_UESTC_task1a_3 Liu2020a 21 72.5 55.1 80.4 78.4 74.2 86.2 60.8 77.4 50.3 83.3 78.8
Liu_UESTC_task1a_4 Liu2020a 28 72.0 55.1 78.6 77.9 73.0 87.3 58.8 78.3 49.7 82.6 78.6
Lopez-Meyer_IL_task1a_1 Lopez-Meyer2020_t1a 68 64.3 46.5 74.2 74.5 63.4 84.7 41.9 67.3 39.4 81.0 69.9
Lopez-Meyer_IL_task1a_2 Lopez-Meyer2020_t1a 70 64.1 48.8 75.3 75.5 61.1 85.4 42.9 65.6 38.9 79.8 68.0
Lu_INTC_task1a_1 Hong2020 36 71.2 51.1 79.6 77.9 73.1 89.7 43.6 75.4 49.6 87.3 84.7
Lu_INTC_task1a_2 Hong2020 69 64.1 50.6 77.2 69.6 65.4 85.1 31.6 65.9 41.0 82.8 72.2
Lu_INTC_task1a_3 Hong2020 58 66.4 49.6 79.1 71.1 69.4 84.8 33.7 72.0 44.9 85.2 73.8
Lu_INTC_task1a_4 Hong2020 35 71.2 51.5 80.1 77.9 73.8 90.0 42.1 75.5 50.5 87.3 83.7
Monteiro_INRS_task1a_1 Joao2020 74 61.7 46.1 71.5 55.7 56.2 83.2 32.1 72.3 47.3 85.6 67.0
Monteiro_INRS_task1a_2 Joao2020 82 55.9 42.1 60.4 53.4 47.9 77.2 36.6 54.5 37.3 83.4 66.2
Monteiro_INRS_task1a_3 Joao2020 88 50.8 43.1 49.5 46.0 48.1 72.9 33.0 52.5 24.5 80.4 58.3
Monteiro_INRS_task1a_4 Joao2020 59 66.3 49.2 73.1 64.8 61.5 84.8 41.5 77.3 47.6 88.1 75.4
Naranjo-Alcazar_Vfy_task1a_1 Naranjo-Alcazar2020_t1 73 61.9 47.8 75.5 60.1 60.3 83.9 35.4 60.9 46.2 70.6 77.9
Naranjo-Alcazar_Vfy_task1a_2 Naranjo-Alcazar2020_t1 77 59.7 51.1 49.6 68.8 59.3 82.5 39.1 64.8 29.5 75.4 77.2
Paniagua_UPM_task1a_1 Paniagua2020 92 43.8 32.5 80.0 65.8 46.5 62.4 28.6 45.9 31.6 44.4 0.0
Shim_UOS_task1a_1 Shim2020 31 71.7 53.7 82.2 72.1 67.4 89.5 49.9 74.7 49.4 89.6 88.3
Shim_UOS_task1a_2 Shim2020 34 71.5 53.9 82.4 71.9 71.0 89.3 51.9 73.1 49.3 87.1 85.5
Shim_UOS_task1a_3 Shim2020 48 68.5 49.1 77.9 69.2 69.2 85.8 53.0 66.9 49.2 86.3 78.2
Shim_UOS_task1a_4 Shim2020 37 71.0 56.2 82.8 74.2 65.0 89.9 46.6 73.2 49.2 87.8 85.4
Suh_ETRI_task1a_1 Suh2020 22 72.5 52.9 82.2 82.7 73.5 93.5 41.8 79.3 47.1 92.8 79.0
Suh_ETRI_task1a_2 Suh2020 7 75.5 59.2 88.0 83.4 76.0 93.4 49.8 78.5 51.3 92.1 82.7
Suh_ETRI_task1a_3 Suh2020 1 76.5 60.8 88.8 82.9 76.6 93.7 52.9 81.4 50.9 92.3 84.8
Suh_ETRI_task1a_4 Suh2020 2 76.5 60.7 88.6 83.2 76.5 93.7 52.4 81.2 51.2 92.3 84.9
Swiecicki_NON_task1a_1 Swiecicki2020 56 67.1 52.8 71.1 67.6 65.8 87.1 49.2 72.0 41.6 84.6 78.8
Swiecicki_NON_task1a_2 Swiecicki2020 42 69.5 54.7 77.9 72.1 67.6 85.5 52.9 73.7 45.3 85.7 79.9
Swiecicki_NON_task1a_3 Swiecicki2020 40 70.3 60.4 79.7 78.5 67.2 86.4 53.8 69.7 43.4 85.8 77.9
Swiecicki_NON_task1a_4 Swiecicki2020 30 71.8 59.4 81.6 78.2 69.9 88.6 55.6 72.7 44.2 86.8 81.5
Vilouras_AUTh_task1a_1 Vilouras2020 53 67.7 45.8 88.6 60.1 58.2 90.3 57.4 77.4 47.6 81.2 70.4
Vilouras_AUTh_task1a_2 Vilouras2020 52 67.8 50.5 87.0 59.8 67.3 85.9 39.6 79.0 49.8 84.3 75.1
Vilouras_AUTh_task1a_3 Vilouras2020 44 69.3 50.5 89.1 61.1 65.1 89.3 49.5 79.9 49.7 84.2 74.6
Waldekar_IITKGP_task1a_1 Waldekar2020 79 58.4 50.4 70.4 53.5 51.9 84.0 38.9 59.6 33.4 72.9 68.5
Wang_RoyalFlush_task1a_1 Wang2020a 80 56.7 39.7 82.4 69.9 41.1 82.9 37.2 53.5 27.9 75.3 56.6
Wang_RoyalFlush_task1a_2 Wang2020a 65 65.2 56.4 74.9 67.7 49.5 84.5 50.8 68.0 43.9 79.7 76.2
Wang_RoyalFlush_task1a_3 Wang2020a 71 64.0 60.4 63.1 65.7 49.5 83.9 46.1 59.8 49.2 77.1 84.7
Wang_RoyalFlush_task1a_4 Wang2020a 91 45.5 5.6 89.6 69.1 62.1 84.8 34.2 6.3 0.0 78.6 24.7
Wu_CUHK_task1a_1 Wu2020_t1a 67 64.7 51.4 84.1 56.0 55.7 86.0 43.9 59.7 48.0 83.9 78.5
Wu_CUHK_task1a_2 Wu2020_t1a 46 69.3 46.3 87.1 76.8 68.2 86.9 43.3 65.7 49.8 84.7 83.8
Wu_CUHK_task1a_3 Wu2020_t1a 51 67.9 46.8 85.1 66.5 63.9 87.4 45.0 65.7 50.1 86.0 82.9
Wu_CUHK_task1a_4 Wu2020_t1a 43 69.4 46.9 86.7 72.6 66.6 88.0 45.6 65.5 51.6 86.1 84.5
Zhang_THUEE_task1a_1 Shao2020 19 73.0 57.0 85.5 78.4 73.2 92.5 55.1 69.4 52.8 84.3 82.0
Zhang_THUEE_task1a_2 Shao2020 17 73.2 57.4 85.7 79.1 73.2 92.3 55.1 68.9 53.8 84.3 81.8
Zhang_THUEE_task1a_3 Shao2020 25 72.3 55.6 85.6 77.4 72.6 90.7 54.5 67.3 53.5 84.3 81.3
Zhang_UESTC_task1a_1 Zhang2020 89 50.4 30.1 66.6 48.5 51.9 72.3 28.7 43.4 28.5 62.0 71.6
Zhang_UESTC_task1a_2 Zhang2020 87 51.7 32.1 82.8 53.8 47.8 65.1 31.1 47.3 37.0 64.4 55.6
Zhang_UESTC_task1a_3 Zhang2020 90 47.4 33.2 84.0 39.8 37.0 66.8 29.8 44.2 28.6 52.9 57.8

Device-wise performance

Unseen devices Seen devices
Rank Submission label Technical
Report
Official
system
rank
Accuracy Accuracy /
Unseen
Accuracy /
Seen
D S7 S8 S9 S10 A B C S1 S2 S3
Abbasi_ARI_task1a_1 Abbasi2020 78 59.7 56.1 62.7 42.4 61.1 61.9 62.8 52.3 69.3 59.8 64.0 60.5 61.3 61.3
Abbasi_ARI_task1a_2 Abbasi2020 76 60.6 58.9 62.0 52.4 63.2 59.2 63.9 55.6 65.5 61.0 60.8 59.0 64.6 61.2
Cao_JNU_task1a_1 Fei2020 63 65.7 63.0 68.0 56.8 70.0 66.0 64.3 58.0 74.4 65.7 70.5 65.5 65.6 66.6
Cao_JNU_task1a_2 Fei2020 64 65.7 62.9 68.0 56.8 69.9 66.0 63.7 57.9 73.9 66.6 70.3 66.1 65.8 65.5
Cao_JNU_task1a_3 Fei2020 61 66.0 62.9 68.5 56.7 69.8 66.3 63.3 58.5 74.4 66.5 71.1 66.3 65.2 67.4
Cao_JNU_task1a_4 Fei2020 62 65.9 63.0 68.4 57.3 70.0 65.0 63.3 59.2 73.9 66.1 70.7 66.9 65.4 67.5
FanVaf__task1a_1 Fanioudakis2020 72 63.4 57.3 68.5 40.9 66.6 62.6 60.6 55.7 75.1 64.7 70.2 67.7 64.9 68.1
FanVaf__task1a_2 Fanioudakis2020 75 60.7 53.4 66.9 40.7 61.9 59.8 54.6 49.7 74.7 63.2 68.3 66.4 64.1 64.5
FanVaf__task1a_3 Fanioudakis2020 66 64.8 58.1 70.3 41.1 67.5 64.0 61.9 56.2 75.4 65.4 72.6 70.8 67.4 70.2
FanVaf__task1a_4 Fanioudakis2020 54 67.5 60.8 73.1 40.5 69.7 69.4 63.3 61.0 78.1 69.5 73.9 71.7 72.7 72.7
Gao_UNISA_task1a_1 Gao2020 9 75.0 73.3 76.5 64.6 76.8 73.9 76.0 75.0 79.7 74.4 78.0 74.2 76.6 76.3
Gao_UNISA_task1a_2 Gao2020 12 74.1 71.9 75.9 61.0 76.6 72.9 74.6 74.5 80.1 73.5 78.1 73.3 75.8 74.8
Gao_UNISA_task1a_3 Gao2020 11 74.7 72.9 76.1 63.1 77.5 74.1 75.2 74.7 79.7 73.5 78.4 74.1 75.0 76.0
Gao_UNISA_task1a_4 Gao2020 8 75.2 73.1 77.0 62.5 77.6 74.4 75.6 75.4 80.3 74.9 78.9 74.7 76.5 76.7
DCASE2020 baseline 51.4 37.2 63.1 22.8 49.8 41.1 31.0 41.3 72.8 61.7 68.9 62.7 54.6 58.2
Helin_ADSPLAB_task1a_1 Wang2020_t1 14 73.4 70.1 76.2 65.0 76.3 74.5 66.9 67.6 81.7 74.6 79.8 72.9 73.8 74.4
Helin_ADSPLAB_task1a_2 Wang2020_t1 49 68.4 63.8 72.3 62.0 72.2 69.1 57.8 58.0 80.7 72.9 76.9 67.6 66.9 68.9
Helin_ADSPLAB_task1a_3 Wang2020_t1 18 73.1 70.2 75.5 66.6 77.3 73.9 66.6 66.8 82.4 74.2 78.8 73.1 71.7 73.1
Helin_ADSPLAB_task1a_4 Wang2020_t1 24 72.3 68.8 75.2 65.6 75.9 73.1 64.4 65.1 81.8 73.9 79.3 71.8 70.9 73.7
Hu_GT_task1a_1 Hu2020 6 75.7 74.3 76.8 68.0 76.0 76.4 76.6 74.4 80.0 74.7 77.7 75.0 76.6 76.9
Hu_GT_task1a_2 Hu2020 4 75.9 74.4 77.2 67.6 76.7 77.7 77.2 72.7 81.6 75.6 79.7 74.6 75.6 75.8
Hu_GT_task1a_3 Hu2020 3 76.2 74.7 77.5 68.2 76.6 77.1 77.9 73.6 81.5 75.4 79.8 75.0 76.2 76.9
Hu_GT_task1a_4 Hu2020 5 75.8 74.3 77.0 67.4 76.2 77.2 77.4 73.2 81.3 75.4 79.5 74.7 75.4 75.6
JHKim_IVS_task1a_1 Kim2020_t1 55 67.3 64.5 69.7 55.2 70.7 68.4 64.8 63.4 74.4 67.4 72.4 68.1 67.9 67.8
JHKim_IVS_task1a_2 Kim2020_t1 60 66.2 64.3 67.7 58.7 68.1 66.2 66.5 61.9 73.6 64.3 69.6 64.9 66.8 67.2
Jie_Maxvision_task1a_1 Jie2020 10 75.0 73.2 76.5 65.8 76.8 75.0 74.6 74.0 78.5 74.7 79.1 73.8 76.1 76.9
Kim_SGU_task1a_1 Changmin2020 33 71.6 69.2 73.5 60.1 73.7 70.7 73.9 67.8 77.6 70.7 78.5 71.6 70.6 72.0
Kim_SGU_task1a_2 Changmin2020 38 70.7 68.4 72.7 56.9 75.1 70.0 72.8 67.4 77.7 69.3 75.6 69.2 71.9 72.3
Kim_SGU_task1a_3 Changmin2020 39 70.7 68.3 72.6 60.1 72.5 70.2 71.9 66.9 77.0 70.1 75.9 70.4 70.6 71.7
Kim_SGU_task1a_4 Changmin2020 57 66.4 63.5 68.9 54.7 70.5 67.4 67.5 57.5 74.7 66.0 72.4 64.4 67.6 68.0
Koutini_CPJKU_task1a_1 Koutini2020 29 71.9 68.4 74.9 54.4 74.4 72.0 70.6 70.5 79.7 72.7 73.9 74.1 73.4 75.6
Koutini_CPJKU_task1a_2 Koutini2020 32 71.6 66.9 75.5 49.4 74.3 71.1 71.5 68.1 78.1 72.9 77.4 74.8 74.9 74.8
Koutini_CPJKU_task1a_3 Koutini2020 13 73.6 69.8 76.8 52.9 77.6 74.0 74.3 70.1 80.6 74.8 77.6 76.2 75.4 76.4
Koutini_CPJKU_task1a_4 Koutini2020 15 73.4 69.4 76.7 52.9 76.8 73.9 73.4 70.1 80.5 74.9 78.1 75.4 75.1 76.5
Lee_CAU_task1a_1 Lee2020 47 69.2 66.2 71.6 53.7 71.6 69.4 68.8 67.7 76.1 69.4 74.2 70.4 70.6 69.1
Lee_CAU_task1a_2 Lee2020 41 69.6 66.5 72.3 54.5 71.0 70.4 67.8 68.8 77.5 69.4 74.4 71.4 70.6 70.4
Lee_CAU_task1a_3 Lee2020 27 72.0 69.3 74.3 61.1 75.4 71.2 68.1 70.6 78.1 71.6 76.0 74.4 71.8 73.8
Lee_CAU_task1a_4 Lee2020 20 72.9 69.8 75.5 60.2 76.1 72.5 69.7 70.4 79.8 73.2 77.2 74.5 74.2 74.2
Lee_GU_task1a_1 Aryal2020 81 55.9 46.4 63.8 30.4 52.0 52.6 44.0 53.0 73.9 64.1 68.8 62.7 55.6 58.1
Lee_GU_task1a_2 Aryal2020 85 55.6 45.7 63.8 29.8 50.4 52.8 43.7 52.0 75.4 62.5 69.6 61.4 56.1 57.6
Lee_GU_task1a_3 Aryal2020 84 55.6 45.5 64.0 28.1 54.1 55.5 40.7 48.9 74.4 64.0 67.5 61.1 57.9 59.4
Lee_GU_task1a_4 Aryal2020 86 54.9 44.7 63.5 26.6 52.4 55.0 41.7 47.9 74.3 63.1 68.8 60.4 56.9 57.5
Liu_SHNU_task1a_1 Liu2020 45 69.3 65.1 72.7 50.7 74.4 69.4 67.3 63.8 76.5 70.1 72.1 73.0 71.8 73.0
Liu_SHNU_task1a_2 Liu2020 50 68.0 64.9 70.6 55.9 70.0 66.1 66.3 66.0 74.7 69.8 71.3 68.6 70.1 69.3
Liu_SHNU_task1a_3 Liu2020 83 55.7 46.8 63.1 40.3 55.7 56.3 35.0 46.8 74.4 65.6 70.6 55.3 56.8 55.6
Liu_SHNU_task1a_4 Liu2020 26 72.0 67.5 75.8 52.6 76.4 71.8 70.0 66.8 79.0 73.7 76.4 76.3 74.1 75.3
Liu_UESTC_task1a_1 Liu2020a 16 73.2 71.9 74.3 66.1 75.1 71.5 75.6 71.4 78.3 71.7 76.6 73.3 74.0 72.0
Liu_UESTC_task1a_2 Liu2020a 23 72.4 71.1 73.5 64.4 74.2 71.1 74.8 71.0 77.4 70.6 75.3 73.3 72.6 71.7
Liu_UESTC_task1a_3 Liu2020a 21 72.5 71.3 73.5 65.2 75.0 71.0 75.0 70.3 77.8 70.3 77.1 71.7 72.3 71.6
Liu_UESTC_task1a_4 Liu2020a 28 72.0 70.3 73.4 63.8 73.2 70.9 73.3 70.4 78.0 70.2 76.8 71.0 72.8 71.6
Lopez-Meyer_IL_task1a_1 Lopez-Meyer2020_t1a 68 64.3 60.9 67.1 54.2 68.2 65.1 60.0 57.0 76.4 64.4 68.5 63.8 62.7 66.8
Lopez-Meyer_IL_task1a_2 Lopez-Meyer2020_t1a 70 64.1 61.1 66.7 54.1 67.8 64.5 60.6 58.6 75.5 64.4 68.9 62.9 62.1 66.2
Lu_INTC_task1a_1 Hong2020 36 71.2 68.8 73.2 66.0 74.4 71.0 62.6 69.7 79.2 72.5 73.2 70.5 70.7 73.2
Lu_INTC_task1a_2 Hong2020 69 64.1 60.8 66.9 61.2 65.6 64.4 53.2 59.6 73.1 68.1 68.8 63.3 62.4 65.6
Lu_INTC_task1a_3 Hong2020 58 66.4 63.3 68.9 63.1 68.1 67.1 56.5 61.8 74.7 68.0 70.8 66.7 64.4 68.9
Lu_INTC_task1a_4 Hong2020 35 71.2 68.8 73.3 66.0 74.4 70.7 63.6 69.4 79.6 73.1 73.8 70.6 70.3 72.2
Monteiro_INRS_task1a_1 Joao2020 74 61.7 59.4 63.6 46.8 63.0 60.6 65.6 60.9 68.0 61.5 63.2 62.3 64.2 62.7
Monteiro_INRS_task1a_2 Joao2020 82 55.9 51.8 59.3 32.1 59.0 56.9 58.7 52.2 65.6 56.1 57.3 57.9 59.9 59.0
Monteiro_INRS_task1a_3 Joao2020 88 50.8 44.5 56.1 37.5 47.4 45.6 46.8 45.2 64.8 57.7 60.4 51.8 51.0 51.1
Monteiro_INRS_task1a_4 Joao2020 59 66.3 63.2 69.0 50.2 66.5 65.5 68.3 65.6 74.7 66.7 68.5 68.0 68.9 67.0
Naranjo-Alcazar_Vfy_task1a_1 Naranjo-Alcazar2020_t1 73 61.9 55.9 66.9 45.6 62.5 60.3 53.3 57.6 74.9 65.9 70.8 63.3 63.2 63.1
Naranjo-Alcazar_Vfy_task1a_2 Naranjo-Alcazar2020_t1 77 59.7 54.0 64.5 52.4 60.0 61.2 54.1 42.3 73.6 66.0 68.2 59.5 59.7 59.8
Paniagua_UPM_task1a_1 Paniagua2020 92 43.8 36.0 50.3 28.1 42.2 40.1 33.9 35.5 60.6 46.9 52.0 47.2 47.0 47.8
Shim_UOS_task1a_1 Shim2020 31 71.7 69.0 74.0 57.4 72.8 71.4 72.8 70.6 78.7 72.2 75.4 73.1 72.9 71.5
Shim_UOS_task1a_2 Shim2020 34 71.5 68.4 74.2 56.2 72.4 70.8 72.5 69.8 79.4 71.7 76.1 73.4 73.1 71.6
Shim_UOS_task1a_3 Shim2020 48 68.5 64.9 71.4 49.6 69.1 68.1 69.8 68.0 75.7 69.7 72.9 71.0 70.3 69.0
Shim_UOS_task1a_4 Shim2020 37 71.0 68.2 73.3 56.2 71.9 71.9 72.4 68.8 79.3 71.1 76.2 71.7 72.0 69.8
Suh_ETRI_task1a_1 Suh2020 22 72.5 69.9 74.6 62.1 73.9 71.3 72.3 69.8 78.5 73.5 78.2 71.2 72.2 74.1
Suh_ETRI_task1a_2 Suh2020 7 75.5 73.6 77.0 66.0 76.1 75.5 76.1 74.5 79.7 74.6 79.0 75.3 76.5 76.7
Suh_ETRI_task1a_3 Suh2020 1 76.5 74.6 78.1 65.6 78.0 75.6 77.6 76.3 81.1 75.6 80.0 76.3 77.6 77.9
Suh_ETRI_task1a_4 Suh2020 2 76.5 74.7 77.9 65.8 78.2 75.6 77.6 76.4 81.1 75.6 79.6 76.4 77.4 77.5
Swiecicki_NON_task1a_1 Swiecicki2020 56 67.1 64.0 69.6 49.6 68.0 67.6 70.4 64.6 72.7 69.0 70.5 67.5 70.8 67.0
Swiecicki_NON_task1a_2 Swiecicki2020 42 69.5 66.5 72.0 55.0 70.4 71.5 71.1 64.5 77.3 71.1 71.9 70.7 71.3 69.9
Swiecicki_NON_task1a_3 Swiecicki2020 40 70.3 68.2 72.0 56.4 72.6 73.0 73.7 65.5 75.6 68.9 72.3 71.2 72.8 71.0
Swiecicki_NON_task1a_4 Swiecicki2020 30 71.8 69.0 74.2 55.9 74.4 74.4 73.6 66.9 78.1 72.2 74.0 73.1 75.0 72.7
Vilouras_AUTh_task1a_1 Vilouras2020 53 67.7 63.5 71.2 60.6 68.1 62.9 59.9 65.8 78.1 69.2 75.6 67.4 68.6 68.3
Vilouras_AUTh_task1a_2 Vilouras2020 52 67.8 63.0 71.8 54.4 68.4 63.4 63.0 65.9 77.4 69.8 74.4 69.7 70.6 68.8
Vilouras_AUTh_task1a_3 Vilouras2020 44 69.3 65.3 72.6 59.9 70.3 64.9 63.2 68.2 79.1 69.4 75.5 70.0 70.7 71.0
Waldekar_IITKGP_task1a_1 Waldekar2020 79 58.4 52.9 62.9 50.8 57.8 59.2 47.5 49.3 68.2 59.8 66.9 59.3 62.0 61.0
Wang_RoyalFlush_task1a_1 Wang2020a 80 56.7 54.8 58.2 47.3 58.6 57.5 56.4 54.4 65.0 55.4 61.9 57.4 53.7 55.7
Wang_RoyalFlush_task1a_2 Wang2020a 65 65.2 63.0 67.0 56.8 67.2 63.0 65.4 62.6 74.1 63.3 70.8 64.5 62.6 66.6
Wang_RoyalFlush_task1a_3 Wang2020a 71 64.0 60.0 67.3 56.8 65.5 63.3 57.2 57.1 75.7 65.7 70.3 65.2 63.2 63.5
Wang_RoyalFlush_task1a_4 Wang2020a 91 45.5 42.9 47.7 40.4 46.7 47.3 40.3 39.8 52.2 47.0 50.5 42.6 46.6 47.2
Wu_CUHK_task1a_1 Wu2020_t1a 67 64.7 60.0 68.7 47.7 63.8 64.2 65.4 58.8 76.7 64.4 71.1 66.2 66.3 67.4
Wu_CUHK_task1a_2 Wu2020_t1a 46 69.3 63.0 74.5 48.1 70.7 70.1 68.6 57.3 80.4 73.0 77.7 71.3 71.9 72.8
Wu_CUHK_task1a_3 Wu2020_t1a 51 67.9 62.7 72.3 50.2 68.0 68.9 68.6 58.1 78.9 68.6 73.8 70.7 71.2 70.5
Wu_CUHK_task1a_4 Wu2020_t1a 43 69.4 63.6 74.3 49.4 70.4 69.8 69.2 59.2 80.6 71.0 75.9 72.4 73.8 72.0
Zhang_THUEE_task1a_1 Shao2020 19 73.0 69.9 75.6 59.2 75.8 71.7 72.4 70.4 80.4 73.7 79.3 74.4 71.9 74.1
Zhang_THUEE_task1a_2 Shao2020 17 73.2 70.0 75.8 59.0 75.9 71.3 72.7 71.0 80.6 73.9 79.4 74.1 72.3 74.6
Zhang_THUEE_task1a_3 Shao2020 25 72.3 68.8 75.2 55.6 75.4 71.3 71.9 69.9 80.0 72.7 78.7 72.9 73.4 73.3
Zhang_UESTC_task1a_1 Zhang2020 89 50.4 35.8 62.5 20.9 46.8 45.1 28.3 38.0 73.5 62.7 68.5 58.1 52.4 59.7
Zhang_UESTC_task1a_2 Zhang2020 87 51.7 37.5 63.5 21.6 52.0 45.1 28.3 40.6 72.6 63.6 68.3 61.5 55.4 59.6
Zhang_UESTC_task1a_3 Zhang2020 90 47.4 32.2 60.1 19.1 44.0 37.2 22.4 38.2 71.0 61.5 64.8 55.0 53.9 54.4

System characteristics

General characteristics

Rank Submission label Technical
Report
Official
system
rank
Accuracy
(Eval)
Sampling
rate
Data
augmentation
Features Embeddings
Abbasi_ARI_task1a_1 Abbasi2020 78 59.7 44.1kHz mixup mel spectrogram
Abbasi_ARI_task1a_2 Abbasi2020 76 60.6 44.1kHz mixup mel spectrogram
Cao_JNU_task1a_1 Fei2020 63 65.7 22.05kHz mixup log-mel spectrogram, gamma-tone spectrogram, CQT
Cao_JNU_task1a_2 Fei2020 64 65.7 22.05kHz mixup log-mel spectrogram, gamma-tone spectrogram, CQT
Cao_JNU_task1a_3 Fei2020 61 66.0 22.05kHz mixup log-mel spectrogram, gamma-tone spectrogram, CQT
Cao_JNU_task1a_4 Fei2020 62 65.9 22.05kHz mixup log-mel spectrogram, gamma-tone spectrogram, CQT
FanVaf__task1a_1 Fanioudakis2020 72 63.4 4kHz mixup, time shifting spectrogram
FanVaf__task1a_2 Fanioudakis2020 75 60.7 8kHz mixup, time shifting spectrogram
FanVaf__task1a_3 Fanioudakis2020 66 64.8 4kHz, 8kHz mixup, time shifting spectrogram
FanVaf__task1a_4 Fanioudakis2020 54 67.5 4kHz, 8kHz mixup, time shifting spectrogram
Gao_UNISA_task1a_1 Gao2020 9 75.0 44.1kHz mixup, temporal cropping log-mel energies, deltas, delta-deltas
Gao_UNISA_task1a_2 Gao2020 12 74.1 44.1kHz mixup, temporal cropping log-mel energies, deltas, delta-deltas
Gao_UNISA_task1a_3 Gao2020 11 74.7 44.1kHz mixup, temporal cropping log-mel energies, deltas, delta-deltas
Gao_UNISA_task1a_4 Gao2020 8 75.2 44.1kHz mixup, temporal cropping log-mel energies, deltas, delta-deltas
DCASE2020 baseline 51.4 44.1kHz OpenL3
Helin_ADSPLAB_task1a_1 Wang2020_t1 14 73.4 44.1kHz mixup MFCC, log-mel energies, CQT, Gammatone
Helin_ADSPLAB_task1a_2 Wang2020_t1 49 68.4 44.1kHz mixup MFCC, log-mel energies, CQT, Gammatone
Helin_ADSPLAB_task1a_3 Wang2020_t1 18 73.1 44.1kHz mixup MFCC, log-mel energies, CQT, Gammatone
Helin_ADSPLAB_task1a_4 Wang2020_t1 24 72.3 44.1kHz mixup MFCC, log-mel energies, CQT, Gammatone
Hu_GT_task1a_1 Hu2020 6 75.7 44.1kHz mixup, random cropping, channel confusion, SpecAugment, spectrum correction, reverberation-drc, pitch shift, speed change, random noise, mix audios log-mel energies
Hu_GT_task1a_2 Hu2020 4 75.9 44.1kHz mixup, random cropping, channel confusion, SpecAugment, spectrum correction, reverberation-drc, pitch shift, speed change, random noise, mix audios log-mel energies
Hu_GT_task1a_3 Hu2020 3 76.2 44.1kHz mixup, random cropping, channel confusion, SpecAugment, spectrum correction, reverberation-drc, pitch shift, speed change, random noise, mix audios log-mel energies
Hu_GT_task1a_4 Hu2020 5 75.8 44.1kHz mixup, random cropping, channel confusion, SpecAugment, spectrum correction, reverberation-drc, pitch shift, speed change, random noise, mix audios log-mel energies
JHKim_IVS_task1a_1 Kim2020_t1 55 67.3 44.1kHz subtract filter HPSS, log-mel energies
JHKim_IVS_task1a_2 Kim2020_t1 60 66.2 44.1kHz subtract filter HPSS, log-mel energies
Jie_Maxvision_task1a_1 Jie2020 10 75.0 44.1kHz mixup, temporal cropping log-mel energies
Kim_SGU_task1a_1 Changmin2020 33 71.6 44.1kHz mixup, temporal cropping, class-wise random masking log-mel energies, deltas, delta-deltas, multiple channel feature
Kim_SGU_task1a_2 Changmin2020 38 70.7 44.1kHz mixup, temporal cropping, class-wise random masking log-mel energies, deltas, delta-deltas, multiple channel feature
Kim_SGU_task1a_3 Changmin2020 39 70.7 44.1kHz mixup, temporal cropping, class-wise random masking log-mel energies, deltas, delta-deltas, multiple channel feature
Kim_SGU_task1a_4 Changmin2020 57 66.4 44.1kHz mixup, temporal cropping, class-wise random masking log-mel energies, deltas, delta-deltas, multiple channel feature
Koutini_CPJKU_task1a_1 Koutini2020 29 71.9 22.05kHz mixup Perceptually-weighted log-mel energies
Koutini_CPJKU_task1a_2 Koutini2020 32 71.6 22.05kHz mixup Perceptually-weighted log-mel energies
Koutini_CPJKU_task1a_3 Koutini2020 13 73.6 22.05kHz mixup Perceptually-weighted log-mel energies
Koutini_CPJKU_task1a_4 Koutini2020 15 73.4 22.05kHz mixup Perceptually-weighted log-mel energies
Lee_CAU_task1a_1 Lee2020 47 69.2 44.1kHz mixup log-mel energies, deltas, delta-deltas, HPSS
Lee_CAU_task1a_2 Lee2020 41 69.6 44.1kHz mixup log-mel energies, deltas, delta-deltas, HPSS
Lee_CAU_task1a_3 Lee2020 27 72.0 44.1kHz mixup log-mel energies, deltas, delta-deltas, HPSS
Lee_CAU_task1a_4 Lee2020 20 72.9 44.1kHz mixup log-mel energies, deltas, delta-deltas, HPSS
Lee_GU_task1a_1 Aryal2020 81 55.9 44.1kHz mixup, time masking, frequency masking OpenL3 (env)
Lee_GU_task1a_2 Aryal2020 85 55.6 44.1kHz mixup, time masking, frequency masking OpenL3 (env)
Lee_GU_task1a_3 Aryal2020 84 55.6 44.1kHz mixup, time masking, frequency masking OpenL3 (music)
Lee_GU_task1a_4 Aryal2020 86 54.9 44.1kHz mixup, time masking, frequency masking OpenL3 (music)
Liu_SHNU_task1a_1 Liu2020 45 69.3 22.05kHz mixup, deviceaugment perceptual weighted power spectrogram
Liu_SHNU_task1a_2 Liu2020 50 68.0 22.05kHz mixup perceptual weighted power spectrogram
Liu_SHNU_task1a_3 Liu2020 83 55.7 44.1kHz SpecAugment log-mel energies
Liu_SHNU_task1a_4 Liu2020 26 72.0 22.05kHz,44.1kHz mixup, deviceaugment perceptual weighted power spectrogram OpenL3
Liu_UESTC_task1a_1 Liu2020a 16 73.2 44.1kHz HPSS,NNF,vocal separation,HRTF log-mel energies
Liu_UESTC_task1a_2 Liu2020a 23 72.4 44.1kHz HPSS,NNF,vocal separation,HRTF log-mel energies
Liu_UESTC_task1a_3 Liu2020a 21 72.5 44.1kHz HPSS,NNF,vocal separation,HRTF log-mel energies
Liu_UESTC_task1a_4 Liu2020a 28 72.0 44.1kHz HPSS,NNF,vocal separation,HRTF log-mel energies
Lopez-Meyer_IL_task1a_1 Lopez-Meyer2020_t1a 68 64.3 16kHz random noise, random gain, random cropping, mixup, SpecAugment raw waveform, mel filterbank
Lopez-Meyer_IL_task1a_2 Lopez-Meyer2020_t1a 70 64.1 16kHz random noise, random gain, random cropping, mixup, SpecAugment raw waveform, mel filterbank
Lu_INTC_task1a_1 Hong2020 36 71.2 32kHz mixup, weight decay, dropout, SpecAugment mel spectrogram, CQT None
Lu_INTC_task1a_2 Hong2020 69 64.1 32kHz mixup, weight decay, dropout, SpecAugment mel spectrogram, CQT None
Lu_INTC_task1a_3 Hong2020 58 66.4 32kHz mixup, weight decay, dropout, SpecAugment mel spectrogram, CQT None
Lu_INTC_task1a_4 Hong2020 35 71.2 32kHz mixup, weight decay, dropout, SpecAugment mel spectrogram, CQT None
Monteiro_INRS_task1a_1 Joao2020 74 61.7 44.1kHz Sox distortions, SpecAugment log-mel energies
Monteiro_INRS_task1a_2 Joao2020 82 55.9 44.1kHz Sox distortions, SpecAugment log-mel energies
Monteiro_INRS_task1a_3 Joao2020 88 50.8 44.1kHz Sox distortions, SpecAugment modulation spectra
Monteiro_INRS_task1a_4 Joao2020 59 66.3 44.1kHz Sox distortions, SpecAugment log-mel energies, modulation spectra
Naranjo-Alcazar_Vfy_task1a_1 Naranjo-Alcazar2020_t1 73 61.9 44.1kHz mixup Gammatone
Naranjo-Alcazar_Vfy_task1a_2 Naranjo-Alcazar2020_t1 77 59.7 44.1kHz mixup HPSS, log-mel energies
Paniagua_UPM_task1a_1 Paniagua2020 92 43.8 44.1kHz LTAS, envelope modulation spectrum
Shim_UOS_task1a_1 Shim2020 31 71.7 44.1kHz mixup, SpecAugment mel spectrogram
Shim_UOS_task1a_2 Shim2020 34 71.5 44.1kHz mixup, SpecAugment mel spectrogram
Shim_UOS_task1a_3 Shim2020 48 68.5 44.1kHz mixup, SpecAugment mel spectrogram
Shim_UOS_task1a_4 Shim2020 37 71.0 44.1kHz mixup mel spectrogram
Suh_ETRI_task1a_1 Suh2020 22 72.5 44.1kHz temporal cropping, mixup log-mel energies, deltas, delta-deltas
Suh_ETRI_task1a_2 Suh2020 7 75.5 44.1kHz temporal cropping, mixup log-mel energies, deltas, delta-deltas
Suh_ETRI_task1a_3 Suh2020 1 76.5 44.1kHz temporal cropping, mixup log-mel energies, deltas, delta-deltas
Suh_ETRI_task1a_4 Suh2020 2 76.5 44.1kHz temporal cropping, mixup log-mel energies, deltas, delta-deltas
Swiecicki_NON_task1a_1 Swiecicki2020 56 67.1 44.1kHz mixup, SpecAugment, random resize, random cropping log-mel energies
Swiecicki_NON_task1a_2 Swiecicki2020 42 69.5 44.1kHz mixup, SpecAugment, random resize, random cropping log-mel energies
Swiecicki_NON_task1a_3 Swiecicki2020 40 70.3 44.1kHz mixup, SpecAugment, random resize, random cropping log-mel energies
Swiecicki_NON_task1a_4 Swiecicki2020 30 71.8 44.1kHz mixup, SpecAugment, random resize, random cropping log-mel energies
Vilouras_AUTh_task1a_1 Vilouras2020 53 67.7 44.1kHz log-mel energies, PCEN
Vilouras_AUTh_task1a_2 Vilouras2020 52 67.8 44.1kHz mixup, time stretching, frequency masking, shifting, clipping distortion log-mel energies, PCEN
Vilouras_AUTh_task1a_3 Vilouras2020 44 69.3 44.1kHz mixup, time stretching, frequency masking, shifting, clipping distortion log-mel energies, PCEN
Waldekar_IITKGP_task1a_1 Waldekar2020 79 58.4 44.1kHz MFDWC
Wang_RoyalFlush_task1a_1 Wang2020a 80 56.7 44.1kHz mixup, spectrum correction log-mel energies
Wang_RoyalFlush_task1a_2 Wang2020a 65 65.2 44.1kHz mixup, spectrum correction log-mel energies
Wang_RoyalFlush_task1a_3 Wang2020a 71 64.0 44.1kHz mixup, spectrum correction log-mel energies
Wang_RoyalFlush_task1a_4 Wang2020a 91 45.5 44.1kHz mixup, spectrum correction log-mel energies
Wu_CUHK_task1a_1 Wu2020_t1a 67 64.7 44.1kHz mixup wavelet filter-bank features
Wu_CUHK_task1a_2 Wu2020_t1a 46 69.3 44.1kHz mixup wavelet filter-bank features
Wu_CUHK_task1a_3 Wu2020_t1a 51 67.9 44.1kHz mixup wavelet filter-bank features
Wu_CUHK_task1a_4 Wu2020_t1a 43 69.4 44.1kHz mixup wavelet filter-bank features
Zhang_THUEE_task1a_1 Shao2020 19 73.0 44.1kHz mixup, ImageDataGenerator, temporal cropping log-mel energies
Zhang_THUEE_task1a_2 Shao2020 17 73.2 44.1kHz mixup, ImageDataGenerator, temporal cropping log-mel energies
Zhang_THUEE_task1a_3 Shao2020 25 72.3 44.1kHz mixup, ImageDataGenerator, temporal cropping log-mel energies
Zhang_UESTC_task1a_1 Zhang2020 89 50.4 44.1kHz log-mel energies OpenL3
Zhang_UESTC_task1a_2 Zhang2020 87 51.7 44.1kHz log-mel energies OpenL3
Zhang_UESTC_task1a_3 Zhang2020 90 47.4 44.1kHz log-mel energies OpenL3



Machine learning characteristics

Rank Code Technical
Report
Official
system
rank
Accuracy
(Eval)
External
data usage
External
data sources
Model
complexity
Classifier Ensemble
subsystems
Decision
making
Abbasi_ARI_task1a_1 Abbasi2020 78 59.7 180310 CNN,ensemble 5 average
Abbasi_ARI_task1a_2 Abbasi2020 76 60.6 180310 CNN, ensemble, XGBoost 5 average
Cao_JNU_task1a_1 Fei2020 63 65.7 2631282 CNN, 2-DenseNet 5 majority vote
Cao_JNU_task1a_2 Fei2020 64 65.7 2631282 CNN,2-DenseNet 5 majority vote
Cao_JNU_task1a_3 Fei2020 61 66.0 5094806 CNN,2-DenseNet 7 majority vote
Cao_JNU_task1a_4 Fei2020 62 65.9 5094806 CNN,2-DenseNet 7 majority vote
FanVaf__task1a_1 Fanioudakis2020 72 63.4 20477140 CRNN sample-based average
FanVaf__task1a_2 Fanioudakis2020 75 60.7 20477140 CRNN sample-based average
FanVaf__task1a_3 Fanioudakis2020 66 64.8 20477140 CRNN 2 sample average with weights
FanVaf__task1a_4 Fanioudakis2020 54 67.5 20477140 CRNN 2 sample average with weights
Gao_UNISA_task1a_1 Gao2020 9 75.0 4311732 ResNet
Gao_UNISA_task1a_2 Gao2020 12 74.1 4311732 ResNet
Gao_UNISA_task1a_3 Gao2020 11 74.7 4312628# embeddings (OpenL2)=4684224, classifier=328707 ResNet
Gao_UNISA_task1a_4 Gao2020 8 75.2 12936092 ResNet, ensemble 3 average
DCASE2020 baseline 51.4 embeddings 5012931 MLP
Helin_ADSPLAB_task1a_1 Wang2020_t1 14 73.4 directly AudioSet 341229835 CNN, ensemble 9 average
Helin_ADSPLAB_task1a_2 Wang2020_t1 49 68.4 839596544 CNN, ensemble 8 average
Helin_ADSPLAB_task1a_3 Wang2020_t1 18 73.1 directly AudioSet 361028107 CNN, ensemble 13 average
Helin_ADSPLAB_task1a_4 Wang2020_t1 24 72.3 directly AudioSet 380826379 CNN, ensemble 17 average
Hu_GT_task1a_1 Hu2020 6 75.7 62525968 CNN, ResNet, ensemble 4 average
Hu_GT_task1a_2 Hu2020 4 75.9 67763768 CNN, ResNet, ensemble 4 average
Hu_GT_task1a_3 Hu2020 3 76.2 130289736 CNN, ResNet, ensemble 8 average
Hu_GT_task1a_4 Hu2020 5 75.8 91251960 CNN, ResNet, ensemble 5 average
JHKim_IVS_task1a_1 Kim2020_t1 55 67.3 pre-trained model 115300 CNN
JHKim_IVS_task1a_2 Kim2020_t1 60 66.2 pre-trained model 31600 CNN
Jie_Maxvision_task1a_1 Jie2020 10 75.0 3584924 CNN
Kim_SGU_task1a_1 Changmin2020 33 71.6 3254028 Residual CNN 2 average
Kim_SGU_task1a_2 Changmin2020 38 70.7 3254908 Residual CNN 2 average
Kim_SGU_task1a_3 Changmin2020 39 70.7 6352740 Residual CNN 4 average
Kim_SGU_task1a_4 Changmin2020 57 66.4 3255788 Residual CNN 2 average
Koutini_CPJKU_task1a_1 Koutini2020 29 71.9 19702400 RF-regularized CNNs
Koutini_CPJKU_task1a_2 Koutini2020 32 71.6 36783360 RF-regularized CNNs
Koutini_CPJKU_task1a_3 Koutini2020 13 73.6 225943040 RF-regularized CNNs
Koutini_CPJKU_task1a_4 Koutini2020 15 73.4 225943040 RF-regularized CNNs
Lee_CAU_task1a_1 Lee2020 47 69.2 10088328‬ CNN, ResNet, LCNN, InceptionLike, ensemble 8 average
Lee_CAU_task1a_2 Lee2020 41 69.6 10088328‬ CNN 8 average
Lee_CAU_task1a_3 Lee2020 27 72.0 10088328‬ CNN, ResNet, LCNN, InceptionLike, ensemble 8 average
Lee_CAU_task1a_4 Lee2020 20 72.9 10088328‬ CNN, ResNet, LCNN, InceptionLike, ensemble 8 average
Lee_GU_task1a_1 Aryal2020 81 55.9 embeddings 15940046 ResNet, Attention
Lee_GU_task1a_2 Aryal2020 85 55.6 embeddings 15940046 ResNet, Attention
Lee_GU_task1a_3 Aryal2020 84 55.6 embeddings 15940046 ResNet, Attention
Lee_GU_task1a_4 Aryal2020 86 54.9 embeddings 15940046 ResNet, Attention
Liu_SHNU_task1a_1 Liu2020 45 69.3 3563412 ResNet , Receptive Field Regularization
Liu_SHNU_task1a_2 Liu2020 50 68.0 4691274 CNN
Liu_SHNU_task1a_3 Liu2020 83 55.7 8756749 Self-attention
Liu_SHNU_task1a_4 Liu2020 26 72.0 embeddings 13267617 ResNet , Receptive Field Regularization, CNN , MLP
Liu_UESTC_task1a_1 Liu2020a 16 73.2 26023864 ResNet 8 average
Liu_UESTC_task1a_2 Liu2020a 23 72.4 58559744 ResNet 18 average
Liu_UESTC_task1a_3 Liu2020a 21 72.5 26023864 ResNet 8 stacking
Liu_UESTC_task1a_4 Liu2020a 28 72.0 58559744 ResNet 18 average
Lopez-Meyer_IL_task1a_1 Lopez-Meyer2020_t1a 68 64.3 directly AudioSet 39998697 CNN, ResNet, VGG, ensemble 3 average
Lopez-Meyer_IL_task1a_2 Lopez-Meyer2020_t1a 70 64.1 directly AudioSet 39998697 CNN, ResNet, VGG, ensemble 3 average
Lu_INTC_task1a_1 Hong2020 36 71.2 pre-trained model AudioSet 27184858 ResNext 10 average
Lu_INTC_task1a_2 Hong2020 69 64.1 pre-trained model AudioSet 27184858 ResNext None softmax
Lu_INTC_task1a_3 Hong2020 58 66.4 pre-trained model AudioSet 27184858 ResNext 2 average
Lu_INTC_task1a_4 Hong2020 35 71.2 pre-trained model AudioSet 27184858 ResNext 12 average
Monteiro_INRS_task1a_1 Joao2020 74 61.7 4978634 ResNet
Monteiro_INRS_task1a_2 Joao2020 82 55.9 4522398 TDNN
Monteiro_INRS_task1a_3 Joao2020 88 50.8 20731100 ResNet
Monteiro_INRS_task1a_4 Joao2020 59 66.3 20731100 CNN, ResNet12, ResNet18, TDNN 5 average
Naranjo-Alcazar_Vfy_task1a_1 Naranjo-Alcazar2020_t1 73 61.9 425294 CNN
Naranjo-Alcazar_Vfy_task1a_2 Naranjo-Alcazar2020_t1 77 59.7 528014 CNN
Paniagua_UPM_task1a_1 Paniagua2020 92 43.8 11264 MLP average log-likelihood
Shim_UOS_task1a_1 Shim2020 31 71.7 embeddings 1115461 ensemble 16 score-sum
Shim_UOS_task1a_2 Shim2020 34 71.5 embeddings 1115461 ensemble 8 score-sum
Shim_UOS_task1a_3 Shim2020 48 68.5 856693 LCNN 4 score-sum
Shim_UOS_task1a_4 Shim2020 37 71.0 embeddings 594923 ResNet 8 score-sum
Suh_ETRI_task1a_1 Suh2020 22 72.5 13164184 ResNet
Suh_ETRI_task1a_2 Suh2020 7 75.5 13164184 ResNet
Suh_ETRI_task1a_3 Suh2020 1 76.5 39492555 Snapshot 3 average
Suh_ETRI_task1a_4 Suh2020 2 76.5 39492555 Snapshot 3 weighted score average
Swiecicki_NON_task1a_1 Swiecicki2020 56 67.1 10711602 EfficientNet average
Swiecicki_NON_task1a_2 Swiecicki2020 42 69.5 10711602 EfficientNet average
Swiecicki_NON_task1a_3 Swiecicki2020 40 70.3 10711602 EfficientNet average
Swiecicki_NON_task1a_4 Swiecicki2020 30 71.8 21423204 EfficientNet 2 average
Vilouras_AUTh_task1a_1 Vilouras2020 53 67.7 3343774 ResNet, ensemble 4 average
Vilouras_AUTh_task1a_2 Vilouras2020 52 67.8 3343774 ResNet, ensemble 4 average
Vilouras_AUTh_task1a_3 Vilouras2020 44 69.3 6687548 ResNet, ensemble 8 average
Waldekar_IITKGP_task1a_1 Waldekar2020 79 58.4 32400 SVM
Wang_RoyalFlush_task1a_1 Wang2020a 80 56.7 542190 CNN, ensemble 6 average
Wang_RoyalFlush_task1a_2 Wang2020a 65 65.2 542190 CNN, ensemble 5 average
Wang_RoyalFlush_task1a_3 Wang2020a 71 64.0 542190 CNN, ensemble 4 average
Wang_RoyalFlush_task1a_4 Wang2020a 91 45.5 650628 CNN, ensemble 6 average
Wu_CUHK_task1a_1 Wu2020_t1a 67 64.7 13143642 CNN
Wu_CUHK_task1a_2 Wu2020_t1a 46 69.3 53300328 CNN 4 average
Wu_CUHK_task1a_3 Wu2020_t1a 51 67.9 65718210 CNN 5 average
Wu_CUHK_task1a_4 Wu2020_t1a 43 69.4 119018538 CNN 9 average
Zhang_THUEE_task1a_1 Shao2020 19 73.0 3524258 ResNet, Mini-SegNet 11
Zhang_THUEE_task1a_2 Shao2020 17 73.2 2516564 ResNet, Mini-SegNet 13
Zhang_THUEE_task1a_3 Shao2020 25 72.3 2196170 ResNet, Mini-SegNet 8
Zhang_UESTC_task1a_1 Zhang2020 89 50.4 embeddings 329610 MLP , CNN maximum likelihood
Zhang_UESTC_task1a_2 Zhang2020 87 51.7 embeddings 329610 MLP , CNN maximum likelihood
Zhang_UESTC_task1a_3 Zhang2020 90 47.4 embeddings 518090 MLP , CNN maximum likelihood

Technical reports

Acoustic Scene Classification by the Snapshot Ensemble of CNNs with XGBoost

Reyhaneh Abbasi and Peter Balazs
Mathematics and Signal Processing in Acoustics, acoustic research institute of OEAW, Vienna, Austria

Abstract

This is the report for the DCASE challenge task 1A. The aim is to implement acoustic scene classification of audio recordings into 10 predefined classes including Airport, shopping mall, metro station, street pedestrian, public square, street traffic, tram, bus, metro, and park. Two main drawbacks of this task are that recordings are provided by five devices with different quality and that some of these classes are very close in terms of acoustic information. To bias correct all instruments against the reference (here the instrument A), we have used XGboost algorithm fed by standaridiezd Mel spectrogram. Our classifier consists of a CNN, mix-up augmentation, and snapshot ensemble (to decrease the total number of parameters and, consequently, the variance of model prediction). Our model has yielded an accuracy of 62.1% and cross-entropy loss of 1.06. Whereas the baseline model has yielded the accuracy and cross-entropy loss of 54.1% and 1.36, respectively.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup
Features mel spectrogram
Classifier CNN,ensemble; CNN, ensemble, XGBoost
Decision making average
PDF

Attention-Based Resnet-18 Model for Acoustic Scene Classification

Nisan Aryal and Sang Woong Lee
Gachon University, South Korea

Abstract

This technical report describes our approach to solve Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 challenge task1a. Resnet-18 with attention model and Openl3 embedding are used to solve the acoustic scene classification problem. The model shows 59.6% accuracy in the training and validation split of the development set, which is 5.5% higher than that of the baseline network.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup, time masking, frequency masking
Embeddings OpenL3 (env); OpenL3 (music)
Classifier ResNet, Attention
PDF

Multi-Channel Feature Using Inter-Class and Inter-Device Standard Deviations for Acoustic Scene Classification

Kim Changmin, Seo Soonshin and Kim Ji-Hwan
Dept. of Computer Scinece and Engineering, Sogang University, Seoul, South Korea

Abstract

In this technical report, we describe our acoustic scene classification methods submitted to detection and classification of acoustic scenes and events challenge 2020 task 1a. Our proposed methods aim to maximize the differences between acoustic scene classes and minimize the differences between various devices. We obtained the inter-class and inter-device standard deviations of the training data and applied them to the log-mel spectrogram features. These features are added to the channel of the original log-mel spectrogram. In addition, we applied class-wise random masking for the frequency domain with small standard deviations. Then, masked features are divided into quarters on the frequency axis. They are trained using four-pathway residual convolutional neural networks. Our proposed methods achieved an overall accuracy of 72.7% for the official development dataset, which was an improvement by 18.6% over the official baseline.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup, temporal cropping, class-wise random masking
Features log-mel energies, deltas, delta-deltas, multiple channel feature
Classifier Residual CNN
Decision making average
PDF

Investigating Temporal and Spectral Sequences Combining GRU-Rnns for Acoustic Scene Classification

Eleftherios Fanioudakis and Anastasios Vafeiadis
Greece

Abstract

This report describes our contribution to Task 1A of the 2020 Detection and Classification of Acoustic Scenes and Events (DCASE) challenge. We investigated the use of bi-directional Gated Recurrent Unit (GRU) - Recurrent Neural Networks (RNNs) in order to capture the spectral and temporal information of the input signal. The GRU-RNNs are used as an ensemble during training, having equal weights for the time and the frequency sequences. Our architecture is based on a Convolutional Recurrent Neural Network (CRNN), where the short-time Fourier magnitude spectrogram is used as an input to the network. By exploiting the mixup augmentation technique, randomly selecting the mixup coefficient α for every sample, and down-sampling the original signal from 44.1 kHz to 4 kHz, we achieved an average class accuracy of 65.4%. Since most of the information of the environmental sound signals was found in the lower frequencies, a CRNN model ensemble was performed, combining 4 and 8 kHz as the sampling frequencies. The latter system’s accuracy was boosted to 67.3%, a 24.4% increase over the development set baseline.

System characteristics
Sampling rate 4kHz; 8kHz; 4kHz, 8kHz
Data augmentation mixup, time shifting
Features spectrogram
Classifier CRNN
Decision making sample-based average; sample average with weights
PDF

Acoustic Scene Classification Based on 2-Order Dense Convolutional Network

Hongbo Fei, Zilong Huang, Yi Cao and Chen Liu
Mechanical engineering, Jiangnan University, Wuxi, China

Abstract

In this technical report, we describe our acoustic scene classification algorithm submitted in DCASE 2020 Task 1a. We focus on network innovation, a novel acoustic scene classification model based on 2-order dense convolutional network is proposed, which aims at the problems of insufficient classification accuracy and adaptability of current models. Based on the dense convolutional neural network, combined with the N-order Markov model, the traditional dense connection is improved to the N-order correlation connection, and then the N-order dense convolutional network model is proposed. In terms of audio feature extraction, we use Log-Mel spectrograms and Gamma-Tone spectrograms to stitch together. In order to further improve system performance, virtual data generation technology is adopted. Finally, use the trained model for transfer learning. By using proposed systems, we achieved a classification accuracy of 69.16% on the officially provided evaluation dataset, which is 15.06% over than the baseline system.

System characteristics
Sampling rate 22.05kHz
Data augmentation mixup
Features log-mel spectrogram, gamma-tone spectrogram, CQT
Classifier CNN, 2-DenseNet; CNN,2-DenseNet
Decision making majority vote
PDF

Acoustic Scene Classification Using Deep Residual Networks with Focal Loss and Mild Domain Adaptation

Wei Gao and Mark McDonnell
UniSA STEM, University of South Australia, Adelaide, Australia

Abstract

This technical report describes our approach to Tasks 1a in the 2020 DCASE acoustic scene classification challenge. We have incorporated few more training techniques based on our previous contest entries. One was replacing cross-entropy loss with focal loss which aims to focus on poor-classified samples while reducing the loss on well-classified samples with high probability; another methods used was to add an auxiliary binary classifier to serve the purpose of domain adaptation.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup, temporal cropping
Features log-mel energies, deltas, delta-deltas
Classifier ResNet; ResNet, ensemble
Decision making average
PDF

Acoustic Scene Classification Using Mel-Spectrum and CQT Based Neural Network Ensemble

Lu Hong
Intel Labs, Intel Corporation, Santa Clara, USA

Abstract

In our submission to the DCASE 2020 Task1a, we have explored the use of ResNeXt-50 architecture with Log-Mel-spectrum and Constant-Q transform(CQT) based frontend. In order to improve performance, we use transfer learning technique. The neural networks were pre-trained with AudioSet data, and then fine-tuned over the DCASE task1a dataset. With DCASE 2020 task1a default train/validation split, we got about 70% average accuracy across all the 10 classes. To further improve the performance, we applied a leave-one-city out cross validation(CV) method to train 10 more models, with one city’s data as holdout set for each of the CV fold. These models were combined together with different ensemble strategies to produce 4 final submission entries.

System characteristics
Sampling rate 32kHz
Data augmentation mixup, weight decay, dropout, SpecAugment
Features mel spectrogram, CQT
Embeddings None
Classifier ResNext
Decision making average; softmax
PDF

Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation

Hu Hu1, Chao-Han Huck Yang1, Xianjun Xia2, Xue Bai3, Xin Tang3, Yajian Wang3, Shutong Niu3, Li Chai3, Juanjuan Li2, Hongning Zhu2, Feng Bao4, Yuanjun Zhao2, Sabato Marco Siniscalchi5, Yannan Wang2, Jun Du3 and Chin-Hui Lee1
1School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, USA, 2Tencent Media Lab, Shenzhen, China, 3University of Science and Technology of China, HeFei, China, 4Tencent Media Lab, Beijing, China, 5Computer Engineering School, University of Enna Kore, Italy

Abstract

In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1 comprises two different sub-tasks: (i) Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes, and (ii) Task 1b concerns with classification of data into three higher-level classes using lowcomplexity solutions. For Task 1a, we propose a novel two-stage ASC system leveraging upon ad-hoc score combination of two convolutional neural networks (CNNs), classifying the acoustic input according to three classes, and then ten classes, respectively. Four different CNN-based architectures are explored to implement the two-stage classifiers, and several data augmentation techniques are also investigated. For Task 1b, we leverage upon a quantization method to reduce the complexity of two of our top-accuracy three-classes CNN-based architectures. On Task 1a development data set, an ASC accuracy of 76.9% is attained using our best single classifier and data augmentation. An accuracy of 81.9% is then attained by a final model fusion of our two-stage ASC classifiers. On Task 1b development data set, we achieve an accuracy of 96.7% with a model size smaller than 500KB

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup, random cropping, channel confusion, SpecAugment, spectrum correction, reverberation-drc, pitch shift, speed change, random noise, mix audios
Features log-mel energies
Classifier CNN, ResNet, ensemble
Decision making average
PDF

Acoustic Scene Classification with Residual Networks and Attention Mechanism

Liu Jie
Maxvision, Wuhan, China

Abstract

This technical report describes our submission for TASK1A of DCASE2020 challenge. We use log-mel spectrograms and a residual network. We follow the idea of McDonnell [1] in DCASE2019 and do not downsample in the frequency axis. Besides, we use attention mechanism to improve the performance of the system.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup, temporal cropping
Features log-mel energies
Classifier CNN
PDF

Development of the INRS-EMT Scene Classification Systems for the 2020 Edition of the DCASE Challenge

Monteiro Joao, Shruti Kshirsagar, Anderson Avila, Amr Aaballah, Parth Tiwari and Tiago Falk
EMT, Institut National de la Recherche Scientifique, Montreal, Canada

Abstract

In this report we provide a brief overview of a set of submissions for the scene classification sub-tasks of the 2020 edition of the DCASE challenge. Our submissions comprise efforts at the feature representation level, where we explored the use of modulation spectra and i-vectors (extracted from mel cepstral coefficients, as well as modulation spectra) and modeling strategies, where recent convolutional deep neural network models were used. Results on the Challenge validation set show several of the submitted methods outperforming the baseline model.

System characteristics
Sampling rate 44.1kHz
Data augmentation Sox distortions, SpecAugment
Features log-mel energies; modulation spectra; log-mel energies, modulation spectra
Classifier ResNet; TDNN; CNN, ResNet12, ResNet18, TDNN
Decision making average
PDF

Acoustic Scene Classification Using Multi-Channel Audio Feature with Convolutional Neural Networks and Subtract Filter Augmentation

Jaehun Kim
AI Research Lab, IVS Inc, Seoul, South Korea

Abstract

This paper presents a multi-channel audio feature using imagenet model based on convolutional neural networks for DCASE 2020 Task1-A Acoustic scene classification with multiple devices. We use the TAU Urban Acoustic Scenes 2020 Mobile Dataset. It consists of 10 seconds of audio clips about 10 scenes. We proposed a multi-channel audio feature to use imagenet pre-trained model weight. also, we proposed filtered augmentation for other devices' recorded audio. the multichannel feature consists of raw and harmonic, percussive (HPSS) data’s Log-Mel-Spectrogram. Also, we use EfficientNet pre-trained model weight.

System characteristics
Sampling rate 44.1kHz
Data augmentation subtract filter
Features HPSS, log-mel energies
Classifier CNN
PDF

CP-JKU Submissions to DCASE’20: Low-Complexity Cross-Device Acoustic Scene Classification with RF-Regularized CNNs

Khaled Koutini, Florian Henkel, Hamid Eghbal-zadeh and Gerhard Widmer
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria

Abstract

This technical report describes the CP-JKU team’s submission for Task 1 - Subtask A (Acoustic Scene Classification with Multiple Devices) and Subtask B (Low-Complexity Acoustic Scene Classification) of the DCASE-2020 challenge. For Subtask 1A, we provide our Receptive Field (RF) regularized CNN model as a baseline, and additionally explore the use of two different domain adaption objectives in the form of the Maximum Mean Discrepancy (MMD) and the Sliced Wasserstein Distance (SWD). For Subtask 1B, we investigate different parameter reduction methods such as Pruning and Knowledge Distillation (KD). Additionally, we incorporate a decomposed convolutional layer that reduces the number of nonezero parameters in our models while only slightly decreasing the accuracy compared to full-parameter baseline.

System characteristics
Sampling rate 22.05kHz
Data augmentation mixup
Features Perceptually-weighted log-mel energies
Classifier RF-regularized CNNs
PDF

The CAU-ET Acoustic Scenery Classification System for DCASE 2020 Challenge

Yerin Lee1, Soyoung Lim1 and Il-Youp Kwak2
1Statistics Dept., Chung-Ang University, Seoul, South Korea, 2Department of Applied Statistics, Chung-Ang University, Seoul, South Korea

Abstract

The acoustic scenry classification problem is an interesting topic that has been studied for a long time through the DCASE competition. This technical report presents the CAU-ET’s submitted scenery detection system to the DCASE 2020 challenge, Task 1. In our method we generate mel-spectrogram from audio. From log-mel spectrogram, we got Deltas, Delta-deltas and Harmonic-percussive source seperation(HPSS) feature as inputs of our deep neural network models. The classification result of the proposed system was 66.26% for development dataset in subtask A and 95.27% in subtask B

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies, deltas, delta-deltas, HPSS
Classifier CNN, ResNet, LCNN, InceptionLike, ensemble; CNN
Decision making average
PDF

Acoustic Scene Classification with Various Deep Classifiers

Yue Liu, XinYuan Zhou and YanHua Long
The College of Information,Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai, China

Abstract

In this report, we describes the SHNU team’s submission to the DCASE-2020 challenge Task1-A (Acoustic Scene Classification with Multiple Devices). In our submissions, three different deep models are investigated. The first one is a ResNet-based model with receptive-field regularization. The second one is a common two-dimensional CNN model with perceptual weighted power spectrogram as input. The third one is a self-attention based model with only Transformer encoder architecture which is specially designed for acoustic scene classification. In addition, we proposed a deviceenhancement data augmentation method, together with the conventional mix-up and specAugment to improve the model robustness to multiple devices. Experimental results on the fold1 validation set show that these models are complementary in some extent. We prepared all of our submissions without the use of any external data except for the official baseline embeddings. The logistic regression score fusion is used to fuse the softmax outputs of single-systems.

System characteristics
Sampling rate 22.05kHz; 44.1kHz; 22.05kHz,44.1kHz
Data augmentation mixup, deviceaugment; mixup; SpecAugment
Features perceptual weighted power spectrogram; log-mel energies
Embeddings OpenL3
Classifier ResNet , Receptive Field Regularization; CNN; Self-attention; ResNet , Receptive Field Regularization, CNN , MLP
PDF

Acoustic Scene Classification Using Ensembles of Deep Residual Networks and Spectrogram Decompositions

Yingzi Liu, Shengwang Jiang, Chuang Shi and Huiyong Li
School of imformation and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China

Abstract

This technical report describes ensembles of convolutional neural networks (CNNs) for the task 1 / subtask B of the DACSE 2020 challenge, with emphasis on the use of a deep residual network applied to different spectrogram decompositions. The harmonic percussive source separation (HPSS), nearest neighbor filter (NNF), vocal separation and Head-related transfer function (HRTF) are used to augment the acoustic features. Our system achieves higher classification accuracies and lower log loss in the development dataset than baseline system.

System characteristics
Sampling rate 44.1kHz
Data augmentation HPSS,NNF,vocal separation,HRTF
Features log-mel energies
Classifier ResNet
Decision making average; stacking
PDF

Ensemble of Convolutional Neural Networks for the DCASE 2020 Acoustic Scene Classification Challenge

Paulo Lopez-Meyer1, Juan Antonio Del Hoyo Ontiveros1, Georg Stemmer2, Lama Nachman3 and Jonathan Huang4
1Intel Labs, Intel Corporation, Jalisco, Mexico, 2Intel Labs, Intel Corporation, Neubiberg, Germany, 3Intel Labs, Intel Corporation, California, USA, 4California, USA

Abstract

For the DCASE 2020 Task 1a, we propose the use of three different deep learning based convolutional neural networks architectures: AclNet, AclResNet50, and Vgg12. These three neural network architectures were pre-trained with Audioset data for embedding generation, and then fine-tuned with an added classification layer, though the development dataset provided by the task. The outputs produced by these trained models proved to be complementary when ensemble, as expected, due to the different nature of the feature front-end, and of architecture diversity. The ensemble average of these models’ outputs improved significantly from best single model classification accuracy of 67.55% to 69.74% on the evaluation dataset, when trained with the challenge suggested development partitioning.

System characteristics
Sampling rate 16kHz
Data augmentation random noise, random gain, random cropping, mixup, SpecAugment
Features raw waveform, mel filterbank
Classifier CNN, ResNet, VGG, ensemble
Decision making average
PDF

Task 1 DCASE 2020: ASC with Mismatch Devices and Reduced Size Model Using Residual Squeeze-Excitation CNNs

Javier Naranjo-Alcazar1,2, Sergi Perez-Castanos3, Pedro Zuccarello3 and Maximo Cobos2
1AI department, Visualfy, Benisano, Spain, 2Computer Science Department, Universitat de Valencia, Burjassot, Spain, 3AI department, Visualfy, Benisano, Valencia

Abstract

Acoustic Scene Classification (ASC) is a problem related to the field of machine listening whose objective is to classify/tag an audio clip in a predefined label describing a scene location such as park, airport among others. Due to the emergence of more extensive audio datasets, solutions based on Deep Learning techniques have become the state-of-the-art. The most common choice are those that implement a convolutional neural network (CNN) having previously transformed the audio signal into a 2D representation. This twodimensional audio representation is currently a subject of research. In addition, there are solutions that propose several concatenated 2D representations, thus creating a representation with several input channels. This article proposes two novel stereo audio representations to maximize the accuracy of an ASC framework. These representations correspond to the 3-channel representations such as the left channel, the right channel and the difference between channels (L − R) using the Gammatone filter bank and the harmonic, percussive and difference between channels sources using the Mel filter bank. Both representations are also concatenated creating a 6-channel with different audio filter banks. Furthermore, the proposed CNN is a residual network that employs squeeze-excitation techniques in its residual blocks in a novel way to force the network to extract meaningful features from the audio representation. The proposed network is used in both subtasks with different modifications to meet the requirements of each one. However, since stereo audio is not available in Subtask A, the representations are slightly modified in that task. This technical report first presents the overlaps of the two tasks and then makes the relevant changes to each task in one section per task. The baselines are surpassed in both tasks by approximately 10 percentage points.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup
Features Gammatone; HPSS, log-mel energies
Classifier CNN
PDF

Classification of Acoustic Scenes Based on Modulation Spectra and the Cepstrum of the Cross Correlation Between Binarual Audio Channels

Arturo Paniagua, Rubén Fraile, Juana M. Gutiérrez-Arriola, Nicolás Sáenz-lechón and Víctor J- Osma-Ruiz
CITSEM, Universidad Politéctica de Madrid, Madrid, Spain

Abstract

A system for the automatic classification of acoustic scenes is proposed that uses one audio channel for calculating the spectral distribution of energy across auditory-relevant frequency bands, and some descriptors of the envelope modulation spectrum (EMS) obtained by means of the discrete cosine transform. When the stereophonic signal captured by a binaural microphone is available, this parameter set is augmented by including the first coefficients of the cepstrum of the cross-correlation between both audio channels. This cross-correlation contains information on the angular distribution of acoustic sources. These three types of features (energy spectrum, EMS and cepstrum of cross-correlation) are used as inputs for a multilayer perceptron with two hidden layers and a number of adjustable parameters below 15,000.

System characteristics
Sampling rate 44.1kHz
Features LTAS, envelope modulation spectrum
Classifier MLP
Decision making average log-likelihood
PDF

Thuee Submission for DCASE 2020 Challenge Task1a

Yunfei Shao1, Xinxin Ma2, Yong Ma2 and Wei-Qiang Zhang1
1Department of Electronic Engineering, Tsinghua University, Beijing, China, 2School of Physics and Electronic Engineering, Jiangsu Normal University, Xuzhou, China

Abstract

In this report, we described our submission for the task1a of Detection and Classification of Acoustic Scenes and Events (DACSE) 2020 Challenge: Acoustic Scene Classification with Multiple Devices. Our methods are mainly based on two types of deep learning models: ResNet and Mini-SegNet. In our submissions, we designed two classification systems. Firstly, we applied spectrum correction to combat mismatched frequency responses, and further proposed in log-mel domain. Then these features are fed to ResNet or Mini-SegNet models for feature learning. In order to prevent overfitting, we adopted mixup augmentation, ImageDataGenrator and temporal crop augmentation for data augmentation. Besides, we tried an ensemble of multiple subsystems to enhance the generalization capability of our system. In our work, our final system achieved an average of 75.02% on different devices in the Development dataset.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup, ImageDataGenerator, temporal cropping
Features log-mel energies
Classifier ResNet, Mini-SegNet
PDF

Audio Tagging and Deep Architectures for Acoustic Scene Classification: Uos Submission for the DCASE 2020 Challenge

Hye-jin Shim, Ju-ho Kim, Jee-weon Jung and Ha-jin Yu
School of Computer Science, University of Seoul, Seoul, South Korea

Abstract

In this technical report, we address the UOS submission for the Detection and Classification of Acoustic Scenes and Events 2020 Challenge Task 1-a. We propose to utilize the representation vectors, extracted from a pre-trained audio tagging system, for the acoustic scene classification task. Audio tagging denotes the existence of various sound events and is known to help the classification of acoustic scene. To select suitable feature for the acoustic scene classification task, we also explore deep architectures such as light convolutional neural networks and convolutional block attention module. Experiments are conducted using the official fold-1 configuration test set. Results using audio tagging representation and deep architectures demonstrate accuracies of 68.8% and 70.5%, compared to that of 65.3% of the baseline. Additionally, score-sum ensemble of the two proposed systems has an accuracy of 71.9% which shows 10.1% relative improvement.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup, SpecAugment; mixup
Features mel spectrogram
Classifier ensemble; LCNN; ResNet
Decision making score-sum
PDF

Designing Acoustic Scene Classification Models with CNN Variants

Sangwon Suh, Sooyoung Park, Youngho Jeong and Taejin Lee
Media Coding Research Section, Electronics and Telecommunications Research Institute, Daejeon, South Korea

Abstract

This technical report describes our Acoustic Scene Classification systems for DCASE2020 challenge Task1. For subtask A, we designed a single model implemented with three parallel ResNets, which is named Trident ResNet. We have confirmed that this structure is beneficial when analyzing samples collected from minority or unseen devices, and confirmed 73.7% classification accuracy for the test split. For subtask B, we used the Inception module to build a Shallow Inception model that has fewer parameters than the CNN of the DCASE baseline system. Due to the sparse structure of the Inception module, we have enhanced the accuracy of the model up to 97.6%, while reducing the number of parameters.

System characteristics
Sampling rate 44.1kHz
Data augmentation temporal cropping, mixup
Features log-mel energies, deltas, delta-deltas
Classifier ResNet; Snapshot
Decision making average; weighted score average
PDF

Acoustic Scene Classification Using Efficientnet

Jakub Swiecicki
None, Warsaw, Poland

Abstract

This technical report describes our solution to task 1b of the DCASE 2020 acoustic scene classification challenge. Our primary focus was to develop a single efficient model. We decided to concentrate on a single model in order to reflect the typical business situation. In our solution we chose to use log-mel spectrograms with deltas and delta-deltas features as a sound sample representation. We augmented the data with multiple techniques - mixup, specaugment, and spectrogram resizing. Our final model used EfficientNet [1] architecture.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup, SpecAugment, random resize, random cropping
Features log-mel energies
Classifier EfficientNet
Decision making average
PDF

Acoustic Scene Classification Using Fully Convolutional Neural Networks and Per-Channel Energy Normalization

Konstantinos Vilouras
Electrical and Computer Engineering, Aristotle University of Thessaloniki, Thessaloniki, Greece

Abstract

This technical report describes our approach to Task 1 ''Acoustic Scene Classification'' of the DCASE 2020 challenge. For subtask A, we introduce per-channel energy normalization (PCEN) as an additional preprocessing step along with log-Mel spectrograms. We also propose two residual network architectures utilizing “Shake-Shake” regularization and the “Squeeze-and-Excitation” block, respectively. Our best submission (ensemble of 8 classifiers) outperforms the corresponding baseline system by 16.2% in terms of macro-average accuracy. For subtask B, we mainly focus on a low complexity, fully convolutional neural network architecture, which leads to 5% relative improvement over baseline accuracy.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup, time stretching, frequency masking, shifting, clipping distortion
Features log-mel energies, PCEN
Classifier ResNet, ensemble
Decision making average
PDF

Mel-Scaled Wavelet-Based Features for Sub-Task A and Texture Features for Sub-Task B of DCASE 2020 Task 1

Shefali Waldekar, Kishore Kumar A and Goutam Saha
Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India

Abstract

This report describes a submission for IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 for Task 1 (acoustic scene classification (ASC)), sub-task A (ASC with Multiple Devices) and sub-task B (LowComplexity ASC). The systems exploit time-frequency representation of audio to obtain the scene labels. The system for Task1A follows a simple pattern classification framework employing wavelet transform based mel-scaled features along with support vector machine (SVM) as classifier. Texture features, namely Local Binary Pattern (LBP) extracted from log of mel-band energies is used in a similar classification framework for Task 1B. The proposed systems outperform the deep-learning based baseline system with the development dataset provided for the respective sub-tasks.

System characteristics
Sampling rate 44.1kHz
Features MFDWC
Classifier SVM
PDF

Acoustic Scene Classification with Multiple Decision Schemes

Helin Wang, Dading Chong and Yuexian Zou
School of ECE, Peking University, Shenzhen, China

Abstract

This technical report describes the ADSPLAB team’s submission for Task1 of DCASE2020 challenge. Our acoustic scene classifi- cation (ASC) system is based on the convolutional neural networks (CNN). Multiple decision schemes are proposed in our system, in- cluding the decision schemes in multiple representations, multiple frequency bands, and multiple temporal frames. The final system is the fusion of models with multiple decision schemes and mod- els pre-trained on AudioSet. The experimental results show that our system could achieve the accuracy of 84.5 %(official baseline: 54.1%) and 92.1% (official baseline: 87.3%) on the officially provided fold 1 evaluation dataset of Task1A and Task1B, respectively.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup
Features MFCC, log-mel energies, CQT, Gammatone
Classifier CNN, ensemble
Decision making average
PDF

Acoustic Scene Classification with Device Mismatch Using Data Augmentation by Spectrum Correction

Peiyao Wang, Zhiyuan Cheng and Xinkang Xu
Speech Group, Hithink RoyalFlush Information Network Co.,Ltd, Hangzhou, China

Abstract

This report describes the submissions by RoyalFlush of DCASE2020 task1a. Our aim is to find an audio scene classification system that is robust against multiple devices. We use logMel and its first and second derivatives as input features. We use the fully convolutional deep neural networks as classification model, and some strategies such as pre-Act, L2 regularization, dropout and feature normalization were applied. For improving the data imbalance caused by the different device, we tried to generate more training data by using device-related spectrum correction method

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup, spectrum correction
Features log-mel energies
Classifier CNN, ensemble
Decision making average
PDF

Robust Feature Learning for Acoustic Scene Classification with Multiple Devices

Yuzhong Wu and Tan Lee
Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China

Abstract

This technical report describes our submission for Task 1A of DCASE2020 challenge. The objective of the task is to identify acoustic scenes from audios recorded by various recording devices. In our ASC systems, we use sound-duration based decomposition method to decompose the time-frequency (TF) features into 3 components. Our observation shows that low frequency bins of the longduration component image are most easily affected by the change of recording devices. We use an AlexNet-like CNN model with the decomposed TF features to build ASC systems. To prevent the CNN classifier from over-fitting to the seen recording devices in the training dataset, we apply an auxiliary classifier on the embedding feature extracted from long-duration component image. We propose the regularized cross-entropy (RCE) loss to train the auxiliary classifier. Experiment results on development dataset shows that the use of regularized cross-entropy loss significantly improves the CNN accuracy on audios from unseen devices.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup
Features wavelet filter-bank features
Classifier CNN
Decision making average
PDF

Simple Convolutional Networks Attempting Acoustic Scene Classification Cross Devices

Chi Zhang1, Hanxin Zhu2 and Cheng Ting3
1Electronic Information Engineering, University of Electronic Science and Technology of China, Chengdu, China, 2Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China, 3University of Electronic Science and Technology of China, Chengdu, China

Abstract

This technical report describes our submission for task1a (Acoustic Scene Classification with Multiple Devices) of the DCASE 2020 Challenge. The results of the DCASE 2019 show that the convolution neural networks (CNNs) can acquire excellent classification accuracies. Our work will still be based on the convolution neural networks. We consider two feature extraction methods that are provided by OpenL3 library. Finally, our method improves the accuracy of classification by 2% as compared to the baseline system.

System characteristics
Sampling rate 44.1kHz
Features log-mel energies
Embeddings OpenL3
Classifier MLP , CNN
Decision making maximum likelihood
PDF