Low-Complexity Acoustic
Scene Classification


Challenge results

Task description

This subtask is concerned with the classification of data into three higher-level classes while focusing on low-complexity solutions. All submitted systems had to comply with the task rules by limiting the size of the acoustic model size to be under 500 KB (only classification related non-zero parameters counted). See model size calculation examples here.

The development set contains data from 10 cities. The total amount of audio in the development set is 40 hours. The evaluation set contains data from 12 cities (2 cities unseen in the development set). Evaluation data contains 30 hours of audio.

More detailed task description can be found in the task description page

Systems ranking

Submission information Evaluation dataset Development dataset
Rank Submission label Name Technical
Report
Official
system rank
Accuracy
with 95% confidence interval
Logloss Accuracy Logloss
Chang_QTI_task1b_1 QTI1 Chang2020 12 95.0 (94.6 - 95.5) 0.228 97.9 0.172
Chang_QTI_task1b_2 QTI2 Chang2020 30 93.2 (92.9 - 93.5) 0.232 97.8 0.159
Chang_QTI_task1b_3 QTI3 Chang2020 15 94.8 (94.2 - 95.3) 0.224 98.0 0.144
Chang_QTI_task1b_4 QTI4 Chang2020 19 94.4 (93.8 - 95.1) 0.237 97.8 0.170
Dat_HCMUni_task1b_1 HCM_Group Dat2020 57 89.5 (89.5 - 89.5) 0.648 94.5
Farrugia_IMT-Atlantique-BRAIn_task1b_1 IMT_BRAINa Pajusco2020 77 85.4 (84.9 - 85.8) 0.379 87.6 0.360
Farrugia_IMT-Atlantique-BRAIn_task1b_2 IMT_BRAINb Pajusco2020 48 90.6 (90.0 - 91.2) 0.270 90.9 0.288
Farrugia_IMT-Atlantique-BRAIn_task1b_3 IMT_BRAINc Pajusco2020 73 86.6 (85.9 - 87.3) 0.384 87.6 0.380
Farrugia_IMT-Atlantique-BRAIn_task1b_4 IMT_BRAINd Pajusco2020 66 88.4 (87.9 - 88.9) 0.286 91.2 0.269
Feng_TJU_task1b_1 CNN-BDG Feng2020 86 72.3 (73.9 - 70.7) 1.728 91.8 0.469
Feng_TJU_task1b_2 CNN-BDG Feng2020 83 81.9 (82.2 - 81.6) 1.189 90.6 0.594
Feng_TJU_task1b_3 CNN-BDG Feng2020 84 80.7 (81.0 - 80.4) 1.302 90.6 0.644
Feng_TJU_task1b_4 CNN-BDG Feng2020 85 79.9 (80.4 - 79.3) 1.281 90.2 0.629
DCASE2020 baseline Baseline 89.5 (88.8 - 90.2) 0.401 88.0 0.481
Helin_ADSPLAB_task1b_1 Helin1b Wang2020_t1 42 91.6 (91.1 - 92.0) 0.227 92.1 0.312
Helin_ADSPLAB_task1b_2 Helin2b Wang2020_t1 41 91.6 (91.2 - 92.0) 0.233 92.1 0.312
Helin_ADSPLAB_task1b_3 Helin3b Wang2020_t1 43 91.6 (91.1 - 92.0) 0.230 92.1 0.312
Helin_ADSPLAB_task1b_4 Helin4b Wang2020_t1 44 91.3 (91.0 - 91.6) 0.264 92.1 0.312
Hu_GT_task1b_1 Hu_GT_1b_1 Hu2020 7 95.8 (95.5 - 96.1) 0.357 96.3 0.349
Hu_GT_task1b_2 Hu_GT_1b_2 Hu2020 10 95.5 (95.1 - 95.8) 0.367
Hu_GT_task1b_3 Hu_GT_1b_3 Hu2020 3 96.0 (95.5 - 96.5) 0.122 96.7
Hu_GT_task1b_4 Hu_GT_1b_4 Hu2020 5 95.8 (95.3 - 96.3) 0.131 96.3
Kalinowski_SRPOL_task1b_4 kalinowski Kalinowski2020 31 93.1 (92.7 - 93.5) 1.532 95.4 0.217
Koutini_CPJKU_task1b_1 decomposed Koutini2020 16 94.7 (94.5 - 94.9) 0.164 96.1 0.140
Koutini_CPJKU_task1b_2 RFD-prune Koutini2020 1 96.5 (96.2 - 96.8) 0.101 97.3 0.080
Koutini_CPJKU_task1b_3 RFDsmall Koutini2020 8 95.7 (95.5 - 95.9) 0.113 97.1 0.090
Koutini_CPJKU_task1b_4 RFensem Koutini2020 2 96.2 (95.9 - 96.5) 0.105 97.0 0.090
Kowaleczko_SRPOL_task1b_3 pkowdcase Kalinowski2020 52 90.1 (89.6 - 90.7) 0.356 92.8 0.256
Kwiatkowska_SRPOL_task1b_1 ens3-10mix Kalinowski2020 36 92.6 (92.0 - 93.2) 0.200 94.6 0.175
Kwiatkowska_SRPOL_task1b_2 ens3to10 Kalinowski2020 27 93.5 (93.0 - 94.0) 0.168 94.5 0.170
LamPham_Kent_task1b_1 LamPham Pham2020 59 89.4 (89.2 - 89.7) 0.332 93.0
LamPham_Kent_task1b_2 LamPham Pham2020 71 87.0 (86.1 - 87.8) 0.349 91.9
LamPham_Kent_task1b_3 LamPham Pham2020 79 84.7 (85.0 - 84.5) 0.402 90.5
Lee_CAU_task1b_1 CAUET Lee2020 47 90.7 (90.7 - 90.7) 0.302 95.3 0.167
Lee_CAU_task1b_2 CAUET Lee2020 23 93.9 (93.7 - 94.1) 0.156 95.3 0.167
Lee_CAU_task1b_3 CAUET Lee2020 46 91.1 (91.0 - 91.2) 0.246 93.7 0.193
Lee_CAU_task1b_4 CAUET Lee2020 45 91.2 (91.2 - 91.2) 0.864 92.8 0.500
Lopez-Meyer_IL_task1b_1 INT8CNN Lopez-Meyer2020_t1b 50 90.4 (89.6 - 91.1) 0.681 90.9 0.645
Lopez-Meyer_IL_task1b_2 PrunCNN Lopez-Meyer2020_t1b 53 90.1 (89.7 - 90.5) 0.677 91.5 0.637
Lopez-Meyer_IL_task1b_3 KD-CNN Lopez-Meyer2020_t1b 49 90.5 (89.8 - 91.2) 0.276 90.3 0.673
Lopez-Meyer_IL_task1b_4 GCC-CNN Lopez-Meyer2020_t1b 56 89.7 (88.8 - 90.5) 0.983 91.2 0.510
McDonnell_USA_task1b_1 UniSA_1b1 McDonnell2020 13 94.9 (94.9 - 95.0) 0.135 97.1 0.094
McDonnell_USA_task1b_2 UniSA_1b2 McDonnell2020 9 95.5 (95.3 - 95.7) 0.118 97.1 0.094
McDonnell_USA_task1b_3 UniSA_1b3 McDonnell2020 4 95.9 (95.7 - 96.1) 0.117 97.1 0.094
McDonnell_USA_task1b_4 UniSA_1b4 McDonnell2020 6 95.8 (95.6 - 96.0) 0.119 97.1 0.094
Monteiro_INRS_task1b_1 MelCNN Joao2020 69 87.4 (86.5 - 88.3) 0.327
Naranjo-Alcazar_Vfy_task1b_1 ASCCSSE Naranjo-Alcazar2020_t1 24 93.6 (93.4 - 93.7) 0.202 97.1 0.132
Naranjo-Alcazar_Vfy_task1b_2 ASCCSSE Naranjo-Alcazar2020_t1 25 93.6 (93.4 - 93.8) 0.190 97.0 0.104
NguyenHongDuc_SU_task1b_1 NHD_1B_1 Nguyen_Hong_Duc2020 32 93.1 (92.6 - 93.5) 0.215 92.4 0.230
NguyenHongDuc_SU_task1b_2 NHD_1B_2 Nguyen_Hong_Duc2020 37 92.3 (91.9 - 92.6) 0.214 92.3 0.230
Ooi_NTU_task1b_1 Ooi_model1 Ooi2020 67 87.8 (87.1 - 88.6) 0.337 89.4
Ooi_NTU_task1b_2 Ooi_model2 Ooi2020 70 87.3 (86.6 - 88.1) 0.367 88.6
Ooi_NTU_task1b_3 Ooi_model3 Ooi2020 55 89.8 (89.0 - 90.5) 0.257 91.5
Ooi_NTU_task1b_4 Ooi_model4 Ooi2020 54 89.8 (89.1 - 90.5) 0.305 90.5
Paniagua_UPM_task1b_1 Pan_UPM Paniagua2020 60 89.4 (89.0 - 89.8) 0.347 87.8
Patki_SELF_task1b_1 PATKI Patki2020 76 86.0 (85.8 - 86.3) 1.372 88.7 0.000
Patki_SELF_task1b_2 PATKI Patki2020 61 89.4 (89.0 - 89.7) 0.951 88.9 0.000
Patki_SELF_task1b_3 PATKI Patki2020 82 83.7 (81.8 - 85.7) 1.837 87.0 0.170
Phan_UIUC_task1b_1 DD_1b_1 Phan2020_t1 65 88.5 (87.8 - 89.2) 0.319 89.5 0.289
Phan_UIUC_task1b_2 DD_1b_2 Phan2020_t1 62 89.2 (88.7 - 89.8) 0.283 89.5 0.292
Phan_UIUC_task1b_3 DD_1b_3 Phan2020_t1 63 89.0 (88.7 - 89.3) 0.301 90.3 0.254
Phan_UIUC_task1b_4 DD_1b_4 Phan2020_t1 58 89.5 (88.9 - 90.0) 0.282 90.4 0.275
Sampathkumar_TUC_task1b_1 AALNet-94 Sampathkumar2020 68 87.5 (87.1 - 87.9) 0.864 89.4 0.635
Singh_IITMandi_task1b_1 IITMandi Singh2020 81 84.5 (83.5 - 85.6) 0.418 84.9 0.422
Singh_IITMandi_task1b_2 IITMandi Singh2020 80 84.7 (83.5 - 85.9) 0.420 85.9 0.416
Singh_IITMandi_task1b_3 IITMandi Singh2020 78 85.2 (84.6 - 85.8) 0.402 86.8 0.399
Singh_IITMandi_task1b_4 IITMandi Singh2020 75 86.4 (85.0 - 87.8) 0.385 87.2 0.378
Suh_ETRI_task1b_1 Incep_Dev Suh2020 29 93.3 (93.2 - 93.4) 0.302 97.6 0.259
Suh_ETRI_task1b_2 Incep_Eval Suh2020 18 94.6 (94.4 - 94.7) 0.270 97.6 0.259
Suh_ETRI_task1b_3 Incep_Ensb Suh2020 11 95.1 (94.9 - 95.2) 0.277 97.5 0.271
Suh_ETRI_task1b_4 Incep_wEsb Suh2020 17 94.6 (94.5 - 94.8) 0.271 97.7 0.260
Vilouras_AUTh_task1b_1 VilFCN Vilouras2020 40 91.8 (91.2 - 92.5) 0.215 92.3 0.211
Waldekar_IITKGP_task1b_1 LogMBE-LBP Waldekar2020 64 88.6 (88.2 - 89.1) 7.923 90.0
Wu_CUHK_task1b_1 CNN4Blocks Wu2020_t1b 22 94.2 (94.0 - 94.3) 0.188 95.8
Wu_CUHK_task1b_2 ensemble_2 Wu2020_t1b 21 94.2 (94.1 - 94.3) 0.201 96.2
Wu_CUHK_task1b_3 ensemble_3 Wu2020_t1b 20 94.3 (94.3 - 94.4) 0.185 96.3
Wu_CUHK_task1b_4 diff_feat2 Wu2020_t1b 14 94.9 (94.7 - 95.1) 0.218 96.5
Yang_UESTC_task1b_1 CNNs Haocong2020 38 92.1 (91.7 - 92.4) 0.272 94.9 0.237
Yang_UESTC_task1b_2 CNNs_PAE Haocong2020 28 93.5 (93.3 - 93.7) 0.247 95.9 0.200
Yang_UESTC_task1b_3 CNNs_Cyc Haocong2020 26 93.5 (93.3 - 93.8) 0.228 96.0 0.187
Yang_UESTC_task1b_4 CNNs_4CV Haocong2020 51 90.4 (88.7 - 92.0) 0.327 92.0 0.305
Zhang_BUPT_task1b_1 BUPTSystem Zhang2020 39 92.0 (91.6 - 92.4) 0.346 93.5 0.481
Zhang_BUPT_task1b_2 BUPTSystem Zhang2020 35 92.7 (92.1 - 93.2) 0.334 93.5 0.481
Zhang_BUPT_task1b_3 BUPTSystem Zhang2020 34 92.9 (92.3 - 93.5) 0.316 93.5 0.481
Zhang_BUPT_task1b_4 BUPTSystem Zhang2020 33 93.0 (92.4 - 93.6) 0.316 93.5 0.481
Zhao_JNU_task1b_1 DD-CNN Zhao2020 74 86.6 (86.5 - 86.7) 0.867 92.0 0.257
Zhao_JNU_task1b_2 DD-CNN Zhao2020 72 86.9 (86.8 - 87.0) 0.873 91.1 0.343

Teams ranking

Table including only the best performing system per submitting team.

Submission information Evaluation dataset Development dataset
Rank Submission label Name Technical
Report
Official
system rank
Team rank Accuracy
with 95% confidence interval
Logloss Accuracy Logloss
Chang_QTI_task1b_1 QTI1 Chang2020 12 5 95.0 (94.6 - 95.5) 0.228 97.9 0.172
Dat_HCMUni_task1b_1 HCM_Group Dat2020 57 20 89.5 (89.5 - 89.5) 0.648 94.5
Farrugia_IMT-Atlantique-BRAIn_task1b_2 IMT_BRAINb Pajusco2020 48 16 90.6 (90.0 - 91.2) 0.270 90.9 0.288
Feng_TJU_task1b_2 CNN-BDG Feng2020 83 30 81.9 (82.2 - 81.6) 1.189 90.6 0.594
DCASE2020 baseline Baseline 89.5 (88.8 - 90.2) 0.401 88.0 0.481
Helin_ADSPLAB_task1b_2 Helin2b Wang2020_t1 41 15 91.6 (91.2 - 92.0) 0.233 92.1 0.312
Hu_GT_task1b_3 Hu_GT_1b_3 Hu2020 3 2 96.0 (95.5 - 96.5) 0.122 96.7
Kalinowski_SRPOL_task1b_4 kalinowski Kalinowski2020 31 11 93.1 (92.7 - 93.5) 1.532 95.4 0.217
Koutini_CPJKU_task1b_2 RFD-prune Koutini2020 1 1 96.5 (96.2 - 96.8) 0.101 97.3 0.080
Kowaleczko_SRPOL_task1b_3 pkowdcase Kalinowski2020 52 18 90.1 (89.6 - 90.7) 0.356 92.8 0.256
Kwiatkowska_SRPOL_task1b_2 ens3to10 Kalinowski2020 27 10 93.5 (93.0 - 94.0) 0.168 94.5 0.170
LamPham_Kent_task1b_1 LamPham Pham2020 59 22 89.4 (89.2 - 89.7) 0.332 93.0
Lee_CAU_task1b_2 CAUET Lee2020 23 7 93.9 (93.7 - 94.1) 0.156 95.3 0.167
Lopez-Meyer_IL_task1b_3 KD-CNN Lopez-Meyer2020_t1b 49 17 90.5 (89.8 - 91.2) 0.276 90.3 0.673
McDonnell_USA_task1b_3 UniSA_1b3 McDonnell2020 4 3 95.9 (95.7 - 96.1) 0.117 97.1 0.094
Monteiro_INRS_task1b_1 MelCNN Joao2020 69 27 87.4 (86.5 - 88.3) 0.327
Naranjo-Alcazar_Vfy_task1b_1 ASCCSSE Naranjo-Alcazar2020_t1 24 8 93.6 (93.4 - 93.7) 0.202 97.1 0.132
NguyenHongDuc_SU_task1b_1 NHD_1B_1 Nguyen_Hong_Duc2020 32 12 93.1 (92.6 - 93.5) 0.215 92.4 0.230
Ooi_NTU_task1b_4 Ooi_model4 Ooi2020 54 19 89.8 (89.1 - 90.5) 0.305 90.5
Paniagua_UPM_task1b_1 Pan_UPM Paniagua2020 60 23 89.4 (89.0 - 89.8) 0.347 87.8
Patki_SELF_task1b_2 PATKI Patki2020 61 24 89.4 (89.0 - 89.7) 0.951 88.9 0.000
Phan_UIUC_task1b_4 DD_1b_4 Phan2020_t1 58 21 89.5 (88.9 - 90.0) 0.282 90.4 0.275
Sampathkumar_TUC_task1b_1 AALNet-94 Sampathkumar2020 68 26 87.5 (87.1 - 87.9) 0.864 89.4 0.635
Singh_IITMandi_task1b_4 IITMandi Singh2020 75 29 86.4 (85.0 - 87.8) 0.385 87.2 0.378
Suh_ETRI_task1b_3 Incep_Ensb Suh2020 11 4 95.1 (94.9 - 95.2) 0.277 97.5 0.271
Vilouras_AUTh_task1b_1 VilFCN Vilouras2020 40 14 91.8 (91.2 - 92.5) 0.215 92.3 0.211
Waldekar_IITKGP_task1b_1 LogMBE-LBP Waldekar2020 64 25 88.6 (88.2 - 89.1) 7.923 90.0
Wu_CUHK_task1b_4 diff_feat2 Wu2020_t1b 14 6 94.9 (94.7 - 95.1) 0.218 96.5
Yang_UESTC_task1b_3 CNNs_Cyc Haocong2020 26 9 93.5 (93.3 - 93.8) 0.228 96.0 0.187
Zhang_BUPT_task1b_4 BUPTSystem Zhang2020 33 13 93.0 (92.4 - 93.6) 0.316 93.5 0.481
Zhao_JNU_task1b_2 DD-CNN Zhao2020 72 28 86.9 (86.8 - 87.0) 0.873 91.1 0.343

System complexity

Submission information Evaluation dataset Acoustic model System
Rank Submission label Technical
Report
Official
system
rank
Accuracy Logloss Parameters Non-zero
parameters
Sparsity Size
(KB) *
Complexity
management
Chang_QTI_task1b_1 Chang2020 12 95.0 0.228 601866 245591 0.5919506999896986 491.2 sparsity
Chang_QTI_task1b_2 Chang2020 30 93.2 0.232 601866 245591 0.5919506999896986 491.2 sparsity
Chang_QTI_task1b_3 Chang2020 15 94.8 0.224 601866 245591 0.5919506999896986 491.2 sparsity
Chang_QTI_task1b_4 Chang2020 19 94.4 0.237 601866 245591 0.5919506999896986 491.2 sparsity
Dat_HCMUni_task1b_1 Dat2020 57 89.5 0.648 111366 111366 0.0 445.0
Farrugia_IMT-Atlantique-BRAIn_task1b_1 Pajusco2020 77 85.4 0.379 13632 12160 0.107981220657277 23.8 float16, quantization, pruning
Farrugia_IMT-Atlantique-BRAIn_task1b_2 Pajusco2020 48 90.6 0.270 29888 29888 0.0 58.4 float16, quantization
Farrugia_IMT-Atlantique-BRAIn_task1b_3 Pajusco2020 73 86.6 0.384 398400 130730 0.6718624497991967 255.3 float16, quantization, pruning
Farrugia_IMT-Atlantique-BRAIn_task1b_4 Pajusco2020 66 88.4 0.286 373696 238896 0.3607210138722384 466.6 float16, quantization, pruning
Feng_TJU_task1b_1 Feng2020 86 72.3 1.728 35059 35059 0.0 136.9 optimize the convolution operation and the network structure
Feng_TJU_task1b_2 Feng2020 83 81.9 1.189 60403 60403 0.0 235.9 optimize the convolution operation and the network structure
Feng_TJU_task1b_3 Feng2020 84 80.7 1.302 85747 85747 0.0 334.9 optimize the convolution operation and the network structure
Feng_TJU_task1b_4 Feng2020 85 79.9 1.281 111091 111091 0.0 433.9 optimize the convolution operation and the network structure
DCASE2020 baseline 89.5 0.401 115219 115219 0.0 450.1
Helin_ADSPLAB_task1b_1 Wang2020_t1 42 91.6 0.227 123576 123576 0.0 490.8
Helin_ADSPLAB_task1b_2 Wang2020_t1 41 91.6 0.233 123576 123576 0.0 490.8
Helin_ADSPLAB_task1b_3 Wang2020_t1 43 91.6 0.230 123576 123576 0.0 490.8
Helin_ADSPLAB_task1b_4 Wang2020_t1 44 91.3 0.264 123576 123576 0.0 490.8
Hu_GT_task1b_1 Hu2020 7 95.8 0.357 94028 94028 0.0 375.0 int8, quantization
Hu_GT_task1b_2 Hu2020 10 95.5 0.367 122900 122900 0.0 490.0 int8, quantization
Hu_GT_task1b_3 Hu2020 3 96.0 0.122 122900 122900 0.0 490.0 int8, quantization
Hu_GT_task1b_4 Hu2020 5 95.8 0.131 125121 125121 0.0 499.0 int8, quantization
Kalinowski_SRPOL_task1b_4 Kalinowski2020 31 93.1 1.532 110899 110899 0.0 433.2
Koutini_CPJKU_task1b_1 Koutini2020 16 94.7 0.164 17520 17520 0.0 34.2 float16, conv layers decomposition
Koutini_CPJKU_task1b_2 Koutini2020 1 96.5 0.101 345990 247562 0.28448221046851063 483.5 pruning, float16
Koutini_CPJKU_task1b_3 Koutini2020 8 95.7 0.113 242592 242592 0.0 473.8 float16, smaller width/depth
Koutini_CPJKU_task1b_4 Koutini2020 2 96.2 0.105 556480 249386 0.5518509200690052 487.1 float16, smaller width/depth
Kowaleczko_SRPOL_task1b_3 Kalinowski2020 52 90.1 0.356 110899 110899 0.0 433.2 using rectangular convolution kernels
Kwiatkowska_SRPOL_task1b_1 Kalinowski2020 36 92.6 0.200 107494 107494 0.0 421.0 constraints-aware modelling
Kwiatkowska_SRPOL_task1b_2 Kalinowski2020 27 93.5 0.168 107494 107494 0.0 421.0 constraints-aware modelling
LamPham_Kent_task1b_1 Pham2020 59 89.4 0.332 61636 61636 0.0 246.6
LamPham_Kent_task1b_2 Pham2020 71 87.0 0.349 61636 61636 0.0 61.6 quantization
LamPham_Kent_task1b_3 Pham2020 79 84.7 0.402 61636 30818 0.5 123.2 pruning
Lee_CAU_task1b_1 Lee2020 47 90.7 0.302 126979 126528 0.003551768402649258 494.2
Lee_CAU_task1b_2 Lee2020 23 93.9 0.156 126979 126528 0.003551768402649258 494.2
Lee_CAU_task1b_3 Lee2020 46 91.1 0.246 125827 125376 0.003584286361432709 489.8
Lee_CAU_task1b_4 Lee2020 45 91.2 0.864 127539 126864 0.005292498765083642 495.6
Lopez-Meyer_IL_task1b_1 Lopez-Meyer2020_t1b 50 90.4 0.681 317038 317038 0.0 309.6 quantization
Lopez-Meyer_IL_task1b_2 Lopez-Meyer2020_t1b 53 90.1 0.677 317038 255740 0.1933459080614942 499.5 pruning, quantization
Lopez-Meyer_IL_task1b_3 Lopez-Meyer2020_t1b 49 90.5 0.276 252712 252712 0.0 493.6 knowledge distillation, quantization
Lopez-Meyer_IL_task1b_4 Lopez-Meyer2020_t1b 56 89.7 0.983 252491 252491 0.0 493.1 quantization
McDonnell_USA_task1b_1 McDonnell2020 13 94.9 0.135 3987000 3987000 0.0 486.7 1-bit quantization
McDonnell_USA_task1b_2 McDonnell2020 9 95.5 0.118 3987000 3987000 0.0 486.7 1-bit quantization
McDonnell_USA_task1b_3 McDonnell2020 4 95.9 0.117 3987000 3987000 0.0 486.7 1-bit quantization
McDonnell_USA_task1b_4 McDonnell2020 6 95.8 0.119 3987000 3987000 0.0 486.7 1-bit quantization
Monteiro_INRS_task1b_1 Joao2020 69 87.4 0.327 54468 54468 0.0 218.8
Naranjo-Alcazar_Vfy_task1b_1 Naranjo-Alcazar2020_t1 24 93.6 0.202 127055 127055 0.0 496.3
Naranjo-Alcazar_Vfy_task1b_2 Naranjo-Alcazar2020_t1 25 93.6 0.190 126927 126927 0.0 495.8
NguyenHongDuc_SU_task1b_1 Nguyen_Hong_Duc2020 32 93.1 0.215 122999 122493 0.004113854584183563 478.5
NguyenHongDuc_SU_task1b_2 Nguyen_Hong_Duc2020 37 92.3 0.214 77028 77028 0.0 300.9
Ooi_NTU_task1b_1 Ooi2020 67 87.8 0.337 80839 17115 0.788282883261792 66.9 sparsity
Ooi_NTU_task1b_2 Ooi2020 70 87.3 0.367 167571 34181 0.7960207911870192 133.5 sparsity
Ooi_NTU_task1b_3 Ooi2020 55 89.8 0.257 571766 119756 0.7905506798235642 467.8 sparsity
Ooi_NTU_task1b_4 Ooi2020 54 89.8 0.305 571766 119756 0.7905506798235642 467.8 sparsity
Paniagua_UPM_task1b_1 Paniagua2020 60 89.4 0.347 13197 13197 0.0 103.1
Patki_SELF_task1b_1 Patki2020 76 86.0 1.372 9010 9010 0.0 17.5
Patki_SELF_task1b_2 Patki2020 61 89.4 0.951 18020 18020 0.0 26.3
Patki_SELF_task1b_3 Patki2020 82 83.7 1.837 9010 9010 0.0 8.8
Phan_UIUC_task1b_1 Phan2020_t1 65 88.5 0.319 6979 6944 0.005015045135406182 27.3
Phan_UIUC_task1b_2 Phan2020_t1 62 89.2 0.283 6979 6944 0.005015045135406182 27.3
Phan_UIUC_task1b_3 Phan2020_t1 63 89.0 0.301 17859 17792 0.003751609832577385 69.8
Phan_UIUC_task1b_4 Phan2020_t1 58 89.5 0.282 17859 17792 0.003751609832577385 69.8
Sampathkumar_TUC_task1b_1 Sampathkumar2020 68 87.5 0.864 123487 122387 0.008907820256383259 489.4
Singh_IITMandi_task1b_1 Singh2020 81 84.5 0.418 52467 52467 0.0 204.9
Singh_IITMandi_task1b_2 Singh2020 80 84.7 0.420 18611 18611 0.0 72.7
Singh_IITMandi_task1b_3 Singh2020 78 85.2 0.402 19763 19763 0.0 77.2
Singh_IITMandi_task1b_4 Singh2020 75 86.4 0.385 70947 70947 0.0 277.1
Suh_ETRI_task1b_1 Suh2020 29 93.3 0.302 103778 103778 0.0 405.4
Suh_ETRI_task1b_2 Suh2020 18 94.6 0.270 103778 103778 0.0 405.4
Suh_ETRI_task1b_3 Suh2020 11 95.1 0.277 207556 207556 0.0 413.0 float16
Suh_ETRI_task1b_4 Suh2020 17 94.6 0.271 207556 207556 0.0 413.0 float16
Vilouras_AUTh_task1b_1 Vilouras2020 40 91.8 0.215 127467 127021 0.0034989448249350685 496.2
Waldekar_IITKGP_task1b_1 Waldekar2020 64 88.6 7.923 10092 10092 0.0 40.0
Wu_CUHK_task1b_1 Wu2020_t1b 22 94.2 0.188 76611 76611 0.0 299.3
Wu_CUHK_task1b_2 Wu2020_t1b 21 94.2 0.201 187917 187917 0.0 367.0 float16
Wu_CUHK_task1b_3 Wu2020_t1b 20 94.3 0.185 229883 229883 0.0 449.0 float16
Wu_CUHK_task1b_4 Wu2020_t1b 14 94.9 0.218 153222 153222 0.0 299.3 float16
Yang_UESTC_task1b_1 Haocong2020 38 92.1 0.272 119382 119382 0.0 258.0 float16
Yang_UESTC_task1b_2 Haocong2020 28 93.5 0.247 119382 119382 0.0 258.0 float16
Yang_UESTC_task1b_3 Haocong2020 26 93.5 0.228 119382 119382 0.0 258.0 float16
Yang_UESTC_task1b_4 Haocong2020 51 90.4 0.327 182184 182184 0.0 448.0 float16
Zhang_BUPT_task1b_1 Zhang2020 39 92.0 0.346 83974 83974 0.0 83.4 8-bit quantization
Zhang_BUPT_task1b_2 Zhang2020 35 92.7 0.334 83974 83974 0.0 83.4 8-bit quantization
Zhang_BUPT_task1b_3 Zhang2020 34 92.9 0.316 83974 83974 0.0 83.4 8-bit quantization
Zhang_BUPT_task1b_4 Zhang2020 33 93.0 0.316 83974 83974 0.0 83.4 8-bit quantization
Zhao_JNU_task1b_1 Zhao2020 74 86.6 0.867 127491 127491 0.0 498.0 disout
Zhao_JNU_task1b_2 Zhao2020 72 86.9 0.873 127491 127491 0.0 498.0 disout


*) Model size is calculated accordingly to the task specific rules, and will differ from a real model storage size. See model size calculation examples here.

Generalization performance

All results with evaluation dataset.

Submission information Overall Cities
Rank Submission label Technical
Report
Official
system rank
Accuracy
(Evaluation dataset)
Accuracy /
unseen
Accuracy /
seen
Chang_QTI_task1b_1 Chang2020 12 95.0 91.3 95.8
Chang_QTI_task1b_2 Chang2020 30 93.2 91.8 93.5
Chang_QTI_task1b_3 Chang2020 15 94.8 91.6 95.4
Chang_QTI_task1b_4 Chang2020 19 94.4 90.8 95.2
Dat_HCMUni_task1b_1 Dat2020 57 89.5 88.0 89.8
Farrugia_IMT-Atlantique-BRAIn_task1b_1 Pajusco2020 77 85.4 79.8 86.5
Farrugia_IMT-Atlantique-BRAIn_task1b_2 Pajusco2020 48 90.6 87.3 91.3
Farrugia_IMT-Atlantique-BRAIn_task1b_3 Pajusco2020 73 86.6 83.3 87.3
Farrugia_IMT-Atlantique-BRAIn_task1b_4 Pajusco2020 66 88.4 81.4 89.8
Feng_TJU_task1b_1 Feng2020 86 72.3 74.9 71.8
Feng_TJU_task1b_2 Feng2020 83 81.9 82.4 81.8
Feng_TJU_task1b_3 Feng2020 84 80.7 79.1 81.0
Feng_TJU_task1b_4 Feng2020 85 79.9 77.8 80.3
DCASE2020 baseline 89.5 84.9 90.4
Helin_ADSPLAB_task1b_1 Wang2020_t1 42 91.6 85.9 92.7
Helin_ADSPLAB_task1b_2 Wang2020_t1 41 91.6 86.1 92.7
Helin_ADSPLAB_task1b_3 Wang2020_t1 43 91.6 86.1 92.6
Helin_ADSPLAB_task1b_4 Wang2020_t1 44 91.3 85.9 92.4
Hu_GT_task1b_1 Hu2020 7 95.8 93.3 96.3
Hu_GT_task1b_2 Hu2020 10 95.5 92.1 96.1
Hu_GT_task1b_3 Hu2020 3 96.0 93.0 96.7
Hu_GT_task1b_4 Hu2020 5 95.8 93.5 96.3
Kalinowski_SRPOL_task1b_4 Kalinowski2020 31 93.1 90.1 93.7
Koutini_CPJKU_task1b_1 Koutini2020 16 94.7 91.1 95.4
Koutini_CPJKU_task1b_2 Koutini2020 1 96.5 95.3 96.7
Koutini_CPJKU_task1b_3 Koutini2020 8 95.7 94.7 95.9
Koutini_CPJKU_task1b_4 Koutini2020 2 96.2 94.4 96.6
Kowaleczko_SRPOL_task1b_3 Kalinowski2020 52 90.1 86.8 90.8
Kwiatkowska_SRPOL_task1b_1 Kalinowski2020 36 92.6 88.7 93.4
Kwiatkowska_SRPOL_task1b_2 Kalinowski2020 27 93.5 88.9 94.4
LamPham_Kent_task1b_1 Pham2020 59 89.4 85.5 90.2
LamPham_Kent_task1b_2 Pham2020 71 87.0 84.6 87.4
LamPham_Kent_task1b_3 Pham2020 79 84.7 82.1 85.3
Lee_CAU_task1b_1 Lee2020 47 90.7 87.4 91.3
Lee_CAU_task1b_2 Lee2020 23 93.9 90.0 94.6
Lee_CAU_task1b_3 Lee2020 46 91.1 87.3 91.8
Lee_CAU_task1b_4 Lee2020 45 91.2 87.5 91.9
Lopez-Meyer_IL_task1b_1 Lopez-Meyer2020_t1b 50 90.4 88.2 90.8
Lopez-Meyer_IL_task1b_2 Lopez-Meyer2020_t1b 53 90.1 87.1 90.7
Lopez-Meyer_IL_task1b_3 Lopez-Meyer2020_t1b 49 90.5 85.2 91.6
Lopez-Meyer_IL_task1b_4 Lopez-Meyer2020_t1b 56 89.7 89.1 89.8
McDonnell_USA_task1b_1 McDonnell2020 13 94.9 93.8 95.1
McDonnell_USA_task1b_2 McDonnell2020 9 95.5 92.9 96.0
McDonnell_USA_task1b_3 McDonnell2020 4 95.9 94.7 96.2
McDonnell_USA_task1b_4 McDonnell2020 6 95.8 93.8 96.2
Monteiro_INRS_task1b_1 Joao2020 69 87.4 83.9 88.1
Naranjo-Alcazar_Vfy_task1b_1 Naranjo-Alcazar2020_t1 24 93.6 90.8 94.1
Naranjo-Alcazar_Vfy_task1b_2 Naranjo-Alcazar2020_t1 25 93.6 91.3 94.0
NguyenHongDuc_SU_task1b_1 Nguyen_Hong_Duc2020 32 93.1 90.0 93.7
NguyenHongDuc_SU_task1b_2 Nguyen_Hong_Duc2020 37 92.3 90.0 92.7
Ooi_NTU_task1b_1 Ooi2020 67 87.8 81.1 89.1
Ooi_NTU_task1b_2 Ooi2020 70 87.3 85.7 87.7
Ooi_NTU_task1b_3 Ooi2020 55 89.8 86.0 90.5
Ooi_NTU_task1b_4 Ooi2020 54 89.8 87.7 90.2
Paniagua_UPM_task1b_1 Paniagua2020 60 89.4 89.4 89.4
Patki_SELF_task1b_1 Patki2020 76 86.0 89.7 85.3
Patki_SELF_task1b_2 Patki2020 61 89.4 89.7 89.3
Patki_SELF_task1b_3 Patki2020 82 83.7 84.9 83.5
Phan_UIUC_task1b_1 Phan2020_t1 65 88.5 84.2 89.4
Phan_UIUC_task1b_2 Phan2020_t1 62 89.2 86.8 89.7
Phan_UIUC_task1b_3 Phan2020_t1 63 89.0 85.4 89.7
Phan_UIUC_task1b_4 Phan2020_t1 58 89.5 85.4 90.3
Sampathkumar_TUC_task1b_1 Sampathkumar2020 68 87.5 85.7 87.8
Singh_IITMandi_task1b_1 Singh2020 81 84.5 81.8 85.1
Singh_IITMandi_task1b_2 Singh2020 80 84.7 81.1 85.4
Singh_IITMandi_task1b_3 Singh2020 78 85.2 80.8 86.1
Singh_IITMandi_task1b_4 Singh2020 75 86.4 82.8 87.1
Suh_ETRI_task1b_1 Suh2020 29 93.3 89.9 94.0
Suh_ETRI_task1b_2 Suh2020 18 94.6 91.6 95.2
Suh_ETRI_task1b_3 Suh2020 11 95.1 92.8 95.5
Suh_ETRI_task1b_4 Suh2020 17 94.6 91.6 95.2
Vilouras_AUTh_task1b_1 Vilouras2020 40 91.8 89.6 92.3
Waldekar_IITKGP_task1b_1 Waldekar2020 64 88.6 84.6 89.4
Wu_CUHK_task1b_1 Wu2020_t1b 22 94.2 92.9 94.4
Wu_CUHK_task1b_2 Wu2020_t1b 21 94.2 93.5 94.3
Wu_CUHK_task1b_3 Wu2020_t1b 20 94.3 93.1 94.5
Wu_CUHK_task1b_4 Wu2020_t1b 14 94.9 94.1 95.1
Yang_UESTC_task1b_1 Haocong2020 38 92.1 86.1 93.2
Yang_UESTC_task1b_2 Haocong2020 28 93.5 89.3 94.3
Yang_UESTC_task1b_3 Haocong2020 26 93.5 89.5 94.3
Yang_UESTC_task1b_4 Haocong2020 51 90.4 86.3 91.2
Zhang_BUPT_task1b_1 Zhang2020 39 92.0 86.3 93.1
Zhang_BUPT_task1b_2 Zhang2020 35 92.7 87.1 93.8
Zhang_BUPT_task1b_3 Zhang2020 34 92.9 87.5 94.0
Zhang_BUPT_task1b_4 Zhang2020 33 93.0 87.5 94.1
Zhao_JNU_task1b_1 Zhao2020 74 86.6 84.3 87.0
Zhao_JNU_task1b_2 Zhao2020 72 86.9 84.9 87.4

Class-wise performance

Rank Submission label Technical
Report
Official
system
rank
Accuracy Indoor Outdoor Transportation
Chang_QTI_task1b_1 Chang2020 12 95.0 91.5 95.3 98.3
Chang_QTI_task1b_2 Chang2020 30 93.2 86.1 95.3 98.1
Chang_QTI_task1b_3 Chang2020 15 94.8 91.2 94.3 98.8
Chang_QTI_task1b_4 Chang2020 19 94.4 92.2 92.8 98.2
Dat_HCMUni_task1b_1 Dat2020 57 89.5 78.9 95.5 94.1
Farrugia_IMT-Atlantique-BRAIn_task1b_1 Pajusco2020 77 85.4 75.8 88.5 91.9
Farrugia_IMT-Atlantique-BRAIn_task1b_2 Pajusco2020 48 90.6 87.7 90.7 93.5
Farrugia_IMT-Atlantique-BRAIn_task1b_3 Pajusco2020 73 86.6 79.8 87.2 92.7
Farrugia_IMT-Atlantique-BRAIn_task1b_4 Pajusco2020 66 88.4 81.1 90.2 93.9
Feng_TJU_task1b_1 Feng2020 86 72.3 41.8 97.5 77.5
Feng_TJU_task1b_2 Feng2020 83 81.9 68.7 92.4 84.6
Feng_TJU_task1b_3 Feng2020 84 80.7 66.2 91.8 84.1
Feng_TJU_task1b_4 Feng2020 85 79.9 59.0 93.5 87.2
DCASE2020 baseline 89.5 84.5 89.1 94.9
Helin_ADSPLAB_task1b_1 Wang2020_t1 42 91.6 85.0 93.0 96.8
Helin_ADSPLAB_task1b_2 Wang2020_t1 41 91.6 84.7 93.5 96.7
Helin_ADSPLAB_task1b_3 Wang2020_t1 43 91.6 84.5 93.3 96.9
Helin_ADSPLAB_task1b_4 Wang2020_t1 44 91.3 83.1 94.3 96.6
Hu_GT_task1b_1 Hu2020 7 95.8 91.9 96.6 98.8
Hu_GT_task1b_2 Hu2020 10 95.5 92.8 96.3 97.3
Hu_GT_task1b_3 Hu2020 3 96.0 95.3 95.3 97.5
Hu_GT_task1b_4 Hu2020 5 95.8 94.8 95.1 97.5
Kalinowski_SRPOL_task1b_4 Kalinowski2020 31 93.1 88.7 94.5 96.2
Koutini_CPJKU_task1b_1 Koutini2020 16 94.7 89.1 97.0 98.0
Koutini_CPJKU_task1b_2 Koutini2020 1 96.5 92.7 97.4 99.2
Koutini_CPJKU_task1b_3 Koutini2020 8 95.7 90.1 97.8 99.2
Koutini_CPJKU_task1b_4 Koutini2020 2 96.2 92.2 97.3 99.0
Kowaleczko_SRPOL_task1b_3 Kalinowski2020 52 90.1 91.0 90.6 88.9
Kwiatkowska_SRPOL_task1b_1 Kalinowski2020 36 92.6 89.0 91.9 96.9
Kwiatkowska_SRPOL_task1b_2 Kalinowski2020 27 93.5 89.0 93.7 97.8
LamPham_Kent_task1b_1 Pham2020 59 89.4 77.2 93.2 97.9
LamPham_Kent_task1b_2 Pham2020 71 87.0 84.8 85.9 90.2
LamPham_Kent_task1b_3 Pham2020 79 84.7 67.4 94.9 91.9
Lee_CAU_task1b_1 Lee2020 47 90.7 78.1 96.9 97.0
Lee_CAU_task1b_2 Lee2020 23 93.9 86.7 97.1 97.9
Lee_CAU_task1b_3 Lee2020 46 91.1 81.1 96.3 95.9
Lee_CAU_task1b_4 Lee2020 45 91.2 80.3 96.7 96.5
Lopez-Meyer_IL_task1b_1 Lopez-Meyer2020_t1b 50 90.4 87.9 88.9 94.3
Lopez-Meyer_IL_task1b_2 Lopez-Meyer2020_t1b 53 90.1 82.5 92.6 95.2
Lopez-Meyer_IL_task1b_3 Lopez-Meyer2020_t1b 49 90.5 85.8 89.6 96.2
Lopez-Meyer_IL_task1b_4 Lopez-Meyer2020_t1b 56 89.7 84.4 88.0 96.6
McDonnell_USA_task1b_1 McDonnell2020 13 94.9 87.7 98.8 98.3
McDonnell_USA_task1b_2 McDonnell2020 9 95.5 90.4 97.5 98.7
McDonnell_USA_task1b_3 McDonnell2020 4 95.9 90.7 97.9 99.2
McDonnell_USA_task1b_4 McDonnell2020 6 95.8 90.2 98.1 99.1
Monteiro_INRS_task1b_1 Joao2020 69 87.4 82.6 85.8 93.7
Naranjo-Alcazar_Vfy_task1b_1 Naranjo-Alcazar2020_t1 24 93.6 85.0 97.0 98.7
Naranjo-Alcazar_Vfy_task1b_2 Naranjo-Alcazar2020_t1 25 93.6 86.0 96.7 98.0
NguyenHongDuc_SU_task1b_1 Nguyen_Hong_Duc2020 32 93.1 88.8 93.8 96.6
NguyenHongDuc_SU_task1b_2 Nguyen_Hong_Duc2020 37 92.3 85.8 94.4 96.5
Ooi_NTU_task1b_1 Ooi2020 67 87.8 82.6 87.2 93.6
Ooi_NTU_task1b_2 Ooi2020 70 87.3 86.5 86.8 88.6
Ooi_NTU_task1b_3 Ooi2020 55 89.8 86.1 89.1 94.1
Ooi_NTU_task1b_4 Ooi2020 54 89.8 85.6 89.3 94.4
Paniagua_UPM_task1b_1 Paniagua2020 60 89.4 82.5 91.7 94.0
Patki_SELF_task1b_1 Patki2020 76 86.0 75.9 90.7 91.6
Patki_SELF_task1b_2 Patki2020 61 89.4 84.2 92.4 91.4
Patki_SELF_task1b_3 Patki2020 82 83.7 82.1 72.3 96.9
Phan_UIUC_task1b_1 Phan2020_t1 65 88.5 82.8 88.6 94.1
Phan_UIUC_task1b_2 Phan2020_t1 62 89.2 84.0 89.9 93.9
Phan_UIUC_task1b_3 Phan2020_t1 63 89.0 78.8 92.5 95.6
Phan_UIUC_task1b_4 Phan2020_t1 58 89.5 82.4 90.2 95.8
Sampathkumar_TUC_task1b_1 Sampathkumar2020 68 87.5 76.5 90.2 95.7
Singh_IITMandi_task1b_1 Singh2020 81 84.5 77.0 81.8 94.7
Singh_IITMandi_task1b_2 Singh2020 80 84.7 79.9 80.3 93.7
Singh_IITMandi_task1b_3 Singh2020 78 85.2 75.4 86.7 93.6
Singh_IITMandi_task1b_4 Singh2020 75 86.4 85.0 79.9 94.3
Suh_ETRI_task1b_1 Suh2020 29 93.3 83.5 97.2 99.2
Suh_ETRI_task1b_2 Suh2020 18 94.6 87.0 97.6 99.2
Suh_ETRI_task1b_3 Suh2020 11 95.1 88.3 97.9 99.1
Suh_ETRI_task1b_4 Suh2020 17 94.6 87.0 97.7 99.2
Vilouras_AUTh_task1b_1 Vilouras2020 40 91.8 87.2 91.1 97.2
Waldekar_IITKGP_task1b_1 Waldekar2020 64 88.6 82.9 90.8 92.2
Wu_CUHK_task1b_1 Wu2020_t1b 22 94.2 86.1 97.9 98.5
Wu_CUHK_task1b_2 Wu2020_t1b 21 94.2 86.1 97.8 98.6
Wu_CUHK_task1b_3 Wu2020_t1b 20 94.3 85.9 98.5 98.5
Wu_CUHK_task1b_4 Wu2020_t1b 14 94.9 88.8 97.3 98.6
Yang_UESTC_task1b_1 Haocong2020 38 92.1 86.6 94.1 95.5
Yang_UESTC_task1b_2 Haocong2020 28 93.5 89.4 96.3 94.8
Yang_UESTC_task1b_3 Haocong2020 26 93.5 89.9 96.0 94.7
Yang_UESTC_task1b_4 Haocong2020 51 90.4 95.0 80.8 95.3
Zhang_BUPT_task1b_1 Zhang2020 39 92.0 84.3 93.5 98.2
Zhang_BUPT_task1b_2 Zhang2020 35 92.7 88.0 92.6 97.4
Zhang_BUPT_task1b_3 Zhang2020 34 92.9 88.3 92.6 97.8
Zhang_BUPT_task1b_4 Zhang2020 33 93.0 88.9 92.5 97.7
Zhao_JNU_task1b_1 Zhao2020 74 86.6 70.4 92.6 96.7
Zhao_JNU_task1b_2 Zhao2020 72 86.9 71.0 92.9 96.9

System characteristics

General characteristics

Rank Submission label Technical
Report
Official
system
rank
Accuracy
(Eval)
Input Sampling
rate
Data
augmentation
Features
Chang_QTI_task1b_1 Chang2020 12 95.0 binaural 22.05kHz mixup+FreqMix perceptual weighted power spectrogram
Chang_QTI_task1b_2 Chang2020 30 93.2 binaural 22.05kHz mixup+FreqMix perceptual weighted power spectrogram
Chang_QTI_task1b_3 Chang2020 15 94.8 binaural 22.05kHz mixup+FreqMix perceptual weighted power spectrogram
Chang_QTI_task1b_4 Chang2020 19 94.4 binaural 22.05kHz mixup+FreqMix perceptual weighted power spectrogram
Dat_HCMUni_task1b_1 Dat2020 57 89.5 left, right, average of left+right 48kHz Random oversample & mixup Gammatone energy
Farrugia_IMT-Atlantique-BRAIn_task1b_1 Pajusco2020 77 85.4 binaural 18kHz temporal masking, filtering, additive noise raw waveform
Farrugia_IMT-Atlantique-BRAIn_task1b_2 Pajusco2020 48 90.6 binaural 18kHz temporal masking, filtering, additive noise raw waveform
Farrugia_IMT-Atlantique-BRAIn_task1b_3 Pajusco2020 73 86.6 binaural 18kHz cutmix raw waveform
Farrugia_IMT-Atlantique-BRAIn_task1b_4 Pajusco2020 66 88.4 binaural 18kHz cutmix raw waveform
Feng_TJU_task1b_1 Feng2020 86 72.3 mono 48kHz same class mix mel spectrogram
Feng_TJU_task1b_2 Feng2020 83 81.9 mono 48kHz same class mix mel spectrogram
Feng_TJU_task1b_3 Feng2020 84 80.7 mono 48kHz same class mix mel spectrogram
Feng_TJU_task1b_4 Feng2020 85 79.9 mono 48kHz same class mix mel spectrogram
DCASE2020 baseline 89.5 mono 48kHz log-mel energies
Helin_ADSPLAB_task1b_1 Wang2020_t1 42 91.6 mono 44.1kHz mixup log-mel energies, CQT, Gammatone
Helin_ADSPLAB_task1b_2 Wang2020_t1 41 91.6 mono 44.1kHz mixup log-mel energies, CQT, Gammatone
Helin_ADSPLAB_task1b_3 Wang2020_t1 43 91.6 mono 44.1kHz mixup log-mel energies, CQT, Gammatone
Helin_ADSPLAB_task1b_4 Wang2020_t1 44 91.3 mono 44.1kHz mixup log-mel energies, CQT, Gammatone
Hu_GT_task1b_1 Hu2020 7 95.8 binaural 48kHz mixup, channel confusion, SpecAugment log-mel energies
Hu_GT_task1b_2 Hu2020 10 95.5 binaural 48kHz mixup, channel confusion, SpecAugment log-mel energies
Hu_GT_task1b_3 Hu2020 3 96.0 binaural 48kHz mixup, channel confusion, SpecAugment log-mel energies
Hu_GT_task1b_4 Hu2020 5 95.8 binaural 48kHz mixup, channel confusion, SpecAugment log-mel energies
Kalinowski_SRPOL_task1b_4 Kalinowski2020 31 93.1 mono 48kHz time warping, frequency warping, loudness control, time length control, time masking, frequency masking log-mel spectrogram
Koutini_CPJKU_task1b_1 Koutini2020 16 94.7 stereo 22.05kHz mixup Perceptually-weighted log-mel energies
Koutini_CPJKU_task1b_2 Koutini2020 1 96.5 stereo 22.05kHz mixup Perceptually-weighted log-mel energies
Koutini_CPJKU_task1b_3 Koutini2020 8 95.7 stereo 22.05kHz mixup Perceptually-weighted log-mel energies
Koutini_CPJKU_task1b_4 Koutini2020 2 96.2 stereo 22.05kHz mixup Perceptually-weighted log-mel energies
Kowaleczko_SRPOL_task1b_3 Kalinowski2020 52 90.1 mono 48kHz log-mel spectrogram
Kwiatkowska_SRPOL_task1b_1 Kalinowski2020 36 92.6 mono 48kHz mixup log-mel energies
Kwiatkowska_SRPOL_task1b_2 Kalinowski2020 27 93.5 mono 48kHz log-mel energies
LamPham_Kent_task1b_1 Pham2020 59 89.4 left 48kHz mixup Gammatone energy
LamPham_Kent_task1b_2 Pham2020 71 87.0 left 48kHz mixup Gammatone energy
LamPham_Kent_task1b_3 Pham2020 79 84.7 left 48kHz mixup Gammatone energy
Lee_CAU_task1b_1 Lee2020 47 90.7 binaural 48kHz mixup log-mel energies, deltas, delta-deltas
Lee_CAU_task1b_2 Lee2020 23 93.9 binaural 48kHz mixup log-mel energies, deltas, delta-deltas
Lee_CAU_task1b_3 Lee2020 46 91.1 binaural 48kHz mixup HPSS
Lee_CAU_task1b_4 Lee2020 45 91.2 binaural 48kHz HPSS, log-mel energies, deltas, delta-deltas
Lopez-Meyer_IL_task1b_1 Lopez-Meyer2020_t1b 50 90.4 mono 16kHz random noise, random gain, random cropping, mixup raw waveform
Lopez-Meyer_IL_task1b_2 Lopez-Meyer2020_t1b 53 90.1 mono 16kHz random noise, random gain, random cropping, mixup raw waveform
Lopez-Meyer_IL_task1b_3 Lopez-Meyer2020_t1b 49 90.5 mono 48kHz SpecAugment mel filterbank
Lopez-Meyer_IL_task1b_4 Lopez-Meyer2020_t1b 56 89.7 binaural 16kHz SpecAugment log-mel filterbanks, GCC-grams
McDonnell_USA_task1b_1 McDonnell2020 13 94.9 left, right 48kHz mixup, temporal cropping, channel swapping log-mel energies
McDonnell_USA_task1b_2 McDonnell2020 9 95.5 left, right 48kHz mixup, temporal cropping, channel swapping log-mel energies
McDonnell_USA_task1b_3 McDonnell2020 4 95.9 left, right 48kHz mixup, temporal cropping, channel swapping log-mel energies
McDonnell_USA_task1b_4 McDonnell2020 6 95.8 left, right 48kHz mixup, temporal cropping, channel swapping log-mel energies
Monteiro_INRS_task1b_1 Joao2020 69 87.4 mono 44.1kHz Sox distortions, SpecAugment log-mel energies
Naranjo-Alcazar_Vfy_task1b_1 Naranjo-Alcazar2020_t1 24 93.6 left, right, difference 48kHz gammatone
Naranjo-Alcazar_Vfy_task1b_2 Naranjo-Alcazar2020_t1 25 93.6 left, right, difference, mono 48kHz gammatone, HPSS, log-mel energies
NguyenHongDuc_SU_task1b_1 Nguyen_Hong_Duc2020 32 93.1 mono, binaural 48kHz mixup RMS level, third-octave levels, Leq, interaural cross correlation coefficient, hardness, depth, brightness, roughness, warmth, sharpness, boominess, reverb, log-mel spectrogram
NguyenHongDuc_SU_task1b_2 Nguyen_Hong_Duc2020 37 92.3 mono, binaural 48kHz mixup RMS level, third-octave levels, Leq, interaural cross correlation coefficient, hardness, depth, brightness, roughness, warmth, sharpness, boominess, reverb, log-mel spectrogram
Ooi_NTU_task1b_1 Ooi2020 67 87.8 mono 48kHz log-mel energies
Ooi_NTU_task1b_2 Ooi2020 70 87.3 mono 48kHz log-mel energies
Ooi_NTU_task1b_3 Ooi2020 55 89.8 mono 48kHz log-mel energies
Ooi_NTU_task1b_4 Ooi2020 54 89.8 mono 48kHz block mixing log-mel energies
Paniagua_UPM_task1b_1 Paniagua2020 60 89.4 48kHz LTAS, envelope modulation spectrum, cepstrum of cross-correlation
Patki_SELF_task1b_1 Patki2020 76 86.0 left+right, left-right 48kHz log-mel spectrogram
Patki_SELF_task1b_2 Patki2020 61 89.4 left+right, left-right 48kHz log-mel spectrogram
Patki_SELF_task1b_3 Patki2020 82 83.7 mono 48kHz log-mel spectrogram
Phan_UIUC_task1b_1 Phan2020_t1 65 88.5 mono 48kHz log-mel energies
Phan_UIUC_task1b_2 Phan2020_t1 62 89.2 mono 48kHz log-mel energies
Phan_UIUC_task1b_3 Phan2020_t1 63 89.0 mono 48kHz log-mel energies
Phan_UIUC_task1b_4 Phan2020_t1 58 89.5 mono 48kHz log-mel energies
Sampathkumar_TUC_task1b_1 Sampathkumar2020 68 87.5 mono 48kHz log-mel energies
Singh_IITMandi_task1b_1 Singh2020 81 84.5 mono 16kHz raw waveform segment
Singh_IITMandi_task1b_2 Singh2020 80 84.7 mono 16kHz raw waveform segment
Singh_IITMandi_task1b_3 Singh2020 78 85.2 mono 16kHz raw waveform segment
Singh_IITMandi_task1b_4 Singh2020 75 86.4 mono 16kHz raw waveform segment
Suh_ETRI_task1b_1 Suh2020 29 93.3 stereo 48kHz temporal cropping, mixup log-mel energies
Suh_ETRI_task1b_2 Suh2020 18 94.6 stereo 48kHz temporal cropping, mixup log-mel energies
Suh_ETRI_task1b_3 Suh2020 11 95.1 stereo 48kHz temporal cropping, mixup log-mel energies
Suh_ETRI_task1b_4 Suh2020 17 94.6 stereo 48kHz temporal cropping, mixup log-mel energies
Vilouras_AUTh_task1b_1 Vilouras2020 40 91.8 mono 48kHz log-mel energies
Waldekar_IITKGP_task1b_1 Waldekar2020 64 88.6 mono 48kHz histogram of uniform LBP of log-mel energies
Wu_CUHK_task1b_1 Wu2020_t1b 22 94.2 binaural 48kHz mixup wavelet filter-bank features
Wu_CUHK_task1b_2 Wu2020_t1b 21 94.2 binaural 48kHz mixup wavelet filter-bank features
Wu_CUHK_task1b_3 Wu2020_t1b 20 94.3 binaural 48kHz mixup wavelet filter-bank features
Wu_CUHK_task1b_4 Wu2020_t1b 14 94.9 binaural 48kHz mixup wavelet filter-bank features
Yang_UESTC_task1b_1 Haocong2020 38 92.1 binaural 22.05kHz CQT
Yang_UESTC_task1b_2 Haocong2020 28 93.5 mixed 22.05kHz CQT
Yang_UESTC_task1b_3 Haocong2020 26 93.5 binaural 22.05kHz CQT
Yang_UESTC_task1b_4 Haocong2020 51 90.4 binaural 22.05kHz CQT
Zhang_BUPT_task1b_1 Zhang2020 39 92.0 mixed 44.1kHz mixup log-mel energies
Zhang_BUPT_task1b_2 Zhang2020 35 92.7 mixed 44.1kHz mixup log-mel energies
Zhang_BUPT_task1b_3 Zhang2020 34 92.9 mixed 44.1kHz mixup log-mel energies
Zhang_BUPT_task1b_4 Zhang2020 33 93.0 mixed 44.1kHz mixup log-mel energies
Zhao_JNU_task1b_1 Zhao2020 74 86.6 mono 48kHz SpecAugment log-mel energies
Zhao_JNU_task1b_2 Zhao2020 72 86.9 mono 48kHz SpecAugment log-mel energies



Machine learning characteristics

Rank Code Technical
Report
Official
system
rank
Accuracy
(Eval)
External
data usage
External
data sources
Classifier Ensemble
subsystems
Decision
making
Chang_QTI_task1b_1 Chang2020 12 95.0 CNN
Chang_QTI_task1b_2 Chang2020 30 93.2 CNN
Chang_QTI_task1b_3 Chang2020 15 94.8 CNN
Chang_QTI_task1b_4 Chang2020 19 94.4 CNN
Dat_HCMUni_task1b_1 Dat2020 57 89.5 CNN
Farrugia_IMT-Atlantique-BRAIn_task1b_1 Pajusco2020 77 85.4 CNN
Farrugia_IMT-Atlantique-BRAIn_task1b_2 Pajusco2020 48 90.6 CNN
Farrugia_IMT-Atlantique-BRAIn_task1b_3 Pajusco2020 73 86.6 ResNet
Farrugia_IMT-Atlantique-BRAIn_task1b_4 Pajusco2020 66 88.4 ResNet
Feng_TJU_task1b_1 Feng2020 86 72.3 CNN
Feng_TJU_task1b_2 Feng2020 83 81.9 CNN
Feng_TJU_task1b_3 Feng2020 84 80.7 CNN
Feng_TJU_task1b_4 Feng2020 85 79.9 CNN
DCASE2020 baseline 89.5 embeddings CNN
Helin_ADSPLAB_task1b_1 Wang2020_t1 42 91.6 CNN
Helin_ADSPLAB_task1b_2 Wang2020_t1 41 91.6 CNN
Helin_ADSPLAB_task1b_3 Wang2020_t1 43 91.6 CNN
Helin_ADSPLAB_task1b_4 Wang2020_t1 44 91.3 CNN
Hu_GT_task1b_1 Hu2020 7 95.8 CNN
Hu_GT_task1b_2 Hu2020 10 95.5 CNN, MobileNet, ensemble 2 average
Hu_GT_task1b_3 Hu2020 3 96.0 CNN, MobileNet, ensemble 2 logistical regression
Hu_GT_task1b_4 Hu2020 5 95.8 CNN, ensemble 2 logistical regression
Kalinowski_SRPOL_task1b_4 Kalinowski2020 31 93.1 CNN, VGG softmax
Koutini_CPJKU_task1b_1 Koutini2020 16 94.7 RF-regularized CNNs
Koutini_CPJKU_task1b_2 Koutini2020 1 96.5 RF-regularized CNNs
Koutini_CPJKU_task1b_3 Koutini2020 8 95.7 RF-regularized CNNs
Koutini_CPJKU_task1b_4 Koutini2020 2 96.2 RF-regularized CNNs
Kowaleczko_SRPOL_task1b_3 Kalinowski2020 52 90.1 CNN softmax
Kwiatkowska_SRPOL_task1b_1 Kalinowski2020 36 92.6 CNN, ensemble 2 soft voting
Kwiatkowska_SRPOL_task1b_2 Kalinowski2020 27 93.5 CNN, ensemble 2 soft voting
LamPham_Kent_task1b_1 Pham2020 59 89.4 CNN
LamPham_Kent_task1b_2 Pham2020 71 87.0 CNN
LamPham_Kent_task1b_3 Pham2020 79 84.7 CNN
Lee_CAU_task1b_1 Lee2020 47 90.7 ResNet
Lee_CAU_task1b_2 Lee2020 23 93.9 ResNet
Lee_CAU_task1b_3 Lee2020 46 91.1 ResNet
Lee_CAU_task1b_4 Lee2020 45 91.2 Multi-input model, ResNet
Lopez-Meyer_IL_task1b_1 Lopez-Meyer2020_t1b 50 90.4 directly Audioset CNN maximum softmax
Lopez-Meyer_IL_task1b_2 Lopez-Meyer2020_t1b 53 90.1 directly Audioset CNN maximum softmax
Lopez-Meyer_IL_task1b_3 Lopez-Meyer2020_t1b 49 90.5 directly Audioset CNN maximum softmax
Lopez-Meyer_IL_task1b_4 Lopez-Meyer2020_t1b 56 89.7 CNN maximum softmax
McDonnell_USA_task1b_1 McDonnell2020 13 94.9 CNN
McDonnell_USA_task1b_2 McDonnell2020 9 95.5 CNN
McDonnell_USA_task1b_3 McDonnell2020 4 95.9 CNN
McDonnell_USA_task1b_4 McDonnell2020 6 95.8 CNN
Monteiro_INRS_task1b_1 Joao2020 69 87.4 CNN
Naranjo-Alcazar_Vfy_task1b_1 Naranjo-Alcazar2020_t1 24 93.6 CNN
Naranjo-Alcazar_Vfy_task1b_2 Naranjo-Alcazar2020_t1 25 93.6 CNN
NguyenHongDuc_SU_task1b_1 Nguyen_Hong_Duc2020 32 93.1 directly CNN, GRU, MLP 3 average
NguyenHongDuc_SU_task1b_2 Nguyen_Hong_Duc2020 37 92.3 directly CNN, GRU, MLP 2 average
Ooi_NTU_task1b_1 Ooi2020 67 87.8 VGGNet
Ooi_NTU_task1b_2 Ooi2020 70 87.3 InceptionNet
Ooi_NTU_task1b_3 Ooi2020 55 89.8 VGGNet, InceptionNet, ensemble 6 average
Ooi_NTU_task1b_4 Ooi2020 54 89.8 VGGNet, InceptionNet, ensemble 6 average
Paniagua_UPM_task1b_1 Paniagua2020 60 89.4 MLP average log-likelihood
Patki_SELF_task1b_1 Patki2020 76 86.0 embeddings SVM
Patki_SELF_task1b_2 Patki2020 61 89.4 embeddings SVM
Patki_SELF_task1b_3 Patki2020 82 83.7 embeddings SVM
Phan_UIUC_task1b_1 Phan2020_t1 65 88.5 CNN
Phan_UIUC_task1b_2 Phan2020_t1 62 89.2 CNN
Phan_UIUC_task1b_3 Phan2020_t1 63 89.0 CNN
Phan_UIUC_task1b_4 Phan2020_t1 58 89.5 CNN
Sampathkumar_TUC_task1b_1 Sampathkumar2020 68 87.5 CNN
Singh_IITMandi_task1b_1 Singh2020 81 84.5 CNN maximum likelihood
Singh_IITMandi_task1b_2 Singh2020 80 84.7 pre-trained weights of SoundNet CNN maximum likelihood
Singh_IITMandi_task1b_3 Singh2020 78 85.2 pre-trained weights of SoundNet for initilization SoundNet CNN maximum likelihood
Singh_IITMandi_task1b_4 Singh2020 75 86.4 pre-trained weights of SoundNet SoundNet CNN maximum likelihood
Suh_ETRI_task1b_1 Suh2020 29 93.3 CNN(Inception)
Suh_ETRI_task1b_2 Suh2020 18 94.6 CNN(Inception)
Suh_ETRI_task1b_3 Suh2020 11 95.1 CNN(Inception) 2 average
Suh_ETRI_task1b_4 Suh2020 17 94.6 CNN(Inception) 2 weighted score average
Vilouras_AUTh_task1b_1 Vilouras2020 40 91.8 CNN
Waldekar_IITKGP_task1b_1 Waldekar2020 64 88.6 SVM
Wu_CUHK_task1b_1 Wu2020_t1b 22 94.2 CNN
Wu_CUHK_task1b_2 Wu2020_t1b 21 94.2 CNN 2 average
Wu_CUHK_task1b_3 Wu2020_t1b 20 94.3 CNN 3 average
Wu_CUHK_task1b_4 Wu2020_t1b 14 94.9 CNN 3 average
Yang_UESTC_task1b_1 Haocong2020 38 92.1 CNN
Yang_UESTC_task1b_2 Haocong2020 28 93.5 CNN
Yang_UESTC_task1b_3 Haocong2020 26 93.5 CNN
Yang_UESTC_task1b_4 Haocong2020 51 90.4 CNN
Zhang_BUPT_task1b_1 Zhang2020 39 92.0 ResNet
Zhang_BUPT_task1b_2 Zhang2020 35 92.7 ResNet
Zhang_BUPT_task1b_3 Zhang2020 34 92.9 ResNet
Zhang_BUPT_task1b_4 Zhang2020 33 93.0 ResNet
Zhao_JNU_task1b_1 Zhao2020 74 86.6 embeddings CNN
Zhao_JNU_task1b_2 Zhao2020 72 86.9 embeddings CNN

Technical reports

QTI Submission to DCASE 2020: Model Efficient Acoustic Scene

Simyung Chang, Janghoon Cho, Hyoungwoo Park, Hyunsin Park, Sungrack Yun and Kyuwoong Hwang
Qualcomm AI Research, Qualcomm Korea YH, Seoul, South Korea

Abstract

This technical report describes the details of our submission (QAIR team’s submission) for Task1B of the DCASE 2020 challenge. In this report, we introduce three methods for the efficient acoustic scene classification with low model complexity. First, inspired by CutMix which is proposed for image recognition tasks, we consider FreqMix for the data augmentation of mixing specific frequency bands of two different samples instead of cutting and pasting box patches. Second, as a novel feature normalization, we consider SubSpectral Normalization, which can reduce the correlation between the sub-spectral groups by performing the normalization on each separated group. Last, to reduce the number of model parameters, we propose a Shared Residual architecture where the weights of all layers (except the normalization layer) are shared. All submission models were trained without any external data, and our model is not based on an ensemble of multiple models but a single model to satisfy the model complexity condition.

System characteristics
Input binaural
Sampling rate 22.05kHz
Data augmentation mixup+FreqMix
Features perceptual weighted power spectrogram
Classifier CNN
Complexity management sparsity
PDF

CNN-Based Framework for DCASE 2020 Task 1B Challenge

Ngo Dat, Pham Lam, Nguyen Anh and Hoang Hao
Electrical & Electronic Engineering, Ho Chi Minh University of Technology, Ho Chi Minh, Vietnam

Abstract

This technical report presents a low-complexity CNN-based deep learning framework for acoustic scene classification. Particularly, the proposed architecture constitute of two main steps front-end feature extraction and back-end network. Firstly, spectrogram representation is approached as front-end feature extraction in this framework. Next, the spectrograms extracted are fed into a CNN-based architecture for classification. Obtained experimental results conducted over the DCASE 2020 Task 1B dataset improve DCASE baseline by 7.2%.

System characteristics
Input left, right, average of left+right
Sampling rate 48kHz
Data augmentation Random oversample & mixup
Features Gammatone energy
Classifier CNN
PDF

Acoustic Scene Classification Based on Lightweight CNN with Efficient Convolutions

Guoqing Feng, Jinhua Liang and Biyun Ding
School of Electrical and Information Engineering, Tianjin University, Tianjin, China

Abstract

This technical report is for the Task 1B Acoustic scene classification of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). Targeting low complexity solutions for the classification problem in term of model size, a kind of lightweight Convolutional Neural Network (CNN) with efficient convolutions is designed. The network is constructed by the improved bottleneck block based on the inverted residual linear bottleneck block. In the improved bottleneck block, the operations of Detpthwise Channel Ascent (DCA) and Group Channel Descent (GCD) are used to replace pointwise convolution to realize efficient channel transformation. The designed network is denoted by CNN-BDG in this report. CNN-BDF realizes a better performance which is 4.46% higher than the baseline model in the validation set. Besides, the parameters are reduced to about 30% compared to the baseline model.

System characteristics
Input mono
Sampling rate 48kHz
Data augmentation same class mix
Features mel spectrogram
Classifier CNN
Complexity management optimize the convolution operation and the network structure
PDF

Low-Complexity Acoustic Scene Classification Using Primary Ambient Extraction and Cyclegan

Yang Haocong, Shi Chuang and Li Huiyong
Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, China

Abstract

This report describes our submissions for DCASE2020 Challenge task 1b (Low-Complexity Acoustic Scene Classification). In each submission, constant-Q transform is used as acoustic feature, and the corresponding classifier is a full convolution neural network based on residual blocks. The classifier parameters use half-precision (16 bit) float-point number to limit the model size and accelerate training. We use primary ambient extraction in the audio front-end processing, and generate virtual samples according to the phase information of binaural audio. These virtual samples will be used for one of the submissions. We also used the virtual samples generated by CycleGAN for another submission. Finally, we give a 4-fold cross validation submission that meets the complexity limit. The highest macro recognition accuracy of the above methods in the development dataset is 96.05%, and the log loss is 0.120.

System characteristics
Input binaural; mixed
Sampling rate 22.05kHz
Features CQT
Classifier CNN
Complexity management float16
PDF

Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation

Hu Hu1, Chao-Han Huck Yang1, Xianjun Xia2, Xue Bai3, Xin Tang3, Yajian Wang3, Shutong Niu3, Li Chai3, Juanjuan Li2, Hongning Zhu2, Feng Bao4, Yuanjun Zhao2, Sabato Marco Siniscalchi5, Yannan Wang2, Jun Du3 and Chin-Hui Lee1
1School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, USA, 2Tencent Media Lab, Shenzhen, China, 3University of Science and Technology of China, HeFei, China, 4Tencent Media Lab, Beijing, China, 5Computer Engineering School, University of Enna Kore, Italy

Abstract

In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1 comprises two different sub-tasks: (i) Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes, and (ii) Task 1b concerns with classification of data into three higher-level classes using lowcomplexity solutions. For Task 1a, we propose a novel two-stage ASC system leveraging upon ad-hoc score combination of two convolutional neural networks (CNNs), classifying the acoustic input according to three classes, and then ten classes, respectively. Four different CNN-based architectures are explored to implement the two-stage classifiers, and several data augmentation techniques are also investigated. For Task 1b, we leverage upon a quantization method to reduce the complexity of two of our top-accuracy three-classes CNN-based architectures. On Task 1a development data set, an ASC accuracy of 76.9% is attained using our best single classifier and data augmentation. An accuracy of 81.9% is then attained by a final model fusion of our two-stage ASC classifiers. On Task 1b development data set, we achieve an accuracy of 96.7% with a model size smaller than 500KB

System characteristics
Input binaural
Sampling rate 48kHz
Data augmentation mixup, channel confusion, SpecAugment
Features log-mel energies
Classifier CNN; CNN, MobileNet, ensemble; CNN, ensemble
Decision making average; logistical regression
Complexity management int8, quantization
PDF

Development of the INRS-EMT Scene Classification Systems for the 2020 Edition of the DCASE Challenge

Monteiro Joao, Shruti Kshirsagar, Anderson Avila, Amr Aaballah, Parth Tiwari and Tiago Falk
EMT, Institut National de la Recherche Scientifique, Montreal, Canada

Abstract

In this report we provide a brief overview of a set of submissions for the scene classification sub-tasks of the 2020 edition of the DCASE challenge. Our submissions comprise efforts at the feature representation level, where we explored the use of modulation spectra and i-vectors (extracted from mel cepstral coefficients, as well as modulation spectra) and modeling strategies, where recent convolutional deep neural network models were used. Results on the Challenge validation set show several of the submitted methods outperforming the baseline model.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation Sox distortions, SpecAugment
Features log-mel energies
Classifier CNN
PDF

Low-Complexity Acoustic Scene Classification with Small Convolutional Neural Networks and Curriculum Learning

Beniamin Kalinowski
Audio Intelligence, Samsung R&D Poland, Warsaw, Poland

Abstract

The report presents the results of submission to Task 1B of Detection and Classification of Acoustic Scenes and Events Challenge (DCASE) 2020. Main issue in this task was size limitation of 500 KB for each of submitted models. Such limitations are important when model ought to be implemented on device with low memory size. For this task four different models based on convolutional neural networks were developed, varying from data preprocessing methods, data architectures etc. Crucial techniques used in complexity management were curriculum learning and the use of depth-wise and separable convolutions, along with ensembling models trained on 3 and 10 classes for performance preservation. Best models improved baseline by 10% increase in accuracy and by 60% decrease in log-loss.

System characteristics
Input mono
Sampling rate 48kHz
Data augmentation time warping, frequency warping, loudness control, time length control, time masking, frequency masking; mixup
Features log-mel spectrogram; log-mel energies
Classifier CNN, VGG; CNN; CNN, ensemble
Decision making softmax; soft voting
Complexity management using rectangular convolution kernels; constraints-aware modelling
PDF

CP-JKU Submissions to DCASE’20: Low-Complexity Cross-Device Acoustic Scene Classification with RF-Regularized CNNs

Khaled Koutini, Florian Henkel, Hamid Eghbal-zadeh and Gerhard Widmer
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria

Abstract

This technical report describes the CP-JKU team’s submission for Task 1 - Subtask A (Acoustic Scene Classification with Multiple Devices) and Subtask B (Low-Complexity Acoustic Scene Classification) of the DCASE-2020 challenge. For Subtask 1A, we provide our Receptive Field (RF) regularized CNN model as a baseline, and additionally explore the use of two different domain adaption objectives in the form of the Maximum Mean Discrepancy (MMD) and the Sliced Wasserstein Distance (SWD). For Subtask 1B, we investigate different parameter reduction methods such as Pruning and Knowledge Distillation (KD). Additionally, we incorporate a decomposed convolutional layer that reduces the number of nonezero parameters in our models while only slightly decreasing the accuracy compared to full-parameter baseline.

System characteristics
Input stereo
Sampling rate 22.05kHz
Data augmentation mixup
Features Perceptually-weighted log-mel energies
Classifier RF-regularized CNNs
Complexity management float16, conv layers decomposition; pruning, float16; float16, smaller width/depth
PDF

The CAU-ET Acoustic Scenery Classification System for DCASE 2020 Challenge

Yerin Lee1, Soyoung Lim1 and Il-Youp Kwak2
1Statistics, Chung-Ang University, Seoul, South Korea, 2Department of Applied Statistics, Chung-Ang University, Seoul, South Korea

Abstract

The acoustic scenry classification problem is an interesting topic that has been studied for a long time through the DCASE competition. This technical report presents the CAU-ET’s submitted scenery detection system to the DCASE 2020 challenge, Task 1. In our method we generate mel-spectrogram from audio. From log-mel spectrogram, we got Deltas, Delta-deltas and Harmonic-percussive source seperation(HPSS) feature as inputs of our deep neural network models. The classification result of the proposed system was 66.26% for development dataset in subtask A and 95.27% in subtask B

System characteristics
Input binaural
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies, deltas, delta-deltas; HPSS; HPSS, log-mel energies, deltas, delta-deltas
Classifier ResNet; Multi-input model, ResNet
PDF

Low-Memory Convolutional Neural Networks for Acoustic Scene Classification

Paulo Lopez-Meyer1, Juan Antonio Del Hoyo Ontiveros1, Hong Lu2, Hector Alfonso Cordourier Maruri1, Georg Stemmer3, Lama Nachman2 and Jonathan Huang4
1Intel Labs, Intel Corporation, Jalisco, Mexico, 2Intel Labs, Intel Corporation, California, USA, 3Intel Labs, Intel Corporation, Neubiberg, Germany, 4California, USA

Abstract

In this work, we describe the implementation of four different convolutional neural networks for acoustic scene classification, complying with the memory size restrictions defined in the DCASE2020 Task 1b challenge guidelines. Quantization, pruning, knowledge distillation, and GCC-grams as input features, were explored as means to achieve the highest accuracy possible while reducing the number of resources in terms of the models trainable parameters and memory. Our experimental results yield to higher than the 87.30% reported accuracy in the challenge’s baseline, where our four submissions managed to achieve > 90.00% of acoustic classification accuracy using CNN models with < 500 KB .

System characteristics
Input mono; binaural
Sampling rate 16kHz; 48kHz
Data augmentation random noise, random gain, random cropping, mixup; SpecAugment
Features raw waveform; mel filterbank; log-mel filterbanks, GCC-grams
Classifier CNN
Decision making maximum softmax
Complexity management quantization; pruning, quantization; knowledge distillation, quantization
PDF

Low-Complexity Acoustic Scene Classification Using One-Bit-Per-Weight Deep Convolutional Neural Networks

Mark McDonnell
Computational Learning Systems Laboratory, University of South Australia, Mawson Lakes, Australia

Abstract

This technical report describes a submission to Task 1b (“LowComplexity Acoustic Scene Classification”) in the DCASE2020 Acoustic Scene Challenge. Solutions for this task were required to be constrained to have parameters totalling no more than 500 KB. The strategy described in this report was to train a deep convolutional neural network applied to spectrograms formed from the acoustic scene files, such that each convolutional weight was set to one of two values following training, and hence could be stored using a single bit. This strategy allowed a single 36-layer all-convolutional deep neural network to be trained, consisting of a total of 3,987,000 binary weights, totalling 486.69KB. The model achieved a macro-average accuracy (balanced accuracy score) across the three classes of 96.6±0.5% on the 2020 DCASE Task 1b validation set.

System characteristics
Input left, right
Sampling rate 48kHz
Data augmentation mixup, temporal cropping, channel swapping
Features log-mel energies
Classifier CNN
Complexity management 1-bit quantization
PDF

Task 1 DCASE 2020: ASC with Mismatch Devices and Reduced Size Model Using Residual Squeeze-Excitation CNNs

Javier Naranjo-Alcazar1,2, Sergi Perez-Castanos3, Pedro Zuccarello3 and Maximo Cobos2
1AI department, Visualfy, Benisano, Spain, 2Computer Science Department, Universitat de Valencia, Burjassot, Spain, 3AI department, Visualfy, Benisano, Valencia

Abstract

Acoustic Scene Classification (ASC) is a problem related to the field of machine listening whose objective is to classify/tag an audio clip in a predefined label describing a scene location such as park, airport among others. Due to the emergence of more extensive audio datasets, solutions based on Deep Learning techniques have become the state-of-the-art. The most common choice are those that implement a convolutional neural network (CNN) having previously transformed the audio signal into a 2D representation. This twodimensional audio representation is currently a subject of research. In addition, there are solutions that propose several concatenated 2D representations, thus creating a representation with several input channels. This article proposes two novel stereo audio representations to maximize the accuracy of an ASC framework. These representations correspond to the 3-channel representations such as the left channel, the right channel and the difference between channels (L − R) using the Gammatone filter bank and the harmonic, percussive and difference between channels sources using the Mel filter bank. Both representations are also concatenated creating a 6-channel with different audio filter banks. Furthermore, the proposed CNN is a residual network that employs squeeze-excitation techniques in its residual blocks in a novel way to force the network to extract meaningful features from the audio representation. The proposed network is used in both subtasks with different modifications to meet the requirements of each one. However, since stereo audio is not available in Subtask A, the representations are slightly modified in that task. This technical report first presents the overlaps of the two tasks and then makes the relevant changes to each task in one section per task. The baselines are surpassed in both tasks by approximately 10 percentage points.

System characteristics
Input left, right, difference; left, right, difference, mono
Sampling rate 48kHz
Features gammatone; gammatone, HPSS, log-mel energies
Classifier CNN
PDF

Acoustic Scene Classification Using Long-Term and Fine-Scale Audio Representations

Paul Nguyen Hong Duc1, Dorian Cazau2, Olivier Adam3, Odile Gerard4 and Paul R. White5
1Institut d’Alembert, Sorbonne Universite, Paris, France, 2Lab-Sticc, ENSTA Bretagne, Brest, France, 3Institut d'Alembert, Sorbonne Université, Paris, France, 4Techniques Navales, DGA-TN, Toulon, France, 5ISVR, University of Southampton, Southampton, United Kingdom

Abstract

Audio scene classification (ASC) is an emerging filed of research in different scientific communities such as urban soundscape characterization or bioacoustics. It has gained visibility and relevance with open challenges especially with the benchmark dataset and evaluation from DCASE. This paper present our deep learning model to address the ASC task of the DCASE 2020 challenge edition. The model exploits multiple long-term and fine-scale audio representations as inputs of the neural network. Each representation is fed into a different network. The audio embedding of each branch are fused before a Multi-Layer Perceptron to predict the final classes.

System characteristics
Input mono, binaural
Sampling rate 48kHz
Data augmentation mixup
Features RMS level, third-octave levels, Leq, interaural cross correlation coefficient, hardness, depth, brightness, roughness, warmth, sharpness, boominess, reverb, log-mel spectrogram
Classifier CNN, GRU, MLP
Decision making average
PDF

Ensemble of Pruned Models for Low-Complexity Acoustic Scene Classifcation

Kenneth Ooi, Santi Peksi and Gan Woon-Seng
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore

Abstract

For the DCASE 2020 Challenge, the focus of Task 1B is to develop low-complexity models for classification of 3 different types of acoustic scenes, which have potential applications in resource-scarce edge devices deployed in a large-scale acoustic network. For this report, we present the training methodology for our submissions for the challenge, with the best-performing system consisting of an ensemble of VGGNet- and InceptionNet-based lightweight classification models. The subsystems in the ensemble classifier were trained with log-mel spectrograms of the raw audio data, and were subsequently pruned by setting low-magnitude weights periodically to zero with a polynomial decay schedule for an 80% reduction in individual subsystem size. The resultant ensemble classifier outperformed the baseline model on the validation set over 5 runs and had 119758 nonzero parameters which took up 468 KB of memory, thus showing the efficacy of the pruning technique. No external data was used, and source code for the submission can be found at https://github.com/kenowr/DCASE-2020-Task-1B.

System characteristics
Input mono
Sampling rate 48kHz
Data augmentation block mixing
Features log-mel energies
Classifier VGGNet; InceptionNet; VGGNet, InceptionNet, ensemble
Decision making average
Complexity management sparsity
PDF

Lightweight Convolutional Neural Networks on Binaural Waveforms for Low Complexity Acoustic Scene Classification

Nicolas Pajusco, Richard Huang and Nicolas Farrugia
Electronics, IMT Atlantique, Brest, France

Abstract

This report describes our submission to DCASE 2020 task 1, subtask B, which is an acoustic scene classification task with the objective of minimizing parameter count. While the vast majority of proposed approaches rely on fixed feature extraction based on time-frequency representations such as spectrograms, we propose to fully exploit the information in binaural waveforms directly. To do so, we train one dimensional Convolutional Neural Networks (1D-CNN) on raw, subsampled binaural audio waveforms, thus exploiting phase information within and across the two input channels. In addition, our approach relies heavily on data augmentation in the temporal domain. Finally, we apply iterative structured parameter pruning to remove the least important convolutional kernels, and perform weight quantization in floating point half precision. We apply this approach to train two network architectures: a 1D-CNN based on VGG-like blocks, as well as a ResNet architecture with 1D convolutions. Our results show that we can train, prune and quantify a small VGG model to make it 20 times smaller than the 500 KB limit (model A) with an accuracy at baseline level (87.6 %), as well as a larger model achieving 91 % of accuracy while being 8 times smaller than the challenge limit. ResNets could be successfully trained, pruned and quantify in order to be below the 500 KB limit, achieving up to 91.2 % accuracy

System characteristics
Input binaural
Sampling rate 18kHz
Data augmentation temporal masking, filtering, additive noise; cutmix
Features raw waveform
Classifier CNN; ResNet
Complexity management float16, quantization, pruning; float16, quantization
PDF

Classification of Acoustic Scenes Based on Modulation Spectra and the Cepstrum of the Cross Correlation Between Binarual Audio Channels

Arturo Paniagua, Rubén Fraile, Juana M. Gutiérrez-Arriola, Nicolás Sáenz-lechón and Víctor J- Osma-Ruiz
CITSEM, Universidad Politéctica de Madrid, Madrid, Spain

Abstract

A system for the automatic classification of acoustic scenes is proposed that uses one audio channel for calculating the spectral distribution of energy across auditory-relevant frequency bands, and some descriptors of the envelope modulation spectrum (EMS) obtained by means of the discrete cosine transform. When the stereophonic signal captured by a binaural microphone is available, this parameter set is augmented by including the first coefficients of the cepstrum of the cross-correlation between both audio channels. This cross-correlation contains information on the angular distribution of acoustic sources. These three types of features (energy spectrum, EMS and cepstrum of cross-correlation) are used as inputs for a multilayer perceptron with two hidden layers and a number of adjustable parameters below 15,000.

System characteristics
Sampling rate 48kHz
Features LTAS, envelope modulation spectrum, cepstrum of cross-correlation
Classifier MLP
Decision making average log-likelihood
PDF

Exploring Compact Alternatives to Deep Learning in Task 1B

Prachi Patki

Abstract

Task 1b appeared primarily geared toward finding compact deep learning models; however, our experience is that other methodologies may sometimes achieve similar accuracies with substantially smaller parameter counts. We focused on finding alternative classifier formulations that significantly reduce complexity while still achieving superior results. Our primary submission, based on a multi-channel SVM formulation, performs better than the reference design on test data, but requires only ~17.5 KB in parameter complexity.

System characteristics
Input left+right, left-right; mono
Sampling rate 48kHz
Features log-mel spectrogram
Classifier SVM
PDF

DCASE 2020 Challenge Task 1B: Low-Complexity CNN-Based Framework for Acoustic Scene Classification

Lam Pham1, Ngo Dat2, Phan Huy3 and Duong Ngoc4
1School of Computing, University of Kent, Kent, UK, 2Electrical & Electronic Engineering, Ho Chi Minh University of Technology, Ho Chi Minh, Vietnam, 3School of Electronic Engineering and Computer Science, Queen Mary University of London, London, UK, 4InterDigital R&D, InterDigital Company, Rennes, France

Abstract

This report presents a low-complexity CNN-based deep learning framework for acoustic scene classification task (ASC). In particular, the framework approaches spectrogram representation referred to as front-end feature extraction. The spectrograms extracted are fed into a CNN-based architecture for classification, referred to as the baseline. Next, quantization and pruning techniques are applied on the pre-trained baseline to fine-tune and further compress the network size, eventually achieve low-complexity models with competitive performance.

System characteristics
Input left
Sampling rate 48kHz
Data augmentation mixup
Features Gammatone energy
Classifier CNN
Complexity management quantization; pruning
PDF

DCASE 2020 Task 1 Subtask B: Low-Complexity Acoustic Scene Classification

Duc Phan and Douglas Jones
ECE, University of Illinois at Urbana Champaign, Illinois, USA

Abstract

A deep network with depth-wise separable convolutions [1] and skip connections is introduced for low complexity acoustic scenes classification. The proposed network is not only more than 15 times smaller than the baseline convolution neural network [2] but also outperforms the baseline by two percents on average.

System characteristics
Input mono
Sampling rate 48kHz
Features log-mel energies
Classifier CNN
PDF

Low Complexity Acoustic Scene Classification Using Aalnet-94

Arunodhayan Sampathkumar and Danny Kowerko
Juniorprofessur MEDIA COMPUTING, Techniche universität Chemnitz, Chemnitz, Germany

Abstract

One of the manifold application fields of Deep Neural Networks (DNN) is the classification of audio signals such as indoor, outdoor, transportation, humans and animals sounds. DCASE2020 provided a dataset consisting of 3 classes to perform classification using low complexity solutions. The dataset was trained using AALNet-94 from our previous research work that performed well in publicly available datasets such as ESC-50, Ultrasound 8K and audioset. The results obtained performed well when compared with the baseline.

System characteristics
Input mono
Sampling rate 48kHz
Features log-mel energies
Classifier CNN
PDF

End2end CNN-Based Low-Complexity Acoustic Scene Classification

Arshdeep Singh, Dhanunjaya Varma Devalraju and Padmanabhan Rajan
School of Computing and Electrical engineering, Indian institute of technology, Mandi, Mandi, India

Abstract

This technical report describes the IITMandi AudioTeam’s submission for ASC Task 1, Subtask B of DCASE2020 challenge. This report aims to design low-complexity systems for acoustic scene classification. We propose a convolution neural network based endto-end classification framework. The proposed framework learns from raw audio directly. We present performance analysis of various frameworks with model size lesser than 500KB for classification. The three acoustic scenes namely indoor, outdoor and transportation are considered. Our experimental analysis shows that the proposed end-to-end framework, where features are being learned from raw audio directly, with a model size of approx. 77KB gives similar performance on development dataset as that of baseline1 system proposed for the same task.

System characteristics
Input mono
Sampling rate 16kHz
Features raw waveform segment
Embeddings Singh_IITMandi_task1b_1, Singh_IITMandi_task1b_3
Classifier CNN
Decision making maximum likelihood
PDF

Designing Acoustic Scene Classification Models with CNN Variants

Sangwon Suh, Sooyoung Park, Youngho Jeong and Taejin Lee
Media Coding Research Section, Electronics and Telecommunications Research Institute, Daejeon, South Korea

Abstract

This technical report describes our Acoustic Scene Classification systems for DCASE2020 challenge Task1. For subtask A, we designed a single model implemented with three parallel ResNets, which is named Trident ResNet. We have confirmed that this structure is beneficial when analyzing samples collected from minority or unseen devices, and confirmed 73.7% classification accuracy for the test split. For subtask B, we used the Inception module to build a Shallow Inception model that has fewer parameters than the CNN of the DCASE baseline system. Due to the sparse structure of the Inception module, we have enhanced the accuracy of the model up to 97.6%, while reducing the number of parameters.

System characteristics
Input stereo
Sampling rate 48kHz
Data augmentation temporal cropping, mixup
Features log-mel energies
Classifier CNN(Inception)
Decision making average; weighted score average
Complexity management float16
PDF

Acoustic Scene Classification Using Fully Convolutional Neural Networks and Per-Channel Energy Normalization

Konstantinos Vilouras
Electrical and Computer Engineering, Aristotle University of Thessaloniki, Thessaloniki, Greece

Abstract

This technical report describes our approach to Task 1 ''Acoustic Scene Classification'' of the DCASE 2020 challenge. For subtask A, we introduce per-channel energy normalization (PCEN) as an additional preprocessing step along with log-Mel spectrograms. We also propose two residual network architectures utilizing “Shake-Shake” regularization and the “Squeeze-and-Excitation” block, respectively. Our best submission (ensemble of 8 classifiers) outperforms the corresponding baseline system by 16.2% in terms of macro-average accuracy. For subtask B, we mainly focus on a low complexity, fully convolutional neural network architecture, which leads to 5% relative improvement over baseline accuracy.

System characteristics
Input mono
Sampling rate 48kHz
Features log-mel energies
Classifier CNN
PDF

Mel-Scaled Wavelet-Based Features for Sub-Task A and Texture Features for Sub-Task B of DCASE 2020 Task 1

Shefali Waldekar, Kishore Kumar A and Goutam Saha
Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India

Abstract

This report describes a submission for IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 for Task 1 (acoustic scene classification (ASC)), sub-task A (ASC with Multiple Devices) and sub-task B (LowComplexity ASC). The systems exploit time-frequency representation of audio to obtain the scene labels. The system for Task1A follows a simple pattern classification framework employing wavelet transform based mel-scaled features along with support vector machine (SVM) as classifier. Texture features, namely Local Binary Pattern (LBP) extracted from log of mel-band energies is used in a similar classification framework for Task 1B. The proposed systems outperform the deep-learning based baseline system with the development dataset provided for the respective sub-tasks.

System characteristics
Input mono
Sampling rate 48kHz
Features histogram of uniform LBP of log-mel energies
Classifier SVM
PDF

Acoustic Scene Classification with Multiple Decision Schemes

Helin Wang, Dading Chong and Yuexian Zou
School of ECE, Peking University, Shenzhen, China

Abstract

This technical report describes the ADSPLAB team’s submission for Task1 of DCASE2020 challenge. Our acoustic scene classifi- cation (ASC) system is based on the convolutional neural networks (CNN). Multiple decision schemes are proposed in our system, in- cluding the decision schemes in multiple representations, multiple frequency bands, and multiple temporal frames. The final system is the fusion of models with multiple decision schemes and mod- els pre-trained on AudioSet. The experimental results show that our system could achieve the accuracy of 84.5 %(official baseline: 54.1%) and 92.1% (official baseline: 87.3%) on the officially provided fold 1 evaluation dataset of Task1A and Task1B, respectively.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies, CQT, Gammatone
Classifier CNN
PDF

Searching for Efficient Network Architectures for Acoustic Scene Classification

Yuzhong Wu and Tan Lee
Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China

Abstract

This technical report describes our submission for Task 1B of DCASE2020 challenge. The objective of task 1B is to construct an acoustic scene classification (ASC) system with low model complexity. In our ASC system, the average-difference time-frequency features are extracted from binaural audio waveforms. A random search policy is used to find the best-performing CNN architecture while satisfying the requirement of model size. The search is limited to several predefined efficient convolutional modules based on depth-wise convolution and swish activation function to constrain the size of search space. Experimental results on development dataset shows that CNN model obtained by this search strategy has higher accuracy compared to an AlexNet-like CNN benchmark.

System characteristics
Input binaural
Sampling rate 48kHz
Data augmentation mixup
Features wavelet filter-bank features
Classifier CNN
Decision making average
Complexity management float16
PDF

Bupt Submissions to DCASE 2020: Low-Complexity Acoustic Scene Classification with Post Training Static Quantization and Prune

Jiawang Zhang, Chunxia Ren and Shengchen Li
BUPT, Beijing University of Posts and Telecommunications, Beijing, China

Abstract

This report describes a method for Task 1b (Low-Complexity Acoustic Scene Classification) of the DCASE 2020 challenge, which targets low complexity solutions for the classification problem. The proposed model has five residual block with average pooling. To improve the performance of the proposed system, binaural features from the dataset are used, and with Log Mel Spectrogram, mix-up data augmentation. To reduce system complexity, the proposed method uses Post Training Static Quantization and Prune methods, Post Training Static Quantization are used to do the 8-bits quantization, this method can reduce the model size by four times. Pruning can reduce redundant weights by prune the low weights, the process allows only a small part of the original weight parameters performance close to the original network. The accuracy of the method proposed in this report on the development data set is 92.9%, which is 5.6% higher than the baseline, but 81% lower than the baseline model.

System characteristics
Input mixed
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies
Classifier ResNet
Complexity management 8-bit quantization
PDF

Dd-CNN: Depthwise Disout Convolutional Neural Network for Low-Complexity Acoustic Scene Classification

Jingqiao Zhao1, Xiao-Jun Wu1, Xiaoning Song1, Zhen-Hua Feng2 and Qiuqiang Kong3
1Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China, 2Centre for Vision, Speech and Signal Processin, University of Surrey, Guildford, UK, 3AI Lab,, ByteDance, Shanghai, China

Abstract

This report presents our Depthwise Disout Convolutional Neural Network (DD-CNN) used for the detection and classification of urban acoustic scenes in the DCASE2020 Challenge (Task 1 - Subtask B). Specifically, we use log-mel as feature representations of acoustic signals for the inputs of our network. In DD-CNN, depthwise separable convolution is used to reduce the network complexity. Besides, SpecAugment and Disout are used for further performance boosting. Experimental results demonstrate that our DD-CNN can learn discriminative acoustic characteristics from audio fragments and effectively reduce the network complexity. Our method achieves 92.04% accuracy on the validation set.

System characteristics
Input mono
Sampling rate 48kHz
Data augmentation SpecAugment
Features log-mel energies
Classifier CNN
Complexity management disout
PDF