Acoustic
scene classification


Challenge results

Task description

This subtask is concerned with the basic problem of acoustic scene classification, in which all available data (development and evaluation) are recorded with the same device, in this case device A.

The development dataset consists of recordings from ten cities; the training subset contains recordings from only 9 of the cities, to test the generalization properties of the systems. The training/test subsets are created based on the recording location such that the training subset contains approximately 70% of recording locations from each city. The test subset contains recordings from the rest of the locations, and few locations from the tenth city. Full data from the tenth city is provided, but partly unused in this setup, to reflect the final evaluation setup. The development set contains 40 hours of data, with 14400 segments (144 per city per acoustic scene class). The training/test setup includes segments from Milan only to the test subset. There are 9185 segments in the training set, 4185 in the test set, and additional 1030 segments from Milan. For complete details on the dataset, check the readme file provided with the data.

More detailed task description can be found in the task description page

Systems ranking

Rank Submission
code
Submission
name
Technical
Report
Accuracy
with 95% confidence interval
(Evaluation dataset)
Accuracy
(Development dataset)
Accuracy /
seen cities
(Evaluation dataset)
Accuracy /
unseen cities
(Evaluation dataset)
Bilot_IDG_task1a_1 Base+Att Bilot2019 66.1 (65.0 - 67.2) 67.3 67.8 57.7
Bilot_IDG_task1a_2 Gates+Att Bilot2019 67.3 (66.3 - 68.4) 70.6 70.1 53.7
Bilot_IDG_task1a_3 Dilated Bilot2019 64.5 (63.4 - 65.6) 64.1 66.8 52.9
Bilot_IDG_task1a_4 Fusion Bilot2019 68.3 (67.3 - 69.4) 72.3 70.5 57.8
Chandrasekhar_IIITH_task1a_1 ASC by SFCC and DNN Paseddula2019 52.6 (51.4 - 53.7) 70.4 54.8 41.3
DSPLAB_TJU_task1a_1 DSPLAB_TJU_task1a_1 Ding2019 66.5 (65.4 - 67.6) 63.2 68.3 57.9
DSPLAB_TJU_task1a_2 DSPLAB_TJU_task1a_1 Ding2019 69.6 (68.5 - 70.6) 72.1 56.8
DSPLAB_TJU_task1a_3 DSPLAB_TJU_task1a_1 Ding2019 65.0 (63.9 - 66.1) 64.3 66.5 57.6
DSPLAB_TJU_task1a_4 DSPLAB_TJU_task1a_1 Ding2019 69.5 (68.4 - 70.5) 72.0 56.7
Fmta91_KNToosi_task1a_1 BWS_RSD Arabnezhad2019 76.2 (75.2 - 77.2) 88.0 76.6 74.3
Fraile_UPM_task1a_1 UPM_EMS Fraile2019 58.7 (57.6 - 59.9) 53.9 60.1 51.6
DCASE2019 baseline Baseline 63.3 (62.2 - 64.5) 62.5 66.7 46.1
Huang_IL_task1a_1 DLensemble Huang2019 80.5 (79.6 - 81.4) 81.6 75.3
Huang_IL_task1a_2 DLensemble Huang2019 81.1 (80.2 - 82.0) 82.1 76.1
Huang_IL_task1a_3 DLensemble Huang2019 81.3 (80.4 - 82.2) 82.3 76.6
Huang_IL_task1a_4 DLensemble Huang2019 79.5 (78.6 - 80.5) 81.1 71.5
Huang_SCNU_task1a_1 Huang_SCNU Huang2019a 79.2 (78.3 - 80.1) 73.9 80.7 71.6
JSNU_WDXY_task1a_1 task1a Ma2019 72.2 (71.1 - 73.2) 71.1 73.8 63.9
Jung_UOS_task1a_1 S_KD_1 Jung2019 81.1 (80.2 - 82.0) 76.1 82.5 74.2
Jung_UOS_task1a_2 S_KD_2 Jung2019 81.2 (80.3 - 82.1) 76.1 82.5 74.5
Jung_UOS_task1a_3 S_KD_3 Jung2019 81.0 (80.1 - 81.9) 76.2 82.5 73.6
Jung_UOS_task1a_4 S_KD_4 Jung2019 81.2 (80.3 - 82.1) 76.2 82.6 74.4
KK_I2R_task1a_1 system1 KK2019 76.6 (75.6 - 77.6) 78.3 68.3
KK_I2R_task1a_2 system2 KK2019 77.7 (76.7 - 78.6) 79.5 68.6
KK_I2R_task1a_3 system3 KK2019 77.2 (76.2 - 78.2) 78.7 69.8
Kong_SURREY_task1a_1 cvssp_cnn9 Kong2019 70.5 (69.5 - 71.6) 69.2 72.8 59.0
Koutini_CPJKU_task1a_1 cp_resnet Koutini2019 82.8 (82.0 - 83.7) 83.7 84.2 75.8
Koutini_CPJKU_task1a_2 cv_avg Koutini2019 83.7 (82.9 - 84.6) 83.7 84.8 78.5
Koutini_CPJKU_task1a_3 variants Koutini2019 83.5 (82.6 - 84.4) 83.7 84.7 77.5
Koutini_CPJKU_task1a_4 variants2 Koutini2019 83.8 (82.9 - 84.6) 83.7 84.9 78.1
LamPham_HCMGroup_task1a_1 HCM Group Pham2019 73.9 (72.9 - 74.9) 73.7 76.5 60.9
LamPham_KentGroup_task1a_1 Kent Group Pham2019a 76.8 (75.8 - 77.7) 76.2 78.9 66.2
Lei_CQU_task1a_1 model1 Lei2019 75.5 (74.5 - 76.5) 79.6 76.7 69.5
Li_NPU_task1a_1 NDHWD_1 FangLi2019 59.9 (58.8 - 61.0) 88.0 61.3 52.6
Li_NPU_task1a_2 NDHWD_2 FangLi2019 61.8 (60.7 - 62.9) 88.0 63.7 51.9
Liang_HUST_task1a_1 Liang_1 Liang2019 68.2 (67.1 - 69.2) 70.7 69.2 62.9
Liang_HUST_task1a_2 Liang_2 Liang2019 66.4 (65.3 - 67.5) 70.2 68.3 56.7
Liu_SCUT_task1a_1 HS Mingle2019 78.3 (77.4 - 79.3) 87.8 79.9 70.5
Liu_SCUT_task1a_2 HS Mingle2019 79.9 (79.0 - 80.8) 88.4 81.6 71.4
Liu_SCUT_task1a_3 HS Mingle2019 78.3 (77.3 - 79.2) 86.8 79.9 69.9
Liu_SCUT_task1a_4 HS Mingle2019 78.4 (77.4 - 79.3) 87.8 80.1 69.8
MaLiu_BIT_task1a_1 BIT_task1a_1 Ma2019a 72.8 (71.8 - 73.8) 84.7 73.8 67.8
MaLiu_BIT_task1a_2 BIT_task1a_2 Liu2019 76.0 (75.1 - 77.0) 88.1 77.0 71.4
MaLiu_BIT_task1a_3 BIT_task1a_3 Ma2019a 73.3 (72.3 - 74.3) 85.4 74.3 68.5
Mars_PRDCSG_task1a_1 PRDCSG Mars2019 79.3 (78.3 - 80.2) 79.4 80.2 74.6
McDonnell_USA_task1a_1 UniSA_1a1 Gao2019 80.0 (79.0 - 80.9) 82.3 82.4 67.8
McDonnell_USA_task1a_2 UniSA_1a2 Gao2019 80.5 (79.6 - 81.4) 81.4 82.8 69.4
McDonnell_USA_task1a_3 UniSA_1a3 Gao2019 80.4 (79.5 - 81.3) 82.8 68.5
McDonnell_USA_task1a_4 UniSA_1a4 Gao2019 80.3 (79.4 - 81.2) 82.8 67.8
Naranjo-Alcazar_VfyAI_task1a_1 Visualfy Naranjo-Alcazar2019 74.1 (73.1 - 75.2) 76.8 75.8 65.7
Naranjo-Alcazar_VfyAI_task1a_2 Visualfy Naranjo-Alcazar2019 74.2 (73.2 - 75.2) 77.1 75.9 65.8
Naranjo-Alcazar_VfyAI_task1a_3 Visualfy Naranjo-Alcazar2019 74.0 (73.0 - 75.0) 76.8 75.7 65.6
Naranjo-Alcazar_VfyAI_task1a_4 Visualfy Naranjo-Alcazar2019 74.1 (73.1 - 75.1) 76.9 75.8 65.7
Plata_SRPOL_task1a_1 CLUSTERS Plata2019 78.8 (77.9 - 79.8) 80.8 80.2 72.2
Plata_SRPOL_task1a_2 CLUSTERS Plata2019 79.2 (78.3 - 80.1) 81.6 80.6 72.3
Plata_SRPOL_task1a_3 CLUSTERS Plata2019 77.2 (76.3 - 78.2) 80.8 78.8 69.7
Plata_SRPOL_task1a_4 CLUSTERS Plata2019 77.9 (77.0 - 78.9) 81.6 79.2 71.4
SSW_ETRI_task1a_1 ETRI_base Sangwon2019 66.7 (65.6 - 67.8) 74.4 68.0 60.0
SSW_ETRI_task1a_2 ETRI_E8 Sangwon2019 67.0 (65.9 - 68.1) 75.5 68.9 57.5
SSW_ETRI_task1a_3 ETRI_E17 Sangwon2019 67.6 (66.5 - 68.7) 76.1 69.3 59.1
SSW_ETRI_task1a_4 ETRI_E25 Sangwon2019 67.6 (66.5 - 68.7) 76.1 69.3 58.8
Salvati_DMIF_task1a_1 RW-CNN Salvati2019 68.5 (67.5 - 69.6) 69.7 69.8 62.0
Seo_LGE_task1a_1 LGE_ROBOTICS Hyeji2019 81.6 (80.7 - 82.5) 80.5 82.8 75.3
Seo_LGE_task1a_2 LGE_ROBOTICS Hyeji2019 82.5 (81.6 - 83.4) 80.8 83.8 76.1
Seo_LGE_task1a_3 LGE_ROBOTICS Hyeji2019 81.1 (80.2 - 82.0) 79.5 82.4 74.6
Seo_LGE_task1a_4 LGE_ROBOTICS Hyeji2019 82.5 (81.7 - 83.4) 81.5 83.8 76.5
Waldekar_IITKGP_task1a_1 IITKGP_MFDWC19 Waldekar2019 65.9 (64.8 - 67.0) 67.4 67.8 56.0
Wang_BTBU_task1a_1 Wang_BTBU Wang2019 32.2 (31.1 - 33.3) 73.5 33.7 24.4
Wang_NWPU_task1a_1 Mou_task1A1 Wang2019a_t1 80.6 (79.7 - 81.5) 78.8 82.1 73.2
Wang_NWPU_task1a_2 Mou_task1A1 Wang2019a 80.1 (79.1 - 81.0) 78.3 81.7 71.9
Wang_NWPU_task1a_3 Mou_task1A1 Wang2019a 76.6 (75.6 - 77.6) 78.3 77.9 70.6
Wang_NWPU_task1a_4 Mou_task1A1 Wang2019a 76.8 (75.8 - 77.8) 78.8 78.1 70.6
Wang_SCUT_task1a_1 5c Wang2019b 76.4 (75.4 - 77.4) 86.9 79.1 62.4
Wang_SCUT_task1a_2 5c Wang2019b 76.6 (75.6 - 77.5) 86.9 79.2 63.2
Wang_SCUT_task1a_3 5c Wang2019b 75.9 (74.9 - 76.9) 86.9 78.6 62.2
Wang_SCUT_task1a_4 5c Wang2019b 76.5 (75.5 - 77.5) 86.9 79.2 62.9
Wilkinghoff_FKIE_task1a_1 FKIEsingle Wilkinghoff2019 74.6 (73.6 - 75.6) 76.0 67.8
Wilkinghoff_FKIE_task1a_2 FKIEensemb Wilkinghoff2019 76.2 (75.2 - 77.2) 77.5 69.6
Wu_CUHK_task1a_1 Wu_CUHK_1 Wu2019 80.1 (79.1 - 81.0) 76.6 82.0 70.4
Yang_UESTC_task1a_1 6models-4folds_avg Haocong2019 79.9 (78.9 - 80.8) 82.2 68.1
Yang_UESTC_task1a_2 6models-4folds_rf Haocong2019 81.6 (80.7 - 82.5) 83.8 70.7
Yang_UESTC_task1a_3 4models-4folds_rf Haocong2019 81.2 (80.3 - 82.1) 83.4 69.8
Zeinali_BUT_task1a_1 ScoreFuse Zeinali2019 78.9 (78.0 - 79.9) 77.0 80.5 71.2
Zeinali_BUT_task1a_2 ScoreFuse Zeinali2019 78.9 (77.9 - 79.8) 77.0 80.2 72.2
Zeinali_BUT_task1a_3 ScoreFuse Zeinali2019 79.1 (78.1 - 80.0) 77.0 80.6 71.3
Zhang_IOA_task1a_1 ZhangIOA1 Chen2019 84.9 (84.1 - 85.7) 85.1 86.4 77.3
Zhang_IOA_task1a_2 ZhangIOA2 Chen2019 84.9 (84.1 - 85.8) 85.1 86.5 77.0
Zhang_IOA_task1a_3 ZhangIOA3 Chen2019 85.2 (84.4 - 86.0) 85.1 86.7 77.9
Zhang_IOA_task1a_4 ZhangIOA4 Chen2019 84.8 (83.9 - 85.6) 85.3 86.3 77.1
Zheng_USTC_task1a_1 CNN_9Avg Zheng2019 75.7 (74.7 - 76.7) 78.7 78.1 63.8
Zheng_USTC_task1a_2 MyEnvNet Zheng2019 71.3 (70.3 - 72.4) 69.0 72.7 64.5
Zheng_USTC_task1a_3 1_2fusion Zheng2019 78.9 (77.9 - 79.8) 81.1 81.2 67.0
Zhou_Kuaiyu_task1a_1 Kuaiyu Zhou2019_t1 79.8 (78.8 - 80.7) 69.7 81.6 70.6
Zhou_Kuaiyu_task1a_2 Kuaiyu Zhou2019_t1 79.4 (78.5 - 80.4) 69.7 81.2 70.6
Zhou_Kuaiyu_task1a_3 Kuaiyu Zhou2019_t1 78.7 (77.7 - 79.6) 69.7 80.6 69.1
Zhu_SSLabBUPT_task1a_1 mult_ensem Zhu2019 79.2 (78.3 - 80.1) 81.0 70.0
Zhu_SSLabBUPT_task1a_2 mult_ensem Zhu2019 78.8 (77.9 - 79.7) 80.5 70.1
Zhu_SSLabBUPT_task1a_3 mult_ensem Zhu2019 79.1 (78.2 - 80.1) 81.0 69.9
Zhu_SSLabBUPT_task1a_4 mult_ensem Zhu2019 78.8 (77.8 - 79.7) 80.7 69.2

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission
code
Submission
name
Technical
Report
Accuracy
with 95% confidence interval
(Evaluation dataset)
Accuracy
(Development dataset)
Accuracy /
seen cities
(Evaluation dataset)
Accuracy /
unseen cities
(Evaluation dataset)
Bilot_IDG_task1a_4 Fusion Bilot2019 68.3 (67.3 - 69.4) 72.3 70.5 57.8
Chandrasekhar_IIITH_task1a_1 ASC by SFCC and DNN Paseddula2019 52.6 (51.4 - 53.7) 70.4 54.8 41.3
DSPLAB_TJU_task1a_2 DSPLAB_TJU_task1a_1 Ding2019 69.6 (68.5 - 70.6) 72.1 56.8
Fmta91_KNToosi_task1a_1 BWS_RSD Arabnezhad2019 76.2 (75.2 - 77.2) 88.0 76.6 74.3
Fraile_UPM_task1a_1 UPM_EMS Fraile2019 58.7 (57.6 - 59.9) 53.9 60.1 51.6
DCASE2019 baseline Baseline 63.3 (62.2 - 64.5) 62.5 66.7 46.1
Huang_IL_task1a_3 DLensemble Huang2019 81.3 (80.4 - 82.2) 82.3 76.6
Huang_SCNU_task1a_1 Huang_SCNU Huang2019a 79.2 (78.3 - 80.1) 73.9 80.7 71.6
JSNU_WDXY_task1a_1 task1a Ma2019 72.2 (71.1 - 73.2) 71.1 73.8 63.9
Jung_UOS_task1a_4 S_KD_4 Jung2019 81.2 (80.3 - 82.1) 76.2 82.6 74.4
KK_I2R_task1a_2 system2 KK2019 77.7 (76.7 - 78.6) 79.5 68.6
Kong_SURREY_task1a_1 cvssp_cnn9 Kong2019 70.5 (69.5 - 71.6) 69.2 72.8 59.0
Koutini_CPJKU_task1a_4 variants2 Koutini2019 83.8 (82.9 - 84.6) 83.7 84.9 78.1
LamPham_HCMGroup_task1a_1 HCM Group Pham2019 73.9 (72.9 - 74.9) 73.7 76.5 60.9
LamPham_KentGroup_task1a_1 Kent Group Pham2019a 76.8 (75.8 - 77.7) 76.2 78.9 66.2
Lei_CQU_task1a_1 model1 Lei2019 75.5 (74.5 - 76.5) 79.6 76.7 69.5
Li_NPU_task1a_2 NDHWD_2 FangLi2019 61.8 (60.7 - 62.9) 88.0 63.7 51.9
Liang_HUST_task1a_1 Liang_1 Liang2019 68.2 (67.1 - 69.2) 70.7 69.2 62.9
Liu_SCUT_task1a_2 HS Mingle2019 79.9 (79.0 - 80.8) 88.4 81.6 71.4
MaLiu_BIT_task1a_2 BIT_task1a_2 Liu2019 76.0 (75.1 - 77.0) 88.1 77.0 71.4
Mars_PRDCSG_task1a_1 PRDCSG Mars2019 79.3 (78.3 - 80.2) 79.4 80.2 74.6
McDonnell_USA_task1a_2 UniSA_1a2 Gao2019 80.5 (79.6 - 81.4) 81.4 82.8 69.4
Naranjo-Alcazar_VfyAI_task1a_2 Visualfy Naranjo-Alcazar2019 74.2 (73.2 - 75.2) 77.1 75.9 65.8
Plata_SRPOL_task1a_2 CLUSTERS Plata2019 79.2 (78.3 - 80.1) 81.6 80.6 72.3
SSW_ETRI_task1a_3 ETRI_E17 Sangwon2019 67.6 (66.5 - 68.7) 76.1 69.3 59.1
Salvati_DMIF_task1a_1 RW-CNN Salvati2019 68.5 (67.5 - 69.6) 69.7 69.8 62.0
Seo_LGE_task1a_4 LGE_ROBOTICS Hyeji2019 82.5 (81.7 - 83.4) 81.5 83.8 76.5
Waldekar_IITKGP_task1a_1 IITKGP_MFDWC19 Waldekar2019 65.9 (64.8 - 67.0) 67.4 67.8 56.0
Wang_BTBU_task1a_1 Wang_BTBU Wang2019 32.2 (31.1 - 33.3) 73.5 33.7 24.4
Wang_NWPU_task1a_1 Mou_task1A1 Wang2019a_t1 80.6 (79.7 - 81.5) 78.8 82.1 73.2
Wang_SCUT_task1a_2 5c Wang2019b 76.6 (75.6 - 77.5) 86.9 79.2 63.2
Wilkinghoff_FKIE_task1a_2 FKIEensemb Wilkinghoff2019 76.2 (75.2 - 77.2) 77.5 69.6
Wu_CUHK_task1a_1 Wu_CUHK_1 Wu2019 80.1 (79.1 - 81.0) 76.6 82.0 70.4
Yang_UESTC_task1a_2 6models-4folds_rf Haocong2019 81.6 (80.7 - 82.5) 83.8 70.7
Zeinali_BUT_task1a_3 ScoreFuse Zeinali2019 79.1 (78.1 - 80.0) 77.0 80.6 71.3
Zhang_IOA_task1a_3 ZhangIOA3 Chen2019 85.2 (84.4 - 86.0) 85.1 86.7 77.9
Zheng_USTC_task1a_3 1_2fusion Zheng2019 78.9 (77.9 - 79.8) 81.1 81.2 67.0
Zhou_Kuaiyu_task1a_1 Kuaiyu Zhou2019_t1 79.8 (78.8 - 80.7) 69.7 81.6 70.6
Zhu_SSLabBUPT_task1a_1 mult_ensem Zhu2019 79.2 (78.3 - 80.1) 81.0 70.0

Class-wise performance

Rank Submission
code
Submission
name
Technical
Report
Accuracy
(Evaluation dataset)
Airport Bus Metro Metro
station
Park Public
square
Shopping
mall
Street
pedestrian
Street
traffic
Tram
Bilot_IDG_task1a_1 Base+Att Bilot2019 66.1 47.2 77.4 70.1 52.4 95.6 31.9 68.3 62.9 81.8 73.6
Bilot_IDG_task1a_2 Gates+Att Bilot2019 67.3 58.3 69.0 71.7 61.5 90.0 36.0 61.4 54.6 86.2 84.7
Bilot_IDG_task1a_3 Dilated Bilot2019 64.5 56.9 76.7 69.7 41.5 87.5 37.5 71.8 63.1 80.8 59.6
Bilot_IDG_task1a_4 Fusion Bilot2019 68.3 54.7 77.9 72.4 60.3 89.3 37.4 70.6 65.4 86.4 69.2
Chandrasekhar_IIITH_task1a_1 ASC by SFCC and DNN Paseddula2019 52.6 21.7 53.5 78.1 32.8 90.0 22.1 39.0 33.9 81.2 73.6
DSPLAB_TJU_task1a_1 DSPLAB_TJU_task1a_1 Ding2019 66.5 41.2 86.0 63.1 48.2 91.1 41.1 62.5 74.7 85.6 71.8
DSPLAB_TJU_task1a_2 DSPLAB_TJU_task1a_1 Ding2019 69.6 46.2 79.6 71.1 57.6 93.5 43.6 69.3 71.2 87.4 76.2
DSPLAB_TJU_task1a_3 DSPLAB_TJU_task1a_1 Ding2019 65.0 32.1 75.7 68.1 44.9 88.3 42.4 66.7 70.8 87.5 73.8
DSPLAB_TJU_task1a_4 DSPLAB_TJU_task1a_1 Ding2019 69.5 45.7 80.3 71.2 56.2 93.3 43.8 69.0 72.5 87.1 75.6
Fmta91_KNToosi_task1a_1 BWS_RSD Arabnezhad2019 76.2 57.5 91.8 79.0 72.6 86.4 65.0 71.7 71.4 85.6 81.2
Fraile_UPM_task1a_1 UPM_EMS Fraile2019 58.7 45.1 70.4 60.1 54.7 74.9 37.4 51.0 48.5 81.9 63.3
DCASE2019 baseline Baseline 63.3 49.7 61.1 68.2 51.5 91.2 35.1 69.0 52.6 85.6 69.3
Huang_IL_task1a_1 DLensemble Huang2019 80.5 72.5 89.7 81.9 83.1 95.7 62.2 81.7 63.5 91.2 83.8
Huang_IL_task1a_2 DLensemble Huang2019 81.1 75.6 92.2 83.3 81.4 95.3 64.4 81.7 61.3 91.4 84.6
Huang_IL_task1a_3 DLensemble Huang2019 81.3 76.2 92.2 84.0 80.7 95.1 64.4 82.5 60.7 91.4 85.6
Huang_IL_task1a_4 DLensemble Huang2019 79.5 76.8 90.0 84.2 78.1 92.9 67.5 78.1 57.2 91.2 79.3
Huang_SCNU_task1a_1 Huang_SCNU Huang2019a 79.2 69.4 94.4 77.8 68.8 95.4 57.4 78.2 68.2 91.2 91.2
JSNU_WDXY_task1a_1 task1a Ma2019 72.2 64.0 67.1 73.3 68.9 92.8 67.6 60.6 52.4 91.4 83.6
Jung_UOS_task1a_1 S_KD_1 Jung2019 81.1 67.9 95.8 81.9 81.2 95.3 55.1 74.2 77.9 92.2 89.3
Jung_UOS_task1a_2 S_KD_2 Jung2019 81.2 68.5 95.7 84.3 82.1 95.6 55.6 72.6 76.8 91.5 89.3
Jung_UOS_task1a_3 S_KD_3 Jung2019 81.0 70.0 95.1 81.0 78.5 95.4 55.4 77.9 77.8 91.5 87.6
Jung_UOS_task1a_4 S_KD_4 Jung2019 81.2 67.1 96.1 84.2 81.7 95.7 54.4 74.9 79.3 90.4 88.6
KK_I2R_task1a_1 system1 KK2019 76.6 66.8 84.6 78.6 74.7 92.6 56.9 77.9 57.4 92.9 83.8
KK_I2R_task1a_2 system2 KK2019 77.7 63.3 86.0 76.7 80.1 94.3 58.3 78.1 59.0 92.5 88.5
KK_I2R_task1a_3 system3 KK2019 77.2 68.8 88.1 68.2 86.2 87.8 64.3 71.1 51.8 93.2 92.5
Kong_SURREY_task1a_1 cvssp_cnn9 Kong2019 70.5 63.9 72.4 72.6 66.2 90.3 59.0 63.3 49.4 91.0 76.9
Koutini_CPJKU_task1a_1 cp_resnet Koutini2019 82.8 80.8 92.6 85.3 79.3 95.7 63.6 78.5 72.8 92.1 87.6
Koutini_CPJKU_task1a_2 cv_avg Koutini2019 83.7 79.9 96.2 84.7 81.5 96.1 64.0 80.1 72.9 92.6 89.0
Koutini_CPJKU_task1a_3 variants Koutini2019 83.5 78.3 96.0 83.9 82.4 96.4 61.4 79.9 73.6 93.1 90.1
Koutini_CPJKU_task1a_4 variants2 Koutini2019 83.8 80.1 96.1 84.7 81.7 96.1 62.9 80.7 73.2 92.5 89.4
LamPham_HCMGroup_task1a_1 HCM Group Pham2019 73.9 65.1 93.5 78.2 73.8 93.8 46.2 62.5 54.0 86.8 85.4
LamPham_KentGroup_task1a_1 Kent Group Pham2019a 76.8 63.2 95.7 79.2 77.5 95.3 46.1 67.5 62.4 91.8 89.0
Lei_CQU_task1a_1 model1 Lei2019 75.5 62.1 85.0 76.0 76.1 95.8 49.6 75.4 65.4 90.7 79.2
Li_NPU_task1a_1 NDHWD_1 FangLi2019 59.9 51.4 71.2 68.9 44.3 85.6 35.0 63.9 38.8 79.6 60.3
Li_NPU_task1a_2 NDHWD_2 FangLi2019 61.8 49.3 74.3 57.4 47.8 81.9 37.2 69.3 47.1 80.6 72.9
Liang_HUST_task1a_1 Liang_1 Liang2019 68.2 49.6 82.9 71.5 62.8 92.4 45.6 65.3 65.6 82.8 63.3
Liang_HUST_task1a_2 Liang_2 Liang2019 66.4 49.6 78.1 67.6 64.0 86.7 39.6 69.7 55.0 86.5 66.9
Liu_SCUT_task1a_1 HS Mingle2019 78.3 69.4 94.7 82.1 76.8 95.6 51.8 72.4 65.0 91.4 84.2
Liu_SCUT_task1a_2 HS Mingle2019 79.9 71.9 93.8 81.1 79.9 96.0 59.4 72.2 65.6 91.5 87.5
Liu_SCUT_task1a_3 HS Mingle2019 78.3 70.4 92.1 79.9 76.4 95.6 54.3 74.7 61.9 91.0 86.7
Liu_SCUT_task1a_4 HS Mingle2019 78.4 69.4 93.5 81.4 77.1 95.7 56.1 72.5 63.1 91.0 84.2
MaLiu_BIT_task1a_1 BIT_task1a_1 Ma2019a 72.8 43.5 90.0 82.5 76.4 87.5 66.4 71.5 68.1 78.5 63.6
MaLiu_BIT_task1a_2 BIT_task1a_2 Liu2019 76.0 70.6 94.6 84.7 65.8 95.1 52.5 63.3 72.8 88.6 72.4
MaLiu_BIT_task1a_3 BIT_task1a_3 Ma2019a 73.3 45.1 90.7 82.5 76.9 87.5 66.0 71.9 68.8 79.0 64.3
Mars_PRDCSG_task1a_1 PRDCSG Mars2019 79.3 63.9 93.3 80.3 75.8 96.7 56.9 81.1 71.7 93.6 79.4
McDonnell_USA_task1a_1 UniSA_1a1 Gao2019 80.0 66.2 90.8 79.6 79.3 97.8 58.2 73.1 70.6 93.9 90.1
McDonnell_USA_task1a_2 UniSA_1a2 Gao2019 80.5 66.9 88.9 83.3 79.9 97.5 63.1 73.3 69.6 92.8 90.0
McDonnell_USA_task1a_3 UniSA_1a3 Gao2019 80.4 66.9 90.0 81.7 80.1 97.5 60.3 73.8 69.9 93.6 90.3
McDonnell_USA_task1a_4 UniSA_1a4 Gao2019 80.3 66.8 90.8 81.8 79.7 97.5 59.3 72.8 71.1 93.5 89.6
Naranjo-Alcazar_VfyAI_task1a_1 Visualfy Naranjo-Alcazar2019 74.1 59.7 83.5 74.0 73.3 94.9 53.9 72.8 66.2 88.8 74.3
Naranjo-Alcazar_VfyAI_task1a_2 Visualfy Naranjo-Alcazar2019 74.2 59.6 84.4 74.0 73.1 94.6 53.9 73.3 66.4 88.6 74.0
Naranjo-Alcazar_VfyAI_task1a_3 Visualfy Naranjo-Alcazar2019 74.0 59.9 84.2 73.5 73.5 94.7 53.1 72.6 65.8 88.2 74.3
Naranjo-Alcazar_VfyAI_task1a_4 Visualfy Naranjo-Alcazar2019 74.1 59.6 84.0 73.9 73.6 94.7 53.3 72.9 66.0 88.3 74.3
Plata_SRPOL_task1a_1 CLUSTERS Plata2019 78.8 63.7 87.1 76.4 80.7 95.6 55.1 80.0 74.9 87.4 87.5
Plata_SRPOL_task1a_2 CLUSTERS Plata2019 79.2 64.3 85.4 78.6 81.9 95.7 55.6 81.8 75.3 88.1 85.4
Plata_SRPOL_task1a_3 CLUSTERS Plata2019 77.3 64.9 87.2 75.6 79.4 96.2 37.5 78.8 77.1 92.9 82.9
Plata_SRPOL_task1a_4 CLUSTERS Plata2019 77.9 67.5 88.8 76.9 80.0 96.5 45.0 77.4 71.2 93.1 82.8
SSW_ETRI_task1a_1 ETRI_base Sangwon2019 66.7 55.3 77.8 63.9 68.3 80.6 52.4 64.3 56.2 79.4 68.5
SSW_ETRI_task1a_2 ETRI_E8 Sangwon2019 67.0 56.1 77.8 67.4 68.8 79.7 52.2 63.2 55.4 79.7 70.0
SSW_ETRI_task1a_3 ETRI_E17 Sangwon2019 67.6 56.9 78.8 67.4 69.0 80.3 51.7 63.3 57.9 80.3 70.6
SSW_ETRI_task1a_4 ETRI_E25 Sangwon2019 67.6 57.4 79.2 66.4 70.0 80.6 51.9 61.7 57.1 80.3 71.2
Salvati_DMIF_task1a_1 RW-CNN Salvati2019 68.5 57.5 88.3 67.9 63.2 91.0 35.6 56.0 63.7 88.8 73.3
Seo_LGE_task1a_1 LGE_ROBOTICS Hyeji2019 81.6 66.7 90.3 83.3 83.3 93.5 53.2 84.6 76.9 92.8 91.1
Seo_LGE_task1a_2 LGE_ROBOTICS Hyeji2019 82.5 74.9 95.0 82.9 80.0 92.4 60.7 85.4 68.6 93.2 91.9
Seo_LGE_task1a_3 LGE_ROBOTICS Hyeji2019 81.1 71.9 93.1 79.7 76.2 93.1 62.4 84.9 68.3 91.8 89.9
Seo_LGE_task1a_4 LGE_ROBOTICS Hyeji2019 82.5 76.8 93.2 82.5 81.4 92.8 61.4 86.7 64.7 93.5 92.5
Waldekar_IITKGP_task1a_1 IITKGP_MFDWC19 Waldekar2019 65.9 62.6 82.9 74.0 71.5 92.2 70.8 66.7 28.7 69.4 39.6
Wang_BTBU_task1a_1 Wang_BTBU Wang2019 32.2 38.6 17.1 42.6 59.3 26.4 44.0 29.3 8.5 40.7 15.6
Wang_NWPU_task1a_1 Mou_task1A1 Wang2019a_t1 80.6 66.2 91.8 82.4 81.0 96.2 62.5 79.4 63.9 93.1 89.7
Wang_NWPU_task1a_2 Mou_task1A1 Wang2019a 80.1 64.4 91.1 81.4 81.5 96.2 62.4 78.3 63.7 92.8 88.8
Wang_NWPU_task1a_3 Mou_task1A1 Wang2019a 76.6 59.9 87.8 73.2 78.2 96.2 53.6 79.0 61.3 92.1 84.9
Wang_NWPU_task1a_4 Mou_task1A1 Wang2019a 76.8 59.4 89.2 73.5 78.1 96.0 53.6 79.6 61.0 92.6 85.1
Wang_SCUT_task1a_1 5c Wang2019b 76.4 61.7 85.4 78.2 76.2 94.4 47.2 76.9 69.3 90.4 84.2
Wang_SCUT_task1a_2 5c Wang2019b 76.6 61.8 85.7 75.3 78.2 93.9 48.9 77.2 68.2 90.0 86.4
Wang_SCUT_task1a_3 5c Wang2019b 75.9 60.4 82.2 79.0 75.0 95.1 50.6 78.5 65.7 89.3 83.1
Wang_SCUT_task1a_4 5c Wang2019b 76.5 61.4 84.9 77.6 76.4 94.6 47.6 77.5 69.3 90.4 85.1
Wilkinghoff_FKIE_task1a_1 FKIEsingle Wilkinghoff2019 74.6 70.7 79.3 73.9 71.9 88.3 53.1 76.1 50.7 93.8 88.2
Wilkinghoff_FKIE_task1a_2 FKIEensemb Wilkinghoff2019 76.2 68.8 82.4 75.4 74.7 89.4 55.0 77.8 52.5 95.0 90.8
Wu_CUHK_task1a_1 Wu_CUHK_1 Wu2019 80.1 66.4 91.2 79.7 79.6 97.2 55.1 76.9 71.4 93.6 89.3
Yang_UESTC_task1a_1 6models-4folds_avg Haocong2019 79.9 73.5 90.8 80.4 79.6 95.6 54.3 77.2 67.5 93.6 86.1
Yang_UESTC_task1a_2 6models-4folds_rf Haocong2019 81.6 70.3 93.1 80.6 82.8 96.8 59.2 79.2 69.2 93.2 91.9
Yang_UESTC_task1a_3 4models-4folds_rf Haocong2019 81.2 70.3 92.2 79.2 81.9 96.2 57.6 80.3 68.8 93.9 91.4
Zeinali_BUT_task1a_1 ScoreFuse Zeinali2019 78.9 74.7 93.5 81.0 77.9 91.5 51.7 77.1 61.7 88.6 91.8
Zeinali_BUT_task1a_2 ScoreFuse Zeinali2019 78.9 74.9 94.7 79.4 80.1 92.4 46.0 74.6 63.2 91.5 92.1
Zeinali_BUT_task1a_3 ScoreFuse Zeinali2019 79.1 74.2 93.5 80.7 78.1 91.5 52.9 77.4 61.9 88.8 91.8
Zhang_IOA_task1a_1 ZhangIOA1 Chen2019 84.9 75.1 96.7 90.1 84.3 96.8 61.5 80.8 79.7 94.7 89.2
Zhang_IOA_task1a_2 ZhangIOA2 Chen2019 84.9 75.1 96.7 89.9 85.3 96.8 62.1 80.3 79.2 94.9 89.2
Zhang_IOA_task1a_3 ZhangIOA3 Chen2019 85.2 77.5 96.1 89.9 82.5 96.7 67.5 80.0 78.6 92.8 90.6
Zhang_IOA_task1a_4 ZhangIOA4 Chen2019 84.8 75.1 96.7 89.9 84.0 96.8 61.7 80.3 79.7 94.4 89.0
Zheng_USTC_task1a_1 CNN_9Avg Zheng2019 75.7 66.0 84.7 82.5 76.7 95.8 64.4 55.4 60.1 87.9 83.2
Zheng_USTC_task1a_2 MyEnvNet Zheng2019 71.3 62.9 91.7 76.5 66.8 92.1 42.1 62.9 48.6 92.4 77.4
Zheng_USTC_task1a_3 1_2fusion Zheng2019 78.9 67.5 93.1 84.0 78.2 97.5 63.9 63.5 64.2 91.8 85.0
Zhou_Kuaiyu_task1a_1 Kuaiyu Zhou2019_t1 79.8 69.6 85.7 81.5 79.4 95.4 62.9 79.3 58.6 95.3 89.7
Zhou_Kuaiyu_task1a_2 Kuaiyu Zhou2019_t1 79.4 73.2 86.9 78.6 76.9 95.4 63.5 77.4 56.9 94.7 90.8
Zhou_Kuaiyu_task1a_3 Kuaiyu Zhou2019_t1 78.7 70.1 90.7 77.9 75.1 95.7 57.2 79.7 59.3 93.1 88.1
Zhu_SSLabBUPT_task1a_1 mult_ensem Zhu2019 79.2 69.3 93.3 79.4 75.0 97.6 47.9 73.6 78.9 91.7 85.1
Zhu_SSLabBUPT_task1a_2 mult_ensem Zhu2019 78.8 67.9 93.2 81.7 71.9 97.5 46.7 73.9 78.1 92.1 85.1
Zhu_SSLabBUPT_task1a_3 mult_ensem Zhu2019 79.1 70.0 92.8 82.4 73.3 97.9 45.6 73.2 79.2 91.9 85.0
Zhu_SSLabBUPT_task1a_4 mult_ensem Zhu2019 78.8 70.0 91.9 80.4 73.5 97.6 45.7 72.4 80.4 91.2 84.3

System characteristics

General characteristics

Rank Code Technical
Report
Accuracy
(Eval)
Input Sampling
rate
Data
augmentation
Features
Bilot_IDG_task1a_1 Bilot2019 66.1 binaural, difference 48kHz mixup log-mel energies
Bilot_IDG_task1a_2 Bilot2019 67.3 binaural, difference 48kHz mixup log-mel energies
Bilot_IDG_task1a_3 Bilot2019 64.5 binaural, difference 48kHz mixup log-mel energies
Bilot_IDG_task1a_4 Bilot2019 68.3 binaural, difference 48kHz mixup log-mel energies
Chandrasekhar_IIITH_task1a_1 Paseddula2019 52.6 mono 48kHz single frequency cepstral coefficients (SFCC), log-mel energies
DSPLAB_TJU_task1a_1 Ding2019 66.5 mono 48kHz log-mel energies
DSPLAB_TJU_task1a_2 Ding2019 69.6 mono, left, right, mixed 48kHz MFCC, log-mel energies, ZRC, RMSE, spectrogram centroid
DSPLAB_TJU_task1a_3 Ding2019 65.0 mono 48kHz MFCC, log-mel energies, ZRC, RMSE, spectrogram centroid
DSPLAB_TJU_task1a_4 Ding2019 69.5 mono, left, right, mixed 48kHz MFCC, log-mel energies, ZRC, RMSE, spectrogram centroid
Fmta91_KNToosi_task1a_1 Arabnezhad2019 76.2 mono 48kHz wavelet scattering spectra
Fraile_UPM_task1a_1 Fraile2019 58.7 binaural 48kHz spectrogram, modulation spectrum, position-pitch maps
DCASE2019 baseline 63.3 mono 48kHz log-mel energies
Huang_IL_task1a_1 Huang2019 80.5 mono 16kHz mixup raw waveform, log-mel energies
Huang_IL_task1a_2 Huang2019 81.1 mono , binaural 48kHz, 16kHz mixup raw waveform, log-mel energies
Huang_IL_task1a_3 Huang2019 81.3 mono , binaural 48kHz, 16kHz mixup raw waveform, log-mel energies
Huang_IL_task1a_4 Huang2019 79.5 binaural 48kHz mixup log-mel energies
Huang_SCNU_task1a_1 Huang2019a 79.2 left, right, mixed 44.1kHz mixup MFCC, CQT
JSNU_WDXY_task1a_1 Ma2019 72.2 mono 44.1kHz log-mel energies
Jung_UOS_task1a_1 Jung2019 81.1 binaural 48kHz mixup raw waveform, log-mel energies
Jung_UOS_task1a_2 Jung2019 81.2 binaural 48kHz mixup raw waveform, log-mel energies
Jung_UOS_task1a_3 Jung2019 81.0 binaural 48kHz mixup raw waveform, log-mel energies
Jung_UOS_task1a_4 Jung2019 81.2 binaural 48kHz mixup raw waveform, log-mel energies
KK_I2R_task1a_1 KK2019 76.6 mono 44.1kHz log-mel energies, HPSS
KK_I2R_task1a_2 KK2019 77.7 mono 44.1kHz mixup log-mel energies, HPSS, subband power distribution
KK_I2R_task1a_3 KK2019 77.2 mono 44.1kHz mixup log-mel energies, HPSS, subband power distribution
Kong_SURREY_task1a_1 Kong2019 70.5 mono 32kHz log-mel energies
Koutini_CPJKU_task1a_1 Koutini2019 82.8 binaural 22.05kHz mixup perceptual weighted power spectrogram
Koutini_CPJKU_task1a_2 Koutini2019 83.7 binaural 22.05kHz mixup perceptual weighted power spectrogram
Koutini_CPJKU_task1a_3 Koutini2019 83.5 binaural 22.05kHz mixup perceptual weighted power spectrogram
Koutini_CPJKU_task1a_4 Koutini2019 83.8 binaural 22.05kHz mixup perceptual weighted power spectrogram
LamPham_HCMGroup_task1a_1 Pham2019 73.9 mono 48kHz mixup Gammatone, log-mel energies, CQT
LamPham_KentGroup_task1a_1 Pham2019a 76.8 mono 48kHz mixup Gammatone, log-mel energies, CQT
Lei_CQU_task1a_1 Lei2019 75.5 binaural 44.1kHz log-mel energies
Li_NPU_task1a_1 FangLi2019 59.9 mixed 48kHz DCGAN log-mel energies
Li_NPU_task1a_2 FangLi2019 61.8 mixed 48kHz DCGAN log-mel energies
Liang_HUST_task1a_1 Liang2019 68.2 mono 48kHz log-mel energies
Liang_HUST_task1a_2 Liang2019 66.4 mono 48kHz log-mel energies
Liu_SCUT_task1a_1 Mingle2019 78.3 mono,binaural 48kHz mixup log-mel energies
Liu_SCUT_task1a_2 Mingle2019 79.9 mono,binaural 48kHz mixup log-mel energies
Liu_SCUT_task1a_3 Mingle2019 78.3 mono,binaural 48kHz mixup log-mel energies
Liu_SCUT_task1a_4 Mingle2019 78.4 mono,binaural 48kHz mixup log-mel energies
MaLiu_BIT_task1a_1 Ma2019a 72.8 left,right 48kHz DSS
MaLiu_BIT_task1a_2 Liu2019 76.0 left,right 48kHz DSS
MaLiu_BIT_task1a_3 Ma2019a 73.3 left,right 48kHz DSS
Mars_PRDCSG_task1a_1 Mars2019 79.3 mono, left, right, mid, side 48kHz mixup log-mel energies
McDonnell_USA_task1a_1 Gao2019 80.0 left, right 48kHz mixup, temporal cropping log-mel energies, deltas and delta-deltas
McDonnell_USA_task1a_2 Gao2019 80.5 left, right 48kHz mixup, temporal cropping log-mel energies
McDonnell_USA_task1a_3 Gao2019 80.4 left, right 48kHz mixup, temporal cropping log-mel energies, deltas and delta-deltas
McDonnell_USA_task1a_4 Gao2019 80.3 left, right 48kHz mixup, temporal cropping log-mel energies, deltas and delta-deltas
Naranjo-Alcazar_VfyAI_task1a_1 Naranjo-Alcazar2019 74.1 mono, left, right, difference, harmonic, percussive 48kHz log-mel energies
Naranjo-Alcazar_VfyAI_task1a_2 Naranjo-Alcazar2019 74.2 mono, left, right, difference, harmonic, percussive 48kHz log-mel energies
Naranjo-Alcazar_VfyAI_task1a_3 Naranjo-Alcazar2019 74.0 mono, left, right, difference, harmonic, percussive 48kHz log-mel energies
Naranjo-Alcazar_VfyAI_task1a_4 Naranjo-Alcazar2019 74.1 mono, left, right, difference, harmonic, percussive 48kHz log-mel energies
Plata_SRPOL_task1a_1 Plata2019 78.8 mono, left, right 48kHz mixup log-mel energies, harmonic, percussive
Plata_SRPOL_task1a_2 Plata2019 79.2 mono, left, right 48kHz mixup log-mel energies, harmonic, percussive
Plata_SRPOL_task1a_3 Plata2019 77.3 mono, left, right 48kHz mixup log-mel energies, harmonic, percussive
Plata_SRPOL_task1a_4 Plata2019 77.9 mono, left, right 48kHz mixup log-mel energies, harmonic, percussive
SSW_ETRI_task1a_1 Sangwon2019 66.7 mono 48kHz SpecAugment log-mel energies
SSW_ETRI_task1a_2 Sangwon2019 67.0 mono 48kHz SpecAugment log-mel energies
SSW_ETRI_task1a_3 Sangwon2019 67.6 mono 48kHz SpecAugment log-mel energies
SSW_ETRI_task1a_4 Sangwon2019 67.6 mono 48kHz SpecAugment log-mel energies
Salvati_DMIF_task1a_1 Salvati2019 68.5 mono 48kHz raw waveform
Seo_LGE_task1a_1 Hyeji2019 81.6 mono, binaural 48kHz mixup log-mel energies, spectrogram, chromagram
Seo_LGE_task1a_2 Hyeji2019 82.5 mono, binaural 48kHz mixup log-mel energies, spectrogram, chromagram
Seo_LGE_task1a_3 Hyeji2019 81.1 mono, binaural 48kHz mixup log-mel energies, spectrogram, chromagram
Seo_LGE_task1a_4 Hyeji2019 82.5 mono, binaural 48kHz mixup log-mel energies, spectrogram, chromagram
Waldekar_IITKGP_task1a_1 Waldekar2019 65.9 mono 48kHz MFDWC
Wang_BTBU_task1a_1 Wang2019 32.2 one 22.05kHz MFCC
Wang_NWPU_task1a_1 Wang2019a_t1 80.6 mono 32kHz log-mel energies
Wang_NWPU_task1a_2 Wang2019a 80.1 mono 32kHz log-mel energies
Wang_NWPU_task1a_3 Wang2019a 76.6 mono 32kHz log-mel energies
Wang_NWPU_task1a_4 Wang2019a 76.8 mono 32kHz log-mel energies
Wang_SCUT_task1a_1 Wang2019b 76.4 mono,binaural 48kHz mixup log-mel energies,MFCC
Wang_SCUT_task1a_2 Wang2019b 76.6 mono,binaural 48kHz mixup log-mel energies,MFCC
Wang_SCUT_task1a_3 Wang2019b 75.9 mono,binaural 48kHz mixup log-mel energies,MFCC
Wang_SCUT_task1a_4 Wang2019b 76.5 mono,binaural 48kHz mixup log-mel energies,MFCC
Wilkinghoff_FKIE_task1a_1 Wilkinghoff2019 74.6 mono 48kHz mixup, cutout, width shift, height shift log-mel energies
Wilkinghoff_FKIE_task1a_2 Wilkinghoff2019 76.2 mono 48kHz mixup, cutout, width shift, height shift log-mel energies, harmonic part, percussive part
Wu_CUHK_task1a_1 Wu2019 80.1 mono 48kHz mixup log-mel energies
Yang_UESTC_task1a_1 Haocong2019 79.9 mono, binaural 44.1kHz mixup log-mel energies
Yang_UESTC_task1a_2 Haocong2019 81.6 mono, binaural 44.1kHz mixup log-mel energies
Yang_UESTC_task1a_3 Haocong2019 81.2 mono, binaural 44.1kHz mixup log-mel energies
Zeinali_BUT_task1a_1 Zeinali2019 78.9 mono 22.05kHz mixup log-mel energies
Zeinali_BUT_task1a_2 Zeinali2019 78.9 mono 22.05kHz mixup log-mel energies
Zeinali_BUT_task1a_3 Zeinali2019 79.1 mono 22.05kHz mixup log-mel energies
Zhang_IOA_task1a_1 Chen2019 84.9 binaural 48kHz generative neural network log-mel energies, CQT
Zhang_IOA_task1a_2 Chen2019 84.9 binaural 48kHz generative neural network log-mel energies, CQT
Zhang_IOA_task1a_3 Chen2019 85.2 binaural 48kHz generative neural network log-mel energies, CQT
Zhang_IOA_task1a_4 Chen2019 84.8 binaural 48kHz generative neural network, variational autoencoder log-mel energies, CQT
Zheng_USTC_task1a_1 Zheng2019 75.7 binaural 22.05kHz SpecAugment, RandomCrop log-mel energies
Zheng_USTC_task1a_2 Zheng2019 71.3 mono 16kHz Between-Class learning raw waveform
Zheng_USTC_task1a_3 Zheng2019 78.9 binaural 22.05kHz SpecAugment, RandomCrop, Between-Class learning log-mel energies
Zhou_Kuaiyu_task1a_1 Zhou2019_t1 79.8 mono 48kHz mixup log-mel energies
Zhou_Kuaiyu_task1a_2 Zhou2019_t1 79.4 mono 48kHz mixup log-mel energies
Zhou_Kuaiyu_task1a_3 Zhou2019_t1 78.7 mono 48kHz mixup log-mel energies
Zhu_SSLabBUPT_task1a_1 Zhu2019 79.2 multiple 48kHz mixup log-mel energies
Zhu_SSLabBUPT_task1a_2 Zhu2019 78.8 multiple 48kHz mixup log-mel energies
Zhu_SSLabBUPT_task1a_3 Zhu2019 79.1 multiple 48kHz mixup log-mel energies
Zhu_SSLabBUPT_task1a_4 Zhu2019 78.8 multiple 48kHz mixup log-mel energies



Machine learning characteristics

Rank Code Technical
Report
Accuracy
(Eval)
External
data usage
External
data sources
Model
complexity
Classifier Ensemble
subsystems
Decision
making
Bilot_IDG_task1a_1 Bilot2019 66.1 282762 CNN
Bilot_IDG_task1a_2 Bilot2019 67.3 3654342 CNN
Bilot_IDG_task1a_3 Bilot2019 64.5 924374 CNN
Bilot_IDG_task1a_4 Bilot2019 68.3 4862583 CNN 3 MLP
Chandrasekhar_IIITH_task1a_1 Paseddula2019 52.6 DNN maxrule
DSPLAB_TJU_task1a_1 Ding2019 66.5 3840 GMM
DSPLAB_TJU_task1a_2 Ding2019 69.6 922920 GMM, CNN 13 majority vote
DSPLAB_TJU_task1a_3 Ding2019 65.0 3090 GMM
DSPLAB_TJU_task1a_4 Ding2019 69.5 926760 GMM, CNN 14 majority vote
Fmta91_KNToosi_task1a_1 Arabnezhad2019 76.2 random subspace 30 highest average score
Fraile_UPM_task1a_1 Fraile2019 58.7 27328 GMM average log-likelihood
DCASE2019 baseline 63.3 116118 CNN
Huang_IL_task1a_1 Huang2019 80.5 pre-trained model AudioSet 1264400984 CNN 40 Max value of soft ensemble
Huang_IL_task1a_2 Huang2019 81.1 pre-trained model AudioSet 921761716 CNN 31 Max value of soft ensemble
Huang_IL_task1a_3 Huang2019 81.3 pre-trained model AudioSet 798650092 CNN 26 Max value of soft ensemble
Huang_IL_task1a_4 Huang2019 79.5 pre-trained model AudioSet 374743680 CNN 20 Max value of soft ensemble
Huang_SCNU_task1a_1 Huang2019a 79.2 46118408 CNN 6 majority vote
JSNU_WDXY_task1a_1 Ma2019 72.2 CNN
Jung_UOS_task1a_1 Jung2019 81.1 101727776 CNN 16 majority vote
Jung_UOS_task1a_2 Jung2019 81.2 101727776 CNN 16 score-sum
Jung_UOS_task1a_3 Jung2019 81.0 101727776 CNN 16 majority vote
Jung_UOS_task1a_4 Jung2019 81.2 101727776 CNN 16 score-sum
KK_I2R_task1a_1 KK2019 76.6 4692426 CNN
KK_I2R_task1a_2 KK2019 77.7 14077278 CNN 3 weighted averaging vote
KK_I2R_task1a_3 KK2019 77.2 14077278 CNN 3 multi-class linear logistic regression
Kong_SURREY_task1a_1 Kong2019 70.5 4686144 CNN
Koutini_CPJKU_task1a_1 Koutini2019 82.8 3566612 CNN, Receptive Field Regularization
Koutini_CPJKU_task1a_2 Koutini2019 83.7 17833060 CNN, Receptive Field Regularization average
Koutini_CPJKU_task1a_3 Koutini2019 83.5 71332240 CNN, Receptive Field Regularization average
Koutini_CPJKU_task1a_4 Koutini2019 83.8 71332240 CNN, Receptive Field Regularization average
LamPham_HCMGroup_task1a_1 Pham2019 73.9 12346325 CNN, RNN
LamPham_KentGroup_task1a_1 Pham2019a 76.8 12346325 CNN, DNN 2
Lei_CQU_task1a_1 Lei2019 75.5 CNN
Li_NPU_task1a_1 FangLi2019 59.9 1353372 CNN 3 majority vote
Li_NPU_task1a_2 FangLi2019 61.8 1353372 CNN 3 majority vote
Liang_HUST_task1a_1 Liang2019 68.2 220906 CNN
Liang_HUST_task1a_2 Liang2019 66.4 310362 CNN
Liu_SCUT_task1a_1 Mingle2019 78.3 116118 ResNet 4 vote
Liu_SCUT_task1a_2 Mingle2019 79.9 116118 ResNet 4 vote
Liu_SCUT_task1a_3 Mingle2019 78.3 116118 ResNet 4 vote
Liu_SCUT_task1a_4 Mingle2019 78.4 116118 ResNet 4 vote
MaLiu_BIT_task1a_1 Ma2019a 72.8 164994 CNN,DNN
MaLiu_BIT_task1a_2 Liu2019 76.0 116118 CNN,DNN
MaLiu_BIT_task1a_3 Ma2019a 73.3 164994 CNN,DNN
Mars_PRDCSG_task1a_1 Mars2019 79.3 13593896 CNN 4 max probability
McDonnell_USA_task1a_1 Gao2019 80.0 3254468 CNN
McDonnell_USA_task1a_2 Gao2019 80.5 3252708 CNN
McDonnell_USA_task1a_3 Gao2019 80.4 6507176 CNN 2 average
McDonnell_USA_task1a_4 Gao2019 80.3 6508936 CNN 2 average
Naranjo-Alcazar_VfyAI_task1a_1 Naranjo-Alcazar2019 74.1 1485450 ensemble, CNN 3 arithmetic mean
Naranjo-Alcazar_VfyAI_task1a_2 Naranjo-Alcazar2019 74.2 1485450 ensemble, CNN 3 geometric mean
Naranjo-Alcazar_VfyAI_task1a_3 Naranjo-Alcazar2019 74.0 1485450 ensemble, CNN 3 orness weighted average
Naranjo-Alcazar_VfyAI_task1a_4 Naranjo-Alcazar2019 74.1 1485450 ensemble, CNN 3 orness weighted average
Plata_SRPOL_task1a_1 Plata2019 78.8 60000000 CNN, random forest 15 random forest
Plata_SRPOL_task1a_2 Plata2019 79.2 50000000 CNN, random forest 15 random forest
Plata_SRPOL_task1a_3 Plata2019 77.3 60000000 CNN 15 majority vote
Plata_SRPOL_task1a_4 Plata2019 77.9 50000000 CNN 15 majority vote
SSW_ETRI_task1a_1 Sangwon2019 66.7 839126 CNN
SSW_ETRI_task1a_2 Sangwon2019 67.0 6713017 ensemble 8
SSW_ETRI_task1a_3 Sangwon2019 67.6 14265160 ensemble 17
SSW_ETRI_task1a_4 Sangwon2019 67.6 20978176 ensemble 25
Salvati_DMIF_task1a_1 Salvati2019 68.5 9053546 CNN
Seo_LGE_task1a_1 Hyeji2019 81.6 8475574 CNN 8 average
Seo_LGE_task1a_2 Hyeji2019 82.5 18348450 CNN 21 average
Seo_LGE_task1a_3 Hyeji2019 81.1 8475574 CNN 21 average
Seo_LGE_task1a_4 Hyeji2019 82.5 18348450 CNN 21 average
Waldekar_IITKGP_task1a_1 Waldekar2019 65.9 9000 SVM
Wang_BTBU_task1a_1 Wang2019 32.2 CNN
Wang_NWPU_task1a_1 Wang2019a_t1 80.6 CNN
Wang_NWPU_task1a_2 Wang2019a 80.1 CNN
Wang_NWPU_task1a_3 Wang2019a 76.6 CNN
Wang_NWPU_task1a_4 Wang2019a 76.8 CNN
Wang_SCUT_task1a_1 Wang2019b 76.4 116118 VGG,Inception,ResNet 4 vote
Wang_SCUT_task1a_2 Wang2019b 76.6 116118 VGG,Inception,ResNet 4 vote
Wang_SCUT_task1a_3 Wang2019b 75.9 116118 VGG,Inception,ResNet 4 vote
Wang_SCUT_task1a_4 Wang2019b 76.5 116118 VGG,Inception,ResNet 4 vote
Wilkinghoff_FKIE_task1a_1 Wilkinghoff2019 74.6 1880578 CNN maximum likelihood
Wilkinghoff_FKIE_task1a_2 Wilkinghoff2019 76.2 5642274 CNN, ensemble 3 geometric mean, maximum likelihood
Wu_CUHK_task1a_1 Wu2019 80.1 52574568 CNN 4 majority vote
Yang_UESTC_task1a_1 Haocong2019 79.9 31984112 CNN 24 average
Yang_UESTC_task1a_2 Haocong2019 81.6 31984112 CNN 24 random forest
Yang_UESTC_task1a_3 Haocong2019 81.2 21310368 CNN 16 random forest
Zeinali_BUT_task1a_1 Zeinali2019 78.9 25000000 CNN, ensemble 12 score fusion
Zeinali_BUT_task1a_2 Zeinali2019 78.9 25000000 CNN, ensemble 12 majority vote
Zeinali_BUT_task1a_3 Zeinali2019 79.1 25000000 CNN, ensemble 12 majority vote, score fusion
Zhang_IOA_task1a_1 Chen2019 84.9 94841565 CNN 7 average vote
Zhang_IOA_task1a_2 Chen2019 84.9 48257054 CNN 7 average vote
Zhang_IOA_task1a_3 Chen2019 85.2 48257054 CNN 7 average vote
Zhang_IOA_task1a_4 Chen2019 84.8 53266772 CNN 7 average vote
Zheng_USTC_task1a_1 Zheng2019 75.7 116118 CNN
Zheng_USTC_task1a_2 Zheng2019 71.3 116118 CNN
Zheng_USTC_task1a_3 Zheng2019 78.9 116118 CNN
Zhou_Kuaiyu_task1a_1 Zhou2019_t1 79.8 4700000 CNN
Zhou_Kuaiyu_task1a_2 Zhou2019_t1 79.4 4700000 CNN
Zhou_Kuaiyu_task1a_3 Zhou2019_t1 78.7 4700000 CNN
Zhu_SSLabBUPT_task1a_1 Zhu2019 79.2 48003724 CNN, BGRU, self-attention, ensemble 6
Zhu_SSLabBUPT_task1a_2 Zhu2019 78.8 51908898 CNN, BGRU, self-attention, ensemble 7
Zhu_SSLabBUPT_task1a_3 Zhu2019 79.1 51908898 CNN, BGRU, self-attention, ensemble 7
Zhu_SSLabBUPT_task1a_4 Zhu2019 78.8 55814072 CNN, BGRU, self-attention, ensemble 8

Public leaderboard

Scores

Date Top Team Top 10 Team median
2019-05-14 74.7 71.3 (69.3 - 74.7)
2019-05-15 77.3 73.1 (70.0 - 77.3)
2019-05-16 83.5 75.4 (72.7 - 83.5)
2019-05-17 83.5 77.0 (75.3 - 83.5)
2019-05-18 83.5 78.2 (75.5 - 83.5)
2019-05-19 83.5 78.2 (75.8 - 83.5)
2019-05-20 83.5 78.2 (75.8 - 83.5)
2019-05-21 83.5 78.7 (76.3 - 83.5)
2019-05-22 83.5 78.7 (77.3 - 83.5)
2019-05-23 83.5 79.7 (77.3 - 83.5)
2019-05-24 83.5 79.8 (77.5 - 83.5)
2019-05-25 83.5 79.8 (78.5 - 83.5)
2019-05-26 83.5 80.1 (78.5 - 83.5)
2019-05-27 83.5 80.1 (78.5 - 83.5)
2019-05-28 83.5 80.1 (78.8 - 83.5)
2019-05-29 84.3 81.2 (79.0 - 84.3)
2019-05-30 84.3 81.2 (79.0 - 84.3)
2019-05-31 84.3 81.2 (79.0 - 84.3)
2019-06-01 84.5 81.3 (79.0 - 84.5)
2019-06-02 84.5 81.3 (79.0 - 84.5)
2019-06-03 84.5 81.5 (79.0 - 84.5)
2019-06-04 84.5 81.5 (79.2 - 84.5)
2019-06-05 86.2 81.5 (79.2 - 86.2)
2019-06-06 86.2 82.4 (79.8 - 86.2)
2019-06-07 86.2 82.7 (79.8 - 86.2)
2019-06-08 86.5 82.7 (80.7 - 86.5)
2019-06-09 86.5 82.8 (80.7 - 86.5)
2019-06-10 86.5 82.8 (81.3 - 86.5)

Entries

Total entries

Date Entries
2019-05-14 6
2019-05-15 22
2019-05-16 37
2019-05-17 59
2019-05-18 73
2019-05-19 82
2019-05-20 92
2019-05-21 113
2019-05-22 132
2019-05-23 149
2019-05-24 164
2019-05-25 180
2019-05-26 189
2019-05-27 200
2019-05-28 223
2019-05-29 242
2019-05-30 262
2019-05-31 288
2019-06-01 307
2019-06-02 320
2019-06-03 334
2019-06-04 353
2019-06-05 374
2019-06-06 398
2019-06-07 424
2019-06-08 453
2019-06-09 480
2019-06-10 503

Entries per day

Date Entries per day
2019-05-14 6
2019-05-15 16
2019-05-16 15
2019-05-17 22
2019-05-18 14
2019-05-19 9
2019-05-20 10
2019-05-21 21
2019-05-22 19
2019-05-23 17
2019-05-24 15
2019-05-25 16
2019-05-26 9
2019-05-27 11
2019-05-28 23
2019-05-29 19
2019-05-30 20
2019-05-31 26
2019-06-01 19
2019-06-02 13
2019-06-03 14
2019-06-04 19
2019-06-05 21
2019-06-06 24
2019-06-07 26
2019-06-08 29
2019-06-09 27
2019-06-10 23

Teams

Date Teams
2019-05-14 4
2019-05-15 16
2019-05-16 21
2019-05-17 29
2019-05-18 31
2019-05-19 32
2019-05-20 35
2019-05-21 38
2019-05-22 42
2019-05-23 43
2019-05-24 44
2019-05-25 45
2019-05-26 46
2019-05-27 48
2019-05-28 50
2019-05-29 53
2019-05-30 55
2019-05-31 61
2019-06-01 62
2019-06-02 63
2019-06-03 66
2019-06-04 67
2019-06-05 70
2019-06-06 72
2019-06-07 75
2019-06-08 77
2019-06-09 80
2019-06-10 80

Technical reports

Urban Acoustic Scene Classification Using Binaural Wavelet Scattering and Random Subspace Discrimination Method

Fateme Arabnezhad and Babak Nasersharif
Computer Engineering Department, Khaje Nasir Toosi, Tehran, Iran

Abstract

This report describe our contribution to Detection and Classification of Urban Acoustic Scenes on DCASE 2019 challenge (Task1 –Subtask A). We propose to use wavelet scatterings spectrum as a good representation and feature where we extracted from both average of 2 audio recorded(mono) and also difference of 2 audio recorded channels (side). The concatenation of these two set of wavelet scattering spectrum are used as a feature vector which is fed into a classifier based on random subspace method. In this work, Regularized Linear Discriminant Analysis (RLDA) is used as a base learner and a classification approach for Random Subspace Method. The experimental results shows that the proposed structure learn acoustic characteristics from audio segments. This structure achieved 87.98% accuracy on whole development set (without cross-validation) and 78.83% on leaderboard dataset.

System characteristics
Input mono
Sampling rate 48kHz
Features wavelet scattering spectra
Classifier random subspace
Decision making highest average score
PDF

Acoustic Scene Classification with Multiple Instance Learning and Fusion

Valentin Bilot and Quang Khanh Ngoc Duong
Audio R&D, InterDigital R&D, Rennes, France

Abstract

Audio classification has been an emerging topic in the last few years, especially with the benchmark dataset and evaluation from DCASE. This paper present our deep learning models to address the acoustic scene classification (ASC) task of the DCASE 2019. The models exploit multiple instance learning (MIL) method as a way of guiding the network attention to different temporal segments of a recording. We then propose a simple late fusion of results obtained by the three investigated MIL-based models. Such fusion system uses multi-layer perceptron (MLP) to predict the final classes from the initial class probability predictions and obtains a better result on the development and the leaderboard dataset.

System characteristics
Input binaural, difference
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
Decision making MLP
PDF

Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling

Hangting Chen, Zuozhen Liu, Zongming Liu, Pengyuan Zhang and Yonghong Yan
Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Beijing, China

Abstract

This technical report describes the IOA team’s submission for TASK1A of DCASE2019 challenge. Our acoustic scene classification (ASC) system adopts a data augmentation scheme employing generative adversary networks. Two major classifiers, 1D deep convolutional neural network integrated with scalogram features and 2D fully convolutional neural network integrated with Mel filter bank features, are deployed in the scheme. Other approaches, such as adversary city adaptation, temporal module based on discrete cosine transform and hybrid architectures, have been developed for further fusion. The results of our experiments indicates that the final fusion systems A-D could achieve the accuracy above 85% on the officially provided fold 1 evaluation dataset.

System characteristics
Input binaural
Sampling rate 48kHz
Data augmentation generative neural network; generative neural network, variational autoencoder
Features log-mel energies, CQT
Classifier CNN
Decision making average vote
PDF

Acoustic Scene Classification Based on Ensemble System

Biyun Ding, Ganjun Liu and Jinhua Liang
School of Electrical and Information Engineering, TianJin University, Tianjin, China

Abstract

This technical report is for the Task 1A Acoustic scene classification of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). In this task, the features of audio will affect the performance. To improve the performance, we implement Acoustic scene classification task using multiple features and applying ensemble system which composed of CNN and GMM. According to the experiments which were performed with the DCASE 2019 challenge development dataset, the class average accuracy of GMM with 103 features is 64.3%, which is an improvement of 4.2% compared to Baseline CNN. Besides, the class average accuracy of the ensemble system is 66.3% , which is an improvement of 7.4% compared to Baseline CNN

System characteristics
Input mono; mono, left, right, mixed
Sampling rate 48kHz
Features log-mel energies; MFCC, log-mel energies, ZRC, RMSE, spectrogram centroid
Classifier GMM; GMM, CNN
Decision making majority vote
PDF

Acoustic Scene Classification and Audio Tagging with Receptive-Field-Regularized CNNs

Hamid Eghbal-zadeh, Khaled Koutini and Gerhard Widmer
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria

Abstract

In this report, we detail the CP-JKU submissions to the DCASE-2019 challenge Task 1 (acoustic scene classification) and Task 2 (audio tagging with noisy labels and minimal supervision). In all of our submissions, we use fully convolutional deep neural networks architectures that are regularized with Receptive Field (RF) adjustments. We adjust the RF of variants of Resnet and Densenet architectures to best fit the various audio processing tasks that use the spectrogram features as input. Additionally, we propose novel CNN layers such as Frequency-Aware CNNs, and new noise compensation techniques such as Adaptive Weighting for Learning from Noisy Labels to cope with the complexities of each task. We prepared all of our submissions without the use of any external data. Our focus in this year’s submissions is to provide the best-performing single-model submission, using our proposed approaches.

System characteristics
Input mono
Sampling rate 22.05kHz
Data augmentation mixup
Features perceptual weighted power spectrogram
Classifier CNN, Receptive Field Regularization
Decision making average
PDF

Acoustic Scene Classification Based on the Dataset with Deep Convolutional Generated Against Network

Ning FangLi and Duan Shuang
Mechanical Engineering, Northwestern Polytechnical University School, 127 West Youyi Road, Xi'an, 710072, China

Abstract

As is known to us all, Convolutional Neural Networks have been the most excellent solution for image classification challenges. From the results of DCASE 2018 [1], the Convolutional Neural Network has also achieved excellent results in the classification of acoustic scenes. Therefore, our team also adopted Convolutional Neural Network for DCASE 2019 Task 1a. In order to make the audio features are exposed more, our team used multiple Mel-spectrograms to characterize the audio, trained multiple classifiers, and finally weighted the prediction results of each classifier to make results ensemble. The performance of classifier is largely limited by the quality and quantity of the data. From the results of the technical report [2], the use of GAN to augment the data set can play a vital role in the final performance, and our team also introduced Deep Convolution GANs (DCGAN) [3] to our solution to Task 1a Challenge, Our model ultimately achieved an accuracy of 0.846 on the development set and an accuracy of 0.671 on the leaderboard dataset.

System characteristics
Input mixed
Sampling rate 48kHz
Data augmentation DCGAN
Features log-mel energies
Classifier CNN
Decision making majority vote
PDF

Classification of Acoustic Scenes Based on Modulation Spectra and Position-Pitch Maps

Ruben Fraile, Juan Carlos Reina, Juana Gutierrez-Arriola and Elena Blanco
CITSEM, Universidad Politecnica de Madrid, Madrid, Spain

Abstract

A system for the automatic classification of acoustic scenes is proposed that uses the stereophonic signal captured by a binaural microphone. This system uses one channel for calculating the spectral distribution of energy across auditory-relevant frequency bands. It further obtains some descriptors of the envelope modulation spectrum (EMS) by applying the discrete cosine transform to the logarithm of the EMS. The availability of the two-channel binaural recordings is used for representing the spatial distribution of acoustic sources by means of position-pitch maps. These maps are further parametrized using the two-dimensional Fourier transform. These three types of features (energy spectrum, EMS and positionpitch maps) are used as inputs for a standard Gaussian Mixture Model with 64 components.

System characteristics
Input binaural
Sampling rate 48kHz
Features spectrogram, modulation spectrum, position-pitch maps
Classifier GMM
Decision making average log-likelihood
PDF

Acoustic Scene Classification Using Deep Residual Networks with Late Fusion of Separated High and Low Frequency Paths

Wei Gao and Mark McDonnell
School of Information Technology and Mathematical Sciences, University of South Australia, Mawson Lakes, Australia

Abstract

This technical report describes our approach to Tasks 1a, 1b and 1c in the 2019 DCASE acoustic scene classification challenge. Our focus was on developing strong single models, without use of any supplementary data. We investigated the use of a deep residual network applied to log-mel spectrograms complemented by log-mel deltas and delta-deltas. We designed the network to take into account that the temporal and frequency axes in spectrograms represent fundamentally different information. In particular, we used two pathways in the residual network: one for high frequencies and one for low frequencies, that were fused just two convolutional layers prior to the network output.

System characteristics
Input left, right; mono
Sampling rate 48kHz; 44.1kHz
Data augmentation mixup, temporal cropping
Features log-mel energies, deltas and delta-deltas; log-mel energies
Classifier CNN
Decision making average
PDF

Acoustic Scene Classification Using CNN Ensembles and Primary Ambient Extraction

Yang Haocong, Shi Chuang and Li Huiyong
Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, China

Abstract

This report describes our submission for Task 1a (acoustic scene classification) of the DCASE 2019 challenge. The results of the DCASE 2018 challenge demonstrate that the convolution neural networks (CNNs) and their ensembles can achieve excellent clas-sification accuracies. Inspired by the previous works, our method continues to work on the ensembles of CNNs, whereas the prima-ry ambient extraction is newly introduced to decompose a binaural audio sample into four channels by using the spatial information. The feature extraction is still carried out with mel spectrograms. 6 CNN models are trained by using the 4-fold cross validation. Ensemble is applied to further improve the performance. Finally, our method has achieved classification accuracies of 0.84 on the public leaderboard.

System characteristics
Input mono, binaural
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
Decision making average; random forest
PDF

Acoustic Scene Classification Using Deep Learning-Based Ensemble Averaging

Jonathan Huang1, Paulo Lopez Meyer2, Hong Lu1, Hector Cordourier Maruri2 and Juan Del Hoyo2
1Intel Labs, Intel Corporation, Santa Clara, CA, USA, 2Intel Labs, Intel Corporation, Zapopan, Jalisco, Mexico

Abstract

In our submission to the DCASE 2019 Task 1a, we have explored the use of four different deep learning based neural networks architectures: Vgg12, ResNet50, AclNet, and AclSincNet. In order to improve performance, these four network architectures were pretrained with Audioset data, and then fine-tuned over the development set for the task. The outputs produced by these networks, due to the diversity of feature front-end and of architecture differences, proved to be complementary when fused together. The ensemble of these models’ outputs improved from best single model accuracy of 77.9% to 83.0% on the validation set, trained with the challenge default’s development split.

System characteristics
Input mono; mono , binaural; binaural
Sampling rate 16kHz; 48kHz, 16kHz; 48kHz
Data augmentation mixup
Features raw waveform, log-mel energies; log-mel energies
Classifier CNN
Decision making Max value of soft ensemble
PDF

Acoustic Scene Classification Based on Deep Convolutional Neuralnetwork with Spatial-Temporal Attention Pooling

Zhenyi Huang and Dacan Jiang
School of Computer, South China Normal University, Guangzhou, China

Abstract

Acoustic scene classification is a challenging task in machine learn-ing with limited data sets. In this report, several different spectro-grams are applied to classify the acoustic scenes using deep convo-lutional neural network with spatial-temporal attention pooling. Inaddition, mixup augmentation is performed to further improve theclassification performance. Finally, majority voting is performed onsix different models and an accuracy of 73.86% is achieved which is11.36 percentage points higher than the one of the baseline system.

System characteristics
Input left, right, mixed
Sampling rate 44.1kHz
Data augmentation mixup
Features MFCC, CQT
Classifier CNN
Decision making majority vote
PDF

Acoustic Scene Classification Using Various Pre-Processed Features and Convolutional Neural Networks

Seo Hyeji and Park Jihwan
Advanced Robotics Lab, LG Electronics, Seoul, Korea

Abstract

In this technical report, we describe our acoustic scene classification algorithm submitted in DCASE 2019 Task 1a. We focus on various pre-processed features to categorize the class of acoustic scenes using only stereo microphone input signal. In the frontend system, the pre-processed and spatial information are extracted from the stereo microphone input. Residual network, subspectral network, and conventional convolutional neural network (CNN) are used for back-end systems. Finally, we ensemble all of the models to take advantage of each algorithm. By using proposed systems, we achieved a classification accuracy of 80.4%, which is 17.9% over than the baseline system.

System characteristics
Input mono, binaural
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies, spectrogram, chromagram
Classifier CNN
Decision making average
PDF

Acoustic Scene Classification Using Ensembles of Convolutional Neural Networks and Spectrogram Decompositions

Shengwang Jiang and Chuang Shi
School of Communication and Information Engineering, University of Electronic Science and Technology of China, Chengdu, China

Abstract

This technical report proposes ensembles of convolutional neural networks (CNNs) for the task 1 / subtask B of the DCASE 2019 challenge, with emphasis on using different spectrogram decompositions. The harmonic percussive source separation (HPSS), nearest neighbor filter (NNF), and vocal separation are applied to the monaural samples. Head-related transfer function (HRTF) is also proposed to transform monaural samples to binaural ones with augmented spatial information. Finally, 16 neural networks are trained and put together. The classification accuracy of the proposed system achieves 0.70166 on the public leaderboard.

System characteristics
Sampling rate 44.1kHz
Data augmentation HPSS, NNF, vocal separation, HRTF
Features log-mel energies
Classifier CNN
Decision making stacking; averaging
PDF

Knowledge Distillation with Specialist Models in Acoustic Scene Classification

Jee-weon Jung, Hee-Soo Heo, Hye-jin Shim and Ha-Jin Yu
Computing Sciences, Univerisity of Seoul, Seoul, Republic of Korea

Abstract

In this technical report, we describe our submission for the Detection and Classification of Acoustic Scenes and Events 2019 task1-a competition which exploits knowledge distillation with specialist models. Different acoustic scenes that share common properties are one of the main obstacles that hinder successful acoustic scene classification. We found that confusion between scenes, sharing the common properties, causes most of the errors in the acoustic scene classification. For example, the confusing scene pairs such as airport-shopping mall and metro-tram have caused the most errors in various systems. We applied knowledge distillation based on the specialist models to address the errors from the most confusing scene pairs. Specialist models where each model concentrates on discriminating a pair two similar scenes are exploited to provide soft-labels. We expected that knowledge distillation from multiple specialist models and a pre-trained generalist model to a single model could train an ensemble of models that gives more emphasis on discriminating specific acoustic scene pairs. Through knowledge distillation from well trained model and specialist models to single model, we report improved accuracy on the validation set.

System characteristics
Input binaural
Sampling rate 48kHz
Data augmentation mixup
Features raw waveform, log-mel energies
Classifier CNN
Decision making majority vote; score-sum
PDF

The I2r Submission to DCASE 2019 Challenge

Teh KK1, Sun HW2 and Tran Huy Dat2
1I2R, A-star, Singapore, 2I2R, A-Star, Singapore

Abstract

This paper proposes Convolutional Neural Network (CNN) ensembles for acoustic scene classification of tasks1A of the DCASE 2019 challenge. In this approach various preprocessing features method: mel-filterbank and delta feature vectors, harmonic-percussive and subband power distribution are used to train CNN model. We also used score-fusion of the features to find an optimum feature configuration. On the official leaderboard data set of the task1A challenge, an accuracy of 79.67% is achieved.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies, HPSS; log-mel energies, HPSS, subband power distribution
Classifier CNN
Decision making weighted averaging vote; multi-class linear logistic regression
PDF

Calibrating Neural Networks for Secondary Recording Devices

Michał Kośmider
Artificial Intelligence, Samsung R&D Institute Poland, Warsaw, Poland

Abstract

This report describes the solution to Task 1B of the DCASE 2019 challenge proposed by Samsung R&D Institute Poland. Primary focus of the system for task 1B was a novel technique designed to address issues with learning from microphones with different frequency responses in settings with limited examples for the targeted secondary devices. This technique is independent from the architecture of the predictive model and requires just a few examples to become effective.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation Spectrum Correction, SpecAugment, mixup
Features log-mel energies
Classifier CNN
Decision making isotonic-calibrated soft-voting; soft-voting
PDF

Cross-Task Learning for Audio Tagging, Sound Event Detection and Spatial Localization: DCASE 2019 Baseline Systems

Qiuqiang Kong, Yin Cao, Turab Iqbal, Wenwu Wang and Mark D. Plumbley
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, England

Abstract

The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge focuses on audio tagging, sound event detection and spatial localisation. DCASE 2019 consists of five tasks: 1) acoustic scene classification, 2) audio tagging with noisy labels and minimal supervision, 3) sound event localisation and detection, 4) sound event detection in domestic environments, and 5) urban sound tagging. In this paper, we propose generic cross-task baseline systems based on convolutional neural networks (CNNs). The motivation is to investigate the performance of a variety of models across several audio recognition tasks without exploiting the specific characteristics of the tasks. We looked at CNNs with 5, 9, and 13 layers, and found that the optimal architecture is task-dependent. For the systems we considered, we found that the 9-layer CNN with average pooling after convolutional layers is a good model for a majority of the DCASE 2019 tasks.

System characteristics
Input mono
Sampling rate 32kHz
Features log-mel energies
Classifier CNN
PDF

Acoustic Scene Classification and Audio Tagging with Receptive-Field-Regularized CNNs

Khaled Koutini, Hamid Eghbal-zadeh and Gerhard Widmer
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria

Abstract

In this report, we detail the CP-JKU submissions to the DCASE-2019 challenge Task 1 (acoustic scene classification) and Task 2 (audio tagging with noisy labels and minimal supervision). In all of our submissions, we use fully convolutional deep neural networks architectures that are regularized with Receptive Field (RF) adjustments. We adjust the RF of variants of Resnet and Densenet architectures to best fit the various audio processing tasks that use the spectrogram features as input. Additionally, we propose novel CNN layers such as Frequency-Aware CNNs, and new noise compensation techniques such as Adaptive Weighting for Learning from Noisy Labels to cope with the complexities of each task. We prepared all of our submissions without the use of any external data. Our focus in this year’s submissions is to provide the best-performing single-model submission, using our proposed approaches.

System characteristics
Input binaural
Sampling rate 22.05kHz
Data augmentation mixup
Features perceptual weighted power spectrogram
Classifier CNN, Receptive Field Regularization
Decision making average
PDF

Acoustic Scene Classification with Reject Option Based on Resnets

Bernhard Lehner1 and Khaled Koutini2
1Silicon Austria Labs, JKU, Linz, Austria, 2Institute of Computational Perception, JKU, Linz, Austria

Abstract

This technical report describes the submissions from the SAL/CP JKU team for Task 1 - Subtask C (classification on data that includes classes not encountered in the training data) of the DCASE-2019 challenge. Our method uses a ResNet variant specifically adapted to be used along with spectrograms in the context of Acoustic Scene Classification (ASC). The reject option is based on the logit values of the same networks. We do not use any of the provided external data sets, and perform data augmentation only with the mixup technique [1]. The result of our experiments is a system that achieves classification accuracies of up to around 60% on the public Kaggle-Leaderboard. This is an improvement of around 14 percentage points compared to the official DCASE 2019 baseline.

System characteristics
Input mono
Sampling rate 22.05kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
Decision making logit averaging
PDF

Multi-Scale Recalibrated Features Fusion for Acoustic Scene Classification

Chongqin Lei and Zixu Wang
Intelligent Information Technology and System Lab, CHONGQING UNIVERSITY, Chongqing, China

Abstract

We investigate the effectiveness of multi-scale recalibrated features fusion for acoustic scene classification as contribution to the subtask of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2019). A general problem in acoustic scene classification task is audio signal segment contains less effective information. In order to further utilize features with less effective information to improve classification accuracy, we introduce the Squeeze-and-Excitation unit to embed the backbone structure of Xception to recalibrate the channel weights of feature maps in each block. In addition, the recalibrated features of multiscale are fused and finally fed into the full connection layer to get more useful information. Furthermore, we introduce Mixup method to augment the data in training stage to reduce the degree of over-fitting of network. The proposed method attains a recognition accuracy of 77.5%, which is 13% higher compared to the baseline system of the DCASE 2019 Acoustic Scenes Classification task.

System characteristics
Input binaural
Sampling rate 44.1kHz
Features log-mel energies
Classifier CNN
PDF

Acoustic Scene Classification Using Attention-Based Convolutional Neural Network

Han Liang and Yaxiong Ma
Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, China

Abstract

This technical report describes the Task 1 - Subtask A (Acoustic Scene Classification, ASC) of the DCASE 2019 challenge whose goal is to classify a test audio recording into one of the predefined classes that characterizes the environment. We detemine to use mel-spectrogram as audio feature and deep convolutional neural networks (CNNs) as classifier to classify acoustic scenes. In our method, spectrogram of every audio clip is divided in two ways. In addition, we introduce attention mechanism to further improve the performance. Experimental results illustrate that our best model can achieve classification accuracy of around 70.7% for Development dataset, which is superior to the baseline system with the accuracy of 62.5%.

System characteristics
Input mono
Sampling rate 48kHz
Features log-mel energies
Classifier CNN
PDF

Jsnu_wdxy Submission for DCASE-2019: Acoustic Scene Classification with Convolution Neural Networks

Xinixn Ma and Mingliang Gu
School of Physics and Electronic, Jiangsu Normal University, Xuzhou, China

Abstract

Acoustic Scene Classification (ASC) is the task of identifying the scene from which the audio signal is recorded. It is one of the core research problems in the field of Computational Sound Scene Analysis. Most of current best performing Acoustic Scene Classification systems utilize Mel scale spectrograms with Convolutional Neural Networks (CNNs). In this paper, we demonstrate how we applied convolutional neural network for DCASE 2019 task1, acoustic scene classification. First, we applied Mel scale spectrogram to extract acoustic features. Mel scale is a common way to suit frequency warping of human ears, with strict decreasing frequency resolution on low to high frequency range. Second, we generate Mel spectrogram from binaural audio, adaptively learn 5 Convolutional Neural Networks. The best classification result of the proposed system was 71.1% for Development dataset and 73.16% for Leaderboard dataset.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CNN
PDF

Acoustic Scene Classification Based on Binaural Deep Scattering Spectra with Neural Network

Sifan Ma and Wei Liu
Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China

Abstract

This technical report presents our approach for the acoustic scene classification of DCASE2019 task1a. Compared to traditional audio features such as Mel-frequency Cepstral Coefficients (MFCC) and Constant-Q Transform (CQT), we choose Deep Scattering Spectra (DSS) features which are more suitable for characterizing acoustic scenes. DSS is a good way to preserve high frequency information. Based on DSS features, we choose a network model of Convolutional Neural Network (CNN) and Gated Recurrent Unit (GRU) to classify acoustic scenes. The experimental results show that our approach increase the classification accuracy from 62.5% (DCASE2019 baseline) to 85%.

System characteristics
Input left,right
Sampling rate 48kHz
Features DSS
Classifier CNN,DNN
PDF

Acoustic Scene Classification From Binaural Signals Using Convolutional Neural Networks

Rohith Mars, Pranay Pratik, Srikanth Nagisetty and Chong Soon Lim
Core Technology Group, Panasonic R&D Center, Singapore, Singapore

Abstract

In this report, we present the technical details of our proposed framework and solution for the DCASE 2019 Task 1A - Acoustic Scene Classification challenge. We describe the audio pre-processing, feature extraction steps and the time-frequency (TF) representations used for acoustic scene classification using binaural audio recordings. We employ two distinct architectures of convolutional neural networks (CNNs) for processing the extracted audio features for classification and compare their relative performance in terms of both accuracy and model complexity. Using an ensemble of the predictions from multiple models based on the above CNNs, we achieved an average classification accuracy of 79.35% on the test split of the development dataset for this task and a system score of 82.33% in the Kaggle public leaderboard, which is an improvement of ≈ 18% over the baseline system.

System characteristics
Input mono, left, right, mid, side
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
Decision making max probability
PDF

The System for Acoustic Scene Classification Using Resnet

Liu Mingle and Li Yanxiong
School of Electronic and Information Enginnering, South China University of Technology, GuangZhou, GuangDong Province

Abstract

In this report, we present our works concerning task 1a of DCASE 2019, i.e. acoustic scene classification (ASC) with mismatched recording devices. We propose a strategy of classifiers voting for ASC. Specifically, an audio feature, such as logarithmic filter-bank (LFB), is first extracted from audio recordings. Then a series of convolutional neural network (CNN) is built for obtaining classifiers ensemble. Finally, classification result for each test sample is based on the voting of all classifiers.

System characteristics
Input mono,binaural
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies
Classifier ResNet
Decision making vote
PDF

DCASE 2019: CNN Depth Analysis with Different Channel Inputs for Acoustic Scene Classification

Javier Naranjo-Alcazar1, Sergi Perez-Castanos1, Pedro Zuccarello1 and Maximo Cobos2
1Visualfy AI, Visualfy, Benisano, Spain, 2Computer Science, Universitat de Valencia, Burjassot, Spain

Abstract

The objective of this technical report is to describe the framework used in Task 1, Acoustic scene classification (ASC), of the DCASE 2019 challenge. The presented approach is based on Log-Mel spectrogram representations and VGG-based Convolutional Neural Networks (CNNs). Three different CNNs, with very similar architectures, have been implemented. The main difference is the number of filters in their convolutional blocks. Experiments show that the depth of the network is not the most relevant factor for improving the accuracy of the results. The performance seems to be more sensitive to the input audio representation. This conclusion is important for the implementation of real-time audio recognition and classification system on edge devices. In the presented experiments the best audio representation is the Log-Mel spectrogram of the harmonic and percussive sources plus the Log-Mel spectrogram of the difference between left and right stereo-channels (L − R). Also, in order to improve accuracy, ensemble methods combining different model predictions with different inputs are explored. Besides geometric and arithmetic means, ensembles aggregated with the Orness Weighted Averaged (OWA) operator have shown interesting and novel results. The proposed framework outperforms the baseline system by 14.34 percentage points. For Task 1a, the obtained development accuracy is 76.84%, being 62.5% the baseline, whereas the accuracy obtained in public leaderboard is 77.33%, being 64.33% the baseline.

System characteristics
Input mono, left, right, difference, harmonic, percussive
Sampling rate 48kHz
Features log-mel energies
Classifier ensemble, CNN
Decision making arithmetic mean; geometric mean; orness weighted average
PDF

DCASE 2019 Task 1a: Acoustic Scene Classification by Sffcc and DNN

Chandrasekhar Paseddula1 and Suryakanth V.Gangashetty2
1International Institute of Information Technology, Hyderabad department:Electronics and Communication Engineering, Hyderabad, India, 2Computor Science Engineering, International Institute of Information Technology, Hyderabad, Hyderabad, India

Abstract

In this study, we dealt with the acoustic scene classification (ASC) task in the Detection and Classification of Acoustic Scenes and Events (DCASE)-2019 challenge Task 1A. Single frequency filtering cepstral coefficients (SFFCC) features and Deep Neural networks (DNN) model is proposed for ASC. We have adopted a late fusion mechanism to further improve the performance and finally, to validate the performance of the model and compare it to the baseline system. We used the TAU Urban Acoustic Scenes 2019 development dataset for training and cross-validation, resulting in a 7.9% improvement when compared to the baseline system.

System characteristics
Input mono
Sampling rate 48kHz
Features single frequency cepstral coefficients (SFCC), log-mel energies
Classifier DNN
Decision making maxrule
PDF

Cdnn-CRNN Joined Model for Acoustic Scene Classification

Lam Pham1, Tan Doan2, Dat Thanh Ngo2, Hung Nguyen2 and Ha Hoang Kha2
1School of Computing, University of Kent, Chatham, United Kingdom, 2Electrical and Electronics Engineering, HoChiMinh City University of Technology, HoChiMinh City, Vietnam

Abstract

This work proposes a deep learning framework applied for Acoustic Scene Classification (ASC), targeting DCASE2019 task 1A. In general, the front-end process shows a combination of three types of spectrograms: Gammatone (GAM), log-Mel and Constant Q Transform (CQT). The back-end classification presents a joined learning model between CDNN and CRNN. Our experiments over the development dataset of DCASE2019 challenge task 1A show a significant improvement, increasing 11.2% compared to DCASE2019 baseline of 62.5%. The Kaggle reports the classification accuracy of 74.6% when we train all development dataset.

System characteristics
Input mono
Sampling rate 48kHz
Data augmentation mixup
Features Gammatone, log-mel energies, CQT
Classifier CNN, RNN
PDF

A Multi-Spectrogram Deep Neural Network for Acoustic Scene Classification

Lam Pham, Ian McLoughlin, Huy Phan and Ramaswamy Palaniappan
School of Computing, University of Kent, Chatham, United Kingdom

Abstract

This work targets the task 1A and 1B of DCASE2019 challenge that are Acoustic Scene Classification (ASC) over ten different classes recorded by a same device (task 1A) and mismatched devices (task 1B). For the front-end feature extraction, this work proposes a combination of three types of spectrograms: Gammatone (GAM), log- Mel and Constant Q Transform (CQT). The back-end classification shows two training processes, namely pre-trained CNN and post- trained DNN, and the result of post-trained DNN is reported. Our experiments over the development dataset of DCASE2019 1A and 1B show significant improvement, increasing 14% and 17.4 % compared to DCASE2019 baseline of 62.5% and 41.4%, respectively. The Kaggle report also confirms the classification accuracy of 79% and 69.2% for task 1A and 1B.

System characteristics
Input mono
Sampling rate 48kHz; 44.1kHz
Data augmentation mixup
Features Gammatone, log-mel energies, CQT
Classifier CNN, DNN
PDF

Deep Neural Networks with Supported Clusters Preclassification Procedure for Acoustic Scene Recognition

Marcin Plata
Data Intelligence Group, Samsung R&D Institute Poland, Warsaw, Poland

Abstract

In this technical report, we presented a system for acoustic scene classification focuses on deeper analysis of data. We made an impact analysis of various combinations of arguments for short time Fourier transform (STFT) and Mel filter bank. We also used the harmonic and percussive source separation (HPSS) algorithm as an additional features extractor. Finally, next to common spectrograms divided and non-overlap classification neural networks, we decided to present an out-of-the-box solution with one main neural network trained on clustered labels and a few supporting neural network to distinguish between most difficult scenes, e.g. street pedestrian and public square.

System characteristics
Input mono, left, right
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies, harmonic, percussive
Classifier CNN, random forest; CNN
Decision making random forest; majority vote
PDF

Acoustic Scene Classification with Mismatched Recording Devices

Paul Primus and David Eitelsebner
Computational Perception, Johannes Kepler University Linz, Linz, Austria

Abstract

This technical report describes CP-JKU Student team’s approach for Task 1 - Subtask B of the DCASE 2019 challenge. In this context, we propose two loss functions for domain adaptation to learn invariant representations given time-aligned recordings. We show that these methods improve the classification performance on our cross-validation, as well as performance on the Kaggle leader board, up to three percentage points compared to our baseline model. Our best scoring submission is an ensemble of eight classifiers.

System characteristics
Input mono
Sampling rate 22.05kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN, ensemble
Decision making average
PDF

Frequency-Aware CNN for Open Set Acoustic Scene Classification

Alexander Rakowski1 and Michał Kośmider2
1Audio Intelligence, Samsung R&D Institute Poland, Warsaw, Poland, 2Artificial Intelligence, Samsung R&D Institute Poland, Warsaw, Poland

Abstract

This report describes systems used for Task 1c of the DCASE 2019 Challenge - Open Set Acoustic Scene Classification. The main system consists of a 5-layer convolutional neural network which preserves the location of features on the frequency axis. This is in contrast to the standard approach where global pooling is applied along the frequency-related dimension. Additionally the main system is combined with an ensemble of calibrated neural networks in order to improve generalization.

System characteristics
Input mono
Sampling rate 32kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
Decision making soft-voting
PDF

Urban Acoustic Scene Classification Using Raw Waveform Convolutional Neural Networks

Daniele Salvati, Carlo Drioli and Gian Luca Foresti
Mathematics, Computer Science and Physics, University of Udine, Udine, Italy

Abstract

We present the signal processing framework and the results obtained with the development dataset (task 1, subtask A) for the detection and classification of acoustic scenes and events (DCASE 2019) challenge. The framework for the classification of urban acoustic scenes consists of a raw waveform (RW) end-to-end computational scheme based on convolutional neural networks (CNNs). The RW-CNN operates on a time-domain signal segment of 0.5 s and consists of 5 one-dimensional convolutional layers and 3 fully connected layers. The overall classification accuracy with the development dataset of the proposed RW-CNN is 69.7 %.

System characteristics
Input mono
Sampling rate 48kHz
Features raw waveform
Classifier CNN
PDF

Acoustic Scene Classification Using Specaugment and Convolutional Neural Network with Inception Modules

Suh Sangwon, Jeong Youngho, Lim Wootaek and Park Sooyoung
Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea

Abstract

This paper describes the system submitted to the Task 1a (Acoustic Scene Classification, ASC). By analyzing the major systems submitted in 2017 and 2018, we have selected a two-dimensional convolutional neural network (CNN) as the most suitable model for this task. The proposed model is composed of four convolution blocks; two of them are conventional CNN structures but the following two blocks consist of Inception modules. We have constructed a meta-learning problem with this model in order to train the super learner. For each base model training, we have applied different validation split methods to take advantage in generalized result with the ensemble method. In addition, we have applied data augmentation in real time with SpecAugment, which was performed for each base model. With our final system with all of the above techniques have applied, we have achieved an accuracy of 76.1% with the development dataset and 81.3% with the leader board set.

System characteristics
Input mono
Sampling rate 48kHz
Data augmentation SpecAugment
Features log-mel energies
Classifier CNN; ensemble
PDF

Feature Enhancement for Robust Acoustic Scene Classification with Device Mismatch

Hongwei Song and Hao Yang
Computer Sciences and Technology, Harbin Institute of Technology, Harbin, China

Abstract

This technical report describes our system for DCASE2019 Task1 SubtaskB. We focus on analyzing how device distortions affect the classic log Mel feature, which is the most adopted feature for convolutional neural networks (CNN) based models. We demonstrate mathematically that for log Mel feature, the influence of device distortion shows as an additive constant vector over the log Mel spectrogram. Based on this analysis, we propose to use feature enhancement methods such as spectrogram-wise mean subtraction and median filtering, to remove the additive term of channel distortions. Information loss introduced by the enhancement methods is discussed. We also motivate to use mixup technique to generate virtual samples with various device distortions. Combining the proposed techniques, we rank the second on the public kaggle leaderboard.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
Decision making probability aggregation; majority vote
PDF

Wavelet Based Mel-Scaled Features for DCASE 2019 Task 1a and Task 1b

Shefali Waldekar and Goutam Saha
Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India

Abstract

This report describes a submission for IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 for Task 1 (acoustic scene classification (ASC)), sub-task A (basic ASC) and sub-task B (ASC with mismatched recording devices). The system exploits time-frequency representation of audio to obtain the scene labels. It follows a simple pattern classification framework employing wavelet transform based mel-scaled features along with support vector machine as classifier. The proposed system relatively outperforms the deep-learning based baseline system by almost 8% for sub-task A and 26% for sub-task B on the development dataset provided for the respective sub-tasks.

System characteristics
Input mono
Sampling rate 48kHz; 44.1kHz
Features MFDWC
Classifier SVM
PDF

Acoustic Scene Classification Based on CNN System

Zhuhe Wang, Jingkai Ma and Chunyang Li
Noise and Vibration Laboratory, Beijing Technology and Business University, Beijing, China

Abstract

In this study, we present a solution for the acoustic scene classification task1A in the DCASE 2019 Challenge. Our model uses a convolutional neural network and makes some improvements on the basis of CNN. Then we extract the MFCC (Mel frequency cepstral coefficient) feature from the official audio file and recreate the data set. Use this as an input to the neural network. Finally, comparing our model to the performance of the baseline system, the results were 12% more accurate than the baseline system.

System characteristics
Input one
Sampling rate 22.05kHz
Features MFCC
Classifier CNN
PDF

Ciaic-ASC System for DCASE 2019 Challenge Task1

Mou Wang and Rui Wang
School of Marine Sciences and Technology, Northwestern Polytechnical University, Xi'an, China

Abstract

In this report, we present our systems for subtask A and subtask B of the DCASE 2019 Task1, i.e. acoustic scene classification. The subtask A is a problem of basic closed set classification with data from a single device. In our system, we firstly extracted several acoustic features such as mel-spectrogram, hybrid constant-Q transform, harmonic-percussive source separation and etc.. Convolution neural networks (CNN) with average pooling are used to classify acoustic scenes. We averaged the outputs of CNN fed by different features to ensemble those methods. The subtask B is a classification problem with mismatched devices. So, we introduce a Domain Adaptation Neural Network (DANN) to extract the feature, which is uncorrelated with domain. We further ensemble DANN with CNN methods to obtain a better performance. The accuracy of the our system for subtask A is 0.783 on validation dataset and 0.816 on leaderboard dataset. The accuracy of subtask B achieves 0:717 on leaderborad dataset, which shows that our method can solve such a cross-domain problem and outperforms baseline system.

System characteristics
Input mono
Sampling rate 32kHz; 44.1kHz
Features log-mel energies
Classifier CNN; CNN, DNN
Decision making average
PDF

The SEIE-SCUT Systems for Acoustic Scene Classification Using CNN Ensemble

Wucheng Wang and Mingle Liu
School of Electronic and Information Enginnering, South China University of Technology, GuangZhou, GuangDong Province

Abstract

In this report, we present our works concerning task 1b of DCASE 2019, i.e. acoustic scene classification (ASC) with mismatched recording devices. We propose a strategy of CNN ensemble for ASC. Specifically, an audio feature, such as Mel-frequency cepstral coefficients (MFCCs) and logarithmic filter-bank (LFB), is first extracted from audio recordings. Then a series of convolutional neural network (CNN) is built for obtaining CNN ensemble. Finally, classification result for each test sample is based on the voting of all CNNS contained in the CNN ensemble.

System characteristics
Input mono,binaural
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies,MFCC
Classifier VGG,Inception,ResNet
Decision making vote
PDF

Open-Set Acoustic Scene Classification with Deep Convolutional Autoencoders

Kevin Wilkinghoff and Frank Kurth
Communication Systems, Fraunhofer Institute for Communication, Information Processing and Ergonomics, Wachtberg, Germany

Abstract

Acoustic scene classification is the task of determining the environment in which a given audio file has been recorded. If it is a priori not known whether all possible environments that may be encountered during test time are also known when training the system, the task is referred to as open-set classification. This paper contains a description of an open-set acoustic scene classification system submitted to Task 1C of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Our system consists of a combination of convolutional neural networks for closed-set identification and deep convolutional autoencoders for outlier detection. In evaluations conducted on the leaderboard dataset of the challenge, the proposed system significantly outperforms the baseline systems and improves the score by 35.4% from 0.46666 to 0.63166.

System characteristics
Input mono
Sampling rate 48kHz
Data augmentation mixup, cutout, width shift, height shift
Features log-mel energies; log-mel energies, harmonic part, percussive part
Classifier CNN; CNN, ensemble; CNN, DCAE, logistic regression; CNN, DCAE, logistic regression, ensemble
Decision making maximum likelihood; geometric mean, maximum likelihood; threshold, maximum likelihood; geometric mean, threshold, maximum likelihood
PDF

Stratified Time-Frequency Features for CNN-Based Acoustic Scene Classification

Yuzhong Wu and Tan Lee
Electronic Engineering, The Chinese University of Hong Kong, Hong Kong, China

Abstract

Acoustic scene signal is a mixture of diverse sound events, which are frequently overlapped with each other. The CNN models for acoustic scene classification usually suffer from model overfitting because they might memorize the overlapped sounds as the representative patterns for acoustic scenes, and might fail to recognize the scene when only one of the sound is present. Based on a standard CNN setup with log-Mel feature as input, we propose to stratify the log-Mel image to several component images based on sound duration, and each component image should contain a specific type of time-frequency patterns. Then we emphasize the independent modeling of time-frequency patterns to better utilize the stratified features. The experiment results on TAU Urban Acoustic Scenes 2019 development dataset [1] show that the use of stratified feature can significantly improve the classification performance.

System characteristics
Input mono
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
Decision making majority vote
PDF

Acoustic Scene Classification Using Fusion of Attentive Convolutional Neural Networks for Dcase2019 Challenge

Hossein Zeinali, Lukas Burget and Honza Cernocky
Information Technology, Brno University of Technology, Brno, Czech Republic

Abstract

In this report, the Brno University of Technology (BUT) team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2019 challenge are described. Also, the analysis of different methods is provided. The proposed approach is a fusion of three different Convolutional Neural Network (CNN) topologies. The first one is a VGG like two-dimensional CNNs. The second one is again a two-dimensional CNN network which uses MaxFeature-Map activation and called Light-CNN (LCNN). The third network is a one-dimensional CNN which mainly used for speaker verification and called x-vector topology. All proposed networks use self-attention mechanism for statistic pooling. As a feature, we use a 256-dimensional log Mel-spectrogram. Our submissions are a fusion of several networks trained on 4-folds generated evaluation setup using different fusion strategies.

System characteristics
Input mono
Sampling rate 22.05kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN, ensemble
Decision making score fusion; majority vote; majority vote, score fusion
PDF

Acoustic Scene Classification Combining Log-Mel CNN Model and End-To-End Model

Xu Zheng and Jie Yan
Computing Sciences, University of Science of Techonology of China, Hefei,Anhui,China

Abstract

This technical report describes the Zheng-USTC team’s submissions for Task 1 - Subtask A (Acoustic Scene Classification, ASC) of the DCASE-2019 challenge. In this paper, two different models for Acoustic Scene Classification are provided.The first one is a common two-dimensional CNN model in which the log-mel energies spectrogram is treated as an image. The second one is an end-to-end model, in which the features of a speech are extracted by a 3-layer CNN model with 64 filters. The experimental results on the fold1 validation set of 4185 samples and the leaderboard showed that the class-wise accuracy of the two models are complementary in some way. Finally we fused the softmax ouput scores of the two different systems by using a simple non-weighted average.

System characteristics
Input binaural; mono
Sampling rate 22.05kHz; 16kHz
Data augmentation SpecAugment, RandomCrop; Between-Class learning; SpecAugment, RandomCrop, Between-Class learning
Features log-mel energies; raw waveform
Classifier CNN
PDF

Audio Scene Calssification Based on Deeper CNN and Mixed Mono Channel Feature

Nai Zhou, Yanfang Liu and Qingkai Wei
Beijing Kuaiyu Electronics Co., Ltd., Beijing, China

Abstract

This technical report describes Kuaiyu team’s submissions for Task 1 - Subtask A (Acoustic Scene Classification, ASC) of the DCASE-2019 challenge. Refering the results of DCASE 2018, a convolution neural network and log-mel spectrogram generated from mono audio are used, log-mel specture is converted into multiple channels spectrogram and as a input to 8 convolutional layer neural networks. The result of our experiments is a classification system that achieves classification accuracies of around 75.5% on the public Kaggle-Leaderboard.

System characteristics
Input mono
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
PDF

DCASE 2019 Challenge Task1 Technical Report

Houwei Zhu1, Chunxia Ren2, Jun Wang2, Shengchen Li2, Lizhong Wang1 and Lei Yang1
1Speech Lab, Samsung Research China-Beijing, Beijing, China, 2Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China

Abstract

This report describe our methods for the DCASE 2019 task1a and task1c of Acoustic Scene Classification(ASC).Especially task1c for unknown scene, which not included in training data set, We use less training data and a threshold to classify the known and unknown scenes. In our method, we use Log MelSpectrogram with different divisions, as the input of multiple neural network , the ensemble learning output shows good accuracy. For task 1a we use VGG and xception as network and 3 different divisions ensemble, the accuracy is 0.807 for Leadboard dataset. For task 1c we use Convolutional Recurrent Neural Network (CRNN) and self-attention mechanism with 2 different features division ensemble, and 0.4 as threshold for unknown judgment, the Leadboard accuracy is 0.648.

System characteristics
Input mono
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN, BGRU, self-attention
PDF

DCASE 2019 Challenge Task1 Technical Report

Houwei Zhu1, Chunxia Ren2, Jun Wang2, Shengchen Li2, Lizhong Wang1 and Lei Yang1
1Speech Lab, Samsung Research China-Beijing, Beijing, China, 2Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China

Abstract

This report describe our methods for the DCASE 2019 task1a and task1c of Acoustic Scene Classification(ASC).Especially task1c for unknown scene, which not included in training data set, We use less training data and a threshold to classify the known and unknown scenes. In our method, we use Log MelSpectrogram with different divisions, as the input of multiple neural network , the ensemble learning output shows good accuracy. For task 1a we use VGG and xception as network and 3 different divisions ensemble, the accuracy is 0.807 for Leadboard dataset. For task 1c we use Convolutional Recurrent Neural Network (CRNN) and self-attention mechanism with 2 different features division ensemble, and 0.4 as threshold for unknown judgment, the Leadboard accuracy is 0.648.

System characteristics
Input multiple
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN, BGRU, self-attention, ensemble
PDF