Acoustic Scene Classification with use of external data


Challenge results

Task description

More detailed task description can be found in the task description page

Systems ranking

Systems using external data

Please, note that baseline is not using external data.

Rank Submission
code
Submission
name
Technical
Report
Accuracy
with 95% confidence interval
(Evaluation dataset)
Accuracy
(Development dataset)
Accuracy
(Leaderboard dataset)
DCASE2018 baseline Baseline Heittola2018 61.0 (59.4 - 62.6) 59.7 62.5
Khadkevich_FB_task1c_1 1cavpool Khadkevich2018 71.7 (70.2 - 73.2) 71.8
Khadkevich_FB_task1c_2 1cmaxpool Khadkevich2018 69.0 (67.5 - 70.5) 70.2

All submitted systems

This table includes all submissions from subtask A and subtask C. External data usage is highlighted with green color. Please, note that baseline is not using external data.

Rank Submission
code
Submission
name
Technical
Report
Accuracy
with 95% confidence interval
(Evaluation dataset)
Accuracy
(Development dataset)
Accuracy
(Leaderboard dataset)
Baseline_Surrey_task1a_1 SurreyCNN8 Kong2018 70.4 (68.9 - 71.9) 68.0 70.7
Baseline_Surrey_task1a_2 SurreyCNN4 Kong2018 69.7 (68.2 - 71.2) 68.0 70.7
Dang_NCU_task1a_1 AnD_NCU Dang2018 73.3 (71.9 - 74.8) 76.7 72.5
Dang_NCU_task1a_2 AnD_NCU Dang2018 74.5 (73.1 - 76.0) 76.7 72.5
Dang_NCU_task1a_3 AnD_NCU Dang2018 74.1 (72.7 - 75.5) 76.7 72.5
Dorfer_CPJKU_task1a_1 DNN Dorfer2018 79.7 (78.4 - 81.0) 77.1 80.0
Dorfer_CPJKU_task1a_2 i-vectors Dorfer2018 67.8 (66.3 - 69.3) 65.8
Dorfer_CPJKU_task1a_3 calib-avg Dorfer2018 80.5 (79.2 - 81.8) 80.5
Dorfer_CPJKU_task1a_4 calib-sep Dorfer2018 77.2 (75.8 - 78.5)
Fraile_UPM_task1a_1 UPMg Fraile2018 62.7 (61.1 - 64.3) 62.3 57.7
Gil-jin_KNU_task1a_1 ECDCNN Sangwon2018 74.4 (73.0 - 75.8) 72.4 75.5
Golubkov_SPCH_task1a_1 spch_fusion Golubkov2018 60.2 (58.7 - 61.8) 80.1 69.3
DCASE2018 baseline Baseline Heittola2018 61.0 (59.4 - 62.6) 59.7 62.5
Jung_UOS_task1a_1 4cl_nw Jung2018 74.8 (73.4 - 76.2) 73.5
Jung_UOS_task1a_2 4cl_w Jung2018 74.2 (72.8 - 75.7) 72.9
Jung_UOS_task1a_3 GM_w Jung2018 73.8 (72.4 - 75.2) 72.7
Jung_UOS_task1a_4 SVM_w Jung2018 73.8 (72.4 - 75.3) 72.4
Khadkevich_FB_task1a_1 1aavpool Khadkevich2018 67.8 (66.3 - 69.3)
Khadkevich_FB_task1a_2 1amaxpool Khadkevich2018 67.2 (65.7 - 68.8)
Li_BIT_task1a_1 BIT_task1a_1 Li2018 73.0 (71.5 - 74.4) 74.3
Li_BIT_task1a_2 BIT_task1a_2 Li2018 75.3 (73.9 - 76.7) 75.2
Li_BIT_task1a_3 BIT_task1a_3 Li2018 75.3 (73.9 - 76.7) 76.6
Li_BIT_task1a_4 BIT_task1a_4 Li2018 75.0 (73.6 - 76.4) 76.4
Li_SCUT_task1a_1 Li_SCUT Li2018a 43.4 (41.8 - 45.0) 66.9
Li_SCUT_task1a_2 Li_SCUT Li2018a 50.2 (48.6 - 51.9) 72.9
Li_SCUT_task1a_3 Li_SCUT Li2018a 44.5 (42.9 - 46.2) 69.1
Li_SCUT_task1a_4 Li_SCUT Li2018a 46.7 (45.1 - 48.3) 71.2
Liping_CQU_task1a_1 Xception Liping2018 70.4 (69.0 - 71.9) 79.8 72.2
Liping_CQU_task1a_2 Xception Liping2018 74.0 (72.6 - 75.4) 79.8 73.8
Liping_CQU_task1a_3 Xception Liping2018 74.7 (73.3 - 76.1) 79.8 74.2
Liping_CQU_task1a_4 Xception Liping2018 75.4 (74.0 - 76.8) 79.8 73.0
Maka_ZUT_task1a_1 asa_dev Maka2018 65.8 (64.3 - 67.4) 66.2 63.5
Mariotti_lip6_task1a_1 MP_all Mariotti2018 75.0 (73.6 - 76.4) 78.4
Mariotti_lip6_task1a_2 MP_no50 Mariotti2018 72.8 (71.3 - 74.2) 79.1
Mariotti_lip6_task1a_3 NN_all Mariotti2018 72.8 (71.3 - 74.2) 76.4
Mariotti_lip6_task1a_4 NN_no50 Mariotti2018 74.9 (73.4 - 76.3) 79.3
Nguyen_TUGraz_task1a_1 NNF_CNNEns Nguyen2018 69.8 (68.3 - 71.3) 69.3 66.8
Ren_UAU_task1a_1 ABCNN Ren2018 69.0 (67.5 - 70.5) 72.6 69.7
Roletscheck_UNIA_task1a_1 DeepSAGA Roletscheck2018 69.2 (67.7 - 70.7) 74.7 69.3
Roletscheck_UNIA_task1a_2 DeepSAGA Roletscheck2018 67.3 (65.7 - 68.8) 72.8
Sakashita_TUT_task1a_1 Sakashita_1 Sakashita2018 81.0 (79.7 - 82.3) 79.7
Sakashita_TUT_task1a_2 Sakashita_2 Sakashita2018 81.0 (79.7 - 82.3) 79.3
Sakashita_TUT_task1a_3 Sakashita_3 Sakashita2018 80.7 (79.4 - 82.0) 79.2
Sakashita_TUT_task1a_4 Sakashita_4 Sakashita2018 79.3 (78.0 - 80.6) 76.9 77.2
Tilak_IIITB_task1a_1 CNN_raw Purohit2018 59.5 (57.9 - 61.1) 63.9 59.7
Tilak_IIITB_task1a_2 DCNN_raw Purohit2018 58.3 (56.7 - 59.9) 61.0 59.3
Tilak_IIITB_task1a_3 DCNN_raw Purohit2018 55.0 (53.4 - 56.6) 60.0 58.8
Waldekar_IITKGP_task1a_1 IITKGP_ABSP_Fusion18 Waldekar2018 69.7 (68.2 - 71.2) 69.8
WangJun_BUPT_task1a_1 Attention Jun2018 70.9 (69.4 - 72.4) 70.8 70.8
WangJun_BUPT_task1a_2 Attention Jun2018 70.5 (69.0 - 72.0) 70.8 70.8
WangJun_BUPT_task1a_3 Attention Jun2018 73.2 (71.7 - 74.6) 70.8 70.8
Yang_GIST_task1a_1 SEResNet Yang2018 71.7 (70.2 - 73.2) 72.5
Yang_GIST_task1a_2 GAN_CNN Yang2018 70.0 (68.5 - 71.5) 70.5
Zeinali_BUT_task1a_1 BUT_1 Zeinali2018 78.4 (77.0 - 79.7) 69.3 74.0
Zeinali_BUT_task1a_2 BUT_2 Zeinali2018 78.1 (76.8 - 79.5) 69.0 74.5
Zeinali_BUT_task1a_3 BUT_3 Zeinali2018 74.5 (73.1 - 76.0) 70.3 74.5
Zeinali_BUT_task1a_4 BUT_4 Zeinali2018 75.1 (73.7 - 76.6) 69.8 73.7
Zhang_HIT_task1a_1 CNN_MLTP Zhang2018 73.4 (72.0 - 74.9) 75.3 73.3
Zhang_HIT_task1a_2 CNN_MLTP Zhang2018 70.9 (69.4 - 72.3) 75.1 71.8
Zhao_DLU_task1a_1 BiLstm-CNN Hao2018 69.8 (68.3 - 71.3) 73.6 70.2
Khadkevich_FB_task1c_1 1cavpool Khadkevich2018 71.7 (70.2 - 73.2) 71.8
Khadkevich_FB_task1c_2 1cmaxpool Khadkevich2018 69.0 (67.5 - 70.5) 70.2

Teams ranking

Systems using external data

Table including only the best performing system per submitting team. Please, note that baseline is not using external data.

Rank Submission
code
Submission
name
Technical
Report
Accuracy
with 95% confidence interval
(Evaluation dataset)
Accuracy
(Development dataset)
Accuracy
(Leaderboard dataset)
DCASE2018 baseline Baseline Heittola2018 61.0 (59.4 - 62.6) 59.7 62.5
Khadkevich_FB_task1c_1 1cavpool Khadkevich2018 71.7 (70.2 - 73.2) 71.8

All submitted systems

This table includes all submissions from subtask A and subtask C. External data usage is highlighted with green color. Please, note that baseline is not using external data.

Rank Submission
code
Submission
name
Technical
Report
Accuracy
with 95% confidence interval
(Evaluation dataset)
Accuracy
(Development dataset)
Accuracy
(Leaderboard dataset)
Baseline_Surrey_task1a_1 SurreyCNN8 Kong2018 70.4 (68.9 - 71.9) 68.0 70.7
Dang_NCU_task1a_2 AnD_NCU Dang2018 74.5 (73.1 - 76.0) 76.7 72.5
Dorfer_CPJKU_task1a_3 calib-avg Dorfer2018 80.5 (79.2 - 81.8) 80.5
Fraile_UPM_task1a_1 UPMg Fraile2018 62.7 (61.1 - 64.3) 62.3 57.7
Gil-jin_KNU_task1a_1 ECDCNN Sangwon2018 74.4 (73.0 - 75.8) 72.4 75.5
Golubkov_SPCH_task1a_1 spch_fusion Golubkov2018 60.2 (58.7 - 61.8) 80.1 69.3
DCASE2018 baseline Baseline Heittola2018 61.0 (59.4 - 62.6) 59.7 62.5
Jung_UOS_task1a_1 4cl_nw Jung2018 74.8 (73.4 - 76.2) 73.5
Khadkevich_FB_task1a_1 1aavpool Khadkevich2018 67.8 (66.3 - 69.3)
Li_BIT_task1a_3 BIT_task1a_3 Li2018 75.3 (73.9 - 76.7) 76.6
Li_SCUT_task1a_2 Li_SCUT Li2018a 50.2 (48.6 - 51.9) 72.9
Liping_CQU_task1a_4 Xception Liping2018 75.4 (74.0 - 76.8) 79.8 73.0
Maka_ZUT_task1a_1 asa_dev Maka2018 65.8 (64.3 - 67.4) 66.2 63.5
Mariotti_lip6_task1a_1 MP_all Mariotti2018 75.0 (73.6 - 76.4) 78.4
Nguyen_TUGraz_task1a_1 NNF_CNNEns Nguyen2018 69.8 (68.3 - 71.3) 69.3 66.8
Ren_UAU_task1a_1 ABCNN Ren2018 69.0 (67.5 - 70.5) 72.6 69.7
Roletscheck_UNIA_task1a_1 DeepSAGA Roletscheck2018 69.2 (67.7 - 70.7) 74.7 69.3
Sakashita_TUT_task1a_2 Sakashita_2 Sakashita2018 81.0 (79.7 - 82.3) 79.3
Tilak_IIITB_task1a_1 CNN_raw Purohit2018 59.5 (57.9 - 61.1) 63.9 59.7
Waldekar_IITKGP_task1a_1 IITKGP_ABSP_Fusion18 Waldekar2018 69.7 (68.2 - 71.2) 69.8
WangJun_BUPT_task1a_3 Attention Jun2018 73.2 (71.7 - 74.6) 70.8 70.8
Yang_GIST_task1a_1 SEResNet Yang2018 71.7 (70.2 - 73.2) 72.5
Zeinali_BUT_task1a_1 BUT_1 Zeinali2018 78.4 (77.0 - 79.7) 69.3 74.0
Zhang_HIT_task1a_1 CNN_MLTP Zhang2018 73.4 (72.0 - 74.9) 75.3 73.3
Zhao_DLU_task1a_1 BiLstm-CNN Hao2018 69.8 (68.3 - 71.3) 73.6 70.2
Khadkevich_FB_task1c_1 1cavpool Khadkevich2018 71.7 (70.2 - 73.2) 71.8

Class-wise performance

Systems using external data

Table including only the best performing system per submitting team. Please, note that baseline is not using external data.

Rank Submission
code
Submission
name
Technical
Report
Accuracy
(Evaluation dataset)
Airport Bus Metro Metro
station
Park Public
square
Shopping
mall
Street
pedestrian
Street
traffic
Tram
DCASE2018 baseline Baseline Heittola2018 61.0 55.3 66.1 60.8 52.8 79.4 33.9 64.2 55.3 81.9 60.0
Khadkevich_FB_task1c_1 1cavpool Khadkevich2018 71.7 67.5 86.1 74.4 69.2 93.1 33.3 75.3 49.4 86.9 81.7
Khadkevich_FB_task1c_2 1cmaxpool Khadkevich2018 69.0 66.1 87.5 69.4 65.0 93.1 36.7 66.9 40.3 87.5 77.8

All submitted systems

This table includes all submissions from subtask A and subtask C. External data usage is highlighted with green color. Please, note that baseline is not using external data.

Rank Submission
code
Submission
name
Technical
Report
Accuracy
(Evaluation dataset)
Airport Bus Metro Metro
station
Park Public
square
Shopping
mall
Street
pedestrian
Street
traffic
Tram
Baseline_Surrey_task1a_1 SurreyCNN8 Kong2018 70.4 78.6 71.4 71.4 72.8 92.5 33.9 59.2 51.4 85.8 87.2
Baseline_Surrey_task1a_2 SurreyCNN4 Kong2018 69.7 71.7 70.8 65.0 69.7 92.5 34.4 68.9 52.8 86.7 84.2
Dang_NCU_task1a_1 AnD_NCU Dang2018 73.3 80.8 80.0 76.1 75.8 94.4 36.7 65.0 55.0 86.1 83.3
Dang_NCU_task1a_2 AnD_NCU Dang2018 74.5 82.8 80.6 76.4 72.5 96.1 30.3 68.6 58.1 88.6 91.4
Dang_NCU_task1a_3 AnD_NCU Dang2018 74.1 83.3 79.2 76.1 70.6 95.8 29.2 68.1 56.7 88.3 93.9
Dorfer_CPJKU_task1a_1 DNN Dorfer2018 79.7 93.1 89.4 79.2 81.7 93.6 54.2 69.7 55.0 87.5 93.6
Dorfer_CPJKU_task1a_2 i-vectors Dorfer2018 67.8 61.7 81.9 61.9 60.0 88.3 48.1 61.9 43.1 85.6 85.3
Dorfer_CPJKU_task1a_3 calib-avg Dorfer2018 80.5 88.3 89.4 85.3 85.3 89.7 50.6 73.3 58.6 93.3 91.4
Dorfer_CPJKU_task1a_4 calib-sep Dorfer2018 77.2 82.8 85.8 73.6 77.5 90.0 55.8 65.3 64.4 88.1 88.3
Fraile_UPM_task1a_1 UPMg Fraile2018 62.7 65.8 82.5 41.4 56.1 87.5 46.1 62.2 41.4 81.9 61.9
Gil-jin_KNU_task1a_1 ECDCNN Sangwon2018 74.4 83.9 76.1 69.7 80.3 96.1 30.3 73.3 56.7 89.2 88.6
Golubkov_SPCH_task1a_1 spch_fusion Golubkov2018 60.2 70.3 59.7 69.2 45.8 86.4 36.7 40.0 30.6 85.3 78.6
DCASE2018 baseline Baseline Heittola2018 61.0 55.3 66.1 60.8 52.8 79.4 33.9 64.2 55.3 81.9 60.0
Jung_UOS_task1a_1 4cl_nw Jung2018 74.8 61.4 87.2 75.8 76.9 96.9 42.2 67.5 67.5 87.8 84.7
Jung_UOS_task1a_2 4cl_w Jung2018 74.2 61.1 89.4 76.7 74.4 98.9 33.6 69.7 64.7 91.7 81.9
Jung_UOS_task1a_3 GM_w Jung2018 73.8 57.8 89.7 76.1 75.6 98.1 33.3 69.7 66.7 91.1 80.0
Jung_UOS_task1a_4 SVM_w Jung2018 73.8 60.8 89.2 76.9 71.7 99.2 35.3 71.1 60.6 91.1 82.5
Khadkevich_FB_task1a_1 1aavpool Khadkevich2018 67.8 76.1 75.3 67.5 58.3 90.3 26.7 58.3 50.0 85.6 89.7
Khadkevich_FB_task1a_2 1amaxpool Khadkevich2018 67.2 73.9 77.5 69.7 54.7 91.4 28.3 55.0 48.6 87.8 85.6
Li_BIT_task1a_1 BIT_task1a_1 Li2018 73.0 61.1 87.5 80.0 81.4 93.3 38.9 73.9 59.2 82.2 72.2
Li_BIT_task1a_2 BIT_task1a_2 Li2018 75.3 57.5 86.9 86.4 79.7 92.5 46.7 79.4 64.4 85.0 74.2
Li_BIT_task1a_3 BIT_task1a_3 Li2018 75.3 62.2 85.8 86.7 79.7 96.7 46.4 76.7 60.0 85.8 73.1
Li_BIT_task1a_4 BIT_task1a_4 Li2018 75.0 62.2 85.3 86.4 78.1 95.8 44.7 77.5 59.7 86.4 73.9
Li_SCUT_task1a_1 Li_SCUT Li2018a 43.4 61.1 69.4 17.8 18.3 72.5 23.3 41.7 11.9 75.3 42.8
Li_SCUT_task1a_2 Li_SCUT Li2018a 50.3 40.8 70.6 46.9 44.7 44.4 29.7 53.1 37.5 77.8 56.9
Li_SCUT_task1a_3 Li_SCUT Li2018a 44.5 36.9 69.7 27.2 19.2 74.4 16.1 61.4 25.0 79.2 36.1
Li_SCUT_task1a_4 Li_SCUT Li2018a 46.7 49.2 68.6 31.7 24.4 80.6 25.8 51.9 15.3 76.7 42.8
Liping_CQU_task1a_1 Xception Liping2018 70.4 69.2 69.4 76.7 72.8 97.2 31.9 69.2 49.4 92.2 76.4
Liping_CQU_task1a_2 Xception Liping2018 74.0 75.0 77.8 71.9 84.2 93.6 40.3 74.2 48.1 90.0 85.0
Liping_CQU_task1a_3 Xception Liping2018 74.7 79.2 73.1 78.3 83.3 93.9 40.3 69.7 53.3 88.3 87.8
Liping_CQU_task1a_4 Xception Liping2018 75.4 78.9 73.3 77.2 81.7 95.8 40.8 71.9 56.9 89.4 87.8
Maka_ZUT_task1a_1 asa_dev Maka2018 65.8 52.8 82.2 62.2 50.3 94.4 38.1 66.4 59.7 81.7 70.3
Mariotti_lip6_task1a_1 MP_all Mariotti2018 75.0 82.5 75.6 82.2 76.7 97.8 34.2 70.8 59.2 89.2 81.9
Mariotti_lip6_task1a_2 MP_no50 Mariotti2018 72.8 83.3 75.6 72.2 74.2 97.5 31.4 68.1 57.8 89.4 78.1
Mariotti_lip6_task1a_3 NN_all Mariotti2018 72.8 83.6 75.6 73.1 74.7 97.8 31.1 67.2 57.5 89.4 77.8
Mariotti_lip6_task1a_4 NN_no50 Mariotti2018 74.9 81.1 76.7 83.3 73.9 97.5 37.5 75.3 52.2 89.2 81.9
Nguyen_TUGraz_task1a_1 NNF_CNNEns Nguyen2018 69.8 83.3 85.0 62.8 70.0 95.0 35.0 59.4 45.6 84.7 77.5
Ren_UAU_task1a_1 ABCNN Ren2018 69.0 64.4 70.3 56.4 68.6 95.8 42.8 71.1 45.3 85.8 89.7
Roletscheck_UNIA_task1a_1 DeepSAGA Roletscheck2018 69.2 73.6 72.5 63.3 63.3 94.2 36.7 70.8 59.2 78.6 79.7
Roletscheck_UNIA_task1a_2 DeepSAGA Roletscheck2018 67.3 70.3 69.7 65.0 58.1 92.8 35.0 66.4 58.9 81.1 75.6
Sakashita_TUT_task1a_1 Sakashita_1 Sakashita2018 81.0 90.8 81.7 83.9 82.5 92.2 66.4 78.6 58.3 78.9 96.4
Sakashita_TUT_task1a_2 Sakashita_2 Sakashita2018 81.0 90.8 81.9 83.6 82.8 92.2 66.4 78.3 57.8 79.4 96.9
Sakashita_TUT_task1a_3 Sakashita_3 Sakashita2018 80.7 90.3 81.7 83.3 82.2 92.2 65.3 77.5 58.3 79.2 96.7
Sakashita_TUT_task1a_4 Sakashita_4 Sakashita2018 79.3 90.8 83.3 74.4 84.7 97.8 46.4 77.8 52.5 90.8 94.2
Tilak_IIITB_task1a_1 CNN_raw Purohit2018 59.5 50.3 87.2 44.2 52.2 81.4 23.6 61.7 41.7 85.3 67.8
Tilak_IIITB_task1a_2 DCNN_raw Purohit2018 58.3 41.1 88.3 52.2 55.0 68.3 25.8 61.9 35.8 80.0 74.7
Tilak_IIITB_task1a_3 DCNN_raw Purohit2018 55.0 48.9 86.9 46.9 52.5 53.3 29.2 53.9 35.3 71.9 71.1
Waldekar_IITKGP_task1a_1 IITKGP_ABSP_Fusion18 Waldekar2018 69.7 63.3 81.4 70.8 65.6 94.2 40.0 68.1 55.3 83.3 75.0
WangJun_BUPT_task1a_1 Attention Jun2018 70.9 68.9 80.0 70.3 77.8 96.1 23.9 60.3 61.1 81.4 89.4
WangJun_BUPT_task1a_2 Attention Jun2018 70.5 74.4 75.8 75.3 56.7 95.3 44.2 82.5 27.2 88.1 85.8
WangJun_BUPT_task1a_3 Attention Jun2018 73.2 73.3 80.3 73.6 75.6 96.4 33.1 69.7 54.2 85.6 90.3
Yang_GIST_task1a_1 SEResNet Yang2018 71.7 75.0 74.4 63.9 69.2 96.1 38.6 71.7 53.9 89.4 84.7
Yang_GIST_task1a_2 GAN_CNN Yang2018 70.0 68.6 76.7 62.5 71.9 92.8 33.3 70.0 56.1 86.4 81.7
Zeinali_BUT_task1a_1 BUT_1 Zeinali2018 78.4 82.8 90.3 81.4 71.7 95.6 55.6 75.0 52.8 88.3 90.3
Zeinali_BUT_task1a_2 BUT_2 Zeinali2018 78.1 82.8 90.3 85.8 73.9 93.1 54.7 74.2 51.9 89.2 85.6
Zeinali_BUT_task1a_3 BUT_3 Zeinali2018 74.5 67.8 87.5 92.2 61.1 98.9 17.2 77.5 80.8 85.3 76.9
Zeinali_BUT_task1a_4 BUT_4 Zeinali2018 75.1 66.4 83.3 88.9 66.7 98.9 22.5 77.2 80.6 86.1 80.8
Zhang_HIT_task1a_1 CNN_MLTP Zhang2018 73.4 76.7 78.1 73.3 72.2 94.4 41.1 69.2 54.7 85.8 88.9
Zhang_HIT_task1a_2 CNN_MLTP Zhang2018 70.9 71.7 76.7 70.8 67.8 95.3 36.7 68.1 58.1 81.4 82.2
Zhao_DLU_task1a_1 BiLstm-CNN Hao2018 69.8 60.8 84.4 72.5 64.7 97.8 37.8 73.6 39.7 89.2 77.8
Khadkevich_FB_task1c_1 1cavpool Khadkevich2018 71.7 67.5 86.1 74.4 69.2 93.1 33.3 75.3 49.4 86.9 81.7
Khadkevich_FB_task1c_2 1cmaxpool Khadkevich2018 69.0 66.1 87.5 69.4 65.0 93.1 36.7 66.9 40.3 87.5 77.8

System characteristics

General characteristics

Rank Code Technical
Report
Accuracy
(Eval)
Input Sampling
rate
Data
augmentation
Features
DCASE2018 baseline Heittola2018 61.0 mono 48kHz log-mel energies
Khadkevich_FB_task1c_1 Khadkevich2018 71.7 mono 16kHz log-mel energies
Khadkevich_FB_task1c_2 Khadkevich2018 69.0 mono 16kHz log-mel energies



Machine learning characteristics

Rank Code Technical
Report
Accuracy
(Eval)
Model
complexity
Classifier Ensemble
subsystems
Decision
making
DCASE2018 baseline Heittola2018 61.0 116118 CNN
Khadkevich_FB_task1c_1 Khadkevich2018 71.7 CNN
Khadkevich_FB_task1c_2 Khadkevich2018 69.0 CNN

Technical reports

Acoustic Scene Classification Using Ensemble of Convnets

An Dang, Toan Vu and Jia-Ching Wang
Computer Science and Information Engineering, Deep Learning and Media System Laboratory, National Central University, Taoyuan, Taiwan

Abstract

This technical report presents our system for the acoustic scene classification problem in the task 1A of the DCASE2018 challenge whose goal is to classify audio recordings into predefined types of environments. The overall system is an ensemble of ConvNet models working on different audio features separately. Audio signals are processed in both mono channel and two channels before we extract mel-spectrogram and gammatone-based spectrogram features as inputs to models. All models are implemented by almost the same ConvNet structure. Experimental results illustrate that the ensemble system can achieve superior accuracy to the baseline by a large margin of 17% on the test data.

System characteristics
Input stereo, mono
Sampling rate 48kHz
Features log-mel energies
Classifier Ensemble of Convnet
Decision making average
PDF

Acoustic Scene Classification with Fully Convolutional Neural Networks and I-Vectors

Matthias Dorfer, Bernhard Lehner, Hamid Eghbal-zadeh, Heindl Christop, Paischer Fabian and Widmer Gerhard
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria

Abstract

This technical report describes the CP-JKU team's submissions for Task 1 - Subtask A (Acoustic Scene Classification, ASC) of the DCASE-2018 challenge. Our approach is still related to the methodology that achieved ranks 1 and 2 in the 2016 ASC challenge: a fusion of i-vector modelling using MFCC features derived from left and right audio channels, and deep convolutional neural networks (CNNs) trained on spectrograms. However, for our 2018 submission we have put a stronger focus on tuning and pushing the performance of our CNNs. The result of our experiments is a classification system that achieves classification accuracies of around 80% on the public Kaggle-Leaderboard.

System characteristics
Input left, right, difference; left, right
Sampling rate 22.5kHz
Data augmentation mixup; pitch shifting; mixup, pitch shifting
Features perceptual weighted power spectrogram; MFCC; perceptual weighted power spectrogram, MFCC
Classifier CNN, ensemble; i-vector, late fusion; CNN i-vector ensemble; CNN i-vector late fusion ensemble
Decision making average; fusion; late calibrated fusion of averaged i-vector and CNN models; late calibrated fusion
PDF

Classification of Acoustic Scenes Based on Modulation Spectra and Position-Pitch Maps

Ruben Fraile, Elena Blanco-Martin, Juana M. Gutierrez-Arriola, Nicolas Saenz-Lechon and Victor J. Osma-Ruiz
Research Center on Software Technologies and Multimedia Systems for Sustainability (CITSEM), Universidad Politecnica de Madrid, Madrid, Spain

Abstract

A system for the automatic classification of acoustic scenes is proposed that uses the stereophonic signal captured by a binaural microphone. This system uses one channel for calculating the spectral distribution of energy across auditory-relevant frequency bands. It further obtains some descriptors of the envelope modulation spectrum (EMS) by applying the discrete cosine transform to the logarithm of the EMS. The availability of the two-channel binaural recordings is used for representing the spatial distribution of acoustic sources by means of position-pitch maps. These maps are further parametrized using the two-dimensional Fourier transform. These three types of features (energy spectrum, EMS and positionpitch maps) are used as inputs for a standard multilayer perceptron with two hidden layers.

System characteristics
Input binaural
Sampling rate 48kHz
Features LTAS, Modulation spectrum, position-pitch maps
Classifier MLP
Decision making sum of log-probabilities
PDF

Acoustic Scene Classification Using Convolutional Neural Networks and Different Channels Representations and Its Fusion

Alexander Golubkov and Alexander Lavrentyev
Saint Petersburg, Russia

Abstract

Deep convolutional neural networks has great results in a image classification tasks. In this paper, we used different architectures of DCNN for image classification. As for images we used spectrograms of differenet signal representations, such as MFCC, Melspectrograms and CQT-spectrograms. Result was obtained using goemetric mean of all the models.

System characteristics
Input left, right, mono, mixed
Sampling rate 48kHz
Features CQT, spectrogram, log-mel, MFCC
Classifier CNN
Decision making mean
PDF

DCASE 2018 Task 1a: Acoustic Scene Classification by Bi-LSTM-CNN-Net Multichannel Fusion

WenJie Hao, Lasheng Zhao, Qiang Zhang, HanYu Zhao and JiaHua Wang
Key Laboratory of Advanced Design and Intelligent Computing(Dalian University), Ministry of Education, Dalian University, Liaoning, China

Abstract

In this study, we provide a solution for acoustic scene classification task in the DCASE 2018 challenge. A system consisting of bidirectional long-term memory and convolutional neural networks(BI-LSTM-CNN) is proposed. And, improved logarithmic scaled mel spectra as input to our system. Besides we have adopted a new model fusion mechanism. Finally, to validate the performance of the model and compare it to the baseline system, we used the TUT Acoustic Scene 2018 dataset for training and cross-validation, resulting in an 13.93% improvement over the baseline system.

System characteristics
Input multichannel
Sampling rate 48kHz
Features log-mel energies
Classifier CNN,Bi-Lstm
Decision making max of precision
PDF

A Multi-Device Dataset for Urban Acoustic Scene Classification

Toni Heittola, Annamaria Mesaros and Tuomas Virtanen
Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland

Abstract

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

System characteristics
Input mono
Sampling rate 48kHz; 44.1kHz
Features log-mel energies
Classifier CNN
PDF

Self-Attention Mechanism Based System for Dcase2018 Challenge Task1 and Task4

Wang Jun1 and Li Shengchen2
1Institute of Information Photonics and Optical Communication, c, Beijing, China, 2Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China

Abstract

In this technique report, we provide self-attention mechanism for the Task1 and Task 4 of Detection and Classification of Acoustic Scenes and Events 2018 (DCASE2017) challenge. We take convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) as our basic systems in Task 1 and Task 4. In this convolutional recurrent neural network (CRNN), gated linear units (GLUs) is used for non-linearity which implement a gating mechanism over the output of the network for selecting informative local features. Self-attention mechanism called intra-attention is used for modeling relationship between different positions of a single sequence over the output of the CRNN. Attention-based pooling scheme is used for localizing the specific events in Task 4 and for obtaining the final labels in Task 1. In a summary, we get 70.81% accuracy subtask 1 of Task 1. In the subtask 2 of Task 1, we get 70.1% accuracy for device a, 59.4% accuracy for device b, and 55.6 accuracy for device c. For Task 1, we get 26.98% F1 value for sound event detection in old test data of developmemt data.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN,BGRU,self-attention
PDF

DNN Based Multi-Level Features Ensemble for Acoustic Scene Classification

Jee-weon Jung, Hee-soo Heo, Hye-jin Shim and Ha-jin Yu
School of Computer Science, University of Seoul, Seoul, South Korea

Abstract

Acoustic scenes are defined by various characteristics such as long-term context or short-term event, making it difficult to select input features or pre-processing methods suitable for acoustic scene classification. In this paper, we propose an ensemble model which exploits various input features that vary in their degree of preprocessing: raw waveform without pre-processing, spectrogram, and i-vector a segment-level low dimensional representation. We tried to effectively perform combination of deep neural networks that handle different types of input features by using a separate scoring phase by using Gaussian models and support vector machines to extract scores from individual system that can be used as a confidence measure. Validity of the proposed framework is tested using the detection and classification of acoustic scenes and events 2018 dataset. The proposed framework showed accuracy of 73.82% using the validation set.

System characteristics
Input binaural
Sampling rate 48kHz
Features raw-waveform, spectrogram, i-vector
Classifier CNN, DNN, GMM, SVM
Decision making score-sum; weighted score-sum
PDF

Acoustic Scene and Event Detection Systems Submitted to DCASE 2018 Challenge

Maksim Khadkevich
AML, Facebook, Menlo Park, CA, USA

Abstract

In this technical report we describe systems that have been submitted to DCASE 2018 [1] challenge. Feature extraction and convolutional neural network (CNN) architecture are outlined. For tasks 1c and 2 we describe transfer learning approach that has been applied. Model training and inference are finally presented.

System characteristics
Input mono
Sampling rate 16kHz
Features log-mel energies
Classifier CNN
PDF

DCASE 2018 Challenge Surrey Cross-Task Convolutional Neural Network Baseline

Qiuqiang Kong, Iqbal Turab, Xu Yong, Wenwu Wang and Mark D. Plumbley
Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK

Abstract

The Detection and Classification of Acoustic Scenes and Events (DCASE) consists of five audio classification and sound event detection tasks: 1) Acoustic scene classification, 2) General-purpose audio tagging of Freesound, 3) Bird audio detection, 4) Weaklylabeled semi-supervised sound event detection and 5) Multi-channel audio classification. In this paper, we create a cross-task baseline system for all five tasks based on a convolutional neural network (CNN): a “CNN Baseline” system. We implemented CNNs with 4 layers and 8 layers originating from AlexNet and VGG from computer vision. We investigated how the performance varies from task to task with the same configuration of neural networks. Experiments show that deeper CNN with 8 layers performs better than CNN with 4 layers on all tasks except Task 1. Using CNN with 8 layers, we achieve an accuracy of 0.680 on Task 1, an accuracy of 0.895 and a mean average precision (MAP) of 0.928 on Task 2, an accuracy of 0.751 and an area under the curve (AUC) of 0.854 on Task 3, a sound event detection F1 score of 20.8% on Task 4, and an F1 score of 87.75% on Task 5. We released the Python source code of the baseline systems under the MIT license for further research.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier VGGish 8 layer CNN with global max pooling; AlexNetish 4 layer CNN with global max pooling
PDF

Acoustic Scene Classification Based on Binaural Deep Scattering Spectra with CNN and LSTM

Zhitong Li, Liqiang Zhang, Shixuan Du and Wei Liu
Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China

Abstract

This technical report presents the solutions proposed by the Beijing Institute of Technology Modern Communications Technology Laboratory for the acoustic scene classification of DCASE2018 task1a. Compared to previous years, the data is more diverse, making such tasks more difficult. In order to solve this problem, we use the Deep Scattering Spectra (DSS) features. The traditional features, such as Mel-frequency Cepstral Coefficients (MFCC), often lose information at high frequencies. DSS is a good way to preserve high frequency information. Based on this feature, we propose a network model of Convolutional Neural Network (CNN) and Long Short-term Memory (LSTM) to classify sound scenes. The experimental results show that the proposed feature extraction method and network structure have a good effect on this classification task. From the experimental data, the accuracy increased from 59% to 76%.

System characteristics
Input left,right
Sampling rate 48kHz
Features DSS
Classifier CNN; CNN,DNN
PDF

The SEIE-SCUT Systems for Challenge on DCASE 2018: Deep Learning Techniques for Audio Representation and Classification

YangXiong Li, Xianku Li and Yuhan Zhang
Laboratory of Signal Processing, South China University of Technology, Guangzhou, China

Abstract

In this report, we present our works about one task of challenge on DCASE 2018, i.e. task 1a: Acoustic Scene Classification (ASC). We adopt deep learning techniques to extract Deep Audio Feature (DAF) and classify various acoustic scenes . Specifically, a Deep Neural Network (DNN) is first built for generating the DAF from MelFrequency Cepstral Coefficients (MFCCs), and then a Recurrent Neural Network (RNN) of Bidirectional Long Short Term Memory (BLSTM) fed by the DAF is built for ASC. Evaluated on the development datasets of DCASE 2018, our systems are superior to the corresponding baselines for tasks 1a.

System characteristics
Input mono
Sampling rate 48kHz
Features MFCC
Classifier LSTM
PDF

The SEIE-SCUT Systems for Challenge on DCASE 2018: Deep Learning Techniques for Audio Representation and Classification

YangXiong Li, Yuhan Zhang and Xianku Li
Laboratory of Signal Processing, South China University of Technology, Guangzhou, China

Abstract

In this report, we present our works about one task of challenge on DCASE 2018, i.e. task 1b:Acoustic Scene Classification with mismatched recording devices (ASC). We adopt deep learning techniques to extract Deep Audio Feature (DAF) and classify various acoustic scenes . Specifically, a Deep Neural Network (DNN) is first built for generating the DAF from Mel-Frequency Cepstral Coefficients (MFCCs), and then a Recurrent Neural Network (RNN) of Bidirectional Long Short Term Memory (BLSTM) fed by the DAF is built for ASC. Evaluated on the development datasets of DCASE 2018, our systems are superior to the corresponding baselines for tasks 1b.

System characteristics
Input mono
Sampling rate 48kHz
Features MFCC
Classifier LSTM
PDF

Acoustic Scene Classification Using Multi-Scale Features

Yang Liping, Chen Xinxing and Tao Lianjie
College of Optoelectronic Engineering, Chongqing University, Chongqing, China

Abstract

Convolutional neural networks(CNN) has shown tremendous ability in classification problems, because it can extract abstract features for improving classification performance. In this paper, we use CNN to compute feature hierarchy layer by layer. With the layers deepen, the extracted features become more abstract, but the shallow features are also very useful for classification. So we propose a fuse multi-scale features of different layers method, which can improve performance of acoustic scene classification. In our method, the logmel features of audio signal are used as the input of CNN. In order to reduce the parameters' number, we use xception as the foundation network, which is a CNN with depthwise separable convolution operation (a depthwise convolution followed by a pointwise convolution). And we modify xception to fuse multi-scale features. We also introduce the focal loss, to further improve classification performance. This method can achieve commendable result, whether the audio recordings are collected by same device(subtask A) or by different devices (subtask B).

System characteristics
Input mono
Sampling rate 48kHz; 44.1kHz
Features log-mel energies
Classifier Xception
PDF

Auditory Scene Classification Using Ensemble Learning with Small Audio Feature Space

Tomasz Maka
Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, Szczecin, Szczecin, Poland

Abstract

The report presents the results of an analysis of audio feature space for auditory scene classification. The final small feature set was determined by the selection of the attributes from various representations. Feature importance was calculated exploiting the Gradient Boosting Machine. A number of classifiers were employed to build the ensemble classification scheme, and majority voting was performed to obtain the final decision. In the result, the proposed solution uses 223 attributes and outperforms the baseline system by over 6 per cent.

System characteristics
Input binaural
Sampling rate 48kHz
Features various
Classifier ensemble
Decision making majority vote
PDF

Exploring Deep Vision Models for Acoustic Scene Classification

Octave Mariotti, Matthieu Cord and Olivier Schwander
Laboratoire d'informatique de Paris 6, Sorbonne Université, Paris, France

Abstract

This report evaluates the application of deep vision models, namely VGG and Resnet, to general audio recognition. In the context of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events 2018, we trained several of these architecture on the task 1 dataset to perform acoustic scene classification. Then, in order to produce more robust predictions, we explored two ensemble methods to aggregate the different model outputs. Our results show a final accuracy of 79% on the development dataset, outperforming the baseline by almost 20%.

System characteristics
Input mono, binaural
Sampling rate 48kHz
Features log-mel energies
Classifier CNN
Decision making mean probability; neural network
PDF

Acoustic Scene Classification Using a Convolutional Neural Network Ensemble and Nearest Neighbor Filters

Truc Nguyen and Franz Pernkopf
Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria/ Europe

Abstract

This paper proposes Convolutional Neural Network (CNN) ensembles for acoustic scene classification of subtasks 1A and 1B of DCASE 2018 challenge. We introduce a nearest neighbor filter applied on spectrogram, which allows to emphasize and smooth similar patterns of sound events in a scene. We also propose a variety of CNN models for single-input (SI) and multi-input (MI) channels and three different methods for building a network ensemble. The experimental results show that for subtask 1A the combination of the MI-CNN structures using both of log-mel features and their nearest neighbor filtering is slightly more effective than that of single-input channel CNN models using log-mel features only. This statement is opposite for subtask 1B. In addition, the ensemble methods improve the accuracy of the system significantly, in which the best ensemble method is ensemble selection, which achieves 69.3% for subtask 1A and 63.6% for subtask 1B. This improves the baseline system by 8.9% and 14.4% for subtask 1A and 1B, respectively

System characteristics
Input mono
Sampling rate 48kHz; 44.1kHz
Features log-mel energies and their nearest neighbor filtered version
Classifier CNN
Decision making averaging vote
PDF

Acoustic Scene Classification Using Deep CNN on Raw-Waveform

Tilak Purohit and Atul Agarwal
Signal Processing and Pattern Recognition Lab, International Institute of Information Technology, Bangaluru, India

Abstract

For acoustic scene classification problems, conventionally Convolutional Neural Networks (CNNs) have been used on handcrafted features like Mel Frequency Cepstral Coefficients, filterbank energies, scaled spectrograms etc. However, recently CNNs have been used on raw waveform for acoustic modeling in speech recognition, though the time-scales of these waveforms are short (of the order of typical phoneme durations - 80-120 ms). In this work, we have exploited the representation learning power of CNNs by using them directly on very long raw acoustic sound waveforms (of durations 0.5-10 sec) for the acoustic scene classification (ASC) task of DCASE and have shown that deep CNNs (of 8-34 layers) can outperform CNNs with similar architecture on handcrafted features.

System characteristics
Input mono
Sampling rate 8kHz
Features raw-waveform
Classifier CNN; DCNN
PDF

Attention-Based Convolutional Neural Networks for Acoustic Scene Classification

Zhao Ren1, Qiuqiang Kong2, Kun Qian1, Mark Plumbley2 and Björn Schuller3
1ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany, 2Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Surrey, UK, 3ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing / GLAM -- Group on Language, Audio \& Music, University of Augsburg, Imperial College London, Augsburg, Germany / London, UK

Abstract

We propose a convolutional neural network (CNN) model based on an attention pooling method to classify ten different acoustic scenes, participating in the acoustic scene classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2018), which includes data from one device (subtask A) and data from three different devices (subtask B). The log mel spectrogram images of the audio waves are first forwarded to convolutional layers, and then fed into an attention pooling layer to reduce the feature dimension and achieve classification. From attention perspective, we build a weighted evaluation of the features, instead of simple max pooling or average pooling. On the official development set of the challenge, the best accuracy of subtask A is 72.6 %, which is an improvement of 12.9 % when compared with the official baseline (p < .001 in a one-tailed z-test). For subtask B, the best result of our attention-based CNN is a significant improvement of the baseline as well, in which the accuracies are 71.8 %, 58.3 %, and 58.3 % for the three devices A to C (p < .001 for device A, p < .01 for device B, and p < .05 for device C).

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel spectrogram
Classifier CNN
PDF

Using an Evolutionary Approach to Explore Convolutional Neural Networks for Acoustic Scene Classification

Christian Roletscheck and Tobias Watzka
Human Centered Multimedia, Augsburg University, Augsburg, Germany

Abstract

The successful application of modern deep neural networks is heavily reliant on the chosen architecture and the selection of the appropriate hyperparameters. Due to the large amount of parameters and the complex inner workings of a neural network, finding a suitable configuration for a respective problem turns out to be a rather complex task for a human. In this paper we propose an evolutionary approach to automatically generate a suitable neural network architecture for any given problem. A genetic algorithm is used to generate and evaluate a variety of deep convolutional networks. We take the DCASE 2018 Challenge as an opportunity to evaluate our algorithm on the task of acoustic scene classification. The best accuracy achieved by our approach was 74.7% on the development dataset.

System characteristics
Input mono
Sampling rate 48kHz
Features log-mel spectrogram
Classifier CNN
Decision making majority vote
PDF

Acoustic Scene Classification by Ensemble of Spectrograms Based on Adaptive Temporal Divisions

Yuma Sakashita and Masaki Aono
Knowledge Data Engineering Laboratory, Toyohashi University of Technology, Aichi, Japan

Abstract

Many classification tasks using deep learning have improved classification accuracy by using a large amount of training data. However, it is difficult to collect audio data and build a large database. Since training data is restricted in DCASE 2018 Task 1a, unknown acoustic scene must be predicted from less training data. From the results of DCASE 2017[1], we determine that using a convolution neural network and ensemble multiple networks is an effective means for classifying acoustic scenes. In our method we generate mel-spectrogram from binaural audio, mono audio, Harmonicpercussive source separation (HPSS) audio, adaptively divide the spectrogram into multiple ways and learn 9 neural networks. We further improve ensemble accuracy by ensemble learning using these outputs. The classification result of the proposed system was 0.769 for Development dataset and 0.796 for Leaderboard dataset.

System characteristics
Input mono, binaural
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
Decision making random forest
PDF

CNN Based System for Acoustic Scene Classification

Lee Sangwon, Kang Seungtae and Jang Gin-jin
School of Electronics Engineering, Kyungpook National University, Daegu, Korea

Abstract

Convolution neural networks (CNNs) have achieved great successes in many machine learning tasks such as classifying visual objects or various audio sounds. In this report, we describe our system implementation for acoustic scene classification task of DCASE 2018 based on CNN. The classification accuracies of the proposed system are 72.4% and 75.5% on development and leaderboard datasets, respectively.

System characteristics
Input mono
Sampling rate 48kHz
Features log-mel energies
Classifier CNN
Decision making majority vote
PDF

Combination of Amplitude Modulation Spectrogram Features and MFCCs for Acoustic Scene Classification

Juergen Tchorz
Institute for Acoustics, University of Applied Sciences Luebeck, Luebeck, Germany

Abstract

This report describes an approach for acoustic scene classification and its results for the development data set of the DCASE 2018 challenge. Amplitude modulation spectrograms (AMS), which mimic important aspects of the auditory system are used as features, in combination with mel-scale cepstral coefficients which have shown to be complementary to AMS features. For classification, a long short-term memory deep neural network is used. The proposed system outperforms the baseline system by 6.3-9.3 % for the development data test subset, depending on the recording device.

System characteristics
Sampling rate 44.1kHz
Features amplitude modulation spectrogram, MFCC
Classifier LSTM
PDF

Wavelet-Based Audio Features for Acoustic Scene Classification

Shefali Waldekar and Goutam Saha
Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India

Abstract

This report describes a submission for IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 for Task 1 (acoustic scene classification (ASC)), sub-task A (basic ASC) and sub-task B (ASC with mismatched recording devices). We use two wavelet-based features in a scorefusion framework to achieve the goal. The first feature applies wavelet transform to log mel-band energies, while the second does a high-Q wavelet transformation on the frames of raw signal. The two features are found to be complementary so that the fused system relatively outperforms the deep-learning based baseline system by 17% for sub-task A and 26% for sub-task B with the development dataset provided for the respective sub-tasks.

System characteristics
Input mono
Sampling rate 48kHz
Features MFDWC, CQCC
Classifier SVM
Decision making fusion
PDF

Se-Resnet with Gan-Based Data Augmentation Applied to Acoustic Scene Classification

Jeong Hyeon Yang, Nam Kyun Kim and Hong Kook Kim
School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, Korea

Abstract

This report describes our contribution to the development of audio scene classification methods for the DCASE 2018 Challenge Task 1A. The proposed systems for this task are based on data augmentation through generative adversarial network (GAN)-based data augmentation and various convolutional networks such as residual networks (ResNets) and squeeze-and-excitation residual networks (SE-ResNets). In addition to data augmentation, SEResNets are revised so that they operate on the log-mel spectrogram domain, and the numbers of layers and kernels are adjusted to provide better performance on the task. Finally, the ensemble method is applied using a four-fold cross-validated training dataset. Consequently, the proposed audio scene classification system improves classwise accuracy by 10% compared to the baseline system through the Kaggle competition in acoustic scene classification.

System characteristics
Input mixed
Sampling rate 48kHz
Data augmentation GAN
Features log-mel spectrogram
Classifier CNN, ensemble
Decision making mean probability
PDF

Convolutional Neural Networks and X-Vector Embedding for Dcase2018 Acoustic Scene Classification Challenge

Hossein Zeinali, Lukas Burget and Honza Cernocky
BUT Speech, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic

Abstract

In this report, the BUT team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2018 challenge is described. Also, the analysis of different method performance on the development set is provided. The proposed approach is a fusion of two different Conventional Neural Network (CNN) topologies. The first one is the common two-dimensional CNNs which mainly is used in image classification task. The second one is one dimensional CNN for extracting embeddings from the neural network which is too common in speech processing, especially for speaker recognition. In addition to the topologies, two types of features were suggested to be used in this task, Mel-spectrogram in log domain and CQT features which explained in detail in the report. Finally, the outputs of different systems are fused using a weighted average.

System characteristics
Input mono, binaural
Sampling rate 48kHz
Data augmentation block mixing
Features log-mel energies, CQT
Classifier CNN, x-vector, ensemble
Decision making weighted average
PDF

Acoustic Scene Classification Using Multi-Layered Temporal Pooling Based on Deep Convolutional Neural Network

Liwen Zhang and Jiqing Han
Laboratory of Speech Signal Processing, Harbin Institute of Technology, Harbin, China

Abstract

The performance of an Acoustic Scene Classification (ASC) system is highly depending on the latent temporal dynamics of the audio signal. In this paper, we proposed a multiple layers temporal pooling method using CNN feature sequence as input, which can effectively capture the temporal dynamics for an entire audio signal with arbitrary duration by building direct connections between the sequence and its time indexes. We applied our novel framework on DCASE 2018 task 1, ASC. For evaluation, we trained a Support Vector Machine (SVM) with the proposed Multi-Layered Temporal Pooling (MLTP) learned features. Experimental results on the development dataset, usage of the MLTP features significantly improved the ASC performance. The best performance with 75.28% accuracy was achieved by using the optimal setting found in our experiments.

System characteristics
Input mono
Sampling rate 48kHz
Features log-mel energies
Classifier CNN, SVR, SVM
Decision making only one SVM
PDF