Task description
The goal of acoustic scene classification task was to classify test recordings into one of predefined classes (15) that characterizes the environment in which they were recorded — for example park, home, office. The participants used 4680 10-second audio excerpts (13h of audio) to train their systems, and 1620 10-second audio excerpts (4h 30min of audio) were used for the challenge evaluation.
More detailed task description can be found in the task description page
Challenge results
Here you can find complete information on the submissions for Task 1: results on evaluation and development set (when reported by authors), class-wise results, technical reports and bibtex citations.
System outputs:
Systems ranking
Rank |
Submission code |
Submission name |
Technical Report |
Accuracy with 95% confidence interval (Evaluation dataset) |
Accuracy (Development dataset) |
---|---|---|---|---|---|
Abrol_IITM_task1_1 | Baseline | Abrol2017 | 65.7 (63.4 - 68.0) | 88.1 | |
Amiriparian_AU_task1_1 | S2S-AE | Amiriparian2017 | 67.5 (65.3 - 69.8) | 88.0 | |
Amiriparian_AU_task1_2 | Shahin_APTI | Amiriparian2017a | 59.1 (56.7 - 61.5) | 90.1 | |
Biho_Sogang_task1_1 | Biho1 | Kim2017 | 56.5 (54.1 - 59.0) | 75.9 | |
Biho_Sogang_task1_2 | Biho2 | Kim2017 | 60.5 (58.1 - 62.9) | 75.9 | |
Bisot_TPT_task1_1 | TPT1 | Bisot2017 | 69.8 (67.6 - 72.1) | 90.1 | |
Bisot_TPT_task1_2 | TPT2 | Bisot2017 | 69.6 (67.3 - 71.8) | 89.1 | |
Chandrasekhar_IIITH_task1_1 | Chandrasekhar2017 | 45.9 (43.4 - 48.3) | 77.6 | ||
Chou_SINICA_task1_1 | TP_CNN_cv1 | Chou2017 | 57.1 (54.7 - 59.5) | ||
Chou_SINICA_task1_2 | SINICA | Chou2017 | 61.5 (59.2 - 63.9) | ||
Chou_SINICA_task1_3 | SINICA | Chou2017 | 59.8 (57.4 - 62.1) | ||
Chou_SINICA_task1_4 | SINICA | Chou2017 | 57.1 (54.7 - 59.5) | ||
Dang_NCU_task1_1 | andang1 | Dang2017 | 62.7 (60.4 - 65.1) | 82.0 | |
Dang_NCU_task1_2 | andang1 | Dang2017 | 62.7 (60.4 - 65.1) | 79.1 | |
Dang_NCU_task1_3 | andang1 | Dang2017 | 63.7 (61.4 - 66.0) | 81.6 | |
Duppada_Seernet_task1_1 | Seernet | Duppada2017 | 57.0 (54.6 - 59.4) | 79.9 | |
Duppada_Seernet_task1_2 | Seernet | Duppada2017 | 59.9 (57.5 - 62.3) | 81.9 | |
Duppada_Seernet_task1_3 | Seernet | Duppada2017 | 64.1 (61.7 - 66.4) | 81.6 | |
Duppada_Seernet_task1_4 | Seernet | Duppada2017 | 63.0 (60.7 - 65.4) | 84.8 | |
Foleiss_UTFPR_task1_1 | MLPFeats | Foleiss2017 | 64.5 (62.2 - 66.8) | 78.0 | |
Foleiss_UTFPR_task1_2 | MLPFeatRF | Foleiss2017 | 66.9 (64.6 - 69.2) | 80.0 | |
Fonseca_MTG_task1_1 | MTG | Fonseca2017 | 67.3 (65.1 - 69.6) | 83.0 | |
Fraile_UPM_task1_1 | GAMMA-UPM | Fraile2017 | 58.3 (55.9 - 60.7) | 79.8 | |
Gong_MTG_task1_1 | MTG_GBMVGG | Gong2017 | 61.2 (58.8 - 63.5) | 86.8 | |
Gong_MTG_task1_2 | MTG_GBM | Gong2017 | 61.5 (59.1 - 63.9) | 86.1 | |
Gong_MTG_task1_3 | MTG_VGG | Gong2017 | 61.9 (59.5 - 64.2) | 84.0 | |
Han_COCAI_task1_1 | 4fEnsemSel | Han2017 | 79.9 (78.0 - 81.9) | 91.9 | |
Han_COCAI_task1_2 | 4fMeanAll | Han2017 | 79.6 (77.7 - 81.6) | 91.7 | |
Han_COCAI_task1_3 | FlEnsemSel | Han2017 | 80.4 (78.4 - 82.3) | 91.9 | |
Han_COCAI_task1_4 | flMeanAll | Han2017 | 80.3 (78.4 - 82.2) | 91.7 | |
Hasan_BUET_task1_1 | BUETBOSCH1 | Hyder2017 | 74.1 (72.0 - 76.3) | 88.1 | |
Hasan_BUET_task1_2 | BUETBOSCH2 | Hyder2017 | 72.2 (70.0 - 74.3) | 83.3 | |
Hasan_BUET_task1_3 | BUETBOSCH3 | Hyder2017 | 68.6 (66.3 - 70.8) | 89.8 | |
Hasan_BUET_task1_4 | BUETBOSCH4 | Hyder2017 | 72.0 (69.8 - 74.2) | 89.6 | |
DCASE2017 baseline | Baseline | Heittola2017 | 61.0 (58.7 - 63.4) | 74.8 | |
Huang_THU_task1_1 | wjhta | Huang2017 | 65.5 (63.2 - 67.8) | 83.4 | |
Huang_THU_task1_2 | wjhta | Huang2017 | 65.4 (63.1 - 67.7) | 84.4 | |
Hussain_NUCES_task1_1 | Hussain2017 | 56.7 (54.3 - 59.1) | 90.7 | ||
Hussain_NUCES_task1_2 | Hussain2017 | 59.5 (57.1 - 61.9) | 90.4 | ||
Hussain_NUCES_task1_3 | Hussain2017 | 59.9 (57.5 - 62.3) | 90.0 | ||
Hussain_NUCES_task1_4 | Hussain2017 | 55.4 (52.9 - 57.8) | 88.9 | ||
Jallet_TUT_task1_1 | CRNN-1 | Jallet2017 | 60.7 (58.4 - 63.1) | 78.9 | |
Jallet_TUT_task1_2 | CRNN-2 | Jallet2017 | 61.2 (58.8 - 63.5) | 80.8 | |
Jimenez_CMU_task1_1 | LapKernel | Jimenez2017 | 59.9 (57.6 - 62.3) | 78.7 | |
Kukanov_UEF_task1_1 | K-CRNN | Kukanov2017 | 71.7 (69.5 - 73.9) | 85.8 | |
Kun_TUM_UAU_UP_task1_1 | Wav_SVMs | Kun2017 | 64.2 (61.9 - 66.5) | 83.2 | |
Kun_TUM_UAU_UP_task1_2 | Wav_GRUs | Kun2017 | 64.0 (61.7 - 66.3) | 82.6 | |
Lehner_JKU_task1_1 | JKU_IVEC | Lehner2017 | 68.7 (66.4 - 71.0) | 84.5 | |
Lehner_JKU_task1_2 | JKU_ALL_av | Lehner2017 | 66.8 (64.5 - 69.1) | 87.7 | |
Lehner_JKU_task1_3 | JKU_CNN | Lehner2017 | 64.8 (62.5 - 67.1) | 89.0 | |
Lehner_JKU_task1_4 | JKU_All_ca | Lehner2017 | 73.8 (71.7 - 76.0) | 91.3 | |
Li_SCUT_task1_1 | LiSCUTt1_1 | Li2017 | 53.7 (51.3 - 56.1) | 91.0 | |
Li_SCUT_task1_2 | LiSCUTt1_2 | Li2017 | 63.6 (61.3 - 66.0) | 83.9 | |
Li_SCUT_task1_3 | LiSCUTt1_3 | Li2017 | 61.7 (59.4 - 64.1) | 83.1 | |
Li_SCUT_task1_4 | LiSCUTt1_4 | Li2017 | 57.8 (55.4 - 60.2) | 87.5 | |
Maka_ZUT_task1_1 | ASAWI | Maka2017 | 47.5 (45.1 - 50.0) | 70.6 | |
Mun_KU_task1_1 | GAN_SKMUN | Mun2017 | 83.3 (81.5 - 85.1) | 87.1 | |
Park_ISPL_task1_1 | ISPL | Park2017 | 72.6 (70.4 - 74.8) | 83.6 | |
Phan_UniLuebeck_task1_1 | CNN | Phan2017 | 59.0 (56.6 - 61.4) | 83.8 | |
Phan_UniLuebeck_task1_2 | ACNN | Phan2017 | 55.9 (53.5 - 58.3) | 82.3 | |
Phan_UniLuebeck_task1_3 | CNN+ | Phan2017 | 58.3 (55.9 - 60.7) | 83.8 | |
Phan_UniLuebeck_task1_4 | ACNN+ | Phan2017 | 58.0 (55.6 - 60.4) | 82.3 | |
Piczak_WUT_task1_1 | amb200 | Piczak2017 | 70.6 (68.4 - 72.8) | 82.3 | |
Piczak_WUT_task1_2 | dishes | Piczak2017 | 69.6 (67.3 - 71.8) | 82.7 | |
Piczak_WUT_task1_3 | amb100 | Piczak2017 | 67.7 (65.4 - 69.9) | 80.2 | |
Piczak_WUT_task1_4 | amb60 | Piczak2017 | 62.0 (59.6 - 64.3) | 79.0 | |
Rakotomamonjy_UROUEN_task1_1 | HBGS CNN | Rakotomamonjy2017 | 61.5 (59.2 - 63.9) | 85.9 | |
Rakotomamonjy_UROUEN_task1_2 | HBGS CNN-4 | Rakotomamonjy2017 | 62.7 (60.3 - 65.0) | 85.3 | |
Rakotomamonjy_UROUEN_task1_3 | HBGS CNN-19 | Rakotomamonjy2017 | 62.8 (60.4 - 65.1) | 84.6 | |
Schindler_AIT_task1_1 | multires | Schindler2017 | 61.7 (59.4 - 64.1) | 87.3 | |
Schindler_AIT_task1_2 | multires-p | Schindler2017 | 61.7 (59.4 - 64.1) | 90.5 | |
Vafeiadis_CERTH_task1_1 | CERTH_1 | Vafeiadis2017 | 61.0 (58.6 - 63.4) | 80.4 | |
Vafeiadis_CERTH_task1_2 | CERTH_2 | Vafeiadis2017 | 49.5 (47.1 - 51.9) | 95.9 | |
Vij_UIET_task1_1 | Vij_UIET_1 | Vij2017 | 61.2 (58.9 - 63.6) | 77.3 | |
Vij_UIET_task1_2 | Vij_UIET_2 | Vij2017 | 57.5 (55.1 - 59.9) | 79.0 | |
Vij_UIET_task1_3 | Vij_UIET_3 | Vij2017 | 59.6 (57.2 - 62.0) | 78.0 | |
Vij_UIET_task1_4 | Vij_UIET_4 | Vij2017 | 65.0 (62.7 - 67.3) | 82.7 | |
Waldekar_IITKGP_task1_1 | IITKGP_ABSP_Fusion | Waldekar2017 | 67.0 (64.7 - 69.3) | 86.3 | |
Waldekar_IITKGP_task1_2 | IITKGP_ABSP_Hierarchical | Waldekar2017 | 64.9 (62.6 - 67.2) | 88.8 | |
Xing_SCNU_task1_1 | DCNN_vote | Weiping2017 | 74.8 (72.6 - 76.9) | 87.6 | |
Xing_SCNU_task1_2 | DCNN_SVM | Weiping2017 | 77.7 (75.7 - 79.7) | 89.9 | |
Xu_NUDT_task1_1 | XuCnnMFCC | Xu2017 | 68.5 (66.2 - 70.7) | 85.3 | |
Xu_NUDT_task1_2 | XuCnnMFCC | Xu2017 | 67.5 (65.3 - 69.8) | 87.4 | |
Xu_PKU_task1_1 | autolog1 | Xu2017a | 65.9 (63.6 - 68.2) | 84.4 | |
Xu_PKU_task1_2 | autolog2 | Xu2017a | 66.7 (64.4 - 69.0) | 84.4 | |
Xu_PKU_task1_3 | autolog3 | Xu2017a | 64.6 (62.3 - 67.0) | 84.4 | |
Yang_WHU_TASK1_1 | MFS | Lu2017 | 61.5 (59.2 - 63.9) | 81.3 | |
Yang_WHU_TASK1_2 | STD | Lu2017 | 65.2 (62.9 - 67.6) | 80.3 | |
Yang_WHU_TASK1_3 | MFS+STD | Lu2017 | 62.8 (60.5 - 65.2) | 82.0 | |
Yang_WHU_TASK1_4 | Pre-training | Lu2017 | 63.6 (61.3 - 66.0) | 82.3 | |
Yu_UOS_task1_1 | UOS_DualIn | Jee-Weon2017 | 67.0 (64.7 - 69.3) | 85.5 | |
Yu_UOS_task1_2 | UOS_BalCos | Jee-Weon2017 | 66.2 (63.9 - 68.5) | 85.1 | |
Yu_UOS_task1_3 | UOS_DatDup | Jee-Weon2017 | 67.3 (65.1 - 69.6) | 95.4 | |
Yu_UOS_task1_4 | UOS_res | Jee-Weon2017 | 70.6 (68.3 - 72.8) | 95.8 | |
Zhao_ADSC_task1_1 | MResNet-34 | Zhao2017 | 70.0 (67.8 - 72.2) | 85.6 | |
Zhao_ADSC_task1_2 | Conv | Zhao2017 | 67.9 (65.6 - 70.2) | 85.4 | |
Zhao_UAU_UP_task1_1 | GRNN | Zhao2017a | 63.8 (61.5 - 66.2) | 83.3 |
Teams ranking
Table including only the best performing system per submitting team.
Rank |
Submission code |
Submission name |
Technical Report |
Accuracy with 95% confidence interval (Evaluation dataset) |
Accuracy (Development dataset) |
---|---|---|---|---|---|
Abrol_IITM_task1_1 | Baseline | Abrol2017 | 65.7 (63.4 - 68.0) | 88.1 | |
Amiriparian_AU_task1_1 | S2S-AE | Amiriparian2017 | 67.5 (65.3 - 69.8) | 88.0 | |
Amiriparian_AU_task1_2 | Shahin_APTI | Amiriparian2017a | 59.1 (56.7 - 61.5) | 90.1 | |
Biho_Sogang_task1_2 | Biho2 | Kim2017 | 60.5 (58.1 - 62.9) | 75.9 | |
Bisot_TPT_task1_1 | TPT1 | Bisot2017 | 69.8 (67.6 - 72.1) | 90.1 | |
Chandrasekhar_IIITH_task1_1 | Chandrasekhar2017 | 45.9 (43.4 - 48.3) | 77.6 | ||
Chou_SINICA_task1_2 | SINICA | Chou2017 | 61.5 (59.2 - 63.9) | ||
Dang_NCU_task1_3 | andang1 | Dang2017 | 63.7 (61.4 - 66.0) | 81.6 | |
Duppada_Seernet_task1_3 | Seernet | Duppada2017 | 64.1 (61.7 - 66.4) | 81.6 | |
Foleiss_UTFPR_task1_2 | MLPFeatRF | Foleiss2017 | 66.9 (64.6 - 69.2) | 80.0 | |
Fonseca_MTG_task1_1 | MTG | Fonseca2017 | 67.3 (65.1 - 69.6) | 83.0 | |
Fraile_UPM_task1_1 | GAMMA-UPM | Fraile2017 | 58.3 (55.9 - 60.7) | 79.8 | |
Gong_MTG_task1_3 | MTG_VGG | Gong2017 | 61.9 (59.5 - 64.2) | 84.0 | |
Han_COCAI_task1_3 | FlEnsemSel | Han2017 | 80.4 (78.4 - 82.3) | 91.9 | |
Hasan_BUET_task1_1 | BUETBOSCH1 | Hyder2017 | 74.1 (72.0 - 76.3) | 88.1 | |
DCASE2017 baseline | Baseline | Heittola2017 | 61.0 (58.7 - 63.4) | 74.8 | |
Huang_THU_task1_1 | wjhta | Huang2017 | 65.5 (63.2 - 67.8) | 83.4 | |
Hussain_NUCES_task1_3 | Hussain2017 | 59.9 (57.5 - 62.3) | 90.0 | ||
Jallet_TUT_task1_2 | CRNN-2 | Jallet2017 | 61.2 (58.8 - 63.5) | 80.8 | |
Jimenez_CMU_task1_1 | LapKernel | Jimenez2017 | 59.9 (57.6 - 62.3) | 78.7 | |
Kukanov_UEF_task1_1 | K-CRNN | Kukanov2017 | 71.7 (69.5 - 73.9) | 85.8 | |
Kun_TUM_UAU_UP_task1_1 | Wav_SVMs | Kun2017 | 64.2 (61.9 - 66.5) | 83.2 | |
Lehner_JKU_task1_4 | JKU_All_ca | Lehner2017 | 73.8 (71.7 - 76.0) | 91.3 | |
Li_SCUT_task1_2 | LiSCUTt1_2 | Li2017 | 63.6 (61.3 - 66.0) | 83.9 | |
Maka_ZUT_task1_1 | ASAWI | Maka2017 | 47.5 (45.1 - 50.0) | 70.6 | |
Mun_KU_task1_1 | GAN_SKMUN | Mun2017 | 83.3 (81.5 - 85.1) | 87.1 | |
Park_ISPL_task1_1 | ISPL | Park2017 | 72.6 (70.4 - 74.8) | 83.6 | |
Phan_UniLuebeck_task1_1 | CNN | Phan2017 | 59.0 (56.6 - 61.4) | 83.8 | |
Piczak_WUT_task1_1 | amb200 | Piczak2017 | 70.6 (68.4 - 72.8) | 82.3 | |
Rakotomamonjy_UROUEN_task1_3 | HBGS CNN-19 | Rakotomamonjy2017 | 62.8 (60.4 - 65.1) | 84.6 | |
Schindler_AIT_task1_1 | multires | Schindler2017 | 61.7 (59.4 - 64.1) | 87.3 | |
Vafeiadis_CERTH_task1_1 | CERTH_1 | Vafeiadis2017 | 61.0 (58.6 - 63.4) | 80.4 | |
Vij_UIET_task1_4 | Vij_UIET_4 | Vij2017 | 65.0 (62.7 - 67.3) | 82.7 | |
Waldekar_IITKGP_task1_1 | IITKGP_ABSP_Fusion | Waldekar2017 | 67.0 (64.7 - 69.3) | 86.3 | |
Xing_SCNU_task1_2 | DCNN_SVM | Weiping2017 | 77.7 (75.7 - 79.7) | 89.9 | |
Xu_NUDT_task1_1 | XuCnnMFCC | Xu2017 | 68.5 (66.2 - 70.7) | 85.3 | |
Xu_PKU_task1_2 | autolog2 | Xu2017a | 66.7 (64.4 - 69.0) | 84.4 | |
Yang_WHU_TASK1_2 | STD | Lu2017 | 65.2 (62.9 - 67.6) | 80.3 | |
Yu_UOS_task1_4 | UOS_res | Jee-Weon2017 | 70.6 (68.3 - 72.8) | 95.8 | |
Zhao_ADSC_task1_1 | MResNet-34 | Zhao2017 | 70.0 (67.8 - 72.2) | 85.6 | |
Zhao_UAU_UP_task1_1 | GRNN | Zhao2017a | 63.8 (61.5 - 66.2) | 83.3 |
Class-wise performance
Rank |
Submission code |
Submission name |
Technical Report |
Accuracy (Evaluation dataset) |
Beach | Bus |
Cafe / Restaurant |
Car |
City center |
Forest path |
Grocery store |
Home | Library |
Metro station |
Office | Park |
Residential area |
Train | Tram |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Abrol_IITM_task1_1 | Baseline | Abrol2017 | 65.7 | 73.1 | 61.1 | 88.9 | 81.5 | 82.4 | 44.4 | 73.1 | 72.2 | 35.2 | 75.0 | 86.1 | 32.4 | 49.1 | 75.0 | 55.6 | |
Amiriparian_AU_task1_1 | S2S-AE | Amiriparian2017 | 67.5 | 44.4 | 75.0 | 63.0 | 95.4 | 94.4 | 97.2 | 73.1 | 60.2 | 43.5 | 79.6 | 62.0 | 16.7 | 64.8 | 82.4 | 61.1 | |
Amiriparian_AU_task1_2 | Shahin_APTI | Amiriparian2017a | 59.1 | 24.1 | 62.0 | 58.3 | 82.4 | 91.7 | 97.2 | 69.4 | 51.9 | 39.8 | 66.7 | 43.5 | 7.4 | 62.0 | 78.7 | 50.9 | |
Biho_Sogang_task1_1 | Biho1 | Kim2017 | 56.5 | 24.1 | 33.3 | 33.3 | 75.9 | 61.1 | 80.6 | 50.9 | 88.9 | 27.8 | 99.1 | 57.4 | 17.6 | 88.0 | 55.6 | 54.6 | |
Biho_Sogang_task1_2 | Biho2 | Kim2017 | 60.5 | 37.0 | 41.7 | 30.6 | 74.1 | 74.1 | 88.0 | 50.9 | 86.1 | 39.8 | 96.3 | 57.4 | 41.7 | 83.3 | 55.6 | 50.9 | |
Bisot_TPT_task1_1 | TPT1 | Bisot2017 | 69.8 | 5.6 | 81.5 | 51.9 | 80.6 | 76.9 | 86.1 | 75.0 | 88.0 | 45.4 | 99.1 | 85.2 | 26.9 | 80.6 | 95.4 | 69.4 | |
Bisot_TPT_task1_2 | TPT2 | Bisot2017 | 69.6 | 23.1 | 75.9 | 54.6 | 75.9 | 78.7 | 84.3 | 75.0 | 88.9 | 39.8 | 100.0 | 87.0 | 27.8 | 75.9 | 94.4 | 62.0 | |
Chandrasekhar_IIITH_task1_1 | Chandrasekhar2017 | 45.9 | 6.5 | 47.2 | 21.3 | 88.9 | 96.3 | 69.4 | 42.6 | 92.6 | 61.1 | 68.5 | 0.0 | 0.0 | 3.7 | 73.1 | 16.7 | ||
Chou_SINICA_task1_1 | TP_CNN_cv1 | Chou2017 | 57.1 | 25.9 | 40.7 | 48.1 | 75.0 | 80.6 | 88.9 | 58.3 | 67.6 | 19.4 | 80.6 | 62.0 | 21.3 | 61.1 | 69.4 | 57.4 | |
Chou_SINICA_task1_2 | SINICA | Chou2017 | 61.5 | 19.4 | 48.1 | 66.7 | 68.5 | 77.8 | 86.1 | 65.7 | 57.4 | 25.0 | 97.2 | 81.5 | 28.7 | 68.5 | 66.7 | 65.7 | |
Chou_SINICA_task1_3 | SINICA | Chou2017 | 59.8 | 32.4 | 50.0 | 49.1 | 74.1 | 88.9 | 88.9 | 62.0 | 59.3 | 36.1 | 92.6 | 57.4 | 20.4 | 50.0 | 69.4 | 65.7 | |
Chou_SINICA_task1_4 | SINICA | Chou2017 | 57.1 | 25.9 | 40.7 | 48.1 | 75.0 | 80.6 | 88.9 | 58.3 | 67.6 | 19.4 | 80.6 | 62.0 | 21.3 | 61.1 | 69.4 | 57.4 | |
Dang_NCU_task1_1 | andang1 | Dang2017 | 62.7 | 32.4 | 49.1 | 61.1 | 65.7 | 76.9 | 87.0 | 57.4 | 90.7 | 26.9 | 95.4 | 82.4 | 24.1 | 75.0 | 70.4 | 46.3 | |
Dang_NCU_task1_2 | andang1 | Dang2017 | 62.7 | 24.1 | 38.9 | 68.5 | 66.7 | 76.9 | 71.3 | 65.7 | 67.6 | 20.4 | 99.1 | 95.4 | 30.6 | 77.8 | 69.4 | 68.5 | |
Dang_NCU_task1_3 | andang1 | Dang2017 | 63.7 | 28.7 | 49.1 | 61.1 | 71.3 | 69.4 | 88.9 | 59.3 | 83.3 | 34.3 | 100.0 | 84.3 | 25.0 | 83.3 | 72.2 | 45.4 | |
Duppada_Seernet_task1_1 | Seernet | Duppada2017 | 57.0 | 13.0 | 35.2 | 51.9 | 88.0 | 85.2 | 86.1 | 52.8 | 68.5 | 25.0 | 28.7 | 72.2 | 35.2 | 82.4 | 71.3 | 60.2 | |
Duppada_Seernet_task1_2 | Seernet | Duppada2017 | 59.9 | 8.3 | 39.8 | 57.4 | 96.3 | 75.9 | 88.0 | 58.3 | 79.6 | 34.3 | 23.1 | 86.1 | 40.7 | 78.7 | 74.1 | 57.4 | |
Duppada_Seernet_task1_3 | Seernet | Duppada2017 | 64.1 | 10.2 | 49.1 | 45.4 | 77.8 | 89.8 | 85.2 | 54.6 | 81.5 | 38.9 | 97.2 | 94.4 | 25.0 | 80.6 | 75.0 | 56.5 | |
Duppada_Seernet_task1_4 | Seernet | Duppada2017 | 63.0 | 13.9 | 42.6 | 57.4 | 85.2 | 85.2 | 87.0 | 57.4 | 83.3 | 35.2 | 63.9 | 88.9 | 31.5 | 81.5 | 72.2 | 60.2 | |
Foleiss_UTFPR_task1_1 | MLPFeats | Foleiss2017 | 64.5 | 18.5 | 47.2 | 65.7 | 75.0 | 86.1 | 84.3 | 63.9 | 89.8 | 52.8 | 99.1 | 54.6 | 15.7 | 77.8 | 65.7 | 71.3 | |
Foleiss_UTFPR_task1_2 | MLPFeatRF | Foleiss2017 | 66.9 | 13.9 | 49.1 | 68.5 | 75.9 | 87.0 | 91.7 | 69.4 | 99.1 | 50.9 | 99.1 | 63.0 | 18.5 | 78.7 | 69.4 | 69.4 | |
Fonseca_MTG_task1_1 | MTG | Fonseca2017 | 67.3 | 36.1 | 41.7 | 62.0 | 75.9 | 75.0 | 92.6 | 57.4 | 84.3 | 41.7 | 99.1 | 89.8 | 38.9 | 76.9 | 76.9 | 62.0 | |
Fraile_UPM_task1_1 | GAMMA-UPM | Fraile2017 | 58.3 | 61.1 | 46.3 | 47.2 | 76.9 | 88.9 | 65.7 | 48.1 | 95.4 | 35.2 | 63.0 | 24.1 | 29.6 | 63.9 | 75.0 | 53.7 | |
Gong_MTG_task1_1 | MTG_GBMVGG | Gong2017 | 61.2 | 50.0 | 45.4 | 66.7 | 67.6 | 66.7 | 89.8 | 62.0 | 81.5 | 27.8 | 85.2 | 35.2 | 34.3 | 68.5 | 80.6 | 56.5 | |
Gong_MTG_task1_2 | MTG_GBM | Gong2017 | 61.5 | 41.7 | 43.5 | 66.7 | 70.4 | 64.8 | 93.5 | 51.9 | 95.4 | 32.4 | 88.9 | 37.0 | 43.5 | 67.6 | 71.3 | 53.7 | |
Gong_MTG_task1_3 | MTG_VGG | Gong2017 | 61.9 | 64.8 | 46.3 | 66.7 | 71.3 | 68.5 | 84.3 | 71.3 | 76.9 | 24.1 | 55.6 | 84.3 | 22.2 | 57.4 | 76.9 | 57.4 | |
Han_COCAI_task1_1 | 4fEnsemSel | Han2017 | 79.9 | 75.9 | 66.7 | 82.4 | 92.6 | 86.1 | 98.1 | 80.6 | 93.5 | 54.6 | 100.0 | 87.0 | 47.2 | 75.0 | 96.3 | 63.0 | |
Han_COCAI_task1_2 | 4fMeanAll | Han2017 | 79.6 | 75.0 | 65.7 | 82.4 | 92.6 | 86.1 | 98.1 | 78.7 | 92.6 | 55.6 | 100.0 | 85.2 | 49.1 | 75.0 | 96.3 | 62.0 | |
Han_COCAI_task1_3 | FlEnsemSel | Han2017 | 80.4 | 78.7 | 71.3 | 83.3 | 93.5 | 88.9 | 98.1 | 79.6 | 94.4 | 53.7 | 100.0 | 86.1 | 44.4 | 75.9 | 90.7 | 66.7 | |
Han_COCAI_task1_4 | flMeanAll | Han2017 | 80.3 | 77.8 | 73.1 | 82.4 | 92.6 | 90.7 | 98.1 | 76.9 | 93.5 | 52.8 | 100.0 | 84.3 | 48.1 | 76.9 | 90.7 | 66.7 | |
Hasan_BUET_task1_1 | BUETBOSCH1 | Hyder2017 | 74.1 | 87.0 | 59.3 | 91.7 | 92.6 | 94.4 | 91.7 | 81.5 | 97.2 | 47.2 | 76.9 | 49.1 | 38.0 | 58.3 | 81.5 | 65.7 | |
Hasan_BUET_task1_2 | BUETBOSCH2 | Hyder2017 | 72.2 | 69.4 | 61.1 | 65.7 | 94.4 | 81.5 | 93.5 | 66.7 | 91.7 | 38.9 | 100.0 | 83.3 | 36.1 | 61.1 | 77.8 | 61.1 | |
Hasan_BUET_task1_3 | BUETBOSCH3 | Hyder2017 | 68.6 | 77.8 | 70.4 | 95.4 | 86.1 | 86.1 | 84.3 | 71.3 | 98.1 | 50.0 | 40.7 | 22.2 | 41.7 | 68.5 | 83.3 | 52.8 | |
Hasan_BUET_task1_4 | BUETBOSCH4 | Hyder2017 | 72.0 | 83.3 | 72.2 | 94.4 | 85.2 | 88.0 | 88.0 | 71.3 | 98.1 | 54.6 | 60.2 | 26.9 | 44.4 | 75.0 | 83.3 | 54.6 | |
DCASE2017 baseline | Baseline | Heittola2017 | 61.0 | 40.7 | 38.9 | 43.5 | 64.8 | 79.6 | 85.2 | 49.1 | 76.9 | 30.6 | 93.5 | 73.1 | 32.4 | 77.8 | 72.2 | 57.4 | |
Huang_THU_task1_1 | wjhta | Huang2017 | 65.5 | 22.2 | 50.9 | 57.4 | 60.2 | 77.8 | 96.3 | 65.7 | 90.7 | 46.3 | 99.1 | 77.8 | 21.3 | 75.9 | 73.1 | 67.6 | |
Huang_THU_task1_2 | wjhta | Huang2017 | 65.4 | 30.6 | 48.1 | 63.9 | 65.7 | 76.9 | 95.4 | 63.9 | 91.7 | 37.0 | 99.1 | 77.8 | 10.2 | 75.9 | 79.6 | 64.8 | |
Hussain_NUCES_task1_1 | Hussain2017 | 56.7 | 25.9 | 27.8 | 49.1 | 42.6 | 73.1 | 88.9 | 57.4 | 88.0 | 4.6 | 100.0 | 66.7 | 29.6 | 83.3 | 51.9 | 61.1 | ||
Hussain_NUCES_task1_2 | Hussain2017 | 59.5 | 28.7 | 37.0 | 37.0 | 73.1 | 67.6 | 79.6 | 55.6 | 84.3 | 27.8 | 100.0 | 67.6 | 24.1 | 85.2 | 59.3 | 65.7 | ||
Hussain_NUCES_task1_3 | Hussain2017 | 59.9 | 22.2 | 36.1 | 39.8 | 71.3 | 74.1 | 78.7 | 57.4 | 85.2 | 45.4 | 97.2 | 67.6 | 24.1 | 85.2 | 55.6 | 58.3 | ||
Hussain_NUCES_task1_4 | Hussain2017 | 55.4 | 38.9 | 21.3 | 59.3 | 40.7 | 69.4 | 92.6 | 54.6 | 75.0 | 14.8 | 80.6 | 67.6 | 20.4 | 81.5 | 53.7 | 60.2 | ||
Jallet_TUT_task1_1 | CRNN-1 | Jallet2017 | 60.7 | 15.7 | 51.9 | 61.1 | 75.0 | 88.0 | 88.9 | 56.5 | 65.7 | 27.8 | 87.0 | 91.7 | 21.3 | 55.6 | 80.6 | 44.4 | |
Jallet_TUT_task1_2 | CRNN-2 | Jallet2017 | 61.2 | 24.1 | 55.6 | 62.0 | 70.4 | 88.9 | 90.7 | 63.9 | 70.4 | 29.6 | 87.0 | 84.3 | 23.1 | 55.6 | 72.2 | 39.8 | |
Jimenez_CMU_task1_1 | LapKernel | Jimenez2017 | 59.9 | 69.4 | 43.5 | 65.7 | 72.2 | 62.0 | 79.6 | 47.2 | 73.1 | 26.9 | 76.9 | 81.5 | 25.9 | 63.0 | 62.0 | 50.0 | |
Kukanov_UEF_task1_1 | K-CRNN | Kukanov2017 | 71.7 | 43.5 | 47.2 | 77.8 | 79.6 | 85.2 | 99.1 | 73.1 | 76.9 | 35.2 | 100.0 | 95.4 | 46.3 | 74.1 | 83.3 | 59.3 | |
Kun_TUM_UAU_UP_task1_1 | Wav_SVMs | Kun2017 | 64.2 | 61.1 | 44.4 | 72.2 | 68.5 | 76.9 | 83.3 | 48.1 | 64.8 | 28.7 | 92.6 | 90.7 | 39.8 | 56.5 | 75.9 | 59.3 | |
Kun_TUM_UAU_UP_task1_2 | Wav_GRUs | Kun2017 | 64.0 | 50.0 | 49.1 | 67.6 | 67.6 | 89.8 | 88.0 | 62.0 | 81.5 | 24.1 | 88.0 | 65.7 | 36.1 | 58.3 | 73.1 | 59.3 | |
Lehner_JKU_task1_1 | JKU_IVEC | Lehner2017 | 68.7 | 91.7 | 65.7 | 79.6 | 76.9 | 70.4 | 90.7 | 65.7 | 88.0 | 58.3 | 76.9 | 50.9 | 22.2 | 75.9 | 71.3 | 46.3 | |
Lehner_JKU_task1_2 | JKU_ALL_av | Lehner2017 | 66.8 | 57.4 | 64.8 | 73.1 | 80.6 | 91.7 | 88.9 | 79.6 | 77.8 | 35.2 | 64.8 | 71.3 | 36.1 | 38.0 | 83.3 | 59.3 | |
Lehner_JKU_task1_3 | JKU_CNN | Lehner2017 | 64.8 | 47.2 | 59.3 | 73.1 | 78.7 | 88.0 | 87.0 | 75.0 | 74.1 | 31.5 | 63.0 | 69.4 | 48.1 | 37.0 | 83.3 | 57.4 | |
Lehner_JKU_task1_4 | JKU_All_ca | Lehner2017 | 73.8 | 87.0 | 66.7 | 88.9 | 80.6 | 92.6 | 92.6 | 76.9 | 88.9 | 49.1 | 79.6 | 65.7 | 45.4 | 55.6 | 84.3 | 53.7 | |
Li_SCUT_task1_1 | LiSCUTt1_1 | Li2017 | 53.7 | 14.8 | 38.0 | 50.9 | 55.6 | 83.3 | 68.5 | 60.2 | 95.4 | 20.4 | 80.6 | 34.3 | 17.6 | 70.4 | 54.6 | 61.1 | |
Li_SCUT_task1_2 | LiSCUTt1_2 | Li2017 | 63.6 | 55.6 | 45.4 | 55.6 | 53.7 | 87.0 | 81.5 | 75.0 | 99.1 | 26.9 | 97.2 | 62.0 | 11.1 | 79.6 | 56.5 | 68.5 | |
Li_SCUT_task1_3 | LiSCUTt1_3 | Li2017 | 61.7 | 51.9 | 33.3 | 48.1 | 64.8 | 83.3 | 82.4 | 70.4 | 99.1 | 24.1 | 99.1 | 50.0 | 14.8 | 78.7 | 53.7 | 72.2 | |
Li_SCUT_task1_4 | LiSCUTt1_4 | Li2017 | 57.8 | 35.2 | 38.9 | 48.1 | 60.2 | 84.3 | 81.5 | 65.7 | 97.2 | 25.9 | 80.6 | 38.0 | 15.7 | 70.4 | 55.6 | 69.4 | |
Maka_ZUT_task1_1 | ASAWI | Maka2017 | 47.5 | 60.2 | 40.7 | 61.1 | 57.4 | 31.5 | 65.7 | 44.4 | 78.7 | 16.7 | 33.3 | 45.4 | 0.9 | 69.4 | 59.3 | 48.1 | |
Mun_KU_task1_1 | GAN_SKMUN | Mun2017 | 83.3 | 83.3 | 74.1 | 88.0 | 93.5 | 94.4 | 95.4 | 82.4 | 88.0 | 75.9 | 88.0 | 92.6 | 75.9 | 86.1 | 67.6 | 63.9 | |
Park_ISPL_task1_1 | ISPL | Park2017 | 72.6 | 54.6 | 59.3 | 71.3 | 79.6 | 91.7 | 85.2 | 75.0 | 98.1 | 44.4 | 98.1 | 84.3 | 23.1 | 76.9 | 82.4 | 64.8 | |
Phan_UniLuebeck_task1_1 | CNN | Phan2017 | 59.0 | 38.9 | 48.1 | 61.1 | 82.4 | 60.2 | 80.6 | 65.7 | 73.1 | 38.9 | 85.2 | 34.3 | 32.4 | 58.3 | 71.3 | 54.6 | |
Phan_UniLuebeck_task1_2 | ACNN | Phan2017 | 55.9 | 41.7 | 45.4 | 51.9 | 79.6 | 56.5 | 67.6 | 62.0 | 70.4 | 35.2 | 88.9 | 33.3 | 31.5 | 52.8 | 72.2 | 50.0 | |
Phan_UniLuebeck_task1_3 | CNN+ | Phan2017 | 58.3 | 41.7 | 44.4 | 68.5 | 74.1 | 57.4 | 94.4 | 66.7 | 66.7 | 27.8 | 68.5 | 76.9 | 21.3 | 40.7 | 71.3 | 54.6 | |
Phan_UniLuebeck_task1_4 | ACNN+ | Phan2017 | 58.0 | 53.7 | 47.2 | 64.8 | 75.0 | 59.3 | 91.7 | 61.1 | 70.4 | 28.7 | 75.9 | 69.4 | 14.8 | 34.3 | 68.5 | 55.6 | |
Piczak_WUT_task1_1 | amb200 | Piczak2017 | 70.6 | 29.6 | 66.7 | 71.3 | 71.3 | 91.7 | 80.6 | 46.3 | 88.0 | 56.5 | 99.1 | 69.4 | 49.1 | 75.9 | 81.5 | 82.4 | |
Piczak_WUT_task1_2 | dishes | Piczak2017 | 69.6 | 32.4 | 63.9 | 65.7 | 77.8 | 91.7 | 84.3 | 49.1 | 76.9 | 67.6 | 99.1 | 56.5 | 56.5 | 67.6 | 82.4 | 72.2 | |
Piczak_WUT_task1_3 | amb100 | Piczak2017 | 67.7 | 22.2 | 66.7 | 65.7 | 74.1 | 90.7 | 86.1 | 35.2 | 81.5 | 59.3 | 98.1 | 78.7 | 41.7 | 64.8 | 81.5 | 68.5 | |
Piczak_WUT_task1_4 | amb60 | Piczak2017 | 62.0 | 19.4 | 63.9 | 51.9 | 65.7 | 89.8 | 88.9 | 21.3 | 67.6 | 43.5 | 92.6 | 81.5 | 43.5 | 73.1 | 63.9 | 63.0 | |
Rakotomamonjy_UROUEN_task1_1 | HBGS CNN | Rakotomamonjy2017 | 61.5 | 9.3 | 74.1 | 41.7 | 83.3 | 84.3 | 87.0 | 64.8 | 96.3 | 40.7 | 87.0 | 26.9 | 37.0 | 50.9 | 81.5 | 58.3 | |
Rakotomamonjy_UROUEN_task1_2 | HBGS CNN-4 | Rakotomamonjy2017 | 62.7 | 6.5 | 77.8 | 47.2 | 82.4 | 88.9 | 87.0 | 68.5 | 92.6 | 38.0 | 95.4 | 35.2 | 33.3 | 48.1 | 85.2 | 53.7 | |
Rakotomamonjy_UROUEN_task1_3 | HBGS CNN-19 | Rakotomamonjy2017 | 62.8 | 5.6 | 78.7 | 48.1 | 83.3 | 88.9 | 84.3 | 65.7 | 93.5 | 38.9 | 93.5 | 40.7 | 29.6 | 49.1 | 87.0 | 54.6 | |
Schindler_AIT_task1_1 | multires | Schindler2017 | 61.7 | 47.2 | 55.6 | 65.7 | 69.4 | 98.1 | 87.0 | 46.3 | 74.1 | 18.5 | 47.2 | 71.3 | 55.6 | 74.1 | 82.4 | 33.3 | |
Schindler_AIT_task1_2 | multires-p | Schindler2017 | 61.7 | 56.5 | 56.5 | 62.0 | 66.7 | 99.1 | 91.7 | 45.4 | 75.9 | 25.0 | 37.0 | 79.6 | 40.7 | 63.0 | 88.9 | 38.0 | |
Vafeiadis_CERTH_task1_1 | CERTH_1 | Vafeiadis2017 | 61.0 | 23.1 | 42.6 | 58.3 | 66.7 | 77.8 | 86.1 | 64.8 | 94.4 | 39.8 | 92.6 | 54.6 | 20.4 | 72.2 | 81.5 | 39.8 | |
Vafeiadis_CERTH_task1_2 | CERTH_2 | Vafeiadis2017 | 49.5 | 35.2 | 23.1 | 58.3 | 63.0 | 90.7 | 90.7 | 57.4 | 61.1 | 20.4 | 38.0 | 53.7 | 25.9 | 45.4 | 59.3 | 20.4 | |
Vij_UIET_task1_1 | Vij_UIET_1 | Vij2017 | 61.2 | 22.2 | 39.8 | 43.5 | 73.1 | 77.8 | 90.7 | 64.8 | 83.3 | 43.5 | 95.4 | 52.8 | 28.7 | 77.8 | 59.3 | 65.7 | |
Vij_UIET_task1_2 | Vij_UIET_2 | Vij2017 | 57.5 | 21.3 | 32.4 | 36.1 | 64.8 | 73.1 | 79.6 | 50.9 | 71.3 | 35.2 | 99.1 | 66.7 | 30.6 | 83.3 | 54.6 | 63.9 | |
Vij_UIET_task1_3 | Vij_UIET_3 | Vij2017 | 59.6 | 10.2 | 42.6 | 36.1 | 53.7 | 75.0 | 79.6 | 54.6 | 88.0 | 48.1 | 98.1 | 57.4 | 39.8 | 88.0 | 58.3 | 63.9 | |
Vij_UIET_task1_4 | Vij_UIET_4 | Vij2017 | 65.0 | 16.7 | 38.9 | 65.7 | 74.1 | 84.3 | 98.1 | 64.8 | 85.2 | 40.7 | 98.1 | 84.3 | 25.9 | 69.4 | 70.4 | 58.3 | |
Waldekar_IITKGP_task1_1 | IITKGP_ABSP_Fusion | Waldekar2017 | 67.0 | 13.9 | 61.1 | 76.9 | 70.4 | 86.1 | 90.7 | 63.0 | 85.2 | 49.1 | 98.1 | 81.5 | 19.4 | 80.6 | 73.1 | 56.5 | |
Waldekar_IITKGP_task1_2 | IITKGP_ABSP_Hierarchical | Waldekar2017 | 64.9 | 15.7 | 58.3 | 78.7 | 63.9 | 82.4 | 84.3 | 63.0 | 88.0 | 50.0 | 97.2 | 84.3 | 15.7 | 70.4 | 70.4 | 50.9 | |
Xing_SCNU_task1_1 | DCNN_vote | Weiping2017 | 74.8 | 77.8 | 88.0 | 71.3 | 81.5 | 78.7 | 73.1 | 76.9 | 67.6 | 49.1 | 95.4 | 82.4 | 57.4 | 73.1 | 88.0 | 61.1 | |
Xing_SCNU_task1_2 | DCNN_SVM | Weiping2017 | 77.7 | 71.3 | 84.3 | 79.6 | 85.2 | 82.4 | 78.7 | 80.6 | 73.1 | 59.3 | 97.2 | 81.5 | 57.4 | 85.2 | 92.6 | 57.4 | |
Xu_NUDT_task1_1 | XuCnnMFCC | Xu2017 | 68.5 | 27.8 | 43.5 | 70.4 | 84.3 | 88.0 | 96.3 | 66.7 | 91.7 | 40.7 | 100.0 | 85.2 | 13.9 | 82.4 | 72.2 | 63.9 | |
Xu_NUDT_task1_2 | XuCnnMFCC | Xu2017 | 67.5 | 26.9 | 43.5 | 68.5 | 85.2 | 88.0 | 94.4 | 66.7 | 86.1 | 42.6 | 100.0 | 85.2 | 11.1 | 82.4 | 72.2 | 60.2 | |
Xu_PKU_task1_1 | autolog1 | Xu2017a | 65.9 | 29.6 | 42.6 | 58.3 | 80.6 | 79.6 | 98.1 | 67.6 | 51.9 | 53.7 | 100.0 | 90.7 | 32.4 | 70.4 | 75.0 | 58.3 | |
Xu_PKU_task1_2 | autolog2 | Xu2017a | 66.7 | 28.7 | 32.4 | 59.3 | 84.3 | 77.8 | 99.1 | 69.4 | 50.0 | 36.1 | 100.0 | 99.1 | 38.9 | 72.2 | 74.1 | 79.6 | |
Xu_PKU_task1_3 | autolog3 | Xu2017a | 64.6 | 25.0 | 37.0 | 60.2 | 84.3 | 74.1 | 98.1 | 64.8 | 43.5 | 33.3 | 100.0 | 94.4 | 25.0 | 68.5 | 84.3 | 76.9 | |
Yang_WHU_TASK1_1 | MFS | Lu2017 | 61.5 | 10.2 | 55.6 | 52.8 | 76.9 | 79.6 | 94.4 | 50.0 | 79.6 | 30.6 | 94.4 | 55.6 | 33.3 | 68.5 | 75.9 | 65.7 | |
Yang_WHU_TASK1_2 | STD | Lu2017 | 65.2 | 45.4 | 47.2 | 57.4 | 74.1 | 86.1 | 88.0 | 55.6 | 75.0 | 49.1 | 98.1 | 68.5 | 29.6 | 66.7 | 75.0 | 63.0 | |
Yang_WHU_TASK1_3 | MFS+STD | Lu2017 | 62.8 | 53.7 | 42.6 | 54.6 | 78.7 | 88.9 | 88.9 | 61.1 | 75.9 | 47.2 | 90.7 | 48.1 | 15.7 | 61.1 | 71.3 | 63.9 | |
Yang_WHU_TASK1_4 | Pre-training | Lu2017 | 63.6 | 42.6 | 45.4 | 57.4 | 71.3 | 97.2 | 89.8 | 51.9 | 81.5 | 38.0 | 99.1 | 62.0 | 20.4 | 67.6 | 70.4 | 60.2 | |
Yu_UOS_task1_1 | UOS_DualIn | Jee-Weon2017 | 67.0 | 53.7 | 57.4 | 53.7 | 73.1 | 76.9 | 82.4 | 65.7 | 94.4 | 42.6 | 99.1 | 75.0 | 29.6 | 79.6 | 69.4 | 52.8 | |
Yu_UOS_task1_2 | UOS_BalCos | Jee-Weon2017 | 66.2 | 55.6 | 57.4 | 47.2 | 72.2 | 75.9 | 83.3 | 65.7 | 92.6 | 43.5 | 99.1 | 75.0 | 27.8 | 77.8 | 69.4 | 50.0 | |
Yu_UOS_task1_3 | UOS_DatDup | Jee-Weon2017 | 67.3 | 60.2 | 58.3 | 56.5 | 69.4 | 76.9 | 84.3 | 68.5 | 90.7 | 46.3 | 94.4 | 72.2 | 28.7 | 79.6 | 72.2 | 51.9 | |
Yu_UOS_task1_4 | UOS_res | Jee-Weon2017 | 70.6 | 72.2 | 51.9 | 68.5 | 76.9 | 77.8 | 86.1 | 74.1 | 93.5 | 38.9 | 95.4 | 77.8 | 34.3 | 84.3 | 68.5 | 58.3 | |
Zhao_ADSC_task1_1 | MResNet-34 | Zhao2017 | 70.0 | 41.7 | 69.4 | 69.4 | 93.5 | 63.9 | 98.1 | 71.3 | 79.6 | 32.4 | 100.0 | 81.5 | 37.0 | 84.3 | 68.5 | 59.3 | |
Zhao_ADSC_task1_2 | Conv | Zhao2017 | 67.9 | 13.0 | 55.6 | 67.6 | 95.4 | 70.4 | 100.0 | 73.1 | 90.7 | 45.4 | 99.1 | 83.3 | 20.4 | 69.4 | 80.6 | 54.6 | |
Zhao_UAU_UP_task1_1 | GRNN | Zhao2017a | 63.8 | 47.2 | 46.3 | 70.4 | 66.7 | 77.8 | 88.9 | 65.7 | 85.2 | 28.7 | 86.1 | 70.4 | 38.0 | 56.5 | 74.1 | 55.6 |
System characteristics
Rank | Code | Name |
Technical Report |
Accuracy (Eval) |
Input |
Sampling rate |
Data augmentation |
Features | Classifier |
Decision making |
---|---|---|---|---|---|---|---|---|---|---|
Abrol_IITM_task1_1 | Baseline | Abrol2017 | 65.7 | mono | 44.1kHz | CQT | GMM, Archetypal Analysis, SVM | majority vote on audio segments of a file | ||
Amiriparian_AU_task1_1 | S2S-AE | Amiriparian2017 | 67.5 | mixed | 44.1kHz | log-mel energies | MLP | |||
Amiriparian_AU_task1_2 | Shahin_APTI | Amiriparian2017a | 59.1 | mixed | 44.1kHz | log-mel energies | MLP+SVM | weighted late fusion | ||
Biho_Sogang_task1_1 | Biho1 | Kim2017 | 56.5 | mono | 44.1kHz | log-mel energies | CNN | majority vote | ||
Biho_Sogang_task1_2 | Biho2 | Kim2017 | 60.5 | mono | 44.1kHz | log-mel energies | CNN | majority vote | ||
Bisot_TPT_task1_1 | TPT1 | Bisot2017 | 69.8 | left, right | 44.1kHz | CQT | NMF, MLP | average log-probability | ||
Bisot_TPT_task1_2 | TPT2 | Bisot2017 | 69.6 | left, right | 44.1kHz | CQT | NMF | average log-probability | ||
Chandrasekhar_IIITH_task1_1 | Chandrasekhar2017 | 45.9 | mono | 44.1kHz | MFCC, Inverse Melfrequency cepstral coefficients | DNN | majority vote | |||
Chou_SINICA_task1_1 | TP_CNN_cv1 | Chou2017 | 57.1 | mono | 44.1kHz | spectrogram | CNN | majority vote | ||
Chou_SINICA_task1_2 | SINICA | Chou2017 | 61.5 | mono | 44.1kHz | spectrogram | CNN | majority vote | ||
Chou_SINICA_task1_3 | SINICA | Chou2017 | 59.8 | mono | 44.1kHz | spectrogram | CNN | majority vote | ||
Chou_SINICA_task1_4 | SINICA | Chou2017 | 57.1 | mono | 44.1kHz | spectrogram | ensemble | majority vote | ||
Dang_NCU_task1_1 | andang1 | Dang2017 | 62.7 | mono | 44.1kHz | MFCC | CRNN | majority vote | ||
Dang_NCU_task1_2 | andang1 | Dang2017 | 62.7 | mono | 44.1kHz | log-mel energies | CNN | majority vote | ||
Dang_NCU_task1_3 | andang1 | Dang2017 | 63.7 | mono | 44.1kHz | log-mel energies, MFCC | CNN | majority vote | ||
Duppada_Seernet_task1_1 | Seernet | Duppada2017 | 57.0 | mono | 44.1kHz | log-mel spectrogram | CNN | mean | ||
Duppada_Seernet_task1_2 | Seernet | Duppada2017 | 59.9 | mono | 16kHz | log-mel spectrogram | CNN | mean | ||
Duppada_Seernet_task1_3 | Seernet | Duppada2017 | 64.1 | mono | 16kHz | log-mel spectrogram | CNN | mean | ||
Duppada_Seernet_task1_4 | Seernet | Duppada2017 | 63.0 | mono | 44.1kHz, 16kHz | log-mel spectrogram | CNN, ensemble | mean | ||
Foleiss_UTFPR_task1_1 | MLPFeats | Foleiss2017 | 64.5 | mono | 44.1kHz | STFT | MLP | probability sum | ||
Foleiss_UTFPR_task1_2 | MLPFeatRF | Foleiss2017 | 66.9 | mono | 44.1kHz | STFT | MLP, random forest | majority vote | ||
Fonseca_MTG_task1_1 | MTG | Fonseca2017 | 67.3 | mono | 44.1kHz | various | ensemble | max of average score | ||
Fraile_UPM_task1_1 | GAMMA-UPM | Fraile2017 | 58.3 | binaural | 44.1kHz | modulation spectrum | MLP | a posteriori probablity | ||
Gong_MTG_task1_1 | MTG_GBMVGG | Gong2017 | 61.2 | multichannel | 44.1kHz | various | GBM CNN fusion | maximum | ||
Gong_MTG_task1_2 | MTG_GBM | Gong2017 | 61.5 | multichannel | 44.1kHz | various | GBM fusion | maximum | ||
Gong_MTG_task1_3 | MTG_VGG | Gong2017 | 61.9 | multichannel | 44.1kHz | log-mel energies | CNN fusion | maximum | ||
Han_COCAI_task1_1 | 4fEnsemSel | Han2017 | 79.9 | mono, binaural | 44.1kHz | log-mel energies | CNN, ensemble | mean probability | ||
Han_COCAI_task1_2 | 4fMeanAll | Han2017 | 79.6 | mono, binaural | 44.1kHz | log-mel energies | CNN, ensemble | mean probability | ||
Han_COCAI_task1_3 | FlEnsemSel | Han2017 | 80.4 | mono, binaural | 44.1kHz | log-mel energies | CNN, ensemble | mean probability | ||
Han_COCAI_task1_4 | flMeanAll | Han2017 | 80.3 | mono, binaural | 44.1kHz | log-mel energies | CNN, ensemble | mean probability | ||
Hasan_BUET_task1_1 | BUETBOSCH1 | Hyder2017 | 74.1 | mono | 44.1kHz | MFCC, log-mel energies | GMM-SV, CNN-SV, Multiband CNN-SV | majority vote | ||
Hasan_BUET_task1_2 | BUETBOSCH2 | Hyder2017 | 72.2 | mono | 44.1kHz | log-mel energies | CNN-SV | majority vote | ||
Hasan_BUET_task1_3 | BUETBOSCH3 | Hyder2017 | 68.6 | mono | 44.1kHz | MFCC, log-mel energies | GMM-SV, CNN-SV, Multiband CNN-SV, CNN, Multiband CNN | majority vote | ||
Hasan_BUET_task1_4 | BUETBOSCH4 | Hyder2017 | 72.0 | mono | 44.1kHz | MFCC, log-mel energies, different functionals of various spectral and prosodic features | GMM-SV, CNN-SV, Multiband CNN-SV, CNN, Multiband CNN, DNN | majority vote | ||
DCASE2017 baseline | Baseline | Heittola2017 | 61.0 | mono | 44.1kHz | log-mel energies | MLP | majority vote | ||
Huang_THU_task1_1 | wjhta | Huang2017 | 65.5 | mono | 44.1kHz | MFCC, CQT | CNN | majority vote | ||
Huang_THU_task1_2 | wjhta | Huang2017 | 65.4 | mono | 44.1kHz | pitch shifting | MFCC, CQT | CNN | majority vote | |
Hussain_NUCES_task1_1 | Hussain2017 | 56.7 | binaural | 44.1kHz | log-mel energies | CNN | ||||
Hussain_NUCES_task1_2 | Hussain2017 | 59.5 | binaural | 44.1kHz | log-mel energies | DNN | ||||
Hussain_NUCES_task1_3 | Hussain2017 | 59.9 | binaural | 44.1kHz | log-mel energies | DNN | ||||
Hussain_NUCES_task1_4 | Hussain2017 | 55.4 | binaural | 44.1kHz | log-mel energies | CNN | ||||
Jallet_TUT_task1_1 | CRNN-1 | Jallet2017 | 60.7 | mono | 44.1kHz | log-mel energies | CRNN | maximum | ||
Jallet_TUT_task1_2 | CRNN-2 | Jallet2017 | 61.2 | mono | 44.1kHz | log-mel energies | CRNN | majority vote | ||
Jimenez_CMU_task1_1 | LapKernel | Jimenez2017 | 59.9 | mono | 44.1kHz | emo_conf (opensmile) | SVM | highest score | ||
Kukanov_UEF_task1_1 | K-CRNN | Kukanov2017 | 71.7 | mono | 44.1kHz | log-mel energies | CRNN | majority vote | ||
Kun_TUM_UAU_UP_task1_1 | Wav_SVMs | Kun2017 | 64.2 | mono | 44.1kHz | wavelets, ComParE (openSMILE) | SVM | margin sampling value | ||
Kun_TUM_UAU_UP_task1_2 | Wav_GRUs | Kun2017 | 64.0 | mono | 44.1kHz | wavelets, ComParE (openSMILE) | GRNN | margin sampling value | ||
Lehner_JKU_task1_1 | JKU_IVEC | Lehner2017 | 68.7 | binaural | 22.05kHz | pitch shifting | MFCC based i-vectors | i-vector | min. cosine distance | |
Lehner_JKU_task1_2 | JKU_ALL_av | Lehner2017 | 66.8 | mono, binaural | 22.05kHz | pitch shifting | MFCC, log-scaled spectrogram | CNN, i-vector, ensemble | model averaging | |
Lehner_JKU_task1_3 | JKU_CNN | Lehner2017 | 64.8 | mono | 22.05kHz | log-scaled spectrogram | CNN, ensemble | fusion w/ logistic linear regression | ||
Lehner_JKU_task1_4 | JKU_All_ca | Lehner2017 | 73.8 | mono, binaural | 22.05kHz | pitch shifting | mel-scaled spectrograms, i-vectors | i-vector, CNN, ensemble | fusion w/ logistic linear regression | |
Li_SCUT_task1_1 | LiSCUTt1_1 | Li2017 | 53.7 | mono | 44.1kHz | DNN(MFCC) | Bi-LSTM | majority vote | ||
Li_SCUT_task1_2 | LiSCUTt1_2 | Li2017 | 63.6 | mono | 44.1kHz | DNN(MFCC) | Bi-LSTM | majority vote | ||
Li_SCUT_task1_3 | LiSCUTt1_3 | Li2017 | 61.7 | mono | 44.1kHz | DNN(MFCC) | DNN | majority vote | ||
Li_SCUT_task1_4 | LiSCUTt1_4 | Li2017 | 57.8 | mono | 44.1kHz | DNN(MFCC) | Bi-LSTM | majority vote | ||
Maka_ZUT_task1_1 | ASAWI | Maka2017 | 47.5 | binaural | 44.1kHz | cochleagram, onset map, binaural cues, low-level feature contours | random forest | |||
Mun_KU_task1_1 | GAN_SKMUN | Mun2017 | 83.3 | left, right, mixed | 22.05kHz | GAN | log-mel energies, spectrogram | MLP, RNN, CNN, SVM | majority vote | |
Park_ISPL_task1_1 | ISPL | Park2017 | 72.6 | binaural | 44.1kHz | block mixing | covariance of gammachirp energies, double FFT of gammachirp energies | CNN | maximum posterior | |
Phan_UniLuebeck_task1_1 | CNN | Phan2017 | 59.0 | binaural | 44.1kHz | cross-validation with different data splits | generalized label tree embedding | CNN | entire-signal classification | |
Phan_UniLuebeck_task1_2 | ACNN | Phan2017 | 55.9 | binaural | 44.1kHz | cross-validation with different data splits | generalized label tree embedding | Attentive CNN | entire-signal classification | |
Phan_UniLuebeck_task1_3 | CNN+ | Phan2017 | 58.3 | binaural | 44.1kHz | cross-validation with different data splits | generalized label tree embedding | CNN | entire-signal classification | |
Phan_UniLuebeck_task1_4 | ACNN+ | Phan2017 | 58.0 | binaural | 44.1kHz | cross-validation with different data splits | generalized label tree embedding | Attentive CNN | entire-signal classification | |
Piczak_WUT_task1_1 | amb200 | Piczak2017 | 70.6 | mono | 44.1kHz | time delay, block mixing | spectrogram | CNN | majority vote | |
Piczak_WUT_task1_2 | dishes | Piczak2017 | 69.6 | mono | 44.1kHz | time delay, block mixing | spectrogram | CNN | majority vote | |
Piczak_WUT_task1_3 | amb100 | Piczak2017 | 67.7 | mono | 44.1kHz | time delay, block mixing | spectrogram | CNN | majority vote | |
Piczak_WUT_task1_4 | amb60 | Piczak2017 | 62.0 | mono | 44.1kHz | time delay, block mixing | spectrogram | CNN | majority vote | |
Rakotomamonjy_UROUEN_task1_1 | HBGS CNN | Rakotomamonjy2017 | 61.5 | mono | 44.1kHz | CQT | CNN | average prediction | ||
Rakotomamonjy_UROUEN_task1_2 | HBGS CNN-4 | Rakotomamonjy2017 | 62.7 | mono | 44.1kHz | CQT | CNN | average prediction over 4 models | ||
Rakotomamonjy_UROUEN_task1_3 | HBGS CNN-19 | Rakotomamonjy2017 | 62.8 | mono | 44.1kHz | CQT | CNN | average prediction over 19 models | ||
Schindler_AIT_task1_1 | multires | Schindler2017 | 61.7 | mono | 44.1kHz | time stretching, block mixing, pitch shifting, mixing files of same class, gaussian noise | log-mel spectrogram | CNN | argmax of average softmax response per file | |
Schindler_AIT_task1_2 | multires-p | Schindler2017 | 61.7 | mono | 44.1kHz | time stretching, block mixing, pitch shifting, mixing files of same class, gaussian noise | log-mel spectrogram | CNN | argmax of average softmax response per file | |
Vafeiadis_CERTH_task1_1 | CERTH_1 | Vafeiadis2017 | 61.0 | mono | 44.1kHz | MFCC, MFCC delta, MFCC acceleration, centroid, rolloff, ZCR | SVM-HMM | majority vote | ||
Vafeiadis_CERTH_task1_2 | CERTH_2 | Vafeiadis2017 | 49.5 | mono | 44.1kHz | speed and pitch change (downsampling), amplitude change (dynamic), gaussian noise | log-mel spectrogram | CNN | majority vote | |
Vij_UIET_task1_1 | Vij_UIET_1 | Vij2017 | 61.2 | binaural | 44.1kHz | feature frame concatenation | log mel-filter bank | RNN | majority vote | |
Vij_UIET_task1_2 | Vij_UIET_2 | Vij2017 | 57.5 | binaural | 44.1kHz | feature frame concatenation | log mel-filter bank | LSTM | majority vote | |
Vij_UIET_task1_3 | Vij_UIET_3 | Vij2017 | 59.6 | binaural | 44.1kHz | feature frame concatenation | log mel-filter bank | GRU | majority vote | |
Vij_UIET_task1_4 | Vij_UIET_4 | Vij2017 | 65.0 | binaural | 44.1kHz | feature frame concatenation | log mel-filter bank | CNN | majority vote | |
Waldekar_IITKGP_task1_1 | IITKGP_ABSP_Fusion | Waldekar2017 | 67.0 | binaural | 44.1kHz | combination [block-based MFCC; SCFC; CQCC] | SVM | fusion | ||
Waldekar_IITKGP_task1_2 | IITKGP_ABSP_Hierarchical | Waldekar2017 | 64.9 | binaural | 44.1kHz | combination [block-based MFCC; SCFC; CQCC] | SVM | fusion | ||
Xing_SCNU_task1_1 | DCNN_vote | Weiping2017 | 74.8 | binaural | 22.05kHz | spectrogram, CQT | CNN | majority vote | ||
Xing_SCNU_task1_2 | DCNN_SVM | Weiping2017 | 77.7 | binaural | 22.05kHz | spectrogram, CQT | CNN | SVM | ||
Xu_NUDT_task1_1 | XuCnnMFCC | Xu2017 | 68.5 | left, right, mixed | 44.1kHz | pitch shifting | MFCC, spectrogram | CNN | majority vote | |
Xu_NUDT_task1_2 | XuCnnMFCC | Xu2017 | 67.5 | left, right, mixed | 44.1kHz | pitch shifting | MFCC, spectrogram | CNN | majority vote | |
Xu_PKU_task1_1 | autolog1 | Xu2017a | 65.9 | binaural | 44.1kHz | CQT | Autoencoder and Logistic Regression | majority vote | ||
Xu_PKU_task1_2 | autolog2 | Xu2017a | 66.7 | binaural | 44.1kHz | CQT | Autoencoder and Logistic Regression | majority vote | ||
Xu_PKU_task1_3 | autolog3 | Xu2017a | 64.6 | binaural | 44.1kHz | CQT | Autoencoder and Logistic Regression | majority vote | ||
Yang_WHU_TASK1_1 | MFS | Lu2017 | 61.5 | mono | 44.1kHz | log-mel energies | CNN | logsum | ||
Yang_WHU_TASK1_2 | STD | Lu2017 | 65.2 | mono | 44.1kHz | log-mel energies | CNN | logsum | ||
Yang_WHU_TASK1_3 | MFS+STD | Lu2017 | 62.8 | mono | 44.1kHz | log-mel energies | CNN | logsum | ||
Yang_WHU_TASK1_4 | Pre-training | Lu2017 | 63.6 | mono | 44.1kHz | log-mel energies | CNN | logsum | ||
Yu_UOS_task1_1 | UOS_DualIn | Jee-Weon2017 | 67.0 | left, right, mixed | 44.1kHz | mel-filterbank features | MLP, ensemble | score sum | ||
Yu_UOS_task1_2 | UOS_BalCos | Jee-Weon2017 | 66.2 | left, right, mixed | 44.1kHz | mel-filterbank features | MLP, ensemble | score sum | ||
Yu_UOS_task1_3 | UOS_DatDup | Jee-Weon2017 | 67.3 | left, right, mixed | 44.1kHz | stochastic duplication | mel-filterbank features | MLP, ensemble | score sum | |
Yu_UOS_task1_4 | UOS_res | Jee-Weon2017 | 70.6 | left, right, mixed | 44.1kHz | stochastic duplication | mel-filterbank features | MLP, ensemble | score sum | |
Zhao_ADSC_task1_1 | MResNet-34 | Zhao2017 | 70.0 | binaural | 44.1kHz | log-mel spectrogram | CNN | majority vote | ||
Zhao_ADSC_task1_2 | Conv | Zhao2017 | 67.9 | binaural | 44.1kHz | log-mel spectrogram | CNN | majority vote | ||
Zhao_UAU_UP_task1_1 | GRNN | Zhao2017a | 63.8 | mono | 44.1kHz | spectrogram, scalogram, wavelets, ComParE (openSMILE) | GRNN | margin sampling value |
Technical reports
GMM-AA System for Acoustic Scene Classification
Vinayak Abrol, Pulkit Sharma and Anshul Thakur
Multimedia Analytics and Systems Lab, SCEE, Indian Institute of Technology Mandi, Mandi, India
Abrol_IITM_task1_1
GMM-AA System for Acoustic Scene Classification
Abstract
In this submission we propose to use Gaussian mixture modelling and Archetypal Analysis based system for DCASE17 acoustic scene classification task. We propose a feature learning approach via decomposing time-frequency (TF) representations with Archetypal Analysis (AA). In order to process large number of TF frames and capture the variations efficiently, firstly a class-specific GMM is build on frames of TF representations, followed by AA on GMM means to build class specific local dictionaries. Next, the TF representations are projected on the concatenated AA dictionary to get the non-negative sparse activations. Finally, the TF frames are reconstructed back using the computed activation vectors, and are then used to train a SVM classifier. The proposed method significantly outperforms the baseline system.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | CQT |
Classifier | GMM, Archetypal Analysis, SVM |
Decision making | majority vote on audio segments of a file |
Sequence to Sequence Autoencoders for Unsupervised Representation Learning From Audio
Shahin Amiriparian1,2,3, Michael Freitag1, Nicholas Cummins1,2 and Björn Schuller2,4
1Chair of Complex & Intelligent Systems, Universität Passau, Passau, Germany, 2Chair of Embedded Intelligence for Health Care, Augsburg University, Augsburg, Germany, 3Machine Intelligence & Signal Processing Group, Technische Universität München, München, Germany, 4Group of Language, Audio & Music, Imperial Collage London, London, UK
Amiriparian_AU_task1_1
Sequence to Sequence Autoencoders for Unsupervised Representation Learning From Audio
Abstract
This paper describes our contribution to the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017). We propose a system for this task using a recurrent sequence to sequence autoencoder for unsupervised representation learning from raw audio files. First, we extract mel-spectrograms from the raw audio files. Second, we train a recurrent sequence to sequence autoencoder on these spectrograms, that are considered as time-dependent frequency vectors. Then, we extract, from a fully connected layer between the decoder and encoder units, the learnt representations of spectrograms as the feature vectors for the corresponding audio instances. Finally, we train a multilayer perceptron neural network on these feature vectors to predict the class labels. An accuracy of 88.0 % is achieved on the official development set of the challenge – a relative improvement of 17.7 % over the challenge baseline.
System characteristics
Input | mixed |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | MLP |
The Combined Augsburg / Passau / Tum / Icl System for DCASE 2017
Shahin Amiriparian1,2,3, Nicholas Cummins1,2, Michael Freitag1, Andykun Qian1,2,3, Ren Zhao1,2, Vedhas Pandit1,2 and Björn Schuller1,2,4
1Chair of Complex & Intelligent Systems, Universität Passau, Passau, Germany, 2Chair of Embedded Intelligence for Health Care, Augsburg University, Augsburg, Germany, 3Machine Intelligence & Signal Processing Group, Technische Universität München, München, Germany, 4Group of Language, Audio & Music, Imperial Collage London, London, UK
Amiriparian_AU_task1_2
The Combined Augsburg / Passau / Tum / Icl System for DCASE 2017
Abstract
This technical report covers the fusion of two approaches towards the Acoustic Scene Classification sub-task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017). The first system uses a novel recurrent sequence to sequence autoencoder approach for unsupervised representation learning. The second system is based on the late fusion of support vector machines trained on either wavelet features or an archetypal acoustic feature set. A weighted late-fusion combination of these two systems achieved an accuracy of 90.1 % on the official development set of the challenge – a relative percentage improvement of 20.2 % over the challenge baseline.
System characteristics
Input | mixed |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | MLP+SVM |
Decision making | weighted late fusion |
Nonnegative Feature Learning Methods for Acoustic Scene Classification
Victor Bisot1, Romain Serizel2,3,4, Slim Essid1 and Gaël Richard1
1Image Data and Signal, Telecom ParisTech, Paris, France, 2Université de Lorraine, Loria, Nancy, France, 3Inria, Nancy, France, 4CNRS, LORIA, Nancy, France
Bisot_TPT_task1_1 Bisot_TPT_task1_2
Nonnegative Feature Learning Methods for Acoustic Scene Classification
Abstract
This paper introduces improvements to nonnegative feature learning-based methods for acoustic scene classification. We start by introducing modifications to the task-driven nonnegative matrix factorization algorithm. The proposed adapted scaling algorithm improves the generalization capability of task-driven nonnegative matrix factorization for the task. We then propose to exploit simple deep neural network architecture to classify both low level time-frequency representations and unsupervised nonnegative matrix factorization activation features independently. Moreover, we also propose a deep neural network architecture that exploits jointly unsupervised nonnegative matrix factorization activation features and low-level time frequency representations as inputs. Finally, we present a fusion of proposed systems in order to further improve performance. The resulting systems are our submission for the task 1 of the DCASE 2017 challenge.
System characteristics
Input | left, right |
Sampling rate | 44.1kHz |
Features | CQT |
Classifier | NMF, MLP; NMF |
Decision making | average log-probability |
Acoustic Scene Classification Using Deep Neural Network
Paseddula Chandrasekhar1 and Suryakanth V. Gangashetty2
1Speech Processing Lab, International Institute of Information Technology, Hyderabad, Hyderabad, India, 2Speech processing Lab, International Institute of Information Technology, Hyderabad, Hyderabad, India
Chandrasekhar_IIITH_task1_1
Acoustic Scene Classification Using Deep Neural Network
Abstract
In this paper, deep neural networks (DNN) are applied for acoustic scene classification task provided by DCASE2017 challenge. We perform experiment on a dataset consisting of 15 types of acoustic scenes with a given total development data and evolution data of task1. We propose an DNN architecture for utterance level classification. Evaluation of models were performed on given evolution data of task1 for 4 folds using development data. In this approach MFCC and IMFCC feature vectors are used to train DNN model and their DNN scores were combined to test the system. On the official development data set of the task1 challenge, an accuracy of 81.28% is achieved.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC, Inverse Melfrequency cepstral coefficients |
Classifier | DNN |
Decision making | majority vote |
FrameCNN: A Weakly-Supervised Learning Framework for Frame-Wise Acoustic Event Detection and Classification
Szu-Yu Chou1,2, Jyh-Shing Jang1 and Yi-Hsuan Yang2
1Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan, 2Research Center for IT innovation, Academia Sinica, Taipei, Taiwan
Chou_SINICA_task1_1 Chou_SINICA_task1_2 Chou_SINICA_task1_3 Chou_SINICA_task1_4
FrameCNN: A Weakly-Supervised Learning Framework for Frame-Wise Acoustic Event Detection and Classification
Abstract
In this paper, we describe our contribution to the challenge of detection and classification of acoustic scenes and events (DCASE2017).We propose framCNN, a novel weakly supervised learning frame-work that improves the performance of convolutional neural net-work (CNN) for acoustic event detection by attending to details of each sound at various temporal levels. Most existing weakly-supervised frameworks replace fully-connected network with global average pooling after the final convolution layer. Such a method tends to identify only a few discriminative parts, leading to sub-optimal localization and classification accuracy. The key idea of our approach is to consciously classify the sound of each frame given by the corresponding label. The idea is general and can be applied to any network for achieving sound event detection and improving the performance of sound event classification. In acoustic scene classification (Task1), our approach obtained an average accuracy of 99.2% on the four-fold cross-validation for acoustic scene recognition, comparing to the provided baseline of 74.8%. In the large-scale weakly supervised sound event detection for smart cars(Task4), we obtained a F-score 53.8% for sound event audio tagging (subtask A), compared to the baseline of 19.8%, and a F-score32.8% for sound event detection (subtask B), compared to the base-line of 11.4%
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | spectrogram |
Classifier | CNN; ensemble |
Decision making | majority vote |
Deep Learning for DCASE2017 Challenge
An Dang, Toan Vu and Jia-Ching Wang
Computer Sciene and Information Engineering, National Central University, Taoyuan, Taiwan
Dang_NCU_task1_1 Dang_NCU_task1_2 Dang_NCU_task1_3
Deep Learning for DCASE2017 Challenge
Abstract
This paper reports our results on all tasks of DCASE challenge 2017 which are acoustic scene classification, detection of rare sound events, sound event detection in real life audio, and large-scale weakly supervised sound event detection for smart cars. Our proposed methods are developed based on two favorite neural networks which are convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Experiments show that our proposed methods outperform the baseline.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC; log-mel energies; log-mel energies, MFCC |
Classifier | CRNN; CNN |
Decision making | majority vote |
Ensemble of Deep Neural Networks for Acoustic Scene Classification
Venkatesh Duppada and Sushant Hiray
Data Science, Seernet Technologies, LLC, Mumbai, India
Duppada_Seernet_task1_1 Duppada_Seernet_task1_2 Duppada_Seernet_task1_3 Duppada_Seernet_task1_4
Ensemble of Deep Neural Networks for Acoustic Scene Classification
Abstract
Deep neural networks (DNNs) have recently achieved great success in a multitude of classification tasks. Ensembles of DNNs have been shown to improve the performance. In this paper, we explore the recent state-of-the-art DNNs used for image classification. We modified these DNNs and applied them to the task of acoustic scene classification. We conducted a number of experiments on the TUT Acoustic Scenes 2017 dataset to empirically compare these methods. Finally, we show that the ensemble of these DNNs improves the baseline score for DCASE-2017 Task 1 by 10%
System characteristics
Input | mono |
Sampling rate | 44.1kHz; 16kHz; 44.1kHz, 16kHz |
Features | log-mel spectrogram |
Classifier | CNN; CNN, ensemble |
Decision making | mean |
MLP-Based Feature Learning for Automatic Acoustic Scene Classification
Juliano Foleiss1 and Tiago Tavares2
1Computing Department, Universidade Tecnologica Federal do Parana, Campo Mourao, Brazil, 2School of Electrical and Computer Engineering, University of Campinas, Campinas, Brazil
Foleiss_UTFPR_task1_1 Foleiss_UTFPR_task1_2
MLP-Based Feature Learning for Automatic Acoustic Scene Classification
Abstract
This paper presents an experimental setup for feature learning in the context of Automatic Acoustic Scene Classification. The setup presented in this paper has been successfully used for Automatic Music Genre Classification by Sigtia and Dixon (2014). First a MLP is trained with audio frames calculated from a 2048-sample STFT and one-shot encoding. Then, the activations of each hidden layer of the MLP are stored as learned features for the entire dataset. Such features are then used to train Random Forests in order to increase classification performance. Our results on the DCASE 2017 development dataset reaches 80% accuracy across supplied folds.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | STFT |
Classifier | MLP; MLP, random forest |
Decision making | probability sum; majority vote |
Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks
Eduardo Fonseca, Rong Gong, Dmitry Bogdanov, Olga Slizovskaia, Emilia Gomez and Xavier Serra
Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
Fonseca_MTG_task1_1
Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks
Abstract
This work describes our contribution to the acoustic scene classification task of the DCASE 2017 challenge. We propose a system that consists of the ensemble of two methods of different nature: a feature engineering approach, where a collection of hand-crafted features is input to a Gradient Boosting Machine, and another approach based on learning representations from data, where log-scaled mel-spectrograms are input to a Convolutional Neural Network. This CNN is designed with multiple filter shapes in the first layer. We use a simple late fusion strategy to combine both methods. We report classification accuracy of each method alone and the ensemble system on the provided cross-validation setup of TUT Acoustic Scenes 2017 dataset. The proposed system outperforms each of its component methods and improves the provided baseline system by 8.2%.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | various |
Classifier | ensemble |
Decision making | max of average score |
Classification of Acoustic Scenes Based on the Modulation Spectrum
Ruben Fraile, Juana M. Gutierrez-Arriola, Nicolas Saenz-Lechon and Victor J. Osma-Ruiz
Group on Acoustics and Multimedia Applicationa, Universidad Politecnica de Madrid, Madrid, Spain
Fraile_UPM_task1_1
Classification of Acoustic Scenes Based on the Modulation Spectrum
Abstract
A system for the automatic classification of acoustic scenes is proposed. This system calculates the spectral distribution of energy across auditory-relevant frequency bands and obtains some descriptors of the envelope modulation spectrum (EMS) by applying the discrete cosine transform to the logarithm of the EMS. This parametrisation scheme achieves good separation among scene classes, since it gets good classification results with a simple classifier consisting of a multilayer perceptron with only one hidden layer.
System characteristics
Input | binaural |
Sampling rate | 44.1kHz |
Features | modulation spectrum |
Classifier | MLP |
Decision making | a posteriori probablity |
Acoustic Scene Classification by Fusing LightGBM and VGG-Net Multichannel Predictions
Rong Gong, Eduardo Fonseca, Dmitry Bogdanov, Olga Slizovskaia, Emilia Gomez and Xavier Serra
Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
Gong_MTG_task1_1 Gong_MTG_task1_2 Gong_MTG_task1_3
Acoustic Scene Classification by Fusing LightGBM and VGG-Net Multichannel Predictions
Abstract
This report provides a solution for the task 1 of DCASE 2017 challenge. We build two parallel audio scene classification systems -- LightGBM and VGG-net. The prediction scores are output from the multichannel version of the TUT Acoustic Scenes 2017 dataset. Finally, we perform a linear logistic regression method to fuse the LightGBM, VGG-net and LightGBM+VGG-net scores respectively. The evaluation is done on the development set, and three outputs are submitted for the challenge.
System characteristics
Input | multichannel |
Sampling rate | 44.1kHz |
Features | various; log-mel energies |
Classifier | GBM CNN fusion; GBM fusion; CNN fusion |
Decision making | maximum |
Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification
Yoonchang Han1 and Jeongsoo Park1,2
1Cochlear.ai, Seoul, Korea, 2Music and Audio Research Group, Seoul National University, Seoul, Korea
Han_COCAI_task1_1 Han_COCAI_task1_2 Han_COCAI_task1_3 Han_COCAI_task1_4
Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification
Abstract
In this paper, we demonstrate how we applied convolutional neural network for DCASE 2017 task 1, acoustic scene classification. We propose a variety of preprocessing methods that emphasise different acoustic characteristics such as binaural representations, harmonic-percussive source separation, and background subtraction. We also present a network structure that can simultaneously analyse paired input, which makes the system benefit from a spatial information. The experimental results show that the proposed network structure and preprocessing method effectively learn acoustic characteristics from the audio recordings, and combining these with an ensemble model significantly reduces the error rate further, exhibiting an accuracy of 0.917 for 4-fold cross-validation on the development set using a mean ensemble.
System characteristics
Input | mono, binaural |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CNN, ensemble |
Decision making | mean probability |
DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System
Abstract
DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | MLP |
Decision making | majority vote |
A Multi-Scale Deep Convolutional Neural Network for Acoustic Scene Classification
Taoan Huang and Jianhao Wang
Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
Huang_THU_task1_1 Huang_THU_task1_2
A Multi-Scale Deep Convolutional Neural Network for Acoustic Scene Classification
Abstract
Deep neural networks have shown great classification performances in numbers of applications. We applied a multi-scale deep convolutional neural network to acoustic scene classification (ASC) which has been submitted to Task 1 of the DCASE-2017 challenge. In this report, we show our model for classifying short sequences of audio, represented by their Mel-Frequency Cepstral Coefficients and Constant-Q value. The system is evaluated on the public dataset provided by the organizers. The best accuracy we obtained on a 4-fold cross-validation setup is 84.4%.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Data augmentation | pitch shifting |
Features | MFCC, CQT |
Classifier | CNN |
Decision making | majority vote |
Improved Acoustic Scene Classification with DNN and CNN
Khalid Hussain1, Mazhar Hussain2 and Muhammad Khan1
1Department of electrical engineering, National University of computer and emerging sciences, Pakistan, 2Department of Computer Science, National University of computer and emerging sciences, Pakistan
Hussain_NUCES_task1_1 Hussain_NUCES_task1_2 Hussain_NUCES_task1_3 Hussain_NUCES_task1_4
Improved Acoustic Scene Classification with DNN and CNN
Abstract
This paper presents the acoustic scene classification (ASC) to differentiate between different acoustic environments corre-sponding to the DCASE 2017 challenge task1. In this contribution we have applied two techniques of classification i.e. Deep Neural Network (DNN) and Convolution Neural Network (CNN). DNN and CNN are widely used in speech recognition, computer vision, and natural language processing applications. These techniques have recently achieved great success in the field of audio classification for the various applications. We achieved higher accuracy than the previous work done on benchmark datasets provided in the DCASE 2016 challenge. We used frame level randomization of the training dataset and log mel energy features to achieve higher accuracy with DNN and CNN. It is observed that DNN achieved 90.41%, 90.03% and CNN achieved 90.71%, 88.86% accuracy on randomized data based on 80 and 60 mel energy features, respectively
System characteristics
Input | binaural |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CNN; DNN |
BUET Bosch Consortium (B2C) Acoustic Scene Classification Systems for DCASE 2017
Rakib Hyder1, Shabnam Ghaffarzadegan2, Zhe Feng2 and Taufiq Hasan3
1Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh, 2Robert Bosch Research and Technology Center, Robert Bosch Research and Technology Center, Palo Alto, CA, USA, 3Department of Biomedical Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
Hasan_BUET_task1_1 Hasan_BUET_task1_2 Hasan_BUET_task1_3 Hasan_BUET_task1_4
BUET Bosch Consortium (B2C) Acoustic Scene Classification Systems for DCASE 2017
Abstract
This technical report describes the systems jointly submitted by Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh, and Robert Bosch Research and Technology Center, Palo Alto, CA, USA, for the Acoustic scene classification (ASC) task of the DCASE 2017 challenge. Our sub-systems mainly consist of Convolutional Neural Network (CNN) based models trained on Spectrogram Image Features (SIF) using Mel and Log-scaled filter-banks. We also used a novel multi-band approach that learns the CNN models from different frequency bands separately using a single spectrogram. In a variant of CNN sub-systems, large dimensional audio segment level feature vectors are extracted from the flattening layer of a trained CNN model and later classified utilized a Probabilistic Linear Discriminant Analysis (PLDA) model. This sub-system is termed as the CNN-SuperVector (SV) system. We also implemented a GMM SuperVector system with a PLDA classifier and a feed-forward Neural Network (NN) classifier trained on an acoustic feature ensemble. Finally, we utilized linear score-fusion to combine the class-wise scores obtained from the different sub-systems.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC, log-mel energies; log-mel energies; MFCC, log-mel energies, different functionals of various spectral and prosodic features |
Classifier | GMM-SV, CNN-SV, Multiband CNN-SV; CNN-SV; GMM-SV, CNN-SV, Multiband CNN-SV, CNN, Multiband CNN; GMM-SV, CNN-SV, Multiband CNN-SV, CNN, Multiband CNN, DNN |
Decision making | majority vote |
Acoustic Scene Classification Using CRNN
Hugo Jallet, Emre Cakir and Tuomas Virtanen
Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland
Jallet_TUT_task1_1 Jallet_TUT_task1_2
Acoustic Scene Classification Using CRNN
Abstract
This paper presents an application of a convolutiona lrecurrent neural network ( CRNN ) for the task of Acoustic Scene Classification ( ASC ). This is the first attempt, to the authors’ knowledge, to use this kind of network for the task of ASC, even though simple convolutional neural networks (CNN ) have already been applied and approved for this specific work. The submitted methods have been developed for the 2017 edition of the ”Detection and Classification of Acoustic Scenes and Events” ( DCASE ) challenge and consequently tested on the datasets provided for the task of ASC. In this paper, we use two based CRNN methods which score an overall accuracy of 78.9% and 80.8%.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CRNN |
Decision making | maximum; majority vote |
DNN-Based Audio Scene Classification for DCASE 2017: Dual Inputfeatures, Balancing Cost, and Stochastic Data Duplication
Jung Jee-Weon, Heo Hee-Soo, Yang IL-Ho, Yoon Sung-Hyun, Shim Hye-Jin and Yu Ha-Jin
School of Computer Science, University of Seoul, Seoul, Republic of South Korea
Yu_UOS_task1_1 Yu_UOS_task1_2 Yu_UOS_task1_3 Yu_UOS_task1_4
DNN-Based Audio Scene Classification for DCASE 2017: Dual Inputfeatures, Balancing Cost, and Stochastic Data Duplication
Abstract
In this study, we explored DNN-based audio scene classification systems with dual input features. Dual input features take advantage of simultaneously utilizing two features with different levels of abstraction as inputs: a frame-level mel-filterbank feature and utterance-level identity vector. A new fine-tune cost that solves the drawback of dual input features was developed, as well as a data duplication method that enables DNN to clearly discriminate frequently misclassified classes. Combining the proposed methods with the latest DNN techniques such as residual learning achieved a fold-wise accuracy of 95.8% for the validation set provided by the Detection and Classification of Acoustic Scenes and Events community.
System characteristics
Input | left, right, mixed |
Sampling rate | 44.1kHz |
Data augmentation | stochastic duplication |
Features | mel-filterbank features |
Classifier | MLP, ensemble |
Decision making | score sum |
DCASE 2017 Task 1: Acoustic Scene Classification Using Shift-Invariant Kernels and Random Features
Abelino Jimenez, Benjamin Elizalde and Bhiksha Raj
Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, USA
Jimenez_CMU_task1_1
DCASE 2017 Task 1: Acoustic Scene Classification Using Shift-Invariant Kernels and Random Features
Abstract
The recordings from acoustic scenes contain information from multiple sound sources that can be captured by different type of handcrafted features. These features can be classified using kernel machines, such as the Support Vector Machines, which can approximate decision boundaries arbitrarily well. However, the complexity of training these methods increases with the dimensionality of the features and the size of the dataset. A solution is to take advantage of shift-invariant kernels to map the input features to a randomized low-dimensional feature space, then used the resulting random features to approximate non-linear kernels with linear kernel computation. In this work, we compared shift-invariant kernels such as Guassian, Laplacian and Cauchy and their corresponding random features. Experiments show that kernels outperformed the DCASE baseline by and absolute 4%. More importantly, the dimensionality of the random features in contrast to the input features is more than three times, from 6,553 to 2,048, with minimal loss of performance and more than 10 times and still outperformed the baseline. Random features approaches provide a strong alternative to perform acoustic scene classification with small or large number of instances. Moreover, they provide other benefits such as privacy preservation.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | emo_conf (opensmile) |
Classifier | SVM |
Decision making | highest score |
Case 2017 Acoustic Scene Classification Using Convolutional Neural Network in Time Series
Biho Kim
Sogang university, Seoul, Korea
Biho_Sogang_task1_1 Biho_Sogang_task1_2
Case 2017 Acoustic Scene Classification Using Convolutional Neural Network in Time Series
Abstract
This technical paper presents our approach for the acoustic scene classification (ASC) task in DACSE2017 challenge. We propose combination of recently deep learning algorithm for classification sequence of audio. We stack dilated causal convolution which is efficient for time series signal without recurrent structure and use SELU activation unit instead batch-normalization. Based on this, various experiments were evaluated on the ASC development dataset. The results were analyzed from different perspectives and the best accuracy score obtained by our system on 75.9% ..
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CNN |
Decision making | majority vote |
Recurrent Neural Network and Maximal Figure of Merit for Acoustic Event Detection
Ivan Kukanov1,2, Ville Hautamäki1 and Kong Aik Lee2
1School of Computing, University of Eastern Finland, Joensuu, Finland, 2Institute for Infocomm Research, A*Star, Singapore
Kukanov_UEF_task1_1
Recurrent Neural Network and Maximal Figure of Merit for Acoustic Event Detection
Abstract
In this report, we describe the systems submitted to the DCASE 2017 challenge. In particular, we explored convolutional recurrent neural network (CRNN) for acoustic scene classification (Task 1). For the weakly supervised sound event detection (Task 4), we utilized CRNN by embedding maximal figure-of-merit (CRNN-MFoM) into the binary cross-entropy objective function. On the development data set, the CRNN model achieves an average 14.7% relative accuracy improvement on the classification Task 1, the CRNN-MFoM improves F1-score from 10.9% to 33.5% on the detection Task 4 compared to the baseline system.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CRNN |
Decision making | majority vote |
Wavelets Revisited for the Classification of Acoustic Scenes
Qian Kun1,2,3, Ren Zhao2,3, Pandit Vedhas2,3, Yang Zijiang2,3, Zhang Zixing3 and Schuller Björn2,3,4
1MISP group, Technische Universität München, Munich, Germany, 2Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany, 3Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany, 4Group on Language, Audio and Music, Imperial College London, London, UK
Kun_TUM_UAU_UP_task1_1 Kun_TUM_UAU_UP_task1_2
Wavelets Revisited for the Classification of Acoustic Scenes
Abstract
We investigate the effectiveness of wavelet features for acoustic scene classification as contribution to the subtask of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017). On the back-end side, gated recurrent neural networks (GRNNs) are compared against traditional support vector machines (SVMs). We observe that, the proposed wavelet features behave comparable to the typically-used temporal and spectral features in the classification of acoustic scenes. Further, a late fusion of trained models with wavelets and typical acoustic features reach the best averaged 4-fold cross validation accuracy of 83.2%, and 82.6% by SVMs, and GRNNs, respectively; both significantly outperform the baseline (74.8%) of the official development set (p<0.001, one-tailed z-test).
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | wavelets, ComParE (openSMILE) |
Classifier | SVM; GRNN |
Decision making | margin sampling value |
Classifying Short Acoustic Scenes with I-Vectors and CNNs: Challenges and Optimisations for the 2017 DCASE ASC Task
Bernhard Lehner, Hamid Eghbal-Zadeh, Matthias Dorfer, Filip Korzeniowski, Khaled Koutini and Gerhard Widmer
Department of Computational Perception, Johannes Kepler University, Linz, Austria
Lehner_JKU_task1_1 Lehner_JKU_task1_2 Lehner_JKU_task1_3 Lehner_JKU_task1_4
Classifying Short Acoustic Scenes with I-Vectors and CNNs: Challenges and Optimisations for the 2017 DCASE ASC Task
Abstract
This report describes the CP-JKU team's submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2017 challenge, and discusses some observations we made about the data and the classification setup. Our approach is based on the methodology that achieved ranks 1 and 2 in the 2016 ASC challenge: a fusion of i-vector modelling using MFCC features derived from left and right audio channels, and deep convolutional neural networks (CNNs) trained on raw spectrograms. The data provided for the 2017 ASC task presented some new challenges -- in particular, audio stimuli of very short duration. These will be discussed in detail, and our measures for addressing them will be described. The result of our experiments is a classification system that achieves classification accuracies of around 90% on the provided development data, as estimated via the prescribed four-fold cross-validation scheme (which, we suspect, may be rather optimistic in relation to new data).
System characteristics
Input | binaural; mono, binaural; mono |
Sampling rate | 22.05kHz |
Data augmentation | pitch shifting |
Features | MFCC based i-vectors; MFCC, log-scaled spectrogram; log-scaled spectrogram; mel-scaled spectrograms, i-vectors |
Classifier | i-vector; CNN, i-vector, ensemble; CNN, ensemble; i-vector, CNN, ensemble |
Decision making | min. cosine distance; model averaging; fusion w/ logistic linear regression |
The SEIE-SCUT Systems for IEEE AASP Challenge on DCASE 2017: Deep Learning Techniques for Audio Representation and Classification
Yanxiong Li and Xianku Li
School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
Li_SCUT_task1_1 Li_SCUT_task1_2 Li_SCUT_task1_3 Li_SCUT_task1_4
The SEIE-SCUT Systems for IEEE AASP Challenge on DCASE 2017: Deep Learning Techniques for Audio Representation and Classification
Abstract
In this report, we present our works about three tasks of IEEE AASP challenge on DCASE 2017, i.e. task 1: Acoustic Scene Classification (ASC), task 2: detection of rare sound events in artificially created mixtures and task 3: sound event detection in real life recordings. Tasks 2 and 3 belong to the same problem, i.e. Sound Event Detection (SED). We adopt deep learning techniques to extract Deep Audio Feature (DAF) and classify various acoustic scenes or sound events. Specifically, a Deep Neural Network (DNN) is first built for generating the DAF from Mel-Frequency Cepstral Coefficients (MFCCs), and then a Recurrent Neural Network (RNN) of Bi-directional Long Short Term Memory (Bi-LSTM) fed by the DAF is built for ASC and SED. Evaluated on the development datasets of DCASE 2017, our systems are superior to the corresponding baselines for tasks 1 and 2, and our system for task 3 performs as good as the baseline in terms of the predominant metrics.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | DNN(MFCC) |
Classifier | Bi-LSTM; DNN |
Decision making | majority vote |
Acoustic Scene Classifications
Lu Lu1,2, Yuzhi Jiang1,3, Huiyu Zhang1, Yuhong Yang1,2, Ruiming Hu1,3, Weiping Tu1 and Weiyi Huang1
1National Engineering Research Center for Multimedia Software, Wuhan University, Hubei, China, 2Collaborative Innovation Center of Geospatial Technology, Wuhan, China, 3The Key Laboratory of Multimedia and Network Communication Engineering,Wuhan University, Wuhan University, Hubei, China
Yang_WHU_TASK1_1 Yang_WHU_TASK1_2 Yang_WHU_TASK1_3 Yang_WHU_TASK1_4
Acoustic Scene Classifications
Abstract
In this paper, we present three approaches on Task1 Acoustic Scene Classification(ASC): a simple CNN with low time-complexity, a novelty feature extraction, and feature fusion. First, We propose a simplified CNN architecture with only two convolutional layers to avoid overfitting. The model had a balance between higher accuracy and lower time-complexity. Second, we extract identifiable audio features by a data-driven spectrogram down-sampling. Third, we do feature fusion by combining data-driven features with Mel-Frequency spectrogram(MFS) as the network input. All the three approaches improve classification accuracy, compared with baseline on the development set.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CNN |
Decision making | logsum |
Auditory Scene Classification Based on the Spectro-Temporal Structure Analysis
Tomasz Maka
Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, Szczecin, Szczecin, Poland
Maka_ZUT_task1_1
Auditory Scene Classification Based on the Spectro-Temporal Structure Analysis
Abstract
In this report, we present a modular system for acoustic scenes classification. Our proposed system contains four modules to compute the representations describing spectro-temporal properties of audio data. The frequency components are extracted from cochleagram and low-level audio feature contours. An onset map is used to determine the properties of temporal structure, and binaural cues are additional components in the final feature space. Computed features are formed into vector and fed to random forests classifier for the purpose of classification. The results were submitted to the 2017 IEEE AASP DCASE challenge.
System characteristics
Input | binaural |
Sampling rate | 44.1kHz |
Features | cochleagram, onset map, binaural cues, low-level feature contours |
Classifier | random forest |
Generative Adversarial Network Based Acoustic Scene Training Set Augmentation and Selection Using SVM Hyper-Plane
Seongkyu Mun1, Sangwook Park1, David Han2 and Hanseok Ko1
1Intelligent Signal Processing Laboratory, Korea University, Seoul, South Korea, 2Office of Naval Research, Office of Naval Research, Arlington VA, USA
Mun_KU_task1_1
Generative Adversarial Network Based Acoustic Scene Training Set Augmentation and Selection Using SVM Hyper-Plane
Abstract
Although it is typically expected that using a large amount of labeled training data would lead to improve performance in deep learning, it is generally difficult to obtain such DataBase (DB). In competitions such as the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge Task 1, participants are constrained to use a relatively small DB as a rule, which is similar to the aforementioned issue. To improve Acoustic Scene Classification (ASC) performance without employing additional DB, this paper proposes to use Generative Adversarial Networks (GAN) based method for generating additional training DB. Since it is not clear whether every sample generated by GAN would have equal impact in classification performance, this paper proposes to use Support Vector Machine (SVM) hyper plane for each class as reference for selecting samples, which have class discriminative information. Based on the cross-validated exper-iments on development DB, the usage of the generated features could improve ASC performance.
System characteristics
Input | left, right, mixed |
Sampling rate | 22.05kHz |
Data augmentation | GAN |
Features | log-mel energies, spectrogram |
Classifier | MLP, RNN, CNN, SVM |
Decision making | majority vote |
Acoustic Scene Classification Based on Convolutional Neural Network Using Double Image Features
Sangwook Park1, Seongkyu Mun2, Younglo Lee1 and Hanseok Ko1
1School of Electrical Engineering, Korea University, Seoul, Republic of Korea, 2Department of Visual Information Processing, Korea University, Seoul, Republic of Korea
Park_ISPL_task1_1
Acoustic Scene Classification Based on Convolutional Neural Network Using Double Image Features
Abstract
This paper proposes new image features for the acoustic scene classification task of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events. In classification of acoustic scenes, identical sounds being observed in different places may affect performance. To resolve this issue, a covariance matrix, which represents energy density for each subband, and a double Fourier transform image, which represents energy variation for each subband, were defined as features. To classify the acoustic scenes with these features, Convolutional Neural Network has been applied with several techniques to reduce training time and to resolve initialization and local optimum problems. According to the experiments which were performed with the DCASE2017 challenge development dataset it is claimed that the proposed method outperformed several baseline methods. Specifically, the class average accuracy is shown as 83.6%, which is an improvement of 8.8%, 9.5%, 8.2% compared to MFCC-MLP, MFCC-GMM, and CepsCom-GMM, respectively.
System characteristics
Input | binaural |
Sampling rate | 44.1kHz |
Data augmentation | block mixing |
Features | covariance of gammachirp energies, double FFT of gammachirp energies |
Classifier | CNN |
Decision making | maximum posterior |
Attention-Based CNN with Generalized Label Tree Embedding for Audio Scene Classification
Huy Phan, Philipp Koch, Fabrice Katzberg, Marco Maass, Radoslaw Mazur and Alfred Mertins
Institute for Signal Processing, University of Luebeck, Luebeck, Germany
Phan_UniLuebeck_task1_1 Phan_UniLuebeck_task1_2 Phan_UniLuebeck_task1_3 Phan_UniLuebeck_task1_4
Attention-Based CNN with Generalized Label Tree Embedding for Audio Scene Classification
Abstract
This report presents our audio scene classification systems submitted for Task 1 ("acoustic scene classification") of DCASE 2017 challenge. The systems rely on combinations of generalized label tree embedding representation, convolutional neural networks (CNNs), and attention mechanism. Our experimental results on the development data of the challenge show that our proposed system significantly outperform the challenge's baseline, improving the average classification accuracy from 74.8% of the baseline to 83.8%.
System characteristics
Input | binaural |
Sampling rate | 44.1kHz |
Data augmentation | cross-validation with different data splits |
Features | generalized label tree embedding |
Classifier | CNN; Attentive CNN |
Decision making | entire-signal classification |
The Details That Matter: Frequency Resolution of Spectrograms in Acoustic Scene Classification
Abstract
This study describes a convolutional neural network model submitted to the acoustic scene classification task of the DCASE 2017 challenge. The performance of this model is evaluated with different frequency resolutions of the input spectrogram showing that a higher number of mel bands improves accuracy with negligible impact on the learning time. Additionally, apart from the convolutional model focusing solely on the ambient characteristics of the audio scene, a proposed extension with pretrained event detectors shows potential for further exploration.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Data augmentation | time delay, block mixing |
Features | spectrogram |
Classifier | CNN |
Decision making | majority vote |
Human-Based Greedy Search of CNN Architecture
Alain Rakotomamonjy
LITIS EA4108, Université de Rouen, Saint Etienne du Rouvray, France
Rakotomamonjy_UROUEN_task1_1 Rakotomamonjy_UROUEN_task1_2 Rakotomamonjy_UROUEN_task1_3
Human-Based Greedy Search of CNN Architecture
Abstract
This paper presents the methodology we have followed for our submission at the DCASE 2017 competition on acoustic scene classification (Task 1). The approach is based convolutional neural networks. There is nothing original about this contribution, as we have just applied a human-based search of the best CNN architecture and hyper-parameters using a 4-fold cross-validation for selecting the best model. We hope that this approach will not reach the top entry of the challenge and that it will be outperformed by clever and beautiful methods.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | CQT |
Classifier | CNN |
Decision making | average prediction; average prediction over 4 models; average prediction over 19 models |
Multi-Temporal Resolution Convolutional Neural Networks for the DCASE Acoustic Scene Classification Task
Alexander Schindler1, Thomas Lidy2 and Andreas Rauber2
1Center for Digital Safety and Security, Austrian Institute of Technology, Vienna, Austria, 2Institute for Software and Interactive Systems, Technical University of Vienna, Vienna, Austria
Schindler_AIT_task1_1 Schindler_AIT_task1_2
Multi-Temporal Resolution Convolutional Neural Networks for the DCASE Acoustic Scene Classification Task
Abstract
In this paper we present our DCASE 2017 Challenge on Detection and Classification of Acoustic Scenes and Events contributions. We propose a parallel Convolutional Neural Network architecture for the task of classifying acoustic scenes and urban sound scapes. We propose a Deep Neural Network architecture for the task of acoustic scene classification which harnesses information from increasing temporal resolutions of Mel-Spectrogram segments. This architecture is composed of separated parallel Convolutional Neural Networks which learn spectral and temporal representations for each input resolution. The resolution are chosen to cover fine-grained characteristics of a scene's spectral texture as well as its distribution of acoustic events. The best performing variant of the proposed model scores 90.54% accuracy on the development dataset. This is a 6.81% improvement of the best performing single resolution model and 15.74% of the DCASE 2017 Acoustic Scenes Classification task baseline.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Data augmentation | time stretching, block mixing, pitch shifting, mixing files of same class, gaussian noise |
Features | log-mel spectrogram |
Classifier | CNN |
Decision making | argmax of average softmax response per file |
Acoustic Scene Classification: From a Hybrid Classifier to Deep Learning
Anastasios Vafeiadis1, Dimitrios Kalatzis1, Konstantinos Votis1, Dimitrios Giakoumis1, Dimitrios Tzovaras1, Liming Chen2 and Raouf Hamzaoui2
1Information Technologies Institute, Center for Research & Technology Hellas, Thessaloniki, Greece, 2Faculty of Technology, De Montfort University, Leicester, UK
Vafeiadis_CERTH_task1_1 Vafeiadis_CERTH_task1_2
Acoustic Scene Classification: From a Hybrid Classifier to Deep Learning
Abstract
This report provides our contribution to the 2017 Detection and Classification of Acoustic Scenes and Events (DCASE) challenge. We investigated two approaches for the acoustic scene classification task. Firstly, we used a combination of features in the time and frequency domain and a hybrid Support Vector Machines - Hidden Markov Model (SVM-HMM) classifier to achieve an average accuracy over 4-folds of 80.9%. Secondly, we used the log-mel spectrogram for feature extraction and a Convolutional Neural Network (CNN) to achieve an average accuracy over 4-folds of 83.7%. Moreover, by exploiting data-augmentation techniques and using the whole segment (as opposed to splitting into sub-sequences) as an input, the accuracy of our CNN system was boosted to 95.9%. Our two approaches outperformed the DCASE baseline method, which uses log-mel band energies for feature extraction and a MultiLayer Perceptron (MLP) to achieve an average accuracy over 4-folds of 74.8%
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Data augmentation | speed and pitch change (downsampling), amplitude change (dynamic), gaussian noise |
Features | MFCC, MFCC delta, MFCC acceleration, centroid, rolloff, ZCR; log-mel spectrogram |
Classifier | SVM-HMM; CNN |
Decision making | majority vote |
Performance Evaluation of Deep Learning Architectures for Acoustic Scene Classification
Dinesh Vij and Naveen Aggarwal
Computer Science and Engineering, University Institute of Engineering and Technology, Panjab University, Chandigarh, India
Vij_UIET_task1_1 Vij_UIET_task1_2 Vij_UIET_task1_3 Vij_UIET_task1_4
Performance Evaluation of Deep Learning Architectures for Acoustic Scene Classification
Abstract
This paper is a submission to the sub-task Acoustic Scene Classification of the IEEE Audio and Acoustic Signal Processing challenge: Detection and Classification of Acoustic Scenes and Events 2017. The aim of the sub-task is to correctly detect 15 different acoustic scenes, which consist of indoor, outdoor, and vehicle categories. This work is based on log mel-filter bank features and deep learning. In this short paper, the impact of different parameters while applying a basic Deep Neural Network (DNN) architecture is first analyzed. The accuracy gains obtained by the different types of deep learning architectures such as Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN) are then reported. It has been observed that the overall best scene classification accuracy was obtained with CNN.
System characteristics
Input | binaural |
Sampling rate | 44.1kHz |
Data augmentation | feature frame concatenation |
Features | log mel-filter bank |
Classifier | RNN; LSTM; GRU; CNN |
Decision making | majority vote |
IIT Kharagpur Submissions for DCASE2017 ASC Task: Audio Features in a Fusion-Based Framework
Shefali Waldekar and Goutam Saha
Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India
Waldekar_IITKGP_task1_1 Waldekar_IITKGP_task1_2
IIT Kharagpur Submissions for DCASE2017 ASC Task: Audio Features in a Fusion-Based Framework
Abstract
This report describes two submissions for Acoustic Scene Classification (ASC) task of the IEEE AASP challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2017. The first system follows an approach based on a score-level fusion of some well-known spectral features of audio processing. The second system uses the first proposed system in a two-stage hierarchical classification framework. On the DCASE 2017 development dataset, the two systems respectively show 18% and 21% better performance relative to that of the MLP-based baseline system.
System characteristics
Input | binaural |
Sampling rate | 44.1kHz |
Features | combination [block-based MFCC; SCFC; CQCC] |
Classifier | SVM |
Decision making | fusion |
Acoustic Scene Classification Using Deep Convolutional Neural Network and Multiple Spectrograms Fusion
Zheng Weiping1, Yi Jiantao1, Xing Xiaotao1, Liu Xiangtao2 and Peng Shaohu3
1School of Computer, South China Normal University, Guangzhou, China, 2Shenzhen Chinasfan Information Technology Co., Ltd., Shenzhen Chinasfan Information Technology Co., Ltd., Shenzhen, China, 3School of Mechanical and Electrical Engineering,, Guangzhou University, Guangzhou, China
Xing_SCNU_task1_1 Xing_SCNU_task1_2
Acoustic Scene Classification Using Deep Convolutional Neural Network and Multiple Spectrograms Fusion
Abstract
Making sense of the environment by sounds is an important re-search in machine learning community. In this work, a Deep Con-volutional Neural Network (DCNN) model is presented to classi-fy acoustic scenes along with a multiple spectrograms fusion method. Firstly, the generations of raw spectrogram and CQT spectrogram are introduced separately. Corresponding features can then be extracted by feeding these spectrogram data into the proposed DCNN model. To fuse these multiple spectrogram fea-tures, two fusing mechanisms, namely the voting and the SVM methods, are designed. By fusing DCNN features of the raw and CQT spectrograms, the accuracy is significantly improved in our experiments, comparing with the single spectrogram schemes. This proves the effectiveness of the proposed multi-spectrograms fusion method.
System characteristics
Input | binaural |
Sampling rate | 22.05kHz |
Features | spectrogram, CQT |
Classifier | CNN |
Decision making | majority vote; SVM |
Fusion Model Based on Convolutional Neural Networks with Two Features for Acoustic Scene Classification
Jinwei Xu, Yang Zhao, jingfei Jiang, Yong Dou, Zhiqiang Liu and Kai Chen
Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha, China
Xu_NUDT_task1_1 Xu_NUDT_task1_2
Fusion Model Based on Convolutional Neural Networks with Two Features for Acoustic Scene Classification
Abstract
This report describes two submissions for Task 1 (audio scene classification) of DCASE-2017 challenge of PDL team. We propose two different approaches for Task 1. First, we propose a new convolutional neural network (CNN) architecture trained on frame-level features such as mel-frequency cepstral coefficient (MFCC) of audio data. Second, we propose a late fusion of the proposed CNN trained with two different features, namely, MFCCs and spectrograms. We report the performance of our proposed methods on the cross-validation setup for Task 1 of DCASE-2017 challenge.
System characteristics
Input | left, right, mixed |
Sampling rate | 44.1kHz |
Data augmentation | pitch shifting |
Features | MFCC, spectrogram |
Classifier | CNN |
Decision making | majority vote |
Acoustic Scene Classification Using Autoencoder
Xiaoshuo Xu, Xiaoou Chen and Deshun Yang
Institute of Computer Science and Technology, Peking University, Beijing, China
Xu_PKU_task1_1 Xu_PKU_task1_2 Xu_PKU_task1_3
Acoustic Scene Classification Using Autoencoder
Abstract
This report describes our contribution to the Acoustic Scene Classification (ASC) task of the 2017 IEEE AASP DCASE challenge. We apply an Autoencoder to capture the discriminative information underlying the audio. Then, a Logistic Regression model is employed to recognize different scenes under the compressed representation. In order to boost the performance, we train models based on different channels from the original recordings and simply apply majority voting method on the predictions. Our final system achieves 84.31% on a four-fold cross-validation setting, which outperforms the baseline system by 9.5%.
System characteristics
Input | binaural |
Sampling rate | 44.1kHz |
Features | CQT |
Classifier | Autoencoder and Logistic Regression |
Decision making | majority vote |
ADSC Submission for DCASE 2017: Acoustic Scene Classification Using Deep Residual Convolutional Neural Networks
Shengkui Zhao1, Thi Ngoc Tho Nguyen1, Woon-Seng Gan2 and Jones Douglas L.1
1Illinois at Singapore, Advanced Digital Sciences Center, Singapore, 2School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore
Zhao_ADSC_task1_1 Zhao_ADSC_task1_2
ADSC Submission for DCASE 2017: Acoustic Scene Classification Using Deep Residual Convolutional Neural Networks
Abstract
This report describes our two submissions to the DCASE-2017 challenge for Task 1 (Acoustic scene classification). The first submission is motivated by the superior performance of the deep residual networks for both image and audio classifications. We propose a modified deep residual architecture trained on log-mel spectrogram patches in an end-to-end fashion for acoustic scene classification. We configure the number of layers and kernels for the deep residual nets and find that the modified deep residual net of 34 layers using binaural input features perform well on the DCASE-2017 development dataset. In the second submission, we implement a shallower network that consists of 3 convolutional layers and 2 fully connected layers to benchmark the performance of the residual network. Our two approaches improve the accuracy of the baseline by 10.8% and 10.6% respectively on the 4-fold cross-validation. We suggest that the size of the dataset for Task 1 is relatively small for deep networks to outperform shallower ones.
System characteristics
Input | binaural |
Sampling rate | 44.1kHz |
Features | log-mel spectrogram |
Classifier | CNN |
Decision making | majority vote |
A System for 2017 DCASE Challenge Using Deep Sequential Image and Wavelet Features
Ren Zhao1,2, Qian Kun1,2,3, Pandit Vedhas1,2, Zhang Zixing2, Yang Zijiang1,2 and Schuller Björn1,2,4
1Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany, 2Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany, 3MISP group, Technische Universität München, Munich, Germany, 4Group on Language, Audio and Music, Imperial College London, London, UK
Zhao_UAU_UP_task1_1
A System for 2017 DCASE Challenge Using Deep Sequential Image and Wavelet Features
Abstract
For the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017), we propose a novel method to classify 15 different acoustic scenes using deep sequential learning for the audio scenes. First, deep representations extracted from the spectrogram and two types of scalograms using Convolutional Neural Networks, the ComparE features and two types of wavelet features are fed into the Gated Recurrent Neural Networks for classification separately. Predictions from the six models are then combined by a margin sampling value strategy. On the official development set of the challenge, the best accuracy on a four-fold cross-validation setup is 83.3%, which increases 8.5% compared with the baseline (p<.001 by one-tailed z-test).
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | spectrogram, scalogram, wavelets, ComParE (openSMILE) |
Classifier | GRNN |
Decision making | margin sampling value |