Task description
The goal of acoustic scene classification task was to classify a test recordings into one of predefined classes (15) that characterizes the environment in which it was recorded — for example "park", "home", "office".
More detailed task description can be found in the task description page
Challenge results
Here you can find complete information on the submissions for Task 1: results on evaluation and development set (when reported by authors), class-wise results, technical reports and bibtex citations.
System outputs:
Systems ranking
Rank |
Submission code |
Submission name |
Technical Report |
Accuracy (Evaluation dataset) |
Accuracy (Development dataset) |
---|---|---|---|---|---|
Aggarwal_task1_1 | Vij2016 | 74.4 | 74.1 | ||
Bae_task1_1 | CLC | Bae2016 | 84.1 | 79.2 | |
Bao_task1_1 | Bao2016 | 83.1 | |||
Battaglino_task1_1 | Battaglino2016 | 80.0 | |||
Bisot_task1_1 | Bisot2016 | 87.7 | 86.2 | ||
DCASE2016 baseline | DCASE2016_baseline | Heittola2016 | 77.2 | 72.5 | |
Duong_task1_1 | Tec_SVM_A | Sena_Mafra2016 | 76.4 | 80.0 | |
Duong_task1_2 | Tec_SVM_V | Sena_Mafra2016 | 80.5 | 78.0 | |
Duong_task1_3 | Tec_MLP | Sena_Mafra2016 | 73.1 | 75.0 | |
Duong_task1_4 | Tec_CNN | Sena_Mafra2016 | 62.8 | 59.0 | |
Eghbal-Zadeh_task1_1 | CPJKU16_BMBI | Eghbal-Zadeh2016 | 86.4 | 80.8 | |
Eghbal-Zadeh_task1_2 | CPJKU16_CBMBI | Eghbal-Zadeh2016 | 88.7 | 83.9 | |
Eghbal-Zadeh_task1_3 | CPJKU16_DCNN | Eghbal-Zadeh2016 | 83.3 | 79.5 | |
Eghbal-Zadeh_task1_4 | CPJKU16_LFCBI | Eghbal-Zadeh2016 | 89.7 | 89.9 | |
Foleiss_task1_1 | JFTT | Foleiss2016 | 76.2 | 71.8 | |
Hertel_task1_1 | All-ConvNet | Hertel2016 | 79.5 | 84.5 | |
Kim_task1_1 | QRK | Yun2016 | 82.1 | 84.0 | |
Ko_task1_1 | KU_ISPL1_2016 | Park2016 | 87.2 | 76.3 | |
Ko_task1_2 | KU_ISPL2_2016 | Mun2016 | 82.3 | 72.7 | |
Kong_task1_1 | QK | Kong2016 | 81.0 | 76.4 | |
Kumar_task1_1 | Gauss | Elizalde2016 | 85.9 | 78.9 | |
Lee_task1_1 | MARGNet_MWFD | Han2016 | 84.6 | 83.1 | |
Lee_task1_2 | MARGNet_ZENS | Kim2016 | 85.4 | 81.6 | |
Liu_task1_1 | liu-re | Liu2016 | 83.8 | ||
Liu_task1_2 | liu-pre | Liu2016 | 83.6 | ||
Lostanlen_task1_1 | LostanlenAnden_2016 | Lostanlen2016 | 80.8 | 79.4 | |
Marchi_task1_1 | Marchi_2016 | Marchi2016 | 86.4 | 81.4 | |
Marques_task1_1 | DRKNN_2016 | Marques2016 | 83.1 | 78.2 | |
Moritz_task1_1 | Moritz2016 | 79.0 | 76.5 | ||
Mulimani_task1_1 | Mulimani2016 | 65.6 | 66.8 | ||
Nogueira_task1_1 | Nogueira2016 | 81.0 | |||
Patiyal_task1_1 | IITMandi_2016 | Patiyal2016 | 78.5 | 97.6 | |
Phan_task1_1 | CNN-LTE | Phan2016 | 83.3 | 81.2 | |
Pugachev_task1_1 | Pugachev2016 | 73.1 | 82.9 | ||
Qu_task1_1 | Dai2016 | 80.5 | |||
Qu_task1_2 | Dai2016 | 84.1 | |||
Qu_task1_3 | Dai2016 | 82.3 | |||
Qu_task1_4 | Dai2016 | 80.5 | |||
Rakotomamonjy_task1_1 | RAK_2016_1 | Rakotomamonjy2016 | 82.1 | 81.2 | |
Rakotomamonjy_task1_2 | RAK_2016_2 | Rakotomamonjy2016 | 79.2 | ||
Santoso_task1_1 | SWW | Santoso2016 | 80.8 | 78.8 | |
Schindler_task1_1 | CQTCNN_1 | Lidy2016 | 81.8 | 80.8 | |
Schindler_task1_2 | CQTCNN_2 | Lidy2016 | 83.3 | ||
Takahashi_task1_1 | UTNII_2016 | Takahashi2016 | 85.6 | 77.5 | |
Valenti_task1_1 | Valenti2016 | 86.2 | 79.0 | ||
Vikaskumar_task1_1 | ABSP_IITKGP_2016 | Vikaskumar2016 | 81.3 | 80.4 | |
Vu_task1_1 | Vu2016 | 80.0 | 82.1 | ||
Xu_task1_1 | HL-DNN-ASC_2016 | Xu2016 | 73.3 | 81.4 | |
Zoehrer_task1_1 | Zoehrer2016 | 73.1 |
Teams ranking
Table including only the best performing system per submitting team.
Rank |
Submission code |
Submission name |
Technical Report |
Accuracy (Evaluation dataset) |
Accuracy (Development dataset) |
---|---|---|---|---|---|
Aggarwal_task1_1 | Vij2016 | 74.4 | 74.1 | ||
Bae_task1_1 | CLC | Bae2016 | 84.1 | 79.2 | |
Bao_task1_1 | Bao2016 | 83.1 | |||
Battaglino_task1_1 | Battaglino2016 | 80.0 | |||
Bisot_task1_1 | Bisot2016 | 87.7 | 86.2 | ||
DCASE2016 baseline | DCASE2016_baseline | Heittola2016 | 77.2 | 72.5 | |
Duong_task1_2 | Tec_SVM_V | Sena_Mafra2016 | 80.5 | 78.0 | |
Eghbal-Zadeh_task1_4 | CPJKU16_LFCBI | Eghbal-Zadeh2016 | 89.7 | 89.9 | |
Foleiss_task1_1 | JFTT | Foleiss2016 | 76.2 | 71.8 | |
Hertel_task1_1 | All-ConvNet | Hertel2016 | 79.5 | 84.5 | |
Kim_task1_1 | QRK | Yun2016 | 82.1 | 84.0 | |
Ko_task1_1 | KU_ISPL1_2016 | Park2016 | 87.2 | 76.3 | |
Ko_task1_2 | KU_ISPL2_2016 | Mun2016 | 82.3 | 72.7 | |
Kong_task1_1 | QK | Kong2016 | 81.0 | 76.4 | |
Kumar_task1_1 | Gauss | Elizalde2016 | 85.9 | 78.9 | |
Lee_task1_1 | MARGNet_MWFD | Han2016 | 84.6 | 83.1 | |
Lee_task1_2 | MARGNet_ZENS | Kim2016 | 85.4 | 81.6 | |
Liu_task1_1 | liu-re | Liu2016 | 83.8 | ||
Lostanlen_task1_1 | LostanlenAnden_2016 | Lostanlen2016 | 80.8 | 79.4 | |
Marchi_task1_1 | Marchi_2016 | Marchi2016 | 86.4 | 81.4 | |
Marques_task1_1 | DRKNN_2016 | Marques2016 | 83.1 | 78.2 | |
Moritz_task1_1 | Moritz2016 | 79.0 | 76.5 | ||
Mulimani_task1_1 | Mulimani2016 | 65.6 | 66.8 | ||
Nogueira_task1_1 | Nogueira2016 | 81.0 | |||
Patiyal_task1_1 | IITMandi_2016 | Patiyal2016 | 78.5 | 97.6 | |
Phan_task1_1 | CNN-LTE | Phan2016 | 83.3 | 81.2 | |
Pugachev_task1_1 | Pugachev2016 | 73.1 | 82.9 | ||
Qu_task1_2 | Dai2016 | 84.1 | |||
Rakotomamonjy_task1_1 | RAK_2016_1 | Rakotomamonjy2016 | 82.1 | 81.2 | |
Santoso_task1_1 | SWW | Santoso2016 | 80.8 | 78.8 | |
Schindler_task1_2 | CQTCNN_2 | Lidy2016 | 83.3 | ||
Takahashi_task1_1 | UTNII_2016 | Takahashi2016 | 85.6 | 77.5 | |
Valenti_task1_1 | Valenti2016 | 86.2 | 79.0 | ||
Vikaskumar_task1_1 | ABSP_IITKGP_2016 | Vikaskumar2016 | 81.3 | 80.4 | |
Vu_task1_1 | Vu2016 | 80.0 | 82.1 | ||
Xu_task1_1 | HL-DNN-ASC_2016 | Xu2016 | 73.3 | 81.4 | |
Zoehrer_task1_1 | Zoehrer2016 | 73.1 |
Class-wise performance
Rank |
Submission code |
Submission name |
Technical Report |
Accuracy (Evaluation dataset) |
Beach | Bus |
Cafe / Restaurant |
Car |
City center |
Forest path |
Grocery store |
Home | Library |
Metro station |
Office | Park |
Residential area |
Train | Tram |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Aggarwal_task1_1 | Vij2016 | 74.4 | 80.8 | 84.6 | 69.2 | 88.5 | 80.8 | 84.6 | 84.6 | 92.3 | 38.5 | 96.2 | 92.3 | 65.4 | 42.3 | 34.6 | 80.8 | ||
Bae_task1_1 | CLC | Bae2016 | 84.1 | 84.6 | 100.0 | 61.5 | 88.5 | 92.3 | 100.0 | 96.2 | 88.5 | 46.2 | 88.5 | 100.0 | 96.2 | 65.4 | 53.8 | 100.0 | |
Bao_task1_1 | Bao2016 | 83.1 | 84.6 | 96.2 | 57.7 | 100.0 | 76.9 | 92.3 | 84.6 | 88.5 | 46.2 | 96.2 | 100.0 | 96.2 | 76.9 | 50.0 | 100.0 | ||
Battaglino_task1_1 | Battaglino2016 | 80.0 | 84.6 | 73.1 | 76.9 | 84.6 | 96.2 | 100.0 | 96.2 | 84.6 | 34.6 | 80.8 | 84.6 | 96.2 | 65.4 | 53.8 | 88.5 | ||
Bisot_task1_1 | Bisot2016 | 87.7 | 88.5 | 100.0 | 76.9 | 100.0 | 100.0 | 88.5 | 88.5 | 96.2 | 50.0 | 100.0 | 96.2 | 80.8 | 76.9 | 73.1 | 100.0 | ||
DCASE2016 baseline | DCASE2016_baseline | Heittola2016 | 77.2 | 84.6 | 88.5 | 69.2 | 96.2 | 80.8 | 65.4 | 88.5 | 92.3 | 26.9 | 100.0 | 96.2 | 53.8 | 88.5 | 30.8 | 96.2 | |
Duong_task1_1 | Tec_SVM_A | Sena_Mafra2016 | 76.4 | 88.5 | 100.0 | 69.2 | 88.5 | 84.6 | 100.0 | 96.2 | 38.5 | 46.2 | 80.8 | 100.0 | 61.5 | 34.6 | 57.7 | 100.0 | |
Duong_task1_2 | Tec_SVM_V | Sena_Mafra2016 | 80.5 | 80.8 | 100.0 | 84.6 | 92.3 | 92.3 | 100.0 | 96.2 | 57.7 | 46.2 | 96.2 | 100.0 | 50.0 | 53.8 | 57.7 | 100.0 | |
Duong_task1_3 | Tec_MLP | Sena_Mafra2016 | 73.1 | 73.1 | 92.3 | 50.0 | 84.6 | 88.5 | 100.0 | 80.8 | 34.6 | 26.9 | 92.3 | 100.0 | 84.6 | 46.2 | 50.0 | 92.3 | |
Duong_task1_4 | Tec_CNN | Sena_Mafra2016 | 62.8 | 80.8 | 88.5 | 53.8 | 80.8 | 69.2 | 96.2 | 76.9 | 50.0 | 15.4 | 46.2 | 92.3 | 42.3 | 34.6 | 19.2 | 96.2 | |
Eghbal-Zadeh_task1_1 | CPJKU16_BMBI | Eghbal-Zadeh2016 | 86.4 | 92.3 | 92.3 | 76.9 | 96.2 | 92.3 | 96.2 | 100.0 | 88.5 | 69.2 | 73.1 | 100.0 | 96.2 | 76.9 | 46.2 | 100.0 | |
Eghbal-Zadeh_task1_2 | CPJKU16_CBMBI | Eghbal-Zadeh2016 | 88.7 | 96.2 | 100.0 | 84.6 | 100.0 | 92.3 | 96.2 | 100.0 | 92.3 | 69.2 | 69.2 | 100.0 | 96.2 | 84.6 | 50.0 | 100.0 | |
Eghbal-Zadeh_task1_3 | CPJKU16_DCNN | Eghbal-Zadeh2016 | 83.3 | 92.3 | 96.2 | 42.3 | 88.5 | 84.6 | 100.0 | 100.0 | 100.0 | 53.8 | 100.0 | 96.2 | 46.2 | 80.8 | 69.2 | 100.0 | |
Eghbal-Zadeh_task1_4 | CPJKU16_LFCBI | Eghbal-Zadeh2016 | 89.7 | 96.2 | 100.0 | 61.5 | 96.2 | 96.2 | 96.2 | 100.0 | 96.2 | 69.2 | 100.0 | 96.2 | 88.5 | 88.5 | 61.5 | 100.0 | |
Foleiss_task1_1 | JFTT | Foleiss2016 | 76.2 | 84.6 | 84.6 | 61.5 | 80.8 | 96.2 | 84.6 | 96.2 | 88.5 | 46.2 | 57.7 | 84.6 | 65.4 | 42.3 | 80.8 | 88.5 | |
Hertel_task1_1 | All-ConvNet | Hertel2016 | 79.5 | 84.6 | 92.3 | 53.8 | 100.0 | 80.8 | 80.8 | 76.9 | 76.9 | 69.2 | 100.0 | 100.0 | 84.6 | 46.2 | 53.8 | 92.3 | |
Kim_task1_1 | QRK | Yun2016 | 82.1 | 76.9 | 100.0 | 76.9 | 100.0 | 84.6 | 100.0 | 88.5 | 100.0 | 0.0 | 92.3 | 96.2 | 76.9 | 69.2 | 69.2 | 100.0 | |
Ko_task1_1 | KU_ISPL1_2016 | Park2016 | 87.2 | 88.5 | 96.2 | 84.6 | 96.2 | 100.0 | 96.2 | 96.2 | 88.5 | 53.8 | 80.8 | 100.0 | 57.7 | 80.8 | 88.5 | 100.0 | |
Ko_task1_2 | KU_ISPL2_2016 | Mun2016 | 82.3 | 92.3 | 84.6 | 65.4 | 92.3 | 100.0 | 84.6 | 96.2 | 92.3 | 53.8 | 65.4 | 84.6 | 92.3 | 84.6 | 53.8 | 92.3 | |
Kong_task1_1 | QK | Kong2016 | 81.0 | 84.6 | 100.0 | 57.7 | 92.3 | 88.5 | 96.2 | 92.3 | 76.9 | 34.6 | 80.8 | 100.0 | 96.2 | 69.2 | 46.2 | 100.0 | |
Kumar_task1_1 | Gauss | Elizalde2016 | 85.9 | 84.6 | 92.3 | 73.1 | 88.5 | 92.3 | 96.2 | 96.2 | 92.3 | 50.0 | 96.2 | 96.2 | 80.8 | 88.5 | 73.1 | 88.5 | |
Lee_task1_1 | MARGNet_MWFD | Han2016 | 84.6 | 84.6 | 96.2 | 61.5 | 100.0 | 88.5 | 96.2 | 92.3 | 96.2 | 42.3 | 84.6 | 96.2 | 84.6 | 76.9 | 69.2 | 100.0 | |
Lee_task1_2 | MARGNet_ZENS | Kim2016 | 85.4 | 84.6 | 92.3 | 61.5 | 100.0 | 96.2 | 100.0 | 96.2 | 96.2 | 46.2 | 84.6 | 100.0 | 92.3 | 69.2 | 61.5 | 100.0 | |
Liu_task1_1 | liu-re | Liu2016 | 83.8 | 84.6 | 96.2 | 69.2 | 84.6 | 92.3 | 96.2 | 88.5 | 92.3 | 46.2 | 92.3 | 96.2 | 88.5 | 76.9 | 53.8 | 100.0 | |
Liu_task1_2 | liu-pre | Liu2016 | 83.6 | 88.5 | 92.3 | 69.2 | 84.6 | 96.2 | 92.3 | 92.3 | 88.5 | 46.2 | 88.5 | 96.2 | 92.3 | 76.9 | 50.0 | 100.0 | |
Lostanlen_task1_1 | LostanlenAnden_2016 | Lostanlen2016 | 80.8 | 80.8 | 92.3 | 50.0 | 96.2 | 84.6 | 96.2 | 84.6 | 80.8 | 65.4 | 96.2 | 100.0 | 65.4 | 69.2 | 53.8 | 96.2 | |
Marchi_task1_1 | Marchi_2016 | Marchi2016 | 86.4 | 88.5 | 92.3 | 80.8 | 100.0 | 96.2 | 100.0 | 100.0 | 76.9 | 50.0 | 96.2 | 100.0 | 92.3 | 84.6 | 42.3 | 96.2 | |
Marques_task1_1 | DRKNN_2016 | Marques2016 | 83.1 | 88.5 | 96.2 | 65.4 | 84.6 | 84.6 | 96.2 | 80.8 | 84.6 | 69.2 | 84.6 | 92.3 | 96.2 | 65.4 | 57.7 | 100.0 | |
Moritz_task1_1 | Moritz2016 | 79.0 | 88.5 | 100.0 | 19.2 | 100.0 | 92.3 | 100.0 | 88.5 | 92.3 | 38.5 | 80.8 | 100.0 | 61.5 | 76.9 | 46.2 | 100.0 | ||
Mulimani_task1_1 | Mulimani2016 | 65.6 | 73.1 | 96.2 | 69.2 | 100.0 | 73.1 | 50.0 | 65.4 | 76.9 | 7.7 | 76.9 | 96.2 | 96.2 | 23.1 | 15.4 | 65.4 | ||
Nogueira_task1_1 | Nogueira2016 | 81.0 | 88.5 | 88.5 | 65.4 | 92.3 | 73.1 | 96.2 | 84.6 | 92.3 | 38.5 | 96.2 | 100.0 | 73.1 | 80.8 | 53.8 | 92.3 | ||
Patiyal_task1_1 | IITMandi_2016 | Patiyal2016 | 78.5 | 84.6 | 96.2 | 61.5 | 92.3 | 92.3 | 92.3 | 80.8 | 92.3 | 34.6 | 96.2 | 96.2 | 92.3 | 69.2 | 11.5 | 84.6 | |
Phan_task1_1 | CNN-LTE | Phan2016 | 83.3 | 84.6 | 96.2 | 53.8 | 100.0 | 100.0 | 96.2 | 84.6 | 88.5 | 46.2 | 84.6 | 100.0 | 88.5 | 84.6 | 46.2 | 96.2 | |
Pugachev_task1_1 | Pugachev2016 | 73.1 | 84.6 | 69.2 | 61.5 | 92.3 | 80.8 | 96.2 | 92.3 | 80.8 | 26.9 | 96.2 | 88.5 | 57.7 | 42.3 | 34.6 | 92.3 | ||
Qu_task1_1 | Dai2016 | 80.5 | 84.6 | 100.0 | 73.1 | 88.5 | 96.2 | 84.6 | 100.0 | 88.5 | 23.1 | 76.9 | 96.2 | 73.1 | 76.9 | 46.2 | 100.0 | ||
Qu_task1_2 | Dai2016 | 84.1 | 88.5 | 100.0 | 80.8 | 92.3 | 96.2 | 84.6 | 100.0 | 88.5 | 42.3 | 76.9 | 96.2 | 76.9 | 80.8 | 57.7 | 100.0 | ||
Qu_task1_3 | Dai2016 | 82.3 | 88.5 | 100.0 | 76.9 | 92.3 | 96.2 | 84.6 | 92.3 | 88.5 | 30.8 | 88.5 | 96.2 | 76.9 | 76.9 | 46.2 | 100.0 | ||
Qu_task1_4 | Dai2016 | 80.5 | 80.8 | 100.0 | 84.6 | 88.5 | 92.3 | 84.6 | 92.3 | 92.3 | 42.3 | 76.9 | 96.2 | 76.9 | 76.9 | 23.1 | 100.0 | ||
Rakotomamonjy_task1_1 | RAK_2016_1 | Rakotomamonjy2016 | 82.1 | 80.8 | 96.2 | 46.2 | 92.3 | 84.6 | 100.0 | 96.2 | 88.5 | 42.3 | 80.8 | 96.2 | 88.5 | 73.1 | 65.4 | 100.0 | |
Rakotomamonjy_task1_2 | RAK_2016_2 | Rakotomamonjy2016 | 79.2 | 92.3 | 92.3 | 69.2 | 84.6 | 80.8 | 96.2 | 84.6 | 88.5 | 38.5 | 96.2 | 100.0 | 73.1 | 57.7 | 34.6 | 100.0 | |
Santoso_task1_1 | SWW | Santoso2016 | 80.8 | 84.6 | 84.6 | 61.5 | 96.2 | 84.6 | 100.0 | 80.8 | 100.0 | 42.3 | 92.3 | 100.0 | 80.8 | 65.4 | 42.3 | 96.2 | |
Schindler_task1_1 | CQTCNN_1 | Lidy2016 | 81.8 | 88.5 | 100.0 | 34.6 | 92.3 | 96.2 | 100.0 | 92.3 | 88.5 | 46.2 | 96.2 | 100.0 | 65.4 | 73.1 | 53.8 | 100.0 | |
Schindler_task1_2 | CQTCNN_2 | Lidy2016 | 83.3 | 88.5 | 100.0 | 34.6 | 92.3 | 96.2 | 100.0 | 92.3 | 92.3 | 46.2 | 96.2 | 100.0 | 65.4 | 76.9 | 69.2 | 100.0 | |
Takahashi_task1_1 | UTNII_2016 | Takahashi2016 | 85.6 | 92.3 | 100.0 | 61.5 | 100.0 | 88.5 | 88.5 | 96.2 | 84.6 | 57.7 | 80.8 | 100.0 | 92.3 | 80.8 | 61.5 | 100.0 | |
Valenti_task1_1 | Valenti2016 | 86.2 | 84.6 | 100.0 | 76.9 | 100.0 | 96.2 | 100.0 | 92.3 | 92.3 | 42.3 | 96.2 | 96.2 | 76.9 | 76.9 | 65.4 | 96.2 | ||
Vikaskumar_task1_1 | ABSP_IITKGP_2016 | Vikaskumar2016 | 81.3 | 84.6 | 92.3 | 61.5 | 100.0 | 84.6 | 84.6 | 80.8 | 88.5 | 65.4 | 92.3 | 69.2 | 80.8 | 73.1 | 73.1 | 88.5 | |
Vu_task1_1 | Vu2016 | 80.0 | 88.5 | 76.9 | 61.5 | 100.0 | 92.3 | 100.0 | 80.8 | 73.1 | 46.2 | 92.3 | 100.0 | 92.3 | 50.0 | 46.2 | 100.0 | ||
Xu_task1_1 | HL-DNN-ASC_2016 | Xu2016 | 73.3 | 84.6 | 96.2 | 23.1 | 96.2 | 84.6 | 100.0 | 84.6 | 69.2 | 23.1 | 57.7 | 100.0 | 73.1 | 69.2 | 38.5 | 100.0 | |
Zoehrer_task1_1 | Zoehrer2016 | 73.1 | 80.8 | 92.3 | 38.5 | 92.3 | 65.4 | 96.2 | 84.6 | 65.4 | 23.1 | 84.6 | 100.0 | 61.5 | 69.2 | 42.3 | 100.0 |
System characteristics
Rank | Code | Name |
Technical Report |
Accuracy (Eval) |
Input | Features | Classifier |
---|---|---|---|---|---|---|---|
Aggarwal_task1_1 | Vij2016 | 74.4 | binaural | various | SVM | ||
Bae_task1_1 | CLC | Bae2016 | 84.1 | monophonic | spectrogram | CNN-RNN | |
Bao_task1_1 | Bao2016 | 83.1 | monophonic | MFCC+mel energy | fusion | ||
Battaglino_task1_1 | Battaglino2016 | 80.0 | binaural | mel energy | CNN | ||
Bisot_task1_1 | Bisot2016 | 87.7 | monophonic | spectrogram | NMF | ||
DCASE2016 baseline | DCASE2016_baseline | Heittola2016 | 77.2 | monophonic | MFCC | GMM | |
Duong_task1_1 | Tec_SVM_A | Sena_Mafra2016 | 76.4 | monophonic | mel energy | SVM | |
Duong_task1_2 | Tec_SVM_V | Sena_Mafra2016 | 80.5 | monophonic | mel energy | SVM | |
Duong_task1_3 | Tec_MLP | Sena_Mafra2016 | 73.1 | monophonic | mel energy | DNN | |
Duong_task1_4 | Tec_CNN | Sena_Mafra2016 | 62.8 | monophonic | mel energy | DNN | |
Eghbal-Zadeh_task1_1 | CPJKU16_BMBI | Eghbal-Zadeh2016 | 86.4 | binaural | MFCC | I-vector | |
Eghbal-Zadeh_task1_2 | CPJKU16_CBMBI | Eghbal-Zadeh2016 | 88.7 | binaural | MFCC | I-vector | |
Eghbal-Zadeh_task1_3 | CPJKU16_DCNN | Eghbal-Zadeh2016 | 83.3 | monophonic | spectrogram | CNN | |
Eghbal-Zadeh_task1_4 | CPJKU16_LFCBI | Eghbal-Zadeh2016 | 89.7 | mono+binaural | MFCC+spectrograms | fusion | |
Foleiss_task1_1 | JFTT | Foleiss2016 | 76.2 | monophonic | various | SVM | |
Hertel_task1_1 | All-ConvNet | Hertel2016 | 79.5 | left | spectrogram | CNN | |
Kim_task1_1 | QRK | Yun2016 | 82.1 | mono | MFCC | GMM | |
Ko_task1_1 | KU_ISPL1_2016 | Park2016 | 87.2 | mono | various | fusion | |
Ko_task1_2 | KU_ISPL2_2016 | Mun2016 | 82.3 | left+right+mono | various | DNN | |
Kong_task1_1 | QK | Kong2016 | 81.0 | mono | mel energy | DNN | |
Kumar_task1_1 | Gauss | Elizalde2016 | 85.9 | mono | MFCC distribution | SVM | |
Lee_task1_1 | MARGNet_MWFD | Han2016 | 84.6 | mono | mel energy | CNN | |
Lee_task1_2 | MARGNet_ZENS | Kim2016 | 85.4 | mono | unsupervised | CNN ensemble | |
Liu_task1_1 | liu-re | Liu2016 | 83.8 | mono | MFCC+mel energy | fusion | |
Liu_task1_2 | liu-pre | Liu2016 | 83.6 | mono | MFCC+mel energy | fusion | |
Lostanlen_task1_1 | LostanlenAnden_2016 | Lostanlen2016 | 80.8 | mixed | gammatone scattering | SVM | |
Marchi_task1_1 | Marchi_2016 | Marchi2016 | 86.4 | mono | various | fusion | |
Marques_task1_1 | DRKNN_2016 | Marques2016 | 83.1 | mono | MFCC | kNN | |
Moritz_task1_1 | Moritz2016 | 79.0 | left+right+mono | amplitude modulation filter bank | TDNN | ||
Mulimani_task1_1 | Mulimani2016 | 65.6 | mono | MFCC+matching pursuit | GMM | ||
Nogueira_task1_1 | Nogueira2016 | 81.0 | binaural | various | SVM | ||
Patiyal_task1_1 | IITMandi_2016 | Patiyal2016 | 78.5 | mono | MFCC | DNN | |
Phan_task1_1 | CNN-LTE | Phan2016 | 83.3 | mono | label tree embedding | CNN | |
Pugachev_task1_1 | Pugachev2016 | 73.1 | mono | MFCC | DNN | ||
Qu_task1_1 | Dai2016 | 80.5 | mono | various | ensemble | ||
Qu_task1_2 | Dai2016 | 84.1 | mono | various | ensemble | ||
Qu_task1_3 | Dai2016 | 82.3 | mono | various | ensemble | ||
Qu_task1_4 | Dai2016 | 80.5 | mono | various | ensemble | ||
Rakotomamonjy_task1_1 | RAK_2016_1 | Rakotomamonjy2016 | 82.1 | mono | various | SVM | |
Rakotomamonjy_task1_2 | RAK_2016_2 | Rakotomamonjy2016 | 79.2 | mono | various | SVM | |
Santoso_task1_1 | SWW | Santoso2016 | 80.8 | mono | MFCC | CNN | |
Schindler_task1_1 | CQTCNN_1 | Lidy2016 | 81.8 | mono | CQT | CNN | |
Schindler_task1_2 | CQTCNN_2 | Lidy2016 | 83.3 | mono | CQT | CNN | |
Takahashi_task1_1 | UTNII_2016 | Takahashi2016 | 85.6 | mono | MFCC | DNN-GMM | |
Valenti_task1_1 | Valenti2016 | 86.2 | mono | mel energy | CNN | ||
Vikaskumar_task1_1 | ABSP_IITKGP_2016 | Vikaskumar2016 | 81.3 | mono | MFCC | SVM | |
Vu_task1_1 | Vu2016 | 80.0 | mono | MFCC | RNN | ||
Xu_task1_1 | HL-DNN-ASC_2016 | Xu2016 | 73.3 | mono | mel energy | DNN | |
Zoehrer_task1_1 | Zoehrer2016 | 73.1 | mono | spectrogram | GRNN |
Technical reports
Acoustic Scene Classification Using Parallel Combination of LSTM and CNN
Soo Hyun Bae, Inkyu Choi and Nam Soo Kim
Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea
Bae_task1_1
Acoustic Scene Classification Using Parallel Combination of LSTM and CNN
Abstract
Deep neural networks(DNNs) have recently achieved a great success in various learning task, and have also been used for classification of environmental sounds. While DNNs are showing their potential in the classification task, they cannot fully utilize the temporal information. In this paper, we propose a neural network architecture for the purpose of using sequential information. The proposed structure is composed of two separated lower networks and one upper network. We refer to these as LSTM layers, CNN layers and connected layers, respectively. The LSTM layers extract the sequential information from consecutive audio features. The CNN layers learn the spectro-temporal locality from spectrogram images. Finally, the connected layers summarize the outputs of two networks to take advantage of the complementary features of the LSTM and CNN by combining them. To compare the proposed method with other neural networks, we conducted a number of experiments on the TUT acoustic scenes 2016 dataset which consists of recordings from various acoustic scenes. By using the proposed combination structure, we achieved higher performance compared to the conventional DNN, CNN and LSTM architecture.
System characteristics
Input | monophonic |
Sampling rate | 44.1kHz |
Features | spectrogram |
Classifier | CNN-RNN |
Technical Report of USTC System for Acoustic Scene Classification
Xiao Bao, Tian Gao and Jun Du
National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, Anhui, China
Bao_task1_1
Technical Report of USTC System for Acoustic Scene Classification
Abstract
This technical report describes our submission for acoustic scene classification task of DCASE 2016. We first explore the use of and Gaussian mixture models (GMM) and ergodic hidden Markov models (HMM). Next, we combine neural network based discriminative models (DNN, CNN) with generative models to build hybrid systems, including DNN-GMM, CNN-GMM, DNN-HMM and CNNHMM. Finally, a system combination method is used to obtain the best overall performance from the multiple systems.
System characteristics
Input | monophonic |
Sampling rate | 44.1kHz |
Features | MFCC+mel energy |
Classifier | fusion |
Acoustic Scene Classification Using Convolutional Neural Networks
Daniele Battaglino1,2, Ludovick Lepauloux1 and Nicholas Evans2
1NXP Software, France, 2EURECOM, France
Battaglino_task1_1
Acoustic Scene Classification Using Convolutional Neural Networks
Abstract
Acoustic scene classification (ASC) aims to distinguish between different acoustic environments and is a technology which can be used by smart devices for contextualization and personalization. Standard algorithms exploit hand-crafted features which are unlikely to offer the best potential for reliable classification. This paper reports the first application of convolutional neural networks (CNNs) to ASC, an approach which learns discriminant features automatically from spectral representations of raw acoustic data. A principal influence on performance comes from the specific convolutional filters which can be adjusted to capture different spectrotemporal, recurrent acoustic structure. The proposed CNN approach is shown to outperform a Gaussian mixture model baseline for the DCASE 2016 database even though training data is sparse.
System characteristics
Input | binaural |
Sampling rate | 44.1kHz |
Features | mel energy |
Classifier | CNN |
Supervised Nonnegative Matrix Factorization for Acoustic Scene Classification
Victor Bisot, Romain Serizel, Slim Essid and Gaël Richard
Telecom ParisTech, Paris, France
Bisot_task1_1
Supervised Nonnegative Matrix Factorization for Acoustic Scene Classification
Abstract
This report describes our contribution to the 2016 IEEE AASP DCASE challenge for the acoustic scene classification task. We propose a feature learning approach following the idea of decomposing time-frequency representations with nonnegative matrix factorization. We aim at learning a common dictionary representing the data and use projections on this dictionary as features for classification. Our system is based on a novel supervised extension of nonnegative matrix factorization. In the approach we propose, the dictionary and the classifier are optimized jointly in order to find a suited representation to minimize the classification cost. The proposed method significantly outperforms the baseline and provides improved results compared to unsupervised nonnegative matrix factorization.
System characteristics
Input | monophonic |
Sampling rate | 44.1kHz |
Features | spectrogram |
Classifier | NMF |
Acoustic Scene Recognition with Deep Neural Networks (DCASE Challenge 2016)
Wei Dai1, Juncheng Li2, Phuong Pham3, Samarjit Das2 and Shuhui Qu4
1Carnegie Mellon University, Pittsburgh, USA, 2Robert Bosch Research and Technology Center, USA, 3University of Pittsburgh, Pittsburgh, USA, 4Stanford University, Stanford, USA
Qu_task1_1 Qu_task1_2 Qu_task1_3 Qu_task1_4
Acoustic Scene Recognition with Deep Neural Networks (DCASE Challenge 2016)
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | various |
Classifier | ensemble |
CP-JKU Submissions for DCASE-2016: a Hybrid Approach Using Binaural I-Vectors and Deep Convolutional Neural Networks
Hamid Eghbal-Zadeh, Bernhard Lehner, Matthias Dorfer and Gerhard Widmer
Department of Computational Perception, Johannes Kepler University of Linz, Linz, Austria
Eghbal-Zadeh_task1_1 Eghbal-Zadeh_task1_2 Eghbal-Zadeh_task1_3 Eghbal-Zadeh_task1_4
CP-JKU Submissions for DCASE-2016: a Hybrid Approach Using Binaural I-Vectors and Deep Convolutional Neural Networks
Abstract
This report describes the 4 submissions for Task 1 (Audio scene classification) of the DCASE-2016 challenge of the CP-JKU team. We propose 4 different approaches for Audio Scene Classification (ASC). First, we propose a novel i-vector extraction scheme for ASC using both left and right audio channels. Second, we propose a Deep Convolutional Neural Network (DCNN) architecture trained on spectrograms of audio excerpts in end-to-end fashion. Third, we use a calibration transformation to improve the performance of our binaural i-vector system. Finally, we propose a late-fusion of our binaural i-vector and the DCNN. We report the performance of our proposed methods on the provided cross-validation setup for the DCASE-2016 challenge. Using the late-fusion approach, we improve the performance of the baseline by 17% in accuracy.
System characteristics
Input | binaural; monophonic; mono+binaural |
Sampling rate | 44.1kHz |
Features | MFCC; spectrogram; MFCC+spectrograms |
Classifier | I-vector; CNN; fusion |
Experiments on The DCASE Challenge 2016: Acoustic Scene Classification and Sound Event Detection in Real Life Recording
Benjamin Elizalde1, Anurag Kumar1, Ankit Shah2, Rohan Badlani3, Emmanuel Vincent4, Bhiksha Raj1 and Ian Lane1
1Carnegie Mellon University, Pittsburgh, USA, 2NIT Surathkal, India, 3BITS, Pilani, India, 4Inria, Villers-les-Nancy, France
Kumar_task1_1
Experiments on The DCASE Challenge 2016: Acoustic Scene Classification and Sound Event Detection in Real Life Recording
Abstract
In this paper we present our work on Task 1 Acoustic Scene Classification and Task 3 Sound Event Detection in Real Life Recordings. Among our experiments we have low-level and high-level features, classifier optimization and other heuristics specific to each task. Our performance for both tasks improved the baseline from DCASE: for Task 1 we achieved an overall accuracy of 78.9% compared to the baseline of 72.6% and for Task 3 we achieved a Segment-Based Error Rate of 0.48 compared to the baseline of 0.91
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC distribution |
Classifier | SVM |
Mel-Band Features for DCASE 2016 Acoustic Scene Classification Task
Juliano Henrique Foleiss1 and Tiago Fernandes Tavares2
1Universidade Tecnologica Federal do Parana, Campo Mourao, Brazil, 2Universidade Estadual de Campinas, Campinas, Brazil
Foleiss_task1_1
Mel-Band Features for DCASE 2016 Acoustic Scene Classification Task
Abstract
In this work we propose to separately calculate spectral low-level features in each frequency band, as it is commonly done in the problem of beat tracking and tempo estimation [1]. We based this assumption in the same auditory models that inspired the use of Mel- Frequency Cepstral Coefficients (MFCCs) [2] or energy through a filter bank [3] for audio genre classification. They rely on a model for the cochlea in which similar regions of the inner ear are stimulated by similar frequencies, and are processed independently. Both the MFCCs and the energy through filterbank approaches only generate an energy spectrum. In our approach, we expand this idea to incorporate other perceptually-inspired features.
System characteristics
Input | monophonic |
Sampling rate | 44.1kHz |
Features | various |
Classifier | SVM |
Convolutional Neural Network with Multiple-Width Frequency-Delta Data Augmentation for Acoustic Scene Classification
Yoonchang Han and Kyogu Lee
Music and Audio Research Group, Seoul National University, Seoul, South Korea
Lee_task1_1
Convolutional Neural Network with Multiple-Width Frequency-Delta Data Augmentation for Acoustic Scene Classification
Abstract
In this paper, we apply convolutional neural network on acoustic scene classification task of DCASE 2016. We propose multiwidth frequency-delta data augmentation which uses static melspectrogram as well as frequency-delta features as individual examples with same labels for the network input, and the experimental result shows that this method significantly improves the performance compare to the case of using static mel-spectrogram input only. In addition, we propose folded mean aggregation, which first multiplies output probabilities of static and delta augmentation data from the same window first prior to audio clip-wise aggregation, and we found that this method reduces the error rate further. The system exhibited a classification accuracy of 0.831 when classifying 15 acoustic scenes.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | mel energy |
Classifier | CNN |
DCASE2016 Baseline System
System characteristics
Input | monophonic |
Sampling rate | 44.1kHz |
Features | MFCC |
Classifier | GMM |
Classifying Variable-Length Audio Files with All-Convolutional Networks and Masked Global Pooling
Lars Hertel1, Huy Phan1,2 and Alfred Mertins1
1Institute for Signal Processing, University of Luebeck, Luebeck, Germany, 2Graduate School for Computing in Medicine and Life Sciences, University of Luebeck, Luebeck, Germany
Hertel_task1_1
Classifying Variable-Length Audio Files with All-Convolutional Networks and Masked Global Pooling
Abstract
We trained a deep all-convolutional neural network with masked global pooling to perform single-label classification for acoustic scene classification and multi-label classification for domestic audio tagging in the DCASE-2016 contest. Our network achieved an average accuracy of 84:5% on the four-fold cross-validation for acoustic scene recognition, compared to the provided baseline of 72:5%, and an average equal error rate of 0:17 for domestic audio tagging, compared to the baseline of 0:21. The network therefore improves the baselines by a relative amount of 17% and 19%, respectively. The network only consists of convolutional layers to extract features from the short-time Fourier transform and one global pooling layer to combine those features. It particularly possesses neither fully-connected layers, besides the fully-connected output layer, nor dropout layers.
System characteristics
Input | left |
Sampling rate | 44.1kHz |
Features | spectrogram |
Classifier | CNN |
Empirical Study on Ensemble Method of Deep Neural Networks for Acoustic Scene Classification
Jaehun Kim and Kyogu Lee
Music and Audio Research Group, Seoul National University, Seoul, South Korea
Lee_task1_2
Empirical Study on Ensemble Method of Deep Neural Networks for Acoustic Scene Classification
Abstract
The deep neural network has shown superior classification or regression performances in wide range of applications. In particular, the ensemble of deep machines was reported to effectively decrease test errors in many studies. In this work, we extend the scale of deep machines to include hundreds of networks, and apply it to acoustic scene classification. In so doing, several recent learning techniques are employed to accelerate the training process, and a novel stochastic feature diversification method is proposed to allow different contributions from each constituent network. Experimental results with the DCASE2016 dataset indicate that an ensemble of deep machines leads to better performances on the acoustic scene classification.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | unsupervised |
Classifier | CNN ensemble |
Deep Neural Network Baseline for DCASE Challenge 2016
Abstract
The DCASE Challenge 2016 contains tasks for Acoustic Acene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. For feature extraction, 40 Melfilter bank features are used. Two kinds of Mel banks, same area bank and same height bank are discussed. Experimental results show that the same height bank is better than the same area bank. DNNs with the same structure are applied to all four tasks in the DCASE Challenge 2016. In Task 1 we obtained accuracy of 76.4% using Mel + DNN against 72.5% by using Mel Frequency Ceptral Coefficient (MFCC) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 17.4% using Mel + DNN against 41.6% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 38.1% using Mel + DNN against 26.6% by using MFCC + GMM. In task 4 we obtained Equal Error Rate (ERR) of 20.9% using Mel + DNN against 21.0% by using MFCC + GMM. Therefore the DNN improves the baseline in Task 1 and Task 3, and is similar to the baseline in Task 4, although is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always work.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | mel energy |
Classifier | DNN |
CQT-Based Convolutional Neural Networks for Audio Scene Classification and Domestic Audio Tagging
Thomas Lidy1 and Alexander Schindler2
1Institute of Software Technology, Vienna University of Technology, Vienna, Austria, 2Digital Safety and Security, Austrian Institute of Technology, Vienna, Austria
Schindler_task1_1 Schindler_task1_2
CQT-Based Convolutional Neural Networks for Audio Scene Classification and Domestic Audio Tagging
Abstract
For the DCASE 2016 audio benchmarking contest, we submitted a parallel Convolutional Neural Network architecture for the tasks of 1) classifying acoustic scenes (task 1) and urban sound scapes and 2) domestic audio tagging (task 4). A popular choice for input to a Convolutional Neural Network in audio classification problems are Mel-transformed spectrograms. We, however, found that a Constant-Q-transformed input improves results. Furthermore, we evaluated critical parameters such as the number of necessary bands and filter sizes in a Convolutional Neural Network. Finally, we propose a parallel (graph-based) neural network architecture which captures relevant audio characteristics both in time and in frequency, and submitted it to the DCASE 2016 tasks 1 and 4, with some slight alterations described in this paper. Our approach shows a 10.7 % relative improvement of the baseline system of the Acoustic Scenes Classification task on the development set of task 1[1].
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | CQT |
Classifier | CNN |
Acoustic Scene Classification by Feed Forward Neural Network with Class Dependent Attention Mechanism
Jiaming Liu1, Hihui Wang1, Mingyu You1, Ruiwei Zhao1 and Guozheng Li2
1Department of Control Science and Engineering, Tongji University, Shanghai, China, 2Data Center of Traditional Chinese Medicine, China Academy of Chinese Medical Science, Beijing, China
Liu_task1_1 Liu_task1_2
Acoustic Scene Classification by Feed Forward Neural Network with Class Dependent Attention Mechanism
Abstract
In the acoustic scene classification task, we proposed a novel attention mechanism embedded to feed forward networks. On top of a shared input layer, 15 separated attention modules are calculated for each class, and output 15 class dependent feature vectors. Then the feature vectors are mapped to class labels by 15 subnetworks. A softmax layer is employed on the very top of the network. In our experiments, the default feature, MFCC and mel filterbank with delta and acceleration, is used to represent each segment. We split each 30s audio recording into 1s segments and calculate label for the segment, then output the most frequent label for the 30s recording. The best single neural network could get 77.4% cross validation accuracy without further feature engineering and any data augmentation. We train 5 models with MFCC features and 5 models with mel filterbank features, then make an ensemble with majority vote, getting a 78.6% final cross validation result. For submission, the 10 models are retrained with full dataset. And, the final submission is a majority vote ensemble of the 10 models' outputs.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC+mel energy |
Classifier | fusion |
Binaural Scene Classification with Wavelet Scattering
Vincent Lostanlen1 and Joakim Andén2
1Departement d’Informatique, Ecole normale superieure, Paris, France, 2Program in Applied and Computational Mathematics, Princeton University, Princeton, NJ, USA
Lostanlen_task1_1
Binaural Scene Classification with Wavelet Scattering
Abstract
This technical report describes our contribution to the scene classification task of the 2016 edition of the IEEE AASP Challenge for Detection and Classification of Acoustic Scenes and Events (DCASE). Our computational pipeline consists of a gammatone scattering transform, logarithmically compressed and coupled with a per-frame linear support vector machine. At test time, frame-level labels are aggregated over the whole recording by majority vote. During the training phase, we propose a novel data augmentation technique, where left and right channels are mixed at different proportions to introduce invariance to sound direction in the training data.
System characteristics
Input | mixed |
Sampling rate | 44.1kHz |
Features | gammatone scattering |
Classifier | SVM |
The Up System for The 2016 DCASE Challenge Using Deep Recurrent Neural Network and Multiscale Kernel Subspace Learning
Erik Marchi1,2, Dario Tonelli3, Xinzhou Xu1, Fabien Ringeval1,2, Jun Deng1, Stefano Squartini3 and Björn Schuller1,2,4
1Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany, 2audEERING GmbH, Gilching, Germany, 3A3LAB, Department of Information Engineering, Universita Politecnica delle Marche, Italy, 4Department of Computing, Imperial College London, London, United Kingdom
Marchi_task1_1
The Up System for The 2016 DCASE Challenge Using Deep Recurrent Neural Network and Multiscale Kernel Subspace Learning
Abstract
We propose a system for acoustic scene classification using pairwise decomposition with deep neural networks and dimensionality reduction by multiscale kernel subspace learning. It is our contribution to the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2016). The system classifies 15 different acoustic scenes. First, auditory spectral features are extracted and fed into 15 binary deep multilayer perceptron neural networks (MLP). MLP are trained with the one-against-all paradigm to perform a pairwise decomposition. In a second stage, a large number of spectral, cepstral, energy and voicing-related audio features are extracted. Multiscale Gaussian kernels are then used in constructing optimal linear combination of Gram matrices for multiple kernel subspace learning. The reduced feature set is fed into a nearest-neighbour classifier. Predictions from the two systems are then combined by a threshold-based decision function. On the official development set of the challenge, an accuracy of 81.5% is achieved. In this technical report, we provide a description of the actual system submitted to the challenge.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | various |
Classifier | fusion |
TUT Acoustic Scene Classification Submission
Gonçalo Marques1 and Thibault Langlois2
1Electronic Telecom. and Comp. Dept., Instituto Superior de Engenharia de Lisboa, Lisboa, Portugal, 2Informatics Dept., Faculdade de Ciências da Universidade de Lisboa, Lisboa, Portugal
Marques_task1_1
TUT Acoustic Scene Classification Submission
Abstract
This technical report presents the details of our submission to the D-CASE classification challenge, Task 1: Acoustic Scene Classification. The method used consists in a feature extraction phase followed by two dimensionality reduction steps (PCA and LDA) the classification being done using the k nearest-neighbours algorithm.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC |
Classifier | kNN |
Acoustic Scene Classification Using Time-Delay Neural Networks and Amplitude Modulation Filter Bank Features
Niko Moritz1, Jens Schröder1, Stefan Goetze1, Jörn Anemüller2 and Birger Kollmeier2
1Project Group for Hearing, Speech, and Audio Processing, Fraunhofer IDMT, Oldenburg, Germany, 2Medizinische Physik & Hearing4all, University of Oldenburg, Oldenburg, Germany
Moritz_task1_1
Acoustic Scene Classification Using Time-Delay Neural Networks and Amplitude Modulation Filter Bank Features
Abstract
This paper presents a system for acoustic scene classification (SC) that is applied to data of the SC task of the DCASE'16 challenge (Task 1). The proposed method is based on extracting acoustic features that employ a relatively long temporal context, i.e., amplitude modulation filer bank (AMFB) features, prior to detection of acoustic scenes using a neural network (NN) based classification approach. Recurrent neural networks (RNN) are well suited to model long-term acoustic dependencies that are known to encode important information for SC tasks. However, RNNs require a relatively large amount of training data in comparison to feed-forward deep neural networks (DNNs). Hence, the time-delay neural network (TDNN) approach is used in the present work that enables analysis of long contextual information similar to RNNs but with training efforts comparable to conventional DNNs. The proposed SC system attains a recognition accuracy of 76.5 %, which is 4.0 % higher compared to the DCASE'16 baseline system.
System characteristics
Input | left+right+mono |
Sampling rate | 16kHz |
Features | amplitude modulation filter bank |
Classifier | TDNN |
Acoustic Scene Classification Using MFCC and MP Features
Manjunath Mulimani and Shashidhar G. Koolagudi
Dept. of Computer Science & Engineering, National Institute of Technology, Karnataka, India
Mulimani_task1_1
Acoustic Scene Classification Using MFCC and MP Features
Abstract
This paper, clearly describes our experiments for efficient acoustic scene classification task as a part of Detection and Classification of Acoustic Scenes and Events-2016 (DCASE-2016) IEEE Audio and Acoustic Signal Processing (AASP) challenge. Identification of features from given audio clips to appropriate acoustic scene classification is a challenging task because of heterogeneity by thier nature. In order to identify such features, in this paper we have implemented few methods using Matching Pursuit (MP) algorithm in order to extract Time-Frequency (TF) based features. MP algorithm is used to select atoms iteratively among the set of parameterized waveforms in the dictionary that best correlates the original signal structure. Using these selected set of atoms mean and standard deviation of amplitude and frequency parameters of first few (n) atoms are calculated separately, resulting into four MP feature sets. Combination of twenty MFCCs along with four MP features enhanced the recognition accuracy of acoustic scenes using GMM classifier.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC+matching pursuit |
Classifier | GMM |
Deep Neural Network Bottleneck Feature for Acoustic Scene Classification
Seongkyu Mun1, Sangwook Park2, Younglo Lee2 and Hanseok Ko1,2
1Department of Visual Information Processing, Korea University, Seoul, South Korea, 2School of Electrical Engineering, Korea University, Seoul, South Korea
Ko_task1_2
Deep Neural Network Bottleneck Feature for Acoustic Scene Classification
Abstract
Bottleneck features have been shown to be effective in improv-ing the accuracy of speaker recognition, language identification and automatic speech recognition. However, few works have focused on bottleneck features for acoustic scene classification. This report proposes a novel acoustic scene feature extraction using bottleneck features derived from a Deep Neural Network (DNN). On the official development set with our settings, a fea-ture set that includes bottleneck features and Perceptual Linear Prediction (PLP) feature shows a best accuracy rate.
System characteristics
Input | left+right+mono |
Sampling rate | 16kHz |
Features | various |
Classifier | DNN |
Sound Scene Identification Based on Monaural and Binaural Features
Waldo Nogueira1,2
1Medical University Hannover, Hannover, Germany, 2Cluster of Excellence Hearing4all, Hannover, Germany
Nogueira_task1_1
Sound Scene Identification Based on Monaural and Binaural Features
Abstract
This submission to the sub-task acoustic scene classification of the IEEE DCASE 2016 Challenge: Acoustic scene classification is based on a feature extraction module based on the concatenation of monaural and binaural features. Monaural features are based on Mel Frequency Cepstrums summarized using recurrence quantification analysis. Binaural features are based on the extraction of inter-aural differences (level and time) and the coherence between the two channel stereo recordings. These features are used in conjunction with a support vector-machine for the classification of the acoustic sound scenes. In this short paper the impact of different features is analyzed.
System characteristics
Input | binaural |
Sampling rate | 44.1kHz |
Features | various |
Classifier | SVM |
Score Fusion of Classification Systems for Acoustic Scene Classification
Sangwook Park1, Seongkyu Mun2, Younglo Lee1 and Hanseok Ko1,2
1School of Electrical Engineering, Korea University, Seoul, South Korea, 2Department of Visual Information Processing, Korea University, Seoul, South Korea
Ko_task1_1
Score Fusion of Classification Systems for Acoustic Scene Classification
Abstract
This is a technical report about our study for an acoustic scene classification which is a task of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events. In order to achieve this purpose, we investigated several methods in three aspects, feature extraction, generative/discriminative machine learning, and score fusion for a final decision. For finding an appropriate frame based feature, a new feature was devised after investigating several features. And then, models based on both generative and discriminative learning were applied for classifying the feature. From these studies, several systems composed of feature and classifier were considered. The final result was determined by fusing individual results. In Section 3, experiment results are summarized, and concluding remarks of this report are presented in Section 4.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | various |
Classifier | fusion |
Acoustic Scene Classification Using Deep Learning
Rohit Patiyal and Padmanabhan Rajan
School of Computing and Electrical Engineering, Indian Institute of Technology Mandi, Himachal Pradesh, India
Patiyal_task1_1
Acoustic Scene Classification Using Deep Learning
Abstract
Acoustic Scene Classification (ASC) is the task of classifying audio samples on the basis of their soundscapes. This is one of the tasks taken up by Detection and Classification of Acoustic Scenes and Events 2016 (DCASE-2016) challenge. A labeled dataset of audio samples from various scenes is provided and solutions are invited. In this paper, use of Deep Neural Networks (DNN) is proposed for the task of ASC. Here, different methods for extracting features with different classification algorithms are explored. It is observed that DNN works significantly better as compared to other methods trained over the same set of features. It performs at par with the state of-the-art techniques presented in DCASE-2013. It is concluded that the use of MFCC features with DNN works the best, giving 97.6 % cross-validation score on the development dataset-2016 data for a particular set of parameters for the DNN. Also training a DNN does not take larger run times compared to others methods.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC |
Classifier | DNN |
CNN-LTE: a Class of 1-X Pooling Convolutional Neural Networks on Label Tree Embeddings for Audio Scene Recognition
Huy Phan1,2, Lars Hertel1, Marco Maass1, Philipp Koch1 and Alfred Mertins1
1Institute for Signal Processing, University of Luebeck, Luebeck, Germany, 2Graduate School for Computing in Medicine and Life Sciences, University of Luebeck, Luebeck, Germany
Phan_task1_1
CNN-LTE: a Class of 1-X Pooling Convolutional Neural Networks on Label Tree Embeddings for Audio Scene Recognition
Abstract
We describe in this report our audio scene recognition system submitted to the DCASE 2016 challenge [1]. Firstly, given the label set of the scenes, a label tree is automatically constructed. This category taxonomy is then used in the feature extraction step in which an audio scene instance is represented by a label tree embedding image. Different convolutional neural networks, which are tailored for the task at hand, are finally learned on top of the image features for scene recognition. Our system reaches an overall recognition accuracy of 81.2% and outperforms the DCASE 2016 baseline with an absolute improvement of 8.7% on the development data.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | label tree embedding |
Classifier | CNN |
Deep Neural Network for Acoustic Scene Detection
Alexei Pugachev and Dmitrii Ubskii
Chair of Speech Information Systems, ITMO University, St. Petersburg, Russia
Pugachev_task1_1
Deep Neural Network for Acoustic Scene Detection
Abstract
The DCASE 2016 challenge comprised the task of Acoustic Scene Classification. The goal of this task was to classify test recordings into one of predefined classes that characterizes the environment during the recording.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC |
Classifier | DNN |
Enriched Supervised Feature Learning for Acoustic Scene Classification
Alain Rakotomamonjy
Music and Audio Research Group, Normandie Université, Saint Etienne du Rouvray, France
Rakotomamonjy_task1_1 Rakotomamonjy_task1_2
Enriched Supervised Feature Learning for Acoustic Scene Classification
Abstract
This paper presents the methodology we have followed for our submission at the DCASE 2016 competition on acoustic scene classification (Task 1). The approach is based on a supervised feature learning technique which is built upon matrix factorization of timefrequency representation of an audio scene. As an original contribution, we have introduced a non-negative supervised matrix factorization that helps in learning discriminative codes. Our experiments have shown that these supervised features perform slightly better than convolutional neural networks for this challenge. In addition, when they are coupled with some hand-crafted features such as histogram of gradient, their performances are further boosted.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | various |
Classifier | SVM |
Acoustic Scene Classification Using Network-In-Network Based Convolutional Neural Network
Andri Santoso, Chien-Yao Wang and Jia-Ching Wang
National Central University, Taiwan
Santoso_task1_1
Acoustic Scene Classification Using Network-In-Network Based Convolutional Neural Network
Abstract
In this paper, we present our entry to the challenge of detection and classification of acoustic scenes and events (DCASE). The submission for this challenge is for the task of automatic audio scene classification. Our approach is based on the deep learning method that is adopted from computer vision research field. The convolutional neural network is adopted to solve the problem of audio based scene classification, specifically the architecture of network-in-network is utilized to build the classifier. For the feature extraction part, mel frequency spectral coefficients (MFCC) is used as the input vector for the classifier. Differ from the original architecture of network-in-network, in this work we perform 1-D convolution operation instead of performing 2-D convolution. The classifier is trained using every frames from MFCC feature set, and the results for every frames are then thresholded and voted to choose the final scene label of audio data. The proposed work in this paper shows a better performance of the provided baseline system of DCASE challenge.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC |
Classifier | CNN |
Acoustic Scene Classification: an Evaluation of an Extremely Compact Feature Representation
Gustavo Sena Mafra1, Quang-Khanh-Ngoc Duong2, Alexey Ozerov2 and Patrick Perez2
1Universidade Federal de Santa Catarina, Santa Catarina, Brazil, 2Technicolor, France
Duong_task1_1 Duong_task1_2 Duong_task1_3 Duong_task1_4
Acoustic Scene Classification: an Evaluation of an Extremely Compact Feature Representation
Abstract
This paper investigates several approaches to address the acoustic scene classification (ASC) task. We start from low-level feature representation for segmented audio frames and investigate different time granularity for feature aggregation. We study the use of support vector machine (SVM), as a well-known classifier, together with two popular neural network (NN) architectures, namely multilayer perceptron (MLP) and convolutional neural network (CNN), for higher level feature learning and classification. We evaluate the performance of these approaches on benchmark datasets provided from the 2013 and 2016 Detection and Classification of Acoustic Scenes and Events (DCASE) challenges. We observe that a simple approach exploiting averaged Mel-log-spectrogram, as an extremely compact feature, and SVM can obtain even better result than NN-based approaches and comparable performance with the best systems in the DCASE 2013 challenge.
System characteristics
Input | monophonic |
Sampling rate | 44.1kHz |
Features | mel energy |
Classifier | SVM; DNN |
Acoustic Scene Classification Using Deep Neural Network and Frame-Concatenated Acoustic Feature
Gen Takahashi1, Takeshi Yamada1, Shoji Makino1 and Nobutaka Ono2
1University of Tsukuba, Tsukuba, Japan, 2National Institute of Informatics / SOKENDAI, Japan
Takahashi_task1_1
Acoustic Scene Classification Using Deep Neural Network and Frame-Concatenated Acoustic Feature
Abstract
This paper describes our contribution to the task of acoustic scene classification in the DCASE2016 (Detection and Classification of Acoustic Scenes and Events 2016) Challenge set by IEEE AASP. In this work, we applied the DNN-GMM (Deep Neural Network-Gaussian Mixture Model) to acoustic scene classification. We introduced high-dimensional features that are concatenated with acoustic features in temporally adjacent frames. As a result, it was confirmed that the classification accuracy of the DNN-GMM was improved by 5.0% in comparison with that of the GMM, which was used as the baseline classifier.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC |
Classifier | DNN-GMM |
DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks
Michele Valenti1, Aleksandr Diment2, Giambattista Parascandolo2, Stefano Squartini1 and Tuomas Virtanen2
1Department of Information Engineering, Università Politecnica delle Marche, Ancona, Italy, 2Department of Signal Processing, Tampere University of Technology, Tampere, Finland
Valenti_task1_1
DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks
Abstract
This workshop paper presents our contribution for the task of acoustic scene classification proposed for the Detection and classification of acoustic scenes and events (D-CASE) 2016 challenge. We propose the use of a convolutional neural network trained to classify short sequences of audio, represented by their log-mel spectrogram. In addition we use a training method that can be used when the validation performance of the system saturates as the training proceeds. The performance is evaluated on the public acoustic scene classification development dataset provided for the D-CASE challenge. The best accuracy score obtained by our configuration on a four-folded crossvalidation setup is 79.0%. It constitutes a 8.8% relative improvement with respect to the baseline system, based on a Gaussian mixture model classifier.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | mel energy |
Classifier | CNN |
Acoustic Scene Classification Based on Spectral Analysis and Feature-Level Channel Combination
Dinesh Vij1, Naveen Aggarwal1, Bhaskaran Raman2, K.K. Ramakrishnan3 and Divya Bansal4
1UIET, Panjab University, Chandigarh, India, 2IIT, Bombay, India, 3University of California, USA, 4PEC University of Technology, Chandigarh, India
Aggarwal_task1_1
Acoustic Scene Classification Based on Spectral Analysis and Feature-Level Channel Combination
Abstract
This paper is a submission to the sub-task Acoustic Scene Classification of the IEEE Audio and Acoustic Signal Processing challenge: Detection and Classification of Acoustic Scenes and Events 2016. The aim of the sub-task is to correctly detect 15 different acoustic scenes, which consist of indoor, outdoor, and vehicle categories. This work is based on spectral analysis, feature-level channel combination, and support vector machine classifier. In this short paper, the impact of different parameters while extracting features is analyzed. The accuracy gain obtained by feature-level channel combination is then reported.
System characteristics
Input | binaural |
Sampling rate | 44.1kHz |
Features | various |
Classifier | SVM |
Acoustic Scene Classification Using Block Based MFCC Features
Ghodasara Vikaskumar, Shefali Waldekar, Dipjyoti Paul and Goutam Saha
Electronics & Electrical Communication Engineering Department, Indian Institute of Technology Kharagpur, Kharagpur, India
Vikaskumar_task1_1
Acoustic Scene Classification Using Block Based MFCC Features
Abstract
Acoustic Scene Classification (ASC) is receiving wide spread attention due to its wide variety of applications in smart wearable devices, surveillance, life log diarization etc. This work describes our contribution to the Acoustic scene classification task of the DCASE2016 Challenge for Detection and Classification of Acoustic Scenes and Events. In this work, we apply block based MFCC along with few traditional short term audio features with mean and standard deviation as statistics and Support Vector Machine (SVM) as a classifier to ASC. It is observed that block based MFCC feature performs better than classical MFCC. For evaluation purpose, we used three different datasets.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC |
Classifier | SVM |
Acoustic Scene and Event Recognition Using Recurrent Neural Networks
Toan H. Vu and Jia-Ching Wang
Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
Vu_task1_1
Acoustic Scene and Event Recognition Using Recurrent Neural Networks
Abstract
The DCASE2016 challenge is designed particularly for research in environmental sound analysis. It consists of four tasks that spread on various problems such as acoustic scene classification and sound event detection. This paper reports our results on all the tasks by using Recurrent Neural Networks (RNNs). Experiments show that our models achieved superior performances compared with the baselines.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC |
Classifier | RNN |
Hierarchical Learning for DNN-Based Acoustic Scene Classification
Yong Xu, Qiang Huang, Wenwu Wang and Mark D. Plumbley
Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom
Xu_task1_1
Hierarchical Learning for DNN-Based Acoustic Scene Classification
Abstract
In this paper, we present a deep neural network (DNN)-based acoustic scene classification framework. Two hierarchical learning methods are proposed to improve the DNN baseline performance by incorporating the hierarchical taxonomy information of environmental sounds. Firstly, the parameters of the DNN are initialized by the proposed hierarchical pre-training. Multi-level objective function is then adopted to add more constraint on the cross-entropy based loss function. A series of experiments were conducted on the Task1 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 challenge. The final DNN-based system achieved a 22.9% relative improvement on average scene classification error as compared with the Gaussian Mixture Model (GMM)-based benchmark system across four standard folds.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | mel energy |
Classifier | DNN |
Discriminative Training of GMM Parameters for Audio Scene Classification
Sungrack Yun, Sungwoong Kim, Sunkuk Moon, Juncheol Cho and Taesu Kim
Qualcomm Research, Seoul, South Korea
Kim_task1_1
Discriminative Training of GMM Parameters for Audio Scene Classification
Abstract
This report describes the algorithm for audio scene classification and audio tagging and the result for DCASE 2016 challenge data. We propose a discriminative training algorithm to improve the baseline GMM performance. The algorithm updates the baseline GMM parameters by maximizing the margin between classes to improve discriminative performance. For Task1, we use a hierarchical classifier to maximize discriminative performance, and achieve 84% accuracy for given cross validation data. For Task4, we apply binary classifier for each label, and achieve 16.71% EER for given cross validation data.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | MFCC |
Classifier | GMM |
Gated Recurrent Networks Applied To Acoustic Scene Classification and Acoustic Event Detection
Matthias Zöhrer and Franz Pernkopf
Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria
Zoehrer_task1_1
Gated Recurrent Networks Applied To Acoustic Scene Classification and Acoustic Event Detection
Abstract
We present two resource efficient frameworks for acoustic scene classification and acoustic event detection. In particular, we combine gated recurrent neural networks (GRNNs) and linear discriminant analysis (LDA) for efficiently classifying environmental sound scenes of the IEEE Detection and Classification of Acoustic Scenes and Events challenge (DCASE2016). Our system reaches an overall accuracy of 79.1% on DCASE 2016 task 1 development data, resulting in a relative improvement of 8.34% compared to the baseline GMM system. By applying GRNNs on DCASE2016 real event detection data using a MSE objective, we obtain a segment-based error rate (ER) score of 0.73 - which is a relative improvement of 19.8% compared to the baseline GMM system. We further investigate semi-supervised learning applied to acoustic scene analysis. In particular, we evaluate the effects of a hybrid, i.e. generative discriminative, objective function.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | spectrogram |
Classifier | GRNN |