Task description
This subtask is concerned with the situation in which an application will be tested with different devices, possibly not the same as the ones used to record the development data. In this case, evaluation data contains more devices than the development data.
The development data consists of the same recordings as in subtask A, and a small amount of parallel data recorded with devices B and C. The amount of data is as follows:
- Device A: 40 hours (14400 segments, same as subtask A, but resampled and single-channel)
- Device B: 3 hours (108 segments per acoustic scene)
- Device C: 3 hours (108 segments per acoustic scene)
More detailed task description can be found in the task description page
Systems ranking
Rank |
Submission code |
Submission name |
Technical Report |
Accuracy (B/C) with 95% confidence interval (Evaluation dataset) |
Accuracy (B/C) (Development dataset) |
---|---|---|---|---|---|
Eghbal-zadeh_CPJKU_task1b_1 | mmd_shake_res_snapi | Eghbal-zadeh2019 | 74.5 (73.5 - 75.5) | ||
Eghbal-zadeh_CPJKU_task1b_2 | mmd_shake_res | Eghbal-zadeh2019 | 74.5 (73.5 - 75.5) | ||
Eghbal-zadeh_CPJKU_task1b_3 | mmd_shake_snapi | Eghbal-zadeh2019 | 73.4 (72.4 - 74.5) | ||
Eghbal-zadeh_CPJKU_task1b_4 | mmd_shake | Eghbal-zadeh2019 | 73.4 (72.3 - 74.4) | ||
DCASE2019 baseline | Baseline | 47.7 (46.5 - 48.8) | 41.4 | ||
Jiang_UESTC_task1b_1 | Randomforest_16 | Jiang2019 | 70.3 (69.2 - 71.3) | 62.2 | |
Jiang_UESTC_task1b_2 | Randomforest_8 | Jiang2019 | 69.9 (68.9 - 71.0) | 64.2 | |
Jiang_UESTC_task1b_3 | Averaging_16 | Jiang2019 | 69.0 (68.0 - 70.1) | 63.2 | |
Jiang_UESTC_task1b_4 | Averaging_8 | Jiang2019 | 69.6 (68.6 - 70.7) | 64.0 | |
Kong_SURREY_task1b_1 | cvssp_cnn9 | Kong2019 | 61.6 (60.4 - 62.7) | 52.7 | |
Kosmider_SRPOL_task1b_1 | SC+IC+RCV | Komider2019 | 75.1 (74.1 - 76.1) | ||
Kosmider_SRPOL_task1b_2 | SC+ALL+SV | Komider2019 | 75.3 (74.3 - 76.3) | ||
Kosmider_SRPOL_task1b_3 | SC+IC+RCV | Komider2019 | 74.9 (73.9 - 75.9) | ||
Kosmider_SRPOL_task1b_4 | SC+FULL+SV | Komider2019 | 75.2 (74.3 - 76.2) | ||
LamPham_KentGroup_task1b_1 | Kent | Pham2019 | 72.8 (71.8 - 73.8) | 72.9 | |
McDonnell_USA_task1b_1 | UniSA_1b1 | Gao2019 | 74.2 (73.2 - 75.2) | 66.3 | |
McDonnell_USA_task1b_2 | UniSA_1b2 | Gao2019 | 74.1 (73.1 - 75.2) | 62.5 | |
McDonnell_USA_task1b_3 | UniSA_1b3 | Gao2019 | 74.9 (73.9 - 75.9) | 64.2 | |
McDonnell_USA_task1b_4 | UniSA_1b4 | Gao2019 | 74.4 (73.4 - 75.4) | 66.3 | |
Primus_CPJKU_task1b_1 | CPR-NoDA | Primus2019 | 71.3 (70.2 - 72.3) | 61.2 | |
Primus_CPJKU_task1b_2 | CPR-MSE | Primus2019 | 73.4 (72.4 - 74.4) | 64.3 | |
Primus_CPJKU_task1b_3 | CPR-MI | Primus2019 | 71.6 (70.6 - 72.7) | 62.5 | |
Primus_CPJKU_task1b_4 | CPR-Ensemble | Primus2019 | 74.2 (73.2 - 75.2) | 65.1 | |
Song_HIT_task1b_1 | hitsplab_1 | Song2019 | 67.3 (66.2 - 68.3) | 65.6 | |
Song_HIT_task1b_2 | hitsplab_2 | Song2019 | 72.2 (71.2 - 73.3) | 41.4 | |
Song_HIT_task1b_3 | hitsplab_3 | Song2019 | 72.1 (71.1 - 73.1) | 70.3 | |
Waldekar_IITKGP_task1b_1 | IITKGP_MFDWC19 | Waldekar2019 | 62.1 (60.9 - 63.2) | 52.3 | |
Wang_NWPU_task1b_1 | Rui_task1b | Wang2019 | 65.7 (64.6 - 66.8) | 54.8 | |
Wang_NWPU_task1b_2 | Rui_task1b | Wang2019 | 68.5 (67.4 - 69.6) | 55.2 | |
Wang_NWPU_task1b_3 | Rui_task1b | Wang2019 | 70.3 (69.3 - 71.4) | 54.8 |
Teams ranking
Rank |
Submission code |
Submission name |
Technical Report |
Accuracy with 95% confidence interval (Evaluation dataset) |
Accuracy (Development dataset) |
---|---|---|---|---|---|
Eghbal-zadeh_CPJKU_task1b_2 | mmd_shake_res | Eghbal-zadeh2019 | 74.5 (73.5 - 75.5) | ||
DCASE2019 baseline | Baseline | 47.7 (46.5 - 48.8) | 41.4 | ||
Jiang_UESTC_task1b_1 | Randomforest_16 | Jiang2019 | 70.3 (69.2 - 71.3) | 62.2 | |
Kong_SURREY_task1b_1 | cvssp_cnn9 | Kong2019 | 61.6 (60.4 - 62.7) | 52.7 | |
Kosmider_SRPOL_task1b_2 | SC+ALL+SV | Komider2019 | 75.3 (74.3 - 76.3) | ||
LamPham_KentGroup_task1b_1 | Kent | Pham2019 | 72.8 (71.8 - 73.8) | 72.9 | |
McDonnell_USA_task1b_3 | UniSA_1b3 | Gao2019 | 74.9 (73.9 - 75.9) | 64.2 | |
Primus_CPJKU_task1b_4 | CPR-Ensemble | Primus2019 | 74.2 (73.2 - 75.2) | 65.1 | |
Song_HIT_task1b_2 | hitsplab_2 | Song2019 | 72.2 (71.2 - 73.3) | 41.4 | |
Waldekar_IITKGP_task1b_1 | IITKGP_MFDWC19 | Waldekar2019 | 62.1 (60.9 - 63.2) | 52.3 | |
Wang_NWPU_task1b_3 | Rui_task1b | Wang2019 | 70.3 (69.3 - 71.4) | 54.8 |
Class-wise performance
Rank |
Submission code |
Submission name |
Technical Report |
Accuracy (Evaluation dataset) |
Airport | Bus | Metro |
Metro station |
Park |
Public square |
Shopping mall |
Street pedestrian |
Street traffic |
Tram |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Eghbal-zadeh_CPJKU_task1b_1 | mmd_shake_res_snapi | Eghbal-zadeh2019 | 74.5 | 71.2 | 90.3 | 74.9 | 66.0 | 84.6 | 59.9 | 81.8 | 45.0 | 89.7 | 81.5 | |
Eghbal-zadeh_CPJKU_task1b_2 | mmd_shake_res | Eghbal-zadeh2019 | 74.5 | 70.1 | 90.6 | 74.4 | 66.2 | 82.1 | 58.3 | 82.6 | 48.1 | 90.3 | 82.2 | |
Eghbal-zadeh_CPJKU_task1b_3 | mmd_shake_snapi | Eghbal-zadeh2019 | 73.4 | 73.6 | 89.6 | 71.0 | 66.1 | 86.1 | 56.5 | 78.8 | 44.0 | 87.2 | 81.4 | |
Eghbal-zadeh_CPJKU_task1b_4 | mmd_shake | Eghbal-zadeh2019 | 73.4 | 71.7 | 90.1 | 72.5 | 65.1 | 85.8 | 58.9 | 79.3 | 43.1 | 88.1 | 79.0 | |
DCASE2019 baseline | Baseline | 47.7 | 36.5 | 57.8 | 57.8 | 35.8 | 54.9 | 15.8 | 76.8 | 28.3 | 67.8 | 45.1 | ||
Jiang_UESTC_task1b_1 | Randomforest_16 | Jiang2019 | 70.3 | 61.3 | 68.2 | 76.7 | 71.9 | 79.6 | 59.4 | 80.7 | 47.5 | 86.8 | 70.7 | |
Jiang_UESTC_task1b_2 | Randomforest_8 | Jiang2019 | 69.9 | 62.6 | 69.2 | 74.0 | 71.7 | 79.0 | 58.9 | 79.3 | 46.5 | 87.4 | 70.8 | |
Jiang_UESTC_task1b_3 | Averaging_16 | Jiang2019 | 69.0 | 54.7 | 67.1 | 74.9 | 71.1 | 80.7 | 61.9 | 85.3 | 39.7 | 89.2 | 65.8 | |
Jiang_UESTC_task1b_4 | Averaging_8 | Jiang2019 | 69.6 | 54.2 | 71.2 | 71.9 | 70.8 | 80.3 | 62.4 | 84.9 | 40.0 | 89.6 | 71.0 | |
Kong_SURREY_task1b_1 | cvssp_cnn9 | Kong2019 | 61.6 | 50.4 | 63.7 | 69.7 | 52.2 | 77.4 | 41.1 | 56.8 | 60.7 | 84.3 | 59.2 | |
Kosmider_SRPOL_task1b_1 | SC+IC+RCV | Komider2019 | 75.1 | 64.0 | 82.4 | 78.6 | 65.0 | 92.1 | 62.4 | 85.0 | 49.3 | 87.4 | 84.6 | |
Kosmider_SRPOL_task1b_2 | SC+ALL+SV | Komider2019 | 75.3 | 68.3 | 85.8 | 81.2 | 65.6 | 94.3 | 53.6 | 86.4 | 45.1 | 90.1 | 82.4 | |
Kosmider_SRPOL_task1b_3 | SC+IC+RCV | Komider2019 | 74.9 | 64.0 | 82.2 | 79.0 | 65.4 | 92.2 | 61.1 | 84.4 | 49.0 | 86.8 | 84.4 | |
Kosmider_SRPOL_task1b_4 | SC+FULL+SV | Komider2019 | 75.2 | 67.9 | 85.8 | 80.8 | 65.0 | 94.4 | 54.7 | 86.4 | 43.2 | 89.9 | 84.3 | |
LamPham_KentGroup_task1b_1 | Kent | Pham2019 | 72.8 | 69.9 | 91.9 | 66.1 | 56.7 | 87.6 | 44.9 | 73.3 | 64.3 | 89.6 | 83.9 | |
McDonnell_USA_task1b_1 | UniSA_1b1 | Gao2019 | 74.2 | 68.9 | 88.1 | 77.1 | 67.4 | 84.3 | 55.6 | 85.3 | 49.3 | 91.9 | 73.9 | |
McDonnell_USA_task1b_2 | UniSA_1b2 | Gao2019 | 74.1 | 73.9 | 87.5 | 73.6 | 70.0 | 81.5 | 55.6 | 82.2 | 52.8 | 91.1 | 73.2 | |
McDonnell_USA_task1b_3 | UniSA_1b3 | Gao2019 | 74.9 | 72.1 | 88.6 | 75.4 | 70.1 | 83.8 | 56.2 | 84.0 | 52.5 | 92.1 | 74.2 | |
McDonnell_USA_task1b_4 | UniSA_1b4 | Gao2019 | 74.4 | 70.4 | 86.8 | 77.8 | 69.2 | 83.1 | 55.0 | 86.8 | 49.6 | 92.1 | 73.5 | |
Primus_CPJKU_task1b_1 | CPR-NoDA | Primus2019 | 71.3 | 78.8 | 86.0 | 66.7 | 64.0 | 79.3 | 51.2 | 74.4 | 40.7 | 90.0 | 81.8 | |
Primus_CPJKU_task1b_2 | CPR-MSE | Primus2019 | 73.4 | 75.4 | 86.1 | 71.9 | 71.7 | 87.8 | 57.1 | 74.6 | 36.0 | 91.0 | 82.2 | |
Primus_CPJKU_task1b_3 | CPR-MI | Primus2019 | 71.6 | 76.1 | 83.1 | 76.0 | 61.8 | 78.8 | 59.3 | 70.3 | 36.4 | 91.1 | 83.3 | |
Primus_CPJKU_task1b_4 | CPR-Ensemble | Primus2019 | 74.2 | 77.5 | 86.2 | 74.4 | 72.4 | 86.4 | 59.9 | 78.1 | 36.0 | 89.9 | 81.5 | |
Song_HIT_task1b_1 | hitsplab_1 | Song2019 | 67.3 | 41.4 | 74.9 | 59.2 | 70.0 | 86.8 | 45.0 | 86.2 | 46.1 | 88.2 | 74.9 | |
Song_HIT_task1b_2 | hitsplab_2 | Song2019 | 72.2 | 63.1 | 80.6 | 76.5 | 73.5 | 86.9 | 37.8 | 87.5 | 51.5 | 92.5 | 72.6 | |
Song_HIT_task1b_3 | hitsplab_3 | Song2019 | 72.1 | 56.2 | 82.1 | 70.7 | 74.3 | 87.2 | 40.3 | 86.9 | 51.1 | 93.2 | 79.0 | |
Waldekar_IITKGP_task1b_1 | IITKGP_MFDWC19 | Waldekar2019 | 62.1 | 55.6 | 69.7 | 55.0 | 51.8 | 83.8 | 43.2 | 66.2 | 42.4 | 86.2 | 66.7 | |
Wang_NWPU_task1b_1 | Rui_task1b | Wang2019 | 65.7 | 55.7 | 68.2 | 69.7 | 60.0 | 81.8 | 45.3 | 62.6 | 55.6 | 90.3 | 67.6 | |
Wang_NWPU_task1b_2 | Rui_task1b | Wang2019 | 68.5 | 58.8 | 70.1 | 71.5 | 64.3 | 82.1 | 51.9 | 68.6 | 53.1 | 89.7 | 74.9 | |
Wang_NWPU_task1b_3 | Rui_task1b | Wang2019 | 70.3 | 60.3 | 72.6 | 72.8 | 65.8 | 81.1 | 54.4 | 73.2 | 53.3 | 90.6 | 78.9 |
Device-wise performance
Rank |
Submission code |
Submission name |
Technical Report |
Accuracy / Evaluation dataset | ||||
---|---|---|---|---|---|---|---|---|
Average Dev B / Dev C |
Dev B | Dev C | Dev A | Dev D | ||||
Eghbal-zadeh_CPJKU_task1b_1 | mmd_shake_res_snapi | Eghbal-zadeh2019 | 74.5 | 73.8 | 75.2 | 81.3 | 54.4 | |
Eghbal-zadeh_CPJKU_task1b_2 | mmd_shake_res | Eghbal-zadeh2019 | 74.5 | 74.0 | 75.0 | 81.2 | 53.1 | |
Eghbal-zadeh_CPJKU_task1b_3 | mmd_shake_snapi | Eghbal-zadeh2019 | 73.4 | 72.8 | 74.1 | 80.3 | 55.3 | |
Eghbal-zadeh_CPJKU_task1b_4 | mmd_shake | Eghbal-zadeh2019 | 73.4 | 72.6 | 74.2 | 79.9 | 55.5 | |
DCASE2019 baseline | Baseline | 47.7 | 48.9 | 46.4 | 63.2 | 26.7 | ||
Jiang_UESTC_task1b_1 | Randomforest_16 | Jiang2019 | 70.3 | 69.1 | 71.4 | 75.1 | 53.0 | |
Jiang_UESTC_task1b_2 | Randomforest_8 | Jiang2019 | 69.9 | 68.5 | 71.4 | 75.1 | 52.0 | |
Jiang_UESTC_task1b_3 | Averaging_16 | Jiang2019 | 69.0 | 68.4 | 69.6 | 74.3 | 53.2 | |
Jiang_UESTC_task1b_4 | Averaging_8 | Jiang2019 | 69.6 | 68.8 | 70.5 | 73.9 | 54.2 | |
Kong_SURREY_task1b_1 | cvssp_cnn9 | Kong2019 | 61.6 | 60.3 | 62.8 | 70.2 | 40.8 | |
Kosmider_SRPOL_task1b_1 | SC+IC+RCV | Komider2019 | 75.1 | 74.5 | 75.7 | 79.8 | 36.1 | |
Kosmider_SRPOL_task1b_2 | SC+ALL+SV | Komider2019 | 75.3 | 74.3 | 76.2 | 80.8 | 38.6 | |
Kosmider_SRPOL_task1b_3 | SC+IC+RCV | Komider2019 | 74.9 | 74.4 | 75.3 | 78.9 | 35.5 | |
Kosmider_SRPOL_task1b_4 | SC+FULL+SV | Komider2019 | 75.2 | 74.3 | 76.2 | 80.1 | 40.0 | |
LamPham_KentGroup_task1b_1 | Kent | Pham2019 | 72.8 | 71.8 | 73.8 | 78.2 | 24.6 | |
McDonnell_USA_task1b_1 | UniSA_1b1 | Gao2019 | 74.2 | 73.2 | 75.1 | 79.3 | 63.4 | |
McDonnell_USA_task1b_2 | UniSA_1b2 | Gao2019 | 74.1 | 73.6 | 74.7 | 79.9 | 63.8 | |
McDonnell_USA_task1b_3 | UniSA_1b3 | Gao2019 | 74.9 | 74.2 | 75.6 | 79.8 | 65.2 | |
McDonnell_USA_task1b_4 | UniSA_1b4 | Gao2019 | 74.4 | 73.8 | 75.1 | 80.1 | 63.6 | |
Primus_CPJKU_task1b_1 | CPR-NoDA | Primus2019 | 71.3 | 70.9 | 71.7 | 78.1 | 49.4 | |
Primus_CPJKU_task1b_2 | CPR-MSE | Primus2019 | 73.4 | 73.6 | 73.1 | 72.1 | 47.9 | |
Primus_CPJKU_task1b_3 | CPR-MI | Primus2019 | 71.6 | 71.4 | 71.8 | 72.8 | 49.3 | |
Primus_CPJKU_task1b_4 | CPR-Ensemble | Primus2019 | 74.2 | 74.1 | 74.3 | 73.7 | 47.4 | |
Song_HIT_task1b_1 | hitsplab_1 | Song2019 | 67.3 | 65.3 | 69.2 | 73.1 | 47.7 | |
Song_HIT_task1b_2 | hitsplab_2 | Song2019 | 72.2 | 71.7 | 72.8 | 79.9 | 59.4 | |
Song_HIT_task1b_3 | hitsplab_3 | Song2019 | 72.1 | 71.1 | 73.1 | 78.4 | 59.1 | |
Waldekar_IITKGP_task1b_1 | IITKGP_MFDWC19 | Waldekar2019 | 62.1 | 59.7 | 64.4 | 71.4 | 39.8 | |
Wang_NWPU_task1b_1 | Rui_task1b | Wang2019 | 65.7 | 64.9 | 66.4 | 75.4 | 39.9 | |
Wang_NWPU_task1b_2 | Rui_task1b | Wang2019 | 68.5 | 67.8 | 69.2 | 76.9 | 46.8 | |
Wang_NWPU_task1b_3 | Rui_task1b | Wang2019 | 70.3 | 68.8 | 71.9 | 79.6 | 47.2 |
System characteristics
General characteristics
Rank | Code |
Technical Report |
Accuracy (Eval) |
Sampling rate |
Data augmentation |
Features |
---|---|---|---|---|---|---|
Eghbal-zadeh_CPJKU_task1b_1 | Eghbal-zadeh2019 | 74.5 | 22.05kHz | mixup | perceptual weighted power spectrogram | |
Eghbal-zadeh_CPJKU_task1b_2 | Eghbal-zadeh2019 | 74.5 | 22.05kHz | mixup | perceptual weighted power spectrogram | |
Eghbal-zadeh_CPJKU_task1b_3 | Eghbal-zadeh2019 | 73.4 | 22.05kHz | mixup | perceptual weighted power spectrogram | |
Eghbal-zadeh_CPJKU_task1b_4 | Eghbal-zadeh2019 | 73.4 | 22.05kHz | mixup | perceptual weighted power spectrogram | |
DCASE2019 baseline | 47.7 | 44.1kHz | log-mel energies | |||
Jiang_UESTC_task1b_1 | Jiang2019 | 70.3 | 44.1kHz | HPSS, NNF, vocal separation, HRTF | log-mel energies | |
Jiang_UESTC_task1b_2 | Jiang2019 | 69.9 | 44.1kHz | HPSS, NNF, vocal separation, HRTF | log-mel energies | |
Jiang_UESTC_task1b_3 | Jiang2019 | 69.0 | 44.1kHz | HPSS, NNF, vocal separation, HRTF | log-mel energies | |
Jiang_UESTC_task1b_4 | Jiang2019 | 69.6 | 44.1kHz | HPSS, NNF, vocal separation, HRTF | log-mel energies | |
Kong_SURREY_task1b_1 | Kong2019 | 61.6 | 32kHz | log-mel energies | ||
Kosmider_SRPOL_task1b_1 | Komider2019 | 75.1 | 44.1kHz | Spectrum Correction, SpecAugment, mixup | log-mel energies | |
Kosmider_SRPOL_task1b_2 | Komider2019 | 75.3 | 44.1kHz | Spectrum Correction, SpecAugment, mixup | log-mel energies | |
Kosmider_SRPOL_task1b_3 | Komider2019 | 74.9 | 44.1kHz | Spectrum Correction, SpecAugment, mixup | log-mel energies | |
Kosmider_SRPOL_task1b_4 | Komider2019 | 75.2 | 44.1kHz | Spectrum Correction, SpecAugment, mixup | log-mel energies | |
LamPham_KentGroup_task1b_1 | Pham2019 | 72.8 | 44.1kHz | mixup | Gammatone, log-mel energies, CQT | |
McDonnell_USA_task1b_1 | Gao2019 | 74.2 | 44.1kHz | mixup, temporal cropping | log-mel energies, deltas and delta-deltas | |
McDonnell_USA_task1b_2 | Gao2019 | 74.1 | 44.1kHz | mixup, temporal cropping | log-mel energies | |
McDonnell_USA_task1b_3 | Gao2019 | 74.9 | 44.1kHz | mixup, temporal cropping | log-mel energies, deltas and delta-deltas | |
McDonnell_USA_task1b_4 | Gao2019 | 74.4 | 44.1kHz | mixup, temporal cropping | log-mel energies, deltas and delta-deltas | |
Primus_CPJKU_task1b_1 | Primus2019 | 71.3 | 22.05kHz | mixup | log-mel energies | |
Primus_CPJKU_task1b_2 | Primus2019 | 73.4 | 22.05kHz | mixup | log-mel energies | |
Primus_CPJKU_task1b_3 | Primus2019 | 71.6 | 22.05kHz | mixup | log-mel energies | |
Primus_CPJKU_task1b_4 | Primus2019 | 74.2 | 22.05kHz | mixup | log-mel energies | |
Song_HIT_task1b_1 | Song2019 | 67.3 | 44.1kHz | mixup | log-mel energies | |
Song_HIT_task1b_2 | Song2019 | 72.2 | 44.1kHz | mixup | log-mel energies | |
Song_HIT_task1b_3 | Song2019 | 72.1 | 44.1kHz | mixup | log-mel energies | |
Waldekar_IITKGP_task1b_1 | Waldekar2019 | 62.1 | 44.1kHz | MFDWC | ||
Wang_NWPU_task1b_1 | Wang2019 | 65.7 | 44.1kHz | log-mel energies | ||
Wang_NWPU_task1b_2 | Wang2019 | 68.5 | 32kHz | log-mel energies | ||
Wang_NWPU_task1b_3 | Wang2019 | 70.3 | 32kHz | log-mel energies |
Machine learning characteristics
Rank | Code |
Technical Report |
Accuracy (Eval) |
Model complexity |
Classifier |
Ensemble subsystems |
Decision making |
Device mismatch handling |
---|---|---|---|---|---|---|---|---|
Eghbal-zadeh_CPJKU_task1b_1 | Eghbal-zadeh2019 | 74.5 | 747596080 | CNN, Receptive Field Regularization | 220 | average | maximum mean discrepancy, domain adaptation, transfer learning | |
Eghbal-zadeh_CPJKU_task1b_2 | Eghbal-zadeh2019 | 74.5 | 37379804 | CNN, Receptive Field Regularization | 11 | average | maximum mean discrepancy, domain adaptation, transfer learning | |
Eghbal-zadeh_CPJKU_task1b_3 | Eghbal-zadeh2019 | 73.4 | 286137920 | CNN, Receptive Field Regularization | 80 | average | maximum mean discrepancy, domain adaptation, transfer learning | |
Eghbal-zadeh_CPJKU_task1b_4 | Eghbal-zadeh2019 | 73.4 | 12878416 | CNN, Receptive Field Regularization | 4 | average | maximum mean discrepancy, domain adaptation, transfer learning | |
DCASE2019 baseline | 47.7 | 116118 | CNN | |||||
Jiang_UESTC_task1b_1 | Jiang2019 | 70.3 | 1448794 | CNN | 16 | stacking | ||
Jiang_UESTC_task1b_2 | Jiang2019 | 69.9 | 1448794 | CNN | 8 | stacking | ||
Jiang_UESTC_task1b_3 | Jiang2019 | 69.0 | 1448794 | CNN | 16 | averaging | ||
Jiang_UESTC_task1b_4 | Jiang2019 | 69.6 | 1448794 | CNN | 8 | averaging | ||
Kong_SURREY_task1b_1 | Kong2019 | 61.6 | 4686144 | CNN | ||||
Kosmider_SRPOL_task1b_1 | Komider2019 | 75.1 | 6100840 | CNN | 36 | isotonic-calibrated soft-voting | spectrum correction | |
Kosmider_SRPOL_task1b_2 | Komider2019 | 75.3 | 18095576 | CNN | 124 | soft-voting | spectrum correction | |
Kosmider_SRPOL_task1b_3 | Komider2019 | 74.9 | 3077046 | CNN | 31 | soft-voting | spectrum correction | |
Kosmider_SRPOL_task1b_4 | Komider2019 | 75.2 | 10768964 | CNN | 58 | soft-voting | spectrum correction | |
LamPham_KentGroup_task1b_1 | Pham2019 | 72.8 | 12346325 | CNN, DNN | 2 | |||
McDonnell_USA_task1b_1 | Gao2019 | 74.2 | 3253148 | CNN | aggressive regularization and augmentation | |||
McDonnell_USA_task1b_2 | Gao2019 | 74.1 | 3252268 | CNN | aggressive regularization and augmentation | |||
McDonnell_USA_task1b_3 | Gao2019 | 74.9 | 6505416 | CNN | 2 | average | aggressive regularization and augmentation | |
McDonnell_USA_task1b_4 | Gao2019 | 74.4 | 6506296 | CNN | 2 | average | aggressive regularization and augmentation | |
Primus_CPJKU_task1b_1 | Primus2019 | 71.3 | 13047888 | CNN, ensemble | 4 | average | ||
Primus_CPJKU_task1b_2 | Primus2019 | 73.4 | 13047888 | CNN, ensemble | 4 | average | domain adaptation | |
Primus_CPJKU_task1b_3 | Primus2019 | 71.6 | 13047888 | CNN, ensemble | 8 | average | domain adaptation | |
Primus_CPJKU_task1b_4 | Primus2019 | 74.2 | 26095776 | CNN, ensemble | 8 | average | domain adaptation | |
Song_HIT_task1b_1 | Song2019 | 67.3 | 22758197 | CNN | feature transform | |||
Song_HIT_task1b_2 | Song2019 | 72.2 | 68274591 | CNN | 3 | probability aggregation | feature transform | |
Song_HIT_task1b_3 | Song2019 | 72.1 | 68274591 | CNN | 3 | majority vote | feature transform | |
Waldekar_IITKGP_task1b_1 | Waldekar2019 | 62.1 | 9000 | SVM | ||||
Wang_NWPU_task1b_1 | Wang2019 | 65.7 | 116118 | CNN, DNN | 7 | domain adaptation | ||
Wang_NWPU_task1b_2 | Wang2019 | 68.5 | 116118 | CNN, DNN | 7 | average | domain adaptation | |
Wang_NWPU_task1b_3 | Wang2019 | 70.3 | 116118 | CNN, DNN | 7 | average | domain adaptation |
Public leaderboard
Scores
Date | Top Team | Top 10 Team median |
---|---|---|
2019-05-14 | 64.8 | 64.8 (64.8 - 64.8) |
2019-05-15 | 64.8 | 62.4 (60.0 - 64.8) |
2019-05-16 | 66.3 | 65.6 (64.8 - 66.3) |
2019-05-17 | 66.7 | 65.8 (64.8 - 66.7) |
2019-05-18 | 66.7 | 64.8 (60.5 - 66.7) |
2019-05-19 | 68.5 | 66.7 (64.8 - 68.5) |
2019-05-20 | 73.3 | 67.8 (64.8 - 73.3) |
2019-05-21 | 73.3 | 64.8 (56.7 - 73.3) |
2019-05-22 | 73.3 | 67.8 (59.3 - 73.3) |
2019-05-23 | 73.3 | 66.3 (53.2 - 73.3) |
2019-05-24 | 73.3 | 66.3 (58.3 - 73.3) |
2019-05-25 | 73.3 | 66.3 (60.3 - 73.3) |
2019-05-26 | 73.3 | 66.3 (60.3 - 73.3) |
2019-05-27 | 73.3 | 66.3 (60.3 - 73.3) |
2019-05-28 | 73.3 | 66.3 (60.7 - 73.3) |
2019-05-29 | 73.3 | 68.2 (60.7 - 73.3) |
2019-05-30 | 73.3 | 66.3 (44.0 - 73.3) |
2019-05-31 | 73.3 | 66.9 (58.3 - 73.3) |
2019-06-01 | 73.3 | 68.2 (62.5 - 73.3) |
2019-06-02 | 73.7 | 68.2 (62.5 - 73.7) |
2019-06-03 | 73.7 | 69.0 (62.5 - 73.7) |
2019-06-04 | 73.7 | 69.0 (62.5 - 73.7) |
2019-06-05 | 76.5 | 69.7 (64.8 - 76.5) |
2019-06-06 | 76.5 | 69.7 (66.7 - 76.5) |
2019-06-07 | 76.5 | 69.7 (66.7 - 76.5) |
2019-06-08 | 76.5 | 69.7 (67.7 - 76.5) |
2019-06-09 | 76.5 | 69.7 (68.3 - 76.5) |
2019-06-10 | 76.5 | 70.4 (69.0 - 76.5) |
Entries
Total entries
Date | Entries |
---|---|
2019-05-14 | 1 |
2019-05-15 | 2 |
2019-05-16 | 3 |
2019-05-17 | 4 |
2019-05-18 | 6 |
2019-05-19 | 7 |
2019-05-20 | 9 |
2019-05-21 | 11 |
2019-05-22 | 13 |
2019-05-23 | 16 |
2019-05-24 | 19 |
2019-05-25 | 21 |
2019-05-26 | 21 |
2019-05-27 | 22 |
2019-05-28 | 23 |
2019-05-29 | 27 |
2019-05-30 | 32 |
2019-05-31 | 39 |
2019-06-01 | 44 |
2019-06-02 | 49 |
2019-06-03 | 53 |
2019-06-04 | 56 |
2019-06-05 | 63 |
2019-06-06 | 67 |
2019-06-07 | 74 |
2019-06-08 | 80 |
2019-06-09 | 88 |
2019-06-10 | 97 |
Entries per day
Date | Entries per day |
---|---|
2019-05-14 | 1 |
2019-05-15 | 1 |
2019-05-16 | 1 |
2019-05-17 | 1 |
2019-05-18 | 2 |
2019-05-19 | 1 |
2019-05-20 | 2 |
2019-05-21 | 2 |
2019-05-22 | 2 |
2019-05-23 | 3 |
2019-05-24 | 3 |
2019-05-25 | 2 |
2019-05-26 | 0 |
2019-05-27 | 1 |
2019-05-28 | 1 |
2019-05-29 | 4 |
2019-05-30 | 5 |
2019-05-31 | 7 |
2019-06-01 | 5 |
2019-06-02 | 5 |
2019-06-03 | 4 |
2019-06-04 | 3 |
2019-06-05 | 7 |
2019-06-06 | 4 |
2019-06-07 | 7 |
2019-06-08 | 6 |
2019-06-09 | 8 |
2019-06-10 | 9 |
Technical reports
Urban Acoustic Scene Classification Using Binaural Wavelet Scattering and Random Subspace Discrimination Method
Fateme Arabnezhad and Babak Nasersharif
Computer Engineering Department, Khaje Nasir Toosi, Tehran, Iran
Fmta91_KNToosi_task1a_1
Urban Acoustic Scene Classification Using Binaural Wavelet Scattering and Random Subspace Discrimination Method
Fateme Arabnezhad and Babak Nasersharif
Computer Engineering Department, Khaje Nasir Toosi, Tehran, Iran
Abstract
This report describe our contribution to Detection and Classification of Urban Acoustic Scenes on DCASE 2019 challenge (Task1 –Subtask A). We propose to use wavelet scatterings spectrum as a good representation and feature where we extracted from both average of 2 audio recorded(mono) and also difference of 2 audio recorded channels (side). The concatenation of these two set of wavelet scattering spectrum are used as a feature vector which is fed into a classifier based on random subspace method. In this work, Regularized Linear Discriminant Analysis (RLDA) is used as a base learner and a classification approach for Random Subspace Method. The experimental results shows that the proposed structure learn acoustic characteristics from audio segments. This structure achieved 87.98% accuracy on whole development set (without cross-validation) and 78.83% on leaderboard dataset.
System characteristics
Input | mono |
Sampling rate | 48kHz |
Features | wavelet scattering spectra |
Classifier | random subspace |
Decision making | highest average score |
Acoustic Scene Classification with Multiple Instance Learning and Fusion
Valentin Bilot and Quang Khanh Ngoc Duong
Audio R&D, InterDigital R&D, Rennes, France
Bilot_IDG_task1a_1 Bilot_IDG_task1a_2 Bilot_IDG_task1a_3 Bilot_IDG_task1a_4
Acoustic Scene Classification with Multiple Instance Learning and Fusion
Valentin Bilot and Quang Khanh Ngoc Duong
Audio R&D, InterDigital R&D, Rennes, France
Abstract
Audio classification has been an emerging topic in the last few years, especially with the benchmark dataset and evaluation from DCASE. This paper present our deep learning models to address the acoustic scene classification (ASC) task of the DCASE 2019. The models exploit multiple instance learning (MIL) method as a way of guiding the network attention to different temporal segments of a recording. We then propose a simple late fusion of results obtained by the three investigated MIL-based models. Such fusion system uses multi-layer perceptron (MLP) to predict the final classes from the initial class probability predictions and obtains a better result on the development and the leaderboard dataset.
System characteristics
Input | binaural, difference |
Sampling rate | 48kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN |
Decision making | MLP |
Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling
Hangting Chen, Zuozhen Liu, Zongming Liu, Pengyuan Zhang and Yonghong Yan
Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Beijing, China
Zhang_IOA_task1a_1 Zhang_IOA_task1a_2 Zhang_IOA_task1a_3 Zhang_IOA_task1a_4
Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling
Hangting Chen, Zuozhen Liu, Zongming Liu, Pengyuan Zhang and Yonghong Yan
Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Beijing, China
Abstract
This technical report describes the IOA team’s submission for TASK1A of DCASE2019 challenge. Our acoustic scene classification (ASC) system adopts a data augmentation scheme employing generative adversary networks. Two major classifiers, 1D deep convolutional neural network integrated with scalogram features and 2D fully convolutional neural network integrated with Mel filter bank features, are deployed in the scheme. Other approaches, such as adversary city adaptation, temporal module based on discrete cosine transform and hybrid architectures, have been developed for further fusion. The results of our experiments indicates that the final fusion systems A-D could achieve the accuracy above 85% on the officially provided fold 1 evaluation dataset.
System characteristics
Input | binaural |
Sampling rate | 48kHz |
Data augmentation | generative neural network; generative neural network, variational autoencoder |
Features | log-mel energies, CQT |
Classifier | CNN |
Decision making | average vote |
Acoustic Scene Classification Based on Ensemble System
Biyun Ding, Ganjun Liu and Jinhua Liang
School of Electrical and Information Engineering, TianJin University, Tianjin, China
DSPLAB_TJU_task1a_1 DSPLAB_TJU_task1a_2 DSPLAB_TJU_task1a_3 DSPLAB_TJU_task1a_4
Acoustic Scene Classification Based on Ensemble System
Biyun Ding, Ganjun Liu and Jinhua Liang
School of Electrical and Information Engineering, TianJin University, Tianjin, China
Abstract
This technical report is for the Task 1A Acoustic scene classification of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). In this task, the features of audio will affect the performance. To improve the performance, we implement Acoustic scene classification task using multiple features and applying ensemble system which composed of CNN and GMM. According to the experiments which were performed with the DCASE 2019 challenge development dataset, the class average accuracy of GMM with 103 features is 64.3%, which is an improvement of 4.2% compared to Baseline CNN. Besides, the class average accuracy of the ensemble system is 66.3% , which is an improvement of 7.4% compared to Baseline CNN
System characteristics
Input | mono; mono, left, right, mixed |
Sampling rate | 48kHz |
Features | log-mel energies; MFCC, log-mel energies, ZRC, RMSE, spectrogram centroid |
Classifier | GMM; GMM, CNN |
Decision making | majority vote |
Acoustic Scene Classification and Audio Tagging with Receptive-Field-Regularized CNNs
Hamid Eghbal-zadeh, Khaled Koutini and Gerhard Widmer
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria
Eghbal-zadeh_CPJKU_task1b_1 Eghbal-zadeh_CPJKU_task1b_2 Eghbal-zadeh_CPJKU_task1b_3 Eghbal-zadeh_CPJKU_task1b_4
Acoustic Scene Classification and Audio Tagging with Receptive-Field-Regularized CNNs
Hamid Eghbal-zadeh, Khaled Koutini and Gerhard Widmer
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria
Abstract
In this report, we detail the CP-JKU submissions to the DCASE-2019 challenge Task 1 (acoustic scene classification) and Task 2 (audio tagging with noisy labels and minimal supervision). In all of our submissions, we use fully convolutional deep neural networks architectures that are regularized with Receptive Field (RF) adjustments. We adjust the RF of variants of Resnet and Densenet architectures to best fit the various audio processing tasks that use the spectrogram features as input. Additionally, we propose novel CNN layers such as Frequency-Aware CNNs, and new noise compensation techniques such as Adaptive Weighting for Learning from Noisy Labels to cope with the complexities of each task. We prepared all of our submissions without the use of any external data. Our focus in this year’s submissions is to provide the best-performing single-model submission, using our proposed approaches.
System characteristics
Input | mono |
Sampling rate | 22.05kHz |
Data augmentation | mixup |
Features | perceptual weighted power spectrogram |
Classifier | CNN, Receptive Field Regularization |
Decision making | average |
Acoustic Scene Classification Based on the Dataset with Deep Convolutional Generated Against Network
Ning FangLi and Duan Shuang
Mechanical Engineering, Northwestern Polytechnical University School, 127 West Youyi Road, Xi'an, 710072, China
Li_NPU_task1a_1 Li_NPU_task1a_2
Acoustic Scene Classification Based on the Dataset with Deep Convolutional Generated Against Network
Ning FangLi and Duan Shuang
Mechanical Engineering, Northwestern Polytechnical University School, 127 West Youyi Road, Xi'an, 710072, China
Abstract
As is known to us all, Convolutional Neural Networks have been the most excellent solution for image classification challenges. From the results of DCASE 2018 [1], the Convolutional Neural Network has also achieved excellent results in the classification of acoustic scenes. Therefore, our team also adopted Convolutional Neural Network for DCASE 2019 Task 1a. In order to make the audio features are exposed more, our team used multiple Mel-spectrograms to characterize the audio, trained multiple classifiers, and finally weighted the prediction results of each classifier to make results ensemble. The performance of classifier is largely limited by the quality and quantity of the data. From the results of the technical report [2], the use of GAN to augment the data set can play a vital role in the final performance, and our team also introduced Deep Convolution GANs (DCGAN) [3] to our solution to Task 1a Challenge, Our model ultimately achieved an accuracy of 0.846 on the development set and an accuracy of 0.671 on the leaderboard dataset.
System characteristics
Input | mixed |
Sampling rate | 48kHz |
Data augmentation | DCGAN |
Features | log-mel energies |
Classifier | CNN |
Decision making | majority vote |
Classification of Acoustic Scenes Based on Modulation Spectra and Position-Pitch Maps
Ruben Fraile, Juan Carlos Reina, Juana Gutierrez-Arriola and Elena Blanco
CITSEM, Universidad Politecnica de Madrid, Madrid, Spain
Fraile_UPM_task1a_1
Classification of Acoustic Scenes Based on Modulation Spectra and Position-Pitch Maps
Ruben Fraile, Juan Carlos Reina, Juana Gutierrez-Arriola and Elena Blanco
CITSEM, Universidad Politecnica de Madrid, Madrid, Spain
Abstract
A system for the automatic classification of acoustic scenes is proposed that uses the stereophonic signal captured by a binaural microphone. This system uses one channel for calculating the spectral distribution of energy across auditory-relevant frequency bands. It further obtains some descriptors of the envelope modulation spectrum (EMS) by applying the discrete cosine transform to the logarithm of the EMS. The availability of the two-channel binaural recordings is used for representing the spatial distribution of acoustic sources by means of position-pitch maps. These maps are further parametrized using the two-dimensional Fourier transform. These three types of features (energy spectrum, EMS and positionpitch maps) are used as inputs for a standard Gaussian Mixture Model with 64 components.
System characteristics
Input | binaural |
Sampling rate | 48kHz |
Features | spectrogram, modulation spectrum, position-pitch maps |
Classifier | GMM |
Decision making | average log-likelihood |
Acoustic Scene Classification Using Deep Residual Networks with Late Fusion of Separated High and Low Frequency Paths
Wei Gao and Mark McDonnell
School of Information Technology and Mathematical Sciences, University of South Australia, Mawson Lakes, Australia
McDonnell_USA_task1a_1 McDonnell_USA_task1a_2 McDonnell_USA_task1a_3 McDonnell_USA_task1a_4 McDonnell_USA_task1b_1 McDonnell_USA_task1b_2 McDonnell_USA_task1b_3 McDonnell_USA_task1b_4 McDonnell_USA_task1c_1 McDonnell_USA_task1c_2 McDonnell_USA_task1c_3 McDonnell_USA_task1c_4
Acoustic Scene Classification Using Deep Residual Networks with Late Fusion of Separated High and Low Frequency Paths
Wei Gao and Mark McDonnell
School of Information Technology and Mathematical Sciences, University of South Australia, Mawson Lakes, Australia
Abstract
This technical report describes our approach to Tasks 1a, 1b and 1c in the 2019 DCASE acoustic scene classification challenge. Our focus was on developing strong single models, without use of any supplementary data. We investigated the use of a deep residual network applied to log-mel spectrograms complemented by log-mel deltas and delta-deltas. We designed the network to take into account that the temporal and frequency axes in spectrograms represent fundamentally different information. In particular, we used two pathways in the residual network: one for high frequencies and one for low frequencies, that were fused just two convolutional layers prior to the network output.
System characteristics
Input | left, right; mono |
Sampling rate | 48kHz; 44.1kHz |
Data augmentation | mixup, temporal cropping |
Features | log-mel energies, deltas and delta-deltas; log-mel energies |
Classifier | CNN |
Decision making | average |
Acoustic Scene Classification Using CNN Ensembles and Primary Ambient Extraction
Yang Haocong, Shi Chuang and Li Huiyong
Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, China
Yang_UESTC_task1a_1 Yang_UESTC_task1a_2 Yang_UESTC_task1a_3
Acoustic Scene Classification Using CNN Ensembles and Primary Ambient Extraction
Yang Haocong, Shi Chuang and Li Huiyong
Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, China
Abstract
This report describes our submission for Task 1a (acoustic scene classification) of the DCASE 2019 challenge. The results of the DCASE 2018 challenge demonstrate that the convolution neural networks (CNNs) and their ensembles can achieve excellent clas-sification accuracies. Inspired by the previous works, our method continues to work on the ensembles of CNNs, whereas the prima-ry ambient extraction is newly introduced to decompose a binaural audio sample into four channels by using the spatial information. The feature extraction is still carried out with mel spectrograms. 6 CNN models are trained by using the 4-fold cross validation. Ensemble is applied to further improve the performance. Finally, our method has achieved classification accuracies of 0.84 on the public leaderboard.
System characteristics
Input | mono, binaural |
Sampling rate | 44.1kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN |
Decision making | average; random forest |
Acoustic Scene Classification Using Deep Learning-Based Ensemble Averaging
Jonathan Huang1, Paulo Lopez Meyer2, Hong Lu1, Hector Cordourier Maruri2 and Juan Del Hoyo2
1Intel Labs, Intel Corporation, Santa Clara, CA, USA, 2Intel Labs, Intel Corporation, Zapopan, Jalisco, Mexico
Huang_IL_task1a_1 Huang_IL_task1a_2 Huang_IL_task1a_3 Huang_IL_task1a_4
Acoustic Scene Classification Using Deep Learning-Based Ensemble Averaging
Jonathan Huang1, Paulo Lopez Meyer2, Hong Lu1, Hector Cordourier Maruri2 and Juan Del Hoyo2
1Intel Labs, Intel Corporation, Santa Clara, CA, USA, 2Intel Labs, Intel Corporation, Zapopan, Jalisco, Mexico
Abstract
In our submission to the DCASE 2019 Task 1a, we have explored the use of four different deep learning based neural networks architectures: Vgg12, ResNet50, AclNet, and AclSincNet. In order to improve performance, these four network architectures were pretrained with Audioset data, and then fine-tuned over the development set for the task. The outputs produced by these networks, due to the diversity of feature front-end and of architecture differences, proved to be complementary when fused together. The ensemble of these models’ outputs improved from best single model accuracy of 77.9% to 83.0% on the validation set, trained with the challenge default’s development split.
System characteristics
Input | mono; mono , binaural; binaural |
Sampling rate | 16kHz; 48kHz, 16kHz; 48kHz |
Data augmentation | mixup |
Features | raw waveform, log-mel energies; log-mel energies |
Classifier | CNN |
Decision making | Max value of soft ensemble |
Acoustic Scene Classification Based on Deep Convolutional Neuralnetwork with Spatial-Temporal Attention Pooling
Zhenyi Huang and Dacan Jiang
School of Computer, South China Normal University, Guangzhou, China
Huang_SCNU_task1a_1
Acoustic Scene Classification Based on Deep Convolutional Neuralnetwork with Spatial-Temporal Attention Pooling
Zhenyi Huang and Dacan Jiang
School of Computer, South China Normal University, Guangzhou, China
Abstract
Acoustic scene classification is a challenging task in machine learn-ing with limited data sets. In this report, several different spectro-grams are applied to classify the acoustic scenes using deep convo-lutional neural network with spatial-temporal attention pooling. Inaddition, mixup augmentation is performed to further improve theclassification performance. Finally, majority voting is performed onsix different models and an accuracy of 73.86% is achieved which is11.36 percentage points higher than the one of the baseline system.
System characteristics
Input | left, right, mixed |
Sampling rate | 44.1kHz |
Data augmentation | mixup |
Features | MFCC, CQT |
Classifier | CNN |
Decision making | majority vote |
Acoustic Scene Classification Using Various Pre-Processed Features and Convolutional Neural Networks
Seo Hyeji and Park Jihwan
Advanced Robotics Lab, LG Electronics, Seoul, Korea
Seo_LGE_task1a_1 Seo_LGE_task1a_2 Seo_LGE_task1a_3 Seo_LGE_task1a_4
Acoustic Scene Classification Using Various Pre-Processed Features and Convolutional Neural Networks
Seo Hyeji and Park Jihwan
Advanced Robotics Lab, LG Electronics, Seoul, Korea
Abstract
In this technical report, we describe our acoustic scene classification algorithm submitted in DCASE 2019 Task 1a. We focus on various pre-processed features to categorize the class of acoustic scenes using only stereo microphone input signal. In the frontend system, the pre-processed and spatial information are extracted from the stereo microphone input. Residual network, subspectral network, and conventional convolutional neural network (CNN) are used for back-end systems. Finally, we ensemble all of the models to take advantage of each algorithm. By using proposed systems, we achieved a classification accuracy of 80.4%, which is 17.9% over than the baseline system.
System characteristics
Input | mono, binaural |
Sampling rate | 48kHz |
Data augmentation | mixup |
Features | log-mel energies, spectrogram, chromagram |
Classifier | CNN |
Decision making | average |
Acoustic Scene Classification Using Ensembles of Convolutional Neural Networks and Spectrogram Decompositions
Shengwang Jiang and Chuang Shi
School of Communication and Information Engineering, University of Electronic Science and Technology of China, Chengdu, China
Jiang_UESTC_task1b_1 Jiang_UESTC_task1b_2 Jiang_UESTC_task1b_3 Jiang_UESTC_task1b_4
Acoustic Scene Classification Using Ensembles of Convolutional Neural Networks and Spectrogram Decompositions
Shengwang Jiang and Chuang Shi
School of Communication and Information Engineering, University of Electronic Science and Technology of China, Chengdu, China
Abstract
This technical report proposes ensembles of convolutional neural networks (CNNs) for the task 1 / subtask B of the DCASE 2019 challenge, with emphasis on using different spectrogram decompositions. The harmonic percussive source separation (HPSS), nearest neighbor filter (NNF), and vocal separation are applied to the monaural samples. Head-related transfer function (HRTF) is also proposed to transform monaural samples to binaural ones with augmented spatial information. Finally, 16 neural networks are trained and put together. The classification accuracy of the proposed system achieves 0.70166 on the public leaderboard.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | HPSS, NNF, vocal separation, HRTF |
Features | log-mel energies |
Classifier | CNN |
Decision making | stacking; averaging |
Knowledge Distillation with Specialist Models in Acoustic Scene Classification
Jee-weon Jung, Hee-Soo Heo, Hye-jin Shim and Ha-Jin Yu
Computing Sciences, Univerisity of Seoul, Seoul, Republic of Korea
Jung_UOS_task1a_1 Jung_UOS_task1a_2 Jung_UOS_task1a_3 Jung_UOS_task1a_4
Knowledge Distillation with Specialist Models in Acoustic Scene Classification
Jee-weon Jung, Hee-Soo Heo, Hye-jin Shim and Ha-Jin Yu
Computing Sciences, Univerisity of Seoul, Seoul, Republic of Korea
Abstract
In this technical report, we describe our submission for the Detection and Classification of Acoustic Scenes and Events 2019 task1-a competition which exploits knowledge distillation with specialist models. Different acoustic scenes that share common properties are one of the main obstacles that hinder successful acoustic scene classification. We found that confusion between scenes, sharing the common properties, causes most of the errors in the acoustic scene classification. For example, the confusing scene pairs such as airport-shopping mall and metro-tram have caused the most errors in various systems. We applied knowledge distillation based on the specialist models to address the errors from the most confusing scene pairs. Specialist models where each model concentrates on discriminating a pair two similar scenes are exploited to provide soft-labels. We expected that knowledge distillation from multiple specialist models and a pre-trained generalist model to a single model could train an ensemble of models that gives more emphasis on discriminating specific acoustic scene pairs. Through knowledge distillation from well trained model and specialist models to single model, we report improved accuracy on the validation set.
System characteristics
Input | binaural |
Sampling rate | 48kHz |
Data augmentation | mixup |
Features | raw waveform, log-mel energies |
Classifier | CNN |
Decision making | majority vote; score-sum |
The I2r Submission to DCASE 2019 Challenge
Teh KK1, Sun HW2 and Tran Huy Dat2
1I2R, A-star, Singapore, 2I2R, A-Star, Singapore
KK_I2R_task1a_1 KK_I2R_task1a_2 KK_I2R_task1a_3
The I2r Submission to DCASE 2019 Challenge
Teh KK1, Sun HW2 and Tran Huy Dat2
1I2R, A-star, Singapore, 2I2R, A-Star, Singapore
Abstract
This paper proposes Convolutional Neural Network (CNN) ensembles for acoustic scene classification of tasks1A of the DCASE 2019 challenge. In this approach various preprocessing features method: mel-filterbank and delta feature vectors, harmonic-percussive and subband power distribution are used to train CNN model. We also used score-fusion of the features to find an optimum feature configuration. On the official leaderboard data set of the task1A challenge, an accuracy of 79.67% is achieved.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Data augmentation | mixup |
Features | log-mel energies, HPSS; log-mel energies, HPSS, subband power distribution |
Classifier | CNN |
Decision making | weighted averaging vote; multi-class linear logistic regression |
Calibrating Neural Networks for Secondary Recording Devices
Michał Kośmider
Artificial Intelligence, Samsung R&D Institute Poland, Warsaw, Poland
Kosmider_SRPOL_task1b_1 Kosmider_SRPOL_task1b_2 Kosmider_SRPOL_task1b_3 Kosmider_SRPOL_task1b_4
Calibrating Neural Networks for Secondary Recording Devices
Michał Kośmider
Artificial Intelligence, Samsung R&D Institute Poland, Warsaw, Poland
Abstract
This report describes the solution to Task 1B of the DCASE 2019 challenge proposed by Samsung R&D Institute Poland. Primary focus of the system for task 1B was a novel technique designed to address issues with learning from microphones with different frequency responses in settings with limited examples for the targeted secondary devices. This technique is independent from the architecture of the predictive model and requires just a few examples to become effective.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Data augmentation | Spectrum Correction, SpecAugment, mixup |
Features | log-mel energies |
Classifier | CNN |
Decision making | isotonic-calibrated soft-voting; soft-voting |
Cross-Task Learning for Audio Tagging, Sound Event Detection and Spatial Localization: DCASE 2019 Baseline Systems
Qiuqiang Kong, Yin Cao, Turab Iqbal, Wenwu Wang and Mark D. Plumbley
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, England
Kong_SURREY_task1a_1 Kong_SURREY_task1b_1 Kong_SURREY_task1c_1
Cross-Task Learning for Audio Tagging, Sound Event Detection and Spatial Localization: DCASE 2019 Baseline Systems
Qiuqiang Kong, Yin Cao, Turab Iqbal, Wenwu Wang and Mark D. Plumbley
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, England
Abstract
The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge focuses on audio tagging, sound event detection and spatial localisation. DCASE 2019 consists of five tasks: 1) acoustic scene classification, 2) audio tagging with noisy labels and minimal supervision, 3) sound event localisation and detection, 4) sound event detection in domestic environments, and 5) urban sound tagging. In this paper, we propose generic cross-task baseline systems based on convolutional neural networks (CNNs). The motivation is to investigate the performance of a variety of models across several audio recognition tasks without exploiting the specific characteristics of the tasks. We looked at CNNs with 5, 9, and 13 layers, and found that the optimal architecture is task-dependent. For the systems we considered, we found that the 9-layer CNN with average pooling after convolutional layers is a good model for a majority of the DCASE 2019 tasks.
System characteristics
Input | mono |
Sampling rate | 32kHz |
Features | log-mel energies |
Classifier | CNN |
Acoustic Scene Classification and Audio Tagging with Receptive-Field-Regularized CNNs
Khaled Koutini, Hamid Eghbal-zadeh and Gerhard Widmer
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria
Koutini_CPJKU_task1a_1 Koutini_CPJKU_task1a_2 Koutini_CPJKU_task1a_3 Koutini_CPJKU_task1a_4
Acoustic Scene Classification and Audio Tagging with Receptive-Field-Regularized CNNs
Khaled Koutini, Hamid Eghbal-zadeh and Gerhard Widmer
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria
Abstract
In this report, we detail the CP-JKU submissions to the DCASE-2019 challenge Task 1 (acoustic scene classification) and Task 2 (audio tagging with noisy labels and minimal supervision). In all of our submissions, we use fully convolutional deep neural networks architectures that are regularized with Receptive Field (RF) adjustments. We adjust the RF of variants of Resnet and Densenet architectures to best fit the various audio processing tasks that use the spectrogram features as input. Additionally, we propose novel CNN layers such as Frequency-Aware CNNs, and new noise compensation techniques such as Adaptive Weighting for Learning from Noisy Labels to cope with the complexities of each task. We prepared all of our submissions without the use of any external data. Our focus in this year’s submissions is to provide the best-performing single-model submission, using our proposed approaches.
System characteristics
Input | binaural |
Sampling rate | 22.05kHz |
Data augmentation | mixup |
Features | perceptual weighted power spectrogram |
Classifier | CNN, Receptive Field Regularization |
Decision making | average |
Acoustic Scene Classification with Reject Option Based on Resnets
Bernhard Lehner1 and Khaled Koutini2
1Silicon Austria Labs, JKU, Linz, Austria, 2Institute of Computational Perception, JKU, Linz, Austria
Lehner_SAL_task1c_1 Lehner_SAL_task1c_2 Lehner_SAL_task1c_3 Lehner_SAL_task1c_4
Acoustic Scene Classification with Reject Option Based on Resnets
Bernhard Lehner1 and Khaled Koutini2
1Silicon Austria Labs, JKU, Linz, Austria, 2Institute of Computational Perception, JKU, Linz, Austria
Abstract
This technical report describes the submissions from the SAL/CP JKU team for Task 1 - Subtask C (classification on data that includes classes not encountered in the training data) of the DCASE-2019 challenge. Our method uses a ResNet variant specifically adapted to be used along with spectrograms in the context of Acoustic Scene Classification (ASC). The reject option is based on the logit values of the same networks. We do not use any of the provided external data sets, and perform data augmentation only with the mixup technique [1]. The result of our experiments is a system that achieves classification accuracies of up to around 60% on the public Kaggle-Leaderboard. This is an improvement of around 14 percentage points compared to the official DCASE 2019 baseline.
System characteristics
Input | mono |
Sampling rate | 22.05kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN |
Decision making | logit averaging |
Multi-Scale Recalibrated Features Fusion for Acoustic Scene Classification
Chongqin Lei and Zixu Wang
Intelligent Information Technology and System Lab, CHONGQING UNIVERSITY, Chongqing, China
Lei_CQU_task1a_1
Multi-Scale Recalibrated Features Fusion for Acoustic Scene Classification
Chongqin Lei and Zixu Wang
Intelligent Information Technology and System Lab, CHONGQING UNIVERSITY, Chongqing, China
Abstract
We investigate the effectiveness of multi-scale recalibrated features fusion for acoustic scene classification as contribution to the subtask of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2019). A general problem in acoustic scene classification task is audio signal segment contains less effective information. In order to further utilize features with less effective information to improve classification accuracy, we introduce the Squeeze-and-Excitation unit to embed the backbone structure of Xception to recalibrate the channel weights of feature maps in each block. In addition, the recalibrated features of multiscale are fused and finally fed into the full connection layer to get more useful information. Furthermore, we introduce Mixup method to augment the data in training stage to reduce the degree of over-fitting of network. The proposed method attains a recognition accuracy of 77.5%, which is 13% higher compared to the baseline system of the DCASE 2019 Acoustic Scenes Classification task.
System characteristics
Input | binaural |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CNN |
Acoustic Scene Classification Using Attention-Based Convolutional Neural Network
Han Liang and Yaxiong Ma
Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, China
Liang_HUST_task1a_1 Liang_HUST_task1a_2
Acoustic Scene Classification Using Attention-Based Convolutional Neural Network
Han Liang and Yaxiong Ma
Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, China
Abstract
This technical report describes the Task 1 - Subtask A (Acoustic Scene Classification, ASC) of the DCASE 2019 challenge whose goal is to classify a test audio recording into one of the predefined classes that characterizes the environment. We detemine to use mel-spectrogram as audio feature and deep convolutional neural networks (CNNs) as classifier to classify acoustic scenes. In our method, spectrogram of every audio clip is divided in two ways. In addition, we introduce attention mechanism to further improve the performance. Experimental results illustrate that our best model can achieve classification accuracy of around 70.7% for Development dataset, which is superior to the baseline system with the accuracy of 62.5%.
System characteristics
Input | mono |
Sampling rate | 48kHz |
Features | log-mel energies |
Classifier | CNN |
Jsnu_wdxy Submission for DCASE-2019: Acoustic Scene Classification with Convolution Neural Networks
Xinixn Ma and Mingliang Gu
School of Physics and Electronic, Jiangsu Normal University, Xuzhou, China
JSNU_WDXY_task1a_1
Jsnu_wdxy Submission for DCASE-2019: Acoustic Scene Classification with Convolution Neural Networks
Xinixn Ma and Mingliang Gu
School of Physics and Electronic, Jiangsu Normal University, Xuzhou, China
Abstract
Acoustic Scene Classification (ASC) is the task of identifying the scene from which the audio signal is recorded. It is one of the core research problems in the field of Computational Sound Scene Analysis. Most of current best performing Acoustic Scene Classification systems utilize Mel scale spectrograms with Convolutional Neural Networks (CNNs). In this paper, we demonstrate how we applied convolutional neural network for DCASE 2019 task1, acoustic scene classification. First, we applied Mel scale spectrogram to extract acoustic features. Mel scale is a common way to suit frequency warping of human ears, with strict decreasing frequency resolution on low to high frequency range. Second, we generate Mel spectrogram from binaural audio, adaptively learn 5 Convolutional Neural Networks. The best classification result of the proposed system was 71.1% for Development dataset and 73.16% for Leaderboard dataset.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CNN |
Acoustic Scene Classification Based on Binaural Deep Scattering Spectra with Neural Network
Sifan Ma and Wei Liu
Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China
Abstract
This technical report presents our approach for the acoustic scene classification of DCASE2019 task1a. Compared to traditional audio features such as Mel-frequency Cepstral Coefficients (MFCC) and Constant-Q Transform (CQT), we choose Deep Scattering Spectra (DSS) features which are more suitable for characterizing acoustic scenes. DSS is a good way to preserve high frequency information. Based on DSS features, we choose a network model of Convolutional Neural Network (CNN) and Gated Recurrent Unit (GRU) to classify acoustic scenes. The experimental results show that our approach increase the classification accuracy from 62.5% (DCASE2019 baseline) to 85%.
System characteristics
Input | left,right |
Sampling rate | 48kHz |
Features | DSS |
Classifier | CNN,DNN |
Acoustic Scene Classification From Binaural Signals Using Convolutional Neural Networks
Rohith Mars, Pranay Pratik, Srikanth Nagisetty and Chong Soon Lim
Core Technology Group, Panasonic R&D Center, Singapore, Singapore
Mars_PRDCSG_task1a_1
Acoustic Scene Classification From Binaural Signals Using Convolutional Neural Networks
Rohith Mars, Pranay Pratik, Srikanth Nagisetty and Chong Soon Lim
Core Technology Group, Panasonic R&D Center, Singapore, Singapore
Abstract
In this report, we present the technical details of our proposed framework and solution for the DCASE 2019 Task 1A - Acoustic Scene Classification challenge. We describe the audio pre-processing, feature extraction steps and the time-frequency (TF) representations used for acoustic scene classification using binaural audio recordings. We employ two distinct architectures of convolutional neural networks (CNNs) for processing the extracted audio features for classification and compare their relative performance in terms of both accuracy and model complexity. Using an ensemble of the predictions from multiple models based on the above CNNs, we achieved an average classification accuracy of 79.35% on the test split of the development dataset for this task and a system score of 82.33% in the Kaggle public leaderboard, which is an improvement of ≈ 18% over the baseline system.
System characteristics
Input | mono, left, right, mid, side |
Sampling rate | 48kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN |
Decision making | max probability |
The System for Acoustic Scene Classification Using Resnet
Liu Mingle and Li Yanxiong
School of Electronic and Information Enginnering, South China University of Technology, GuangZhou, GuangDong Province
Liu_SCUT_task1a_1 Liu_SCUT_task1a_2 Liu_SCUT_task1a_3 Liu_SCUT_task1a_4
The System for Acoustic Scene Classification Using Resnet
Liu Mingle and Li Yanxiong
School of Electronic and Information Enginnering, South China University of Technology, GuangZhou, GuangDong Province
Abstract
In this report, we present our works concerning task 1a of DCASE 2019, i.e. acoustic scene classification (ASC) with mismatched recording devices. We propose a strategy of classifiers voting for ASC. Specifically, an audio feature, such as logarithmic filter-bank (LFB), is first extracted from audio recordings. Then a series of convolutional neural network (CNN) is built for obtaining classifiers ensemble. Finally, classification result for each test sample is based on the voting of all classifiers.
System characteristics
Input | mono,binaural |
Sampling rate | 48kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | ResNet |
Decision making | vote |
DCASE 2019: CNN Depth Analysis with Different Channel Inputs for Acoustic Scene Classification
Javier Naranjo-Alcazar1, Sergi Perez-Castanos1, Pedro Zuccarello1 and Maximo Cobos2
1Visualfy AI, Visualfy, Benisano, Spain, 2Computer Science, Universitat de Valencia, Burjassot, Spain
Naranjo-Alcazar_VfyAI_task1a_1 Naranjo-Alcazar_VfyAI_task1a_2 Naranjo-Alcazar_VfyAI_task1a_3 Naranjo-Alcazar_VfyAI_task1a_4
DCASE 2019: CNN Depth Analysis with Different Channel Inputs for Acoustic Scene Classification
Javier Naranjo-Alcazar1, Sergi Perez-Castanos1, Pedro Zuccarello1 and Maximo Cobos2
1Visualfy AI, Visualfy, Benisano, Spain, 2Computer Science, Universitat de Valencia, Burjassot, Spain
Abstract
The objective of this technical report is to describe the framework used in Task 1, Acoustic scene classification (ASC), of the DCASE 2019 challenge. The presented approach is based on Log-Mel spectrogram representations and VGG-based Convolutional Neural Networks (CNNs). Three different CNNs, with very similar architectures, have been implemented. The main difference is the number of filters in their convolutional blocks. Experiments show that the depth of the network is not the most relevant factor for improving the accuracy of the results. The performance seems to be more sensitive to the input audio representation. This conclusion is important for the implementation of real-time audio recognition and classification system on edge devices. In the presented experiments the best audio representation is the Log-Mel spectrogram of the harmonic and percussive sources plus the Log-Mel spectrogram of the difference between left and right stereo-channels (L − R). Also, in order to improve accuracy, ensemble methods combining different model predictions with different inputs are explored. Besides geometric and arithmetic means, ensembles aggregated with the Orness Weighted Averaged (OWA) operator have shown interesting and novel results. The proposed framework outperforms the baseline system by 14.34 percentage points. For Task 1a, the obtained development accuracy is 76.84%, being 62.5% the baseline, whereas the accuracy obtained in public leaderboard is 77.33%, being 64.33% the baseline.
System characteristics
Input | mono, left, right, difference, harmonic, percussive |
Sampling rate | 48kHz |
Features | log-mel energies |
Classifier | ensemble, CNN |
Decision making | arithmetic mean; geometric mean; orness weighted average |
DCASE 2019 Task 1a: Acoustic Scene Classification by Sffcc and DNN
Chandrasekhar Paseddula1 and Suryakanth V.Gangashetty2
1International Institute of Information Technology, Hyderabad department:Electronics and Communication Engineering, Hyderabad, India, 2Computor Science Engineering, International Institute of Information Technology, Hyderabad, Hyderabad, India
Chandrasekhar_IIITH_task1a_1
DCASE 2019 Task 1a: Acoustic Scene Classification by Sffcc and DNN
Chandrasekhar Paseddula1 and Suryakanth V.Gangashetty2
1International Institute of Information Technology, Hyderabad department:Electronics and Communication Engineering, Hyderabad, India, 2Computor Science Engineering, International Institute of Information Technology, Hyderabad, Hyderabad, India
Abstract
In this study, we dealt with the acoustic scene classification (ASC) task in the Detection and Classification of Acoustic Scenes and Events (DCASE)-2019 challenge Task 1A. Single frequency filtering cepstral coefficients (SFFCC) features and Deep Neural networks (DNN) model is proposed for ASC. We have adopted a late fusion mechanism to further improve the performance and finally, to validate the performance of the model and compare it to the baseline system. We used the TAU Urban Acoustic Scenes 2019 development dataset for training and cross-validation, resulting in a 7.9% improvement when compared to the baseline system.
System characteristics
Input | mono |
Sampling rate | 48kHz |
Features | single frequency cepstral coefficients (SFCC), log-mel energies |
Classifier | DNN |
Decision making | maxrule |
Cdnn-CRNN Joined Model for Acoustic Scene Classification
Lam Pham1, Tan Doan2, Dat Thanh Ngo2, Hung Nguyen2 and Ha Hoang Kha2
1School of Computing, University of Kent, Chatham, United Kingdom, 2Electrical and Electronics Engineering, HoChiMinh City University of Technology, HoChiMinh City, Vietnam
LamPham_HCMGroup_task1a_1
Cdnn-CRNN Joined Model for Acoustic Scene Classification
Lam Pham1, Tan Doan2, Dat Thanh Ngo2, Hung Nguyen2 and Ha Hoang Kha2
1School of Computing, University of Kent, Chatham, United Kingdom, 2Electrical and Electronics Engineering, HoChiMinh City University of Technology, HoChiMinh City, Vietnam
Abstract
This work proposes a deep learning framework applied for Acoustic Scene Classification (ASC), targeting DCASE2019 task 1A. In general, the front-end process shows a combination of three types of spectrograms: Gammatone (GAM), log-Mel and Constant Q Transform (CQT). The back-end classification presents a joined learning model between CDNN and CRNN. Our experiments over the development dataset of DCASE2019 challenge task 1A show a significant improvement, increasing 11.2% compared to DCASE2019 baseline of 62.5%. The Kaggle reports the classification accuracy of 74.6% when we train all development dataset.
System characteristics
Input | mono |
Sampling rate | 48kHz |
Data augmentation | mixup |
Features | Gammatone, log-mel energies, CQT |
Classifier | CNN, RNN |
A Multi-Spectrogram Deep Neural Network for Acoustic Scene Classification
Lam Pham, Ian McLoughlin, Huy Phan and Ramaswamy Palaniappan
School of Computing, University of Kent, Chatham, United Kingdom
LamPham_KentGroup_task1a_1 LamPham_KentGroup_task1b_1
A Multi-Spectrogram Deep Neural Network for Acoustic Scene Classification
Lam Pham, Ian McLoughlin, Huy Phan and Ramaswamy Palaniappan
School of Computing, University of Kent, Chatham, United Kingdom
Abstract
This work targets the task 1A and 1B of DCASE2019 challenge that are Acoustic Scene Classification (ASC) over ten different classes recorded by a same device (task 1A) and mismatched devices (task 1B). For the front-end feature extraction, this work proposes a combination of three types of spectrograms: Gammatone (GAM), log- Mel and Constant Q Transform (CQT). The back-end classification shows two training processes, namely pre-trained CNN and post- trained DNN, and the result of post-trained DNN is reported. Our experiments over the development dataset of DCASE2019 1A and 1B show significant improvement, increasing 14% and 17.4 % compared to DCASE2019 baseline of 62.5% and 41.4%, respectively. The Kaggle report also confirms the classification accuracy of 79% and 69.2% for task 1A and 1B.
System characteristics
Input | mono |
Sampling rate | 48kHz; 44.1kHz |
Data augmentation | mixup |
Features | Gammatone, log-mel energies, CQT |
Classifier | CNN, DNN |
Deep Neural Networks with Supported Clusters Preclassification Procedure for Acoustic Scene Recognition
Marcin Plata
Data Intelligence Group, Samsung R&D Institute Poland, Warsaw, Poland
Plata_SRPOL_task1a_1 Plata_SRPOL_task1a_2 Plata_SRPOL_task1a_3 Plata_SRPOL_task1a_4
Deep Neural Networks with Supported Clusters Preclassification Procedure for Acoustic Scene Recognition
Marcin Plata
Data Intelligence Group, Samsung R&D Institute Poland, Warsaw, Poland
Abstract
In this technical report, we presented a system for acoustic scene classification focuses on deeper analysis of data. We made an impact analysis of various combinations of arguments for short time Fourier transform (STFT) and Mel filter bank. We also used the harmonic and percussive source separation (HPSS) algorithm as an additional features extractor. Finally, next to common spectrograms divided and non-overlap classification neural networks, we decided to present an out-of-the-box solution with one main neural network trained on clustered labels and a few supporting neural network to distinguish between most difficult scenes, e.g. street pedestrian and public square.
System characteristics
Input | mono, left, right |
Sampling rate | 48kHz |
Data augmentation | mixup |
Features | log-mel energies, harmonic, percussive |
Classifier | CNN, random forest; CNN |
Decision making | random forest; majority vote |
Acoustic Scene Classification with Mismatched Recording Devices
Paul Primus and David Eitelsebner
Computational Perception, Johannes Kepler University Linz, Linz, Austria
Abstract
This technical report describes CP-JKU Student team’s approach for Task 1 - Subtask B of the DCASE 2019 challenge. In this context, we propose two loss functions for domain adaptation to learn invariant representations given time-aligned recordings. We show that these methods improve the classification performance on our cross-validation, as well as performance on the Kaggle leader board, up to three percentage points compared to our baseline model. Our best scoring submission is an ensemble of eight classifiers.
System characteristics
Input | mono |
Sampling rate | 22.05kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN, ensemble |
Decision making | average |
Frequency-Aware CNN for Open Set Acoustic Scene Classification
Alexander Rakowski1 and Michał Kośmider2
1Audio Intelligence, Samsung R&D Institute Poland, Warsaw, Poland, 2Artificial Intelligence, Samsung R&D Institute Poland, Warsaw, Poland
Rakowski_SRPOL_task1c_1 Rakowski_SRPOL_task1c_2 Rakowski_SRPOL_task1c_3 Rakowski_SRPOL_task1c_4
Frequency-Aware CNN for Open Set Acoustic Scene Classification
Alexander Rakowski1 and Michał Kośmider2
1Audio Intelligence, Samsung R&D Institute Poland, Warsaw, Poland, 2Artificial Intelligence, Samsung R&D Institute Poland, Warsaw, Poland
Abstract
This report describes systems used for Task 1c of the DCASE 2019 Challenge - Open Set Acoustic Scene Classification. The main system consists of a 5-layer convolutional neural network which preserves the location of features on the frequency axis. This is in contrast to the standard approach where global pooling is applied along the frequency-related dimension. Additionally the main system is combined with an ensemble of calibrated neural networks in order to improve generalization.
System characteristics
Input | mono |
Sampling rate | 32kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN |
Decision making | soft-voting |
Urban Acoustic Scene Classification Using Raw Waveform Convolutional Neural Networks
Daniele Salvati, Carlo Drioli and Gian Luca Foresti
Mathematics, Computer Science and Physics, University of Udine, Udine, Italy
Salvati_DMIF_task1a_1
Urban Acoustic Scene Classification Using Raw Waveform Convolutional Neural Networks
Daniele Salvati, Carlo Drioli and Gian Luca Foresti
Mathematics, Computer Science and Physics, University of Udine, Udine, Italy
Abstract
We present the signal processing framework and the results obtained with the development dataset (task 1, subtask A) for the detection and classification of acoustic scenes and events (DCASE 2019) challenge. The framework for the classification of urban acoustic scenes consists of a raw waveform (RW) end-to-end computational scheme based on convolutional neural networks (CNNs). The RW-CNN operates on a time-domain signal segment of 0.5 s and consists of 5 one-dimensional convolutional layers and 3 fully connected layers. The overall classification accuracy with the development dataset of the proposed RW-CNN is 69.7 %.
System characteristics
Input | mono |
Sampling rate | 48kHz |
Features | raw waveform |
Classifier | CNN |
Acoustic Scene Classification Using Specaugment and Convolutional Neural Network with Inception Modules
Suh Sangwon, Jeong Youngho, Lim Wootaek and Park Sooyoung
Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea
SSW_ETRI_task1a_1 SSW_ETRI_task1a_2 SSW_ETRI_task1a_3 SSW_ETRI_task1a_4
Acoustic Scene Classification Using Specaugment and Convolutional Neural Network with Inception Modules
Suh Sangwon, Jeong Youngho, Lim Wootaek and Park Sooyoung
Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea
Abstract
This paper describes the system submitted to the Task 1a (Acoustic Scene Classification, ASC). By analyzing the major systems submitted in 2017 and 2018, we have selected a two-dimensional convolutional neural network (CNN) as the most suitable model for this task. The proposed model is composed of four convolution blocks; two of them are conventional CNN structures but the following two blocks consist of Inception modules. We have constructed a meta-learning problem with this model in order to train the super learner. For each base model training, we have applied different validation split methods to take advantage in generalized result with the ensemble method. In addition, we have applied data augmentation in real time with SpecAugment, which was performed for each base model. With our final system with all of the above techniques have applied, we have achieved an accuracy of 76.1% with the development dataset and 81.3% with the leader board set.
System characteristics
Input | mono |
Sampling rate | 48kHz |
Data augmentation | SpecAugment |
Features | log-mel energies |
Classifier | CNN; ensemble |
Feature Enhancement for Robust Acoustic Scene Classification with Device Mismatch
Hongwei Song and Hao Yang
Computer Sciences and Technology, Harbin Institute of Technology, Harbin, China
Abstract
This technical report describes our system for DCASE2019 Task1 SubtaskB. We focus on analyzing how device distortions affect the classic log Mel feature, which is the most adopted feature for convolutional neural networks (CNN) based models. We demonstrate mathematically that for log Mel feature, the influence of device distortion shows as an additive constant vector over the log Mel spectrogram. Based on this analysis, we propose to use feature enhancement methods such as spectrogram-wise mean subtraction and median filtering, to remove the additive term of channel distortions. Information loss introduced by the enhancement methods is discussed. We also motivate to use mixup technique to generate virtual samples with various device distortions. Combining the proposed techniques, we rank the second on the public kaggle leaderboard.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN |
Decision making | probability aggregation; majority vote |
Wavelet Based Mel-Scaled Features for DCASE 2019 Task 1a and Task 1b
Shefali Waldekar and Goutam Saha
Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India
Waldekar_IITKGP_task1a_1 Waldekar_IITKGP_task1b_1
Wavelet Based Mel-Scaled Features for DCASE 2019 Task 1a and Task 1b
Shefali Waldekar and Goutam Saha
Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India
Abstract
This report describes a submission for IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 for Task 1 (acoustic scene classification (ASC)), sub-task A (basic ASC) and sub-task B (ASC with mismatched recording devices). The system exploits time-frequency representation of audio to obtain the scene labels. It follows a simple pattern classification framework employing wavelet transform based mel-scaled features along with support vector machine as classifier. The proposed system relatively outperforms the deep-learning based baseline system by almost 8% for sub-task A and 26% for sub-task B on the development dataset provided for the respective sub-tasks.
System characteristics
Input | mono |
Sampling rate | 48kHz; 44.1kHz |
Features | MFDWC |
Classifier | SVM |
Acoustic Scene Classification Based on CNN System
Zhuhe Wang, Jingkai Ma and Chunyang Li
Noise and Vibration Laboratory, Beijing Technology and Business University, Beijing, China
Wang_BTBU_task1a_1
Acoustic Scene Classification Based on CNN System
Zhuhe Wang, Jingkai Ma and Chunyang Li
Noise and Vibration Laboratory, Beijing Technology and Business University, Beijing, China
Abstract
In this study, we present a solution for the acoustic scene classification task1A in the DCASE 2019 Challenge. Our model uses a convolutional neural network and makes some improvements on the basis of CNN. Then we extract the MFCC (Mel frequency cepstral coefficient) feature from the official audio file and recreate the data set. Use this as an input to the neural network. Finally, comparing our model to the performance of the baseline system, the results were 12% more accurate than the baseline system.
System characteristics
Input | one |
Sampling rate | 22.05kHz |
Features | MFCC |
Classifier | CNN |
Ciaic-ASC System for DCASE 2019 Challenge Task1
Mou Wang and Rui Wang
School of Marine Sciences and Technology, Northwestern Polytechnical University, Xi'an, China
Wang_NWPU_task1a_1 Wang_NWPU_task1a_2 Wang_NWPU_task1a_3 Wang_NWPU_task1a_4 Wang_NWPU_task1b_1 Wang_NWPU_task1b_2 Wang_NWPU_task1b_3
Ciaic-ASC System for DCASE 2019 Challenge Task1
Mou Wang and Rui Wang
School of Marine Sciences and Technology, Northwestern Polytechnical University, Xi'an, China
Abstract
In this report, we present our systems for subtask A and subtask B of the DCASE 2019 Task1, i.e. acoustic scene classification. The subtask A is a problem of basic closed set classification with data from a single device. In our system, we firstly extracted several acoustic features such as mel-spectrogram, hybrid constant-Q transform, harmonic-percussive source separation and etc.. Convolution neural networks (CNN) with average pooling are used to classify acoustic scenes. We averaged the outputs of CNN fed by different features to ensemble those methods. The subtask B is a classification problem with mismatched devices. So, we introduce a Domain Adaptation Neural Network (DANN) to extract the feature, which is uncorrelated with domain. We further ensemble DANN with CNN methods to obtain a better performance. The accuracy of the our system for subtask A is 0.783 on validation dataset and 0.816 on leaderboard dataset. The accuracy of subtask B achieves 0:717 on leaderborad dataset, which shows that our method can solve such a cross-domain problem and outperforms baseline system.
System characteristics
Input | mono |
Sampling rate | 32kHz; 44.1kHz |
Features | log-mel energies |
Classifier | CNN; CNN, DNN |
Decision making | average |
The SEIE-SCUT Systems for Acoustic Scene Classification Using CNN Ensemble
Wucheng Wang and Mingle Liu
School of Electronic and Information Enginnering, South China University of Technology, GuangZhou, GuangDong Province
Wang_SCUT_task1a_1 Wang_SCUT_task1a_2 Wang_SCUT_task1a_3 Wang_SCUT_task1a_4
The SEIE-SCUT Systems for Acoustic Scene Classification Using CNN Ensemble
Wucheng Wang and Mingle Liu
School of Electronic and Information Enginnering, South China University of Technology, GuangZhou, GuangDong Province
Abstract
In this report, we present our works concerning task 1b of DCASE 2019, i.e. acoustic scene classification (ASC) with mismatched recording devices. We propose a strategy of CNN ensemble for ASC. Specifically, an audio feature, such as Mel-frequency cepstral coefficients (MFCCs) and logarithmic filter-bank (LFB), is first extracted from audio recordings. Then a series of convolutional neural network (CNN) is built for obtaining CNN ensemble. Finally, classification result for each test sample is based on the voting of all CNNS contained in the CNN ensemble.
System characteristics
Input | mono,binaural |
Sampling rate | 48kHz |
Data augmentation | mixup |
Features | log-mel energies,MFCC |
Classifier | VGG,Inception,ResNet |
Decision making | vote |
Open-Set Acoustic Scene Classification with Deep Convolutional Autoencoders
Kevin Wilkinghoff and Frank Kurth
Communication Systems, Fraunhofer Institute for Communication, Information Processing and Ergonomics, Wachtberg, Germany
Wilkinghoff_FKIE_task1a_1 Wilkinghoff_FKIE_task1a_2 Wilkinghoff_FKIE_task1c_1 Wilkinghoff_FKIE_task1c_2
Open-Set Acoustic Scene Classification with Deep Convolutional Autoencoders
Kevin Wilkinghoff and Frank Kurth
Communication Systems, Fraunhofer Institute for Communication, Information Processing and Ergonomics, Wachtberg, Germany
Abstract
Acoustic scene classification is the task of determining the environment in which a given audio file has been recorded. If it is a priori not known whether all possible environments that may be encountered during test time are also known when training the system, the task is referred to as open-set classification. This paper contains a description of an open-set acoustic scene classification system submitted to Task 1C of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Our system consists of a combination of convolutional neural networks for closed-set identification and deep convolutional autoencoders for outlier detection. In evaluations conducted on the leaderboard dataset of the challenge, the proposed system significantly outperforms the baseline systems and improves the score by 35.4% from 0.46666 to 0.63166.
System characteristics
Input | mono |
Sampling rate | 48kHz |
Data augmentation | mixup, cutout, width shift, height shift |
Features | log-mel energies; log-mel energies, harmonic part, percussive part |
Classifier | CNN; CNN, ensemble; CNN, DCAE, logistic regression; CNN, DCAE, logistic regression, ensemble |
Decision making | maximum likelihood; geometric mean, maximum likelihood; threshold, maximum likelihood; geometric mean, threshold, maximum likelihood |
Stratified Time-Frequency Features for CNN-Based Acoustic Scene Classification
Yuzhong Wu and Tan Lee
Electronic Engineering, The Chinese University of Hong Kong, Hong Kong, China
Abstract
Acoustic scene signal is a mixture of diverse sound events, which are frequently overlapped with each other. The CNN models for acoustic scene classification usually suffer from model overfitting because they might memorize the overlapped sounds as the representative patterns for acoustic scenes, and might fail to recognize the scene when only one of the sound is present. Based on a standard CNN setup with log-Mel feature as input, we propose to stratify the log-Mel image to several component images based on sound duration, and each component image should contain a specific type of time-frequency patterns. Then we emphasize the independent modeling of time-frequency patterns to better utilize the stratified features. The experiment results on TAU Urban Acoustic Scenes 2019 development dataset [1] show that the use of stratified feature can significantly improve the classification performance.
System characteristics
Input | mono |
Sampling rate | 48kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN |
Decision making | majority vote |
Acoustic Scene Classification Using Fusion of Attentive Convolutional Neural Networks for Dcase2019 Challenge
Hossein Zeinali, Lukas Burget and Honza Cernocky
Information Technology, Brno University of Technology, Brno, Czech Republic
Zeinali_BUT_task1a_1 Zeinali_BUT_task1a_2 Zeinali_BUT_task1a_3
Acoustic Scene Classification Using Fusion of Attentive Convolutional Neural Networks for Dcase2019 Challenge
Hossein Zeinali, Lukas Burget and Honza Cernocky
Information Technology, Brno University of Technology, Brno, Czech Republic
Abstract
In this report, the Brno University of Technology (BUT) team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2019 challenge are described. Also, the analysis of different methods is provided. The proposed approach is a fusion of three different Convolutional Neural Network (CNN) topologies. The first one is a VGG like two-dimensional CNNs. The second one is again a two-dimensional CNN network which uses MaxFeature-Map activation and called Light-CNN (LCNN). The third network is a one-dimensional CNN which mainly used for speaker verification and called x-vector topology. All proposed networks use self-attention mechanism for statistic pooling. As a feature, we use a 256-dimensional log Mel-spectrogram. Our submissions are a fusion of several networks trained on 4-folds generated evaluation setup using different fusion strategies.
System characteristics
Input | mono |
Sampling rate | 22.05kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN, ensemble |
Decision making | score fusion; majority vote; majority vote, score fusion |
Acoustic Scene Classification Combining Log-Mel CNN Model and End-To-End Model
Xu Zheng and Jie Yan
Computing Sciences, University of Science of Techonology of China, Hefei,Anhui,China
Zheng_USTC_task1a_1 Zheng_USTC_task1a_2 Zheng_USTC_task1a_3
Acoustic Scene Classification Combining Log-Mel CNN Model and End-To-End Model
Xu Zheng and Jie Yan
Computing Sciences, University of Science of Techonology of China, Hefei,Anhui,China
Abstract
This technical report describes the Zheng-USTC team’s submissions for Task 1 - Subtask A (Acoustic Scene Classification, ASC) of the DCASE-2019 challenge. In this paper, two different models for Acoustic Scene Classification are provided.The first one is a common two-dimensional CNN model in which the log-mel energies spectrogram is treated as an image. The second one is an end-to-end model, in which the features of a speech are extracted by a 3-layer CNN model with 64 filters. The experimental results on the fold1 validation set of 4185 samples and the leaderboard showed that the class-wise accuracy of the two models are complementary in some way. Finally we fused the softmax ouput scores of the two different systems by using a simple non-weighted average.
System characteristics
Input | binaural; mono |
Sampling rate | 22.05kHz; 16kHz |
Data augmentation | SpecAugment, RandomCrop; Between-Class learning; SpecAugment, RandomCrop, Between-Class learning |
Features | log-mel energies; raw waveform |
Classifier | CNN |
Audio Scene Calssification Based on Deeper CNN and Mixed Mono Channel Feature
Nai Zhou, Yanfang Liu and Qingkai Wei
Beijing Kuaiyu Electronics Co., Ltd., Beijing, China
Zhou_Kuaiyu_task1a_1 Zhou_Kuaiyu_task1a_2 Zhou_Kuaiyu_task1a_3
Audio Scene Calssification Based on Deeper CNN and Mixed Mono Channel Feature
Nai Zhou, Yanfang Liu and Qingkai Wei
Beijing Kuaiyu Electronics Co., Ltd., Beijing, China
Abstract
This technical report describes Kuaiyu team’s submissions for Task 1 - Subtask A (Acoustic Scene Classification, ASC) of the DCASE-2019 challenge. Refering the results of DCASE 2018, a convolution neural network and log-mel spectrogram generated from mono audio are used, log-mel specture is converted into multiple channels spectrogram and as a input to 8 convolutional layer neural networks. The result of our experiments is a classification system that achieves classification accuracies of around 75.5% on the public Kaggle-Leaderboard.
System characteristics
Input | mono |
Sampling rate | 48kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN |
DCASE 2019 Challenge Task1 Technical Report
Houwei Zhu1, Chunxia Ren2, Jun Wang2, Shengchen Li2, Lizhong Wang1 and Lei Yang1
1Speech Lab, Samsung Research China-Beijing, Beijing, China, 2Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China
Zhu_SRCBBUPT_task1c_1 Zhu_SRCBBUPT_task1c_2 Zhu_SRCBBUPT_task1c_3 Zhu_SRCBBUPT_task1c_4
DCASE 2019 Challenge Task1 Technical Report
Houwei Zhu1, Chunxia Ren2, Jun Wang2, Shengchen Li2, Lizhong Wang1 and Lei Yang1
1Speech Lab, Samsung Research China-Beijing, Beijing, China, 2Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China
Abstract
This report describe our methods for the DCASE 2019 task1a and task1c of Acoustic Scene Classification(ASC).Especially task1c for unknown scene, which not included in training data set, We use less training data and a threshold to classify the known and unknown scenes. In our method, we use Log MelSpectrogram with different divisions, as the input of multiple neural network , the ensemble learning output shows good accuracy. For task 1a we use VGG and xception as network and 3 different divisions ensemble, the accuracy is 0.807 for Leadboard dataset. For task 1c we use Convolutional Recurrent Neural Network (CRNN) and self-attention mechanism with 2 different features division ensemble, and 0.4 as threshold for unknown judgment, the Leadboard accuracy is 0.648.
System characteristics
Input | mono |
Sampling rate | 48kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN, BGRU, self-attention |
DCASE 2019 Challenge Task1 Technical Report
Houwei Zhu1, Chunxia Ren2, Jun Wang2, Shengchen Li2, Lizhong Wang1 and Lei Yang1
1Speech Lab, Samsung Research China-Beijing, Beijing, China, 2Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China
Zhu_SSLabBUPT_task1a_1 Zhu_SSLabBUPT_task1a_2 Zhu_SSLabBUPT_task1a_3 Zhu_SSLabBUPT_task1a_4
DCASE 2019 Challenge Task1 Technical Report
Houwei Zhu1, Chunxia Ren2, Jun Wang2, Shengchen Li2, Lizhong Wang1 and Lei Yang1
1Speech Lab, Samsung Research China-Beijing, Beijing, China, 2Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China
Abstract
This report describe our methods for the DCASE 2019 task1a and task1c of Acoustic Scene Classification(ASC).Especially task1c for unknown scene, which not included in training data set, We use less training data and a threshold to classify the known and unknown scenes. In our method, we use Log MelSpectrogram with different divisions, as the input of multiple neural network , the ensemble learning output shows good accuracy. For task 1a we use VGG and xception as network and 3 different divisions ensemble, the accuracy is 0.807 for Leadboard dataset. For task 1c we use Convolutional Recurrent Neural Network (CRNN) and self-attention mechanism with 2 different features division ensemble, and 0.4 as threshold for unknown judgment, the Leadboard accuracy is 0.648.
System characteristics
Input | multiple |
Sampling rate | 48kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN, BGRU, self-attention, ensemble |