Acoustic Scene Classification with
mismatched recording devices


Challenge results

Task description

This subtask is concerned with the situation in which an application will be tested with different devices, possibly not the same as the ones used to record the development data. In this case, evaluation data contains more devices than the development data.

The development data consists of the same recordings as in subtask A, and a small amount of parallel data recorded with devices B and C. The amount of data is as follows:

  • Device A: 40 hours (14400 segments, same as subtask A, but resampled and single-channel)
  • Device B: 3 hours (108 segments per acoustic scene)
  • Device C: 3 hours (108 segments per acoustic scene)

More detailed task description can be found in the task description page

Systems ranking

Rank Submission
code
Submission
name
Technical
Report
Accuracy (B/C)
with 95% confidence interval
(Evaluation dataset)
Accuracy (B/C)
(Development dataset)
Eghbal-zadeh_CPJKU_task1b_1 mmd_shake_res_snapi Eghbal-zadeh2019 74.5 (73.5 - 75.5)
Eghbal-zadeh_CPJKU_task1b_2 mmd_shake_res Eghbal-zadeh2019 74.5 (73.5 - 75.5)
Eghbal-zadeh_CPJKU_task1b_3 mmd_shake_snapi Eghbal-zadeh2019 73.4 (72.4 - 74.5)
Eghbal-zadeh_CPJKU_task1b_4 mmd_shake Eghbal-zadeh2019 73.4 (72.3 - 74.4)
DCASE2019 baseline Baseline 47.7 (46.5 - 48.8) 41.4
Jiang_UESTC_task1b_1 Randomforest_16 Jiang2019 70.3 (69.2 - 71.3) 62.2
Jiang_UESTC_task1b_2 Randomforest_8 Jiang2019 69.9 (68.9 - 71.0) 64.2
Jiang_UESTC_task1b_3 Averaging_16 Jiang2019 69.0 (68.0 - 70.1) 63.2
Jiang_UESTC_task1b_4 Averaging_8 Jiang2019 69.6 (68.6 - 70.7) 64.0
Kong_SURREY_task1b_1 cvssp_cnn9 Kong2019 61.6 (60.4 - 62.7) 52.7
Kosmider_SRPOL_task1b_1 SC+IC+RCV Komider2019 75.1 (74.1 - 76.1)
Kosmider_SRPOL_task1b_2 SC+ALL+SV Komider2019 75.3 (74.3 - 76.3)
Kosmider_SRPOL_task1b_3 SC+IC+RCV Komider2019 74.9 (73.9 - 75.9)
Kosmider_SRPOL_task1b_4 SC+FULL+SV Komider2019 75.2 (74.3 - 76.2)
LamPham_KentGroup_task1b_1 Kent Pham2019 72.8 (71.8 - 73.8) 72.9
McDonnell_USA_task1b_1 UniSA_1b1 Gao2019 74.2 (73.2 - 75.2) 66.3
McDonnell_USA_task1b_2 UniSA_1b2 Gao2019 74.1 (73.1 - 75.2) 62.5
McDonnell_USA_task1b_3 UniSA_1b3 Gao2019 74.9 (73.9 - 75.9) 64.2
McDonnell_USA_task1b_4 UniSA_1b4 Gao2019 74.4 (73.4 - 75.4) 66.3
Primus_CPJKU_task1b_1 CPR-NoDA Primus2019 71.3 (70.2 - 72.3) 61.2
Primus_CPJKU_task1b_2 CPR-MSE Primus2019 73.4 (72.4 - 74.4) 64.3
Primus_CPJKU_task1b_3 CPR-MI Primus2019 71.6 (70.6 - 72.7) 62.5
Primus_CPJKU_task1b_4 CPR-Ensemble Primus2019 74.2 (73.2 - 75.2) 65.1
Song_HIT_task1b_1 hitsplab_1 Song2019 67.3 (66.2 - 68.3) 65.6
Song_HIT_task1b_2 hitsplab_2 Song2019 72.2 (71.2 - 73.3) 41.4
Song_HIT_task1b_3 hitsplab_3 Song2019 72.1 (71.1 - 73.1) 70.3
Waldekar_IITKGP_task1b_1 IITKGP_MFDWC19 Waldekar2019 62.1 (60.9 - 63.2) 52.3
Wang_NWPU_task1b_1 Rui_task1b Wang2019 65.7 (64.6 - 66.8) 54.8
Wang_NWPU_task1b_2 Rui_task1b Wang2019 68.5 (67.4 - 69.6) 55.2
Wang_NWPU_task1b_3 Rui_task1b Wang2019 70.3 (69.3 - 71.4) 54.8

Teams ranking

Rank Submission
code
Submission
name
Technical
Report
Accuracy
with 95% confidence interval
(Evaluation dataset)
Accuracy
(Development dataset)
Eghbal-zadeh_CPJKU_task1b_2 mmd_shake_res Eghbal-zadeh2019 74.5 (73.5 - 75.5)
DCASE2019 baseline Baseline 47.7 (46.5 - 48.8) 41.4
Jiang_UESTC_task1b_1 Randomforest_16 Jiang2019 70.3 (69.2 - 71.3) 62.2
Kong_SURREY_task1b_1 cvssp_cnn9 Kong2019 61.6 (60.4 - 62.7) 52.7
Kosmider_SRPOL_task1b_2 SC+ALL+SV Komider2019 75.3 (74.3 - 76.3)
LamPham_KentGroup_task1b_1 Kent Pham2019 72.8 (71.8 - 73.8) 72.9
McDonnell_USA_task1b_3 UniSA_1b3 Gao2019 74.9 (73.9 - 75.9) 64.2
Primus_CPJKU_task1b_4 CPR-Ensemble Primus2019 74.2 (73.2 - 75.2) 65.1
Song_HIT_task1b_2 hitsplab_2 Song2019 72.2 (71.2 - 73.3) 41.4
Waldekar_IITKGP_task1b_1 IITKGP_MFDWC19 Waldekar2019 62.1 (60.9 - 63.2) 52.3
Wang_NWPU_task1b_3 Rui_task1b Wang2019 70.3 (69.3 - 71.4) 54.8

Class-wise performance

Rank Submission
code
Submission
name
Technical
Report
Accuracy
(Evaluation dataset)
Airport Bus Metro Metro
station
Park Public
square
Shopping
mall
Street
pedestrian
Street
traffic
Tram
Eghbal-zadeh_CPJKU_task1b_1 mmd_shake_res_snapi Eghbal-zadeh2019 74.5 71.2 90.3 74.9 66.0 84.6 59.9 81.8 45.0 89.7 81.5
Eghbal-zadeh_CPJKU_task1b_2 mmd_shake_res Eghbal-zadeh2019 74.5 70.1 90.6 74.4 66.2 82.1 58.3 82.6 48.1 90.3 82.2
Eghbal-zadeh_CPJKU_task1b_3 mmd_shake_snapi Eghbal-zadeh2019 73.4 73.6 89.6 71.0 66.1 86.1 56.5 78.8 44.0 87.2 81.4
Eghbal-zadeh_CPJKU_task1b_4 mmd_shake Eghbal-zadeh2019 73.4 71.7 90.1 72.5 65.1 85.8 58.9 79.3 43.1 88.1 79.0
DCASE2019 baseline Baseline 47.7 36.5 57.8 57.8 35.8 54.9 15.8 76.8 28.3 67.8 45.1
Jiang_UESTC_task1b_1 Randomforest_16 Jiang2019 70.3 61.3 68.2 76.7 71.9 79.6 59.4 80.7 47.5 86.8 70.7
Jiang_UESTC_task1b_2 Randomforest_8 Jiang2019 69.9 62.6 69.2 74.0 71.7 79.0 58.9 79.3 46.5 87.4 70.8
Jiang_UESTC_task1b_3 Averaging_16 Jiang2019 69.0 54.7 67.1 74.9 71.1 80.7 61.9 85.3 39.7 89.2 65.8
Jiang_UESTC_task1b_4 Averaging_8 Jiang2019 69.6 54.2 71.2 71.9 70.8 80.3 62.4 84.9 40.0 89.6 71.0
Kong_SURREY_task1b_1 cvssp_cnn9 Kong2019 61.6 50.4 63.7 69.7 52.2 77.4 41.1 56.8 60.7 84.3 59.2
Kosmider_SRPOL_task1b_1 SC+IC+RCV Komider2019 75.1 64.0 82.4 78.6 65.0 92.1 62.4 85.0 49.3 87.4 84.6
Kosmider_SRPOL_task1b_2 SC+ALL+SV Komider2019 75.3 68.3 85.8 81.2 65.6 94.3 53.6 86.4 45.1 90.1 82.4
Kosmider_SRPOL_task1b_3 SC+IC+RCV Komider2019 74.9 64.0 82.2 79.0 65.4 92.2 61.1 84.4 49.0 86.8 84.4
Kosmider_SRPOL_task1b_4 SC+FULL+SV Komider2019 75.2 67.9 85.8 80.8 65.0 94.4 54.7 86.4 43.2 89.9 84.3
LamPham_KentGroup_task1b_1 Kent Pham2019 72.8 69.9 91.9 66.1 56.7 87.6 44.9 73.3 64.3 89.6 83.9
McDonnell_USA_task1b_1 UniSA_1b1 Gao2019 74.2 68.9 88.1 77.1 67.4 84.3 55.6 85.3 49.3 91.9 73.9
McDonnell_USA_task1b_2 UniSA_1b2 Gao2019 74.1 73.9 87.5 73.6 70.0 81.5 55.6 82.2 52.8 91.1 73.2
McDonnell_USA_task1b_3 UniSA_1b3 Gao2019 74.9 72.1 88.6 75.4 70.1 83.8 56.2 84.0 52.5 92.1 74.2
McDonnell_USA_task1b_4 UniSA_1b4 Gao2019 74.4 70.4 86.8 77.8 69.2 83.1 55.0 86.8 49.6 92.1 73.5
Primus_CPJKU_task1b_1 CPR-NoDA Primus2019 71.3 78.8 86.0 66.7 64.0 79.3 51.2 74.4 40.7 90.0 81.8
Primus_CPJKU_task1b_2 CPR-MSE Primus2019 73.4 75.4 86.1 71.9 71.7 87.8 57.1 74.6 36.0 91.0 82.2
Primus_CPJKU_task1b_3 CPR-MI Primus2019 71.6 76.1 83.1 76.0 61.8 78.8 59.3 70.3 36.4 91.1 83.3
Primus_CPJKU_task1b_4 CPR-Ensemble Primus2019 74.2 77.5 86.2 74.4 72.4 86.4 59.9 78.1 36.0 89.9 81.5
Song_HIT_task1b_1 hitsplab_1 Song2019 67.3 41.4 74.9 59.2 70.0 86.8 45.0 86.2 46.1 88.2 74.9
Song_HIT_task1b_2 hitsplab_2 Song2019 72.2 63.1 80.6 76.5 73.5 86.9 37.8 87.5 51.5 92.5 72.6
Song_HIT_task1b_3 hitsplab_3 Song2019 72.1 56.2 82.1 70.7 74.3 87.2 40.3 86.9 51.1 93.2 79.0
Waldekar_IITKGP_task1b_1 IITKGP_MFDWC19 Waldekar2019 62.1 55.6 69.7 55.0 51.8 83.8 43.2 66.2 42.4 86.2 66.7
Wang_NWPU_task1b_1 Rui_task1b Wang2019 65.7 55.7 68.2 69.7 60.0 81.8 45.3 62.6 55.6 90.3 67.6
Wang_NWPU_task1b_2 Rui_task1b Wang2019 68.5 58.8 70.1 71.5 64.3 82.1 51.9 68.6 53.1 89.7 74.9
Wang_NWPU_task1b_3 Rui_task1b Wang2019 70.3 60.3 72.6 72.8 65.8 81.1 54.4 73.2 53.3 90.6 78.9

Device-wise performance

Rank Submission
code
Submission
name
Technical
Report
Accuracy / Evaluation dataset
Average
Dev B / Dev C
Dev B Dev C Dev A Dev D
Eghbal-zadeh_CPJKU_task1b_1 mmd_shake_res_snapi Eghbal-zadeh2019 74.5 73.8 75.2 81.3 54.4
Eghbal-zadeh_CPJKU_task1b_2 mmd_shake_res Eghbal-zadeh2019 74.5 74.0 75.0 81.2 53.1
Eghbal-zadeh_CPJKU_task1b_3 mmd_shake_snapi Eghbal-zadeh2019 73.4 72.8 74.1 80.3 55.3
Eghbal-zadeh_CPJKU_task1b_4 mmd_shake Eghbal-zadeh2019 73.4 72.6 74.2 79.9 55.5
DCASE2019 baseline Baseline 47.7 48.9 46.4 63.2 26.7
Jiang_UESTC_task1b_1 Randomforest_16 Jiang2019 70.3 69.1 71.4 75.1 53.0
Jiang_UESTC_task1b_2 Randomforest_8 Jiang2019 69.9 68.5 71.4 75.1 52.0
Jiang_UESTC_task1b_3 Averaging_16 Jiang2019 69.0 68.4 69.6 74.3 53.2
Jiang_UESTC_task1b_4 Averaging_8 Jiang2019 69.6 68.8 70.5 73.9 54.2
Kong_SURREY_task1b_1 cvssp_cnn9 Kong2019 61.6 60.3 62.8 70.2 40.8
Kosmider_SRPOL_task1b_1 SC+IC+RCV Komider2019 75.1 74.5 75.7 79.8 36.1
Kosmider_SRPOL_task1b_2 SC+ALL+SV Komider2019 75.3 74.3 76.2 80.8 38.6
Kosmider_SRPOL_task1b_3 SC+IC+RCV Komider2019 74.9 74.4 75.3 78.9 35.5
Kosmider_SRPOL_task1b_4 SC+FULL+SV Komider2019 75.2 74.3 76.2 80.1 40.0
LamPham_KentGroup_task1b_1 Kent Pham2019 72.8 71.8 73.8 78.2 24.6
McDonnell_USA_task1b_1 UniSA_1b1 Gao2019 74.2 73.2 75.1 79.3 63.4
McDonnell_USA_task1b_2 UniSA_1b2 Gao2019 74.1 73.6 74.7 79.9 63.8
McDonnell_USA_task1b_3 UniSA_1b3 Gao2019 74.9 74.2 75.6 79.8 65.2
McDonnell_USA_task1b_4 UniSA_1b4 Gao2019 74.4 73.8 75.1 80.1 63.6
Primus_CPJKU_task1b_1 CPR-NoDA Primus2019 71.3 70.9 71.7 78.1 49.4
Primus_CPJKU_task1b_2 CPR-MSE Primus2019 73.4 73.6 73.1 72.1 47.9
Primus_CPJKU_task1b_3 CPR-MI Primus2019 71.6 71.4 71.8 72.8 49.3
Primus_CPJKU_task1b_4 CPR-Ensemble Primus2019 74.2 74.1 74.3 73.7 47.4
Song_HIT_task1b_1 hitsplab_1 Song2019 67.3 65.3 69.2 73.1 47.7
Song_HIT_task1b_2 hitsplab_2 Song2019 72.2 71.7 72.8 79.9 59.4
Song_HIT_task1b_3 hitsplab_3 Song2019 72.1 71.1 73.1 78.4 59.1
Waldekar_IITKGP_task1b_1 IITKGP_MFDWC19 Waldekar2019 62.1 59.7 64.4 71.4 39.8
Wang_NWPU_task1b_1 Rui_task1b Wang2019 65.7 64.9 66.4 75.4 39.9
Wang_NWPU_task1b_2 Rui_task1b Wang2019 68.5 67.8 69.2 76.9 46.8
Wang_NWPU_task1b_3 Rui_task1b Wang2019 70.3 68.8 71.9 79.6 47.2

System characteristics

General characteristics

Rank Code Technical
Report
Accuracy
(Eval)
Sampling
rate
Data
augmentation
Features
Eghbal-zadeh_CPJKU_task1b_1 Eghbal-zadeh2019 74.5 22.05kHz mixup perceptual weighted power spectrogram
Eghbal-zadeh_CPJKU_task1b_2 Eghbal-zadeh2019 74.5 22.05kHz mixup perceptual weighted power spectrogram
Eghbal-zadeh_CPJKU_task1b_3 Eghbal-zadeh2019 73.4 22.05kHz mixup perceptual weighted power spectrogram
Eghbal-zadeh_CPJKU_task1b_4 Eghbal-zadeh2019 73.4 22.05kHz mixup perceptual weighted power spectrogram
DCASE2019 baseline 47.7 44.1kHz log-mel energies
Jiang_UESTC_task1b_1 Jiang2019 70.3 44.1kHz HPSS, NNF, vocal separation, HRTF log-mel energies
Jiang_UESTC_task1b_2 Jiang2019 69.9 44.1kHz HPSS, NNF, vocal separation, HRTF log-mel energies
Jiang_UESTC_task1b_3 Jiang2019 69.0 44.1kHz HPSS, NNF, vocal separation, HRTF log-mel energies
Jiang_UESTC_task1b_4 Jiang2019 69.6 44.1kHz HPSS, NNF, vocal separation, HRTF log-mel energies
Kong_SURREY_task1b_1 Kong2019 61.6 32kHz log-mel energies
Kosmider_SRPOL_task1b_1 Komider2019 75.1 44.1kHz Spectrum Correction, SpecAugment, mixup log-mel energies
Kosmider_SRPOL_task1b_2 Komider2019 75.3 44.1kHz Spectrum Correction, SpecAugment, mixup log-mel energies
Kosmider_SRPOL_task1b_3 Komider2019 74.9 44.1kHz Spectrum Correction, SpecAugment, mixup log-mel energies
Kosmider_SRPOL_task1b_4 Komider2019 75.2 44.1kHz Spectrum Correction, SpecAugment, mixup log-mel energies
LamPham_KentGroup_task1b_1 Pham2019 72.8 44.1kHz mixup Gammatone, log-mel energies, CQT
McDonnell_USA_task1b_1 Gao2019 74.2 44.1kHz mixup, temporal cropping log-mel energies, deltas and delta-deltas
McDonnell_USA_task1b_2 Gao2019 74.1 44.1kHz mixup, temporal cropping log-mel energies
McDonnell_USA_task1b_3 Gao2019 74.9 44.1kHz mixup, temporal cropping log-mel energies, deltas and delta-deltas
McDonnell_USA_task1b_4 Gao2019 74.4 44.1kHz mixup, temporal cropping log-mel energies, deltas and delta-deltas
Primus_CPJKU_task1b_1 Primus2019 71.3 22.05kHz mixup log-mel energies
Primus_CPJKU_task1b_2 Primus2019 73.4 22.05kHz mixup log-mel energies
Primus_CPJKU_task1b_3 Primus2019 71.6 22.05kHz mixup log-mel energies
Primus_CPJKU_task1b_4 Primus2019 74.2 22.05kHz mixup log-mel energies
Song_HIT_task1b_1 Song2019 67.3 44.1kHz mixup log-mel energies
Song_HIT_task1b_2 Song2019 72.2 44.1kHz mixup log-mel energies
Song_HIT_task1b_3 Song2019 72.1 44.1kHz mixup log-mel energies
Waldekar_IITKGP_task1b_1 Waldekar2019 62.1 44.1kHz MFDWC
Wang_NWPU_task1b_1 Wang2019 65.7 44.1kHz log-mel energies
Wang_NWPU_task1b_2 Wang2019 68.5 32kHz log-mel energies
Wang_NWPU_task1b_3 Wang2019 70.3 32kHz log-mel energies



Machine learning characteristics

Rank Code Technical
Report
Accuracy
(Eval)
Model
complexity
Classifier Ensemble
subsystems
Decision
making
Device mismatch
handling
Eghbal-zadeh_CPJKU_task1b_1 Eghbal-zadeh2019 74.5 747596080 CNN, Receptive Field Regularization 220 average maximum mean discrepancy, domain adaptation, transfer learning
Eghbal-zadeh_CPJKU_task1b_2 Eghbal-zadeh2019 74.5 37379804 CNN, Receptive Field Regularization 11 average maximum mean discrepancy, domain adaptation, transfer learning
Eghbal-zadeh_CPJKU_task1b_3 Eghbal-zadeh2019 73.4 286137920 CNN, Receptive Field Regularization 80 average maximum mean discrepancy, domain adaptation, transfer learning
Eghbal-zadeh_CPJKU_task1b_4 Eghbal-zadeh2019 73.4 12878416 CNN, Receptive Field Regularization 4 average maximum mean discrepancy, domain adaptation, transfer learning
DCASE2019 baseline 47.7 116118 CNN
Jiang_UESTC_task1b_1 Jiang2019 70.3 1448794 CNN 16 stacking
Jiang_UESTC_task1b_2 Jiang2019 69.9 1448794 CNN 8 stacking
Jiang_UESTC_task1b_3 Jiang2019 69.0 1448794 CNN 16 averaging
Jiang_UESTC_task1b_4 Jiang2019 69.6 1448794 CNN 8 averaging
Kong_SURREY_task1b_1 Kong2019 61.6 4686144 CNN
Kosmider_SRPOL_task1b_1 Komider2019 75.1 6100840 CNN 36 isotonic-calibrated soft-voting spectrum correction
Kosmider_SRPOL_task1b_2 Komider2019 75.3 18095576 CNN 124 soft-voting spectrum correction
Kosmider_SRPOL_task1b_3 Komider2019 74.9 3077046 CNN 31 soft-voting spectrum correction
Kosmider_SRPOL_task1b_4 Komider2019 75.2 10768964 CNN 58 soft-voting spectrum correction
LamPham_KentGroup_task1b_1 Pham2019 72.8 12346325 CNN, DNN 2
McDonnell_USA_task1b_1 Gao2019 74.2 3253148 CNN aggressive regularization and augmentation
McDonnell_USA_task1b_2 Gao2019 74.1 3252268 CNN aggressive regularization and augmentation
McDonnell_USA_task1b_3 Gao2019 74.9 6505416 CNN 2 average aggressive regularization and augmentation
McDonnell_USA_task1b_4 Gao2019 74.4 6506296 CNN 2 average aggressive regularization and augmentation
Primus_CPJKU_task1b_1 Primus2019 71.3 13047888 CNN, ensemble 4 average
Primus_CPJKU_task1b_2 Primus2019 73.4 13047888 CNN, ensemble 4 average domain adaptation
Primus_CPJKU_task1b_3 Primus2019 71.6 13047888 CNN, ensemble 8 average domain adaptation
Primus_CPJKU_task1b_4 Primus2019 74.2 26095776 CNN, ensemble 8 average domain adaptation
Song_HIT_task1b_1 Song2019 67.3 22758197 CNN feature transform
Song_HIT_task1b_2 Song2019 72.2 68274591 CNN 3 probability aggregation feature transform
Song_HIT_task1b_3 Song2019 72.1 68274591 CNN 3 majority vote feature transform
Waldekar_IITKGP_task1b_1 Waldekar2019 62.1 9000 SVM
Wang_NWPU_task1b_1 Wang2019 65.7 116118 CNN, DNN 7 domain adaptation
Wang_NWPU_task1b_2 Wang2019 68.5 116118 CNN, DNN 7 average domain adaptation
Wang_NWPU_task1b_3 Wang2019 70.3 116118 CNN, DNN 7 average domain adaptation

Public leaderboard

Scores

Date Top Team Top 10 Team median
2019-05-14 64.8 64.8 (64.8 - 64.8)
2019-05-15 64.8 62.4 (60.0 - 64.8)
2019-05-16 66.3 65.6 (64.8 - 66.3)
2019-05-17 66.7 65.8 (64.8 - 66.7)
2019-05-18 66.7 64.8 (60.5 - 66.7)
2019-05-19 68.5 66.7 (64.8 - 68.5)
2019-05-20 73.3 67.8 (64.8 - 73.3)
2019-05-21 73.3 64.8 (56.7 - 73.3)
2019-05-22 73.3 67.8 (59.3 - 73.3)
2019-05-23 73.3 66.3 (53.2 - 73.3)
2019-05-24 73.3 66.3 (58.3 - 73.3)
2019-05-25 73.3 66.3 (60.3 - 73.3)
2019-05-26 73.3 66.3 (60.3 - 73.3)
2019-05-27 73.3 66.3 (60.3 - 73.3)
2019-05-28 73.3 66.3 (60.7 - 73.3)
2019-05-29 73.3 68.2 (60.7 - 73.3)
2019-05-30 73.3 66.3 (44.0 - 73.3)
2019-05-31 73.3 66.9 (58.3 - 73.3)
2019-06-01 73.3 68.2 (62.5 - 73.3)
2019-06-02 73.7 68.2 (62.5 - 73.7)
2019-06-03 73.7 69.0 (62.5 - 73.7)
2019-06-04 73.7 69.0 (62.5 - 73.7)
2019-06-05 76.5 69.7 (64.8 - 76.5)
2019-06-06 76.5 69.7 (66.7 - 76.5)
2019-06-07 76.5 69.7 (66.7 - 76.5)
2019-06-08 76.5 69.7 (67.7 - 76.5)
2019-06-09 76.5 69.7 (68.3 - 76.5)
2019-06-10 76.5 70.4 (69.0 - 76.5)

Entries

Total entries

Date Entries
2019-05-14 1
2019-05-15 2
2019-05-16 3
2019-05-17 4
2019-05-18 6
2019-05-19 7
2019-05-20 9
2019-05-21 11
2019-05-22 13
2019-05-23 16
2019-05-24 19
2019-05-25 21
2019-05-26 21
2019-05-27 22
2019-05-28 23
2019-05-29 27
2019-05-30 32
2019-05-31 39
2019-06-01 44
2019-06-02 49
2019-06-03 53
2019-06-04 56
2019-06-05 63
2019-06-06 67
2019-06-07 74
2019-06-08 80
2019-06-09 88
2019-06-10 97

Entries per day

Date Entries per day
2019-05-14 1
2019-05-15 1
2019-05-16 1
2019-05-17 1
2019-05-18 2
2019-05-19 1
2019-05-20 2
2019-05-21 2
2019-05-22 2
2019-05-23 3
2019-05-24 3
2019-05-25 2
2019-05-26 0
2019-05-27 1
2019-05-28 1
2019-05-29 4
2019-05-30 5
2019-05-31 7
2019-06-01 5
2019-06-02 5
2019-06-03 4
2019-06-04 3
2019-06-05 7
2019-06-06 4
2019-06-07 7
2019-06-08 6
2019-06-09 8
2019-06-10 9

Technical reports

Urban Acoustic Scene Classification Using Binaural Wavelet Scattering and Random Subspace Discrimination Method

Fateme Arabnezhad and Babak Nasersharif
Computer Engineering Department, Khaje Nasir Toosi, Tehran, Iran

Abstract

This report describe our contribution to Detection and Classification of Urban Acoustic Scenes on DCASE 2019 challenge (Task1 –Subtask A). We propose to use wavelet scatterings spectrum as a good representation and feature where we extracted from both average of 2 audio recorded(mono) and also difference of 2 audio recorded channels (side). The concatenation of these two set of wavelet scattering spectrum are used as a feature vector which is fed into a classifier based on random subspace method. In this work, Regularized Linear Discriminant Analysis (RLDA) is used as a base learner and a classification approach for Random Subspace Method. The experimental results shows that the proposed structure learn acoustic characteristics from audio segments. This structure achieved 87.98% accuracy on whole development set (without cross-validation) and 78.83% on leaderboard dataset.

System characteristics
Input mono
Sampling rate 48kHz
Features wavelet scattering spectra
Classifier random subspace
Decision making highest average score
PDF

Acoustic Scene Classification with Multiple Instance Learning and Fusion

Valentin Bilot and Quang Khanh Ngoc Duong
Audio R&D, InterDigital R&D, Rennes, France

Abstract

Audio classification has been an emerging topic in the last few years, especially with the benchmark dataset and evaluation from DCASE. This paper present our deep learning models to address the acoustic scene classification (ASC) task of the DCASE 2019. The models exploit multiple instance learning (MIL) method as a way of guiding the network attention to different temporal segments of a recording. We then propose a simple late fusion of results obtained by the three investigated MIL-based models. Such fusion system uses multi-layer perceptron (MLP) to predict the final classes from the initial class probability predictions and obtains a better result on the development and the leaderboard dataset.

System characteristics
Input binaural, difference
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
Decision making MLP
PDF

Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling

Hangting Chen, Zuozhen Liu, Zongming Liu, Pengyuan Zhang and Yonghong Yan
Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Beijing, China

Abstract

This technical report describes the IOA team’s submission for TASK1A of DCASE2019 challenge. Our acoustic scene classification (ASC) system adopts a data augmentation scheme employing generative adversary networks. Two major classifiers, 1D deep convolutional neural network integrated with scalogram features and 2D fully convolutional neural network integrated with Mel filter bank features, are deployed in the scheme. Other approaches, such as adversary city adaptation, temporal module based on discrete cosine transform and hybrid architectures, have been developed for further fusion. The results of our experiments indicates that the final fusion systems A-D could achieve the accuracy above 85% on the officially provided fold 1 evaluation dataset.

System characteristics
Input binaural
Sampling rate 48kHz
Data augmentation generative neural network; generative neural network, variational autoencoder
Features log-mel energies, CQT
Classifier CNN
Decision making average vote
PDF

Acoustic Scene Classification Based on Ensemble System

Biyun Ding, Ganjun Liu and Jinhua Liang
School of Electrical and Information Engineering, TianJin University, Tianjin, China

Abstract

This technical report is for the Task 1A Acoustic scene classification of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). In this task, the features of audio will affect the performance. To improve the performance, we implement Acoustic scene classification task using multiple features and applying ensemble system which composed of CNN and GMM. According to the experiments which were performed with the DCASE 2019 challenge development dataset, the class average accuracy of GMM with 103 features is 64.3%, which is an improvement of 4.2% compared to Baseline CNN. Besides, the class average accuracy of the ensemble system is 66.3% , which is an improvement of 7.4% compared to Baseline CNN

System characteristics
Input mono; mono, left, right, mixed
Sampling rate 48kHz
Features log-mel energies; MFCC, log-mel energies, ZRC, RMSE, spectrogram centroid
Classifier GMM; GMM, CNN
Decision making majority vote
PDF

Acoustic Scene Classification and Audio Tagging with Receptive-Field-Regularized CNNs

Hamid Eghbal-zadeh, Khaled Koutini and Gerhard Widmer
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria

Abstract

In this report, we detail the CP-JKU submissions to the DCASE-2019 challenge Task 1 (acoustic scene classification) and Task 2 (audio tagging with noisy labels and minimal supervision). In all of our submissions, we use fully convolutional deep neural networks architectures that are regularized with Receptive Field (RF) adjustments. We adjust the RF of variants of Resnet and Densenet architectures to best fit the various audio processing tasks that use the spectrogram features as input. Additionally, we propose novel CNN layers such as Frequency-Aware CNNs, and new noise compensation techniques such as Adaptive Weighting for Learning from Noisy Labels to cope with the complexities of each task. We prepared all of our submissions without the use of any external data. Our focus in this year’s submissions is to provide the best-performing single-model submission, using our proposed approaches.

System characteristics
Input mono
Sampling rate 22.05kHz
Data augmentation mixup
Features perceptual weighted power spectrogram
Classifier CNN, Receptive Field Regularization
Decision making average
PDF

Acoustic Scene Classification Based on the Dataset with Deep Convolutional Generated Against Network

Ning FangLi and Duan Shuang
Mechanical Engineering, Northwestern Polytechnical University School, 127 West Youyi Road, Xi'an, 710072, China

Abstract

As is known to us all, Convolutional Neural Networks have been the most excellent solution for image classification challenges. From the results of DCASE 2018 [1], the Convolutional Neural Network has also achieved excellent results in the classification of acoustic scenes. Therefore, our team also adopted Convolutional Neural Network for DCASE 2019 Task 1a. In order to make the audio features are exposed more, our team used multiple Mel-spectrograms to characterize the audio, trained multiple classifiers, and finally weighted the prediction results of each classifier to make results ensemble. The performance of classifier is largely limited by the quality and quantity of the data. From the results of the technical report [2], the use of GAN to augment the data set can play a vital role in the final performance, and our team also introduced Deep Convolution GANs (DCGAN) [3] to our solution to Task 1a Challenge, Our model ultimately achieved an accuracy of 0.846 on the development set and an accuracy of 0.671 on the leaderboard dataset.

System characteristics
Input mixed
Sampling rate 48kHz
Data augmentation DCGAN
Features log-mel energies
Classifier CNN
Decision making majority vote
PDF

Classification of Acoustic Scenes Based on Modulation Spectra and Position-Pitch Maps

Ruben Fraile, Juan Carlos Reina, Juana Gutierrez-Arriola and Elena Blanco
CITSEM, Universidad Politecnica de Madrid, Madrid, Spain

Abstract

A system for the automatic classification of acoustic scenes is proposed that uses the stereophonic signal captured by a binaural microphone. This system uses one channel for calculating the spectral distribution of energy across auditory-relevant frequency bands. It further obtains some descriptors of the envelope modulation spectrum (EMS) by applying the discrete cosine transform to the logarithm of the EMS. The availability of the two-channel binaural recordings is used for representing the spatial distribution of acoustic sources by means of position-pitch maps. These maps are further parametrized using the two-dimensional Fourier transform. These three types of features (energy spectrum, EMS and positionpitch maps) are used as inputs for a standard Gaussian Mixture Model with 64 components.

System characteristics
Input binaural
Sampling rate 48kHz
Features spectrogram, modulation spectrum, position-pitch maps
Classifier GMM
Decision making average log-likelihood
PDF

Acoustic Scene Classification Using Deep Residual Networks with Late Fusion of Separated High and Low Frequency Paths

Wei Gao and Mark McDonnell
School of Information Technology and Mathematical Sciences, University of South Australia, Mawson Lakes, Australia

Abstract

This technical report describes our approach to Tasks 1a, 1b and 1c in the 2019 DCASE acoustic scene classification challenge. Our focus was on developing strong single models, without use of any supplementary data. We investigated the use of a deep residual network applied to log-mel spectrograms complemented by log-mel deltas and delta-deltas. We designed the network to take into account that the temporal and frequency axes in spectrograms represent fundamentally different information. In particular, we used two pathways in the residual network: one for high frequencies and one for low frequencies, that were fused just two convolutional layers prior to the network output.

System characteristics
Input left, right; mono
Sampling rate 48kHz; 44.1kHz
Data augmentation mixup, temporal cropping
Features log-mel energies, deltas and delta-deltas; log-mel energies
Classifier CNN
Decision making average
PDF

Acoustic Scene Classification Using CNN Ensembles and Primary Ambient Extraction

Yang Haocong, Shi Chuang and Li Huiyong
Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, China

Abstract

This report describes our submission for Task 1a (acoustic scene classification) of the DCASE 2019 challenge. The results of the DCASE 2018 challenge demonstrate that the convolution neural networks (CNNs) and their ensembles can achieve excellent clas-sification accuracies. Inspired by the previous works, our method continues to work on the ensembles of CNNs, whereas the prima-ry ambient extraction is newly introduced to decompose a binaural audio sample into four channels by using the spatial information. The feature extraction is still carried out with mel spectrograms. 6 CNN models are trained by using the 4-fold cross validation. Ensemble is applied to further improve the performance. Finally, our method has achieved classification accuracies of 0.84 on the public leaderboard.

System characteristics
Input mono, binaural
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
Decision making average; random forest
PDF

Acoustic Scene Classification Using Deep Learning-Based Ensemble Averaging

Jonathan Huang1, Paulo Lopez Meyer2, Hong Lu1, Hector Cordourier Maruri2 and Juan Del Hoyo2
1Intel Labs, Intel Corporation, Santa Clara, CA, USA, 2Intel Labs, Intel Corporation, Zapopan, Jalisco, Mexico

Abstract

In our submission to the DCASE 2019 Task 1a, we have explored the use of four different deep learning based neural networks architectures: Vgg12, ResNet50, AclNet, and AclSincNet. In order to improve performance, these four network architectures were pretrained with Audioset data, and then fine-tuned over the development set for the task. The outputs produced by these networks, due to the diversity of feature front-end and of architecture differences, proved to be complementary when fused together. The ensemble of these models’ outputs improved from best single model accuracy of 77.9% to 83.0% on the validation set, trained with the challenge default’s development split.

System characteristics
Input mono; mono , binaural; binaural
Sampling rate 16kHz; 48kHz, 16kHz; 48kHz
Data augmentation mixup
Features raw waveform, log-mel energies; log-mel energies
Classifier CNN
Decision making Max value of soft ensemble
PDF

Acoustic Scene Classification Based on Deep Convolutional Neuralnetwork with Spatial-Temporal Attention Pooling

Zhenyi Huang and Dacan Jiang
School of Computer, South China Normal University, Guangzhou, China

Abstract

Acoustic scene classification is a challenging task in machine learn-ing with limited data sets. In this report, several different spectro-grams are applied to classify the acoustic scenes using deep convo-lutional neural network with spatial-temporal attention pooling. Inaddition, mixup augmentation is performed to further improve theclassification performance. Finally, majority voting is performed onsix different models and an accuracy of 73.86% is achieved which is11.36 percentage points higher than the one of the baseline system.

System characteristics
Input left, right, mixed
Sampling rate 44.1kHz
Data augmentation mixup
Features MFCC, CQT
Classifier CNN
Decision making majority vote
PDF

Acoustic Scene Classification Using Various Pre-Processed Features and Convolutional Neural Networks

Seo Hyeji and Park Jihwan
Advanced Robotics Lab, LG Electronics, Seoul, Korea

Abstract

In this technical report, we describe our acoustic scene classification algorithm submitted in DCASE 2019 Task 1a. We focus on various pre-processed features to categorize the class of acoustic scenes using only stereo microphone input signal. In the frontend system, the pre-processed and spatial information are extracted from the stereo microphone input. Residual network, subspectral network, and conventional convolutional neural network (CNN) are used for back-end systems. Finally, we ensemble all of the models to take advantage of each algorithm. By using proposed systems, we achieved a classification accuracy of 80.4%, which is 17.9% over than the baseline system.

System characteristics
Input mono, binaural
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies, spectrogram, chromagram
Classifier CNN
Decision making average
PDF

Acoustic Scene Classification Using Ensembles of Convolutional Neural Networks and Spectrogram Decompositions

Shengwang Jiang and Chuang Shi
School of Communication and Information Engineering, University of Electronic Science and Technology of China, Chengdu, China

Abstract

This technical report proposes ensembles of convolutional neural networks (CNNs) for the task 1 / subtask B of the DCASE 2019 challenge, with emphasis on using different spectrogram decompositions. The harmonic percussive source separation (HPSS), nearest neighbor filter (NNF), and vocal separation are applied to the monaural samples. Head-related transfer function (HRTF) is also proposed to transform monaural samples to binaural ones with augmented spatial information. Finally, 16 neural networks are trained and put together. The classification accuracy of the proposed system achieves 0.70166 on the public leaderboard.

System characteristics
Sampling rate 44.1kHz
Data augmentation HPSS, NNF, vocal separation, HRTF
Features log-mel energies
Classifier CNN
Decision making stacking; averaging
PDF

Knowledge Distillation with Specialist Models in Acoustic Scene Classification

Jee-weon Jung, Hee-Soo Heo, Hye-jin Shim and Ha-Jin Yu
Computing Sciences, Univerisity of Seoul, Seoul, Republic of Korea

Abstract

In this technical report, we describe our submission for the Detection and Classification of Acoustic Scenes and Events 2019 task1-a competition which exploits knowledge distillation with specialist models. Different acoustic scenes that share common properties are one of the main obstacles that hinder successful acoustic scene classification. We found that confusion between scenes, sharing the common properties, causes most of the errors in the acoustic scene classification. For example, the confusing scene pairs such as airport-shopping mall and metro-tram have caused the most errors in various systems. We applied knowledge distillation based on the specialist models to address the errors from the most confusing scene pairs. Specialist models where each model concentrates on discriminating a pair two similar scenes are exploited to provide soft-labels. We expected that knowledge distillation from multiple specialist models and a pre-trained generalist model to a single model could train an ensemble of models that gives more emphasis on discriminating specific acoustic scene pairs. Through knowledge distillation from well trained model and specialist models to single model, we report improved accuracy on the validation set.

System characteristics
Input binaural
Sampling rate 48kHz
Data augmentation mixup
Features raw waveform, log-mel energies
Classifier CNN
Decision making majority vote; score-sum
PDF

The I2r Submission to DCASE 2019 Challenge

Teh KK1, Sun HW2 and Tran Huy Dat2
1I2R, A-star, Singapore, 2I2R, A-Star, Singapore

Abstract

This paper proposes Convolutional Neural Network (CNN) ensembles for acoustic scene classification of tasks1A of the DCASE 2019 challenge. In this approach various preprocessing features method: mel-filterbank and delta feature vectors, harmonic-percussive and subband power distribution are used to train CNN model. We also used score-fusion of the features to find an optimum feature configuration. On the official leaderboard data set of the task1A challenge, an accuracy of 79.67% is achieved.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies, HPSS; log-mel energies, HPSS, subband power distribution
Classifier CNN
Decision making weighted averaging vote; multi-class linear logistic regression
PDF

Calibrating Neural Networks for Secondary Recording Devices

Michał Kośmider
Artificial Intelligence, Samsung R&D Institute Poland, Warsaw, Poland

Abstract

This report describes the solution to Task 1B of the DCASE 2019 challenge proposed by Samsung R&D Institute Poland. Primary focus of the system for task 1B was a novel technique designed to address issues with learning from microphones with different frequency responses in settings with limited examples for the targeted secondary devices. This technique is independent from the architecture of the predictive model and requires just a few examples to become effective.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation Spectrum Correction, SpecAugment, mixup
Features log-mel energies
Classifier CNN
Decision making isotonic-calibrated soft-voting; soft-voting
PDF

Cross-Task Learning for Audio Tagging, Sound Event Detection and Spatial Localization: DCASE 2019 Baseline Systems

Qiuqiang Kong, Yin Cao, Turab Iqbal, Wenwu Wang and Mark D. Plumbley
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, England

Abstract

The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge focuses on audio tagging, sound event detection and spatial localisation. DCASE 2019 consists of five tasks: 1) acoustic scene classification, 2) audio tagging with noisy labels and minimal supervision, 3) sound event localisation and detection, 4) sound event detection in domestic environments, and 5) urban sound tagging. In this paper, we propose generic cross-task baseline systems based on convolutional neural networks (CNNs). The motivation is to investigate the performance of a variety of models across several audio recognition tasks without exploiting the specific characteristics of the tasks. We looked at CNNs with 5, 9, and 13 layers, and found that the optimal architecture is task-dependent. For the systems we considered, we found that the 9-layer CNN with average pooling after convolutional layers is a good model for a majority of the DCASE 2019 tasks.

System characteristics
Input mono
Sampling rate 32kHz
Features log-mel energies
Classifier CNN
PDF

Acoustic Scene Classification and Audio Tagging with Receptive-Field-Regularized CNNs

Khaled Koutini, Hamid Eghbal-zadeh and Gerhard Widmer
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria

Abstract

In this report, we detail the CP-JKU submissions to the DCASE-2019 challenge Task 1 (acoustic scene classification) and Task 2 (audio tagging with noisy labels and minimal supervision). In all of our submissions, we use fully convolutional deep neural networks architectures that are regularized with Receptive Field (RF) adjustments. We adjust the RF of variants of Resnet and Densenet architectures to best fit the various audio processing tasks that use the spectrogram features as input. Additionally, we propose novel CNN layers such as Frequency-Aware CNNs, and new noise compensation techniques such as Adaptive Weighting for Learning from Noisy Labels to cope with the complexities of each task. We prepared all of our submissions without the use of any external data. Our focus in this year’s submissions is to provide the best-performing single-model submission, using our proposed approaches.

System characteristics
Input binaural
Sampling rate 22.05kHz
Data augmentation mixup
Features perceptual weighted power spectrogram
Classifier CNN, Receptive Field Regularization
Decision making average
PDF

Acoustic Scene Classification with Reject Option Based on Resnets

Bernhard Lehner1 and Khaled Koutini2
1Silicon Austria Labs, JKU, Linz, Austria, 2Institute of Computational Perception, JKU, Linz, Austria

Abstract

This technical report describes the submissions from the SAL/CP JKU team for Task 1 - Subtask C (classification on data that includes classes not encountered in the training data) of the DCASE-2019 challenge. Our method uses a ResNet variant specifically adapted to be used along with spectrograms in the context of Acoustic Scene Classification (ASC). The reject option is based on the logit values of the same networks. We do not use any of the provided external data sets, and perform data augmentation only with the mixup technique [1]. The result of our experiments is a system that achieves classification accuracies of up to around 60% on the public Kaggle-Leaderboard. This is an improvement of around 14 percentage points compared to the official DCASE 2019 baseline.

System characteristics
Input mono
Sampling rate 22.05kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
Decision making logit averaging
PDF

Multi-Scale Recalibrated Features Fusion for Acoustic Scene Classification

Chongqin Lei and Zixu Wang
Intelligent Information Technology and System Lab, CHONGQING UNIVERSITY, Chongqing, China

Abstract

We investigate the effectiveness of multi-scale recalibrated features fusion for acoustic scene classification as contribution to the subtask of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2019). A general problem in acoustic scene classification task is audio signal segment contains less effective information. In order to further utilize features with less effective information to improve classification accuracy, we introduce the Squeeze-and-Excitation unit to embed the backbone structure of Xception to recalibrate the channel weights of feature maps in each block. In addition, the recalibrated features of multiscale are fused and finally fed into the full connection layer to get more useful information. Furthermore, we introduce Mixup method to augment the data in training stage to reduce the degree of over-fitting of network. The proposed method attains a recognition accuracy of 77.5%, which is 13% higher compared to the baseline system of the DCASE 2019 Acoustic Scenes Classification task.

System characteristics
Input binaural
Sampling rate 44.1kHz
Features log-mel energies
Classifier CNN
PDF

Acoustic Scene Classification Using Attention-Based Convolutional Neural Network

Han Liang and Yaxiong Ma
Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, China

Abstract

This technical report describes the Task 1 - Subtask A (Acoustic Scene Classification, ASC) of the DCASE 2019 challenge whose goal is to classify a test audio recording into one of the predefined classes that characterizes the environment. We detemine to use mel-spectrogram as audio feature and deep convolutional neural networks (CNNs) as classifier to classify acoustic scenes. In our method, spectrogram of every audio clip is divided in two ways. In addition, we introduce attention mechanism to further improve the performance. Experimental results illustrate that our best model can achieve classification accuracy of around 70.7% for Development dataset, which is superior to the baseline system with the accuracy of 62.5%.

System characteristics
Input mono
Sampling rate 48kHz
Features log-mel energies
Classifier CNN
PDF

Jsnu_wdxy Submission for DCASE-2019: Acoustic Scene Classification with Convolution Neural Networks

Xinixn Ma and Mingliang Gu
School of Physics and Electronic, Jiangsu Normal University, Xuzhou, China

Abstract

Acoustic Scene Classification (ASC) is the task of identifying the scene from which the audio signal is recorded. It is one of the core research problems in the field of Computational Sound Scene Analysis. Most of current best performing Acoustic Scene Classification systems utilize Mel scale spectrograms with Convolutional Neural Networks (CNNs). In this paper, we demonstrate how we applied convolutional neural network for DCASE 2019 task1, acoustic scene classification. First, we applied Mel scale spectrogram to extract acoustic features. Mel scale is a common way to suit frequency warping of human ears, with strict decreasing frequency resolution on low to high frequency range. Second, we generate Mel spectrogram from binaural audio, adaptively learn 5 Convolutional Neural Networks. The best classification result of the proposed system was 71.1% for Development dataset and 73.16% for Leaderboard dataset.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CNN
PDF

Acoustic Scene Classification Based on Binaural Deep Scattering Spectra with Neural Network

Sifan Ma and Wei Liu
Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China

Abstract

This technical report presents our approach for the acoustic scene classification of DCASE2019 task1a. Compared to traditional audio features such as Mel-frequency Cepstral Coefficients (MFCC) and Constant-Q Transform (CQT), we choose Deep Scattering Spectra (DSS) features which are more suitable for characterizing acoustic scenes. DSS is a good way to preserve high frequency information. Based on DSS features, we choose a network model of Convolutional Neural Network (CNN) and Gated Recurrent Unit (GRU) to classify acoustic scenes. The experimental results show that our approach increase the classification accuracy from 62.5% (DCASE2019 baseline) to 85%.

System characteristics
Input left,right
Sampling rate 48kHz
Features DSS
Classifier CNN,DNN
PDF

Acoustic Scene Classification From Binaural Signals Using Convolutional Neural Networks

Rohith Mars, Pranay Pratik, Srikanth Nagisetty and Chong Soon Lim
Core Technology Group, Panasonic R&D Center, Singapore, Singapore

Abstract

In this report, we present the technical details of our proposed framework and solution for the DCASE 2019 Task 1A - Acoustic Scene Classification challenge. We describe the audio pre-processing, feature extraction steps and the time-frequency (TF) representations used for acoustic scene classification using binaural audio recordings. We employ two distinct architectures of convolutional neural networks (CNNs) for processing the extracted audio features for classification and compare their relative performance in terms of both accuracy and model complexity. Using an ensemble of the predictions from multiple models based on the above CNNs, we achieved an average classification accuracy of 79.35% on the test split of the development dataset for this task and a system score of 82.33% in the Kaggle public leaderboard, which is an improvement of ≈ 18% over the baseline system.

System characteristics
Input mono, left, right, mid, side
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
Decision making max probability
PDF

The System for Acoustic Scene Classification Using Resnet

Liu Mingle and Li Yanxiong
School of Electronic and Information Enginnering, South China University of Technology, GuangZhou, GuangDong Province

Abstract

In this report, we present our works concerning task 1a of DCASE 2019, i.e. acoustic scene classification (ASC) with mismatched recording devices. We propose a strategy of classifiers voting for ASC. Specifically, an audio feature, such as logarithmic filter-bank (LFB), is first extracted from audio recordings. Then a series of convolutional neural network (CNN) is built for obtaining classifiers ensemble. Finally, classification result for each test sample is based on the voting of all classifiers.

System characteristics
Input mono,binaural
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies
Classifier ResNet
Decision making vote
PDF

DCASE 2019: CNN Depth Analysis with Different Channel Inputs for Acoustic Scene Classification

Javier Naranjo-Alcazar1, Sergi Perez-Castanos1, Pedro Zuccarello1 and Maximo Cobos2
1Visualfy AI, Visualfy, Benisano, Spain, 2Computer Science, Universitat de Valencia, Burjassot, Spain

Abstract

The objective of this technical report is to describe the framework used in Task 1, Acoustic scene classification (ASC), of the DCASE 2019 challenge. The presented approach is based on Log-Mel spectrogram representations and VGG-based Convolutional Neural Networks (CNNs). Three different CNNs, with very similar architectures, have been implemented. The main difference is the number of filters in their convolutional blocks. Experiments show that the depth of the network is not the most relevant factor for improving the accuracy of the results. The performance seems to be more sensitive to the input audio representation. This conclusion is important for the implementation of real-time audio recognition and classification system on edge devices. In the presented experiments the best audio representation is the Log-Mel spectrogram of the harmonic and percussive sources plus the Log-Mel spectrogram of the difference between left and right stereo-channels (L − R). Also, in order to improve accuracy, ensemble methods combining different model predictions with different inputs are explored. Besides geometric and arithmetic means, ensembles aggregated with the Orness Weighted Averaged (OWA) operator have shown interesting and novel results. The proposed framework outperforms the baseline system by 14.34 percentage points. For Task 1a, the obtained development accuracy is 76.84%, being 62.5% the baseline, whereas the accuracy obtained in public leaderboard is 77.33%, being 64.33% the baseline.

System characteristics
Input mono, left, right, difference, harmonic, percussive
Sampling rate 48kHz
Features log-mel energies
Classifier ensemble, CNN
Decision making arithmetic mean; geometric mean; orness weighted average
PDF

DCASE 2019 Task 1a: Acoustic Scene Classification by Sffcc and DNN

Chandrasekhar Paseddula1 and Suryakanth V.Gangashetty2
1International Institute of Information Technology, Hyderabad department:Electronics and Communication Engineering, Hyderabad, India, 2Computor Science Engineering, International Institute of Information Technology, Hyderabad, Hyderabad, India

Abstract

In this study, we dealt with the acoustic scene classification (ASC) task in the Detection and Classification of Acoustic Scenes and Events (DCASE)-2019 challenge Task 1A. Single frequency filtering cepstral coefficients (SFFCC) features and Deep Neural networks (DNN) model is proposed for ASC. We have adopted a late fusion mechanism to further improve the performance and finally, to validate the performance of the model and compare it to the baseline system. We used the TAU Urban Acoustic Scenes 2019 development dataset for training and cross-validation, resulting in a 7.9% improvement when compared to the baseline system.

System characteristics
Input mono
Sampling rate 48kHz
Features single frequency cepstral coefficients (SFCC), log-mel energies
Classifier DNN
Decision making maxrule
PDF

Cdnn-CRNN Joined Model for Acoustic Scene Classification

Lam Pham1, Tan Doan2, Dat Thanh Ngo2, Hung Nguyen2 and Ha Hoang Kha2
1School of Computing, University of Kent, Chatham, United Kingdom, 2Electrical and Electronics Engineering, HoChiMinh City University of Technology, HoChiMinh City, Vietnam

Abstract

This work proposes a deep learning framework applied for Acoustic Scene Classification (ASC), targeting DCASE2019 task 1A. In general, the front-end process shows a combination of three types of spectrograms: Gammatone (GAM), log-Mel and Constant Q Transform (CQT). The back-end classification presents a joined learning model between CDNN and CRNN. Our experiments over the development dataset of DCASE2019 challenge task 1A show a significant improvement, increasing 11.2% compared to DCASE2019 baseline of 62.5%. The Kaggle reports the classification accuracy of 74.6% when we train all development dataset.

System characteristics
Input mono
Sampling rate 48kHz
Data augmentation mixup
Features Gammatone, log-mel energies, CQT
Classifier CNN, RNN
PDF

A Multi-Spectrogram Deep Neural Network for Acoustic Scene Classification

Lam Pham, Ian McLoughlin, Huy Phan and Ramaswamy Palaniappan
School of Computing, University of Kent, Chatham, United Kingdom

Abstract

This work targets the task 1A and 1B of DCASE2019 challenge that are Acoustic Scene Classification (ASC) over ten different classes recorded by a same device (task 1A) and mismatched devices (task 1B). For the front-end feature extraction, this work proposes a combination of three types of spectrograms: Gammatone (GAM), log- Mel and Constant Q Transform (CQT). The back-end classification shows two training processes, namely pre-trained CNN and post- trained DNN, and the result of post-trained DNN is reported. Our experiments over the development dataset of DCASE2019 1A and 1B show significant improvement, increasing 14% and 17.4 % compared to DCASE2019 baseline of 62.5% and 41.4%, respectively. The Kaggle report also confirms the classification accuracy of 79% and 69.2% for task 1A and 1B.

System characteristics
Input mono
Sampling rate 48kHz; 44.1kHz
Data augmentation mixup
Features Gammatone, log-mel energies, CQT
Classifier CNN, DNN
PDF

Deep Neural Networks with Supported Clusters Preclassification Procedure for Acoustic Scene Recognition

Marcin Plata
Data Intelligence Group, Samsung R&D Institute Poland, Warsaw, Poland

Abstract

In this technical report, we presented a system for acoustic scene classification focuses on deeper analysis of data. We made an impact analysis of various combinations of arguments for short time Fourier transform (STFT) and Mel filter bank. We also used the harmonic and percussive source separation (HPSS) algorithm as an additional features extractor. Finally, next to common spectrograms divided and non-overlap classification neural networks, we decided to present an out-of-the-box solution with one main neural network trained on clustered labels and a few supporting neural network to distinguish between most difficult scenes, e.g. street pedestrian and public square.

System characteristics
Input mono, left, right
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies, harmonic, percussive
Classifier CNN, random forest; CNN
Decision making random forest; majority vote
PDF

Acoustic Scene Classification with Mismatched Recording Devices

Paul Primus and David Eitelsebner
Computational Perception, Johannes Kepler University Linz, Linz, Austria

Abstract

This technical report describes CP-JKU Student team’s approach for Task 1 - Subtask B of the DCASE 2019 challenge. In this context, we propose two loss functions for domain adaptation to learn invariant representations given time-aligned recordings. We show that these methods improve the classification performance on our cross-validation, as well as performance on the Kaggle leader board, up to three percentage points compared to our baseline model. Our best scoring submission is an ensemble of eight classifiers.

System characteristics
Input mono
Sampling rate 22.05kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN, ensemble
Decision making average
PDF

Frequency-Aware CNN for Open Set Acoustic Scene Classification

Alexander Rakowski1 and Michał Kośmider2
1Audio Intelligence, Samsung R&D Institute Poland, Warsaw, Poland, 2Artificial Intelligence, Samsung R&D Institute Poland, Warsaw, Poland

Abstract

This report describes systems used for Task 1c of the DCASE 2019 Challenge - Open Set Acoustic Scene Classification. The main system consists of a 5-layer convolutional neural network which preserves the location of features on the frequency axis. This is in contrast to the standard approach where global pooling is applied along the frequency-related dimension. Additionally the main system is combined with an ensemble of calibrated neural networks in order to improve generalization.

System characteristics
Input mono
Sampling rate 32kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
Decision making soft-voting
PDF

Urban Acoustic Scene Classification Using Raw Waveform Convolutional Neural Networks

Daniele Salvati, Carlo Drioli and Gian Luca Foresti
Mathematics, Computer Science and Physics, University of Udine, Udine, Italy

Abstract

We present the signal processing framework and the results obtained with the development dataset (task 1, subtask A) for the detection and classification of acoustic scenes and events (DCASE 2019) challenge. The framework for the classification of urban acoustic scenes consists of a raw waveform (RW) end-to-end computational scheme based on convolutional neural networks (CNNs). The RW-CNN operates on a time-domain signal segment of 0.5 s and consists of 5 one-dimensional convolutional layers and 3 fully connected layers. The overall classification accuracy with the development dataset of the proposed RW-CNN is 69.7 %.

System characteristics
Input mono
Sampling rate 48kHz
Features raw waveform
Classifier CNN
PDF

Acoustic Scene Classification Using Specaugment and Convolutional Neural Network with Inception Modules

Suh Sangwon, Jeong Youngho, Lim Wootaek and Park Sooyoung
Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea

Abstract

This paper describes the system submitted to the Task 1a (Acoustic Scene Classification, ASC). By analyzing the major systems submitted in 2017 and 2018, we have selected a two-dimensional convolutional neural network (CNN) as the most suitable model for this task. The proposed model is composed of four convolution blocks; two of them are conventional CNN structures but the following two blocks consist of Inception modules. We have constructed a meta-learning problem with this model in order to train the super learner. For each base model training, we have applied different validation split methods to take advantage in generalized result with the ensemble method. In addition, we have applied data augmentation in real time with SpecAugment, which was performed for each base model. With our final system with all of the above techniques have applied, we have achieved an accuracy of 76.1% with the development dataset and 81.3% with the leader board set.

System characteristics
Input mono
Sampling rate 48kHz
Data augmentation SpecAugment
Features log-mel energies
Classifier CNN; ensemble
PDF

Feature Enhancement for Robust Acoustic Scene Classification with Device Mismatch

Hongwei Song and Hao Yang
Computer Sciences and Technology, Harbin Institute of Technology, Harbin, China

Abstract

This technical report describes our system for DCASE2019 Task1 SubtaskB. We focus on analyzing how device distortions affect the classic log Mel feature, which is the most adopted feature for convolutional neural networks (CNN) based models. We demonstrate mathematically that for log Mel feature, the influence of device distortion shows as an additive constant vector over the log Mel spectrogram. Based on this analysis, we propose to use feature enhancement methods such as spectrogram-wise mean subtraction and median filtering, to remove the additive term of channel distortions. Information loss introduced by the enhancement methods is discussed. We also motivate to use mixup technique to generate virtual samples with various device distortions. Combining the proposed techniques, we rank the second on the public kaggle leaderboard.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
Decision making probability aggregation; majority vote
PDF

Wavelet Based Mel-Scaled Features for DCASE 2019 Task 1a and Task 1b

Shefali Waldekar and Goutam Saha
Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India

Abstract

This report describes a submission for IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 for Task 1 (acoustic scene classification (ASC)), sub-task A (basic ASC) and sub-task B (ASC with mismatched recording devices). The system exploits time-frequency representation of audio to obtain the scene labels. It follows a simple pattern classification framework employing wavelet transform based mel-scaled features along with support vector machine as classifier. The proposed system relatively outperforms the deep-learning based baseline system by almost 8% for sub-task A and 26% for sub-task B on the development dataset provided for the respective sub-tasks.

System characteristics
Input mono
Sampling rate 48kHz; 44.1kHz
Features MFDWC
Classifier SVM
PDF

Acoustic Scene Classification Based on CNN System

Zhuhe Wang, Jingkai Ma and Chunyang Li
Noise and Vibration Laboratory, Beijing Technology and Business University, Beijing, China

Abstract

In this study, we present a solution for the acoustic scene classification task1A in the DCASE 2019 Challenge. Our model uses a convolutional neural network and makes some improvements on the basis of CNN. Then we extract the MFCC (Mel frequency cepstral coefficient) feature from the official audio file and recreate the data set. Use this as an input to the neural network. Finally, comparing our model to the performance of the baseline system, the results were 12% more accurate than the baseline system.

System characteristics
Input one
Sampling rate 22.05kHz
Features MFCC
Classifier CNN
PDF

Ciaic-ASC System for DCASE 2019 Challenge Task1

Mou Wang and Rui Wang
School of Marine Sciences and Technology, Northwestern Polytechnical University, Xi'an, China

Abstract

In this report, we present our systems for subtask A and subtask B of the DCASE 2019 Task1, i.e. acoustic scene classification. The subtask A is a problem of basic closed set classification with data from a single device. In our system, we firstly extracted several acoustic features such as mel-spectrogram, hybrid constant-Q transform, harmonic-percussive source separation and etc.. Convolution neural networks (CNN) with average pooling are used to classify acoustic scenes. We averaged the outputs of CNN fed by different features to ensemble those methods. The subtask B is a classification problem with mismatched devices. So, we introduce a Domain Adaptation Neural Network (DANN) to extract the feature, which is uncorrelated with domain. We further ensemble DANN with CNN methods to obtain a better performance. The accuracy of the our system for subtask A is 0.783 on validation dataset and 0.816 on leaderboard dataset. The accuracy of subtask B achieves 0:717 on leaderborad dataset, which shows that our method can solve such a cross-domain problem and outperforms baseline system.

System characteristics
Input mono
Sampling rate 32kHz; 44.1kHz
Features log-mel energies
Classifier CNN; CNN, DNN
Decision making average
PDF

The SEIE-SCUT Systems for Acoustic Scene Classification Using CNN Ensemble

Wucheng Wang and Mingle Liu
School of Electronic and Information Enginnering, South China University of Technology, GuangZhou, GuangDong Province

Abstract

In this report, we present our works concerning task 1b of DCASE 2019, i.e. acoustic scene classification (ASC) with mismatched recording devices. We propose a strategy of CNN ensemble for ASC. Specifically, an audio feature, such as Mel-frequency cepstral coefficients (MFCCs) and logarithmic filter-bank (LFB), is first extracted from audio recordings. Then a series of convolutional neural network (CNN) is built for obtaining CNN ensemble. Finally, classification result for each test sample is based on the voting of all CNNS contained in the CNN ensemble.

System characteristics
Input mono,binaural
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies,MFCC
Classifier VGG,Inception,ResNet
Decision making vote
PDF

Open-Set Acoustic Scene Classification with Deep Convolutional Autoencoders

Kevin Wilkinghoff and Frank Kurth
Communication Systems, Fraunhofer Institute for Communication, Information Processing and Ergonomics, Wachtberg, Germany

Abstract

Acoustic scene classification is the task of determining the environment in which a given audio file has been recorded. If it is a priori not known whether all possible environments that may be encountered during test time are also known when training the system, the task is referred to as open-set classification. This paper contains a description of an open-set acoustic scene classification system submitted to Task 1C of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Our system consists of a combination of convolutional neural networks for closed-set identification and deep convolutional autoencoders for outlier detection. In evaluations conducted on the leaderboard dataset of the challenge, the proposed system significantly outperforms the baseline systems and improves the score by 35.4% from 0.46666 to 0.63166.

System characteristics
Input mono
Sampling rate 48kHz
Data augmentation mixup, cutout, width shift, height shift
Features log-mel energies; log-mel energies, harmonic part, percussive part
Classifier CNN; CNN, ensemble; CNN, DCAE, logistic regression; CNN, DCAE, logistic regression, ensemble
Decision making maximum likelihood; geometric mean, maximum likelihood; threshold, maximum likelihood; geometric mean, threshold, maximum likelihood
PDF

Stratified Time-Frequency Features for CNN-Based Acoustic Scene Classification

Yuzhong Wu and Tan Lee
Electronic Engineering, The Chinese University of Hong Kong, Hong Kong, China

Abstract

Acoustic scene signal is a mixture of diverse sound events, which are frequently overlapped with each other. The CNN models for acoustic scene classification usually suffer from model overfitting because they might memorize the overlapped sounds as the representative patterns for acoustic scenes, and might fail to recognize the scene when only one of the sound is present. Based on a standard CNN setup with log-Mel feature as input, we propose to stratify the log-Mel image to several component images based on sound duration, and each component image should contain a specific type of time-frequency patterns. Then we emphasize the independent modeling of time-frequency patterns to better utilize the stratified features. The experiment results on TAU Urban Acoustic Scenes 2019 development dataset [1] show that the use of stratified feature can significantly improve the classification performance.

System characteristics
Input mono
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
Decision making majority vote
PDF

Acoustic Scene Classification Using Fusion of Attentive Convolutional Neural Networks for Dcase2019 Challenge

Hossein Zeinali, Lukas Burget and Honza Cernocky
Information Technology, Brno University of Technology, Brno, Czech Republic

Abstract

In this report, the Brno University of Technology (BUT) team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2019 challenge are described. Also, the analysis of different methods is provided. The proposed approach is a fusion of three different Convolutional Neural Network (CNN) topologies. The first one is a VGG like two-dimensional CNNs. The second one is again a two-dimensional CNN network which uses MaxFeature-Map activation and called Light-CNN (LCNN). The third network is a one-dimensional CNN which mainly used for speaker verification and called x-vector topology. All proposed networks use self-attention mechanism for statistic pooling. As a feature, we use a 256-dimensional log Mel-spectrogram. Our submissions are a fusion of several networks trained on 4-folds generated evaluation setup using different fusion strategies.

System characteristics
Input mono
Sampling rate 22.05kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN, ensemble
Decision making score fusion; majority vote; majority vote, score fusion
PDF

Acoustic Scene Classification Combining Log-Mel CNN Model and End-To-End Model

Xu Zheng and Jie Yan
Computing Sciences, University of Science of Techonology of China, Hefei,Anhui,China

Abstract

This technical report describes the Zheng-USTC team’s submissions for Task 1 - Subtask A (Acoustic Scene Classification, ASC) of the DCASE-2019 challenge. In this paper, two different models for Acoustic Scene Classification are provided.The first one is a common two-dimensional CNN model in which the log-mel energies spectrogram is treated as an image. The second one is an end-to-end model, in which the features of a speech are extracted by a 3-layer CNN model with 64 filters. The experimental results on the fold1 validation set of 4185 samples and the leaderboard showed that the class-wise accuracy of the two models are complementary in some way. Finally we fused the softmax ouput scores of the two different systems by using a simple non-weighted average.

System characteristics
Input binaural; mono
Sampling rate 22.05kHz; 16kHz
Data augmentation SpecAugment, RandomCrop; Between-Class learning; SpecAugment, RandomCrop, Between-Class learning
Features log-mel energies; raw waveform
Classifier CNN
PDF

Audio Scene Calssification Based on Deeper CNN and Mixed Mono Channel Feature

Nai Zhou, Yanfang Liu and Qingkai Wei
Beijing Kuaiyu Electronics Co., Ltd., Beijing, China

Abstract

This technical report describes Kuaiyu team’s submissions for Task 1 - Subtask A (Acoustic Scene Classification, ASC) of the DCASE-2019 challenge. Refering the results of DCASE 2018, a convolution neural network and log-mel spectrogram generated from mono audio are used, log-mel specture is converted into multiple channels spectrogram and as a input to 8 convolutional layer neural networks. The result of our experiments is a classification system that achieves classification accuracies of around 75.5% on the public Kaggle-Leaderboard.

System characteristics
Input mono
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
PDF

DCASE 2019 Challenge Task1 Technical Report

Houwei Zhu1, Chunxia Ren2, Jun Wang2, Shengchen Li2, Lizhong Wang1 and Lei Yang1
1Speech Lab, Samsung Research China-Beijing, Beijing, China, 2Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China

Abstract

This report describe our methods for the DCASE 2019 task1a and task1c of Acoustic Scene Classification(ASC).Especially task1c for unknown scene, which not included in training data set, We use less training data and a threshold to classify the known and unknown scenes. In our method, we use Log MelSpectrogram with different divisions, as the input of multiple neural network , the ensemble learning output shows good accuracy. For task 1a we use VGG and xception as network and 3 different divisions ensemble, the accuracy is 0.807 for Leadboard dataset. For task 1c we use Convolutional Recurrent Neural Network (CRNN) and self-attention mechanism with 2 different features division ensemble, and 0.4 as threshold for unknown judgment, the Leadboard accuracy is 0.648.

System characteristics
Input mono
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN, BGRU, self-attention
PDF

DCASE 2019 Challenge Task1 Technical Report

Houwei Zhu1, Chunxia Ren2, Jun Wang2, Shengchen Li2, Lizhong Wang1 and Lei Yang1
1Speech Lab, Samsung Research China-Beijing, Beijing, China, 2Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China

Abstract

This report describe our methods for the DCASE 2019 task1a and task1c of Acoustic Scene Classification(ASC).Especially task1c for unknown scene, which not included in training data set, We use less training data and a threshold to classify the known and unknown scenes. In our method, we use Log MelSpectrogram with different divisions, as the input of multiple neural network , the ensemble learning output shows good accuracy. For task 1a we use VGG and xception as network and 3 different divisions ensemble, the accuracy is 0.807 for Leadboard dataset. For task 1c we use Convolutional Recurrent Neural Network (CRNN) and self-attention mechanism with 2 different features division ensemble, and 0.4 as threshold for unknown judgment, the Leadboard accuracy is 0.648.

System characteristics
Input multiple
Sampling rate 48kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN, BGRU, self-attention, ensemble
PDF