Task description
This subtask is concerned with the classification of daily activities performed in a home environment (e.g. Cooking). The provided samples are multi-channel audio segments acquired by multiple microphone arrays at different positions. This means that spatial properties can be exploited to serve as input features to the classification problem. However, using absolute localization of sound sources as input for the detection model is doomed to not generalize well to cases where the position of the microphone array is altered. Therefore, in this task the focus is on systems which can exploit spatial cues independent of sensor location using multi-channel audio.
The development data consists of recording obtained by four microphone arrays at different positions. The evaluation dataset contained data of seven microphone arrays, consisting of the four microphone arrays available in the development set and three unknown microphone arrays. The former is used to provide quantative numbers on the spatial overfit while the latter is used to determine the winner of task 5.
More detailed task description can be found on the task description page
Systems ranking
Rank |
Submission code |
Submission name |
Technical Report |
F1-score on Eval. set (Unknown mic.) |
F1-score on Eval. set (dev. set mic. arrays) |
F1-score on Dev. set |
---|---|---|---|---|---|---|
DCASE2018 baseline | Baseline | Dekkers2018 | 83.1 | 85.0 | 84.5 | |
Delphin_OL_task5_1 | GCNN_PTS | Delphin-Poulat2018 | 80.7 | 86.1 | 88.5 | |
Delphin_OL_task5_2 | GCNN_FTS | Delphin-Poulat2018 | 80.8 | 85.0 | 88.6 | |
Delphin_OL_task5_3 | GCNN_ATS | Delphin-Poulat2018 | 81.6 | 84.9 | 86.0 | |
Delphin_OL_task5_4 | GCNN_F | Delphin-Poulat2018 | 82.5 | 86.5 | 88.7 | |
Inoue_IBM_task5_1 | InouetMilk | Inoue2018 | 88.4 | 90.4 | 90.0 | |
Inoue_IBM_task5_2 | InouetMilk | Inoue2018 | 88.3 | 90.5 | 90.0 | |
Kong_Surrey_task5_1 | SurreyCNN8 | Kong2018 | 83.2 | 87.6 | 87.8 | |
Kong_Surrey_task5_2 | SurreyCNN4 | Kong2018 | 82.4 | 86.2 | 87.8 | |
Li_NPU_task5_1 | CIAICSys1 | Li2018 | 79.0 | 90.7 | 89.7 | |
Li_NPU_task5_2 | CIAICSys2 | Li2018 | 78.6 | 90.4 | 89.7 | |
Li_NPU_task5_3 | CIAICSys3 | Li2018 | 84.8 | 91.3 | 90.5 | |
Li_NPU_task5_4 | CIAICSys4 | Li2018 | 85.1 | 91.4 | 90.7 | |
Liao_NTHU_task5_1 | NTHU_sub_4 | Liao2018 | 86.7 | 88.6 | 88.7 | |
Liao_NTHU_task5_2 | NTHU_sub_MVDR | Liao2018 | 72.1 | 87.1 | 87.1 | |
Liao_NTHU_task5_3 | NTHU_sub_MVDRMMSE | Liao2018 | 76.7 | 85.7 | 85.5 | |
Liu_THU_task5_1 | Liu_THU | Liu2018 | 87.5 | 89.4 | 89.8 | |
Liu_THU_task5_2 | Liu_THU | Liu2018 | 87.4 | 89.5 | 89.8 | |
Liu_THU_task5_3 | Liu_THU | Liu2018 | 86.8 | 89.3 | 88.9 | |
Nakadai_HRI-JP_task5_1 | PS-CNN | Nakadai2018 | 85.4 | 89.9 | 89.9 | |
Raveh_INRC_task5_1 | INRC_1D | Raveh2018 | 80.4 | 87.7 | 87.2 | |
Raveh_INRC_task5_2 | INRC_1DSVD | Raveh2018 | 80.2 | 86.3 | 85.7 | |
Raveh_INRC_task5_3 | INRC_2D | Raveh2018 | 81.7 | 87.7 | 86.8 | |
Raveh_INRC_task5_4 | INRC_2DSVD | Raveh2018 | 81.2 | 86.4 | 85.8 | |
Sun_SUTD_task5_1 | SUTD | Chew2018 | 76.8 | 78.5 | 92.2 | |
Tanabe_HIT_task5_1 | HITavg | Tanabe2018 | 88.4 | 89.7 | 89.8 | |
Tanabe_HIT_task5_2 | HITrf | Tanabe2018 | 82.2 | 86.0 | 90.0 | |
Tanabe_HIT_task5_3 | HITsvm | Tanabe2018 | 86.3 | 89.2 | 90.3 | |
Tanabe_HIT_task5_4 | HITfweight | Tanabe2018 | 88.4 | 89.8 | 89.8 | |
Tiraboschi_UNIMI_task5_1 | TC2DCNN | Tiraboschi2018 | 76.9 | 85.8 | 85.8 | |
Zhang_THU_task5_1 | THUEE | Shen2018 | 85.9 | 87.6 | 89.7 | |
Zhang_THU_task5_2 | THUEE | Shen2018 | 84.3 | 86.2 | 91.2 | |
Zhang_THU_task5_3 | THUEE | Shen2018 | 86.0 | 87.5 | 90.5 | |
Zhang_THU_task5_4 | THUEE | Shen2018 | 85.9 | 87.6 | 90.4 |
Teams ranking
Table including only the best performing system per submitting team.
Rank |
Submission code |
Submission name |
Technical Report |
F1-score on Eval. set (Unknown mic.) |
F1-score on Eval. set (dev. set mic. arrays) |
F1-score (Dev. set) |
---|---|---|---|---|---|---|
DCASE2018 baseline | Baseline | Dekkers2018 | 83.1 | 85.0 | 84.5 | |
Delphin_OL_task5_4 | GCNN_F | Delphin-Poulat2018 | 82.5 | 86.5 | 88.7 | |
Inoue_IBM_task5_1 | InouetMilk | Inoue2018 | 88.4 | 90.4 | 90.0 | |
Kong_Surrey_task5_1 | SurreyCNN8 | Kong2018 | 83.2 | 87.6 | 87.8 | |
Li_NPU_task5_4 | CIAICSys4 | Li2018 | 85.1 | 91.4 | 90.7 | |
Liao_NTHU_task5_1 | NTHU_sub_4 | Liao2018 | 86.7 | 88.6 | 88.7 | |
Liu_THU_task5_1 | Liu_THU | Liu2018 | 87.5 | 89.4 | 89.8 | |
Nakadai_HRI-JP_task5_1 | PS-CNN | Nakadai2018 | 85.4 | 89.9 | 89.9 | |
Raveh_INRC_task5_3 | INRC_2D | Raveh2018 | 81.7 | 87.7 | 86.8 | |
Sun_SUTD_task5_1 | SUTD | Chew2018 | 76.8 | 78.5 | 92.2 | |
Tanabe_HIT_task5_1 | HITavg | Tanabe2018 | 88.4 | 89.7 | 89.8 | |
Tiraboschi_UNIMI_task5_1 | TC2DCNN | Tiraboschi2018 | 76.9 | 85.8 | 85.8 | |
Zhang_THU_task5_3 | THUEE | Shen2018 | 86.0 | 87.5 | 90.5 |
Class-wise performance
Rank |
Submission code |
Submission name |
Technical Report |
F1-score on Eval. set (Unknown mic.) |
Absence | Cooking | Dishwashing | Eating | Other |
Social activity |
Vacuum cleaning |
Watching TV |
Working |
F1-score on Eval. set (dev. set mic. arrays) |
Absence (2) | Cooking (2) | Dishwashing (2) | Eating (2) | Other (2) |
Social activity (2) |
Vacuum cleaning (2) |
Watching TV (2) |
Working (2) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DCASE2018 baseline | Baseline | Dekkers2018 | 83.1 | 87.7 | 93.0 | 77.2 | 81.2 | 35.0 | 96.6 | 95.8 | 99.9 | 81.4 | 85.0 | 89.4 | 96.3 | 79.5 | 82.0 | 44.1 | 96.4 | 95.9 | 99.9 | 81.5 | |
Delphin_OL_task5_1 | GCNN_PTS | Delphin-Poulat2018 | 80.7 | 79.9 | 85.5 | 70.1 | 79.3 | 45.5 | 96.0 | 95.7 | 99.9 | 74.5 | 86.1 | 91.0 | 96.1 | 79.6 | 82.5 | 48.4 | 96.5 | 96.4 | 99.9 | 84.7 | |
Delphin_OL_task5_2 | GCNN_FTS | Delphin-Poulat2018 | 80.8 | 79.1 | 88.5 | 71.4 | 79.9 | 42.7 | 94.6 | 97.0 | 99.9 | 74.2 | 85.0 | 90.5 | 96.0 | 78.5 | 81.0 | 44.1 | 95.1 | 97.0 | 99.9 | 83.2 | |
Delphin_OL_task5_3 | GCNN_ATS | Delphin-Poulat2018 | 81.6 | 83.8 | 90.5 | 71.3 | 78.6 | 44.0 | 94.7 | 96.1 | 99.9 | 76.0 | 84.9 | 91.3 | 95.3 | 75.9 | 81.8 | 45.5 | 94.5 | 96.5 | 99.9 | 83.1 | |
Delphin_OL_task5_4 | GCNN_F | Delphin-Poulat2018 | 82.5 | 82.2 | 89.5 | 73.8 | 81.1 | 47.4 | 95.5 | 96.7 | 100.0 | 76.2 | 86.5 | 92.0 | 96.2 | 80.4 | 83.1 | 49.3 | 95.8 | 96.8 | 99.9 | 85.3 | |
Inoue_IBM_task5_1 | InouetMilk | Inoue2018 | 88.4 | 93.7 | 91.5 | 86.5 | 87.0 | 54.2 | 97.0 | 97.1 | 99.9 | 88.7 | 90.4 | 94.2 | 96.8 | 88.4 | 89.9 | 59.7 | 97.4 | 97.3 | 100.0 | 90.0 | |
Inoue_IBM_task5_2 | InouetMilk | Inoue2018 | 88.3 | 93.6 | 91.7 | 86.1 | 87.0 | 53.6 | 97.0 | 97.1 | 99.9 | 88.7 | 90.5 | 94.2 | 96.9 | 89.4 | 90.2 | 59.5 | 97.4 | 97.2 | 99.9 | 90.2 | |
Kong_Surrey_task5_1 | SurreyCNN8 | Kong2018 | 83.2 | 90.4 | 82.9 | 75.0 | 82.4 | 42.6 | 96.6 | 96.4 | 99.9 | 82.5 | 87.6 | 92.7 | 95.0 | 82.7 | 85.9 | 51.5 | 96.7 | 97.1 | 99.9 | 87.3 | |
Kong_Surrey_task5_2 | SurreyCNN4 | Kong2018 | 82.4 | 87.4 | 84.2 | 74.3 | 78.4 | 45.4 | 96.4 | 96.6 | 99.9 | 79.1 | 86.2 | 90.7 | 94.6 | 81.0 | 83.0 | 48.9 | 96.5 | 97.5 | 99.9 | 83.4 | |
Li_NPU_task5_1 | CIAICSys1 | Li2018 | 79.0 | 79.6 | 84.6 | 76.4 | 80.8 | 20.3 | 95.6 | 96.4 | 99.9 | 77.3 | 90.7 | 93.0 | 97.3 | 91.0 | 91.6 | 61.1 | 96.6 | 97.0 | 100.0 | 89.1 | |
Li_NPU_task5_2 | CIAICSys2 | Li2018 | 78.6 | 81.5 | 85.7 | 78.2 | 74.1 | 24.4 | 92.1 | 95.4 | 99.7 | 76.1 | 90.4 | 92.4 | 97.2 | 91.3 | 92.0 | 59.4 | 96.2 | 96.8 | 100.0 | 88.6 | |
Li_NPU_task5_3 | CIAICSys3 | Li2018 | 84.8 | 88.3 | 91.0 | 81.1 | 84.4 | 40.5 | 97.2 | 97.0 | 99.9 | 83.6 | 91.3 | 94.4 | 97.2 | 89.9 | 91.6 | 62.9 | 97.4 | 97.4 | 100.0 | 91.0 | |
Li_NPU_task5_4 | CIAICSys4 | Li2018 | 85.1 | 88.1 | 91.3 | 82.9 | 84.7 | 42.2 | 96.6 | 97.1 | 100.0 | 83.3 | 91.4 | 94.3 | 97.4 | 90.3 | 91.8 | 63.1 | 97.5 | 97.3 | 100.0 | 90.8 | |
Liao_NTHU_task5_1 | NTHU_sub_4 | Liao2018 | 86.7 | 91.0 | 95.1 | 81.7 | 82.1 | 52.3 | 97.9 | 95.3 | 100.0 | 85.3 | 88.6 | 92.6 | 96.7 | 88.0 | 85.3 | 55.7 | 97.8 | 95.0 | 100.0 | 86.7 | |
Liao_NTHU_task5_2 | NTHU_sub_MVDR | Liao2018 | 72.1 | 69.8 | 64.7 | 63.5 | 67.1 | 17.7 | 95.9 | 96.9 | 99.8 | 73.5 | 87.1 | 91.6 | 92.2 | 84.3 | 84.6 | 52.2 | 96.5 | 97.3 | 99.9 | 85.3 | |
Liao_NTHU_task5_3 | NTHU_sub_MVDRMMSE | Liao2018 | 76.7 | 77.4 | 68.3 | 63.6 | 77.2 | 38.2 | 96.0 | 96.0 | 99.8 | 73.9 | 85.7 | 91.7 | 88.2 | 77.4 | 84.7 | 50.1 | 96.6 | 96.7 | 99.9 | 86.0 | |
Liu_THU_task5_1 | Liu_THU | Liu2018 | 87.5 | 92.7 | 89.9 | 84.7 | 85.6 | 53.5 | 96.6 | 97.4 | 100.0 | 87.4 | 89.4 | 93.9 | 95.6 | 87.4 | 86.6 | 56.8 | 97.1 | 97.6 | 100.0 | 89.6 | |
Liu_THU_task5_2 | Liu_THU | Liu2018 | 87.4 | 92.9 | 90.5 | 84.2 | 85.0 | 52.1 | 97.0 | 97.3 | 100.0 | 87.5 | 89.5 | 93.9 | 96.3 | 87.3 | 86.9 | 56.3 | 97.4 | 97.8 | 100.0 | 89.6 | |
Liu_THU_task5_3 | Liu_THU | Liu2018 | 86.8 | 92.1 | 90.3 | 82.1 | 84.2 | 51.9 | 97.0 | 96.7 | 100.0 | 86.5 | 89.3 | 93.3 | 95.9 | 87.5 | 87.5 | 55.3 | 97.6 | 97.6 | 100.0 | 88.7 | |
Nakadai_HRI-JP_task5_1 | PS-CNN | Nakadai2018 | 85.4 | 84.6 | 92.7 | 81.6 | 84.5 | 51.1 | 97.3 | 97.0 | 100.0 | 80.0 | 89.9 | 93.4 | 96.8 | 88.3 | 89.0 | 57.7 | 97.3 | 96.8 | 100.0 | 89.4 | |
Raveh_INRC_task5_1 | INRC_1D | Raveh2018 | 80.4 | 74.8 | 84.1 | 71.9 | 81.5 | 47.6 | 95.1 | 97.1 | 99.9 | 71.5 | 87.7 | 89.3 | 96.0 | 85.9 | 86.1 | 53.5 | 97.0 | 97.8 | 99.9 | 84.2 | |
Raveh_INRC_task5_2 | INRC_1DSVD | Raveh2018 | 80.2 | 69.9 | 91.1 | 75.6 | 79.1 | 44.9 | 95.2 | 97.7 | 99.8 | 68.6 | 86.3 | 88.1 | 95.1 | 81.8 | 83.8 | 51.6 | 96.0 | 97.8 | 99.9 | 82.5 | |
Raveh_INRC_task5_3 | INRC_2D | Raveh2018 | 81.7 | 79.7 | 86.9 | 73.8 | 82.2 | 42.7 | 97.1 | 97.4 | 99.9 | 75.5 | 87.7 | 90.8 | 95.3 | 82.8 | 87.2 | 51.4 | 97.5 | 97.8 | 99.9 | 86.2 | |
Raveh_INRC_task5_4 | INRC_2DSVD | Raveh2018 | 81.2 | 75.8 | 87.5 | 73.2 | 80.0 | 48.4 | 95.9 | 96.6 | 99.9 | 73.4 | 86.4 | 89.8 | 94.1 | 78.6 | 84.8 | 51.9 | 96.8 | 96.8 | 99.9 | 84.9 | |
Sun_SUTD_task5_1 | SUTD | Chew2018 | 76.8 | 74.9 | 85.5 | 70.5 | 68.5 | 35.2 | 92.9 | 94.7 | 99.8 | 69.5 | 78.5 | 81.3 | 92.7 | 72.1 | 72.2 | 30.5 | 93.9 | 94.5 | 99.7 | 69.6 | |
Tanabe_HIT_task5_1 | HITavg | Tanabe2018 | 88.4 | 91.6 | 97.0 | 83.0 | 84.2 | 57.7 | 98.2 | 97.7 | 100.0 | 86.1 | 89.7 | 92.4 | 97.2 | 86.1 | 86.0 | 61.6 | 98.1 | 97.9 | 100.0 | 87.7 | |
Tanabe_HIT_task5_2 | HITrf | Tanabe2018 | 82.2 | 59.1 | 96.1 | 81.5 | 85.7 | 53.7 | 97.7 | 97.7 | 100.0 | 68.6 | 86.0 | 74.6 | 96.9 | 85.5 | 88.2 | 57.9 | 97.7 | 97.9 | 100.0 | 75.4 | |
Tanabe_HIT_task5_3 | HITsvm | Tanabe2018 | 86.3 | 86.1 | 95.8 | 81.6 | 85.2 | 54.6 | 95.9 | 96.7 | 100.0 | 81.3 | 89.2 | 92.6 | 96.7 | 85.4 | 88.1 | 57.3 | 96.6 | 97.2 | 100.0 | 88.7 | |
Tanabe_HIT_task5_4 | HITfweight | Tanabe2018 | 88.4 | 91.3 | 97.0 | 83.0 | 84.1 | 58.3 | 98.2 | 97.7 | 100.0 | 85.8 | 89.8 | 92.6 | 97.2 | 86.4 | 86.1 | 62.1 | 98.2 | 97.9 | 100.0 | 87.9 | |
Tiraboschi_UNIMI_task5_1 | TC2DCNN | Tiraboschi2018 | 76.9 | 79.8 | 88.7 | 71.8 | 78.9 | 17.6 | 96.2 | 94.4 | 99.7 | 64.6 | 85.8 | 90.8 | 93.6 | 77.5 | 83.2 | 50.5 | 97.4 | 94.1 | 100.0 | 85.0 | |
Zhang_THU_task5_1 | THUEE | Shen2018 | 85.9 | 92.8 | 88.6 | 78.7 | 81.9 | 50.3 | 97.5 | 96.3 | 99.9 | 87.5 | 87.6 | 93.1 | 94.6 | 80.8 | 85.1 | 52.8 | 97.5 | 96.6 | 99.9 | 88.0 | |
Zhang_THU_task5_2 | THUEE | Shen2018 | 84.3 | 93.6 | 85.1 | 76.8 | 76.6 | 46.6 | 97.1 | 96.5 | 99.9 | 86.9 | 86.2 | 94.1 | 89.7 | 79.0 | 80.7 | 50.6 | 96.7 | 97.1 | 99.9 | 88.0 | |
Zhang_THU_task5_3 | THUEE | Shen2018 | 86.0 | 93.6 | 87.4 | 79.7 | 80.1 | 50.8 | 97.6 | 96.7 | 99.9 | 87.7 | 87.5 | 94.2 | 92.2 | 81.5 | 83.2 | 53.3 | 97.4 | 97.1 | 100.0 | 88.8 | |
Zhang_THU_task5_4 | THUEE | Shen2018 | 85.9 | 93.5 | 87.4 | 79.0 | 79.9 | 51.3 | 97.6 | 96.7 | 99.9 | 87.7 | 87.6 | 94.1 | 92.4 | 81.4 | 83.5 | 53.7 | 97.6 | 97.1 | 100.0 | 88.8 |
System characteristics
Input characteristics
Rank | Code |
Technical Report |
F1-score on Eval. set (Unknown mic.) |
F1-score on Eval. set (dev. set mic. arrays) |
Acoustic features |
Spatial features |
Data augmentation |
Pre-trained model |
---|---|---|---|---|---|---|---|---|
DCASE2018 baseline | Dekkers2018 | 83.1 | 85.0 | log-mel energies | ||||
Delphin_OL_task5_1 | Delphin-Poulat2018 | 80.7 | 86.1 | log-mel energies | ||||
Delphin_OL_task5_2 | Delphin-Poulat2018 | 80.8 | 85.0 | log-mel energies | ||||
Delphin_OL_task5_3 | Delphin-Poulat2018 | 81.6 | 84.9 | log-mel energies | Gaussian Additive Noise | |||
Delphin_OL_task5_4 | Delphin-Poulat2018 | 82.5 | 86.5 | log-mel energies | Gaussian Additive Noise | |||
Inoue_IBM_task5_1 | Inoue2018 | 88.4 | 90.4 | log-mel energies | shuffling, mixing | |||
Inoue_IBM_task5_2 | Inoue2018 | 88.3 | 90.5 | log-mel energies | shuffling, mixing | |||
Kong_Surrey_task5_1 | Kong2018 | 83.2 | 87.6 | log-mel energies | ||||
Kong_Surrey_task5_2 | Kong2018 | 82.4 | 86.2 | log-mel energies | ||||
Li_NPU_task5_1 | Li2018 | 79.0 | 90.7 | log-mel energies | coherence | |||
Li_NPU_task5_2 | Li2018 | 78.6 | 90.4 | log-mel energies | coherence | |||
Li_NPU_task5_3 | Li2018 | 84.8 | 91.3 | log-mel energies | coherence | |||
Li_NPU_task5_4 | Li2018 | 85.1 | 91.4 | log-mel energies | coherence | |||
Liao_NTHU_task5_1 | Liao2018 | 86.7 | 88.6 | log-mel energies | time shifting | |||
Liao_NTHU_task5_2 | Liao2018 | 72.1 | 87.1 | log-mel energies | MVDR | time shifting | ||
Liao_NTHU_task5_3 | Liao2018 | 76.7 | 85.7 | log-mel energies | MVDR with MMSE | time shifting | ||
Liu_THU_task5_1 | Liu2018 | 87.5 | 89.4 | log-mel energies, MFCC | VGGish | |||
Liu_THU_task5_2 | Liu2018 | 87.4 | 89.5 | log-mel energies, MFCC | VGGish | |||
Liu_THU_task5_3 | Liu2018 | 86.8 | 89.3 | log-mel energies, MFCC | VGGish | |||
Nakadai_HRI-JP_task5_1 | Nakadai2018 | 85.4 | 89.9 | log-mel energies | ||||
Raveh_INRC_task5_1 | Raveh2018 | 80.4 | 87.7 | Scattering Transform | ||||
Raveh_INRC_task5_2 | Raveh2018 | 80.2 | 86.3 | Scattering Transform, SVD | ||||
Raveh_INRC_task5_3 | Raveh2018 | 81.7 | 87.7 | Scattering Transform | ||||
Raveh_INRC_task5_4 | Raveh2018 | 81.2 | 86.4 | Scattering Transform, SVD | ||||
Sun_SUTD_task5_1 | Chew2018 | 76.8 | 78.5 | MFCC, spectrogram | ||||
Tanabe_HIT_task5_1 | Tanabe2018 | 88.4 | 89.7 | log-mel energies, MFCC | multi-channel front-end processing | VGG16 | ||
Tanabe_HIT_task5_2 | Tanabe2018 | 82.2 | 86.0 | log-mel energies, MFCC | multi-channel front-end processing | VGG16 | ||
Tanabe_HIT_task5_3 | Tanabe2018 | 86.3 | 89.2 | log-mel energies, MFCC | Blind Source Seperation, Blind dereverberation, Beamformer | VGG16 | ||
Tanabe_HIT_task5_4 | Tanabe2018 | 88.4 | 89.8 | log-mel energies, MFCC | Blind Source Seperation, Blind dereverberation, Beamformer | VGG16 | ||
Tiraboschi_UNIMI_task5_1 | Tiraboschi2018 | 76.9 | 85.8 | log-mel energies | ||||
Zhang_THU_task5_1 | Shen2018 | 85.9 | 87.6 | log-mel energies, Time-Frequency Cepstral | ||||
Zhang_THU_task5_2 | Shen2018 | 84.3 | 86.2 | log-mel energies, Time-Frequency Cepstral | ||||
Zhang_THU_task5_3 | Shen2018 | 86.0 | 87.5 | log-mel energies, Time-Frequency Cepstral | ||||
Zhang_THU_task5_4 | Shen2018 | 85.9 | 87.6 | log-mel energies, Time-Frequency Cepstral |
Machine learning characteristics
Rank | Code |
Technical Report |
F1-score on Eval. set (Unknown mic.) |
F1-score on Eval. set (dev. set mic. arrays) |
Classifier |
Fusion level |
Fusion method |
Ensemble subsystems |
Decision making |
---|---|---|---|---|---|---|---|---|---|
DCASE2018 baseline | Dekkers2018 | 83.1 | 85.0 | CNN | decision | average | |||
Delphin_OL_task5_1 | Delphin-Poulat2018 | 80.7 | 86.1 | CNN | decision | average | |||
Delphin_OL_task5_2 | Delphin-Poulat2018 | 80.8 | 85.0 | CNN | decision | average | |||
Delphin_OL_task5_3 | Delphin-Poulat2018 | 81.6 | 84.9 | CNN | decision | average | |||
Delphin_OL_task5_4 | Delphin-Poulat2018 | 82.5 | 86.5 | CNN | decision | average | |||
Inoue_IBM_task5_1 | Inoue2018 | 88.4 | 90.4 | CNN | decision | average | 4 | average | |
Inoue_IBM_task5_2 | Inoue2018 | 88.3 | 90.5 | CNN | decision | average | 4 | average | |
Kong_Surrey_task5_1 | Kong2018 | 83.2 | 87.6 | AlexNetish 8 layer CNN with global max pooling | decision | average | |||
Kong_Surrey_task5_2 | Kong2018 | 82.4 | 86.2 | AlexNetish 4 layer CNN with global max pooling | decision | average | |||
Li_NPU_task5_1 | Li2018 | 79.0 | 90.7 | CNN, VGG10, ensemble | decision | average | 2 | average | |
Li_NPU_task5_2 | Li2018 | 78.6 | 90.4 | CNN, VGG10, GLU, ensemble | decision | average | 2 | average | |
Li_NPU_task5_3 | Li2018 | 84.8 | 91.3 | CNN, VGG10, ensemble | decision | average | 3 | average | |
Li_NPU_task5_4 | Li2018 | 85.1 | 91.4 | CNN, VGG10, GLU, ensemble | decision | average | 3 | average | |
Liao_NTHU_task5_1 | Liao2018 | 86.7 | 88.6 | CNN | decision | average | |||
Liao_NTHU_task5_2 | Liao2018 | 72.1 | 87.1 | CNN | decision | average | |||
Liao_NTHU_task5_3 | Liao2018 | 76.7 | 85.7 | CNN | decision | average | |||
Liu_THU_task5_1 | Liu2018 | 87.5 | 89.4 | CNN, RNN, ensemble | decision | average | 3 | average | |
Liu_THU_task5_2 | Liu2018 | 87.4 | 89.5 | CNN, RNN, ensemble | decision | average | 3 | average | |
Liu_THU_task5_3 | Liu2018 | 86.8 | 89.3 | CNN, RNN, ensemble | decision | average | 3 | average | |
Nakadai_HRI-JP_task5_1 | Nakadai2018 | 85.4 | 89.9 | Partially Shared CNN | decision | majority vote | |||
Raveh_INRC_task5_1 | Raveh2018 | 80.4 | 87.7 | LSTM, CNN, ResNet | feature | average | |||
Raveh_INRC_task5_2 | Raveh2018 | 80.2 | 86.3 | LSTM, CNN, ResNet | feature | average | |||
Raveh_INRC_task5_3 | Raveh2018 | 81.7 | 87.7 | LSTM, CNN, ResNet | feature | average | |||
Raveh_INRC_task5_4 | Raveh2018 | 81.2 | 86.4 | LSTM, CNN, ResNet | feature | average | |||
Sun_SUTD_task5_1 | Chew2018 | 76.8 | 78.5 | CNN, LSTM, ensemble | decision | average | 3 | average | |
Tanabe_HIT_task5_1 | Tanabe2018 | 88.4 | 89.7 | CNN, SVM, VGG16, ensemble | audio, decision | Blind Source Seperation, Blind dereverberation, Beamformer, average | 89 | average | |
Tanabe_HIT_task5_2 | Tanabe2018 | 82.2 | 86.0 | CNN, SVM, VGG16, ensemble | audio, decision | Blind Source Seperation, Blind dereverberation, Beamformer, Random Forest | 89 | Random Forest | |
Tanabe_HIT_task5_3 | Tanabe2018 | 86.3 | 89.2 | CNN, SVM, VGG16, ensemble | audio, decision | Blind Source Seperation, Blind dereverberation, Beamformer, SVM | 89 | SVM | |
Tanabe_HIT_task5_4 | Tanabe2018 | 88.4 | 89.8 | CNN, SVM, VGG16, ensemble | audio, decision | Blind Source Seperation, Blind dereverberation, Beamformer, F1-score-weighted average | 89 | F1-score-weighted average | |
Tiraboschi_UNIMI_task5_1 | Tiraboschi2018 | 76.9 | 85.8 | CNN | classifier | CNN | |||
Zhang_THU_task5_1 | Shen2018 | 85.9 | 87.6 | GCNN,GSV-SVM, ensemble | classifier | stacking | 4 | ||
Zhang_THU_task5_2 | Shen2018 | 84.3 | 86.2 | GCNN,GSV-SVM,ensemble | classifier | stacking | 4 | ||
Zhang_THU_task5_3 | Shen2018 | 86.0 | 87.5 | GCNN,GSV-SVM,ensemble | classifier | stacking | 4 | ||
Zhang_THU_task5_4 | Shen2018 | 85.9 | 87.6 | GCNN,GSV-SVM,ensemble | classifier | stacking | 4 |
Technical reports
DCASE 2018 Challenge: Solution for Task 5
Jeremy Chew, Yingxiang Sun, Lahiru Jayasinghe and Chau Yuen
Engineering Product Development, Singapore University of Technology and Design, Singapore
Sun_SUTD_task5_1
DCASE 2018 Challenge: Solution for Task 5
Jeremy Chew, Yingxiang Sun, Lahiru Jayasinghe and Chau Yuen
Engineering Product Development, Singapore University of Technology and Design, Singapore
Abstract
To address Task 5 in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 chal-lenge, in this paper, we propose an ensemble learning system. The proposed system consists of three differ-ent models, based on convolutional neural network and long short memory recurrent neural network. With extracted features such as spectrogram and mel-frequency cepstrum coefficients from different chan-nels, the proposed system can classify different do-mestic activities effectively. Experimental results ob-tained from the provided development dataset show that good performance with F1-score of 92.19% can be achieved. Compared with the baseline system, our pro-posed system significantly improves the performance of F1-score by 7.69%.
System characteristics
Input | all |
Sampling rate | 16kHz |
Acoustic features | MFCC, spectrogram |
Fusion level | decision |
Fusion method | average |
Classifier | CNN, LSTM, ensemble |
Decision making | average |
DCASE 2018 Challenge - Task 5: Monitoring of Domestic Activities Based on Multi-Channel Acoustics
Gert Dekkers and Peter Karsmakers
Computer Science, KU Leuven - ADVISE, Geel, Belgium
Abstract
The DCASE 2018 Challenge consists of five tasks related to automatic classification and detection of sound events and scenes. This paper presents the setup of Task 5 which includes the description of the task, dataset and the baseline system. In this task, it is investigated to which extent multi-channel acoustic recordings are beneficial for the purpose of classifying domestic activities. The goal is to exploit spectral and spatial cues independent of sensor location using multi-channel audio. For this purpose we provided a development and evaluation dataset which are derivatives of the SINS database and contain domestic activities recorded by multiple microphone arrays. The baseline system, based on a Neural Network architecture using convolutional and dense layer(s), is intended to lower the hurdle to participate the challenge and to provide a reference performance.
System characteristics
Input | all |
Sampling rate | 16kHz |
Acoustic features | log-mel energies |
Fusion level | decision |
Fusion method | average |
Classifier | CNN |
GCNN for Classification of Domestic Activities
Lionel Delphin-Poulat, Cyril Plapous and Rozenn Nicol
HOME/CONTENT, Orange Labs, Lannion, France
Delphin_OL_task5_1 Delphin_OL_task5_2 Delphin_OL_task5_3 Delphin_OL_task5_4
GCNN for Classification of Domestic Activities
Lionel Delphin-Poulat, Cyril Plapous and Rozenn Nicol
HOME/CONTENT, Orange Labs, Lannion, France
Abstract
A model of classifier for processing multi-channel audio segments into nine classes categorizing daily activities (Task 5 of Challenge DCASE 2018) is presented. Its framework is based on Gated Convolutional Neural Network (GCNN). Four models are proposed with different learning strategies. They achieve a macro-averaged F1-score between 88.50 and 88.72%.
System characteristics
Input | all |
Sampling rate | 16kHz |
Data augmentation | Gaussian Additive Noise |
Acoustic features | log-mel energies |
Fusion level | decision |
Fusion method | average |
Classifier | CNN |
Domestic Activities Classification Based on CNN Using Shuffling and Mixing Data Augmentation
Tadanobu Inoue1, Phongtharin Vinayavekhin1, Shiqiang Wang2, David Wood2, Nancy Greco2 and Ryuki Tachibana3
1AI, IBM Research, Tokyo, Japan, 2AI, Research, Yorktown Heights, NY, USA, 3AI, Research, Tokyo, Japan
Inoue_IBM_task5_1 Inoue_IBM_task5_2
Domestic Activities Classification Based on CNN Using Shuffling and Mixing Data Augmentation
Tadanobu Inoue1, Phongtharin Vinayavekhin1, Shiqiang Wang2, David Wood2, Nancy Greco2 and Ryuki Tachibana3
1AI, IBM Research, Tokyo, Japan, 2AI, Research, Yorktown Heights, NY, USA, 3AI, Research, Tokyo, Japan
Abstract
This technical report describes our proposed design and implementation of the system used for the DCASE 2018 Challenge submission. The work focuses on Task 5 of the challenge, which is about monitoring and classifying domestic activities based on multi-channel acoustics. We propose data augmentation techniques using shuffling and mixing two sounds in a same class to mitigate the unbalanced training dataset. This data augmentation can generate new variations on both the sequence and the density of sound events. The experimental results show that the proposed system achieves an average of 89.95% of macro-averaged F1 score over 4 folds on the development dataset. This is a significant improvement from the baseline result of 84.50%. In the final evaluation for the submission, four proposed classifiers are trained with four folds of training and validation data in the development dataset. Then we ensemble these four models by averaging their predictions.
System characteristics
Input | all |
Sampling rate | 16kHz |
Data augmentation | shuffling, mixing |
Acoustic features | log-mel energies |
Fusion level | decision |
Fusion method | average |
Classifier | CNN |
Decision making | average |
DCASE 2018 Challenge Baseline with Convolutional Neural Networks
Qiuqiang Kong, Iqbal Turab, Xu Yong, Wenwu Wang and Mark D. Plumbley
Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
Abstract
Detection and classification of acoustic scenes and events (DCASE) 2018 challenge is a well known IEEE AASP challenge consists of several audio classification and sound event detection tasks. DCASE 2018 challenge includes five tasks: 1) Acoustic scene classification, 2) Audio tagging of Freesound, 3) Bird audio detection, 4) Weakly labeled semi-supervised sound event detection and 5) Multi-channel audio tagging. In this paper we open source the python code of all of Task 1 - 5 of DCASE 2018 challenge. The baseline source code contains the implementation of the convolutional neural networks (CNNs) including the AlexNetish and the VGGish from the image processing area. We researched how the performance varies from task to task when the configuration of the neural networks are the same. The experiment shows deeper VGGish network performs better than AlexNetish on Task 2 - 5 except Task 1 where VGGish and AlexNetish network perform similar. With the VGGish network, we achieve an accuracy of 0.680 on Task 1, a mean average precision (mAP) of 0.928 on Task 2, an area under the curve (AUC) of 0.854 on Task 3, a sound event detection F1 score of 20.8% on Task 4 and a F1 score of 87.75% on Task 5.
System characteristics
Input | all |
Sampling rate | 16kHz |
Acoustic features | log-mel energies |
Fusion level | decision |
Fusion method | average |
Classifier | AlexNetish 8 layer CNN with global max pooling; AlexNetish 4 layer CNN with global max pooling |
Ciaic-Moda System for Dcase2018 Challenge Task5
Dexin Li and Mou Wang
Speech Signal Processing, CIAIC, Xi'an, China
Li_NPU_task5_1 Li_NPU_task5_2 Li_NPU_task5_3 Li_NPU_task5_4
Ciaic-Moda System for Dcase2018 Challenge Task5
Dexin Li and Mou Wang
Speech Signal Processing, CIAIC, Xi'an, China
Abstract
In this technical report, we present several systems for the task 5 of Detection and Classification of Acoustic Scenes and Events 2018 (DCASE2018) challenge. The task is to classify multi-channel audio segments into one of daily activities performed in a home environment. We develop three methods for the task. First, log melspectrogram is extracted from each segment and fed to CNN in baseline system with gated linear units (GLU). Then, we use VGGNet to improve the network. In addition, to exploit spatial information, we extract coherence features among all channels and use 1D-CNN with GLU to classify it. Finally, we make a fusion on posteriors from three subsystems to further improve the performances. The experimental results show the proposed systems can get at least 5% F1-score improvement compared to the baseline system.
System characteristics
Input | all |
Sampling rate | 16kHz |
Acoustic features | log-mel energies |
Spatial features | coherence |
Fusion level | decision |
Fusion method | average |
Classifier | CNN, VGG10, ensemble; CNN, VGG10, GLU, ensemble |
Decision making | average |
DCASE 2018 Task 5 Challenge Technical Report: Sound Event Classification by a Deep Neural Network with Attention and Minimum Variance Distortionless Response Enhancement
Hsueh-Wei Liao1, Jong-Yi Huang2, Shih-Syuan Lan2, Tsung-Han Lee2, Yi-Wen Liu1 and Ming-Sian Bai2
1Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, 2Power Mechanical Engineering, National Tsing Hua University, Hsinchu, Taiwan
Liao_NTHU_task5_1 Liao_NTHU_task5_2 Liao_NTHU_task5_3
DCASE 2018 Task 5 Challenge Technical Report: Sound Event Classification by a Deep Neural Network with Attention and Minimum Variance Distortionless Response Enhancement
Hsueh-Wei Liao1, Jong-Yi Huang2, Shih-Syuan Lan2, Tsung-Han Lee2, Yi-Wen Liu1 and Ming-Sian Bai2
1Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, 2Power Mechanical Engineering, National Tsing Hua University, Hsinchu, Taiwan
Abstract
In this technical report, we propose a sub-band convolution neural network with residual building blocks as a sound event detection system. Our system performs not only the clip-wise prediction of the task 5 but also the frame-wise prediction, which can be regarded as multi-task learning. The frame-wise labels are all transformed from the original weak labels by label smoothing with the energy of the frames. With the multi-task learning, we believe such framewise prediction can concentrate on the most important part from the weakly-labeled dataset. In addition, we attempted to preprocess the input signals by array-based methods and, depending on the sound classes, mixed results are reported in terms of the F1-score.
System characteristics
Input | all; mixed |
Sampling rate | 16kHz |
Data augmentation | time shifting |
Acoustic features | log-mel energies |
Spatial features | MVDR; MVDR with MMSE |
Fusion level | decision |
Fusion method | average |
Classifier | CNN |
An Ensemble System for Domestic Activity Recognition
Huaping Liu1, Feng Wang1, Xinzhu Liu2 and Di Guo1
1Computer Science and Technology, Computer Science and Technology, Beijing, China, 2Computer Science and Technology, Computer Science and Technology, Changchun, China
Liu_THU_task5_1 Liu_THU_task5_2 Liu_THU_task5_3
An Ensemble System for Domestic Activity Recognition
Huaping Liu1, Feng Wang1, Xinzhu Liu2 and Di Guo1
1Computer Science and Technology, Computer Science and Technology, Beijing, China, 2Computer Science and Technology, Computer Science and Technology, Changchun, China
Abstract
As one of the most important sensing modalities, the acoustics is becoming more and more popular in achieving our smart environment nowadays. In this challenge, we try to tackle the task of monitoring domestic activities based on multi-channel acoustics. Several acoustic features including the Mel-Spectrograms, MFCC and VGGish features are extracted from the raw audio and fused to train different deep neural networks. An ensemble system is then established with the trained models. The experimental results on the development dataset demonstrate that the proposed system has shown superior performance in recognizing domestic activities.
System characteristics
Input | all |
Sampling rate | 16kHz |
Acoustic features | log-mel energies, MFCC |
External model | VGGish |
Fusion level | decision |
Fusion method | average |
Classifier | CNN, RNN, ensemble |
Decision making | average |
Partially-Shared Convolutional Neural Network for Classification of Multi-Channel Recorded Audio Signals
Kazuhiro Nakadai and Danilo R. Onishi
Research Div., Honda Research Institute Japan, Wako, Japan
Nakadai_HRI-JP_task5_1
Partially-Shared Convolutional Neural Network for Classification of Multi-Channel Recorded Audio Signals
Kazuhiro Nakadai and Danilo R. Onishi
Research Div., Honda Research Institute Japan, Wako, Japan
Abstract
This technical paper presents the system used in our submission for task 5 of the DCASE 2018 challenge. We proposed a partially-shared convolutional neural network, which is a multi-task system that contains a common input (the multi-channel log Mel features) and two output branches, a classification branch, which outputs the predicted class, and a regression branch, which outputs a single-channel representation of the multi-channel input data. Since the system has a shared network between classification and regression, training for regression is expected to enhance another training for classification and vice versa. Because task 5 aims at classification based on multi-channel audio input, we tried to improve classification performance with this system by training classification and regression together. By applying the proposed system incorporated with parameter tuning of the baseline CNN system, we confirmed that the classification F1 score increased to 89.94% in four-fold cross validation, while the baseline system achieved 84.50%.
System characteristics
Input | all |
Sampling rate | 16kHz |
Acoustic features | log-mel energies |
Fusion level | decision |
Fusion method | majority vote |
Classifier | Partially Shared CNN |
Multi-Channel Audio Classification with Neural Network Using Scattering Transform
Alon Raveh1 and Alon Amar2
1Signal Processing Department, National Research Center, Haifa, Israel, 2EE, Technion, Haifa, Israel
Raveh_INRC_task5_1 Raveh_INRC_task5_2 Raveh_INRC_task5_3 Raveh_INRC_task5_4
Multi-Channel Audio Classification with Neural Network Using Scattering Transform
Alon Raveh1 and Alon Amar2
1Signal Processing Department, National Research Center, Haifa, Israel, 2EE, Technion, Haifa, Israel
Abstract
This technical paper presents an approach for the 2018 acoustic scene classification challenge (DCASE 2018) task 5. A sequence of audio segments are observed by an array with 4 microphones. The task is to suggest a multichannel processing to classify the audio signals to one of 9 pre-defined classes. The proposed approach combines a deep neural network with scattering transform. Each audio segment is first represented by two layers of scattering transform. The 4 denoised transforms of each of the two layers are combined together. Each of the fused layers are processed in parallel by two neural networks (NN) architectures, RESNET and long short-term memory (LSTM) network, with a joint fully connected layer.
System characteristics
Input | all |
Sampling rate | 16kHz |
Acoustic features | Scattering Transform; Scattering Transform, SVD |
Fusion level | feature |
Fusion method | average |
Classifier | LSTM, CNN, ResNet |
Home Activity Monitoring Based on Gated Convolutional Neural Networks and System Fusion
Yuhan Shen, Kexin He and Weiqiang Zhang
Electronic Engineering, Tsinghua University, Beijing, China
Zhang_THU_task5_1 Zhang_THU_task5_2 Zhang_THU_task5_3 Zhang_THU_task5_4
Home Activity Monitoring Based on Gated Convolutional Neural Networks and System Fusion
Yuhan Shen, Kexin He and Weiqiang Zhang
Electronic Engineering, Tsinghua University, Beijing, China
Abstract
In this technical report, we propose a method for the task 5 of Detection and Classification of Acoustic Scenes and Events 2018 (DCASE2018) challenge. This task aims to classify multi-channel audio segments into one of the provided predefined classes. All of these classes are daily activities performed in a home environment. This paper adopts a model based on gated convolutional neural networks for domestic activity classification. We utilize multiple methods to improve the performance of our proposed system. Firstly, we use gated convolutional neural network to replace normal convolutional neural network and recurrent neural network in order to extract more temporal feature and improve working efficiency. Secondly, we mitigate the problem of data imbalance using a weighted loss function. Besides, we adopt model ensembling strategy to make our system stronger and more effective. Finally, we use a fusion of two systems to improve our performance. In a summary, we obtain 89.73% F1-score on the development dataset while the official baseline system gets 84.50% F1-score.
System characteristics
Input | all |
Sampling rate | 16kHz |
Acoustic features | log-mel energies, Time-Frequency Cepstral |
Fusion level | classifier |
Fusion method | stacking |
Classifier | GCNN,GSV-SVM, ensemble; GCNN,GSV-SVM,ensemble |
Multichannel Acoustic Scene Classification by Blind Dereverberation, Blind Source Separation, Data Augmentation, and Model Ensembling
Ryo Tanabe, Takashi Endo, Yuki Nikaido, Takeshi Ichige, Phong Nguyen, Yohei Kawaguchi and Koichi Hamada
R&D Group, Hitachi, Ltd., Tokyo, Japan
Tanabe_HIT_task5_1 Tanabe_HIT_task5_2 Tanabe_HIT_task5_3 Tanabe_HIT_task5_4
Multichannel Acoustic Scene Classification by Blind Dereverberation, Blind Source Separation, Data Augmentation, and Model Ensembling
Ryo Tanabe, Takashi Endo, Yuki Nikaido, Takeshi Ichige, Phong Nguyen, Yohei Kawaguchi and Koichi Hamada
R&D Group, Hitachi, Ltd., Tokyo, Japan
Abstract
Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Challenge Task 5 can be regarded as one type of multichannel acoustic scene classification. The important characteristic of the Task 5 is that a microphone array may be put at different locations between the development dataset and the evaluation dataset, so we should not exploit location-dependent spatial cues but location-independent ones to avoid overfitting. The proposed system is a combination of front-end modules based on blind signal processing and back-end modules based on machine learning. The front-end modules employ blind dereverberation, blind source separation, etc., which use the spatial cues without machine learning, so overfitting is avoided. The back-end modules employ one-dimensional-convolutional-neural-network-(1DCNN)-based architectures and VGG16-based architectures for individual front-end modules, and all the probability outputs are ensembled. Also, through a Mixup-based data augmentation, overfitting is avoided in the back-end modules.
System characteristics
Input | all |
Sampling rate | 16kHz |
Acoustic features | log-mel energies, MFCC |
Spatial features | multi-channel front-end processing; Blind Source Seperation, Blind dereverberation, Beamformer |
External model | VGG16 |
Fusion level | audio, decision |
Fusion method | Blind Source Seperation, Blind dereverberation, Beamformer, average; Blind Source Seperation, Blind dereverberation, Beamformer, Random Forest; Blind Source Seperation, Blind dereverberation, Beamformer, SVM; Blind Source Seperation, Blind dereverberation, Beamformer, F1-score-weighted average |
Classifier | CNN, SVM, VGG16, ensemble |
Decision making | average; Random Forest; SVM; F1-score-weighted average |
Monitoring of Domestic Activities Based on Multi-Channel Acoustics: A Time-Channel 2D-Convolutional Approach
Marco Tiraboschi
Computer Science, Università degli Studi di Milano, Milan, Italy
Tiraboschi_UNIMI_task5_1
Monitoring of Domestic Activities Based on Multi-Channel Acoustics: A Time-Channel 2D-Convolutional Approach
Marco Tiraboschi
Computer Science, Università degli Studi di Milano, Milan, Italy
Abstract
This approach is meant to be an extension of the DCASE 2018 task 5 baseline system for domestic activity recognition exploiting multichannel audio: the Convolutional Neural Network model has been slightly restructured for this purpose by using two-dimensional convolutions along the dimensions of time and channel.
System characteristics
Input | all |
Sampling rate | 16kHz |
Acoustic features | log-mel energies |
Fusion level | classifier |
Fusion method | CNN |
Classifier | CNN |