Monitoring of domestic activities based on multi-channel acoustics


Challenge results

Task description

This subtask is concerned with the classification of daily activities performed in a home environment (e.g. Cooking). The provided samples are multi-channel audio segments acquired by multiple microphone arrays at different positions. This means that spatial properties can be exploited to serve as input features to the classification problem. However, using absolute localization of sound sources as input for the detection model is doomed to not generalize well to cases where the position of the microphone array is altered. Therefore, in this task the focus is on systems which can exploit spatial cues independent of sensor location using multi-channel audio.

The development data consists of recording obtained by four microphone arrays at different positions. The evaluation dataset contained data of seven microphone arrays, consisting of the four microphone arrays available in the development set and three unknown microphone arrays. The former is used to provide quantative numbers on the spatial overfit while the latter is used to determine the winner of task 5.

More detailed task description can be found on the task description page

Systems ranking

Rank Submission
code
Submission
name
Technical
Report
F1-score
on Eval. set
(Unknown mic.)
F1-score
on Eval. set
(dev. set mic. arrays)
F1-score
on Dev. set
DCASE2018 baseline Baseline Dekkers2018 83.1 85.0 84.5
Delphin_OL_task5_1 GCNN_PTS Delphin-Poulat2018 80.7 86.1 88.5
Delphin_OL_task5_2 GCNN_FTS Delphin-Poulat2018 80.8 85.0 88.6
Delphin_OL_task5_3 GCNN_ATS Delphin-Poulat2018 81.6 84.9 86.0
Delphin_OL_task5_4 GCNN_F Delphin-Poulat2018 82.5 86.5 88.7
Inoue_IBM_task5_1 InouetMilk Inoue2018 88.4 90.4 90.0
Inoue_IBM_task5_2 InouetMilk Inoue2018 88.3 90.5 90.0
Kong_Surrey_task5_1 SurreyCNN8 Kong2018 83.2 87.6 87.8
Kong_Surrey_task5_2 SurreyCNN4 Kong2018 82.4 86.2 87.8
Li_NPU_task5_1 CIAICSys1 Li2018 79.0 90.7 89.7
Li_NPU_task5_2 CIAICSys2 Li2018 78.6 90.4 89.7
Li_NPU_task5_3 CIAICSys3 Li2018 84.8 91.3 90.5
Li_NPU_task5_4 CIAICSys4 Li2018 85.1 91.4 90.7
Liao_NTHU_task5_1 NTHU_sub_4 Liao2018 86.7 88.6 88.7
Liao_NTHU_task5_2 NTHU_sub_MVDR Liao2018 72.1 87.1 87.1
Liao_NTHU_task5_3 NTHU_sub_MVDRMMSE Liao2018 76.7 85.7 85.5
Liu_THU_task5_1 Liu_THU Liu2018 87.5 89.4 89.8
Liu_THU_task5_2 Liu_THU Liu2018 87.4 89.5 89.8
Liu_THU_task5_3 Liu_THU Liu2018 86.8 89.3 88.9
Nakadai_HRI-JP_task5_1 PS-CNN Nakadai2018 85.4 89.9 89.9
Raveh_INRC_task5_1 INRC_1D Raveh2018 80.4 87.7 87.2
Raveh_INRC_task5_2 INRC_1DSVD Raveh2018 80.2 86.3 85.7
Raveh_INRC_task5_3 INRC_2D Raveh2018 81.7 87.7 86.8
Raveh_INRC_task5_4 INRC_2DSVD Raveh2018 81.2 86.4 85.8
Sun_SUTD_task5_1 SUTD Chew2018 76.8 78.5 92.2
Tanabe_HIT_task5_1 HITavg Tanabe2018 88.4 89.7 89.8
Tanabe_HIT_task5_2 HITrf Tanabe2018 82.2 86.0 90.0
Tanabe_HIT_task5_3 HITsvm Tanabe2018 86.3 89.2 90.3
Tanabe_HIT_task5_4 HITfweight Tanabe2018 88.4 89.8 89.8
Tiraboschi_UNIMI_task5_1 TC2DCNN Tiraboschi2018 76.9 85.8 85.8
Zhang_THU_task5_1 THUEE Shen2018 85.9 87.6 89.7
Zhang_THU_task5_2 THUEE Shen2018 84.3 86.2 91.2
Zhang_THU_task5_3 THUEE Shen2018 86.0 87.5 90.5
Zhang_THU_task5_4 THUEE Shen2018 85.9 87.6 90.4

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission
code
Submission
name
Technical
Report
F1-score
on Eval. set
(Unknown mic.)
F1-score
on Eval. set
(dev. set mic. arrays)
F1-score
(Dev. set)
DCASE2018 baseline Baseline Dekkers2018 83.1 85.0 84.5
Delphin_OL_task5_4 GCNN_F Delphin-Poulat2018 82.5 86.5 88.7
Inoue_IBM_task5_1 InouetMilk Inoue2018 88.4 90.4 90.0
Kong_Surrey_task5_1 SurreyCNN8 Kong2018 83.2 87.6 87.8
Li_NPU_task5_4 CIAICSys4 Li2018 85.1 91.4 90.7
Liao_NTHU_task5_1 NTHU_sub_4 Liao2018 86.7 88.6 88.7
Liu_THU_task5_1 Liu_THU Liu2018 87.5 89.4 89.8
Nakadai_HRI-JP_task5_1 PS-CNN Nakadai2018 85.4 89.9 89.9
Raveh_INRC_task5_3 INRC_2D Raveh2018 81.7 87.7 86.8
Sun_SUTD_task5_1 SUTD Chew2018 76.8 78.5 92.2
Tanabe_HIT_task5_1 HITavg Tanabe2018 88.4 89.7 89.8
Tiraboschi_UNIMI_task5_1 TC2DCNN Tiraboschi2018 76.9 85.8 85.8
Zhang_THU_task5_3 THUEE Shen2018 86.0 87.5 90.5

Class-wise performance

Rank Submission
code
Submission
name
Technical
Report
F1-score
on Eval. set
(Unknown mic.)
Absence Cooking Dishwashing Eating Other Social
activity
Vacuum
cleaning
Watching
TV
Working F1-score
on Eval. set
(dev. set mic. arrays)
Absence (2) Cooking (2) Dishwashing (2) Eating (2) Other (2) Social
activity (2)
Vacuum
cleaning (2)
Watching
TV (2)
Working (2)
DCASE2018 baseline Baseline Dekkers2018 83.1 87.7 93.0 77.2 81.2 35.0 96.6 95.8 99.9 81.4 85.0 89.4 96.3 79.5 82.0 44.1 96.4 95.9 99.9 81.5
Delphin_OL_task5_1 GCNN_PTS Delphin-Poulat2018 80.7 79.9 85.5 70.1 79.3 45.5 96.0 95.7 99.9 74.5 86.1 91.0 96.1 79.6 82.5 48.4 96.5 96.4 99.9 84.7
Delphin_OL_task5_2 GCNN_FTS Delphin-Poulat2018 80.8 79.1 88.5 71.4 79.9 42.7 94.6 97.0 99.9 74.2 85.0 90.5 96.0 78.5 81.0 44.1 95.1 97.0 99.9 83.2
Delphin_OL_task5_3 GCNN_ATS Delphin-Poulat2018 81.6 83.8 90.5 71.3 78.6 44.0 94.7 96.1 99.9 76.0 84.9 91.3 95.3 75.9 81.8 45.5 94.5 96.5 99.9 83.1
Delphin_OL_task5_4 GCNN_F Delphin-Poulat2018 82.5 82.2 89.5 73.8 81.1 47.4 95.5 96.7 100.0 76.2 86.5 92.0 96.2 80.4 83.1 49.3 95.8 96.8 99.9 85.3
Inoue_IBM_task5_1 InouetMilk Inoue2018 88.4 93.7 91.5 86.5 87.0 54.2 97.0 97.1 99.9 88.7 90.4 94.2 96.8 88.4 89.9 59.7 97.4 97.3 100.0 90.0
Inoue_IBM_task5_2 InouetMilk Inoue2018 88.3 93.6 91.7 86.1 87.0 53.6 97.0 97.1 99.9 88.7 90.5 94.2 96.9 89.4 90.2 59.5 97.4 97.2 99.9 90.2
Kong_Surrey_task5_1 SurreyCNN8 Kong2018 83.2 90.4 82.9 75.0 82.4 42.6 96.6 96.4 99.9 82.5 87.6 92.7 95.0 82.7 85.9 51.5 96.7 97.1 99.9 87.3
Kong_Surrey_task5_2 SurreyCNN4 Kong2018 82.4 87.4 84.2 74.3 78.4 45.4 96.4 96.6 99.9 79.1 86.2 90.7 94.6 81.0 83.0 48.9 96.5 97.5 99.9 83.4
Li_NPU_task5_1 CIAICSys1 Li2018 79.0 79.6 84.6 76.4 80.8 20.3 95.6 96.4 99.9 77.3 90.7 93.0 97.3 91.0 91.6 61.1 96.6 97.0 100.0 89.1
Li_NPU_task5_2 CIAICSys2 Li2018 78.6 81.5 85.7 78.2 74.1 24.4 92.1 95.4 99.7 76.1 90.4 92.4 97.2 91.3 92.0 59.4 96.2 96.8 100.0 88.6
Li_NPU_task5_3 CIAICSys3 Li2018 84.8 88.3 91.0 81.1 84.4 40.5 97.2 97.0 99.9 83.6 91.3 94.4 97.2 89.9 91.6 62.9 97.4 97.4 100.0 91.0
Li_NPU_task5_4 CIAICSys4 Li2018 85.1 88.1 91.3 82.9 84.7 42.2 96.6 97.1 100.0 83.3 91.4 94.3 97.4 90.3 91.8 63.1 97.5 97.3 100.0 90.8
Liao_NTHU_task5_1 NTHU_sub_4 Liao2018 86.7 91.0 95.1 81.7 82.1 52.3 97.9 95.3 100.0 85.3 88.6 92.6 96.7 88.0 85.3 55.7 97.8 95.0 100.0 86.7
Liao_NTHU_task5_2 NTHU_sub_MVDR Liao2018 72.1 69.8 64.7 63.5 67.1 17.7 95.9 96.9 99.8 73.5 87.1 91.6 92.2 84.3 84.6 52.2 96.5 97.3 99.9 85.3
Liao_NTHU_task5_3 NTHU_sub_MVDRMMSE Liao2018 76.7 77.4 68.3 63.6 77.2 38.2 96.0 96.0 99.8 73.9 85.7 91.7 88.2 77.4 84.7 50.1 96.6 96.7 99.9 86.0
Liu_THU_task5_1 Liu_THU Liu2018 87.5 92.7 89.9 84.7 85.6 53.5 96.6 97.4 100.0 87.4 89.4 93.9 95.6 87.4 86.6 56.8 97.1 97.6 100.0 89.6
Liu_THU_task5_2 Liu_THU Liu2018 87.4 92.9 90.5 84.2 85.0 52.1 97.0 97.3 100.0 87.5 89.5 93.9 96.3 87.3 86.9 56.3 97.4 97.8 100.0 89.6
Liu_THU_task5_3 Liu_THU Liu2018 86.8 92.1 90.3 82.1 84.2 51.9 97.0 96.7 100.0 86.5 89.3 93.3 95.9 87.5 87.5 55.3 97.6 97.6 100.0 88.7
Nakadai_HRI-JP_task5_1 PS-CNN Nakadai2018 85.4 84.6 92.7 81.6 84.5 51.1 97.3 97.0 100.0 80.0 89.9 93.4 96.8 88.3 89.0 57.7 97.3 96.8 100.0 89.4
Raveh_INRC_task5_1 INRC_1D Raveh2018 80.4 74.8 84.1 71.9 81.5 47.6 95.1 97.1 99.9 71.5 87.7 89.3 96.0 85.9 86.1 53.5 97.0 97.8 99.9 84.2
Raveh_INRC_task5_2 INRC_1DSVD Raveh2018 80.2 69.9 91.1 75.6 79.1 44.9 95.2 97.7 99.8 68.6 86.3 88.1 95.1 81.8 83.8 51.6 96.0 97.8 99.9 82.5
Raveh_INRC_task5_3 INRC_2D Raveh2018 81.7 79.7 86.9 73.8 82.2 42.7 97.1 97.4 99.9 75.5 87.7 90.8 95.3 82.8 87.2 51.4 97.5 97.8 99.9 86.2
Raveh_INRC_task5_4 INRC_2DSVD Raveh2018 81.2 75.8 87.5 73.2 80.0 48.4 95.9 96.6 99.9 73.4 86.4 89.8 94.1 78.6 84.8 51.9 96.8 96.8 99.9 84.9
Sun_SUTD_task5_1 SUTD Chew2018 76.8 74.9 85.5 70.5 68.5 35.2 92.9 94.7 99.8 69.5 78.5 81.3 92.7 72.1 72.2 30.5 93.9 94.5 99.7 69.6
Tanabe_HIT_task5_1 HITavg Tanabe2018 88.4 91.6 97.0 83.0 84.2 57.7 98.2 97.7 100.0 86.1 89.7 92.4 97.2 86.1 86.0 61.6 98.1 97.9 100.0 87.7
Tanabe_HIT_task5_2 HITrf Tanabe2018 82.2 59.1 96.1 81.5 85.7 53.7 97.7 97.7 100.0 68.6 86.0 74.6 96.9 85.5 88.2 57.9 97.7 97.9 100.0 75.4
Tanabe_HIT_task5_3 HITsvm Tanabe2018 86.3 86.1 95.8 81.6 85.2 54.6 95.9 96.7 100.0 81.3 89.2 92.6 96.7 85.4 88.1 57.3 96.6 97.2 100.0 88.7
Tanabe_HIT_task5_4 HITfweight Tanabe2018 88.4 91.3 97.0 83.0 84.1 58.3 98.2 97.7 100.0 85.8 89.8 92.6 97.2 86.4 86.1 62.1 98.2 97.9 100.0 87.9
Tiraboschi_UNIMI_task5_1 TC2DCNN Tiraboschi2018 76.9 79.8 88.7 71.8 78.9 17.6 96.2 94.4 99.7 64.6 85.8 90.8 93.6 77.5 83.2 50.5 97.4 94.1 100.0 85.0
Zhang_THU_task5_1 THUEE Shen2018 85.9 92.8 88.6 78.7 81.9 50.3 97.5 96.3 99.9 87.5 87.6 93.1 94.6 80.8 85.1 52.8 97.5 96.6 99.9 88.0
Zhang_THU_task5_2 THUEE Shen2018 84.3 93.6 85.1 76.8 76.6 46.6 97.1 96.5 99.9 86.9 86.2 94.1 89.7 79.0 80.7 50.6 96.7 97.1 99.9 88.0
Zhang_THU_task5_3 THUEE Shen2018 86.0 93.6 87.4 79.7 80.1 50.8 97.6 96.7 99.9 87.7 87.5 94.2 92.2 81.5 83.2 53.3 97.4 97.1 100.0 88.8
Zhang_THU_task5_4 THUEE Shen2018 85.9 93.5 87.4 79.0 79.9 51.3 97.6 96.7 99.9 87.7 87.6 94.1 92.4 81.4 83.5 53.7 97.6 97.1 100.0 88.8

System characteristics

Input characteristics

Rank Code Technical
Report
F1-score
on Eval. set
(Unknown mic.)
F1-score
on Eval. set
(dev. set
mic. arrays)
Acoustic
features
Spatial
features
Data
augmentation
Pre-trained
model
DCASE2018 baseline Dekkers2018 83.1 85.0 log-mel energies
Delphin_OL_task5_1 Delphin-Poulat2018 80.7 86.1 log-mel energies
Delphin_OL_task5_2 Delphin-Poulat2018 80.8 85.0 log-mel energies
Delphin_OL_task5_3 Delphin-Poulat2018 81.6 84.9 log-mel energies Gaussian Additive Noise
Delphin_OL_task5_4 Delphin-Poulat2018 82.5 86.5 log-mel energies Gaussian Additive Noise
Inoue_IBM_task5_1 Inoue2018 88.4 90.4 log-mel energies shuffling, mixing
Inoue_IBM_task5_2 Inoue2018 88.3 90.5 log-mel energies shuffling, mixing
Kong_Surrey_task5_1 Kong2018 83.2 87.6 log-mel energies
Kong_Surrey_task5_2 Kong2018 82.4 86.2 log-mel energies
Li_NPU_task5_1 Li2018 79.0 90.7 log-mel energies coherence
Li_NPU_task5_2 Li2018 78.6 90.4 log-mel energies coherence
Li_NPU_task5_3 Li2018 84.8 91.3 log-mel energies coherence
Li_NPU_task5_4 Li2018 85.1 91.4 log-mel energies coherence
Liao_NTHU_task5_1 Liao2018 86.7 88.6 log-mel energies time shifting
Liao_NTHU_task5_2 Liao2018 72.1 87.1 log-mel energies MVDR time shifting
Liao_NTHU_task5_3 Liao2018 76.7 85.7 log-mel energies MVDR with MMSE time shifting
Liu_THU_task5_1 Liu2018 87.5 89.4 log-mel energies, MFCC VGGish
Liu_THU_task5_2 Liu2018 87.4 89.5 log-mel energies, MFCC VGGish
Liu_THU_task5_3 Liu2018 86.8 89.3 log-mel energies, MFCC VGGish
Nakadai_HRI-JP_task5_1 Nakadai2018 85.4 89.9 log-mel energies
Raveh_INRC_task5_1 Raveh2018 80.4 87.7 Scattering Transform
Raveh_INRC_task5_2 Raveh2018 80.2 86.3 Scattering Transform, SVD
Raveh_INRC_task5_3 Raveh2018 81.7 87.7 Scattering Transform
Raveh_INRC_task5_4 Raveh2018 81.2 86.4 Scattering Transform, SVD
Sun_SUTD_task5_1 Chew2018 76.8 78.5 MFCC, spectrogram
Tanabe_HIT_task5_1 Tanabe2018 88.4 89.7 log-mel energies, MFCC multi-channel front-end processing VGG16
Tanabe_HIT_task5_2 Tanabe2018 82.2 86.0 log-mel energies, MFCC multi-channel front-end processing VGG16
Tanabe_HIT_task5_3 Tanabe2018 86.3 89.2 log-mel energies, MFCC Blind Source Seperation, Blind dereverberation, Beamformer VGG16
Tanabe_HIT_task5_4 Tanabe2018 88.4 89.8 log-mel energies, MFCC Blind Source Seperation, Blind dereverberation, Beamformer VGG16
Tiraboschi_UNIMI_task5_1 Tiraboschi2018 76.9 85.8 log-mel energies
Zhang_THU_task5_1 Shen2018 85.9 87.6 log-mel energies, Time-Frequency Cepstral
Zhang_THU_task5_2 Shen2018 84.3 86.2 log-mel energies, Time-Frequency Cepstral
Zhang_THU_task5_3 Shen2018 86.0 87.5 log-mel energies, Time-Frequency Cepstral
Zhang_THU_task5_4 Shen2018 85.9 87.6 log-mel energies, Time-Frequency Cepstral



Machine learning characteristics

Rank Code Technical
Report
F1-score
on Eval. set
(Unknown
mic.)
F1-score
on Eval. set
(dev. set
mic. arrays)
Classifier Fusion
level
Fusion
method
Ensemble
subsystems
Decision
making
DCASE2018 baseline Dekkers2018 83.1 85.0 CNN decision average
Delphin_OL_task5_1 Delphin-Poulat2018 80.7 86.1 CNN decision average
Delphin_OL_task5_2 Delphin-Poulat2018 80.8 85.0 CNN decision average
Delphin_OL_task5_3 Delphin-Poulat2018 81.6 84.9 CNN decision average
Delphin_OL_task5_4 Delphin-Poulat2018 82.5 86.5 CNN decision average
Inoue_IBM_task5_1 Inoue2018 88.4 90.4 CNN decision average 4 average
Inoue_IBM_task5_2 Inoue2018 88.3 90.5 CNN decision average 4 average
Kong_Surrey_task5_1 Kong2018 83.2 87.6 AlexNetish 8 layer CNN with global max pooling decision average
Kong_Surrey_task5_2 Kong2018 82.4 86.2 AlexNetish 4 layer CNN with global max pooling decision average
Li_NPU_task5_1 Li2018 79.0 90.7 CNN, VGG10, ensemble decision average 2 average
Li_NPU_task5_2 Li2018 78.6 90.4 CNN, VGG10, GLU, ensemble decision average 2 average
Li_NPU_task5_3 Li2018 84.8 91.3 CNN, VGG10, ensemble decision average 3 average
Li_NPU_task5_4 Li2018 85.1 91.4 CNN, VGG10, GLU, ensemble decision average 3 average
Liao_NTHU_task5_1 Liao2018 86.7 88.6 CNN decision average
Liao_NTHU_task5_2 Liao2018 72.1 87.1 CNN decision average
Liao_NTHU_task5_3 Liao2018 76.7 85.7 CNN decision average
Liu_THU_task5_1 Liu2018 87.5 89.4 CNN, RNN, ensemble decision average 3 average
Liu_THU_task5_2 Liu2018 87.4 89.5 CNN, RNN, ensemble decision average 3 average
Liu_THU_task5_3 Liu2018 86.8 89.3 CNN, RNN, ensemble decision average 3 average
Nakadai_HRI-JP_task5_1 Nakadai2018 85.4 89.9 Partially Shared CNN decision majority vote
Raveh_INRC_task5_1 Raveh2018 80.4 87.7 LSTM, CNN, ResNet feature average
Raveh_INRC_task5_2 Raveh2018 80.2 86.3 LSTM, CNN, ResNet feature average
Raveh_INRC_task5_3 Raveh2018 81.7 87.7 LSTM, CNN, ResNet feature average
Raveh_INRC_task5_4 Raveh2018 81.2 86.4 LSTM, CNN, ResNet feature average
Sun_SUTD_task5_1 Chew2018 76.8 78.5 CNN, LSTM, ensemble decision average 3 average
Tanabe_HIT_task5_1 Tanabe2018 88.4 89.7 CNN, SVM, VGG16, ensemble audio, decision Blind Source Seperation, Blind dereverberation, Beamformer, average 89 average
Tanabe_HIT_task5_2 Tanabe2018 82.2 86.0 CNN, SVM, VGG16, ensemble audio, decision Blind Source Seperation, Blind dereverberation, Beamformer, Random Forest 89 Random Forest
Tanabe_HIT_task5_3 Tanabe2018 86.3 89.2 CNN, SVM, VGG16, ensemble audio, decision Blind Source Seperation, Blind dereverberation, Beamformer, SVM 89 SVM
Tanabe_HIT_task5_4 Tanabe2018 88.4 89.8 CNN, SVM, VGG16, ensemble audio, decision Blind Source Seperation, Blind dereverberation, Beamformer, F1-score-weighted average 89 F1-score-weighted average
Tiraboschi_UNIMI_task5_1 Tiraboschi2018 76.9 85.8 CNN classifier CNN
Zhang_THU_task5_1 Shen2018 85.9 87.6 GCNN,GSV-SVM, ensemble classifier stacking 4
Zhang_THU_task5_2 Shen2018 84.3 86.2 GCNN,GSV-SVM,ensemble classifier stacking 4
Zhang_THU_task5_3 Shen2018 86.0 87.5 GCNN,GSV-SVM,ensemble classifier stacking 4
Zhang_THU_task5_4 Shen2018 85.9 87.6 GCNN,GSV-SVM,ensemble classifier stacking 4

Technical reports

DCASE 2018 Challenge: Solution for Task 5

Jeremy Chew, Yingxiang Sun, Lahiru Jayasinghe and Chau Yuen
Engineering Product Development, Singapore University of Technology and Design, Singapore

Abstract

To address Task 5 in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 chal-lenge, in this paper, we propose an ensemble learning system. The proposed system consists of three differ-ent models, based on convolutional neural network and long short memory recurrent neural network. With extracted features such as spectrogram and mel-frequency cepstrum coefficients from different chan-nels, the proposed system can classify different do-mestic activities effectively. Experimental results ob-tained from the provided development dataset show that good performance with F1-score of 92.19% can be achieved. Compared with the baseline system, our pro-posed system significantly improves the performance of F1-score by 7.69%.

System characteristics
Input all
Sampling rate 16kHz
Acoustic features MFCC, spectrogram
Fusion level decision
Fusion method average
Classifier CNN, LSTM, ensemble
Decision making average
PDF

DCASE 2018 Challenge - Task 5: Monitoring of Domestic Activities Based on Multi-Channel Acoustics

Gert Dekkers and Peter Karsmakers
Computer Science, KU Leuven - ADVISE, Geel, Belgium

Abstract

The DCASE 2018 Challenge consists of five tasks related to automatic classification and detection of sound events and scenes. This paper presents the setup of Task 5 which includes the description of the task, dataset and the baseline system. In this task, it is investigated to which extent multi-channel acoustic recordings are beneficial for the purpose of classifying domestic activities. The goal is to exploit spectral and spatial cues independent of sensor location using multi-channel audio. For this purpose we provided a development and evaluation dataset which are derivatives of the SINS database and contain domestic activities recorded by multiple microphone arrays. The baseline system, based on a Neural Network architecture using convolutional and dense layer(s), is intended to lower the hurdle to participate the challenge and to provide a reference performance.

System characteristics
Input all
Sampling rate 16kHz
Acoustic features log-mel energies
Fusion level decision
Fusion method average
Classifier CNN
PDF

GCNN for Classification of Domestic Activities

Lionel Delphin-Poulat, Cyril Plapous and Rozenn Nicol
HOME/CONTENT, Orange Labs, Lannion, France

Abstract

A model of classifier for processing multi-channel audio segments into nine classes categorizing daily activities (Task 5 of Challenge DCASE 2018) is presented. Its framework is based on Gated Convolutional Neural Network (GCNN). Four models are proposed with different learning strategies. They achieve a macro-averaged F1-score between 88.50 and 88.72%.

System characteristics
Input all
Sampling rate 16kHz
Data augmentation Gaussian Additive Noise
Acoustic features log-mel energies
Fusion level decision
Fusion method average
Classifier CNN
PDF

Domestic Activities Classification Based on CNN Using Shuffling and Mixing Data Augmentation

Tadanobu Inoue1, Phongtharin Vinayavekhin1, Shiqiang Wang2, David Wood2, Nancy Greco2 and Ryuki Tachibana3
1AI, IBM Research, Tokyo, Japan, 2AI, Research, Yorktown Heights, NY, USA, 3AI, Research, Tokyo, Japan

Abstract

This technical report describes our proposed design and implementation of the system used for the DCASE 2018 Challenge submission. The work focuses on Task 5 of the challenge, which is about monitoring and classifying domestic activities based on multi-channel acoustics. We propose data augmentation techniques using shuffling and mixing two sounds in a same class to mitigate the unbalanced training dataset. This data augmentation can generate new variations on both the sequence and the density of sound events. The experimental results show that the proposed system achieves an average of 89.95% of macro-averaged F1 score over 4 folds on the development dataset. This is a significant improvement from the baseline result of 84.50%. In the final evaluation for the submission, four proposed classifiers are trained with four folds of training and validation data in the development dataset. Then we ensemble these four models by averaging their predictions.

System characteristics
Input all
Sampling rate 16kHz
Data augmentation shuffling, mixing
Acoustic features log-mel energies
Fusion level decision
Fusion method average
Classifier CNN
Decision making average
PDF

DCASE 2018 Challenge Baseline with Convolutional Neural Networks

Qiuqiang Kong, Iqbal Turab, Xu Yong, Wenwu Wang and Mark D. Plumbley
Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK

Abstract

Detection and classification of acoustic scenes and events (DCASE) 2018 challenge is a well known IEEE AASP challenge consists of several audio classification and sound event detection tasks. DCASE 2018 challenge includes five tasks: 1) Acoustic scene classification, 2) Audio tagging of Freesound, 3) Bird audio detection, 4) Weakly labeled semi-supervised sound event detection and 5) Multi-channel audio tagging. In this paper we open source the python code of all of Task 1 - 5 of DCASE 2018 challenge. The baseline source code contains the implementation of the convolutional neural networks (CNNs) including the AlexNetish and the VGGish from the image processing area. We researched how the performance varies from task to task when the configuration of the neural networks are the same. The experiment shows deeper VGGish network performs better than AlexNetish on Task 2 - 5 except Task 1 where VGGish and AlexNetish network perform similar. With the VGGish network, we achieve an accuracy of 0.680 on Task 1, a mean average precision (mAP) of 0.928 on Task 2, an area under the curve (AUC) of 0.854 on Task 3, a sound event detection F1 score of 20.8% on Task 4 and a F1 score of 87.75% on Task 5.

System characteristics
Input all
Sampling rate 16kHz
Acoustic features log-mel energies
Fusion level decision
Fusion method average
Classifier AlexNetish 8 layer CNN with global max pooling; AlexNetish 4 layer CNN with global max pooling
PDF

Ciaic-Moda System for Dcase2018 Challenge Task5

Dexin Li and Mou Wang
Speech Signal Processing, CIAIC, Xi'an, China

Abstract

In this technical report, we present several systems for the task 5 of Detection and Classification of Acoustic Scenes and Events 2018 (DCASE2018) challenge. The task is to classify multi-channel audio segments into one of daily activities performed in a home environment. We develop three methods for the task. First, log melspectrogram is extracted from each segment and fed to CNN in baseline system with gated linear units (GLU). Then, we use VGGNet to improve the network. In addition, to exploit spatial information, we extract coherence features among all channels and use 1D-CNN with GLU to classify it. Finally, we make a fusion on posteriors from three subsystems to further improve the performances. The experimental results show the proposed systems can get at least 5% F1-score improvement compared to the baseline system.

System characteristics
Input all
Sampling rate 16kHz
Acoustic features log-mel energies
Spatial features coherence
Fusion level decision
Fusion method average
Classifier CNN, VGG10, ensemble; CNN, VGG10, GLU, ensemble
Decision making average
PDF

DCASE 2018 Task 5 Challenge Technical Report: Sound Event Classification by a Deep Neural Network with Attention and Minimum Variance Distortionless Response Enhancement

Hsueh-Wei Liao1, Jong-Yi Huang2, Shih-Syuan Lan2, Tsung-Han Lee2, Yi-Wen Liu1 and Ming-Sian Bai2
1Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, 2Power Mechanical Engineering, National Tsing Hua University, Hsinchu, Taiwan

Abstract

In this technical report, we propose a sub-band convolution neural network with residual building blocks as a sound event detection system. Our system performs not only the clip-wise prediction of the task 5 but also the frame-wise prediction, which can be regarded as multi-task learning. The frame-wise labels are all transformed from the original weak labels by label smoothing with the energy of the frames. With the multi-task learning, we believe such framewise prediction can concentrate on the most important part from the weakly-labeled dataset. In addition, we attempted to preprocess the input signals by array-based methods and, depending on the sound classes, mixed results are reported in terms of the F1-score.

System characteristics
Input all; mixed
Sampling rate 16kHz
Data augmentation time shifting
Acoustic features log-mel energies
Spatial features MVDR; MVDR with MMSE
Fusion level decision
Fusion method average
Classifier CNN
PDF

An Ensemble System for Domestic Activity Recognition

Huaping Liu1, Feng Wang1, Xinzhu Liu2 and Di Guo1
1Computer Science and Technology, Computer Science and Technology, Beijing, China, 2Computer Science and Technology, Computer Science and Technology, Changchun, China

Abstract

As one of the most important sensing modalities, the acoustics is becoming more and more popular in achieving our smart environment nowadays. In this challenge, we try to tackle the task of monitoring domestic activities based on multi-channel acoustics. Several acoustic features including the Mel-Spectrograms, MFCC and VGGish features are extracted from the raw audio and fused to train different deep neural networks. An ensemble system is then established with the trained models. The experimental results on the development dataset demonstrate that the proposed system has shown superior performance in recognizing domestic activities.

System characteristics
Input all
Sampling rate 16kHz
Acoustic features log-mel energies, MFCC
External model VGGish
Fusion level decision
Fusion method average
Classifier CNN, RNN, ensemble
Decision making average
PDF

Partially-Shared Convolutional Neural Network for Classification of Multi-Channel Recorded Audio Signals

Kazuhiro Nakadai and Danilo R. Onishi
Research Div., Honda Research Institute Japan, Wako, Japan

Abstract

This technical paper presents the system used in our submission for task 5 of the DCASE 2018 challenge. We proposed a partially-shared convolutional neural network, which is a multi-task system that contains a common input (the multi-channel log Mel features) and two output branches, a classification branch, which outputs the predicted class, and a regression branch, which outputs a single-channel representation of the multi-channel input data. Since the system has a shared network between classification and regression, training for regression is expected to enhance another training for classification and vice versa. Because task 5 aims at classification based on multi-channel audio input, we tried to improve classification performance with this system by training classification and regression together. By applying the proposed system incorporated with parameter tuning of the baseline CNN system, we confirmed that the classification F1 score increased to 89.94% in four-fold cross validation, while the baseline system achieved 84.50%.

System characteristics
Input all
Sampling rate 16kHz
Acoustic features log-mel energies
Fusion level decision
Fusion method majority vote
Classifier Partially Shared CNN
PDF

Multi-Channel Audio Classification with Neural Network Using Scattering Transform

Alon Raveh1 and Alon Amar2
1Signal Processing Department, National Research Center, Haifa, Israel, 2EE, Technion, Haifa, Israel

Abstract

This technical paper presents an approach for the 2018 acoustic scene classification challenge (DCASE 2018) task 5. A sequence of audio segments are observed by an array with 4 microphones. The task is to suggest a multichannel processing to classify the audio signals to one of 9 pre-defined classes. The proposed approach combines a deep neural network with scattering transform. Each audio segment is first represented by two layers of scattering transform. The 4 denoised transforms of each of the two layers are combined together. Each of the fused layers are processed in parallel by two neural networks (NN) architectures, RESNET and long short-term memory (LSTM) network, with a joint fully connected layer.

System characteristics
Input all
Sampling rate 16kHz
Acoustic features Scattering Transform; Scattering Transform, SVD
Fusion level feature
Fusion method average
Classifier LSTM, CNN, ResNet
PDF

Home Activity Monitoring Based on Gated Convolutional Neural Networks and System Fusion

Yuhan Shen, Kexin He and Weiqiang Zhang
Electronic Engineering, Tsinghua University, Beijing, China

Abstract

In this technical report, we propose a method for the task 5 of Detection and Classification of Acoustic Scenes and Events 2018 (DCASE2018) challenge. This task aims to classify multi-channel audio segments into one of the provided predefined classes. All of these classes are daily activities performed in a home environment. This paper adopts a model based on gated convolutional neural networks for domestic activity classification. We utilize multiple methods to improve the performance of our proposed system. Firstly, we use gated convolutional neural network to replace normal convolutional neural network and recurrent neural network in order to extract more temporal feature and improve working efficiency. Secondly, we mitigate the problem of data imbalance using a weighted loss function. Besides, we adopt model ensembling strategy to make our system stronger and more effective. Finally, we use a fusion of two systems to improve our performance. In a summary, we obtain 89.73% F1-score on the development dataset while the official baseline system gets 84.50% F1-score.

System characteristics
Input all
Sampling rate 16kHz
Acoustic features log-mel energies, Time-Frequency Cepstral
Fusion level classifier
Fusion method stacking
Classifier GCNN,GSV-SVM, ensemble; GCNN,GSV-SVM,ensemble
PDF

Multichannel Acoustic Scene Classification by Blind Dereverberation, Blind Source Separation, Data Augmentation, and Model Ensembling

Ryo Tanabe, Takashi Endo, Yuki Nikaido, Takeshi Ichige, Phong Nguyen, Yohei Kawaguchi and Koichi Hamada
R&D Group, Hitachi, Ltd., Tokyo, Japan

Abstract

Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Challenge Task 5 can be regarded as one type of multichannel acoustic scene classification. The important characteristic of the Task 5 is that a microphone array may be put at different locations between the development dataset and the evaluation dataset, so we should not exploit location-dependent spatial cues but location-independent ones to avoid overfitting. The proposed system is a combination of front-end modules based on blind signal processing and back-end modules based on machine learning. The front-end modules employ blind dereverberation, blind source separation, etc., which use the spatial cues without machine learning, so overfitting is avoided. The back-end modules employ one-dimensional-convolutional-neural-network-(1DCNN)-based architectures and VGG16-based architectures for individual front-end modules, and all the probability outputs are ensembled. Also, through a Mixup-based data augmentation, overfitting is avoided in the back-end modules.

System characteristics
Input all
Sampling rate 16kHz
Acoustic features log-mel energies, MFCC
Spatial features multi-channel front-end processing; Blind Source Seperation, Blind dereverberation, Beamformer
External model VGG16
Fusion level audio, decision
Fusion method Blind Source Seperation, Blind dereverberation, Beamformer, average; Blind Source Seperation, Blind dereverberation, Beamformer, Random Forest; Blind Source Seperation, Blind dereverberation, Beamformer, SVM; Blind Source Seperation, Blind dereverberation, Beamformer, F1-score-weighted average
Classifier CNN, SVM, VGG16, ensemble
Decision making average; Random Forest; SVM; F1-score-weighted average
PDF

Monitoring of Domestic Activities Based on Multi-Channel Acoustics: A Time-Channel 2D-Convolutional Approach

Marco Tiraboschi
Computer Science, Università degli Studi di Milano, Milan, Italy

Abstract

This approach is meant to be an extension of the DCASE 2018 task 5 baseline system for domestic activity recognition exploiting multichannel audio: the Convolutional Neural Network model has been slightly restructured for this purpose by using two-dimensional convolutions along the dimensions of time and channel.

System characteristics
Input all
Sampling rate 16kHz
Acoustic features log-mel energies
Fusion level classifier
Fusion method CNN
Classifier CNN
PDF