Monitoring of domestic activities based on multi-channel acoustics

Task description

The goal of this task is to classify multi-channel audio segments (i.e. segmented data is given), acquired by a microphone array, into one of the provided predefined classes. These classes are daily activities performed in a home environment (e.g. "Cooking").

updated 30/06/2018


There is a rising interest in smart environments that enhance the quality of live for humans in terms of e.g. safety, security, comfort, and home care. In order to have smart functionality, situational awareness is required, which might be obtained by interpreting a multitude of sensing modalities including acoustics. The latter is already used in vocal assistants such as Google Home, Apple HomePod, and Amazon Echo. While these devices focus on speech, they could be extended to identify domestic activities carried out by humans. In the literature, this recognition of activities based on acoustics is already touched upon. Yet, the acoustic models are typically based on single channel and single location recordings. In this task, it is investigated to which extend multi-channel acoustic recordings are beneficial for the purpose of detecting domestic activities.


The goal of this task is to classify multi-channel audio segments (i.e. segmented data is given), acquired by a microphone array, into one of the provided predefined classes as illustrated by Figure 1. These classes are daily activities performed in a home environment. For example, “Cooking”, “Watching TV” and “Working”. As they can be composed out of different sound events such activities are considered as acoustic scenes. The difference with Task 1: Acoustic scene classification is the type of scenes and the possibility to use multi-channel audio.

Figure 1: Conceptual overview of the task.

In this challenge a person living alone at home is considered. This reduces the complexity of the problem since the number of overlapping activities is expected to be small. In fact, in the considered data set no overlapping activities are present. These conditions were chosen to focus on the main goal of this task which is to investigate to which extend multi-channel acoustic recordings are beneficial for the purpose of detecting domestic activities. This means that spatial properties can be exploited to serve as input features to the classification problem. However, using absolute localization of sound sources as input for the detection model is doomed to not generalize well to cases where the position of the microphone array is altered. Therefore, in this task the focus is on systems which can exploit spatial cues independent of sensor location using multi-channel audio.



The dataset used in this task is a derivative of the SINS dataset. It contains a continuous recording of one person living in a vacation home over a period of one week. It was collected using a network of 13 microphone arrays distributed over the entire home. The microphone array consists of 4 linearly arranged microphones. For this task 7 microphone arrays in the combined living room and kitchen area are used. Figure 2 shows the floorplan of the recorded environment along with the position of the used sensor nodes.

Figure 2: 2D floorplan of the combined kitchen and living room with the used sensor nodes.

The continuous recordings were split into audio segments of 10s. Segments containing more then one active class (e.g. a transition of two actitivies) were left out. This means that each segments represents one activity. Subsampling was then performed starting from the largest classes to make the dataset easier to use for a challenge. These audio segments are provided as individual files along with the ground truth. Each audio segment contains 4 channels (e.g. the 4 microphone channels from a particular node). The daily activities for this task (9) are shown in Table 1 along with the available 10s multi-channel segments in the development set and the amount of full sessions of a certain activity (e.g. a cooking session).

Activity # 10s segments # sessions
Absence (nobody present in the room) 18860 42
Cooking 5124 13
Dishwashing 1424 10
Eating 2308 13
Other (present but not doing any relevant activity) 2060 118
Social activity (visit, phone call) 4944 21
Vacuum cleaning 972 9
Watching TV 18648 9
Working (typing, mouse click, ...) 18644 33
Total 72984 268

As development set, approximately 200 hours of data from 4 sensor nodes along with the ground truth is given. As evaluation set, data is provided from all the sensor nodes (i.e. also sensor nodes not present in the development set). The evaluation will be based on the sensor nodes not present in the room. The data from the same nodes as in training are provided to give insights about the overfitting on those positions. The partitioning of the data was done randomly. The segments belonging to one particular consecutive activity (e.g. a full session of cooking) were kept together. The data provided for each sensor node contain recordings of the same time period. This means that the performed activities are observed from multiple microphone arrays at the same time instant. Due to the subsampling on the segments of the largest classes, there is not a full time-wise overlap by all sensor nodes for a particular consecutive activity of those classes.

Recording and annotation procedure

The sensor node configuration used in this setup is a control board together with a linear microphone array. The control board contains an EFM32 ARM cortex M4 microcontroller from Silicon Labs (EFM32WG980) used for sampling the analog audio. The microphone array contains four Sonion N8AC03 MEMS low-power (±17µW) microphones with an inter-microphone distance of 5 cm. The sampling for each audio channel is done sequentially at a rate of 16 kHz with a bit depth of 12. The annotation was performed in two phases. First, during the data collection a smartphone application was used to let the monitored person(s) annotate the activities while being recorded. The person could only select a fixed set of activities. The application was easy to use and did not significantly influence the transition between activities. Secondly, the start and stop timestamps of each activity were refined by using our own annotation software. Postprocessing and sharing the database involves privacy-related aspects. Besides the person(s) living there, multiple people visited the home. Moreover, during a phone call, one can partially hear the person on the other end. A written informed consent was obtained from all participants.

More information about the full dataset can be found in:


Gert Dekkers, Steven Lauwereins, Bart Thoen, Mulu Weldegebreal Adhana, Henk Brouckxon, Toon van Waterschoot, Bart Vanrumste, Marian Verhelst, and Peter Karsmakers. The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), pp 32–36. November 2017.


The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network


There is a rising interest in monitoring and improving human wellbeing at home using different types of sensors including microphones. In the context of Ambient Assisted Living (AAL) persons are monitored, e.g. to support patients with a chronic illness and older persons, by tracking their activities being performed at home. When considering an acoustic sensing modality, a performed activity can be seen as an acoustic scene. Recently, acoustic detection and classification of scenes and events has gained interest in the scientific community and led to numerous public databases for a wide range of applications. However, no public databases exist which a) focus on daily activities in a home environment, b) contain activities being performed in a spontaneous manner, c) make use of an acoustic sensor network, and d) are recorded as a continuous stream. In this paper we introduce a database recorded in one living home, over a period of one week. The recording setup is an acoustic sensor network containing thirteen sensor nodes, with four low-cost microphones each, distributed over five rooms. Annotation is available on an activity level. In this paper we present the recording and annotation procedure, the database content and a discussion on a baseline detection benchmark. The baseline consists of Mel-Frequency Cepstral Coefficients, Support Vector Machine and a majority vote late-fusion scheme. The database is publicly released to provide a common ground for future research.


Database, Acoustic Scene Classification, Acoustic Event Detection, Acoustic Sensor Networks


Download: Development dataset

In case you are using the provided baseline system, there is no need to download the dataset as the system will automatically download needed dataset for you.

Task 5, development dataset (42.6 GB download - 87.0 GB when extracted)

An inconsistency in the dataset was reported here. The issue is fixed in the current release of the dataset (v1.0.3, 15/05/2018). If you have an earlier version you can either download the entire dataset again or overwrite a subset of the files using this archive

The content of the development set is structured in the following manner:

dataset root
│   EULA.pdf                End user license agreement
│   meta.txt                meta data, tsv-format, [audio file (str)][tab][label (str)][tab][session (str)]
│               Dataset description (markdown)
│   README.html             Dataset description (HTML)
└───audio                   72984 audio segments, 16-bit 16kHz
│   │   DevNode1_ex1_1.wav  name format DevNode{NodeID}_ex{sessionID}_{segmentID}.wav
│   │   DevNode2_ex1_2.wav
│   │   ...
└───evaluation_setup        cross-validation setup, 4 folds
    │   fold1_train.txt     training file list, tsv-format, [audio file (str)][tab][label (str)][tab][session (str)]
    │   fold1_test.txt      test file list, tsv-format, [audio file (str)]
    │   fold1_evaluate.txt  evaluation file list, tsv-format, [audio file (str)][tab][label (str)]  
    │   ...        

The multi-channel audio files can be found under directory audio and are formatted in the following manner:

  • {NodeID} (1-4) is an identifier to indicate which segments belong to a specific node. In total 4 nodes are given (1-4). It is unknown what the location of the node is to the participant.
  • {sessionID} indicates a full session of a certain activity.
  • {segmentID} indicates a segment belonging to a certain {sessionID}. A session of a certain activity (e.g. cooking) can have multiple 10s segments. Keep in mind that segmentIDs are not shared between nodes (e.g. DevNode1_ex1_1 is not necessarely recorded at the same time range as DevNode2_ex1_1 but it surely belongs to the same session).

The file meta.txt and the content of the folder evaluation_setup contain filenames with optionally ground truth labels and an identifier of to which session the segment belongs. These are arranged in the following manner:

[filename (str)][tab][activity label (str)][tab][session (str)]

The directory evaluation_setup provides cross-validation folds for the development dataset. More information on the usage can be read here

Download: Evaluation dataset

In case you are using the provided baseline system, there is no need to download the dataset as the system will automatically download needed dataset for you.

Task 5, evaluation dataset (42.2 GB download - 87.0 GB when extracted)

The content of the dataset is structured in the following manner:

dataset root
│   EULA.pdf                End user license agreement
│   meta.txt                meta data, tsv-format, [audio file (str)]\n
│               Dataset description (markdown)
│   readme.html             Dataset description (HTML)
└───audio                   72972 audio segments, 16-bit 16kHz
│   │   1.wav               name format {segmentID}.wav
│   │   100.wav
│   │   ...
└───evaluation_setup        evaluation files
    │   test.txt            test file list, tsv-format, [audio file (str)]\n

The multi-channel audio files can be found under directory audio and are formatted in the following manner:


The file meta.txt and the content of the folder evaluation_setup contain filenames. Ground truth will be made available after the challenge results have been made public. Additionally, a filename mapping will be made available that will map the filenames to a filename similar as the development dataset.

Task setup

The task is split up in two phases. First the development dataset is provided. A month before the challenge deadline the evaluation sets are provided. At the challenge deadline submissions include the system output on the evaluation set, system meta information and one technical report. The goal of the task to obtain the highest score on the evaluation set.

Development set

The development set includes multi-channel audio segments, recorded by four different sensor nodes, along with the ground truth and cross-validation folds. Cross-validation folds are provided for the development dataset in order to make results reported with this dataset uniform. Results on these subsets are used for comparison in the initial internal report and also need to be reported in the outputted meta information. The setup consists of four folds distributing the available files. Segments belonging to a particular session of an activity (e.g. a session of cooking collected by multiple sensor nodes) are kept together to minimize leakage between folds. The folds are provided with the dataset in the directory evaluation setup. For each fold a training, testing and evaluation subset is provided.

Important: Important: If you are not using the provided cross-validation setup, pay attention to the segments extracted from the same sessions. Make sure that for each given fold, ALL segments from the same session must be either in the training subset OR in the test subset.

External data sources/pre-trained models

List of external datasets/pre-trained models allowed:

Dataset name Type Added Link
AudioSet audio 29.6.2018
VGGish model 5.7.2018
Xception model 5.7.2018
VGG16 model 5.7.2018
VGG19 model 5.7.2018
ResNet50 model 5.7.2018
InceptionV3 model 5.7.2018
InceptionResNetV2 model 5.7.2018
MobileNet model 5.7.2018
DenseNet model 5.7.2018
MobileNetV2 model 5.7.2018

Participants can suggest what to add to this list by sending email to until evaluation dataset is published, after which the list is locked.

Evaluation set

The evaluation set includes multi-channel audio segments, recorded by seven different sensor nodes. Three sensor nodes are not in the development set and will be used for the final evaluation score. The other segments obtained by the same nodes as in the development set are used to check overfitting. Participants should run their system on this dataset, and submit the classification results (system output) to DCASE2018 Challenge. The evaluation dataset is provided without ground truth.


Challenge submissions consists of one zip-package containing the system outputs and system meta information and one technical report (pdf file). Detailed information for the challenge submission can found on the submission page. System output should be presented as a single text-file (in tab-seperated format) containing classification result for each audio file in the evaluation set. Result items can be in any order. The format is as follows:

[filename (string)][tab][activity label (string)]

A template for the system meta information (.yaml file) is available on the submissions page.

Multiple system outputs can be submitted (maximum 4 per participant). If submitting multiple systems, the individual text-files should be packaged into a zip file for submission. Please carefully mark the connection between the submitted files and the corresponding system or system parameters (for example by naming the text file appropriately).

Task rules

These are the general rules valid for all tasks. The same rules and additional information on technical report and submission requirements can be found here. Task specific rules are highlighted in bold.

  • Participants are allowed to use external data for system development taking into account the following principles:
    • The used data must be publicly available without cost before 29th of March 2018
    • External data includes pre-trained models
    • Participants should inform/suggest such data to be listed on the task webpage, so all competitors know about them and have equal opportunity to use them
    • Once the evaluation set is published, the list of external datasets allowed is locked (no further external sources allowed)
  • Manipulation of provided training data is allowed (e.g. data augmentation).
  • Participants are not allowed to make subjective judgements on the evaluation data, nor to annotate it.
  • The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision making is also forbidden.
  • The system outputs that do not respect the challenge rules will be evaluated on request, but they will not be officially included in the challenge rankings.


The scoring of this task will be based on macro-averaged F1-score. The F1-score is calculated for each class seperately and averaged over all classes. A full 10s multi-channel audio segment is considered to be one sample. The winner is of the task is the submission with the highest macro-averaged F1-score on the evaluation set. The output is send through and the scores are calculated by the task coordinators.

Baseline system


The baseline system is intended to lower the hurdle to participate the challenge. It provides an entry-level approach which is simple but relatively close to the state of the art systems. High-end performance is left for the challenge participants to find. Participants are allowed to build their system on top of the given baseline system. The system has all needed functionality for dataset handling, storing / accessing features and models, and evaluating the results, making the adaptation for one's needs rather easy. The baseline system is also a good starting point for entry level researchers.

If participants plan to publish their code to the DCASE community after the challenge, by building their approach on the baseline system will make their code more accessible to the community. DCASE organizers encourage participants strongly to share their code in any form after the challenge to push the research further.

During the recording campaign, data was measured simultaneously using multiple microphone arrays (nodes) each containing 4 microphones. Hence, each domestic activity is recorded as many times as there were microphones. The baseline system trains a single classifier model that takes a single channel as input. Each parallel recording of a single activity is considered as a different example during training. The learner in the baseline system is based on a Neural Network architecture using convolutional and dense layers. As input, log mel-band energies are provided to the network for each microphone channel separately. In the prediction stage a single outcome is computed for each node by averaging the 4 model outcomes (posteriors) that were computed by evaluating the trained classifier model on all 4 microphones.

The baseline system parameters are as follows:

  • Frame size: 40 ms (50% hop size)
  • Feature matrix:
    • 40 log mel-band energies in 501 successive frames (10 s)
  • Neural Network:
    • Input data: 40x501 (each microphone channel is considered to be a separate example for the learner)
    • Architecture:
      • 1D Convolutional layer (filters: 32, kernel size: 5, stride: 1, axis: time) + Batch Normalization + ReLU activation
      • 1D Max Pooling (pool size: 5, stride: 5) + Dropout (rate: 20%)
      • 1D Convolutional layer (filters, 64, kernel size: 3, stride: 1, axis: time) + Batch Normalization + ReLU activation
      • 1D Global Max Pooling + Dropout (rate: 20%)
      • Dense layer (neurons: 64) + ReLU activation + Dropout (rate: 20%)
      • Softmax output layer (classes: 9)
    • Learning:
      • Optimizer: Adam (learning rate: 0.0001)
      • Epochs: 500
      • On each epoch, the training dataset is randomly subsampled so that the number of examples for each class match the size of the smallest class
      • Batch size: 256 * 4 channels (each channel is considered as a different example for the learner)
  • Fusion: Output probabilities from the four microphones in a particular node under test are averaged to obtain the final posterior probability.
  • Model selection: The performance of the model is evaluated every 10 epochs on a validation subset (30% subsampled from the training set). The model with the highest Macro-averaged F1-score is picked.

The baseline system is build on dcase_util toolbox. The machine learning part of the code in build on Keras (v2.1.5) while using TensorFlow (v1.4.0) as backend.


An inconsistency in the dataset was reported here. The issue is fixed in the current release of the dataset (v1.0.3, 15/05/2018). The new repository is updated on the latest release of the dcase_util library (v0.2.3). Using an older version will download an older version of the dataset. If you prefer to not download all files again, you can overwrite a subset of the files using this archive.

Results for the development dataset

When running the code in development mode the baseline system provides results for the 4-fold cross-validation setup. The table below shows the averaged Macro-averaged F1-score over these 4 folds.

Activity F1-score
Absence 85.41 %
Cooking 95.14 %
Dishwashing 76.73 %
Eating 83.64 %
Other 44.76 %
Social activity 93.92 %
Vacuum cleaning 99.31 %
Watching TV 99.59 %
Working 82.03 %
Macro-averaged F1-score 84.50 %

Note: The performance might not be exactly reproducible but similar results should be obtainable.