Acoustic scene classification


Task description

The goal of acoustic scene classification is to classify a test recording into one of the provided predefined classes that characterizes the environment in which it was recorded.

This task comprises three different subtasks that involve system development for three different situations:

A Match Task 1

Acoustic Scene Classification
Subtask A

Classification of data from the same device as the available training data.

B Mismatch Task 1

Acoustic Scene Classification with mismatched recording devices
Subtask B

Classification of data recorded with devices different than the training data.

C External Task 1

Acoustic Scene Classification with use of external data
Subtask C

Use of external data in training.

Description

The goal of acoustic scene classification is to classify a test recording into one of the provided predefined classes that characterizes the environment in which it was recorded — for example "park", "pedestrian street", "metro station".

Figure 1: Overview of acoustic scene classification system.


Audio dataset

The dataset for this task is the TUT Urban Acoustic Scenes 2018 dataset, consisting of recordings from various acoustic scenes. The dataset was recorded in six large european cities, in different locations for each scene class. For each recording location there are 5-6 minutes of audio. The original recordings were split into segments with a length of 10 seconds that are provided in individual files. Available information about the recordings include the following: acoustic scene class, city, and recording location.

There are two different versions, TUT Urban Acoustic Scenes 2018 and TUT Urban Acoustic Scenes 2018 Mobile, used for tasks A and B, respectively. More details about them can be found below.

Acoustic scenes for the task (10):

  • Airport - airport
  • Indoor shopping mall - shopping_mall
  • Metro station - metro_station
  • Pedestrian street - street_pedestrian
  • Public square - public_square
  • Street with medium level of traffic - street_traffic
  • Travelling by a tram - tram
  • Travelling by a bus - bus
  • Travelling by an underground metro - metro
  • Urban park - park

The dataset was collected by Tampere University of Technology between 01/2018 - 03/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

ERC

Recording procedure

Recordings were made using four devices that captured audio simultaneously.

The main recording device consists in Soundman OKM II Klassik/studio A3, electret binaural microphone and a Zoom F8 audio recorder using 48kHz sampling rate and 24 bit resolution. The microphones are specifically made to look like headphones, being worn in the ears. As an effect of this, the recorded audio is very similar to the sound that reaches the human auditory system of the person wearing the equipment. This equipment is further referred to as device A.

Three other commonly available customer devices (e.g. smartphones, cameras) were used, handled in typical ways by the person doing recordings (e.g. hand held). We further refer to these devices as B, C, and D. The audio recordings from these devices are of different quality than device A. All simultaneous recordings are time synchronized.

TUT Urban Acoustic Scenes 2018 development dataset contains only material recorded with device A, having 864 segments for each acoustic scene (144 minutes of audio). The dataset contains in total 8640 segments, i.e. 24 hours of audio.

TUT Urban Acoustic Scenes 2018 Mobile development dataset contains material recorded with devices A, B and C. For each acoustic scene there are 864 segments recorded with device A, and parallel audio consisting of 72 segments recorded with devices B and C . Data from device A was resampled and averaged into a single channel, to align with the properties of the data recorded with devices B and C. The dataset contains in total 28 hours of audio.

Download

Subtask A




Subtask B




Subtask C

This subtask is using the same dataset as subtask A.




Task setup

For each subtask, a development set is provided, together with a training/test partitioning for system development. Participants are required to report performance of their system using this train/test setup in order to allow comparison of systems on the development set.

Subtask A

A Match Task 1 Acoustic Scene Classification

This subtask is concerned with the basic problem of acoustic scene classification, in which all available data (development and evaluation) are recorded with the same device, in this case device A.

Development dataset


The development data consists of recordings from all six cities, and is partitioned so that the training subset contains for each city and each class recordings from approximately 70% of recording locations, and the test subset contains recordings from the rest of the locations. Of the total 8640 segments, 6122 segments were included in the training subset and 2518 segments in the test subset. For complete details on the dataset, check the readme file provided with the data.

Participants are allowed to create their own cross-validation folds or separate validation set. In this case please pay attention to the segments recorded at same location. Location identifier can be found from metadata file provided in the dataset or from audio file names:

[scene label]-[city]-[location id]-[segment id]-[device id].wav

Make sure that all files having same location id are placed on the same side of the evaluation. In this subtask, device id is always a.

In this subtask, use of external data is forbidden. Data from another task or subtask is considered external data!

Evaluation dataset


Participants should run their system for this dataset, and submit the classification results (system output) to DCASE2018 Challenge. Evaluation dataset is provided without ground truth. The amount of data in the evaluation set is 10 hours.

Subtask B

B Mismatch Task 1 Acoustic Scene Classification with mismatched recording devices

This subtask is concerned with the situation in which an application will be tested with a few different types of devices, possibly not the same as the ones used to record the development data.

Development dataset


The development data consists of the same recordings as in subtask A, and a small amount of parallel data recorded with devices B and C. The amount of data is as follows:

  • Device A: 24 hours (8640 segments, same as subtask A, but resampled and single-channel)
  • Device B: 2 hours (72 segments per acoustic scene)
  • Device C: 2 hours (72 segments per acoustic scene)

The 2 hours of data recorded with devices B and C is parallel, and also available as recorded with device A. The training/test setup was created such that approximately 70% of recording locations for each city and each scene class are in the training subset, considering only device A. The training subset contains 6122 segments from device A, 540 segments from device B, and 540 segments from device C. The test subset contains 2518 segments from device A, 180 segments from device B, and 180 segments from device C. Please report development set results using the provided data partitioning.

Participants are allowed to create their own cross-validation folds or separate validation set. In this case please pay attention to the segments recorded at same location. Location identifier can be found from metadata file provided in the dataset or from audio file names:

[scene label]-[city]-[location id]-[segment id]-[device id].wav

Make sure that all files having same location id are placed on the same side of the evaluation. In this subtask, device id can be a, b, or c.

In this subtask, use of external data is forbidden. Data from another task or subtask is considered external data!

Evaluation dataset


Participants should run their system for this dataset, and submit the classification results (system output) to DCASE2018 Challenge. Evaluation dataset is provided without ground truth. The evaluation data consists of audio recorded with all four devices, of which device D was not encountered in development. Ranking of systems will be done only using devices B and C, but accuracy will be calculated also for devices A and D.

Subtask C

C External Task 1 Acoustic Scene Classification with use of external data

This subtask is meant to test if use of external data in system development brings a significant improvement to the performance. The task is identical to subtask A, with the only difference that use of external data and transfer learning is allowed under the following conditions:

  • The used dataset must be public and freely available before 29th of March 2018
  • Participants must inform/suggest such data to be listed on the task webpage, so that all competitors know about them and have equal opportunity to use them; please let us know by sending email to dcase.challenge@gmail.com and we will update the list of external datasets accordingly
  • Once the evaluation set is published, the list of allowed external datasets is locked (no further external sources allowed)
  • Participants should list in their technical report the external data sources they used

Participants to this subtask are encouraged to submit also for subtask A to provide the comparison point of their system without external data.

External data sources

List of external datasets allowed:

Dataset name Type Added Link
TUT Acoustic scenes 2017, development dataset audio 29.3.2018 https://zenodo.org/record/400515
TUT Acoustic scenes 2017, evaluation dataset audio 29.3.2018 https://zenodo.org/record/1040168
TUT Acoustic scenes 2016, development dataset audio 29.3.2018 https://zenodo.org/record/45739
TUT Acoustic scenes 2016, evaluation dataset audio 29.3.2018 https://zenodo.org/record/165995
LITIS Rouen audio scene dataset audio 29.3.2018 https://sites.google.com/site/alainrakotomamonjy/home/audio-scene
DCASE2013 Challenge - Public Dataset for Scene Classification Task audio 29.3.2018 https://archive.org/details/dcase2013_scene_classification
DCASE2013 Challenge - Private Dataset for Scene Classification Task audio 29.3.2018 https://archive.org/details/dcase2013_scene_classification_testset
Dares G1 audio 29.3.2018 http://www.daresounds.org/
AudioSet audio 25.4.2018 https://research.google.com/audioset/


Participants can suggest data to this list by sending email to dcase.challenge@gmail.com until evaluation dataset is published, after which the list is locked.

Submission

Official challenge submission consists of a technical report and system output for the evaluation data.

System output should be presented as a single text-file (in CSV format) containing classification result for each audio file in the evaluation set. Result items can be in any order. Format:

[filename (string)][tab][scene label (string)]

Multiple system outputs can be submitted (maximum 4 per participant). For each system, meta information should be provided in a separate file, containing the task specific information as given in the example here. All files should be packaged into a zip file for submission. Please carefully mark the connection between the submitted files and the corresponding system or system parameters (for example by naming the text file appropriately).

Detailed information for the submission can be found on the Submission page.

Public leaderboards

During the challenge, a public leaderboard will be provided using a separate public evaluation set for each subtask. The leaderboards are organized through Kaggle InClass competitions, and they are meant to serve as a development tool for participants during the development.

A Match Task 1 Subtask A Leaderboard

B Mismatch Task 1 Subtask B Leaderboard

C External Task 1 Subtask C Leaderboard

The official DCASE challenge submission will not be done through these Kaggle InClass competitions.

Datasets

For public leaderboard submissions, participants should use the challenge development datasets to train their system as in DCASE challenge. Separate datasets, leaderboard datasets, are released to be used as evaluation datasets in the competitions. These leaderboard datasets consist of similar material to the official evaluation dataset in the DCASE challenge. The material amount in the leaderboard dataset is considerably lower than the official evaluation material in the DCASE challenge. It is not allowed to use the leaderboard datasets to train the systems in any DCASE challenge subtasks or leaderboard competitions.



Task rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

Task specific rules:

  • Use of external data for system development is allowed only in subtask C. Data from another task or subtask is considered external data.
  • Manipulation of provided training and development data is allowed in all subtasks. The development dataset can be augmented without use of external data (e.g. by mixing data sampled from a pdf or using techniques such as pitch shifting or time stretching).
  • Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision making is also forbidden.
  • Classification decision must be done independently for each test sample.

Evaluation

The scoring of acoustic scene classification will be based on classification accuracy: the number of correctly classified segments among the total number of segments. Each segment is considered an independent test sample. Accuracy will be calculated as average of the class-wise accuracy.

Participants can use sed_eval toolbox for the evaluation:


Baseline system

The baseline system provides a simple entry-level state-of-the-art approach that gives reasonable results in the subtasks of Task 1. The baseline system is built on dcase_util toolbox.

Participants are strongly encouraged to build their own systems by extending the provided baseline system. The system has all needed functionality for the dataset handling, acoustic feature storing and accessing, acoustic model training and storing, and evaluation. The modular structure of the system enables participants to modify the system to their needs. The baseline system is a good starting point especially for the entry level researchers to familiarize themselves with the acoustic scene classification problem.

If participants plan to publish their code to the DCASE community after the challenge, building their approach on the baseline system will make their code more accessible to the community. DCASE organizers strongly encourage participants to share their code in any form after the challenge.

Repository


System description

The baseline system implements a convolutional neural network (CNN) based approach, where log mel-band energies are first extracted for each 10-second signal, and a network consisting of two CNN layers and one fully connected layer is trained to assign scene labels to the audio signals.

The baseline system is built on dcase_util toolbox. The machine learning part of the code in built on Keras (v2.1.5), using TensorFlow (v1.4.0) as backend.

Parameters

Acoustic features

  • Analysis frame 40 ms (50% hop size)
  • Log mel-band energies (40 bands)

Neural network

  • Input shape: 40 * 500 (10 seconds)
  • Architecture:

    • CNN layer #1
      • 2D Convolutional layer (filters: 32, kernel size: 7) + Batch normalization + ReLu activation
      • 2D max pooling (pool size: (5, 5)) + Dropout (rate: 30%)
    • CNN layer #2
      • 2D Convolutional layer (filters: 64, kernel size: 7) + Batch normalization + ReLu activation
      • 2D max pooling (pool size: (4, 100)) + Dropout (rate: 30%)
    • Flatten
    • Dense layer #1
      • Dense layer (units: 100, activation: ReLu )
      • Dropout (rate: 30%)
    • Output layer (activation: softmax)
  • Learning (epochs: 200, batch size: 16, data shuffling between epochs)

    • Optimizer: Adam (learning rate: 0.001)
  • Model selection:

    • Approximately 30% of the original training data is assigned to validation set, split done such that training and validation sets do not have segments from the same location and both sets have data from each city
    • Model performance after each epoch is evaluated on the validation set, and best performing model is selected

Results for the development dataset

Results are calculated using TensorFlow in GPU mode (using Nvidia Titan XP GPU card). Because results produced with GPU card are generally non-deterministic, the system was trained and tested 10 times; mean and standard deviation of the performance from these 10 independent trials are shown in the results tables.

Subtask A

Scene label Accuracy
Airport 72.9 %
Bus 62.9 %
Metro 51.2 %
Metro station 55.4 %
Park 79.1 %
Public square 40.4 %
Shopping mall 49.6 %
Street, pedestrian 50.0 %
Street, traffic 80.5 %
Tram 55.1 %
Average 59.7 % (± 0.7)

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results.

Subtask B

Material from device A (high-quality) is used for training, while testing is done with material from all three devices. This highlights the problem of mismatched recording devices. Results are calculated the same way as for subtask A, with mean and standard deviation of the performance from 10 independent trials shown in the results table.

Remember that ranking in this subtask will be done by devices B and C (third column in this table).

Scene label Device B Device C Average (B,C) Device A
Airport 68.9 % 76.1 % 72.5 % 73.4 %
Bus 70.6 % 86.1 % 78.3 % 56.7 %
Metro 23.9 % 17.2 % 20.6 % 46.6 %
Metro station 33.9 % 31.7 % 32.8 % 52.9 %
Park 67.2 % 51.1 % 59.2 % 80.8 %
Public square 22.8 % 26.7 % 24.7 % 37.9 %
Shopping mall 58.3 % 63.9 % 61.1 % 46.4 %
Street, pedestrian 16.7 % 25.0 % 20.8 % 55.5 %
Street, traffic 69.4 % 63.3 % 66.4 % 82.5 %
Tram 18.9 % 20.6 % 19.7 % 56.5 %
Average 45.1 % (± 3.6) 46.2 % (± 4.2) 45.6 % (± 3.6) 58.9 % (± 0.8)

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results.

Subtask C

Result for the subtask A applies.