# Acoustic scene classification

### Coordinators

 Annamaria Mesaros Toni Heittola Tuomas Virtanen

The goal of acoustic scene classification is to classify a test recording into one of the provided predefined classes that characterizes the environment in which it was recorded.

This task comprises three different subtasks that involve system development for three different situations:

### Acoustic Scene Classification Subtask A

Classification of data from the same device as the available training data.

### Acoustic Scene Classification with mismatched recording devices Subtask B

Classification of data recorded with devices different than the training data.

### Open set Acoustic Scene Classification Subtask C

Classification on data that includes classes not encountered in the training data.

# Description

The goal of acoustic scene classification is to classify a test recording into one of the provided predefined classes that characterizes the environment in which it was recorded — for example "park", "pedestrian street", "metro station" — or to indicate it is from a different, unknown environment.

# Audio dataset

The dataset for this task is the TAU Urban Acoustic Scenes 2019 dataset, consisting of recordings from various acoustic scenes. This dataset extends the TUT Urban Acoustic Scenes 2018 dataset with other 6 cities to a total of 12 large European cities. For each scene class, recordings were done in different locations; for each recording location there are 5-6 minutes of audio. The original recordings were split into segments with a length of 10 seconds that are provided in individual files. Available information about the recordings include the following: acoustic scene class, city, and recording location.

Acoustic scenes (10):

• Airport - airport
• Indoor shopping mall - shopping_mall
• Metro station - metro_station
• Pedestrian street - street_pedestrian
• Public square - public_square
• Street with medium level of traffic - street_traffic
• Travelling by a tram - tram
• Travelling by a bus - bus
• Travelling by an underground metro - metro
• Urban park - park

Data was recorded in the following cities:

• Amsterdam
• Barcelona
• Helsinki
• Lisbon
• London
• Lyon
• Milan
• Prague
• Paris
• Stockholm
• Vienna

## Recording procedure

Recordings were made using four devices that captured audio simultaneously.

The main recording device consists in Soundman OKM II Klassik/studio A3, electret binaural microphone and a Zoom F8 audio recorder using 48kHz sampling rate and 24 bit resolution. The microphones are specifically made to look like headphones, being worn in the ears. As an effect of this, the recorded audio is very similar to the sound that reaches the human auditory system of the person wearing the equipment. This equipment is further referred to as device A.

The other devices are commonly available customer devices: device B is a Samsung Galaxy S7, device C is IPhone SE, and device D is a GoPro Hero5 Session. All simultaneous recordings are time synchronized.

The dataset was collected by Tampere University of Technology between 05/2018 - 11/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

## Development and evaluation datasets

Different versions of the dataset are provided depending on the task.

TAU Urban Acoustic Scenes 2019 development dataset contains only material recorded with device A, containing 40 hours of audio, balanced between classes. The data comes from 10 of the 12 cities. TAU Urban Acoustic Scenes 2019 evaluation dataset contains data from all 12 cities.

TAU Urban Acoustic Scenes 2019 Mobile development dataset contains material recorded with devices A, B and C. It is composed of TAU Urban Acoustic Scenes 2019 data recorded with device A, and some amount of parallel audio recorded with devices B and C. Data from device A was resampled and averaged into a single channel, to align with the properties of the data recorded with devices B and C. The dataset contains in total 46 hours of audio (40h + 3h + 3h). TAU Urban Acoustic Scenes 2019 Mobile development dataset contains also data from device D.

TAU Urban Acoustic Scenes 2019 Open set development dataset contains only material recorded with device A, being composed of TAU Urban Acoustic Scenes 2019 and additional audio examples for the open classification problem. The "open" data consists of the "beach" and "office" classes of TUT Acoustic Scenes 2017 dataset and other material recorded in 2019. The dataset contains in total 46 hours of audio (40h + 6h). TAU Urban Acoustic Scenes 2019 Open set evaluation dataset contains data from the 10 known classes, and other unknown ones.

## Reference labels

Reference labels are provided only for the development datasets. Reference labels for evaluation dataset or leaderboard dataset will not be released. For publications based on the DCASE challenge data, please use the provided training/test setup of the development set, to allow comparisons. After the challenge, if you want to evaluate your proposed system with official challenge evaluation setup, contact the task coordinators. Task coordinators can provide unofficial scoring for limited amount of system outputs.

Dataset was updated on 12 March 2019 to include train/test setup (version 2). In order to update already downloaded the dataset version 1, update only TAU-urban-acoustic-scenes-2019-openset-development.meta.zip file.

For each subtask, a development set is provided, together with a training/test partitioning for system development. Participants are required to report performance of their system using this train/test setup in order to allow comparison of systems on the development set.

A Match Task 1 Acoustic Scene Classification

This subtask is concerned with the basic problem of acoustic scene classification, in which all data (development and evaluation) are recorded with the same device, in this case device A, and contains only data from the 10 known acoustic scene classes. The subtask uses TAU Urban Acoustic Scenes 2019 dataset.

#### Development dataset

The development dataset consists of recordings from ten cities; the training subset contains recordings from only 9 of the cities, to test the generalization properties of the systems. The training/test subsets are created based on the recording location such that the training subset contains approximately 70% of recording locations from each city. The test subset contains recordings from the rest of the locations, and few locations from the tenth city. Full data from the tenth city is provided, but partly unused in this setup, to reflect the final evaluation setup.
The development set contains 40 hours of data, with 14400 segments (144 per city per acoustic scene class). The training/test setup includes segments from Milan only to the test subset. There are 9185 segments in the training set, 4185 in the test set, and additional 1030 segments from Milan. For complete details on the dataset, check the readme file provided with the data.

Participants are allowed to create their own cross-validation folds or separate validation set. In this case please pay attention to the segments recorded at same location. Location identifier can be found from metadata file provided in the dataset or from audio file names:

[scene label]-[city]-[location id]-[segment id]-[device id].wav


Make sure that all files having same location id are placed on the same side of the evaluation. In this subtask, device id is always a.

#### Evaluation dataset

The evaluation dataset contains 20 hours of audio data from 12 cities (2 cities not encountered in development set), and it is provided without ground truth. Participants should run their system for this dataset, and submit the classification results (system output) to DCASE2019 Challenge.

B Mismatch Task 1 Acoustic Scene Classification with mismatched recording devices

This subtask is concerned with the situation in which an application will be tested with different devices, possibly not the same as the ones used to record the development data. In this case, evaluation data contains more devices than the development data. The subtask uses TAU Urban Acoustic Scenes 2019 Mobile dataset.

#### Development dataset

The development set consists of data recorded with 3 devices: A, B and C. This includes all data from the development set of subtask A (40 hours), partitioned in the same way. In addition, parallel recordings are provided from devices B and C, amounting to 3 hours for each. From devices B and C, half of the data is included to the training subset, half to the test subset. The development set contains in total 46 hours of data, with 16560 segments, of which 14400 from device A, 1080 from device B, 1080 from device C. There are 10265 segments in the training set (9185 for device A, 540 for device B, and 540 for device C), 5265 in the test set (4185 for device A, 540 for device B, and 540 device C), and additional 1030 segments from Milan. For complete details on the dataset, check the readme file provided with the data.

Participants are allowed to create their own cross-validation folds or separate validation set. In this case please pay attention to the segments recorded at same location. Location identifier can be found from metadata file provided in the dataset or from audio file names:

[scene label]-[city]-[location id]-[segment id]-[device id].wav


Make sure that all files having same location id are placed on the same side of the evaluation. In this subtask, device id can be a, b or c.

#### Evaluation dataset

The evaluation dataset contains data from all 4 devices, including device D that was not available in the development set. It contains 30 hours of audio and it is provided without ground truth. Participants should run their system for this dataset, and submit the classification results (system output) to DCASE2019 Challenge.

C OpenSet Task 1 Open set Acoustic Scene Classification

This subtask is concerned with acoustic scene classification where the test recording may be from a different environment than the 10 target classes, in which case it should be classified as "unknown", in a so-called open-set classification setup. The subtask uses TAU Urban Acoustic Scenes 2019 Openset dataset and some additional data providing examples of "unknown" acoustic scenes.

Participants should make good use of external data in order to model the case of scenes not encountered within the training data. The provided examples allow only limited generalization, and may overfit to their original dataset due to lack of sufficient variety.

#### Development dataset

The development dataset consists of data from the 10 target classes and additional "unknown" class examples. The dataset includes all data from the development set of Subtask A (40 hours), partitioned in the same way. In addition, recordings are provided for modeling and testing the open-set classification task. The unknown class consists of audio examples from TUT Acoustic Scenes 2017 dataset and new material recorded during the collection of TAU Urban Acoustic Scenes 2019 dataset. The development set contains 44 hours of data (40+4), with 15850 segments (14400 of ten scene classes + 1450 unknown class). Complete details on the dataset are provided in the readme file. In addition, correspondence of "unknown" class examples with their original acoustic scenes and file names is provided in meta_unknown.csv.

Participants are allowed to create their own cross-validation folds or separate validation set. In this case please pay attention to the segments recorded at same location. Location identifier can be found from metadata file provided in the dataset or from audio file names:

[scene label]-[city]-[location id]-[segment id]-[device id].wav


Make sure that all files having same location id are placed on the same side of the evaluation. In this subtask, device id is always a.

#### Evaluation dataset

The evaluation dataset contains 20 hours of audio data, of which part is recorded in one of the 10 known classes, and part in other, unknown environments, different than the ones in the development set. The evaluation dataset is provided without ground truth. Participants should run their system for this dataset, and submit the classification results (system output) to DCASE2019 Challenge.

# External data resources

Use of external data is allowed in all subtasks under the following conditions:

• The used external resource is clearly referenced and freely accessible to any other research group in the world. External data refers to public datasets or trained models. The dataset/models must be public and freely available before 1st of April 2019.
• Participants submit at least one system without external training data so that we can study the contribution of such resources. The list of external data sources used in training must be clearly indicated in the technical report.
• Participants inform the organizers in advance about such data sources, so that all competitors know about them and have equal opportunity to use them; please send and email to the task coordinators; we will update the list of external datasets on the webpage accordingly. Once the evaluation set is published, the list of allowed external data resources is locked (no further external sources allowed).
• It is not allowed to use TUT Acoustic Scenes 2016, TUT Acoustic Scenes 2017 and TUT Urban Acoustic Scenes 2018. These datasets are partially included in the current setup, and additional usage will lead to overfitting.

List of external datasets allowed:

LITIS Rouen audio scene dataset audio 04.03.2019 https://sites.google.com/site/alainrakotomamonjy/home/audio-scene
DCASE2013 Challenge - Public Dataset for Scene Classification Task audio 04.03.2019 https://archive.org/details/dcase2013_scene_classification
DCASE2013 Challenge - Private Dataset for Scene Classification Task audio 04.03.2019 https://archive.org/details/dcase2013_scene_classification_testset
Dares G1 audio 04.03.2019 http://www.daresounds.org/

Participants can suggest data to this list by sending email to dcase.challenge@gmail.com until evaluation dataset is published, after which the list is locked.

# Submission

Participants can choose subtasks they participate, there is no requirement to participate all of them. Official challenge submission consists of a technical report and system output for the evaluation data.

System output should be presented as a single text-file (in CSV format) containing classification result for each audio file in the evaluation set. Result items can be in any order. Format:

[filename (string)][tab][scene label (string)]


Multiple system outputs can be submitted (maximum 4 per participant per subtask). For each system, meta information should be provided in a separate file, containing the task specific information as given in the example here. All files should be packaged into a zip file for submission. Please carefully mark the connection between the submitted files and the corresponding system or system parameters (for example by naming the text file appropriately).

When training the final system for submission, participants can of course use the entire development set. In the technical report, participants should include system results on the training/test setup provided with the development set.

Detailed information for the submission will be provided in due time.

During the challenge, a public leaderboard will be provided using a separate public evaluation dataset for each subtask. The leaderboards are organized through Kaggle InClass competitions. Leaderboards are meant to serve as a development tool for participants, and does not have an official role in the challenge.

Due to Kaggle / US Government policy, people who are residents of certain countries (Cuba, Iran, Syria, North Korea, and Sudan) are unable to participate in the Kaggle competitions (see Kaggle terms, section 7 What are the rules for competitions on Kaggle?). As DCASE is committed to open science open to everybody, in case these Kaggle restrictions are preventing you from using the Kaggle based leaderboard during the development, please contact task 1 organizers and we will provide similar service outside Kaggle.

The official DCASE challenge submission will not be done through these Kaggle InClass competitions.

## Datasets

For public leaderboard submissions, participants should use the official challenge development datasets to train their system as in DCASE challenge. Separate datasets, leaderboard datasets, are released to be used as evaluation datasets in the competitions. These leaderboard datasets consist of a small subset of the official evaluation dataset, with similar properties (distribution). The material amount in the leaderboard dataset is considerably lower than the official evaluation material in the DCASE challenge.

It is not allowed to use the leaderboard datasets to train the systems in any DCASE challenge subtasks or leaderboard competitions.

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

• Use of external data is allowed, except TUT Acoustic Scenes 2016, TUT Acoustic Scenes 2017, TUT Urban Acoustic Scenes 2018 and leaderboard datasets (DCASE2018 and DCASE2019).
• Manipulation of provided training and development data is allowed (e.g. by mixing data sampled from a pdf or using techniques such as pitch shifting or time stretching).
• Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision making is also forbidden.
• Classification decision must be done independently for each test sample.

# Evaluation

The scoring of acoustic scene classification will be based on classification accuracy: the number of correctly classified segments among the total number of segments. Each segment is considered an independent test sample. Accuracy will be calculated as average of the class-wise accuracy.

Participants can use sed_eval toolbox for the evaluation:

# Ranking

• Subtask A will use the overall accuracy on the evaluation data.
• Subtask B will use the overall accuracy on data from devices B and C.
• Subtask C will use the weighted average of the known classes and unknown class:
$$ACC_{weighted} = 0.5 * ACC_{known~classes} + 0.5 * ACC_{unknown~classes}$$

# Baseline system

The baseline system provides a simple entry-level state-of-the-art approach that gives reasonable results in the subtasks of Task 1. The baseline system is built on dcase_util toolbox.

The system has all needed functionality for the dataset handling, acoustic feature storing and accessing, acoustic model training and storing, and evaluation. The modular structure of the system enables participants to modify the system to their needs. The baseline system is a good starting point especially for the entry level researchers to familiarize themselves with the acoustic scene classification problem.

## System description

The baseline system implements a convolutional neural network (CNN) based approach, where log mel-band energies are first extracted for each 10-second signal, and a network consisting of two CNN layers and one fully connected layer is trained to assign scene labels to the audio signals.

The baseline system is built on dcase_util toolbox. The machine learning part of the code in built on Keras (v2.2.2), using TensorFlow (v1.9.0) as backend.

### Parameters

#### Acoustic features

• Analysis frame 40 ms (50% hop size)
• Log mel-band energies (40 bands)

#### Neural network

• Input shape: 40 * 500 (10 seconds)
• Architecture:

• CNN layer #1
• 2D Convolutional layer (filters: 32, kernel size: 7) + Batch normalization + ReLu activation
• 2D max pooling (pool size: (5, 5)) + Dropout (rate: 30%)
• CNN layer #2
• 2D Convolutional layer (filters: 64, kernel size: 7) + Batch normalization + ReLu activation
• 2D max pooling (pool size: (4, 100)) + Dropout (rate: 30%)
• Flatten
• Dense layer #1
• Dense layer (units: 100, activation: ReLu )
• Dropout (rate: 30%)
• Output layer (activation: softmax/sigmoid)
• Learning (epochs: 200, batch size: 16, data shuffling between epochs)

• Optimizer: Adam (learning rate: 0.001)
• Model selection:

• Approximately 30% of the original training data is assigned to validation set, split done such that training and validation sets do not have segments from the same location and both sets have data from each city
• Model performance after each epoch is evaluated on the validation set, and best performing model is selected

For Task 1A and 1B systems, the activation function for the output layer is Softmax and decision is made based on maximum output. For Task 1C, the activation function for the output layer is Sigmoid and decision is made based on threshold value (0.5); if at least one of the class values is over the threshold, the most probable target scene class is chosen, if all values are under the threshold, unknown scene class is chosen.

## Results for the development dataset

Results are calculated using TensorFlow in GPU mode (using Nvidia Titan XP GPU card). Because results produced with GPU card are generally non-deterministic, the system was trained and tested 10 times; mean and standard deviation of the performance from these 10 independent trials are shown in the results tables.

Scene label Accuracy
Airport 48.4 %
Bus 62.3 %
Metro 65.1 %
Metro station 54.5 %
Park 83.1 %
Public square 40.7 %
Shopping mall 59.4 %
Street, pedestrian 60.9 %
Street, traffic 86.7 %
Tram 64.0 %
Average 62.5 % (± 0.6)

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results.

Material from all three devices (A, B and C) are used for training amd testing. Results are calculated the same way as for subtask A, with mean and standard deviation of the performance from 10 independent trials shown in the results table.

Remember that ranking in this subtask will be done by devices B and C (third column in this table).

Scene label Device B Device C Average (B,C) Device A
Airport 18.3 % 24.1 % 21.2 % 51.2 %
Bus 40.4 % 70.0 % 55.2 % 68.0 %
Metro 50.7 % 36.1 % 43.4 % 62.4 %
Metro station 28.7% 36.1 % 30.0 % 54.4 %
Park 45.2 % 57.0 % 51.1 % 80.4 %
Public square 22.8 % 11.3 % 17.0 % 35.4 %
Shopping mall 63.5 % 64.8 % 64.2 % 64.4 %
Street, pedestrian 37.0 % 37.6 % 37.3 % 63.3 %
Street, traffic 77.0 % 86.5 % 81.8 % 85.8 %
Tram 12.0 % 12.6 % 12.3 % 52.2 %
Average 39.6 % (± 2.7) 43.1 % (± 2.2) 41.4 % (± 1.7) 61.9 % (± 0.8)

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results.

Scene label Accuracy
Airport 44.1 %
Bus 59.2 %
Metro 51.5 %
Metro station 41.3 %
Park 74.0 %
Public square 34.7 %
Shopping mall 50.9 %
Street, pedestrian 47.5 %
Street, traffic 78.4 %
Tram 60.7 %
Class Average 54.2 %
Unknown 43.1 %
Accuracy (Class Average | Unknown) 48.7 % (± 3.2)

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results.

# Citation

If you are participating to this task or using the dataset or baseline code please cite the following paper:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 9–13. November 2018. URL: https://arxiv.org/abs/1807.09840.

#### A multi-device dataset for urban acoustic scene classification

##### Abstract

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

##### Keywords

Acoustic scene classification, DCASE challenge, public datasets, multi-device data