Acoustic scene classification


Task description

The goal of acoustic scene classification is to classify a test recording into one of the provided predefined classes that characterizes the environment in which it was recorded.

This task comprises three different subtasks that involve system development for three different situations:

A Match Task 1

Acoustic Scene Classification
Subtask A

Classification of data from the same device as the available training data.

B Mismatch Task 1

Acoustic Scene Classification with mismatched recording devices
Subtask B

Classification of data recorded with devices different than the training data.

C External Task 1

Acoustic Scene Classification with use of external data
Subtask C

Use of external data in training.

Description

The goal of acoustic scene classification is to classify a test recording into one of the provided predefined classes that characterizes the environment in which it was recorded — for example "park", "pedestrian street", "metro station".

Figure 1: Overview of acoustic scene classification system.


Audio dataset

The dataset for this task is the TUT Urban Acoustic Scenes 2018 dataset, consisting of recordings from various acoustic scenes. The dataset was recorded in six large european cities, in different locations for each scene class. For each recording location there are 5-6 minutes of audio. The original recordings were split into segments with a length of 10 seconds that are provided in individual files. Available information about the recordings include the following: acoustic scene class, city, and recording location.

There are two different versions, TUT Urban Acoustic Scenes 2018 and TUT Urban Acoustic Scenes 2018 Mobile, used for tasks A and B, respectively. More details about them can be found below.

Acoustic scenes for the task (10):

  • Airport - airport
  • Indoor shopping mall - shopping_mall
  • Metro station - metro_station
  • Pedestrian street - street_pedestrian
  • Public square - public_square
  • Street with medium level of traffic - street_traffic
  • Travelling by a tram - tram
  • Travelling by a bus - bus
  • Travelling by an underground metro - metro
  • Urban park - park

The dataset was collected by Tampere University of Technology between 01/2018 - 03/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

ERC

Recording procedure

Recordings were made using four devices that captured audio simultaneously.

The main recording device consists in Soundman OKM II Klassik/studio A3, electret binaural microphone and a Zoom F8 audio recorder using 48kHz sampling rate and 24 bit resolution. The microphones are specifically made to look like headphones, being worn in the ears. As an effect of this, the recorded audio is very similar to the sound that reaches the human auditory system of the person wearing the equipment. This equipment is further referred to as device A.

Three other commonly available customer devices (e.g. smartphones, cameras) were used, handled in typical ways by the person doing recordings (e.g. hand held). We further refer to these devices as B, C, and D. The audio recordings from these devices are of different quality than device A. All simultaneous recordings are time synchronized.

TUT Urban Acoustic Scenes 2018 development dataset contains only material recorded with device A, having 864 segments for each acoustic scene (144 minutes of audio). The dataset contains in total 8640 segments, i.e. 24 hours of audio.

TUT Urban Acoustic Scenes 2018 Mobile development dataset contains material recorded with devices A, B and C. For each acoustic scene there are 864 segments recorded with device A, and parallel audio consisting of 72 segments recorded with devices B and C . Data from device A was resampled and averaged into a single channel, to align with the properties of the data recorded with devices B and C. The dataset contains in total 28 hours of audio.

Reference labels

Reference labels are provided only for the development datasets. Currently, there is no plan of releasing the reference labels for evaluation or public leaderboard datasets. If you are preparing a publication based on the DCASE challenge setup and you want to evaluate your proposed system with official challenge evaluation setup, contact the task coordinators. Task coordinators can provide unofficial scoring for limited amount of system outputs.

Download

Subtask A




Subtask B




Subtask C

This subtask is using the same dataset as subtask A.




Task setup

For each subtask, a development set is provided, together with a training/test partitioning for system development. Participants are required to report performance of their system using this train/test setup in order to allow comparison of systems on the development set.

Subtask A

A Match Task 1 Acoustic Scene Classification

This subtask is concerned with the basic problem of acoustic scene classification, in which all available data (development and evaluation) are recorded with the same device, in this case device A.

Development dataset


The development data consists of recordings from all six cities, and is partitioned so that the training subset contains for each city and each class recordings from approximately 70% of recording locations, and the test subset contains recordings from the rest of the locations. Of the total 8640 segments, 6122 segments were included in the training subset and 2518 segments in the test subset. For complete details on the dataset, check the readme file provided with the data.

Participants are allowed to create their own cross-validation folds or separate validation set. In this case please pay attention to the segments recorded at same location. Location identifier can be found from metadata file provided in the dataset or from audio file names:

[scene label]-[city]-[location id]-[segment id]-[device id].wav

Make sure that all files having same location id are placed on the same side of the evaluation. In this subtask, device id is always a.

In this subtask, use of external data is forbidden. Data from another task or subtask is considered external data!

Evaluation dataset


Participants should run their system for this dataset, and submit the classification results (system output) to DCASE2018 Challenge. Evaluation dataset is provided without ground truth. The amount of data in the evaluation set is 10 hours.

Subtask B

B Mismatch Task 1 Acoustic Scene Classification with mismatched recording devices

This subtask is concerned with the situation in which an application will be tested with a few different types of devices, possibly not the same as the ones used to record the development data.

Development dataset


The development data consists of the same recordings as in subtask A, and a small amount of parallel data recorded with devices B and C. The amount of data is as follows:

  • Device A: 24 hours (8640 segments, same as subtask A, but resampled and single-channel)
  • Device B: 2 hours (72 segments per acoustic scene)
  • Device C: 2 hours (72 segments per acoustic scene)

The 2 hours of data recorded with devices B and C is parallel, and also available as recorded with device A. The training/test setup was created such that approximately 70% of recording locations for each city and each scene class are in the training subset, considering only device A. The training subset contains 6122 segments from device A, 540 segments from device B, and 540 segments from device C. The test subset contains 2518 segments from device A, 180 segments from device B, and 180 segments from device C. Please report development set results using the provided data partitioning.

Participants are allowed to create their own cross-validation folds or separate validation set. In this case please pay attention to the segments recorded at same location. Location identifier can be found from metadata file provided in the dataset or from audio file names:

[scene label]-[city]-[location id]-[segment id]-[device id].wav

Make sure that all files having same location id are placed on the same side of the evaluation. In this subtask, device id can be a, b, or c.

In this subtask, use of external data is forbidden. Data from another task or subtask is considered external data!

Evaluation dataset


Participants should run their system for this dataset, and submit the classification results (system output) to DCASE2018 Challenge. Evaluation dataset is provided without ground truth. The evaluation data consists of audio recorded with all four devices, of which device D was not encountered in development. Ranking of systems will be done only using devices B and C, but accuracy will be calculated also for devices A and D.

Subtask C

C External Task 1 Acoustic Scene Classification with use of external data

This subtask is meant to test if use of external data in system development brings a significant improvement to the performance. The task is identical to subtask A, with the only difference that use of external data and transfer learning is allowed under the following conditions:

  • The used dataset must be public and freely available before 29th of March 2018
  • Participants must inform/suggest such data to be listed on the task webpage, so that all competitors know about them and have equal opportunity to use them; please let us know by sending email to dcase.challenge@gmail.com and we will update the list of external datasets accordingly
  • Once the evaluation set is published, the list of allowed external datasets is locked (no further external sources allowed)
  • Participants should list in their technical report the external data sources they used

Participants to this subtask are encouraged to submit also for subtask A to provide the comparison point of their system without external data.

External data sources

List of external datasets allowed:

Dataset name Type Added Link
TUT Acoustic scenes 2017, development dataset audio 29.3.2018 https://zenodo.org/record/400515
TUT Acoustic scenes 2017, evaluation dataset audio 29.3.2018 https://zenodo.org/record/1040168
TUT Acoustic scenes 2016, development dataset audio 29.3.2018 https://zenodo.org/record/45739
TUT Acoustic scenes 2016, evaluation dataset audio 29.3.2018 https://zenodo.org/record/165995
LITIS Rouen audio scene dataset audio 29.3.2018 https://sites.google.com/site/alainrakotomamonjy/home/audio-scene
DCASE2013 Challenge - Public Dataset for Scene Classification Task audio 29.3.2018 https://archive.org/details/dcase2013_scene_classification
DCASE2013 Challenge - Private Dataset for Scene Classification Task audio 29.3.2018 https://archive.org/details/dcase2013_scene_classification_testset
Dares G1 audio 29.3.2018 http://www.daresounds.org/
AudioSet audio 25.4.2018 https://research.google.com/audioset/


Participants can suggest data to this list by sending email to dcase.challenge@gmail.com until evaluation dataset is published, after which the list is locked.

Submission

Official challenge submission consists of a technical report and system output for the evaluation data.

System output should be presented as a single text-file (in CSV format) containing classification result for each audio file in the evaluation set. Result items can be in any order. Format:

[filename (string)][tab][scene label (string)]

Multiple system outputs can be submitted (maximum 4 per participant). For each system, meta information should be provided in a separate file, containing the task specific information as given in the example here. All files should be packaged into a zip file for submission. Please carefully mark the connection between the submitted files and the corresponding system or system parameters (for example by naming the text file appropriately).

Detailed information for the submission can be found on the Submission page.

Public leaderboards

During the challenge, a public leaderboard will be provided using a separate public evaluation set for each subtask. The leaderboards are organized through Kaggle InClass competitions, and they are meant to serve as a development tool for participants during the development.

A Match Task 1 Subtask A Leaderboard

B Mismatch Task 1 Subtask B Leaderboard

C External Task 1 Subtask C Leaderboard

The official DCASE challenge submission will not be done through these Kaggle InClass competitions.

Datasets

For public leaderboard submissions, participants should use the challenge development datasets to train their system as in DCASE challenge. Separate datasets, leaderboard datasets, are released to be used as evaluation datasets in the competitions. These leaderboard datasets consist of similar material to the official evaluation dataset in the DCASE challenge. The material amount in the leaderboard dataset is considerably lower than the official evaluation material in the DCASE challenge. It is not allowed to use the leaderboard datasets to train the systems in any DCASE challenge subtasks or leaderboard competitions.



Task rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

Task specific rules:

  • Use of external data for system development is allowed only in subtask C. Data from another task or subtask is considered external data.
  • Manipulation of provided training and development data is allowed in all subtasks. The development dataset can be augmented without use of external data (e.g. by mixing data sampled from a pdf or using techniques such as pitch shifting or time stretching).
  • Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision making is also forbidden.
  • Classification decision must be done independently for each test sample.

Evaluation

The scoring of acoustic scene classification will be based on classification accuracy: the number of correctly classified segments among the total number of segments. Each segment is considered an independent test sample. Accuracy will be calculated as average of the class-wise accuracy.

Participants can use sed_eval toolbox for the evaluation:


Results

Subtask A

Rank Submission Information Corresponding Technical
Report
Accuracy
with 95%
confidence interval
Code Name Author Affiliation
Baseline_Surrey_task1a_1 SurreyCNN8 Qiuqiang Kong Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK task-acoustic-scene-classification-results-a#Kong2018 70.4 (68.9 - 71.9)
Baseline_Surrey_task1a_2 SurreyCNN4 Qiuqiang Kong Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK task-acoustic-scene-classification-results-a#Kong2018 69.7 (68.2 - 71.2)
Dang_NCU_task1a_1 AnD_NCU An Dang Computer Science and Information Engineering, Deep Learning and Media System Laboratory, National Central University, Taoyuan, Taiwan task-acoustic-scene-classification-results-a#Dang2018 73.3 (71.9 - 74.8)
Dang_NCU_task1a_2 AnD_NCU An Dang Computer Science and Information Engineering, Deep Learning and Media System Laboratory, National Central University, Taoyuan, Taiwan task-acoustic-scene-classification-results-a#Dang2018 74.5 (73.1 - 76.0)
Dang_NCU_task1a_3 AnD_NCU An Dang Computer Science and Information Engineering, Deep Learning and Media System Laboratory, National Central University, Taoyuan, Taiwan task-acoustic-scene-classification-results-a#Dang2018 74.1 (72.7 - 75.5)
Dorfer_CPJKU_task1a_1 DNN Matthias Dorfer Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria task-acoustic-scene-classification-results-a#Dorfer2018 79.7 (78.4 - 81.0)
Dorfer_CPJKU_task1a_2 i-vectors Matthias Dorfer Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria task-acoustic-scene-classification-results-a#Dorfer2018 67.8 (66.3 - 69.3)
Dorfer_CPJKU_task1a_3 calib-avg Matthias Dorfer Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria task-acoustic-scene-classification-results-a#Dorfer2018 80.5 (79.2 - 81.8)
Dorfer_CPJKU_task1a_4 calib-sep Matthias Dorfer Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria task-acoustic-scene-classification-results-a#Dorfer2018 77.2 (75.8 - 78.5)
Fraile_UPM_task1a_1 UPMg Ruben Fraile Research Center on Software Technologies and Multimedia Systems for Sustainability (CITSEM), Universidad Politecnica de Madrid, Madrid, Spain task-acoustic-scene-classification-results-a#Fraile2018 62.7 (61.1 - 64.3)
Gil-jin_KNU_task1a_1 ECDCNN Jang Gin-jin School of Electronics Engineering, Kyungpook National University, Daegu, Korea task-acoustic-scene-classification-results-a#Sangwon2018 74.4 (73.0 - 75.8)
Golubkov_SPCH_task1a_1 spch_fusion Alexander Golubkov Saint Petersburg, Russia task-acoustic-scene-classification-results-a#Golubkov2018 60.2 (58.7 - 61.8)
DCASE2018 baseline Baseline Toni Heittola Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland task-acoustic-scene-classification-results-a#Heittola2018 61.0 (59.4 - 62.6)
Jung_UOS_task1a_1 4cl_nw Ha-jin Yu School of Computer Science, University of Seoul, Seoul, South Korea task-acoustic-scene-classification-results-a#Jung2018 74.8 (73.4 - 76.2)
Jung_UOS_task1a_2 4cl_w Ha-jin Yu School of Computer Science, University of Seoul, Seoul, South Korea task-acoustic-scene-classification-results-a#Jung2018 74.2 (72.8 - 75.7)
Jung_UOS_task1a_3 GM_w Ha-jin Yu School of Computer Science, University of Seoul, Seoul, South Korea task-acoustic-scene-classification-results-a#Jung2018 73.8 (72.4 - 75.2)
Jung_UOS_task1a_4 SVM_w Ha-jin Yu School of Computer Science, University of Seoul, Seoul, South Korea task-acoustic-scene-classification-results-a#Jung2018 73.8 (72.4 - 75.3)
Khadkevich_FB_task1a_1 1aavpool Maksim Khadkevich AML, Facebook, Menlo Park, CA, USA task-acoustic-scene-classification-results-a#Khadkevich2018 67.8 (66.3 - 69.3)
Khadkevich_FB_task1a_2 1amaxpool Maksim Khadkevich AML, Facebook, Menlo Park, CA, USA task-acoustic-scene-classification-results-a#Khadkevich2018 67.2 (65.7 - 68.8)
Li_BIT_task1a_1 BIT_task1a_1 Zhitong Li Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China task-acoustic-scene-classification-results-a#Li2018 73.0 (71.5 - 74.4)
Li_BIT_task1a_2 BIT_task1a_2 Zhitong Li Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China task-acoustic-scene-classification-results-a#Li2018 75.3 (73.9 - 76.7)
Li_BIT_task1a_3 BIT_task1a_3 Zhitong Li Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China task-acoustic-scene-classification-results-a#Li2018 75.3 (73.9 - 76.7)
Li_BIT_task1a_4 BIT_task1a_4 Zhitong Li Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China task-acoustic-scene-classification-results-a#Li2018 75.0 (73.6 - 76.4)
Li_SCUT_task1a_1 Li_SCUT YangXiong Li Laboratory of Signal Processing, South China University of Technology, Guangzhou, China task-acoustic-scene-classification-results-a#Li2018a 43.4 (41.8 - 45.0)
Li_SCUT_task1a_2 Li_SCUT YangXiong Li Laboratory of Signal Processing, South China University of Technology, Guangzhou, China task-acoustic-scene-classification-results-a#Li2018a 50.2 (48.6 - 51.9)
Li_SCUT_task1a_3 Li_SCUT YangXiong Li Laboratory of Signal Processing, South China University of Technology, Guangzhou, China task-acoustic-scene-classification-results-a#Li2018a 44.5 (42.9 - 46.2)
Li_SCUT_task1a_4 Li_SCUT YangXiong Li Laboratory of Signal Processing, South China University of Technology, Guangzhou, China task-acoustic-scene-classification-results-a#Li2018a 46.7 (45.1 - 48.3)
Liping_CQU_task1a_1 Xception Chen Xinxing College of Optoelectronic Engineering, Chongqing University, Chongqing, China task-acoustic-scene-classification-results-a#Liping2018 70.4 (69.0 - 71.9)
Liping_CQU_task1a_2 Xception Chen Xinxing College of Optoelectronic Engineering, Chongqing University, Chongqing, China task-acoustic-scene-classification-results-a#Liping2018 74.0 (72.6 - 75.4)
Liping_CQU_task1a_3 Xception Chen Xinxing College of Optoelectronic Engineering, Chongqing University, Chongqing, China task-acoustic-scene-classification-results-a#Liping2018 74.7 (73.3 - 76.1)
Liping_CQU_task1a_4 Xception Chen Xinxing College of Optoelectronic Engineering, Chongqing University, Chongqing, China task-acoustic-scene-classification-results-a#Liping2018 75.4 (74.0 - 76.8)
Maka_ZUT_task1a_1 asa_dev Tomasz Maka Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, Szczecin, Szczecin, Poland task-acoustic-scene-classification-results-a#Maka2018 65.8 (64.3 - 67.4)
Mariotti_lip6_task1a_1 MP_all Octave Mariotti Laboratoire d'informatique de Paris 6, Sorbonne Université, Paris, France task-acoustic-scene-classification-results-a#Mariotti2018 75.0 (73.6 - 76.4)
Mariotti_lip6_task1a_2 MP_no50 Octave Mariotti Laboratoire d'informatique de Paris 6, Sorbonne Université, Paris, France task-acoustic-scene-classification-results-a#Mariotti2018 72.8 (71.3 - 74.2)
Mariotti_lip6_task1a_3 NN_all Octave Mariotti Laboratoire d'informatique de Paris 6, Sorbonne Université, Paris, France task-acoustic-scene-classification-results-a#Mariotti2018 72.8 (71.3 - 74.2)
Mariotti_lip6_task1a_4 NN_no50 Octave Mariotti Laboratoire d'informatique de Paris 6, Sorbonne Université, Paris, France task-acoustic-scene-classification-results-a#Mariotti2018 74.9 (73.4 - 76.3)
Nguyen_TUGraz_task1a_1 NNF_CNNEns Truc Nguyen Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria/ Europe task-acoustic-scene-classification-results-a#Nguyen2018 69.8 (68.3 - 71.3)
Ren_UAU_task1a_1 ABCNN Zhao Ren ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany task-acoustic-scene-classification-results-a#Ren2018 69.0 (67.5 - 70.5)
Roletscheck_UNIA_task1a_1 DeepSAGA Christian Roletscheck Human Centered Multimedia, Augsburg University, Augsburg, Germany task-acoustic-scene-classification-results-a#Roletscheck2018 69.2 (67.7 - 70.7)
Roletscheck_UNIA_task1a_2 DeepSAGA Christian Roletscheck Human Centered Multimedia, Augsburg University, Augsburg, Germany task-acoustic-scene-classification-results-a#Roletscheck2018 67.3 (65.7 - 68.8)
Sakashita_TUT_task1a_1 Sakashita_1 Yuma Sakashita Knowledge Data Engineering Laboratory, Toyohashi University of Technology, Aichi, Japan task-acoustic-scene-classification-results-a#Sakashita2018 81.0 (79.7 - 82.3)
Sakashita_TUT_task1a_2 Sakashita_2 Yuma Sakashita Knowledge Data Engineering Laboratory, Toyohashi University of Technology, Aichi, Japan task-acoustic-scene-classification-results-a#Sakashita2018 81.0 (79.7 - 82.3)
Sakashita_TUT_task1a_3 Sakashita_3 Yuma Sakashita Knowledge Data Engineering Laboratory, Toyohashi University of Technology, Aichi, Japan task-acoustic-scene-classification-results-a#Sakashita2018 80.7 (79.4 - 82.0)
Sakashita_TUT_task1a_4 Sakashita_4 Yuma Sakashita Knowledge Data Engineering Laboratory, Toyohashi University of Technology, Aichi, Japan task-acoustic-scene-classification-results-a#Sakashita2018 79.3 (78.0 - 80.6)
Tilak_IIITB_task1a_1 CNN_raw Tilak Purohit Signal Processing and Pattern Recognition Lab, International Institute of Information Technology, Bangaluru, India task-acoustic-scene-classification-results-a#Purohit2018 59.5 (57.9 - 61.1)
Tilak_IIITB_task1a_2 DCNN_raw Tilak Purohit Signal Processing and Pattern Recognition Lab, International Institute of Information Technology, Bangaluru, India task-acoustic-scene-classification-results-a#Purohit2018 58.3 (56.7 - 59.9)
Tilak_IIITB_task1a_3 DCNN_raw Tilak Purohit Signal Processing and Pattern Recognition Lab, International Institute of Information Technology, Bangaluru, India task-acoustic-scene-classification-results-a#Purohit2018 55.0 (53.4 - 56.6)
Waldekar_IITKGP_task1a_1 IITKGP_ABSP_Fusion18 Shefali Waldekar Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India task-acoustic-scene-classification-results-a#Waldekar2018 69.7 (68.2 - 71.2)
WangJun_BUPT_task1a_1 Attention Wang Jun Institute of Information Photonics and Optical Communication, c, Beijing, China task-acoustic-scene-classification-results-a#Jun2018 70.9 (69.4 - 72.4)
WangJun_BUPT_task1a_2 Attention Wang Jun Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China task-acoustic-scene-classification-results-a#Jun2018 70.5 (69.0 - 72.0)
WangJun_BUPT_task1a_3 Attention Wang Jun Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China task-acoustic-scene-classification-results-a#Jun2018 73.2 (71.7 - 74.6)
Yang_GIST_task1a_1 SEResNet Jeong Hyeon Yang School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, Korea task-acoustic-scene-classification-results-a#Yang2018 71.7 (70.2 - 73.2)
Yang_GIST_task1a_2 GAN_CNN Jeong Hyeon Yang School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, Korea task-acoustic-scene-classification-results-a#Yang2018 70.0 (68.5 - 71.5)
Zeinali_BUT_task1a_1 BUT_1 Hossein Zeinali BUT Speech, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic task-acoustic-scene-classification-results-a#Zeinali2018 78.4 (77.0 - 79.7)
Zeinali_BUT_task1a_2 BUT_2 Hossein Zeinali BUT Speech, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic task-acoustic-scene-classification-results-a#Zeinali2018 78.1 (76.8 - 79.5)
Zeinali_BUT_task1a_3 BUT_3 Hossein Zeinali BUT Speech, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic task-acoustic-scene-classification-results-a#Zeinali2018 74.5 (73.1 - 76.0)
Zeinali_BUT_task1a_4 BUT_4 Hossein Zeinali BUT Speech, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic task-acoustic-scene-classification-results-a#Zeinali2018 75.1 (73.7 - 76.6)
Zhang_HIT_task1a_1 CNN_MLTP Liwen Zhang Laboratory of Speech Signal Processing, Harbin Institute of Technology, Harbin, China task-acoustic-scene-classification-results-a#Zhang2018 73.4 (72.0 - 74.9)
Zhang_HIT_task1a_2 CNN_MLTP Liwen Zhang Laboratory of Speech Signal Processing, Harbin Institute of Technology, Harbin, China task-acoustic-scene-classification-results-a#Zhang2018 70.9 (69.4 - 72.3)
Zhao_DLU_task1a_1 BiLstm-CNN Lasheng Zhao Key Laboratory of Advanced Design and Intelligent Computing(Dalian University), Ministry of Education, Dalian University, Liaoning, China task-acoustic-scene-classification-results-a#Hao2018 69.8 (68.3 - 71.3)


Complete results and technical reports can be found at subtask A results page

Subtask B

Rank Submission Information Corresponding Technical
Report
Accuracy
with 95%
confidence interval
Code Name Author Affiliation
Baseline_Surrey_task1b_1 SurreyCNN8 Qiuqiang Kong Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK task-acoustic-scene-classification-results-b#Kong2018 59.6 (58.5 - 60.7)
Baseline_Surrey_task1b_2 SurreyCNN4 Qiuqiang Kong Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK task-acoustic-scene-classification-results-b#Kong2018 58.8 (57.7 - 59.9)
DCASE2018 baseline Baseline Toni Heittola Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland task-acoustic-scene-classification-results-b#Heittola2018 46.5 (45.4 - 47.6)
Li_SCUT_task1b_1 Li_SCUT YangXiong Li Laboratory of Signal Processing, South China University of Technology, Guangzhou, China task-acoustic-scene-classification-results-b#Li2018 41.1 (40.0 - 42.2)
Li_SCUT_task1b_2 Li_SCUT YangXiong Li Laboratory of Signal Processing, South China University of Technology, Guangzhou, China task-acoustic-scene-classification-results-b#Li2018 39.5 (38.4 - 40.6)
Li_SCUT_task1b_3 Li_SCUT YangXiong Li Laboratory of Signal Processing, South China University of Technology, Guangzhou, China task-acoustic-scene-classification-results-b#Li2018 42.3 (41.2 - 43.4)
Liping_CQU_task1b_1 Xception Chen Xinxing College of Optoelectronic Engineering, Chongqing University, Chongqing, China task-acoustic-scene-classification-results-b#Liping2018 67.0 (66.0 - 68.1)
Liping_CQU_task1b_2 Xception Chen Xinxing College of Optoelectronic Engineering, Chongqing University, Chongqing, China task-acoustic-scene-classification-results-b#Liping2018 63.2 (62.1 - 64.3)
Liping_CQU_task1b_3 Xception Chen Xinxing College of Optoelectronic Engineering, Chongqing University, Chongqing, China task-acoustic-scene-classification-results-b#Liping2018 67.7 (66.6 - 68.7)
Liping_CQU_task1b_4 Xception Chen Xinxing College of Optoelectronic Engineering, Chongqing University, Chongqing, China task-acoustic-scene-classification-results-b#Liping2018 67.1 (66.1 - 68.2)
Nguyen_TUGraz_task1b_1 NNF_CNNEns Truc Nguyen Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria/ Europe task-acoustic-scene-classification-results-b#Nguyen2018 69.0 (68.0 - 70.0)
Ren_UAU_task1b_1 ABCNN Zhao Ren ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany task-acoustic-scene-classification-results-b#Ren2018 60.5 (59.4 - 61.5)
Tchorz_THL_task1b_1 AMS_MFCC Juergen Tchorz Institute for Acoustics, University of Applied Sciences Luebeck, Luebeck, Germany task-acoustic-scene-classification-results-b#Tchorz2018 54.0 (52.9 - 55.1)
Waldekar_IITKGP_task1b_1 IITKGP_ABSP_Fusion18 Shefali Waldekar Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India task-acoustic-scene-classification-results-b#Waldekar2018 56.2 (55.1 - 57.3)
WangJun_BUPT_task1b_1 Attention Wang Jun Laboratory of Signal Processing, Institute of Information Photonics and Optical Communication, Beijing, China task-acoustic-scene-classification-results-b#Jun2018 48.8 (47.7 - 49.9)
WangJun_BUPT_task1b_2 Attention Wang Jun Laboratory of Signal Processing, Institute of Information Photonics and Optical Communication, Beijing, China task-acoustic-scene-classification-results-b#Jun2018 52.5 (51.4 - 53.6)
WangJun_BUPT_task1b_3 Attention Wang Jun Laboratory of Signal Processing, Institute of Information Photonics and Optical Communication, Beijing, China task-acoustic-scene-classification-results-b#Jun2018 52.3 (51.2 - 53.4)


Complete results and technical reports can be found at subtask B results page

Subtask C

Rank Submission Information Corresponding Technical
Report
Accuracy
with 95%
confidence interval
Code Name Author Affiliation
DCASE2018 baseline Baseline Toni Heittola Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland task-acoustic-scene-classification-results-c#Heittola2018 61.0 (59.4 - 62.6)
Khadkevich_FB_task1c_1 1cavpool Maksim Khadkevich AML, Facebook, Menlo Park, CA, USA task-acoustic-scene-classification-results-c#Khadkevich2018 71.7 (70.2 - 73.2)
Khadkevich_FB_task1c_2 1cmaxpool Maksim Khadkevich AML, Facebook, Menlo Park, CA, USA task-acoustic-scene-classification-results-c#Khadkevich2018 69.0 (67.5 - 70.5)


Complete results and technical reports can be found at subtask C results page

Submissions

Subtask Teams Entries Authors Affiliations
Subtask A 24 59 71 25
Subtask B 8 16 21 9
Subtask C 1 2 1 1
Overall 25 77 72 26

Baseline system

The baseline system provides a simple entry-level state-of-the-art approach that gives reasonable results in the subtasks of Task 1. The baseline system is built on dcase_util toolbox.

Participants are strongly encouraged to build their own systems by extending the provided baseline system. The system has all needed functionality for the dataset handling, acoustic feature storing and accessing, acoustic model training and storing, and evaluation. The modular structure of the system enables participants to modify the system to their needs. The baseline system is a good starting point especially for the entry level researchers to familiarize themselves with the acoustic scene classification problem.

If participants plan to publish their code to the DCASE community after the challenge, building their approach on the baseline system will make their code more accessible to the community. DCASE organizers strongly encourage participants to share their code in any form after the challenge.

Repository


System description

The baseline system implements a convolutional neural network (CNN) based approach, where log mel-band energies are first extracted for each 10-second signal, and a network consisting of two CNN layers and one fully connected layer is trained to assign scene labels to the audio signals.

The baseline system is built on dcase_util toolbox. The machine learning part of the code in built on Keras (v2.1.5), using TensorFlow (v1.4.0) as backend.

Parameters

Acoustic features

  • Analysis frame 40 ms (50% hop size)
  • Log mel-band energies (40 bands)

Neural network

  • Input shape: 40 * 500 (10 seconds)
  • Architecture:

    • CNN layer #1
      • 2D Convolutional layer (filters: 32, kernel size: 7) + Batch normalization + ReLu activation
      • 2D max pooling (pool size: (5, 5)) + Dropout (rate: 30%)
    • CNN layer #2
      • 2D Convolutional layer (filters: 64, kernel size: 7) + Batch normalization + ReLu activation
      • 2D max pooling (pool size: (4, 100)) + Dropout (rate: 30%)
    • Flatten
    • Dense layer #1
      • Dense layer (units: 100, activation: ReLu )
      • Dropout (rate: 30%)
    • Output layer (activation: softmax)
  • Learning (epochs: 200, batch size: 16, data shuffling between epochs)

    • Optimizer: Adam (learning rate: 0.001)
  • Model selection:

    • Approximately 30% of the original training data is assigned to validation set, split done such that training and validation sets do not have segments from the same location and both sets have data from each city
    • Model performance after each epoch is evaluated on the validation set, and best performing model is selected

Results for the development dataset

Results are calculated using TensorFlow in GPU mode (using Nvidia Titan XP GPU card). Because results produced with GPU card are generally non-deterministic, the system was trained and tested 10 times; mean and standard deviation of the performance from these 10 independent trials are shown in the results tables.

Subtask A

Scene label Accuracy
Airport 72.9 %
Bus 62.9 %
Metro 51.2 %
Metro station 55.4 %
Park 79.1 %
Public square 40.4 %
Shopping mall 49.6 %
Street, pedestrian 50.0 %
Street, traffic 80.5 %
Tram 55.1 %
Average 59.7 % (± 0.7)

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results.

Subtask B

Material from device A (high-quality) is used for training, while testing is done with material from all three devices. This highlights the problem of mismatched recording devices. Results are calculated the same way as for subtask A, with mean and standard deviation of the performance from 10 independent trials shown in the results table.

Remember that ranking in this subtask will be done by devices B and C (third column in this table).

Scene label Device B Device C Average (B,C) Device A
Airport 68.9 % 76.1 % 72.5 % 73.4 %
Bus 70.6 % 86.1 % 78.3 % 56.7 %
Metro 23.9 % 17.2 % 20.6 % 46.6 %
Metro station 33.9 % 31.7 % 32.8 % 52.9 %
Park 67.2 % 51.1 % 59.2 % 80.8 %
Public square 22.8 % 26.7 % 24.7 % 37.9 %
Shopping mall 58.3 % 63.9 % 61.1 % 46.4 %
Street, pedestrian 16.7 % 25.0 % 20.8 % 55.5 %
Street, traffic 69.4 % 63.3 % 66.4 % 82.5 %
Tram 18.9 % 20.6 % 19.7 % 56.5 %
Average 45.1 % (± 3.6) 46.2 % (± 4.2) 45.6 % (± 3.6) 58.9 % (± 0.8)

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results.

Subtask C

Result for the subtask A applies.

Citation

If you are participating to this task or using the dataset or baseline code please cite the following paper:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi-device dataset for urban acoustic scene classification. Submitted to DCASE2018 Workshop, 2018. URL: https://arxiv.org/abs/1807.09840, arXiv:1807.09840.

PDF

A multi-device dataset for urban acoustic scene classification

Abstract

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

Keywords

Acoustic scene classification, DCASE challenge, public datasets, multi-device data

PDF