Challenge has ended. Full results for this task can be found in subtask specific result pages: Task1A Task1B Task1C

The goal of acoustic scene classification is to classify a test recording into one of the provided predefined classes that characterizes the environment in which it was recorded.

This task comprises three different subtasks that involve system development for three different situations:

A Match Task 1

Acoustic Scene Classification
Subtask A

Classification of data from the same device as the available training data.

B Mismatch Task 1

Acoustic Scene Classification with mismatched recording devices
Subtask B

Classification of data recorded with devices different than the training data.

C External Task 1

Acoustic Scene Classification with use of external data
Subtask C

Use of external data in training.

Description

The goal of acoustic scene classification is to classify a test recording into one of the provided predefined classes that characterizes the environment in which it was recorded — for example "park", "pedestrian street", "metro station".

Figure 1: Overview of acoustic scene classification system.

Audio dataset

The dataset for this task is the TUT Urban Acoustic Scenes 2018 dataset, consisting of recordings from various acoustic scenes. The dataset was recorded in six large european cities, in different locations for each scene class. For each recording location there are 5-6 minutes of audio. The original recordings were split into segments with a length of 10 seconds that are provided in individual files. Available information about the recordings include the following: acoustic scene class, city, and recording location.

There are two different versions, TUT Urban Acoustic Scenes 2018 and TUT Urban Acoustic Scenes 2018 Mobile, used for tasks A and B, respectively. More details about them can be found below.

Acoustic scenes for the task (10):

Airport - airport
Indoor shopping mall - shopping_mall
Metro station - metro_station
Pedestrian street - street_pedestrian
Public square - public_square
Street with medium level of traffic - street_traffic
Travelling by a tram - tram
Travelling by a bus - bus
Travelling by an underground metro - metro
Urban park - park

The dataset was collected by Tampere University of Technology between 01/2018 - 03/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

Recording procedure

Recordings were made using four devices that captured audio simultaneously.

The main recording device consists in Soundman OKM II Klassik/studio A3, electret binaural microphone and a Zoom F8 audio recorder using 48kHz sampling rate and 24 bit resolution. The microphones are specifically made to look like headphones, being worn in the ears. As an effect of this, the recorded audio is very similar to the sound that reaches the human auditory system of the person wearing the equipment. This equipment is further referred to as device A.

Three other commonly available customer devices (e.g. smartphones, cameras) were used, handled in typical ways by the person doing recordings (e.g. hand held). We further refer to these devices as B, C, and D. The audio recordings from these devices are of different quality than device A. All simultaneous recordings are time synchronized.

TUT Urban Acoustic Scenes 2018 development dataset contains only material recorded with device A, having 864 segments for each acoustic scene (144 minutes of audio). The dataset contains in total 8640 segments, i.e. 24 hours of audio.

TUT Urban Acoustic Scenes 2018 Mobile development dataset contains material recorded with devices A, B and C. For each acoustic scene there are 864 segments recorded with device A, and parallel audio consisting of 72 segments recorded with devices B and C . Data from device A was resampled and averaged into a single channel, to align with the properties of the data recorded with devices B and C. The dataset contains in total 28 hours of audio.

Reference labels

Reference labels are provided only for the development datasets. Currently, there is no plan of releasing the reference labels for evaluation or public leaderboard datasets. If you are preparing a publication based on the DCASE challenge setup and you want to evaluate your proposed system with official challenge evaluation setup, contact the task coordinators. Task coordinators can provide unofficial scoring for limited amount of system outputs.

Download

Subtask A

TUT Urban Acoustic Scenes 2018, Development dataset (21.1 GB)

TUT Urban Acoustic Scenes 2018, Evaluation dataset (8.3 GB)

TUT Urban Acoustic Scenes 2018, Leaderboard dataset (2.8 GB)

Subtask B

TUT Urban Acoustic Scenes 2018 Mobile, Development dataset (11.3 GB)

TUT Urban Acoustic Scenes 2018 Mobile, Evaluation dataset (16.0 GB)

TUT Urban Acoustic Scenes 2018 Mobile, Leaderboard dataset (2.3 GB)

Subtask C

This subtask is using the same dataset as subtask A.

TUT Urban Acoustic Scenes 2018, Development dataset (21.1 GB)

TUT Urban Acoustic Scenes 2018, Evaluation dataset (8.3 GB)

TUT Urban Acoustic Scenes 2018, Leaderboard dataset (2.8 GB)

Task setup

For each subtask, a development set is provided, together with a training/test partitioning for system development. Participants are required to report performance of their system using this train/test setup in order to allow comparison of systems on the development set.

Subtask A

A Match Task 1 Acoustic Scene Classification

This subtask is concerned with the basic problem of acoustic scene classification, in which all available data (development and evaluation) are recorded with the same device, in this case device A.

Development dataset

TUT Urban Acoustic Scenes 2018, Development dataset (21.1 GB)

The development data consists of recordings from all six cities, and is partitioned so that the training subset contains for each city and each class recordings from approximately 70% of recording locations, and the test subset contains recordings from the rest of the locations. Of the total 8640 segments, 6122 segments were included in the training subset and 2518 segments in the test subset. For complete details on the dataset, check the readme file provided with the data.

Participants are allowed to create their own cross-validation folds or separate validation set. In this case please pay attention to the segments recorded at same location. Location identifier can be found from metadata file provided in the dataset or from audio file names:

[scene label]-[city]-[location id]-[segment id]-[device id].wav

Make sure that all files having same location id are placed on the same side of the evaluation. In this subtask, device id is always a.

In this subtask, use of external data is forbidden. Data from another task or subtask is considered external data!

Evaluation dataset

TUT Urban Acoustic Scenes 2018, Evaluation dataset (8.3 GB)

Subtask B

B Mismatch Task 1 Acoustic Scene Classification with mismatched recording devices

This subtask is concerned with the situation in which an application will be tested with a few different types of devices, possibly not the same as the ones used to record the development data.

Development dataset

TUT Urban Acoustic Scenes 2018 Mobile, Development dataset (11.3 GB)

The development data consists of the same recordings as in subtask A, and a small amount of parallel data recorded with devices B and C. The amount of data is as follows:

Device A: 24 hours (8640 segments, same as subtask A, but resampled and single-channel)
Device B: 2 hours (72 segments per acoustic scene)
Device C: 2 hours (72 segments per acoustic scene)

The 2 hours of data recorded with devices B and C is parallel, and also available as recorded with device A. The training/test setup was created such that approximately 70% of recording locations for each city and each scene class are in the training subset, considering only device A. The training subset contains 6122 segments from device A, 540 segments from device B, and 540 segments from device C. The test subset contains 2518 segments from device A, 180 segments from device B, and 180 segments from device C. Please report development set results using the provided data partitioning.

[scene label]-[city]-[location id]-[segment id]-[device id].wav

Make sure that all files having same location id are placed on the same side of the evaluation. In this subtask, device id can be a, b, or c.

In this subtask, use of external data is forbidden. Data from another task or subtask is considered external data!

Evaluation dataset

TUT Urban Acoustic Scenes 2018 Mobile, Evaluation dataset (16.0 GB)

Participants should run their system for this dataset, and submit the classification results (system output) to DCASE2018 Challenge. Evaluation dataset is provided without ground truth. The evaluation data consists of audio recorded with all four devices, of which device D was not encountered in development. Ranking of systems will be done only using devices B and C, but accuracy will be calculated also for devices A and D.

Subtask C

C External Task 1 Acoustic Scene Classification with use of external data

This subtask is meant to test if use of external data in system development brings a significant improvement to the performance. The task is identical to subtask A, with the only difference that use of external data and transfer learning is allowed under the following conditions:

The used dataset must be public and freely available before 29th of March 2018
Participants must inform/suggest such data to be listed on the task webpage, so that all competitors know about them and have equal opportunity to use them; please let us know by sending email to dcase.challenge@gmail.com and we will update the list of external datasets accordingly
Once the evaluation set is published, the list of allowed external datasets is locked (no further external sources allowed)
Participants should list in their technical report the external data sources they used

Participants to this subtask are encouraged to submit also for subtask A to provide the comparison point of their system without external data.

External data sources

List of external datasets allowed:

Dataset name	Type	Added	Link
TUT Acoustic scenes 2017, development dataset	audio	29.3.2018	https://zenodo.org/record/400515
TUT Acoustic scenes 2017, evaluation dataset	audio	29.3.2018	https://zenodo.org/record/1040168
TUT Acoustic scenes 2016, development dataset	audio	29.3.2018	https://zenodo.org/record/45739
TUT Acoustic scenes 2016, evaluation dataset	audio	29.3.2018	https://zenodo.org/record/165995
LITIS Rouen audio scene dataset	audio	29.3.2018	https://sites.google.com/site/alainrakotomamonjy/home/audio-scene
DCASE2013 Challenge - Public Dataset for Scene Classification Task	audio	29.3.2018	https://archive.org/details/dcase2013_scene_classification
DCASE2013 Challenge - Private Dataset for Scene Classification Task	audio	29.3.2018	https://archive.org/details/dcase2013_scene_classification_testset
Dares G1	audio	29.3.2018	http://www.daresounds.org/
AudioSet	audio	25.4.2018	https://research.google.com/audioset/

Participants can suggest data to this list by sending email to dcase.challenge@gmail.com until evaluation dataset is published, after which the list is locked.

Submission

Official challenge submission consists of a technical report and system output for the evaluation data.

System output should be presented as a single text-file (in CSV format) containing classification result for each audio file in the evaluation set. Result items can be in any order. Format:

[filename (string)][tab][scene label (string)]

Multiple system outputs can be submitted (maximum 4 per participant). For each system, meta information should be provided in a separate file, containing the task specific information as given in the example here. All files should be packaged into a zip file for submission. Please carefully mark the connection between the submitted files and the corresponding system or system parameters (for example by naming the text file appropriately).

Detailed information for the submission can be found on the Submission page.

Public leaderboards

During the challenge, a public leaderboard will be provided using a separate public evaluation set for each subtask. The leaderboards are organized through Kaggle InClass competitions, and they are meant to serve as a development tool for participants during the development.

A Match Task 1 Subtask A Leaderboard

B Mismatch Task 1 Subtask B Leaderboard

C External Task 1 Subtask C Leaderboard

The official DCASE challenge submission will not be done through these Kaggle InClass competitions.

Datasets

For public leaderboard submissions, participants should use the challenge development datasets to train their system as in DCASE challenge. Separate datasets, leaderboard datasets, are released to be used as evaluation datasets in the competitions. These leaderboard datasets consist of similar material to the official evaluation dataset in the DCASE challenge. The material amount in the leaderboard dataset is considerably lower than the official evaluation material in the DCASE challenge. It is not allowed to use the leaderboard datasets to train the systems in any DCASE challenge subtasks or leaderboard competitions.

TUT Urban Acoustic Scenes 2018, Leaderboard dataset (2.8 GB)

TUT Urban Acoustic Scenes 2018 Mobile, Leaderboard dataset (2.3 GB)

Task rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

Task specific rules:

Use of external data for system development is allowed only in subtask C. Data from another task or subtask is considered external data.
Manipulation of provided training and development data is allowed in all subtasks. The development dataset can be augmented without use of external data (e.g. by mixing data sampled from a pdf or using techniques such as pitch shifting or time stretching).
Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision making is also forbidden.
Classification decision must be done independently for each test sample.

Evaluation

The scoring of acoustic scene classification will be based on classification accuracy: the number of correctly classified segments among the total number of segments. Each segment is considered an independent test sample. Accuracy will be calculated as average of the class-wise accuracy.

Participants can use sed_eval toolbox for the evaluation:

sed_eval - Evaluation toolbox for Sound Event Detection

Results

Subtask A

Rank	Submission Information
Rank	Code	Author	Affiliation	Technical Report	Accuracy with 95% confidence interval
	Baseline_Surrey_task1a_1	Qiuqiang Kong	Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK	task-acoustic-scene-classification-results-a#Kong2018	70.4 (68.9 - 71.9)
	Baseline_Surrey_task1a_2	Qiuqiang Kong	Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK	task-acoustic-scene-classification-results-a#Kong2018	69.7 (68.2 - 71.2)
	Dang_NCU_task1a_1	An Dang	Computer Science and Information Engineering, Deep Learning and Media System Laboratory, National Central University, Taoyuan, Taiwan	task-acoustic-scene-classification-results-a#Dang2018	73.3 (71.9 - 74.8)
	Dang_NCU_task1a_2	An Dang	Computer Science and Information Engineering, Deep Learning and Media System Laboratory, National Central University, Taoyuan, Taiwan	task-acoustic-scene-classification-results-a#Dang2018	74.5 (73.1 - 76.0)
	Dang_NCU_task1a_3	An Dang	Computer Science and Information Engineering, Deep Learning and Media System Laboratory, National Central University, Taoyuan, Taiwan	task-acoustic-scene-classification-results-a#Dang2018	74.1 (72.7 - 75.5)
	Dorfer_CPJKU_task1a_1	Matthias Dorfer	Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria	task-acoustic-scene-classification-results-a#Dorfer2018	79.7 (78.4 - 81.0)
	Dorfer_CPJKU_task1a_2	Matthias Dorfer	Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria	task-acoustic-scene-classification-results-a#Dorfer2018	67.8 (66.3 - 69.3)
	Dorfer_CPJKU_task1a_3	Matthias Dorfer	Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria	task-acoustic-scene-classification-results-a#Dorfer2018	80.5 (79.2 - 81.8)
	Dorfer_CPJKU_task1a_4	Matthias Dorfer	Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria	task-acoustic-scene-classification-results-a#Dorfer2018	77.2 (75.8 - 78.5)
	Fraile_UPM_task1a_1	Ruben Fraile	Research Center on Software Technologies and Multimedia Systems for Sustainability (CITSEM), Universidad Politecnica de Madrid, Madrid, Spain	task-acoustic-scene-classification-results-a#Fraile2018	62.7 (61.1 - 64.3)
	Gil-jin_KNU_task1a_1	Jang Gin-jin	School of Electronics Engineering, Kyungpook National University, Daegu, Korea	task-acoustic-scene-classification-results-a#Sangwon2018	74.4 (73.0 - 75.8)
	Golubkov_SPCH_task1a_1	Alexander Golubkov	Saint Petersburg, Russia	task-acoustic-scene-classification-results-a#Golubkov2018	60.2 (58.7 - 61.8)
	DCASE2018 baseline	Toni Heittola	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-acoustic-scene-classification-results-a#Heittola2018	61.0 (59.4 - 62.6)
	Jung_UOS_task1a_1	Ha-jin Yu	School of Computer Science, University of Seoul, Seoul, South Korea	task-acoustic-scene-classification-results-a#Jung2018	74.8 (73.4 - 76.2)
	Jung_UOS_task1a_2	Ha-jin Yu	School of Computer Science, University of Seoul, Seoul, South Korea	task-acoustic-scene-classification-results-a#Jung2018	74.2 (72.8 - 75.7)
	Jung_UOS_task1a_3	Ha-jin Yu	School of Computer Science, University of Seoul, Seoul, South Korea	task-acoustic-scene-classification-results-a#Jung2018	73.8 (72.4 - 75.2)
	Jung_UOS_task1a_4	Ha-jin Yu	School of Computer Science, University of Seoul, Seoul, South Korea	task-acoustic-scene-classification-results-a#Jung2018	73.8 (72.4 - 75.3)
	Khadkevich_FB_task1a_1	Maksim Khadkevich	AML, Facebook, Menlo Park, CA, USA	task-acoustic-scene-classification-results-a#Khadkevich2018	67.8 (66.3 - 69.3)
	Khadkevich_FB_task1a_2	Maksim Khadkevich	AML, Facebook, Menlo Park, CA, USA	task-acoustic-scene-classification-results-a#Khadkevich2018	67.2 (65.7 - 68.8)
	Li_BIT_task1a_1	Zhitong Li	Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China	task-acoustic-scene-classification-results-a#Li2018	73.0 (71.5 - 74.4)
	Li_BIT_task1a_2	Zhitong Li	Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China	task-acoustic-scene-classification-results-a#Li2018	75.3 (73.9 - 76.7)
	Li_BIT_task1a_3	Zhitong Li	Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China	task-acoustic-scene-classification-results-a#Li2018	75.3 (73.9 - 76.7)
	Li_BIT_task1a_4	Zhitong Li	Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China	task-acoustic-scene-classification-results-a#Li2018	75.0 (73.6 - 76.4)
	Li_SCUT_task1a_1	YangXiong Li	Laboratory of Signal Processing, South China University of Technology, Guangzhou, China	task-acoustic-scene-classification-results-a#Li2018a	43.4 (41.8 - 45.0)
	Li_SCUT_task1a_2	YangXiong Li	Laboratory of Signal Processing, South China University of Technology, Guangzhou, China	task-acoustic-scene-classification-results-a#Li2018a	50.2 (48.6 - 51.9)
	Li_SCUT_task1a_3	YangXiong Li	Laboratory of Signal Processing, South China University of Technology, Guangzhou, China	task-acoustic-scene-classification-results-a#Li2018a	44.5 (42.9 - 46.2)
	Li_SCUT_task1a_4	YangXiong Li	Laboratory of Signal Processing, South China University of Technology, Guangzhou, China	task-acoustic-scene-classification-results-a#Li2018a	46.7 (45.1 - 48.3)
	Liping_CQU_task1a_1	Chen Xinxing	College of Optoelectronic Engineering, Chongqing University, Chongqing, China	task-acoustic-scene-classification-results-a#Liping2018	70.4 (69.0 - 71.9)
	Liping_CQU_task1a_2	Chen Xinxing	College of Optoelectronic Engineering, Chongqing University, Chongqing, China	task-acoustic-scene-classification-results-a#Liping2018	74.0 (72.6 - 75.4)
	Liping_CQU_task1a_3	Chen Xinxing	College of Optoelectronic Engineering, Chongqing University, Chongqing, China	task-acoustic-scene-classification-results-a#Liping2018	74.7 (73.3 - 76.1)
	Liping_CQU_task1a_4	Chen Xinxing	College of Optoelectronic Engineering, Chongqing University, Chongqing, China	task-acoustic-scene-classification-results-a#Liping2018	75.4 (74.0 - 76.8)
	Maka_ZUT_task1a_1	Tomasz Maka	Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, Szczecin, Szczecin, Poland	task-acoustic-scene-classification-results-a#Maka2018	65.8 (64.3 - 67.4)
	Mariotti_lip6_task1a_1	Octave Mariotti	Laboratoire d'informatique de Paris 6, Sorbonne Université, Paris, France	task-acoustic-scene-classification-results-a#Mariotti2018	75.0 (73.6 - 76.4)
	Mariotti_lip6_task1a_2	Octave Mariotti	Laboratoire d'informatique de Paris 6, Sorbonne Université, Paris, France	task-acoustic-scene-classification-results-a#Mariotti2018	72.8 (71.3 - 74.2)
	Mariotti_lip6_task1a_3	Octave Mariotti	Laboratoire d'informatique de Paris 6, Sorbonne Université, Paris, France	task-acoustic-scene-classification-results-a#Mariotti2018	72.8 (71.3 - 74.2)
	Mariotti_lip6_task1a_4	Octave Mariotti	Laboratoire d'informatique de Paris 6, Sorbonne Université, Paris, France	task-acoustic-scene-classification-results-a#Mariotti2018	74.9 (73.4 - 76.3)
	Nguyen_TUGraz_task1a_1	Truc Nguyen	Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria/ Europe	task-acoustic-scene-classification-results-a#Nguyen2018	69.8 (68.3 - 71.3)
	Ren_UAU_task1a_1	Zhao Ren	ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany	task-acoustic-scene-classification-results-a#Ren2018	69.0 (67.5 - 70.5)
	Roletscheck_UNIA_task1a_1	Christian Roletscheck	Human Centered Multimedia, Augsburg University, Augsburg, Germany	task-acoustic-scene-classification-results-a#Roletscheck2018	69.2 (67.7 - 70.7)
	Roletscheck_UNIA_task1a_2	Christian Roletscheck	Human Centered Multimedia, Augsburg University, Augsburg, Germany	task-acoustic-scene-classification-results-a#Roletscheck2018	67.3 (65.7 - 68.8)
	Sakashita_TUT_task1a_1	Yuma Sakashita	Knowledge Data Engineering Laboratory, Toyohashi University of Technology, Aichi, Japan	task-acoustic-scene-classification-results-a#Sakashita2018	81.0 (79.7 - 82.3)
	Sakashita_TUT_task1a_2	Yuma Sakashita	Knowledge Data Engineering Laboratory, Toyohashi University of Technology, Aichi, Japan	task-acoustic-scene-classification-results-a#Sakashita2018	81.0 (79.7 - 82.3)
	Sakashita_TUT_task1a_3	Yuma Sakashita	Knowledge Data Engineering Laboratory, Toyohashi University of Technology, Aichi, Japan	task-acoustic-scene-classification-results-a#Sakashita2018	80.7 (79.4 - 82.0)
	Sakashita_TUT_task1a_4	Yuma Sakashita	Knowledge Data Engineering Laboratory, Toyohashi University of Technology, Aichi, Japan	task-acoustic-scene-classification-results-a#Sakashita2018	79.3 (78.0 - 80.6)
	Tilak_IIITB_task1a_1	Tilak Purohit	Signal Processing and Pattern Recognition Lab, International Institute of Information Technology, Bangaluru, India	task-acoustic-scene-classification-results-a#Purohit2018	59.5 (57.9 - 61.1)
	Tilak_IIITB_task1a_2	Tilak Purohit	Signal Processing and Pattern Recognition Lab, International Institute of Information Technology, Bangaluru, India	task-acoustic-scene-classification-results-a#Purohit2018	58.3 (56.7 - 59.9)
	Tilak_IIITB_task1a_3	Tilak Purohit	Signal Processing and Pattern Recognition Lab, International Institute of Information Technology, Bangaluru, India	task-acoustic-scene-classification-results-a#Purohit2018	55.0 (53.4 - 56.6)
	Waldekar_IITKGP_task1a_1	Shefali Waldekar	Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India	task-acoustic-scene-classification-results-a#Waldekar2018	69.7 (68.2 - 71.2)
	WangJun_BUPT_task1a_1	Wang Jun	Institute of Information Photonics and Optical Communication, c, Beijing, China	task-acoustic-scene-classification-results-a#Jun2018	70.9 (69.4 - 72.4)
	WangJun_BUPT_task1a_2	Wang Jun	Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China	task-acoustic-scene-classification-results-a#Jun2018	70.5 (69.0 - 72.0)
	WangJun_BUPT_task1a_3	Wang Jun	Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China	task-acoustic-scene-classification-results-a#Jun2018	73.2 (71.7 - 74.6)
	Yang_GIST_task1a_1	Jeong Hyeon Yang	School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, Korea	task-acoustic-scene-classification-results-a#Yang2018	71.7 (70.2 - 73.2)
	Yang_GIST_task1a_2	Jeong Hyeon Yang	School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, Korea	task-acoustic-scene-classification-results-a#Yang2018	70.0 (68.5 - 71.5)
	Zeinali_BUT_task1a_1	Hossein Zeinali	BUT Speech, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic	task-acoustic-scene-classification-results-a#Zeinali2018	78.4 (77.0 - 79.7)
	Zeinali_BUT_task1a_2	Hossein Zeinali	BUT Speech, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic	task-acoustic-scene-classification-results-a#Zeinali2018	78.1 (76.8 - 79.5)
	Zeinali_BUT_task1a_3	Hossein Zeinali	BUT Speech, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic	task-acoustic-scene-classification-results-a#Zeinali2018	74.5 (73.1 - 76.0)
	Zeinali_BUT_task1a_4	Hossein Zeinali	BUT Speech, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic	task-acoustic-scene-classification-results-a#Zeinali2018	75.1 (73.7 - 76.6)
	Zhang_HIT_task1a_1	Liwen Zhang	Laboratory of Speech Signal Processing, Harbin Institute of Technology, Harbin, China	task-acoustic-scene-classification-results-a#Zhang2018	73.4 (72.0 - 74.9)
	Zhang_HIT_task1a_2	Liwen Zhang	Laboratory of Speech Signal Processing, Harbin Institute of Technology, Harbin, China	task-acoustic-scene-classification-results-a#Zhang2018	70.9 (69.4 - 72.3)
	Zhao_DLU_task1a_1	Lasheng Zhao	Key Laboratory of Advanced Design and Intelligent Computing(Dalian University), Ministry of Education, Dalian University, Liaoning, China	task-acoustic-scene-classification-results-a#Hao2018	69.8 (68.3 - 71.3)

Complete results and technical reports can be found at subtask A results page

Subtask B

Rank	Submission Information
Rank	Code	Author	Affiliation	Technical Report	Accuracy with 95% confidence interval
	Baseline_Surrey_task1b_1	Qiuqiang Kong	Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK	task-acoustic-scene-classification-results-b#Kong2018	59.6 (58.5 - 60.7)
	Baseline_Surrey_task1b_2	Qiuqiang Kong	Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK	task-acoustic-scene-classification-results-b#Kong2018	58.8 (57.7 - 59.9)
	DCASE2018 baseline	Toni Heittola	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-acoustic-scene-classification-results-b#Heittola2018	46.5 (45.4 - 47.6)
	Li_SCUT_task1b_1	YangXiong Li	Laboratory of Signal Processing, South China University of Technology, Guangzhou, China	task-acoustic-scene-classification-results-b#Li2018	41.1 (40.0 - 42.2)
	Li_SCUT_task1b_2	YangXiong Li	Laboratory of Signal Processing, South China University of Technology, Guangzhou, China	task-acoustic-scene-classification-results-b#Li2018	39.5 (38.4 - 40.6)
	Li_SCUT_task1b_3	YangXiong Li	Laboratory of Signal Processing, South China University of Technology, Guangzhou, China	task-acoustic-scene-classification-results-b#Li2018	42.3 (41.2 - 43.4)
	Liping_CQU_task1b_1	Chen Xinxing	College of Optoelectronic Engineering, Chongqing University, Chongqing, China	task-acoustic-scene-classification-results-b#Liping2018	67.0 (66.0 - 68.1)
	Liping_CQU_task1b_2	Chen Xinxing	College of Optoelectronic Engineering, Chongqing University, Chongqing, China	task-acoustic-scene-classification-results-b#Liping2018	63.2 (62.1 - 64.3)
	Liping_CQU_task1b_3	Chen Xinxing	College of Optoelectronic Engineering, Chongqing University, Chongqing, China	task-acoustic-scene-classification-results-b#Liping2018	67.7 (66.6 - 68.7)
	Liping_CQU_task1b_4	Chen Xinxing	College of Optoelectronic Engineering, Chongqing University, Chongqing, China	task-acoustic-scene-classification-results-b#Liping2018	67.1 (66.1 - 68.2)
	Nguyen_TUGraz_task1b_1	Truc Nguyen	Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria/ Europe	task-acoustic-scene-classification-results-b#Nguyen2018	69.0 (68.0 - 70.0)
	Ren_UAU_task1b_1	Zhao Ren	ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany	task-acoustic-scene-classification-results-b#Ren2018	60.5 (59.4 - 61.5)
	Tchorz_THL_task1b_1	Juergen Tchorz	Institute for Acoustics, University of Applied Sciences Luebeck, Luebeck, Germany	task-acoustic-scene-classification-results-b#Tchorz2018	54.0 (52.9 - 55.1)
	Waldekar_IITKGP_task1b_1	Shefali Waldekar	Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India	task-acoustic-scene-classification-results-b#Waldekar2018	56.2 (55.1 - 57.3)
	WangJun_BUPT_task1b_1	Wang Jun	Laboratory of Signal Processing, Institute of Information Photonics and Optical Communication, Beijing, China	task-acoustic-scene-classification-results-b#Jun2018	48.8 (47.7 - 49.9)
	WangJun_BUPT_task1b_2	Wang Jun	Laboratory of Signal Processing, Institute of Information Photonics and Optical Communication, Beijing, China	task-acoustic-scene-classification-results-b#Jun2018	52.5 (51.4 - 53.6)
	WangJun_BUPT_task1b_3	Wang Jun	Laboratory of Signal Processing, Institute of Information Photonics and Optical Communication, Beijing, China	task-acoustic-scene-classification-results-b#Jun2018	52.3 (51.2 - 53.4)

Complete results and technical reports can be found at subtask B results page

Subtask C

Rank	Submission Information
Rank	Code	Author	Affiliation	Technical Report	Accuracy with 95% confidence interval
	DCASE2018 baseline	Toni Heittola	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-acoustic-scene-classification-results-c#Heittola2018	61.0 (59.4 - 62.6)
	Khadkevich_FB_task1c_1	Maksim Khadkevich	AML, Facebook, Menlo Park, CA, USA	task-acoustic-scene-classification-results-c#Khadkevich2018	71.7 (70.2 - 73.2)
	Khadkevich_FB_task1c_2	Maksim Khadkevich	AML, Facebook, Menlo Park, CA, USA	task-acoustic-scene-classification-results-c#Khadkevich2018	69.0 (67.5 - 70.5)

Complete results and technical reports can be found at subtask C results page

Submissions

Subtask	Teams	Entries	Authors	Affiliations
Subtask A	24	59	71	25
Subtask B	8	16	21	9
Subtask C	1	2	1	1
Overall	25	77	72	26

Baseline system

The baseline system provides a simple entry-level state-of-the-art approach that gives reasonable results in the subtasks of Task 1. The baseline system is built on dcase_util toolbox.

Participants are strongly encouraged to build their own systems by extending the provided baseline system. The system has all needed functionality for the dataset handling, acoustic feature storing and accessing, acoustic model training and storing, and evaluation. The modular structure of the system enables participants to modify the system to their needs. The baseline system is a good starting point especially for the entry level researchers to familiarize themselves with the acoustic scene classification problem.

If participants plan to publish their code to the DCASE community after the challenge, building their approach on the baseline system will make their code more accessible to the community. DCASE organizers strongly encourage participants to share their code in any form after the challenge.

Repository

DCASE2018 Task 1 Baseline

System description

The baseline system implements a convolutional neural network (CNN) based approach, where log mel-band energies are first extracted for each 10-second signal, and a network consisting of two CNN layers and one fully connected layer is trained to assign scene labels to the audio signals.

The baseline system is built on dcase_util toolbox. The machine learning part of the code in built on Keras (v2.1.5), using TensorFlow (v1.4.0) as backend.

Parameters

Acoustic features

Analysis frame 40 ms (50% hop size)
Log mel-band energies (40 bands)

Neural network

Input shape: 40 * 500 (10 seconds)
Architecture:
- CNN layer #1
  - 2D Convolutional layer (filters: 32, kernel size: 7) + Batch normalization + ReLu activation
  - 2D max pooling (pool size: (5, 5)) + Dropout (rate: 30%)
- CNN layer #2
  - 2D Convolutional layer (filters: 64, kernel size: 7) + Batch normalization + ReLu activation
  - 2D max pooling (pool size: (4, 100)) + Dropout (rate: 30%)
- Flatten
- Dense layer #1
  - Dense layer (units: 100, activation: ReLu )
  - Dropout (rate: 30%)
- Output layer (activation: softmax)
Learning (epochs: 200, batch size: 16, data shuffling between epochs)
- Optimizer: Adam (learning rate: 0.001)
Model selection:
- Approximately 30% of the original training data is assigned to validation set, split done such that training and validation sets do not have segments from the same location and both sets have data from each city
- Model performance after each epoch is evaluated on the validation set, and best performing model is selected

Results for the development dataset

Results are calculated using TensorFlow in GPU mode (using Nvidia Titan XP GPU card). Because results produced with GPU card are generally non-deterministic, the system was trained and tested 10 times; mean and standard deviation of the performance from these 10 independent trials are shown in the results tables.

Subtask A

Scene label	Accuracy
Airport	72.9 %
Bus	62.9 %
Metro	51.2 %
Metro station	55.4 %
Park	79.1 %
Public square	40.4 %
Shopping mall	49.6 %
Street, pedestrian	50.0 %
Street, traffic	80.5 %
Tram	55.1 %
Average	59.7 % (± 0.7)

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results.

Subtask B

Material from device A (high-quality) is used for training, while testing is done with material from all three devices. This highlights the problem of mismatched recording devices. Results are calculated the same way as for subtask A, with mean and standard deviation of the performance from 10 independent trials shown in the results table.

Remember that ranking in this subtask will be done by devices B and C (third column in this table).

Scene label	Device B	Device C	Average (B,C)	Device A
Airport	68.9 %	76.1 %	72.5 %	73.4 %
Bus	70.6 %	86.1 %	78.3 %	56.7 %
Metro	23.9 %	17.2 %	20.6 %	46.6 %
Metro station	33.9 %	31.7 %	32.8 %	52.9 %
Park	67.2 %	51.1 %	59.2 %	80.8 %
Public square	22.8 %	26.7 %	24.7 %	37.9 %
Shopping mall	58.3 %	63.9 %	61.1 %	46.4 %
Street, pedestrian	16.7 %	25.0 %	20.8 %	55.5 %
Street, traffic	69.4 %	63.3 %	66.4 %	82.5 %
Tram	18.9 %	20.6 %	19.7 %	56.5 %
Average	45.1 % (± 3.6)	46.2 % (± 4.2)	45.6 % (± 3.6)	58.9 % (± 0.8)

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results.

Subtask C

Result for the subtask A applies.

Citation

If you are using the dataset or baseline code, or want to refer challenge task please cite the following paper:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 9–13. November 2018. URL: https://arxiv.org/abs/1807.09840.

PDF

A multi-device dataset for urban acoustic scene classification

Abstract

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

Keywords

Acoustic scene classification, DCASE challenge, public datasets, multi-device data

PDF

	Annamaria Mesaros Tampere University of Technology
	Tuomas Virtanen Tampere University of Technology
	Toni Heittola Tampere University of Technology

Coordinators

Content

Acoustic Scene Classification Subtask A

Acoustic Scene Classification with mismatched recording devices Subtask B

Acoustic Scene Classification with use of external data Subtask C

Description

Audio dataset

Recording procedure

Reference labels

Download

Subtask A

Subtask B

Subtask C

Task setup

Subtask A

Development dataset

Evaluation dataset

Subtask B

Development dataset

Evaluation dataset

Subtask C

External data sources

Submission

Public leaderboards

Datasets

Task rules

Evaluation

Results

Subtask A

Subtask B

Subtask C

Submissions

Baseline system

Repository

System description

Parameters

Acoustic features

Neural network

Results for the development dataset

Subtask A

Subtask B

Subtask C

Citation

A multi-device dataset for urban acoustic scene classification

Abstract

Keywords

Acoustic Scene Classification
Subtask A

Acoustic Scene Classification with mismatched recording devices
Subtask B

Acoustic Scene Classification with use of external data
Subtask C