Challenge has ended. Full results for this task can be found in subtask specific result pages: Task1A Task1B Task1C
The goal of acoustic scene classification is to classify a test recording into one of the provided predefined classes that characterizes the environment in which it was recorded.
This task comprises three different subtasks that involve system development for three different situations:
Acoustic Scene Classification
Subtask A
Classification of data from the same device as the available training data.
Acoustic Scene Classification with mismatched recording devices
Subtask B
Classification of data recorded with devices different than the training data.
Acoustic Scene Classification with use of external data
Subtask C
Use of external data in training.
Description
The goal of acoustic scene classification is to classify a test recording into one of the provided predefined classes that characterizes the environment in which it was recorded — for example "park", "pedestrian street", "metro station".
Audio dataset
The dataset for this task is the TUT Urban Acoustic Scenes 2018 dataset, consisting of recordings from various acoustic scenes. The dataset was recorded in six large european cities, in different locations for each scene class. For each recording location there are 5-6 minutes of audio. The original recordings were split into segments with a length of 10 seconds that are provided in individual files. Available information about the recordings include the following: acoustic scene class, city, and recording location.
There are two different versions, TUT Urban Acoustic Scenes 2018 and TUT Urban Acoustic Scenes 2018 Mobile, used for tasks A and B, respectively. More details about them can be found below.
Acoustic scenes for the task (10):
- Airport -
airport
- Indoor shopping mall -
shopping_mall
- Metro station -
metro_station
- Pedestrian street -
street_pedestrian
- Public square -
public_square
- Street with medium level of traffic -
street_traffic
- Travelling by a tram -
tram
- Travelling by a bus -
bus
- Travelling by an underground metro -
metro
- Urban park -
park
The dataset was collected by Tampere University of Technology between 01/2018 - 03/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.
Recording procedure
Recordings were made using four devices that captured audio simultaneously.
The main recording device consists in Soundman OKM II Klassik/studio A3, electret binaural microphone and a Zoom F8 audio recorder using 48kHz sampling rate and 24 bit resolution. The microphones are specifically made to look like headphones, being worn in the ears. As an effect of this, the recorded audio is very similar to the sound that reaches the human auditory system of the person wearing the equipment. This equipment is further referred to as device A.
Three other commonly available customer devices (e.g. smartphones, cameras) were used, handled in typical ways by the person doing recordings (e.g. hand held). We further refer to these devices as B, C, and D. The audio recordings from these devices are of different quality than device A. All simultaneous recordings are time synchronized.
TUT Urban Acoustic Scenes 2018 development dataset contains only material recorded with device A, having 864 segments for each acoustic scene (144 minutes of audio). The dataset contains in total 8640 segments, i.e. 24 hours of audio.
TUT Urban Acoustic Scenes 2018 Mobile development dataset contains material recorded with devices A, B and C. For each acoustic scene there are 864 segments recorded with device A, and parallel audio consisting of 72 segments recorded with devices B and C . Data from device A was resampled and averaged into a single channel, to align with the properties of the data recorded with devices B and C. The dataset contains in total 28 hours of audio.
Reference labels
Reference labels are provided only for the development datasets. Currently, there is no plan of releasing the reference labels for evaluation or public leaderboard datasets. If you are preparing a publication based on the DCASE challenge setup and you want to evaluate your proposed system with official challenge evaluation setup, contact the task coordinators. Task coordinators can provide unofficial scoring for limited amount of system outputs.
Download
Subtask A
Subtask B
Subtask C
This subtask is using the same dataset as subtask A.
Task setup
For each subtask, a development set is provided, together with a training/test partitioning for system development. Participants are required to report performance of their system using this train/test setup in order to allow comparison of systems on the development set.
Subtask A
A Match Acoustic Scene Classification
This subtask is concerned with the basic problem of acoustic scene classification, in which all available data (development and evaluation) are recorded with the same device, in this case device A.
Development dataset
The development data consists of recordings from all six cities, and is partitioned so that the training subset contains for each city and each class recordings from approximately 70% of recording locations, and the test subset contains recordings from the rest of the locations. Of the total 8640 segments, 6122 segments were included in the training subset and 2518 segments in the test subset. For complete details on the dataset, check the readme file provided with the data.
Participants are allowed to create their own cross-validation folds or separate validation set. In this case please pay attention to the segments recorded at same location. Location identifier can be found from metadata file provided in the dataset or from audio file names:
[scene label]-[city]-[location id]-[segment id]-[device id].wav
Make sure that all files having same location id are placed on the same side of the evaluation. In this subtask, device id is always a
.
In this subtask, use of external data is forbidden. Data from another task or subtask is considered external data!
Evaluation dataset
Participants should run their system for this dataset, and submit the classification results (system output) to DCASE2018 Challenge. Evaluation dataset is provided without ground truth. The amount of data in the evaluation set is 10 hours.
Subtask B
B Mismatch Acoustic Scene Classification with mismatched recording devices
This subtask is concerned with the situation in which an application will be tested with a few different types of devices, possibly not the same as the ones used to record the development data.
Development dataset
The development data consists of the same recordings as in subtask A, and a small amount of parallel data recorded with devices B and C. The amount of data is as follows:
- Device A: 24 hours (8640 segments, same as subtask A, but resampled and single-channel)
- Device B: 2 hours (72 segments per acoustic scene)
- Device C: 2 hours (72 segments per acoustic scene)
The 2 hours of data recorded with devices B and C is parallel, and also available as recorded with device A. The training/test setup was created such that approximately 70% of recording locations for each city and each scene class are in the training subset, considering only device A. The training subset contains 6122 segments from device A, 540 segments from device B, and 540 segments from device C. The test subset contains 2518 segments from device A, 180 segments from device B, and 180 segments from device C. Please report development set results using the provided data partitioning.
Participants are allowed to create their own cross-validation folds or separate validation set. In this case please pay attention to the segments recorded at same location. Location identifier can be found from metadata file provided in the dataset or from audio file names:
[scene label]-[city]-[location id]-[segment id]-[device id].wav
Make sure that all files having same location id are placed on the same side of the evaluation. In this subtask, device id can be a
, b
, or c
.
In this subtask, use of external data is forbidden. Data from another task or subtask is considered external data!
Evaluation dataset
Participants should run their system for this dataset, and submit the classification results (system output) to DCASE2018 Challenge. Evaluation dataset is provided without ground truth. The evaluation data consists of audio recorded with all four devices, of which device D was not encountered in development. Ranking of systems will be done only using devices B and C, but accuracy will be calculated also for devices A and D.
Subtask C
C External Acoustic Scene Classification with use of external data
This subtask is meant to test if use of external data in system development brings a significant improvement to the performance. The task is identical to subtask A, with the only difference that use of external data and transfer learning is allowed under the following conditions:
- The used dataset must be public and freely available before 29th of March 2018
- Participants must inform/suggest such data to be listed on the task webpage, so that all competitors know about them and have equal opportunity to use them; please let us know by sending email to dcase.challenge@gmail.com and we will update the list of external datasets accordingly
- Once the evaluation set is published, the list of allowed external datasets is locked (no further external sources allowed)
- Participants should list in their technical report the external data sources they used
Participants to this subtask are encouraged to submit also for subtask A to provide the comparison point of their system without external data.
External data sources
List of external datasets allowed:
Dataset name | Type | Added | Link |
---|---|---|---|
TUT Acoustic scenes 2017, development dataset | audio | 29.3.2018 | https://zenodo.org/record/400515 |
TUT Acoustic scenes 2017, evaluation dataset | audio | 29.3.2018 | https://zenodo.org/record/1040168 |
TUT Acoustic scenes 2016, development dataset | audio | 29.3.2018 | https://zenodo.org/record/45739 |
TUT Acoustic scenes 2016, evaluation dataset | audio | 29.3.2018 | https://zenodo.org/record/165995 |
LITIS Rouen audio scene dataset | audio | 29.3.2018 | https://sites.google.com/site/alainrakotomamonjy/home/audio-scene |
DCASE2013 Challenge - Public Dataset for Scene Classification Task | audio | 29.3.2018 | https://archive.org/details/dcase2013_scene_classification |
DCASE2013 Challenge - Private Dataset for Scene Classification Task | audio | 29.3.2018 | https://archive.org/details/dcase2013_scene_classification_testset |
Dares G1 | audio | 29.3.2018 | http://www.daresounds.org/ |
AudioSet | audio | 25.4.2018 | https://research.google.com/audioset/ |
Participants can suggest data to this list by sending email to dcase.challenge@gmail.com until evaluation dataset is published, after which the list is locked.
Submission
Official challenge submission consists of a technical report and system output for the evaluation data.
System output should be presented as a single text-file (in CSV format) containing classification result for each audio file in the evaluation set. Result items can be in any order. Format:
[filename (string)][tab][scene label (string)]
Multiple system outputs can be submitted (maximum 4 per participant). For each system, meta information should be provided in a separate file, containing the task specific information as given in the example here. All files should be packaged into a zip file for submission. Please carefully mark the connection between the submitted files and the corresponding system or system parameters (for example by naming the text file appropriately).
Detailed information for the submission can be found on the Submission page.
Public leaderboards
During the challenge, a public leaderboard will be provided using a separate public evaluation set for each subtask. The leaderboards are organized through Kaggle InClass competitions, and they are meant to serve as a development tool for participants during the development.
B Mismatch Subtask B Leaderboard
C External Subtask C Leaderboard
The official DCASE challenge submission will not be done through these Kaggle InClass competitions.
Datasets
For public leaderboard submissions, participants should use the challenge development datasets to train their system as in DCASE challenge. Separate datasets, leaderboard datasets, are released to be used as evaluation datasets in the competitions. These leaderboard datasets consist of similar material to the official evaluation dataset in the DCASE challenge. The material amount in the leaderboard dataset is considerably lower than the official evaluation material in the DCASE challenge. It is not allowed to use the leaderboard datasets to train the systems in any DCASE challenge subtasks or leaderboard competitions.
Task rules
There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.
Task specific rules:
- Use of external data for system development is allowed only in subtask C. Data from another task or subtask is considered external data.
- Manipulation of provided training and development data is allowed in all subtasks. The development dataset can be augmented without use of external data (e.g. by mixing data sampled from a pdf or using techniques such as pitch shifting or time stretching).
- Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision making is also forbidden.
- Classification decision must be done independently for each test sample.
Evaluation
The scoring of acoustic scene classification will be based on classification accuracy: the number of correctly classified segments among the total number of segments. Each segment is considered an independent test sample. Accuracy will be calculated as average of the class-wise accuracy.
Participants can use sed_eval toolbox for the evaluation:
Results
Subtask A
Rank | Submission Information | ||||
---|---|---|---|---|---|
Code | Author | Affiliation |
Technical Report |
Accuracy with 95% confidence interval |
|
Baseline_Surrey_task1a_1 | Qiuqiang Kong | Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK | task-acoustic-scene-classification-results-a#Kong2018 | 70.4 (68.9 - 71.9) | |
Baseline_Surrey_task1a_2 | Qiuqiang Kong | Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK | task-acoustic-scene-classification-results-a#Kong2018 | 69.7 (68.2 - 71.2) | |
Dang_NCU_task1a_1 | An Dang | Computer Science and Information Engineering, Deep Learning and Media System Laboratory, National Central University, Taoyuan, Taiwan | task-acoustic-scene-classification-results-a#Dang2018 | 73.3 (71.9 - 74.8) | |
Dang_NCU_task1a_2 | An Dang | Computer Science and Information Engineering, Deep Learning and Media System Laboratory, National Central University, Taoyuan, Taiwan | task-acoustic-scene-classification-results-a#Dang2018 | 74.5 (73.1 - 76.0) | |
Dang_NCU_task1a_3 | An Dang | Computer Science and Information Engineering, Deep Learning and Media System Laboratory, National Central University, Taoyuan, Taiwan | task-acoustic-scene-classification-results-a#Dang2018 | 74.1 (72.7 - 75.5) | |
Dorfer_CPJKU_task1a_1 | Matthias Dorfer | Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria | task-acoustic-scene-classification-results-a#Dorfer2018 | 79.7 (78.4 - 81.0) | |
Dorfer_CPJKU_task1a_2 | Matthias Dorfer | Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria | task-acoustic-scene-classification-results-a#Dorfer2018 | 67.8 (66.3 - 69.3) | |
Dorfer_CPJKU_task1a_3 | Matthias Dorfer | Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria | task-acoustic-scene-classification-results-a#Dorfer2018 | 80.5 (79.2 - 81.8) | |
Dorfer_CPJKU_task1a_4 | Matthias Dorfer | Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria | task-acoustic-scene-classification-results-a#Dorfer2018 | 77.2 (75.8 - 78.5) | |
Fraile_UPM_task1a_1 | Ruben Fraile | Research Center on Software Technologies and Multimedia Systems for Sustainability (CITSEM), Universidad Politecnica de Madrid, Madrid, Spain | task-acoustic-scene-classification-results-a#Fraile2018 | 62.7 (61.1 - 64.3) | |
Gil-jin_KNU_task1a_1 | Jang Gin-jin | School of Electronics Engineering, Kyungpook National University, Daegu, Korea | task-acoustic-scene-classification-results-a#Sangwon2018 | 74.4 (73.0 - 75.8) | |
Golubkov_SPCH_task1a_1 | Alexander Golubkov | Saint Petersburg, Russia | task-acoustic-scene-classification-results-a#Golubkov2018 | 60.2 (58.7 - 61.8) | |
DCASE2018 baseline | Toni Heittola | Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland | task-acoustic-scene-classification-results-a#Heittola2018 | 61.0 (59.4 - 62.6) | |
Jung_UOS_task1a_1 | Ha-jin Yu | School of Computer Science, University of Seoul, Seoul, South Korea | task-acoustic-scene-classification-results-a#Jung2018 | 74.8 (73.4 - 76.2) | |
Jung_UOS_task1a_2 | Ha-jin Yu | School of Computer Science, University of Seoul, Seoul, South Korea | task-acoustic-scene-classification-results-a#Jung2018 | 74.2 (72.8 - 75.7) | |
Jung_UOS_task1a_3 | Ha-jin Yu | School of Computer Science, University of Seoul, Seoul, South Korea | task-acoustic-scene-classification-results-a#Jung2018 | 73.8 (72.4 - 75.2) | |
Jung_UOS_task1a_4 | Ha-jin Yu | School of Computer Science, University of Seoul, Seoul, South Korea | task-acoustic-scene-classification-results-a#Jung2018 | 73.8 (72.4 - 75.3) | |
Khadkevich_FB_task1a_1 | Maksim Khadkevich | AML, Facebook, Menlo Park, CA, USA | task-acoustic-scene-classification-results-a#Khadkevich2018 | 67.8 (66.3 - 69.3) | |
Khadkevich_FB_task1a_2 | Maksim Khadkevich | AML, Facebook, Menlo Park, CA, USA | task-acoustic-scene-classification-results-a#Khadkevich2018 | 67.2 (65.7 - 68.8) | |
Li_BIT_task1a_1 | Zhitong Li | Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China | task-acoustic-scene-classification-results-a#Li2018 | 73.0 (71.5 - 74.4) | |
Li_BIT_task1a_2 | Zhitong Li | Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China | task-acoustic-scene-classification-results-a#Li2018 | 75.3 (73.9 - 76.7) | |
Li_BIT_task1a_3 | Zhitong Li | Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China | task-acoustic-scene-classification-results-a#Li2018 | 75.3 (73.9 - 76.7) | |
Li_BIT_task1a_4 | Zhitong Li | Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China | task-acoustic-scene-classification-results-a#Li2018 | 75.0 (73.6 - 76.4) | |
Li_SCUT_task1a_1 | YangXiong Li | Laboratory of Signal Processing, South China University of Technology, Guangzhou, China | task-acoustic-scene-classification-results-a#Li2018a | 43.4 (41.8 - 45.0) | |
Li_SCUT_task1a_2 | YangXiong Li | Laboratory of Signal Processing, South China University of Technology, Guangzhou, China | task-acoustic-scene-classification-results-a#Li2018a | 50.2 (48.6 - 51.9) | |
Li_SCUT_task1a_3 | YangXiong Li | Laboratory of Signal Processing, South China University of Technology, Guangzhou, China | task-acoustic-scene-classification-results-a#Li2018a | 44.5 (42.9 - 46.2) | |
Li_SCUT_task1a_4 | YangXiong Li | Laboratory of Signal Processing, South China University of Technology, Guangzhou, China | task-acoustic-scene-classification-results-a#Li2018a | 46.7 (45.1 - 48.3) | |
Liping_CQU_task1a_1 | Chen Xinxing | College of Optoelectronic Engineering, Chongqing University, Chongqing, China | task-acoustic-scene-classification-results-a#Liping2018 | 70.4 (69.0 - 71.9) | |
Liping_CQU_task1a_2 | Chen Xinxing | College of Optoelectronic Engineering, Chongqing University, Chongqing, China | task-acoustic-scene-classification-results-a#Liping2018 | 74.0 (72.6 - 75.4) | |
Liping_CQU_task1a_3 | Chen Xinxing | College of Optoelectronic Engineering, Chongqing University, Chongqing, China | task-acoustic-scene-classification-results-a#Liping2018 | 74.7 (73.3 - 76.1) | |
Liping_CQU_task1a_4 | Chen Xinxing | College of Optoelectronic Engineering, Chongqing University, Chongqing, China | task-acoustic-scene-classification-results-a#Liping2018 | 75.4 (74.0 - 76.8) | |
Maka_ZUT_task1a_1 | Tomasz Maka | Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, Szczecin, Szczecin, Poland | task-acoustic-scene-classification-results-a#Maka2018 | 65.8 (64.3 - 67.4) | |
Mariotti_lip6_task1a_1 | Octave Mariotti | Laboratoire d'informatique de Paris 6, Sorbonne Université, Paris, France | task-acoustic-scene-classification-results-a#Mariotti2018 | 75.0 (73.6 - 76.4) | |
Mariotti_lip6_task1a_2 | Octave Mariotti | Laboratoire d'informatique de Paris 6, Sorbonne Université, Paris, France | task-acoustic-scene-classification-results-a#Mariotti2018 | 72.8 (71.3 - 74.2) | |
Mariotti_lip6_task1a_3 | Octave Mariotti | Laboratoire d'informatique de Paris 6, Sorbonne Université, Paris, France | task-acoustic-scene-classification-results-a#Mariotti2018 | 72.8 (71.3 - 74.2) | |
Mariotti_lip6_task1a_4 | Octave Mariotti | Laboratoire d'informatique de Paris 6, Sorbonne Université, Paris, France | task-acoustic-scene-classification-results-a#Mariotti2018 | 74.9 (73.4 - 76.3) | |
Nguyen_TUGraz_task1a_1 | Truc Nguyen | Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria/ Europe | task-acoustic-scene-classification-results-a#Nguyen2018 | 69.8 (68.3 - 71.3) | |
Ren_UAU_task1a_1 | Zhao Ren | ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany | task-acoustic-scene-classification-results-a#Ren2018 | 69.0 (67.5 - 70.5) | |
Roletscheck_UNIA_task1a_1 | Christian Roletscheck | Human Centered Multimedia, Augsburg University, Augsburg, Germany | task-acoustic-scene-classification-results-a#Roletscheck2018 | 69.2 (67.7 - 70.7) | |
Roletscheck_UNIA_task1a_2 | Christian Roletscheck | Human Centered Multimedia, Augsburg University, Augsburg, Germany | task-acoustic-scene-classification-results-a#Roletscheck2018 | 67.3 (65.7 - 68.8) | |
Sakashita_TUT_task1a_1 | Yuma Sakashita | Knowledge Data Engineering Laboratory, Toyohashi University of Technology, Aichi, Japan | task-acoustic-scene-classification-results-a#Sakashita2018 | 81.0 (79.7 - 82.3) | |
Sakashita_TUT_task1a_2 | Yuma Sakashita | Knowledge Data Engineering Laboratory, Toyohashi University of Technology, Aichi, Japan | task-acoustic-scene-classification-results-a#Sakashita2018 | 81.0 (79.7 - 82.3) | |
Sakashita_TUT_task1a_3 | Yuma Sakashita | Knowledge Data Engineering Laboratory, Toyohashi University of Technology, Aichi, Japan | task-acoustic-scene-classification-results-a#Sakashita2018 | 80.7 (79.4 - 82.0) | |
Sakashita_TUT_task1a_4 | Yuma Sakashita | Knowledge Data Engineering Laboratory, Toyohashi University of Technology, Aichi, Japan | task-acoustic-scene-classification-results-a#Sakashita2018 | 79.3 (78.0 - 80.6) | |
Tilak_IIITB_task1a_1 | Tilak Purohit | Signal Processing and Pattern Recognition Lab, International Institute of Information Technology, Bangaluru, India | task-acoustic-scene-classification-results-a#Purohit2018 | 59.5 (57.9 - 61.1) | |
Tilak_IIITB_task1a_2 | Tilak Purohit | Signal Processing and Pattern Recognition Lab, International Institute of Information Technology, Bangaluru, India | task-acoustic-scene-classification-results-a#Purohit2018 | 58.3 (56.7 - 59.9) | |
Tilak_IIITB_task1a_3 | Tilak Purohit | Signal Processing and Pattern Recognition Lab, International Institute of Information Technology, Bangaluru, India | task-acoustic-scene-classification-results-a#Purohit2018 | 55.0 (53.4 - 56.6) | |
Waldekar_IITKGP_task1a_1 | Shefali Waldekar | Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India | task-acoustic-scene-classification-results-a#Waldekar2018 | 69.7 (68.2 - 71.2) | |
WangJun_BUPT_task1a_1 | Wang Jun | Institute of Information Photonics and Optical Communication, c, Beijing, China | task-acoustic-scene-classification-results-a#Jun2018 | 70.9 (69.4 - 72.4) | |
WangJun_BUPT_task1a_2 | Wang Jun | Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China | task-acoustic-scene-classification-results-a#Jun2018 | 70.5 (69.0 - 72.0) | |
WangJun_BUPT_task1a_3 | Wang Jun | Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China | task-acoustic-scene-classification-results-a#Jun2018 | 73.2 (71.7 - 74.6) | |
Yang_GIST_task1a_1 | Jeong Hyeon Yang | School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, Korea | task-acoustic-scene-classification-results-a#Yang2018 | 71.7 (70.2 - 73.2) | |
Yang_GIST_task1a_2 | Jeong Hyeon Yang | School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, Korea | task-acoustic-scene-classification-results-a#Yang2018 | 70.0 (68.5 - 71.5) | |
Zeinali_BUT_task1a_1 | Hossein Zeinali | BUT Speech, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic | task-acoustic-scene-classification-results-a#Zeinali2018 | 78.4 (77.0 - 79.7) | |
Zeinali_BUT_task1a_2 | Hossein Zeinali | BUT Speech, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic | task-acoustic-scene-classification-results-a#Zeinali2018 | 78.1 (76.8 - 79.5) | |
Zeinali_BUT_task1a_3 | Hossein Zeinali | BUT Speech, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic | task-acoustic-scene-classification-results-a#Zeinali2018 | 74.5 (73.1 - 76.0) | |
Zeinali_BUT_task1a_4 | Hossein Zeinali | BUT Speech, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic | task-acoustic-scene-classification-results-a#Zeinali2018 | 75.1 (73.7 - 76.6) | |
Zhang_HIT_task1a_1 | Liwen Zhang | Laboratory of Speech Signal Processing, Harbin Institute of Technology, Harbin, China | task-acoustic-scene-classification-results-a#Zhang2018 | 73.4 (72.0 - 74.9) | |
Zhang_HIT_task1a_2 | Liwen Zhang | Laboratory of Speech Signal Processing, Harbin Institute of Technology, Harbin, China | task-acoustic-scene-classification-results-a#Zhang2018 | 70.9 (69.4 - 72.3) | |
Zhao_DLU_task1a_1 | Lasheng Zhao | Key Laboratory of Advanced Design and Intelligent Computing(Dalian University), Ministry of Education, Dalian University, Liaoning, China | task-acoustic-scene-classification-results-a#Hao2018 | 69.8 (68.3 - 71.3) |
Complete results and technical reports can be found at subtask A results page
Subtask B
Rank | Submission Information | ||||
---|---|---|---|---|---|
Code | Author | Affiliation |
Technical Report |
Accuracy with 95% confidence interval |
|
Baseline_Surrey_task1b_1 | Qiuqiang Kong | Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK | task-acoustic-scene-classification-results-b#Kong2018 | 59.6 (58.5 - 60.7) | |
Baseline_Surrey_task1b_2 | Qiuqiang Kong | Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK | task-acoustic-scene-classification-results-b#Kong2018 | 58.8 (57.7 - 59.9) | |
DCASE2018 baseline | Toni Heittola | Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland | task-acoustic-scene-classification-results-b#Heittola2018 | 46.5 (45.4 - 47.6) | |
Li_SCUT_task1b_1 | YangXiong Li | Laboratory of Signal Processing, South China University of Technology, Guangzhou, China | task-acoustic-scene-classification-results-b#Li2018 | 41.1 (40.0 - 42.2) | |
Li_SCUT_task1b_2 | YangXiong Li | Laboratory of Signal Processing, South China University of Technology, Guangzhou, China | task-acoustic-scene-classification-results-b#Li2018 | 39.5 (38.4 - 40.6) | |
Li_SCUT_task1b_3 | YangXiong Li | Laboratory of Signal Processing, South China University of Technology, Guangzhou, China | task-acoustic-scene-classification-results-b#Li2018 | 42.3 (41.2 - 43.4) | |
Liping_CQU_task1b_1 | Chen Xinxing | College of Optoelectronic Engineering, Chongqing University, Chongqing, China | task-acoustic-scene-classification-results-b#Liping2018 | 67.0 (66.0 - 68.1) | |
Liping_CQU_task1b_2 | Chen Xinxing | College of Optoelectronic Engineering, Chongqing University, Chongqing, China | task-acoustic-scene-classification-results-b#Liping2018 | 63.2 (62.1 - 64.3) | |
Liping_CQU_task1b_3 | Chen Xinxing | College of Optoelectronic Engineering, Chongqing University, Chongqing, China | task-acoustic-scene-classification-results-b#Liping2018 | 67.7 (66.6 - 68.7) | |
Liping_CQU_task1b_4 | Chen Xinxing | College of Optoelectronic Engineering, Chongqing University, Chongqing, China | task-acoustic-scene-classification-results-b#Liping2018 | 67.1 (66.1 - 68.2) | |
Nguyen_TUGraz_task1b_1 | Truc Nguyen | Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria/ Europe | task-acoustic-scene-classification-results-b#Nguyen2018 | 69.0 (68.0 - 70.0) | |
Ren_UAU_task1b_1 | Zhao Ren | ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany | task-acoustic-scene-classification-results-b#Ren2018 | 60.5 (59.4 - 61.5) | |
Tchorz_THL_task1b_1 | Juergen Tchorz | Institute for Acoustics, University of Applied Sciences Luebeck, Luebeck, Germany | task-acoustic-scene-classification-results-b#Tchorz2018 | 54.0 (52.9 - 55.1) | |
Waldekar_IITKGP_task1b_1 | Shefali Waldekar | Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India | task-acoustic-scene-classification-results-b#Waldekar2018 | 56.2 (55.1 - 57.3) | |
WangJun_BUPT_task1b_1 | Wang Jun | Laboratory of Signal Processing, Institute of Information Photonics and Optical Communication, Beijing, China | task-acoustic-scene-classification-results-b#Jun2018 | 48.8 (47.7 - 49.9) | |
WangJun_BUPT_task1b_2 | Wang Jun | Laboratory of Signal Processing, Institute of Information Photonics and Optical Communication, Beijing, China | task-acoustic-scene-classification-results-b#Jun2018 | 52.5 (51.4 - 53.6) | |
WangJun_BUPT_task1b_3 | Wang Jun | Laboratory of Signal Processing, Institute of Information Photonics and Optical Communication, Beijing, China | task-acoustic-scene-classification-results-b#Jun2018 | 52.3 (51.2 - 53.4) |
Complete results and technical reports can be found at subtask B results page
Subtask C
Rank | Submission Information | ||||
---|---|---|---|---|---|
Code | Author | Affiliation |
Technical Report |
Accuracy with 95% confidence interval |
|
DCASE2018 baseline | Toni Heittola | Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland | task-acoustic-scene-classification-results-c#Heittola2018 | 61.0 (59.4 - 62.6) | |
Khadkevich_FB_task1c_1 | Maksim Khadkevich | AML, Facebook, Menlo Park, CA, USA | task-acoustic-scene-classification-results-c#Khadkevich2018 | 71.7 (70.2 - 73.2) | |
Khadkevich_FB_task1c_2 | Maksim Khadkevich | AML, Facebook, Menlo Park, CA, USA | task-acoustic-scene-classification-results-c#Khadkevich2018 | 69.0 (67.5 - 70.5) |
Complete results and technical reports can be found at subtask C results page
Submissions
Subtask | Teams | Entries | Authors | Affiliations |
---|---|---|---|---|
Subtask A | 24 | 59 | 71 | 25 |
Subtask B | 8 | 16 | 21 | 9 |
Subtask C | 1 | 2 | 1 | 1 |
Overall | 25 | 77 | 72 | 26 |
Baseline system
The baseline system provides a simple entry-level state-of-the-art approach that gives reasonable results in the subtasks of Task 1. The baseline system is built on dcase_util toolbox.
Participants are strongly encouraged to build their own systems by extending the provided baseline system. The system has all needed functionality for the dataset handling, acoustic feature storing and accessing, acoustic model training and storing, and evaluation. The modular structure of the system enables participants to modify the system to their needs. The baseline system is a good starting point especially for the entry level researchers to familiarize themselves with the acoustic scene classification problem.
If participants plan to publish their code to the DCASE community after the challenge, building their approach on the baseline system will make their code more accessible to the community. DCASE organizers strongly encourage participants to share their code in any form after the challenge.
Repository
System description
The baseline system implements a convolutional neural network (CNN) based approach, where log mel-band energies are first extracted for each 10-second signal, and a network consisting of two CNN layers and one fully connected layer is trained to assign scene labels to the audio signals.
The baseline system is built on dcase_util toolbox. The machine learning part of the code in built on Keras (v2.1.5), using TensorFlow (v1.4.0) as backend.
Parameters
Acoustic features
- Analysis frame 40 ms (50% hop size)
- Log mel-band energies (40 bands)
Neural network
- Input shape: 40 * 500 (10 seconds)
-
Architecture:
- CNN layer #1
- 2D Convolutional layer (filters: 32, kernel size: 7) + Batch normalization + ReLu activation
- 2D max pooling (pool size: (5, 5)) + Dropout (rate: 30%)
- CNN layer #2
- 2D Convolutional layer (filters: 64, kernel size: 7) + Batch normalization + ReLu activation
- 2D max pooling (pool size: (4, 100)) + Dropout (rate: 30%)
- Flatten
- Dense layer #1
- Dense layer (units: 100, activation: ReLu )
- Dropout (rate: 30%)
- Output layer (activation: softmax)
- CNN layer #1
-
Learning (epochs: 200, batch size: 16, data shuffling between epochs)
- Optimizer: Adam (learning rate: 0.001)
-
Model selection:
- Approximately 30% of the original training data is assigned to validation set, split done such that training and validation sets do not have segments from the same location and both sets have data from each city
- Model performance after each epoch is evaluated on the validation set, and best performing model is selected
Results for the development dataset
Results are calculated using TensorFlow in GPU mode (using Nvidia Titan XP GPU card). Because results produced with GPU card are generally non-deterministic, the system was trained and tested 10 times; mean and standard deviation of the performance from these 10 independent trials are shown in the results tables.
Subtask A
Scene label | Accuracy |
---|---|
Airport | 72.9 % |
Bus | 62.9 % |
Metro | 51.2 % |
Metro station | 55.4 % |
Park | 79.1 % |
Public square | 40.4 % |
Shopping mall | 49.6 % |
Street, pedestrian | 50.0 % |
Street, traffic | 80.5 % |
Tram | 55.1 % |
Average | 59.7 % (± 0.7) |
Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results.
Subtask B
Material from device A (high-quality) is used for training, while testing is done with material from all three devices. This highlights the problem of mismatched recording devices. Results are calculated the same way as for subtask A, with mean and standard deviation of the performance from 10 independent trials shown in the results table.
Remember that ranking in this subtask will be done by devices B and C (third column in this table).
Scene label | Device B | Device C | Average (B,C) | Device A |
---|---|---|---|---|
Airport | 68.9 % | 76.1 % | 72.5 % | 73.4 % |
Bus | 70.6 % | 86.1 % | 78.3 % | 56.7 % |
Metro | 23.9 % | 17.2 % | 20.6 % | 46.6 % |
Metro station | 33.9 % | 31.7 % | 32.8 % | 52.9 % |
Park | 67.2 % | 51.1 % | 59.2 % | 80.8 % |
Public square | 22.8 % | 26.7 % | 24.7 % | 37.9 % |
Shopping mall | 58.3 % | 63.9 % | 61.1 % | 46.4 % |
Street, pedestrian | 16.7 % | 25.0 % | 20.8 % | 55.5 % |
Street, traffic | 69.4 % | 63.3 % | 66.4 % | 82.5 % |
Tram | 18.9 % | 20.6 % | 19.7 % | 56.5 % |
Average | 45.1 % (± 3.6) | 46.2 % (± 4.2) | 45.6 % (± 3.6) | 58.9 % (± 0.8) |
Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results.
Subtask C
Result for the subtask A applies.
Citation
If you are using the dataset or baseline code, or want to refer challenge task please cite the following paper:
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 9–13. November 2018. URL: https://arxiv.org/abs/1807.09840.
A multi-device dataset for urban acoustic scene classification
Abstract
This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.
Keywords
Acoustic scene classification, DCASE challenge, public datasets, multi-device data