Acoustic Scene Classification with
mismatched recording devices


Challenge results

Task description

This subtask is concerned with the situation in which an application will be tested with a few different types of devices, possibly not the same as the ones used to record the development data.

The development data consists of the same recordings as in subtask A, and a small amount of parallel data recorded with devices B and C. The amount of data is as follows:

  • Device A: 24 hours (8640 segments, same as subtask A, but resampled and single-channel)
  • Device B: 2 hours (72 segments per acoustic scene)
  • Device C: 2 hours (72 segments per acoustic scene)

The 2 hours of data recorded with devices B and C is parallel, and also available as recorded with device A. The training/test setup was created such that approximately 70% of recording locations for each city and each scene class are in the training subset, considering only device A. The training subset contains 6122 segments from device A, 540 segments from device B, and 540 segments from device C. The test subset contains 2518 segments from device A, 180 segments from device B, and 180 segments from device C.

More detailed task description can be found in the task description page

Systems ranking

Rank Submission
code
Submission
name
Technical
Report
Accuracy (B/C)
with 95% confidence interval
(Evaluation dataset)
Accuracy (B/C)
(Development dataset)
Accuracy (B/C)
(Leaderboard dataset)
Baseline_Surrey_task1b_1 SurreyCNN8 Kong2018 59.6 (58.5 - 60.7) 57.2 56.8
Baseline_Surrey_task1b_2 SurreyCNN4 Kong2018 58.8 (57.7 - 59.9) 57.5 57.8
DCASE2018 baseline Baseline Heittola2018 46.5 (45.4 - 47.6) 45.6 45.0
Li_SCUT_task1b_1 Li_SCUT Li2018 41.1 (40.0 - 42.2) 51.7
Li_SCUT_task1b_2 Li_SCUT Li2018 39.5 (38.4 - 40.6) 53.9
Li_SCUT_task1b_3 Li_SCUT Li2018 42.3 (41.2 - 43.4) 51.7
Liping_CQU_task1b_1 Xception Liping2018 67.0 (66.0 - 68.1) 77.6 65.8
Liping_CQU_task1b_2 Xception Liping2018 63.2 (62.1 - 64.3) 77.6 63.1
Liping_CQU_task1b_3 Xception Liping2018 67.7 (66.6 - 68.7) 77.6 67.2
Liping_CQU_task1b_4 Xception Liping2018 67.1 (66.1 - 68.2) 77.6 66.5
Nguyen_TUGraz_task1b_1 NNF_CNNEns Nguyen2018 69.0 (68.0 - 70.0) 63.6 67.3
Ren_UAU_task1b_1 ABCNN Ren2018 60.5 (59.4 - 61.5) 58.3 58.9
Tchorz_THL_task1b_1 AMS_MFCC Tchorz2018 54.0 (52.9 - 55.1) 63.8
Waldekar_IITKGP_task1b_1 IITKGP_ABSP_Fusion18 Waldekar2018 56.2 (55.1 - 57.3) 57.8
WangJun_BUPT_task1b_1 Attention Jun2018 48.8 (47.7 - 49.9) 69.0 49.4
WangJun_BUPT_task1b_2 Attention Jun2018 52.5 (51.4 - 53.6) 69.0 49.4
WangJun_BUPT_task1b_3 Attention Jun2018 52.3 (51.2 - 53.4) 69.0 49.4

Teams ranking

Rank Submission
code
Submission
name
Technical
Report
Accuracy
with 95% confidence interval
(Evaluation dataset)
Accuracy
(Development dataset)
Accuracy
(Leaderboard dataset)
Baseline_Surrey_task1b_1 SurreyCNN8 Kong2018 59.6 (58.5 - 60.7) 57.2 56.8
DCASE2018 baseline Baseline Heittola2018 46.5 (45.4 - 47.6) 45.6 45.0
Li_SCUT_task1b_3 Li_SCUT Li2018 42.3 (41.2 - 43.4) 51.7
Liping_CQU_task1b_3 Xception Liping2018 67.7 (66.6 - 68.7) 77.6 67.2
Nguyen_TUGraz_task1b_1 NNF_CNNEns Nguyen2018 69.0 (68.0 - 70.0) 63.6 67.3
Ren_UAU_task1b_1 ABCNN Ren2018 60.5 (59.4 - 61.5) 58.3 58.9
Tchorz_THL_task1b_1 AMS_MFCC Tchorz2018 54.0 (52.9 - 55.1) 63.8
Waldekar_IITKGP_task1b_1 IITKGP_ABSP_Fusion18 Waldekar2018 56.2 (55.1 - 57.3) 57.8
WangJun_BUPT_task1b_2 Attention Jun2018 52.5 (51.4 - 53.6) 69.0 49.4

Class-wise performance

Rank Submission
code
Submission
name
Technical
Report
Accuracy
(Evaluation dataset)
Airport Bus Metro Metro
station
Park Public
square
Shopping
mall
Street
pedestrian
Street
traffic
Tram
Baseline_Surrey_task1b_1 SurreyCNN8 Kong2018 59.6 47.7 65.0 51.0 58.2 86.4 36.7 58.5 49.9 80.2 62.1
Baseline_Surrey_task1b_2 SurreyCNN4 Kong2018 58.8 49.6 63.9 56.9 51.9 74.5 36.7 64.4 41.8 78.4 69.6
DCASE2018 baseline Baseline Heittola2018 46.5 61.6 56.7 45.3 40.0 61.1 15.4 51.8 32.4 69.8 30.4
Li_SCUT_task1b_1 Li_SCUT Li2018 41.1 33.3 53.0 33.1 30.1 48.9 43.3 38.1 31.4 47.5 52.1
Li_SCUT_task1b_2 Li_SCUT Li2018 39.5 27.4 66.5 17.3 35.7 52.5 40.7 41.0 23.2 50.6 39.9
Li_SCUT_task1b_3 Li_SCUT Li2018 42.3 34.7 66.8 27.8 32.2 51.0 52.3 48.4 23.0 47.6 39.3
Liping_CQU_task1b_1 Xception Liping2018 67.0 57.6 71.3 65.7 69.6 82.4 57.1 73.4 39.8 82.3 71.1
Liping_CQU_task1b_2 Xception Liping2018 63.2 40.9 73.7 59.8 68.4 84.2 34.8 77.8 42.9 87.5 61.9
Liping_CQU_task1b_3 Xception Liping2018 67.7 63.1 72.0 59.5 71.7 86.0 52.3 74.0 42.2 80.4 75.4
Liping_CQU_task1b_4 Xception Liping2018 67.1 62.6 71.7 59.8 69.9 88.9 48.2 71.1 44.6 81.7 72.7
Nguyen_TUGraz_task1b_1 NNF_CNNEns Nguyen2018 69.0 67.0 86.9 57.6 56.9 93.9 45.6 69.8 53.3 85.1 73.9
Ren_UAU_task1b_1 ABCNN Ren2018 60.5 44.6 79.3 52.3 61.4 81.2 29.2 64.0 58.8 81.3 52.7
Tchorz_THL_task1b_1 AMS_MFCC Tchorz2018 54.0 44.4 64.5 45.1 43.9 76.6 42.6 57.6 37.2 70.8 57.1
Waldekar_IITKGP_task1b_1 IITKGP_ABSP_Fusion18 Waldekar2018 56.2 39.3 62.4 51.1 54.9 73.1 40.2 72.0 41.4 78.4 49.6
WangJun_BUPT_task1b_1 Attention Jun2018 48.8 37.0 57.2 40.5 60.9 86.9 23.5 50.4 16.2 67.2 48.0
WangJun_BUPT_task1b_2 Attention Jun2018 52.5 70.7 55.4 59.8 44.6 76.3 46.6 48.6 2.1 73.5 47.0
WangJun_BUPT_task1b_3 Attention Jun2018 52.3 51.4 58.5 47.7 59.3 87.8 30.2 52.7 12.8 71.5 50.9

Device-wise performance

Rank Submission
code
Submission
name
Technical
Report
Accuracy / Evaluation dataset
Average
Dev B / Dev C
Dev B Dev C Dev A Dev D
Baseline_Surrey_task1b_1 SurreyCNN8 Kong2018 59.6 59.5 59.6 69.1 32.1
Baseline_Surrey_task1b_2 SurreyCNN4 Kong2018 58.8 58.7 58.8 70.6 33.8
DCASE2018 baseline Baseline Heittola2018 46.5 45.9 47.0 63.6 27.5
Li_SCUT_task1b_1 Li_SCUT Li2018 41.1 42.2 39.9 54.2 20.3
Li_SCUT_task1b_2 Li_SCUT Li2018 39.5 39.8 39.1 55.7 13.2
Li_SCUT_task1b_3 Li_SCUT Li2018 42.3 43.3 41.3 55.7 18.5
Liping_CQU_task1b_1 Xception Liping2018 67.0 66.8 67.3 73.7 45.4
Liping_CQU_task1b_2 Xception Liping2018 63.2 63.9 62.5 72.2 45.8
Liping_CQU_task1b_3 Xception Liping2018 67.7 67.8 67.5 73.9 48.8
Liping_CQU_task1b_4 Xception Liping2018 67.1 67.6 66.7 73.6 47.8
Nguyen_TUGraz_task1b_1 NNF_CNNEns Nguyen2018 69.0 68.9 69.1 73.8 37.6
Ren_UAU_task1b_1 ABCNN Ren2018 60.5 60.6 60.3 71.2 30.1
Tchorz_THL_task1b_1 AMS_MFCC Tchorz2018 54.0 55.2 52.8 65.1 12.7
Waldekar_IITKGP_task1b_1 IITKGP_ABSP_Fusion18 Waldekar2018 56.2 54.8 57.7 58.9 29.5
WangJun_BUPT_task1b_1 Attention Jun2018 48.8 47.8 49.8 31.1 33.5
WangJun_BUPT_task1b_2 Attention Jun2018 52.5 48.0 56.9 50.1 38.5
WangJun_BUPT_task1b_3 Attention Jun2018 52.3 49.7 54.8 35.7 36.2

System characteristics

General characteristics

Rank Code Technical
Report
Accuracy
(Eval)
Sampling
rate
Data
augmentation
Features
Baseline_Surrey_task1b_1 Kong2018 59.6 44.1kHz log-mel energies
Baseline_Surrey_task1b_2 Kong2018 58.8 44.1kHz log-mel energies
DCASE2018 baseline Heittola2018 46.5 44.1kHz log-mel energies
Li_SCUT_task1b_1 Li2018 41.1 48kHz MFCC
Li_SCUT_task1b_2 Li2018 39.5 48kHz MFCC
Li_SCUT_task1b_3 Li2018 42.3 48kHz MFCC
Liping_CQU_task1b_1 Liping2018 67.0 44.1kHz log-mel energies
Liping_CQU_task1b_2 Liping2018 63.2 44.1kHz log-mel energies
Liping_CQU_task1b_3 Liping2018 67.7 44.1kHz log-mel energies
Liping_CQU_task1b_4 Liping2018 67.1 44.1kHz log-mel energies
Nguyen_TUGraz_task1b_1 Nguyen2018 69.0 44.1kHz log-mel energies and their nearest neighbor filtered version
Ren_UAU_task1b_1 Ren2018 60.5 44.1kHz log-mel spectrogram
Tchorz_THL_task1b_1 Tchorz2018 54.0 44.1kHz amplitude modulation spectrogram, MFCC
Waldekar_IITKGP_task1b_1 Waldekar2018 56.2 48kHz MFDWC, CQCC
WangJun_BUPT_task1b_1 Jun2018 48.8 44.1kHz mixup log-mel energies
WangJun_BUPT_task1b_2 Jun2018 52.5 44.1kHz mixup log-mel energies
WangJun_BUPT_task1b_3 Jun2018 52.3 44.1kHz mixup log-mel energies



Machine learning characteristics

Rank Code Technical
Report
Accuracy
(Eval)
Model
complexity
Classifier Ensemble
subsystems
Decision
making
Baseline_Surrey_task1b_1 Kong2018 59.6 4691274 VGGish 8 layer CNN with global max pooling
Baseline_Surrey_task1b_2 Kong2018 58.8 4309450 VGGish 8 layer CNN with global max pooling
DCASE2018 baseline Heittola2018 46.5 116118 CNN
Li_SCUT_task1b_1 Li2018 41.1 116118 LSTM
Li_SCUT_task1b_2 Li2018 39.5 116118 LSTM
Li_SCUT_task1b_3 Li2018 42.3 116118 LSTM
Liping_CQU_task1b_1 Liping2018 67.0 22758194 Xception
Liping_CQU_task1b_2 Liping2018 63.2 22758194 Xception
Liping_CQU_task1b_3 Liping2018 67.7 22758194 Xception
Liping_CQU_task1b_4 Liping2018 67.1 22758194 Xception
Nguyen_TUGraz_task1b_1 Nguyen2018 69.0 12278040 CNN 12 averaging vote
Ren_UAU_task1b_1 Ren2018 60.5 616800 CNN
Tchorz_THL_task1b_1 Tchorz2018 54.0 15395500 LSTM
Waldekar_IITKGP_task1b_1 Waldekar2018 56.2 20973 SVM 3 fusion
WangJun_BUPT_task1b_1 Jun2018 48.8 4634004 CNN,BGRU,self-attention
WangJun_BUPT_task1b_2 Jun2018 52.5 4634004 CNN,BGRU,self-attention
WangJun_BUPT_task1b_3 Jun2018 52.3 4634004 CNN,BGRU,self-attention

Technical reports

Acoustic Scene Classification Using Ensemble of Convnets

An Dang, Toan Vu and Jia-Ching Wang
Computer Science and Information Engineering, Deep Learning and Media System Laboratory, National Central University, Taoyuan, Taiwan

Abstract

This technical report presents our system for the acoustic scene classification problem in the task 1A of the DCASE2018 challenge whose goal is to classify audio recordings into predefined types of environments. The overall system is an ensemble of ConvNet models working on different audio features separately. Audio signals are processed in both mono channel and two channels before we extract mel-spectrogram and gammatone-based spectrogram features as inputs to models. All models are implemented by almost the same ConvNet structure. Experimental results illustrate that the ensemble system can achieve superior accuracy to the baseline by a large margin of 17% on the test data.

System characteristics
Input stereo, mono
Sampling rate 48kHz
Features log-mel energies
Classifier Ensemble of Convnet
Decision making average
PDF

Acoustic Scene Classification with Fully Convolutional Neural Networks and I-Vectors

Matthias Dorfer, Bernhard Lehner, Hamid Eghbal-zadeh, Heindl Christop, Paischer Fabian and Widmer Gerhard
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria

Abstract

This technical report describes the CP-JKU team's submissions for Task 1 - Subtask A (Acoustic Scene Classification, ASC) of the DCASE-2018 challenge. Our approach is still related to the methodology that achieved ranks 1 and 2 in the 2016 ASC challenge: a fusion of i-vector modelling using MFCC features derived from left and right audio channels, and deep convolutional neural networks (CNNs) trained on spectrograms. However, for our 2018 submission we have put a stronger focus on tuning and pushing the performance of our CNNs. The result of our experiments is a classification system that achieves classification accuracies of around 80% on the public Kaggle-Leaderboard.

System characteristics
Input left, right, difference; left, right
Sampling rate 22.5kHz
Data augmentation mixup; pitch shifting; mixup, pitch shifting
Features perceptual weighted power spectrogram; MFCC; perceptual weighted power spectrogram, MFCC
Classifier CNN, ensemble; i-vector, late fusion; CNN i-vector ensemble; CNN i-vector late fusion ensemble
Decision making average; fusion; late calibrated fusion of averaged i-vector and CNN models; late calibrated fusion
PDF

Classification of Acoustic Scenes Based on Modulation Spectra and Position-Pitch Maps

Ruben Fraile, Elena Blanco-Martin, Juana M. Gutierrez-Arriola, Nicolas Saenz-Lechon and Victor J. Osma-Ruiz
Research Center on Software Technologies and Multimedia Systems for Sustainability (CITSEM), Universidad Politecnica de Madrid, Madrid, Spain

Abstract

A system for the automatic classification of acoustic scenes is proposed that uses the stereophonic signal captured by a binaural microphone. This system uses one channel for calculating the spectral distribution of energy across auditory-relevant frequency bands. It further obtains some descriptors of the envelope modulation spectrum (EMS) by applying the discrete cosine transform to the logarithm of the EMS. The availability of the two-channel binaural recordings is used for representing the spatial distribution of acoustic sources by means of position-pitch maps. These maps are further parametrized using the two-dimensional Fourier transform. These three types of features (energy spectrum, EMS and positionpitch maps) are used as inputs for a standard multilayer perceptron with two hidden layers.

System characteristics
Input binaural
Sampling rate 48kHz
Features LTAS, Modulation spectrum, position-pitch maps
Classifier MLP
Decision making sum of log-probabilities
PDF

Acoustic Scene Classification Using Convolutional Neural Networks and Different Channels Representations and Its Fusion

Alexander Golubkov and Alexander Lavrentyev
Saint Petersburg, Russia

Abstract

Deep convolutional neural networks has great results in a image classification tasks. In this paper, we used different architectures of DCNN for image classification. As for images we used spectrograms of differenet signal representations, such as MFCC, Melspectrograms and CQT-spectrograms. Result was obtained using goemetric mean of all the models.

System characteristics
Input left, right, mono, mixed
Sampling rate 48kHz
Features CQT, spectrogram, log-mel, MFCC
Classifier CNN
Decision making mean
PDF

DCASE 2018 Task 1a: Acoustic Scene Classification by Bi-LSTM-CNN-Net Multichannel Fusion

WenJie Hao, Lasheng Zhao, Qiang Zhang, HanYu Zhao and JiaHua Wang
Key Laboratory of Advanced Design and Intelligent Computing(Dalian University), Ministry of Education, Dalian University, Liaoning, China

Abstract

In this study, we provide a solution for acoustic scene classification task in the DCASE 2018 challenge. A system consisting of bidirectional long-term memory and convolutional neural networks(BI-LSTM-CNN) is proposed. And, improved logarithmic scaled mel spectra as input to our system. Besides we have adopted a new model fusion mechanism. Finally, to validate the performance of the model and compare it to the baseline system, we used the TUT Acoustic Scene 2018 dataset for training and cross-validation, resulting in an 13.93% improvement over the baseline system.

System characteristics
Input multichannel
Sampling rate 48kHz
Features log-mel energies
Classifier CNN,Bi-Lstm
Decision making max of precision
PDF

A Multi-Device Dataset for Urban Acoustic Scene Classification

Toni Heittola, Annamaria Mesaros and Tuomas Virtanen
Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland

Abstract

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

System characteristics
Input mono
Sampling rate 48kHz; 44.1kHz
Features log-mel energies
Classifier CNN
PDF

Self-Attention Mechanism Based System for Dcase2018 Challenge Task1 and Task4

Wang Jun1 and Li Shengchen2
1Institute of Information Photonics and Optical Communication, c, Beijing, China, 2Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China

Abstract

In this technique report, we provide self-attention mechanism for the Task1 and Task 4 of Detection and Classification of Acoustic Scenes and Events 2018 (DCASE2017) challenge. We take convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) as our basic systems in Task 1 and Task 4. In this convolutional recurrent neural network (CRNN), gated linear units (GLUs) is used for non-linearity which implement a gating mechanism over the output of the network for selecting informative local features. Self-attention mechanism called intra-attention is used for modeling relationship between different positions of a single sequence over the output of the CRNN. Attention-based pooling scheme is used for localizing the specific events in Task 4 and for obtaining the final labels in Task 1. In a summary, we get 70.81% accuracy subtask 1 of Task 1. In the subtask 2 of Task 1, we get 70.1% accuracy for device a, 59.4% accuracy for device b, and 55.6 accuracy for device c. For Task 1, we get 26.98% F1 value for sound event detection in old test data of developmemt data.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN,BGRU,self-attention
PDF

DNN Based Multi-Level Features Ensemble for Acoustic Scene Classification

Jee-weon Jung, Hee-soo Heo, Hye-jin Shim and Ha-jin Yu
School of Computer Science, University of Seoul, Seoul, South Korea

Abstract

Acoustic scenes are defined by various characteristics such as long-term context or short-term event, making it difficult to select input features or pre-processing methods suitable for acoustic scene classification. In this paper, we propose an ensemble model which exploits various input features that vary in their degree of preprocessing: raw waveform without pre-processing, spectrogram, and i-vector a segment-level low dimensional representation. We tried to effectively perform combination of deep neural networks that handle different types of input features by using a separate scoring phase by using Gaussian models and support vector machines to extract scores from individual system that can be used as a confidence measure. Validity of the proposed framework is tested using the detection and classification of acoustic scenes and events 2018 dataset. The proposed framework showed accuracy of 73.82% using the validation set.

System characteristics
Input binaural
Sampling rate 48kHz
Features raw-waveform, spectrogram, i-vector
Classifier CNN, DNN, GMM, SVM
Decision making score-sum; weighted score-sum
PDF

Acoustic Scene and Event Detection Systems Submitted to DCASE 2018 Challenge

Maksim Khadkevich
AML, Facebook, Menlo Park, CA, USA

Abstract

In this technical report we describe systems that have been submitted to DCASE 2018 [1] challenge. Feature extraction and convolutional neural network (CNN) architecture are outlined. For tasks 1c and 2 we describe transfer learning approach that has been applied. Model training and inference are finally presented.

System characteristics
Input mono
Sampling rate 16kHz
Features log-mel energies
Classifier CNN
PDF

DCASE 2018 Challenge Surrey Cross-Task Convolutional Neural Network Baseline

Qiuqiang Kong, Iqbal Turab, Xu Yong, Wenwu Wang and Mark D. Plumbley
Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK

Abstract

The Detection and Classification of Acoustic Scenes and Events (DCASE) consists of five audio classification and sound event detection tasks: 1) Acoustic scene classification, 2) General-purpose audio tagging of Freesound, 3) Bird audio detection, 4) Weaklylabeled semi-supervised sound event detection and 5) Multi-channel audio classification. In this paper, we create a cross-task baseline system for all five tasks based on a convolutional neural network (CNN): a “CNN Baseline” system. We implemented CNNs with 4 layers and 8 layers originating from AlexNet and VGG from computer vision. We investigated how the performance varies from task to task with the same configuration of neural networks. Experiments show that deeper CNN with 8 layers performs better than CNN with 4 layers on all tasks except Task 1. Using CNN with 8 layers, we achieve an accuracy of 0.680 on Task 1, an accuracy of 0.895 and a mean average precision (MAP) of 0.928 on Task 2, an accuracy of 0.751 and an area under the curve (AUC) of 0.854 on Task 3, a sound event detection F1 score of 20.8% on Task 4, and an F1 score of 87.75% on Task 5. We released the Python source code of the baseline systems under the MIT license for further research.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier VGGish 8 layer CNN with global max pooling; AlexNetish 4 layer CNN with global max pooling
PDF

Acoustic Scene Classification Based on Binaural Deep Scattering Spectra with CNN and LSTM

Zhitong Li, Liqiang Zhang, Shixuan Du and Wei Liu
Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China

Abstract

This technical report presents the solutions proposed by the Beijing Institute of Technology Modern Communications Technology Laboratory for the acoustic scene classification of DCASE2018 task1a. Compared to previous years, the data is more diverse, making such tasks more difficult. In order to solve this problem, we use the Deep Scattering Spectra (DSS) features. The traditional features, such as Mel-frequency Cepstral Coefficients (MFCC), often lose information at high frequencies. DSS is a good way to preserve high frequency information. Based on this feature, we propose a network model of Convolutional Neural Network (CNN) and Long Short-term Memory (LSTM) to classify sound scenes. The experimental results show that the proposed feature extraction method and network structure have a good effect on this classification task. From the experimental data, the accuracy increased from 59% to 76%.

System characteristics
Input left,right
Sampling rate 48kHz
Features DSS
Classifier CNN; CNN,DNN
PDF

The SEIE-SCUT Systems for Challenge on DCASE 2018: Deep Learning Techniques for Audio Representation and Classification

YangXiong Li, Xianku Li and Yuhan Zhang
Laboratory of Signal Processing, South China University of Technology, Guangzhou, China

Abstract

In this report, we present our works about one task of challenge on DCASE 2018, i.e. task 1a: Acoustic Scene Classification (ASC). We adopt deep learning techniques to extract Deep Audio Feature (DAF) and classify various acoustic scenes . Specifically, a Deep Neural Network (DNN) is first built for generating the DAF from MelFrequency Cepstral Coefficients (MFCCs), and then a Recurrent Neural Network (RNN) of Bidirectional Long Short Term Memory (BLSTM) fed by the DAF is built for ASC. Evaluated on the development datasets of DCASE 2018, our systems are superior to the corresponding baselines for tasks 1a.

System characteristics
Input mono
Sampling rate 48kHz
Features MFCC
Classifier LSTM
PDF

The SEIE-SCUT Systems for Challenge on DCASE 2018: Deep Learning Techniques for Audio Representation and Classification

YangXiong Li, Yuhan Zhang and Xianku Li
Laboratory of Signal Processing, South China University of Technology, Guangzhou, China

Abstract

In this report, we present our works about one task of challenge on DCASE 2018, i.e. task 1b:Acoustic Scene Classification with mismatched recording devices (ASC). We adopt deep learning techniques to extract Deep Audio Feature (DAF) and classify various acoustic scenes . Specifically, a Deep Neural Network (DNN) is first built for generating the DAF from Mel-Frequency Cepstral Coefficients (MFCCs), and then a Recurrent Neural Network (RNN) of Bidirectional Long Short Term Memory (BLSTM) fed by the DAF is built for ASC. Evaluated on the development datasets of DCASE 2018, our systems are superior to the corresponding baselines for tasks 1b.

System characteristics
Input mono
Sampling rate 48kHz
Features MFCC
Classifier LSTM
PDF

Acoustic Scene Classification Using Multi-Scale Features

Yang Liping, Chen Xinxing and Tao Lianjie
College of Optoelectronic Engineering, Chongqing University, Chongqing, China

Abstract

Convolutional neural networks(CNN) has shown tremendous ability in classification problems, because it can extract abstract features for improving classification performance. In this paper, we use CNN to compute feature hierarchy layer by layer. With the layers deepen, the extracted features become more abstract, but the shallow features are also very useful for classification. So we propose a fuse multi-scale features of different layers method, which can improve performance of acoustic scene classification. In our method, the logmel features of audio signal are used as the input of CNN. In order to reduce the parameters' number, we use xception as the foundation network, which is a CNN with depthwise separable convolution operation (a depthwise convolution followed by a pointwise convolution). And we modify xception to fuse multi-scale features. We also introduce the focal loss, to further improve classification performance. This method can achieve commendable result, whether the audio recordings are collected by same device(subtask A) or by different devices (subtask B).

System characteristics
Input mono
Sampling rate 48kHz; 44.1kHz
Features log-mel energies
Classifier Xception
PDF

Auditory Scene Classification Using Ensemble Learning with Small Audio Feature Space

Tomasz Maka
Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, Szczecin, Szczecin, Poland

Abstract

The report presents the results of an analysis of audio feature space for auditory scene classification. The final small feature set was determined by the selection of the attributes from various representations. Feature importance was calculated exploiting the Gradient Boosting Machine. A number of classifiers were employed to build the ensemble classification scheme, and majority voting was performed to obtain the final decision. In the result, the proposed solution uses 223 attributes and outperforms the baseline system by over 6 per cent.

System characteristics
Input binaural
Sampling rate 48kHz
Features various
Classifier ensemble
Decision making majority vote
PDF

Exploring Deep Vision Models for Acoustic Scene Classification

Octave Mariotti, Matthieu Cord and Olivier Schwander
Laboratoire d'informatique de Paris 6, Sorbonne Université, Paris, France

Abstract

This report evaluates the application of deep vision models, namely VGG and Resnet, to general audio recognition. In the context of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events 2018, we trained several of these architecture on the task 1 dataset to perform acoustic scene classification. Then, in order to produce more robust predictions, we explored two ensemble methods to aggregate the different model outputs. Our results show a final accuracy of 79% on the development dataset, outperforming the baseline by almost 20%.

System characteristics
Input mono, binaural
Sampling rate 48kHz
Features log-mel energies
Classifier CNN
Decision making mean probability; neural network
PDF

Acoustic Scene Classification Using a Convolutional Neural Network Ensemble and Nearest Neighbor Filters

Truc Nguyen and Franz Pernkopf
Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria/ Europe

Abstract

This paper proposes Convolutional Neural Network (CNN) ensembles for acoustic scene classification of subtasks 1A and 1B of DCASE 2018 challenge. We introduce a nearest neighbor filter applied on spectrogram, which allows to emphasize and smooth similar patterns of sound events in a scene. We also propose a variety of CNN models for single-input (SI) and multi-input (MI) channels and three different methods for building a network ensemble. The experimental results show that for subtask 1A the combination of the MI-CNN structures using both of log-mel features and their nearest neighbor filtering is slightly more effective than that of single-input channel CNN models using log-mel features only. This statement is opposite for subtask 1B. In addition, the ensemble methods improve the accuracy of the system significantly, in which the best ensemble method is ensemble selection, which achieves 69.3% for subtask 1A and 63.6% for subtask 1B. This improves the baseline system by 8.9% and 14.4% for subtask 1A and 1B, respectively

System characteristics
Input mono
Sampling rate 48kHz; 44.1kHz
Features log-mel energies and their nearest neighbor filtered version
Classifier CNN
Decision making averaging vote
PDF

Acoustic Scene Classification Using Deep CNN on Raw-Waveform

Tilak Purohit and Atul Agarwal
Signal Processing and Pattern Recognition Lab, International Institute of Information Technology, Bangaluru, India

Abstract

For acoustic scene classification problems, conventionally Convolutional Neural Networks (CNNs) have been used on handcrafted features like Mel Frequency Cepstral Coefficients, filterbank energies, scaled spectrograms etc. However, recently CNNs have been used on raw waveform for acoustic modeling in speech recognition, though the time-scales of these waveforms are short (of the order of typical phoneme durations - 80-120 ms). In this work, we have exploited the representation learning power of CNNs by using them directly on very long raw acoustic sound waveforms (of durations 0.5-10 sec) for the acoustic scene classification (ASC) task of DCASE and have shown that deep CNNs (of 8-34 layers) can outperform CNNs with similar architecture on handcrafted features.

System characteristics
Input mono
Sampling rate 8kHz
Features raw-waveform
Classifier CNN; DCNN
PDF

Attention-Based Convolutional Neural Networks for Acoustic Scene Classification

Zhao Ren1, Qiuqiang Kong2, Kun Qian1, Mark Plumbley2 and Björn Schuller3
1ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany, 2Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Surrey, UK, 3ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing / GLAM -- Group on Language, Audio \& Music, University of Augsburg, Imperial College London, Augsburg, Germany / London, UK

Abstract

We propose a convolutional neural network (CNN) model based on an attention pooling method to classify ten different acoustic scenes, participating in the acoustic scene classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2018), which includes data from one device (subtask A) and data from three different devices (subtask B). The log mel spectrogram images of the audio waves are first forwarded to convolutional layers, and then fed into an attention pooling layer to reduce the feature dimension and achieve classification. From attention perspective, we build a weighted evaluation of the features, instead of simple max pooling or average pooling. On the official development set of the challenge, the best accuracy of subtask A is 72.6 %, which is an improvement of 12.9 % when compared with the official baseline (p < .001 in a one-tailed z-test). For subtask B, the best result of our attention-based CNN is a significant improvement of the baseline as well, in which the accuracies are 71.8 %, 58.3 %, and 58.3 % for the three devices A to C (p < .001 for device A, p < .01 for device B, and p < .05 for device C).

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel spectrogram
Classifier CNN
PDF

Using an Evolutionary Approach to Explore Convolutional Neural Networks for Acoustic Scene Classification

Christian Roletscheck and Tobias Watzka
Human Centered Multimedia, Augsburg University, Augsburg, Germany

Abstract

The successful application of modern deep neural networks is heavily reliant on the chosen architecture and the selection of the appropriate hyperparameters. Due to the large amount of parameters and the complex inner workings of a neural network, finding a suitable configuration for a respective problem turns out to be a rather complex task for a human. In this paper we propose an evolutionary approach to automatically generate a suitable neural network architecture for any given problem. A genetic algorithm is used to generate and evaluate a variety of deep convolutional networks. We take the DCASE 2018 Challenge as an opportunity to evaluate our algorithm on the task of acoustic scene classification. The best accuracy achieved by our approach was 74.7% on the development dataset.

System characteristics
Input mono
Sampling rate 48kHz
Features log-mel spectrogram
Classifier CNN
Decision making majority vote
PDF

Acoustic Scene Classification by Ensemble of Spectrograms Based on Adaptive Temporal Divisions

Yuma Sakashita and Masaki Aono
Knowledge Data Engineering Laboratory, Toyohashi University of Technology, Aichi, Japan

Abstract

Many classification tasks using deep learning have improved classification accuracy by using a large amount of training data. However, it is difficult to collect audio data and build a large database. Since training data is restricted in DCASE 2018 Task 1a, unknown acoustic scene must be predicted from less training data. From the results of DCASE 2017[1], we determine that using a convolution neural network and ensemble multiple networks is an effective means for classifying acoustic scenes. In our method we generate mel-spectrogram from binaural audio, mono audio, Harmonicpercussive source separation (HPSS) audio, adaptively divide the spectrogram into multiple ways and learn 9 neural networks. We further improve ensemble accuracy by ensemble learning using these outputs. The classification result of the proposed system was 0.769 for Development dataset and 0.796 for Leaderboard dataset.

System characteristics
Input mono, binaural
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
Decision making random forest
PDF

CNN Based System for Acoustic Scene Classification

Lee Sangwon, Kang Seungtae and Jang Gin-jin
School of Electronics Engineering, Kyungpook National University, Daegu, Korea

Abstract

Convolution neural networks (CNNs) have achieved great successes in many machine learning tasks such as classifying visual objects or various audio sounds. In this report, we describe our system implementation for acoustic scene classification task of DCASE 2018 based on CNN. The classification accuracies of the proposed system are 72.4% and 75.5% on development and leaderboard datasets, respectively.

System characteristics
Input mono
Sampling rate 48kHz
Features log-mel energies
Classifier CNN
Decision making majority vote
PDF

Combination of Amplitude Modulation Spectrogram Features and MFCCs for Acoustic Scene Classification

Juergen Tchorz
Institute for Acoustics, University of Applied Sciences Luebeck, Luebeck, Germany

Abstract

This report describes an approach for acoustic scene classification and its results for the development data set of the DCASE 2018 challenge. Amplitude modulation spectrograms (AMS), which mimic important aspects of the auditory system are used as features, in combination with mel-scale cepstral coefficients which have shown to be complementary to AMS features. For classification, a long short-term memory deep neural network is used. The proposed system outperforms the baseline system by 6.3-9.3 % for the development data test subset, depending on the recording device.

System characteristics
Sampling rate 44.1kHz
Features amplitude modulation spectrogram, MFCC
Classifier LSTM
PDF

Wavelet-Based Audio Features for Acoustic Scene Classification

Shefali Waldekar and Goutam Saha
Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India

Abstract

This report describes a submission for IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 for Task 1 (acoustic scene classification (ASC)), sub-task A (basic ASC) and sub-task B (ASC with mismatched recording devices). We use two wavelet-based features in a scorefusion framework to achieve the goal. The first feature applies wavelet transform to log mel-band energies, while the second does a high-Q wavelet transformation on the frames of raw signal. The two features are found to be complementary so that the fused system relatively outperforms the deep-learning based baseline system by 17% for sub-task A and 26% for sub-task B with the development dataset provided for the respective sub-tasks.

System characteristics
Input mono
Sampling rate 48kHz
Features MFDWC, CQCC
Classifier SVM
Decision making fusion
PDF

Se-Resnet with Gan-Based Data Augmentation Applied to Acoustic Scene Classification

Jeong Hyeon Yang, Nam Kyun Kim and Hong Kook Kim
School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, Korea

Abstract

This report describes our contribution to the development of audio scene classification methods for the DCASE 2018 Challenge Task 1A. The proposed systems for this task are based on data augmentation through generative adversarial network (GAN)-based data augmentation and various convolutional networks such as residual networks (ResNets) and squeeze-and-excitation residual networks (SE-ResNets). In addition to data augmentation, SEResNets are revised so that they operate on the log-mel spectrogram domain, and the numbers of layers and kernels are adjusted to provide better performance on the task. Finally, the ensemble method is applied using a four-fold cross-validated training dataset. Consequently, the proposed audio scene classification system improves classwise accuracy by 10% compared to the baseline system through the Kaggle competition in acoustic scene classification.

System characteristics
Input mixed
Sampling rate 48kHz
Data augmentation GAN
Features log-mel spectrogram
Classifier CNN, ensemble
Decision making mean probability
PDF

Convolutional Neural Networks and X-Vector Embedding for Dcase2018 Acoustic Scene Classification Challenge

Hossein Zeinali, Lukas Burget and Honza Cernocky
BUT Speech, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic

Abstract

In this report, the BUT team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2018 challenge is described. Also, the analysis of different method performance on the development set is provided. The proposed approach is a fusion of two different Conventional Neural Network (CNN) topologies. The first one is the common two-dimensional CNNs which mainly is used in image classification task. The second one is one dimensional CNN for extracting embeddings from the neural network which is too common in speech processing, especially for speaker recognition. In addition to the topologies, two types of features were suggested to be used in this task, Mel-spectrogram in log domain and CQT features which explained in detail in the report. Finally, the outputs of different systems are fused using a weighted average.

System characteristics
Input mono, binaural
Sampling rate 48kHz
Data augmentation block mixing
Features log-mel energies, CQT
Classifier CNN, x-vector, ensemble
Decision making weighted average
PDF

Acoustic Scene Classification Using Multi-Layered Temporal Pooling Based on Deep Convolutional Neural Network

Liwen Zhang and Jiqing Han
Laboratory of Speech Signal Processing, Harbin Institute of Technology, Harbin, China

Abstract

The performance of an Acoustic Scene Classification (ASC) system is highly depending on the latent temporal dynamics of the audio signal. In this paper, we proposed a multiple layers temporal pooling method using CNN feature sequence as input, which can effectively capture the temporal dynamics for an entire audio signal with arbitrary duration by building direct connections between the sequence and its time indexes. We applied our novel framework on DCASE 2018 task 1, ASC. For evaluation, we trained a Support Vector Machine (SVM) with the proposed Multi-Layered Temporal Pooling (MLTP) learned features. Experimental results on the development dataset, usage of the MLTP features significantly improved the ASC performance. The best performance with 75.28% accuracy was achieved by using the optimal setting found in our experiments.

System characteristics
Input mono
Sampling rate 48kHz
Features log-mel energies
Classifier CNN, SVR, SVM
Decision making only one SVM
PDF