Task description
This subtask is concerned with the situation in which an application will be tested with a few different types of devices, possibly not the same as the ones used to record the development data.
The development data consists of the same recordings as in subtask A, and a small amount of parallel data recorded with devices B and C. The amount of data is as follows:
- Device A: 24 hours (8640 segments, same as subtask A, but resampled and single-channel)
- Device B: 2 hours (72 segments per acoustic scene)
- Device C: 2 hours (72 segments per acoustic scene)
The 2 hours of data recorded with devices B and C is parallel, and also available as recorded with device A. The training/test setup was created such that approximately 70% of recording locations for each city and each scene class are in the training subset, considering only device A. The training subset contains 6122 segments from device A, 540 segments from device B, and 540 segments from device C. The test subset contains 2518 segments from device A, 180 segments from device B, and 180 segments from device C.
More detailed task description can be found in the task description page
Systems ranking
Rank |
Submission code |
Submission name |
Technical Report |
Accuracy (B/C) with 95% confidence interval (Evaluation dataset) |
Accuracy (B/C) (Development dataset) |
Accuracy (B/C) (Leaderboard dataset) |
---|---|---|---|---|---|---|
Baseline_Surrey_task1b_1 | SurreyCNN8 | Kong2018 | 59.6 (58.5 - 60.7) | 57.2 | 56.8 | |
Baseline_Surrey_task1b_2 | SurreyCNN4 | Kong2018 | 58.8 (57.7 - 59.9) | 57.5 | 57.8 | |
DCASE2018 baseline | Baseline | Heittola2018 | 46.5 (45.4 - 47.6) | 45.6 | 45.0 | |
Li_SCUT_task1b_1 | Li_SCUT | Li2018 | 41.1 (40.0 - 42.2) | 51.7 | ||
Li_SCUT_task1b_2 | Li_SCUT | Li2018 | 39.5 (38.4 - 40.6) | 53.9 | ||
Li_SCUT_task1b_3 | Li_SCUT | Li2018 | 42.3 (41.2 - 43.4) | 51.7 | ||
Liping_CQU_task1b_1 | Xception | Liping2018 | 67.0 (66.0 - 68.1) | 77.6 | 65.8 | |
Liping_CQU_task1b_2 | Xception | Liping2018 | 63.2 (62.1 - 64.3) | 77.6 | 63.1 | |
Liping_CQU_task1b_3 | Xception | Liping2018 | 67.7 (66.6 - 68.7) | 77.6 | 67.2 | |
Liping_CQU_task1b_4 | Xception | Liping2018 | 67.1 (66.1 - 68.2) | 77.6 | 66.5 | |
Nguyen_TUGraz_task1b_1 | NNF_CNNEns | Nguyen2018 | 69.0 (68.0 - 70.0) | 63.6 | 67.3 | |
Ren_UAU_task1b_1 | ABCNN | Ren2018 | 60.5 (59.4 - 61.5) | 58.3 | 58.9 | |
Tchorz_THL_task1b_1 | AMS_MFCC | Tchorz2018 | 54.0 (52.9 - 55.1) | 63.8 | ||
Waldekar_IITKGP_task1b_1 | IITKGP_ABSP_Fusion18 | Waldekar2018 | 56.2 (55.1 - 57.3) | 57.8 | ||
WangJun_BUPT_task1b_1 | Attention | Jun2018 | 48.8 (47.7 - 49.9) | 69.0 | 49.4 | |
WangJun_BUPT_task1b_2 | Attention | Jun2018 | 52.5 (51.4 - 53.6) | 69.0 | 49.4 | |
WangJun_BUPT_task1b_3 | Attention | Jun2018 | 52.3 (51.2 - 53.4) | 69.0 | 49.4 |
Teams ranking
Rank |
Submission code |
Submission name |
Technical Report |
Accuracy with 95% confidence interval (Evaluation dataset) |
Accuracy (Development dataset) |
Accuracy (Leaderboard dataset) |
---|---|---|---|---|---|---|
Baseline_Surrey_task1b_1 | SurreyCNN8 | Kong2018 | 59.6 (58.5 - 60.7) | 57.2 | 56.8 | |
DCASE2018 baseline | Baseline | Heittola2018 | 46.5 (45.4 - 47.6) | 45.6 | 45.0 | |
Li_SCUT_task1b_3 | Li_SCUT | Li2018 | 42.3 (41.2 - 43.4) | 51.7 | ||
Liping_CQU_task1b_3 | Xception | Liping2018 | 67.7 (66.6 - 68.7) | 77.6 | 67.2 | |
Nguyen_TUGraz_task1b_1 | NNF_CNNEns | Nguyen2018 | 69.0 (68.0 - 70.0) | 63.6 | 67.3 | |
Ren_UAU_task1b_1 | ABCNN | Ren2018 | 60.5 (59.4 - 61.5) | 58.3 | 58.9 | |
Tchorz_THL_task1b_1 | AMS_MFCC | Tchorz2018 | 54.0 (52.9 - 55.1) | 63.8 | ||
Waldekar_IITKGP_task1b_1 | IITKGP_ABSP_Fusion18 | Waldekar2018 | 56.2 (55.1 - 57.3) | 57.8 | ||
WangJun_BUPT_task1b_2 | Attention | Jun2018 | 52.5 (51.4 - 53.6) | 69.0 | 49.4 |
Class-wise performance
Rank |
Submission code |
Submission name |
Technical Report |
Accuracy (Evaluation dataset) |
Airport | Bus | Metro |
Metro station |
Park |
Public square |
Shopping mall |
Street pedestrian |
Street traffic |
Tram |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline_Surrey_task1b_1 | SurreyCNN8 | Kong2018 | 59.6 | 47.7 | 65.0 | 51.0 | 58.2 | 86.4 | 36.7 | 58.5 | 49.9 | 80.2 | 62.1 | |
Baseline_Surrey_task1b_2 | SurreyCNN4 | Kong2018 | 58.8 | 49.6 | 63.9 | 56.9 | 51.9 | 74.5 | 36.7 | 64.4 | 41.8 | 78.4 | 69.6 | |
DCASE2018 baseline | Baseline | Heittola2018 | 46.5 | 61.6 | 56.7 | 45.3 | 40.0 | 61.1 | 15.4 | 51.8 | 32.4 | 69.8 | 30.4 | |
Li_SCUT_task1b_1 | Li_SCUT | Li2018 | 41.1 | 33.3 | 53.0 | 33.1 | 30.1 | 48.9 | 43.3 | 38.1 | 31.4 | 47.5 | 52.1 | |
Li_SCUT_task1b_2 | Li_SCUT | Li2018 | 39.5 | 27.4 | 66.5 | 17.3 | 35.7 | 52.5 | 40.7 | 41.0 | 23.2 | 50.6 | 39.9 | |
Li_SCUT_task1b_3 | Li_SCUT | Li2018 | 42.3 | 34.7 | 66.8 | 27.8 | 32.2 | 51.0 | 52.3 | 48.4 | 23.0 | 47.6 | 39.3 | |
Liping_CQU_task1b_1 | Xception | Liping2018 | 67.0 | 57.6 | 71.3 | 65.7 | 69.6 | 82.4 | 57.1 | 73.4 | 39.8 | 82.3 | 71.1 | |
Liping_CQU_task1b_2 | Xception | Liping2018 | 63.2 | 40.9 | 73.7 | 59.8 | 68.4 | 84.2 | 34.8 | 77.8 | 42.9 | 87.5 | 61.9 | |
Liping_CQU_task1b_3 | Xception | Liping2018 | 67.7 | 63.1 | 72.0 | 59.5 | 71.7 | 86.0 | 52.3 | 74.0 | 42.2 | 80.4 | 75.4 | |
Liping_CQU_task1b_4 | Xception | Liping2018 | 67.1 | 62.6 | 71.7 | 59.8 | 69.9 | 88.9 | 48.2 | 71.1 | 44.6 | 81.7 | 72.7 | |
Nguyen_TUGraz_task1b_1 | NNF_CNNEns | Nguyen2018 | 69.0 | 67.0 | 86.9 | 57.6 | 56.9 | 93.9 | 45.6 | 69.8 | 53.3 | 85.1 | 73.9 | |
Ren_UAU_task1b_1 | ABCNN | Ren2018 | 60.5 | 44.6 | 79.3 | 52.3 | 61.4 | 81.2 | 29.2 | 64.0 | 58.8 | 81.3 | 52.7 | |
Tchorz_THL_task1b_1 | AMS_MFCC | Tchorz2018 | 54.0 | 44.4 | 64.5 | 45.1 | 43.9 | 76.6 | 42.6 | 57.6 | 37.2 | 70.8 | 57.1 | |
Waldekar_IITKGP_task1b_1 | IITKGP_ABSP_Fusion18 | Waldekar2018 | 56.2 | 39.3 | 62.4 | 51.1 | 54.9 | 73.1 | 40.2 | 72.0 | 41.4 | 78.4 | 49.6 | |
WangJun_BUPT_task1b_1 | Attention | Jun2018 | 48.8 | 37.0 | 57.2 | 40.5 | 60.9 | 86.9 | 23.5 | 50.4 | 16.2 | 67.2 | 48.0 | |
WangJun_BUPT_task1b_2 | Attention | Jun2018 | 52.5 | 70.7 | 55.4 | 59.8 | 44.6 | 76.3 | 46.6 | 48.6 | 2.1 | 73.5 | 47.0 | |
WangJun_BUPT_task1b_3 | Attention | Jun2018 | 52.3 | 51.4 | 58.5 | 47.7 | 59.3 | 87.8 | 30.2 | 52.7 | 12.8 | 71.5 | 50.9 |
Device-wise performance
Rank |
Submission code |
Submission name |
Technical Report |
Accuracy / Evaluation dataset | ||||
---|---|---|---|---|---|---|---|---|
Average Dev B / Dev C |
Dev B | Dev C | Dev A | Dev D | ||||
Baseline_Surrey_task1b_1 | SurreyCNN8 | Kong2018 | 59.6 | 59.5 | 59.6 | 69.1 | 32.1 | |
Baseline_Surrey_task1b_2 | SurreyCNN4 | Kong2018 | 58.8 | 58.7 | 58.8 | 70.6 | 33.8 | |
DCASE2018 baseline | Baseline | Heittola2018 | 46.5 | 45.9 | 47.0 | 63.6 | 27.5 | |
Li_SCUT_task1b_1 | Li_SCUT | Li2018 | 41.1 | 42.2 | 39.9 | 54.2 | 20.3 | |
Li_SCUT_task1b_2 | Li_SCUT | Li2018 | 39.5 | 39.8 | 39.1 | 55.7 | 13.2 | |
Li_SCUT_task1b_3 | Li_SCUT | Li2018 | 42.3 | 43.3 | 41.3 | 55.7 | 18.5 | |
Liping_CQU_task1b_1 | Xception | Liping2018 | 67.0 | 66.8 | 67.3 | 73.7 | 45.4 | |
Liping_CQU_task1b_2 | Xception | Liping2018 | 63.2 | 63.9 | 62.5 | 72.2 | 45.8 | |
Liping_CQU_task1b_3 | Xception | Liping2018 | 67.7 | 67.8 | 67.5 | 73.9 | 48.8 | |
Liping_CQU_task1b_4 | Xception | Liping2018 | 67.1 | 67.6 | 66.7 | 73.6 | 47.8 | |
Nguyen_TUGraz_task1b_1 | NNF_CNNEns | Nguyen2018 | 69.0 | 68.9 | 69.1 | 73.8 | 37.6 | |
Ren_UAU_task1b_1 | ABCNN | Ren2018 | 60.5 | 60.6 | 60.3 | 71.2 | 30.1 | |
Tchorz_THL_task1b_1 | AMS_MFCC | Tchorz2018 | 54.0 | 55.2 | 52.8 | 65.1 | 12.7 | |
Waldekar_IITKGP_task1b_1 | IITKGP_ABSP_Fusion18 | Waldekar2018 | 56.2 | 54.8 | 57.7 | 58.9 | 29.5 | |
WangJun_BUPT_task1b_1 | Attention | Jun2018 | 48.8 | 47.8 | 49.8 | 31.1 | 33.5 | |
WangJun_BUPT_task1b_2 | Attention | Jun2018 | 52.5 | 48.0 | 56.9 | 50.1 | 38.5 | |
WangJun_BUPT_task1b_3 | Attention | Jun2018 | 52.3 | 49.7 | 54.8 | 35.7 | 36.2 |
System characteristics
General characteristics
Rank | Code |
Technical Report |
Accuracy (Eval) |
Sampling rate |
Data augmentation |
Features |
---|---|---|---|---|---|---|
Baseline_Surrey_task1b_1 | Kong2018 | 59.6 | 44.1kHz | log-mel energies | ||
Baseline_Surrey_task1b_2 | Kong2018 | 58.8 | 44.1kHz | log-mel energies | ||
DCASE2018 baseline | Heittola2018 | 46.5 | 44.1kHz | log-mel energies | ||
Li_SCUT_task1b_1 | Li2018 | 41.1 | 48kHz | MFCC | ||
Li_SCUT_task1b_2 | Li2018 | 39.5 | 48kHz | MFCC | ||
Li_SCUT_task1b_3 | Li2018 | 42.3 | 48kHz | MFCC | ||
Liping_CQU_task1b_1 | Liping2018 | 67.0 | 44.1kHz | log-mel energies | ||
Liping_CQU_task1b_2 | Liping2018 | 63.2 | 44.1kHz | log-mel energies | ||
Liping_CQU_task1b_3 | Liping2018 | 67.7 | 44.1kHz | log-mel energies | ||
Liping_CQU_task1b_4 | Liping2018 | 67.1 | 44.1kHz | log-mel energies | ||
Nguyen_TUGraz_task1b_1 | Nguyen2018 | 69.0 | 44.1kHz | log-mel energies and their nearest neighbor filtered version | ||
Ren_UAU_task1b_1 | Ren2018 | 60.5 | 44.1kHz | log-mel spectrogram | ||
Tchorz_THL_task1b_1 | Tchorz2018 | 54.0 | 44.1kHz | amplitude modulation spectrogram, MFCC | ||
Waldekar_IITKGP_task1b_1 | Waldekar2018 | 56.2 | 48kHz | MFDWC, CQCC | ||
WangJun_BUPT_task1b_1 | Jun2018 | 48.8 | 44.1kHz | mixup | log-mel energies | |
WangJun_BUPT_task1b_2 | Jun2018 | 52.5 | 44.1kHz | mixup | log-mel energies | |
WangJun_BUPT_task1b_3 | Jun2018 | 52.3 | 44.1kHz | mixup | log-mel energies |
Machine learning characteristics
Rank | Code |
Technical Report |
Accuracy (Eval) |
Model complexity |
Classifier |
Ensemble subsystems |
Decision making |
---|---|---|---|---|---|---|---|
Baseline_Surrey_task1b_1 | Kong2018 | 59.6 | 4691274 | VGGish 8 layer CNN with global max pooling | |||
Baseline_Surrey_task1b_2 | Kong2018 | 58.8 | 4309450 | VGGish 8 layer CNN with global max pooling | |||
DCASE2018 baseline | Heittola2018 | 46.5 | 116118 | CNN | |||
Li_SCUT_task1b_1 | Li2018 | 41.1 | 116118 | LSTM | |||
Li_SCUT_task1b_2 | Li2018 | 39.5 | 116118 | LSTM | |||
Li_SCUT_task1b_3 | Li2018 | 42.3 | 116118 | LSTM | |||
Liping_CQU_task1b_1 | Liping2018 | 67.0 | 22758194 | Xception | |||
Liping_CQU_task1b_2 | Liping2018 | 63.2 | 22758194 | Xception | |||
Liping_CQU_task1b_3 | Liping2018 | 67.7 | 22758194 | Xception | |||
Liping_CQU_task1b_4 | Liping2018 | 67.1 | 22758194 | Xception | |||
Nguyen_TUGraz_task1b_1 | Nguyen2018 | 69.0 | 12278040 | CNN | 12 | averaging vote | |
Ren_UAU_task1b_1 | Ren2018 | 60.5 | 616800 | CNN | |||
Tchorz_THL_task1b_1 | Tchorz2018 | 54.0 | 15395500 | LSTM | |||
Waldekar_IITKGP_task1b_1 | Waldekar2018 | 56.2 | 20973 | SVM | 3 | fusion | |
WangJun_BUPT_task1b_1 | Jun2018 | 48.8 | 4634004 | CNN,BGRU,self-attention | |||
WangJun_BUPT_task1b_2 | Jun2018 | 52.5 | 4634004 | CNN,BGRU,self-attention | |||
WangJun_BUPT_task1b_3 | Jun2018 | 52.3 | 4634004 | CNN,BGRU,self-attention |
Technical reports
Acoustic Scene Classification Using Ensemble of Convnets
An Dang, Toan Vu and Jia-Ching Wang
Computer Science and Information Engineering, Deep Learning and Media System Laboratory, National Central University, Taoyuan, Taiwan
Dang_NCU_task1a_1 Dang_NCU_task1a_2 Dang_NCU_task1a_3
Acoustic Scene Classification Using Ensemble of Convnets
An Dang, Toan Vu and Jia-Ching Wang
Computer Science and Information Engineering, Deep Learning and Media System Laboratory, National Central University, Taoyuan, Taiwan
Abstract
This technical report presents our system for the acoustic scene classification problem in the task 1A of the DCASE2018 challenge whose goal is to classify audio recordings into predefined types of environments. The overall system is an ensemble of ConvNet models working on different audio features separately. Audio signals are processed in both mono channel and two channels before we extract mel-spectrogram and gammatone-based spectrogram features as inputs to models. All models are implemented by almost the same ConvNet structure. Experimental results illustrate that the ensemble system can achieve superior accuracy to the baseline by a large margin of 17% on the test data.
System characteristics
Input | stereo, mono |
Sampling rate | 48kHz |
Features | log-mel energies |
Classifier | Ensemble of Convnet |
Decision making | average |
Acoustic Scene Classification with Fully Convolutional Neural Networks and I-Vectors
Matthias Dorfer, Bernhard Lehner, Hamid Eghbal-zadeh, Heindl Christop, Paischer Fabian and Widmer Gerhard
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria
Dorfer_CPJKU_task1a_1 Dorfer_CPJKU_task1a_2 Dorfer_CPJKU_task1a_3 Dorfer_CPJKU_task1a_4
Acoustic Scene Classification with Fully Convolutional Neural Networks and I-Vectors
Matthias Dorfer, Bernhard Lehner, Hamid Eghbal-zadeh, Heindl Christop, Paischer Fabian and Widmer Gerhard
Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria
Abstract
This technical report describes the CP-JKU team's submissions for Task 1 - Subtask A (Acoustic Scene Classification, ASC) of the DCASE-2018 challenge. Our approach is still related to the methodology that achieved ranks 1 and 2 in the 2016 ASC challenge: a fusion of i-vector modelling using MFCC features derived from left and right audio channels, and deep convolutional neural networks (CNNs) trained on spectrograms. However, for our 2018 submission we have put a stronger focus on tuning and pushing the performance of our CNNs. The result of our experiments is a classification system that achieves classification accuracies of around 80% on the public Kaggle-Leaderboard.
System characteristics
Input | left, right, difference; left, right |
Sampling rate | 22.5kHz |
Data augmentation | mixup; pitch shifting; mixup, pitch shifting |
Features | perceptual weighted power spectrogram; MFCC; perceptual weighted power spectrogram, MFCC |
Classifier | CNN, ensemble; i-vector, late fusion; CNN i-vector ensemble; CNN i-vector late fusion ensemble |
Decision making | average; fusion; late calibrated fusion of averaged i-vector and CNN models; late calibrated fusion |
Classification of Acoustic Scenes Based on Modulation Spectra and Position-Pitch Maps
Ruben Fraile, Elena Blanco-Martin, Juana M. Gutierrez-Arriola, Nicolas Saenz-Lechon and Victor J. Osma-Ruiz
Research Center on Software Technologies and Multimedia Systems for Sustainability (CITSEM), Universidad Politecnica de Madrid, Madrid, Spain
Fraile_UPM_task1a_1
Classification of Acoustic Scenes Based on Modulation Spectra and Position-Pitch Maps
Ruben Fraile, Elena Blanco-Martin, Juana M. Gutierrez-Arriola, Nicolas Saenz-Lechon and Victor J. Osma-Ruiz
Research Center on Software Technologies and Multimedia Systems for Sustainability (CITSEM), Universidad Politecnica de Madrid, Madrid, Spain
Abstract
A system for the automatic classification of acoustic scenes is proposed that uses the stereophonic signal captured by a binaural microphone. This system uses one channel for calculating the spectral distribution of energy across auditory-relevant frequency bands. It further obtains some descriptors of the envelope modulation spectrum (EMS) by applying the discrete cosine transform to the logarithm of the EMS. The availability of the two-channel binaural recordings is used for representing the spatial distribution of acoustic sources by means of position-pitch maps. These maps are further parametrized using the two-dimensional Fourier transform. These three types of features (energy spectrum, EMS and positionpitch maps) are used as inputs for a standard multilayer perceptron with two hidden layers.
System characteristics
Input | binaural |
Sampling rate | 48kHz |
Features | LTAS, Modulation spectrum, position-pitch maps |
Classifier | MLP |
Decision making | sum of log-probabilities |
Acoustic Scene Classification Using Convolutional Neural Networks and Different Channels Representations and Its Fusion
Alexander Golubkov and Alexander Lavrentyev
Saint Petersburg, Russia
Golubkov_SPCH_task1a_1
Acoustic Scene Classification Using Convolutional Neural Networks and Different Channels Representations and Its Fusion
Alexander Golubkov and Alexander Lavrentyev
Saint Petersburg, Russia
Abstract
Deep convolutional neural networks has great results in a image classification tasks. In this paper, we used different architectures of DCNN for image classification. As for images we used spectrograms of differenet signal representations, such as MFCC, Melspectrograms and CQT-spectrograms. Result was obtained using goemetric mean of all the models.
System characteristics
Input | left, right, mono, mixed |
Sampling rate | 48kHz |
Features | CQT, spectrogram, log-mel, MFCC |
Classifier | CNN |
Decision making | mean |
DCASE 2018 Task 1a: Acoustic Scene Classification by Bi-LSTM-CNN-Net Multichannel Fusion
WenJie Hao, Lasheng Zhao, Qiang Zhang, HanYu Zhao and JiaHua Wang
Key Laboratory of Advanced Design and Intelligent Computing(Dalian University), Ministry of Education, Dalian University, Liaoning, China
Zhao_DLU_task1a_1
DCASE 2018 Task 1a: Acoustic Scene Classification by Bi-LSTM-CNN-Net Multichannel Fusion
WenJie Hao, Lasheng Zhao, Qiang Zhang, HanYu Zhao and JiaHua Wang
Key Laboratory of Advanced Design and Intelligent Computing(Dalian University), Ministry of Education, Dalian University, Liaoning, China
Abstract
In this study, we provide a solution for acoustic scene classification task in the DCASE 2018 challenge. A system consisting of bidirectional long-term memory and convolutional neural networks(BI-LSTM-CNN) is proposed. And, improved logarithmic scaled mel spectra as input to our system. Besides we have adopted a new model fusion mechanism. Finally, to validate the performance of the model and compare it to the baseline system, we used the TUT Acoustic Scene 2018 dataset for training and cross-validation, resulting in an 13.93% improvement over the baseline system.
System characteristics
Input | multichannel |
Sampling rate | 48kHz |
Features | log-mel energies |
Classifier | CNN,Bi-Lstm |
Decision making | max of precision |
A Multi-Device Dataset for Urban Acoustic Scene Classification
Toni Heittola, Annamaria Mesaros and Tuomas Virtanen
Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland
Abstract
This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.
System characteristics
Input | mono |
Sampling rate | 48kHz; 44.1kHz |
Features | log-mel energies |
Classifier | CNN |
Self-Attention Mechanism Based System for Dcase2018 Challenge Task1 and Task4
Wang Jun1 and Li Shengchen2
1Institute of Information Photonics and Optical Communication, c, Beijing, China, 2Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China
WangJun_BUPT_task1a_1 WangJun_BUPT_task1a_2 WangJun_BUPT_task1a_3 WangJun_BUPT_task1b_1 WangJun_BUPT_task1b_2 WangJun_BUPT_task1b_3
Self-Attention Mechanism Based System for Dcase2018 Challenge Task1 and Task4
Wang Jun1 and Li Shengchen2
1Institute of Information Photonics and Optical Communication, c, Beijing, China, 2Institute of Information Photonics and Optical Communication, Beijing University of Posts and Telecommunications, Beijing, China
Abstract
In this technique report, we provide self-attention mechanism for the Task1 and Task 4 of Detection and Classification of Acoustic Scenes and Events 2018 (DCASE2017) challenge. We take convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) as our basic systems in Task 1 and Task 4. In this convolutional recurrent neural network (CRNN), gated linear units (GLUs) is used for non-linearity which implement a gating mechanism over the output of the network for selecting informative local features. Self-attention mechanism called intra-attention is used for modeling relationship between different positions of a single sequence over the output of the CRNN. Attention-based pooling scheme is used for localizing the specific events in Task 4 and for obtaining the final labels in Task 1. In a summary, we get 70.81% accuracy subtask 1 of Task 1. In the subtask 2 of Task 1, we get 70.1% accuracy for device a, 59.4% accuracy for device b, and 55.6 accuracy for device c. For Task 1, we get 26.98% F1 value for sound event detection in old test data of developmemt data.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN,BGRU,self-attention |
DNN Based Multi-Level Features Ensemble for Acoustic Scene Classification
Jee-weon Jung, Hee-soo Heo, Hye-jin Shim and Ha-jin Yu
School of Computer Science, University of Seoul, Seoul, South Korea
Jung_UOS_task1a_1 Jung_UOS_task1a_2 Jung_UOS_task1a_3 Jung_UOS_task1a_4
DNN Based Multi-Level Features Ensemble for Acoustic Scene Classification
Jee-weon Jung, Hee-soo Heo, Hye-jin Shim and Ha-jin Yu
School of Computer Science, University of Seoul, Seoul, South Korea
Abstract
Acoustic scenes are defined by various characteristics such as long-term context or short-term event, making it difficult to select input features or pre-processing methods suitable for acoustic scene classification. In this paper, we propose an ensemble model which exploits various input features that vary in their degree of preprocessing: raw waveform without pre-processing, spectrogram, and i-vector a segment-level low dimensional representation. We tried to effectively perform combination of deep neural networks that handle different types of input features by using a separate scoring phase by using Gaussian models and support vector machines to extract scores from individual system that can be used as a confidence measure. Validity of the proposed framework is tested using the detection and classification of acoustic scenes and events 2018 dataset. The proposed framework showed accuracy of 73.82% using the validation set.
System characteristics
Input | binaural |
Sampling rate | 48kHz |
Features | raw-waveform, spectrogram, i-vector |
Classifier | CNN, DNN, GMM, SVM |
Decision making | score-sum; weighted score-sum |
Acoustic Scene and Event Detection Systems Submitted to DCASE 2018 Challenge
Maksim Khadkevich
AML, Facebook, Menlo Park, CA, USA
Khadkevich_FB_task1a_1 Khadkevich_FB_task1a_2 Khadkevich_FB_task1c_1 Khadkevich_FB_task1c_2
Acoustic Scene and Event Detection Systems Submitted to DCASE 2018 Challenge
Maksim Khadkevich
AML, Facebook, Menlo Park, CA, USA
Abstract
In this technical report we describe systems that have been submitted to DCASE 2018 [1] challenge. Feature extraction and convolutional neural network (CNN) architecture are outlined. For tasks 1c and 2 we describe transfer learning approach that has been applied. Model training and inference are finally presented.
System characteristics
Input | mono |
Sampling rate | 16kHz |
Features | log-mel energies |
Classifier | CNN |
DCASE 2018 Challenge Surrey Cross-Task Convolutional Neural Network Baseline
Qiuqiang Kong, Iqbal Turab, Xu Yong, Wenwu Wang and Mark D. Plumbley
Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
Baseline_Surrey_task1a_1 Baseline_Surrey_task1a_2 Baseline_Surrey_task1b_1 Baseline_Surrey_task1b_2
DCASE 2018 Challenge Surrey Cross-Task Convolutional Neural Network Baseline
Qiuqiang Kong, Iqbal Turab, Xu Yong, Wenwu Wang and Mark D. Plumbley
Centre for Vission, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
Abstract
The Detection and Classification of Acoustic Scenes and Events (DCASE) consists of five audio classification and sound event detection tasks: 1) Acoustic scene classification, 2) General-purpose audio tagging of Freesound, 3) Bird audio detection, 4) Weaklylabeled semi-supervised sound event detection and 5) Multi-channel audio classification. In this paper, we create a cross-task baseline system for all five tasks based on a convolutional neural network (CNN): a “CNN Baseline” system. We implemented CNNs with 4 layers and 8 layers originating from AlexNet and VGG from computer vision. We investigated how the performance varies from task to task with the same configuration of neural networks. Experiments show that deeper CNN with 8 layers performs better than CNN with 4 layers on all tasks except Task 1. Using CNN with 8 layers, we achieve an accuracy of 0.680 on Task 1, an accuracy of 0.895 and a mean average precision (MAP) of 0.928 on Task 2, an accuracy of 0.751 and an area under the curve (AUC) of 0.854 on Task 3, a sound event detection F1 score of 20.8% on Task 4, and an F1 score of 87.75% on Task 5. We released the Python source code of the baseline systems under the MIT license for further research.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | VGGish 8 layer CNN with global max pooling; AlexNetish 4 layer CNN with global max pooling |
Acoustic Scene Classification Based on Binaural Deep Scattering Spectra with CNN and LSTM
Zhitong Li, Liqiang Zhang, Shixuan Du and Wei Liu
Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China
Li_BIT_task1a_1 Li_BIT_task1a_2 Li_BIT_task1a_3 Li_BIT_task1a_4
Acoustic Scene Classification Based on Binaural Deep Scattering Spectra with CNN and LSTM
Zhitong Li, Liqiang Zhang, Shixuan Du and Wei Liu
Laboratory of Modern Communication, Beijing Institute of Technology, Beijing, China
Abstract
This technical report presents the solutions proposed by the Beijing Institute of Technology Modern Communications Technology Laboratory for the acoustic scene classification of DCASE2018 task1a. Compared to previous years, the data is more diverse, making such tasks more difficult. In order to solve this problem, we use the Deep Scattering Spectra (DSS) features. The traditional features, such as Mel-frequency Cepstral Coefficients (MFCC), often lose information at high frequencies. DSS is a good way to preserve high frequency information. Based on this feature, we propose a network model of Convolutional Neural Network (CNN) and Long Short-term Memory (LSTM) to classify sound scenes. The experimental results show that the proposed feature extraction method and network structure have a good effect on this classification task. From the experimental data, the accuracy increased from 59% to 76%.
System characteristics
Input | left,right |
Sampling rate | 48kHz |
Features | DSS |
Classifier | CNN; CNN,DNN |
The SEIE-SCUT Systems for Challenge on DCASE 2018: Deep Learning Techniques for Audio Representation and Classification
YangXiong Li, Xianku Li and Yuhan Zhang
Laboratory of Signal Processing, South China University of Technology, Guangzhou, China
Li_SCUT_task1a_1 Li_SCUT_task1a_2 Li_SCUT_task1a_3 Li_SCUT_task1a_4
The SEIE-SCUT Systems for Challenge on DCASE 2018: Deep Learning Techniques for Audio Representation and Classification
YangXiong Li, Xianku Li and Yuhan Zhang
Laboratory of Signal Processing, South China University of Technology, Guangzhou, China
Abstract
In this report, we present our works about one task of challenge on DCASE 2018, i.e. task 1a: Acoustic Scene Classification (ASC). We adopt deep learning techniques to extract Deep Audio Feature (DAF) and classify various acoustic scenes . Specifically, a Deep Neural Network (DNN) is first built for generating the DAF from MelFrequency Cepstral Coefficients (MFCCs), and then a Recurrent Neural Network (RNN) of Bidirectional Long Short Term Memory (BLSTM) fed by the DAF is built for ASC. Evaluated on the development datasets of DCASE 2018, our systems are superior to the corresponding baselines for tasks 1a.
System characteristics
Input | mono |
Sampling rate | 48kHz |
Features | MFCC |
Classifier | LSTM |
The SEIE-SCUT Systems for Challenge on DCASE 2018: Deep Learning Techniques for Audio Representation and Classification
YangXiong Li, Yuhan Zhang and Xianku Li
Laboratory of Signal Processing, South China University of Technology, Guangzhou, China
Li_SCUT_task1b_1 Li_SCUT_task1b_2 Li_SCUT_task1b_3
The SEIE-SCUT Systems for Challenge on DCASE 2018: Deep Learning Techniques for Audio Representation and Classification
YangXiong Li, Yuhan Zhang and Xianku Li
Laboratory of Signal Processing, South China University of Technology, Guangzhou, China
Abstract
In this report, we present our works about one task of challenge on DCASE 2018, i.e. task 1b:Acoustic Scene Classification with mismatched recording devices (ASC). We adopt deep learning techniques to extract Deep Audio Feature (DAF) and classify various acoustic scenes . Specifically, a Deep Neural Network (DNN) is first built for generating the DAF from Mel-Frequency Cepstral Coefficients (MFCCs), and then a Recurrent Neural Network (RNN) of Bidirectional Long Short Term Memory (BLSTM) fed by the DAF is built for ASC. Evaluated on the development datasets of DCASE 2018, our systems are superior to the corresponding baselines for tasks 1b.
System characteristics
Input | mono |
Sampling rate | 48kHz |
Features | MFCC |
Classifier | LSTM |
Acoustic Scene Classification Using Multi-Scale Features
Yang Liping, Chen Xinxing and Tao Lianjie
College of Optoelectronic Engineering, Chongqing University, Chongqing, China
Liping_CQU_task1a_1 Liping_CQU_task1a_2 Liping_CQU_task1a_3 Liping_CQU_task1a_4 Liping_CQU_task1b_1 Liping_CQU_task1b_2 Liping_CQU_task1b_3 Liping_CQU_task1b_4
Acoustic Scene Classification Using Multi-Scale Features
Yang Liping, Chen Xinxing and Tao Lianjie
College of Optoelectronic Engineering, Chongqing University, Chongqing, China
Abstract
Convolutional neural networks(CNN) has shown tremendous ability in classification problems, because it can extract abstract features for improving classification performance. In this paper, we use CNN to compute feature hierarchy layer by layer. With the layers deepen, the extracted features become more abstract, but the shallow features are also very useful for classification. So we propose a fuse multi-scale features of different layers method, which can improve performance of acoustic scene classification. In our method, the logmel features of audio signal are used as the input of CNN. In order to reduce the parameters' number, we use xception as the foundation network, which is a CNN with depthwise separable convolution operation (a depthwise convolution followed by a pointwise convolution). And we modify xception to fuse multi-scale features. We also introduce the focal loss, to further improve classification performance. This method can achieve commendable result, whether the audio recordings are collected by same device(subtask A) or by different devices (subtask B).
System characteristics
Input | mono |
Sampling rate | 48kHz; 44.1kHz |
Features | log-mel energies |
Classifier | Xception |
Auditory Scene Classification Using Ensemble Learning with Small Audio Feature Space
Tomasz Maka
Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, Szczecin, Szczecin, Poland
Maka_ZUT_task1a_1
Auditory Scene Classification Using Ensemble Learning with Small Audio Feature Space
Tomasz Maka
Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, Szczecin, Szczecin, Poland
Abstract
The report presents the results of an analysis of audio feature space for auditory scene classification. The final small feature set was determined by the selection of the attributes from various representations. Feature importance was calculated exploiting the Gradient Boosting Machine. A number of classifiers were employed to build the ensemble classification scheme, and majority voting was performed to obtain the final decision. In the result, the proposed solution uses 223 attributes and outperforms the baseline system by over 6 per cent.
System characteristics
Input | binaural |
Sampling rate | 48kHz |
Features | various |
Classifier | ensemble |
Decision making | majority vote |
Exploring Deep Vision Models for Acoustic Scene Classification
Octave Mariotti, Matthieu Cord and Olivier Schwander
Laboratoire d'informatique de Paris 6, Sorbonne Université, Paris, France
Mariotti_lip6_task1a_1 Mariotti_lip6_task1a_2 Mariotti_lip6_task1a_3 Mariotti_lip6_task1a_4
Exploring Deep Vision Models for Acoustic Scene Classification
Octave Mariotti, Matthieu Cord and Olivier Schwander
Laboratoire d'informatique de Paris 6, Sorbonne Université, Paris, France
Abstract
This report evaluates the application of deep vision models, namely VGG and Resnet, to general audio recognition. In the context of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events 2018, we trained several of these architecture on the task 1 dataset to perform acoustic scene classification. Then, in order to produce more robust predictions, we explored two ensemble methods to aggregate the different model outputs. Our results show a final accuracy of 79% on the development dataset, outperforming the baseline by almost 20%.
System characteristics
Input | mono, binaural |
Sampling rate | 48kHz |
Features | log-mel energies |
Classifier | CNN |
Decision making | mean probability; neural network |
Acoustic Scene Classification Using a Convolutional Neural Network Ensemble and Nearest Neighbor Filters
Truc Nguyen and Franz Pernkopf
Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria/ Europe
Nguyen_TUGraz_task1a_1 Nguyen_TUGraz_task1b_1
Acoustic Scene Classification Using a Convolutional Neural Network Ensemble and Nearest Neighbor Filters
Truc Nguyen and Franz Pernkopf
Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria/ Europe
Abstract
This paper proposes Convolutional Neural Network (CNN) ensembles for acoustic scene classification of subtasks 1A and 1B of DCASE 2018 challenge. We introduce a nearest neighbor filter applied on spectrogram, which allows to emphasize and smooth similar patterns of sound events in a scene. We also propose a variety of CNN models for single-input (SI) and multi-input (MI) channels and three different methods for building a network ensemble. The experimental results show that for subtask 1A the combination of the MI-CNN structures using both of log-mel features and their nearest neighbor filtering is slightly more effective than that of single-input channel CNN models using log-mel features only. This statement is opposite for subtask 1B. In addition, the ensemble methods improve the accuracy of the system significantly, in which the best ensemble method is ensemble selection, which achieves 69.3% for subtask 1A and 63.6% for subtask 1B. This improves the baseline system by 8.9% and 14.4% for subtask 1A and 1B, respectively
System characteristics
Input | mono |
Sampling rate | 48kHz; 44.1kHz |
Features | log-mel energies and their nearest neighbor filtered version |
Classifier | CNN |
Decision making | averaging vote |
Acoustic Scene Classification Using Deep CNN on Raw-Waveform
Tilak Purohit and Atul Agarwal
Signal Processing and Pattern Recognition Lab, International Institute of Information Technology, Bangaluru, India
Tilak_IIITB_task1a_1 Tilak_IIITB_task1a_2 Tilak_IIITB_task1a_3
Acoustic Scene Classification Using Deep CNN on Raw-Waveform
Tilak Purohit and Atul Agarwal
Signal Processing and Pattern Recognition Lab, International Institute of Information Technology, Bangaluru, India
Abstract
For acoustic scene classification problems, conventionally Convolutional Neural Networks (CNNs) have been used on handcrafted features like Mel Frequency Cepstral Coefficients, filterbank energies, scaled spectrograms etc. However, recently CNNs have been used on raw waveform for acoustic modeling in speech recognition, though the time-scales of these waveforms are short (of the order of typical phoneme durations - 80-120 ms). In this work, we have exploited the representation learning power of CNNs by using them directly on very long raw acoustic sound waveforms (of durations 0.5-10 sec) for the acoustic scene classification (ASC) task of DCASE and have shown that deep CNNs (of 8-34 layers) can outperform CNNs with similar architecture on handcrafted features.
System characteristics
Input | mono |
Sampling rate | 8kHz |
Features | raw-waveform |
Classifier | CNN; DCNN |
Attention-Based Convolutional Neural Networks for Acoustic Scene Classification
Zhao Ren1, Qiuqiang Kong2, Kun Qian1, Mark Plumbley2 and Björn Schuller3
1ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany, 2Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Surrey, UK, 3ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing / GLAM -- Group on Language, Audio \& Music, University of Augsburg, Imperial College London, Augsburg, Germany / London, UK
Ren_UAU_task1a_1 Ren_UAU_task1b_1
Attention-Based Convolutional Neural Networks for Acoustic Scene Classification
Zhao Ren1, Qiuqiang Kong2, Kun Qian1, Mark Plumbley2 and Björn Schuller3
1ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany, 2Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Surrey, UK, 3ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing / GLAM -- Group on Language, Audio \& Music, University of Augsburg, Imperial College London, Augsburg, Germany / London, UK
Abstract
We propose a convolutional neural network (CNN) model based on an attention pooling method to classify ten different acoustic scenes, participating in the acoustic scene classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2018), which includes data from one device (subtask A) and data from three different devices (subtask B). The log mel spectrogram images of the audio waves are first forwarded to convolutional layers, and then fed into an attention pooling layer to reduce the feature dimension and achieve classification. From attention perspective, we build a weighted evaluation of the features, instead of simple max pooling or average pooling. On the official development set of the challenge, the best accuracy of subtask A is 72.6 %, which is an improvement of 12.9 % when compared with the official baseline (p < .001 in a one-tailed z-test). For subtask B, the best result of our attention-based CNN is a significant improvement of the baseline as well, in which the accuracies are 71.8 %, 58.3 %, and 58.3 % for the three devices A to C (p < .001 for device A, p < .01 for device B, and p < .05 for device C).
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel spectrogram |
Classifier | CNN |
Using an Evolutionary Approach to Explore Convolutional Neural Networks for Acoustic Scene Classification
Christian Roletscheck and Tobias Watzka
Human Centered Multimedia, Augsburg University, Augsburg, Germany
Roletscheck_UNIA_task1a_1 Roletscheck_UNIA_task1a_2
Using an Evolutionary Approach to Explore Convolutional Neural Networks for Acoustic Scene Classification
Christian Roletscheck and Tobias Watzka
Human Centered Multimedia, Augsburg University, Augsburg, Germany
Abstract
The successful application of modern deep neural networks is heavily reliant on the chosen architecture and the selection of the appropriate hyperparameters. Due to the large amount of parameters and the complex inner workings of a neural network, finding a suitable configuration for a respective problem turns out to be a rather complex task for a human. In this paper we propose an evolutionary approach to automatically generate a suitable neural network architecture for any given problem. A genetic algorithm is used to generate and evaluate a variety of deep convolutional networks. We take the DCASE 2018 Challenge as an opportunity to evaluate our algorithm on the task of acoustic scene classification. The best accuracy achieved by our approach was 74.7% on the development dataset.
System characteristics
Input | mono |
Sampling rate | 48kHz |
Features | log-mel spectrogram |
Classifier | CNN |
Decision making | majority vote |
Acoustic Scene Classification by Ensemble of Spectrograms Based on Adaptive Temporal Divisions
Yuma Sakashita and Masaki Aono
Knowledge Data Engineering Laboratory, Toyohashi University of Technology, Aichi, Japan
Sakashita_TUT_task1a_1 Sakashita_TUT_task1a_2 Sakashita_TUT_task1a_3 Sakashita_TUT_task1a_4
Acoustic Scene Classification by Ensemble of Spectrograms Based on Adaptive Temporal Divisions
Yuma Sakashita and Masaki Aono
Knowledge Data Engineering Laboratory, Toyohashi University of Technology, Aichi, Japan
Abstract
Many classification tasks using deep learning have improved classification accuracy by using a large amount of training data. However, it is difficult to collect audio data and build a large database. Since training data is restricted in DCASE 2018 Task 1a, unknown acoustic scene must be predicted from less training data. From the results of DCASE 2017[1], we determine that using a convolution neural network and ensemble multiple networks is an effective means for classifying acoustic scenes. In our method we generate mel-spectrogram from binaural audio, mono audio, Harmonicpercussive source separation (HPSS) audio, adaptively divide the spectrogram into multiple ways and learn 9 neural networks. We further improve ensemble accuracy by ensemble learning using these outputs. The classification result of the proposed system was 0.769 for Development dataset and 0.796 for Leaderboard dataset.
System characteristics
Input | mono, binaural |
Sampling rate | 44.1kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN |
Decision making | random forest |
CNN Based System for Acoustic Scene Classification
Lee Sangwon, Kang Seungtae and Jang Gin-jin
School of Electronics Engineering, Kyungpook National University, Daegu, Korea
Gil-jin_KNU_task1a_1
CNN Based System for Acoustic Scene Classification
Lee Sangwon, Kang Seungtae and Jang Gin-jin
School of Electronics Engineering, Kyungpook National University, Daegu, Korea
Abstract
Convolution neural networks (CNNs) have achieved great successes in many machine learning tasks such as classifying visual objects or various audio sounds. In this report, we describe our system implementation for acoustic scene classification task of DCASE 2018 based on CNN. The classification accuracies of the proposed system are 72.4% and 75.5% on development and leaderboard datasets, respectively.
System characteristics
Input | mono |
Sampling rate | 48kHz |
Features | log-mel energies |
Classifier | CNN |
Decision making | majority vote |
Combination of Amplitude Modulation Spectrogram Features and MFCCs for Acoustic Scene Classification
Juergen Tchorz
Institute for Acoustics, University of Applied Sciences Luebeck, Luebeck, Germany
Tchorz_THL_task1b_1
Combination of Amplitude Modulation Spectrogram Features and MFCCs for Acoustic Scene Classification
Juergen Tchorz
Institute for Acoustics, University of Applied Sciences Luebeck, Luebeck, Germany
Abstract
This report describes an approach for acoustic scene classification and its results for the development data set of the DCASE 2018 challenge. Amplitude modulation spectrograms (AMS), which mimic important aspects of the auditory system are used as features, in combination with mel-scale cepstral coefficients which have shown to be complementary to AMS features. For classification, a long short-term memory deep neural network is used. The proposed system outperforms the baseline system by 6.3-9.3 % for the development data test subset, depending on the recording device.
System characteristics
Sampling rate | 44.1kHz |
Features | amplitude modulation spectrogram, MFCC |
Classifier | LSTM |
Wavelet-Based Audio Features for Acoustic Scene Classification
Shefali Waldekar and Goutam Saha
Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India
Waldekar_IITKGP_task1a_1 Waldekar_IITKGP_task1b_1
Wavelet-Based Audio Features for Acoustic Scene Classification
Shefali Waldekar and Goutam Saha
Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India
Abstract
This report describes a submission for IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 for Task 1 (acoustic scene classification (ASC)), sub-task A (basic ASC) and sub-task B (ASC with mismatched recording devices). We use two wavelet-based features in a scorefusion framework to achieve the goal. The first feature applies wavelet transform to log mel-band energies, while the second does a high-Q wavelet transformation on the frames of raw signal. The two features are found to be complementary so that the fused system relatively outperforms the deep-learning based baseline system by 17% for sub-task A and 26% for sub-task B with the development dataset provided for the respective sub-tasks.
System characteristics
Input | mono |
Sampling rate | 48kHz |
Features | MFDWC, CQCC |
Classifier | SVM |
Decision making | fusion |
Se-Resnet with Gan-Based Data Augmentation Applied to Acoustic Scene Classification
Jeong Hyeon Yang, Nam Kyun Kim and Hong Kook Kim
School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, Korea
Yang_GIST_task1a_1 Yang_GIST_task1a_2
Se-Resnet with Gan-Based Data Augmentation Applied to Acoustic Scene Classification
Jeong Hyeon Yang, Nam Kyun Kim and Hong Kook Kim
School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, Korea
Abstract
This report describes our contribution to the development of audio scene classification methods for the DCASE 2018 Challenge Task 1A. The proposed systems for this task are based on data augmentation through generative adversarial network (GAN)-based data augmentation and various convolutional networks such as residual networks (ResNets) and squeeze-and-excitation residual networks (SE-ResNets). In addition to data augmentation, SEResNets are revised so that they operate on the log-mel spectrogram domain, and the numbers of layers and kernels are adjusted to provide better performance on the task. Finally, the ensemble method is applied using a four-fold cross-validated training dataset. Consequently, the proposed audio scene classification system improves classwise accuracy by 10% compared to the baseline system through the Kaggle competition in acoustic scene classification.
System characteristics
Input | mixed |
Sampling rate | 48kHz |
Data augmentation | GAN |
Features | log-mel spectrogram |
Classifier | CNN, ensemble |
Decision making | mean probability |
Convolutional Neural Networks and X-Vector Embedding for Dcase2018 Acoustic Scene Classification Challenge
Hossein Zeinali, Lukas Burget and Honza Cernocky
BUT Speech, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
Zeinali_BUT_task1a_1 Zeinali_BUT_task1a_2 Zeinali_BUT_task1a_3 Zeinali_BUT_task1a_4
Convolutional Neural Networks and X-Vector Embedding for Dcase2018 Acoustic Scene Classification Challenge
Hossein Zeinali, Lukas Burget and Honza Cernocky
BUT Speech, Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
Abstract
In this report, the BUT team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2018 challenge is described. Also, the analysis of different method performance on the development set is provided. The proposed approach is a fusion of two different Conventional Neural Network (CNN) topologies. The first one is the common two-dimensional CNNs which mainly is used in image classification task. The second one is one dimensional CNN for extracting embeddings from the neural network which is too common in speech processing, especially for speaker recognition. In addition to the topologies, two types of features were suggested to be used in this task, Mel-spectrogram in log domain and CQT features which explained in detail in the report. Finally, the outputs of different systems are fused using a weighted average.
System characteristics
Input | mono, binaural |
Sampling rate | 48kHz |
Data augmentation | block mixing |
Features | log-mel energies, CQT |
Classifier | CNN, x-vector, ensemble |
Decision making | weighted average |
Acoustic Scene Classification Using Multi-Layered Temporal Pooling Based on Deep Convolutional Neural Network
Liwen Zhang and Jiqing Han
Laboratory of Speech Signal Processing, Harbin Institute of Technology, Harbin, China
Zhang_HIT_task1a_1 Zhang_HIT_task1a_2
Acoustic Scene Classification Using Multi-Layered Temporal Pooling Based on Deep Convolutional Neural Network
Liwen Zhang and Jiqing Han
Laboratory of Speech Signal Processing, Harbin Institute of Technology, Harbin, China
Abstract
The performance of an Acoustic Scene Classification (ASC) system is highly depending on the latent temporal dynamics of the audio signal. In this paper, we proposed a multiple layers temporal pooling method using CNN feature sequence as input, which can effectively capture the temporal dynamics for an entire audio signal with arbitrary duration by building direct connections between the sequence and its time indexes. We applied our novel framework on DCASE 2018 task 1, ASC. For evaluation, we trained a Support Vector Machine (SVM) with the proposed Multi-Layered Temporal Pooling (MLTP) learned features. Experimental results on the development dataset, usage of the MLTP features significantly improved the ASC performance. The best performance with 75.28% accuracy was achieved by using the optimal setting found in our experiments.
System characteristics
Input | mono |
Sampling rate | 48kHz |
Features | log-mel energies |
Classifier | CNN, SVR, SVM |
Decision making | only one SVM |