Proceedings - DCASE

The proceedings of the DCASE2017 Workshop have been published as electronic publication of Tampere University of Technology series:

Virtanen, T., Mesaros, A., Heittola, T., Diment, A., Vincent, E., Benetos, E. & Elizalde, B. (Eds.) (2017). Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017).

ISBN (Electronic): 978-952-15-4042-4

Link PDF

Total cites: 2186 (updated 26.11.2024)

Acoustic Scene Classification by Combining Autoencoder-Based Dimensionality Reduction and Convolutional Neural Networks

Jakob Abeßer, Stylianos Ioannis Mimilakis, Robert Grafe, and Hanna Lukashevich

Fraunhofer IDMT, Ilmenau, Germany

36 cites

PDF

Abstract

Motivated by the recent success of deep learning techniques in various audio analysis tasks, this work presents a distributed sensor-server system for acoustic scene classification in urban environments based on deep convolutional neural networks (CNN). Stacked autoencoders are used to compress extracted spectrogram patches on the sensor side before being transmitted to and classified on the server side. In our experiments, we compare two state-of-theart CNN architectures subject to their classification accuracy under the presence of environmental noise, the dimensionality reduction in the encoding stage, as well as a reduced number of filters in the convolution layers. Our results show that the best model configuration leads to a classification accuracy of 75% for 5 acoustic scenes. We furthermore discuss which confusions among particular classes can be ascribed to particular sound event types, which are present in multiple acoustic scene classes.

Keywords

Acoustic Scene Classification, Convolutional Neural Networks, Stacked Denoising Autoencoder, Smart City

Cites: 36 ( see at Google Scholar )

PDF

Sound Event Detection Using Weakly Labeled Dataset with Stacked Convolutional and Recurrent Neural Network

Sharath Adavanne and Tuomas Virtanen

Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland

71 cites

PDF Slides

Abstract

This paper proposes a neural network architecture and training scheme to learn the start and end time of sound events (strong labels) in an audio recording given just the list of sound events existing in the audio without time information (weak labels). We achieve this by using a stacked convolutional and recurrent neural network with two prediction layers in sequence one for the strong followed by the weak label. The network is trained using frame-wise log melband energy as the input audio feature, and weak labels provided in the dataset as labels for the weak label prediction layer. Strong labels are generated by replicating the weak labels as many number of times as the frames in the input audio feature, and used for strong label layer during training. We propose to control what the network learns from the weak and strong labels by different weighting for the loss computed in the two prediction layers. The proposed method is evaluated on a publicly available dataset of 155 hours with 17 sound event classes. The method achieves the best error rate of 0.84 for strong labels and F-score of 43.3% for weak labels on the unseen test split.

Keywords

sound event detection, weak labels, deep neural network, CNN, GRU

Cites: 71 ( see at Google Scholar )

PDF Slides

Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio

Shahin Amiriparian^1,2,3, Michael Freitag¹, Nicholas Cummins^1,2 and Björn Schuller^2,4

¹Chair of Complex & Intelligent Systems, Universität Passau, Passau, Germany, ²Chair of Embedded Intelligence for Health Care, Augsburg University, Augsburg, Germany, ³Machine Intelligence & Signal Processing Group, Technische Universität München, München, Germany, ⁴Group of Language, Audio & Music, Imperial Collage London, London, UK

116 cites

PDF Slides

Abstract

This paper describes our contribution to the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017). We propose a system for this task using a recurrent sequence to sequence autoencoder for unsupervised representation learning from raw audio files. First, we extract mel-spectrograms from the raw audio files. Second, we train a recurrent sequence to sequence autoencoder on these spectrograms, that are considered as time-dependent frequency vectors. Then, we extract, from a fully connected layer between the decoder and encoder units, the learnt representations of spectrograms as the feature vectors for the corresponding audio instances. Finally, we train a multilayer perceptron neural network on these feature vectors to predict the class labels. In comparison to the baseline, the accuracy is increased from 74:8% to 88:0% on the development set, and from 61:0% to 67:5% on the test set.

Keywords

deep feature learning, sequence to sequence learning, recurrent autoencoders, audio processing acoustic scene classification

Cites: 116 ( see at Google Scholar )

PDF Slides

Nonnegative Feature Learning Methods for Acoustic Scene Classification

Victor Bisot¹, Romain Serizel^2,3,4, Slim Essid¹ and Gaël Richard¹

¹Image Data and Signal, Telecom ParisTech, Paris, France, ²Université de Lorraine, Loria, Nancy, France, ³Inria, Nancy, France, ⁴CNRS, LORIA, Nancy, France

15 cites

PDF

Abstract

This paper introduces improvements to nonnegative feature learning-based methods for acoustic scene classification. We start by introducing modifications to the task-driven nonnegative matrix factorization algorithm. The proposed adapted scaling algorithm improves the generalization capability of task-driven nonnegative matrix factorization for the task. We then propose to exploit simple deep neural network architecture to classify both low level time-frequency representations and unsupervised nonnegative matrix factorization activation features independently. Moreover, we also propose a deep neural network architecture that exploits jointly unsupervised nonnegative matrix factorization activation features and low-level time frequency representations as inputs. Finally, we present a fusion of proposed systems in order to further improve performance. The resulting systems are our submission for the task 1 of the DCASE 2017 challenge.

Keywords

Feature learning, Nonnegative Matrix Factorization, Deep Neural Networks

Cites: 15 ( see at Google Scholar )

PDF

Convolutional Recurrent Neural Networks for Rare Sound Event Detection

Emre Cakir and Tuomas Virtanen

Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland

73 cites

PDF

Abstract

Sound events possess certain temporal and spectral structure in their time-frequency representations. The spectral content for the samples of the same sound event class may exhibit small shifts due to intra-class acoustic variability. Convolutional layers can be used to learn high-level, shift invariant features from time-frequency representations of acoustic samples, while recurrent layers can be used to learn the longer term temporal context from the extracted high-level features. In this paper, we propose combining these two in a convolutional recurrent neural network (CRNN) for rare sound event detection. The proposed method is evaluated over DCASE 2017 challenge dataset of individual sound event samples mixed with everyday acoustic scene samples. CRNN provides significant performance improvement over two other deep learning based methods mainly due to its capability of longer term temporal modeling.

Keywords

Sound Event Detection, Convolutional Neural Network, Recurrent Neural Network, Machine learning

Cites: 73 ( see at Google Scholar )

PDF

The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network

Gert Dekkers^1,2, Steven Lauwereins², Bart Thoen¹, Mulu Weldegebreal Adhana¹, Henk Brouckxon³, Bertold Van den Bergh², Toon van Waterschoot^1,2, Bart Vanrumste^1,2,4, Marian Verhelst², Peter Karsmakers¹

¹ KU Leuven, Department of Electrical Engineering, Engineering Technology Cluster, Geel, Belgium, ² KU Leuven, Department of Electrical Engineering, Leuven, Belgium, ³ Vrije Universiteit Brussel, Department ETRO-DSSP, Brussels, Belgium, ⁴ IMEC, Leuven, Belgium

145 cites

PDF

Abstract

There is a rising interest in monitoring and improving human wellbeing at home using different types of sensors including microphones. In the context of Ambient Assisted Living (AAL) persons are monitored, e.g. to support patients with a chronic illness and older persons, by tracking their activities being performed at home. When considering an acoustic sensing modality, a performed activity can be seen as an acoustic scene. Recently, acoustic detection and classification of scenes and events has gained interest in the scientific community and led to numerous public databases for a wide range of applications. However, no public databases exist which a) focus on daily activities in a home environment, b) contain activities being performed in a spontaneous manner, c) make use of an acoustic sensor network, and d) are recorded as a continuous stream. In this paper we introduce a database recorded in one living home, over a period of one week. The recording setup is an acoustic sensor network containing thirteen sensor nodes, with four low-cost microphones each, distributed over five rooms. Annotation is available on an activity level. In this paper we present the recording and annotation procedure, the database content and a discussion on a baseline detection benchmark. The baseline consists of Mel-Frequency Cepstral Coefficients, Support Vector Machine and a majority vote late-fusion scheme. The database is publicly released to provide a common ground for future research.

Keywords

Database, Acoustic Scene Classification, Acoustic Event Detection, Acoustic Sensor Networks

Cites: 145 ( see at Google Scholar )

PDF

Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks

Eduardo Fonseca, Rong Gong, Dmitry Bogdanov, Olga Slizovskaia, Emilia Gomez and Xavier Serra

Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain

33 cites

PDF Slides

Abstract

This work describes our contribution to the acoustic scene classification task of the DCASE 2017 challenge. We propose a system that consists of the ensemble of two methods of different nature: a feature engineering approach, where a collection of hand-crafted features is input to a Gradient Boosting Machine, and another approach based on learning representations from data, where log-scaled melspectrograms are input to a Convolutional Neural Network. This CNN is designed with multiple filter shapes in the first layer. We use a simple late fusion strategy to combine both methods. We report classification accuracy of each method alone and the ensemble system on the provided cross-validation setup of TUT Acoustic Scenes 2017 dataset. The proposed system outperforms each of its component methods and improves the provided baseline system by 8.2%.

Keywords

acoustic scene classification, gradient boosting machine, convolutional neural networks, ensembling

Cites: 33 ( see at Google Scholar )

PDF Slides

Acoustic Scene Classification Using Spatial Features

Marc C. Green and Damian Murphy

Audio Lab, Department of Electonic Engineering, University of York, York, UK

17 cites

PDF Slides

Abstract

Due to various factors, the vast majority of the research in the field of Acoustic Scene Classification has used monaural or binaural datasets. This paper introduces EigenScape - a new dataset of 4th-order Ambisonic acoustic scene recordings - and presents preliminary analysis of this dataset. The data is classified using a standard Mel-Frequency Cepstral Coefficient - Gaussian Mixture Model system, and the performance of this system is compared to that of a new system using spatial features extracted using Directional Audio Coding (DirAC) techniques. The DirAC features are shown to perform well in scene classification, with some subsets of these features outperforming the MFCC classification. The differences in label confusion between the two systems are especially interesting, as these suggest that certain scenes that are spectrally similar might not necessarily be spatially similar.

Keywords

Acoustic scene classification, MFCC, gaussian mixture model, ambisonics, directional audio coding, multichannel, eigenmike

Cites: 17 ( see at Google Scholar )

PDF Slides

EigenScape EigenScape

Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification

Yoonchang Han¹ and Jeongsoo Park^1,2

¹Cochlear.ai, Seoul, Korea, ²Music and Audio Research Group, Seoul National University, Seoul, Korea

169 cites

Acoustic Scene Classification by Combining Autoencoder-Based Dimensionality Reduction and Convolutional Neural Networks

Abstract

Keywords

Sound Event Detection Using Weakly Labeled Dataset with Stacked Convolutional and Recurrent Neural Network

Abstract

Keywords

Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio

Abstract

Keywords

Nonnegative Feature Learning Methods for Acoustic Scene Classification

Abstract

Keywords

Convolutional Recurrent Neural Networks for Rare Sound Event Detection

Abstract

Keywords

The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network

Abstract

Keywords

Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks

Abstract

Keywords

Acoustic Scene Classification Using Spatial Features

Abstract

Keywords

Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification

Abstract

Keywords

Audio Event Detection Using Multiple-Input Convolutional Neural Network

Abstract

Keywords

DCASE 2017 Task 1: Acoustic Scene Classification Using Shift-Invariant Kernels and Random Features

Abstract

Keywords

DNN-Based Audio Scene Classification for DCASE2017: Dual Input Features, Balancing Cost, and Stochastic Data Duplication

Abstract

Keywords

Neuroevolution for Sound Event Detection in Real Life Audio: A Pilot Study

Abstract

Keywords

Combining Multi-Scale Features Using Sample-Level Deep Convolutional Neural Networks for Weakly Supervised Sound Event Detection

Abstract

Keywords

Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input

Abstract

Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks

Abstract

Keywords

DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

Keywords

Generative Adversarial Network Based Acoustic Scene Training Set Augmentation and Selection Using SVM Hyper-Plane

Abstract

Keywords

Acoustic Scene Classification Based on Convolutional Neural Network Using Double Image Features

Abstract

Keywords

The Details That Matter: Frequency Resolution of Spectrograms in Acoustic Scene Classification

Abstract

Keywords

Wavelets Revisited for the Classification of Acoustic Scenes

Abstract

Keywords

Deep Sequential Image Features on Acoustic Scene Classification

Abstract

Keywords

Multi-Temporal Resolution Convolutional Neural Networks for Acoustic Scene Classification

Abstract

Keywords

Acoustic Scene Classification: From a Hybrid Classifier to Deep Learning

Abstract

Keywords

Audio Events Detection and classification using extended R-FCN Approach

Abstract

Keywords

Acoustic Scene Classification Using Deep Convolutional Neural Network and Multiple Spectrograms Fusion

Abstract

Keywords

Robust Sound Event Detection Through Noise Estimation and Source Separation Using NMF

Abstract

Keywords