Proceedings

Workshop on Detection and Classification of Acoustic Scenes and Events
3rd of September 2016, Budapest, Hungary

The proceedings of the DCASE2016 workshop have been published as electronic publication of Tampere University of Technology series:

Virtanen, T., Mesaros, A., Heittola, T., Plumbley, M. D., Foster, P., Benetos, E., & Lagrange, M. (Eds.) (2016). Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016).

ISBN (Electronic): 978-952-15-3807-0


Link PDF
Total cites: 1280 (updated 30.11.2023)
Abstract

In this paper, we propose the use of spatial and harmonic features in combination with long short term memory (LSTM) recurrent neural network (RNN) for automatic sound event detection (SED) task. Real life sound recordings typically have many overlapping sound events, making it hard to recognize with just mono channel audio. Human listeners have been successfully recognizing the mixture of overlapping sound events using pitch cues and exploiting the stereo (multichannel) audio signal available at their ears to spatially localize these events. Traditionally SED systems have only been using mono channel audio, motivated by the human listener we propose to extend them to use multichannel audio. The proposed SED system is compared against the state of the art mono channel method on the development subset of TUT sound events detection 2016 database. The proposed method improves the F-score by 3.75% while reducing the error rate by 6%

Keywords

Sound event detection, multichannel, time difference of arrival, pitch, recurrent neural networks, long short term memory

Cites: 135 ( see at Google Scholar )

PDF
Abstract

Deep neural networks(DNNs) have recently achieved a great success in various learning task, and have also been used for classification of environmental sounds. While DNNs are showing their potential in the classification task, they cannot fully utilize the temporal information. In this paper, we propose a neural network architecture for the purpose of using sequential information. The proposed structure is composed of two seperated lower networks and one upper network. We refer to these as LSTM layers, CNN layers and connected layers, respectively. The LSTM layers extract the sequential information from consecutive audio features. The CNN layers learn the spectro-temporal locality from spectrogram images. Finally, the connected layers summarize the outputs of two networks to take advangate of the complementary features of the LSTM and CNN by combining them. To compare the proposed method with other neural netowrks, we conducted a number of experiments on the TUT acoustic scenes 2016 dataset which consists of recordings from various acoustic scenes. By using the proposed combination structure, we achieved higher performance compared to the conventional DNN, CNN and LSTM architecture.

Keywords

Deep learning, sequence learning, combination of LSTM and CNN, acoustic scene classification

Cites: 188 ( see at Google Scholar )

PDF
Abstract

In this paper, we present a sound event detection system based on a deep neural network (DNN). Exemplar-based noise reduction approach is proposed for enhancing mel-band energy feature. Multi-label DNN classifier is trained for polyphonic event detection. The system is evaluated on IEEE DCASE 2016 Challenge Task 2 Train/Development Datasets. The result on the development set yields up to 0.9261 and 0.1379 in terms of F-Score and error rate on segment-based metric, respectively.

Keywords

Sound event detection, deep neural network, exemplar-based noise reduction

Cites: 42 ( see at Google Scholar )

PDF
Abstract

In this paper we present our work on Task 1 Acoustic Scene Classi- fication and Task 3 Sound Event Detection in Real Life Recordings. Among our experiments we have low-level and high-level features, classifier optimization and other heuristics specific to each task. Our performance for both tasks improved the baseline from DCASE: for Task 1 we achieved an overall accuracy of 78.9% compared to the baseline of 72.6% and for Task 3 we achieved a Segment-Based Error Rate of 0.48 compared to the baseline of 0.91.

Keywords

audio, scenes, events, features, segmentation, DCASE, bag of audio words, GMMs

Cites: 47 ( see at Google Scholar )

PDF
Abstract

In this paper, we investigate sparse convolutive non-negative matrix factorization (sparse-CNMF) for detecting overlapped acoustic events in single-channel audio, within the experimental framework of Task 2 of the DCASE’16 challenge. In particular, our main focus lies on the efficient creation of the dictionary, as well as on the detection scheme associated with the CNMF approach. Specifically, we propose a shift-invariant dictionary reduction method that outperforms standard CNMF-based dictionary building. Further, we develop a novel detection algorithm that combines information from the CNMF activation matrix and atom-based reconstruction residuals, achieving significant improvement over the conventional approach based on the activations alone. The resulting system, evaluated on the development set of Task 2 of the DCASE’16 Challenge, also achieves large gains over the traditional NMF baseline provided by the Challenge organizers.

Keywords

Convolutive Non-Negative Matrix Factorization, Dictionary Building, Overlapped Acoustic Event Detection

Cites: 17 ( see at Google Scholar )

PDF
Abstract

This paper presents a sound event detection system based on mel-frequency cepstral coefficients and a non-parametric classifier. System performance is tested using the training and development datasets corresponding to the second task of the DCASE 2016 challenge. Results indicate that the most relevant spectral information for event detection is below 8000 Hz and that the general shape of the spectral envelope is much more relevant than its fine details.

Keywords

Sound event detection, spectral envelope, cepstral analysis

Cites: 19 ( see at Google Scholar )

PDF
Abstract

In this study, we propose a new method of polyphonic sound event detection based on a Bidirectional Long Short-Term Memory Hidden Markov Model hybrid system (BLSTM-HMM). We extend the hybrid model of neural network and HMM, which achieved state of-the-art performance in the field of speech recognition, to the multi-label classification problem. This extension provides an explicit duration model for output labels, unlike the straightforward application of BLSTM-RNN. We compare the performance of our proposed method to conventional methods such as non-negative matrix factorization (NMF) and standard BLSTM-RNN, using the DCASE2016 task 2 dataset. Our proposed method outperformed conventional approaches in both monophonic and polyphonic tasks, and finally achieved an average F1 score of 76.63% (error rate of 51.11%) on the event-based evaluation, and an average F1-score 87.16% (error rate of 25.91%) on the segment-based evaluation.

Keywords

Polyphonic Sound Event Detection, Bidirectional Long Short-Term Memory, Hidden Markov Model, multilabel classification

Cites: 47 ( see at Google Scholar )

PDF
Abstract

In this paper, the Non-negative Matrix Factorization is applied for isolating the contribution of road traffic from acoustic measurements in urban sound mixtures. This method is tested on simulated scenes to enable a better control of the presence of different sound sources. The presented first results show the potential of the method.

Keywords

Non-negative Matrix Factorization, road traffic noise mapping, urban measurements

Cites: 19 ( see at Google Scholar )

PDF
Abstract

This paper proposes an acoustic event detection (AED) method using semi-supervised non-negative matrix factorization (NMF) with a mixture of local dictionaries (MLD). The proposed method based on semi-supervised NMF newly introduces a noise dictionary and a noise activation matrix both dedicated to unknown acoustic atoms which are not included in MLD. Because unknown acoustic atoms are better modeled by the new noise dictionary learned upon classification and the new activation matrix, the proposed method provides a higher classification performance for event classes modeled by MLD when a signal to be classified is contaminated by unknown acoustic atoms. Evaluation results using DCASE2016 task 2 Dataset show that F-measure by the proposed method with semisupervised NMF is improved by as much as 11.1% compared to that by the conventional method with supervised NMF.

Keywords

Acoustic event detection, Non-negative matrix factorization, Semi-supervised NMF, Mixture of local dictionaries

Cites: 85 ( see at Google Scholar )

PDF
Abstract

The DCASE Challenge 2016 contains tasks for Acoustic Acene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. For feature extraction, 40 Melfilter bank features are used. Two kinds of Mel banks, same area bank and same height bank are discussed. Experimental results show that the same height bank is better than the same area bank. DNNs with the same structure are applied to all four tasks in the DCASE Challenge 2016. In Task 1 we obtained accuracy of 76.4% using Mel + DNN against 72.5% by using Mel Frequency Ceptral Coefficient (MFCC) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 17.4% using Mel + DNN against 41.6% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 38.1% using Mel + DNN against 26.6% by using MFCC + GMM. In task 4 we obtained Equal Error Rate (ERR) of 20.9% using Mel + DNN against 21.0% by using MFCC + GMM. Therefore the DNN improves the baseline in Task 1 and Task 3, and is similar to the baseline in Task 4, although is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always work.

Keywords

Mel-filter bank, Deep Neural Network (DNN), Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), Audio Tagging

Cites: 82 ( see at Google Scholar )

PDF
Abstract

In this paper a novel approach for acoustic event detection in sensor networks is presented. Improved and more robust recognition is achieved by making use of the signals from multiple sensors. To this end, various known fusion strategies are evaluated along with a novel method using classifier stacking. Detailed comparative evaluation is performed on two different datasets using 32 distributed microphones: the ITC-Irst database, and a set of smart room recordings. The stacking yields a significant improvement. The effect of using events at previously observed as well as unobserved locations is investigated. The performance of recognizing events at previously unobserved locations can be improved by sorting the channels according to their posterior probabilities.

Keywords

Bag-of-Features, Acoustic Event Detection, Sensor Arrays, Robustness, Acoustic Sensor Networks

Cites: 30 ( see at Google Scholar )

Abstract

In this paper, we propose a parallel Convolutional Neural Network architecture for the task of classifying acoustic scenes and urban sound scapes. A popular choice for input to a Convolutional Neural Network in audio classification problems are Mel-transformed spectrograms. We, however, show in this paper that a Constant-Q-transformed input improves results. Furthermore, we evaluated critical parameters such as the number of necessary bands and filter sizes in a Convolutional Neural Network. These are non-trivial in audio tasks due to the different semantics of the two axes of the input data: time vs. frequency. Finally, we propose a parallel (graph-based) neural network architecture which captures relevant audio characteristics both in time and in frequency. Our approach shows a 10.7% relative improvement of the baseline system of the DCASE 2016 Acoustic Scenes Classification task.

Keywords

Deep Learning, Constant-Q-Transform, Convolutional Neural Networks, Audio Event Classification

Cites: 124 ( see at Google Scholar )

PDF
Abstract

We propose a system for acoustic scene classification using pairwise decomposition with deep neural networks and dimensionality reduction by multiscale kernel subspace learning. It is our contribution to the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2016). The system classifies 15 different acoustic scenes. First, auditory spectral features are extracted and fed into 15 binary deep multilayer perceptron neural networks (MLP). MLP are trained with the `one-against-all' paradigm to perform a pairwise decomposition. In a second stage, a large number of spectral, cepstral, energy and voicing-related audio features are extracted. Multiscale Gaussian kernels are then used in constructing optimal linear combination of Gram matrices for multiple kernel subspace learning. The reduced feature set is fed into a nearest-neighbour classifier. Predictions from the two systems are then combined by a threshold-based decision function. On the official development set of the challenge, an accuracy of 81.4% is achieved.

Keywords

Computational Acoustic Scene Analysis, Acoustic Scene Classification, Multilayer Perceptron, Deep Neural Networks, Multiscale Kernel Analysis

Cites: 37 ( see at Google Scholar )

PDF
Abstract

This paper presents a system for acoustic scene classification (SC) that is applied to data of the SC task of the DCASE’16 challenge (Task 1). The proposed method is based on extracting acoustic features that employ a relatively long temporal context, i.e., amplitude modulation filer bank (AMFB) features, prior to detection of acoustic scenes using a neural network (NN) based classification approach. Recurrent neural networks (RNN) are well suited to model long-term acoustic dependencies that are known to encode important information for SC tasks. However, RNNs require a relatively large amount of training data in com-parison to feed-forward deep neural networks (DNNs). Hence, the time-delay neural network (TDNN) approach is used in the present work that enables analysis of long contextual information similar to RNNs but with training efforts comparable to conventional DNNs. The proposed SC system attains a recogni-tion accuracy of 75%, which is 2.5% higher compared to the DCASE’16 baseline system.

Keywords

Time-delay neural networks, acoustic scene classification, DCASE, amplitude modulation filter bank features

Cites: 12 ( see at Google Scholar )

PDF
Abstract

Sounds around us convey the context of daily life activities. There are 360 million ndividuals worldwide who experience some form of deafness. For them, missing these contexts such as fire alarm can not only be inconvenient but also life threatening. In this paper, we explore a combination of different audio feature extraction algorithms that would aid in increasing the accuracy of identifying environmental sounds and also reduce power consumption. We also design a simple approach that alleviates some of the privacy concerns, and evaluate the implemented real-time environmental sound recognition system on Android mobile devices. Our solution works in embedded mode where sound processing and recognition are performed directly on a mobile device in a way that conserves battery power. Sound signals were detected using standard deviation of normalized power sequences. Multiple feature extraction techniques like zero crossing rate, Mel-frequency cepstral coefficient (MFCC), spectral flatness, and spectral centroid were applied on the raw sound signal. Multi-layer perceptron classifier was used to identify the sound. Experimental results show improved over state-of-the-art.

Keywords

Environmental sound recognition, signal processing, machine learning, Android OS

Cites: 36 ( see at Google Scholar )

PDF
Abstract

This contribution reports on the performance of systems for polyphonic acoustic event detection (AED) compared within the framework of the “detection and classification of acoustic scenes and events 2016” (DCASE’16) challenge. State-of-the-art Gaussian mixture model (GMM) and GMM-hidden Markov model (HMM) approaches are applied using Mel-frequency cepstral coefficients (MFCCs) and Gabor filterbank (GFB) features and a non-negative matrix factorization (NMF) based system. Furthermore, tandem and hybrid deep neural network (DNN)-HMMsystems are adopted. All HMM systems that usually are of multiclass type, i.e., systems that just output one label per time segment from a set of possible classes, are extended to binary classification systems that are compound of single binary classifiers classifying between target and non-target classes and, thus, are capable of multi labeling. These systems are evaluated for the data of residential areas of Task 3 from the DCASE’16 challenge. It is shown, that the DNN based system performs worse than the traditional systems for this task. Best results are achieved using GFB features in combination with a multiclass GMM-HMM approach.

Keywords

acoustic event detection, DCASE2016, Gabor filterbank, deep neural network

Cites: 24 ( see at Google Scholar )

PDF
Abstract

This paper investigates several approaches to address the acoustic scene classification (ASC) task. We start from low-level feature representation for segmented audio frames and investigate different time granularity for feature aggregation. We study the use of support vector machine (SVM), as a well-known classifier, together with two popular neural network (NN) architectures, namely multilayer perceptron (MLP) and convolutional neural network (CNN), for higher level feature learning and classification. We evaluate the performance of these approaches on benchmark datasets provided from the 2013 and 2016 Detection and Classification of Acoustic Scenes and Events (DCASE) challenges. We observe that a simple approach exploiting averaged Mel-log-spectrogram, as an extremely compact feature, and SVM can obtain even better result than NN-based approaches and comparable performance with the best systems in the DCASE 2013 challenge.

Keywords

Acoustic scene classification, Audio features, Multilayer Perceptron, Convolutional Neural Network, Support Vector Machine

Cites: 17 ( see at Google Scholar )

PDF
Abstract

Coupled non-negative matrix factorization (NMF) of spectral representations and class activity annotations has shown promising results for acoustic event detection (AED) in real life environments. Recently, a new dataset has been proposed for development of algorithms for real life AED. In this paper we propose two methods for real life polyphonic AED: Coupled Sparse Non-negative Matrix Factorization (CSNMF) of time-frequency patches with class activity annotations and Multi-class Random Forest classification (MRF) of time-frequency patches, and compare their performance on this new dataset. Both our methods outperform the DCASE2016 baseline in terms of F-score. Moreover, we show that as the dataset is unbalanced, a classifier that recognizes a few most frequent classes may outperform the sparse NMF-approach and a baseline based on Gaussian Mixture Models.

Keywords

Acoustic event detection, random forest classifier, non-negative matrix factorization, sparse representation

Cites: 6 ( see at Google Scholar )

PDF
Abstract

This workshop paper presents our contribution for the task of acoustic scene classification proposed for the “detection and classification of acoustic scenes and events” (D-CASE) 2016 challenge. We propose the use of a convolutional neural network trained to classify short sequences of audio, represented by their log-mel spectrogram. In addition we use a training method that can be used when the validation performance of the system saturates as the training proceeds. The performance is evaluated on the public acoustic scene classification development dataset provided for the D-CASE challenge. The best accuracy score obtained by our configuration on a four-folded cross-validation setup is 79.0%. It constitutes a 8.8% relative improvement with respect to the baseline system, based on a Gaussian mixture model classifier.

Keywords

Acoustic scene classification, convolutional neural networks, DCASE, computational audio processing

Cites: 191 ( see at Google Scholar )

Abstract

This paper outlines preliminary steps towards the development of an audio-based room-occupancy analysis model. Our approach borrows from speech recognition tradition and is based on Gaussian Mixtures and Hidden Markov Models. We analyze possible challenges encountered in the development of such a model, and offer several solutions including feature design and prediction strategies. We provide results obtained from experiments with audio data from a retail store in Palo Alto, California. Model assessment is done via leave- two-out Bootstrap and model convergence achieves good ac- curacy, thus representing a contribution to multimodal people counting algorithms.

Keywords

Acoustic Traffic Monitoring, Audio Forensics, Retail Analytics

Cites: 13 ( see at Google Scholar )

PDF
Abstract

In this paper, we present a deep neural network (DNN)-based acoustic scene classification framework. Two hierarchical learning methods are proposed to improve the DNN baseline performance by incorporating the hierarchical taxonomy information of environmental sounds. Firstly, the parameters of the DNN are initialized by the proposed hierarchical pre-training. Multi-level objective function is then adopted to add more constraint on the cross-entropy based loss function. A series of experiments were conducted on the Task1 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 challenge. The final DNN-based system achieved a 22.9% relative improvement on average scene classification error as compared with the Gaussian Mixture Model (GMM)-based benchmark system across four standard folds.

Keywords

Acoustic scene classification, deep neural network, hierarchical pre-training, multi-level objective function

Cites: 37 ( see at Google Scholar )

PDF
Abstract

Acoustic event detection for content analysis in most cases relies on lots of labeled data. However, manually annotating data is a time-consuming task, which thus makes few annotated resources available so far. Unlike audio event detection, automatic audio tagging, a multi-label acoustic event classification task, only relies on weakly labeled data. This is highly desirable to some practical applications using audio analysis. In this paper we propose to use a fully deep neural network (DNN) framework to handle the multi-label classification task in a regression way. Considering that only chunk-level rather than frame-level labels are available, the whole or almost whole frames of the chunk were fed into the DNN to perform a multi-label regression for the expected tags. The fully DNN, which is regarded as an encoding function, can well map the audio features sequence to a multi-tag vector. A deep pyramid structure was also designed to extract more robust high-level features related to the target tags. Further improved methods were adopted, such as the Dropout and background noise aware training, to enhance its generalization capability for new audio recordings in mismatched environments. Compared with the conventional Gaussian Mixture Model (GMM) and support vector machine (SVM) methods, the proposed fully DNN-based method could well utilize the long-term temporal information with the whole chunk as the input. The results show that our approach obtained a 15% relative improvement compared with the official GMM-based method of DCASE 2016 challenge.

Keywords

Audio tagging, deep neural networks, multilabel regression, dropout, DCASE 2016

Cites: 23 ( see at Google Scholar )

PDF
Abstract

We present a resource efficient framework for acoustic scene classification. In particular, we combine gated recurrent neural networks (GRNNs) and a linear discriminant analysis (LDA) objective for efficiently classifying environmental sound scenes of the IEEE Detection and Classification of Acoustic Scenes and Events challenge (DCASE2016). Our system reaches an overall accuracy of 79.1% on development data, resulting in a relative improvement of 8.34% compared to the baseline GMM system. We further investigate semi-supervised learning applied to acoustic scene analysis. In particular, we evaluate the effects of virtual adversarial training (VAT), the use of a hybrid, i.e. generative-discriminative, objective function.

Keywords

Acoustic Scene Labeling, Gated Recurrent Networks, Deep Linear Discriminant Analysis, Semi-Supervised Learning

Cites: 49 ( see at Google Scholar )

PDF