The proceedings of the DCASE2016 workshop have been published as electronic publication of Tampere University of Technology series:
Virtanen, T., Mesaros, A., Heittola, T., Plumbley, M. D., Foster, P., Benetos, E., & Lagrange, M. (Eds.) (2016). Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016).
ISBN (Electronic): 978-952-15-3807-0
Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features
Sharath Adavanne, Giambattista Parascandolo, Pasi Pertilä, Toni Heittola and Tuomas Virtanen
Tampere University of Technology, Laboratory of Signal Processing, Tampere, Finland
In this paper, we propose the use of spatial and harmonic features in combination with long short term memory (LSTM) recurrent neural network (RNN) for automatic sound event detection (SED) task. Real life sound recordings typically have many overlapping sound events, making it hard to recognize with just mono channel audio. Human listeners have been successfully recognizing the mixture of overlapping sound events using pitch cues and exploiting the stereo (multichannel) audio signal available at their ears to spatially localize these events. Traditionally SED systems have only been using mono channel audio, motivated by the human listener we propose to extend them to use multichannel audio. The proposed SED system is compared against the state of the art mono channel method on the development subset of TUT sound events detection 2016 database. The proposed method improves the F-score by 3.75% while reducing the error rate by 6%
Sound event detection, multichannel, time difference of arrival, pitch, recurrent neural networks, long short term memory
Cites: 132 ( see at Google Scholar )
Acoustic Scene Classification Using Parallel Combination of LSTM and CNN
Soo Hyun Bae, Inkyu Choi and Nam Soo Kim
Seoul National University, Department of Electrical and Computer Engineering and INMC, Seoul, Korea
Deep neural networks(DNNs) have recently achieved a great success in various learning task, and have also been used for classification of environmental sounds. While DNNs are showing their potential in the classification task, they cannot fully utilize the temporal information. In this paper, we propose a neural network architecture for the purpose of using sequential information. The proposed structure is composed of two seperated lower networks and one upper network. We refer to these as LSTM layers, CNN layers and connected layers, respectively. The LSTM layers extract the sequential information from consecutive audio features. The CNN layers learn the spectro-temporal locality from spectrogram images. Finally, the connected layers summarize the outputs of two networks to take advangate of the complementary features of the LSTM and CNN by combining them. To compare the proposed method with other neural netowrks, we conducted a number of experiments on the TUT acoustic scenes 2016 dataset which consists of recordings from various acoustic scenes. By using the proposed combination structure, we achieved higher performance compared to the conventional DNN, CNN and LSTM architecture.
Deep learning, sequence learning, combination of LSTM and CNN, acoustic scene classification
Cites: 163 ( see at Google Scholar )
DNN-Based Sound Event Detection with Exemplar-Based Approach for Noise Reduction
Inkyu Choi, Kisoo Kwon, Soo Hyun Bae and Nam Soo Kim
Seoul National University, Department of Electrical and Computer Engineering and INMC, Seoul, Korea
In this paper, we present a sound event detection system based on a deep neural network (DNN). Exemplar-based noise reduction approach is proposed for enhancing mel-band energy feature. Multi-label DNN classifier is trained for polyphonic event detection. The system is evaluated on IEEE DCASE 2016 Challenge Task 2 Train/Development Datasets. The result on the development set yields up to 0.9261 and 0.1379 in terms of F-Score and error rate on segment-based metric, respectively.
Sound event detection, deep neural network, exemplar-based noise reduction
Cites: 37 ( see at Google Scholar )
Experiments on the DCASE Challenge 2016: Acoustic Scene Classification and Sound Event Detection in Real Life Recording
Benjamin Elizalde1, Anurag Kumar1, Ankit Shah2, Rohan Badlani3, Emmanuel Vincent4, Bhiksha Raj1, and Ian Lane1
1Carnegie Mellon University, Pittsburgh, USA, 2NIT Surathkal, India, 3BITS, Pilani, India, 4Inria, Villers-les-Nancy, France
In this paper we present our work on Task 1 Acoustic Scene Classi- fication and Task 3 Sound Event Detection in Real Life Recordings. Among our experiments we have low-level and high-level features, classifier optimization and other heuristics specific to each task. Our performance for both tasks improved the baseline from DCASE: for Task 1 we achieved an overall accuracy of 78.9% compared to the baseline of 72.6% and for Task 3 we achieved a Segment-Based Error Rate of 0.48 compared to the baseline of 0.91.
audio, scenes, events, features, segmentation, DCASE, bag of audio words, GMMs
Cites: 45 ( see at Google Scholar )
Improved Dictionary Selection and Detection Schemes in Sparse-CNMF-Based Overlapping Acoustic Event Detection
Panagiotis Giannoulis1,3, Gerasimos Potamianos 2,3, Petros Maragos 1,3, and Athanasios Katsamanis 1,3
1School of ECE, National Technical University of Athens, Athens, Greece, 2Department of ECE, University of Thessaly, Volos, Greece, 3Athena Research and Innovation Center, Maroussi, Greece
In this paper, we investigate sparse convolutive non-negative matrix factorization (sparse-CNMF) for detecting overlapped acoustic events in single-channel audio, within the experimental framework of Task 2 of the DCASE’16 challenge. In particular, our main focus lies on the efficient creation of the dictionary, as well as on the detection scheme associated with the CNMF approach. Specifically, we propose a shift-invariant dictionary reduction method that outperforms standard CNMF-based dictionary building. Further, we develop a novel detection algorithm that combines information from the CNMF activation matrix and atom-based reconstruction residuals, achieving significant improvement over the conventional approach based on the activations alone. The resulting system, evaluated on the development set of Task 2 of the DCASE’16 Challenge, also achieves large gains over the traditional NMF baseline provided by the Challenge organizers.
Convolutive Non-Negative Matrix Factorization, Dictionary Building, Overlapped Acoustic Event Detection
Cites: 14 ( see at Google Scholar )
Synthetic Sound Event Detection based on MFCC
Juana M. Gutiérrez-Arriola, Rubén Fraile, Alexander Camacho, Thibaut Durand, Jaime L. Jarrín, and Shirley R. Mendoza
Escuela Técnica Superior de Ingeniería y Sistemas de Telecomunicacíon, Universidad Politécnica de Madrid, Spain
This paper presents a sound event detection system based on mel-frequency cepstral coefficients and a non-parametric classifier. System performance is tested using the training and development datasets corresponding to the second task of the DCASE 2016 challenge. Results indicate that the most relevant spectral information for event detection is below 8000 Hz and that the general shape of the spectral envelope is much more relevant than its fine details.
Sound event detection, spectral envelope, cepstral analysis
Cites: 15 ( see at Google Scholar )
Bidirectional LSTM-HMM Hybrid System for Polyphonic Sound Event Detection
Tomoki Hayashi1, Shinji Watanabe2, Tomoki Toda1, Takaaki Hori2, Jonathan Le Roux2, and Kazuya Takeda1
1Nagoya University, Nagoya, Japan, 2Mitsubishi Electric Research Laboratories (MERL), Cambridge, USA
In this study, we propose a new method of polyphonic sound event detection based on a Bidirectional Long Short-Term Memory Hidden Markov Model hybrid system (BLSTM-HMM). We extend the hybrid model of neural network and HMM, which achieved state of-the-art performance in the field of speech recognition, to the multi-label classification problem. This extension provides an explicit duration model for output labels, unlike the straightforward application of BLSTM-RNN. We compare the performance of our proposed method to conventional methods such as non-negative matrix factorization (NMF) and standard BLSTM-RNN, using the DCASE2016 task 2 dataset. Our proposed method outperformed conventional approaches in both monophonic and polyphonic tasks, and finally achieved an average F1 score of 76.63% (error rate of 51.11%) on the event-based evaluation, and an average F1-score 87.16% (error rate of 25.91%) on the segment-based evaluation.
Polyphonic Sound Event Detection, Bidirectional Long Short-Term Memory, Hidden Markov Model, multilabel classification
Cites: 43 ( see at Google Scholar )
Estimating Traffic Noise Levels Using Acoustic Monitoring a Preliminary Study
Jean-Remy Gloaguen1, Arnaud Can1, Mathieu Lagrange2, and Jean-François Petiot2
1Ifsttar - LAE, Bouguenais, France, 2IRCCyN, École Centrale de Nantes, Nantes, France
In this paper, the Non-negative Matrix Factorization is applied for isolating the contribution of road traffic from acoustic measurements in urban sound mixtures. This method is tested on simulated scenes to enable a better control of the presence of different sound sources. The presented first results show the potential of the method.
Non-negative Matrix Factorization, road traffic noise mapping, urban measurements
Cites: 13 ( see at Google Scholar )
Acoustic Event Detection Method Using Semi-Supervised Non-Negative Matrix Factorization with Mixtures of Local Dictionaries
Tatsuya Komatsu, Takahiro Toizumi, Reishi Kondo, and Yuzo Senda
NEC Corporation, Data Science Research Laboratories, Kawasaki, Japan
This paper proposes an acoustic event detection (AED) method using semi-supervised non-negative matrix factorization (NMF) with a mixture of local dictionaries (MLD). The proposed method based on semi-supervised NMF newly introduces a noise dictionary and a noise activation matrix both dedicated to unknown acoustic atoms which are not included in MLD. Because unknown acoustic atoms are better modeled by the new noise dictionary learned upon classification and the new activation matrix, the proposed method provides a higher classification performance for event classes modeled by MLD when a signal to be classified is contaminated by unknown acoustic atoms. Evaluation results using DCASE2016 task 2 Dataset show that F-measure by the proposed method with semisupervised NMF is improved by as much as 11.1% compared to that by the conventional method with supervised NMF.
Acoustic event detection, Non-negative matrix factorization, Semi-supervised NMF, Mixture of local dictionaries
Cites: 79 ( see at Google Scholar )
Deep Neural Network Baseline for DCASE Challenge 2016
Qiuqiang Kong, Iwona Sobieraj, Wenwu Wang, and Mark D. Plumbley
University of Surrey, Centre for Vision, Speech and Signal Processing, UK
The DCASE Challenge 2016 contains tasks for Acoustic Acene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. For feature extraction, 40 Melfilter bank features are used. Two kinds of Mel banks, same area bank and same height bank are discussed. Experimental results show that the same height bank is better than the same area bank. DNNs with the same structure are applied to all four tasks in the DCASE Challenge 2016. In Task 1 we obtained accuracy of 76.4% using Mel + DNN against 72.5% by using Mel Frequency Ceptral Coefficient (MFCC) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 17.4% using Mel + DNN against 41.6% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 38.1% using Mel + DNN against 26.6% by using MFCC + GMM. In task 4 we obtained Equal Error Rate (ERR) of 20.9% using Mel + DNN against 21.0% by using MFCC + GMM. Therefore the DNN improves the baseline in Task 1 and Task 3, and is similar to the baseline in Task 4, although is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always work.
Mel-filter bank, Deep Neural Network (DNN), Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), Audio Tagging
Cites: 78 ( see at Google Scholar )
In this paper a novel approach for acoustic event detection in sensor networks is presented. Improved and more robust recognition is achieved by making use of the signals from multiple sensors. To this end, various known fusion strategies are evaluated along with a novel method using classifier stacking. Detailed comparative evaluation is performed on two different datasets using 32 distributed microphones: the ITC-Irst database, and a set of smart room recordings. The stacking yields a significant improvement. The effect of using events at previously observed as well as unobserved locations is investigated. The performance of recognizing events at previously unobserved locations can be improved by sorting the channels according to their posterior probabilities.
Bag-of-Features, Acoustic Event Detection, Sensor Arrays, Robustness, Acoustic Sensor Networks
Cites: 26 ( see at Google Scholar )
CQT-based Convolutional Neural Networks for Audio Scene Classification
Thomas Lidy1 and Alexander Schindler2
1Vienna University of Technology, Institute of Software Technology, Vienna, Austria, 2Austrian Institute of Technology, Digital Safety and Security, Vienna, Austria
In this paper, we propose a parallel Convolutional Neural Network architecture for the task of classifying acoustic scenes and urban sound scapes. A popular choice for input to a Convolutional Neural Network in audio classification problems are Mel-transformed spectrograms. We, however, show in this paper that a Constant-Q-transformed input improves results. Furthermore, we evaluated critical parameters such as the number of necessary bands and filter sizes in a Convolutional Neural Network. These are non-trivial in audio tasks due to the different semantics of the two axes of the input data: time vs. frequency. Finally, we propose a parallel (graph-based) neural network architecture which captures relevant audio characteristics both in time and in frequency. Our approach shows a 10.7% relative improvement of the baseline system of the DCASE 2016 Acoustic Scenes Classification task.
Deep Learning, Constant-Q-Transform, Convolutional Neural Networks, Audio Event Classification
Cites: 115 ( see at Google Scholar )
Pairwise Decomposition with Deep Neural Networks and Multiscale Kernel Subspace Learning for Acoustic Scene Classification
Erik Marchi1,3, Dario Tonelli2, Xinzhou Xu1, Fabien Ringeval1,3, Jun Deng1, Stefano Squartini2, and Björn Schuller1,3,4
1University of Passau, Chair of Complex and Intelligent Systems, Germany, 2A3LAB, Department of Information Engineering, Universitá Politecnica delle Marche, Italy, 3audEERING GmbH, Gilching, Germany,4Imperial College London, Department of Computing, London, United Kingdom
We propose a system for acoustic scene classification using pairwise decomposition with deep neural networks and dimensionality reduction by multiscale kernel subspace learning. It is our contribution to the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2016). The system classifies 15 different acoustic scenes. First, auditory spectral features are extracted and fed into 15 binary deep multilayer perceptron neural networks (MLP). MLP are trained with the `one-against-all' paradigm to perform a pairwise decomposition. In a second stage, a large number of spectral, cepstral, energy and voicing-related audio features are extracted. Multiscale Gaussian kernels are then used in constructing optimal linear combination of Gram matrices for multiple kernel subspace learning. The reduced feature set is fed into a nearest-neighbour classifier. Predictions from the two systems are then combined by a threshold-based decision function. On the official development set of the challenge, an accuracy of 81.4% is achieved.
Computational Acoustic Scene Analysis, Acoustic Scene Classification, Multilayer Perceptron, Deep Neural Networks, Multiscale Kernel Analysis
Cites: 34 ( see at Google Scholar )
Acoustic Scene Classification using Time-Delay Neural Networks and Amplitude Modulation Filter Bank Features
Niko Moritz1, Jens Schröder1, Stefan Goetze1, Jörn Anemüller2, and Birger Kollmeier2
1Fraunhofer IDMT, Project Group for Hearing, Speech, and Audio Technology, Oldenburg, Germany, 2University of Oldenburg, Medizinische Physik & Hearing4all, Oldenburg, Germany
This paper presents a system for acoustic scene classification (SC) that is applied to data of the SC task of the DCASE’16 challenge (Task 1). The proposed method is based on extracting acoustic features that employ a relatively long temporal context, i.e., amplitude modulation filer bank (AMFB) features, prior to detection of acoustic scenes using a neural network (NN) based classification approach. Recurrent neural networks (RNN) are well suited to model long-term acoustic dependencies that are known to encode important information for SC tasks. However, RNNs require a relatively large amount of training data in com-parison to feed-forward deep neural networks (DNNs). Hence, the time-delay neural network (TDNN) approach is used in the present work that enables analysis of long contextual information similar to RNNs but with training efforts comparable to conventional DNNs. The proposed SC system attains a recogni-tion accuracy of 75%, which is 2.5% higher compared to the DCASE’16 baseline system.
Time-delay neural networks, acoustic scene classification, DCASE, amplitude modulation filter bank features
Cites: 10 ( see at Google Scholar )
A Real-Time Environmental Sound Recognition System for the Android OS
Angelos Pillos1, Khalid Alghamidi1, Noura Alzamel1, Veselin Pavlov1, Swetha Machanavajhala2
1Computer Science Department, University College London, UK, 2Microsoft, Redmond, USA
Sounds around us convey the context of daily life activities. There are 360 million ndividuals worldwide who experience some form of deafness. For them, missing these contexts such as fire alarm can not only be inconvenient but also life threatening. In this paper, we explore a combination of different audio feature extraction algorithms that would aid in increasing the accuracy of identifying environmental sounds and also reduce power consumption. We also design a simple approach that alleviates some of the privacy concerns, and evaluate the implemented real-time environmental sound recognition system on Android mobile devices. Our solution works in embedded mode where sound processing and recognition are performed directly on a mobile device in a way that conserves battery power. Sound signals were detected using standard deviation of normalized power sequences. Multiple feature extraction techniques like zero crossing rate, Mel-frequency cepstral coefficient (MFCC), spectral flatness, and spectral centroid were applied on the raw sound signal. Multi-layer perceptron classifier was used to identify the sound. Experimental results show improved over state-of-the-art.
Environmental sound recognition, signal processing, machine learning, Android OS
Cites: 35 ( see at Google Scholar )
Performance comparison of GMM, HMM and DNN based approaches for acoustic event detection within Task 3 of the DCASE 2016 challenge
Jens Schröder1,3, Jörn Anemüller2,3, and Stefan Goetze1,3
1Fraunhofer Institute for Digital Media Technology IDMT, Oldenburg, Germany, 2University of Oldenburg, Department of Medical Physics and Acoustics, Oldenburg, Germany,3Cluster of Excellence, Hearing4all, Germany
This contribution reports on the performance of systems for polyphonic acoustic event detection (AED) compared within the framework of the “detection and classification of acoustic scenes and events 2016” (DCASE’16) challenge. State-of-the-art Gaussian mixture model (GMM) and GMM-hidden Markov model (HMM) approaches are applied using Mel-frequency cepstral coefficients (MFCCs) and Gabor filterbank (GFB) features and a non-negative matrix factorization (NMF) based system. Furthermore, tandem and hybrid deep neural network (DNN)-HMMsystems are adopted. All HMM systems that usually are of multiclass type, i.e., systems that just output one label per time segment from a set of possible classes, are extended to binary classification systems that are compound of single binary classifiers classifying between target and non-target classes and, thus, are capable of multi labeling. These systems are evaluated for the data of residential areas of Task 3 from the DCASE’16 challenge. It is shown, that the DNN based system performs worse than the traditional systems for this task. Best results are achieved using GFB features in combination with a multiclass GMM-HMM approach.
acoustic event detection, DCASE2016, Gabor filterbank, deep neural network
Cites: 21 ( see at Google Scholar )
Acoustic Scene Classification: An evaluation of an extremely compact feature representation
Gustavo Sena Mafra1, Ngoc Q. K. Duong2, Alexey Ozerov2, and Patrick Perez2
1Universidade Federal de Santa Catarina, Santa Catarina, Brazil, 2Technicolor, Cesson-Sévigné, France
This paper investigates several approaches to address the acoustic scene classification (ASC) task. We start from low-level feature representation for segmented audio frames and investigate different time granularity for feature aggregation. We study the use of support vector machine (SVM), as a well-known classifier, together with two popular neural network (NN) architectures, namely multilayer perceptron (MLP) and convolutional neural network (CNN), for higher level feature learning and classification. We evaluate the performance of these approaches on benchmark datasets provided from the 2013 and 2016 Detection and Classification of Acoustic Scenes and Events (DCASE) challenges. We observe that a simple approach exploiting averaged Mel-log-spectrogram, as an extremely compact feature, and SVM can obtain even better result than NN-based approaches and comparable performance with the best systems in the DCASE 2013 challenge.
Acoustic scene classification, Audio features, Multilayer Perceptron, Convolutional Neural Network, Support Vector Machine
Cites: 17 ( see at Google Scholar )
Coupled Sparse NMF vs. Random Forest Classification for Real Life Acoustic Event Detection
Iwona Sobieraj and Mark D. Plumbley
University of Surrey, Centre for Vision Speech and Signal Processing, Surrey, United Kingdom
Coupled non-negative matrix factorization (NMF) of spectral representations and class activity annotations has shown promising results for acoustic event detection (AED) in real life environments. Recently, a new dataset has been proposed for development of algorithms for real life AED. In this paper we propose two methods for real life polyphonic AED: Coupled Sparse Non-negative Matrix Factorization (CSNMF) of time-frequency patches with class activity annotations and Multi-class Random Forest classification (MRF) of time-frequency patches, and compare their performance on this new dataset. Both our methods outperform the DCASE2016 baseline in terms of F-score. Moreover, we show that as the dataset is unbalanced, a classifier that recognizes a few most frequent classes may outperform the sparse NMF-approach and a baseline based on Gaussian Mixture Models.
Acoustic event detection, random forest classifier, non-negative matrix factorization, sparse representation
Cites: 6 ( see at Google Scholar )
DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks
Michele Valenti1, Aleksandr Diment2, Giambattista Parascandolo2, Stefano Squartini1, Tuomas Virtanen2
1Universita Politecnica delle Marche, Department of Information Engineering, Ancona, Italy, 2Tampere University of Technology, Department of Signal Processing, Tampere, Finland
This workshop paper presents our contribution for the task of acoustic scene classification proposed for the “detection and classification of acoustic scenes and events” (D-CASE) 2016 challenge. We propose the use of a convolutional neural network trained to classify short sequences of audio, represented by their log-mel spectrogram. In addition we use a training method that can be used when the validation performance of the system saturates as the training proceeds. The performance is evaluated on the public acoustic scene classification development dataset provided for the D-CASE challenge. The best accuracy score obtained by our configuration on a four-folded cross-validation setup is 79.0%. It constitutes a 8.8% relative improvement with respect to the baseline system, based on a Gaussian mixture model classifier.
Acoustic scene classification, convolutional neural networks, DCASE, computational audio processing
Cites: 171 ( see at Google Scholar )
ABROA: Audio-Based Room-Occupancy Analysis Using Gaussian Mixtures and Hidden Markov Models
UC Berkeley, Center for New Music and Audio Technologies (CNMAT), Berkeley, USA
This paper outlines preliminary steps towards the development of an audio-based room-occupancy analysis model. Our approach borrows from speech recognition tradition and is based on Gaussian Mixtures and Hidden Markov Models. We analyze possible challenges encountered in the development of such a model, and offer several solutions including feature design and prediction strategies. We provide results obtained from experiments with audio data from a retail store in Palo Alto, California. Model assessment is done via leave- two-out Bootstrap and model convergence achieves good ac- curacy, thus representing a contribution to multimodal people counting algorithms.
Acoustic Traffic Monitoring, Audio Forensics, Retail Analytics
Cites: 11 ( see at Google Scholar )
Hierarchical Learning for DNN-Based Acoustic Scene Classification
Yong Xu, Qiang Huang, Wenwu Wang, and Mark D. Plumbley
University of Surrey, Centre for Vision, Speech and Signal Processing, Surrey, UK
In this paper, we present a deep neural network (DNN)-based acoustic scene classification framework. Two hierarchical learning methods are proposed to improve the DNN baseline performance by incorporating the hierarchical taxonomy information of environmental sounds. Firstly, the parameters of the DNN are initialized by the proposed hierarchical pre-training. Multi-level objective function is then adopted to add more constraint on the cross-entropy based loss function. A series of experiments were conducted on the Task1 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 challenge. The final DNN-based system achieved a 22.9% relative improvement on average scene classification error as compared with the Gaussian Mixture Model (GMM)-based benchmark system across four standard folds.
Acoustic scene classification, deep neural network, hierarchical pre-training, multi-level objective function
Cites: 32 ( see at Google Scholar )
Fully DNN-Based Multi-Label Regression for Audio Tagging
Yong Xu, Qiang Huang, Wenwu Wang, Philip J. B. Jackson, and Mark D. Plumbley
University of Surrey, Centre for Vision, Speech and Signal Processing, Surrey, UK
Acoustic event detection for content analysis in most cases relies on lots of labeled data. However, manually annotating data is a time-consuming task, which thus makes few annotated resources available so far. Unlike audio event detection, automatic audio tagging, a multi-label acoustic event classification task, only relies on weakly labeled data. This is highly desirable to some practical applications using audio analysis. In this paper we propose to use a fully deep neural network (DNN) framework to handle the multi-label classification task in a regression way. Considering that only chunk-level rather than frame-level labels are available, the whole or almost whole frames of the chunk were fed into the DNN to perform a multi-label regression for the expected tags. The fully DNN, which is regarded as an encoding function, can well map the audio features sequence to a multi-tag vector. A deep pyramid structure was also designed to extract more robust high-level features related to the target tags. Further improved methods were adopted, such as the Dropout and background noise aware training, to enhance its generalization capability for new audio recordings in mismatched environments. Compared with the conventional Gaussian Mixture Model (GMM) and support vector machine (SVM) methods, the proposed fully DNN-based method could well utilize the long-term temporal information with the whole chunk as the input. The results show that our approach obtained a 15% relative improvement compared with the official GMM-based method of DCASE 2016 challenge.
Audio tagging, deep neural networks, multilabel regression, dropout, DCASE 2016
Cites: 21 ( see at Google Scholar )
Gated Recurrent Networks applied to Acoustic Scene Classification
Matthias Zöhrer and Franz Pernkopf
Graz University of Technology, Signal Processing and Speech Communication Laboratory, Austria
We present a resource efficient framework for acoustic scene classification. In particular, we combine gated recurrent neural networks (GRNNs) and a linear discriminant analysis (LDA) objective for efficiently classifying environmental sound scenes of the IEEE Detection and Classification of Acoustic Scenes and Events challenge (DCASE2016). Our system reaches an overall accuracy of 79.1% on development data, resulting in a relative improvement of 8.34% compared to the baseline GMM system. We further investigate semi-supervised learning applied to acoustic scene analysis. In particular, we evaluate the effects of virtual adversarial training (VAT), the use of a hybrid, i.e. generative-discriminative, objective function.
Acoustic Scene Labeling, Gated Recurrent Networks, Deep Linear Discriminant Analysis, Semi-Supervised Learning
Cites: 40 ( see at Google Scholar )