Technical program

Workshop on Detection and Classification of Acoustic Scenes and Events

16 - 17 November 2017, Munich, Germany

Day 1

Thursday 16.11.2017, 9:00 - 18:00

Hours

8:45

Registration

Registration

8:45

Coffee

Coffee

Welcome coffee

9:10

Welcome

Welcome

Annamaria Mesaros
Tampere University of Technology, Finland

9:20

Keynote

Keynote

Session chair Sacha Krstulović

General-Purpose Sound Event Recognition

Shawn Hershey
Google Research

Abstract

Inspired by the success of general-purpose object recognition in images, we have been working on automatic, real-time systems for recognizing sound events regardless of domain. Our goal is a system that can tag or describe an arbitrary soundtrack - as might be found on a media sharing site like YouTube - using terms that make sense to a human. I will cover the process of defining this task, our deep learning approach, our efforts to collect training data, and our current results. I'll discuss some factors important for accurate models, and some ideas about how to get the best return from manual labeling investment.

Biography

Shawn Hershey is a software engineer at Google Research, working in the Machine Hearing Group on machine learning for speech and audio processing. He is currently working on soundtrack classification and audio event detection. Before Google he worked as the first Software Engineer at Lyric Semiconductors, building tools to aid the development of hardware accelerators for AI. On the side, Shawn travels the world teaching Lindy Hop and blues dancing and playing in swing and blues bands. Long ago Shawn graduated from the University of Rochester with a BA in Computer Science and half of a degree from the Eastman School of Music.

Shawn Hershey

Google Research

10:10

Break

Coffee break

10:30

Presentations

	Oral Session I Session chair Axel Plinge
10:30	DCASE2017 Challenge Summary Tuomas Virtanen Tampere University of Technology, Laboratory of Signal Processing, Tampere, Finland Video DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System Annamaria Mesaros¹, Toni Heittola¹, Aleksandr Diment¹, Benjamin Elizalde², Ankit Shah², Emmanuel Vincent³, Bhiksha Raj² and Tuomas Virtanen ¹ ¹Tampere University of Technology, Laboratory of Signal Processing, Tampere, Finland, ²Carnegie Mellon University, Department of Electrical and Computer Engineering, & Department of Language Technologies Institute, Pittsburgh, USA, ³Inria, F-54600 Villers-les-Nancy, France PDF Abstract DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics. Keywords Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events, Weak Labels PDF DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System @inproceedings{Mesaros2017, author = "Mesaros, Annamaria and Heittola, Toni and Diment, Aleksandr and Elizalde, Benjamin and Shah, Ankit and Vincent, Emmanuel and Raj, Bhiksha and Virtanen, Tuomas", title = "{DCASE2017} Challenge Setup: Tasks, Datasets and Baseline System", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "85--92", keywords = "Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events, Weak Labels", abstract = "DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics." }
11:00	Generative Adversarial Network Based Acoustic Scene Training Set Augmentation and Selection Using SVM Hyper-Plane Seongkyu Mun¹, Sangwook Park¹, David Han² and Hanseok Ko¹ ¹Intelligent Signal Processing Laboratory, Korea University, Seoul, South Korea, ²Office of Naval Research, Office of Naval Research, Arlington VA, USA PDF Abstract Although it is typically expected that using a large amount of labeled training data would lead to improve performance in deep learning, it is generally difficult to obtain such DataBase (DB). In competitions such as the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge Task 1, participants are constrained to use a relatively small DB as a rule, which is similar to the aforementioned issue. To improve Acoustic Scene Classification (ASC) performance without employing additional DB, this paper proposes to use Generative Adversarial Networks (GAN) based method for generating additional training DB. Since it is not clear whether every sample generated by GAN would have equal impact in classification performance, this paper proposes to use Support Vector Machine (SVM) hyper plane for each class as reference for selecting samples, which have class discriminative information. Based on the crossvalidated experiments on development DB, the usage of the generated features could improve ASC performance. Keywords acoustic scene classification, generative adversarial networks, support vector machine, data augmentation, decision hyper-plane PDF Generative Adversarial Network Based Acoustic Scene Training Set Augmentation and Selection Using SVM Hyper-Plane @inproceedings{Mun2017, author = "Mun, Seongkyu and Park, Sangwook and Han, David K and Ko, Hanseok", title = "Generative Adversarial Network Based Acoustic Scene Training Set Augmentation and Selection Using {SVM} Hyper-Plane", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "93--102", keywords = "acoustic scene classification, generative adversarial networks, support vector machine, data augmentation, decision hyper-plane", abstract = "Although it is typically expected that using a large amount of labeled training data would lead to improve performance in deep learning, it is generally difficult to obtain such DataBase (DB). In competitions such as the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge Task 1, participants are constrained to use a relatively small DB as a rule, which is similar to the aforementioned issue. To improve Acoustic Scene Classification (ASC) performance without employing additional DB, this paper proposes to use Generative Adversarial Networks (GAN) based method for generating additional training DB. Since it is not clear whether every sample generated by GAN would have equal impact in classification performance, this paper proposes to use Support Vector Machine (SVM) hyper plane for each class as reference for selecting samples, which have class discriminative information. Based on the crossvalidated experiments on development DB, the usage of the generated features could improve ASC performance." }
11:20	Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input Donmoon Lee^1,2, Subin Lee^1,2, Yoonchang Han² and Kyogu Lee¹ ¹Music and Audio Research Group, Seoul National University, Seoul, Korea, ²Cochlear.ai, Seoul, Korea PDF Abstract In this paper, we use ensemble of convolutional neural network models that use the various analysis window to detect audio events in the automotive environment. When detecting the presence of audio events, global input based model that uses the entire audio clip works better. On the other hand, segmented input based models works better in finding the accurate position of the event. Experimental results for weakly-labeled audio data confirm the performance trade-off between the two tasks, depending on the length of input audio. By combining the predictions of various models, the proposed system achieved 0.4762 in the clip-based F1-score and 0.7167 in the segment-based error rate. PDF Slides Video Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input @inproceedings{Lee2017b, author = "Lee, Donmoon and Lee, Subin and Han, Yoonchang and Lee, Kyogu", title = "Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", pages = "74--79", abstract = "In this paper, we use ensemble of convolutional neural network models that use the various analysis window to detect audio events in the automotive environment. When detecting the presence of audio events, global input based model that uses the entire audio clip works better. On the other hand, segmented input based models works better in finding the accurate position of the event. Experimental results for weakly-labeled audio data confirm the performance trade-off between the two tasks, depending on the length of input audio. By combining the predictions of various models, the proposed system achieved 0.4762 in the clip-based F1-score and 0.7167 in the segment-based error rate." }
11:40	Sound Event Detection Using Weakly Labeled Dataset with Stacked Convolutional and Recurrent Neural Network Sharath Adavanne and Tuomas Virtanen Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland PDF Abstract This paper proposes a neural network architecture and training scheme to learn the start and end time of sound events (strong labels) in an audio recording given just the list of sound events existing in the audio without time information (weak labels). We achieve this by using a stacked convolutional and recurrent neural network with two prediction layers in sequence one for the strong followed by the weak label. The network is trained using frame-wise log melband energy as the input audio feature, and weak labels provided in the dataset as labels for the weak label prediction layer. Strong labels are generated by replicating the weak labels as many number of times as the frames in the input audio feature, and used for strong label layer during training. We propose to control what the network learns from the weak and strong labels by different weighting for the loss computed in the two prediction layers. The proposed method is evaluated on a publicly available dataset of 155 hours with 17 sound event classes. The method achieves the best error rate of 0.84 for strong labels and F-score of 43.3% for weak labels on the unseen test split. Keywords sound event detection, weak labels, deep neural network, CNN, GRU PDF Slides Video Sound Event Detection Using Weakly Labeled Dataset with Stacked Convolutional and Recurrent Neural Network @inproceedings{Adavanne2017, author = "Adavanne, Sharath and Virtanen, Tuomas", title = "Sound Event Detection Using Weakly Labeled Dataset with Stacked Convolutional and Recurrent Neural Network", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "12--16", keywords = "sound event detection, weak labels, deep neural network, CNN, GRU", abstract = "This paper proposes a neural network architecture and training scheme to learn the start and end time of sound events (strong labels) in an audio recording given just the list of sound events existing in the audio without time information (weak labels). We achieve this by using a stacked convolutional and recurrent neural network with two prediction layers in sequence one for the strong followed by the weak label. The network is trained using frame-wise log melband energy as the input audio feature, and weak labels provided in the dataset as labels for the weak label prediction layer. Strong labels are generated by replicating the weak labels as many number of times as the frames in the input audio feature, and used for strong label layer during training. We propose to control what the network learns from the weak and strong labels by different weighting for the loss computed in the two prediction layers. The proposed method is evaluated on a publicly available dataset of 155 hours with 17 sound event classes. The method achieves the best error rate of 0.84 for strong labels and F-score of 43.3\% for weak labels on the unseen test split." }
12:00	Neuroevolution for Sound Event Detection in Real Life Audio: A Pilot Study Christian Kroos and Mark D. Plumbley Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK PDF Abstract Neuroevolution techniques combine genetic algorithms with artificial neural networks, some of them evolving network topology along with the network weights. One of these latter techniques is the NeuroEvolution of Augmenting Topologies (NEAT) algorithm. For this pilot study we devised an extended variant (joint NEAT, J-NEAT), introducing dynamic cooperative co-evolution, and applied it to sound event detection in real life audio (Task 3) in the DCASE 2017 challenge. Our research question was whether small networks could be evolved that would be able to compete with the much larger networks now typical for classification and detection tasks. We used the wavelet-based deep scattering transform and k-means clustering across the resulting scales (not across samples) to provide J-NEAT with a compact representation of the acoustic input. The results show that for the development data set J-NEAT was capable of evolving small networks that match the performance of the baseline system in terms of the segment-based error metrics, while exhibiting a substantially better event-related error rate. In the challenge, J-NEAT took first place overall according to the F1 error metric with an F1 of 44:9% and achieved rank 15 out of 34 on the ER error metric with a value of 0:891. We discuss the question of evolving versus learning for supervised tasks. Keywords Sound event detection, neuroevolution, NEAT, deep scattering transform, wavelets, clustering, co-evolution PDF Slides Video Neuroevolution for Sound Event Detection in Real Life Audio: A Pilot Study @inproceedings{Kroos2017, author = "Kroos, Christian and Plumbley, Mark D.", title = "Neuroevolution for Sound Event Detection in Real Life Audio: A Pilot Study", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "64--68", keywords = "Sound event detection, neuroevolution, NEAT, deep scattering transform, wavelets, clustering, co-evolution", abstract = "Neuroevolution techniques combine genetic algorithms with artificial neural networks, some of them evolving network topology along with the network weights. One of these latter techniques is the NeuroEvolution of Augmenting Topologies (NEAT) algorithm. For this pilot study we devised an extended variant (joint NEAT, J-NEAT), introducing dynamic cooperative co-evolution, and applied it to sound event detection in real life audio (Task 3) in the DCASE 2017 challenge. Our research question was whether small networks could be evolved that would be able to compete with the much larger networks now typical for classification and detection tasks. We used the wavelet-based deep scattering transform and k-means clustering across the resulting scales (not across samples) to provide J-NEAT with a compact representation of the acoustic input. The results show that for the development data set J-NEAT was capable of evolving small networks that match the performance of the baseline system in terms of the segment-based error metrics, while exhibiting a substantially better event-related error rate. In the challenge, J-NEAT took first place overall according to the F1 error metric with an F1 of 44:9\% and achieved rank 15 out of 34 on the ER error metric with a value of 0:891. We discuss the question of evolving versus learning for supervised tasks." }

12:30

Break

Lunch

14:00

Presentations

	Oral Session II Session chair Romain Serizel
14:00	Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks Eduardo Fonseca, Rong Gong, Dmitry Bogdanov, Olga Slizovskaia, Emilia Gomez and Xavier Serra Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain PDF Abstract This work describes our contribution to the acoustic scene classification task of the DCASE 2017 challenge. We propose a system that consists of the ensemble of two methods of different nature: a feature engineering approach, where a collection of hand-crafted features is input to a Gradient Boosting Machine, and another approach based on learning representations from data, where log-scaled melspectrograms are input to a Convolutional Neural Network. This CNN is designed with multiple filter shapes in the first layer. We use a simple late fusion strategy to combine both methods. We report classification accuracy of each method alone and the ensemble system on the provided cross-validation setup of TUT Acoustic Scenes 2017 dataset. The proposed system outperforms each of its component methods and improves the provided baseline system by 8.2%. Keywords acoustic scene classification, gradient boosting machine, convolutional neural networks, ensembling PDF Slides Video Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks @inproceedings{Fonseca2017, author = "Fonseca, Eduardo and Gong, Rong and Bogdanov, Dmitry and Slizovskaia, Olga and Gomez, Emilia and Serra, Xavier", title = "Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "37--41", keywords = "acoustic scene classification, gradient boosting machine, convolutional neural networks, ensembling", abstract = "This work describes our contribution to the acoustic scene classification task of the DCASE 2017 challenge. We propose a system that consists of the ensemble of two methods of different nature: a feature engineering approach, where a collection of hand-crafted features is input to a Gradient Boosting Machine, and another approach based on learning representations from data, where log-scaled melspectrograms are input to a Convolutional Neural Network. This CNN is designed with multiple filter shapes in the first layer. We use a simple late fusion strategy to combine both methods. We report classification accuracy of each method alone and the ensemble system on the provided cross-validation setup of TUT Acoustic Scenes 2017 dataset. The proposed system outperforms each of its component methods and improves the provided baseline system by 8.2\%." }
14:20	DCASE 2017 Task 1: Acoustic Scene Classification Using Shift-Invariant Kernels and Random Features Abelino Jimenez, Benjamin Elizalde and Bhiksha Raj Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, USA PDF Abstract Acoustic scene recordings are represented by different types of handcrafted or Neural Network features. These features, typically of thousands of dimensions, are classified in state of the art approaches using kernel machines, such as the Support Vector Machines (SVM). However, the complexity of training these methods increases with the dimensionality of these input features and the size of the dataset. A solution is to map the input features to a randomized low-dimensional feature space. The resulting random features can approximate non-linear kernels with faster linear kernel computation. In this work, we computed a set of 6,553 input features and used them to compute random features to approximate three types of kernels, Guassian, Laplacian and Cauchy. We compared their performance using an SVM in the context of the DCASE Task 1 - Acoustic Scene Classification. Experiments show that both, input and random features outperformed the DCASE baseline by an absolute 4%. Moreover, the random features reduced the dimensionality of the input by more than three times with minimal loss of performance and by more than six times and still outperformed the baseline. Hence, random features could be employed by state of the art approaches to compute low-storage features and perform faster kernel computations. Keywords Acoustic Scene Classification, Laplacian Kernel, Kernel Machines, Random Features PDF Slides Video DCASE 2017 Task 1: Acoustic Scene Classification Using Shift-Invariant Kernels and Random Features @inproceedings{Jimenez2017, author = "Jimenez, Abelino and Elizalde, Benjamin and Raj, Bhiksha", title = "{DCASE} 2017 Task 1: Acoustic Scene Classification Using Shift-Invariant Kernels and Random Features", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "55--58", keywords = "Acoustic Scene Classification, Laplacian Kernel, Kernel Machines, Random Features", abstract = "Acoustic scene recordings are represented by different types of handcrafted or Neural Network features. These features, typically of thousands of dimensions, are classified in state of the art approaches using kernel machines, such as the Support Vector Machines (SVM). However, the complexity of training these methods increases with the dimensionality of these input features and the size of the dataset. A solution is to map the input features to a randomized low-dimensional feature space. The resulting random features can approximate non-linear kernels with faster linear kernel computation. In this work, we computed a set of 6,553 input features and used them to compute random features to approximate three types of kernels, Guassian, Laplacian and Cauchy. We compared their performance using an SVM in the context of the DCASE Task 1 - Acoustic Scene Classification. Experiments show that both, input and random features outperformed the DCASE baseline by an absolute 4\%. Moreover, the random features reduced the dimensionality of the input by more than three times with minimal loss of performance and by more than six times and still outperformed the baseline. Hence, random features could be employed by state of the art approaches to compute low-storage features and perform faster kernel computations." }
14:40	Nonnegative Feature Learning Methods for Acoustic Scene Classification Victor Bisot¹, Romain Serizel^2,3,4, Slim Essid¹ and Gaël Richard¹ ¹Image Data and Signal, Telecom ParisTech, Paris, France, ²Université de Lorraine, Loria, Nancy, France, ³Inria, Nancy, France, ⁴CNRS, LORIA, Nancy, France PDF Abstract This paper introduces improvements to nonnegative feature learning-based methods for acoustic scene classification. We start by introducing modifications to the task-driven nonnegative matrix factorization algorithm. The proposed adapted scaling algorithm improves the generalization capability of task-driven nonnegative matrix factorization for the task. We then propose to exploit simple deep neural network architecture to classify both low level time-frequency representations and unsupervised nonnegative matrix factorization activation features independently. Moreover, we also propose a deep neural network architecture that exploits jointly unsupervised nonnegative matrix factorization activation features and low-level time frequency representations as inputs. Finally, we present a fusion of proposed systems in order to further improve performance. The resulting systems are our submission for the task 1 of the DCASE 2017 challenge. Keywords Feature learning, Nonnegative Matrix Factorization, Deep Neural Networks PDF Video Nonnegative Feature Learning Methods for Acoustic Scene Classification @inproceedings{Bisot2017, author = "Bisot, Victor and Serizel, Romain and Essid, Slim and Richard, Gaël", title = "Nonnegative Feature Learning Methods for Acoustic Scene Classification", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "22--26", keywords = "Feature learning, Nonnegative Matrix Factorization, Deep Neural Networks", abstract = "This paper introduces improvements to nonnegative feature learning-based methods for acoustic scene classification. We start by introducing modifications to the task-driven nonnegative matrix factorization algorithm. The proposed adapted scaling algorithm improves the generalization capability of task-driven nonnegative matrix factorization for the task. We then propose to exploit simple deep neural network architecture to classify both low level time-frequency representations and unsupervised nonnegative matrix factorization activation features independently. Moreover, we also propose a deep neural network architecture that exploits jointly unsupervised nonnegative matrix factorization activation features and low-level time frequency representations as inputs. Finally, we present a fusion of proposed systems in order to further improve performance. The resulting systems are our submission for the task 1 of the DCASE 2017 challenge." }
15:00	Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio Shahin Amiriparian^1,2,3, Michael Freitag¹, Nicholas Cummins^1,2 and Björn Schuller^2,4 ¹Chair of Complex & Intelligent Systems, Universität Passau, Passau, Germany, ²Chair of Embedded Intelligence for Health Care, Augsburg University, Augsburg, Germany, ³Machine Intelligence & Signal Processing Group, Technische Universität München, München, Germany, ⁴Group of Language, Audio & Music, Imperial Collage London, London, UK PDF Abstract This paper describes our contribution to the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017). We propose a system for this task using a recurrent sequence to sequence autoencoder for unsupervised representation learning from raw audio files. First, we extract mel-spectrograms from the raw audio files. Second, we train a recurrent sequence to sequence autoencoder on these spectrograms, that are considered as time-dependent frequency vectors. Then, we extract, from a fully connected layer between the decoder and encoder units, the learnt representations of spectrograms as the feature vectors for the corresponding audio instances. Finally, we train a multilayer perceptron neural network on these feature vectors to predict the class labels. In comparison to the baseline, the accuracy is increased from 74:8% to 88:0% on the development set, and from 61:0% to 67:5% on the test set. Keywords deep feature learning, sequence to sequence learning, recurrent autoencoders, audio processing acoustic scene classification PDF Slides Video Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio @inproceedings{Amiriparian2017, author = "Amiriparian, Shahin and Freitag, Michael and Cummins, Nicholas and Schuller, Björn", title = "Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "17--21", keywords = "deep feature learning, sequence to sequence learning, recurrent autoencoders, audio processing acoustic scene classification", abstract = "This paper describes our contribution to the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017). We propose a system for this task using a recurrent sequence to sequence autoencoder for unsupervised representation learning from raw audio files. First, we extract mel-spectrograms from the raw audio files. Second, we train a recurrent sequence to sequence autoencoder on these spectrograms, that are considered as time-dependent frequency vectors. Then, we extract, from a fully connected layer between the decoder and encoder units, the learnt representations of spectrograms as the feature vectors for the corresponding audio instances. Finally, we train a multilayer perceptron neural network on these feature vectors to predict the class labels. In comparison to the baseline, the accuracy is increased from 74:8\% to 88:0\% on the development set, and from 61:0\% to 67:5\% on the test set." }

15:20

Coffee

Coffee

Coffee served during the poster session.

15:20

Posters

	Poster Session I
	Acoustic Scene Classification Based on Convolutional Neural Network Using Double Image Features Sangwook Park¹, Seongkyu Mun², Younglo Lee¹ and Hanseok Ko¹ ¹School of Electrical Engineering, Korea University, Seoul, Republic of Korea, ²Department of Visual Information Processing, Korea University, Seoul, Republic of Korea PDF Abstract This paper proposes new image features for the acoustic scene classification task of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events. In classification of acoustic scenes, identical sounds being observed in different places may affect performance. To resolve this issue, a covariance matrix, which represents energy density for each subband, and a double Fourier transform image, which represents energy variation for each subband, were defined as features. To classify the acoustic scenes with these features, Convolutional Neural Network has been applied with several techniques to reduce training time and to resolve initialization and local optimum problems. According to the experiments which were performed with the DCASE2017 challenge development dataset it is claimed that the proposed method outperformed several baseline methods. Specifically, the class average accuracy is shown as 83.6%, which is an improvement of 8.8%, 9.5%, 8.2% compared to MFCC-MLP, MFCC-GMM, and CepsCom-GMM, respectively. Keywords Acoustic scene classification, covariance learning, double FFT, convolutional neural network PDF Acoustic Scene Classification Based on Convolutional Neural Network Using Double Image Features @inproceedings{Park2017, author = "Park, Sangwook and Mun, Seongkyu and Lee, Younglo and Ko, Hanseok", title = "Acoustic Scene Classification Based on Convolutional Neural Network Using Double Image Features", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "98--102", keywords = "Acoustic scene classification, covariance learning, double FFT, convolutional neural network", abstract = "This paper proposes new image features for the acoustic scene classification task of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events. In classification of acoustic scenes, identical sounds being observed in different places may affect performance. To resolve this issue, a covariance matrix, which represents energy density for each subband, and a double Fourier transform image, which represents energy variation for each subband, were defined as features. To classify the acoustic scenes with these features, Convolutional Neural Network has been applied with several techniques to reduce training time and to resolve initialization and local optimum problems. According to the experiments which were performed with the DCASE2017 challenge development dataset it is claimed that the proposed method outperformed several baseline methods. Specifically, the class average accuracy is shown as 83.6\%, which is an improvement of 8.8\%, 9.5\%, 8.2\% compared to MFCC-MLP, MFCC-GMM, and CepsCom-GMM, respectively." }
	The Details That Matter: Frequency Resolution of Spectrograms in Acoustic Scene Classification Karol Piczak Institute of Computer Science, Warsaw University of Technology, Warsaw, Poland PDF Abstract This study describes a convolutional neural network model submitted to the acoustic scene classification task of the DCASE 2017 challenge. The performance of this model is evaluated with different frequency resolutions of the input spectrogram showing that a higher number of mel bands improves accuracy with negligible impact on the learning time. Additionally, apart from the convolutional model focusing solely on the ambient characteristics of the audio scene, a proposed extension with pretrained event detectors shows potential for further exploration. Keywords acoustic scene classification, spectrogram, frequency resolution, convolutional neural network, DCASE 2017 PDF Poster The Details That Matter: Frequency Resolution of Spectrograms in Acoustic Scene Classification @inproceedings{Piczak2017, author = "Piczak, Karol Jerzy", title = "The Details That Matter: Frequency Resolution of Spectrograms in Acoustic Scene Classification", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "103--107", keywords = "acoustic scene classification, spectrogram, frequency resolution, convolutional neural network, DCASE 2017", abstract = "This study describes a convolutional neural network model submitted to the acoustic scene classification task of the DCASE 2017 challenge. The performance of this model is evaluated with different frequency resolutions of the input spectrogram showing that a higher number of mel bands improves accuracy with negligible impact on the learning time. Additionally, apart from the convolutional model focusing solely on the ambient characteristics of the audio scene, a proposed extension with pretrained event detectors shows potential for further exploration." }
	Wavelets Revisited for the Classification of Acoustic Scenes Qian Kun^1,2,3, Ren Zhao^2,3, Pandit Vedhas^2,3, Yang Zijiang^1,2, Zhang Zixing², and Schuller Björn^2,3,4 ¹MISP group, Technische Universität München, Munich, Germany, ²Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany, ³Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany, ⁴GLAM - Group on Language, Audio and Music, Imperial College London, London, UK PDF Abstract We investigate the effectiveness of wavelet features for acoustic scene classification as contribution to the subtask of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017). On the back-end side, gated recurrent neural networks (GRNNs) are compared against traditional support vector machines (SVMs). We observe that, the proposed wavelet features behave comparable to the typically-used temporal and spectral features in the classification of acoustic scenes. Further, a late fusion of trained models with wavelets and typical acoustic features reach the best averaged 4-fold cross validation accuracy of 83.2 \%, and 82.6 \% by SVMs, and GRNNs, respectively; both significantly outperform the baseline (74.8 \%) of the official development set (p < 0:001, one-tailed z-test). Keywords Acoustic Scene Classification, Wavelets, Support Vector Machines, Sequence Modelling, Gated Recurrent Neural Networks PDF Poster Wavelets Revisited for the Classification of Acoustic Scenes @inproceedings{Qian2017, author = "Qian, Kun and Ren, Zhao and Pandit, Vedhas and Yang, Zijiang and Zhang, Zixing and Schuller, Björn", title = "Wavelets Revisited for the Classification of Acoustic Scenes", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "108--112", keywords = "Acoustic Scene Classification, Wavelets, Support Vector Machines, Sequence Modelling, Gated Recurrent Neural Networks", abstract = "We investigate the effectiveness of wavelet features for acoustic scene classification as contribution to the subtask of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017). On the back-end side, gated recurrent neural networks (GRNNs) are compared against traditional support vector machines (SVMs). We observe that, the proposed wavelet features behave comparable to the typically-used temporal and spectral features in the classification of acoustic scenes. Further, a late fusion of trained models with wavelets and typical acoustic features reach the best averaged 4-fold cross validation accuracy of 83.2 \\%, and 82.6 \\% by SVMs, and GRNNs, respectively; both significantly outperform the baseline (74.8 \\%) of the official development set (p < 0:001, one-tailed z-test)." }
	Deep Sequential Image Features on Acoustic Scene Classification Ren Zhao^1,2, Pandit Vedhas^1,2, Qian Kun^1,2,3, Yang Zijiang^1,2, Zhang Zixing², and Schuller Björn^1,2,4 ¹Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany, ²Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany, ³MISP group, Technische Universität München, Munich, Germany, ⁴GLAM - Group on Language, Audio and Music, Imperial College London, London, UK PDF Abstract For the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017), we propose a novel method to classify 15 different acoustic scenes using deep sequential learning, based on features extracted from Short-Time Fourier Transform and scalogram of the audio scenes using Convolutional Neural Networks. It is the first time to investigate the performance of bump and morse scalograms for acoustic scene classification in an according context. First, segmented audio waves are transformed into a spectrogram and two types of scalograms; then, ‘deep features’ are extracted from these using the pre-trained VGG16 model by probing at the fully connected layer. These representations are then fed into Gated Recurrent Neural Networks for classification separately. Predictions from the three systems are finally combined by a margin sampling value strategy. On the official development set of the challenge, the best accuracy on a four-fold cross-validation setup is 80:9%, which increases by 6:1% when compared with the official baseline (p < :001 by one-tailed z-test). Keywords Audio Scene Classification, Deep Sequential Learning, Scalogram, Convolutional Neural Networks, Gated Recurrent Neural Networks PDF Poster Deep Sequential Image Features on Acoustic Scene Classification @inproceedings{Ren2017, author = "Ren, Zhao and Pandit, Vedhas and Qian, Kun and Yang, Zijiang and Zhang, Zixing and Schuller, Björn", title = "Deep Sequential Image Features on Acoustic Scene Classification", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "113--117", keywords = "Audio Scene Classification, Deep Sequential Learning, Scalogram, Convolutional Neural Networks, Gated Recurrent Neural Networks", abstract = "For the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017), we propose a novel method to classify 15 different acoustic scenes using deep sequential learning, based on features extracted from Short-Time Fourier Transform and scalogram of the audio scenes using Convolutional Neural Networks. It is the first time to investigate the performance of bump and morse scalograms for acoustic scene classification in an according context. First, segmented audio waves are transformed into a spectrogram and two types of scalograms; then, ‘deep features’ are extracted from these using the pre-trained VGG16 model by probing at the fully connected layer. These representations are then fed into Gated Recurrent Neural Networks for classification separately. Predictions from the three systems are finally combined by a margin sampling value strategy. On the official development set of the challenge, the best accuracy on a four-fold cross-validation setup is 80:9\%, which increases by 6:1\% when compared with the official baseline (p < :001 by one-tailed z-test)." }
	Multi-Temporal Resolution Convolutional Neural Networks for Acoustic Scene Classification Alexander Schindler¹, Thomas Lidy² and Andreas Rauber² ¹Center for Digital Safety and Security, Austrian Institute of Technology, Vienna, Austria, ²Institute for Software and Interactive Systems, Technical University of Vienna, Vienna, Austria PDF Abstract In this paper we present a Deep Neural Network architecture for the task of acoustic scene classification which harnesses information from increasing temporal resolutions of Mel-Spectrogram segments. This architecture is composed of separated parallel Convolutional Neural Networks which learn spectral and temporal representations for each input resolution. The resolutions are chosen to cover fine-grained characteristics of a scene’s spectral texture as well as its distribution of acoustic events. The proposed model shows a 3.56% absolute improvement of the best performing single resolution model and 12.49% of the DCASE 2017 Acoustic Scenes Classification task baseline [1]. Keywords Deep Learning, Convolutional Neural Networks, Acoustic Scene Classification, Audio Analysis PDF Poster Multi-Temporal Resolution Convolutional Neural Networks for Acoustic Scene Classification @inproceedings{Schindler2017, author = "Schindler, Alexander and Lidy, Thomas and Rauber, Andreas", title = "Multi-Temporal Resolution Convolutional Neural Networks for Acoustic Scene Classification", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "118--122", keywords = "Deep Learning, Convolutional Neural Networks, Acoustic Scene Classification, Audio Analysis", abstract = "In this paper we present a Deep Neural Network architecture for the task of acoustic scene classification which harnesses information from increasing temporal resolutions of Mel-Spectrogram segments. This architecture is composed of separated parallel Convolutional Neural Networks which learn spectral and temporal representations for each input resolution. The resolutions are chosen to cover fine-grained characteristics of a scene’s spectral texture as well as its distribution of acoustic events. The proposed model shows a 3.56\% absolute improvement of the best performing single resolution model and 12.49\% of the DCASE 2017 Acoustic Scenes Classification task baseline [1]." }
	Convolutional Recurrent Neural Networks for Rare Sound Event Detection Emre Cakir and Tuomas Virtanen Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland PDF Abstract Sound events possess certain temporal and spectral structure in their time-frequency representations. The spectral content for the samples of the same sound event class may exhibit small shifts due to intra-class acoustic variability. Convolutional layers can be used to learn high-level, shift invariant features from time-frequency representations of acoustic samples, while recurrent layers can be used to learn the longer term temporal context from the extracted high-level features. In this paper, we propose combining these two in a convolutional recurrent neural network (CRNN) for rare sound event detection. The proposed method is evaluated over DCASE 2017 challenge dataset of individual sound event samples mixed with everyday acoustic scene samples. CRNN provides significant performance improvement over two other deep learning based methods mainly due to its capability of longer term temporal modeling. Keywords Sound Event Detection, Convolutional Neural Network, Recurrent Neural Network, Machine learning PDF Convolutional Recurrent Neural Networks for Rare Sound Event Detection @inproceedings{Cakir2017, author = "Cakir, Emre and Virtanen, Tuomas", title = "Convolutional Recurrent Neural Networks for Rare Sound Event Detection", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "27--31", keywords = "Sound Event Detection, Convolutional Neural Network, Recurrent Neural Network, Machine learning", abstract = "Sound events possess certain temporal and spectral structure in their time-frequency representations. The spectral content for the samples of the same sound event class may exhibit small shifts due to intra-class acoustic variability. Convolutional layers can be used to learn high-level, shift invariant features from time-frequency representations of acoustic samples, while recurrent layers can be used to learn the longer term temporal context from the extracted high-level features. In this paper, we propose combining these two in a convolutional recurrent neural network (CRNN) for rare sound event detection. The proposed method is evaluated over DCASE 2017 challenge dataset of individual sound event samples mixed with everyday acoustic scene samples. CRNN provides significant performance improvement over two other deep learning based methods mainly due to its capability of longer term temporal modeling." }
	Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks Hyungui Lim¹, Jeongsoo Park^2,3 and Yoonchang Han¹ ¹Cochlear.ai, Seoul, Korea, ²N/A, Cochlear.ai, Seoul, Korea, ³Music and Audio Research Group, Seoul National University, Seoul, Korea PDF Abstract Rare sound event detection is a newly proposed task in IEEE DCASE 2017 to identify the presence of monophonic sound event that is classified as an emergency and to detect the onset time of the event. In this paper, we introduce a rare sound event detection system using combination of 1D convolutional neural network (1D ConvNet) and recurrent neural network (RNN) with long shortterm memory units (LSTM). A log-amplitude mel-spectrogram is used as an input acoustic feature and the 1D ConvNet is applied in each time-frequency frame to convert the spectral feature. Then the RNN-LSTM is utilized to incorporate the temporal dependency of the extracted features. The system is evaluated using DCASE 2017 Challenge Task 2 Dataset. Our best result on the test set of the development dataset shows 0.07 and 96.26 of error rate and F-score on the event-based metric, respectively. The proposed system has achieved the 1st place in the challenge with an error rate of 0.13 and an F-Score of 93.1 on the evaluation dataset. Keywords Rare sound event detection, deep learning, convolutional neural network, recurrent neural network, long short-term memory PDF Poster Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks @inproceedings{Lim2017, author = "Lim, Hyungui and Park, Jeongsoo and Han, Yoonchang", title = "Rare Sound Event Detection Using {1D} Convolutional Recurrent Neural Networks", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "80--84", keywords = "Rare sound event detection, deep learning, convolutional neural network, recurrent neural network, long short-term memory", abstract = "Rare sound event detection is a newly proposed task in IEEE DCASE 2017 to identify the presence of monophonic sound event that is classified as an emergency and to detect the onset time of the event. In this paper, we introduce a rare sound event detection system using combination of 1D convolutional neural network (1D ConvNet) and recurrent neural network (RNN) with long shortterm memory units (LSTM). A log-amplitude mel-spectrogram is used as an input acoustic feature and the 1D ConvNet is applied in each time-frequency frame to convert the spectral feature. Then the RNN-LSTM is utilized to incorporate the temporal dependency of the extracted features. The system is evaluated using DCASE 2017 Challenge Task 2 Dataset. Our best result on the test set of the development dataset shows 0.07 and 96.26 of error rate and F-score on the event-based metric, respectively. The proposed system has achieved the 1st place in the challenge with an error rate of 0.13 and an F-Score of 93.1 on the evaluation dataset." }
	DCASE2017 Challenge posters
	DCASE2017 Challenge Results Annamaria Mesaros¹, Toni Heittola¹, Aleksandr Diment¹, Benjamin Elizalde², Ankit Shah², Emmanuel Vincent³, Bhiksha Raj² and Tuomas Virtanen ¹ ¹Tampere University of Technology, Laboratory of Signal Processing, Tampere, Finland, ²Carnegie Mellon University, Department of Electrical and Computer Engineering, & Department of Language Technologies Institute, Pittsburgh, USA, ³Inria, F-54600 Villers-les-Nancy, France Keywords Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events, Weak Labels
	Classifying Short Acoustic Scenes with I-Vectors and CNNs: Challenges and Optimisations for the 2017 DCASE ASC Task Bernhard Lehner, Hamid Eghbal-Zadeh, Matthias Dorfer, Filip Korzeniowski, Khaled Koutini and Gerhard Widmer Department of Computational Perception, Johannes Kepler University, Linz, Austria PDF Abstract This report describes the CP-JKU team's submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2017 challenge, and discusses some observations we made about the data and the classification setup. Our approach is based on the methodology that achieved ranks 1 and 2 in the 2016 ASC challenge: a fusion of i-vector modelling using MFCC features derived from left and right audio channels, and deep convolutional neural networks (CNNs) trained on raw spectrograms. The data provided for the 2017 ASC task presented some new challenges -- in particular, audio stimuli of very short duration. These will be discussed in detail, and our measures for addressing them will be described. The result of our experiments is a classification system that achieves classification accuracies of around 90% on the provided development data, as estimated via the prescribed four-fold cross-validation scheme (which, we suspect, may be rather optimistic in relation to new data). PDF
	A Report on Sound Event Detection with Different Binaural Features Sharath Adavanne and Tuomas Virtanen Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland PDF Abstract In this paper, we compare the performance of using binaural audio features in place of single channel features for sound event detection. Three different binaural features are studied and evaluated on the publicly available TUT Sound Events 2017 dataset of length 70 minutes. Sound event detection is performed separately with single channel and binaural features using stacked convolutional and recurrent neural network and the evaluation is reported using standard metrics of error rate and F-score. The studied binaural features are seen to consistently perform equal to or better than the single-channel features with respect to error rate metric. PDF
	Surrey-CVSSP System for DCASE2017 Challenge Task4 Yong Xu, Qiuqiang Kong, Wenwu Wang and Mark D. Plumbley Center for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK PDF Abstract In this technique report, we present a bunch of methods for the task 4 of Detection and Classification of Acoustic Scenes and Events 2017 (DCASE2017) challenge. This task evaluates systems for the large-scale detection of sound events using weakly labeled training data. The data are YouTube video excerpts focusing on transportation and warnings due to their industry applications. There are two tasks, audio tagging and sound event detection from weakly labeled data. Convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) are adopted as our basic framework. We proposed a learnable gating activation function for selecting informative local features. Attention-based scheme is used for localizing the specific events in a weakly-supervised mode. A new batch-level balancing strategy is also proposed to tackle the data unbalancing problem. Fusion of posteriors from different systems are found effective to improve the performance. In a summary, we get 61% F-value for the audio tagging subtask and 0.72 error rate (ER) for the sound event detection subtask on the development set. While the official multilayer perceptron (MLP) based baseline just obtained 13.1% F-value for the audio tagging and 1.02 for the sound event detection. PDF Source code

17:00

Discussion

Open Discussion

Moderated by Mark Plumbley
University of Surrey, United Kingdom

Day 2

Friday 17.11.2017, 9:00 - 12:10

Hours

9:00

Keynote

Keynote

Session chair Dan Stowell

Sound Texture Perception via Summary Statistics

Josh McDermott
Massachusetts Institute of Technology, USA

Abstract

Sound textures are produced by superpositions of large numbers of similar acoustic features (as in rain, swarms of insects, or galloping horses). Textures are noteworthy for being stationary, raising the possibility that time-averaged statistics might capture their structure. I will describe several lines of work testing this idea. I will show how the synthesis of textures from statistics of biological auditory models provides evidence for statistical texture representations. I will then describe experiments that characterize the process by which texture statistics are measured by the auditory system, and that explore their role in auditory scene analysis.

Biography

Josh McDermott is a perceptual scientist studying sound and hearing in the Department of Brain and Cognitive Sciences at MIT, where he is the Fred & Carole Middleton Career Development Assistant Professor and heads the Laboratory for Computational Audition. His research addresses human and machine audition using tools from experimental psychology, engineering, and neuroscience. McDermott obtained a BA in Brain and Cognitive Science from Harvard, an MPhil in Computational Neuroscience from University College London, a PhD in Brain and Cognitive Science from MIT, and postdoctoral training in psychoacoustics at the University of Minnesota and in computational neuroscience at NYU. He is the recipient of a Marshall Scholarship, a James S. McDonnell Foundation Scholar Award, and an NSF CAREER Award.

Josh McDermott

Fred & Carole Middleton Career Development Assistant Professor, Department of Brain and Cognitive Science, Massachusetts Institute of Technology, USA

9:50

Presentations

	Oral Session III Session chair Dan Stowell
9:50	The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network Gert Dekkers^1,2, Steven Lauwereins², Bart Thoen¹, Mulu Weldegebreal Adhana¹, Henk Brouckxon³, Bertold Van den Bergh², Toon van Waterschoot^1,2, Bart Vanrumste^1,2,4, Marian Verhelst², Peter Karsmakers¹ ¹ KU Leuven, Department of Electrical Engineering, Engineering Technology Cluster, Geel, Belgium, ² KU Leuven, Department of Electrical Engineering, Leuven, Belgium, ³ Vrije Universiteit Brussel, Department ETRO-DSSP, Brussels, Belgium, ⁴ IMEC, Leuven, Belgium PDF Abstract There is a rising interest in monitoring and improving human wellbeing at home using different types of sensors including microphones. In the context of Ambient Assisted Living (AAL) persons are monitored, e.g. to support patients with a chronic illness and older persons, by tracking their activities being performed at home. When considering an acoustic sensing modality, a performed activity can be seen as an acoustic scene. Recently, acoustic detection and classification of scenes and events has gained interest in the scientific community and led to numerous public databases for a wide range of applications. However, no public databases exist which a) focus on daily activities in a home environment, b) contain activities being performed in a spontaneous manner, c) make use of an acoustic sensor network, and d) are recorded as a continuous stream. In this paper we introduce a database recorded in one living home, over a period of one week. The recording setup is an acoustic sensor network containing thirteen sensor nodes, with four low-cost microphones each, distributed over five rooms. Annotation is available on an activity level. In this paper we present the recording and annotation procedure, the database content and a discussion on a baseline detection benchmark. The baseline consists of Mel-Frequency Cepstral Coefficients, Support Vector Machine and a majority vote late-fusion scheme. The database is publicly released to provide a common ground for future research. Keywords Database, Acoustic Scene Classification, Acoustic Event Detection, Acoustic Sensor Networks PDF Video The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network @inproceedings{Dekkers2017, author = "Dekkers, Gert and Lauwereins, Steven and Thoen, Bart and Adhana, Mulu Weldegebreal and Brouckxon, Henk and van Waterschoot, Toon and Vanrumste, Bart and Verhelst, Marian and Karsmakers, Peter and", title = "The {SINS} Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "32--36", keywords = "Database, Acoustic Scene Classification, Acoustic Event Detection, Acoustic Sensor Networks", abstract = "There is a rising interest in monitoring and improving human wellbeing at home using different types of sensors including microphones. In the context of Ambient Assisted Living (AAL) persons are monitored, e.g. to support patients with a chronic illness and older persons, by tracking their activities being performed at home. When considering an acoustic sensing modality, a performed activity can be seen as an acoustic scene. Recently, acoustic detection and classification of scenes and events has gained interest in the scientific community and led to numerous public databases for a wide range of applications. However, no public databases exist which a) focus on daily activities in a home environment, b) contain activities being performed in a spontaneous manner, c) make use of an acoustic sensor network, and d) are recorded as a continuous stream. In this paper we introduce a database recorded in one living home, over a period of one week. The recording setup is an acoustic sensor network containing thirteen sensor nodes, with four low-cost microphones each, distributed over five rooms. Annotation is available on an activity level. In this paper we present the recording and annotation procedure, the database content and a discussion on a baseline detection benchmark. The baseline consists of Mel-Frequency Cepstral Coefficients, Support Vector Machine and a majority vote late-fusion scheme. The database is publicly released to provide a common ground for future research." }
10:10	Acoustic Scene Classification Using Spatial Features Marc C. Green and Damian Murphy Audio Lab, Department of Electonic Engineering, University of York, York, UK PDF Abstract Due to various factors, the vast majority of the research in the field of Acoustic Scene Classification has used monaural or binaural datasets. This paper introduces EigenScape - a new dataset of 4th-order Ambisonic acoustic scene recordings - and presents preliminary analysis of this dataset. The data is classified using a standard Mel-Frequency Cepstral Coefficient - Gaussian Mixture Model system, and the performance of this system is compared to that of a new system using spatial features extracted using Directional Audio Coding (DirAC) techniques. The DirAC features are shown to perform well in scene classification, with some subsets of these features outperforming the MFCC classification. The differences in label confusion between the two systems are especially interesting, as these suggest that certain scenes that are spectrally similar might not necessarily be spatially similar. Keywords Acoustic scene classification, MFCC, gaussian mixture model, ambisonics, directional audio coding, multichannel, eigenmike PDF Slides Video EigenScape EigenScape Acoustic Scene Classification Using Spatial Features @inproceedings{Green2017, author = "Green, Marc Ciufo and Murphy, Damian", title = "Acoustic Scene Classification Using Spatial Features", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "42--45", keywords = "Acoustic scene classification, MFCC, gaussian mixture model, ambisonics, directional audio coding, multichannel, eigenmike", abstract = "Due to various factors, the vast majority of the research in the field of Acoustic Scene Classification has used monaural or binaural datasets. This paper introduces EigenScape - a new dataset of 4th-order Ambisonic acoustic scene recordings - and presents preliminary analysis of this dataset. The data is classified using a standard Mel-Frequency Cepstral Coefficient - Gaussian Mixture Model system, and the performance of this system is compared to that of a new system using spatial features extracted using Directional Audio Coding (DirAC) techniques. The DirAC features are shown to perform well in scene classification, with some subsets of these features outperforming the MFCC classification. The differences in label confusion between the two systems are especially interesting, as these suggest that certain scenes that are spectrally similar might not necessarily be spatially similar." }
10:30	Acoustic Scene Classification by Combining Autoencoder-Based Dimensionality Reduction and Convolutional Neural Networks Jakob Abeßer, Stylianos Ioannis Mimilakis, Robert Grafe, and Hanna Lukashevich Fraunhofer IDMT, Ilmenau, Germany PDF Abstract Motivated by the recent success of deep learning techniques in various audio analysis tasks, this work presents a distributed sensor-server system for acoustic scene classification in urban environments based on deep convolutional neural networks (CNN). Stacked autoencoders are used to compress extracted spectrogram patches on the sensor side before being transmitted to and classified on the server side. In our experiments, we compare two state-of-theart CNN architectures subject to their classification accuracy under the presence of environmental noise, the dimensionality reduction in the encoding stage, as well as a reduced number of filters in the convolution layers. Our results show that the best model configuration leads to a classification accuracy of 75% for 5 acoustic scenes. We furthermore discuss which confusions among particular classes can be ascribed to particular sound event types, which are present in multiple acoustic scene classes. Keywords Acoustic Scene Classification, Convolutional Neural Networks, Stacked Denoising Autoencoder, Smart City PDF Video Acoustic Scene Classification by Combining Autoencoder-Based Dimensionality Reduction and Convolutional Neural Networks @inproceedings{Abesser2017, author = "Abeßer, Jakob and Mimilakis, Stylianos Ioannis and Gräfe, Robert and Lukashevich, Hanna", title = "Acoustic Scene Classification by Combining Autoencoder-Based Dimensionality Reduction and Convolutional Neural Networks", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "7--11", keywords = "Acoustic Scene Classification, Convolutional Neural Networks, Stacked Denoising Autoencoder, Smart City", abstract = "Motivated by the recent success of deep learning techniques in various audio analysis tasks, this work presents a distributed sensor-server system for acoustic scene classification in urban environments based on deep convolutional neural networks (CNN). Stacked autoencoders are used to compress extracted spectrogram patches on the sensor side before being transmitted to and classified on the server side. In our experiments, we compare two state-of-theart CNN architectures subject to their classification accuracy under the presence of environmental noise, the dimensionality reduction in the encoding stage, as well as a reduced number of filters in the convolution layers. Our results show that the best model configuration leads to a classification accuracy of 75\% for 5 acoustic scenes. We furthermore discuss which confusions among particular classes can be ascribed to particular sound event types, which are present in multiple acoustic scene classes." }

10:50

Coffee

Coffee

Coffee served during the poster session.

10:50

Posters

	Poster Session II
	Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification Yoonchang Han¹ and Jeongsoo Park^1,2 ¹Cochlear.ai, Seoul, Korea, ²Music and Audio Research Group, Seoul National University, Seoul, Korea PDF Abstract In this paper, we demonstrate how we applied convolutional neural network for DCASE 2017 task 1, acoustic scene classification. We propose a variety of preprocessing methods that emphasise different acoustic characteristics such as binaural representations, harmonicpercussive source separation, and background subtraction. We also present a network structure designed for paired input to make the most of the spatial information contained in the stereo. The experimental results show that the proposed network structures and the preprocessing methods effectively learn acoustic characteristics from the audio recordings, and their ensemble model significantly reduces the error rate further, exhibiting an accuracy of 0.917 for 4-fold cross-validation on the development. The proposed system achieved second place in DCASE 2017 task 1 with an accuracy of 0.804 on the evaluation set. Keywords DCASE 2017, acoustic scene classification, convolutional neural network, binaural representations, harmonicpercussive source separation, background subtraction PDF Poster Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification @inproceedings{Han2017, author = "Han, Yoonchang and Park, Jeongsoo", title = "Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "46--50", keywords = "DCASE 2017, acoustic scene classification, convolutional neural network, binaural representations, harmonicpercussive source separation, background subtraction", abstract = "In this paper, we demonstrate how we applied convolutional neural network for DCASE 2017 task 1, acoustic scene classification. We propose a variety of preprocessing methods that emphasise different acoustic characteristics such as binaural representations, harmonicpercussive source separation, and background subtraction. We also present a network structure designed for paired input to make the most of the spatial information contained in the stereo. The experimental results show that the proposed network structures and the preprocessing methods effectively learn acoustic characteristics from the audio recordings, and their ensemble model significantly reduces the error rate further, exhibiting an accuracy of 0.917 for 4-fold cross-validation on the development. The proposed system achieved second place in DCASE 2017 task 1 with an accuracy of 0.804 on the evaluation set." }
	Audio Event Detection Using Multiple-Input Convolutional Neural Network Il-Young Jeong^1,2, Subin Lee^1,2, Yoonchang Han² and Kyogu Lee¹ ¹Music and Audio Research Group, Seoul National University, Seoul, Korea, ²Cochlear.ai, Seoul, Korea PDF Abstract This paper describes the model and training framework from our submission for DCASE 2017 task 3: sound event detection in real life audio. Extending the basic convolutional neural network architecture, we use both short- and long-term audio signal simultaneously as input data. In the training stage, we calculated validation errors more frequently than one epoch with adaptive thresholds. We also used class-wise early-stopping strategy to find the best model for each class. The proposed model showed meaningful improvements in cross-validation experiments compared to the baseline system. Keywords DCASE 2017, Sound event detection, Convolutional neural networks PDF Poster Audio Event Detection Using Multiple-Input Convolutional Neural Network @inproceedings{Jeong2017, author = "Jeong, Il-Young and Lee, Subin and Han, Yoonchang and Lee, Kyogu", title = "Audio Event Detection Using Multiple-Input Convolutional Neural Network", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "51--54", keywords = "DCASE 2017, Sound event detection, Convolutional neural networks", abstract = "This paper describes the model and training framework from our submission for DCASE 2017 task 3: sound event detection in real life audio. Extending the basic convolutional neural network architecture, we use both short- and long-term audio signal simultaneously as input data. In the training stage, we calculated validation errors more frequently than one epoch with adaptive thresholds. We also used class-wise early-stopping strategy to find the best model for each class. The proposed model showed meaningful improvements in cross-validation experiments compared to the baseline system." }
	DNN-Based Audio Scene Classification for DCASE2017: Dual Input Features, Balancing Cost, and Stochastic Data Duplication Jung Jee-Weon, Heo Hee-Soo, Yang IL-Ho, Yoon Sung-Hyun, Shim Hye-Jin and Yu Ha-Jin School of Computer Science, University of Seoul, Seoul, Republic of South Korea PDF Abstract In this study, we explored DNN-based audio scene classification systems with dual input features. Dual input features take advantage of simultaneously utilizing two features with different levels of abstraction as inputs: a frame-level mel-filterbank feature and segment-level identity vector. A new fine-tune cost that solves the drawback of dual input features was developed, as well as a data duplication method that enables DNN to clearly discriminate frequently misclassified classes. Combining the proposed methods with the latest DNN techniques such as residual learning achieved a fold-wise accuracy of 95.9% for the validation set and 70.6% for the evaluation set provided by the Detection and Classification of Acoustic Scenes and Events community. Keywords audio scene classification, DNN, dual input feature, balancing cost, data duplication, residual learning PDF Poster DNN-Based Audio Scene Classification for DCASE2017: Dual Input Features, Balancing Cost, and Stochastic Data Duplication @inproceedings{Jung2017, author = "Jung, Jee-Weon and Heo, Hee-Soo and Yang, IL-Ho and Yoon, Sung-Hyun and Shim, Hye-Jin and Yu, Ha-Jin", title = "{DNN}-Based Audio Scene Classification for {DCASE2017}: Dual Input Features, Balancing Cost, and Stochastic Data Duplication", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "59--63", keywords = "audio scene classification, DNN, dual input feature, balancing cost, data duplication, residual learning", abstract = "In this study, we explored DNN-based audio scene classification systems with dual input features. Dual input features take advantage of simultaneously utilizing two features with different levels of abstraction as inputs: a frame-level mel-filterbank feature and segment-level identity vector. A new fine-tune cost that solves the drawback of dual input features was developed, as well as a data duplication method that enables DNN to clearly discriminate frequently misclassified classes. Combining the proposed methods with the latest DNN techniques such as residual learning achieved a fold-wise accuracy of 95.9\% for the validation set and 70.6\% for the evaluation set provided by the Detection and Classification of Acoustic Scenes and Events community." }
	Combining Multi-Scale Features Using Sample-Level Deep Convolutional Neural Networks for Weakly Supervised Sound Event Detection Jongpil Lee¹, Jiyoung Park¹, Sangeun Kum¹, Youngho Jeong², Juhan Nam¹ ¹Graduate School of Culture Technology, KAIST, Korea, ²Realistic AV Research Group, ETRI, Korea PDF Abstract This paper describes our method submitted to large-scale weakly supervised sound event detection for smart cars in the DCASE Challenge 2017. It is based on two deep neural network methods suggested for music auto-tagging. One is training sample-level Deep Convolutional Neural Networks (DCNN) using raw waveforms as a feature extractor. The other is aggregating features on multiscaled models of the DCNNs and making final predictions from them. With this approach, we achieved the best results, 47.3% in F-score on subtask A (audio tagging) and 0.75 in error rate on subtask B (sound event detection) in the evaluation. These results show that the waveform-based models can be comparable to spectrogrambased models when compared to other DCASE Task 4 submissions. Finally, we visualize hierarchically learned filters from the challenge dataset in each layer of the waveform-based model to explain how they discriminate the events. Keywords Sound event detection, audio tagging, weakly supervised learning, multi-scale features, sample-level, convolutional neural networks, raw waveforms PDF Poster Combining Multi-Scale Features Using Sample-Level Deep Convolutional Neural Networks for Weakly Supervised Sound Event Detection @inproceedings{Lee2017a, author = "Lee, Jongpil and Park, Jiyoung and Kum, Sangeun and Jeong, Youngho and Nam, Juhan", title = "Combining Multi-Scale Features Using Sample-Level Deep Convolutional Neural Networks for Weakly Supervised Sound Event Detection", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "69--73", keywords = "Sound event detection, audio tagging, weakly supervised learning, multi-scale features, sample-level, convolutional neural networks, raw waveforms", abstract = "This paper describes our method submitted to large-scale weakly supervised sound event detection for smart cars in the DCASE Challenge 2017. It is based on two deep neural network methods suggested for music auto-tagging. One is training sample-level Deep Convolutional Neural Networks (DCNN) using raw waveforms as a feature extractor. The other is aggregating features on multiscaled models of the DCNNs and making final predictions from them. With this approach, we achieved the best results, 47.3\% in F-score on subtask A (audio tagging) and 0.75 in error rate on subtask B (sound event detection) in the evaluation. These results show that the waveform-based models can be comparable to spectrogrambased models when compared to other DCASE Task 4 submissions. Finally, we visualize hierarchically learned filters from the challenge dataset in each layer of the waveform-based model to explain how they discriminate the events." }
	Acoustic Scene Classification: From a Hybrid Classifier to Deep Learning Anastasios Vafeiadis¹, Dimitrios Kalatzis¹, Konstantinos Votis¹, Dimitrios Giakoumis¹, Dimitrios Tzovaras¹, Liming Chen² and Raouf Hamzaoui² ¹Information Technologies Institute, Center for Research & Technology Hellas, Thessaloniki, Greece, ²Faculty of Technology, De Montfort University, Leicester, UK PDF Abstract This report describes our contribution to the 2017 Detection and Classification of Acoustic Scenes and Events (DCASE) challenge. We investigated two approaches for the acoustic scene classification task. Firstly, we used a combination of features in the time and frequency domain and a hybrid Support Vector Machines - Hidden Markov Model (SVM-HMM) classifier to achieve an average accuracy over 4-folds of 80.9% on the development dataset and 61.0% on the evaluation dataset. Secondly, by exploiting dataaugmentation techniques and using the whole segment (as opposed to splitting into sub-sequences) as an input, the accuracy of our CNN system was boosted to 95.9%. However, due to the small number of kernels used for the CNN and a failure of capturing the global information of the audio signals, it achieved an accuracy of 49.5% on the evaluation dataset. Our two approaches outperformed the DCASE baseline method, which uses log-mel band energies for feature extraction and a Multi-Layer Perceptron (MLP) to achieve an average accuracy over 4-folds of 74.8%. Keywords Acoustic scene classification, feature extraction, deep learning, spectral features, data augmentation PDF Poster Acoustic Scene Classification: From a Hybrid Classifier to Deep Learning @inproceedings{Vafeiadis2017, author = "Vafeiadis, Anastasios and Kalatzis, Dimitrios and Votis, Konstantinos and Giakoumis, Dimitrios and Tzovaras, Dimitrios and Chen, Liming and Hamzaoui, Raouf", title = "Acoustic Scene Classification: From a Hybrid Classifier to Deep Learning", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "123--127", keywords = "Acoustic scene classification, feature extraction, deep learning, spectral features, data augmentation", abstract = "This report describes our contribution to the 2017 Detection and Classification of Acoustic Scenes and Events (DCASE) challenge. We investigated two approaches for the acoustic scene classification task. Firstly, we used a combination of features in the time and frequency domain and a hybrid Support Vector Machines - Hidden Markov Model (SVM-HMM) classifier to achieve an average accuracy over 4-folds of 80.9\% on the development dataset and 61.0\% on the evaluation dataset. Secondly, by exploiting dataaugmentation techniques and using the whole segment (as opposed to splitting into sub-sequences) as an input, the accuracy of our CNN system was boosted to 95.9\%. However, due to the small number of kernels used for the CNN and a failure of capturing the global information of the audio signals, it achieved an accuracy of 49.5\% on the evaluation dataset. Our two approaches outperformed the DCASE baseline method, which uses log-mel band energies for feature extraction and a Multi-Layer Perceptron (MLP) to achieve an average accuracy over 4-folds of 74.8\%." }
	Audio Events Detection and classification using extended R-FCN Approach Wang Kaiwu, Yang Liping and Yang Bin Key Laboratory of Optoelectronic Technology and Systems (Chongqing University), Ministry of Education, ChongQing University, ChongQing, China PDF Abstract In this study, we present a new audio event detection and classification approach based on R-FCN—a state-of-the-art fully convolutional network framework for visual object detection. Spectrogram features of audio signals are used as the input of the approach. The proposed approach consists of two stages like R-FCN network. In the first stage, we detect whether there are audio events by sliding convolutional kernel in time axis, and then proposals which possibly contain audio events are generated by RPN (Region Proposal Networks). In the second stage, time and frequency domain information are integrated to classify these proposals and refine their boundaries. Our approach can output the positions of audio events directly which can input a two-dimensional representation of arbitrary length sound without any size regularization. Keywords audio events detection, Convolutional Neural Network, spectrogram feature PDF Audio Events Detection and classification using extended R-FCN Approach @inproceedings{Wang2017, author = "Wang, Kaiwu and Yang, Liping and Yang, Bin", title = "Audio Events Detection and classification using extended {R-FCN} Approach", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "128--132", keywords = "audio events detection, Convolutional Neural Network, spectrogram feature", abstract = "In this study, we present a new audio event detection and classification approach based on R-FCN—a state-of-the-art fully convolutional network framework for visual object detection. Spectrogram features of audio signals are used as the input of the approach. The proposed approach consists of two stages like R-FCN network. In the first stage, we detect whether there are audio events by sliding convolutional kernel in time axis, and then proposals which possibly contain audio events are generated by RPN (Region Proposal Networks). In the second stage, time and frequency domain information are integrated to classify these proposals and refine their boundaries. Our approach can output the positions of audio events directly which can input a two-dimensional representation of arbitrary length sound without any size regularization." }
	Acoustic Scene Classification Using Deep Convolutional Neural Network and Multiple Spectrograms Fusion Zheng Weiping¹, Yi Jiantao¹, Xing Xiaotao¹, Liu Xiangtao² and Peng Shaohu³ ¹School of Computer, South China Normal University, Guangzhou, China, ²Shenzhen Chinasfan Information Technology Co., Ltd., Shenzhen Chinasfan Information Technology Co., Ltd., Shenzhen, China, ³School of Mechanical and Electrical Engineering,, Guangzhou University, Guangzhou, China PDF Abstract Making sense of the environment by sounds is an important research in machine learning community. In this work, a Deep Convolutional Neural Network (DCNN) model is presented to classify acoustic scenes along with a multiple spectrograms fusion method. Firstly, the generations of standard spectrogram and CQT spectrogram are introduced separately. Corresponding features can then be extracted by feeding these spectrogram data into the proposed DCNN model. To fuse these multiple spectrogram features, two fusing mechanisms, namely the voting and the SVM methods, are designed. By fusing DCNN features of the standard and CQT spectrograms, the accuracy is significantly improved in our experiments, comparing with the single spectrogram schemes. This proves the effectiveness of the proposed multi-spectrograms fusion method. Keywords Deep convolutional neural network, spectrogram, feature fusion, acoustic scene classification PDF Poster Acoustic Scene Classification Using Deep Convolutional Neural Network and Multiple Spectrograms Fusion @inproceedings{Zheng2017, author = "Zheng, Weiping and Yi, Jiantao and Xing, Xiaotao and Liu, Xiangtao and Peng, Shaohu", title = "Acoustic Scene Classification Using Deep Convolutional Neural Network and Multiple Spectrograms Fusion", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "133--137", keywords = "Deep convolutional neural network, spectrogram, feature fusion, acoustic scene classification", abstract = "Making sense of the environment by sounds is an important research in machine learning community. In this work, a Deep Convolutional Neural Network (DCNN) model is presented to classify acoustic scenes along with a multiple spectrograms fusion method. Firstly, the generations of standard spectrogram and CQT spectrogram are introduced separately. Corresponding features can then be extracted by feeding these spectrogram data into the proposed DCNN model. To fuse these multiple spectrogram features, two fusing mechanisms, namely the voting and the SVM methods, are designed. By fusing DCNN features of the standard and CQT spectrograms, the accuracy is significantly improved in our experiments, comparing with the single spectrogram schemes. This proves the effectiveness of the proposed multi-spectrograms fusion method." }
	Robust Sound Event Detection Through Noise Estimation and Source Separation Using NMF Qing Zhou and Zuren Feng School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi'an, China PDF Abstract This paper addresses the problem of sound event detection under non-stationary noises and various real-world acoustic scenes. An effective noise reduction strategy is proposed in this paper which can automatically adapt to background variations. The proposed method is based on supervised non-negative matrix factorization (NMF) for separating target events from noise. The event dictionary is trained offline using the training data of the target event class while the noise dictionary is learned online from the input signal by sparse and low-rank decomposition. Incorporating the estimated noise bases, this method can produce accurate source separation results by reducing noise residue and signal distortion of the reconstructed event spectrogram. Experimental results on DCASE 2017 task 2 dataset show that the proposed method outperforms the baseline system based on multi-layer perceptron classifiers and also another NMF-based method which employs a semi-supervised strategy for noise reduction. Keywords Sound event detection, non-negative matrix factorization, sparse and low-rank decomposition, source separation PDF Robust Sound Event Detection Through Noise Estimation and Source Separation Using NMF @inproceedings{Zhou2017, author = "Zhou, Qing and Feng, Zuren", title = "Robust Sound Event Detection Through Noise Estimation and Source Separation Using {NMF}", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", year = "2017", month = "November", pages = "138--142", keywords = "Sound event detection, non-negative matrix factorization, sparse and low-rank decomposition, source separation", abstract = "This paper addresses the problem of sound event detection under non-stationary noises and various real-world acoustic scenes. An effective noise reduction strategy is proposed in this paper which can automatically adapt to background variations. The proposed method is based on supervised non-negative matrix factorization (NMF) for separating target events from noise. The event dictionary is trained offline using the training data of the target event class while the noise dictionary is learned online from the input signal by sparse and low-rank decomposition. Incorporating the estimated noise bases, this method can produce accurate source separation results by reducing noise residue and signal distortion of the reconstructed event spectrogram. Experimental results on DCASE 2017 task 2 dataset show that the proposed method outperforms the baseline system based on multi-layer perceptron classifiers and also another NMF-based method which employs a semi-supervised strategy for noise reduction." }
	A Hierarchic Multi-Scaled Approach for Rare Sound Event Detection Fabio Vesperini, Diego Droghini, Daniele Ferretti, Emanuele Principi, Leonardo Gabrielli, Stefano Squartini and Francesco Piazza Department of Information Engineering, Università Politecnica delle Marche, Ancona, Italy PDF Abstract We propose a system for rare sound event detection using hierarchical and multi-scaled approach based on Multi Layer Perceptron (MLP) and Convolutional Neural Networks (CNN). It is our contribution to the rare sound event detection task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017). The task consists on detection of event onset from artificially generated mixtures. Acoustic features are extracted from the acoustic signals, successively first event detection stage is performed by an MLP based neural network which proposes contiguous blocks of frames to the second stage. The CNN refines the event detection of the prior network, intrinsically operating on a multi-scaled resolution and discarding blocks that contain background wrongly classified by the MLP as event. Finally the effective onset time of the active event is obtained. The achieved overall error rate and F-measure on the development testset are respectively equal to 0.18 and 90.9%. PDF

12:00

Closing remarks

Closing remarks