8:45 |
Registration |
|
8:45 |
Coffee |
|
9:10 |
Welcome |
|
Welcome
Annamaria Mesaros Tampere University of Technology, Finland
|
|
9:20 |
Keynote |
|
Keynote
Session chair Sacha Krstulović
General-Purpose Sound Event Recognition
Shawn Hershey
Google Research
Abstract
Inspired by the success of general-purpose object recognition in images, we have been working on automatic, real-time systems for recognizing sound events regardless of domain. Our goal is a system that can tag or describe an arbitrary soundtrack - as might be found on a media sharing site like YouTube - using terms that make sense to a human. I will cover the process of defining this task, our deep learning approach, our efforts to collect training data, and our current results. I'll discuss some factors important for accurate models, and some ideas about how to get the best return from manual labeling investment.
Biography
Shawn Hershey is a software engineer at Google Research, working in the Machine Hearing Group on machine learning for speech and audio processing. He is currently working on soundtrack classification and audio event detection. Before Google he worked as the first Software Engineer at Lyric Semiconductors, building tools to aid the development of hardware accelerators for AI. On the side, Shawn travels the world teaching Lindy Hop and blues dancing and playing in swing and blues bands. Long ago Shawn graduated from the University of Rochester with a BA in Computer Science and half of a degree from the Eastman School of Music.
Shawn Hershey
Google Research
|
|
10:10 |
Break |
|
10:30 |
Presentations |
|
Oral Session I
Session chair Axel Plinge
|
10:30 |
DCASE2017 Challenge Summary
Tuomas Virtanen
Tampere University of Technology, Laboratory of Signal Processing, Tampere, Finland
DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System
Annamaria Mesaros1, Toni Heittola1, Aleksandr Diment1, Benjamin Elizalde2, Ankit Shah2, Emmanuel Vincent3, Bhiksha Raj2 and Tuomas Virtanen 1
1Tampere University of Technology, Laboratory of Signal Processing, Tampere, Finland, 2Carnegie Mellon University, Department of Electrical and Computer Engineering, & Department of Language Technologies Institute, Pittsburgh, USA, 3Inria, F-54600 Villers-les-Nancy, France
Abstract
DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.
Keywords
Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events, Weak Labels
@inproceedings{Mesaros2017,
author = "Mesaros, Annamaria and Heittola, Toni and Diment, Aleksandr and Elizalde, Benjamin and Shah, Ankit and Vincent, Emmanuel and Raj, Bhiksha and Virtanen, Tuomas",
title = "{DCASE2017} Challenge Setup: Tasks, Datasets and Baseline System",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
year = "2017",
month = "November",
pages = "85--92",
keywords = "Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events, Weak Labels",
abstract = "DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics."
}
|
11:00 |
Generative Adversarial Network Based Acoustic Scene Training Set Augmentation and Selection Using SVM Hyper-Plane
Seongkyu Mun1, Sangwook Park1, David Han2 and Hanseok Ko1
1Intelligent Signal Processing Laboratory, Korea University, Seoul, South Korea, 2Office of Naval Research, Office of Naval Research, Arlington VA, USA
Abstract
Although it is typically expected that using a large amount of labeled training data would lead to improve performance in deep learning, it is generally difficult to obtain such DataBase (DB). In competitions such as the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge Task 1, participants are constrained to use a relatively small DB as a rule, which is similar to the aforementioned issue. To improve Acoustic Scene Classification (ASC) performance without employing additional DB, this paper proposes to use Generative Adversarial Networks (GAN) based method for generating additional training DB. Since it is not clear whether every sample generated by GAN would have equal impact in classification performance, this paper proposes to use Support Vector Machine (SVM) hyper plane for each class as reference for selecting samples, which have class discriminative information. Based on the crossvalidated experiments on development DB, the usage of the generated features could improve ASC performance.
Keywords
acoustic scene classification, generative adversarial networks, support vector machine, data augmentation, decision hyper-plane
@inproceedings{Mun2017,
author = "Mun, Seongkyu and Park, Sangwook and Han, David K and Ko, Hanseok",
title = "Generative Adversarial Network Based Acoustic Scene Training Set Augmentation and Selection Using {SVM} Hyper-Plane",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
year = "2017",
month = "November",
pages = "93--102",
keywords = "acoustic scene classification, generative adversarial networks, support vector machine, data augmentation, decision hyper-plane",
abstract = "Although it is typically expected that using a large amount of labeled training data would lead to improve performance in deep learning, it is generally difficult to obtain such DataBase (DB). In competitions such as the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge Task 1, participants are constrained to use a relatively small DB as a rule, which is similar to the aforementioned issue. To improve Acoustic Scene Classification (ASC) performance without employing additional DB, this paper proposes to use Generative Adversarial Networks (GAN) based method for generating additional training DB. Since it is not clear whether every sample generated by GAN would have equal impact in classification performance, this paper proposes to use Support Vector Machine (SVM) hyper plane for each class as reference for selecting samples, which have class discriminative information. Based on the crossvalidated experiments on development DB, the usage of the generated features could improve ASC performance."
}
|
11:20 |
Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input
Donmoon Lee1,2, Subin Lee1,2, Yoonchang Han2 and Kyogu Lee1
1Music and Audio Research Group, Seoul National University, Seoul, Korea, 2Cochlear.ai, Seoul, Korea
Abstract
In this paper, we use ensemble of convolutional neural network models that use the various analysis window to detect audio events in the automotive environment. When detecting the presence of audio events, global input based model that uses the entire audio clip works better. On the other hand, segmented input based models works better in finding the accurate position of the event. Experimental results for weakly-labeled audio data confirm the performance trade-off between the two tasks, depending on the length of input audio. By combining the predictions of various models, the proposed system achieved 0.4762 in the clip-based F1-score and 0.7167 in the segment-based error rate.
@inproceedings{Lee2017b,
author = "Lee, Donmoon and Lee, Subin and Han, Yoonchang and Lee, Kyogu",
title = "Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
year = "2017",
pages = "74--79",
abstract = "In this paper, we use ensemble of convolutional neural network models that use the various analysis window to detect audio events in the automotive environment. When detecting the presence of audio events, global input based model that uses the entire audio clip works better. On the other hand, segmented input based models works better in finding the accurate position of the event. Experimental results for weakly-labeled audio data confirm the performance trade-off between the two tasks, depending on the length of input audio. By combining the predictions of various models, the proposed system achieved 0.4762 in the clip-based F1-score and 0.7167 in the segment-based error rate."
}
|
11:40 |
Sound Event Detection Using Weakly Labeled Dataset with Stacked Convolutional and Recurrent Neural Network
Sharath Adavanne and Tuomas Virtanen
Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland
Abstract
This paper proposes a neural network architecture and training scheme to learn the start and end time of sound events (strong labels) in an audio recording given just the list of sound events existing in the audio without time information (weak labels). We achieve this by using a stacked convolutional and recurrent neural network with two prediction layers in sequence one for the strong followed by the weak label. The network is trained using frame-wise log melband energy as the input audio feature, and weak labels provided in the dataset as labels for the weak label prediction layer. Strong labels are generated by replicating the weak labels as many number of times as the frames in the input audio feature, and used for strong label layer during training. We propose to control what the network learns from the weak and strong labels by different weighting for the loss computed in the two prediction layers. The proposed method is evaluated on a publicly available dataset of 155 hours with 17 sound event classes. The method achieves the best error rate of 0.84 for strong labels and F-score of 43.3% for weak labels on the unseen test split.
Keywords
sound event detection, weak labels, deep neural network, CNN, GRU
@inproceedings{Adavanne2017,
author = "Adavanne, Sharath and Virtanen, Tuomas",
title = "Sound Event Detection Using Weakly Labeled Dataset with Stacked Convolutional and Recurrent Neural Network",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
year = "2017",
month = "November",
pages = "12--16",
keywords = "sound event detection, weak labels, deep neural network, CNN, GRU",
abstract = "This paper proposes a neural network architecture and training scheme to learn the start and end time of sound events (strong labels) in an audio recording given just the list of sound events existing in the audio without time information (weak labels). We achieve this by using a stacked convolutional and recurrent neural network with two prediction layers in sequence one for the strong followed by the weak label. The network is trained using frame-wise log melband energy as the input audio feature, and weak labels provided in the dataset as labels for the weak label prediction layer. Strong labels are generated by replicating the weak labels as many number of times as the frames in the input audio feature, and used for strong label layer during training. We propose to control what the network learns from the weak and strong labels by different weighting for the loss computed in the two prediction layers. The proposed method is evaluated on a publicly available dataset of 155 hours with 17 sound event classes. The method achieves the best error rate of 0.84 for strong labels and F-score of 43.3\% for weak labels on the unseen test split."
}
|
12:00 |
Neuroevolution for Sound Event Detection in Real Life Audio: A Pilot Study
Christian Kroos and Mark D. Plumbley
Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK
Abstract
Neuroevolution techniques combine genetic algorithms with artificial neural networks, some of them evolving network topology along with the network weights. One of these latter techniques is the NeuroEvolution of Augmenting Topologies (NEAT) algorithm. For this pilot study we devised an extended variant (joint NEAT, J-NEAT), introducing dynamic cooperative co-evolution, and applied it to sound event detection in real life audio (Task 3) in the DCASE 2017 challenge. Our research question was whether small networks could be evolved that would be able to compete with the much larger networks now typical for classification and detection tasks. We used the wavelet-based deep scattering transform and k-means clustering across the resulting scales (not across samples) to provide J-NEAT with a compact representation of the acoustic input. The results show that for the development data set J-NEAT was capable of evolving small networks that match the performance of the baseline system in terms of the segment-based error metrics, while exhibiting a substantially better event-related error rate. In the challenge, J-NEAT took first place overall according to the F1 error metric with an F1 of 44:9% and achieved rank 15 out of 34 on the ER error metric with a value of 0:891. We discuss the question of evolving versus learning for supervised tasks.
Keywords
Sound event detection, neuroevolution, NEAT, deep scattering transform, wavelets, clustering, co-evolution
@inproceedings{Kroos2017,
author = "Kroos, Christian and Plumbley, Mark D.",
title = "Neuroevolution for Sound Event Detection in Real Life Audio: A Pilot Study",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
year = "2017",
month = "November",
pages = "64--68",
keywords = "Sound event detection, neuroevolution, NEAT, deep scattering transform, wavelets, clustering, co-evolution",
abstract = "Neuroevolution techniques combine genetic algorithms with artificial neural networks, some of them evolving network topology along with the network weights. One of these latter techniques is the NeuroEvolution of Augmenting Topologies (NEAT) algorithm. For this pilot study we devised an extended variant (joint NEAT, J-NEAT), introducing dynamic cooperative co-evolution, and applied it to sound event detection in real life audio (Task 3) in the DCASE 2017 challenge. Our research question was whether small networks could be evolved that would be able to compete with the much larger networks now typical for classification and detection tasks. We used the wavelet-based deep scattering transform and k-means clustering across the resulting scales (not across samples) to provide J-NEAT with a compact representation of the acoustic input. The results show that for the development data set J-NEAT was capable of evolving small networks that match the performance of the baseline system in terms of the segment-based error metrics, while exhibiting a substantially better event-related error rate. In the challenge, J-NEAT took first place overall according to the F1 error metric with an F1 of 44:9\% and achieved rank 15 out of 34 on the ER error metric with a value of 0:891. We discuss the question of evolving versus learning for supervised tasks."
}
|
|
12:30 |
Break |
|
14:00 |
Presentations |
|
Oral Session II
Session chair Romain Serizel
|
14:00 |
Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks
Eduardo Fonseca, Rong Gong, Dmitry Bogdanov, Olga Slizovskaia, Emilia Gomez and Xavier Serra
Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
Abstract
This work describes our contribution to the acoustic scene classification task of the DCASE 2017 challenge. We propose a system that consists of the ensemble of two methods of different nature: a feature engineering approach, where a collection of hand-crafted features is input to a Gradient Boosting Machine, and another approach based on learning representations from data, where log-scaled melspectrograms are input to a Convolutional Neural Network. This CNN is designed with multiple filter shapes in the first layer. We use a simple late fusion strategy to combine both methods. We report classification accuracy of each method alone and the ensemble system on the provided cross-validation setup of TUT Acoustic Scenes 2017 dataset. The proposed system outperforms each of its component methods and improves the provided baseline system by 8.2%.
Keywords
acoustic scene classification, gradient boosting machine, convolutional neural networks, ensembling
@inproceedings{Fonseca2017,
author = "Fonseca, Eduardo and Gong, Rong and Bogdanov, Dmitry and Slizovskaia, Olga and Gomez, Emilia and Serra, Xavier",
title = "Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
year = "2017",
month = "November",
pages = "37--41",
keywords = "acoustic scene classification, gradient boosting machine, convolutional neural networks, ensembling",
abstract = "This work describes our contribution to the acoustic scene classification task of the DCASE 2017 challenge. We propose a system that consists of the ensemble of two methods of different nature: a feature engineering approach, where a collection of hand-crafted features is input to a Gradient Boosting Machine, and another approach based on learning representations from data, where log-scaled melspectrograms are input to a Convolutional Neural Network. This CNN is designed with multiple filter shapes in the first layer. We use a simple late fusion strategy to combine both methods. We report classification accuracy of each method alone and the ensemble system on the provided cross-validation setup of TUT Acoustic Scenes 2017 dataset. The proposed system outperforms each of its component methods and improves the provided baseline system by 8.2\%."
}
|
14:20 |
DCASE 2017 Task 1: Acoustic Scene Classification Using Shift-Invariant Kernels and Random Features
Abelino Jimenez, Benjamin Elizalde and Bhiksha Raj
Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, USA
Abstract
Acoustic scene recordings are represented by different types of handcrafted or Neural Network features. These features, typically of thousands of dimensions, are classified in state of the art approaches using kernel machines, such as the Support Vector Machines (SVM). However, the complexity of training these methods increases with the dimensionality of these input features and the size of the dataset. A solution is to map the input features to a randomized low-dimensional feature space. The resulting random features can approximate non-linear kernels with faster linear kernel computation. In this work, we computed a set of 6,553 input features and used them to compute random features to approximate three types of kernels, Guassian, Laplacian and Cauchy. We compared their performance using an SVM in the context of the DCASE Task 1 - Acoustic Scene Classification. Experiments show that both, input and random features outperformed the DCASE baseline by an absolute 4%. Moreover, the random features reduced the dimensionality of the input by more than three times with minimal loss of performance and by more than six times and still outperformed the baseline. Hence, random features could be employed by state of the art approaches to compute low-storage features and perform faster kernel computations.
Keywords
Acoustic Scene Classification, Laplacian Kernel, Kernel Machines, Random Features
@inproceedings{Jimenez2017,
author = "Jimenez, Abelino and Elizalde, Benjamin and Raj, Bhiksha",
title = "{DCASE} 2017 Task 1: Acoustic Scene Classification Using Shift-Invariant Kernels and Random Features",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
year = "2017",
month = "November",
pages = "55--58",
keywords = "Acoustic Scene Classification, Laplacian Kernel, Kernel Machines, Random Features",
abstract = "Acoustic scene recordings are represented by different types of handcrafted or Neural Network features. These features, typically of thousands of dimensions, are classified in state of the art approaches using kernel machines, such as the Support Vector Machines (SVM). However, the complexity of training these methods increases with the dimensionality of these input features and the size of the dataset. A solution is to map the input features to a randomized low-dimensional feature space. The resulting random features can approximate non-linear kernels with faster linear kernel computation. In this work, we computed a set of 6,553 input features and used them to compute random features to approximate three types of kernels, Guassian, Laplacian and Cauchy. We compared their performance using an SVM in the context of the DCASE Task 1 - Acoustic Scene Classification. Experiments show that both, input and random features outperformed the DCASE baseline by an absolute 4\%. Moreover, the random features reduced the dimensionality of the input by more than three times with minimal loss of performance and by more than six times and still outperformed the baseline. Hence, random features could be employed by state of the art approaches to compute low-storage features and perform faster kernel computations."
}
|
14:40 |
Nonnegative Feature Learning Methods for Acoustic Scene Classification
Victor Bisot1, Romain Serizel2,3,4, Slim Essid1 and Gaël Richard1
1Image Data and Signal, Telecom ParisTech, Paris, France, 2Université de Lorraine, Loria, Nancy, France, 3Inria, Nancy, France, 4CNRS, LORIA, Nancy, France
Abstract
This paper introduces improvements to nonnegative feature learning-based methods for acoustic scene classification. We start by introducing modifications to the task-driven nonnegative matrix factorization algorithm. The proposed adapted scaling algorithm improves the generalization capability of task-driven nonnegative matrix factorization for the task. We then propose to exploit simple deep neural network architecture to classify both low level time-frequency representations and unsupervised nonnegative matrix factorization activation features independently. Moreover, we also propose a deep neural network architecture that exploits jointly unsupervised nonnegative matrix factorization activation features and low-level time frequency representations as inputs. Finally, we present a fusion of proposed systems in order to further improve performance. The resulting systems are our submission for the task 1 of the DCASE 2017 challenge.
Keywords
Feature learning, Nonnegative Matrix Factorization, Deep Neural Networks
@inproceedings{Bisot2017,
author = "Bisot, Victor and Serizel, Romain and Essid, Slim and Richard, Gaël",
title = "Nonnegative Feature Learning Methods for Acoustic Scene Classification",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
year = "2017",
month = "November",
pages = "22--26",
keywords = "Feature learning, Nonnegative Matrix Factorization, Deep Neural Networks",
abstract = "This paper introduces improvements to nonnegative feature learning-based methods for acoustic scene classification. We start by introducing modifications to the task-driven nonnegative matrix factorization algorithm. The proposed adapted scaling algorithm improves the generalization capability of task-driven nonnegative matrix factorization for the task. We then propose to exploit simple deep neural network architecture to classify both low level time-frequency representations and unsupervised nonnegative matrix factorization activation features independently. Moreover, we also propose a deep neural network architecture that exploits jointly unsupervised nonnegative matrix factorization activation features and low-level time frequency representations as inputs. Finally, we present a fusion of proposed systems in order to further improve performance. The resulting systems are our submission for the task 1 of the DCASE 2017 challenge."
}
|
15:00 |
Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio
Shahin Amiriparian1,2,3, Michael Freitag1, Nicholas Cummins1,2 and Björn Schuller2,4
1Chair of Complex & Intelligent Systems, Universität Passau, Passau, Germany, 2Chair of Embedded Intelligence for Health Care, Augsburg University, Augsburg, Germany, 3Machine Intelligence & Signal Processing Group, Technische Universität München, München, Germany, 4Group of Language, Audio & Music, Imperial Collage London, London, UK
Abstract
This paper describes our contribution to the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017). We propose a system for this task using a recurrent sequence to sequence autoencoder for unsupervised representation learning from raw audio files. First, we extract mel-spectrograms from the raw audio files. Second, we train a recurrent sequence to sequence autoencoder on these spectrograms, that are considered as time-dependent frequency vectors. Then, we extract, from a fully connected layer between the decoder and encoder units, the learnt representations of spectrograms as the feature vectors for the corresponding audio instances. Finally, we train a multilayer perceptron neural network on these feature vectors to predict the class labels. In comparison to the baseline, the accuracy is increased from 74:8% to 88:0% on the development set, and from 61:0% to 67:5% on the test set.
Keywords
deep feature learning, sequence to sequence learning, recurrent autoencoders, audio processing acoustic scene classification
@inproceedings{Amiriparian2017,
author = "Amiriparian, Shahin and Freitag, Michael and Cummins, Nicholas and Schuller, Björn",
title = "Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
year = "2017",
month = "November",
pages = "17--21",
keywords = "deep feature learning, sequence to sequence learning, recurrent autoencoders, audio processing acoustic scene classification",
abstract = "This paper describes our contribution to the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017). We propose a system for this task using a recurrent sequence to sequence autoencoder for unsupervised representation learning from raw audio files. First, we extract mel-spectrograms from the raw audio files. Second, we train a recurrent sequence to sequence autoencoder on these spectrograms, that are considered as time-dependent frequency vectors. Then, we extract, from a fully connected layer between the decoder and encoder units, the learnt representations of spectrograms as the feature vectors for the corresponding audio instances. Finally, we train a multilayer perceptron neural network on these feature vectors to predict the class labels. In comparison to the baseline, the accuracy is increased from 74:8\% to 88:0\% on the development set, and from 61:0\% to 67:5\% on the test set."
}
|
|
15:20 |
Coffee |
|
Coffee
Coffee served during the poster session.
|
|
15:20 |
Posters |
|
Poster Session I
|
|
Acoustic Scene Classification Based on Convolutional Neural Network Using Double Image Features
Sangwook Park1, Seongkyu Mun2, Younglo Lee1 and Hanseok Ko1
1School of Electrical Engineering, Korea University, Seoul, Republic of Korea, 2Department of Visual Information Processing, Korea University, Seoul, Republic of Korea
Abstract
This paper proposes new image features for the acoustic scene classification task of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events. In classification of acoustic scenes, identical sounds being observed in different places may affect performance. To resolve this issue, a covariance matrix, which represents energy density for each subband, and a double Fourier transform image, which represents energy variation for each subband, were defined as features. To classify the acoustic scenes with these features, Convolutional Neural Network has been applied with several techniques to reduce training time and to resolve initialization and local optimum problems. According to the experiments which were performed with the DCASE2017 challenge development dataset it is claimed that the proposed method outperformed several baseline methods. Specifically, the class average accuracy is shown as 83.6%, which is an improvement of 8.8%, 9.5%, 8.2% compared to MFCC-MLP, MFCC-GMM, and CepsCom-GMM, respectively.
Keywords
Acoustic scene classification, covariance learning, double FFT, convolutional neural network
@inproceedings{Park2017,
author = "Park, Sangwook and Mun, Seongkyu and Lee, Younglo and Ko, Hanseok",
title = "Acoustic Scene Classification Based on Convolutional Neural Network Using Double Image Features",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
year = "2017",
month = "November",
pages = "98--102",
keywords = "Acoustic scene classification, covariance learning, double FFT, convolutional neural network",
abstract = "This paper proposes new image features for the acoustic scene classification task of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events. In classification of acoustic scenes, identical sounds being observed in different places may affect performance. To resolve this issue, a covariance matrix, which represents energy density for each subband, and a double Fourier transform image, which represents energy variation for each subband, were defined as features. To classify the acoustic scenes with these features, Convolutional Neural Network has been applied with several techniques to reduce training time and to resolve initialization and local optimum problems. According to the experiments which were performed with the DCASE2017 challenge development dataset it is claimed that the proposed method outperformed several baseline methods. Specifically, the class average accuracy is shown as 83.6\%, which is an improvement of 8.8\%, 9.5\%, 8.2\% compared to MFCC-MLP, MFCC-GMM, and CepsCom-GMM, respectively."
}
|
|
The Details That Matter: Frequency Resolution of Spectrograms in Acoustic Scene Classification
Karol Piczak
Institute of Computer Science, Warsaw University of Technology, Warsaw, Poland
Abstract
This study describes a convolutional neural network model submitted to the acoustic scene classification task of the DCASE 2017 challenge. The performance of this model is evaluated with different frequency resolutions of the input spectrogram showing that a higher number of mel bands improves accuracy with negligible impact on the learning time. Additionally, apart from the convolutional model focusing solely on the ambient characteristics of the audio scene, a proposed extension with pretrained event detectors shows potential for further exploration.
Keywords
acoustic scene classification, spectrogram, frequency resolution, convolutional neural network, DCASE 2017
@inproceedings{Piczak2017,
author = "Piczak, Karol Jerzy",
title = "The Details That Matter: Frequency Resolution of Spectrograms in Acoustic Scene Classification",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
year = "2017",
month = "November",
pages = "103--107",
keywords = "acoustic scene classification, spectrogram, frequency resolution, convolutional neural network, DCASE 2017",
abstract = "This study describes a convolutional neural network model submitted to the acoustic scene classification task of the DCASE 2017 challenge. The performance of this model is evaluated with different frequency resolutions of the input spectrogram showing that a higher number of mel bands improves accuracy with negligible impact on the learning time. Additionally, apart from the convolutional model focusing solely on the ambient characteristics of the audio scene, a proposed extension with pretrained event detectors shows potential for further exploration."
}
|
|
Wavelets Revisited for the Classification of Acoustic Scenes
Qian Kun1,2,3, Ren Zhao2,3, Pandit Vedhas2,3, Yang Zijiang1,2, Zhang Zixing2, and Schuller Björn2,3,4
1MISP group, Technische Universität München, Munich, Germany, 2Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany, 3Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany, 4GLAM - Group on Language, Audio and Music, Imperial College London, London, UK
Abstract
We investigate the effectiveness of wavelet features for acoustic scene classification as contribution to the subtask of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017). On the back-end side, gated recurrent neural networks (GRNNs) are compared against traditional support vector machines (SVMs). We observe that, the proposed wavelet features behave comparable to the typically-used temporal and spectral features in the classification of acoustic scenes. Further, a late fusion of trained models with wavelets and typical acoustic features reach the best averaged 4-fold cross validation accuracy of 83.2 \%, and 82.6 \% by SVMs, and GRNNs, respectively; both significantly outperform the baseline (74.8 \%) of the official development set (p < 0:001, one-tailed z-test).
Keywords
Acoustic Scene Classification, Wavelets, Support Vector Machines, Sequence Modelling, Gated Recurrent Neural Networks
@inproceedings{Qian2017,
author = "Qian, Kun and Ren, Zhao and Pandit, Vedhas and Yang, Zijiang and Zhang, Zixing and Schuller, Björn",
title = "Wavelets Revisited for the Classification of Acoustic Scenes",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
year = "2017",
month = "November",
pages = "108--112",
keywords = "Acoustic Scene Classification, Wavelets, Support Vector Machines, Sequence Modelling, Gated Recurrent Neural Networks",
abstract = "We investigate the effectiveness of wavelet features for acoustic scene classification as contribution to the subtask of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017). On the back-end side, gated recurrent neural networks (GRNNs) are compared against traditional support vector machines (SVMs). We observe that, the proposed wavelet features behave comparable to the typically-used temporal and spectral features in the classification of acoustic scenes. Further, a late fusion of trained models with wavelets and typical acoustic features reach the best averaged 4-fold cross validation accuracy of 83.2 \\%, and 82.6 \\% by SVMs, and GRNNs, respectively; both significantly outperform the baseline (74.8 \\%) of the official development set (p < 0:001, one-tailed z-test)."
}
|
|
Deep Sequential Image Features on Acoustic Scene Classification
Ren Zhao1,2, Pandit Vedhas1,2, Qian Kun1,2,3, Yang Zijiang1,2, Zhang Zixing2, and Schuller Björn1,2,4
1Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany, 2Chair of Complex and Intelligent Systems, University of Passau, Passau, Germany, 3MISP group, Technische Universität München, Munich, Germany, 4GLAM - Group on Language, Audio and Music, Imperial College London, London, UK
Abstract
For the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017), we propose a novel method to classify 15 different acoustic scenes using deep sequential learning, based on features extracted from Short-Time Fourier Transform and scalogram of the audio scenes using Convolutional Neural Networks. It is the first time to investigate the performance of bump and morse scalograms for acoustic scene classification in an according context. First, segmented audio waves are transformed into a spectrogram and two types of scalograms; then, ‘deep features’ are extracted from these using the pre-trained VGG16 model by probing at the fully connected layer. These representations are then fed into Gated Recurrent Neural Networks for classification separately. Predictions from the three systems are finally combined by a margin sampling value strategy. On the official development set of the challenge, the best accuracy on a four-fold cross-validation setup is 80:9%, which increases by 6:1% when compared with the official baseline (p < :001 by one-tailed z-test).
Keywords
Audio Scene Classification, Deep Sequential Learning, Scalogram, Convolutional Neural Networks, Gated Recurrent Neural Networks
@inproceedings{Ren2017,
author = "Ren, Zhao and Pandit, Vedhas and Qian, Kun and Yang, Zijiang and Zhang, Zixing and Schuller, Björn",
title = "Deep Sequential Image Features on Acoustic Scene Classification",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
year = "2017",
month = "November",
pages = "113--117",
keywords = "Audio Scene Classification, Deep Sequential Learning, Scalogram, Convolutional Neural Networks, Gated Recurrent Neural Networks",
abstract = "For the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017), we propose a novel method to classify 15 different acoustic scenes using deep sequential learning, based on features extracted from Short-Time Fourier Transform and scalogram of the audio scenes using Convolutional Neural Networks. It is the first time to investigate the performance of bump and morse scalograms for acoustic scene classification in an according context. First, segmented audio waves are transformed into a spectrogram and two types of scalograms; then, ‘deep features’ are extracted from these using the pre-trained VGG16 model by probing at the fully connected layer. These representations are then fed into Gated Recurrent Neural Networks for classification separately. Predictions from the three systems are finally combined by a margin sampling value strategy. On the official development set of the challenge, the best accuracy on a four-fold cross-validation setup is 80:9\%, which increases by 6:1\% when compared with the official baseline (p < :001 by one-tailed z-test)."
}
|
|
Multi-Temporal Resolution Convolutional Neural Networks for Acoustic Scene Classification
Alexander Schindler1, Thomas Lidy2 and Andreas Rauber2
1Center for Digital Safety and Security, Austrian Institute of Technology, Vienna, Austria, 2Institute for Software and Interactive Systems, Technical University of Vienna, Vienna, Austria
Abstract
In this paper we present a Deep Neural Network architecture for the task of acoustic scene classification which harnesses information from increasing temporal resolutions of Mel-Spectrogram segments. This architecture is composed of separated parallel Convolutional Neural Networks which learn spectral and temporal representations for each input resolution. The resolutions are chosen to cover fine-grained characteristics of a scene’s spectral texture as well as its distribution of acoustic events. The proposed model shows a 3.56% absolute improvement of the best performing single resolution model and 12.49% of the DCASE 2017 Acoustic Scenes Classification task baseline [1].
Keywords
Deep Learning, Convolutional Neural Networks, Acoustic Scene Classification, Audio Analysis
@inproceedings{Schindler2017,
author = "Schindler, Alexander and Lidy, Thomas and Rauber, Andreas",
title = "Multi-Temporal Resolution Convolutional Neural Networks for Acoustic Scene Classification",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
year = "2017",
month = "November",
pages = "118--122",
keywords = "Deep Learning, Convolutional Neural Networks, Acoustic Scene Classification, Audio Analysis",
abstract = "In this paper we present a Deep Neural Network architecture for the task of acoustic scene classification which harnesses information from increasing temporal resolutions of Mel-Spectrogram segments. This architecture is composed of separated parallel Convolutional Neural Networks which learn spectral and temporal representations for each input resolution. The resolutions are chosen to cover fine-grained characteristics of a scene’s spectral texture as well as its distribution of acoustic events. The proposed model shows a 3.56\% absolute improvement of the best performing single resolution model and 12.49\% of the DCASE 2017 Acoustic Scenes Classification task baseline [1]."
}
|
|
Convolutional Recurrent Neural Networks for Rare Sound Event Detection
Emre Cakir and Tuomas Virtanen
Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland
Abstract
Sound events possess certain temporal and spectral structure in their time-frequency representations. The spectral content for the samples of the same sound event class may exhibit small shifts due to intra-class acoustic variability. Convolutional layers can be used to learn high-level, shift invariant features from time-frequency representations of acoustic samples, while recurrent layers can be used to learn the longer term temporal context from the extracted high-level features. In this paper, we propose combining these two in a convolutional recurrent neural network (CRNN) for rare sound event detection. The proposed method is evaluated over DCASE 2017 challenge dataset of individual sound event samples mixed with everyday acoustic scene samples. CRNN provides significant performance improvement over two other deep learning based methods mainly due to its capability of longer term temporal modeling.
Keywords
Sound Event Detection, Convolutional Neural Network, Recurrent Neural Network, Machine learning
@inproceedings{Cakir2017,
author = "Cakir, Emre and Virtanen, Tuomas",
title = "Convolutional Recurrent Neural Networks for Rare Sound Event Detection",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
year = "2017",
month = "November",
pages = "27--31",
keywords = "Sound Event Detection, Convolutional Neural Network, Recurrent Neural Network, Machine learning",
abstract = "Sound events possess certain temporal and spectral structure in their time-frequency representations. The spectral content for the samples of the same sound event class may exhibit small shifts due to intra-class acoustic variability. Convolutional layers can be used to learn high-level, shift invariant features from time-frequency representations of acoustic samples, while recurrent layers can be used to learn the longer term temporal context from the extracted high-level features. In this paper, we propose combining these two in a convolutional recurrent neural network (CRNN) for rare sound event detection. The proposed method is evaluated over DCASE 2017 challenge dataset of individual sound event samples mixed with everyday acoustic scene samples. CRNN provides significant performance improvement over two other deep learning based methods mainly due to its capability of longer term temporal modeling."
}
|
|
Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks
Hyungui Lim1, Jeongsoo Park2,3 and Yoonchang Han1
1Cochlear.ai, Seoul, Korea, 2N/A, Cochlear.ai, Seoul, Korea, 3Music and Audio Research Group, Seoul National University, Seoul, Korea
Abstract
Rare sound event detection is a newly proposed task in IEEE DCASE 2017 to identify the presence of monophonic sound event that is classified as an emergency and to detect the onset time of the event. In this paper, we introduce a rare sound event detection system using combination of 1D convolutional neural network (1D ConvNet) and recurrent neural network (RNN) with long shortterm memory units (LSTM). A log-amplitude mel-spectrogram is used as an input acoustic feature and the 1D ConvNet is applied in each time-frequency frame to convert the spectral feature. Then the RNN-LSTM is utilized to incorporate the temporal dependency of the extracted features. The system is evaluated using DCASE 2017 Challenge Task 2 Dataset. Our best result on the test set of the development dataset shows 0.07 and 96.26 of error rate and F-score on the event-based metric, respectively. The proposed system has achieved the 1st place in the challenge with an error rate of 0.13 and an F-Score of 93.1 on the evaluation dataset.
Keywords
Rare sound event detection, deep learning, convolutional neural network, recurrent neural network, long short-term memory
@inproceedings{Lim2017,
author = "Lim, Hyungui and Park, Jeongsoo and Han, Yoonchang",
title = "Rare Sound Event Detection Using {1D} Convolutional Recurrent Neural Networks",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
year = "2017",
month = "November",
pages = "80--84",
keywords = "Rare sound event detection, deep learning, convolutional neural network, recurrent neural network, long short-term memory",
abstract = "Rare sound event detection is a newly proposed task in IEEE DCASE 2017 to identify the presence of monophonic sound event that is classified as an emergency and to detect the onset time of the event. In this paper, we introduce a rare sound event detection system using combination of 1D convolutional neural network (1D ConvNet) and recurrent neural network (RNN) with long shortterm memory units (LSTM). A log-amplitude mel-spectrogram is used as an input acoustic feature and the 1D ConvNet is applied in each time-frequency frame to convert the spectral feature. Then the RNN-LSTM is utilized to incorporate the temporal dependency of the extracted features. The system is evaluated using DCASE 2017 Challenge Task 2 Dataset. Our best result on the test set of the development dataset shows 0.07 and 96.26 of error rate and F-score on the event-based metric, respectively. The proposed system has achieved the 1st place in the challenge with an error rate of 0.13 and an F-Score of 93.1 on the evaluation dataset."
}
|
|
DCASE2017 Challenge posters |
|
DCASE2017 Challenge Results
Annamaria Mesaros1, Toni Heittola1, Aleksandr Diment1, Benjamin Elizalde2, Ankit Shah2, Emmanuel Vincent3, Bhiksha Raj2 and Tuomas Virtanen 1
1Tampere University of Technology, Laboratory of Signal Processing, Tampere, Finland, 2Carnegie Mellon University, Department of Electrical and Computer Engineering, & Department of Language Technologies Institute, Pittsburgh, USA, 3Inria, F-54600 Villers-les-Nancy, France
Keywords
Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events, Weak Labels
|
|
Classifying Short Acoustic Scenes with I-Vectors and CNNs: Challenges and Optimisations for the 2017 DCASE ASC Task
Bernhard Lehner, Hamid Eghbal-Zadeh, Matthias Dorfer, Filip Korzeniowski, Khaled Koutini and Gerhard Widmer
Department of Computational Perception, Johannes Kepler University, Linz, Austria
Abstract
This report describes the CP-JKU team's submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2017 challenge, and discusses some observations we made about the data and the classification setup. Our approach is based on the methodology that achieved ranks 1 and 2 in the 2016 ASC challenge: a fusion of i-vector modelling using MFCC features derived from left and right audio channels, and deep convolutional neural networks (CNNs) trained on raw spectrograms. The data provided for the 2017 ASC task presented some new challenges -- in particular, audio stimuli of very short duration. These will be discussed in detail, and our measures for addressing them will be described. The result of our experiments is a classification system that achieves classification accuracies of around 90% on the provided development data, as estimated via the prescribed four-fold cross-validation scheme (which, we suspect, may be rather optimistic in relation to new data).
|
|
A Report on Sound Event Detection with Different Binaural Features
Sharath Adavanne and Tuomas Virtanen
Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland
Abstract
In this paper, we compare the performance of using binaural audio features in place of single channel features for sound event detection. Three different binaural features are studied and evaluated on the publicly available TUT Sound Events 2017 dataset of length 70 minutes. Sound event detection is performed separately with single channel and binaural features using stacked convolutional and recurrent neural network and the evaluation is reported using standard metrics of error rate and F-score. The studied binaural features are seen to consistently perform equal to or better than the single-channel features with respect to error rate metric.
|
|
Surrey-CVSSP System for DCASE2017 Challenge Task4
Yong Xu, Qiuqiang Kong, Wenwu Wang and Mark D. Plumbley
Center for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK
Abstract
In this technique report, we present a bunch of methods for the task 4 of Detection and Classification of Acoustic Scenes and Events 2017 (DCASE2017) challenge. This task evaluates systems for the large-scale detection of sound events using weakly labeled training data. The data are YouTube video excerpts focusing on transportation and warnings due to their industry applications. There are two tasks, audio tagging and sound event detection from weakly labeled data. Convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) are adopted as our basic framework. We proposed a learnable gating activation function for selecting informative local features. Attention-based scheme is used for localizing the specific events in a weakly-supervised mode. A new batch-level balancing strategy is also proposed to tackle the data unbalancing problem. Fusion of posteriors from different systems are found effective to improve the performance. In a summary, we get 61% F-value for the audio tagging subtask and 0.72 error rate (ER) for the sound event detection subtask on the development set. While the official multilayer perceptron (MLP) based baseline just obtained 13.1% F-value for the audio tagging and 1.02 for the sound event detection.
|
|
17:00 |
Discussion |
|
Open Discussion
Moderated by Mark Plumbley University of Surrey, United Kingdom
|
|