The proceedings of the DCASE2018 Workshop have been published as electronic publication of Tampere University of Technology series:
An extensible cluster-graph taxonomy for open set sound scene analysis
Helen L. Bear and Emmanouil Benetos
Queen Mary University of London, School of Electronic Engineering and Computer Science, UK
We present a new extensible and divisible taxonomy for open set sound scene analysis. This new model allows complex scene analysis with tangible descriptors and perception labels. Its novel structure is a cluster graph such that each cluster (or subset) can stand alone for targeted analyses such as office sound event detection, whilst maintaining integrity over the whole graph (superset) of labels. The key design benefit is its extensibility as new labels are needed during new data capture. Furthermore, datasets which use the same taxonomy are easily augmented, saving future data collection effort. We balance the details needed for complex scene analysis with avoiding ‘the taxonomy of everything’ with our framework to ensure no duplicity in the superset of labels and demonstrate this with DCASE challenge classifications.
Taxonomy, ontology, sound scenes, sound events, sound scene analysis, open set
Sound event detection from weak annotations: weighted-GRU versus multi-instance-learning
Léo Cances, Thomas Pellegrini, and Patrice Guyot
IRIT, Universite de Toulouse, CNRS, Toulouse, France
In this paper, we address the detection of audio events in domestic environments in the case where a weakly annotated dataset is available for training. The weak annotations provide tags from audio events but do not provide temporal boundaries. We report experiments in the framework of the task four of the DCASE 2018 challenge. The objective is twofold: detect audio events (multicategorical classification at recording level), localize the events precisely within the recordings. We explored two approaches: 1) a ”weighted-GRU” (WGRU), in which we train a Convolutional Recurrent Neural Network (CRNN) for classification and then exploit its frame-based predictions at the output of the time-distributed dense layer to perform localization. We propose to lower the influence of the hidden states to avoid predicting a same score throughout a recording. 2) An approach inspired by Multi-Instance Learning (MIL), in which we train a CRNN to give predictions at framelevel, using a custom loss function based on the weak label and statistics of the frame-based predictions. Both approaches outperform the baseline of 14.06% in F-measure by a large margin, with values of respectively 16.77% and 24.58% for combined WGRUs and MIL, on a test set comprised of 288 recordings.
Sound event detection, weakly supervised learning, multi-instance learning, convolutional neural networks, weighted gate recurrent unit
Training general-purpose audio tagging networks with noisy labels and iterative self-verification
Matthias Dorfer and Gerhard Widmer
Johannes Kepler University, Institute of Computational Perception, Linz, Austria
This paper describes our submission to the first Freesound generalpurpose audio tagging challenge carried out within the DCASE 2018 challenge. Our proposal is based on a fully convolutional neural network that predicts one out of 41 possible audio class labels when given an audio spectrogram excerpt as an input. What makes this classification dataset and the task in general special, is the fact that only 3,700 of the 9,500 provided training examples are delivered with manually verified ground truth labels. The remaining non-verified observations are expected to contain a substantial amount of label noise (up to 30-35% in the “worst” categories). We propose to address this issue by a simple, iterative self-verification process, which gradually shifts unverified labels into the verified, trusted training set. The decision criterion for self-verifying a training example is the prediction consensus of a previous snapshot of the network on multiple short sliding window excerpts of the training example at hand. On the unseen test data, an ensemble of three networks trained with this self-verification approach achieves a mean average precision (MAP@3) of 0.951. This is the second best out of 558 submissions to the corresponding Kaggle challenge.
Audio-tagging, Fully Convolutional Neural Networks, Noisy Labels, Label Self-Verification
General-purpose tagging of Freesound audio with AudioSet labels: task description, dataset, and baseline
Eduardo Fonseca1, Manoj Plakal2, Frederic Font1, Daniel P. W. Ellis2, Xavier Favory1, Jordi Pons1, and Xavier Serra1
1Music Technology Group, Universitat Pompeu Fabra, Barcelona, 2Google, Inc., New York, NY, USA
This paper describes Task 2 of the DCASE 2018 Challenge, titled “General-purpose audio tagging of Freesound content with AudioSet labels”. This task was hosted on the Kaggle platform as “Freesound General-Purpose Audio Tagging Challenge”. The goal of the task is to build an audio tagging system that can recognize the category of an audio clip from a subset of 41 diverse categories drawn from the AudioSet Ontology. We present the task, the dataset prepared for the competition, and a baseline system.
Audio tagging, audio dataset, data collection
Unsupervised adversarial domain adaptation for acoustic scene classification
Shayan Gharib1, Konstantinos Drossos1, Emre Cakir1, Dmitriy Serdyuk2, and Tuomas Virtanen1
1Laboratory of Signal Processing, Tampere University of Technology, Finland, 2Montreal Institute for Learning Algorithms, Canada
A general problem in acoustic scene classification task is the mismatched conditions between training and testing data, which significantly reduces the performance of the developed methods on classification accuracy. As a countermeasure, we present the first method of unsupervised adversarial domain adaptation for acoustic scene classification. We employ a model pre-trained on data from one set of conditions and by using data from other set of conditions, we adapt the model in order that its output cannot be used for classifying the set of conditions that input data belong to. We use a freely available dataset from the DCASE 2018 challenge Task 1, subtask B, that contains data from mismatched recording devices. We consider the scenario where the annotations are available for the data recorded from one device, but not for the rest. Our results show that with our model agnostic method we can achieve ∼ 10% increase at the accuracy on an unseen and unlabeled dataset, while keeping almost the same performance on the labeled dataset.
Adversarial domain adaptation, acoustic scene classification
Towards perceptual soundscape characterization using event detection algorithms
Félix Gontier1, Pierre Aumond2,3, Mathieu Lagrange1, Catherine Lavandier2, and Jean-Francois Petiot1
1LS2N, UMR 6004, Ecole Centrale de Nantes, Nantes, France, 2ETIS, UMR 8051, Universite Paris Seine, Université de Cergy-Pontoise, ENSEA, CNRS, France, 3IFSTTAR, CEREMA, UMRAE, Bouguenais, France
Assessing properties about specific sound sources is important to characterize better the perception of urban sound environments. In order to produce perceptually motivated noise maps, we argue that it is possible to consider the data produced by acoustic sensor networks to gather information about sources of interest and predict their perceptual attributes. To validate this important assumption, this paper reports on a perceptual test on simulated sound scenes for which both perceptual and acoustic source properties are known. Results show that it is indeed feasible to predict perceptual source-specific quantities of interest from recordings, leading to the introduction of two predictors of perceptual judgments from acoustic data. The use of those predictors in the new task of automatic soundscape characterization is finally discussed.
Soundscape, urban acoustic monitoring, event detection
Multi-scale convolutional recurrent neural network with ensemble method for weakly labeled sound event detection
Yingmei Guo1, Mingxing Xu1, Jianming Wu2, Yanan Wang2, and Keiichiro Hoashi2
1Tsinghua University, Department of Computer Science and Technology, Beijing, China, 2KDDI Research, Inc., Saitama,Japan
In this paper, we describe our contributions to the challenge of detection and classification of acoustic scenes and events 2018(DCASE2018). We propose multi-scale convolutional recurrent neural network(Multi-scale CRNN), a novel weakly-supervised learning framework for sound event detection. By integrating information from different time resolutions, the multi-scale method can capture both the fine-grained and coarse-grained features of sound events and model the temporal dependencies including fine-grained dependencies and long-term dependencies. CRNN using learnable gated linear units(GLUs) can also help to select the most related features corresponding to the audio labels. Furthermore, the ensemble method proposed in the paper can help to correct the frame-level prediction errors with classification results, as identifying the sound events occurred in the audio is much easier than providing the event time boundaries. The proposed method achieves 29.2% in the event-based F1-score and 1.40 in event-based error rate in development set of DCASE2018 task4 compared to the baseline of 14.1% F-value and 1.54 error rate.
Sound event detection, Weakly-supervised learning, Deep learning, Convolutional recurrent neural network, Multi-scale model
Robust median-plane binaural sound source localization
Benjamin R. Hammond and Philip J.B. Jackson
University of Surrey, Centre for Vision, Speech and Signal Processing, Guildford, UK
For a sound source on the median-plane of a binaural system, interaural localization cues are absent. So, for robust binaural localization of sound sources on the median-plane, localization methods need to be designed with this in consideration. We compare four median-plane binaural sound source localization methods. Where appropriate, adjustments to the methods have been made to improve their robustness to real world recording conditions. The methods are tested using different HRTF datasets to generate the test data and training data. Each method uses a different combination of spectral and interaural localization cues, allowing for a comparison of the effect of spectral and interaural cues on median-plane localization. The methods are tested for their robustness to different levels of additive noise and different categories of sound.
Binaural, Localization, Median-Plane
Sound event detection using weakly labelled semi-supervised data with GCRNNs, VAT and self-adaptive label refinement
Robert Harb1 and Franz Pernkopf1
1Graz University of Technology, Austria, 2Graz University of Technology, Signal Processing and Speech Communication Laboratory, Austria
In this paper, we present a gated convolutional recurrent neural network based approach to solve task 4, large-scale weakly labelled semi-supervised sound event detection in domestic environments, of the DCASE 2018 challenge. Gated linear units and a temporal attention layer are used to predict the onset and offset of sound events in 10s long audio clips. Whereby for training only weaklylabelled data is used. Virtual adversarial training is used for regularization, utilizing both labelled and unlabelled data. Furthermore, we introduce self-adaptive label refinement, a method which allows unsupervised adaption of our trained system to refine the accuracy of frame-level class predictions. The proposed system reaches an overall macro averaged event-based F-score of 34:6%, resulting in a relative improvement of 20:5% over the baseline system.
DCASE 2018, Convolutional neural networks, Sound event detection, Weakly-supervised learning, Semisupervised learning
3D convolutional recurrent neural networks for bird sound detection
Ivan Himawan, Michael Towsey, and Paul Roe
Queensland University of Technology, Science and Engineering Faculty, Brisbane, Australia
With the increasing use of a high quality acoustic device to monitor wildlife population, it has become imperative to develop techniques for analyzing animals’ calls automatically. Bird sound detection is one example of a long-term monitoring project where data are collected in continuous periods, often cover multiple sites at the same time. Inspired by the success of deep learning approaches in various audio classification tasks, this paper first reviews previous works exploiting deep learning for bird audio detection, and then proposes a novel 3-dimensional (3D) convolutional and recurrent neural networks. We propose 3D convolutions for extracting long-term and short-term information in frequency simultaneously. In order to leverage powerful and compact features of 3D convolution, we employ separate recurrent neural networks (RNN), acting on each filter of the last convolutional layers rather than stacking the feature maps in the typical combined convolution and recurrent architectures. Our best model achieved a preview of 88.70% Area Under ROC Curve (AUC) score on the unseen evaluation data in the second edition of bird audio detection challenge. Further improvement with model adaptation led to a 89.58% AUC score.
bird sound detection, deep learning, 3D CNN, GRU, biodiversity
Polyphonic audio tagging with sequentially labelled data using CRNN with learnable gated linear units
Yuanbo Hou1, Qiuqiang Kong2, Jun Wang1, and Shengchen Li1
1 Beijing University of Posts and Telecommunications, Beijing, China, 2 Centre for Vision, Speech and Signal Processing, University of Surrey, UK
Audio tagging aims to detect the types of sound events occurring in an audio recording. To tag the polyphonic audio recordings, we propose to use Connectionist Temporal Classification (CTC) loss function on the top of Convolutional Recurrent Neural Network (CRNN) with learnable Gated Linear Units (GLUCTC), based on a new type of audio label data: Sequentially Labelled Data (SLD). In GLU-CTC, CTC objective function maps the frame-level probability of labels to clip-level probability of labels. To compare the mapping ability of GLU-CTC for sound events, we train a CRNN with GLU based on Global Max Pooling (GLU-GMP) and a CRNN with GLU based on Global Average Pooling (GLU-GAP). And we also compare the proposed GLU-CTC system with the baseline system, which is a CRNN trained using CTC loss function without GLU. The experiments show that the GLU-CTC achieves an Area Under Curve (AUC) score of 0.882 in audio tagging, outperforming the GLU-GMP of 0.803, GLU-GAP of 0.766 and baseline system of 0.837. That means based on the same CRNN model with GLU, the performance of CTC mapping is better than the GMP and GAP mapping. Given both based on the CTC mapping, the CRNN with GLU outperforms the CRNN without GLU.
Audio tagging, Convolutional Recurrent Neural Network (CRNN), Gated Linear Units (GLU), Connectionist Temporal Classification (CTC), Sequentially Labelled Data (SLD)
Acoustic event search with an onomatopoeic query: measuring distance between onomatopoeic words and sounds
Shota Ikawa1 and Kunio Kashino1,2
1Graduate School of Information Science and Technology, University of Tokyo, Japan, 2NTT Communication Science Laboratories, NTT Corporation, Japan
As a means of searching for desired audio signals stored in a database, we consider using a string of an onomatopoeic word, namely a word that imitates a sound, as a query, which allows the user to specify the desired sound by verbally mimicking the sound or typing the sound word, or the word containing sounds similar to the desired sound. However, it is generally difficult to realize such a system based on text similarities between the onomatopoeic query and the onomatopoeic tags associated with each section of the audio signals in the database. In this paper, we propose a novel audio signal search method that uses a latent variable space obtained through a learning process. By employing an encoderdecoder onomatopoeia generation model and an encoder model for the onomatopoeias, both audio signals and onomatopoeias are mapped within the space, allowing us to directly measure the distance between them. Subjective tests show that the search results obtained with the proposed method correspond to the onomatopoeic queries reasonably well, and the method has a generalization capability when searching. We also confirm that users preferred the audio signals obtained with this approach to those obtained with a text-based similarity search.
audio signal search, onomatopoeia, latent variable, encoder-decoder model
General-purpose audio tagging from noisy labels using convolutional neural networks
Turab Iqbal, Qiuqiang Kong, Mark D. Plumbley, and Wenwu Wang
University of Surrey, Centre for Vision, Speech and Signal Processing, UK
General-purpose audio tagging refers to classifying sounds that are of a diverse nature, and is relevant in many applications where domain-specific information cannot be exploited. The DCASE 2018 challenge introduces Task 2 for this very problem. In this task, there are a large number of classes and the audio clips vary in duration. Moreover, a subset of the labels are noisy. In this paper, we propose a system to address these challenges. The basis of our system is an ensemble of convolutional neural networks trained on log-scaled mel spectrograms. We use preprocessing and data augmentation methods to improve the performance further. To reduce the effects of label noise, two techniques are proposed: loss function weighting and pseudo-labeling. Experiments on the private test set of this task show that our system achieves state-of-the-art performance with a mean average precision score of 0:951.
Audio classification, convolutional network, recurrent network, deep learning, data augmentation, label noise
Audio tagging system using densely connected convolutional networks
Il-Young Jeong and Hyungui Lim
Cochlear.ai, Seoul, Korea
In this paper, we describe the techniques and models applied to our submission for DCASE 2018 task 2: General-purpose audio tagging of Freesound content with AudioSet labels. We mainly focus on how to train deep learning models efficiently against strong augmentation and label noise. First, we conducted a single-block DenseNet architecture and multi-head softmax classifier for efficient learning with mixup augmentation. For the label noise, we applied the batch-wise loss masking to eliminate the loss of outliers in a mini-batch. We also tried an ensemble of various models, trained by using different sampling rate or audio representation.
Audio tagging, DenseNet, Mixup, Multi-head softmax, Batch-wise loss masking
DNN based multi-level feature ensemble for acoustic scene classification
Jee-weon Jung, Hee-soo Heo, Hye-jin Shim, and Ha-jin Yu
University of Seoul, School of Computer Science, South Korea
Various characteristics can be used to define an acoustic scene, such as long-term context information and short-term events. This makes it difficult to select input features and pre-processing methods suitable for acoustic scene classification. In this paper, we propose an ensemble model that exploits various input features in which the strength for classifying an acoustic scene varies: i-vectors are used for segment-level representations of long-term context, spectrograms are used for frame-level short-term events, and raw waveforms are used to extract features that could be missed by existing methods. For each feature, we used deep neural network based models to extract a representation from an input segment. A separated scoring phase was then exploited to extract class-wise scores on a scale of 0 to 1 that could be used as confidence measures. Scores were extracted using Gaussian models and support vector machines. We tested the validity of the proposed framework using task 1 of detection, and classification of acoustic scenes and events 2018 dataset. The proposed framework had an accuracy of 73.82% for the pre-defined fold-1 validation setup and 74.8% for the evaluation setup which is 7th in team ranking.
Acoustic scene classification, DNN, raw waveform, i-vector
Vocal Imitation Set: a dataset of vocally imitated sound events using the AudioSet ontology
Bongjun Kim1, Madhav Ghei1, Bryan Pardo1, and Zhiyao Duan2
1Northwestern University, Department of Electrical Engineering and Computer Science, Evanston, USA, 2University of Rochester, Department of Electrical and Computer Engineering, Rochester, USA
Query-By-Vocal Imitation (QBV) search systems enable searching a collection of audio files using a vocal imitation as a query. This can be useful when sounds do not have commonly agreed-upon textlabels, or many sounds share a label. As deep learning approaches have been successfully applied to QBV systems, datasets to build models have become more important. We present Vocal Imitation Set, a new vocal imitation dataset containing 11; 242 crowd-sourced vocal imitations of 302 sound event classes in the AudioSet sound event ontology. It is the largest publicly-available dataset of vocal imitations as well as the first to adopt the widely-used AudioSet ontology for a vocal imitation dataset. Each imitation recording in Vocal Imitation Set was rated by a human listener on how similar the imitation is to the recording it was an imitation of. Vocal Imitation Set also has an average of 10 different original recordings per sound class. Since each sound class has about 19 listener-vetted imitations and 10 original sound files, the data set is suited for training models to do fine-grained vocal imitation-based search within sound classes. We provide an example of using the dataset to measure how well the existing state-of-the-art in QBV search performs on fine-grained search.
Vocal imitation datasets, audio retrieval, queryby-vocal imitation search
DCASE 2018 Challenge Surrey cross-task convolutional neural network baseline
Qiuqiang Kong, Turab Iqbal, Yong Xu, Wenwu Wang, Mark D. Plumbley
University of Surrey, Centre for Vision, Speech and Signal Processing, UK
The Detection and Classification of Acoustic Scenes and Events (DCASE) consists of five audio classification and sound event detection tasks: 1) Acoustic scene classification, 2) General-purpose audio tagging of Freesound, 3) Bird audio detection, 4) Weakly-labeled semi-supervised sound event detection and 5) Multi-channel audio classification. In this paper, we create a cross-task baseline system for all five tasks based on a convlutional neural network (CNN): a “CNN Baseline” system. We implemented CNNs with 4 layers and 8 layers originating from AlexNet and VGG from computer vision. We investigated how the performance varies from task to task with the same configuration of neural networks. Experiments show that deeper CNN with 8 layers performs better than CNN with 4 layers on all tasks except Task 1. Using CNN with 8 layers, we achieve an accuracy of 0.680 on Task 1, an accuracy of 0.895 and a mean average precision (MAP) of 0.928 on Task 2, an accuracy of 0.751 and an area under the curve (AUC) of 0.854 on Task 3, a sound event detection F1 score of 20.8% on Task 4, and an F1 score of 87.75% on Task 5. We released the Python source code of the baseline systems under the MIT license for further research.
DCASE 2018 challenge, convolutional neural networks, open source
Iterative knowledge distillation in R-CNNs for weakly-labeled semi-supervised sound event detection
Khaled Koutini, Hamid Eghbal-zadeh, and Gerhard Widmer
Johannes Kepler University, Institute of Computational Perception, Linz, Austria
In this paper, we present our approach used for the CP-JKU submission in Task 4 of the DCASE-2018 Challenge. We propose a novel iterative knowledge distillation technique for weakly-labeled semi-supervised event detection using neural networks, specifically Recurrent Convolutional Neural Networks (R-CNNs). R-CNNs are used to tag the unlabeled data and predict strong labels. Further, we use the R-CNN strong pseudo-labels on the training datasets and train new models after applying label-smoothing techniques on the strong pseudo-labels. Our proposed approach significantly improved the performance of the baseline, achieving the event-based f-measure of 40.86% compared to 15.11% event-based f-measure of the baseline in the provided test set from the development dataset.
Weakly-labeled, Semi-supervised, Knowledge Distillation, Recurrent Neural Network, Convolutional Neural Network
Acoustic bird detection with deep convolutional neural networks
Museum fuer Naturkunde, Leibniz Institute for Evolution and Biodiversity Science, Berlin, Germany
This paper presents deep learning techniques for acoustic bird detection. Deep Convolutional Neural Networks (DCNNs), originally designed for image classification, are adapted and fine-tuned to detect the presence of birds in audio recordings. Various data augmentation techniques are applied to increase model performance and improve generalization to unknown recording conditions and new habitats. The proposed approach is evaluated on the dataset of the Bird Audio Detection task which is part of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2018. It surpasses previous state-of-the-art achieving an area under the curve (AUC) above 95 \% on the public challenge leaderboard.
Bird Detection, Deep Learning, Deep Convolutional Neural Networks, Data Augmentation
Combining high-level features of raw audio waves and mel-spectrograms for audio tagging
Marcel Lederle and Benjamin Wilhelm
University of Konstanz, Germany
In this paper, we describe our contribution to Task 2 of the DCASE 2018 Audio Challenge . While it has become ubiquitous to utilize an ensemble of machine learning methods for classification tasks to obtain better predictive performance, the majority of ensemble methods combine predictions rather than learned features. We propose a single-model method that combines learned highlevel features computed from log-scaled mel-spectrograms and raw audio data. These features are learned separately by two Convolutional Neural Networks, one for each input type, and then combined by densely connected layers within a single network. This relatively simple approach along with data augmentation ranks among the best two percent in the Freesound General-Purpose Audio Tagging Challenge on Kaggle.
audio-tagging, convolutional neural network, raw audio, mel-spectrogram
Fast mosquito acoustic detection with field cup recordings: an initial investigation
Yunpeng Li1, Ivan Kiskin1, Marianne Sinka2, Davide Zilli1,3, Henry Chan1, Eva Herreros-Moya2, Theeraphap Chareonviriyaphap4, Rungarun Tisgratog4, Kathy Willis2,5, and Stephen Roberts1,3
1University of Oxford, Machine Learning Research Group, Department of Engineering Science, UK, 2University of Oxford, Department of Zoology, UK, 3Mind Foundry Ltd., UK, 4Kasetsart University, Department of Entomology, Faculty of Agriculture, Bangkok, Thailand, 5Royal Botanic Gardens, Kew, UK
In terms of vectoring disease, mosquitoes are the world’s deadliest. A fast and efficient mosquito survey tool is crucial for vectored disease intervention programmes to reduce mosquito-induced deaths. Standard mosquito sampling techniques, such as human landing catches, are time consuming, expensive and can put the collectors at risk of diseases. Mosquito acoustic detection aims to provide a cost-effective automated detection tool, based on mosquitoes’ characteristic flight tones. We propose a simple, yet highly effective, classification pipeline based on the mel-frequency spectrum allied with convolutional neural networks. This detection pipeline is computationally efficient in not only detecting mosquitoes, but also in classifying species. Many previous assessments of mosquito acoustic detection techniques have relied only upon lab recordings of mosquito colonies. We illustrate in this paper our proposed algorithm’s performance over an extensive dataset, consisting of cup recordings of more than 1000 mosquito individuals from 6 species captured in field studies in Thailand.
Mosquito detection, acoustic signal processing, multi-species classification, convolutional neural networks
Domain tuning methods for bird audio detection
Sidrah Liaqat, Narjes Bozorg, Neenu Jose, Patrick Conrey, Antony Tamasi and Michael T. Johnson
University of Kentucky, Speech and Signal Processing Lab, Electrical Engineering Department, Lexington, USA
This paper presents several feature extraction and normalization methods implemented for the DCASE 2018 Bird Audio Detection challenge, a binary audio classification task, to identify whether a ten second audio segment from a specified dataset contains one or more bird vocalizations. Our baseline system is adapted from the Convolutional Neural Network system of last year’s challenge winner bulbul . We introduce one feature modification, an increase in temporal resolution of the Mel-spectrogram feature matrix, tailored to the fast-changing temporal structure of many song-bird vocalizations. Additionally, we introduce two feature normalization approaches, a front-end signal enhancement method to reduce differences in dataset noise characteristics and an explicit domain adaptation method based on covariance normalization. Results show that none of these approaches gave significant benefit individually, but that combining the methods lead to overall improvement. Despite the modest improvement, this system won the award for “Highestscoring open-source/reproducible method” for this task.
audio classification, convolutional neural network, bioacoustic vocalization analysis, domain adaptation
Weakly labeled semi-supervised sound event detection using CRNN with inception module
Wootaek Lim, Sangwon Suh, and Youngho Jeong
Realistic AV Research Group, Electronics and Telecommunications Research Institute, Daejeon, Korea
In this paper, we present a method for large-scale detection of sound events using small weakly labeled data proposed in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 challenge Task 4. To perform this task, we adopted the convolutional neural network (CNN) and gated recurrent unit (GRU) based bidirectional recurrent neural network (RNN) as our proposed system. In addition, we proposed the Inception module for handling various receptive fields at once in each CNN layer. We also applied the data augmentation method to solve the labeled data shortage problem and applied the event activity detection method for strong label learning. By applying the proposed method to a weakly labeled semi-supervised sound event detection, it was verified that the proposed system provides better performance compared to the DCASE 2018 baseline system.
DCASE 2018, Sound event detection, Weakly labeled semi-supervised learning, Deep learning, Inception module
Acoustic scene classification using multi-scale features
Liping Yang , Xinxing Chen, and Lianjie Tao
Chongqing University, Key Laboratory of Optoelectronic Technology and System, China
Convolutional neural networks(CNNs) has shown tremendous ability in many classification problems, because it could improve classification performance by extracting abstract features. In this paper, we use CNNs to calculate features layer by layer. With the layers deepen, the extracted features become more abstract, but the shallow features are also very useful for classification. So we propose a method that fuses features of different layers(it’s called multi-scale features), which can improve performance of acoustic scene classification. In our method, the logMel features of audio signal are used as the input of CNNs. In order to reduce the parameters’ number, we use Xception as the foundation network, which is a CNNs with depthwise separable convolution operation (a depthwise convolution followed by a pointwise convolution). And we modify Xception to fuse multi-scale features. We also introduce the focal loss, to further improve classification performance. This method can achieve commendable result, whether the audio recordings are collected by same device(subtask A) or by different devices (subtask B).
Multi-scale features, acoustic scene classification, convolutional neural network, Xception, log Mel features
Audio feature space analysis for acoustic scene classification
West Pomeranian University of Technology, Faculty of Computer Science and Information Technology, Szczecin, Poland
The paper presents a study of audio features analysis for acoustic scene classification. Various feature sets and many classifiers were employed to build a system for scene classification by determining compact feature space and using an ensemble learning. The input feature space containing different sets and representations were reduced to 223 attributes using the importance of individual features computed by gradient boosting trees algorithm. The resulting set of features was split into distinct groups partly reflected auditory cues, and then their contribution to discriminative power was analysed. Also, to determine the influence of the pattern recognition system on the final efficacy, accuracy tests were performed using several classifiers. Finally, conducted experiments show that proposed solution with a dedicated feature set outperformed baseline system by 6%.
audio features, auditory scene analysis, ensemble learning, majority voting
Applying triplet loss to siamese-style networks for audio similarity ranking
Brian Margolis, Madhav Ghei, and Bryan Pardo
Northwestern University, Electrical Engineering and Computer Science, Evanston, USA
Query by vocal imitation (QBV) systems let users search a library of general non-speech audio files using a vocal imitation of the desired sound as the query. The best existing system for QBV uses a similarity measure between vocal imitations and general audio files that is learned by a two-tower semi-Siamese deep neural network architecture. This approach typically uses pairwise training examples and error measurement. In this work, we show that this pairwise error signal does not correlate well with improved search rankings and instead describe how triplet loss can be used to train a two-tower network designed to work with pairwise loss, resulting in better correlation with search rankings. This approach can be used to train any two-tower architecture using triplet loss. Empirical results on a dataset of vocal imitations and general audio files show that low triplet loss is much better correlated with improved search ranking than low pairwise loss.
vocal imitation, information retrieval, convolutional Siamese-style networks, triplet loss
Exploring deep vision models for acoustic scene classification
Octave Mariotti, Matthieu Cord, and Olivier Schwander
Sorbonne Université, CNRS, Laboratoire d’Informatique de Paris 6, Paris, France
This report evaluates the application of deep vision models, namely VGG and Resnet, to general audio recognition. In the context of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events 2018, we trained several of these architectures on the task 1 dataset to perform acoustic scene classification. Then, in order to produce more robust predictions, we explored two ensemble methods to aggregate the different model outputs. Our results show a final accuracy of 79% on the development dataset for subtask A, outperforming the baseline by almost 20%.
Acoustic scene classification, DCASE 2018, Vision, VGG, Residual networks, Ensemble
A multi-device dataset for urban acoustic scene classification
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen
Tampere University of Technology, Laboratory of Signal Processing, Tampere, Finland
This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.
Acoustic scene classification, DCASE challenge, public datasets, multi-device data
Data-efficient weakly supervised learning for low-resource audio event detection using deep learning
Veronica Morfi and Dan Stowell
Queen Mary University of London, Machine Listening Lab, Centre for Digital Music (C4DM), UK
We propose a method to perform audio event detection under the common constraint that only limited training data are available. In training a deep learning system to perform audio event detection, two practical problems arise. Firstly, most datasets are “weakly labelled” having only a list of events present in each recording without any temporal information for training. Secondly, deep neural networks need a very large amount of labelled training data to achieve good quality performance, yet in practice it is difficult to collect enough samples for most classes of interest. In this paper, we propose a data-efficient training of a stacked convolutional and recurrent neural network. This neural network is trained in a multi instance learning setting for which we introduce a new loss function that leads to improved training compared to the usual approaches for weakly supervised learning. We successfully test our approach on two low-resource datasets that lack temporal labels.
Multi instance learning, deep learning, weak labels, audio event detection
Acoustic scene classification using a convolutional neural network ensemble and nearest neighbor filters
Truc Nguyen and Franz Pernkopf
Graz University of Technology, Signal Processing and Speech Communication Lab., Austria
This paper proposes Convolutional Neural Network (CNN) ensembles for acoustic scene classification of tasks 1A and 1B of the DCASE 2018 challenge. We introduce a nearest neighbor filter applied on spectrograms, which allows to emphasize and smooth similar patterns of sound events in a scene. We also propose a variety of CNN models for single-input (SI) and multi-input (MI) channels and three different methods for building a network ensemble. The experimental results show that for task 1A the combination of the MI-CNN structures using both of log-mel features and their nearest neighbor filtering is slightly more effective than the single-input channel CNN models using log-mel features only. This statement is opposite for task 1B. In addition, the ensemble methods improve the accuracy of the system significantly, the best ensemble method is ensemble selection, which achieves 69.3% for task 1A and 63.6% for task 1B. This improves the baseline system by 8.9% and 14.4% for task 1A and 1B, respectively.
DCASE 2018, acoustic scene classification, convolution neural network, nearest neighbor filter
DCASE 2018 task 2: iterative training, label smoothing, and background noise normalization for audio event tagging
Thi Ngoc Tho Nguyen1, Ngoc Khanh Nguyen2, Douglas L. Jones3, and Woon Seng Gan1
1Nanyang Technological University, Electrical and Electronic Engineering Dept., Singapore, 2SWAT, Singapore, 3University of Illinois at Urbana-Champaign, Dept. of Electrical and Computer Engineering, USA
This paper describes an approach from our submissions for DCASE 2018 Task 2: general-purpose audio tagging of Freesound content with AudioSet labels. To tackle the problem of diverse recording environments, we propose to use background noise normalization. To tackle the problem of noisy labels, we propose to use pseudolabel for automatic label verification and label smoothing to reduce the over-fitting. We train several convolutional neural networks with data augmentation and different input sizes for the automatic label verification process. The label verification procedure is promising to improve the quality of datasets for audio classification. Our ensemble model ranked fifth on the private leaderboard of the competition with an mAP@3 score of 0:9496.
Audio event tagging, Background noise normalization, Convolutional neural networks, DCASE 2018, Label smoothing, Pseudo-label
To bee or not to bee: Investigating machine learning approaches for beehive sound recognition
Inês Nolasco and Emmanouil Benetos
Queen Mary University of London, School of Electronic Engineering and Computer Science, UK
In this work, we aim to explore the potential of machine learning methods to the problem of beehive sound recognition. A major contribution of this work is the creation and release of annotations for a selection of beehive recordings. By experimenting with both support vector machines and convolutional neural networks, we explore important aspects to be considered in the development of beehive sound recognition systems using machine learning approaches.
Computational bioacoustic scene analysis, ecoacoustics, beehive sound recognition
Ensemble of convolutional neural networks for general-purpose audio tagging
School of Electrical Engineering, Signals and Systems Department, Belgrade, Serbia
This work describes our solution for the general-purpose audio tagging task of DCASE 2018 challenge. We propose the ensemble of several Convolutional Neural Networks (CNNs) with different properties. Logistic regression is used as a meta-classifier to produce final predictions. Experiments demonstrate that the ensemble outperforms each CNN individually. Finally, the proposed system achieves Mean Average Precision (MAP) score of 0.945 on test set, which is a significant improvement compared to the baseline.
audio tagging, DCASE 2018, convolutional neural networks, ensembling
Attention-based convolutional neural networks for acoustic scene classification
Zhao Ren1, Qiuqiang Kong2, Kun Qian1, Mark D. Plumbley2, and Björn W. Schuller 1,3
1ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany, 2Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK, 3GLAM – Group on Language, Audio & Music, Imperial College London, UK
We propose a convolutional neural network (CNN) model based on an attention pooling method to classify ten different acoustic scenes, participating in the acoustic scene classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2018), which includes data from one device (subtask A) and data from three different devices (subtask B). The log mel spectrogram images of the audio waves are first forwarded to convolutional layers, and then fed into an attention pooling layer to reduce the feature dimension and achieve classification. From attention perspective, we build a weighted evaluation of the features, instead of simple max pooling or average pooling. On the official development set of the challenge, the best accuracy of subtask A is 72.6 \%, which is an improvement of 12.9 \% when compared with the official baseline (p < :001 in a one-tailed z-test). For subtask B, the best result of our attention-based CNN is a significant improvement of the baseline as well, in which the accuracies are 71.8 \%, 58.3 \%, and 58.3 \% for the three devices A to C (p < :001 for device A, p < :01 for device B, and p < :05 for device C).
Acoustic Scene Classification, Convolutional Neural Network, Attention Pooling, Log Mel Spectrogram
Using an evolutionary approach to explore convolutional neural networks for acoustic scene classification
Christian Roletscheck, Tobias Watzka, Andreas Seiderer, Dominik Schiller, Elisabeth André
Augsburg University, Human Centered Multimedia, Germany
The successful application of modern deep neural networks is heavily reliant on the chosen architecture and the selection of the appropriate hyperparameters. Due to the large number of parameters and the complex inner workings of a neural network, finding a suitable configuration for a respective problem turns out to be a rather complex task for a human. In this paper we, propose an evolutionary approach to automatically generate a suitable neural network architecture and hyperparameters for any given classification problem. A genetic algorithm is used to generate and evaluate a variety of deep convolutional networks. We take the DCASE 2018 Challenge as an opportunity to evaluate our algorithm on the task of acoustic scene classification. The best accuracy achieved by our approach was 74.7% on the development dataset.
Evolutionary algorithm, genetic algorithm, convolutional neural networks, acoustic scene classification
Large-scale weakly labeled semi-supervised sound event detection in domestic environments
Romain Serizel1, Nicolas Turpault1, Hamid Eghbal-Zadeh2, and Ankit Parag Shah3
1Universit de Lorraine, CNRS, Inria, Loria, Nancy, France, 2Institute of Computational Perception, Johannes Kepler University of Linz, Austria, 3Language Technologies Institute, Carnegie Mellon University, Pittsburgh PA, United States
This paper presents DCASE 2018 task 4. The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without time boundaries). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly labeled training set to improve system performance. The data are Youtube video excerpts from domestic context which have many applications such as ambient assisted living. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events. . . ) and potential applications.
Sound event detection, Large scale, Weakly labeled data, Semi-supervised learning
A report on audio tagging with deeper CNN, 1D-ConvNet and 2D-ConvNet
Qingkai Wei, Yanfang Liu, and Xiaohui Ruan
Beijing Kuaiyu Electronics Co. Ltd., Beijing, China
General-purpose audio tagging is a newly proposed task in DCASE 2018, which can provide insight towards broadly-applicable sound event classifiers. In this paper, two systems (named as 1D-ConvNet and 2D-ConvNet in this paper) with small kernel sizes, multiple functional modules, deeper CNN (convolutional neural networks) are developed to improve performance in this task. In detail, different audio features are used, i.e. raw waveforms are for 1D-ConvNet, while frequency domain features, such as mfcc, log-mel spectrogram, multi-resolution log-mel spectrogram and spectrogram, are utilized as the 2D-ConvNet input. Using DCASE 2018 Challenge task 2 dataset to train and evaluate, the best single model with 1DConvNet and 2D-ConvNet are chosen, whose kaggle public leaderboard score are 0.877 and 0.961 respectively. In addition, a better ensemble rank averaging prediction get a score 0.967 on the public leaderboard, ranking 5/558, while score 0.942 on the private leaderboard ranking 11/558.
DCASE 2018, Audio tagging, Convolutional neural networks, 1D-ConvNet, 2D-ConvNet, Model ensemble
Sample mixed-based data augmentation for domestic audio tagging
Shengyun Wei1, Kele Xu1,2, Dezhi Wang3, Feifan Liao1, Huaimin Wang2, and Qiuqiang Kong4
1National University of Defense Technology, Information and Communication Dept., Wuhan, China, 2National University of Defense Technology, Computer Dept., Changsha, China, 3National University of Defense Technology, College of Meteorology and Oceanography, Changsha, China, 4University of Surrey, Center for Vision, Speech and Signal Processing, Guildford, UK
Audio tagging has attracted increasing attention since last decade and has various potential applications in many fields. The objective of audio tagging is to predict the labels of an audio clip. Recently deep learning methods have been applied to audio tagging and have achieved state-of-the-art performance, which provides a poor generalization ability on new data. However due to the limited size of audio tagging data such as DCASE data, the trained models tend to result in overfitting of the network. Previous data augmentation methods such as pitch shifting, time stretching and adding background noise do not show much improvement in audio tagging. In this paper, we explore the sample mixed data augmentation for the domestic audio tagging task, including mixup, SamplePairing and extrapolation. We apply a convolutional recurrent neural network (CRNN) with attention module with log-scaled mel spectrum as a baseline system. In our experiments, we achieve an state-of-the-art of equal error rate (EER) of 0.10 on DCASE 2016 task4 dataset with mixup approach, outperforming the baseline system without data augmentation.
Audio tagging, data augmentation, sample mixed, convolutional recurrent neural network
General-purpose audio tagging by ensembling convolutional neural networks based on multiple features
Fraunhofer Institute for Communication, Information Processing and Ergonomics FKIE, Germany
This paper describes an audio tagging system that participated in Task 2 “General-purpose audio tagging of Freesound content with AudioSet labels” of the “Detection and Classification of Acoustic Scenes and Events (DCASE)” Challenge 2018. The system is an ensemble consisting of five convolutional neural networks based on Mel-frequency Cepstral Coefficients, Perceptual Linear Prediction features, Mel-spectrograms and the raw audio data. For ensembling all models, score-based fusion via Logistic Regression is performed with another neural network. In experimental evaluations, it is shown that ensembling the models significantly improves upon the performances obtained with the individual models. As a final result, the system achieved a Mean Average Precision with Cutoff 3 of 0:9414 on the private leaderboard of the challenge.
audio tagging, acoustic event classification, convolutional neural network, score-based fusion
The Aalto system based on fine-tuned AudioSet features for DCASE2018 task2 - general purpose audio tagging
Zhicun Xu, Peter Smit, and Mikko Kurimo
Aalto University, Department of Signal Processing and Acoustics, Espoo, Finland
In this paper, we presented a neural network system for DCASE 2018 task 2, general purpose audio tagging. We fine-tuned the Google AudioSet feature generation model with different settings for the given 41 classes on top of a fully connected layer with 100 units. Then we used the fine-tuned models to generate 128 dimensional features for each 0.960s audio. We tried different neural network structures including LSTM and multi-level attention models. In our experiments, the multi-level attention model has shown its superiority over others. Truncating the silence parts, repeating and splitting the audio into the fixed length, pitch shifting augmentation, and mixup techniques are all used in our experiments. The proposed system achieved a result with MAP@3 score at 0.936, which outperforms the baseline result of 0.704 and achieves top 8% in the public leaderboard.
audio tagging, AudioSet, multi-level attention model
Meta learning based audio tagging
Kele Xu1, Boqing Zhu1, Dezhi Wang2, Yuxing Peng1, Huaimin Wang1, Lilun Zhang2, Bo Li3
1National University of Defense Technology, Computer Department, Changsha, China, 2National University of Defense Technology, Computer Department, Changsha, China, 3Beijing University of Posts and Telecommunications, Automation Department, Beijing, China
In this paper, we describe our solution for the general-purpose audio tagging task, which belongs to one of the subtasks in the DCASE 2018 challenge. For the solution, we employed both deep learning methods and statistic features-based shallow architecture learners. For single model, different deep convolutional neural network architectures are tested with different kinds of input, which ranges from the raw-signal, log-scaled Mel-spectrograms (log Mel) to Mel Frequency Cepstral Coefficients (MFCC). For log Mel and MFCC, the delta and delta-delta information are also used to formulate three-channel features, while mixup is used for the data augmentation. Using ResNeXt, our best single convolutional neural network architecture provides a mAP@3 of 0.967 on the public Kaggle leaderboard, 0.939 on the private leaderboard. Moreover, to improve the accuracy further, we also propose a meta learning-based ensemble method. By employing the diversities between different architectures, the meta learning-based model can provide higher prediction accuracy and robustness with comparison to the single model. Our solution achieves a mAP@3 of 0.977 on the public leaderboard and 0.951 as our best on the private leaderboard, while the baseline gives a mAP@3 of 0.704.
Audio tagging, convolutional neural networks, meta-learning, mixup
Multi-level attention model for weakly supervised audio classification
Changsong Yu1, Karim Said Barsim1, Qiuqiang Kong2, Bin Yang1
1University of Stuttgart, Institute of Signal Processing and System Theory, Germany, 2University of Surrey, Center for Vision, Speech and Signal Processing, UK
In this paper, we propose a multi-level attention model for the weakly labelled audio classification problem. The objective of audio classification is to predict the presence or the absence of sound events in an audio clip. Recently, Google published a large scale weakly labelled AudioSet dataset containing 2 million audio clips with only the presence or the absence labels of the sound events, without the onset and offset time of the sound events. Previously proposed attention models only applied a single attention module on the last layer of a neural network which limited the capacity of the attention model. In this paper, we propose a multi-level attention model which consists of multiple attention modules applied on the intermediate neural network layers. The outputs of these attention modules are concatenated to a vector followed by a fully connected layer to obtain the final prediction of each class. Experiments show that the proposed multi-attention attention model achieves a stateof-the-art mean average precision (mAP) of 0.360, outperforming the single attention model and the Google baseline system of 0.327 and 0.314, respectively.
AudioSet, audio classification, attention model
Convolutional neural networks and x-vector embedding for DCASE2018 Acoustic Scene Classification challenge
Hossein Zeinali, Lukáš Burget and Jan Honza Černocký
Brno University of Technology, Speech@FIT and IT4I Center of Excellence, Czech Republic
In this paper, the Brno University of Technology (BUT) team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2018 challenge are described. Also, the analysis of different methods on the leaderboard set is provided. The proposed approach is a fusion of two different Convolutional Neural Network (CNN) topologies. The first one is the common two-dimensional CNNs which is mainly used in image classification. The second one is a one-dimensional CNN for extracting fixed-length audio segment embeddings, so called x-vectors, which has also been used in speech processing, especially for speaker recognition. In addition to the different topologies, two types of features were tested: log mel-spectrogram and CQT features. Finally, the outputs of different systems are fused using a simple output averaging in the best performing system. Our submissions ranked third among 24 teams in the ASC sub-task A (task1a).
Audio scene classification, Convolutional neural networks, Deep learning, x-vectors, Regularized LDA