Proceedings

The proceedings of the DCASE2018 Workshop have been published as electronic publication of Tampere University of Technology series:

Plumbley, M. D., Kroos, C., Bello, J. P., Richard, G., Ellis, D.P.W. & Mesaros, A. (Eds.) (2018). Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018).

ISBN (Electronic): 978-952-15-4262-6


Link PDF
Total cites: 1682 (updated 30.11.2023)
Abstract

We present a new extensible and divisible taxonomy for open set sound scene analysis. This new model allows complex scene analysis with tangible descriptors and perception labels. Its novel structure is a cluster graph such that each cluster (or subset) can stand alone for targeted analyses such as office sound event detection, whilst maintaining integrity over the whole graph (superset) of labels. The key design benefit is its extensibility as new labels are needed during new data capture. Furthermore, datasets which use the same taxonomy are easily augmented, saving future data collection effort. We balance the details needed for complex scene analysis with avoiding ‘the taxonomy of everything’ with our framework to ensure no duplicity in the superset of labels and demonstrate this with DCASE challenge classifications.

Keywords

Taxonomy, ontology, sound scenes, sound events, sound scene analysis, open set

Cites: 3 ( see at Google Scholar )

PDF
Abstract

In this paper, we address the detection of audio events in domestic environments in the case where a weakly annotated dataset is available for training. The weak annotations provide tags from audio events but do not provide temporal boundaries. We report experiments in the framework of the task four of the DCASE 2018 challenge. The objective is twofold: detect audio events (multicategorical classification at recording level), localize the events precisely within the recordings. We explored two approaches: 1) a ”weighted-GRU” (WGRU), in which we train a Convolutional Recurrent Neural Network (CRNN) for classification and then exploit its frame-based predictions at the output of the time-distributed dense layer to perform localization. We propose to lower the influence of the hidden states to avoid predicting a same score throughout a recording. 2) An approach inspired by Multi-Instance Learning (MIL), in which we train a CRNN to give predictions at framelevel, using a custom loss function based on the weak label and statistics of the frame-based predictions. Both approaches outperform the baseline of 14.06% in F-measure by a large margin, with values of respectively 16.77% and 24.58% for combined WGRUs and MIL, on a test set comprised of 288 recordings.

Keywords

Sound event detection, weakly supervised learning, multi-instance learning, convolutional neural networks, weighted gate recurrent unit

Cites: 9 ( see at Google Scholar )

PDF
Abstract

This paper describes our submission to the first Freesound generalpurpose audio tagging challenge carried out within the DCASE 2018 challenge. Our proposal is based on a fully convolutional neural network that predicts one out of 41 possible audio class labels when given an audio spectrogram excerpt as an input. What makes this classification dataset and the task in general special, is the fact that only 3,700 of the 9,500 provided training examples are delivered with manually verified ground truth labels. The remaining non-verified observations are expected to contain a substantial amount of label noise (up to 30-35% in the “worst” categories). We propose to address this issue by a simple, iterative self-verification process, which gradually shifts unverified labels into the verified, trusted training set. The decision criterion for self-verifying a training example is the prediction consensus of a previous snapshot of the network on multiple short sliding window excerpts of the training example at hand. On the unseen test data, an ensemble of three networks trained with this self-verification approach achieves a mean average precision (MAP@3) of 0.951. This is the second best out of 558 submissions to the corresponding Kaggle challenge.

Keywords

Audio-tagging, Fully Convolutional Neural Networks, Noisy Labels, Label Self-Verification

Cites: 34 ( see at Google Scholar )

PDF
Abstract

This paper describes Task 2 of the DCASE 2018 Challenge, titled “General-purpose audio tagging of Freesound content with AudioSet labels”. This task was hosted on the Kaggle platform as “Freesound General-Purpose Audio Tagging Challenge”. The goal of the task is to build an audio tagging system that can recognize the category of an audio clip from a subset of 41 diverse categories drawn from the AudioSet Ontology. We present the task, the dataset prepared for the competition, and a baseline system.

Keywords

Audio tagging, audio dataset, data collection

Cites: 163 ( see at Google Scholar )

PDF
Abstract

A general problem in acoustic scene classification task is the mismatched conditions between training and testing data, which significantly reduces the performance of the developed methods on classification accuracy. As a countermeasure, we present the first method of unsupervised adversarial domain adaptation for acoustic scene classification. We employ a model pre-trained on data from one set of conditions and by using data from other set of conditions, we adapt the model in order that its output cannot be used for classifying the set of conditions that input data belong to. We use a freely available dataset from the DCASE 2018 challenge Task 1, subtask B, that contains data from mismatched recording devices. We consider the scenario where the annotations are available for the data recorded from one device, but not for the rest. Our results show that with our model agnostic method we can achieve ∼ 10% increase at the accuracy on an unseen and unlabeled dataset, while keeping almost the same performance on the labeled dataset.

Keywords

Adversarial domain adaptation, acoustic scene classification

Cites: 56 ( see at Google Scholar )

PDF
Abstract

Assessing properties about specific sound sources is important to characterize better the perception of urban sound environments. In order to produce perceptually motivated noise maps, we argue that it is possible to consider the data produced by acoustic sensor networks to gather information about sources of interest and predict their perceptual attributes. To validate this important assumption, this paper reports on a perceptual test on simulated sound scenes for which both perceptual and acoustic source properties are known. Results show that it is indeed feasible to predict perceptual source-specific quantities of interest from recordings, leading to the introduction of two predictors of perceptual judgments from acoustic data. The use of those predictors in the new task of automatic soundscape characterization is finally discussed.

Keywords

Soundscape, urban acoustic monitoring, event detection

Cites: 3 ( see at Google Scholar )

PDF
Abstract

In this paper, we describe our contributions to the challenge of detection and classification of acoustic scenes and events 2018(DCASE2018). We propose multi-scale convolutional recurrent neural network(Multi-scale CRNN), a novel weakly-supervised learning framework for sound event detection. By integrating information from different time resolutions, the multi-scale method can capture both the fine-grained and coarse-grained features of sound events and model the temporal dependencies including fine-grained dependencies and long-term dependencies. CRNN using learnable gated linear units(GLUs) can also help to select the most related features corresponding to the audio labels. Furthermore, the ensemble method proposed in the paper can help to correct the frame-level prediction errors with classification results, as identifying the sound events occurred in the audio is much easier than providing the event time boundaries. The proposed method achieves 29.2% in the event-based F1-score and 1.40 in event-based error rate in development set of DCASE2018 task4 compared to the baseline of 14.1% F-value and 1.54 error rate[1].

Keywords

Sound event detection, Weakly-supervised learning, Deep learning, Convolutional recurrent neural network, Multi-scale model

Cites: 17 ( see at Google Scholar )

PDF
Abstract

For a sound source on the median-plane of a binaural system, interaural localization cues are absent. So, for robust binaural localization of sound sources on the median-plane, localization methods need to be designed with this in consideration. We compare four median-plane binaural sound source localization methods. Where appropriate, adjustments to the methods have been made to improve their robustness to real world recording conditions. The methods are tested using different HRTF datasets to generate the test data and training data. Each method uses a different combination of spectral and interaural localization cues, allowing for a comparison of the effect of spectral and interaural cues on median-plane localization. The methods are tested for their robustness to different levels of additive noise and different categories of sound.

Keywords

Binaural, Localization, Median-Plane

Cites: 2 ( see at Google Scholar )

PDF
Abstract

In this paper, we present a gated convolutional recurrent neural network based approach to solve task 4, large-scale weakly labelled semi-supervised sound event detection in domestic environments, of the DCASE 2018 challenge. Gated linear units and a temporal attention layer are used to predict the onset and offset of sound events in 10s long audio clips. Whereby for training only weaklylabelled data is used. Virtual adversarial training is used for regularization, utilizing both labelled and unlabelled data. Furthermore, we introduce self-adaptive label refinement, a method which allows unsupervised adaption of our trained system to refine the accuracy of frame-level class predictions. The proposed system reaches an overall macro averaged event-based F-score of 34:6%, resulting in a relative improvement of 20:5% over the baseline system.

Keywords

DCASE 2018, Convolutional neural networks, Sound event detection, Weakly-supervised learning, Semisupervised learning

Cites: 11 ( see at Google Scholar )

PDF
Abstract

With the increasing use of a high quality acoustic device to monitor wildlife population, it has become imperative to develop techniques for analyzing animals’ calls automatically. Bird sound detection is one example of a long-term monitoring project where data are collected in continuous periods, often cover multiple sites at the same time. Inspired by the success of deep learning approaches in various audio classification tasks, this paper first reviews previous works exploiting deep learning for bird audio detection, and then proposes a novel 3-dimensional (3D) convolutional and recurrent neural networks. We propose 3D convolutions for extracting long-term and short-term information in frequency simultaneously. In order to leverage powerful and compact features of 3D convolution, we employ separate recurrent neural networks (RNN), acting on each filter of the last convolutional layers rather than stacking the feature maps in the typical combined convolution and recurrent architectures. Our best model achieved a preview of 88.70% Area Under ROC Curve (AUC) score on the unseen evaluation data in the second edition of bird audio detection challenge. Further improvement with model adaptation led to a 89.58% AUC score.

Keywords

bird sound detection, deep learning, 3D CNN, GRU, biodiversity

Cites: 24 ( see at Google Scholar )

PDF
Abstract

Audio tagging aims to detect the types of sound events occurring in an audio recording. To tag the polyphonic audio recordings, we propose to use Connectionist Temporal Classification (CTC) loss function on the top of Convolutional Recurrent Neural Network (CRNN) with learnable Gated Linear Units (GLUCTC), based on a new type of audio label data: Sequentially Labelled Data (SLD). In GLU-CTC, CTC objective function maps the frame-level probability of labels to clip-level probability of labels. To compare the mapping ability of GLU-CTC for sound events, we train a CRNN with GLU based on Global Max Pooling (GLU-GMP) and a CRNN with GLU based on Global Average Pooling (GLU-GAP). And we also compare the proposed GLU-CTC system with the baseline system, which is a CRNN trained using CTC loss function without GLU. The experiments show that the GLU-CTC achieves an Area Under Curve (AUC) score of 0.882 in audio tagging, outperforming the GLU-GMP of 0.803, GLU-GAP of 0.766 and baseline system of 0.837. That means based on the same CRNN model with GLU, the performance of CTC mapping is better than the GMP and GAP mapping. Given both based on the CTC mapping, the CRNN with GLU outperforms the CRNN without GLU.

Keywords

Audio tagging, Convolutional Recurrent Neural Network (CRNN), Gated Linear Units (GLU), Connectionist Temporal Classification (CTC), Sequentially Labelled Data (SLD)

Cites: 19 ( see at Google Scholar )

PDF
Abstract

As a means of searching for desired audio signals stored in a database, we consider using a string of an onomatopoeic word, namely a word that imitates a sound, as a query, which allows the user to specify the desired sound by verbally mimicking the sound or typing the sound word, or the word containing sounds similar to the desired sound. However, it is generally difficult to realize such a system based on text similarities between the onomatopoeic query and the onomatopoeic tags associated with each section of the audio signals in the database. In this paper, we propose a novel audio signal search method that uses a latent variable space obtained through a learning process. By employing an encoderdecoder onomatopoeia generation model and an encoder model for the onomatopoeias, both audio signals and onomatopoeias are mapped within the space, allowing us to directly measure the distance between them. Subjective tests show that the search results obtained with the proposed method correspond to the onomatopoeic queries reasonably well, and the method has a generalization capability when searching. We also confirm that users preferred the audio signals obtained with this approach to those obtained with a text-based similarity search.

Keywords

audio signal search, onomatopoeia, latent variable, encoder-decoder model

Cites: 11 ( see at Google Scholar )

PDF
Abstract

General-purpose audio tagging refers to classifying sounds that are of a diverse nature, and is relevant in many applications where domain-specific information cannot be exploited. The DCASE 2018 challenge introduces Task 2 for this very problem. In this task, there are a large number of classes and the audio clips vary in duration. Moreover, a subset of the labels are noisy. In this paper, we propose a system to address these challenges. The basis of our system is an ensemble of convolutional neural networks trained on log-scaled mel spectrograms. We use preprocessing and data augmentation methods to improve the performance further. To reduce the effects of label noise, two techniques are proposed: loss function weighting and pseudo-labeling. Experiments on the private test set of this task show that our system achieves state-of-the-art performance with a mean average precision score of 0:951.

Keywords

Audio classification, convolutional network, recurrent network, deep learning, data augmentation, label noise

Cites: 15 ( see at Google Scholar )

PDF
Abstract

In this paper, we describe the techniques and models applied to our submission for DCASE 2018 task 2: General-purpose audio tagging of Freesound content with AudioSet labels. We mainly focus on how to train deep learning models efficiently against strong augmentation and label noise. First, we conducted a single-block DenseNet architecture and multi-head softmax classifier for efficient learning with mixup augmentation. For the label noise, we applied the batch-wise loss masking to eliminate the loss of outliers in a mini-batch. We also tried an ensemble of various models, trained by using different sampling rate or audio representation.

Keywords

Audio tagging, DenseNet, Mixup, Multi-head softmax, Batch-wise loss masking

Cites: 12 ( see at Google Scholar )

PDF
Abstract

Various characteristics can be used to define an acoustic scene, such as long-term context information and short-term events. This makes it difficult to select input features and pre-processing methods suitable for acoustic scene classification. In this paper, we propose an ensemble model that exploits various input features in which the strength for classifying an acoustic scene varies: i-vectors are used for segment-level representations of long-term context, spectrograms are used for frame-level short-term events, and raw waveforms are used to extract features that could be missed by existing methods. For each feature, we used deep neural network based models to extract a representation from an input segment. A separated scoring phase was then exploited to extract class-wise scores on a scale of 0 to 1 that could be used as confidence measures. Scores were extracted using Gaussian models and support vector machines. We tested the validity of the proposed framework using task 1 of detection, and classification of acoustic scenes and events 2018 dataset. The proposed framework had an accuracy of 73.82% for the pre-defined fold-1 validation setup and 74.8% for the evaluation setup which is 7th in team ranking.

Keywords

Acoustic scene classification, DNN, raw waveform, i-vector

Cites: 19 ( see at Google Scholar )

PDF
Abstract

Query-By-Vocal Imitation (QBV) search systems enable searching a collection of audio files using a vocal imitation as a query. This can be useful when sounds do not have commonly agreed-upon textlabels, or many sounds share a label. As deep learning approaches have been successfully applied to QBV systems, datasets to build models have become more important. We present Vocal Imitation Set, a new vocal imitation dataset containing 11; 242 crowd-sourced vocal imitations of 302 sound event classes in the AudioSet sound event ontology. It is the largest publicly-available dataset of vocal imitations as well as the first to adopt the widely-used AudioSet ontology for a vocal imitation dataset. Each imitation recording in Vocal Imitation Set was rated by a human listener on how similar the imitation is to the recording it was an imitation of. Vocal Imitation Set also has an average of 10 different original recordings per sound class. Since each sound class has about 19 listener-vetted imitations and 10 original sound files, the data set is suited for training models to do fine-grained vocal imitation-based search within sound classes. We provide an example of using the dataset to measure how well the existing state-of-the-art in QBV search performs on fine-grained search.

Keywords

Vocal imitation datasets, audio retrieval, queryby-vocal imitation search

Cites: 26 ( see at Google Scholar )

PDF
Abstract

The Detection and Classification of Acoustic Scenes and Events (DCASE) consists of five audio classification and sound event detection tasks: 1) Acoustic scene classification, 2) General-purpose audio tagging of Freesound, 3) Bird audio detection, 4) Weakly-labeled semi-supervised sound event detection and 5) Multi-channel audio classification. In this paper, we create a cross-task baseline system for all five tasks based on a convlutional neural network (CNN): a “CNN Baseline” system. We implemented CNNs with 4 layers and 8 layers originating from AlexNet and VGG from computer vision. We investigated how the performance varies from task to task with the same configuration of neural networks. Experiments show that deeper CNN with 8 layers performs better than CNN with 4 layers on all tasks except Task 1. Using CNN with 8 layers, we achieve an accuracy of 0.680 on Task 1, an accuracy of 0.895 and a mean average precision (MAP) of 0.928 on Task 2, an accuracy of 0.751 and an area under the curve (AUC) of 0.854 on Task 3, a sound event detection F1 score of 20.8% on Task 4, and an F1 score of 87.75% on Task 5. We released the Python source code of the baseline systems under the MIT license for further research.

Keywords

DCASE 2018 challenge, convolutional neural networks, open source

Cites: 69 ( see at Google Scholar )

PDF
Abstract

In this paper, we present our approach used for the CP-JKU submission in Task 4 of the DCASE-2018 Challenge. We propose a novel iterative knowledge distillation technique for weakly-labeled semi-supervised event detection using neural networks, specifically Recurrent Convolutional Neural Networks (R-CNNs). R-CNNs are used to tag the unlabeled data and predict strong labels. Further, we use the R-CNN strong pseudo-labels on the training datasets and train new models after applying label-smoothing techniques on the strong pseudo-labels. Our proposed approach significantly improved the performance of the baseline, achieving the event-based f-measure of 40.86% compared to 15.11% event-based f-measure of the baseline in the provided test set from the development dataset.

Keywords

Weakly-labeled, Semi-supervised, Knowledge Distillation, Recurrent Neural Network, Convolutional Neural Network

Cites: 22 ( see at Google Scholar )

PDF
Abstract

This paper presents deep learning techniques for acoustic bird detection. Deep Convolutional Neural Networks (DCNNs), originally designed for image classification, are adapted and fine-tuned to detect the presence of birds in audio recordings. Various data augmentation techniques are applied to increase model performance and improve generalization to unknown recording conditions and new habitats. The proposed approach is evaluated on the dataset of the Bird Audio Detection task which is part of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2018. It surpasses previous state-of-the-art achieving an area under the curve (AUC) above 95 \% on the public challenge leaderboard.

Keywords

Bird Detection, Deep Learning, Deep Convolutional Neural Networks, Data Augmentation

Cites: 61 ( see at Google Scholar )

PDF
Abstract

In this paper, we describe our contribution to Task 2 of the DCASE 2018 Audio Challenge [1]. While it has become ubiquitous to utilize an ensemble of machine learning methods for classification tasks to obtain better predictive performance, the majority of ensemble methods combine predictions rather than learned features. We propose a single-model method that combines learned highlevel features computed from log-scaled mel-spectrograms and raw audio data. These features are learned separately by two Convolutional Neural Networks, one for each input type, and then combined by densely connected layers within a single network. This relatively simple approach along with data augmentation ranks among the best two percent in the Freesound General-Purpose Audio Tagging Challenge on Kaggle.

Keywords

audio-tagging, convolutional neural network, raw audio, mel-spectrogram

Cites: 7 ( see at Google Scholar )

PDF
Abstract

In terms of vectoring disease, mosquitoes are the world’s deadliest. A fast and efficient mosquito survey tool is crucial for vectored disease intervention programmes to reduce mosquito-induced deaths. Standard mosquito sampling techniques, such as human landing catches, are time consuming, expensive and can put the collectors at risk of diseases. Mosquito acoustic detection aims to provide a cost-effective automated detection tool, based on mosquitoes’ characteristic flight tones. We propose a simple, yet highly effective, classification pipeline based on the mel-frequency spectrum allied with convolutional neural networks. This detection pipeline is computationally efficient in not only detecting mosquitoes, but also in classifying species. Many previous assessments of mosquito acoustic detection techniques have relied only upon lab recordings of mosquito colonies. We illustrate in this paper our proposed algorithm’s performance over an extensive dataset, consisting of cup recordings of more than 1000 mosquito individuals from 6 species captured in field studies in Thailand.

Keywords

Mosquito detection, acoustic signal processing, multi-species classification, convolutional neural networks

Cites: 7 ( see at Google Scholar )

PDF
Abstract

This paper presents several feature extraction and normalization methods implemented for the DCASE 2018 Bird Audio Detection challenge, a binary audio classification task, to identify whether a ten second audio segment from a specified dataset contains one or more bird vocalizations. Our baseline system is adapted from the Convolutional Neural Network system of last year’s challenge winner bulbul [1]. We introduce one feature modification, an increase in temporal resolution of the Mel-spectrogram feature matrix, tailored to the fast-changing temporal structure of many song-bird vocalizations. Additionally, we introduce two feature normalization approaches, a front-end signal enhancement method to reduce differences in dataset noise characteristics and an explicit domain adaptation method based on covariance normalization. Results show that none of these approaches gave significant benefit individually, but that combining the methods lead to overall improvement. Despite the modest improvement, this system won the award for “Highestscoring open-source/reproducible method” for this task.

Keywords

audio classification, convolutional neural network, bioacoustic vocalization analysis, domain adaptation

Cites: 4 ( see at Google Scholar )

PDF
Abstract

In this paper, we present a method for large-scale detection of sound events using small weakly labeled data proposed in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 challenge Task 4. To perform this task, we adopted the convolutional neural network (CNN) and gated recurrent unit (GRU) based bidirectional recurrent neural network (RNN) as our proposed system. In addition, we proposed the Inception module for handling various receptive fields at once in each CNN layer. We also applied the data augmentation method to solve the labeled data shortage problem and applied the event activity detection method for strong label learning. By applying the proposed method to a weakly labeled semi-supervised sound event detection, it was verified that the proposed system provides better performance compared to the DCASE 2018 baseline system.

Keywords

DCASE 2018, Sound event detection, Weakly labeled semi-supervised learning, Deep learning, Inception module

Cites: 16 ( see at Google Scholar )

PDF
Abstract

Convolutional neural networks(CNNs) has shown tremendous ability in many classification problems, because it could improve classification performance by extracting abstract features. In this paper, we use CNNs to calculate features layer by layer. With the layers deepen, the extracted features become more abstract, but the shallow features are also very useful for classification. So we propose a method that fuses features of different layers(it’s called multi-scale features), which can improve performance of acoustic scene classification. In our method, the logMel features of audio signal are used as the input of CNNs. In order to reduce the parameters’ number, we use Xception as the foundation network, which is a CNNs with depthwise separable convolution operation (a depthwise convolution followed by a pointwise convolution). And we modify Xception to fuse multi-scale features. We also introduce the focal loss, to further improve classification performance. This method can achieve commendable result, whether the audio recordings are collected by same device(subtask A) or by different devices (subtask B).

Keywords

Multi-scale features, acoustic scene classification, convolutional neural network, Xception, log Mel features

Cites: 48 ( see at Google Scholar )

PDF
Abstract

The paper presents a study of audio features analysis for acoustic scene classification. Various feature sets and many classifiers were employed to build a system for scene classification by determining compact feature space and using an ensemble learning. The input feature space containing different sets and representations were reduced to 223 attributes using the importance of individual features computed by gradient boosting trees algorithm. The resulting set of features was split into distinct groups partly reflected auditory cues, and then their contribution to discriminative power was analysed. Also, to determine the influence of the pattern recognition system on the final efficacy, accuracy tests were performed using several classifiers. Finally, conducted experiments show that proposed solution with a dedicated feature set outperformed baseline system by 6%.

Keywords

audio features, auditory scene analysis, ensemble learning, majority voting

Cites: 4 ( see at Google Scholar )

PDF
Abstract

Query by vocal imitation (QBV) systems let users search a library of general non-speech audio files using a vocal imitation of the desired sound as the query. The best existing system for QBV uses a similarity measure between vocal imitations and general audio files that is learned by a two-tower semi-Siamese deep neural network architecture. This approach typically uses pairwise training examples and error measurement. In this work, we show that this pairwise error signal does not correlate well with improved search rankings and instead describe how triplet loss can be used to train a two-tower network designed to work with pairwise loss, resulting in better correlation with search rankings. This approach can be used to train any two-tower architecture using triplet loss. Empirical results on a dataset of vocal imitations and general audio files show that low triplet loss is much better correlated with improved search ranking than low pairwise loss.

Keywords

vocal imitation, information retrieval, convolutional Siamese-style networks, triplet loss

Cites: 2 ( see at Google Scholar )

PDF
Abstract

This report evaluates the application of deep vision models, namely VGG and Resnet, to general audio recognition. In the context of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events 2018, we trained several of these architectures on the task 1 dataset to perform acoustic scene classification. Then, in order to produce more robust predictions, we explored two ensemble methods to aggregate the different model outputs. Our results show a final accuracy of 79% on the development dataset for subtask A, outperforming the baseline by almost 20%.

Keywords

Acoustic scene classification, DCASE 2018, Vision, VGG, Residual networks, Ensemble

Cites: 25 ( see at Google Scholar )

PDF
Abstract

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

Keywords

Acoustic scene classification, DCASE challenge, public datasets, multi-device data

Cites: 409 ( see at Google Scholar )

PDF
Abstract

We propose a method to perform audio event detection under the common constraint that only limited training data are available. In training a deep learning system to perform audio event detection, two practical problems arise. Firstly, most datasets are “weakly labelled” having only a list of events present in each recording without any temporal information for training. Secondly, deep neural networks need a very large amount of labelled training data to achieve good quality performance, yet in practice it is difficult to collect enough samples for most classes of interest. In this paper, we propose a data-efficient training of a stacked convolutional and recurrent neural network. This neural network is trained in a multi instance learning setting for which we introduce a new loss function that leads to improved training compared to the usual approaches for weakly supervised learning. We successfully test our approach on two low-resource datasets that lack temporal labels.

Keywords

Multi instance learning, deep learning, weak labels, audio event detection

Cites: 20 ( see at Google Scholar )

PDF
Abstract

This paper proposes Convolutional Neural Network (CNN) ensembles for acoustic scene classification of tasks 1A and 1B of the DCASE 2018 challenge. We introduce a nearest neighbor filter applied on spectrograms, which allows to emphasize and smooth similar patterns of sound events in a scene. We also propose a variety of CNN models for single-input (SI) and multi-input (MI) channels and three different methods for building a network ensemble. The experimental results show that for task 1A the combination of the MI-CNN structures using both of log-mel features and their nearest neighbor filtering is slightly more effective than the single-input channel CNN models using log-mel features only. This statement is opposite for task 1B. In addition, the ensemble methods improve the accuracy of the system significantly, the best ensemble method is ensemble selection, which achieves 69.3% for task 1A and 63.6% for task 1B. This improves the baseline system by 8.9% and 14.4% for task 1A and 1B, respectively.

Keywords

DCASE 2018, acoustic scene classification, convolution neural network, nearest neighbor filter

Cites: 59 ( see at Google Scholar )

PDF
Abstract

This paper describes an approach from our submissions for DCASE 2018 Task 2: general-purpose audio tagging of Freesound content with AudioSet labels. To tackle the problem of diverse recording environments, we propose to use background noise normalization. To tackle the problem of noisy labels, we propose to use pseudolabel for automatic label verification and label smoothing to reduce the over-fitting. We train several convolutional neural networks with data augmentation and different input sizes for the automatic label verification process. The label verification procedure is promising to improve the quality of datasets for audio classification. Our ensemble model ranked fifth on the private leaderboard of the competition with an mAP@3 score of 0:9496.

Keywords

Audio event tagging, Background noise normalization, Convolutional neural networks, DCASE 2018, Label smoothing, Pseudo-label

Cites: 6 ( see at Google Scholar )

PDF
Abstract

In this work, we aim to explore the potential of machine learning methods to the problem of beehive sound recognition. A major contribution of this work is the creation and release of annotations for a selection of beehive recordings. By experimenting with both support vector machines and convolutional neural networks, we explore important aspects to be considered in the development of beehive sound recognition systems using machine learning approaches.

Keywords

Computational bioacoustic scene analysis, ecoacoustics, beehive sound recognition

Cites: 46 ( see at Google Scholar )

PDF
Abstract

This work describes our solution for the general-purpose audio tagging task of DCASE 2018 challenge. We propose the ensemble of several Convolutional Neural Networks (CNNs) with different properties. Logistic regression is used as a meta-classifier to produce final predictions. Experiments demonstrate that the ensemble outperforms each CNN individually. Finally, the proposed system achieves Mean Average Precision (MAP) score of 0.945 on test set, which is a significant improvement compared to the baseline.

Keywords

audio tagging, DCASE 2018, convolutional neural networks, ensembling

PDF
Abstract

We propose a convolutional neural network (CNN) model based on an attention pooling method to classify ten different acoustic scenes, participating in the acoustic scene classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2018), which includes data from one device (subtask A) and data from three different devices (subtask B). The log mel spectrogram images of the audio waves are first forwarded to convolutional layers, and then fed into an attention pooling layer to reduce the feature dimension and achieve classification. From attention perspective, we build a weighted evaluation of the features, instead of simple max pooling or average pooling. On the official development set of the challenge, the best accuracy of subtask A is 72.6 \%, which is an improvement of 12.9 \% when compared with the official baseline (p < :001 in a one-tailed z-test). For subtask B, the best result of our attention-based CNN is a significant improvement of the baseline as well, in which the accuracies are 71.8 \%, 58.3 \%, and 58.3 \% for the three devices A to C (p < :001 for device A, p < :01 for device B, and p < :05 for device C).

Keywords

Acoustic Scene Classification, Convolutional Neural Network, Attention Pooling, Log Mel Spectrogram

Cites: 77 ( see at Google Scholar )

PDF
Abstract

The successful application of modern deep neural networks is heavily reliant on the chosen architecture and the selection of the appropriate hyperparameters. Due to the large number of parameters and the complex inner workings of a neural network, finding a suitable configuration for a respective problem turns out to be a rather complex task for a human. In this paper we, propose an evolutionary approach to automatically generate a suitable neural network architecture and hyperparameters for any given classification problem. A genetic algorithm is used to generate and evaluate a variety of deep convolutional networks. We take the DCASE 2018 Challenge as an opportunity to evaluate our algorithm on the task of acoustic scene classification. The best accuracy achieved by our approach was 74.7% on the development dataset.

Keywords

Evolutionary algorithm, genetic algorithm, convolutional neural networks, acoustic scene classification

Cites: 15 ( see at Google Scholar )

PDF
Abstract

This paper presents DCASE 2018 task 4. The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without time boundaries). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly labeled training set to improve system performance. The data are Youtube video excerpts from domestic context which have many applications such as ambient assisted living. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events. . . ) and potential applications.

Keywords

Sound event detection, Large scale, Weakly labeled data, Semi-supervised learning

Cites: 158 ( see at Google Scholar )

PDF
Abstract

General-purpose audio tagging is a newly proposed task in DCASE 2018, which can provide insight towards broadly-applicable sound event classifiers. In this paper, two systems (named as 1D-ConvNet and 2D-ConvNet in this paper) with small kernel sizes, multiple functional modules, deeper CNN (convolutional neural networks) are developed to improve performance in this task. In detail, different audio features are used, i.e. raw waveforms are for 1D-ConvNet, while frequency domain features, such as mfcc, log-mel spectrogram, multi-resolution log-mel spectrogram and spectrogram, are utilized as the 2D-ConvNet input. Using DCASE 2018 Challenge task 2 dataset to train and evaluate, the best single model with 1DConvNet and 2D-ConvNet are chosen, whose kaggle public leaderboard score are 0.877 and 0.961 respectively. In addition, a better ensemble rank averaging prediction get a score 0.967 on the public leaderboard, ranking 5/558, while score 0.942 on the private leaderboard ranking 11/558.

Keywords

DCASE 2018, Audio tagging, Convolutional neural networks, 1D-ConvNet, 2D-ConvNet, Model ensemble

Cites: 3 ( see at Google Scholar )

PDF
Abstract

Audio tagging has attracted increasing attention since last decade and has various potential applications in many fields. The objective of audio tagging is to predict the labels of an audio clip. Recently deep learning methods have been applied to audio tagging and have achieved state-of-the-art performance, which provides a poor generalization ability on new data. However due to the limited size of audio tagging data such as DCASE data, the trained models tend to result in overfitting of the network. Previous data augmentation methods such as pitch shifting, time stretching and adding background noise do not show much improvement in audio tagging. In this paper, we explore the sample mixed data augmentation for the domestic audio tagging task, including mixup, SamplePairing and extrapolation. We apply a convolutional recurrent neural network (CRNN) with attention module with log-scaled mel spectrum as a baseline system. In our experiments, we achieve an state-of-the-art of equal error rate (EER) of 0.10 on DCASE 2016 task4 dataset with mixup approach, outperforming the baseline system without data augmentation.

Keywords

Audio tagging, data augmentation, sample mixed, convolutional recurrent neural network

Cites: 40 ( see at Google Scholar )

PDF
Abstract

This paper describes an audio tagging system that participated in Task 2 “General-purpose audio tagging of Freesound content with AudioSet labels” of the “Detection and Classification of Acoustic Scenes and Events (DCASE)” Challenge 2018. The system is an ensemble consisting of five convolutional neural networks based on Mel-frequency Cepstral Coefficients, Perceptual Linear Prediction features, Mel-spectrograms and the raw audio data. For ensembling all models, score-based fusion via Logistic Regression is performed with another neural network. In experimental evaluations, it is shown that ensembling the models significantly improves upon the performances obtained with the individual models. As a final result, the system achieved a Mean Average Precision with Cutoff 3 of 0:9414 on the private leaderboard of the challenge.

Keywords

audio tagging, acoustic event classification, convolutional neural network, score-based fusion

Cites: 3 ( see at Google Scholar )

PDF
Abstract

In this paper, we presented a neural network system for DCASE 2018 task 2, general purpose audio tagging. We fine-tuned the Google AudioSet feature generation model with different settings for the given 41 classes on top of a fully connected layer with 100 units. Then we used the fine-tuned models to generate 128 dimensional features for each 0.960s audio. We tried different neural network structures including LSTM and multi-level attention models. In our experiments, the multi-level attention model has shown its superiority over others. Truncating the silence parts, repeating and splitting the audio into the fixed length, pitch shifting augmentation, and mixup techniques are all used in our experiments. The proposed system achieved a result with MAP@3 score at 0.936, which outperforms the baseline result of 0.704 and achieves top 8% in the public leaderboard.

Keywords

audio tagging, AudioSet, multi-level attention model

Cites: 2 ( see at Google Scholar )

PDF
Abstract

In this paper, we describe our solution for the general-purpose audio tagging task, which belongs to one of the subtasks in the DCASE 2018 challenge. For the solution, we employed both deep learning methods and statistic features-based shallow architecture learners. For single model, different deep convolutional neural network architectures are tested with different kinds of input, which ranges from the raw-signal, log-scaled Mel-spectrograms (log Mel) to Mel Frequency Cepstral Coefficients (MFCC). For log Mel and MFCC, the delta and delta-delta information are also used to formulate three-channel features, while mixup is used for the data augmentation. Using ResNeXt, our best single convolutional neural network architecture provides a mAP@3 of 0.967 on the public Kaggle leaderboard, 0.939 on the private leaderboard. Moreover, to improve the accuracy further, we also propose a meta learning-based ensemble method. By employing the diversities between different architectures, the meta learning-based model can provide higher prediction accuracy and robustness with comparison to the single model. Our solution achieves a mAP@3 of 0.977 on the public leaderboard and 0.951 as our best on the private leaderboard, while the baseline gives a mAP@3 of 0.704.

Keywords

Audio tagging, convolutional neural networks, meta-learning, mixup

Cites: 4 ( see at Google Scholar )

PDF
Abstract

In this paper, we propose a multi-level attention model for the weakly labelled audio classification problem. The objective of audio classification is to predict the presence or the absence of sound events in an audio clip. Recently, Google published a large scale weakly labelled AudioSet dataset containing 2 million audio clips with only the presence or the absence labels of the sound events, without the onset and offset time of the sound events. Previously proposed attention models only applied a single attention module on the last layer of a neural network which limited the capacity of the attention model. In this paper, we propose a multi-level attention model which consists of multiple attention modules applied on the intermediate neural network layers. The outputs of these attention modules are concatenated to a vector followed by a fully connected layer to obtain the final prediction of each class. Experiments show that the proposed multi-attention attention model achieves a stateof-the-art mean average precision (mAP) of 0.360, outperforming the single attention model and the Google baseline system of 0.327 and 0.314, respectively.

Keywords

AudioSet, audio classification, attention model

Cites: 79 ( see at Google Scholar )

PDF
Abstract

In this paper, the Brno University of Technology (BUT) team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2018 challenge are described. Also, the analysis of different methods on the leaderboard set is provided. The proposed approach is a fusion of two different Convolutional Neural Network (CNN) topologies. The first one is the common two-dimensional CNNs which is mainly used in image classification. The second one is a one-dimensional CNN for extracting fixed-length audio segment embeddings, so called x-vectors, which has also been used in speech processing, especially for speaker recognition. In addition to the different topologies, two types of features were tested: log mel-spectrogram and CQT features. Finally, the outputs of different systems are fused using a simple output averaging in the best performing system. Our submissions ranked third among 24 teams in the ASC sub-task A (task1a).

Keywords

Audio scene classification, Convolutional neural networks, Deep learning, x-vectors, Regularized LDA

Cites: 72 ( see at Google Scholar )

PDF