The proceedings of the DCASE2020 Workshop have been published as an electronic publication:
Nobutaka Ono, Noboru Harada, Yohei Kawaguchi, Annamaria Mesaros, Keisuke Imoto, Yuma Koizumi, and Tatsuya Komatsu (eds.), Proceedings of the 5th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2020), Nov. 2020.
ISBN (Electronic): 978-4-600-00566-5
Microphone Array Optimization for Autonomous-Vehicle Audio Localization Based on the Radon Transform
Ohad Barak1, Nizar Sallem1, and Marc Fischer1
1Siemens Digital Industries Software Corporation
Beamforming is a standard method of determining the Direction-of-Arrival (DoA) of wave energy to an array of receivers. In the case of acoustic waves in an air medium, the array would comprise microphones. The angular resolution of an array depends on the frequency of the data, the number of microphones, the size of the array relative to the wavelengths in the medium, and the geometry of the array, i.e., the positions of the microphones in relation to each other. The task of finding the right balance between the aforementioned parameters is microphone-array optimization. This task is rendered even more complicated in the particular context of sound classification and localization for self driving cars as a result of the design limitations imposed by the automotive industry. We present a microphone array optimization method suitable for designing arrays to be placed on vehicles, which applies beamforming using the Radon transform. We show how our method produces an array geometry with reasonable angular resolution for audio frequencies that are in the range of interest for a road scenario.
Multi-Task Regularization Based on Infrequent Classes for Audio Captioning
Emre Çakır1, Konstantinos Drossos1, and Tuomas Virtanen1
1Audio Research Group, Tampere University
Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural networks, employing an encoder-decoder scheme and a dataset with audio clips and corresponding natural language descriptions (i.e. captions). A significant challenge for audio captioning is the distribution of words in the captions: some words are very frequent but acoustically non-informative, i.e. the function words (e.g. “a”, “the”), and other words are infrequent but informative, i.e. the content words (e.g. adjectives, nouns). In this paper we propose two methods to mitigate this class imbalance problem. First, in an autoencoder setting for audio captioning, we weigh each word's contribution to the training loss inversely proportional to its number of occurrences in the whole dataset. Secondly, in addition to multi-class, word-level audio captioning task, we define a multi-label side task based on clip-level content word detection by training a separate decoder. We use the loss from the second task to regularize the jointly trained encoder for the audio captioning task. We evaluate our method using Clotho, a recently published, wide-scale audio captioning dataset, and our results show an increase of 37% relative improvement with SPIDEr metric over the baseline method.
Event-Independent Network for Polyphonic Sound Event Localization and Detection
Yin Cao1, Turab Iqbal1, Qiuqiang Kong2, Yue Zhong1, Wenwu Wang1, and Mark D. Plumbley1
1Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, 2ByteDance Shanghai
Polyphonic sound event localization and detection is not only detecting what sound events are happening but localizing corresponding sound sources. This series of tasks was first introduced in DCASE 2019 Task 3. In 2020, the sound event localization and detection task introduces additional challenges in moving sound sources and overlapping cases, which include two events of the same type with two different direction-of-arrival (DoA) angles. In this paper, a novel event-independent network for polyphonic sound event localization and detection is proposed. Unlike the two-stage method we proposed in DCASE 2019 Task 3, this new network is fully end-to-end. Inputs to the network are first-order Ambisonics (FOA) time-domain signals, which are then fed into a 1-D convolutional layer to extract acoustic features. The network is then split into two parallel branches. The first branch is for sound event detection (SED), and the second branch is for DoA estimation. There are three types of predictions from the network, SED predictions, DoA predictions, and event activity detection (EAD) predictions that are used to combine the SED and DoA features for on-set and off-set estimation. All of these predictions have the format of two tracks indicating that there are at most two overlapping events. Within each track, there could be at most one event happening. This architecture introduces a problem of track permutation. To address this problem, a frame-level permutation invariant training method is used. Experimental results show that the proposed method can detect polyphonic sound events and their corresponding DoAs. Its performance on the Task 3 dataset is greatly increased as compared with that of the baseline method.
SONYC-UST-V2: An Urban Sound Tagging Dataset with Spatiotemporal Context
Mark Cartwright1, Jason Cramer1, Ana Elisa Mendez Mendez1, Yu Wang1, Ho-Hsiang Wu1, Vincent Lostanlen1,2, Magdalena Fuentes1, Graham Dove1, Charlie Mydlarz1, Justin Salamon3, Oded Nov1, and Juan Pablo Bello1
1New York University, 2Cornell Lab of Ornithology, 3Adobe Research, San Francisco
We present SONYC-UST-V2, a dataset for urban sound tagging with spatiotemporal information. This dataset is aimed for the development and evaluation of machine listening systems for real-world urban noise monitoring. While datasets of urban recordings are available, this dataset provides the opportunity to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags. SONYC-UST-V2 consists of 18510 audio recordings from the “Sounds of New York City” (SONYC) acoustic sensor network, including the timestamp of audio acquisition (at the hour scale) and location of the sensor (at the urban block level). The dataset contains annotations by volunteers from the Zooniverse citizen science platform, as well as a two-stage verification with our team. In this article, we describe our data collection procedure and propose evaluation metrics for multilabel classification of urban sound tags. We report the results of a simple baseline model that exploits temporal information.
Audio Captioning Based on Transformer and Pre-Trained CNN
Kun Chen1, Yusong Wu1, Ziyue Wang2, Xuan Zhang2, Fudong Nian3, Shengchen Li1, and Xi Shao2
1Beijing University of Posts and Telecommunications, 2Nanjing University of Posts and Telecommunications, 3Anhui University
Automated audio captioning is the task that generates text description of a piece of audio. This paper proposes a solution of automated audio captioning based on a combination of pre-trained CNN layers and a sequence-to-sequence architecture based on Transformer. The pre-trained CNN layers are adopted from a CNN based neural network for acoustic event tagging, which makes the latent variable resulted more efficient on generating captions. Transformer decoder is used in the sequence-to-sequence architecture as a consequence of comparing the performance of the more classical LSTM layers. The proposed system achieves a SPIDEr score of 0.227 for the DCASE challenge 2020 Task 6 with data augmentation and label smoothing applied.
Domain-Adversarial Training and Trainable Parallel Front-End for the DCASE 2020 Task 4 Sound Event Detection Challenge
Samuele Cornell1, Michel Olvera2, Manuel Pariente2, Giovanni Pepe1, Emanuele Principi1, Leonardo Gabrielli1, and Stefano Squartini1
1Università Politecnica delle Marche, Dept. Information Engineering, 2INRIA Nancy Grand-Est, Dept. Information and Communication Sciences and Technologies
In this paper, we propose several methods for improving Sound Event Detection systems performance in the context of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Task 4 challenge. Our main contributions are in the training techniques, feature pre-processing and prediction post-processing. Given the mismatch between synthetic labelled data and target domain data, we exploit domain adversarial training to improve the network generalization. We show that such technique is especially effective when coupled with dynamic mixing and data augmentation. Together with Hidden Markov Models prediction smoothing, by coupling the challenge baseline with aforementioned techniques we are able to improve event-based macro F1 score by more than 10% on the development set, without computational overhead at inference time. Moreover, we propose a novel, effective Parallel Per-Channel Energy Normalization front-end layer and show that it brings an additional improvement of more than one percent with minimal computational overhead.
Task-Aware Separation for the DCASE 2020 Task 4 Sound Event Detection and Separation Challenge
Samuele Cornell1, Michel Olvera2, Manuel Pariente2, Giovanni Pepe1, Emanuele Principi1, Leonardo Gabrielli1, and Stefano Squartini1
1Università Politecnica delle Marche, Dept. Information Engineering, 2INRIA Nancy Grand-Est, Dept. Information and Communication Sciences and Technologies
Source Separation is often used as a pre-processing step in many signal-processing tasks. In this work we propose a novel approach for combined Source Separation and Sound Event Detection in which a Source Separation algorithm is used to enhance the Sound Even -Detection back-end performance. In particular, we present a permutation-invariant training scheme for optimizing the Source Separation system directly with the back-end Sound Event Detection objective without requiring joint training or fine-tuning of the two systems. We show that such an approach has significant advantages over the more standard approach of training the Source Separation system separately using only a Source Separation based objective such as Scale-Invariant Signal-To-Distortion Ratio. On the 2020 Detection and Classification of Acoustic Scenes and Events Task 4 Challenge our proposed approach is able to outperform the baseline source separation system by more than one percent in event-based macro F1 score on the development set with significantly less computational requirements.
A Multi-Resolution Approach to Sound Event Detection in DCASE 2020 Task4
Diego de Benito-Gorron1, Daniel Ramos1, and Doroteo T. Toledano1
1AUDIAS Research Group, Universidad Autónoma de Madrid
In this paper, we propose a multi-resolution analysis for feature extraction in Sound Event Detection. Because of the specific temporal and spectral characteristics of the different acoustic events, we hypothesize that different time-frequency resolutions can be more appropriate to locate each sound category. We carry out our experiments using the DESED dataset in the context of the DCASE 2020 Task 4 challenge, where the combination of up to five different time-frequency resolutions via model fusion is able to outperform the baseline results. In addition, we propose class-specific thresholds for the F1-score metric, further improving the results over the Validation and Public Evaluation sets.
Forward-Backward Convolutional Recurrent Neural Networks and Tag-Conditioned Convolutional Neural Networks for Weakly Labeled Semi-Supervised Sound Event Detection
Janek Ebbers1, and Reinhold Haeb-Umbach1
In this paper we present our system for the detection and classification of acoustic scenes and events (DCASE) 2020 Challenge Task 4: Sound event detection and separation in domestic environments. We introduce two new models: the forwatd-backward convolutional recurrent neural network (FBCRNN) and the tag-conditioned convolutional neural network (CNN). The FBCRNN employs tworecurrent neural network (RNN) classifiers sharing the same CNN for preprocessing. With one RNN processing a recording in forward direction and the other in backward direction, the two networks are trained to jointly predict audio tags, i.e., weak labels, at each time step within a recording, given that at each time step they have jointly processed the whole recording. The proposed training encourages the classifiers to tag events as soon as possible. Therefore, after training, the networks can be applied to shorter audio segments of, e.g., 200 ms, allowing sound event detection (SED). Further, we propose a tag-conditioned CNN to complement SED. It is trained to predict strong labels while using (predicted) tags, i.e., weak labels, as additional input. For training pseudo strong labels from a FBCRNN ensemble are used. The presented system scored the fourth and third place in the systems and teams rankings, respectively. Subsequent improvements allow our system to even outperform the challenge baseline and winner systems in average by, respectively, 18.0% and 2.2% event-based F1-score on the validation set. Source code is publicly available at https://github.com/fgnt/pb_sed.
Self-Supervised Classification for Detecting Anomalous Sounds
Ritwik Giri1, Srikanth V. Tenneti1, Fangzhou Cheng1, Karim Helwani1, Umut Isik1, and Arvindh Krishnaswamy1
1Amazon Web Services
Representation learning, using self-supervised classification has recently been shown to give state-of-the-art accuracies for anomaly detection on computer vision datasets. Geometric transformations on images such as rotations, translations and flipping have been used in these recent works to create auxiliary classification tasks for feature learning. This paper introduces a new self-supervised classification framework for anomaly detection in audio signals. Classification tasks are set up based on differences in the metadata associated with the audio files. Synthetic augmentations such as linearly combining and warping audio-spectrograms are also used to increase the complexity of the classification task, to learn finer features. The proposed approach is validated using the publicly available DCASE 2020 challenge task 2: Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring dataset. We demonstrate the effectiveness of our approach by comparing against the baseline autoencoder model, showing an improvement of over 12.5% in the average AUC metrics.
Group Masked Autoencoder Based Density Estimator for Audio Anomaly Detection
Ritwik Giri1, Fangzhou Cheng1, Karim Helwani1, Srikanth V. Tenneti1, Umut Isik1, and Arvindh Krishnaswamy1
1Amazon Web Services
In this paper, we address the problem of detecting previously unseen anomalous audio events, when the training dataset itself does not contain any examples of anomalies. While the traditional density estimation techniques, such as Gaussian Mixture Model (GMM) showed promise in past for the problem at hand, recent advances in neural density estimation techniques, have made them suitable for anomaly detection task. In this work, we develop a novel neural density estimation technique based on the Group-Masked Autoencoder, that estimates the density of an audio time series by taking into account the intra-frame statistics of the signal. Our proposed approach has been validated using the DCASE 2020 challenge dataset (Task 2 - Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring). We demonstrate the effectiveness of our approach by comparing against the baseline autoencoder model, and also against recently proposed Interpolating Deep Neural Network (IDNN) model.
Acoustic Scene Classification in DCASE 2020 Challenge: Generalization Across Devices and Low Complexity Solutions
Toni Heittola1, Annamaria Mesaros1, and Tuomas Virtanen1
1Computing Sciences, Tampere University
This paper presents the details of Task 1. Acoustic Scene Classification in the DCASE 2020 Challenge. The task consisted of two subtasks: classification of data from multiple devices, requiring good generalization properties, and classification using low-complexity solutions. Each subtask received around 90 submissions, and most of them outperformed the baseline system. The most used techniques among the submissions were data augmentation in Subtask A, to compensate for the device mismatch, and post-training quantization of neural network weights in Subtask B, to bring the model size under the required limit. The maximum classification accuracy on the evaluation set in Subtask A was 76.5%, compared to the baseline performance of 51.4%. In Subtask B, many systems are just below the size limit, and the maximum classification accuracy was 96.5%, compared to the baseline performance of 89.5%.
Guided Multi-Branch Learning Systems for Sound Event Detection with Sound Separation
Yuxin Huang1,2, Liwei Lin1,2, Shuo Ma1,2, Xiangdong Wang1, Hong Liu1, Yueliang Qian1, Min Liu3, and Kazushige Ouchi3
1Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, 2University of Chinese Academy of Sciences, 3Toshiba China R&D Center
In this paper, we describe in detail our systems for DCASE 2020 Task 4. The systems are based on the 1st-place system of DCASE 2019 Task 4, which adopts weakly-supervised framework with an attention-based embedding-level pooling module and a semi-supervised learning approach named guided learning. This year, we incorporate multi-branch learning (MBL) into the original system to further improve its performance. MBL uses different branches with different pooling strategies (including instance-level and embedding-level strategies) and different pooling modules (including attention pooling, global max pooling or global average pooling modules), which share the same feature encoder of the model. Therefore, multiple branches pursuing different purposes and focusing on different characteristics of the data can help the feature encoder model the feature space better and avoid over-fitting. To better exploit the strongly-labeled synthetic data, inspired by multi-task learning, we also employ a sound event detection branch. To combine sound separation (SS) with sound event detection (SED), we fuse the results of SED systems with SS-SED systems which are trained using separated sound output by an SS system. The experimental results prove that MBL can improve the model performance and using SS has great potential to improve the performance of SED ensemble system.
Detection of Anomalous Sounds for Machine Condition Monitoring using Classification Confidence
Tadanobu Inoue1, Phongtharin Vinayavekhin1, Shu Morikuni1, Shiqiang Wang1, Tuan Hoang Trong1, David Wood1, Michiaki Tatsubori1, and Ryuki Tachibana1
Anomaly-detection methods based on classification confidence are applied to the DCASE 2020 Task 2 Challenge on Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring. The final systems for submitting to the challenge are ensembles of two classification-based detectors. Both classifiers are trained with either known or generated properties of normal sounds as labels: one is a model to classify sounds into machine type and ID; the other is a model to classify transformed sounds into data-augmentation type. As for the latter model, the normal sound is augmented by using sound-transformation techniques such as pitch shifting, and data-augmentation type is used as a label. For both classifiers, classification confidence is used as the normality score for an input sample at runtime. An ensemble of these approaches is created by using probability aggregation of their anomaly scores. The experimental results on AUC show superior performance by each detector in relation to the baseline provided by the DCASE organizer. Moreover, the proposed ensemble of two detectors generally shows further improvement on the anomaly detection performance. The proposed anomaly-detection system was ranked fourth in the team ranking according to the metrics of the DCASE Challenge, and it achieves 90.93% in terms of average of AUC and pAUC scores for all the machine types, and that score is the highest of those scores achieved by all of the submitted systems.
ID-Conditioned Auto-Encoder for Unsupervised Anomaly Detection
1Samsung R&D Institute Poland
In this paper, we introduce ID-Conditioned Auto-Encoder for unsupervised anomaly detection. Our method is an adaptation of the Class-Conditioned Auto-Encoder (C2AE) designed for the open-set recognition. Assuming that non-anomalous samples constitute of distinct IDs, we apply Conditioned Auto-Encoder with labels provided by these IDs. Opposed to C2AE, our approach omits the classification subtask and reduces the learning process to the single run. We simplify the learning process further by fixing a constant vector as the target for non-matching labels. We apply our method in the context of sounds for machine condition monitoring. We evaluate our method on the ToyADMOS and MIMII datasets from the DCASE 2020 Challenge Task 2. We conduct an ablation study to indicate which steps of our method influences results the most.
Audio Tag Representation Guided Dual Attention Network for Acoustic Scene Classification
Ju-Ho Kim1, Jee-Weon Jung1, Hye-Jin Shim1, and Ha-Jin Yu1
1University of Seoul, School of Computer Science
Sound events are crucial to discern a specific acoustic scene, which establishes a close relationship between audio tagging and acoustic scene classification (ASC). In this study, we explore the role and application of sound events based on the ASC task and propose the use of the last hidden layer's output of an audio tagging system (tag representation), rather than the output itself (tag vector), in ASC. We hypothesize that the tag representation contains sound event information that can improve the classification accuracy of acoustic scenes. The dual attention mechanism is investigated to adequately emphasize the frequency-time and channel dimensions of the feature map of an ASC system using tag representation. Experiments are conducted using the Detection and Classification of Acoustic Scenes and Events 2020 task1-a dataset. The proposed system demonstrates an overall classification accuracy of 69.3%, compared to 65.3% of the baseline.
Description and Discussion on DCASE2020 Challenge Task2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring
Yuma Koizumi1, Yohei Kawaguchi2, Keisuke Imoto3, Toshiki Nakamura2, Yuki Nikaido2, Ryo Tanabe2, Harsh Purohit2, Kaori Suefusa2, Takashi Endo2, Masahiro Yasuda1, and Noboru Harada1
1NTT Corporation, 2Hitachi, Ltd., 3Doshisha University
In this paper, we present the task description and discuss the results of the DCASE 2020 Challenge Task 2: Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring. The goal of anomalous sound detection (ASD) is to identify whether the sound emitted from a target machine is normal or anomalous. The main challenge of this task is to detect unknown anomalous sounds under the condition that only normal sound samples have been provided as training data. We have designed this challenge as the first benchmark of ASD research, which includes a large-scale dataset, evaluation metrics, and a simple baseline system. We received 117 submissions from 40 teams, and several novel approaches have been developed as a result of this challenge. On the basis of the analysis of the evaluation results, we discuss two new approaches and their problems.
Low-Complexity Models for Acoustic Scene Classification Based on Receptive Field Regularization and Frequency Damping
Khaled Koutini1, Florian Henkel1, Hamid Eghbal-Zadeh1,2, and Gerhard Widmer1,2
1Institute of Computational Perception (CP-JKU), Johannes Kepler University Linz, 2LIT Artificial Intelligence Lab, Johannes Kepler University Linz
Deep Neural Networks are known to be very demanding in terms of computing and memory requirements. Due to the ever increasing use of embedded systems and mobile devices with a limited resource budget, designing low-complexity models without sacrificing too much of their predictive performance gained great importance. In this work, we investigate and compare several well-known methods to reduce the number of parameters in neural networks. We further put these into the context of a recent study on the effect of the Receptive Field (RF) on a model's performance, and empirically show that we can achieve high-performing low-complexity models by applying specific restrictions on the RFs, in combination with parameter reduction methods. Additionally, we propose a filter-damping technique for regularizing the RF of models, without altering their architecture and changing their parameter counts. We will show that incorporating this technique improves the performance in various low-complexity settings such as pruning and decomposed convolution. Using our proposed filter damping, we achieved the 1st rank at the DCASE-2020 Challenge in the task of Low-Complexity Acoustic Scene Classification.
Model Selection for Deep Audio Source Separation via Clustering Analysis
Alisa Liu1, Prem Seetharaman1, and Bryan Pardo1
1Northwestern University, Computer Science Department
Audio source separation is the process of separating a mixture into isolated sounds from individual sources. Deep learning models are the state-of-the-art in source separation, given that the mixture to be separated is similar to the mixtures the deep model was trained on. This requires the end user to know enough about each model's training to select the correct model for a given audio mixture. In this work, we propose a confidence measure that can be broadly applied to any clustering-based separation model. The proposed confidence measure does not require ground truth to estimate the quality of a separated source. We use our confidence measure to automate selection of the appropriate deep clustering model for an audio mixture. Results show that our confidence measure can reliably select the highest-performing model for an audio mixture without knowledge of the domain the audio mixture came from, enabling automatic selection of deep models.
A Speaker Recognition Approach to Anomaly Detection
Jose A. Lopez1, Hong Lu1, Paulo Lopez-Meyer1, Lama Nachman1, Georg Stemmer1, and Jonathan Huang2
1Intel Corp, Intel Labs, 2Work done at Intel
We present our submission to the DCASE 2020 Challenge Task 2, which aims to promote research in anomalous sound detection. We found that a speaker recognition approach enables the use of all the training data, even from different machine types, to detect anomalies in specific machines. Using this approach, we obtained good results for 5 out of 6 machines on the development data. We also discuss the modifications needed to surpass the baseline score for the remaining (ToyConveyor) machine which we found to be particularly difficult. On the challenge evaluation test data, our results were skewed by the system's uninspiring performance on the Toy machines. However, we placed 18th in the challenge due to our results on the industrial machine data where we reached the top 5 in team pAUC scores.
Conformer-Based Sound Event Detection with Semi-Supervised Learning and Data Augmentation
Koichi Miyazaki1, Tatsuya Komatsu2, Tomoki Hayashi1,3, Shinji Watanabe4, Tomoki Toda1, and Kazuya Takeda1
1Nagoya University, 2LINE Corporation, 3Human Dataware Lab. Co., Ltd., 4Johns Hopkins University
This paper presents a Conformer-based sound event detection (SED) method, which uses semi-supervised learning and data augmentation. The proposed method employs Conformer, a convolution-augmented Transformer that is able to exploit local features of audio data more effectively using CNNs, while global features are captured with Transformer. For SED, both global information on background sound and local information on foreground sound events are essential for modeling and identifying various types of sounds. Since Conformer can capture both global and local features using a single architecture, our proposed method is able to model various characteristics of sound events effectively. In addition to this novel architecture, we further improve performance by utilizing a semi-supervised learning technique, data augmentation, and post-processing optimized for each sound event class. We demonstrate the performance of our proposed method through experimental evaluation using the DCASE2020 Task4 dataset. Our experimental results show that the proposed method can achieve an event-based macro F1 score of 50.6% when using the validation set, significantly outperforming the baseline method score (34.8%). Our system achieved a score of 51.1% when using the DCASE2020 challenge’s evaluation set, the best results among the 72 submissions.
Embedded Acoustic Scene Classification for Low Power Microcontroller Devices
Filippo Naccari1, Ivana Guarneri1, Salvatore Curti1, and Alberto Amilcare Savi1
Automatic sound understanding tasks have been very popular within research community during the last years. The success of deep learning data driven applications in many signal understanding fields is now moving from centralized cloud services to the edge of the network, close to the nodes where raw data are generated from different type of sensors. In this paper we show a complete workflow for a context awareness acoustic scene classification (ASC) application and its effective embedding process into an ultra-low power microcontroller (MCU). It can widen the capabilities of edge AI applications, from environmental and inertial sensors up to acoustic signals, which require more bandwidth and generate more data. In the paper the entire workflow of such development is described in terms of dataset collection, selection and annotations, acoustic features representation, neural net modeling and optimization as well as the efficient embedding step of the whole application into the target low power 32-bit microcontroller device. Moreover, the overall accuracy of the proposed model and the capability to be real time executed together with an audio feature extraction process shows that such kind of audio understanding application can be efficiently deployed on power constrained battery-operated devices.
Temporal Sub-Sampling of Audio Feature Sequences for Automated Audio Captioning
Khoa Nguyen1, Konstantinos Drossos2, and Tuomas Virtanen2
13D Media Group, Tampere University, 2Audio Research Group, Tampere University
Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio signal, for example 10 words versus some thousands of audio feature vectors. This clearly indicates that an output word corresponds to multiple input feature vectors. In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence. We employ the baseline method of the DCASE 2020 audio captioning task, and we apply temporal sub-sampling between the RNNs of the encoder. We evaluate the benefit of our approach by employing the freely available dataset Clotho and comparing the performance of our method with the performance of the DCASE 2020 baseline method. Our results show an improvement to all considered metrics.
On the Effectiveness of Spatial and Multi-Channel Features for Multi-Channel Polyphonic Sound Event Detection
Thi Ngoc Tho Nguyen1, Douglas L. Jones2, and Woon Seng Gan1
1Nanyang Technological University, School of Electrical and Electronic Engineering, 2University of Illinois at Urbana-Champaign, Dept. of Electrical and Computer Engineering
Multi-channel log-mel spectrograms and spatial features such as generalized cross-correlation with phase transform have been demonstrated to be useful for multi-channel polyphonic sound event detection for static-source cases. The multi-channel log-mel spectrograms and spatial features are often stacked along the channel dimension similar to RGB images before being passed to a convolutional model to detect sound events better in multi-source cases. In this paper, we investigate the usage of multi-channel log-mel spectrograms and spatial features for polyphonic sound event detection in both static and dynamic-source cases using DCASE2019 and DCASE2020 sound event localization and detection datasets. Our experimental results show that multi-channel log-mel spectrogram and spatial features are more useful for static-source cases than for dynamic-source cases. The best use of multi-channel audio inputs for polyphonic sound event detection in both static and dynamic scenarios is to train a model that use all the single-channel log-mel spectrograms separately as input features and the final prediction during the inference stage is obtained by taking the arithmetic mean of the model's output predictions of all the input channels.
Ensemble of Sequence Matching Networks for Dynamic Sound Event Localization, Detection, and Tracking
Thi Ngoc Tho Nguyen1, Douglas L. Jones2, and Woon Seng Gan1
1Nanyang Technological University, School of Electrical and Electronic Engineering, 2University of Illinois at Urbana-Champaign, Dept. of Electrical and Computer Engineering
Sound event localization and detection consists of two subtasks which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses magnitude or phase differences between microphones to estimate source directions. Therefore, it is often difficult to jointly train two subtasks simultaneously. Our previous sequence matching approach solved sound event detection and direction-of-arrival separately and trained a convolutional recurrent neural network to associate the sound classes with the directions-of-arrival using onsets and offsets of the sound events. This approach achieved better performance than other state-of-the-art networks such as the SELDnet, and the two-stage networks for static sources. In order to estimate directions-of-arrival of moving sound sources with higher required spatial resolutions than those of static sources, we propose to separate the directional estimates into azimuth and elevation estimates before passing them to the sequence matching network. Experimental results on the new DCASE dataset for sound event localization, detection, and tracking of multiple moving sound sources show that the sequence matching network with separated azimuth and elevation inputs outperforms the sequence matching network with joint azimuth and elevation input. We combined several sequence matching networks with the new proposed directional inputs into an ensemble to boost the system performance. Our proposed ensemble achieves localization error of 9.3 degrees, localization recall of 90%, and ranked 2nd in the team category of the DCASE2020 sound event localization and detection challenge.
RWCP-SSD-Onomatopoeia: Onomatopoeic Word Dataset for Environmental Sound Synthesis
Yuki Okamoto1, Keisuke Imoto2,1, Shinnosuke Takamichi3, Ryosuke Yamanishi1,4, Takahiro Fukumori1, and Yoichi Yamashita1
1Ritsumeikan University, 2Doshisha University, 3The University of Tokyo, 4Kansai University
Environmental sound synthesis is a technique for generating a natural environmental sound. Conventional work on environmental sound synthesis using sound event labels can not control synthesized sounds finely, for example, the pitch and timbre. We have considered that onomatopoeic words can be used for environmental sound synthesis. Onomatopoeic words are effective for explaining the feature of sounds. We believe that using onomatopoeic words enable us to control the fine time-frequency structure of synthesized sounds. However, there is no dataset available for environmental sound synthesis using onomatopoeic words. In this paper, we thus present RWCP-SSD-Onomatopoeia, a dataset consisting of 155,568 onomatopoeic words pairing with audio samples for environmental sound synthesis. We also have collected self-reported confidence scores and others-reported acceptance scores of onomatopoeic words, which help to select the words.
Ensemble of Pruned Low-Complexity Models for Acoustic Scene Classification
Kenneth Ooi1, Santi Peksi1, and Woon-Seng Gan1
1Nanyang Technological University, School of Electrical and Electronic Engineering
For the DCASE 2020 Challenge, the focus of Task 1B is to develop low-complexity models for classification of 3 different types of acoustic scenes, which have potential applications in resource-scarce edge devices deployed in a large-scale acoustic network. In this paper, we present the training methodology for our submissions for the challenge, with the best-performing system consisting of an ensemble of VGGNet- and InceptionNet-based lightweight classification models. The subsystems in the ensemble classifier were pruned by setting low-magnitude weights periodically to zero with a polynomial decay schedule to achieve an 80% reduction in individual subsystem size. The resultant ensemble classifier outperformed the baseline model on the validation set over 10 runs and had 119758 non-zero parameters taking up 468KB of memory. This shows the efficacy of the pruning technique used. We also performed experiments to compare the performance of various data augmentation schemes, input feature representations, and model architectures in our training methodology. No external data was used, and source code for the submission can be found at https://github.com/kenowr/DCASE-2020-Task-1B.
Lightweight Convolutional Neural Networks on Binaural Waveforms for Low Complexity Acoustic Scene Classification
Nicolas Pajusco1, Richard Huang1, and Nicolas Farrugia1
1IMT Atlantique, Lab-STICC, Department of Electronics
In this paper, we investigate the feasibility of training low complexity convolutional neural networks directly from waveforms. While the vast majority of proposed approaches perform fixed feature extraction based on time-frequency representations such as spectrograms, we propose to fully exploit the information in waveforms directly and to minimize the model size. To do so, we train one dimensional Convolutional Neural Networks (1D-CNN) on raw, subsampled binaural audio waveforms, thus exploiting phase information within and across the two input channels. In addition, our approach relies heavily on data augmentation in the temporal domain. Finally, we apply iterative structured parameter pruning to remove the least important convolutional kernels, and perform weight quantization in floating point half precision. We apply this approach on the TAU Urban Acoustic Scenes 2020 3class dataset, with two network architectures : a 1D-CNN based on VGG-like blocks, as well as a ResNet architecture with 1D convolutions, and compare our results with the baseline model from the DCASE 2020 challenge, task 1 subtask B. We report four models that constitute our submission to the DCASE 2020 challenge, task 1 subtask B. Our results show that we can train, prune and quantify a small VGG model to make it 20 times smaller than the 500 KB challenge limit with an accuracy at baseline level (87.6 %), as well as a larger model achieving 91 % of accuracy while being 8 times smaller than the challenge limit. ResNets could be successfully trained, pruned and quantify in order to be below the 500 KB limit, achieving up to 91.2% accuracy. We also report the stability of these results according to data augmentation and monoraul versus binaural inputs.
DCASE 2020 Task2: Anomalous Sound Detection using Relevant Spectral Feature and Focusing Techniques in the Unsupervised Learning Scenario
Jihwan Park1, and Sooyeon Yoo1
1Advanced Robot Research Laboratory, LG Electronics
In this paper, we propose an improved version of the anomalous sound detection (ASD) system for noisy and reverberant conditions, which was submitted to DCASE 2020 Challenge Task2. The improved system consists of three phases: feature extraction, autoencoder (AE) model, and focusing techniques. In the feature extraction phase, we used spectrograms instead of log-mel energies for more effective distinction of normal and abnormal machine sounds, and validated this feature for the baseline autoencoder model and interpolation DNN (IDNN). We also applied the focusing techniques in both train and evaluation phases, which focuses on machine-adaptive ranges of reconstructed errors for performance improvements. Through experiments, we found that our proposed ASD system outperforms baseline methods under the unsupervised learning scenario. The performance improvement was especially remarkable for non-stationary sounds; above 95% of AUC score was achieved for slider and valve sounds with the proposed system.
Anomalous Sound Detection using Unsupervised and Semi-Supervised Autoencoders and Gammatone Audio Representation
Sergi Perez-Castanos1, Javier Naranjo-Alcazar1, Pedro Zuccarello1, and Maximo Cobos2
1Visualfy, 2Universitat de València
Anomalous sound detection (ASD) is, nowadays, one of the topical subjects in machine listening discipline. Unsupervised detection is attracting a lot of interest due to its immediate applicability in many fields. For example, related to industrial processes, the early detection of malfunctions or damage in machines can mean great savings and an improvement in the efficiency of industrial processes. This problem can be solved with an unsupervised ASD solution since industrial machines will not be damaged simply by having this audio data in the training stage. This paper proposes a novel framework based on convolutional autoencoders (both unsupervised and semi-supervised) and a Gammatone-based representation of the audio. The results obtained by these architectures substantially exceed the results presented as a baseline.
Listen Carefully and Tell: An Audio Captioning System Based on Residual Learning and Gammatone Audio Representation
Sergi Perez-Castanos, Javier Naranjo-Alcazar1, Pedro Zuccarello, and Maximo Cobos1
1Universitat de València
Automated audio captioning is machine listening task whose goal is to describe an audio using free text. An automated audio captioning system has to be implemented as it accepts an audio as input and outputs as textual description, that is, the caption of the signal. This task can be useful in many applications such as automatic content description or machine-to-machine interaction. In this work, an automatic audio captioning based on residual learning on the encoder phase is proposed. The encoder phase is implemented via different Residual Networks configurations. The decoder phase (create the caption) is run using recurrent layers plus attention mechanism. The audio representation chosen has been Gammatone. Results show that the framework proposed in this work surpass the baseline system in challenge results.
Papafil: A Low Complexity Sound Event Localization and Detection Method with Parametric Particle Filtering and Gradient Boosting
Andrés Pérez-López1,2, and Rafael Ibáñez-Usach3
1Music Technology Group, Universitat Pompeu Fabra, 2Eurecat, Centre Tecnòlogic de Catalunya, 3STRATIO
The present article describes the architecture of a system submitted to the DCASE 2020 Challenge - Task 3: Sound Event Localization and Detection. The proposed method conforms a low complexity solution for the task. It is based on four building blocks: a spatial parametric analysis to find single-source spectrogram bins, a particle tracker to estimate trajectories and temporal activities, a spatial filter, and a single-class classifier implemented with a gradient boosting machine . Results from the development dataset show that the proposed method outperforms a deep learning baseline in three out of the four evaluation metrics considered in the challenge, and obtains an overall score almost ten points above the baseline.
On Multitask Loss Function for Audio Event Detection and Localization
Huy Phan1, Lam Pham2, Philipp Koch3, Ngoc Q. K. Duong4, Ian McLoughlin5, and Alfred Mertins3
1School of Electric Engineering and Computer Science, Queen Mary University of London, 2School of Computing, University of Kent, 3Institute for Signal Processing, University of Lübeck, 4InterDigital R&D France, 5Singapore Institute of Technology
Audio event localization and detection (SELD) have been commonly tackled using multitask models. Such a model usually consists of a multi-label event classification branch with sigmoid cross-entropy loss for event activity detection and a regression branch with mean squared error loss for direction-of-arrival estimation. In this work, we propose a multitask regression model, in which both (multi-label) event detection and localization are formulated as regression problems and use the mean squared error loss homogeneously for model training. We show that the common combination of heterogeneous loss functions causes the network to underfit the data whereas the homogeneous mean squared error loss leads to better convergence and performance. Experiments on the development and validation sets of the DCASE 2020 SELD task demonstrate that the proposed system also outperforms the DCASE 2020 SELD baseline across all the detection and localization metrics, reducing the overall SELD error (the combined metric) by approximately 10% absolute.
A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection
Archontis Politis1, Sharath Adavanne1, and Tuomas Virtanen1
1Audio and Speech Processing Research Group, Tampere University
This report details the dataset and the evaluation setup of the Sound Event Localization & Detection (SELD) task for the DCASE 2020 Challenge. Training and testing SELD systems requires datasets of diverse sound events occurring under realistic acoustic conditions. A significantly more complex dataset is created for DCASE 2020 compared to the previous challenge. The two key differences are a more diverse range of acoustical conditions, and dynamic conditions, i.e. moving sources. The spatial sound scene recordings for all conditions are generated using real room impulse responses, while ambient noise recorded on location is added to the spatialized sound events. Additionally, an improved version of the SELD baseline used in the previous challenge is included, providing benchmark scores for the task.
Anomalous Sound Detection as a Simple Binary Classification Problem with Careful Selection of Proxy Outlier Examples
Paul Primus1, Verena Haunschmid1, Patrick Praher3, and Gerhard Widmer1,2
1Institute of Computational Perception, Johannes Kepler University, 2LIT Artificial Intelligence Lab, Johannes Kepler University, 3Software Competence Center Hagenberg GmbH
Unsupervised anomalous sound detection is concerned with identifying sounds that deviate from what is defined as “normal”, without explicitly specifying the types of anomalies. A significant obstacle is the diversity and rareness of outliers, which typically prevent us from collecting a representative set of anomalous sounds. As a consequence, most anomaly detection methods use unsupervised rather than supervised machine learning methods. Nevertheless, we will show that anomalous sound detection can be effectively framed as a supervised classification problem if the set of anomalous samples is carefully substituted with what we call proxy outliers. Candidates for proxy outliers are available in abundance as they potentially include all recordings that are neither normal nor abnormal sounds. We experiment with the machine condition monitoring data set of the 2020's DCASE Challenge and find proxy outliers with matching recording conditions and high similarity to the target sounds particularly informative. If no data with similar sounds and matching recording conditions is available, data sets with a larger diversity in these two dimensions are preferable. Our models based on supervised training with proxy outliers achieved rank three in Task 2 of the DCASE2020 Challenge.
Deep Autoencoding GMM-Based Unsupervised Anomaly Detection in Acoustic Signals and its Hyper-Parameter Optimization
Harsh Purohit1, Ryo Tanabe1, Takashi Endo1, Kaori Suefusa1, Yuki Nikaido1, and Yohei Kawaguchi1
1Research and Development Group, Hitachi, Ltd.
Failures or breakdowns in factory machinery can cause a significant cost to companies. Therefore, there is an increasing demand for automatic machine inspection. In this work, our aim is to develop an acoustic signal based unsupervised anomaly detection method. Existing approaches such as deep autoencoder (DA) and Gaussian mixture model (GMM) have poor anomaly-detection performance. We propose a new method based on deep autoencoding Gaussian mixture model with hyper-parameter optimization (DAGMM-HO). The DAGMM-HO applies the conventional DAGMM to the audio domain for the first time, expecting that its total optimization on reduction of dimensions and statistical modelling improves anomaly-detection performance. In addition, the DAGMM-HO solves the hyper-parameter sensitivity problem of the conventional DAGMM by hyper-parameter optimization based on the gap statistic and the cumulative eigenvalues. We evaluated the proposed method with experimental data of the industrial fans. We found that it significantly outperforms previous approaches, and achieves up to a 20% improvement based on the standard AUC score.
Sound Event Localization and Detection Based on CRNN using Rectangular Filters and Channel Rotation Data Augmentation
Francesca Ronchini1, Daniel Arteaga1,2, and Andrés Pérez-López1,3
1Universitat Pompeu Fabra, 2Dolby Iberia, SL, 3Eurecat, Centre Tecnologic de Catalunya
Sound Event Localization and Detection refers to the problem of identifying the presence of independent or temporally-overlapped sound sources, correctly identifying to which sound class it belongs, and estimating their spatial directions while they are active. In the last years, neural networks have become the prevailing method for Sound Event Localization and Detection task, with convolutional recurrent neural networks being among the most used systems. This paper presents a system submitted to the Detection and Classification of Acoustic Scenes and Events 2020 Challenge Task 3. The algorithm consists of a convolutional recurrent neural network using rectangular filters, specialized in recognizing significant spectral features related to the task. In order to further improve the score and to generalize the system performance to unseen data, the training dataset size has been increased using data augmentation. The technique used for that is based on channel rotations and reflection on the xy plane in the First Order Ambisonic domain, which allows improving Direction of Arrival labels keeping the physical relationships between channels. Evaluation results on the development dataset show that the proposed system outperforms the baseline results, considerably improving Error Rate and F-score for location-aware detection.
Open-Window: A Sound Event Dataset for Window State Detection and Recognition
Saeid Safavi1, Turab Iqbal1, Wenwu Wang1, Philip Coleman2, and Mark D. Plumbley1
1Centre for Vision, Speech and Signal Processing, University of Surrey, 2The Institute of Sound Recording, University of Surrey
Situated in the domain of urban sound scene classification by humans and machines, this research is the first step towards mapping urban noise pollution experienced indoors and finding ways to reduce its negative impact in peoples’ homes. We have recorded a sound dataset, called Open-Window, which contains recordings from three different locations and four different window states; two stationary states (open and close) and two transitional states (open to close and close to open). We have then built our machine recognition baselines for different scenarios (open set versus closed set) using a deep learning framework. The human listening test is also performed to be able to compare the human and machine performance for detecting the window state just using the acoustic cues. Our experimental results reveal that when using a simple machine baseline system, humans and machines are achieving similar average performance for closed set experiments.
Effects of Word-Frequency Based Pre- and Post- Processings for Audio Captioning
Daiki Takeuchi1, Yuma Koizumi1, Yasunori Ohishi1, Noboru Harada1, and Kunio Kashino1
This paper proposes the use of three elements, namely, data augmentation, multi-task learning, and post-processing, in combination for audio captioning and clarifies their individual effectiveness. The system was used for our submission to Task 6 (Automated Audio Captioning) of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge and obtained the highest evaluation scores, but it was yet to clarify which of those elements contributed to its performance. Therefore, we first conduct an element-wise ablation study on our system in order to estimate to what extent each element is effective. We then conduct a detailed module-wise ablation study to further clarify the key processing modules for improving accuracy. The results show that data augmentation and post-processing significantly improve the score in our system. In particular, mix-up data augmentation and beam search in post-processing improve SPIDEr by 0.8 and 1.6 points, respectively.
Evaluation Metric of Sound Event Detection Considering Severe Misdetection by Scenes
Noriyuki Tonami1, Keisuke Imoto2, Takahiro Fukumori1, and Yoichi Yamashita1
1Ritsumeikan University, 2Doshisha University
In this paper, we propose a new evaluation metric for sound event detection (SED) and discuss a problem frequently encountered in conventional metrics. In conventional evaluation metrics, misdetected sound events are treated equally, e.g., the misdetected sound event “car” in the acoustic scenes “office” and “street” are regarded as the same type of misdetection. However, the misdetected event“car” in “office”is as evere mistake compared with its misdetection in“street.” The event “car” rarely occurs in the “office.” SED systems that are evaluated using conventional metrics may cause severe/catastrophic problems and lead to confusion in practice owing to lack of consideration of the relationship between sound events and scenes. Our evaluation metric for SED considers severe misdetections on the basis of the relationship between sound events and scenes. We demonstrate the utility of our proposed method by com-paring it with the conventional evaluation metrics on two datasets with events and scenes. Experimental results show that the pro-posed metric can accurately evaluate whether SED systems appropriately consider the relationship between sound events and scenes,i.e., realistic situations.
Training Sound Event Detection on a Heterogeneous Dataset
Nicolas Turpault1, and Romain Serizel1
1Université de Lorraine, CNRS, Inria, Loria
Training a sound event detection algorithm on a heterogeneous dataset including both recorded and synthetic soundscapes that can have various labeling granularity is a non-trivial task that can lead to systems requiring several technical choices. These technical choices are often passed from one system to another without being questioned. We propose to perform a detailed analysis of DCASE 2020 task 4 sound event detection baseline with regards to several aspects such as the type of data used for training, the parameters of the mean-teacher or the transformations applied while generating the synthetic soundscapes. Some of the parameters that are usually used as default are shown to be sub-optimal.
Improving Sound Event Detection in Domestic Environments using Sound Separation
Nicolas Turpault1, Scott Wisdom2, Hakan Erdogan2, John R. Hershey2, Romain Serizel1, Eduardo Fonseca3, Prem Seetharaman4, and Justin Salamon5
1Université de Lorraine, CNRS, Inria, Loria, 2Google Research, AI Perception, 3Music Technology Group, Universitat Pompeu Fabra, 4Interactive Audio Lab, Northwestern University, 5Adobe Research
Performing sound event detection on real-world recordings often implies dealing with overlapping target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing for sound event detection. In this paper we start from a sound separation model trained on the Free Universal Sound Separation dataset and the DCASE 2020 task 4 sound event detection baseline. We explore different methods to combine separated sound sources and the original mixture within the sound event detection. Furthermore, we investigate the impact of adapting the sound separation model to the sound event detection data on both the sound separation and the sound event detection.
Acoustic Scene Classification with Spectrogram Processing Strategies
Helin Wang1, Yuexian Zou1,2, and DaDing Chong1
1ADSPLAB, School of ECE, Peking University, 2Peng Cheng Laboratory
Recently, convolutional neural networks (CNN) have achieved the state-of-the-art performance in acoustic scene classification (ASC) task. The audio data is often transformed into two-dimensional spectrogram representations, which are then fed to the neural networks. In this paper, we study the problem of efficiently taking advantage of different spectrogram representations through discriminative processing strategies. There are two main contributions. The first contribution is exploring the impact of the combination of multiple spectrogram representations at different stages, which provides a meaningful reference for the effective spectrogram fusion. The second contribution is that the processing strategies in multiple frequency bands and multiple temporal frames are proposed to make fully use of a single spectrogram representation. The proposed spectrogram processing strategies can be easily transferred to any network structures. The experiments are carried out on the DCASE 2020 Task1 datasets, and the results show that our method could achieve the accuracy of 81.8% (official baseline: 54.1%) and 92.1% (official baseline: 87.3%) on the officially provided fold 1 evaluation dataset of Task1A and Task1B, respectively.
Using Look, Listen, and Learn Embeddings for Detecting Anomalous Sounds in Machine Condition Monitoring
1Fraunhofer Institute for Communication, Information Processing and Ergonomics FKIE
The goal of anomalous sound detection is to unsupervisedly train a system to distinguish normal from anomalous sounds that substantially differ from the normal sounds used for training. In this paper, a system based on Look, Listen, and Learn embeddings, which participated in task 2 “Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring” of the DCASE challenge 2020 and is adapted from an open-set machine listening system, is presented. The experimental results show that the presented system significantly outperforms the baseline system of the challenge both in detecting outliers and in recognizing the correct machine type or exact machine id. Moreover, it is shown that an ensemble consisting of the presented system and the baseline system performs even better than both of its components.
Searching for Efficient Network Architectures for Acoustic Scene Classification
Yuzhong Wu1, and Tan Lee1
1The Chinese University of Hong Kong, Department of Electronic Engineering
Acoustic scene classification (ASC) is the task of classifying recorded audio signal into one of the predefined acoustic environment classes. While previous studies reported ASC systems with high accuracy, the computation cost and system complexity may not be optimal for practical mobile applications. Inspired by the success of neural architecture search (NAS) and the efficacy of MobileNets in vision applications, we propose a simple yet effective random search policy to obtain high accuracy ASC models under strict model size constraint. The search policy allows automatic discovery of the best trade-off between model depth and width, and statistical analysis of model design can be carried out using the evaluation results of randomly sampled architectures. To enable fast search, the search space is limited to several predefined efficient convolutional modules based on depth-wise convolution and swish activation function. Experimental results show that the CNN model found by this search policy gives higher accuracy compared to an AlexNet-like CNN benchmark.
A CRNN-GRU Based Reinforcement Learning Approach to Audio Captioning
Xuenan Xu1, Heinrich Dinkel1, Mengyue Wu1, and Kai Yu1
1MoE Key Lab of Artificial Intelligence, SpeechLab, Department of Computer Science and Engineering, AI Institute, Shanghai Jiao Tong University
Audio captioning aims at generating a natural sentence to describe the content in an audio clip. This paper proposes the use of a powerful CRNN encoder combined with a GRU decoder to tackle this multi-modal task. In addition to standard cross-entropy, reinforcement learning is also investigated for generating richer and more accurate captions. Our approach significantly improves against the baseline model on all shown metrics achieving a relative improvement of at least 34%. Results indicate that our proposed CRNN-GRU model with reinforcement learning achieves a SPIDEr of 0.190 on the Clotho evaluation set. With data augmentation, the performance is further boosted to 0.223. In the DCASE challenge Task 6 we ranked fourth based on SPIDEr, second on 5 metrics including BLEU, ROUGE-L and METEOR, without ensemble or data augmentation while maintaining a small model size (only 5 Million parameters).
Two-Stage Domain Adaptation for Sound Event Detection
Liping Yang1, Junyong Hao1, Zhenwei Hou1, and Wang Peng1
1Key Laboratory of Optoelectronic Technology and Systems, MOE, Chongqing University
Sound event detection under real scenarios is a challenge task. Due to the great distribution mismatch of synthetic and real audio data, the performance of sound event detection model, which is trained on strong-labeled synthetic data, degrades dramatically when it is applied in real environment. To tackle the issue and improve the robustness of sound event detection model, we propose a two-stage domain adaptation sound event detection approach in this paper. The backbone convolutional recurrent neural network (CRNN) leaned using strong-labeled synthetic data is updated by weak-label supervised adaptation and frame-level adversarial do-main adaptation. As a result, the parameters of CRNN are renewed for real audio data, and the input space distribution mismatch be-tween synthetic and real audio data is mitigated in the feature space of CRNN. Moreover, a context clip-level consistency regulariza-tion between the classification outputs of CNN and CRNN is in-troduced to improve the feature representation ability of convolu-tional layers in CRNN. Experiments on DCASE 2019 sound event detection in domestic environments task demonstrate the superiori-ty of our proposed domain adaptation approach. Our approach achieves F1 scores of 48.3% on the validation set and 49.4% on the evaluation set, which are the-state-of-art sound event detection performances of CRNN model without data augmentation.
Joint Training of Guided Learning and Mean Teacher Models for Sound Event Detection
Hao Yen1,2, Pin-Jui Ku1,2, Ming-Chi Yen1, Hung-Shin Lee1,2, and Hsin-Min Wang1
1Institute of Information Science, Academia Sinica, 2Department of Electrical Engineering, National Taiwan University
In this paper, we present our system of sound event detection and separation in domestic environments for DCASE 2020. The task aims to determine which sound events appear in a clip and the detailed temporal ranges they occupy. The system is trained by using weakly-labeled and unlabeled real data and synthetic data with strongly annotated labels. Our proposed model structure includes a feature-level front-end based on convolution neural networks (CNN), followed by both embedding-level and instance-level back-end attention modules. In order to make full use of the large amount of unlabeled data, we jointly adopt the Guided Learning and Mean Teacher approaches to carry out weakly-supervised learning and semi-supervised learning. In addition, a set of adaptive median windows for individual sound events is used to smooth the frame-level predictions in post-processing. In the public evaluation set of DCASE 2019, the best event-based F1-score achieved by our system is 48.50%, which is a relative improvement of 27.16% over the official baseline (38.14%). In addition, in the development set of DCASE 2020, our best system also achieves a relative improvement of 32.91% over the baseline (45.68% vs. 34.37%)
DCASE-Models: A Python Library for Computational Environmental Sound Analysis using Deep-Learning Models
Pablo Zinemanas1, Ignacio Hounie2, Pablo Cancela2, Frederic Font1, Martín Rocamora2, and Xavier Serra1
1Music Technology Group, Universitat Pompeu Fabra, 2Facultad de Ingeniería, Universidad de la República
This document presents DCASE-models, an open-source Python library for rapid prototyping of environmental sound analysis systems, with an emphasis on deep-learning models. Together with a collection of functions for dataset handling, data preparation, feature extraction, and evaluation, it includes a model interface to standardize the interaction of machine learning methods with the other system components. This also provides an abstraction layer that allows the use of different machine learning backends. The package includes Python scripts, Jupyter Notebooks, and a web application, to illustrate its usefulness. The library seeks to alleviate the process of releasing and maintaining the code of new models, improve research reproducibility, and simplify comparison of methods. We expect it to become a valuable resource for the community.