Urban Sound Tagging

Urban Sound Tagging Using Convolutional Neural Networks

Sainath Adapa
Customer Acquisition, FindHotel, Amsterdam, Netherlands

Abstract

This technical report outlines our solution to Task 5 of the DCASE 2019 challenge, titled Urban Sound Tagging. The objective of the task is to label different sources of noise from raw audio data. A modified form of MobileNetv2, a convolutional neural network (CNN) model was trained to label both coarse and fine tags jointly. The proposed model uses log-scaled Mel-spectrogram as the representation format for the audio data. Mixup, Random erasing, scaling, and shifting are used as data augmentation techniques. A second model that uses scaled labels was built to account for human errors in the annotations. The solution code is available on GitHub.

Urban Sound Tagging with Multi-Feature Fusion System

Jisheng Bai and Chen Chen

School of Marine science, Northwestern Polytechnical University, Xi'an, China

Bai_NPU_task5_1 Bai_NPU_task5_2 Bai_NPU_task5_3 Bai_NPU_task5_4

Urban Sound Tagging with Multi-Feature Fusion System

Jisheng Bai and Chen Chen
School of Marine science, Northwestern Polytechnical University, Xi'an, China

Abstract

This paper presents a multi-feature fusion system for the DCASE 2019 Task5 Urban Sound Tagging(UST). It focus on predicting whether each of 23 sources of noise pollution is pre-sent or absent in a 10-second scene [1]. There are coarse-level and fine -level taxonomies to train model. We mainly focus on coarse-level and use best coarse-level model architecture to train fine-level model. Various features are extracted from original urban sound and Convolutional Neural Networks(CNNs) are applied in this system. Log-Mel, harmonic, short time Fourier transform (STFT) and Mel Frequency Cepstral Coefficents (MFCC) spectrograms are fed into a 5-layer or 9-layer CNN, and a type of gated activation [2] is also used in CNN. Different feature is adapted for different urban sound classification ac-cording to the results of our experiment. We get at least 0.14 macro-auprc score improvement compared to baseline system on coarse-level. Finally, we make a fusion of some models and evaluate on evaluation dataset.

Sonyc Urban Sound Tagging: A Multilabel Dataset From an Urban Acoustic Sensor Network

Mark Cartwright¹, Jason Cramer², Ana Elisa Mendez Mendez³, Ho-Hsiang Wu³, Vincent Lostanlen⁴, Juan P. Bello¹ and Justin Salamon⁵

¹Music and Audio Research Laboratory, Department of Computer Science and Engineering, Center for Urban Science and Progress, New York University, New York, New York, USA, ²Music and Audio Research Laboratory, Department of Electrical and Computer Engineering, New York University, New York, New York, USA, ³Music and Audio Research Laboratory, Department of Music and Performing Arts Professions, New York University, New York, New York, USA, ⁴Cornell Lab of Ornithology, Cornell University, Ithaca, New York, USA, ⁵Machine Perception Team, Adobe Research, San Francisco, CA, USA

Cartwright_NYU_task5_1

Code

Sonyc Urban Sound Tagging: A Multilabel Dataset From an Urban Acoustic Sensor Network

Mark Cartwright¹, Jason Cramer², Ana Elisa Mendez Mendez³, Ho-Hsiang Wu³, Vincent Lostanlen⁴, Juan P. Bello¹ and Justin Salamon⁵
¹Music and Audio Research Laboratory, Department of Computer Science and Engineering, Center for Urban Science and Progress, New York University, New York, New York, USA, ²Music and Audio Research Laboratory, Department of Electrical and Computer Engineering, New York University, New York, New York, USA, ³Music and Audio Research Laboratory, Department of Music and Performing Arts Professions, New York University, New York, New York, USA, ⁴Cornell Lab of Ornithology, Cornell University, Ithaca, New York, USA, ⁵Machine Perception Team, Adobe Research, San Francisco, CA, USA

Abstract

SONYC Urban Sound Tagging (SONYC-UST) is a dataset for the development and evaluation of machine listening systems for realistic urban noise monitoring. The audio was recorded from an acoustic sensor network named ``Sounds of New York City'' (SONYC). Via the Zooniverse citizen science platform, volunteers tagged the presence of 23 classes that were priorly chosen in consultation with the New York City Department of Environmental Protection. These 23 fine-grained classes can be grouped into eight coarse-grained classes.

Time-Frequency Segmenation Attention Neural Network for Urban Sound Tagging

Lin Cui, Shaonan Ji, Xinyuan Han and Jinjia Wang

school of Information Science and Engineering, department of Electronic communication, Yanshan University, Qin Huangdao, Hebei, China

Cui_YSU_task5_1

Time-Frequency Segmenation Attention Neural Network for Urban Sound Tagging

Lin Cui, Shaonan Ji, Xinyuan Han and Jinjia Wang
school of Information Science and Engineering, department of Electronic communication, Yanshan University, Qin Huangdao, Hebei, China

Abstract

Audio tagging aims to assign one or more labels to the audio clip. In this task, we used the Time-Frequency Segmentation Attention Network (TFSANN) for urban sound tagging. In the training, the log mel spectrogram of the audio clip is used as input feature, and the time-frequency segmentation mask is obtained by the timefrequency segmentation network. The time-frequency segmentation mask can be used to separate the time-frequency domain sound event from the background scene, and enhance the sound event that occurred in the audio clip. Global Weighted Rank Pooling (GWRP) allows existing event categories to occupy significant part of the spectrogram, allowing the network to focus on more significant features, and it can also estimate the probability of existence of sound event. In this paper, the proposed TFSANN model is validated on the development dataset of DCASE2019 task 5. Finally, the coarsegrained and fine-grained taxonomy results are obtained on the Micro Area under precision-recall curve (AUPRC), Micro F1 score and Macro Area under precision-recall curve (AUPRC).

VGG CNN for Urban Sound Tagging

Clément Gousseau

Ambient Intelligence and Mobility, Orange Labs (company where I do my master thesis internship), Lannion, France

Gousseau_OL_task5_1 Gousseau_OL_task5_2

VGG CNN for Urban Sound Tagging

Clément Gousseau
Ambient Intelligence and Mobility, Orange Labs (company where I do my master thesis internship), Lannion, France

Abstract

A model of urban sound tagging is presented (Task 5 of DCASE 2019 [1][2]). The task is to detect activities from 10-seconds audio segments recorded in the streets of New York City (SONYC dataset). The model is based on the model presented in the book Hands-On Transfer Learning with Python [3] which does urban sound classification for the UrbanSound dataset. This model has been adapted and optimized to address the task 5 of DCASE2019. It achieved a AUPRC of 82.6 for the coarse-grained model where the baseline achieves an AUPRC of 76.2.

Convolutional Neural Networks with Transfer Learning for Urban Sound Tagging

Bongjun Kim

Department of Computer Science, Northwestern University, Evnaston, Illinois, USA

Kim_NU_task5_1 Kim_NU_task5_2 Kim_NU_task5_3

Convolutional Neural Networks with Transfer Learning for Urban Sound Tagging

Bongjun Kim
Department of Computer Science, Northwestern University, Evnaston, Illinois, USA

Abstract

This technical report describes sound classification models from our submissions for DCASE challenge 2019-task5. The task is to build a system to perform audio tagging on urban sound. The dataset has 23 fine-grained tags and 8 coarse-grained tags. In this report, we only present a model for coarse-grained tagging. Our model is a Convolutional Neural network (CNN)-based model which consists of 6 convolutional layers and 3 fully-connected layers. We apply transfer learning to the model training by utilizing VGGish model that has been pre-trained on a large scale of a dataset. We also apply an ensemble technique to boost the performance of a single model. We compare the performance of our models and the baseline approach on the provided validation dataset. The results show that our models outperform the baseline system.

Cross-Talk Learning for Audio Tagging, Sound Event Detection and Spatial Localization: DCASE 2019 Baseline Systems

Qiuqiang Kong, Yin Cao, Turab Iqbal, Wenwu Wang and Mark D. Plumbley

Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, England

Kong_SURREY_task5_1 Kong_SURREY_task5_2

Cross-Talk Learning for Audio Tagging, Sound Event Detection and Spatial Localization: DCASE 2019 Baseline Systems

Qiuqiang Kong, Yin Cao, Turab Iqbal, Wenwu Wang and Mark D. Plumbley
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, England

Abstract

The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge focuses on audio tagging, sound event detection and spatial localisation. DCASE 2019 consists of five tasks: 1) acoustic scene classification, 2) audio tagging with noisy labels and minimal supervision, 3) sound event localisation and detection, 4) sound event detection in domestic environments, and 5) urban sound tagging. In this paper, we propose generic cross-task baseline systems based on convolutional neural networks (CNNs). The motivation is to investigate the performance of a variety of models across several audio recognition tasks without exploiting the specific characteristics of the tasks. We looked at CNNs with 5, 9, and 13 layers, and found that the optimal architecture is taskdependent. For the systems we considered, we found that the 9-layer CNN with average pooling after convolutional layers is a good model for a majority of the DCASE 2019 tasks.

Improved Residual Network Based on Deformable Convolution for Urban Sound Tagging

Fuling Liu

College of Photoelectric Engineering, Chongqing University, 174 Shazhengjie, Shapingba, Chongqing, 400030, China

Liu_CU_task5_1

Improved Residual Network Based on Deformable Convolution for Urban Sound Tagging

Fuling Liu
College of Photoelectric Engineering, Chongqing University, 174 Shazhengjie, Shapingba, Chongqing, 400030, China

Abstract

In order to solve the problem of Urban Sound Tagging, we use deformable convolution to improve the residual module, add offset variables to the input feature map of the residual module, and use the improved residual module to form a new residual network. Compared with the general method of adding deformable convolution, the improved method in this paper has better results.

Urban Sound Tagging DCASE 2019 Chalelnge Task 5

Linus Ng and Kenneth Ooi

Smart Nation Translational Lab, Centre for Infocomm Technology (INFINITUS), School of Electrical and Electronic Engineering, Nanyang Technological University, Nanyang Technological University, Singapore

Ng_NTU_task5_1 Ng_NTU_task5_2 Ng_NTU_task5_3 Ng_NTU_task5_4

Urban Sound Tagging DCASE 2019 Chalelnge Task 5

Linus Ng and Kenneth Ooi
Smart Nation Translational Lab, Centre for Infocomm Technology (INFINITUS), School of Electrical and Electronic Engineering, Nanyang Technological University, Nanyang Technological University, Singapore

Abstract

Identifying urban noises and sounds is a challenging but important problem in the field of machine listening [1]. It enables and provides a realistic use case for detecting noises in an urbanised city - from noise complaints to detecting sounds or unusual noises that may indicate possible emergencies. The Urban Sound Tagging challenge as part of the DCASE 2019 challenge [2] [3] addresses the problem statement of urban noise control [1]. For this challenge, we are tasked to build a audio classifier to predict whether each of 23 sources of noise pollution is present or absent in a 10-second scene, as recorded by an acoustic sensor network. In this technical report, we will examine in some detail the performance of the audio classification models trained with different open external datasets.

An Augmented Neural Network for the DCASE 2019 Urban Sound Tagging Challenge

Ferran Orga¹, Joan Serrà² and Carlos Segura Perales²

¹GTM - Grup de recerca en Tecnologies Mèdia, La Salle - Universitat Ramon Llull (URL), C/Quatre Camins, 30, 08022 Barcelona (Spain), ²Telefónica Research, Barcelona (Spain)

Orga_URL_task5_1

An Augmented Neural Network for the DCASE 2019 Urban Sound Tagging Challenge

Ferran Orga¹, Joan Serrà² and Carlos Segura Perales²
¹GTM - Grup de recerca en Tecnologies Mèdia, La Salle - Universitat Ramon Llull (URL), C/Quatre Camins, 30, 08022 Barcelona (Spain), ²Telefónica Research, Barcelona (Spain)

Abstract

The Sounds of New York City (SONYC) research project aims to mitigate urban noise pollution in the context of a megacity. This project has deployed 50 different sensors in various areas of the New York City installed back in 2015 to monitor the overall sound pressure level. However, this is not enough to determine the noise sources, needed to detect noise code violations. Within the Task 5 of DCASE2019 challenge, an urban sound tagging challenge is proposed where the participants are asked to develop a machine listening system that distinguishes between 23 sources of noise pollution. The system is asked to predict whether the source is present or absent in 10-second scenes recorded by the SONYC. Moreover, annotations are also provided at a higher level, classifying the 23 fine labels in 8 coarser labels. In this report, the authors present a machine listening approach based on an augmented neural network where both coarse and fine-level annotations are used to predict the event presence in the same network. This approach obtains a classification accuracy on the validation split of 87% at the coarse level and 92% at the fine level.

DCASE 2019 Challenge Task 5: Cnn+vggish

Daniel Tompkins and Eric Nichols

Dynamics 365 AI Research, Microsoft, Redmond, WA

Tompkins_MS_task5_1 Tompkins_MS_task5_2 Tompkins_MS_task5_3

DCASE 2019 Challenge Task 5: Cnn+vggish

Daniel Tompkins and Eric Nichols
Dynamics 365 AI Research, Microsoft, Redmond, WA

Abstract

We trained a model for multi-label audio classification on Task 5 of the DCASE 2019 Challenge [1]. The model is composed of a preprocessing layer that converts audio to a log-mel spectrogram, a VGG-inspired Convolutional Neural Network (CNN) that generates an embedding for the spectrogram, the pre-trained VGGish network [2] that generates a separate audio embedding, and finally a series of fully-connected layers that converts these two embeddings (concatenated) into a multi-label classification. This model directly outputs both fine and coarse labels; it treats the task as a 37-way multi-label classification problem. One version of this network did better at the coarse labels (submission 1); another did better with fine labels on Micro AUPRC (submission 2). A separate family of CNNs models, one per coarse label, was also trained to take into account the hierarchical nature of the labels (submission 3), but the single model solution performed slightly better.