Urban Sound Tagging


Challenge results

Task description

The goal of urban sound tagging (UST) is to predict whether each of 23 sources of noise pollution is present or absent in a 10-second scene. These sources of noise are also grouped into 8 coarse-level categories. All of the recordings are from an urban acoustic sensor network in New York City. The training set was annotated by volunteers on the Zooniverse citizen-science platform, and the validation and test sets were annotated by the task organizers.

More detailed task description can be found in the task description page

Note that only teams which have open sourced their systems are included in the system and team rankings.

Coarse-level prediction

System ranking

These results only include systems for which the source code has been release.

Rank Submission
code
Submission
name
Technical
Report
Macro-AUPRC Micro-F1 Micro-AUPRC
Adapa_FH_task5_1 MNv2_1 Adapa2019 0.718 0.631 0.860
Adapa_FH_task5_2 MNv2_2 Adapa2019 0.723 0.745 0.847
Bai_NPU_task5_1 multifeat1 Bai2019 0.618 0.696 0.763
Bai_NPU_task5_2 multifeat2 Bai2019 0.649 0.701 0.769
Bai_NPU_task5_3 multifeat3 Bai2019 0.558 0.631 0.680
Bai_NPU_task5_4 multifeat4 Bai2019 0.647 0.709 0.782
DCASE2019 baseline Baseline Cartwright2019 0.619 0.664 0.742
Cui_YSU_task5_1 YSU_TFSANN Cui2019 0.674 0.525 0.807
Gousseau_OL_task5_1 Gousseau1 Gousseau2019 0.650 0.667 0.745
Gousseau_OL_task5_2 Gousseau2 Gousseau2019 0.612 0.639 0.748
Kim_NU_task5_1 BK_CNN1 Kim2019 0.653 0.686 0.761
Kim_NU_task5_2 BK_CNN2 Kim2019 0.696 0.734 0.825
Kim_NU_task5_3 BK_CNN3 Kim2019 0.697 0.730 0.809
Kong_SURREY_task5_1 cvssp_cnn9 Kong2019 0.567 0.467 0.674
Kong_SURREY_task5_2 cvssp_plus Kong2019 0.613 0.516 0.777
Ng_NTU_task5_1 Ng_1 Ng2019 0.657 0.670 0.759
Ng_NTU_task5_2 Ng_2 Ng2019 0.666 0.677 0.767
Ng_NTU_task5_3 Ng_3 Ng2019 0.660 0.671 0.762
Ng_NTU_task5_4 Ng_4 Ng2019 0.660 0.666 0.770
Orga_URL_task5_1 AugNet Orga2019 0.501 0.557 0.562
Tompkins_MS_task5_1 MS D365 AI 1 Tompkins2019 0.646 0.631 0.779
Tompkins_MS_task5_2 MS D365 AI 2 Tompkins2019 0.666 0.552 0.788
Tompkins_MS_task5_3 MS D365 AI 3 Tompkins2019 0.646 0.631 0.779

Teams ranking

Table including only the best performing reproducible system per submitting team.

Rank Submission
code
Submission
name
Technical
Report
Macro-AUPRC Micro-F1 Micro-AUPRC
Adapa_FH_task5_1 MNv2_1 Adapa2019 0.718 0.631 0.860
Bai_NPU_task5_4 multifeat4 Bai2019 0.647 0.709 0.782
DCASE2019 baseline Baseline Cartwright2019 0.619 0.664 0.742
Cui_YSU_task5_1 YSU_TFSANN Cui2019 0.674 0.525 0.807
Gousseau_OL_task5_2 Gousseau2 Gousseau2019 0.612 0.639 0.748
Kim_NU_task5_2 BK_CNN2 Kim2019 0.696 0.734 0.825
Kong_SURREY_task5_2 cvssp_plus Kong2019 0.613 0.516 0.777
Ng_NTU_task5_4 Ng_4 Ng2019 0.660 0.666 0.770
Orga_URL_task5_1 AugNet Orga2019 0.501 0.557 0.562
Tompkins_MS_task5_2 MS D365 AI 2 Tompkins2019 0.666 0.552 0.788

Class-wise performance

Submission
code
Submission
name
Technical
Report
Micro-AUPRC Engine Machinery
impact
Non-machinery
impact
Powered
saw
Alert
signal
Music Human
voice
Dog
Adapa_FH_task5_1 MNv2_1 Adapa2019 0.860 0.888 0.627 0.361 0.684 0.897 0.404 0.947 0.937
Adapa_FH_task5_2 MNv2_2 Adapa2019 0.847 0.878 0.578 0.344 0.643 0.875 0.586 0.949 0.931
Bai_NPU_task5_1 multifeat1 Bai2019 0.763 0.787 0.632 0.287 0.578 0.732 0.105 0.907 0.918
Bai_NPU_task5_2 multifeat2 Bai2019 0.769 0.792 0.602 0.363 0.658 0.804 0.171 0.909 0.896
Bai_NPU_task5_3 multifeat3 Bai2019 0.680 0.792 0.111 0.071 0.658 0.771 0.225 0.922 0.911
Bai_NPU_task5_4 multifeat4 Bai2019 0.782 0.809 0.637 0.347 0.628 0.781 0.151 0.916 0.912
DCASE2019 baseline Baseline Cartwright2019 0.742 0.832 0.454 0.170 0.709 0.727 0.246 0.886 0.929
Cui_YSU_task5_1 YSU_TFSANN Cui2019 0.807 0.859 0.598 0.405 0.739 0.773 0.268 0.883 0.863
Gousseau_OL_task5_1 Gousseau1 Gousseau2019 0.745 0.793 0.598 0.282 0.703 0.802 0.218 0.867 0.934
Gousseau_OL_task5_2 Gousseau2 Gousseau2019 0.748 0.813 0.403 0.253 0.698 0.778 0.166 0.871 0.916
Kim_NU_task5_1 BK_CNN1 Kim2019 0.761 0.863 0.548 0.202 0.717 0.791 0.276 0.910 0.918
Kim_NU_task5_2 BK_CNN2 Kim2019 0.825 0.849 0.643 0.308 0.686 0.850 0.358 0.944 0.931
Kim_NU_task5_3 BK_CNN3 Kim2019 0.809 0.831 0.650 0.290 0.674 0.856 0.402 0.934 0.941
Kong_SURREY_task5_1 cvssp_cnn9 Kong2019 0.674 0.786 0.455 0.272 0.640 0.552 0.181 0.765 0.883
Kong_SURREY_task5_2 cvssp_plus Kong2019 0.777 0.824 0.526 0.172 0.722 0.785 0.057 0.893 0.926
Liu_CU_task5_1 Liu_CU_1 Liu2019 0.700 0.746 0.528 0.318 0.742 0.826 0.000 0.772 0.898
Ng_NTU_task5_1 Ng_1 Ng2019 0.759 0.832 0.525 0.268 0.693 0.786 0.403 0.874 0.877
Ng_NTU_task5_2 Ng_2 Ng2019 0.767 0.843 0.535 0.249 0.739 0.760 0.425 0.882 0.895
Ng_NTU_task5_3 Ng_3 Ng2019 0.762 0.852 0.529 0.197 0.767 0.775 0.399 0.875 0.888
Ng_NTU_task5_4 Ng_4 Ng2019 0.770 0.854 0.545 0.209 0.749 0.764 0.412 0.878 0.867
Orga_URL_task5_1 AugNet Orga2019 0.562 0.653 0.411 0.131 0.704 0.544 0.223 0.672 0.668
Tompkins_MS_task5_1 MS D365 AI 1 Tompkins2019 0.779 0.844 0.519 0.227 0.730 0.770 0.316 0.883 0.877
Tompkins_MS_task5_2 MS D365 AI 2 Tompkins2019 0.788 0.855 0.538 0.188 0.744 0.812 0.418 0.886 0.886
Tompkins_MS_task5_3 MS D365 AI 3 Tompkins2019 0.779 0.844 0.519 0.227 0.730 0.770 0.316 0.883 0.877

Fine-level prediction

System ranking

These results only include systems for which the source code has been release.

Rank Submission
code
Submission
name
Technical
Report
Macro-AUPRC Micro-F1 Micro-AUPRC
Adapa_FH_task5_1 MNv2_1 Adapa2019 0.645 0.484 0.751
Adapa_FH_task5_2 MNv2_2 Adapa2019 0.622 0.575 0.721
Bai_NPU_task5_1 multifeat1 Bai2019 0.534 0.514 0.572
Bai_NPU_task5_2 multifeat2 Bai2019 0.523 0.594 0.615
Bai_NPU_task5_3 multifeat3 Bai2019 0.553 0.600 0.639
Bai_NPU_task5_4 multifeat4 Bai2019 0.554 0.571 0.623
DCASE2019 baseline Baseline Cartwright2019 0.531 0.450 0.619
Cui_YSU_task5_1 YSU_TFSANN Cui2019 0.552 0.286 0.637
Gousseau_OL_task5_1 Gousseau1 Gousseau2019 0.000 0.000 0.000
Gousseau_OL_task5_2 Gousseau2 Gousseau2019 0.500 0.560 0.621
Kim_NU_task5_1 BK_CNN1 Kim2019 0.000 0.000 0.000
Kim_NU_task5_2 BK_CNN2 Kim2019 0.000 0.000 0.000
Kim_NU_task5_3 BK_CNN3 Kim2019 0.000 0.000 0.000
Kong_SURREY_task5_1 cvssp_cnn9 Kong2019 0.378 0.391 0.496
Kong_SURREY_task5_2 cvssp_plus Kong2019 0.462 0.206 0.584
Ng_NTU_task5_1 Ng_1 Ng2019 0.560 0.551 0.639
Ng_NTU_task5_2 Ng_2 Ng2019 0.564 0.540 0.638
Ng_NTU_task5_3 Ng_3 Ng2019 0.564 0.521 0.632
Ng_NTU_task5_4 Ng_4 Ng2019 0.571 0.534 0.646
Orga_URL_task5_1 AugNet Orga2019 0.391 0.457 0.428
Tompkins_MS_task5_1 MS D365 AI 1 Tompkins2019 0.521 0.444 0.618
Tompkins_MS_task5_2 MS D365 AI 2 Tompkins2019 0.555 0.381 0.649
Tompkins_MS_task5_3 MS D365 AI 3 Tompkins2019 0.522 0.461 0.599

Teams ranking

Table including only the best performing reproducible system per submitting team.

Rank Submission
code
Submission
name
Technical
Report
Macro-AUPRC Micro-F1 Micro-AUPRC
Adapa_FH_task5_1 MNv2_1 Adapa2019 0.645 0.484 0.751
Bai_NPU_task5_3 multifeat3 Bai2019 0.553 0.600 0.639
DCASE2019 baseline Baseline Cartwright2019 0.531 0.450 0.619
Cui_YSU_task5_1 YSU_TFSANN Cui2019 0.552 0.286 0.637
Gousseau_OL_task5_2 Gousseau2 Gousseau2019 0.500 0.560 0.621
Kim_NU_task5_1 BK_CNN1 Kim2019 0.000 0.000 0.000
Kong_SURREY_task5_2 cvssp_plus Kong2019 0.462 0.206 0.584
Ng_NTU_task5_4 Ng_4 Ng2019 0.571 0.534 0.646
Orga_URL_task5_1 AugNet Orga2019 0.391 0.457 0.428
Tompkins_MS_task5_2 MS D365 AI 2 Tompkins2019 0.555 0.381 0.649

Class-wise performance

Submission
code
Submission
name
Technical
Report
Micro-AUPRC Engine Machinery
impact
Non-machinery
impact
Powered
saw
Alert
signal
Music Human
voice
Dog
Adapa_FH_task5_1 MNv2_1 Adapa2019 0.751 0.665 0.718 0.362 0.486 0.858 0.289 0.841 0.936
Adapa_FH_task5_2 MNv2_2 Adapa2019 0.721 0.673 0.604 0.374 0.378 0.832 0.351 0.833 0.931
Bai_NPU_task5_1 multifeat1 Bai2019 0.572 0.394 0.560 0.470 0.351 0.648 0.129 0.735 0.981
Bai_NPU_task5_2 multifeat2 Bai2019 0.615 0.524 0.517 0.346 0.364 0.687 0.083 0.720 0.944
Bai_NPU_task5_3 multifeat3 Bai2019 0.639 0.545 0.536 0.489 0.418 0.679 0.082 0.763 0.911
Bai_NPU_task5_4 multifeat4 Bai2019 0.623 0.511 0.565 0.430 0.407 0.696 0.115 0.739 0.973
DCASE2019 baseline Baseline Cartwright2019 0.619 0.638 0.539 0.182 0.478 0.543 0.168 0.777 0.922
Cui_YSU_task5_1 YSU_TFSANN Cui2019 0.637 0.632 0.566 0.359 0.444 0.652 0.116 0.731 0.913
Gousseau_OL_task5_1 Gousseau1 Gousseau2019 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Gousseau_OL_task5_2 Gousseau2 Gousseau2019 0.621 0.606 0.270 0.253 0.398 0.694 0.103 0.756 0.916
Kim_NU_task5_1 BK_CNN1 Kim2019 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Kim_NU_task5_2 BK_CNN2 Kim2019 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Kim_NU_task5_3 BK_CNN3 Kim2019 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Kong_SURREY_task5_1 cvssp_cnn9 Kong2019 0.496 0.506 0.279 0.230 0.239 0.464 0.015 0.638 0.652
Kong_SURREY_task5_2 cvssp_plus Kong2019 0.584 0.534 0.440 0.089 0.437 0.535 0.009 0.738 0.913
Liu_CU_task5_1 Liu_CU_1 Liu2019 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Ng_NTU_task5_1 Ng_1 Ng2019 0.639 0.668 0.576 0.235 0.557 0.583 0.221 0.765 0.873
Ng_NTU_task5_2 Ng_2 Ng2019 0.638 0.667 0.562 0.265 0.535 0.627 0.213 0.760 0.881
Ng_NTU_task5_3 Ng_3 Ng2019 0.632 0.666 0.513 0.267 0.532 0.632 0.258 0.757 0.890
Ng_NTU_task5_4 Ng_4 Ng2019 0.646 0.665 0.538 0.280 0.545 0.666 0.222 0.764 0.886
Orga_URL_task5_1 AugNet Orga2019 0.428 0.417 0.396 0.137 0.536 0.346 0.083 0.566 0.644
Tompkins_MS_task5_1 MS D365 AI 1 Tompkins2019 0.618 0.621 0.531 0.225 0.351 0.620 0.190 0.755 0.877
Tompkins_MS_task5_2 MS D365 AI 2 Tompkins2019 0.649 0.638 0.552 0.189 0.466 0.680 0.266 0.759 0.886
Tompkins_MS_task5_3 MS D365 AI 3 Tompkins2019 0.599 0.553 0.458 0.225 0.572 0.608 0.199 0.683 0.877

System characteristics

Code Technical
Report
Coarse
Micro-AUPRC
Fine
Micro-AUPRC
Input Sampling
rate
Data
augmentation
Features External
data
External
data sources
Model
complexity
Classifier Ensemble
subsystems
Used annotator ID Used proximity Used sensor ID Aggregation
method
Target
level
Target
method
System
relabeling
Adapa_FH_task5_1 Adapa2019 0.860 0.751 mono 44.1kHz mixup, random erase, scaling, shifting log-mel energies pre-trained model ImageNet based trained weights of MobileNetV2 2896726 CNN False False False mean both average manual
Adapa_FH_task5_2 Adapa2019 0.847 0.721 mono 44.1kHz mixup, random erase, scaling, shifting log-mel energies pre-trained model ImageNet based trained weights of MobileNetV2 2899804 CNN False False False mean both average automatic
Bai_NPU_task5_1 Bai2019 0.763 0.572 mono 16kHz MFCC, log-mel, STFT, HPSS CNN False False False fine fusion
Bai_NPU_task5_2 Bai2019 0.769 0.615 mono 16kHz MFCC, log-mel, STFT, HPSS CNN False False False fine fusion
Bai_NPU_task5_3 Bai2019 0.680 0.639 mono 16kHz MFCC, log-mel, STFT, HPSS CNN False False False fine fusion
Bai_NPU_task5_4 Bai2019 0.782 0.623 mono 16kHz MFCC, log-mel, STFT, HPSS CNN False False False fine fusion
DCASE2019 baseline Cartwright2019 0.742 0.619 mono 44.1kHz vggish 2967 logistic regression False False False fine minority vote
Cui_YSU_task5_1 Cui2019 0.807 0.637 mono 32kHz log-mel spectrogram 583336 CNN False False False both minority vote
Gousseau_OL_task5_1 Gousseau2019 0.745 0.000 mono 44.1kHz mixup log-mel energies 120753440 CNN 4 False False False coarse minority vote
Gousseau_OL_task5_2 Gousseau2019 0.748 0.621 mono 44.1kHz mixup log-mel energies 120753440 CNN 4 False False False coarse minority vote
Kim_NU_task5_1 Kim2019 0.761 0.000 mono 44.1kHz mel spectrogram pre-trained model vggish 12193928 CNN False False False max coarse minority vote
Kim_NU_task5_2 Kim2019 0.825 0.000 mono 44.1kHz mel spectrogram pre-trained model vggish 24387856 CNN, ensemble False False False max coarse minority vote
Kim_NU_task5_3 Kim2019 0.809 0.000 mono 44.1kHz mel spectrogram pre-trained model vggish 12193928 CNN False False False max coarse minority vote
Kong_SURREY_task5_1 Kong2019 0.674 0.496 mono 32kHz log-mel 4686144 CNN False False False both minority vote
Kong_SURREY_task5_2 Kong2019 0.777 0.584 mono 32kHz log-mel AudioSet pretrained model AudioSet 4686144 CNN False False False both minority vote
Liu_CU_task5_1 Liu2019 0.700 0.000 mono 44.1kHz log-mel spectrogram pre-trained model pre-trained model 2967 CNN False False False mean coarse minority vote
Ng_NTU_task5_1 Ng2019 0.759 0.639 mono 44.1kHz openl3 120215 logistic regression, neural network True False False both minority vote manual
Ng_NTU_task5_2 Ng2019 0.767 0.638 mono 44.1kHz openl3 audio data Urban-SED, FSDKaggle2018 103191 logistic regression, neural network True False True both minority vote manual
Ng_NTU_task5_3 Ng2019 0.762 0.632 mono 44.1kHz openl3 audio data Urban-SED, FSDKaggle2018, UrbanSound8k, FSDnoisy18k, ESC-50-master 103191 logistic regression, neural network True False True both minority vote automatic, manual
Ng_NTU_task5_4 Ng2019 0.770 0.646 mono 44.1kHz openl3 audio data Urban-SED, FSDKaggle2018, UrbanSound8k, FSDnoisy18k, ESC-50-master 271895 logistic regression, neural network True False True both minority vote automatic, manual
Orga_URL_task5_1 Orga2019 0.562 0.428 mono 44.1kHz emphasis, compression, mixing vggish 286208677 CNN False False False mean both minority vote
Tompkins_MS_task5_1 Tompkins2019 0.779 0.618 mono 44.1kHz pitch shifting, volume, white noise addition log-mel spectrogram, vggish pre-trained model vggish 169518141 CNN False False False both majority vote
Tompkins_MS_task5_2 Tompkins2019 0.788 0.649 mono 44.1kHz pitch shifting, volume, white noise addition log-mel spectrogram, vggish pre-trained model vggish 169518141 CNN False False False both majority vote
Tompkins_MS_task5_3 Tompkins2019 0.779 0.599 mono 44.1kHz pitch shifting, volume, white noise addition log-mel spectrogram, vggish pre-trained model vggish 1186626987 CNN, ensemble 7 False False False both majority vote

Technical reports

Urban Sound Tagging Using Convolutional Neural Networks

Sainath Adapa
Customer Acquisition, FindHotel, Amsterdam, Netherlands

Abstract

This technical report outlines our solution to Task 5 of the DCASE 2019 challenge, titled Urban Sound Tagging. The objective of the task is to label different sources of noise from raw audio data. A modified form of MobileNetv2, a convolutional neural network (CNN) model was trained to label both coarse and fine tags jointly. The proposed model uses log-scaled Mel-spectrogram as the representation format for the audio data. Mixup, Random erasing, scaling, and shifting are used as data augmentation techniques. A second model that uses scaled labels was built to account for human errors in the annotations. The solution code is available on GitHub.

PDF

Urban Sound Tagging with Multi-Feature Fusion System

Jisheng Bai and Chen Chen
School of Marine science, Northwestern Polytechnical University, Xi'an, China

Abstract

This paper presents a multi-feature fusion system for the DCASE 2019 Task5 Urban Sound Tagging(UST). It focus on predicting whether each of 23 sources of noise pollution is pre-sent or absent in a 10-second scene [1]. There are coarse-level and fine -level taxonomies to train model. We mainly focus on coarse-level and use best coarse-level model architecture to train fine-level model. Various features are extracted from original urban sound and Convolutional Neural Networks(CNNs) are applied in this system. Log-Mel, harmonic, short time Fourier transform (STFT) and Mel Frequency Cepstral Coefficents (MFCC) spectrograms are fed into a 5-layer or 9-layer CNN, and a type of gated activation [2] is also used in CNN. Different feature is adapted for different urban sound classification ac-cording to the results of our experiment. We get at least 0.14 macro-auprc score improvement compared to baseline system on coarse-level. Finally, we make a fusion of some models and evaluate on evaluation dataset.

PDF

Sonyc Urban Sound Tagging: A Multilabel Dataset From an Urban Acoustic Sensor Network

Mark Cartwright1, Jason Cramer2, Ana Elisa Mendez Mendez3, Ho-Hsiang Wu3, Vincent Lostanlen4, Juan P. Bello1 and Justin Salamon5
1Music and Audio Research Laboratory, Department of Computer Science and Engineering, Center for Urban Science and Progress, New York University, New York, New York, USA, 2Music and Audio Research Laboratory, Department of Electrical and Computer Engineering, New York University, New York, New York, USA, 3Music and Audio Research Laboratory, Department of Music and Performing Arts Professions, New York University, New York, New York, USA, 4Cornell Lab of Ornithology, Cornell University, Ithaca, New York, USA, 5Machine Perception Team, Adobe Research, San Francisco, CA, USA

Abstract

SONYC Urban Sound Tagging (SONYC-UST) is a dataset for the development and evaluation of machine listening systems for realistic urban noise monitoring. The audio was recorded from an acoustic sensor network named ``Sounds of New York City'' (SONYC). Via the Zooniverse citizen science platform, volunteers tagged the presence of 23 classes that were priorly chosen in consultation with the New York City Department of Environmental Protection. These 23 fine-grained classes can be grouped into eight coarse-grained classes.

Time-Frequency Segmenation Attention Neural Network for Urban Sound Tagging

Lin Cui, Shaonan Ji, Xinyuan Han and Jinjia Wang
school of Information Science and Engineering, department of Electronic communication, Yanshan University, Qin Huangdao, Hebei, China

Abstract

Audio tagging aims to assign one or more labels to the audio clip. In this task, we used the Time-Frequency Segmentation Attention Network (TFSANN) for urban sound tagging. In the training, the log mel spectrogram of the audio clip is used as input feature, and the time-frequency segmentation mask is obtained by the timefrequency segmentation network. The time-frequency segmentation mask can be used to separate the time-frequency domain sound event from the background scene, and enhance the sound event that occurred in the audio clip. Global Weighted Rank Pooling (GWRP) allows existing event categories to occupy significant part of the spectrogram, allowing the network to focus on more significant features, and it can also estimate the probability of existence of sound event. In this paper, the proposed TFSANN model is validated on the development dataset of DCASE2019 task 5. Finally, the coarsegrained and fine-grained taxonomy results are obtained on the Micro Area under precision-recall curve (AUPRC), Micro F1 score and Macro Area under precision-recall curve (AUPRC).

PDF

VGG CNN for Urban Sound Tagging

Clément Gousseau
Ambient Intelligence and Mobility, Orange Labs (company where I do my master thesis internship), Lannion, France

Abstract

A model of urban sound tagging is presented (Task 5 of DCASE 2019 [1][2]). The task is to detect activities from 10-seconds audio segments recorded in the streets of New York City (SONYC dataset). The model is based on the model presented in the book Hands-On Transfer Learning with Python [3] which does urban sound classification for the UrbanSound dataset. This model has been adapted and optimized to address the task 5 of DCASE2019. It achieved a AUPRC of 82.6 for the coarse-grained model where the baseline achieves an AUPRC of 76.2.

PDF

Convolutional Neural Networks with Transfer Learning for Urban Sound Tagging

Bongjun Kim
Department of Computer Science, Northwestern University, Evnaston, Illinois, USA

Abstract

This technical report describes sound classification models from our submissions for DCASE challenge 2019-task5. The task is to build a system to perform audio tagging on urban sound. The dataset has 23 fine-grained tags and 8 coarse-grained tags. In this report, we only present a model for coarse-grained tagging. Our model is a Convolutional Neural network (CNN)-based model which consists of 6 convolutional layers and 3 fully-connected layers. We apply transfer learning to the model training by utilizing VGGish model that has been pre-trained on a large scale of a dataset. We also apply an ensemble technique to boost the performance of a single model. We compare the performance of our models and the baseline approach on the provided validation dataset. The results show that our models outperform the baseline system.

PDF

Cross-Talk Learning for Audio Tagging, Sound Event Detection and Spatial Localization: DCASE 2019 Baseline Systems

Qiuqiang Kong, Yin Cao, Turab Iqbal, Wenwu Wang and Mark D. Plumbley
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, England

Abstract

The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge focuses on audio tagging, sound event detection and spatial localisation. DCASE 2019 consists of five tasks: 1) acoustic scene classification, 2) audio tagging with noisy labels and minimal supervision, 3) sound event localisation and detection, 4) sound event detection in domestic environments, and 5) urban sound tagging. In this paper, we propose generic cross-task baseline systems based on convolutional neural networks (CNNs). The motivation is to investigate the performance of a variety of models across several audio recognition tasks without exploiting the specific characteristics of the tasks. We looked at CNNs with 5, 9, and 13 layers, and found that the optimal architecture is taskdependent. For the systems we considered, we found that the 9-layer CNN with average pooling after convolutional layers is a good model for a majority of the DCASE 2019 tasks.

PDF

Improved Residual Network Based on Deformable Convolution for Urban Sound Tagging

Fuling Liu
College of Photoelectric Engineering, Chongqing University, 174 Shazhengjie, Shapingba, Chongqing, 400030, China

Abstract

In order to solve the problem of Urban Sound Tagging, we use deformable convolution to improve the residual module, add offset variables to the input feature map of the residual module, and use the improved residual module to form a new residual network. Compared with the general method of adding deformable convolution, the improved method in this paper has better results.

PDF

Urban Sound Tagging DCASE 2019 Chalelnge Task 5

Linus Ng and Kenneth Ooi
Smart Nation Translational Lab, Centre for Infocomm Technology (INFINITUS), School of Electrical and Electronic Engineering, Nanyang Technological University, Nanyang Technological University, Singapore

Abstract

Identifying urban noises and sounds is a challenging but important problem in the field of machine listening [1]. It enables and provides a realistic use case for detecting noises in an urbanised city - from noise complaints to detecting sounds or unusual noises that may indicate possible emergencies. The Urban Sound Tagging challenge as part of the DCASE 2019 challenge [2] [3] addresses the problem statement of urban noise control [1]. For this challenge, we are tasked to build a audio classifier to predict whether each of 23 sources of noise pollution is present or absent in a 10-second scene, as recorded by an acoustic sensor network. In this technical report, we will examine in some detail the performance of the audio classification models trained with different open external datasets.

PDF

An Augmented Neural Network for the DCASE 2019 Urban Sound Tagging Challenge

Ferran Orga1, Joan Serrà2 and Carlos Segura Perales2
1GTM - Grup de recerca en Tecnologies Mèdia, La Salle - Universitat Ramon Llull (URL), C/Quatre Camins, 30, 08022 Barcelona (Spain), 2Telefónica Research, Barcelona (Spain)

Abstract

The Sounds of New York City (SONYC) research project aims to mitigate urban noise pollution in the context of a megacity. This project has deployed 50 different sensors in various areas of the New York City installed back in 2015 to monitor the overall sound pressure level. However, this is not enough to determine the noise sources, needed to detect noise code violations. Within the Task 5 of DCASE2019 challenge, an urban sound tagging challenge is proposed where the participants are asked to develop a machine listening system that distinguishes between 23 sources of noise pollution. The system is asked to predict whether the source is present or absent in 10-second scenes recorded by the SONYC. Moreover, annotations are also provided at a higher level, classifying the 23 fine labels in 8 coarser labels. In this report, the authors present a machine listening approach based on an augmented neural network where both coarse and fine-level annotations are used to predict the event presence in the same network. This approach obtains a classification accuracy on the validation split of 87% at the coarse level and 92% at the fine level.

PDF

DCASE 2019 Challenge Task 5: Cnn+vggish

Daniel Tompkins and Eric Nichols
Dynamics 365 AI Research, Microsoft, Redmond, WA

Abstract

We trained a model for multi-label audio classification on Task 5 of the DCASE 2019 Challenge [1]. The model is composed of a preprocessing layer that converts audio to a log-mel spectrogram, a VGG-inspired Convolutional Neural Network (CNN) that generates an embedding for the spectrogram, the pre-trained VGGish network [2] that generates a separate audio embedding, and finally a series of fully-connected layers that converts these two embeddings (concatenated) into a multi-label classification. This model directly outputs both fine and coarse labels; it treats the task as a 37-way multi-label classification problem. One version of this network did better at the coarse labels (submission 1); another did better with fine labels on Micro AUPRC (submission 2). A separate family of CNNs models, one per coarse label, was also trained to take into account the hierarchical nature of the labels (submission 3), but the single model solution performed slightly better.

PDF