Urban Sound Tagging with Spatiotemporal Context


Challenge results

Task description

The goal of urban sound tagging with spatiotemporal context is to predict whether each of 23 sources of noise pollution is present or absent in a 10-second scene given both the audio and the time and location of the recording. These sources of noise are also grouped into 8 coarse-level categories. All of the recordings are from an urban acoustic sensor network in New York City. The training set was annotated by volunteers on the Zooniverse citizen-science platform, and the verified subsets (inc. the test set) were annotated by the task organizers.

More detailed task description can be found in the task description page

Coarse-level prediction

System ranking

Rank Submission
code
Submission
name
Technical
Report
Micro-AUPRC LWLRAP Macro-AUPRC
Arnault_MULT_task5_1 UrbanNet_1 Arnault2020 0.767 0.846 0.593
Arnault_MULT_task5_2 UrbanNet_2 Arnault2020 0.815 0.881 0.632
Arnault_MULT_task5_3 UrbanNet_3 Arnault2020 0.835 0.898 0.649
Bai_NWPU_task5_1 Bai_1 Bai2020 0.768 0.860 0.600
Bai_NWPU_task5_2 Bai_2 Bai2020 0.542 0.720 0.547
Bai_NWPU_task5_3 Bai_3 Bai2020 0.562 0.760 0.537
Bai_NWPU_task5_4 Bai_4 Bai2020 0.601 0.745 0.576
DCASE2020 baseline Baseline Cartwright2020 0.749 0.857 0.510
Diez_Noismart_task5_1 AholabUSC1 Diez2020 0.640 0.776 0.449
Diez_Noismart_task5_2 AholabUSC2 Diez2020 0.615 0.775 0.382
Diez_Noismart_task5_3 AholabUSC3 Diez2020 0.579 0.771 0.387
Iqbal_Surrey_task5_1 PANN_Ens Iqbal2020 0.858 0.915 0.649
Iqbal_Surrey_task5_2 PANN Iqbal2020 0.839 0.903 0.632
Iqbal_Surrey_task5_3 PANN_Pseud Iqbal2020 0.846 0.910 0.624
Iqbal_Surrey_task5_4 GCNN_Pseud Iqbal2020 0.825 0.901 0.604
JHKim_IVS_task5_1 EF_1 Kim2020 0.788 0.871 0.578
JHKim_IVS_task5_2 EF_2 Kim2020 0.792 0.880 0.586
JHKim_IVS_task5_3 EF_3 Kim2020 0.788 0.875 0.581
JHKim_IVS_task5_4 EF_4 Kim2020 0.781 0.879 0.569
Liu_BUPT_task5_1 LLHF1 Liu2020 0.748 0.847 0.593
Liu_BUPT_task5_2 LLHF2 Liu2020 0.748 0.853 0.594
Liu_BUPT_task5_3 LLHF3 Liu2020 0.755 0.862 0.599
Liu_BUPT_task5_4 LLHF4 Liu2020 0.744 0.851 0.594

Teams ranking

Rank Submission
code
Submission
name
Technical
Report
Micro-AUPRC LWLRAP Macro-AUPRC
Arnault_MULT_task5_3 UrbanNet_3 Arnault2020 0.835 0.898 0.649
Bai_NWPU_task5_1 Bai_1 Bai2020 0.768 0.860 0.600
DCASE2020 baseline Baseline Cartwright2020 0.749 0.857 0.510
Diez_Noismart_task5_1 AholabUSC1 Diez2020 0.640 0.776 0.449
Iqbal_Surrey_task5_1 PANN_Ens Iqbal2020 0.858 0.915 0.649
JHKim_IVS_task5_2 EF_2 Kim2020 0.792 0.880 0.586
Liu_BUPT_task5_3 LLHF3 Liu2020 0.755 0.862 0.599

Class-wise performance

Submission
code
Submission
name
Technical
Report
Macro-AUPRC Engine Machinery
impact
Non-machinery
impact
Powered
saw
Alert
signal
Music Human
voice
Dog
Arnault_MULT_task5_1 UrbanNet_1 Arnault2020 0.593 0.869 0.374 0.760 0.080 0.809 0.539 0.939 0.372
Arnault_MULT_task5_2 UrbanNet_2 Arnault2020 0.632 0.880 0.338 0.740 0.074 0.843 0.701 0.948 0.529
Arnault_MULT_task5_3 UrbanNet_3 Arnault2020 0.649 0.899 0.440 0.737 0.060 0.861 0.718 0.956 0.519
Bai_NWPU_task5_1 Bai_1 Bai2020 0.600 0.859 0.365 0.710 0.050 0.784 0.668 0.943 0.423
Bai_NWPU_task5_2 Bai_2 Bai2020 0.547 0.676 0.322 0.712 0.097 0.755 0.499 0.931 0.382
Bai_NWPU_task5_3 Bai_3 Bai2020 0.537 0.676 0.233 0.556 0.117 0.755 0.644 0.931 0.382
Bai_NWPU_task5_4 Bai_4 Bai2020 0.576 0.789 0.295 0.713 0.114 0.778 0.582 0.941 0.400
DCASE2020 baseline Baseline Cartwright2020 0.510 0.823 0.230 0.576 0.265 0.519 0.483 0.935 0.248
Diez_Noismart_task5_1 AholabUSC1 Diez2020 0.449 0.780 0.133 0.586 0.055 0.597 0.356 0.818 0.263
Diez_Noismart_task5_2 AholabUSC2 Diez2020 0.382 0.749 0.145 0.454 0.025 0.591 0.112 0.782 0.199
Diez_Noismart_task5_3 AholabUSC3 Diez2020 0.387 0.690 0.169 0.506 0.009 0.547 0.289 0.718 0.164
Iqbal_Surrey_task5_1 PANN_Ens Iqbal2020 0.649 0.903 0.306 0.762 0.200 0.845 0.658 0.961 0.561
Iqbal_Surrey_task5_2 PANN Iqbal2020 0.632 0.889 0.263 0.756 0.202 0.834 0.640 0.956 0.515
Iqbal_Surrey_task5_3 PANN_Pseud Iqbal2020 0.624 0.884 0.258 0.746 0.196 0.836 0.640 0.955 0.475
Iqbal_Surrey_task5_4 GCNN_Pseud Iqbal2020 0.604 0.866 0.290 0.719 0.108 0.815 0.640 0.935 0.459
JHKim_IVS_task5_1 EF_1 Kim2020 0.578 0.852 0.258 0.712 0.152 0.754 0.560 0.932 0.407
JHKim_IVS_task5_2 EF_2 Kim2020 0.586 0.847 0.286 0.710 0.150 0.796 0.556 0.928 0.413
JHKim_IVS_task5_3 EF_3 Kim2020 0.581 0.845 0.248 0.703 0.131 0.785 0.576 0.932 0.428
JHKim_IVS_task5_4 EF_4 Kim2020 0.569 0.840 0.314 0.702 0.050 0.764 0.569 0.921 0.391
Liu_BUPT_task5_1 LLHF1 Liu2020 0.593 0.848 0.345 0.654 0.140 0.763 0.606 0.934 0.454
Liu_BUPT_task5_2 LLHF2 Liu2020 0.594 0.826 0.369 0.665 0.095 0.767 0.653 0.928 0.451
Liu_BUPT_task5_3 LLHF3 Liu2020 0.599 0.834 0.388 0.644 0.122 0.774 0.628 0.929 0.472
Liu_BUPT_task5_4 LLHF4 Liu2020 0.594 0.825 0.375 0.670 0.082 0.765 0.659 0.927 0.450

Fine-level prediction

System ranking

Rank Submission
code
Submission
name
Technical
Report
Micro-AUPRC Macro-AUPRC
Arnault_MULT_task5_1 UrbanNet_1 Arnault2020 0.724 0.532
Arnault_MULT_task5_2 UrbanNet_2 Arnault2020 0.726 0.561
Arnault_MULT_task5_3 UrbanNet_3 Arnault2020 0.755 0.581
Bai_NWPU_task5_1 Bai_1 Bai2020 0.596 0.484
Bai_NWPU_task5_2 Bai_2 Bai2020 0.515 0.468
Bai_NWPU_task5_3 Bai_3 Bai2020 0.548 0.483
Bai_NWPU_task5_4 Bai_4 Bai2020 0.655 0.509
DCASE2020 baseline Baseline Cartwright2020 0.618 0.432
Diez_Noismart_task5_1 AholabUSC1 Diez2020 0.527 0.370
Diez_Noismart_task5_2 AholabUSC2 Diez2020 0.466 0.318
Diez_Noismart_task5_3 AholabUSC3 Diez2020 0.498 0.332
Iqbal_Surrey_task5_1 PANN_Ens Iqbal2020 0.768 0.573
Iqbal_Surrey_task5_2 PANN Iqbal2020 0.733 0.552
Iqbal_Surrey_task5_3 PANN_Pseud Iqbal2020 0.747 0.546
Iqbal_Surrey_task5_4 GCNN_Pseud Iqbal2020 0.723 0.524
JHKim_IVS_task5_1 EF_1 Kim2020 0.653 0.503
JHKim_IVS_task5_2 EF_2 Kim2020 0.658 0.516
JHKim_IVS_task5_3 EF_3 Kim2020 0.654 0.514
JHKim_IVS_task5_4 EF_4 Kim2020 0.673 0.490
Liu_BUPT_task5_1 LLHF1 Liu2020 0.676 0.518
Liu_BUPT_task5_2 LLHF2 Liu2020 0.681 0.523
Liu_BUPT_task5_3 LLHF3 Liu2020 0.663 0.503
Liu_BUPT_task5_4 LLHF4 Liu2020 0.680 0.523

Teams ranking

Rank Submission
code
Submission
name
Technical
Report
Micro-AUPRC Macro-AUPRC
Arnault_MULT_task5_3 UrbanNet_3 Arnault2020 0.755 0.581
Bai_NWPU_task5_4 Bai_4 Bai2020 0.655 0.509
DCASE2020 baseline Baseline Cartwright2020 0.618 0.432
Diez_Noismart_task5_1 AholabUSC1 Diez2020 0.527 0.370
Iqbal_Surrey_task5_1 PANN_Ens Iqbal2020 0.768 0.573
JHKim_IVS_task5_2 EF_2 Kim2020 0.658 0.516
Liu_BUPT_task5_4 LLHF4 Liu2020 0.680 0.523

Class-wise performance

Submission
code
Submission
name
Technical
Report
Macro-AUPRC Engine Machinery
impact
Non-machinery
impact
Powered
saw
Alert
signal
Music Human
voice
Dog
Arnault_MULT_task5_1 UrbanNet_1 Arnault2020 0.532 0.684 0.172 0.766 0.108 0.794 0.447 0.910 0.372
Arnault_MULT_task5_2 UrbanNet_2 Arnault2020 0.561 0.609 0.162 0.747 0.087 0.814 0.594 0.922 0.550
Arnault_MULT_task5_3 UrbanNet_3 Arnault2020 0.581 0.659 0.225 0.737 0.108 0.811 0.637 0.924 0.548
Bai_NWPU_task5_1 Bai_1 Bai2020 0.484 0.517 0.186 0.717 0.045 0.654 0.390 0.888 0.478
Bai_NWPU_task5_2 Bai_2 Bai2020 0.468 0.547 0.217 0.645 0.044 0.689 0.396 0.885 0.323
Bai_NWPU_task5_3 Bai_3 Bai2020 0.483 0.547 0.147 0.725 0.033 0.689 0.513 0.885 0.323
Bai_NWPU_task5_4 Bai_4 Bai2020 0.509 0.655 0.158 0.718 0.036 0.750 0.462 0.909 0.383
DCASE2020 baseline Baseline Cartwright2020 0.432 0.573 0.184 0.576 0.144 0.442 0.405 0.885 0.248
Diez_Noismart_task5_1 AholabUSC1 Diez2020 0.370 0.561 0.140 0.632 0.023 0.471 0.083 0.736 0.315
Diez_Noismart_task5_2 AholabUSC2 Diez2020 0.318 0.492 0.049 0.498 0.011 0.522 0.069 0.697 0.204
Diez_Noismart_task5_3 AholabUSC3 Diez2020 0.332 0.496 0.030 0.591 0.004 0.513 0.183 0.670 0.167
Iqbal_Surrey_task5_1 PANN_Ens Iqbal2020 0.573 0.710 0.175 0.762 0.156 0.797 0.501 0.924 0.561
Iqbal_Surrey_task5_2 PANN Iqbal2020 0.552 0.666 0.184 0.756 0.151 0.762 0.472 0.913 0.515
Iqbal_Surrey_task5_3 PANN_Pseud Iqbal2020 0.546 0.651 0.124 0.746 0.198 0.786 0.477 0.914 0.475
Iqbal_Surrey_task5_4 GCNN_Pseud Iqbal2020 0.524 0.632 0.123 0.719 0.084 0.763 0.511 0.901 0.459
JHKim_IVS_task5_1 EF_1 Kim2020 0.503 0.632 0.161 0.712 0.100 0.700 0.477 0.835 0.407
JHKim_IVS_task5_2 EF_2 Kim2020 0.516 0.621 0.187 0.710 0.114 0.736 0.486 0.856 0.413
JHKim_IVS_task5_3 EF_3 Kim2020 0.514 0.627 0.160 0.703 0.100 0.730 0.513 0.851 0.428
JHKim_IVS_task5_4 EF_4 Kim2020 0.490 0.614 0.194 0.702 0.043 0.701 0.427 0.851 0.391
Liu_BUPT_task5_1 LLHF1 Liu2020 0.518 0.656 0.170 0.646 0.192 0.703 0.423 0.902 0.454
Liu_BUPT_task5_2 LLHF2 Liu2020 0.523 0.642 0.207 0.640 0.108 0.715 0.500 0.896 0.473
Liu_BUPT_task5_3 LLHF3 Liu2020 0.503 0.617 0.136 0.621 0.170 0.697 0.451 0.890 0.444
Liu_BUPT_task5_4 LLHF4 Liu2020 0.523 0.643 0.203 0.651 0.091 0.715 0.508 0.897 0.479

System characteristics

Code Technical
Report
Coarse
Macro-AUPRC
Fine
Macro-AUPRC
Input Sampling
rate
Data
augmentation
Features STC External
data and sources
Other External
data and sources
Model
complexity
Classifier Ensemble
subsystems
Used annotator ID Used proximity Used sensor ID Used borough Used block Used latitude Used longitude Used year Used week Used day Used hour Aggregation
method
Target
level
Target
method
System
relabeling
Arnault_MULT_task5_1 Arnault2020 0.593 0.532 mono 44.1kHz shift-scale-rotate, grid-distortion, cutout, mixup spectrogram 13244589 CRNN False False False False False True True False True True True att pooling both minority vote automatic
Arnault_MULT_task5_2 Arnault2020 0.632 0.561 mono 44.1kHz spec-augment, shift-scale-rotate, grid-distortion, cutout spectrogram (audio_data; AudioSet) 22748557 CRNN False False False False False True True False True True True att pooling both minority vote automatic
Arnault_MULT_task5_3 Arnault2020 0.649 0.581 mono 44.1kHz shift-scale-rotate, grid-distortion, cutout spectrogram (audio_data; AudioSet) 22776985 CRNN False False False False False True True False True True True att pooling both minority vote automatic
Bai_NWPU_task5_1 Bai2020 0.600 0.484 mono 22.05kHz log-mel spectrogram, log-linear spectrogram, log-mel-h 2416848 CNN 3 False False False False False True True False True True True auto both minority vote
Bai_NWPU_task5_2 Bai2020 0.547 0.468 mono 22.05kHz mixup log-mel spectrogram 2416848 CNN False False False False False True True False True True True auto both minority vote
Bai_NWPU_task5_3 Bai2020 0.537 0.483 mono 22.05kHz mixup log-mel spectrogram, log-linear spectrogram, log-mel-h 2416848 CNN 3 False False False False False True True False True True True auto both minority vote
Bai_NWPU_task5_4 Bai2020 0.576 0.509 mono 22.05kHz mixup log-mel spectrogram 2416848 CNN, CRNN 2 False False False False False True True False True True True auto both minority vote
DCASE2020 baseline Cartwright2020 0.510 0.432 mono 48kHz openl3 79534 MLP False False False False False True True False True True True fine minority vote
Diez_Noismart_task5_1 Diez2020 0.449 0.370 mono 48kHz log-mel spectrogram 697.856 CNN False False False False False True True False True True True both minority vote
Diez_Noismart_task5_2 Diez2020 0.382 0.318 mono 48kHz log-mel spectrogram 716.936 CNN, MLP False False False False False True True False True True True both minority vote
Diez_Noismart_task5_3 Diez2020 0.387 0.332 mono 48kHz log-mel spectrogram 717.056 CNN, MLP False False True True True True True False True True True both minority vote
Iqbal_Surrey_task5_1 Iqbal2020 0.649 0.573 mono 32kHz tf-masking log-mel spectrogram (audio_data; AudioSet) 19866868 CNN 4 False False False False False False False False True True True fine minority vote automatic
Iqbal_Surrey_task5_2 Iqbal2020 0.632 0.552 mono 32kHz tf-masking log-mel spectrogram (audio_data; AudioSet) 4966717 CNN False False False False False False False False True True True fine minority vote
Iqbal_Surrey_task5_3 Iqbal2020 0.624 0.546 mono 32kHz tf-masking log-mel spectrogram (audio_data; AudioSet) 4966717 CNN False False False False False False False False True True True fine minority vote automatic
Iqbal_Surrey_task5_4 Iqbal2020 0.604 0.524 mono 32kHz tf-masking log-mel spectrogram 18825469 GCNN False False False False False False False False True True True fine minority vote automatic
JHKim_IVS_task5_1 Kim2020 0.578 0.503 mono 44.1kHz HPSS, log-mel spectrogram (pre-trained model; ImageNet based trained weights of EfficientNet) 71367 CNN False False False False False True True False True True True minority vote
JHKim_IVS_task5_2 Kim2020 0.586 0.516 mono 44.1kHz HPSS, log-mel spectrogram (pre-trained model; ImageNet based trained weights of EfficientNet) 71367 CNN False False False False False True True False True True True minority vote
JHKim_IVS_task5_3 Kim2020 0.581 0.514 mono 44.1kHz HPSS, log-mel spectrogram (pre-trained model; ImageNet based trained weights of EfficientNet) 71367 CNN False False False False False True True False True True True minority vote
JHKim_IVS_task5_4 Kim2020 0.569 0.490 mono 44.1kHz HPSS, log-mel spectrogram (pre-trained model; ImageNet based trained weights of EfficientNet) 71367 CNN False False False False False True True False True True True minority vote
Liu_BUPT_task5_1 Liu2020 0.593 0.518 mono 32kHz mixup log-mel spectrogram 57.9M CNN 11 False False False False False True True True True True True both minority vote
Liu_BUPT_task5_2 Liu2020 0.594 0.523 mono 32kHz mixup log-mel spectrogram 31.5M CNN 6 False False False False False True True True True True True both minority vote
Liu_BUPT_task5_3 Liu2020 0.599 0.503 mono 32kHz mixup log-mel spectrogram 15.8M CNN 3 False False False False False True True True True True True both minority vote
Liu_BUPT_task5_4 Liu2020 0.594 0.523 mono 32kHz mixup log-mel spectrogram 21M CNN 11 False False False False False True True True True True True both minority vote

Technical reports

CRNNs for Urban Sound Tagging with Spatiotemporal Context

Augustin Arnault and Nicolas Riche
Department of Artificial Intelligence, Mons, Belgium

Abstract

This paper describes CRNNs we used to participate in Task 5 of the DCASE 2020 challenge. This task focuses on hierarchical multilabel urban sound tagging with spatiotemporal context. The code is available to our GitHub repository at https://github.com/multitelai/urban-sound-tagging.

PDF

Data Augmentation Based System for Urban Sound Tagging

Jisheng Bai1, Chen Chen1, Jianfeng Chen1, Mou Wang1, Xiaolei Zhang1 and Qingli Yan2
1School of Marine Science and Technology, Xi'an, China, 2School of Computer Science and Technology, Xi'an, China

Abstract

In this report, we introduce our system for Task5 of Dcase 2020 challenges (Urban Sound Tagging with Spatiotemporal Context). We presents a fusion system for Task5 based on different features and data augmentation. The task focuses on predicting whether each of 23 sources of noise pollution is present or absent in a 10- second scene with original recordings and addtional spatiotemporal context [1]. There are two levels of taxonomies to train a model. To explore features in detecting various sources of urban sound, we extracted four different features from original recordings. We applied a 9-layer convolutional neural network(CNN) as our primary classifier. To prevent the imbalance between classes, we applied data augmentation method. The experiment results show that our system performed better than baseline on validation data.

PDF

An Urban Sound Tagging Dataset with Spatiotemporal Context

Mark Cartwright1, Jason Cramer2, Ana Elisa Mendez Mendez3, Yu Wang3, Ho-Hsiang Wu3, Vincent Lostanlen4, Magdalena Fuentes3, Justin Salamon5 and Juan P. Bello1
1Music and Audio Research Laboratory, Department of Computer Science and Engineering, Center for Urban Science and Progress, New York, New York, USA, 2Music and Audio Research Laboratory, Department of Electrical and Computer Engineering, New York, New York, USA, 3Music and Audio Research Laboratory, Department of Music and Performing Arts Professions, New York, New York, USA, 4Cornell Lab of Ornithology, Ithaca, New York, USA, 5Machine Perception Team, San Francisco, CA, USA

Abstract

We present SONYC-UST v2, a dataset for urban sound tagging with spatiotemporal information. This dataset is aimed for the development and evaluation of machine listening systems for realworld urban noise monitoring. While datasets of urban recordings are available, this datasets provides the opportunity to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags. It consists of 18510 audio recordings from the 'Sounds of New York City' (SONYC) acoustic sensor network including relevant metadata about when and where the data were recorded at the hour and block level. The dataset has annotations by volunteers from the Zooniverse citizen science platform, as well as a two-stage verification with our team. In this work we describe the data collection, metrics used to evaluate tagging systems, and the results of a simple baseline model that exploits temporal information.

PDF

Urban Sound Classification Using Convolutional Neural Networks for DCASE 2020 Challenge

Itxasne Diez1, Peio Gonzalez2 and Ibon Gonzalez2
1Getxo, Basque Country, Spain, 2HiTZ Center – Aholab Signal Processing Laboratory, Bilbao, Basque Country, Spain

Abstract

This technical report describes our system proposed for Task 5 - Urban Sound Tagging. The system has a core architecture based on Convolutional Neural Networks. This neural network uses log melspectrogram features as input and this input is processed by two CNN layers. The output of the convolutional stack is processed by several fully connected layers plus an output layer to produce the classification decision. Spatiotemporal context data is also available and we propose a multi-input architecture, with two input branches that are merged for the final processing. The spatiotemporal context information is processed by an additional neural network of 2 fully connected layers. Its output is merged with the output of the CNN stack and the resulting data is fed to the fully connected output block. In this report, we describe the proposed models in detail and compare them to the baseline approach using the provided development datasets. Finally, we present the results obtained with the validation split from the dataset.

PDF

Incorporating Auxiliary Data for Urban Sound Tagging

Turab Iqbal, Yin Cao, Mark D. Plumbley and Wenwu Wang
Centre for Vision, Speech and Signal Processing, Guildford, Surrey, UK

Abstract

DCASE 2020 Task 5 presents a multi-label sound tagging problem for the detection of urban noises in acoustic scenes. The main theme is the use of auxiliary data to facilitate sound tagging. In this report, we provide a detailed description of our submission for Task 5 and present experimental results for the development set. Two different network choices are described: a pre-trained convolutional neural network (CNN) and a randomly-initialised gated CNN. To make use of the auxiliary information, we construct a feature vector based on the spatiotemporal metadata and use it in parallel with log-mel spectrogram features. Moreover, we address the presence of multiple annotations per recording by using a pseudo-labelling technique to estimate the true labels. Mean ensembling is also used with one of the proposed systems to combine several models.

PDF

Urban Sound Tagging Using Multi-Channel Audio Feature with Convolutional Neural Networks

Jaehun Kim
AI Research Lab, Seoul, Seoul, South Korea

Abstract

This paper presents a multi-channel audio feature using imagenet model based on convolutional neural networks for DCASE 2020 Task5 Urban Sound Tagging (UST) with Spatio-temporal context (STC). We used the SONYC (Sounds of New York City) Urban Sound Tagging Dataset. It consists of audio clips and STC information. We proposed a multi-channel audio feature to use imagenet pre-trained model weight. multi-channel feature consists of raw and harmonic, percussive (HPSS) data’s Log-Mel-Spectrogram. Also, we used EfficientNet pre-trained model weight.

PDF

Multisystem Fusion Model Based on Tag Relationship

Gang Liu, Zhuangzhuang Liu, Junyan Fang and Xiaofeng Hong
Pattern Recognition and Intelligent System Laboratory (PRIS Lab), Beijing, China

Abstract

Audio tagging aims to assign one or more labels to the audio clip. In this paper, we proposed our solutions applied to our submis-sion for DCASE2020 Task5 It focus on predicting whether each of the sources of noise pollution is present or absent in a 10-sec-ond scene [1]. And we should consider the spatiotemporal con-text(STC) in our work. We used VGG as our basic model and we regarded Multi-task learning as a method to train our models. We introduced the relationship between fine labels and coarse labels in our system. Finally, the coarse-grained and fine-grained taxon-omy results are obtained on the Micro Area under precision-recall curve (AUPRC), Micro F1 score and Macro Area under precision-recall curve (AUPRC).

PDF