Task description
The goal of urban sound tagging with spatiotemporal context is to predict whether each of 23 sources of noise pollution is present or absent in a 10-second scene given both the audio and the time and location of the recording. These sources of noise are also grouped into 8 coarse-level categories. All of the recordings are from an urban acoustic sensor network in New York City. The training set was annotated by volunteers on the Zooniverse citizen-science platform, and the verified subsets (inc. the test set) were annotated by the task organizers.
More detailed task description can be found in the task description page
Coarse-level prediction
System ranking
Rank |
Submission code |
Submission name |
Technical Report |
Micro-AUPRC | LWLRAP | Macro-AUPRC |
---|---|---|---|---|---|---|
Arnault_MULT_task5_1 | UrbanNet_1 | Arnault2020 | 0.767 | 0.846 | 0.593 | |
Arnault_MULT_task5_2 | UrbanNet_2 | Arnault2020 | 0.815 | 0.881 | 0.632 | |
Arnault_MULT_task5_3 | UrbanNet_3 | Arnault2020 | 0.835 | 0.898 | 0.649 | |
Bai_NWPU_task5_1 | Bai_1 | Bai2020 | 0.768 | 0.860 | 0.600 | |
Bai_NWPU_task5_2 | Bai_2 | Bai2020 | 0.542 | 0.720 | 0.547 | |
Bai_NWPU_task5_3 | Bai_3 | Bai2020 | 0.562 | 0.760 | 0.537 | |
Bai_NWPU_task5_4 | Bai_4 | Bai2020 | 0.601 | 0.745 | 0.576 | |
DCASE2020 baseline | Baseline | Cartwright2020 | 0.749 | 0.857 | 0.510 | |
Diez_Noismart_task5_1 | AholabUSC1 | Diez2020 | 0.640 | 0.776 | 0.449 | |
Diez_Noismart_task5_2 | AholabUSC2 | Diez2020 | 0.615 | 0.775 | 0.382 | |
Diez_Noismart_task5_3 | AholabUSC3 | Diez2020 | 0.579 | 0.771 | 0.387 | |
Iqbal_Surrey_task5_1 | PANN_Ens | Iqbal2020 | 0.858 | 0.915 | 0.649 | |
Iqbal_Surrey_task5_2 | PANN | Iqbal2020 | 0.839 | 0.903 | 0.632 | |
Iqbal_Surrey_task5_3 | PANN_Pseud | Iqbal2020 | 0.846 | 0.910 | 0.624 | |
Iqbal_Surrey_task5_4 | GCNN_Pseud | Iqbal2020 | 0.825 | 0.901 | 0.604 | |
JHKim_IVS_task5_1 | EF_1 | Kim2020 | 0.788 | 0.871 | 0.578 | |
JHKim_IVS_task5_2 | EF_2 | Kim2020 | 0.792 | 0.880 | 0.586 | |
JHKim_IVS_task5_3 | EF_3 | Kim2020 | 0.788 | 0.875 | 0.581 | |
JHKim_IVS_task5_4 | EF_4 | Kim2020 | 0.781 | 0.879 | 0.569 | |
Liu_BUPT_task5_1 | LLHF1 | Liu2020 | 0.748 | 0.847 | 0.593 | |
Liu_BUPT_task5_2 | LLHF2 | Liu2020 | 0.748 | 0.853 | 0.594 | |
Liu_BUPT_task5_3 | LLHF3 | Liu2020 | 0.755 | 0.862 | 0.599 | |
Liu_BUPT_task5_4 | LLHF4 | Liu2020 | 0.744 | 0.851 | 0.594 |
Teams ranking
Rank |
Submission code |
Submission name |
Technical Report |
Micro-AUPRC | LWLRAP | Macro-AUPRC |
---|---|---|---|---|---|---|
Arnault_MULT_task5_3 | UrbanNet_3 | Arnault2020 | 0.835 | 0.898 | 0.649 | |
Bai_NWPU_task5_1 | Bai_1 | Bai2020 | 0.768 | 0.860 | 0.600 | |
DCASE2020 baseline | Baseline | Cartwright2020 | 0.749 | 0.857 | 0.510 | |
Diez_Noismart_task5_1 | AholabUSC1 | Diez2020 | 0.640 | 0.776 | 0.449 | |
Iqbal_Surrey_task5_1 | PANN_Ens | Iqbal2020 | 0.858 | 0.915 | 0.649 | |
JHKim_IVS_task5_2 | EF_2 | Kim2020 | 0.792 | 0.880 | 0.586 | |
Liu_BUPT_task5_3 | LLHF3 | Liu2020 | 0.755 | 0.862 | 0.599 |
Class-wise performance
Submission code |
Submission name |
Technical Report |
Macro-AUPRC | Engine |
Machinery impact |
Non-machinery impact |
Powered saw |
Alert signal |
Music |
Human voice |
Dog |
---|---|---|---|---|---|---|---|---|---|---|---|
Arnault_MULT_task5_1 | UrbanNet_1 | Arnault2020 | 0.593 | 0.869 | 0.374 | 0.760 | 0.080 | 0.809 | 0.539 | 0.939 | 0.372 |
Arnault_MULT_task5_2 | UrbanNet_2 | Arnault2020 | 0.632 | 0.880 | 0.338 | 0.740 | 0.074 | 0.843 | 0.701 | 0.948 | 0.529 |
Arnault_MULT_task5_3 | UrbanNet_3 | Arnault2020 | 0.649 | 0.899 | 0.440 | 0.737 | 0.060 | 0.861 | 0.718 | 0.956 | 0.519 |
Bai_NWPU_task5_1 | Bai_1 | Bai2020 | 0.600 | 0.859 | 0.365 | 0.710 | 0.050 | 0.784 | 0.668 | 0.943 | 0.423 |
Bai_NWPU_task5_2 | Bai_2 | Bai2020 | 0.547 | 0.676 | 0.322 | 0.712 | 0.097 | 0.755 | 0.499 | 0.931 | 0.382 |
Bai_NWPU_task5_3 | Bai_3 | Bai2020 | 0.537 | 0.676 | 0.233 | 0.556 | 0.117 | 0.755 | 0.644 | 0.931 | 0.382 |
Bai_NWPU_task5_4 | Bai_4 | Bai2020 | 0.576 | 0.789 | 0.295 | 0.713 | 0.114 | 0.778 | 0.582 | 0.941 | 0.400 |
DCASE2020 baseline | Baseline | Cartwright2020 | 0.510 | 0.823 | 0.230 | 0.576 | 0.265 | 0.519 | 0.483 | 0.935 | 0.248 |
Diez_Noismart_task5_1 | AholabUSC1 | Diez2020 | 0.449 | 0.780 | 0.133 | 0.586 | 0.055 | 0.597 | 0.356 | 0.818 | 0.263 |
Diez_Noismart_task5_2 | AholabUSC2 | Diez2020 | 0.382 | 0.749 | 0.145 | 0.454 | 0.025 | 0.591 | 0.112 | 0.782 | 0.199 |
Diez_Noismart_task5_3 | AholabUSC3 | Diez2020 | 0.387 | 0.690 | 0.169 | 0.506 | 0.009 | 0.547 | 0.289 | 0.718 | 0.164 |
Iqbal_Surrey_task5_1 | PANN_Ens | Iqbal2020 | 0.649 | 0.903 | 0.306 | 0.762 | 0.200 | 0.845 | 0.658 | 0.961 | 0.561 |
Iqbal_Surrey_task5_2 | PANN | Iqbal2020 | 0.632 | 0.889 | 0.263 | 0.756 | 0.202 | 0.834 | 0.640 | 0.956 | 0.515 |
Iqbal_Surrey_task5_3 | PANN_Pseud | Iqbal2020 | 0.624 | 0.884 | 0.258 | 0.746 | 0.196 | 0.836 | 0.640 | 0.955 | 0.475 |
Iqbal_Surrey_task5_4 | GCNN_Pseud | Iqbal2020 | 0.604 | 0.866 | 0.290 | 0.719 | 0.108 | 0.815 | 0.640 | 0.935 | 0.459 |
JHKim_IVS_task5_1 | EF_1 | Kim2020 | 0.578 | 0.852 | 0.258 | 0.712 | 0.152 | 0.754 | 0.560 | 0.932 | 0.407 |
JHKim_IVS_task5_2 | EF_2 | Kim2020 | 0.586 | 0.847 | 0.286 | 0.710 | 0.150 | 0.796 | 0.556 | 0.928 | 0.413 |
JHKim_IVS_task5_3 | EF_3 | Kim2020 | 0.581 | 0.845 | 0.248 | 0.703 | 0.131 | 0.785 | 0.576 | 0.932 | 0.428 |
JHKim_IVS_task5_4 | EF_4 | Kim2020 | 0.569 | 0.840 | 0.314 | 0.702 | 0.050 | 0.764 | 0.569 | 0.921 | 0.391 |
Liu_BUPT_task5_1 | LLHF1 | Liu2020 | 0.593 | 0.848 | 0.345 | 0.654 | 0.140 | 0.763 | 0.606 | 0.934 | 0.454 |
Liu_BUPT_task5_2 | LLHF2 | Liu2020 | 0.594 | 0.826 | 0.369 | 0.665 | 0.095 | 0.767 | 0.653 | 0.928 | 0.451 |
Liu_BUPT_task5_3 | LLHF3 | Liu2020 | 0.599 | 0.834 | 0.388 | 0.644 | 0.122 | 0.774 | 0.628 | 0.929 | 0.472 |
Liu_BUPT_task5_4 | LLHF4 | Liu2020 | 0.594 | 0.825 | 0.375 | 0.670 | 0.082 | 0.765 | 0.659 | 0.927 | 0.450 |
Fine-level prediction
System ranking
Rank |
Submission code |
Submission name |
Technical Report |
Micro-AUPRC | Macro-AUPRC |
---|---|---|---|---|---|
Arnault_MULT_task5_1 | UrbanNet_1 | Arnault2020 | 0.724 | 0.532 | |
Arnault_MULT_task5_2 | UrbanNet_2 | Arnault2020 | 0.726 | 0.561 | |
Arnault_MULT_task5_3 | UrbanNet_3 | Arnault2020 | 0.755 | 0.581 | |
Bai_NWPU_task5_1 | Bai_1 | Bai2020 | 0.596 | 0.484 | |
Bai_NWPU_task5_2 | Bai_2 | Bai2020 | 0.515 | 0.468 | |
Bai_NWPU_task5_3 | Bai_3 | Bai2020 | 0.548 | 0.483 | |
Bai_NWPU_task5_4 | Bai_4 | Bai2020 | 0.655 | 0.509 | |
DCASE2020 baseline | Baseline | Cartwright2020 | 0.618 | 0.432 | |
Diez_Noismart_task5_1 | AholabUSC1 | Diez2020 | 0.527 | 0.370 | |
Diez_Noismart_task5_2 | AholabUSC2 | Diez2020 | 0.466 | 0.318 | |
Diez_Noismart_task5_3 | AholabUSC3 | Diez2020 | 0.498 | 0.332 | |
Iqbal_Surrey_task5_1 | PANN_Ens | Iqbal2020 | 0.768 | 0.573 | |
Iqbal_Surrey_task5_2 | PANN | Iqbal2020 | 0.733 | 0.552 | |
Iqbal_Surrey_task5_3 | PANN_Pseud | Iqbal2020 | 0.747 | 0.546 | |
Iqbal_Surrey_task5_4 | GCNN_Pseud | Iqbal2020 | 0.723 | 0.524 | |
JHKim_IVS_task5_1 | EF_1 | Kim2020 | 0.653 | 0.503 | |
JHKim_IVS_task5_2 | EF_2 | Kim2020 | 0.658 | 0.516 | |
JHKim_IVS_task5_3 | EF_3 | Kim2020 | 0.654 | 0.514 | |
JHKim_IVS_task5_4 | EF_4 | Kim2020 | 0.673 | 0.490 | |
Liu_BUPT_task5_1 | LLHF1 | Liu2020 | 0.676 | 0.518 | |
Liu_BUPT_task5_2 | LLHF2 | Liu2020 | 0.681 | 0.523 | |
Liu_BUPT_task5_3 | LLHF3 | Liu2020 | 0.663 | 0.503 | |
Liu_BUPT_task5_4 | LLHF4 | Liu2020 | 0.680 | 0.523 |
Teams ranking
Rank |
Submission code |
Submission name |
Technical Report |
Micro-AUPRC | Macro-AUPRC |
---|---|---|---|---|---|
Arnault_MULT_task5_3 | UrbanNet_3 | Arnault2020 | 0.755 | 0.581 | |
Bai_NWPU_task5_4 | Bai_4 | Bai2020 | 0.655 | 0.509 | |
DCASE2020 baseline | Baseline | Cartwright2020 | 0.618 | 0.432 | |
Diez_Noismart_task5_1 | AholabUSC1 | Diez2020 | 0.527 | 0.370 | |
Iqbal_Surrey_task5_1 | PANN_Ens | Iqbal2020 | 0.768 | 0.573 | |
JHKim_IVS_task5_2 | EF_2 | Kim2020 | 0.658 | 0.516 | |
Liu_BUPT_task5_4 | LLHF4 | Liu2020 | 0.680 | 0.523 |
Class-wise performance
Submission code |
Submission name |
Technical Report |
Macro-AUPRC | Engine |
Machinery impact |
Non-machinery impact |
Powered saw |
Alert signal |
Music |
Human voice |
Dog |
---|---|---|---|---|---|---|---|---|---|---|---|
Arnault_MULT_task5_1 | UrbanNet_1 | Arnault2020 | 0.532 | 0.684 | 0.172 | 0.766 | 0.108 | 0.794 | 0.447 | 0.910 | 0.372 |
Arnault_MULT_task5_2 | UrbanNet_2 | Arnault2020 | 0.561 | 0.609 | 0.162 | 0.747 | 0.087 | 0.814 | 0.594 | 0.922 | 0.550 |
Arnault_MULT_task5_3 | UrbanNet_3 | Arnault2020 | 0.581 | 0.659 | 0.225 | 0.737 | 0.108 | 0.811 | 0.637 | 0.924 | 0.548 |
Bai_NWPU_task5_1 | Bai_1 | Bai2020 | 0.484 | 0.517 | 0.186 | 0.717 | 0.045 | 0.654 | 0.390 | 0.888 | 0.478 |
Bai_NWPU_task5_2 | Bai_2 | Bai2020 | 0.468 | 0.547 | 0.217 | 0.645 | 0.044 | 0.689 | 0.396 | 0.885 | 0.323 |
Bai_NWPU_task5_3 | Bai_3 | Bai2020 | 0.483 | 0.547 | 0.147 | 0.725 | 0.033 | 0.689 | 0.513 | 0.885 | 0.323 |
Bai_NWPU_task5_4 | Bai_4 | Bai2020 | 0.509 | 0.655 | 0.158 | 0.718 | 0.036 | 0.750 | 0.462 | 0.909 | 0.383 |
DCASE2020 baseline | Baseline | Cartwright2020 | 0.432 | 0.573 | 0.184 | 0.576 | 0.144 | 0.442 | 0.405 | 0.885 | 0.248 |
Diez_Noismart_task5_1 | AholabUSC1 | Diez2020 | 0.370 | 0.561 | 0.140 | 0.632 | 0.023 | 0.471 | 0.083 | 0.736 | 0.315 |
Diez_Noismart_task5_2 | AholabUSC2 | Diez2020 | 0.318 | 0.492 | 0.049 | 0.498 | 0.011 | 0.522 | 0.069 | 0.697 | 0.204 |
Diez_Noismart_task5_3 | AholabUSC3 | Diez2020 | 0.332 | 0.496 | 0.030 | 0.591 | 0.004 | 0.513 | 0.183 | 0.670 | 0.167 |
Iqbal_Surrey_task5_1 | PANN_Ens | Iqbal2020 | 0.573 | 0.710 | 0.175 | 0.762 | 0.156 | 0.797 | 0.501 | 0.924 | 0.561 |
Iqbal_Surrey_task5_2 | PANN | Iqbal2020 | 0.552 | 0.666 | 0.184 | 0.756 | 0.151 | 0.762 | 0.472 | 0.913 | 0.515 |
Iqbal_Surrey_task5_3 | PANN_Pseud | Iqbal2020 | 0.546 | 0.651 | 0.124 | 0.746 | 0.198 | 0.786 | 0.477 | 0.914 | 0.475 |
Iqbal_Surrey_task5_4 | GCNN_Pseud | Iqbal2020 | 0.524 | 0.632 | 0.123 | 0.719 | 0.084 | 0.763 | 0.511 | 0.901 | 0.459 |
JHKim_IVS_task5_1 | EF_1 | Kim2020 | 0.503 | 0.632 | 0.161 | 0.712 | 0.100 | 0.700 | 0.477 | 0.835 | 0.407 |
JHKim_IVS_task5_2 | EF_2 | Kim2020 | 0.516 | 0.621 | 0.187 | 0.710 | 0.114 | 0.736 | 0.486 | 0.856 | 0.413 |
JHKim_IVS_task5_3 | EF_3 | Kim2020 | 0.514 | 0.627 | 0.160 | 0.703 | 0.100 | 0.730 | 0.513 | 0.851 | 0.428 |
JHKim_IVS_task5_4 | EF_4 | Kim2020 | 0.490 | 0.614 | 0.194 | 0.702 | 0.043 | 0.701 | 0.427 | 0.851 | 0.391 |
Liu_BUPT_task5_1 | LLHF1 | Liu2020 | 0.518 | 0.656 | 0.170 | 0.646 | 0.192 | 0.703 | 0.423 | 0.902 | 0.454 |
Liu_BUPT_task5_2 | LLHF2 | Liu2020 | 0.523 | 0.642 | 0.207 | 0.640 | 0.108 | 0.715 | 0.500 | 0.896 | 0.473 |
Liu_BUPT_task5_3 | LLHF3 | Liu2020 | 0.503 | 0.617 | 0.136 | 0.621 | 0.170 | 0.697 | 0.451 | 0.890 | 0.444 |
Liu_BUPT_task5_4 | LLHF4 | Liu2020 | 0.523 | 0.643 | 0.203 | 0.651 | 0.091 | 0.715 | 0.508 | 0.897 | 0.479 |
System characteristics
Code |
Technical Report |
Coarse Macro-AUPRC |
Fine Macro-AUPRC |
Input |
Sampling rate |
Data augmentation |
Features |
STC External data and sources |
Other External data and sources |
Model complexity |
Classifier |
Ensemble subsystems |
Used annotator ID | Used proximity | Used sensor ID | Used borough | Used block | Used latitude | Used longitude | Used year | Used week | Used day | Used hour |
Aggregation method |
Target level |
Target method |
System relabeling |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Arnault_MULT_task5_1 | Arnault2020 | 0.593 | 0.532 | mono | 44.1kHz | shift-scale-rotate, grid-distortion, cutout, mixup | spectrogram | 13244589 | CRNN | False | False | False | False | False | True | True | False | True | True | True | att pooling | both | minority vote | automatic | |||
Arnault_MULT_task5_2 | Arnault2020 | 0.632 | 0.561 | mono | 44.1kHz | spec-augment, shift-scale-rotate, grid-distortion, cutout | spectrogram | (audio_data; AudioSet) | 22748557 | CRNN | False | False | False | False | False | True | True | False | True | True | True | att pooling | both | minority vote | automatic | ||
Arnault_MULT_task5_3 | Arnault2020 | 0.649 | 0.581 | mono | 44.1kHz | shift-scale-rotate, grid-distortion, cutout | spectrogram | (audio_data; AudioSet) | 22776985 | CRNN | False | False | False | False | False | True | True | False | True | True | True | att pooling | both | minority vote | automatic | ||
Bai_NWPU_task5_1 | Bai2020 | 0.600 | 0.484 | mono | 22.05kHz | log-mel spectrogram, log-linear spectrogram, log-mel-h | 2416848 | CNN | 3 | False | False | False | False | False | True | True | False | True | True | True | auto | both | minority vote | ||||
Bai_NWPU_task5_2 | Bai2020 | 0.547 | 0.468 | mono | 22.05kHz | mixup | log-mel spectrogram | 2416848 | CNN | False | False | False | False | False | True | True | False | True | True | True | auto | both | minority vote | ||||
Bai_NWPU_task5_3 | Bai2020 | 0.537 | 0.483 | mono | 22.05kHz | mixup | log-mel spectrogram, log-linear spectrogram, log-mel-h | 2416848 | CNN | 3 | False | False | False | False | False | True | True | False | True | True | True | auto | both | minority vote | |||
Bai_NWPU_task5_4 | Bai2020 | 0.576 | 0.509 | mono | 22.05kHz | mixup | log-mel spectrogram | 2416848 | CNN, CRNN | 2 | False | False | False | False | False | True | True | False | True | True | True | auto | both | minority vote | |||
DCASE2020 baseline | Cartwright2020 | 0.510 | 0.432 | mono | 48kHz | openl3 | 79534 | MLP | False | False | False | False | False | True | True | False | True | True | True | fine | minority vote | ||||||
Diez_Noismart_task5_1 | Diez2020 | 0.449 | 0.370 | mono | 48kHz | log-mel spectrogram | 697.856 | CNN | False | False | False | False | False | True | True | False | True | True | True | both | minority vote | ||||||
Diez_Noismart_task5_2 | Diez2020 | 0.382 | 0.318 | mono | 48kHz | log-mel spectrogram | 716.936 | CNN, MLP | False | False | False | False | False | True | True | False | True | True | True | both | minority vote | ||||||
Diez_Noismart_task5_3 | Diez2020 | 0.387 | 0.332 | mono | 48kHz | log-mel spectrogram | 717.056 | CNN, MLP | False | False | True | True | True | True | True | False | True | True | True | both | minority vote | ||||||
Iqbal_Surrey_task5_1 | Iqbal2020 | 0.649 | 0.573 | mono | 32kHz | tf-masking | log-mel spectrogram | (audio_data; AudioSet) | 19866868 | CNN | 4 | False | False | False | False | False | False | False | False | True | True | True | fine | minority vote | automatic | ||
Iqbal_Surrey_task5_2 | Iqbal2020 | 0.632 | 0.552 | mono | 32kHz | tf-masking | log-mel spectrogram | (audio_data; AudioSet) | 4966717 | CNN | False | False | False | False | False | False | False | False | True | True | True | fine | minority vote | ||||
Iqbal_Surrey_task5_3 | Iqbal2020 | 0.624 | 0.546 | mono | 32kHz | tf-masking | log-mel spectrogram | (audio_data; AudioSet) | 4966717 | CNN | False | False | False | False | False | False | False | False | True | True | True | fine | minority vote | automatic | |||
Iqbal_Surrey_task5_4 | Iqbal2020 | 0.604 | 0.524 | mono | 32kHz | tf-masking | log-mel spectrogram | 18825469 | GCNN | False | False | False | False | False | False | False | False | True | True | True | fine | minority vote | automatic | ||||
JHKim_IVS_task5_1 | Kim2020 | 0.578 | 0.503 | mono | 44.1kHz | HPSS, log-mel spectrogram | (pre-trained model; ImageNet based trained weights of EfficientNet) | 71367 | CNN | False | False | False | False | False | True | True | False | True | True | True | minority vote | ||||||
JHKim_IVS_task5_2 | Kim2020 | 0.586 | 0.516 | mono | 44.1kHz | HPSS, log-mel spectrogram | (pre-trained model; ImageNet based trained weights of EfficientNet) | 71367 | CNN | False | False | False | False | False | True | True | False | True | True | True | minority vote | ||||||
JHKim_IVS_task5_3 | Kim2020 | 0.581 | 0.514 | mono | 44.1kHz | HPSS, log-mel spectrogram | (pre-trained model; ImageNet based trained weights of EfficientNet) | 71367 | CNN | False | False | False | False | False | True | True | False | True | True | True | minority vote | ||||||
JHKim_IVS_task5_4 | Kim2020 | 0.569 | 0.490 | mono | 44.1kHz | HPSS, log-mel spectrogram | (pre-trained model; ImageNet based trained weights of EfficientNet) | 71367 | CNN | False | False | False | False | False | True | True | False | True | True | True | minority vote | ||||||
Liu_BUPT_task5_1 | Liu2020 | 0.593 | 0.518 | mono | 32kHz | mixup | log-mel spectrogram | 57.9M | CNN | 11 | False | False | False | False | False | True | True | True | True | True | True | both | minority vote | ||||
Liu_BUPT_task5_2 | Liu2020 | 0.594 | 0.523 | mono | 32kHz | mixup | log-mel spectrogram | 31.5M | CNN | 6 | False | False | False | False | False | True | True | True | True | True | True | both | minority vote | ||||
Liu_BUPT_task5_3 | Liu2020 | 0.599 | 0.503 | mono | 32kHz | mixup | log-mel spectrogram | 15.8M | CNN | 3 | False | False | False | False | False | True | True | True | True | True | True | both | minority vote | ||||
Liu_BUPT_task5_4 | Liu2020 | 0.594 | 0.523 | mono | 32kHz | mixup | log-mel spectrogram | 21M | CNN | 11 | False | False | False | False | False | True | True | True | True | True | True | both | minority vote |
Technical reports
CRNNs for Urban Sound Tagging with Spatiotemporal Context
Augustin Arnault and Nicolas Riche
Department of Artificial Intelligence, Mons, Belgium
Abstract
This paper describes CRNNs we used to participate in Task 5 of the DCASE 2020 challenge. This task focuses on hierarchical multilabel urban sound tagging with spatiotemporal context. The code is available to our GitHub repository at https://github.com/multitelai/urban-sound-tagging.
Data Augmentation Based System for Urban Sound Tagging
Jisheng Bai1, Chen Chen1, Jianfeng Chen1, Mou Wang1, Xiaolei Zhang1 and Qingli Yan2
1School of Marine Science and Technology, Xi'an, China, 2School of Computer Science and Technology, Xi'an, China
Bai_NWPU_task5_1 Bai_NWPU_task5_2 Bai_NWPU_task5_3 Bai_NWPU_task5_4
Data Augmentation Based System for Urban Sound Tagging
Jisheng Bai1, Chen Chen1, Jianfeng Chen1, Mou Wang1, Xiaolei Zhang1 and Qingli Yan2
1School of Marine Science and Technology, Xi'an, China, 2School of Computer Science and Technology, Xi'an, China
Abstract
In this report, we introduce our system for Task5 of Dcase 2020 challenges (Urban Sound Tagging with Spatiotemporal Context). We presents a fusion system for Task5 based on different features and data augmentation. The task focuses on predicting whether each of 23 sources of noise pollution is present or absent in a 10- second scene with original recordings and addtional spatiotemporal context [1]. There are two levels of taxonomies to train a model. To explore features in detecting various sources of urban sound, we extracted four different features from original recordings. We applied a 9-layer convolutional neural network(CNN) as our primary classifier. To prevent the imbalance between classes, we applied data augmentation method. The experiment results show that our system performed better than baseline on validation data.
An Urban Sound Tagging Dataset with Spatiotemporal Context
Mark Cartwright1, Jason Cramer2, Ana Elisa Mendez Mendez3, Yu Wang3, Ho-Hsiang Wu3, Vincent Lostanlen4, Magdalena Fuentes3, Justin Salamon5 and Juan P. Bello1
1Music and Audio Research Laboratory, Department of Computer Science and Engineering, Center for Urban Science and Progress, New York, New York, USA, 2Music and Audio Research Laboratory, Department of Electrical and Computer Engineering, New York, New York, USA, 3Music and Audio Research Laboratory, Department of Music and Performing Arts Professions, New York, New York, USA, 4Cornell Lab of Ornithology, Ithaca, New York, USA, 5Machine Perception Team, San Francisco, CA, USA
Cartwright_NYU_task5_1
An Urban Sound Tagging Dataset with Spatiotemporal Context
Mark Cartwright1, Jason Cramer2, Ana Elisa Mendez Mendez3, Yu Wang3, Ho-Hsiang Wu3, Vincent Lostanlen4, Magdalena Fuentes3, Justin Salamon5 and Juan P. Bello1
1Music and Audio Research Laboratory, Department of Computer Science and Engineering, Center for Urban Science and Progress, New York, New York, USA, 2Music and Audio Research Laboratory, Department of Electrical and Computer Engineering, New York, New York, USA, 3Music and Audio Research Laboratory, Department of Music and Performing Arts Professions, New York, New York, USA, 4Cornell Lab of Ornithology, Ithaca, New York, USA, 5Machine Perception Team, San Francisco, CA, USA
Abstract
We present SONYC-UST v2, a dataset for urban sound tagging with spatiotemporal information. This dataset is aimed for the development and evaluation of machine listening systems for realworld urban noise monitoring. While datasets of urban recordings are available, this datasets provides the opportunity to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags. It consists of 18510 audio recordings from the 'Sounds of New York City' (SONYC) acoustic sensor network including relevant metadata about when and where the data were recorded at the hour and block level. The dataset has annotations by volunteers from the Zooniverse citizen science platform, as well as a two-stage verification with our team. In this work we describe the data collection, metrics used to evaluate tagging systems, and the results of a simple baseline model that exploits temporal information.
Urban Sound Classification Using Convolutional Neural Networks for DCASE 2020 Challenge
Itxasne Diez1, Peio Gonzalez2 and Ibon Gonzalez2
1Getxo, Basque Country, Spain, 2HiTZ Center – Aholab Signal Processing Laboratory, Bilbao, Basque Country, Spain
Diez_Noismart_task5_1 Diez_Noismart_task5_2 Diez_Noismart_task5_3
Urban Sound Classification Using Convolutional Neural Networks for DCASE 2020 Challenge
Itxasne Diez1, Peio Gonzalez2 and Ibon Gonzalez2
1Getxo, Basque Country, Spain, 2HiTZ Center – Aholab Signal Processing Laboratory, Bilbao, Basque Country, Spain
Abstract
This technical report describes our system proposed for Task 5 - Urban Sound Tagging. The system has a core architecture based on Convolutional Neural Networks. This neural network uses log melspectrogram features as input and this input is processed by two CNN layers. The output of the convolutional stack is processed by several fully connected layers plus an output layer to produce the classification decision. Spatiotemporal context data is also available and we propose a multi-input architecture, with two input branches that are merged for the final processing. The spatiotemporal context information is processed by an additional neural network of 2 fully connected layers. Its output is merged with the output of the CNN stack and the resulting data is fed to the fully connected output block. In this report, we describe the proposed models in detail and compare them to the baseline approach using the provided development datasets. Finally, we present the results obtained with the validation split from the dataset.
Incorporating Auxiliary Data for Urban Sound Tagging
Turab Iqbal, Yin Cao, Mark D. Plumbley and Wenwu Wang
Centre for Vision, Speech and Signal Processing, Guildford, Surrey, UK
Abstract
DCASE 2020 Task 5 presents a multi-label sound tagging problem for the detection of urban noises in acoustic scenes. The main theme is the use of auxiliary data to facilitate sound tagging. In this report, we provide a detailed description of our submission for Task 5 and present experimental results for the development set. Two different network choices are described: a pre-trained convolutional neural network (CNN) and a randomly-initialised gated CNN. To make use of the auxiliary information, we construct a feature vector based on the spatiotemporal metadata and use it in parallel with log-mel spectrogram features. Moreover, we address the presence of multiple annotations per recording by using a pseudo-labelling technique to estimate the true labels. Mean ensembling is also used with one of the proposed systems to combine several models.
Urban Sound Tagging Using Multi-Channel Audio Feature with Convolutional Neural Networks
Jaehun Kim
AI Research Lab, Seoul, Seoul, South Korea
Abstract
This paper presents a multi-channel audio feature using imagenet model based on convolutional neural networks for DCASE 2020 Task5 Urban Sound Tagging (UST) with Spatio-temporal context (STC). We used the SONYC (Sounds of New York City) Urban Sound Tagging Dataset. It consists of audio clips and STC information. We proposed a multi-channel audio feature to use imagenet pre-trained model weight. multi-channel feature consists of raw and harmonic, percussive (HPSS) data’s Log-Mel-Spectrogram. Also, we used EfficientNet pre-trained model weight.
Multisystem Fusion Model Based on Tag Relationship
Gang Liu, Zhuangzhuang Liu, Junyan Fang and Xiaofeng Hong
Pattern Recognition and Intelligent System Laboratory (PRIS Lab), Beijing, China
Abstract
Audio tagging aims to assign one or more labels to the audio clip. In this paper, we proposed our solutions applied to our submis-sion for DCASE2020 Task5 It focus on predicting whether each of the sources of noise pollution is present or absent in a 10-sec-ond scene [1]. And we should consider the spatiotemporal con-text(STC) in our work. We used VGG as our basic model and we regarded Multi-task learning as a method to train our models. We introduced the relationship between fine labels and coarse labels in our system. Finally, the coarse-grained and fine-grained taxon-omy results are obtained on the Micro Area under precision-recall curve (AUPRC), Micro F1 score and Macro Area under precision-recall curve (AUPRC).