Task description
The goal of urban sound tagging (UST) is to predict whether each of 23 sources of noise pollution is present or absent in a 10-second scene. These sources of noise are also grouped into 8 coarse-level categories. All of the recordings are from an urban acoustic sensor network in New York City. The training set was annotated by volunteers on the Zooniverse citizen-science platform, and the validation and test sets were annotated by the task organizers.
More detailed task description can be found in the task description page
Note that only teams which have open sourced their systems are included in the system and team rankings.
Coarse-level prediction
System ranking
These results only include systems for which the source code has been release.
Rank |
Submission code |
Submission name |
Technical Report |
Macro-AUPRC | Micro-F1 | Micro-AUPRC |
---|---|---|---|---|---|---|
Adapa_FH_task5_1 | MNv2_1 | Adapa2019 | 0.718 | 0.631 | 0.860 | |
Adapa_FH_task5_2 | MNv2_2 | Adapa2019 | 0.723 | 0.745 | 0.847 | |
Bai_NPU_task5_1 | multifeat1 | Bai2019 | 0.618 | 0.696 | 0.763 | |
Bai_NPU_task5_2 | multifeat2 | Bai2019 | 0.649 | 0.701 | 0.769 | |
Bai_NPU_task5_3 | multifeat3 | Bai2019 | 0.558 | 0.631 | 0.680 | |
Bai_NPU_task5_4 | multifeat4 | Bai2019 | 0.647 | 0.709 | 0.782 | |
DCASE2019 baseline | Baseline | Cartwright2019 | 0.619 | 0.664 | 0.742 | |
Cui_YSU_task5_1 | YSU_TFSANN | Cui2019 | 0.674 | 0.525 | 0.807 | |
Gousseau_OL_task5_1 | Gousseau1 | Gousseau2019 | 0.650 | 0.667 | 0.745 | |
Gousseau_OL_task5_2 | Gousseau2 | Gousseau2019 | 0.612 | 0.639 | 0.748 | |
Kim_NU_task5_1 | BK_CNN1 | Kim2019 | 0.653 | 0.686 | 0.761 | |
Kim_NU_task5_2 | BK_CNN2 | Kim2019 | 0.696 | 0.734 | 0.825 | |
Kim_NU_task5_3 | BK_CNN3 | Kim2019 | 0.697 | 0.730 | 0.809 | |
Kong_SURREY_task5_1 | cvssp_cnn9 | Kong2019 | 0.567 | 0.467 | 0.674 | |
Kong_SURREY_task5_2 | cvssp_plus | Kong2019 | 0.613 | 0.516 | 0.777 | |
Ng_NTU_task5_1 | Ng_1 | Ng2019 | 0.657 | 0.670 | 0.759 | |
Ng_NTU_task5_2 | Ng_2 | Ng2019 | 0.666 | 0.677 | 0.767 | |
Ng_NTU_task5_3 | Ng_3 | Ng2019 | 0.660 | 0.671 | 0.762 | |
Ng_NTU_task5_4 | Ng_4 | Ng2019 | 0.660 | 0.666 | 0.770 | |
Orga_URL_task5_1 | AugNet | Orga2019 | 0.501 | 0.557 | 0.562 | |
Tompkins_MS_task5_1 | MS D365 AI 1 | Tompkins2019 | 0.646 | 0.631 | 0.779 | |
Tompkins_MS_task5_2 | MS D365 AI 2 | Tompkins2019 | 0.666 | 0.552 | 0.788 | |
Tompkins_MS_task5_3 | MS D365 AI 3 | Tompkins2019 | 0.646 | 0.631 | 0.779 |
Teams ranking
Table including only the best performing reproducible system per submitting team.
Rank |
Submission code |
Submission name |
Technical Report |
Macro-AUPRC | Micro-F1 | Micro-AUPRC |
---|---|---|---|---|---|---|
Adapa_FH_task5_1 | MNv2_1 | Adapa2019 | 0.718 | 0.631 | 0.860 | |
Bai_NPU_task5_4 | multifeat4 | Bai2019 | 0.647 | 0.709 | 0.782 | |
DCASE2019 baseline | Baseline | Cartwright2019 | 0.619 | 0.664 | 0.742 | |
Cui_YSU_task5_1 | YSU_TFSANN | Cui2019 | 0.674 | 0.525 | 0.807 | |
Gousseau_OL_task5_2 | Gousseau2 | Gousseau2019 | 0.612 | 0.639 | 0.748 | |
Kim_NU_task5_2 | BK_CNN2 | Kim2019 | 0.696 | 0.734 | 0.825 | |
Kong_SURREY_task5_2 | cvssp_plus | Kong2019 | 0.613 | 0.516 | 0.777 | |
Ng_NTU_task5_4 | Ng_4 | Ng2019 | 0.660 | 0.666 | 0.770 | |
Orga_URL_task5_1 | AugNet | Orga2019 | 0.501 | 0.557 | 0.562 | |
Tompkins_MS_task5_2 | MS D365 AI 2 | Tompkins2019 | 0.666 | 0.552 | 0.788 |
Class-wise performance
Submission code |
Submission name |
Technical Report |
Micro-AUPRC | Engine |
Machinery impact |
Non-machinery impact |
Powered saw |
Alert signal |
Music |
Human voice |
Dog |
---|---|---|---|---|---|---|---|---|---|---|---|
Adapa_FH_task5_1 | MNv2_1 | Adapa2019 | 0.860 | 0.888 | 0.627 | 0.361 | 0.684 | 0.897 | 0.404 | 0.947 | 0.937 |
Adapa_FH_task5_2 | MNv2_2 | Adapa2019 | 0.847 | 0.878 | 0.578 | 0.344 | 0.643 | 0.875 | 0.586 | 0.949 | 0.931 |
Bai_NPU_task5_1 | multifeat1 | Bai2019 | 0.763 | 0.787 | 0.632 | 0.287 | 0.578 | 0.732 | 0.105 | 0.907 | 0.918 |
Bai_NPU_task5_2 | multifeat2 | Bai2019 | 0.769 | 0.792 | 0.602 | 0.363 | 0.658 | 0.804 | 0.171 | 0.909 | 0.896 |
Bai_NPU_task5_3 | multifeat3 | Bai2019 | 0.680 | 0.792 | 0.111 | 0.071 | 0.658 | 0.771 | 0.225 | 0.922 | 0.911 |
Bai_NPU_task5_4 | multifeat4 | Bai2019 | 0.782 | 0.809 | 0.637 | 0.347 | 0.628 | 0.781 | 0.151 | 0.916 | 0.912 |
DCASE2019 baseline | Baseline | Cartwright2019 | 0.742 | 0.832 | 0.454 | 0.170 | 0.709 | 0.727 | 0.246 | 0.886 | 0.929 |
Cui_YSU_task5_1 | YSU_TFSANN | Cui2019 | 0.807 | 0.859 | 0.598 | 0.405 | 0.739 | 0.773 | 0.268 | 0.883 | 0.863 |
Gousseau_OL_task5_1 | Gousseau1 | Gousseau2019 | 0.745 | 0.793 | 0.598 | 0.282 | 0.703 | 0.802 | 0.218 | 0.867 | 0.934 |
Gousseau_OL_task5_2 | Gousseau2 | Gousseau2019 | 0.748 | 0.813 | 0.403 | 0.253 | 0.698 | 0.778 | 0.166 | 0.871 | 0.916 |
Kim_NU_task5_1 | BK_CNN1 | Kim2019 | 0.761 | 0.863 | 0.548 | 0.202 | 0.717 | 0.791 | 0.276 | 0.910 | 0.918 |
Kim_NU_task5_2 | BK_CNN2 | Kim2019 | 0.825 | 0.849 | 0.643 | 0.308 | 0.686 | 0.850 | 0.358 | 0.944 | 0.931 |
Kim_NU_task5_3 | BK_CNN3 | Kim2019 | 0.809 | 0.831 | 0.650 | 0.290 | 0.674 | 0.856 | 0.402 | 0.934 | 0.941 |
Kong_SURREY_task5_1 | cvssp_cnn9 | Kong2019 | 0.674 | 0.786 | 0.455 | 0.272 | 0.640 | 0.552 | 0.181 | 0.765 | 0.883 |
Kong_SURREY_task5_2 | cvssp_plus | Kong2019 | 0.777 | 0.824 | 0.526 | 0.172 | 0.722 | 0.785 | 0.057 | 0.893 | 0.926 |
Liu_CU_task5_1 | Liu_CU_1 | Liu2019 | 0.700 | 0.746 | 0.528 | 0.318 | 0.742 | 0.826 | 0.000 | 0.772 | 0.898 |
Ng_NTU_task5_1 | Ng_1 | Ng2019 | 0.759 | 0.832 | 0.525 | 0.268 | 0.693 | 0.786 | 0.403 | 0.874 | 0.877 |
Ng_NTU_task5_2 | Ng_2 | Ng2019 | 0.767 | 0.843 | 0.535 | 0.249 | 0.739 | 0.760 | 0.425 | 0.882 | 0.895 |
Ng_NTU_task5_3 | Ng_3 | Ng2019 | 0.762 | 0.852 | 0.529 | 0.197 | 0.767 | 0.775 | 0.399 | 0.875 | 0.888 |
Ng_NTU_task5_4 | Ng_4 | Ng2019 | 0.770 | 0.854 | 0.545 | 0.209 | 0.749 | 0.764 | 0.412 | 0.878 | 0.867 |
Orga_URL_task5_1 | AugNet | Orga2019 | 0.562 | 0.653 | 0.411 | 0.131 | 0.704 | 0.544 | 0.223 | 0.672 | 0.668 |
Tompkins_MS_task5_1 | MS D365 AI 1 | Tompkins2019 | 0.779 | 0.844 | 0.519 | 0.227 | 0.730 | 0.770 | 0.316 | 0.883 | 0.877 |
Tompkins_MS_task5_2 | MS D365 AI 2 | Tompkins2019 | 0.788 | 0.855 | 0.538 | 0.188 | 0.744 | 0.812 | 0.418 | 0.886 | 0.886 |
Tompkins_MS_task5_3 | MS D365 AI 3 | Tompkins2019 | 0.779 | 0.844 | 0.519 | 0.227 | 0.730 | 0.770 | 0.316 | 0.883 | 0.877 |
Fine-level prediction
System ranking
These results only include systems for which the source code has been release.
Rank |
Submission code |
Submission name |
Technical Report |
Macro-AUPRC | Micro-F1 | Micro-AUPRC |
---|---|---|---|---|---|---|
Adapa_FH_task5_1 | MNv2_1 | Adapa2019 | 0.645 | 0.484 | 0.751 | |
Adapa_FH_task5_2 | MNv2_2 | Adapa2019 | 0.622 | 0.575 | 0.721 | |
Bai_NPU_task5_1 | multifeat1 | Bai2019 | 0.534 | 0.514 | 0.572 | |
Bai_NPU_task5_2 | multifeat2 | Bai2019 | 0.523 | 0.594 | 0.615 | |
Bai_NPU_task5_3 | multifeat3 | Bai2019 | 0.553 | 0.600 | 0.639 | |
Bai_NPU_task5_4 | multifeat4 | Bai2019 | 0.554 | 0.571 | 0.623 | |
DCASE2019 baseline | Baseline | Cartwright2019 | 0.531 | 0.450 | 0.619 | |
Cui_YSU_task5_1 | YSU_TFSANN | Cui2019 | 0.552 | 0.286 | 0.637 | |
Gousseau_OL_task5_1 | Gousseau1 | Gousseau2019 | 0.000 | 0.000 | 0.000 | |
Gousseau_OL_task5_2 | Gousseau2 | Gousseau2019 | 0.500 | 0.560 | 0.621 | |
Kim_NU_task5_1 | BK_CNN1 | Kim2019 | 0.000 | 0.000 | 0.000 | |
Kim_NU_task5_2 | BK_CNN2 | Kim2019 | 0.000 | 0.000 | 0.000 | |
Kim_NU_task5_3 | BK_CNN3 | Kim2019 | 0.000 | 0.000 | 0.000 | |
Kong_SURREY_task5_1 | cvssp_cnn9 | Kong2019 | 0.378 | 0.391 | 0.496 | |
Kong_SURREY_task5_2 | cvssp_plus | Kong2019 | 0.462 | 0.206 | 0.584 | |
Ng_NTU_task5_1 | Ng_1 | Ng2019 | 0.560 | 0.551 | 0.639 | |
Ng_NTU_task5_2 | Ng_2 | Ng2019 | 0.564 | 0.540 | 0.638 | |
Ng_NTU_task5_3 | Ng_3 | Ng2019 | 0.564 | 0.521 | 0.632 | |
Ng_NTU_task5_4 | Ng_4 | Ng2019 | 0.571 | 0.534 | 0.646 | |
Orga_URL_task5_1 | AugNet | Orga2019 | 0.391 | 0.457 | 0.428 | |
Tompkins_MS_task5_1 | MS D365 AI 1 | Tompkins2019 | 0.521 | 0.444 | 0.618 | |
Tompkins_MS_task5_2 | MS D365 AI 2 | Tompkins2019 | 0.555 | 0.381 | 0.649 | |
Tompkins_MS_task5_3 | MS D365 AI 3 | Tompkins2019 | 0.522 | 0.461 | 0.599 |
Teams ranking
Table including only the best performing reproducible system per submitting team.
Rank |
Submission code |
Submission name |
Technical Report |
Macro-AUPRC | Micro-F1 | Micro-AUPRC |
---|---|---|---|---|---|---|
Adapa_FH_task5_1 | MNv2_1 | Adapa2019 | 0.645 | 0.484 | 0.751 | |
Bai_NPU_task5_3 | multifeat3 | Bai2019 | 0.553 | 0.600 | 0.639 | |
DCASE2019 baseline | Baseline | Cartwright2019 | 0.531 | 0.450 | 0.619 | |
Cui_YSU_task5_1 | YSU_TFSANN | Cui2019 | 0.552 | 0.286 | 0.637 | |
Gousseau_OL_task5_2 | Gousseau2 | Gousseau2019 | 0.500 | 0.560 | 0.621 | |
Kim_NU_task5_1 | BK_CNN1 | Kim2019 | 0.000 | 0.000 | 0.000 | |
Kong_SURREY_task5_2 | cvssp_plus | Kong2019 | 0.462 | 0.206 | 0.584 | |
Ng_NTU_task5_4 | Ng_4 | Ng2019 | 0.571 | 0.534 | 0.646 | |
Orga_URL_task5_1 | AugNet | Orga2019 | 0.391 | 0.457 | 0.428 | |
Tompkins_MS_task5_2 | MS D365 AI 2 | Tompkins2019 | 0.555 | 0.381 | 0.649 |
Class-wise performance
Submission code |
Submission name |
Technical Report |
Micro-AUPRC | Engine |
Machinery impact |
Non-machinery impact |
Powered saw |
Alert signal |
Music |
Human voice |
Dog |
---|---|---|---|---|---|---|---|---|---|---|---|
Adapa_FH_task5_1 | MNv2_1 | Adapa2019 | 0.751 | 0.665 | 0.718 | 0.362 | 0.486 | 0.858 | 0.289 | 0.841 | 0.936 |
Adapa_FH_task5_2 | MNv2_2 | Adapa2019 | 0.721 | 0.673 | 0.604 | 0.374 | 0.378 | 0.832 | 0.351 | 0.833 | 0.931 |
Bai_NPU_task5_1 | multifeat1 | Bai2019 | 0.572 | 0.394 | 0.560 | 0.470 | 0.351 | 0.648 | 0.129 | 0.735 | 0.981 |
Bai_NPU_task5_2 | multifeat2 | Bai2019 | 0.615 | 0.524 | 0.517 | 0.346 | 0.364 | 0.687 | 0.083 | 0.720 | 0.944 |
Bai_NPU_task5_3 | multifeat3 | Bai2019 | 0.639 | 0.545 | 0.536 | 0.489 | 0.418 | 0.679 | 0.082 | 0.763 | 0.911 |
Bai_NPU_task5_4 | multifeat4 | Bai2019 | 0.623 | 0.511 | 0.565 | 0.430 | 0.407 | 0.696 | 0.115 | 0.739 | 0.973 |
DCASE2019 baseline | Baseline | Cartwright2019 | 0.619 | 0.638 | 0.539 | 0.182 | 0.478 | 0.543 | 0.168 | 0.777 | 0.922 |
Cui_YSU_task5_1 | YSU_TFSANN | Cui2019 | 0.637 | 0.632 | 0.566 | 0.359 | 0.444 | 0.652 | 0.116 | 0.731 | 0.913 |
Gousseau_OL_task5_1 | Gousseau1 | Gousseau2019 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Gousseau_OL_task5_2 | Gousseau2 | Gousseau2019 | 0.621 | 0.606 | 0.270 | 0.253 | 0.398 | 0.694 | 0.103 | 0.756 | 0.916 |
Kim_NU_task5_1 | BK_CNN1 | Kim2019 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Kim_NU_task5_2 | BK_CNN2 | Kim2019 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Kim_NU_task5_3 | BK_CNN3 | Kim2019 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Kong_SURREY_task5_1 | cvssp_cnn9 | Kong2019 | 0.496 | 0.506 | 0.279 | 0.230 | 0.239 | 0.464 | 0.015 | 0.638 | 0.652 |
Kong_SURREY_task5_2 | cvssp_plus | Kong2019 | 0.584 | 0.534 | 0.440 | 0.089 | 0.437 | 0.535 | 0.009 | 0.738 | 0.913 |
Liu_CU_task5_1 | Liu_CU_1 | Liu2019 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Ng_NTU_task5_1 | Ng_1 | Ng2019 | 0.639 | 0.668 | 0.576 | 0.235 | 0.557 | 0.583 | 0.221 | 0.765 | 0.873 |
Ng_NTU_task5_2 | Ng_2 | Ng2019 | 0.638 | 0.667 | 0.562 | 0.265 | 0.535 | 0.627 | 0.213 | 0.760 | 0.881 |
Ng_NTU_task5_3 | Ng_3 | Ng2019 | 0.632 | 0.666 | 0.513 | 0.267 | 0.532 | 0.632 | 0.258 | 0.757 | 0.890 |
Ng_NTU_task5_4 | Ng_4 | Ng2019 | 0.646 | 0.665 | 0.538 | 0.280 | 0.545 | 0.666 | 0.222 | 0.764 | 0.886 |
Orga_URL_task5_1 | AugNet | Orga2019 | 0.428 | 0.417 | 0.396 | 0.137 | 0.536 | 0.346 | 0.083 | 0.566 | 0.644 |
Tompkins_MS_task5_1 | MS D365 AI 1 | Tompkins2019 | 0.618 | 0.621 | 0.531 | 0.225 | 0.351 | 0.620 | 0.190 | 0.755 | 0.877 |
Tompkins_MS_task5_2 | MS D365 AI 2 | Tompkins2019 | 0.649 | 0.638 | 0.552 | 0.189 | 0.466 | 0.680 | 0.266 | 0.759 | 0.886 |
Tompkins_MS_task5_3 | MS D365 AI 3 | Tompkins2019 | 0.599 | 0.553 | 0.458 | 0.225 | 0.572 | 0.608 | 0.199 | 0.683 | 0.877 |
System characteristics
Code |
Technical Report |
Coarse Micro-AUPRC |
Fine Micro-AUPRC |
Input |
Sampling rate |
Data augmentation |
Features |
External data |
External data sources |
Model complexity |
Classifier |
Ensemble subsystems |
Used annotator ID | Used proximity | Used sensor ID |
Aggregation method |
Target level |
Target method |
System relabeling |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Adapa_FH_task5_1 | Adapa2019 | 0.860 | 0.751 | mono | 44.1kHz | mixup, random erase, scaling, shifting | log-mel energies | pre-trained model | ImageNet based trained weights of MobileNetV2 | 2896726 | CNN | False | False | False | mean | both | average | manual | |
Adapa_FH_task5_2 | Adapa2019 | 0.847 | 0.721 | mono | 44.1kHz | mixup, random erase, scaling, shifting | log-mel energies | pre-trained model | ImageNet based trained weights of MobileNetV2 | 2899804 | CNN | False | False | False | mean | both | average | automatic | |
Bai_NPU_task5_1 | Bai2019 | 0.763 | 0.572 | mono | 16kHz | MFCC, log-mel, STFT, HPSS | CNN | False | False | False | fine | fusion | |||||||
Bai_NPU_task5_2 | Bai2019 | 0.769 | 0.615 | mono | 16kHz | MFCC, log-mel, STFT, HPSS | CNN | False | False | False | fine | fusion | |||||||
Bai_NPU_task5_3 | Bai2019 | 0.680 | 0.639 | mono | 16kHz | MFCC, log-mel, STFT, HPSS | CNN | False | False | False | fine | fusion | |||||||
Bai_NPU_task5_4 | Bai2019 | 0.782 | 0.623 | mono | 16kHz | MFCC, log-mel, STFT, HPSS | CNN | False | False | False | fine | fusion | |||||||
DCASE2019 baseline | Cartwright2019 | 0.742 | 0.619 | mono | 44.1kHz | vggish | 2967 | logistic regression | False | False | False | fine | minority vote | ||||||
Cui_YSU_task5_1 | Cui2019 | 0.807 | 0.637 | mono | 32kHz | log-mel spectrogram | 583336 | CNN | False | False | False | both | minority vote | ||||||
Gousseau_OL_task5_1 | Gousseau2019 | 0.745 | 0.000 | mono | 44.1kHz | mixup | log-mel energies | 120753440 | CNN | 4 | False | False | False | coarse | minority vote | ||||
Gousseau_OL_task5_2 | Gousseau2019 | 0.748 | 0.621 | mono | 44.1kHz | mixup | log-mel energies | 120753440 | CNN | 4 | False | False | False | coarse | minority vote | ||||
Kim_NU_task5_1 | Kim2019 | 0.761 | 0.000 | mono | 44.1kHz | mel spectrogram | pre-trained model | vggish | 12193928 | CNN | False | False | False | max | coarse | minority vote | |||
Kim_NU_task5_2 | Kim2019 | 0.825 | 0.000 | mono | 44.1kHz | mel spectrogram | pre-trained model | vggish | 24387856 | CNN, ensemble | False | False | False | max | coarse | minority vote | |||
Kim_NU_task5_3 | Kim2019 | 0.809 | 0.000 | mono | 44.1kHz | mel spectrogram | pre-trained model | vggish | 12193928 | CNN | False | False | False | max | coarse | minority vote | |||
Kong_SURREY_task5_1 | Kong2019 | 0.674 | 0.496 | mono | 32kHz | log-mel | 4686144 | CNN | False | False | False | both | minority vote | ||||||
Kong_SURREY_task5_2 | Kong2019 | 0.777 | 0.584 | mono | 32kHz | log-mel | AudioSet pretrained model | AudioSet | 4686144 | CNN | False | False | False | both | minority vote | ||||
Liu_CU_task5_1 | Liu2019 | 0.700 | 0.000 | mono | 44.1kHz | log-mel spectrogram | pre-trained model | pre-trained model | 2967 | CNN | False | False | False | mean | coarse | minority vote | |||
Ng_NTU_task5_1 | Ng2019 | 0.759 | 0.639 | mono | 44.1kHz | openl3 | 120215 | logistic regression, neural network | True | False | False | both | minority vote | manual | |||||
Ng_NTU_task5_2 | Ng2019 | 0.767 | 0.638 | mono | 44.1kHz | openl3 | audio data | Urban-SED, FSDKaggle2018 | 103191 | logistic regression, neural network | True | False | True | both | minority vote | manual | |||
Ng_NTU_task5_3 | Ng2019 | 0.762 | 0.632 | mono | 44.1kHz | openl3 | audio data | Urban-SED, FSDKaggle2018, UrbanSound8k, FSDnoisy18k, ESC-50-master | 103191 | logistic regression, neural network | True | False | True | both | minority vote | automatic, manual | |||
Ng_NTU_task5_4 | Ng2019 | 0.770 | 0.646 | mono | 44.1kHz | openl3 | audio data | Urban-SED, FSDKaggle2018, UrbanSound8k, FSDnoisy18k, ESC-50-master | 271895 | logistic regression, neural network | True | False | True | both | minority vote | automatic, manual | |||
Orga_URL_task5_1 | Orga2019 | 0.562 | 0.428 | mono | 44.1kHz | emphasis, compression, mixing | vggish | 286208677 | CNN | False | False | False | mean | both | minority vote | ||||
Tompkins_MS_task5_1 | Tompkins2019 | 0.779 | 0.618 | mono | 44.1kHz | pitch shifting, volume, white noise addition | log-mel spectrogram, vggish | pre-trained model | vggish | 169518141 | CNN | False | False | False | both | majority vote | |||
Tompkins_MS_task5_2 | Tompkins2019 | 0.788 | 0.649 | mono | 44.1kHz | pitch shifting, volume, white noise addition | log-mel spectrogram, vggish | pre-trained model | vggish | 169518141 | CNN | False | False | False | both | majority vote | |||
Tompkins_MS_task5_3 | Tompkins2019 | 0.779 | 0.599 | mono | 44.1kHz | pitch shifting, volume, white noise addition | log-mel spectrogram, vggish | pre-trained model | vggish | 1186626987 | CNN, ensemble | 7 | False | False | False | both | majority vote |
Technical reports
Urban Sound Tagging Using Convolutional Neural Networks
Sainath Adapa
Customer Acquisition, FindHotel, Amsterdam, Netherlands
Abstract
This technical report outlines our solution to Task 5 of the DCASE 2019 challenge, titled Urban Sound Tagging. The objective of the task is to label different sources of noise from raw audio data. A modified form of MobileNetv2, a convolutional neural network (CNN) model was trained to label both coarse and fine tags jointly. The proposed model uses log-scaled Mel-spectrogram as the representation format for the audio data. Mixup, Random erasing, scaling, and shifting are used as data augmentation techniques. A second model that uses scaled labels was built to account for human errors in the annotations. The solution code is available on GitHub.
Urban Sound Tagging with Multi-Feature Fusion System
Jisheng Bai and Chen Chen
School of Marine science, Northwestern Polytechnical University, Xi'an, China
Abstract
This paper presents a multi-feature fusion system for the DCASE 2019 Task5 Urban Sound Tagging(UST). It focus on predicting whether each of 23 sources of noise pollution is pre-sent or absent in a 10-second scene [1]. There are coarse-level and fine -level taxonomies to train model. We mainly focus on coarse-level and use best coarse-level model architecture to train fine-level model. Various features are extracted from original urban sound and Convolutional Neural Networks(CNNs) are applied in this system. Log-Mel, harmonic, short time Fourier transform (STFT) and Mel Frequency Cepstral Coefficents (MFCC) spectrograms are fed into a 5-layer or 9-layer CNN, and a type of gated activation [2] is also used in CNN. Different feature is adapted for different urban sound classification ac-cording to the results of our experiment. We get at least 0.14 macro-auprc score improvement compared to baseline system on coarse-level. Finally, we make a fusion of some models and evaluate on evaluation dataset.
Sonyc Urban Sound Tagging: A Multilabel Dataset From an Urban Acoustic Sensor Network
Mark Cartwright1, Jason Cramer2, Ana Elisa Mendez Mendez3, Ho-Hsiang Wu3, Vincent Lostanlen4, Juan P. Bello1 and Justin Salamon5
1Music and Audio Research Laboratory, Department of Computer Science and Engineering, Center for Urban Science and Progress, New York University, New York, New York, USA, 2Music and Audio Research Laboratory, Department of Electrical and Computer Engineering, New York University, New York, New York, USA, 3Music and Audio Research Laboratory, Department of Music and Performing Arts Professions, New York University, New York, New York, USA, 4Cornell Lab of Ornithology, Cornell University, Ithaca, New York, USA, 5Machine Perception Team, Adobe Research, San Francisco, CA, USA
Cartwright_NYU_task5_1
Sonyc Urban Sound Tagging: A Multilabel Dataset From an Urban Acoustic Sensor Network
Mark Cartwright1, Jason Cramer2, Ana Elisa Mendez Mendez3, Ho-Hsiang Wu3, Vincent Lostanlen4, Juan P. Bello1 and Justin Salamon5
1Music and Audio Research Laboratory, Department of Computer Science and Engineering, Center for Urban Science and Progress, New York University, New York, New York, USA, 2Music and Audio Research Laboratory, Department of Electrical and Computer Engineering, New York University, New York, New York, USA, 3Music and Audio Research Laboratory, Department of Music and Performing Arts Professions, New York University, New York, New York, USA, 4Cornell Lab of Ornithology, Cornell University, Ithaca, New York, USA, 5Machine Perception Team, Adobe Research, San Francisco, CA, USA
Abstract
SONYC Urban Sound Tagging (SONYC-UST) is a dataset for the development and evaluation of machine listening systems for realistic urban noise monitoring. The audio was recorded from an acoustic sensor network named ``Sounds of New York City'' (SONYC). Via the Zooniverse citizen science platform, volunteers tagged the presence of 23 classes that were priorly chosen in consultation with the New York City Department of Environmental Protection. These 23 fine-grained classes can be grouped into eight coarse-grained classes.
Time-Frequency Segmenation Attention Neural Network for Urban Sound Tagging
Lin Cui, Shaonan Ji, Xinyuan Han and Jinjia Wang
school of Information Science and Engineering, department of Electronic communication, Yanshan University, Qin Huangdao, Hebei, China
Abstract
Audio tagging aims to assign one or more labels to the audio clip. In this task, we used the Time-Frequency Segmentation Attention Network (TFSANN) for urban sound tagging. In the training, the log mel spectrogram of the audio clip is used as input feature, and the time-frequency segmentation mask is obtained by the timefrequency segmentation network. The time-frequency segmentation mask can be used to separate the time-frequency domain sound event from the background scene, and enhance the sound event that occurred in the audio clip. Global Weighted Rank Pooling (GWRP) allows existing event categories to occupy significant part of the spectrogram, allowing the network to focus on more significant features, and it can also estimate the probability of existence of sound event. In this paper, the proposed TFSANN model is validated on the development dataset of DCASE2019 task 5. Finally, the coarsegrained and fine-grained taxonomy results are obtained on the Micro Area under precision-recall curve (AUPRC), Micro F1 score and Macro Area under precision-recall curve (AUPRC).
VGG CNN for Urban Sound Tagging
Clément Gousseau
Ambient Intelligence and Mobility, Orange Labs (company where I do my master thesis internship), Lannion, France
Abstract
A model of urban sound tagging is presented (Task 5 of DCASE 2019 [1][2]). The task is to detect activities from 10-seconds audio segments recorded in the streets of New York City (SONYC dataset). The model is based on the model presented in the book Hands-On Transfer Learning with Python [3] which does urban sound classification for the UrbanSound dataset. This model has been adapted and optimized to address the task 5 of DCASE2019. It achieved a AUPRC of 82.6 for the coarse-grained model where the baseline achieves an AUPRC of 76.2.
Convolutional Neural Networks with Transfer Learning for Urban Sound Tagging
Bongjun Kim
Department of Computer Science, Northwestern University, Evnaston, Illinois, USA
Abstract
This technical report describes sound classification models from our submissions for DCASE challenge 2019-task5. The task is to build a system to perform audio tagging on urban sound. The dataset has 23 fine-grained tags and 8 coarse-grained tags. In this report, we only present a model for coarse-grained tagging. Our model is a Convolutional Neural network (CNN)-based model which consists of 6 convolutional layers and 3 fully-connected layers. We apply transfer learning to the model training by utilizing VGGish model that has been pre-trained on a large scale of a dataset. We also apply an ensemble technique to boost the performance of a single model. We compare the performance of our models and the baseline approach on the provided validation dataset. The results show that our models outperform the baseline system.
Cross-Talk Learning for Audio Tagging, Sound Event Detection and Spatial Localization: DCASE 2019 Baseline Systems
Qiuqiang Kong, Yin Cao, Turab Iqbal, Wenwu Wang and Mark D. Plumbley
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, England
Kong_SURREY_task5_1 Kong_SURREY_task5_2
Cross-Talk Learning for Audio Tagging, Sound Event Detection and Spatial Localization: DCASE 2019 Baseline Systems
Qiuqiang Kong, Yin Cao, Turab Iqbal, Wenwu Wang and Mark D. Plumbley
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, England
Abstract
The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge focuses on audio tagging, sound event detection and spatial localisation. DCASE 2019 consists of five tasks: 1) acoustic scene classification, 2) audio tagging with noisy labels and minimal supervision, 3) sound event localisation and detection, 4) sound event detection in domestic environments, and 5) urban sound tagging. In this paper, we propose generic cross-task baseline systems based on convolutional neural networks (CNNs). The motivation is to investigate the performance of a variety of models across several audio recognition tasks without exploiting the specific characteristics of the tasks. We looked at CNNs with 5, 9, and 13 layers, and found that the optimal architecture is taskdependent. For the systems we considered, we found that the 9-layer CNN with average pooling after convolutional layers is a good model for a majority of the DCASE 2019 tasks.
Improved Residual Network Based on Deformable Convolution for Urban Sound Tagging
Fuling Liu
College of Photoelectric Engineering, Chongqing University, 174 Shazhengjie, Shapingba, Chongqing, 400030, China
Liu_CU_task5_1
Improved Residual Network Based on Deformable Convolution for Urban Sound Tagging
Fuling Liu
College of Photoelectric Engineering, Chongqing University, 174 Shazhengjie, Shapingba, Chongqing, 400030, China
Abstract
In order to solve the problem of Urban Sound Tagging, we use deformable convolution to improve the residual module, add offset variables to the input feature map of the residual module, and use the improved residual module to form a new residual network. Compared with the general method of adding deformable convolution, the improved method in this paper has better results.
Urban Sound Tagging DCASE 2019 Chalelnge Task 5
Linus Ng and Kenneth Ooi
Smart Nation Translational Lab, Centre for Infocomm Technology (INFINITUS), School of Electrical and Electronic Engineering, Nanyang Technological University, Nanyang Technological University, Singapore
Ng_NTU_task5_1 Ng_NTU_task5_2 Ng_NTU_task5_3 Ng_NTU_task5_4
Urban Sound Tagging DCASE 2019 Chalelnge Task 5
Linus Ng and Kenneth Ooi
Smart Nation Translational Lab, Centre for Infocomm Technology (INFINITUS), School of Electrical and Electronic Engineering, Nanyang Technological University, Nanyang Technological University, Singapore
Abstract
Identifying urban noises and sounds is a challenging but important problem in the field of machine listening [1]. It enables and provides a realistic use case for detecting noises in an urbanised city - from noise complaints to detecting sounds or unusual noises that may indicate possible emergencies. The Urban Sound Tagging challenge as part of the DCASE 2019 challenge [2] [3] addresses the problem statement of urban noise control [1]. For this challenge, we are tasked to build a audio classifier to predict whether each of 23 sources of noise pollution is present or absent in a 10-second scene, as recorded by an acoustic sensor network. In this technical report, we will examine in some detail the performance of the audio classification models trained with different open external datasets.
An Augmented Neural Network for the DCASE 2019 Urban Sound Tagging Challenge
Ferran Orga1, Joan Serrà2 and Carlos Segura Perales2
1GTM - Grup de recerca en Tecnologies Mèdia, La Salle - Universitat Ramon Llull (URL), C/Quatre Camins, 30, 08022 Barcelona (Spain), 2Telefónica Research, Barcelona (Spain)
Orga_URL_task5_1
An Augmented Neural Network for the DCASE 2019 Urban Sound Tagging Challenge
Ferran Orga1, Joan Serrà2 and Carlos Segura Perales2
1GTM - Grup de recerca en Tecnologies Mèdia, La Salle - Universitat Ramon Llull (URL), C/Quatre Camins, 30, 08022 Barcelona (Spain), 2Telefónica Research, Barcelona (Spain)
Abstract
The Sounds of New York City (SONYC) research project aims to mitigate urban noise pollution in the context of a megacity. This project has deployed 50 different sensors in various areas of the New York City installed back in 2015 to monitor the overall sound pressure level. However, this is not enough to determine the noise sources, needed to detect noise code violations. Within the Task 5 of DCASE2019 challenge, an urban sound tagging challenge is proposed where the participants are asked to develop a machine listening system that distinguishes between 23 sources of noise pollution. The system is asked to predict whether the source is present or absent in 10-second scenes recorded by the SONYC. Moreover, annotations are also provided at a higher level, classifying the 23 fine labels in 8 coarser labels. In this report, the authors present a machine listening approach based on an augmented neural network where both coarse and fine-level annotations are used to predict the event presence in the same network. This approach obtains a classification accuracy on the validation split of 87% at the coarse level and 92% at the fine level.
DCASE 2019 Challenge Task 5: Cnn+vggish
Daniel Tompkins and Eric Nichols
Dynamics 365 AI Research, Microsoft, Redmond, WA
Abstract
We trained a model for multi-label audio classification on Task 5 of the DCASE 2019 Challenge [1]. The model is composed of a preprocessing layer that converts audio to a log-mel spectrogram, a VGG-inspired Convolutional Neural Network (CNN) that generates an embedding for the spectrogram, the pre-trained VGGish network [2] that generates a separate audio embedding, and finally a series of fully-connected layers that converts these two embeddings (concatenated) into a multi-label classification. This model directly outputs both fine and coarse labels; it treats the task as a 37-way multi-label classification problem. One version of this network did better at the coarse labels (submission 1); another did better with fine labels on Micro AUPRC (submission 2). A separate family of CNNs models, one per coarse label, was also trained to take into account the hierarchical nature of the labels (submission 3), but the single model solution performed slightly better.