Task description
This subtask is concerned with classification using audio and video modalities. Since audio-visual machine learning has gained popularity in the last years, we aim to provide a multidisciplinary task that may attract researchers from the machine vision community.
We imposed no restrictions on the modality or combinations of modalities used in the system. We encouraged participants to also submit single-modality systems (audio-only or video-only methods for scene classification).
The development set contains synchronized audio and video recordings from 10 European cities in 10 different scenes. The total amount of audio in the development set is 34 hours. The evaluation set contains data from 12 cities (2 cities unseen in the development set). Evaluation data contains 20 hours of audio.
More detailed task description can be found in the task description page
Systems ranking
Submission information | Evaluation dataset | Development dataset | ||||||
---|---|---|---|---|---|---|---|---|
Rank | Submission label | Name |
Technical Report |
Official system rank |
Logloss (Evaluation dataset) |
Accuracy with 95% confidence interval (Evaluation dataset) |
Logloss (Development dataset) | Accuracy (Development dataset) |
Boes_KUL_task1b_1 | muls_tr(1) | Boes2021 | 23 | 0.653 | 74.5 (74.2 - 74.8) | 0.620 | 75.9 | |
Boes_KUL_task1b_2 | muls_tr(2) | Boes2021 | 25 | 0.683 | 76.0 (75.7 - 76.3) | 0.652 | 76.6 | |
Boes_KUL_task1b_3 | muls_tr(3) | Boes2021 | 26 | 0.701 | 76.3 (76.0 - 76.6) | 0.665 | 77.0 | |
Boes_KUL_task1b_4 | muls_tr(4) | Boes2021 | 24 | 0.681 | 76.0 (75.6 - 76.3) | 0.682 | 77.1 | |
Diez_Noismart_task1b_1 | AholabASC1 | Diez2021 | 35 | 1.061 | 65.2 (64.8 - 65.5) | 1.038 | 65.6 | |
Diez_Noismart_task1b_2 | AhonoiseASC1 | Diez2021 | 38 | 1.096 | 64.4 (64.1 - 64.8) | 1.006 | 63.3 | |
Diez_Noismart_task1b_3 | AhonoiseASC1 | Diez2021 | 34 | 1.060 | 64.7 (64.4 - 65.1) | 1.023 | 63.2 | |
Du_USTC_task1b_1 | USTC_t1b_1 | Wang2021 | 8 | 0.241 | 92.9 (92.7 - 93.1) | 0.147 | 94.7 | |
Du_USTC_task1b_2 | USTC_t1b_2 | Wang2021 | 7 | 0.238 | 92.7 (92.5 - 92.9) | 0.145 | 95.1 | |
Du_USTC_task1b_3 | USTC_t1b_3 | Wang2021 | 6 | 0.222 | 93.2 (93.0 - 93.4) | 0.143 | 95.2 | |
Du_USTC_task1b_4 | USTC_t1b_4 | Wang2021 | 5 | 0.221 | 93.2 (93.0 - 93.4) | 0.141 | 95.5 | |
Fedorishin_UB_task1b_1 | WS Small | Fedorishin2021 | 37 | 1.077 | 67.2 (66.8 - 67.5) | 0.907 | 69.6 | |
Fedorishin_UB_task1b_2 | WS Fusion | Fedorishin2021 | 33 | 1.028 | 68.7 (68.4 - 69.1) | 0.990 | 68.5 | |
Hou_UGent_task1b_1 | HTCH_1 | Hou2021 | 20 | 0.555 | 81.5 (81.2 - 81.8) | 0.346 | 87.4 | |
Hou_UGent_task1b_2 | HTCH_2 | Hou2021 | 29 | 0.771 | 81.8 (81.6 - 82.1) | 0.351 | 89.5 | |
Hou_UGent_task1b_3 | HTCH_3 | Hou2021 | 19 | 0.523 | 84.0 (83.7 - 84.3) | 0.318 | 88.6 | |
Hou_UGent_task1b_4 | HTCH_4 | Hou2021 | 16 | 0.416 | 85.6 (85.3 - 85.8) | 0.328 | 91.7 | |
Naranjo-Alcazar_UV_task1b_1 | AVSC_SE_CRNN | Naranjo-Alcazar2021_t1b | 18 | 0.495 | 86.5 (86.3 - 86.8) | 0.556 | 90.0 | |
Naranjo-Alcazar_UV_task1b_2 | AVSC_SE_CRNN | Naranjo-Alcazar2021_t1b | 22 | 0.640 | 83.2 (82.9 - 83.4) | 0.616 | 87.0 | |
Naranjo-Alcazar_UV_task1b_3 | AVSC_SE_CRNN | Naranjo-Alcazar2021_t1b | 32 | 1.006 | 66.8 (66.5 - 67.1) | 0.969 | 69.0 | |
Okazaki_LDSLVision_task1b_1 | S01 | Okazaki2021 | 12 | 0.312 | 91.6 (91.4 - 91.8) | 0.260 | 92.5 | |
Okazaki_LDSLVision_task1b_2 | S02 | Okazaki2021 | 13 | 0.320 | 93.2 (93.0 - 93.3) | 0.149 | 96.1 | |
Okazaki_LDSLVision_task1b_3 | S03 | Okazaki2021 | 11 | 0.303 | 93.5 (93.3 - 93.7) | 0.238 | 95.8 | |
Okazaki_LDSLVision_task1b_4 | S04 | Okazaki2021 | 9 | 0.257 | 93.5 (93.3 - 93.7) | 0.149 | 96.1 | |
Peng_CQU_task1b_1 | CRFDS | Peng2021 | 45 | 1.395 | 68.2 (67.9 - 68.5) | 1.614 | 71.8 | |
Peng_CQU_task1b_2 | CRFDS | Peng2021 | 40 | 1.172 | 67.8 (67.5 - 68.1) | 1.627 | 71.0 | |
Peng_CQU_task1b_3 | CRFDS | Peng2021 | 41 | 1.172 | 67.8 (67.5 - 68.1) | 1.635 | 70.5 | |
Peng_CQU_task1b_4 | CRFDS | Peng2021 | 43 | 1.233 | 68.5 (68.1 - 68.8) | 1.824 | 70.2 | |
Pham_AIT_task1b_1 | Pham_AIT | Pham2021 | 44 | 1.311 | 73.0 (72.7 - 73.3) | 93.9 | ||
Pham_AIT_task1b_2 | Pham_AIT | Pham2021 | 21 | 0.589 | 88.3 (88.1 - 88.6) | 93.9 | ||
Pham_AIT_task1b_3 | Pham_AIT | Pham2021 | 17 | 0.434 | 88.4 (88.2 - 88.7) | 93.9 | ||
Pham_AIT_task1b_4 | Pham_AIT | Pham2021 | 28 | 0.738 | 91.5 (91.3 - 91.7) | 93.9 | ||
Triantafyllopoulos_AUD_task1b_1 | WT | Triantafyllopoulos2021 | 39 | 1.157 | 58.4 (58.1 - 58.8) | 1.153 | 59.4 | |
Triantafyllopoulos_AUD_task1b_2 | MWT-FiLM | Triantafyllopoulos2021 | 27 | 0.735 | 73.6 (73.3 - 73.9) | 0.568 | 79.5 | |
Triantafyllopoulos_AUD_task1b_3 | MWT-Bias | Triantafyllopoulos2021 | 30 | 0.785 | 73.7 (73.3 - 74.0) | 0.704 | 76.2 | |
Triantafyllopoulos_AUD_task1b_4 | MWT-Wave | Triantafyllopoulos2021 | 31 | 0.872 | 70.3 (70.0 - 70.7) | 0.796 | 72.6 | |
Wang_BIT_task1b_1 | Wang_BIT1 | Wang2021a | 36 | 1.061 | 74.1 (73.7 - 74.4) | 1.072 | 91.5 | |
Wang_BIT_task1b_2 | Wang_BIT2 | Liang2021 | 42 | 1.180 | 62.4 (62.0 - 62.7) | 0.744 | 80.6 | |
DCASE2021 baseline | Baseline | 0.662 | 77.1 (76.8 - 77.5) | 0.658 | 77.0 | |||
Yang_THU_task1b_1 | cnn14_cvt | Yang2021 | 15 | 0.332 | 90.8 (90.6 - 91.1) | 0.261 | 92.6 | |
Yang_THU_task1b_2 | trans_cvt | Yang2021 | 14 | 0.321 | 90.8 (90.6 - 91.0) | 0.230 | 93.1 | |
Yang_THU_task1b_3 | 2trans_cnn | Yang2021 | 10 | 0.279 | 92.1 (91.9 - 92.3) | 0.223 | 93.9 | |
Zhang_IOA_task1b_1 | ZhangIOA1 | Wang2021b | 3 | 0.201 | 93.5 (93.3 - 93.7) | 0.146 | 95.0 | |
Zhang_IOA_task1b_2 | ZhangIOA2 | Wang2021b | 4 | 0.205 | 93.6 (93.4 - 93.8) | 0.153 | 95.0 | |
Zhang_IOA_task1b_3 | ZhangIOA3 | Wang2021b | 1 | 0.195 | 93.8 (93.6 - 93.9) | 0.145 | 95.1 | |
Zhang_IOA_task1b_4 | ZhangIOA4 | Wang2021b | 2 | 0.199 | 93.9 (93.7 - 94.1) | 0.156 | 95.0 |
Teams ranking
Table including only the best performing system per submitting team.
Submission information | Evaluation dataset | Development dataset | |||||||
---|---|---|---|---|---|---|---|---|---|
Rank | Submission label | Name |
Technical Report |
Official system rank |
Team rank | Logloss (Evaluation dataset) |
Accuracy with 95% confidence interval (Evaluation dataset) |
Logloss (Development dataset) | Accuracy (Development dataset) |
Boes_KUL_task1b_1 | muls_tr(1) | Boes2021 | 23 | 8 | 0.653 | 74.5 (74.2 - 74.8) | 0.620 | 75.9 | |
Diez_Noismart_task1b_3 | AhonoiseASC1 | Diez2021 | 34 | 11 | 1.060 | 64.7 (64.4 - 65.1) | 1.023 | 63.2 | |
Du_USTC_task1b_4 | USTC_t1b_4 | Wang2021 | 5 | 2 | 0.221 | 93.2 (93.0 - 93.4) | 0.141 | 95.5 | |
Fedorishin_UB_task1b_2 | WS Fusion | Fedorishin2021 | 33 | 10 | 1.028 | 68.7 (68.4 - 69.1) | 0.990 | 68.5 | |
Hou_UGent_task1b_4 | HTCH_4 | Hou2021 | 16 | 5 | 0.416 | 85.6 (85.3 - 85.8) | 0.328 | 91.7 | |
Naranjo-Alcazar_UV_task1b_1 | AVSC_SE_CRNN | Naranjo-Alcazar2021_t1b | 18 | 7 | 0.495 | 86.5 (86.3 - 86.8) | 0.556 | 90.0 | |
Okazaki_LDSLVision_task1b_4 | S04 | Okazaki2021 | 9 | 3 | 0.257 | 93.5 (93.3 - 93.7) | 0.149 | 96.1 | |
Peng_CQU_task1b_2 | CRFDS | Peng2021 | 40 | 13 | 1.172 | 67.8 (67.5 - 68.1) | 1.627 | 71.0 | |
Pham_AIT_task1b_3 | Pham_AIT | Pham2021 | 17 | 6 | 0.434 | 88.4 (88.2 - 88.7) | 93.9 | ||
Triantafyllopoulos_AUD_task1b_2 | MWT-FiLM | Triantafyllopoulos2021 | 27 | 9 | 0.735 | 73.6 (73.3 - 73.9) | 0.568 | 79.5 | |
Wang_BIT_task1b_1 | Wang_BIT1 | Wang2021a | 36 | 12 | 1.061 | 74.1 (73.7 - 74.4) | 1.072 | 91.5 | |
DCASE2021 baseline | Baseline | 0.662 | 77.1 (76.8 - 77.5) | 0.658 | 77.0 | ||||
Yang_THU_task1b_3 | 2trans_cnn | Yang2021 | 10 | 4 | 0.279 | 92.1 (91.9 - 92.3) | 0.223 | 93.9 | |
Zhang_IOA_task1b_3 | ZhangIOA3 | Wang2021b | 1 | 1 | 0.195 | 93.8 (93.6 - 93.9) | 0.145 | 95.1 |
Modality
Rank | Submission label |
Technical Report |
Official system rank |
Logloss (Eval) |
Accuracy (Eval) |
Used modalities | Method to combine information from modalities |
---|---|---|---|---|---|---|---|
Boes_KUL_task1b_1 | Boes2021 | 23 | 0.653 | 74.5 | audio + video | early fusion | |
Boes_KUL_task1b_2 | Boes2021 | 25 | 0.683 | 76.0 | audio + video | early fusion | |
Boes_KUL_task1b_3 | Boes2021 | 26 | 0.701 | 76.3 | audio + video | early fusion | |
Boes_KUL_task1b_4 | Boes2021 | 24 | 0.681 | 76.0 | audio + video | early fusion | |
Diez_Noismart_task1b_1 | Diez2021 | 35 | 1.061 | 65.2 | audio | audio only | |
Diez_Noismart_task1b_2 | Diez2021 | 38 | 1.096 | 64.4 | audio | audio only | |
Diez_Noismart_task1b_3 | Diez2021 | 34 | 1.060 | 64.7 | audio | audio only | |
Du_USTC_task1b_1 | Wang2021 | 8 | 0.241 | 92.9 | audio + video | early fusion | |
Du_USTC_task1b_2 | Wang2021 | 7 | 0.238 | 92.7 | audio + video | early fusion | |
Du_USTC_task1b_3 | Wang2021 | 6 | 0.222 | 93.2 | audio + video | early fusion | |
Du_USTC_task1b_4 | Wang2021 | 5 | 0.221 | 93.2 | audio + video | early fusion | |
Fedorishin_UB_task1b_1 | Fedorishin2021 | 37 | 1.077 | 67.2 | audio | audio only | |
Fedorishin_UB_task1b_2 | Fedorishin2021 | 33 | 1.028 | 68.7 | audio | audio only | |
Hou_UGent_task1b_1 | Hou2021 | 20 | 0.555 | 81.5 | audio + video | late fusion | |
Hou_UGent_task1b_2 | Hou2021 | 29 | 0.771 | 81.8 | audio + video | late fusion | |
Hou_UGent_task1b_3 | Hou2021 | 19 | 0.523 | 84.0 | audio + video | late fusion | |
Hou_UGent_task1b_4 | Hou2021 | 16 | 0.416 | 85.6 | audio + video | late fusion | |
Naranjo-Alcazar_UV_task1b_1 | Naranjo-Alcazar2021_t1b | 18 | 0.495 | 86.5 | audio + video | early fusion, late fusion | |
Naranjo-Alcazar_UV_task1b_2 | Naranjo-Alcazar2021_t1b | 22 | 0.640 | 83.2 | video | video only | |
Naranjo-Alcazar_UV_task1b_3 | Naranjo-Alcazar2021_t1b | 32 | 1.006 | 66.8 | audio | audio only | |
Okazaki_LDSLVision_task1b_1 | Okazaki2021 | 12 | 0.312 | 91.6 | video | video only | |
Okazaki_LDSLVision_task1b_2 | Okazaki2021 | 13 | 0.320 | 93.2 | audio + video | audio-visual | |
Okazaki_LDSLVision_task1b_3 | Okazaki2021 | 11 | 0.303 | 93.5 | audio + video | audio-visual | |
Okazaki_LDSLVision_task1b_4 | Okazaki2021 | 9 | 0.257 | 93.5 | audio + video | audio-visual | |
Peng_CQU_task1b_1 | Peng2021 | 45 | 1.395 | 68.2 | audio | audio only | |
Peng_CQU_task1b_2 | Peng2021 | 40 | 1.172 | 67.8 | audio | audio only | |
Peng_CQU_task1b_3 | Peng2021 | 41 | 1.172 | 67.8 | audio | audio only | |
Peng_CQU_task1b_4 | Peng2021 | 43 | 1.233 | 68.5 | audio | audio only | |
Pham_AIT_task1b_1 | Pham2021 | 44 | 1.311 | 73.0 | PROD late fusion | ||
Pham_AIT_task1b_2 | Pham2021 | 21 | 0.589 | 88.3 | PROD late fusion | ||
Pham_AIT_task1b_3 | Pham2021 | 17 | 0.434 | 88.4 | PROD late fusion | ||
Pham_AIT_task1b_4 | Pham2021 | 28 | 0.738 | 91.5 | PROD late fusion | ||
Triantafyllopoulos_AUD_task1b_1 | Triantafyllopoulos2021 | 39 | 1.157 | 58.4 | audio | ||
Triantafyllopoulos_AUD_task1b_2 | Triantafyllopoulos2021 | 27 | 0.735 | 73.6 | audio + video | FiLM conditioning | |
Triantafyllopoulos_AUD_task1b_3 | Triantafyllopoulos2021 | 30 | 0.785 | 73.7 | audio + video | Bias conditioning | |
Triantafyllopoulos_AUD_task1b_4 | Triantafyllopoulos2021 | 31 | 0.872 | 70.3 | audio + video | Wave conditioning | |
Wang_BIT_task1b_1 | Wang2021a | 36 | 1.061 | 74.1 | video | video only | |
Wang_BIT_task1b_2 | Liang2021 | 42 | 1.180 | 62.4 | audio | audio only | |
DCASE2021 baseline | 0.662 | 77.1 | audio + video | early fusion | |||
Yang_THU_task1b_1 | Yang2021 | 15 | 0.332 | 90.8 | audio + video | early fusion | |
Yang_THU_task1b_2 | Yang2021 | 14 | 0.321 | 90.8 | audio + video | early fusion | |
Yang_THU_task1b_3 | Yang2021 | 10 | 0.279 | 92.1 | audio + video | early fusion | |
Zhang_IOA_task1b_1 | Wang2021b | 3 | 0.201 | 93.5 | audio + video | early fusion | |
Zhang_IOA_task1b_2 | Wang2021b | 4 | 0.205 | 93.6 | audio + video | early fusion | |
Zhang_IOA_task1b_3 | Wang2021b | 1 | 0.195 | 93.8 | audio + video | early fusion | |
Zhang_IOA_task1b_4 | Wang2021b | 2 | 0.199 | 93.9 | audio + video | early fusion |
Class-wise performance
Log loss
Rank | Submission label |
Technical Report |
Official system rank |
Logloss | Airport | Bus | Metro |
Metro station |
Park |
Public square |
Shopping mall |
Street pedestrian |
Street traffic |
Tram |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Boes_KUL_task1b_1 | Boes2021 | 23 | 0.653 | 0.769 | 0.600 | 0.724 | 0.705 | 0.175 | 0.751 | 0.495 | 0.679 | 0.452 | 1.175 | |
Boes_KUL_task1b_2 | Boes2021 | 25 | 0.683 | 0.615 | 0.759 | 0.815 | 0.517 | 0.214 | 0.835 | 0.621 | 0.860 | 0.421 | 1.177 | |
Boes_KUL_task1b_3 | Boes2021 | 26 | 0.701 | 0.578 | 0.838 | 0.785 | 0.440 | 0.158 | 1.128 | 0.617 | 0.917 | 0.417 | 1.133 | |
Boes_KUL_task1b_4 | Boes2021 | 24 | 0.681 | 0.573 | 0.836 | 0.868 | 0.493 | 0.192 | 0.870 | 0.643 | 0.821 | 0.405 | 1.110 | |
Diez_Noismart_task1b_1 | Diez2021 | 35 | 1.061 | 2.085 | 0.962 | 1.087 | 1.203 | 0.340 | 1.113 | 1.203 | 1.272 | 0.581 | 0.763 | |
Diez_Noismart_task1b_2 | Diez2021 | 38 | 1.096 | 1.654 | 0.807 | 1.524 | 1.740 | 0.384 | 0.924 | 1.362 | 1.158 | 0.496 | 0.909 | |
Diez_Noismart_task1b_3 | Diez2021 | 34 | 1.060 | 1.266 | 0.843 | 1.413 | 1.531 | 0.299 | 1.148 | 1.585 | 1.204 | 0.450 | 0.858 | |
Du_USTC_task1b_1 | Wang2021 | 8 | 0.241 | 0.267 | 0.113 | 0.257 | 0.018 | 0.025 | 0.453 | 0.241 | 0.620 | 0.092 | 0.325 | |
Du_USTC_task1b_2 | Wang2021 | 7 | 0.238 | 0.237 | 0.116 | 0.232 | 0.017 | 0.031 | 0.469 | 0.279 | 0.578 | 0.085 | 0.339 | |
Du_USTC_task1b_3 | Wang2021 | 6 | 0.222 | 0.234 | 0.136 | 0.211 | 0.023 | 0.028 | 0.456 | 0.223 | 0.553 | 0.082 | 0.273 | |
Du_USTC_task1b_4 | Wang2021 | 5 | 0.221 | 0.250 | 0.133 | 0.214 | 0.023 | 0.028 | 0.432 | 0.220 | 0.542 | 0.088 | 0.277 | |
Fedorishin_UB_task1b_1 | Fedorishin2021 | 37 | 1.077 | 2.391 | 0.552 | 1.287 | 1.139 | 0.241 | 1.272 | 0.786 | 1.522 | 0.526 | 1.052 | |
Fedorishin_UB_task1b_2 | Fedorishin2021 | 33 | 1.028 | 1.671 | 0.615 | 1.195 | 1.118 | 0.200 | 1.745 | 1.008 | 1.291 | 0.524 | 0.910 | |
Hou_UGent_task1b_1 | Hou2021 | 20 | 0.555 | 0.703 | 0.581 | 0.373 | 0.211 | 0.021 | 1.113 | 0.610 | 0.680 | 0.271 | 0.985 | |
Hou_UGent_task1b_2 | Hou2021 | 29 | 0.771 | 0.917 | 0.902 | 0.477 | 0.231 | 0.006 | 1.810 | 0.704 | 1.458 | 0.339 | 0.863 | |
Hou_UGent_task1b_3 | Hou2021 | 19 | 0.523 | 0.571 | 0.459 | 0.300 | 0.231 | 0.054 | 0.931 | 0.671 | 0.730 | 0.367 | 0.911 | |
Hou_UGent_task1b_4 | Hou2021 | 16 | 0.416 | 0.524 | 0.424 | 0.257 | 0.161 | 0.019 | 0.846 | 0.474 | 0.641 | 0.219 | 0.592 | |
Naranjo-Alcazar_UV_task1b_1 | Naranjo-Alcazar2021_t1b | 18 | 0.495 | 0.771 | 0.475 | 0.628 | 0.223 | 0.102 | 0.664 | 0.371 | 0.714 | 0.191 | 0.810 | |
Naranjo-Alcazar_UV_task1b_2 | Naranjo-Alcazar2021_t1b | 22 | 0.640 | 0.809 | 0.424 | 0.646 | 0.128 | 0.087 | 0.831 | 0.526 | 0.960 | 0.305 | 1.677 | |
Naranjo-Alcazar_UV_task1b_3 | Naranjo-Alcazar2021_t1b | 32 | 1.006 | 1.777 | 0.605 | 0.979 | 1.177 | 0.345 | 1.396 | 1.093 | 1.039 | 0.656 | 0.990 | |
Okazaki_LDSLVision_task1b_1 | Okazaki2021 | 12 | 0.312 | 0.227 | 0.314 | 0.308 | 0.144 | 0.107 | 0.477 | 0.180 | 0.532 | 0.182 | 0.645 | |
Okazaki_LDSLVision_task1b_2 | Okazaki2021 | 13 | 0.320 | 0.284 | 0.161 | 0.307 | 0.102 | 0.043 | 0.657 | 0.228 | 0.880 | 0.124 | 0.411 | |
Okazaki_LDSLVision_task1b_3 | Okazaki2021 | 11 | 0.303 | 0.279 | 0.282 | 0.304 | 0.172 | 0.105 | 0.506 | 0.213 | 0.544 | 0.171 | 0.457 | |
Okazaki_LDSLVision_task1b_4 | Okazaki2021 | 9 | 0.257 | 0.257 | 0.181 | 0.222 | 0.078 | 0.040 | 0.472 | 0.165 | 0.662 | 0.106 | 0.387 | |
Peng_CQU_task1b_1 | Peng2021 | 45 | 1.395 | 2.363 | 0.934 | 1.157 | 1.568 | 0.194 | 3.260 | 1.239 | 1.561 | 0.585 | 1.090 | |
Peng_CQU_task1b_2 | Peng2021 | 40 | 1.172 | 1.414 | 0.722 | 0.907 | 1.140 | 0.198 | 2.190 | 1.515 | 1.662 | 0.620 | 1.353 | |
Peng_CQU_task1b_3 | Peng2021 | 41 | 1.172 | 1.414 | 0.722 | 0.907 | 1.140 | 0.198 | 2.190 | 1.515 | 1.662 | 0.620 | 1.353 | |
Peng_CQU_task1b_4 | Peng2021 | 43 | 1.233 | 2.224 | 0.668 | 1.004 | 1.340 | 0.278 | 2.258 | 1.269 | 1.803 | 0.517 | 0.966 | |
Pham_AIT_task1b_1 | Pham2021 | 44 | 1.311 | 1.588 | 0.667 | 1.989 | 1.202 | 0.321 | 2.089 | 1.738 | 1.346 | 1.094 | 1.074 | |
Pham_AIT_task1b_2 | Pham2021 | 21 | 0.589 | 0.444 | 0.472 | 0.538 | 0.044 | 0.080 | 1.265 | 0.286 | 0.980 | 0.412 | 1.367 | |
Pham_AIT_task1b_3 | Pham2021 | 17 | 0.434 | 0.455 | 0.225 | 0.394 | 0.107 | 0.067 | 0.830 | 0.457 | 0.833 | 0.273 | 0.704 | |
Pham_AIT_task1b_4 | Pham2021 | 28 | 0.738 | 0.389 | 0.393 | 0.695 | 0.034 | 0.004 | 1.952 | 0.515 | 1.554 | 0.483 | 1.361 | |
Triantafyllopoulos_AUD_task1b_1 | Triantafyllopoulos2021 | 39 | 1.157 | 1.442 | 0.678 | 1.336 | 1.959 | 0.283 | 1.492 | 0.677 | 1.430 | 0.757 | 1.520 | |
Triantafyllopoulos_AUD_task1b_2 | Triantafyllopoulos2021 | 27 | 0.735 | 1.038 | 0.702 | 0.931 | 0.306 | 0.075 | 0.978 | 0.730 | 1.400 | 0.401 | 0.792 | |
Triantafyllopoulos_AUD_task1b_3 | Triantafyllopoulos2021 | 30 | 0.785 | 1.503 | 0.604 | 0.512 | 0.832 | 0.164 | 0.623 | 0.502 | 0.921 | 0.979 | 1.211 | |
Triantafyllopoulos_AUD_task1b_4 | Triantafyllopoulos2021 | 31 | 0.872 | 1.556 | 0.815 | 0.956 | 1.130 | 0.427 | 1.009 | 0.232 | 1.572 | 0.521 | 0.503 | |
Wang_BIT_task1b_1 | Wang2021a | 36 | 1.061 | 2.558 | 0.529 | 2.405 | 1.250 | 0.007 | 1.965 | 0.047 | 0.699 | 0.591 | 0.562 | |
Wang_BIT_task1b_2 | Liang2021 | 42 | 1.180 | 1.114 | 1.121 | 1.223 | 1.516 | 0.722 | 1.516 | 1.035 | 1.516 | 0.887 | 1.151 | |
DCASE2021 baseline | 0.662 | 1.313 | 0.496 | 0.683 | 0.517 | 0.151 | 1.002 | 0.586 | 0.907 | 0.215 | 0.751 | |||
Yang_THU_task1b_1 | Yang2021 | 15 | 0.332 | 0.774 | 0.270 | 0.288 | 0.098 | 0.032 | 0.536 | 0.207 | 0.664 | 0.080 | 0.375 | |
Yang_THU_task1b_2 | Yang2021 | 14 | 0.321 | 0.699 | 0.169 | 0.331 | 0.052 | 0.026 | 0.553 | 0.219 | 0.419 | 0.095 | 0.642 | |
Yang_THU_task1b_3 | Yang2021 | 10 | 0.279 | 0.602 | 0.155 | 0.292 | 0.042 | 0.030 | 0.549 | 0.212 | 0.442 | 0.074 | 0.395 | |
Zhang_IOA_task1b_1 | Wang2021b | 3 | 0.201 | 0.463 | 0.099 | 0.113 | 0.043 | 0.009 | 0.417 | 0.053 | 0.373 | 0.066 | 0.369 | |
Zhang_IOA_task1b_2 | Wang2021b | 4 | 0.205 | 0.493 | 0.086 | 0.141 | 0.045 | 0.013 | 0.392 | 0.063 | 0.388 | 0.066 | 0.358 | |
Zhang_IOA_task1b_3 | Wang2021b | 1 | 0.195 | 0.454 | 0.092 | 0.113 | 0.041 | 0.009 | 0.402 | 0.047 | 0.379 | 0.064 | 0.352 | |
Zhang_IOA_task1b_4 | Wang2021b | 2 | 0.199 | 0.454 | 0.079 | 0.138 | 0.044 | 0.015 | 0.376 | 0.061 | 0.389 | 0.069 | 0.363 |
Accuracy
Rank | Submission label |
Technical Report |
Official system rank |
Accuracy | Airport | Bus | Metro |
Metro station |
Park |
Public square |
Shopping mall |
Street pedestrian |
Street traffic |
Tram |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Boes_KUL_task1b_1 | Boes2021 | 23 | 74.5 | 63.2 | 80.4 | 76.7 | 75.5 | 94.2 | 70.1 | 83.4 | 79.7 | 85.1 | 36.5 | |
Boes_KUL_task1b_2 | Boes2021 | 25 | 76.0 | 79.7 | 72.4 | 73.8 | 83.6 | 92.4 | 70.1 | 77.8 | 74.1 | 85.5 | 50.6 | |
Boes_KUL_task1b_3 | Boes2021 | 26 | 76.3 | 80.8 | 71.8 | 74.4 | 86.5 | 94.6 | 63.4 | 79.8 | 71.4 | 87.6 | 53.0 | |
Boes_KUL_task1b_4 | Boes2021 | 24 | 76.0 | 81.8 | 69.4 | 69.8 | 84.3 | 94.2 | 68.5 | 77.6 | 73.5 | 86.8 | 53.7 | |
Diez_Noismart_task1b_1 | Diez2021 | 35 | 65.2 | 37.9 | 67.8 | 59.7 | 59.4 | 89.4 | 58.0 | 73.8 | 53.2 | 80.7 | 71.8 | |
Diez_Noismart_task1b_2 | Diez2021 | 38 | 64.4 | 44.1 | 74.9 | 44.8 | 47.9 | 88.9 | 64.3 | 71.8 | 55.1 | 83.7 | 68.8 | |
Diez_Noismart_task1b_3 | Diez2021 | 34 | 64.7 | 53.2 | 73.0 | 47.4 | 52.1 | 91.2 | 55.4 | 62.9 | 57.6 | 85.3 | 69.3 | |
Du_USTC_task1b_1 | Wang2021 | 8 | 92.9 | 90.2 | 96.8 | 93.2 | 99.7 | 99.9 | 87.3 | 93.6 | 81.4 | 96.8 | 89.7 | |
Du_USTC_task1b_2 | Wang2021 | 7 | 92.7 | 90.9 | 96.4 | 94.2 | 99.8 | 99.6 | 86.1 | 93.0 | 81.0 | 97.2 | 89.2 | |
Du_USTC_task1b_3 | Wang2021 | 6 | 93.2 | 91.1 | 95.9 | 95.5 | 99.4 | 99.8 | 85.2 | 93.8 | 81.7 | 97.5 | 92.3 | |
Du_USTC_task1b_4 | Wang2021 | 5 | 93.2 | 90.7 | 96.2 | 95.3 | 99.5 | 99.9 | 85.4 | 94.1 | 81.5 | 97.2 | 92.1 | |
Fedorishin_UB_task1b_1 | Fedorishin2021 | 37 | 67.2 | 42.3 | 81.4 | 58.3 | 63.4 | 92.9 | 55.0 | 77.2 | 51.2 | 83.7 | 66.1 | |
Fedorishin_UB_task1b_2 | Fedorishin2021 | 33 | 68.7 | 54.2 | 80.9 | 63.7 | 65.2 | 94.6 | 45.9 | 68.6 | 57.8 | 84.6 | 71.7 | |
Hou_UGent_task1b_1 | Hou2021 | 20 | 81.5 | 79.5 | 75.9 | 85.5 | 91.9 | 99.1 | 63.4 | 80.6 | 80.1 | 90.7 | 68.5 | |
Hou_UGent_task1b_2 | Hou2021 | 29 | 81.8 | 70.4 | 75.7 | 86.2 | 93.1 | 99.8 | 65.1 | 83.7 | 73.9 | 93.2 | 77.3 | |
Hou_UGent_task1b_3 | Hou2021 | 19 | 84.0 | 82.1 | 88.1 | 88.1 | 92.5 | 99.3 | 74.0 | 78.4 | 81.6 | 88.8 | 67.0 | |
Hou_UGent_task1b_4 | Hou2021 | 16 | 85.6 | 82.8 | 83.5 | 88.1 | 95.1 | 100.0 | 70.4 | 86.7 | 81.8 | 94.1 | 73.3 | |
Naranjo-Alcazar_UV_task1b_1 | Naranjo-Alcazar2021_t1b | 18 | 86.5 | 78.6 | 86.9 | 82.6 | 97.7 | 99.4 | 78.2 | 91.8 | 80.2 | 95.7 | 74.3 | |
Naranjo-Alcazar_UV_task1b_2 | Naranjo-Alcazar2021_t1b | 22 | 83.2 | 75.9 | 87.5 | 84.4 | 95.1 | 95.5 | 82.5 | 86.4 | 77.0 | 92.6 | 54.6 | |
Naranjo-Alcazar_UV_task1b_3 | Naranjo-Alcazar2021_t1b | 32 | 66.8 | 45.8 | 79.4 | 67.0 | 63.4 | 89.3 | 51.1 | 63.2 | 66.0 | 78.8 | 64.0 | |
Okazaki_LDSLVision_task1b_1 | Okazaki2021 | 12 | 91.6 | 98.1 | 89.9 | 91.2 | 97.8 | 100.0 | 85.9 | 97.0 | 85.6 | 96.6 | 73.5 | |
Okazaki_LDSLVision_task1b_2 | Okazaki2021 | 13 | 93.2 | 96.1 | 96.6 | 92.8 | 98.7 | 99.9 | 85.6 | 95.4 | 81.6 | 97.9 | 87.0 | |
Okazaki_LDSLVision_task1b_3 | Okazaki2021 | 11 | 93.5 | 96.0 | 95.1 | 96.0 | 99.0 | 100.0 | 85.2 | 96.8 | 85.0 | 97.9 | 84.3 | |
Okazaki_LDSLVision_task1b_4 | Okazaki2021 | 9 | 93.5 | 96.0 | 95.1 | 96.0 | 99.0 | 100.0 | 85.2 | 96.8 | 85.0 | 97.9 | 84.3 | |
Peng_CQU_task1b_1 | Peng2021 | 45 | 68.2 | 49.5 | 78.6 | 69.5 | 61.2 | 94.8 | 46.8 | 66.2 | 57.9 | 84.4 | 73.1 | |
Peng_CQU_task1b_2 | Peng2021 | 40 | 67.8 | 53.8 | 79.5 | 72.8 | 65.9 | 94.4 | 49.3 | 58.6 | 54.3 | 82.6 | 66.7 | |
Peng_CQU_task1b_3 | Peng2021 | 41 | 67.8 | 53.8 | 79.5 | 72.8 | 65.9 | 94.4 | 49.3 | 58.6 | 54.3 | 82.6 | 66.7 | |
Peng_CQU_task1b_4 | Peng2021 | 43 | 68.5 | 45.0 | 81.0 | 70.5 | 66.3 | 92.0 | 47.8 | 67.8 | 58.6 | 84.8 | 70.9 | |
Pham_AIT_task1b_1 | Pham2021 | 44 | 73.0 | 64.5 | 87.9 | 63.2 | 74.7 | 93.2 | 54.4 | 67.3 | 70.2 | 78.9 | 75.7 | |
Pham_AIT_task1b_2 | Pham2021 | 21 | 88.3 | 88.0 | 88.1 | 87.6 | 98.8 | 96.0 | 79.0 | 93.1 | 84.4 | 93.5 | 75.0 | |
Pham_AIT_task1b_3 | Pham2021 | 17 | 88.4 | 84.8 | 92.3 | 88.7 | 96.7 | 97.3 | 77.4 | 89.6 | 82.9 | 93.7 | 81.0 | |
Pham_AIT_task1b_4 | Pham2021 | 28 | 91.5 | 94.2 | 94.1 | 89.6 | 99.2 | 99.9 | 78.1 | 95.0 | 85.1 | 95.4 | 84.1 | |
Triantafyllopoulos_AUD_task1b_1 | Triantafyllopoulos2021 | 39 | 58.4 | 37.2 | 77.4 | 52.1 | 36.1 | 93.5 | 41.9 | 76.8 | 48.8 | 78.8 | 41.9 | |
Triantafyllopoulos_AUD_task1b_2 | Triantafyllopoulos2021 | 27 | 73.6 | 59.9 | 75.2 | 64.7 | 90.9 | 99.0 | 65.6 | 72.8 | 55.0 | 87.3 | 65.3 | |
Triantafyllopoulos_AUD_task1b_3 | Triantafyllopoulos2021 | 30 | 73.7 | 51.7 | 79.4 | 82.0 | 78.3 | 95.1 | 76.6 | 82.1 | 66.1 | 74.4 | 51.1 | |
Triantafyllopoulos_AUD_task1b_4 | Triantafyllopoulos2021 | 31 | 70.3 | 45.8 | 71.6 | 60.9 | 69.5 | 86.9 | 71.3 | 92.0 | 41.4 | 84.7 | 79.2 | |
Wang_BIT_task1b_1 | Wang2021a | 36 | 74.1 | 45.6 | 80.2 | 39.9 | 64.5 | 99.9 | 62.8 | 98.2 | 81.7 | 84.9 | 82.9 | |
Wang_BIT_task1b_2 | Liang2021 | 42 | 62.4 | 63.3 | 62.6 | 66.2 | 49.9 | 90.9 | 35.3 | 65.8 | 51.6 | 79.0 | 59.0 | |
DCASE2021 baseline | 77.1 | 56.6 | 83.4 | 75.8 | 83.9 | 95.8 | 63.6 | 78.2 | 70.6 | 93.3 | 70.4 | |||
Yang_THU_task1b_1 | Yang2021 | 15 | 90.8 | 83.5 | 90.6 | 90.6 | 97.0 | 99.3 | 85.1 | 93.4 | 82.0 | 97.8 | 89.0 | |
Yang_THU_task1b_2 | Yang2021 | 14 | 90.8 | 81.5 | 94.8 | 89.4 | 98.4 | 98.7 | 85.0 | 92.7 | 89.8 | 97.4 | 80.6 | |
Yang_THU_task1b_3 | Yang2021 | 10 | 92.1 | 84.8 | 94.8 | 90.5 | 98.6 | 98.8 | 84.6 | 93.2 | 89.0 | 97.7 | 88.5 | |
Zhang_IOA_task1b_1 | Wang2021b | 3 | 93.5 | 88.0 | 96.2 | 97.7 | 99.3 | 100.0 | 83.1 | 98.0 | 88.3 | 97.7 | 86.7 | |
Zhang_IOA_task1b_2 | Wang2021b | 4 | 93.6 | 87.0 | 96.9 | 97.8 | 99.3 | 100.0 | 84.9 | 97.6 | 88.0 | 97.9 | 86.4 | |
Zhang_IOA_task1b_3 | Wang2021b | 1 | 93.8 | 88.1 | 96.6 | 97.9 | 99.3 | 100.0 | 83.7 | 98.5 | 88.2 | 97.8 | 87.6 | |
Zhang_IOA_task1b_4 | Wang2021b | 2 | 93.9 | 87.4 | 97.4 | 98.2 | 99.4 | 100.0 | 85.3 | 98.0 | 88.0 | 98.0 | 87.2 |
System characteristics
General characteristics
Rank | Submission label |
Technical Report |
Official system rank |
Logloss (Eval) |
Accuracy (Eval) |
Sampling rate |
Data augmentation |
Features | Embeddings / audio | Embeddings / visual |
---|---|---|---|---|---|---|---|---|---|---|
Boes_KUL_task1b_1 | Boes2021 | 23 | 0.653 | 74.5 | 22.05kHz | mixup | log-mel energies | VGGish | VGG16 | |
Boes_KUL_task1b_2 | Boes2021 | 25 | 0.683 | 76.0 | 22.05kHz | mixup | log-mel energies | VGGish | VGG16 | |
Boes_KUL_task1b_3 | Boes2021 | 26 | 0.701 | 76.3 | 22.05kHz | mixup | log-mel energies | VGGish | VGG16 | |
Boes_KUL_task1b_4 | Boes2021 | 24 | 0.681 | 76.0 | 22.05kHz | mixup | log-mel energies | VGGish | VGG16 | |
Diez_Noismart_task1b_1 | Diez2021 | 35 | 1.061 | 65.2 | 48.0kHz | mixup | log-mel spectrogram | OpenL3 | ||
Diez_Noismart_task1b_2 | Diez2021 | 38 | 1.096 | 64.4 | 48.0kHz | mixup | log-mel spectrogram | OpenL3 | ||
Diez_Noismart_task1b_3 | Diez2021 | 34 | 1.060 | 64.7 | 48.0kHz | mixup | log-mel spectrogram | OpenL3 | ||
Du_USTC_task1b_1 | Wang2021 | 8 | 0.241 | 92.9 | 48.0kHz | mixup, channel confusion, SpecAugment, pitch shifting, speed change, random noise, mix audios, contrast, sharpness | log-mel energies | FCNN, ResNet17, HMM | DenseNet161, ResNet50, ResNeSt50, HMM, VGG19 | |
Du_USTC_task1b_2 | Wang2021 | 7 | 0.238 | 92.7 | 48.0kHz | mixup, channel confusion, SpecAugment, pitch shifting, speed change, random noise, mix audios, contrast, sharpness | log-mel energies | FCNN, ResNet17, HMM | DenseNet161, ResNet50, ResNeSt50, HMM, VGG19 | |
Du_USTC_task1b_3 | Wang2021 | 6 | 0.222 | 93.2 | 48.0kHz | mixup, channel confusion, SpecAugment, pitch shifting, speed change, random noise, mix audios, contrast, sharpness | log-mel energies | FCNN, ResNet17 | DenseNet161, ResNet50, ResNeSt50 | |
Du_USTC_task1b_4 | Wang2021 | 5 | 0.221 | 93.2 | 48.0kHz | mixup, channel confusion, SpecAugment, pitch shifting, speed change, random noise, mix audios, contrast, sharpness | log-mel energies | FCNN, ResNet17 | DenseNet161, ResNet50, ResNeSt50 | |
Fedorishin_UB_task1b_1 | Fedorishin2021 | 37 | 1.077 | 67.2 | 48.0kHz | mixup, time-shifting | log-mel energies, raw waveform | |||
Fedorishin_UB_task1b_2 | Fedorishin2021 | 33 | 1.028 | 68.7 | 48.0kHz | mixup, time-shifting | log-mel energies, raw waveform | |||
Hou_UGent_task1b_1 | Hou2021 | 20 | 0.555 | 81.5 | 48.0kHz | OpenL3 | OpenL3 | ResNet50 | ||
Hou_UGent_task1b_2 | Hou2021 | 29 | 0.771 | 81.8 | 48.0kHz | OpenL3 | OpenL3 | ResNet50 | ||
Hou_UGent_task1b_3 | Hou2021 | 19 | 0.523 | 84.0 | 48.0kHz | OpenL3 | OpenL3 | ResNet50 | ||
Hou_UGent_task1b_4 | Hou2021 | 16 | 0.416 | 85.6 | 48.0kHz | OpenL3 | OpenL3 | ResNet50 | ||
Naranjo-Alcazar_UV_task1b_1 | Naranjo-Alcazar2021_t1b | 18 | 0.495 | 86.5 | 44.1kHz | mixup | gammatone spectrogram | |||
Naranjo-Alcazar_UV_task1b_2 | Naranjo-Alcazar2021_t1b | 22 | 0.640 | 83.2 | ||||||
Naranjo-Alcazar_UV_task1b_3 | Naranjo-Alcazar2021_t1b | 32 | 1.006 | 66.8 | 44.1kHz | mixup | gammatone spectrogram | |||
Okazaki_LDSLVision_task1b_1 | Okazaki2021 | 12 | 0.312 | 91.6 | (Video) random affine, color jitter, gaussian blur, random erasing | CNN, CLIP CNN/ViT | ||||
Okazaki_LDSLVision_task1b_2 | Okazaki2021 | 13 | 0.320 | 93.2 | 48.0kHz | (Audio) frequency masking, random gain (Video) random affine, color jitter, gaussian blur, random erasing | log-mel spectrogram | CNN | CNN, CLIP CNN/ViT | |
Okazaki_LDSLVision_task1b_3 | Okazaki2021 | 11 | 0.303 | 93.5 | 48.0kHz | (Audio) frequency masking, random gain (Video) random affine, color jitter, gaussian blur, random erasing | log-mel spectrogram | CNN | CNN, CLIP CNN/ViT | |
Okazaki_LDSLVision_task1b_4 | Okazaki2021 | 9 | 0.257 | 93.5 | 48.0kHz | (Audio) frequency masking, random gain (Video) random affine, color jitter, gaussian blur, random erasing | log-mel spectrogram | CNN | CNN, CLIP CNN/ViT | |
Peng_CQU_task1b_1 | Peng2021 | 45 | 1.395 | 68.2 | 44.1kHz | log-mel energies | CRFDS | |||
Peng_CQU_task1b_2 | Peng2021 | 40 | 1.172 | 67.8 | 44.1kHz | log-mel energies | CRFDS | |||
Peng_CQU_task1b_3 | Peng2021 | 41 | 1.172 | 67.8 | 44.1kHz | log-mel energies | CRFDS | |||
Peng_CQU_task1b_4 | Peng2021 | 43 | 1.233 | 68.5 | 44.1kHz | log-mel energies | CRFDS | |||
Pham_AIT_task1b_1 | Pham2021 | 44 | 1.311 | 73.0 | 48.0kHz | mixup | CQT, Gammatonegram, log-mel | |||
Pham_AIT_task1b_2 | Pham2021 | 21 | 0.589 | 88.3 | 48.0kHz | mixup | CQT, Gammatonegram, log-mel | |||
Pham_AIT_task1b_3 | Pham2021 | 17 | 0.434 | 88.4 | 48.0kHz | mixup | CQT, Gammatonegram, log-mel | |||
Pham_AIT_task1b_4 | Pham2021 | 28 | 0.738 | 91.5 | 48.0kHz | mixup | CQT, Gammatonegram, log-mel | |||
Triantafyllopoulos_AUD_task1b_1 | Triantafyllopoulos2021 | 39 | 1.157 | 58.4 | 48.0kHz | frequency masking, time masking, random cropping | log-mel spectrogram | |||
Triantafyllopoulos_AUD_task1b_2 | Triantafyllopoulos2021 | 27 | 0.735 | 73.6 | 48.0kHz | frequency masking, time masking, random cropping | log-mel spectrogram | OpenL3 | OpenL3 | |
Triantafyllopoulos_AUD_task1b_3 | Triantafyllopoulos2021 | 30 | 0.785 | 73.7 | 48.0kHz | frequency masking, time masking, random cropping | log-mel spectrogram | OpenL3 | OpenL3 | |
Triantafyllopoulos_AUD_task1b_4 | Triantafyllopoulos2021 | 31 | 0.872 | 70.3 | 48.0kHz | frequency masking, time masking, random cropping | log-mel spectrogram | OpenL3 | OpenL3 | |
Wang_BIT_task1b_1 | Wang2021a | 36 | 1.061 | 74.1 | ||||||
Wang_BIT_task1b_2 | Liang2021 | 42 | 1.180 | 62.4 | 44.1kHz | log-mel energies | ||||
DCASE2021 baseline | 0.662 | 77.1 | 48.0kHz | log-mel energies | OpenL3 | OpenL3 | ||||
Yang_THU_task1b_1 | Yang2021 | 15 | 0.332 | 90.8 | 44.1kHz | mixup, specaugment | log-mel energies | VGGish | Transformer | |
Yang_THU_task1b_2 | Yang2021 | 14 | 0.321 | 90.8 | 44.1kHz | mixup, specaugment | log-mel energies | Transformer | Transformer | |
Yang_THU_task1b_3 | Yang2021 | 10 | 0.279 | 92.1 | 44.1kHz | mixup, specaugment | log-mel energies | Transformer, VGGish | Transformer | |
Zhang_IOA_task1b_1 | Wang2021b | 3 | 0.201 | 93.5 | 48.0kHz | mixup | log-mel energies, CQT, bark | |||
Zhang_IOA_task1b_2 | Wang2021b | 4 | 0.205 | 93.6 | 48.0kHz | mixup | log-mel energies, CQT, bark | |||
Zhang_IOA_task1b_3 | Wang2021b | 1 | 0.195 | 93.8 | 48.0kHz | mixup | log-mel energies, CQT, bark | |||
Zhang_IOA_task1b_4 | Wang2021b | 2 | 0.199 | 93.9 | 48.0kHz | mixup | log-mel energies, CQT, bark |
Machine learning characteristics
Rank | Code |
Technical Report |
Official system rank |
Logloss (Eval) |
Accuracy (Eval) |
External data usage |
External data sources |
Model complexity |
Model complexity / Audio |
Model complexity / Visual |
Classifier |
Ensemble subsystems |
Decision making |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Boes_KUL_task1b_1 | Boes2021 | 23 | 0.653 | 74.5 | embeddings | 2302441 | CNN, transformer | ||||||
Boes_KUL_task1b_2 | Boes2021 | 25 | 0.683 | 76.0 | embeddings | 2302441 | CNN, transformer | ||||||
Boes_KUL_task1b_3 | Boes2021 | 26 | 0.701 | 76.3 | embeddings | 2302441 | CNN, transformer | ||||||
Boes_KUL_task1b_4 | Boes2021 | 24 | 0.681 | 76.0 | embeddings | 2302441 | CNN, transformer | ||||||
Diez_Noismart_task1b_1 | Diez2021 | 35 | 1.061 | 65.2 | embeddings | 970294 | 970294 | CNN | maximum likelihood | ||||
Diez_Noismart_task1b_2 | Diez2021 | 38 | 1.096 | 64.4 | embeddings | 751690 | 751690 | CNN | maximum likelihood | ||||
Diez_Noismart_task1b_3 | Diez2021 | 34 | 1.060 | 64.7 | embeddings | 972598 | 972598 | CNN | maximum likelihood | ||||
Du_USTC_task1b_1 | Wang2021 | 8 | 0.241 | 92.9 | pre-trained model | Places365 | 933932599 | 250090842 | 678276533 | GMM, HMM, VGG, CNN, ResNet, DNN, ResNeSt, DenseNet, ensemble | 4 | average | |
Du_USTC_task1b_2 | Wang2021 | 7 | 0.238 | 92.7 | pre-trained model | Places365 | 1185866410 | 404955720 | 780910690 | GMM, HMM, VGG, CNN, ResNet, DNN, ResNeSt, DenseNet, ensemble | 8 | average | |
Du_USTC_task1b_3 | Wang2021 | 6 | 0.222 | 93.2 | pre-trained model | Places365 | 263064259 | 154864878 | 102634157 | CNN, ResNet, DNN, ResNeSt, DenseNet, ensemble | 4 | average | |
Du_USTC_task1b_4 | Wang2021 | 5 | 0.221 | 93.2 | pre-trained model | Places365, ImageNet | 373738849 | 215535992 | 149650221 | CNN, ResNet, DNN, DenseNet, ResNeSt, ensemble | 6 | average | |
Fedorishin_UB_task1b_1 | Fedorishin2021 | 37 | 1.077 | 67.2 | 1351562 | 1351562 | 0 | CNN | maximum likelihood | ||||
Fedorishin_UB_task1b_2 | Fedorishin2021 | 33 | 1.028 | 68.7 | 5422730 | 5422730 | 0 | CNN | maximum likelihood | ||||
Hou_UGent_task1b_1 | Hou2021 | 20 | 0.555 | 81.5 | embeddings, pre-trained model | 28329010 | 2113024 | 26082088 | CNN, ResNet | maximum likelihood | |||
Hou_UGent_task1b_2 | Hou2021 | 29 | 0.771 | 81.8 | embeddings, pre-trained model | 28329010 | 2113024 | 26082088 | CNN, ResNet | 3 | maximum likelihood | ||
Hou_UGent_task1b_3 | Hou2021 | 19 | 0.523 | 84.0 | embeddings, pre-trained model | 28329010 | 2113024 | 26082088 | CNN, ResNet | 2 | maximum likelihood | ||
Hou_UGent_task1b_4 | Hou2021 | 16 | 0.416 | 85.6 | embeddings, pre-trained model | 28329010 | 2113024 | 26082088 | CNN, ResNet | 3 | maximum likelihood | ||
Naranjo-Alcazar_UV_task1b_1 | Naranjo-Alcazar2021_t1b | 18 | 0.495 | 86.5 | pre-trained model Places365 | Places365 | 15416148 | 323274 | 14820170 | CRNN | |||
Naranjo-Alcazar_UV_task1b_2 | Naranjo-Alcazar2021_t1b | 22 | 0.640 | 83.2 | pre-trained model Places365 | Places365 | 14820170 | 14820170 | CRNN | ||||
Naranjo-Alcazar_UV_task1b_3 | Naranjo-Alcazar2021_t1b | 32 | 1.006 | 66.8 | 323274 | 323274 | CRNN | ||||||
Okazaki_LDSLVision_task1b_1 | Okazaki2021 | 12 | 0.312 | 91.6 | pre-trained model | ImageNet pre-trained models, CLIP models | 70676012 | 0 | 70676012 | CNN, ViT, ensemble | 4 | average | |
Okazaki_LDSLVision_task1b_2 | Okazaki2021 | 13 | 0.320 | 93.2 | pre-trained model | ImageNet pre-trained models, CLIP models | 127257242 | 56581230 | 70676012 | CNN, ViT, ensemble | 7 | average | |
Okazaki_LDSLVision_task1b_3 | Okazaki2021 | 11 | 0.303 | 93.5 | pre-trained model | ImageNet pre-trained models, CLIP models | 636286210 | 282906150 | 353380060 | CNN, ViT, ensemble | 35 | average | |
Okazaki_LDSLVision_task1b_4 | Okazaki2021 | 9 | 0.257 | 93.5 | pre-trained model | ImageNet pre-trained models, CLIP models | 636286210 | 282906150 | 353380060 | CNN, ViT, ensemble | 35 | average | |
Peng_CQU_task1b_1 | Peng2021 | 45 | 1.395 | 68.2 | directly | 1817152 | 1817152 | 0 | CNN | average | |||
Peng_CQU_task1b_2 | Peng2021 | 40 | 1.172 | 67.8 | directly | 1817152 | 1817152 | 0 | CNN | average | |||
Peng_CQU_task1b_3 | Peng2021 | 41 | 1.172 | 67.8 | directly | 1817152 | 1817152 | 0 | CNN | average | |||
Peng_CQU_task1b_4 | Peng2021 | 43 | 1.233 | 68.5 | directly | 1817152 | 1817152 | 0 | CNN | average | |||
Pham_AIT_task1b_1 | Pham2021 | 44 | 1.311 | 73.0 | 180626842 | 47582256 | 133044586 | CNN | 6 | Average | |||
Pham_AIT_task1b_2 | Pham2021 | 21 | 0.589 | 88.3 | 180626842 | 47582256 | 133044586 | CNN | 6 | Average | |||
Pham_AIT_task1b_3 | Pham2021 | 17 | 0.434 | 88.4 | 180626842 | 47582256 | 133044586 | CNN | 6 | Average | |||
Pham_AIT_task1b_4 | Pham2021 | 28 | 0.738 | 91.5 | 180626842 | 47582256 | 133044586 | CNN | 6 | Average | |||
Triantafyllopoulos_AUD_task1b_1 | Triantafyllopoulos2021 | 39 | 1.157 | 58.4 | 1468139 | 1468139 | WaveTransformer | maximum likelihood | |||||
Triantafyllopoulos_AUD_task1b_2 | Triantafyllopoulos2021 | 27 | 0.735 | 73.6 | 2747339 | 11897999 | 4691020 | MultimodalWaveTransformer | maximum likelihood | ||||
Triantafyllopoulos_AUD_task1b_3 | Triantafyllopoulos2021 | 30 | 0.785 | 73.7 | 2107739 | 11258399 | 4691020 | MultimodalWaveTransformer | maximum likelihood | ||||
Triantafyllopoulos_AUD_task1b_4 | Triantafyllopoulos2021 | 31 | 0.872 | 70.3 | 2107739 | 11258399 | 4691020 | MultimodalWaveTransformer | maximum likelihood | ||||
Wang_BIT_task1b_1 | Wang2021a | 36 | 1.061 | 74.1 | Transformer | maximum likelihood | |||||||
Wang_BIT_task1b_2 | Liang2021 | 42 | 1.180 | 62.4 | embeddings | 14553134 | 9489294 | 5029654 | CNN | maximum likelihood | |||
DCASE2021 baseline | 0.662 | 77.1 | embeddings | 711454 | 338634 | 338634 | CNN | maximum likelihood | |||||
Yang_THU_task1b_1 | Yang2021 | 15 | 0.332 | 90.8 | directly | ImageNet | 102000000 | 81000000 | 19800000 | CNN | maximum likelihood | ||
Yang_THU_task1b_2 | Yang2021 | 14 | 0.321 | 90.8 | directly | ImageNet | 40000000 | 18974474 | 19800000 | Multi-head Attention, MLP | maximum likelihood | ||
Yang_THU_task1b_3 | Yang2021 | 10 | 0.279 | 92.1 | directly | ImageNet | 121000000 | 99974474 | 19800000 | Multi-head Attention, MLP, CNN | maximum likelihood | ||
Zhang_IOA_task1b_1 | Wang2021b | 3 | 0.201 | 93.5 | directly, pre-trained model | ImageNet, Places365, EfficientNet, PyTorch Image Models | 100865990 | 17479760 | 77985102 | CNN, EfficientNet, Swin Transformer, ensemble | 4 | weighted vote | |
Zhang_IOA_task1b_2 | Wang2021b | 4 | 0.205 | 93.6 | directly, pre-trained model | ImageNet, Places365, EfficientNet, PyTorch Image Models | 103370202 | 17479760 | 77985102 | CNN, EfficientNet, Swin Transformer, ensemble | 6 | average vote | |
Zhang_IOA_task1b_3 | Wang2021b | 1 | 0.195 | 93.8 | directly, pre-trained model | ImageNet, Places365, EfficientNet, PyTorch Image Models | 110848636 | 25882876 | 77985102 | CNN, EfficientNet, Swin Transformer, ensemble | 5 | weighted vote | |
Zhang_IOA_task1b_4 | Wang2021b | 2 | 0.199 | 93.9 | directly, pre-trained model | ImageNet, Places365, EfficientNet, PyTorch Image Models | 115725988 | 25882876 | 77985102 | CNN, EfficientNet, Swin Transformer, ensemble | 9 | average vote |
Technical reports
Multi-Source Transformer Architectures for Audiovisual Scene Classification
Wim Boes and Hugo Van hamme
ESAT, KU Leuven, Leuven, Belgium
Boes_KUL_task1b_1 Boes_KUL_task1b_2 Boes_KUL_task1b_3 Boes_KUL_task1b_4
Multi-Source Transformer Architectures for Audiovisual Scene Classification
Wim Boes and Hugo Van hamme
ESAT, KU Leuven, Leuven, Belgium
Abstract
In this technical report, the systems we submitted for subtask 1B of the DCASE 2021 challenge, regarding audiovisual scene classi- fication, are described in detail. They are essentially multi-source transformers employing a combination of auditory and visual fea- tures to make predictions. These models are evaluated utilizing the macro-averaged multi-class cross-entropy and accuracy metrics. In terms of the macro-averaged multi-class cross-entropy, our best model achieved a score of 0.620 on the validation data. This is slightly better than the performance of the baseline system (0.658). With regard to the accuracy measure, our best model achieved a score of 77.1% on the validation data, which is about the same as the performance obtained by the baseline system (77.0%).
System characteristics
Input | mono |
Sampling rate | 22.05kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN, transformer |
Audio Scene Classification Using Enhanced Convolutional Neural Networks for DCASE 2021 Challenge
Itxasne Diez1 and Ibon Saratxaga2
1Getxo, Basque Country, Spain, 2HiTZ Center – Aholab Signal Processing Laboratory, University of the Basque Country, Bilbao, Basque Country, Spain
Diez_Noismart_task1b_1 Diez_Noismart_task1b_2 Diez_Noismart_task1b_3
Audio Scene Classification Using Enhanced Convolutional Neural Networks for DCASE 2021 Challenge
Itxasne Diez1 and Ibon Saratxaga2
1Getxo, Basque Country, Spain, 2HiTZ Center – Aholab Signal Processing Laboratory, University of the Basque Country, Bilbao, Basque Country, Spain
Abstract
This technical report describes our system proposed for Task 1B – Audio-Visual Scene Classification of the DCASE 2021 Challenge. Our system focuses in the audio signal based classification. The system has an architecture based on the combination of Convolutional Neural Networks and OpenL3 embeddings. The CNN consist of three stacked 2D convolutional layers to process the log-melspectrogram parameters obtained from the input signals. Additionally OpenL3 embeddings of the input signals are also calculated and merged with the output of the CNN stack. The resulting vector is fed to a classification block consisting of two fully connected layers. Mixup augmentation technique is applied to the training data and binaural data is also used as input to provide additional information. In this report, we describe the proposed systems in detail and com- pare them to the baseline approach using the provided development datasets.
System characteristics
Input | mixed; binaural |
Sampling rate | 48.0kHz |
Data augmentation | mixup |
Features | log-mel spectrogram |
Classifier | CNN |
Decision making | maximum likelihood |
Investigating Waveform and Spectrogram Feature Fusion for Audio Classification
Dennis Fedorishin1, Nishant Sankaran1, Deen Mohan1, Justas Birgiolas2, Philip Schneider2, Srirangaraj Setlur1 and Venu Govindaraju1
1Computer Science, Center for Unified Biometrics and Sensors, University at Buffalo, New York, USA, 2ACV Auctions, LLC., New York, USA
Fedorishin_UB_task1b_1 Fedorishin_UB_task1b_2
Investigating Waveform and Spectrogram Feature Fusion for Audio Classification
Dennis Fedorishin1, Nishant Sankaran1, Deen Mohan1, Justas Birgiolas2, Philip Schneider2, Srirangaraj Setlur1 and Venu Govindaraju1
1Computer Science, Center for Unified Biometrics and Sensors, University at Buffalo, New York, USA, 2ACV Auctions, LLC., New York, USA
Abstract
This technical report presents our submitted system for the DCASE 2021 Challenge Task1B: Audio-Visual Scene Classifica- tion. Focusing on the audio modality only, we investigate the use of two common feature representations within the audio understand- ing domain, the raw waveform and Mel-spectrogram, and measure their degree of complementarity when using both representations in a fusion setting. We introduce a new model paradigm for audio clas- sification by fusing features learned from Mel-spectrograms and the raw waveform from separate feature extraction branches. Our ex- perimental results show that our proposed fusion model has a 4.5% increase in validation accuracy and a reduction of .14 in validation loss over the Task 1B baseline audio-only subnetwork. We further show that learned features of raw waveforms and Mel-spectrograms are indeed complementary to each other and that there is a consis- tent classification performance improvement over models trained on Mel-spectrograms alone.
System characteristics
Input | mono |
Sampling rate | 48.0kHz |
Data augmentation | mixup, time-shifting |
Features | log-mel energies, raw waveform |
Classifier | CNN |
Decision making | maximum likelihood |
CNN-Based Dual-Stream Network for Audio-Visual Scene Classification
Yuanbo Hou1, Yizhou Tan2, Yue Chang3, Tianyang Huang3, Shengchen Li4, Xi Shao3 and Dick Botteldooren1
1Ghent University, Gent, Belgium, 2International School, Beijing University of Posts and Telecommunications, Beijing, China, 3Telecommunications & Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China, 4Xi’an Jiaotong-Liverpool University, Suzhou, China
Hou_UGent_task1b_1 Hou_UGent_task1b_2 Hou_UGent_task1b_3 Hou_UGent_task1b_4
CNN-Based Dual-Stream Network for Audio-Visual Scene Classification
Yuanbo Hou1, Yizhou Tan2, Yue Chang3, Tianyang Huang3, Shengchen Li4, Xi Shao3 and Dick Botteldooren1
1Ghent University, Gent, Belgium, 2International School, Beijing University of Posts and Telecommunications, Beijing, China, 3Telecommunications & Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China, 4Xi’an Jiaotong-Liverpool University, Suzhou, China
Abstract
This technical report presents the CNN-based dual-stream network for audio-visual scene classification in DCASE 2021 Challenge (Task 1 Subtask B). The proposed method in this report is only trained based on the development dataset in Task 1 Subtask B and does not use any external dataset. For the performance, the model proposed in this report gets 0.318 log-loss and 90.0% accuracy for scene classification on the development dataset, and the log-loss and accuracy in the baseline are 0.658 and 77.0%, respectively. Our results are reproducible, source code is available here: https://github.com/Yuanbo2020/DCASE2021-T1B.
System characteristics
Input | mono |
Sampling rate | 48.0kHz |
Features | OpenL3 |
Classifier | CNN, ResNet |
Decision making | maximum likelihood |
Task 1B DCASE 2021: Audio-Visual Scene Classification with Squeeze-Excitation Convolutional Recurrent Neural Networks
Javier Naranjo-Alcazar1,2, Sergi Perez-Castanos1, Maximo Cobos1, Francesc J. Ferri1 and Pedro Zuccarello2
1Computer Science, Universitat de Valencia, Burjassot, Spain, 2Intituto Tecnológico de Informática, Valencia, Spain
Naranjo-Alcazar_UV_task1b_1 Naranjo-Alcazar_UV_task1b_2 Naranjo-Alcazar_UV_task1b_3
Task 1B DCASE 2021: Audio-Visual Scene Classification with Squeeze-Excitation Convolutional Recurrent Neural Networks
Javier Naranjo-Alcazar1,2, Sergi Perez-Castanos1, Maximo Cobos1, Francesc J. Ferri1 and Pedro Zuccarello2
1Computer Science, Universitat de Valencia, Burjassot, Spain, 2Intituto Tecnológico de Informática, Valencia, Spain
Abstract
Automatic scene classification has always been one of the core tasks in every edition of the DCASE challenge. Until this edition, such classification was performed using only audio data, and so the problematic was defined as Acoustic Scene Classification (ASC). In this 2021 edition, audio data is accompanied with visual data, providing additional information that can be jointly exploited for achieving higher recognition accuracy. The proposed approach makes use of two separate networks which are respectively trained in isolation on audio and visual data, so that each network specializes in a given modality. After training each network, the fusion of information from the audio and visual subnetworks is performed at two different stages. The early fusion stage combines features resulting from the last convolutional block of the respective subnetworks at different time steps to feed a bidirectional recurrent structure. The late fusion stage combines the output of the early fusion stage with the independent predictions provided by the two subnetworks, resulting in the final prediction. For the visual subnetwork, a VGG16 architecture pretrained on the Places365 dataset is used, applying a fine-tuning strategy over the Challenge dataset. On the other hand, the audio subnetwork is trained from scratch and uses squeeze- excitation techniques as in previous contributions from this team. As a result, the final accuracy of the system is 92% on development split, outperforming the baseline by 15 percentage points.
System characteristics
Input | left, right, difference |
Sampling rate | 44.1kHz |
Data augmentation | mixup |
Features | gammatone spectrogram |
Classifier | CRNN |
Ldslvision Submissions to Dcase’21: A Multi-Modal Fusion Approach for Audio-Visual Scene Classification Enhanced by Clip Variants
Soichiro Okazaki, Kong Quan and Tomoaki Yoshinaga
Lumada Data Science Lab., Hitachi, Ltd., Toyko, Japan
Okazaki_LDSLVision_task1b_1 Okazaki_LDSLVision_task1b_2 Okazaki_LDSLVision_task1b_3 Okazaki_LDSLVision_task1b_4
Ldslvision Submissions to Dcase’21: A Multi-Modal Fusion Approach for Audio-Visual Scene Classification Enhanced by Clip Variants
Soichiro Okazaki, Kong Quan and Tomoaki Yoshinaga
Lumada Data Science Lab., Hitachi, Ltd., Toyko, Japan
Abstract
In this report, we describe our solution for audio-visual scene classification task of DCASE2021 challenge Task1B. Our solution is based on a multi-modal fusion approach consisting of three different domain features: (1) Log-mel spectrogram audio features extracted by CNN variants from audio files. (2) Frame-wise image features extracted by CNN variants from video files. (3) Text-guided frame- wise image features extracted by CLIP variants from video files. We trained three domain models respectively and created final sub- missions by ensembling the class-wise confidences of three domain models’ outputs. With ensembling and post-processing for the confidences, our model reached 0.149 log-loss (official baseline: 0.658 log-loss) and 96.1% accuracy (official baseline: 77.0% accuracy) on the officially provided fold1 evaluation dataset of Task1B.
System characteristics
Input | mono+delta+delta-delta (3 channels) |
Sampling rate | 48.0kHz |
Data augmentation | (Video) random affine, color jitter, gaussian blur, random erasing; (Audio) frequency masking, random gain (Video) random affine, color jitter, gaussian blur, random erasing |
Features | log-mel spectrogram |
Classifier | CNN, ViT, ensemble |
Decision making | average |
Convolutional Receptive Field Dual Selection Mechanism for Acoustic Scene Classification
Wang Peng1, Tianyang Zhang1 and Zehua Zou2
1Intelligent Information Technology and System Lab, CHONGQING UNIVERSITY, Chongqing, China, 2Image Information Processing Lab, CHONGQING UNIVERSITY, Chongqing, China
Peng_CQU_task1b_1 Peng_CQU_task1b_2 Peng_CQU_task1b_3 Peng_CQU_task1b_4
Convolutional Receptive Field Dual Selection Mechanism for Acoustic Scene Classification
Wang Peng1, Tianyang Zhang1 and Zehua Zou2
1Intelligent Information Technology and System Lab, CHONGQING UNIVERSITY, Chongqing, China, 2Image Information Processing Lab, CHONGQING UNIVERSITY, Chongqing, China
Abstract
Convolution neural network (CNN), which can extract rich semantic information of signal, is a representative feature learing network in acoustic scene classification (ASC). However, since that the receptive field (RF) of a CNN is fixed, it is inefficient to capture the dynamical time-frequency changing characteristic of the input Log-Mel spectrogram. In addition, although the Log-Mel spectrogram can be treated as an image, the time and frequency dimensions, which respectively represent the acoustic event duration and frequency information, have different physical meanings. Therefore, existing receptive field adaptive meth- ods, which get same-sized optimal receptive fields in two dimensions, are not suitable for ASC. To tackle this problem, we proposed a convolution receptive field dual selection mechanism (CRFDS) in this paper. Acoustic scene classification experiments conducted on DCASE 2021 subtask B with audio-only show that the accuracy of CRFDS can achieve 71.45%.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CNN |
Decision making | average |
DCASE 2021 Task 1B: Technique Report
Lam Pham, Alexander Schindler, Mina Schutz, Jasmin Lampert and Ross King
Center for Digital Safety & Security, Austrian Institute of Technology, Vienna, Austria
Pham_AIT_task1b_1 Pham_AIT_task1b_2 Pham_AIT_task1b_3 Pham_AIT_task1b_4
DCASE 2021 Task 1B: Technique Report
Lam Pham, Alexander Schindler, Mina Schutz, Jasmin Lampert and Ross King
Center for Digital Safety & Security, Austrian Institute of Technology, Vienna, Austria
Abstract
Abstract—This report shows a deep learning framework for audio-visual scene classification (SC). Our extensive experiments, which are conducted on DCASE Task 1B development dataset, achieve the best classification accuracy of 82.2%, 91.1%, and 93.9% with audio input only, visual input only, and both audio- visual input, respectively.
System characteristics
Input | left, right |
Sampling rate | 48.0kHz |
Data augmentation | mixup |
Features | CQT, Gammatonegram, log-mel |
Classifier | CNN |
Decision making | Average |
A Multimodal Wavetransformer Architecture Conditioned on Openl3 Embeddings for Audio-Visual Scene Classification
Andreas Triantafyllopoulos1, Konstantinos Drossos2, Alexander Gebhard3, Baird Alice3 and Schuller Björn3
1audEERING GmbH, Gilching, Germany, 2Computing Sciences, Tampere University, Tampere, Finland, 3University of Augsburg, Augsburg, Germany
Triantafyllopoulos_AUD_task1b_1 Triantafyllopoulos_AUD_task1b_2 Triantafyllopoulos_AUD_task1b_3 Triantafyllopoulos_AUD_task1b_4
A Multimodal Wavetransformer Architecture Conditioned on Openl3 Embeddings for Audio-Visual Scene Classification
Andreas Triantafyllopoulos1, Konstantinos Drossos2, Alexander Gebhard3, Baird Alice3 and Schuller Björn3
1audEERING GmbH, Gilching, Germany, 2Computing Sciences, Tampere University, Tampere, Finland, 3University of Augsburg, Augsburg, Germany
Abstract
In this report, we present our submission systems to TASK1B of the DCASE2021 Challenge. We submit a total of four systems: one purely audio-based and three multimodal variants of the same architecture. The main module consists of the WaveTransformer architecture, which was recently introduced for automatic audio captioning (AAC). We first adapt the architecture to the task of acoustic scene classification (ASC), and then extend it to handle multimodal signals by globally conditioning all layers on multimodal OpenL3 embeddings. As data augmentation, we apply time- and frequency- bin masking, as well as random cropping. Our best-effort system achieves a log-loss of 0.568 and an accuracy of 79.5%.
System characteristics
Input | mono |
Sampling rate | 48.0kHz |
Data augmentation | frequency masking, time masking, random cropping |
Features | log-mel spectrogram |
Classifier | WaveTransformer; MultimodalWaveTransformer |
Decision making | maximum likelihood |
A Model Ensemble Approach for Audio-Visual Scene Classification
Qing Wang1, Siyuan Zheng1, Yunqing Li1, Yajian Wang1, Yuzhong Wu2, Hu Hu3, Chao-Han Huck Yang3, Sabato Marco Siniscalchi4, Yannan Wang5, Jun Du1 and Chin-Hui Lee3
1NELSLIP, University of Science and Technology of China, Heifei, China, 2DSP & Speech Technology Laboratory, The Chinese University of Hong Kong, Hong Kong, China, 3School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA, 4Kore University of Enna, Italy, 5Tencent Ethereal Audio Lab, Tencent Corporation, Shenzhen, China
Du_USTC_task1b_1 Du_USTC_task1b_2 Du_USTC_task1b_3 Du_USTC_task1b_4
A Model Ensemble Approach for Audio-Visual Scene Classification
Qing Wang1, Siyuan Zheng1, Yunqing Li1, Yajian Wang1, Yuzhong Wu2, Hu Hu3, Chao-Han Huck Yang3, Sabato Marco Siniscalchi4, Yannan Wang5, Jun Du1 and Chin-Hui Lee3
1NELSLIP, University of Science and Technology of China, Heifei, China, 2DSP & Speech Technology Laboratory, The Chinese University of Hong Kong, Hong Kong, China, 3School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA, 4Kore University of Enna, Italy, 5Tencent Ethereal Audio Lab, Tencent Corporation, Shenzhen, China
Abstract
In this technical report, we present our approach to Task 1b - Audio- Visual Scene Classification (AVSC) in the DCASE 2021 Challenge. We employ pre-trained networks trained on image datasets to ex- tract video embedding whereas for audio embedding models trained from scratch are more appropriate for feature extraction. We pro- pose several models for the AVSC task based on different audio and video embeddings using early fusion strategy. Besides, we propose to use audio-visual segment model (AVSM) to extract text embed- ding. Data augmentation methods are used during training. Fur- thermore, a two-stage classification strategy is adopted by leverag- ing on score fusion of two classifiers. Finally, model ensemble of two-stage AVSC classifiers is used to obtained more robust predic- tions. The proposed systems are evaluated on the development set of TAU Urban Audio Visual Scenes 2021. Compared with official baseline, our approach can achieve a much lower log loss of 0.141 and a much higher accuracy of 95.3%.
System characteristics
Input | binaural |
Sampling rate | 48.0kHz |
Data augmentation | mixup, channel confusion, SpecAugment, pitch shifting, speed change, random noise, mix audios, contrast, sharpness |
Features | log-mel energies |
Classifier | GMM, HMM, VGG, CNN, ResNet, DNN, ResNeSt, DenseNet, ensemble; CNN, ResNet, DNN, ResNeSt, DenseNet, ensemble; CNN, ResNet, DNN, DenseNet, ResNeSt, ensemble |
Decision making | average |
Bit Submission for DCASE 2020 Challenge Task1
Yuxiang Wang and Shuang liang
Electronic engineering, Beijing Institute of Technology, Beijing, China
Abstract
DCASE2021 challenge task 1 contains two sub-tasks: (i) LowComplexity Acoustic Scene Classification with Multiple Devices and (ii) Audio-Visual Scene Classification. In our submission systems, different methods are used for different tasks. For task 1a, we explore the fsFCNN (frequency sub-sampling controlled fully convolution) with two-stage training. In order to reduce the model size, the knowledge distillation approach is used. For task 1b, different models are used for different modals. For audio classification, the same fsFCNN structures with two-stage training are applied. For video classification, the TimeSformer model is used. Experimental results show that our final model obtain accuracy of 64.6% with 128kb model size. On task 1b development set, the audio modal achieve 80.6% accuracy and 92% for video modal.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | Transformer; CNN |
Decision making | maximum likelihood |
Audio-Visual Scene Classification Using Transfer Learning and Hybrid Fusion Strategy
Meng Wang, Chengxin Chen, Yuan Xie, Hangting Chen, Yuzhuo Liu and Pengyuan Zhang
Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Beijing, China
Zhang_IOA_task1b_1 Zhang_IOA_task1b_2 Zhang_IOA_task1b_3 Zhang_IOA_task1b_4
Audio-Visual Scene Classification Using Transfer Learning and Hybrid Fusion Strategy
Meng Wang, Chengxin Chen, Yuan Xie, Hangting Chen, Yuzhuo Liu and Pengyuan Zhang
Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Beijing, China
Abstract
In this technical report, we describe the details of our submission for DCASE2021 Task1b. This task focuses on audio-visual scene classification. We use 1D deep convolutional neural network integrated with three different acoustic features in our audio system, and perform a two-stage fine-tuning on some pre-trained models such as ResNet-50 and EfficientNet-b5 in our image system. In model-level fusion, the extracted audio and image embeddings are concatenated as input into a classifier. We also use decision-level fusion to make our system more robust. On the official train/test setup of the development dataset, our best single audio-visual system obtained a 0.159 log loss and 94.1% accuracy compared to 0.623 and 78.5% for the audio-only system and 0.270 and 91.8% for the image-only system. Our final fusion system could achieve a 0.143 log loss and 95.2% accuracy.
System characteristics
Input | binaural |
Sampling rate | 48.0kHz |
Data augmentation | mixup |
Features | log-mel energies, CQT, bark |
Classifier | CNN, EfficientNet, Swin Transformer, ensemble |
Decision making | weighted vote; average vote |
Scene Classification Using Acoustic and Visual Feature
Yujie Yang1 and Yanan Luo2
1Tsinghua University, Shenzhen, China, 2Tencent, Shenzhen, China
Yang_THU_task1b_1 Yang_THU_task1b_2 Yang_THU_task1b_3
Scene Classification Using Acoustic and Visual Feature
Yujie Yang1 and Yanan Luo2
1Tsinghua University, Shenzhen, China, 2Tencent, Shenzhen, China
Abstract
In this report, we provide a brief overview of our submission for the audio-visual scene classification task of the DCASE 2021 challenge. This report focuses on the joint use of audio and video features to improve the performance of scene classification. In order to extract audio features, we train a convolutional neural network similar to the VGG to classify the log-mel spectra. In order to extract video features, we use ResNext to train an image classifier. Subsequently we use both features to do the classification, which can achieve better performance than using only one feature to make the classification.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Data augmentation | mixup, specaugment |
Features | log-mel energies |
Classifier | CNN; Multi-head Attention, MLP; Multi-head Attention, MLP, CNN |
Decision making | maximum likelihood |