Audio-Visual Scene Classification


Challenge results

Task description

This subtask is concerned with classification using audio and video modalities. Since audio-visual machine learning has gained popularity in the last years, we aim to provide a multidisciplinary task that may attract researchers from the machine vision community.

We imposed no restrictions on the modality or combinations of modalities used in the system. We encouraged participants to also submit single-modality systems (audio-only or video-only methods for scene classification).

The development set contains synchronized audio and video recordings from 10 European cities in 10 different scenes. The total amount of audio in the development set is 34 hours. The evaluation set contains data from 12 cities (2 cities unseen in the development set). Evaluation data contains 20 hours of audio.

More detailed task description can be found in the task description page

Systems ranking

Submission information Evaluation dataset Development dataset
Rank Submission label Name Technical
Report
Official
system rank
Logloss Accuracy
with 95% confidence interval
Logloss Accuracy
Boes_KUL_task1b_1 muls_tr(1) Boes2021 23 0.653 74.5 (74.2 - 74.8) 0.620 75.9
Boes_KUL_task1b_2 muls_tr(2) Boes2021 25 0.683 76.0 (75.7 - 76.3) 0.652 76.6
Boes_KUL_task1b_3 muls_tr(3) Boes2021 26 0.701 76.3 (76.0 - 76.6) 0.665 77.0
Boes_KUL_task1b_4 muls_tr(4) Boes2021 24 0.681 76.0 (75.6 - 76.3) 0.682 77.1
Diez_Noismart_task1b_1 AholabASC1 Diez2021 35 1.061 65.2 (64.8 - 65.5) 1.038 65.6
Diez_Noismart_task1b_2 AhonoiseASC1 Diez2021 38 1.096 64.4 (64.1 - 64.8) 1.006 63.3
Diez_Noismart_task1b_3 AhonoiseASC1 Diez2021 34 1.060 64.7 (64.4 - 65.1) 1.023 63.2
Du_USTC_task1b_1 USTC_t1b_1 Wang2021 8 0.241 92.9 (92.7 - 93.1) 0.147 94.7
Du_USTC_task1b_2 USTC_t1b_2 Wang2021 7 0.238 92.7 (92.5 - 92.9) 0.145 95.1
Du_USTC_task1b_3 USTC_t1b_3 Wang2021 6 0.222 93.2 (93.0 - 93.4) 0.143 95.2
Du_USTC_task1b_4 USTC_t1b_4 Wang2021 5 0.221 93.2 (93.0 - 93.4) 0.141 95.5
Fedorishin_UB_task1b_1 WS Small Fedorishin2021 37 1.077 67.2 (66.8 - 67.5) 0.907 69.6
Fedorishin_UB_task1b_2 WS Fusion Fedorishin2021 33 1.028 68.7 (68.4 - 69.1) 0.990 68.5
Hou_UGent_task1b_1 HTCH_1 Hou2021 20 0.555 81.5 (81.2 - 81.8) 0.346 87.4
Hou_UGent_task1b_2 HTCH_2 Hou2021 29 0.771 81.8 (81.6 - 82.1) 0.351 89.5
Hou_UGent_task1b_3 HTCH_3 Hou2021 19 0.523 84.0 (83.7 - 84.3) 0.318 88.6
Hou_UGent_task1b_4 HTCH_4 Hou2021 16 0.416 85.6 (85.3 - 85.8) 0.328 91.7
Naranjo-Alcazar_UV_task1b_1 AVSC_SE_CRNN Naranjo-Alcazar2021_t1b 18 0.495 86.5 (86.3 - 86.8) 0.556 90.0
Naranjo-Alcazar_UV_task1b_2 AVSC_SE_CRNN Naranjo-Alcazar2021_t1b 22 0.640 83.2 (82.9 - 83.4) 0.616 87.0
Naranjo-Alcazar_UV_task1b_3 AVSC_SE_CRNN Naranjo-Alcazar2021_t1b 32 1.006 66.8 (66.5 - 67.1) 0.969 69.0
Okazaki_LDSLVision_task1b_1 S01 Okazaki2021 12 0.312 91.6 (91.4 - 91.8) 0.260 92.5
Okazaki_LDSLVision_task1b_2 S02 Okazaki2021 13 0.320 93.2 (93.0 - 93.3) 0.149 96.1
Okazaki_LDSLVision_task1b_3 S03 Okazaki2021 11 0.303 93.5 (93.3 - 93.7) 0.238 95.8
Okazaki_LDSLVision_task1b_4 S04 Okazaki2021 9 0.257 93.5 (93.3 - 93.7) 0.149 96.1
Peng_CQU_task1b_1 CRFDS Peng2021 45 1.395 68.2 (67.9 - 68.5) 1.614 71.8
Peng_CQU_task1b_2 CRFDS Peng2021 40 1.172 67.8 (67.5 - 68.1) 1.627 71.0
Peng_CQU_task1b_3 CRFDS Peng2021 41 1.172 67.8 (67.5 - 68.1) 1.635 70.5
Peng_CQU_task1b_4 CRFDS Peng2021 43 1.233 68.5 (68.1 - 68.8) 1.824 70.2
Pham_AIT_task1b_1 Pham_AIT Pham2021 44 1.311 73.0 (72.7 - 73.3) 93.9
Pham_AIT_task1b_2 Pham_AIT Pham2021 21 0.589 88.3 (88.1 - 88.6) 93.9
Pham_AIT_task1b_3 Pham_AIT Pham2021 17 0.434 88.4 (88.2 - 88.7) 93.9
Pham_AIT_task1b_4 Pham_AIT Pham2021 28 0.738 91.5 (91.3 - 91.7) 93.9
Triantafyllopoulos_AUD_task1b_1 WT Triantafyllopoulos2021 39 1.157 58.4 (58.1 - 58.8) 1.153 59.4
Triantafyllopoulos_AUD_task1b_2 MWT-FiLM Triantafyllopoulos2021 27 0.735 73.6 (73.3 - 73.9) 0.568 79.5
Triantafyllopoulos_AUD_task1b_3 MWT-Bias Triantafyllopoulos2021 30 0.785 73.7 (73.3 - 74.0) 0.704 76.2
Triantafyllopoulos_AUD_task1b_4 MWT-Wave Triantafyllopoulos2021 31 0.872 70.3 (70.0 - 70.7) 0.796 72.6
Wang_BIT_task1b_1 Wang_BIT1 Wang2021a 36 1.061 74.1 (73.7 - 74.4) 1.072 91.5
Wang_BIT_task1b_2 Wang_BIT2 Liang2021 42 1.180 62.4 (62.0 - 62.7) 0.744 80.6
DCASE2021 baseline Baseline 0.662 77.1 (76.8 - 77.5) 0.658 77.0
Yang_THU_task1b_1 cnn14_cvt Yang2021 15 0.332 90.8 (90.6 - 91.1) 0.261 92.6
Yang_THU_task1b_2 trans_cvt Yang2021 14 0.321 90.8 (90.6 - 91.0) 0.230 93.1
Yang_THU_task1b_3 2trans_cnn Yang2021 10 0.279 92.1 (91.9 - 92.3) 0.223 93.9
Zhang_IOA_task1b_1 ZhangIOA1 Wang2021b 3 0.201 93.5 (93.3 - 93.7) 0.146 95.0
Zhang_IOA_task1b_2 ZhangIOA2 Wang2021b 4 0.205 93.6 (93.4 - 93.8) 0.153 95.0
Zhang_IOA_task1b_3 ZhangIOA3 Wang2021b 1 0.195 93.8 (93.6 - 93.9) 0.145 95.1
Zhang_IOA_task1b_4 ZhangIOA4 Wang2021b 2 0.199 93.9 (93.7 - 94.1) 0.156 95.0

Teams ranking

Table including only the best performing system per submitting team.

Submission information Evaluation dataset Development dataset
Rank Submission label Name Technical
Report
Official
system rank
Team rank Logloss Accuracy
with 95% confidence interval
Logloss Accuracy
Boes_KUL_task1b_1 muls_tr(1) Boes2021 23 8 0.653 74.5 (74.2 - 74.8) 0.620 75.9
Diez_Noismart_task1b_3 AhonoiseASC1 Diez2021 34 11 1.060 64.7 (64.4 - 65.1) 1.023 63.2
Du_USTC_task1b_4 USTC_t1b_4 Wang2021 5 2 0.221 93.2 (93.0 - 93.4) 0.141 95.5
Fedorishin_UB_task1b_2 WS Fusion Fedorishin2021 33 10 1.028 68.7 (68.4 - 69.1) 0.990 68.5
Hou_UGent_task1b_4 HTCH_4 Hou2021 16 5 0.416 85.6 (85.3 - 85.8) 0.328 91.7
Naranjo-Alcazar_UV_task1b_1 AVSC_SE_CRNN Naranjo-Alcazar2021_t1b 18 7 0.495 86.5 (86.3 - 86.8) 0.556 90.0
Okazaki_LDSLVision_task1b_4 S04 Okazaki2021 9 3 0.257 93.5 (93.3 - 93.7) 0.149 96.1
Peng_CQU_task1b_2 CRFDS Peng2021 40 13 1.172 67.8 (67.5 - 68.1) 1.627 71.0
Pham_AIT_task1b_3 Pham_AIT Pham2021 17 6 0.434 88.4 (88.2 - 88.7) 93.9
Triantafyllopoulos_AUD_task1b_2 MWT-FiLM Triantafyllopoulos2021 27 9 0.735 73.6 (73.3 - 73.9) 0.568 79.5
Wang_BIT_task1b_1 Wang_BIT1 Wang2021a 36 12 1.061 74.1 (73.7 - 74.4) 1.072 91.5
DCASE2021 baseline Baseline 0.662 77.1 (76.8 - 77.5) 0.658 77.0
Yang_THU_task1b_3 2trans_cnn Yang2021 10 4 0.279 92.1 (91.9 - 92.3) 0.223 93.9
Zhang_IOA_task1b_3 ZhangIOA3 Wang2021b 1 1 0.195 93.8 (93.6 - 93.9) 0.145 95.1

Modality

Rank Submission label Technical
Report
Official
system
rank
Logloss
(Eval)
Accuracy
(Eval)
Used modalities Method to combine information from modalities
Boes_KUL_task1b_1 Boes2021 23 0.653 74.5 audio + video early fusion
Boes_KUL_task1b_2 Boes2021 25 0.683 76.0 audio + video early fusion
Boes_KUL_task1b_3 Boes2021 26 0.701 76.3 audio + video early fusion
Boes_KUL_task1b_4 Boes2021 24 0.681 76.0 audio + video early fusion
Diez_Noismart_task1b_1 Diez2021 35 1.061 65.2 audio audio only
Diez_Noismart_task1b_2 Diez2021 38 1.096 64.4 audio audio only
Diez_Noismart_task1b_3 Diez2021 34 1.060 64.7 audio audio only
Du_USTC_task1b_1 Wang2021 8 0.241 92.9 audio + video early fusion
Du_USTC_task1b_2 Wang2021 7 0.238 92.7 audio + video early fusion
Du_USTC_task1b_3 Wang2021 6 0.222 93.2 audio + video early fusion
Du_USTC_task1b_4 Wang2021 5 0.221 93.2 audio + video early fusion
Fedorishin_UB_task1b_1 Fedorishin2021 37 1.077 67.2 audio audio only
Fedorishin_UB_task1b_2 Fedorishin2021 33 1.028 68.7 audio audio only
Hou_UGent_task1b_1 Hou2021 20 0.555 81.5 audio + video late fusion
Hou_UGent_task1b_2 Hou2021 29 0.771 81.8 audio + video late fusion
Hou_UGent_task1b_3 Hou2021 19 0.523 84.0 audio + video late fusion
Hou_UGent_task1b_4 Hou2021 16 0.416 85.6 audio + video late fusion
Naranjo-Alcazar_UV_task1b_1 Naranjo-Alcazar2021_t1b 18 0.495 86.5 audio + video early fusion, late fusion
Naranjo-Alcazar_UV_task1b_2 Naranjo-Alcazar2021_t1b 22 0.640 83.2 video video only
Naranjo-Alcazar_UV_task1b_3 Naranjo-Alcazar2021_t1b 32 1.006 66.8 audio audio only
Okazaki_LDSLVision_task1b_1 Okazaki2021 12 0.312 91.6 video video only
Okazaki_LDSLVision_task1b_2 Okazaki2021 13 0.320 93.2 audio + video audio-visual
Okazaki_LDSLVision_task1b_3 Okazaki2021 11 0.303 93.5 audio + video audio-visual
Okazaki_LDSLVision_task1b_4 Okazaki2021 9 0.257 93.5 audio + video audio-visual
Peng_CQU_task1b_1 Peng2021 45 1.395 68.2 audio audio only
Peng_CQU_task1b_2 Peng2021 40 1.172 67.8 audio audio only
Peng_CQU_task1b_3 Peng2021 41 1.172 67.8 audio audio only
Peng_CQU_task1b_4 Peng2021 43 1.233 68.5 audio audio only
Pham_AIT_task1b_1 Pham2021 44 1.311 73.0 PROD late fusion
Pham_AIT_task1b_2 Pham2021 21 0.589 88.3 PROD late fusion
Pham_AIT_task1b_3 Pham2021 17 0.434 88.4 PROD late fusion
Pham_AIT_task1b_4 Pham2021 28 0.738 91.5 PROD late fusion
Triantafyllopoulos_AUD_task1b_1 Triantafyllopoulos2021 39 1.157 58.4 audio
Triantafyllopoulos_AUD_task1b_2 Triantafyllopoulos2021 27 0.735 73.6 audio + video FiLM conditioning
Triantafyllopoulos_AUD_task1b_3 Triantafyllopoulos2021 30 0.785 73.7 audio + video Bias conditioning
Triantafyllopoulos_AUD_task1b_4 Triantafyllopoulos2021 31 0.872 70.3 audio + video Wave conditioning
Wang_BIT_task1b_1 Wang2021a 36 1.061 74.1 video video only
Wang_BIT_task1b_2 Liang2021 42 1.180 62.4 audio audio only
DCASE2021 baseline 0.662 77.1 audio + video early fusion
Yang_THU_task1b_1 Yang2021 15 0.332 90.8 audio + video early fusion
Yang_THU_task1b_2 Yang2021 14 0.321 90.8 audio + video early fusion
Yang_THU_task1b_3 Yang2021 10 0.279 92.1 audio + video early fusion
Zhang_IOA_task1b_1 Wang2021b 3 0.201 93.5 audio + video early fusion
Zhang_IOA_task1b_2 Wang2021b 4 0.205 93.6 audio + video early fusion
Zhang_IOA_task1b_3 Wang2021b 1 0.195 93.8 audio + video early fusion
Zhang_IOA_task1b_4 Wang2021b 2 0.199 93.9 audio + video early fusion



Class-wise performance

Log loss

Rank Submission label Technical
Report
Official
system
rank
Logloss Airport Bus Metro Metro
station
Park Public
square
Shopping
mall
Street
pedestrian
Street
traffic
Tram
Boes_KUL_task1b_1 Boes2021 23 0.653 0.769 0.600 0.724 0.705 0.175 0.751 0.495 0.679 0.452 1.175
Boes_KUL_task1b_2 Boes2021 25 0.683 0.615 0.759 0.815 0.517 0.214 0.835 0.621 0.860 0.421 1.177
Boes_KUL_task1b_3 Boes2021 26 0.701 0.578 0.838 0.785 0.440 0.158 1.128 0.617 0.917 0.417 1.133
Boes_KUL_task1b_4 Boes2021 24 0.681 0.573 0.836 0.868 0.493 0.192 0.870 0.643 0.821 0.405 1.110
Diez_Noismart_task1b_1 Diez2021 35 1.061 2.085 0.962 1.087 1.203 0.340 1.113 1.203 1.272 0.581 0.763
Diez_Noismart_task1b_2 Diez2021 38 1.096 1.654 0.807 1.524 1.740 0.384 0.924 1.362 1.158 0.496 0.909
Diez_Noismart_task1b_3 Diez2021 34 1.060 1.266 0.843 1.413 1.531 0.299 1.148 1.585 1.204 0.450 0.858
Du_USTC_task1b_1 Wang2021 8 0.241 0.267 0.113 0.257 0.018 0.025 0.453 0.241 0.620 0.092 0.325
Du_USTC_task1b_2 Wang2021 7 0.238 0.237 0.116 0.232 0.017 0.031 0.469 0.279 0.578 0.085 0.339
Du_USTC_task1b_3 Wang2021 6 0.222 0.234 0.136 0.211 0.023 0.028 0.456 0.223 0.553 0.082 0.273
Du_USTC_task1b_4 Wang2021 5 0.221 0.250 0.133 0.214 0.023 0.028 0.432 0.220 0.542 0.088 0.277
Fedorishin_UB_task1b_1 Fedorishin2021 37 1.077 2.391 0.552 1.287 1.139 0.241 1.272 0.786 1.522 0.526 1.052
Fedorishin_UB_task1b_2 Fedorishin2021 33 1.028 1.671 0.615 1.195 1.118 0.200 1.745 1.008 1.291 0.524 0.910
Hou_UGent_task1b_1 Hou2021 20 0.555 0.703 0.581 0.373 0.211 0.021 1.113 0.610 0.680 0.271 0.985
Hou_UGent_task1b_2 Hou2021 29 0.771 0.917 0.902 0.477 0.231 0.006 1.810 0.704 1.458 0.339 0.863
Hou_UGent_task1b_3 Hou2021 19 0.523 0.571 0.459 0.300 0.231 0.054 0.931 0.671 0.730 0.367 0.911
Hou_UGent_task1b_4 Hou2021 16 0.416 0.524 0.424 0.257 0.161 0.019 0.846 0.474 0.641 0.219 0.592
Naranjo-Alcazar_UV_task1b_1 Naranjo-Alcazar2021_t1b 18 0.495 0.771 0.475 0.628 0.223 0.102 0.664 0.371 0.714 0.191 0.810
Naranjo-Alcazar_UV_task1b_2 Naranjo-Alcazar2021_t1b 22 0.640 0.809 0.424 0.646 0.128 0.087 0.831 0.526 0.960 0.305 1.677
Naranjo-Alcazar_UV_task1b_3 Naranjo-Alcazar2021_t1b 32 1.006 1.777 0.605 0.979 1.177 0.345 1.396 1.093 1.039 0.656 0.990
Okazaki_LDSLVision_task1b_1 Okazaki2021 12 0.312 0.227 0.314 0.308 0.144 0.107 0.477 0.180 0.532 0.182 0.645
Okazaki_LDSLVision_task1b_2 Okazaki2021 13 0.320 0.284 0.161 0.307 0.102 0.043 0.657 0.228 0.880 0.124 0.411
Okazaki_LDSLVision_task1b_3 Okazaki2021 11 0.303 0.279 0.282 0.304 0.172 0.105 0.506 0.213 0.544 0.171 0.457
Okazaki_LDSLVision_task1b_4 Okazaki2021 9 0.257 0.257 0.181 0.222 0.078 0.040 0.472 0.165 0.662 0.106 0.387
Peng_CQU_task1b_1 Peng2021 45 1.395 2.363 0.934 1.157 1.568 0.194 3.260 1.239 1.561 0.585 1.090
Peng_CQU_task1b_2 Peng2021 40 1.172 1.414 0.722 0.907 1.140 0.198 2.190 1.515 1.662 0.620 1.353
Peng_CQU_task1b_3 Peng2021 41 1.172 1.414 0.722 0.907 1.140 0.198 2.190 1.515 1.662 0.620 1.353
Peng_CQU_task1b_4 Peng2021 43 1.233 2.224 0.668 1.004 1.340 0.278 2.258 1.269 1.803 0.517 0.966
Pham_AIT_task1b_1 Pham2021 44 1.311 1.588 0.667 1.989 1.202 0.321 2.089 1.738 1.346 1.094 1.074
Pham_AIT_task1b_2 Pham2021 21 0.589 0.444 0.472 0.538 0.044 0.080 1.265 0.286 0.980 0.412 1.367
Pham_AIT_task1b_3 Pham2021 17 0.434 0.455 0.225 0.394 0.107 0.067 0.830 0.457 0.833 0.273 0.704
Pham_AIT_task1b_4 Pham2021 28 0.738 0.389 0.393 0.695 0.034 0.004 1.952 0.515 1.554 0.483 1.361
Triantafyllopoulos_AUD_task1b_1 Triantafyllopoulos2021 39 1.157 1.442 0.678 1.336 1.959 0.283 1.492 0.677 1.430 0.757 1.520
Triantafyllopoulos_AUD_task1b_2 Triantafyllopoulos2021 27 0.735 1.038 0.702 0.931 0.306 0.075 0.978 0.730 1.400 0.401 0.792
Triantafyllopoulos_AUD_task1b_3 Triantafyllopoulos2021 30 0.785 1.503 0.604 0.512 0.832 0.164 0.623 0.502 0.921 0.979 1.211
Triantafyllopoulos_AUD_task1b_4 Triantafyllopoulos2021 31 0.872 1.556 0.815 0.956 1.130 0.427 1.009 0.232 1.572 0.521 0.503
Wang_BIT_task1b_1 Wang2021a 36 1.061 2.558 0.529 2.405 1.250 0.007 1.965 0.047 0.699 0.591 0.562
Wang_BIT_task1b_2 Liang2021 42 1.180 1.114 1.121 1.223 1.516 0.722 1.516 1.035 1.516 0.887 1.151
DCASE2021 baseline 0.662 1.313 0.496 0.683 0.517 0.151 1.002 0.586 0.907 0.215 0.751
Yang_THU_task1b_1 Yang2021 15 0.332 0.774 0.270 0.288 0.098 0.032 0.536 0.207 0.664 0.080 0.375
Yang_THU_task1b_2 Yang2021 14 0.321 0.699 0.169 0.331 0.052 0.026 0.553 0.219 0.419 0.095 0.642
Yang_THU_task1b_3 Yang2021 10 0.279 0.602 0.155 0.292 0.042 0.030 0.549 0.212 0.442 0.074 0.395
Zhang_IOA_task1b_1 Wang2021b 3 0.201 0.463 0.099 0.113 0.043 0.009 0.417 0.053 0.373 0.066 0.369
Zhang_IOA_task1b_2 Wang2021b 4 0.205 0.493 0.086 0.141 0.045 0.013 0.392 0.063 0.388 0.066 0.358
Zhang_IOA_task1b_3 Wang2021b 1 0.195 0.454 0.092 0.113 0.041 0.009 0.402 0.047 0.379 0.064 0.352
Zhang_IOA_task1b_4 Wang2021b 2 0.199 0.454 0.079 0.138 0.044 0.015 0.376 0.061 0.389 0.069 0.363

Accuracy

Rank Submission label Technical
Report
Official
system
rank
Accuracy Airport Bus Metro Metro
station
Park Public
square
Shopping
mall
Street
pedestrian
Street
traffic
Tram
Boes_KUL_task1b_1 Boes2021 23 74.5 63.2 80.4 76.7 75.5 94.2 70.1 83.4 79.7 85.1 36.5
Boes_KUL_task1b_2 Boes2021 25 76.0 79.7 72.4 73.8 83.6 92.4 70.1 77.8 74.1 85.5 50.6
Boes_KUL_task1b_3 Boes2021 26 76.3 80.8 71.8 74.4 86.5 94.6 63.4 79.8 71.4 87.6 53.0
Boes_KUL_task1b_4 Boes2021 24 76.0 81.8 69.4 69.8 84.3 94.2 68.5 77.6 73.5 86.8 53.7
Diez_Noismart_task1b_1 Diez2021 35 65.2 37.9 67.8 59.7 59.4 89.4 58.0 73.8 53.2 80.7 71.8
Diez_Noismart_task1b_2 Diez2021 38 64.4 44.1 74.9 44.8 47.9 88.9 64.3 71.8 55.1 83.7 68.8
Diez_Noismart_task1b_3 Diez2021 34 64.7 53.2 73.0 47.4 52.1 91.2 55.4 62.9 57.6 85.3 69.3
Du_USTC_task1b_1 Wang2021 8 92.9 90.2 96.8 93.2 99.7 99.9 87.3 93.6 81.4 96.8 89.7
Du_USTC_task1b_2 Wang2021 7 92.7 90.9 96.4 94.2 99.8 99.6 86.1 93.0 81.0 97.2 89.2
Du_USTC_task1b_3 Wang2021 6 93.2 91.1 95.9 95.5 99.4 99.8 85.2 93.8 81.7 97.5 92.3
Du_USTC_task1b_4 Wang2021 5 93.2 90.7 96.2 95.3 99.5 99.9 85.4 94.1 81.5 97.2 92.1
Fedorishin_UB_task1b_1 Fedorishin2021 37 67.2 42.3 81.4 58.3 63.4 92.9 55.0 77.2 51.2 83.7 66.1
Fedorishin_UB_task1b_2 Fedorishin2021 33 68.7 54.2 80.9 63.7 65.2 94.6 45.9 68.6 57.8 84.6 71.7
Hou_UGent_task1b_1 Hou2021 20 81.5 79.5 75.9 85.5 91.9 99.1 63.4 80.6 80.1 90.7 68.5
Hou_UGent_task1b_2 Hou2021 29 81.8 70.4 75.7 86.2 93.1 99.8 65.1 83.7 73.9 93.2 77.3
Hou_UGent_task1b_3 Hou2021 19 84.0 82.1 88.1 88.1 92.5 99.3 74.0 78.4 81.6 88.8 67.0
Hou_UGent_task1b_4 Hou2021 16 85.6 82.8 83.5 88.1 95.1 100.0 70.4 86.7 81.8 94.1 73.3
Naranjo-Alcazar_UV_task1b_1 Naranjo-Alcazar2021_t1b 18 86.5 78.6 86.9 82.6 97.7 99.4 78.2 91.8 80.2 95.7 74.3
Naranjo-Alcazar_UV_task1b_2 Naranjo-Alcazar2021_t1b 22 83.2 75.9 87.5 84.4 95.1 95.5 82.5 86.4 77.0 92.6 54.6
Naranjo-Alcazar_UV_task1b_3 Naranjo-Alcazar2021_t1b 32 66.8 45.8 79.4 67.0 63.4 89.3 51.1 63.2 66.0 78.8 64.0
Okazaki_LDSLVision_task1b_1 Okazaki2021 12 91.6 98.1 89.9 91.2 97.8 100.0 85.9 97.0 85.6 96.6 73.5
Okazaki_LDSLVision_task1b_2 Okazaki2021 13 93.2 96.1 96.6 92.8 98.7 99.9 85.6 95.4 81.6 97.9 87.0
Okazaki_LDSLVision_task1b_3 Okazaki2021 11 93.5 96.0 95.1 96.0 99.0 100.0 85.2 96.8 85.0 97.9 84.3
Okazaki_LDSLVision_task1b_4 Okazaki2021 9 93.5 96.0 95.1 96.0 99.0 100.0 85.2 96.8 85.0 97.9 84.3
Peng_CQU_task1b_1 Peng2021 45 68.2 49.5 78.6 69.5 61.2 94.8 46.8 66.2 57.9 84.4 73.1
Peng_CQU_task1b_2 Peng2021 40 67.8 53.8 79.5 72.8 65.9 94.4 49.3 58.6 54.3 82.6 66.7
Peng_CQU_task1b_3 Peng2021 41 67.8 53.8 79.5 72.8 65.9 94.4 49.3 58.6 54.3 82.6 66.7
Peng_CQU_task1b_4 Peng2021 43 68.5 45.0 81.0 70.5 66.3 92.0 47.8 67.8 58.6 84.8 70.9
Pham_AIT_task1b_1 Pham2021 44 73.0 64.5 87.9 63.2 74.7 93.2 54.4 67.3 70.2 78.9 75.7
Pham_AIT_task1b_2 Pham2021 21 88.3 88.0 88.1 87.6 98.8 96.0 79.0 93.1 84.4 93.5 75.0
Pham_AIT_task1b_3 Pham2021 17 88.4 84.8 92.3 88.7 96.7 97.3 77.4 89.6 82.9 93.7 81.0
Pham_AIT_task1b_4 Pham2021 28 91.5 94.2 94.1 89.6 99.2 99.9 78.1 95.0 85.1 95.4 84.1
Triantafyllopoulos_AUD_task1b_1 Triantafyllopoulos2021 39 58.4 37.2 77.4 52.1 36.1 93.5 41.9 76.8 48.8 78.8 41.9
Triantafyllopoulos_AUD_task1b_2 Triantafyllopoulos2021 27 73.6 59.9 75.2 64.7 90.9 99.0 65.6 72.8 55.0 87.3 65.3
Triantafyllopoulos_AUD_task1b_3 Triantafyllopoulos2021 30 73.7 51.7 79.4 82.0 78.3 95.1 76.6 82.1 66.1 74.4 51.1
Triantafyllopoulos_AUD_task1b_4 Triantafyllopoulos2021 31 70.3 45.8 71.6 60.9 69.5 86.9 71.3 92.0 41.4 84.7 79.2
Wang_BIT_task1b_1 Wang2021a 36 74.1 45.6 80.2 39.9 64.5 99.9 62.8 98.2 81.7 84.9 82.9
Wang_BIT_task1b_2 Liang2021 42 62.4 63.3 62.6 66.2 49.9 90.9 35.3 65.8 51.6 79.0 59.0
DCASE2021 baseline 77.1 56.6 83.4 75.8 83.9 95.8 63.6 78.2 70.6 93.3 70.4
Yang_THU_task1b_1 Yang2021 15 90.8 83.5 90.6 90.6 97.0 99.3 85.1 93.4 82.0 97.8 89.0
Yang_THU_task1b_2 Yang2021 14 90.8 81.5 94.8 89.4 98.4 98.7 85.0 92.7 89.8 97.4 80.6
Yang_THU_task1b_3 Yang2021 10 92.1 84.8 94.8 90.5 98.6 98.8 84.6 93.2 89.0 97.7 88.5
Zhang_IOA_task1b_1 Wang2021b 3 93.5 88.0 96.2 97.7 99.3 100.0 83.1 98.0 88.3 97.7 86.7
Zhang_IOA_task1b_2 Wang2021b 4 93.6 87.0 96.9 97.8 99.3 100.0 84.9 97.6 88.0 97.9 86.4
Zhang_IOA_task1b_3 Wang2021b 1 93.8 88.1 96.6 97.9 99.3 100.0 83.7 98.5 88.2 97.8 87.6
Zhang_IOA_task1b_4 Wang2021b 2 93.9 87.4 97.4 98.2 99.4 100.0 85.3 98.0 88.0 98.0 87.2

System characteristics

General characteristics

Rank Submission label Technical
Report
Official
system
rank
Logloss
(Eval)
Accuracy
(Eval)
Sampling
rate
Data
augmentation
Features Embeddings / audio Embeddings / visual
Boes_KUL_task1b_1 Boes2021 23 0.653 74.5 22.05kHz mixup log-mel energies VGGish VGG16
Boes_KUL_task1b_2 Boes2021 25 0.683 76.0 22.05kHz mixup log-mel energies VGGish VGG16
Boes_KUL_task1b_3 Boes2021 26 0.701 76.3 22.05kHz mixup log-mel energies VGGish VGG16
Boes_KUL_task1b_4 Boes2021 24 0.681 76.0 22.05kHz mixup log-mel energies VGGish VGG16
Diez_Noismart_task1b_1 Diez2021 35 1.061 65.2 48.0kHz mixup log-mel spectrogram OpenL3
Diez_Noismart_task1b_2 Diez2021 38 1.096 64.4 48.0kHz mixup log-mel spectrogram OpenL3
Diez_Noismart_task1b_3 Diez2021 34 1.060 64.7 48.0kHz mixup log-mel spectrogram OpenL3
Du_USTC_task1b_1 Wang2021 8 0.241 92.9 48.0kHz mixup, channel confusion, SpecAugment, pitch shifting, speed change, random noise, mix audios, contrast, sharpness log-mel energies FCNN, ResNet17, HMM DenseNet161, ResNet50, ResNeSt50, HMM, VGG19
Du_USTC_task1b_2 Wang2021 7 0.238 92.7 48.0kHz mixup, channel confusion, SpecAugment, pitch shifting, speed change, random noise, mix audios, contrast, sharpness log-mel energies FCNN, ResNet17, HMM DenseNet161, ResNet50, ResNeSt50, HMM, VGG19
Du_USTC_task1b_3 Wang2021 6 0.222 93.2 48.0kHz mixup, channel confusion, SpecAugment, pitch shifting, speed change, random noise, mix audios, contrast, sharpness log-mel energies FCNN, ResNet17 DenseNet161, ResNet50, ResNeSt50
Du_USTC_task1b_4 Wang2021 5 0.221 93.2 48.0kHz mixup, channel confusion, SpecAugment, pitch shifting, speed change, random noise, mix audios, contrast, sharpness log-mel energies FCNN, ResNet17 DenseNet161, ResNet50, ResNeSt50
Fedorishin_UB_task1b_1 Fedorishin2021 37 1.077 67.2 48.0kHz mixup, time-shifting log-mel energies, raw waveform
Fedorishin_UB_task1b_2 Fedorishin2021 33 1.028 68.7 48.0kHz mixup, time-shifting log-mel energies, raw waveform
Hou_UGent_task1b_1 Hou2021 20 0.555 81.5 48.0kHz OpenL3 OpenL3 ResNet50
Hou_UGent_task1b_2 Hou2021 29 0.771 81.8 48.0kHz OpenL3 OpenL3 ResNet50
Hou_UGent_task1b_3 Hou2021 19 0.523 84.0 48.0kHz OpenL3 OpenL3 ResNet50
Hou_UGent_task1b_4 Hou2021 16 0.416 85.6 48.0kHz OpenL3 OpenL3 ResNet50
Naranjo-Alcazar_UV_task1b_1 Naranjo-Alcazar2021_t1b 18 0.495 86.5 44.1kHz mixup gammatone spectrogram
Naranjo-Alcazar_UV_task1b_2 Naranjo-Alcazar2021_t1b 22 0.640 83.2
Naranjo-Alcazar_UV_task1b_3 Naranjo-Alcazar2021_t1b 32 1.006 66.8 44.1kHz mixup gammatone spectrogram
Okazaki_LDSLVision_task1b_1 Okazaki2021 12 0.312 91.6 (Video) random affine, color jitter, gaussian blur, random erasing CNN, CLIP CNN/ViT
Okazaki_LDSLVision_task1b_2 Okazaki2021 13 0.320 93.2 48.0kHz (Audio) frequency masking, random gain (Video) random affine, color jitter, gaussian blur, random erasing log-mel spectrogram CNN CNN, CLIP CNN/ViT
Okazaki_LDSLVision_task1b_3 Okazaki2021 11 0.303 93.5 48.0kHz (Audio) frequency masking, random gain (Video) random affine, color jitter, gaussian blur, random erasing log-mel spectrogram CNN CNN, CLIP CNN/ViT
Okazaki_LDSLVision_task1b_4 Okazaki2021 9 0.257 93.5 48.0kHz (Audio) frequency masking, random gain (Video) random affine, color jitter, gaussian blur, random erasing log-mel spectrogram CNN CNN, CLIP CNN/ViT
Peng_CQU_task1b_1 Peng2021 45 1.395 68.2 44.1kHz log-mel energies CRFDS
Peng_CQU_task1b_2 Peng2021 40 1.172 67.8 44.1kHz log-mel energies CRFDS
Peng_CQU_task1b_3 Peng2021 41 1.172 67.8 44.1kHz log-mel energies CRFDS
Peng_CQU_task1b_4 Peng2021 43 1.233 68.5 44.1kHz log-mel energies CRFDS
Pham_AIT_task1b_1 Pham2021 44 1.311 73.0 48.0kHz mixup CQT, Gammatonegram, log-mel
Pham_AIT_task1b_2 Pham2021 21 0.589 88.3 48.0kHz mixup CQT, Gammatonegram, log-mel
Pham_AIT_task1b_3 Pham2021 17 0.434 88.4 48.0kHz mixup CQT, Gammatonegram, log-mel
Pham_AIT_task1b_4 Pham2021 28 0.738 91.5 48.0kHz mixup CQT, Gammatonegram, log-mel
Triantafyllopoulos_AUD_task1b_1 Triantafyllopoulos2021 39 1.157 58.4 48.0kHz frequency masking, time masking, random cropping log-mel spectrogram
Triantafyllopoulos_AUD_task1b_2 Triantafyllopoulos2021 27 0.735 73.6 48.0kHz frequency masking, time masking, random cropping log-mel spectrogram OpenL3 OpenL3
Triantafyllopoulos_AUD_task1b_3 Triantafyllopoulos2021 30 0.785 73.7 48.0kHz frequency masking, time masking, random cropping log-mel spectrogram OpenL3 OpenL3
Triantafyllopoulos_AUD_task1b_4 Triantafyllopoulos2021 31 0.872 70.3 48.0kHz frequency masking, time masking, random cropping log-mel spectrogram OpenL3 OpenL3
Wang_BIT_task1b_1 Wang2021a 36 1.061 74.1
Wang_BIT_task1b_2 Liang2021 42 1.180 62.4 44.1kHz log-mel energies
DCASE2021 baseline 0.662 77.1 48.0kHz log-mel energies OpenL3 OpenL3
Yang_THU_task1b_1 Yang2021 15 0.332 90.8 44.1kHz mixup, specaugment log-mel energies VGGish Transformer
Yang_THU_task1b_2 Yang2021 14 0.321 90.8 44.1kHz mixup, specaugment log-mel energies Transformer Transformer
Yang_THU_task1b_3 Yang2021 10 0.279 92.1 44.1kHz mixup, specaugment log-mel energies Transformer, VGGish Transformer
Zhang_IOA_task1b_1 Wang2021b 3 0.201 93.5 48.0kHz mixup log-mel energies, CQT, bark
Zhang_IOA_task1b_2 Wang2021b 4 0.205 93.6 48.0kHz mixup log-mel energies, CQT, bark
Zhang_IOA_task1b_3 Wang2021b 1 0.195 93.8 48.0kHz mixup log-mel energies, CQT, bark
Zhang_IOA_task1b_4 Wang2021b 2 0.199 93.9 48.0kHz mixup log-mel energies, CQT, bark



Machine learning characteristics

Rank Code Technical
Report
Official
system
rank
Logloss
(Eval)
Accuracy
(Eval)
External
data usage
External
data sources
Model
complexity
Model
complexity / Audio
Model
complexity / Visual
Classifier Ensemble
subsystems
Decision
making
Boes_KUL_task1b_1 Boes2021 23 0.653 74.5 embeddings 2302441 CNN, transformer
Boes_KUL_task1b_2 Boes2021 25 0.683 76.0 embeddings 2302441 CNN, transformer
Boes_KUL_task1b_3 Boes2021 26 0.701 76.3 embeddings 2302441 CNN, transformer
Boes_KUL_task1b_4 Boes2021 24 0.681 76.0 embeddings 2302441 CNN, transformer
Diez_Noismart_task1b_1 Diez2021 35 1.061 65.2 embeddings 970294 970294 CNN maximum likelihood
Diez_Noismart_task1b_2 Diez2021 38 1.096 64.4 embeddings 751690 751690 CNN maximum likelihood
Diez_Noismart_task1b_3 Diez2021 34 1.060 64.7 embeddings 972598 972598 CNN maximum likelihood
Du_USTC_task1b_1 Wang2021 8 0.241 92.9 pre-trained model Places365 933932599 250090842 678276533 GMM, HMM, VGG, CNN, ResNet, DNN, ResNeSt, DenseNet, ensemble 4 average
Du_USTC_task1b_2 Wang2021 7 0.238 92.7 pre-trained model Places365 1185866410 404955720 780910690 GMM, HMM, VGG, CNN, ResNet, DNN, ResNeSt, DenseNet, ensemble 8 average
Du_USTC_task1b_3 Wang2021 6 0.222 93.2 pre-trained model Places365 263064259 154864878 102634157 CNN, ResNet, DNN, ResNeSt, DenseNet, ensemble 4 average
Du_USTC_task1b_4 Wang2021 5 0.221 93.2 pre-trained model Places365, ImageNet 373738849 215535992 149650221 CNN, ResNet, DNN, DenseNet, ResNeSt, ensemble 6 average
Fedorishin_UB_task1b_1 Fedorishin2021 37 1.077 67.2 1351562 1351562 0 CNN maximum likelihood
Fedorishin_UB_task1b_2 Fedorishin2021 33 1.028 68.7 5422730 5422730 0 CNN maximum likelihood
Hou_UGent_task1b_1 Hou2021 20 0.555 81.5 embeddings, pre-trained model 28329010 2113024 26082088 CNN, ResNet maximum likelihood
Hou_UGent_task1b_2 Hou2021 29 0.771 81.8 embeddings, pre-trained model 28329010 2113024 26082088 CNN, ResNet 3 maximum likelihood
Hou_UGent_task1b_3 Hou2021 19 0.523 84.0 embeddings, pre-trained model 28329010 2113024 26082088 CNN, ResNet 2 maximum likelihood
Hou_UGent_task1b_4 Hou2021 16 0.416 85.6 embeddings, pre-trained model 28329010 2113024 26082088 CNN, ResNet 3 maximum likelihood
Naranjo-Alcazar_UV_task1b_1 Naranjo-Alcazar2021_t1b 18 0.495 86.5 pre-trained model Places365 Places365 15416148 323274 14820170 CRNN
Naranjo-Alcazar_UV_task1b_2 Naranjo-Alcazar2021_t1b 22 0.640 83.2 pre-trained model Places365 Places365 14820170 14820170 CRNN
Naranjo-Alcazar_UV_task1b_3 Naranjo-Alcazar2021_t1b 32 1.006 66.8 323274 323274 CRNN
Okazaki_LDSLVision_task1b_1 Okazaki2021 12 0.312 91.6 pre-trained model ImageNet pre-trained models, CLIP models 70676012 0 70676012 CNN, ViT, ensemble 4 average
Okazaki_LDSLVision_task1b_2 Okazaki2021 13 0.320 93.2 pre-trained model ImageNet pre-trained models, CLIP models 127257242 56581230 70676012 CNN, ViT, ensemble 7 average
Okazaki_LDSLVision_task1b_3 Okazaki2021 11 0.303 93.5 pre-trained model ImageNet pre-trained models, CLIP models 636286210 282906150 353380060 CNN, ViT, ensemble 35 average
Okazaki_LDSLVision_task1b_4 Okazaki2021 9 0.257 93.5 pre-trained model ImageNet pre-trained models, CLIP models 636286210 282906150 353380060 CNN, ViT, ensemble 35 average
Peng_CQU_task1b_1 Peng2021 45 1.395 68.2 directly 1817152 1817152 0 CNN average
Peng_CQU_task1b_2 Peng2021 40 1.172 67.8 directly 1817152 1817152 0 CNN average
Peng_CQU_task1b_3 Peng2021 41 1.172 67.8 directly 1817152 1817152 0 CNN average
Peng_CQU_task1b_4 Peng2021 43 1.233 68.5 directly 1817152 1817152 0 CNN average
Pham_AIT_task1b_1 Pham2021 44 1.311 73.0 180626842 47582256 133044586 CNN 6 Average
Pham_AIT_task1b_2 Pham2021 21 0.589 88.3 180626842 47582256 133044586 CNN 6 Average
Pham_AIT_task1b_3 Pham2021 17 0.434 88.4 180626842 47582256 133044586 CNN 6 Average
Pham_AIT_task1b_4 Pham2021 28 0.738 91.5 180626842 47582256 133044586 CNN 6 Average
Triantafyllopoulos_AUD_task1b_1 Triantafyllopoulos2021 39 1.157 58.4 1468139 1468139 WaveTransformer maximum likelihood
Triantafyllopoulos_AUD_task1b_2 Triantafyllopoulos2021 27 0.735 73.6 2747339 11897999 4691020 MultimodalWaveTransformer maximum likelihood
Triantafyllopoulos_AUD_task1b_3 Triantafyllopoulos2021 30 0.785 73.7 2107739 11258399 4691020 MultimodalWaveTransformer maximum likelihood
Triantafyllopoulos_AUD_task1b_4 Triantafyllopoulos2021 31 0.872 70.3 2107739 11258399 4691020 MultimodalWaveTransformer maximum likelihood
Wang_BIT_task1b_1 Wang2021a 36 1.061 74.1 Transformer maximum likelihood
Wang_BIT_task1b_2 Liang2021 42 1.180 62.4 embeddings 14553134 9489294 5029654 CNN maximum likelihood
DCASE2021 baseline 0.662 77.1 embeddings 711454 338634 338634 CNN maximum likelihood
Yang_THU_task1b_1 Yang2021 15 0.332 90.8 directly ImageNet 102000000 81000000 19800000 CNN maximum likelihood
Yang_THU_task1b_2 Yang2021 14 0.321 90.8 directly ImageNet 40000000 18974474 19800000 Multi-head Attention, MLP maximum likelihood
Yang_THU_task1b_3 Yang2021 10 0.279 92.1 directly ImageNet 121000000 99974474 19800000 Multi-head Attention, MLP, CNN maximum likelihood
Zhang_IOA_task1b_1 Wang2021b 3 0.201 93.5 directly, pre-trained model ImageNet, Places365, EfficientNet, PyTorch Image Models 100865990 17479760 77985102 CNN, EfficientNet, Swin Transformer, ensemble 4 weighted vote
Zhang_IOA_task1b_2 Wang2021b 4 0.205 93.6 directly, pre-trained model ImageNet, Places365, EfficientNet, PyTorch Image Models 103370202 17479760 77985102 CNN, EfficientNet, Swin Transformer, ensemble 6 average vote
Zhang_IOA_task1b_3 Wang2021b 1 0.195 93.8 directly, pre-trained model ImageNet, Places365, EfficientNet, PyTorch Image Models 110848636 25882876 77985102 CNN, EfficientNet, Swin Transformer, ensemble 5 weighted vote
Zhang_IOA_task1b_4 Wang2021b 2 0.199 93.9 directly, pre-trained model ImageNet, Places365, EfficientNet, PyTorch Image Models 115725988 25882876 77985102 CNN, EfficientNet, Swin Transformer, ensemble 9 average vote

Technical reports

Multi-Source Transformer Architectures for Audiovisual Scene Classification

Wim Boes and Hugo Van hamme
ESAT, KU Leuven, Leuven, Belgium

Abstract

In this technical report, the systems we submitted for subtask 1B of the DCASE 2021 challenge, regarding audiovisual scene classi- fication, are described in detail. They are essentially multi-source transformers employing a combination of auditory and visual fea- tures to make predictions. These models are evaluated utilizing the macro-averaged multi-class cross-entropy and accuracy metrics. In terms of the macro-averaged multi-class cross-entropy, our best model achieved a score of 0.620 on the validation data. This is slightly better than the performance of the baseline system (0.658). With regard to the accuracy measure, our best model achieved a score of 77.1% on the validation data, which is about the same as the performance obtained by the baseline system (77.0%).

System characteristics
Input mono
Sampling rate 22.05kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN, transformer
PDF

Audio Scene Classification Using Enhanced Convolutional Neural Networks for DCASE 2021 Challenge

Itxasne Diez1 and Ibon Saratxaga2
1Getxo, Basque Country, Spain, 2HiTZ Center – Aholab Signal Processing Laboratory, University of the Basque Country, Bilbao, Basque Country, Spain

Abstract

This technical report describes our system proposed for Task 1B – Audio-Visual Scene Classification of the DCASE 2021 Challenge. Our system focuses in the audio signal based classification. The system has an architecture based on the combination of Convolutional Neural Networks and OpenL3 embeddings. The CNN consist of three stacked 2D convolutional layers to process the log-melspectrogram parameters obtained from the input signals. Additionally OpenL3 embeddings of the input signals are also calculated and merged with the output of the CNN stack. The resulting vector is fed to a classification block consisting of two fully connected layers. Mixup augmentation technique is applied to the training data and binaural data is also used as input to provide additional information. In this report, we describe the proposed systems in detail and com- pare them to the baseline approach using the provided development datasets.

System characteristics
Input mixed; binaural
Sampling rate 48.0kHz
Data augmentation mixup
Features log-mel spectrogram
Classifier CNN
Decision making maximum likelihood
PDF

Investigating Waveform and Spectrogram Feature Fusion for Audio Classification

Dennis Fedorishin1, Nishant Sankaran1, Deen Mohan1, Justas Birgiolas2, Philip Schneider2, Srirangaraj Setlur1 and Venu Govindaraju1
1Computer Science, Center for Unified Biometrics and Sensors, University at Buffalo, New York, USA, 2ACV Auctions, LLC., New York, USA

Abstract

This technical report presents our submitted system for the DCASE 2021 Challenge Task1B: Audio-Visual Scene Classifica- tion. Focusing on the audio modality only, we investigate the use of two common feature representations within the audio understand- ing domain, the raw waveform and Mel-spectrogram, and measure their degree of complementarity when using both representations in a fusion setting. We introduce a new model paradigm for audio clas- sification by fusing features learned from Mel-spectrograms and the raw waveform from separate feature extraction branches. Our ex- perimental results show that our proposed fusion model has a 4.5% increase in validation accuracy and a reduction of .14 in validation loss over the Task 1B baseline audio-only subnetwork. We further show that learned features of raw waveforms and Mel-spectrograms are indeed complementary to each other and that there is a consis- tent classification performance improvement over models trained on Mel-spectrograms alone.

System characteristics
Input mono
Sampling rate 48.0kHz
Data augmentation mixup, time-shifting
Features log-mel energies, raw waveform
Classifier CNN
Decision making maximum likelihood
PDF

CNN-Based Dual-Stream Network for Audio-Visual Scene Classification

Yuanbo Hou1, Yizhou Tan2, Yue Chang3, Tianyang Huang3, Shengchen Li4, Xi Shao3 and Dick Botteldooren1
1Ghent University, Gent, Belgium, 2International School, Beijing University of Posts and Telecommunications, Beijing, China, 3Telecommunications & Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China, 4Xi’an Jiaotong-Liverpool University, Suzhou, China

Abstract

This technical report presents the CNN-based dual-stream network for audio-visual scene classification in DCASE 2021 Challenge (Task 1 Subtask B). The proposed method in this report is only trained based on the development dataset in Task 1 Subtask B and does not use any external dataset. For the performance, the model proposed in this report gets 0.318 log-loss and 90.0% accuracy for scene classification on the development dataset, and the log-loss and accuracy in the baseline are 0.658 and 77.0%, respectively. Our results are reproducible, source code is available here: https://github.com/Yuanbo2020/DCASE2021-T1B.

System characteristics
Input mono
Sampling rate 48.0kHz
Features OpenL3
Classifier CNN, ResNet
Decision making maximum likelihood
PDF

Task 1B DCASE 2021: Audio-Visual Scene Classification with Squeeze-Excitation Convolutional Recurrent Neural Networks

Javier Naranjo-Alcazar1,2, Sergi Perez-Castanos1, Maximo Cobos1, Francesc J. Ferri1 and Pedro Zuccarello2
1Computer Science, Universitat de Valencia, Burjassot, Spain, 2Intituto Tecnológico de Informática, Valencia, Spain

Abstract

Automatic scene classification has always been one of the core tasks in every edition of the DCASE challenge. Until this edition, such classification was performed using only audio data, and so the problematic was defined as Acoustic Scene Classification (ASC). In this 2021 edition, audio data is accompanied with visual data, providing additional information that can be jointly exploited for achieving higher recognition accuracy. The proposed approach makes use of two separate networks which are respectively trained in isolation on audio and visual data, so that each network specializes in a given modality. After training each network, the fusion of information from the audio and visual subnetworks is performed at two different stages. The early fusion stage combines features resulting from the last convolutional block of the respective subnetworks at different time steps to feed a bidirectional recurrent structure. The late fusion stage combines the output of the early fusion stage with the independent predictions provided by the two subnetworks, resulting in the final prediction. For the visual subnetwork, a VGG16 architecture pretrained on the Places365 dataset is used, applying a fine-tuning strategy over the Challenge dataset. On the other hand, the audio subnetwork is trained from scratch and uses squeeze- excitation techniques as in previous contributions from this team. As a result, the final accuracy of the system is 92% on development split, outperforming the baseline by 15 percentage points.

System characteristics
Input left, right, difference
Sampling rate 44.1kHz
Data augmentation mixup
Features gammatone spectrogram
Classifier CRNN
PDF

Ldslvision Submissions to Dcase’21: A Multi-Modal Fusion Approach for Audio-Visual Scene Classification Enhanced by Clip Variants

Soichiro Okazaki, Kong Quan and Tomoaki Yoshinaga
Lumada Data Science Lab., Hitachi, Ltd., Toyko, Japan

Abstract

In this report, we describe our solution for audio-visual scene classification task of DCASE2021 challenge Task1B. Our solution is based on a multi-modal fusion approach consisting of three different domain features: (1) Log-mel spectrogram audio features extracted by CNN variants from audio files. (2) Frame-wise image features extracted by CNN variants from video files. (3) Text-guided frame- wise image features extracted by CLIP variants from video files. We trained three domain models respectively and created final sub- missions by ensembling the class-wise confidences of three domain models’ outputs. With ensembling and post-processing for the confidences, our model reached 0.149 log-loss (official baseline: 0.658 log-loss) and 96.1% accuracy (official baseline: 77.0% accuracy) on the officially provided fold1 evaluation dataset of Task1B.

System characteristics
Input mono+delta+delta-delta (3 channels)
Sampling rate 48.0kHz
Data augmentation (Video) random affine, color jitter, gaussian blur, random erasing; (Audio) frequency masking, random gain (Video) random affine, color jitter, gaussian blur, random erasing
Features log-mel spectrogram
Classifier CNN, ViT, ensemble
Decision making average
PDF

Convolutional Receptive Field Dual Selection Mechanism for Acoustic Scene Classification

Wang Peng1, Tianyang Zhang1 and Zehua Zou2
1Intelligent Information Technology and System Lab, CHONGQING UNIVERSITY, Chongqing, China, 2Image Information Processing Lab, CHONGQING UNIVERSITY, Chongqing, China

Abstract

Convolution neural network (CNN), which can extract rich semantic information of signal, is a representative feature learing network in acoustic scene classification (ASC). However, since that the receptive field (RF) of a CNN is fixed, it is inefficient to capture the dynamical time-frequency changing characteristic of the input Log-Mel spectrogram. In addition, although the Log-Mel spectrogram can be treated as an image, the time and frequency dimensions, which respectively represent the acoustic event duration and frequency information, have different physical meanings. Therefore, existing receptive field adaptive meth- ods, which get same-sized optimal receptive fields in two dimensions, are not suitable for ASC. To tackle this problem, we proposed a convolution receptive field dual selection mechanism (CRFDS) in this paper. Acoustic scene classification experiments conducted on DCASE 2021 subtask B with audio-only show that the accuracy of CRFDS can achieve 71.45%.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier CNN
Decision making average
PDF

DCASE 2021 Task 1B: Technique Report

Lam Pham, Alexander Schindler, Mina Schutz, Jasmin Lampert and Ross King
Center for Digital Safety & Security, Austrian Institute of Technology, Vienna, Austria

Abstract

Abstract—This report shows a deep learning framework for audio-visual scene classification (SC). Our extensive experiments, which are conducted on DCASE Task 1B development dataset, achieve the best classification accuracy of 82.2%, 91.1%, and 93.9% with audio input only, visual input only, and both audio- visual input, respectively.

System characteristics
Input left, right
Sampling rate 48.0kHz
Data augmentation mixup
Features CQT, Gammatonegram, log-mel
Classifier CNN
Decision making Average
PDF

A Multimodal Wavetransformer Architecture Conditioned on Openl3 Embeddings for Audio-Visual Scene Classification

Andreas Triantafyllopoulos1, Konstantinos Drossos2, Alexander Gebhard3, Baird Alice3 and Schuller Björn3
1audEERING GmbH, Gilching, Germany, 2Computing Sciences, Tampere University, Tampere, Finland, 3University of Augsburg, Augsburg, Germany

Abstract

In this report, we present our submission systems to TASK1B of the DCASE2021 Challenge. We submit a total of four systems: one purely audio-based and three multimodal variants of the same architecture. The main module consists of the WaveTransformer architecture, which was recently introduced for automatic audio captioning (AAC). We first adapt the architecture to the task of acoustic scene classification (ASC), and then extend it to handle multimodal signals by globally conditioning all layers on multimodal OpenL3 embeddings. As data augmentation, we apply time- and frequency- bin masking, as well as random cropping. Our best-effort system achieves a log-loss of 0.568 and an accuracy of 79.5%.

System characteristics
Input mono
Sampling rate 48.0kHz
Data augmentation frequency masking, time masking, random cropping
Features log-mel spectrogram
Classifier WaveTransformer; MultimodalWaveTransformer
Decision making maximum likelihood
PDF

A Model Ensemble Approach for Audio-Visual Scene Classification

Qing Wang1, Siyuan Zheng1, Yunqing Li1, Yajian Wang1, Yuzhong Wu2, Hu Hu3, Chao-Han Huck Yang3, Sabato Marco Siniscalchi4, Yannan Wang5, Jun Du1 and Chin-Hui Lee3
1NELSLIP, University of Science and Technology of China, Heifei, China, 2DSP & Speech Technology Laboratory, The Chinese University of Hong Kong, Hong Kong, China, 3School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA, 4Kore University of Enna, Italy, 5Tencent Ethereal Audio Lab, Tencent Corporation, Shenzhen, China

Abstract

In this technical report, we present our approach to Task 1b - Audio- Visual Scene Classification (AVSC) in the DCASE 2021 Challenge. We employ pre-trained networks trained on image datasets to ex- tract video embedding whereas for audio embedding models trained from scratch are more appropriate for feature extraction. We pro- pose several models for the AVSC task based on different audio and video embeddings using early fusion strategy. Besides, we propose to use audio-visual segment model (AVSM) to extract text embed- ding. Data augmentation methods are used during training. Fur- thermore, a two-stage classification strategy is adopted by leverag- ing on score fusion of two classifiers. Finally, model ensemble of two-stage AVSC classifiers is used to obtained more robust predic- tions. The proposed systems are evaluated on the development set of TAU Urban Audio Visual Scenes 2021. Compared with official baseline, our approach can achieve a much lower log loss of 0.141 and a much higher accuracy of 95.3%.

System characteristics
Input binaural
Sampling rate 48.0kHz
Data augmentation mixup, channel confusion, SpecAugment, pitch shifting, speed change, random noise, mix audios, contrast, sharpness
Features log-mel energies
Classifier GMM, HMM, VGG, CNN, ResNet, DNN, ResNeSt, DenseNet, ensemble; CNN, ResNet, DNN, ResNeSt, DenseNet, ensemble; CNN, ResNet, DNN, DenseNet, ResNeSt, ensemble
Decision making average
PDF

Bit Submission for DCASE 2020 Challenge Task1

Yuxiang Wang and Shuang liang
Electronic engineering, Beijing Institute of Technology, Beijing, China

Abstract

DCASE2021 challenge task 1 contains two sub-tasks: (i) LowComplexity Acoustic Scene Classification with Multiple Devices and (ii) Audio-Visual Scene Classification. In our submission systems, different methods are used for different tasks. For task 1a, we explore the fsFCNN (frequency sub-sampling controlled fully convolution) with two-stage training. In order to reduce the model size, the knowledge distillation approach is used. For task 1b, different models are used for different modals. For audio classification, the same fsFCNN structures with two-stage training are applied. For video classification, the TimeSformer model is used. Experimental results show that our final model obtain accuracy of 64.6% with 128kb model size. On task 1b development set, the audio modal achieve 80.6% accuracy and 92% for video modal.

System characteristics
Input mono
Sampling rate 44.1kHz
Features log-mel energies
Classifier Transformer; CNN
Decision making maximum likelihood
PDF

Audio-Visual Scene Classification Using Transfer Learning and Hybrid Fusion Strategy

Meng Wang, Chengxin Chen, Yuan Xie, Hangting Chen, Yuzhuo Liu and Pengyuan Zhang
Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Beijing, China

Abstract

In this technical report, we describe the details of our submission for DCASE2021 Task1b. This task focuses on audio-visual scene classification. We use 1D deep convolutional neural network integrated with three different acoustic features in our audio system, and perform a two-stage fine-tuning on some pre-trained models such as ResNet-50 and EfficientNet-b5 in our image system. In model-level fusion, the extracted audio and image embeddings are concatenated as input into a classifier. We also use decision-level fusion to make our system more robust. On the official train/test setup of the development dataset, our best single audio-visual system obtained a 0.159 log loss and 94.1% accuracy compared to 0.623 and 78.5% for the audio-only system and 0.270 and 91.8% for the image-only system. Our final fusion system could achieve a 0.143 log loss and 95.2% accuracy.

System characteristics
Input binaural
Sampling rate 48.0kHz
Data augmentation mixup
Features log-mel energies, CQT, bark
Classifier CNN, EfficientNet, Swin Transformer, ensemble
Decision making weighted vote; average vote
PDF

Scene Classification Using Acoustic and Visual Feature

Yujie Yang1 and Yanan Luo2
1Tsinghua University, Shenzhen, China, 2Tencent, Shenzhen, China

Abstract

In this report, we provide a brief overview of our submission for the audio-visual scene classification task of the DCASE 2021 challenge. This report focuses on the joint use of audio and video features to improve the performance of scene classification. In order to extract audio features, we train a convolutional neural network similar to the VGG to classify the log-mel spectra. In order to extract video features, we use ResNext to train an image classifier. Subsequently we use both features to do the classification, which can achieve better performance than using only one feature to make the classification.

System characteristics
Input mono
Sampling rate 44.1kHz
Data augmentation mixup, specaugment
Features log-mel energies
Classifier CNN; Multi-head Attention, MLP; Multi-head Attention, MLP, CNN
Decision making maximum likelihood
PDF