Task description
This task aims to build a foley sound synthesis system that can generate plausible audio signals fitting into given categories of foley sound. The foley sound categories are composed of sound events and environmental background sounds. The challenge has two subproblems – the development of models with and without external resources. Participants are expected to submit a system for one of the two problems, and each problem is evaluated independently. Submissions will be evaluated by Frechet Audio Distance (FAD), followed by a subjective test.
Systems ranking
Track A
A big THANK YOU to the DCASE community members and the contestants who spent several hours rating other teams' anonymized sounds for the perceptual evaluation stage (see column '# Categories Rated by Team Members' in the FAD table).
Perceptual Evaluation Score
The weighted average of the three ratings were based on a ratio of audio quality : category fit : diversity that was 2:2:1.
Rank | Submission Information | Weighted Average Score of Audio Quality, Category Fit, and Diversity | Audio Quality (MOS score w/ 10 steps) | Category Fit (MOS score w/ 10 steps) | Diversity (MOS score w/ 10 steps, weighted 0.5) | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission Code |
Technical Report |
Official Rank |
Average Score | Dog Bark | Footstep | Gun Shot | Keyboard |
Moving Motor Vehicle |
Rain | Sneeze/Cough | Average Score | Dog Bark | Footstep | Gun Shot | Keyboard |
Moving Motor Vehicle |
Rain | Sneeze/Cough | Average Score | Dog Bark | Footstep | Gun Shot | Keyboard |
Moving Motor Vehicle |
Rain | Sneeze/Cough | Average Score | Dog Bark | Footstep | Gun Shot | Keyboard |
Moving Motor Vehicle |
Rain | Sneeze/Cough | |
DCASE2023_baseline_task7 | DCASE2023baseline2023 | 6 | 3.810 | 2.688 | 4.160 | 3.237 | 5.150 | 3.862 | 4.175 | 3.400 | 3.831 | 2.930 | 4.158 | 3.504 | 5.137 | 3.543 | 4.115 | 3.432 | 3.789 | 2.447 | 4.162 | 2.969 | 5.163 | 4.182 | 4.235 | 3.368 | |||||||||
Chon_Gaudio_task7_trackA_1 | ChonGLI2023 | 2 | 6.967 | 7.984 | 6.865 | 7.255 | 6.989 | 6.881 | 6.243 | 6.553 | 6.657 | 7.612 | 6.455 | 6.814 | 6.814 | 6.446 | 5.928 | 6.528 | 7.154 | 8.223 | 7.082 | 7.573 | 7.157 | 7.131 | 6.306 | 6.606 | 7.214 | 8.250 | 7.250 | 7.500 | 7.000 | 7.250 | 6.750 | 6.500 | |
Yi_SURREY_task7_trackA_1 | YiSURREY2023 | 1 | 7.056 | 7.742 | 6.466 | 6.189 | 7.433 | 7.448 | 6.441 | 7.675 | 6.723 | 7.309 | 6.143 | 5.532 | 7.243 | 7.315 | 6.067 | 7.454 | 7.578 | 8.297 | 6.646 | 6.689 | 8.089 | 8.181 | 6.911 | 8.233 | 6.679 | 7.500 | 6.750 | 6.500 | 6.500 | 6.250 | 6.250 | 7.000 | |
Guan_HEU_task7_trackA_2 | GuanHEU2023 | 4 | 5.157 | 4.877 | 4.450 | 6.413 | 5.479 | 5.822 | 5.201 | 3.856 | 4.670 | 3.800 | 4.164 | 5.800 | 5.339 | 5.365 | 4.972 | 3.250 | 5.293 | 5.142 | 3.836 | 7.482 | 5.232 | 6.315 | 5.656 | 3.389 | 5.857 | 6.500 | 6.250 | 5.500 | 6.250 | 5.750 | 4.750 | 6.000 | |
Scheibler_LINE_task7_trackA_1 | ScheiblerLINE2023 | 3 | 6.887 | 7.333 | 6.832 | 7.317 | 7.199 | 6.474 | 5.222 | 7.834 | 6.355 | 6.479 | 6.263 | 6.771 | 6.886 | 6.131 | 4.780 | 7.180 | 7.327 | 7.479 | 7.192 | 7.896 | 7.861 | 7.054 | 5.150 | 8.655 | 7.071 | 8.750 | 7.250 | 7.250 | 6.500 | 6.000 | 6.250 | 7.500 |
FAD Score
Rank | Submission Information | Evaluation Dataset | Development Dataset | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission Code |
Technical Report |
# Categories Rated by Team Members |
Official Rank |
FAD Rank |
Average FAD |
Dog Bark (FAD) |
Footstep (FAD) |
Gun Shot (FAD) |
Keyboard (FAD) |
Moving Motor Vehicle (FAD) |
Rain (FAD) |
Sneeze/Cough (FAD) |
Average FAD |
Dog Bark (FAD) |
Footstep (FAD) |
Gun Shot (FAD) |
Keyboard (FAD) |
Moving Motor Vehicle (FAD) |
Rain (FAD) |
Sneeze/Cough (FAD) |
|
DCASE2023_baseline_task7 | DCASE2023baseline2023 | 6 | 6 | 9.702 | 13.412 | 8.108 | 7.952 | 5.230 | 16.107 | 13.338 | 3.771 | 8.701 | 13.614 | 6.826 | 6.152 | 5.065 | 11.239 | 14.449 | 3.563 | ||
Chon_Gaudio_task7_trackA_1 | ChonGLI2023 | 7 | 2 | 3 | 5.540 | 11.456 | 5.959 | 3.021 | 4.090 | 6.173 | 5.738 | 2.340 | 5.522 | 11.464 | 4.575 | 3.782 | 6.190 | 5.814 | 4.746 | 2.083 | |
Lee_maum_task7_trackA_1 | Leemaum2023 | 4 | 9 | 9 | 12.937 | 9.265 | 6.924 | 10.451 | 6.488 | 37.748 | 7.778 | 11.903 | 11.331 | 9.716 | 4.858 | 8.672 | 5.227 | 29.206 | 10.450 | 11.187 | |
Lee_maum_task7_trackA_2 | Leemaum2023 | 4 | 10 | 10 | 12.946 | 10.549 | 7.747 | 7.643 | 9.922 | 38.558 | 6.585 | 9.620 | 10.900 | 10.854 | 5.751 | 5.588 | 7.413 | 29.562 | 8.140 | 8.992 | |
Lee_maum_task7_trackA_3 | Leemaum2023 | 4 | 8 | 8 | 12.429 | 11.719 | 6.903 | 7.287 | 9.292 | 35.209 | 6.787 | 9.804 | 10.586 | 12.056 | 5.742 | 5.420 | 7.242 | 26.474 | 8.043 | 9.126 | |
Lee_maum_task7_trackA_4 | Leemaum2023 | 4 | 7 | 7 | 9.883 | 9.287 | 6.910 | 7.881 | 6.603 | 22.310 | 6.750 | 9.436 | 8.964 | 9.700 | 5.566 | 6.037 | 5.370 | 19.305 | 7.946 | 8.827 | |
Yi_SURREY_task7_trackA_1 | YiSURREY2023 | 7 | 1 | 2 | 5.025 | 3.621 | 5.104 | 5.748 | 3.038 | 9.801 | 5.964 | 1.901 | 4.051 | 3.355 | 3.434 | 5.796 | 3.483 | 4.674 | 5.994 | 1.621 | |
Guan_HEU_task7_trackA_1 | GuanHEU2023 | 7 | 5 | 5 | 8.623 | 5.583 | 10.143 | 8.428 | 5.403 | 17.984 | 7.561 | 5.258 | 7.941 | 5.893 | 9.118 | 7.485 | 7.706 | 12.818 | 7.874 | 4.692 | |
Guan_HEU_task7_trackA_2 | GuanHEU2023 | 7 | 4 | 4 | 7.799 | 5.685 | 7.685 | 8.532 | 4.165 | 17.258 | 7.795 | 3.475 | 7.015 | 6.020 | 7.297 | 7.628 | 4.049 | 12.216 | 8.446 | 3.452 | |
Scheibler_LINE_task7_trackA_1 | ScheiblerLINE2023 | 6 | 3 | 1 | 4.777 | 3.679 | 8.073 | 3.655 | 2.775 | 7.422 | 5.225 | 2.609 | 4.156 | 3.726 | 5.713 | 3.226 | 3.415 | 5.453 | 5.308 | 2.253 |
Track B
A big THANK YOU to the DCASE community members and the contestants who spent several hours rating other teams' anonymized sounds for the perceptual evaluation stage (see column '# Categories Rated by Team Members' in the FAD table).
Perceptual Evaluation Score
In the case that multiple systems were submitted by one team, only the system with the highest FAD score per team was perceptually evaluated. The weighted average of the three ratings were based on a ratio of audio quality : category fit : diversity that was 2:2:1.
Rank | Submission Information | Weighted Average Score of Audio Quality, Category Fit, and Diversity | Audio Quality (MOS score w/ 10 steps) | Category Fit (MOS score w/ 10 steps) | Diversity (MOS score w/ 10 steps, weighted 0.5) | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission Code |
Technical Report |
Official Rank |
Average Score | Dog Bark | Footstep | Gun Shot | Keyboard |
Moving Motor Vehicle |
Rain | Sneeze/Cough | Average Score | Dog Bark | Footstep | Gun Shot | Keyboard |
Moving Motor Vehicle |
Rain | Sneeze/Cough | Average Score | Dog Bark | Footstep | Gun Shot | Keyboard |
Moving Motor Vehicle |
Rain | Sneeze/Cough | Average Score | Dog Bark | Footstep | Gun Shot | Keyboard |
Moving Motor Vehicle |
Rain | Sneeze/Cough | |
DCASE2023_baseline_task7 | DCASE2023baseline2023 | 18 | 3.810 | 2.688 | 4.160 | 3.237 | 5.150 | 3.862 | 4.175 | 3.400 | 3.831 | 2.930 | 4.158 | 3.504 | 5.137 | 3.543 | 4.115 | 3.432 | 3.789 | 2.447 | 4.162 | 2.969 | 5.163 | 4.182 | 4.235 | 3.368 | |||||||||
Kamath_NUS_task7_trackB_2 | KamathNUS2023 | 3 | 4.647 | 4.807 | 4.073 | 5.010 | 4.276 | 5.248 | 4.013 | 5.102 | 3.988 | 3.789 | 3.554 | 4.346 | 3.911 | 4.642 | 3.378 | 4.295 | 4.612 | 4.979 | 3.629 | 5.054 | 4.029 | 5.727 | 3.406 | 5.459 | 6.036 | 6.500 | 6.000 | 6.250 | 5.500 | 5.500 | 6.500 | 6.000 | |
Chang_HYU_task7_trackB_1 | ChangHYU2023 | 1 | 6.515 | 5.659 | 7.111 | 6.557 | 7.384 | 6.155 | 7.042 | 5.699 | 6.085 | 4.882 | 6.738 | 5.879 | 7.296 | 6.069 | 6.860 | 4.873 | 6.845 | 6.014 | 7.288 | 7.013 | 7.789 | 6.442 | 7.370 | 6.000 | 6.714 | 6.500 | 7.500 | 7.000 | 6.750 | 5.750 | 6.750 | 6.750 | |
Jung_KT_task7_trackB_2 | JungKT2023 | 2 | 5.534 | 5.321 | 5.033 | 6.022 | 5.614 | 6.021 | 5.902 | 4.826 | 5.082 | 4.432 | 4.933 | 5.579 | 5.139 | 5.623 | 5.600 | 4.270 | 5.610 | 5.371 | 4.775 | 6.100 | 5.646 | 6.554 | 5.906 | 4.920 | 6.286 | 7.000 | 5.750 | 6.750 | 6.500 | 5.750 | 6.500 | 5.750 | |
Lee_MARG_task7_trackB_1 | LeeMARG2023 | 4 | 4.427 | 3.273 | 4.843 | 3.941 | 5.409 | 4.942 | 4.210 | 4.374 | 3.929 | 2.530 | 4.204 | 3.531 | 5.311 | 4.542 | 3.735 | 3.650 | 4.443 | 3.153 | 4.654 | 3.696 | 5.336 | 5.312 | 4.040 | 4.910 | 5.393 | 5.000 | 6.500 | 5.250 | 5.750 | 5.000 | 5.500 | 4.750 |
FAD Score
Rank | Submission Information | Evaluation Dataset | Development Dataset | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission Code |
Technical Report |
# Categories Rated by Team Members |
Official Rank |
FAD Rank |
Average FAD |
Dog Bark (FAD) |
Footstep (FAD) |
Gun Shot (FAD) |
Keyboard (FAD) |
Moving Motor Vehicle (FAD) |
Rain (FAD) |
Sneeze/Cough (FAD) |
Average FAD |
Dog Bark< br>(FAD) |
Footstep (FAD) |
Gun Shot (FAD) |
Keyboard (FAD) |
Moving Motor Vehicle (FAD) |
Rain (FAD) |
Sneeze/Cough (FAD) |
|
DCASE2023_baseline_task7 | DCASE2023baseline2023 | 18 | 18 | 9.702 | 13.412 | 8.108 | 7.952 | 5.230 | 16.107 | 13.338 | 3.771 | 8.701 | 13.614 | 6.826 | 6.152 | 5.065 | 11.239 | 14.449 | 3.563 | ||
Kamath_NUS_task7_trackB_1 | KamathNUS2023 | 7 | 15 | 15 | 9.081 | 6.468 | 6.348 | 10.665 | 5.656 | 24.674 | 6.498 | 3.259 | 7.341 | 6.455 | 4.875 | 7.922 | 4.521 | 16.567 | 8.023 | 3.026 | |
Kamath_NUS_task7_trackB_2 | KamathNUS2023 | 7 | 3 | 6 | 6.754 | 3.870 | 7.223 | 7.561 | 3.884 | 13.564 | 7.045 | 4.129 | 5.348 | 3.438 | 5.906 | 5.648 | 4.192 | 7.234 | 7.383 | 3.632 | |
Pillay_CMU_task7_trackB_1 | PillayCMU2023 | 4 | 22 | 22 | 12.034 | 14.607 | 6.656 | 16.268 | 5.279 | 16.471 | 9.451 | 15.506 | 11.257 | 14.436 | 5.505 | 12.523 | 5.355 | 13.252 | 12.766 | 14.964 | |
Qianbin_BIT_task7_trackB_1 | QianbinBIT2023 | 4 | 10 | 10 | 7.154 | 10.681 | 5.679 | 6.960 | 4.283 | 11.485 | 9.502 | 1.489 | 6.280 | 10.729 | 3.106 | 5.613 | 3.269 | 8.837 | 10.854 | 1.555 | |
Lee_maum_task7_trackB_1 | Leemaum2023 | 4 | 26 | 26 | 12.862 | 9.692 | 6.948 | 9.263 | 6.341 | 37.965 | 8.098 | 11.729 | 11.267 | 10.197 | 4.868 | 7.472 | 5.289 | 29.329 | 10.744 | 10.973 | |
Lee_maum_task7_trackB_2 | Leemaum2023 | 4 | 25 | 25 | 12.858 | 9.754 | 7.411 | 7.458 | 9.687 | 38.361 | 6.905 | 10.429 | 10.849 | 10.057 | 5.335 | 5.516 | 7.141 | 29.502 | 8.564 | 9.829 | |
Lee_maum_task7_trackB_3 | Leemaum2023 | 4 | 23 | 23 | 12.276 | 11.651 | 7.373 | 7.606 | 9.407 | 34.061 | 6.267 | 9.566 | 10.366 | 11.890 | 6.002 | 5.418 | 7.333 | 25.451 | 7.558 | 8.913 | |
Lee_maum_task7_trackB_4 | Leemaum2023 | 4 | 20 | 20 | 9.964 | 9.701 | 6.837 | 7.789 | 6.591 | 22.998 | 6.825 | 9.008 | 9.143 | 10.218 | 5.634 | 5.868 | 5.353 | 20.414 | 8.190 | 8.323 | |
Chang_HYU_task7_trackB_1 | ChangHYU2023 | 4 | 1 | 7 | 6.898 | 4.677 | 5.736 | 6.407 | 4.753 | 18.859 | 5.892 | 1.965 | 4.422 | 4.317 | 3.597 | 5.311 | 2.432 | 10.177 | 3.398 | 1.722 | |
Chang_HYU_task7_trackB_2 | ChangHYU2023 | 4 | 12 | 12 | 7.356 | 5.098 | 5.877 | 8.000 | 4.623 | 19.926 | 5.796 | 2.169 | 4.871 | 4.948 | 3.448 | 6.538 | 2.457 | 11.320 | 3.502 | 1.885 | |
Xie_SJTU_task7_trackB_1 | XieSJTU2023 | 6 | 13 | 13 | 7.407 | 8.035 | 6.987 | 8.185 | 3.495 | 13.565 | 9.267 | 2.315 | 6.050 | 7.564 | 4.761 | 6.237 | 2.176 | 9.853 | 9.592 | 2.167 | |
Xie_SJTU_task7_trackB_2 | XieSJTU2023 | 6 | 9 | 9 | 6.998 | 6.817 | 6.894 | 7.815 | 3.495 | 12.536 | 9.265 | 2.164 | 6.232 | 6.809 | 5.236 | 6.877 | 2.176 | 9.587 | 10.983 | 1.958 | |
Xie_SJTU_task7_trackB_3 | XieSJTU2023 | 6 | 8 | 8 | 6.992 | 7.017 | 6.949 | 7.913 | 3.600 | 11.621 | 9.350 | 2.492 | 6.458 | 6.991 | 5.300 | 7.286 | 2.569 | 9.716 | 11.071 | 2.271 | |
Xie_SJTU_task7_trackB_4 | XieSJTU2023 | 6 | 11 | 11 | 7.177 | 6.660 | 7.763 | 8.199 | 3.703 | 11.443 | 9.817 | 2.654 | 6.904 | 6.598 | 6.079 | 7.992 | 3.718 | 9.456 | 11.941 | 2.546 | |
QianXu_BIT_NUDT_task7_trackB_1 | QianXuBIT2023 | 4 | 21 | 21 | 10.644 | 17.956 | 6.526 | 10.180 | 4.901 | 14.348 | 5.616 | 14.979 | 8.817 | 18.385 | 6.301 | 6.729 | 3.130 | 7.759 | 5.229 | 14.186 | |
QianXu_BIT_NUDT_task7_trackB_2 | QianXuBIT2023 | 4 | 17 | 17 | 9.645 | 12.148 | 5.899 | 10.771 | 5.380 | 14.004 | 5.534 | 13.777 | 7.705 | 12.672 | 5.735 | 6.473 | 3.083 | 7.751 | 5.205 | 13.018 | |
QianXu_BIT_NUDT_task7_trackB_3 | QianXuBIT2023 | 4 | 19 | 19 | 9.959 | 13.526 | 6.064 | 10.615 | 5.574 | 16.127 | 5.864 | 11.944 | 7.857 | 14.248 | 5.662 | 6.395 | 3.400 | 8.971 | 5.184 | 11.136 | |
QianXu_BIT_NUDT_task7_trackB_4 | QianXuBIT2023 | 4 | 24 | 24 | 12.319 | 12.883 | 12.139 | 8.442 | 8.173 | 22.671 | 14.680 | 7.243 | 12.601 | 13.295 | 12.775 | 7.077 | 10.839 | 18.117 | 19.511 | 6.595 | |
Bai_JLESS_task7_trackB_1 | BaiJLESS2023 | 0 | 27 | 27 | 13.583 | 15.958 | 8.663 | 18.485 | 6.728 | 24.094 | 15.193 | 5.958 | 12.437 | 17.510 | 8.497 | 16.824 | 6.956 | 18.737 | 12.874 | 5.662 | |
Chun_Chosun_task7_trackB_2 | ChunChosun2023 | 5 | 14 | 14 | 8.351 | 8.690 | 7.265 | 10.764 | 5.602 | 13.941 | 9.512 | 2.684 | 7.376 | 8.382 | 6.203 | 8.294 | 3.748 | 10.974 | 11.562 | 2.467 | |
Wendner_JKU_task7_trackB_1 | WendnerJKU2023 | 5 | 28 | 28 | 15.736 | 8.979 | 9.950 | 15.354 | 12.564 | 31.160 | 21.753 | 10.388 | 15.669 | 10.093 | 9.682 | 11.984 | 13.334 | 26.435 | 28.391 | 9.763 | |
Jung_KT_task7_trackB_1 | JungKT2023 | 7 | 7 | 4 | 5.480 | 2.784 | 4.370 | 4.667 | 3.555 | 17.511 | 3.899 | 1.577 | 3.373 | 2.771 | 2.514 | 2.960 | 2.246 | 8.776 | 2.947 | 1.397 | |
Jung_KT_task7_trackB_2 | JungKT2023 | 7 | 2 | 1 | 5.023 | 3.348 | 3.990 | 3.495 | 4.074 | 14.861 | 3.529 | 1.865 | 3.181 | 3.087 | 2.580 | 2.560 | 2.255 | 7.540 | 2.626 | 1.617 | |
Jung_KT_task7_trackB_3 | JungKT2023 | 7 | 6 | 3 | 5.230 | 2.616 | 3.739 | 6.322 | 4.089 | 14.172 | 4.304 | 1.371 | 3.088 | 2.477 | 2.588 | 3.722 | 2.220 | 6.867 | 2.349 | 1.395 | |
Jung_KT_task7_trackB_4 | JungKT2023 | 7 | 5 | 2 | 5.026 | 4.854 | 3.103 | 4.790 | 3.665 | 13.604 | 3.727 | 1.435 | 3.215 | 4.673 | 2.045 | 3.614 | 2.450 | 6.018 | 2.322 | 1.380 | |
Lee_MARG_task7_trackB_1 | LeeMARG2023 | 4 | 4 | 5 | 6.409 | 6.947 | 4.563 | 10.657 | 3.900 | 11.602 | 5.491 | 1.699 | 4.766 | 7.778 | 3.712 | 8.208 | 3.584 | 4.359 | 4.386 | 1.332 | |
Chung_KAIST_task7_trackB_1 | ChungKAIST2023 | 5 | 16 | 16 | 9.192 | 10.389 | 6.832 | 7.572 | 5.188 | 15.653 | 13.348 | 5.359 | 7.841 | 11.783 | 6.283 | 6.668 | 5.168 | 10.830 | 9.498 | 4.655 |
System characteristics
Summary of the submitted system characteristics.
Track A
Rank |
Submission Code |
Technical Report |
System input |
ML method |
Phase reconstruction |
Acoustic feature |
System Complexity |
Data Augmentation |
Subsystem Count |
---|---|---|---|---|---|---|---|---|---|
6 | DCASE2023_baseline_task7 | DCASE2023baseline2023 | sound event label | VQ-VAE, PixelSNAIL | HiFi-GAN | spectrogram | 269992 | ||
2 | Chon_Gaudio_task7_trackA_1 | ChonGLI2023 | sound event label | diffusion model | modified HiFi-GAN | spectrogram | 642000000 | mixup, time stretching | |
9 | Lee_maum_task7_trackA_1 | Leemaum2023 | sound event label | VAE, GAN, flow, VITS, PhaseAug, Avocodo | HiFi-GAN | Gaussian latent variables | 92319922 | PhaseAug | |
10 | Lee_maum_task7_trackA_2 | Leemaum2023 | sound event label | VAE, GAN, flow, VITS, PhaseAug, Avocodo | HiFi-GAN | Gaussian latent variables | 92319922 | PhaseAug | |
8 | Lee_maum_task7_trackA_3 | Leemaum2023 | sound event label | VAE, GAN, flow, VITS, PhaseAug, Avocodo | HiFi-GAN | Gaussian latent variables | 92319922 | PhaseAug | |
7 | Lee_maum_task7_trackA_4 | Leemaum2023 | sound event label | VAE, GAN, flow, VITS, PhaseAug, Avocodo, ensemble | HiFi-GAN | Gaussian latent variables | 369279688 | PhaseAug | 4 |
1 | Yi_SURREY_task7_trackA_1 | YiSURREY2023 | sound event label | diffusion model, VQ-VAE | HiFi-GAN | spectrogram | 1173847474 | 2 | |
5 | Guan_HEU_task7_trackA_1 | GuanHEU2023 | sound event label, caption | AudioLDM | 421000000 | ||||
4 | Guan_HEU_task7_trackA_2 | GuanHEU2023 | sound event label, caption | AudioLDM, Baseline | 421269992 | ||||
3 | Scheibler_LINE_task7_trackA_1 | ScheiblerLINE2023 | sound event label | VQ-VAE, diffusion model | HiFi-GAN | log-mel spectrogram | 977116210 |
Track B
Rank |
Submission Code |
Technical Report |
System input |
ML method |
Phase reconstruction |
Acoustic feature |
System Complexity |
Data Augmentation |
Subsystem Count |
---|---|---|---|---|---|---|---|---|---|
18 | DCASE2023_baseline_task7 | DCASE2023baseline2023 | sound event label | VQ-VAE, PixelSNAIL | HiFi-GAN | spectrogram | 269992 | ||
15 | Kamath_NUS_task7_trackB_1 | KamathNUS2023 | sound event label | StyleGAN2 | phase gradient heap integration | log-magnitude spectrogram | 62010138 | ||
3 | Kamath_NUS_task7_trackB_2 | KamathNUS2023 | sound event label | StyleGAN2 | phase gradient heap integration | log-magnitude spectrogram | 376959933 | time shifting, sound wrapping | 7 |
22 | Pillay_CMU_task7_trackB_1 | PillayCMU2023 | sound event label | VQ-VAE, PixelSNAIL | HiFi-GAN | spectrogram | 103316216 | time masking, frequency masking | 3 |
10 | Qianbin_BIT_task7_trackB_1 | QianbinBIT2023 | sound event label | VQ-VAE, PixelSNAIL, Bit-diffusion | HiFi-GAN | spectrogram | 112857385 | 2 | |
26 | Lee_maum_task7_trackB_1 | Leemaum2023 | sound event label | VAE, GAN, flow, VITS, PhaseAug, Avocodo | HiFi-GAN | Gaussian latent variables | 92319922 | PhaseAug | |
25 | Lee_maum_task7_trackB_2 | Leemaum2023 | sound event label | VAE, GAN, flow, VITS, PhaseAug, Avocodo | HiFi-GAN | Gaussian latent variables | 92319922 | PhaseAug | |
23 | Lee_maum_task7_trackB_3 | Leemaum2023 | sound event label | VAE, GAN, flow, VITS, PhaseAug, Avocodo | HiFi-GAN | Gaussian latent variables | 92319922 | PhaseAug | |
20 | Lee_maum_task7_trackB_4 | Leemaum2023 | sound event label | VAE, GAN, flow, VITS, PhaseAug, Avocodo, ensemble | HiFi-GAN | Gaussian latent variables | 369279688 | PhaseAug | 4 |
1 | Chang_HYU_task7_trackB_1 | ChangHYU2023 | sound event label | diffusion model | HiFi-GAN | log-mel spectrogram | 23374056 | ||
12 | Chang_HYU_task7_trackB_2 | ChangHYU2023 | sound event label | diffusion model | HiFi-GAN | log-mel spectrogram | 23374056 | ||
13 | Xie_SJTU_task7_trackB_1 | XieSJTU2023 | sound event label | VQ-VAE, Transformer | HiFi-GAN | spectrogram | 28224194 | ||
9 | Xie_SJTU_task7_trackB_2 | XieSJTU2023 | sound event label | VQ-VAE, Transformer, TransformerDecoder | HiFi-GAN | spectrogram | 40843458 | mixup | 3 |
8 | Xie_SJTU_task7_trackB_3 | XieSJTU2023 | sound event label | VQ-VAE, Transformer, TransformerDecoder, TrnsformerEncder Discriminator | HiFi-GAN | spectrogram | 44037827 | mixup | 3 |
11 | Xie_SJTU_task7_trackB_4 | XieSJTU2023 | sound event label | VQ-VAE, Transformer, TransformerDecoder, TrnsformerEncder Discriminator | HiFi-GAN | spectrogram | 44037827 | mixup | 3 |
21 | QianXu_BIT_NUDT_task7_trackB_1 | QianXuBIT2023 | sound | diffusion model | spectrogram | 113668609 | |||
17 | QianXu_BIT_NUDT_task7_trackB_2 | QianXuBIT2023 | sound | diffusion model | spectrogram | 113668609 | |||
19 | QianXu_BIT_NUDT_task7_trackB_3 | QianXuBIT2023 | sound | diffusion model | spectrogram | 113668609 | |||
24 | QianXu_BIT_NUDT_task7_trackB_4 | QianXuBIT2023 | sound | diffusion model | spectrogram | 113668609 | wavelet domain denoise | ||
27 | Bai_JLESS_task7_trackB_1 | BaiJLESS2023 | sound event label | CVAE-GAN | HiFi-GAN | spectrogram | 8760000 | gain, pitch shifting, time shifting, peak normalization | 7 |
14 | Chun_Chosun_task7_trackB_2 | ChunChosun2023 | sound event label | VQ-VAE, PixelSNAIL | HiFi-GAN | spectrogram | 386598842 | 2 | |
28 | Wendner_JKU_task7_trackB_1 | WendnerJKU2023 | sound event label | diffusion model, ensemble | 7167405 | gain reduction, time shifting | 7 | ||
7 | Jung_KT_task7_trackB_1 | JungKT2023 | sound event label, random noise | C-SupConGAN | HiFi-GAN | mel spectrogram | 21398259 | fade in/out, time masking | |
2 | Jung_KT_task7_trackB_2 | JungKT2023 | sound event label, random noise | C-SupConGAN | HiFi-GAN | mel spectrogram | 21398259 | fade in/out, time masking | |
6 | Jung_KT_task7_trackB_3 | JungKT2023 | sound event label, random noise | C-SupConGAN | HiFi-GAN | mel spectrogram | 21398259 | fade in/out, time masking | |
5 | Jung_KT_task7_trackB_4 | JungKT2023 | sound event label, random noise | C-SupConGAN | HiFi-GAN | mel spectrogram | 21398259 | fade in/out, time masking | |
4 | Lee_MARG_task7_trackB_1 | LeeMARG2023 | sound event label | VQ-VAE, PixelSNAIL, StyleGAN2-ADA | HiFi-GAN, Griffin-Lim | spectrogram | 116202572 | time stretching, time shifting, RoomSimulator, TanhDistortion, resample, time masking, pitch shift | 6 |
16 | Chung_KAIST_task7_trackB_1 | ChungKAIST2023 | sound event label | diffusion model | 87330433 |
Wav files used for evaluation experiment
Technical reports
JLESS Submission to DCASE2023 Task7: Foley Sound Synthesis Using Non-Autoagressive Generative Model
Siwei Huang, Jisheng Bai, Yafei Jia, Jianfeng Chen
School of Marine Science and Technology, Northwestern Polytechnical University, Xi'an, China, LianFeng Acoustic Technologies Co., Ltd. Xi'an, China
Bai_JLESS_task7_trackB_1
JLESS Submission to DCASE2023 Task7: Foley Sound Synthesis Using Non-Autoagressive Generative Model
Siwei Huang, Jisheng Bai, Yafei Jia, Jianfeng Chen
School of Marine Science and Technology, Northwestern Polytechnical University, Xi'an, China, LianFeng Acoustic Technologies Co., Ltd. Xi'an, China
Abstract
This technical report describes our proposed system for DCASE2023 task7: Foley Sound Synthesis. In our approach, we propose a GAN-based mel-spectrogram synthesis system. we take a Conditional Variational auto-encoder (CVAE) as the generator, which consists of densely-connected dilated convolution blocks, and a simple CNN as the discriminator. The decoder of CVAE synthesizes fake mel-spectrogram resampling from prior noise and class, and the discriminator determines whether it is real or not. Furthermore, we also train a classifier to help CVAE keep class-wise distribution. Finally, the audio is wrapped using the HiFiGAN vocoder.
System characteristics
System input | sound event label |
Machine learning method | CVAE-GAN |
Phase reconstruction method | HiFi-GAN |
Acoustic features | spectrogram |
Data augmentation | gain, pitch shifting, time shifting, peak normalization |
Subsystem count | 7 |
System comprexity | 8760000 parameters |
HYU Submission For The DCASE 2023 Task 7: Diffusion Probabilistic Model With Adversarial Training For Foley Sound Synthesis
Won-Gook Choi, Joon-Hyuk Chang
Department of Electronic Engineering, Hanyang University, Seoul, Republic of Korea, Department of Electronic Engineering, Hanyang University, Seoul, Republic of Korea
Chang_HYU_task7_trackB_1 Chang_HYU_task7_trackB_2
HYU Submission For The DCASE 2023 Task 7: Diffusion Probabilistic Model With Adversarial Training For Foley Sound Synthesis
Won-Gook Choi, Joon-Hyuk Chang
Department of Electronic Engineering, Hanyang University, Seoul, Republic of Korea, Department of Electronic Engineering, Hanyang University, Seoul, Republic of Korea
Abstract
This paper is a technical report of the Hanyang University team submission for the DCASE 2023 challenge task 7, Foley Sound Synthesis. The goal of the task is to build a generative model that can synthesize high-quality and various foley sounds: the sounds of dog barking, footsteps, gunshots, keyboards, moving motor vehicles, rainy scenes, and sneezing. The core strategy of the submissions is a diffusion probabilistic model-based acoustic model. Also, we adopted adversarial training on the evidence lower bound (ELBO) of the diffusion model for the higher quality. The submissions did not use any external dataset and achieved lower Frechet audio distance (FAD) scores than the DCASE baseline, except for the sounds of moving motor vehicles.
System characteristics
System input | sound event label |
Machine learning method | diffusion model |
Phase reconstruction method | HiFi-GAN |
Acoustic features | log-mel spectrogram |
System comprexity | 23374056,23374056 parameters |
FALL-E: Gaudio Foley Synthesis System
Minsung Kang, Sangshin Oh, Hyeongi Moon, Kyungyun Lee, Ben Sangbae Chon
Gaudio Lab, Inc., Seoul, South Kore
Chon_Gaudio_task7_trackA_1
FALL-E: Gaudio Foley Synthesis System
Minsung Kang, Sangshin Oh, Hyeongi Moon, Kyungyun Lee, Ben Sangbae Chon
Gaudio Lab, Inc., Seoul, South Kore
Abstract
This paper introduces FALL-E, Gaudio's Foley Synthesis System, which is submitted to the DCASE 2023 Task 7 Foley Synthesis Challenge (Track A). The system employs a cascaded approach comprising low-resolution spectrogram generation, spectrogram super-resolution, and a vocoder. We trained every sound-related model from scratch using our extensive datasets, and we utilized a pre-trained language model. We conditioned the model with dataset-specific texts, enabling it to learn sound quality and recording environment based on the text input. Moreover, we leveraged external language models to improve text descriptions of our datasets and performed prompt engineering for quality, coherence, and diversity. We report the objective measure with respect to the official evaluation set, although our focus is on developing generally working sound generation models beyond the challenge.
System characteristics
System input | sound event label |
Machine learning method | diffusion model |
Phase reconstruction method | modified HiFi-GAN |
Acoustic features | spectrogram |
Data augmentation | mixup, time stretching |
System comprexity | 642000000 parameters |
High-Quality Foley Sound Synthesis Using Monte Carlo Dropout
Chae-Woon Bang, Nam Kyun Kim, Chanjun Chun
Chosun University, Gwangju, South Korea, Korea Automotive Technology Institute, Gwanjgu, South Korea
Chun_Chosun_task7_trackB_2
High-Quality Foley Sound Synthesis Using Monte Carlo Dropout
Chae-Woon Bang, Nam Kyun Kim, Chanjun Chun
Chosun University, Gwangju, South Korea, Korea Automotive Technology Institute, Gwanjgu, South Korea
Abstract
This technical report describes the foley sound synthesis system for DCASE2022 Task7. Here, it aims to creates foley sound, which is widely utilized as various sound effects in multimedia contents. To accomplish this, it uses sound synthesis technique, generating a 4-second audio clip of one of seven classes. Specifically, we fine-tuned the baseline model such that improves the performance. After that, we ensemble the models using Monte Carlo Dropout. The performance of the proposed system was compared with the baseline using Frechet Audio Distance (FAD), which is referred as an audio evaluation metric. As a result, it was confirmed that both the single model and the ensemble model outperform the existing baseline system.
System characteristics
System input | sound event label |
Machine learning method | PixelSNAIL,VQ-VAE |
Phase reconstruction method | HiFi-GAN |
Acoustic features | spectrogram |
Subsystem count | 2 |
System comprexity | 386598842 parameters |
Foley Sound Synthesis In Waveform Domain With Diffusion Model
Yoonjin Chung, Junwon Lee, Juhan Nam
Graduate School of AI, KAIST, Graduate School of Culture Technology, KAIST
Chung_KAIST_task7_trackB_1
Foley Sound Synthesis In Waveform Domain With Diffusion Model
Yoonjin Chung, Junwon Lee, Juhan Nam
Graduate School of AI, KAIST, Graduate School of Culture Technology, KAIST
Abstract
Foley sound synthesis becomes an important task due to the growing popularity of multi-media content, which is an industrial usecase of general audio synthesis. We propose a diffusion-based model that generates class-conditioned general audio in a classifier-free guidance manner as a participant of DCASE 2023 challenge task 7[1]. Our model follows a UNet-like structure while incorporating LSTM[2] inside the encoder block. We demonstrate the FAD (Frechet Audio Distance) scores of generated results for each 7 sound class respectively.
System characteristics
System input | sound event label |
Machine learning method | diffusion model |
System comprexity | 87330433 parameters |
Foley Sound Synthesis With AudioLDM For DCASE2023 Task 7
Shitong Fan, Qiaoxi Zhu, Feiyang Xiao, Haiyan Lan, Wenwu Wang, Jian Guan1
Group of Intelligent Signal Processing (GISP), College of Computer Science and Technology, Harbin Engineering University, Harbin, China, Centre for Audio, Acoustic and Vibration (CAAV), University of Technology Sydney, Ultimo, Australia, Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
Guan_HEU_task7_trackA_1 Guan_HEU_task7_trackA_2
Foley Sound Synthesis With AudioLDM For DCASE2023 Task 7
Shitong Fan, Qiaoxi Zhu, Feiyang Xiao, Haiyan Lan, Wenwu Wang, Jian Guan1
Group of Intelligent Signal Processing (GISP), College of Computer Science and Technology, Harbin Engineering University, Harbin, China, Centre for Audio, Acoustic and Vibration (CAAV), University of Technology Sydney, Ultimo, Australia, Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
Abstract
This report describes our submission for DCASE2023 Challenge Task 7, a system for foley sound synthesis. Our system is based on AudioLDM, which offers high generation quality and computational efficiency for the text-to-audio task. Experiments are conducted on the dataset of DCASE2023 Challenge Task 7. The Fr\'{e}chet audio distance (FAD) between the sound generated by our system and the actual sound sample is 5.120 in the category “DogBark” and 8.102 in the category “Rain”, better than the baseline with an FAD 7.256 and an FAD 4.901 distance closer to the actual samples, respectively.
System characteristics
System input | sound event label, caption |
Machine learning method | AudioLDM,Baseline |
System comprexity | 421000000,421269992 parameters |
Foley Sound Synthesis Based On GAN Using Contrastive Learning Without Label Information
Hae Chun Chung, Yuna Lee, Jae Hoon Jung
KT Corporation, Republic of Korea
Jung_KT_task7_trackB_1 Jung_KT_task7_trackB_2 Jung_KT_task7_trackB_3 Jung_KT_task7_trackB_2
Foley Sound Synthesis Based On GAN Using Contrastive Learning Without Label Information
Hae Chun Chung, Yuna Lee, Jae Hoon Jung
KT Corporation, Republic of Korea
Abstract
Sound effects used in radio or movies, such as foley sound, have been difficult to create without the help of experts. Furthermore, in the field of audio synthesis, the field of speech has been actively progressed, but there has been no research on audio sounds that can be obtained in real life. In this technical report, We present our submission system for DCASE2023 Task7: Foley-sound synthesis. We participate in track B, which forbids the usage of external resources. We propose a framework that employ the loss function of ContraGAN and C-SupConGAN based on structure of Self-Attention GAN (SAGAN). Our final system achieves outperforming the baseline performance by a large margin.
System characteristics
System input | sound event label, random noise |
Machine learning method | C-SupConGAN |
Phase reconstruction method | HiFi-GAN |
Acoustic features | mel spectrogram |
Data augmentation | fade in/out, time masking |
System comprexity | 21398259,21398259,21398259,21398259 parameters |
DCASE Task-7: StyleGAN2-Based Foley Sound Synthesis
Purnima Kamath, Tasnim Nishat Islam, Chitralekha Gupta, Lonce Wyse, Suranga Nanayakkara
National University of Singapore, Singapore and Bangladesh University of Engineering and Technology, Bangladesh and Universitat Pompeu Fabra, Barcelona, Spain
Kamath_NUS_task7_trackB_1 Kamath_NUS_task7_trackB_2
DCASE Task-7: StyleGAN2-Based Foley Sound Synthesis
Purnima Kamath, Tasnim Nishat Islam, Chitralekha Gupta, Lonce Wyse, Suranga Nanayakkara
National University of Singapore, Singapore and Bangladesh University of Engineering and Technology, Bangladesh and Universitat Pompeu Fabra, Barcelona, Spain
Abstract
For the DCASE 2023 Task 7 (Track B), Foley Sound Synthesis, we submit two systems, (1) a StyleGAN conditioned on the class ID, and (2) an ensemble of StyleGANs each trained unconditionally on each class separately. We quantitatively find that both systems out-perform the task 7 baseline models in terms of FAD Scores. Given the high inter-class and intra-class variance in the development datasets, the system conditioned on class ID is able to generate a smooth and a homogeneous latent space indicated by the subjective quality of its generated samples. The unconditionally trained ensemble generates more categorically recognizable samples than system 1, but tends to generate more instances of out-of-distribution or noisy samples.
System characteristics
System input | sound event label |
Machine learning method | StyleGAN2 |
Phase reconstruction method | phase gradient heap integration |
Acoustic features | log-magnitude spectrogram |
Data augmentation | time shifting, sound wrapping |
Subsystem count | 7 |
System comprexity | 62010138,376959933 parameters |
Foley Sound Synthesis at the DCASE 2023 Challenge
Keunwoo Choi, Jaekwon Im, Laurie Heller, Brian McFee, Keisuke Imoto, Yuki Okamoto, Mathieu Lagrange, and Shinnosuke Takamichi
Gaudio Lab, Inc., Seoul, South Korea, KAIST and Daejeon, South Korea and Carnegie Mellon University, Pennsylvania, USA and New York University, New York, USA, and Doshisha University, Kyoto, Japan and Ritsumeikan University, Kyoto, Japan and CNRS, Ecole Centrale Nantes, Nantes Universite, Nantes, France and The University of Tokyo, Tokyo, Japan
DCASE2023_baseline_task7
Foley Sound Synthesis at the DCASE 2023 Challenge
Keunwoo Choi, Jaekwon Im, Laurie Heller, Brian McFee, Keisuke Imoto, Yuki Okamoto, Mathieu Lagrange, and Shinnosuke Takamichi
Gaudio Lab, Inc., Seoul, South Korea, KAIST and Daejeon, South Korea and Carnegie Mellon University, Pennsylvania, USA and New York University, New York, USA, and Doshisha University, Kyoto, Japan and Ritsumeikan University, Kyoto, Japan and CNRS, Ecole Centrale Nantes, Nantes Universite, Nantes, France and The University of Tokyo, Tokyo, Japan
Abstract
The addition of Foley sound effects during post-production is a common technique used to enhance the perceived acoustic properties of multimedia content. Traditionally, Foley sound has been produced by human Foley artists, which involves manual recording and mixing of sound. However, recent advances in sound synthesis and generative models have generated interest in machine-assisted or automatic Foley synthesis techniques. To promote further research in this area, we have organized a challenge in DCASE 2023: Task 7 - Foley Sound Synthesis. Our challenge aims to provide a standardized evaluation framework that is both rigorous and efficient, allowing for the evaluation of different Foley synthesis systems. Through this challenge, we hope to encourage active participation from the research community and advance the state-of-the-art in automatic Foley synthesis. In this technical report, we provide a detailed overview of the Foley sound synthesis challenge, including task definition, dataset, baseline, evaluation scheme and criteria, and discussion.
System characteristics
System input | sound event label |
Machine learning method | VQ-VAE, PixelSNAIL |
Phase reconstruction method | HiFi-GAN |
Acoustic features | spectrogram |
Data augmentation | null |
Subsystem count | null |
System comprexity | 269992 parameters |
Conditional Foley Sound Synthesis With Limited Data: Two-Stage Data Augmentation Approach With StyleGAN2-ADA
Kyungsu Kim, Jinwoo Lee, Hayoon Kim, Kyogu Lee
Seoul National University Department of Intelligence and Information
Lee_MARG_task7_trackB_1
Conditional Foley Sound Synthesis With Limited Data: Two-Stage Data Augmentation Approach With StyleGAN2-ADA
Kyungsu Kim, Jinwoo Lee, Hayoon Kim, Kyogu Lee
Seoul National University Department of Intelligence and Information
Abstract
This report introduces an audio synthesis system designed to tackle the task of Foley Sound Synthesis in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 challenge. Our proposed system comprises an ensemble of a baseline model and StyleGAN2-ADA. To optimize the system with the limited data without relying on external datasets and pretrained systems, we propose a two-stage data augmentation strategy. This approach involves augmenting input waveforms to expand the size of the training dataset, as well as employing adaptive discriminator augmentation (ADA) to alleviate overfitting of discriminator and ensure stable training. Experimental results demonstrate that our proposed ensemble system achieves an FAD (Fr\'{e}chet Audio Distance) of 5.84 on the evaluation dataset.
System characteristics
System input | sound event label |
Machine learning method | PixelSNAIL,StyleGAN2-ADA,VQ-VAE |
Phase reconstruction method | HiFi-GAN, Griffin-Lim |
Acoustic features | spectrogram |
Data augmentation | time stretching, time shifting, RoomSimulator, TanhDistortion, resample, time masking, pitch shifting |
Subsystem count | 6 |
System comprexity | 116202572 parameters |
VIFS: An End-To-End Variational Inference For Foley Sound Synthesis
Junhyeok Lee1, Hyeonuk Nam, Yong-Hwa Park
maum.ai Inc., Republic of Korea and Korea Advanced Institute of Science and Technology, Republic of Korea
Lee_maum_task7_trackA_1 Lee_maum_task7_trackA_2 Lee_maum_task7_trackA_3 Lee_maum_task7_trackA_4 Lee_maum_task7_trackB_1 Lee_maum_task7_trackB_2 Lee_maum_task7_trackB_3 Lee_maum_task7_trackB_2
VIFS: An End-To-End Variational Inference For Foley Sound Synthesis
Junhyeok Lee1, Hyeonuk Nam, Yong-Hwa Park
maum.ai Inc., Republic of Korea and Korea Advanced Institute of Science and Technology, Republic of Korea
Abstract
Foley sound synthesis (FSS) is a task to generate a sound for specific conditions. In this work, FSS is defined as a ”category-to-sound” problem, which is generating various sounds for a given category. To solve this diversity problem, we adopt VITS, a text-to-speech (TTS) model with variational inference. In addition, we apply various techniques from speech synthesis including PhaseAug and Avocodo. Different from TTS models which generate short pronunciation from phonemes and a speaker identity, the category-to-sound problem requires to generate diverse sounds just from a category class. To compensate this difference between TTS and category-to- sound while maintaining consistency within each inference, we heavily modified the prior encoder to enhance consistency with posterior latent variables. This introduced additional gaussian on prior encoder which promotes variance within the category. With these modifications, we propose VIFS, variational inference for end-toend Foley sound synthesis, which is able to generate high-quality sounds with diversity.
System characteristics
System input | sound event label |
Machine learning method | Avocodo,GAN,PhaseAug,VAE,VITS,ensemble,flow |
Phase reconstruction method | HiFi-GAN |
Acoustic features | Gaussian latent variables |
Data augmentation | PhaseAug |
Subsystem count | 4 |
System comprexity | 92319922,92319922,92319922,369279688,92319922,92319922,92319922,369279688 parameters |
DCASE Task 7: Foley Sound Synthesis
Ashwin Pillay, Sage Betko, Ari Liloia, Hao Chen, Ankit Shah
Carnegie Mellon University, Pittsburgh, USA
Abstract
Foley sound synthesis refers to the creation of realistic, diegetic sound effects for a piece of media, such as film or radio. We propose building a deep learning system for Task 7 of the DCASE 2023 challenge that can generate original mono audio clips belonging to one of seven foley sound categories. Our training dataset consists of 4,850 sound clips from the UrbanSound8K, FSD50K, and BBC Sound Effects datasets. We aim to better the subjective and objective quality of generated sounds by passing as much meaningful information about the input data into latent representations as possible. The primary innovation in our submission is the change from using melspectrograms to using CEmbeddings (combined embeddings), which are input to the VQ-VAE and consist of melspectrograms concatenated with latent representations of audio produced by a pre-trained MERT model. Our submission to track A utilizes a pre-trained MERT model; as such, PixelSNAIL was trained on CEmbeddings. Our submission to track B utilizes PixelSNAIL retrained only on melspectrograms. Our code can be found here: https://github.com/ankitshah009/ foley-sound-synthesis_DCASE_2023.
System characteristics
System input | sound event label |
Machine learning method | PixelSNAIL,VQ-VAE |
Phase reconstruction method | HiFi-GAN |
Acoustic features | spectrogram |
Data augmentation | time masking, frequency masking |
Subsystem count | 3 |
System comprexity | 103316216 parameters |
Auto-Bit for DCASE2023 Task7 Technical Reports: Assemble System of BitDiffusion and PixelSNAIL
Anbin Qi
School Information and Electronics, Beijing Institute of Technology, Beijing, China
Qianbin_BIT_task7_trackB_1
Auto-Bit for DCASE2023 Task7 Technical Reports: Assemble System of BitDiffusion and PixelSNAIL
Anbin Qi
School Information and Electronics, Beijing Institute of Technology, Beijing, China
Abstract
This paper is a technical report on DCASE TASK7, which proposes using different methods and models for sound synthesis tasks in different scene events. For dogbark and sneeze cough, a non autoregressive model based on conditional generation bit-diffusion was used for sound synthesis. For the other five types of sounds, a autoregressive model based PixelSnail was used.
System characteristics
System input | sound event label |
Machine learning method | VQ-VAE, PixelSNAIL, Bit-diffusion |
Phase reconstruction method | HiFi-GAN |
Acoustic features | spectrogram |
Subsystem count | 2 |
System comprexity | 112857385 parameters |
From Noise To Sound: Audio Synthesis Via Diffusion Models
Haojie Zhang, Kun Qian, Lin Shen, Lujundong Li, Kele Xu, Bin Hu
Key Laboratory of Brain Health Intelligent Evaluation and Intervention, Ministry of Education (Beijing Institute of Technology), P. R. China, School of Medical Technology, Beijing Institute of Technology, P. R. China, National University of Defense Technology, P. R. China
QianXu_BIT_NUDT_task7_trackB_1 QianXu_BIT_NUDT_task7_trackB_2 QianXu_BIT_NUDT_task7_trackB_3 QianXu_BIT_NUDT_task7_trackB_2
From Noise To Sound: Audio Synthesis Via Diffusion Models
Haojie Zhang, Kun Qian, Lin Shen, Lujundong Li, Kele Xu, Bin Hu
Key Laboratory of Brain Health Intelligent Evaluation and Intervention, Ministry of Education (Beijing Institute of Technology), P. R. China, School of Medical Technology, Beijing Institute of Technology, P. R. China, National University of Defense Technology, P. R. China
Abstract
In this technical report, we describe our submission system for DCASE2023 Task 7: Foley Sound Synthesis (Track B). A Sound Pixelate Diffuse model is proposed to realize foley sound synthesis. The model includes data format conversion and synthesising audio through the diffusion model. The Synthesised audio are evaluated on DCASE2023 Task 7 Eval FAD evaluation set and the best FAD score of all categories is 8.429.
System characteristics
System input | sound |
Machine learning method | diffusion model |
Acoustic features | spectrogram |
Data augmentation | wavelet domain denoise |
System comprexity | 113668609,113668609,113668609,113668609 parameters |
Class-Conditioned Latent Diffusion Model For DCASE 2023 Foley Sound Synthesis Challenge
Robin Scheibler, Takuya Hasumi, Yusuke Fujita, Tatsuya Komatsu, Ryuichi Yamamoto, Kentaro Tachibana
LINE Corporation, Tokyo, Japan
Scheibler_LINE_task7_trackA_1
Class-Conditioned Latent Diffusion Model For DCASE 2023 Foley Sound Synthesis Challenge
Robin Scheibler, Takuya Hasumi, Yusuke Fujita, Tatsuya Komatsu, Ryuichi Yamamoto, Kentaro Tachibana
LINE Corporation, Tokyo, Japan
Abstract
This report describes our submission to DCASE 2023 Task7: Foley sound synthesis challenge. We use a latent diffusion model (LDM) that generates a latent representation of audio conditioned on a specified audio class, a variational autoencoder that converts the latent representation to a mel-spectrogram, and a universal neural vocoder based on HIFI-GAN that reconstructs a natural waveform from the mel-spectrogram. We trained the LDM using development set with its audio class indices as conditioners for generating class-specific latent representations.
System characteristics
System input | sound event label |
Machine learning method | VQ-VAE,diffusion model |
Phase reconstruction method | HiFi-GAN |
Acoustic features | log-mel spectrogram |
System comprexity | 977116210 parameters |
Audio Diffusion For Foley Sound Synthesis
Timo Wendner, Patricia Hu, Tara Jadidi, Alexander Neuhauser
Johannes Kepler University, Linz, Austria
Abstract
This technical report describes our approach for Task 7 (Foley Sound Synthesis), Track B (using no external resources other than the ones provided) of the DCASE2023 Challenge. This work was carried out as part of an elective course in the Artificial Intelligence curriculum at Johannes Kepler University Linz by a student group. We use an ensemble of U-Net based diffusion models for waveform generation in seven predefined sound categories. We apply gain reduction to normalize and time shifting to augment the provided training data and test different noise schedulers and U-Net architectures. Applying different training strategies, we achieve competitive results for the majority of the sound classes while being more parameter efficient and allowing end-to-end generation on audio waveforms. Evaluated on the task's evaluation metric, i.e., the mean FAD score over all classes, we achieve a final score of 12.42 as compared to the score of the challenge baseline model of 9.68.
System characteristics
System input | sound event label |
Machine learning method | diffusion model,ensemble |
Data augmentation | gain reduction, time shifting |
Subsystem count | 7 |
System comprexity | 7167405 parameters |
The X-LANCE System For DCASE2023 Challenge Task 7: Foley Sound Synthesis Track B
Zeyu Xie, Xuenan Xu, Baihan Li, Mengyue Wu, Kai Yu
MoE Key Lab of Artificial Intelligence X-LANCE Lab, Department of Computer Science and Engineering AI Institute, Shanghai Jiao Tong University, Shanghai, China
Xie_SJTU_task7_trackB_1 Xie_SJTU_task7_trackB_2 Xie_SJTU_task7_trackB_3 Xie_SJTU_task7_trackB_2
The X-LANCE System For DCASE2023 Challenge Task 7: Foley Sound Synthesis Track B
Zeyu Xie, Xuenan Xu, Baihan Li, Mengyue Wu, Kai Yu
MoE Key Lab of Artificial Intelligence X-LANCE Lab, Department of Computer Science and Engineering AI Institute, Shanghai Jiao Tong University, Shanghai, China
Abstract
This report describes the system submitted to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 challenge Task 7: foley sound synthesis track B. We first train a VQVAE model to learn the discrete representation of the audio spectrogram. Then an auto-regressive model is trained to predict discrete tokens based on input conditions. Finally, a trained vocoder converts the generated spectrogram into a waveform, where the spectrogram is restored from predicted tokens by VQ-VAE decoder. To achieve higher accuracy, fidelity and diversity, we introduce some training schemes, including (1) a discriminator model to filter audio; (2) mixup method for data augmentation; (3) clustering methods for better training. Our best system achieved a FAD score of 6.99 averaging on all categories.
System characteristics
System input | sound event label |
Machine learning method | Transformer,TransformerDecoder,TrnsformerEncder Discriminator,VQ-VAE |
Phase reconstruction method | HiFi-GAN |
Acoustic features | spectrogram |
Data augmentation | mixup |
Subsystem count | 3 |
System comprexity | 28224194,40843458,44037827,44037827 parameters |
Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7
Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Mark D. Plumbley, Wenwu Wang
University of Surrey, Guildford, United Kingdom
Abstract
Foley sound generation aims to synthesise the background sound for multimedia content, which involves computationally modelling sound effects with specialized techniques. In this work, we proposed a diffusion-based generative model for DCASE 2023 challenge task 7: Foley Sound Synthesis. The proposed system is based on AudioLDM, which is a diffusion-based text-to-audio generation model. To alleviate the data scarcity of task 7 training set, our model is initially trained with large-scale datasets and downstream into this DCASE task via transfer learning. We have observed that the feature extracted by the encoder can significantly affect the performance of the generation model. Hence, we improve the results by leveraging the input label with related text embedding features obtained by a large language model, i.e.,contrastive language-audio pretraining (CLAP). In addition, we utilize a filtering strategy to further refine the output, i.e. by selecting the best results from the candidate clips generated in terms of the similarity score between the sound and target labels. The overall system achieves a Fr\'{e}chet audio distance (FAD) score of 4.765 on average among all seven different classes, outperforming the baseline system which achieves a FAD score of 9.7.
System characteristics
System input | sound event label |
Machine learning method | VQ-VAE,diffusion model |
Phase reconstruction method | HiFi-GAN |
Acoustic features | spectrogram |
Subsystem count | 2 |
System comprexity | 1173847474 parameters |