Foley Sound Synthesis


Challenge results

Task description

This task aims to build a foley sound synthesis system that can generate plausible audio signals fitting into given categories of foley sound. The foley sound categories are composed of sound events and environmental background sounds. The challenge has two subproblems – the development of models with and without external resources. Participants are expected to submit a system for one of the two problems, and each problem is evaluated independently. Submissions will be evaluated by Frechet Audio Distance (FAD), followed by a subjective test.

Systems ranking

Track A

A big THANK YOU to the DCASE community members and the contestants who spent several hours rating other teams' anonymized sounds for the perceptual evaluation stage (see column '# Categories Rated by Team Members' in the FAD table).

Perceptual Evaluation Score

The weighted average of the three ratings were based on a ratio of audio quality : category fit : diversity that was 2:2:1.

Rank Submission Information Weighted Average Score of Audio Quality, Category Fit, and Diversity Audio Quality (MOS score w/ 10 steps) Category Fit (MOS score w/ 10 steps) Diversity (MOS score w/ 10 steps, weighted 0.5)
Submission Code Technical
Report
Official
Rank
Average Score Dog Bark Footstep Gun Shot Keyboard Moving Motor
Vehicle
Rain Sneeze/Cough Average Score Dog Bark Footstep Gun Shot Keyboard Moving Motor
Vehicle
Rain Sneeze/Cough Average Score Dog Bark Footstep Gun Shot Keyboard Moving Motor
Vehicle
Rain Sneeze/Cough Average Score Dog Bark Footstep Gun Shot Keyboard Moving Motor
Vehicle
Rain Sneeze/Cough
DCASE2023_baseline_task7 DCASE2023baseline2023 6 3.810 2.688 4.160 3.237 5.150 3.862 4.175 3.400 3.831 2.930 4.158 3.504 5.137 3.543 4.115 3.432 3.789 2.447 4.162 2.969 5.163 4.182 4.235 3.368
Chon_Gaudio_task7_trackA_1 ChonGLI2023 2 6.967 7.984 6.865 7.255 6.989 6.881 6.243 6.553 6.657 7.612 6.455 6.814 6.814 6.446 5.928 6.528 7.154 8.223 7.082 7.573 7.157 7.131 6.306 6.606 7.214 8.250 7.250 7.500 7.000 7.250 6.750 6.500
Yi_SURREY_task7_trackA_1 YiSURREY2023 1 7.056 7.742 6.466 6.189 7.433 7.448 6.441 7.675 6.723 7.309 6.143 5.532 7.243 7.315 6.067 7.454 7.578 8.297 6.646 6.689 8.089 8.181 6.911 8.233 6.679 7.500 6.750 6.500 6.500 6.250 6.250 7.000
Guan_HEU_task7_trackA_2 GuanHEU2023 4 5.157 4.877 4.450 6.413 5.479 5.822 5.201 3.856 4.670 3.800 4.164 5.800 5.339 5.365 4.972 3.250 5.293 5.142 3.836 7.482 5.232 6.315 5.656 3.389 5.857 6.500 6.250 5.500 6.250 5.750 4.750 6.000
Scheibler_LINE_task7_trackA_1 ScheiblerLINE2023 3 6.887 7.333 6.832 7.317 7.199 6.474 5.222 7.834 6.355 6.479 6.263 6.771 6.886 6.131 4.780 7.180 7.327 7.479 7.192 7.896 7.861 7.054 5.150 8.655 7.071 8.750 7.250 7.250 6.500 6.000 6.250 7.500



FAD Score

Rank Submission Information Evaluation Dataset Development Dataset
Submission Code Technical
Report
# Categories Rated by
Team Members
Official
Rank
FAD
Rank
Average
FAD
Dog Bark
(FAD)
Footstep
(FAD)
Gun Shot
(FAD)
Keyboard
(FAD)
Moving Motor
Vehicle (FAD)
Rain
(FAD)
Sneeze/Cough
(FAD)
Average
FAD
Dog Bark
(FAD)
Footstep
(FAD)
Gun Shot
(FAD)
Keyboard
(FAD)
Moving Motor
Vehicle (FAD)
Rain
(FAD)
Sneeze/Cough
(FAD)
DCASE2023_baseline_task7 DCASE2023baseline2023 6 6 9.702 13.412 8.108 7.952 5.230 16.107 13.338 3.771 8.701 13.614 6.826 6.152 5.065 11.239 14.449 3.563
Chon_Gaudio_task7_trackA_1 ChonGLI2023 7 2 3 5.540 11.456 5.959 3.021 4.090 6.173 5.738 2.340 5.522 11.464 4.575 3.782 6.190 5.814 4.746 2.083
Lee_maum_task7_trackA_1 Leemaum2023 4 9 9 12.937 9.265 6.924 10.451 6.488 37.748 7.778 11.903 11.331 9.716 4.858 8.672 5.227 29.206 10.450 11.187
Lee_maum_task7_trackA_2 Leemaum2023 4 10 10 12.946 10.549 7.747 7.643 9.922 38.558 6.585 9.620 10.900 10.854 5.751 5.588 7.413 29.562 8.140 8.992
Lee_maum_task7_trackA_3 Leemaum2023 4 8 8 12.429 11.719 6.903 7.287 9.292 35.209 6.787 9.804 10.586 12.056 5.742 5.420 7.242 26.474 8.043 9.126
Lee_maum_task7_trackA_4 Leemaum2023 4 7 7 9.883 9.287 6.910 7.881 6.603 22.310 6.750 9.436 8.964 9.700 5.566 6.037 5.370 19.305 7.946 8.827
Yi_SURREY_task7_trackA_1 YiSURREY2023 7 1 2 5.025 3.621 5.104 5.748 3.038 9.801 5.964 1.901 4.051 3.355 3.434 5.796 3.483 4.674 5.994 1.621
Guan_HEU_task7_trackA_1 GuanHEU2023 7 5 5 8.623 5.583 10.143 8.428 5.403 17.984 7.561 5.258 7.941 5.893 9.118 7.485 7.706 12.818 7.874 4.692
Guan_HEU_task7_trackA_2 GuanHEU2023 7 4 4 7.799 5.685 7.685 8.532 4.165 17.258 7.795 3.475 7.015 6.020 7.297 7.628 4.049 12.216 8.446 3.452
Scheibler_LINE_task7_trackA_1 ScheiblerLINE2023 6 3 1 4.777 3.679 8.073 3.655 2.775 7.422 5.225 2.609 4.156 3.726 5.713 3.226 3.415 5.453 5.308 2.253



Track B

A big THANK YOU to the DCASE community members and the contestants who spent several hours rating other teams' anonymized sounds for the perceptual evaluation stage (see column '# Categories Rated by Team Members' in the FAD table).

Perceptual Evaluation Score

In the case that multiple systems were submitted by one team, only the system with the highest FAD score per team was perceptually evaluated. The weighted average of the three ratings were based on a ratio of audio quality : category fit : diversity that was 2:2:1.

Rank Submission Information Weighted Average Score of Audio Quality, Category Fit, and Diversity Audio Quality (MOS score w/ 10 steps) Category Fit (MOS score w/ 10 steps) Diversity (MOS score w/ 10 steps, weighted 0.5)
Submission Code Technical
Report
Official
Rank
Average Score Dog Bark Footstep Gun Shot Keyboard Moving Motor
Vehicle
Rain Sneeze/Cough Average Score Dog Bark Footstep Gun Shot Keyboard Moving Motor
Vehicle
Rain Sneeze/Cough Average Score Dog Bark Footstep Gun Shot Keyboard Moving Motor
Vehicle
Rain Sneeze/Cough Average Score Dog Bark Footstep Gun Shot Keyboard Moving Motor
Vehicle
Rain Sneeze/Cough
DCASE2023_baseline_task7 DCASE2023baseline2023 18 3.810 2.688 4.160 3.237 5.150 3.862 4.175 3.400 3.831 2.930 4.158 3.504 5.137 3.543 4.115 3.432 3.789 2.447 4.162 2.969 5.163 4.182 4.235 3.368
Kamath_NUS_task7_trackB_2 KamathNUS2023 3 4.647 4.807 4.073 5.010 4.276 5.248 4.013 5.102 3.988 3.789 3.554 4.346 3.911 4.642 3.378 4.295 4.612 4.979 3.629 5.054 4.029 5.727 3.406 5.459 6.036 6.500 6.000 6.250 5.500 5.500 6.500 6.000
Chang_HYU_task7_trackB_1 ChangHYU2023 1 6.515 5.659 7.111 6.557 7.384 6.155 7.042 5.699 6.085 4.882 6.738 5.879 7.296 6.069 6.860 4.873 6.845 6.014 7.288 7.013 7.789 6.442 7.370 6.000 6.714 6.500 7.500 7.000 6.750 5.750 6.750 6.750
Jung_KT_task7_trackB_2 JungKT2023 2 5.534 5.321 5.033 6.022 5.614 6.021 5.902 4.826 5.082 4.432 4.933 5.579 5.139 5.623 5.600 4.270 5.610 5.371 4.775 6.100 5.646 6.554 5.906 4.920 6.286 7.000 5.750 6.750 6.500 5.750 6.500 5.750
Lee_MARG_task7_trackB_1 LeeMARG2023 4 4.427 3.273 4.843 3.941 5.409 4.942 4.210 4.374 3.929 2.530 4.204 3.531 5.311 4.542 3.735 3.650 4.443 3.153 4.654 3.696 5.336 5.312 4.040 4.910 5.393 5.000 6.500 5.250 5.750 5.000 5.500 4.750



FAD Score

Rank Submission Information Evaluation Dataset Development Dataset
Submission Code Technical
Report
# Categories Rated by
Team Members
Official
Rank
FAD
Rank
Average
FAD
Dog Bark
(FAD)
Footstep
(FAD)
Gun Shot
(FAD)
Keyboard
(FAD)
Moving Motor
Vehicle (FAD)
Rain
(FAD)
Sneeze/Cough
(FAD)
Average
FAD
Dog Bark< br>(FAD) Footstep
(FAD)
Gun Shot
(FAD)
Keyboard
(FAD)
Moving Motor
Vehicle (FAD)
Rain
(FAD)
Sneeze/Cough
(FAD)
DCASE2023_baseline_task7 DCASE2023baseline2023 18 18 9.702 13.412 8.108 7.952 5.230 16.107 13.338 3.771 8.701 13.614 6.826 6.152 5.065 11.239 14.449 3.563
Kamath_NUS_task7_trackB_1 KamathNUS2023 7 15 15 9.081 6.468 6.348 10.665 5.656 24.674 6.498 3.259 7.341 6.455 4.875 7.922 4.521 16.567 8.023 3.026
Kamath_NUS_task7_trackB_2 KamathNUS2023 7 3 6 6.754 3.870 7.223 7.561 3.884 13.564 7.045 4.129 5.348 3.438 5.906 5.648 4.192 7.234 7.383 3.632
Pillay_CMU_task7_trackB_1 PillayCMU2023 4 22 22 12.034 14.607 6.656 16.268 5.279 16.471 9.451 15.506 11.257 14.436 5.505 12.523 5.355 13.252 12.766 14.964
Qianbin_BIT_task7_trackB_1 QianbinBIT2023 4 10 10 7.154 10.681 5.679 6.960 4.283 11.485 9.502 1.489 6.280 10.729 3.106 5.613 3.269 8.837 10.854 1.555
Lee_maum_task7_trackB_1 Leemaum2023 4 26 26 12.862 9.692 6.948 9.263 6.341 37.965 8.098 11.729 11.267 10.197 4.868 7.472 5.289 29.329 10.744 10.973
Lee_maum_task7_trackB_2 Leemaum2023 4 25 25 12.858 9.754 7.411 7.458 9.687 38.361 6.905 10.429 10.849 10.057 5.335 5.516 7.141 29.502 8.564 9.829
Lee_maum_task7_trackB_3 Leemaum2023 4 23 23 12.276 11.651 7.373 7.606 9.407 34.061 6.267 9.566 10.366 11.890 6.002 5.418 7.333 25.451 7.558 8.913
Lee_maum_task7_trackB_4 Leemaum2023 4 20 20 9.964 9.701 6.837 7.789 6.591 22.998 6.825 9.008 9.143 10.218 5.634 5.868 5.353 20.414 8.190 8.323
Chang_HYU_task7_trackB_1 ChangHYU2023 4 1 7 6.898 4.677 5.736 6.407 4.753 18.859 5.892 1.965 4.422 4.317 3.597 5.311 2.432 10.177 3.398 1.722
Chang_HYU_task7_trackB_2 ChangHYU2023 4 12 12 7.356 5.098 5.877 8.000 4.623 19.926 5.796 2.169 4.871 4.948 3.448 6.538 2.457 11.320 3.502 1.885
Xie_SJTU_task7_trackB_1 XieSJTU2023 6 13 13 7.407 8.035 6.987 8.185 3.495 13.565 9.267 2.315 6.050 7.564 4.761 6.237 2.176 9.853 9.592 2.167
Xie_SJTU_task7_trackB_2 XieSJTU2023 6 9 9 6.998 6.817 6.894 7.815 3.495 12.536 9.265 2.164 6.232 6.809 5.236 6.877 2.176 9.587 10.983 1.958
Xie_SJTU_task7_trackB_3 XieSJTU2023 6 8 8 6.992 7.017 6.949 7.913 3.600 11.621 9.350 2.492 6.458 6.991 5.300 7.286 2.569 9.716 11.071 2.271
Xie_SJTU_task7_trackB_4 XieSJTU2023 6 11 11 7.177 6.660 7.763 8.199 3.703 11.443 9.817 2.654 6.904 6.598 6.079 7.992 3.718 9.456 11.941 2.546
QianXu_BIT_NUDT_task7_trackB_1 QianXuBIT2023 4 21 21 10.644 17.956 6.526 10.180 4.901 14.348 5.616 14.979 8.817 18.385 6.301 6.729 3.130 7.759 5.229 14.186
QianXu_BIT_NUDT_task7_trackB_2 QianXuBIT2023 4 17 17 9.645 12.148 5.899 10.771 5.380 14.004 5.534 13.777 7.705 12.672 5.735 6.473 3.083 7.751 5.205 13.018
QianXu_BIT_NUDT_task7_trackB_3 QianXuBIT2023 4 19 19 9.959 13.526 6.064 10.615 5.574 16.127 5.864 11.944 7.857 14.248 5.662 6.395 3.400 8.971 5.184 11.136
QianXu_BIT_NUDT_task7_trackB_4 QianXuBIT2023 4 24 24 12.319 12.883 12.139 8.442 8.173 22.671 14.680 7.243 12.601 13.295 12.775 7.077 10.839 18.117 19.511 6.595
Bai_JLESS_task7_trackB_1 BaiJLESS2023 0 27 27 13.583 15.958 8.663 18.485 6.728 24.094 15.193 5.958 12.437 17.510 8.497 16.824 6.956 18.737 12.874 5.662
Chun_Chosun_task7_trackB_2 ChunChosun2023 5 14 14 8.351 8.690 7.265 10.764 5.602 13.941 9.512 2.684 7.376 8.382 6.203 8.294 3.748 10.974 11.562 2.467
Wendner_JKU_task7_trackB_1 WendnerJKU2023 5 28 28 15.736 8.979 9.950 15.354 12.564 31.160 21.753 10.388 15.669 10.093 9.682 11.984 13.334 26.435 28.391 9.763
Jung_KT_task7_trackB_1 JungKT2023 7 7 4 5.480 2.784 4.370 4.667 3.555 17.511 3.899 1.577 3.373 2.771 2.514 2.960 2.246 8.776 2.947 1.397
Jung_KT_task7_trackB_2 JungKT2023 7 2 1 5.023 3.348 3.990 3.495 4.074 14.861 3.529 1.865 3.181 3.087 2.580 2.560 2.255 7.540 2.626 1.617
Jung_KT_task7_trackB_3 JungKT2023 7 6 3 5.230 2.616 3.739 6.322 4.089 14.172 4.304 1.371 3.088 2.477 2.588 3.722 2.220 6.867 2.349 1.395
Jung_KT_task7_trackB_4 JungKT2023 7 5 2 5.026 4.854 3.103 4.790 3.665 13.604 3.727 1.435 3.215 4.673 2.045 3.614 2.450 6.018 2.322 1.380
Lee_MARG_task7_trackB_1 LeeMARG2023 4 4 5 6.409 6.947 4.563 10.657 3.900 11.602 5.491 1.699 4.766 7.778 3.712 8.208 3.584 4.359 4.386 1.332
Chung_KAIST_task7_trackB_1 ChungKAIST2023 5 16 16 9.192 10.389 6.832 7.572 5.188 15.653 13.348 5.359 7.841 11.783 6.283 6.668 5.168 10.830 9.498 4.655



System characteristics

Summary of the submitted system characteristics.

Track A

Rank Submission
Code
Technical
Report
System
input
ML
method
Phase
reconstruction
Acoustic
feature
System
Complexity
Data
Augmentation
Subsystem
Count
6 DCASE2023_baseline_task7 DCASE2023baseline2023 sound event label VQ-VAE, PixelSNAIL HiFi-GAN spectrogram 269992
2 Chon_Gaudio_task7_trackA_1 ChonGLI2023 sound event label diffusion model modified HiFi-GAN spectrogram 642000000 mixup, time stretching
9 Lee_maum_task7_trackA_1 Leemaum2023 sound event label VAE, GAN, flow, VITS, PhaseAug, Avocodo HiFi-GAN Gaussian latent variables 92319922 PhaseAug
10 Lee_maum_task7_trackA_2 Leemaum2023 sound event label VAE, GAN, flow, VITS, PhaseAug, Avocodo HiFi-GAN Gaussian latent variables 92319922 PhaseAug
8 Lee_maum_task7_trackA_3 Leemaum2023 sound event label VAE, GAN, flow, VITS, PhaseAug, Avocodo HiFi-GAN Gaussian latent variables 92319922 PhaseAug
7 Lee_maum_task7_trackA_4 Leemaum2023 sound event label VAE, GAN, flow, VITS, PhaseAug, Avocodo, ensemble HiFi-GAN Gaussian latent variables 369279688 PhaseAug 4
1 Yi_SURREY_task7_trackA_1 YiSURREY2023 sound event label diffusion model, VQ-VAE HiFi-GAN spectrogram 1173847474 2
5 Guan_HEU_task7_trackA_1 GuanHEU2023 sound event label, caption AudioLDM 421000000
4 Guan_HEU_task7_trackA_2 GuanHEU2023 sound event label, caption AudioLDM, Baseline 421269992
3 Scheibler_LINE_task7_trackA_1 ScheiblerLINE2023 sound event label VQ-VAE, diffusion model HiFi-GAN log-mel spectrogram 977116210



Track B

Rank Submission
Code
Technical
Report
System
input
ML
method
Phase
reconstruction
Acoustic
feature
System
Complexity
Data
Augmentation
Subsystem
Count
18 DCASE2023_baseline_task7 DCASE2023baseline2023 sound event label VQ-VAE, PixelSNAIL HiFi-GAN spectrogram 269992
15 Kamath_NUS_task7_trackB_1 KamathNUS2023 sound event label StyleGAN2 phase gradient heap integration log-magnitude spectrogram 62010138
3 Kamath_NUS_task7_trackB_2 KamathNUS2023 sound event label StyleGAN2 phase gradient heap integration log-magnitude spectrogram 376959933 time shifting, sound wrapping 7
22 Pillay_CMU_task7_trackB_1 PillayCMU2023 sound event label VQ-VAE, PixelSNAIL HiFi-GAN spectrogram 103316216 time masking, frequency masking 3
10 Qianbin_BIT_task7_trackB_1 QianbinBIT2023 sound event label VQ-VAE, PixelSNAIL, Bit-diffusion HiFi-GAN spectrogram 112857385 2
26 Lee_maum_task7_trackB_1 Leemaum2023 sound event label VAE, GAN, flow, VITS, PhaseAug, Avocodo HiFi-GAN Gaussian latent variables 92319922 PhaseAug
25 Lee_maum_task7_trackB_2 Leemaum2023 sound event label VAE, GAN, flow, VITS, PhaseAug, Avocodo HiFi-GAN Gaussian latent variables 92319922 PhaseAug
23 Lee_maum_task7_trackB_3 Leemaum2023 sound event label VAE, GAN, flow, VITS, PhaseAug, Avocodo HiFi-GAN Gaussian latent variables 92319922 PhaseAug
20 Lee_maum_task7_trackB_4 Leemaum2023 sound event label VAE, GAN, flow, VITS, PhaseAug, Avocodo, ensemble HiFi-GAN Gaussian latent variables 369279688 PhaseAug 4
1 Chang_HYU_task7_trackB_1 ChangHYU2023 sound event label diffusion model HiFi-GAN log-mel spectrogram 23374056
12 Chang_HYU_task7_trackB_2 ChangHYU2023 sound event label diffusion model HiFi-GAN log-mel spectrogram 23374056
13 Xie_SJTU_task7_trackB_1 XieSJTU2023 sound event label VQ-VAE, Transformer HiFi-GAN spectrogram 28224194
9 Xie_SJTU_task7_trackB_2 XieSJTU2023 sound event label VQ-VAE, Transformer, TransformerDecoder HiFi-GAN spectrogram 40843458 mixup 3
8 Xie_SJTU_task7_trackB_3 XieSJTU2023 sound event label VQ-VAE, Transformer, TransformerDecoder, TrnsformerEncder Discriminator HiFi-GAN spectrogram 44037827 mixup 3
11 Xie_SJTU_task7_trackB_4 XieSJTU2023 sound event label VQ-VAE, Transformer, TransformerDecoder, TrnsformerEncder Discriminator HiFi-GAN spectrogram 44037827 mixup 3
21 QianXu_BIT_NUDT_task7_trackB_1 QianXuBIT2023 sound diffusion model spectrogram 113668609
17 QianXu_BIT_NUDT_task7_trackB_2 QianXuBIT2023 sound diffusion model spectrogram 113668609
19 QianXu_BIT_NUDT_task7_trackB_3 QianXuBIT2023 sound diffusion model spectrogram 113668609
24 QianXu_BIT_NUDT_task7_trackB_4 QianXuBIT2023 sound diffusion model spectrogram 113668609 wavelet domain denoise
27 Bai_JLESS_task7_trackB_1 BaiJLESS2023 sound event label CVAE-GAN HiFi-GAN spectrogram 8760000 gain, pitch shifting, time shifting, peak normalization 7
14 Chun_Chosun_task7_trackB_2 ChunChosun2023 sound event label VQ-VAE, PixelSNAIL HiFi-GAN spectrogram 386598842 2
28 Wendner_JKU_task7_trackB_1 WendnerJKU2023 sound event label diffusion model, ensemble 7167405 gain reduction, time shifting 7
7 Jung_KT_task7_trackB_1 JungKT2023 sound event label, random noise C-SupConGAN HiFi-GAN mel spectrogram 21398259 fade in/out, time masking
2 Jung_KT_task7_trackB_2 JungKT2023 sound event label, random noise C-SupConGAN HiFi-GAN mel spectrogram 21398259 fade in/out, time masking
6 Jung_KT_task7_trackB_3 JungKT2023 sound event label, random noise C-SupConGAN HiFi-GAN mel spectrogram 21398259 fade in/out, time masking
5 Jung_KT_task7_trackB_4 JungKT2023 sound event label, random noise C-SupConGAN HiFi-GAN mel spectrogram 21398259 fade in/out, time masking
4 Lee_MARG_task7_trackB_1 LeeMARG2023 sound event label VQ-VAE, PixelSNAIL, StyleGAN2-ADA HiFi-GAN, Griffin-Lim spectrogram 116202572 time stretching, time shifting, RoomSimulator, TanhDistortion, resample, time masking, pitch shift 6
16 Chung_KAIST_task7_trackB_1 ChungKAIST2023 sound event label diffusion model 87330433



Wav files used for evaluation experiment


Technical reports

JLESS Submission to DCASE2023 Task7: Foley Sound Synthesis Using Non-Autoagressive Generative Model

Siwei Huang, Jisheng Bai, Yafei Jia, Jianfeng Chen
School of Marine Science and Technology, Northwestern Polytechnical University, Xi'an, China, LianFeng Acoustic Technologies Co., Ltd. Xi'an, China

Abstract

This technical report describes our proposed system for DCASE2023 task7: Foley Sound Synthesis. In our approach, we propose a GAN-based mel-spectrogram synthesis system. we take a Conditional Variational auto-encoder (CVAE) as the generator, which consists of densely-connected dilated convolution blocks, and a simple CNN as the discriminator. The decoder of CVAE synthesizes fake mel-spectrogram resampling from prior noise and class, and the discriminator determines whether it is real or not. Furthermore, we also train a classifier to help CVAE keep class-wise distribution. Finally, the audio is wrapped using the HiFiGAN vocoder.

System characteristics
System input sound event label
Machine learning method CVAE-GAN
Phase reconstruction method HiFi-GAN
Acoustic features spectrogram
Data augmentation gain, pitch shifting, time shifting, peak normalization
Subsystem count 7
System comprexity 8760000 parameters
PDF

HYU Submission For The DCASE 2023 Task 7: Diffusion Probabilistic Model With Adversarial Training For Foley Sound Synthesis

Won-Gook Choi, Joon-Hyuk Chang
Department of Electronic Engineering, Hanyang University, Seoul, Republic of Korea, Department of Electronic Engineering, Hanyang University, Seoul, Republic of Korea

Abstract

This paper is a technical report of the Hanyang University team submission for the DCASE 2023 challenge task 7, Foley Sound Synthesis. The goal of the task is to build a generative model that can synthesize high-quality and various foley sounds: the sounds of dog barking, footsteps, gunshots, keyboards, moving motor vehicles, rainy scenes, and sneezing. The core strategy of the submissions is a diffusion probabilistic model-based acoustic model. Also, we adopted adversarial training on the evidence lower bound (ELBO) of the diffusion model for the higher quality. The submissions did not use any external dataset and achieved lower Frechet audio distance (FAD) scores than the DCASE baseline, except for the sounds of moving motor vehicles.

System characteristics
System input sound event label
Machine learning method diffusion model
Phase reconstruction method HiFi-GAN
Acoustic features log-mel spectrogram
System comprexity 23374056,23374056 parameters
PDF

FALL-E: Gaudio Foley Synthesis System

Minsung Kang, Sangshin Oh, Hyeongi Moon, Kyungyun Lee, Ben Sangbae Chon
Gaudio Lab, Inc., Seoul, South Kore

Abstract

This paper introduces FALL-E, Gaudio's Foley Synthesis System, which is submitted to the DCASE 2023 Task 7 Foley Synthesis Challenge (Track A). The system employs a cascaded approach comprising low-resolution spectrogram generation, spectrogram super-resolution, and a vocoder. We trained every sound-related model from scratch using our extensive datasets, and we utilized a pre-trained language model. We conditioned the model with dataset-specific texts, enabling it to learn sound quality and recording environment based on the text input. Moreover, we leveraged external language models to improve text descriptions of our datasets and performed prompt engineering for quality, coherence, and diversity. We report the objective measure with respect to the official evaluation set, although our focus is on developing generally working sound generation models beyond the challenge.

System characteristics
System input sound event label
Machine learning method diffusion model
Phase reconstruction method modified HiFi-GAN
Acoustic features spectrogram
Data augmentation mixup, time stretching
System comprexity 642000000 parameters
PDF

High-Quality Foley Sound Synthesis Using Monte Carlo Dropout

Chae-Woon Bang, Nam Kyun Kim, Chanjun Chun
Chosun University, Gwangju, South Korea, Korea Automotive Technology Institute, Gwanjgu, South Korea

Abstract

This technical report describes the foley sound synthesis system for DCASE2022 Task7. Here, it aims to creates foley sound, which is widely utilized as various sound effects in multimedia contents. To accomplish this, it uses sound synthesis technique, generating a 4-second audio clip of one of seven classes. Specifically, we fine-tuned the baseline model such that improves the performance. After that, we ensemble the models using Monte Carlo Dropout. The performance of the proposed system was compared with the baseline using Frechet Audio Distance (FAD), which is referred as an audio evaluation metric. As a result, it was confirmed that both the single model and the ensemble model outperform the existing baseline system.

System characteristics
System input sound event label
Machine learning method PixelSNAIL,VQ-VAE
Phase reconstruction method HiFi-GAN
Acoustic features spectrogram
Subsystem count 2
System comprexity 386598842 parameters
PDF

Foley Sound Synthesis In Waveform Domain With Diffusion Model

Yoonjin Chung, Junwon Lee, Juhan Nam
Graduate School of AI, KAIST, Graduate School of Culture Technology, KAIST

Abstract

Foley sound synthesis becomes an important task due to the growing popularity of multi-media content, which is an industrial usecase of general audio synthesis. We propose a diffusion-based model that generates class-conditioned general audio in a classifier-free guidance manner as a participant of DCASE 2023 challenge task 7[1]. Our model follows a UNet-like structure while incorporating LSTM[2] inside the encoder block. We demonstrate the FAD (Frechet Audio Distance) scores of generated results for each 7 sound class respectively.

System characteristics
System input sound event label
Machine learning method diffusion model
System comprexity 87330433 parameters
PDF

Foley Sound Synthesis With AudioLDM For DCASE2023 Task 7

Shitong Fan, Qiaoxi Zhu, Feiyang Xiao, Haiyan Lan, Wenwu Wang, Jian Guan1
Group of Intelligent Signal Processing (GISP), College of Computer Science and Technology, Harbin Engineering University, Harbin, China, Centre for Audio, Acoustic and Vibration (CAAV), University of Technology Sydney, Ultimo, Australia, Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK

Abstract

This report describes our submission for DCASE2023 Challenge Task 7, a system for foley sound synthesis. Our system is based on AudioLDM, which offers high generation quality and computational efficiency for the text-to-audio task. Experiments are conducted on the dataset of DCASE2023 Challenge Task 7. The Fr\'{e}chet audio distance (FAD) between the sound generated by our system and the actual sound sample is 5.120 in the category “DogBark” and 8.102 in the category “Rain”, better than the baseline with an FAD 7.256 and an FAD 4.901 distance closer to the actual samples, respectively.

System characteristics
System input sound event label, caption
Machine learning method AudioLDM,Baseline
System comprexity 421000000,421269992 parameters
PDF

Foley Sound Synthesis Based On GAN Using Contrastive Learning Without Label Information

Hae Chun Chung, Yuna Lee, Jae Hoon Jung
KT Corporation, Republic of Korea

Abstract

Sound effects used in radio or movies, such as foley sound, have been difficult to create without the help of experts. Furthermore, in the field of audio synthesis, the field of speech has been actively progressed, but there has been no research on audio sounds that can be obtained in real life. In this technical report, We present our submission system for DCASE2023 Task7: Foley-sound synthesis. We participate in track B, which forbids the usage of external resources. We propose a framework that employ the loss function of ContraGAN and C-SupConGAN based on structure of Self-Attention GAN (SAGAN). Our final system achieves outperforming the baseline performance by a large margin.

System characteristics
System input sound event label, random noise
Machine learning method C-SupConGAN
Phase reconstruction method HiFi-GAN
Acoustic features mel spectrogram
Data augmentation fade in/out, time masking
System comprexity 21398259,21398259,21398259,21398259 parameters
PDF

DCASE Task-7: StyleGAN2-Based Foley Sound Synthesis

Purnima Kamath, Tasnim Nishat Islam, Chitralekha Gupta, Lonce Wyse, Suranga Nanayakkara
National University of Singapore, Singapore and Bangladesh University of Engineering and Technology, Bangladesh and Universitat Pompeu Fabra, Barcelona, Spain

Abstract

For the DCASE 2023 Task 7 (Track B), Foley Sound Synthesis, we submit two systems, (1) a StyleGAN conditioned on the class ID, and (2) an ensemble of StyleGANs each trained unconditionally on each class separately. We quantitatively find that both systems out-perform the task 7 baseline models in terms of FAD Scores. Given the high inter-class and intra-class variance in the development datasets, the system conditioned on class ID is able to generate a smooth and a homogeneous latent space indicated by the subjective quality of its generated samples. The unconditionally trained ensemble generates more categorically recognizable samples than system 1, but tends to generate more instances of out-of-distribution or noisy samples.

System characteristics
System input sound event label
Machine learning method StyleGAN2
Phase reconstruction method phase gradient heap integration
Acoustic features log-magnitude spectrogram
Data augmentation time shifting, sound wrapping
Subsystem count 7
System comprexity 62010138,376959933 parameters
PDF

Foley Sound Synthesis at the DCASE 2023 Challenge

Keunwoo Choi, Jaekwon Im, Laurie Heller, Brian McFee, Keisuke Imoto, Yuki Okamoto, Mathieu Lagrange, and Shinnosuke Takamichi
Gaudio Lab, Inc., Seoul, South Korea, KAIST and Daejeon, South Korea and Carnegie Mellon University, Pennsylvania, USA and New York University, New York, USA, and Doshisha University, Kyoto, Japan and Ritsumeikan University, Kyoto, Japan and CNRS, Ecole Centrale Nantes, Nantes Universite, Nantes, France and The University of Tokyo, Tokyo, Japan

Abstract

The addition of Foley sound effects during post-production is a common technique used to enhance the perceived acoustic properties of multimedia content. Traditionally, Foley sound has been produced by human Foley artists, which involves manual recording and mixing of sound. However, recent advances in sound synthesis and generative models have generated interest in machine-assisted or automatic Foley synthesis techniques. To promote further research in this area, we have organized a challenge in DCASE 2023: Task 7 - Foley Sound Synthesis. Our challenge aims to provide a standardized evaluation framework that is both rigorous and efficient, allowing for the evaluation of different Foley synthesis systems. Through this challenge, we hope to encourage active participation from the research community and advance the state-of-the-art in automatic Foley synthesis. In this technical report, we provide a detailed overview of the Foley sound synthesis challenge, including task definition, dataset, baseline, evaluation scheme and criteria, and discussion.

System characteristics
System input sound event label
Machine learning method VQ-VAE, PixelSNAIL
Phase reconstruction method HiFi-GAN
Acoustic features spectrogram
Data augmentation null
Subsystem count null
System comprexity 269992 parameters
PDF

Conditional Foley Sound Synthesis With Limited Data: Two-Stage Data Augmentation Approach With StyleGAN2-ADA

Kyungsu Kim, Jinwoo Lee, Hayoon Kim, Kyogu Lee
Seoul National University Department of Intelligence and Information

Abstract

This report introduces an audio synthesis system designed to tackle the task of Foley Sound Synthesis in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 challenge. Our proposed system comprises an ensemble of a baseline model and StyleGAN2-ADA. To optimize the system with the limited data without relying on external datasets and pretrained systems, we propose a two-stage data augmentation strategy. This approach involves augmenting input waveforms to expand the size of the training dataset, as well as employing adaptive discriminator augmentation (ADA) to alleviate overfitting of discriminator and ensure stable training. Experimental results demonstrate that our proposed ensemble system achieves an FAD (Fr\'{e}chet Audio Distance) of 5.84 on the evaluation dataset.

System characteristics
System input sound event label
Machine learning method PixelSNAIL,StyleGAN2-ADA,VQ-VAE
Phase reconstruction method HiFi-GAN, Griffin-Lim
Acoustic features spectrogram
Data augmentation time stretching, time shifting, RoomSimulator, TanhDistortion, resample, time masking, pitch shifting
Subsystem count 6
System comprexity 116202572 parameters
PDF

VIFS: An End-To-End Variational Inference For Foley Sound Synthesis

Junhyeok Lee1, Hyeonuk Nam, Yong-Hwa Park
maum.ai Inc., Republic of Korea and Korea Advanced Institute of Science and Technology, Republic of Korea

Abstract

Foley sound synthesis (FSS) is a task to generate a sound for specific conditions. In this work, FSS is defined as a ”category-to-sound” problem, which is generating various sounds for a given category. To solve this diversity problem, we adopt VITS, a text-to-speech (TTS) model with variational inference. In addition, we apply various techniques from speech synthesis including PhaseAug and Avocodo. Different from TTS models which generate short pronunciation from phonemes and a speaker identity, the category-to-sound problem requires to generate diverse sounds just from a category class. To compensate this difference between TTS and category-to- sound while maintaining consistency within each inference, we heavily modified the prior encoder to enhance consistency with posterior latent variables. This introduced additional gaussian on prior encoder which promotes variance within the category. With these modifications, we propose VIFS, variational inference for end-toend Foley sound synthesis, which is able to generate high-quality sounds with diversity.

System characteristics
System input sound event label
Machine learning method Avocodo,GAN,PhaseAug,VAE,VITS,ensemble,flow
Phase reconstruction method HiFi-GAN
Acoustic features Gaussian latent variables
Data augmentation PhaseAug
Subsystem count 4
System comprexity 92319922,92319922,92319922,369279688,92319922,92319922,92319922,369279688 parameters
PDF

DCASE Task 7: Foley Sound Synthesis

Ashwin Pillay, Sage Betko, Ari Liloia, Hao Chen, Ankit Shah
Carnegie Mellon University, Pittsburgh, USA

Abstract

Foley sound synthesis refers to the creation of realistic, diegetic sound effects for a piece of media, such as film or radio. We propose building a deep learning system for Task 7 of the DCASE 2023 challenge that can generate original mono audio clips belonging to one of seven foley sound categories. Our training dataset consists of 4,850 sound clips from the UrbanSound8K, FSD50K, and BBC Sound Effects datasets. We aim to better the subjective and objective quality of generated sounds by passing as much meaningful information about the input data into latent representations as possible. The primary innovation in our submission is the change from using melspectrograms to using CEmbeddings (combined embeddings), which are input to the VQ-VAE and consist of melspectrograms concatenated with latent representations of audio produced by a pre-trained MERT model. Our submission to track A utilizes a pre-trained MERT model; as such, PixelSNAIL was trained on CEmbeddings. Our submission to track B utilizes PixelSNAIL retrained only on melspectrograms. Our code can be found here: https://github.com/ankitshah009/ foley-sound-synthesis_DCASE_2023.

System characteristics
System input sound event label
Machine learning method PixelSNAIL,VQ-VAE
Phase reconstruction method HiFi-GAN
Acoustic features spectrogram
Data augmentation time masking, frequency masking
Subsystem count 3
System comprexity 103316216 parameters
PDF

Auto-Bit for DCASE2023 Task7 Technical Reports: Assemble System of BitDiffusion and PixelSNAIL

Anbin Qi
School Information and Electronics, Beijing Institute of Technology, Beijing, China

Abstract

This paper is a technical report on DCASE TASK7, which proposes using different methods and models for sound synthesis tasks in different scene events. For dogbark and sneeze cough, a non autoregressive model based on conditional generation bit-diffusion was used for sound synthesis. For the other five types of sounds, a autoregressive model based PixelSnail was used.

System characteristics
System input sound event label
Machine learning method VQ-VAE, PixelSNAIL, Bit-diffusion
Phase reconstruction method HiFi-GAN
Acoustic features spectrogram
Subsystem count 2
System comprexity 112857385 parameters
PDF

From Noise To Sound: Audio Synthesis Via Diffusion Models

Haojie Zhang, Kun Qian, Lin Shen, Lujundong Li, Kele Xu, Bin Hu
Key Laboratory of Brain Health Intelligent Evaluation and Intervention, Ministry of Education (Beijing Institute of Technology), P. R. China, School of Medical Technology, Beijing Institute of Technology, P. R. China, National University of Defense Technology, P. R. China

Abstract

In this technical report, we describe our submission system for DCASE2023 Task 7: Foley Sound Synthesis (Track B). A Sound Pixelate Diffuse model is proposed to realize foley sound synthesis. The model includes data format conversion and synthesising audio through the diffusion model. The Synthesised audio are evaluated on DCASE2023 Task 7 Eval FAD evaluation set and the best FAD score of all categories is 8.429.

System characteristics
System input sound
Machine learning method diffusion model
Acoustic features spectrogram
Data augmentation wavelet domain denoise
System comprexity 113668609,113668609,113668609,113668609 parameters
PDF

Class-Conditioned Latent Diffusion Model For DCASE 2023 Foley Sound Synthesis Challenge

Robin Scheibler, Takuya Hasumi, Yusuke Fujita, Tatsuya Komatsu, Ryuichi Yamamoto, Kentaro Tachibana
LINE Corporation, Tokyo, Japan

Abstract

This report describes our submission to DCASE 2023 Task7: Foley sound synthesis challenge. We use a latent diffusion model (LDM) that generates a latent representation of audio conditioned on a specified audio class, a variational autoencoder that converts the latent representation to a mel-spectrogram, and a universal neural vocoder based on HIFI-GAN that reconstructs a natural waveform from the mel-spectrogram. We trained the LDM using development set with its audio class indices as conditioners for generating class-specific latent representations.

System characteristics
System input sound event label
Machine learning method VQ-VAE,diffusion model
Phase reconstruction method HiFi-GAN
Acoustic features log-mel spectrogram
System comprexity 977116210 parameters
PDF

Audio Diffusion For Foley Sound Synthesis

Timo Wendner, Patricia Hu, Tara Jadidi, Alexander Neuhauser
Johannes Kepler University, Linz, Austria

Abstract

This technical report describes our approach for Task 7 (Foley Sound Synthesis), Track B (using no external resources other than the ones provided) of the DCASE2023 Challenge. This work was carried out as part of an elective course in the Artificial Intelligence curriculum at Johannes Kepler University Linz by a student group. We use an ensemble of U-Net based diffusion models for waveform generation in seven predefined sound categories. We apply gain reduction to normalize and time shifting to augment the provided training data and test different noise schedulers and U-Net architectures. Applying different training strategies, we achieve competitive results for the majority of the sound classes while being more parameter efficient and allowing end-to-end generation on audio waveforms. Evaluated on the task's evaluation metric, i.e., the mean FAD score over all classes, we achieve a final score of 12.42 as compared to the score of the challenge baseline model of 9.68.

System characteristics
System input sound event label
Machine learning method diffusion model,ensemble
Data augmentation gain reduction, time shifting
Subsystem count 7
System comprexity 7167405 parameters
PDF

The X-LANCE System For DCASE2023 Challenge Task 7: Foley Sound Synthesis Track B

Zeyu Xie, Xuenan Xu, Baihan Li, Mengyue Wu, Kai Yu
MoE Key Lab of Artificial Intelligence X-LANCE Lab, Department of Computer Science and Engineering AI Institute, Shanghai Jiao Tong University, Shanghai, China

Abstract

This report describes the system submitted to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 challenge Task 7: foley sound synthesis track B. We first train a VQVAE model to learn the discrete representation of the audio spectrogram. Then an auto-regressive model is trained to predict discrete tokens based on input conditions. Finally, a trained vocoder converts the generated spectrogram into a waveform, where the spectrogram is restored from predicted tokens by VQ-VAE decoder. To achieve higher accuracy, fidelity and diversity, we introduce some training schemes, including (1) a discriminator model to filter audio; (2) mixup method for data augmentation; (3) clustering methods for better training. Our best system achieved a FAD score of 6.99 averaging on all categories.

System characteristics
System input sound event label
Machine learning method Transformer,TransformerDecoder,TrnsformerEncder Discriminator,VQ-VAE
Phase reconstruction method HiFi-GAN
Acoustic features spectrogram
Data augmentation mixup
Subsystem count 3
System comprexity 28224194,40843458,44037827,44037827 parameters
PDF

Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7

Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Mark D. Plumbley, Wenwu Wang
University of Surrey, Guildford, United Kingdom

Abstract

Foley sound generation aims to synthesise the background sound for multimedia content, which involves computationally modelling sound effects with specialized techniques. In this work, we proposed a diffusion-based generative model for DCASE 2023 challenge task 7: Foley Sound Synthesis. The proposed system is based on AudioLDM, which is a diffusion-based text-to-audio generation model. To alleviate the data scarcity of task 7 training set, our model is initially trained with large-scale datasets and downstream into this DCASE task via transfer learning. We have observed that the feature extracted by the encoder can significantly affect the performance of the generation model. Hence, we improve the results by leveraging the input label with related text embedding features obtained by a large language model, i.e.,contrastive language-audio pretraining (CLAP). In addition, we utilize a filtering strategy to further refine the output, i.e. by selecting the best results from the candidate clips generated in terms of the similarity score between the sound and target labels. The overall system achieves a Fr\'{e}chet audio distance (FAD) score of 4.765 on average among all seven different classes, outperforming the baseline system which achieves a FAD score of 9.7.

System characteristics
System input sound event label
Machine learning method VQ-VAE,diffusion model
Phase reconstruction method HiFi-GAN
Acoustic features spectrogram
Subsystem count 2
System comprexity 1173847474 parameters
PDF