Foley Sound Synthesis


Task description

Coordinators

Keunwoo Choi
Keunwoo Choi

Gaudio Lab, Inc.

Jaekwon Im
Jaekwon Im

Gaudio Lab, Inc. / Korea Advanced Institute of Science & Technology (KAIST)

Laurie Heller
Laurie Heller

Carnegie Mellon University

Brian McFee
Brian McFee

New York University

Keisuke Imoto
Keisuke Imoto

Doshisha University

Yuki Okamoto
Yuki Okamoto

Ritsumeikan University

Mathieu Lagrange
Mathieu Lagrange

CNRS, Ecole Centrale Nantes, Nantes Université

Shinnosuke Takamichi
Shinnosuke Takamichi

The University of Tokyo

Challenge has ended. Full results for this task can be found in the Results page.

If you are interested in the task, you can join us on the dedicated slack channel.
We have released wav files used for the evaluation experiment on zenodo. A URL for zenodo is https://zenodo.org/record/8091972.

Summary of task

Before Submission

  • Participants build foley sound synthesis models that generate various sounds, for 7 sound classes we defined.
    • class index --> [Model] --> 100 sounds (4-second / mono / 16-bit / 22,050 Hz).
    • class:{"DogBark": 0, "Footstep": 1, "GunShot": 2, "Keyboard": 3, "MovingMotorVehicle": 4, "Rain": 5, "Sneeze_Cough": 6}
  • On Track A, participants can use external data/models. On Track B, no external resources are allowed. One can submit for either or both.
  • An entry consists of two items.
    • A yaml: In a pre-defined format that includes URL for a Colab notebook (which also follows a pre-defined format)
    • A pdf: Technical report

After Submission

  • Submissions will be evaluated first by Frechet Audio Distance (FAD), then by subjective test.
  • Participants are required to contribute to the subjective tests.
    • One evaluator per entry / About 2-3 hours of listening test / During the week of May 22-27.
  • FAD decides four finalists. Subjective test decides final ranking among them (FAD score is ignored)

Description

Foley sound, in general, refers to sound effects that are created to convey (and sometimes enhance) the sounds produced by events occurring in a narrative (e.g. radio or film). Foley sounds are commonly added to multimedia to enhance the perceptual audio experience. This sound synthesis challenge requires the generation of original audio clips that represent a category of sound, such as footsteps.

The new sounds should fit into the category that is typified by the set of sounds in the development set, yet they should not duplicate any of the provided sounds.
Figure 1: Overview of Foley sound synthesis system.


Why is it an important task?

First, time-consuming post-production is inevitable to obtain a perfectly matched sound effect. By generating sound that belongs to a target sound category, Foley sound synthesis can make the workflow much more time and cost-effective. With the rise of virtual environments such as the metaverse, we expect a growing need for the automated generation of more and more complex and creative sound environments. Second, it can be utilized for dataset synthesis or augmentation for a wide variety of DCASE tasks including sound event detection (SED). SED has drawn great attention and synthesized datasets have been used already, e.g., URBAN-SED dataset. A high-quality Foley sound synthesis model could lead to development of better SED models.

Task setup

This task consists of generating sounds from two conditions. The first condition is the class id number, and the second condition is the number of the sounds to be generated. The data type of both conditions is an integer. The length of each generated audio should be 4 seconds.

The challenge has two subproblems - development of models with external resources (track A) and without external resources (track B). Participants are expected to submit a system for one of the two problems, and each problem is evaluated independently.

Audio dataset

Development Set

The development set is composed of audio samples from 3 datasets: UrbanSound8K, FSD50K, and BBC Sound Effects. Participants are not allowed to use these 3 datasets as external resources. The development set consists of 4,850 labeled sound excerpts, which are divided into 7 classes. All audio was converted to mono 16-bit 22,050 Hz audio. In addition, for consistency of the challenge, it was zero-padded or segmented to fit a length of 4 seconds. The 7 class IDs and the corresponding classes of audio are as follows.


Class ID Category Number of files
0 DogBark 617
1 Footstep 703
2 GunShot 777
3 Keyboard 800
4 MovingMotorVehicle 581
5 Rain 741
6 Sneeze/Cough 631

Evaluation Set

The evaluation dataset has the same classes as the development dataset. The sample rate, the number of channels, and the bit depth of audio correspond to those of audio in the development dataset. The length of each audio is 4 seconds, and there are 100 audio samples per category. This dataset will be released after the end of the challenge.

Download

Development dataset (4.0 GB)
version 2.0


A portion of the sounds in the data set were kindly provided by permission of the BBC under the condition that the sounds are used for this DCASE challenge and are only used for research purposes. You can check the original source of each sound in DevMeta.csv and EvalMeta.csv located in DCASE_2023_Challenge_Task_7_Dataset.

List of external data resources allowed (Track A):

For Track A, You can use external resources (except FSD50k and Urbansound8k), whether they are public or not. If you want to check the list of the audio dataset, you can refer to LAION-AI audio dataset

List of external data resources allowed (Track B):

Dataset / Model name Type Added Link
DCASE2023 Task7 Development Set dataset (audio) 2023.03.09 https://dcase.community/challenge2023/task-foley-sound-synthesis
DCASE2023 Task7 Baseline - Pre-Trained HiFi-GAN model (vocoder) 2023.03.09 https://github.com/DCASE2023-Task7-Foley-Sound-Synthesis/dcase2023_task7_baseline


Task rules

Track A

  • Participants are allowed to use external resources, including public/private datasets and pre-trained models, except that no sounds may be used (for model training, etc.) from FSD50k or Urbansound8k (because those databases were used to create the eval set).
  • Participants are not allowed to submit systems that duplicate any pre-existing sounds (on the web or in private databases). They also may not submit systems that simply apply alterations or audio effects to pre-existing sounds. The sounds must be unique and must be generated by the code supplied in the colab.
  • When submitting a technical report, participants are required to specify all the external resources they used.
  • Participants are allowed to submit a rule-based system that doesn't have trainable parameters (as long as only sounds from the development set are used).
  • Participants are allowed to submit an ensemble system having multiple systems for each sound class.

Track B

  • Participants are NOT allowed to use external resources.
  • Participants are allowed to submit a rule-based system that doesn't have trainable parameters (as long as only sounds from the development set are used).
  • Participants are allowed to submit an ensemble system having multiple systems for each sound class.

Submission

A submission would consist of the following items.

  • Metadata file (*.yaml) with the pre-defined format
  • Technical report (*.pdf)

The audio files will be generated by the organizers using the Colab notebook. On audio files: the duration, sample rate, bit depth, and the number of channels must follow those of the development set (4 seconds, 16 bit, 22,050 Hz, mono). The number of audio files generated per class should be 100. Participants need to declare in the metafile for which track the submission is. The technical report should contain a description, including the implementation details of the submission.

The confidential statement for the submitted Colab code and model checkpoint is included in the submission package.

Evaluation

Submissions will be evaluated by Frechet Audio Distance (FAD), followed by a subjective test.

Evaluation Metric

FAD, based on Frechet inception distance (FID) widely used metric for generative models, is a reference-free evaluation metric. Because FAD is calculated from sets of hidden representations of generated and real samples, it can be used even when the ground truth reference audio does not exist. FAD is calculated in the following sequence. First, audio representations are extracted from both evaluation datasets and generated samples. We use VGGish, a classification model that was trained on AudioSet. Second, each set of representations is fitted to a multivariate Gaussian distribution. Finally, the FAD score of the generated samples is calculated by the Frechet distance of these two distributions as follows:

\(F ({\mathcal N}_{r}, {\mathcal N}_{g}) = \| \mu_{r} - \mu_{g} \|^{2} + \mathrm{tr} ( \Sigma_{r} + \Sigma_{g} - 2 \sqrt{\Sigma_{r} \Sigma_{g}} )\)

where \({\mathcal N}_{r} (\mu_{r},\Sigma_{r})\) and \({\mathcal N}_{g} (\mu_{g},\Sigma_{g})\) are multivariate Gaussians of the VGGish embeddings, from the generated samples and evaluation set, respectively.

Subjective Test

Complementary to quantitative evaluation, we will evaluate the performance through a subjective test. Four submissions with the highest FAD scores will be evaluated. Generated audio excerpts for the subjective test are randomly sampled from the submitted data. Each challenge submission is evaluated by the organizers and one author of each of the other challenge submissions. The test will follow an incomplete strategy where the samples generated using an algorithm submitted by one team will not be evaluated by anyone of the team. The listening test will be separate for each category. Evaluators will listen to generated audio samples from challenge submissions, with real samples and audio samples generated by the baseline model. Audio fidelity, category suitability, and diversity will be evaluated. The weighted average of the three ratings are based on a ratio of audio quality : category fit : diversity that is 2:2:1. One evaluator from each entry must be available to spend 2-3 hours doing listening tests during the week of May 22-27.

Ranking

The final ranking is determined by the FAD score and the subjective tests. First, four submissions with the highest FAD scores are selected. They are then ranked, finally, by the subjective tests regardless of their FAD scores.

Results

Track A

Rank Submission Information Weighted Average Score of Audio Quality, Category Fit, and Diversity Audio Quality (MOS score w/ 10 steps) Category Fit (MOS score w/ 10 steps) Diversity (MOS score w/ 10 steps, weighted 0.5)
Submission Code Technical
Report
Official
Rank
Average Score Dog Bark Footstep Gun Shot Keyboard Moving Motor
Vehicle
Rain Sneeze/Cough Average Score Dog Bark Footstep Gun Shot Keyboard Moving Motor
Vehicle
Rain Sneeze/Cough Average Score Dog Bark Footstep Gun Shot Keyboard Moving Motor
Vehicle
Rain Sneeze/Cough Average Score Dog Bark Footstep Gun Shot Keyboard Moving Motor
Vehicle
Rain Sneeze/Cough
DCASE2023_baseline_task7 DCASE2023baseline2023 6 3.810 2.688 4.160 3.237 5.150 3.862 4.175 3.400 3.831 2.930 4.158 3.504 5.137 3.543 4.115 3.432 3.789 2.447 4.162 2.969 5.163 4.182 4.235 3.368
Chon_Gaudio_task7_trackA_1 ChonGLI2023 2 6.967 7.984 6.865 7.255 6.989 6.881 6.243 6.553 6.657 7.612 6.455 6.814 6.814 6.446 5.928 6.528 7.154 8.223 7.082 7.573 7.157 7.131 6.306 6.606 7.214 8.250 7.250 7.500 7.000 7.250 6.750 6.500
Yi_SURREY_task7_trackA_1 YiSURREY2023 1 7.056 7.742 6.466 6.189 7.433 7.448 6.441 7.675 6.723 7.309 6.143 5.532 7.243 7.315 6.067 7.454 7.578 8.297 6.646 6.689 8.089 8.181 6.911 8.233 6.679 7.500 6.750 6.500 6.500 6.250 6.250 7.000
Guan_HEU_task7_trackA_2 GuanHEU2023 4 5.157 4.877 4.450 6.413 5.479 5.822 5.201 3.856 4.670 3.800 4.164 5.800 5.339 5.365 4.972 3.250 5.293 5.142 3.836 7.482 5.232 6.315 5.656 3.389 5.857 6.500 6.250 5.500 6.250 5.750 4.750 6.000
Scheibler_LINE_task7_trackA_1 ScheiblerLINE2023 3 6.887 7.333 6.832 7.317 7.199 6.474 5.222 7.834 6.355 6.479 6.263 6.771 6.886 6.131 4.780 7.180 7.327 7.479 7.192 7.896 7.861 7.054 5.150 8.655 7.071 8.750 7.250 7.250 6.500 6.000 6.250 7.500


Track B

In the case that multiple systems were submitted by one team, only the system with the highest FAD score per team was perceptually evaluated.

Rank Submission Information Weighted Average Score of Audio Quality, Category Fit, and Diversity Audio Quality (MOS score w/ 10 steps) Category Fit (MOS score w/ 10 steps) Diversity (MOS score w/ 10 steps, weighted 0.5)
Submission Code Technical
Report
Official
Rank
Average Score Dog Bark Footstep Gun Shot Keyboard Moving Motor
Vehicle
Rain Sneeze/Cough Average Score Dog Bark Footstep Gun Shot Keyboard Moving Motor
Vehicle
Rain Sneeze/Cough Average Score Dog Bark Footstep Gun Shot Keyboard Moving Motor
Vehicle
Rain Sneeze/Cough Average Score Dog Bark Footstep Gun Shot Keyboard Moving Motor
Vehicle
Rain Sneeze/Cough
DCASE2023_baseline_task7 DCASE2023baseline2023 18 3.810 2.688 4.160 3.237 5.150 3.862 4.175 3.400 3.831 2.930 4.158 3.504 5.137 3.543 4.115 3.432 3.789 2.447 4.162 2.969 5.163 4.182 4.235 3.368
Kamath_NUS_task7_trackB_2 KamathNUS2023 3 4.647 4.807 4.073 5.010 4.276 5.248 4.013 5.102 3.988 3.789 3.554 4.346 3.911 4.642 3.378 4.295 4.612 4.979 3.629 5.054 4.029 5.727 3.406 5.459 6.036 6.500 6.000 6.250 5.500 5.500 6.500 6.000
Chang_HYU_task7_trackB_1 ChangHYU2023 1 6.515 5.659 7.111 6.557 7.384 6.155 7.042 5.699 6.085 4.882 6.738 5.879 7.296 6.069 6.860 4.873 6.845 6.014 7.288 7.013 7.789 6.442 7.370 6.000 6.714 6.500 7.500 7.000 6.750 5.750 6.750 6.750
Jung_KT_task7_trackB_2 JungKT2023 2 5.534 5.321 5.033 6.022 5.614 6.021 5.902 4.826 5.082 4.432 4.933 5.579 5.139 5.623 5.600 4.270 5.610 5.371 4.775 6.100 5.646 6.554 5.906 4.920 6.286 7.000 5.750 6.750 6.500 5.750 6.500 5.750
Lee_MARG_task7_trackB_1 LeeMARG2023 4 4.427 3.273 4.843 3.941 5.409 4.942 4.210 4.374 3.929 2.530 4.204 3.531 5.311 4.542 3.735 3.650 4.443 3.153 4.654 3.696 5.336 5.312 4.040 4.910 5.393 5.000 6.500 5.250 5.750 5.000 5.500 4.750


Complete results and technical reports can be found at results page.

Baseline system

The task organizers will provide a baseline system consisting of VQ-VAE, PixelSNAIL, and HiFi-GAN. We chose the baseline system to be a cascaded model, rather than an end-to-end one, due to its simplicity, explainability, and efficiency.

Repository


Detailed information can be found in the GitHub repository. The code of the baseline repository is mostly from this repository. If you use this code, please cite the task description paper, which will be published in May, and the following paper:

Publication

Xubo Liu, Turab Iqbal, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, and Wenwu Wang. Conditional sound generation using neural discrete time-frequency representation learning. 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6, 2021.

PDF

Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learning

PDF

Parameters

  • Audio features:
    • Log mel-band energies (80 bands), 22,050 Hz sampling frequency, frame length 1024 points with 25% hop size
  • Neural network:
    • Input shape: 80 * 344 (4 seconds)
    • VQ-VAE encoder
      • Strided convolutional layer #1: 4 × 4 conv, ReLU, 4 × 4 conv, ReLU, 3 × 3 conv
      • Residual block #1: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
      • Strided convolutional layer #2: 2 × 2 conv, ReLU, 2 × 2 conv, ReLU, 3 × 3 conv
      • Residual block #2: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
      • Strided convolutional layer #3: 6 × 6 conv, ReLU, 6 × 6 conv, ReLU, 3 × 3 conv
      • Residual block #3: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
      • Strided convolutional layer #4: 8 × 8 conv, ReLU, 8 × 8 conv, ReLU, 3 × 3 conv
      • Residual block #4: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
    • VQ-VAE decoder
      • 3 × 3 conv
      • Residual block #1: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
      • Residual block #2: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
      • ReLU
      • Transposed convolutional block: 4 × 4 transposed conv, ReLU, 4 × 4 transposed conv
    • PixelSNAIL
    • HiFi-GAN

FAD results for the development dataset

We evaluated the FAD score on the development dataset. In each category, we used 100 gound truth (natural) and 100 synthesized sounds. Participants can utilize this repo to compute the FAD score between the sounds generated by their system and those of the evaluation set.



Class ID Category FAD
0 DogBark 13.614
1 Footstep 6.826
2 GunShot 6.152
3 Keyboard 5.065
4 MovingMotorVehicle 11.239
5 Rain 14.449
6 Sneeze/Cough 3.563

Citation

If you are participating in this task or using the baseline code, please cite the following task description paper.

Publication

Keunwoo Choi, Jaekwon Im, Laurie Heller, Brian McFee, Keisuke Imoto, Yuki Okamoto, Mathieu Lagrange, and Shinosuke Takamichi. Foley sound synthesis at the dcase 2023 challenge. In arXiv e-prints: 2304.12521, 2023.

PDF

Foley Sound Synthesis at the DCASE 2023 Challenge

PDF