# Foley Sound Synthesis

### Coordinators

 Keunwoo Choi Gaudio Lab, Inc. Jaekwon Im Gaudio Lab, Inc. / Korea Advanced Institute of Science & Technology (KAIST) Laurie Heller Carnegie Mellon University Brian McFee New York University Keisuke Imoto Doshisha University Yuki Okamoto Ritsumeikan University Mathieu Lagrange CNRS, Ecole Centrale Nantes, Nantes Université Shinnosuke Takamichi The University of Tokyo

## Before Submission

• Participants build foley sound synthesis models that generate various sounds, for 7 sound classes we defined.
• class index --> [Model] --> 100 sounds (4-second / mono / 16-bit / 22,050 Hz).
• class:{"DogBark": 0, "Footstep": 1, "GunShot": 2, "Keyboard": 3, "MovingMotorVehicle": 4, "Rain": 5, "Sneeze_Cough": 6}
• On Track A, participants can use any external data/models. On Track B, no external resources are allowed. One can submit for either or both.
• An entry consists of two items.
• A yaml: In a pre-defined format that includes URL for a Colab notebook (which also follows a pre-defined format)
• A pdf: Technical report

## After Submission

• Submissions will be evaluated first by Frechet Audio Distance (FAD), then by subjective test.
• Participants are required to contribute to the subjective tests.
• One evaluator per entry / About 2-3 hours of listening test / During the week of May 20-26.
• FAD decides eight finalists. Subjective test decides final ranking among them (FAD score is ignored)

# Description

Foley sound, in general, refers to sound effects that are created to convey (and sometimes enhance) the sounds produced by events occurring in a narrative (e.g. radio or film). Foley sounds are commonly added to multimedia to enhance the perceptual audio experience. This sound synthesis challenge requires the generation of original audio clips that represent a category of sound, such as footsteps.

The new sounds should fit into the category that is typified by the set of sounds in the development set, yet they should not duplicate any of the provided sounds.

## Why is it an important task?

First, time-consuming post-production is inevitable to obtain a perfectly matched sound effect. By generating sound that belongs to a target sound category, Foley sound synthesis can make the workflow much more time and cost-effective. With the rise of virtual environments such as the metaverse, we expect a growing need for the automated generation of more and more complex and creative sound environments. Second, it can be utilized for dataset synthesis or augmentation for a wide variety of DCASE tasks including sound event detection (SED). SED has drawn great attention and synthesized datasets have been used already, e.g., URBAN-SED dataset. A high-quality Foley sound synthesis model could lead to development of better SED models.

This task consists of generating sounds from two conditions. The first condition is the class id number, and the second condition is the number of the sounds to be generated. The data type of both conditions is an integer. The length of each generated audio should be 4 seconds.

The challenge has two subproblems - development of models with external resources (track A) and without external resources (track B). Participants are expected to submit a system for one of the two problems, and each problem is evaluated independently.

# Audio dataset

## Development Set

The development set is composed of audio samples from 3 datasets: UrbanSound8K, FSD50K, and BBC Sound Effects. Participants are not allowed to use these 3 datasets as external resources. The development set consists of 4,850 labeled sound excerpts, which are divided into 7 classes. All audio was converted to mono 16-bit 22,050 Hz audio. In addition, for consistency of the challenge, it was zero-padded or segmented to fit a length of 4 seconds. The 7 class IDs and the corresponding classes of audio are as follows.

 Class ID Category Number of files 0 DogBark 617 1 Footstep 703 2 GunShot 777 3 Keyboard 800 4 MovingMotorVehicle 581 5 Rain 741 6 Sneeze/Cough 631

## Evaluation Set

The evaluation dataset has the same classes as the development dataset. The sample rate, the number of channels, and the bit depth of audio correspond to those of audio in the development dataset. The length of each audio is 4 seconds, and there are 100 audio samples per category. This dataset will be released after the end of the challenge.

## List of external data resources allowed (Track A):

AudioCaps dataset (audio) 2023.03.01 https://audiocaps.github.io/
Clotho v2 dataset (audio) 2023.03.01 https://zenodo.org/record/4783391
DCASE2023 Task7 Baseline - Pre-Trained HiFi-GAN model (vocoder) 2023.03.01 https://github.com/jik876/hifi-gan
Official Pre-Trained MelGAN model (vocoder) 2023.03.01 https://github.com/descriptinc/melgan-neurips

## Track A

• Participants are allowed to use external resources in the list of datasets and pre-trained models provided by the task organizer.
• Participants can request the organizer to add public models or datasets (the corresponding models or datasets must be publicly available).
• When submitting a technical report, participants are required to specify all the external resources they used.
• Participants are allowed to submit a rule-based system that doesn't have trainable parameters (as long as only sounds from the development set are used).
• Participants are allowed to submit an ensemble system having multiple systems for each sound class.

## Track B

• Participants are NOT allowed to use external resources.
• Participants are allowed to submit a rule-based system that doesn't have trainable parameters (as long as only sounds from the development set are used).
• Participants are allowed to submit an ensemble system having multiple systems for each sound class.

# Submission

A submission would consist of the following items.

• Metadata file (*.yaml) with the pre-defined format
• The URL for Colab notebook is included here. (Colab template will be released soon)
• Technical report (*.pdf)

On audio files: the duration, sample rate, bit depth, and the number of channels must follow those of the development set (4 seconds, 16 bit, 22,050 Hz, mono). The audio files should be generated using the Colab notebook. The number of audio files generated per class should be 90. Participants need to declare in the metafile for which track the submission is. The technical report should contain a description including the implementation details of the submission.

# Evaluation

Submissions will be evaluated by Frechet Audio Distance (FAD), followed by a subjective test.

## Evaluation Metric

FAD, based on Frechet inception distance (FID) widely used metric for generative models, is a reference-free evaluation metric. Because FAD is calculated from sets of hidden representations of generated and real samples, it can be used even when the ground truth reference audio does not exist. FAD is calculated in the following sequence. First, audio representations are extracted from both evaluation datasets and generated samples. We use VGGish, a classification model that was trained on AudioSet. Second, each set of representations is fitted to a multivariate Gaussian distribution. Finally, the FAD score of the generated samples is calculated by the Frechet distance of these two distributions as follows:

$$F ({\mathcal N}_{r}, {\mathcal N}_{g}) = \| \mu_{r} - \mu_{g} \|^{2} + \mathrm{tr} ( \Sigma_{r} + \Sigma_{g} - 2 \sqrt{\Sigma_{r} \Sigma_{g}} )$$

where $${\mathcal N}_{r} (\mu_{r},\Sigma_{r})$$ and $${\mathcal N}_{g} (\mu_{g},\Sigma_{g})$$ are multivariate Gaussians of the VGGish embeddings, from the generated samples and evaluation set, respectively.

## Subjective Test

Complementary to quantitative evaluation, we will evaluate the performance through a subjective test. Generated audio excerpts for the subjective test are randomly sampled from the submitted data. Each challenge submission is evaluated by the organizers and one author of each of the other challenge submissions. The test will follow an incomplete strategy where the samples generated using an algorithm submitted by one team will not be evaluated by anyone of the team. The listening test will be separate for each category. Evaluators will listen to generated audio samples from challenge submissions, with real samples and audio samples generated by the baseline model. Both audio fidelity and category suitability will be evaluated. One evaluator from each entry must be available to spend 2-3 hours doing listening tests during the week of May 20-26.

## Ranking

The final ranking is determined by the FAD score and the subjective tests. First, eight submissions with the highest FAD scores are selected. They are then ranked, finally, by the subjective tests regardless of their FAD scores.

# Baseline system

The task organizers will provide a baseline system consisting of VQ-VAE, PixelSNAIL, and HiFi-GAN. We chose the baseline system to be a cascaded model, rather than an end-to-end one, due to its simplicity, explainability, and efficiency.

### Repository

Detailed information can be found in the GitHub repository. The code of the baseline repository is mostly from this repository. If you use this code, please cite the task description paper, which will be published in May, and the following paper:

Publication

Xubo Liu, Turab Iqbal, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, and Wenwu Wang. Conditional sound generation using neural discrete time-frequency representation learning. 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6, 2021.

### Parameters

• Audio features:
• Log mel-band energies (80 bands), 22,050 Hz sampling frequency, frame length 1024 points with 25% hop size
• Neural network:
• Input shape: 80 * 344 (4 seconds)
• VQ-VAE encoder
• Strided convolutional layer #1: 4 × 4 conv, ReLU, 4 × 4 conv, ReLU, 3 × 3 conv
• Residual block #1: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
• Strided convolutional layer #2: 2 × 2 conv, ReLU, 2 × 2 conv, ReLU, 3 × 3 conv
• Residual block #2: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
• Strided convolutional layer #3: 6 × 6 conv, ReLU, 6 × 6 conv, ReLU, 3 × 3 conv
• Residual block #3: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
• Strided convolutional layer #4: 8 × 8 conv, ReLU, 8 × 8 conv, ReLU, 3 × 3 conv
• Residual block #4: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
• VQ-VAE decoder
• 3 × 3 conv
• Residual block #1: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
• Residual block #2: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
• ReLU
• Transposed convolutional block: 4 × 4 transposed conv, ReLU, 4 × 4 transposed conv
• PixelSNAIL
• HiFi-GAN

## FAD results for the development dataset

We evaluated the FAD score on the development dataset. In each category, we used 100 gound truth (natural) and 100 synthesized sounds. We calculate the FAD score using an official implementation on this repo.

 Class ID Category FAD 0 DogBark 13.411 1 Footstep 8.109 2 GunShot 7.951 3 Keyboard 5.230 4 MovingMotorVehicle 16.108 5 Rain 13.337 6 Sneeze/Cough 3.770

# Citation

A task description paper will be published in May. If you are participating in this task or using the baseline code, please cite the task description paper.