Challenge has ended. Full results for this task can be found in the Results page.

If you are interested in the task, you can join us on the dedicated slack channel.

We have released wav files used for the evaluation experiment on zenodo. A URL for zenodo is https://zenodo.org/record/8091972.

Summary of task

Before Submission

Participants build foley sound synthesis models that generate various sounds, for 7 sound classes we defined.
- class index --> [Model] --> 100 sounds (4-second / mono / 16-bit / 22,050 Hz).
- class:{"DogBark": 0, "Footstep": 1, "GunShot": 2, "Keyboard": 3, "MovingMotorVehicle": 4, "Rain": 5, "Sneeze_Cough": 6}
On Track A, participants can use external data/models. On Track B, no external resources are allowed. One can submit for either or both.
An entry consists of two items.
- A yaml: In a pre-defined format that includes URL for a Colab notebook (which also follows a pre-defined format)
- A pdf: Technical report

After Submission

Submissions will be evaluated first by Frechet Audio Distance (FAD), then by subjective test.
Participants are required to contribute to the subjective tests.
- One evaluator per entry / About 2-3 hours of listening test / During the week of May 22-27.
FAD decides four finalists. Subjective test decides final ranking among them (FAD score is ignored)

Description

Foley sound, in general, refers to sound effects that are created to convey (and sometimes enhance) the sounds produced by events occurring in a narrative (e.g. radio or film). Foley sounds are commonly added to multimedia to enhance the perceptual audio experience. This sound synthesis challenge requires the generation of original audio clips that represent a category of sound, such as footsteps.

The new sounds should fit into the category that is typified by the set of sounds in the development set, yet they should not duplicate any of the provided sounds.

Figure 1: Overview of Foley sound synthesis system.

Why is it an important task?

First, time-consuming post-production is inevitable to obtain a perfectly matched sound effect. By generating sound that belongs to a target sound category, Foley sound synthesis can make the workflow much more time and cost-effective. With the rise of virtual environments such as the metaverse, we expect a growing need for the automated generation of more and more complex and creative sound environments. Second, it can be utilized for dataset synthesis or augmentation for a wide variety of DCASE tasks including sound event detection (SED). SED has drawn great attention and synthesized datasets have been used already, e.g., URBAN-SED dataset. A high-quality Foley sound synthesis model could lead to development of better SED models.

Task setup

This task consists of generating sounds from two conditions. The first condition is the class id number, and the second condition is the number of the sounds to be generated. The data type of both conditions is an integer. The length of each generated audio should be 4 seconds.

The challenge has two subproblems - development of models with external resources (track A) and without external resources (track B). Participants are expected to submit a system for one of the two problems, and each problem is evaluated independently.

Audio dataset

Development Set

The development set is composed of audio samples from 3 datasets: UrbanSound8K, FSD50K, and BBC Sound Effects. Participants are not allowed to use these 3 datasets as external resources. The development set consists of 4,850 labeled sound excerpts, which are divided into 7 classes. All audio was converted to mono 16-bit 22,050 Hz audio. In addition, for consistency of the challenge, it was zero-padded or segmented to fit a length of 4 seconds. The 7 class IDs and the corresponding classes of audio are as follows.

Class ID	Category	Number of files
0	DogBark	617
1	Footstep	703
2	GunShot	777
3	Keyboard	800
4	MovingMotorVehicle	581
5	Rain	741
6	Sneeze/Cough	631

Evaluation Set

The evaluation dataset has the same classes as the development dataset. The sample rate, the number of channels, and the bit depth of audio correspond to those of audio in the development dataset. The length of each audio is 4 seconds, and there are 100 audio samples per category. This dataset will be released after the end of the challenge.

Download

Development dataset (4.0 GB)
version 2.0

A portion of the sounds in the data set were kindly provided by permission of the BBC under the condition that the sounds are used for this DCASE challenge and are only used for research purposes. You can check the original source of each sound in DevMeta.csv and EvalMeta.csv located in DCASE_2023_Challenge_Task_7_Dataset.

List of external data resources allowed (Track A):

For Track A, You can use external resources (except FSD50k and Urbansound8k), whether they are public or not. If you want to check the list of the audio dataset, you can refer to LAION-AI audio dataset

List of external data resources allowed (Track B):

Dataset / Model name	Type	Added	Link
DCASE2023 Task7 Development Set	dataset (audio)	2023.03.09	https://dcase.community/challenge2023/task-foley-sound-synthesis
DCASE2023 Task7 Baseline - Pre-Trained HiFi-GAN	model (vocoder)	2023.03.09	https://github.com/DCASE2023-Task7-Foley-Sound-Synthesis/dcase2023_task7_baseline

Task rules

Track A

Participants are allowed to use external resources, including public/private datasets and pre-trained models, except that no sounds may be used (for model training, etc.) from FSD50k or Urbansound8k (because those databases were used to create the eval set).
Participants are not allowed to submit systems that duplicate any pre-existing sounds (on the web or in private databases). They also may not submit systems that simply apply alterations or audio effects to pre-existing sounds. The sounds must be unique and must be generated by the code supplied in the colab.
When submitting a technical report, participants are required to specify all the external resources they used.
Participants are allowed to submit a rule-based system that doesn't have trainable parameters (as long as only sounds from the development set are used).
Participants are allowed to submit an ensemble system having multiple systems for each sound class.

Track B

Participants are NOT allowed to use external resources.
Participants are allowed to submit a rule-based system that doesn't have trainable parameters (as long as only sounds from the development set are used).
Participants are allowed to submit an ensemble system having multiple systems for each sound class.

Submission

A submission would consist of the following items.

Metadata file (*.yaml) with the pre-defined format
- The URL for Colab notebook is included here.
- Colab template
Technical report (*.pdf)

The audio files will be generated by the organizers using the Colab notebook. On audio files: the duration, sample rate, bit depth, and the number of channels must follow those of the development set (4 seconds, 16 bit, 22,050 Hz, mono). The number of audio files generated per class should be 100. Participants need to declare in the metafile for which track the submission is. The technical report should contain a description, including the implementation details of the submission.

The confidential statement for the submitted Colab code and model checkpoint is included in the submission package.

Evaluation

Submissions will be evaluated by Frechet Audio Distance (FAD), followed by a subjective test.

Evaluation Metric

FAD, based on Frechet inception distance (FID) widely used metric for generative models, is a reference-free evaluation metric. Because FAD is calculated from sets of hidden representations of generated and real samples, it can be used even when the ground truth reference audio does not exist. FAD is calculated in the following sequence. First, audio representations are extracted from both evaluation datasets and generated samples. We use VGGish, a classification model that was trained on AudioSet. Second, each set of representations is fitted to a multivariate Gaussian distribution. Finally, the FAD score of the generated samples is calculated by the Frechet distance of these two distributions as follows:

\(F ({\mathcal N}_{r}, {\mathcal N}_{g}) = \| \mu_{r} - \mu_{g} \|^{2} + \mathrm{tr} ( \Sigma_{r} + \Sigma_{g} - 2 \sqrt{\Sigma_{r} \Sigma_{g}} )\)

where \({\mathcal N}_{r} (\mu_{r},\Sigma_{r})\) and \({\mathcal N}_{g} (\mu_{g},\Sigma_{g})\) are multivariate Gaussians of the VGGish embeddings, from the generated samples and evaluation set, respectively.

Subjective Test

Complementary to quantitative evaluation, we will evaluate the performance through a subjective test. Four submissions with the highest FAD scores will be evaluated. Generated audio excerpts for the subjective test are randomly sampled from the submitted data. Each challenge submission is evaluated by the organizers and one author of each of the other challenge submissions. The test will follow an incomplete strategy where the samples generated using an algorithm submitted by one team will not be evaluated by anyone of the team. The listening test will be separate for each category. Evaluators will listen to generated audio samples from challenge submissions, with real samples and audio samples generated by the baseline model. Audio fidelity, category suitability, and diversity will be evaluated. The weighted average of the three ratings are based on a ratio of audio quality : category fit : diversity that is 2:2:1. One evaluator from each entry must be available to spend 2-3 hours doing listening tests during the week of May 22-27.

Ranking

The final ranking is determined by the FAD score and the subjective tests. First, four submissions with the highest FAD scores are selected. They are then ranked, finally, by the subjective tests regardless of their FAD scores.

Results

Track A

Rank	Submission Information		Weighted Average Score of Audio Quality, Category Fit, and Diversity									Audio Quality (MOS score w/ 10 steps)								Category Fit (MOS score w/ 10 steps)								Diversity (MOS score w/ 10 steps, weighted 0.5)
Rank	Submission Code	Technical Report	Official Rank	Average Score	Dog Bark	Footstep	Gun Shot	Keyboard	Moving Motor Vehicle	Rain	Sneeze/Cough	Average Score	Dog Bark	Footstep	Gun Shot	Keyboard	Moving Motor Vehicle	Rain	Sneeze/Cough	Average Score	Dog Bark	Footstep	Gun Shot	Keyboard	Moving Motor Vehicle	Rain	Sneeze/Cough	Average Score	Dog Bark	Footstep	Gun Shot	Keyboard	Moving Motor Vehicle	Rain	Sneeze/Cough
	DCASE2023_baseline_task7	DCASE2023baseline2023	6	3.810	2.688	4.160	3.237	5.150	3.862	4.175	3.400	3.831	2.930	4.158	3.504	5.137	3.543	4.115	3.432	3.789	2.447	4.162	2.969	5.163	4.182	4.235	3.368
	Chon_Gaudio_task7_trackA_1	ChonGLI2023	2	6.967	7.984	6.865	7.255	6.989	6.881	6.243	6.553	6.657	7.612	6.455	6.814	6.814	6.446	5.928	6.528	7.154	8.223	7.082	7.573	7.157	7.131	6.306	6.606	7.214	8.250	7.250	7.500	7.000	7.250	6.750	6.500
	Yi_SURREY_task7_trackA_1	YiSURREY2023	1	7.056	7.742	6.466	6.189	7.433	7.448	6.441	7.675	6.723	7.309	6.143	5.532	7.243	7.315	6.067	7.454	7.578	8.297	6.646	6.689	8.089	8.181	6.911	8.233	6.679	7.500	6.750	6.500	6.500	6.250	6.250	7.000
	Guan_HEU_task7_trackA_2	GuanHEU2023	4	5.157	4.877	4.450	6.413	5.479	5.822	5.201	3.856	4.670	3.800	4.164	5.800	5.339	5.365	4.972	3.250	5.293	5.142	3.836	7.482	5.232	6.315	5.656	3.389	5.857	6.500	6.250	5.500	6.250	5.750	4.750	6.000
	Scheibler_LINE_task7_trackA_1	ScheiblerLINE2023	3	6.887	7.333	6.832	7.317	7.199	6.474	5.222	7.834	6.355	6.479	6.263	6.771	6.886	6.131	4.780	7.180	7.327	7.479	7.192	7.896	7.861	7.054	5.150	8.655	7.071	8.750	7.250	7.250	6.500	6.000	6.250	7.500

Track B

In the case that multiple systems were submitted by one team, only the system with the highest FAD score per team was perceptually evaluated.

Rank	Submission Information		Weighted Average Score of Audio Quality, Category Fit, and Diversity									Audio Quality (MOS score w/ 10 steps)								Category Fit (MOS score w/ 10 steps)								Diversity (MOS score w/ 10 steps, weighted 0.5)
Rank	Submission Code	Technical Report	Official Rank	Average Score	Dog Bark	Footstep	Gun Shot	Keyboard	Moving Motor Vehicle	Rain	Sneeze/Cough	Average Score	Dog Bark	Footstep	Gun Shot	Keyboard	Moving Motor Vehicle	Rain	Sneeze/Cough	Average Score	Dog Bark	Footstep	Gun Shot	Keyboard	Moving Motor Vehicle	Rain	Sneeze/Cough	Average Score	Dog Bark	Footstep	Gun Shot	Keyboard	Moving Motor Vehicle	Rain	Sneeze/Cough
	DCASE2023_baseline_task7	DCASE2023baseline2023	18	3.810	2.688	4.160	3.237	5.150	3.862	4.175	3.400	3.831	2.930	4.158	3.504	5.137	3.543	4.115	3.432	3.789	2.447	4.162	2.969	5.163	4.182	4.235	3.368
	Kamath_NUS_task7_trackB_2	KamathNUS2023	3	4.647	4.807	4.073	5.010	4.276	5.248	4.013	5.102	3.988	3.789	3.554	4.346	3.911	4.642	3.378	4.295	4.612	4.979	3.629	5.054	4.029	5.727	3.406	5.459	6.036	6.500	6.000	6.250	5.500	5.500	6.500	6.000
	Chang_HYU_task7_trackB_1	ChangHYU2023	1	6.515	5.659	7.111	6.557	7.384	6.155	7.042	5.699	6.085	4.882	6.738	5.879	7.296	6.069	6.860	4.873	6.845	6.014	7.288	7.013	7.789	6.442	7.370	6.000	6.714	6.500	7.500	7.000	6.750	5.750	6.750	6.750
	Jung_KT_task7_trackB_2	JungKT2023	2	5.534	5.321	5.033	6.022	5.614	6.021	5.902	4.826	5.082	4.432	4.933	5.579	5.139	5.623	5.600	4.270	5.610	5.371	4.775	6.100	5.646	6.554	5.906	4.920	6.286	7.000	5.750	6.750	6.500	5.750	6.500	5.750
	Lee_MARG_task7_trackB_1	LeeMARG2023	4	4.427	3.273	4.843	3.941	5.409	4.942	4.210	4.374	3.929	2.530	4.204	3.531	5.311	4.542	3.735	3.650	4.443	3.153	4.654	3.696	5.336	5.312	4.040	4.910	5.393	5.000	6.500	5.250	5.750	5.000	5.500	4.750

Complete results and technical reports can be found at results page.

Baseline system

The task organizers will provide a baseline system consisting of VQ-VAE, PixelSNAIL, and HiFi-GAN. We chose the baseline system to be a cascaded model, rather than an end-to-end one, due to its simplicity, explainability, and efficiency.

Repository

DCASE2023 Task 7 baseline, repository
version 1.0

Detailed information can be found in the GitHub repository. The code of the baseline repository is mostly from this repository. If you use this code, please cite the task description paper, which will be published in May, and the following paper:

Publication

Xubo Liu, Turab Iqbal, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, and Wenwu Wang. Conditional sound generation using neural discrete time-frequency representation learning. 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6, 2021.

PDF

Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learning

PDF

Parameters

Audio features:
- Log mel-band energies (80 bands), 22,050 Hz sampling frequency, frame length 1024 points with 25% hop size
Neural network:
- Input shape: 80 * 344 (4 seconds)
- VQ-VAE encoder
  - Strided convolutional layer #1: 4 × 4 conv, ReLU, 4 × 4 conv, ReLU, 3 × 3 conv
  - Residual block #1: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
  - Strided convolutional layer #2: 2 × 2 conv, ReLU, 2 × 2 conv, ReLU, 3 × 3 conv
  - Residual block #2: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
  - Strided convolutional layer #3: 6 × 6 conv, ReLU, 6 × 6 conv, ReLU, 3 × 3 conv
  - Residual block #3: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
  - Strided convolutional layer #4: 8 × 8 conv, ReLU, 8 × 8 conv, ReLU, 3 × 3 conv
  - Residual block #4: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
- VQ-VAE decoder
  - 3 × 3 conv
  - Residual block #1: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
  - Residual block #2: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
  - ReLU
  - Transposed convolutional block: 4 × 4 transposed conv, ReLU, 4 × 4 transposed conv
- PixelSNAIL
  - Parameters inherited from the original github repo
- HiFi-GAN
  - Parameters inherited from this github repo

FAD results for the development dataset

We evaluated the FAD score on the development dataset. In each category, we used 100 gound truth (natural) and 100 synthesized sounds. Participants can utilize this repo to compute the FAD score between the sounds generated by their system and those of the evaluation set.

DCASE2023 Task 7 Eval FAD, repository
version 1.0

Class ID	Category	FAD
0	DogBark	13.614
1	Footstep	6.826
2	GunShot	6.152
3	Keyboard	5.065
4	MovingMotorVehicle	11.239
5	Rain	14.449
6	Sneeze/Cough	3.563

Citation

If you are participating in this task or using the baseline code, please cite the following task description paper.

Publication

Keunwoo Choi, Jaekwon Im, Laurie Heller, Brian McFee, Keisuke Imoto, Yuki Okamoto, Mathieu Lagrange, and Shinosuke Takamichi. Foley sound synthesis at the dcase 2023 challenge. In arXiv e-prints: 2304.12521, 2023.

PDF

Foley Sound Synthesis at the DCASE 2023 Challenge

PDF

	Keunwoo Choi Gaudio Lab, Inc.
	Jaekwon Im Gaudio Lab, Inc. / Korea Advanced Institute of Science & Technology (KAIST)
	Laurie Heller Carnegie Mellon University
	Brian McFee New York University
	Keisuke Imoto Doshisha University
	Yuki Okamoto Ritsumeikan University
	Mathieu Lagrange CNRS, Ecole Centrale Nantes, Nantes Université
	Shinnosuke Takamichi The University of Tokyo

Coordinators

Content

Summary of task

Before Submission

After Submission

Description

Why is it an important task?

Task setup

Audio dataset

Development Set

Evaluation Set

Download

List of external data resources allowed (Track A):

List of external data resources allowed (Track B):

Task rules

Track A

Track B

Submission

Evaluation

Evaluation Metric

Subjective Test

Ranking

Results

Track A

Track B

Baseline system

Repository

Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learning

Parameters

FAD results for the development dataset

Citation

Foley Sound Synthesis at the DCASE 2023 Challenge