Challenge has ended. Full results for this task can be found in the Results page.
Summary of task
Before Submission
- Participants build foley sound synthesis models that generate various sounds, for 7 sound classes we defined.
class index
-->[Model]
--> 100 sounds (4-second / mono / 16-bit / 22,050 Hz).- class:
{"DogBark": 0, "Footstep": 1, "GunShot": 2, "Keyboard": 3, "MovingMotorVehicle": 4, "Rain": 5, "Sneeze_Cough": 6}
- On
Track A
, participants can use external data/models. OnTrack B
, no external resources are allowed. One can submit for either or both. - An entry consists of two items.
- A
yaml
: In a pre-defined format that includes URL for a Colab notebook (which also follows a pre-defined format) - A
pdf
: Technical report
- A
After Submission
- Submissions will be evaluated first by Frechet Audio Distance (FAD), then by subjective test.
- Participants are required to contribute to the subjective tests.
- One evaluator per entry / About 2-3 hours of listening test / During the week of May 22-27.
- FAD decides four finalists. Subjective test decides final ranking among them (FAD score is ignored)
Description
Foley sound, in general, refers to sound effects that are created to convey (and sometimes enhance) the sounds produced by events occurring in a narrative (e.g. radio or film). Foley sounds are commonly added to multimedia to enhance the perceptual audio experience. This sound synthesis challenge requires the generation of original audio clips that represent a category of sound, such as footsteps.
Why is it an important task?
First, time-consuming post-production is inevitable to obtain a perfectly matched sound effect. By generating sound that belongs to a target sound category, Foley sound synthesis can make the workflow much more time and cost-effective. With the rise of virtual environments such as the metaverse, we expect a growing need for the automated generation of more and more complex and creative sound environments. Second, it can be utilized for dataset synthesis or augmentation for a wide variety of DCASE tasks including sound event detection (SED). SED has drawn great attention and synthesized datasets have been used already, e.g., URBAN-SED dataset. A high-quality Foley sound synthesis model could lead to development of better SED models.
Task setup
This task consists of generating sounds from two conditions. The first condition is the class id number, and the second condition is the number of the sounds to be generated. The data type of both conditions is an integer. The length of each generated audio should be 4 seconds.
The challenge has two subproblems - development of models with external resources (track A
) and without external resources (track B
). Participants are expected to submit a system for one of the two problems, and each problem is evaluated independently.
Audio dataset
Development Set
The development set is composed of audio samples from 3 datasets: UrbanSound8K, FSD50K, and BBC Sound Effects. Participants are not allowed to use these 3 datasets as external resources. The development set consists of 4,850 labeled sound excerpts, which are divided into 7 classes. All audio was converted to mono 16-bit 22,050 Hz audio. In addition, for consistency of the challenge, it was zero-padded or segmented to fit a length of 4 seconds. The 7 class IDs and the corresponding classes of audio are as follows.
Class ID | Category | Number of files |
0 | DogBark | 617 |
1 | Footstep | 703 |
2 | GunShot | 777 |
3 | Keyboard | 800 |
4 | MovingMotorVehicle | 581 |
5 | Rain | 741 |
6 | Sneeze/Cough | 631 |
Evaluation Set
The evaluation dataset has the same classes as the development dataset. The sample rate, the number of channels, and the bit depth of audio correspond to those of audio in the development dataset. The length of each audio is 4 seconds, and there are 100 audio samples per category. This dataset will be released after the end of the challenge.
Download
A portion of the sounds in the data set were kindly provided by permission of the BBC under the condition that the sounds are used for this DCASE challenge and are only used for research purposes. You can check the original source of each sound in DevMeta.csv
and EvalMeta.csv
located in DCASE_2023_Challenge_Task_7_Dataset
.
List of external data resources allowed (Track A):
For Track A, You can use external resources (except FSD50k and Urbansound8k), whether they are public or not. If you want to check the list of the audio dataset, you can refer to LAION-AI audio dataset
List of external data resources allowed (Track B):
Dataset / Model name | Type | Added | Link |
---|---|---|---|
DCASE2023 Task7 Development Set | dataset (audio) | 2023.03.09 | https://dcase.community/challenge2023/task-foley-sound-synthesis |
DCASE2023 Task7 Baseline - Pre-Trained HiFi-GAN | model (vocoder) | 2023.03.09 | https://github.com/DCASE2023-Task7-Foley-Sound-Synthesis/dcase2023_task7_baseline |
Task rules
Track A
- Participants are allowed to use external resources, including public/private datasets and pre-trained models, except that no sounds may be used (for model training, etc.) from FSD50k or Urbansound8k (because those databases were used to create the eval set).
- Participants are not allowed to submit systems that duplicate any pre-existing sounds (on the web or in private databases). They also may not submit systems that simply apply alterations or audio effects to pre-existing sounds. The sounds must be unique and must be generated by the code supplied in the colab.
- When submitting a technical report, participants are required to specify all the external resources they used.
- Participants are allowed to submit a rule-based system that doesn't have trainable parameters (as long as only sounds from the development set are used).
- Participants are allowed to submit an ensemble system having multiple systems for each sound class.
Track B
- Participants are NOT allowed to use external resources.
- Participants are allowed to submit a rule-based system that doesn't have trainable parameters (as long as only sounds from the development set are used).
- Participants are allowed to submit an ensemble system having multiple systems for each sound class.
Submission
A submission would consist of the following items.
- Metadata file (
*.yaml
) with the pre-defined format- The URL for Colab notebook is included here.
- Colab template
- Technical report (
*.pdf
)
The audio files will be generated by the organizers using the Colab notebook. On audio files: the duration, sample rate, bit depth, and the number of channels must follow those of the development set (4 seconds
, 16 bit
, 22,050 Hz
, mono
). The number of audio files generated per class should be 100. Participants need to declare in the metafile for which track the submission is. The technical report should contain a description, including the implementation details of the submission.
The confidential statement for the submitted Colab code and model checkpoint is included in the submission package.
Evaluation
Submissions will be evaluated by Frechet Audio Distance (FAD), followed by a subjective test.
Evaluation Metric
FAD, based on Frechet inception distance (FID) widely used metric for generative models, is a reference-free evaluation metric. Because FAD is calculated from sets of hidden representations of generated and real samples, it can be used even when the ground truth reference audio does not exist. FAD is calculated in the following sequence. First, audio representations are extracted from both evaluation datasets and generated samples. We use VGGish, a classification model that was trained on AudioSet. Second, each set of representations is fitted to a multivariate Gaussian distribution. Finally, the FAD score of the generated samples is calculated by the Frechet distance of these two distributions as follows:
\(F ({\mathcal N}_{r}, {\mathcal N}_{g}) = \| \mu_{r} - \mu_{g} \|^{2} + \mathrm{tr} ( \Sigma_{r} + \Sigma_{g} - 2 \sqrt{\Sigma_{r} \Sigma_{g}} )\)
where \({\mathcal N}_{r} (\mu_{r},\Sigma_{r})\) and \({\mathcal N}_{g} (\mu_{g},\Sigma_{g})\) are multivariate Gaussians of the VGGish embeddings, from the generated samples and evaluation set, respectively.
Subjective Test
Complementary to quantitative evaluation, we will evaluate the performance through a subjective test. Four submissions with the highest FAD scores will be evaluated. Generated audio excerpts for the subjective test are randomly sampled from the submitted data. Each challenge submission is evaluated by the organizers and one author of each of the other challenge submissions. The test will follow an incomplete strategy where the samples generated using an algorithm submitted by one team will not be evaluated by anyone of the team. The listening test will be separate for each category. Evaluators will listen to generated audio samples from challenge submissions, with real samples and audio samples generated by the baseline model. Audio fidelity, category suitability, and diversity will be evaluated. The weighted average of the three ratings are based on a ratio of audio quality : category fit : diversity that is 2:2:1. One evaluator from each entry must be available to spend 2-3 hours doing listening tests during the week of May 22-27.
Ranking
The final ranking is determined by the FAD score and the subjective tests. First, four submissions with the highest FAD scores are selected. They are then ranked, finally, by the subjective tests regardless of their FAD scores.
Results
Track A
Rank | Submission Information | Weighted Average Score of Audio Quality, Category Fit, and Diversity | Audio Quality (MOS score w/ 10 steps) | Category Fit (MOS score w/ 10 steps) | Diversity (MOS score w/ 10 steps, weighted 0.5) | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission Code |
Technical Report |
Official Rank |
Average Score | Dog Bark | Footstep | Gun Shot | Keyboard |
Moving Motor Vehicle |
Rain | Sneeze/Cough | Average Score | Dog Bark | Footstep | Gun Shot | Keyboard |
Moving Motor Vehicle |
Rain | Sneeze/Cough | Average Score | Dog Bark | Footstep | Gun Shot | Keyboard |
Moving Motor Vehicle |
Rain | Sneeze/Cough | Average Score | Dog Bark | Footstep | Gun Shot | Keyboard |
Moving Motor Vehicle |
Rain | Sneeze/Cough | |
DCASE2023_baseline_task7 | DCASE2023baseline2023 | 6 | 3.810 | 2.688 | 4.160 | 3.237 | 5.150 | 3.862 | 4.175 | 3.400 | 3.831 | 2.930 | 4.158 | 3.504 | 5.137 | 3.543 | 4.115 | 3.432 | 3.789 | 2.447 | 4.162 | 2.969 | 5.163 | 4.182 | 4.235 | 3.368 | |||||||||
Chon_Gaudio_task7_trackA_1 | ChonGLI2023 | 2 | 6.967 | 7.984 | 6.865 | 7.255 | 6.989 | 6.881 | 6.243 | 6.553 | 6.657 | 7.612 | 6.455 | 6.814 | 6.814 | 6.446 | 5.928 | 6.528 | 7.154 | 8.223 | 7.082 | 7.573 | 7.157 | 7.131 | 6.306 | 6.606 | 7.214 | 8.250 | 7.250 | 7.500 | 7.000 | 7.250 | 6.750 | 6.500 | |
Yi_SURREY_task7_trackA_1 | YiSURREY2023 | 1 | 7.056 | 7.742 | 6.466 | 6.189 | 7.433 | 7.448 | 6.441 | 7.675 | 6.723 | 7.309 | 6.143 | 5.532 | 7.243 | 7.315 | 6.067 | 7.454 | 7.578 | 8.297 | 6.646 | 6.689 | 8.089 | 8.181 | 6.911 | 8.233 | 6.679 | 7.500 | 6.750 | 6.500 | 6.500 | 6.250 | 6.250 | 7.000 | |
Guan_HEU_task7_trackA_2 | GuanHEU2023 | 4 | 5.157 | 4.877 | 4.450 | 6.413 | 5.479 | 5.822 | 5.201 | 3.856 | 4.670 | 3.800 | 4.164 | 5.800 | 5.339 | 5.365 | 4.972 | 3.250 | 5.293 | 5.142 | 3.836 | 7.482 | 5.232 | 6.315 | 5.656 | 3.389 | 5.857 | 6.500 | 6.250 | 5.500 | 6.250 | 5.750 | 4.750 | 6.000 | |
Scheibler_LINE_task7_trackA_1 | ScheiblerLINE2023 | 3 | 6.887 | 7.333 | 6.832 | 7.317 | 7.199 | 6.474 | 5.222 | 7.834 | 6.355 | 6.479 | 6.263 | 6.771 | 6.886 | 6.131 | 4.780 | 7.180 | 7.327 | 7.479 | 7.192 | 7.896 | 7.861 | 7.054 | 5.150 | 8.655 | 7.071 | 8.750 | 7.250 | 7.250 | 6.500 | 6.000 | 6.250 | 7.500 |
Track B
In the case that multiple systems were submitted by one team, only the system with the highest FAD score per team was perceptually evaluated.
Rank | Submission Information | Weighted Average Score of Audio Quality, Category Fit, and Diversity | Audio Quality (MOS score w/ 10 steps) | Category Fit (MOS score w/ 10 steps) | Diversity (MOS score w/ 10 steps, weighted 0.5) | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission Code |
Technical Report |
Official Rank |
Average Score | Dog Bark | Footstep | Gun Shot | Keyboard |
Moving Motor Vehicle |
Rain | Sneeze/Cough | Average Score | Dog Bark | Footstep | Gun Shot | Keyboard |
Moving Motor Vehicle |
Rain | Sneeze/Cough | Average Score | Dog Bark | Footstep | Gun Shot | Keyboard |
Moving Motor Vehicle |
Rain | Sneeze/Cough | Average Score | Dog Bark | Footstep | Gun Shot | Keyboard |
Moving Motor Vehicle |
Rain | Sneeze/Cough | |
DCASE2023_baseline_task7 | DCASE2023baseline2023 | 18 | 3.810 | 2.688 | 4.160 | 3.237 | 5.150 | 3.862 | 4.175 | 3.400 | 3.831 | 2.930 | 4.158 | 3.504 | 5.137 | 3.543 | 4.115 | 3.432 | 3.789 | 2.447 | 4.162 | 2.969 | 5.163 | 4.182 | 4.235 | 3.368 | |||||||||
Kamath_NUS_task7_trackB_2 | KamathNUS2023 | 3 | 4.647 | 4.807 | 4.073 | 5.010 | 4.276 | 5.248 | 4.013 | 5.102 | 3.988 | 3.789 | 3.554 | 4.346 | 3.911 | 4.642 | 3.378 | 4.295 | 4.612 | 4.979 | 3.629 | 5.054 | 4.029 | 5.727 | 3.406 | 5.459 | 6.036 | 6.500 | 6.000 | 6.250 | 5.500 | 5.500 | 6.500 | 6.000 | |
Chang_HYU_task7_trackB_1 | ChangHYU2023 | 1 | 6.515 | 5.659 | 7.111 | 6.557 | 7.384 | 6.155 | 7.042 | 5.699 | 6.085 | 4.882 | 6.738 | 5.879 | 7.296 | 6.069 | 6.860 | 4.873 | 6.845 | 6.014 | 7.288 | 7.013 | 7.789 | 6.442 | 7.370 | 6.000 | 6.714 | 6.500 | 7.500 | 7.000 | 6.750 | 5.750 | 6.750 | 6.750 | |
Jung_KT_task7_trackB_2 | JungKT2023 | 2 | 5.534 | 5.321 | 5.033 | 6.022 | 5.614 | 6.021 | 5.902 | 4.826 | 5.082 | 4.432 | 4.933 | 5.579 | 5.139 | 5.623 | 5.600 | 4.270 | 5.610 | 5.371 | 4.775 | 6.100 | 5.646 | 6.554 | 5.906 | 4.920 | 6.286 | 7.000 | 5.750 | 6.750 | 6.500 | 5.750 | 6.500 | 5.750 | |
Lee_MARG_task7_trackB_1 | LeeMARG2023 | 4 | 4.427 | 3.273 | 4.843 | 3.941 | 5.409 | 4.942 | 4.210 | 4.374 | 3.929 | 2.530 | 4.204 | 3.531 | 5.311 | 4.542 | 3.735 | 3.650 | 4.443 | 3.153 | 4.654 | 3.696 | 5.336 | 5.312 | 4.040 | 4.910 | 5.393 | 5.000 | 6.500 | 5.250 | 5.750 | 5.000 | 5.500 | 4.750 |
Complete results and technical reports can be found at results page.
Baseline system
The task organizers will provide a baseline system consisting of VQ-VAE, PixelSNAIL, and HiFi-GAN. We chose the baseline system to be a cascaded model, rather than an end-to-end one, due to its simplicity, explainability, and efficiency.
Repository
Detailed information can be found in the GitHub repository. The code of the baseline repository is mostly from this repository. If you use this code, please cite the task description paper, which will be published in May, and the following paper:
Parameters
- Audio features:
- Log mel-band energies (80 bands), 22,050 Hz sampling frequency, frame length 1024 points with 25% hop size
- Neural network:
- Input shape: 80 * 344 (4 seconds)
- VQ-VAE encoder
- Strided convolutional layer #1: 4 × 4 conv, ReLU, 4 × 4 conv, ReLU, 3 × 3 conv
- Residual block #1: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
- Strided convolutional layer #2: 2 × 2 conv, ReLU, 2 × 2 conv, ReLU, 3 × 3 conv
- Residual block #2: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
- Strided convolutional layer #3: 6 × 6 conv, ReLU, 6 × 6 conv, ReLU, 3 × 3 conv
- Residual block #3: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
- Strided convolutional layer #4: 8 × 8 conv, ReLU, 8 × 8 conv, ReLU, 3 × 3 conv
- Residual block #4: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
- VQ-VAE decoder
- 3 × 3 conv
- Residual block #1: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
- Residual block #2: ReLU, 3 × 3 conv, ReLU, 1 × 1 conv
- ReLU
- Transposed convolutional block: 4 × 4 transposed conv, ReLU, 4 × 4 transposed conv
- PixelSNAIL
- Parameters inherited from the original github repo
- HiFi-GAN
- Parameters inherited from this github repo
FAD results for the development dataset
We evaluated the FAD score on the development dataset. In each category, we used 100 gound truth (natural) and 100 synthesized sounds. Participants can utilize this repo to compute the FAD score between the sounds generated by their system and those of the evaluation set.
Class ID | Category | FAD |
0 | DogBark | 13.614 |
1 | Footstep | 6.826 |
2 | GunShot | 6.152 |
3 | Keyboard | 5.065 |
4 | MovingMotorVehicle | 11.239 |
5 | Rain | 14.449 |
6 | Sneeze/Cough | 3.563 |
Citation
If you are participating in this task or using the baseline code, please cite the following task description paper.