Sound Scene Synthesis

Task description


Mathieu Lagrange
Mathieu Lagrange

CNRS, Ecole Centrale Nantes, Nantes Université

Junwon Lee
Modan Tailleur
Laurie Heller
Laurie Heller

Carnegie Mellon University

Keunwoo Choi
Keunwoo Choi

Gaudio Lab, Inc.

Brian McFee
Brian McFee

New York University

Keisuke Imoto
Keisuke Imoto

Doshisha University

Yuki Okamoto
Yuki Okamoto

The University of Tokyo

This task expands the scope from Foley sound to general sound scenes and aims at adding further controllability with natural language prompts in the form of text prompts.

If you are interested in the task, you can join us on the dedicated slack channel.


Environmental Sound Scene Synthesis is the task of generating environmental sound given a textual description. Environmental sounds encompass any non-musical and unintelligible vocal sounds. This next-generation task expands the scope from last year’s Foley sounds to a more general sound scene. Further, it adds controllability with natural language in the form of text prompts.

Figure 1: Overview of sound synthesis system.

The organizers will provide: 1) a development set consisting of 60 prompts and corresponding audio embeddings (not raw audio), and 2) a colab template with the baseline. Any external resources for system training are allowed. We mandate that the authors provide extensive information about the datasets used in their technical report and metadata file (*.yaml). If a significant part of the training set is from private sources whose properties shall not be disclosed, the authors are asked to provide at least a rough estimate of the duration of audio of this private part. Before the inference starts, calls to external resources are allowed to install Python modules and download pre-trained models. However, during inference, calls to external resources are not allowed. In other words, the inference stage must be 100% self-contained.

The evaluation will proceed as such:

  1. The participants submit a colab notebook with their contributions.
  2. A debugging phase will be opened two weeks before the challenge deadline, during which time the participants will be able to submit their system to check that their submitted code is running correctly. The authors must submit their code by sharing it to, with a prompt given as a commented line in the colab. The organizers will run the submitted code with the given prompt and check that the generated audio meets the requirements. No content analysis will be performed. We encourage authors to extensively check their system before submission and ask for checking as soon as possible, as the organizers may not be able to handle multiple requests close to the deadline.
  3. Once the deadline has passed, the organizers will query the synthesizers submitted through the official submission website with a new set of prompts. If some issues arise, the organizers will contact the authors to fix obvious bugs. No resubmission will be allowed, only recommendations for bug fixes that the organizers will handle to modify the original submission. We advise authors to provide a contact email of someone that is available during the 2 weeks after the deadline. If the submitted code fails and no reply is promptly given, the organizers will discard the submission. The time limit is set to 24 hours for the generation of 250 4 seconds clips.
  4. The organizers will rank the submitted synthesizers in terms of Fréchet Audio Distance (FAD), using PANNs CNN14 Wavegram-Logmel (panns-wavegram-logmel) embeddings.
  5. The participants will be asked to perceptually rate a sample of the generated sounds from the highest ranked systems.
  6. Results will be published on **July 15th** due to the time needed to perform the subjective evaluation.

Please be aware that, if you plan to submit a paper to the DCASE workshop, the deadline for submitting your paper will be July 4th which is before the publication of the results of the task. This should not prevent you from submitting a paper describing your submitted system for two main reasons:

  1. The ranking of the described system is not an evaluation criterion for reviewing a DCASE workshop paper.
  2. You will be allowed to add information related to the evaluation of your system at the rebuttal phase. However, no major changes to the overall claim of the accepted paper are permitted.

Why is it an important task?

Environmental Sound Synthesis from textual prompts underpins a multitude of applications, notably encompassing sound effects generation for multimedia content creation. Moreover, it serves as a relatively straightforward means to assess the proficiency of generative audio models. Given that nearly any conceivable sound can be generated using a text prompt, proficiency in this task suggests the potential of a similar architecture to tackle more intricate and difficult-to-evaluate generative tasks, such as generating audio from video inputs.

Task setup

This task consists of generating sounds from text prompts. When a list of text prompts are given, the generative model should synthesize a 4-s long sound per prompt and provide output as a dictionary. The text prompt is given as a string, and the resulting dictionary should have string-type prompts as keys and numpy-ndarray audio waveforms as values. The participants in this challenge are allowed to use external resources along with the development set that the organizers provide to build their model.

Audio dataset: Sound Design Reference

The Sound Design Reference audio datasets have been created by a sound engineer, using sounds from Freesound and from private audio datasets. The sound engineer has tried to match a given prompt by editing sounds from those datasets. The prompts comprise a foreground sound that could be any sound except for music and speech, and a background sound from among the categories “water”, “birds,” “traffic”, and two other categories which will not be revealed before systems are submitted. The foregrounds are a combination of a foreground source and a verb (e.g. a pig is grunting). Some foregrounds of the dataset also contain an adjective, when the organizers of the challenge decided that an adjective could add audible information (e.g. a small dog is barking). All created sounds are 4-s long.

Development set

The development dataset consists of 60 pairs of prompts and audio embeddings from foreground sounds. Its purpose is not to cover the entire audio spectrum of the evaluation set but to offer participants insight into the types of prompts that may be used during evaluation. No audio will be made publically available, only the embeddings of the corresponding audio are given for each prompt. The embeddings for PANNs CNN14 Wavegram-Logmel (panns-wavegram-logmel), MS-CLAP (clap-2023) and VGGish (vggish) are available. Please note that only the PANNs CNN14 Wavegram-Logmel (panns-wavegram-logmel) embeddings will be used for FAD evaluation; the other embeddings are only being provided as supplementary information to contestants.

The evaluated system is expected to handle any prompt involving non-musical and non-intelligible speech. Given the narrowed scope of the provided development set, participants are encouraged to assess their systems on broader TTA datasets and tasks as well.


Evaluation set

The evaluation set consists of 250 secret prompts and audios. The audios and prompts will remain secret even after the end of the challenge. The foreground prompts in the development set doesn't represent the exact sound sources that will be used in the evaluation phase, e.g. if dog is used in the development set and cat is not used, either one may appear in the evaluation set. Apart from the backgrounds introduced in the development set, the evaluation dataset will include two additional types of backgrounds.

Task rules

  1. Participants are allowed to use external resources, including public/private datasets and pre-trained models, prior to inference.
  2. Participants are not allowed to submit systems that duplicate any pre-existing sounds (on the web or in private databases). They also may not submit systems that simply apply phase or spectrum alterations or audio effects to pre-existing sounds. The sounds must be unique and must be generated by the code supplied in the colab.
  3. When submitting a technical report and yaml file, participants are required to specify all the external resources they used with detailed specifications.
  4. Participants are allowed to submit a rule-based system that doesn't have trainable parameters
  5. The amount of assistance the organizers will provide to the participants to get their system running may be thresholded without notice
  6. It is mandatory that the participants respond within 24 hours to organizers' queries and feedback. Systems that do not function properly on colab (or take too much time) are the responsibility of the submitter, not the organizers, to debug. If the submitter does not respond with a fix before the deadline, their system will not be entered into the contest. We do not accept any modifications after the submission deadline.
  7. For participants, only 1 submission per team will be accepted. No overlap is allowed among team members, except for one supervisor, who must be identified as such. This policy was added to optimize the evaluation workload, which is significantly heavier than other DCASE tasks.


A submission would consist of the following items.

  • Metadata file (*.yaml) with the pre-defined format
    • The URL for the Colab notebook is included here. Copy the file to make your own one.
    • Colab template
  • Technical report (*.pdf)

The audio files will be generated by the organizers using the Colab notebook. On audio files: the duration, sample rate, bit depth, and the number of channels must follow those of the development set (4 seconds, 32 bit, 32,000 Hz, mono). The technical report should contain a description, including the implementation details of the submission.

The confidentiality statement for the submitted Colab code and model checkpoint is included in the submission package.

Debug Phase

June 1: system submission opens for organizer reviewer June 7: System pre-submission Review Deadline to get pre-competition ORGANIZER REVIEW and testing. For systems pre-submitted by June 7th, organizers will conduct a test run of 7 prompts from your system to find out if it generates sounds within 30 minutes with gpu load in Colab Pro+. If it does not meet these standards, the organizers will notify you (by June 10th) and give you the opportunity to fix your problems, and will retest your system if required. If you do not respond within 24 hours of notice from the organizers, your opportunity for further organizer review&help will be forfeited, but you may still submit a revised system any time until June 15th. *Note, if you submit after June 7th, no organizer review is guaranteed. If your submitted system does not run in a timely way, your system will not have sounds entered into the contest on June 15th. System submission deadline - June 15. - Submit a system that you know works after establishing that it runs in a timely way.


Submissions will be evaluated by Frechet Audio Distance (FAD), using the evaluation set (Sound Design Reference audio datasets) and the PANNs CNN14 Wavegram-Logmel (panns-wavegram-logmel) for feature extraction, followed by a subjective test.

Evaluation Metric

FAD, based on Frechet inception distance (FID) widely used metric for generative models, is a reference-free evaluation metric. FAD is calculated in the following sequence. First, audio representations are extracted from both evaluation datasets and generated samples. Second, each set of representations is fitted to a multivariate Gaussian distribution. Finally, the FAD score of the generated samples is calculated by the Frechet distance of these two distributions as follows:

\(F ({\mathcal N}_{r}, {\mathcal N}_{g}) = \| \mu_{r} - \mu_{g} \|^{2} + \mathrm{tr} ( \Sigma_{r} + \Sigma_{g} - 2 \sqrt{\Sigma_{r} \Sigma_{g}} )\)

where \({\mathcal N}_{r} (\mu_{r},\Sigma_{r})\) and \({\mathcal N}_{g} (\mu_{g},\Sigma_{g})\) are multivariate Gaussians of the VGGish embeddings, from the generated samples and evaluation set, respectively.

FAD calculation is commonly computed using VGGish embedding. However, prior work suggests that VGGish-based FAD fails to reliably predict the perceptual quality of generative audio (Gui et al., 2024; Tailleur et al. 2024). The work by Tailleur et al. (Tailleur et al. 2024) showed that amongst the different tested embeddings, PANNs CNN14 Wavegram-Logmel (panns-wavegram-logmel) is the embedding that lead to the highest correlation scores with perceptual rating on DCASE Task 7 2023 dataset. Following those results, PANNs CNN14 Wavegram-Logmel, a classification model trained on AudioSet, is used as the embedding for FAD calculation for DCASE Task 7 2024.


Azalea Gui, Hannes Gamper, Sebastian Braun, and Dimitra Emmanouilidou. Adapting frechet audio distance for generative music evaluation. Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 1331–1335, 2024.


Adapting Frechet Audio Distance for Generative Music Evaluation


Modan Tailleur, Junwon Lee, Mathieu Lagrange, Keunwoo Choi, Laurie M. Heller, Keisuke Imoto, and Yuki Okamoto. Correlation of fréchet audio distance with human perception of environmental audio is embedding dependent. arXiv:2403.17508, 2024.


Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependent


Subjective Test

Complementary to quantitative evaluation, we will evaluate the performance through a subjective test. A small number of submissions with the highest FAD scores will be evaluated. A subset of the generated audio excerpts will be used for the subjective test. Each challenge submission is blindly evaluated by the other challenge teams. The test will follow an incomplete strategy where the evaluation of one team's sound samples will not incorporate ratings by anyone from that team. The listening evaluation will be separated for each category. Evaluators will listen to generated audio samples from challenge submissions in random order, intermixed with real samples (made by the organizers) and audio samples generated by the baseline model. Foreground fit, Background fit, and Audio quality will be evaluated. For foreground fit, given the current prompts, it is expected that the system outputs one source, with possibly multiple occurrences. For example, when the prompt says 'a small dog is barking', the dog can bark multiple times, but if there are multiple dogs in the audio, the system will be downgraded in the perceptual evaluation. The weighted average of the three ratings are based on a ratio of Foreground fit : Background fit : Audio quality that is 2:1:1. Each entering team must be available to spend up to a total of 4 hours (which can be split up among team members) during listening tests during the time frame of June 28-July 5th. This listening task time window may be pushed back several days depending upon how many systems are submitted, because each system takes time to process. The time window will be updated in mid-June. Only contest participants and organizers will be rating the systems.


The final ranking is determined by the FAD score and the subjective tests. First, a few submissions with the highest FAD scores are selected. They are then ranked, finally, by the subjective tests regardless of their FAD scores.

Baseline system

The task organizers will provide a baseline system based on AudioLDM. AudioLDM is a Text-To-Audio (TTA) model that comprises VAE, latent diffusion, and multimodal encoders. AudioLDM is one of the foundational models of TTA.


Detailed information can be found in the GitHub repository. The code of the baseline repository is mostly from this repository. If you use this code, please cite the task description paper, which will be published in May, and the following paper:


Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. AudioLDM: text-to-audio generation with latent diffusion models. Proceedings of the 40th International Conference on Machine Learning (ICML), pages 21450–21474, 2023.


AudioLDM: Text-to-Audio Generation with Latent Diffusion Models


Also, as the baseline model generates 10-second audio which is longer than our 4-second configuration, we provide a simple code to chop the loudest segment. Refer to the Colab submission template in Submission Section.


  • Audio output: 4 second, Log mel-band energies (64 bands), 16,000 Hz sampling frequency, frame length 1024 points with 160 hop size
  • Neural Network: Parameter inherited from the original github repo
    • VAE encoder & decoder
    • CLAP
    • Latent Diffusion (U-Net)
    • HiFi-GAN

FAD results for the development dataset

We evaluated the FAD score of the baseline system on the development dataset. As mentioned previously, the FAD score is calculated using PANNs CNN14 Wavegram-Logmel (panns-wavegram-logmel) embedding. For your own purposes, you may use any reference set or audio embedding for FAD calculation using given repository (fadtk, FAD ToolKit). The embedding panns-wavegram-logmel of this fadtk library is used for the FAD score evaluation of the challenge. Refer to the repository for more details.

FAD score (PANNs CNN14 Wavegram-Logmel): 61.2761


If you are participating in this task, using fadtk (FAD ToolKit), or using the baseline code, please cite the following paper.


Modan Tailleur, Junwon Lee, Mathieu Lagrange, Keunwoo Choi, Laurie M. Heller, Keisuke Imoto, and Yuki Okamoto. Correlation of fréchet audio distance with human perception of environmental audio is embedding dependent. arXiv:2403.17508, 2024.


Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependent