Sound Scene Synthesis


Challenge results

Task description

Environmental Sound Scene Synthesis is the task of generating environmental sound given a textual description. Environmental sounds encompass any non-musical and unintelligible vocal sounds. This next-generation task expands the scope from last year’s Foley sounds to a more general sound scene. Further, it adds controllability with natural language in the form of text prompts.

Systems ranking

Evaluation Procedure

Fourteen raters judged 24 prompts per system for 6 systems: 4 contestants, a baseline, and a Sound-Designer Reference set constructed by hand-mixing recorded sounds. Four of the raters were contestants and 10 of the raters were double-blinded organizers and their lab members.

There were 4 unique foreground prompts for each of 6 foreground categories: alarms, animals, entrances, humans, tools, vehicles. Five types of background prompts (birds, crowds, traffic, water, no background) were interspersed irregularly among the categories, so that the average match rating of background prompts within a particular foreground category represents the ratings of a few types of background prompts.

All raters were uninformed about which system generated each sound. Organizers who gave ratings saw anonymized system numbers in the data until all results and rankings were finalized. (Those organizers who had heard sounds during the generation phase did not participate in ratings; instead, they generated the anonymized system numbers for the data files). Contestants rated sounds from all systems; to avoid bias, their self-ratings were removed in case they were able to recognize sounds from their own system. For each contestant and each prompt, each self-rating was replaced with a contestant’s average responses to that prompt for all other systems; this ensured that removal of self-ratings would not uniquely raise or lower the average of their own system.

Ranking

The final ranking is determined by the weighted average of the three ratings were based on a ratio of category fit for foreground sound : category fit for background sound : audio quality that was 2:1:1.

Perceptual Evaluation Score

Submission Information Weighted Average Score Foreground Fit Background Fit Audio Quality
Submission Code Technical
Report
Official
Rank
Average Score Alarm Animal Entrance Human Tool Vehicle Average Score Alarm Animal Entrance Human Tool Vehicle Average Score Alarm Animal Entrance Human Tool Vehicle Average Score Alarm Animal Entrance Human Tool Vehicle
Sound Designer (Ref.) 9.331 9.331 8.478 8.598 9.286 9.290 8.853 9.768 9.768 8.554 8.875 9.500 9.393 9.143 8.786 8.786 8.375 8.286 9.125 9.339 8.554 9.000 9.000 8.429 8.357 9.018 9.036 8.571
Sun_Samsung_task7_1 SunSamsung2024 1 5.832 5.651 5.884 5.630 5.191 5.769 6.870 5.752 5.571 6.414 4.500 4.964 5.307 7.757 5.780 5.743 4.729 7.079 5.311 6.139 5.679 6.045 5.718 5.979 6.443 5.525 6.321 6.286
Chung_KT_task7_1 ChungKT2024 2 4.966 4.890 5.506 4.272 4.737 4.885 5.506 5.025 5.218 6.079 3.750 4.436 4.907 5.761 4.623 3.746 4.446 4.807 5.361 4.071 5.307 5.191 5.379 5.421 4.782 4.714 5.654 5.196
Yi_Surrey_task7_1 YiSURREY2024 3 4.748 4.271 5.499 4.920 5.497 4.043 4.256 3.733 2.775 4.968 4.671 5.543 2.339 2.104 5.133 5.261 5.454 3.261 5.236 5.375 6.214 6.391 6.271 6.607 7.079 5.664 6.118 6.604
DCASE2024_baseline_task7 LeeGLI2024 3.287 3.750 1.764 4.607 2.880 4.049 2.674 3.280 4.071 1.054 4.946 2.036 4.893 2.679 2.797 2.768 2.446 4.036 3.196 1.696 2.643 3.792 4.089 2.500 4.500 4.250 4.714 2.696
Verma_IITMandi_task7_1 VermaIITMandi2024 4 2.523 2.207 2.269 3.540 2.636 1.965 2.520 1.792 1.021 1.504 2.621 1.814 1.414 2.379 3.078 4.168 2.575 4.575 2.811 2.204 2.132 3.430 2.618 3.493 4.343 4.107 2.829 3.189



FAD Score

Frechet Audio Distance (FAD) was calculated using embeddings of the evaluation set (Sound Design Reference audio datasets) and generated sounds from submitted systems.

Submission Information Evaluation Dataset Development Dataset
Submission Code Technical
Report
Official
Rank
FAD
Rank
FAD (PANNs) FAD (CLAP) FAD (VGGish) FAD (PANNs) FAD (CLAP) FAD (VGGish)
Sound Designer (Ref.)
Sun_Samsung_task7_1 SunSamsung2024 1 1 35.985 257.968 5.424 50.179 333.943 7.558
Chung_KT_task7_1 ChungKT2024 2 2 37.092 192.358 5.051 41.580 269.975 4.524
Yi_Surrey_task7_1 YiSURREY2024 3 3 43.304 149.853 6.800 56.985 295.729 6.253
DCASE2024_baseline_task7 LeeGLI2024 57.061 321.415 9.713 55.614 367.668 8.069
Verma_IITMandi_task7_1 VermaIITMandi2024 4 4 53.728 313.398 9.208 52.056 348.012 6.520



System characteristics

Summary of the submitted system characteristics.

Submission
Code
Technical
Report
Audio
Dataset
System
input
ML
method
Phase
reconstruction
Acoustic
feature
System
Complexity
Data
Augmentation
Pre-trained
Model
Subsystem
Count
1 Sound Designer (Ref.)
2 Sun_Samsung_task7_1 SunSamsung2024 AudioCaps, audio-alpaca text prompt VAE, CLAP, U-Net-based latent diffusion model HiFi-GAN mel-spectrogram 1047000000 conditioning augmentation TANGO 2, HiFi-GAN 2
3 Chung_KT_task7_1 ChungKT2024 AudioCaps, WavCaps text prompt, noise CLAP, GAN HiFi-GAN mel-spectrogram 325963838 CLAP, HiFi-GAN
4 Yi_Surrey_task7_1 YiSURREY2024 AudioSet text prompt VAE, T5, U-Net-based latent diffusion model BigvGAN mel-spectrogram 265531016 conditioning augmentation CLAP
5 DCASE2024_baseline_task7 LeeGLI2024 DCASE2024 Challenge Task 7 Development Dataset text prompt VAE, CLAP, U-Net-based latent diffusion model HiFi-GAN mel-spectrogram 416000000 conditioning augmentation HiFi-GAN
6 Verma_IITMandi_task7_1 VermaIITMandi2024 DCASE2024 Challenge Task 7 Development Dataset, Custom Dataset text prompt VAE, CLAP, U-Net-based latent diffusion model HiFi-GAN mel-spectrogram 671000000 conditioning augmentation HiFi-GAN



Technical reports

Sound Scene Synthesis Based on GAN Using Contrastive Learning and Effective Time-Frequency Swap Cross Attention Mechanism

Hae Chun Chung, Jae Hoon Jung
KT Corporation, Seoul, Republic of Korea

Abstract

This technical report outlines the efforts of KT Corporation's Acoustic Processing Project for addressing sound scene synthesis, DCASE 2024 Challenge Task 7. The task's objective is to develop a generative system capable of synthesizing environmental sounds from text descriptions. Our system is designed in three stages to achieve this goal: embedding the text description, generating a mel spectrogram conditioned on the text embedding, and converting the mel spectrogram into an audio waveform. Our main focus lies on training the model for the second stage. We employed a generative adversarial network (GAN) and meticulously designed the training process and architecture. We utilized various contrastive losses and introduced the single-double-triple attention mechanism to accurately capture text descriptions and train high-quality features. To mitigate the rise in GPU memory consumption caused by the expanded attention mechanism, we designed a novel time-frequency swap cross-attention mechanism. Our system achieved FAD score more than 30% lower than the DCASE baseline, demonstrating significant performance improvements in text-to-audio generation.

System characteristics
System input text prompt,noise
Machine learning method CLAP,GAN
Phase reconstruction method HiFi-GAN
Acoustic features mel-spectrogram
System comprexity 325963838 parameters
PDF

Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependent

Modan Tailleur, Junwon Lee, Mathieu Lagrange, Keunwoo Choi, Laurie M. Heller, Keisuke Imoto, Yuki Okamoto
CNRS, Ecole Centrale Nantes, Nantes Universite, Nantes, France and Gaudio Lab, Inc., Seoul, South Korea, KAIST and Daejeon, South Korea and Carnegie Mellon University, Pennsylvania, USA and Doshisha University, Kyoto, Japan and Ritsumeikan University, Kyoto, Japan

System characteristics
System input text prompt
Machine learning method VAE,CLAP,U-Net-based latent diffusion model
Phase reconstruction method HiFi-GAN
Acoustic features mel-spectrogram
Data augmentation conditioning augmentation
PDF

Sound Scene Synthesis With AudioLDM and TANGO2 for DCASE 2024 Task7

Xie ZhiDong, Li XinYu, Liu HaiCheng, Zou XiaoYan, Sun Yu
Samsung Research China-Nanjing, Nanjing, China

Abstract

This report describes our submission for DCASE2024 Challenge Task 7, a system for sound scene synthesis. Our system is based on AudioLDM and Tango2. Experiments are conducted on the dataset of DCASE2024 Challenge Task 7. The Frechet Audio Distance (FAD) between the sound generated by our system and the develop set is 60.64.

System characteristics
System input text prompt
Machine learning method VAE,CLAP,U-Net-based latent diffusion model
Phase reconstruction method HiFi-GAN
Acoustic features mel-spectrogram
Data augmentation conditioning augmentation
Subsystem count 2
System comprexity 1047000000 parameters
PDF

Sound Scene Synthesis Based on Fine-Tuned Latent Diffusion Model for DCASE Challenge 2024 Task 7

Sagnik Ghosh, Gaurav Verma, Siddharath Narayan Shakya, Shubham Sharma, Shivesh Singh
Indian Institute of Technology Mandi, Kamand, Mandi, India

Abstract

With the advancements in generative AI, text-to-audio systems have become increasingly popular, transforming audio generation across various domains such as music and speech. These systems enable the generation of high-quality audio from textual descriptions, offering freedom and control when producing a variety of audio. This technical report explores advancements in deep learning applied to sound generation specifically focusing on environmental sound scene generation. Our approach leverages a Text-toAudio (TTA) system with Contrastive Language-Audio Pretraining (CLAP), Conditional Latent Diffusion Models, a Variational Autoencoder (VAE) decoder, and a HiFi-GAN vocoder where LDM learn continuous audio representations from CLAP embeddings, enhancing synthesis control through natural language prompts. Also finetuned the diffusion model with the custom dataset created using two audio dataset in order to improve generation quality.

System characteristics
System input text prompt
Machine learning method VAE,CLAP,U-Net-based latent diffusion model
Phase reconstruction method HiFi-GAN
Acoustic features mel-spectrogram
Data augmentation conditioning augmentation
System comprexity 671000000 parameters
PDF

Diffusion Based Sound Scene Synthesis for DCASE Challenge 2024 Task 7

Yi Yuan, Haohe Liu, Xubo Liu, Mark D. Plumbley, Wenwu Wang
University of Surrey, Guildford, United Kingdom

Abstract

Sound scene synthesis aims to generate a variety of environment-related sounds within a specific scene. In this work, we proposed a system for DCASE 2024 challenge task 7. The proposed system is based on the official baseline model AudioLDM, a diffusion-based text-to-audio generation model. The system first trained with large-scale datasets and then downstream into this task via transfer learning. Addressing the challenge of no target audio data, we implemented an automated pipeline to synthesize audio and generate corresponding captions that mirror the semantic structure of the task. Despite the absence of dedicated training and testing sets for this task, our robust audio synthesis model effectively adapts the given conditions, fulfilling all the task requirements. Our system achieved a Fre ́chet Audio Distance (FAD) score of 55.1, surpassing the baseline system's FAD score of 61.3 calculated by the official evaluation toolkit.

System characteristics
System input text prompt
Machine learning method VAE,T5,U-Net-based latent diffusion model
Phase reconstruction method BigvGAN
Acoustic features mel-spectrogram
Data augmentation conditioning augmentation
System comprexity 265531016 parameters
PDF