Task description

Environmental Sound Scene Synthesis is the task of generating environmental sound given a textual description. Environmental sounds encompass any non-musical and unintelligible vocal sounds. This next-generation task expands the scope from last year’s Foley sounds to a more general sound scene. Further, it adds controllability with natural language in the form of text prompts.

Systems ranking

Evaluation Procedure

Fourteen raters judged 24 prompts per system for 6 systems: 4 contestants, a baseline, and a Sound-Designer Reference set constructed by hand-mixing recorded sounds. Four of the raters were contestants and 10 of the raters were double-blinded organizers and their lab members.

There were 4 unique foreground prompts for each of 6 foreground categories: alarms, animals, entrances, humans, tools, vehicles. Five types of background prompts (birds, crowds, traffic, water, no background) were interspersed irregularly among the categories, so that the average match rating of background prompts within a particular foreground category represents the ratings of a few types of background prompts.

All raters were uninformed about which system generated each sound. Organizers who gave ratings saw anonymized system numbers in the data until all results and rankings were finalized. (Those organizers who had heard sounds during the generation phase did not participate in ratings; instead, they generated the anonymized system numbers for the data files). Contestants rated sounds from all systems; to avoid bias, their self-ratings were removed in case they were able to recognize sounds from their own system. For each contestant and each prompt, each self-rating was replaced with a contestant’s average responses to that prompt for all other systems; this ensured that removal of self-ratings would not uniquely raise or lower the average of their own system.

Ranking

The final ranking is determined by the weighted average of the three ratings were based on a ratio of category fit for foreground sound : category fit for background sound : audio quality that was 2:1:1.

Perceptual Evaluation Score

Submission Information		Weighted Average Score								Foreground Fit							Background Fit							Audio Quality
Submission Code	Technical Report	Official Rank	Average Score	Alarm	Animal	Entrance	Human	Tool	Vehicle	Average Score	Alarm	Animal	Entrance	Human	Tool	Vehicle	Average Score	Alarm	Animal	Entrance	Human	Tool	Vehicle	Average Score	Alarm	Animal	Entrance	Human	Tool	Vehicle
Sound Designer (Ref.)			8.980	9.331	8.478	8.598	9.286	9.290	8.853	9.205	9.768	8.554	8.875	9.500	9.393	9.143	8.774	8.786	8.375	8.286	9.125	9.339	8.554	8.735	9.000	8.429	8.357	9.018	9.036	8.571
Sun_Samsung_task7_1	SunSamsung2024	1	5.832	5.651	5.884	5.630	5.191	5.769	6.870	5.752	5.571	6.414	4.500	4.964	5.307	7.757	5.780	5.743	4.729	7.079	5.311	6.139	5.679	6.045	5.718	5.979	6.443	5.525	6.321	6.286
Chung_KT_task7_1	ChungKT2024	2	4.966	4.890	5.506	4.272	4.737	4.885	5.506	5.025	5.218	6.079	3.750	4.436	4.907	5.761	4.623	3.746	4.446	4.807	5.361	4.071	5.307	5.191	5.379	5.421	4.782	4.714	5.654	5.196
Yi_Surrey_task7_1	YiSURREY2024	3	4.748	4.271	5.499	4.920	5.497	4.043	4.256	3.733	2.775	4.968	4.671	5.543	2.339	2.104	5.133	5.261	5.454	3.261	5.236	5.375	6.214	6.391	6.271	6.607	7.079	5.664	6.118	6.604
DCASE2024_baseline_task7	LeeGLI2024		3.287	3.750	1.764	4.607	2.880	4.049	2.674	3.280	4.071	1.054	4.946	2.036	4.893	2.679	2.797	2.768	2.446	4.036	3.196	1.696	2.643	3.792	4.089	2.500	4.500	4.250	4.714	2.696
Verma_IITMandi_task7_1	VermaIITMandi2024	4	2.523	2.207	2.269	3.540	2.636	1.965	2.520	1.792	1.021	1.504	2.621	1.814	1.414	2.379	3.078	4.168	2.575	4.575	2.811	2.204	2.132	3.430	2.618	3.493	4.343	4.107	2.829	3.189

FAD Score

Frechet Audio Distance (FAD) was calculated using embeddings of the evaluation set (Sound Design Reference audio datasets) and generated sounds from submitted systems.

Submission Information		Evaluation Dataset					Development Dataset
Submission Code	Technical Report	Official Rank	FAD Rank	FAD (PANNs)	FAD (CLAP)	FAD (VGGish)	FAD (PANNs)	FAD (CLAP)	FAD (VGGish)
Sound Designer (Ref.)
Sun_Samsung_task7_1	SunSamsung2024	1	1	35.985	257.968	5.424	50.179	333.943	7.558
Chung_KT_task7_1	ChungKT2024	2	2	37.092	192.358	5.051	41.580	269.975	4.524
Yi_Surrey_task7_1	YiSURREY2024	3	3	43.304	149.853	6.800	56.985	295.729	6.253
DCASE2024_baseline_task7	LeeGLI2024			57.061	321.415	9.713	55.614	367.668	8.069
Verma_IITMandi_task7_1	VermaIITMandi2024	4	4	53.728	313.398	9.208	52.056	348.012	6.520

System characteristics

Summary of the submitted system characteristics.

	Submission Code	Technical Report	Audio Dataset	System input	ML method	Phase reconstruction	Acoustic feature	System Complexity	Data Augmentation	Pre-trained Model	Subsystem Count
1	Sound Designer (Ref.)
2	Sun_Samsung_task7_1	SunSamsung2024	AudioCaps, audio-alpaca	text prompt	VAE, CLAP, U-Net-based latent diffusion model	HiFi-GAN	mel-spectrogram	1047000000	conditioning augmentation	TANGO 2, HiFi-GAN	2
3	Chung_KT_task7_1	ChungKT2024	AudioCaps, WavCaps	text prompt, noise	CLAP, GAN	HiFi-GAN	mel-spectrogram	325963838		CLAP, HiFi-GAN
4	Yi_Surrey_task7_1	YiSURREY2024	AudioSet	text prompt	VAE, T5, U-Net-based latent diffusion model	BigvGAN	mel-spectrogram	265531016	conditioning augmentation	CLAP
5	DCASE2024_baseline_task7	LeeGLI2024	DCASE2024 Challenge Task 7 Development Dataset	text prompt	VAE, CLAP, U-Net-based latent diffusion model	HiFi-GAN	mel-spectrogram	416000000	conditioning augmentation	HiFi-GAN
6	Verma_IITMandi_task7_1	VermaIITMandi2024	DCASE2024 Challenge Task 7 Development Dataset, Custom Dataset	text prompt	VAE, CLAP, U-Net-based latent diffusion model	HiFi-GAN	mel-spectrogram	671000000	conditioning augmentation	HiFi-GAN

Technical reports

Sound Scene Synthesis Based on GAN Using Contrastive Learning and Effective Time-Frequency Swap Cross Attention Mechanism

Hae Chun Chung, Jae Hoon Jung

KT Corporation, Seoul, Republic of Korea

Chung_KT_task7_1

PDF

Sound Scene Synthesis Based on GAN Using Contrastive Learning and Effective Time-Frequency Swap Cross Attention Mechanism

Hae Chun Chung, Jae Hoon Jung
KT Corporation, Seoul, Republic of Korea

Abstract

This technical report outlines the efforts of KT Corporation's Acoustic Processing Project for addressing sound scene synthesis, DCASE 2024 Challenge Task 7. The task's objective is to develop a generative system capable of synthesizing environmental sounds from text descriptions. Our system is designed in three stages to achieve this goal: embedding the text description, generating a mel spectrogram conditioned on the text embedding, and converting the mel spectrogram into an audio waveform. Our main focus lies on training the model for the second stage. We employed a generative adversarial network (GAN) and meticulously designed the training process and architecture. We utilized various contrastive losses and introduced the single-double-triple attention mechanism to accurately capture text descriptions and train high-quality features. To mitigate the rise in GPU memory consumption caused by the expanded attention mechanism, we designed a novel time-frequency swap cross-attention mechanism. Our system achieved FAD score more than 30% lower than the DCASE baseline, demonstrating significant performance improvements in text-to-audio generation.

System characteristics

System input	text prompt,noise
Machine learning method	CLAP,GAN
Phase reconstruction method	HiFi-GAN
Acoustic features	mel-spectrogram
System comprexity	325963838 parameters

PDF

Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependent

Modan Tailleur, Junwon Lee, Mathieu Lagrange, Keunwoo Choi, Laurie M. Heller, Keisuke Imoto, Yuki Okamoto

CNRS, Ecole Centrale Nantes, Nantes Universite, Nantes, France and Gaudio Lab, Inc., Seoul, South Korea, KAIST and Daejeon, South Korea and Carnegie Mellon University, Pennsylvania, USA and Doshisha University, Kyoto, Japan and Ritsumeikan University, Kyoto, Japan

DCASE2024_baseline_task7

PDF Code

Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependent

Modan Tailleur, Junwon Lee, Mathieu Lagrange, Keunwoo Choi, Laurie M. Heller, Keisuke Imoto, Yuki Okamoto
CNRS, Ecole Centrale Nantes, Nantes Universite, Nantes, France and Gaudio Lab, Inc., Seoul, South Korea, KAIST and Daejeon, South Korea and Carnegie Mellon University, Pennsylvania, USA and Doshisha University, Kyoto, Japan and Ritsumeikan University, Kyoto, Japan

System characteristics

System input	text prompt
Machine learning method	VAE,CLAP,U-Net-based latent diffusion model
Phase reconstruction method	HiFi-GAN
Acoustic features	mel-spectrogram
Data augmentation	conditioning augmentation

PDF

Source code

Sound Scene Synthesis With AudioLDM and TANGO2 for DCASE 2024 Task7

Xie ZhiDong, Li XinYu, Liu HaiCheng, Zou XiaoYan, Sun Yu

Samsung Research China-Nanjing, Nanjing, China

Sun_Samsung_task7_1

PDF

Sound Scene Synthesis With AudioLDM and TANGO2 for DCASE 2024 Task7

Xie ZhiDong, Li XinYu, Liu HaiCheng, Zou XiaoYan, Sun Yu
Samsung Research China-Nanjing, Nanjing, China

Abstract

This report describes our submission for DCASE2024 Challenge Task 7, a system for sound scene synthesis. Our system is based on AudioLDM and Tango2. Experiments are conducted on the dataset of DCASE2024 Challenge Task 7. The Frechet Audio Distance (FAD) between the sound generated by our system and the develop set is 60.64.

System characteristics

System input	text prompt
Machine learning method	VAE,CLAP,U-Net-based latent diffusion model
Phase reconstruction method	HiFi-GAN
Acoustic features	mel-spectrogram
Data augmentation	conditioning augmentation
Subsystem count	2
System comprexity	1047000000 parameters

PDF

Sound Scene Synthesis Based on Fine-Tuned Latent Diffusion Model for DCASE Challenge 2024 Task 7

Sagnik Ghosh, Gaurav Verma, Siddharath Narayan Shakya, Shubham Sharma, Shivesh Singh

Indian Institute of Technology Mandi, Kamand, Mandi, India

Verma_IITMandi_task7_1

PDF Code

Sound Scene Synthesis Based on Fine-Tuned Latent Diffusion Model for DCASE Challenge 2024 Task 7

Sagnik Ghosh, Gaurav Verma, Siddharath Narayan Shakya, Shubham Sharma, Shivesh Singh
Indian Institute of Technology Mandi, Kamand, Mandi, India

Abstract

With the advancements in generative AI, text-to-audio systems have become increasingly popular, transforming audio generation across various domains such as music and speech. These systems enable the generation of high-quality audio from textual descriptions, offering freedom and control when producing a variety of audio. This technical report explores advancements in deep learning applied to sound generation specifically focusing on environmental sound scene generation. Our approach leverages a Text-toAudio (TTA) system with Contrastive Language-Audio Pretraining (CLAP), Conditional Latent Diffusion Models, a Variational Autoencoder (VAE) decoder, and a HiFi-GAN vocoder where LDM learn continuous audio representations from CLAP embeddings, enhancing synthesis control through natural language prompts. Also finetuned the diffusion model with the custom dataset created using two audio dataset in order to improve generation quality.

System characteristics

System input	text prompt
Machine learning method	VAE,CLAP,U-Net-based latent diffusion model
Phase reconstruction method	HiFi-GAN
Acoustic features	mel-spectrogram
Data augmentation	conditioning augmentation
System comprexity	671000000 parameters

PDF

Source code

Diffusion Based Sound Scene Synthesis for DCASE Challenge 2024 Task 7

Yi Yuan, Haohe Liu, Xubo Liu, Mark D. Plumbley, Wenwu Wang

University of Surrey, Guildford, United Kingdom

Yi_SURREY_task7_1

PDF Code

Diffusion Based Sound Scene Synthesis for DCASE Challenge 2024 Task 7

Yi Yuan, Haohe Liu, Xubo Liu, Mark D. Plumbley, Wenwu Wang
University of Surrey, Guildford, United Kingdom

Abstract

Sound scene synthesis aims to generate a variety of environment-related sounds within a specific scene. In this work, we proposed a system for DCASE 2024 challenge task 7. The proposed system is based on the official baseline model AudioLDM, a diffusion-based text-to-audio generation model. The system first trained with large-scale datasets and then downstream into this task via transfer learning. Addressing the challenge of no target audio data, we implemented an automated pipeline to synthesize audio and generate corresponding captions that mirror the semantic structure of the task. Despite the absence of dedicated training and testing sets for this task, our robust audio synthesis model effectively adapts the given conditions, fulfilling all the task requirements. Our system achieved a Fre ́chet Audio Distance (FAD) score of 55.1, surpassing the baseline system's FAD score of 61.3 calculated by the official evaluation toolkit.

System characteristics

System input	text prompt
Machine learning method	VAE,T5,U-Net-based latent diffusion model
Phase reconstruction method	BigvGAN
Acoustic features	mel-spectrogram
Data augmentation	conditioning augmentation
System comprexity	265531016 parameters

PDF

Source code

Citation

If you are participating in this task, using fadtk (FAD ToolKit), or using the baseline code, please cite the following papers.

Publication

Modan Tailleur, Junwon Lee, Mathieu Lagrange, Keunwoo Choi, Laurie M. Heller, Keisuke Imoto, and Yuki Okamoto. Correlation of fréchet audio distance with human perception of environmental audio is embedding dependent. arXiv:2403.17508, 2024.

PDF

Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependent

PDF

Publication

Junwon Lee, Modan Tailleur, Laurie M. Heller, Keunwoo Choi, Mathieu Lagrange, Brian McFee, Keisuke Imoto, and Yuki Okamoto. Challenge on sound scene synthesis: evaluating text-to-audio generation. In Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation. 2024.

PDF

Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation

PDF

Publication

Mathieu Lagrange, Junwon Lee, Modan Tailleur, Laurie M. Heller, Keunwoo Choi, Brian McFee, Keisuke Imoto, and Yuki Okamoto. Sound scene synthesis at the dcase 2024 challenge. arXiv:2501.08587, 2025.

PDF

Sound Scene Synthesis at the DCASE 2024 Challenge

PDF

Content

Task description

Systems ranking

Evaluation Procedure

Ranking

Perceptual Evaluation Score

FAD Score

System characteristics

Technical reports

Sound Scene Synthesis Based on GAN Using Contrastive Learning and Effective Time-Frequency Swap Cross Attention Mechanism

Sound Scene Synthesis Based on GAN Using Contrastive Learning and Effective Time-Frequency Swap Cross Attention Mechanism

Abstract

System characteristics

Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependent

Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependent

System characteristics

Sound Scene Synthesis With AudioLDM and TANGO2 for DCASE 2024 Task7

Sound Scene Synthesis With AudioLDM and TANGO2 for DCASE 2024 Task7

Abstract

System characteristics

Sound Scene Synthesis Based on Fine-Tuned Latent Diffusion Model for DCASE Challenge 2024 Task 7

Sound Scene Synthesis Based on Fine-Tuned Latent Diffusion Model for DCASE Challenge 2024 Task 7

Abstract

System characteristics

Diffusion Based Sound Scene Synthesis for DCASE Challenge 2024 Task 7

Diffusion Based Sound Scene Synthesis for DCASE Challenge 2024 Task 7

Abstract

System characteristics

Citation

Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependent

Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation

Sound Scene Synthesis at the DCASE 2024 Challenge