Task description
Environmental Sound Scene Synthesis is the task of generating environmental sound given a textual description. Environmental sounds encompass any non-musical and unintelligible vocal sounds. This next-generation task expands the scope from last year’s Foley sounds to a more general sound scene. Further, it adds controllability with natural language in the form of text prompts.
Systems ranking
Evaluation Procedure
Fourteen raters judged 24 prompts per system for 6 systems: 4 contestants, a baseline, and a Sound-Designer Reference set constructed by hand-mixing recorded sounds. Four of the raters were contestants and 10 of the raters were double-blinded organizers and their lab members.
There were 4 unique foreground prompts for each of 6 foreground categories: alarms, animals, entrances, humans, tools, vehicles. Five types of background prompts (birds, crowds, traffic, water, no background) were interspersed irregularly among the categories, so that the average match rating of background prompts within a particular foreground category represents the ratings of a few types of background prompts.
All raters were uninformed about which system generated each sound. Organizers who gave ratings saw anonymized system numbers in the data until all results and rankings were finalized. (Those organizers who had heard sounds during the generation phase did not participate in ratings; instead, they generated the anonymized system numbers for the data files). Contestants rated sounds from all systems; to avoid bias, their self-ratings were removed in case they were able to recognize sounds from their own system. For each contestant and each prompt, each self-rating was replaced with a contestant’s average responses to that prompt for all other systems; this ensured that removal of self-ratings would not uniquely raise or lower the average of their own system.
Ranking
The final ranking is determined by the weighted average of the three ratings were based on a ratio of category fit for foreground sound : category fit for background sound : audio quality that was 2:1:1.
Perceptual Evaluation Score
Submission Information | Weighted Average Score | Foreground Fit | Background Fit | Audio Quality | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission Code |
Technical Report |
Official Rank |
Average Score | Alarm | Animal | Entrance | Human | Tool | Vehicle | Average Score | Alarm | Animal | Entrance | Human | Tool | Vehicle | Average Score | Alarm | Animal | Entrance | Human | Tool | Vehicle | Average Score | Alarm | Animal | Entrance | Human | Tool | Vehicle | |
Sound Designer (Ref.) | 8.793 | 9.331 | 8.478 | 8.598 | 9.286 | 9.290 | 8.853 | 9.768 | 9.768 | 8.554 | 8.875 | 9.500 | 9.393 | 9.143 | 8.786 | 8.786 | 8.375 | 8.286 | 9.125 | 9.339 | 8.554 | 9.000 | 9.000 | 8.429 | 8.357 | 9.018 | 9.036 | 8.571 | |||
Sun_Samsung_task7_1 | SunSamsung2024 | 1 | 5.832 | 5.651 | 5.884 | 5.630 | 5.191 | 5.769 | 6.870 | 5.752 | 5.571 | 6.414 | 4.500 | 4.964 | 5.307 | 7.757 | 5.780 | 5.743 | 4.729 | 7.079 | 5.311 | 6.139 | 5.679 | 6.045 | 5.718 | 5.979 | 6.443 | 5.525 | 6.321 | 6.286 | |
Chung_KT_task7_1 | ChungKT2024 | 2 | 4.966 | 4.890 | 5.506 | 4.272 | 4.737 | 4.885 | 5.506 | 5.025 | 5.218 | 6.079 | 3.750 | 4.436 | 4.907 | 5.761 | 4.623 | 3.746 | 4.446 | 4.807 | 5.361 | 4.071 | 5.307 | 5.191 | 5.379 | 5.421 | 4.782 | 4.714 | 5.654 | 5.196 | |
Yi_Surrey_task7_1 | YiSURREY2024 | 3 | 4.748 | 4.271 | 5.499 | 4.920 | 5.497 | 4.043 | 4.256 | 3.733 | 2.775 | 4.968 | 4.671 | 5.543 | 2.339 | 2.104 | 5.133 | 5.261 | 5.454 | 3.261 | 5.236 | 5.375 | 6.214 | 6.391 | 6.271 | 6.607 | 7.079 | 5.664 | 6.118 | 6.604 | |
DCASE2024_baseline_task7 | LeeGLI2024 | 3.287 | 3.750 | 1.764 | 4.607 | 2.880 | 4.049 | 2.674 | 3.280 | 4.071 | 1.054 | 4.946 | 2.036 | 4.893 | 2.679 | 2.797 | 2.768 | 2.446 | 4.036 | 3.196 | 1.696 | 2.643 | 3.792 | 4.089 | 2.500 | 4.500 | 4.250 | 4.714 | 2.696 | ||
Verma_IITMandi_task7_1 | VermaIITMandi2024 | 4 | 2.523 | 2.207 | 2.269 | 3.540 | 2.636 | 1.965 | 2.520 | 1.792 | 1.021 | 1.504 | 2.621 | 1.814 | 1.414 | 2.379 | 3.078 | 4.168 | 2.575 | 4.575 | 2.811 | 2.204 | 2.132 | 3.430 | 2.618 | 3.493 | 4.343 | 4.107 | 2.829 | 3.189 |
FAD Score
Frechet Audio Distance (FAD) was calculated using embeddings of the evaluation set (Sound Design Reference audio datasets) and generated sounds from submitted systems.
Submission Information | Evaluation Dataset | Development Dataset | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Submission Code |
Technical Report |
Official Rank |
FAD Rank |
FAD (PANNs) | FAD (CLAP) | FAD (VGGish) | FAD (PANNs) | FAD (CLAP) | FAD (VGGish) | |
Sound Designer (Ref.) | ||||||||||
Sun_Samsung_task7_1 | SunSamsung2024 | 1 | 1 | 35.985 | 257.968 | 5.424 | 50.179 | 333.943 | 7.558 | |
Chung_KT_task7_1 | ChungKT2024 | 2 | 2 | 37.092 | 192.358 | 5.051 | 41.580 | 269.975 | 4.524 | |
Yi_Surrey_task7_1 | YiSURREY2024 | 3 | 3 | 43.304 | 149.853 | 6.800 | 56.985 | 295.729 | 6.253 | |
DCASE2024_baseline_task7 | LeeGLI2024 | 57.061 | 321.415 | 9.713 | 55.614 | 367.668 | 8.069 | |||
Verma_IITMandi_task7_1 | VermaIITMandi2024 | 4 | 4 | 53.728 | 313.398 | 9.208 | 52.056 | 348.012 | 6.520 |
System characteristics
Summary of the submitted system characteristics.
Submission Code |
Technical Report |
Audio Dataset |
System input |
ML method |
Phase reconstruction |
Acoustic feature |
System Complexity |
Data Augmentation |
Pre-trained Model |
Subsystem Count |
|
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Sound Designer (Ref.) | ||||||||||
2 | Sun_Samsung_task7_1 | SunSamsung2024 | AudioCaps, audio-alpaca | text prompt | VAE, CLAP, U-Net-based latent diffusion model | HiFi-GAN | mel-spectrogram | 1047000000 | conditioning augmentation | TANGO 2, HiFi-GAN | 2 |
3 | Chung_KT_task7_1 | ChungKT2024 | AudioCaps, WavCaps | text prompt, noise | CLAP, GAN | HiFi-GAN | mel-spectrogram | 325963838 | CLAP, HiFi-GAN | ||
4 | Yi_Surrey_task7_1 | YiSURREY2024 | AudioSet | text prompt | VAE, T5, U-Net-based latent diffusion model | BigvGAN | mel-spectrogram | 265531016 | conditioning augmentation | CLAP | |
5 | DCASE2024_baseline_task7 | LeeGLI2024 | DCASE2024 Challenge Task 7 Development Dataset | text prompt | VAE, CLAP, U-Net-based latent diffusion model | HiFi-GAN | mel-spectrogram | 416000000 | conditioning augmentation | HiFi-GAN | |
6 | Verma_IITMandi_task7_1 | VermaIITMandi2024 | DCASE2024 Challenge Task 7 Development Dataset, Custom Dataset | text prompt | VAE, CLAP, U-Net-based latent diffusion model | HiFi-GAN | mel-spectrogram | 671000000 | conditioning augmentation | HiFi-GAN |
Technical reports
Sound Scene Synthesis Based on GAN Using Contrastive Learning and Effective Time-Frequency Swap Cross Attention Mechanism
Hae Chun Chung, Jae Hoon Jung
KT Corporation, Seoul, Republic of Korea
Chung_KT_task7_1
Sound Scene Synthesis Based on GAN Using Contrastive Learning and Effective Time-Frequency Swap Cross Attention Mechanism
Hae Chun Chung, Jae Hoon Jung
KT Corporation, Seoul, Republic of Korea
Abstract
This technical report outlines the efforts of KT Corporation's Acoustic Processing Project for addressing sound scene synthesis, DCASE 2024 Challenge Task 7. The task's objective is to develop a generative system capable of synthesizing environmental sounds from text descriptions. Our system is designed in three stages to achieve this goal: embedding the text description, generating a mel spectrogram conditioned on the text embedding, and converting the mel spectrogram into an audio waveform. Our main focus lies on training the model for the second stage. We employed a generative adversarial network (GAN) and meticulously designed the training process and architecture. We utilized various contrastive losses and introduced the single-double-triple attention mechanism to accurately capture text descriptions and train high-quality features. To mitigate the rise in GPU memory consumption caused by the expanded attention mechanism, we designed a novel time-frequency swap cross-attention mechanism. Our system achieved FAD score more than 30% lower than the DCASE baseline, demonstrating significant performance improvements in text-to-audio generation.
System characteristics
System input | text prompt,noise |
Machine learning method | CLAP,GAN |
Phase reconstruction method | HiFi-GAN |
Acoustic features | mel-spectrogram |
System comprexity | 325963838 parameters |
Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependent
Modan Tailleur, Junwon Lee, Mathieu Lagrange, Keunwoo Choi, Laurie M. Heller, Keisuke Imoto, Yuki Okamoto
CNRS, Ecole Centrale Nantes, Nantes Universite, Nantes, France and Gaudio Lab, Inc., Seoul, South Korea, KAIST and Daejeon, South Korea and Carnegie Mellon University, Pennsylvania, USA and Doshisha University, Kyoto, Japan and Ritsumeikan University, Kyoto, Japan
DCASE2024_baseline_task7
Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependent
Modan Tailleur, Junwon Lee, Mathieu Lagrange, Keunwoo Choi, Laurie M. Heller, Keisuke Imoto, Yuki Okamoto
CNRS, Ecole Centrale Nantes, Nantes Universite, Nantes, France and Gaudio Lab, Inc., Seoul, South Korea, KAIST and Daejeon, South Korea and Carnegie Mellon University, Pennsylvania, USA and Doshisha University, Kyoto, Japan and Ritsumeikan University, Kyoto, Japan
System characteristics
System input | text prompt |
Machine learning method | VAE,CLAP,U-Net-based latent diffusion model |
Phase reconstruction method | HiFi-GAN |
Acoustic features | mel-spectrogram |
Data augmentation | conditioning augmentation |
Sound Scene Synthesis With AudioLDM and TANGO2 for DCASE 2024 Task7
Xie ZhiDong, Li XinYu, Liu HaiCheng, Zou XiaoYan, Sun Yu
Samsung Research China-Nanjing, Nanjing, China
Sun_Samsung_task7_1
Sound Scene Synthesis With AudioLDM and TANGO2 for DCASE 2024 Task7
Xie ZhiDong, Li XinYu, Liu HaiCheng, Zou XiaoYan, Sun Yu
Samsung Research China-Nanjing, Nanjing, China
Abstract
This report describes our submission for DCASE2024 Challenge Task 7, a system for sound scene synthesis. Our system is based on AudioLDM and Tango2. Experiments are conducted on the dataset of DCASE2024 Challenge Task 7. The Frechet Audio Distance (FAD) between the sound generated by our system and the develop set is 60.64.
System characteristics
System input | text prompt |
Machine learning method | VAE,CLAP,U-Net-based latent diffusion model |
Phase reconstruction method | HiFi-GAN |
Acoustic features | mel-spectrogram |
Data augmentation | conditioning augmentation |
Subsystem count | 2 |
System comprexity | 1047000000 parameters |
Sound Scene Synthesis Based on Fine-Tuned Latent Diffusion Model for DCASE Challenge 2024 Task 7
Sagnik Ghosh, Gaurav Verma, Siddharath Narayan Shakya, Shubham Sharma, Shivesh Singh
Indian Institute of Technology Mandi, Kamand, Mandi, India
Abstract
With the advancements in generative AI, text-to-audio systems have become increasingly popular, transforming audio generation across various domains such as music and speech. These systems enable the generation of high-quality audio from textual descriptions, offering freedom and control when producing a variety of audio. This technical report explores advancements in deep learning applied to sound generation specifically focusing on environmental sound scene generation. Our approach leverages a Text-toAudio (TTA) system with Contrastive Language-Audio Pretraining (CLAP), Conditional Latent Diffusion Models, a Variational Autoencoder (VAE) decoder, and a HiFi-GAN vocoder where LDM learn continuous audio representations from CLAP embeddings, enhancing synthesis control through natural language prompts. Also finetuned the diffusion model with the custom dataset created using two audio dataset in order to improve generation quality.
System characteristics
System input | text prompt |
Machine learning method | VAE,CLAP,U-Net-based latent diffusion model |
Phase reconstruction method | HiFi-GAN |
Acoustic features | mel-spectrogram |
Data augmentation | conditioning augmentation |
System comprexity | 671000000 parameters |
Diffusion Based Sound Scene Synthesis for DCASE Challenge 2024 Task 7
Yi Yuan, Haohe Liu, Xubo Liu, Mark D. Plumbley, Wenwu Wang
University of Surrey, Guildford, United Kingdom
Abstract
Sound scene synthesis aims to generate a variety of environment-related sounds within a specific scene. In this work, we proposed a system for DCASE 2024 challenge task 7. The proposed system is based on the official baseline model AudioLDM, a diffusion-based text-to-audio generation model. The system first trained with large-scale datasets and then downstream into this task via transfer learning. Addressing the challenge of no target audio data, we implemented an automated pipeline to synthesize audio and generate corresponding captions that mirror the semantic structure of the task. Despite the absence of dedicated training and testing sets for this task, our robust audio synthesis model effectively adapts the given conditions, fulfilling all the task requirements. Our system achieved a Fre ́chet Audio Distance (FAD) score of 55.1, surpassing the baseline system's FAD score of 61.3 calculated by the official evaluation toolkit.
System characteristics
System input | text prompt |
Machine learning method | VAE,T5,U-Net-based latent diffusion model |
Phase reconstruction method | BigvGAN |
Acoustic features | mel-spectrogram |
Data augmentation | conditioning augmentation |
System comprexity | 265531016 parameters |