Task description

Language-queried audio source separation (LASS) is the task of separating arbitrary sound sources using textual descriptions of the desired source. LASS provides a useful tool for future source separation systems, allowing users to extract audio sources via natural language instructions. Submissions will be evaluated by signal-to-distortion ratio (SDR), followed by a subjective test. Final rankings are determined by subjective listening tests.

More detailed task description can be found in the task description page

Systems ranking

Subjective Evaluation Score

If multiple systems were submitted by one team, only the system with the highest SDR score was subjectively evaluated. The weighted average of the two ratings was calculated using a 1:1 ratio of REL (relevance between the target audio and the language query) and OVL (overall audio quality of the separated signal).

Rank	Submission Information		Subjective Evaluation Score
Rank	Submission Code	Technical Report	Official Rank	Average Score	OVL Score	REL Score
	Kim_GIST-AunionAI_task9_4	Lee2024_t9	1	3.310	3.430	3.188
	Guan_HEU_task9_2	Xiao2024_t9	2	3.288	3.416	3.159
	HanYin_NWPU-JLESS_task9_4	Yin2024_t9	3	3.266	3.400	3.133
	Romaniuk_SRPOL_task9_2	Romaniuk2024_t9	4	3.260	3.386	3.134
	Chung_KT_task9_1	Chung2024_t9	5	3.240	3.378	3.102

Objective Evaluation Score

Rank	Submission Information		Evaluation Set				Validation (Development) Set
Rank	Submission Code	Technical Report	SDR Rank	SDR Score	SDRi Score	SI-SDR Score	SDR Score	SDRi Score	SI-SDR Score
	Kim_GIST-AunionAI_task9_4	Lee2024_t9	1	8.869	8.763	7.764	8.610	8.575	7.493
	Kim_GIST-AunionAI_task9_3	Lee2024_t9	2	8.864	8.757	7.780	8.599	8.564	7.497
	HanYin_NWPU-JLESS_task9_4	Yin2024_t9	3	8.842	8.736	7.820	8.467	8.432	7.403
	HanYin_NWPU-JLESS_task9_3	Yin2024_t9	4	8.764	8.658	7.394	8.191	8.156	6.794
	Kim_GIST-AunionAI_task9_2	Lee2024_t9	5	8.671	8.564	7.217	8.459	8.424	7.072
	Guan_HEU_task9_2	Xiao2024_t9	6	8.368	8.262	6.800	8.192	8.157	6.680
	HanYin_NWPU-JLESS_task9_2	Yin2024_t9	7	8.186	8.080	6.499	8.007	7.972	6.459
	Kim_GIST-AunionAI_task9_1	Lee2024_t9	8	8.059	7.953	6.510	7.750	7.715	6.161
	Romaniuk_SRPOL_task9_2	Romaniuk2024_t9	9	7.572	7.466	5.455	7.398	7.363	5.551
	HanYin_NWPU-JLESS_task9_1	Yin2024_t9	10	7.306	7.200	5.481	7.087	7.052	5.413
	Chung_KT_task9_1	Chung2024_t9	11	7.302	7.195	5.628	7.030	6.995	5.368
	Romaniuk_SRPOL_task9_1	Romaniuk2024_t9	12	7.245	7.138	5.294	7.021	6.986	5.291
	Chung_KT_task9_2	Chung2024_t9	13	7.186	7.080	5.526	7.124	7.089	5.593
	Chung_KT_task9_3	Chung2024_t9	14	7.118	7.012	5.301	7.139	7.104	5.504
	Romaniuk_SRPOL_task9_4	Romaniuk2024_t9	15	6.478	6.372	4.513	6.282	6.247	4.620
	Romaniuk_SRPOL_task9_3	Romaniuk2024_t9	16	6.153	6.046	3.811	6.181	6.146	4.188
	Guan_HEU_task9_1	Xiao2024_t9	17	6.022	5.916	4.115	5.937	5.902	4.191
	Baseline	Liu2024_t9	18	5.799	5.693	3.873	5.708	5.673	3.862
	Guan_HEU_task9_3	Xiao2024_t9	19	-5.417	-5.523	-39.983	-4.747	-4.792	-42.346

System characteristics

Summary of the submitted system characteristics.

Submission Code	Technical Report	Input SR	Data augmentation	ML method	Loss function	Ensemble systems	Total parameters	Training datasets	Used pre-trained models
Kim_GIST-AunionAI_task9_4	Lee2024_t9	16kHz	caption augmentation	CLAP, ResUNet-based separation model, time-frequency masking	waveform l1 loss	5	467M	Clotho, FSD50K, WavCaps	CLAP, AudioSep, Phi-2.0
Kim_GIST-AunionAI_task9_3	Lee2024_t9	16kHz	caption augmentation	CLAP, ResUNet-based separation model, time-frequency masking	waveform l1 loss	4	467M	Clotho, FSD50K, WavCaps	CLAP, AudioSep, Phi-2.0
HanYin_NWPU-JLESS_task9_4	Yin2024_t9	32kHz, 16kHz	volume augmentation	CLAP, ResUNet-based separation model, time-frequency masking, DPRNN	waveform l1 loss	3	267M	Clotho, FSD50K, Audiocaps, Auto-ACD, WavCaps	CLAP
HanYin_NWPU-JLESS_task9_3	Yin2024_t9	32kHz	volume augmentation	CLAP, ResUNet-based separation model, time-frequency masking, DPRNN	waveform l1 loss	1	267M	Clotho, FSD50K, Audiocaps, Auto-ACD, WavCaps	CLAP
Kim_GIST-AunionAI_task9_2	Lee2024_t9	32kHz	caption augmentation	CLAP, ResUNet-based separation model, time-frequency masking	waveform l1 loss	1	238M	Clotho, FSD50K, WavCaps	CLAP, AudioSep, Phi-2.0
Guan_HEU_task9_2	Xiao2024_t9	32kHz	N/A	CLAP, ResUNet-based separation model, time-frequency masking	waveform l1 loss	1	238.60M	Clotho, FSD50K	CLAP, AudioSep
HanYin_NWPU-JLESS_task9_2	Yin2024_t9	16kHz	volume augmentation	CLAP, ResUNet-based separation model, time-frequency masking, DPRNN	waveform l1 loss	1	267M	Clotho, FSD50K, Audiocaps, Auto-ACD, WavCaps	CLAP
Kim_GIST-AunionAI_task9_1	Lee2024_t9	16kHz	caption augmentation	CLAP, ResUNet-based separation model, time-frequency masking	waveform l1 loss	1	229M	Clotho, FSD50K, WavCaps	CLAP, Phi-2.0
Romaniuk_SRPOL_task9_2	Romaniuk2024_t9	16kHz	random crop for long audio clips	CLAP, ResUNet-based separation model, time-frequency masking	waveform l1 loss	1	229M	Clotho, FSD50K, AudioCaps	CLAP
HanYin_NWPU-JLESS_task9_1	Yin2024_t9	16kHz	volume augmentation	CLAP, ResUNet-based separation model, time-frequency masking	waveform l1 loss	1	238.60M	Clotho, FSD50K, Audiocaps, Auto-ACD, WavCaps	CLAP
Chung_KT_task9_1	Chung2024_t9	16kHz	N/A	FLAN-T5, ResUNet-based separation model, CLAP	waveform l1 loss, multi-scale mel-spectrogram loss, contrastive loss, loss balancer	1	372.73M	AudioCaps, Clotho, WavCaps, FSD50K	FLAN-T5, CLAP
Romaniuk_SRPOL_task9_1	Romaniuk2024_t9	16kHz	random crop for long audio clips	CLAP, ResUNet-based separation model, time-frequency masking, separate masks for real and imaginary components	waveform l1 loss	1	229M	Clotho, FSD50K, AudioCaps	CLAP
Chung_KT_task9_2	Chung2024_t9	16kHz	N/A	FLAN-T5, ResUNet-based separation model, CLAP	waveform l1 loss, multi-scale mel-spectrogram loss, contrastive loss, loss balancer	1	372.73M	AudioCaps, Clotho, WavCaps, FSD50K	FLAN-T5, CLAP
Chung_KT_task9_3	Chung2024_t9	16kHz	N/A	FLAN-T5, ResUNet-based separation model, CLAP	waveform l1 loss, multi-scale mel-spectrogram loss, contrastive loss, loss balancer	1	372.73M	AudioCaps, Clotho, WavCaps, FSD50K	FLAN-T5, CLAP
Romaniuk_SRPOL_task9_4	Romaniuk2024_t9	16kHz	random crop for long audio clips	CLAP, ResUNet-based separation model, time-frequency masking	waveform l1 loss	1	238.60M	Clotho, FSD50K, AudioCaps	CLAP
Romaniuk_SRPOL_task9_3	Romaniuk2024_t9	16kHz	random crop for long audio clips	CLAP, ResUNet-based separation model, time-frequency masking	waveform l1 loss	1	238.60M	Clotho, FSD50K, AudioCaps	CLAP
Guan_HEU_task9_1	Xiao2024_t9	16kHz	GPT-based text augmentation	CLAP, ResUNet-based separation model, time-frequency masking	waveform l1 loss	1	238.60M	Clotho, FSD50K	CLAP
Baseline	Liu2024_t9	16kHz	volume augmentation	CLAP, ResUNet-based separation model, time-frequency masking	waveform l1 loss	1	238.60M	Clotho, FSD50K	CLAP
Guan_HEU_task9_3	Xiao2024_t9	16kHz	N/A	CLAP, ResUNet-based separation model, time-frequency masking, Latent Diffusion Model	waveform l1 loss, Latent Diffusion Model loss	1	671M	Clotho, FSD50K	CLAP, AudioLDM

Technical reports

LANGUAGE-QUERIED AUDIO SOURCE SEPARATION ENHANCED BY EXPANDED LANGUAGE-AUDIO CONTRASTIVE LOSS

Hae Chun Chung, Jae Hoon Jung

AI Tech Lab, KT Corporation

Chung_KT_task9_1 Chung_KT_task9_2 Chung_KT_task9_3

PDF

LANGUAGE-QUERIED AUDIO SOURCE SEPARATION ENHANCED BY EXPANDED LANGUAGE-AUDIO CONTRASTIVE LOSS

Hae Chun Chung, Jae Hoon Jung
AI Tech Lab, KT Corporation

Abstract

This technical report outlines the efforts of KT Corporation's Acoustic Processing Project for addressing language-queried audio source separation (LASS), DCASE 2024 Challenge Task 9. The objective of this work is to separate arbitrary sound sources using a text description of the desired source. We propose three systems, each with the same model architecture but different training methods. These systems use the FLAN-T5 model as the text encoder and the ResUNet model as the separator. To train these systems, we introduced three loss functions: L1 loss in the time domain, multi-scale mel-spectrogram loss in the frequency domain, and contrastive loss, with a loss balancer to stabilize the training. Utilizing the Contrastive Language-Audio Pre-training (CLAP) model, we designed three contrastive losses: audio-to-text (A2T-CL), audio-to-audio (A2A-CL), and audio-to-multi (A2M-CL). The first system was trained with A2T-CL, the second with both A2A-CL and A2T-CL, and the third with A2M-CL. These systems achieved signal-to-distortion ratio (SDR) of 7.030, 7.124, and 7.136, respectively, showing nearly a 30\% improvement over the baseline SDR of 5.708 provided by the challenge.

PDF

PERFORMANCE IMPROVEMENT OF LANGUAGE-QUERIED AUDIO SOURCE SEPARATION BASED ON CAPTION AUGMENTATION FROM LARGE LANGUAGE MODELS FOR DCASE CHALLENGE 2024 TASK 9

Do Hyun Lee¹, Yoonah Song¹, Hong Kook Kim^1,2,3

¹AI Graduate School, Gwangju Institute of Science and Technology, Republic of Korea, ²School of EECS, Gwangju Institute of Science and Technology, Republic of Korea, ³Aunion AI, Co. Ltd, Republic of Korea

Kim_GIST-AunionAI_task9_1 Kim_GIST-AunionAI_task9_2 Kim_GIST-AunionAI_task9_3 Kim_GIST-AunionAI_task9_4

PDF

PERFORMANCE IMPROVEMENT OF LANGUAGE-QUERIED AUDIO SOURCE SEPARATION BASED ON CAPTION AUGMENTATION FROM LARGE LANGUAGE MODELS FOR DCASE CHALLENGE 2024 TASK 9

Do Hyun Lee¹, Yoonah Song¹, Hong Kook Kim^1,2,3
¹AI Graduate School, Gwangju Institute of Science and Technology, Republic of Korea, ²School of EECS, Gwangju Institute of Science and Technology, Republic of Korea, ³Aunion AI, Co. Ltd, Republic of Korea

Abstract

We present a prompt-engineering-based text-augmentation approach applied to a language-queried audio source separation (LASS) task. To enhance the performance of LASS, the proposed approach utilizes large language models (LLMs) to generate multiple captions corresponding to each sentence of the training dataset. To this end, we first perform experiments to identify the most effective prompts for caption augmentation with a smaller number of captions. A LASS model trained with these augmented captions demonstrates improved performance on the DCASE 2024 Task 9 validation set compared to that trained without augmentation. This study highlights the effectiveness of LLM-based caption augmentation in advancing language-queried audio source separation.

PDF

SRPOL submission to DCASE 2024 challenge task 9: Modeling real and imaginary components, Mixit and SDR based loss

Michal Romaniuk, Justyna Krzywdziak

Samsung R&D Institute Poland

Romaniuk_SRPOL_task9_1 Romaniuk_SRPOL_task9_2 Romaniuk_SRPOL_task9_3 Romaniuk_SRPOL_task9_4

PDF

SRPOL submission to DCASE 2024 challenge task 9: Modeling real and imaginary components, Mixit and SDR based loss

Michal Romaniuk, Justyna Krzywdziak
Samsung R&D Institute Poland

Abstract

We present our solution to the DCASE 2024 challenge task 9 (Language-Queried Audio Source Separation). Our solution is based on the official baseline, with training dataset including FSD50k, Clotho and additionally extended with AudioCaps. We show that the additional data improve results throughout the training process. We explore changing the ratio masking method from spectrogram amplitude and phase to individual masks for real and imaginary components. We also investigate how different losses, such as Mixit loss and SDR based loss, affect the training process.

PDF

LANGUAGE-QUERIED AUDIO SOURCE SEPARATION WITH GPT-BASED TEXT AUGMENTATION AND IDEAL RATIO MASKING

Feiyang Xiao¹, Wenbo Wang², Dongli Xu³, Shuhan Qi⁴, Kejia Zhang¹, Qiaoxi Zhu⁵, Jian Guan¹

¹Harbin Engineering University, Harbin, China, ²Harbin Institute of Technology, Harbin, China, ³Independent Researcher, China, ⁴Harbin Institute of Technology, Shenzhen, China, ⁵University of Technology Sydney, Ultimo, Australia

Guan_HEU_task9_1 Guan_HEU_task9_2 Guan_HEU_task9_3

PDF

LANGUAGE-QUERIED AUDIO SOURCE SEPARATION WITH GPT-BASED TEXT AUGMENTATION AND IDEAL RATIO MASKING

Feiyang Xiao¹, Wenbo Wang², Dongli Xu³, Shuhan Qi⁴, Kejia Zhang¹, Qiaoxi Zhu⁵, Jian Guan¹
¹Harbin Engineering University, Harbin, China, ²Harbin Institute of Technology, Harbin, China, ³Independent Researcher, China, ⁴Harbin Institute of Technology, Shenzhen, China, ⁵University of Technology Sydney, Ultimo, Australia

Abstract

This technical report details our submission systems for Task 9 (language-queried audio source separation) of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge. Our four proposed systems utilize the large language model GPT-4 for data augmentation and apply the ideal ratio masking strategy in the latent feature space of the text-to-audio generation model, AudioLDM. Additionally, our systems incorporate the pre-trained language-queried audio source separation model, AudioSep-32K, which leverages extensive pre-training on large-scale data to separate audio sources based on text queries. Experimental results demonstrate that our systems achieve better separation performance compared to the official baseline method on objective metrics. Furthermore, we introduce a novel evaluation metric, audio-text similarity (ATS), which measures the semantic similarity between the separated audio and the text query without requiring a reference target audio signal.

PDF

LANGUAGE-QUERIED AUDIO SOURCE SEPARATION VIA RESUNET WITH DPRNN

Han Yin¹, Jisheng Bai^1,3,4, Mou Wang², Jianfeng Chen¹

¹Northwestern Polytechnical University, Xi’an, China, ²Chinese Academy of Sciences, Beijing, China, ³Nanyang Technological University, Singapore, ⁴LianFeng Acoustic Technologies Co., Ltd. Xi’an China

HanYin_NWPU-JLESS_task9_1 HanYin_NWPU-JLESS_task9_2 HanYin_NWPU-JLESS_task9_3 HanYin_NWPU-JLESS_task9_4

PDF

LANGUAGE-QUERIED AUDIO SOURCE SEPARATION VIA RESUNET WITH DPRNN

Han Yin¹, Jisheng Bai^1,3,4, Mou Wang², Jianfeng Chen¹
¹Northwestern Polytechnical University, Xi’an, China, ²Chinese Academy of Sciences, Beijing, China, ³Nanyang Technological University, Singapore, ⁴LianFeng Acoustic Technologies Co., Ltd. Xi’an China

Abstract

This report presents our submitted systems for the task 9 of DCASE challenge: language-queried audio source separation (LASS). LASS is the task of separating arbitrary sound sources using textual descriptions of the desired source, also known as ''separate what you describe''. Specifically, we first incorporate a dual-path recurrent neural network (DPRNN) block into ResUNet, which is significantly beneficial for improving the separation performance. Then, we trained the proposed model using a large number of public datasets, including Clotho, FSD50K, Audiocaps, Auto-ACD, and Wavcaps. We trained the proposed model at 16 kHz and 32 kHz respectively, and the 32 kHz model achieved the best separation performance with an SDR of 8.191 dB on the validation set, which is 2.483 dB higher than the challenge baseline.

PDF

Content

Task description

Systems ranking

Subjective Evaluation Score

Objective Evaluation Score

System characteristics

Technical reports

LANGUAGE-QUERIED AUDIO SOURCE SEPARATION ENHANCED BY EXPANDED LANGUAGE-AUDIO CONTRASTIVE LOSS

LANGUAGE-QUERIED AUDIO SOURCE SEPARATION ENHANCED BY EXPANDED LANGUAGE-AUDIO CONTRASTIVE LOSS

Abstract

PERFORMANCE IMPROVEMENT OF LANGUAGE-QUERIED AUDIO SOURCE SEPARATION BASED ON CAPTION AUGMENTATION FROM LARGE LANGUAGE MODELS FOR DCASE CHALLENGE 2024 TASK 9

PERFORMANCE IMPROVEMENT OF LANGUAGE-QUERIED AUDIO SOURCE SEPARATION BASED ON CAPTION AUGMENTATION FROM LARGE LANGUAGE MODELS FOR DCASE CHALLENGE 2024 TASK 9

Abstract

SRPOL submission to DCASE 2024 challenge task 9: Modeling real and imaginary components, Mixit and SDR based loss

SRPOL submission to DCASE 2024 challenge task 9: Modeling real and imaginary components, Mixit and SDR based loss

Abstract

LANGUAGE-QUERIED AUDIO SOURCE SEPARATION WITH GPT-BASED TEXT AUGMENTATION AND IDEAL RATIO MASKING

LANGUAGE-QUERIED AUDIO SOURCE SEPARATION WITH GPT-BASED TEXT AUGMENTATION AND IDEAL RATIO MASKING

Abstract

LANGUAGE-QUERIED AUDIO SOURCE SEPARATION VIA RESUNET WITH DPRNN

LANGUAGE-QUERIED AUDIO SOURCE SEPARATION VIA RESUNET WITH DPRNN

Abstract