Language-Queried Audio Source Separation


Challenge results

Task description

Language-queried audio source separation (LASS) is the task of separating arbitrary sound sources using textual descriptions of the desired source. LASS provides a useful tool for future source separation systems, allowing users to extract audio sources via natural language instructions. Submissions will be evaluated by signal-to-distortion ratio (SDR), followed by a subjective test. Final rankings are determined by subjective listening tests.

More detailed task description can be found in the task description page

Systems ranking

Subjective Evaluation Score

If multiple systems were submitted by one team, only the system with the highest SDR score was subjectively evaluated. The weighted average of the two ratings was calculated using a 1:1 ratio of REL (relevance between the target audio and the language query) and OVL (overall audio quality of the separated signal).

Rank Submission Information Subjective Evaluation Score
Submission Code Technical
Report
Official
Rank
Average Score OVL Score REL Score
Kim_GIST-AunionAI_task9_4 Lee2024_t9 1 3.310 3.430 3.188
Guan_HEU_task9_2 Xiao2024_t9 2 3.288 3.416 3.159
HanYin_NWPU-JLESS_task9_4 Yin2024_t9 3 3.266 3.400 3.133
Romaniuk_SRPOL_task9_2 Romaniuk2024_t9 4 3.260 3.386 3.134
Chung_KT_task9_1 Chung2024_t9 5 3.240 3.378 3.102

Objective Evaluation Score

Rank Submission Information Evaluation Set Validation (Development) Set
Submission Code Technical
Report
SDR
Rank
SDR Score SDRi Score SI-SDR Score SDR Score SDRi Score SI-SDR Score
Kim_GIST-AunionAI_task9_4 Lee2024_t9 1 8.869 8.763 7.764 8.610 8.575 7.493
Kim_GIST-AunionAI_task9_3 Lee2024_t9 2 8.864 8.757 7.780 8.599 8.564 7.497
HanYin_NWPU-JLESS_task9_4 Yin2024_t9 3 8.842 8.736 7.820 8.467 8.432 7.403
HanYin_NWPU-JLESS_task9_3 Yin2024_t9 4 8.764 8.658 7.394 8.191 8.156 6.794
Kim_GIST-AunionAI_task9_2 Lee2024_t9 5 8.671 8.564 7.217 8.459 8.424 7.072
Guan_HEU_task9_2 Xiao2024_t9 6 8.368 8.262 6.800 8.192 8.157 6.680
HanYin_NWPU-JLESS_task9_2 Yin2024_t9 7 8.186 8.080 6.499 8.007 7.972 6.459
Kim_GIST-AunionAI_task9_1 Lee2024_t9 8 8.059 7.953 6.510 7.750 7.715 6.161
Romaniuk_SRPOL_task9_2 Romaniuk2024_t9 9 7.572 7.466 5.455 7.398 7.363 5.551
HanYin_NWPU-JLESS_task9_1 Yin2024_t9 10 7.306 7.200 5.481 7.087 7.052 5.413
Chung_KT_task9_1 Chung2024_t9 11 7.302 7.195 5.628 7.030 6.995 5.368
Romaniuk_SRPOL_task9_1 Romaniuk2024_t9 12 7.245 7.138 5.294 7.021 6.986 5.291
Chung_KT_task9_2 Chung2024_t9 13 7.186 7.080 5.526 7.124 7.089 5.593
Chung_KT_task9_3 Chung2024_t9 14 7.118 7.012 5.301 7.139 7.104 5.504
Romaniuk_SRPOL_task9_4 Romaniuk2024_t9 15 6.478 6.372 4.513 6.282 6.247 4.620
Romaniuk_SRPOL_task9_3 Romaniuk2024_t9 16 6.153 6.046 3.811 6.181 6.146 4.188
Guan_HEU_task9_1 Xiao2024_t9 17 6.022 5.916 4.115 5.937 5.902 4.191
Baseline Liu2024_t9 18 5.799 5.693 3.873 5.708 5.673 3.862
Guan_HEU_task9_3 Xiao2024_t9 19 -5.417 -5.523 -39.983 -4.747 -4.792 -42.346

System characteristics

Summary of the submitted system characteristics.

Submission
Code
Technical
Report
Input
SR
Data
augmentation
ML
method
Loss
function
Ensemble
systems
Total
parameters
Training
datasets
Used
pre-trained models
Kim_GIST-AunionAI_task9_4 Lee2024_t9 16kHz caption augmentation CLAP, ResUNet-based separation model, time-frequency masking waveform l1 loss 5 467M Clotho, FSD50K, WavCaps CLAP, AudioSep, Phi-2.0
Kim_GIST-AunionAI_task9_3 Lee2024_t9 16kHz caption augmentation CLAP, ResUNet-based separation model, time-frequency masking waveform l1 loss 4 467M Clotho, FSD50K, WavCaps CLAP, AudioSep, Phi-2.0
HanYin_NWPU-JLESS_task9_4 Yin2024_t9 32kHz, 16kHz volume augmentation CLAP, ResUNet-based separation model, time-frequency masking, DPRNN waveform l1 loss 3 267M Clotho, FSD50K, Audiocaps, Auto-ACD, WavCaps CLAP
HanYin_NWPU-JLESS_task9_3 Yin2024_t9 32kHz volume augmentation CLAP, ResUNet-based separation model, time-frequency masking, DPRNN waveform l1 loss 1 267M Clotho, FSD50K, Audiocaps, Auto-ACD, WavCaps CLAP
Kim_GIST-AunionAI_task9_2 Lee2024_t9 32kHz caption augmentation CLAP, ResUNet-based separation model, time-frequency masking waveform l1 loss 1 238M Clotho, FSD50K, WavCaps CLAP, AudioSep, Phi-2.0
Guan_HEU_task9_2 Xiao2024_t9 32kHz N/A CLAP, ResUNet-based separation model, time-frequency masking waveform l1 loss 1 238.60M Clotho, FSD50K CLAP, AudioSep
HanYin_NWPU-JLESS_task9_2 Yin2024_t9 16kHz volume augmentation CLAP, ResUNet-based separation model, time-frequency masking, DPRNN waveform l1 loss 1 267M Clotho, FSD50K, Audiocaps, Auto-ACD, WavCaps CLAP
Kim_GIST-AunionAI_task9_1 Lee2024_t9 16kHz caption augmentation CLAP, ResUNet-based separation model, time-frequency masking waveform l1 loss 1 229M Clotho, FSD50K, WavCaps CLAP, Phi-2.0
Romaniuk_SRPOL_task9_2 Romaniuk2024_t9 16kHz random crop for long audio clips CLAP, ResUNet-based separation model, time-frequency masking waveform l1 loss 1 229M Clotho, FSD50K, AudioCaps CLAP
HanYin_NWPU-JLESS_task9_1 Yin2024_t9 16kHz volume augmentation CLAP, ResUNet-based separation model, time-frequency masking waveform l1 loss 1 238.60M Clotho, FSD50K, Audiocaps, Auto-ACD, WavCaps CLAP
Chung_KT_task9_1 Chung2024_t9 16kHz N/A FLAN-T5, ResUNet-based separation model, CLAP waveform l1 loss, multi-scale mel-spectrogram loss, contrastive loss, loss balancer 1 372.73M AudioCaps, Clotho, WavCaps, FSD50K FLAN-T5, CLAP
Romaniuk_SRPOL_task9_1 Romaniuk2024_t9 16kHz random crop for long audio clips CLAP, ResUNet-based separation model, time-frequency masking, separate masks for real and imaginary components waveform l1 loss 1 229M Clotho, FSD50K, AudioCaps CLAP
Chung_KT_task9_2 Chung2024_t9 16kHz N/A FLAN-T5, ResUNet-based separation model, CLAP waveform l1 loss, multi-scale mel-spectrogram loss, contrastive loss, loss balancer 1 372.73M AudioCaps, Clotho, WavCaps, FSD50K FLAN-T5, CLAP
Chung_KT_task9_3 Chung2024_t9 16kHz N/A FLAN-T5, ResUNet-based separation model, CLAP waveform l1 loss, multi-scale mel-spectrogram loss, contrastive loss, loss balancer 1 372.73M AudioCaps, Clotho, WavCaps, FSD50K FLAN-T5, CLAP
Romaniuk_SRPOL_task9_4 Romaniuk2024_t9 16kHz random crop for long audio clips CLAP, ResUNet-based separation model, time-frequency masking waveform l1 loss 1 238.60M Clotho, FSD50K, AudioCaps CLAP
Romaniuk_SRPOL_task9_3 Romaniuk2024_t9 16kHz random crop for long audio clips CLAP, ResUNet-based separation model, time-frequency masking waveform l1 loss 1 238.60M Clotho, FSD50K, AudioCaps CLAP
Guan_HEU_task9_1 Xiao2024_t9 16kHz GPT-based text augmentation CLAP, ResUNet-based separation model, time-frequency masking waveform l1 loss 1 238.60M Clotho, FSD50K CLAP
Baseline Liu2024_t9 16kHz volume augmentation CLAP, ResUNet-based separation model, time-frequency masking waveform l1 loss 1 238.60M Clotho, FSD50K CLAP
Guan_HEU_task9_3 Xiao2024_t9 16kHz N/A CLAP, ResUNet-based separation model, time-frequency masking, Latent Diffusion Model waveform l1 loss, Latent Diffusion Model loss 1 671M Clotho, FSD50K CLAP, AudioLDM



Technical reports

LANGUAGE-QUERIED AUDIO SOURCE SEPARATION ENHANCED BY EXPANDED LANGUAGE-AUDIO CONTRASTIVE LOSS

Hae Chun Chung, Jae Hoon Jung
AI Tech Lab, KT Corporation

Abstract

This technical report outlines the efforts of KT Corporation's Acoustic Processing Project for addressing language-queried audio source separation (LASS), DCASE 2024 Challenge Task 9. The objective of this work is to separate arbitrary sound sources using a text description of the desired source. We propose three systems, each with the same model architecture but different training methods. These systems use the FLAN-T5 model as the text encoder and the ResUNet model as the separator. To train these systems, we introduced three loss functions: L1 loss in the time domain, multi-scale mel-spectrogram loss in the frequency domain, and contrastive loss, with a loss balancer to stabilize the training. Utilizing the Contrastive Language-Audio Pre-training (CLAP) model, we designed three contrastive losses: audio-to-text (A2T-CL), audio-to-audio (A2A-CL), and audio-to-multi (A2M-CL). The first system was trained with A2T-CL, the second with both A2A-CL and A2T-CL, and the third with A2M-CL. These systems achieved signal-to-distortion ratio (SDR) of 7.030, 7.124, and 7.136, respectively, showing nearly a 30\% improvement over the baseline SDR of 5.708 provided by the challenge.

PDF

PERFORMANCE IMPROVEMENT OF LANGUAGE-QUERIED AUDIO SOURCE SEPARATION BASED ON CAPTION AUGMENTATION FROM LARGE LANGUAGE MODELS FOR DCASE CHALLENGE 2024 TASK 9

Do Hyun Lee1, Yoonah Song1, Hong Kook Kim1,2,3
1AI Graduate School, Gwangju Institute of Science and Technology, Republic of Korea, 2School of EECS, Gwangju Institute of Science and Technology, Republic of Korea, 3Aunion AI, Co. Ltd, Republic of Korea

Abstract

We present a prompt-engineering-based text-augmentation approach applied to a language-queried audio source separation (LASS) task. To enhance the performance of LASS, the proposed approach utilizes large language models (LLMs) to generate multiple captions corresponding to each sentence of the training dataset. To this end, we first perform experiments to identify the most effective prompts for caption augmentation with a smaller number of captions. A LASS model trained with these augmented captions demonstrates improved performance on the DCASE 2024 Task 9 validation set compared to that trained without augmentation. This study highlights the effectiveness of LLM-based caption augmentation in advancing language-queried audio source separation.

PDF

SRPOL submission to DCASE 2024 challenge task 9: Modeling real and imaginary components, Mixit and SDR based loss

Michal Romaniuk, Justyna Krzywdziak
Samsung R&D Institute Poland

Abstract

We present our solution to the DCASE 2024 challenge task 9 (Language-Queried Audio Source Separation). Our solution is based on the official baseline, with training dataset including FSD50k, Clotho and additionally extended with AudioCaps. We show that the additional data improve results throughout the training process. We explore changing the ratio masking method from spectrogram amplitude and phase to individual masks for real and imaginary components. We also investigate how different losses, such as Mixit loss and SDR based loss, affect the training process.

PDF

LANGUAGE-QUERIED AUDIO SOURCE SEPARATION WITH GPT-BASED TEXT AUGMENTATION AND IDEAL RATIO MASKING

Feiyang Xiao1, Wenbo Wang2, Dongli Xu3, Shuhan Qi4, Kejia Zhang1, Qiaoxi Zhu5, Jian Guan1
1Harbin Engineering University, Harbin, China, 2Harbin Institute of Technology, Harbin, China, 3Independent Researcher, China, 4Harbin Institute of Technology, Shenzhen, China, 5University of Technology Sydney, Ultimo, Australia

Abstract

This technical report details our submission systems for Task 9 (language-queried audio source separation) of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge. Our four proposed systems utilize the large language model GPT-4 for data augmentation and apply the ideal ratio masking strategy in the latent feature space of the text-to-audio generation model, AudioLDM. Additionally, our systems incorporate the pre-trained language-queried audio source separation model, AudioSep-32K, which leverages extensive pre-training on large-scale data to separate audio sources based on text queries. Experimental results demonstrate that our systems achieve better separation performance compared to the official baseline method on objective metrics. Furthermore, we introduce a novel evaluation metric, audio-text similarity (ATS), which measures the semantic similarity between the separated audio and the text query without requiring a reference target audio signal.

PDF

LANGUAGE-QUERIED AUDIO SOURCE SEPARATION VIA RESUNET WITH DPRNN

Han Yin1, Jisheng Bai1,3,4, Mou Wang2, Jianfeng Chen1
1Northwestern Polytechnical University, Xi’an, China, 2Chinese Academy of Sciences, Beijing, China, 3Nanyang Technological University, Singapore, 4LianFeng Acoustic Technologies Co., Ltd. Xi’an China

Abstract

This report presents our submitted systems for the task 9 of DCASE challenge: language-queried audio source separation (LASS). LASS is the task of separating arbitrary sound sources using textual descriptions of the desired source, also known as ''separate what you describe''. Specifically, we first incorporate a dual-path recurrent neural network (DPRNN) block into ResUNet, which is significantly beneficial for improving the separation performance. Then, we trained the proposed model using a large number of public datasets, including Clotho, FSD50K, Audiocaps, Auto-ACD, and Wavcaps. We trained the proposed model at 16 kHz and 32 kHz respectively, and the 32 kHz model achieved the best separation performance with an SDR of 8.191 dB on the validation set, which is 2.483 dB higher than the challenge baseline.

PDF