Task description
Language-queried audio source separation (LASS) is the task of separating arbitrary sound sources using textual descriptions of the desired source. LASS provides a useful tool for future source separation systems, allowing users to extract audio sources via natural language instructions. Submissions will be evaluated by signal-to-distortion ratio (SDR), followed by a subjective test. Final rankings are determined by subjective listening tests.
More detailed task description can be found in the task description page
Systems ranking
Subjective Evaluation Score
If multiple systems were submitted by one team, only the system with the highest SDR score was subjectively evaluated. The weighted average of the two ratings was calculated using a 1:1 ratio of REL (relevance between the target audio and the language query) and OVL (overall audio quality of the separated signal).
Rank | Submission Information | Subjective Evaluation Score | ||||
---|---|---|---|---|---|---|
Submission Code |
Technical Report |
Official Rank |
Average Score | OVL Score | REL Score | |
Kim_GIST-AunionAI_task9_4 | Lee2024_t9 | 1 | 3.310 | 3.430 | 3.188 | |
Guan_HEU_task9_2 | Xiao2024_t9 | 2 | 3.288 | 3.416 | 3.159 | |
HanYin_NWPU-JLESS_task9_4 | Yin2024_t9 | 3 | 3.266 | 3.400 | 3.133 | |
Romaniuk_SRPOL_task9_2 | Romaniuk2024_t9 | 4 | 3.260 | 3.386 | 3.134 | |
Chung_KT_task9_1 | Chung2024_t9 | 5 | 3.240 | 3.378 | 3.102 |
Objective Evaluation Score
Rank | Submission Information | Evaluation Set | Validation (Development) Set | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Submission Code |
Technical Report |
SDR Rank |
SDR Score | SDRi Score | SI-SDR Score | SDR Score | SDRi Score | SI-SDR Score | ||
Kim_GIST-AunionAI_task9_4 | Lee2024_t9 | 1 | 8.869 | 8.763 | 7.764 | 8.610 | 8.575 | 7.493 | ||
Kim_GIST-AunionAI_task9_3 | Lee2024_t9 | 2 | 8.864 | 8.757 | 7.780 | 8.599 | 8.564 | 7.497 | ||
HanYin_NWPU-JLESS_task9_4 | Yin2024_t9 | 3 | 8.842 | 8.736 | 7.820 | 8.467 | 8.432 | 7.403 | ||
HanYin_NWPU-JLESS_task9_3 | Yin2024_t9 | 4 | 8.764 | 8.658 | 7.394 | 8.191 | 8.156 | 6.794 | ||
Kim_GIST-AunionAI_task9_2 | Lee2024_t9 | 5 | 8.671 | 8.564 | 7.217 | 8.459 | 8.424 | 7.072 | ||
Guan_HEU_task9_2 | Xiao2024_t9 | 6 | 8.368 | 8.262 | 6.800 | 8.192 | 8.157 | 6.680 | ||
HanYin_NWPU-JLESS_task9_2 | Yin2024_t9 | 7 | 8.186 | 8.080 | 6.499 | 8.007 | 7.972 | 6.459 | ||
Kim_GIST-AunionAI_task9_1 | Lee2024_t9 | 8 | 8.059 | 7.953 | 6.510 | 7.750 | 7.715 | 6.161 | ||
Romaniuk_SRPOL_task9_2 | Romaniuk2024_t9 | 9 | 7.572 | 7.466 | 5.455 | 7.398 | 7.363 | 5.551 | ||
HanYin_NWPU-JLESS_task9_1 | Yin2024_t9 | 10 | 7.306 | 7.200 | 5.481 | 7.087 | 7.052 | 5.413 | ||
Chung_KT_task9_1 | Chung2024_t9 | 11 | 7.302 | 7.195 | 5.628 | 7.030 | 6.995 | 5.368 | ||
Romaniuk_SRPOL_task9_1 | Romaniuk2024_t9 | 12 | 7.245 | 7.138 | 5.294 | 7.021 | 6.986 | 5.291 | ||
Chung_KT_task9_2 | Chung2024_t9 | 13 | 7.186 | 7.080 | 5.526 | 7.124 | 7.089 | 5.593 | ||
Chung_KT_task9_3 | Chung2024_t9 | 14 | 7.118 | 7.012 | 5.301 | 7.139 | 7.104 | 5.504 | ||
Romaniuk_SRPOL_task9_4 | Romaniuk2024_t9 | 15 | 6.478 | 6.372 | 4.513 | 6.282 | 6.247 | 4.620 | ||
Romaniuk_SRPOL_task9_3 | Romaniuk2024_t9 | 16 | 6.153 | 6.046 | 3.811 | 6.181 | 6.146 | 4.188 | ||
Guan_HEU_task9_1 | Xiao2024_t9 | 17 | 6.022 | 5.916 | 4.115 | 5.937 | 5.902 | 4.191 | ||
Baseline | Liu2024_t9 | 18 | 5.799 | 5.693 | 3.873 | 5.708 | 5.673 | 3.862 | ||
Guan_HEU_task9_3 | Xiao2024_t9 | 19 | -5.417 | -5.523 | -39.983 | -4.747 | -4.792 | -42.346 |
System characteristics
Summary of the submitted system characteristics.
Submission Code |
Technical Report |
Input SR |
Data augmentation |
ML method |
Loss function |
Ensemble systems |
Total parameters |
Training datasets |
Used pre-trained models |
---|---|---|---|---|---|---|---|---|---|
Kim_GIST-AunionAI_task9_4 | Lee2024_t9 | 16kHz | caption augmentation | CLAP, ResUNet-based separation model, time-frequency masking | waveform l1 loss | 5 | 467M | Clotho, FSD50K, WavCaps | CLAP, AudioSep, Phi-2.0 |
Kim_GIST-AunionAI_task9_3 | Lee2024_t9 | 16kHz | caption augmentation | CLAP, ResUNet-based separation model, time-frequency masking | waveform l1 loss | 4 | 467M | Clotho, FSD50K, WavCaps | CLAP, AudioSep, Phi-2.0 |
HanYin_NWPU-JLESS_task9_4 | Yin2024_t9 | 32kHz, 16kHz | volume augmentation | CLAP, ResUNet-based separation model, time-frequency masking, DPRNN | waveform l1 loss | 3 | 267M | Clotho, FSD50K, Audiocaps, Auto-ACD, WavCaps | CLAP |
HanYin_NWPU-JLESS_task9_3 | Yin2024_t9 | 32kHz | volume augmentation | CLAP, ResUNet-based separation model, time-frequency masking, DPRNN | waveform l1 loss | 1 | 267M | Clotho, FSD50K, Audiocaps, Auto-ACD, WavCaps | CLAP |
Kim_GIST-AunionAI_task9_2 | Lee2024_t9 | 32kHz | caption augmentation | CLAP, ResUNet-based separation model, time-frequency masking | waveform l1 loss | 1 | 238M | Clotho, FSD50K, WavCaps | CLAP, AudioSep, Phi-2.0 |
Guan_HEU_task9_2 | Xiao2024_t9 | 32kHz | N/A | CLAP, ResUNet-based separation model, time-frequency masking | waveform l1 loss | 1 | 238.60M | Clotho, FSD50K | CLAP, AudioSep |
HanYin_NWPU-JLESS_task9_2 | Yin2024_t9 | 16kHz | volume augmentation | CLAP, ResUNet-based separation model, time-frequency masking, DPRNN | waveform l1 loss | 1 | 267M | Clotho, FSD50K, Audiocaps, Auto-ACD, WavCaps | CLAP |
Kim_GIST-AunionAI_task9_1 | Lee2024_t9 | 16kHz | caption augmentation | CLAP, ResUNet-based separation model, time-frequency masking | waveform l1 loss | 1 | 229M | Clotho, FSD50K, WavCaps | CLAP, Phi-2.0 |
Romaniuk_SRPOL_task9_2 | Romaniuk2024_t9 | 16kHz | random crop for long audio clips | CLAP, ResUNet-based separation model, time-frequency masking | waveform l1 loss | 1 | 229M | Clotho, FSD50K, AudioCaps | CLAP |
HanYin_NWPU-JLESS_task9_1 | Yin2024_t9 | 16kHz | volume augmentation | CLAP, ResUNet-based separation model, time-frequency masking | waveform l1 loss | 1 | 238.60M | Clotho, FSD50K, Audiocaps, Auto-ACD, WavCaps | CLAP |
Chung_KT_task9_1 | Chung2024_t9 | 16kHz | N/A | FLAN-T5, ResUNet-based separation model, CLAP | waveform l1 loss, multi-scale mel-spectrogram loss, contrastive loss, loss balancer | 1 | 372.73M | AudioCaps, Clotho, WavCaps, FSD50K | FLAN-T5, CLAP |
Romaniuk_SRPOL_task9_1 | Romaniuk2024_t9 | 16kHz | random crop for long audio clips | CLAP, ResUNet-based separation model, time-frequency masking, separate masks for real and imaginary components | waveform l1 loss | 1 | 229M | Clotho, FSD50K, AudioCaps | CLAP |
Chung_KT_task9_2 | Chung2024_t9 | 16kHz | N/A | FLAN-T5, ResUNet-based separation model, CLAP | waveform l1 loss, multi-scale mel-spectrogram loss, contrastive loss, loss balancer | 1 | 372.73M | AudioCaps, Clotho, WavCaps, FSD50K | FLAN-T5, CLAP |
Chung_KT_task9_3 | Chung2024_t9 | 16kHz | N/A | FLAN-T5, ResUNet-based separation model, CLAP | waveform l1 loss, multi-scale mel-spectrogram loss, contrastive loss, loss balancer | 1 | 372.73M | AudioCaps, Clotho, WavCaps, FSD50K | FLAN-T5, CLAP |
Romaniuk_SRPOL_task9_4 | Romaniuk2024_t9 | 16kHz | random crop for long audio clips | CLAP, ResUNet-based separation model, time-frequency masking | waveform l1 loss | 1 | 238.60M | Clotho, FSD50K, AudioCaps | CLAP |
Romaniuk_SRPOL_task9_3 | Romaniuk2024_t9 | 16kHz | random crop for long audio clips | CLAP, ResUNet-based separation model, time-frequency masking | waveform l1 loss | 1 | 238.60M | Clotho, FSD50K, AudioCaps | CLAP |
Guan_HEU_task9_1 | Xiao2024_t9 | 16kHz | GPT-based text augmentation | CLAP, ResUNet-based separation model, time-frequency masking | waveform l1 loss | 1 | 238.60M | Clotho, FSD50K | CLAP |
Baseline | Liu2024_t9 | 16kHz | volume augmentation | CLAP, ResUNet-based separation model, time-frequency masking | waveform l1 loss | 1 | 238.60M | Clotho, FSD50K | CLAP |
Guan_HEU_task9_3 | Xiao2024_t9 | 16kHz | N/A | CLAP, ResUNet-based separation model, time-frequency masking, Latent Diffusion Model | waveform l1 loss, Latent Diffusion Model loss | 1 | 671M | Clotho, FSD50K | CLAP, AudioLDM |
Technical reports
LANGUAGE-QUERIED AUDIO SOURCE SEPARATION ENHANCED BY EXPANDED LANGUAGE-AUDIO CONTRASTIVE LOSS
Hae Chun Chung, Jae Hoon Jung
AI Tech Lab, KT Corporation
Chung_KT_task9_1 Chung_KT_task9_2 Chung_KT_task9_3
LANGUAGE-QUERIED AUDIO SOURCE SEPARATION ENHANCED BY EXPANDED LANGUAGE-AUDIO CONTRASTIVE LOSS
Hae Chun Chung, Jae Hoon Jung
AI Tech Lab, KT Corporation
Abstract
This technical report outlines the efforts of KT Corporation's Acoustic Processing Project for addressing language-queried audio source separation (LASS), DCASE 2024 Challenge Task 9. The objective of this work is to separate arbitrary sound sources using a text description of the desired source. We propose three systems, each with the same model architecture but different training methods. These systems use the FLAN-T5 model as the text encoder and the ResUNet model as the separator. To train these systems, we introduced three loss functions: L1 loss in the time domain, multi-scale mel-spectrogram loss in the frequency domain, and contrastive loss, with a loss balancer to stabilize the training. Utilizing the Contrastive Language-Audio Pre-training (CLAP) model, we designed three contrastive losses: audio-to-text (A2T-CL), audio-to-audio (A2A-CL), and audio-to-multi (A2M-CL). The first system was trained with A2T-CL, the second with both A2A-CL and A2T-CL, and the third with A2M-CL. These systems achieved signal-to-distortion ratio (SDR) of 7.030, 7.124, and 7.136, respectively, showing nearly a 30\% improvement over the baseline SDR of 5.708 provided by the challenge.
PERFORMANCE IMPROVEMENT OF LANGUAGE-QUERIED AUDIO SOURCE SEPARATION BASED ON CAPTION AUGMENTATION FROM LARGE LANGUAGE MODELS FOR DCASE CHALLENGE 2024 TASK 9
Do Hyun Lee1, Yoonah Song1, Hong Kook Kim1,2,3
1AI Graduate School, Gwangju Institute of Science and Technology, Republic of Korea, 2School of EECS, Gwangju Institute of Science and Technology, Republic of Korea, 3Aunion AI, Co. Ltd, Republic of Korea
Kim_GIST-AunionAI_task9_1 Kim_GIST-AunionAI_task9_2 Kim_GIST-AunionAI_task9_3 Kim_GIST-AunionAI_task9_4
PERFORMANCE IMPROVEMENT OF LANGUAGE-QUERIED AUDIO SOURCE SEPARATION BASED ON CAPTION AUGMENTATION FROM LARGE LANGUAGE MODELS FOR DCASE CHALLENGE 2024 TASK 9
Do Hyun Lee1, Yoonah Song1, Hong Kook Kim1,2,3
1AI Graduate School, Gwangju Institute of Science and Technology, Republic of Korea, 2School of EECS, Gwangju Institute of Science and Technology, Republic of Korea, 3Aunion AI, Co. Ltd, Republic of Korea
Abstract
We present a prompt-engineering-based text-augmentation approach applied to a language-queried audio source separation (LASS) task. To enhance the performance of LASS, the proposed approach utilizes large language models (LLMs) to generate multiple captions corresponding to each sentence of the training dataset. To this end, we first perform experiments to identify the most effective prompts for caption augmentation with a smaller number of captions. A LASS model trained with these augmented captions demonstrates improved performance on the DCASE 2024 Task 9 validation set compared to that trained without augmentation. This study highlights the effectiveness of LLM-based caption augmentation in advancing language-queried audio source separation.
SRPOL submission to DCASE 2024 challenge task 9: Modeling real and imaginary components, Mixit and SDR based loss
Michal Romaniuk, Justyna Krzywdziak
Samsung R&D Institute Poland
Romaniuk_SRPOL_task9_1 Romaniuk_SRPOL_task9_2 Romaniuk_SRPOL_task9_3 Romaniuk_SRPOL_task9_4
SRPOL submission to DCASE 2024 challenge task 9: Modeling real and imaginary components, Mixit and SDR based loss
Michal Romaniuk, Justyna Krzywdziak
Samsung R&D Institute Poland
Abstract
We present our solution to the DCASE 2024 challenge task 9 (Language-Queried Audio Source Separation). Our solution is based on the official baseline, with training dataset including FSD50k, Clotho and additionally extended with AudioCaps. We show that the additional data improve results throughout the training process. We explore changing the ratio masking method from spectrogram amplitude and phase to individual masks for real and imaginary components. We also investigate how different losses, such as Mixit loss and SDR based loss, affect the training process.
LANGUAGE-QUERIED AUDIO SOURCE SEPARATION WITH GPT-BASED TEXT AUGMENTATION AND IDEAL RATIO MASKING
Feiyang Xiao1, Wenbo Wang2, Dongli Xu3, Shuhan Qi4, Kejia Zhang1, Qiaoxi Zhu5, Jian Guan1
1Harbin Engineering University, Harbin, China, 2Harbin Institute of Technology, Harbin, China, 3Independent Researcher, China, 4Harbin Institute of Technology, Shenzhen, China, 5University of Technology Sydney, Ultimo, Australia
Guan_HEU_task9_1 Guan_HEU_task9_2 Guan_HEU_task9_3
LANGUAGE-QUERIED AUDIO SOURCE SEPARATION WITH GPT-BASED TEXT AUGMENTATION AND IDEAL RATIO MASKING
Feiyang Xiao1, Wenbo Wang2, Dongli Xu3, Shuhan Qi4, Kejia Zhang1, Qiaoxi Zhu5, Jian Guan1
1Harbin Engineering University, Harbin, China, 2Harbin Institute of Technology, Harbin, China, 3Independent Researcher, China, 4Harbin Institute of Technology, Shenzhen, China, 5University of Technology Sydney, Ultimo, Australia
Abstract
This technical report details our submission systems for Task 9 (language-queried audio source separation) of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge. Our four proposed systems utilize the large language model GPT-4 for data augmentation and apply the ideal ratio masking strategy in the latent feature space of the text-to-audio generation model, AudioLDM. Additionally, our systems incorporate the pre-trained language-queried audio source separation model, AudioSep-32K, which leverages extensive pre-training on large-scale data to separate audio sources based on text queries. Experimental results demonstrate that our systems achieve better separation performance compared to the official baseline method on objective metrics. Furthermore, we introduce a novel evaluation metric, audio-text similarity (ATS), which measures the semantic similarity between the separated audio and the text query without requiring a reference target audio signal.
LANGUAGE-QUERIED AUDIO SOURCE SEPARATION VIA RESUNET WITH DPRNN
Han Yin1, Jisheng Bai1,3,4, Mou Wang2, Jianfeng Chen1
1Northwestern Polytechnical University, Xi’an, China, 2Chinese Academy of Sciences, Beijing, China, 3Nanyang Technological University, Singapore, 4LianFeng Acoustic Technologies Co., Ltd. Xi’an China
HanYin_NWPU-JLESS_task9_1 HanYin_NWPU-JLESS_task9_2 HanYin_NWPU-JLESS_task9_3 HanYin_NWPU-JLESS_task9_4
LANGUAGE-QUERIED AUDIO SOURCE SEPARATION VIA RESUNET WITH DPRNN
Han Yin1, Jisheng Bai1,3,4, Mou Wang2, Jianfeng Chen1
1Northwestern Polytechnical University, Xi’an, China, 2Chinese Academy of Sciences, Beijing, China, 3Nanyang Technological University, Singapore, 4LianFeng Acoustic Technologies Co., Ltd. Xi’an China
Abstract
This report presents our submitted systems for the task 9 of DCASE challenge: language-queried audio source separation (LASS). LASS is the task of separating arbitrary sound sources using textual descriptions of the desired source, also known as ''separate what you describe''. Specifically, we first incorporate a dual-path recurrent neural network (DPRNN) block into ResUNet, which is significantly beneficial for improving the separation performance. Then, we trained the proposed model using a large number of public datasets, including Clotho, FSD50K, Audiocaps, Auto-ACD, and Wavcaps. We trained the proposed model at 16 kHz and 32 kHz respectively, and the 32 kHz model achieved the best separation performance with an SDR of 8.191 dB on the validation set, which is 2.483 dB higher than the challenge baseline.