Language-Queried Audio Source Separation


Task description

Separate arbitrary audio sources using natural language queries.

If you are interested in the task, you can join us on dedicated slack.

Description

Language-queried audio source separation (LASS) is the task of separating arbitrary sound sources using textual descriptions of the desired source, also known as 'separate what you describe'. LASS provides a useful tool for future source separation systems, allowing users to extract audio sources via natural language instructions. Such a system could be useful in many applications, such as automatic audio editing, multimedia content retrieval, and augmented listening. The objective of this challenge is to effectively separate sound sources using natural language queries, thereby advancing the way we interact with and manipulate audio content.

Figure 1: Overview of LASS system.


Audio dataset

Development set

The development set is composed of audio samples from FSD50K and Clotho v2 datasets. Clotho v2 consists of 6972 audio samples (~37 hours), each audio clip is labeled with five captions. FSD50K contains over 51k audio clips (~100 hours) manually labeled using 200 classes drawn from the AudioSet Ontology. For each audio clip in the FSD50K dataset, we generated one automatic caption for each audio clip by prompting ChatGPT (GPT-4) with its sound event tags.

Participants are encouraged to train LASS models by constructing synthetic data. For example, given an audio clip A1 and its corresponding caption C, we can select an additional audio clip, A2, to serve as background noise, thereby creating a mixed audio, A3. We anticipate that the LASS system, given A3 and C as inputs, will be able to separate the A1 source.

All audio files should be converted to mono 16 kHz audio for training LASS models.

Validation and evaluation sets

We downloaded audio clips from Freesound, uploaded between Apr. and Oct. 2023, and sampled 2,200 clips. Each audio file has been chunked into a 10-second clip and converted to mono 16 kHz. The tags of each audio file were verified and revised according to the FSD50K sound event categories. We have three data splits for the validation and evaluation purposes: validation set (synth), evaluation set (synth) and evaluation set (real).

To form our evaluation set (real), we selected 200 clips containing at least two overlapping sound sources. For each audio clip, we manually annotated their component sources using text descriptions, so that each clip can be used as a mixture from which to extract one or more of the component sources based on a text query. Each audio clip in evaluation (real) was labeled with three such text queries.

The remaining 2,000 clips are divided equally for validation and evaluation; each clip was annotated by three captions describing the clip as a whole. For the validation (synth) and evaluation (synth) sets, we created 3,000 synthetic mixtures with signal-to-noise ratios (SNR) ranging from -15 to 15 dB. Each synthetic mixture has one natural language query and its corresponding target source. We used the revised tags information to ensure that the two audio clips used in each mix do not share overlapping sound source classes.

Validation (synth) is released with the development set while evaluation (synth) and evaluation (real) will be released on 01 June 2024.

Download



Task setup

This task aims to develop LASS model that can separate target sound source with natural language queries. The participants in this challenge are allowed to use external resources along with the development set that the organizers provide to build their model. Evaluaton will be conducted using both synthetic and real audio data.

Task rules

  • Participants are free to use external data and pre-trained models, including private data and models.
  • Participants are not allowed to use audio in Freesound uploaded between April and October 2023 for any purpose.
  • Participants are not allowed to use audio files in the validation & evaluation sets for training their models.
  • Participants must specify all external resources utilized in their submission in the technical report.
  • Participants are allowed to submit an ensemble system.

Submission

All participants should submit:

  • The zip file includes audio files in *.wav format, containing the results of their separation with the evaluation (synth) set.
  • The zip file includes audio files in *.wav format, containing the results of their separation with the evaluation (real) set.
  • Metadata for their submission (*.yaml file), and
  • A technical report for their submission (*.pdf file).

The separated audio file should use the same name as the input audio mixture file for submission.

We allow up to 4 system output submissions per participant/team. For each system, metadata should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted metadata (the .yaml file), submitted system outputs (the .zip files), and the technical report (the .pdf file). For indicating the connection of your files, you can consider using the following naming convention:

<author>_<institute>_task9_submission_<submission_index>_<output or metadata or report>.<zip or yaml or pdf>

For example:

author_institute_task9_submission_1_synth_results.zip
author_institute_task9_submission_1_real_results.zip
author_institute_task9_submission_1_metadata.yaml
author_institute_task9_submission_1_report.pdf

The field <submission_index> is to differentiate your submissions in case that you have multiple submissions.

Evaluation

We use signal-to-distortion ratio (SDR), followed by a subjective listening test. The higher SDR score, the better the output of the system is. For listening tests, evaluators will listen to separated audio samples from challenge submissions, with access to text queries and audio mixtures (which do not have ground-truth references). The SDR evaluation will be conducted on the evaluation (synth) and the subjective listening set will be conducted on the evaluation (real).

Listening test

For the listening test, the quality of separated sources will be evaluated in a Likert scale focusing on two key aspects: 1) REL: the relevance between the target audio and the language query; and 2) OVL: the overall sound quality of the separated signal. Each of the 600 examples (200 audios x 3 captions) in evaluation (real) will be assessed independently based on the following criteria:

Poor (1): OVL: Heavily distorted audio with significant non-target sounds. REL: Barely matches the query.

Fair (2): OVL: Noticeable non-target sounds and distortions present. REL: Basic alignment with the query.

Average (3): OVL: Identifiable target source with some non-target sounds and minor distortions. REL: Moderate match with the query.

Good (4): OVL: Minimal non-target sounds and very few distortions. REL: Clear match with the query.

Excellent (5): OVL: Minimal to no distortions or non-target sounds, resembling a clean recording. REL: Near-perfect match with the query.

Ranking

Final rankings are determined by subjective listening tests. First, the 5 submissions with the highest SDR scores on the evaluation (synth) are selected. To ensure fair evaluation, organizers will randomly select few files on which they will check algorithms that obtained lower SDR results with informal listening tests, and at their discretion will decide to include them in the subjective evaluation. The selected systems are then ranked by subjective tests on evaluation (real), regardless of their SDR scores. This ranking is achieved by calculating the average of the REL and OVL scores, with both metrics weighted equally (1:1).

Baseline system

The task organizers provided a baseline system based on AudioSep. which consists of a CLAP text encoder and a ResUNet separation model. The baseline model is trained using the development set (Clotho and augmented FSD50K datasets) for 200k steps with a batch size of 16 using one Nvidia A100 GPU (around 1 day). Details of the AudioSep architecture can be found in the original paper.

We evaluated the SDR score of the baseline system on the validation set (synth). The baseline system achieved a SDR of 5.708.

Detailed information can be found in the GitHub repository.


The pre-trained weights for the baseline system yielding the above results are freely available on Zenodo:


Citation

If you are participating in this task or using the baseline code, please cite these two papers below.

Publication

Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Jinzheng Zhao, Qiushi Huang, Mark D Plumbley, and Wenwu Wang. Separate what you describe: language-queried audio source separation. In Proc. INTERSPEECH 2022. 2022.

PDF

Separate what you describe: Language-queried audio source separation

PDF
Publication

Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D Plumbley, and Wenwu Wang. Separate anything you describe. arXiv preprint arXiv:2308.05037, 2023.

PDF

Separate anything you describe

PDF