Spatial Semantic Segmentation of Sound Scenes


Task description

Coordinators

Masahiro Yasuda
Masahiro Yasuda

NTT Corporation

Noboru Harada
Noboru Harada
Binh Thien Nguyen
Binh Thien Nguyen

NTT Corporation

Daiki Takeuchi
Daiki Takeuchi

NTT Corporation

Daisuke Niizumi
Daisuke Niizumi
Marc Delcroix
Marc Delcroix

NTT Corporation

Shoko Araki
Shoko Araki

NTT Corporation

Tomohiro Nakatani
Tomohiro Nakatani

NTT Corporation

Yasunori Ohishi
Yasunori Ohishi

NTT Corporation

Romain Serizel
Mayank Mishra
Nobutaka Ono
Nobutaka Ono

Tokyo Metropolitan University

Takao Kawamura
Takao Kawamura

Tokyo Metropolitan University

Sound separation and sound event detection in multichannel spatial sound scenes

If you are interested in the task, you can join us on the dedicated slack channel

Description

This Task, Spatial Semantic Segmentation of Sound Scenes (S5), aims to enhance technologies for sound event detection and separation from multi-channel input signals that mix multiple sound events with spatial information. This is a fundamental basis of immersive communication. The ultimate goal is to separate sound event signals from a multi-channel mixture into dry sound object signals and metadata about the object type (sound event class) and temporal localization. However, because several existing challenge tasks already provide some of the sub-set functions, this Task for this year focuses on detecting and separating sound events from multi-channel spatial input signals.

This Task requires systems to detect and extract sound events from multi-channel spatial input signals. The input signal contains, at most, three simultaneous target sound events plus optional multiple non-target sound events and non-directional background noise. Each output signal should contain one isolated target sound event with a predicted label for the event class. The task overview is shown in Fig. 1.

Figure 1: Spatial Semantic Segmentation of Sound Scenes system.


The Task Setting is formulated as follows:

Let \(\boldsymbol{Y} = [\boldsymbol{y}^{(1)},\dots, \boldsymbol{y}^{(M)}]^\top \in \mathbb{R}^{M \times T}\) be the multi-channel time-domain mixture signal of length \(T\), recorded with an array of \(M\) microphones, where \(\{\cdot\}^\top\) is the matrix transposition. We denote \(C=\{c_1, ...,c_K\}\) the set of source labels in the mixture, where the source count \(K\) can vary from \(1\) to \(K_\textrm{max}\). The \(m\)-th channel of \(\boldsymbol{Y}\) can be modeled as

$$ \boldsymbol{y}^{(m)}=\sum_{k=1}^{K} \boldsymbol{h}^{(m)}_k*\boldsymbol{s}_k + \bigl[ \sum_{j=1}^{J} \boldsymbol{h}^{(m)}_j*\boldsymbol{s}_j + \boldsymbol{n}^{(m)} \bigr]_{optional} =\sum_{k=1}^{K}\boldsymbol{x}^{(m)}_k + \bigl[\sum_{j=1}^{J}\boldsymbol{x}^{(m)}_j + \boldsymbol{n}^{(m)}\bigr]_{optional}, \label{eq1}\tag{1} $$

where \(\boldsymbol{s}_k, \boldsymbol{s}_j\in\mathbb{R}^T\) are the single-channel dry source signal corresponding to the labels of target event \(c_k\) and interfering event \(c_j\), respectively. \(\boldsymbol{h}^{(m)}_k, \boldsymbol{h}^{(m)}_j\in \mathbb{R}^H\) is the \(m\)-th channel of the length-\(H\) room impulse response (RIR) at the spatial position of \(\boldsymbol{s}_k\) and \(\boldsymbol{s}_j\). \(\boldsymbol{n}^{(m)}\in\mathbb{R}^T\) is the \(m\)-th channel of the multi-channel noise signal. The wet source \(\boldsymbol{x}^{(m)}_k\in\mathbb{R}^T\) can be split into two components: the direct path and early reflections, \(\boldsymbol{x}^{(m,\textrm{d})}_k\in\mathbb{R}^T\), and late reverberation, \(\boldsymbol{x}^{(m,\textrm{r})}_k\in\mathbb{R}^T\), as

$$\boldsymbol{x}^{(m)}_k = \boldsymbol{x}^{(m,\textrm{d})}_k + \boldsymbol{x}^{(m,\textrm{r})}_k = \boldsymbol{h}^{(m,\textrm{d})}_k*\boldsymbol{s}_k + \boldsymbol{h}^{(m,\textrm{r})}_k*\boldsymbol{s}_k,$$

where \(\boldsymbol{h}^{(m,\textrm{d})}_k, \boldsymbol{h}^{(m,\textrm{r})}_k \in \mathbb{R}^H\) are the corresponding early and late parts of the RIR, respectively. We denote by \(N\) the number of sound event classes.

The goal of S5 is to extract all the single-channel sources \(\{\hat{\boldsymbol{x}}^{(m_\textrm{ref},\textrm{d})}_1, \dots,\hat{\boldsymbol{x}}^{(m_\textrm{ref},\textrm{d})}_{\hat{K}} \}\) at a reference microphone \(m_\textrm{ref}\) and their corresponding class labels \(\hat{C}=\{\hat{c}_1, \dots,\hat{c}_{\hat{K}}\}\) from the multi-channel mixture \(\boldsymbol{Y}\).

For clarity in the formulation, we could also drop some indices. Thus, the notations of \(\boldsymbol{x}^{(m_\textrm{ref}, \textrm{d})}_k\), \(\hat{\boldsymbol{x}}^{(m_\textrm{ref}, \textrm{d})}_k\), and \(\boldsymbol{y}^{(m_\textrm{ref})}\) become \(\boldsymbol{x}_k\), \(\hat{\boldsymbol{x}}_k\), and \(\boldsymbol{y}\), respectively.

Audio dataset

The list of target sound event classes (18) is:

  • "AlarmClock"
  • "BicycleBell"
  • "Blender"
  • "Buzzer"
  • "Clapping"
  • "Cough"
  • "CupboardOpenClose"
  • "Dishes"
  • "Doorbell"
  • "FootSteps"
  • "HairDryer"
  • "MechanicalFans"
  • "MusicalKeyboard"
  • "Percussion"
  • "Pour"
  • "Speech"
  • "Typing"
  • "VacuumCleaner"

The mixed sample will be synthesized with the following formula, equivalent to Eq. (\ref{eq1}).

$$ \boldsymbol{y}^{(m)}=\sum_{k=1}^{K} \boldsymbol{h}^{(m)}_k*\boldsymbol{s}_k + \bigl[ \sum_{j=1}^{J} \boldsymbol{h}^{(m)}_j*\boldsymbol{s}_j + \boldsymbol{n}^{(m)} \bigr]_{optional},$$

where \(\boldsymbol{s}_k\) represents target event sound, \(\boldsymbol{s}_j\) represents interference (non-target) event sound, \(\boldsymbol{h}_k\) and \(\boldsymbol{h}_j\) represent corresponding RIR, and \(\boldsymbol{n}\) represents non directional noise.

Development set

A folder tree contains the following:

For generating the training and validation data:

  • dry source sample files (Anechoic Sound Event 1K, newly recorded by NTT + FSD50K R1 + EARS dataset R2)
  • RIR files (NTT recorded + FOA-MEIR R3)
  • nondirectional background noise and interference event sound files (FOA-MEIR R3 + FSD50K R1 + ESC-50 R4 + DISCO R5 that are used in Semantic Hearing R6)

For checking the performance:

  • pre-mixed file
  • target sound event files

All files are converted in 32 kHz, 16 bit.

To place the complete set of data, follow the instructions in the dcase2025_task4_baseline.

During the training, the source samples and RIRs are processed using Spatial Scaper R7. A provided patch shall be applied to the original Spatial Scapar code. Follow the instruction provided in the baseline code.

Evaluation set

Pre-mixed data will be provided after June 1, 2025.

External data resources

Based on the past DCASE's external data resource policy, we allow the use of external datasets and trained models under the following conditions:

  • Any test data in both development and evaluation datasets shall not be used for training.

  • Datasets, pre-trained models, and pre-trained parameters on the "List of external data resources allowed" can be used. The List will be updated upon request. Datasets, pre-trained models, and pre-trained parameters, freely accessible by any other research group before May 15, 2025, can be added to the List.

To add sources of external datasets, pre-trained models, or pre-trained parameters to the List, request the organizers by the evaluation set publishing date. We will update the "List of external data resources allowed" on the web page to give an equal opportunity to use them for all competitors. Once the evaluation set is published, no further external sources will be added, and the List will be locked on June 1, 2025.

Dataset name Type Added Link Comments
YAMNet model 01.04.2025 https://github.com/tensorflow/models/tree/master/research/audioset/yamnet
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition model 01.04.2025 https://zenodo.org/record/3987831
OpenL3 model 01.04.2025 https://openl3.readthedocs.io/
VGGish model 01.04.2025 https://github.com/tensorflow/models/tree/master/research/audioset/vggish
COLA model 01.04.2025 https://github.com/google-research/google-research/tree/master/cola
BYOL-A model 01.04.2025 https://github.com/nttcslab/byol-a
AST: Audio Spectrogram Transformer model 01.04.2025 https://github.com/YuanGongND/ast
PaSST: Efficient Training of Audio Transformers with Patchout model 01.04.2025 https://github.com/kkoutini/PaSST
BEATs: Audio Pre-Training with Acoustic Tokenizers model 01.04.2025 https://github.com/microsoft/unilm/tree/master/beats
AudioSet audio, video 01.04.2025 https://research.google.com/audioset/
FSD50K audio 01.04.2025 https://zenodo.org/record/4060432 eval\_audio shall not be used
ImageNet image 01.04.2025 http://www.image-net.org/
MUSAN audio 01.04.2025 https://www.openslr.org/17/
DCASE 2018, Task 5: Monitoring of domestic activities based on multi-channel acoustics - Development dataset audio 01.04.2025 https://zenodo.org/record/1247102#.Y_oyRIBBx8s
Pre-trained desed embeddings (Panns, AST part 1) model 01.04.2025 https://zenodo.org/record/6642806#.Y_oy_oBBx8s
Audio Teacher-Student Transformer model 01.04.2025 https://drive.google.com/file/d/1_xb0_n3UNbUG_pH1vLHTviLfsaSfCzxz/view
TUT Acoustic scenes dataset audio 01.04.2025 https://zenodo.org/records/45739
MicIRP IR 01.04.2025 http://micirp.blogspot.com/?m=1
FOA-MEIR IR 01.04.2025 The data related to the Room IDs of 68, 69, 85, and 86 are in the DEV set.
EARS Audio 01.04.2025
DISCO Audio 01.04.2025
ESC-50 Audio 01.04.2025


Download

DCASE 2025 S5 development set

This contains sound event source and Room Impulse Response (RIR) files for generating training and validation data, as well as test samples and Oracle separation targets for sanity checks of performance.

DCASE 2025 S5 evaluation set

This contains evaluation test mixture samples for the challenge submission.

[Available after June 1, 2025]

Task setup and rules

The participants are required to process sound mixtures, identify the sources in the mixture that are in the list of target sound classes defined in Section 3, and output the source signal for each identified source. The number of target sound events in a mixture is [1, 2, or 3], while the number of interference sound events is [0, 1, or 2]. The performance will be evaluated using the class-aware signal-to-distortion ratio improvement (CA-SDRi) metric shown in Section 8.1. Additional informative metrics in Section 8.2 will also be calculated.

The Task rules are as follows:

  • Participants are allowed to submit up to 4 different systems.
  • Participants are allowed to use external data for system development.
  • Data from other task is considered external data.
  • Embeddings extracted from models pre-trained on external data is considered as external data
  • Datasets and models can be added to the list upon request until May 15th (as long as the corresponding resources are publicly available).
  • The external dataset used during training should be listed in the YAML file describing the submission.
  • Manipulation of provided training data is allowed.
  • Participants are not allowed to use audio from the pevaluation sets of the follwoing datasets: Disco, EARS, ESC-50, FSD50k and Semantic Hearing.

Submission

All participants should submit:

  • The zip file includes audio files in "*.wav" format, containing the results of their separation with the evaluation set (\(EVAL_{test}\)). The separated audio files should follow the submission naming rules.
  • A text file contains the Google Drive link of the zip files of the separated audios.
  • A CSV file containing the calculated scores for the separation results of the development set (\(DEV_{test}\)).
  • Metadata for their submission ("*.yaml" file), and
  • A technical report for their submission ("*.pdf" file)

We allow up to 4 system output submissions per participant/team. Each system's metadata should be provided in a separate file containing the task-specific information. All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted metadata (the .yaml file), submitted system outputs (the .zip files), and the technical report (the .pdf file). For indicating the connection of your files, you can consider using the following naming convention:

[The detailed naming conversion will be announced in the submission example package.]

Evaluation

Evaluation metric

A class-aware metric has been newly defined to evaluate systems' output. This metric evaluates both the sound sources and their class labels simultaneously.

The main idea is that the estimated and reference sources are aligned by their labels, with the waveform metric being calculated when the label is correctly predicted, i.e., \(c_k \in C \cap \hat{C}\). In cases of incorrect label prediction, penalty values are accumulated. We define a class-aware signal-to-distortion ratio improvement (CA-SDRi) metric as

$$ \textrm{CA-SDRi}\left(\{\hat{\boldsymbol{x}}_1, \dots,\hat{\boldsymbol{x}}_{\hat{K}}\}, \{\boldsymbol{x}_1, \dots,\boldsymbol{x}_K\}, \boldsymbol{y}\right) = \frac{1}{| C \cup \hat{C} |} \sum_{c_k \in C \cup \hat{C}} P_{c_k}, \label{eq2}\tag{2} $$

where \(| C \cup \hat{C} |\) is the length of the set union. The metric component \(P_{c_k}\) is calculated as

$$ P_{c_k\in C \cup \hat{C}} = \begin{cases} \textrm{SDRi}(\hat{\boldsymbol{x}}_k, \boldsymbol{x}_k, \boldsymbol{y}), &\text{if } c_k \in C \cap \hat{C}\\ \mathcal{P}^\textrm{FN}_{c_k}, &\text{if } c_k \in C \text{ and } c_k \notin \hat{C}\\ \mathcal{P}^\textrm{FP}_{c_k}, &\text{if } c_k \notin C \text{ and } c_k \in \hat{C} \end{cases}, $$

where the SDRi is calculated as

$$ \textrm{SDRi}(\hat{\boldsymbol{x}}_k, \boldsymbol{x}_k, \boldsymbol{y}) = \textrm{SDR}(\hat{\boldsymbol{x}}_k, \boldsymbol{x}_k) - \textrm{SDR}(\boldsymbol{y} , \boldsymbol{x}_k),$$
$$\label{sq:sdr} \textrm{SDR}(\hat{\boldsymbol{x}}, \boldsymbol{x}) = 10\log_{10} \left( \frac{\|\boldsymbol{x}\|^2}{\|\boldsymbol{x} - \hat{\boldsymbol{x}}\|^2} \right). $$

\(\mathcal{P}^\textrm{FN}_{c_k}\) and \(\mathcal{P}^\textrm{FP}_{c_k}\) are the penalty values for false negative (FN) and false positive (FP), both set to \(0\), indicating that incorrect predictions do not contribute to any improvement in the metric.

Additional informative metrics

The following informative metrics will be calculated:

  • PESQ (or POLQA) for speech and PEAQ for other signals will be used to provide additional information related to perceptual quality.
  • Accuracy, Recall (True Positive Rate: TPR), Precision and F1 score, as well as False Positive Rate (FPR).

Additional example separation results processed by the submitted systems will be shown.

Ranking

The ranking of the systems will be determined with the class-aware signal-to-distortion ratio improvement (CA-SDRi) metric defined in Eq. (\ref{eq2}).

Additional informative results, including PESQ and PEAQ scores and some example separation results (wave files), will be present but not counted for the Ranking.

Results

Available after June 30, 2025

Baseline system

The task organizers provided a baseline system proposed in the paper. The system is based on the combined structures of the Masked Modeling Duo (M2D) R8 for audio tagging and ResUNet R9 for separation. Fig. 2 shows two variants of the baseline: (a) ResUNet and (b) ResUNetK.

Figure 2: Proposed spatial semantic segmentation system flows.


The code is available at the following link:


Acknowledgement

This work was partially supported by \textbf{JST Strategic International Collaborative Research Program (SICORP)}, Grant Number JPMJSC2306, Japan.

This work was partially supported by the Agence Nationale de la Recherche (Project Confluence, grant number ANR-23-EDIA-0003).

Citation

If you are participating in this Task or using the baseline code, please cite these two papers below.

Task description paper

Publication

Masahiro Yasuda, Binh Thien Nguyen, Noboru Harada, Romain Serizel, Mayank Mishra, Marc Delcroix, Shoko Araki, Daiki Takeuchi, Daisuke Niizumi, Yasunori Ohishi, Tomohiro Nakatani, Takao Kawamura, and Nobutaka Ono. Description and discussion on dcase 2025 challenge task 4: spatial semantic segmentation of sound scenes. 2025. URL: available soon.

Description and discussion on DCASE 2025 challenge task 4: Spatial Semantic Segmentation of Sound Scenes

Baseline system paper

Publication

Binh Thien Nguyen, Masahiro Yasuda, Daiki Takeuchi, Daisuke Niizumi, Yasunori Ohishi, and Noboru Harada. Baseline systems and evaluation metrics for spatial semantic segmentation of sound scenes. 2025. URL: https://arxiv.org/abs/2503.22088, arXiv:2503.22088.

Baseline Systems and Evaluation Metrics for Spatial Semantic Segmentation of Sound Scenes

Reference

[R1] E. Fonseca, X. Favory, J. Pons, F. Font and X. Serra, "FSD50K: An Open Dataset of Human-Labeled Sound Events," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 829-852, 2022, doi: 10.1109/TASLP.2021.3133208.

[R2] Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo Gerkmann, "EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation," in Proc. Interspeech 2024.

[R3] Masahiro Yasuda, Yasunori Ohishi, Shoichiro Saito, "Echo-aware Adaptation of Sound Event Localization and Detection in Unknown Environments," in IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2022.

[R4] Karol J. Piczak, "ESC: Dataset for Environmental Sound Classification," In Proceedings of the 23rd ACM International Conference on Multimedia (MM '15). Association for Computing Machinery, New York, NY, USA, 1015–1018, 2015. https://doi.org/10.1145/2733373.2806390

[R5] Furnon Nicolas. 2020. Noise files for the DISCO dataset. (2020). https://github. com/nfurnon/disco.

[R6] Bandhav Veluri, Malek Itani, Justin Chan, Takuya Yoshioka, Shyamnath Gollakota, "Semantic Hearing: Programming Acoustic Scenes with Binaural Hearables," in Proceedings of the 36th Annual ACM Symposium on User Interf, 2023.

[R7] I. R. Roman, C. Ick, S. Ding, A. S. Roman, B. McFee and J. P. Bello, "Spatial Scaper: A Library to Simulate and Augment Soundscapes for Sound Event Localization and Detection in Realistic Rooms," ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 1221-1225, doi: 10.1109/ICASSP48485.2024.10446118.

[R8] D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada and K. Kashino, "Masked Modeling Duo: Towards a Universal Audio Pre-Training Framework," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2391-2406, 2024, doi: 10.1109/TASLP.2024.3389636.

[R9] Q. Kong, K. Chen, H. Liu, X. Du, T. Berg-Kirkpatrick, S. Dubnov, and M. D. Plumbley, "Universal source separation with weakly labeled data," arXiv preprint arXiv:2305.07447, 2023.