Semantic Acoustic Imaging for Sound Event Localization and Detection from Spatial Audio and Audiovisual Scenes


Task description

The goal of the Semantic Acoustic Imaging for SELD is to use either audio-only or audiovisual inputs to create systems that map low-channel spatial audio to high-definition acoustic energy maps, generating dynamic semantic masks that encode sound event classes, spatial locations, and pixel-level acoustic intensity.

Information for Task 3 will be shared at: https://dcase.slack.com/archives/C01DMMGQCLF

Description

Given low-channel spatial audio, Semantic Acoustic Imaging for Sound Event Localization and Detection (SAI-SELD) produces high-resolution acoustic energy maps that describe which sound events are active, where they occur, and how their localized acoustic energy is distributed. Unlike conventional SELD, which estimates sparse direction-of-arrival vectors, this task represents the acoustic scene as a dense image-like field. The scientific goal is to move beyond point-based localization toward richer modeling of sound objects with source extent, directional energy distribution, spatial overlap, and diffuseness, while linking spatial audio analysis with semantic segmentation and multimodal learning.

The task addresses a challenging super-resolution problem using the STAIRS26 dataset (derived from STARSS23) at full spatial resolution. During development, participants have access to 32-channel recordings and corresponding high-definition acoustic maps. At evaluation time, however, systems receive only 4-channel input (tetrahedral array and/or first-order ambisonics) and must reconstruct detailed acoustic images from these resolution-limited spatial observations. The outputs are dynamic semantic polygon masks that jointly encode sound event class, source location, and instantaneous energy through pixel intensity. This setting supports both audio-only and audiovisual methods, and encourages the use of modern modeling architectures to recover semantically meaningful acoustic fields. Potential applications include robotic hearing, smart environments, audiovisual scene understanding, and autonomous systems operating in complex real-world scenes.

Figure 1: 360° video frame overlayed with ground truth acoustic imaging map.


Dataset

This challenge uses the STAIRS26 dataset, a spatial audio benchmark that extends the STARSS23 dataset. STAIRS26 enriches the original data by providing two main components:

  1. 32-Channel Eigenmike Recordings: The raw 32-channel signals serve as a high-resolution complement to the originally released STARSS23 dataset, which was previously only available in 4-channel (tetrahedral and first-order ambisonics) formats.
  2. High-Definition Acoustic Maps: Provided as JSON files, these dense acoustic images capture the sound event class, source location, and localized acoustic energy in a unified representation.

For detailed dataset specifications on the original recording setup, hardware configuration, scene properties, and the recording and annotation procedures, please refer to the dataset paper of STARSS22:

Publication

Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, and Tuomas Virtanen. STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 125–129. Nancy, France, November 2022. URL: https://dcase.community/workshop2022/proceedings.

PDF

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

Abstract

This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) dataset of spatial recordings of real sound scenes collected in various interiors at two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events belonging to 13 target classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. STARSS22 serves as the development and evaluation dataset for Task 3 (Sound Event Localization and Detection) of the DCASE2022 Challenge and it introduces significant new challenges with regard to the previous iterations, which were based on synthetic data. Additionally, the report introduces the baseline system that accompanies the dataset with emphasis on its differences to the baseline of the previous challenge. Baseline results indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6600531.

PDF

and to the dataset paper of STARSS23:

Publication

Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel A. Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, and Yuki Mitsufuji. STARSS23: an audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, 72931–72957. Curran Associates, Inc., 2023. URL: https://proceedings.neurips.cc/paper_files/paper/2023/hash/e6c9671ed3b3106b71cafda3ba225c1a-Abstract-Datasets_and_Benchmarks.html.

PDF

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Abstract

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at https://zenodo.org/record/7880637.

PDF

A technical report on the specifications of the STAIRS26 dataset will be included here soon.

Dataset specifications

The main properties of STAIRS26 are summarized below.

Volume, duration, and split

  • The current release contains approximately 7.5 hours of recordings across 168 development clips (12 of which in the training split do not have 360º video).
  • The file naming and data split conventions are retained from STARSS23, and the original STARSS23 audio and video files are intended to be used together with this dataset.
  • The current release contains only the development audio, video, and labels used for training and validation in the DCASE2026 challenge.

Audio

  • Sampling rate: 24 kHz.
  • Bit depth: 16 bits.
  • Recording format: 32-channel Eigenmike.

Recording format

The array response of the recording format, the Eigenmike 32 spherical microphone array, can be considered known. A theoretical model of the array transfer functions is provided, which describe the directional response of each channel to a source incident from direction-of-arrival (DOA) given by azimuth angle \(\phi\) and elevation angle \(\theta\).

\begin{align} M_{01} &: ( 0^\circ, \; 21^\circ, \; 4.2\,\mathrm{cm}) \\ M_{02} &: ( 32^\circ, \; 0^\circ, \; 4.2\,\mathrm{cm}) \\ M_{03} &: ( 0^\circ, \; -21^\circ, \; 4.2\,\mathrm{cm}) \\ M_{04} &: (328^\circ, \; 0^\circ, \; 4.2\,\mathrm{cm}) \\ M_{05} &: ( 0^\circ, \; 58^\circ, \; 4.2\,\mathrm{cm}) \\ M_{06} &: ( 45^\circ, \; 35^\circ, \; 4.2\,\mathrm{cm}) \\ M_{07} &: ( 69^\circ, \; 0^\circ, \; 4.2\,\mathrm{cm}) \\ M_{08} &: ( 45^\circ, \; -35^\circ, \; 4.2\,\mathrm{cm}) \\ M_{09} &: ( 0^\circ, \; -58^\circ, \; 4.2\,\mathrm{cm}) \\ M_{10} &: (315^\circ, \; -35^\circ, \; 4.2\,\mathrm{cm}) \\ M_{11} &: (291^\circ, \; 0^\circ, \; 4.2\,\mathrm{cm}) \\ M_{12} &: (315^\circ, \; 35^\circ, \; 4.2\,\mathrm{cm}) \\ M_{13} &: ( 91^\circ, \; 69^\circ, \; 4.2\,\mathrm{cm}) \\ M_{14} &: ( 90^\circ, \; 32^\circ, \; 4.2\,\mathrm{cm}) \\ M_{15} &: ( 90^\circ, \; -31^\circ, \; 4.2\,\mathrm{cm}) \\ M_{16} &: ( 89^\circ, \; -69^\circ, \; 4.2\,\mathrm{cm}) \\ M_{17} &: (180^\circ, \; 21^\circ, \; 4.2\,\mathrm{cm}) \\ M_{18} &: (212^\circ, \; 0^\circ, \; 4.2\,\mathrm{cm}) \\ M_{19} &: (180^\circ, \; -21^\circ, \; 4.2\,\mathrm{cm}) \\ M_{20} &: (148^\circ, \; 0^\circ, \; 4.2\,\mathrm{cm}) \\ M_{21} &: (180^\circ, \; 58^\circ, \; 4.2\,\mathrm{cm}) \\ M_{22} &: (225^\circ, \; 35^\circ, \; 4.2\,\mathrm{cm}) \\ M_{23} &: (249^\circ, \; 0^\circ, \; 4.2\,\mathrm{cm}) \\ M_{24} &: (225^\circ, \; -35^\circ, \; 4.2\,\mathrm{cm}) \\ M_{25} &: (180^\circ, \; -58^\circ, \; 4.2\,\mathrm{cm}) \\ M_{26} &: (135^\circ, \; -35^\circ, \; 4.2\,\mathrm{cm}) \\ M_{27} &: (111^\circ, \; 0^\circ, \; 4.2\,\mathrm{cm}) \\ M_{28} &: (135^\circ, \; 35^\circ, \; 4.2\,\mathrm{cm}) \\ M_{29} &: (269^\circ, \; 69^\circ, \; 4.2\,\mathrm{cm}) \\ M_{30} &: (270^\circ, \; 32^\circ, \; 4.2\,\mathrm{cm}) \\ M_{31} &: (270^\circ, \; -32^\circ, \; 4.2\,\mathrm{cm}) \\ M_{32} &: (271^\circ, \; -69^\circ, \; 4.2\,\mathrm{cm}) \end{align}

Since the microphones are mounted on an acoustically-hard spherical baffle, an analytical expression for the directional array response is given by the expansion:

\begin{equation} H_m(\phi_m, \theta_m, \phi, \theta, \omega) = \frac{1}{(\omega R/c)^2}\sum_{n=0}^{30} \frac{i^{n-1}}{h_n'^{(2)}(\omega R/c)}(2n+1)P_n(\cos(\gamma_m)) \end{equation}

where \(m\) is the channel number, \((\phi_m, \theta_m)\) are the specific microphone's azimuth and elevation position, \(\omega = 2\pi f\) is the angular frequency, \(R = 0.042\)m is the array radius, \(c = 343\)m/s is the speed of sound, \(\cos(\gamma_m)\) is the cosine angle between the microphone and the DOA, and \(P_n\) is the unnormalized Legendre polynomial of degree \(n\), and \(h_n'^{(2)}\) is the derivative with respect to the argument of a spherical Hankel function of the second kind. The expansion is limited to 30 terms which provides negligible modeling error up to 20kHz. Example routines that can generate directional frequency and impulse array responses based on the above formula can be found here.

Reference Acoustic maps (JSON Annotations)

For the development set, we provide high-definition acoustic images generated from the 32-channel microphone array recordings. These maps serve as the dense ground-truth representation for training Semantic Acoustic Imaging systems. The maps are generated at a temporal resolution of 10 fps and on a spherical uniform grid of 484 points, using the super-resolution method referenced below. The spherical maps are further interpolated into conventional 2D 360° images in an equirectangular format. Additionally, since the original maps capture contributions from all sources, peaks around the active sources of the target sound classes are cropped using spherical polygonal masks. The isolated power maps, one per active target sound event at each frame, are then registered as individual sound event-based annotations.

Publication

Adrian S Roman, Iran R Roman, and Juan P Bello. Latent acoustic mapping for direction of arrival estimation: a self-supervised approach. In 2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 1–5. IEEE, 2025.

Latent Acoustic Mapping for Direction of Arrival Estimation: A Self-Supervised Approach

The acoustic maps are provided within .json files, one per recording. The annotations field contains a list of dictionary entries, where each entry represents an active sound-producing object at a specific frame.

Each annotation entry contains the following structured fields:

  • metadata_frame_index: Indicates the frame where the annotation is active. Important: The temporal resolution (frame rate) is 10 FPS.
  • instance_id: A unique identifier used to track sound sources across multiple frames (consistent with STARSS23 source_id).
  • category_id: An integer mapping to the STARSS23 labels.
  • distance: The distance of the sound source from the microphone, measured in centimeters.
  • segmentation: Contains the full filled polygon(s) representing the source's spatial extent and directional intensity.
    • There may be more than one polygon per annotation (for example, if an object wraps around the edge of the 360-degree video, resulting in a split polygon).
    • Coordinates & Intensity: Each item making up a polygon is a 3D array formatted as [x, y, intensity]. The first two values represent the x and y spatial coordinates, and the final value represents the pixel intensity (standardized acoustic amplitude between zero and 1, such that zero is the minimum amplitude for the recordings within the training split, and 1 the maximum amplitude).

The spatial coordinates generally correspond to an angular grid with one pixel per degree (x ∈ [0, 359], y ∈ [0, 179]). Amplitude values are standard floating-point numbers. "The x coordinate represents azimuth angle where x=180 corresponds to the front of the scene (azimuth 0°), x=0 and x=359 correspond to directly behind (azimuth ±180°), and values decrease moving anti-clockwise (leftward in the image). The y coordinate represents the polar angle with y=0 at the top of the image (north pole, elevation +90°) and y=179 at the bottom (south pole, elevation −90°)."

Sound event classes

The semantic labels follow the same 13 target sound event classes used in STARSS23:

  1. Female speech, woman speaking
  2. Male speech, man speaking
  3. Clapping
  4. Telephone
  5. Laughter
  6. Domestic sounds
  7. Walk, footsteps
  8. Door, open or close
  9. Music
  10. Musical instrument
  11. Water tap, faucet
  12. Bell
  13. Knock

These classes are inherited from the STARSS23 annotation scheme.

Resources

A script for generating the standardized acoustic images from the 32-channel audio recordings is available in the associated repository:

  • https://github.com/AudibleLight/starss_representations

A separate visualization script for inspecting the JSON acoustic maps is also available:

  • https://gist.github.com/HuwCheston/d46559748ea3af8e37fda711486fd3bf

Download

The development dataset is distributed as two zip files:

  • 32ch_audio_dev.zip: development audio.
  • labels_dev.zip: generated acoustic-image labels in JSON format.

Which can be downloaded from:


Task Setup

The development dataset is provided with a training/testing split. During the development stage, the testing split can be used for comparison with baseline results and consistent reporting of results in the technical reports of the submitted systems, prior to the evaluation stage.

Note that even though there are two origins of the underlying STARSS23 data (Sony and TAU), the challenge task considers the dataset as a single entity. Hence, models should not be trained separately for each of the two origins and tested individually. Instead, the clips of the individual training splits should be combined, and the models should be trained and evaluated on the respective combined splits.

The evaluation dataset is released a few weeks before the final submission deadline. That dataset consists of only low-channel audio (4-channel tetrahedral/FOA) and video clips without any metadata, labels, or 32-channel recordings. At the evaluation stage, participants can decide the training procedure (e.g., the amount of training and validation files to be used from the development dataset, the number of ensemble models, etc.) and submit the results of their Semantic Acoustic Imaging performance on the evaluation dataset.

There are two tracks that participants can follow: the audio-only track and the audiovisual track. Participants can choose to submit systems to either of the two tracks, or both. We strongly encourage participants to submit both audio-only and audiovisual models. Submissions for both tracks will output the same dynamic semantic polygon masks, and ranking results will be presented in separate tables for each track.

Track A: Audio-only inference

In the audio-only track, inference of the semantic acoustic images is performed using only the limited 4-channel spatial audio from the STARSS23 dataset. Note that we do not exclude the use of video data or video information during the training of such models. In this sense, the video clips of the development set could be treated as external data and exploited in various ways to improve the performance of the model. However, during inference, only the audio recordings of the evaluation set may be used.

Track B: Audiovisual inference

In the audiovisual track, participants have access to corresponding video data during both training and evaluation. The models in this track are expected to use both the 4-channel audio and the video data during inference to reconstruct the high-resolution acoustic images.

Development dataset

The clips in the development dataset follow the naming convention retained from STARSS23: fold[fold number]_room[room number]_mix[recording number per room].wav

The fold number is used to distinguish between the training and testing splits. The room information is provided to help users understand the performance of their methods with respect to different acoustic conditions.

Evaluation dataset

The evaluation dataset consists of clips without any information regarding their origin (Sony or TAU) or location in the naming convention. They follow the format below: sample[clip number].wav (and corresponding .mp4 for video).

During evaluation only a 4ch tetrahedral array subset from the original 32ch recordings will be provided, similar to the microphone array format of the previous challenges. The coordinates and order of those microphones in spherical coordinates \((\phi, \theta, r)\) are:

\begin{eqnarray} M1: &\quad(&45^\circ, &&35^\circ, &4.2\mathrm{cm})\nonumber\\ M2: &\quad(&-45^\circ, &-&35^\circ, &4.2\mathrm{cm})\nonumber\\ M3: &\quad(&135^\circ, &-&35^\circ, &4.2\mathrm{cm})\nonumber\\ M4: &\quad(&-135^\circ, &&35^\circ, &4.2\mathrm{cm})\nonumber \end{eqnarray}

External Data Resources and Pretrained Models

Since the development set contains recordings of real scenes, the presence of each class and the density of sound events varies greatly. To enable more effective training of models to detect and localize all target classes, apart from spatial and spectrotemporal augmentation of the development set, we additionally allow use of external datasets as long as they are publicly available. External data examples are sound sample banks, annotated sound event datasets, pre-trained models, room and array impulse response libraries.

The following rules apply on the use of external data:

  • The external datasets or pre-trained models used should be freely and publicly accessible before 1 April 2026.
  • Participants should inform the organizers in advance about such data sources, so that all competitors know about them and have an equal opportunity to use them. Please send an email or message in the Slack channel to the task coordinators if you intend to use a dataset or pre-trained model not on the list; we will update a list of external data in the webpage accordingly.
  • The participants will have to indicate clearly which external data they have used in their system info and technical report.
  • Once the evaluation set is published, no further requests will be taken and no further external sources will be added to the list.
Dataset name Type Added Link
TAU-SRIR DB room impulse responses 04.04.2022 https://zenodo.org/records/6408611
6DOF_SRIRs room impulse responses 23.11.2021 https://zenodo.org/records/6382405
METU SRIRs room impulse responses 10.04.2019 https://zenodo.org/records/2635758
MIRACLE room impulse responses 12.10.2023 https://depositonce.tu-berlin.de/items/fc34d59c-c524-4a4b-86ae-4da9289f20e2
AudioSet audio, video 30.03.2017 https://research.google.com/audioset/
FSD50K audio 02.10.2020 https://zenodo.org/record/4060432
ESC-50 audio 13.10.2015 https://github.com/karolpiczak/ESC-50
Wearable SELD dataset audio 17.02.2022 https://zenodo.org/record/6030111
IRMAS audio 08.09.2014 https://zenodo.org/record/1290750
Kinetics 400 audio, video 22.05.2017 https://www.deepmind.com/open-source/kinetics
SSAST pre-trained model 10.02.2022 https://github.com/YuanGongND/ssast
TAU-NIGENS Spatial Sound Events 2020 audio 06.04.2020 https://zenodo.org/record/4064792
TAU-NIGENS Spatial Sound Events 2021 audio 28.02.2021 https://zenodo.org/record/5476980
PANN pre-trained model 19.10.2020 https://github.com/qiuqiangkong/audioset_tagging_cnn
wav2vec2.0 pre-trained model 20.08.2020 https://github.com/facebookresearch/fairseq
PaSST pre-trained model 18.09.2022 https://github.com/kkoutini/PaSST
DTF-AT pre-trained model 19.12.2023 https://github.com/ta012/DTFAT
FNAC_AVL pre-trained model 25.03.2023 https://github.com/OpenNLPLab/FNAC_AVL
CSS10 Japanese audio 05.08.2019 https://www.kaggle.com/datasets/bryanpark/japanese-single-speaker-speech-dataset
JSUT audio 28.10.2017 https://sites.google.com/site/shinnosuketakamichi/publication/jsut
VoxCeleb1 audio, video 26.06.2017 https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html
COCO video 01.05.2014 https://cocodataset.org/
360-Indoor video 03.10.2019 http://aliensunmin.github.io/project/360-dataset/
TorchVision Models and Pre-trained Weights pre-trained model 03.04.2017 https://pytorch.org/vision/stable/models.html
YOLOv7 pre-trained model 07.07.2022 https://github.com/WongKinYiu/yolov7
YOLOv8 pre-trained model 10.01.2023 https://github.com/ultralytics/ultralytics
Grounding DINO pre-trained model 10.03.2023 https://github.com/IDEA-Research/GroundingDINO
MMDetection pre-trained model 15.12.2021 https://github.com/open-mmlab/mmdetection
MMPose pre-trained model 01.01.2020 https://github.com/open-mmlab/mmpose
MMFlow pre-trained model 01.01.2021 https://github.com/open-mmlab/mmflow
Paddle Detection pre-trained model 01.01.2019 https://github.com/PaddlePaddle/PaddleDetection
doors Image Dataset image 18.02.2022 https://universe.roboflow.com/mohammed-naji/doors-6g8eb/dataset/1
DoorDetect Dataset image 27.05.2021 https://github.com/MiguelARD/DoorDetect-Dataset
CLIP pre-trained model 05.01.2021 https://github.com/openai/CLIP
CLAP pre-trained model 06.03.2022 https://github.com/LAION-AI/CLAP
Depth Anything pre-trained model 22.01.2024 https://github.com/LiheYoung/Depth-Anything
PanoFormer pre-trained model 04.03.2022 https://github.com/zhijieshen-bjtu/PanoFormer
SoundQ Youtube 360° video list video 06.10.2023 https://github.com/aromanusc/SoundQ/blob/main/synth_data_gen/dataset.csv
FMA audio 02.12.2016 https://github.com/mdeff/fma
Dasheng pre-trained model 11.06.2024 https://github.com/XiaoMi/dasheng
DINOv2 pre-trained model 18.04.2023 https://github.com/facebookresearch/dinov2
ONE-PEACE pre-trained model 19.05.2023 https://github.com/OFA-Sys/ONE-PEACE
OWL-ViT pre-trained model 13.05.2022 https://huggingface.co/docs/transformers/en/model_doc/owlvit
Flickr30k image 01.02.2014 https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset
SELDVisualSynth Canvas and Assets image, video 27.03.2025 https://github.com/adrianSRoman/SELDVisualSynth
Audio-Visual Scene Reproduction Datasets audio, image 30.10.2021 http://3dkim.com/research/VR/index.html
SurrRoom 1.0 Dataset room impulse responses 12.05.2023 https://cvssp.org/data/SurrRoom1_0/
NitroFusion pre-trained model 02.12.2024 https://chendaryen.github.io/NitroFusion.github.io/
PSELDNets pre-trained model 11.11.2024 https://github.com/Jinbo-Hu/PSELDNets
HTS-AT pre-trained model 02.02.2022 https://github.com/RetroCirce/HTS-Audio-Transformer


Example external data use with baseline

Synthetic audio-only and audiovisual data (along with generated ground truth power maps conforming to the task specifications) can be generated using AudibleLight.

AudibleLight is a new Python package for synthetic soundscape synthesis across both ray-traced & real-world measured acoustics. It combines and extends the functionality available in many existing packages, including SpatialScaper, SELDVisualSynth, PyRoomAcoustics, and SoundSpaces.

AudibleLight can be downloaded from the following repository. Usage information is available on the repository page. It is recommended to use a development install when using AudibleLight to generate data for this task, rather than PyPi.


If you use AudibleLight to generate synthetic data for your system, please ensure you cite the following paper:

Publication

Huw Cheston, Adrian Stepien, Juan Azcarreta, Adrian S. Roman, Chuyang Chen, Cagdas Bilen, and Iran R. Roman. Audiblelight: a controllable, end-to-end api for soundscape synthesis across ray-traced & real-world measured acoustics. In DMRN+20: Digital Music Research Network One-Day Workshop 2025. 2025.

PDF

AudibleLight: A Controllable, End-to-End API for Soundscape Synthesis Across Ray-Traced & Real-World Measured Acoustics

PDF

Task Rules

  • Use of external data is allowed, as long as they are publicly available. Check the section on external data for more instructions.
  • Manipulation of the provided training-test split in the development dataset is not allowed for reporting dev-test results using the development dataset.
  • Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system.
  • The development dataset can be augmented e.g. using techniques such as pitch shifting or time stretching, spatial transformations, source separation, respatialization, or re-reverberation of parts, etc.

Submission

During the evaluation phase of the challenge, the results for each recording in the evaluation dataset must be collected in individual JSON files. Each result file must have the same base name as the corresponding audio recording file, but with the .json extension (e.g., sample001.json for sample001.wav).

Each JSON file must contain a single top-level key, annotations, whose value is a list of annotation entries. Each entry is a dictionary representing one active sound source instance at one specific frame. The overall structure is:

{
  "annotations": [
    {
      "metadata_frame_index": 42,
      "category_id":          1,
      "score":                0.812,
      "distance":             245.0,
      "segmentation":         [[[120.0, 88.0, 0.923], [121.0, 89.0, 0.741], ...]]
    },
    ...
  ]
}

Field Descriptions

Each annotation entry must contain the following fields:

  • metadata_frame_index (int): The frame index at which this annotation is active. The temporal resolution is 10 FPS, so frame index k corresponds to time k / 10 seconds. Enumeration begins at zero.
  • category_id (int): A zero-indexed integer identifying the sound event class. Class indices follow the ordering given in the class descriptions above (0 = Female speech, 1 = Male speech, ..., 12 = Knock).

  • score (float): This field is mandatory. The detection confidence score for this instance at this frame, in the range [0, 1]. The official evaluation computes mean Average Precision (mAP) (see Evaluation below). Submissions that omit this field or set all scores to a constant value will produce degenerate AP curves and will be ranked last.

  • distance (float): This field is NOT mandatory. The predicted distance of the sound source from the microphone array centre, measured in centimetres. Distance estimation will not be used in the ranking, however, if participants develop methods that estimate it, distance estimation results will be presented in the results of the challenge, based on Relative Distance Error (RDE) metric.

  • segmentation (list of lists of triplets): A list of polygon(s) representing the predicted spatial extent and acoustic energy distribution of the source. Each polygon is a list of [x, y, intensity] triplets. There may be more than one polygon per annotation entry (e.g., when a source spans the azimuth wrap boundary and its representation is split into two fragments). The format exactly mirrors that of the reference labels:

    • x (float): Azimuth coordinate in the equirectangular image, x ∈ [0, 359]. x = 0 is front-facing; values increase anti-clockwise.
    • y (float): Polar (elevation) coordinate, y ∈ [0, 179]. y = 0 is the north pole.
    • intensity (float): Predicted normalised acoustic amplitude, ∈ [0, 1].

Participants are not required to submit dense filled polygons. Sparse representations consisting of the highest-energy peaks within each source's bounding region are fully supported by the evaluator, which internally renders both prediction and reference through a spherical Gaussian kernel before metric computation. A minimum of one [x, y, intensity] triplet per annotation entry is required. We recommend submitting between 10 and 30 peaks per instance per frame; the provided baseline inference script exports the top-20 peaks by default.

Coordinate System Summary

Field Image range Meaning of zero Notes
x [0, 359] Left edge of image = azimuth ±180° (directly behind) Front of scene = x=180; anti-clockwise = decreasing x
y [0, 179] Top of image = north pole (elevation +90°) South pole (elevation −90°) = y=179
intensity [0, 1] Minimum recorded amplitude in training split 1 is the maximum recorded amplitude

Temporal Resolution

The evaluation is performed at a temporal resolution of 100 ms (10 FPS). If your system produces predictions at a different frame rate or hop size, you must resample or aggregate your outputs to this resolution before submission. Each metadata_frame_index value must be a non-negative integer corresponding to a distinct 100 ms window.


Baseline Inference Script

A reference inference script (run_inference.py) is provided in the task repository. It takes a trained model checkpoint and a directory of evaluation recordings as input and produces correctly formatted JSON output files directly. Participants may use this script as-is, adapt it to their own model architecture, or implement their own inference pipeline entirely, as long as the output JSON format described above is respected.

General information for all DCASE submissions can be found on the Submission page.

Evaluation

The evaluation jointly assesses sound event detection, spatial localization, acoustic imaging reconstruction, and distance estimation.

Rendering Pipeline

Every segmentation annotation (ground truth and prediction) is converted into a continuous 360×180 equirectangular energy map. For each [x, y, intensity] triplet, a spherical Gaussian blob (standard deviation \(\sigma = 6^\circ\)) is added to the canvas:

$$\mathcal{E}(u, v) = \sum_k e_k \exp\!\left(-\frac{d_{\mathrm{gc}}^2\bigl((u,v),(x_k,y_k)\bigr)}{2\sigma^2}\right)$$

where \(d_{\mathrm{gc}}\) is the great-circle angular distance, and \(e_k\) is the normalized intensity. The kernel accounts for equirectangular distortion by expanding the azimuth half-width near the poles by \(1/\cos(\theta_k)\), ensuring a consistent solid angle.

A binary segmentation mask \(M\) is derived by thresholding at \(\tau = 10\%\) of the peak rendered energy:

$$M(u,v) = \mathbf{1}\!\left[\mathcal{E}(u,v) \geq \tau \cdot \max_{u',v'}\mathcal{E}(u',v')\right]$$

Annotations with a peak energy below \(10^{-9}\) are excluded from evaluation.

Spatial Matching

To compute Pearson \(r\) and RDE metrics, predictions and ground truths are paired per frame and per class: - Peak locations are extracted from the rendered maps. - The Hungarian algorithm finds the minimum-cost one-to-one assignment based on great-circle angular distance. - Matches exceeding \(T_{\mathrm{DOA}} = 20^\circ\) are rejected (becoming false positives/negatives). - Cross-class matching is strictly prohibited.

Metrics

1. Mask mAP — Detection and Localization (Primary Metric)

The primary ranking metric is the Mask mean Average Precision (Mask mAP), computed via the standard COCO evaluation protocol on the binarized rendered masks.

$$\mathrm{Mask\ mAP} = \frac{1}{C} \sum_{c=1}^{C} \left( \frac{1}{|\mathcal{T}|} \sum_{t \in \mathcal{T}} \mathrm{AP}_c^{(t)} \right)$$

where \(C=13\) classes and \(\mathcal{T} = \{0.50, 0.55, \ldots, 0.95\}\) are the IoU thresholds in equirectangular pixel space. We also report Mask AP50 (IoU \(t = 0.50\)). Mask mAP simultaneously penalizes missed detections, false positives, and poor spatial localization.

Note: The score field is mandatory; constant scores will result in a degenerate precision-recall curve and rank last.

2. Pearson \(r\) — Energy Field Reconstruction Quality

For each spatially matched pair, the Pearson correlation coefficient is computed over the 2-pixel dilated union of their binary masks (\(\mathcal{U}\)):

$$r = \frac{\displaystyle\sum_{(u,v)\in\mathcal{U}} \bigl(\mathcal{E}(u,v)-\bar{\mathcal{E}}_{\mathcal{U}}\bigr)\bigl(\hat{\mathcal{E}}(u,v)-\hat{\bar{\mathcal{E}}}_{\mathcal{U}}\bigr)}{\displaystyle\sqrt{\sum_{(u,v)\in\mathcal{U}}\!\bigl(\mathcal{E}(u,v)-\bar{\mathcal{E}}_{\mathcal{U}}\bigr)^2 \cdot \sum_{(u,v)\in\mathcal{U}}\!\bigl(\hat{\mathcal{E}}(u,v)-\hat{\bar{\mathcal{E}}}_{\mathcal{U}}\bigr)^2}}$$

This measures spatial shape fidelity independently of absolute intensity scale. Scores are macro-averaged across all matched pairs and classes to yield \(\bar{r}\). Flat maps (range \(< 10^{-7}\)) are excluded.

3. Relative Distance Error (RDE) — Range Estimation Quality (not included in the ranking)

For each matched pair, the relative distance error (RDE) is calculated in centimeters:

$$\Delta_k = \frac{|L_p^{(k)} - L_r^{(k)}|}{L_r^{(k)}} \times 100\%$$

The overall \(\mathrm{RDE}\) is the macro-average across all classes. This asymmetric metric appropriately penalizes near-field errors more heavily than equivalent absolute errors on distant sources.

Note: Both \(\bar{r}\) and RDE are computed exclusively over matched true-positive pairs. The Mask mAP metric carries the full penalty for detection failures and false positives.

Parameter and Threshold Summary

Parameter Value Role
Gaussian kernel \(\sigma\) \(6°\) of arc Rendering GT and predictions to energy maps
Binary mask threshold \(\tau\) \(10\%\) of peak Binarization for Mask mAP IoU computation
Angular match threshold \(T_{\mathrm{DOA}}\) \(20°\) great-circle Hungarian matching rejection criterion
IoU range (Mask mAP) \(0.50 : 0.05 : 0.95\) Standard COCO protocol, 10 thresholds
Classes \(C\) 13 Macro-averaging denominator for all metrics
Canvas resolution \(360 \times 180\) px One pixel per degree, equirectangular

Ranking

Overall ranking will be based on the cumulative rank of the metrics mentioned above, sorted in ascending order. By cumulative rank we mean the following: if system A was ranked individually for each metric as \(mAP:1, RMSE:3, RDE: 1\), then its cumulative rank is \(1+3+1=5\). Then if system B has \(mAP:2, RMSE:2, RDE:3\) (7), and system C has \(mAP:3, RMSE:1, RDE:2\) (6), then the overall rank of the systems is A,C,B. If two systems end up with the same cumulative rank, they will be ranked based on the Mask mAP rank (i.e., the system with a higher mAP will be ranked first).

Baseline System

The baseline for both Track A (audio-only inference) and Track B (audiovisual inference) is provided in a unified codebase.

The baseline model takes per-frame multimodal inputs consisting of RGB image frames and spatially rendered acoustic feature maps derived from a pre-trained neural acoustic upscaler applied to the 4-channel input audio.

Publication

Adrian S Roman, Iran R Roman, and Juan P Bello. Latent acoustic mapping for direction of arrival estimation: a self-supervised approach. In 2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 1–5. IEEE, 2025.

Latent Acoustic Mapping for Direction of Arrival Estimation: A Self-Supervised Approach

This upscaler projects the low-channel spatial audio into a set of equirectangular acoustic band images, which are concatenated with the RGB frame to form a 12-channel input tensor. In the audio-only track, the RGB channels are omitted and replaced with zero-valued tensors.

The combined input is processed by an instance segmentation backbone that predicts, for each detected sound event, a bounding region on the equirectangular canvas, a class label, a detection confidence score, a 28×28 acoustic energy map, and a source distance. A frame-level tracker based on IoU-constrained Hungarian matching links detections across time to produce temporally consistent instance identities. At inference, detections are filtered through a multi-stage pipeline (score thresholding, per-class non-maximum suppression, per-frame class cap, and track confirmation) before the energy maps are sparsified and exported in the submission JSON format.

Baseline results on the development set will be reported here soon.

Please refer to the README file in the baseline repository for detailed information on installation, training, and inference:

Citation

If you are participating in this task or using the dataset and baseline code, please consider citing the following papers:

STAIRS26 Dataset and DCASE 2026 Task Description:

High-Definition Acoustic Maps & Baseline (UpLAM):

Publication

Adrian S Roman, Iran R Roman, and Juan P Bello. Latent acoustic mapping for direction of arrival estimation: a self-supervised approach. In 2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 1–5. IEEE, 2025.

Latent Acoustic Mapping for Direction of Arrival Estimation: A Self-Supervised Approach

Original STARSS Datasets:

Publication

Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel A. Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, and Yuki Mitsufuji. STARSS23: an audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, 72931–72957. Curran Associates, Inc., 2023. URL: https://proceedings.neurips.cc/paper_files/paper/2023/hash/e6c9671ed3b3106b71cafda3ba225c1a-Abstract-Datasets_and_Benchmarks.html.

PDF

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Abstract

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at https://zenodo.org/record/7880637.

PDF
Publication

Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, and Tuomas Virtanen. STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 125–129. Nancy, France, November 2022. URL: https://dcase.community/workshop2022/proceedings.

PDF

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

Abstract

This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) dataset of spatial recordings of real sound scenes collected in various interiors at two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events belonging to 13 target classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. STARSS22 serves as the development and evaluation dataset for Task 3 (Sound Event Localization and Detection) of the DCASE2022 Challenge and it introduces significant new challenges with regard to the previous iterations, which were based on synthetic data. Additionally, the report introduces the baseline system that accompanies the dataset with emphasis on its differences to the baseline of the previous challenge. Baseline results indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6600531.

PDF

AudibleLight (if using synthetic data):

Publication

Huw Cheston, Adrian Stepien, Juan Azcarreta, Adrian S. Roman, Chuyang Chen, Cagdas Bilen, and Iran R. Roman. Audiblelight: a controllable, end-to-end api for soundscape synthesis across ray-traced & real-world measured acoustics. In DMRN+20: Digital Music Research Network One-Day Workshop 2025. 2025.

PDF

AudibleLight: A Controllable, End-to-End API for Soundscape Synthesis Across Ray-Traced & Real-World Measured Acoustics

PDF