The goal of the Semantic Acoustic Imaging for SELD is to use either audio-only or audiovisual inputs to create systems that map low-channel spatial audio to high-definition acoustic energy maps, generating dynamic semantic masks that encode sound event classes, spatial locations, and pixel-level acoustic intensity.
Description
Given low-channel spatial audio, Semantic Acoustic Imaging for Sound Event Localization and Detection (SAI-SELD) produces high-resolution acoustic energy maps that describe which sound events are active, where they occur, and how their localized acoustic energy is distributed. Unlike conventional SELD, which estimates sparse direction-of-arrival vectors, this task represents the acoustic scene as a dense image-like field. The scientific goal is to move beyond point-based localization toward richer modeling of sound objects with source extent, directional energy distribution, spatial overlap, and diffuseness, while linking spatial audio analysis with semantic segmentation and multimodal learning.
The task addresses a challenging super-resolution problem using the STAIRS26 dataset (derived from STARSS23) at full spatial resolution. During development, participants have access to 32-channel recordings and corresponding high-definition acoustic maps. At evaluation time, however, systems receive only 4-channel input (tetrahedral array and/or first-order ambisonics) and must reconstruct detailed acoustic images from these resolution-limited spatial observations. The outputs are dynamic semantic polygon masks that jointly encode sound event class, source location, and instantaneous energy through pixel intensity. This setting supports both audio-only and audiovisual methods, and encourages the use of modern modeling architectures to recover semantically meaningful acoustic fields. Potential applications include robotic hearing, smart environments, audiovisual scene understanding, and autonomous systems operating in complex real-world scenes.
Dataset
This challenge uses the STAIRS26 dataset, a spatial audio benchmark that extends the STARSS23 dataset. STAIRS26 enriches the original data by providing two main components:
- 32-Channel Eigenmike Recordings: The raw 32-channel signals serve as a high-resolution complement to the originally released STARSS23 dataset, which was previously only available in 4-channel (tetrahedral and first-order ambisonics) formats.
- High-Definition Acoustic Maps: Provided as JSON files, these dense acoustic images capture the sound event class, source location, and localized acoustic energy in a unified representation.
For detailed dataset specifications on the original recording setup, hardware configuration, scene properties, and the recording and annotation procedures, please refer to the dataset paper of STARSS22:
Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, and Tuomas Virtanen. STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 125–129. Nancy, France, November 2022. URL: https://dcase.community/workshop2022/proceedings.
STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events
Abstract
This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) dataset of spatial recordings of real sound scenes collected in various interiors at two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events belonging to 13 target classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. STARSS22 serves as the development and evaluation dataset for Task 3 (Sound Event Localization and Detection) of the DCASE2022 Challenge and it introduces significant new challenges with regard to the previous iterations, which were based on synthetic data. Additionally, the report introduces the baseline system that accompanies the dataset with emphasis on its differences to the baseline of the previous challenge. Baseline results indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6600531.
and to the dataset paper of STARSS23:
Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel A. Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, and Yuki Mitsufuji. STARSS23: an audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, 72931–72957. Curran Associates, Inc., 2023. URL: https://proceedings.neurips.cc/paper_files/paper/2023/hash/e6c9671ed3b3106b71cafda3ba225c1a-Abstract-Datasets_and_Benchmarks.html.
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events
Abstract
While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at https://zenodo.org/record/7880637.
A technical report on the specifications of the STAIRS26 dataset will be included here soon.
Dataset specifications
The main properties of STAIRS26 are summarized below.
Volume, duration, and split
- The current release contains approximately 7.5 hours of recordings across 168 development clips (12 of which in the training split do not have 360º video).
- The file naming and data split conventions are retained from STARSS23, and the original STARSS23 audio and video files are intended to be used together with this dataset.
- The current release contains only the development audio, video, and labels used for training and validation in the DCASE2026 challenge.
Audio
- Sampling rate: 24 kHz.
- Bit depth: 16 bits.
- Recording format: 32-channel Eigenmike.
Recording format
The array response of the recording format, the Eigenmike 32 spherical microphone array, can be considered known. A theoretical model of the array transfer functions is provided, which describe the directional response of each channel to a source incident from direction-of-arrival (DOA) given by azimuth angle \(\phi\) and elevation angle \(\theta\).
Since the microphones are mounted on an acoustically-hard spherical baffle, an analytical expression for the directional array response is given by the expansion:
where \(m\) is the channel number, \((\phi_m, \theta_m)\) are the specific microphone's azimuth and elevation position, \(\omega = 2\pi f\) is the angular frequency, \(R = 0.042\)m is the array radius, \(c = 343\)m/s is the speed of sound, \(\cos(\gamma_m)\) is the cosine angle between the microphone and the DOA, and \(P_n\) is the unnormalized Legendre polynomial of degree \(n\), and \(h_n'^{(2)}\) is the derivative with respect to the argument of a spherical Hankel function of the second kind. The expansion is limited to 30 terms which provides negligible modeling error up to 20kHz. Example routines that can generate directional frequency and impulse array responses based on the above formula can be found here.
Reference Acoustic maps (JSON Annotations)
For the development set, we provide high-definition acoustic images generated from the 32-channel microphone array recordings. These maps serve as the dense ground-truth representation for training Semantic Acoustic Imaging systems. The maps are generated at a temporal resolution of 10 fps and on a spherical uniform grid of 484 points, using the super-resolution method referenced below. The spherical maps are further interpolated into conventional 2D 360° images in an equirectangular format. Additionally, since the original maps capture contributions from all sources, peaks around the active sources of the target sound classes are cropped using spherical polygonal masks. The isolated power maps, one per active target sound event at each frame, are then registered as individual sound event-based annotations.
Adrian S Roman, Iran R Roman, and Juan P Bello. Latent acoustic mapping for direction of arrival estimation: a self-supervised approach. In 2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 1–5. IEEE, 2025.
Latent Acoustic Mapping for Direction of Arrival Estimation: A Self-Supervised Approach
The acoustic maps are provided within .json files, one per recording. The annotations field contains a list of dictionary entries, where each entry represents an active sound-producing object at a specific frame.
Each annotation entry contains the following structured fields:
metadata_frame_index: Indicates the frame where the annotation is active. Important: The temporal resolution (frame rate) is 10 FPS.instance_id: A unique identifier used to track sound sources across multiple frames (consistent with STARSS23source_id).category_id: An integer mapping to the STARSS23 labels.distance: The distance of the sound source from the microphone, measured in centimeters.segmentation: Contains the full filled polygon(s) representing the source's spatial extent and directional intensity.- There may be more than one polygon per annotation (for example, if an object wraps around the edge of the 360-degree video, resulting in a split polygon).
- Coordinates & Intensity: Each item making up a polygon is a 3D array formatted as
[x, y, intensity]. The first two values represent thexandyspatial coordinates, and the final value represents the pixel intensity (standardized acoustic amplitude between zero and 1, such that zero is the minimum amplitude for the recordings within the training split, and 1 the maximum amplitude).
The spatial coordinates generally correspond to an angular grid with one pixel per degree (x ∈ [0, 359], y ∈ [0, 179]). Amplitude values are standard floating-point numbers. "The x coordinate represents azimuth angle where x=180 corresponds to the front of the scene (azimuth 0°), x=0 and x=359 correspond to directly behind (azimuth ±180°), and values decrease moving anti-clockwise (leftward in the image). The y coordinate represents the polar angle with y=0 at the top of the image (north pole, elevation +90°) and y=179 at the bottom (south pole, elevation −90°)."
Sound event classes
The semantic labels follow the same 13 target sound event classes used in STARSS23:
- Female speech, woman speaking
- Male speech, man speaking
- Clapping
- Telephone
- Laughter
- Domestic sounds
- Walk, footsteps
- Door, open or close
- Music
- Musical instrument
- Water tap, faucet
- Bell
- Knock
These classes are inherited from the STARSS23 annotation scheme.
Resources
A script for generating the standardized acoustic images from the 32-channel audio recordings is available in the associated repository:
https://github.com/AudibleLight/starss_representations
A separate visualization script for inspecting the JSON acoustic maps is also available:
https://gist.github.com/HuwCheston/d46559748ea3af8e37fda711486fd3bf
Download
The development dataset is distributed as two zip files:
32ch_audio_dev.zip: development audio.labels_dev.zip: generated acoustic-image labels in JSON format.
Which can be downloaded from:
Task Setup
The development dataset is provided with a training/testing split. During the development stage, the testing split can be used for comparison with baseline results and consistent reporting of results in the technical reports of the submitted systems, prior to the evaluation stage.
Note that even though there are two origins of the underlying STARSS23 data (Sony and TAU), the challenge task considers the dataset as a single entity. Hence, models should not be trained separately for each of the two origins and tested individually. Instead, the clips of the individual training splits should be combined, and the models should be trained and evaluated on the respective combined splits.
The evaluation dataset is released a few weeks before the final submission deadline. That dataset consists of only low-channel audio (4-channel tetrahedral/FOA) and video clips without any metadata, labels, or 32-channel recordings. At the evaluation stage, participants can decide the training procedure (e.g., the amount of training and validation files to be used from the development dataset, the number of ensemble models, etc.) and submit the results of their Semantic Acoustic Imaging performance on the evaluation dataset.
There are two tracks that participants can follow: the audio-only track and the audiovisual track. Participants can choose to submit systems to either of the two tracks, or both. We strongly encourage participants to submit both audio-only and audiovisual models. Submissions for both tracks will output the same dynamic semantic polygon masks, and ranking results will be presented in separate tables for each track.
Track A: Audio-only inference
In the audio-only track, inference of the semantic acoustic images is performed using only the limited 4-channel spatial audio from the STARSS23 dataset. Note that we do not exclude the use of video data or video information during the training of such models. In this sense, the video clips of the development set could be treated as external data and exploited in various ways to improve the performance of the model. However, during inference, only the audio recordings of the evaluation set may be used.
Track B: Audiovisual inference
In the audiovisual track, participants have access to corresponding video data during both training and evaluation. The models in this track are expected to use both the 4-channel audio and the video data during inference to reconstruct the high-resolution acoustic images.
Development dataset
The clips in the development dataset follow the naming convention retained from STARSS23:
fold[fold number]_room[room number]_mix[recording number per room].wav
The fold number is used to distinguish between the training and testing splits. The room information is provided to help users understand the performance of their methods with respect to different acoustic conditions.
Evaluation dataset
The evaluation dataset consists of clips without any information regarding their origin (Sony or TAU) or location in the naming convention. They follow the format below:
sample[clip number].wav (and corresponding .mp4 for video).
During evaluation only a 4ch tetrahedral array subset from the original 32ch recordings will be provided, similar to the microphone array format of the previous challenges. The coordinates and order of those microphones in spherical coordinates \((\phi, \theta, r)\) are:
External Data Resources and Pretrained Models
Since the development set contains recordings of real scenes, the presence of each class and the density of sound events varies greatly. To enable more effective training of models to detect and localize all target classes, apart from spatial and spectrotemporal augmentation of the development set, we additionally allow use of external datasets as long as they are publicly available. External data examples are sound sample banks, annotated sound event datasets, pre-trained models, room and array impulse response libraries.
The following rules apply on the use of external data:
- The external datasets or pre-trained models used should be freely and publicly accessible before 1 April 2026.
- Participants should inform the organizers in advance about such data sources, so that all competitors know about them and have an equal opportunity to use them. Please send an email or message in the Slack channel to the task coordinators if you intend to use a dataset or pre-trained model not on the list; we will update a list of external data in the webpage accordingly.
- The participants will have to indicate clearly which external data they have used in their system info and technical report.
- Once the evaluation set is published, no further requests will be taken and no further external sources will be added to the list.
| Dataset name | Type | Added | Link |
|---|---|---|---|
| TAU-SRIR DB | room impulse responses | 04.04.2022 | https://zenodo.org/records/6408611 |
| 6DOF_SRIRs | room impulse responses | 23.11.2021 | https://zenodo.org/records/6382405 |
| METU SRIRs | room impulse responses | 10.04.2019 | https://zenodo.org/records/2635758 |
| MIRACLE | room impulse responses | 12.10.2023 | https://depositonce.tu-berlin.de/items/fc34d59c-c524-4a4b-86ae-4da9289f20e2 |
| AudioSet | audio, video | 30.03.2017 | https://research.google.com/audioset/ |
| FSD50K | audio | 02.10.2020 | https://zenodo.org/record/4060432 |
| ESC-50 | audio | 13.10.2015 | https://github.com/karolpiczak/ESC-50 |
| Wearable SELD dataset | audio | 17.02.2022 | https://zenodo.org/record/6030111 |
| IRMAS | audio | 08.09.2014 | https://zenodo.org/record/1290750 |
| Kinetics 400 | audio, video | 22.05.2017 | https://www.deepmind.com/open-source/kinetics |
| SSAST | pre-trained model | 10.02.2022 | https://github.com/YuanGongND/ssast |
| TAU-NIGENS Spatial Sound Events 2020 | audio | 06.04.2020 | https://zenodo.org/record/4064792 |
| TAU-NIGENS Spatial Sound Events 2021 | audio | 28.02.2021 | https://zenodo.org/record/5476980 |
| PANN | pre-trained model | 19.10.2020 | https://github.com/qiuqiangkong/audioset_tagging_cnn |
| wav2vec2.0 | pre-trained model | 20.08.2020 | https://github.com/facebookresearch/fairseq |
| PaSST | pre-trained model | 18.09.2022 | https://github.com/kkoutini/PaSST |
| DTF-AT | pre-trained model | 19.12.2023 | https://github.com/ta012/DTFAT |
| FNAC_AVL | pre-trained model | 25.03.2023 | https://github.com/OpenNLPLab/FNAC_AVL |
| CSS10 Japanese | audio | 05.08.2019 | https://www.kaggle.com/datasets/bryanpark/japanese-single-speaker-speech-dataset |
| JSUT | audio | 28.10.2017 | https://sites.google.com/site/shinnosuketakamichi/publication/jsut |
| VoxCeleb1 | audio, video | 26.06.2017 | https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html |
| COCO | video | 01.05.2014 | https://cocodataset.org/ |
| 360-Indoor | video | 03.10.2019 | http://aliensunmin.github.io/project/360-dataset/ |
| TorchVision Models and Pre-trained Weights | pre-trained model | 03.04.2017 | https://pytorch.org/vision/stable/models.html |
| YOLOv7 | pre-trained model | 07.07.2022 | https://github.com/WongKinYiu/yolov7 |
| YOLOv8 | pre-trained model | 10.01.2023 | https://github.com/ultralytics/ultralytics |
| Grounding DINO | pre-trained model | 10.03.2023 | https://github.com/IDEA-Research/GroundingDINO |
| MMDetection | pre-trained model | 15.12.2021 | https://github.com/open-mmlab/mmdetection |
| MMPose | pre-trained model | 01.01.2020 | https://github.com/open-mmlab/mmpose |
| MMFlow | pre-trained model | 01.01.2021 | https://github.com/open-mmlab/mmflow |
| Paddle Detection | pre-trained model | 01.01.2019 | https://github.com/PaddlePaddle/PaddleDetection |
| doors Image Dataset | image | 18.02.2022 | https://universe.roboflow.com/mohammed-naji/doors-6g8eb/dataset/1 |
| DoorDetect Dataset | image | 27.05.2021 | https://github.com/MiguelARD/DoorDetect-Dataset |
| CLIP | pre-trained model | 05.01.2021 | https://github.com/openai/CLIP |
| CLAP | pre-trained model | 06.03.2022 | https://github.com/LAION-AI/CLAP |
| Depth Anything | pre-trained model | 22.01.2024 | https://github.com/LiheYoung/Depth-Anything |
| PanoFormer | pre-trained model | 04.03.2022 | https://github.com/zhijieshen-bjtu/PanoFormer |
| SoundQ Youtube 360° video list | video | 06.10.2023 | https://github.com/aromanusc/SoundQ/blob/main/synth_data_gen/dataset.csv |
| FMA | audio | 02.12.2016 | https://github.com/mdeff/fma |
| Dasheng | pre-trained model | 11.06.2024 | https://github.com/XiaoMi/dasheng |
| DINOv2 | pre-trained model | 18.04.2023 | https://github.com/facebookresearch/dinov2 |
| ONE-PEACE | pre-trained model | 19.05.2023 | https://github.com/OFA-Sys/ONE-PEACE |
| OWL-ViT | pre-trained model | 13.05.2022 | https://huggingface.co/docs/transformers/en/model_doc/owlvit |
| Flickr30k | image | 01.02.2014 | https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset |
| SELDVisualSynth Canvas and Assets | image, video | 27.03.2025 | https://github.com/adrianSRoman/SELDVisualSynth |
| Audio-Visual Scene Reproduction Datasets | audio, image | 30.10.2021 | http://3dkim.com/research/VR/index.html |
| SurrRoom 1.0 Dataset | room impulse responses | 12.05.2023 | https://cvssp.org/data/SurrRoom1_0/ |
| NitroFusion | pre-trained model | 02.12.2024 | https://chendaryen.github.io/NitroFusion.github.io/ |
| PSELDNets | pre-trained model | 11.11.2024 | https://github.com/Jinbo-Hu/PSELDNets |
| HTS-AT | pre-trained model | 02.02.2022 | https://github.com/RetroCirce/HTS-Audio-Transformer |
Example external data use with baseline
Synthetic audio-only and audiovisual data (along with generated ground truth power maps conforming to the task specifications) can be generated using AudibleLight.
AudibleLight is a new Python package for synthetic soundscape synthesis across both ray-traced & real-world measured acoustics. It combines and extends the functionality available in many existing packages, including SpatialScaper, SELDVisualSynth, PyRoomAcoustics, and SoundSpaces.
AudibleLight can be downloaded from the following repository. Usage information is available on the repository page. It is recommended to use a development install when using AudibleLight to generate data for this task, rather than PyPi.
If you use AudibleLight to generate synthetic data for your system, please ensure you cite the following paper:
Huw Cheston, Adrian Stepien, Juan Azcarreta, Adrian S. Roman, Chuyang Chen, Cagdas Bilen, and Iran R. Roman. Audiblelight: a controllable, end-to-end api for soundscape synthesis across ray-traced & real-world measured acoustics. In DMRN+20: Digital Music Research Network One-Day Workshop 2025. 2025.
AudibleLight: A Controllable, End-to-End API for Soundscape Synthesis Across Ray-Traced & Real-World Measured Acoustics
Task Rules
- Use of external data is allowed, as long as they are publicly available. Check the section on external data for more instructions.
- Manipulation of the provided training-test split in the development dataset is not allowed for reporting dev-test results using the development dataset.
- Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system.
- The development dataset can be augmented e.g. using techniques such as pitch shifting or time stretching, spatial transformations, source separation, respatialization, or re-reverberation of parts, etc.
Submission
During the evaluation phase of the challenge, the results for each recording in the
evaluation dataset must be collected in individual JSON files. Each result file must
have the same base name as the corresponding audio recording file, but with the .json
extension (e.g., sample001.json for sample001.wav).
Each JSON file must contain a single top-level key, annotations, whose value is a
list of annotation entries. Each entry is a dictionary representing one active sound
source instance at one specific frame. The overall structure is:
{
"annotations": [
{
"metadata_frame_index": 42,
"category_id": 1,
"score": 0.812,
"distance": 245.0,
"segmentation": [[[120.0, 88.0, 0.923], [121.0, 89.0, 0.741], ...]]
},
...
]
}
Field Descriptions
Each annotation entry must contain the following fields:
metadata_frame_index(int): The frame index at which this annotation is active. The temporal resolution is 10 FPS, so frame indexkcorresponds to timek / 10seconds. Enumeration begins at zero.
-
category_id(int): A zero-indexed integer identifying the sound event class. Class indices follow the ordering given in the class descriptions above (0 = Female speech, 1 = Male speech, ..., 12 = Knock). -
score(float): This field is mandatory. The detection confidence score for this instance at this frame, in the range[0, 1]. The official evaluation computes mean Average Precision (mAP) (see Evaluation below). Submissions that omit this field or set all scores to a constant value will produce degenerate AP curves and will be ranked last. -
distance(float): This field is NOT mandatory. The predicted distance of the sound source from the microphone array centre, measured in centimetres. Distance estimation will not be used in the ranking, however, if participants develop methods that estimate it, distance estimation results will be presented in the results of the challenge, based on Relative Distance Error (RDE) metric. -
segmentation(list of lists of triplets): A list of polygon(s) representing the predicted spatial extent and acoustic energy distribution of the source. Each polygon is a list of[x, y, intensity]triplets. There may be more than one polygon per annotation entry (e.g., when a source spans the azimuth wrap boundary and its representation is split into two fragments). The format exactly mirrors that of the reference labels:x(float): Azimuth coordinate in the equirectangular image,x ∈ [0, 359].x = 0is front-facing; values increase anti-clockwise.y(float): Polar (elevation) coordinate,y ∈ [0, 179].y = 0is the north pole.intensity(float): Predicted normalised acoustic amplitude,∈ [0, 1].
Participants are not required to submit dense filled polygons. Sparse
representations consisting of the highest-energy peaks within each source's
bounding region are fully supported by the evaluator, which internally renders both
prediction and reference through a spherical Gaussian kernel before metric
computation. A minimum of one [x, y, intensity] triplet per annotation entry is
required. We recommend submitting between 10 and 30 peaks per instance per frame;
the provided baseline inference script exports the top-20 peaks by default.
Coordinate System Summary
| Field | Image range | Meaning of zero | Notes |
|---|---|---|---|
| x | [0, 359] | Left edge of image = azimuth ±180° (directly behind) | Front of scene = x=180; anti-clockwise = decreasing x |
| y | [0, 179] | Top of image = north pole (elevation +90°) | South pole (elevation −90°) = y=179 |
| intensity | [0, 1] | Minimum recorded amplitude in training split | 1 is the maximum recorded amplitude |
Temporal Resolution
The evaluation is performed at a temporal resolution of 100 ms (10 FPS). If your
system produces predictions at a different frame rate or hop size, you must resample
or aggregate your outputs to this resolution before submission. Each
metadata_frame_index value must be a non-negative integer corresponding to a
distinct 100 ms window.
Baseline Inference Script
A reference inference script (run_inference.py) is provided in the task repository.
It takes a trained model checkpoint and a directory of evaluation recordings as input
and produces correctly formatted JSON output files directly. Participants may use this
script as-is, adapt it to their own model architecture, or implement their own
inference pipeline entirely, as long as the output JSON format described above is
respected.
General information for all DCASE submissions can be found on the Submission page.
Evaluation
The evaluation jointly assesses sound event detection, spatial localization, acoustic imaging reconstruction, and distance estimation.
Rendering Pipeline
Every segmentation annotation (ground truth and prediction) is converted into a continuous 360×180 equirectangular energy map. For each [x, y, intensity] triplet, a spherical Gaussian blob (standard deviation \(\sigma = 6^\circ\)) is added to the canvas:
where \(d_{\mathrm{gc}}\) is the great-circle angular distance, and \(e_k\) is the normalized intensity. The kernel accounts for equirectangular distortion by expanding the azimuth half-width near the poles by \(1/\cos(\theta_k)\), ensuring a consistent solid angle.
A binary segmentation mask \(M\) is derived by thresholding at \(\tau = 10\%\) of the peak rendered energy:
Annotations with a peak energy below \(10^{-9}\) are excluded from evaluation.
Spatial Matching
To compute Pearson \(r\) and RDE metrics, predictions and ground truths are paired per frame and per class: - Peak locations are extracted from the rendered maps. - The Hungarian algorithm finds the minimum-cost one-to-one assignment based on great-circle angular distance. - Matches exceeding \(T_{\mathrm{DOA}} = 20^\circ\) are rejected (becoming false positives/negatives). - Cross-class matching is strictly prohibited.
Metrics
1. Mask mAP — Detection and Localization (Primary Metric)
The primary ranking metric is the Mask mean Average Precision (Mask mAP), computed via the standard COCO evaluation protocol on the binarized rendered masks.
where \(C=13\) classes and \(\mathcal{T} = \{0.50, 0.55, \ldots, 0.95\}\) are the IoU thresholds in equirectangular pixel space. We also report Mask AP50 (IoU \(t = 0.50\)). Mask mAP simultaneously penalizes missed detections, false positives, and poor spatial localization.
Note: The score field is mandatory; constant scores will result in a degenerate precision-recall curve and rank last.
2. Pearson \(r\) — Energy Field Reconstruction Quality
For each spatially matched pair, the Pearson correlation coefficient is computed over the 2-pixel dilated union of their binary masks (\(\mathcal{U}\)):
This measures spatial shape fidelity independently of absolute intensity scale. Scores are macro-averaged across all matched pairs and classes to yield \(\bar{r}\). Flat maps (range \(< 10^{-7}\)) are excluded.
3. Relative Distance Error (RDE) — Range Estimation Quality (not included in the ranking)
For each matched pair, the relative distance error (RDE) is calculated in centimeters:
The overall \(\mathrm{RDE}\) is the macro-average across all classes. This asymmetric metric appropriately penalizes near-field errors more heavily than equivalent absolute errors on distant sources.
Note: Both \(\bar{r}\) and RDE are computed exclusively over matched true-positive pairs. The Mask mAP metric carries the full penalty for detection failures and false positives.
Parameter and Threshold Summary
| Parameter | Value | Role |
|---|---|---|
| Gaussian kernel \(\sigma\) | \(6°\) of arc | Rendering GT and predictions to energy maps |
| Binary mask threshold \(\tau\) | \(10\%\) of peak | Binarization for Mask mAP IoU computation |
| Angular match threshold \(T_{\mathrm{DOA}}\) | \(20°\) great-circle | Hungarian matching rejection criterion |
| IoU range (Mask mAP) | \(0.50 : 0.05 : 0.95\) | Standard COCO protocol, 10 thresholds |
| Classes \(C\) | 13 | Macro-averaging denominator for all metrics |
| Canvas resolution | \(360 \times 180\) px | One pixel per degree, equirectangular |
Ranking
Overall ranking will be based on the cumulative rank of the metrics mentioned above, sorted in ascending order. By cumulative rank we mean the following: if system A was ranked individually for each metric as \(mAP:1, RMSE:3, RDE: 1\), then its cumulative rank is \(1+3+1=5\). Then if system B has \(mAP:2, RMSE:2, RDE:3\) (7), and system C has \(mAP:3, RMSE:1, RDE:2\) (6), then the overall rank of the systems is A,C,B. If two systems end up with the same cumulative rank, they will be ranked based on the Mask mAP rank (i.e., the system with a higher mAP will be ranked first).
Baseline System
The baseline for both Track A (audio-only inference) and Track B (audiovisual inference) is provided in a unified codebase.
The baseline model takes per-frame multimodal inputs consisting of RGB image frames and spatially rendered acoustic feature maps derived from a pre-trained neural acoustic upscaler applied to the 4-channel input audio.
Adrian S Roman, Iran R Roman, and Juan P Bello. Latent acoustic mapping for direction of arrival estimation: a self-supervised approach. In 2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 1–5. IEEE, 2025.
Latent Acoustic Mapping for Direction of Arrival Estimation: A Self-Supervised Approach
This upscaler projects the low-channel spatial audio into a set of equirectangular acoustic band images, which are concatenated with the RGB frame to form a 12-channel input tensor. In the audio-only track, the RGB channels are omitted and replaced with zero-valued tensors.
The combined input is processed by an instance segmentation backbone that predicts, for each detected sound event, a bounding region on the equirectangular canvas, a class label, a detection confidence score, a 28×28 acoustic energy map, and a source distance. A frame-level tracker based on IoU-constrained Hungarian matching links detections across time to produce temporally consistent instance identities. At inference, detections are filtered through a multi-stage pipeline (score thresholding, per-class non-maximum suppression, per-frame class cap, and track confirmation) before the energy maps are sparsified and exported in the submission JSON format.
Baseline results on the development set will be reported here soon.
Please refer to the README file in the baseline repository for detailed information on installation, training, and inference:
Citation
If you are participating in this task or using the dataset and baseline code, please consider citing the following papers:
STAIRS26 Dataset and DCASE 2026 Task Description:
High-Definition Acoustic Maps & Baseline (UpLAM):
Adrian S Roman, Iran R Roman, and Juan P Bello. Latent acoustic mapping for direction of arrival estimation: a self-supervised approach. In 2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 1–5. IEEE, 2025.
Latent Acoustic Mapping for Direction of Arrival Estimation: A Self-Supervised Approach
Original STARSS Datasets:
Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel A. Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, and Yuki Mitsufuji. STARSS23: an audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, 72931–72957. Curran Associates, Inc., 2023. URL: https://proceedings.neurips.cc/paper_files/paper/2023/hash/e6c9671ed3b3106b71cafda3ba225c1a-Abstract-Datasets_and_Benchmarks.html.
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events
Abstract
While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at https://zenodo.org/record/7880637.
Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, and Tuomas Virtanen. STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 125–129. Nancy, France, November 2022. URL: https://dcase.community/workshop2022/proceedings.
STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events
Abstract
This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) dataset of spatial recordings of real sound scenes collected in various interiors at two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events belonging to 13 target classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. STARSS22 serves as the development and evaluation dataset for Task 3 (Sound Event Localization and Detection) of the DCASE2022 Challenge and it introduces significant new challenges with regard to the previous iterations, which were based on synthetic data. Additionally, the report introduces the baseline system that accompanies the dataset with emphasis on its differences to the baseline of the previous challenge. Baseline results indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6600531.
AudibleLight (if using synthetic data):
Huw Cheston, Adrian Stepien, Juan Azcarreta, Adrian S. Roman, Chuyang Chen, Cagdas Bilen, and Iran R. Roman. Audiblelight: a controllable, end-to-end api for soundscape synthesis across ray-traced & real-world measured acoustics. In DMRN+20: Digital Music Research Network One-Day Workshop 2025. 2025.