The goal of the sound event localization and detection task is to detect occurrences of sound events belonging to specific target classes, track their temporal activity, and estimate their directions-of-arrival or positions during it.
Description
Given multichannel audio input, a sound event localization and detection (SELD) system outputs localization estimates of one or more events for each of the target sound classes, whenever such events are detected. This results in a spatio-temporal characterization of the acoustic scene that can be used in a wide range of machine cognition tasks, such as inference on the type of environment, self-localization, navigation with visually occluded targets, tracking of specific types of sound sources, smart-home applications, scene visualization systems, and acoustic monitoring, among others.
While the previous challenges use four-channel audio data, i.e., first-order Ambisonics (FOA) and microphone array data, this challenge tackles SELD with stereo audio data (called stereo SELD), investigating the task in a commonplace audio and media scenario. Since the stereo audio used in the task have angular ambiguity in top-bottom and front-back, the task focuses on direction-of-arrival (DOA) estimation of azimuth angles only in the left-right axis. The challenge continues to tackle distance estimation as we believe that there is still space for novel solutions.
This challenge evaluates stereo SELD models with audio-only input (Track A) or audiovisual input (Track B). Since the field-of-view (FOV) is not 360° anymore, Track B poses a new interesting sub-task: onscreen/offscreen classification of the detected events. The evaluation metrics are modified to take this classification task into account. We encourage participants to submit both audio-only SELD systems and audiovisual SELD systems.

Dataset
This challenge uses a stereo audio and video dataset, DCASE2025 Task3 Stereo SELD Dataset, derived from the STARSS23 dataset. The original STARSS23's FOA audio and 360-degree video data have been converted to stereo audio and perspective video data, simulating regular media content.
The STARSS23 dataset contains multichannel recordings of sound scenes in various rooms and environments, together with temporal and spatial annotations of prominent events belonging to a set of target classes. The original 360-degree video are spatially and temporally aligned with the microphone array recordings. More details on the recording and annotation procedure can be found in DCASE2022 Challenge task description, in the dataset paper of STARSS22:
Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, and Tuomas Virtanen. STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 125–129. Nancy, France, November 2022. URL: https://dcase.community/workshop2022/proceedings.
STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events
Abstract
This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) dataset of spatial recordings of real sound scenes collected in various interiors at two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events belonging to 13 target classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. STARSS22 serves as the development and evaluation dataset for Task 3 (Sound Event Localization and Detection) of the DCASE2022 Challenge and it introduces significant new challenges with regard to the previous iterations, which were based on synthetic data. Additionally, the report introduces the baseline system that accompanies the dataset with emphasis on its differences to the baseline of the previous challenge. Baseline results indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6600531.
and in the dataset paper of STARSS23:
Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel A. Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, and Yuki Mitsufuji. STARSS23: an audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, 72931–72957. Curran Associates, Inc., 2023. URL: https://proceedings.neurips.cc/paper_files/paper/2023/hash/e6c9671ed3b3106b71cafda3ba225c1a-Abstract-Datasets_and_Benchmarks.html.
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events
Abstract
While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at https://zenodo.org/record/7880637.
To construct the DCASE2025 Task3 Stereo SELD Dataset, we conduct the following sampling and conversion procedures from the STARSS23 dataset. We first sample 5-second clips from the original STARSS23 recordings. Then we convert the 5-second FOA audio and 360° video to generate stereo audio and perspective video data corresponding to a fixed point-of-view. According to the fixed viewing angle, we first rotate the FOA audio. Then we convert the rotated FOA audio to a stereo audio emulating a mid-side (M/S) recording technique. We convert the equirectangular video to a perspective video with the same viewing angle as the audio. We set the horizontal FOV to 100 degrees and the video resolution (Width:Height) to 640:360 pixels, with an aspect ratio of 16:9, widely used in media content.
We also rotate the original STARSS23's DOA labels to new DOA labels centered at the fixed viewing angle. The new azimuth labels are folded back from back to front, considering front-back ambiguity. The elevation labels are omitted due to top-bottom ambiguity. The distance labels are kept the same as the STARSS23 one. To get the binary onscreen/offscreen event labels, we compare the new DOA labels with the FOV in the perspective video.
Dataset specifications
The specifications of the stereo audio and video dataset can be summarized in the following:
Recording (STARSS22/23 setup):
- Each recording clip is part of a recording session happening in a unique room.
- Groups of participants, sound making props, and scene scenarios are unique for each session (with a few exceptions).
- 13 target classes are identified in the recordings and strongly annotated by humans.
- Spatial annotations for those active events are captured by an optical tracking system.
- Sound events out of the target classes are considered as interference.
- Occurrences of up to 3 simultaneous events are fairly common, while higher numbers of overlapping events (up to 6) can occur but are rare.
Sampling and conversion:
- In each sampling step, its recording, start frame, and viewing angle are randomly selected.
- A recording is selected with length-weighted random choice to treat all frames of all files equally.
- A start frame is selected uniformly within each recording.
- A horizontal viewing angle is selected uniformly in 360° while the vertical viewing angle is kept at 0° elevation.
- 12 audio recordings with missing videos (fold3_room21_mix001.wav - fold3_room21_mix012.wav) are not selected to keep the same set between audio-only and audiovisual tracks.
- Several clips don't contain any target sound events after random sampling.
- The class distribution across all frames after random sampling is similar to the STARSS23 one.
- The onscreen / offscreen distribution across all frames is around 1 : 3.
Volume, duration, and data split:
- A total of 16 unique rooms captured in the recordings (development set).
- 30,000 clips of 5 sec duration, with a total time of 41.7 hrs (development dataset).
- 23.9 % of the clips are derived from recordings in Tokyo (development dataset).
- 76.1 % of the clips are derived from recordings in Tampere (development dataset).
- A training-testing split is provided for reporting results using the development dataset.
- 2 rooms in Tokyo are for the training split (dev-train-sony).
- 2 rooms in Tokyo are for the testing split (dev-test-sony).
- 7 rooms in Tampere are for the training split (dev-train-tau).
- 5 rooms in Tampere are for the testing split (dev-test-tau).
Audio:
- Sampling rate: 24kHz.
- Bit depth: 16 bits.
- Stereo format: mid-side (M/S) technique with left-right cardioid stereo patterns.
Video:
- Video format: perspective.
- Video resolution: 640x360.
- Video frames per second (fps): 29.97.
Sound event classes
13 target sound event classes were annotated. The classes follow loosely the AudioSet ontology.
- Female speech, woman speaking
- Male speech, man speaking
- Clapping
- Telephone
- Laughter
- Domestic sounds
- Walk, footsteps
- Door, open or close
- Music
- Musical instrument
- Water tap, faucet
- Bell
- Knock
The content of some of these classes corresponds to events of a limited range of AudioSet-related subclasses. These are detailed here to aid the participants:
- Telephone
- Mostly traditional Telephone Bell Ringing and Ringtone sounds, without musical ringtones.
- Domestic sounds
- Sounds of Vacuum cleaner
- Sounds of water boiler, closer to Boiling
- Sounds of air circulator, closer to Mechanical fan
- Door, open or close
- Combination of Door and Cupboard open or close
- Music
- Background music and Pop music played by a loudspeaker in the room.
- Musical Instrument
- Acoustic guitar
- Marimba, xylophone
- Cowbell
- Piano
- Rattle (instrument)
- Bell
- Combination of sounds from hotel bell and glass bell, closer to Bicycle bell and single Chime.
Some additional notes:
- The speech classes contain speech in a few different languages.
- There are occasionally localized sound events that are not annotated and are considered as interferers, with examples such as computer keyboard, shuffling cards, dishes, pots, and pans.
- There is natural background noise (e.g. HVAC noise) in all recordings, at very low levels in some and at quite high levels in others. Such mostly diffuse background noise should be distinct from other noisy target sources (e.g. vacuum cleaner, mechanical fan) since these are clearly spatially localized.
Audio format description
The stereo audio format is derived from the full-sphere first-order Ambisonics (FOA) format used in the previous iterations of the challenge. For recording details and encoding specifications of the FOA format refer to the previous task descriptions.
The stereo format specification is emulating a coincident mid-side (M/S) stereo recording technique using cardioid microphones pointing fully at 90° (left channel) and -90° (right channel). Since any mid-side stereo recording configuration can be extracted from FOA signals, in this simple M/S case they are derived using only the omnidirectional and left-right dipole component of the FOA signals.
More specifically, for FOA signals following an ACN/SN3D convention ordered as \([W(n), Y(n), Z(n), X(n)]\), the stereo signals \([L(n), R(n)]\) are then simply:
- \(L(n) = W(n) + Y(n)\)
- \(R(n) = W(n) - Y(n)\)
Reference labels, directions-of-arrival, source distances, and off/onscreens
For each clip in the development dataset, the labels, DOAs, distances, and off/onscreens are provided in a plain text CSV file of the same filename as the clips, in the following format:
[frame number (int)],[active class index (int)],[source number index (int)],[azimuth (int)],[distance (int)],[off/onscreen (0/1)]
Frame, class, and source enumeration begins at zero. Frames correspond to a temporal resolution of 100msec. Azimuth angles are given in degrees, rounded to the closest integer value, with azimuth and elevation being zero at the front, azimuth \(\phi \in [-90^{\circ}, 90^{\circ}]\). The azimuth angle is increasing counter-clockwise (\(\phi = 90^{\circ}\) at the left). Note that the azimuth angle is folded back from behind to front, considering stereo ambiguity in front-back. The elevation angle is omitted due to ambiguity in top-bottom. Distances are provided in centimeters, rounded to the closest integer value.
In off/onscreens, 0 means offscreen, whereas 1 means onscreen. If an event is onscreen, the event appears in the FOV in the perspective video.
Note that the off/onscreen labels are only used in the audiovisual track. While the audiovisual track's participants need to provide off/onscreen labels, the audio-only track's participants do not need to provide them.
The source index is a unique integer for each source in the scene, and it is provided only as additional information. Note that each unique actor gets assigned one such identifier, but not individual events produced by the same actor; e.g. a clapping event and a laughter event produced by the same person have the same identifier. Independent sources that are not actors (e.g. a loudspeaker playing music in the room) get a 0 identifier.
Note that the source index is only included in the development metadata as additional information that can be exploited during training. It is not required to be estimated or provided by the participants in their results.
Overlapping sound events are indicated with duplicate frame numbers, and can belong to a different or the same class. An example sequence could be as:
10, 1, 1, -60, 181, 0
11, 1, 1, -60, 181, 0
11, 1, 2, 10, 243, 1
12, 1, 2, 10, 243, 1
13, 1, 2, 10, 243, 1
13, 8, 0, -40, 80, 1
which describes that in frame 10-11, an event of class male speech (class 1) belonging to one actor (source 1) is active at location (-60°, 181cm) off screen (off/onscreen 0). However, at frame 11 a second instance of the same class appears simultaneously at a different location (10°, 243cm) in the screen (off/onscreen 1) belonging to another actor (source 2), while at frame 13 an additional event of class music (class 8) appears belonging to a non-actor source (source 0). Frames that contain no sound events are not included in the sequence.
Download
The development version of the dataset can be downloaded at:
The stereo SELD data generator for the task are made available:
The data generator enables to construct stereo SELD datasets like the DCASE2025 Task3 Stereo SELD Dataset from real or synthetic FOA SELD datasets. The generator samples a clip randomly, and converts its FOA audio / 360° video / metadata to new stereo audio / perspective video / metadata according to a viewing angle.
Task setup
The development dataset is provided with a training/testing split. During the development stage, the testing split can be used for comparison with the baseline results and consistent reporting of results at the technical reports of the submitted systems, before the evaluation stage.
- Note that even though there are two origins of the data, Sony and TAU, the challenge task considers the dataset as a single entity. Hence models should not be trained separately for each of the two origins, and tested individually on clips of each of them. Instead, the clips of the individual training splits (dev-train-sony, dev-train-tau) and testing splits (dev-test-sony, dev-test-tau) should be combined (dev-train, dev-test) and the models should be trained and evaluated in the respective combined splits.
The evaluation dataset is released a few weeks before the final submission deadline. That dataset consists of only audio and video clips without any metadata/labels. At the evaluation stage, participants can decide the training procedure, i.e. the amount of training and validation files to be used in the development dataset, the number of ensemble models etc., and submit the results of the SELD performance on the evaluation dataset.
There are two tracks that the participants can follow: the audio-only track and the audiovisual track. The participants can choose to submit systems on either of the two tracks, or on both. We encourage participants to submit both audio-only models and audiovisual models.
Submissions for both tracks will be in the same system output format except for off/onscreen labels, and both tracks will be evaluated with the same metrics except for off/onscreen labels. During evaluation, ranking results will be presented in separate tables for each track.
Track A: Audio-only inference
In the audio-only track, inference of the SELD labels is performed with stereo audio input only. Note that we do not exclude the use of video data or video information during training of such models. In this sense, the video clips of the development set could be treated as external data and exploited in various ways to improve the performance of the model. However, during inference only the audio recordings of the evaluation set should be used. Note that the models in this track are not required to provide off/onscreens.
Track B: Audiovisual inference
In the audiovisual track participants have access to perspective video data during training and evaluation. The models in this track are expected to be using both audio and video data during inference to produce the SELD labels. Additionally, the models in this track are required to provide off/onscreen labels.
Development dataset
The clips in the development dataset follow the naming convention:
fold[fold number]_room[room number]_mix[recording number per room]_deg[viewing angle in degree]_start[start time in frame].wav
The fold number at the moment is used only to distinguish between the training and testing split. The room information is provided for the user of the dataset to potentially help understand the performance of their method with respect to different conditions.
Each clip is generated by randomly selecting the recording, viewing angle, and start time. The recording number, viewing angle, and start time are provided to indicate the configuration of the clip. Note that the viewing angle and start time are not sampled at equal intervals but sampled randomly.
For the audiovisual track, video files are provided with the same folder structure and naming convention as the audio files.
External data resources
Since the development set contains recordings of real scenes, the presence of each class and the density of sound events varies greatly. To enable more effective training of models to detect and localize all target classes, apart from spatial and spectrotemporal augmentation of the development set, we additionally allow use of external datasets as long as they are publicly available. External data examples are sound sample banks, annotated sound event datasets, pre-trained models, room and array impulse response libraries.
The following rules apply on the use of external data:
- The external datasets or pre-trained models used should be freely and publicly accessible before 15 May 2025.
- Participants should inform the organizers in advance about such data sources, so that all competitors know about them and have an equal opportunity to use them. Please send an email or message in the Slack channel to the task coordinators if you intend to use a dataset or pre-trained model not on the list; we will update a list of external data in the webpage accordingly.
- The participants will have to indicate clearly which external data they have used in their system info and technical report.
- Once the evaluation set is published, no further requests will be taken and no further external sources will be added to the list.
Dataset name | Type | Added | Link |
---|---|---|---|
TAU-SRIR DB | room impulse responses | 04.04.2022 | https://zenodo.org/records/6408611 |
6DOF_SRIRs | room impulse responses | 23.11.2021 | https://zenodo.org/records/6382405 |
METU SRIRs | room impulse responses | 10.04.2019 | https://zenodo.org/records/2635758 |
MIRACLE | room impulse responses | 12.10.2023 | https://depositonce.tu-berlin.de/items/fc34d59c-c524-4a4b-86ae-4da9289f20e2 |
AudioSet | audio, video | 30.03.2017 | https://research.google.com/audioset/ |
FSD50K | audio | 02.10.2020 | https://zenodo.org/record/4060432 |
ESC-50 | audio | 13.10.2015 | https://github.com/karolpiczak/ESC-50 |
Wearable SELD dataset | audio | 17.02.2022 | https://zenodo.org/record/6030111 |
IRMAS | audio | 08.09.2014 | https://zenodo.org/record/1290750 |
Kinetics 400 | audio, video | 22.05.2017 | https://www.deepmind.com/open-source/kinetics |
SSAST | pre-trained model | 10.02.2022 | https://github.com/YuanGongND/ssast |
TAU-NIGENS Spatial Sound Events 2020 | audio | 06.04.2020 | https://zenodo.org/record/4064792 |
TAU-NIGENS Spatial Sound Events 2021 | audio | 28.02.2021 | https://zenodo.org/record/5476980 |
PANN | pre-trained model | 19.10.2020 | https://github.com/qiuqiangkong/audioset_tagging_cnn |
wav2vec2.0 | pre-trained model | 20.08.2020 | https://github.com/facebookresearch/fairseq |
PaSST | pre-trained model | 18.09.2022 | https://github.com/kkoutini/PaSST |
DTF-AT | pre-trained model | 19.12.2023 | https://github.com/ta012/DTFAT |
FNAC_AVL | pre-trained model | 25.03.2023 | https://github.com/OpenNLPLab/FNAC_AVL |
CSS10 Japanese | audio | 05.08.2019 | https://www.kaggle.com/datasets/bryanpark/japanese-single-speaker-speech-dataset |
JSUT | audio | 28.10.2017 | https://sites.google.com/site/shinnosuketakamichi/publication/jsut |
VoxCeleb1 | audio, video | 26.06.2017 | https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html |
COCO | video | 01.05.2014 | https://cocodataset.org/ |
360-Indoor | video | 03.10.2019 | http://aliensunmin.github.io/project/360-dataset/ |
TorchVision Models and Pre-trained Weights | pre-trained model | 03.04.2017 | https://pytorch.org/vision/stable/models.html |
YOLOv7 | pre-trained model | 07.07.2022 | https://github.com/WongKinYiu/yolov7 |
YOLOv8 | pre-trained model | 10.01.2023 | https://github.com/ultralytics/ultralytics |
Grounding DINO | pre-trained model | 10.03.2023 | https://github.com/IDEA-Research/GroundingDINO |
MMDetection | pre-trained model | 15.12.2021 | https://github.com/open-mmlab/mmdetection |
MMPose | pre-trained model | 01.01.2020 | https://github.com/open-mmlab/mmpose |
MMFlow | pre-trained model | 01.01.2021 | https://github.com/open-mmlab/mmflow |
Paddle Detection | pre-trained model | 01.01.2019 | https://github.com/PaddlePaddle/PaddleDetection |
doors Image Dataset | image | 18.02.2022 | https://universe.roboflow.com/mohammed-naji/doors-6g8eb/dataset/1 |
DoorDetect Dataset | image | 27.05.2021 | https://github.com/MiguelARD/DoorDetect-Dataset |
CLIP | pre-trained model | 05.01.2021 | https://github.com/openai/CLIP |
CLAP | pre-trained model | 06.03.2022 | https://github.com/LAION-AI/CLAP |
Depth Anything | pre-trained model | 22.01.2024 | https://github.com/LiheYoung/Depth-Anything |
PanoFormer | pre-trained model | 04.03.2022 | https://github.com/zhijieshen-bjtu/PanoFormer |
SoundQ Youtube 360° video list | video | 06.10.2023 | https://github.com/aromanusc/SoundQ/blob/main/synth_data_gen/dataset.csv |
FMA | audio | 02.12.2016 | https://github.com/mdeff/fma |
Example external data use with baseline
The baseline is trained with a combination of the clips in the development set and another clips from synthetic recordings.
We first synthesize FOA audio data and metadata using the SpatialScaper, in which the FOA data are synthesized through convolution of isolated sound samples with spatial room impulse responses (SRIRs).
Iran R. Roman, Christopher Ick, Sivan Ding, Adrian S. Roman, Brian McFee, and Juan P. Bello. Spatial scaper: a library to simulate and augment soundscapes for sound event localization and detection in realistic rooms. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Seoul, South Korea, April 2024.
Spatial Scaper: A Library to Simulate and Augment Soundscapes for Sound Event Localization and Detection in Realistic Rooms
Abstract
Sound event localization and detection (SELD) is an important task in machine listening. Major advancements rely on simulated data with sound events in specific rooms and strong spatio-temporal labels. SELD data is simulated by convolving spatialy-localized room impulse responses (RIRs) with sound waveforms to place sound events in a soundscape. However, RIRs require manual collection in specific rooms. We present SpatialScaper, a library for SELD data simulation and augmentation. Compared to existing tools, SpatialScaper emulates virtual rooms via parameters such as size and wall absorption. This allows for parameterized placement (including movement) of foreground and background sound sources. SpatialScaper also includes data augmentation pipelines that can be applied to existing SELD data. As a case study, we use SpatialScaper to add rooms to the DCASE SELD data. Training a model with our data led to progressive performance improves as a direct function of acoustic diversity. These results show that SpatialScaper is valuable to train robust SELD models.
There are scripts that generate data using the FSD50K sample subset that was hand picked earlier to conform to the target classes of the challenge, along with exporting labels in the challenge format. You can find the library with usage info here:
This year, we synthesize 360° videos for the audiovisual track. After synthesizing the FOA audio data and metadata using the SpatialScaper, using the SpatialScaper metadata, we generate 360° video data with stock backgrounds and object/people images, where the objects are moving in the coordinates and times indicated in the SpatialScaper metadata. In our preliminary experiments, we used the synthetic FOA and 360° video data to train the audiovisual baseline from a previous challenge. We observed improved metrics across the board when including the new synthetic data, which includes moving objects and real-world backgrounds. More details can be found in this report:
The library with documentation is at:
Finally, we apply the above-mentioned stereo SELD data generator to the synthetic FOA data, 360° video data, and metadata to construct the new clips for stereo SELD tasks.
Task rules
- Use of external data is allowed, as long as they are publicly available. Check the section on external data for more instructions.
- Manipulation of the provided training-test split in the development dataset is not allowed for reporting dev-test results using the development dataset.
- Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system.
- The development dataset can be augmented e.g. using techniques such as pitch shifting or time stretching, rotations, re-spatialization or re-reverberation of parts, etc.
Submission
During the evaluation phase of the challenge, the results for each of the clips in the evaluation dataset should be collected in individual CSV files. Each result file should have the same name as the file name of the respective audio clip, but with the .csv
extension. In the audio-only track, it should contain the below information at each row:
[frame number (int)],[active class index (int)],[azimuth (int)],[distance (int)]
Each result file in the audiovisual track should contain the above four columns and an onscreen column at each row:
[frame number (int)],[active class index (int)],[azimuth (int)],[distance (int)],[off/onscreen (0/1)]
Enumeration of frame and class indices begins at zero. The class indices are as ordered in the class descriptions mentioned above. The evaluation will be performed at a temporal resolution of 100msec. In case the participants use a different frame or hop length for their study, we expect them to use a suitable method to resample the information at the specified resolution before submitting the evaluation results.
In addition to the CSV files, the participants are asked to update the information of their method in the provided file and submit a technical report describing the method. We allow up to 4 system output submissions per participant/team for each of the two tracks (audio-only, audiovisual). For each system, meta-information should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. The detailed information regarding the challenge information can be found in the submission page.
General information for all DCASE submissions can be found on the Submission page.
Evaluation
The evaluation is based on metrics that jointly evaluate localization and detection performance. It is similar to the ones used in the previous challenge. However, this year, we are adding the onscreen estimation evaluation for the audiovisual track, and the ranking will be based only on the localization-aware F1-score.
Metrics
The metrics are based on true positives (\(TP\)) and false positives (\(FP\)) determined not only by correct or wrong detections, but also based on if a) the azimuth error \(\Omega = \angle(DOA_p, DOA_r)\) between the DOA of the prediction and the DOA of the matched reference event (if any) being smaller than an angular threshold \(T_{DOA}\). b) The relative distance error \(\Delta = |L_p-L_r|/L_r\) between the distance \(L_p\) of the prediction and the distance \(L_r\) of the matched reference (if any) being closer than a relative distance error threshold \(T_{RD}\). c) The onscreen prediction \(OS_p\) being equal to the onscreen status \(OS_r\) of the matched reference (if any). For the evaluation of this challenge, we take the angular threshold to be \(T_{DOA} = 20^\circ\) and the relative distance threshold \(T_{RD} = 1\).
More specifically, for each class \(c\in[1,...,C]\) and each frame:
- \(P_c\) predicted events of class \(c\) are associated with \(R_c\) reference events of class \(c\)
- false negatives are counted for misses: \(FN_c = \max(0, R_c-P_c)\)
- false positives are counted for extraneous predictions: \(FP_{c}=\max(0,P_c-R_c)\)
- \(K_c=\min(P_c,R_c)\) predictions are spatially associated with references based on Hungarian algorithm minimizing the azimuth error \(\Omega\). Those can also be considered as the unthresholded true positives \(TP_c = K_c\).
- the application of the spatial and onscreen thresholds moves \(L_c\leq K_c\) predictions further than the thresholds to false positives: \(FP_{c,(\Omega>T_{DOA})\cup(\Delta>T_{RD})\cup(OS_p\neq OS_r)} = L_c\), and \(FP_{c,+} = FP_{c}+FP_{c,(\Omega > T_{DOA})\cup(\Delta> T_{RD})\cup(OS_p\neq OS_r)}\)
- the remaining matched estimates per class are counted as true positives: \(TP_{c,(\Omega\leq T_{DOA})\cap(\Delta\leq T_{RD})\cap(OS_p= OS_r)} = K_c-FP_{c,(\Omega> T_{DOA})\cup(\Delta>T_{RD})\cup(OS_p\neq OS_r)}\)
- finally: predictions \(P_c = TP_{c,(\Omega\leq T_{DOA})\cap(\Delta\leq T_{RD})\cap(OS_p= OS_r)}+ FP_{c,+}\), but references \(R_c = TP_{c,(\Omega\leq T_{DOA})\cap(\Delta\leq1)\cap(OS_p= OS_r)}+FP_{c,(\Omega> T_{DOA})\cup(\Delta> T_{RD})\cup(OS_p\neq OS_r)}+FN_c\)
Based on those, we form the location-dependent F1-score \(F_{c,LD}\). We perform macro-averaging of the location-dependent F1-score: \(F_{LD}= \sum_c F_{c,LD}/C\).
We evaluate DOA localization accuracy through a class-dependent azimuth error \(DOAE_c\), computed as the mean angular error of the matched true positives per class, and then macro-averaged:
- \(DOAE_c = \sum_k \Omega_k/ K_c = \sum_k \Omega_k /TP_c\) for each frame or segment with \(K_c>0\), and with \(\Omega_k\) being the angular error between the \(k\)th matched prediction and reference,
- and after averaging across all frames that have any true positives, \(DOAE = \sum_c DOAE_c/C\).
Distance localization accuracy is evaluated through a class-dependent relative distance error \(RDE_c\), computed as the mean relative distance error of the matched true positives per class, and then macro-averaged:
- \(RDE_c = \sum_k \Delta_k/ K_c = \sum_k \Delta_k /TP_c\) for each frame or segment with \(K_c>0\), and with \(\Delta_k\) being the relative distance error between the \(k\)th matched prediction and reference,
- and after averaging across all frames that have any true positives, \(RDE = \sum_c RDE_c/C\).
Finally, onscreen estimation is evaluated through its class-dependent accuracy \(OSA_c\), computed as the ratio between the number of correct onscreen estimates and the amount of true positives for per class, and then macro-averaged: \(OSA = \sum_c OSA_c/C\).
Note that the DOA and distance localization errors are not thresholded in order to give more varied complementary information to the location-dependent F1-score, presenting localization accuracy outside of the spatial threshold.
Ranking
Since the localization-aware F1-score takes into account all the aspects of the system (i.e., event detection, azimuth and distance localization, and, for the audiovisual track, onscreen estimation), the overall ranking of the task will be done according only to it. Additionally, rankings based on the macro-averaged azimuth error \(DOAE\), the macro-averaged relative distance error \(RDE\), and the macro-averaged onscreen accuracy \(OSA\) will also be published on the result website.
Baseline system
The baselines for both Track A (audio-only inference) and Track B (audiovisual inference) have been unified into a common codebase with a flag to switch from one to the other. Additionally, on/offscreen classification has been integrated into the baseline in the audiovisual track.
The baseline model uses log mel-spectrogram from stereo audio as input features for the audio-only track, while the audiovisual track additionally incorporates per-frame visual features from a pre-trained model. Audio features are processed using a convolutional recurrent neural network model, and audiovisual representations are fused via cross-attention layers. The features are passed to fully-connected layers to make the predictions.
The baseline model follows the multi-ACCDOA format from the previous years, predicting up to three simultaneous sound events per class. For each time step and for all the 13 classes, the model outputs 3 sets of x, y (i.e., DOA vector), distance, and, for the audiovisual track, an additional on/offscreen binary output. The mean squared error loss is used to train the DOA and distance predictions and binary cross entropy loss is used for the on/offscreen binary output.
Please refer to the README file in the baseline repository for detailed information:
For more details on the multi-ACCDOA format can be found here:
Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Naoya Takahashi, Emiru Tsunoo, and Yuki Mitsufuji. Multi-ACCDOA: localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore, Singapore, May 2022.
Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training
Abstract
Sound event localization and detection (SELD) involves identifying the direction-of-arrival (DOA) and the event class. The SELD methods with a class-wise output format make the model predict activities of all sound event classes and corresponding locations. The class-wise methods can output activity-coupled Cartesian DOA (ACCDOA) vectors, which enable us to solve a SELD task with a single target using a single network. However, there is still a challenge in detecting the same event class from multiple locations. To overcome this problem while maintaining the advantages of the class-wise format, we extended ACCDOA to a multi one and proposed auxiliary duplicating permutation invariant training (ADPIT). The multi- ACCDOA format (a class- and track-wise output format) enables the model to solve the cases with overlaps from the same class. The class-wise ADPIT scheme enables each track of the multi-ACCDOA format to learn with the same target as the single-ACCDOA format. In evaluations with the DCASE 2021 Task 3 dataset, the model trained with the multi-ACCDOA format and with the class-wise ADPIT detects overlapping events from the same class while maintaining its performance in the other cases. Also, the proposed method performed comparably to state-of-the-art SELD methods with fewer parameters.
and on the extended format with distance estimation here:
Results for the development dataset
The evaluation metric scores for the test split of the development dataset are given below. While this challenge uses the localization-aware F1-score considering spatial and onscreen thresholds for ranking in the audiovisual track, the F1-score considering only spatial thresholds is also shown for comparing the audiovisual result to the audio-only result.
Track A: Audio-only baseline
Dataset | macro F20°/1 (%) | macro F20°/1/on (%) | DOAE (°) | RDE (%) | OSA (%) |
---|---|---|---|---|---|
Stereo | 22.8 % | N/A | 24.5° | 41 % | N/A |
Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.
Track B: Audiovisual baseline
Dataset | macro F20°/1 (%) | macro F20°/1/on (%) | DOAE (°) | RDE (%) | OSA (%) |
---|---|---|---|---|---|
Stereo | 26.8 % | 20.0 % | 23.8° | 40 % | 80 % |
Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.
Citation
If you are participating in this task or using the dataset and code please consider citing the following paper:
David Diaz-Guerra, Archontis Politis, Parthasaarathy Sudarsanam, Kazuki Shimada, Daniel A. Krause, Kengo Uchida, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Takashi Shibuya, Yuki Mitsufuji, and Tuomas Virtanen. Baseline models and evaluation of sound event localization and detection with distance estimation in dcase2024 challenge. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024), 41–45. Tokyo, Japan, October 2024.