The goal of the sound event localization and detection task is to detect occurrences of sound events belonging to specific target classes, track their temporal activity, and estimate their directions-of-arrival or positions during it.

Challenge has ended. Full results for this task can be found in the Results page.

Description

Given multichannel audio input, a sound event localization and detection (SELD) system outputs localization estimates of one or more events for each of the target sound classes, whenever such events are detected. This results in a spatio-temporal characterization of the acoustic scene that can be used in a wide range of machine cognition tasks, such as inference on the type of environment, self-localization, navigation with visually occluded targets, tracking of specific types of sound sources, smart-home applications, scene visualization systems, and acoustic monitoring, among others.

While the previous challenges use four-channel audio data, i.e., first-order Ambisonics (FOA) and microphone array data, this challenge tackles SELD with stereo audio data (called stereo SELD), investigating the task in a commonplace audio and media scenario. Since the stereo audio used in the task have angular ambiguity in top-bottom and front-back, the task focuses on direction-of-arrival (DOA) estimation of azimuth angles only in the left-right axis. The challenge continues to tackle distance estimation as we believe that there is still space for novel solutions.

This challenge evaluates stereo SELD models with audio-only input (Track A) or audiovisual input (Track B). Since the field-of-view (FOV) is not 360° anymore, Track B poses a new interesting sub-task: onscreen/offscreen classification of the detected events. The evaluation metrics are modified to take this classification task into account. We encourage participants to submit both audio-only SELD systems and audiovisual SELD systems.

Figure 1: Overview of sound event localization and detection system.

Dataset

This challenge uses a stereo audio and video dataset, DCASE2025 Task3 Stereo SELD Dataset, derived from the STARSS23 dataset. The original STARSS23's FOA audio and 360-degree video data have been converted to stereo audio and perspective video data, simulating regular media content.

The STARSS23 dataset contains multichannel recordings of sound scenes in various rooms and environments, together with temporal and spatial annotations of prominent events belonging to a set of target classes. The original 360-degree video are spatially and temporally aligned with the microphone array recordings. More details on the recording and annotation procedure can be found in DCASE2022 Challenge task description, in the dataset paper of STARSS22:

Publication

Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, and Tuomas Virtanen. STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 125–129. Nancy, France, November 2022. URL: https://dcase.community/workshop2022/proceedings.

	Kazuki Shimada Sony
	Archontis Politis Tampere University
	Iran Roman Queen Mary University of London
	Tuomas Virtanen Tampere University
	Yuki Mitsufuji Sony
	Parthasaarathy Sudarsanam Tampere University
	David Diaz-Guerra Tampere University
	Ruchi Pandey Tampere University
	Kengo Uchida Sony
	Yuichiro Koyama Sony
	Naoya Takahashi Sony
	Takashi Shibuya Sony
	Shusuke Takahashi Sony

Dataset name	Type	Added	Link
TAU-SRIR DB	room impulse responses	04.04.2022	https://zenodo.org/records/6408611
6DOF_SRIRs	room impulse responses	23.11.2021	https://zenodo.org/records/6382405
METU SRIRs	room impulse responses	10.04.2019	https://zenodo.org/records/2635758
MIRACLE	room impulse responses	12.10.2023	https://depositonce.tu-berlin.de/items/fc34d59c-c524-4a4b-86ae-4da9289f20e2
AudioSet	audio, video	30.03.2017	https://research.google.com/audioset/
FSD50K	audio	02.10.2020	https://zenodo.org/record/4060432
ESC-50	audio	13.10.2015	https://github.com/karolpiczak/ESC-50
Wearable SELD dataset	audio	17.02.2022	https://zenodo.org/record/6030111
IRMAS	audio	08.09.2014	https://zenodo.org/record/1290750
Kinetics 400	audio, video	22.05.2017	https://www.deepmind.com/open-source/kinetics
SSAST	pre-trained model	10.02.2022	https://github.com/YuanGongND/ssast
TAU-NIGENS Spatial Sound Events 2020	audio	06.04.2020	https://zenodo.org/record/4064792
TAU-NIGENS Spatial Sound Events 2021	audio	28.02.2021	https://zenodo.org/record/5476980
PANN	pre-trained model	19.10.2020	https://github.com/qiuqiangkong/audioset_tagging_cnn
wav2vec2.0	pre-trained model	20.08.2020	https://github.com/facebookresearch/fairseq
PaSST	pre-trained model	18.09.2022	https://github.com/kkoutini/PaSST
DTF-AT	pre-trained model	19.12.2023	https://github.com/ta012/DTFAT
FNAC_AVL	pre-trained model	25.03.2023	https://github.com/OpenNLPLab/FNAC_AVL
CSS10 Japanese	audio	05.08.2019	https://www.kaggle.com/datasets/bryanpark/japanese-single-speaker-speech-dataset
JSUT	audio	28.10.2017	https://sites.google.com/site/shinnosuketakamichi/publication/jsut
VoxCeleb1	audio, video	26.06.2017	https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html
COCO	video	01.05.2014	https://cocodataset.org/
360-Indoor	video	03.10.2019	http://aliensunmin.github.io/project/360-dataset/
TorchVision Models and Pre-trained Weights	pre-trained model	03.04.2017	https://pytorch.org/vision/stable/models.html
YOLOv7	pre-trained model	07.07.2022	https://github.com/WongKinYiu/yolov7
YOLOv8	pre-trained model	10.01.2023	https://github.com/ultralytics/ultralytics
Grounding DINO	pre-trained model	10.03.2023	https://github.com/IDEA-Research/GroundingDINO
MMDetection	pre-trained model	15.12.2021	https://github.com/open-mmlab/mmdetection
MMPose	pre-trained model	01.01.2020	https://github.com/open-mmlab/mmpose
MMFlow	pre-trained model	01.01.2021	https://github.com/open-mmlab/mmflow
Paddle Detection	pre-trained model	01.01.2019	https://github.com/PaddlePaddle/PaddleDetection
doors Image Dataset	image	18.02.2022	https://universe.roboflow.com/mohammed-naji/doors-6g8eb/dataset/1
DoorDetect Dataset	image	27.05.2021	https://github.com/MiguelARD/DoorDetect-Dataset
CLIP	pre-trained model	05.01.2021	https://github.com/openai/CLIP
CLAP	pre-trained model	06.03.2022	https://github.com/LAION-AI/CLAP
Depth Anything	pre-trained model	22.01.2024	https://github.com/LiheYoung/Depth-Anything
PanoFormer	pre-trained model	04.03.2022	https://github.com/zhijieshen-bjtu/PanoFormer
SoundQ Youtube 360° video list	video	06.10.2023	https://github.com/aromanusc/SoundQ/blob/main/synth_data_gen/dataset.csv
FMA	audio	02.12.2016	https://github.com/mdeff/fma
Dasheng	pre-trained model	11.06.2024	https://github.com/XiaoMi/dasheng
DINOv2	pre-trained model	18.04.2023	https://github.com/facebookresearch/dinov2
ONE-PEACE	pre-trained model	19.05.2023	https://github.com/OFA-Sys/ONE-PEACE
OWL-ViT	pre-trained model	13.05.2022	https://huggingface.co/docs/transformers/en/model_doc/owlvit
Flickr30k	image	01.02.2014	https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset
SELDVisualSynth Canvas and Assets	image, video	27.03.2025	https://github.com/adrianSRoman/SELDVisualSynth
Audio-Visual Scene Reproduction Datasets	audio, image	30.10.2021	http://3dkim.com/research/VR/index.html
SurrRoom 1.0 Dataset	room impulse responses	12.05.2023	https://cvssp.org/data/SurrRoom1_0/
NitroFusion	pre-trained model	02.12.2024	https://chendaryen.github.io/NitroFusion.github.io/
PSELDNets	pre-trained model	11.11.2024	https://github.com/Jinbo-Hu/PSELDNets
HTS-AT	pre-trained model	02.02.2022	https://github.com/RetroCirce/HTS-Audio-Transformer

Rank	Submission Information			Evaluation dataset
Rank	Submission	Corresponding author	Affiliation	F-score (20°/1)	DOA error (°)	Relative distance error
1	Du_NERCSLIP_task3a_4	Jun Du	University of Science and Technology of China	50.4 (49.2 - 51.4)	12.2 (11.7 - 12.5)	26.9 (25.9 - 28.1)
2	He_HIT_task3a_1	Changjiang He	Harbin Institute of Technology	47.0 (45.6 - 48.2)	13.3 (12.6 - 13.9)	38.6 (37.5 - 39.9)
3	Banerjee_NTU_task3a_1	Mohor Banerjee	Nanyang Technological University	43.9 (42.5 - 45.5)	14.0 (13.2 - 14.7)	35.2 (33.6 - 36.5)
4	Berghi_SURREY_task3a_2	Davide Berghi	University of Surrey	42.5 (41.4 - 43.8)	15.4 (14.5 - 16.2)	31.4 (30.5 - 32.4)
5	Wu_HUST_task3a_2	Digao Wu	Huazhong University of Science and Technology	41.8 (40.4 - 43.3)	15.3 (14.6 - 16.0)	29.3 (28.2 - 30.4)
6	Yeow_NTU_task3a_3	Jun Wei Yeow	Nanyang Technological University	41.3 (40.0 - 42.7)	14.5 (13.3 - 15.6)	28.0 (26.9 - 28.9)
7	Wan_XJU_task3a_1	QingJing Wan	Xinjiang University	35.4 (34.3 - 36.7)	18.6 (17.4 - 19.4)	34.9 (33.1 - 36.9)
8	Zhao_MITC-MG_task3a_3	Tianbo Zhao	Xiaomi Corporation	34.0 (33.1 - 34.7)	16.8 (15.1 - 18.4)	36.6 (35.7 - 37.3)
9	Gao_DTU_task3a_1	Wenmiao Gao	Denmark Technical University	31.0 (30.2 - 31.8)	17.4 (13.6 - 18.6)	40.1 (35.9 - 41.5)
10	Park_KAIST_task3a_2	Jehyun Park	Korea Advanced Institute of Science and Technology	30.3 (29.5 - 31.1)	14.6 (13.6 - 17.5)	32.4 (26.9 - 43.2)
11	Bahuguna_UPF_task3a_3	Arjun Bahuguna	Universitat Pompeu Fabra	28.8 (27.7 - 29.7)	21.2 (16.8 - 26.9)	100.0 (100.0 - 100.0)
12	Bingnan_UOE_task3a_1	Duan Bingnan	The University of Edinburgh	26.9 (26.1 - 27.9)	24.6 (21.4 - 32.8)	37.9 (31.6 - 54.4)
13	AO_Baseline	Parthasaarathy Sudarsanam	Tampere University	26.1 (25.0 - 27.6)	23.0 (21.5 - 24.1)	33.2 (30.8 - 37.3)
14	Guan_GISP-HEU_task3a_1	Jian Guan	Harbin Engineering University	25.1 (24.0 - 26.1)	24.7 (22.1 - 27.9)	35.6 (34.7 - 36.2)
15	Kim_Samsung_task3a_1	Gwantae Kim	Samsung Electronics	24.6 (23.7 - 25.5)	18.2 (13.2 - 25.9)	33.7 (32.4 - 35.1)

Rank	Submission Information			Evaluation dataset
Rank	Submission	Corresponding author	Affiliation	F-score (20°/1/on)	F-score (20°/1)	DOA error (°)	Relative distance error	Onscreen accuracy
1	Du_NERCSLIP_task3b_1	Jun Du	University of Science and Technology of China	41.6 (40.3 - 42.6)	50.1 (48.8 - 51.1)	12.2 (11.7 - 12.5)	27.0 (26.0 - 28.1)	82.2 (80.0 - 84.5)
2	Berghi_SURREY_task3b_3	Davide Berghi	University of Surrey	34.8 (33.7 - 35.9)	46.2 (44.9 - 47.9)	14.1 (13.5 - 14.4)	30.4 (29.0 - 31.5)	76.9 (73.5 - 80.0)
3	Chengnuo_JSU_task3b_4	Sun Chengnuo	Jiangsu University	20.8 (19.9 - 21.7)	27.5 (26.7 - 28.3)	22.2 (20.9 - 23.4)	37.7 (35.7 - 40.1)	77.8 (74.0 - 81.3)
4	AV_Baseline	Parthasaarathy Sudarsanam	Tampere University	20.8 (19.9 - 21.7)	27.5 (26.7 - 28.3)	22.2 (20.9 - 23.4)	37.7 (35.7 - 40.1)	77.8 (74.0 - 81.3)
5	Guan_GISP-HEU_task3b_3	Jian Guan	Harbin Engineering University	18.2 (17.7 - 18.7)	24.7 (23.9 - 25.5)	24.0 (20.2 - 27.1)	38.2 (31.3 - 46.1)	76.1 (68.0 - 84.8)
6	Yu_Polyu_task3b_1	Xiang Yu	The Hong Kong Polytechnic University	18.1 (17.2 - 19.1)	24.8 (24.0 - 25.8)	18.1 (17.2 - 19.0)	34.0 (31.6 - 35.8)	79.8 (77.3 - 82.3)
7	Kim_Samsung_task3b_1	Gwantae Kim	Samsung Electronics	18.0 (17.0 - 18.9)	24.5 (23.6 - 25.4)	20.9 (14.8 - 38.0)	34.0 (31.2 - 42.8)	78.4 (76.0 - 82.8)

Coordinators

Content

Description

Dataset

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

Abstract

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Abstract

Stereo Sound Event Localization and Detection with Onscreen/offscreen Classification

Dataset specifications

Sound event classes

Audio format description

Reference labels, directions-of-arrival, source distances, and off/onscreens

Download

Task setup

Track A: Audio-only inference

Track B: Audiovisual inference

Development dataset

Evaluation dataset

External data resources

Example external data use with baseline

Spatial Scaper: A Library to Simulate and Augment Soundscapes for Sound Event Localization and Detection in Realistic Rooms

Abstract

Generating Diverse Audio-Visual 360º Soundscapes for Sound Event Localization and Detection

Task rules

Submission

Evaluation

Metrics

Baseline Models and Evaluation of Sound Event Localization and Detection with Distance Estimation in DCASE2024 Challenge

Stereo Sound Event Localization and Detection with Onscreen/offscreen Classification

Ranking

Results

Track A: Audio-only

Track B: Audiovisual

Baseline system

Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training

Abstract

Sound Event Detection and Localization with Distance Estimation

Results for the development dataset

Track A: Audio-only baseline

Track B: Audiovisual baseline

Citation

Baseline Models and Evaluation of Sound Event Localization and Detection with Distance Estimation in DCASE2024 Challenge

Stereo Sound Event Localization and Detection with Onscreen/offscreen Classification