The goal of the sound event localization and detection task is to detect occurrences of sound events belonging to specific target classes, track their temporal activity, and estimate their directions-of-arrival or positions during it.

Challenge has ended. Full results for this task can be found in the Results page.

Description

Given multichannel audio input, a sound event localization and detection (SELD) system outputs localization estimates of one or more events for each of the target sound classes, whenever such events are detected. This results in a spatio-temporal characterization of the acoustic scene that can be used in a wide range of machine cognition tasks, such as inference on the type of environment, self-localization, navigation with visually occluded targets, tracking of specific types of sound sources, smart-home applications, scene visualization systems, and acoustic monitoring, among others.

Overally, this year the challenge task resembles the previous iteration, evaluating SELD models with audio-only input (Track A) or audiovisual input (Track B) on manually annotated recordings of real interior sound scenes. However, this year the task introduces distance estimation of the detected events, which makes the task significantly more challenging. The evaluation metrics are also modified to take that extra dimension into account. Regarding the audiovisual track, we believe that there is still plenty of space for developing novel solutions that surpass audio-only models. We encourage participants to submit both audio-only SELD systems and audiovisual SELD systems.

Figure 1: Overview of sound event localization and detection system.

Dataset

The Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset contains multichannel recordings of sound scenes in various rooms and environments, together with temporal and spatial annotations of prominent events belonging to a set of target classes. The dataset is collected in two different sites, in Tampere, Finland by the Audio Researh Group (ARG) of Tampere University, and in Tokyo, Japan by Sony, using a similar setup and annotation procedure. As in the previous challenges, the dataset is delivered in two spatial recording formats.

The simultaneous 360° video are spatially and temporally aligned with the microphone array recordings. The videos are made available with the participants' consent, after blurring visible faces.

Collection of data from the TAU side has received funding from Google.

Detailed dataset specifications can be found below. More details on the recording and annotation procedure can be found in DCASE2022 Challenge task description, in the technical report of STARSS22:

Publication

Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, and Tuomas Virtanen. STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 125–129. Nancy, France, November 2022. URL: https://dcase.community/workshop2022/proceedings.

	Archontis Politis Tampere University
	Kazuki Shimada SONY
	Yuki Mitsufuji SONY
	Tuomas Virtanen Tampere University
	Parthasaarathy Sudarsanam Tampere University
	Daniel Krause Tampere University
	Kengo Uchida SONY
	David Diaz-Guerra Tampere University
	Yuichiro Koyama SONY
	Naoya Takahashi SONY
	Takashi Shibuya SONY
	Shusuke Takahashi SONY

Dataset name	Type	Added	Link
TAU-SRIR DB	room impulse responses	04.04.2022	https://zenodo.org/records/6408611
6DOF_SRIRs	room impulse responses	23.11.2021	https://zenodo.org/records/6382405
METU SRIRs	room impulse responses	10.04.2019	https://zenodo.org/records/2635758
MIRACLE	room impulse responses	12.10.2023	https://depositonce.tu-berlin.de/items/fc34d59c-c524-4a4b-86ae-4da9289f20e2
AudioSet	audio, video	30.03.2017	https://research.google.com/audioset/
FSD50K	audio	02.10.2020	https://zenodo.org/record/4060432
ESC-50	audio	13.10.2015	https://github.com/karolpiczak/ESC-50
Wearable SELD dataset	audio	17.02.2022	https://zenodo.org/record/6030111
IRMAS	audio	08.09.2014	https://zenodo.org/record/1290750
Kinetics 400	audio, video	22.05.2017	https://www.deepmind.com/open-source/kinetics
SSAST	pre-trained model	10.02.2022	https://github.com/YuanGongND/ssast
TAU-NIGENS Spatial Sound Events 2020	audio	06.04.2020	https://zenodo.org/record/4064792
TAU-NIGENS Spatial Sound Events 2021	audio	28.02.2021	https://zenodo.org/record/5476980
PANN	pre-trained model	19.10.2020	https://github.com/qiuqiangkong/audioset_tagging_cnn
wav2vec2.0	pre-trained model	20.08.2020	https://github.com/facebookresearch/fairseq
PaSST	pre-trained model	18.09.2022	https://github.com/kkoutini/PaSST
DTF-AT	pre-trained model	19.12.2023	https://github.com/ta012/DTFAT
FNAC_AVL	pre-trained model	25.03.2023	https://github.com/OpenNLPLab/FNAC_AVL
CSS10 Japanese	audio	05.08.2019	https://www.kaggle.com/datasets/bryanpark/japanese-single-speaker-speech-dataset
JSUT	audio	28.10.2017	https://sites.google.com/site/shinnosuketakamichi/publication/jsut
VoxCeleb1	audio, video	26.06.2017	https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html
COCO	video	01.05.2014	https://cocodataset.org/
360-Indoor	video	03.10.2019	http://aliensunmin.github.io/project/360-dataset/
TorchVision Models and Pre-trained Weights	pre-trained model	03.04.2017	https://pytorch.org/vision/stable/models.html
YOLOv7	pre-trained model	07.07.2022	https://github.com/WongKinYiu/yolov7
YOLOv8	pre-trained model	10.01.2023	https://github.com/ultralytics/ultralytics
Grounding DINO	pre-trained model	10.03.2023	https://github.com/IDEA-Research/GroundingDINO
MMDetection	pre-trained model	15.12.2021	https://github.com/open-mmlab/mmdetection
MMPose	pre-trained model	01.01.2020	https://github.com/open-mmlab/mmpose
MMFlow	pre-trained model	01.01.2021	https://github.com/open-mmlab/mmflow
Paddle Detection	pre-trained model	01.01.2019	https://github.com/PaddlePaddle/PaddleDetection
doors Image Dataset	image	18.02.2022	https://universe.roboflow.com/mohammed-naji/doors-6g8eb/dataset/1
DoorDetect Dataset	image	27.05.2021	https://github.com/MiguelARD/DoorDetect-Dataset
CLIP	pre-trained model	05.01.2021	https://github.com/openai/CLIP
CLAP	pre-trained model	06.03.2022	https://github.com/LAION-AI/CLAP
Depth Anything	pre-trained model	22.01.2024	https://github.com/LiheYoung/Depth-Anything
PanoFormer	pre-trained model	04.03.2022	https://github.com/zhijieshen-bjtu/PanoFormer
SoundQ Youtube 360° video list	video	06.10.2023	https://github.com/aromanusc/SoundQ/blob/main/synth_data_gen/dataset.csv
FMA	audio	02.12.2016	https://github.com/mdeff/fma

Rank	Submission Information			Evaluation dataset
Rank	Submission	Corresponding author	Affiliation	F-score (20°/1)	DOA error (°)	Relative distance error
1	Du_NERCSLIP_task3a_4	Qing Wang	University of Science and Technology of China	54.4 (48.9 - 59.2)	13.6 (12.4 - 15.0)	0.21 (0.18 - 0.23)
2	Yu_HYUNDAI_task3a_3	Hogeon Yu	Hyundai Motor Company	29.8 (25.1 - 34.2)	19.8 (18.3 - 21.6)	0.28 (0.25 - 0.32)
3	Yeow_NTU_task3a_2	Jun Wei Yeow	Nanyang Technological University	26.2 (22.0 - 30.5)	25.1 (23.2 - 27.6)	0.26 (0.22 - 0.28)
4	Guan_CQUPT_task3a_4	Xin Guan	Chongqing University of Posts and Telecommunications	26.7 (22.7 - 31.1)	18.6 (17.4 - 21.8)	0.36 (0.34 - 0.39)
5	Vo_DU_task3a_1	Quoc Thinh Vo	Drexel University	24.7 (20.8 - 28.4)	19.3 (17.7 - 21.3)	0.34 (0.30 - 0.37)
6	Berg_LU_task3a_3	Axel Berg	Lund University, Arm	25.5 (21.8 - 29.6)	23.2 (18.2 - 28.8)	0.39 (0.34 - 0.44)
7	Sun_JLESS_task3a_1	Wenqiang Sun	Northwestern Polytechnical University	28.5 (24.2 - 33.0)	23.8 (21.5 - 25.9)	0.51 (0.49 - 0.53)
8	Qian_IASP_task3a_1	Yuanhang Qian	Wuhan University	22.8 (18.6 - 26.8)	27.2 (24.6 - 29.8)	0.36 (0.31 - 0.42)
9	AO_Baseline_FOA	Parthasaarathy Sudarsanam	Tampere University	18.0 (14.6 - 21.7)	29.6 (24.6 - 33.3)	0.31 (0.28 - 0.36)
10	Zhang_BUPT_task3a_1	Zhicheng Zhang	Beijing University of Posts and Telecommunications	19.0 (16.1 - 21.8)	29.6 (26.6 - 32.9)	0.40 (0.32 - 0.48)
11	Chen_ECUST_task3a_1	Ning Chen	East China University of Science and Technology	15.1 (12.2 - 17.9)	28.3 (25.5 - 30.9)	0.48 (0.39 - 0.59)
12	Li_BIT_task3a_1	Jiahao Li	Beijing Institution of Technology	16.9 (13.4 - 20.5)	33.5 (30.0 - 42.7)	0.51 (0.26 - 1.25)

Dataset	macro F_20°/1 (%)	DOAE (°)	RDE (%)
Ambisonic	13.1 %	36.9°	33 %
Microphone array	9.9 %	38.1°	30 %

Dataset	macro F_20°/1 (%)	DOAE (°)	RDE (%)
Ambisonic	11.3 %	38.4°	36 %
Microphone array	11.8 %	38.5°	29 %

Coordinators

Content

Description

Dataset

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

Abstract

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Abstract

Dataset specifications

Sound event classes

Recording formats

Reference labels, directions-of-arrival, and source distances

Download

Task setup

Track A: Audio-only inference

Track B: Audiovisual inference

Development dataset

Evaluation dataset

External data resources

Example external data use with baseline

Spatial Scaper: A Library to Simulate and Augment Soundscapes for Sound Event Localization and Detection in Realistic Rooms

Abstract

Task rules

Submission

Evaluation

Metrics

Ranking

Results

Track A: Audio-only

Track B: Audiovisual

Baseline system

Sound Event Detection and Localization with Distance Estimation

Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training

Abstract

Track A: Audio-only baseline

SALSA-Lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays

Abstract

Track B: Audiovisual baseline

Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection

Abstract

Results for the development dataset

Track A: Audio-only baseline

Track B: Audiovisual baseline

Citation

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

Abstract

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Abstract