# Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes

### Coordinators

 Archontis Politis Kazuki Shimada Yuki Mitsufuji Tuomas Virtanen Sharath Adavanne Parthasaarathy Sudarsanam Daniel Krause Naoya Takahashi Shusuke Takahashi Yuichiro Koyama Kengo Uchida Aapo Hakala

The goal of the sound event localization and detection task is to detect occurences of sound events belonging to specific target classes, track their temporal activity, and estimate their directions-of-arrival or positions during it.

# Description

Given multichannel audio input, a sound event localization and detection (SELD) system outputs a temporal activation track for each of the target sound classes, along with one or more corresponding spatial trajectories when the track indicates activity. This results in a spatio-temporal characterization of the acoustic scene that can be used in a wide range of machine cognition tasks, such as inference on the type of environment, self-localization, navigation with occluded targets, tracking of specific types of sound sources, smart-home applications, scene visualization systems, and acoustic monitoring, among others.

This year the challenge task remains similar to the previous iteration, evaluated on manually annotated recordings of real sound scenes. However, it adds the option of working with additional information during training, namely distance labels of the annotated sound events.

To stimulate further developments on SELD research, we also prepare an audiovisual track, in which participants have access to 360° video recordings during training and evaluation. The video data has the potential to mitigate difficulties and ambiguities of the spatiotemporal characterization of the acoustic scene solely through audio data. For example, using video data sounds of footsteps can be easily distinguished from other tapping sounds. Also visible speakers in the video can provide candidate positions of speaker-related sounds. We encourage participants to submit both audio-only SELD systems and audiovisual SELD systems.

# Dataset

The Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset contains multichannel recordings of sound scenes in various rooms and environments, together with temporal and spatial annotations of prominent events belonging to a set of target classes. The dataset is collected in two different sites, in Tampere, Finland by the Audio Researh Group (ARG) of Tampere University, and in Tokyo, Japan by Sony, using a similar setup and annotation procedure. As in the previous challenges, the dataset is delivered in two spatial recording formats.

Compared to the STARSS22 dataset used in DCASE2022, this version maintains all the recordings of STARSS22, while it adds an additional 4hrs of material captured in Tampere University distributed between the training and evaluation sets. It further includes simultaneous 360° video recordings for all the audio recordings and it augments the respective labels with source distance information, apart from the direction-of-arrival.

Collection of data from the TAU side has received funding from Google.

Detailed dataset specifications can be found below. More details on the recording and annotation procedure can be found in last year's task description, and in the technical report of STARSS22:

Publication

Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, and Tuomas Virtanen. STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 125–129. Nancy, France, November 2022. URL: https://dcase.community/workshop2022/proceedings.

#### STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

##### Abstract

This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) dataset of spatial recordings of real sound scenes collected in various interiors at two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events belonging to 13 target classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. STARSS22 serves as the development and evaluation dataset for Task 3 (Sound Event Localization and Detection) of the DCASE2022 Challenge and it introduces significant new challenges with regard to the previous iterations, which were based on synthetic data. Additionally, the report introduces the baseline system that accompanies the dataset with emphasis on its differences to the baseline of the previous challenge. Baseline results indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6600531.

360° Video excerpts from STARSS23 with spatioatmpoeral labels overlaid on top of the video.
For binaural listening of the scene, use headphones on Chrome/Firefox.

## Recording formats

The array response of the two recording formats can be considered known. The following theoretical spatial responses (steering vectors) modeling the two formats describe the directional response of each channel to a source incident from direction-of-arrival (DOA) given by azimuth angle $$\phi$$ and elevation angle $$\theta$$.

For the first-order ambisonics (FOA):

\begin{eqnarray} H_1(\phi, \theta, f) &=& 1 \\ H_2(\phi, \theta, f) &=& \sin(\phi) * \cos(\theta) \\ H_3(\phi, \theta, f) &=& \sin(\theta) \\ H_4(\phi, \theta, f) &=& \cos(\phi) * \cos(\theta) \end{eqnarray}

The (FOA) format is obtained by converting the 32-channel microphone array signals by means of encoding filters based on anechoic measurements of the Eigenmike array response. Note that in the formulas above the encoding format is assumed frequency-independent, something that holds true up to around 9kHz with the specific microphone array, while the actual encoded responses start to deviate gradually at higher frequencies from the ideal ones provided above.

For the tetrahedral microphone array (MIC):

The four microphone have the following positions, in spherical coordinates $$(\phi, \theta, r)$$:

Since the microphones are mounted on an acoustically-hard spherical baffle, an analytical expression for the directional array response is given by the expansion:

$$H_m(\phi_m, \theta_m, \phi, \theta, \omega) = \frac{1}{(\omega R/c)^2}\sum_{n=0}^{30} \frac{i^{n-1}}{h_n'^{(2)}(\omega R/c)}(2n+1)P_n(\cos(\gamma_m))$$

where $$m$$ is the channel number, $$(\phi_m, \theta_m)$$ are the specific microphone's azimuth and elevation position, $$\omega = 2\pi f$$ is the angular frequency, $$R = 0.042$$m is the array radius, $$c = 343$$m/s is the speed of sound, $$\cos(\gamma_m)$$ is the cosine angle between the microphone and the DOA, and $$P_n$$ is the unnormalized Legendre polynomial of degree $$n$$, and $$h_n'^{(2)}$$ is the derivative with respect to the argument of a spherical Hankel function of the second kind. The expansion is limited to 30 terms which provides negligible modeling error up to 20kHz. Example routines that can generate directional frequency and impulse array responses based on the above formula can be found here.

## Sound event classes

13 target sound event classes were annotated. The classes follow loosely the Audioset ontology.

1. Female speech, woman speaking
2. Male speech, man speaking
3. Clapping
4. Telephone
5. Laughter
6. Domestic sounds
7. Walk, footsteps
8. Door, open or close
9. Music
10. Musical instrument
11. Water tap, faucet
12. Bell
13. Knock

The content of some of these classes corresponds to events of a limited range of Audioset-related subclasses. These are detailed here to aid the participants:

• Telephone
• Mostly traditional Telephone Bell Ringing and Ringtone sounds, without musical ringtones.
• Domestic sounds
• Sounds of Vacuum cleaner
• Sounds of water boiler, closer to Boiling
• Sounds of air circulator, closer to Mechanical fan
• Door, open or close
• Combination of Door and Cupboard open or close
• Music
• Background music and Pop music played by a loudspeaker in the room.
• Musical Instrument
• Acoustic guitar
• Marimba, xylophone
• Cowbell
• Piano
• Rattle (instrument)
• Bell
• Combination of sounds from hotel bell and glass bell, closer to Bicycle bell and single Chime.

• The speech classes contain speech in a few different languages.
• There are occasionally localized sound events that are not annotated and are considered as interferers, with examples such as computer keyboard, shuffling cards, dishes, pots, and pans.
• There is natural background noise (e.g. HVAC noise) in all recordings, at very low levels in some and at quite high levels in others. Such mostly diffuse background noise should be distinct from other noisy target sources (e.g. vacuum cleaner, mechanical fan) since these are clearly spatially localized.

## Video data

The simultaneous 360° video are spatially and temporally aligned with the microphone array recordings. The videos are made available with the participants' consent, after blurring visible faces.

## Dataset specifications

The specifications of the dataset can be summarized in the following:

General:

• Recordings are taken in two different sites.
• Each recording clip is part of a recording session happening in a unique room.
• Groups of participants, sound making props, and scene scenarios are unique for each session (with a few exceptions).
• To achieve good variability and efficiency in the data, in terms of presence, density, movement, and/or spatial distribution of the sounds events, the scenes are loosely scripted.
• 13 target classes are identified in the recordings and strongly annotated by humans.
• Spatial annotations for those active events are captured by an optical tracking system.
• Sound events out of the target classes are considered as interference.
• Occurrences of up to 3 simultaneous events are fairly common, while higher numbers of overlapping events (up to 5) can occur but are rare.

Volume, duration, and data split:

• A total of 16 unique rooms captured in the recordings, 4 in Tokyo and 12 in Tampere (development set).
• 70 recording clips of 30 sec ~ 5 min durations, with a total time of ~2hrs, captured in Tokyo (development dataset).
• 98 recording clips of 40 sec ~ 9 min durations, with a total time of ~5.5hrs, captured in Tampere (development dataset).
• A training-testing split is provided for reporting results using the development dataset.
• 40 recordings contributed by Sony for the training split, captured in 2 rooms (dev-train-sony).
• 30 recordings contributed by Sony for the testing split, captured in 2 rooms (dev-test-sony).
• 50 recordings contributed by TAU for the training split, captured in 7 rooms (dev-train-tau).
• 48 recordings contributed by TAU for the testing split, captured in 5 rooms (dev-test-tau).
• About ~3.5hrs of additional recordings from both sites, captured in different rooms from the development set, will be released later as the evaluation set.

Audio:

• Sampling rate: 24kHz.
• Two 4-channel 3-dimensional recording formats: first-order Ambisonics (FOA) and tetrahedral microphone array (MIC).

Video:

• Video 360° format: equirectangular
• Video resolution: 1920x960.
• Video frames per second (fps): 29.97.

## Reference labels, directions-of-arrival, and source distances

For each recording in the development dataset, the labels, DoAs, and distances are provided in a plain text CSV file of the same filename as the recording, in the following format:

[frame number (int)], [active class index (int)], [source number index (int)], [azimuth (int)], [elevation (int)], [distance (int)]


Frame, class, and source enumeration begins at 0. Frames correspond to a temporal resolution of 100msec. Azimuth and elevation angles are given in degrees, rounded to the closest integer value, with azimuth and elevation being zero at the front, azimuth $$\phi \in [-180^{\circ}, 180^{\circ}]$$, and elevation $$\theta \in [-90^{\circ}, 90^{\circ}]$$. Note that the azimuth angle is increasing counter-clockwise ($$\phi = 90^{\circ}$$ at the left). Distances are provided in centimeters, rounded to the closest integer value.

The source index is a unique integer for each source in the scene, and it is provided only as additional information. Note that each unique actor gets assigned one such identifier, but not individual events produced by the same actor; e.g. a clapping event and a laughter event produced by the same person have the same identifier. Independent sources that are not actors (e.g. a loudspeaker playing music in the room) get a 0 identifier.

Note that the source index and the source distance are only included in the development metadata as additional information that can be exploited during training. They are not required to be estimated or provided by the participants in their results.

Overlapping sound events are indicated with duplicate frame numbers, and can belong to a different or the same class. An example sequence could be as:

10,     1,  1,  -50,  30, 181
11,     1,  1,  -50,  30, 181
11,     1,  2,   10, -20, 243
12,     1,  2,   10, -20, 243
13,     1,  2,   10, -20, 243
13,     8,  0,  -40,   0,  80


which describes that in frame 10-11, an event of class male speech (class 1) belonging to one actor (source 1) is active at location (-50°,30°,180cm). However, at frame 11 a second instance of the same class appears simultaneously at a different direction (10°,-20°,243cm) belonging to another actor (source 2), while at frame 13 an additional event of class music (class 8) appears belonging to a non-actor source (source 0). Frames that contain no sound events are not included in the sequence.

The development dataset is provided with a training/testing split. During the development stage, the testing split can be used for comparison with the baseline results and consistent reporting of results at the technical reports of the submitted systems, before the evaluation stage.

• Note that even though there are two origins of the data, Sony and TAU, the challenge task considers the dataset as a single entity. Hence models should not be trained separately for each of the two origins, and tested individually on recordings of each of them. Instead, the recordings of the individual training splits (dev-train-sony, dev-train-tau) and testing splits (dev-test-sony, dev-test-tau) should be combined (dev-train, dev-test) and the models should be trained and evaluated in the respective combined splits.

• The participants can choose to use as input to their models one of the two formats, FOA or MIC, or both simultaneously.

The evaluation dataset will be released a few weeks before the final submission deadline. That dataset consists of only audio and video recordings without any metadata/labels. At the evaluation stage, participants can decide the training procedure, i.e. the amount of training and validation files to be used in the development dataset, the number of ensemble models etc., and submit the results of the SELD performance on the evaluation dataset.

Contrary to the previous year, in this challenge there are two tracks that the participants can follow: the audio-only track and the audiovisual track.

• The participants can choose to submit systems on either of the two tracks, or on both. We encourage participants to submit both audio-only models and audiovisual models.

• During evaluation, ranking results will be presented in separate tables for each track.

Submissions for both tracks will be in the same system output format, and both tracks will be evaluated with the same metrics.

## Track A: Audio-only inference

The audio-only track continues the SELD task setup of the previous year, where inference of the SELD labels is performed with multichannel audio input only. Note that we do not exclude the use of video data or video information during training of such models. In this sense, the STARSS23 video recordings of the development set could be treated as external data and exploited in various ways to improve the performance of the model. However, during inference only the audio recordings of the evaluation set should be used.

## Track B: Audiovisual inference

In the audiovisual track participants have access to 360° video recordings during training and evaluation. The models in this track are expected to be using both audio and video data during inference to produce the SELD labels.

## Development dataset

The recordings in the development dataset follow the naming convention:

fold[fold number]_room[room number]_mix[recording number per room].wav


The fold number at the moment is used only to distinguish between the training and testing split. The room information is provided for the user of the dataset to potentially help understand the performance of their method with respect to different conditions.

For the audiovisual track, video files are provided with the same folder structure and naming convention as the audio files.

## Evaluation dataset

The evaluation dataset will consist of recordings without any information on the origin (Sony or TAU) or on the location in the naming convention, as below:

mix[recording number].wav


# External data resources

Since the development set contains recordings of real scenes, the presence of each class and the density of sound events varies greatly. To enable more effective training of models to detect and localize all target classes, apart from spatial and spectrotemporal augmentation of the development set, we additionally allow use of external datasets as long as they are publicly available. External data examples are sound sample banks, annotated sound event datasets, pre-trained models, room and array impulse response libraries.

A typical use case could be in the form of sound event datasets containing the target classes, which can be used to generate additional spatial mixtures. Some possible examples of spatialization are:

• using the theoretical steering vectors of any of the two formats presented earlier to emulate anechoic mixtures, with the possibility of background noise recordings decorrelated and added as diffuse across channels,
• using the theoretical steering vectors of any of the two formats presented earlier and a room simulator to spatialize isolated event samples in reverberant conditions,
• using isolated event samples convolved with measured multichannel room impulse responses of any of the two formats, to emulate spatial sound scenes with real reverberation profiles.

The following rules apply on the use of external data:

• The external datasets or pre-trained models used should be freely and publicly accessible before 15 April 2023.
• Participants should inform the organizers in advance about such data sources, so that all competitors know about them and have an equal opportunity to use them. Please send an email or message in the Slack channel to the task coordinators if you intend to use a dataset or pre-trained model not on the list; we will update a list of external data in the webpage accordingly.
• The participants will have to indicate clearly which external data they have used in their system info and technical report.
• Once the evaluation set is published, no further requests will be taken and no further external sources will be added to the list.
FSD50K audio 02.10.2020 https://zenodo.org/record/4060432
ESC-50 audio 13.10.2015 https://github.com/karolpiczak/ESC-50
Wearable SELD dataset audio 17.02.2022 https://zenodo.org/record/6030111
IRMAS audio 08.09.2014 https://zenodo.org/record/1290750
Kinetics 400 audio, video 22.05.2017 https://www.deepmind.com/open-source/kinetics
SSAST pre-trained model 10.02.2022 https://github.com/YuanGongND/ssast
TAU-NIGENS Spatial Sound Events 2020 audio 06.04.2020 https://zenodo.org/record/4064792
TAU-NIGENS Spatial Sound Events 2021 audio 28.02.2021 https://zenodo.org/record/5476980
PANN pre-trained model 19.10.2020 https://github.com/qiuqiangkong/audioset_tagging_cnn
CSS10 Japanese audio 05.08.2019 https://www.kaggle.com/datasets/bryanpark/japanese-single-speaker-speech-dataset
VoxCeleb1 audio, video 26.06.2017 https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html

For an example use of external data when training the audio-only baseline model, refer to the respective section of the task description of DCASE2022.

• Use of external data is allowed, as long as they are publicly available. Check the section on external data for more instructions.
• Manipulation of the provided training-test split in the development dataset is not allowed.
• Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system.
• The development dataset can be augmented e.g. using techniques such as pitch shifting or time stretching, rotations, respatialization or re-reverberation of parts, etc.

# Submission

The results for each of the recordings in the evaluation dataset should be collected in individual CSV files. Each result file should have the same name as the file name of the respective audio recording, but with the .csv extension, and should contain the same information at each row as the reference labels, excluding the source index and the source distance:

[frame number (int)],[active class index (int)],[azimuth (int)],[elevation (int)]


Enumeration of frame and class indices begins at zero. The class indices are as ordered in the class descriptions mentioned above. The evaluation will be performed at a temporal resolution of 100msec. In case the participants use a different frame or hop length for their study, we expect them to use a suitable method to resample the information at the specified resolution before submitting the evaluation results.

In addition to the CSV files, the participants are asked to update the information of their method in the provided file and submit a technical report describing the method. We allow upto 4 system output submissions per participant/team for each of the two tracks (audio-only, audiovisual). For each system, meta-information should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. The detailed information regarding the challenge information can be found in the submission page.

General information for all DCASE submissions can be found on the Submission page.

# Evaluation

The evaluation is based on metrics evaluating jointly localization and detection performance and are similar to the ones used in the previous challenge.

## Metrics

The metrics are based on true positives ($$TP$$) and false positives ($$FP$$) determined not only by correct or wrong detections, but also based on if they are closer or further than a distance threshold $$T^\circ$$ (angular in our case) from the reference. For the evaluation of this challenge we take this threshold to be $$T = 20^\circ$$.

More specifically, for each class $$c\in[1,...,C]$$ and each frame or segment:

• $$P_c$$ predicted events of class $$c$$ are associated with $$R_c$$ reference events of class $$c$$
• false negatives are counted for misses: $$FN_c = \max(0, R_c-P_c)$$
• false positives are counted for extraneous predictions: $$FP_{c,\infty}=\max(0,P_c-R_c)$$
• $$K_c$$ predictions are spatially associated with references based on Hungarian algorithm: $$K_c=\min(P_c,R_c)$$. Those can also be considered as the unthresholded true positives $$TP_c = K_c$$.
• the spatial threshold is applied which moves $$L_c\leq K_c$$ predictions further than threhold to false positives: $$FP_{c,\geq20^\circ} = L_c$$, and $$FP_c = FP_{c,\infty}+FP_{c,\geq20^\circ}$$
• the remaining matched estimates per class are counted as true positives: $$TP_{c,\leq20^\circ} = K_c-FP_{c,\geq20^\circ}$$
• finally: predictions $$P_c = TP_{c,\leq20^\circ}+ FP_c$$, but references $$R_c = TP_{c,\leq20^\circ}+FP_{c,\geq20^\circ}+FN_c$$

Based on those, we form the location-dependent F1-score ($$F_{\leq 20^\circ}$$) and Error Rate ($$ER_{\leq 20^\circ}$$). We perform macro-averaging of the location-dependent F1-score: $$F_{\leq 20^\circ}= \sum_c F_{c,\leq 20^\circ}/C$$.

Additionally, we evaluate localization accuracy through a class-dependent localization error $$LE_c$$, computed as the mean angular error of the matched true positives per class, and then macro-averaged:

• $$LE_c = \sum_k \theta_k/ K_c = \sum_k \theta_k /TP_c$$ for each frame or segment with $$TP_c>0$$, and with $$\theta_k$$ being the angular error between the $$k$$th matched prediction and reference,
• and after averaging across all frames that have any true positives, $$LE_{CD} = \sum_c LE_c/C$$.

Complementary to the localization error, we compute a localization recall metric per class, also macro-averaged:

• $$LR_c = K_c/R_c = TP_c/(TP_c + FN_c)$$, and
• $$LR_{CD} = \sum_c LR_c/C$$.

Note that the localization error and recall are not thresholded in order to give more varied complementary information to the location-dependent F1-score, presenting localization accuracy outside of the spatial threshold.

All metrics are computed in one-second non-overlapping frames. For a more thorough analysis on the joint SELD metrics please refer to:

Publication

Archontis Politis, Annamaria Mesaros, Sharath Adavanne, Toni Heittola, and Tuomas Virtanen. Overview and evaluation of sound event localization and detection in dcase 2019. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:684–698, 2020. URL: https://ieeexplore.ieee.org/abstract/document/9306885.

#### Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019

##### Abstract

Sound event localization and detection is a novel area of research that emerged from the combined interest of analyzing the acoustic scene in terms of the spatial and temporal activity of sounds of interest. This paper presents an overview of the first international evaluation on sound event localization and detection, organized as a task of the DCASE 2019 Challenge. A large-scale realistic dataset of spatialized sound events was generated for the challenge, to be used for training of learning-based approaches, and for evaluation of the submissions in an unlabeled subset. The overview presents in detail how the systems were evaluated and ranked and the characteristics of the best-performing systems. Common strategies in terms of input features, model architectures, training approaches, exploitation of prior knowledge, and data augmentation are discussed. Since ranking in the challenge was based on individually evaluating localization and event classification performance, part of the overview focuses on presenting metrics for the joint measurement of the two, together with a reevaluation of submissions using these new metrics. The new analysis reveals submissions that performed better on the joint task of detecting the correct type of event close to its original location than some of the submissions that were ranked higher in the challenge. Consequently, ranking of submissions which performed strongly when evaluated separately on detection or localization, but not jointly on both, was affected negatively.

## Ranking

Overall ranking will be based on the cumulative rank of the metrics mentioned above, sorted in ascending order. By cumulative rank we mean the following: if system A was ranked individually for each metric as $$ER:1, F1:1, LE:3, LR: 1$$, then its cumulative rank is $$1+1+3+1=6$$. Then if system B has $$ER:3, F1:2, LE:2, LR:3$$ (10), and system C has $$ER:2, F1:3, LE:1, LR:2$$ (8), then the overall rank of the systems is A,C,B. If two systems end up with the same cumulative rank, then they are assumed to have equal place in the challenge, even though they will be listed alphabetically in the ranking tables.

# Baseline system

## Track A: Audio-only baseline

The baseline for the audio-only inference track remains similar to the DCASE2022 baseline. The only difference is the addition of two multi-head self-attention layers to improve further the modeling power of the baseline with a reasonable increase in complexity (about 20%). For more details on the exact architecture please refer to the baseline repository, found below. The input features remain the same: namely, mel-band spectra for the FOA and MIC formats, with mel-band aggregated acoustic intensity vectors for FOA and generalized cross-correlation (GCC) sequences for MIC as spatial features. Additionally, for the MIC format there is the option of the SALSA-lite spatial features, without mel-band aggregation in this case. The multi-ACCDOA output representation is also the same as last year's baseline.

The baseline, along with more details on it, can be found in:

## Track B: Audiovisual baseline

While the audio-only baseline system takes only the audio input, the audiovisual baseline system takes both the audio and a visual input. The visual input is a corresponding image at the start frame of the audio feature sequence. With the corresponding image, an object detection module outputs bounding boxes of potential objects. These bounding boxes are transformed to a concatenation of two Gaussian-like vectors, where they represent likelihoods of objects present along the image's horizontal axis and vertical axis. The Gaussian-like vectors are encoded to a visual embedding by fully-connected layers. Then the audio embeddings from convolutional blocks and the visual embedding are concatenated. The concatenated feature sequence are fed into the recurrent layer and fully connected layers to output a Multi-ACCDOA sequence. For more details about the feature and the network architecture please check the below baseline repository.

The baseline, along with more details on it, can be found in:

## Results for the development dataset

The evaluation metric scores for the test split of the development dataset are given below. The location-dependent detection metrics are computed within a 20° threshold from the reference.

### Track A: Audio-only baseline

Dataset ER20° F20°(micro) F20°(macro) LECD LRCD
Ambisonic 0.57 48.7 % 29.9 % 22° 47.7 %
Microphone array (GCC) 0.62 44.7 % 27.8 % 27° 44.3 %

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.

### Track B: Audiovisual baseline

The current audiovisual baseline system shows limited performance compared to the above audio-only baseline system, even though if the video feature extraction branch is omitted, the architecture resembles closely the audio-only baseline. That is mainly due to the training data of the audiovisual baseline being only the STARSS23 development set, while the audio-only baseline uses a combination of the STARSS23 development set recordings and synthetic audio recordings. Additional reasons may be small implementation differences such as input features, architecture changes, and hyperparameters. To demonstrate benefit in the use of the additional video input, we present results with the same audiovisual baseline on the same training data, i.e., the STARSS23 development set, using both audio and video, or audio-only input.

Dataset ER20° F20°(micro) F20°(macro) LECD LRCD
Ambisonic (Audio + Video) 1.07 23.2 % 14.3 % 48° 35.5 %
Ambisonic (Audio-only) 1.00 23.6 % 14.4 % 60° 32.7 %
Microphone array (Audio + Video) 1.08 20.6 % 9.8 % 62° 29.2 %
Microphone array (Audio-only) 1.03 25.0 % 11.4 % 77° 30.4 %

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.

# Citation

A technical report with more details on the collection, annotation, and specifications of the dataset, along with analysis of the baseline and its properties will be provided soon.

If you are participating in this task or using the dataset and code please consider citing the following papers:

Publication

Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, and Tuomas Virtanen. STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 125–129. Nancy, France, November 2022. URL: https://dcase.community/workshop2022/proceedings.

#### STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

##### Abstract

This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) dataset of spatial recordings of real sound scenes collected in various interiors at two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events belonging to 13 target classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. STARSS22 serves as the development and evaluation dataset for Task 3 (Sound Event Localization and Detection) of the DCASE2022 Challenge and it introduces significant new challenges with regard to the previous iterations, which were based on synthetic data. Additionally, the report introduces the baseline system that accompanies the dataset with emphasis on its differences to the baseline of the previous challenge. Baseline results indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6600531.