# Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes

### Coordinators

 Archontis Politis Yuki Mitsufuji Kazuki Shimada Tuomas Virtanen Sharath Adavanne Parthasaarathy Sudarsanam Daniel Krause Naoya Takahashi Shusuke Takahashi Yuichiro Koyama

The goal of the sound event localization and detection task is to detect occurences of sound events belonging to specific target classes, track their temporal activity, and estimate their directions-of-arrival or positions during it.

Challenge has ended. Full results for this task can be found in the page.

# Description

Given multichannel audio input, a sound event detection and localization (SELD) system outputs a temporal activation track for each of the target sound classes, along with one or more corresponding spatial trajectories when the track indicates activity. This results in a spatio-temporal characterization of the acoustic scene that can be used in a wide range of machine cognition tasks, such as inference on the type of environment, self-localization, navigation without visual input or with occluded targets, tracking of specific types of sound sources, smart-home applications, scene visualization systems, and acoustic monitoring, among others.

This year the challenge task changes considerably compared to the previous iterations since it transitions from computationally generated spatial recordings to recordings of real sound scenes, manually annotated. This change brings a number of significant differences in the task setup, detailed below.

This is the fourth iteration of the task in the DCASE Challenge. The first 3 challenges were based on emulated multichannel recordings, generated from event sample banks spatialized with spatial room impulse responses (SRIRs) captured in various rooms and mixed with spatial ambient noise recorded at the same locations. At every successive iteration the acoustical conditions were increased in complexity, in order to bring the task closer to more challenging real-world conditions. A table showing basic differences between the previous 3 challenges follows:

After the continuous development of the methods submitted in those challenges to tackle the SELD task, a natural step forward is testing of the net iteration of systems on real spatial sound scene recordings. A dataset of such recordings was collected and annotated for the challenge. This transition brings a number of differences and changes compared to the previous years - some of them are summarized below:

# Audio dataset

The Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) dataset contains multichannel recordings of sound scenes in various rooms and environments, together with temporal and spatial annotations of prominent events belonging to a set of target classes. The dataset is collected in two different countries, in Tampere, Finland by the Audio Researh Group (ARG) of Tampere University, and in Tokyo, Japan by SONY, using a similar setup and annotation procedure. As in the previous challenges, the dataset is delivered in two spatial recording formats.

The recordings were organized in recording sessions, with each session happening in a unique room. With a few exceptions, groups of participants, sound making props and scene scnarios were also unique for each session. Multiple self-contained 30sec - 6min recordings (clips) were captured in each such session. To achieve good variability and efficiency in the data, in terms of presence, density, movement, and/or spatial distribution of the sounds events, the scenes were loosely scripted.

Collection of data from the TAU side has received funding from Google.

A technical report on the dataset, including details on the challenge setup and the baseline, can be found in:

Publication

Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, and Tuomas Virtanen. Starss22: a dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. 2022. URL: https://arxiv.org/abs/2206.01948, arXiv:2206.01948.

## Recording and annotation procedure

The sound scene recordings were captured with a high-channel-count spherical microphone array (Eigenmike em32 by mh Acoustics), simultaneously with a 360° video recording spatially aligned with the spherical array recording (Ricoh Theta V). Additionally, the main sound sources of interest were equipped with tracking markers, which are tracked throughout the recording with an Optitrack Flex 13 system arranged around each scene. All scenes were based on human actors performing some actions, interacting between them and with the objects in the scene, and were by design dynamic. Since the actors were producing most of the sounds in the scene (but not all), they were additionally equipped with DPA Wireless Go II microphones, providing close-miked recordings of the main events. Recording would start and stop according to a scene being acted, usually lasting between 1~5mins. Recording would start in all microphones and tracking devices before the beginning of the scene, and would stop right after. A clapper sound would initiate the acting and it would serve as a reference signal for synchronization between the em32 recording, the Ricoh Theta V video, the DPA wireless microphone recordings, and the Optitrack tracker data. Synchronized clips of all of them would be cropped and stored in the end of each recording session.

By combining information from the wireless microphones, the optical tracking data, and the 360° videos, spatiotemporal annotations were extracted semi-automatically, and validated manually. More specifically, the actors were tracked all through each recording session wearing headbands with markers, and the spatial positions of other human-related sources, such as mouth, hands, or footsteps were geometrically extrapolated from those head coordinates. Additional trackers were mounted on other sources of interest (e.g. vacuum cleaner, guitar, water tap, cupboard, door handle, a.o.). Each actor had a wireless microphone mounted on their lapel, providing a clear recording of all sound events produced by that actor, and/or any independent sources closer to that actor than the rest. The temporal annotation was based primarily on those close-miked recordings. The annotators would annotate the sound event activity and label their class during the recording by listening those close-miked signals. Events that were not audible in the overall scene recording of the em32 were not annotated, even if they were audible in the lapel recordings. In ambiguous cases, the annotators could rely on the 360° video to associate an event with a certain actor or source. The final sound event temporal annotations were associated with the tracking data through the class of each sound event and the actor that produced them. All tracked Cartesian coordinates delivered by the tracker were converted to directions-of-arrival (DOAs) with respect to the coordinates of the Eigenmike. Finally, the final class, temporal, and spatial annotations were combined and converted to the challenge format. Validation of the annotations was done by observing videos of the activities of each class visualized as markers positioned at their respective DOAs on the 360° video plane, overlapped with the 360° from the Ricoh Theta V.

## Recording formats

The array response of the two recording formats can be considered known. The following theoretical spatial responses (steering vectors) modeling the two formats describe the directional response of each channel to a source incident from direction-of-arrival (DOA) given by azimuth angle $$\phi$$ and elevation angle $$\theta$$.

For the first-order ambisonics (FOA):

\begin{eqnarray} H_1(\phi, \theta, f) &=& 1 \\ H_2(\phi, \theta, f) &=& \sin(\phi) * \cos(\theta) \\ H_3(\phi, \theta, f) &=& \sin(\theta) \\ H_4(\phi, \theta, f) &=& \cos(\phi) * \cos(\theta) \end{eqnarray}

The (FOA) format is obtained by converting the 32-channel microphone array signals by means of encoding filters based on anechoic measurements of the Eigenmike array response. Note that in the formulas above the encoding format is assumed frequency-independent, something that holds true up to around 9kHz with the specific microphone array, while the actual encoded responses start to deviate gradually at higher frequencies from the ideal ones provided above.

For the tetrahedral microphone array (MIC):

The four microphone have the following positions, in spherical coordinates $$(\phi, \theta, r)$$:

Since the microphones are mounted on an acoustically-hard spherical baffle, an analytical expression for the directional array response is given by the expansion:

$$H_m(\phi_m, \theta_m, \phi, \theta, \omega) = \frac{1}{(\omega R/c)^2}\sum_{n=0}^{30} \frac{i^{n-1}}{h_n'^{(2)}(\omega R/c)}(2n+1)P_n(\cos(\gamma_m))$$

where $$m$$ is the channel number, $$(\phi_m, \theta_m)$$ are the specific microphone's azimuth and elevation position, $$\omega = 2\pi f$$ is the angular frequency, $$R = 0.042$$m is the array radius, $$c = 343$$m/s is the speed of sound, $$\cos(\gamma_m)$$ is the cosine angle between the microphone and the DOA, and $$P_n$$ is the unnormalized Legendre polynomial of degree $$n$$, and $$h_n'^{(2)}$$ is the derivative with respect to the argument of a spherical Hankel function of the second kind. The expansion is limited to 30 terms which provides negligible modeling error up to 20kHz. Example routines that can generate directional frequency and impulse array responses based on the above formula can be found here.

## Sound event classes

13 target sound event classes were annotated. The classes follow loosely the Audioset ontology.

1. Female speech, woman speaking
2. Male speech, man speaking
3. Clapping
4. Telephone
5. Laughter
6. Domestic sounds
7. Walk, footsteps
8. Door, open or close
9. Music
10. Musical instrument
11. Water tap, faucet
12. Bell
13. Knock

The content of some of these classes corresponds to events of a limited range of Audioset-related subclasses. These are detailed here to aid the participants:

• Telephone
• Mostly traditional Telephone Bell Ringing and Ringtone sounds, without musical ringtones.
• Domestic sounds
• Sounds of Vacuum cleaner
• Sounds of water boiler, closer to Boiling
• Sounds of air circulator, closer to Mechanical fan
• Door, open or close
• Combination of Door and Cupboard open or close
• Music
• Background music and Pop music played by a loudspeaker in the room.
• Musical Instrument
• Acoustic guitar
• Marimba, xylophone
• Cowbell
• Piano
• Rattle (instrument)
• Bell
• Combination of sounds from hotel bell and glass bell, closer to Bicycle bell and single Chime.

• The speech classes contain speech in a few different languages.
• There are occasionally localized sound events that are not annotated and are considered as interferers, with examples such as computer keyboard, shuffling cards, dishes, pots, and pans.
• There is natural background noise (e.g. HVAC noise) in all recordings, at very low levels in some and at quite high levels in others. Such mostly diffuse background noise should be distinct from other noisy target sources (e.g. vacuum cleaner, mechanical fan) since these are clearly spatially localized.

## Dataset specifications

The specifications of the dataset can be summarized in the following:

• 70 recording clips of 30 sec ~ 5 min durations, with a total time of ~2hrs, contributed by SONY (development dataset).
• 51 recording clips of 1 min ~ 5 min durations, with a total time of ~3hrs, contributed by TAU (development dataset).
• 52 recording clips of 40 sec ~ 5.5 min durations, with a total time of ~2hrs, contributed by SONY+TAU (evaluation dataset).
• A training-test split is provided for reporting results using the development dataset.
• 40 recordings contributed by SONY for the training split, captured in 2 rooms (dev-train-sony).
• 30 recordings contributed by SONY for the testing split, captured in 2 rooms (dev-test-sony).
• 27 recordings contributed by TAU for the training split, captured in 4 rooms (dev-train-tau).
• 24 recordings contributed by TAU for the testing split, captured in 3 rooms (dev-test-tau).
• A total of 11 unique rooms captured in the recordings, 4 from SONY and 7 from TAU (development set).
• Sampling rate 24kHz.
• Two 4-channel 3-dimensional recording formats: first-order Ambisonics (FOA) and tetrahedral microphone array (MIC).
• Recordings are taken in two different countries and two different sites.
• Each recording clip is part of a recording session happening in a unique room.
• Groups of participants, sound making props, and scene scenarios are unique for each session (with a few exceptions).
• To achieve good variability and efficiency in the data, in terms of presence, density, movement, and/or spatial distribution of the sounds events, the scenes are loosely scripted.
• 13 target classes are identified in the recordings and strongly annotated by humans.
• Spatial annotations for those active events are captured by an optical tracking system.
• Sound events out of the target classes are considered as interference.
• Occurences of up to 3 simultaneous events are fairly common, while higher numbers of overlapping events (up to 5) can occur but are rare.

## Reference labels and directions-of-arrival

For each recording in the development dataset, the labels and DoAs are provided in a plain text CSV file of the same filename as the recording, in the following format:

[frame number (int)], [active class index (int)], [source number index (int)], [azimuth (int)], [elevation (int)]

Frame, class, and source enumeration begins at 0. Frames correspond to a temporal resolution of 100msec. Azimuth and elevation angles are given in degrees, rounded to the closest integer value, with azimuth and elevation being zero at the front, azimuth $$\phi \in [-180^{\circ}, 180^{\circ}]$$, and elevation $$\theta \in [-90^{\circ}, 90^{\circ}]$$. Note that the azimuth angle is increasing counter-clockwise ($$\phi = 90^{\circ}$$ at the left).

The source index is a unique integer for each source in the scene, and it is provided only as additional information. Note that each unique actor gets assigned one such identifier, but not individual events produced by the same actor; e.g. a clapping event and a laughter event produced by the same person have the same identifier. Independent sources that are not actors (e.g. a loudspeaker playing music in the room) get a 0 identifier. Note that source identifier information is only included in the development metadata and is not required to be provided by the participants in their results.

Overlapping sound events are indicated with duplicate frame numbers, and can belong to a different or the same class. An example sequence could be as:

10,     1,  1,  -50,  30
11,     1,  1,  -50,  30
11,     1,  2,   10, -20
12,     1,  2,   10, -20
13,     1,  2,   10, -20
13,     8,  0,  -40,   0

which describes that in frame 10-11, an event of class male speech (class 1) belonging to one actor (source 1) is active at direction (-50°,30°). However, at frame 11 a second instance of the same class appears simultaneously at a different direction (10°,-20°) belonging to another actor (source 2), while at frame 13 an additional event of class music (class 8) appears belonging to a non-actor source (source 0). Frames that contain no sound events are not included in the sequence.

The development and evaluation version of the dataset can be downloaded at:

The development dataset is provided with a training/testing split. During the development stage, the testing split can be used for comparison with the baseline results and consistent reporting of results at the technical reports of the submitted systems, before the evaluation stage.

• Note that even though there are two origins of the data, SONY and TAU, the challenge task considers the dataset as a single entity. Hence models should not be trained separately for each of the two origins, and tested individually on recordings of each of them. Instead, the recordings of the individual training splits (dev-train-sony, dev_train_tau) and testing splits (dev-test-sony, dev_test_tau) should be combined (dev_train, dev_test) and the models should be trained and evaluated in the respective combined splits.

• The participants can choose to use as input to their models one of the two formats, FOA or MIC, or both simultaneously.

The evaluation dataset will be released a few weeks before the final submission deadline. That dataset consists of only audio recordings without any metadata/labels. At the evaluation stage, participants can decide the training procedure, i.e. the amount of training and validation files to be used in the development dataset, the number of ensemble models etc., and submit the results of the SELD performance on the evaluation dataset.

## Development dataset

The recordings in the development dataset follow the naming convention:

fold[fold number]_room[room number]_mix[recording number per room].wav

The fold number at the moment is used only to distinguish between the training and testing split. The room information is provided for the user of the dataset to potentially help understand the performance of their method with respect to different conditions.

## Evaluation dataset

The evaluation dataset will consist of recordings without any information on the origin (SONY or TAU) or on the location in the naming convention, as below:

mix[recording number].wav

## External data

Since the development set contains recordings of real scenes, the presence of each class and the density of sound events varies greatly. To enable more effective training of models to detect and localize all target classes, apart from spatial and spectrotemporal augmentation of the development set, we additionally allow use of external datasets as long as they are publicly available. External data examples are sound sample banks, annotated sound event datasets, pre-trained models, room and array impulse response libraries.

A typical use case could be in the form of sound event datasets containing the target classes, which can be used to generate additional spatial mixtures. Some possible examples of spatialization are:

• using the theoretical steering vectors of any of the two formats presented earlier to emulate anechoic mixtures, with the possibility of background noise recordings decorrelated and added as diffuse across channels,
• using the theoretical steering vectors of any of the two formats presented earlier and a room simulator to spatialize isolated event samples in reverberant conditions,
• using isolated event samples convolved with measured multichannel room impulse responses of any of the two formats, to emulate spatial sound scenes with real reverberation profiles.

The following rules apply on the use of external data:

• The external datasets or pre-trained models used should be freely and publicly accessible before 15 April 2022.
• Participants should inform the organizers in advance about such data sources, so that all competitors know about them and have an equal opportunity to use them. Please send an email or message in the Slack channel to the task coordinators if you intend to use a dataset or pre-trained model not on the list; we will update a list of external data in the webpage accordingly.
• The participants will have to indicate clearly which external data they have used in their system info and technical report.
• Once the evaluation set is published, no further requests will be taken and no further external sources will be added to the list.
FSD50K audio 02.10.2020 https://zenodo.org/record/4060432
ESC-50 audio 13.10.2015 https://github.com/karolpiczak/ESC-50
Wearable SELD dataset audio 17.02.2022 https://zenodo.org/record/6030111
IRMAS audio 08.09.2014 https://zenodo.org/record/1290750
Kinetics 400 audio, video 22.05.2017 https://www.deepmind.com/open-source/kinetics
SSAST pre-trained model 10.02.2022 https://github.com/YuanGongND/ssast
TAU-NIGENS Spatial Sound Events 2020 audio 06.04.2020 https://zenodo.org/record/4064792
TAU-NIGENS Spatial Sound Events 2021 audio 28.02.2021 https://zenodo.org/record/5476980
PANN pre-trained model 19.10.2020 https://github.com/qiuqiangkong/audioset_tagging_cnn
CSS10 Japanese audio 05.08.2019 https://www.kaggle.com/datasets/bryanpark/japanese-single-speaker-speech-dataset
VoxCeleb1 audio, video 26.06.2017 https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html

## Example external data use with baseline

The baseline is trained with a combination of the recordings in the development set and synthetic recordings, generated through convolution of isolated sound samples with real spatial room impulse responses (SRIRs) captured in various spaces of Tampere University. The training can be summarized by the following steps:

1. Sound samples for the target classes were sourced from the FSD50K
2. dataset. The samples were selected based on their labels having only one of the classes of interest, and the annotator rating present and predominant.
3. The sound samples were spatialized using the same SRIRs as the ones used to generate the TAU-NIGENS Spatial Sound Events 2020 and TAU-NIGENS Spatial Sound Events 2021 datasets of synthetic sound scenes. The generation was done with a similar procedure and code as in the latter dataset.
4. 1200 1-minute long scene recordings were generated for both formats, with a maximum polyphony of 2 and no directional interference.
5. Some additional tuning of event signal energies in the recordings was performed during generation, to better match the event signal energy distribution found in the development dataset.
6. The synthesized recordings were mixed with the real recordings from the development training set.
7. The baseline model was trained on this mixed training set and evaluated on the development testing set.

For reproducibility, we share the generated recordings here, along with a list of FSD files used for the generation:

For participants that would like to use a similar process as above generating their own data with such measured SRIRs, we have published the responses of 9 rooms here:

Additionally, a python version of the generator code to spatialize and layer the sound events and mix ambient noise, as in our synthesized data, is shared here:

The SELD data generator code is functional, but still WIP, with the code being quite "rough" and not well documented yet. We will be taking care of those issues during the development phase of the challenge. For problems or questions on its use contact Daniel Krause or Archontis Politis from the organizers.

• Use of external data is allowed, as long as they are publicly available. Check the section on external data for more instructions.
• Manipulation of the provided training-test split in the development dataset is not allowed.
• Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system.
• The development dataset can be augmented e.g. using techniques such as pitch shifting or time stretching, respatialization or re-reverberation of parts, etc.

# Submission

The results for each of the recordings in the evaluation dataset should be collected in individual CSV files. Each result file should have the same name as the file name of the respective audio recording, but with the .csv extension, and should contain the same information at each row as the reference labels, excluding the source index:

[frame number (int)],[active class index (int)],[azimuth (int)],[elevation (int)]

Enumeration of frame and class indices begins at zero. The class indices are as ordered in the class descriptions mentioned above. The evaluation will be performed at a temporal resolution of 100msec. In case the participants use a different frame or hop length for their study, we expect them to use a suitable method to resample the information at the specified resolution before submitting the evaluation results.

In addition to the CSV files, the participants are asked to update the information of their method in the provided file and submit a technical report describing the method. We allow upto 4 system output submissions per participant/team. For each system, meta-information should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. The detailed information regarding the challenge information can be found in the submission page.

General information for all DCASE submissions can be found on the Submission page.

# Evaluation

The evaluation is based on metrics evaluating jointly detection and localization performance and are similar to the ones used in the previous 2 challenges, with a few differences detailed below.

## Metrics

The metrics are based on true positives ($$TP$$) and false positives ($$FP$$) determined not only by correct or wrong detections, but also based on if they are closer or further than a distance threshold $$T^\circ$$ (angular in our case) from the reference. For the evaluation of this challenge we take this threshold to be $$T = 20^\circ$$.

More specifically, for each class $$c\in[1,...,C]$$ and each frame or segment:

• $$P_c$$ predicted events of class $$c$$ are associated with $$R_c$$ reference events of class $$c$$
• false negatives are counted for misses: $$FN_c = \max(0, R_c-P_c)$$
• false positives are counted for extraneous predictions: $$FP_{c,\infty}=\max(0,P_c-R_c)$$
• $$K_c$$ predictions are spatially associated with references based on Hungarian algorithm: $$K_c=\min(P_c,R_c)$$. Those can also be considered as the unthresholded true positives $$TP_c = K_c$$.
• the spatial threshold is applied which moves $$L_c\leq K_c$$ predictions further than threhold to false positives: $$FP_{c,\geq20^\circ} = L_c$$, and $$FP_c = FP_{c,\infty}+FP_{c,\geq20^\circ}$$
• the remaining matched estimates per class are counted as true positives: $$TP_{c,\leq20^\circ} = K_c-FP_{c,\geq20^\circ}$$
• finally: predictions $$P_c = TP_{c,\leq20^\circ}+ FP_c$$, but references $$R_c = TP_{c,\leq20^\circ}+FP_{c,\geq20^\circ}+FN_c$$

Based on those, we form the location-dependent F1-score ($$F_{\leq 20^\circ}$$) and Error Rate ($$ER_{\leq 20^\circ}$$). Contrary to the previous challenges, in which $$F_{\leq 20^\circ}$$ was micro-averaged, in this challenge we perform macro-averaging of the location-dependent F1-score: $$F_{\leq 20^\circ}= \sum_c F_{c,\leq 20^\circ}/C$$.

Additionally, we evaluate localization accuracy through a class-dependent localization error $$LE_c$$, computed as the mean angular error of the matched true positives per class, and then macro-averaged:

• $$LE_c = \sum_k \theta_k/ K_c = \sum_k \theta_k /TP_c$$ for each frame or segment, with $$\theta_k$$ being the angular error between the $$k$$th matched prediction and reference,
• and after averaging across all frames that have any true positives, $$LE_{CD} = \sum_c LE_c/C$$.

Complementary to the localization error, we compute a localization recall metric per class, also macro-averaged:

• $$LR_c = K_c/R_c = TP_c/(TP_c + FN_c)$$, and
• $$LR_{CD} = \sum_c LR_c/C$$.

Note that the localization error and recall are not thresholded in order to give more varied complementary information to the location-dependent F1-score, presenting localization accuracy outside of the spatial threshold.

All metrics are computed in one-second non-overlapping frames. For a more thorough analysis on the joint SELD metrics please refer to:

Publication

Archontis Politis, Annamaria Mesaros, Sharath Adavanne, Toni Heittola, and Tuomas Virtanen. Overview and evaluation of sound event localization and detection in dcase 2019. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:684–698, 2020. URL: https://ieeexplore.ieee.org/abstract/document/9306885.

#### Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019

##### Abstract

Sound event localization and detection is a novel area of research that emerged from the combined interest of analyzing the acoustic scene in terms of the spatial and temporal activity of sounds of interest. This paper presents an overview of the first international evaluation on sound event localization and detection, organized as a task of the DCASE 2019 Challenge. A large-scale realistic dataset of spatialized sound events was generated for the challenge, to be used for training of learning-based approaches, and for evaluation of the submissions in an unlabeled subset. The overview presents in detail how the systems were evaluated and ranked and the characteristics of the best-performing systems. Common strategies in terms of input features, model architectures, training approaches, exploitation of prior knowledge, and data augmentation are discussed. Since ranking in the challenge was based on individually evaluating localization and event classification performance, part of the overview focuses on presenting metrics for the joint measurement of the two, together with a reevaluation of submissions using these new metrics. The new analysis reveals submissions that performed better on the joint task of detecting the correct type of event close to its original location than some of the submissions that were ranked higher in the challenge. Consequently, ranking of submissions which performed strongly when evaluated separately on detection or localization, but not jointly on both, was affected negatively.

## Ranking

Overall ranking will be based on the cumulative rank of the metrics mentioned above, sorted in ascending order. By cumulative rank we mean the following: if system A was ranked individually for each metric as $$ER:1, F1:1, LE:3, LR: 1$$, then its cumulative rank is $$1+1+3+1=6$$. Then if system B has $$ER:3, F1:2, LE:2, LR:3$$ (10), and system C has $$ER:2, F1:3, LE:1, LR:2$$ (8), then the overall rank of the systems is A,C,B. If two systems end up with the same cumulative rank, then they are assumed to have equal place in the challenge, even though they will be listed alphabetically in the ranking tables.

# Results

The SELD task received 63 submissions in total from 19 teams across the world. Main results for these submissions are as following (the table below includes only the best performing system per submitting team):

Rank Submission Information Evaluation dataset
Submission Corresponding
author
Affiliation Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
1 Du_NERCSLIP_task3_2 Jun Du University of Science and Technology of China 0.35 (0.30 - 0.41) 58.3 (53.8 - 64.7) 14.6 (12.8 - 16.5) 73.7 (68.7 - 78.2)
5 Hu_IACAS_task3_3 Jinbo Hu Institute of Acoustics, Chinese Academy of Sciences 0.39 (0.34 - 0.44) 55.8 (51.2 - 61.1) 16.2 (14.6 - 17.8) 72.4 (67.3 - 77.2)
7 Han_KU_task3_4 Sung Won Han Korea University 0.37 (0.31 - 0.42) 49.7 (44.4 - 56.6) 16.5 (14.8 - 18.0) 70.7 (65.8 - 76.1)
11 Xie_UESTC_task3_1 Rong Xie University of Electronic Science and Technology of China 0.48 (0.41 - 0.55) 48.6 (42.5 - 55.4) 17.6 (16.0 - 19.2) 73.5 (68.0 - 77.6)
14 Bai_JLESS_task3_4 Jisheng Bai Northwestern Polytechnical University 0.47 (0.40 - 0.54) 49.3 (41.8 - 57.1) 16.9 (15.0 - 18.9) 67.9 (59.3 - 73.3)
17 Kang_KT_task3_2 Sang-Ick Kang KT Corporation 0.47 (0.40 - 0.53) 45.9 (40.1 - 52.6) 15.8 (13.6 - 18.0) 59.3 (50.3 - 65.1)
42 FOA_Baseline_task3_1 Archontis Politis Tampere University 0.61 (0.57 - 0.65) 23.7 (18.7 - 29.4) 22.9 (21.0 - 26.0) 51.4 (46.2 - 55.2)
27 Chun_Chosun_task3_3 Chanjun Chun Chosun University 0.59 (0.52 - 0.66) 31.0 (25.9 - 36.3) 19.8 (17.3 - 22.6) 50.7 (42.2 - 56.3)
33 Guo_XIAOMI_task3_2 Kaibin Guo Xiaomi 0.60 (0.53 - 0.67) 28.2 (22.8 - 34.1) 23.8 (21.3 - 26.2) 52.1 (43.4 - 58.1)
30 Scheibler_LINE_task3_1 Robin Scheibler LINE Corporation 0.62 (0.55 - 0.69) 30.4 (25.2 - 36.3) 16.7 (14.0 - 19.5) 49.2 (42.1 - 54.5)
38 Park_SGU_task3_4 Hyung-Min Park Sogang University 0.60 (0.53 - 0.67) 30.6 (25.2 - 36.4) 21.6 (17.8 - 25.1) 45.9 (40.3 - 51.0)
33 Wang_SJTU_task3_2 Yu Wang Shanghai Jiao Tong University 0.67 (0.60 - 0.74) 27.0 (19.3 - 33.6) 24.4 (22.0 - 27.1) 60.3 (53.8 - 65.3)
52 FalconPerez_Aalto_task3_2 Ricardo Falcon-Perez Aalto University 0.73 (0.67 - 0.79) 21.8 (15.5 - 27.6) 24.4 (21.7 - 27.1) 43.1 (35.7 - 48.7)
46 Kim_KU_task3_2 Gwantae Kim Korea University 0.74 (0.66 - 0.81) 24.1 (19.8 - 28.9) 26.6 (23.4 - 29.8) 55.1 (48.6 - 59.5)
65 Chen_SHU_task3_1 Zhengyu Chen Shanghai University 1.00 (1.00 - 1.00) 0.3 (0.1 - 0.6) 60.3 (45.4 - 94.0) 4.5 (2.9 - 6.3)
53 Wu_NKU_task3_2 Shichao Wu Nankai University 0.69 (0.64 - 0.74) 17.9 (14.4 - 21.5) 28.5 (24.5 - 39.7) 44.5 (38.2 - 48.4)
23 Ko_KAIST_task3_2 Byeong-Yun Ko Korea Advanced Institute of Science and Technology 0.49 (0.42 - 0.55) 39.9 (33.8 - 46.0) 17.3 (15.3 - 19.3) 54.6 (46.5 - 60.5)
48 Kapka_SRPOL_task3_4 Slawomir Kapka Samsung Research Poland 0.72 (0.65 - 0.79) 25.5 (21.3 - 30.4) 25.4 (21.7 - 29.3) 49.8 (42.8 - 55.3)
60 Zhaoyu_LRVT_task3_1 Zhaoyu Yan Lenovo Research 0.96 (0.88 - 1.00) 11.2 (8.8 - 13.9) 31.0 (28.5 - 33.4) 53.4 (44.4 - 58.9)
44 Xie_XJU_task3_1 Yin Xie Xinjiang university 0.66 (0.59 - 0.74) 25.5 (19.3 - 32.2) 23.1 (19.9 - 26.4) 53.1 (42.7 - 59.4)

Complete results and technical reports can be found in the

# Baseline system

Similarly to the previous iterations of the challenge, as the baseline we use a straightforward convolutional recurrent neural netowrk (CRNN) based on SELDnet, but with a few important modifications.

Publication

Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13(1):34–48, March 2018. URL: https://ieeexplore.ieee.org/abstract/document/8567942, doi:10.1109/JSTSP.2018.2885636.

#### Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

##### Abstract

In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

##### Keywords

Direction-of-arrival estimation;Estimation;Task analysis;Azimuth;Microphone arrays;Recurrent neural networks;Sound event detection;direction of arrival estimation;convolutional recurrent neural network

## Baseline changes

Compared to the DCASE2021 and the associated published SELDnet version, a few modifications have been integrated in the model, in order to take into account some of the simplest effective improvements demonstrated by the participants in the previous year.

In DCASE2021 the baseline adopted the ACCDOA representation for training localization and detection with a single unified regression vector loss, succesfully demonstrated in the challenge of DCASE2020 and adopted by many other participants during DCASE2021. In this challenge, we maintain the ACCDOA representation but with an additional recent extension in order to make it suitable for handling simultaneous events of the same class, presented by Shimada et al. in the paper:

Publication

Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Naoya Takahashi, Emiru Tsunoo, and Yuki Mitsufuji. Multi-accdoa: localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore, Singapore, May 2022.

#### Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training

##### Abstract

Sound event localization and detection (SELD) involves identifying the direction-of-arrival (DOA) and the event class. The SELD methods with a class-wise output format make the model predict activities of all sound event classes and corresponding locations. The class-wise methods can output activity-coupled Cartesian DOA (ACCDOA) vectors, which enable us to solve a SELD task with a single target using a single network. However, there is still a challenge in detecting the same event class from multiple locations. To overcome this problem while maintaining the advantages of the class-wise format, we extended ACCDOA to a multi one and proposed auxiliary duplicating permutation invariant training (ADPIT). The multi- ACCDOA format (a class- and track-wise output format) enables the model to solve the cases with overlaps from the same class. The class-wise ADPIT scheme enables each track of the multi-ACCDOA format to learn with the same target as the single-ACCDOA format. In evaluations with the DCASE 2021 Task 3 dataset, the model trained with the multi-ACCDOA format and with the class-wise ADPIT detects overlapping events from the same class while maintaining its performance in the other cases. Also, the proposed method performed comparably to state-of-the-art SELD methods with fewer parameters.

Another modification is the addition of alternative input spatial features for the MIC format of the dataset, which apart from generalized cross-correlation (GCC) vectors now include the frequency-normalized inter-channel phase differences as defined by Nguyen et al. in their recent work:

Publication

Thi Ngoc Tho Nguyen, Douglas L. Jones, Karn N. Watcharasupat, Huy Phan, and Woon-Seng Gan. SALSA-Lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore, Singapore, May 2022.

#### SALSA-Lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays

##### Abstract

Polyphonic sound event localization and detection (SELD) has many practical applications in acoustic sensing and monitoring. However, the development of real-time SELD has been limited by the demanding computational requirement of most recent SELD systems. In this work, we introduce SALSA-Lite, a fast and effective feature for polyphonic SELD using microphone array inputs. SALSA-Lite is a lightweight variation of a previously proposed SALSA feature for polyphonic SELD. SALSA, which stands for Spatial Cue-Augmented Log-Spectrogram, consists of multichannel log-spectrograms stacked channelwise with the normalized principal eigenvectors of the spectrotemporally corresponding spatial covariance matrices. In contrast to SALSA, which uses eigenvector-based spatial features, SALSA-Lite uses normalized inter-channel phase differences as spatial features, allowing a 30-fold speedup compared to the original SALSA feature. Experimental results on the TAU-NIGENS Spatial Sound Events 2021 dataset showed that the SALSA-Lite feature achieved competitive performance compared to the full SALSA feature, and significantly outperformed the traditional feature set of multichannel log-mel spectrograms with generalized cross-correlation spectra. Specifically, using SALSA-Lite features increased localization-dependent F1 score and class-dependent localization recall by 15% and 5%, respectively, compared to using multichannel log-mel spectrograms with generalized cross-correlation spectra.

That modification was introduced in order to bring the baseline performance on the microphone array (MIC) format closer to the ambisonic (FOA) one, with respect to the large difference observed in the DCASE2021 challenge attributed mainly to the use of GCC features in complex multi-source conditions.

## Repository

The baseline, along with more details on it, can be found in:

## Results for the development dataset

The evaluation metric scores for the test split of the development dataset are given below. The location-dependent detection metrics are computed within a 20° threshold from the reference.

Dataset ER20° F20°(micro) F20°(macro) LECD LRCD
Ambisonic 0.71 36 % 21 % 29.3° 46 %
Microphone array 0.71 36 % 18 % 32.2° 47 %

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.

# Citation

A technical report with more details on the collection, annotation, and specifications of the dataset, along with analysis of the baseline and its properties will be provided soon.

If you are participating in this task or using the dataset and code please consider citing the following papers:

Publication

Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, and Tuomas Virtanen. Starss22: a dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. 2022. URL: https://arxiv.org/abs/2206.01948, arXiv:2206.01948.

#### STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

Publication

Archontis Politis, Annamaria Mesaros, Sharath Adavanne, Toni Heittola, and Tuomas Virtanen. Overview and evaluation of sound event localization and detection in dcase 2019. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:684–698, 2020. URL: https://ieeexplore.ieee.org/abstract/document/9306885.

#### Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019

##### Abstract

Sound event localization and detection is a novel area of research that emerged from the combined interest of analyzing the acoustic scene in terms of the spatial and temporal activity of sounds of interest. This paper presents an overview of the first international evaluation on sound event localization and detection, organized as a task of the DCASE 2019 Challenge. A large-scale realistic dataset of spatialized sound events was generated for the challenge, to be used for training of learning-based approaches, and for evaluation of the submissions in an unlabeled subset. The overview presents in detail how the systems were evaluated and ranked and the characteristics of the best-performing systems. Common strategies in terms of input features, model architectures, training approaches, exploitation of prior knowledge, and data augmentation are discussed. Since ranking in the challenge was based on individually evaluating localization and event classification performance, part of the overview focuses on presenting metrics for the joint measurement of the two, together with a reevaluation of submissions using these new metrics. The new analysis reveals submissions that performed better on the joint task of detecting the correct type of event close to its original location than some of the submissions that were ranked higher in the challenge. Consequently, ranking of submissions which performed strongly when evaluated separately on detection or localization, but not jointly on both, was affected negatively.