The goal of the sound event localization and detection task is to detect occurrences of sound events belonging to specific target classes, track their temporal activity, and estimate their directions-of-arrival or positions during it.
Challenge has ended. Full results for this task can be found in the Results page.
Description
Given multichannel audio input, a sound event localization and detection (SELD) system outputs localization estimates of one or more events for each of the target sound classes, whenever such events are detected. This results in a spatio-temporal characterization of the acoustic scene that can be used in a wide range of machine cognition tasks, such as inference on the type of environment, self-localization, navigation with visually occluded targets, tracking of specific types of sound sources, smart-home applications, scene visualization systems, and acoustic monitoring, among others.
Overally, this year the challenge task resembles the previous iteration, evaluating SELD models with audio-only input (Track A) or audiovisual input (Track B) on manually annotated recordings of real interior sound scenes. However, this year the task introduces distance estimation of the detected events, which makes the task significantly more challenging. The evaluation metrics are also modified to take that extra dimension into account. Regarding the audiovisual track, we believe that there is still plenty of space for developing novel solutions that surpass audio-only models. We encourage participants to submit both audio-only SELD systems and audiovisual SELD systems.
Dataset
The Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset contains multichannel recordings of sound scenes in various rooms and environments, together with temporal and spatial annotations of prominent events belonging to a set of target classes. The dataset is collected in two different sites, in Tampere, Finland by the Audio Researh Group (ARG) of Tampere University, and in Tokyo, Japan by Sony, using a similar setup and annotation procedure. As in the previous challenges, the dataset is delivered in two spatial recording formats.
The simultaneous 360° video are spatially and temporally aligned with the microphone array recordings. The videos are made available with the participants' consent, after blurring visible faces.
Collection of data from the TAU side has received funding from Google.
Detailed dataset specifications can be found below. More details on the recording and annotation procedure can be found in DCASE2022 Challenge task description, in the technical report of STARSS22:
Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, and Tuomas Virtanen. STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 125–129. Nancy, France, November 2022. URL: https://dcase.community/workshop2022/proceedings.
STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events
Abstract
This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) dataset of spatial recordings of real sound scenes collected in various interiors at two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events belonging to 13 target classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. STARSS22 serves as the development and evaluation dataset for Task 3 (Sound Event Localization and Detection) of the DCASE2022 Challenge and it introduces significant new challenges with regard to the previous iterations, which were based on synthetic data. Additionally, the report introduces the baseline system that accompanies the dataset with emphasis on its differences to the baseline of the previous challenge. Baseline results indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6600531.
and in the dataset paper of STARSS23:
Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel A. Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, and Yuki Mitsufuji. STARSS23: an audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, 72931–72957. Curran Associates, Inc., 2023. URL: https://proceedings.neurips.cc/paper_files/paper/2023/hash/e6c9671ed3b3106b71cafda3ba225c1a-Abstract-Datasets_and_Benchmarks.html.
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events
Abstract
While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at https://zenodo.org/record/7880637.
360° Video excerpts from STARSS23 with spatioatmpoeral labels overlaid on top of the video.
For spatialized binaural listening of the scene, use headphones on Chrome/Firefox.
Dataset specifications
The specifications of the dataset can be summarized in the following:
General:
- Recordings are taken in two different sites.
- Each recording clip is part of a recording session happening in a unique room.
- Groups of participants, sound making props, and scene scenarios are unique for each session (with a few exceptions).
- To achieve good variability and efficiency in the data, in terms of presence, density, movement, and/or spatial distribution of the sounds events, the scenes are loosely scripted.
- 13 target classes are identified in the recordings and strongly annotated by humans.
- Spatial annotations for those active events are captured by an optical tracking system.
- Sound events out of the target classes are considered as interference.
- Occurrences of up to 3 simultaneous events are fairly common, while higher numbers of overlapping events (up to 6) can occur but are rare.
Volume, duration, and data split:
- A total of 16 unique rooms captured in the recordings, 4 in Tokyo and 12 in Tampere (development set).
- 70 recording clips of 30 sec ~ 5 min durations, with a total time of ~2hrs, captured in Tokyo (development dataset).
- 98 recording clips of 40 sec ~ 9 min durations, with a total time of ~5.5hrs, captured in Tampere (development dataset).
- 79 recordings clips of 40 sec ~ 7 min durations, with a total time of ~3.5hrs, captured in both sites (evaluation dataset).
- A training-testing split is provided for reporting results using the development dataset.
- 40 recordings contributed by Sony for the training split, captured in 2 rooms (dev-train-sony).
- 30 recordings contributed by Sony for the testing split, captured in 2 rooms (dev-test-sony).
- 50 recordings contributed by TAU for the training split, captured in 7 rooms (dev-train-tau).
- 48 recordings contributed by TAU for the testing split, captured in 5 rooms (dev-test-tau).
Audio:
- Sampling rate: 24kHz.
- Bit depth: 16 bits.
- Two 4-channel 3-dimensional recording formats: first-order Ambisonics (FOA) and tetrahedral microphone array (MIC).
Video:
- Video 360° format: equirectangular
- Video resolution: 1920x960.
- Video frames per second (fps): 29.97.
- All audio recordings are accompanied by synchronised video recordings, apart from 12 audio recordings with missing videos (fold3_room21_mix001.wav - fold3_room21_mix012.wav)
Sound event classes
13 target sound event classes were annotated. The classes follow loosely the Audioset ontology.
- Female speech, woman speaking
- Male speech, man speaking
- Clapping
- Telephone
- Laughter
- Domestic sounds
- Walk, footsteps
- Door, open or close
- Music
- Musical instrument
- Water tap, faucet
- Bell
- Knock
The content of some of these classes corresponds to events of a limited range of Audioset-related subclasses. These are detailed here to aid the participants:
- Telephone
- Mostly traditional Telephone Bell Ringing and Ringtone sounds, without musical ringtones.
- Domestic sounds
- Sounds of Vacuum cleaner
- Sounds of water boiler, closer to Boiling
- Sounds of air circulator, closer to Mechanical fan
- Door, open or close
- Combination of Door and Cupboard open or close
- Music
- Background music and Pop music played by a loudspeaker in the room.
- Musical Instrument
- Acoustic guitar
- Marimba, xylophone
- Cowbell
- Piano
- Rattle (instrument)
- Bell
- Combination of sounds from hotel bell and glass bell, closer to Bicycle bell and single Chime.
Some additional notes:
- The speech classes contain speech in a few different languages.
- There are occasionally localized sound events that are not annotated and are considered as interferers, with examples such as computer keyboard, shuffling cards, dishes, pots, and pans.
- There is natural background noise (e.g. HVAC noise) in all recordings, at very low levels in some and at quite high levels in others. Such mostly diffuse background noise should be distinct from other noisy target sources (e.g. vacuum cleaner, mechanical fan) since these are clearly spatially localized.
Recording formats
The array response of the two recording formats can be considered known. The following theoretical spatial responses (steering vectors) modeling the two formats describe the directional response of each channel to a source incident from direction-of-arrival (DOA) given by azimuth angle \(\phi\) and elevation angle \(\theta\).
For the first-order ambisonics (FOA):
The (FOA) format is obtained by converting the 32-channel microphone array signals by means of encoding filters based on anechoic measurements of the Eigenmike array response. Note that in the formulas above the encoding format is assumed frequency-independent, something that holds true up to around 9kHz with the specific microphone array, while the actual encoded responses start to deviate gradually at higher frequencies from the ideal ones provided above.
For the tetrahedral microphone array (MIC):
The four microphone have the following positions, in spherical coordinates \((\phi, \theta, r)\):
Since the microphones are mounted on an acoustically-hard spherical baffle, an analytical expression for the directional array response is given by the expansion:
where \(m\) is the channel number, \((\phi_m, \theta_m)\) are the specific microphone's azimuth and elevation position, \(\omega = 2\pi f\) is the angular frequency, \(R = 0.042\)m is the array radius, \(c = 343\)m/s is the speed of sound, \(\cos(\gamma_m)\) is the cosine angle between the microphone and the DOA, and \(P_n\) is the unnormalized Legendre polynomial of degree \(n\), and \(h_n'^{(2)}\) is the derivative with respect to the argument of a spherical Hankel function of the second kind. The expansion is limited to 30 terms which provides negligible modeling error up to 20kHz. Example routines that can generate directional frequency and impulse array responses based on the above formula can be found here.
Reference labels, directions-of-arrival, and source distances
For each recording in the development dataset, the labels, DoAs, and distances are provided in a plain text CSV file of the same filename as the recording, in the following format:
[frame number (int)], [active class index (int)], [source number index (int)], [azimuth (int)], [elevation (int)], [distance (int)]
Frame, class, and source enumeration begins at 0. Frames correspond to a temporal resolution of 100msec. Azimuth and elevation angles are given in degrees, rounded to the closest integer value, with azimuth and elevation being zero at the front, azimuth \(\phi \in [-180^{\circ}, 180^{\circ}]\), and elevation \(\theta \in [-90^{\circ}, 90^{\circ}]\). Note that the azimuth angle is increasing counter-clockwise (\(\phi = 90^{\circ}\) at the left). Distances are provided in centimeters, rounded to the closest integer value.
The source index is a unique integer for each source in the scene, and it is provided only as additional information. Note that each unique actor gets assigned one such identifier, but not individual events produced by the same actor; e.g. a clapping event and a laughter event produced by the same person have the same identifier. Independent sources that are not actors (e.g. a loudspeaker playing music in the room) get a 0 identifier.
Note that the source index is only included in the development metadata as additional information that can be exploited during training. It is not required to be estimated or provided by the participants in their results.
Overlapping sound events are indicated with duplicate frame numbers, and can belong to a different or the same class. An example sequence could be as:
10, 1, 1, -50, 30, 181
11, 1, 1, -50, 30, 181
11, 1, 2, 10, -20, 243
12, 1, 2, 10, -20, 243
13, 1, 2, 10, -20, 243
13, 8, 0, -40, 0, 80
which describes that in frame 10-11, an event of class male speech (class 1) belonging to one actor (source 1) is active at location (-50°,30°,181cm). However, at frame 11 a second instance of the same class appears simultaneously at a different direction (10°,-20°,243cm) belonging to another actor (source 2), while at frame 13 an additional event of class music (class 8) appears belonging to a non-actor source (source 0). Frames that contain no sound events are not included in the sequence.
Download
The development and evaluation version of the dataset can be downloaded at:
Task setup
The development dataset is provided with a training/testing split. During the development stage, the testing split can be used for comparison with the baseline results and consistent reporting of results at the technical reports of the submitted systems, before the evaluation stage.
-
Note that even though there are two origins of the data, Sony and TAU, the challenge task considers the dataset as a single entity. Hence models should not be trained separately for each of the two origins, and tested individually on recordings of each of them. Instead, the recordings of the individual training splits (dev-train-sony, dev-train-tau) and testing splits (dev-test-sony, dev-test-tau) should be combined (dev-train, dev-test) and the models should be trained and evaluated in the respective combined splits.
-
The participants can choose to use as input to their models one of the two formats, FOA or MIC, or both simultaneously.
The evaluation dataset is released a few weeks before the final submission deadline. That dataset consists of only audio and video recordings without any metadata/labels. At the evaluation stage, participants can decide the training procedure, i.e. the amount of training and validation files to be used in the development dataset, the number of ensemble models etc., and submit the results of the SELD performance on the evaluation dataset.
There are two tracks that the participants can follow: the audio-only track and the audiovisual track. The participants can choose to submit systems on either of the two tracks, or on both. We encourage participants to submit both audio-only models and audiovisual models.
Submissions for both tracks will be in the same system output format, and both tracks will be evaluated with the same metrics. During evaluation, ranking results will be presented in separate tables for each track.
Track A: Audio-only inference
The audio-only track continues the SELD task setup of the previous year, where inference of the SELD labels is performed with multichannel audio input only. Note that we do not exclude the use of video data or video information during training of such models. In this sense, the STARSS23 video recordings of the development set could be treated as external data and exploited in various ways to improve the performance of the model. However, during inference only the audio recordings of the evaluation set should be used.
Track B: Audiovisual inference
In the audiovisual track participants have access to 360° video recordings during training and evaluation. The models in this track are expected to be using both audio and video data during inference to produce the SELD labels.
Development dataset
The recordings in the development dataset follow the naming convention:
fold[fold number]_room[room number]_mix[recording number per room].wav
The fold number at the moment is used only to distinguish between the training and testing split. The room information is provided for the user of the dataset to potentially help understand the performance of their method with respect to different conditions.
For the audiovisual track, video files are provided with the same folder structure and naming convention as the audio files. There are 12 audio recordings with missing videos (fold3_room21_mix001.wav - fold3_room21_mix012.wav), hence the data are not exactly equivalent when working with audio-only or audiovisual input.
Evaluation dataset
The evaluation dataset consists of recordings without any information on the origin (Sony or TAU) or on the location in the naming convention, as below:
mix[recording number].wav
External data resources
Since the development set contains recordings of real scenes, the presence of each class and the density of sound events varies greatly. To enable more effective training of models to detect and localize all target classes, apart from spatial and spectrotemporal augmentation of the development set, we additionally allow use of external datasets as long as they are publicly available. External data examples are sound sample banks, annotated sound event datasets, pre-trained models, room and array impulse response libraries.
The following rules apply on the use of external data:
- The external datasets or pre-trained models used should be freely and publicly accessible before 15 May 2024.
- Participants should inform the organizers in advance about such data sources, so that all competitors know about them and have an equal opportunity to use them. Please send an email or message in the Slack channel to the task coordinators if you intend to use a dataset or pre-trained model not on the list; we will update a list of external data in the webpage accordingly.
- The participants will have to indicate clearly which external data they have used in their system info and technical report.
- Once the evaluation set is published, no further requests will be taken and no further external sources will be added to the list.
Dataset name | Type | Added | Link |
---|---|---|---|
TAU-SRIR DB | room impulse responses | 04.04.2022 | https://zenodo.org/records/6408611 |
6DOF_SRIRs | room impulse responses | 23.11.2021 | https://zenodo.org/records/6382405 |
METU SRIRs | room impulse responses | 10.04.2019 | https://zenodo.org/records/2635758 |
MIRACLE | room impulse responses | 12.10.2023 | https://depositonce.tu-berlin.de/items/fc34d59c-c524-4a4b-86ae-4da9289f20e2 |
AudioSet | audio, video | 30.03.2017 | https://research.google.com/audioset/ |
FSD50K | audio | 02.10.2020 | https://zenodo.org/record/4060432 |
ESC-50 | audio | 13.10.2015 | https://github.com/karolpiczak/ESC-50 |
Wearable SELD dataset | audio | 17.02.2022 | https://zenodo.org/record/6030111 |
IRMAS | audio | 08.09.2014 | https://zenodo.org/record/1290750 |
Kinetics 400 | audio, video | 22.05.2017 | https://www.deepmind.com/open-source/kinetics |
SSAST | pre-trained model | 10.02.2022 | https://github.com/YuanGongND/ssast |
TAU-NIGENS Spatial Sound Events 2020 | audio | 06.04.2020 | https://zenodo.org/record/4064792 |
TAU-NIGENS Spatial Sound Events 2021 | audio | 28.02.2021 | https://zenodo.org/record/5476980 |
PANN | pre-trained model | 19.10.2020 | https://github.com/qiuqiangkong/audioset_tagging_cnn |
wav2vec2.0 | pre-trained model | 20.08.2020 | https://github.com/facebookresearch/fairseq |
PaSST | pre-trained model | 18.09.2022 | https://github.com/kkoutini/PaSST |
DTF-AT | pre-trained model | 19.12.2023 | https://github.com/ta012/DTFAT |
FNAC_AVL | pre-trained model | 25.03.2023 | https://github.com/OpenNLPLab/FNAC_AVL |
CSS10 Japanese | audio | 05.08.2019 | https://www.kaggle.com/datasets/bryanpark/japanese-single-speaker-speech-dataset |
JSUT | audio | 28.10.2017 | https://sites.google.com/site/shinnosuketakamichi/publication/jsut |
VoxCeleb1 | audio, video | 26.06.2017 | https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html |
COCO | video | 01.05.2014 | https://cocodataset.org/ |
360-Indoor | video | 03.10.2019 | http://aliensunmin.github.io/project/360-dataset/ |
TorchVision Models and Pre-trained Weights | pre-trained model | 03.04.2017 | https://pytorch.org/vision/stable/models.html |
YOLOv7 | pre-trained model | 07.07.2022 | https://github.com/WongKinYiu/yolov7 |
YOLOv8 | pre-trained model | 10.01.2023 | https://github.com/ultralytics/ultralytics |
Grounding DINO | pre-trained model | 10.03.2023 | https://github.com/IDEA-Research/GroundingDINO |
MMDetection | pre-trained model | 15.12.2021 | https://github.com/open-mmlab/mmdetection |
MMPose | pre-trained model | 01.01.2020 | https://github.com/open-mmlab/mmpose |
MMFlow | pre-trained model | 01.01.2021 | https://github.com/open-mmlab/mmflow |
Paddle Detection | pre-trained model | 01.01.2019 | https://github.com/PaddlePaddle/PaddleDetection |
doors Image Dataset | image | 18.02.2022 | https://universe.roboflow.com/mohammed-naji/doors-6g8eb/dataset/1 |
DoorDetect Dataset | image | 27.05.2021 | https://github.com/MiguelARD/DoorDetect-Dataset |
CLIP | pre-trained model | 05.01.2021 | https://github.com/openai/CLIP |
CLAP | pre-trained model | 06.03.2022 | https://github.com/LAION-AI/CLAP |
Depth Anything | pre-trained model | 22.01.2024 | https://github.com/LiheYoung/Depth-Anything |
PanoFormer | pre-trained model | 04.03.2022 | https://github.com/zhijieshen-bjtu/PanoFormer |
SoundQ Youtube 360° video list | video | 06.10.2023 | https://github.com/aromanusc/SoundQ/blob/main/synth_data_gen/dataset.csv |
FMA | audio | 02.12.2016 | https://github.com/mdeff/fma |
Example external data use with baseline
The baseline is trained with a combination of the recordings in the development set and synthetic recordings, generated through convolution of isolated sound samples with real spatial room impulse responses (SRIRs) captured in various spaces of Tampere University. The scene synthesizer is the same as used in DCASE2022-2023 challenge, modified to export additional distance labels from the TAU-SRIR database of measured RIRs. For more details please refer to the respective section of the task description of DCASE2022. For reproducibility, the exact synthetic data that were generated and used for training this year's baseline are made available:
This year, we do not share the generator code with the additional distance exporting fiunctionality. Instead we recommend the participants to use SpatialScaper, a recent library that integrates the functionality of the earlier SELD scene generator with support for TAU-SRIR DB, while at the same time offering support for additional real SRIRs as well as synthesized RIRs.
Iran R. Roman, Christopher Ick, Sivan Ding, Adrian S. Roman, Brian McFee, and Juan P. Bello. Spatial scaper: a library to simulate and augment soundscapes for sound event localization and detection in realistic rooms. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Seoul, South Korea, April 2024.
Spatial Scaper: A Library to Simulate and Augment Soundscapes for Sound Event Localization and Detection in Realistic Rooms
Abstract
Sound event localization and detection (SELD) is an important task in machine listening. Major advancements rely on simulated data with sound events in specific rooms and strong spatio-temporal labels. SELD data is simulated by convolving spatialy-localized room impulse responses (RIRs) with sound waveforms to place sound events in a soundscape. However, RIRs require manual collection in specific rooms. We present SpatialScaper, a library for SELD data simulation and augmentation. Compared to existing tools, SpatialScaper emulates virtual rooms via parameters such as size and wall absorption. This allows for parameterized placement (including movement) of foreground and background sound sources. SpatialScaper also includes data augmentation pipelines that can be applied to existing SELD data. As a case study, we use SpatialScaper to add rooms to the DCASE SELD data. Training a model with our data led to progressive performance improves as a direct function of acoustic diversity. These results show that SpatialScaper is valuable to train robust SELD models.
Scripts that generate data using the FSD50K sample subset that was hand picked earlier to conform to the target classes of the challenge, along with exporting labels in the challenge format, are kindly provided by the authors of the library. You can find the library with usage info here:
Task rules
- Use of external data is allowed, as long as they are publicly available. Check the section on external data for more instructions.
- Manipulation of the provided training-test split in the development dataset is not allowed for reporting dev-test results using the development dataset..
- Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system.
- The development dataset can be augmented e.g. using techniques such as pitch shifting or time stretching, rotations, respatialization or re-reverberation of parts, etc.
Submission
During the evaluation phase of the challenge, the results for each of the recordings in the evaluation dataset should be collected in individual CSV files. Each result file should have the same name as the file name of the respective audio recording, but with the .csv
extension, and should contain the same information at each row as the reference labels, excluding the source index:
[frame number (int)],[active class index (int)],[azimuth (int)],[elevation (int)],[distance (int)]
Enumeration of frame and class indices begins at zero. The class indices are as ordered in the class descriptions mentioned above. The evaluation will be performed at a temporal resolution of 100msec. In case the participants use a different frame or hop length for their study, we expect them to use a suitable method to resample the information at the specified resolution before submitting the evaluation results.
In addition to the CSV files, the participants are asked to update the information of their method in the provided file and submit a technical report describing the method. We allow upto 4 system output submissions per participant/team for each of the two tracks (audio-only, audiovisual). For each system, meta-information should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. The detailed information regarding the challenge information can be found in the submission page.
General information for all DCASE submissions can be found on the Submission page.
Evaluation
The evaluation is based on metrics evaluating jointly localization and detection performance. Contrary to the previous challenges, there are changes to the metrics this year, to take distance estimation into account. The changes are the following:
- From the 4 earlier metrics (location-dependent F1 score, location-dependent error rate, direction-of-arrival (DOA) localization error, and localization recall), only the location-dependent F1 score and DOA error (DOAE) are used. This decision was taken to simplify a growing amount of somewhat complementary metrics including the new distance estimation, which was making the ranking of the systems quite complicated to perform in a balanced manner.
- A new source distance error is introduced, betwen the reference event distance and any matched predictions of it. Instead of using an absolute distance error, a relative distance error (RDE) is used with respect to the reference distance.
- The F1 score is spatially thresholded not only on the angular distance of predictions from the reference events, but also on the distances from the references. Instead of imposing a single 3D Cartesian distance threshold from the reference, we impose two separate thresholds: one angular threshold connected to the DOAE, and a distance threhsold connected to the RDE. In this way the metric allows separate penalization of DOA estimation performance and distance estimation performance. We deem that appropriate since each of those estimates can target different applications.
- Instead of segment-based evaluation of the metrics, frame-based evaluation is used in this challenge. This decision was taken to simplify hand-crafted rules arising from associating multiple localization estimates over the duration of a segment with positives or negatives, and the increased such complexity that distance estimation was bringing to the table.
- This year's localization error (DOAE) does not penalize with 180° in undetected classes/events.
Metrics
The metrics are based on true positives (\(TP\)) and false positives (\(FP\)) determined not only by correct or wrong detections, but also based on if a) the angular distance \(\Omega = \angle(DOA_p, DOA_r)\) between the DOA of the prediction and the DOA of the matched reference event (if any) being smaller than an angular threhold \(T_{DOA}\). b) The relative distance error \(\Delta = |L_p-L_r|/L_r\) between the distance \(L_p\) of the prediction and the distance \(L_r\) of the matched reference (if any) being closer than a relative distance error threshold of \(T_{RD}\). For the evaluation of this challenge we take the angular threshold to be \(T_{DOA} = 20^\circ\) and the relative distance threshold \(T_{RD} = 1\). The reason that a relative error is chosen instead of an absolute one is that we would like to penalize less farther distances, at which distance cues at the received signals become increasingly unreliable, than closer distances at which a better estimation performance is expected.
More specifically, for each class \(c\in[1,...,C]\) and each frame:
- \(P_c\) predicted events of class \(c\) are associated with \(R_c\) reference events of class \(c\)
- false negatives are counted for misses: \(FN_c = \max(0, R_c-P_c)\)
- false positives are counted for extraneous predictions: \(FP_{c}=\max(0,P_c-R_c)\)
- \(K_c\) predictions are spatially associated with references based on Hungarian algorithm: \(K_c=\min(P_c,R_c)\). Those can also be considered as the unthresholded true positives \(TP_c = K_c\).
- the application of the spatial thresholds moves \(L_c\leq K_c\) predictions further than the threholds to false positives: \(FP_{c,(\Omega>T_{DOA})\cup(\Delta>T_{RD})} = L_c\), and \(FP_{c,+} = FP_{c}+FP_{c,(\Omega > T_{DOA})\cup(\Delta> T_{RD})}\)
- the remaining matched estimates per class are counted as true positives: \(TP_{c,(\Omega\leq T_{DOA})\cap(\Delta\leq T_{RD})} = K_c-FP_{c,(\Omega> T_{DOA})\cup(\Delta>T_{RD})}\)
- finally: predictions \(P_c = TP_{c,(\Omega\leq T_{DOA})\cap(\Delta\leq T_{RD})}+ FP_{c,+}\), but references \(R_c = TP_{c,(\Omega\leq T_{DOA})\cap(\Delta\leq1)}+FP_{c,(\Omega> T_{DOA})\cup(\Delta> T_{RD})}+FN_c\)
Based on those, we form the location-dependent F-score \(F_{c,LD}\). We perform macro-averaging of the location-dependent F-score: \(F_{LD}= \sum_c F_{c,LD}/C\).
We evaluate DOA localization accuracy through a class-dependent DOA error \(DOAE_c\), computed as the mean angular error of the matched true positives per class, and then macro-averaged:
- \(DOAE_c = \sum_k \Omega_k/ K_c = \sum_k \Omega_k /TP_c\) for each frame or segment with \(K_c>0\), and with \(\Omega_k\) being the angular error between the \(k\)th matched prediction and reference,
- and after averaging across all frames that have any true positives, \(DOAE = \sum_c DOAE_c/C\).
Finally, distance localization accuracy is evaluated through a class-dependent relative distance error \(RDE_c\), computed as the mean relative distance error of the matched true positives per class, and then macro-averaged:
- \(RDE_c = \sum_k \Delta_k/ K_c = \sum_k \Delta_k /TP_c\) for each frame or segment with \(K_c>0\), and with \(\Delta_k\) being the relative distance error between the \(k\)th matched prediction and reference,
- and after averaging across all frames that have any true positives, \(RDE = \sum_c RDE_c/C\).
Note that the DOA and distance localization errors are not thresholded in order to give more varied complementary information to the location-dependent F1-score, presenting localization accuracy outside of the spatial threshold.
Ranking
Overall ranking will be based on the cumulative rank of the metrics mentioned above, sorted in ascending order. By cumulative rank we mean the following: if system A was ranked individually for each metric as \(F1:1, DOAE:3, RDE: 1\), then its cumulative rank is \(1+3+1=5\). Then if system B has \(F1:2, DOAE:2, RDE:3\) (7), and system C has \(F1:3, DOAE:1, RDE:2\) (6), then the overall rank of the systems is A,C,B. If two systems end up with the same cumulative rank, then they are assumed to have equal place in the challenge, even though they will be listed alphabetically in the ranking tables.
Results
The SELD task received 47 submissions in total from 13 teams across the world. From those, 29 submissions were on the audio-only track, and 18 submissions on the audiovisual track.
Complete results and technical reports can be found in the results page
Main results for these submissions are as following (the table below includes only the best performing system per submitting team):
Track A: Audio-only
Rank | Submission Information | Evaluation dataset | |||||
---|---|---|---|---|---|---|---|
Submission |
Corresponding author |
Affiliation |
F-score (20°/1) |
DOA error (°) |
Relative distance error |
||
1 | Du_NERCSLIP_task3a_4 | Qing Wang | University of Science and Technology of China | 54.4 (48.9 - 59.2) | 13.6 (12.4 - 15.0) | 0.21 (0.18 - 0.23) | |
2 | Yu_HYUNDAI_task3a_3 | Hogeon Yu | Hyundai Motor Company | 29.8 (25.1 - 34.2) | 19.8 (18.3 - 21.6) | 0.28 (0.25 - 0.32) | |
3 | Yeow_NTU_task3a_2 | Jun Wei Yeow | Nanyang Technological University | 26.2 (22.0 - 30.5) | 25.1 (23.2 - 27.6) | 0.26 (0.22 - 0.28) | |
4 | Guan_CQUPT_task3a_4 | Xin Guan | Chongqing University of Posts and Telecommunications | 26.7 (22.7 - 31.1) | 18.6 (17.4 - 21.8) | 0.36 (0.34 - 0.39) | |
5 | Vo_DU_task3a_1 | Quoc Thinh Vo | Drexel University | 24.7 (20.8 - 28.4) | 19.3 (17.7 - 21.3) | 0.34 (0.30 - 0.37) | |
6 | Berg_LU_task3a_3 | Axel Berg | Lund University, Arm | 25.5 (21.8 - 29.6) | 23.2 (18.2 - 28.8) | 0.39 (0.34 - 0.44) | |
7 | Sun_JLESS_task3a_1 | Wenqiang Sun | Northwestern Polytechnical University | 28.5 (24.2 - 33.0) | 23.8 (21.5 - 25.9) | 0.51 (0.49 - 0.53) | |
8 | Qian_IASP_task3a_1 | Yuanhang Qian | Wuhan University | 22.8 (18.6 - 26.8) | 27.2 (24.6 - 29.8) | 0.36 (0.31 - 0.42) | |
9 | AO_Baseline_FOA | Parthasaarathy Sudarsanam | Tampere University | 18.0 (14.6 - 21.7) | 29.6 (24.6 - 33.3) | 0.31 (0.28 - 0.36) | |
10 | Zhang_BUPT_task3a_1 | Zhicheng Zhang | Beijing University of Posts and Telecommunications | 19.0 (16.1 - 21.8) | 29.6 (26.6 - 32.9) | 0.40 (0.32 - 0.48) | |
11 | Chen_ECUST_task3a_1 | Ning Chen | East China University of Science and Technology | 15.1 (12.2 - 17.9) | 28.3 (25.5 - 30.9) | 0.48 (0.39 - 0.59) | |
12 | Li_BIT_task3a_1 | Jiahao Li | Beijing Institution of Technology | 16.9 (13.4 - 20.5) | 33.5 (30.0 - 42.7) | 0.51 (0.26 - 1.25) |
Track B: Audiovisual
Rank | Submission Information | Evaluation dataset | |||||
---|---|---|---|---|---|---|---|
Submission |
Corresponding author |
Affiliation |
F-score (20°/1) |
DOA error (°) |
Relative distance error |
||
1 | Du_NERCSLIP_task3b_4 | Qing Wang | University of Science and Technology of China | 55.8 (51.2 - 60.4) | 11.4 (10.4 - 12.5) | 0.25 (0.22 - 0.29) | |
2 | Berghi_SURREY_task3b_4 | Davide Berghi | University of Surrey | 39.2 (33.9 - 44.3) | 15.8 (14.2 - 17.4) | 0.29 (0.25 - 0.32) | |
3 | Li_SHU_task3b_2 | Yongbo Li | Shanghai University | 34.2 (29.9 - 38.4) | 21.5 (19.8 - 23.4) | 0.28 (0.25 - 0.31) | |
4 | Guan_CQUPT_task3b_2 | Xin Guan | Chongqing University of Posts and Telecommunications | 23.2 (19.2 - 27.2) | 18.8 (17.3 - 21.5) | 0.32 (0.28 - 0.37) | |
5 | Berg_LU_task3b_3 | Axel Berg | Lund University, Arm | 25.9 (22.1 - 30.1) | 23.2 (18.2 - 28.8) | 0.33 (0.28 - 0.38) | |
6 | Chen_ECUST_task3b_1 | Ning Chen | East China University of Science and Technology | 16.3 (13.7 - 19.3) | 25.1 (22.3 - 26.9) | 0.32 (0.27 - 0.39) | |
7 | AV_Baseline_MIC | Parthasaarathy Sudarsanam | Tampere University | 16.0 (12.1 - 20.0) | 35.9 (31.8 - 39.6) | 0.30 (0.27 - 0.33) |
Baseline system
The baselines for both Track A (audio-only inference) and Track B (audiovisual inference) have been unified this year into a common codebase with a flag to switch from one to the other. Additionally, distance estimation has been integrated into the baselines. The output representation of the task objectives is still based on the multi-ACCDOA format as in the previous year, but it is extended to a 4-element vector with the first three elements representing the activity-coupled DOA estimate and the 4th element the estimated distance. For more details on this extended ACCODA representation and using it to train an earlier version of the baseline, you can refer to this recent pre-print:
Daniel A. Krause, Archontis Politis, and Annamaria Mesaros. Sound event detection and localization with distance estimation. arXiv, 2024.
Sound Event Detection and Localization with Distance Estimation
Abstract
Sound Event Detection and Localization (SELD) is a combined task of identifying sound events and their corresponding direction-of-arrival (DOA). While this task has numerous applications and has been extensively researched in recent years, it fails to provide full information about the sound source position. In this paper, we overcome this problem by extending the task to Sound Event Detection, Localization with Distance Estimation (3D SELD). We study two ways of integrating distance estimation within the SELD core - a multi-task approach, in which the problem is tackled by a separate model output, and a single-task approach obtained by extending the multi-ACCDOA method to include distance information. We investigate both methods for the Ambisonic and binaural versions of STARSS23: Sony-TAU Realistic Spatial Soundscapes 2023. Moreover, our study involves experiments on the loss function related to the distance estimation part. Our results show that it is possible to perform 3D SELD without any degradation of performance in sound event detection and DOA estimation.
Details on the multi-ACCDOA representation can be found here:
Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Naoya Takahashi, Emiru Tsunoo, and Yuki Mitsufuji. Multi-ACCDOA: localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore, Singapore, May 2022.
Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training
Abstract
Sound event localization and detection (SELD) involves identifying the direction-of-arrival (DOA) and the event class. The SELD methods with a class-wise output format make the model predict activities of all sound event classes and corresponding locations. The class-wise methods can output activity-coupled Cartesian DOA (ACCDOA) vectors, which enable us to solve a SELD task with a single target using a single network. However, there is still a challenge in detecting the same event class from multiple locations. To overcome this problem while maintaining the advantages of the class-wise format, we extended ACCDOA to a multi one and proposed auxiliary duplicating permutation invariant training (ADPIT). The multi- ACCDOA format (a class- and track-wise output format) enables the model to solve the cases with overlaps from the same class. The class-wise ADPIT scheme enables each track of the multi-ACCDOA format to learn with the same target as the single-ACCDOA format. In evaluations with the DCASE 2021 Task 3 dataset, the model trained with the multi-ACCDOA format and with the class-wise ADPIT detects overlapping events from the same class while maintaining its performance in the other cases. Also, the proposed method performed comparably to state-of-the-art SELD methods with fewer parameters.
Track A: Audio-only baseline
The baseline for the audio-only inference track remains similar to the DCASE2023 baseline, with the exception of the extended multi-ACCDOA representation. The input features remain the same: namely, mel-band spectra for the FOA and MIC formats, with mel-band aggregated acoustic intensity vectors for FOA and generalized cross-correlation (GCC) sequences for MIC as spatial features. Additionally, for the MIC format there is the option of the SALSA-lite spatial features, without mel-band aggregation in this case.
Details on the SALSA-lite spatial features can be found in:
Thi Ngoc Tho Nguyen, Douglas L. Jones, Karn N. Watcharasupat, Huy Phan, and Woon-Seng Gan. SALSA-Lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore, Singapore, May 2022.
SALSA-Lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays
Abstract
Polyphonic sound event localization and detection (SELD) has many practical applications in acoustic sensing and monitoring. However, the development of real-time SELD has been limited by the demanding computational requirement of most recent SELD systems. In this work, we introduce SALSA-Lite, a fast and effective feature for polyphonic SELD using microphone array inputs. SALSA-Lite is a lightweight variation of a previously proposed SALSA feature for polyphonic SELD. SALSA, which stands for Spatial Cue-Augmented Log-Spectrogram, consists of multichannel log-spectrograms stacked channelwise with the normalized principal eigenvectors of the spectrotemporally corresponding spatial covariance matrices. In contrast to SALSA, which uses eigenvector-based spatial features, SALSA-Lite uses normalized inter-channel phase differences as spatial features, allowing a 30-fold speedup compared to the original SALSA feature. Experimental results on the TAU-NIGENS Spatial Sound Events 2021 dataset showed that the SALSA-Lite feature achieved competitive performance compared to the full SALSA feature, and significantly outperformed the traditional feature set of multichannel log-mel spectrograms with generalized cross-correlation spectra. Specifically, using SALSA-Lite features increased localization-dependent F1 score and class-dependent localization recall by 15% and 5%, respectively, compared to using multichannel log-mel spectrograms with generalized cross-correlation spectra.
Track B: Audiovisual baseline
While the audio-only baseline system takes only the audio input, the audiovisual baseline system takes both the audio and a visual input. In the previous year's audiovisual baseline an explicit object detector provided bounding boxes for detected objects which were further converted into gaussian spatial vectors and encoded into features before concatenation with the extracted audio features. In this year's AV baseline a pre-trained ResNet-50 network is used to extract video embeddings, inspired by the work of:
Davide Berghi, Peipei Wu, Jinzheng Zhao, Wenwu Wang, and Philip J.B. Jackson. Fusion of audio and visual embeddings for sound event localization and detection. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Seoul, South Korea, April 2024.
Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection
Abstract
Sound event localization and detection (SELD) combines two subtasks: sound event detection (SED) and direction of arrival (DOA) estimation. SELD is usually tackled as an audio-only problem, but visual information has been recently included. Few audio-visual (AV)-SELD works have been published and most employ vision via face/object bounding boxes, or human pose keypoints. In contrast, we explore the integration of audio and visual feature embeddings extracted with pre-trained deep networks. For the visual modality, we tested ResNet50 and Inflated 3D ConvNet (I3D). Our comparison of AV fusion methods includes the AV-Conformer and Cross-Modal Attentive Fusion (CMAF) model. Our best models outperform the DCASE 2023 Task3 audio-only and AV baselines by a wide margin on the development set of the STARSS23 dataset, making them competitive amongst state-of-the-art results of the AV challenge, without model ensembling, heavy data augmentation, or prediction post-processing. Such techniques and further pre-training could be applied as next steps to improve performance.
These video features are then fused with the extracted audio features using cross-modal attention blocks. For more details about the feature and the network architecture please check the below baseline repository.
Both baselines and a more detailed description of them can be found in:
Results for the development dataset
The evaluation metric scores for the test split of the development dataset are given below. The location-dependent F-score is computed within a 20° angular threshold and a 100% relative distance threshold from the reference.
Track A: Audio-only baseline
Dataset | macro F20°/1 (%) | DOAE (°) | RDE (%) |
---|---|---|---|
Ambisonic | 13.1 % | 36.9° | 33 % |
Microphone array | 9.9 % | 38.1° | 30 % |
Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.
Track B: Audiovisual baseline
Dataset | macro F20°/1 (%) | DOAE (°) | RDE (%) |
---|---|---|---|
Ambisonic | 11.3 % | 38.4° | 36 % | Microphone array | 11.8 % | 38.5° | 29 % |
Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.
Citation
If you are participating in this task or using the dataset and code please consider citing the following papers:
Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, and Tuomas Virtanen. STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 125–129. Nancy, France, November 2022. URL: https://dcase.community/workshop2022/proceedings.
STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events
Abstract
This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) dataset of spatial recordings of real sound scenes collected in various interiors at two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events belonging to 13 target classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. STARSS22 serves as the development and evaluation dataset for Task 3 (Sound Event Localization and Detection) of the DCASE2022 Challenge and it introduces significant new challenges with regard to the previous iterations, which were based on synthetic data. Additionally, the report introduces the baseline system that accompanies the dataset with emphasis on its differences to the baseline of the previous challenge. Baseline results indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6600531.
Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel A. Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, and Yuki Mitsufuji. STARSS23: an audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, 72931–72957. Curran Associates, Inc., 2023. URL: https://proceedings.neurips.cc/paper_files/paper/2023/hash/e6c9671ed3b3106b71cafda3ba225c1a-Abstract-Datasets_and_Benchmarks.html.
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events
Abstract
While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at https://zenodo.org/record/7880637.