Sound Event Localization and Detection with Directional Interference


Task description

The goal of this task is to recognize individual sound events of specific classes, detect their temporal activity, and estimate their location during it, in the presence of interfering directional events not belonging to the target classes and spatial ambient noise.

Description

Given multichannel audio input, a sound event detection and localization (SELD) system outputs a temporal activation track for each of the target sound classes, along with one or more corresponding spatial trajectories when the track indicates activity. This results in a spatio-temporal characterization of the acoustic scene that can be used in a wide range of machine cognition tasks, such as inference on the type of environment, self-localization, navigation without visual input or with occluded targets, tracking of specific types of sound sources, smart-home applications, scene visualization systems, and audio surveillance, among others.

The task setup remains mostly unchanged with the previous year's DCASE2020 Challenge. The main difference is the emulation of scene recordings wth a more natural temporal distribution of target events and, more importantly, the inclusion of directional interferences, meaning sound events out of the target classes that are also point-like in nature. For each reverberant environment and every emulated recording, Interferences are spatialized in the same way as the target events, resulting in recordings that are more challenging and closer to real-life conditions.

Figure 1: Overview of sound event localization and detection system.


Audio dataset

The TAU-NIGENS Spatial Sound Events 2021 dataset contains multiple spatial sound-scene recordings, consisting of sound events of distinct categories integrated into a variety of acoustical spaces, and from multiple source directions and distances as seen from the recording position. Apart from the spatialized sound events of the target classes, diverse sound events not belonging to any of the target classes are also included in the scene. The spatialization of all sound events is based on filtering through real spatial room impulse responses (RIRs), captured in multiple rooms of various shapes, sizes, and acoustical absorption properties. Furthermore, each scene recording is delivered in two spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). The sound events are spatialized as either stationary sound sources in the room, or moving sound sources, in which case time-variant RIRs are used. Each sound event in the sound scene is associated with a trajectory of its direction-of-arrival (DoA) to the recording point, and a temporal onset and offset time. The isolated sound event recordings used for the synthesis of the sound scenes are obtained from the NIGENS general sound events database. These recordings serve as the development dataset for the DCASE 2021 Sound Event Localization and Detection Task of the DCASE 2021 Challenge.

The RIRs were collected in Finland by staff of Tampere University between 12/2017 - 06/2018, and between 11/2019 - 1/2020. The RIRs or subsets of them, have been also used in the datasets associated with the earlier two iterations of the challenge, the TAU Spatial Sound Events 2019 development and evaluation datasets, with RIRs from 5 rooms, and the TAU-NIGENS Spatial Sound Events 2020 dataset with RIRs from 13 rooms. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

ERC

A detailed description of the impulse response collection and dataset generation can be found in the following paper:

Publication

Archontis Politis, Sharath Adavanne, and Tuomas Virtanen. A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2020). November 2020. URL: https://arxiv.org/abs/2006.01919.

PDF

A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection

Abstract

This report presents the dataset and the evaluation setup of the Sound Event Localization & Detection (SELD) task for the DCASE 2020 Challenge. The SELD task refers to the problem of trying to simultaneously classify a known set of sound event classes, detect their temporal activations, and estimate their spatial directions or locations while they are active. To train and test SELD systems, datasets of diverse sound events occurring under realistic acoustic conditions are needed. Compared to the previous challenge, a significantly more complex dataset was created for DCASE 2020. The two key differences are a more diverse range of acoustical conditions, and dynamic conditions, i.e. moving sources. The spatial sound scenes are created using real room impulse responses captured in a continuous manner with a slowly moving excitation source. Both static and moving sound events are synthesized from them. Ambient noise recorded on location is added to complete the generation of scene recordings. A baseline SELD method accompanies the dataset, based on a convolutional recurrent neural network, to provide benchmark scores for the task. The baseline is an updated version of the one used in the previous challenge, with input features and training modifications to improve its performance.

PDF

and in a longer non-peer-reviewed version with additional details in the challenge technical report.

Recording procedure

To construct a realistic dataset, real-life IR recordings were collected using an Eigenmike spherical microphone array. A Genelec G Three loudspeaker was used to playback a maximum length sequence (MLS) around the Eigenmike. The IRs were obtained in the STFT domain using a least-squares regression between the known measurement signal (MLS) and far-field recording independently at each frequency.

The IRs were recorded at fifteen different indoor locations inside the Tampere University campus at Hervanta, Finland. Additionally, 30 minutes of ambient noise recordings were collected at the same locations with the IR recording setup unchanged. For each space IRs were collected with the source placed along specified trajectories, at various heights. The IR trajectories, source directions and distances differ with the space. Possible azimuths span the whole range of \(\phi\in[-180,180)\), while the elevations span approximately a range between \(\theta\in[-45,45]\) degrees. A summary of the measured spaces is as follows:


  1. Large common area with multiple seating tables and carpet flooring. People chatting and working.
  2. Large cafeteria with multiple seating tables and carpet flooring. People chatting and having food.
  3. High ceiling corridor with hard flooring. People walking around and chatting.
  4. Corridor with classrooms around and hard flooring. People walking around and chatting.
  5. Large corridor with multiple sofas and tables, hard and carpet flooring at different parts. People walking around and chatting.
  6. (2x) Large lecture halls with inclined floor. Ventilation noise.
  7. (2x) Modern classrooms with multiple seating tables and carpet flooring. Ventilation noise.
  8. (2x) Meeting rooms with hard floor and partially glass walls. Ventilation noise.
  9. (2x) Old-style large classrooms with hard floor and rows of desks. Ventilation noise.
  10. Large open space in underground bomb shelter, with plastic floor and rock walls. Ventilation noise.
  11. Large open gym space. People using weights and gym equipment.

Recording formats

The array response of the two recording formats can be considered known. The following theoretical spatial responses (steering vectors) modeling the two formats describe the directional response of each channel to a source incident from direction-of-arrival (DOA) given by azimuth angle \(\phi\) and elevation angle \(\theta\).

For the first-order ambisonics (FOA):

\begin{eqnarray} H_1(\phi, \theta, f) &=& 1 \\ H_2(\phi, \theta, f) &=& \sin(\phi) * \cos(\theta) \\ H_3(\phi, \theta, f) &=& \sin(\theta) \\ H_4(\phi, \theta, f) &=& \cos(\phi) * \cos(\theta) \end{eqnarray}

The (FOA) format is obtained by converting the 32-channel microphone array signals by means of encoding filters based on anechoic measurements of the Eigenmike array response. Note that in the formulas above the encoding format is assumed frequency-independent, something that holds true up to around 9kHz with the specific microphone array, while the actual encoded responses start to deviate gradually at higher frequencies from the ideal ones provided above.

For the tetrahedral microphone array (MIC):

The four microphone have the following positions, in spherical coordinates \((\phi, \theta, r)\):

\begin{eqnarray} M1: &\quad(&45^\circ, &&35^\circ, &4.2\mathrm{cm})\nonumber\\ M2: &\quad(&-45^\circ, &-&35^\circ, &4.2\mathrm{cm})\nonumber\\ M3: &\quad(&135^\circ, &-&35^\circ, &4.2\mathrm{cm})\nonumber\\ M4: &\quad(&-135^\circ, &&35^\circ, &4.2\mathrm{cm})\nonumber \end{eqnarray}

Since the microphones are mounted on an acoustically-hard spherical baffle, an analytical expression for the directional array response is given by the expansion:

\begin{equation} H_m(\phi_m, \theta_m, \phi, \theta, \omega) = \frac{1}{(\omega R/c)^2}\sum_{n=0}^{30} \frac{i^{n-1}}{h_n'^{(2)}(\omega R/c)}(2n+1)P_n(\cos(\gamma_m)) \end{equation}

where \(m\) is the channel number, \((\phi_m, \theta_m)\) are the specific microphone's azimuth and elevation position, \(\omega = 2\pi f\) is the angular frequency, \(R = 0.042\)m is the array radius, \(c = 343\)m/s is the speed of sound, \(\cos(\gamma_m)\) is the cosine angle between the microphone and the DOA, and \(P_n\) is the unnormalized Legendre polynomial of degree \(n\), and \(h_n'^{(2)}\) is the derivative with respect to the argument of a spherical Hankel function of the second kind. The expansion is limited to 30 terms which provides negligible modeling error up to 20kHz. Example routines that can generate directional frequency and impulse array responses based on the above formula can be found here.

Sound event classes

To generate the spatial sound scenes the measured room IRs are convolved with dry recordings of sound samples belonging to distinct sound classes. The sound event database of sound samples used for that purpose is the recent NIGENS general sound events database:

The 12 target sound classes of the spatialized events are:

  1. alarm
  2. crying baby
  3. crash
  4. barking dog
  5. female scream
  6. female speech
  7. footsteps
  8. knocking on door
  9. male scream
  10. male speech
  11. ringing phone
  12. piano

Additionally, dry recordings of disparate sounds not belonging to any of those classes are also spatialized in the same way to serve as directional interference. The sounds are sourced from the running engine, burning fire, and general classes of NIGENS database.

Dataset specifications

The specifications of the dataset can be summarized in the following:

  • 600 one-minute long sound scene recordings with metadata (development dataset).
  • 200 one-minute long sound scene recordings without metadata (evaluation dataset).
  • Sampling rate 24kHz.
  • About 500 sound event samples distributed over the 12 target classes (see here for more details).
  • About 400 sound event samples used as interference events (see here for more details).
  • Two 4-channel 3-dimensional recording formats: first-order Ambisonics (FOA) and tetrahedral microphone array.
  • Realistic spatialization and reverberation through multichannel RIRs collected in 13 different enclosures.
  • From 1184 to 6480 possible RIR positions across the different rooms.
  • Both static reverberant and moving reverberant sound events.
  • Three possible angular speeds for moving sources of approximately 10, 20, or 40deg/sec.
  • Up to three overlapping sound events possible, temporally and spatially.
  • SImultaneous directional interfering sound events with their own temporal activities, static or moving.
  • Realistic spatial ambient noise collected from each room is added to the spatialized sound events, at varying signal-to-noise ratios (SNR) ranging from noiseless (30dB) to noisy (6dB) conditions.

Each recording corresponds to a single room, and allowed overlap of two simulatenous sources, or no overlap. Each event spatialized in the recording has equal probability of being either static or moving, and is asigned randomly one of the room RIR positions, or motion along one of the densely measured RIR trajectories. The moving sound events are synthesized with a slow (10deg/sec), moderate (20deg/sec), or fast (40deg/sec) angular speed. A partitioned time-frequency interpolation scheme of the RIRs extracted from the measurements at regular intervals is used to approximate the time-variant room response corresponding to the target motion.

A schematic example of a scene recording, with 4 target classes, 8 target events, 2 interference events, of which 2 targets and 1 interferer are moving, is depicted below. The maximum co-occurences of target events (polyphony) in this case is 3.

Figure 2: Sketch of the spatial distribution and temporal distribution of events in an example scene recording.


In the development dataset, eleven out of the thirteen rooms along with the NIGENS event samples are assigned to 6 disjoint sets, and their combinations form 6 distinct splits of 100 recordings each. The splits permits testing and validation across different acoustic conditions. The remaining two rooms are used for the evaluation dataset.

Reference labels and directions-of-arrival

For each recording in the development dataset, the labels and DoAs are provided in a plain text CSV file of the same filename as the recording, in the following format:

[frame number (int)], [active class index (int)], [event number index (int)], [azimuth (int)], [elevation (int)]

Frame, class, and track enumeration begins at 0. Frames correspond to a temporal resolution of 100msec. Azimuth and elevation angles are given in degrees, rounded to the closest integer value, with azimuth and elevation being zero at the front, azimuth \(\phi \in [-180^{\circ}, 180^{\circ}]\), and elevation \(\theta \in [-90^{\circ}, 90^{\circ}]\). Note that the azimuth angle is increasing counter-clockwise (\(\phi = 90^{\circ}\) at the left).

The event number index is a unique integer for each event in the recording, enumerating them in the order of appearance. This event identifiers are useful to disentangle directions of co-occuring events through time in the metadata file. The interferers are considered unknown and no activity or direction labels of them are provided with the training datasets.

Overlapping sound events are indicated with duplicate frame numbers, and can belong to a different or the same class. An example sequence could be as:

10,     1,  0,  -50,  30
11,     1,  0,  -50,  30
11,     1,  1,   10, -20
12,     1,  1,   10, -20
13,     1,  1,   10, -20
13,     4,  2,  -40,   0

which describes that in frame 10-13, the first instance (event 0) of class crying baby (class 1) is active, however at frame 11 a second instance (event 1) of the same class appears simultaneously at a different direction, while at frame 13 an additional event of class 4 appears. Frames that contain no sound events are not included in the sequence. Note that event identifier information is only included in the development metadata and is not required to be provided by the participants in their results.

In the scenario that a participant would like to use a higher temporal resolution in their estimation than 100msec, we would recommend for an integer number of sub-frames to be used, to simplify processing of the metadata. A simple example routine performing (linear) spherical interpolation of directions is provided here (where e.g. for a sub-frame of 20msec, four interpolated directions are returned between the two input directions spaced at 100msec).

Download

The development version of the dataset can be downloaded at:


Task setup

In order to allow a fair comparison of methods on the development dataset, participants are required to report results using the following split:

Training folds Validation fold Testing fold
1, 2, 3, 4 5 6

The evaluation dataset is released a few weeks before the final submission deadline. This dataset consists of only audio recordings without any metadata/labels. Participants can decide the training procedure, i.e. the amount of training and validation files in the development dataset, the number of ensemble models etc., and submit the results of the SELD performance on the evaluation dataset.

Development dataset

The recordings in the development dataset follow the naming convention:

fold[fold number]_room[room number per fold]_mix[recording number per room per split].wav

Note that the room number only distinguishes different rooms used inside a split. For example, room1 in the first split is not the same as room1 in the second split. The room information is provided for the user of the dataset to understand the performance of their method with respect to different conditions.

Compared to the TAU-NIGENS Spatial Sound Events 2020 dataset of the previous iteration, there is no indication of polyphony in each recording, as all recordings now allow conditions with the maximum polyphony of 3. However, all lower polyphonies and sliences can occur in each recording.

Evaluation dataset

The evaluation dataset will be released in spring before the evaluation phase of the challenge commences.

Task rules

  • Use of external data is not allowed.
  • Manipulation of the provided cross-validation split in the development dataset is not allowed.
  • Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system.
  • Use of the dry event samples from the NIGENS database for additional data generation or augmentation is not allowed.
  • The development dataset can be augmented without the use of external data e.g. using techniques such as pitch shifting or time stretching, respatialization or re-reverberation of parts, etc.

Submission

The results for each of the 200 recordings in the evaluation dataset should be collected in individual CSV files. Each result file should have the same name as the file name of the respective audio recording, but with the .csv extension, and should contain the same information at each row as the reference labels, excluding the event id:

[frame number (int)],[active class index (int)],[azimuth (int)],[elevation (int)]

Enumeration of frame and class indices begins at zero. The class indices are as ordered in the class descriptions mentioned above. The evaluation will be performed at a temporal resolution of 100msec. In case the participants use a different frame or hop length for their study, we expect them to use a suitable method to resample the information at the specified resolution before submitting the evaluation results.

In addition to the CSV files, the participants are asked to update the information of their method in the provided file and submit a technical report describing the method. We allow upto 4 system output submissions per participant/team. For each system, meta-information should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. The detailed information regarding the challenge information can be found in the submission page.

General information for all DCASE submissions can be found on the Submission page.

Evaluation

The evaluation will be similar to the one employed in DCASE2020. Metrics evaluating jointly detection and localization performance are used.

Metrics

The first two metrics are the classic sound event detection (SED) metrics of F-score (\(F_{\leq T^\circ}\)) and Error Rate (\(ER_{\leq T^\circ}\)), but are now location-dependent, considering true positives predicted only under a distance threshold \(T^\circ\) (angular in our case) from the reference. For the evaluation of this challenge we take this threshold to be \(T = 20^\circ\).

The next two metrics are focused on the localization part, but are now classification-dependent, meaning that they are computed only across each class only, instead of across all outputs. The first is the localization error \(LE_{\mathrm{CD}}\) expressing average angular distance between predictions and references of the same class. The second is a simple localization recall metric \(LR_{\mathrm{CD}}\) expressing the true positive rate of how many of these localization estimates were detected in a class, out of the total class instances. Unlike the location-dependent detection, these localization metrics do not use any threshold.

All metrics are computed in one-second non-overlapping frames. For a more thorough analysis on the joint SELD metrics please refer to:

Publication

Archontis Politis, Annamaria Mesaros, Sharath Adavanne, Toni Heittola, and Tuomas Virtanen. Overview and evaluation of sound event localization and detection in dcase 2019. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:684–698, 2020. URL: https://https://arxiv.org/abs/2009.02792.

PDF

Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019

Abstract

Sound event localization and detection is a novel area of research that emerged from the combined interest of analyzing the acoustic scene in terms of the spatial and temporal activity of sounds of interest. This paper presents an overview of the first international evaluation on sound event localization and detection, organized as a task of the DCASE 2019 Challenge. A large-scale realistic dataset of spatialized sound events was generated for the challenge, to be used for training of learning-based approaches, and for evaluation of the submissions in an unlabeled subset. The overview presents in detail how the systems were evaluated and ranked and the characteristics of the best-performing systems. Common strategies in terms of input features, model architectures, training approaches, exploitation of prior knowledge, and data augmentation are discussed. Since ranking in the challenge was based on individually evaluating localization and event classification performance, part of the overview focuses on presenting metrics for the joint measurement of the two, together with a reevaluation of submissions using these new metrics. The new analysis reveals submissions that performed better on the joint task of detecting the correct type of event close to its original location than some of the submissions that were ranked higher in the challenge. Consequently, ranking of submissions which performed strongly when evaluated separately on detection or localization, but not jointly on both, was affected negatively.

PDF

Ranking

Overall ranking will be based on the cumulative rank of the four metrics mentioned above, sorted in ascending order. By cumulative rank we mean the following: if system A was ranked individually for each metric as \(ER:1, F1:1, LE:3, LR:1\), then its cumulative rank is \(1+1+3+1=5\). Then if system B has \(ER:3, F1:2, LE:2, LR:3\) (10), and system C has \(ER:2, F1:3, LE:1, LR:2\) (8), then the overall rank of the systems is A,C,B. If two systems end up with the same cumulative rank, then they are assumed to have equal place in the challenge, even though they will be listed alphabetically in the ranking tables.

Baseline system

Similarly to the previous iterations of the challenge, as the baseline we use a straightforward convolutional recurrent neural netowrk (CRNN) based on SELDnet, but with a few important modifications.

Publication

Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13(1):34–48, March 2018. URL: https://ieeexplore.ieee.org/abstract/document/8567942, doi:10.1109/JSTSP.2018.2885636.

PDF

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Abstract

In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

Keywords

Direction-of-arrival estimation;Estimation;Task analysis;Azimuth;Microphone arrays;Recurrent neural networks;Sound event detection;direction of arrival estimation;convolutional recurrent neural network

PDF

Baseline changes

**Compared to the DCASE2020 and the associated published SELDnet version, a few modifications have been integrated in the model, in order to take into account some of the simplest effective improvements demonstrated by the participants in the previous year.

The most important one is the elimination of the dedicated event classification output branch, by adopting the ACCDOA training target which unifies the localization and classification losses in a homogenous regression vector loss, pioneered by Shimada et al. and the third best performing team in DCASE2020. More details can be found in their report:

Publication

Kazuki Shimada, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, and Yuki Mitsufuji. Accdoa: activity-coupled cartesian direction of arrival representation for sound event localization and detection. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Ontario, Canada, June 2021.

PDF

ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization and Detection

Abstract

Neural-network (NN)-based methods show high performance in sound event localization and detection (SELD). Conventional NN-based methods use two branches for a sound event detection (SED) target and a direction-of-arrival (DOA) target. The two-branch representation with a single network has to decide how to balance the two objectives during optimization. Using two networks dedicated to each task increases system complexity and network size. To address these problems, we propose an activity-coupled Cartesian DOA (ACCDOA) representation, which assigns a sound event activity to the length of a corresponding Cartesian DOA vector. The ACCDOA representation enables us to solve a SELD task with a single target and has two advantages: avoiding the necessity of balancing the objectives and model size increase. In experimental evaluations with the DCASE 2020 Task 3 dataset, the ACCDOA representation outperformed the two-branch representation in SELD metrics with a smaller network size. The ACCDOA-based SELD system also performed better than state-of-the-art SELD systems in terms of localization and location-dependent detection.

PDF

Repository

The baseline, along with more details on it, can be found in:


Baseline results (development dataset)

The evaluation metric scores for the test split of the development dataset are given below. The location-dependent detection metrics are computed within a 20° threshold from the reference.

Dataset ER20° F20° LECD LRCD
Ambisonic 0.66 32.8 % 25.9° 44.3 %
Microphone array 0.69 26.8% 31.6° 44.5 %

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.

Citation

If you are participating in this task or using the dataset and code please consider citing the following papers:

Publication

Archontis Politis, Annamaria Mesaros, Sharath Adavanne, Toni Heittola, and Tuomas Virtanen. Overview and evaluation of sound event localization and detection in dcase 2019. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:684–698, 2020. URL: https://https://arxiv.org/abs/2009.02792.

PDF

Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019

Abstract

Sound event localization and detection is a novel area of research that emerged from the combined interest of analyzing the acoustic scene in terms of the spatial and temporal activity of sounds of interest. This paper presents an overview of the first international evaluation on sound event localization and detection, organized as a task of the DCASE 2019 Challenge. A large-scale realistic dataset of spatialized sound events was generated for the challenge, to be used for training of learning-based approaches, and for evaluation of the submissions in an unlabeled subset. The overview presents in detail how the systems were evaluated and ranked and the characteristics of the best-performing systems. Common strategies in terms of input features, model architectures, training approaches, exploitation of prior knowledge, and data augmentation are discussed. Since ranking in the challenge was based on individually evaluating localization and event classification performance, part of the overview focuses on presenting metrics for the joint measurement of the two, together with a reevaluation of submissions using these new metrics. The new analysis reveals submissions that performed better on the joint task of detecting the correct type of event close to its original location than some of the submissions that were ranked higher in the challenge. Consequently, ranking of submissions which performed strongly when evaluated separately on detection or localization, but not jointly on both, was affected negatively.

PDF
Publication

Archontis Politis, Sharath Adavanne, and Tuomas Virtanen. A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2020). November 2020. URL: https://arxiv.org/abs/2006.01919.

PDF

A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection

Abstract

This report presents the dataset and the evaluation setup of the Sound Event Localization & Detection (SELD) task for the DCASE 2020 Challenge. The SELD task refers to the problem of trying to simultaneously classify a known set of sound event classes, detect their temporal activations, and estimate their spatial directions or locations while they are active. To train and test SELD systems, datasets of diverse sound events occurring under realistic acoustic conditions are needed. Compared to the previous challenge, a significantly more complex dataset was created for DCASE 2020. The two key differences are a more diverse range of acoustical conditions, and dynamic conditions, i.e. moving sources. The spatial sound scenes are created using real room impulse responses captured in a continuous manner with a slowly moving excitation source. Both static and moving sound events are synthesized from them. Ambient noise recorded on location is added to complete the generation of scene recordings. A baseline SELD method accompanies the dataset, based on a convolutional recurrent neural network, to provide benchmark scores for the task. The baseline is an updated version of the one used in the previous challenge, with input features and training modifications to improve its performance.

PDF