Sound Event Localization and Detection


Task description

The goal of this task is to recognize individual sound events, detect their temporal activity, and estimate their location during it.

Description

Given multichannel audio input, a sound event detection and localization (SELD) system outputs a temporal activation track for each of the target sound classes, along with one or more corresponding spatial trajectories when the track indicates activity. This results in a spatio-temporal characterization of the acoustic scene that can be used in a wide range of machine cognition tasks, such as inference on the type of environment, self-localization, navigation without visual input or with occluded targets, tracking of specific types of sound sources, smart-home applications, scene visualization systems, and audio surveillance, among others.

Figure 1: Overview of sound event localization and detection system.


Audio dataset

The TAU-NIGENS Spatial Sound Events 2020 dataset contains multiple spatial sound-scene recordings, consisting of sound events of distinct categories integrated into a variety of acoustical spaces, and from multiple source directions and distances as seen from the recording position. The spatialization of all sound events is based on filtering through real spatial room impulse responses (RIRs), captured in multiple rooms of various shapes, sizes, and acoustical absorption properties. Furthermore, each scene recording is delivered in two spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). The sound events are spatialized as either stationary sound sources in the room, or moving sound sources, in which case time-variant RIRs are used. Each sound event in the sound scene is associated with a trajectory of its direction-of-arrival (DoA) to the recording point, and a temporal onset and offset time. The isolated sound event recordings used for the synthesis of the sound scenes are obtained from the NIGENS general sound events database. These recordings serve as the development dataset for the DCASE 2020 Sound Event Localization and Detection Task of the DCASE 2020 Challenge.

The RIRs were collected in Finland by staff of Tampere University between 12/2017 - 06/2018, and between 11/2019 - 1/2020. The older measurements from five rooms were also used for the earlier development and evaluation datasets TAU Spatial Sound Events 2019, while ten additional rooms were added for this dataset. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

ERC

Recording procedure

To construct a realistic dataset, real-life IR recordings were collected using an Eigenmike spherical microphone array. A Genelec G Three loudspeaker was used to playback a maximum length sequence (MLS) around the Eigenmike. The IRs were obtained in the STFT domain using a least-squares regression between the known measurement signal (MLS) and far-field recording independently at each frequency.

The IRs were recorded at fifteen different indoor locations inside the Tampere University campus at Hervanta, Finland. Apart from the five spaces measured and used for the same task in DCASE2019, we added ten new spaces. Additionally, 30 minutes of ambient noise recordings were collected at the same locations with the IR recording setup unchanged. Contrary to DCASE2019, the new IRs were not measured on a spherical grid of fixed azimuth and elevation resolution, and at fixed distances. Instead, IR directions and distances differ with the space. Possible azimuths span the whole range of \(\phi\in[-180,180)\), while the elevations span approximately a range between \(\theta\in[-45,45]\) degrees. A summary of the measured spaces is as follows:


DCASE2019

  1. Large common area with multiple seating tables and carpet flooring. People chatting and working.
  2. Large cafeteria with multiple seating tables and carpet flooring. People chatting and having food.
  3. High ceiling corridor with hard flooring. People walking around and chatting.
  4. Corridor with classrooms around and hard flooring. People walking around and chatting.
  5. Large corridor with multiple sofas and tables, hard and carpet flooring at different parts. People walking around and chatting.

DCASE2020

  1. (2x) Large lecture halls with inclined floor. Ventilation noise.
  2. (2x) Modern classrooms with multiple seating tables and carpet flooring. Ventilation noise.
  3. (2x) Meeting rooms with hard floor and partially glass walls. Ventilation noise.
  4. (2x) Old-style large classrooms with hard floor and rows of desks. Ventilation noise.
  5. Large open space in underground bomb shelter, with plastic floor and rock walls. Ventilation noise.
  6. Large open gym space. People using weights and gym equipment.

Recording formats

The array response of the two recording formats can be considered known. The following theoretical spatial responses (steering vectors) modeling the two formats describe the directional response of each channel to a source incident from direction-of-arrival (DOA) given by azimuth angle \(\phi\) and elevation angle \(\theta\).

For the first-order ambisonics (FOA):

\begin{eqnarray} H_1(\phi, \theta, f) &=& 1 \\ H_2(\phi, \theta, f) &=& \sin(\phi) * \cos(\theta) \\ H_3(\phi, \theta, f) &=& \sin(\theta) \\ H_4(\phi, \theta, f) &=& \cos(\phi) * \cos(\theta) \end{eqnarray}

The (FOA) format is obtained by converting the 32-channel microphone array signals by means of encoding filters based on anechoic measurements of the Eigenmike array response. Note that in the formulas above the encoding format is assumed frequency-independent, something that holds true up to around 9kHz with the specific microphone array, while the actual encoded responses starts to deviate gradually from the ideal one provided above at higher frequencies.

For the tetrahedral microphone array (MIC):

The four microphone have the following positions, in spherical coordinates \((\phi, \theta, r)\):

M1: ( 45°, 35°, 4.2cm)

M2: (-45°, -35°, 4.2cm)

M3: (135°, -35°, 4.2cm)

M4: (-135°, 35°, 4.2cm)

Since the microphones are mounted on an acoustically-hard spherical baffle, an analytical expression for the directional array response is given by the expansion:

\begin{equation} H_m(\phi_m, \theta_m, \phi, \theta, \omega) = \frac{1}{(\omega R/c)^2}\sum_{n=0}^{30} \frac{i^{n-1}}{h_n'^{(2)}(\omega R/c)}(2n+1)P_n(\cos(\gamma_m)) \end{equation}

where \(m\) is the channel number, \((\phi_m, \theta_m)\) are the specific microphone's azimuth and elevation position, \(\omega = 2\pi f\) is the angular frequency, \(R = 0.042\)m is the array radius, \(c = 343\)m/s is the speed of sound, \(\cos(\gamma_m)\) is the cosine angle between the microphone and the DOA, and \(P_n\) is the unnormalized Legendre polynomial of degree \(n\), and \(h_n'^{(2)}\) is the derivative with respect to the argument of a spherical Hankel function of the second kind. The expansion is limited to 30 terms which provides negligible modeling error up to 20kHz. Example routines that can generate directional frequency and impulse array responses based on the above formula can be found here.

Sound event classes

To generate the spatial sound scenes the measured room IRs are convolved with dry recordings of sound samples belonging to distinct sound classes. The sound event database of sound samples used for that purpose is the recent NIGENS general sound events database:

The 14 sound classes of the spatialized events are:

  1. alarm
  2. crying baby
  3. crash
  4. barking dog
  5. running engine
  6. female scream
  7. female speech
  8. burning fire
  9. footsteps
  10. knocking on door
  11. male scream
  12. male speech
  13. ringing phone
  14. piano

Dataset specifications

The specifications of the dataset can be summarized in the following:

  • 600 one-minute long sound scene recordings (development dataset).
  • Sampling rate 24kHz.
  • About 700 sound event samples spread over 14 classes (see here for more details).
  • Two 4-channel 3-dimensional recording formats: first-order Ambisonics (FOA) and tetrahedral microphone array.
  • Realistic spatialization and reverberation through RIRs collected in 15 different enclosures.
  • From about 1500 to 3500 possible RIR positions across the different rooms.
  • Both static reverberant and moving reverberant sound events.
  • Three possible angular speeds for moving sources of about 10, 20, or 40deg/sec.
  • Up to two overlapping sound events possible, temporally and spatially.
  • Realistic spatial ambient noise collected from each room is added to the spatialized sound events, at varying signal-to-noise ratios (SNR) ranging from noiseless (30dB) to noisy (6dB).

Each recording corresponds to a single room, and allowed overlap of two simulatenous sources, or no overlap. Each event spatialized in the recording has equal probability of being either static or moving, and is asigned randomly one of the room RIR positions, or motion along one of the predefined trajectories. The moving sound events are synthesized with a slow (10deg/sec), moderate (20deg/sec), or fast (40deg/sec) angular speed. A partitioned time-frequency interpolation scheme of the RIRs extracted from the measurements at regular intervals is used to approximate the time-variant room response corresponding to the target motion.

In the development dataset, eleven out of the fifteen rooms along with the NIGENS event samples are assigned to 6 disjoint sets, and their combinations form 6 distinct splits of 100 recordings each. The splits permits testing and validation across different acoustic conditions.

Reference labels and directions-of-arrival

For each recording in the development dataset, the labels and DoAs are provided in a plain text CSV file of the same filename as the recording, in the following format:

[frame number (int)], [active class index (int)], [track number index (int)], [azimuth (int)], [elevation (int)]

Frame, class, and track enumeration begins at 0. Frames correspond to a temporal resolution of 100msec. Azimuth and elevation angles are given in degrees, rounded to the closest integer value, with azimuth and elevation being zero at the front, azimuth \(\phi \in [-180^{\circ}, 180^{\circ}]\), and elevation \(\theta \in [-90^{\circ}, 90^{\circ}]\). Note that the azimuth angle is increasing counter-clockwise (\(\phi = 90^{\circ}\) at the left).

The track index indicates instances of the same class in the recording, overlapping or non-overlapping, and it increases for each new occuring instance. By instance here we mean the sound event that is spatialized with a distinct static position in the room, or with a coherent continuous spatial trajectory in the case of moving events. This information is mostly redundant in the case of recordings with no overlap, but it becomes more important when overlap occurs. For example, when there are two same-class events occurring at the same time, and the user would like to resample their position to a higher resolution than 100msec, the track index can be used directly to disentangle the DoAs for interpolation, without the users having to solve the association problem themselves.

Overlapping sound events are indicated with duplicate frame numbers, and can belong to a different or the same class. An example sequence could be as:

10,     1,  0,  -50,  30
11,     1,  0,  -50,  30
11,     1,  1,   10, -20
12,     1,  1,   10, -20
13,     1,  1,   10, -20
13,     4,  0,  -40,   0

which describes that in frame 10-13, the first instance (track 0) of class crying baby (class 1) is active, however at frame 11 a second instance (track 1) of the same class appears simultaneously at a different direction, while at frame 13 an additional event of class 4 appears. Frames that contain no sound events are not included in the sequence. Note that track information is only included in the development metadata and is not required to be provided by the participants.

In the scenario that a participant would like to use a higher temporal resolution in their estimation than 100msec, we would recommend for an integer number of sub-frames to be used, to simplify processing of the metadata. A simple example routine performing (linear) spherical interpolation of directions is provided with the baseline code (where e.g. for a sub-frame of 20msec, four interpolated directions are returned between the two input directions spaced at 100msec).

Download


Task setup

In order to allow a fair comparison of methods on the development dataset, participants are required to report results using the following split:

Training splits Validation split Testing split
3, 4, 5, 6 2 1

The evaluation dataset is released a few weeks before the final submission deadline. This dataset consists of only audio recordings without any metadata/labels. Participants can decide the training procedure, i.e. the amount of training and validation files in the development dataset, the number of ensemble models etc., and submit the results of the SELD performance on the evaluation dataset.

Development dataset

The recordings in the development dataset follow the naming convention:

fold[split number]_room[room number per split]_mix[recording number per room per split]_ov[number of overlapping sound events].wav

Note that the room number only distinguishes different rooms used inside a split. For example, room1 in the first split is not the same as room1 in the second split. The room or overlap information is provided for the user of the dataset to understand the performance of their method with respect to different conditions.

Evaluation dataset

The evaluation dataset consists of 200 recordings without any information on the location, or the number of overlapping sound events in the naming convention as below:

mix[recording number].wav

Submission

The results for each of the recordings in the evaluation dataset should be collected in individual file-wise CSV. Similarly, we also collect file-wise CSV for each of the 100 recordings (testing split results) in the development dataset. These file-wise CSVs should have the same name as the file name of the respective audio recording, but with .csv extension, and should containt the same information at each row as the reference labels, excluding the track id:

[frame number (int)],[active class index (int)],[azimuth (int)],[elevation (int)]

Enumeration of frames and class indices begins at zero. The evaluation will be performed at a temporal resolution of 100msec. In case the participants use a different frame or hop length for their study, we expect them to use a suitable method to extract the information at the specified resolution before submitting the evaluation results. The class indices are as ordered in the class descriptions mentioned above.

In addition to the CSV files, the participants are asked to update the information of their method in the provided file and submit a technical report describing the method. We allow upto 4 system output submissions per participant/team. For each system, meta-information should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. The detailed information regarding the challenge information can be found in the submission page.

General information for all DCASE submissions can be found on the Submission page.

Task rules

  • Use of external data is not allowed.
  • Manipulation of provided cross-validation split for development dataset is not allowed.
  • The development dataset can be augmented without the use of external data (e.g. using techniques such as pitch shifting or time stretching).
  • Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system.

Evaluation

Contrary to the SELD task of DCASE 2019, we do not rate the systems in terms of independent sound event detection performance and localization performance. In order to have a more representative evaluation of the task, we introduce modified metrics that consider the joint nature of localization-and-detection.

The first metric is more focused on the detection part, also referred as the location-aware detection, which gives us the error rate (ER) and F-score (F) in one-second non-overlapping segments. We consider the prediction to be correct if the prediction and reference class are the same, and the distance between them is below 20°. The second metric is more focused on the localization part, also referred as the class-aware localization, which gives us the DOA error (DE), and F-score (DE_F) in one-second non-overlapping segments. Unlike the location-aware detection, we do not use any distance threshold, but estimate the distance between the correct prediction and reference.

For a more thorough description and analysis on the joint SELD metrics please refer to:

Publication

Annamaria Mesaros, Sharath Adavanne, Archontis Politis, Toni Heittola, and Tuomas Virtanen. Joint measurement of localization and detection of sound events. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz, NY, Oct 2019. Accepted.

PDF

Joint Measurement of Localization and Detection of Sound Events

Abstract

Sound event detection and sound localization or tracking have historically been two separate areas of research. Recent development of sound event detection methods approach also the localization side, but lack a consistent way of measuring the joint performance of the system; instead, they measure the separate abilities for detection and for localization. This paper proposes augmentation of the localization metrics with a condition related to the detection, and conversely, use of location information in calculating the true positives for detection. An extensive evaluation example is provided to illustrate the behavior of such joint metrics. The comparison to the detection only and localization only performance shows that the proposed joint metrics operate in a consistent and logical manner, and characterize adequately both aspects.

Keywords

Sound event detection and localization, performance evaluation

PDF

Ranking

Overall ranking will be based on the average rank of the four metrics mentioned above.

Baseline system

As the baseline, we use the recently published SELDnet, a CRNN based method that uses the confidence of the SED to estimate one DOA for each sound class. The SED is obtained as a multiclass multilabel classification, whereas DOA estimation is performed as a multioutput regression.

Publication

Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13(1):34–48, March 2018. URL: https://ieeexplore.ieee.org/abstract/document/8567942, doi:10.1109/JSTSP.2018.2885636.

PDF

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Abstract

In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

Keywords

Direction-of-arrival estimation;Estimation;Task analysis;Azimuth;Microphone arrays;Recurrent neural networks;Sound event detection;direction of arrival estimation;convolutional recurrent neural network

PDF

Compared to DCASE 2019 and the published SELDnet version, a few modifications have been integrated in the model, in order to take into account some of the simplest effective improvements demonstrated by the participants in the previous year. Some of them are:

  • instead of raw multichannel magnitude and phase spectrograms as SED features, the more compressed log-mel spectral coefficients are used
  • instead of raw multichannel magnitude and phase spectrograms as localization features, generalized cross-correlation (GCC) features are used for the MIC format, and the acoutic intensity vector for the FOA format
  • the model is trained initially with a SED loss only, then continued with a joint SELD loss
  • the localization part of the joint loss is masked with the ground truth activations of each class, hence not contributing to the training when an event is not active

Furthermore, the newly introduced joint SELD metrics are used for tuning the model, and instead of (azimuth, elevation) angles the localization regressors output the estimated direction in Cartesian coordinates (x,y,z), as in the original SELDnet publication.

Repository

This repository implements SELDnet and performs cross-validation in the manner we recommend. We also provide scripts to visualize your SELD results and estimate the relevant metric scores before final submission.


Results for the development dataset

The evaluation metric scores for the test split of the development dataset are given below. The location-aware detection metrics are computed with a 20deg threshold from the reference for true positives.

Dataset ER F DE DE_F
Ambisonic 0.84 23.3 % 28° 56.4 %
Microphone array 0.82 24.3 % 28.4° 61.2 %

For a comparison, using the independent detection and localization metrics as done in DCASE2019 would result in the following:

Dataset ER F DE DE_F
Ambisonic 0.59 56.5 % 23.5° 67.3 %
Microphone array 0.53 61.3 % 24° 69 %

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.

Citation

If you are participating in this task or using the dataset and code please consider citing the following papers:

Publication

Sharath Adavanne, Archontis Politis, and Tuomas Virtanen. A multi-room reverberant dataset for sound event localization and uetection. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE). 2019. URL: http://dcase.community/documents/challenge2019/technical_reports/DCASE2019_Adavanne.pdf.

PDF

A Multi-room Reverberant Dataset for Sound Event Localization and Uetection

Abstract

This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge. The goal of the SELD task is to detect the temporal activities of a known set of sound event classes, and further localize them in space when active. As part of the challenge, a synthesized dataset with each sound event associated with a spatial coordinate represented using azimuth and elevation angles is provided. These sound events are spatialized using real-life impulse responses collected at multiple spatial coordinates in five different rooms with varying dimensions and material properties. A baseline SELD method employing a convolutional recurrent neural network is used to generate benchmark scores for this reverberant dataset. The benchmark scores are obtained using the recommended cross-validation setup.

PDF
Publication

Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13(1):34–48, March 2018. URL: https://ieeexplore.ieee.org/abstract/document/8567942, doi:10.1109/JSTSP.2018.2885636.

PDF

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Abstract

In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

Keywords

Direction-of-arrival estimation;Estimation;Task analysis;Azimuth;Microphone arrays;Recurrent neural networks;Sound event detection;direction of arrival estimation;convolutional recurrent neural network

PDF