# Sound Event Localization and Detection

### Coordinators

 Sharath Adavanne Archontis Politis Tuomas Virtanen

The goal of this task is to jointly localize and recognize individual sound events and their respective temporal onset and offset times.

# Description

Given a multichannel audio input, the goal of a sound event localization and detection (SELD) method is to output all instances of the sound labels in the recording, their respective onset-offset times, and their directions-of-arrival (DOAs) in azimuth and elevation angles. Effective implementations of such a SELD method enable an automated description of human activities with a spatial dimension, and help machines to interact with the world more seamlessly. More specifically, SELD can be an important module in assisted listening systems, scene information visualization systems, immersive interactive media, and spatial machine cognition for scene-based deployment of services. A straightforward practical application is a robot that recognizes and tracks the sound source of interest. In the current challenge, only static scenes are considered, meaning that each individual sound event instance in the provided recordings is spatially stationary with a fixed location during its entire duration.

# Audio dataset

The task provides two datasets, TAU Spatial Sound Events 2019 - Ambisonic or TAU Spatial Sound Events 2019 - Microphone Array, of an identical sound scene with the only difference in the format of the audio. The TAU Spatial Sound Events 2019 - Ambisonic provides four-channel First-Order Ambisonic (FOA) recordings while the TAU Spatial Sound Events 2019 - Microphone Array provides four-channel directional microphone recordings from a tetrahedral array configuration. Both formats are extracted from the same microphone array, and additional information on the spatial characteristics of each format can be found below. The participants can choose one of the two or both the datasets based on the audio format they prefer. Both the datasets, consists of a development and evaluation set. The development set consists of 400, one minute long recordings sampled at 48000 Hz, divided into four cross-validation splits of 100 recordings each. The evaluation set consists of 100, one-minute recordings. These recordings were synthesized using spatial room impulse response (IRs) collected from five indoor locations, at 504 unique combinations of azimuth-elevation-distance. Furthermore, in order to synthesize the recordings the collected IRs were convolved with isolated sound events dataset from DCASE 2016 task 2. Finally, to create a realistic sound scene recording, natural ambient noise collected in the IR recording locations was added to the synthesized recordings such that the average SNR of the sound events was 30 dB.

The IRs were collected in Finland by Tampere University between 12/2017 - 06/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

## Recording procedure

The real-life IR recordings were collected using an Eigenmike spherical microphone array. A Genelec G Two loudspeaker was used to playback a maximum length sequences (MLS) around the Eigenmike. The MLS playback level was ensured to be 30 dB greater than the ambient sound level during the recording. The IRs were obtained in the STFT domain using a least-squares regression between the known measurement signal (MLS) and far-field recording independently at each frequency. These IRs were collected in the following directions:

• 36 IRs at every 10° azimuth angle, for 9 elevations from -40° to 40° at 10° increments, at 1 m distance from the Eigenmike, resulting in 324 discrete DOAs.
• 36 IRs at every 10° azimuth angle, for 5 elevations from -20° to 20° at 10° increments, at 2 m distance from the Eigenmike, resulting in 180 discrete DOAs.

The IRs were recorded at five different indoor locations inside the Tampere University campus at Hervanta, Finland. Additionally, we also collected 30 minutes of ambient noise recordings from these five locations with the IR recording setup unchanged. The description of the indoor locations are as following:

• Language Center - Large common area with multiple seating tables and carpet flooring. People chatting and working.
• Reaktori Building - Large cafeteria with multiple seating tables and carpet flooring. People chatting and having food.
• Festia Building - High ceiling corridor with hard flooring. People walking around and chatting.
• Tietotalo Building - Corridor with classrooms around and hard flooring. People walking around and chatting.
• Sähkötalo Building - Large corridor with multiple sofas and tables, hard and carpet flooring at different parts. People walking around and chatting.

## Recording format and dataset specifications

The isolated sound events dataset from DCASE 2016 task 2 consists of 11 classes, each with 20 examples. These examples are randomly split into five sets with an equal number of examples for each class, the first four sets are used for synthesizing the four splits of development dataset, while the remaining one set is used for evaluation dataset. For each split of the dataset, we synthesize 100 recordings. Each of these recordings is generated by randomly choosing sound event examples from the corresponding set and assigning a start time, and one of the collected IRs randomly. Finally, by convolving each of these assigned sound examples with their respective IRs, we spatially position them at a given distance, azimuth and elevation angles from the Eigenmike. We make sure to use IRs from a single location for all sound events in a recording. Further, half of the recordings in each split are synthesized with up to two temporally overlapping sound events while the others are synthesized with no overlapping sound events. Finally, the ambient noise collected at the respective IR location was added to the synthesized recording such that the average SNR of the sound events is 30 dB.

Since the number of channels in the IRs is equal to the number of microphones in Eigenmike (32), in order to create the TAU Spatial Sound Events 2019 - Microphone Array dataset we use the channels 6, 10, 26, and 22 that corresponds to microphone positions (45°, 35°, 42cm), (-45°, -35°, 42cm), (135°, -35°, 42cm) and (-135°, 35°, 42cm). The spherical coordinate system in use is right-handed with the front at (0°, 0°), left at (90°, 0°) and top at (0°, 90°). Further, the TAU Spatial Sound Events 2019 - Ambisonic dataset is obtained by converting the 32 channel microphone signals to FOA, by means of encoding filters based on anechoic measurements of the Eigenmike array response.

For model-based localization approaches the array response may be considered known. The following theoretical spatial responses (steering vectors) modeling the two -formats describe the directional response of each channel to a source incident from DOA given by azimuth angle $$\phi$$ and elevation angle $$\theta$$.

For the first-order ambisonics:

\begin{eqnarray} H_1(\phi, \theta, f) &=& 1 \\ H_2(\phi, \theta, f) &=& \sqrt{3} * \sin(\phi) * \cos(\theta) \\ H_3(\phi, \theta, f) &=& \sqrt{3} * \sin(\theta) \\ H_4(\phi, \theta, f) &=& \sqrt{3} * \cos(\phi) * \cos(\theta) \end{eqnarray}

For the tetrahedral array of microphones mounted on spherical baffle, an analytical expression for the directional array response is given by the expansion:

$$H_m(\phi_m, \theta_m, \phi, \theta, \omega) = \frac{1}{(\omega R/c)^2}\sum_{n=0}^{30} \frac{i^{n-1}}{h_n'^{(2)}(\omega R/c)}(2n+1)P_n(\cos(\gamma_m))$$

where $$m$$ is the channel number, $$(\phi_m, \theta_m)$$ are the specific microphone's azimuth and elevation position, $$\omega = 2\pi f$$ is the angular frequency, $$R = 0.042$$m is the array radius, $$c = 343$$m/s is the speed of sound, $$\cos(\gamma_m)$$ is the cosine angle between the microphone and the DOA, and $$P_n$$ is the unnormalized Legendre polynomial of degree $$n$$, and $$h_n'^{(2)}$$ is the derivative with respect to the argument of a spherical Hankel function of the second kind. The expansion is limited to 30 terms which provides negligible modeling error up to 20kHz. Note that the Ambisonics format is frequency-independent, something that holds true up to around 9kHz with the specific microphone array, while the the actual encoded response starting to deviate gradually from the ideal one provided above at higher frequencies.

In summary, there are 100 recordings in total in the evaluation dataset, and in each of the four splits of the development dataset. These 100 recordings are comprised of 10 recordings that have either up to two, or no temporally overlapping sound events, synthesized using the IRs from the five locations (10 * 2 * 5 = 100). The only explicit difference between each of the development dataset splits and evaluation dataset is the isolated sound event examples employed. Although each of the development dataset splits and evaluation dataset consists of IRs from all the five locations, the dataset only guarantees a balanced distribution of sound events in each of the 36 azimuths and 9 elevation angles within the splits but does not guarantee the use of IRs collected at a single location to be entirely present in a single split. For example, some of the IRs of Reaktori building might not be in the first split but might occur in any of the other splits.

More details on the IR recordings collections and synthesis of the dataset can be read in:

Publication

Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, ():1–1, 2018. doi:10.1109/JSTSP.2018.2885636.

#### Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

##### Abstract

In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

##### Keywords

Direction-of-arrival estimation;Estimation;Task analysis;Azimuth;Microphone arrays;Recurrent neural networks;Sound event detection;direction of arrival estimation;convolutional recurrent neural network

## Reference labels

As labels, for each recording in the development dataset, we provide a CSV format file, that enlists the sound events, their respective temporal onset-offset times, azimuth and elevation angles. Since the development dataset provides four cross-validation splits, it can be used as a standalone dataset for future work. If you are preparing a publication based on the DCASE challenge set up and you want to evaluate your proposed system with official challenge evaluation setup, contact the task coordinators. The task coordinators can provide unofficial scoring for a limited amount of system outputs.

The development dataset consists of a pre-defined four cross-validation split as shown in the table below. These splits consist of audio recordings and the corresponding metadata describing the sound events and their respective locations within each recording. Participants are required to report the performance of their method on the testing splits of the four folds. In order to allow a fair comparison of methods on the development dataset participants are not allowed to change the defined splits.

Folds Training splits Validation split Testing split
Fold 1 3, 4 2 1
Fold 2 4, 1 3 2
Fold 3 1, 2 4 3
Fold 4 2, 3 1 4

The evaluation dataset is released a few weeks before the final submission deadline. This dataset consists of only audio recordings without any metadata/labels. Participants can decide the training procedure, i.e. the amount of training and validation files in the development dataset, the number of ensemble models, and submit the results of the SELD performance on the evaluation dataset.

## Development dataset

The recordings in the development dataset follow the naming convention:

split[number]_ir[location number]_ov[number of overlapping sound events]_[recording number per split].wav


The information of the location whose impulse response has been used to synthesize the recording or the number of overlapping sound events in the recording is only provided for the participant to understand the performance of their method with respect to different conditions. We encourage participants to do individual studies for such conditions and report as a publication in the DCASE 2019 workshop. But for the challenge, we only consider generic methods that do not use location or number of overlapping sound events information during training or inference.

## Evaluation dataset

The evaluation dataset consists of 100 recordings without any information on the location, or the number of overlapping sound events in the naming convention as below:

split[number]_[recording number per split].wav


• Use of external data is not allowed.
• Manipulation of provided cross-validation split for development dataset is not allowed.
• The development dataset can be augmented without the use of external data (e.g. using techniques such as pitch shifting or time stretching).
• Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system.

# Submission

The results for each of the 100 recordings in the evaluation dataset should be collected in individual file-wise CSV. Similarly, we also collect file-wise CSV for each of the 400 recordings (testing split results of the four folds) in the development dataset. This file-wise CSVs has the same name as the audio recording, but with .csv extension, and contains the following information in each row.

[frame number (int)],[active class index (int)],[azimuth (int)],[elevation (int)]


An example output file will look like below. If you use the baseline code, then this output is automatically produced for you.

10,1,10,-20
10,1,-50,30
11,1,10,-20
11,1,-50,30
12,1,10,-20
12,1,-50,30
13,1,10,-20
13,1,-50,30
13,2,30,0
112,4,-40,0
113,4,-40,0
114,4,-40,0


The output file describes that there are two instances of class 1 that is active in locations (10° , -20°) and (-50° , 30°) for four continuous frames 10 to 13. In the 13th frame, in addition to class 1, class 2 is also active. Finally, class 4 is active in frames 112-114 at location (-40° , 0°).

The evaluation is performed at hop length of 20 ms, that results in 3000 frames for a 60 s long audio recording. In case the participants use a different hop length for their study, we expect the participants to use a suitable post-processing method to extract the information at 20 ms hop length and submit it as evaluation results. The class index for each of the 11 classes in the provided dataset is available in the metadata provided with the dataset and the baseline code. The azimuth angles are expected in the range of -180° to 170° while the elevation angles are expected in the range of -40° to 40°, any value beyond these limits will be clipped to the respective minimum or maximum values.

In addition to the CSV files, the participants are asked to update the information of their method in the provided file and submit a technical report describing the method. We allow upto 4 system output submissions per participant. For each system, meta information should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. The detailed information regarding the challenge information can be found in the submission page.

# Evaluation

The SELD task is evaluated with individual metrics for SED and DOA estimation. For SED, we use the F-score and error rate (ER) calculated in one-second segments. A short description of the SED metrics is found here, and the detailed information is available in:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016. URL: http://www.mdpi.com/2076-3417/6/6/162, doi:10.3390/app6060162.

#### Metrics for Polyphonic Sound Event Detection

##### Abstract

This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.

For DOA estimation we use two frame-wise metrics: DOA error and frame recall. The DOA error is the average angular error in degrees between the predicted and reference DOAs. For a recording of length $$T$$ time-frames, let $$\mathbf{DOA}^t_R$$ be the list of all reference DOAs at time-frame $$t$$ and $$\mathbf{DOA}^t_E$$ be the list of all estimated DOAs. The DOA error is now defined as

$$DOA\,error = \frac{1}{\sum_{t=1}^{T}{D^t_E}}\sum_{t=1}^{T}{\mathcal{H}(\mathbf{DOA}^t_R, \mathbf{DOA}^t_E)},$$

where $$D^t_E$$ is the number of DOAs in $$\mathbf{DOA}^t_E$$ at $$t$$-th frame, and $$\mathcal{H}$$ is the Hungarian algorithm for solving assignment problem, i.e., matching the individual estimated DOAs with the respective reference DOAs. The Hungarian algorithm solves this by estimating the pair-wise costs between individual predicted and reference DOA using the central angle between them,

$$\sigma = \arccos(\sin\phi_E\sin\phi_R + \cos\phi_E\cos\phi_R\cos(\lambda_R-\lambda_E))$$

where the reference DOA is represented by the azimuth angle $$\phi_R \in [-\pi, \pi)$$ and elevation angle $$\lambda_R \in [-\pi/2, \pi/2]$$, and the estimated DOA is represented with $$(\phi_E, \lambda_E)$$ in the similar range as reference DOA.

In order to account for time frames where the number of estimated and reference DOAs are unequal, we report the second metric frame recall, which is calculated as,

$$Frame\,recall =\frac{\sum_{t=1}^{T}{\mathbb{1}(D^t_R = D^t_E)}}{T},$$

where $$D^t_R$$ is the number of DOAs in $$\mathbf{DOA}^t_R$$ at $$t$$-th frame, $$\mathbb{1}()$$ is the indicator function resulting in an output one if the $$(D^t_R = D^t_E)$$ condition is met else returns zero. More details regarding the DOA metrics can be found in:

Publication

Sharath Adavanne, Archontis Politis, and Tuomas Virtanen. Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network. In 2018 26th European Signal Processing Conference (EUSIPCO), 1462–1466. IEEE, 2018.

#### Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network

##### Abstract

This paper proposes a deep neural network for estimating the directions of arrival (DOA) of multiple sound sources. The proposed stacked convolutional and recurrent neural network (DOAnet) generates a spatial pseudo-spectrum (SPS) along with the DOA estimates in both azimuth and elevation. We avoid any explicit feature extraction step by using the magnitudes and phases of the spectrograms of all the channels as input to the network. The proposed DOAnet is evaluated by estimating the DOAs of multiple concurrently present sources in anechoic, matched and unmatched reverberant conditions. The results show that the proposed DOAnet is capable of estimating the number of sources and their respective DOAs with good precision and generate SPS with high signal-to-noise ratio.

An ideal SELD method will have an error rate of zero, F score of 1 (reported in %), DOA error of 0° and frame recall of 1 (reported in %). In order to compare the submitted methods, we will rank each method individually for all the four metrics, and the final positions will be the obtained using the cumulative minimum of the ranks.

PLEASE NOTE: The four cross-validation folds are treated as a single experiment, meaning that metrics are calculated only after training and testing all folds, not as the average of the individual folds nor as the average of individual class performance. Intermediate measures (insertions, deletions, substitutions) from all folds are accumulated before calculating metrics. For more information on why so, please refer to the following paper:

Publication

George Forman and Martin Scholz. Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. SIGKDD Explor. Newsl., 12(1):49–57, November 2010. URL: http://doi.acm.org/10.1145/1882471.1882479, doi:10.1145/1882471.1882479.

#### Apples-to-apples in Cross-validation Studies: Pitfalls in Classifier Performance Measurement

##### Abstract

Cross-validation is a mainstay for measuring performance and progress in machine learning. There are subtle differences in how exactly to compute accuracy, F-measure and Area Under the ROC Curve (AUC) in cross-validation studies. However, these details are not discussed in the literature, and incompatible methods are used by various papers and software packages. This leads to inconsistency across the research literature. Anomalies in performance calculations for particular folds and situations go undiscovered when they are buried in aggregated results over many folds and datasets, without ever a person looking at the intermediate performance measurements. This research note clarifies and illustrates the differences, and it provides guidance for how best to measure classification performance under cross-validation. In particular, there are several divergent methods used for computing F-measure, which is often recommended as a performance measure under class imbalance, e.g., for text classification domains and in one-vs.-all reductions of datasets having many classes. We show by experiment that all but one of these computation methods leads to biased measurements, especially under high class imbalance. This paper is of particular interest to those designing machine learning software libraries and researchers focused on high class imbalance.

# Baseline system

As the baseline, we use the recently published SELDnet, a CRNN based method that uses the confidence of the SED to estimate one DOA for each sound class. The SED is obtained as a multiclass multilabel classification, whereas DOA is performed as a multioutput regression. The SELDnet uses the magnitude and phase component of the FFT as input feature, SED labels represented as one-hot encoding and DOAs represented as azimuth and elevation angles in radians. More details about SELDnet can be read in:

Publication

Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, ():1–1, 2018. doi:10.1109/JSTSP.2018.2885636.

#### Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

##### Abstract

In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

##### Keywords

Direction-of-arrival estimation;Estimation;Task analysis;Azimuth;Microphone arrays;Recurrent neural networks;Sound event detection;direction of arrival estimation;convolutional recurrent neural network

A difference of the baseline implementation here and the implementation in the above publication is that the baseline outputs directly azimuth and elevation angles, rather than cartesian components of the DOA vector.

PLEASE NOTE: The SELD task is in a nascent stage. Although the SELDnet architecture used as baseline has been found to perform effectively on the SELD task for a range of conditions, there are still a lot of approaches that are unexplored in both data-driven and model-based estimation. We believe the proposed task and dataset will help explore new methods, and the results obtained, poorer or better, should be treated as valid research and shared with the community. This will only benefit future researchers on what works and what doesn't.

## Repository

This repository implements SELDnet and performs cross-validation in the manner we recommend. We also provide scripts to visualize your SELD results and estimate the relevant metric scores before final submission.

## Results for the development dataset

Dataset Error rate F score DOA error Frame recall
Ambisonic 0.34 79.7 % 30.8° 84.3 %
Microphone array 0.37 78.5 % 35.4° 81.6 %

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.

# Citation

If you are participating in this task or using the dataset and code please consider citing the following paper:

Publication

Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, ():1–1, 2018. doi:10.1109/JSTSP.2018.2885636.

#### Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

##### Abstract

In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

##### Keywords

Direction-of-arrival estimation;Estimation;Task analysis;Azimuth;Microphone arrays;Recurrent neural networks;Sound event detection;direction of arrival estimation;convolutional recurrent neural network