Sound Event Localization and Detection


Task description

The goal of this task is to jointly localize and recognize individual sound events and their respective temporal onset and offset times.

Challenge has ended. Full results for this task can be found in the Results page.

Description

Given a multichannel audio input, the goal of a sound event localization and detection (SELD) method is to output all instances of the sound labels in the recording, their respective onset-offset times, and their directions-of-arrival (DOAs) in azimuth and elevation angles. Effective implementations of such a SELD method enable an automated description of human activities with a spatial dimension, and help machines to interact with the world more seamlessly. More specifically, SELD can be an important module in assisted listening systems, scene information visualization systems, immersive interactive media, and spatial machine cognition for scene-based deployment of services. A straightforward practical application is a robot that recognizes and tracks the sound source of interest. In the current challenge, only static scenes are considered, meaning that each individual sound event instance in the provided recordings is spatially stationary with a fixed location during its entire duration.

Figure 1: Overview of sound event localization and detection system.


Audio dataset

The task provides two datasets, TAU Spatial Sound Events 2019 - Ambisonic or TAU Spatial Sound Events 2019 - Microphone Array, of an identical sound scene with the only difference in the format of the audio. The TAU Spatial Sound Events 2019 - Ambisonic provides four-channel First-Order Ambisonic (FOA) recordings while the TAU Spatial Sound Events 2019 - Microphone Array provides four-channel directional microphone recordings from a tetrahedral array configuration. Both formats are extracted from the same microphone array, and additional information on the spatial characteristics of each format can be found below. The participants can choose one of the two or both the datasets based on the audio format they prefer. Both the datasets, consists of a development and evaluation set. The development set consists of 400, one minute long recordings sampled at 48000 Hz, divided into four cross-validation splits of 100 recordings each. The evaluation set consists of 100, one-minute recordings. These recordings were synthesized using spatial room impulse response (IRs) collected from five indoor locations, at 504 unique combinations of azimuth-elevation-distance. Furthermore, in order to synthesize the recordings the collected IRs were convolved with isolated sound events dataset from DCASE 2016 task 2. Finally, to create a realistic sound scene recording, natural ambient noise collected in the IR recording locations was added to the synthesized recordings such that the average SNR of the sound events was 30 dB.

The IRs were collected in Finland by Tampere University between 12/2017 - 06/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

ERC

Recording procedure

The real-life IR recordings were collected using an Eigenmike spherical microphone array. A Genelec G Two loudspeaker was used to playback a maximum length sequences (MLS) around the Eigenmike. The MLS playback level was ensured to be 30 dB greater than the ambient sound level during the recording. The IRs were obtained in the STFT domain using a least-squares regression between the known measurement signal (MLS) and far-field recording independently at each frequency. These IRs were collected in the following directions:

  • 36 IRs at every 10° azimuth angle, for 9 elevations from -40° to 40° at 10° increments, at 1 m distance from the Eigenmike, resulting in 324 discrete DOAs.
  • 36 IRs at every 10° azimuth angle, for 5 elevations from -20° to 20° at 10° increments, at 2 m distance from the Eigenmike, resulting in 180 discrete DOAs.

The IRs were recorded at five different indoor locations inside the Tampere University campus at Hervanta, Finland. Additionally, we also collected 30 minutes of ambient noise recordings from these five locations with the IR recording setup unchanged. The description of the indoor locations are as following:

  1. Language Center - Large common area with multiple seating tables and carpet flooring. People chatting and working.
  2. Reaktori Building - Large cafeteria with multiple seating tables and carpet flooring. People chatting and having food.
  3. Festia Building - High ceiling corridor with hard flooring. People walking around and chatting.
  4. Tietotalo Building - Corridor with classrooms around and hard flooring. People walking around and chatting.
  5. Sähkötalo Building - Large corridor with multiple sofas and tables, hard and carpet flooring at different parts. People walking around and chatting.

Recording format and dataset specifications

The isolated sound events dataset from DCASE 2016 task 2 consists of 11 classes, each with 20 examples. These examples are randomly split into five sets with an equal number of examples for each class, the first four sets are used for synthesizing the four splits of development dataset, while the remaining one set is used for evaluation dataset. For each split of the dataset, we synthesize 100 recordings. Each of these recordings is generated by randomly choosing sound event examples from the corresponding set and assigning a start time, and one of the collected IRs randomly. Finally, by convolving each of these assigned sound examples with their respective IRs, we spatially position them at a given distance, azimuth and elevation angles from the Eigenmike. We make sure to use IRs from a single location for all sound events in a recording. Further, half of the recordings in each split are synthesized with up to two temporally overlapping sound events while the others are synthesized with no overlapping sound events. Finally, the ambient noise collected at the respective IR location was added to the synthesized recording such that the average SNR of the sound events is 30 dB.

Since the number of channels in the IRs is equal to the number of microphones in Eigenmike (32), in order to create the TAU Spatial Sound Events 2019 - Microphone Array dataset we use the channels 6, 10, 26, and 22 that corresponds to microphone positions (45°, 35°, 4.2cm), (-45°, -35°, 4.2cm), (135°, -35°, 4.2cm) and (-135°, 35°, 4.2cm). The spherical coordinate system in use is right-handed with the front at (0°, 0°), left at (90°, 0°) and top at (0°, 90°). Further, the TAU Spatial Sound Events 2019 - Ambisonic dataset is obtained by converting the 32 channel microphone signals to FOA, by means of encoding filters based on anechoic measurements of the Eigenmike array response.

For model-based localization approaches the array response may be considered known. The following theoretical spatial responses (steering vectors) modeling the two -formats describe the directional response of each channel to a source incident from DOA given by azimuth angle \(\phi\) and elevation angle \(\theta\).

For the first-order ambisonics:

\begin{eqnarray} H_1(\phi, \theta, f) &=& 1 \\ H_2(\phi, \theta, f) &=& \sqrt{3} * \sin(\phi) * \cos(\theta) \\ H_3(\phi, \theta, f) &=& \sqrt{3} * \sin(\theta) \\ H_4(\phi, \theta, f) &=& \sqrt{3} * \cos(\phi) * \cos(\theta) \end{eqnarray}

For the tetrahedral array of microphones mounted on spherical baffle, an analytical expression for the directional array response is given by the expansion:

\begin{equation} H_m(\phi_m, \theta_m, \phi, \theta, \omega) = \frac{1}{(\omega R/c)^2}\sum_{n=0}^{30} \frac{i^{n-1}}{h_n'^{(2)}(\omega R/c)}(2n+1)P_n(\cos(\gamma_m)) \end{equation}

where \(m\) is the channel number, \((\phi_m, \theta_m)\) are the specific microphone's azimuth and elevation position, \(\omega = 2\pi f\) is the angular frequency, \(R = 0.042\)m is the array radius, \(c = 343\)m/s is the speed of sound, \(\cos(\gamma_m)\) is the cosine angle between the microphone and the DOA, and \(P_n\) is the unnormalized Legendre polynomial of degree \(n\), and \(h_n'^{(2)}\) is the derivative with respect to the argument of a spherical Hankel function of the second kind. The expansion is limited to 30 terms which provides negligible modeling error up to 20kHz. Note that the Ambisonics format is frequency-independent, something that holds true up to around 9kHz with the specific microphone array, while the the actual encoded response starting to deviate gradually from the ideal one provided above at higher frequencies.

In summary, there are 100 recordings in total in the evaluation dataset, and in each of the four splits of the development dataset. These 100 recordings are comprised of 10 recordings that have either up to two, or no temporally overlapping sound events, synthesized using the IRs from the five locations (10 * 2 * 5 = 100). The only explicit difference between each of the development dataset splits and evaluation dataset is the isolated sound event examples employed. Although each of the development dataset splits and evaluation dataset consists of IRs from all the five locations, the dataset only guarantees a balanced distribution of sound events in each of the 36 azimuths and 9 elevation angles within the splits but does not guarantee the use of IRs collected at a single location to be entirely present in a single split. For example, some of the IRs of Reaktori building might not be in the first split but might occur in any of the other splits.

More details on the IR recordings collections and synthesis of the dataset can be read in:

Publication

Sharath Adavanne, Archontis Politis, and Tuomas Virtanen. A multi-room reverberant dataset for sound event localization and uetection. In Submitted to Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019). 2019. URL: https://arxiv.org/abs/1905.08546.

PDF

A Multi-room Reverberant Dataset for Sound Event Localization and Uetection

Abstract

This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge. The goal of the SELD task is to detect the temporal activities of a known set of sound event classes, and further localize them in space when active. As part of the challenge, a synthesized dataset with each sound event associated with a spatial coordinate represented using azimuth and elevation angles is provided. These sound events are spatialized using real-life impulse responses collected at multiple spatial coordinates in five different rooms with varying dimensions and material properties. A baseline SELD method employing a convolutional recurrent neural network is used to generate benchmark scores for this reverberant dataset. The benchmark scores are obtained using the recommended cross-validation setup.

PDF

Reference labels

As labels, for each recording in the development dataset, we provide a CSV format file, that enlists the sound events, their respective temporal onset-offset times, azimuth and elevation angles. Since the development dataset provides four cross-validation splits, it can be used as a standalone dataset for future work. If you are preparing a publication based on the DCASE challenge set up and you want to evaluate your proposed system with official challenge evaluation setup, contact the task coordinators. The task coordinators can provide unofficial scoring for a limited amount of system outputs.

Download


Dataset was updated on 20 March 2019 to remove labels of sound events that were missing in the audio (version 2). In order to update already downloaded dataset version 1, download only the metadata_dev.zip file from version 2.


Task setup

The development dataset consists of a pre-defined four cross-validation split as shown in the table below. These splits consist of audio recordings and the corresponding metadata describing the sound events and their respective locations within each recording. Participants are required to report the performance of their method on the testing splits of the four folds. In order to allow a fair comparison of methods on the development dataset participants are not allowed to change the defined splits.

Folds Training splits Validation split Testing split
Fold 1 3, 4 2 1
Fold 2 4, 1 3 2
Fold 3 1, 2 4 3
Fold 4 2, 3 1 4

The evaluation dataset is released a few weeks before the final submission deadline. This dataset consists of only audio recordings without any metadata/labels. Participants can decide the training procedure, i.e. the amount of training and validation files in the development dataset, the number of ensemble models, and submit the results of the SELD performance on the evaluation dataset.

Development dataset

The recordings in the development dataset follow the naming convention:

split[number]_ir[location number]_ov[number of overlapping sound events]_[recording number per split].wav

The information of the location whose impulse response has been used to synthesize the recording or the number of overlapping sound events in the recording is only provided for the participant to understand the performance of their method with respect to different conditions. We encourage participants to do individual studies for such conditions and report as a publication in the DCASE 2019 workshop. But for the challenge, we only consider generic methods that do not use location or number of overlapping sound events information during training or inference.

Evaluation dataset

The evaluation dataset consists of 100 recordings without any information on the location, or the number of overlapping sound events in the naming convention as below:

split[number]_[recording number per split].wav

Task rules

  • Use of external data is not allowed.
  • Manipulation of provided cross-validation split for development dataset is not allowed.
  • The development dataset can be augmented without the use of external data (e.g. using techniques such as pitch shifting or time stretching).
  • Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system.

Submission

The results for each of the 100 recordings in the evaluation dataset should be collected in individual file-wise CSV. Similarly, we also collect file-wise CSV for each of the 400 recordings (testing split results of the four folds) in the development dataset. This file-wise CSVs has the same name as the audio recording, but with .csv extension, and contains the following information in each row.

[frame number (int)],[active class index (int)],[azimuth (int)],[elevation (int)]

An example output file will look like below. If you use the baseline code, then this output is automatically produced for you.

10,1,10,-20
10,1,-50,30
11,1,10,-20
11,1,-50,30
12,1,10,-20
12,1,-50,30
13,1,10,-20
13,1,-50,30
13,2,30,0
112,4,-40,0
113,4,-40,0
114,4,-40,0

The output file describes that there are two instances of class 1 that is active in locations (10° , -20°) and (-50° , 30°) for four continuous frames 10 to 13. In the 13th frame, in addition to class 1, class 2 is also active. Finally, class 4 is active in frames 112-114 at location (-40° , 0°).

The evaluation is performed at hop length of 20 ms, that results in 3000 frames for a 60 s long audio recording. In case the participants use a different hop length for their study, we expect the participants to use a suitable post-processing method to extract the information at 20 ms hop length and submit it as evaluation results. The class index for each of the 11 classes in the provided dataset is available in the metadata provided with the dataset and the baseline code. The azimuth angles are expected in the range of -180° to 170° while the elevation angles are expected in the range of -40° to 40°, any value beyond these limits will be clipped to the respective minimum or maximum values.

In addition to the CSV files, the participants are asked to update the information of their method in the provided file and submit a technical report describing the method. We allow upto 4 system output submissions per participant/team. For each system, meta information should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. The detailed information regarding the challenge information can be found in the submission page.

Detailed information for the submission can be found on the Submission page.

Evaluation

The SELD task is evaluated with individual metrics for SED and DOA estimation. For SED, we use the F-score and error rate (ER) calculated in one-second segments. A short description of the SED metrics is found here, and the detailed information is available in:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016. URL: http://www.mdpi.com/2076-3417/6/6/162, doi:10.3390/app6060162.

PDF

Metrics for Polyphonic Sound Event Detection

Abstract

This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.

For DOA estimation we use two frame-wise metrics: DOA error and frame recall. The DOA error is the average angular error in degrees between the predicted and reference DOAs. For a recording of length \(T\) time-frames, let \(\mathbf{DOA}^t_R\) be the list of all reference DOAs at time-frame \(t\) and \(\mathbf{DOA}^t_E\) be the list of all estimated DOAs. The DOA error is now defined as

\begin{equation} DOA\,error = \frac{1}{\sum_{t=1}^{T}{D^t_E}}\sum_{t=1}^{T}{\mathcal{H}(\mathbf{DOA}^t_R, \mathbf{DOA}^t_E)}, \end{equation}

where \(D^t_E\) is the number of DOAs in \(\mathbf{DOA}^t_E\) at \(t\)-th frame, and \(\mathcal{H}\) is the Hungarian algorithm for solving assignment problem, i.e., matching the individual estimated DOAs with the respective reference DOAs. The Hungarian algorithm solves this by estimating the pair-wise costs between individual predicted and reference DOA using the central angle between them,

\begin{equation} \sigma = \arccos(\sin\lambda_{E}\sin\lambda_{R} + \cos\lambda_{E}\cos\lambda_{R}\cos(|\phi_{R}-\phi_{E}|)) \end{equation}

where the reference DOA is represented by the azimuth angle \(\phi_R \in [-\pi, \pi)\) and elevation angle \(\lambda_R \in [-\pi/2, \pi/2]\), and the estimated DOA is represented with \((\phi_E, \lambda_E)\) in the similar range as reference DOA.

In order to account for time frames where the number of estimated and reference DOAs are unequal, we report the second metric frame recall, which is calculated as,

\begin{equation} Frame\,recall =\frac{\sum_{t=1}^{T}{\mathbb{1}(D^t_R = D^t_E)}}{T}, \end{equation}

where \(D^t_R\) is the number of DOAs in \(\mathbf{DOA}^t_R\) at \(t\)-th frame, \(\mathbb{1}()\) is the indicator function resulting in an output one if the \((D^t_R = D^t_E)\) condition is met else returns zero. More details regarding the DOA metrics can be found in:

Publication

Sharath Adavanne, Archontis Politis, and Tuomas Virtanen. Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network. In 2018 26th European Signal Processing Conference (EUSIPCO), 1462–1466. IEEE, 2018.

PDF

Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network

Abstract

This paper proposes a deep neural network for estimating the directions of arrival (DOA) of multiple sound sources. The proposed stacked convolutional and recurrent neural network (DOAnet) generates a spatial pseudo-spectrum (SPS) along with the DOA estimates in both azimuth and elevation. We avoid any explicit feature extraction step by using the magnitudes and phases of the spectrograms of all the channels as input to the network. The proposed DOAnet is evaluated by estimating the DOAs of multiple concurrently present sources in anechoic, matched and unmatched reverberant conditions. The results show that the proposed DOAnet is capable of estimating the number of sources and their respective DOAs with good precision and generate SPS with high signal-to-noise ratio.

PDF

An ideal SELD method will have an error rate of zero, F score of 1 (reported in %), DOA error of 0° and frame recall of 1 (reported in %). In order to compare the submitted methods, we will rank each method individually for all the four metrics, and the final positions will be the obtained using the cumulative minimum of the ranks.

PLEASE NOTE: The four cross-validation folds are treated as a single experiment, meaning that metrics are calculated only after training and testing all folds, not as the average of the individual folds nor as the average of individual class performance. Intermediate measures (insertions, deletions, substitutions) from all folds are accumulated before calculating metrics. For more information on why so, please refer to the following paper:

Publication

George Forman and Martin Scholz. Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. SIGKDD Explor. Newsl., 12(1):49–57, November 2010. URL: http://doi.acm.org/10.1145/1882471.1882479, doi:10.1145/1882471.1882479.

PDF

Apples-to-apples in Cross-validation Studies: Pitfalls in Classifier Performance Measurement

Abstract

Cross-validation is a mainstay for measuring performance and progress in machine learning. There are subtle differences in how exactly to compute accuracy, F-measure and Area Under the ROC Curve (AUC) in cross-validation studies. However, these details are not discussed in the literature, and incompatible methods are used by various papers and software packages. This leads to inconsistency across the research literature. Anomalies in performance calculations for particular folds and situations go undiscovered when they are buried in aggregated results over many folds and datasets, without ever a person looking at the intermediate performance measurements. This research note clarifies and illustrates the differences, and it provides guidance for how best to measure classification performance under cross-validation. In particular, there are several divergent methods used for computing F-measure, which is often recommended as a performance measure under class imbalance, e.g., for text classification domains and in one-vs.-all reductions of datasets having many classes. We show by experiment that all but one of these computation methods leads to biased measurements, especially under high class imbalance. This paper is of particular interest to those designing machine learning software libraries and researchers focused on high class imbalance.

PDF

Results

The SELD task received 58 submissions in total from 22 teams across the world. The results for these submissions are as following.

Rank Submission Information Evaluation dataset
Submission name Author Affiliation Technical
Report
Official
rank
Error
Rate
F-score DOA
error
Frame
recall
Kapka_SRPOL_task3_2 Slawomir Kapka Samsung R&D Institute Poland task-sound-event-localization-and-detection-results#Kapka2019 1 0.08 94.7 3.7 96.8
Kapka_SRPOL_task3_4 Slawomir Kapka Samsung R&D Institute Poland task-sound-event-localization-and-detection-results#Kapka2019 2 0.08 94.7 3.7 96.8
Kapka_SRPOL_task3_3 Slawomir Kapka Samsung R&D Institute Poland task-sound-event-localization-and-detection-results#Kapka2019 3 0.10 93.5 4.6 96.0
Cao_Surrey_task3_4 Qiuqiang Kong University of Surrey task-sound-event-localization-and-detection-results#Cao2019 4 0.08 95.5 5.5 92.2
Xue_JDAI_task3_1 Wei Xue JD.COM task-sound-event-localization-and-detection-results#Xue2019 5 0.06 96.3 9.7 92.3
He_THU_task3_2 Liang He Tsinghua University task-sound-event-localization-and-detection-results#He2019 6 0.06 96.7 22.4 94.1
He_THU_task3_1 Liang He Tsinghua University task-sound-event-localization-and-detection-results#He2019 7 0.06 96.6 23.8 94.4
Cao_Surrey_task3_1 Qiuqiang Kong University of Surrey task-sound-event-localization-and-detection-results#Cao2019 8 0.09 95.1 5.5 91.0
Xue_JDAI_task3_4 Wei Xue JD.COM task-sound-event-localization-and-detection-results#Xue2019 9 0.07 95.9 10.0 92.6
Jee_NTU_task3_1 Wen Jie Jee Nanyang Technological University task-sound-event-localization-and-detection-results#Jee2019 10 0.12 93.7 4.2 91.8
Xue_JDAI_task3_3 Wei Xue JD.COM task-sound-event-localization-and-detection-results#Xue2019 11 0.08 95.6 10.1 92.2
He_THU_task3_4 Liang He Tsinghua University task-sound-event-localization-and-detection-results#He2019 12 0.06 96.3 26.1 93.4
Cao_Surrey_task3_3 Qiuqiang Kong University of Surrey task-sound-event-localization-and-detection-results#Cao2019 13 0.10 94.9 5.8 90.4
Xue_JDAI_task3_2 Wei Xue JD.COM task-sound-event-localization-and-detection-results#Xue2019 14 0.09 95.2 9.2 91.5
He_THU_task3_3 Liang He Tsinghua University task-sound-event-localization-and-detection-results#He2019 15 0.08 95.6 24.4 92.9
Cao_Surrey_task3_2 Qiuqiang Kong University of Surrey task-sound-event-localization-and-detection-results#Cao2019 16 0.12 93.8 5.5 89.0
Nguyen_NTU_task3_3 Thi Ngoc Tho Nguyen Nanyang Technological University task-sound-event-localization-and-detection-results#Nguyen2019 17 0.11 93.4 5.4 88.8
MazzonYasuda_NTT_task3_3 Yuma Koizumi NTT Media Intelligence Laboratories task-sound-event-localization-and-detection-results#MazzonYasuda2019 18 0.10 94.2 6.4 88.8
Chang_HYU_task3_3 Chang Joon-Hyuk Hanyang University task-sound-event-localization-and-detection-results#Chang2019 19 0.14 91.9 2.7 90.8
Nguyen_NTU_task3_4 Thi Ngoc Tho Nguyen Nanyang Technological University task-sound-event-localization-and-detection-results#Nguyen2019 20 0.12 93.2 5.5 88.7
Chang_HYU_task3_4 Chang Joon-Hyuk Hanyang University task-sound-event-localization-and-detection-results#Chang2019 21 0.17 90.5 3.1 94.1
MazzonYasuda_NTT_task3_2 Yuma Koizumi NTT Media Intelligence Laboratories task-sound-event-localization-and-detection-results#MazzonYasuda2019 22 0.13 93.0 5.0 88.2
Chang_HYU_task3_2 Chang Joon-Hyuk Hanyang University task-sound-event-localization-and-detection-results#Chang2019 23 0.14 92.3 9.7 95.3
Chang_HYU_task3_1 Chang Joon-Hyuk Hanyang University task-sound-event-localization-and-detection-results#Chang2019 24 0.13 92.8 8.4 91.4
MazzonYasuda_NTT_task3_1 Yuma Koizumi NTT Media Intelligence Laboratories task-sound-event-localization-and-detection-results#MazzonYasuda2019 25 0.12 93.3 7.1 88.1
Ranjan_NTU_task3_3 Rishabh Ranjan Nanyang Technological University task-sound-event-localization-and-detection-results#Ranjan2019 26 0.16 90.9 5.7 91.8
Ranjan_NTU_task3_4 Rishabh Ranjan Nanyang Technological University task-sound-event-localization-and-detection-results#Ranjan2019 27 0.16 90.7 6.4 92.0
Park_ETRI_task3_1 Sooyoung Park Electronics and Telecommunications Research Institute task-sound-event-localization-and-detection-results#Park2019 28 0.15 91.9 5.1 87.4
Nguyen_NTU_task3_1 Thi Ngoc Tho Nguyen Nanyang Technological University task-sound-event-localization-and-detection-results#Nguyen2019 29 0.15 91.1 5.6 89.8
Leung_DBS_task3_2 Shuangran Leung DBSonics task-sound-event-localization-and-detection-results#Leung2019 30 0.12 93.3 25.9 91.1
Park_ETRI_task3_2 Sooyoung Park Electronics and Telecommunications Research Institute task-sound-event-localization-and-detection-results#Park2019 31 0.15 91.8 5.0 87.2
Grondin_MIT_task3_1 Francois Grondin Massachusetts Institute of Technology task-sound-event-localization-and-detection-results#Grondin2019 32 0.14 92.2 7.4 87.5
Leung_DBS_task3_1 Shuangran Leung DBSonics task-sound-event-localization-and-detection-results#Leung2019 33 0.12 93.4 27.2 90.7
Park_ETRI_task3_3 Sooyoung Park Electronics and Telecommunications Research Institute task-sound-event-localization-and-detection-results#Park2019 34 0.15 91.9 7.0 87.4
MazzonYasuda_NTT_task3_4 Yuma Koizumi NTT Media Intelligence Laboratories task-sound-event-localization-and-detection-results#MazzonYasuda2019 35 0.14 92.0 7.3 87.1
Park_ETRI_task3_4 Sooyoung Park Electronics and Telecommunications Research Institute task-sound-event-localization-and-detection-results#Park2019 36 0.15 91.8 7.0 87.2
Ranjan_NTU_task3_1 Rishabh Ranjan Nanyang Technological University task-sound-event-localization-and-detection-results#Ranjan2019 37 0.18 89.9 8.6 90.1
Ranjan_NTU_task3_2 Rishabh Ranjan Nanyang Technological University task-sound-event-localization-and-detection-results#Ranjan2019 38 0.22 86.8 7.8 90.0
ZhaoLu_UESTC_task3_1 Zhao Lu University of Electronic Science and Technology of China task-sound-event-localization-and-detection-results#ZhaoLu2019 39 0.18 89.3 6.8 84.3
Rough_EMED_task3_2 Pi LiHong Tsinghua University task-sound-event-localization-and-detection-results#Rough2019 40 0.18 89.7 9.4 85.5
Nguyen_NTU_task3_2 Thi Ngoc Tho Nguyen Nanyang Technological University task-sound-event-localization-and-detection-results#Nguyen2019 41 0.17 89.7 8.0 77.3
Jee_NTU_task3_2 Wen Jie Jee Nanyang Technological University task-sound-event-localization-and-detection-results#Jee2019 42 0.19 89.1 8.1 85.0
Tan_NTU_task3_1 Ee Leng Tan Nanyang Technological University task-sound-event-localization-and-detection-results#Tan2019 43 0.17 89.8 15.4 84.4
Lewandowski_SRPOL_task3_1 Mateusz Lewandowski Samsung R&D Institute Poland task-sound-event-localization-and-detection-results#Kapka2019 44 0.19 89.4 36.2 87.7
Cordourier_IL_task3_2 Hector Cordourier Maruri Intel Corporation task-sound-event-localization-and-detection-results#Cordourier2019 45 0.22 86.5 20.8 85.7
Cordourier_IL_task3_1 Hector Cordourier Maruri Intel Corporation task-sound-event-localization-and-detection-results#Cordourier2019 46 0.22 86.3 19.9 85.6
Krause_AGH_task3_4 Daniel Krause AGH University of Science and Technology task-sound-event-localization-and-detection-results#Krause2019 47 0.22 87.4 31.0 87.0
DCASE2019_FOA_baseline Sharath Adavanne Tampere University task-sound-event-localization-and-detection-results#Adavanne2019 48 0.28 85.4 24.6 85.7
Perezlopez_UPF_task3_1 Andres Perez-Lopez Centre Tecnologic de Catalunya task-sound-event-localization-and-detection-results#Perezlopez2019 49 0.29 82.1 9.3 75.8
Chytas_UTH_task3_1 Sotirios Panagiotis Chytas University of Thessaly task-sound-event-localization-and-detection-results#Chytas2019 50 0.29 82.4 18.6 75.6
Anemueller_UOL_task3_3 Jorn Anemuller University of Oldenburg task-sound-event-localization-and-detection-results#Anemueller2019 51 0.28 83.8 29.2 84.1
Chytas_UTH_task3_2 Sotirios Panagiotis Chytas University of Thessaly task-sound-event-localization-and-detection-results#Chytas2019 52 0.29 82.3 18.7 75.7
Krause_AGH_task3_2 Daniel Krause AGH University of Science and Technology task-sound-event-localization-and-detection-results#Krause2019 53 0.32 82.9 31.7 85.7
Krause_AGH_task3_1 Daniel Krause AGH University of Science and Technology task-sound-event-localization-and-detection-results#Krause2019 54 0.30 83.0 32.5 85.3
Anemueller_UOL_task3_1 Jorn Anemuller University of Oldenburg task-sound-event-localization-and-detection-results#Anemueller2019 55 0.33 81.3 28.2 84.5
Kong_SURREY_task3_1 Qiuqiang Kong University of Surrey task-sound-event-localization-and-detection-results#Kong2019 56 0.29 83.4 37.6 81.3
Anemueller_UOL_task3_2 Jorn Anemuller University of Oldenburg task-sound-event-localization-and-detection-results#Anemueller2019 57 0.36 79.8 25.0 84.1
DCASE2019_MIC_baseline Sharath Adavanne Tampere University task-sound-event-localization-and-detection-results#Adavanne2019 58 0.30 83.2 38.1 83.4
Lin_YYZN_task3_1 Yifeng Lin Esound corporation task-sound-event-localization-and-detection-results#Lin2019 59 1.03 2.6 21.9 31.6
Krause_AGH_task3_3 Daniel Krause AGH University of Science and Technology task-sound-event-localization-and-detection-results#Krause2019 60 0.35 80.3 52.6 83.6

Complete results and technical reports can be found in the results page

Awards

This task will offer two awards, not necessarily based on the evaluation set performance ranking. These awards aim to encourage contestants to openly publish their code, and to use novel and problem-specific approaches which leverage knowledge of the audio domain. We also highly encourage student authorship.

Reproducible system award

Reproducible system award of 500 USD will be offered for the highest scoring method that is open-source and fully reproducible. For full reproducibility, the authors must provide all the information needed to run the system and achieve the reported performance. The choice of licence is left to the author, but should ideally be selected among the ones approved by the Open Source Initiative.

Judges’ award

Judges’ award of 500 USD will be offered for the method considered by the judges to be the most interesting or innovative. Criteria considered for this award include but are not limited to: originality, complexity, student participation, open-source, etc. Single model approaches are strongly preferred over ensembles; occasionally, small ensembles of different models can be considered, if the approach is innovative.

More information can be found on the Award page.


The awards are sponsored by

Gold sponsor Silver sponsor
Sonos Harman
Bronze sponsors
Cochlear.ai Oticon Sound Intelligence
Technical sponsor
Inria

Baseline system

As the baseline, we use the recently published SELDnet, a CRNN based method that uses the confidence of the SED to estimate one DOA for each sound class. The SED is obtained as a multiclass multilabel classification, whereas DOA is performed as a multioutput regression. The SELDnet uses the magnitude and phase component of the FFT as input feature, SED labels represented as one-hot encoding and DOAs represented as azimuth and elevation angles in radians. More details about SELDnet can be read in:

Publication

Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, ():1–1, 2018. doi:10.1109/JSTSP.2018.2885636.

PDF

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Abstract

In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

Keywords

Direction-of-arrival estimation;Estimation;Task analysis;Azimuth;Microphone arrays;Recurrent neural networks;Sound event detection;direction of arrival estimation;convolutional recurrent neural network

PDF

A difference of the baseline implementation here and the implementation in the above publication is that the baseline outputs directly azimuth and elevation angles, rather than cartesian components of the DOA vector.

PLEASE NOTE: The SELD task is in a nascent stage. Although the SELDnet architecture used as baseline has been found to perform effectively on the SELD task for a range of conditions, there are still a lot of approaches that are unexplored in both data-driven and model-based estimation. We believe the proposed task and dataset will help explore new methods, and the results obtained, poorer or better, should be treated as valid research and shared with the community. This will only benefit future researchers on what works and what doesn't.

Repository

This repository implements SELDnet and performs cross-validation in the manner we recommend. We also provide scripts to visualize your SELD results and estimate the relevant metric scores before final submission.


Results for the development dataset

Dataset Error rate F score DOA error Frame recall
Ambisonic 0.34 79.9 % 28.5° 85.4 %
Microphone array 0.35 80.0 % 30.8° 84.0 %

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.

Citation

If you are participating in this task or using the dataset and code please consider citing the following papers:

Publication

Sharath Adavanne, Archontis Politis, and Tuomas Virtanen. A multi-room reverberant dataset for sound event localization and uetection. In Submitted to Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019). 2019. URL: https://arxiv.org/abs/1905.08546.

PDF

A Multi-room Reverberant Dataset for Sound Event Localization and Uetection

Abstract

This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge. The goal of the SELD task is to detect the temporal activities of a known set of sound event classes, and further localize them in space when active. As part of the challenge, a synthesized dataset with each sound event associated with a spatial coordinate represented using azimuth and elevation angles is provided. These sound events are spatialized using real-life impulse responses collected at multiple spatial coordinates in five different rooms with varying dimensions and material properties. A baseline SELD method employing a convolutional recurrent neural network is used to generate benchmark scores for this reverberant dataset. The benchmark scores are obtained using the recommended cross-validation setup.

PDF
Publication

Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, ():1–1, 2018. doi:10.1109/JSTSP.2018.2885636.

PDF

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Abstract

In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

Keywords

Direction-of-arrival estimation;Estimation;Task analysis;Azimuth;Microphone arrays;Recurrent neural networks;Sound event detection;direction of arrival estimation;convolutional recurrent neural network

PDF