The goal of this task is to jointly localize and recognize individual sound events and their respective temporal onset and offset times.
Challenge has ended. Full results for this task can be found in the Results page.
Description
Given a multichannel audio input, the goal of a sound event localization and detection (SELD) method is to output all instances of the sound labels in the recording, their respective onset-offset times, and their directions-of-arrival (DOAs) in azimuth and elevation angles. Effective implementations of such a SELD method enable an automated description of human activities with a spatial dimension, and help machines to interact with the world more seamlessly. More specifically, SELD can be an important module in assisted listening systems, scene information visualization systems, immersive interactive media, and spatial machine cognition for scene-based deployment of services. A straightforward practical application is a robot that recognizes and tracks the sound source of interest. In the current challenge, only static scenes are considered, meaning that each individual sound event instance in the provided recordings is spatially stationary with a fixed location during its entire duration.
Audio dataset
The task provides two datasets, TAU Spatial Sound Events 2019 - Ambisonic or TAU Spatial Sound Events 2019 - Microphone Array, of an identical sound scene with the only difference in the format of the audio. The TAU Spatial Sound Events 2019 - Ambisonic provides four-channel First-Order Ambisonic (FOA) recordings while the TAU Spatial Sound Events 2019 - Microphone Array provides four-channel directional microphone recordings from a tetrahedral array configuration. Both formats are extracted from the same microphone array, and additional information on the spatial characteristics of each format can be found below. The participants can choose one of the two or both the datasets based on the audio format they prefer. Both the datasets, consists of a development and evaluation set. The development set consists of 400, one minute long recordings sampled at 48000 Hz, divided into four cross-validation splits of 100 recordings each. The evaluation set consists of 100, one-minute recordings. These recordings were synthesized using spatial room impulse response (IRs) collected from five indoor locations, at 504 unique combinations of azimuth-elevation-distance. Furthermore, in order to synthesize the recordings the collected IRs were convolved with isolated sound events dataset from DCASE 2016 task 2. Finally, to create a realistic sound scene recording, natural ambient noise collected in the IR recording locations was added to the synthesized recordings such that the average SNR of the sound events was 30 dB.
The IRs were collected in Finland by Tampere University between 12/2017 - 06/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.
Recording procedure
The real-life IR recordings were collected using an Eigenmike spherical microphone array. A Genelec G Two loudspeaker was used to playback a maximum length sequences (MLS) around the Eigenmike. The MLS playback level was ensured to be 30 dB greater than the ambient sound level during the recording. The IRs were obtained in the STFT domain using a least-squares regression between the known measurement signal (MLS) and far-field recording independently at each frequency. These IRs were collected in the following directions:
- 36 IRs at every 10° azimuth angle, for 9 elevations from -40° to 40° at 10° increments, at 1 m distance from the Eigenmike, resulting in 324 discrete DOAs.
- 36 IRs at every 10° azimuth angle, for 5 elevations from -20° to 20° at 10° increments, at 2 m distance from the Eigenmike, resulting in 180 discrete DOAs.
The IRs were recorded at five different indoor locations inside the Tampere University campus at Hervanta, Finland. Additionally, we also collected 30 minutes of ambient noise recordings from these five locations with the IR recording setup unchanged. The description of the indoor locations are as following:
- Language Center - Large common area with multiple seating tables and carpet flooring. People chatting and working.
- Reaktori Building - Large cafeteria with multiple seating tables and carpet flooring. People chatting and having food.
- Festia Building - High ceiling corridor with hard flooring. People walking around and chatting.
- Tietotalo Building - Corridor with classrooms around and hard flooring. People walking around and chatting.
- Sähkötalo Building - Large corridor with multiple sofas and tables, hard and carpet flooring at different parts. People walking around and chatting.
Recording format and dataset specifications
The isolated sound events dataset from DCASE 2016 task 2 consists of 11 classes, each with 20 examples. These examples are randomly split into five sets with an equal number of examples for each class, the first four sets are used for synthesizing the four splits of development dataset, while the remaining one set is used for evaluation dataset. For each split of the dataset, we synthesize 100 recordings. Each of these recordings is generated by randomly choosing sound event examples from the corresponding set and assigning a start time, and one of the collected IRs randomly. Finally, by convolving each of these assigned sound examples with their respective IRs, we spatially position them at a given distance, azimuth and elevation angles from the Eigenmike. We make sure to use IRs from a single location for all sound events in a recording. Further, half of the recordings in each split are synthesized with up to two temporally overlapping sound events while the others are synthesized with no overlapping sound events. Finally, the ambient noise collected at the respective IR location was added to the synthesized recording such that the average SNR of the sound events is 30 dB.
Since the number of channels in the IRs is equal to the number of microphones in Eigenmike (32), in order to create the TAU Spatial Sound Events 2019 - Microphone Array dataset we use the channels 6, 10, 26, and 22 that corresponds to microphone positions (45°, 35°, 4.2cm), (-45°, -35°, 4.2cm), (135°, -35°, 4.2cm) and (-135°, 35°, 4.2cm). The spherical coordinate system in use is right-handed with the front at (0°, 0°), left at (90°, 0°) and top at (0°, 90°). Further, the TAU Spatial Sound Events 2019 - Ambisonic dataset is obtained by converting the 32 channel microphone signals to FOA, by means of encoding filters based on anechoic measurements of the Eigenmike array response.
For model-based localization approaches the array response may be considered known. The following theoretical spatial responses (steering vectors) modeling the two -formats describe the directional response of each channel to a source incident from DOA given by azimuth angle \(\phi\) and elevation angle \(\theta\).
For the first-order ambisonics:
For the tetrahedral array of microphones mounted on spherical baffle, an analytical expression for the directional array response is given by the expansion:
where \(m\) is the channel number, \((\phi_m, \theta_m)\) are the specific microphone's azimuth and elevation position, \(\omega = 2\pi f\) is the angular frequency, \(R = 0.042\)m is the array radius, \(c = 343\)m/s is the speed of sound, \(\cos(\gamma_m)\) is the cosine angle between the microphone and the DOA, and \(P_n\) is the unnormalized Legendre polynomial of degree \(n\), and \(h_n'^{(2)}\) is the derivative with respect to the argument of a spherical Hankel function of the second kind. The expansion is limited to 30 terms which provides negligible modeling error up to 20kHz. Note that the Ambisonics format is frequency-independent, something that holds true up to around 9kHz with the specific microphone array, while the the actual encoded response starting to deviate gradually from the ideal one provided above at higher frequencies.
In summary, there are 100 recordings in total in the evaluation dataset, and in each of the four splits of the development dataset. These 100 recordings are comprised of 10 recordings that have either up to two, or no temporally overlapping sound events, synthesized using the IRs from the five locations (10 * 2 * 5 = 100). The only explicit difference between each of the development dataset splits and evaluation dataset is the isolated sound event examples employed. Although each of the development dataset splits and evaluation dataset consists of IRs from all the five locations, the dataset only guarantees a balanced distribution of sound events in each of the 36 azimuths and 9 elevation angles within the splits but does not guarantee the use of IRs collected at a single location to be entirely present in a single split. For example, some of the IRs of Reaktori building might not be in the first split but might occur in any of the other splits.
More details on the IR recordings collections and synthesis of the dataset can be read in:
Sharath Adavanne, Archontis Politis, and Tuomas Virtanen. A multi-room reverberant dataset for sound event localization and detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), 10–14. New York University, NY, USA, October 2019. URL: https://dcase.community/workshop2019/proceedings.
A Multi-room Reverberant Dataset for Sound Event Localization and Detection
Abstract
This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge. The goal of the SELD task is to detect the temporal activities of a known set of sound event classes, and further localize them in space when active. As part of the challenge, a synthesized dataset where each sound event associated with a spatial coordinate represented using azimuth and elevation angles is provided. These sound events are spatialized using real-life impulse responses collected at multiple spatial coordinates in five different rooms with varying dimensions and material properties. A baseline SELD method employing a convolutional recurrent neural network is used to generate benchmark scores for this reverberant dataset. The benchmark scores are obtained using the recommended cross-validation setup.
Reference labels
As labels, for each recording in the development dataset, we provide a CSV format file, that enlists the sound events, their respective temporal onset-offset times, azimuth and elevation angles. Since the development dataset provides four cross-validation splits, it can be used as a standalone dataset for future work. If you are preparing a publication based on the DCASE challenge set up and you want to evaluate your proposed system with official challenge evaluation setup, contact the task coordinators. The task coordinators can provide unofficial scoring for a limited amount of system outputs.
Download
Dataset was updated on 20 March 2019 to remove labels of sound events that were missing in the audio (version 2). In order to update already downloaded dataset version 1, download only the metadata_dev.zip
file from version 2.
Dataset was updated on 26 August 2019: Now that the task has ended, we are releasing the reference labels for the evaluation dataset (version 2).
Task setup
The development dataset consists of a pre-defined four cross-validation split as shown in the table below. These splits consist of audio recordings and the corresponding metadata describing the sound events and their respective locations within each recording. Participants are required to report the performance of their method on the testing splits of the four folds. In order to allow a fair comparison of methods on the development dataset participants are not allowed to change the defined splits.
Folds | Training splits | Validation split | Testing split |
---|---|---|---|
Fold 1 | 3, 4 | 2 | 1 |
Fold 2 | 4, 1 | 3 | 2 |
Fold 3 | 1, 2 | 4 | 3 |
Fold 4 | 2, 3 | 1 | 4 |
The evaluation dataset is released a few weeks before the final submission deadline. This dataset consists of only audio recordings without any metadata/labels. Participants can decide the training procedure, i.e. the amount of training and validation files in the development dataset, the number of ensemble models, and submit the results of the SELD performance on the evaluation dataset.
Development dataset
The recordings in the development dataset follow the naming convention:
split[number]_ir[location number]_ov[number of overlapping sound events]_[recording number per split].wav
The information of the location whose impulse response has been used to synthesize the recording or the number of overlapping sound events in the recording is only provided for the participant to understand the performance of their method with respect to different conditions. We encourage participants to do individual studies for such conditions and report as a publication in the DCASE 2019 workshop. But for the challenge, we only consider generic methods that do not use location or number of overlapping sound events information during training or inference.
Evaluation dataset
The evaluation dataset consists of 100 recordings without any information on the location, or the number of overlapping sound events in the naming convention as below:
split[number]_[recording number per split].wav
Task rules
- Use of external data is not allowed.
- Manipulation of provided cross-validation split for development dataset is not allowed.
- The development dataset can be augmented without the use of external data (e.g. using techniques such as pitch shifting or time stretching).
- Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system.
Submission
The results for each of the 100 recordings in the evaluation dataset should be collected in individual file-wise CSV. Similarly, we also collect file-wise CSV for each of the 400 recordings (testing split results of the four folds) in the development dataset. This file-wise CSVs has the same name as the audio recording, but with .csv
extension, and contains the following information in each row.
[frame number (int)],[active class index (int)],[azimuth (int)],[elevation (int)]
An example output file will look like below. If you use the baseline code, then this output is automatically produced for you.
10,1,10,-20
10,1,-50,30
11,1,10,-20
11,1,-50,30
12,1,10,-20
12,1,-50,30
13,1,10,-20
13,1,-50,30
13,2,30,0
112,4,-40,0
113,4,-40,0
114,4,-40,0
The output file describes that there are two instances of class 1 that is active in locations (10° , -20°) and (-50° , 30°) for four continuous frames 10 to 13. In the 13th frame, in addition to class 1, class 2 is also active. Finally, class 4 is active in frames 112-114 at location (-40° , 0°).
The evaluation is performed at hop length of 20 ms, that results in 3000 frames for a 60 s long audio recording. In case the participants use a different hop length for their study, we expect the participants to use a suitable post-processing method to extract the information at 20 ms hop length and submit it as evaluation results. The class index for each of the 11 classes in the provided dataset is available in the metadata provided with the dataset and the baseline code. The azimuth angles are expected in the range of -180° to 170° while the elevation angles are expected in the range of -40° to 40°, any value beyond these limits will be clipped to the respective minimum or maximum values.
In addition to the CSV files, the participants are asked to update the information of their method in the provided file and submit a technical report describing the method. We allow upto 4 system output submissions per participant/team. For each system, meta information should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. The detailed information regarding the challenge information can be found in the submission page.
Detailed information for the submission can be found on the Submission page.
Evaluation
The SELD task is evaluated with individual metrics for SED and DOA estimation. For SED, we use the F-score and error rate (ER) calculated in one-second segments. A short description of the SED metrics is found here, and the detailed information is available in:
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016. URL: http://www.mdpi.com/2076-3417/6/6/162, doi:10.3390/app6060162.
Metrics for Polyphonic Sound Event Detection
Abstract
This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.
For DOA estimation we use two frame-wise metrics: DOA error and frame recall. The DOA error is the average angular error in degrees between the predicted and reference DOAs. For a recording of length \(T\) time-frames, let \(\mathbf{DOA}^t_R\) be the list of all reference DOAs at time-frame \(t\) and \(\mathbf{DOA}^t_E\) be the list of all estimated DOAs. The DOA error is now defined as
where \(D^t_E\) is the number of DOAs in \(\mathbf{DOA}^t_E\) at \(t\)-th frame, and \(\mathcal{H}\) is the Hungarian algorithm for solving assignment problem, i.e., matching the individual estimated DOAs with the respective reference DOAs. The Hungarian algorithm solves this by estimating the pair-wise costs between individual predicted and reference DOA using the central angle between them,
where the reference DOA is represented by the azimuth angle \(\phi_R \in [-\pi, \pi)\) and elevation angle \(\lambda_R \in [-\pi/2, \pi/2]\), and the estimated DOA is represented with \((\phi_E, \lambda_E)\) in the similar range as reference DOA.
In order to account for time frames where the number of estimated and reference DOAs are unequal, we report the second metric frame recall, which is calculated as,
where \(D^t_R\) is the number of DOAs in \(\mathbf{DOA}^t_R\) at \(t\)-th frame, \(\mathbb{1}()\) is the indicator function resulting in an output one if the \((D^t_R = D^t_E)\) condition is met else returns zero. More details regarding the DOA metrics can be found in:
Sharath Adavanne, Archontis Politis, and Tuomas Virtanen. Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network. In 2018 26th European Signal Processing Conference (EUSIPCO), 1462–1466. IEEE, 2018.
Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network
Abstract
This paper proposes a deep neural network for estimating the directions of arrival (DOA) of multiple sound sources. The proposed stacked convolutional and recurrent neural network (DOAnet) generates a spatial pseudo-spectrum (SPS) along with the DOA estimates in both azimuth and elevation. We avoid any explicit feature extraction step by using the magnitudes and phases of the spectrograms of all the channels as input to the network. The proposed DOAnet is evaluated by estimating the DOAs of multiple concurrently present sources in anechoic, matched and unmatched reverberant conditions. The results show that the proposed DOAnet is capable of estimating the number of sources and their respective DOAs with good precision and generate SPS with high signal-to-noise ratio.
An ideal SELD method will have an error rate of zero, F score of 1 (reported in %), DOA error of 0° and frame recall of 1 (reported in %). In order to compare the submitted methods, we will rank each method individually for all the four metrics, and the final positions will be the obtained using the cumulative minimum of the ranks.
PLEASE NOTE: The four cross-validation folds are treated as a single experiment, meaning that metrics are calculated only after training and testing all folds, not as the average of the individual folds nor as the average of individual class performance. Intermediate measures (insertions, deletions, substitutions) from all folds are accumulated before calculating metrics. For more information on why so, please refer to the following paper:
George Forman and Martin Scholz. Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. SIGKDD Explor. Newsl., 12(1):49–57, November 2010. URL: http://doi.acm.org/10.1145/1882471.1882479, doi:10.1145/1882471.1882479.
Apples-to-apples in Cross-validation Studies: Pitfalls in Classifier Performance Measurement
Abstract
Cross-validation is a mainstay for measuring performance and progress in machine learning. There are subtle differences in how exactly to compute accuracy, F-measure and Area Under the ROC Curve (AUC) in cross-validation studies. However, these details are not discussed in the literature, and incompatible methods are used by various papers and software packages. This leads to inconsistency across the research literature. Anomalies in performance calculations for particular folds and situations go undiscovered when they are buried in aggregated results over many folds and datasets, without ever a person looking at the intermediate performance measurements. This research note clarifies and illustrates the differences, and it provides guidance for how best to measure classification performance under cross-validation. In particular, there are several divergent methods used for computing F-measure, which is often recommended as a performance measure under class imbalance, e.g., for text classification domains and in one-vs.-all reductions of datasets having many classes. We show by experiment that all but one of these computation methods leads to biased measurements, especially under high class imbalance. This paper is of particular interest to those designing machine learning software libraries and researchers focused on high class imbalance.
Results
The SELD task received 58 submissions in total from 22 teams across the world. The results for these submissions are as following.
Rank | Submission Information | Evaluation dataset | |||||||
---|---|---|---|---|---|---|---|---|---|
Submission name | Author | Affiliation |
Technical Report |
Official rank |
Error Rate |
F-score |
DOA error |
Frame recall |
|
Kapka_SRPOL_task3_2 | Slawomir Kapka | Samsung R&D Institute Poland | task-sound-event-localization-and-detection-results#Kapka2019 | 1 | 0.08 | 94.7 | 3.7 | 96.8 | |
Kapka_SRPOL_task3_4 | Slawomir Kapka | Samsung R&D Institute Poland | task-sound-event-localization-and-detection-results#Kapka2019 | 2 | 0.08 | 94.7 | 3.7 | 96.8 | |
Kapka_SRPOL_task3_3 | Slawomir Kapka | Samsung R&D Institute Poland | task-sound-event-localization-and-detection-results#Kapka2019 | 3 | 0.10 | 93.5 | 4.6 | 96.0 | |
Cao_Surrey_task3_4 | Qiuqiang Kong | University of Surrey | task-sound-event-localization-and-detection-results#Cao2019 | 4 | 0.08 | 95.5 | 5.5 | 92.2 | |
Xue_JDAI_task3_1 | Wei Xue | JD.COM | task-sound-event-localization-and-detection-results#Xue2019 | 5 | 0.06 | 96.3 | 9.7 | 92.3 | |
He_THU_task3_2 | Liang He | Tsinghua University | task-sound-event-localization-and-detection-results#He2019 | 6 | 0.06 | 96.7 | 22.4 | 94.1 | |
He_THU_task3_1 | Liang He | Tsinghua University | task-sound-event-localization-and-detection-results#He2019 | 7 | 0.06 | 96.6 | 23.8 | 94.4 | |
Cao_Surrey_task3_1 | Qiuqiang Kong | University of Surrey | task-sound-event-localization-and-detection-results#Cao2019 | 8 | 0.09 | 95.1 | 5.5 | 91.0 | |
Xue_JDAI_task3_4 | Wei Xue | JD.COM | task-sound-event-localization-and-detection-results#Xue2019 | 9 | 0.07 | 95.9 | 10.0 | 92.6 | |
Jee_NTU_task3_1 | Wen Jie Jee | Nanyang Technological University | task-sound-event-localization-and-detection-results#Jee2019 | 10 | 0.12 | 93.7 | 4.2 | 91.8 | |
Xue_JDAI_task3_3 | Wei Xue | JD.COM | task-sound-event-localization-and-detection-results#Xue2019 | 11 | 0.08 | 95.6 | 10.1 | 92.2 | |
He_THU_task3_4 | Liang He | Tsinghua University | task-sound-event-localization-and-detection-results#He2019 | 12 | 0.06 | 96.3 | 26.1 | 93.4 | |
Cao_Surrey_task3_3 | Qiuqiang Kong | University of Surrey | task-sound-event-localization-and-detection-results#Cao2019 | 13 | 0.10 | 94.9 | 5.8 | 90.4 | |
Xue_JDAI_task3_2 | Wei Xue | JD.COM | task-sound-event-localization-and-detection-results#Xue2019 | 14 | 0.09 | 95.2 | 9.2 | 91.5 | |
He_THU_task3_3 | Liang He | Tsinghua University | task-sound-event-localization-and-detection-results#He2019 | 15 | 0.08 | 95.6 | 24.4 | 92.9 | |
Cao_Surrey_task3_2 | Qiuqiang Kong | University of Surrey | task-sound-event-localization-and-detection-results#Cao2019 | 16 | 0.12 | 93.8 | 5.5 | 89.0 | |
Nguyen_NTU_task3_3 | Thi Ngoc Tho Nguyen | Nanyang Technological University | task-sound-event-localization-and-detection-results#Nguyen2019 | 17 | 0.11 | 93.4 | 5.4 | 88.8 | |
MazzonYasuda_NTT_task3_3 | Yuma Koizumi | NTT Media Intelligence Laboratories | task-sound-event-localization-and-detection-results#MazzonYasuda2019 | 18 | 0.10 | 94.2 | 6.4 | 88.8 | |
Chang_HYU_task3_3 | Chang Joon-Hyuk | Hanyang University | task-sound-event-localization-and-detection-results#Chang2019 | 19 | 0.14 | 91.9 | 2.7 | 90.8 | |
Nguyen_NTU_task3_4 | Thi Ngoc Tho Nguyen | Nanyang Technological University | task-sound-event-localization-and-detection-results#Nguyen2019 | 20 | 0.12 | 93.2 | 5.5 | 88.7 | |
Chang_HYU_task3_4 | Chang Joon-Hyuk | Hanyang University | task-sound-event-localization-and-detection-results#Chang2019 | 21 | 0.17 | 90.5 | 3.1 | 94.1 | |
MazzonYasuda_NTT_task3_2 | Yuma Koizumi | NTT Media Intelligence Laboratories | task-sound-event-localization-and-detection-results#MazzonYasuda2019 | 22 | 0.13 | 93.0 | 5.0 | 88.2 | |
Chang_HYU_task3_2 | Chang Joon-Hyuk | Hanyang University | task-sound-event-localization-and-detection-results#Chang2019 | 23 | 0.14 | 92.3 | 9.7 | 95.3 | |
Chang_HYU_task3_1 | Chang Joon-Hyuk | Hanyang University | task-sound-event-localization-and-detection-results#Chang2019 | 24 | 0.13 | 92.8 | 8.4 | 91.4 | |
MazzonYasuda_NTT_task3_1 | Yuma Koizumi | NTT Media Intelligence Laboratories | task-sound-event-localization-and-detection-results#MazzonYasuda2019 | 25 | 0.12 | 93.3 | 7.1 | 88.1 | |
Ranjan_NTU_task3_3 | Rishabh Ranjan | Nanyang Technological University | task-sound-event-localization-and-detection-results#Ranjan2019 | 26 | 0.16 | 90.9 | 5.7 | 91.8 | |
Ranjan_NTU_task3_4 | Rishabh Ranjan | Nanyang Technological University | task-sound-event-localization-and-detection-results#Ranjan2019 | 27 | 0.16 | 90.7 | 6.4 | 92.0 | |
Park_ETRI_task3_1 | Sooyoung Park | Electronics and Telecommunications Research Institute | task-sound-event-localization-and-detection-results#Park2019 | 28 | 0.15 | 91.9 | 5.1 | 87.4 | |
Nguyen_NTU_task3_1 | Thi Ngoc Tho Nguyen | Nanyang Technological University | task-sound-event-localization-and-detection-results#Nguyen2019 | 29 | 0.15 | 91.1 | 5.6 | 89.8 | |
Leung_DBS_task3_2 | Shuangran Leung | DBSonics | task-sound-event-localization-and-detection-results#Leung2019 | 30 | 0.12 | 93.3 | 25.9 | 91.1 | |
Park_ETRI_task3_2 | Sooyoung Park | Electronics and Telecommunications Research Institute | task-sound-event-localization-and-detection-results#Park2019 | 31 | 0.15 | 91.8 | 5.0 | 87.2 | |
Grondin_MIT_task3_1 | Francois Grondin | Massachusetts Institute of Technology | task-sound-event-localization-and-detection-results#Grondin2019 | 32 | 0.14 | 92.2 | 7.4 | 87.5 | |
Leung_DBS_task3_1 | Shuangran Leung | DBSonics | task-sound-event-localization-and-detection-results#Leung2019 | 33 | 0.12 | 93.4 | 27.2 | 90.7 | |
Park_ETRI_task3_3 | Sooyoung Park | Electronics and Telecommunications Research Institute | task-sound-event-localization-and-detection-results#Park2019 | 34 | 0.15 | 91.9 | 7.0 | 87.4 | |
MazzonYasuda_NTT_task3_4 | Yuma Koizumi | NTT Media Intelligence Laboratories | task-sound-event-localization-and-detection-results#MazzonYasuda2019 | 35 | 0.14 | 92.0 | 7.3 | 87.1 | |
Park_ETRI_task3_4 | Sooyoung Park | Electronics and Telecommunications Research Institute | task-sound-event-localization-and-detection-results#Park2019 | 36 | 0.15 | 91.8 | 7.0 | 87.2 | |
Ranjan_NTU_task3_1 | Rishabh Ranjan | Nanyang Technological University | task-sound-event-localization-and-detection-results#Ranjan2019 | 37 | 0.18 | 89.9 | 8.6 | 90.1 | |
Ranjan_NTU_task3_2 | Rishabh Ranjan | Nanyang Technological University | task-sound-event-localization-and-detection-results#Ranjan2019 | 38 | 0.22 | 86.8 | 7.8 | 90.0 | |
ZhaoLu_UESTC_task3_1 | Zhao Lu | University of Electronic Science and Technology of China | task-sound-event-localization-and-detection-results#ZhaoLu2019 | 39 | 0.18 | 89.3 | 6.8 | 84.3 | |
Rough_EMED_task3_2 | Pi LiHong | Tsinghua University | task-sound-event-localization-and-detection-results#Rough2019 | 40 | 0.18 | 89.7 | 9.4 | 85.5 | |
Nguyen_NTU_task3_2 | Thi Ngoc Tho Nguyen | Nanyang Technological University | task-sound-event-localization-and-detection-results#Nguyen2019 | 41 | 0.17 | 89.7 | 8.0 | 77.3 | |
Jee_NTU_task3_2 | Wen Jie Jee | Nanyang Technological University | task-sound-event-localization-and-detection-results#Jee2019 | 42 | 0.19 | 89.1 | 8.1 | 85.0 | |
Tan_NTU_task3_1 | Ee Leng Tan | Nanyang Technological University | task-sound-event-localization-and-detection-results#Tan2019 | 43 | 0.17 | 89.8 | 15.4 | 84.4 | |
Lewandowski_SRPOL_task3_1 | Mateusz Lewandowski | Samsung R&D Institute Poland | task-sound-event-localization-and-detection-results#Kapka2019 | 44 | 0.19 | 89.4 | 36.2 | 87.7 | |
Cordourier_IL_task3_2 | Hector Cordourier Maruri | Intel Corporation | task-sound-event-localization-and-detection-results#Cordourier2019 | 45 | 0.22 | 86.5 | 20.8 | 85.7 | |
Cordourier_IL_task3_1 | Hector Cordourier Maruri | Intel Corporation | task-sound-event-localization-and-detection-results#Cordourier2019 | 46 | 0.22 | 86.3 | 19.9 | 85.6 | |
Krause_AGH_task3_4 | Daniel Krause | AGH University of Science and Technology | task-sound-event-localization-and-detection-results#Krause2019 | 47 | 0.22 | 87.4 | 31.0 | 87.0 | |
DCASE2019_FOA_baseline | Sharath Adavanne | Tampere University | task-sound-event-localization-and-detection-results#Adavanne2019 | 48 | 0.28 | 85.4 | 24.6 | 85.7 | |
Perezlopez_UPF_task3_1 | Andres Perez-Lopez | Centre Tecnologic de Catalunya | task-sound-event-localization-and-detection-results#Perezlopez2019 | 49 | 0.29 | 82.1 | 9.3 | 75.8 | |
Chytas_UTH_task3_1 | Sotirios Panagiotis Chytas | University of Thessaly | task-sound-event-localization-and-detection-results#Chytas2019 | 50 | 0.29 | 82.4 | 18.6 | 75.6 | |
Anemueller_UOL_task3_3 | Jorn Anemuller | University of Oldenburg | task-sound-event-localization-and-detection-results#Anemueller2019 | 51 | 0.28 | 83.8 | 29.2 | 84.1 | |
Chytas_UTH_task3_2 | Sotirios Panagiotis Chytas | University of Thessaly | task-sound-event-localization-and-detection-results#Chytas2019 | 52 | 0.29 | 82.3 | 18.7 | 75.7 | |
Krause_AGH_task3_2 | Daniel Krause | AGH University of Science and Technology | task-sound-event-localization-and-detection-results#Krause2019 | 53 | 0.32 | 82.9 | 31.7 | 85.7 | |
Krause_AGH_task3_1 | Daniel Krause | AGH University of Science and Technology | task-sound-event-localization-and-detection-results#Krause2019 | 54 | 0.30 | 83.0 | 32.5 | 85.3 | |
Anemueller_UOL_task3_1 | Jorn Anemuller | University of Oldenburg | task-sound-event-localization-and-detection-results#Anemueller2019 | 55 | 0.33 | 81.3 | 28.2 | 84.5 | |
Kong_SURREY_task3_1 | Qiuqiang Kong | University of Surrey | task-sound-event-localization-and-detection-results#Kong2019 | 56 | 0.29 | 83.4 | 37.6 | 81.3 | |
Anemueller_UOL_task3_2 | Jorn Anemuller | University of Oldenburg | task-sound-event-localization-and-detection-results#Anemueller2019 | 57 | 0.36 | 79.8 | 25.0 | 84.1 | |
DCASE2019_MIC_baseline | Sharath Adavanne | Tampere University | task-sound-event-localization-and-detection-results#Adavanne2019 | 58 | 0.30 | 83.2 | 38.1 | 83.4 | |
Lin_YYZN_task3_1 | Yifeng Lin | Esound corporation | task-sound-event-localization-and-detection-results#Lin2019 | 59 | 1.03 | 2.6 | 21.9 | 31.6 | |
Krause_AGH_task3_3 | Daniel Krause | AGH University of Science and Technology | task-sound-event-localization-and-detection-results#Krause2019 | 60 | 0.35 | 80.3 | 52.6 | 83.6 |
Complete results and technical reports can be found in the results page
Awards
This task will offer two awards, not necessarily based on the evaluation set performance ranking. These awards aim to encourage contestants to openly publish their code, and to use novel and problem-specific approaches which leverage knowledge of the audio domain. We also highly encourage student authorship.
Reproducible system award
Reproducible system award of 500 USD will be offered for the highest scoring method that is open-source and fully reproducible. For full reproducibility, the authors must provide all the information needed to run the system and achieve the reported performance. The choice of licence is left to the author, but should ideally be selected among the ones approved by the Open Source Initiative.
Judges’ award
Judges’ award of 500 USD will be offered for the method considered by the judges to be the most interesting or innovative. Criteria considered for this award include but are not limited to: originality, complexity, student participation, open-source, etc. Single model approaches are strongly preferred over ensembles; occasionally, small ensembles of different models can be considered, if the approach is innovative.
More information can be found on the Award page.
The awards are sponsored by
Baseline system
As the baseline, we use the recently published SELDnet, a CRNN based method that uses the confidence of the SED to estimate one DOA for each sound class. The SED is obtained as a multiclass multilabel classification, whereas DOA is performed as a multioutput regression. The SELDnet uses the magnitude and phase component of the FFT as input feature, SED labels represented as one-hot encoding and DOAs represented as azimuth and elevation angles in radians. More details about SELDnet can be read in:
Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13(1):34–48, March 2018. doi:10.1109/JSTSP.2018.2885636.
Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks
Abstract
In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.
Keywords
Direction-of-arrival estimation;Estimation;Task analysis;Azimuth;Microphone arrays;Recurrent neural networks;Sound event detection;direction of arrival estimation;convolutional recurrent neural network
A difference of the baseline implementation here and the implementation in the above publication is that the baseline outputs directly azimuth and elevation angles, rather than cartesian components of the DOA vector.
PLEASE NOTE: The SELD task is in a nascent stage. Although the SELDnet architecture used as baseline has been found to perform effectively on the SELD task for a range of conditions, there are still a lot of approaches that are unexplored in both data-driven and model-based estimation. We believe the proposed task and dataset will help explore new methods, and the results obtained, poorer or better, should be treated as valid research and shared with the community. This will only benefit future researchers on what works and what doesn't.
Repository
This repository implements SELDnet and performs cross-validation in the manner we recommend. We also provide scripts to visualize your SELD results and estimate the relevant metric scores before final submission.
Results for the development dataset
Dataset | Error rate | F score | DOA error | Frame recall |
---|---|---|---|---|
Ambisonic | 0.34 | 79.9 % | 28.5° | 85.4 % |
Microphone array | 0.35 | 80.0 % | 30.8° | 84.0 % |
Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.
Citation
If you are participating in this task or using the dataset and code please consider citing the following papers:
Sharath Adavanne, Archontis Politis, and Tuomas Virtanen. A multi-room reverberant dataset for sound event localization and detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), 10–14. New York University, NY, USA, October 2019. URL: https://dcase.community/workshop2019/proceedings.
A Multi-room Reverberant Dataset for Sound Event Localization and Detection
Abstract
This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge. The goal of the SELD task is to detect the temporal activities of a known set of sound event classes, and further localize them in space when active. As part of the challenge, a synthesized dataset where each sound event associated with a spatial coordinate represented using azimuth and elevation angles is provided. These sound events are spatialized using real-life impulse responses collected at multiple spatial coordinates in five different rooms with varying dimensions and material properties. A baseline SELD method employing a convolutional recurrent neural network is used to generate benchmark scores for this reverberant dataset. The benchmark scores are obtained using the recommended cross-validation setup.
Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13(1):34–48, March 2018. doi:10.1109/JSTSP.2018.2885636.
Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks
Abstract
In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.
Keywords
Direction-of-arrival estimation;Estimation;Task analysis;Azimuth;Microphone arrays;Recurrent neural networks;Sound event detection;direction of arrival estimation;convolutional recurrent neural network