Sound Event Localization and Detection with Directional Interference


Challenge results

Task description

The Sound Event Localization and Detection (SELD) task deals with methods that detect the temporal onset and offset of sound events when active, classify the type of the event from a known set of sound classes, and further localize the events in space when active.

The focus of the current SELD task is to build systems that are able to handle event polyphony while being robust to ambient noise and reverberation in different acoustic environments/rooms, under static and dynamic spatial conditions (i.e. with moving sources). Additionally, the systems should be robust to interfering noise and events that are localized but do not belong to the targets classes, and their spatiotemporal activity is unknown during training. The task provides two datasets, development and evaluation, recorded in a total of 13 different acoustics environments. Among the two datasets, only the development dataset provides the reference labels. The participants are expected to build and validate systems using the development dataset, report results on a predefined development set split, and finally test their system on the unseen evaluation dataset.

More details on the task setup and evaluation can be found in the task description page.

Teams ranking

The SELD task received 37 submissions in total from 13 teams across the world. Results from one of the teams were retracted on their request. The following table includes only the best performing system per submitting team.

Rank Submission Information Evaluation dataset Development dataset
Submission name Corresponding
author
Affiliation Technical
Report
Best official
system rank
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Loalization
recall
Shimada_SONY_task3_3 Kazuki Shimada Sony Group Corporation Shimada_SONY_task3_report 1 0.32 79.1 8.5 82.8 0.43 69.9 11.1 73.2
Nguyen_NTU_task3_3 Thi Ngoc Tho Nguyen Nanyang Technological University Nguyen_NTU_task3_report 5 0.32 78.3 10.0 78.3 0.37 73.7 11.2 74.1
Parrish_JHU_task3_2 Nathan Parrish Johns Hopkins University Parrish_JHU_task3_report 9 0.39 73.8 12.8 76.8
Lee_SGU_task3_1 Sang-Hoon Lee Sogang University Lee_SGU_task3_report 10 0.40 72.9 13.2 76.5 0.46 60.9 14.4 73.3
Park_ETRI_task3_4 Sooyoung Park Electronics and Telecommunications Research Institute Park_ETRI_task3_report 14 0.46 67.8 12.8 72.3 0.44 69.6 13.7 74.2
Zhang_UCAS_task3_1 Zihao Li University of Chinese Academy of Sciences Zhang_UCAS_task3_report 20 0.46 64.7 12.8 61.9 0.46 63.2 13.9 62.9
Ko_SKKU_task3_4 Jonghwan Ko Sungkyunkwan University Ko_SKKU_task3_report 24 0.58 60.3 15.1 70.7 0.38 73.4 14.7 81.0
Huang_Aalto_task3_1 Huang Daolang Aalto University Huang_Aalto_task3_report 28 0.57 52.3 18.5 58.5 0.71 36.8 23.3 66.8
Yalta_HIT_task3_1 Nelson Yalta Hitachi, Ltd. Yalta_HIT_task3_report 29 0.72 52.5 20.1 71.1 0.60 54.0 20.3 65.3
Naranjo-Alcazar_UV_task3_2 Javier Naranjo-Alcazar Instituto Tecnológico de Informática Naranjo-Alcazar_UV_task3_report 31 0.68 37.7 25.3 53.9 0.71 31.9 27.6 46.6
Politis_TAU_task3_foa Archontis Politis Tampere University Politis_TUNI_task3_report 31 0.67 37.2 23.9 45.8 0.73 30.7 24.5 44.8
Bai_NWPU_task3_2 Jisheng Bai LianFeng Acoustic Technologies Co., Ltd. Bai_NWPU_task3_report 35 0.79 16.4 66.5 35.5 0.76 20.7 40.1 30.8
Sun_AIAL-XJU_task3_2 Xinghao Sun Xinjiang University Sun_AIAL-XJU_task3_report 37 0.95 2.7 84.5 17.4 0.52 55.3 19.1 60.9

Systems ranking

Performance of all the submitted systems on the evaluation and the development datasets

Rank Submission Information Evaluation dataset Development dataset
Submission name Technical
Report
Official
rank
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Error Rate
(20°)
F-score
(20°)
Localization
error (°)
Localization
recall
Shimada_SONY_task3_3 Shimada_SONY_task3_report 1 0.32 79.1 8.5 82.8 0.43 69.9 11.1 73.2
Shimada_SONY_task3_2 Shimada_SONY_task3_report 1 0.30 79.4 8.2 79.0 0.41 69.6 10.7 68.6
Shimada_SONY_task3_4 Shimada_SONY_task3_report 3 0.31 79.0 8.1 78.2 0.41 70.0 10.3 68.7
Shimada_SONY_task3_1 Shimada_SONY_task3_report 4 0.33 79.0 8.5 82.6 0.43 69.6 11.3 73.2
Nguyen_NTU_task3_3 Nguyen_NTU_task3_report 5 0.32 78.3 10.0 78.3 0.37 73.7 11.2 74.1
Nguyen_NTU_task3_1 Nguyen_NTU_task3_report 6 0.33 78.0 10.1 79.1 0.38 74.0 11.4 75.6
Nguyen_NTU_task3_2 Nguyen_NTU_task3_report 7 0.33 78.0 10.2 78.7 0.38 73.8 11.2 75.0
Nguyen_NTU_task3_4 Nguyen_NTU_task3_report 8 0.35 77.2 10.3 80.3 0.39 74.1 12.1 77.9
Parrish_JHU_task3_2 Parrish_JHU_task3_report 9 0.39 73.8 12.8 76.8
Lee_SGU_task3_1 Lee_SGU_task3_report 10 0.40 72.9 13.2 76.5 0.46 60.9 14.4 73.3
Parrish_JHU_task3_1 Parrish_JHU_task3_report 11 0.41 72.6 13.5 76.4
Lee_SGU_task3_4 Lee_SGU_task3_report 12 0.44 70.7 13.2 74.9 0.48 59.9 14.3 72.2
Lee_SGU_task3_2 Lee_SGU_task3_report 13 0.44 69.9 14.9 75.8 0.48 59.9 15.5 73.5
Park_ETRI_task3_4 Park_ETRI_task3_report 14 0.46 67.8 12.8 72.3 0.44 69.6 13.7 74.2
Lee_SGU_task3_3 Lee_SGU_task3_report 15 0.45 68.9 15.3 75.2 0.49 59.4 15.1 73.4
Parrish_JHU_task3_3 Parrish_JHU_task3_report 15 0.45 68.8 14.8 75.1
Park_ETRI_task3_2 Park_ETRI_task3_report 17 0.47 67.5 12.9 72.7 0.44 69.6 13.7 74.2
Park_ETRI_task3_3 Park_ETRI_task3_report 17 0.46 67.5 12.8 72.1 0.44 69.6 13.7 74.2
Park_ETRI_task3_1 Park_ETRI_task3_report 19 0.48 65.9 12.5 68.9 0.44 69.6 13.7 74.2
Zhang_UCAS_task3_1 Zhang_UCAS_task3_report 20 0.46 64.7 12.8 61.9 0.46 63.2 13.9 62.9
Zhang_UCAS_task3_3 Zhang_UCAS_task3_report 21 0.48 64.0 12.8 58.9 0.47 61.3 14.0 59.0
Zhang_UCAS_task3_4 Zhang_UCAS_task3_report 22 0.47 64.6 14.4 66.0 0.50 60.3 18.1 68.8
Zhang_UCAS_task3_2 Zhang_UCAS_task3_report 23 0.48 63.3 14.3 64.5 0.49 60.4 16.0 64.0
Ko_SKKU_task3_4 Ko_SKKU_task3_report 24 0.58 60.3 15.1 70.7 0.38 73.4 14.7 81.0
Ko_SKKU_task3_1 Ko_SKKU_task3_report 25 0.64 59.0 15.6 73.0 0.39 73.2 14.9 81.3
Parrish_JHU_task3_4 Parrish_JHU_task3_report 25 0.50 62.9 15.4 67.0
Ko_SKKU_task3_2 Ko_SKKU_task3_report 27 0.67 58.1 15.8 73.5 0.45 70.8 15.2 81.8
Huang_Aalto_task3_1 Huang_Aalto_task3_report 28 0.57 52.3 18.5 58.5 0.71 36.8 23.3 66.8
Yalta_HIT_task3_1 Yalta_HIT_task3_report 29 0.72 52.5 20.1 71.1 0.60 54.0 20.3 65.3
Huang_Aalto_task3_2 Huang_Aalto_task3_report 30 0.59 49.8 19.5 57.3 0.71 35.9 24.5 67.4
Naranjo-Alcazar_UV_task3_2 Naranjo-Alcazar_UV_task3_report 31 0.68 37.7 25.3 53.9 0.71 31.9 27.6 46.6
Politis_TAU_task3_foa Politis_TUNI_task3_report 31 0.67 37.2 23.9 45.8 0.73 30.7 24.5 44.8
Naranjo-Alcazar_UV_task3_1 Naranjo-Alcazar_UV_task3_report 33 0.67 36.8 30.1 48.7 0.72 30.2 29.4 42.5
Politis_TAU_task3_mic Politis_TUNI_task3_report 34 0.73 27.1 30.8 40.6 0.75 23.4 30.6 37.8
Bai_NWPU_task3_2 Bai_NWPU_task3_report 35 0.79 16.4 66.5 35.5 0.76 20.7 40.1 30.8
Bai_NWPU_task3_1 Bai_NWPU_task3_report 36 0.81 15.0 69.6 37.5 0.76 20.7 40.1 30.8
Sun_AIAL-XJU_task3_2 Sun_AIAL-XJU_task3_report 37 0.95 2.7 84.5 17.4 0.52 55.3 19.1 60.9
Sun_AIAL-XJU_task3_1 Sun_AIAL-XJU_task3_report 38 0.97 1.4 84.4 10.6 0.57 52.6 19.6 58.1

System characteristics

Summary of the submitted systems characteristics.

Rank Submission
name
Technical
Report
Model Model
params
Audio
format
Acoustic
features
Data
augmentation
1 Shimada_SONY_task3_3 Shimada_SONY_task3_report RD3Net, RD3Net with TFRNN, EINV2 with D3block, ensemble 42647804 Ambisonic magnitude spectra, PCEN spectra, IPD, cosIPD, sinIPD EMDA, rotation, SpecAugment, impulse response simulation
1 Shimada_SONY_task3_2 Shimada_SONY_task3_report RD3Net, RD3Net with TFRNN, EINV2 with D3block, ensemble 42647804 Ambisonic magnitude spectra, PCEN spectra, IPD, cosIPD, sinIPD EMDA, rotation, SpecAugment, impulse response simulation
3 Shimada_SONY_task3_4 Shimada_SONY_task3_report RD3Net, RD3Net with TFRNN, EINV2 with D3block, ensemble 73578962 Ambisonic magnitude spectra, PCEN spectra, IPD, cosIPD, sinIPD EMDA, rotation, SpecAugment, impulse response simulation
4 Shimada_SONY_task3_1 Shimada_SONY_task3_report RD3Net, RD3Net with TFRNN, EINV2 with D3block, ensemble 42647804 Ambisonic magnitude spectra, PCEN spectra, IPD, cosIPD, sinIPD EMDA, rotation, SpecAugment, impulse response simulation
5 Nguyen_NTU_task3_3 Nguyen_NTU_task3_report CRNN, ensemble 107800000 Ambisonic eigenvector-augmented log spectra, mel spectra mixup, frequency shifting, random cutout, SpecAugment, channel swapping
6 Nguyen_NTU_task3_1 Nguyen_NTU_task3_report CRNN, MHSA, ensemble 112200000 Ambisonic eigenvector-augmented log spectra, mel spectra mixup, frequency shifting, random cutout, SpecAugment, channel swapping
7 Nguyen_NTU_task3_2 Nguyen_NTU_task3_report CRNN, MHSA, ensemble 83900000 Ambisonic eigenvector-augmented log spectra, mel spectra mixup, frequency shifting, random cutout, SpecAugment, channel swapping
8 Nguyen_NTU_task3_4 Nguyen_NTU_task3_report CRNN, MHSA, ensemble 112200000 Ambisonic eigenvector-augmented log spectra, mel spectra mixup, frequency shifting, random cutout, SpecAugment, channel swapping
9 Parrish_JHU_task3_2 Parrish_JHU_task3_report CNN, MHSA 222000000 Ambisonic mel spectra, constant-Q spectra, intensity vector rotation, wav mixing, frequency masking, time masking
10 Lee_SGU_task3_1 Lee_SGU_task3_report CNN, Transformer 27797491 Ambisonic mel spectra, intensity vector rotation, mixup, SpecAugment
11 Parrish_JHU_task3_1 Parrish_JHU_task3_report CNN, MHSA 237000000 Ambisonic mel spectra, constant-Q spectra, intensity vector rotation, wav mixing, frequency masking, time masking
12 Lee_SGU_task3_4 Lee_SGU_task3_report CNN, Transformer 19389427 Ambisonic mel spectra, intensity vector rotation, mixup, SpecAugment
13 Lee_SGU_task3_2 Lee_SGU_task3_report CNN, Transformer 26221567 Ambisonic mel spectra, intensity vector rotation, mixup, SpecAugment
14 Park_ETRI_task3_4 Park_ETRI_task3_report Transformer, ensemble 601251180 Both mel spectra, intensity vector rotation, time stretching
15 Lee_SGU_task3_3 Lee_SGU_task3_report CNN, Transformer 27798106 Ambisonic mel spectra, intensity vector rotation, mixup, SpecAugment
15 Parrish_JHU_task3_3 Parrish_JHU_task3_report CNN, MHSA 73000000 Ambisonic constant-Q spectra, intensity vector rotation, wav mixing, frequency masking, time masking
17 Park_ETRI_task3_2 Park_ETRI_task3_report Transformer 172641840 Both mel spectra, intensity vector rotation
17 Park_ETRI_task3_3 Park_ETRI_task3_report Transformer, ensemble 429807444 Both mel spectra, intensity vector rotation
19 Park_ETRI_task3_1 Park_ETRI_task3_report Transformer 172641840 Both mel spectra, intensity vector rotation
20 Zhang_UCAS_task3_1 Zhang_UCAS_task3_report CNN, Conformer, ensemble 88684456 Both mel spectra, intensity vector, GCC-PHAT ACS, TFM
21 Zhang_UCAS_task3_3 Zhang_UCAS_task3_report CNN, Conformer,ResNet,SEResNet,ensemble 79306528 Both mel spectra, intensity vector, GCC-PHAT ACS, TFM
22 Zhang_UCAS_task3_4 Zhang_UCAS_task3_report CNN, Conformer 12834944 Both mel spectra, intensity vector, GCC-PHAT ACS, TFM, SP
23 Zhang_UCAS_task3_2 Zhang_UCAS_task3_report CNN, Conformer 12669172 Both mel spectra, intensity vector, GCC-PHAT ACS, TFM
24 Ko_SKKU_task3_4 Ko_SKKU_task3_report CNN, RNN, Transformer, ensemble 12360000 Ambisonic mel spectra, intensity vector domain spatial augmentation, SpecAugment
25 Ko_SKKU_task3_1 Ko_SKKU_task3_report CNN, RNN, Transformer, ensemble 9506544 Ambisonic mel spectra, intensity vector domain spatial augmentation, SpecAugment
25 Parrish_JHU_task3_4 Parrish_JHU_task3_report CNN, MHSA 21000000 Ambisonic mel spectra, intensity vector rotation, wav mixing, frequency masking, time masking
27 Ko_SKKU_task3_2 Ko_SKKU_task3_report CNN, RNN, Transformer, ensemble 9506544 Ambisonic mel spectra, intensity vector domain spatial augmentation, SpecAugment
28 Huang_Aalto_task3_1 Huang_Aalto_task3_report CNN, Transformer 2214052 Both raw waveform rotation, time masking, random audio equalization
29 Yalta_HIT_task3_1 Yalta_HIT_task3_report Transformer 4100000 Ambisonic mel spectra, intensity vector SpecAugment
30 Huang_Aalto_task3_2 Huang_Aalto_task3_report CNN, Transformer 2251684 Both raw waveform rotation, time masking, random audio equalization
31 Naranjo-Alcazar_UV_task3_2 Naranjo-Alcazar_UV_task3_report CRNN 590100 Ambisonic mel spectra, intensity vector None
31 Politis_TAU_task3_foa Politis_TUNI_task3_report CRNN 494000 Ambisonic mel spectra, intensity vector None
33 Naranjo-Alcazar_UV_task3_1 Naranjo-Alcazar_UV_task3_report CRNN 592020 Microphone Array mel spectra, GCC-PHAT None
34 Politis_TAU_task3_mic Politis_TUNI_task3_report CRNN 494000 Microphone Array mel spectra, GCC-PHAT None
35 Bai_NWPU_task3_2 Bai_NWPU_task3_report CRNN 116118 Microphone Array mel spectra, GCC-PHAT random segment augmentation
36 Bai_NWPU_task3_1 Bai_NWPU_task3_report CRNN 24506660 Microphone Array mel spectra, GCC-PHAT random segment augmentation
37 Sun_AIAL-XJU_task3_2 Sun_AIAL-XJU_task3_report CRNN 1338286 Ambisonic mel spectra, intensity vector None
38 Sun_AIAL-XJU_task3_1 Sun_AIAL-XJU_task3_report CRNN 1338286 Ambisonic mel spectra, intensity vector None



Technical reports

DCASE 2021 TASK 3: SELD SYSTEM BASED ON RESNET AND RANDOM SEGMENT AUGMENTATION

Jisheng Bai1, Zijun Pu2, Jianfeng Chen3
1LianFeng Acoustic Technologies Co. Ltd., 2Kunming University of Science and Technology, 3Northwestern Polytechnical University

Abstract

This technical report describes our system proposed for Sound Event Localization \& Detection (SELD) task of DCASE 2021 challenge [1]. In our approach, Resnet architectures are used in the task as main network for SELD, and GRU is used after the Resnet for catching temporal relationship of acoustic features. Moreover, a data augmentation method called random segment augmentation is adopted during training. Firstly, the original sound recordings which only contain single event sound are cut into 100ms length pieces. Secondly, all the sound pieces are shuffled and randomly combined for generating new recordings. Finally, our proposed system is evaluated on the development dataset of task3 and it achieve better performance than baseline.

PDF

SSELDNET: A FULLY END-TO-END SAMPLE-LEVEL FRAMEWORK FOR SOUND EVENT LOCALIZATION AND DETECTION

Huang Daolang, Perez Ricardo
Aalto University

Abstract

Sound event localization and detection (SELD) is a multi-task learn- ing problem that aims to detect different audio events and estimate their corresponding locations. All of the previously proposed SELD systems were based on human-extracted features such as Mel-spectrograms to make the prediction, which required specific prior knowledge in acoustics. In this report, we investigate the possibility to apply representation learning directly to the raw audio and propose an end-to-end sample-level SELD framework. To improve generalization, we applied three data augmentation tricks: sound field rotation, time masking and random audio equalization. The proposed system is evaluated on the TAU-NIGENS Spatial Sound Events 2021 development dataset. The experimental results will be submitted to DCASE 2021 challenge task 3.

PDF

A COMBINATION OF VARIOUS NEURAL NETWORKS FOR SOUND EVENT LOCALIZATION AND DETECTION

Daniel Rho1, Seungjin Lee1, JinHyeock Park1, Taesoo Kim1, Jiho Chang2, Jonghwan Ko1
1Sungkyunkwan University, 2Korea Research Institute of Standards and Science

Abstract

This technical report describes our approach to the DCASE 2021 task 3: Sound Event Localization and Detection (SELD). We pro- pose a network architecture, a combination of various network layers, which can yield the optimal performance for the SELD task. Furthermore, we propose which augmentation techniques to use to boost the performance of our proposed model with a limited train dataset. In order to further improve the performance, several techniques were applied at training and post-processing stages, such as adaptive gradient clipping, ensemble techniques, and class-wise dynamic thresholds. Evaluation results on the development dataset showed that the proposed approach outperformed the existing base- line model of the task.

PDF

SOUND EVENT LOCALIZATION AND DETECTION USING CROSS-MODAL ATTENTION AND PARAMETER SHARING FOR DCASE2021 CHALLENGE

Sang-Hoon Lee, Jung-Wook Hwang, Sang-Buem Seo, Hyung-Min Park
Sogang University

Abstract

In this report, we present our model for DCASE2021 Challenge Task3: Sound Event Localization and Detection (SELD) with Directional Interference. The model learns sound event detection (SED) and direction-of-arrival (DoA) at once by multi-task learn- ing for the SELD task. When learning the model, general features for both SED and DoA prediction are extracted by using the parameter-sharing strategy at the feature level of SED and DoA. In addition, the output is estimated by adding an attention layer based on cross-modal attention (CMA) in the transformer decoder so that the system can efficiently learn associations between SED and DoA features. Furthermore, three different prediction rules are presented for fully connected (FC) networks to provide SED and DoA results. Experiment has been conducted on the TAU- NIGENS Spatial Sound Events 2021 dataset, and to produce more learning data, the data was augmented by the mixup to sum up weighted audio clips and the channel rotation to change the location information of the sound source by rotating input channels, in addition to SpecAugment. Experimental results showed that our method provided significantly improved performance than the baseline method.

PDF

SOUND EVENT LOCALIZATION AND DETECTION USING SQUEEZE-EXCITATION RESIDUAL CNNS

Javier Naranjo-Alcazar1,2, Sergi Perez-Castanos1, Maximo Cobos1, Francesc J. Ferri1, Pedro Zuccarello2
1Universitat de Valencia, 2Instituto Tecnológico de Informática

Abstract

Sound event localisation and detection (SELD) is a problem in the field of automatic listening that aims at the temporal detection and localisation (direction of arrival estimation) of sound events within an audio clip, usually of long duration. Due to the amount of data present in the datasets related to this problem, solutions based on deep learning have positioned themselves at the top of the state of the art. Most solutions are based on 2D representations of the audio (different spectrograms) that are processed by a convolutional-recurrent network. The motivation of this submission is to study the squeeze-excitation technique in the convolutional part of the network and how it improves the performance of the system. This study is based on the one carried out by the same team last year. This year, it has been decided to study how this technique improves each of the datasets (last year only the MIC dataset was studied). This modification shows an improvement in the performance of the system compared to the baseline using MIC dataset.

PDF

DCASE 2021 TASK 3: SPECTROTEMPORALLY-ALIGNED FEATURES FOR POLYPHONIC SOUND EVENT LOCALIZATION AND DETECTION

Thi Ngoc Tho Nguyen1, Karn Watcharasupat1, Ngoc Khanh Nguyen1, Douglas L. Jones2, Woon Seng Gan1
1Nanyang Technological University, 2University of Illinois Urbana-Champaign

Abstract

Sound event localization and detection consists of two subtasks which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses magnitude or phase differences between micro- phones to estimate source directions. Therefore, it is often difficult to jointly optimize these two subtasks simultaneously. We pro- pose a novel feature called spatial cue-augmented log-spectrogram (SALSA) with exact time-frequency mapping between the signal power and the source direction-of-arrival. The feature includes multichannel log-spectrograms stacked along with the estimated direct- to-reverberant ratio and a normalized version of the principal eigenvector of the spatial covariance matrix at each time-frequency bin on the spectrograms. Experimental results on the DCASE 2021 dataset for sound event localization and detection with directional interference showed that the deep learning-based models trained on this new feature outperformed the DCASE challenge baseline by a large margin. We combined several models with slightly different architectures that were trained on the new feature to further improve the system performances for the DCASE sound event localization and detection challenge.

PDF

SELF-ATTENTION MECHANISM FOR SOUND EVENT LOCALIZATION AND DETECTION

Sooyoung Park, Youngho Jeong, Taejin Lee
Electronics and Telecommunications Research Institute

Abstract

This technical report describes the system submitted to DCASE 2021 Task 3: Sound Event Localization and Detection (SELD) with Directional Interference. The goal of Task 3 is to classify poly- phonic events with temporal activity into a given class and detect their direction in the presence of hidden sound events. Our system uses a Transformer that utilizes the self-attention mechanism that is now successfully used in many fields. We propose an architecture called Many-to-Many Audio Spectrogram Transformer (M2M-AST) that uses a pure Transformer to reduce the dependency of CNNs and easily change the output resolution. Using the architecture for Sound Event Detection (SED) and Direction of Arrival Estimation (DOAE), which are small sub-problems that consist of SELD, we show that our system outperforms the baseline system.

PDF

MULTI-SCALE NETWORK FOR SOUND EVENT LOCALIZATION AND DETECTION

Patrick Emmanuel, Nathan Parrish, Mark Horton
Johns Hopkins University

Abstract

This report describes a multi-scale approach to the DCASE 2021 Sound Event Localization and Detection with Directional Interference task. The goal of this task is to detect, classify, and localize in time and space events from twelve sound event classes in vary- ing reverberant acoustic environments in the presence of interfering sources. We train a network that jointly performs detection, localization, and classification using multi-channel magnitude spectral data and intensity vectors derived from first order ambisonics time-series. We implement a network with successive blocks of multi- scale filters to discriminate and extract overlapping classes with different spectral characteristics. We also implement an output format and permutation invariant training loss that enable the network to detect, classify, and localize multiple instances of the same class simultaneously. Experiments show that the proposed network out- performs the CRNN baseline networks in classification and localization metrics.

PDF

A DATASET OF DYNAMIC REVERBERANT SOUND SCENES WITH DIRECTIONAL INTERFERERS FOR SOUND EVENT LOCALIZATION AND DETECTION

Archontis Politis1, Sharath Adavanne1, Daniel Krause1, Antoine Deleforge2, Prerak Srivastava2, Tuomas Virtanen1
1Tampere University, 2INRIA

Abstract

This report presents the dataset and baseline of Task 3 of the DCASE2021 Challenge on Sound Event Localization and Detection (SELD). The dataset is based on emulation of real recordings of static or moving sound events under real conditions of reverberation and ambient noise, using spatial room impulse responses captured in a variety of rooms and delivered in two spatial formats. The acoustical synthesis remains the same as in the previous iteration of the challenge, however the new dataset brings more challenging conditions of polyphony and overlapping instances of the same class. The most important difference of the new dataset is the introduction of directional interferers, meaning sound events that are localized in space but do not belong to the target classes to be detected and are not annotated. Since such interfering events are expected in every real-world scenario of SELD, the new dataset aims to promote systems that deal with this condition effectively. A modified SELDnet baseline employing the recent ACCDOA representation for SELD problems accompanies the dataset and is described herein. To investigate the individual and combined effects of ambient noise, interferers, and reverberation, we study the performance of the baseline on different versions of the dataset excluding or including combinations of these factors. The results indicate that by far the most detrimental effects are caused by directional interferers.

PDF

ENSEMBLE OF ACCDOA- AND EINV2-BASED SYSTEMS WITH D3NETS AND IMPULSE RESPONSE SIMULATION FOR SOUND EVENT LOCALIZATION AND DETECTION

Kazuki Shimada, Naoya Takahashi, Yuichiro Koyama, Shusuke Takahashi, Emiru Tsunoo, Masafumi Takahashi, Yuki Mitsufuji
Sony Group Corporation

Abstract

This report describes our systems submitted to the DCASE2021 challenge task 3: sound event localization and detection (SELD) with directional interference. Our previous system based on activity-coupled Cartesian direction of arrival (ACCDOA) repre- sentation enables us to solve a SELD task with a single target. This ACCDOA-based system with efficient network architecture called RD3Net and data augmentation techniques outperformed state-of-the-art SELD systems in terms of localization and location- dependent detection. Using the ACCDOA-based system as a base, we perform model ensembles by averaging outputs of several sys- tems trained with different conditions such as input features, train- ing folds, and model architectures. We also use the event inde- pendent network v2 (EINV2)-based system to increase the diver- sity of the model ensembles. To generalize the models, we further propose impulse response simulation (IRS), which generates simu- lated multi-channel signals by convolving simulated room impulse responses (RIRs) with source signals extracted from the original dataset. Our systems significantly improved over the baseline sys- tem on the development dataset.

PDF

SOUND EVENT LOCALIZATION AND DETECTION BASED ON CRNN USING ADAPTIVE HYBRID CONVOLUTION AND MULTI-SCALE FEATURE EXTRACTOR

Xinghao Sun, Xiujuan Zhu, Ying Hu
Xinjiang University

Abstract

In this report, we present our method for Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge task 3: Sound Event Localization and Detection with Directional Interference (SELDDI). In this paper, we propose a method based on Adaptive Hybrid Convolution (AHConv) and multi-scale feature extractor. The square convolution shares the weight in each T-F bin of the fixed area in feature map, that is limited. In order to address this problem, we propose a AHConv mechanism instead of square convolution to obtain time and frequency dependencies. We also explored multi-scale feature extractor which can integrate information from very local to exponentially large receptive field within the block. In order to adaptive recalibrate the feature maps after convolutional operation, we designed an adaptive attention block which are largely embodied in the AHConv. On TAU-NIGENS Spatial Sound Events 2021 development dataset, our systems demonstrate a significant improvement over the baseline system. Only the first- order Ambisonics (FOA) dataset was considered in this experiment.

PDF

THE HITACHI DCASE 2021 TASK 3 SYSTEM: HANDLING DIRECTIVE INTERFERENCE WITH SELF ATTENTION LAYERS

Nelson Yalta, Takashi Sumiyoshi, Yohei Kawaguchi
Hitachi Ltd.

Abstract

This report describes the Hitachi system for the DCASE 2021 Challenge - Task 3. Our proposal relies on a single-stage system that employs the transformer encoder (i.e., self-attention layers) as a core idea. We evaluate the effect of applying different transformer configurations to handle the directive interferences on the presence of multiple sound events. Additionally, the transformer employs residual connections to extract the features from the input streams. We trained the model using specaugment as data augmentation and performed threshold post-processing for each sound event. Em- ploying the first-order Ambisonic (FOA) signals, the transformer was trained using the activity-coupled Cartesian DOA vector (ACCDOA) representations. This unified training framework showed better performance than training the model for each sub-task independently.

PDF

DATA AUGMENTATION AND CLASS-BASED ENSEMBLED CNN-CONFORMER NETWORKS FOR SOUND EVENT LOCALIZATION AND DETECTION

Yuxuan Zhang, Shuo Wang, Zihao Li, Kejian Guo, Shijin Chen, Yan Pang
University of Chinese Academy of Sciences

Abstract

In this technical report, we describe the system participating in the DCASE 2021, Task 3: Sound Event Localization and Detection (SELD) challenge. We introduce Conformer block into the base- line system to make better use of temporal context information for the SELD task. To expand the official training dataset, we use Audio Channel Swapping (ACS), Speed Perturbation (SP), and Time- Frequency Masking (TFM) as augmentation techniques. In addition, we proposed a class-based ensemble method to attain a more robust sound event detection (SED) and sound source localization (SSL) estimation result for each sound event. After evaluating our best-proposed system on DCASE 2021 Challenge Task 3 Development Dataset, we approximately achieve 44\% and 37\% relative improvements on the SELD scores, respectively.

PDF