Task description
The Sound Event Localization and Detection (SELD) task deals with methods that detect the temporal onset and offset of sound events when active, classify the type of the event from a known set of sound classes, and further localize the events in space when active.
The focus of the current SELD task is to build systems that are able to handle event polyphony while being robust to ambient noise and reverberation in different acoustic environments/rooms, under static and dynamic spatial conditions (i.e. with moving sources). Additionally, the systems should be robust to interfering noise and events that are localized but do not belong to the targets classes, and their spatiotemporal activity is unknown during training. The task provides two datasets, development and evaluation, recorded in a total of 13 different acoustics environments. Among the two datasets, only the development dataset provides the reference labels. The participants are expected to build and validate systems using the development dataset, report results on a predefined development set split, and finally test their system on the unseen evaluation dataset.
More details on the task setup and evaluation can be found in the task description page.
Teams ranking
The SELD task received 37 submissions in total from 13 teams across the world. Results from one of the teams were retracted on their request. The following table includes only the best performing system per submitting team.
Rank | Submission Information | Evaluation dataset | Development dataset | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission name |
Corresponding author |
Affiliation |
Technical Report |
Best official system rank |
Error Rate (20°) |
F-score (20°) |
Localization error (°) |
Localization recall |
Error Rate (20°) |
F-score (20°) |
Localization error (°) |
Loalization recall |
|
Shimada_SONY_task3_3 | Kazuki Shimada | Sony Group Corporation | Shimada_SONY_task3_report | 1 | 0.32 | 79.1 | 8.5 | 82.8 | 0.43 | 69.9 | 11.1 | 73.2 | |
Nguyen_NTU_task3_3 | Thi Ngoc Tho Nguyen | Nanyang Technological University | Nguyen_NTU_task3_report | 5 | 0.32 | 78.3 | 10.0 | 78.3 | 0.37 | 73.7 | 11.2 | 74.1 | |
Parrish_JHU_task3_2 | Nathan Parrish | Johns Hopkins University | Parrish_JHU_task3_report | 9 | 0.39 | 73.8 | 12.8 | 76.8 | |||||
Lee_SGU_task3_1 | Sang-Hoon Lee | Sogang University | Lee_SGU_task3_report | 10 | 0.40 | 72.9 | 13.2 | 76.5 | 0.46 | 60.9 | 14.4 | 73.3 | |
Park_ETRI_task3_4 | Sooyoung Park | Electronics and Telecommunications Research Institute | Park_ETRI_task3_report | 14 | 0.46 | 67.8 | 12.8 | 72.3 | 0.44 | 69.6 | 13.7 | 74.2 | |
Zhang_UCAS_task3_1 | Zihao Li | University of Chinese Academy of Sciences | Zhang_UCAS_task3_report | 20 | 0.46 | 64.7 | 12.8 | 61.9 | 0.46 | 63.2 | 13.9 | 62.9 | |
Ko_SKKU_task3_4 | Jonghwan Ko | Sungkyunkwan University | Ko_SKKU_task3_report | 24 | 0.58 | 60.3 | 15.1 | 70.7 | 0.38 | 73.4 | 14.7 | 81.0 | |
Huang_Aalto_task3_1 | Huang Daolang | Aalto University | Huang_Aalto_task3_report | 28 | 0.57 | 52.3 | 18.5 | 58.5 | 0.71 | 36.8 | 23.3 | 66.8 | |
Yalta_HIT_task3_1 | Nelson Yalta | Hitachi, Ltd. | Yalta_HIT_task3_report | 29 | 0.72 | 52.5 | 20.1 | 71.1 | 0.60 | 54.0 | 20.3 | 65.3 | |
Naranjo-Alcazar_UV_task3_2 | Javier Naranjo-Alcazar | Instituto Tecnológico de Informática | Naranjo-Alcazar_UV_task3_report | 31 | 0.68 | 37.7 | 25.3 | 53.9 | 0.71 | 31.9 | 27.6 | 46.6 | |
Politis_TAU_task3_foa | Archontis Politis | Tampere University | Politis_TUNI_task3_report | 31 | 0.67 | 37.2 | 23.9 | 45.8 | 0.73 | 30.7 | 24.5 | 44.8 | |
Bai_NWPU_task3_2 | Jisheng Bai | LianFeng Acoustic Technologies Co., Ltd. | Bai_NWPU_task3_report | 35 | 0.79 | 16.4 | 66.5 | 35.5 | 0.76 | 20.7 | 40.1 | 30.8 | |
Sun_AIAL-XJU_task3_2 | Xinghao Sun | Xinjiang University | Sun_AIAL-XJU_task3_report | 37 | 0.95 | 2.7 | 84.5 | 17.4 | 0.52 | 55.3 | 19.1 | 60.9 |
Systems ranking
Performance of all the submitted systems on the evaluation and the development datasets
Rank | Submission Information | Evaluation dataset | Development dataset | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Submission name |
Technical Report |
Official rank |
Error Rate (20°) |
F-score (20°) |
Localization error (°) |
Localization recall |
Error Rate (20°) |
F-score (20°) |
Localization error (°) |
Localization recall |
|
Shimada_SONY_task3_3 | Shimada_SONY_task3_report | 1 | 0.32 | 79.1 | 8.5 | 82.8 | 0.43 | 69.9 | 11.1 | 73.2 | |
Shimada_SONY_task3_2 | Shimada_SONY_task3_report | 1 | 0.30 | 79.4 | 8.2 | 79.0 | 0.41 | 69.6 | 10.7 | 68.6 | |
Shimada_SONY_task3_4 | Shimada_SONY_task3_report | 3 | 0.31 | 79.0 | 8.1 | 78.2 | 0.41 | 70.0 | 10.3 | 68.7 | |
Shimada_SONY_task3_1 | Shimada_SONY_task3_report | 4 | 0.33 | 79.0 | 8.5 | 82.6 | 0.43 | 69.6 | 11.3 | 73.2 | |
Nguyen_NTU_task3_3 | Nguyen_NTU_task3_report | 5 | 0.32 | 78.3 | 10.0 | 78.3 | 0.37 | 73.7 | 11.2 | 74.1 | |
Nguyen_NTU_task3_1 | Nguyen_NTU_task3_report | 6 | 0.33 | 78.0 | 10.1 | 79.1 | 0.38 | 74.0 | 11.4 | 75.6 | |
Nguyen_NTU_task3_2 | Nguyen_NTU_task3_report | 7 | 0.33 | 78.0 | 10.2 | 78.7 | 0.38 | 73.8 | 11.2 | 75.0 | |
Nguyen_NTU_task3_4 | Nguyen_NTU_task3_report | 8 | 0.35 | 77.2 | 10.3 | 80.3 | 0.39 | 74.1 | 12.1 | 77.9 | |
Parrish_JHU_task3_2 | Parrish_JHU_task3_report | 9 | 0.39 | 73.8 | 12.8 | 76.8 | |||||
Lee_SGU_task3_1 | Lee_SGU_task3_report | 10 | 0.40 | 72.9 | 13.2 | 76.5 | 0.46 | 60.9 | 14.4 | 73.3 | |
Parrish_JHU_task3_1 | Parrish_JHU_task3_report | 11 | 0.41 | 72.6 | 13.5 | 76.4 | |||||
Lee_SGU_task3_4 | Lee_SGU_task3_report | 12 | 0.44 | 70.7 | 13.2 | 74.9 | 0.48 | 59.9 | 14.3 | 72.2 | |
Lee_SGU_task3_2 | Lee_SGU_task3_report | 13 | 0.44 | 69.9 | 14.9 | 75.8 | 0.48 | 59.9 | 15.5 | 73.5 | |
Park_ETRI_task3_4 | Park_ETRI_task3_report | 14 | 0.46 | 67.8 | 12.8 | 72.3 | 0.44 | 69.6 | 13.7 | 74.2 | |
Lee_SGU_task3_3 | Lee_SGU_task3_report | 15 | 0.45 | 68.9 | 15.3 | 75.2 | 0.49 | 59.4 | 15.1 | 73.4 | |
Parrish_JHU_task3_3 | Parrish_JHU_task3_report | 15 | 0.45 | 68.8 | 14.8 | 75.1 | |||||
Park_ETRI_task3_2 | Park_ETRI_task3_report | 17 | 0.47 | 67.5 | 12.9 | 72.7 | 0.44 | 69.6 | 13.7 | 74.2 | |
Park_ETRI_task3_3 | Park_ETRI_task3_report | 17 | 0.46 | 67.5 | 12.8 | 72.1 | 0.44 | 69.6 | 13.7 | 74.2 | |
Park_ETRI_task3_1 | Park_ETRI_task3_report | 19 | 0.48 | 65.9 | 12.5 | 68.9 | 0.44 | 69.6 | 13.7 | 74.2 | |
Zhang_UCAS_task3_1 | Zhang_UCAS_task3_report | 20 | 0.46 | 64.7 | 12.8 | 61.9 | 0.46 | 63.2 | 13.9 | 62.9 | |
Zhang_UCAS_task3_3 | Zhang_UCAS_task3_report | 21 | 0.48 | 64.0 | 12.8 | 58.9 | 0.47 | 61.3 | 14.0 | 59.0 | |
Zhang_UCAS_task3_4 | Zhang_UCAS_task3_report | 22 | 0.47 | 64.6 | 14.4 | 66.0 | 0.50 | 60.3 | 18.1 | 68.8 | |
Zhang_UCAS_task3_2 | Zhang_UCAS_task3_report | 23 | 0.48 | 63.3 | 14.3 | 64.5 | 0.49 | 60.4 | 16.0 | 64.0 | |
Ko_SKKU_task3_4 | Ko_SKKU_task3_report | 24 | 0.58 | 60.3 | 15.1 | 70.7 | 0.38 | 73.4 | 14.7 | 81.0 | |
Ko_SKKU_task3_1 | Ko_SKKU_task3_report | 25 | 0.64 | 59.0 | 15.6 | 73.0 | 0.39 | 73.2 | 14.9 | 81.3 | |
Parrish_JHU_task3_4 | Parrish_JHU_task3_report | 25 | 0.50 | 62.9 | 15.4 | 67.0 | |||||
Ko_SKKU_task3_2 | Ko_SKKU_task3_report | 27 | 0.67 | 58.1 | 15.8 | 73.5 | 0.45 | 70.8 | 15.2 | 81.8 | |
Huang_Aalto_task3_1 | Huang_Aalto_task3_report | 28 | 0.57 | 52.3 | 18.5 | 58.5 | 0.71 | 36.8 | 23.3 | 66.8 | |
Yalta_HIT_task3_1 | Yalta_HIT_task3_report | 29 | 0.72 | 52.5 | 20.1 | 71.1 | 0.60 | 54.0 | 20.3 | 65.3 | |
Huang_Aalto_task3_2 | Huang_Aalto_task3_report | 30 | 0.59 | 49.8 | 19.5 | 57.3 | 0.71 | 35.9 | 24.5 | 67.4 | |
Naranjo-Alcazar_UV_task3_2 | Naranjo-Alcazar_UV_task3_report | 31 | 0.68 | 37.7 | 25.3 | 53.9 | 0.71 | 31.9 | 27.6 | 46.6 | |
Politis_TAU_task3_foa | Politis_TUNI_task3_report | 31 | 0.67 | 37.2 | 23.9 | 45.8 | 0.73 | 30.7 | 24.5 | 44.8 | |
Naranjo-Alcazar_UV_task3_1 | Naranjo-Alcazar_UV_task3_report | 33 | 0.67 | 36.8 | 30.1 | 48.7 | 0.72 | 30.2 | 29.4 | 42.5 | |
Politis_TAU_task3_mic | Politis_TUNI_task3_report | 34 | 0.73 | 27.1 | 30.8 | 40.6 | 0.75 | 23.4 | 30.6 | 37.8 | |
Bai_NWPU_task3_2 | Bai_NWPU_task3_report | 35 | 0.79 | 16.4 | 66.5 | 35.5 | 0.76 | 20.7 | 40.1 | 30.8 | |
Bai_NWPU_task3_1 | Bai_NWPU_task3_report | 36 | 0.81 | 15.0 | 69.6 | 37.5 | 0.76 | 20.7 | 40.1 | 30.8 | |
Sun_AIAL-XJU_task3_2 | Sun_AIAL-XJU_task3_report | 37 | 0.95 | 2.7 | 84.5 | 17.4 | 0.52 | 55.3 | 19.1 | 60.9 | |
Sun_AIAL-XJU_task3_1 | Sun_AIAL-XJU_task3_report | 38 | 0.97 | 1.4 | 84.4 | 10.6 | 0.57 | 52.6 | 19.6 | 58.1 |
System characteristics
Summary of the submitted systems characteristics.
Rank |
Submission name |
Technical Report |
Model |
Model params |
Audio format |
Acoustic features |
Data augmentation |
---|---|---|---|---|---|---|---|
1 | Shimada_SONY_task3_3 | Shimada_SONY_task3_report | RD3Net, RD3Net with TFRNN, EINV2 with D3block, ensemble | 42647804 | Ambisonic | magnitude spectra, PCEN spectra, IPD, cosIPD, sinIPD | EMDA, rotation, SpecAugment, impulse response simulation |
1 | Shimada_SONY_task3_2 | Shimada_SONY_task3_report | RD3Net, RD3Net with TFRNN, EINV2 with D3block, ensemble | 42647804 | Ambisonic | magnitude spectra, PCEN spectra, IPD, cosIPD, sinIPD | EMDA, rotation, SpecAugment, impulse response simulation |
3 | Shimada_SONY_task3_4 | Shimada_SONY_task3_report | RD3Net, RD3Net with TFRNN, EINV2 with D3block, ensemble | 73578962 | Ambisonic | magnitude spectra, PCEN spectra, IPD, cosIPD, sinIPD | EMDA, rotation, SpecAugment, impulse response simulation |
4 | Shimada_SONY_task3_1 | Shimada_SONY_task3_report | RD3Net, RD3Net with TFRNN, EINV2 with D3block, ensemble | 42647804 | Ambisonic | magnitude spectra, PCEN spectra, IPD, cosIPD, sinIPD | EMDA, rotation, SpecAugment, impulse response simulation |
5 | Nguyen_NTU_task3_3 | Nguyen_NTU_task3_report | CRNN, ensemble | 107800000 | Ambisonic | eigenvector-augmented log spectra, mel spectra | mixup, frequency shifting, random cutout, SpecAugment, channel swapping |
6 | Nguyen_NTU_task3_1 | Nguyen_NTU_task3_report | CRNN, MHSA, ensemble | 112200000 | Ambisonic | eigenvector-augmented log spectra, mel spectra | mixup, frequency shifting, random cutout, SpecAugment, channel swapping |
7 | Nguyen_NTU_task3_2 | Nguyen_NTU_task3_report | CRNN, MHSA, ensemble | 83900000 | Ambisonic | eigenvector-augmented log spectra, mel spectra | mixup, frequency shifting, random cutout, SpecAugment, channel swapping |
8 | Nguyen_NTU_task3_4 | Nguyen_NTU_task3_report | CRNN, MHSA, ensemble | 112200000 | Ambisonic | eigenvector-augmented log spectra, mel spectra | mixup, frequency shifting, random cutout, SpecAugment, channel swapping |
9 | Parrish_JHU_task3_2 | Parrish_JHU_task3_report | CNN, MHSA | 222000000 | Ambisonic | mel spectra, constant-Q spectra, intensity vector | rotation, wav mixing, frequency masking, time masking |
10 | Lee_SGU_task3_1 | Lee_SGU_task3_report | CNN, Transformer | 27797491 | Ambisonic | mel spectra, intensity vector | rotation, mixup, SpecAugment |
11 | Parrish_JHU_task3_1 | Parrish_JHU_task3_report | CNN, MHSA | 237000000 | Ambisonic | mel spectra, constant-Q spectra, intensity vector | rotation, wav mixing, frequency masking, time masking |
12 | Lee_SGU_task3_4 | Lee_SGU_task3_report | CNN, Transformer | 19389427 | Ambisonic | mel spectra, intensity vector | rotation, mixup, SpecAugment |
13 | Lee_SGU_task3_2 | Lee_SGU_task3_report | CNN, Transformer | 26221567 | Ambisonic | mel spectra, intensity vector | rotation, mixup, SpecAugment |
14 | Park_ETRI_task3_4 | Park_ETRI_task3_report | Transformer, ensemble | 601251180 | Both | mel spectra, intensity vector | rotation, time stretching |
15 | Lee_SGU_task3_3 | Lee_SGU_task3_report | CNN, Transformer | 27798106 | Ambisonic | mel spectra, intensity vector | rotation, mixup, SpecAugment |
15 | Parrish_JHU_task3_3 | Parrish_JHU_task3_report | CNN, MHSA | 73000000 | Ambisonic | constant-Q spectra, intensity vector | rotation, wav mixing, frequency masking, time masking |
17 | Park_ETRI_task3_2 | Park_ETRI_task3_report | Transformer | 172641840 | Both | mel spectra, intensity vector | rotation |
17 | Park_ETRI_task3_3 | Park_ETRI_task3_report | Transformer, ensemble | 429807444 | Both | mel spectra, intensity vector | rotation |
19 | Park_ETRI_task3_1 | Park_ETRI_task3_report | Transformer | 172641840 | Both | mel spectra, intensity vector | rotation |
20 | Zhang_UCAS_task3_1 | Zhang_UCAS_task3_report | CNN, Conformer, ensemble | 88684456 | Both | mel spectra, intensity vector, GCC-PHAT | ACS, TFM |
21 | Zhang_UCAS_task3_3 | Zhang_UCAS_task3_report | CNN, Conformer,ResNet,SEResNet,ensemble | 79306528 | Both | mel spectra, intensity vector, GCC-PHAT | ACS, TFM |
22 | Zhang_UCAS_task3_4 | Zhang_UCAS_task3_report | CNN, Conformer | 12834944 | Both | mel spectra, intensity vector, GCC-PHAT | ACS, TFM, SP |
23 | Zhang_UCAS_task3_2 | Zhang_UCAS_task3_report | CNN, Conformer | 12669172 | Both | mel spectra, intensity vector, GCC-PHAT | ACS, TFM |
24 | Ko_SKKU_task3_4 | Ko_SKKU_task3_report | CNN, RNN, Transformer, ensemble | 12360000 | Ambisonic | mel spectra, intensity vector | domain spatial augmentation, SpecAugment |
25 | Ko_SKKU_task3_1 | Ko_SKKU_task3_report | CNN, RNN, Transformer, ensemble | 9506544 | Ambisonic | mel spectra, intensity vector | domain spatial augmentation, SpecAugment |
25 | Parrish_JHU_task3_4 | Parrish_JHU_task3_report | CNN, MHSA | 21000000 | Ambisonic | mel spectra, intensity vector | rotation, wav mixing, frequency masking, time masking |
27 | Ko_SKKU_task3_2 | Ko_SKKU_task3_report | CNN, RNN, Transformer, ensemble | 9506544 | Ambisonic | mel spectra, intensity vector | domain spatial augmentation, SpecAugment |
28 | Huang_Aalto_task3_1 | Huang_Aalto_task3_report | CNN, Transformer | 2214052 | Both | raw waveform | rotation, time masking, random audio equalization |
29 | Yalta_HIT_task3_1 | Yalta_HIT_task3_report | Transformer | 4100000 | Ambisonic | mel spectra, intensity vector | SpecAugment |
30 | Huang_Aalto_task3_2 | Huang_Aalto_task3_report | CNN, Transformer | 2251684 | Both | raw waveform | rotation, time masking, random audio equalization |
31 | Naranjo-Alcazar_UV_task3_2 | Naranjo-Alcazar_UV_task3_report | CRNN | 590100 | Ambisonic | mel spectra, intensity vector | None |
31 | Politis_TAU_task3_foa | Politis_TUNI_task3_report | CRNN | 494000 | Ambisonic | mel spectra, intensity vector | None |
33 | Naranjo-Alcazar_UV_task3_1 | Naranjo-Alcazar_UV_task3_report | CRNN | 592020 | Microphone Array | mel spectra, GCC-PHAT | None |
34 | Politis_TAU_task3_mic | Politis_TUNI_task3_report | CRNN | 494000 | Microphone Array | mel spectra, GCC-PHAT | None |
35 | Bai_NWPU_task3_2 | Bai_NWPU_task3_report | CRNN | 116118 | Microphone Array | mel spectra, GCC-PHAT | random segment augmentation |
36 | Bai_NWPU_task3_1 | Bai_NWPU_task3_report | CRNN | 24506660 | Microphone Array | mel spectra, GCC-PHAT | random segment augmentation |
37 | Sun_AIAL-XJU_task3_2 | Sun_AIAL-XJU_task3_report | CRNN | 1338286 | Ambisonic | mel spectra, intensity vector | None |
38 | Sun_AIAL-XJU_task3_1 | Sun_AIAL-XJU_task3_report | CRNN | 1338286 | Ambisonic | mel spectra, intensity vector | None |
Technical reports
DCASE 2021 TASK 3: SELD SYSTEM BASED ON RESNET AND RANDOM SEGMENT AUGMENTATION
Jisheng Bai1, Zijun Pu2, Jianfeng Chen3
1LianFeng Acoustic Technologies Co. Ltd., 2Kunming University of Science and Technology, 3Northwestern Polytechnical University
Bai_NWPU_task3_2 Bai_NWPU_task3_1
DCASE 2021 TASK 3: SELD SYSTEM BASED ON RESNET AND RANDOM SEGMENT AUGMENTATION
Jisheng Bai1, Zijun Pu2, Jianfeng Chen3
1LianFeng Acoustic Technologies Co. Ltd., 2Kunming University of Science and Technology, 3Northwestern Polytechnical University
Abstract
This technical report describes our system proposed for Sound Event Localization \& Detection (SELD) task of DCASE 2021 challenge [1]. In our approach, Resnet architectures are used in the task as main network for SELD, and GRU is used after the Resnet for catching temporal relationship of acoustic features. Moreover, a data augmentation method called random segment augmentation is adopted during training. Firstly, the original sound recordings which only contain single event sound are cut into 100ms length pieces. Secondly, all the sound pieces are shuffled and randomly combined for generating new recordings. Finally, our proposed system is evaluated on the development dataset of task3 and it achieve better performance than baseline.
SSELDNET: A FULLY END-TO-END SAMPLE-LEVEL FRAMEWORK FOR SOUND EVENT LOCALIZATION AND DETECTION
Huang Daolang, Perez Ricardo
Aalto University
Huang_Aalto_task3_2 Huang_Aalto_task3_1
SSELDNET: A FULLY END-TO-END SAMPLE-LEVEL FRAMEWORK FOR SOUND EVENT LOCALIZATION AND DETECTION
Huang Daolang, Perez Ricardo
Aalto University
Abstract
Sound event localization and detection (SELD) is a multi-task learn- ing problem that aims to detect different audio events and estimate their corresponding locations. All of the previously proposed SELD systems were based on human-extracted features such as Mel-spectrograms to make the prediction, which required specific prior knowledge in acoustics. In this report, we investigate the possibility to apply representation learning directly to the raw audio and propose an end-to-end sample-level SELD framework. To improve generalization, we applied three data augmentation tricks: sound field rotation, time masking and random audio equalization. The proposed system is evaluated on the TAU-NIGENS Spatial Sound Events 2021 development dataset. The experimental results will be submitted to DCASE 2021 challenge task 3.
A COMBINATION OF VARIOUS NEURAL NETWORKS FOR SOUND EVENT LOCALIZATION AND DETECTION
Daniel Rho1, Seungjin Lee1, JinHyeock Park1, Taesoo Kim1, Jiho Chang2, Jonghwan Ko1
1Sungkyunkwan University, 2Korea Research Institute of Standards and Science
Ko_SKKU_task3_4 Ko_SKKU_task3_2 Ko_SKKU_task3_1
A COMBINATION OF VARIOUS NEURAL NETWORKS FOR SOUND EVENT LOCALIZATION AND DETECTION
Daniel Rho1, Seungjin Lee1, JinHyeock Park1, Taesoo Kim1, Jiho Chang2, Jonghwan Ko1
1Sungkyunkwan University, 2Korea Research Institute of Standards and Science
Abstract
This technical report describes our approach to the DCASE 2021 task 3: Sound Event Localization and Detection (SELD). We pro- pose a network architecture, a combination of various network layers, which can yield the optimal performance for the SELD task. Furthermore, we propose which augmentation techniques to use to boost the performance of our proposed model with a limited train dataset. In order to further improve the performance, several techniques were applied at training and post-processing stages, such as adaptive gradient clipping, ensemble techniques, and class-wise dynamic thresholds. Evaluation results on the development dataset showed that the proposed approach outperformed the existing base- line model of the task.
SOUND EVENT LOCALIZATION AND DETECTION USING CROSS-MODAL ATTENTION AND PARAMETER SHARING FOR DCASE2021 CHALLENGE
Sang-Hoon Lee, Jung-Wook Hwang, Sang-Buem Seo, Hyung-Min Park
Sogang University
Lee_SGU_task3_3 Lee_SGU_task3_4 Lee_SGU_task3_1 Lee_SGU_task3_2
SOUND EVENT LOCALIZATION AND DETECTION USING CROSS-MODAL ATTENTION AND PARAMETER SHARING FOR DCASE2021 CHALLENGE
Sang-Hoon Lee, Jung-Wook Hwang, Sang-Buem Seo, Hyung-Min Park
Sogang University
Abstract
In this report, we present our model for DCASE2021 Challenge Task3: Sound Event Localization and Detection (SELD) with Directional Interference. The model learns sound event detection (SED) and direction-of-arrival (DoA) at once by multi-task learn- ing for the SELD task. When learning the model, general features for both SED and DoA prediction are extracted by using the parameter-sharing strategy at the feature level of SED and DoA. In addition, the output is estimated by adding an attention layer based on cross-modal attention (CMA) in the transformer decoder so that the system can efficiently learn associations between SED and DoA features. Furthermore, three different prediction rules are presented for fully connected (FC) networks to provide SED and DoA results. Experiment has been conducted on the TAU- NIGENS Spatial Sound Events 2021 dataset, and to produce more learning data, the data was augmented by the mixup to sum up weighted audio clips and the channel rotation to change the location information of the sound source by rotating input channels, in addition to SpecAugment. Experimental results showed that our method provided significantly improved performance than the baseline method.
SOUND EVENT LOCALIZATION AND DETECTION USING SQUEEZE-EXCITATION RESIDUAL CNNS
Javier Naranjo-Alcazar1,2, Sergi Perez-Castanos1, Maximo Cobos1, Francesc J. Ferri1, Pedro Zuccarello2
1Universitat de Valencia, 2Instituto Tecnológico de Informática
Naranjo-Alcazar_UV_task3_2, Naranjo-Alcazar_UV_task3_1
SOUND EVENT LOCALIZATION AND DETECTION USING SQUEEZE-EXCITATION RESIDUAL CNNS
Javier Naranjo-Alcazar1,2, Sergi Perez-Castanos1, Maximo Cobos1, Francesc J. Ferri1, Pedro Zuccarello2
1Universitat de Valencia, 2Instituto Tecnológico de Informática
Abstract
Sound event localisation and detection (SELD) is a problem in the field of automatic listening that aims at the temporal detection and localisation (direction of arrival estimation) of sound events within an audio clip, usually of long duration. Due to the amount of data present in the datasets related to this problem, solutions based on deep learning have positioned themselves at the top of the state of the art. Most solutions are based on 2D representations of the audio (different spectrograms) that are processed by a convolutional-recurrent network. The motivation of this submission is to study the squeeze-excitation technique in the convolutional part of the network and how it improves the performance of the system. This study is based on the one carried out by the same team last year. This year, it has been decided to study how this technique improves each of the datasets (last year only the MIC dataset was studied). This modification shows an improvement in the performance of the system compared to the baseline using MIC dataset.
DCASE 2021 TASK 3: SPECTROTEMPORALLY-ALIGNED FEATURES FOR POLYPHONIC SOUND EVENT LOCALIZATION AND DETECTION
Thi Ngoc Tho Nguyen1, Karn Watcharasupat1, Ngoc Khanh Nguyen1, Douglas L. Jones2, Woon Seng Gan1
1Nanyang Technological University, 2University of Illinois Urbana-Champaign
Nguyen_NTU_task3_1 Nguyen_NTU_task3_3 Nguyen_NTU_task3_2 Nguyen_NTU_task3_4
DCASE 2021 TASK 3: SPECTROTEMPORALLY-ALIGNED FEATURES FOR POLYPHONIC SOUND EVENT LOCALIZATION AND DETECTION
Thi Ngoc Tho Nguyen1, Karn Watcharasupat1, Ngoc Khanh Nguyen1, Douglas L. Jones2, Woon Seng Gan1
1Nanyang Technological University, 2University of Illinois Urbana-Champaign
Abstract
Sound event localization and detection consists of two subtasks which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses magnitude or phase differences between micro- phones to estimate source directions. Therefore, it is often difficult to jointly optimize these two subtasks simultaneously. We pro- pose a novel feature called spatial cue-augmented log-spectrogram (SALSA) with exact time-frequency mapping between the signal power and the source direction-of-arrival. The feature includes multichannel log-spectrograms stacked along with the estimated direct- to-reverberant ratio and a normalized version of the principal eigenvector of the spatial covariance matrix at each time-frequency bin on the spectrograms. Experimental results on the DCASE 2021 dataset for sound event localization and detection with directional interference showed that the deep learning-based models trained on this new feature outperformed the DCASE challenge baseline by a large margin. We combined several models with slightly different architectures that were trained on the new feature to further improve the system performances for the DCASE sound event localization and detection challenge.
SELF-ATTENTION MECHANISM FOR SOUND EVENT LOCALIZATION AND DETECTION
Sooyoung Park, Youngho Jeong, Taejin Lee
Electronics and Telecommunications Research Institute
Park_ETRI_task3_1 Park_ETRI_task3_4 Park_ETRI_task3_3 Park_ETRI_task3_2
SELF-ATTENTION MECHANISM FOR SOUND EVENT LOCALIZATION AND DETECTION
Sooyoung Park, Youngho Jeong, Taejin Lee
Electronics and Telecommunications Research Institute
Abstract
This technical report describes the system submitted to DCASE 2021 Task 3: Sound Event Localization and Detection (SELD) with Directional Interference. The goal of Task 3 is to classify poly- phonic events with temporal activity into a given class and detect their direction in the presence of hidden sound events. Our system uses a Transformer that utilizes the self-attention mechanism that is now successfully used in many fields. We propose an architecture called Many-to-Many Audio Spectrogram Transformer (M2M-AST) that uses a pure Transformer to reduce the dependency of CNNs and easily change the output resolution. Using the architecture for Sound Event Detection (SED) and Direction of Arrival Estimation (DOAE), which are small sub-problems that consist of SELD, we show that our system outperforms the baseline system.
MULTI-SCALE NETWORK FOR SOUND EVENT LOCALIZATION AND DETECTION
Patrick Emmanuel, Nathan Parrish, Mark Horton
Johns Hopkins University
Parrish_JHU_task3_3 Parrish_JHU_task3_1 Parrish_JHU_task3_4 Parrish_JHU_task3_2
MULTI-SCALE NETWORK FOR SOUND EVENT LOCALIZATION AND DETECTION
Patrick Emmanuel, Nathan Parrish, Mark Horton
Johns Hopkins University
Abstract
This report describes a multi-scale approach to the DCASE 2021 Sound Event Localization and Detection with Directional Interference task. The goal of this task is to detect, classify, and localize in time and space events from twelve sound event classes in vary- ing reverberant acoustic environments in the presence of interfering sources. We train a network that jointly performs detection, localization, and classification using multi-channel magnitude spectral data and intensity vectors derived from first order ambisonics time-series. We implement a network with successive blocks of multi- scale filters to discriminate and extract overlapping classes with different spectral characteristics. We also implement an output format and permutation invariant training loss that enable the network to detect, classify, and localize multiple instances of the same class simultaneously. Experiments show that the proposed network out- performs the CRNN baseline networks in classification and localization metrics.
A DATASET OF DYNAMIC REVERBERANT SOUND SCENES WITH DIRECTIONAL INTERFERERS FOR SOUND EVENT LOCALIZATION AND DETECTION
Archontis Politis1, Sharath Adavanne1, Daniel Krause1, Antoine Deleforge2, Prerak Srivastava2, Tuomas Virtanen1
1Tampere University, 2INRIA
Politis_TUNI_task3_foa Politis_TUNI_task3_mic
A DATASET OF DYNAMIC REVERBERANT SOUND SCENES WITH DIRECTIONAL INTERFERERS FOR SOUND EVENT LOCALIZATION AND DETECTION
Archontis Politis1, Sharath Adavanne1, Daniel Krause1, Antoine Deleforge2, Prerak Srivastava2, Tuomas Virtanen1
1Tampere University, 2INRIA
Abstract
This report presents the dataset and baseline of Task 3 of the DCASE2021 Challenge on Sound Event Localization and Detection (SELD). The dataset is based on emulation of real recordings of static or moving sound events under real conditions of reverberation and ambient noise, using spatial room impulse responses captured in a variety of rooms and delivered in two spatial formats. The acoustical synthesis remains the same as in the previous iteration of the challenge, however the new dataset brings more challenging conditions of polyphony and overlapping instances of the same class. The most important difference of the new dataset is the introduction of directional interferers, meaning sound events that are localized in space but do not belong to the target classes to be detected and are not annotated. Since such interfering events are expected in every real-world scenario of SELD, the new dataset aims to promote systems that deal with this condition effectively. A modified SELDnet baseline employing the recent ACCDOA representation for SELD problems accompanies the dataset and is described herein. To investigate the individual and combined effects of ambient noise, interferers, and reverberation, we study the performance of the baseline on different versions of the dataset excluding or including combinations of these factors. The results indicate that by far the most detrimental effects are caused by directional interferers.
ENSEMBLE OF ACCDOA- AND EINV2-BASED SYSTEMS WITH D3NETS AND IMPULSE RESPONSE SIMULATION FOR SOUND EVENT LOCALIZATION AND DETECTION
Kazuki Shimada, Naoya Takahashi, Yuichiro Koyama, Shusuke Takahashi, Emiru Tsunoo, Masafumi Takahashi, Yuki Mitsufuji
Sony Group Corporation
Shimada_SONY_task3_4 Shimada_SONY_task3_1 Shimada_SONY_task3_3 Shimada_SONY_task3_2
ENSEMBLE OF ACCDOA- AND EINV2-BASED SYSTEMS WITH D3NETS AND IMPULSE RESPONSE SIMULATION FOR SOUND EVENT LOCALIZATION AND DETECTION
Kazuki Shimada, Naoya Takahashi, Yuichiro Koyama, Shusuke Takahashi, Emiru Tsunoo, Masafumi Takahashi, Yuki Mitsufuji
Sony Group Corporation
Abstract
This report describes our systems submitted to the DCASE2021 challenge task 3: sound event localization and detection (SELD) with directional interference. Our previous system based on activity-coupled Cartesian direction of arrival (ACCDOA) repre- sentation enables us to solve a SELD task with a single target. This ACCDOA-based system with efficient network architecture called RD3Net and data augmentation techniques outperformed state-of-the-art SELD systems in terms of localization and location- dependent detection. Using the ACCDOA-based system as a base, we perform model ensembles by averaging outputs of several sys- tems trained with different conditions such as input features, train- ing folds, and model architectures. We also use the event inde- pendent network v2 (EINV2)-based system to increase the diver- sity of the model ensembles. To generalize the models, we further propose impulse response simulation (IRS), which generates simu- lated multi-channel signals by convolving simulated room impulse responses (RIRs) with source signals extracted from the original dataset. Our systems significantly improved over the baseline sys- tem on the development dataset.
SOUND EVENT LOCALIZATION AND DETECTION BASED ON CRNN USING ADAPTIVE HYBRID CONVOLUTION AND MULTI-SCALE FEATURE EXTRACTOR
Xinghao Sun, Xiujuan Zhu, Ying Hu
Xinjiang University
Abstract
In this report, we present our method for Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge task 3: Sound Event Localization and Detection with Directional Interference (SELDDI). In this paper, we propose a method based on Adaptive Hybrid Convolution (AHConv) and multi-scale feature extractor. The square convolution shares the weight in each T-F bin of the fixed area in feature map, that is limited. In order to address this problem, we propose a AHConv mechanism instead of square convolution to obtain time and frequency dependencies. We also explored multi-scale feature extractor which can integrate information from very local to exponentially large receptive field within the block. In order to adaptive recalibrate the feature maps after convolutional operation, we designed an adaptive attention block which are largely embodied in the AHConv. On TAU-NIGENS Spatial Sound Events 2021 development dataset, our systems demonstrate a significant improvement over the baseline system. Only the first- order Ambisonics (FOA) dataset was considered in this experiment.
THE HITACHI DCASE 2021 TASK 3 SYSTEM: HANDLING DIRECTIVE INTERFERENCE WITH SELF ATTENTION LAYERS
Nelson Yalta, Takashi Sumiyoshi, Yohei Kawaguchi
Hitachi Ltd.
Yalta_HIT_task3_1
THE HITACHI DCASE 2021 TASK 3 SYSTEM: HANDLING DIRECTIVE INTERFERENCE WITH SELF ATTENTION LAYERS
Nelson Yalta, Takashi Sumiyoshi, Yohei Kawaguchi
Hitachi Ltd.
Abstract
This report describes the Hitachi system for the DCASE 2021 Challenge - Task 3. Our proposal relies on a single-stage system that employs the transformer encoder (i.e., self-attention layers) as a core idea. We evaluate the effect of applying different transformer configurations to handle the directive interferences on the presence of multiple sound events. Additionally, the transformer employs residual connections to extract the features from the input streams. We trained the model using specaugment as data augmentation and performed threshold post-processing for each sound event. Em- ploying the first-order Ambisonic (FOA) signals, the transformer was trained using the activity-coupled Cartesian DOA vector (ACCDOA) representations. This unified training framework showed better performance than training the model for each sub-task independently.
DATA AUGMENTATION AND CLASS-BASED ENSEMBLED CNN-CONFORMER NETWORKS FOR SOUND EVENT LOCALIZATION AND DETECTION
Yuxuan Zhang, Shuo Wang, Zihao Li, Kejian Guo, Shijin Chen, Yan Pang
University of Chinese Academy of Sciences
Zhang_UCAS_task3_1 Zhang_UCAS_task3_4 Zhang_UCAS_task3_2 Zhang_UCAS_task3_3
DATA AUGMENTATION AND CLASS-BASED ENSEMBLED CNN-CONFORMER NETWORKS FOR SOUND EVENT LOCALIZATION AND DETECTION
Yuxuan Zhang, Shuo Wang, Zihao Li, Kejian Guo, Shijin Chen, Yan Pang
University of Chinese Academy of Sciences
Abstract
In this technical report, we describe the system participating in the DCASE 2021, Task 3: Sound Event Localization and Detection (SELD) challenge. We introduce Conformer block into the base- line system to make better use of temporal context information for the SELD task. To expand the official training dataset, we use Audio Channel Swapping (ACS), Speed Perturbation (SP), and Time- Frequency Masking (TFM) as augmentation techniques. In addition, we proposed a class-based ensemble method to attain a more robust sound event detection (SED) and sound source localization (SSL) estimation result for each sound event. After evaluating our best-proposed system on DCASE 2021 Challenge Task 3 Development Dataset, we approximately achieve 44\% and 37\% relative improvements on the SELD scores, respectively.