Task description
The Sound Event Localization and Detection (SELD) task deals with methods that detect the temporal onset and offset of sound events when active, classify the type of the event from a known set of sound classes, and further localize the events in space when active.
There are two tracks: an audio-only track (Track A) for systems using only audio data to estimate the SELD labels, and an audiovisual track (Track B) for systems employing additionally simultaneous perspective video data aligned spatially with the audio data.
The task provides two datasets, development and evaluation. Among the two datasets, only the development dataset provides the reference labels. The participants are expected to build and validate systems using the development dataset and test their system on the unseen evaluation dataset.
More details on the task setup and evaluation can be found in the task description page.
Teams ranking
The SELD task received 29 submissions in total from 8 teams. 14 submissions were on the audio-only Track A, and 15 submissions on the audiovisual Track B. A team participated in both Track A & B, 4 teams participated only in Track A and 3 teams participated only in Track B.
The following table includes only the best performing system per submitting team.
Note: Baseline results are not yet included in the table. They will be added soon and do not impact the current rankings.
Note: We found and fixed a bug in the macro-averaging computation of mAP during evaluation. The reported results are based on the patched metric.
Track A: Audio-only inference
| Rank | Submission Information | Evaluation Dataset | |||||
|---|---|---|---|---|---|---|---|
| Submission name |
Corresponding author |
Affiliation |
Technical Report |
Team Rank |
Macro mAP | Macro Pearson r | |
| Kong_CUHK_task3a_2 | Qiuqiang Kong | The Chinese University of Hong Kong | Kong_CUHK_task3a_report | 1 | 0.108 | 0.3962 | |
| Jeong_Medisensing_task3a_4 | Seunggyu Jeong | Medisensing | Jeong_Medisensing_task3a_report | 2 | 0.0662 | 0.2765 | |
| Wu_WHU_task3a_1 | Gongping Huang | Wuhan University | Yang_WHUiasp_task3_report | 3 | 0.0091 | 0.3185 | |
| Kim_SAMSUNG_task3a_2 | Gwantae Kim | Samsung Electronics | Kim_SAMSUNG_task3_report | 4 | 0.0215 | 0.1394 | |
| Deka_NA_task3a_1 | Partha Pratim Deka | Not Applicable | Deka_NA_task3a_report | 5 | 0.0001 | 0.0274 | |
Track B: Audiovisual inference
| Rank | Submission Information | Evaluation Dataset | |||||
|---|---|---|---|---|---|---|---|
| Submission name |
Corresponding author |
Affiliation |
Technical Report |
Team Rank |
Macro mAP | Macro Pearson r | |
| Yang_WHUiasp_task3b_2 | Gongping Huang | Wuhan University | Yang_WHUiasp_task3_report | 1 | 0.0131 | 0.4002 | |
| Kim_SAMSUNG_task3b_4 | Gwantae Kim | Samsung Electronics | Kim_SAMSUNG_task3_report | 2 | 0.0218 | 0.242 | |
| Kwon_KIST_task3b_4 | Junhyeong Kwon | Korea Institute of Science and Technology | Kwon_KIST_task3b_report | 3 | 0.0116 | 0.3439 | |
| Olejnik_SRP_task3b_1 | Michal Olejnik | Samsung Research Poland | Olejnik_SRP_task3b_report | 4 | 0.0017 | 0.3432 | |
Systems ranking
Detailed performance of all the submitted systems on the evaluation datasets.
Track A: Audio-only inference
| Rank | Submission Information | Evaluation Dataset | ||||||
|---|---|---|---|---|---|---|---|---|
| Submission name |
Technical Report |
Submission Rank |
Macro mAP | Macro AP 25 | Macro AP 50 | Macro AP 75 | Macro Pearson r | |
| Kong_CUHK_task3a_2 | Kong_CUHK_task3a_report | 1 | 0.108 | 0.2696 | 0.0543 | 0.0001 | 0.3962 | |
| Kong_CUHK_task3a_3 | Kong_CUHK_task3a_report | 2 | 0.1046 | 0.2654 | 0.0484 | 0.0001 | 0.3837 | |
| Kong_CUHK_task3a_4 | Kong_CUHK_task3a_report | 3 | 0.1041 | 0.2639 | 0.0483 | 0.0001 | 0.385 | |
| Kong_CUHK_task3a_1 | Kong_CUHK_task3a_report | 4 | 0.1045 | 0.2651 | 0.0483 | 0.0001 | 0.3836 | |
| Jeong_Medisensing_task3a_4 | Jeong_Medisensing_task3a_report | 5 | 0.0662 | 0.1658 | 0.0327 | 0.0 | 0.2765 | |
| Jeong_Medisensing_task3a_1 | Jeong_Medisensing_task3a_report | 6 | 0.0283 | 0.0685 | 0.0163 | 0.0 | 0.3834 | |
| Jeong_Medisensing_task3a_2 | Jeong_Medisensing_task3a_report | 7 | 0.0668 | 0.1684 | 0.0319 | 0.0 | 0.2451 | |
| Jeong_Medisensing_task3a_3 | Jeong_Medisensing_task3a_report | 8 | 0.0538 | 0.1364 | 0.025 | 0.0 | 0.2582 | |
| Wu_WHU_task3a_1 | Yang_WHUiasp_task3_report | 9 | 0.0091 | 0.0198 | 0.0072 | 0.0001 | 0.3185 | |
| Kim_SAMSUNG_task3a_2 | Kim_SAMSUNG_task3_report | 10 | 0.0215 | 0.0588 | 0.0058 | 0.0 | 0.1394 | |
| Kim_SAMSUNG_task3a_1 | Kim_SAMSUNG_task3_report | 11 | 0.0206 | 0.0583 | 0.0035 | 0.0 | 0.1163 | |
| Kim_SAMSUNG_task3a_3 | Kim_SAMSUNG_task3_report | 12 | 0.0102 | 0.029 | 0.0015 | 0.0 | 0.1687 | |
| Kim_SAMSUNG_task3a_4 | Kim_SAMSUNG_task3_report | 13 | 0.0162 | 0.0441 | 0.0043 | 0.0 | 0.1131 | |
| Deka_NA_task3a_1 | Deka_NA_task3a_report | 14 | 0.0001 | 0.0002 | 0.0 | 0.0 | 0.0274 | |
Track B: Audiovisual inference
| Rank | Submission Information | Evaluation Dataset | ||||||
|---|---|---|---|---|---|---|---|---|
| Submission name |
Technical Report |
Submission Rank |
Macro mAP | Macro AP 25 | Macro AP 50 | Macro AP 75 | Macro Pearson r | |
| Yang_WHUiasp_task3b_2 | Yang_WHUiasp_task3_report | 1 | 0.0131 | 0.0272 | 0.012 | 0.0001 | 0.4002 | |
| Yang_WHUiasp_task3b_1 | Yang_WHUiasp_task3_report | 2 | 0.013 | 0.0269 | 0.0118 | 0.0001 | 0.4002 | |
| Yang_WHUiasp_task3b_3 | Yang_WHUiasp_task3_report | 3 | 0.0054 | 0.0106 | 0.0056 | 0.0 | 0.5634 | |
| Kim_SAMSUNG_task3b_4 | Kim_SAMSUNG_task3_report | 4 | 0.0218 | 0.0595 | 0.0058 | 0.0 | 0.242 | |
| Yang_WHUiasp_task3b_4 | Yang_WHUiasp_task3_report | 5 | 0.0033 | 0.0061 | 0.0037 | 0.0 | 0.4914 | |
| Kim_SAMSUNG_task3b_2 | Kim_SAMSUNG_task3_report | 6 | 0.0219 | 0.0598 | 0.0058 | 0.0 | 0.1984 | |
| Kwon_KIST_task3b_4 | Kwon_KIST_task3b_report | 7 | 0.0116 | 0.0222 | 0.0118 | 0.0009 | 0.3439 | |
| Kwon_KIST_task3b_2 | Kwon_KIST_task3b_report | 8 | 0.0107 | 0.0212 | 0.0109 | 0.0001 | 0.3439 | |
| Kim_SAMSUNG_task3b_3 | Kim_SAMSUNG_task3_report | 9 | 0.0142 | 0.0388 | 0.0039 | 0.0 | 0.2116 | |
| Kwon_KIST_task3b_3 | Kwon_KIST_task3b_report | 10 | 0.0118 | 0.0228 | 0.0117 | 0.0008 | 0.3224 | |
| Kwon_KIST_task3b_1 | Kwon_KIST_task3b_report | 11 | 0.0117 | 0.0233 | 0.0116 | 0.0002 | 0.3224 | |
| Kim_SAMSUNG_task3b_1 | Kim_SAMSUNG_task3_report | 12 | 0.0209 | 0.0591 | 0.0035 | 0.0 | 0.1912 | |
| Olejnik_SRP_task3b_1 | Olejnik_SRP_task3b_report | 13 | 0.0017 | 0.0038 | 0.0013 | 0.0 | 0.3432 | |
| Olejnik_SRP_task3b_3 | Olejnik_SRP_task3b_report | 14 | 0.0001 | 0.0004 | 0.0 | 0.0 | 0.2433 | |
| Olejnik_SRP_task3b_2 | Olejnik_SRP_task3b_report | 15 | 0.0002 | 0.0005 | 0.0 | 0.0 | 0.2103 | |
System characteristics
Track A: Audio-only inference
| Rank |
Submission name |
Technical Report |
Model |
Model params |
Audio format |
Acoustic features |
Data augmentation |
External datasets |
Pre-trained models |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Kong_CUHK_task3a_2 | Kong_CUHK_task3a_report | CNN, Transformer, Mask2Former-style decoder, external SED score fusion | Microphone Array | log mel spectra, spherical latent features, AudioMAE SED prior | spatial audio augmentation, waveform augmentation | AudioSet | AudioMAE | |
| 2 | Kong_CUHK_task3a_3 | Kong_CUHK_task3a_report | CNN, Transformer, Mask2Former-style decoder, external SED score fusion | Microphone Array | log mel spectra, spherical latent features, AudioMAE SED prior | spatial audio augmentation, waveform augmentation | AudioSet | AudioMAE | |
| 3 | Kong_CUHK_task3a_4 | Kong_CUHK_task3a_report | CNN, Transformer, Mask2Former-style decoder, external SED score fusion | Microphone Array | log mel spectra, spherical latent features, AudioMAE SED prior | spatial audio augmentation, waveform augmentation | AudioSet | AudioMAE | |
| 4 | Kong_CUHK_task3a_1 | Kong_CUHK_task3a_report | CNN, Transformer, Mask2Former-style decoder, external SED score fusion | Microphone Array | log mel spectra, spherical latent features, AudioMAE SED prior | spatial audio augmentation, waveform augmentation | AudioSet | AudioMAE | |
| 5 | Jeong_Medisensing_task3a_4 | Jeong_Medisensing_task3a_report | PanoFormer, SSAST, PSELDNets, re-ranking | 140000000 | Microphone Array, Ambisonic | UpLAM spatial acoustic maps, intensity vector | SSAST, PSELDNets, UpLAM | ||
| 6 | Jeong_Medisensing_task3a_1 | Jeong_Medisensing_task3a_report | PanoFormer, SSAST, re-ranking | 109000000 | Microphone Array | UpLAM spatial acoustic maps | SSAST, UpLAM | ||
| 7 | Jeong_Medisensing_task3a_2 | Jeong_Medisensing_task3a_report | PanoFormer, SSAST, PSELDNets, re-ranking | 140000000 | Microphone Array, Ambisonic | UpLAM spatial acoustic maps, intensity vector | SSAST, PSELDNets, UpLAM | ||
| 8 | Jeong_Medisensing_task3a_3 | Jeong_Medisensing_task3a_report | PanoFormer, CLAP, PSELDNets, re-ranking | 300000000 | Microphone Array, Ambisonic | UpLAM spatial acoustic maps, intensity vector, CLAP audio embedding | SSAST, PSELDNets, CLAP, UpLAM | ||
| 9 | Wu_WHU_task3a_1 | Yang_WHUiasp_task3_report | CNN, ResNet-50, FPN, RPN, ROI heads, multi-task learning | 46001829 | Microphone Array | log mel spectra, UpLAM spatial acoustic maps | FSD50K, TAU-SRIR DB | UpLAM | |
| 10 | Kim_SAMSUNG_task3a_2 | Kim_SAMSUNG_task3_report | Conformer | 73000000 | Microphone Array | log mel spectra, GCC-PHAT | audio channel swapping, SpecMix | FSD50K, TAU-SRIR DB | |
| 11 | Kim_SAMSUNG_task3a_1 | Kim_SAMSUNG_task3_report | Conformer | 73000000 | Microphone Array | log mel spectra, GCC-PHAT | audio channel swapping, SpecMix | FSD50K, TAU-SRIR DB | |
| 12 | Kim_SAMSUNG_task3a_3 | Kim_SAMSUNG_task3_report | Conformer | 73000000 | Microphone Array | log mel spectra, GCC-PHAT | audio channel swapping, SpecMix | FSD50K, TAU-SRIR DB | |
| 13 | Kim_SAMSUNG_task3a_4 | Kim_SAMSUNG_task3_report | Conformer | 73000000 | Microphone Array | log mel spectra, GCC-PHAT | audio channel swapping, SpecMix | FSD50K, TAU-SRIR DB | |
| 14 | Deka_NA_task3a_1 | Deka_NA_task3a_report | CNN, MHSA | 189237613 | Ambisonic | NGCC-PHAT, log mel spectra | NGCC-PHAT |
Track B: Audiovisual inference
| Rank |
Submission name |
Technical Report |
Model |
Model params |
Audio format |
Acoustic features |
Visual features |
Data augmentation |
External datasets |
Pre-trained models |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Yang_WHUiasp_task3b_2 | Yang_WHUiasp_task3_report | Mask R-CNN, FPN, Conformer, ROI heads, multi-task learning, ensemble | 164518814 | Microphone Array | UpLAM spatial acoustic maps, log mel spectra | ResNet-50 features | azimuth rotation, horizontal flip, RGB jitter, audio band masking, audio intensity scaling, audio Gaussian noise, class-balanced sampling | UpLAM, ResNet-50, AV-SELD | |
| 2 | Yang_WHUiasp_task3b_1 | Yang_WHUiasp_task3_report | Mask R-CNN, FPN, Conformer, ROI heads, multi-task learning, ensemble | 164518814 | Microphone Array | UpLAM spatial acoustic maps, log mel spectra | ResNet-50 features | azimuth rotation, horizontal flip, RGB jitter, audio band masking, audio intensity scaling, audio Gaussian noise, class-balanced sampling | UpLAM, ResNet-50, AV-SELD | |
| 3 | Yang_WHUiasp_task3b_3 | Yang_WHUiasp_task3_report | Mask R-CNN, FPN, Conformer, ROI heads, multi-task learning, ensemble | 164518814 | Microphone Array | UpLAM spatial acoustic maps, log mel spectra | ResNet-50 features | azimuth rotation, horizontal flip, RGB jitter, audio band masking, audio intensity scaling, audio Gaussian noise, class-balanced sampling | UpLAM, ResNet-50, AV-SELD | |
| 4 | Kim_SAMSUNG_task3b_4 | Kim_SAMSUNG_task3_report | Conformer, Mask R-CNN | 153000000 | Microphone Array | log mel spectra, GCC-PHAT | ResNet-50 features | audio channel swapping, SpecMix | FSD50K, TAU-SRIR DB | ResNet-50 |
| 5 | Yang_WHUiasp_task3b_4 | Yang_WHUiasp_task3_report | Mask R-CNN, FPN, Conformer, ROI heads, multi-task learning | 79191593 | Microphone Array | UpLAM spatial acoustic maps | ResNet-50 features | azimuth rotation, horizontal flip, RGB jitter, audio band masking, audio intensity scaling, audio Gaussian noise, class-balanced sampling | UpLAM, ResNet-50 | |
| 6 | Kim_SAMSUNG_task3b_2 | Kim_SAMSUNG_task3_report | Conformer, Mask R-CNN | 153000000 | Microphone Array | log mel spectra, GCC-PHAT | ResNet-50 features | audio channel swapping, SpecMix | FSD50K, TAU-SRIR DB | ResNet-50 |
| 7 | Kwon_KIST_task3b_4 | Kwon_KIST_task3b_report | Mask R-CNN | 43700000 | Microphone Array | UpLAM spatial acoustic maps | ResNet-50 features | UpLAM, ResNet-50 | ||
| 8 | Kwon_KIST_task3b_2 | Kwon_KIST_task3b_report | Mask R-CNN | 43700000 | Microphone Array | UpLAM spatial acoustic maps | ResNet-50 features | UpLAM, ResNet-50 | ||
| 9 | Kim_SAMSUNG_task3b_3 | Kim_SAMSUNG_task3_report | Conformer, Mask R-CNN | 153000000 | Microphone Array | log mel spectra, GCC-PHAT | ResNet-50 features | audio channel swapping, SpecMix | FSD50K, TAU-SRIR DB | ResNet-50 |
| 10 | Kwon_KIST_task3b_3 | Kwon_KIST_task3b_report | Mask R-CNN | 43700000 | Microphone Array | UpLAM spatial acoustic maps | ResNet-50 features | UpLAM, ResNet-50 | ||
| 11 | Kwon_KIST_task3b_1 | Kwon_KIST_task3b_report | Mask R-CNN | 43700000 | Microphone Array | UpLAM spatial acoustic maps | ResNet-50 features | UpLAM, ResNet-50 | ||
| 12 | Kim_SAMSUNG_task3b_1 | Kim_SAMSUNG_task3_report | Conformer, Mask R-CNN | 153000000 | Microphone Array | log mel spectra, GCC-PHAT | ResNet-50 features | audio channel swapping, SpecMix | FSD50K, TAU-SRIR DB | ResNet-50 |
| 13 | Olejnik_SRP_task3b_1 | Olejnik_SRP_task3b_report | Mask R-CNN, CNN, FPN, RPN | 44900000 | Microphone Array | UpLAM spatial acoustic maps | ResNet-50 features | azimuth rotation, horizontal flip, RGB jitter, audio band masking, audio intensity scaling | FSD50K, FMA, SpatialScaper RIRs, VisualGenome, Gibson Database of 3D Spaces | UpLAM, ResNet-50 |
| 14 | Olejnik_SRP_task3b_3 | Olejnik_SRP_task3b_report | Mask R-CNN, CNN, FPN, RPN | 44900000 | Microphone Array | UpLAM spatial acoustic maps | ResNet-50 features | azimuth rotation, horizontal flip, RGB jitter, audio band masking, audio intensity scaling | UpLAM, ResNet-50 | |
| 15 | Olejnik_SRP_task3b_2 | Olejnik_SRP_task3b_report | Mask R-CNN, CNN, FPN, RPN | 44900000 | Microphone Array | UpLAM spatial acoustic maps | ResNet-50 features | azimuth rotation, horizontal flip, RGB jitter, audio band masking, audio intensity scaling | UpLAM, ResNet-50 |
Technical reports
Occupancy-Based Semantic Acoustic Imaging using NGCC-PHAT and Mel-Spectrogram Feature Fusion
Partha Pratim Deka1
1Not Applicable
Abstract
Semantic Acoustic Imaging (SAI) aims to reconstruct spatially-resolved acoustic representations from low channel microphone recordings, enabling simultaneous sound event localization and semantic understanding. The DCASE 2026 SAI-SELD challenge introduces a particularly challenging setting where high-resolution acoustic maps must be inferred from four-channel spatial audio. In this work, we propose an audio-only framework that combines Neural Generalized Cross Correlation with Phase Transform (NGCC-PHAT) features and Mel-spectrogram representations to jointly capture spatial and spectral characteristics of acoustic scenes. The NGCC-PHAT branch provides robust inter-channel localization cues, while the Mel-spectrogram branch captures event-specific spectral information. These complementary representations are fused and processed using a transformer-based prediction network to generate class-wise acoustic localization maps for the target sound events. During inference, localization peaks are converted into challenge-compliant acoustic map representations for evaluation. Experimental results on the STARSS23 development dataset demonstrate the feasibility of combining spatial correlation features and spectral representations for audio-only semantic acoustic imaging. The proposed framework demonstrates an occupancy-based formulation for semantic acoustic imaging from audio-only observations.
AN EQUIRECTANGULAR ENERGY FIELD WITH PER-LOCATION CLASS COVERAGE FOR SEMANTIC ACOUSTIC IMAGING
Seunggyu Jeong1,2, Seong-Eun Kim1,2
1Medisensing, Seoul, Korea, 2Seoul National University of Science and Technology, Seoul, Korea
Jeong_Medisensing_task3a_1 Jeong_Medisensing_task3a_2 Jeong_Medisensing_task3a_3 Jeong_Medisensing_task3a_4
AN EQUIRECTANGULAR ENERGY FIELD WITH PER-LOCATION CLASS COVERAGE FOR SEMANTIC ACOUSTIC IMAGING
Seunggyu Jeong1,2, Seong-Eun Kim1,2
1Medisensing, Seoul, Korea, 2Seoul National University of Science and Technology, Seoul, Korea
Abstract
We describe our submission to DCASE 2026 Challenge Task 3, semantic acoustic imaging for sound event localization and detection on the audio-only track. For each frame the task asks for a set of acoustic-image instances, each a soft energy footprint on the equirectangular sphere together with a sound class, scored by a mask mean average precision (mask mAP). We factor the prob leminto a localization stage that predicts a dense energy field and a classification stage that labels each extracted instance. The localization stage uses a distilled acoustic-imaging front-end followed by PanoFormer, an equirectangular transformer whose circular padding removes the seam artefacts that a planar decoder produces at the azimuth wrap. Because the official MACRO average counts a class with no predictions as zero, the classification stage assigns a class to every instance from the direction of arrival of a pretrained spatial network, which covers all thirteen classes. A temporal-persistence re-ranking sets the detection confidences. The primary system reaches a development MACRO mask mAP of 0.0515, against 0.0003 for the official baseline. We submit four systems that differ in the classification stage.
SEMANTIC ACOUSTIC IMAGING VIA SELD-STYLE DOA REGRESSION AND A PER-CLASS AUDIOVISUAL ENSEMBLE
Gwantae Kim1
1Samsung Electronics
Kim_SAMSUNG_task3a_1 Kim_SAMSUNG_task3a_2 Kim_SAMSUNG_task3a_3 Kim_SAMSUNG_task3a_4 Kim_SAMSUNG_task3b_1 Kim_SAMSUNG_task3b_2 Kim_SAMSUNG_task3b_3 Kim_SAMSUNG_task3b_4
SEMANTIC ACOUSTIC IMAGING VIA SELD-STYLE DOA REGRESSION AND A PER-CLASS AUDIOVISUAL ENSEMBLE
Gwantae Kim1
1Samsung Electronics
Abstract
Semantic acoustic imaging aims to estimate class-conditioned spatial energy distributions from audio or audiovisual observations. This report describes our submission to DCASE 2026 Task 3, where systems predict dense acoustic-energy maps on the sphere from either 4-channel spatial audio (Track A) or paired audio and 360◦ video (Track B). Directly regressing dense per-class maps from recordings is difficult because the targets are spatially sparse, the losses are dominated by background regions, and real labeled recordings are limited. We therefore decompose the problem into sound event localization and detection (SELD) followed by a deterministic renderer that converts direction estimates into dense spherical-Gaussian energy fields. The audio model is trained on the provided real recordings and synthetic recordings, with 8 pattern audio-channel swapping (ACS) and SpecMix mixup used to improve spatial robustness. At inference, we explore comple mentary strategies across the submitted systems: 8-view inverse aligned ACS test-time ensembling, a multi-checkpoint cluster ensemble, and per-class audiovisual model selection. On the held out development-test split, the audio system achieves 0.0234 mask mAP / 0.358 Pearsonr, and the per-class audiovisual ensemble reaches 0.0332 / 0.446 (+42% mAP, +25% Pearson).
AUDIO-ONLY SEMANTIC ACOUSTIC IMAGING WITH RECOGNITION-PRIOR SCORE FUSION FOR DCASE2026 TASK 3
Runbang Wang1, Zining Liang2, Yin Cao3, Qiuqiang Kong2
1Nanjing University, China, 2The Chinese University of Hong Kong, Hong Kong SAR, China, 3Institute of Acoustics, Chinese Academy of Sciences, China
Kong_CUHK_task3a_1 Kong_CUHK_task3a_2 Kong_CUHK_task3a_3 Kong_CUHK_task3a_4
AUDIO-ONLY SEMANTIC ACOUSTIC IMAGING WITH RECOGNITION-PRIOR SCORE FUSION FOR DCASE2026 TASK 3
Runbang Wang1, Zining Liang2, Yin Cao3, Qiuqiang Kong2
1Nanjing University, China, 2The Chinese University of Hong Kong, Hong Kong SAR, China, 3Institute of Acoustics, Chinese Academy of Sciences, China
Abstract
Semantic acoustic imaging predicts class-labeled acoustic regions from microphone-array recordings, producing a spherical map of where sound events appear in a scene. The audio-only setting creates a difficult input-output mismatch: region boundaries are not observed in the waveform, and the event class, acoustic extent, and confidence of each prediction must be estimated from multichannel acoustic cues. We present a system for DCASE2026 Task 3 that first forms spherical acoustic evidence from raw MIC-format audio and then decodes this evidence into mask candidates with event classes and detector scores. A separate AudioMAE-based recognition prior estimates class activity for two-second windows and aligns the probabilities to detector frames. The final fusion stage uses this prior to re-rank the detector candidates while preserving their masks. On the full-recording development test set, the system obtains 0.1017 mAP, 0.2378 AP50, and 0.7904 Pearson r. For submission, prediction compression reduces the maximum JSON size from 148.98 MB to 19.03 MB, with mAP changing from 0.1017 to 0.1009.
DECISION-CALIBRATED SEMANTIC ACOUSTIC IMAGING FOR SOUND EVENT LOCALIZATION AND DETECTION
Junhyeong Kwon1, Jongsuk Choi1
1Korea Institute of Science and Technology, Seoul, Korea
Kwon_KIST_task3b_1 Kwon_KIST_task3b_2 Kwon_KIST_task3b_3 Kwon_KIST_task3b_4
DECISION-CALIBRATED SEMANTIC ACOUSTIC IMAGING FOR SOUND EVENT LOCALIZATION AND DETECTION
Junhyeong Kwon1, Jongsuk Choi1
1Korea Institute of Science and Technology, Seoul, Korea
Abstract
This report describes our system submitted to the audiovisual track of Task 3 of the DCASE 2026 Challenge on Semantic Acoustic Imaging for SELD. Building on the official UpLAM + MaskR-CNN baseline, we observe that it reconstructs the spatial energy field reasonably well, reaching an audiovisual Pearson correlation of about 0.44 on the development set, yet attains a near-zero detection mAP. We diagnose this as a decision and calibration failure rather than a representation failure: a single global confidence threshold cannot simultaneously retain under-confident true positives and suppress background false positives, and the rank-based scorer punishes both. Our system therefore runs the detector at a very low score threshold to surface the under-confident true positives; re-ranks the detections with a learned matchability model, a post-hoc calibration layer motivated by conformal and risk-controlled prediction that predicts whether each detection will match the ground truth; and emits a grid sampled energy mask aligned with the challenge soft-IoU scorer. On the STAIRS26 development set this raises macro mAP from the official baseline's 0.0003 to 0.021 and EFRQ from 0.136 to 0.33-0.36. Absolute scores remain low — the task is hard for every system — but the gain over the baseline is large and consistent. We submit a four system hedge portfolio that crosses matchability and raw ranking with cap-1 and cap-3 selection, and report development numbers transparently, including the recall ceiling imposed by the 4-channel spatial resolution.
SRP 16-BAND UPLAM WITH MASK R-CNN FOR AUDIOVISUAL SELD
Rafał Foltyniewicz1, Rafał Kaczmarek1, Michał Olejnik1, Iwan Ryzenkow1, Bogdan Jastrzębski1
1Samsung Research Poland
Olejnik_SRP_task3b_1 Olejnik_SRP_task3b_2 Olejnik_SRP_task3b_3
SRP 16-BAND UPLAM WITH MASK R-CNN FOR AUDIOVISUAL SELD
Rafał Foltyniewicz1, Rafał Kaczmarek1, Michał Olejnik1, Iwan Ryzenkow1, Bogdan Jastrzębski1
1Samsung Research Poland
Abstract
We present an audiovisual sound event localization and detection system based on a modified Mask R-CNN with a ResNet-50 Feature Pyramid Network backbone, taking 19-channel inputs composed of 3 RGB video channels and 16 UpLAM acoustic feature maps extracted from the tetrahedral microphone array. The model is trained with progressive backbone unfreezing, class-balanced sampling with inverse-frequency weights (cap=30), focal loss, and data augmentation including azimuth rotation, horizontal flip, RGB dropout, and audio band masking.
ENHANCED UPLAM-MASK R-CNN SYSTEM FOR SEMANTIC ACOUSTIC IMAGING AND SOUND EVENT LOCALIZATION
Yishuo Yang1, Xinwei Wu1, Chunrui Zhao1, Huayang Wang1, Zheng Wen1, Jilu Jin2, Gongping Huang1
1School of Electronic Information, Wuhan University, Wuhan, China, 2CIAIC, Northwestern Polytechnical University, Xi'an, China
Wu_WHU_task3a_1 Yang_WHUiasp_task3b_1 Yang_WHUiasp_task3b_2 Yang_WHUiasp_task3b_3 Yang_WHUiasp_task3b_4
ENHANCED UPLAM-MASK R-CNN SYSTEM FOR SEMANTIC ACOUSTIC IMAGING AND SOUND EVENT LOCALIZATION
Yishuo Yang1, Xinwei Wu1, Chunrui Zhao1, Huayang Wang1, Zheng Wen1, Jilu Jin2, Gongping Huang1
1School of Electronic Information, Wuhan University, Wuhan, China, 2CIAIC, Northwestern Polytechnical University, Xi'an, China
Abstract
This report presents our UpLAM-Mask R-CNN system for DCASE 2026 Task 3: semantic acoustic imaging (SAI) for sound event localization and detection (SELD). The task requires detecting sound event categories while estimating their spatial acoustic energy distributions and source distances. Based on the official UpLAM Mask R-CNN baseline, we develop unified frameworks for both the audio-only and audio-visual tracks. Specifically, we introduce modality-adaptive input switching for Track A and Track B, optimize the UpLAM-based acoustic feature extraction pipeline, preserve acoustic energy magnitude cues throughout the model, and enhance RoI-level energy-map prediction, full-frame energy decoding, and distance regression. During inference, temporal tracking, confidence-based filtering, and region-aware energy-map export are further applied to improve the stability and completeness of the pre dictions.