Task description

The Sound Event Localization and Detection (SELD) task deals with methods that detect the temporal onset and offset of sound events when active, classify the type of the event from a known set of sound classes, and further localize the events in space when active.

There are two tracks: an audio-only track (Track A) for systems using only audio data to estimate the SELD labels, and an audiovisual track (Track B) for systems employing additionally simultaneous perspective video data aligned spatially with the audio data.

The task provides two datasets, development and evaluation. Among the two datasets, only the development dataset provides the reference labels. The participants are expected to build and validate systems using the development dataset and test their system on the unseen evaluation dataset.

More details on the task setup and evaluation can be found in the task description page.

Teams ranking

The SELD task received 29 submissions in total from 8 teams. 14 submissions were on the audio-only Track A, and 15 submissions on the audiovisual Track B. A team participated in both Track A & B, 4 teams participated only in Track A and 3 teams participated only in Track B.

The following table includes only the best performing system per submitting team.

Note: Baseline results are not yet included in the table. They will be added soon and do not impact the current rankings.

Note: We found and fixed a bug in the macro-averaging computation of mAP during evaluation. The reported results are based on the patched metric.

Track A: Audio-only inference

Rank	Submission Information				Evaluation Dataset
Rank	Submission name	Corresponding author	Affiliation	Technical Report	Team Rank	Macro mAP	Macro Pearson r
	Kong_CUHK_task3a_2	Qiuqiang Kong	The Chinese University of Hong Kong	Kong_CUHK_task3a_report	1	0.108	0.3962
	Jeong_Medisensing_task3a_4	Seunggyu Jeong	Medisensing	Jeong_Medisensing_task3a_report	2	0.0662	0.2765
	Wu_WHU_task3a_1	Gongping Huang	Wuhan University	Yang_WHUiasp_task3_report	3	0.0091	0.3185
	Kim_SAMSUNG_task3a_2	Gwantae Kim	Samsung Electronics	Kim_SAMSUNG_task3_report	4	0.0215	0.1394
	Deka_NA_task3a_1	Partha Pratim Deka	Not Applicable	Deka_NA_task3a_report	5	0.0001	0.0274

Track B: Audiovisual inference

Rank	Submission Information				Evaluation Dataset
Rank	Submission name	Corresponding author	Affiliation	Technical Report	Team Rank	Macro mAP	Macro Pearson r
	Yang_WHUiasp_task3b_2	Gongping Huang	Wuhan University	Yang_WHUiasp_task3_report	1	0.0131	0.4002
	Kim_SAMSUNG_task3b_4	Gwantae Kim	Samsung Electronics	Kim_SAMSUNG_task3_report	2	0.0218	0.242
	Kwon_KIST_task3b_4	Junhyeong Kwon	Korea Institute of Science and Technology	Kwon_KIST_task3b_report	3	0.0116	0.3439
	Olejnik_SRP_task3b_1	Michal Olejnik	Samsung Research Poland	Olejnik_SRP_task3b_report	4	0.0017	0.3432

Systems ranking

Detailed performance of all the submitted systems on the evaluation datasets.

Track A: Audio-only inference

Rank	Submission Information		Evaluation Dataset
Rank	Submission name	Technical Report	Submission Rank	Macro mAP	Macro AP 25	Macro AP 50	Macro AP 75	Macro Pearson r
	Kong_CUHK_task3a_2	Kong_CUHK_task3a_report	1	0.108	0.2696	0.0543	0.0001	0.3962
	Kong_CUHK_task3a_3	Kong_CUHK_task3a_report	2	0.1046	0.2654	0.0484	0.0001	0.3837
	Kong_CUHK_task3a_4	Kong_CUHK_task3a_report	3	0.1041	0.2639	0.0483	0.0001	0.385
	Kong_CUHK_task3a_1	Kong_CUHK_task3a_report	4	0.1045	0.2651	0.0483	0.0001	0.3836
	Jeong_Medisensing_task3a_4	Jeong_Medisensing_task3a_report	5	0.0662	0.1658	0.0327	0.0	0.2765
	Jeong_Medisensing_task3a_1	Jeong_Medisensing_task3a_report	6	0.0283	0.0685	0.0163	0.0	0.3834
	Jeong_Medisensing_task3a_2	Jeong_Medisensing_task3a_report	7	0.0668	0.1684	0.0319	0.0	0.2451
	Jeong_Medisensing_task3a_3	Jeong_Medisensing_task3a_report	8	0.0538	0.1364	0.025	0.0	0.2582
	Wu_WHU_task3a_1	Yang_WHUiasp_task3_report	9	0.0091	0.0198	0.0072	0.0001	0.3185
	Kim_SAMSUNG_task3a_2	Kim_SAMSUNG_task3_report	10	0.0215	0.0588	0.0058	0.0	0.1394
	Kim_SAMSUNG_task3a_1	Kim_SAMSUNG_task3_report	11	0.0206	0.0583	0.0035	0.0	0.1163
	Kim_SAMSUNG_task3a_3	Kim_SAMSUNG_task3_report	12	0.0102	0.029	0.0015	0.0	0.1687
	Kim_SAMSUNG_task3a_4	Kim_SAMSUNG_task3_report	13	0.0162	0.0441	0.0043	0.0	0.1131
	Deka_NA_task3a_1	Deka_NA_task3a_report	14	0.0001	0.0002	0.0	0.0	0.0274

Track B: Audiovisual inference

Rank	Submission Information		Evaluation Dataset
Rank	Submission name	Technical Report	Submission Rank	Macro mAP	Macro AP 25	Macro AP 50	Macro AP 75	Macro Pearson r
	Yang_WHUiasp_task3b_2	Yang_WHUiasp_task3_report	1	0.0131	0.0272	0.012	0.0001	0.4002
	Yang_WHUiasp_task3b_1	Yang_WHUiasp_task3_report	2	0.013	0.0269	0.0118	0.0001	0.4002
	Yang_WHUiasp_task3b_3	Yang_WHUiasp_task3_report	3	0.0054	0.0106	0.0056	0.0	0.5634
	Kim_SAMSUNG_task3b_4	Kim_SAMSUNG_task3_report	4	0.0218	0.0595	0.0058	0.0	0.242
	Yang_WHUiasp_task3b_4	Yang_WHUiasp_task3_report	5	0.0033	0.0061	0.0037	0.0	0.4914
	Kim_SAMSUNG_task3b_2	Kim_SAMSUNG_task3_report	6	0.0219	0.0598	0.0058	0.0	0.1984
	Kwon_KIST_task3b_4	Kwon_KIST_task3b_report	7	0.0116	0.0222	0.0118	0.0009	0.3439
	Kwon_KIST_task3b_2	Kwon_KIST_task3b_report	8	0.0107	0.0212	0.0109	0.0001	0.3439
	Kim_SAMSUNG_task3b_3	Kim_SAMSUNG_task3_report	9	0.0142	0.0388	0.0039	0.0	0.2116
	Kwon_KIST_task3b_3	Kwon_KIST_task3b_report	10	0.0118	0.0228	0.0117	0.0008	0.3224
	Kwon_KIST_task3b_1	Kwon_KIST_task3b_report	11	0.0117	0.0233	0.0116	0.0002	0.3224
	Kim_SAMSUNG_task3b_1	Kim_SAMSUNG_task3_report	12	0.0209	0.0591	0.0035	0.0	0.1912
	Olejnik_SRP_task3b_1	Olejnik_SRP_task3b_report	13	0.0017	0.0038	0.0013	0.0	0.3432
	Olejnik_SRP_task3b_3	Olejnik_SRP_task3b_report	14	0.0001	0.0004	0.0	0.0	0.2433
	Olejnik_SRP_task3b_2	Olejnik_SRP_task3b_report	15	0.0002	0.0005	0.0	0.0	0.2103

System characteristics

Track A: Audio-only inference

Rank	Submission name	Technical Report	Model	Model params	Audio format	Acoustic features	Data augmentation	External datasets	Pre-trained models
1	Kong_CUHK_task3a_2	Kong_CUHK_task3a_report	CNN, Transformer, Mask2Former-style decoder, external SED score fusion		Microphone Array	log mel spectra, spherical latent features, AudioMAE SED prior	spatial audio augmentation, waveform augmentation	AudioSet	AudioMAE
2	Kong_CUHK_task3a_3	Kong_CUHK_task3a_report	CNN, Transformer, Mask2Former-style decoder, external SED score fusion		Microphone Array	log mel spectra, spherical latent features, AudioMAE SED prior	spatial audio augmentation, waveform augmentation	AudioSet	AudioMAE
3	Kong_CUHK_task3a_4	Kong_CUHK_task3a_report	CNN, Transformer, Mask2Former-style decoder, external SED score fusion		Microphone Array	log mel spectra, spherical latent features, AudioMAE SED prior	spatial audio augmentation, waveform augmentation	AudioSet	AudioMAE
4	Kong_CUHK_task3a_1	Kong_CUHK_task3a_report	CNN, Transformer, Mask2Former-style decoder, external SED score fusion		Microphone Array	log mel spectra, spherical latent features, AudioMAE SED prior	spatial audio augmentation, waveform augmentation	AudioSet	AudioMAE
5	Jeong_Medisensing_task3a_4	Jeong_Medisensing_task3a_report	PanoFormer, SSAST, PSELDNets, re-ranking	140000000	Microphone Array, Ambisonic	UpLAM spatial acoustic maps, intensity vector			SSAST, PSELDNets, UpLAM
6	Jeong_Medisensing_task3a_1	Jeong_Medisensing_task3a_report	PanoFormer, SSAST, re-ranking	109000000	Microphone Array	UpLAM spatial acoustic maps			SSAST, UpLAM
7	Jeong_Medisensing_task3a_2	Jeong_Medisensing_task3a_report	PanoFormer, SSAST, PSELDNets, re-ranking	140000000	Microphone Array, Ambisonic	UpLAM spatial acoustic maps, intensity vector			SSAST, PSELDNets, UpLAM
8	Jeong_Medisensing_task3a_3	Jeong_Medisensing_task3a_report	PanoFormer, CLAP, PSELDNets, re-ranking	300000000	Microphone Array, Ambisonic	UpLAM spatial acoustic maps, intensity vector, CLAP audio embedding			SSAST, PSELDNets, CLAP, UpLAM
9	Wu_WHU_task3a_1	Yang_WHUiasp_task3_report	CNN, ResNet-50, FPN, RPN, ROI heads, multi-task learning	46001829	Microphone Array	log mel spectra, UpLAM spatial acoustic maps		FSD50K, TAU-SRIR DB	UpLAM
10	Kim_SAMSUNG_task3a_2	Kim_SAMSUNG_task3_report	Conformer	73000000	Microphone Array	log mel spectra, GCC-PHAT	audio channel swapping, SpecMix	FSD50K, TAU-SRIR DB
11	Kim_SAMSUNG_task3a_1	Kim_SAMSUNG_task3_report	Conformer	73000000	Microphone Array	log mel spectra, GCC-PHAT	audio channel swapping, SpecMix	FSD50K, TAU-SRIR DB
12	Kim_SAMSUNG_task3a_3	Kim_SAMSUNG_task3_report	Conformer	73000000	Microphone Array	log mel spectra, GCC-PHAT	audio channel swapping, SpecMix	FSD50K, TAU-SRIR DB
13	Kim_SAMSUNG_task3a_4	Kim_SAMSUNG_task3_report	Conformer	73000000	Microphone Array	log mel spectra, GCC-PHAT	audio channel swapping, SpecMix	FSD50K, TAU-SRIR DB
14	Deka_NA_task3a_1	Deka_NA_task3a_report	CNN, MHSA	189237613	Ambisonic	NGCC-PHAT, log mel spectra			NGCC-PHAT

Track B: Audiovisual inference

Rank	Submission name	Technical Report	Model	Model params	Audio format	Acoustic features	Visual features	Data augmentation	External datasets	Pre-trained models
1	Yang_WHUiasp_task3b_2	Yang_WHUiasp_task3_report	Mask R-CNN, FPN, Conformer, ROI heads, multi-task learning, ensemble	164518814	Microphone Array	UpLAM spatial acoustic maps, log mel spectra	ResNet-50 features	azimuth rotation, horizontal flip, RGB jitter, audio band masking, audio intensity scaling, audio Gaussian noise, class-balanced sampling		UpLAM, ResNet-50, AV-SELD
2	Yang_WHUiasp_task3b_1	Yang_WHUiasp_task3_report	Mask R-CNN, FPN, Conformer, ROI heads, multi-task learning, ensemble	164518814	Microphone Array	UpLAM spatial acoustic maps, log mel spectra	ResNet-50 features	azimuth rotation, horizontal flip, RGB jitter, audio band masking, audio intensity scaling, audio Gaussian noise, class-balanced sampling		UpLAM, ResNet-50, AV-SELD
3	Yang_WHUiasp_task3b_3	Yang_WHUiasp_task3_report	Mask R-CNN, FPN, Conformer, ROI heads, multi-task learning, ensemble	164518814	Microphone Array	UpLAM spatial acoustic maps, log mel spectra	ResNet-50 features	azimuth rotation, horizontal flip, RGB jitter, audio band masking, audio intensity scaling, audio Gaussian noise, class-balanced sampling		UpLAM, ResNet-50, AV-SELD
4	Kim_SAMSUNG_task3b_4	Kim_SAMSUNG_task3_report	Conformer, Mask R-CNN	153000000	Microphone Array	log mel spectra, GCC-PHAT	ResNet-50 features	audio channel swapping, SpecMix	FSD50K, TAU-SRIR DB	ResNet-50
5	Yang_WHUiasp_task3b_4	Yang_WHUiasp_task3_report	Mask R-CNN, FPN, Conformer, ROI heads, multi-task learning	79191593	Microphone Array	UpLAM spatial acoustic maps	ResNet-50 features	azimuth rotation, horizontal flip, RGB jitter, audio band masking, audio intensity scaling, audio Gaussian noise, class-balanced sampling		UpLAM, ResNet-50
6	Kim_SAMSUNG_task3b_2	Kim_SAMSUNG_task3_report	Conformer, Mask R-CNN	153000000	Microphone Array	log mel spectra, GCC-PHAT	ResNet-50 features	audio channel swapping, SpecMix	FSD50K, TAU-SRIR DB	ResNet-50
7	Kwon_KIST_task3b_4	Kwon_KIST_task3b_report	Mask R-CNN	43700000	Microphone Array	UpLAM spatial acoustic maps	ResNet-50 features			UpLAM, ResNet-50
8	Kwon_KIST_task3b_2	Kwon_KIST_task3b_report	Mask R-CNN	43700000	Microphone Array	UpLAM spatial acoustic maps	ResNet-50 features			UpLAM, ResNet-50
9	Kim_SAMSUNG_task3b_3	Kim_SAMSUNG_task3_report	Conformer, Mask R-CNN	153000000	Microphone Array	log mel spectra, GCC-PHAT	ResNet-50 features	audio channel swapping, SpecMix	FSD50K, TAU-SRIR DB	ResNet-50
10	Kwon_KIST_task3b_3	Kwon_KIST_task3b_report	Mask R-CNN	43700000	Microphone Array	UpLAM spatial acoustic maps	ResNet-50 features			UpLAM, ResNet-50
11	Kwon_KIST_task3b_1	Kwon_KIST_task3b_report	Mask R-CNN	43700000	Microphone Array	UpLAM spatial acoustic maps	ResNet-50 features			UpLAM, ResNet-50
12	Kim_SAMSUNG_task3b_1	Kim_SAMSUNG_task3_report	Conformer, Mask R-CNN	153000000	Microphone Array	log mel spectra, GCC-PHAT	ResNet-50 features	audio channel swapping, SpecMix	FSD50K, TAU-SRIR DB	ResNet-50
13	Olejnik_SRP_task3b_1	Olejnik_SRP_task3b_report	Mask R-CNN, CNN, FPN, RPN	44900000	Microphone Array	UpLAM spatial acoustic maps	ResNet-50 features	azimuth rotation, horizontal flip, RGB jitter, audio band masking, audio intensity scaling	FSD50K, FMA, SpatialScaper RIRs, VisualGenome, Gibson Database of 3D Spaces	UpLAM, ResNet-50
14	Olejnik_SRP_task3b_3	Olejnik_SRP_task3b_report	Mask R-CNN, CNN, FPN, RPN	44900000	Microphone Array	UpLAM spatial acoustic maps	ResNet-50 features	azimuth rotation, horizontal flip, RGB jitter, audio band masking, audio intensity scaling		UpLAM, ResNet-50
15	Olejnik_SRP_task3b_2	Olejnik_SRP_task3b_report	Mask R-CNN, CNN, FPN, RPN	44900000	Microphone Array	UpLAM spatial acoustic maps	ResNet-50 features	azimuth rotation, horizontal flip, RGB jitter, audio band masking, audio intensity scaling		UpLAM, ResNet-50

Technical reports

Occupancy-Based Semantic Acoustic Imaging using NGCC-PHAT and Mel-Spectrogram Feature Fusion

Partha Pratim Deka¹

¹Not Applicable

Deka_NA_task3a_1

PDF Code

Occupancy-Based Semantic Acoustic Imaging using NGCC-PHAT and Mel-Spectrogram Feature Fusion

Partha Pratim Deka¹
¹Not Applicable

Abstract

Semantic Acoustic Imaging (SAI) aims to reconstruct spatially-resolved acoustic representations from low channel microphone recordings, enabling simultaneous sound event localization and semantic understanding. The DCASE 2026 SAI-SELD challenge introduces a particularly challenging setting where high-resolution acoustic maps must be inferred from four-channel spatial audio. In this work, we propose an audio-only framework that combines Neural Generalized Cross Correlation with Phase Transform (NGCC-PHAT) features and Mel-spectrogram representations to jointly capture spatial and spectral characteristics of acoustic scenes. The NGCC-PHAT branch provides robust inter-channel localization cues, while the Mel-spectrogram branch captures event-specific spectral information. These complementary representations are fused and processed using a transformer-based prediction network to generate class-wise acoustic localization maps for the target sound events. During inference, localization peaks are converted into challenge-compliant acoustic map representations for evaluation. Experimental results on the STARSS23 development dataset demonstrate the feasibility of combining spatial correlation features and spectral representations for audio-only semantic acoustic imaging. The proposed framework demonstrates an occupancy-based formulation for semantic acoustic imaging from audio-only observations.

PDF

AN EQUIRECTANGULAR ENERGY FIELD WITH PER-LOCATION CLASS COVERAGE FOR SEMANTIC ACOUSTIC IMAGING

Seunggyu Jeong^1,2, Seong-Eun Kim^1,2

¹Medisensing, Seoul, Korea, ²Seoul National University of Science and Technology, Seoul, Korea

Jeong_Medisensing_task3a_1 Jeong_Medisensing_task3a_2 Jeong_Medisensing_task3a_3 Jeong_Medisensing_task3a_4

PDF

AN EQUIRECTANGULAR ENERGY FIELD WITH PER-LOCATION CLASS COVERAGE FOR SEMANTIC ACOUSTIC IMAGING

Seunggyu Jeong^1,2, Seong-Eun Kim^1,2
¹Medisensing, Seoul, Korea, ²Seoul National University of Science and Technology, Seoul, Korea

Abstract

We describe our submission to DCASE 2026 Challenge Task 3, semantic acoustic imaging for sound event localization and detection on the audio-only track. For each frame the task asks for a set of acoustic-image instances, each a soft energy footprint on the equirectangular sphere together with a sound class, scored by a mask mean average precision (mask mAP). We factor the prob leminto a localization stage that predicts a dense energy field and a classification stage that labels each extracted instance. The localization stage uses a distilled acoustic-imaging front-end followed by PanoFormer, an equirectangular transformer whose circular padding removes the seam artefacts that a planar decoder produces at the azimuth wrap. Because the official MACRO average counts a class with no predictions as zero, the classification stage assigns a class to every instance from the direction of arrival of a pretrained spatial network, which covers all thirteen classes. A temporal-persistence re-ranking sets the detection confidences. The primary system reaches a development MACRO mask mAP of 0.0515, against 0.0003 for the official baseline. We submit four systems that differ in the classification stage.

PDF

SEMANTIC ACOUSTIC IMAGING VIA SELD-STYLE DOA REGRESSION AND A PER-CLASS AUDIOVISUAL ENSEMBLE

Gwantae Kim¹

¹Samsung Electronics

Kim_SAMSUNG_task3a_1 Kim_SAMSUNG_task3a_2 Kim_SAMSUNG_task3a_3 Kim_SAMSUNG_task3a_4 Kim_SAMSUNG_task3b_1 Kim_SAMSUNG_task3b_2 Kim_SAMSUNG_task3b_3 Kim_SAMSUNG_task3b_4

PDF

SEMANTIC ACOUSTIC IMAGING VIA SELD-STYLE DOA REGRESSION AND A PER-CLASS AUDIOVISUAL ENSEMBLE

Gwantae Kim¹
¹Samsung Electronics

Abstract

Semantic acoustic imaging aims to estimate class-conditioned spatial energy distributions from audio or audiovisual observations. This report describes our submission to DCASE 2026 Task 3, where systems predict dense acoustic-energy maps on the sphere from either 4-channel spatial audio (Track A) or paired audio and 360◦ video (Track B). Directly regressing dense per-class maps from recordings is difficult because the targets are spatially sparse, the losses are dominated by background regions, and real labeled recordings are limited. We therefore decompose the problem into sound event localization and detection (SELD) followed by a deterministic renderer that converts direction estimates into dense spherical-Gaussian energy fields. The audio model is trained on the provided real recordings and synthetic recordings, with 8 pattern audio-channel swapping (ACS) and SpecMix mixup used to improve spatial robustness. At inference, we explore comple mentary strategies across the submitted systems: 8-view inverse aligned ACS test-time ensembling, a multi-checkpoint cluster ensemble, and per-class audiovisual model selection. On the held out development-test split, the audio system achieves 0.0234 mask mAP / 0.358 Pearsonr, and the per-class audiovisual ensemble reaches 0.0332 / 0.446 (+42% mAP, +25% Pearson).

PDF

AUDIO-ONLY SEMANTIC ACOUSTIC IMAGING WITH RECOGNITION-PRIOR SCORE FUSION FOR DCASE2026 TASK 3

Runbang Wang¹, Zining Liang², Yin Cao³, Qiuqiang Kong²

¹Nanjing University, China, ²The Chinese University of Hong Kong, Hong Kong SAR, China, ³Institute of Acoustics, Chinese Academy of Sciences, China

Kong_CUHK_task3a_1 Kong_CUHK_task3a_2 Kong_CUHK_task3a_3 Kong_CUHK_task3a_4

PDF

AUDIO-ONLY SEMANTIC ACOUSTIC IMAGING WITH RECOGNITION-PRIOR SCORE FUSION FOR DCASE2026 TASK 3

Runbang Wang¹, Zining Liang², Yin Cao³, Qiuqiang Kong²
¹Nanjing University, China, ²The Chinese University of Hong Kong, Hong Kong SAR, China, ³Institute of Acoustics, Chinese Academy of Sciences, China

Abstract

Semantic acoustic imaging predicts class-labeled acoustic regions from microphone-array recordings, producing a spherical map of where sound events appear in a scene. The audio-only setting creates a difficult input-output mismatch: region boundaries are not observed in the waveform, and the event class, acoustic extent, and confidence of each prediction must be estimated from multichannel acoustic cues. We present a system for DCASE2026 Task 3 that first forms spherical acoustic evidence from raw MIC-format audio and then decodes this evidence into mask candidates with event classes and detector scores. A separate AudioMAE-based recognition prior estimates class activity for two-second windows and aligns the probabilities to detector frames. The final fusion stage uses this prior to re-rank the detector candidates while preserving their masks. On the full-recording development test set, the system obtains 0.1017 mAP, 0.2378 AP50, and 0.7904 Pearson r. For submission, prediction compression reduces the maximum JSON size from 148.98 MB to 19.03 MB, with mAP changing from 0.1017 to 0.1009.

PDF

DECISION-CALIBRATED SEMANTIC ACOUSTIC IMAGING FOR SOUND EVENT LOCALIZATION AND DETECTION

Junhyeong Kwon¹, Jongsuk Choi¹

¹Korea Institute of Science and Technology, Seoul, Korea

Kwon_KIST_task3b_1 Kwon_KIST_task3b_2 Kwon_KIST_task3b_3 Kwon_KIST_task3b_4

PDF

DECISION-CALIBRATED SEMANTIC ACOUSTIC IMAGING FOR SOUND EVENT LOCALIZATION AND DETECTION

Junhyeong Kwon¹, Jongsuk Choi¹
¹Korea Institute of Science and Technology, Seoul, Korea

Abstract

This report describes our system submitted to the audiovisual track of Task 3 of the DCASE 2026 Challenge on Semantic Acoustic Imaging for SELD. Building on the official UpLAM + MaskR-CNN baseline, we observe that it reconstructs the spatial energy field reasonably well, reaching an audiovisual Pearson correlation of about 0.44 on the development set, yet attains a near-zero detection mAP. We diagnose this as a decision and calibration failure rather than a representation failure: a single global confidence threshold cannot simultaneously retain under-confident true positives and suppress background false positives, and the rank-based scorer punishes both. Our system therefore runs the detector at a very low score threshold to surface the under-confident true positives; re-ranks the detections with a learned matchability model, a post-hoc calibration layer motivated by conformal and risk-controlled prediction that predicts whether each detection will match the ground truth; and emits a grid sampled energy mask aligned with the challenge soft-IoU scorer. On the STAIRS26 development set this raises macro mAP from the official baseline's 0.0003 to 0.021 and EFRQ from 0.136 to 0.33-0.36. Absolute scores remain low — the task is hard for every system — but the gain over the baseline is large and consistent. We submit a four system hedge portfolio that crosses matchability and raw ranking with cap-1 and cap-3 selection, and report development numbers transparently, including the recall ceiling imposed by the 4-channel spatial resolution.

PDF

SRP 16-BAND UPLAM WITH MASK R-CNN FOR AUDIOVISUAL SELD

Rafał Foltyniewicz¹, Rafał Kaczmarek¹, Michał Olejnik¹, Iwan Ryzenkow¹, Bogdan Jastrzębski¹

¹Samsung Research Poland

Olejnik_SRP_task3b_1 Olejnik_SRP_task3b_2 Olejnik_SRP_task3b_3

PDF

SRP 16-BAND UPLAM WITH MASK R-CNN FOR AUDIOVISUAL SELD

Rafał Foltyniewicz¹, Rafał Kaczmarek¹, Michał Olejnik¹, Iwan Ryzenkow¹, Bogdan Jastrzębski¹
¹Samsung Research Poland

Abstract

We present an audiovisual sound event localization and detection system based on a modified Mask R-CNN with a ResNet-50 Feature Pyramid Network backbone, taking 19-channel inputs composed of 3 RGB video channels and 16 UpLAM acoustic feature maps extracted from the tetrahedral microphone array. The model is trained with progressive backbone unfreezing, class-balanced sampling with inverse-frequency weights (cap=30), focal loss, and data augmentation including azimuth rotation, horizontal flip, RGB dropout, and audio band masking.

PDF

ENHANCED UPLAM-MASK R-CNN SYSTEM FOR SEMANTIC ACOUSTIC IMAGING AND SOUND EVENT LOCALIZATION

Yishuo Yang¹, Xinwei Wu¹, Chunrui Zhao¹, Huayang Wang¹, Zheng Wen¹, Jilu Jin², Gongping Huang¹

¹School of Electronic Information, Wuhan University, Wuhan, China, ²CIAIC, Northwestern Polytechnical University, Xi'an, China

Wu_WHU_task3a_1 Yang_WHUiasp_task3b_1 Yang_WHUiasp_task3b_2 Yang_WHUiasp_task3b_3 Yang_WHUiasp_task3b_4

PDF

ENHANCED UPLAM-MASK R-CNN SYSTEM FOR SEMANTIC ACOUSTIC IMAGING AND SOUND EVENT LOCALIZATION

Yishuo Yang¹, Xinwei Wu¹, Chunrui Zhao¹, Huayang Wang¹, Zheng Wen¹, Jilu Jin², Gongping Huang¹
¹School of Electronic Information, Wuhan University, Wuhan, China, ²CIAIC, Northwestern Polytechnical University, Xi'an, China

Abstract

This report presents our UpLAM-Mask R-CNN system for DCASE 2026 Task 3: semantic acoustic imaging (SAI) for sound event localization and detection (SELD). The task requires detecting sound event categories while estimating their spatial acoustic energy distributions and source distances. Based on the official UpLAM Mask R-CNN baseline, we develop unified frameworks for both the audio-only and audio-visual tracks. Specifically, we introduce modality-adaptive input switching for Track A and Track B, optimize the UpLAM-based acoustic feature extraction pipeline, preserve acoustic energy magnitude cues throughout the model, and enhance RoI-level energy-map prediction, full-frame energy decoding, and distance regression. During inference, temporal tracking, confidence-based filtering, and region-aware energy-map export are further applied to improve the stability and completeness of the pre dictions.

PDF

Content

Task description

Teams ranking

Track A: Audio-only inference

Track B: Audiovisual inference

Systems ranking

Track A: Audio-only inference

Track B: Audiovisual inference

System characteristics

Track A: Audio-only inference

Track B: Audiovisual inference

Technical reports

Occupancy-Based Semantic Acoustic Imaging using NGCC-PHAT and Mel-Spectrogram Feature Fusion

Occupancy-Based Semantic Acoustic Imaging using NGCC-PHAT and Mel-Spectrogram Feature Fusion

Abstract

AN EQUIRECTANGULAR ENERGY FIELD WITH PER-LOCATION CLASS COVERAGE FOR SEMANTIC ACOUSTIC IMAGING

AN EQUIRECTANGULAR ENERGY FIELD WITH PER-LOCATION CLASS COVERAGE FOR SEMANTIC ACOUSTIC IMAGING

Abstract

SEMANTIC ACOUSTIC IMAGING VIA SELD-STYLE DOA REGRESSION AND A PER-CLASS AUDIOVISUAL ENSEMBLE

SEMANTIC ACOUSTIC IMAGING VIA SELD-STYLE DOA REGRESSION AND A PER-CLASS AUDIOVISUAL ENSEMBLE

Abstract

AUDIO-ONLY SEMANTIC ACOUSTIC IMAGING WITH RECOGNITION-PRIOR SCORE FUSION FOR DCASE2026 TASK 3

AUDIO-ONLY SEMANTIC ACOUSTIC IMAGING WITH RECOGNITION-PRIOR SCORE FUSION FOR DCASE2026 TASK 3

Abstract

DECISION-CALIBRATED SEMANTIC ACOUSTIC IMAGING FOR SOUND EVENT LOCALIZATION AND DETECTION

DECISION-CALIBRATED SEMANTIC ACOUSTIC IMAGING FOR SOUND EVENT LOCALIZATION AND DETECTION

Abstract

SRP 16-BAND UPLAM WITH MASK R-CNN FOR AUDIOVISUAL SELD

SRP 16-BAND UPLAM WITH MASK R-CNN FOR AUDIOVISUAL SELD

Abstract

ENHANCED UPLAM-MASK R-CNN SYSTEM FOR SEMANTIC ACOUSTIC IMAGING AND SOUND EVENT LOCALIZATION

ENHANCED UPLAM-MASK R-CNN SYSTEM FOR SEMANTIC ACOUSTIC IMAGING AND SOUND EVENT LOCALIZATION

Abstract