Task description

This task focuses on heterogeneous audio classification using the Broad Sound Taxonomy (BST), which comprises 5 top-level and 23 second-level sound categories. The goal of this task is to evaluate sound classification models on diverse, real-world audio that varies widely in its nature, including duration and recording conditions. To that end, two complementary Freesound-based datasets are provided: a curated set, BSD10k-v1.2, and a larger, noisier, crowd-sourced collection, BSD35k-CS, which reflects real-world labeling variability. Participants are encouraged to explore audio-based, metadata-based, and multimodal approaches to sound classification, as well as to leverage hierarchical relationships between taxonomy categories.

More detailed task description can be found in the task description page

Leaderboard

Rank		Submission Information				Metrics
Official rank	System rank	Submission label	Submission name	Audio only	Technical Report	hP	hR	hF
7	22	Baseline_UPF_task1_2	HATR baseline (multimodal)	False	Baseline_UPF_2026	0.725	0.696	0.700
1	1	Primus_CPJKU_task1_3	Ensemble 3	False	Primus_CPJKU_2026	0.842	0.846	0.836
8	23	Colotti_TAU_task1_1	Audio Classification using Attention-based Cleaned Multimodal Embeddings	False	Colotti_TAU_2026	0.746	0.709	0.699
5	13	Kucukoglu_NYU_task1_1	Multimodal Ensemble System for Heterogeneous Audio Classification	False	Kucukoglu_NYU_2026	0.824	0.775	0.787
12	36	Han_CSU_task1_1	Multi-Embedding System with Hierarchical Proxy Learning_1	False	Han_CSU_2026	0.422	0.370	0.372
10	30	Liu_CQUPT_task1_3	PaSST and CLAP Audio Ensemble	True	Liu_CQUPT_2026	0.657	0.647	0.629
9	25	Kil_Medisensing_task1_1	metadata gate target mask posterior stack	False	Kil_Medisensing_2026	0.682	0.732	0.692
4	12	Zheng_SCUT_task1_1	Official BSD10k Baseline (Multimodal HATR + CLAP)	False	Zheng_SCUT_2026	0.817	0.782	0.788
2	5	Huang_WHU_task1_2	STFT-distill logit ensemble-1	False	Huang_WHU_2026	0.825	0.810	0.811
11	35	Sharma_IR_task1_1	Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification	False	Sharma_IR_2026	0.526	0.488	0.472
6	14	Lin_JKU_task1_1	Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary)	False	Lin_JKU_2026	0.801	0.785	0.785
3	7	Guan_HEU_task1_3	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026	0.816	0.797	0.799
13	39	Zhang_XJTLU_task1_3	no-ranker ensemble	False	Zhang_XJTLU_2026	0.162	0.139	0.116

Systems ranking

Rank	Submission Information				Metrics
System rank	Submission label	Submission name	Audio only	Technical Report	hP	hR	hF
1	Primus_CPJKU_task1_3	Ensemble 3	False	Primus_CPJKU_2026	0.842	0.846	0.836
2	Primus_CPJKU_task1_2	Ensemble 2	False	Primus_CPJKU_2026	0.842	0.845	0.836
3	Primus_CPJKU_task1_4	Ensemble 4	False	Primus_CPJKU_2026	0.841	0.844	0.835
4	Primus_CPJKU_task1_1	Ensemble 1	False	Primus_CPJKU_2026	0.838	0.842	0.833
5	Huang_WHU_task1_2	STFT-distill logit ensemble-1	False	Huang_WHU_2026	0.825	0.810	0.811
6	Huang_WHU_task1_1	HGA-EMA-STFT system	False	Huang_WHU_2026	0.813	0.802	0.800
7	Guan_HEU_task1_3	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026	0.816	0.797	0.799
8	Huang_WHU_task1_4	Six-model 5-fold logit ensemble	False	Huang_WHU_2026	0.820	0.794	0.799
9	Guan_HEU_task1_4	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026	0.818	0.795	0.798
10	Huang_WHU_task1_3	STFT-distill logit ensemble-5	False	Huang_WHU_2026	0.819	0.793	0.798
11	Guan_HEU_task1_2	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026	0.816	0.793	0.797
12	Zheng_SCUT_task1_1	Official BSD10k Baseline (Multimodal HATR + CLAP)	False	Zheng_SCUT_2026	0.817	0.782	0.788
13	Kucukoglu_NYU_task1_1	Multimodal Ensemble System for Heterogeneous Audio Classification	False	Kucukoglu_NYU_2026	0.824	0.775	0.787
14	Lin_JKU_task1_1	Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary)	False	Lin_JKU_2026	0.801	0.785	0.785
15	Guan_HEU_task1_1	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026	0.799	0.771	0.775
16	Kucukoglu_NYU_task1_4	Three-Modality HATR with Whisper and Mixup	False	Kucukoglu_NYU_2026	0.814	0.764	0.773
17	Kucukoglu_NYU_task1_3	Single HATR Model with Cross-Fold Augmentation	False	Kucukoglu_NYU_2026	0.802	0.762	0.770
18	Lin_JKU_task1_2	Frozen-CLAP ensemble with logit adjustment (balanced)	False	Lin_JKU_2026	0.774	0.778	0.768
19	Lin_JKU_task1_4	Clean concatenation CLAP probe (10k-train only)	False	Lin_JKU_2026	0.781	0.761	0.763
20	Kucukoglu_NYU_task1_2	CLAP-ConvNeXt Ensemble for Heterogeneous Audio Classification	False	Kucukoglu_NYU_2026	0.779	0.763	0.760
21	Lin_JKU_task1_3	Gated multimodal CLAP fusion with pseudo-labels	False	Lin_JKU_2026	0.769	0.765	0.752
22	Baseline_UPF_task1_2	HATR baseline (multimodal)	False	Baseline_UPF_2026	0.725	0.696	0.700
23	Colotti_TAU_task1_1	Audio Classification using Attention-based Cleaned Multimodal Embeddings	False	Colotti_TAU_2026	0.746	0.709	0.699
24	Colotti_TAU_task1_2	Audio Classification using Attention-based Cleaned Multimodal Embeddings	False	Colotti_TAU_2026	0.732	0.706	0.699
25	Kil_Medisensing_task1_1	metadata gate target mask posterior stack	False	Kil_Medisensing_2026	0.682	0.732	0.692
26	Kil_Medisensing_task1_2	raw parent specialist posterior stack	False	Kil_Medisensing_2026	0.682	0.731	0.691
27	Colotti_TAU_task1_4	CLAP-MoE Feature-wise Gated Fusion	False	Colotti_TAU_2026	0.724	0.667	0.669
28	Colotti_TAU_task1_3	EnhancedBaseClassifier	False	Colotti_TAU_2026	0.710	0.667	0.664
29	Kil_Medisensing_task1_3	Larger-CLAP classifier with metadata confidence gate	False	Kil_Medisensing_2026	0.652	0.641	0.633
30	Liu_CQUPT_task1_3	PaSST and CLAP Audio Ensemble	True	Liu_CQUPT_2026	0.657	0.647	0.629
31	Liu_CQUPT_task1_2	Weighted Audio-only CLAP Ensemble with BSD10k Specialist	True	Liu_CQUPT_2026	0.644	0.643	0.623
32	Baseline_UPF_task1_1	HATR baseline (audio)	True	Baseline_UPF_2026	0.647	0.631	0.619
33	Liu_CQUPT_task1_1	Audio-only CLAP Ensemble with EMA and Test-Time Augmentation	True	Liu_CQUPT_2026	0.639	0.640	0.619
34	Kil_Medisensing_task1_4	audio posterior ensemble with conservative argmax decoding	True	Kil_Medisensing_2026	0.590	0.552	0.520
35	Sharma_IR_task1_1	Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification	False	Sharma_IR_2026	0.526	0.488	0.472
36	Han_CSU_task1_1	Multi-Embedding System with Hierarchical Proxy Learning_1	False	Han_CSU_2026	0.422	0.370	0.372
37	Han_CSU_task1_3	Multi-Embedding System with Hierarchical Proxy Learning_3	False	Han_CSU_2026	0.355	0.316	0.315
38	Han_CSU_task1_4	Multi-Embedding System with Hierarchical Proxy Learning_4	False	Han_CSU_2026	0.331	0.294	0.294
39	Zhang_XJTLU_task1_3	no-ranker ensemble	False	Zhang_XJTLU_2026	0.162	0.139	0.116
40	Zhang_XJTLU_task1_4	external diversity low-risk	False	Zhang_XJTLU_2026	0.154	0.146	0.101
41	Han_CSU_task1_2	Multi-Embedding System with Hierarchical Proxy Learning_2	False	Han_CSU_2026	0.102	0.105	0.097
42	Zhang_XJTLU_task1_2	balanced ranker/base	False	Zhang_XJTLU_2026	0.142	0.138	0.095
43	Zhang_XJTLU_task1_1	aggressive ranker	False	Zhang_XJTLU_2026	0.106	0.127	0.091

Class-wise performance

Rank	Submission Information				Class-wise metrics
System rank	Submission label	Submission name	Audio only	Technical Report	hP - Music > Solo percussion (m-sp)	hR - Music > Solo percussion (m-sp)	hF - Music > Solo percussion (m-sp)	hP - Music > Solo instrument (m-si)	hR - Music > Solo instrument (m-si)	hF - Music > Solo instrument (m-si)	hP - Music > Multiple instruments (m-m)	hR - Music > Multiple instruments (m-m)	hF - Music > Multiple instruments (m-m)	hP - Instrument samples > Percussion (is-p)	hR - Instrument samples > Percussion (is-p)	hF - Instrument samples > Percussion (is-p)	hP - Instrument samples > String (is-s)	hR - Instrument samples > String (is-s)	hF - Instrument samples > String (is-s)	hP - Instrument samples > Wind (is-w)	hR - Instrument samples > Wind (is-w)	hF - Instrument samples > Wind (is-w)	hP - Instrument samples > Piano / Keyboard instruments (is-k)	hR - Instrument samples > Piano / Keyboard instruments (is-k)	hF - Instrument samples > Piano / Keyboard instruments (is-k)	hP - Instrument samples > Synths / Electronic (is-e)	hR - Instrument samples > Synths / Electronic (is-e)	hF - Instrument samples > Synths / Electronic (is-e)	hP - Speech > Solo speech (sp-s)	hR - Speech > Solo speech (sp-s)	hF - Speech > Solo speech (sp-s)	hP - Speech > Conversation / Crowd (sp-c)	hR - Speech > Conversation / Crowd (sp-c)	hF - Speech > Conversation / Crowd (sp-c)	hP - Speech > Processed / Synthetic (sp-p)	hR - Speech > Processed / Synthetic (sp-p)	hF - Speech > Processed / Synthetic (sp-p)	hP - Sound effects > Objects / House appliances (fx-o)	hR - Sound effects > Objects / House appliances (fx-o)	hF - Sound effects > Objects / House appliances (fx-o)	hP - Sound effects > Vehicles (fx-v)	hR - Sound effects > Vehicles (fx-v)	hF - Sound effects > Vehicles (fx-v)	hP - Sound effects > Other mechanisms, engines, machines (fx-m)	hR - Sound effects > Other mechanisms, engines, machines (fx-m)	hF - Sound effects > Other mechanisms, engines, machines (fx-m)	hP - Sound effects > Human sounds and actions (fx-h)	hR - Sound effects > Human sounds and actions (fx-h)	hF - Sound effects > Human sounds and actions (fx-h)	hP - Sound effects > Animals (fx-a)	hR - Sound effects > Animals (fx-a)	hF - Sound effects > Animals (fx-a)	hP - Sound effects > Natural elements and explosions (fx-n)	hR - Sound effects > Natural elements and explosions (fx-n)	hF - Sound effects > Natural elements and explosions (fx-n)	hP - Sound effects > Experimental (fx-ex)	hR - Sound effects > Experimental (fx-ex)	hF - Sound effects > Experimental (fx-ex)	hP - Sound effects > Electronic / Design (fx-el)	hR - Sound effects > Electronic / Design (fx-el)	hF - Sound effects > Electronic / Design (fx-el)	hP - Soundscapes > Nature (ss-n)	hR - Soundscapes > Nature (ss-n)	hF - Soundscapes > Nature (ss-n)	hP - Soundscapes > Indoors (ss-i)	hR - Soundscapes > Indoors (ss-i)	hF - Soundscapes > Indoors (ss-i)	hP - Soundscapes > Urban (ss-u)	hR - Soundscapes > Urban (ss-u)	hF - Soundscapes > Urban (ss-u)	hP - Soundscapes > Synthetic / Artificial (ss-s)	hR - Soundscapes > Synthetic / Artificial (ss-s)	hF - Soundscapes > Synthetic / Artificial (ss-s)
1	Primus_CPJKU_task1_3	Ensemble 3	False	Primus_CPJKU_2026	0.940	0.728	0.821	0.908	0.864	0.885	0.839	0.815	0.827	0.807	0.991	0.889	0.834	0.933	0.881	0.600	1.000	0.750	0.963	1.000	0.981	0.858	0.836	0.847	0.920	0.933	0.926	0.955	0.703	0.809	0.854	0.717	0.779	0.850	0.937	0.891	0.879	0.944	0.910	0.825	0.878	0.851	0.909	0.889	0.899	0.831	0.990	0.904	0.734	0.922	0.818	0.536	0.728	0.617	0.904	0.762	0.827	0.932	0.788	0.854	0.757	0.539	0.630	0.847	0.784	0.814	0.881	0.775	0.824
2	Primus_CPJKU_task1_2	Ensemble 2	False	Primus_CPJKU_2026	0.960	0.728	0.828	0.918	0.852	0.884	0.855	0.815	0.835	0.814	0.982	0.890	0.837	0.951	0.890	0.600	1.000	0.750	0.963	1.000	0.981	0.847	0.820	0.833	0.915	0.936	0.925	0.970	0.670	0.792	0.869	0.737	0.798	0.841	0.937	0.886	0.870	0.944	0.905	0.846	0.878	0.862	0.909	0.889	0.899	0.831	0.990	0.904	0.720	0.922	0.809	0.544	0.750	0.630	0.886	0.773	0.826	0.944	0.788	0.859	0.709	0.539	0.613	0.847	0.784	0.814	0.878	0.755	0.812
3	Primus_CPJKU_task1_4	Ensemble 4	False	Primus_CPJKU_2026	0.941	0.744	0.831	0.891	0.864	0.877	0.839	0.822	0.830	0.812	0.972	0.885	0.834	0.933	0.881	0.600	1.000	0.750	0.963	1.000	0.981	0.842	0.836	0.839	0.921	0.941	0.931	0.949	0.637	0.763	0.872	0.689	0.770	0.850	0.937	0.891	0.880	0.955	0.916	0.842	0.863	0.852	0.909	0.889	0.899	0.820	0.990	0.897	0.744	0.922	0.823	0.533	0.728	0.616	0.894	0.773	0.829	0.957	0.788	0.864	0.718	0.546	0.620	0.858	0.803	0.829	0.883	0.779	0.828
4	Primus_CPJKU_task1_1	Ensemble 1	False	Primus_CPJKU_2026	0.941	0.744	0.831	0.877	0.818	0.846	0.845	0.845	0.845	0.812	0.972	0.885	0.785	0.933	0.853	0.600	1.000	0.750	0.963	1.000	0.981	0.825	0.820	0.823	0.913	0.928	0.920	0.952	0.670	0.787	0.872	0.704	0.779	0.865	0.937	0.899	0.871	0.955	0.911	0.872	0.893	0.882	0.896	0.900	0.898	0.831	0.990	0.904	0.729	0.922	0.815	0.543	0.669	0.600	0.867	0.785	0.824	0.944	0.788	0.859	0.734	0.565	0.639	0.855	0.791	0.821	0.878	0.748	0.808
5	Huang_WHU_task1_2	STFT-distill logit ensemble-1	False	Huang_WHU_2026	0.892	0.851	0.871	0.794	0.832	0.813	0.845	0.838	0.841	0.925	0.947	0.936	0.862	0.928	0.894	1.000	1.000	1.000	0.963	1.000	0.981	0.780	0.828	0.804	0.936	0.970	0.953	0.982	0.682	0.805	0.907	0.656	0.761	0.753	0.953	0.841	0.898	0.866	0.882	0.759	0.823	0.790	0.790	0.861	0.824	0.858	0.846	0.852	0.672	0.901	0.770	0.420	0.458	0.438	0.891	0.590	0.710	0.779	0.811	0.795	0.759	0.463	0.575	0.701	0.762	0.730	0.816	0.760	0.787
6	Huang_WHU_task1_1	HGA-EMA-STFT system	False	Huang_WHU_2026	0.932	0.819	0.872	0.878	0.820	0.848	0.818	0.819	0.819	0.875	0.941	0.907	0.851	0.951	0.899	1.000	1.000	1.000	1.000	1.000	1.000	0.820	0.803	0.811	0.892	0.965	0.928	1.000	0.550	0.710	0.887	0.617	0.728	0.748	0.941	0.833	0.927	0.882	0.904	0.736	0.854	0.791	0.765	0.828	0.795	0.863	0.879	0.871	0.674	0.912	0.775	0.435	0.408	0.421	0.838	0.690	0.757	0.782	0.837	0.809	0.625	0.382	0.474	0.601	0.757	0.670	0.762	0.792	0.776
7	Guan_HEU_task1_3	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026	0.772	0.851	0.810	0.858	0.832	0.845	0.848	0.870	0.859	0.903	0.781	0.838	0.904	0.919	0.911	0.688	0.583	0.631	0.963	1.000	0.981	0.786	0.918	0.847	0.946	0.941	0.943	1.000	0.603	0.752	0.830	0.696	0.758	0.770	0.937	0.846	0.922	0.891	0.906	0.719	0.893	0.797	0.866	0.820	0.842	0.811	0.910	0.857	0.709	0.907	0.796	0.464	0.553	0.505	0.875	0.606	0.716	0.830	0.781	0.805	0.767	0.502	0.607	0.696	0.767	0.730	0.853	0.760	0.804
8	Huang_WHU_task1_4	Six-model 5-fold logit ensemble	False	Huang_WHU_2026	0.906	0.829	0.866	0.827	0.825	0.826	0.858	0.833	0.846	0.899	0.941	0.920	0.880	0.917	0.898	1.000	0.667	0.800	0.963	1.000	0.981	0.836	0.836	0.836	0.935	0.957	0.946	0.967	0.610	0.748	0.889	0.628	0.736	0.764	0.937	0.842	0.929	0.886	0.907	0.747	0.838	0.790	0.764	0.871	0.814	0.875	0.863	0.869	0.681	0.912	0.780	0.422	0.453	0.437	0.792	0.655	0.717	0.780	0.849	0.813	0.720	0.424	0.533	0.633	0.757	0.689	0.804	0.779	0.791
9	Guan_HEU_task1_4	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026	0.833	0.877	0.854	0.823	0.843	0.833	0.870	0.845	0.857	0.902	0.855	0.878	0.904	0.919	0.911	0.688	0.583	0.631	0.963	1.000	0.981	0.781	0.893	0.833	0.946	0.941	0.943	1.000	0.575	0.730	0.821	0.696	0.754	0.780	0.937	0.851	0.899	0.886	0.893	0.725	0.893	0.801	0.850	0.820	0.834	0.813	0.926	0.866	0.692	0.907	0.785	0.456	0.531	0.490	0.901	0.588	0.712	0.858	0.800	0.828	0.786	0.477	0.594	0.642	0.767	0.699	0.887	0.721	0.795
10	Huang_WHU_task1_3	STFT-distill logit ensemble-5	False	Huang_WHU_2026	0.890	0.829	0.859	0.832	0.825	0.829	0.861	0.845	0.853	0.896	0.917	0.907	0.880	0.917	0.898	1.000	0.667	0.800	0.963	1.000	0.981	0.829	0.836	0.832	0.936	0.965	0.950	0.967	0.610	0.748	0.904	0.628	0.741	0.761	0.941	0.841	0.929	0.886	0.907	0.747	0.838	0.790	0.793	0.871	0.830	0.867	0.863	0.865	0.673	0.912	0.775	0.403	0.453	0.427	0.797	0.667	0.726	0.782	0.837	0.809	0.696	0.394	0.503	0.629	0.757	0.687	0.804	0.779	0.791
11	Guan_HEU_task1_2	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026	0.725	0.835	0.776	0.819	0.825	0.822	0.810	0.870	0.839	0.867	0.752	0.805	0.873	0.896	0.884	1.000	0.792	0.884	0.931	1.000	0.964	0.842	0.893	0.866	0.946	0.949	0.948	1.000	0.535	0.697	0.841	0.635	0.724	0.792	0.944	0.862	0.898	0.875	0.886	0.723	0.848	0.780	0.818	0.793	0.805	0.804	0.916	0.857	0.705	0.890	0.787	0.482	0.531	0.505	0.852	0.639	0.730	0.826	0.792	0.809	0.719	0.502	0.591	0.627	0.748	0.682	0.875	0.779	0.824
12	Zheng_SCUT_task1_1	Official BSD10k Baseline (Multimodal HATR + CLAP)	False	Zheng_SCUT_2026	0.888	0.825	0.856	0.768	0.848	0.806	0.810	0.810	0.810	0.878	0.932	0.904	0.836	0.868	0.851	1.000	0.667	0.800	0.882	1.000	0.937	0.778	0.752	0.765	0.934	0.979	0.956	1.000	0.685	0.813	0.918	0.656	0.765	0.754	0.941	0.837	0.915	0.756	0.828	0.750	0.832	0.789	0.834	0.820	0.827	0.821	0.834	0.827	0.684	0.890	0.774	0.455	0.575	0.508	0.845	0.523	0.646	0.797	0.868	0.831	0.810	0.419	0.552	0.592	0.812	0.685	0.838	0.703	0.765
13	Kucukoglu_NYU_task1_1	Multimodal Ensemble System for Heterogeneous Audio Classification	False	Kucukoglu_NYU_2026	0.898	0.837	0.867	0.786	0.836	0.810	0.812	0.718	0.762	0.909	0.917	0.913	0.846	0.826	0.836	1.000	0.458	0.629	1.000	1.000	1.000	0.775	0.873	0.821	0.920	0.949	0.934	0.968	0.637	0.769	0.871	0.615	0.721	0.746	0.955	0.838	0.937	0.846	0.889	0.809	0.832	0.821	0.844	0.824	0.834	0.771	0.916	0.837	0.725	0.879	0.795	0.420	0.567	0.483	0.928	0.558	0.697	0.847	0.689	0.760	0.750	0.479	0.585	0.627	0.793	0.700	0.766	0.819	0.792
14	Lin_JKU_task1_1	Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary)	False	Lin_JKU_2026	0.884	0.764	0.820	0.782	0.774	0.778	0.901	0.713	0.796	0.815	0.928	0.868	0.757	0.933	0.836	1.000	1.000	1.000	0.941	1.000	0.970	0.701	0.824	0.757	0.942	0.952	0.947	1.000	0.568	0.724	0.850	0.622	0.719	0.768	0.902	0.829	0.858	0.945	0.900	0.698	0.835	0.760	0.872	0.762	0.814	0.814	0.887	0.849	0.684	0.860	0.762	0.453	0.436	0.444	0.916	0.690	0.787	0.736	0.818	0.775	0.688	0.477	0.563	0.607	0.635	0.620	0.758	0.721	0.739
15	Guan_HEU_task1_1	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026	0.852	0.871	0.861	0.745	0.852	0.795	0.860	0.796	0.827	0.893	0.908	0.900	0.873	0.852	0.862	0.688	0.583	0.631	0.931	1.000	0.964	0.763	0.781	0.772	0.932	0.949	0.940	1.000	0.570	0.726	0.830	0.745	0.785	0.733	0.937	0.822	0.906	0.901	0.904	0.723	0.909	0.805	0.823	0.789	0.806	0.809	0.799	0.804	0.672	0.907	0.772	0.396	0.414	0.405	0.845	0.551	0.667	0.782	0.800	0.791	0.850	0.398	0.542	0.601	0.724	0.656	0.875	0.708	0.783
16	Kucukoglu_NYU_task1_4	Three-Modality HATR with Whisper and Mixup	False	Kucukoglu_NYU_2026	0.883	0.859	0.871	0.818	0.825	0.821	0.857	0.713	0.778	0.863	0.923	0.892	0.834	0.870	0.852	1.000	0.458	0.629	0.901	1.000	0.948	0.712	0.824	0.764	0.908	0.962	0.934	0.967	0.625	0.759	0.855	0.663	0.747	0.720	0.948	0.819	0.934	0.786	0.854	0.817	0.863	0.839	0.806	0.818	0.812	0.763	0.869	0.813	0.698	0.856	0.769	0.430	0.619	0.507	0.906	0.461	0.611	0.792	0.689	0.737	0.783	0.410	0.538	0.595	0.793	0.680	0.869	0.748	0.804
17	Kucukoglu_NYU_task1_3	Single HATR Model with Cross-Fold Augmentation	False	Kucukoglu_NYU_2026	0.887	0.875	0.881	0.755	0.833	0.792	0.780	0.706	0.741	0.914	0.903	0.908	0.883	0.785	0.831	1.000	0.458	0.629	0.901	1.000	0.948	0.744	0.738	0.741	0.929	0.949	0.939	0.944	0.703	0.806	0.875	0.628	0.731	0.762	0.955	0.848	0.907	0.806	0.854	0.686	0.863	0.764	0.841	0.801	0.821	0.802	0.873	0.836	0.728	0.884	0.798	0.424	0.647	0.513	0.725	0.488	0.584	0.796	0.731	0.762	0.720	0.424	0.533	0.614	0.781	0.688	0.838	0.689	0.756
18	Lin_JKU_task1_2	Frozen-CLAP ensemble with logit adjustment (balanced)	False	Lin_JKU_2026	0.897	0.744	0.813	0.812	0.740	0.774	0.878	0.718	0.790	0.815	0.928	0.868	0.767	0.917	0.835	0.750	1.000	0.857	0.862	1.000	0.926	0.587	0.834	0.689	0.966	0.868	0.915	0.932	0.608	0.735	0.786	0.635	0.702	0.765	0.691	0.726	0.833	0.957	0.891	0.680	0.875	0.765	0.888	0.682	0.772	0.785	0.947	0.859	0.719	0.871	0.787	0.476	0.494	0.485	0.788	0.736	0.761	0.773	0.811	0.791	0.684	0.535	0.600	0.671	0.558	0.609	0.698	0.748	0.722
19	Lin_JKU_task1_4	Clean concatenation CLAP probe (10k-train only)	False	Lin_JKU_2026	0.860	0.829	0.844	0.792	0.829	0.810	0.800	0.815	0.807	0.847	0.917	0.881	0.774	0.891	0.828	0.688	0.458	0.550	0.833	1.000	0.909	0.831	0.670	0.742	0.921	0.900	0.910	0.905	0.655	0.760	0.747	0.556	0.637	0.743	0.951	0.834	0.916	0.881	0.898	0.739	0.723	0.731	0.855	0.840	0.847	0.787	0.910	0.844	0.736	0.894	0.808	0.473	0.661	0.552	0.797	0.532	0.639	0.823	0.781	0.801	0.652	0.382	0.482	0.629	0.736	0.678	0.818	0.696	0.752
20	Kucukoglu_NYU_task1_2	CLAP-ConvNeXt Ensemble for Heterogeneous Audio Classification	False	Kucukoglu_NYU_2026	0.887	0.875	0.881	0.769	0.818	0.792	0.789	0.736	0.762	0.927	0.908	0.918	0.858	0.852	0.855	0.000	0.125	0.000	1.000	1.000	1.000	0.757	0.852	0.801	0.914	0.949	0.931	0.968	0.645	0.774	0.864	0.574	0.690	0.779	0.948	0.856	0.927	0.825	0.873	0.783	0.848	0.814	0.837	0.824	0.830	0.788	0.936	0.856	0.714	0.884	0.790	0.421	0.575	0.486	0.904	0.532	0.670	0.833	0.726	0.776	0.781	0.539	0.638	0.629	0.781	0.697	0.787	0.787	0.787
21	Lin_JKU_task1_3	Gated multimodal CLAP fusion with pseudo-labels	False	Lin_JKU_2026	0.887	0.750	0.813	0.858	0.809	0.833	0.832	0.750	0.789	0.806	0.961	0.877	0.838	0.905	0.870	0.000	0.250	0.000	0.849	0.961	0.901	0.789	0.857	0.822	0.911	0.970	0.940	1.000	0.583	0.736	0.889	0.651	0.751	0.772	0.944	0.850	0.865	0.931	0.897	0.719	0.869	0.787	0.946	0.762	0.844	0.799	0.936	0.862	0.691	0.888	0.777	0.469	0.503	0.486	0.797	0.683	0.735	0.839	0.830	0.835	0.762	0.373	0.501	0.593	0.697	0.641	0.779	0.733	0.755
22	Baseline_UPF_task1_2	HATR baseline (multimodal)	False	Baseline_UPF_2026	0.502	0.673	0.575	0.748	0.688	0.717	0.823	0.641	0.721	0.461	0.300	0.363	0.745	0.889	0.810	1.000	0.792	0.884	0.875	0.664	0.755	0.797	0.820	0.808	0.833	0.939	0.883	0.948	0.500	0.655	0.821	0.681	0.745	0.692	0.749	0.719	0.910	0.693	0.786	0.496	0.643	0.560	0.710	0.787	0.746	0.824	0.967	0.890	0.810	0.653	0.723	0.645	0.458	0.536	0.656	0.785	0.715	0.645	0.783	0.707	0.497	0.505	0.501	0.500	0.755	0.602	0.738	0.645	0.688
23	Colotti_TAU_task1_1	Audio Classification using Attention-based Cleaned Multimodal Embeddings	False	Colotti_TAU_2026	0.904	0.812	0.855	0.779	0.772	0.776	0.913	0.606	0.729	0.865	0.923	0.893	0.923	0.671	0.777	0.000	0.375	0.000	1.000	0.609	0.757	0.627	0.912	0.743	0.947	0.910	0.928	0.882	0.315	0.464	0.897	0.717	0.797	0.720	0.909	0.804	0.859	0.868	0.863	0.765	0.838	0.800	0.703	0.902	0.790	0.802	0.805	0.804	0.743	0.851	0.793	0.395	0.625	0.484	0.958	0.426	0.590	0.814	0.894	0.852	0.441	0.273	0.337	0.451	0.659	0.535	0.771	0.642	0.701
24	Colotti_TAU_task1_2	Audio Classification using Attention-based Cleaned Multimodal Embeddings	False	Colotti_TAU_2026	0.906	0.754	0.823	0.803	0.763	0.782	0.936	0.634	0.756	0.844	0.932	0.886	0.662	0.567	0.611	0.643	1.000	0.783	0.714	0.570	0.634	0.636	0.742	0.685	0.976	0.711	0.823	0.963	0.328	0.489	0.838	0.587	0.690	0.765	0.898	0.826	0.877	0.841	0.858	0.698	0.838	0.762	0.636	0.902	0.746	0.732	0.852	0.788	0.716	0.845	0.775	0.332	0.606	0.429	0.625	0.495	0.553	0.784	0.719	0.750	0.591	0.324	0.419	0.442	0.724	0.549	0.723	0.596	0.653
25	Kil_Medisensing_task1_1	metadata gate target mask posterior stack	False	Kil_Medisensing_2026	0.536	0.750	0.625	0.803	0.628	0.705	0.722	0.634	0.675	0.509	0.314	0.389	0.725	0.944	0.820	0.429	1.000	0.600	0.856	1.000	0.923	0.864	0.826	0.845	0.827	0.970	0.893	0.000	0.083	0.000	0.742	0.727	0.734	0.748	0.753	0.751	0.935	0.711	0.808	0.598	0.558	0.577	0.792	0.824	0.807	0.873	0.984	0.925	0.832	0.823	0.828	0.507	0.472	0.489	0.780	0.808	0.793	0.837	0.863	0.850	0.448	0.819	0.579	0.576	0.825	0.678	0.739	0.522	0.612
26	Kil_Medisensing_task1_2	raw parent specialist posterior stack	False	Kil_Medisensing_2026	0.536	0.750	0.625	0.797	0.628	0.702	0.716	0.623	0.666	0.509	0.314	0.389	0.725	0.944	0.820	0.429	1.000	0.600	0.856	1.000	0.923	0.864	0.826	0.845	0.822	0.970	0.890	0.000	0.083	0.000	0.750	0.727	0.738	0.738	0.760	0.749	0.935	0.711	0.808	0.590	0.543	0.565	0.798	0.824	0.811	0.873	0.984	0.925	0.829	0.812	0.821	0.510	0.486	0.498	0.802	0.808	0.805	0.837	0.863	0.850	0.448	0.819	0.579	0.576	0.825	0.678	0.739	0.522	0.612
27	Colotti_TAU_task1_4	CLAP-MoE Feature-wise Gated Fusion	False	Colotti_TAU_2026	0.902	0.758	0.824	0.666	0.782	0.719	0.820	0.729	0.772	0.834	0.893	0.863	0.802	0.558	0.658	0.375	0.000	0.000	1.000	0.680	0.809	0.660	0.570	0.612	0.926	0.840	0.881	0.929	0.290	0.442	0.808	0.431	0.562	0.715	0.960	0.820	0.835	0.753	0.792	0.663	0.741	0.700	0.815	0.785	0.800	0.805	0.809	0.807	0.725	0.879	0.795	0.409	0.669	0.508	0.659	0.690	0.674	0.763	0.894	0.823	0.438	0.361	0.396	0.417	0.692	0.520	0.684	0.569	0.621
28	Colotti_TAU_task1_3	EnhancedBaseClassifier	False	Colotti_TAU_2026	0.880	0.726	0.796	0.785	0.859	0.820	0.747	0.602	0.667	0.891	0.700	0.784	0.671	0.634	0.652	1.000	1.000	1.000	0.875	0.508	0.643	0.516	0.504	0.510	0.958	0.880	0.917	0.682	0.135	0.225	0.650	0.464	0.542	0.689	0.930	0.792	0.748	0.753	0.751	0.502	0.893	0.642	0.790	0.775	0.782	0.844	0.682	0.755	0.644	0.886	0.746	0.357	0.617	0.452	0.750	0.319	0.448	0.686	0.925	0.788	0.333	0.289	0.310	0.448	0.673	0.538	0.879	0.598	0.712
29	Kil_Medisensing_task1_3	Larger-CLAP classifier with metadata confidence gate	False	Kil_Medisensing_2026	0.455	0.615	0.523	0.733	0.783	0.757	0.795	0.708	0.749	0.365	0.263	0.305	0.766	0.803	0.784	0.000	0.125	0.000	0.911	0.844	0.876	0.877	0.764	0.817	0.821	0.957	0.884	0.473	0.217	0.298	0.889	0.579	0.701	0.605	0.761	0.674	0.853	0.784	0.817	0.549	0.634	0.588	0.812	0.623	0.705	0.742	0.857	0.795	0.851	0.560	0.676	0.603	0.408	0.487	0.558	0.704	0.623	0.629	0.906	0.743	0.478	0.542	0.508	0.489	0.712	0.579	0.750	0.588	0.659
30	Liu_CQUPT_task1_3	PaSST and CLAP Audio Ensemble	True	Liu_CQUPT_2026	0.522	0.688	0.594	0.755	0.634	0.689	0.766	0.646	0.701	0.424	0.287	0.342	0.680	0.910	0.778	0.000	0.250	0.000	0.750	0.609	0.672	0.724	0.805	0.762	0.824	0.939	0.878	0.943	0.230	0.370	0.892	0.630	0.738	0.664	0.767	0.712	0.910	0.710	0.797	0.551	0.652	0.598	0.742	0.760	0.751	0.794	0.865	0.828	0.760	0.642	0.696	0.580	0.447	0.505	0.610	0.736	0.667	0.696	0.894	0.783	0.309	0.396	0.347	0.476	0.786	0.593	0.744	0.605	0.667
31	Liu_CQUPT_task1_2	Weighted Audio-only CLAP Ensemble with BSD10k Specialist	True	Liu_CQUPT_2026	0.497	0.647	0.562	0.748	0.645	0.693	0.774	0.646	0.704	0.423	0.267	0.327	0.695	0.873	0.774	0.000	0.250	0.000	0.760	0.688	0.722	0.714	0.814	0.761	0.854	0.952	0.901	0.795	0.230	0.357	0.925	0.651	0.764	0.659	0.830	0.735	0.881	0.694	0.776	0.522	0.588	0.553	0.756	0.768	0.762	0.811	0.891	0.849	0.768	0.735	0.751	0.558	0.353	0.432	0.609	0.771	0.681	0.677	0.858	0.757	0.281	0.361	0.316	0.448	0.743	0.559	0.668	0.542	0.598
32	Baseline_UPF_task1_1	HATR baseline (audio)	True	Baseline_UPF_2026	0.484	0.647	0.554	0.750	0.711	0.730	0.811	0.688	0.744	0.436	0.270	0.334	0.742	0.803	0.771	0.000	0.250	0.000	0.844	0.609	0.708	0.736	0.840	0.785	0.783	0.957	0.861	0.973	0.470	0.634	0.774	0.533	0.631	0.634	0.799	0.707	0.947	0.740	0.831	0.457	0.500	0.477	0.718	0.684	0.701	0.804	0.809	0.807	0.812	0.539	0.648	0.385	0.322	0.351	0.597	0.734	0.659	0.555	0.889	0.684	0.468	0.477	0.472	0.485	0.740	0.586	0.676	0.495	0.572
33	Liu_CQUPT_task1_1	Audio-only CLAP Ensemble with EMA and Test-Time Augmentation	True	Liu_CQUPT_2026	0.483	0.621	0.543	0.748	0.645	0.693	0.766	0.646	0.701	0.423	0.267	0.327	0.695	0.873	0.774	0.000	0.250	0.000	0.760	0.688	0.722	0.714	0.814	0.761	0.854	0.952	0.901	0.795	0.230	0.357	0.925	0.651	0.764	0.650	0.802	0.718	0.881	0.694	0.776	0.519	0.588	0.551	0.744	0.768	0.756	0.804	0.891	0.846	0.757	0.718	0.737	0.517	0.339	0.409	0.601	0.771	0.675	0.668	0.858	0.751	0.281	0.361	0.316	0.448	0.743	0.559	0.668	0.542	0.598
34	Kil_Medisensing_task1_4	audio posterior ensemble with conservative argmax decoding	True	Kil_Medisensing_2026	0.861	0.732	0.791	0.918	0.503	0.650	0.719	0.650	0.683	0.979	0.752	0.851	0.692	0.884	0.777	0.000	0.250	0.000	0.000	0.305	0.000	0.600	0.637	0.618	0.949	0.887	0.917	0.375	0.015	0.029	0.926	0.352	0.510	0.544	0.543	0.544	0.534	0.925	0.677	0.824	0.524	0.641	0.475	0.783	0.591	0.752	0.795	0.773	0.688	0.662	0.674	0.000	0.242	0.000	0.569	0.495	0.530	0.955	0.316	0.475	0.219	0.248	0.232	0.245	0.642	0.355	0.741	0.554	0.634
35	Sharma_IR_task1_1	Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification	False	Sharma_IR_2026	0.557	0.544	0.550	0.738	0.522	0.611	0.589	0.866	0.701	0.312	0.096	0.146	0.688	0.706	0.697	0.000	0.000	0.000	0.683	0.602	0.640	0.520	0.320	0.397	0.750	0.952	0.839	0.795	0.250	0.380	0.675	0.298	0.414	0.515	0.609	0.558	0.745	0.476	0.581	0.500	0.317	0.388	0.427	0.656	0.517	0.448	0.799	0.574	0.559	0.627	0.591	0.207	0.342	0.258	0.316	0.690	0.433	0.750	0.566	0.645	0.421	0.294	0.346	0.330	0.601	0.426	0.571	0.100	0.171
36	Han_CSU_task1_1	Multi-Embedding System with Hierarchical Proxy Learning_1	False	Han_CSU_2026	0.458	0.500	0.478	0.556	0.705	0.622	0.635	0.597	0.616	0.502	0.494	0.498	0.337	0.400	0.366	0.072	0.000	0.000	0.022	0.164	0.039	0.483	0.271	0.348	0.616	0.610	0.613	0.889	0.198	0.323	0.565	0.296	0.388	0.313	0.427	0.361	0.493	0.270	0.349	0.365	0.366	0.365	0.343	0.416	0.376	0.375	0.252	0.301	0.434	0.586	0.498	0.343	0.367	0.355	0.212	0.225	0.218	0.530	0.408	0.461	0.250	0.174	0.205	0.287	0.478	0.359	0.625	0.316	0.420
37	Han_CSU_task1_3	Multi-Embedding System with Hierarchical Proxy Learning_3	False	Han_CSU_2026	0.366	0.446	0.402	0.476	0.645	0.548	0.520	0.456	0.486	0.390	0.338	0.362	0.177	0.204	0.189	0.075	0.125	0.094	0.044	0.141	0.067	0.361	0.205	0.261	0.552	0.518	0.535	0.833	0.138	0.236	0.565	0.281	0.375	0.250	0.328	0.284	0.304	0.188	0.232	0.260	0.277	0.268	0.312	0.379	0.342	0.308	0.238	0.268	0.407	0.554	0.469	0.325	0.331	0.328	0.193	0.206	0.199	0.453	0.382	0.415	0.197	0.174	0.185	0.238	0.377	0.292	0.548	0.336	0.416
38	Han_CSU_task1_4	Multi-Embedding System with Hierarchical Proxy Learning_4	False	Han_CSU_2026	0.318	0.355	0.335	0.475	0.611	0.534	0.622	0.521	0.567	0.373	0.368	0.370	0.231	0.282	0.254	0.069	0.125	0.089	0.022	0.141	0.038	0.385	0.209	0.271	0.465	0.461	0.463	0.729	0.110	0.191	0.295	0.181	0.224	0.254	0.355	0.296	0.302	0.213	0.249	0.273	0.287	0.280	0.269	0.328	0.295	0.324	0.244	0.278	0.324	0.390	0.354	0.344	0.339	0.341	0.237	0.243	0.240	0.446	0.311	0.367	0.196	0.148	0.169	0.254	0.358	0.297	0.410	0.186	0.256
39	Zhang_XJTLU_task1_3	no-ranker ensemble	False	Zhang_XJTLU_2026	0.087	0.173	0.116	0.115	0.188	0.143	0.164	0.118	0.137	0.068	0.101	0.081	0.083	0.039	0.053	0.000	0.125	0.000	0.054	0.258	0.089	0.242	0.090	0.131	0.396	0.026	0.049	0.000	0.015	0.000	0.583	0.028	0.054	0.179	0.366	0.240	0.375	0.216	0.274	0.292	0.110	0.159	0.156	0.223	0.184	0.250	0.262	0.256	0.265	0.166	0.204	0.101	0.294	0.150	0.159	0.241	0.192	0.013	0.007	0.009	0.000	0.007	0.000	0.149	0.142	0.145	0.000	0.000	0.000
40	Zhang_XJTLU_task1_4	external diversity low-risk	False	Zhang_XJTLU_2026	0.114	0.210	0.148	0.122	0.227	0.159	0.167	0.132	0.147	0.087	0.134	0.106	0.167	0.065	0.093	0.000	0.125	0.000	0.000	0.328	0.000	0.235	0.074	0.113	0.550	0.026	0.050	0.000	0.015	0.000	0.583	0.036	0.067	0.184	0.374	0.247	0.375	0.198	0.259	0.205	0.091	0.126	0.168	0.230	0.194	0.000	0.252	0.000	0.230	0.179	0.201	0.100	0.272	0.146	0.094	0.236	0.134	0.042	0.007	0.012	0.000	0.007	0.000	0.130	0.130	0.130	0.000	0.000	0.000
41	Han_CSU_task1_2	Multi-Embedding System with Hierarchical Proxy Learning_2	False	Han_CSU_2026	0.055	0.000	0.000	0.015	0.042	0.022	0.016	0.042	0.023	0.105	0.132	0.117	0.099	0.116	0.107	0.064	0.000	0.000	0.022	0.070	0.034	0.117	0.137	0.126	0.103	0.053	0.070	0.000	0.045	0.000	0.092	0.036	0.052	0.168	0.198	0.182	0.158	0.167	0.162	0.175	0.189	0.182	0.133	0.180	0.153	0.199	0.170	0.184	0.197	0.155	0.174	0.093	0.156	0.116	0.139	0.150	0.144	0.089	0.054	0.067	0.109	0.116	0.112	0.131	0.156	0.142	0.056	0.051	0.054
42	Zhang_XJTLU_task1_2	balanced ranker/base	False	Zhang_XJTLU_2026	0.115	0.210	0.149	0.131	0.246	0.171	0.134	0.139	0.136	0.062	0.090	0.073	0.000	0.021	0.000	0.000	0.125	0.000	0.000	0.258	0.000	0.220	0.059	0.092	0.475	0.026	0.050	0.000	0.015	0.000	0.583	0.028	0.054	0.179	0.374	0.242	0.300	0.198	0.239	0.250	0.101	0.143	0.172	0.223	0.194	0.000	0.240	0.000	0.257	0.166	0.202	0.095	0.256	0.138	0.107	0.243	0.149	0.029	0.007	0.011	0.000	0.007	0.000	0.152	0.149	0.150	0.000	0.000	0.000
43	Zhang_XJTLU_task1_1	aggressive ranker	False	Zhang_XJTLU_2026	0.117	0.188	0.144	0.096	0.151	0.117	0.037	0.132	0.058	0.062	0.086	0.072	0.071	0.039	0.051	0.000	0.125	0.000	0.000	0.141	0.000	0.271	0.152	0.195	0.125	0.005	0.009	0.000	0.007	0.000	0.188	0.008	0.015	0.205	0.401	0.271	0.188	0.220	0.202	0.062	0.073	0.067	0.180	0.201	0.190	0.219	0.262	0.239	0.180	0.151	0.164	0.099	0.239	0.140	0.010	0.208	0.019	0.066	0.021	0.032	0.125	0.007	0.013	0.073	0.089	0.080	0.062	0.007	0.013

BST Top-level performance

Rank	Submission Information				Overall metrics			Class-wise metrics
System rank	Submission label	Submission name	Audio only	Technical Report	P	R	F	P - Music (m)	R - Music (m)	F - Music (m)	P - Instrument samples (is)	R - Instrument samples (is)	F - Instrument samples (is)	P - Speech (sp)	R - Speech (sp)	F - Speech (sp)	P - Sound effects (fx)	R - Sound effects (fx)	F - Sound effects (fx)	P - Soundscapes (ss)	R - Soundscapes (ss)	F - Soundscapes (ss)
1	Primus_CPJKU_task1_3	Ensemble 3	False	Primus_CPJKU_2026	0.908	0.876	0.889	0.951	0.858	0.902	0.873	0.971	0.919	0.955	0.846	0.897	0.864	0.953	0.907	0.898	0.752	0.819
2	Primus_CPJKU_task1_2	Ensemble 2	False	Primus_CPJKU_2026	0.909	0.873	0.887	0.967	0.853	0.906	0.877	0.971	0.921	0.955	0.840	0.894	0.858	0.955	0.904	0.887	0.748	0.811
3	Primus_CPJKU_task1_4	Ensemble 4	False	Primus_CPJKU_2026	0.908	0.875	0.888	0.947	0.868	0.905	0.877	0.971	0.921	0.960	0.823	0.886	0.864	0.955	0.908	0.893	0.757	0.820
4	Primus_CPJKU_task1_1	Ensemble 1	False	Primus_CPJKU_2026	0.903	0.871	0.884	0.935	0.848	0.889	0.857	0.966	0.908	0.961	0.834	0.893	0.866	0.951	0.906	0.898	0.757	0.822
5	Huang_WHU_task1_2	STFT-distill logit ensemble-1	False	Huang_WHU_2026	0.885	0.863	0.872	0.878	0.882	0.880	0.902	0.946	0.924	0.973	0.829	0.895	0.852	0.909	0.880	0.818	0.748	0.781
6	Huang_WHU_task1_1	HGA-EMA-STFT system	False	Huang_WHU_2026	0.881	0.855	0.866	0.921	0.863	0.891	0.893	0.937	0.914	0.958	0.789	0.865	0.857	0.921	0.888	0.778	0.767	0.772
7	Guan_HEU_task1_3	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026	0.891	0.864	0.875	0.876	0.897	0.886	0.904	0.917	0.910	0.966	0.811	0.882	0.862	0.935	0.897	0.846	0.757	0.799
8	Huang_WHU_task1_4	Six-model 5-fold logit ensemble	False	Huang_WHU_2026	0.888	0.859	0.872	0.909	0.877	0.893	0.905	0.927	0.916	0.965	0.794	0.871	0.854	0.927	0.889	0.806	0.771	0.788
9	Guan_HEU_task1_4	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026	0.888	0.862	0.873	0.880	0.897	0.888	0.906	0.937	0.921	0.959	0.800	0.872	0.858	0.931	0.893	0.839	0.743	0.788
10	Huang_WHU_task1_3	STFT-distill logit ensemble-5	False	Huang_WHU_2026	0.887	0.858	0.870	0.904	0.877	0.891	0.904	0.922	0.913	0.965	0.794	0.871	0.853	0.929	0.890	0.809	0.767	0.787
11	Guan_HEU_task1_2	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026	0.879	0.851	0.862	0.835	0.892	0.863	0.916	0.902	0.909	0.964	0.771	0.857	0.867	0.933	0.899	0.811	0.757	0.783
12	Zheng_SCUT_task1_1	Official BSD10k Baseline (Multimodal HATR + CLAP)	False	Zheng_SCUT_2026	0.882	0.864	0.872	0.871	0.892	0.881	0.890	0.912	0.901	0.987	0.840	0.907	0.857	0.899	0.877	0.807	0.776	0.791
13	Kucukoglu_NYU_task1_1	Multimodal Ensemble System for Heterogeneous Audio Classification	False	Kucukoglu_NYU_2026	0.878	0.852	0.863	0.869	0.848	0.859	0.896	0.922	0.909	0.966	0.806	0.879	0.850	0.917	0.882	0.809	0.767	0.787
14	Lin_JKU_task1_1	Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary)	False	Lin_JKU_2026	0.856	0.830	0.839	0.891	0.804	0.845	0.812	0.946	0.874	0.957	0.771	0.854	0.853	0.907	0.879	0.764	0.724	0.743
15	Guan_HEU_task1_1	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026	0.877	0.853	0.863	0.850	0.892	0.871	0.908	0.917	0.913	0.960	0.823	0.886	0.846	0.911	0.877	0.822	0.724	0.770
16	Kucukoglu_NYU_task1_4	Three-Modality HATR with Whisper and Mixup	False	Kucukoglu_NYU_2026	0.875	0.851	0.861	0.902	0.858	0.879	0.869	0.941	0.904	0.960	0.829	0.890	0.835	0.899	0.865	0.810	0.729	0.767
17	Kucukoglu_NYU_task1_3	Single HATR Model with Cross-Fold Augmentation	False	Kucukoglu_NYU_2026	0.870	0.839	0.853	0.855	0.868	0.861	0.904	0.873	0.888	0.953	0.817	0.880	0.827	0.911	0.867	0.810	0.729	0.767
18	Lin_JKU_task1_2	Frozen-CLAP ensemble with logit adjustment (balanced)	False	Lin_JKU_2026	0.847	0.821	0.829	0.909	0.784	0.842	0.765	0.951	0.848	0.957	0.771	0.854	0.841	0.887	0.863	0.764	0.710	0.736
19	Lin_JKU_task1_4	Clean concatenation CLAP probe (10k-train only)	False	Lin_JKU_2026	0.873	0.845	0.857	0.865	0.877	0.871	0.895	0.912	0.903	0.946	0.794	0.863	0.840	0.917	0.877	0.817	0.724	0.768
20	Kucukoglu_NYU_task1_2	CLAP-ConvNeXt Ensemble for Heterogeneous Audio Classification	False	Kucukoglu_NYU_2026	0.879	0.855	0.865	0.859	0.863	0.861	0.900	0.917	0.908	0.966	0.800	0.875	0.853	0.913	0.882	0.820	0.781	0.800
21	Lin_JKU_task1_3	Gated multimodal CLAP fusion with pseudo-labels	False	Lin_JKU_2026	0.876	0.845	0.857	0.924	0.833	0.876	0.851	0.946	0.896	0.966	0.806	0.879	0.847	0.929	0.886	0.793	0.710	0.749
22	Baseline_UPF_task1_2	HATR baseline (multimodal)	False	Baseline_UPF_2026	0.783	0.771	0.775	0.732	0.735	0.733	0.774	0.717	0.744	0.922	0.806	0.860	0.831	0.838	0.835	0.657	0.757	0.704
23	Colotti_TAU_task1_1	Audio Classification using Attention-based Cleaned Multimodal Embeddings	False	Colotti_TAU_2026	0.843	0.802	0.817	0.899	0.789	0.841	0.846	0.912	0.878	0.962	0.714	0.820	0.805	0.897	0.849	0.702	0.695	0.699
24	Colotti_TAU_task1_2	Audio Classification using Attention-based Cleaned Multimodal Embeddings	False	Colotti_TAU_2026	0.841	0.770	0.794	0.901	0.760	0.824	0.879	0.917	0.897	0.991	0.606	0.752	0.764	0.911	0.831	0.670	0.657	0.663
25	Kil_Medisensing_task1_1	metadata gate target mask posterior stack	False	Kil_Medisensing_2026	0.782	0.772	0.773	0.729	0.725	0.727	0.765	0.746	0.756	0.887	0.720	0.795	0.877	0.850	0.863	0.652	0.819	0.726
26	Kil_Medisensing_task1_2	raw parent specialist posterior stack	False	Kil_Medisensing_2026	0.782	0.772	0.773	0.729	0.725	0.727	0.765	0.746	0.756	0.887	0.720	0.795	0.877	0.850	0.863	0.652	0.819	0.726
27	Colotti_TAU_task1_4	CLAP-MoE Feature-wise Gated Fusion	False	Colotti_TAU_2026	0.827	0.769	0.788	0.819	0.819	0.819	0.905	0.790	0.844	0.964	0.611	0.748	0.802	0.911	0.853	0.644	0.714	0.677
28	Colotti_TAU_task1_3	EnhancedBaseClassifier	False	Colotti_TAU_2026	0.814	0.756	0.777	0.880	0.824	0.851	0.901	0.800	0.848	0.887	0.583	0.703	0.751	0.874	0.808	0.653	0.700	0.676
29	Kil_Medisensing_task1_3	Larger-CLAP classifier with metadata confidence gate	False	Kil_Medisensing_2026	0.771	0.758	0.761	0.704	0.770	0.735	0.781	0.678	0.726	0.882	0.726	0.796	0.843	0.838	0.841	0.647	0.776	0.706
30	Liu_CQUPT_task1_3	PaSST and CLAP Audio Ensemble	True	Liu_CQUPT_2026	0.775	0.755	0.761	0.720	0.706	0.713	0.732	0.732	0.732	0.933	0.714	0.809	0.845	0.848	0.846	0.644	0.776	0.704
31	Liu_CQUPT_task1_2	Weighted Audio-only CLAP Ensemble with BSD10k Specialist	True	Liu_CQUPT_2026	0.776	0.752	0.760	0.715	0.701	0.708	0.730	0.712	0.721	0.948	0.726	0.822	0.840	0.860	0.850	0.645	0.762	0.699
32	Baseline_UPF_task1_1	HATR baseline (audio)	True	Baseline_UPF_2026	0.768	0.758	0.760	0.719	0.740	0.729	0.750	0.688	0.718	0.914	0.789	0.847	0.826	0.808	0.817	0.629	0.767	0.691
33	Liu_CQUPT_task1_1	Audio-only CLAP Ensemble with EMA and Test-Time Augmentation	True	Liu_CQUPT_2026	0.774	0.751	0.759	0.714	0.696	0.705	0.730	0.712	0.721	0.948	0.726	0.822	0.838	0.858	0.848	0.643	0.762	0.697
34	Kil_Medisensing_task1_4	audio posterior ensemble with conservative argmax decoding	True	Kil_Medisensing_2026	0.777	0.682	0.708	0.894	0.662	0.761	0.820	0.800	0.810	1.000	0.520	0.684	0.715	0.818	0.763	0.456	0.610	0.521
35	Sharma_IR_task1_1	Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification	False	Sharma_IR_2026	0.701	0.628	0.653	0.728	0.721	0.724	0.672	0.420	0.517	0.917	0.697	0.792	0.634	0.824	0.717	0.552	0.481	0.514
36	Han_CSU_task1_1	Multi-Embedding System with Hierarchical Proxy Learning_1	False	Han_CSU_2026	0.539	0.517	0.521	0.574	0.647	0.608	0.382	0.483	0.427	0.681	0.440	0.535	0.571	0.575	0.573	0.487	0.438	0.461
37	Han_CSU_task1_3	Multi-Embedding System with Hierarchical Proxy Learning_3	False	Han_CSU_2026	0.475	0.452	0.456	0.481	0.574	0.523	0.314	0.371	0.340	0.619	0.371	0.464	0.525	0.530	0.528	0.437	0.414	0.425
38	Han_CSU_task1_4	Multi-Embedding System with Hierarchical Proxy Learning_4	False	Han_CSU_2026	0.441	0.421	0.425	0.472	0.529	0.499	0.301	0.390	0.340	0.491	0.314	0.383	0.510	0.530	0.520	0.431	0.343	0.382
39	Zhang_XJTLU_task1_3	no-ranker ensemble	False	Zhang_XJTLU_2026	0.302	0.204	0.188	0.175	0.260	0.209	0.198	0.161	0.177	0.667	0.034	0.065	0.352	0.516	0.418	0.119	0.048	0.068
40	Zhang_XJTLU_task1_4	external diversity low-risk	False	Zhang_XJTLU_2026	0.354	0.217	0.200	0.189	0.304	0.233	0.212	0.185	0.198	0.875	0.040	0.077	0.361	0.510	0.422	0.135	0.048	0.070
41	Han_CSU_task1_2	Multi-Embedding System with Hierarchical Proxy Learning_2	False	Han_CSU_2026	0.194	0.190	0.187	0.067	0.069	0.068	0.160	0.229	0.188	0.157	0.074	0.101	0.370	0.395	0.382	0.217	0.181	0.197
42	Zhang_XJTLU_task1_2	balanced ranker/base	False	Zhang_XJTLU_2026	0.320	0.207	0.189	0.193	0.314	0.239	0.171	0.137	0.152	0.750	0.034	0.066	0.351	0.500	0.412	0.136	0.052	0.076
43	Zhang_XJTLU_task1_1	aggressive ranker	False	Zhang_XJTLU_2026	0.239	0.191	0.175	0.150	0.240	0.185	0.170	0.151	0.160	0.375	0.017	0.033	0.347	0.478	0.402	0.152	0.067	0.093

Development set performance

Rank	Submission Information				BSD10k-v1.2			BSD35k-CS
System rank	Submission label	Submission name	Audio only	Technical Report	hP	hR	hF	hP	hR	hF
1	Primus_CPJKU_task1_3	Ensemble 3	False	Primus_CPJKU_2026	0.833	0.838	0.834
2	Primus_CPJKU_task1_2	Ensemble 2	False	Primus_CPJKU_2026	0.831	0.837	0.832
3	Primus_CPJKU_task1_4	Ensemble 4	False	Primus_CPJKU_2026	0.828	0.832	0.829
4	Primus_CPJKU_task1_1	Ensemble 1	False	Primus_CPJKU_2026	0.830	0.831	0.830
5	Huang_WHU_task1_2	STFT-distill logit ensemble-1	False	Huang_WHU_2026
6	Huang_WHU_task1_1	HGA-EMA-STFT system	False	Huang_WHU_2026
7	Guan_HEU_task1_3	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026
8	Huang_WHU_task1_4	Six-model 5-fold logit ensemble	False	Huang_WHU_2026
9	Guan_HEU_task1_4	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026
10	Huang_WHU_task1_3	STFT-distill logit ensemble-5	False	Huang_WHU_2026
11	Guan_HEU_task1_2	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026
12	Zheng_SCUT_task1_1	Official BSD10k Baseline (Multimodal HATR + CLAP)	False	Zheng_SCUT_2026	0.795	0.783	0.787	0.811	0.794	0.798
13	Kucukoglu_NYU_task1_1	Multimodal Ensemble System for Heterogeneous Audio Classification	False	Kucukoglu_NYU_2026	0.822	0.805	0.811
14	Lin_JKU_task1_1	Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary)	False	Lin_JKU_2026	0.795	0.805	0.794
15	Guan_HEU_task1_1	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026
16	Kucukoglu_NYU_task1_4	Three-Modality HATR with Whisper and Mixup	False	Kucukoglu_NYU_2026	0.795	0.787	0.788
17	Kucukoglu_NYU_task1_3	Single HATR Model with Cross-Fold Augmentation	False	Kucukoglu_NYU_2026	0.798	0.790	0.791
18	Lin_JKU_task1_2	Frozen-CLAP ensemble with logit adjustment (balanced)	False	Lin_JKU_2026	0.772	0.805	0.779
19	Lin_JKU_task1_4	Clean concatenation CLAP probe (10k-train only)	False	Lin_JKU_2026	0.788	0.782	0.781
20	Kucukoglu_NYU_task1_2	CLAP-ConvNeXt Ensemble for Heterogeneous Audio Classification	False	Kucukoglu_NYU_2026	0.818	0.808	0.811
21	Lin_JKU_task1_3	Gated multimodal CLAP fusion with pseudo-labels	False	Lin_JKU_2026	0.789	0.798	0.790
22	Baseline_UPF_task1_2	HATR baseline (multimodal)	False	Baseline_UPF_2026
23	Colotti_TAU_task1_1	Audio Classification using Attention-based Cleaned Multimodal Embeddings	False	Colotti_TAU_2026
24	Colotti_TAU_task1_2	Audio Classification using Attention-based Cleaned Multimodal Embeddings	False	Colotti_TAU_2026
25	Kil_Medisensing_task1_1	metadata gate target mask posterior stack	False	Kil_Medisensing_2026	0.804	0.815	0.805
26	Kil_Medisensing_task1_2	raw parent specialist posterior stack	False	Kil_Medisensing_2026	0.805	0.812	0.804
27	Colotti_TAU_task1_4	CLAP-MoE Feature-wise Gated Fusion	False	Colotti_TAU_2026
28	Colotti_TAU_task1_3	EnhancedBaseClassifier	False	Colotti_TAU_2026
29	Kil_Medisensing_task1_3	Larger-CLAP classifier with metadata confidence gate	False	Kil_Medisensing_2026	0.753	0.717	0.729
30	Liu_CQUPT_task1_3	PaSST and CLAP Audio Ensemble	True	Liu_CQUPT_2026
31	Liu_CQUPT_task1_2	Weighted Audio-only CLAP Ensemble with BSD10k Specialist	True	Liu_CQUPT_2026
32	Baseline_UPF_task1_1	HATR baseline (audio)	True	Baseline_UPF_2026
33	Liu_CQUPT_task1_1	Audio-only CLAP Ensemble with EMA and Test-Time Augmentation	True	Liu_CQUPT_2026
34	Kil_Medisensing_task1_4	audio posterior ensemble with conservative argmax decoding	True	Kil_Medisensing_2026	0.825	0.806	0.809
35	Sharma_IR_task1_1	Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification	False	Sharma_IR_2026
36	Han_CSU_task1_1	Multi-Embedding System with Hierarchical Proxy Learning_1	False	Han_CSU_2026	0.854	0.840	0.846	0.806	0.835	0.810
37	Han_CSU_task1_3	Multi-Embedding System with Hierarchical Proxy Learning_3	False	Han_CSU_2026	0.852	0.836	0.843	0.791	0.823	0.798
38	Han_CSU_task1_4	Multi-Embedding System with Hierarchical Proxy Learning_4	False	Han_CSU_2026	0.851	0.841	0.845	0.803	0.827	0.806
39	Zhang_XJTLU_task1_3	no-ranker ensemble	False	Zhang_XJTLU_2026			0.833
40	Zhang_XJTLU_task1_4	external diversity low-risk	False	Zhang_XJTLU_2026			0.832
41	Han_CSU_task1_2	Multi-Embedding System with Hierarchical Proxy Learning_2	False	Han_CSU_2026	0.856	0.835	0.844	0.827	0.851	0.832
42	Zhang_XJTLU_task1_2	balanced ranker/base	False	Zhang_XJTLU_2026			0.834
43	Zhang_XJTLU_task1_1	aggressive ranker	False	Zhang_XJTLU_2026			0.835

System characteristics

Rank	Submission Information			Representations			Method		Data			Complexity
System rank	Submission label	Submission name	Technical Report	Sampling rate	Audio representation	Text representation	Hierarchical setting	Machine learning method	Data augmentation	External data usage	External datasets	MACS (G)	Total params
1	Primus_CPJKU_task1_3	Ensemble 3	Primus_CPJKU_2026		CP-CLAP PaSST, M2D	title, tags, description, GPT processed metadata, CP-CLAP, RoBERTa		transformer, contrastive-learning, LLM		embeddings, LLM			383000000.0
2	Primus_CPJKU_task1_2	Ensemble 2	Primus_CPJKU_2026		CP-CLAP PaSST, M2D	title, tags, description, GPT processed metadata, CP-CLAP, RoBERTa		transformer, contrastive-learning, LLM		embeddings, LLM			383000000.0
3	Primus_CPJKU_task1_4	Ensemble 4	Primus_CPJKU_2026		CLAP, PaSST, M2D-CLAP, CP-CLAP	title, tags, description, GPT processed metadata, CP-CLAP, RoBERTa, LAION-CLAP RoBERTa		transformer, contrastive-learning, LLM		embeddings, LLM			539000000.0
4	Primus_CPJKU_task1_1	Ensemble 1	Primus_CPJKU_2026		BEATs, CP-CLAP, PaSST, CLAP	title, tags, description, GPT processed metadata, CP-CLAP, RoBERTa, LAION-CLAP RoBERTa		transformer, contrastive-learning, LLM		embeddings, LLM			543000000.0
5	Huang_WHU_task1_2	STFT-distill logit ensemble-1	Huang_WHU_2026	16kHz	CLAP, log-STFT, log-mel energies	title, tags, description, CLAP, metadata cleaning	multiple classifiers	MLP, transformer	random crop, time masking	embeddings	BSD35k	28.361	97540455.0
6	Huang_WHU_task1_1	HGA-EMA-STFT system	Huang_WHU_2026	16kHz	CLAP, STFT			MLP, transformer	time masking, random crop	embeddings	BSD35k	12.921	2728920.0
7	Guan_HEU_task1_3	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	Guan_HEU_2026	48kHz	CLAP			MLP		embeddings		1.000	326592.0
8	Huang_WHU_task1_4	Six-model 5-fold logit ensemble	Huang_WHU_2026	16kHz	CLAP, log-STFT, log-mel energies, MFCC	title, tags, description, CLAP, metadata cleaning	multiple classifiers	MLP, CNN, transformer, ensemble	time masking, random crop	embeddings, datastes	BSD35k	30.555	210700090.0
9	Guan_HEU_task1_4	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	Guan_HEU_2026	48kHz	CLAP			MLP		embeddings
10	Huang_WHU_task1_3	STFT-distill logit ensemble-5	Huang_WHU_2026	16kHz	CLAP, log-STFT, log-mel energies	title, tags, description, CLAP, metadata cleaning	multiple classifiers	MLP, CNN, transformer, ensemble	random crop, time masking	embeddings, datastes	BSD35k	28.361	97540455.0
11	Guan_HEU_task1_2	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	Guan_HEU_2026	48kHz	CLAP			MLP		embeddings		11.000	15474693.0
12	Zheng_SCUT_task1_1	Official BSD10k Baseline (Multimodal HATR + CLAP)	Zheng_SCUT_2026	44.1kHz	CLAP			MLP		embeddings			7319513.0
13	Kucukoglu_NYU_task1_1	Multimodal Ensemble System for Heterogeneous Audio Classification	Kucukoglu_NYU_2026	48kHz, 32kHz	CLAP, ConvNeXt, Fine-tuned CLAP		loss function	HATR, ensemble, logit averaging	noise addition, random masking, mixup, balanced resampling	embeddings			1889944.0
14	Lin_JKU_task1_1	Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary)	Lin_JKU_2026	48kHz, 32kHz	CLAP, PaSST	title, tags, description, CLAP, RoBERTa		MLP, transformer, ensemble	time masking, frequency masking, mixup	embeddings		163.900	211645044.0
15	Guan_HEU_task1_1	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	Guan_HEU_2026	48kHz	CLAP			MLP		embeddings		8.000	25184262.0
16	Kucukoglu_NYU_task1_4	Three-Modality HATR with Whisper and Mixup	Kucukoglu_NYU_2026	48kHz, 16kHz	CLAP, Whisper-large-v3			Three-modality HATR	mixup, noise addition, random masking	embeddings			1115669.0
17	Kucukoglu_NYU_task1_3	Single HATR Model with Cross-Fold Augmentation	Kucukoglu_NYU_2026	48kHz	CLAP			HATR	cross-fold embedding swap, noise addition, random masking	embeddings			377989.0
18	Lin_JKU_task1_2	Frozen-CLAP ensemble with logit adjustment (balanced)	Lin_JKU_2026	48kHz, 32kHz	CLAP, PaSST	title, tags, description, CLAP, RoBERTa		MLP, transformer, ensemble	time masking, frequency masking, mixup	embeddings		163.900	211645044.0
19	Lin_JKU_task1_4	Clean concatenation CLAP probe (10k-train only)	Lin_JKU_2026	48kHz	CLAP			MLP		embeddings		19.400	538647.0
20	Kucukoglu_NYU_task1_2	CLAP-ConvNeXt Ensemble for Heterogeneous Audio Classification	Kucukoglu_NYU_2026	48kHz, 32kHz	CLAP, ConvNeXt		loss function	HATR, ensemble, logit averaging	noise addition, random masking, mixup, balanced resampling, combined augmentation	embeddings			1889944.0
21	Lin_JKU_task1_3	Gated multimodal CLAP fusion with pseudo-labels	Lin_JKU_2026	48kHz	CLAP			MLP		embeddings		19.400	1327639.0
22	Baseline_UPF_task1_2	HATR baseline (multimodal)	Baseline_UPF_2026	48kHz	CLAP			attention fusion, MLP	noise addition, time masking	embeddings
23	Colotti_TAU_task1_1	Audio Classification using Attention-based Cleaned Multimodal Embeddings	Colotti_TAU_2026	48kHz	CLAP	description, tags, sentence transformer, all-mpnet-base-v2, BERT		attention, hyperbolic neural networks		embeddings, model weights		7.900	380342496.0
24	Colotti_TAU_task1_2	Audio Classification using Attention-based Cleaned Multimodal Embeddings	Colotti_TAU_2026	48kHz	CLAP	description, tags, sentence transformer, all-mpnet-base-v2, BERT		attention, hyperbolic neural networks		embeddings, model weights		7.900	380342496.0
25	Kil_Medisensing_task1_1	metadata gate target mask posterior stack	Kil_Medisensing_2026		CLAP, Larger-CLAP, M2D-CLAP, BEATs, ATST, PaSST, CLAP-HF	title, tags, description, TF-IDF helper gates	second-level classifier, same-parent metadata override, hierarchy-aware posteriors	weighted posterior stacking, metadata target-mask gate		embeddings, datasets
26	Kil_Medisensing_task1_2	raw parent specialist posterior stack	Kil_Medisensing_2026		CLAP, Larger-CLAP, M2D-CLAP, BEATs, ATST, PaSST, CLAP-HF	title, tags, description, conservative helper gates	parent-local specialists, second-level predictions constrained by hierarchy	weighted posterior stacking, raw-embedding parent-specialist gate		embeddings, datasets
27	Colotti_TAU_task1_4	CLAP-MoE Feature-wise Gated Fusion	Colotti_TAU_2026	48kHz	CLAP	tags, description, CLAP	top-level classifier as router, expert second-level classifiers	MLP, feature-wise gated multimodal fusion	noise addition	embeddings		4.140	369991736.0
28	Colotti_TAU_task1_3	EnhancedBaseClassifier	Colotti_TAU_2026	48kHz	CLAP	tags, description, CLAP, SentenceBERT, sentence transformer		MLP	baseline implemented masking and augmentation	embeddings		9.400	282000000.0
29	Kil_Medisensing_task1_3	Larger-CLAP classifier with metadata confidence gate	Kil_Medisensing_2026		CLAP	title, tags, description, TF-IDF	second-level classifier, same-parent metadata override	MLP		datasets, embeddings	FSD50K	0.011	10704631.0
30	Liu_CQUPT_task1_3	PaSST and CLAP Audio Ensemble	Liu_CQUPT_2026	48kHz, 32kHz	CLAP, PaSST		loss function	ensemble	mixup, feature masking, auxiliary text supervision during training, test-time augmentation	embeddings	AudioSet		254784782.0
31	Liu_CQUPT_task1_2	Weighted Audio-only CLAP Ensemble with BSD10k Specialist	Liu_CQUPT_2026	48kHz	CLAP		loss function	residual gated classifiers, kownledge distillation, weighted ensemble	embedding-level mixup, feature masking, noise addition, text dropout, test-time augmentation	embeddings			197911292.0
32	Baseline_UPF_task1_1	HATR baseline (audio)	Baseline_UPF_2026	48kHz	CLAP			MLP	noise addition, time masking	embeddings
33	Liu_CQUPT_task1_1	Audio-only CLAP Ensemble with EMA and Test-Time Augmentation	Liu_CQUPT_2026	48kHz	CLAP		loss function	residual gated classifiers, kownledge distillation, ensemble	embedding-level mixup, feature masking, noise addition, text dropout, test-time augmentation	embeddings			152641185.0
34	Kil_Medisensing_task1_4	audio posterior ensemble with conservative argmax decoding	Kil_Medisensing_2026		CLAP, M2D-CLAP		multiple classifiers, hierarchical ridge branch	ridge ensemble, posterior averaging		embeddings
35	Sharma_IR_task1_1	Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification	Sharma_IR_2026	32kHz, 48kHz	CLAP, PANNs		loss function		dimension masking	embeddings		1.036	261098881.0
36	Han_CSU_task1_1	Multi-Embedding System with Hierarchical Proxy Learning_1	Han_CSU_2026	44.1kHz	BEATs, ATST-Frame, fPaSST, PaSST, CLAP, M2D		loss function	MLP	noise addition, time masking	embeddings	BEATs, ATST-Frame, fPaSST, PaSST,	718.773	3717566529.0
37	Han_CSU_task1_3	Multi-Embedding System with Hierarchical Proxy Learning_3	Han_CSU_2026	44.1kHz	ATST-Frame, PaSST, CLAP, M2D		loss function	MLP	noise addition, time masking	embeddings	ATST-Frame, PaSST,	466.859	2332946777.0
38	Han_CSU_task1_4	Multi-Embedding System with Hierarchical Proxy Learning_4	Han_CSU_2026	44.1kHz	BEATs, CLAP		loss function	MLP	noise addition, time masking	embeddings		200.345	676412058.0
39	Zhang_XJTLU_task1_3	no-ranker ensemble	Zhang_XJTLU_2026	48kHz	CLAP	title, tags, description, CLAP, safe metadata scalar features	parent-aware calibration, second-level candidate override	ensemble, supervision with noisy labels	teacher-weighted soft and hard BSD35k-CS supervision, calibrated ensembling, external-data diversity components	embeddings, datasets	BSD35k-CS, ESC-50, UrbanSound8K, FSD50K
40	Zhang_XJTLU_task1_4	external diversity low-risk	Zhang_XJTLU_2026	48kHz	CLAP	title, tags, description, CLAP, safe metadata scalar features	parent-aware calibration, second-level candidate override	ensemble, external-data diversity components	teacher-weighted soft and hard BSD35k-CS supervision, calibrated ensembling, external-data diversity components	embeddings, datasets	BSD35k-CS, ESC-50, UrbanSound8K, FSD50K
41	Han_CSU_task1_2	Multi-Embedding System with Hierarchical Proxy Learning_2	Han_CSU_2026	44.1kHz	CLAP, M2D		loss function	MLP	noise addition, time masking	embeddings		134.815	222974.0
42	Zhang_XJTLU_task1_2	balanced ranker/base	Zhang_XJTLU_2026	48kHz	CLAP	title, tags, description, CLAP, safe metadata scalar features	parent-aware calibration, second-level candidate override	enseble, parent-aware candidate reranking	teacher-weighted soft and hard BSD35k-CS supervision, calibrated ensembling, external-data diversity components	embeddings, datasets	BSD35k-CS, ESC-50, UrbanSound8K, FSD50K
43	Zhang_XJTLU_task1_1	aggressive ranker	Zhang_XJTLU_2026	48kHz	CLAP	title, tags, description, CLAP, safe metadata scalar features	parent-aware calibration, second-level candidate override	ensemble, candidate-level logistic/gradient reranking, parent-aware override	teacher-weighted soft and hard BSD35k-CS supervision, calibrated ensembling, external-data diversity components	embeddings, datasets	BSD35k-CS, ESC-50, UrbanSound8K, FSD50K

Technical reports

ACACIA: Audio Classification using Attention-Based Cleaned Multimodal Embeddings

Francesco Colotti¹, Kerstin Markl¹, Riccardo Casciotti¹, Javier Naranjo²

¹Tampere University, Audio Research Group, Tampere, Finland, ²Instituto Tecnologico de Informatica, Valencia, Spain

Colotti_TAU_task1_4 Colotti_TAU_task1_2 Colotti_TAU_task1_1 Colotti_TAU_task1_3

Content

Task description

Leaderboard

Systems ranking

Class-wise performance

BST Top-level performance

Development set performance

System characteristics

Technical reports

ACACIA: Audio Classification using Attention-Based Cleaned Multimodal Embeddings

ACACIA: Audio Classification using Attention-Based Cleaned Multimodal Embeddings

Abstract

GISP@HEU's Submission for Task 1: Heterogeneous Audio Classification in the DCASE 2026 Challenge

GISP@HEU's Submission for Task 1: Heterogeneous Audio Classification in the DCASE 2026 Challenge

Abstract

MESH:MULTI-EMBEDDINGSYSTEMWITHHIERARCHICALPROXYLEARNINGFOR DCASE 2026 TASK 1

MESH:MULTI-EMBEDDINGSYSTEMWITHHIERARCHICALPROXYLEARNINGFOR DCASE 2026 TASK 1

Abstract

A MULTI-BRANCH AND HIERARCHY-AWARE SYSTEM FOR BST AUDIO CLASSIFICATION

A MULTI-BRANCH AND HIERARCHY-AWARE SYSTEM FOR BST AUDIO CLASSIFICATION

Abstract

POSTERIOR STACKING AND CONSERVATIVE METADATA GATING FOR HETEROGENEOUS AUDIO CLASSIFICATION

POSTERIOR STACKING AND CONSERVATIVE METADATA GATING FOR HETEROGENEOUS AUDIO CLASSIFICATION

Abstract

MULTIMODAL ENSEMBLE SYSTEM FOR HETEROGENEOUS AUDIO CLASSIFICATION

MULTIMODAL ENSEMBLE SYSTEM FOR HETEROGENEOUS AUDIO CLASSIFICATION

Abstract

Heterogeneous Audio Classification with Frozen Audio-Language Embeddings

Heterogeneous Audio Classification with Frozen Audio-Language Embeddings

Abstract

HAF-CLAP: AHIERARCHICAL-AWAREMULTIMODALCLAPSYSTEMFOR HETEROGENEOUSAUDIO CLASSIFICATION

HAF-CLAP: AHIERARCHICAL-AWAREMULTIMODALCLAPSYSTEMFOR HETEROGENEOUSAUDIO CLASSIFICATION

Abstract

CP-JKU Submission to Task 1 of the DCASE 2026 Challenge: LLM Prediction Fusion and Pseudo-Label Training for Heterogeneous Audio Classification

CP-JKU Submission to Task 1 of the DCASE 2026 Challenge: LLM Prediction Fusion and Pseudo-Label Training for Heterogeneous Audio Classification

Abstract

Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification

Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification

Abstract

Noisy-Label-Aware Multimodal Ensembling with Inference-Safe Candidate Reranking for Heterogeneous Audio Classification

Noisy-Label-Aware Multimodal Ensembling with Inference-Safe Candidate Reranking for Heterogeneous Audio Classification

Abstract

MULTIMODAL HATR CLASSIFICATION WITH FROZEN CLAP EMBEDDINGS FOR THE BROAD SOUND TAXONOMY

MULTIMODAL HATR CLASSIFICATION WITH FROZEN CLAP EMBEDDINGS FOR THE BROAD SOUND TAXONOMY

Abstract