Heterogeneous Audio Classification


Challenge results

Task description

This task focuses on heterogeneous audio classification using the Broad Sound Taxonomy (BST), which comprises 5 top-level and 23 second-level sound categories. The goal of this task is to evaluate sound classification models on diverse, real-world audio that varies widely in its nature, including duration and recording conditions. To that end, two complementary Freesound-based datasets are provided: a curated set, BSD10k-v1.2, and a larger, noisier, crowd-sourced collection, BSD35k-CS, which reflects real-world labeling variability. Participants are encouraged to explore audio-based, metadata-based, and multimodal approaches to sound classification, as well as to leverage hierarchical relationships between taxonomy categories.

More detailed task description can be found in the task description page

Leaderboard

Rank Submission Information Metrics
Official
rank
System
rank
Submission
label
Submission
name
Audio
only
Technical
Report
hP hR hF
1 1 Primus_CPJKU_task1_3 Ensemble 3 False Primus_CPJKU_2026 0.842 0.846 0.836
7 22 Colotti_TAU_task1_1 Audio Classification using Attention-based Cleaned Multimodal Embeddings False Colotti_TAU_2026 0.746 0.709 0.699
5 13 Kucukoglu_NYU_task1_1 Multimodal Ensemble System for Heterogeneous Audio Classification False Kucukoglu_NYU_2026 0.824 0.775 0.787
11 34 Han_CSU_task1_1 Multi-Embedding System with Hierarchical Proxy Learning_1 False Han_CSU_2026 0.422 0.370 0.372
9 29 Liu_CQUPT_task1_3 PaSST and CLAP Audio Ensemble True Liu_CQUPT_2026 0.657 0.647 0.629
8 24 Kil_Medisensing_task1_1 metadata gate target mask posterior stack False Kil_Medisensing_2026 0.682 0.732 0.692
4 12 Zheng_SCUT_task1_1 Official BSD10k Baseline (Multimodal HATR + CLAP) False Zheng_SCUT_2026 0.817 0.782 0.788
2 5 Huang_WHU_task1_2 STFT-distill logit ensemble-1 False Huang_WHU_2026 0.825 0.810 0.811
10 33 Sharma_IR_task1_1 Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification False Sharma_IR_2026 0.526 0.488 0.472
6 14 Lin_JKU_task1_1 Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary) False Lin_JKU_2026 0.801 0.785 0.785
3 7 Guan_HEU_task1_3 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations False Guan_HEU_2026 0.816 0.797 0.799
12 37 Zhang_XJTLU_task1_3 no-ranker ensemble False Zhang_XJTLU_2026 0.162 0.139 0.116

Systems ranking

Rank Submission Information Metrics
System
rank
Submission
label
Submission
name
Audio
only
Technical
Report
hP hR hF
1 Primus_CPJKU_task1_3 Ensemble 3 False Primus_CPJKU_2026 0.842 0.846 0.836
2 Primus_CPJKU_task1_2 Ensemble 2 False Primus_CPJKU_2026 0.842 0.845 0.836
3 Primus_CPJKU_task1_4 Ensemble 4 False Primus_CPJKU_2026 0.841 0.844 0.835
4 Primus_CPJKU_task1_1 Ensemble 1 False Primus_CPJKU_2026 0.838 0.842 0.833
5 Huang_WHU_task1_2 STFT-distill logit ensemble-1 False Huang_WHU_2026 0.825 0.810 0.811
6 Huang_WHU_task1_1 HGA-EMA-STFT system False Huang_WHU_2026 0.813 0.802 0.800
7 Guan_HEU_task1_3 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations False Guan_HEU_2026 0.816 0.797 0.799
8 Huang_WHU_task1_4 Six-model 5-fold logit ensemble False Huang_WHU_2026 0.820 0.794 0.799
9 Guan_HEU_task1_4 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations False Guan_HEU_2026 0.818 0.795 0.798
10 Huang_WHU_task1_3 STFT-distill logit ensemble-5 False Huang_WHU_2026 0.819 0.793 0.798
11 Guan_HEU_task1_2 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations False Guan_HEU_2026 0.816 0.793 0.797
12 Zheng_SCUT_task1_1 Official BSD10k Baseline (Multimodal HATR + CLAP) False Zheng_SCUT_2026 0.817 0.782 0.788
13 Kucukoglu_NYU_task1_1 Multimodal Ensemble System for Heterogeneous Audio Classification False Kucukoglu_NYU_2026 0.824 0.775 0.787
14 Lin_JKU_task1_1 Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary) False Lin_JKU_2026 0.801 0.785 0.785
15 Guan_HEU_task1_1 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations False Guan_HEU_2026 0.799 0.771 0.775
16 Kucukoglu_NYU_task1_4 Three-Modality HATR with Whisper and Mixup False Kucukoglu_NYU_2026 0.814 0.764 0.773
17 Kucukoglu_NYU_task1_3 Single HATR Model with Cross-Fold Augmentation False Kucukoglu_NYU_2026 0.802 0.762 0.770
18 Lin_JKU_task1_2 Frozen-CLAP ensemble with logit adjustment (balanced) False Lin_JKU_2026 0.774 0.778 0.768
19 Lin_JKU_task1_4 Clean concatenation CLAP probe (10k-train only) False Lin_JKU_2026 0.781 0.761 0.763
20 Kucukoglu_NYU_task1_2 CLAP-ConvNeXt Ensemble for Heterogeneous Audio Classification False Kucukoglu_NYU_2026 0.779 0.763 0.760
21 Lin_JKU_task1_3 Gated multimodal CLAP fusion with pseudo-labels False Lin_JKU_2026 0.769 0.765 0.752
22 Colotti_TAU_task1_1 Audio Classification using Attention-based Cleaned Multimodal Embeddings False Colotti_TAU_2026 0.746 0.709 0.699
23 Colotti_TAU_task1_2 Audio Classification using Attention-based Cleaned Multimodal Embeddings False Colotti_TAU_2026 0.732 0.706 0.699
24 Kil_Medisensing_task1_1 metadata gate target mask posterior stack False Kil_Medisensing_2026 0.682 0.732 0.692
25 Kil_Medisensing_task1_2 raw parent specialist posterior stack False Kil_Medisensing_2026 0.682 0.731 0.691
26 Colotti_TAU_task1_4 CLAP-MoE Feature-wise Gated Fusion False Colotti_TAU_2026 0.724 0.667 0.669
27 Colotti_TAU_task1_3 EnhancedBaseClassifier False Colotti_TAU_2026 0.710 0.667 0.664
28 Kil_Medisensing_task1_3 Larger-CLAP classifier with metadata confidence gate False Kil_Medisensing_2026 0.652 0.641 0.633
29 Liu_CQUPT_task1_3 PaSST and CLAP Audio Ensemble True Liu_CQUPT_2026 0.657 0.647 0.629
30 Liu_CQUPT_task1_2 Weighted Audio-only CLAP Ensemble with BSD10k Specialist True Liu_CQUPT_2026 0.644 0.643 0.623
31 Liu_CQUPT_task1_1 Audio-only CLAP Ensemble with EMA and Test-Time Augmentation True Liu_CQUPT_2026 0.639 0.640 0.619
32 Kil_Medisensing_task1_4 audio posterior ensemble with conservative argmax decoding True Kil_Medisensing_2026 0.590 0.552 0.520
33 Sharma_IR_task1_1 Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification False Sharma_IR_2026 0.526 0.488 0.472
34 Han_CSU_task1_1 Multi-Embedding System with Hierarchical Proxy Learning_1 False Han_CSU_2026 0.422 0.370 0.372
35 Han_CSU_task1_3 Multi-Embedding System with Hierarchical Proxy Learning_3 False Han_CSU_2026 0.355 0.316 0.315
36 Han_CSU_task1_4 Multi-Embedding System with Hierarchical Proxy Learning_4 False Han_CSU_2026 0.331 0.294 0.294
37 Zhang_XJTLU_task1_3 no-ranker ensemble False Zhang_XJTLU_2026 0.162 0.139 0.116
38 Zhang_XJTLU_task1_4 external diversity low-risk False Zhang_XJTLU_2026 0.154 0.146 0.101
39 Han_CSU_task1_2 Multi-Embedding System with Hierarchical Proxy Learning_2 False Han_CSU_2026 0.102 0.105 0.097
40 Zhang_XJTLU_task1_2 balanced ranker/base False Zhang_XJTLU_2026 0.142 0.138 0.095
41 Zhang_XJTLU_task1_1 aggressive ranker False Zhang_XJTLU_2026 0.106 0.127 0.091

Class-wise performance

Rank Submission Information Class-wise metrics
System
rank
Submission
label
Submission
name
Audio
only
Technical
Report
hP - Music >
Solo percussion
(m-sp)
hR - Music >
Solo percussion
(m-sp)
hF - Music >
Solo percussion
(m-sp)
hP - Music >
Solo instrument
(m-si)
hR - Music >
Solo instrument
(m-si)
hF - Music >
Solo instrument
(m-si)
hP - Music >
Multiple instruments
(m-m)
hR - Music >
Multiple instruments
(m-m)
hF - Music >
Multiple instruments
(m-m)
hP - Instrument samples >
Percussion
(is-p)
hR - Instrument samples >
Percussion
(is-p)
hF - Instrument samples >
Percussion
(is-p)
hP - Instrument samples >
String
(is-s)
hR - Instrument samples >
String
(is-s)
hF - Instrument samples >
String
(is-s)
hP - Instrument samples >
Wind
(is-w)
hR - Instrument samples >
Wind
(is-w)
hF - Instrument samples >
Wind
(is-w)
hP - Instrument samples >
Piano / Keyboard instruments
(is-k)
hR - Instrument samples >
Piano / Keyboard instruments
(is-k)
hF - Instrument samples >
Piano / Keyboard instruments
(is-k)
hP - Instrument samples >
Synths / Electronic
(is-e)
hR - Instrument samples >
Synths / Electronic
(is-e)
hF - Instrument samples >
Synths / Electronic
(is-e)
hP - Speech >
Solo speech
(sp-s)
hR - Speech >
Solo speech
(sp-s)
hF - Speech >
Solo speech
(sp-s)
hP - Speech >
Conversation / Crowd
(sp-c)
hR - Speech >
Conversation / Crowd
(sp-c)
hF - Speech >
Conversation / Crowd
(sp-c)
hP - Speech >
Processed / Synthetic
(sp-p)
hR - Speech >
Processed / Synthetic
(sp-p)
hF - Speech >
Processed / Synthetic
(sp-p)
hP - Sound effects >
Objects / House appliances
(fx-o)
hR - Sound effects >
Objects / House appliances
(fx-o)
hF - Sound effects >
Objects / House appliances
(fx-o)
hP - Sound effects >
Vehicles
(fx-v)
hR - Sound effects >
Vehicles
(fx-v)
hF - Sound effects >
Vehicles
(fx-v)
hP - Sound effects >
Other mechanisms, engines, machines
(fx-m)
hR - Sound effects >
Other mechanisms, engines, machines
(fx-m)
hF - Sound effects >
Other mechanisms, engines, machines
(fx-m)
hP - Sound effects >
Human sounds and actions
(fx-h)
hR - Sound effects >
Human sounds and actions
(fx-h)
hF - Sound effects >
Human sounds and actions
(fx-h)
hP - Sound effects >
Animals
(fx-a)
hR - Sound effects >
Animals
(fx-a)
hF - Sound effects >
Animals
(fx-a)
hP - Sound effects >
Natural elements and explosions
(fx-n)
hR - Sound effects >
Natural elements and explosions
(fx-n)
hF - Sound effects >
Natural elements and explosions
(fx-n)
hP - Sound effects >
Experimental
(fx-ex)
hR - Sound effects >
Experimental
(fx-ex)
hF - Sound effects >
Experimental
(fx-ex)
hP - Sound effects >
Electronic / Design
(fx-el)
hR - Sound effects >
Electronic / Design
(fx-el)
hF - Sound effects >
Electronic / Design
(fx-el)
hP - Soundscapes >
Nature
(ss-n)
hR - Soundscapes >
Nature
(ss-n)
hF - Soundscapes >
Nature
(ss-n)
hP - Soundscapes >
Indoors
(ss-i)
hR - Soundscapes >
Indoors
(ss-i)
hF - Soundscapes >
Indoors
(ss-i)
hP - Soundscapes >
Urban
(ss-u)
hR - Soundscapes >
Urban
(ss-u)
hF - Soundscapes >
Urban
(ss-u)
hP - Soundscapes >
Synthetic / Artificial
(ss-s)
hR - Soundscapes >
Synthetic / Artificial
(ss-s)
hF - Soundscapes >
Synthetic / Artificial
(ss-s)
1 Primus_CPJKU_task1_3 Ensemble 3 False Primus_CPJKU_2026 0.940 0.728 0.821 0.908 0.864 0.885 0.839 0.815 0.827 0.807 0.991 0.889 0.834 0.933 0.881 0.600 1.000 0.750 0.963 1.000 0.981 0.858 0.836 0.847 0.920 0.933 0.926 0.955 0.703 0.809 0.854 0.717 0.779 0.850 0.937 0.891 0.879 0.944 0.910 0.825 0.878 0.851 0.909 0.889 0.899 0.831 0.990 0.904 0.734 0.922 0.818 0.536 0.728 0.617 0.904 0.762 0.827 0.932 0.788 0.854 0.757 0.539 0.630 0.847 0.784 0.814 0.881 0.775 0.824
2 Primus_CPJKU_task1_2 Ensemble 2 False Primus_CPJKU_2026 0.960 0.728 0.828 0.918 0.852 0.884 0.855 0.815 0.835 0.814 0.982 0.890 0.837 0.951 0.890 0.600 1.000 0.750 0.963 1.000 0.981 0.847 0.820 0.833 0.915 0.936 0.925 0.970 0.670 0.792 0.869 0.737 0.798 0.841 0.937 0.886 0.870 0.944 0.905 0.846 0.878 0.862 0.909 0.889 0.899 0.831 0.990 0.904 0.720 0.922 0.809 0.544 0.750 0.630 0.886 0.773 0.826 0.944 0.788 0.859 0.709 0.539 0.613 0.847 0.784 0.814 0.878 0.755 0.812
3 Primus_CPJKU_task1_4 Ensemble 4 False Primus_CPJKU_2026 0.941 0.744 0.831 0.891 0.864 0.877 0.839 0.822 0.830 0.812 0.972 0.885 0.834 0.933 0.881 0.600 1.000 0.750 0.963 1.000 0.981 0.842 0.836 0.839 0.921 0.941 0.931 0.949 0.637 0.763 0.872 0.689 0.770 0.850 0.937 0.891 0.880 0.955 0.916 0.842 0.863 0.852 0.909 0.889 0.899 0.820 0.990 0.897 0.744 0.922 0.823 0.533 0.728 0.616 0.894 0.773 0.829 0.957 0.788 0.864 0.718 0.546 0.620 0.858 0.803 0.829 0.883 0.779 0.828
4 Primus_CPJKU_task1_1 Ensemble 1 False Primus_CPJKU_2026 0.941 0.744 0.831 0.877 0.818 0.846 0.845 0.845 0.845 0.812 0.972 0.885 0.785 0.933 0.853 0.600 1.000 0.750 0.963 1.000 0.981 0.825 0.820 0.823 0.913 0.928 0.920 0.952 0.670 0.787 0.872 0.704 0.779 0.865 0.937 0.899 0.871 0.955 0.911 0.872 0.893 0.882 0.896 0.900 0.898 0.831 0.990 0.904 0.729 0.922 0.815 0.543 0.669 0.600 0.867 0.785 0.824 0.944 0.788 0.859 0.734 0.565 0.639 0.855 0.791 0.821 0.878 0.748 0.808
5 Huang_WHU_task1_2 STFT-distill logit ensemble-1 False Huang_WHU_2026 0.892 0.851 0.871 0.794 0.832 0.813 0.845 0.838 0.841 0.925 0.947 0.936 0.862 0.928 0.894 1.000 1.000 1.000 0.963 1.000 0.981 0.780 0.828 0.804 0.936 0.970 0.953 0.982 0.682 0.805 0.907 0.656 0.761 0.753 0.953 0.841 0.898 0.866 0.882 0.759 0.823 0.790 0.790 0.861 0.824 0.858 0.846 0.852 0.672 0.901 0.770 0.420 0.458 0.438 0.891 0.590 0.710 0.779 0.811 0.795 0.759 0.463 0.575 0.701 0.762 0.730 0.816 0.760 0.787
6 Huang_WHU_task1_1 HGA-EMA-STFT system False Huang_WHU_2026 0.932 0.819 0.872 0.878 0.820 0.848 0.818 0.819 0.819 0.875 0.941 0.907 0.851 0.951 0.899 1.000 1.000 1.000 1.000 1.000 1.000 0.820 0.803 0.811 0.892 0.965 0.928 1.000 0.550 0.710 0.887 0.617 0.728 0.748 0.941 0.833 0.927 0.882 0.904 0.736 0.854 0.791 0.765 0.828 0.795 0.863 0.879 0.871 0.674 0.912 0.775 0.435 0.408 0.421 0.838 0.690 0.757 0.782 0.837 0.809 0.625 0.382 0.474 0.601 0.757 0.670 0.762 0.792 0.776
7 Guan_HEU_task1_3 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations False Guan_HEU_2026 0.772 0.851 0.810 0.858 0.832 0.845 0.848 0.870 0.859 0.903 0.781 0.838 0.904 0.919 0.911 0.688 0.583 0.631 0.963 1.000 0.981 0.786 0.918 0.847 0.946 0.941 0.943 1.000 0.603 0.752 0.830 0.696 0.758 0.770 0.937 0.846 0.922 0.891 0.906 0.719 0.893 0.797 0.866 0.820 0.842 0.811 0.910 0.857 0.709 0.907 0.796 0.464 0.553 0.505 0.875 0.606 0.716 0.830 0.781 0.805 0.767 0.502 0.607 0.696 0.767 0.730 0.853 0.760 0.804
8 Huang_WHU_task1_4 Six-model 5-fold logit ensemble False Huang_WHU_2026 0.906 0.829 0.866 0.827 0.825 0.826 0.858 0.833 0.846 0.899 0.941 0.920 0.880 0.917 0.898 1.000 0.667 0.800 0.963 1.000 0.981 0.836 0.836 0.836 0.935 0.957 0.946 0.967 0.610 0.748 0.889 0.628 0.736 0.764 0.937 0.842 0.929 0.886 0.907 0.747 0.838 0.790 0.764 0.871 0.814 0.875 0.863 0.869 0.681 0.912 0.780 0.422 0.453 0.437 0.792 0.655 0.717 0.780 0.849 0.813 0.720 0.424 0.533 0.633 0.757 0.689 0.804 0.779 0.791
9 Guan_HEU_task1_4 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations False Guan_HEU_2026 0.833 0.877 0.854 0.823 0.843 0.833 0.870 0.845 0.857 0.902 0.855 0.878 0.904 0.919 0.911 0.688 0.583 0.631 0.963 1.000 0.981 0.781 0.893 0.833 0.946 0.941 0.943 1.000 0.575 0.730 0.821 0.696 0.754 0.780 0.937 0.851 0.899 0.886 0.893 0.725 0.893 0.801 0.850 0.820 0.834 0.813 0.926 0.866 0.692 0.907 0.785 0.456 0.531 0.490 0.901 0.588 0.712 0.858 0.800 0.828 0.786 0.477 0.594 0.642 0.767 0.699 0.887 0.721 0.795
10 Huang_WHU_task1_3 STFT-distill logit ensemble-5 False Huang_WHU_2026 0.890 0.829 0.859 0.832 0.825 0.829 0.861 0.845 0.853 0.896 0.917 0.907 0.880 0.917 0.898 1.000 0.667 0.800 0.963 1.000 0.981 0.829 0.836 0.832 0.936 0.965 0.950 0.967 0.610 0.748 0.904 0.628 0.741 0.761 0.941 0.841 0.929 0.886 0.907 0.747 0.838 0.790 0.793 0.871 0.830 0.867 0.863 0.865 0.673 0.912 0.775 0.403 0.453 0.427 0.797 0.667 0.726 0.782 0.837 0.809 0.696 0.394 0.503 0.629 0.757 0.687 0.804 0.779 0.791
11 Guan_HEU_task1_2 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations False Guan_HEU_2026 0.725 0.835 0.776 0.819 0.825 0.822 0.810 0.870 0.839 0.867 0.752 0.805 0.873 0.896 0.884 1.000 0.792 0.884 0.931 1.000 0.964 0.842 0.893 0.866 0.946 0.949 0.948 1.000 0.535 0.697 0.841 0.635 0.724 0.792 0.944 0.862 0.898 0.875 0.886 0.723 0.848 0.780 0.818 0.793 0.805 0.804 0.916 0.857 0.705 0.890 0.787 0.482 0.531 0.505 0.852 0.639 0.730 0.826 0.792 0.809 0.719 0.502 0.591 0.627 0.748 0.682 0.875 0.779 0.824
12 Zheng_SCUT_task1_1 Official BSD10k Baseline (Multimodal HATR + CLAP) False Zheng_SCUT_2026 0.888 0.825 0.856 0.768 0.848 0.806 0.810 0.810 0.810 0.878 0.932 0.904 0.836 0.868 0.851 1.000 0.667 0.800 0.882 1.000 0.937 0.778 0.752 0.765 0.934 0.979 0.956 1.000 0.685 0.813 0.918 0.656 0.765 0.754 0.941 0.837 0.915 0.756 0.828 0.750 0.832 0.789 0.834 0.820 0.827 0.821 0.834 0.827 0.684 0.890 0.774 0.455 0.575 0.508 0.845 0.523 0.646 0.797 0.868 0.831 0.810 0.419 0.552 0.592 0.812 0.685 0.838 0.703 0.765
13 Kucukoglu_NYU_task1_1 Multimodal Ensemble System for Heterogeneous Audio Classification False Kucukoglu_NYU_2026 0.898 0.837 0.867 0.786 0.836 0.810 0.812 0.718 0.762 0.909 0.917 0.913 0.846 0.826 0.836 1.000 0.458 0.629 1.000 1.000 1.000 0.775 0.873 0.821 0.920 0.949 0.934 0.968 0.637 0.769 0.871 0.615 0.721 0.746 0.955 0.838 0.937 0.846 0.889 0.809 0.832 0.821 0.844 0.824 0.834 0.771 0.916 0.837 0.725 0.879 0.795 0.420 0.567 0.483 0.928 0.558 0.697 0.847 0.689 0.760 0.750 0.479 0.585 0.627 0.793 0.700 0.766 0.819 0.792
14 Lin_JKU_task1_1 Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary) False Lin_JKU_2026 0.884 0.764 0.820 0.782 0.774 0.778 0.901 0.713 0.796 0.815 0.928 0.868 0.757 0.933 0.836 1.000 1.000 1.000 0.941 1.000 0.970 0.701 0.824 0.757 0.942 0.952 0.947 1.000 0.568 0.724 0.850 0.622 0.719 0.768 0.902 0.829 0.858 0.945 0.900 0.698 0.835 0.760 0.872 0.762 0.814 0.814 0.887 0.849 0.684 0.860 0.762 0.453 0.436 0.444 0.916 0.690 0.787 0.736 0.818 0.775 0.688 0.477 0.563 0.607 0.635 0.620 0.758 0.721 0.739
15 Guan_HEU_task1_1 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations False Guan_HEU_2026 0.852 0.871 0.861 0.745 0.852 0.795 0.860 0.796 0.827 0.893 0.908 0.900 0.873 0.852 0.862 0.688 0.583 0.631 0.931 1.000 0.964 0.763 0.781 0.772 0.932 0.949 0.940 1.000 0.570 0.726 0.830 0.745 0.785 0.733 0.937 0.822 0.906 0.901 0.904 0.723 0.909 0.805 0.823 0.789 0.806 0.809 0.799 0.804 0.672 0.907 0.772 0.396 0.414 0.405 0.845 0.551 0.667 0.782 0.800 0.791 0.850 0.398 0.542 0.601 0.724 0.656 0.875 0.708 0.783
16 Kucukoglu_NYU_task1_4 Three-Modality HATR with Whisper and Mixup False Kucukoglu_NYU_2026 0.883 0.859 0.871 0.818 0.825 0.821 0.857 0.713 0.778 0.863 0.923 0.892 0.834 0.870 0.852 1.000 0.458 0.629 0.901 1.000 0.948 0.712 0.824 0.764 0.908 0.962 0.934 0.967 0.625 0.759 0.855 0.663 0.747 0.720 0.948 0.819 0.934 0.786 0.854 0.817 0.863 0.839 0.806 0.818 0.812 0.763 0.869 0.813 0.698 0.856 0.769 0.430 0.619 0.507 0.906 0.461 0.611 0.792 0.689 0.737 0.783 0.410 0.538 0.595 0.793 0.680 0.869 0.748 0.804
17 Kucukoglu_NYU_task1_3 Single HATR Model with Cross-Fold Augmentation False Kucukoglu_NYU_2026 0.887 0.875 0.881 0.755 0.833 0.792 0.780 0.706 0.741 0.914 0.903 0.908 0.883 0.785 0.831 1.000 0.458 0.629 0.901 1.000 0.948 0.744 0.738 0.741 0.929 0.949 0.939 0.944 0.703 0.806 0.875 0.628 0.731 0.762 0.955 0.848 0.907 0.806 0.854 0.686 0.863 0.764 0.841 0.801 0.821 0.802 0.873 0.836 0.728 0.884 0.798 0.424 0.647 0.513 0.725 0.488 0.584 0.796 0.731 0.762 0.720 0.424 0.533 0.614 0.781 0.688 0.838 0.689 0.756
18 Lin_JKU_task1_2 Frozen-CLAP ensemble with logit adjustment (balanced) False Lin_JKU_2026 0.897 0.744 0.813 0.812 0.740 0.774 0.878 0.718 0.790 0.815 0.928 0.868 0.767 0.917 0.835 0.750 1.000 0.857 0.862 1.000 0.926 0.587 0.834 0.689 0.966 0.868 0.915 0.932 0.608 0.735 0.786 0.635 0.702 0.765 0.691 0.726 0.833 0.957 0.891 0.680 0.875 0.765 0.888 0.682 0.772 0.785 0.947 0.859 0.719 0.871 0.787 0.476 0.494 0.485 0.788 0.736 0.761 0.773 0.811 0.791 0.684 0.535 0.600 0.671 0.558 0.609 0.698 0.748 0.722
19 Lin_JKU_task1_4 Clean concatenation CLAP probe (10k-train only) False Lin_JKU_2026 0.860 0.829 0.844 0.792 0.829 0.810 0.800 0.815 0.807 0.847 0.917 0.881 0.774 0.891 0.828 0.688 0.458 0.550 0.833 1.000 0.909 0.831 0.670 0.742 0.921 0.900 0.910 0.905 0.655 0.760 0.747 0.556 0.637 0.743 0.951 0.834 0.916 0.881 0.898 0.739 0.723 0.731 0.855 0.840 0.847 0.787 0.910 0.844 0.736 0.894 0.808 0.473 0.661 0.552 0.797 0.532 0.639 0.823 0.781 0.801 0.652 0.382 0.482 0.629 0.736 0.678 0.818 0.696 0.752
20 Kucukoglu_NYU_task1_2 CLAP-ConvNeXt Ensemble for Heterogeneous Audio Classification False Kucukoglu_NYU_2026 0.887 0.875 0.881 0.769 0.818 0.792 0.789 0.736 0.762 0.927 0.908 0.918 0.858 0.852 0.855 0.000 0.125 0.000 1.000 1.000 1.000 0.757 0.852 0.801 0.914 0.949 0.931 0.968 0.645 0.774 0.864 0.574 0.690 0.779 0.948 0.856 0.927 0.825 0.873 0.783 0.848 0.814 0.837 0.824 0.830 0.788 0.936 0.856 0.714 0.884 0.790 0.421 0.575 0.486 0.904 0.532 0.670 0.833 0.726 0.776 0.781 0.539 0.638 0.629 0.781 0.697 0.787 0.787 0.787
21 Lin_JKU_task1_3 Gated multimodal CLAP fusion with pseudo-labels False Lin_JKU_2026 0.887 0.750 0.813 0.858 0.809 0.833 0.832 0.750 0.789 0.806 0.961 0.877 0.838 0.905 0.870 0.000 0.250 0.000 0.849 0.961 0.901 0.789 0.857 0.822 0.911 0.970 0.940 1.000 0.583 0.736 0.889 0.651 0.751 0.772 0.944 0.850 0.865 0.931 0.897 0.719 0.869 0.787 0.946 0.762 0.844 0.799 0.936 0.862 0.691 0.888 0.777 0.469 0.503 0.486 0.797 0.683 0.735 0.839 0.830 0.835 0.762 0.373 0.501 0.593 0.697 0.641 0.779 0.733 0.755
22 Colotti_TAU_task1_1 Audio Classification using Attention-based Cleaned Multimodal Embeddings False Colotti_TAU_2026 0.904 0.812 0.855 0.779 0.772 0.776 0.913 0.606 0.729 0.865 0.923 0.893 0.923 0.671 0.777 0.000 0.375 0.000 1.000 0.609 0.757 0.627 0.912 0.743 0.947 0.910 0.928 0.882 0.315 0.464 0.897 0.717 0.797 0.720 0.909 0.804 0.859 0.868 0.863 0.765 0.838 0.800 0.703 0.902 0.790 0.802 0.805 0.804 0.743 0.851 0.793 0.395 0.625 0.484 0.958 0.426 0.590 0.814 0.894 0.852 0.441 0.273 0.337 0.451 0.659 0.535 0.771 0.642 0.701
23 Colotti_TAU_task1_2 Audio Classification using Attention-based Cleaned Multimodal Embeddings False Colotti_TAU_2026 0.906 0.754 0.823 0.803 0.763 0.782 0.936 0.634 0.756 0.844 0.932 0.886 0.662 0.567 0.611 0.643 1.000 0.783 0.714 0.570 0.634 0.636 0.742 0.685 0.976 0.711 0.823 0.963 0.328 0.489 0.838 0.587 0.690 0.765 0.898 0.826 0.877 0.841 0.858 0.698 0.838 0.762 0.636 0.902 0.746 0.732 0.852 0.788 0.716 0.845 0.775 0.332 0.606 0.429 0.625 0.495 0.553 0.784 0.719 0.750 0.591 0.324 0.419 0.442 0.724 0.549 0.723 0.596 0.653
24 Kil_Medisensing_task1_1 metadata gate target mask posterior stack False Kil_Medisensing_2026 0.536 0.750 0.625 0.803 0.628 0.705 0.722 0.634 0.675 0.509 0.314 0.389 0.725 0.944 0.820 0.429 1.000 0.600 0.856 1.000 0.923 0.864 0.826 0.845 0.827 0.970 0.893 0.000 0.083 0.000 0.742 0.727 0.734 0.748 0.753 0.751 0.935 0.711 0.808 0.598 0.558 0.577 0.792 0.824 0.807 0.873 0.984 0.925 0.832 0.823 0.828 0.507 0.472 0.489 0.780 0.808 0.793 0.837 0.863 0.850 0.448 0.819 0.579 0.576 0.825 0.678 0.739 0.522 0.612
25 Kil_Medisensing_task1_2 raw parent specialist posterior stack False Kil_Medisensing_2026 0.536 0.750 0.625 0.797 0.628 0.702 0.716 0.623 0.666 0.509 0.314 0.389 0.725 0.944 0.820 0.429 1.000 0.600 0.856 1.000 0.923 0.864 0.826 0.845 0.822 0.970 0.890 0.000 0.083 0.000 0.750 0.727 0.738 0.738 0.760 0.749 0.935 0.711 0.808 0.590 0.543 0.565 0.798 0.824 0.811 0.873 0.984 0.925 0.829 0.812 0.821 0.510 0.486 0.498 0.802 0.808 0.805 0.837 0.863 0.850 0.448 0.819 0.579 0.576 0.825 0.678 0.739 0.522 0.612
26 Colotti_TAU_task1_4 CLAP-MoE Feature-wise Gated Fusion False Colotti_TAU_2026 0.902 0.758 0.824 0.666 0.782 0.719 0.820 0.729 0.772 0.834 0.893 0.863 0.802 0.558 0.658 0.375 0.000 0.000 1.000 0.680 0.809 0.660 0.570 0.612 0.926 0.840 0.881 0.929 0.290 0.442 0.808 0.431 0.562 0.715 0.960 0.820 0.835 0.753 0.792 0.663 0.741 0.700 0.815 0.785 0.800 0.805 0.809 0.807 0.725 0.879 0.795 0.409 0.669 0.508 0.659 0.690 0.674 0.763 0.894 0.823 0.438 0.361 0.396 0.417 0.692 0.520 0.684 0.569 0.621
27 Colotti_TAU_task1_3 EnhancedBaseClassifier False Colotti_TAU_2026 0.880 0.726 0.796 0.785 0.859 0.820 0.747 0.602 0.667 0.891 0.700 0.784 0.671 0.634 0.652 1.000 1.000 1.000 0.875 0.508 0.643 0.516 0.504 0.510 0.958 0.880 0.917 0.682 0.135 0.225 0.650 0.464 0.542 0.689 0.930 0.792 0.748 0.753 0.751 0.502 0.893 0.642 0.790 0.775 0.782 0.844 0.682 0.755 0.644 0.886 0.746 0.357 0.617 0.452 0.750 0.319 0.448 0.686 0.925 0.788 0.333 0.289 0.310 0.448 0.673 0.538 0.879 0.598 0.712
28 Kil_Medisensing_task1_3 Larger-CLAP classifier with metadata confidence gate False Kil_Medisensing_2026 0.455 0.615 0.523 0.733 0.783 0.757 0.795 0.708 0.749 0.365 0.263 0.305 0.766 0.803 0.784 0.000 0.125 0.000 0.911 0.844 0.876 0.877 0.764 0.817 0.821 0.957 0.884 0.473 0.217 0.298 0.889 0.579 0.701 0.605 0.761 0.674 0.853 0.784 0.817 0.549 0.634 0.588 0.812 0.623 0.705 0.742 0.857 0.795 0.851 0.560 0.676 0.603 0.408 0.487 0.558 0.704 0.623 0.629 0.906 0.743 0.478 0.542 0.508 0.489 0.712 0.579 0.750 0.588 0.659
29 Liu_CQUPT_task1_3 PaSST and CLAP Audio Ensemble True Liu_CQUPT_2026 0.522 0.688 0.594 0.755 0.634 0.689 0.766 0.646 0.701 0.424 0.287 0.342 0.680 0.910 0.778 0.000 0.250 0.000 0.750 0.609 0.672 0.724 0.805 0.762 0.824 0.939 0.878 0.943 0.230 0.370 0.892 0.630 0.738 0.664 0.767 0.712 0.910 0.710 0.797 0.551 0.652 0.598 0.742 0.760 0.751 0.794 0.865 0.828 0.760 0.642 0.696 0.580 0.447 0.505 0.610 0.736 0.667 0.696 0.894 0.783 0.309 0.396 0.347 0.476 0.786 0.593 0.744 0.605 0.667
30 Liu_CQUPT_task1_2 Weighted Audio-only CLAP Ensemble with BSD10k Specialist True Liu_CQUPT_2026 0.497 0.647 0.562 0.748 0.645 0.693 0.774 0.646 0.704 0.423 0.267 0.327 0.695 0.873 0.774 0.000 0.250 0.000 0.760 0.688 0.722 0.714 0.814 0.761 0.854 0.952 0.901 0.795 0.230 0.357 0.925 0.651 0.764 0.659 0.830 0.735 0.881 0.694 0.776 0.522 0.588 0.553 0.756 0.768 0.762 0.811 0.891 0.849 0.768 0.735 0.751 0.558 0.353 0.432 0.609 0.771 0.681 0.677 0.858 0.757 0.281 0.361 0.316 0.448 0.743 0.559 0.668 0.542 0.598
31 Liu_CQUPT_task1_1 Audio-only CLAP Ensemble with EMA and Test-Time Augmentation True Liu_CQUPT_2026 0.483 0.621 0.543 0.748 0.645 0.693 0.766 0.646 0.701 0.423 0.267 0.327 0.695 0.873 0.774 0.000 0.250 0.000 0.760 0.688 0.722 0.714 0.814 0.761 0.854 0.952 0.901 0.795 0.230 0.357 0.925 0.651 0.764 0.650 0.802 0.718 0.881 0.694 0.776 0.519 0.588 0.551 0.744 0.768 0.756 0.804 0.891 0.846 0.757 0.718 0.737 0.517 0.339 0.409 0.601 0.771 0.675 0.668 0.858 0.751 0.281 0.361 0.316 0.448 0.743 0.559 0.668 0.542 0.598
32 Kil_Medisensing_task1_4 audio posterior ensemble with conservative argmax decoding True Kil_Medisensing_2026 0.861 0.732 0.791 0.918 0.503 0.650 0.719 0.650 0.683 0.979 0.752 0.851 0.692 0.884 0.777 0.000 0.250 0.000 0.000 0.305 0.000 0.600 0.637 0.618 0.949 0.887 0.917 0.375 0.015 0.029 0.926 0.352 0.510 0.544 0.543 0.544 0.534 0.925 0.677 0.824 0.524 0.641 0.475 0.783 0.591 0.752 0.795 0.773 0.688 0.662 0.674 0.000 0.242 0.000 0.569 0.495 0.530 0.955 0.316 0.475 0.219 0.248 0.232 0.245 0.642 0.355 0.741 0.554 0.634
33 Sharma_IR_task1_1 Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification False Sharma_IR_2026 0.557 0.544 0.550 0.738 0.522 0.611 0.589 0.866 0.701 0.312 0.096 0.146 0.688 0.706 0.697 0.000 0.000 0.000 0.683 0.602 0.640 0.520 0.320 0.397 0.750 0.952 0.839 0.795 0.250 0.380 0.675 0.298 0.414 0.515 0.609 0.558 0.745 0.476 0.581 0.500 0.317 0.388 0.427 0.656 0.517 0.448 0.799 0.574 0.559 0.627 0.591 0.207 0.342 0.258 0.316 0.690 0.433 0.750 0.566 0.645 0.421 0.294 0.346 0.330 0.601 0.426 0.571 0.100 0.171
34 Han_CSU_task1_1 Multi-Embedding System with Hierarchical Proxy Learning_1 False Han_CSU_2026 0.458 0.500 0.478 0.556 0.705 0.622 0.635 0.597 0.616 0.502 0.494 0.498 0.337 0.400 0.366 0.072 0.000 0.000 0.022 0.164 0.039 0.483 0.271 0.348 0.616 0.610 0.613 0.889 0.198 0.323 0.565 0.296 0.388 0.313 0.427 0.361 0.493 0.270 0.349 0.365 0.366 0.365 0.343 0.416 0.376 0.375 0.252 0.301 0.434 0.586 0.498 0.343 0.367 0.355 0.212 0.225 0.218 0.530 0.408 0.461 0.250 0.174 0.205 0.287 0.478 0.359 0.625 0.316 0.420
35 Han_CSU_task1_3 Multi-Embedding System with Hierarchical Proxy Learning_3 False Han_CSU_2026 0.366 0.446 0.402 0.476 0.645 0.548 0.520 0.456 0.486 0.390 0.338 0.362 0.177 0.204 0.189 0.075 0.125 0.094 0.044 0.141 0.067 0.361 0.205 0.261 0.552 0.518 0.535 0.833 0.138 0.236 0.565 0.281 0.375 0.250 0.328 0.284 0.304 0.188 0.232 0.260 0.277 0.268 0.312 0.379 0.342 0.308 0.238 0.268 0.407 0.554 0.469 0.325 0.331 0.328 0.193 0.206 0.199 0.453 0.382 0.415 0.197 0.174 0.185 0.238 0.377 0.292 0.548 0.336 0.416
36 Han_CSU_task1_4 Multi-Embedding System with Hierarchical Proxy Learning_4 False Han_CSU_2026 0.318 0.355 0.335 0.475 0.611 0.534 0.622 0.521 0.567 0.373 0.368 0.370 0.231 0.282 0.254 0.069 0.125 0.089 0.022 0.141 0.038 0.385 0.209 0.271 0.465 0.461 0.463 0.729 0.110 0.191 0.295 0.181 0.224 0.254 0.355 0.296 0.302 0.213 0.249 0.273 0.287 0.280 0.269 0.328 0.295 0.324 0.244 0.278 0.324 0.390 0.354 0.344 0.339 0.341 0.237 0.243 0.240 0.446 0.311 0.367 0.196 0.148 0.169 0.254 0.358 0.297 0.410 0.186 0.256
37 Zhang_XJTLU_task1_3 no-ranker ensemble False Zhang_XJTLU_2026 0.087 0.173 0.116 0.115 0.188 0.143 0.164 0.118 0.137 0.068 0.101 0.081 0.083 0.039 0.053 0.000 0.125 0.000 0.054 0.258 0.089 0.242 0.090 0.131 0.396 0.026 0.049 0.000 0.015 0.000 0.583 0.028 0.054 0.179 0.366 0.240 0.375 0.216 0.274 0.292 0.110 0.159 0.156 0.223 0.184 0.250 0.262 0.256 0.265 0.166 0.204 0.101 0.294 0.150 0.159 0.241 0.192 0.013 0.007 0.009 0.000 0.007 0.000 0.149 0.142 0.145 0.000 0.000 0.000
38 Zhang_XJTLU_task1_4 external diversity low-risk False Zhang_XJTLU_2026 0.114 0.210 0.148 0.122 0.227 0.159 0.167 0.132 0.147 0.087 0.134 0.106 0.167 0.065 0.093 0.000 0.125 0.000 0.000 0.328 0.000 0.235 0.074 0.113 0.550 0.026 0.050 0.000 0.015 0.000 0.583 0.036 0.067 0.184 0.374 0.247 0.375 0.198 0.259 0.205 0.091 0.126 0.168 0.230 0.194 0.000 0.252 0.000 0.230 0.179 0.201 0.100 0.272 0.146 0.094 0.236 0.134 0.042 0.007 0.012 0.000 0.007 0.000 0.130 0.130 0.130 0.000 0.000 0.000
39 Han_CSU_task1_2 Multi-Embedding System with Hierarchical Proxy Learning_2 False Han_CSU_2026 0.055 0.000 0.000 0.015 0.042 0.022 0.016 0.042 0.023 0.105 0.132 0.117 0.099 0.116 0.107 0.064 0.000 0.000 0.022 0.070 0.034 0.117 0.137 0.126 0.103 0.053 0.070 0.000 0.045 0.000 0.092 0.036 0.052 0.168 0.198 0.182 0.158 0.167 0.162 0.175 0.189 0.182 0.133 0.180 0.153 0.199 0.170 0.184 0.197 0.155 0.174 0.093 0.156 0.116 0.139 0.150 0.144 0.089 0.054 0.067 0.109 0.116 0.112 0.131 0.156 0.142 0.056 0.051 0.054
40 Zhang_XJTLU_task1_2 balanced ranker/base False Zhang_XJTLU_2026 0.115 0.210 0.149 0.131 0.246 0.171 0.134 0.139 0.136 0.062 0.090 0.073 0.000 0.021 0.000 0.000 0.125 0.000 0.000 0.258 0.000 0.220 0.059 0.092 0.475 0.026 0.050 0.000 0.015 0.000 0.583 0.028 0.054 0.179 0.374 0.242 0.300 0.198 0.239 0.250 0.101 0.143 0.172 0.223 0.194 0.000 0.240 0.000 0.257 0.166 0.202 0.095 0.256 0.138 0.107 0.243 0.149 0.029 0.007 0.011 0.000 0.007 0.000 0.152 0.149 0.150 0.000 0.000 0.000
41 Zhang_XJTLU_task1_1 aggressive ranker False Zhang_XJTLU_2026 0.117 0.188 0.144 0.096 0.151 0.117 0.037 0.132 0.058 0.062 0.086 0.072 0.071 0.039 0.051 0.000 0.125 0.000 0.000 0.141 0.000 0.271 0.152 0.195 0.125 0.005 0.009 0.000 0.007 0.000 0.188 0.008 0.015 0.205 0.401 0.271 0.188 0.220 0.202 0.062 0.073 0.067 0.180 0.201 0.190 0.219 0.262 0.239 0.180 0.151 0.164 0.099 0.239 0.140 0.010 0.208 0.019 0.066 0.021 0.032 0.125 0.007 0.013 0.073 0.089 0.080 0.062 0.007 0.013

BST Top-level performance

Rank Submission Information Overall metrics Class-wise metrics
System
rank
Submission
label
Submission
name
Audio
only
Technical
Report
P R F P - Music (m) R - Music (m) F - Music (m) P - Instrument samples (is) R - Instrument samples (is) F - Instrument samples (is) P - Speech (sp) R - Speech (sp) F - Speech (sp) P - Sound effects (fx) R - Sound effects (fx) F - Sound effects (fx) P - Soundscapes (ss) R - Soundscapes (ss) F - Soundscapes (ss)
1 Primus_CPJKU_task1_3 Ensemble 3 False Primus_CPJKU_2026 0.908 0.876 0.889 0.951 0.858 0.902 0.873 0.971 0.919 0.955 0.846 0.897 0.864 0.953 0.907 0.898 0.752 0.819
2 Primus_CPJKU_task1_2 Ensemble 2 False Primus_CPJKU_2026 0.909 0.873 0.887 0.967 0.853 0.906 0.877 0.971 0.921 0.955 0.840 0.894 0.858 0.955 0.904 0.887 0.748 0.811
3 Primus_CPJKU_task1_4 Ensemble 4 False Primus_CPJKU_2026 0.908 0.875 0.888 0.947 0.868 0.905 0.877 0.971 0.921 0.960 0.823 0.886 0.864 0.955 0.908 0.893 0.757 0.820
4 Primus_CPJKU_task1_1 Ensemble 1 False Primus_CPJKU_2026 0.903 0.871 0.884 0.935 0.848 0.889 0.857 0.966 0.908 0.961 0.834 0.893 0.866 0.951 0.906 0.898 0.757 0.822
5 Huang_WHU_task1_2 STFT-distill logit ensemble-1 False Huang_WHU_2026 0.885 0.863 0.872 0.878 0.882 0.880 0.902 0.946 0.924 0.973 0.829 0.895 0.852 0.909 0.880 0.818 0.748 0.781
6 Huang_WHU_task1_1 HGA-EMA-STFT system False Huang_WHU_2026 0.881 0.855 0.866 0.921 0.863 0.891 0.893 0.937 0.914 0.958 0.789 0.865 0.857 0.921 0.888 0.778 0.767 0.772
7 Guan_HEU_task1_3 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations False Guan_HEU_2026 0.891 0.864 0.875 0.876 0.897 0.886 0.904 0.917 0.910 0.966 0.811 0.882 0.862 0.935 0.897 0.846 0.757 0.799
8 Huang_WHU_task1_4 Six-model 5-fold logit ensemble False Huang_WHU_2026 0.888 0.859 0.872 0.909 0.877 0.893 0.905 0.927 0.916 0.965 0.794 0.871 0.854 0.927 0.889 0.806 0.771 0.788
9 Guan_HEU_task1_4 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations False Guan_HEU_2026 0.888 0.862 0.873 0.880 0.897 0.888 0.906 0.937 0.921 0.959 0.800 0.872 0.858 0.931 0.893 0.839 0.743 0.788
10 Huang_WHU_task1_3 STFT-distill logit ensemble-5 False Huang_WHU_2026 0.887 0.858 0.870 0.904 0.877 0.891 0.904 0.922 0.913 0.965 0.794 0.871 0.853 0.929 0.890 0.809 0.767 0.787
11 Guan_HEU_task1_2 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations False Guan_HEU_2026 0.879 0.851 0.862 0.835 0.892 0.863 0.916 0.902 0.909 0.964 0.771 0.857 0.867 0.933 0.899 0.811 0.757 0.783
12 Zheng_SCUT_task1_1 Official BSD10k Baseline (Multimodal HATR + CLAP) False Zheng_SCUT_2026 0.882 0.864 0.872 0.871 0.892 0.881 0.890 0.912 0.901 0.987 0.840 0.907 0.857 0.899 0.877 0.807 0.776 0.791
13 Kucukoglu_NYU_task1_1 Multimodal Ensemble System for Heterogeneous Audio Classification False Kucukoglu_NYU_2026 0.878 0.852 0.863 0.869 0.848 0.859 0.896 0.922 0.909 0.966 0.806 0.879 0.850 0.917 0.882 0.809 0.767 0.787
14 Lin_JKU_task1_1 Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary) False Lin_JKU_2026 0.856 0.830 0.839 0.891 0.804 0.845 0.812 0.946 0.874 0.957 0.771 0.854 0.853 0.907 0.879 0.764 0.724 0.743
15 Guan_HEU_task1_1 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations False Guan_HEU_2026 0.877 0.853 0.863 0.850 0.892 0.871 0.908 0.917 0.913 0.960 0.823 0.886 0.846 0.911 0.877 0.822 0.724 0.770
16 Kucukoglu_NYU_task1_4 Three-Modality HATR with Whisper and Mixup False Kucukoglu_NYU_2026 0.875 0.851 0.861 0.902 0.858 0.879 0.869 0.941 0.904 0.960 0.829 0.890 0.835 0.899 0.865 0.810 0.729 0.767
17 Kucukoglu_NYU_task1_3 Single HATR Model with Cross-Fold Augmentation False Kucukoglu_NYU_2026 0.870 0.839 0.853 0.855 0.868 0.861 0.904 0.873 0.888 0.953 0.817 0.880 0.827 0.911 0.867 0.810 0.729 0.767
18 Lin_JKU_task1_2 Frozen-CLAP ensemble with logit adjustment (balanced) False Lin_JKU_2026 0.847 0.821 0.829 0.909 0.784 0.842 0.765 0.951 0.848 0.957 0.771 0.854 0.841 0.887 0.863 0.764 0.710 0.736
19 Lin_JKU_task1_4 Clean concatenation CLAP probe (10k-train only) False Lin_JKU_2026 0.873 0.845 0.857 0.865 0.877 0.871 0.895 0.912 0.903 0.946 0.794 0.863 0.840 0.917 0.877 0.817 0.724 0.768
20 Kucukoglu_NYU_task1_2 CLAP-ConvNeXt Ensemble for Heterogeneous Audio Classification False Kucukoglu_NYU_2026 0.879 0.855 0.865 0.859 0.863 0.861 0.900 0.917 0.908 0.966 0.800 0.875 0.853 0.913 0.882 0.820 0.781 0.800
21 Lin_JKU_task1_3 Gated multimodal CLAP fusion with pseudo-labels False Lin_JKU_2026 0.876 0.845 0.857 0.924 0.833 0.876 0.851 0.946 0.896 0.966 0.806 0.879 0.847 0.929 0.886 0.793 0.710 0.749
22 Colotti_TAU_task1_1 Audio Classification using Attention-based Cleaned Multimodal Embeddings False Colotti_TAU_2026 0.843 0.802 0.817 0.899 0.789 0.841 0.846 0.912 0.878 0.962 0.714 0.820 0.805 0.897 0.849 0.702 0.695 0.699
23 Colotti_TAU_task1_2 Audio Classification using Attention-based Cleaned Multimodal Embeddings False Colotti_TAU_2026 0.841 0.770 0.794 0.901 0.760 0.824 0.879 0.917 0.897 0.991 0.606 0.752 0.764 0.911 0.831 0.670 0.657 0.663
24 Kil_Medisensing_task1_1 metadata gate target mask posterior stack False Kil_Medisensing_2026 0.782 0.772 0.773 0.729 0.725 0.727 0.765 0.746 0.756 0.887 0.720 0.795 0.877 0.850 0.863 0.652 0.819 0.726
25 Kil_Medisensing_task1_2 raw parent specialist posterior stack False Kil_Medisensing_2026 0.782 0.772 0.773 0.729 0.725 0.727 0.765 0.746 0.756 0.887 0.720 0.795 0.877 0.850 0.863 0.652 0.819 0.726
26 Colotti_TAU_task1_4 CLAP-MoE Feature-wise Gated Fusion False Colotti_TAU_2026 0.827 0.769 0.788 0.819 0.819 0.819 0.905 0.790 0.844 0.964 0.611 0.748 0.802 0.911 0.853 0.644 0.714 0.677
27 Colotti_TAU_task1_3 EnhancedBaseClassifier False Colotti_TAU_2026 0.814 0.756 0.777 0.880 0.824 0.851 0.901 0.800 0.848 0.887 0.583 0.703 0.751 0.874 0.808 0.653 0.700 0.676
28 Kil_Medisensing_task1_3 Larger-CLAP classifier with metadata confidence gate False Kil_Medisensing_2026 0.771 0.758 0.761 0.704 0.770 0.735 0.781 0.678 0.726 0.882 0.726 0.796 0.843 0.838 0.841 0.647 0.776 0.706
29 Liu_CQUPT_task1_3 PaSST and CLAP Audio Ensemble True Liu_CQUPT_2026 0.775 0.755 0.761 0.720 0.706 0.713 0.732 0.732 0.732 0.933 0.714 0.809 0.845 0.848 0.846 0.644 0.776 0.704
30 Liu_CQUPT_task1_2 Weighted Audio-only CLAP Ensemble with BSD10k Specialist True Liu_CQUPT_2026 0.776 0.752 0.760 0.715 0.701 0.708 0.730 0.712 0.721 0.948 0.726 0.822 0.840 0.860 0.850 0.645 0.762 0.699
31 Liu_CQUPT_task1_1 Audio-only CLAP Ensemble with EMA and Test-Time Augmentation True Liu_CQUPT_2026 0.774 0.751 0.759 0.714 0.696 0.705 0.730 0.712 0.721 0.948 0.726 0.822 0.838 0.858 0.848 0.643 0.762 0.697
32 Kil_Medisensing_task1_4 audio posterior ensemble with conservative argmax decoding True Kil_Medisensing_2026 0.777 0.682 0.708 0.894 0.662 0.761 0.820 0.800 0.810 1.000 0.520 0.684 0.715 0.818 0.763 0.456 0.610 0.521
33 Sharma_IR_task1_1 Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification False Sharma_IR_2026 0.701 0.628 0.653 0.728 0.721 0.724 0.672 0.420 0.517 0.917 0.697 0.792 0.634 0.824 0.717 0.552 0.481 0.514
34 Han_CSU_task1_1 Multi-Embedding System with Hierarchical Proxy Learning_1 False Han_CSU_2026 0.539 0.517 0.521 0.574 0.647 0.608 0.382 0.483 0.427 0.681 0.440 0.535 0.571 0.575 0.573 0.487 0.438 0.461
35 Han_CSU_task1_3 Multi-Embedding System with Hierarchical Proxy Learning_3 False Han_CSU_2026 0.475 0.452 0.456 0.481 0.574 0.523 0.314 0.371 0.340 0.619 0.371 0.464 0.525 0.530 0.528 0.437 0.414 0.425
36 Han_CSU_task1_4 Multi-Embedding System with Hierarchical Proxy Learning_4 False Han_CSU_2026 0.441 0.421 0.425 0.472 0.529 0.499 0.301 0.390 0.340 0.491 0.314 0.383 0.510 0.530 0.520 0.431 0.343 0.382
37 Zhang_XJTLU_task1_3 no-ranker ensemble False Zhang_XJTLU_2026 0.302 0.204 0.188 0.175 0.260 0.209 0.198 0.161 0.177 0.667 0.034 0.065 0.352 0.516 0.418 0.119 0.048 0.068
38 Zhang_XJTLU_task1_4 external diversity low-risk False Zhang_XJTLU_2026 0.354 0.217 0.200 0.189 0.304 0.233 0.212 0.185 0.198 0.875 0.040 0.077 0.361 0.510 0.422 0.135 0.048 0.070
39 Han_CSU_task1_2 Multi-Embedding System with Hierarchical Proxy Learning_2 False Han_CSU_2026 0.194 0.190 0.187 0.067 0.069 0.068 0.160 0.229 0.188 0.157 0.074 0.101 0.370 0.395 0.382 0.217 0.181 0.197
40 Zhang_XJTLU_task1_2 balanced ranker/base False Zhang_XJTLU_2026 0.320 0.207 0.189 0.193 0.314 0.239 0.171 0.137 0.152 0.750 0.034 0.066 0.351 0.500 0.412 0.136 0.052 0.076
41 Zhang_XJTLU_task1_1 aggressive ranker False Zhang_XJTLU_2026 0.239 0.191 0.175 0.150 0.240 0.185 0.170 0.151 0.160 0.375 0.017 0.033 0.347 0.478 0.402 0.152 0.067 0.093

Development set performance

Rank Submission Information BSD10k-v1.2 BSD35k-CS
System
rank
Submission
label
Submission
name
Audio
only
Technical
Report
hP hR hF hP hR hF
1 Primus_CPJKU_task1_3 Ensemble 3 False Primus_CPJKU_2026 0.833 0.838 0.834
2 Primus_CPJKU_task1_2 Ensemble 2 False Primus_CPJKU_2026 0.831 0.837 0.832
3 Primus_CPJKU_task1_4 Ensemble 4 False Primus_CPJKU_2026 0.828 0.832 0.829
4 Primus_CPJKU_task1_1 Ensemble 1 False Primus_CPJKU_2026 0.830 0.831 0.830
5 Huang_WHU_task1_2 STFT-distill logit ensemble-1 False Huang_WHU_2026
6 Huang_WHU_task1_1 HGA-EMA-STFT system False Huang_WHU_2026
7 Guan_HEU_task1_3 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations False Guan_HEU_2026
8 Huang_WHU_task1_4 Six-model 5-fold logit ensemble False Huang_WHU_2026
9 Guan_HEU_task1_4 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations False Guan_HEU_2026
10 Huang_WHU_task1_3 STFT-distill logit ensemble-5 False Huang_WHU_2026
11 Guan_HEU_task1_2 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations False Guan_HEU_2026
12 Zheng_SCUT_task1_1 Official BSD10k Baseline (Multimodal HATR + CLAP) False Zheng_SCUT_2026 0.795 0.783 0.787 0.811 0.794 0.798
13 Kucukoglu_NYU_task1_1 Multimodal Ensemble System for Heterogeneous Audio Classification False Kucukoglu_NYU_2026 0.822 0.805 0.811
14 Lin_JKU_task1_1 Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary) False Lin_JKU_2026 0.795 0.805 0.794
15 Guan_HEU_task1_1 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations False Guan_HEU_2026
16 Kucukoglu_NYU_task1_4 Three-Modality HATR with Whisper and Mixup False Kucukoglu_NYU_2026 0.795 0.787 0.788
17 Kucukoglu_NYU_task1_3 Single HATR Model with Cross-Fold Augmentation False Kucukoglu_NYU_2026 0.798 0.790 0.791
18 Lin_JKU_task1_2 Frozen-CLAP ensemble with logit adjustment (balanced) False Lin_JKU_2026 0.772 0.805 0.779
19 Lin_JKU_task1_4 Clean concatenation CLAP probe (10k-train only) False Lin_JKU_2026 0.788 0.782 0.781
20 Kucukoglu_NYU_task1_2 CLAP-ConvNeXt Ensemble for Heterogeneous Audio Classification False Kucukoglu_NYU_2026 0.818 0.808 0.811
21 Lin_JKU_task1_3 Gated multimodal CLAP fusion with pseudo-labels False Lin_JKU_2026 0.789 0.798 0.790
22 Colotti_TAU_task1_1 Audio Classification using Attention-based Cleaned Multimodal Embeddings False Colotti_TAU_2026
23 Colotti_TAU_task1_2 Audio Classification using Attention-based Cleaned Multimodal Embeddings False Colotti_TAU_2026
24 Kil_Medisensing_task1_1 metadata gate target mask posterior stack False Kil_Medisensing_2026 0.804 0.815 0.805
25 Kil_Medisensing_task1_2 raw parent specialist posterior stack False Kil_Medisensing_2026 0.805 0.812 0.804
26 Colotti_TAU_task1_4 CLAP-MoE Feature-wise Gated Fusion False Colotti_TAU_2026
27 Colotti_TAU_task1_3 EnhancedBaseClassifier False Colotti_TAU_2026
28 Kil_Medisensing_task1_3 Larger-CLAP classifier with metadata confidence gate False Kil_Medisensing_2026 0.753 0.717 0.729
29 Liu_CQUPT_task1_3 PaSST and CLAP Audio Ensemble True Liu_CQUPT_2026
30 Liu_CQUPT_task1_2 Weighted Audio-only CLAP Ensemble with BSD10k Specialist True Liu_CQUPT_2026
31 Liu_CQUPT_task1_1 Audio-only CLAP Ensemble with EMA and Test-Time Augmentation True Liu_CQUPT_2026
32 Kil_Medisensing_task1_4 audio posterior ensemble with conservative argmax decoding True Kil_Medisensing_2026 0.825 0.806 0.809
33 Sharma_IR_task1_1 Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification False Sharma_IR_2026
34 Han_CSU_task1_1 Multi-Embedding System with Hierarchical Proxy Learning_1 False Han_CSU_2026 0.854 0.840 0.846 0.806 0.835 0.810
35 Han_CSU_task1_3 Multi-Embedding System with Hierarchical Proxy Learning_3 False Han_CSU_2026 0.852 0.836 0.843 0.791 0.823 0.798
36 Han_CSU_task1_4 Multi-Embedding System with Hierarchical Proxy Learning_4 False Han_CSU_2026 0.851 0.841 0.845 0.803 0.827 0.806
37 Zhang_XJTLU_task1_3 no-ranker ensemble False Zhang_XJTLU_2026 0.833
38 Zhang_XJTLU_task1_4 external diversity low-risk False Zhang_XJTLU_2026 0.832
39 Han_CSU_task1_2 Multi-Embedding System with Hierarchical Proxy Learning_2 False Han_CSU_2026 0.856 0.835 0.844 0.827 0.851 0.832
40 Zhang_XJTLU_task1_2 balanced ranker/base False Zhang_XJTLU_2026 0.834
41 Zhang_XJTLU_task1_1 aggressive ranker False Zhang_XJTLU_2026 0.835

System characteristics

Rank Submission Information Representations Method Data Complexity
System
rank
Submission
label
Submission
name
Technical
Report
Sampling
rate
Audio
representation
Text
representation
Hierarchical
setting
Machine learning
method
Data
augmentation
External
data usage
External
datasets
MACS (G) Total
params
1 Primus_CPJKU_task1_3 Ensemble 3 Primus_CPJKU_2026 CP-CLAP PaSST, M2D title, tags, description, GPT processed metadata, CP-CLAP, RoBERTa transformer, contrastive-learning, LLM embeddings, LLM 383000000.0
2 Primus_CPJKU_task1_2 Ensemble 2 Primus_CPJKU_2026 CP-CLAP PaSST, M2D title, tags, description, GPT processed metadata, CP-CLAP, RoBERTa transformer, contrastive-learning, LLM embeddings, LLM 383000000.0
3 Primus_CPJKU_task1_4 Ensemble 4 Primus_CPJKU_2026 CLAP, PaSST, M2D-CLAP, CP-CLAP title, tags, description, GPT processed metadata, CP-CLAP, RoBERTa, LAION-CLAP RoBERTa transformer, contrastive-learning, LLM embeddings, LLM 539000000.0
4 Primus_CPJKU_task1_1 Ensemble 1 Primus_CPJKU_2026 BEATs, CP-CLAP, PaSST, CLAP title, tags, description, GPT processed metadata, CP-CLAP, RoBERTa, LAION-CLAP RoBERTa transformer, contrastive-learning, LLM embeddings, LLM 543000000.0
5 Huang_WHU_task1_2 STFT-distill logit ensemble-1 Huang_WHU_2026 16kHz CLAP, log-STFT, log-mel energies title, tags, description, CLAP, metadata cleaning multiple classifiers MLP, transformer random crop, time masking embeddings BSD35k 28.361 97540455.0
6 Huang_WHU_task1_1 HGA-EMA-STFT system Huang_WHU_2026 16kHz CLAP, STFT MLP, transformer time masking, random crop embeddings BSD35k 12.921 2728920.0
7 Guan_HEU_task1_3 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations Guan_HEU_2026 48kHz CLAP MLP embeddings 1.000 326592.0
8 Huang_WHU_task1_4 Six-model 5-fold logit ensemble Huang_WHU_2026 16kHz CLAP, log-STFT, log-mel energies, MFCC title, tags, description, CLAP, metadata cleaning multiple classifiers MLP, CNN, transformer, ensemble time masking, random crop embeddings, datastes BSD35k 30.555 210700090.0
9 Guan_HEU_task1_4 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations Guan_HEU_2026 48kHz CLAP MLP embeddings
10 Huang_WHU_task1_3 STFT-distill logit ensemble-5 Huang_WHU_2026 16kHz CLAP, log-STFT, log-mel energies title, tags, description, CLAP, metadata cleaning multiple classifiers MLP, CNN, transformer, ensemble random crop, time masking embeddings, datastes BSD35k 28.361 97540455.0
11 Guan_HEU_task1_2 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations Guan_HEU_2026 48kHz CLAP MLP embeddings 11.000 15474693.0
12 Zheng_SCUT_task1_1 Official BSD10k Baseline (Multimodal HATR + CLAP) Zheng_SCUT_2026 44.1kHz CLAP MLP embeddings 7319513.0
13 Kucukoglu_NYU_task1_1 Multimodal Ensemble System for Heterogeneous Audio Classification Kucukoglu_NYU_2026 48kHz, 32kHz CLAP, ConvNeXt, Fine-tuned CLAP loss function HATR, ensemble, logit averaging noise addition, random masking, mixup, balanced resampling embeddings 1889944.0
14 Lin_JKU_task1_1 Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary) Lin_JKU_2026 48kHz, 32kHz CLAP, PaSST title, tags, description, CLAP, RoBERTa MLP, transformer, ensemble time masking, frequency masking, mixup embeddings 163.900 211645044.0
15 Guan_HEU_task1_1 Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations Guan_HEU_2026 48kHz CLAP MLP embeddings 8.000 25184262.0
16 Kucukoglu_NYU_task1_4 Three-Modality HATR with Whisper and Mixup Kucukoglu_NYU_2026 48kHz, 16kHz CLAP, Whisper-large-v3 Three-modality HATR mixup, noise addition, random masking embeddings 1115669.0
17 Kucukoglu_NYU_task1_3 Single HATR Model with Cross-Fold Augmentation Kucukoglu_NYU_2026 48kHz CLAP HATR cross-fold embedding swap, noise addition, random masking embeddings 377989.0
18 Lin_JKU_task1_2 Frozen-CLAP ensemble with logit adjustment (balanced) Lin_JKU_2026 48kHz, 32kHz CLAP, PaSST title, tags, description, CLAP, RoBERTa MLP, transformer, ensemble time masking, frequency masking, mixup embeddings 163.900 211645044.0
19 Lin_JKU_task1_4 Clean concatenation CLAP probe (10k-train only) Lin_JKU_2026 48kHz CLAP MLP embeddings 19.400 538647.0
20 Kucukoglu_NYU_task1_2 CLAP-ConvNeXt Ensemble for Heterogeneous Audio Classification Kucukoglu_NYU_2026 48kHz, 32kHz CLAP, ConvNeXt loss function HATR, ensemble, logit averaging noise addition, random masking, mixup, balanced resampling, combined augmentation embeddings 1889944.0
21 Lin_JKU_task1_3 Gated multimodal CLAP fusion with pseudo-labels Lin_JKU_2026 48kHz CLAP MLP embeddings 19.400 1327639.0
22 Colotti_TAU_task1_1 Audio Classification using Attention-based Cleaned Multimodal Embeddings Colotti_TAU_2026 48kHz CLAP description, tags, sentence transformer, all-mpnet-base-v2, BERT attention, hyperbolic neural networks embeddings, model weights 7.900 380342496.0
23 Colotti_TAU_task1_2 Audio Classification using Attention-based Cleaned Multimodal Embeddings Colotti_TAU_2026 48kHz CLAP description, tags, sentence transformer, all-mpnet-base-v2, BERT attention, hyperbolic neural networks embeddings, model weights 7.900 380342496.0
24 Kil_Medisensing_task1_1 metadata gate target mask posterior stack Kil_Medisensing_2026 CLAP, Larger-CLAP, M2D-CLAP, BEATs, ATST, PaSST, CLAP-HF title, tags, description, TF-IDF helper gates second-level classifier, same-parent metadata override, hierarchy-aware posteriors weighted posterior stacking, metadata target-mask gate embeddings, datasets
25 Kil_Medisensing_task1_2 raw parent specialist posterior stack Kil_Medisensing_2026 CLAP, Larger-CLAP, M2D-CLAP, BEATs, ATST, PaSST, CLAP-HF title, tags, description, conservative helper gates parent-local specialists, second-level predictions constrained by hierarchy weighted posterior stacking, raw-embedding parent-specialist gate embeddings, datasets
26 Colotti_TAU_task1_4 CLAP-MoE Feature-wise Gated Fusion Colotti_TAU_2026 48kHz CLAP tags, description, CLAP top-level classifier as router, expert second-level classifiers MLP, feature-wise gated multimodal fusion noise addition embeddings 4.140 369991736.0
27 Colotti_TAU_task1_3 EnhancedBaseClassifier Colotti_TAU_2026 48kHz CLAP tags, description, CLAP, SentenceBERT, sentence transformer MLP baseline implemented masking and augmentation embeddings 9.400 282000000.0
28 Kil_Medisensing_task1_3 Larger-CLAP classifier with metadata confidence gate Kil_Medisensing_2026 CLAP title, tags, description, TF-IDF second-level classifier, same-parent metadata override MLP datasets, embeddings FSD50K 0.011 10704631.0
29 Liu_CQUPT_task1_3 PaSST and CLAP Audio Ensemble Liu_CQUPT_2026 48kHz, 32kHz CLAP, PaSST loss function ensemble mixup, feature masking, auxiliary text supervision during training, test-time augmentation embeddings AudioSet 254784782.0
30 Liu_CQUPT_task1_2 Weighted Audio-only CLAP Ensemble with BSD10k Specialist Liu_CQUPT_2026 48kHz CLAP loss function residual gated classifiers, kownledge distillation, weighted ensemble embedding-level mixup, feature masking, noise addition, text dropout, test-time augmentation embeddings 197911292.0
31 Liu_CQUPT_task1_1 Audio-only CLAP Ensemble with EMA and Test-Time Augmentation Liu_CQUPT_2026 48kHz CLAP loss function residual gated classifiers, kownledge distillation, ensemble embedding-level mixup, feature masking, noise addition, text dropout, test-time augmentation embeddings 152641185.0
32 Kil_Medisensing_task1_4 audio posterior ensemble with conservative argmax decoding Kil_Medisensing_2026 CLAP, M2D-CLAP multiple classifiers, hierarchical ridge branch ridge ensemble, posterior averaging embeddings
33 Sharma_IR_task1_1 Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification Sharma_IR_2026 32kHz, 48kHz CLAP, PANNs loss function dimension masking embeddings 1.036 261098881.0
34 Han_CSU_task1_1 Multi-Embedding System with Hierarchical Proxy Learning_1 Han_CSU_2026 44.1kHz BEATs, ATST-Frame, fPaSST, PaSST, CLAP, M2D loss function MLP noise addition, time masking embeddings BEATs, ATST-Frame, fPaSST, PaSST, 718.773 3717566529.0
35 Han_CSU_task1_3 Multi-Embedding System with Hierarchical Proxy Learning_3 Han_CSU_2026 44.1kHz ATST-Frame, PaSST, CLAP, M2D loss function MLP noise addition, time masking embeddings ATST-Frame, PaSST, 466.859 2332946777.0
36 Han_CSU_task1_4 Multi-Embedding System with Hierarchical Proxy Learning_4 Han_CSU_2026 44.1kHz BEATs, CLAP loss function MLP noise addition, time masking embeddings 200.345 676412058.0
37 Zhang_XJTLU_task1_3 no-ranker ensemble Zhang_XJTLU_2026 48kHz CLAP title, tags, description, CLAP, safe metadata scalar features parent-aware calibration, second-level candidate override ensemble, supervision with noisy labels teacher-weighted soft and hard BSD35k-CS supervision, calibrated ensembling, external-data diversity components embeddings, datasets BSD35k-CS, ESC-50, UrbanSound8K, FSD50K
38 Zhang_XJTLU_task1_4 external diversity low-risk Zhang_XJTLU_2026 48kHz CLAP title, tags, description, CLAP, safe metadata scalar features parent-aware calibration, second-level candidate override ensemble, external-data diversity components teacher-weighted soft and hard BSD35k-CS supervision, calibrated ensembling, external-data diversity components embeddings, datasets BSD35k-CS, ESC-50, UrbanSound8K, FSD50K
39 Han_CSU_task1_2 Multi-Embedding System with Hierarchical Proxy Learning_2 Han_CSU_2026 44.1kHz CLAP, M2D loss function MLP noise addition, time masking embeddings 134.815 222974.0
40 Zhang_XJTLU_task1_2 balanced ranker/base Zhang_XJTLU_2026 48kHz CLAP title, tags, description, CLAP, safe metadata scalar features parent-aware calibration, second-level candidate override enseble, parent-aware candidate reranking teacher-weighted soft and hard BSD35k-CS supervision, calibrated ensembling, external-data diversity components embeddings, datasets BSD35k-CS, ESC-50, UrbanSound8K, FSD50K
41 Zhang_XJTLU_task1_1 aggressive ranker Zhang_XJTLU_2026 48kHz CLAP title, tags, description, CLAP, safe metadata scalar features parent-aware calibration, second-level candidate override ensemble, candidate-level logistic/gradient reranking, parent-aware override teacher-weighted soft and hard BSD35k-CS supervision, calibrated ensembling, external-data diversity components embeddings, datasets BSD35k-CS, ESC-50, UrbanSound8K, FSD50K

Technical reports

ACACIA: Audio Classification using Attention-Based Cleaned Multimodal Embeddings

Francesco Colotti1, Kerstin Markl1, Riccardo Casciotti1, Javier Naranjo2
1Tampere University, Audio Research Group, Tampere, Finland, 2Instituto Tecnologico de Informatica, Valencia, Spain

Abstract

This paper presents the ACACIA pipeline, submitted to Task 1 of DCASE 2026, the first edition of the Heterogeneous Audio Classification challenge based on the Broad Sound Taxonomy (BST). Our approach combines complementary acoustic and textual information available in Freesound recordings by jointly exploiting the audio signal together with user-provided tags and textual descriptions. To improve the quality of the textual modality, we propose a preprocessing and cleaning pipeline that removes noisy and non-informative content from descriptions before encoding. The proposed architecture comprises three dedicated branches that independently encode audio, tags, and descriptions, whose representations are fused into a shared multimodal embedding space. To exploit the hierarchical organization of BST, the model performs two classification tasks simultaneously, predicting both top-level and second-level categories, and is trained using the sum of two binary cross-entropy losses. Experimental results show that ACACIA surpasses the official baseline by 6\% according to the challenge's hierarchy-aware evaluation metric, demonstrating that the integration of acoustic information with complementary textual metadata provides a robust and effective solution for heterogeneous sound classification in real-world conditions.

PDF

GISP@HEU's Submission for Task 1: Heterogeneous Audio Classification in the DCASE 2026 Challenge

Feng Xiaoyu1, Ye Tong1, Yang Xuefeng1, Xiao Feiyang1, Qiaoxi Zhu2, Jian Guan1
1Harbin Engineering University, Harbin, China, 2University of Technology Sydney, Ultimo, Australia

Abstract

This technical report presents our submitted systems for Task 1: Heterogeneous Audio Classification in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2026 Challenge. Our submission consists of four systems, including three individual systems and one ensemble system. System 1 and System 2 adopt a two-stage hierarchical training framework with hyperbolic representation learning, and are built upon different multimodal audio language backbones, i.e., Qwen2-Audio and Qwen3-Omni. System 3 is developed from the official baseline framework, using multimodal embeddings and hierarchical training with clean BSD10k data and filtered BSD35k data. System 4 is an ensemble system consisting of these three systems. Experimental results demonstrate that the proposed approach improves hierarchical classification performance and achieves 81.80\% ± 0.21\% on the development dataset.

PDF

MESH:MULTI-EMBEDDINGSYSTEMWITHHIERARCHICALPROXYLEARNINGFOR DCASE 2026 TASK 1

Sarang Han1, Minsik Jo2, Eunseo Ha2, Minju Chae2, Hyeonguk Kang2, Geonwoo Lee2
1Chosun University, Department of Data Science, Gwangju 61452, Republic of Korea, 2Chosun University, Department of AI Software, Gwangju 61452, Republic of Korea

Abstract

This technical report describes a multi-embedding system with hierarchical proxy learning (MESH) for DCASE 2026 Challenge Task 1. The task is based on the Broad Sound Taxonomy (BST) for hierarchical audio classification, where each sample is assigned to a lower-level category within a top-level class. Fixed audio embeddings were extracted from multiple pretrained models and pooling configurations, and the final embedding set was selected using validation hierarchical F-score (hF) and embedding diversity. The selected audio embeddings were concatenated with fixed CLAP text embeddings, and the classifier was trained through BSD35k-CS pre-training followed by BSD10k-v1.2 fine-tuning. To incorporate the BST hierarchy, a hierarchical proxy loss included distinct proxy sets for top-level and lower-level classes. Final predictions were aggregated with selective kernel-based output fusion and OOF-based ensemble selection, and the best OOF ensemble achieved 84.59% hF among the four submitted systems.

PDF

A MULTI-BRANCH AND HIERARCHY-AWARE SYSTEM FOR BST AUDIO CLASSIFICATION

Beile Ning1, Jiayi Yu1, Zitong Wang1, Yufei Hu1, Wenjun Xu1, Yuanhang Qian1, Zhongxin Bai2, Gongping Huang1
1Wuhan University (WHU), Intelligent Acoustics and Speech Processing Laboratory (IASP-Lab), Wuhan, China, 2Harbin Engineering University (HEU), Intelligent Acoustics and Speech Processing Laboratory (IASP-Lab), Harbin, China

Abstract

This technical report introduces our system for Task 1 of the DCASE 2026 Challenge. Our system leverages multimodal audio-text representations extracted by CLAP and further improves classification performance through dataset expansion, branch-enhanced acoustic modeling, and hierarchy-aware prediction strategies. To improve data diversity, we selected a subset of the BSD35K dataset and incorporated it into BSD10K to expand the training set. We introduced additional acoustic branches based on Mel-Frequency Cepstral Coefficients (MFCC), log-Mel spectrogram, and log Short-Time Fourier Transform (log-STFT), and adopted different hierarchical strategies and post-processing strategies, with the post-processing strategy further integrated into the system via knowledge distillation. During training, we also applied data augmentation techniques including time masking and random cropping. Our best single system extracts log-STFT features from audio to facilitate model training on the new dataset, and employs post-processing to refine model predictions, achieving a hierarchical F1 score of 80.84% on the BSD10K test set under the same evaluation protocol as the baseline. Furthermore, we constructed two ensemble models based on different backbone architectures, obtaining hierarchical F1 scores of 81.25% and 81.18%, respectively, on the same BSD10K test set as the baseline.

PDF

POSTERIOR STACKING AND CONSERVATIVE METADATA GATING FOR HETEROGENEOUS AUDIO CLASSIFICATION

Minkyu Kil, Seunggyu Jeong, Seong-Eun Kim
Medisensing, Seoul, Korea

Abstract

This technical report describes the Medisensing-SeoulTech submission to DCASE 2026 Task 1, Heterogeneous Audio Classification. The submitted systems classify each evaluation item into one of 23 second-level Broad Sound Taxonomy classes. The final package contains four systems. Systems 1 and 2 are high-scoring weighted posterior stacks built from frozen audio and audio-text embedding heads with conservative same-parent metadata gates. System 3 is a Larger-CLAP residual classifier with a TF-IDF metadata gate, included as a complementary metadata-aware variant with broader label-symbol coverage. System 4 is a complementary audio-only posterior ensemble. Across metadata-assisted systems, metadata can refine the second-level class only inside the audio-predicted top-level parent. No evaluation-set ground-truth labels or manual evaluation-set annotations are used for training, threshold selection, system selection, or reporting.

PDF

MULTIMODAL ENSEMBLE SYSTEM FOR HETEROGENEOUS AUDIO CLASSIFICATION

Mehmet Atilay Kucukoglu
New York University, New York, USA

Abstract

This technical report details an ensemble approach and experiments leading to it as a submission to DCASE 2026 Challenge Task 1 on Heterogeneous Audio Classification using the Broad Sound Taxonomy (BST). The official HATR baseline system achieves 79.01% hierarchical F1 score on BSD10k-v1.2. Alternative audio encoders such as BEATs, ConvNeXt, MATPAC, and Whisper, performed below CLAP, suggesting that CLAP’s joint audio-text alignment is a powerful approach for this task. Incorporating hierarchical loss, contrastive loss, confidence weighting, and class weighting did not improve the hierarchical F1 score. Among data augmentation methods, cross-modal swap with Gaussian noise achieved the best single model result at 79.13% hF1. Incorporating external data from FSD50K and ESC-50 mapped to the BST taxonomy did not achieve higher F1 scores. End-to-end fine-tuning of the CLAP audio encoder was also investigated but did not surpass the frozen multimodal baseline. The best result was obtained using an ensemble of 5 models incorporating different encoders, loss strategies, and augmentation methods, achieving 81.13% hierarchical F1 score on BSD10k-v1.2, a +2.12% improvement over the baseline.

PDF

Heterogeneous Audio Classification with Frozen Audio-Language Embeddings

Pao Lin
Johannes Kepler Universität Linz, Linz, Austria

Abstract

We describe our submission to DCASE 2026 Task 1, Heterogeneous Audio Classification, where each sound is assigned to one of 23 second-level Broad Sound Taxonomy classes and systems are ranked by macro hierarchical-F1 (hF). Like the official baseline, we keep the CLAP audio-text encoder frozen and train small heads on its embeddings. Under matched 5-fold cross-validation on BSD10k-v1.2, our best single model, a gated multimodal head, reaches 78.8 +/- 1.1% hF, matching the published multimodal baseline. Several additions add little once evaluated carefully. Agreement pseudo-labels mined from BSD35k-CS give no gain that clears seed noise, so we treat them as a neutral data addition. A four-member ensemble, with weights tuned on our validation split, reaches 79.4 +/- 0.7% hF over three seeds. Of everything we tried, only adding the free-text description to the metadata helped cleanly (+2.2 hF). Finally, the residual error appears strongly tied to label ambiguity: the classes our system confuses most are the ones annotators were least confident labeling, and most of the lost credit lies in cross-family confusions such as crowd speech vs. urban soundscape. We view this label-ambiguity analysis as the main finding of our submission. We submit four systems with different complexity and robustness profiles.

PDF

HAF-CLAP: AHIERARCHICAL-AWAREMULTIMODALCLAPSYSTEMFOR HETEROGENEOUSAUDIO CLASSIFICATION

Xiangyu Jing, Yuandong Luo, Chaoyong Huang, Hongqing Liu
Chongqing University of Posts and Telecommunications, Chongqing, China

Abstract

This technical report describes the system submitted by Chongqing University of Posts and Telecommunications– Audio Lab (CQUPT AUL) for DCASE 2026 Task 1: Heterogeneous Audio Classifi cation. The proposed system, termed HAF-CLAP, is built upon LAION-CLAP and exploits both acoustic information and textual metadata under the Broad Sound Taxonomy (BST). To adapt pre trained audio-language representations to the target classification task, HAF-CLAP uses a hierarchical-aware multimodal classifica tion framework with audio-text fusion. Several training and infer ence strategies are further applied to improve robustness and pre diction stability. The final submission combines multiple comple mentary HAF-CLAP models. Experimental results on our internal validation split show that the submitted system achieves competitive performance under the hierarchical evaluation metric.

PDF

CP-JKU Submission to Task 1 of the DCASE 2026 Challenge: LLM Prediction Fusion and Pseudo-Label Training for Heterogeneous Audio Classification

Paul Primus, Gerhard Widmer
Johannes Kepler University, Institute of Computational Perception, Linz, Austria

Abstract

This report describes our submission to DCASE 2026 Task 1, Heterogeneous Audio Classification. The task requires classifying Freesound audio clips into 23 second-level Broad Sound Taxonomy classes using audio and optional textual metadata. Our system combines pretrained audio-only and audio-text backbones, including PaSST, BEATs, M2D, CP-CLAP, and LAION-CLAP. We use GPT-5.4-mini in two ways: first, to generate metadata summaries for the CLAP text encoders, and second, to estimate class-probability priors from title, tags, and description. These priors are converted into learned class-embedding mixtures and late fused with the backbone representation. To exploit the noisy BSD35k-CS dataset, we use a three-stage training procedure: clean-data training on BSD10k, pseudo-label-based pretraining on BSD35k-CS, and final fine-tuning on BSD10k. The submitted systems are ensembles of selected Stage-3 models and achieve up to 0.834 hierarchical F-score on our BSD10k test split.

PDF

Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification

Nipun Sharma1, Aditya Sharma2
1Independent Researcher, Madhya Pradesh, india, 2Independent Researcher, Bangalore, India

Abstract

This technical report describes the submitted system for DCASE 2026 Task 1, Heterogeneous Audio Classification, which is defined over the Broad Sound Taxonomy with 5 top-level and 23 second-level categories. The proposed approach uses a late-fusion architecture built on frozen foundation-model embeddings, combining CLAP audio embeddings, PANNs audio embeddings, and CLAP text embeddings derived from metadata. Each modality is projected into a shared hidden space and fused with a lightweight Transformer encoder. The resulting sequence is then aggregated via mean-pooling to create a unified representation. Training uses a hierarchical objective that combines fine-level cross-entropy with an auxiliary coarselevel loss, together with label smoothing and class weighting. Evaluation is performed using the hierarchical Fscore adopted for the task, with additional reporting of toplevel and second-level accuracy. For the final submission, logits from seven selected checkpoints were averaged in a checkpoint ensemble to improve robustness and reduce foldspecific variance. The final system utilizes this streamlined three modalities configuration.

PDF

Noisy-Label-Aware Multimodal Ensembling with Inference-Safe Candidate Reranking for Heterogeneous Audio Classification

Peihong Zhang, Shengchen Li
Xi'an Jiaotong-Liverpool University, School of Advanced Technology, Suzhou, China

Abstract

We present a noisy-label-aware multimodal system for DCASE 2026 Challenge Task 1, Heterogeneous Audio Classification. The task requires second-level Broad Sound Taxonomy prediction from audio and metadata and is evaluated using macro-averaged hierarchical F-score. Our system addresses the central tension between exploiting large but noisy crowd-sourced supervision and preserving inference safety on the unlabeled evaluation set. It combines LAION-CLAP audio/text embeddings, teacher-weighted soft and hard supervision from BSD35k-CS, a fresh long-run ensemble of 31 checkpointed model families, calibrated probability fusion, and an inference-safe candidate reranker with conservative parent-aware override. All submitted systems are generated from deployable checkpoints rather than replayed development-set artifacts. On a label-blind development-as-evaluation protocol, the primary system obtains 83.5214 hF, improving over our local multimodal baseline by 4.57 absolute hF points.

PDF

MULTIMODAL HATR CLASSIFICATION WITH FROZEN CLAP EMBEDDINGS FOR THE BROAD SOUND TAXONOMY

Han Zheng, Yanxiong Li
South China University of Technology, School of Electronic and Information Engineering, Guangzhou, China

Abstract

This report presents a multimodal audio classifier for DCASE 2026 Task 1, which assigns Freesound recordings to 23 second - level categories of the Broad Sound Taxonomy (BST). The system combines frozen LAION - CLAP embeddings of audio and textual metadata (title, tags, description) with a Hierarchical Audio Tagging and Retrieval (HATR) feed - forward classifier using attention - based fusion. Development performance is measured under five - fold cross - validation on BSD10k - v1.2 and BSD35k - CS (lambda = 0.75). On BSD10k - v1.2, the multimodal model attains macro - averaged hierarchical F - scores of 78.65% pm 0.52 (hf), compared with 75.88% pm 0.47 for the audio - only configuration; on BSD35k - CS, the corresponding values are 79.77% pm 0.60 and 70.19% pm 0.93, with leaf accuracy reaching 85.32% pm 0.51. For the challenge evaluation set, five fold models trained on BSD10k - v1.2 are ensembled to produce predictions for all 3246 released test recordings.

PDF