Task description
This task focuses on heterogeneous audio classification using the Broad Sound Taxonomy (BST), which comprises 5 top-level and 23 second-level sound categories. The goal of this task is to evaluate sound classification models on diverse, real-world audio that varies widely in its nature, including duration and recording conditions. To that end, two complementary Freesound-based datasets are provided: a curated set, BSD10k-v1.2, and a larger, noisier, crowd-sourced collection, BSD35k-CS, which reflects real-world labeling variability. Participants are encouraged to explore audio-based, metadata-based, and multimodal approaches to sound classification, as well as to leverage hierarchical relationships between taxonomy categories.
More detailed task description can be found in the task description page
Leaderboard
| Rank | Submission Information | Metrics | ||||||
|---|---|---|---|---|---|---|---|---|
|
Official rank |
System rank |
Submission label |
Submission name |
Audio only |
Technical Report |
hP | hR | hF |
| 1 | 1 | Primus_CPJKU_task1_3 | Ensemble 3 | False | Primus_CPJKU_2026 | 0.842 | 0.846 | 0.836 |
| 7 | 22 | Colotti_TAU_task1_1 | Audio Classification using Attention-based Cleaned Multimodal Embeddings | False | Colotti_TAU_2026 | 0.746 | 0.709 | 0.699 |
| 5 | 13 | Kucukoglu_NYU_task1_1 | Multimodal Ensemble System for Heterogeneous Audio Classification | False | Kucukoglu_NYU_2026 | 0.824 | 0.775 | 0.787 |
| 11 | 34 | Han_CSU_task1_1 | Multi-Embedding System with Hierarchical Proxy Learning_1 | False | Han_CSU_2026 | 0.422 | 0.370 | 0.372 |
| 9 | 29 | Liu_CQUPT_task1_3 | PaSST and CLAP Audio Ensemble | True | Liu_CQUPT_2026 | 0.657 | 0.647 | 0.629 |
| 8 | 24 | Kil_Medisensing_task1_1 | metadata gate target mask posterior stack | False | Kil_Medisensing_2026 | 0.682 | 0.732 | 0.692 |
| 4 | 12 | Zheng_SCUT_task1_1 | Official BSD10k Baseline (Multimodal HATR + CLAP) | False | Zheng_SCUT_2026 | 0.817 | 0.782 | 0.788 |
| 2 | 5 | Huang_WHU_task1_2 | STFT-distill logit ensemble-1 | False | Huang_WHU_2026 | 0.825 | 0.810 | 0.811 |
| 10 | 33 | Sharma_IR_task1_1 | Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification | False | Sharma_IR_2026 | 0.526 | 0.488 | 0.472 |
| 6 | 14 | Lin_JKU_task1_1 | Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary) | False | Lin_JKU_2026 | 0.801 | 0.785 | 0.785 |
| 3 | 7 | Guan_HEU_task1_3 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | False | Guan_HEU_2026 | 0.816 | 0.797 | 0.799 |
| 12 | 37 | Zhang_XJTLU_task1_3 | no-ranker ensemble | False | Zhang_XJTLU_2026 | 0.162 | 0.139 | 0.116 |
Systems ranking
| Rank | Submission Information | Metrics | |||||
|---|---|---|---|---|---|---|---|
|
System rank |
Submission label |
Submission name |
Audio only |
Technical Report |
hP | hR | hF |
| 1 | Primus_CPJKU_task1_3 | Ensemble 3 | False | Primus_CPJKU_2026 | 0.842 | 0.846 | 0.836 |
| 2 | Primus_CPJKU_task1_2 | Ensemble 2 | False | Primus_CPJKU_2026 | 0.842 | 0.845 | 0.836 |
| 3 | Primus_CPJKU_task1_4 | Ensemble 4 | False | Primus_CPJKU_2026 | 0.841 | 0.844 | 0.835 |
| 4 | Primus_CPJKU_task1_1 | Ensemble 1 | False | Primus_CPJKU_2026 | 0.838 | 0.842 | 0.833 |
| 5 | Huang_WHU_task1_2 | STFT-distill logit ensemble-1 | False | Huang_WHU_2026 | 0.825 | 0.810 | 0.811 |
| 6 | Huang_WHU_task1_1 | HGA-EMA-STFT system | False | Huang_WHU_2026 | 0.813 | 0.802 | 0.800 |
| 7 | Guan_HEU_task1_3 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | False | Guan_HEU_2026 | 0.816 | 0.797 | 0.799 |
| 8 | Huang_WHU_task1_4 | Six-model 5-fold logit ensemble | False | Huang_WHU_2026 | 0.820 | 0.794 | 0.799 |
| 9 | Guan_HEU_task1_4 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | False | Guan_HEU_2026 | 0.818 | 0.795 | 0.798 |
| 10 | Huang_WHU_task1_3 | STFT-distill logit ensemble-5 | False | Huang_WHU_2026 | 0.819 | 0.793 | 0.798 |
| 11 | Guan_HEU_task1_2 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | False | Guan_HEU_2026 | 0.816 | 0.793 | 0.797 |
| 12 | Zheng_SCUT_task1_1 | Official BSD10k Baseline (Multimodal HATR + CLAP) | False | Zheng_SCUT_2026 | 0.817 | 0.782 | 0.788 |
| 13 | Kucukoglu_NYU_task1_1 | Multimodal Ensemble System for Heterogeneous Audio Classification | False | Kucukoglu_NYU_2026 | 0.824 | 0.775 | 0.787 |
| 14 | Lin_JKU_task1_1 | Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary) | False | Lin_JKU_2026 | 0.801 | 0.785 | 0.785 |
| 15 | Guan_HEU_task1_1 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | False | Guan_HEU_2026 | 0.799 | 0.771 | 0.775 |
| 16 | Kucukoglu_NYU_task1_4 | Three-Modality HATR with Whisper and Mixup | False | Kucukoglu_NYU_2026 | 0.814 | 0.764 | 0.773 |
| 17 | Kucukoglu_NYU_task1_3 | Single HATR Model with Cross-Fold Augmentation | False | Kucukoglu_NYU_2026 | 0.802 | 0.762 | 0.770 |
| 18 | Lin_JKU_task1_2 | Frozen-CLAP ensemble with logit adjustment (balanced) | False | Lin_JKU_2026 | 0.774 | 0.778 | 0.768 |
| 19 | Lin_JKU_task1_4 | Clean concatenation CLAP probe (10k-train only) | False | Lin_JKU_2026 | 0.781 | 0.761 | 0.763 |
| 20 | Kucukoglu_NYU_task1_2 | CLAP-ConvNeXt Ensemble for Heterogeneous Audio Classification | False | Kucukoglu_NYU_2026 | 0.779 | 0.763 | 0.760 |
| 21 | Lin_JKU_task1_3 | Gated multimodal CLAP fusion with pseudo-labels | False | Lin_JKU_2026 | 0.769 | 0.765 | 0.752 |
| 22 | Colotti_TAU_task1_1 | Audio Classification using Attention-based Cleaned Multimodal Embeddings | False | Colotti_TAU_2026 | 0.746 | 0.709 | 0.699 |
| 23 | Colotti_TAU_task1_2 | Audio Classification using Attention-based Cleaned Multimodal Embeddings | False | Colotti_TAU_2026 | 0.732 | 0.706 | 0.699 |
| 24 | Kil_Medisensing_task1_1 | metadata gate target mask posterior stack | False | Kil_Medisensing_2026 | 0.682 | 0.732 | 0.692 |
| 25 | Kil_Medisensing_task1_2 | raw parent specialist posterior stack | False | Kil_Medisensing_2026 | 0.682 | 0.731 | 0.691 |
| 26 | Colotti_TAU_task1_4 | CLAP-MoE Feature-wise Gated Fusion | False | Colotti_TAU_2026 | 0.724 | 0.667 | 0.669 |
| 27 | Colotti_TAU_task1_3 | EnhancedBaseClassifier | False | Colotti_TAU_2026 | 0.710 | 0.667 | 0.664 |
| 28 | Kil_Medisensing_task1_3 | Larger-CLAP classifier with metadata confidence gate | False | Kil_Medisensing_2026 | 0.652 | 0.641 | 0.633 |
| 29 | Liu_CQUPT_task1_3 | PaSST and CLAP Audio Ensemble | True | Liu_CQUPT_2026 | 0.657 | 0.647 | 0.629 |
| 30 | Liu_CQUPT_task1_2 | Weighted Audio-only CLAP Ensemble with BSD10k Specialist | True | Liu_CQUPT_2026 | 0.644 | 0.643 | 0.623 |
| 31 | Liu_CQUPT_task1_1 | Audio-only CLAP Ensemble with EMA and Test-Time Augmentation | True | Liu_CQUPT_2026 | 0.639 | 0.640 | 0.619 |
| 32 | Kil_Medisensing_task1_4 | audio posterior ensemble with conservative argmax decoding | True | Kil_Medisensing_2026 | 0.590 | 0.552 | 0.520 |
| 33 | Sharma_IR_task1_1 | Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification | False | Sharma_IR_2026 | 0.526 | 0.488 | 0.472 |
| 34 | Han_CSU_task1_1 | Multi-Embedding System with Hierarchical Proxy Learning_1 | False | Han_CSU_2026 | 0.422 | 0.370 | 0.372 |
| 35 | Han_CSU_task1_3 | Multi-Embedding System with Hierarchical Proxy Learning_3 | False | Han_CSU_2026 | 0.355 | 0.316 | 0.315 |
| 36 | Han_CSU_task1_4 | Multi-Embedding System with Hierarchical Proxy Learning_4 | False | Han_CSU_2026 | 0.331 | 0.294 | 0.294 |
| 37 | Zhang_XJTLU_task1_3 | no-ranker ensemble | False | Zhang_XJTLU_2026 | 0.162 | 0.139 | 0.116 |
| 38 | Zhang_XJTLU_task1_4 | external diversity low-risk | False | Zhang_XJTLU_2026 | 0.154 | 0.146 | 0.101 |
| 39 | Han_CSU_task1_2 | Multi-Embedding System with Hierarchical Proxy Learning_2 | False | Han_CSU_2026 | 0.102 | 0.105 | 0.097 |
| 40 | Zhang_XJTLU_task1_2 | balanced ranker/base | False | Zhang_XJTLU_2026 | 0.142 | 0.138 | 0.095 |
| 41 | Zhang_XJTLU_task1_1 | aggressive ranker | False | Zhang_XJTLU_2026 | 0.106 | 0.127 | 0.091 |
Class-wise performance
| Rank | Submission Information | Class-wise metrics | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
System rank |
Submission label |
Submission name |
Audio only |
Technical Report |
hP - Music > Solo percussion (m-sp) |
hR - Music > Solo percussion (m-sp) |
hF - Music > Solo percussion (m-sp) |
hP - Music > Solo instrument (m-si) |
hR - Music > Solo instrument (m-si) |
hF - Music > Solo instrument (m-si) |
hP - Music > Multiple instruments (m-m) |
hR - Music > Multiple instruments (m-m) |
hF - Music > Multiple instruments (m-m) |
hP - Instrument samples > Percussion (is-p) |
hR - Instrument samples > Percussion (is-p) |
hF - Instrument samples > Percussion (is-p) |
hP - Instrument samples > String (is-s) |
hR - Instrument samples > String (is-s) |
hF - Instrument samples > String (is-s) |
hP - Instrument samples > Wind (is-w) |
hR - Instrument samples > Wind (is-w) |
hF - Instrument samples > Wind (is-w) |
hP - Instrument samples > Piano / Keyboard instruments (is-k) |
hR - Instrument samples > Piano / Keyboard instruments (is-k) |
hF - Instrument samples > Piano / Keyboard instruments (is-k) |
hP - Instrument samples > Synths / Electronic (is-e) |
hR - Instrument samples > Synths / Electronic (is-e) |
hF - Instrument samples > Synths / Electronic (is-e) |
hP - Speech > Solo speech (sp-s) |
hR - Speech > Solo speech (sp-s) |
hF - Speech > Solo speech (sp-s) |
hP - Speech > Conversation / Crowd (sp-c) |
hR - Speech > Conversation / Crowd (sp-c) |
hF - Speech > Conversation / Crowd (sp-c) |
hP - Speech > Processed / Synthetic (sp-p) |
hR - Speech > Processed / Synthetic (sp-p) |
hF - Speech > Processed / Synthetic (sp-p) |
hP - Sound effects > Objects / House appliances (fx-o) |
hR - Sound effects > Objects / House appliances (fx-o) |
hF - Sound effects > Objects / House appliances (fx-o) |
hP - Sound effects > Vehicles (fx-v) |
hR - Sound effects > Vehicles (fx-v) |
hF - Sound effects > Vehicles (fx-v) |
hP - Sound effects > Other mechanisms, engines, machines (fx-m) |
hR - Sound effects > Other mechanisms, engines, machines (fx-m) |
hF - Sound effects > Other mechanisms, engines, machines (fx-m) |
hP - Sound effects > Human sounds and actions (fx-h) |
hR - Sound effects > Human sounds and actions (fx-h) |
hF - Sound effects > Human sounds and actions (fx-h) |
hP - Sound effects > Animals (fx-a) |
hR - Sound effects > Animals (fx-a) |
hF - Sound effects > Animals (fx-a) |
hP - Sound effects > Natural elements and explosions (fx-n) |
hR - Sound effects > Natural elements and explosions (fx-n) |
hF - Sound effects > Natural elements and explosions (fx-n) |
hP - Sound effects > Experimental (fx-ex) |
hR - Sound effects > Experimental (fx-ex) |
hF - Sound effects > Experimental (fx-ex) |
hP - Sound effects > Electronic / Design (fx-el) |
hR - Sound effects > Electronic / Design (fx-el) |
hF - Sound effects > Electronic / Design (fx-el) |
hP - Soundscapes > Nature (ss-n) |
hR - Soundscapes > Nature (ss-n) |
hF - Soundscapes > Nature (ss-n) |
hP - Soundscapes > Indoors (ss-i) |
hR - Soundscapes > Indoors (ss-i) |
hF - Soundscapes > Indoors (ss-i) |
hP - Soundscapes > Urban (ss-u) |
hR - Soundscapes > Urban (ss-u) |
hF - Soundscapes > Urban (ss-u) |
hP - Soundscapes > Synthetic / Artificial (ss-s) |
hR - Soundscapes > Synthetic / Artificial (ss-s) |
hF - Soundscapes > Synthetic / Artificial (ss-s) |
| 1 | Primus_CPJKU_task1_3 | Ensemble 3 | False | Primus_CPJKU_2026 | 0.940 | 0.728 | 0.821 | 0.908 | 0.864 | 0.885 | 0.839 | 0.815 | 0.827 | 0.807 | 0.991 | 0.889 | 0.834 | 0.933 | 0.881 | 0.600 | 1.000 | 0.750 | 0.963 | 1.000 | 0.981 | 0.858 | 0.836 | 0.847 | 0.920 | 0.933 | 0.926 | 0.955 | 0.703 | 0.809 | 0.854 | 0.717 | 0.779 | 0.850 | 0.937 | 0.891 | 0.879 | 0.944 | 0.910 | 0.825 | 0.878 | 0.851 | 0.909 | 0.889 | 0.899 | 0.831 | 0.990 | 0.904 | 0.734 | 0.922 | 0.818 | 0.536 | 0.728 | 0.617 | 0.904 | 0.762 | 0.827 | 0.932 | 0.788 | 0.854 | 0.757 | 0.539 | 0.630 | 0.847 | 0.784 | 0.814 | 0.881 | 0.775 | 0.824 |
| 2 | Primus_CPJKU_task1_2 | Ensemble 2 | False | Primus_CPJKU_2026 | 0.960 | 0.728 | 0.828 | 0.918 | 0.852 | 0.884 | 0.855 | 0.815 | 0.835 | 0.814 | 0.982 | 0.890 | 0.837 | 0.951 | 0.890 | 0.600 | 1.000 | 0.750 | 0.963 | 1.000 | 0.981 | 0.847 | 0.820 | 0.833 | 0.915 | 0.936 | 0.925 | 0.970 | 0.670 | 0.792 | 0.869 | 0.737 | 0.798 | 0.841 | 0.937 | 0.886 | 0.870 | 0.944 | 0.905 | 0.846 | 0.878 | 0.862 | 0.909 | 0.889 | 0.899 | 0.831 | 0.990 | 0.904 | 0.720 | 0.922 | 0.809 | 0.544 | 0.750 | 0.630 | 0.886 | 0.773 | 0.826 | 0.944 | 0.788 | 0.859 | 0.709 | 0.539 | 0.613 | 0.847 | 0.784 | 0.814 | 0.878 | 0.755 | 0.812 |
| 3 | Primus_CPJKU_task1_4 | Ensemble 4 | False | Primus_CPJKU_2026 | 0.941 | 0.744 | 0.831 | 0.891 | 0.864 | 0.877 | 0.839 | 0.822 | 0.830 | 0.812 | 0.972 | 0.885 | 0.834 | 0.933 | 0.881 | 0.600 | 1.000 | 0.750 | 0.963 | 1.000 | 0.981 | 0.842 | 0.836 | 0.839 | 0.921 | 0.941 | 0.931 | 0.949 | 0.637 | 0.763 | 0.872 | 0.689 | 0.770 | 0.850 | 0.937 | 0.891 | 0.880 | 0.955 | 0.916 | 0.842 | 0.863 | 0.852 | 0.909 | 0.889 | 0.899 | 0.820 | 0.990 | 0.897 | 0.744 | 0.922 | 0.823 | 0.533 | 0.728 | 0.616 | 0.894 | 0.773 | 0.829 | 0.957 | 0.788 | 0.864 | 0.718 | 0.546 | 0.620 | 0.858 | 0.803 | 0.829 | 0.883 | 0.779 | 0.828 |
| 4 | Primus_CPJKU_task1_1 | Ensemble 1 | False | Primus_CPJKU_2026 | 0.941 | 0.744 | 0.831 | 0.877 | 0.818 | 0.846 | 0.845 | 0.845 | 0.845 | 0.812 | 0.972 | 0.885 | 0.785 | 0.933 | 0.853 | 0.600 | 1.000 | 0.750 | 0.963 | 1.000 | 0.981 | 0.825 | 0.820 | 0.823 | 0.913 | 0.928 | 0.920 | 0.952 | 0.670 | 0.787 | 0.872 | 0.704 | 0.779 | 0.865 | 0.937 | 0.899 | 0.871 | 0.955 | 0.911 | 0.872 | 0.893 | 0.882 | 0.896 | 0.900 | 0.898 | 0.831 | 0.990 | 0.904 | 0.729 | 0.922 | 0.815 | 0.543 | 0.669 | 0.600 | 0.867 | 0.785 | 0.824 | 0.944 | 0.788 | 0.859 | 0.734 | 0.565 | 0.639 | 0.855 | 0.791 | 0.821 | 0.878 | 0.748 | 0.808 |
| 5 | Huang_WHU_task1_2 | STFT-distill logit ensemble-1 | False | Huang_WHU_2026 | 0.892 | 0.851 | 0.871 | 0.794 | 0.832 | 0.813 | 0.845 | 0.838 | 0.841 | 0.925 | 0.947 | 0.936 | 0.862 | 0.928 | 0.894 | 1.000 | 1.000 | 1.000 | 0.963 | 1.000 | 0.981 | 0.780 | 0.828 | 0.804 | 0.936 | 0.970 | 0.953 | 0.982 | 0.682 | 0.805 | 0.907 | 0.656 | 0.761 | 0.753 | 0.953 | 0.841 | 0.898 | 0.866 | 0.882 | 0.759 | 0.823 | 0.790 | 0.790 | 0.861 | 0.824 | 0.858 | 0.846 | 0.852 | 0.672 | 0.901 | 0.770 | 0.420 | 0.458 | 0.438 | 0.891 | 0.590 | 0.710 | 0.779 | 0.811 | 0.795 | 0.759 | 0.463 | 0.575 | 0.701 | 0.762 | 0.730 | 0.816 | 0.760 | 0.787 |
| 6 | Huang_WHU_task1_1 | HGA-EMA-STFT system | False | Huang_WHU_2026 | 0.932 | 0.819 | 0.872 | 0.878 | 0.820 | 0.848 | 0.818 | 0.819 | 0.819 | 0.875 | 0.941 | 0.907 | 0.851 | 0.951 | 0.899 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.820 | 0.803 | 0.811 | 0.892 | 0.965 | 0.928 | 1.000 | 0.550 | 0.710 | 0.887 | 0.617 | 0.728 | 0.748 | 0.941 | 0.833 | 0.927 | 0.882 | 0.904 | 0.736 | 0.854 | 0.791 | 0.765 | 0.828 | 0.795 | 0.863 | 0.879 | 0.871 | 0.674 | 0.912 | 0.775 | 0.435 | 0.408 | 0.421 | 0.838 | 0.690 | 0.757 | 0.782 | 0.837 | 0.809 | 0.625 | 0.382 | 0.474 | 0.601 | 0.757 | 0.670 | 0.762 | 0.792 | 0.776 |
| 7 | Guan_HEU_task1_3 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | False | Guan_HEU_2026 | 0.772 | 0.851 | 0.810 | 0.858 | 0.832 | 0.845 | 0.848 | 0.870 | 0.859 | 0.903 | 0.781 | 0.838 | 0.904 | 0.919 | 0.911 | 0.688 | 0.583 | 0.631 | 0.963 | 1.000 | 0.981 | 0.786 | 0.918 | 0.847 | 0.946 | 0.941 | 0.943 | 1.000 | 0.603 | 0.752 | 0.830 | 0.696 | 0.758 | 0.770 | 0.937 | 0.846 | 0.922 | 0.891 | 0.906 | 0.719 | 0.893 | 0.797 | 0.866 | 0.820 | 0.842 | 0.811 | 0.910 | 0.857 | 0.709 | 0.907 | 0.796 | 0.464 | 0.553 | 0.505 | 0.875 | 0.606 | 0.716 | 0.830 | 0.781 | 0.805 | 0.767 | 0.502 | 0.607 | 0.696 | 0.767 | 0.730 | 0.853 | 0.760 | 0.804 |
| 8 | Huang_WHU_task1_4 | Six-model 5-fold logit ensemble | False | Huang_WHU_2026 | 0.906 | 0.829 | 0.866 | 0.827 | 0.825 | 0.826 | 0.858 | 0.833 | 0.846 | 0.899 | 0.941 | 0.920 | 0.880 | 0.917 | 0.898 | 1.000 | 0.667 | 0.800 | 0.963 | 1.000 | 0.981 | 0.836 | 0.836 | 0.836 | 0.935 | 0.957 | 0.946 | 0.967 | 0.610 | 0.748 | 0.889 | 0.628 | 0.736 | 0.764 | 0.937 | 0.842 | 0.929 | 0.886 | 0.907 | 0.747 | 0.838 | 0.790 | 0.764 | 0.871 | 0.814 | 0.875 | 0.863 | 0.869 | 0.681 | 0.912 | 0.780 | 0.422 | 0.453 | 0.437 | 0.792 | 0.655 | 0.717 | 0.780 | 0.849 | 0.813 | 0.720 | 0.424 | 0.533 | 0.633 | 0.757 | 0.689 | 0.804 | 0.779 | 0.791 |
| 9 | Guan_HEU_task1_4 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | False | Guan_HEU_2026 | 0.833 | 0.877 | 0.854 | 0.823 | 0.843 | 0.833 | 0.870 | 0.845 | 0.857 | 0.902 | 0.855 | 0.878 | 0.904 | 0.919 | 0.911 | 0.688 | 0.583 | 0.631 | 0.963 | 1.000 | 0.981 | 0.781 | 0.893 | 0.833 | 0.946 | 0.941 | 0.943 | 1.000 | 0.575 | 0.730 | 0.821 | 0.696 | 0.754 | 0.780 | 0.937 | 0.851 | 0.899 | 0.886 | 0.893 | 0.725 | 0.893 | 0.801 | 0.850 | 0.820 | 0.834 | 0.813 | 0.926 | 0.866 | 0.692 | 0.907 | 0.785 | 0.456 | 0.531 | 0.490 | 0.901 | 0.588 | 0.712 | 0.858 | 0.800 | 0.828 | 0.786 | 0.477 | 0.594 | 0.642 | 0.767 | 0.699 | 0.887 | 0.721 | 0.795 |
| 10 | Huang_WHU_task1_3 | STFT-distill logit ensemble-5 | False | Huang_WHU_2026 | 0.890 | 0.829 | 0.859 | 0.832 | 0.825 | 0.829 | 0.861 | 0.845 | 0.853 | 0.896 | 0.917 | 0.907 | 0.880 | 0.917 | 0.898 | 1.000 | 0.667 | 0.800 | 0.963 | 1.000 | 0.981 | 0.829 | 0.836 | 0.832 | 0.936 | 0.965 | 0.950 | 0.967 | 0.610 | 0.748 | 0.904 | 0.628 | 0.741 | 0.761 | 0.941 | 0.841 | 0.929 | 0.886 | 0.907 | 0.747 | 0.838 | 0.790 | 0.793 | 0.871 | 0.830 | 0.867 | 0.863 | 0.865 | 0.673 | 0.912 | 0.775 | 0.403 | 0.453 | 0.427 | 0.797 | 0.667 | 0.726 | 0.782 | 0.837 | 0.809 | 0.696 | 0.394 | 0.503 | 0.629 | 0.757 | 0.687 | 0.804 | 0.779 | 0.791 |
| 11 | Guan_HEU_task1_2 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | False | Guan_HEU_2026 | 0.725 | 0.835 | 0.776 | 0.819 | 0.825 | 0.822 | 0.810 | 0.870 | 0.839 | 0.867 | 0.752 | 0.805 | 0.873 | 0.896 | 0.884 | 1.000 | 0.792 | 0.884 | 0.931 | 1.000 | 0.964 | 0.842 | 0.893 | 0.866 | 0.946 | 0.949 | 0.948 | 1.000 | 0.535 | 0.697 | 0.841 | 0.635 | 0.724 | 0.792 | 0.944 | 0.862 | 0.898 | 0.875 | 0.886 | 0.723 | 0.848 | 0.780 | 0.818 | 0.793 | 0.805 | 0.804 | 0.916 | 0.857 | 0.705 | 0.890 | 0.787 | 0.482 | 0.531 | 0.505 | 0.852 | 0.639 | 0.730 | 0.826 | 0.792 | 0.809 | 0.719 | 0.502 | 0.591 | 0.627 | 0.748 | 0.682 | 0.875 | 0.779 | 0.824 |
| 12 | Zheng_SCUT_task1_1 | Official BSD10k Baseline (Multimodal HATR + CLAP) | False | Zheng_SCUT_2026 | 0.888 | 0.825 | 0.856 | 0.768 | 0.848 | 0.806 | 0.810 | 0.810 | 0.810 | 0.878 | 0.932 | 0.904 | 0.836 | 0.868 | 0.851 | 1.000 | 0.667 | 0.800 | 0.882 | 1.000 | 0.937 | 0.778 | 0.752 | 0.765 | 0.934 | 0.979 | 0.956 | 1.000 | 0.685 | 0.813 | 0.918 | 0.656 | 0.765 | 0.754 | 0.941 | 0.837 | 0.915 | 0.756 | 0.828 | 0.750 | 0.832 | 0.789 | 0.834 | 0.820 | 0.827 | 0.821 | 0.834 | 0.827 | 0.684 | 0.890 | 0.774 | 0.455 | 0.575 | 0.508 | 0.845 | 0.523 | 0.646 | 0.797 | 0.868 | 0.831 | 0.810 | 0.419 | 0.552 | 0.592 | 0.812 | 0.685 | 0.838 | 0.703 | 0.765 |
| 13 | Kucukoglu_NYU_task1_1 | Multimodal Ensemble System for Heterogeneous Audio Classification | False | Kucukoglu_NYU_2026 | 0.898 | 0.837 | 0.867 | 0.786 | 0.836 | 0.810 | 0.812 | 0.718 | 0.762 | 0.909 | 0.917 | 0.913 | 0.846 | 0.826 | 0.836 | 1.000 | 0.458 | 0.629 | 1.000 | 1.000 | 1.000 | 0.775 | 0.873 | 0.821 | 0.920 | 0.949 | 0.934 | 0.968 | 0.637 | 0.769 | 0.871 | 0.615 | 0.721 | 0.746 | 0.955 | 0.838 | 0.937 | 0.846 | 0.889 | 0.809 | 0.832 | 0.821 | 0.844 | 0.824 | 0.834 | 0.771 | 0.916 | 0.837 | 0.725 | 0.879 | 0.795 | 0.420 | 0.567 | 0.483 | 0.928 | 0.558 | 0.697 | 0.847 | 0.689 | 0.760 | 0.750 | 0.479 | 0.585 | 0.627 | 0.793 | 0.700 | 0.766 | 0.819 | 0.792 |
| 14 | Lin_JKU_task1_1 | Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary) | False | Lin_JKU_2026 | 0.884 | 0.764 | 0.820 | 0.782 | 0.774 | 0.778 | 0.901 | 0.713 | 0.796 | 0.815 | 0.928 | 0.868 | 0.757 | 0.933 | 0.836 | 1.000 | 1.000 | 1.000 | 0.941 | 1.000 | 0.970 | 0.701 | 0.824 | 0.757 | 0.942 | 0.952 | 0.947 | 1.000 | 0.568 | 0.724 | 0.850 | 0.622 | 0.719 | 0.768 | 0.902 | 0.829 | 0.858 | 0.945 | 0.900 | 0.698 | 0.835 | 0.760 | 0.872 | 0.762 | 0.814 | 0.814 | 0.887 | 0.849 | 0.684 | 0.860 | 0.762 | 0.453 | 0.436 | 0.444 | 0.916 | 0.690 | 0.787 | 0.736 | 0.818 | 0.775 | 0.688 | 0.477 | 0.563 | 0.607 | 0.635 | 0.620 | 0.758 | 0.721 | 0.739 |
| 15 | Guan_HEU_task1_1 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | False | Guan_HEU_2026 | 0.852 | 0.871 | 0.861 | 0.745 | 0.852 | 0.795 | 0.860 | 0.796 | 0.827 | 0.893 | 0.908 | 0.900 | 0.873 | 0.852 | 0.862 | 0.688 | 0.583 | 0.631 | 0.931 | 1.000 | 0.964 | 0.763 | 0.781 | 0.772 | 0.932 | 0.949 | 0.940 | 1.000 | 0.570 | 0.726 | 0.830 | 0.745 | 0.785 | 0.733 | 0.937 | 0.822 | 0.906 | 0.901 | 0.904 | 0.723 | 0.909 | 0.805 | 0.823 | 0.789 | 0.806 | 0.809 | 0.799 | 0.804 | 0.672 | 0.907 | 0.772 | 0.396 | 0.414 | 0.405 | 0.845 | 0.551 | 0.667 | 0.782 | 0.800 | 0.791 | 0.850 | 0.398 | 0.542 | 0.601 | 0.724 | 0.656 | 0.875 | 0.708 | 0.783 |
| 16 | Kucukoglu_NYU_task1_4 | Three-Modality HATR with Whisper and Mixup | False | Kucukoglu_NYU_2026 | 0.883 | 0.859 | 0.871 | 0.818 | 0.825 | 0.821 | 0.857 | 0.713 | 0.778 | 0.863 | 0.923 | 0.892 | 0.834 | 0.870 | 0.852 | 1.000 | 0.458 | 0.629 | 0.901 | 1.000 | 0.948 | 0.712 | 0.824 | 0.764 | 0.908 | 0.962 | 0.934 | 0.967 | 0.625 | 0.759 | 0.855 | 0.663 | 0.747 | 0.720 | 0.948 | 0.819 | 0.934 | 0.786 | 0.854 | 0.817 | 0.863 | 0.839 | 0.806 | 0.818 | 0.812 | 0.763 | 0.869 | 0.813 | 0.698 | 0.856 | 0.769 | 0.430 | 0.619 | 0.507 | 0.906 | 0.461 | 0.611 | 0.792 | 0.689 | 0.737 | 0.783 | 0.410 | 0.538 | 0.595 | 0.793 | 0.680 | 0.869 | 0.748 | 0.804 |
| 17 | Kucukoglu_NYU_task1_3 | Single HATR Model with Cross-Fold Augmentation | False | Kucukoglu_NYU_2026 | 0.887 | 0.875 | 0.881 | 0.755 | 0.833 | 0.792 | 0.780 | 0.706 | 0.741 | 0.914 | 0.903 | 0.908 | 0.883 | 0.785 | 0.831 | 1.000 | 0.458 | 0.629 | 0.901 | 1.000 | 0.948 | 0.744 | 0.738 | 0.741 | 0.929 | 0.949 | 0.939 | 0.944 | 0.703 | 0.806 | 0.875 | 0.628 | 0.731 | 0.762 | 0.955 | 0.848 | 0.907 | 0.806 | 0.854 | 0.686 | 0.863 | 0.764 | 0.841 | 0.801 | 0.821 | 0.802 | 0.873 | 0.836 | 0.728 | 0.884 | 0.798 | 0.424 | 0.647 | 0.513 | 0.725 | 0.488 | 0.584 | 0.796 | 0.731 | 0.762 | 0.720 | 0.424 | 0.533 | 0.614 | 0.781 | 0.688 | 0.838 | 0.689 | 0.756 |
| 18 | Lin_JKU_task1_2 | Frozen-CLAP ensemble with logit adjustment (balanced) | False | Lin_JKU_2026 | 0.897 | 0.744 | 0.813 | 0.812 | 0.740 | 0.774 | 0.878 | 0.718 | 0.790 | 0.815 | 0.928 | 0.868 | 0.767 | 0.917 | 0.835 | 0.750 | 1.000 | 0.857 | 0.862 | 1.000 | 0.926 | 0.587 | 0.834 | 0.689 | 0.966 | 0.868 | 0.915 | 0.932 | 0.608 | 0.735 | 0.786 | 0.635 | 0.702 | 0.765 | 0.691 | 0.726 | 0.833 | 0.957 | 0.891 | 0.680 | 0.875 | 0.765 | 0.888 | 0.682 | 0.772 | 0.785 | 0.947 | 0.859 | 0.719 | 0.871 | 0.787 | 0.476 | 0.494 | 0.485 | 0.788 | 0.736 | 0.761 | 0.773 | 0.811 | 0.791 | 0.684 | 0.535 | 0.600 | 0.671 | 0.558 | 0.609 | 0.698 | 0.748 | 0.722 |
| 19 | Lin_JKU_task1_4 | Clean concatenation CLAP probe (10k-train only) | False | Lin_JKU_2026 | 0.860 | 0.829 | 0.844 | 0.792 | 0.829 | 0.810 | 0.800 | 0.815 | 0.807 | 0.847 | 0.917 | 0.881 | 0.774 | 0.891 | 0.828 | 0.688 | 0.458 | 0.550 | 0.833 | 1.000 | 0.909 | 0.831 | 0.670 | 0.742 | 0.921 | 0.900 | 0.910 | 0.905 | 0.655 | 0.760 | 0.747 | 0.556 | 0.637 | 0.743 | 0.951 | 0.834 | 0.916 | 0.881 | 0.898 | 0.739 | 0.723 | 0.731 | 0.855 | 0.840 | 0.847 | 0.787 | 0.910 | 0.844 | 0.736 | 0.894 | 0.808 | 0.473 | 0.661 | 0.552 | 0.797 | 0.532 | 0.639 | 0.823 | 0.781 | 0.801 | 0.652 | 0.382 | 0.482 | 0.629 | 0.736 | 0.678 | 0.818 | 0.696 | 0.752 |
| 20 | Kucukoglu_NYU_task1_2 | CLAP-ConvNeXt Ensemble for Heterogeneous Audio Classification | False | Kucukoglu_NYU_2026 | 0.887 | 0.875 | 0.881 | 0.769 | 0.818 | 0.792 | 0.789 | 0.736 | 0.762 | 0.927 | 0.908 | 0.918 | 0.858 | 0.852 | 0.855 | 0.000 | 0.125 | 0.000 | 1.000 | 1.000 | 1.000 | 0.757 | 0.852 | 0.801 | 0.914 | 0.949 | 0.931 | 0.968 | 0.645 | 0.774 | 0.864 | 0.574 | 0.690 | 0.779 | 0.948 | 0.856 | 0.927 | 0.825 | 0.873 | 0.783 | 0.848 | 0.814 | 0.837 | 0.824 | 0.830 | 0.788 | 0.936 | 0.856 | 0.714 | 0.884 | 0.790 | 0.421 | 0.575 | 0.486 | 0.904 | 0.532 | 0.670 | 0.833 | 0.726 | 0.776 | 0.781 | 0.539 | 0.638 | 0.629 | 0.781 | 0.697 | 0.787 | 0.787 | 0.787 |
| 21 | Lin_JKU_task1_3 | Gated multimodal CLAP fusion with pseudo-labels | False | Lin_JKU_2026 | 0.887 | 0.750 | 0.813 | 0.858 | 0.809 | 0.833 | 0.832 | 0.750 | 0.789 | 0.806 | 0.961 | 0.877 | 0.838 | 0.905 | 0.870 | 0.000 | 0.250 | 0.000 | 0.849 | 0.961 | 0.901 | 0.789 | 0.857 | 0.822 | 0.911 | 0.970 | 0.940 | 1.000 | 0.583 | 0.736 | 0.889 | 0.651 | 0.751 | 0.772 | 0.944 | 0.850 | 0.865 | 0.931 | 0.897 | 0.719 | 0.869 | 0.787 | 0.946 | 0.762 | 0.844 | 0.799 | 0.936 | 0.862 | 0.691 | 0.888 | 0.777 | 0.469 | 0.503 | 0.486 | 0.797 | 0.683 | 0.735 | 0.839 | 0.830 | 0.835 | 0.762 | 0.373 | 0.501 | 0.593 | 0.697 | 0.641 | 0.779 | 0.733 | 0.755 |
| 22 | Colotti_TAU_task1_1 | Audio Classification using Attention-based Cleaned Multimodal Embeddings | False | Colotti_TAU_2026 | 0.904 | 0.812 | 0.855 | 0.779 | 0.772 | 0.776 | 0.913 | 0.606 | 0.729 | 0.865 | 0.923 | 0.893 | 0.923 | 0.671 | 0.777 | 0.000 | 0.375 | 0.000 | 1.000 | 0.609 | 0.757 | 0.627 | 0.912 | 0.743 | 0.947 | 0.910 | 0.928 | 0.882 | 0.315 | 0.464 | 0.897 | 0.717 | 0.797 | 0.720 | 0.909 | 0.804 | 0.859 | 0.868 | 0.863 | 0.765 | 0.838 | 0.800 | 0.703 | 0.902 | 0.790 | 0.802 | 0.805 | 0.804 | 0.743 | 0.851 | 0.793 | 0.395 | 0.625 | 0.484 | 0.958 | 0.426 | 0.590 | 0.814 | 0.894 | 0.852 | 0.441 | 0.273 | 0.337 | 0.451 | 0.659 | 0.535 | 0.771 | 0.642 | 0.701 |
| 23 | Colotti_TAU_task1_2 | Audio Classification using Attention-based Cleaned Multimodal Embeddings | False | Colotti_TAU_2026 | 0.906 | 0.754 | 0.823 | 0.803 | 0.763 | 0.782 | 0.936 | 0.634 | 0.756 | 0.844 | 0.932 | 0.886 | 0.662 | 0.567 | 0.611 | 0.643 | 1.000 | 0.783 | 0.714 | 0.570 | 0.634 | 0.636 | 0.742 | 0.685 | 0.976 | 0.711 | 0.823 | 0.963 | 0.328 | 0.489 | 0.838 | 0.587 | 0.690 | 0.765 | 0.898 | 0.826 | 0.877 | 0.841 | 0.858 | 0.698 | 0.838 | 0.762 | 0.636 | 0.902 | 0.746 | 0.732 | 0.852 | 0.788 | 0.716 | 0.845 | 0.775 | 0.332 | 0.606 | 0.429 | 0.625 | 0.495 | 0.553 | 0.784 | 0.719 | 0.750 | 0.591 | 0.324 | 0.419 | 0.442 | 0.724 | 0.549 | 0.723 | 0.596 | 0.653 |
| 24 | Kil_Medisensing_task1_1 | metadata gate target mask posterior stack | False | Kil_Medisensing_2026 | 0.536 | 0.750 | 0.625 | 0.803 | 0.628 | 0.705 | 0.722 | 0.634 | 0.675 | 0.509 | 0.314 | 0.389 | 0.725 | 0.944 | 0.820 | 0.429 | 1.000 | 0.600 | 0.856 | 1.000 | 0.923 | 0.864 | 0.826 | 0.845 | 0.827 | 0.970 | 0.893 | 0.000 | 0.083 | 0.000 | 0.742 | 0.727 | 0.734 | 0.748 | 0.753 | 0.751 | 0.935 | 0.711 | 0.808 | 0.598 | 0.558 | 0.577 | 0.792 | 0.824 | 0.807 | 0.873 | 0.984 | 0.925 | 0.832 | 0.823 | 0.828 | 0.507 | 0.472 | 0.489 | 0.780 | 0.808 | 0.793 | 0.837 | 0.863 | 0.850 | 0.448 | 0.819 | 0.579 | 0.576 | 0.825 | 0.678 | 0.739 | 0.522 | 0.612 |
| 25 | Kil_Medisensing_task1_2 | raw parent specialist posterior stack | False | Kil_Medisensing_2026 | 0.536 | 0.750 | 0.625 | 0.797 | 0.628 | 0.702 | 0.716 | 0.623 | 0.666 | 0.509 | 0.314 | 0.389 | 0.725 | 0.944 | 0.820 | 0.429 | 1.000 | 0.600 | 0.856 | 1.000 | 0.923 | 0.864 | 0.826 | 0.845 | 0.822 | 0.970 | 0.890 | 0.000 | 0.083 | 0.000 | 0.750 | 0.727 | 0.738 | 0.738 | 0.760 | 0.749 | 0.935 | 0.711 | 0.808 | 0.590 | 0.543 | 0.565 | 0.798 | 0.824 | 0.811 | 0.873 | 0.984 | 0.925 | 0.829 | 0.812 | 0.821 | 0.510 | 0.486 | 0.498 | 0.802 | 0.808 | 0.805 | 0.837 | 0.863 | 0.850 | 0.448 | 0.819 | 0.579 | 0.576 | 0.825 | 0.678 | 0.739 | 0.522 | 0.612 |
| 26 | Colotti_TAU_task1_4 | CLAP-MoE Feature-wise Gated Fusion | False | Colotti_TAU_2026 | 0.902 | 0.758 | 0.824 | 0.666 | 0.782 | 0.719 | 0.820 | 0.729 | 0.772 | 0.834 | 0.893 | 0.863 | 0.802 | 0.558 | 0.658 | 0.375 | 0.000 | 0.000 | 1.000 | 0.680 | 0.809 | 0.660 | 0.570 | 0.612 | 0.926 | 0.840 | 0.881 | 0.929 | 0.290 | 0.442 | 0.808 | 0.431 | 0.562 | 0.715 | 0.960 | 0.820 | 0.835 | 0.753 | 0.792 | 0.663 | 0.741 | 0.700 | 0.815 | 0.785 | 0.800 | 0.805 | 0.809 | 0.807 | 0.725 | 0.879 | 0.795 | 0.409 | 0.669 | 0.508 | 0.659 | 0.690 | 0.674 | 0.763 | 0.894 | 0.823 | 0.438 | 0.361 | 0.396 | 0.417 | 0.692 | 0.520 | 0.684 | 0.569 | 0.621 |
| 27 | Colotti_TAU_task1_3 | EnhancedBaseClassifier | False | Colotti_TAU_2026 | 0.880 | 0.726 | 0.796 | 0.785 | 0.859 | 0.820 | 0.747 | 0.602 | 0.667 | 0.891 | 0.700 | 0.784 | 0.671 | 0.634 | 0.652 | 1.000 | 1.000 | 1.000 | 0.875 | 0.508 | 0.643 | 0.516 | 0.504 | 0.510 | 0.958 | 0.880 | 0.917 | 0.682 | 0.135 | 0.225 | 0.650 | 0.464 | 0.542 | 0.689 | 0.930 | 0.792 | 0.748 | 0.753 | 0.751 | 0.502 | 0.893 | 0.642 | 0.790 | 0.775 | 0.782 | 0.844 | 0.682 | 0.755 | 0.644 | 0.886 | 0.746 | 0.357 | 0.617 | 0.452 | 0.750 | 0.319 | 0.448 | 0.686 | 0.925 | 0.788 | 0.333 | 0.289 | 0.310 | 0.448 | 0.673 | 0.538 | 0.879 | 0.598 | 0.712 |
| 28 | Kil_Medisensing_task1_3 | Larger-CLAP classifier with metadata confidence gate | False | Kil_Medisensing_2026 | 0.455 | 0.615 | 0.523 | 0.733 | 0.783 | 0.757 | 0.795 | 0.708 | 0.749 | 0.365 | 0.263 | 0.305 | 0.766 | 0.803 | 0.784 | 0.000 | 0.125 | 0.000 | 0.911 | 0.844 | 0.876 | 0.877 | 0.764 | 0.817 | 0.821 | 0.957 | 0.884 | 0.473 | 0.217 | 0.298 | 0.889 | 0.579 | 0.701 | 0.605 | 0.761 | 0.674 | 0.853 | 0.784 | 0.817 | 0.549 | 0.634 | 0.588 | 0.812 | 0.623 | 0.705 | 0.742 | 0.857 | 0.795 | 0.851 | 0.560 | 0.676 | 0.603 | 0.408 | 0.487 | 0.558 | 0.704 | 0.623 | 0.629 | 0.906 | 0.743 | 0.478 | 0.542 | 0.508 | 0.489 | 0.712 | 0.579 | 0.750 | 0.588 | 0.659 |
| 29 | Liu_CQUPT_task1_3 | PaSST and CLAP Audio Ensemble | True | Liu_CQUPT_2026 | 0.522 | 0.688 | 0.594 | 0.755 | 0.634 | 0.689 | 0.766 | 0.646 | 0.701 | 0.424 | 0.287 | 0.342 | 0.680 | 0.910 | 0.778 | 0.000 | 0.250 | 0.000 | 0.750 | 0.609 | 0.672 | 0.724 | 0.805 | 0.762 | 0.824 | 0.939 | 0.878 | 0.943 | 0.230 | 0.370 | 0.892 | 0.630 | 0.738 | 0.664 | 0.767 | 0.712 | 0.910 | 0.710 | 0.797 | 0.551 | 0.652 | 0.598 | 0.742 | 0.760 | 0.751 | 0.794 | 0.865 | 0.828 | 0.760 | 0.642 | 0.696 | 0.580 | 0.447 | 0.505 | 0.610 | 0.736 | 0.667 | 0.696 | 0.894 | 0.783 | 0.309 | 0.396 | 0.347 | 0.476 | 0.786 | 0.593 | 0.744 | 0.605 | 0.667 |
| 30 | Liu_CQUPT_task1_2 | Weighted Audio-only CLAP Ensemble with BSD10k Specialist | True | Liu_CQUPT_2026 | 0.497 | 0.647 | 0.562 | 0.748 | 0.645 | 0.693 | 0.774 | 0.646 | 0.704 | 0.423 | 0.267 | 0.327 | 0.695 | 0.873 | 0.774 | 0.000 | 0.250 | 0.000 | 0.760 | 0.688 | 0.722 | 0.714 | 0.814 | 0.761 | 0.854 | 0.952 | 0.901 | 0.795 | 0.230 | 0.357 | 0.925 | 0.651 | 0.764 | 0.659 | 0.830 | 0.735 | 0.881 | 0.694 | 0.776 | 0.522 | 0.588 | 0.553 | 0.756 | 0.768 | 0.762 | 0.811 | 0.891 | 0.849 | 0.768 | 0.735 | 0.751 | 0.558 | 0.353 | 0.432 | 0.609 | 0.771 | 0.681 | 0.677 | 0.858 | 0.757 | 0.281 | 0.361 | 0.316 | 0.448 | 0.743 | 0.559 | 0.668 | 0.542 | 0.598 |
| 31 | Liu_CQUPT_task1_1 | Audio-only CLAP Ensemble with EMA and Test-Time Augmentation | True | Liu_CQUPT_2026 | 0.483 | 0.621 | 0.543 | 0.748 | 0.645 | 0.693 | 0.766 | 0.646 | 0.701 | 0.423 | 0.267 | 0.327 | 0.695 | 0.873 | 0.774 | 0.000 | 0.250 | 0.000 | 0.760 | 0.688 | 0.722 | 0.714 | 0.814 | 0.761 | 0.854 | 0.952 | 0.901 | 0.795 | 0.230 | 0.357 | 0.925 | 0.651 | 0.764 | 0.650 | 0.802 | 0.718 | 0.881 | 0.694 | 0.776 | 0.519 | 0.588 | 0.551 | 0.744 | 0.768 | 0.756 | 0.804 | 0.891 | 0.846 | 0.757 | 0.718 | 0.737 | 0.517 | 0.339 | 0.409 | 0.601 | 0.771 | 0.675 | 0.668 | 0.858 | 0.751 | 0.281 | 0.361 | 0.316 | 0.448 | 0.743 | 0.559 | 0.668 | 0.542 | 0.598 |
| 32 | Kil_Medisensing_task1_4 | audio posterior ensemble with conservative argmax decoding | True | Kil_Medisensing_2026 | 0.861 | 0.732 | 0.791 | 0.918 | 0.503 | 0.650 | 0.719 | 0.650 | 0.683 | 0.979 | 0.752 | 0.851 | 0.692 | 0.884 | 0.777 | 0.000 | 0.250 | 0.000 | 0.000 | 0.305 | 0.000 | 0.600 | 0.637 | 0.618 | 0.949 | 0.887 | 0.917 | 0.375 | 0.015 | 0.029 | 0.926 | 0.352 | 0.510 | 0.544 | 0.543 | 0.544 | 0.534 | 0.925 | 0.677 | 0.824 | 0.524 | 0.641 | 0.475 | 0.783 | 0.591 | 0.752 | 0.795 | 0.773 | 0.688 | 0.662 | 0.674 | 0.000 | 0.242 | 0.000 | 0.569 | 0.495 | 0.530 | 0.955 | 0.316 | 0.475 | 0.219 | 0.248 | 0.232 | 0.245 | 0.642 | 0.355 | 0.741 | 0.554 | 0.634 |
| 33 | Sharma_IR_task1_1 | Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification | False | Sharma_IR_2026 | 0.557 | 0.544 | 0.550 | 0.738 | 0.522 | 0.611 | 0.589 | 0.866 | 0.701 | 0.312 | 0.096 | 0.146 | 0.688 | 0.706 | 0.697 | 0.000 | 0.000 | 0.000 | 0.683 | 0.602 | 0.640 | 0.520 | 0.320 | 0.397 | 0.750 | 0.952 | 0.839 | 0.795 | 0.250 | 0.380 | 0.675 | 0.298 | 0.414 | 0.515 | 0.609 | 0.558 | 0.745 | 0.476 | 0.581 | 0.500 | 0.317 | 0.388 | 0.427 | 0.656 | 0.517 | 0.448 | 0.799 | 0.574 | 0.559 | 0.627 | 0.591 | 0.207 | 0.342 | 0.258 | 0.316 | 0.690 | 0.433 | 0.750 | 0.566 | 0.645 | 0.421 | 0.294 | 0.346 | 0.330 | 0.601 | 0.426 | 0.571 | 0.100 | 0.171 |
| 34 | Han_CSU_task1_1 | Multi-Embedding System with Hierarchical Proxy Learning_1 | False | Han_CSU_2026 | 0.458 | 0.500 | 0.478 | 0.556 | 0.705 | 0.622 | 0.635 | 0.597 | 0.616 | 0.502 | 0.494 | 0.498 | 0.337 | 0.400 | 0.366 | 0.072 | 0.000 | 0.000 | 0.022 | 0.164 | 0.039 | 0.483 | 0.271 | 0.348 | 0.616 | 0.610 | 0.613 | 0.889 | 0.198 | 0.323 | 0.565 | 0.296 | 0.388 | 0.313 | 0.427 | 0.361 | 0.493 | 0.270 | 0.349 | 0.365 | 0.366 | 0.365 | 0.343 | 0.416 | 0.376 | 0.375 | 0.252 | 0.301 | 0.434 | 0.586 | 0.498 | 0.343 | 0.367 | 0.355 | 0.212 | 0.225 | 0.218 | 0.530 | 0.408 | 0.461 | 0.250 | 0.174 | 0.205 | 0.287 | 0.478 | 0.359 | 0.625 | 0.316 | 0.420 |
| 35 | Han_CSU_task1_3 | Multi-Embedding System with Hierarchical Proxy Learning_3 | False | Han_CSU_2026 | 0.366 | 0.446 | 0.402 | 0.476 | 0.645 | 0.548 | 0.520 | 0.456 | 0.486 | 0.390 | 0.338 | 0.362 | 0.177 | 0.204 | 0.189 | 0.075 | 0.125 | 0.094 | 0.044 | 0.141 | 0.067 | 0.361 | 0.205 | 0.261 | 0.552 | 0.518 | 0.535 | 0.833 | 0.138 | 0.236 | 0.565 | 0.281 | 0.375 | 0.250 | 0.328 | 0.284 | 0.304 | 0.188 | 0.232 | 0.260 | 0.277 | 0.268 | 0.312 | 0.379 | 0.342 | 0.308 | 0.238 | 0.268 | 0.407 | 0.554 | 0.469 | 0.325 | 0.331 | 0.328 | 0.193 | 0.206 | 0.199 | 0.453 | 0.382 | 0.415 | 0.197 | 0.174 | 0.185 | 0.238 | 0.377 | 0.292 | 0.548 | 0.336 | 0.416 |
| 36 | Han_CSU_task1_4 | Multi-Embedding System with Hierarchical Proxy Learning_4 | False | Han_CSU_2026 | 0.318 | 0.355 | 0.335 | 0.475 | 0.611 | 0.534 | 0.622 | 0.521 | 0.567 | 0.373 | 0.368 | 0.370 | 0.231 | 0.282 | 0.254 | 0.069 | 0.125 | 0.089 | 0.022 | 0.141 | 0.038 | 0.385 | 0.209 | 0.271 | 0.465 | 0.461 | 0.463 | 0.729 | 0.110 | 0.191 | 0.295 | 0.181 | 0.224 | 0.254 | 0.355 | 0.296 | 0.302 | 0.213 | 0.249 | 0.273 | 0.287 | 0.280 | 0.269 | 0.328 | 0.295 | 0.324 | 0.244 | 0.278 | 0.324 | 0.390 | 0.354 | 0.344 | 0.339 | 0.341 | 0.237 | 0.243 | 0.240 | 0.446 | 0.311 | 0.367 | 0.196 | 0.148 | 0.169 | 0.254 | 0.358 | 0.297 | 0.410 | 0.186 | 0.256 |
| 37 | Zhang_XJTLU_task1_3 | no-ranker ensemble | False | Zhang_XJTLU_2026 | 0.087 | 0.173 | 0.116 | 0.115 | 0.188 | 0.143 | 0.164 | 0.118 | 0.137 | 0.068 | 0.101 | 0.081 | 0.083 | 0.039 | 0.053 | 0.000 | 0.125 | 0.000 | 0.054 | 0.258 | 0.089 | 0.242 | 0.090 | 0.131 | 0.396 | 0.026 | 0.049 | 0.000 | 0.015 | 0.000 | 0.583 | 0.028 | 0.054 | 0.179 | 0.366 | 0.240 | 0.375 | 0.216 | 0.274 | 0.292 | 0.110 | 0.159 | 0.156 | 0.223 | 0.184 | 0.250 | 0.262 | 0.256 | 0.265 | 0.166 | 0.204 | 0.101 | 0.294 | 0.150 | 0.159 | 0.241 | 0.192 | 0.013 | 0.007 | 0.009 | 0.000 | 0.007 | 0.000 | 0.149 | 0.142 | 0.145 | 0.000 | 0.000 | 0.000 |
| 38 | Zhang_XJTLU_task1_4 | external diversity low-risk | False | Zhang_XJTLU_2026 | 0.114 | 0.210 | 0.148 | 0.122 | 0.227 | 0.159 | 0.167 | 0.132 | 0.147 | 0.087 | 0.134 | 0.106 | 0.167 | 0.065 | 0.093 | 0.000 | 0.125 | 0.000 | 0.000 | 0.328 | 0.000 | 0.235 | 0.074 | 0.113 | 0.550 | 0.026 | 0.050 | 0.000 | 0.015 | 0.000 | 0.583 | 0.036 | 0.067 | 0.184 | 0.374 | 0.247 | 0.375 | 0.198 | 0.259 | 0.205 | 0.091 | 0.126 | 0.168 | 0.230 | 0.194 | 0.000 | 0.252 | 0.000 | 0.230 | 0.179 | 0.201 | 0.100 | 0.272 | 0.146 | 0.094 | 0.236 | 0.134 | 0.042 | 0.007 | 0.012 | 0.000 | 0.007 | 0.000 | 0.130 | 0.130 | 0.130 | 0.000 | 0.000 | 0.000 |
| 39 | Han_CSU_task1_2 | Multi-Embedding System with Hierarchical Proxy Learning_2 | False | Han_CSU_2026 | 0.055 | 0.000 | 0.000 | 0.015 | 0.042 | 0.022 | 0.016 | 0.042 | 0.023 | 0.105 | 0.132 | 0.117 | 0.099 | 0.116 | 0.107 | 0.064 | 0.000 | 0.000 | 0.022 | 0.070 | 0.034 | 0.117 | 0.137 | 0.126 | 0.103 | 0.053 | 0.070 | 0.000 | 0.045 | 0.000 | 0.092 | 0.036 | 0.052 | 0.168 | 0.198 | 0.182 | 0.158 | 0.167 | 0.162 | 0.175 | 0.189 | 0.182 | 0.133 | 0.180 | 0.153 | 0.199 | 0.170 | 0.184 | 0.197 | 0.155 | 0.174 | 0.093 | 0.156 | 0.116 | 0.139 | 0.150 | 0.144 | 0.089 | 0.054 | 0.067 | 0.109 | 0.116 | 0.112 | 0.131 | 0.156 | 0.142 | 0.056 | 0.051 | 0.054 |
| 40 | Zhang_XJTLU_task1_2 | balanced ranker/base | False | Zhang_XJTLU_2026 | 0.115 | 0.210 | 0.149 | 0.131 | 0.246 | 0.171 | 0.134 | 0.139 | 0.136 | 0.062 | 0.090 | 0.073 | 0.000 | 0.021 | 0.000 | 0.000 | 0.125 | 0.000 | 0.000 | 0.258 | 0.000 | 0.220 | 0.059 | 0.092 | 0.475 | 0.026 | 0.050 | 0.000 | 0.015 | 0.000 | 0.583 | 0.028 | 0.054 | 0.179 | 0.374 | 0.242 | 0.300 | 0.198 | 0.239 | 0.250 | 0.101 | 0.143 | 0.172 | 0.223 | 0.194 | 0.000 | 0.240 | 0.000 | 0.257 | 0.166 | 0.202 | 0.095 | 0.256 | 0.138 | 0.107 | 0.243 | 0.149 | 0.029 | 0.007 | 0.011 | 0.000 | 0.007 | 0.000 | 0.152 | 0.149 | 0.150 | 0.000 | 0.000 | 0.000 |
| 41 | Zhang_XJTLU_task1_1 | aggressive ranker | False | Zhang_XJTLU_2026 | 0.117 | 0.188 | 0.144 | 0.096 | 0.151 | 0.117 | 0.037 | 0.132 | 0.058 | 0.062 | 0.086 | 0.072 | 0.071 | 0.039 | 0.051 | 0.000 | 0.125 | 0.000 | 0.000 | 0.141 | 0.000 | 0.271 | 0.152 | 0.195 | 0.125 | 0.005 | 0.009 | 0.000 | 0.007 | 0.000 | 0.188 | 0.008 | 0.015 | 0.205 | 0.401 | 0.271 | 0.188 | 0.220 | 0.202 | 0.062 | 0.073 | 0.067 | 0.180 | 0.201 | 0.190 | 0.219 | 0.262 | 0.239 | 0.180 | 0.151 | 0.164 | 0.099 | 0.239 | 0.140 | 0.010 | 0.208 | 0.019 | 0.066 | 0.021 | 0.032 | 0.125 | 0.007 | 0.013 | 0.073 | 0.089 | 0.080 | 0.062 | 0.007 | 0.013 |
BST Top-level performance
| Rank | Submission Information | Overall metrics | Class-wise metrics | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
System rank |
Submission label |
Submission name |
Audio only |
Technical Report |
P | R | F | P - Music (m) | R - Music (m) | F - Music (m) | P - Instrument samples (is) | R - Instrument samples (is) | F - Instrument samples (is) | P - Speech (sp) | R - Speech (sp) | F - Speech (sp) | P - Sound effects (fx) | R - Sound effects (fx) | F - Sound effects (fx) | P - Soundscapes (ss) | R - Soundscapes (ss) | F - Soundscapes (ss) |
| 1 | Primus_CPJKU_task1_3 | Ensemble 3 | False | Primus_CPJKU_2026 | 0.908 | 0.876 | 0.889 | 0.951 | 0.858 | 0.902 | 0.873 | 0.971 | 0.919 | 0.955 | 0.846 | 0.897 | 0.864 | 0.953 | 0.907 | 0.898 | 0.752 | 0.819 |
| 2 | Primus_CPJKU_task1_2 | Ensemble 2 | False | Primus_CPJKU_2026 | 0.909 | 0.873 | 0.887 | 0.967 | 0.853 | 0.906 | 0.877 | 0.971 | 0.921 | 0.955 | 0.840 | 0.894 | 0.858 | 0.955 | 0.904 | 0.887 | 0.748 | 0.811 |
| 3 | Primus_CPJKU_task1_4 | Ensemble 4 | False | Primus_CPJKU_2026 | 0.908 | 0.875 | 0.888 | 0.947 | 0.868 | 0.905 | 0.877 | 0.971 | 0.921 | 0.960 | 0.823 | 0.886 | 0.864 | 0.955 | 0.908 | 0.893 | 0.757 | 0.820 |
| 4 | Primus_CPJKU_task1_1 | Ensemble 1 | False | Primus_CPJKU_2026 | 0.903 | 0.871 | 0.884 | 0.935 | 0.848 | 0.889 | 0.857 | 0.966 | 0.908 | 0.961 | 0.834 | 0.893 | 0.866 | 0.951 | 0.906 | 0.898 | 0.757 | 0.822 |
| 5 | Huang_WHU_task1_2 | STFT-distill logit ensemble-1 | False | Huang_WHU_2026 | 0.885 | 0.863 | 0.872 | 0.878 | 0.882 | 0.880 | 0.902 | 0.946 | 0.924 | 0.973 | 0.829 | 0.895 | 0.852 | 0.909 | 0.880 | 0.818 | 0.748 | 0.781 |
| 6 | Huang_WHU_task1_1 | HGA-EMA-STFT system | False | Huang_WHU_2026 | 0.881 | 0.855 | 0.866 | 0.921 | 0.863 | 0.891 | 0.893 | 0.937 | 0.914 | 0.958 | 0.789 | 0.865 | 0.857 | 0.921 | 0.888 | 0.778 | 0.767 | 0.772 |
| 7 | Guan_HEU_task1_3 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | False | Guan_HEU_2026 | 0.891 | 0.864 | 0.875 | 0.876 | 0.897 | 0.886 | 0.904 | 0.917 | 0.910 | 0.966 | 0.811 | 0.882 | 0.862 | 0.935 | 0.897 | 0.846 | 0.757 | 0.799 |
| 8 | Huang_WHU_task1_4 | Six-model 5-fold logit ensemble | False | Huang_WHU_2026 | 0.888 | 0.859 | 0.872 | 0.909 | 0.877 | 0.893 | 0.905 | 0.927 | 0.916 | 0.965 | 0.794 | 0.871 | 0.854 | 0.927 | 0.889 | 0.806 | 0.771 | 0.788 |
| 9 | Guan_HEU_task1_4 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | False | Guan_HEU_2026 | 0.888 | 0.862 | 0.873 | 0.880 | 0.897 | 0.888 | 0.906 | 0.937 | 0.921 | 0.959 | 0.800 | 0.872 | 0.858 | 0.931 | 0.893 | 0.839 | 0.743 | 0.788 |
| 10 | Huang_WHU_task1_3 | STFT-distill logit ensemble-5 | False | Huang_WHU_2026 | 0.887 | 0.858 | 0.870 | 0.904 | 0.877 | 0.891 | 0.904 | 0.922 | 0.913 | 0.965 | 0.794 | 0.871 | 0.853 | 0.929 | 0.890 | 0.809 | 0.767 | 0.787 |
| 11 | Guan_HEU_task1_2 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | False | Guan_HEU_2026 | 0.879 | 0.851 | 0.862 | 0.835 | 0.892 | 0.863 | 0.916 | 0.902 | 0.909 | 0.964 | 0.771 | 0.857 | 0.867 | 0.933 | 0.899 | 0.811 | 0.757 | 0.783 |
| 12 | Zheng_SCUT_task1_1 | Official BSD10k Baseline (Multimodal HATR + CLAP) | False | Zheng_SCUT_2026 | 0.882 | 0.864 | 0.872 | 0.871 | 0.892 | 0.881 | 0.890 | 0.912 | 0.901 | 0.987 | 0.840 | 0.907 | 0.857 | 0.899 | 0.877 | 0.807 | 0.776 | 0.791 |
| 13 | Kucukoglu_NYU_task1_1 | Multimodal Ensemble System for Heterogeneous Audio Classification | False | Kucukoglu_NYU_2026 | 0.878 | 0.852 | 0.863 | 0.869 | 0.848 | 0.859 | 0.896 | 0.922 | 0.909 | 0.966 | 0.806 | 0.879 | 0.850 | 0.917 | 0.882 | 0.809 | 0.767 | 0.787 |
| 14 | Lin_JKU_task1_1 | Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary) | False | Lin_JKU_2026 | 0.856 | 0.830 | 0.839 | 0.891 | 0.804 | 0.845 | 0.812 | 0.946 | 0.874 | 0.957 | 0.771 | 0.854 | 0.853 | 0.907 | 0.879 | 0.764 | 0.724 | 0.743 |
| 15 | Guan_HEU_task1_1 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | False | Guan_HEU_2026 | 0.877 | 0.853 | 0.863 | 0.850 | 0.892 | 0.871 | 0.908 | 0.917 | 0.913 | 0.960 | 0.823 | 0.886 | 0.846 | 0.911 | 0.877 | 0.822 | 0.724 | 0.770 |
| 16 | Kucukoglu_NYU_task1_4 | Three-Modality HATR with Whisper and Mixup | False | Kucukoglu_NYU_2026 | 0.875 | 0.851 | 0.861 | 0.902 | 0.858 | 0.879 | 0.869 | 0.941 | 0.904 | 0.960 | 0.829 | 0.890 | 0.835 | 0.899 | 0.865 | 0.810 | 0.729 | 0.767 |
| 17 | Kucukoglu_NYU_task1_3 | Single HATR Model with Cross-Fold Augmentation | False | Kucukoglu_NYU_2026 | 0.870 | 0.839 | 0.853 | 0.855 | 0.868 | 0.861 | 0.904 | 0.873 | 0.888 | 0.953 | 0.817 | 0.880 | 0.827 | 0.911 | 0.867 | 0.810 | 0.729 | 0.767 |
| 18 | Lin_JKU_task1_2 | Frozen-CLAP ensemble with logit adjustment (balanced) | False | Lin_JKU_2026 | 0.847 | 0.821 | 0.829 | 0.909 | 0.784 | 0.842 | 0.765 | 0.951 | 0.848 | 0.957 | 0.771 | 0.854 | 0.841 | 0.887 | 0.863 | 0.764 | 0.710 | 0.736 |
| 19 | Lin_JKU_task1_4 | Clean concatenation CLAP probe (10k-train only) | False | Lin_JKU_2026 | 0.873 | 0.845 | 0.857 | 0.865 | 0.877 | 0.871 | 0.895 | 0.912 | 0.903 | 0.946 | 0.794 | 0.863 | 0.840 | 0.917 | 0.877 | 0.817 | 0.724 | 0.768 |
| 20 | Kucukoglu_NYU_task1_2 | CLAP-ConvNeXt Ensemble for Heterogeneous Audio Classification | False | Kucukoglu_NYU_2026 | 0.879 | 0.855 | 0.865 | 0.859 | 0.863 | 0.861 | 0.900 | 0.917 | 0.908 | 0.966 | 0.800 | 0.875 | 0.853 | 0.913 | 0.882 | 0.820 | 0.781 | 0.800 |
| 21 | Lin_JKU_task1_3 | Gated multimodal CLAP fusion with pseudo-labels | False | Lin_JKU_2026 | 0.876 | 0.845 | 0.857 | 0.924 | 0.833 | 0.876 | 0.851 | 0.946 | 0.896 | 0.966 | 0.806 | 0.879 | 0.847 | 0.929 | 0.886 | 0.793 | 0.710 | 0.749 |
| 22 | Colotti_TAU_task1_1 | Audio Classification using Attention-based Cleaned Multimodal Embeddings | False | Colotti_TAU_2026 | 0.843 | 0.802 | 0.817 | 0.899 | 0.789 | 0.841 | 0.846 | 0.912 | 0.878 | 0.962 | 0.714 | 0.820 | 0.805 | 0.897 | 0.849 | 0.702 | 0.695 | 0.699 |
| 23 | Colotti_TAU_task1_2 | Audio Classification using Attention-based Cleaned Multimodal Embeddings | False | Colotti_TAU_2026 | 0.841 | 0.770 | 0.794 | 0.901 | 0.760 | 0.824 | 0.879 | 0.917 | 0.897 | 0.991 | 0.606 | 0.752 | 0.764 | 0.911 | 0.831 | 0.670 | 0.657 | 0.663 |
| 24 | Kil_Medisensing_task1_1 | metadata gate target mask posterior stack | False | Kil_Medisensing_2026 | 0.782 | 0.772 | 0.773 | 0.729 | 0.725 | 0.727 | 0.765 | 0.746 | 0.756 | 0.887 | 0.720 | 0.795 | 0.877 | 0.850 | 0.863 | 0.652 | 0.819 | 0.726 |
| 25 | Kil_Medisensing_task1_2 | raw parent specialist posterior stack | False | Kil_Medisensing_2026 | 0.782 | 0.772 | 0.773 | 0.729 | 0.725 | 0.727 | 0.765 | 0.746 | 0.756 | 0.887 | 0.720 | 0.795 | 0.877 | 0.850 | 0.863 | 0.652 | 0.819 | 0.726 |
| 26 | Colotti_TAU_task1_4 | CLAP-MoE Feature-wise Gated Fusion | False | Colotti_TAU_2026 | 0.827 | 0.769 | 0.788 | 0.819 | 0.819 | 0.819 | 0.905 | 0.790 | 0.844 | 0.964 | 0.611 | 0.748 | 0.802 | 0.911 | 0.853 | 0.644 | 0.714 | 0.677 |
| 27 | Colotti_TAU_task1_3 | EnhancedBaseClassifier | False | Colotti_TAU_2026 | 0.814 | 0.756 | 0.777 | 0.880 | 0.824 | 0.851 | 0.901 | 0.800 | 0.848 | 0.887 | 0.583 | 0.703 | 0.751 | 0.874 | 0.808 | 0.653 | 0.700 | 0.676 |
| 28 | Kil_Medisensing_task1_3 | Larger-CLAP classifier with metadata confidence gate | False | Kil_Medisensing_2026 | 0.771 | 0.758 | 0.761 | 0.704 | 0.770 | 0.735 | 0.781 | 0.678 | 0.726 | 0.882 | 0.726 | 0.796 | 0.843 | 0.838 | 0.841 | 0.647 | 0.776 | 0.706 |
| 29 | Liu_CQUPT_task1_3 | PaSST and CLAP Audio Ensemble | True | Liu_CQUPT_2026 | 0.775 | 0.755 | 0.761 | 0.720 | 0.706 | 0.713 | 0.732 | 0.732 | 0.732 | 0.933 | 0.714 | 0.809 | 0.845 | 0.848 | 0.846 | 0.644 | 0.776 | 0.704 |
| 30 | Liu_CQUPT_task1_2 | Weighted Audio-only CLAP Ensemble with BSD10k Specialist | True | Liu_CQUPT_2026 | 0.776 | 0.752 | 0.760 | 0.715 | 0.701 | 0.708 | 0.730 | 0.712 | 0.721 | 0.948 | 0.726 | 0.822 | 0.840 | 0.860 | 0.850 | 0.645 | 0.762 | 0.699 |
| 31 | Liu_CQUPT_task1_1 | Audio-only CLAP Ensemble with EMA and Test-Time Augmentation | True | Liu_CQUPT_2026 | 0.774 | 0.751 | 0.759 | 0.714 | 0.696 | 0.705 | 0.730 | 0.712 | 0.721 | 0.948 | 0.726 | 0.822 | 0.838 | 0.858 | 0.848 | 0.643 | 0.762 | 0.697 |
| 32 | Kil_Medisensing_task1_4 | audio posterior ensemble with conservative argmax decoding | True | Kil_Medisensing_2026 | 0.777 | 0.682 | 0.708 | 0.894 | 0.662 | 0.761 | 0.820 | 0.800 | 0.810 | 1.000 | 0.520 | 0.684 | 0.715 | 0.818 | 0.763 | 0.456 | 0.610 | 0.521 |
| 33 | Sharma_IR_task1_1 | Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification | False | Sharma_IR_2026 | 0.701 | 0.628 | 0.653 | 0.728 | 0.721 | 0.724 | 0.672 | 0.420 | 0.517 | 0.917 | 0.697 | 0.792 | 0.634 | 0.824 | 0.717 | 0.552 | 0.481 | 0.514 |
| 34 | Han_CSU_task1_1 | Multi-Embedding System with Hierarchical Proxy Learning_1 | False | Han_CSU_2026 | 0.539 | 0.517 | 0.521 | 0.574 | 0.647 | 0.608 | 0.382 | 0.483 | 0.427 | 0.681 | 0.440 | 0.535 | 0.571 | 0.575 | 0.573 | 0.487 | 0.438 | 0.461 |
| 35 | Han_CSU_task1_3 | Multi-Embedding System with Hierarchical Proxy Learning_3 | False | Han_CSU_2026 | 0.475 | 0.452 | 0.456 | 0.481 | 0.574 | 0.523 | 0.314 | 0.371 | 0.340 | 0.619 | 0.371 | 0.464 | 0.525 | 0.530 | 0.528 | 0.437 | 0.414 | 0.425 |
| 36 | Han_CSU_task1_4 | Multi-Embedding System with Hierarchical Proxy Learning_4 | False | Han_CSU_2026 | 0.441 | 0.421 | 0.425 | 0.472 | 0.529 | 0.499 | 0.301 | 0.390 | 0.340 | 0.491 | 0.314 | 0.383 | 0.510 | 0.530 | 0.520 | 0.431 | 0.343 | 0.382 |
| 37 | Zhang_XJTLU_task1_3 | no-ranker ensemble | False | Zhang_XJTLU_2026 | 0.302 | 0.204 | 0.188 | 0.175 | 0.260 | 0.209 | 0.198 | 0.161 | 0.177 | 0.667 | 0.034 | 0.065 | 0.352 | 0.516 | 0.418 | 0.119 | 0.048 | 0.068 |
| 38 | Zhang_XJTLU_task1_4 | external diversity low-risk | False | Zhang_XJTLU_2026 | 0.354 | 0.217 | 0.200 | 0.189 | 0.304 | 0.233 | 0.212 | 0.185 | 0.198 | 0.875 | 0.040 | 0.077 | 0.361 | 0.510 | 0.422 | 0.135 | 0.048 | 0.070 |
| 39 | Han_CSU_task1_2 | Multi-Embedding System with Hierarchical Proxy Learning_2 | False | Han_CSU_2026 | 0.194 | 0.190 | 0.187 | 0.067 | 0.069 | 0.068 | 0.160 | 0.229 | 0.188 | 0.157 | 0.074 | 0.101 | 0.370 | 0.395 | 0.382 | 0.217 | 0.181 | 0.197 |
| 40 | Zhang_XJTLU_task1_2 | balanced ranker/base | False | Zhang_XJTLU_2026 | 0.320 | 0.207 | 0.189 | 0.193 | 0.314 | 0.239 | 0.171 | 0.137 | 0.152 | 0.750 | 0.034 | 0.066 | 0.351 | 0.500 | 0.412 | 0.136 | 0.052 | 0.076 |
| 41 | Zhang_XJTLU_task1_1 | aggressive ranker | False | Zhang_XJTLU_2026 | 0.239 | 0.191 | 0.175 | 0.150 | 0.240 | 0.185 | 0.170 | 0.151 | 0.160 | 0.375 | 0.017 | 0.033 | 0.347 | 0.478 | 0.402 | 0.152 | 0.067 | 0.093 |
Development set performance
| Rank | Submission Information | BSD10k-v1.2 | BSD35k-CS | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
System rank |
Submission label |
Submission name |
Audio only |
Technical Report |
hP | hR | hF | hP | hR | hF |
| 1 | Primus_CPJKU_task1_3 | Ensemble 3 | False | Primus_CPJKU_2026 | 0.833 | 0.838 | 0.834 | |||
| 2 | Primus_CPJKU_task1_2 | Ensemble 2 | False | Primus_CPJKU_2026 | 0.831 | 0.837 | 0.832 | |||
| 3 | Primus_CPJKU_task1_4 | Ensemble 4 | False | Primus_CPJKU_2026 | 0.828 | 0.832 | 0.829 | |||
| 4 | Primus_CPJKU_task1_1 | Ensemble 1 | False | Primus_CPJKU_2026 | 0.830 | 0.831 | 0.830 | |||
| 5 | Huang_WHU_task1_2 | STFT-distill logit ensemble-1 | False | Huang_WHU_2026 | ||||||
| 6 | Huang_WHU_task1_1 | HGA-EMA-STFT system | False | Huang_WHU_2026 | ||||||
| 7 | Guan_HEU_task1_3 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | False | Guan_HEU_2026 | ||||||
| 8 | Huang_WHU_task1_4 | Six-model 5-fold logit ensemble | False | Huang_WHU_2026 | ||||||
| 9 | Guan_HEU_task1_4 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | False | Guan_HEU_2026 | ||||||
| 10 | Huang_WHU_task1_3 | STFT-distill logit ensemble-5 | False | Huang_WHU_2026 | ||||||
| 11 | Guan_HEU_task1_2 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | False | Guan_HEU_2026 | ||||||
| 12 | Zheng_SCUT_task1_1 | Official BSD10k Baseline (Multimodal HATR + CLAP) | False | Zheng_SCUT_2026 | 0.795 | 0.783 | 0.787 | 0.811 | 0.794 | 0.798 |
| 13 | Kucukoglu_NYU_task1_1 | Multimodal Ensemble System for Heterogeneous Audio Classification | False | Kucukoglu_NYU_2026 | 0.822 | 0.805 | 0.811 | |||
| 14 | Lin_JKU_task1_1 | Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary) | False | Lin_JKU_2026 | 0.795 | 0.805 | 0.794 | |||
| 15 | Guan_HEU_task1_1 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | False | Guan_HEU_2026 | ||||||
| 16 | Kucukoglu_NYU_task1_4 | Three-Modality HATR with Whisper and Mixup | False | Kucukoglu_NYU_2026 | 0.795 | 0.787 | 0.788 | |||
| 17 | Kucukoglu_NYU_task1_3 | Single HATR Model with Cross-Fold Augmentation | False | Kucukoglu_NYU_2026 | 0.798 | 0.790 | 0.791 | |||
| 18 | Lin_JKU_task1_2 | Frozen-CLAP ensemble with logit adjustment (balanced) | False | Lin_JKU_2026 | 0.772 | 0.805 | 0.779 | |||
| 19 | Lin_JKU_task1_4 | Clean concatenation CLAP probe (10k-train only) | False | Lin_JKU_2026 | 0.788 | 0.782 | 0.781 | |||
| 20 | Kucukoglu_NYU_task1_2 | CLAP-ConvNeXt Ensemble for Heterogeneous Audio Classification | False | Kucukoglu_NYU_2026 | 0.818 | 0.808 | 0.811 | |||
| 21 | Lin_JKU_task1_3 | Gated multimodal CLAP fusion with pseudo-labels | False | Lin_JKU_2026 | 0.789 | 0.798 | 0.790 | |||
| 22 | Colotti_TAU_task1_1 | Audio Classification using Attention-based Cleaned Multimodal Embeddings | False | Colotti_TAU_2026 | ||||||
| 23 | Colotti_TAU_task1_2 | Audio Classification using Attention-based Cleaned Multimodal Embeddings | False | Colotti_TAU_2026 | ||||||
| 24 | Kil_Medisensing_task1_1 | metadata gate target mask posterior stack | False | Kil_Medisensing_2026 | 0.804 | 0.815 | 0.805 | |||
| 25 | Kil_Medisensing_task1_2 | raw parent specialist posterior stack | False | Kil_Medisensing_2026 | 0.805 | 0.812 | 0.804 | |||
| 26 | Colotti_TAU_task1_4 | CLAP-MoE Feature-wise Gated Fusion | False | Colotti_TAU_2026 | ||||||
| 27 | Colotti_TAU_task1_3 | EnhancedBaseClassifier | False | Colotti_TAU_2026 | ||||||
| 28 | Kil_Medisensing_task1_3 | Larger-CLAP classifier with metadata confidence gate | False | Kil_Medisensing_2026 | 0.753 | 0.717 | 0.729 | |||
| 29 | Liu_CQUPT_task1_3 | PaSST and CLAP Audio Ensemble | True | Liu_CQUPT_2026 | ||||||
| 30 | Liu_CQUPT_task1_2 | Weighted Audio-only CLAP Ensemble with BSD10k Specialist | True | Liu_CQUPT_2026 | ||||||
| 31 | Liu_CQUPT_task1_1 | Audio-only CLAP Ensemble with EMA and Test-Time Augmentation | True | Liu_CQUPT_2026 | ||||||
| 32 | Kil_Medisensing_task1_4 | audio posterior ensemble with conservative argmax decoding | True | Kil_Medisensing_2026 | 0.825 | 0.806 | 0.809 | |||
| 33 | Sharma_IR_task1_1 | Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification | False | Sharma_IR_2026 | ||||||
| 34 | Han_CSU_task1_1 | Multi-Embedding System with Hierarchical Proxy Learning_1 | False | Han_CSU_2026 | 0.854 | 0.840 | 0.846 | 0.806 | 0.835 | 0.810 |
| 35 | Han_CSU_task1_3 | Multi-Embedding System with Hierarchical Proxy Learning_3 | False | Han_CSU_2026 | 0.852 | 0.836 | 0.843 | 0.791 | 0.823 | 0.798 |
| 36 | Han_CSU_task1_4 | Multi-Embedding System with Hierarchical Proxy Learning_4 | False | Han_CSU_2026 | 0.851 | 0.841 | 0.845 | 0.803 | 0.827 | 0.806 |
| 37 | Zhang_XJTLU_task1_3 | no-ranker ensemble | False | Zhang_XJTLU_2026 | 0.833 | |||||
| 38 | Zhang_XJTLU_task1_4 | external diversity low-risk | False | Zhang_XJTLU_2026 | 0.832 | |||||
| 39 | Han_CSU_task1_2 | Multi-Embedding System with Hierarchical Proxy Learning_2 | False | Han_CSU_2026 | 0.856 | 0.835 | 0.844 | 0.827 | 0.851 | 0.832 |
| 40 | Zhang_XJTLU_task1_2 | balanced ranker/base | False | Zhang_XJTLU_2026 | 0.834 | |||||
| 41 | Zhang_XJTLU_task1_1 | aggressive ranker | False | Zhang_XJTLU_2026 | 0.835 | |||||
System characteristics
| Rank | Submission Information | Representations | Method | Data | Complexity | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
System rank |
Submission label |
Submission name |
Technical Report |
Sampling rate |
Audio representation |
Text representation |
Hierarchical setting |
Machine learning method |
Data augmentation |
External data usage |
External datasets |
MACS (G) |
Total params |
| 1 | Primus_CPJKU_task1_3 | Ensemble 3 | Primus_CPJKU_2026 | CP-CLAP PaSST, M2D | title, tags, description, GPT processed metadata, CP-CLAP, RoBERTa | transformer, contrastive-learning, LLM | embeddings, LLM | 383000000.0 | |||||
| 2 | Primus_CPJKU_task1_2 | Ensemble 2 | Primus_CPJKU_2026 | CP-CLAP PaSST, M2D | title, tags, description, GPT processed metadata, CP-CLAP, RoBERTa | transformer, contrastive-learning, LLM | embeddings, LLM | 383000000.0 | |||||
| 3 | Primus_CPJKU_task1_4 | Ensemble 4 | Primus_CPJKU_2026 | CLAP, PaSST, M2D-CLAP, CP-CLAP | title, tags, description, GPT processed metadata, CP-CLAP, RoBERTa, LAION-CLAP RoBERTa | transformer, contrastive-learning, LLM | embeddings, LLM | 539000000.0 | |||||
| 4 | Primus_CPJKU_task1_1 | Ensemble 1 | Primus_CPJKU_2026 | BEATs, CP-CLAP, PaSST, CLAP | title, tags, description, GPT processed metadata, CP-CLAP, RoBERTa, LAION-CLAP RoBERTa | transformer, contrastive-learning, LLM | embeddings, LLM | 543000000.0 | |||||
| 5 | Huang_WHU_task1_2 | STFT-distill logit ensemble-1 | Huang_WHU_2026 | 16kHz | CLAP, log-STFT, log-mel energies | title, tags, description, CLAP, metadata cleaning | multiple classifiers | MLP, transformer | random crop, time masking | embeddings | BSD35k | 28.361 | 97540455.0 |
| 6 | Huang_WHU_task1_1 | HGA-EMA-STFT system | Huang_WHU_2026 | 16kHz | CLAP, STFT | MLP, transformer | time masking, random crop | embeddings | BSD35k | 12.921 | 2728920.0 | ||
| 7 | Guan_HEU_task1_3 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | Guan_HEU_2026 | 48kHz | CLAP | MLP | embeddings | 1.000 | 326592.0 | ||||
| 8 | Huang_WHU_task1_4 | Six-model 5-fold logit ensemble | Huang_WHU_2026 | 16kHz | CLAP, log-STFT, log-mel energies, MFCC | title, tags, description, CLAP, metadata cleaning | multiple classifiers | MLP, CNN, transformer, ensemble | time masking, random crop | embeddings, datastes | BSD35k | 30.555 | 210700090.0 |
| 9 | Guan_HEU_task1_4 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | Guan_HEU_2026 | 48kHz | CLAP | MLP | embeddings | ||||||
| 10 | Huang_WHU_task1_3 | STFT-distill logit ensemble-5 | Huang_WHU_2026 | 16kHz | CLAP, log-STFT, log-mel energies | title, tags, description, CLAP, metadata cleaning | multiple classifiers | MLP, CNN, transformer, ensemble | random crop, time masking | embeddings, datastes | BSD35k | 28.361 | 97540455.0 |
| 11 | Guan_HEU_task1_2 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | Guan_HEU_2026 | 48kHz | CLAP | MLP | embeddings | 11.000 | 15474693.0 | ||||
| 12 | Zheng_SCUT_task1_1 | Official BSD10k Baseline (Multimodal HATR + CLAP) | Zheng_SCUT_2026 | 44.1kHz | CLAP | MLP | embeddings | 7319513.0 | |||||
| 13 | Kucukoglu_NYU_task1_1 | Multimodal Ensemble System for Heterogeneous Audio Classification | Kucukoglu_NYU_2026 | 48kHz, 32kHz | CLAP, ConvNeXt, Fine-tuned CLAP | loss function | HATR, ensemble, logit averaging | noise addition, random masking, mixup, balanced resampling | embeddings | 1889944.0 | |||
| 14 | Lin_JKU_task1_1 | Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary) | Lin_JKU_2026 | 48kHz, 32kHz | CLAP, PaSST | title, tags, description, CLAP, RoBERTa | MLP, transformer, ensemble | time masking, frequency masking, mixup | embeddings | 163.900 | 211645044.0 | ||
| 15 | Guan_HEU_task1_1 | Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations | Guan_HEU_2026 | 48kHz | CLAP | MLP | embeddings | 8.000 | 25184262.0 | ||||
| 16 | Kucukoglu_NYU_task1_4 | Three-Modality HATR with Whisper and Mixup | Kucukoglu_NYU_2026 | 48kHz, 16kHz | CLAP, Whisper-large-v3 | Three-modality HATR | mixup, noise addition, random masking | embeddings | 1115669.0 | ||||
| 17 | Kucukoglu_NYU_task1_3 | Single HATR Model with Cross-Fold Augmentation | Kucukoglu_NYU_2026 | 48kHz | CLAP | HATR | cross-fold embedding swap, noise addition, random masking | embeddings | 377989.0 | ||||
| 18 | Lin_JKU_task1_2 | Frozen-CLAP ensemble with logit adjustment (balanced) | Lin_JKU_2026 | 48kHz, 32kHz | CLAP, PaSST | title, tags, description, CLAP, RoBERTa | MLP, transformer, ensemble | time masking, frequency masking, mixup | embeddings | 163.900 | 211645044.0 | ||
| 19 | Lin_JKU_task1_4 | Clean concatenation CLAP probe (10k-train only) | Lin_JKU_2026 | 48kHz | CLAP | MLP | embeddings | 19.400 | 538647.0 | ||||
| 20 | Kucukoglu_NYU_task1_2 | CLAP-ConvNeXt Ensemble for Heterogeneous Audio Classification | Kucukoglu_NYU_2026 | 48kHz, 32kHz | CLAP, ConvNeXt | loss function | HATR, ensemble, logit averaging | noise addition, random masking, mixup, balanced resampling, combined augmentation | embeddings | 1889944.0 | |||
| 21 | Lin_JKU_task1_3 | Gated multimodal CLAP fusion with pseudo-labels | Lin_JKU_2026 | 48kHz | CLAP | MLP | embeddings | 19.400 | 1327639.0 | ||||
| 22 | Colotti_TAU_task1_1 | Audio Classification using Attention-based Cleaned Multimodal Embeddings | Colotti_TAU_2026 | 48kHz | CLAP | description, tags, sentence transformer, all-mpnet-base-v2, BERT | attention, hyperbolic neural networks | embeddings, model weights | 7.900 | 380342496.0 | |||
| 23 | Colotti_TAU_task1_2 | Audio Classification using Attention-based Cleaned Multimodal Embeddings | Colotti_TAU_2026 | 48kHz | CLAP | description, tags, sentence transformer, all-mpnet-base-v2, BERT | attention, hyperbolic neural networks | embeddings, model weights | 7.900 | 380342496.0 | |||
| 24 | Kil_Medisensing_task1_1 | metadata gate target mask posterior stack | Kil_Medisensing_2026 | CLAP, Larger-CLAP, M2D-CLAP, BEATs, ATST, PaSST, CLAP-HF | title, tags, description, TF-IDF helper gates | second-level classifier, same-parent metadata override, hierarchy-aware posteriors | weighted posterior stacking, metadata target-mask gate | embeddings, datasets | |||||
| 25 | Kil_Medisensing_task1_2 | raw parent specialist posterior stack | Kil_Medisensing_2026 | CLAP, Larger-CLAP, M2D-CLAP, BEATs, ATST, PaSST, CLAP-HF | title, tags, description, conservative helper gates | parent-local specialists, second-level predictions constrained by hierarchy | weighted posterior stacking, raw-embedding parent-specialist gate | embeddings, datasets | |||||
| 26 | Colotti_TAU_task1_4 | CLAP-MoE Feature-wise Gated Fusion | Colotti_TAU_2026 | 48kHz | CLAP | tags, description, CLAP | top-level classifier as router, expert second-level classifiers | MLP, feature-wise gated multimodal fusion | noise addition | embeddings | 4.140 | 369991736.0 | |
| 27 | Colotti_TAU_task1_3 | EnhancedBaseClassifier | Colotti_TAU_2026 | 48kHz | CLAP | tags, description, CLAP, SentenceBERT, sentence transformer | MLP | baseline implemented masking and augmentation | embeddings | 9.400 | 282000000.0 | ||
| 28 | Kil_Medisensing_task1_3 | Larger-CLAP classifier with metadata confidence gate | Kil_Medisensing_2026 | CLAP | title, tags, description, TF-IDF | second-level classifier, same-parent metadata override | MLP | datasets, embeddings | FSD50K | 0.011 | 10704631.0 | ||
| 29 | Liu_CQUPT_task1_3 | PaSST and CLAP Audio Ensemble | Liu_CQUPT_2026 | 48kHz, 32kHz | CLAP, PaSST | loss function | ensemble | mixup, feature masking, auxiliary text supervision during training, test-time augmentation | embeddings | AudioSet | 254784782.0 | ||
| 30 | Liu_CQUPT_task1_2 | Weighted Audio-only CLAP Ensemble with BSD10k Specialist | Liu_CQUPT_2026 | 48kHz | CLAP | loss function | residual gated classifiers, kownledge distillation, weighted ensemble | embedding-level mixup, feature masking, noise addition, text dropout, test-time augmentation | embeddings | 197911292.0 | |||
| 31 | Liu_CQUPT_task1_1 | Audio-only CLAP Ensemble with EMA and Test-Time Augmentation | Liu_CQUPT_2026 | 48kHz | CLAP | loss function | residual gated classifiers, kownledge distillation, ensemble | embedding-level mixup, feature masking, noise addition, text dropout, test-time augmentation | embeddings | 152641185.0 | |||
| 32 | Kil_Medisensing_task1_4 | audio posterior ensemble with conservative argmax decoding | Kil_Medisensing_2026 | CLAP, M2D-CLAP | multiple classifiers, hierarchical ridge branch | ridge ensemble, posterior averaging | embeddings | ||||||
| 33 | Sharma_IR_task1_1 | Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification | Sharma_IR_2026 | 32kHz, 48kHz | CLAP, PANNs | loss function | dimension masking | embeddings | 1.036 | 261098881.0 | |||
| 34 | Han_CSU_task1_1 | Multi-Embedding System with Hierarchical Proxy Learning_1 | Han_CSU_2026 | 44.1kHz | BEATs, ATST-Frame, fPaSST, PaSST, CLAP, M2D | loss function | MLP | noise addition, time masking | embeddings | BEATs, ATST-Frame, fPaSST, PaSST, | 718.773 | 3717566529.0 | |
| 35 | Han_CSU_task1_3 | Multi-Embedding System with Hierarchical Proxy Learning_3 | Han_CSU_2026 | 44.1kHz | ATST-Frame, PaSST, CLAP, M2D | loss function | MLP | noise addition, time masking | embeddings | ATST-Frame, PaSST, | 466.859 | 2332946777.0 | |
| 36 | Han_CSU_task1_4 | Multi-Embedding System with Hierarchical Proxy Learning_4 | Han_CSU_2026 | 44.1kHz | BEATs, CLAP | loss function | MLP | noise addition, time masking | embeddings | 200.345 | 676412058.0 | ||
| 37 | Zhang_XJTLU_task1_3 | no-ranker ensemble | Zhang_XJTLU_2026 | 48kHz | CLAP | title, tags, description, CLAP, safe metadata scalar features | parent-aware calibration, second-level candidate override | ensemble, supervision with noisy labels | teacher-weighted soft and hard BSD35k-CS supervision, calibrated ensembling, external-data diversity components | embeddings, datasets | BSD35k-CS, ESC-50, UrbanSound8K, FSD50K | ||
| 38 | Zhang_XJTLU_task1_4 | external diversity low-risk | Zhang_XJTLU_2026 | 48kHz | CLAP | title, tags, description, CLAP, safe metadata scalar features | parent-aware calibration, second-level candidate override | ensemble, external-data diversity components | teacher-weighted soft and hard BSD35k-CS supervision, calibrated ensembling, external-data diversity components | embeddings, datasets | BSD35k-CS, ESC-50, UrbanSound8K, FSD50K | ||
| 39 | Han_CSU_task1_2 | Multi-Embedding System with Hierarchical Proxy Learning_2 | Han_CSU_2026 | 44.1kHz | CLAP, M2D | loss function | MLP | noise addition, time masking | embeddings | 134.815 | 222974.0 | ||
| 40 | Zhang_XJTLU_task1_2 | balanced ranker/base | Zhang_XJTLU_2026 | 48kHz | CLAP | title, tags, description, CLAP, safe metadata scalar features | parent-aware calibration, second-level candidate override | enseble, parent-aware candidate reranking | teacher-weighted soft and hard BSD35k-CS supervision, calibrated ensembling, external-data diversity components | embeddings, datasets | BSD35k-CS, ESC-50, UrbanSound8K, FSD50K | ||
| 41 | Zhang_XJTLU_task1_1 | aggressive ranker | Zhang_XJTLU_2026 | 48kHz | CLAP | title, tags, description, CLAP, safe metadata scalar features | parent-aware calibration, second-level candidate override | ensemble, candidate-level logistic/gradient reranking, parent-aware override | teacher-weighted soft and hard BSD35k-CS supervision, calibrated ensembling, external-data diversity components | embeddings, datasets | BSD35k-CS, ESC-50, UrbanSound8K, FSD50K | ||
Technical reports
ACACIA: Audio Classification using Attention-Based Cleaned Multimodal Embeddings
Francesco Colotti1, Kerstin Markl1, Riccardo Casciotti1, Javier Naranjo2
1Tampere University, Audio Research Group, Tampere, Finland, 2Instituto Tecnologico de Informatica, Valencia, Spain
Colotti_TAU_task1_4 Colotti_TAU_task1_2 Colotti_TAU_task1_1 Colotti_TAU_task1_3
ACACIA: Audio Classification using Attention-Based Cleaned Multimodal Embeddings
Francesco Colotti1, Kerstin Markl1, Riccardo Casciotti1, Javier Naranjo2
1Tampere University, Audio Research Group, Tampere, Finland, 2Instituto Tecnologico de Informatica, Valencia, Spain
Abstract
This paper presents the ACACIA pipeline, submitted to Task 1 of DCASE 2026, the first edition of the Heterogeneous Audio Classification challenge based on the Broad Sound Taxonomy (BST). Our approach combines complementary acoustic and textual information available in Freesound recordings by jointly exploiting the audio signal together with user-provided tags and textual descriptions. To improve the quality of the textual modality, we propose a preprocessing and cleaning pipeline that removes noisy and non-informative content from descriptions before encoding. The proposed architecture comprises three dedicated branches that independently encode audio, tags, and descriptions, whose representations are fused into a shared multimodal embedding space. To exploit the hierarchical organization of BST, the model performs two classification tasks simultaneously, predicting both top-level and second-level categories, and is trained using the sum of two binary cross-entropy losses. Experimental results show that ACACIA surpasses the official baseline by 6\% according to the challenge's hierarchy-aware evaluation metric, demonstrating that the integration of acoustic information with complementary textual metadata provides a robust and effective solution for heterogeneous sound classification in real-world conditions.
GISP@HEU's Submission for Task 1: Heterogeneous Audio Classification in the DCASE 2026 Challenge
Feng Xiaoyu1, Ye Tong1, Yang Xuefeng1, Xiao Feiyang1, Qiaoxi Zhu2, Jian Guan1
1Harbin Engineering University, Harbin, China, 2University of Technology Sydney, Ultimo, Australia
Guan_HEU_task1_1 Guan_HEU_task1_2 Guan_HEU_task1_3 Guan_HEU_task1_4
GISP@HEU's Submission for Task 1: Heterogeneous Audio Classification in the DCASE 2026 Challenge
Feng Xiaoyu1, Ye Tong1, Yang Xuefeng1, Xiao Feiyang1, Qiaoxi Zhu2, Jian Guan1
1Harbin Engineering University, Harbin, China, 2University of Technology Sydney, Ultimo, Australia
Abstract
This technical report presents our submitted systems for Task 1: Heterogeneous Audio Classification in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2026 Challenge. Our submission consists of four systems, including three individual systems and one ensemble system. System 1 and System 2 adopt a two-stage hierarchical training framework with hyperbolic representation learning, and are built upon different multimodal audio language backbones, i.e., Qwen2-Audio and Qwen3-Omni. System 3 is developed from the official baseline framework, using multimodal embeddings and hierarchical training with clean BSD10k data and filtered BSD35k data. System 4 is an ensemble system consisting of these three systems. Experimental results demonstrate that the proposed approach improves hierarchical classification performance and achieves 81.80\% ± 0.21\% on the development dataset.
MESH:MULTI-EMBEDDINGSYSTEMWITHHIERARCHICALPROXYLEARNINGFOR DCASE 2026 TASK 1
Sarang Han1, Minsik Jo2, Eunseo Ha2, Minju Chae2, Hyeonguk Kang2, Geonwoo Lee2
1Chosun University, Department of Data Science, Gwangju 61452, Republic of Korea, 2Chosun University, Department of AI Software, Gwangju 61452, Republic of Korea
Han_CSU_task1_4 Han_CSU_task1_3 Han_CSU_task1_1 Han_CSU_task1_2
MESH:MULTI-EMBEDDINGSYSTEMWITHHIERARCHICALPROXYLEARNINGFOR DCASE 2026 TASK 1
Sarang Han1, Minsik Jo2, Eunseo Ha2, Minju Chae2, Hyeonguk Kang2, Geonwoo Lee2
1Chosun University, Department of Data Science, Gwangju 61452, Republic of Korea, 2Chosun University, Department of AI Software, Gwangju 61452, Republic of Korea
Abstract
This technical report describes a multi-embedding system with hierarchical proxy learning (MESH) for DCASE 2026 Challenge Task 1. The task is based on the Broad Sound Taxonomy (BST) for hierarchical audio classification, where each sample is assigned to a lower-level category within a top-level class. Fixed audio embeddings were extracted from multiple pretrained models and pooling configurations, and the final embedding set was selected using validation hierarchical F-score (hF) and embedding diversity. The selected audio embeddings were concatenated with fixed CLAP text embeddings, and the classifier was trained through BSD35k-CS pre-training followed by BSD10k-v1.2 fine-tuning. To incorporate the BST hierarchy, a hierarchical proxy loss included distinct proxy sets for top-level and lower-level classes. Final predictions were aggregated with selective kernel-based output fusion and OOF-based ensemble selection, and the best OOF ensemble achieved 84.59% hF among the four submitted systems.
A MULTI-BRANCH AND HIERARCHY-AWARE SYSTEM FOR BST AUDIO CLASSIFICATION
Beile Ning1, Jiayi Yu1, Zitong Wang1, Yufei Hu1, Wenjun Xu1, Yuanhang Qian1, Zhongxin Bai2, Gongping Huang1
1Wuhan University (WHU), Intelligent Acoustics and Speech Processing Laboratory (IASP-Lab), Wuhan, China, 2Harbin Engineering University (HEU), Intelligent Acoustics and Speech Processing Laboratory (IASP-Lab), Harbin, China
Huang_WHU_task1_3 Huang_WHU_task1_4 Huang_WHU_task1_2 Huang_WHU_task1_1
A MULTI-BRANCH AND HIERARCHY-AWARE SYSTEM FOR BST AUDIO CLASSIFICATION
Beile Ning1, Jiayi Yu1, Zitong Wang1, Yufei Hu1, Wenjun Xu1, Yuanhang Qian1, Zhongxin Bai2, Gongping Huang1
1Wuhan University (WHU), Intelligent Acoustics and Speech Processing Laboratory (IASP-Lab), Wuhan, China, 2Harbin Engineering University (HEU), Intelligent Acoustics and Speech Processing Laboratory (IASP-Lab), Harbin, China
Abstract
This technical report introduces our system for Task 1 of the DCASE 2026 Challenge. Our system leverages multimodal audio-text representations extracted by CLAP and further improves classification performance through dataset expansion, branch-enhanced acoustic modeling, and hierarchy-aware prediction strategies. To improve data diversity, we selected a subset of the BSD35K dataset and incorporated it into BSD10K to expand the training set. We introduced additional acoustic branches based on Mel-Frequency Cepstral Coefficients (MFCC), log-Mel spectrogram, and log Short-Time Fourier Transform (log-STFT), and adopted different hierarchical strategies and post-processing strategies, with the post-processing strategy further integrated into the system via knowledge distillation. During training, we also applied data augmentation techniques including time masking and random cropping. Our best single system extracts log-STFT features from audio to facilitate model training on the new dataset, and employs post-processing to refine model predictions, achieving a hierarchical F1 score of 80.84% on the BSD10K test set under the same evaluation protocol as the baseline. Furthermore, we constructed two ensemble models based on different backbone architectures, obtaining hierarchical F1 scores of 81.25% and 81.18%, respectively, on the same BSD10K test set as the baseline.
POSTERIOR STACKING AND CONSERVATIVE METADATA GATING FOR HETEROGENEOUS AUDIO CLASSIFICATION
Minkyu Kil, Seunggyu Jeong, Seong-Eun Kim
Medisensing, Seoul, Korea
Kil_Medisensing_task1_4 Kil_Medisensing_task1_1 Kil_Medisensing_task1_2 Kil_Medisensing_task1_3
POSTERIOR STACKING AND CONSERVATIVE METADATA GATING FOR HETEROGENEOUS AUDIO CLASSIFICATION
Minkyu Kil, Seunggyu Jeong, Seong-Eun Kim
Medisensing, Seoul, Korea
Abstract
This technical report describes the Medisensing-SeoulTech submission to DCASE 2026 Task 1, Heterogeneous Audio Classification. The submitted systems classify each evaluation item into one of 23 second-level Broad Sound Taxonomy classes. The final package contains four systems. Systems 1 and 2 are high-scoring weighted posterior stacks built from frozen audio and audio-text embedding heads with conservative same-parent metadata gates. System 3 is a Larger-CLAP residual classifier with a TF-IDF metadata gate, included as a complementary metadata-aware variant with broader label-symbol coverage. System 4 is a complementary audio-only posterior ensemble. Across metadata-assisted systems, metadata can refine the second-level class only inside the audio-predicted top-level parent. No evaluation-set ground-truth labels or manual evaluation-set annotations are used for training, threshold selection, system selection, or reporting.
MULTIMODAL ENSEMBLE SYSTEM FOR HETEROGENEOUS AUDIO CLASSIFICATION
Mehmet Atilay Kucukoglu
New York University, New York, USA
Abstract
This technical report details an ensemble approach and experiments leading to it as a submission to DCASE 2026 Challenge Task 1 on Heterogeneous Audio Classification using the Broad Sound Taxonomy (BST). The official HATR baseline system achieves 79.01% hierarchical F1 score on BSD10k-v1.2. Alternative audio encoders such as BEATs, ConvNeXt, MATPAC, and Whisper, performed below CLAP, suggesting that CLAP’s joint audio-text alignment is a powerful approach for this task. Incorporating hierarchical loss, contrastive loss, confidence weighting, and class weighting did not improve the hierarchical F1 score. Among data augmentation methods, cross-modal swap with Gaussian noise achieved the best single model result at 79.13% hF1. Incorporating external data from FSD50K and ESC-50 mapped to the BST taxonomy did not achieve higher F1 scores. End-to-end fine-tuning of the CLAP audio encoder was also investigated but did not surpass the frozen multimodal baseline. The best result was obtained using an ensemble of 5 models incorporating different encoders, loss strategies, and augmentation methods, achieving 81.13% hierarchical F1 score on BSD10k-v1.2, a +2.12% improvement over the baseline.
Heterogeneous Audio Classification with Frozen Audio-Language Embeddings
Pao Lin
Johannes Kepler Universität Linz, Linz, Austria
Lin_JKU_task1_3 Lin_JKU_task1_2 Lin_JKU_task1_4 Lin_JKU_task1_1
Heterogeneous Audio Classification with Frozen Audio-Language Embeddings
Pao Lin
Johannes Kepler Universität Linz, Linz, Austria
Abstract
We describe our submission to DCASE 2026 Task 1, Heterogeneous Audio Classification, where each sound is assigned to one of 23 second-level Broad Sound Taxonomy classes and systems are ranked by macro hierarchical-F1 (hF). Like the official baseline, we keep the CLAP audio-text encoder frozen and train small heads on its embeddings. Under matched 5-fold cross-validation on BSD10k-v1.2, our best single model, a gated multimodal head, reaches 78.8 +/- 1.1% hF, matching the published multimodal baseline. Several additions add little once evaluated carefully. Agreement pseudo-labels mined from BSD35k-CS give no gain that clears seed noise, so we treat them as a neutral data addition. A four-member ensemble, with weights tuned on our validation split, reaches 79.4 +/- 0.7% hF over three seeds. Of everything we tried, only adding the free-text description to the metadata helped cleanly (+2.2 hF). Finally, the residual error appears strongly tied to label ambiguity: the classes our system confuses most are the ones annotators were least confident labeling, and most of the lost credit lies in cross-family confusions such as crowd speech vs. urban soundscape. We view this label-ambiguity analysis as the main finding of our submission. We submit four systems with different complexity and robustness profiles.
HAF-CLAP: AHIERARCHICAL-AWAREMULTIMODALCLAPSYSTEMFOR HETEROGENEOUSAUDIO CLASSIFICATION
Xiangyu Jing, Yuandong Luo, Chaoyong Huang, Hongqing Liu
Chongqing University of Posts and Telecommunications, Chongqing, China
Liu_CQUPT_task1_2 Liu_CQUPT_task1_3 Liu_CQUPT_task1_1
HAF-CLAP: AHIERARCHICAL-AWAREMULTIMODALCLAPSYSTEMFOR HETEROGENEOUSAUDIO CLASSIFICATION
Xiangyu Jing, Yuandong Luo, Chaoyong Huang, Hongqing Liu
Chongqing University of Posts and Telecommunications, Chongqing, China
Abstract
This technical report describes the system submitted by Chongqing University of Posts and Telecommunications– Audio Lab (CQUPT AUL) for DCASE 2026 Task 1: Heterogeneous Audio Classifi cation. The proposed system, termed HAF-CLAP, is built upon LAION-CLAP and exploits both acoustic information and textual metadata under the Broad Sound Taxonomy (BST). To adapt pre trained audio-language representations to the target classification task, HAF-CLAP uses a hierarchical-aware multimodal classifica tion framework with audio-text fusion. Several training and infer ence strategies are further applied to improve robustness and pre diction stability. The final submission combines multiple comple mentary HAF-CLAP models. Experimental results on our internal validation split show that the submitted system achieves competitive performance under the hierarchical evaluation metric.
CP-JKU Submission to Task 1 of the DCASE 2026 Challenge: LLM Prediction Fusion and Pseudo-Label Training for Heterogeneous Audio Classification
Paul Primus, Gerhard Widmer
Johannes Kepler University, Institute of Computational Perception, Linz, Austria
Primus_CPJKU_task1_4 Primus_CPJKU_task1_1 Primus_CPJKU_task1_2 Primus_CPJKU_task1_3
CP-JKU Submission to Task 1 of the DCASE 2026 Challenge: LLM Prediction Fusion and Pseudo-Label Training for Heterogeneous Audio Classification
Paul Primus, Gerhard Widmer
Johannes Kepler University, Institute of Computational Perception, Linz, Austria
Abstract
This report describes our submission to DCASE 2026 Task 1, Heterogeneous Audio Classification. The task requires classifying Freesound audio clips into 23 second-level Broad Sound Taxonomy classes using audio and optional textual metadata. Our system combines pretrained audio-only and audio-text backbones, including PaSST, BEATs, M2D, CP-CLAP, and LAION-CLAP. We use GPT-5.4-mini in two ways: first, to generate metadata summaries for the CLAP text encoders, and second, to estimate class-probability priors from title, tags, and description. These priors are converted into learned class-embedding mixtures and late fused with the backbone representation. To exploit the noisy BSD35k-CS dataset, we use a three-stage training procedure: clean-data training on BSD10k, pseudo-label-based pretraining on BSD35k-CS, and final fine-tuning on BSD10k. The submitted systems are ensembles of selected Stage-3 models and achieve up to 0.834 hierarchical F-score on our BSD10k test split.
Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification
Nipun Sharma1, Aditya Sharma2
1Independent Researcher, Madhya Pradesh, india, 2Independent Researcher, Bangalore, India
Abstract
This technical report describes the submitted system for DCASE 2026 Task 1, Heterogeneous Audio Classification, which is defined over the Broad Sound Taxonomy with 5 top-level and 23 second-level categories. The proposed approach uses a late-fusion architecture built on frozen foundation-model embeddings, combining CLAP audio embeddings, PANNs audio embeddings, and CLAP text embeddings derived from metadata. Each modality is projected into a shared hidden space and fused with a lightweight Transformer encoder. The resulting sequence is then aggregated via mean-pooling to create a unified representation. Training uses a hierarchical objective that combines fine-level cross-entropy with an auxiliary coarselevel loss, together with label smoothing and class weighting. Evaluation is performed using the hierarchical Fscore adopted for the task, with additional reporting of toplevel and second-level accuracy. For the final submission, logits from seven selected checkpoints were averaged in a checkpoint ensemble to improve robustness and reduce foldspecific variance. The final system utilizes this streamlined three modalities configuration.
Noisy-Label-Aware Multimodal Ensembling with Inference-Safe Candidate Reranking for Heterogeneous Audio Classification
Peihong Zhang, Shengchen Li
Xi'an Jiaotong-Liverpool University, School of Advanced Technology, Suzhou, China
Zhang_XJTLU_task1_4 Zhang_XJTLU_task1_1 Zhang_XJTLU_task1_3 Zhang_XJTLU_task1_2
Noisy-Label-Aware Multimodal Ensembling with Inference-Safe Candidate Reranking for Heterogeneous Audio Classification
Peihong Zhang, Shengchen Li
Xi'an Jiaotong-Liverpool University, School of Advanced Technology, Suzhou, China
Abstract
We present a noisy-label-aware multimodal system for DCASE 2026 Challenge Task 1, Heterogeneous Audio Classification. The task requires second-level Broad Sound Taxonomy prediction from audio and metadata and is evaluated using macro-averaged hierarchical F-score. Our system addresses the central tension between exploiting large but noisy crowd-sourced supervision and preserving inference safety on the unlabeled evaluation set. It combines LAION-CLAP audio/text embeddings, teacher-weighted soft and hard supervision from BSD35k-CS, a fresh long-run ensemble of 31 checkpointed model families, calibrated probability fusion, and an inference-safe candidate reranker with conservative parent-aware override. All submitted systems are generated from deployable checkpoints rather than replayed development-set artifacts. On a label-blind development-as-evaluation protocol, the primary system obtains 83.5214 hF, improving over our local multimodal baseline by 4.57 absolute hF points.
MULTIMODAL HATR CLASSIFICATION WITH FROZEN CLAP EMBEDDINGS FOR THE BROAD SOUND TAXONOMY
Han Zheng, Yanxiong Li
South China University of Technology, School of Electronic and Information Engineering, Guangzhou, China
Abstract
This report presents a multimodal audio classifier for DCASE 2026 Task 1, which assigns Freesound recordings to 23 second - level categories of the Broad Sound Taxonomy (BST). The system combines frozen LAION - CLAP embeddings of audio and textual metadata (title, tags, description) with a Hierarchical Audio Tagging and Retrieval (HATR) feed - forward classifier using attention - based fusion. Development performance is measured under five - fold cross - validation on BSD10k - v1.2 and BSD35k - CS (lambda = 0.75). On BSD10k - v1.2, the multimodal model attains macro - averaged hierarchical F - scores of 78.65% pm 0.52 (hf), compared with 75.88% pm 0.47 for the audio - only configuration; on BSD35k - CS, the corresponding values are 79.77% pm 0.60 and 70.19% pm 0.93, with leaf accuracy reaching 85.32% pm 0.51. For the challenge evaluation set, five fold models trained on BSD10k - v1.2 are ensembled to produce predictions for all 3246 released test recordings.