This task focuses on heterogeneous audio classification using the Broad Sound Taxonomy (BST), which comprises 5 top-level and 23 second-level sound categories. The goal of this task is to evaluate sound classification models on diverse, real-world audio that varies widely in its nature, including duration and recording conditions. To that end, two complementary Freesound-based datasets are provided: a curated set, BSD10k-v1.2, and a larger, noisier, crowd-sourced collection, BSD35k-CS, which reflects real-world labeling variability. Participants are encouraged to explore audio-based, metadata-based, and multimodal approaches to sound classification, as well as to leverage hierarchical relationships between taxonomy categories.

If you are interested in the task, you can join us on the dedicated slack channel

Description

This task aims to evaluate systems for multi-class audio classification across a small but heterogeneous set of sound categories. To that end, the Broad Sound Taxonomy (BST) [1] is used, which has 5 top-level and 23 second-level categories, and the goal of the classification task is to correctly predict ground truth labels for the second level of the taxonomy (Figure 1). The BST is designed to represent comprehensively any type of sound at a broad level rather than focusing on detailed sound classes (Figure 2). Unlike other taxonomies which feature deep complex hierarchies (e.g. AudioSet), the BST aims to be manageable and user-friendly, accommodating different user profiles such as those in sound design, music production, audio research, and audio practitioners in general. Since April 2025, the BST has been deployed in Freesound and it has been used by thousands of users for filtering search results. It can likewise support researchers in downstream applications and the management of large sound collections.

Figure 1: Overview of the "Heterogeneous Audio Classification" task

To successfully classify audio using the BST, a classifier must be capable of handling sounds that widely vary in nature, including varying length and a large range of recording/synthesis conditions. In addition to the audio input, such a classifier could further benefit from descriptive text metadata in a multimodal setting that complements the acoustic signal by providing semantic context. The task encourages participants to develop classifiers that take advantage of the hierarchical relationships between the sound categories of the taxonomy, and promote classifiers that perform well at both levels of the taxonomy. This is reflected in the evaluation metric, which features bigger penalization for classification errors that miss both levels of the hierarchy over those that only miss the second level. The goal of this task aligns with practical applications that need to handle complex, potentially multi-layered sounds by providing a first approach to their categorization. This task provides insights towards the development of sound classifiers with a broad range of applications in different domains, including but not limited to audio characterization and retrieval.

Research focus

This is the first year that the Heterogeneous Audio Classification task is being organized. In a way, this task replaces the former Task  1 on Acoustic Scene Classification, which ran for all 11 previous DCASE Challenge editions with different focuses every year!

By addressing a broad sound classification problem using the Broad Sound Taxonomy, this task expects to be more approachable and interpretable than other classification problems based on more complex taxonomies. Participants can become familiar with all sound categories, facilitating not only the development of models, but also a deeper qualitative analysis and understanding of the results. In particular, this task setup addresses the following scientific questions:

To what extent can classification models perform successfully across a highly heterogeneous set of audio categories and handle the broad range of variations present in real-world data?
How can the two-level hierarchical structure of the taxonomy be leveraged to improve classification performance and minimize classification errors that miss both levels of the hierarchy?
How can noisy, heterogeneous audio data be leveraged to improve model performance and/or generalization?
How do audio-only models compare with multimodal models for this task?

To carry out this task, we provide two complementary datasets (BSD10k-v1.2 and BSD35k-CS) that include both audio files and corresponding textual metadata, along with their annotations for top- and second-level BST categories. In addition, BSD10k-v1.2 includes a confidence score for the BST annotation, which can be used as input for the model. BSD35k-CS consists of user-provided audio labels (crowd-sourced and unverified) that may be inconsistent used due to real-world usage variability. Both datasets feature heterogeneous content, widely distributed across the categories of the BST taxonomy. Further details about the provided datasets are given in the next section.

Task Setup

Audio datasets

To capture sound heterogeneity of the proposed task, we use Freesound as the data source. We provide two complementary datasets for system development with single labels from BST, one smaller and curated, while the other bigger and noisy. Additionally, the use of external data (and transfer learning) is allowed, but certain conditions apply. More details are provided in the following sections.

Development datasets

BSD10k-v1.2 (Broad Sound Dataset 10k, v1.2) [2, 3] : This dataset was first introduced at the DCASE 2024 Workshop, and consists of a collection of manually annotated sounds aligned with the second level of the Broad Sound Taxonomy (BST). For this task we use an updated version of the dataset that includes some refinements. The dataset is designed to ensure sufficient diversity within each class. It comprises approximately 11,000 sounds and it is carefully curated to provide well-polished annotations, albeit with an unbalanced distribution across categories. The dataset has a total duration of ~35 hours, with individual sounds up to 30 seconds long. In addition to category labels, the dataset provides annotation confidence scores, descriptive metadata (title, tags, description), and provenance information (Freesound sound ID, uploader username, license). More information is provided in the Zenodo page for this dataset, linked below.

Improtant note: The version 1.2 of the BSD10k dataset was released a couple of weeks after the official launch of the challenge. Before that, the link provided in this page pointed to BSD10k-v1.1 (version 1.1). Participants are strongly advised to check they are using the latest release (BSD10k-v1.2). If BSD10k-v1.1 is used instead for any experiments or submissions, this must be clearly stated in the task report.

BSD35k-CS (Broad Sound Dataset 35k, Crowd Sourced) [4]: This dataset contains approximately 35,000 sounds (~150 hours of audio) uploaded to Freesound between April 1st, 2025, and January 27th, 2026. Since April 2025, users uploading sounds to Freesound are required to choose an appropriate BST category for each sound, and we use these as labels for this dataset. Although the taxonomy is designed to be easy to understand, we have observed that variations in user interpretation and human factors, such as unfamiliarity with the taxonomy or lapses in diligence, can still introduce annotation noise. This dataset, therefore, provides BST user-provided annotations for all sounds, but these annotations are not verified and may be inconsistent and noisy. Also, BSD35k-CS includes a few sounds for whcih the second-level taxonomy class annotation is other (that is to say, the sound uploader was unable to decide among one of the available second-level classes shown in Figure 2). The other case is not considered in the task evaluation (see below), so these sounds can most likely be ignored. In addition to BST labels, the dataset also includes descriptive metadata and provenance information similar to that of BSD10k. Unlike BSD10k, this dataset does not include annotation confidence scores as these are not collected when Freesound users upload sound and select a BST category for them. More information is provided in the Zenodo page for this dataset, linked below.

The datasets include descriptive metadata (title, tags, description) that can be used for training, and provenance information (ID, uploader, license). The ID is the unique identifier coinciding with the Freesound ID, the uploader field can be used for statistical purposes or potentially to batch-exclude data from users proven to be low-quality, and the license field indicates the terms under which the sound and its data can be used.

The two datasets share the same metadata structure and can therefore be easily used in combination. Metadata is provided through a CSV file with the following columns: sound_id, class, class_idx, class_top, confidence, uploader, license, title, tags, and description (see the Zenodo README files for detailed explanations). Note that the confidence column is empty for the BSD35k-CS dataset as this information is not available.

Note that data splits are not provided for the development datasets. Feel free to create your own splits for system develoment. For training our baseline system, we follow a 5-fold cross validation approach and report the average of the evaluation metrics per fold. If you want, you can follow the same strategy and even use the same random seed to replicate the same folds (you'll find all the necessary info in the baseline code).

Audio is provided in single-channel, 44.1 kHz, 24-bit format, with maximum duration cropped to 30s. In addition to the audio and its textual metadata, we also provide precomputed embeddings extracted using the LAION-CLAP model, both from an audio input and from a text input. The text embeddings are extracted using a combination of metadata fields as input, namely the sound title, tags, and description. Additional details can be found in the corresponding Zenodo README files for each dataset. The classes featured in the Broad Sound Taxonomy are listed above in Figure 1. A textual description with some examples for each of the classes can be found in the BST Freesound FAQ page.

Both datasets are already available for download:

BSD10k (Broad Sound Dataset 10k, v1.2) (10 GB)

BSD35k-CS (Broad Sound Dataset 35k, Crowd Sourced) (35 GB)

Evaluation dataset

The evaluation dataset is used for ranking submissions. It contains ~2,000 Freesound sounds, manually labeled with the corresponding BST category and uploaded after April 1st 2025. This set contains heterogeneous sounds, generally balanced across the 23 target second-level BST categories shown in Figure 2. Diversity within each category is also assessed. The dataset does not overlap with the development datasets.

The evaluation dataset, without ground truth labels, has been published so participants can run inference and submit a CSV file with the predictions for evaluation of the systems (see Submission section below). It includes the audio files processed in the same way as for the development datastes, and the corresponding metadata (title, tags and description). Note that it is forbidden to annotate the sounds from the evaluation set to use it as further training set. Also, note that the evaluation set provided for running inference contains more than the ~2,000 aforementioned sounds. That's because only a subset of them will actually be used for computing the final system scores.

The evaluation set can be downloaded here:

DCASE2026 Task 1 Evaluation set (4 GB)

External Data Resources and Pretrained Models

The use of both external data and pretrained models is allowed. However, to avoid any overlap with the evaluation dataset, participants should make sure that no Freesound data uploaded after April 1st 2025 is used directly in their training pipeline or indirectly through the use of pretrained models that use Freesound data from that period. Note that none of the widely used Freesound-based datasets and pretrained models (e.g. FSD50k, ESC-50, LAION-Audio-630K, WavCaps, etc.) uses Freesound data uploaded after that cutoff date. Note that the list of external resources used must be clearly indicated in the technical report and submission files.

Task Rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements, can be found here. Beyond such rules, the only task-specifc rule has to do with the use of external data resouces and pretrained models, and is described in the subsection immediately above.

Evaluation

As evaluation metrics, we use variations of the standard Precision, Recall and F-score (P, R, F) metrics which take into account the class hierarchy (hP, hR, and hF). These metrics assign a stronger penalty for misclassifications that are incorrect at both levels of the taxonomy (i.e. confusing two classes with a common parent results in better accuracy than confusing two classes with different parents). The main metric will be macro-averaged hierarchical F-score (i.e. hF computed for each target second-level BST class, and then averaged over all classes).

The original hP, hR, and hF are described in the paper cited at the end of this section, nevertheless, we use a modified version which allows us to parametrically (using parameter λ) adjust the weight given to classifications that are correct only at the top level of the taxonomy. To give a further intuition of how the metric behaves, the table below reports simulations of hP, hR, and hF scores (λ=0.75) for fake systems with the following characteristics:

Perfect: a system with perfect prediction capabilities.
Half wrong: a system which makes wrong predictions half of the time. The wrong predictions are wrong at the second level of the taxonomy, but in some cases might be correct at the top level.
Half wrong (top level correct): a system which makes wrong predictions half of the time. The wrong predictions are wrong at the second level of the taxonomy, but are always correct at the first level.
Half wrong (both levels wrong): a system which makes wrong predictions half of the time. The wrong predictions are wrong at both levels of the taxonomy.
All wrong: a system which makes wrong predictions all the time. The wrong predictions are wrong at the second level of the taxonomy, but in some cases might be correct at the top level.
All wrong (top level correct): a system which makes wrong predictions all of the time. The wrong predictions are wrong at the second level of the taxonomy, but are always correct at the first level.
All wrong (both levels wrong): a system which makes wrong predictions all of the time. The wrong predictions are wrong at both levels of the taxonomy.
Random: a system that makes random predictions. 1/N times (with N being the number of categories at the second level of the taxonomy) will get the prediction right. Wrong predictions might be correct at the top level.

System name	hP	hR	hF
Perfect	1.000	1.000	1.000
Half wrong (top level correct)	0.688	0.688	0.688
Half wrong	0.535	0.534	0.534
Half wrong (both levels wrong)	0.509	0.503	0.504
All wrong (top level correct)	0.375	0.375	0.375
Random	0.118	0.117	0.116
All wrong	0.073	0.073	0.071
All wrong (both levels wrong)	0.000	0.000	0.000

To complement the main leaderboard based on hF, we will also compute additional information that displays hP, hR, and hF for each individual second-level BST class, and also overall (and per-class) metrics at the top-level of the BST hierarchy.

The baseline system code includes an implementation of such evaluation metrics that can be usd by the parcipants of the task. The original definition of hP, hR, and hF is from this paper:

Publication

Functional Annotation of Genes Using Hierarchical Text Categorization, 2005.

Functional Annotation of Genes Using Hierarchical Text Categorization

Results

The table below shows the leaderboard for DCASE Challenge 2026 task 1 after all systems were submitted and results published. Complete results and technical reports can be found at results page

Rank		Submission Information				Metrics
Official rank	System rank	Submission label	Submission name	Audio only	Technical Report	hP	hR	hF
7	22	Baseline_UPF_task1_2	HATR baseline (multimodal)	False	Baseline_UPF_2026	0.725	0.696	0.700
1	1	Primus_CPJKU_task1_3	Ensemble 3	False	Primus_CPJKU_2026	0.842	0.846	0.836
8	23	Colotti_TAU_task1_1	Audio Classification using Attention-based Cleaned Multimodal Embeddings	False	Colotti_TAU_2026	0.746	0.709	0.699
5	13	Kucukoglu_NYU_task1_1	Multimodal Ensemble System for Heterogeneous Audio Classification	False	Kucukoglu_NYU_2026	0.824	0.775	0.787
12	36	Han_CSU_task1_1	Multi-Embedding System with Hierarchical Proxy Learning_1	False	Han_CSU_2026	0.422	0.370	0.372
10	30	Liu_CQUPT_task1_3	PaSST and CLAP Audio Ensemble	True	Liu_CQUPT_2026	0.657	0.647	0.629
9	25	Kil_Medisensing_task1_1	metadata gate target mask posterior stack	False	Kil_Medisensing_2026	0.682	0.732	0.692
4	12	Zheng_SCUT_task1_1	Official BSD10k Baseline (Multimodal HATR + CLAP)	False	Zheng_SCUT_2026	0.817	0.782	0.788
2	5	Huang_WHU_task1_2	STFT-distill logit ensemble-1	False	Huang_WHU_2026	0.825	0.810	0.811
11	35	Sharma_IR_task1_1	Feature-Centric Late-Fusion Approach for Heterogenous Audio Classification	False	Sharma_IR_2026	0.526	0.488	0.472
6	14	Lin_JKU_task1_1	Frozen-CLAP weighted ensemble with agreement pseudo-labels (primary)	False	Lin_JKU_2026	0.801	0.785	0.785
3	7	Guan_HEU_task1_3	Hierarchical Heterogeneous Audio Classification Using Multimodal Audio-Language Models with Hyperbolic Representations	False	Guan_HEU_2026	0.816	0.797	0.799
13	39	Zhang_XJTLU_task1_3	no-ranker ensemble	False	Zhang_XJTLU_2026	0.162	0.139	0.116

Baseline System

As a baseline system, we use variations of the HATR model presented at the DCASE Workshop 2025 [3]. We include a multimodal version and an an audio-only version, both non-hierachical models trained on audio and text representation vectors extracted using the pretrained LAION-CLAP model. The baseline can be found in this code repository.

DCASE2026 Task1 Baseline

Hints

Here we provide some hints that may help participants explore the data and inform their experiments and systems.

Since uploader usernames are provided and different contributors can be identified, it may be useful to explore the data grouped by uploader. There may be uploader-specific patterns in the data; for example, some users are very active and may upload incorrectly labeled samples, so filtering or analyzing the data at the uploader level could be beneficial.
There may be noise in the textual metadata associated to sounds (sound titles, tags and descriptions) as these often contain information unrelated to the content of the sound. A careful analysis or preprocessing of metadata could therefore help improve model performance.
Note that BSD10k-v1.2 includes confidence scores for the ground truth labels. These might (or might not) be relevant during training.
To better understand how models make predictions and handle variability, participants may also experiment with separate models for different top-level categories or focus on interpretability methods to gain insights into model behavior.
Another potential direction is to introduce hierarchy within the model, e.g. through hierarchical architectures. Variants of the baseline model (not the ones used as such) reinforce hierarchy via the loss function, though with minimal effect on accuracy in that setup (no hierarchical metric evaluation); alternative approaches may achieve stronger results.
It may be worth exploring training with other datasets (mapped to the BST categories), including either curated or cleaner sources. On one hand, mapping dedicated datasets (e.g. wind-instrument one to the Winds category) with cleaner audio and metadata may help establish clearer decision boundaries, while large weakly-verified mappings may improve generalization and real-world scenarios. Synthetic data could also be explored.

Submission

General information about submission can be found in the submission page. Challenge submission consists of the following:

System output files for up to four systems per participating team (*.csv)
Metadata file(s) for each system (*.yaml)
A technical report explaining in sufficient detail the methods used in the submitted systems (*.pdf)

Note about submissions: To encourage consistency across visualizations and analyses in technical reports and paper submissions, we advise authors to follow the color map and class ordering porvided here. This features the same ordering found in the BST_description.csv file provided with the dataset.

System output file

The system output should be provided as a single text file in CSV format containing a classification result and, optionally but encouraged, a classification score for each audio file in the evaluation set. The system output file should look like the following example:

id,predicted_bst_second_level_class,prediction_score
c7e15030,fx-v,0.835
713aa451,ss-u,0.851
924a4665,is-e,0.796
...

Metadata file

For each system entry, it is crucial to provide meta information in a separate YAML file containing the task-specific information. This meta information is essential for fast processing of the submissions and analysis of submitted systems. Participants are strongly advised to fill in the meta information carefully, ensuring all information is correctly provided.

What follows is an example metadata file for reference. Note that this file contains a results section in which we expect participants to report on their evaluation results using the development datasets. This section is optional, but we encourage participants to fill it out. The baseline code includes methods to compute and report evaluation data that can be copy/pasted in the metadata file, including the implementation of the abovementioned 5-fold cross validation strategy.

Metadata

# Submission information
submission:
  # Submission label
  # Label is used to index submissions.
  # Generate your label in the following way to avoid
  # overlapping codes among submissions:
  # [Last name of corresponding author]_[Abbreviation of institute of the
  # corresponding author]_task[task number]_[index number of your submission
  # (1-4)]
  label: Font_UPF_task1_1

  # Submission name
  # This name will be used in the results tables when space permits
  name: Example submission system

  # Submission name abbreviated
  # This abbreviated name will be used in the results table when space is tight.
  # Use maximum 10 characters.
  abbreviation: ExampleSys

  # Authors of the submitted system. Mark authors in
  # the order you want them to appear in submission lists.
  # One of the authors has to be marked as corresponding author,
  # this will be listed next to the submission in the results tables.
  authors:
    # First author
    - lastname: Font
      firstname: Frederic
      email: frederic.font@upf.edu           # Contact email address
      corresponding: true                    # Mark true for one of the authors
      # Affiliation information for the author
      affiliation:
        abbreviation: UPF
        institute: Universitat Pompeu Fabra (UPF)
        department: Music Technology Group (MTG)   # Optional
        location: Barcelona, Spain

    # Second author
    - lastname: Anastasopoulou
      firstname: Panagiota
      email: panagiota.anastasopoulou@upf.edu   
      affiliation:
        abbreviation: UPF
        institute: Universitat Pompeu Fabra (UPF)
        department: Music Technology Group (MTG)   # Optional
        location: Barcelona, Spain  

# System information
system:
  # System description, meta data provided here will be used to do
  # meta analysis of the submitted system.
  # Use general level tags, when possible use the tags provided in comments.
  # If information field is not applicable to the system, use "!!null".
  # Use commas to separate tags if when multiple tags are applicable
   
  # URL to the full source code of the system [optional]
  source_code: https://github.com/MTG/dcase2026_task1_baseline
  
  description:

    # Audio input / sampling rate
    # e.g. 16kHz, 22.05kHz, 32kHz, 44.1kHz, 48.0kHz
    input_sampling_rate: 48kHz

    # Audio representation (audio features or embedding spaces)
    # e.g. MFCC, log-mel energies, spectrogram, CQT, raw waveform, PANNs,
    # CLAP, EnCodec ...
    audio_representation: CLAP

    # Representation for text input (textual features, metadata items and/or
    # embedding spaces, only relevant in multimodal systems that use metadata
    # as input)
    # e.g. title, tags, description, Word2Vec, CLAP ...
    # Use !!null if the system does not use metadata as input (if the system is
    # audio-only).
    text_representation: title, tags, description, CLAP

    # Data augmentation methods
    # e.g. mixup, freq-mixstyle, dir augmentation, pitch shifting, time
    # rolling, frequency masking, time masking, frequency warping,
    # noise addition, ...
    # Use !!null if the system does not use data augmentation
    data_augmentation: noise addition, time masking

    # Machine learning
    # e.g., MLP, (RF-regularized) CNN, RNN, CRNN, Transformer, ...
    machine_learning_method: MLP

    # External data usage method
    # e.g. dataset, embeddings, ...
    # Use !!null if the system does not use external data
    external_data_usage: embeddings

    # Method for considering taxonomy hierarchy in the system design
    # e.g. loss function, multiple classifiers, ...
    # Use !!null if the system does not use any method to leverage hierarchy
    hierarchical_setting: !!null 


  # System complexity
  complexity:
    # Total amount of parameters used in the acoustic model.
    # For neural networks, this information is usually given before training
    # process in the network summary.
    # For other than neural networks, if parameter count information is not
    # directly available, try estimating the count as
    # accurately as possible.
    # In case of ensemble approaches, add up parameters for all subsystems.
    # In case embeddings are used, add up parameter count of the embedding
    # extraction networks and classification network.
    # Use numerical value.
    total_parameters: 269992
    MACS: 1.036 G

  # List of external datasets used in the submission. If using embeddings from
  # pre-trained models, there's NO need to include the datasets used to train
  # the embedding models here.
  external_datasets:
    # Below are two examples (NOT used in the baseline system)
    #- name: EfficientAT
    #  url: https://github.com/fschmid56/EfficientAT
    #  total_audio_length: !!null
    #- name: MicIRP
    #  url: http://micirp.blogspot.com/?m=1
    #  total_audio_length: 2   # specify in minutes

# System results [OPTIONAL]
results:
  development_datasets:
    
    # System results for both development datasets 
    # If possible, please provide overall and class-wise results when
    # training/evaluating your systems on each of the development datasets
    # separately. You can follow the baseline code example to do that. Note that
    # development datasets do not include data splits, our baseline example uses
    # 5-fold cross-validation and reports the average performance across the 5
    # folds. You can reproduce the same experiment setup using the code
    # provided with the baseline system and using the same random seed to
    # generate the same folds.

    # Please refer to the baseline code for the calculation of the overall
    # metrics and class-wise (hP, hR, hF).
    # Set parameter lambda=0.75 (which is default in the baseline code).
    # Note that the numbers below are just examples, they do not correspond to
    # the actual performance of any system.
    
    bsd10k-v1.2:
      overall:
        hP: 0.584
        hR: 0.688
        hF: 0.632

      class_wise:
        m-sp:
          hP: 0.799
          hR: 0.875
          hF: 0.835
        m-si:
          hP: 0.794
          hR: 0.844
          hF: 0.818
        m-m:
          hP: 0.834
          hR: 0.944
          hF: 0.885
        is-p:
          hP: 0.799
          hR: 0.887
          hF: 0.841
        is-s:
          hP: 0.821
          hR: 0.887
          hF: 0.853
        is-w:
          hP: 0.790
          hR: 0.881
          hF: 0.833
        is-k:
          hP: 0.836
          hR: 0.900
          hF: 0.867
        is-e:
          hP: 0.753
          hR: 0.844
          hF: 0.796
        sp-s:
          hP: 0.790
          hR: 0.881
          hF: 0.833
        sp-c:
          hP: 0.821
          hR: 0.906
          hF: 0.862
        sp-p:
          hP: 0.764
          hR: 0.838
          hF: 0.799
        fx-o:
          hP: 0.788
          hR: 0.900
          hF: 0.841
        fx-v:
          hP: 0.788
          hR: 0.887
          hF: 0.835
        fx-m:
          hP: 0.802
          hR: 0.875
          hF: 0.837
        fx-h:
          hP: 0.778
          hR: 0.850
          hF: 0.812
        fx-a:
          hP: 0.795
          hR: 0.863
          hF: 0.828
        fx-n:
          hP: 0.805
          hR: 0.900
          hF: 0.850
        fx-ex:
          hP: 0.797
          hR: 0.881
          hF: 0.837
        fx-el:
          hP: 0.753
          hR: 0.825
          hF: 0.787
        ss-n:
          hP: 0.806
          hR: 0.887
          hF: 0.845
        ss-i:
          hP: 0.840
          hR: 0.900
          hF: 0.869
        ss-u:
          hP: 0.817
          hR: 0.887
          hF: 0.851
        ss-s:
          hP: 0.849
          hR: 0.925
          hF: 0.885
      
    bsd35k-cs:
      overall:
        hP: 0.339
        hR: 0.506
        hF: 0.406

      class_wise:
        m-sp:
          hP: 0.594
          hR: 0.713
          hF: 0.648
        m-si:
          hP: 0.618
          hR: 0.725
          hF: 0.667
        m-m:
          hP: 0.566
          hR: 0.656
          hF: 0.608
        is-p:
          hP: 0.544
          hR: 0.650
          hF: 0.592
        is-s:
          hP: 0.579
          hR: 0.675
          hF: 0.623
        is-w:
          hP: 0.589
          hR: 0.700
          hF: 0.640
        is-k:
          hP: 0.576
          hR: 0.681
          hF: 0.625
        is-e:
          hP: 0.605
          hR: 0.700
          hF: 0.649
        sp-s:
          hP: 0.601
          hR: 0.688
          hF: 0.642
        sp-c:
          hP: 0.587
          hR: 0.706
          hF: 0.641
        sp-p:
          hP: 0.607
          hR: 0.719
          hF: 0.658
        fx-o:
          hP: 0.590
          hR: 0.694
          hF: 0.638
        fx-v:
          hP: 0.598
          hR: 0.719
          hF: 0.653
        fx-m:
          hP: 0.563
          hR: 0.644
          hF: 0.601
        fx-h:
          hP: 0.565
          hR: 0.669
          hF: 0.612
        fx-a:
          hP: 0.597
          hR: 0.688
          hF: 0.639
        fx-n:
          hP: 0.541
          hR: 0.631
          hF: 0.583
        fx-ex:
          hP: 0.604
          hR: 0.719
          hF: 0.656
        fx-el:
          hP: 0.587
          hR: 0.713
          hF: 0.644
        ss-n:
          hP: 0.561
          hR: 0.656
          hF: 0.605
        ss-i:
          hP: 0.598
          hR: 0.719
          hF: 0.653
        ss-u:
          hP: 0.573
          hR: 0.669
          hF: 0.617
        ss-s:
          hP: 0.583
          hR: 0.688
          hF: 0.631

Submission package

To form your submission package, please follow the general submission instructions, which apply to all tasks. All files should be packaged into a zip file for submission. Please ensure a clear connection between the system name in the submitted yaml, submitted system output, and the technical report! See instructions on how to form your submission label. The submission label's index field should be used to differentiate your submissions in case you have multiple submissions. Here is an example of the package structure, assuming a submission label Font_UPF_task1:

task1/
├── Font_UPF_task1.technical_report.pdf
├── Font_UPF_task1_1/
│   ├── Font_UPF_task1_1.output.csv
│   └── Font_UPF_task1_1.meta.yaml

Citation

[1] The Broad Sound Taxonomy is described in detail in a journal article which is currently under review. The preprint is accessible online, made available by the journal:

Publication

Panagiota Anastasopoulou, Xavier Serra, and Frederic Font. A general-purpose sound taxonomy for the classification of heterogeneous sound collections. In press. URL: https://www.researchsquare.com/article/rs-7206795/v1.

A General-Purpose Sound Taxonomy for the Classification of Heterogeneous Sound Collections

[2, 3] If using any of the versions of BSD10k dataset, please cite the original papers:

Publication

Panagiota Anastasopoulou, Jessica Torrey, Xavier Serra, and Frederic Font. Heterogeneous sound classification with the Broad Sound Taxonomy and Dataset. In Proc. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE). 2024.

Heterogeneous Sound Classification with the Broad Sound Taxonomy and Dataset

Publication

Panagiota Anastasopoulou, Francesco Ardan Dal Rí, Xavier Serra, and Frederic Font. Hierarchical and multimodal learning for heterogeneous sound classification. In Proc. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE). 2025.

Hierarchical and Multimodal Learning for Heterogeneous Sound Classification

[4] If using BSD35k-CS, please cite the corresponding Zenodo page:

Publication

Panagiota Anastasopoulou and Frederic Font Corbera. Bsd35k-cs (broad sound dataset 35k - crowd sourced). March 2026. URL: https://doi.org/10.5281/zenodo.19187100, doi:10.5281/zenodo.19187100.

	Panagiota Anastasopoulou Universitat Pompeu Fabra
	Frederic Font Universitat Pompeu Fabra
	Dmitry Bogdanov Universitat Pompeu Fabra
	Lonce Wyse Universitat Pompeu Fabra

Heterogeneous Audio Classification

Coordinators

Description

Research focus

Task Setup

Audio datasets

Development datasets

Evaluation dataset

External Data Resources and Pretrained Models

Task Rules

Evaluation

Functional Annotation of Genes Using Hierarchical Text Categorization

Results

Baseline System

Hints

Submission

System output file

Metadata file

Metadata

Submission package

Citation

A General-Purpose Sound Taxonomy for the Classification of Heterogeneous Sound Collections

Heterogeneous Sound Classification with the Broad Sound Taxonomy and Dataset

Hierarchical and Multimodal Learning for Heterogeneous Sound Classification

BSD35k-CS (Broad Sound Dataset 35k - Crowd Sourced)

Coordinators

Content

Description

Research focus

Task Setup

Audio datasets

Development datasets

Evaluation dataset

External Data Resources and Pretrained Models

Task Rules

Evaluation

Functional Annotation of Genes Using Hierarchical Text Categorization

Results

Baseline System

Hints

Submission

System output file

Metadata file

Metadata

Submission package

Citation

A General-Purpose Sound Taxonomy for the Classification of Heterogeneous Sound Collections

Heterogeneous Sound Classification with the Broad Sound Taxonomy and Dataset

Hierarchical and Multimodal Learning for Heterogeneous Sound Classification

BSD35k-CS (Broad Sound Dataset 35k - Crowd Sourced)