This task focuses on heterogeneous audio classification using the Broad Sound Taxonomy (BST), which comprises 5 top-level and 23 second-level sound categories. The goal of this task is to evaluate sound classification models on diverse, real-world audio that varies widely in its nature, including duration and recording conditions. To that end, two complementary Freesound-based datasets are provided: a curated set, BSD10k-v1.2, and a larger, noisier, crowd-sourced collection, BSD35k-CS, which reflects real-world labeling variability. Participants are encouraged to explore audio-based, metadata-based, and multimodal approaches to sound classification, as well as to leverage hierarchical relationships between taxonomy categories.
Description
This task aims to evaluate systems for multi-class audio classification across a small but heterogeneous set of sound categories. To that end, the Broad Sound Taxonomy (BST) [1] is used, which has 5 top-level and 23 second-level categories, and the goal of the classification task is to correctly predict ground truth labels for the second level of the taxonomy (Figure 1). The BST is designed to represent comprehensively any type of sound at a broad level rather than focusing on detailed sound classes (Figure 2). Unlike other taxonomies which feature deep complex hierarchies (e.g. AudioSet), the BST aims to be manageable and user-friendly, accommodating different user profiles such as those in sound design, music production, audio research, and audio practitioners in general. Since April 2025, the BST taxonomy has been deployed in Freesound and it has been used by thousands of users for filtering search results. It can likewise support researchers in downstream applications and the management of large sound collections.
To successfully classify audio using the BST, a classifier must be capable of handling sounds that widely vary in nature, including varying length and a large range of recording/synthesis conditions. In addition to the audio input, such a classifier could further benefit from descriptive text metadata in a multimodal setting that complements the acoustic signal by providing semantic context. The task encourages participants to develop classifiers that take advantage of the hierarchical relationships between the sound categories of the taxonomy, and promote classifiers that perform well at both levels of the taxonomy. This is reflected in the evaluation metric, which features bigger penalization for classification errors that miss both levels of the hierarchy over those that only miss the second level. The goal of this task aligns with practical applications that need to handle complex, potentially multi-layered sounds by providing a first approach to their categorization. This task provides insights towards the development of sound classifiers with a broad range of applications in different domains, including but not limited to audio characterization and retrieval.
Research focus
This is the first year that the Heterogeneous Audio Classification task is being organized. In a way, this task replaces the former Task 1 on Acoustic Scene Classification, which ran for all 11 previous DCASE Challenge editions with different focuses every year!
By addressing a broad sound classification problem using the Broad Sound Taxonomy, this task expects to be more approachable and interpretable than other classification problems based on more complex taxonomies. Participants can become familiar with all sound categories, facilitating not only the development of models, but also a deeper qualitative analysis and understanding of the results. In particular, this task setup addresses the following scientific questions:
- To what extent can classification models perform successfully across a highly heterogeneous set of audio categories and handle the broad range of variations present in real-world data?
- How can the two-level hierarchical structure of the taxonomy be leveraged to improve classification performance and minimize classification errors that miss both levels of the hierarchy?
- How can noisy, heterogeneous audio data be leveraged to improve model performance and/or generalization?
- How do audio-only models compare with multimodal models for this task?
To carry out this task, we provide two complementary datasets (BSD10k-v1.2 and BSD35k-CS) that include both audio files and corresponding textual metadata, along with their annotations for top- and second-level BST categories. In addition, BSD10k-v1.2 includes a confidence score for the BST annotation, which can be used as input for the model. BSD35k-CS consists of user-provided audio labels (crowd-sourced and unverified) that may be inconsistent used due to real-world usage variability. Both datasets feature heterogeneous content, widely distributed across the categories of the BST taxonomy. Further details about the provided datasets are given in the next section.
Task Setup
Audio datasets
To capture sound heterogeneity of the proposed task, we use Freesound as the data source. We provide two complementary datasets for system development with single labels from BST, one smaller and curated, while the other bigger and noisy. Additionally, the use of external data (and transfer learning) is allowed, but certain conditions apply. More details are provided in the following sections.
Development datasets
-
BSD10k-v1.2 (Broad Sound Dataset 10k, v1.2) [2, 3] : This dataset was first introduced at the DCASE 2024 Workshop, and consists of a collection of manually annotated sounds aligned with the second level of the Broad Sound Taxonomy (BST). For this task we use an updated version of the dataset that includes some refinements. The dataset is designed to ensure sufficient diversity within each class. It comprises approximately 11,000 sounds and it is carefully curated to provide well-polished annotations, albeit with an unbalanced distribution across categories. The dataset has a total duration of ~35 hours, with individual sounds up to 30 seconds long. In addition to category labels, the dataset provides annotation confidence scores, descriptive metadata (title, tags, description), and provenance information (Freesound sound ID, uploader username, license). More information is provided in the Zenodo page for this dataset, linked below.
-
BSD35k-CS (Broad Sound Dataset 35k, Crowd Sourced) [4]: This dataset contains approximately 35,000 sounds (~150 hours of audio) uploaded to Freesound between April 1st, 2025, and January 27th, 2026. Since April 2025, users uploading sounds to Freesound are required to choose an appropriate BST category for each sound, and we use these as labels for this dataset. Although the taxonomy is designed to be easy to understand, we have observed that variations in user interpretation and human factors, such as unfamiliarity with the taxonomy or lapses in diligence, can still introduce annotation noise. This dataset, therefore, provides BST user-provided annotations for all sounds, but these annotations are not verified and may be inconsistent and noisy. In addition to BST labels, the dataset also includes descriptive metadata and provenance information similar to that of BSD10k. Unlike BSD10k, this dataset does not include annotation confidence scores as these are not collected when Freesound users upload sound and select a BST category for them. More information is provided in the Zenodo page for this dataset, linked below.
The datasets include descriptive metadata (title, tags, description) that can be used for training, and provenance information (ID, uploader, license). The ID is the unique identifier coinciding with the Freesound ID, the uploader field can be used for statistical purposes or potentially to batch-exclude data from users proven to be low-quality, and the license field indicates the terms under which the sound and its data can be used.
The two datasets share the same metadata structure and can therefore be easily used in combination. Metadata is provided through a CSV file with the following columns: sound_id, class, class_idx, class_top, confidence, uploader, license, title, tags, and description (see the Zenodo README files for detailed explanations). Note that the confidence column is empty for the BSD35k-CS dataset as this information is not available.
Note that data splits are not provided for the development datasets. Feel free to create your own splits for system develoment. For training our baseline system, we follow a 5-fold cross validation approach and report the average of the evaluation metrics per fold. If you want, you can follow the same strategy and even use the same random seed to replicate the same folds (you'll find all the necessary info in the baseline code).
Audio is provided in single-channel, 44.1 kHz, 24-bit format, with maximum duration cropped to 30s. In addition to the audio and its textual metadata, we also provide precomputed embeddings extracted using the LAION-CLAP model, both from an audio input and from a text input. The text embeddings are extracted using a combination of metadata fields as input, namely the sound title, tags, and description. Additional details can be found in the corresponding Zenodo README files for each dataset. The classes featured in the Broad Sound Taxonomy are listed above in Figure 1. A textual description with some examples for each of the classes can be found in the BST Freesound FAQ page.
Both datasets are already available for download:
Evaluation dataset
The evaluation dataset is used for ranking submissions. It contains ~2,000 Freesound sounds, manually labeled with the corresponding BST category and uploaded after April 1st 2025. This set contains heterogeneous sounds, balanced across the 23 target second-level BST categories. Diversity within each category is also assessed. The dataset does not overlap with the development datasets.
The evaluation dataset, without ground truth labels, will be published one month before the submission deadline so participants can run inference and submit a CSV file with the predictions for evaluation of the systems (see Submission section below). After the challenge, if you want to evaluate your proposed system with the official challenge evaluation setup, the category labels for the evaluation dataset will be released.
External Data Resources and Pretrained Models
The use of both external data and pretrained models is allowed. However, to avoid any overlap with the evaluation dataset, participants should make sure that no Freesound data uploaded after April 1st 2025 is used directly in their training pipeline or indirectly through the use of pretrained models that use Freesound data from that period. Note that none of the widely used Freesound-based datasets and pretrained models (e.g. FSD50k, ESC-50, LAION-Audio-630K, WavCaps, etc.) uses Freesound data uploaded after that cutoff date. Note that the list of external resources used must be clearly indicated in the technical report and submission files.
Task Rules
There are general rules valid for all tasks; these, along with information on technical report and submission requirements, can be found here. Beyond such rules, the only task-specifc rule has to do with the use of external data resouces and pretrained models, and is described in the subsection immediately above.
Evaluation
As evaluation metrics, we use variations of the standard Precision, Recall and F-score (P, R, F) metrics which take into account the class hierarchy (hP, hR and hF). These metrics assign a stronger penalty for misclassifications that are incorrect at both levels of the taxonomy (i.e. confusing two classes with a common parent results in better accuracy than confusing two classes with different parents). The main metric will be macro-averaged hierarchical F-score (i.e., hF computed for each target second-level BST class, and then averaged over all classes).
The original hP, hR and hF are described in the paper cited at the end of this section, nevertheless, we use a modified version which allows us to parametrically (using parameter λ) adjust the weight given to classifications that are correct only at the top level of the taxonomy. To give a further intuition of how the metric behaves, the table below reports simulations of hP, hR and hF scores (λ=0.75) for fake systems with the following characteristics:
- Perfect: a system with perfect prediction capabilities.
- Half wrong: a system which makes wrong predictions half of the time. The wrong predictions are wrong at the second level of the taxonomy, but in some cases might be correct at the top level.
- Half wrong (top level correct): a system which makes wrong predictions half of the time. The wrong predictions are wrong at the second level of the taxonomy, but are always correct at the first level.
- Half wrong (both levels wrong): a system which makes wrong predictions half of the time. The wrong predictions are wrong at both levels of the taxonomy.
- All wrong: a system which makes wrong predictions all the time. The wrong predictions are wrong at the second level of the taxonomy, but in some cases might be correct at the top level.
- All wrong (top level correct): a system which makes wrong predictions all of the time. The wrong predictions are wrong at the second level of the taxonomy, but are always correct at the first level.
- All wrong (both levels wrong): a system which makes wrong predictions all of the time. The wrong predictions are wrong at both levels of the taxonomy.
- Random: a system that makes random predictions. 1/N times (with N being the number of categories at the second level of the taxonomy) will get the prediction right. Wrong predictions might be correct at the top level.
| System name | hP | hR | hF |
|---|---|---|---|
| Perfect | 1.000 | 1.000 | 1.000 |
| Half wrong (top level correct) | 0.584 | 0.688 | 0.632 |
| Half wrong | 0.380 | 0.534 | 0.443 |
| Half wrong (both levels wrong) | 0.338 | 0.503 | 0.404 |
| All wrong (top level correct) | 0.375 | 0.375 | 0.375 |
| Random | 0.097 | 0.117 | 0.105 |
| All wrong | 0.073 | 0.073 | 0.072 |
| All wrong (both levels wrong) | 0.000 | 0.000 | 0.000 |
To complement the main leaderboard based on hF, we will also compute additional information that displays hP, hR and hF for each individual second-level BST class, and also overall (and per-class) metrics at the top-level of the BST hierarchy.
The baseline system code includes an implementation of such evaluation metrics that can be usd by the parcipants of the task. The original definition of hP, hR and hF is from this paper:
Baseline System
As a baseline system, we use variations of the HATR model presented at the DCASE Workshop 2025 [3]. We include a multimodal version and an an audio-only version, both non-hierachical models trained on audio and text representation vectors extracted using the pretrained LAION-CLAP model. The baseline can be found in this code repository.
Hints
Here we provide some hints that may help participants explore the data and inform their experiments and systems.
- Since uploader usernames are provided and different contributors can be identified, it may be useful to explore the data grouped by uploader. There may be uploader-specific patterns in the data; for example, some users are very active and may upload incorrectly labeled samples, so filtering or analyzing the data at the uploader level could be beneficial.
- There may be noise in the textual metadata associated to sounds (sound titles, tags and descriptions) as these often contain information unrelated to the content of the sound. A careful analysis or preprocessing of metadata could therefore help improve model performance.
- Note that BSD10k-v1.2 includes confidence scores for the ground truth labels. These might (or might not) be relevant during training.
- To better understand how models make predictions and handle variability, participants may also experiment with separate models for different top-level categories or focus on interpretability methods to gain insights into model behavior.
- Another potential direction is to introduce hierarchy within the model, e.g. through hierarchical architectures. Variants of the baseline model (not the ones used as such) reinforce hierarchy via the loss function, though with minimal effect on accuracy in that setup (no hierarchical metric evaluation); alternative approaches may achieve stronger results.
- It may be worth exploring training with other datasets (mapped to the BST categories), including either curated or cleaner sources. On one hand, mapping dedicated datasets (e.g. wind-instrument one to the Winds category) with cleaner audio and metadata may help establish clearer decision boundaries, while large weakly-verified mappings may improve generalization and real-world scenarios. Synthetic data could also be explored.
Submission
General information about submission can be found in the submission page. Challenge submission consists of the following:
- System output files for up to four systems per participating team (*.csv)
- Metadata file(s) for each system (*.yaml)
- A technical report explaining in sufficient detail the methods used in the submitted systems (*.pdf)
System output file
The system output should be provided as a single text file in CSV format containing a classification result and, optionally, a classification score for each audio file in the evaluation set. The system output file should look like the following example:
id,predicted_bst_second_level_class,prediction_score c7e15030,fx-v,0.835 713aa451,ss-u,0.851 924a4665,is-e,0.796 ...
Metadata file
For each system entry, it is crucial to provide meta information in a separate YAML file containing the task-specific information. This meta information is essential for fast processing of the submissions and analysis of submitted systems. Participants are strongly advised to fill in the meta information carefully, ensuring all information is correctly provided.
What follows is an example metadata file for reference. Note that this file contains a results section in which we expect participants to report on their evaluation results using the development datasets. This section is optional, but we encourage participants to fill it out. The baseline code includes methods to compute and report evaluation data that can be copy/pasted in the metadata file, including the implementation of the abovementioned 5-fold cross validation strategy.
# Submission information
submission:
# Submission label
# Label is used to index submissions.
# Generate your label in the following way to avoid
# overlapping codes among submissions:
# [Last name of corresponding author]_[Abbreviation of institute of the
# corresponding author]_task[task number]_[index number of your submission
# (1-4)]
label: Font_UPF_task1
# Submission name
# This name will be used in the results tables when space permits
name: Example submission system
# Submission name abbreviated
# This abbreviated name will be used in the results table when space is tight.
# Use maximum 10 characters.
abbreviation: ExampleSys
# Authors of the submitted system. Mark authors in
# the order you want them to appear in submission lists.
# One of the authors has to be marked as corresponding author,
# this will be listed next to the submission in the results tables.
authors:
# First author
- lastname: Font
firstname: Frederic
email: frederic.font@upf.edu # Contact email address
corresponding: true # Mark true for one of the authors
# Affiliation information for the author
affiliation:
abbreviation: UPF
institute: Universitat Pompeu Fabra (UPF)
department: Music Technology Group (MTG) # Optional
location: Barcelona, Spain
# Second author
- lastname: Anastasopoulou
firstname: Panagiota
email: panagiota.anastasopoulou@upf.edu
affiliation:
abbreviation: UPF
institute: Universitat Pompeu Fabra (UPF)
department: Music Technology Group (MTG) # Optional
location: Barcelona, Spain
# System information
system:
# System description, meta data provided here will be used to do
# meta analysis of the submitted system.
# Use general level tags, when possible use the tags provided in comments.
# If information field is not applicable to the system, use "!!null".
# URL to the full source code of the system [optional]
source_code: https://github.com/MTG/dcase2026_task1_baseline
description:
# Audio input / sampling rate
# e.g. 16kHz, 22.05kHz, 32kHz, 44.1kHz, 48.0kHz
input_sampling_rate: 48kHz
# Audio representation (audio features or embedding spaces)
# e.g. MFCC, log-mel energies, spectrogram, CQT, raw waveform, PANNs,
# CLAP, EnCodec ...
audio_representation: CLAP
# Representation for text input (textual features, metadata items and/or
# embedding spaces, only relevant in multimodal systems that use metadata
# as input)
# e.g. title, tags, description, Word2Vec, CLAP ...
# Use !!null if the system does not use metadata as input (if the system is
# audio-only).
text_representation: title, tags, description, CLAP
# Data augmentation methods
# e.g. mixup, freq-mixstyle, dir augmentation, pitch shifting, time
# rolling, frequency masking, time masking, frequency warping,
# noise addition, ...
data_augmentation: noise addition, time masking
# Machine learning
# e.g., MLP, (RF-regularized) CNN, RNN, CRNN, Transformer, ...
machine_learning_method: MLP
# External data usage method
# e.g. dataset, embeddings, ...
external_data_usage: embeddings
# Method for considering taxonomy hierarchy in the system design
# e.g. "loss function", "multiple classifiers", ...
hierarchical_setting: !!null
# System complexity
complexity:
# Total amount of parameters used in the acoustic model.
# For neural networks, this information is usually given before training
# process in the network summary.
# For other than neural networks, if parameter count information is not
# directly available, try estimating the count as
# accurately as possible.
# In case of ensemble approaches, add up parameters for all subsystems.
# In case embeddings are used, add up parameter count of the embedding
# extraction networks and classification network.
# Use numerical value.
total_parameters: 269992
MACS: 1.036 G
# List of external datasets used in the submission. If using embeddings from
# pre-trained models, there's NO need to include the datasets used to train
# the embedding models here.
external_datasets:
# Below are two examples (NOT used in the baseline system)
#- name: EfficientAT
# url: https://github.com/fschmid56/EfficientAT
# total_audio_length: !!null
#- name: MicIRP
# url: http://micirp.blogspot.com/?m=1
# total_audio_length: 2 # specify in minutes
# System results [OPTIONAL]
results:
development_datasets:
# System results for both development datasets
# If possible, please provide overall and class-wise results when
# training/evaluating your systems on each of the development datasets
# separately. You can follow the baseline code example to do that. Note that
# development datasets do not include data splits, our baseline example uses
# 5-fold cross-validation and reports the average performance across the 5
# folds. You can reproduce the same experiment setup using the code
# provided with the baseline system and using the same random seed to
# generate the same folds.
# Please refer to the baseline code for the calculation of the overall
# metrics and class-wise (hP, hR, hF).
# Set parameter lambda=0.75 (which is default in the baseline code).
# Note that the numbers below are just examples, they do not correspond to
# the actual performance of any system.
bas10k-v1.2:
overall:
hp: 0.584
hr: 0.688
hf: 0.632
class_wise:
m-sp: { hP:0.799, hR:0.875, hF:0.835}
m-si: { hP:0.794, hR:0.844, hF:0.818}
m-m: { hP:0.834, hR:0.944, hF:0.885}
is-p: { hP:0.799, hR:0.887, hF:0.841}
is-s: { hP:0.821, hR:0.887, hF:0.853}
is-w: { hP:0.790, hR:0.881, hF:0.833}
is-k: { hP:0.836, hR:0.900, hF:0.867}
is-e: { hP:0.753, hR:0.844, hF:0.796}
sp-s: { hP:0.790, hR:0.881, hF:0.833}
sp-c: { hP:0.821, hR:0.906, hF:0.862}
sp-p: { hP:0.764, hR:0.838, hF:0.799}
fx-o: { hP:0.788, hR:0.900, hF:0.841}
fx-v: { hP:0.788, hR:0.887, hF:0.835}
fx-m: { hP:0.802, hR:0.875, hF:0.837}
fx-h: { hP:0.778, hR:0.850, hF:0.812}
fx-a: { hP:0.795, hR:0.863, hF:0.828}
fx-n: { hP:0.805, hR:0.900, hF:0.850}
fx-ex: { hP:0.797, hR:0.881, hF:0.837}
fx-el: { hP:0.753, hR:0.825, hF:0.787}
ss-n: { hP:0.806, hR:0.887, hF:0.845}
ss-i: { hP:0.840, hR:0.900, hF:0.869}
ss-u: { hP:0.817, hR:0.887, hF:0.851}
ss-s: { hP:0.849, hR:0.925, hF:0.885}
bsd35-cs:
overall:
hP: 0.339
hR: 0.506
hF: 0.406
class_wise:
m-sp: { hP:0.594, hR:0.713, hF:0.648}
m-si: { hP:0.618, hR:0.725, hF:0.667}
m-m: { hP:0.566, hR:0.656, hF:0.608}
is-p: { hP:0.544, hR:0.650, hF:0.592}
is-s: { hP:0.579, hR:0.675, hF:0.623}
is-w: { hP:0.589, hR:0.700, hF:0.640}
is-k: { hP:0.576, hR:0.681, hF:0.625}
is-e: { hP:0.605, hR:0.700, hF:0.649}
sp-s: { hP:0.601, hR:0.688, hF:0.642}
sp-c: { hP:0.587, hR:0.706, hF:0.641}
sp-p: { hP:0.607, hR:0.719, hF:0.658}
fx-o: { hP:0.590, hR:0.694, hF:0.638}
fx-v: { hP:0.598, hR:0.719, hF:0.653}
fx-m: { hP:0.563, hR:0.644, hF:0.601}
fx-h: { hP:0.565, hR:0.669, hF:0.612}
fx-a: { hP:0.597, hR:0.688, hF:0.639}
fx-n: { hP:0.541, hR:0.631, hF:0.583}
fx-ex: { hP:0.604, hR:0.719, hF:0.656}
fx-el: { hP:0.587, hR:0.713, hF:0.644}
ss-n: { hP:0.561, hR:0.656, hF:0.605}
ss-i: { hP:0.598, hR:0.719, hF:0.653}
ss-u: { hP:0.573, hR:0.669, hF:0.617}
ss-s: { hP:0.583, hR:0.688, hF:0.631}
Submission package
To form your submission package, please follow the general submission instructions, which apply to all tasks. All files should be packaged into a zip file for submission. Please ensure a clear connection between the system name in the submitted yaml, submitted system output, and the technical report! See instructions on how to form your submission label. The submission label's index field should be used to differentiate your submissions in case that you have multiple submissions. Here is an example of the package structure, assuming a submission label Font_UPF_task1:
task1/ ├── Font_UPF_task1.technical_report.pdf ├── Font_UPF_task1_1/ │ ├── Font_UPF_task1_1.output.csv │ └── Font_UPF_task1_1.meta.yaml
Citation
[1] The Broad Sound Taxonomy is described in detail in a journal article which is currently under review. The preprint is nevertheless accessible online, made available by the journal:
Panagiota Anastasopoulou, Xavier Serra, and Frederic Font. A general-purpose sound taxonomy for the classification of heterogeneous sound collections. In press. URL: https://www.researchsquare.com/article/rs-7206795/v1.
A General-Purpose Sound Taxonomy for the Classification of Heterogeneous Sound Collections
[2, 3] If using any of the versions of BSD10k dataset, please cite the original papers:
Panagiota Anastasopoulou, Jessica Torrey, Xavier Serra, and Frederic Font. Heterogeneous sound classification with the Broad Sound Taxonomy and Dataset. In Proc. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE). 2024.
Heterogeneous Sound Classification with the Broad Sound Taxonomy and Dataset
Panagiota Anastasopoulou, Francesco Ardan Dal Rí, Xavier Serra, and Frederic Font. Hierarchical and multimodal learning for heterogeneous sound classification. In Proc. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE). 2025.
Hierarchical and Multimodal Learning for Heterogeneous Sound Classification
[4] If using BSD35k-CS, please cite the corresponding Zenodo page:
Panagiota Anastasopoulou and Frederic Font Corbera. Bsd35k-cs (broad sound dataset 35k - crowd sourced). March 2026. URL: https://doi.org/10.5281/zenodo.19187100, doi:10.5281/zenodo.19187100.