This task aims to develop a universal domain-incremental learning (DIL) system that learns to classify audio from different domains sequentially over time without significantly forgetting the knowledge of any of the previously learned domains. Participants will have to train a model for sound event classification in incremental steps using data from different domains, without access to previous domains data in each step.
Description
Continual learning aims to develop systems that can accumulate knowledge over time without forgetting previously learned tasks. In this task, continual learning is performed through incremental exposure to sound events from different domains. Models must adapt to each new domain while maintaining stable performance on earlier ones, promoting long‑term learning and generalization.
The classification task consists of 10 sound classes, with examples coming from three different domains. Audio data from different domains is revealed sequentially, and at each stage the system must learn the new domain using only its data (no access to earlier domain data is allowed).
The goal of this challenge is to develop a domain‑agnostic incremental learning system that can learn new audio classification tasks over time without revisiting past data. Participants must build models that adapt to new audio domains while preserving performance on previously learned domains, minimizing catastrophic forgetting.
In this task, participants will train a sound event classifier under a domain‑incremental learning (DIL) scenario. Audio data from different domains is revealed sequentially, and at each stage the system must learn the new domain using only its data—no access to earlier domain data is allowed. The challenge is to design models that integrate new knowledge while retaining strong performance on all previously learned domains, resulting in a robust, domain‑agnostic audio classifier.
Audio dataset
The task will use the DIL-DCASE26 dataset which contains sound events collected from three different domains. The original datasets from where the sounds were selected include e.g. AudioSet.
The data is presented as belonging to domains 1, 2, and 3, without reference to their original provenance. DIL-DCASE26 contains audio from domains 2 and 3, while the knowledge of domain 1 is provided in the baseline system.
Sound event classes:
- alarm
- baby_cry
- bark
- engine
- fire
- footsteps
- knock
- telephone_ringing
- piano
- speech
Reference labels are provided with the development data, and include the class label and domain. Each clip is annotated with only one label. Additional sounds may be present in the audio, but there is one clear target sound for the classification task. For the evaluation data only audio will be provided.
Task Setup
Development Dataset
The development dataset contains data from 10 sound classes belonging to 2 domains, with 139 minutes of audio from D2 and 275 minutes of audio from D3. The knowledge about D1 is embedded into the trained baseline system provided.
The dataset is available on Zenodo.
Evaluation Dataset
The evaluation dataset contains files from all three domains. The evaluation set will be provided at the corresponding phase of the challenge.
External Data Resources and Pretrained Models
Use of external data is forbidden. The system development can use only data provided in the task. Use of pretrained models is not allowed.
Task Rules
- Use of external data is not allowed. Use of pretrained models/embeddings is not allowed.
- Manipulation of provided training and development data is allowed (e.g. by mixing data sampled from a pdf or using techniques such as pitch shifting or time stretching).
- Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it.
- The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision-making is forbidden.
- Classification decision must be done independently for each test sample.
Evaluation
Systems will be ranked by overall accuracy.
The accuracy will be calculated separately for each domain, then overall accuracy will be calculated as the average over the three domains. This way the three domains have equal weight in the final ranking, accounting for potential data imbalance. Domain-wise accuracy will be calculated as average of the class-wise accuracies.
Baseline System
The baseline system implements a convolutional neural network based approach. Learning of new domains is based on adjusting the batch normalization parameters (BN layers) to reflect the data distribution of each new domain. The method is based on the work in [1], and uses the domain-agnostic version of the system.
Architecture
The baseline system includes 6 convolutional blocks. Each block includes 2 convolutional layers, each convolutional layer followed by a batch normalization (BN) layer, with the layer specifications the same as PANNs CNN14. Global pooling is applied to the last convolutional layer, to get a fixed-length input feature vector to the classifier.
Training
The baseline model is trained from scratch on the domain D1, then separate domain-specific BN layers are adapted for domain D2 and D3 in incremental phases.
Inference
During inference, domain-specific BN layers are predicted and used with domain-shared layers for classification. Specifically, an input audio is forward passed through a combination of shared and domain-specific layers of each domain seen so far and obtains the class probabilities. Subsequently, uncertainty of the model on given input audio among the predicted probabilities is computed using entropy. The domain-specific layers which have minimum entropy, denoting lower uncertainty, are selected for classification.
Parameters
- Audio features
- Sampling rate: 32 kHz
- Training samples in the development set are segmented into 4-second signals, while the testing samples have variable lengths.
- Log mel-band energies (64 bands) with lower and upper cut-off frequencies 50 Hz and 14 kHz respectively. The window (Hamming) size is set to 1024 samples and hop size to 320 samples.
- Neural Network
- Architecture:
- CNN block 1: 2 x [2D Convolutional layer (filters: 64, kernel size: 3) + 3 Batch normalization layers (D1, D2 and D3) + ReLu], 2 x 2 average pooling + Dropout (rate: 20%)
- CNN block 2: 2 x [2D Convolutional layer (filters: 128, kernel size: 3) + 3 Batch normalization layers (D1, D2 and D3) + ReLu], 2 x 2 average pooling + Dropout (rate: 20%)
- CNN block 3: 2 x [2D Convolutional layer (filters: 256, kernel size: 3) + 3 Batch normalization layers (D1, D2 and D3) + ReLu], 2 x 2 average pooling + Dropout (rate: 20%)
- CNN block 4: 2 x [2D Convolutional layer (filters: 512, kernel size: 3) + 3 Batch normalization layers (D1, D2 and D3) + ReLu], 2 x 2 average pooling + Dropout (rate: 20%)
- CNN block 5: 2 x [2D Convolutional layer (filters: 1024, kernel size: 3) + 3 Batch normalization layers (D1, D2 and D3) + ReLu], 2 x 2 average pooling + Dropout (rate: 20%)
- CNN block 6: 2 x [2D Convolutional layer (filters: 2048, kernel size: 3) + 3 Batch normalization layers (D1, D2 and D3) + ReLu], 2 x 2 average pooling + Dropout (rate: 20%)
- Global pooling
- Output layer (activation: softmax)
- Learning: 120 epochs (batch size 32), data shuffling between epochs
- Optimizer: Adam (learning rate at initial phase: 0.0001, at incremental phases: 0.00001)
- Scheduler: CosineAnnealingLR
The baseline is available on GitHub:
Baseline Results
Results of baseline are calculated using PyTorch in GPU mode . The baseline is trained for 120 epochs and tested on the test split of the development dataset.
| D2 | D3 | |
| D2 | 54.7 | 54.7 |
| D3 | 35.0 | |
| Avg | 54.7 | 44.9 |
The baseline model first learns to classify sounds from domain D2, obtains an accuracy of 54.7% on D2 data. Then, it incrementally learns domain D3 and obtains an accuracy of 35.0% on D3.
The average accuracy of the baseline model on D2 and D3 is 44.8%. D1 results will be included in the overall average after the challenge deadline.
Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.
Submission
Official challenge submission consists of:
- System output file(s) (*.csv)
- Metadata file (*.yaml)
- Model files (*.pt, *.pth)
- Technical report explaining in sufficient detail the method (*.pdf)
All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted yaml, submitted system output, and the content (results) in the technical report! Use a clear naming convention (e.g. name your system based on the submission label).
System output file
System output should be presented as a single text-file in TSV format, without a header row, containing a classification result for each audio file in the evaluation set. Result items can be in any order. Multiple system outputs can be submitted (maximum 4 systems per participant per subtask).
Each row in a system output file should contain the input filename and the predicted sound class separated with tabs. The output files should have the following format:
Test_file_1.wav[tab]piano
Test_file_10.wav[tab]telephone_ringing
Metadata
For each system, metadata information should be provided in a separate file, containing the task-specific information. This meta information enables fast processing of the submissions and analysis of submitted systems. Participants are advised to fill in the meta information carefully and make sure all information is correctly provided.
Model files
The participant must submit the results and additional material in order to be evaluated following strictly the general guidelines for challenge submission. Additionally, participants to the task 7 have to submit the following files:
A [lastname]_[affiliation]_task7_[submission_index]_model.py file which contains the model class the participant used in their experiments and a function defined as follows:
def load_model(submission: int = 1):
## ...
return model
The load_model function needs to be implemented by the participant and will return the model they implemented with the state-dictionary loaded in and corresponding to the incremental step Task.
State dictionary files of the model at every incremental step named like [lastname]_[affiliation]_task7_[submission_index]_D[domain_num]_dictionary.pth
For example, Casciotti_TUNI_task7_1_model.py and Casciotti_TUNI_task7_1_D2_dictionary.pth.
Citation
If you are using the baseline system, please cite the following: