This task aims to develop a universal domain-incremental learning (DIL) system that learns to classify audio from different domains sequentially over time without significantly forgetting the knowledge of any of the previously learned domains. Participants will have to train a model for sound event classification in incremental steps using data from different domains, without access to previous domains data in each step.

Description

Continual learning aims to develop systems that can accumulate knowledge over time without forgetting previously learned tasks. In this task, continual learning is performed through incremental exposure to sound events from different domains. Models must adapt to each new domain while maintaining stable performance on earlier ones, promoting long‑term learning and generalization.

The classification task consists of 10 sound classes, with examples coming from three different domains. Audio data from different domains is revealed sequentially, and at each stage the system must learn the new domain using only its data (no access to earlier domain data is allowed).

The goal of this challenge is to develop a domain‑agnostic incremental learning system that can learn new audio classification tasks over time without revisiting past data. Participants must build models that adapt to new audio domains while preserving performance on previously learned domains, minimizing catastrophic forgetting.

In this task, participants will train a sound event classifier under a domain‑incremental learning (DIL) scenario. Audio data from different domains is revealed sequentially, and at each stage the system must learn the new domain using only its data—no access to earlier domain data is allowed. The challenge is to design models that integrate new knowledge while retaining strong performance on all previously learned domains, resulting in a robust, domain‑agnostic audio classifier.

Figure 1: Overview of the "Domain-Agnostic Incremental Learning for Audio Classification" task

Audio dataset

The task will use the DIL-DCASE26 dataset which contains sound events collected from three different domains. The original datasets from where the sounds were selected include e.g. AudioSet.

The data is presented as belonging to domains 1, 2, and 3, without reference to their original provenance. DIL-DCASE26 contains audio from domains 2 and 3, while the knowledge of domain 1 is provided in the baseline system.

Sound event classes:

alarm
baby_cry
bark
engine
fire
footsteps
knock
telephone_ringing
piano
speech

Reference labels are provided with the development data, and include the class label and domain. Each clip is annotated with only one label. Additional sounds may be present in the audio, but there is one clear target sound for the classification task. For the evaluation data only audio will be provided.

Task Setup

Development Dataset

The development dataset contains data from 10 sound classes belonging to 2 domains, with 139 minutes of audio from D2 and 275 minutes of audio from D3. The knowledge about D1 is embedded into the trained baseline system provided.

The dataset is available on Zenodo.

DIL-DCASE26-Dev (1.3 GB)

Evaluation Dataset

The evaluation dataset contains audio from all three domains. The dataset is available on Zenodo.

DIL-DCASE26-Eval (0.9 GB)

External Data Resources and Pretrained Models

Use of external data is forbidden. The system development can use only data provided in the task. Use of pretrained models is not allowed.

Task Rules

Use of external data is not allowed. Use of pretrained models/embeddings is not allowed.
Manipulation of provided training and development data is allowed (e.g. by mixing data sampled from a pdf or using techniques such as pitch shifting or time stretching).
Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it.
The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision-making is forbidden.
Classification decision must be done independently for each test sample.

Evaluation

Systems will be ranked by overall accuracy.

The accuracy will be calculated separately for each domain, then overall accuracy will be calculated as the average over the three domains. This way the three domains have equal weight in the final ranking, accounting for potential data imbalance. Domain-wise accuracy will be calculated as average of the class-wise accuracies.

Results

Official rank	Submission Information
Official rank	Code	Author	Affiliation	Accuracy
1	Park_GIST-HanwhaVision_task7_4	Jongyeon Park	Gwangju Institute of Science and Technology (GIST), Gwangju, South Korea	79.62
2	Chun_Chosun_task7_1	Chun Chanjun	Chosun University, Gwangju, Republic of Korea	77.96
3	Im_SSU_task7_2	Sungbin Im	Soongsil University, Seoul, Repulic of Korea	76.92
4	Chang_Surrey_task7_1	Peiwei Chang	University of Surrey, Guildford, UK	75.12
5	Divakar_IND_task7_2	Aakash Divakar	Independent Researcher	72.78
6	Miyazaki_CyberAgent_task7_4	Koichi Miyazaki	CyberAgent, Tokyo, Japan	72.32
7	Chang_HYU_task7_3	Joon-Hyuk Chang	Hanyang University, Seoul, Republic of Korea	71.27
8	BelHadj_VUB_task7_2	Yacine Bel-Hadj	Vrije Universiteit Brussel, Brussels, Belgium	70.65
9	Guan_HEU_task7_2	Xuefeng Yang	Harbin Engineering University, Harbin, China	69.91
10	Li_SCUT_task7_1	Yanxiong Li	South China University of Technology, Guangzhou, China	69.38
11	Zhang_XJTLU_task7_4	Peihong Zhang	Xi'an Jiaotong-Liverpool University, Suzhou, China	69.08
12	Gao_SHNU_task7_1	Zhe Gao	Shanghai Normal University, Shanghai, China	68.87
13	Huang_WHU_task7_3	Gongping Huang	Wuhan University, Wuhan, China	68.78
14	Heo_SeoulTech_task7_3	Se-Min Heo	Seoul National University of Science and Technology (SeoulTech), Seoul, Republic of Korea	66.66
15	Chung_IND_task7_4	Haechun Chung	Independent, Seoul, Republic of Korea	65.46
16	Tyagi_IITM_Task7_1	Akansha Tyagi	Indian Institute of Technology Mandi, Himachal Pradesh, India	63.09
17	Takami_OU_task7_1	Haruto Takami	Okayama University, Okayama, Japan	62.42
18	Giacomini_ICMC_task7_1	Anderson Giacomini	University of São Paulo (USP), São Carlos, SP, Brazil	62.34
19	Yang_USST_task7_1	Wenxing Yang	University of Shanghai for Science and Technology, Shanghai, China	62.33
20	Hu_XJTLU_task7_2	Shengchen Li	Xi'an Jiaotong-Liverpool University, Suzhou, China	61.58
	Baseline System	Manjuanath Mulimani	Tampere University, Tampere, Finland	60.35
22	Raja_IITM_task7_1	Risan Raja	Indian Institute of Technology Madras, Chennai, India	59.66

Complete results and technical reports can be found at results page

Baseline System

The baseline system implements a convolutional neural network based approach. Learning of new domains is based on adjusting the batch normalization parameters (BN layers) to reflect the data distribution of each new domain. The method is based on the work in [1], and uses the domain-agnostic version of the system.

Architecture

The baseline system includes 6 convolutional blocks. Each block includes 2 convolutional layers, each convolutional layer followed by a batch normalization (BN) layer, with the layer specifications the same as PANNs CNN14. Global pooling is applied to the last convolutional layer, to get a fixed-length input feature vector to the classifier.

Training

The baseline model is trained from scratch on the domain D1, then separate domain-specific BN layers are adapted for domain D2 and D3 in incremental phases.

Figure 2: CNN blocks of the baseline system. The convolutional layers are shared among all the domains. Batch normalization (BN) layers are specific to each domain: D1, D2 and D3.

Inference

During inference, domain-specific BN layers are predicted and used with domain-shared layers for classification. Specifically, an input audio is forward passed through a combination of shared and domain-specific layers of each domain seen so far and obtains the class probabilities. Subsequently, uncertainty of the model on given input audio among the predicted probabilities is computed using entropy. The domain-specific layers which have minimum entropy, denoting lower uncertainty, are selected for classification.

Parameters

Audio features
- Sampling rate: 32 kHz
- Training samples in the development set are segmented into 4-second signals, while the testing samples have variable lengths.
- Log mel-band energies (64 bands) with lower and upper cut-off frequencies 50 Hz and 14 kHz respectively. The window (Hamming) size is set to 1024 samples and hop size to 320 samples.
Neural Network
- Architecture:
  - CNN block 1: 2 x [2D Convolutional layer (filters: 64, kernel size: 3) + 3 Batch normalization layers (D1, D2 and D3) + ReLu], 2 x 2 average pooling + Dropout (rate: 20%)
  - CNN block 2: 2 x [2D Convolutional layer (filters: 128, kernel size: 3) + 3 Batch normalization layers (D1, D2 and D3) + ReLu], 2 x 2 average pooling + Dropout (rate: 20%)
  - CNN block 3: 2 x [2D Convolutional layer (filters: 256, kernel size: 3) + 3 Batch normalization layers (D1, D2 and D3) + ReLu], 2 x 2 average pooling + Dropout (rate: 20%)
  - CNN block 4: 2 x [2D Convolutional layer (filters: 512, kernel size: 3) + 3 Batch normalization layers (D1, D2 and D3) + ReLu], 2 x 2 average pooling + Dropout (rate: 20%)
  - CNN block 5: 2 x [2D Convolutional layer (filters: 1024, kernel size: 3) + 3 Batch normalization layers (D1, D2 and D3) + ReLu], 2 x 2 average pooling + Dropout (rate: 20%)
  - CNN block 6: 2 x [2D Convolutional layer (filters: 2048, kernel size: 3) + 3 Batch normalization layers (D1, D2 and D3) + ReLu], 2 x 2 average pooling + Dropout (rate: 20%)
  - Global pooling
  - Output layer (activation: softmax)
- Learning: 120 epochs (batch size 32), data shuffling between epochs
  - Optimizer: Adam (learning rate at initial phase: 0.0001, at incremental phases: 0.00001)
  - Scheduler: CosineAnnealingLR

The baseline is available on GitHub:

DCASE 2026 Challenge Task 7 baseline

Baseline Results

Results of baseline are calculated using PyTorch in GPU mode . The baseline is trained for 120 epochs and tested on the test split of the development dataset.

	D2	D3
D2	58.6	59.0
D3		46.1
Avg	58.6	52.5

The baseline model first learns to classify sounds from domain D2, obtains an accuracy of 58.6% on D2 data. Then, it incrementally learns domain D3 and obtains an accuracy of 46.1% on D3.

The average accuracy of the baseline model on D2 and D3 is 52.5%. D1 results will be included in the overall average after the challenge deadline.

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.

Submission

Official challenge submission consists of:

System output file(s) (*.csv)
Metadata file (*.yaml)
Model files (*.pt, *.pth)
Technical report explaining in sufficient detail the method (*.pdf)

All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted yaml, submitted system output, and the content (results) in the technical report! Use a clear naming convention (e.g. name your system based on the submission label).

System output file

System output should be presented as a single text-file in TSV format, without a header row, containing a classification result for each audio file in the evaluation set. Result items can be in any order. Multiple system outputs can be submitted (maximum 4 systems per participant per subtask).

Each row in a system output file should contain the input filename and the predicted sound class separated with tabs. The output files should have the following format:

Test_file_1.wav[tab]piano
Test_file_10.wav[tab]telephone_ringing

Metadata

For each system, metadata information should be provided in a separate file, containing the task-specific information. This meta information enables fast processing of the submissions and analysis of submitted systems. Participants are advised to fill in the meta information carefully and make sure all information is correctly provided.

Model files

The participant must submit the results and additional material in order to be evaluated following strictly the general guidelines for challenge submission. Additionally, participants to the task 7 have to submit the following files: A [lastname]_[affiliation]_task7_[submission_index]_model.py file which contains the model class the participant used in their experiments and a function defined as follows:

    def load_model(task: int = 1):
        ## ...
        return model

The load_model function needs to be implemented by the participant and will return the model they implemented with the state-dictionary loaded in and corresponding to the incremental step task. State dictionary files of the model at every incremental step should be named [lastname]_[affiliation]_task7_[submission_index]_D[domain_num]_dictionary.pth. For example, the model files of one submission should contain

Casciotti_TUNI_task7_1_model.py
Casciotti_TUNI_task7_1_D2_dictionary.pth
Casciotti_TUNI_task7_1_D3_dictionary.pth

The function must have the name specified load_model(task: int) and must return a model loaded with the checkpoint of the incremental step task. The file containing the model class and the function is per submission, meaning every separate submission should have its own [lastname]_[affiliation]_task7_[submission_index]_model.py file. Be sure that load_model returns only the model with the loaded checkpoint of the corresponding submission.

Citation

If you are participating in the task, please cite the following paper:

Publication

Riccardo Casciotti, Manjunath Mulimani, Manu Harju, Jesper Rindom Jensen, and Annamaria Mesaros. Domain-agnostic incremental learning for sound classification. a dcase 2026 challenge task. 2026. arXiv:2606.02173.

PDF

Domain-Agnostic Incremental Learning for Sound Classification. A DCASE 2026 Challenge task

PDF

If you are using the baseline system, please cite the following: