Low-Complexity Acoustic Scene Classification with Device Information


Task description

The goal of the acoustic scene classification task is to categorize recordings into one of ten predefined acoustic scene classes. This task builds on the Acoustic Scene Classification challenges from previous editions of the DCASE Challenge, with a slight shift in focus. This year, the task emphasizes several key challenges: (1) recording device mismatches, (2) low-complexity constraints, (3) data efficiency, and (4) the development of recording-device-specific models.

If you are interested in the task, you can join us on the dedicated slack channel

Description

Acoustic scene classification systems categorize recordings into one of multiple predefined acoustic scene classes (Figure 1). Similar to previous editions, this year's challenge focuses on two real-world problems: recording device mismatch and low-complexity constraints. Previously, recording device information was not available for audio recordings in the evaluation set, encouraging the development of systems that generalize across (possibly unseen) recording devices. However, in real-world scenarios, the device type (e.g., the specific smartphone the model is running on) may be known, and the model's performance could potentially be improved by conditioning inference on the device type. In this edition, the device ID is provided not only for the development set but also for recordings in the evaluation set; participants are allowed to use separate models for different recording devices. At the same time, participants are encouraged to develop models that generalize to unseen recording devices, since the evaluation set also includes recordings from unknown devices (i.e., devices for which no recordings are available in the development-train split, and therefore no device-specific model can be trained).

Figure 1: Overview of acoustic scene classification system.


Novelties and Research Focus for 2025 Edition

Compared to Task 1 in the DCASE 2024 Challenge, the following aspects change in the 2025 edition:

  • Device Information: Recording Device Information is now available not only for the development set but also for evaluation, allowing participants to fine-tune models for specific recording devices.
  • Training Data: Participants are no longer required to train on all five subsets from DCASE'24 Task 1. Instead, models must be trained only on the 25% subset, encouraging data-efficient approaches such as pre-training.
  • External Resources: No restrictions on external datasets—participants may use any publicly available dataset, provided they announce it to the organizers by May 18, 2025.
  • Inference Code: Participants are required to submit inference code. This promotes open research and allows us to check whether the model operates within the complexity constraints of the challenge.

This new task setup highlights the following scientific questions:

  • Can the device type information be exploited to improve performance compared to previous year’s systems that didn’t have access to the device type?
  • Which machine learning techniques can effectively create specialized models for the different recording devices?
  • Can other acoustic scene datasets (with possibly different scenes, locations, devices) be used to increase the performance on the TAU dataset?

Audio dataset

The development dataset for this task is a subset of the TAU Urban Acoustic Scenes 2022 Mobile development dataset. The dataset contains recordings from 12 European cities in 10 different acoustic scenes using 4 devices. Additionally, synthetic data for 11 mobile devices was created based on the original recordings. Of the 12 cities, only two are present in the evaluation set. The dataset has exactly the same content as the TAU Urban Acoustic Scenes 2020 Mobile development dataset, but the audio files have a length of 1 second (therefore, there are ten times more files than in the 2020 version).

Recordings were made using four devices that captured audio simultaneously. The primary recording device, referred to as device A, consists of a Soundman OKM II Klassik/studio A3, an electret binaural microphone, and a Zoom F8 audio recorder using a 48kHz sampling rate and 24-bit resolution. The other devices are commonly available consumer devices: device B is a Samsung Galaxy S7, device C is an iPhone SE, and device D is a GoPro Hero5 Session.

Audio data was recorded in Amsterdam, Barcelona, Helsinki, Lisbon, London, Lyon, Madrid, Milan, Prague, Paris, Stockholm, and Vienna. The dataset was collected by Tampere University of Technology between 05/2018 - 11/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

ERC

For complete details on the data recording and processing, see:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 9–13. November 2018. URL: https://dcase.community/documents/workshop2018/proceedings/DCASE2018Workshop_Mesaros_8.pdf.

PDF

A multi-device dataset for urban acoustic scene classification

Abstract

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

Keywords

Acoustic scene classification, DCASE challenge, public datasets, multi-device data

PDF

Additionally, 10 mobile devices S1-S10 are simulated using the audio recorded with device A, impulse responses recorded with real devices, and additional dynamic range compression, in order to simulate realistic recordings. A recording from device A is processed through convolution with the selected impulse response, then processed with a selected set of parameters for dynamic range compression (device-specific). The impulse responses are proprietary data and will not be published.

The complete development dataset comprises 40 hours of data from device A and smaller amounts from the other devices; participants are only allowed to use the subset specified in the Task Setup section below. Audio is provided in a single-channel, 44.1 kHz, 24-bit format.

Acoustic scenes (10):

  • Airport - airport
  • Indoor shopping mall - shopping_mall
  • Metro station - metro_station
  • Pedestrian street - street_pedestrian
  • Public square - public_square
  • Street with medium level of traffic - street_traffic
  • Travelling by a tram - tram
  • Travelling by a bus - bus
  • Travelling by an underground metro - metro
  • Urban park - park

Task Setup

Development Dataset

This year's development set reuses the 25% train split and the test split of Task 1 in the DCASE Challenge 2024.

It contains recordings from 10 cities and 9 devices: 3 real devices (A, B, and C) and 6 simulated devices (S1-S6). Data from devices B, C, and S1-S6 consists of randomly selected segments from the simultaneous recordings; therefore, all overlap with the data from device A, but not necessarily with each other. The total amount of audio in this year's training set is around 18 hours.

The development data comes with a new predefined train/test split. Only the development-training data (25% split) can be used for model training. The development-test data is for validation and model selection only. Note: Recordings that are part of the TAU Urban Acoustic Scenes 2022 Mobile development dataset and not used in the development-training and development-test splits must not be used for model development.

Links to the CSV files specifying the development-train and development-test splits are given below.

Note that some devices appear only in the test subset. Complete details on the development set and train and test splits are provided in the following table.

Devices Dataset Cross-validation setup
Name Type Total
duration
Total
segments
Train
segments
Test
segments
A Real 8h 28,820 25,520 3,300
B Real 1h 26m 5,190 1,900 3,290
C Real 1h 28m 5,210 1,920 3,290
S1 Simulated 1h 25m 5,140 1,840 3,300
S2 Simulated 1h 25m 5,150 1,850 3,300
S3 Simulated 1h 26m 5,170 1,870 3,300
S4 S5 S6 Simulated 2h 45m 3*3,300 - 3*3,300
Total 17h 56m 64,580 34,900 29,680

The development data is available for download here:


The data split files are automatically downloaded when running the baseline system.

Evaluation Dataset

The evaluation dataset is used for ranking submissions. It contains data from 12 cities and 10 acoustic scenes. The devices include those present in the development-train split (A, B, C, S1–S3), as well as five new devices not included in the development set: one real device (D) and four simulated devices (S7–S10). The device ID will be provided for the known devices (A, B, C, S1–S3), while the unseen devices (D, S7–S10) will be marked with the ID “unknown.” The relative proportion of known vs. unknown devices will be aligned between the development-test and evaluation sets. The evaluation data contains approximately 37 hours of audio recorded at different locations than the development data. City information is not provided in the evaluation set.

Scene labels are provided only for the development split. Scene labels for the evaluation dataset will not be released because the evaluation set is used to rank submitted systems. For publications based on the DCASE challenge data, please use the suggested splits of the development set to allow comparisons. After the challenge, if you want to evaluate your proposed system with the official challenge evaluation setup, contact the task coordinators. Task coordinators can provide unofficial scoring for a limited number of system outputs.

The evaluation data is available for download:


System Complexity Limitations

The computational complexity is limited in terms of model size and MMACs (million multiply-accumulate operations). The limits are modeled after Cortex-M4 devices (e.g., STM32L496@80MHz or Arduino Nano 33@64MHz); the maximum allowed limits are as follows:

  • Maximum memory allowance for model parameters: 128 kB (Kilobyte), counting ALL parameters (including the zero-valued ones), and no requirement on the numerical representation. This allows participants better control over the tradeoff between the number of parameters and their numerical representation. The memory limit translates into 128K parameters when using int8 (signed 8-bit integer) (128,000 parameters * 8 bits per parameter / 8 bits per byte = 128 kB), 64K parameters when using float16 (16-bit float) (64,000 * 16 bits per parameter / 8 bits per byte = 128 kB), or 32K parameters when using float32 (32-bit float) (32,000 * 32 bits per parameter / 8 bits per byte = 128 kB).
  • Maximum number of MACS per inference: 30 MMACs (million MACs). The limit is approximated based on the computing power of the target device class. The analysis segment length for the inference is 1 s. The limit of 30 MMACs mimics the fitting of audio buffers into SRAM (fast access internal memory) on the target device and allows some head space for feature calculation (e.g., FFT), assuming that the most commonly used features fit under this.

In case of using a pre-trained feature extractor (after approval!) during inference, the network used to generate the embeddings counts in the calculated model size and complexity!

In this year’s challenge, participants can use device-specific models, i.e., separate models for each recording device type in the evaluation set (A, B, C, S1–S3, "unknown"). At inference time the model must be chosen only based on the recording device ID; each device specific model must obey the complexity constraints.

Full information about the number of parameters, their numerical representation, and the computational complexity must be provided in the submitted technical report.


External Data Resources and Pretrained Models

The use of external resources (data sets, pretrained models) is allowed under the following conditions:

List of external data resources allowed:

Resource name Type Added Link
AudioSet audio, video 04.03.2019 https://research.google.com/audioset/
MicIRP IR 28.03.2023 http://micirp.blogspot.com/?m=1
FSD50K audio 10.03.2022 https://zenodo.org/record/4060432
Multi-Angle, Multi-Distance Microphone IR Dataset IR 17.05.2024 https://zenodo.org/records/4633508
EfficientAT model 17.05.2024 https://github.com/fschmid56/EfficientAT
PretrainedSED model 31.03.2025 https://github.com/fschmid56/PretrainedSED
PaSST model 17.05.2024 https://github.com/kkoutini/PaSST
WAV2VEC2_XLSR_300M model 08.05.2025 https://docs.pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_XLSR_300M.html
CochlScene audio 17.05.2025 https://zenodo.org/records/7080122
CAS 2023 audio 17.05.2025 https://zenodo.org/records/10616533


Task Rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements, can be found here.

Task-specific rules:

  • Participants may submit predictions from up to four independent systems.
  • The models' performance on the development-test set must be included in the submission.
  • The models' predictions on the evaluation set must be included in the submission.
  • A link to a repository containing the inference code must be provided in the submission.
  • The model complexity limits apply. See conditions for the model complexity here.

Rules regarding the usage of data and inference:

  • Participants can only use the audio segments specified in the 25% subset and the explicitly allowed external resources for training. For example, it is strictly forbidden to use other parts of the TAU dataset (besides the filenames specified in the split25.csv file) to train the system.
  • The previous statement includes both direct and indirect training data usage. For example, in the case of Knowledge Distillation, not only the student but also the teacher model must obey the previous statement.
  • The development-test data can only be used for validation and model selection; it must not be used for training. For example, it is strictly forbidden to use development-test data to compute normalization statistics or for domain adaptation.
  • Use of external data is allowed under some conditions; see here.
  • Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision-making is also forbidden.
  • Classification decisions must be made independently for each test sample.


Evaluation

Participants can submit up to four systems which will be ranked independently.

The leaderboard ranking score of individual submissions is based on their class-wise macro-averaged accuracies.

We will also calculate multi-class cross-entropy (Log loss). The metric is independent of the operating point (see Python implementation here). The multi-class cross-entropy will not be used in the official rankings.


Baseline System

This repository contains the code for the baseline system of the DCASE 2025 Challenge Task 1:


New in 2025: Device-Specific Adaptation:

In last year’s baseline, the training loop was designed to train a model that generalizes well across different, possibly unseen recording devices. This year, since device information is available, we introduce a second training step to train specialized models for each device in the development-train split.

Some highlights:

  • The training loop is implemented using PyTorch and PyTorch Lightning.
  • Logging is implemented using Weights and Biases.
  • The neural network architecture is a simplified version of CP-Mobile, the architecture used in the top-ranked system of Task 1 in the DCASE 2023 challenge.
  • The model has 61,148 parameters and consumes 29.42 million MACs for the inference of a 1-second audio snippet. MACs are counted using torchinfo. The model's test step converts model parameters to 16-bit floats to meet the memory complexity constraint of 128 kB for model parameters.
  • The baseline implements simple data augmentation mechanisms: time rolling of the waveform and masking of frequency bins and time frames.
  • To enhance the generalization across different recording devices, the baseline implements Frequency-MixStyle.

A detailed description of the baseline system and differences to previous year's implementation can be found in the repository's README.md.

Baseline Complexity

The baseline system has a complexity of 61,148 parameters and 29,419,156 MACs. The table below lists how the parameters and MACs are distributed across the different layers in the network.

According to the challenge rules, the following complexity limits apply (to each device-specific model):

  • max memory for model parameters: 128 kB (Kilobyte)
  • max number of MACs for inference of a 1-second audio snippet: 30 MMACs (million MACs)

Model parameters of the baseline must therefore be converted to 16-bit precision before inference of the test/evaluation set to stick to the complexity limits (61,148 * 16 bits = 61,148 * 2 B = 122,296 B <= 128 kB).

In previous years of the challenge, top-ranked teams used a technique called quantization that converts model parameters to 8-bit precision. In this case, the maximum number of allowed parameters would be 128,000.

Description Layer Input Shape Params MACs
in_c[0] Conv2dNormActivation [1, 1, 256, 65] 88304,144
in_c[1] Conv2dNormActivation [1, 8, 128, 33] 2,368 2,506,816
stages[0].b1.block[0] Conv2dNormActivation (pointwise) [1, 32, 64, 17] 2,176 2,228,352
stages[0].b1.block[1] Conv2dNormActivation (depthwise) [1, 64, 64, 17] 704 626,816
stages[0].b1.block[2] Conv2dNormActivation (pointwise) [1, 64, 64, 17] 2,112 2,228,288
stages[0].b2.block[0] Conv2dNormActivation (pointwise) [1, 32, 64, 17] 2,176 2,228,352
stages[0].b2.block[1] Conv2dNormActivation (depthwise) [1, 64, 64, 17] 704 626,816
stages[0].b2.block[2] Conv2dNormActivation (pointwise) [1, 64, 64, 17] 2,112 2,228,288
stages[0].b3.block[0] Conv2dNormActivation (pointwise) [1, 32, 64, 17] 2,176 2,228,352
stages[0].b3.block[1] Conv2dNormActivation (depthwise) [1, 64, 64, 17] 704 331,904
stages[0].b3.block[2] Conv2dNormActivation (pointwise) [1, 64, 64, 9] 2,112 1,179,712
stages[1].b4.block[0] Conv2dNormActivation (pointwise) [1, 32, 64, 9] 2,176 1,179,776
stages[1].b4.block[1] Conv2dNormActivation (depthwise) [1, 64, 64, 9] 704 166,016
stages[1].b4.block[2] Conv2dNormActivation (pointwise) [1, 64, 32, 9] 3,696 1,032,304
stages[1].b5.block[0] Conv2dNormActivation (pointwise) [1, 56, 32, 9] 6,960 1,935,600
stages[1].b5.block[1] Conv2dNormActivation (depthwise) [1, 120, 32, 9] 1,320 311,280
stages[1].b5.block[2] Conv2dNormActivation (pointwise) [1, 120, 32, 9] 6,832 1,935,472
stages[2].b6.block[0] Conv2dNormActivation (pointwise) [1, 56, 32, 9] 6,960 1,935,600
stages[2].b6.block[1] Conv2dNormActivation (depthwise) [1, 120, 32, 9] 1,320 311,280
stages[2].b6.block[2] Conv2dNormActivation (pointwise) [1, 120, 32, 9] 12,688 3,594,448
ff_list[0] Conv2d [1, 104, 32, 9] 1,040 299,520
ff_list[1] BatchNorm2d [1, 10, 32, 9] 20 20
ff_list[2] AdaptiveAvgPool2d [1, 10, 32, 9] - -
Sum - - 61,148 29,419,156

To give an example of how MACs and parameters are calculated, let's look in detail at the module stages[0].b3.block[1]. It consists of a conv2d, a batch norm, and a ReLU activation function.

  • Parameters: The conv2d parameters are calculated as input_channels * output_channels * kernel_size * kernel_size, resulting in 1 * 64 * 3 * 3 = 576 parametes. Note that input_channels=1 since it is a depth-wise convolution with 64 groups. The batch norm adds 64 * 2 = 128 parameters on top, resulting in a total of 704 parameters for this Conv2dNormActivation module.
  • MACs: The MACs of the conv2d are calculated as input_channels * output_channels * kernel_size * kernel_size * output_frequency_bands * output_time_frames, resulting in 1 * 64 * 3 * 3 * 64 * 9 = 331,776 MACs.
    Note that input_channels=1 since it is a depth-wise convolution with 64 groups. The batch norm adds 128 MACs on top, resulting in a total of 331,904 MACs for this Conv2dNormActivation module.

Baseline Results

The two tables below present the results obtained with the baseline system. General refers to the results of a single model trained on all devices in the development-train split, aligned with the DCASE 2024 Task 1 baseline. Device-specific refers to the results using the device-specific models. To obtain device-specific models, the general model is fine-tuned on data from specific recordings devices in the development-train split. The mean and standard deviation across four independent runs are reported.

Note: The exact performance of the baseline system may not be fully reproducible due to differences in setup. However, you should be able to obtain very similar results.

Class-wise Results

The class-wise accuracies take into account only the test items belonging to the considered class.

Model Class-wise Accuracies Macro-Avg.
Accuracy
Airport Bus Metro Metro
Station
Park Public
Square
Shopping
Mall
Street
Pedestrian
Street
Traffic
Tram
General 38.94 62.28 40.60 50.72 72.03 29.20 56.04 34.76 73.21 49.42 50.72 ± 0.47
Device-specific 44.43 64.81 43.87 48.22 72.75 32.04 53.14 34.43 74.10 51.08 51.89 ± 0.05

Device-wise Results

The device-wise accuracies take into account only the test items recorded by the specific device. As discussed here, devices S4-S6 are used only for testing and not for training the system.

Note: Results for devices S4, S5, and S6 are the same for General and Device-specific, as the general model is used for unknown devices (devices that are not in the training set).

Model Device-wise Accuracies Macro-Avg.
Accuracy
A B C S1 S2 S3 S4 S5 S6
General 62.80 52.87 54.23 48.52 47.29 52.86 48.14 47.23 42.60 50.72 ± 0.47
Device-specific 63.98 55.85 59.09 48.68 48.74 52.72 48.14 47.23 42.60 51.89 ± 0.05


Submission

Official challenge submission consists of the following:

  • System output files for up to four systems (*.csv)
  • Metadata file(s) (*.yaml)
  • Link to inference code in Metadata file(s) (field inference_code)
  • A technical report explaining in sufficient detail the method (*.pdf)

The system output should be provided as a single text file (in CSV format, with a header row) containing a classification result for each audio file in the evaluation set. In addition, the results file should contain probabilities for each scene class. Result items can be in any order. Multiple system entries can be submitted (maximum 4 per participant).

For each system entry, it is crucial to provide meta information in a separate file containing the task-specific information. This meta information is essential for fast processing of the submissions and analysis of submitted systems. Participants are strongly advised to fill in the meta information carefully, ensuring all information is correctly provided.

All files should be packaged into a zip file for submission. Please ensure a clear connection between the system name in the submitted yaml, submitted system output, and the technical report! Instead of system name, you can use a submission label too. See instructions on how to form your submission label. The submission label's index field should be used to differentiate your submissions in case that you have multiple submissions.

Package structure:

task1/
├── [submission_label].technical_report.pdf
├── [submission_label]_[submission_index (1-4)]/
   ├── [submission_label]_[submission_index (1-4)].output.csv
   └── [submission_label]_[submission_index (1-4)].meta.yaml

System output file

The system output should have the following format for each row:

[filename (string)][tab][scene label (string)][tab][airport probability (float)][tab][bus probability (float)][tab][metro probability (float)][tab][metro_station probability (float)][tab][park probability (float)][tab][public_square probability (float)][tab][shopping_mall probability (float)][tab][street_pedestrian probability (float)][tab][street_traffic probability (float)][tab][tram probability (float)]

Example output:

filename	scene_label	airport	bus	metro	metro_station	park	public_square	shopping_mall	street_pedestrian	street_traffic	tram
0.wav	bus	0.2512	0.9931	0.1211	0.3211	0.4122	0.4233	0.2344	0.3455	0.1266	0.4577
1.wav	tram	0.2521	0.194	0.1211	0.3211	0.4122	0.4233	0.2344	0.3455	0.1266	0.8577

Metadata file

Example meta information file for the baseline system task1/Schmid_CPJKU_task1_1/Schmid_CPJKU_task1_1.meta.yaml:

# Submission information
submission:
  # Submission label
  # Label is used to index submissions.
  # Generate your label in the following way to avoid
  # overlapping codes among submissions:
  # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
  label: Schmid_CPJKU_task1_1

  # Submission name
  # This name will be used in the results tables when space permits
  name: DCASE2025 baseline system

  # Submission name abbreviated
  # This abbreviated name will be used in the results table when space is tight.
  # Use maximum 10 characters.
  abbreviation: Baseline

  # Authors of the submitted system. Mark authors in
  # the order you want them to appear in submission lists.
  # One of the authors has to be marked as corresponding author,
  # this will be listed next to the submission in the results tables.
  authors:
    # First author
    - lastname: Schmid
      firstname: Florian
      email: florian.schmid@jku.at           # Contact email address
      corresponding: true                    # Mark true for one of the authors
      # Affiliation information for the author
      affiliation:
        abbreviation: JKU
        institute: Johannes Kepler University (JKU) Linz
        department: Institute of Computational Perception (CP)   # Optional
        location: Linz, Austria

    # Second author
    - lastname: Primus
      firstname: Paul
      email: paul.primus@jku.at   
      affiliation:
        abbreviation: JKU
        institute: Johannes Kepler University (JKU) Linz
        department: Institute of Computational Perception (CP)  
        location: Linz, Austria      

    # Third author
    - lastname: Heittola
      firstname: Toni
      email: toni.heittola@tuni.fi
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences            
        location: Tampere, Finland
    
    # Fourth author
    - lastname: Mesaros
      firstname: Annamaria
      email: annamaria.mesaros@tuni.fi
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences            
        location: Tampere, Finland
    
    # Fifth author
    - lastname: Martín Morató
      firstname: Irene
      email: irene.martinmorato@tuni.fi
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences           
        location: Tampere, Finland

    # Sixth author
    - lastname: Widmer
      firstname: Gerhard
      email: gerhard.widmer@jku.at
      affiliation:
        abbreviation: JKU
        institute: Johannes Kepler University (JKU) Linz
        department: Institute of Computational Perception (CP)  
        location: Linz, Austria           

# System information
system:
  # System description, meta data provided here will be used to do
  # meta analysis of the submitted system.
  # Use general level tags, when possible use the tags provided in comments.
  # If information field is not applicable to the system, use "!!null".
  
  # URL to the inference code of the system [required]
  inference_code: https://github.com/CPJKU/dcase2025_task1_inference
  
  # URL to the full source code (including training) of the system [optional]
  source_code: https://github.com/CPJKU/dcase2024_task1_baseline
  
  description:

    # Audio input / sampling rate
    # e.g. 16kHz, 22.05kHz, 32kHz, 44.1kHz, 48.0kHz
    input_sampling_rate: 32kHz

    # Acoustic representation
    # one or multiple labels, e.g. MFCC, log-mel energies, spectrogram, CQT, raw waveform, ...
    acoustic_features: log-mel energies

    # Data augmentation methods
    # e.g. mixup, freq-mixstyle, dir augmentation, pitch shifting, time rolling, frequency masking, time masking, frequency warping, ...
    data_augmentation: freq-mixstyle, pitch shifting, time rolling

    # Machine learning
    # e.g., (RF-regularized) CNN, RNN, CRNN, Transformer, ...
    machine_learning_method: RF-regularized CNN

    # External data usage method
    # e.g. "dataset", "embeddings", "pre-trained model", ...
    external_data_usage: !!null

    # Method for handling the complexity restrictions
    # e.g. "knowledge distillation", "pruning", "precision_16", "weight quantization", "network design", ...
    complexity_management: precision_16, network design

    # System training/processing pipeline stages
    # e.g. "train teachers", "ensemble teachers", "train general student model with knowledge distillation",
    # "device-specific end-to-end fine-tuning", "quantization-aware training"
    pipeline: train general model, device-specific end-to-end fine-tuning 

    # Machine learning framework
    # e.g. keras/tensorflow, pytorch, ...
    framework: pytorch

    # How did you exploit available device information at inference time?
    # e.g., "per-device end-to-end fine-tuning", "device-specific adapters", "device-specific normalization", ...
    device_information: "per-device end-to-end fine-tuning"
    
    # Total number of models used at inference time
    # e.g., one general model and one model for each of A, B, C, S1, S2, S3 in baseline (= 7 models)
    num_models_at_inference: 7
    
    # Degree of parameter sharing between device-specific models
    # Options: "fully shared", "partially shared", "fully device-specific"
    model_weight_sharing: "fully device-specific"

  # System complexity
  # If complexity differs across device-specific models, report values for the most complex model.
  complexity:
    # Total model size in bytes. Calculated as [parameter count]*[bit per parameter]/8
    total_model_size: 122296  # 61,148 * 16 bits = 61,148 * 2 B = 122,296 B for the baseline system

    # Total number of parameters in the most complex device-specific model
    # For other than neural networks, if parameter count information is not directly
    # available, try estimating the count as accurately as possible.
    # In case of ensemble approaches, add up parameters for all subsystems.
    # In case embeddings are used, add up parameter count of the embedding
    # extraction networks and classification network
    # Use numerical value.
    total_parameters: 61148 

    # MACS - as calculated by torchinfo
    macs: 29419156

  # List of external datasets used in the submission.
  external_datasets:
    # Below are two examples (NOT used in the baseline system)
    #- name: EfficientAT
    #  url: https://github.com/fschmid56/EfficientAT
    #  total_audio_length: !!null
    #- name: MicIRP
    #  url: http://micirp.blogspot.com/?m=1
    #  total_audio_length: 2   # specify in minutes

# System results
results:
  development_dataset:
    # Results on the development-test set for both the general model and the device-specific models.
    #
    # - The `general` block reports results when using a single model for all devices.
    # - The `device_specific` block reports results when using a dedicated model for each known device
    #   (e.g., A, B, C, S1–S6), and falling back to the general model for unknown devices.
    #
    # Providing both results allows for evaluating the benefit of device-specific adaptation.
    # Partial results are acceptable, but full reporting is highly encouraged for comparative analysis.
    
    device_specific:
    # Results using device-specific models for known devices,
    # and the general model for unknown devices.
    
      # Overall metrics
      overall:
        logloss: !!null   # Set to !!null if not computed
        accuracy: 51.89    # mean of class-wise accuracies

      # Class-wise metrics
      class_wise:
        airport:           { accuracy: 44.43, logloss: !!null }
        bus:               { accuracy: 64.81, logloss: !!null }
        metro:             { accuracy: 43.87, logloss: !!null }
        metro_station:     { accuracy: 48.22, logloss: !!null }
        park:              { accuracy: 72.75, logloss: !!null }
        public_square:     { accuracy: 32.04, logloss: !!null }
        shopping_mall:     { accuracy: 53.14, logloss: !!null }
        street_pedestrian: { accuracy: 34.43, logloss: !!null }
        street_traffic:    { accuracy: 74.10, logloss: !!null }
        tram:              { accuracy: 51.08, logloss: !!null }

      # Device-wise metrics
      device_wise:
        a:   { accuracy: 63.98, logloss: !!null }
        b:   { accuracy: 55.85, logloss: !!null }
        c:   { accuracy: 59.09, logloss: !!null }
        s1:  { accuracy: 48.68, logloss: !!null }
        s2:  { accuracy: 48.74, logloss: !!null }
        s3:  { accuracy: 52.72, logloss: !!null }
        s4:  { accuracy: 48.14, logloss: !!null }
        s5:  { accuracy: 47.23, logloss: !!null }
        s6:  { accuracy: 42.60, logloss: !!null }
    
    general: 
    # Results using the general model (used for unknown devices in section 'device-specific') for all devices
    
      # Overall metrics
      overall:
        logloss: !!null   # !!null, if you don't have the corresponding result
        accuracy: 50.72    # mean of class-wise accuracies

      # Class-wise metrics
      class_wise:
        airport:           { accuracy: 38.94, logloss: !!null }
        bus:               { accuracy: 62.28, logloss: !!null }
        metro:             { accuracy: 40.60, logloss: !!null }
        metro_station:     { accuracy: 50.72, logloss: !!null }
        park:              { accuracy: 72.03, logloss: !!null }
        public_square:     { accuracy: 29.20, logloss: !!null }
        shopping_mall:     { accuracy: 56.04, logloss: !!null }
        street_pedestrian: { accuracy: 34.76, logloss: !!null }
        street_traffic:    { accuracy: 73.21, logloss: !!null }
        tram:              { accuracy: 49.42, logloss: !!null }

      # Device-wise metrics
      device_wise:
        a:   { accuracy: 62.80, logloss: !!null }
        b:   { accuracy: 52.87, logloss: !!null }
        c:   { accuracy: 54.23, logloss: !!null }
        s1:  { accuracy: 48.52, logloss: !!null }
        s2:  { accuracy: 47.29, logloss: !!null }
        s3:  { accuracy: 52.86, logloss: !!null }
        s4:  { accuracy: 48.14, logloss: !!null }
        s5:  { accuracy: 47.23, logloss: !!null }
        s6:  { accuracy: 42.60, logloss: !!null }

Package validator

There is an automatic validation tool to help challenge participants to prepare a correctly formatted submission package, which in turn will speed up the submission processing in the challenge evaluation stage. Please use it to make sure your submission package follows the given formatting.


Inference code

Participants are required to submit their inference code via a link to a public repository. The submitted code must:

  • Be structured as an installable Python package
  • Implement a predefined inference API

A detailed description of the API, along with an example implementation based on the baseline system, is available here:


Note: The repository also includes code for generating prediction .csv files on the evaluation set, ready for submission.

The link to the repository must be specified in the metadata file (field inference_code).
Submitting code for training the system is optional but encouraged.

Tools

Calculating parameter memory and MACs

Please use the get_torch_macs_memory function implemented for the baseline system in file helpers/complexity.py to measure the memory required by model parameters and the MACs for the inference of a 1-second audio recording. MACs are calculated using torchinfo and model parameter memory is calculated using a custom function that takes into account the data type of model parameters.

If you need support for Keras or TFLite use the NeSsi tool. This tool will give the model MACs and parameter count ("memory" field in the NeSsi output). Based on the parameter count value and the used variable type you can calculate the memory size of the model parameters.


If you have any issues calculating the complexity of your model, please contact Florian Schmid (florian.schmid@jku.at) by email or use the DCASE community forum or Slack channel.


Citation

If you are using the audio dataset, please cite the following paper:

Publication

Toni Heittola, Annamaria Mesaros, and Tuomas Virtanen. Acoustic scene classification in dcase 2020 challenge: generalization across devices and low complexity solutions. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), 56–60. 2020. URL: https://arxiv.org/abs/2005.14623.

PDF

Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions

Abstract

This paper presents the details of Task 1: Acoustic Scene Classification in the DCASE 2020 Challenge. The task consists of two subtasks: classification of data from multiple devices, requiring good generalization properties, and classification using low-complexity solutions. Here we describe the datasets and baseline systems. After the challenge submission deadline, challenge results and analysis of the submissions will be added.

PDF



If you are participating in the task, please cite the following paper:

Publication

Florian Schmid, Paul Primus, Toni Heittola, Annamaria Mesaros, Irene Martín-Morató, and Gerhard Widmer. Low-complexity acoustic scene classification with device information in the dcase 2025 challenge. 2025. URL: https://arxiv.org/abs/2505.01747, arXiv:2505.01747.

PDF

Low-Complexity Acoustic Scene Classification with Device Information in the DCASE 2025 Challenge

Abstract

This paper presents the Low-Complexity Acoustic Scene Classification with Device Information Task of the DCASE 2025 Challenge and its baseline system. Continuing the focus on low-complexity models, data efficiency, and device mismatch from previous editions (2022--2024), this year's task introduces a key change: recording device information is now provided at inference time. This enables the development of device-specific models that leverage device characteristics -- reflecting real-world deployment scenarios in which a model is designed with awareness of the underlying hardware. The training set matches the 25% subset used in the corresponding DCASE 2024 challenge, with no restrictions on external data use, highlighting transfer learning as a central topic. The baseline achieves 50.72% accuracy on this ten-class problem with a device-general model, improving to 51.89% when using the available device information.

PDF



If you are using/ referencing the baseline system, please cite the following paper:

Publication

Florian Schmid, Tobias Morocutti, Shahed Masoudian, Khaled Koutini, and Gerhard Widmer. Distilling the knowledge of transformers and CNNs with CP-mobile. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023), 161–165. 2023.

PDF

Distilling the Knowledge of Transformers and CNNs with CP-Mobile

Abstract

Designing lightweight models that require limited computational resources and can operate on edge devices is a major trajectory in deep learning research. In the context of Acoustic Scene Classification (ASC), the DCASE community hosts an annual challenge on low-complexity ASC, contributing to the research on Knowledge Distillation (KD), Model Pruning, Quantization and efficient neural network design. In this work, we propose a system that contributes to the latter by introducing CP-Mobile, a lightweight CNN architecture constructed of residual inverted bottleneck blocks and Global Response Normalization. Furthermore, we improve Knowledge Distillation by showing that ensembling CNNs and Audio Spectrogram Transformers form strong teacher ensembles. Our proposed system improves the results on the TAU Urban Acoustic Scenes 2022 Mobile development dataset by around 5 percentage points in accuracy compared to the top-ranked submission for Task 1 of the DCASE 22 challenge and achieves the top rank in the DCASE 23 challenge.

PDF