The goal of the acoustic scene classification task is to categorize recordings into one of ten predefined acoustic scene classes. This task builds on the Acoustic Scene Classification challenges from previous editions of the DCASE Challenge, with a slight shift in focus. This year, the task emphasizes several key challenges: (1) recording device mismatches, (2) low-complexity constraints, (3) data efficiency, and (4) the development of recording-device-specific models.
Description
Acoustic scene classification systems categorize recordings into one of multiple predefined acoustic scene classes (Figure 1). Similar to previous editions, this year's challenge focuses on two real-world problems: recording device mismatch and low-complexity constraints. Previously, recording device information was not available for audio recordings in the evaluation set, encouraging the development of systems that generalize across (possibly unseen) recording devices. However, in real-world scenarios, the device type (e.g., the specific smartphone the model is running on) may be known, and the model's performance could potentially be improved by conditioning inference on the device type. In this edition, the device ID is provided not only for the development set but also for recordings in the evaluation set; participants are allowed to use separate models for different recording devices. At the same time, participants are encouraged to develop models that generalize to unseen recording devices, since the evaluation set also includes recordings from unknown devices (i.e., devices for which no recordings are available in the development-train split, and therefore no device-specific model can be trained).

Novelties and Research Focus for 2025 Edition
Compared to Task 1 in the DCASE 2024 Challenge, the following aspects change in the 2025 edition:
- Device Information: Recording Device Information is now available not only for the development set but also for evaluation, allowing participants to fine-tune models for specific recording devices.
- Training Data: Participants are no longer required to train on all five subsets from DCASE'24 Task 1. Instead, models must be trained only on the 25% subset, encouraging data-efficient approaches such as pre-training.
- External Resources: No restrictions on external datasets—participants may use any publicly available dataset, provided they announce it to the organizers by May 18, 2025.
- Inference Code: Participants are required to submit inference code. This promotes open research and allows us to check whether the model operates within the complexity constraints of the challenge.
This new task setup highlights the following scientific questions:
- Can the device type information be exploited to improve performance compared to previous year’s systems that didn’t have access to the device type?
- Which machine learning techniques can effectively create specialized models for the different recording devices?
- Can other acoustic scene datasets (with possibly different scenes, locations, devices) be used to increase the performance on the TAU dataset?
Audio dataset
The development dataset for this task is a subset of the TAU Urban Acoustic Scenes 2022 Mobile development dataset. The dataset contains recordings from 12 European cities in 10 different acoustic scenes using 4 devices. Additionally, synthetic data for 11 mobile devices was created based on the original recordings. Of the 12 cities, only two are present in the evaluation set. The dataset has exactly the same content as the TAU Urban Acoustic Scenes 2020 Mobile development dataset, but the audio files have a length of 1 second (therefore, there are ten times more files than in the 2020 version).
Recordings were made using four devices that captured audio simultaneously. The primary recording device, referred to as device A, consists of a Soundman OKM II Klassik/studio A3, an electret binaural microphone, and a Zoom F8 audio recorder using a 48kHz sampling rate and 24-bit resolution. The other devices are commonly available consumer devices: device B is a Samsung Galaxy S7, device C is an iPhone SE, and device D is a GoPro Hero5 Session.
Audio data was recorded in Amsterdam, Barcelona, Helsinki, Lisbon, London, Lyon, Madrid, Milan, Prague, Paris, Stockholm, and Vienna. The dataset was collected by Tampere University of Technology between 05/2018 - 11/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.
For complete details on the data recording and processing, see:
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 9–13. November 2018. URL: https://dcase.community/documents/workshop2018/proceedings/DCASE2018Workshop_Mesaros_8.pdf.
A multi-device dataset for urban acoustic scene classification
Abstract
This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.
Keywords
Acoustic scene classification, DCASE challenge, public datasets, multi-device data
Additionally, 10 mobile devices S1-S10 are simulated using the audio recorded with device A, impulse responses recorded with real devices, and additional dynamic range compression, in order to simulate realistic recordings. A recording from device A is processed through convolution with the selected impulse response, then processed with a selected set of parameters for dynamic range compression (device-specific). The impulse responses are proprietary data and will not be published.
The complete development dataset comprises 40 hours of data from device A and smaller amounts from the other devices; participants are only allowed to use the subset specified in the Task Setup section below. Audio is provided in a single-channel, 44.1 kHz, 24-bit format.
Acoustic scenes (10):
- Airport -
airport
- Indoor shopping mall -
shopping_mall
- Metro station -
metro_station
- Pedestrian street -
street_pedestrian
- Public square -
public_square
- Street with medium level of traffic -
street_traffic
- Travelling by a tram -
tram
- Travelling by a bus -
bus
- Travelling by an underground metro -
metro
- Urban park -
park
Task Setup
Development Dataset
This year's development set reuses the 25% train split and the test split of Task 1 in the DCASE Challenge 2024.
It contains recordings from 10 cities and 9 devices: 3 real devices (A, B, and C) and 6 simulated devices (S1-S6). Data from devices B, C, and S1-S6 consists of randomly selected segments from the simultaneous recordings; therefore, all overlap with the data from device A, but not necessarily with each other. The total amount of audio in this year's training set is around 18 hours.
The development data comes with a new predefined train/test split. Only the development-training data (25% split) can be used for model training. The development-test data is for validation and model selection only. Note: Recordings that are part of the TAU Urban Acoustic Scenes 2022 Mobile development dataset and not used in the development-training and development-test splits must not be used for model development.
Links to the CSV files specifying the development-train and development-test splits are given below.
Note that some devices appear only in the test subset. Complete details on the development set and train and test splits are provided in the following table.
Devices | Dataset | Cross-validation setup | |||
Name | Type | Total duration |
Total segments |
Train segments |
Test segments |
A | Real | 8h | 28,820 | 25,520 | 3,300 |
B | Real | 1h 26m | 5,190 | 1,900 | 3,290 |
C | Real | 1h 28m | 5,210 | 1,920 | 3,290 |
S1 | Simulated | 1h 25m | 5,140 | 1,840 | 3,300 |
S2 | Simulated | 1h 25m | 5,150 | 1,850 | 3,300 |
S3 | Simulated | 1h 26m | 5,170 | 1,870 | 3,300 |
S4 S5 S6 | Simulated | 2h 45m | 3*3,300 | - | 3*3,300 |
Total | 17h 56m | 64,580 | 34,900 | 29,680 |
The development data is available for download here:
The data split files are automatically downloaded when running the baseline system.
Evaluation Dataset
The evaluation dataset is used for ranking submissions. It contains data from 12 cities and 10 acoustic scenes. The devices include those present in the development-train split (A, B, C, S1–S3), as well as five new devices not included in the development set: one real device (D) and four simulated devices (S7–S10). The device ID will be provided for the known devices (A, B, C, S1–S3), while the unseen devices (D, S7–S10) will be marked with the ID “unknown.” The relative proportion of known vs. unknown devices will be aligned between the development-test and evaluation sets. The evaluation data contains approximately 37 hours of audio recorded at different locations than the development data. City information is not provided in the evaluation set.
Scene labels are provided only for the development split. Scene labels for the evaluation dataset will not be released because the evaluation set is used to rank submitted systems. For publications based on the DCASE challenge data, please use the suggested splits of the development set to allow comparisons. After the challenge, if you want to evaluate your proposed system with the official challenge evaluation setup, contact the task coordinators. Task coordinators can provide unofficial scoring for a limited number of system outputs.
The evaluation data will be released on June 1st.
System Complexity Limitations
The computational complexity is limited in terms of model size and MMACs (million multiply-accumulate operations). The limits are modeled after Cortex-M4 devices (e.g., STM32L496@80MHz or Arduino Nano 33@64MHz); the maximum allowed limits are as follows:
- Maximum memory allowance for model parameters: 128 kB (Kilobyte), counting ALL parameters (including the zero-valued ones), and no requirement on the numerical representation. This allows participants better control over the tradeoff between the number of parameters and their numerical representation. The memory limit translates into 128K parameters when using
int8
(signed 8-bit integer) (128,000 parameters * 8 bits per parameter / 8 bits per byte = 128 kB), 64K parameters when usingfloat16
(16-bit float) (64,000 * 16 bits per parameter / 8 bits per byte = 128 kB), or 32K parameters when usingfloat32
(32-bit float) (32,000 * 32 bits per parameter / 8 bits per byte = 128 kB). - Maximum number of MACS per inference: 30 MMACs (million MACs). The limit is approximated based on the computing power of the target device class. The analysis segment length for the inference is 1 s. The limit of 30 MMACs mimics the fitting of audio buffers into SRAM (fast access internal memory) on the target device and allows some head space for feature calculation (e.g., FFT), assuming that the most commonly used features fit under this.
In case of using a pre-trained feature extractor (after approval!) during inference, the network used to generate the embeddings counts in the calculated model size and complexity!
In this year’s challenge, participants can use device-specific models, i.e., separate models for each recording device type in the evaluation set (A, B, C, S1–S3, "unknown"). At inference time the model must be chosen only based on the recording device ID; each device specific model must obey the complexity constraints.
Full information about the number of parameters, their numerical representation, and the computational complexity must be provided in the submitted technical report.
External Data Resources and Pretrained Models
The use of external resources (data sets, pretrained models) is allowed under the following conditions:
- The task coordinators have to approve the resource. To this end, please email to the task coordinators; if approved, we will update the list of external resources on the webpage accordingly. The list of allowed external resources will be locked on May 18 (no further external sources allowed).
- The external resource must be freely accessible to any other research group in the world. The resource must be public and freely available before May 18, 2025.
- The list of external resources used must be clearly indicated in the technical report.
- TUT Urban Acoustic Scenes 2018, TUT Urban Acoustic Scenes 2018 Mobile, TAU Urban Acoustic Scenes 2019, TAU Urban Acoustic Scenes 2019 Mobile, TAU Urban Acoustic Scenes 2020 Mobile, or TAU Urban Acoustic Scenes 2020 3Class and other TAU Urban Acoustic Scenes data sets will not be approved.
List of external data resources allowed:
Resource name | Type | Added | Link |
---|---|---|---|
AudioSet | audio, video | 04.03.2019 | https://research.google.com/audioset/ |
MicIRP | IR | 28.03.2023 | http://micirp.blogspot.com/?m=1 |
FSD50K | audio | 10.03.2022 | https://zenodo.org/record/4060432 |
Multi-Angle, Multi-Distance Microphone IR Dataset | IR | 17.05.2024 | https://zenodo.org/records/4633508 |
EfficientAT | model | 17.05.2024 | https://github.com/fschmid56/EfficientAT |
PretrainedSED | model | 31.03.2025 | https://github.com/fschmid56/PretrainedSED |
PaSST | model | 17.05.2024 | https://github.com/kkoutini/PaSST |
Task Rules
There are general rules valid for all tasks; these, along with information on technical report and submission requirements, can be found here.
Task-specific rules:
- Participants may submit predictions from up to four independent systems.
- The models' performance on the development-test set must be included in the submission.
- The models' predictions on the evaluation set must be included in the submission.
- A link to a respository containing the inference code must be provided in the submission.
- The model complexity limits apply. See conditions for the model complexity here.
Rules regarding the usage of data and inference:
- Participants can only use the audio segments specified in the 25% subset and the explicitly allowed external resources for training. For example, it is strictly forbidden to use other parts of the TAU dataset (besides the filenames specified in the
split25.csv
file) to train the system. - The previous statement includes both direct and indirect training data usage. For example, in the case of Knowledge Distillation, not only the student but also the teacher model must obey the previous statement.
- The development-test data can only be used for validation and model selection; it must not be used for training.
- Use of external data is allowed under some conditions; see here.
- Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision-making is also forbidden.
- Classification decisions must be made independently for each test sample.
Evaluation
Participants can submit up to four systems which will be ranked independently.
The leaderboard ranking score of individual submissions is based on their class-wise macro-averaged accuracies.
We will also calculate multi-class cross-entropy (Log loss). The metric is independent of the operating point (see Python implementation here). The multi-class cross-entropy will not be used in the official rankings.
Baseline System
This repository contains the code for the baseline system of the DCASE 2025 Challenge Task 1:
New in 2025: Device-Specific Adaptation:
In last year’s baseline, the training loop was designed to train a model that generalizes well across different, possibly unseen recording devices. This year, since device information is available, we introduce a second training step to train specialized models for each device in the development-train split.
Some highlights:
- The training loop is implemented using PyTorch and PyTorch Lightning.
- Logging is implemented using Weights and Biases.
- The neural network architecture is a simplified version of CP-Mobile, the architecture used in the top-ranked system of Task 1 in the DCASE 2023 challenge.
- The model has 61,148 parameters and consumes 29.42 million MACs for the inference of a 1-second audio snippet. MACs are counted using torchinfo. The model's test step converts model parameters to 16-bit floats to meet the memory complexity constraint of 128 kB for model parameters.
- The baseline implements simple data augmentation mechanisms: time rolling of the waveform and masking of frequency bins and time frames.
- To enhance the generalization across different recording devices, the baseline implements Frequency-MixStyle.
A detailed description of the baseline system and differences to previous year's implementation can be found in the repository's README.md.
Baseline Complexity
The baseline system has a complexity of 61,148 parameters and 29,419,156 MACs. The table below lists how the parameters and MACs are distributed across the different layers in the network.
According to the challenge rules, the following complexity limits apply (to each device-specific model):
- max memory for model parameters: 128 kB (Kilobyte)
- max number of MACs for inference of a 1-second audio snippet: 30 MMACs (million MACs)
Model parameters of the baseline must therefore be converted to 16-bit precision before inference of the test/evaluation set to stick to the complexity limits (61,148 * 16 bits = 61,148 * 2 B = 122,296 B <= 128 kB).
In previous years of the challenge, top-ranked teams used a technique called quantization that converts model parameters to 8-bit precision. In this case, the maximum number of allowed parameters would be 128,000.
Description | Layer | Input Shape | Params | MACs | |
in_c[0] | Conv2dNormActivation | [1, 1, 256, 65] | 88 | 304,144 | |
in_c[1] | Conv2dNormActivation | [1, 8, 128, 33] | 2,368 | 2,506,816 | |
stages[0].b1.block[0] | Conv2dNormActivation (pointwise) | [1, 32, 64, 17] | 2,176 | 2,228,352 | |
stages[0].b1.block[1] | Conv2dNormActivation (depthwise) | [1, 64, 64, 17] | 704 | 626,816 | |
stages[0].b1.block[2] | Conv2dNormActivation (pointwise) | [1, 64, 64, 17] | 2,112 | 2,228,288 | |
stages[0].b2.block[0] | Conv2dNormActivation (pointwise) | [1, 32, 64, 17] | 2,176 | 2,228,352 | |
stages[0].b2.block[1] | Conv2dNormActivation (depthwise) | [1, 64, 64, 17] | 704 | 626,816 | |
stages[0].b2.block[2] | Conv2dNormActivation (pointwise) | [1, 64, 64, 17] | 2,112 | 2,228,288 | |
stages[0].b3.block[0] | Conv2dNormActivation (pointwise) | [1, 32, 64, 17] | 2,176 | 2,228,352 | |
stages[0].b3.block[1] | Conv2dNormActivation (depthwise) | [1, 64, 64, 17] | 704 | 331,904 | |
stages[0].b3.block[2] | Conv2dNormActivation (pointwise) | [1, 64, 64, 9] | 2,112 | 1,179,712 | |
stages[1].b4.block[0] | Conv2dNormActivation (pointwise) | [1, 32, 64, 9] | 2,176 | 1,179,776 | |
stages[1].b4.block[1] | Conv2dNormActivation (depthwise) | [1, 64, 64, 9] | 704 | 166,016 | |
stages[1].b4.block[2] | Conv2dNormActivation (pointwise) | [1, 64, 32, 9] | 3,696 | 1,032,304 | |
stages[1].b5.block[0] | Conv2dNormActivation (pointwise) | [1, 56, 32, 9] | 6,960 | 1,935,600 | |
stages[1].b5.block[1] | Conv2dNormActivation (depthwise) | [1, 120, 32, 9] | 1,320 | 311,280 | |
stages[1].b5.block[2] | Conv2dNormActivation (pointwise) | [1, 120, 32, 9] | 6,832 | 1,935,472 | |
stages[2].b6.block[0] | Conv2dNormActivation (pointwise) | [1, 56, 32, 9] | 6,960 | 1,935,600 | |
stages[2].b6.block[1] | Conv2dNormActivation (depthwise) | [1, 120, 32, 9] | 1,320 | 311,280 | |
stages[2].b6.block[2] | Conv2dNormActivation (pointwise) | [1, 120, 32, 9] | 12,688 | 3,594,448 | |
ff_list[0] | Conv2d | [1, 104, 32, 9] | 1,040 | 299,520 | |
ff_list[1] | BatchNorm2d | [1, 10, 32, 9] | 20 | 20 | |
ff_list[2] | AdaptiveAvgPool2d | [1, 10, 32, 9] | - | - | |
Sum | - | - | 61,148 | 29,419,156 |
To give an example of how MACs and parameters are calculated, let's look in detail at the module stages[0].b3.block[1]. It consists of a conv2d, a batch norm, and a ReLU activation function.
- Parameters: The conv2d parameters are calculated as input_channels * output_channels * kernel_size * kernel_size, resulting in 1 * 64 * 3 * 3 = 576 parametes. Note that input_channels=1 since it is a depth-wise convolution with 64 groups. The batch norm adds 64 * 2 = 128 parameters on top, resulting in a total of 704 parameters for this Conv2dNormActivation module.
- MACs: The MACs of the conv2d are calculated as input_channels * output_channels * kernel_size * kernel_size * output_frequency_bands * output_time_frames, resulting in 1 * 64 * 3 * 3 * 64 * 9 = 331,776 MACs.
Note that input_channels=1 since it is a depth-wise convolution with 64 groups. The batch norm adds 128 MACs on top, resulting in a total of 331,904 MACs for this Conv2dNormActivation module.
Baseline Results
The two tables below present the results obtained with the baseline system. General refers to the results of a single model trained on all devices in the development-train split, aligned with the DCASE 2024 Task 1 baseline. Device-specific refers to the results using the device-specific models. To obtain device-specific models, the general model is fine-tuned on data from specific recordings devices in the development-train split. The mean and standard deviation across four independent runs are reported.
Note: The exact performance of the baseline system may not be fully reproducible due to differences in setup. However, you should be able to obtain very similar results.
Class-wise Results
The class-wise accuracies take into account only the test items belonging to the considered class.
Model | Class-wise Accuracies | Macro-Avg. Accuracy |
|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Airport | Bus | Metro | Metro Station |
Park | Public Square |
Shopping Mall |
Street Pedestrian |
Street Traffic |
Tram | ||
General | 38.94 | 62.28 | 40.60 | 50.72 | 72.03 | 29.20 | 56.04 | 34.76 | 73.21 | 49.42 | 50.72 ± 0.47 |
Device-specific | 44.43 | 64.81 | 43.87 | 48.22 | 72.75 | 32.04 | 53.14 | 34.43 | 74.10 | 51.08 | 51.89 ± 0.05 |
Device-wise Results
The device-wise accuracies take into account only the test items recorded by the specific device. As discussed here, devices S4-S6 are used only for testing and not for training the system.
Note: Results for devices S4, S5, and S6 are the same for General and Device-specific, as the general model is used for unknown devices (devices that are not in the training set).
Model | Device-wise Accuracies | Macro-Avg. Accuracy |
||||||||
---|---|---|---|---|---|---|---|---|---|---|
A | B | C | S1 | S2 | S3 | S4 | S5 | S6 | ||
General | 62.80 | 52.87 | 54.23 | 48.52 | 47.29 | 52.86 | 48.14 | 47.23 | 42.60 | 50.72 ± 0.47 |
Device-specific | 63.98 | 55.85 | 59.09 | 48.68 | 48.74 | 52.72 | 48.14 | 47.23 | 42.60 | 51.89 ± 0.05 |
Submission
Official challenge submission consists of the following:
- System output files for up to four systems (
*.csv
) - Metadata file(s) including a link to the inference code (
*.yaml
) - A technical report explaining in sufficient detail the method (
*.pdf
)
The system output should be provided as a single text file (in CSV format, with a header row) containing a classification result for each audio file in the evaluation set. In addition, the results file should contain probabilities for each scene class. Result items can be in any order. Multiple system entries can be submitted (maximum 3 per participant).
For each system entry, it is crucial to provide meta information in a separate file containing the task-specific information. This meta information is essential for fast processing of the submissions and analysis of submitted systems. Participants are strongly advised to fill in the meta information carefully, ensuring all information is correctly provided.
All files should be packaged into a zip file for submission. Please ensure a clear connection between the system name in the submitted yaml, submitted system output, and the technical report! Instead of system name, you can use a submission label too. See instructions on how to form your submission label. The submission label's index field should be used to differentiate your submissions in case that you have multiple submissions.
Package structure:
[submission_label]/[submission_label].output.csv
[submission_label]/[submission_label].meta.yaml
[submission_label].technical_report.pdf
System output file
The system output should have the following format for each row:
[filename (string)][tab][scene label (string)][tab][airport probability (float)][tab][bus probability (float)][tab][metro probability (float)][tab][metro_station probability (float)][tab][park probability (float)][tab][public_square probability (float)][tab][shopping_mall probability (float)][tab][street_pedestrian probability (float)][tab][street_traffic probability (float)][tab][tram probability (float)]
Metadata file
The exact structure of the Metadata file will be released on June 1st.
Package validator
There is an automatic validation tool to help challenge participants to prepare a correctly formatted submission package, which in turn will speed up the submission processing in the challenge evaluation stage. Please use it to make sure your submission package follows the given formatting.
The package validator will be released on June 1st.
Inference code
Inference code will be submitted via a link to a repository containing an installable Python Package.
An example inference package for the Baseline will be released on June 1st.
Tools
Calculating parameter memory and MACs
Please use the get_torch_macs_memory
function implemented for the baseline system in file helpers/complexity.py
to measure the memory required by model parameters and the MACs for the inference of a 1-second audio recording.
MACs are calculated using torchinfo
and model parameter memory is calculated using a custom function that takes into account the data type of model parameters.
If you need support for Keras or TFLite use the NeSsi
tool. This tool will give the model MACs and parameter count ("memory" field in the NeSsi output). Based on the parameter count value and the used variable type you can calculate the memory size of the model parameters.
If you have any issues calculating the complexity of your model, please contact Florian Schmid (florian.schmid@jku.at) by email or use the DCASE community forum or Slack channel.
Citation
If you are using the audio dataset, please cite the following paper:
Toni Heittola, Annamaria Mesaros, and Tuomas Virtanen. Acoustic scene classification in dcase 2020 challenge: generalization across devices and low complexity solutions. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), 56–60. 2020. URL: https://arxiv.org/abs/2005.14623.
Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions
Abstract
This paper presents the details of Task 1: Acoustic Scene Classification in the DCASE 2020 Challenge. The task consists of two subtasks: classification of data from multiple devices, requiring good generalization properties, and classification using low-complexity solutions. Here we describe the datasets and baseline systems. After the challenge submission deadline, challenge results and analysis of the submissions will be added.
If you are participating in the task, please cite the following paper:
Task Paper will be available around May 1st.
If you are using/ referencing the baseline system, please cite the following paper:
Florian Schmid, Tobias Morocutti, Shahed Masoudian, Khaled Koutini, and Gerhard Widmer. Distilling the knowledge of transformers and CNNs with CP-mobile. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023), 161–165. 2023.
Distilling the Knowledge of Transformers and CNNs with CP-Mobile
Abstract
Designing lightweight models that require limited computational resources and can operate on edge devices is a major trajectory in deep learning research. In the context of Acoustic Scene Classification (ASC), the DCASE community hosts an annual challenge on low-complexity ASC, contributing to the research on Knowledge Distillation (KD), Model Pruning, Quantization and efficient neural network design. In this work, we propose a system that contributes to the latter by introducing CP-Mobile, a lightweight CNN architecture constructed of residual inverted bottleneck blocks and Global Response Normalization. Furthermore, we improve Knowledge Distillation by showing that ensembling CNNs and Audio Spectrogram Transformers form strong teacher ensembles. Our proposed system improves the results on the TAU Urban Acoustic Scenes 2022 Mobile development dataset by around 5 percentage points in accuracy compared to the top-ranked submission for Task 1 of the DCASE 22 challenge and achieves the top rank in the DCASE 23 challenge.