Data-Efficient Low-Complexity Acoustic Scene Classification


Task description

The goal of the acoustic scene classification task is to classify recordings into one of the ten predefined acoustic scene classes. This task continues the Acoustic Scene Classification tasks from previous editions of the DCASE Challenge, with a slight shift of focus. This year, the task concentrates on three challenging aspects: (1) a recording device mismatch, (2) low-complexity constraints, and (3) the limited availability of labeled data for training.

If you are interested in the task, you can join us on the dedicated slack channel

Description

Acoustic scene classification systems categorize recordings into one of multiple predefined acoustic scene classes (Figure 1). The previous editions of the challenge focused on two important problems: (1) recording device mismatch and (2) low-complexity constraints. This year, we want to tackle an additional challenging real-world situation: the limited availability of labeled data. We encourage participants to design systems that maintain high prediction accuracy while using as little labeled data as possible. To this end, participants must develop systems for five increasingly challenging scenarios that progressively limit the available training data. The largest subset corresponds to the full train split used in previous editions of the challenge, and the smallest subset only contains 5% of audio snippets in the full train dataset.

Figure 1: Overview of acoustic scene classification system.


Novelties and Research Focus for 2024 Edition

Compared to Task 1 in the DCASE 2023 Challenge, the following aspects change in the 2024 edition:

  • Training sets of different sizes are provided. These train subsets contain approximately 5%, 10%, 25%, 50%, and 100% of the audio snippets in the train split provided in Task 1 of the DCASE 2023 challenge. A system must only be trained on the specified subset and the explicitly allowed external resources.
  • The final ranking takes into account the evaluation set performances of systems trained on all of the subsets. However, individual ranking lists for the subsets will be provided.
  • All submitted systems must be trained on all subsets. Test split performances and evaluation set predictions must be provided for all systems and all subsets. A maximum of three different systems (see evaluation setup) can be submitted.
  • The model complexity is not part of the ranking system. The model's complexity is limited in terms of hard constraints. To this end, the memory requirement for model parameters is restricted to 128 kB, and the maximum number of multiply-accumulate operations (MACs) is restricted to 30 million for the inference of a 1-second audio snippet.

This new task setup highlights the following scientific questions:

  • How does the performance of systems vary with the number of available labeled training samples?
  • What changes are necessary in a system as the number of available data points varies?
  • Can pre-training of low-complexity models on general-purpose audio datasets effectively mitigate the need for larger amounts of acoustic scenes?

Audio dataset

The development dataset for this task is the TAU Urban Acoustic Scenes 2022 Mobile development dataset. The dataset contains recordings from 12 European cities in 10 different acoustic scenes using 4 devices. Additionally, synthetic data for 11 mobile devices was created based on the original recordings. Of the 12 cities, only two are present in the evaluation set. The dataset has exactly the same content as the TAU Urban Acoustic Scenes 2020 Mobile development dataset, but the audio files have a length of 1 second (therefore, there are ten times more files than in the 2020 version).

Recordings were made using four devices that captured audio simultaneously. The primary recording device, referred to as device A, consists of a Soundman OKM II Klassik/studio A3, an electret binaural microphone, and a Zoom F8 audio recorder using a 48kHz sampling rate and 24-bit resolution. The other devices are commonly available consumer devices: device B is a Samsung Galaxy S7, device C is an iPhone SE, and device D is a GoPro Hero5 Session.

Audio data was recorded in Amsterdam, Barcelona, Helsinki, Lisbon, London, Lyon, Madrid, Milan, Prague, Paris, Stockholm, and Vienna. The dataset was collected by Tampere University of Technology between 05/2018 - 11/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

ERC

For complete details on the data recording and processing, see:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 9–13. November 2018. URL: https://dcase.community/documents/workshop2018/proceedings/DCASE2018Workshop_Mesaros_8.pdf.

PDF

A multi-device dataset for urban acoustic scene classification

Abstract

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

Keywords

Acoustic scene classification, DCASE challenge, public datasets, multi-device data

PDF

Additionally, 11 mobile devices S1-S11 are simulated using the audio recorded with device A, impulse responses recorded with real devices, and additional dynamic range compression, in order to simulate realistic recordings. A recording from device A is processed through convolution with the selected impulse response, then processed with a selected set of parameters for dynamic range compression (device-specific). The impulse responses are proprietary data and will not be published.

The development dataset comprises 40 hours of data from device A and smaller amounts from the other devices. Audio is provided in a single-channel, 44.1 kHz, 24-bit format.

Acoustic scenes (10):

  • Airport - airport
  • Indoor shopping mall - shopping_mall
  • Metro station - metro_station
  • Pedestrian street - street_pedestrian
  • Public square - public_square
  • Street with medium level of traffic - street_traffic
  • Travelling by a tram - tram
  • Travelling by a bus - bus
  • Travelling by an underground metro - metro
  • Urban park - park

Task Setup

Development Dataset

The development set contains data from 10 cities and 9 devices: 3 real devices (A, B, and C) and 6 simulated devices (S1-S6). Data from devices B, C, and S1-S6 consists of randomly selected segments from the simultaneous recordings; therefore, all overlap with the data from device A, but not necessarily with each other. The total amount of audio in the development set is 64 hours.

Aligned with the data splits in the 2023 edition of Task 1, the development data comes with the same predefined train/test split in which 70% of the data for each device is included for training, and 30% for testing. As described in detail in the next section, the train segments will be further split into subsets of different sizes for data-efficient evaluation of systems.

Some devices appear only in the test subset. In order to create a perfectly balanced test set, a number of segments from various devices are not included in this split. Complete details on the development set and train/test split are provided in the following table.

Devices Dataset Cross-validation setup
Name Type Total
duration
Total
segments
Train
segments
Test
segments
Notes
A Real 40h 144,000 102,150 3,300 38,550 segments not used in train/test split
B Real 3h each 10,780 7,490 3,290
C Real 3h each 10,770 7,480 3,290
S1 S2 S3 Simulated 3h each 3 * 10,800 3 * 7,500 3 * 3,300
S4 S5 S6 Simulated 3h each 3 * 10,800 - 3 * 3,300 3 * 7,500 segments not used in train/test split
Total 64h 230,350 139,620 29,680 61,050 segments not use in train/test split

The development data is available for download:


Subsets for Data-Efficient Evaluation

This year, participants are supposed to develop up to three systems for five increasingly challenging scenarios that progressively limit the available training data. To this end, we provide five pre-defined subsets/splits of the development-train dataset that are 100%, 50%, 25%, 10%, and 5% of the original development-train set's size. The 100% subset contains all segments (139,620) of the development-train split. Smaller subsets (50%, 25%, 10%, 5%) are created in such a way that they have the following properties:

  • Smaller splits are subsets of larger ones; e.g., all segments of the 5% split are included in the 10% split, and all samples of the 25% split are included in the 50% split. This corresponds to the idea of collecting more data from subset to subset.
  • As many different recording locations as possible are included in subsets to ensure high diversity in audio segments.
  • The distribution of acoustic scenes is kept similar to that in the 100% subset (i.e., approximately uniform).
  • The distribution of recording devices is kept similar to that in the 100% subset (i.e., as shown in the table below).
  • All 1-second audio snippets of a 10-second recording (see audio dataset) are either fully included or not included in a subset.

The table below summarizes the number of segments included in the subsets for each device. Participants are required to train their systems on each of these subsets/splits and submit corresponding predictions on the evaluation set for the final ranking.

Development-train Subsets
Name Type Train segments
(100% Subset)
50%
Subset
25%
Subset
10%
Subset
5%
Subset
A Real 102,150 51,100 25,520 10,190 5,080
B Real 7,490 3,780 1,900 730 380
C Real 7,480 3,780 1,920 790 380
S1 Simulated 7,500 3,720 1,840 740 380
S2 Simulated 7,500 3,700 1,850 750 380
S3 Simulated 7,500 3,720 1,870 760 380
Total 139,620 69,800 34,900 13,960 6,980

Figure 2 illustrates the setup for the five scenarios: Systems are trained on various proportions of the development-train split (the provided subsets), validated on the specified development-test split, which is fixed for all five scenarios, and evaluated on the evaluation set, which will be released on the 1st of June without corresponding scene labels.

The definition of the x% training subsets is provided in the split{x}.csv files. Models for the x% case should be trained on the development-train subset (green) that is specified in split{x}.csv; only the samples specified in split{x}.csv (green) and explicitly allowed external resources can be used for training the system. For example, data from the test split (yellow) must not be used to train the models before generating predictions on the evaluation data.

The development-test split (yellow) remains the same for all five scenarios and is available in test.csv; this split can only be used for validation and model selection.

The remaining samples in the development set that are not included in the development-train and the development-test splits must not be used.

Figure 2: Overview of the five development splits. Participants need to train each submitted system on all five splits.


The definition of train subsets (split{x}.csv files) and the test split (test.csv file) can be downloaded from the link below. The csv files contain two columns: filenames of audio snippets included in the respective split and the corresponding scene label.


Each team is allowed to submit up to three different systems. Each of these submitted systems has to be trained on all five development-train splits. A system is considered to be the same if architecture and design choices (such as building blocks, features, data augmentation techniques, decision-making, etc.) remain the same. However, basic hyperparameters like the number of update steps, learning rate, batch size, or regularization strength may vary for training on the different development-train splits. The technical report must highlight the differences between the three submitted systems. It must also state if and how the basic hyperparameter configuration was adapted for the different development-train splits.

For example, for a single system, it is allowed to reduce the total number of update steps when training on the 5% subset. Those changes need to be documented in the technical report. However, for instance, if the number of model parameters changes, the system counts as different. If you have doubts about which category a system modification falls into, please contact the task organizers.

Evaluation Dataset

The evaluation dataset is used for ranking submissions. It contains data from 12 cities, 10 acoustic scenes, and 11 devices. There are five new devices compared to the development set: a real device D and simulated devices S7-S11. The evaluation data contains 22 hours of audio. The evaluation data contains audio recorded at different locations than the development data. Recording device and city information is not provided in the evaluation set. The systems are expected to be robust to different recording devices.

Reference labels are provided only for the development datasets. Reference labels for the evaluation dataset will not be released because the evaluation set is used to rank submitted systems. For publications based on the DCASE challenge data, please use the provided splits of the development set to allow comparisons. After the challenge, if you want to evaluate your proposed system with the official challenge evaluation setup, contact the task coordinators. Task coordinators can provide unofficial scoring for a limited number of system outputs.

The evaluation data will be released on the 1st of June.

System Complexity Limitations

The computational complexity is limited in terms of model size and MMACs (million multiply-accumulate operations). The limits are modeled after Cortex-M4 devices (e.g., STM32L496@80MHz or Arduino Nano 33@64MHz); the maximum allowed limits are as follows:

  • Maximum memory allowance for model parameters: 128 kB (Kilobyte), counting ALL parameters (including the zero-valued ones), and no requirement on the numerical representation. This allows participants better control over the tradeoff between the number of parameters and their numerical representation. The memory limit translates into 128K parameters when using int8 (signed 8-bit integer) (128,000 parameters * 8 bits per parameter / 8 bits per byte = 128 kB), 64K parameters when using float16 (16-bit float) (64,000 * 16 bits per parameter / 8 bits per byte = 128 kB), or 32K parameters when using float32 (32-bit float) (32,000 * 32 bits per parameter / 8 bits per byte = 128 kB).
  • Maximum number of MACS per inference: 30 MMACs (million MACs). The limit is approximated based on the computing power of the target device class. The analysis segment length for the inference is 1 s. The limit of 30 MMACs mimics the fitting of audio buffers into SRAM (fast access internal memory) on the target device and allows some head space for feature calculation (e.g., FFT), assuming that the most commonly used features fit under this.

In case of using a feature extractor (after approval!) during inference, the network used to generate the embeddings counts in the calculated model size and complexity!

Full information about the number of parameters, their numerical representation, and the computational complexity must be provided in the submitted technical report.


External Data Resources

The use of external resources (data sets, pretrained models) is allowed under the following conditions:

List of external data resources allowed:

Resource name Type Added Link
AudioSet audio, video 04.03.2019 https://research.google.com/audioset/
MicIRP IR 28.03.2023 http://micirp.blogspot.com/?m=1
FSD50K audio 10.03.2022 https://zenodo.org/record/4060432


Task Rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements, can be found here.

Task-specific rules:

  • Participants may submit predictions for up to three systems.
  • Participants have to train each submitted system on all five development-train splits.
  • The models' performance on the development-test set for all five development-train splits must be included in the submission.
  • The models' predictions on the evaluation set for all five development-train splits must be included in the submission.
  • Depending on the number of submitted systems (1, 2, or 3), participants will either submit 5, 10, or 15 evaluation set prediction files. This means at least five evaluation set prediction files (one system trained on the five different development-train subsets) are necessary to be included in the final ranking.
  • A submitted system is considered the same if only basic hyperparameters like the number of update steps, learning rate, batch size, or regularization strength vary; see here.
  • The model complexity limits apply. See conditions for the model complexity here.

Rules regarding the usage of data and inference:

  • Participants can only use the audio segments specified in each subset and the explicitly allowed external resources for training. For example, it is strictly forbidden to use other parts of the development dataset (besides the filenames specified in the split{x}.csv files) to train the system.
  • The previous statement includes both direct and indirect training data usage. For example, in the case of Knowledge Distillation, not only the student but also the teacher model must obey the previous statement. That is, participants cannot use the model trained on split{100}.csv as a teacher for the model that is trained on split{5}.csv. This would allow the student model to indirectly access knowledge gained from the split{100}.csv data and is therefore strictly forbidden.
  • Use of external data is allowed under some conditions; see here.
  • Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision-making is also forbidden.
  • Classification decisions must be made independently for each test sample.


Evaluation

Participants can submit up to three systems; however, for each training set size, \(p \in \{5, 10, 25, 50, 100 \}\), only the best-performing system will be used for ranking. This is to encourage participants to develop specialized systems for different development set sizes.

The leaderboard ranking score is computed as follows: first, the class-wise macro-averaged accuracies for all \(P=5\) development subsets and all \(N\) submissions are computed. The accuracy of the n-th submission on the p% subset is denoted as \(ACC_{n,p}\). The scores are then aggregated by choosing the best-performing system for each subset and averaging the resulting accuracies.

$$\textrm{score} := \frac{1}{P}\sum_{p \in \{5, 10, 25, 50, 100\}} \textrm{max}_{n \in \{1, \dots, N \}} ACC_{n, p}$$

Besides the final leaderboard (based on the score defined above), we will also provide individual ranking lists for the five different subsets.

We will also calculate multi-class cross-entropy (Log loss). The metric is independent of the operating point (see Python implementation here). The multi-class cross-entropy will not be used in the official rankings.


Baseline System

The baseline model is based on a simplified version of the CNN classifier used in the top-ranked approach in Task 1 of the DCASE Challenge 2023. The baseline system uses Frequency-MixStyle to cope with device generalization. No quantization is used, but inference will be done using floating-point 16-bit precision, allowing a maximum number of 64,000 parameters.

This repository contains the code for the baseline system of the DCASE 2024 Challenge Task 1:


Some details about the baseline:

  • The training loop is implemented using PyTorch and PyTorch Lightning.
  • Logging is implemented using Weights and Biases.
  • The neural network architecture is a simplified version of CP-Mobile, the architecture used in the top-ranked system of Task 1 in the DCASE 2023 challenge.
  • The model has 61,148 parameters and consumes 29.42 million MACs for the inference of a 1-second audio snippet. These numbers are counted using NeSsi. The model's test step converts model parameters to 16-bit floats to meet the memory complexity constraint of 128 kB for model parameters.
  • The baseline implements simple data augmentation mechanisms: time rolling of the waveform and masking of frequency bins and time frames.
  • To enhance the generalization across different recording devices, the baseline implements Frequency-MixStyle.

Baseline Complexity

The baseline system has a complexity of 61,148 parameters and 29,419,156 MACs. The table below lists how the parameters and MACs are distributed across the different layers in the network.

According to the challenge rules, the following complexity limits apply:

  • max memory for model parameters: 128 kB (Kilobyte)
  • max number of MACs for inference of a 1-second audio snippet: 30 MMACs (million MACs)

Model parameters of the baseline must therefore be converted to 16-bit precision before inference of the test/evaluation set to stick to the complexity limits (61,148 * 16 bits = 61,148 * 2 B = 122,296 B <= 128 kB).

In previous years of the challenge, top-ranked teams used a technique called quantization that converts model parameters to 8-bit precision. In this case, the maximum number of allowed parameters would be 128,000.

DescriptionLayerInput ShapeParamsMACs
in_c[0]Conv2dNormActivation[1, 1, 256, 65]88304,144
in_c[1]Conv2dNormActivation[1, 8, 128, 33]2,3682,506,816
stages[0].b1.block[0]Conv2dNormActivation (pointwise)[1, 32, 64, 17]2,1762,228,352
stages[0].b1.block[1]Conv2dNormActivation (depthwise)[1, 64, 64, 17]704626,816
stages[0].b1.block[2]Conv2dNormActivation (pointwise)[1, 64, 64, 17]2,1122,228,288
stages[0].b2.block[0]Conv2dNormActivation (pointwise)[1, 32, 64, 17]2,1762,228,352
stages[0].b2.block[1]Conv2dNormActivation (depthwise)[1, 64, 64, 17]704626,816
stages[0].b2.block[2]Conv2dNormActivation (pointwise)[1, 64, 64, 17]2,1122,228,288
stages[0].b3.block[0]Conv2dNormActivation (pointwise)[1, 32, 64, 17]2,1762,228,352
stages[0].b3.block[1]Conv2dNormActivation (depthwise)[1, 64, 64, 17]704331,904
stages[0].b3.block[2]Conv2dNormActivation (pointwise)[1, 64, 64, 9]2,1121,179,712
stages[1].b4.block[0]Conv2dNormActivation (pointwise)[1, 32, 64, 9]2,1761,179,776
stages[1].b4.block[1]Conv2dNormActivation (depthwise)[1, 64, 64, 9]704166,016
stages[1].b4.block[2]Conv2dNormActivation (pointwise)[1, 64, 32, 9]3,6961,032,304
stages[1].b5.block[0]Conv2dNormActivation (pointwise)[1, 56, 32, 9]6,9601,935,600
stages[1].b5.block[1]Conv2dNormActivation (depthwise)[1, 120, 32, 9]1,320311,280
stages[1].b5.block[2]Conv2dNormActivation (pointwise)[1, 120, 32, 9]6,8321,935,472
stages[2].b6.block[0]Conv2dNormActivation (pointwise)[1, 56, 32, 9]6,9601,935,600
stages[2].b6.block[1]Conv2dNormActivation (depthwise)[1, 120, 32, 9]1,320311,280
stages[2].b6.block[2]Conv2dNormActivation (pointwise)[1, 120, 32, 9]12,6883,594,448
ff_list[0]Conv2d[1, 104, 32, 9]1,040299,520
ff_list[1]BatchNorm2d[1, 10, 32, 9]2020
ff_list[2]AdaptiveAvgPool2d[1, 10, 32, 9]--
Sum--61,14829,419,156

To give an example of how MACs and parameters are calculated, let's look in detail at the module stages[0].b3.block[1]. It consists of a conv2d, a batch norm, and a ReLU activation function.

Parameters: The conv2d parameters are calculated as input_channels * output_channels * kernel_size * kernel_size, resulting in 1 * 64 * 3 * 3 = 576 parametes. Note that input_channels=1 since it is a depth-wise convolution with 64 groups. The batch norm adds 64 * 2 = 128 parameters on top, resulting in a total of 704 parameters for this Conv2dNormActivation module.

MACs: The MACs of the conv2d are calculated as input_channels * output_channels * kernel_size * kernel_size * output_frequency_bands * output_time_frames, resulting in 1 * 64 * 3 * 3 * 64 * 9 = 331,776 MACs.
Note that input_channels=1 since it is a depth-wise convolution with 64 groups. The batch norm adds 128 MACs on top, resulting in a total of 331,904 MACs for this Conv2dNormActivation module.

Baseline Results

The two tables below present the results obtained with the baseline system. The system was trained five times on each subset and tested on the development-test split, and the mean and standard deviation of the performance from these five independent trials are shown.

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.

Class-wise Results

The class-wise accuracies take into account only the test items belonging to the considered class.

Subset Class-wise Accuracies Macro-Avg.
Accuracy
Airport Bus Metro Metro
Station
Park Public
Square
Shopping
Mall
Street
Pedestrian
Street
Traffic
Tram
5% 34.77 45.21 30.79 40.03 62.06 22.28 52.07 31.32 70.23 35.20 42.40 ± 0.42
10% 38.50 47.99 36.93 43.71 65.43 27.05 52.46 31.82 72.64 36.41 45.29 ± 1.01
25% 41.81 61.19 38.88 40.84 69.74 33.54 58.84 30.31 75.93 51.77 50.29 ± 0.87
50% 41.51 63.23 43.37 48.71 72.55 34.25 60.09 37.26 79.71 51.16 53.19 ± 0.68
100% 46.45 72.95 52.86 41.56 76.11 37.07 66.91 38.73 80.66 56.58 56.99 ± 1.11

Device-wise Results

The device-wise accuracies take into account only the test items recorded by the specific device. As discussed here, devices S4-S6 are used only for testing and not for training the system.

Subset Device-wise Accuracies Macro-Avg.
Accuracy
A B C S1 S2 S3 S4 S5 S6
5% 54.45 45.73 48.42 39.66 36.13 44.30 38.90 40.47 33.58 42.40 ± 0.42
10% 57.84 48.60 51.13 42.16 40.30 46.00 43.13 41.30 37.26 45.29 ± 1.01
25% 62.27 53.27 55.39 47.52 46.68 51.59 47.39 46.75 41.75 50.29 ± 0.87
50% 65.39 56.30 57.23 52.99 50.85 54.78 48.35 47.93 44.90 53.19 ± 0.68
100% 67.17 59.67 61.99 56.28 55.69 58.16 53.05 52.35 48.58 56.99 ± 1.11


Submission

Official challenge submission consists of:

  • System output files (*.csv)
  • Metadata file(s) (*.yaml)
  • A technical report explaining in sufficient detail the method (*.pdf)

A more detailed description of the submission format will be published together with the evaluation data.


Tools

NeSsi - Calculating memory size of the model parameters and MACS

In Pytorch, one can use the torchinfo tool to get model parameters per layers. In Keras, there is a built-in method summary for this. Based on the parameter counts provided by these tools and taking into account the variable types you used in your model you can calculate the memory size of the model parameters (see previous examples).

The script offers the possibility for calculating the MACS and the model size for Keras and PyTorch based models. This tool will give the model MACS and parameter count ("memory" field in the NeSsi output) for the inputted model. Based on the parameter count value and the used variable type you can calculate the memory size of the model parameters.


If you have any issues using NeSsi, please contact Florian Schmid (florian.schmid@jku.at) by email or use the DCASE community forum or Slack channel.


Package validator

An automatic submission package validation tool will be made available together with the evaluation data.


Citation

If you are using the audio dataset, please cite the following paper:

Publication

Toni Heittola, Annamaria Mesaros, and Tuomas Virtanen. Acoustic scene classification in dcase 2020 challenge: generalization across devices and low complexity solutions. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), 56–60. 2020. URL: https://arxiv.org/abs/2005.14623.

PDF

Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions

Abstract

This paper presents the details of Task 1: Acoustic Scene Classification in the DCASE 2020 Challenge. The task consists of two subtasks: classification of data from multiple devices, requiring good generalization properties, and classification using low-complexity solutions. Here we describe the datasets and baseline systems. After the challenge submission deadline, challenge results and analysis of the submissions will be added.

PDF


If you are using/ referencing the baseline system, please cite the following paper:

Publication

Florian Schmid, Tobias Morocutti, Shahed Masoudian, Khaled Koutini, and Gerhard Widmer. Distilling the knowledge of transformers and CNNs with CP-mobile. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023), 161–165. 2023.

PDF

Distilling the Knowledge of Transformers and CNNs with CP-Mobile

Abstract

Designing lightweight models that require limited computational resources and can operate on edge devices is a major trajectory in deep learning research. In the context of Acoustic Scene Classification (ASC), the DCASE community hosts an annual challenge on low-complexity ASC, contributing to the research on Knowledge Distillation (KD), Model Pruning, Quantization and efficient neural network design. In this work, we propose a system that contributes to the latter by introducing CP-Mobile, a lightweight CNN architecture constructed of residual inverted bottleneck blocks and Global Response Normalization. Furthermore, we improve Knowledge Distillation by showing that ensembling CNNs and Audio Spectrogram Transformers form strong teacher ensembles. Our proposed system improves the results on the TAU Urban Acoustic Scenes 2022 Mobile development dataset by around 5 percentage points in accuracy compared to the top-ranked submission for Task 1 of the DCASE 22 challenge and achieves the top rank in the DCASE 23 challenge.

PDF