Data-Efficient Low-Complexity Acoustic Scene Classification


Task description

The goal of the acoustic scene classification task is to classify recordings into one of the ten predefined acoustic scene classes. This task continues the Acoustic Scene Classification tasks from previous editions of the DCASE Challenge, with a slight shift of focus. This year, the task concentrates on three challenging aspects: (1) a recording device mismatch, (2) low-complexity constraints, and (3) the limited availability of labeled data for training.

Challenge has ended. Full results for this task can be found in the Results page.

If you are interested in the task, you can join us on the dedicated slack channel

Description

Acoustic scene classification systems categorize recordings into one of multiple predefined acoustic scene classes (Figure 1). The previous editions of the challenge focused on two important problems: (1) recording device mismatch and (2) low-complexity constraints. This year, we want to tackle an additional challenging real-world situation: the limited availability of labeled data. We encourage participants to design systems that maintain high prediction accuracy while using as little labeled data as possible. To this end, participants must develop systems for five increasingly challenging scenarios that progressively limit the available training data. The largest subset corresponds to the full train split used in previous editions of the challenge, and the smallest subset only contains 5% of audio snippets in the full train dataset.

Figure 1: Overview of acoustic scene classification system.


Novelties and Research Focus for 2024 Edition

Compared to Task 1 in the DCASE 2023 Challenge, the following aspects change in the 2024 edition:

  • Training sets of different sizes are provided. These train subsets contain approximately 5%, 10%, 25%, 50%, and 100% of the audio snippets in the train split provided in Task 1 of the DCASE 2023 challenge. A system must only be trained on the specified subset and the explicitly allowed external resources.
  • The final ranking takes into account the evaluation set performances of systems trained on all of the subsets. However, individual ranking lists for the subsets will be provided.
  • All submitted systems must be trained on all subsets. Test split performances and evaluation set predictions must be provided for all systems and all subsets. A maximum of three different systems (see evaluation setup) can be submitted.
  • The model complexity is not part of the ranking system. The model's complexity is limited in terms of hard constraints. To this end, the memory requirement for model parameters is restricted to 128 kB, and the maximum number of multiply-accumulate operations (MACs) is restricted to 30 million for the inference of a 1-second audio snippet.

This new task setup highlights the following scientific questions:

  • How does the performance of systems vary with the number of available labeled training samples?
  • What changes are necessary in a system as the number of available data points varies?
  • Can pre-training of low-complexity models on general-purpose audio datasets effectively mitigate the need for larger amounts of acoustic scenes?

Audio dataset

The development dataset for this task is the TAU Urban Acoustic Scenes 2022 Mobile development dataset. The dataset contains recordings from 12 European cities in 10 different acoustic scenes using 4 devices. Additionally, synthetic data for 11 mobile devices was created based on the original recordings. Of the 12 cities, only two are present in the evaluation set. The dataset has exactly the same content as the TAU Urban Acoustic Scenes 2020 Mobile development dataset, but the audio files have a length of 1 second (therefore, there are ten times more files than in the 2020 version).

Recordings were made using four devices that captured audio simultaneously. The primary recording device, referred to as device A, consists of a Soundman OKM II Klassik/studio A3, an electret binaural microphone, and a Zoom F8 audio recorder using a 48kHz sampling rate and 24-bit resolution. The other devices are commonly available consumer devices: device B is a Samsung Galaxy S7, device C is an iPhone SE, and device D is a GoPro Hero5 Session.

Audio data was recorded in Amsterdam, Barcelona, Helsinki, Lisbon, London, Lyon, Madrid, Milan, Prague, Paris, Stockholm, and Vienna. The dataset was collected by Tampere University of Technology between 05/2018 - 11/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

ERC

For complete details on the data recording and processing, see:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 9–13. November 2018. URL: https://dcase.community/documents/workshop2018/proceedings/DCASE2018Workshop_Mesaros_8.pdf.

PDF

A multi-device dataset for urban acoustic scene classification

Abstract

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

Keywords

Acoustic scene classification, DCASE challenge, public datasets, multi-device data

PDF

Additionally, 10 mobile devices S1-S10 are simulated using the audio recorded with device A, impulse responses recorded with real devices, and additional dynamic range compression, in order to simulate realistic recordings. A recording from device A is processed through convolution with the selected impulse response, then processed with a selected set of parameters for dynamic range compression (device-specific). The impulse responses are proprietary data and will not be published.

The development dataset comprises 40 hours of data from device A and smaller amounts from the other devices. Audio is provided in a single-channel, 44.1 kHz, 24-bit format.

Acoustic scenes (10):

  • Airport - airport
  • Indoor shopping mall - shopping_mall
  • Metro station - metro_station
  • Pedestrian street - street_pedestrian
  • Public square - public_square
  • Street with medium level of traffic - street_traffic
  • Travelling by a tram - tram
  • Travelling by a bus - bus
  • Travelling by an underground metro - metro
  • Urban park - park

Download


Task Setup

Development Dataset

The development set contains data from 10 cities and 9 devices: 3 real devices (A, B, and C) and 6 simulated devices (S1-S6). Data from devices B, C, and S1-S6 consists of randomly selected segments from the simultaneous recordings; therefore, all overlap with the data from device A, but not necessarily with each other. The total amount of audio in the development set is 64 hours.

Aligned with the data splits in the 2023 edition of Task 1, the development data comes with the same predefined train/test split in which 70% of the data for each device is included for training, and 30% for testing. As described in detail in the next section, the train segments will be further split into subsets of different sizes for data-efficient evaluation of systems.

Some devices appear only in the test subset. In order to create a perfectly balanced test set, a number of segments from various devices are not included in this split. Complete details on the development set and train/test split are provided in the following table.

Devices Dataset Cross-validation setup
Name Type Total
duration
Total
segments
Train
segments
Test
segments
Notes
A Real 40h 144,000 102,150 3,300 38,550 segments not used in train/test split
B Real 3h each 10,780 7,490 3,290
C Real 3h each 10,770 7,480 3,290
S1 S2 S3 Simulated 3h each 3 * 10,800 3 * 7,500 3 * 3,300
S4 S5 S6 Simulated 3h each 3 * 10,800 - 3 * 3,300 3 * 7,500 segments not used in train/test split
Total 64h 230,350 139,620 29,680 61,050 segments not use in train/test split

The development data is available for download:


Subsets for Data-Efficient Evaluation

This year, participants are supposed to develop up to three systems for five increasingly challenging scenarios that progressively limit the available training data. To this end, we provide five pre-defined subsets/splits of the development-train dataset that are 100%, 50%, 25%, 10%, and 5% of the original development-train set's size. The 100% subset contains all segments (139,620) of the development-train split. Smaller subsets (50%, 25%, 10%, 5%) are created in such a way that they have the following properties:

  • Smaller splits are subsets of larger ones; e.g., all segments of the 5% split are included in the 10% split, and all samples of the 25% split are included in the 50% split. This corresponds to the idea of collecting more data from subset to subset.
  • As many different recording locations as possible are included in subsets to ensure high diversity in audio segments.
  • The distribution of acoustic scenes is kept similar to that in the 100% subset (i.e., approximately uniform).
  • The distribution of recording devices is kept similar to that in the 100% subset (i.e., as shown in the table below).
  • All 1-second audio snippets of a 10-second recording (see audio dataset) are either fully included or not included in a subset.

The table below summarizes the number of segments included in the subsets for each device. Participants are required to train their systems on each of these subsets/splits and submit corresponding predictions on the evaluation set for the final ranking.

Development-train Subsets
Name Type Train segments
(100% Subset)
50%
Subset
25%
Subset
10%
Subset
5%
Subset
A Real 102,150 51,100 25,520 10,190 5,080
B Real 7,490 3,780 1,900 730 380
C Real 7,480 3,780 1,920 790 380
S1 Simulated 7,500 3,720 1,840 740 380
S2 Simulated 7,500 3,700 1,850 750 380
S3 Simulated 7,500 3,720 1,870 760 380
Total 139,620 69,800 34,900 13,960 6,980

Figure 2 illustrates the setup for the five scenarios: Systems are trained on various proportions of the development-train split (the provided subsets), validated on the specified development-test split, which is fixed for all five scenarios, and evaluated on the evaluation set, which will be released on the 1st of June without corresponding scene labels.

The definition of the x% training subsets is provided in the split{x}.csv files. Models for the x% case should be trained on the development-train subset (green) that is specified in split{x}.csv; only the samples specified in split{x}.csv (green) and explicitly allowed external resources can be used for training the system. For example, data from the test split (yellow) must not be used to train the models before generating predictions on the evaluation data.

The development-test split (yellow) remains the same for all five scenarios and is available in test.csv; this split can only be used for validation and model selection.

The remaining samples in the development set that are not included in the development-train and the development-test splits must not be used.

Figure 2: Overview of the five development splits. Participants need to train each submitted system on all five splits.


The definition of train subsets (split{x}.csv files) and the test split (test.csv file) can be downloaded from the link below. The csv files contain two columns: filenames of audio snippets included in the respective split and the corresponding scene label.


Each team is allowed to submit up to three different systems. Each of these submitted systems has to be trained on all five development-train splits. A system is considered to be the same if architecture and design choices (such as building blocks, features, data augmentation techniques, decision-making, etc.) remain the same. However, basic hyperparameters like the number of update steps, learning rate, batch size, or regularization strength may vary for training on the different development-train splits. The technical report must highlight the differences between the three submitted systems. It must also state if and how the basic hyperparameter configuration was adapted for the different development-train splits.

For example, for a single system, it is allowed to reduce the total number of update steps when training on the 5% subset. Those changes need to be documented in the technical report. However, for instance, if the number of model parameters changes, the system counts as different. If you have doubts about which category a system modification falls into, please contact the task organizers.

Evaluation Dataset

The evaluation dataset is used for ranking submissions. It contains data from 12 cities, 10 acoustic scenes, and 11 devices. There are five new devices compared to the development set: a real device D and simulated devices S7-S10. The evaluation data contains approximately 37 hours of audio. The evaluation data contains audio recorded at different locations than the development data. Recording device and city information is not provided in the evaluation set. The systems are expected to be robust to different recording devices.

Reference labels are provided only for the development datasets. Reference labels for the evaluation dataset will not be released because the evaluation set is used to rank submitted systems. For publications based on the DCASE challenge data, please use the provided splits of the development set to allow comparisons. After the challenge, if you want to evaluate your proposed system with the official challenge evaluation setup, contact the task coordinators. Task coordinators can provide unofficial scoring for a limited number of system outputs.

The evaluation data is available for download:


System Complexity Limitations

The computational complexity is limited in terms of model size and MMACs (million multiply-accumulate operations). The limits are modeled after Cortex-M4 devices (e.g., STM32L496@80MHz or Arduino Nano 33@64MHz); the maximum allowed limits are as follows:

  • Maximum memory allowance for model parameters: 128 kB (Kilobyte), counting ALL parameters (including the zero-valued ones), and no requirement on the numerical representation. This allows participants better control over the tradeoff between the number of parameters and their numerical representation. The memory limit translates into 128K parameters when using int8 (signed 8-bit integer) (128,000 parameters * 8 bits per parameter / 8 bits per byte = 128 kB), 64K parameters when using float16 (16-bit float) (64,000 * 16 bits per parameter / 8 bits per byte = 128 kB), or 32K parameters when using float32 (32-bit float) (32,000 * 32 bits per parameter / 8 bits per byte = 128 kB).
  • Maximum number of MACS per inference: 30 MMACs (million MACs). The limit is approximated based on the computing power of the target device class. The analysis segment length for the inference is 1 s. The limit of 30 MMACs mimics the fitting of audio buffers into SRAM (fast access internal memory) on the target device and allows some head space for feature calculation (e.g., FFT), assuming that the most commonly used features fit under this.

In case of using a feature extractor (after approval!) during inference, the network used to generate the embeddings counts in the calculated model size and complexity!

Full information about the number of parameters, their numerical representation, and the computational complexity must be provided in the submitted technical report.


External Data Resources

The use of external resources (data sets, pretrained models) is allowed under the following conditions:

List of external data resources allowed:

Resource name Type Added Link
AudioSet audio, video 04.03.2019 https://research.google.com/audioset/
MicIRP IR 28.03.2023 http://micirp.blogspot.com/?m=1
FSD50K audio 10.03.2022 https://zenodo.org/record/4060432
Multi-Angle, Multi-Distance Microphone IR Dataset IR 17.05.2024 https://zenodo.org/records/4633508
EfficientAT model 17.05.2024 https://github.com/fschmid56/EfficientAT
PaSST model 17.05.2024 https://github.com/kkoutini/PaSST


Task Rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements, can be found here.

Task-specific rules:

  • Participants may submit predictions for up to three systems.
  • Participants have to train each submitted system on all five development-train splits.
  • The models' performance on the development-test set for all five development-train splits must be included in the submission.
  • The models' predictions on the evaluation set for all five development-train splits must be included in the submission.
  • Depending on the number of submitted systems (1, 2, or 3), participants will either submit 5, 10, or 15 evaluation set prediction files. This means at least five evaluation set prediction files (one system trained on the five different development-train subsets) are necessary to be included in the final ranking.
  • A submitted system is considered the same if only basic hyperparameters like the number of update steps, learning rate, batch size, or regularization strength vary; see here.
  • The model complexity limits apply. See conditions for the model complexity here.

Rules regarding the usage of data and inference:

  • Participants can only use the audio segments specified in each subset and the explicitly allowed external resources for training. For example, it is strictly forbidden to use other parts of the development dataset (besides the filenames specified in the split{x}.csv files) to train the system.
  • The previous statement includes both direct and indirect training data usage. For example, in the case of Knowledge Distillation, not only the student but also the teacher model must obey the previous statement. That is, participants cannot use the model trained on split{100}.csv as a teacher for the model that is trained on split{5}.csv. This would allow the student model to indirectly access knowledge gained from the split{100}.csv data and is therefore strictly forbidden.
  • Use of external data is allowed under some conditions; see here.
  • Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision-making is also forbidden.
  • Classification decisions must be made independently for each test sample.


Evaluation

Participants can submit up to three systems; however, for each training set size, \(p \in \{5, 10, 25, 50, 100 \}\), only the best-performing system will be used for ranking. This is to encourage participants to develop specialized systems for different development set sizes.

The leaderboard ranking score is computed as follows: first, the class-wise macro-averaged accuracies for all \(P=5\) development subsets and all \(N\) submissions are computed. The accuracy of the n-th submission on the p% subset is denoted as \(ACC_{n,p}\). The scores are then aggregated by choosing the best-performing system for each subset and averaging the resulting accuracies.

$$\textrm{score} := \frac{1}{P}\sum_{p \in \{5, 10, 25, 50, 100\}} \textrm{max}_{n \in \{1, \dots, N \}} ACC_{n, p}$$

Besides the final leaderboard (based on the score defined above), we will also provide individual ranking lists for the five different subsets.

We will also calculate multi-class cross-entropy (Log loss). The metric is independent of the operating point (see Python implementation here). The multi-class cross-entropy will not be used in the official rankings.


Results

Official
rank
Submission Information
Code Author Affiliation Technical
Report
Rank Score
18 Auzanneau_CEA Fabrice Auzanneau CEA List, Palaiseau, France task-data-efficient-low-complexity-acoustic-scene-classification-results#Auzanneau2024 37.38
14 BAI_JLESS Jisheng Bai School of Marine Science and Technology, Northwestern Polytechnical University, Xi'an, China task-data-efficient-low-complexity-acoustic-scene-classification-results#Bai2024 52.39
4 Cai_XJTLU Yiqiang Cai School of Advanced Technology, Xi'an Jiaotong-Liverpool University, Suzhou, China task-data-efficient-low-complexity-acoustic-scene-classification-results#Cai2024 56.35
15 Chen_GXU Xuanyan Chen School of Computer and Electronic Information, Guangxi University (GXU) Guangxi, Guangxi, China task-data-efficient-low-complexity-acoustic-scene-classification-results#Chen2024 52.12
12 Chen_SCUT Guoqing Chen School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China task-data-efficient-low-complexity-acoustic-scene-classification-results#Chen2024a 52.74
11 Gao_UniSA Wei Gao UniSA STEM, University of South Australia, Adelaide, Australia task-data-efficient-low-complexity-acoustic-scene-classification-results#Gao2024 52.82
1 Han_SJTUTHU Han Bing Shanghai Jiao Tong University, Shanghai, China task-data-efficient-low-complexity-acoustic-scene-classification-results#Bing2024 58.46
2 MALACH24_JKU Nadrchal David Johannes Kepler University (JKU) Linz, Linz, Austria task-data-efficient-low-complexity-acoustic-scene-classification-results#David2024 57.19
7 OO_NTUPRDCSG Yifei Oo Nanyang Technological University, Singapore task-data-efficient-low-complexity-acoustic-scene-classification-results#Oo2024 54.83
6 Park_KT JaeHan Park AI Tech Lab, Korea Telecom Corporation, Seoul, South Korea task-data-efficient-low-complexity-acoustic-scene-classification-results#Park2024 55.40
17 DCASE2024 baseline Florian Schmid Institute of Computational Perception (CP), Johannes Kepler University (JKU) Linz, Linz, Austria task-data-efficient-low-complexity-acoustic-scene-classification-results#Schmid2024 50.73
3 Shao_NEPUMSE Yun-Fei Shao The School of Mechanical Science and Engineering, Northeast Petroleum University, Daqing, China task-data-efficient-low-complexity-acoustic-scene-classification-results#Shao2024 57.15
13 Surkov_ITMO Maxim Surkov ITMO University, Saint Petersburg, Russia task-data-efficient-low-complexity-acoustic-scene-classification-results#Surkov2024 52.65
16 Tan_CISS Ee-Leng Tan EEE, Nanyang Technological University, Singapore task-data-efficient-low-complexity-acoustic-scene-classification-results#Tan2024 51.71
9 Truchan_LUH Hubert Truchan L3S Research Center, Leibniz University Hannover, Hanover, Germany task-data-efficient-low-complexity-acoustic-scene-classification-results#Truchan2024 53.52
8 Werning_UPBNT Alexander Werning Department of Communications Engineering (NT), Paderborn University (UPB), Paderborn, Germany task-data-efficient-low-complexity-acoustic-scene-classification-results#Werning2024 54.35
10 Yan_NPU Chenhong Yan School of Marine Science and Technology, Northwestern Polytechnical Universi University, Xi’an, China task-data-efficient-low-complexity-acoustic-scene-classification-results#Yan2024 52.94
5 Yeo_NTU Sean Yeo Smart Nation TRANS Lab (SNTL), Nanyang Technological University, Singapore task-data-efficient-low-complexity-acoustic-scene-classification-results#Yeo2024 56.12


Complete results and technical reports can be found at results page

Baseline System

The baseline model is based on a simplified version of the CNN classifier used in the top-ranked approach in Task 1 of the DCASE Challenge 2023. The baseline system uses Frequency-MixStyle to cope with device generalization. No quantization is used, but inference will be done using floating-point 16-bit precision, allowing a maximum number of 64,000 parameters.

This repository contains the code for the baseline system of the DCASE 2024 Challenge Task 1:


Some details about the baseline:

  • The training loop is implemented using PyTorch and PyTorch Lightning.
  • Logging is implemented using Weights and Biases.
  • The neural network architecture is a simplified version of CP-Mobile, the architecture used in the top-ranked system of Task 1 in the DCASE 2023 challenge.
  • The model has 61,148 parameters and consumes 29.42 million MACs for the inference of a 1-second audio snippet. These numbers are counted using NeSsi. The model's test step converts model parameters to 16-bit floats to meet the memory complexity constraint of 128 kB for model parameters.
  • The baseline implements simple data augmentation mechanisms: time rolling of the waveform and masking of frequency bins and time frames.
  • To enhance the generalization across different recording devices, the baseline implements Frequency-MixStyle.

Baseline Complexity

The baseline system has a complexity of 61,148 parameters and 29,419,156 MACs. The table below lists how the parameters and MACs are distributed across the different layers in the network.

According to the challenge rules, the following complexity limits apply:

  • max memory for model parameters: 128 kB (Kilobyte)
  • max number of MACs for inference of a 1-second audio snippet: 30 MMACs (million MACs)

Model parameters of the baseline must therefore be converted to 16-bit precision before inference of the test/evaluation set to stick to the complexity limits (61,148 * 16 bits = 61,148 * 2 B = 122,296 B <= 128 kB).

In previous years of the challenge, top-ranked teams used a technique called quantization that converts model parameters to 8-bit precision. In this case, the maximum number of allowed parameters would be 128,000.

Description Layer Input Shape Params MACs
in_c[0] Conv2dNormActivation [1, 1, 256, 65] 88304,144
in_c[1] Conv2dNormActivation [1, 8, 128, 33] 2,368 2,506,816
stages[0].b1.block[0] Conv2dNormActivation (pointwise) [1, 32, 64, 17] 2,176 2,228,352
stages[0].b1.block[1] Conv2dNormActivation (depthwise) [1, 64, 64, 17] 704 626,816
stages[0].b1.block[2] Conv2dNormActivation (pointwise) [1, 64, 64, 17] 2,112 2,228,288
stages[0].b2.block[0] Conv2dNormActivation (pointwise) [1, 32, 64, 17] 2,176 2,228,352
stages[0].b2.block[1] Conv2dNormActivation (depthwise) [1, 64, 64, 17] 704 626,816
stages[0].b2.block[2] Conv2dNormActivation (pointwise) [1, 64, 64, 17] 2,112 2,228,288
stages[0].b3.block[0] Conv2dNormActivation (pointwise) [1, 32, 64, 17] 2,176 2,228,352
stages[0].b3.block[1] Conv2dNormActivation (depthwise) [1, 64, 64, 17] 704 331,904
stages[0].b3.block[2] Conv2dNormActivation (pointwise) [1, 64, 64, 9] 2,112 1,179,712
stages[1].b4.block[0] Conv2dNormActivation (pointwise) [1, 32, 64, 9] 2,176 1,179,776
stages[1].b4.block[1] Conv2dNormActivation (depthwise) [1, 64, 64, 9] 704 166,016
stages[1].b4.block[2] Conv2dNormActivation (pointwise) [1, 64, 32, 9] 3,696 1,032,304
stages[1].b5.block[0] Conv2dNormActivation (pointwise) [1, 56, 32, 9] 6,960 1,935,600
stages[1].b5.block[1] Conv2dNormActivation (depthwise) [1, 120, 32, 9] 1,320 311,280
stages[1].b5.block[2] Conv2dNormActivation (pointwise) [1, 120, 32, 9] 6,832 1,935,472
stages[2].b6.block[0] Conv2dNormActivation (pointwise) [1, 56, 32, 9] 6,960 1,935,600
stages[2].b6.block[1] Conv2dNormActivation (depthwise) [1, 120, 32, 9] 1,320 311,280
stages[2].b6.block[2] Conv2dNormActivation (pointwise) [1, 120, 32, 9] 12,688 3,594,448
ff_list[0] Conv2d [1, 104, 32, 9] 1,040 299,520
ff_list[1] BatchNorm2d [1, 10, 32, 9] 20 20
ff_list[2] AdaptiveAvgPool2d [1, 10, 32, 9] - -
Sum - - 61,148 29,419,156

To give an example of how MACs and parameters are calculated, let's look in detail at the module stages[0].b3.block[1]. It consists of a conv2d, a batch norm, and a ReLU activation function.

Parameters: The conv2d parameters are calculated as input_channels * output_channels * kernel_size * kernel_size, resulting in 1 * 64 * 3 * 3 = 576 parametes. Note that input_channels=1 since it is a depth-wise convolution with 64 groups. The batch norm adds 64 * 2 = 128 parameters on top, resulting in a total of 704 parameters for this Conv2dNormActivation module.

MACs: The MACs of the conv2d are calculated as input_channels * output_channels * kernel_size * kernel_size * output_frequency_bands * output_time_frames, resulting in 1 * 64 * 3 * 3 * 64 * 9 = 331,776 MACs.
Note that input_channels=1 since it is a depth-wise convolution with 64 groups. The batch norm adds 128 MACs on top, resulting in a total of 331,904 MACs for this Conv2dNormActivation module.

Baseline Results

The two tables below present the results obtained with the baseline system. The system was trained five times on each subset and tested on the development-test split, and the mean and standard deviation of the performance from these five independent trials are shown.

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.

Class-wise Results

The class-wise accuracies take into account only the test items belonging to the considered class.

Subset Class-wise Accuracies Macro-Avg.
Accuracy
Airport Bus Metro Metro
Station
Park Public
Square
Shopping
Mall
Street
Pedestrian
Street
Traffic
Tram
5% 34.77 45.21 30.79 40.03 62.06 22.28 52.07 31.32 70.23 35.20 42.40 ± 0.42
10% 38.50 47.99 36.93 43.71 65.43 27.05 52.46 31.82 72.64 36.41 45.29 ± 1.01
25% 41.81 61.19 38.88 40.84 69.74 33.54 58.84 30.31 75.93 51.77 50.29 ± 0.87
50% 41.51 63.23 43.37 48.71 72.55 34.25 60.09 37.26 79.71 51.16 53.19 ± 0.68
100% 46.45 72.95 52.86 41.56 76.11 37.07 66.91 38.73 80.66 56.58 56.99 ± 1.11

Device-wise Results

The device-wise accuracies take into account only the test items recorded by the specific device. As discussed here, devices S4-S6 are used only for testing and not for training the system.

Subset Device-wise Accuracies Macro-Avg.
Accuracy
A B C S1 S2 S3 S4 S5 S6
5% 54.45 45.73 48.42 39.66 36.13 44.30 38.90 40.47 33.58 42.40 ± 0.42
10% 57.84 48.60 51.13 42.16 40.30 46.00 43.13 41.30 37.26 45.29 ± 1.01
25% 62.27 53.27 55.39 47.52 46.68 51.59 47.39 46.75 41.75 50.29 ± 0.87
50% 65.39 56.30 57.23 52.99 50.85 54.78 48.35 47.93 44.90 53.19 ± 0.68
100% 67.17 59.67 61.99 56.28 55.69 58.16 53.05 52.35 48.58 56.99 ± 1.11


Submission

Official challenge submission consists of the following:

  • System output files (*.csv) (5 in total, one per training split)
  • Metadata file(s) (*.yaml)
  • A technical report explaining in sufficient detail the method (*.pdf)

Each submission entry should contain system output for each provided training split (5 files). The system output per split should be presented as a single text file (in CSV format, with a header row) containing a classification result for each audio file in the evaluation set. In addition, the results file should contain probabilities for each scene class. Result items can be in any order. Multiple system entries can be submitted (maximum 4 per participant).

For each system entry, it is crucial to provide meta information in a separate file containing the task-specific information. This meta information is essential for fast processing of the submissions and analysis of submitted systems. Participants are strongly advised to fill in the meta information carefully, ensuring all information is correctly provided.

All files should be packaged into a zip file for submission. Please ensure a clear connection between the system name in the submitted yaml, submitted system output, and the technical report! Instead of system name, you can use a submission label too. See instructions on how to form your submission label. The submission label's index field should be used to differentiate your submissions in case that you have multiple submissions.

Package structure:

[submission_label]/[submission_label].output.split_5.csv
[submission_label]/[submission_label].output.split_10.csv
[submission_label]/[submission_label].output.split_25.csv
[submission_label]/[submission_label].output.split_50.csv
[submission_label]/[submission_label].output.split_100.csv
[submission_label]/[submission_label].output.split_100.csv
[submission_label]/[submission_label].meta.yaml
[submission_label].technical_report.pdf

System output file

The system output should have the following format for each row:

[filename (string)][tab][scene label (string)][tab][airport probability (float)][tab][bus probability (float)][tab][metro probability (float)][tab][metro_station probability (float)][tab][park probability (float)][tab][public_square probability (float)][tab][shopping_mall probability (float)][tab][street_pedestrian probability (float)][tab][street_traffic probability (float)][tab][tram probability (float)]

Example output:

filename	scene_label	airport	bus	metro	metro_station	park	public_square	shopping_mall	street_pedestrian	street_traffic	tram
0.wav	bus	0.2512	0.9931	0.1211	0.3211	0.4122	0.4233	0.2344	0.3455	0.1266	0.4577
1.wav	tram	0.2521	0.194	0.1211	0.3211	0.4122	0.4233	0.2344	0.3455	0.1266	0.8577

Metadata file

Example meta information file baseline system task1/Schmid_CPJKU_task1_1/Schmid_CPJKU_task1_1.meta.yaml:

# Submission information
submission:
  # Submission label
  # Label is used to index submissions.
  # Generate your label following way to avoid
  # overlapping codes among submissions:
  # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-3)]
  label: Schmid_CPJKU_task1_1

  # Submission name
  # This name will be used in the results tables when space permits
  name: DCASE2024 baseline system

  # Submission name abbreviated
  # This abbreviated name will be used in the results table when space is tight.
  # Use maximum 10 characters.
  abbreviation: Baseline

  # Authors of the submitted system. Mark authors in
  # the order you want them to appear in submission lists.
  # One of the authors has to be marked as corresponding author,
  # this will be listed next to the submission in the results tables.
  authors:
    # First author
    - lastname: Schmid
      firstname: Florian
      email: florian.schmid@jku.at           # Contact email address
      corresponding: true                    # Mark true for one of the authors
      # Affiliation information for the author
      affiliation:
        abbreviation: JKU
        institute: Johannes Kepler University (JKU) Linz
        department: Institute of Computational Perception (CP)   # Optional
        location: Linz, Austria

    # Second author
    - lastname: Primus
      firstname: Paul
      email: paul.primus@jku.at   
      affiliation:
        abbreviation: JKU
        institute: Johannes Kepler University (JKU) Linz
        department: Institute of Computational Perception (CP)  
        location: Linz, Austria      

    # Third author
    - lastname: Heittola
      firstname: Toni
      email: toni.heittola@tuni.fi
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences            
        location: Tampere, Finland
    
    # Fourth author
    - lastname: Mesaros
      firstname: Annamaria
      email: annamaria.mesaros@tuni.fi
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences            
        location: Tampere, Finland
    
    # Fifth author
    - lastname: Martín Morató
      firstname: Irene
      email: irene.martinmorato@tuni.fi
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences           
        location: Tampere, Finland

    # Sixth author
    - lastname: Koutini
      firstname: Khaled
      email: khaled.koutini@jku.at     
      affiliation:
        abbreviation: JKU
        institute: Johannes Kepler University (JKU) Linz
        department: Institute of Computational Perception (CP)  
        location: Linz, Austria

    # Seventh author
    - lastname: Widmer
      firstname: Gerhard
      email: gerhard.widmer@jku.at
      affiliation:
        abbreviation: JKU
        institute: Johannes Kepler University (JKU) Linz
        department: Institute of Computational Perception (CP)  
        location: Linz, Austria           

# System information
system:
  # System description, meta data provided here will be used to do
  # meta analysis of the submitted system.
  # Use general level tags, when possible use the tags provided in comments.
  # If information field is not applicable to the system, use "!!null".
  description:

    # Audio input / sampling rate
    # e.g. 16kHz, 22.05kHz, 32kHz, 44.1kHz, 48.0kHz
    input_sampling_rate: 32kHz

    # Acoustic representation
    # one or multiple labels, e.g. MFCC, log-mel energies, spectrogram, CQT, raw waveform, ...
    acoustic_features: log-mel energies

    # Data augmentation methods
    # e.g. mixup, freq-mixstyle, dir augmentation, pitch shifting, time rolling, frequency masking, time masking, frequency warping, ...
    data_augmentation: freq-mixstyle, pitch shifting, time rolling

    # Machine learning
    # e.g., (RF-regularized) CNN, RNN, CRNN, Transformer, ...
    machine_learning_method: RF-regularized CNN

    # External data usage method
    # e.g. "dataset", "embeddings", "pre-trained model", ...
    external_data_usage: !!null

    # Method for handling the complexity restrictions
    # e.g. "knowledge distillation", "pruning", "precision_16", "weight quantization", "network design", ...
    complexity_management: precision_16, network design

    # System training/processing pipeline stages
    # e.g. "train teachers", "ensemble teachers", "train student using knowledge distillation", "quantization-aware training"
    pipeline: training

    # Machine learning framework
    # e.g. keras/tensorflow, pytorch, ...
    framework: pytorch

    # List all basic hyperparameters that were adapted for the different subsets (or leave !!null in case no adaptations were made)
    # e.g. "lr", "epochs", "batch size", "weight decay", "freq-mixstyle probability", "frequency mask size", "time mask size", 
    #      "time rolling range", "dir augmentation probability", ...
    split_adaptations: !!null

    # List most important properties that make this system different from other submitted systems (or leave !!null if you submit only one system)
    # e.g. "architecture", "model size", "input resolution", "data augmentation techniques", "pre-training", "knowledge distillation", ...
    system_adaptations: !!null

  # System complexity
  complexity:
    # Total model size in bytes. Calculated as [parameter count]*[bit per parameter]/8
    total_model_size: 122296  # 61,148 * 16 bits = 61,148 * 2 B = 122,296 B for the baseline system

    # Total amount of parameters used in the acoustic model.
    # For neural networks, this information is usually given before training process
    # in the network summary.
    # For other than neural networks, if parameter count information is not directly
    # available, try estimating the count as accurately as possible.
    # In case of ensemble approaches, add up parameters for all subsystems.
    # In case embeddings are used, add up parameter count of the embedding
    # extraction networks and classification network
    # Use numerical value.
    total_parameters: 61148 

    # MACS - as calculated by NeSsi
    macs: 29419156

  # List of external datasets used in the submission.
  external_datasets:
    # Below are two examples (NOT used in the baseline system)
    #- name: EfficientAT
    #  url: https://github.com/fschmid56/EfficientAT
    #  total_audio_length: !!null
    #- name: MicIRP
    #  url: http://micirp.blogspot.com/?m=1
    #  total_audio_length: 2   # specify in minutes

  # URL to the source code of the system [optional]
  source_code: https://github.com/CPJKU/dcase2024_task1_baseline

# System results
results:
  development_dataset:
    # System results on the development-test set for all provided data splits (5%, 10%, 25%, 50%, 100%).
    # Full results are not mandatory, however, they are highly recommended
    # as they are needed for through analysis of the challenge submissions.
    # If you are unable to provide all results, also incomplete
    # results can be reported.

    split_5:  # results on 5% subset
      # Overall metrics
      overall:
        logloss: !!null   # !!null, if you don't have the corresponding result
        accuracy: 42.4    # mean of class-wise accuracies

      # Class-wise metrics
      class_wise:
        airport:
          logloss: !!null  # !!null, if you don't have the corresponding result
          accuracy: 34.77
        bus:
          logloss: !!null
          accuracy: 45.21
        metro:
          logloss: !!null
          accuracy: 30.79
        metro_station:
          logloss: !!null
          accuracy: 40.03
        park:
          logloss: !!null
          accuracy: 62.06
        public_square:
          logloss: !!null
          accuracy: 22.28
        shopping_mall:
          logloss: !!null
          accuracy: 52.07
        street_pedestrian:
          logloss: !!null
          accuracy: 31.32
        street_traffic:
          logloss: !!null
          accuracy: 70.23
        tram:
          logloss: !!null
          accuracy: 35.20

      # Device-wise
      device_wise:
        a:
          logloss: !!null
          accuracy: 54.45
        b:
          logloss: !!null
          accuracy: 45.73
        c:
          logloss: !!null
          accuracy: 48.42
        s1:
          logloss: !!null
          accuracy: 39.66
        s2:
          logloss: !!null
          accuracy: 36.13
        s3:
          logloss: !!null
          accuracy: 44.30
        s4:
          logloss: !!null
          accuracy: 38.90
        s5:
          logloss: !!null
          accuracy: 40.47
        s6:
          logloss: !!null
          accuracy: 33.58

    split_10: # results on 10% subset
      # Overall metrics
      overall:
        logloss: !!null
        accuracy: 45.29    # mean of class-wise accuracies

      # Class-wise metrics
      class_wise:
        airport:
          logloss: !!null
          accuracy: 38.50
        bus:
          logloss: !!null
          accuracy: 47.99
        metro:
          logloss: !!null
          accuracy: 36.93
        metro_station:
          logloss: !!null
          accuracy: 43.71
        park:
          logloss: !!null
          accuracy: 65.43
        public_square:
          logloss: !!null
          accuracy: 27.05
        shopping_mall:
          logloss: !!null
          accuracy: 52.46
        street_pedestrian:
          logloss: !!null
          accuracy: 31.82
        street_traffic:
          logloss: !!null
          accuracy: 72.64
        tram:
          logloss: !!null
          accuracy: 36.41

      # Device-wise
      device_wise:
        a:
          logloss: !!null
          accuracy: 57.84                        
        b:
          logloss: !!null
          accuracy: 48.60
        c:
          logloss: !!null
          accuracy: 51.13
        s1:
          logloss: !!null
          accuracy: 42.16
        s2:
          logloss: !!null
          accuracy: 40.30
        s3:
          logloss: !!null
          accuracy: 46.00
        s4:
          logloss: !!null
          accuracy: 43.13
        s5:
          logloss: !!null
          accuracy: 41.30
        s6:
          logloss: !!null
          accuracy: 37.26

    split_25:  # results on 25% subset
      # Overall metrics
      overall:
        logloss: !!null
        accuracy: 50.29    # mean of class-wise accuracies

      # Class-wise metrics
      class_wise:
        airport:
          logloss: !!null
          accuracy: 41.81                           
        bus:
          logloss: !!null
          accuracy: 61.19
        metro:
          logloss: !!null
          accuracy: 38.88
        metro_station:
          logloss: !!null
          accuracy: 40.84
        park:
          logloss: !!null
          accuracy: 69.74
        public_square:
          logloss: !!null
          accuracy: 33.54
        shopping_mall:
          logloss: !!null
          accuracy: 58.84
        street_pedestrian:
          logloss: !!null
          accuracy: 30.31
        street_traffic:
          logloss: !!null
          accuracy: 75.93
        tram:
          logloss: !!null
          accuracy: 51.77

      # Device-wise
      device_wise:
        a:
          logloss: !!null
          accuracy: 62.27                        
        b:
          logloss: !!null
          accuracy: 53.27
        c:
          logloss: !!null
          accuracy: 55.39
        s1:
          logloss: !!null
          accuracy: 47.52
        s2:
          logloss: !!null
          accuracy: 46.68
        s3:
          logloss: !!null
          accuracy: 51.59
        s4:
          logloss: !!null
          accuracy: 47.39
        s5:
          logloss: !!null
          accuracy: 46.75
        s6:
          logloss: !!null
          accuracy: 41.75

    split_50: # results on 50% subset
      # Overall metrics
      overall:
        logloss: !!null
        accuracy: 53.19    # mean of class-wise accuracies

      # Class-wise metrics
      class_wise:
        airport:
          logloss: !!null
          accuracy: 41.51                         
        bus:
          logloss: !!null
          accuracy: 63.23
        metro:
          logloss: !!null
          accuracy: 43.37
        metro_station:
          logloss: !!null
          accuracy: 48.71
        park:
          logloss: !!null
          accuracy: 72.55
        public_square:
          logloss: !!null
          accuracy: 34.25
        shopping_mall:
          logloss: !!null
          accuracy: 60.09
        street_pedestrian:
          logloss: !!null
          accuracy: 37.26
        street_traffic:
          logloss: !!null
          accuracy: 79.71
        tram:
          logloss: !!null
          accuracy: 51.16

      # Device-wise
      device_wise:
        a:
          logloss: !!null
          accuracy: 65.39                        
        b:
          logloss: !!null
          accuracy: 56.30
        c:
          logloss: !!null
          accuracy: 57.23
        s1:
          logloss: !!null
          accuracy: 52.99
        s2:
          logloss: !!null
          accuracy: 50.85
        s3:
          logloss: !!null
          accuracy: 54.78
        s4:
          logloss: !!null
          accuracy: 48.35
        s5:
          logloss: !!null
          accuracy: 47.93
        s6:
          logloss: !!null
          accuracy: 44.90

    split_100:  # results on 100% subset
      # Overall metrics
      overall:
        logloss: !!null
        accuracy: 56.99    # mean of class-wise accuracies

      # Class-wise metrics
      class_wise:
        airport:
          logloss: !!null
          accuracy: 46.45
        bus:
          logloss: !!null
          accuracy: 72.95
        metro:
          logloss: !!null
          accuracy: 52.86
        metro_station:
          logloss: !!null
          accuracy: 41.56
        park:
          logloss: !!null
          accuracy: 76.11
        public_square:
          logloss: !!null
          accuracy: 37.07
        shopping_mall:
          logloss: !!null
          accuracy: 66.91
        street_pedestrian:
          logloss: !!null
          accuracy: 38.73
        street_traffic:
          logloss: !!null
          accuracy: 80.66
        tram:
          logloss: !!null
          accuracy: 56.58

      # Device-wise
      device_wise:
        a:
          logloss: !!null
          accuracy: 67.17                        
        b:
          logloss: !!null
          accuracy: 59.67
        c:
          logloss: !!null
          accuracy: 61.99
        s1:
          logloss: !!null
          accuracy: 56.28
        s2:
          logloss: !!null
          accuracy: 55.69
        s3:
          logloss: !!null
          accuracy: 58.16
        s4:
          logloss: !!null
          accuracy: 53.05
        s5:
          logloss: !!null
          accuracy: 52.35
        s6:
          logloss: !!null
          accuracy: 48.58

Package validator

There is an automatic validation tool to help challenge participants to prepare a correctly formatted submission package, which in turn will speed up the submission processing in the challenge evaluation stage. Please use it to make sure your submission package follows the given formatting.


Tools

NeSsi - Calculating memory size of the model parameters and MACS

In Pytorch, one can use the torchinfo tool to get model parameters per layers. In Keras, there is a built-in method summary for this. Based on the parameter counts provided by these tools and taking into account the variable types you used in your model you can calculate the memory size of the model parameters (see previous examples).

The script offers the possibility for calculating the MACS and the model size for Keras and PyTorch based models. This tool will give the model MACS and parameter count ("memory" field in the NeSsi output) for the inputted model. Based on the parameter count value and the used variable type you can calculate the memory size of the model parameters.


If you have any issues using NeSsi, please contact Florian Schmid (florian.schmid@jku.at) by email or use the DCASE community forum or Slack channel.


Package validator

An automatic submission package validation tool will be made available together with the evaluation data.


Citation

If you are using the audio dataset, please cite the following paper:

Publication

Toni Heittola, Annamaria Mesaros, and Tuomas Virtanen. Acoustic scene classification in dcase 2020 challenge: generalization across devices and low complexity solutions. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), 56–60. 2020. URL: https://arxiv.org/abs/2005.14623.

PDF

Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions

Abstract

This paper presents the details of Task 1: Acoustic Scene Classification in the DCASE 2020 Challenge. The task consists of two subtasks: classification of data from multiple devices, requiring good generalization properties, and classification using low-complexity solutions. Here we describe the datasets and baseline systems. After the challenge submission deadline, challenge results and analysis of the submissions will be added.

PDF


If you are participating in the task, please cite the following paper:

Publication

Florian Schmid, Paul Primus, Toni Heittola, Annamaria Mesaros, Irene Martín-Morató, Khaled Koutini, and Gerhard Widmer. Data-efficient low-complexity acoustic scene classification in the dcase 2024 challenge. 2024. URL: https://arxiv.org/abs/1706.10006.

PDF

Data-Efficient Low-Complexity Acoustic Scene Classification in the DCASE 2024 Challenge

Abstract

This article describes the Data-Efficient Low-Complexity Acoustic Scene Classification Task in the DCASE 2024 Challenge and the corresponding baseline system. The task setup is a continuation of previous editions (2022 and 2023), which focused on recording device mismatches and low-complexity constraints. This year's edition introduces an additional real-world problem: participants must develop data-efficient systems for five scenarios, which progressively limit the available training data. The provided baseline system is based on an efficient, factorized CNN architecture constructed from inverted residual blocks and uses Freq-MixStyle to tackle the device mismatch problem. The baseline system's accuracy ranges from 42.40% on the smallest to 56.99% on the largest training set.

PDF


If you are using/ referencing the baseline system, please cite the following paper:

Publication

Florian Schmid, Tobias Morocutti, Shahed Masoudian, Khaled Koutini, and Gerhard Widmer. Distilling the knowledge of transformers and CNNs with CP-mobile. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023), 161–165. 2023.

PDF

Distilling the Knowledge of Transformers and CNNs with CP-Mobile

Abstract

Designing lightweight models that require limited computational resources and can operate on edge devices is a major trajectory in deep learning research. In the context of Acoustic Scene Classification (ASC), the DCASE community hosts an annual challenge on low-complexity ASC, contributing to the research on Knowledge Distillation (KD), Model Pruning, Quantization and efficient neural network design. In this work, we propose a system that contributes to the latter by introducing CP-Mobile, a lightweight CNN architecture constructed of residual inverted bottleneck blocks and Global Response Normalization. Furthermore, we improve Knowledge Distillation by showing that ensembling CNNs and Audio Spectrogram Transformers form strong teacher ensembles. Our proposed system improves the results on the TAU Urban Acoustic Scenes 2022 Mobile development dataset by around 5 percentage points in accuracy compared to the top-ranked submission for Task 1 of the DCASE 22 challenge and achieves the top rank in the DCASE 23 challenge.

PDF