The goal of acoustic scene classification is to classify a test recording into one of the predefined ten acoustic scene classes. This task is a continuation of the Acoustic Scene Classification task from DCASE2023 Challenge editions, with modified memory limit and added measurement of energy consumption. Submissions will be ranked by weighted average rank of classification accuracy, memory usage, and MACS.

Challenge has ended. Full results for this task can be found in the Results page.

If you are interested in the task, you can join us on the dedicated slack channel

Description

The goal of acoustic scene classification is to classify a test recording into one of the predefined ten acoustic scene classes. This targets acoustic scene classification with devices with low computational and memory allowance, which impose certain limits on the model complexity, such as the model’s number of parameters and the multiply-accumulate operations count. In addition to low-complexity, the aim is generalization across a number of different devices. For this purpose, the task will use audio data recorded and simulated with a variety of devices.

Figure 1: Overview of acoustic scene classification system.

Organization of this task is partially supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 957337, project MARVEL.

Audio dataset

The development dataset for this task is TAU Urban Acoustic Scenes 2022 Mobile, development dataset. The dataset contains recordings from 12 European cities in 10 different acoustic scenes using 4 different devices. Additionally, synthetic data for 11 mobile devices was created based on the original recordings. Of the 12 cities, two are present only in the evaluation set. The dataset has exactly the same content as TAU Urban Acoustic Scenes 2020 Mobile, development dataset, but the audio files have a length of 1 second (therefore there are 10 times more files than in the 2020 version).

Recordings were made using four devices that captured audio simultaneously. The main recording device consists in a Soundman OKM II Klassik/studio A3, electret binaural microphone and a Zoom F8 audio recorder using 48kHz sampling rate and 24-bit resolution, referred to as device A. The other devices are commonly available customer devices: device B is a Samsung Galaxy S7, device C is aniPhone SE, and device D is a GoPro Hero5 Session.

Audio data was recorded in Amsterdam, Barcelona, Helsinki, Lisbon, London, Lyon, Madrid, Milan, Prague, Paris, Stockholm and Vienna. The dataset was collected by Tampere University of Technology between 05/2018 - 11/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

For complete details on the data recording and processing see

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 9–13. November 2018. URL: https://dcase.community/documents/workshop2018/proceedings/DCASE2018Workshop_Mesaros_8.pdf.

PDF

A multi-device dataset for urban acoustic scene classification

Abstract

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

Keywords

Acoustic scene classification, DCASE challenge, public datasets, multi-device data

PDF

Additionally, 10 mobile devices S1-S10 are simulated using the audio recorded with device A, impulse responses recorded with real devices, and additional dynamic range compression, in order to simulate realistic recordings. A recording from device A is processed through convolution with the selected impulse response, then processed with a selected set of parameters for dynamic range compression (device-specific). The impulse responses are proprietary data and will not be published.

The development dataset comprises 40 hours of data from device A, and smaller amounts from the other devices. Audio is provided in a single-channel 44.1kHz 24-bit format.

Acoustic scenes (10):

Airport - airport
Indoor shopping mall - shopping_mall
Metro station - metro_station
Pedestrian street - street_pedestrian
Public square - public_square
Street with medium level of traffic - street_traffic
Travelling by a tram - tram
Travelling by a bus - bus
Travelling by an underground metro - metro
Urban park - park

Reference labels

Reference labels are provided only for the development datasets. Reference labels for evaluation dataset will not be released. For publications based on the DCASE challenge data, please use the provided training/test setup of the development set, to allow comparisons. After the challenge, if you want to evaluate your proposed system with official challenge evaluation setup, contact the task coordinators. Task coordinators can provide unofficial scoring for a limited amount of system outputs.

Download

TAU Urban Acoustic Scenes 2022 Mobile, Development dataset (30.5 GB)

TAU Urban Acoustic Scenes 2023 Mobile, Evaluation dataset (13.4 GB)

Task setup

Development dataset

The development set contains data from 10 cities and 9 devices: 3 real devices (A, B, C) and 6 simulated devices (S1-S6). Data from devices B, C, and S1-S6 consists of randomly selected segments from the simultaneous recordings, therefore all overlap with the data from device A, but not necessarily with each other. The total amount of audio in the development set is 64 hours.

The dataset is provided with a training/test split in which 70% of the data for each device is included for training, 30% for testing. Some devices appear only in the test subset. In order to create a perfectly balanced test set, a number of segments from various devices are not included in this split. Complete details on the development set and training/test split are provided in the following table.

Devices		Dataset		Cross-validation setup
Name	Type	Total duration	Total segments	Train segments	Test segments	Notes
A	Real	40h	144 000	102 150	3300	38 180 Segments not used in train/test split
B	Real	3h each	10 780	7490	3290
C	Real	3h each	10 770	7480	3290
S1 S2 S3	Simulated	3h each	10 800	3 * 7500	3 * 3300
S4 S5 S6	Simulated	3h each	10 800	-	3 * 3300	3 * 7500 segments not used in train/test split
Total		64h	230 350	139 970	29 680

Participants are required to report the performance of their system using this train/test setup in order to allow a comparison of systems on the development set. Participants are allowed to create their own cross-validation folds or separate validation set. In this case please pay attention to the segments recorded at the same location. Location identifier can be found from metadata file provided in the dataset or from audio file names: [scene label]-[city]-[location id]-[segment id]-[device id].wav Make sure that all the files having the same location id are placed on the same side of the evaluation.

Evaluation dataset

The evaluation dataset contains data from 12 cities, 10 acoustic scenes, 11 devices. There are five new devices (not available in the development set): a real device D and simulated devices S7-S11. Evaluation data contains 22 hours of audio. The evaluation data contains audio recorded at different locations than the development data.

Device and city information is not provided in the evaluation set. The systems are expected to be robust to different devices.

System complexity requirements

The computational complexity is limited in terms of model size and MMACs (million multiply-accumulate operations). The limits are modeled after Cortex-M4 devices (e.g. STM32L496@80MHz or Arduino Nano 33@64MHz); the maximum allowed limits are as follows:

Maximum memory allowance for model parameters: 128KB (Kilobyte), counting ALL parameters (including the zero-valued ones), and no requirement on the numerical representation. This differs from DCASE2022 which limited model parameters to 128K with fixed variable type (int8). We make this change to allow participants better control of the tradeoff between the number of parameters and their numerical representation. The memory limit translates into 128K parameters when using int8 (signed 8-bit integer) (128000 parameters * 8 bits per parameter / 8 bits per byte = 128KB), or 32K parameters when using float32 (32-bit float) (32000 * 32 bits per parameter / 8 bits per byte = 128KB).
Maximum number of MACS per inference: 30 MMAC (million MACs). The limit is approximated based on the computing power of the target device class. The analysis segment length for the inference is 1 s. The limit of 30 MMAC mimics the fitting of audio buffers into SRAM (fast access internal memory) on the target device and allows some head space for feature calculation (e.g. FFT), assuming that the most commonly used features fit under this.

In case of using learned features (so-called embeddings, like VGGish, OpenL3 or EdgeL3), the network used to generate them counts in the calculated model size and complexity!

Full information about the model size and complexity should be provided in the submitted technical report.

Calculating memory size of the model parameters and MACS

In Pytorch, one can use the torchinfo tool to get model parameters per layers. In Keras, there is a built-in method summary for this. Based on the parameter counts provided by these tools and taking into account the variable types you used in your model you can calculate the memory size of the model parameters (see previous examples).

We offer a script for calculating the MACS and the model size for Keras and PyTorch based models. This tool will give the model MACS and parameter count ("memory" field in the NeSsi output) for the inputted model. Based on the parameter count value and the used variable type you can calculate the memory size of the model parameters."

NeSsi, Keras/Pytorch neural network size, operations and parameters counter

If you have any issues using NeSsi, please contact Augusto Ancilotto by email or use the DCASE community forum or Slack channel.

Calculation example / DCASE2022 Task 1, Baseline

** Model architecture **

Network before adding quantization layers (before converting into TFlite model):

Id	Layer	Shape	Parameters
0	conv2d_1	(None, 40, 51, 16)	800
1	batch_normalization_1	(None, 40, 51, 16)	64
2	activation_1	(None, 40, 51, 16)	0
3	conv2d_2	(None, 40, 51, 16)	12 560
4	batch_normalization_2	(None, 40, 51, 16)	64
5	activation_2	(None, 40, 51, 16)	0
6	max_pooling2d_1	(None, 8, 10, 16)	0
7	dropout_1	(None, 8, 10, 16)	0
8	conv2d_3	(None, 8, 10, 32)	25 120
9	batch_normalization_3	(None, 8, 10, 32)	128
10	activation_3	(None, 8, 10, 32)	0
11	max_pooling2d_2	(None, 8, 10, 32)	0
12	dropout_2	(None, 2, 1, 32)	0
13	flatten_1	(None, 64)	0
14	dense_1	(None, 100)	6 500
15	dropout_3	(None, 100)	0
16	dense_2	(None, 10)	1010

** Model complexity **

Tensor information (weights excluded, grouped by layer type):

Id	Tensor	Shape	Size in RAM (B)
0	Identity_int8	(1,10)	10
1	conv2d_input_int8	(1, 40, 51, 1)	2 040
2	sequential/activation/Relu	(1, 40, 51, 16)	32 640
3	sequential/activation_1/Relu	(1, 40, 51, 16)	32 640
4	sequential/activation_2/Relu	(1, 8, 10, 32)	2 560
13	sequential/dense/Relu	(1, 100)	100
14	sequential/dense_1/BiasAdd	(1, 10)	10
17	sequential/max_pooling2d/MaxPool	(1, 8, 10, 16)	1 280
18	sequential/max_pooling2d_1/MaxPool	(1, 2, 1, 32)	64
19	conv2d_input	(1, 40, 51, 1)	8 160
20	Identity	(1, 10)	40

Operator (output name)	Tensors in memory (IDs)	Memory use (B)	MACS	Parameters
conv2d_input_int8	[1, 19]	10 200	0	0
sequential/activation/Relu	[1, 2]	34 680	1 599 360	848
sequential/activation_1/Relu	[2, 3]	65 280	25 589 760	12 608
sequential/max_pooling2d/MaxPool	[3, 17]	33 920	32 000	0
sequential/activation_2/Relu	[4, 17]	3 840	2 007 040	25 216
sequential/max_pooling2d_1/MaxPool	[4, 18]	2 624	2 560	0
sequential/dense/Relu	[13, 18]	164	3 200	6 800
sequential/dense_1/BiasAdd	[13, 14]	110	1 000	1 040
Identity_int8	[0, 14]	20	0	0
Identity	[0, 20]	50	0	0
Total			29 234 920	46 512
Max		65 280 B

External data resources

Use of external data and transfer learning is allowed under the following conditions:

The used external resource is clearly referenced and freely accessible to any other research group in the world. External data refers to public datasets or pretrained models. The data must be public and freely available before 1st of April 2023.
The list of external data sources used in training must be clearly indicated in the technical report.
Participants inform the organizers in advance about such data sources, so that all competitors know about them and have an equal opportunity to use them. Please send an email to the task coordinators; we will update the list of external datasets on the webpage accordingly. Once the evaluation set is published, the list of allowed external data resources is locked (no further external sources allowed).
It is not allowed to use previous DCASE Challenge task 1 evaluation sets (2016-2022).
It is not allowed to use TUT Urban Acoustic Scenes 2018, TUT Urban Acoustic Scenes 2018 Mobile, TAU Urban Acoustic Scenes 2019, TAU Urban Acoustic Scenes 2019 Mobile, TAU Urban Acoustic Scenes 2020 Mobile, or TAU Urban Acoustic Scenes 2020 3Class. These datasets are partially included in the current setup, and additional usage will lead to overfitting.

List of external data resources allowed:

Resource name	Type	Added	Link
DCASE2013 Challenge - Public Dataset for Scene Classification Task	audio	04.03.2019	https://archive.org/details/dcase2013_scene_classification
DCASE2013 Challenge - Private Dataset for Scene Classification Task	audio	04.03.2019	https://archive.org/details/dcase2013_scene_classification_testset
AudioSet	audio, video	04.03.2019	https://research.google.com/audioset/
OpenL3	model	12.02.2020	https://openl3.readthedocs.io/
EdgeL3	model	12.02.2020	https://edgel3.readthedocs.io/
VGGish	model	12.02.2020	https://github.com/tensorflow/models/tree/master/research/audioset/vggish
SoundNet	model	03.06.2020	http://soundnet.csail.mit.edu/
Urban-SED	audio	31.03.2021	http://urbansed.weebly.com/
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition	model	31.03.2021	https://zenodo.org/record/3987831
Problem Agnostic Speech Encoder (PASE) Model	model	31.03.2021	https://github.com/santi-pdp/pase
YAMNet	model	20.05.2021	https://github.com/tensorflow/models/tree/master/research/audioset/yamnet
FSD50K	audio	10.03.2022	https://zenodo.org/record/4060432
AudioSet with Temporally-Strong Labels	audio, video	10.03.2022	https://research.google.com/audioset/download_strong.html
PaSST	model	13.05.2022	https://github.com/kkoutini/PaSST
MicIRP	IR	28.03.2023	http://micirp.blogspot.com/?m=1
ImageNet	image	28.03.2023	http://www.image-net.org/
Audio Spectrogram Transformer	model	29.03.2023	https://github.com/YuanGongND/ast
Whisper	model	31.03.2023	https://huggingface.co/docs/transformers/model_doc/whisper

Task rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

Task specific rules:

Use of external data is allowed. See conditions for external data usage here.
It is not allowed to use previous DCASE Challenge task 1 evaluation sets (2016-2022).
Model complexity limits applies. See conditions for the model complexity here.
Manipulation of provided training and development data is allowed (e.g. by mixing data sampled from a pdf or using techniques such as pitch shifting or time stretching).
Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision-making is also forbidden.
Classification decision must be done independently for each test sample.

Submission

Official challenge submission consists of:

System output file (*.csv)
Metadata file (*.yaml)
Technical report explaining in sufficient detail the method (*.pdf)

System output should be presented as a single text-file (in CSV format, with a header row) containing a classification result for each audio file in the evaluation set. In addition, the results file should contain probabilities for each scene class. Result items can be in any order. Multiple system outputs can be submitted (maximum 4 per participant per subtask).

For each system, meta information should be provided in a separate file, containing the task-specific information. This meta information enables fast processing of the submissions and analysis of submitted systems. Participants are advised to fill the meta information carefully while making sure all information is correctly provided.

All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted yaml, submitted system output, and the technical report! Instead of system name you can use a submission label too.

System output file

The system output should have the following format for each row:

[filename (string)][tab][scene label (string)][tab][airport probability (float)][tab][bus probability (float)][tab][metro probability (float)][tab][metro_station probability (float)][tab][park probability (float)][tab][public_square probability (float)][tab][shopping_mall probability (float)][tab][street_pedestrian probability (float)][tab][street_traffic probability (float)][tab][tram probability (float)]

Example output:

filename	scene_label	airport	bus	metro	metro_station	park	public_square	shopping_mall	street_pedestrian	street_traffic	tram
0.wav	bus	0.25	0.99	0.12	0.32	0.41	0.42	0.23	0.34	0.12	0.45
1.wav	tram	0.25	0.19	0.12	0.32	0.41	0.42	0.23	0.34	0.12	0.85

Metadata file

Example meta information file baseline system task1/Martin_TAU_task1_1/Martin_TAU_task1_1.meta.yaml:

Metadata

# Submission information
submission:
  # Submission label
  # Label is used to index submissions.
  # Generate your label following way to avoid
  # overlapping codes among submissions:
  # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
  label: Martin_TAU_task1_1

  # Submission name
  # This name will be used in the results tables when space permits
  name: DCASE2022 baseline system

  # Submission name abbreviated
  # This abbreviated name will be used in the results table when space is tight.
  # Use maximum 10 characters.
  abbreviation: Baseline

  # Authors of the submitted system. Mark authors in
  # the order you want them to appear in submission lists.
  # One of the authors has to be marked as corresponding author,
  # this will be listed next to the submission in the results tables.
  authors:
    # First author
    - lastname: Martín Morató
      firstname: Irene
      email: irene.martinmorato@tuni.fi           # Contact email address
      corresponding: true                         # Mark true for one of the authors

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences            # Optional
        location: Tampere, Finland

    # Second author
    - lastname: Heittola
      firstname: Toni
      email: toni.heittola@tuni.fi                # Contact email address

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences            # Optional
        location: Tampere, Finland

    # Third author
    - lastname: Mesaros
      firstname: Annamaria
      email: annamaria.mesaros@tuni.fi

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences
        location: Tampere, Finland

    # Fourth author
    - lastname: Virtanen
      firstname: Tuomas
      email: tuomas.virtanen@tuni.fi

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences
        location: Tampere, Finland

# System information
system:
  # System description, meta data provided here will be used to do
  # meta analysis of the submitted system.
  # Use general level tags, when possible use the tags provided in comments.
  # If information field is not applicable to the system, use "!!null".
  description:

    # Audio input / sampling rate
    # e.g. 16kHz, 22.05kHz, 44.1kHz, 48.0kHz
    input_sampling_rate: 44.1kHz

    # Acoustic representation
    # one or multiple labels, e.g. MFCC, log-mel energies, spectrogram, CQT, raw waveform, ...
    acoustic_features: log-mel energies

    # Embeddings
    # e.g. VGGish, OpenL3, ...
    embeddings: !!null

    # Data augmentation methods
    # e.g. mixup, time stretching, block mixing, pitch shifting, ...
    data_augmentation: !!null

    # Machine learning
    # In case using ensemble methods, please specify all methods used (comma separated list).
    # one or multiple, e.g. GMM, HMM, SVM, MLP, CNN, RNN, CRNN, ResNet, ensemble, ...
    machine_learning_method: CNN

    # Ensemble method subsystem count
    # In case ensemble method is not used, mark !!null.
    # e.g. 2, 3, 4, 5, ...
    ensemble_method_subsystem_count: !!null

    # Decision making methods
    # e.g. "average", "majority vote", "maximum likelihood", ...
    decision_making: !!null

    # External data usage method
    # e.g. "directly", "embeddings", "pre-trained model", ...
    external_data_usage: embeddings

    # Method for handling the complexity restrictions
    # e.g. "weight quantization", "sparsity", "pruning", ...
    complexity_management: weight quantization

    # System training/processing pipeline stages
    # e.g. "pretraining", "training" (from scratch), "pruning", "weight quantization", ...
    pipeline: pretraining, training, adaptation, pruning, weight quantization

    # Machine learning framework
    # e.g. keras/tensorflow, pytorch, matlab, ...
    framework: keras/tensorflow

  # System complexity, meta data provided here will be used to evaluate
  # submitted systems from the computational load perspective.
  complexity:
    # Total model size in bytes. Calculated as [parameter count]*[bit per parameter]/8
    total_model_size: 46512 # B

    # Total amount of parameters used in the acoustic model.
    # For neural networks, this information is usually given before training process
    # in the network summary.
    # For other than neural networks, if parameter count information is not directly
    # available, try estimating the count as accurately as possible.
    # In case of ensemble approaches, add up parameters for all subsystems.
    # In case embeddings are used, add up parameter count of the embedding
    # extraction networks and classification network
    # Use numerical value.
    total_parameters: 46512

    # Total amount of non-zero parameters in the acoustic model.
    # Calculated with same principles as "total_parameters".
    # Use numerical value.
    total_parameters_non_zero: 46512

    # Model size calculated using NeSsi, as instructed in task description page.
    # Use numerical value
    memory_use: 65280 # B

    # MACS
    # Required for the submission ranking!
    macs: 29234920

  energy_consumption:
    # Energy consumption while training the model. Unit is kWh.
    training: 0.302 #kWh

    # Energy consumption while producing output for all files in the evaluation dataset. Unit is kWh.
    inference: 0.292 #kWh

    # Baseline system's energy consumption while producing output for all files in the evaluation dataset. Unit is kWh.
    # Run baseline code to get this value. Value is used to normalize training and inference values from your system.
    baseline_inference: 0.292 #kWh

  # List of external datasets used in the submission.
  # Development dataset is used here only as example, list only external datasets
  external_datasets:
    # Below an example how to fill the field as a list of datasets
    # Dataset name
    #- name: TAU Urban Acoustic Scenes 2022 Mobile, Development dataset
    #  # Dataset access url
    #  url: https://zenodo.org/record/6337421
    #  # Total audio length in minutes
    #  total_audio_length: 2400            # minutes

  # URL to the source code of the system [optional]
  source_code: https://github.com/marmoi/dcase2022_task1_baseline

# System results
results:
  development_dataset:
    # System results for development dataset with provided the cross-validation setup.
    # Full results are not mandatory, however, they are highly recommended
    # as they are needed for through analysis of the challenge submissions.
    # If you are unable to provide all results, also incomplete
    # results can be reported.

    # Overall metrics
    overall:
      logloss: 1.575
      accuracy: 42.9    # mean of class-wise accuracies

    # Class-wise metrics
    class_wise:
      airport:
        logloss: 1.534
        accuracy: 39.4
      bus:
        logloss: 1.758
        accuracy: 29.3
      metro:
        logloss: 1.382
        accuracy: 47.9
      metro_station:
        logloss: 1.672
        accuracy: 36.0
      park:
        logloss: 1.448
        accuracy: 58.9
      public_square:
        logloss: 2.265
        accuracy: 20.8
      shopping_mall:
        logloss: 1.385
        accuracy: 51.4
      street_pedestrian:
        logloss: 1.822
        accuracy: 30.1
      street_traffic:
        logloss: 1.025
        accuracy: 70.6
      tram:
        logloss: 1.462
        accuracy: 44.6

    # Device-wise
    device_wise:
      a:
        logloss: 1.109
        accuracy: !!null
      b:
        logloss: 1.439
        accuracy: !!null
      c:
        logloss: 1.374
        accuracy: !!null
      s1:
        logloss: 1.621
        accuracy: !!null
      s2:
        logloss: 1.559
        accuracy: !!null
      s3:
        logloss: 1.531
        accuracy: !!null
      s4:
        logloss: 1.813
        accuracy: !!null
      s5:
        logloss: 1.800
        accuracy: !!null
      s6:
        logloss: 1.931
        accuracy: !!null

Package validator

There is an automatic validation tool to help challenge participants to prepare a correctly formatted submission package, which in turn will speed up the submission processing in the challenge evaluation stage. Please use it to make sure your submission package follows the given formatting.

DCASE2023 Task 1 submission validator

Evaluation

Systems will be ranked by a combination of different criteria:

macro-average accuracy (average of the class-wise accuracies, ACC); system evaluation done by organizers, submissions sorted in descending order;
amount of memory needed by the model (MEM), (values provided by participants), submissions sorted in ascending order;
MACs (MAC), value outputted by NeSsi tool (values provided by participants), submissions sorted in ascending order.

Separate ranks are calculated for these criteria, then the overall rank is calculated as a weighted average as follows:

$0.5*R_{ACC}+0.25*R_{MEM}+0.25*R_{MAC}$.

If two systems end up with the same rank, then they are assumed to have an equal place in the challenge.

Example:

System A has individual ranks for the criteria as $R_{ACC}: 1, R_{MEM}:1, R_{MAC}:1$, then its overall rank is $0.5*1+0.25*1+0.25*1=0.5625$.
System B has $R_{ACC}:3, R_{MEM}:3, R_{MAC}:2$ and cumulative rank is $1.875$, and system C has $R_{ACC}:2, R_{MEM}:2, R_{MAC}:3$ and cumulative rank is $1.375$. The overall rank of the systems is $[A, C, B]$.

The weighted average ranking is used to encourage participants to develop a holistic approach to low-computational resources instead of aiming to just be under the memory and MAC limits.

We will also calculate multiclass cross-entropy (Log loss). The metric is independent of the operating point (see python implementation here). The multiclass cross-entropy will not be used in the official rankings.

Measuring the energy consumption

We will collect as additional information the energy consumption at both training and inference phase. Energy consumption is computed using code carbon tool. Participants are required to measure energy consumption of their system and also of the provided baseline system run on the same hardware setup.

Energy consumption of a code block can be done following way:

from codecarbon import EmissionsTracker
# Initialize tracker
tracker = EmissionsTracker()

# Start tracker 
tracker.start()

# [Code here]

# Stop tracker
tracker.stop()

# Store total energy
used_kwh = tracker._total_energy.kWh

Please see the baseline system how to use code carbon. In order to account for potential hardware differences across participants, we will normalize the reported energy consumption measures on the same hardware:

$$\mathrm{E}_{normalized} = \frac{\mathrm{kWh}_{\mathrm{baseline}}}{\mathrm{kWh}_{\mathrm{submission}}}$$

Participants should run the inference with the baseline system locally and report both $\mathrm{kWh}_{\mathrm{baseline}}$ and $\mathrm{kWh}_{\mathrm{submission}}$ in their submission meta data.

Results

Official rank	Submission Information
Official rank	Code	Author	Affiliation	Technical Report	Accuracy with 95% confidence interval	MACS	Memory use
28	AI4EDGE_IPL_task1_1	Carlos Almeida	Eletrotechnical Engineering, Instituto Politécnico de Leiria, Leiria, Portugal	task-low-complexity-acoustic-scene-classification-results#Almeida2023	51.9 (51.7 - 52.2)	25475456	62720
47	AI4EDGE_IPL_task1_2	Carlos Almeida	Eletrotechnical Engineering, Instituto Politécnico de Leiria, Leiria, Portugal	task-low-complexity-acoustic-scene-classification-results#Almeida2023	48.8 (48.5 - 49.1)	29304736	53760
38	AI4EDGE_IPL_task1_3	Carlos Almeida	Eletrotechnical Engineering, Instituto Politécnico de Leiria, Leiria, Portugal	task-low-complexity-acoustic-scene-classification-results#Almeida2023	50.8 (50.5 - 51.1)	26711936	49280
54	Bai_JLESS_task1_1	Yutong Du	Joint Laboratory of Environmental Sound Sensing, School of Marine Science and Technology, Northwestern Polytechnical University, Xi'an, China	task-low-complexity-acoustic-scene-classification-results#Du2023	47.9 (47.6 - 48.2)	27931612	78252
47	Bai_JLESS_task1_2	Yutong Du	Joint Laboratory of Environmental Sound Sensing, School of Marine Science and Technology, Northwestern Polytechnical University, Xi'an, China	task-low-complexity-acoustic-scene-classification-results#Du2023	48.8 (48.5 - 49.1)	14130372	60458
17	Cai_TENCENT_task1_1	Weicheng Cai	Tencent Inc., Beijing, China	task-low-complexity-acoustic-scene-classification-results#Cai2023	56.6 (56.3 - 56.9)	28840396	127684
11	Cai_TENCENT_task1_2	Weicheng Cai	Tencent Inc., Beijing, China	task-low-complexity-acoustic-scene-classification-results#Cai2023	56.2 (55.9 - 56.5)	21990724	79942
14	Cai_TENCENT_task1_3	Weicheng Cai	Tencent Inc., Beijing, China	task-low-complexity-acoustic-scene-classification-results#Cai2023	55.8 (55.5 - 56.1)	21990724	79942
11	Cai_TENCENT_task1_4	Weicheng Cai	Tencent Inc., Beijing, China	task-low-complexity-acoustic-scene-classification-results#Cai2023	55.4 (55.1 - 55.7)	19533124	63558
9	Cai_XJTLU_task1_1	Yiqiang Cai	School of Advanced Technology, Xi'an Jiaotong-Liverpool University, Suzhou, China	task-low-complexity-acoustic-scene-classification-results#Cai2023a	51.9 (51.6 - 52.2)	1649349	6828
8	Cai_XJTLU_task1_2	Yiqiang Cai	School of Advanced Technology, Xi'an Jiaotong-Liverpool University, Suzhou, China	task-low-complexity-acoustic-scene-classification-results#Cai2023a	52.5 (52.3 - 52.8)	1649349	6828
6	Cai_XJTLU_task1_3	Yiqiang Cai	School of Advanced Technology, Xi'an Jiaotong-Liverpool University, Suzhou, China	task-low-complexity-acoustic-scene-classification-results#Cai2023a	55.1 (54.8 - 55.4)	3424245	15890
3	Cai_XJTLU_task1_4	Yiqiang Cai	School of Advanced Technology, Xi'an Jiaotong-Liverpool University, Suzhou, China	task-low-complexity-acoustic-scene-classification-results#Cai2023a	57.0 (56.7 - 57.3)	10219540	54260
15	Fei_vv_task1_1	Hongbo Fei	vivo Mobile Commun co Ltd, Hangzhou, China	task-low-complexity-acoustic-scene-classification-results#Fei2023	55.2 (55.0 - 55.5)	13402932	123636
13	Fei_vv_task1_2	Hongbo Fei	vivo Mobile Commun co Ltd, Hangzhou, China	task-low-complexity-acoustic-scene-classification-results#Fei2023	54.5 (54.2 - 54.8)	7802348	70588
26	Fei_vv_task1_3	Hongbo Fei	vivo Mobile Commun co Ltd, Hangzhou, China	task-low-complexity-acoustic-scene-classification-results#Fei2023	53.2 (53.0 - 53.5)	13402932	123636
23	Fei_vv_task1_4	Hongbo Fei	vivo Mobile Commun co Ltd, Hangzhou, China	task-low-complexity-acoustic-scene-classification-results#Fei2023	51.8 (51.5 - 52.0)	7802348	70588
39	Han_SZU_task1_1	Nengheng Zheng	College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China	task-low-complexity-acoustic-scene-classification-results#Han2023	50.5 (50.3 - 50.8)	29.349M	80845
25	LAM_AEV_task1_1	Lam Pham	Center for Digital Safety & Security, Austrian Institute of Technology, Vienna, Austria	task-low-complexity-acoustic-scene-classification-results#Pham2023	55.3 (55.1 - 55.6)	29267550	88704
31	LAM_AEV_task1_2	Lam Pham	Center for Digital Safety & Security, Austrian Institute of Technology, Vienna, Austria	task-low-complexity-acoustic-scene-classification-results#Pham2023	55.0 (54.7 - 55.3)	29267550	88704
20	LAM_AEV_task1_3	Lam Pham	Center for Digital Safety & Security, Austrian Institute of Technology, Vienna, Austria	task-low-complexity-acoustic-scene-classification-results#Pham2023	55.6 (55.3 - 55.9)	29267550	88704
40	Liang_NTES_task1_1	Zhicong Liang	NetEase, Guangzhou, China	task-low-complexity-acoustic-scene-classification-results#Liang2023	52.6 (52.3 - 52.8)	29591778	143345
8	MALACH23_JKU_task1_1	Noah Pichler	Johannes Kepler University, Linz, Austria	task-low-complexity-acoustic-scene-classification-results#Pichler2023	57.0 (56.7 - 57.3)	14686940	119608
7	MALACH23_JKU_task1_2	Noah Pichler	Johannes Kepler University, Linz, Austria	task-low-complexity-acoustic-scene-classification-results#Pichler2023	56.6 (56.3 - 56.8)	10819292	87160
42	MALACH23_JKU_task1_3	Noah Pichler	Johannes Kepler University, Linz, Austria	task-low-complexity-acoustic-scene-classification-results#Pichler2023	9.9 (9.7 - 10.1)	572340	116008
43	MALACH23_JKU_task1_4	Noah Pichler	Johannes Kepler University, Linz, Austria	task-low-complexity-acoustic-scene-classification-results#Pichler2023	9.8 (9.7 - 10.0)	214420	63592
52	DCASE2023 baseline	Irene Martín Morató	Computing Sciences, Tampere University, Tampere, Finland	task-low-complexity-acoustic-scene-classification-results#BASELINE	44.8 (44.5 - 45.1)	29234920	65280
10	Park_KT_task1_1	Jae Han Park	Computing Sciences, KT Corporation, Seoul, Korea	task-low-complexity-acoustic-scene-classification-results#Kim2023	56.3 (56.0 - 56.6)	19556096	250000
29	Park_KT_task1_2	Jae Han Park	Computing Sciences, KT Corporation, Seoul, Korea	task-low-complexity-acoustic-scene-classification-results#Kim2023	53.0 (52.7 - 53.3)	19556096	250000
26	Park_KT_task1_3	Jae Han Park	Computing Sciences, KT Corporation, Seoul, Korea	task-low-complexity-acoustic-scene-classification-results#Kim2023	54.2 (54.0 - 54.5)	19556096	250000
20	Park_KT_task1_4	Jae Han Park	Computing Sciences, KT Corporation, Seoul, Korea	task-low-complexity-acoustic-scene-classification-results#Kim2023	49.2 (48.9 - 49.5)	617000	30000
5	Schmid_CPJKU_task1_1	Florian Schmid	Institute of Computational Perception (CP), Johannes Kepler University (JKU) Linz, Linz, Austria	task-low-complexity-acoustic-scene-classification-results#Schmid2023	54.4 (54.1 - 54.6)	1582336	5722
1	Schmid_CPJKU_task1_2	Florian Schmid	Institute of Computational Perception (CP), Johannes Kepler University (JKU) Linz, Linz, Austria	task-low-complexity-acoustic-scene-classification-results#Schmid2023	58.7 (58.4 - 59.0)	4354304	12310
2	Schmid_CPJKU_task1_3	Florian Schmid	Institute of Computational Perception (CP), Johannes Kepler University (JKU) Linz, Linz, Austria	task-low-complexity-acoustic-scene-classification-results#Schmid2023	61.4 (61.2 - 61.7)	9638144	30106
3	Schmid_CPJKU_task1_4	Florian Schmid	Institute of Computational Perception (CP), Johannes Kepler University (JKU) Linz, Linz, Austria	task-low-complexity-acoustic-scene-classification-results#Schmid2023	62.7 (62.4 - 63.0)	16803072	54182
26	Schmidt_FAU_task1_1	Lorenz Schmidt	International Audio Laboratories, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany	task-low-complexity-acoustic-scene-classification-results#Schmidt2023	55.7 (55.4 - 56.0)	28931380	127988
12	Schmidt_FAU_task1_2	Lorenz Schmidt	International Audio Laboratories, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany	task-low-complexity-acoustic-scene-classification-results#Schmidt2023	55.6 (55.3 - 55.9)	19910080	68456
19	Schmidt_FAU_task1_3	Lorenz Schmidt	International Audio Laboratories, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany	task-low-complexity-acoustic-scene-classification-results#Schmidt2023	52.7 (52.4 - 53.0)	9996775	74700
24	Schmidt_FAU_task1_4	Lorenz Schmidt	International Audio Laboratories, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany	task-low-complexity-acoustic-scene-classification-results#Schmidt2023	49.7 (49.4 - 50.0)	4938255	34616
33	Tan_NTU_task1_1	Ee-Leng Tan	EEE, Nanyang Technological Univeristy, Singapore, Singapore	task-low-complexity-acoustic-scene-classification-results#Tan2023	47.1 (46.8 - 47.4)	2960384	40960
35	Tan_NTU_task1_2	Ee-Leng Tan	EEE, Nanyang Technological Univeristy, Singapore, Singapore	task-low-complexity-acoustic-scene-classification-results#Tan2023	48.5 (48.2 - 48.8)	6462656	54304
37	Tan_NTU_task1_3	Ee-Leng Tan	EEE, Nanyang Technological Univeristy, Singapore, Singapore	task-low-complexity-acoustic-scene-classification-results#Tan2023	46.3 (46.1 - 46.6)	6462656	54304
4	Tan_SCUT_task1_1	Jiaxin Tan	School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China	task-low-complexity-acoustic-scene-classification-results#Tan2023a	60.8 (60.6 - 61.1)	13180000	73386
30	Tan_SCUT_task1_2	Jiaxin Tan	School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China	task-low-complexity-acoustic-scene-classification-results#Tan2023a	51.7 (51.4 - 52.0)	13180000	73386
16	Tan_SCUT_task1_3	Jiaxin Tan	School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China	task-low-complexity-acoustic-scene-classification-results#Tan2023a	53.5 (53.2 - 53.8)	13180000	73386
32	Tan_SCUT_task1_4	Jiaxin Tan	School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China	task-low-complexity-acoustic-scene-classification-results#Tan2023a	50.9 (50.6 - 51.2)	13180000	73386
52	Vo_DU_task1_1	Quoc Vo	Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA	task-low-complexity-acoustic-scene-classification-results#Vo2023	45.0 (44.7 - 45.2)	15600000	503316
53	Vo_DU_task1_2	Quoc Vo	Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA	task-low-complexity-acoustic-scene-classification-results#Vo2023	44.8 (44.5 - 45.1)	15600000	503316
50	Vo_DU_task1_3	Quoc Vo	Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA	task-low-complexity-acoustic-scene-classification-results#Vo2023	45.2 (44.9 - 45.5)	15600000	503316
49	Vo_DU_task1_4	Quoc Vo	Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA	task-low-complexity-acoustic-scene-classification-results#Vo2023	45.5 (45.2 - 45.8)	15600000	503316
31	Wang_SCUT_task1_1	Yanxiong Li	South China University of Technology, China, GuangZhou	task-low-complexity-acoustic-scene-classification-results#Wang2023	49.1 (48.9 - 49.4)	8646000	62080
18	Wang_SCUT_task1_2	Yanxiong Li	South China University of Technology, China, GuangZhou	task-low-complexity-acoustic-scene-classification-results#Wang2023	52.9 (52.6 - 53.2)	16746000	81150
46	Wang_SCUT_task1_3	Yanxiong Li	South China University of Technology, China, GuangZhou	task-low-complexity-acoustic-scene-classification-results#Wang2023	47.1 (46.8 - 47.4)	25442000	82280
48	Wang_SCUT_task1_4	Yanxiong Li	South China University of Technology, China, GuangZhou	task-low-complexity-acoustic-scene-classification-results#Wang2023	48.5 (48.2 - 48.8)	20902000	148618
45	XuQianHu_BIT&NUDT_task1_1	Bin Hu	Ministry of Education (Beijing Institute of Technology), Key Laboratory of Brain Health Intelligent Evaluation and Intervention, P.R. China; Beijing Institute of Technology, School of Medical Technology, P.R. China	task-low-complexity-acoustic-scene-classification-results#Yu2023	51.0 (50.7 - 51.3)	23803968	125885
44	XuQianHu_BIT&NUDT_task1_2	Bin Hu	Ministry of Education (Beijing Institute of Technology), Key Laboratory of Brain Health Intelligent Evaluation and Intervention, P.R. China; Beijing Institute of Technology, School of Medical Technology, P.R. China	task-low-complexity-acoustic-scene-classification-results#Yu2023	51.6 (51.3 - 51.9)	28400320	123654
41	XuQianHu_BIT&NUDT_task1_3	Bin Hu	Ministry of Education (Beijing Institute of Technology), Key Laboratory of Brain Health Intelligent Evaluation and Intervention, P.R. China; Beijing Institute of Technology, School of Medical Technology, P.R. China	task-low-complexity-acoustic-scene-classification-results#Yu2023	50.0 (49.8 - 50.3)	13402688	125650
36	XuQianHu_BIT&NUDT_task1_4	Bin Hu	Ministry of Education (Beijing Institute of Technology), Key Laboratory of Brain Health Intelligent Evaluation and Intervention, P.R. China; Beijing Institute of Technology, School of Medical Technology, P.R. China	task-low-complexity-acoustic-scene-classification-results#Yu2023	51.1 (50.9 - 51.4)	11878580	125057
15	Yang_GZHU_task1_1	Liu Yang	School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China	task-low-complexity-acoustic-scene-classification-results#Weng2023	55.5 (55.3 - 55.8)	23970000	76906
14	Yang_GZHU_task1_2	Liu Yang	School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China	task-low-complexity-acoustic-scene-classification-results#Weng2023	55.9 (55.6 - 56.2)	23970000	76906
21	Yang_GZHU_task1_3	Liu Yang	School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China	task-low-complexity-acoustic-scene-classification-results#Weng2023	55.1 (54.8 - 55.4)	23970000	76906
19	Yang_GZHU_task1_4	Liu Yang	School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China	task-low-complexity-acoustic-scene-classification-results#Weng2023	55.2 (54.9 - 55.5)	23970000	76906
49	Zhang_NCUT_task1_1	Menglong Wu	Electronic and Communication Engineering, North China University of Technology, Beijing, China	task-low-complexity-acoustic-scene-classification-results#Zhang2023	43.3 (43.0 - 43.6)	7375000	574464
51	Zhang_NCUT_task1_2	Menglong Wu	Electronic and Communication Engineering, North China University of Technology, Beijing, China	task-low-complexity-acoustic-scene-classification-results#Zhang2023	47.9 (47.6 - 48.2)	28461000	1622016
26	Zhang_SATLab_task1_1	Wei-Qiang Zhang	Department of Electronic Engineering, Tsinghua University, Beijing, China	task-low-complexity-acoustic-scene-classification-results#Bing2023	48.8 (48.5 - 49.1)	3972096	7946
34	Zhang_SATLab_task1_2	Wei-Qiang Zhang	Department of Electronic Engineering, Tsinghua University, Beijing, China	task-low-complexity-acoustic-scene-classification-results#Bing2023	50.0 (49.7 - 50.2)	19466240	46232
27	Zhang_SATLab_task1_3	Wei-Qiang Zhang	Department of Electronic Engineering, Tsinghua University, Beijing, China	task-low-complexity-acoustic-scene-classification-results#Bing2023	51.9 (51.6 - 52.2)	23438336	54178
22	Zhang_SATLab_task1_4	Wei-Qiang Zhang	Department of Electronic Engineering, Tsinghua University, Beijing, China	task-low-complexity-acoustic-scene-classification-results#Bing2023	50.3 (50.0 - 50.6)	7944192	15892

Complete results and technical reports can be found at results page

Baseline system

The baseline system is the same as used in the DCASE 2022 Task 1, implementing a convolutional neural network (CNN) based approach using log mel-band energies extracted for each 1-second signal. The network consists of three CNN layers and one fully connected layer to assign scene labels to the audio signals. The system is based on the DCASE 2021 Subtask A baseline system. Model size of the baseline when using TFLite quantization is 46.512 KB, and the MACS count is 29.23 M.

Repository

DCASE2022 Task 1 baseline, repository

Parameters

Audio features:
- Log mel-band energies (40 bands), analysis frame 40 ms (50% hop size)
Neural network:
- Input shape: 40 * 51 (1 seconds)
- Architecture:
  - CNN layer #1: 2D Convolutional layer (filters: 16, kernel size: 7) + Batch normalization + ReLu activation
  - CNN layer #2: 2D Convolutional layer (filters: 16, kernel size: 7) + Batch normalization + ReLu activation, 2D max pooling (pool size: (5, 5)) + Dropout (rate: 30%)
  - CNN layer #3: 2D Convolutional layer (filters: 32, kernel size: 7) + Batch normalization + ReLu activation, 2D max pooling (pool size: (4, 100)) + Dropout (rate: 30%)
  - Flatten
  - Dense layer #1: Dense layer (units: 100, activation: ReLu ), Dropout (rate: 30%)
  - Output layer (activation: softmax)
- Learning: 200 epochs (batch size 16), data shuffling between epochs
  - Optimizer: Adam (learning rate 0.001)

Model selection:

Approximately 30% of the original training data is assigned to validation set, split done such that training and validation sets do not have segments from the same location and both sets have data from each city
Model performance after each epoch is evaluated on the validation set, and best performing model is selected

Results for the development dataset

Results for DCASE2022 baseline are calculated using TensorFlow in GPU mode (using Nvidia Tesla V100 GPU card). Because results produced with GPU card are generally non-deterministic, the system was trained and tested 10 times; mean and standard deviation of the performance from these 10 independent trials are shown in the results tables. Detailed results for the DCASE2022 baseline:

Scene label	Log loss	Device-wise log losses									Accuracy
		A	B	C	S1	S2	S3	S4	S5	S6
Airport	1.534	1.165	1.439	1.475	1.796	1.653	1.355	1.608	1.734	1.577	39.4 %
Bus	1.758	1.073	1.842	1.206	1.790	1.580	1.681	2.202	2.152	2.293	29.3 %
Metro	1.382	0.898	1.298	1.183	2.008	1.459	1.288	1.356	1.777	1.166	47.9 %
Metro station	1.672	1.582	1.641	1.833	2.010	1.857	1.613	1.643	1.627	1.247	36.0 %
Park	1.448	0.572	0.513	0.725	1.615	1.130	1.678	2.314	1.875	2.613	58.9 %
Public square	2.265	1.442	1.862	1.998	2.230	2.133	2.157	2.412	2.831	3.318	20.8 %
Shopping mall	1.385	1.293	1.291	1.354	1.493	1.292	1.424	1.572	1.245	1.497	51.4 %
Street, pedestrian	1.822	1.263	1.731	1.772	1.540	1.805	1.869	2.266	1.950	2.205	30.1 %
Street, traffic	1.025	0.830	1.336	1.023	0.708	1.098	1.147	0.957	0.634	1.489	70.6 %
Tram	1.462	0.973	1.434	1.169	1.017	1.579	1.098	1.805	2.176	1.903	44.6 %
Overall averaged over 10 iterations	1.575 (± 0.018)	1.109	1.439	1.374	1.621	1.559	1.531	1.813	1.800	1.931	42.9 % (± 0.770)

The class-wise log loss and device-wise log loss are calculated taking into account only the test items belonging to the considered class (splitting the classification task into ten different sub-problems), while overall log loss is calculated taking into account all test items. As discussed here, devices S4-S6 are used only for testing not for training the system.

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.

Citation

If you are using the audio dataset, please cite the following paper:

Publication

Toni Heittola, Annamaria Mesaros, and Tuomas Virtanen. Acoustic scene classification in dcase 2020 challenge: generalization across devices and low complexity solutions. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), 56–60. 2020. URL: https://arxiv.org/abs/2005.14623.

PDF

Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions

Abstract

This paper presents the details of Task 1: Acoustic Scene Classification in the DCASE 2020 Challenge. The task consists of two subtasks: classification of data from multiple devices, requiring good generalization properties, and classification using low-complexity solutions. Here we describe the datasets and baseline systems. After the challenge submission deadline, challenge results and analysis of the submissions will be added.

PDF

If you are participating in task please cite the following paper:

Publication

Irene Martín-Morató, Francesco Paissan, Alberto Ancilotto, Toni Heittola, Annamaria Mesaros, Elisabetta Farella, Alessio Brutti, and Tuomas Virtanen. Low-complexity acoustic scene classification in dcase 2022 challenge. 2022. URL: https://arxiv.org/abs/2206.03835, doi:10.48550/ARXIV.2206.03835.

PDF

Low-complexity acoustic scene classification in DCASE 2022 Challenge

Abstract

This paper analyzes the outcome of the Low-Complexity Acoustic Scene Classification task in DCASE 2022 Challenge. The task is a continuation from the previous years. In this edition, the requirement for low-complexity solutions were modified including: a limit of 128 K on the number of parameters, including the zero-valued ones, imposed INT8 numerical format, and a limit of 30 million multiply-accumulate operations at inference time. The provided baseline system is a convolutional neural network which employs post-training quantization of parameters, resulting in 46512 parameters, and 29.23 million multiply-and-accumulate operations, well under the set limits of 128K and 30 million, respectively. The baseline system has a 42.9% accuracy and a log-loss of 1.575 on the development data consisting of audio from 9 different devices. An analysis of the submitted systems will be provided after the challenge deadline.

PDF

	Annamaria Mesaros Tampere University
	Irene Martin Morato Tampere University
	Francesco Paissan Bruno Kessler Foundation
	Alberto Ancilotto Bruno Kessler Foundation
	Elisabetta Farella Bruno Kessler Foundation
	Alessio Brutti Bruno Kessler Foundation
	Toni Heittola Tampere University
	Tuomas Virtanen Tampere University

Coordinators

Content

Description

Audio dataset

A multi-device dataset for urban acoustic scene classification

Abstract

Keywords

Reference labels

Download

Task setup

Development dataset

Evaluation dataset

System complexity requirements

Calculating memory size of the model parameters and MACS

Calculation example / DCASE2022 Task 1, Baseline

External data resources

Task rules

Submission

System output file

Metadata file

Metadata

Package validator

Evaluation

Measuring the energy consumption

Results

Baseline system

Repository

Parameters

Results for the development dataset

Citation

Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions

Abstract

Low-complexity acoustic scene classification in DCASE 2022 Challenge

Abstract