The goal of acoustic scene classification is to classify a test recording into one of the predefined ten acoustic scene classes. This task is a continuation of the Acoustic Scene Classification task from previous DCASE Challenge editions, with some changes that bring new research problems into focus.

Challenge has ended. Full results for this task can be found in subtask specific result pages: Task1A Task1B

If you are interested in the task, you can join us on the dedicated slack channel

We provide two different setups of the acoustic classification problem:

A Complexity Task 1

Low-Complexity Acoustic Scene Classification with Multiple Devices
Subtask A

Classification of data from multiple devices (real and simulated) targeting generalization properties of systems across a number of different devices and focusing on low-complexity solutions.

B Modality Task 1

Audio-Visual Scene Classification
Subtask B

Classification of audio and video data, targeting learning of complementary information from different modalities, focusing on development of complex methods without restrictions on size or approach.

Subtask A

A Complexity Task 1

Low-Complexity Acoustic Scene Classification with Multiple Devices

This subtask is concerned with the basic problem of acoustic scene classification, in which it is required to classify a test audio recording into one of ten known acoustic scene classes. This task targets generalization across a number of different devices, and will use audio data recorded and simulated with a variety of devices. The task also targets low complexity solutions for the classification problem in terms of model size.

Figure 1: Overview of acoustic scene classification system.

Audio dataset

The development dataset for this task is TAU Urban Acoustic Scenes 2020 Mobile, development dataset. The dataset contains recordings from 12 European cities in 10 different acoustic scenes using 4 different devices. Additionally, synthetic data for 11 mobile devices was created based on the original recordings. Of the 12 cities, two are present only in the evaluation set.

Recordings were made using four devices that captured audio simultaneously. The main recording device consists in a Soundman OKM II Klassik/studio A3, electret binaural microphone and a Zoom F8 audio recorder using 48kHz sampling rate and 24-bit resolution, referred to as device A. The other devices are commonly available customer devices: device B is a Samsung Galaxy S7, device C is iPhone SE, and device D is a GoPro Hero5 Session.

Acoustic scenes (10):

Airport - airport
Indoor shopping mall - shopping_mall
Metro station - metro_station
Pedestrian street - street_pedestrian
Public square - public_square
Street with medium level of traffic - street_traffic
Travelling by a tram - tram
Travelling by a bus - bus
Travelling by an underground metro - metro
Urban park - park

Audio data was recorded in Amsterdam, Barcelona, Helsinki, Lisbon, London, Lyon, Madrid, Milan, Prague, Paris, Stockholm and Vienna.

The dataset was collected by Tampere University of Technology between 05/2018 - 11/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

For complete details on the data recording and processing see

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 9–13. November 2018. URL: https://dcase.community/documents/workshop2018/proceedings/DCASE2018Workshop_Mesaros_8.pdf.

PDF

A multi-device dataset for urban acoustic scene classification

Abstract

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

Keywords

Acoustic scene classification, DCASE challenge, public datasets, multi-device data

PDF

Additionally, 10 mobile devices S1-S10 are simulated using the audio recorded with device A, impulse responses recorded with real devices, and additional dynamic range compression, in order to simulate realistic recordings. A recording from device A is processed through convolution with the selected Si impulse response, then processed with a selected set of parameters for dynamic range compression (device-specific). The impulse responses are proprietary data and will not be published.

The development dataset comprises 40 hours of data from device A, and smaller amounts from the other devices. Audio is provided in a single-channel 44.1kHz 24-bit format.

Task setup

Development dataset

The development set contains data from 10 cities and 9 devices: 3 real devices (A, B, C) and 6 simulated devices (S1-S6). Data from devices B, C, and S1-S6 consists of randomly selected segments from the simultaneous recordings, therefore all overlap with the data from device A, but not necessarily with each other. The total amount of audio in the development set is 64 hours.

The dataset is provided with a training/test split in which 70% of the data for each device is included for training, 30% for testing. Some devices appear only in the test subset. In order to create a perfectly balanced test set, a number of segments from various devices are not included in this split. Complete details on the development set and training/test split are provided in the following table.

Devices		Dataset		Cross-validation setup
Name	Type	Total duration	Total segments	Train segments	Test segments	Notes
A	Real	40h	14400	10215	330	3855 Segments not used in train/test split
B C	Real	3h each	1080	749 + 748	2 * 329
S1 S2 S3	Simulated	3h each	1080	3 * 750	3 * 330
S4 S5 S6	Simulated	3h each	1080	-	3 * 330	3 * 750 segments not used in train/test split
Total		64h	23040	13962	2970

Participants are required to report the performance of their system using this train/test setup in order to allow a comparison of systems on the development set. Participants are allowed to create their own cross-validation folds or separate validation set. In this case please pay attention to the segments recorded at the same location. Location identifier can be found from metadata file provided in the dataset or from audio file names:

[scene label]-[city]-[location id]-[segment id]-[device id].wav

Make sure that all the files having the same location id are placed on the same side of the evaluation.

Evaluation dataset

The evaluation dataset contains data from 12 cities, 10 acoustic scenes, 11 devices. There are five new devices (not available in the development set): a real device D and simulated devices S7-S11. Evaluation data contains 22 hours of audio. The evaluation data contains audio recorded at different locations than the development data.

Device and city information is not provided in the evaluation set. The systems are expected to be robust to different devices.

System complexity requirements

A model complexity limit of 128 KB is set for the non-zero parameters. This translates into 32768 parameters when using float32 (32-bit float) which is often the default data type (32768 parameter values * 32 bits per parameter / 8 bits per byte= 131072 bytes = 128 KB (kibibyte)).

By limiting the size of the model on disk, we allow participants some flexibility in design, for example, some implementations would prefer to minimize the number of non-zero parameters of the network in order to comply with this size limit, while other implementations may target representation of the model parameters with a low number of bits. There is no requirement nor recommendation on which method to minimize the model size is sought after.

The computational complexity of the feature extraction stage is not included in this limit because there is no established method for estimating and comparing complexity of different low-level feature extraction implementations. We therefore exclude it in order to keep the complexity estimation straightforward across approaches. Some implementations may use a feature extraction layer as the first layer in the neural network - in this case the limit is applied only to the following layers, in order to exclude the feature calculation as if it were a separate processing block. However, in case of using learned features (so-called embeddings, like VGGish, OpenL3 or EdgeL3), the network used to generate them counts in the calculated model size.

Full information about the model size should be provided in the technical report.

Model size calculation

We offer a script for calculating the model size for Keras based models along with the baseline system. If you have any doubts about how to calculate the model size, please contact toni.heittola@tuni.fi or use the DCASE community forum or Slack channel for visibility.

Calculation examples

System 1: DCASE2020 Task 1 Baseline, Subtask A 19.14 MB

Total model size: 17.89 MB (Audio embeddings) + 1.254 MB (Acoustic model) = 19.14 MB

Audio embeddings (OpenL3)

Layer	Parameters	Non-zero parameters	Data type	Size (non-zero)	Note
input_1	0	0	float32	0 KB
melspectrogram_1	4 460 800	4 196 335	float32	16.01 MB	Skipped
batch_normalization_1	4	4	float32	16 bytes
conv2d_1	640	640	float32	2.5 KB
batch_normalization_2	256	256	float32	1 KB
activation_1	0	0	float32	0 KB
conv2d_2	36 928	36 928	float32	144.2 KB
batch_normalization_3	256	256	float32	1 KB
activation_2	0	0	float32	0 KB
max_pooling2d_1	0	0	float32	0 KB
conv2d_3	73 856	73 856	float32	288.5 KB
batch_normalization_4	512	512	float32	2 KB
activation_3	0	0	float32	0 KB
conv2d_4	147 584	147 584	float32	576.5 KB
batch_normalization_5	512	512	float32	2 KB
activation_4	0	0	float32	0 KB
max_pooling2d_2	0	0	float32	0 KB
conv2d_5	295 168	295 168	float32	1.126 MB
batch_normalization_6	1024	1024	float32	4 KB
activation_5	0	0	float32	0 KB
conv2d_6	590 080	590 080	float32	2.251 MB
batch_normalization_7	1024	1024	float32	4 KB
activation_6	0	0	float32	0 KB
max_pooling2d_3	0	0	float32	0 KB
conv2d_7	1 180 160	1 180 160	float32	4.502 MB
batch_normalization_8	2048	2048	float32	8 KB
activation_7	0	0	float32	0 KB
audio_embedding_layer	2 359 808	2 359 808	float32	9.002 MB
max_pooling2d_4	0	0	float32	0 KB
flatten_1	0	0	float32	0 KB
Total	4 689 860	4 689 860		17.89 MB (4689860 * 32bit / 8bits per byte / 1024 / 1024)

Acoustic model

Layer	Parameters	Non-zero parameters	Size (non-zero)
dense_1	262 656	262 557	1.002 MB
dense_2	65 664	65 664	256.5 KB
dense_3	387	387	1.512 KB
Total	32 8707	32 8608	1.254 MB

System 2: DCASE2020 Task 1 Baseline, Subtask B 451.5 KB

Total model size: 0 KB (Audio embeddings) + 451.5 KB (Acoustic model) = 451.5 KB

Acoustic model

Layer	Parameters	Non-zero parameters	Data type	Size (non-zero)
conv2d_1	1600	1600	float32	6.25 KB
batch_normalization_1	128	128	float32	512 bytes
activation_1	0	0	float32	0 KB
max_pooling2d_1	0	0	float32	0 KB
dropout_1	0	0	float32	0 KB
conv2d_2	100 416	100 416	float32	392.2 KB
batch_normalization_2	256	256	float32	1 KB
activation_2	0	0	float32	0 KB
max_pooling2d_2	0	0	float32	0 KB
dropout_2	0	0	float32	0 KB
flatten_1	0	0	float32	0 KB
dense_1	12 900	12 900	float32	50.39 KB
dropout_3	0	0	float32	0 KB
dense_2	303	303	float32	1.184 KB
Total	115 603	115 603		451.5 KB (115603 * 32bit / 8bits per byte / 1024)

System 3: DCASE2021 Task 1 Baseline, Subtask A 90.3 KB

Total model size: 0 KB (Audio embeddings) + 90.3 KB (Acoustic model) = 90.3 KB

Acoustic model

Layer	Parameters	Non-zero parameters	Data type	Size (non-zero)
conv2d_1	800	800	float16	1.56 KB
batch_normalization_1	64	64	float16	128 bytes
activation_1	0	0		0 KB
conv2d_2	12 560	12 560	float16	24.53 KB
batch_normalization_2	64	64	float16	128 bytes
activation_2	0	0		0 KB
max_pooling2d_1	0	0		0 KB
dropout_1	0	0		0 KB
conv2d_3	25 120	25 120	float16	49.06 KB
batch_normalization_3	128	128	float16	256 bytes
activation_3	0	0		0 KB
max_pooling2d_2	0	0		0 KB
dropout_2	0	0		0 KB
flatten_1	0	0		0 KB
dense_1	6 500	6 500	float16	12.69 KB
dropout_3	0	0		0 KB
dense_2	1010	1010	float16	1.97 KB
Total	46 246	46 246		90.3 KB (46246 * 16bit / 8bits per byte / 1024)

Reference labels

Reference labels are provided only for the development datasets. Reference labels for evaluation dataset will not be released. For publications based on the DCASE challenge data, please use the provided training/test setup of the development set, to allow comparisons. After the challenge, if you want to evaluate your proposed system with official challenge evaluation setup, contact the task coordinators. Task coordinators can provide unofficial scoring for a limited amount of system outputs.

Download

TAU Urban Acoustic Scenes 2020 Mobile, Development dataset (30.5 GB)

version 2.0

TAU Urban Acoustic Scenes 2021 Mobile, Evaluation dataset (8.9 GB)

Subtask B

B Modality Task 1

Audio-Visual Scene Classification
Subtask B

This subtask is concerned with classification using audio and video modalities. Since audio-visual machine learning has gained popularity in the last years, we aim to provide a multidisciplinary task that may attract researchers from the machine vision community.

We impose no restrictions on the modality or combinations of modalities used in the system. We encourage participants to also submit single-modality systems (audio-only or video-only methods for scene classification).

Figure 2: Overview of audio-visual scene classification system.

Audio-Visual dataset

The dataset for this task is TAU Audio-Visual Urban Scenes 2021. The dataset contains synchronized audio and video recordings from 12 European cities in 10 different scenes.

The audio part is a subset of TAU Urban Acoustic Scenes 2020. For complete details on the data recording and processing see:

Publication

Shanshan Wang, Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A curated dataset of urban scenes for audio-visual scene analysis. In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021. accepted. URL: https://arxiv.org/abs/2011.00030.

PDF

A Curated Dataset of Urban Scenes for Audio-Visual Scene Analysis

Abstract

This paper introduces a curated dataset of urban scenes for audio-visual scene analysis which consists of carefully selected and recorded material. The data was recorded in multiple European cities, using the same equipment, in multiple locations for each scene, and is openly available. We also present a case study for audio-visual scene recognition and show that joint modeling of audio and visual modalities brings significant performance gain compared to state of the art uni-modal systems. Our approach obtained an 84.4% accuracy compared to 76.8% for the audio-only and 70.0% for the video-only equivalent systems.

Keywords

Audio-visual data, Scene analysis, Acous-tic scene, Pattern recognition, Transfer learning

PDF

The provided audio is recorded using a Soundman OKM II Klassik/studio A3, electret binaural microphone and a Zoom F8 audio recorder using 48kHz sampling rate and 24-bit resolution. The provided video is recorded using a GoPro Hero5 Session. Faces and licence plates in the video were blurred during the data postprocessing stage.

Data was recorded in the following scenes (10):

Airport - airport
Indoor shopping mall - shopping_mall
Metro station - metro_station
Pedestrian street - street_pedestrian
Public square - public_square
Street with medium level of traffic - street_traffic
Travelling by a tram - tram
Travelling by a bus - bus
Travelling by an underground metro - metro
Urban park - park

Data was recorded in Amsterdam, Barcelona, Helsinki, Lisbon, London, Lyon, Madrid, Milan, Prague, Paris, Stockholm and Vienna.

The dataset was collected by Tampere University of Technology between 05/2018 - 11/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

Task setup

Development dataset

The development set contains audio and video data from 10 cities. The total amount of audio in the development set is 34 hours. The dataset is provided with a training/test split.

Provided files have a length of 10 seconds. A classification decision is required for 1 second segments. Participants are allowed to implement this in any way they want, by splitting the 10-second files into individual 1-second files, or just providing 10 labels independently within the longer file. In the baseline system, the latter method is used. Participants are required to report the performance of their system using this train/test setup in order to allow comparison of systems on the development set.

Participants are allowed to create their own cross-validation folds or separate validation set. In this case please pay attention to the segments recorded at the same location. Location identifier can be found from metadata file provided in the dataset or from audio file names:

[scene label]-[city]-[location id]-[segment id].wav

Make sure that all files having the same location id are placed on the same side of the evaluation.

Evaluation dataset

The evaluation set contains data from 12 cities (2 cities unseen in the development set). Evaluation data contains 20 hours of material (audio and video). Provided files have a length of 1 second. Classification decision is required at file level.

Reference labels

Download

TAU Urban Audio Visual Scenes 2021, Development dataset (107.7 GB)

TAU Urban Audio Visual Scenes 2021, Evaluation dataset (61.3 GB)

External data resources

Use of external data and transfer learning is allowed in all subtasks under the following conditions:

The used external resource is clearly referenced and freely accessible to any other research group in the world. External data refers to public datasets or trained models. The data must be public and freely available before 1st of April 2021.
The list of external data sources used in training must be clearly indicated in the technical report.
Participants inform the organizers in advance about such data sources, so that all competitors know about them and have an equal opportunity to use them. Please send an email to the task coordinators; we will update the list of external datasets on the webpage accordingly. Once the evaluation set is published, the list of allowed external data resources is locked (no further external sources allowed).
It is not allowed to use TUT Urban Acoustic Scenes 2018, TAU Urban Acoustic Scenes 2019, TAU Urban Acoustic Scenes 2019 Mobile, TAU Urban Acoustic Scenes 2020 Mobile, or TAU Urban Acoustic Scenes 2020 3Class. These datasets are partially included in the current setup, and additional usage will lead to overfitting. Please note, that for DCASE2021 challenge audio of subtask 1B overlaps completely with audio for subtask 1A (device A).

List of external data resources allowed:

Dataset name	Type	Added	Link
LITIS Rouen audio scene dataset	audio	04.03.2019	https://sites.google.com/site/alainrakotomamonjy/home/audio-scene
DCASE2013 Challenge - Public Dataset for Scene Classification Task	audio	04.03.2019	https://archive.org/details/dcase2013_scene_classification
DCASE2013 Challenge - Private Dataset for Scene Classification Task	audio	04.03.2019	https://archive.org/details/dcase2013_scene_classification_testset
AudioSet	audio, video	04.03.2019	https://research.google.com/audioset/
OpenL3	model	12.02.2020	https://openl3.readthedocs.io/
EdgeL3	model	12.02.2020	https://edgel3.readthedocs.io/
VGGish	model	12.02.2020	https://github.com/tensorflow/models/tree/master/research/audioset/vggish
SoundNet	model	03.06.2020	http://soundnet.csail.mit.edu/
CIFAR-100	image	1.3.2021	https://www.cs.toronto.edu/~kriz/cifar.html
CIFAR-10	image	31.3.2021	https://www.cs.toronto.edu/~kriz/cifar.html
ImageNet	image	1.3.2021	http://www.image-net.org/
Resnet50	model	1.3.2021	https://pytorch.org/hub/pytorch_vision_resnet/
EfficientNet	model	1.3.2021	https://github.com/lukemelas/EfficientNet-PyTorch
Indoor	image	18.3.2021	http://web.mit.edu/torralba/www/indoor.html
Places365	image	18.3.2021	http://places2.csail.mit.edu/download.html
Urban-SED	audio	31.3.2021	http://urbansed.weebly.com/
Pytorch CIFAR Models	model	31.3.2021	https://github.com/chenyaofo/pytorch-cifar-models
Places365-CNNs	model	31.3.2021	https://github.com/CSAILVision/places365
Pytorch pretrained Models on ImageNet	model	31.3.2021	https://pytorch.org/vision/stable/models.html
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition	model	31.3.2021	https://zenodo.org/record/3987831
Problem Agnostic Speech Encoder (PASE) Model	model	31.3.2021	https://github.com/santi-pdp/pase
PyTorch Image Models	model	14.5.2021	https://github.com/rwightman/pytorch-image-models
CLIP	model	14.5.2021	https://github.com/openai/CLIP
YAMNet	model	20.5.2021	https://github.com/tensorflow/models/tree/master/research/audioset/yamnet

Task rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

Task specific rules:

Use of external data is allowed. See conditions for external data usage here.
In subtask A, the model size limit applies. See conditions for the model size here.
Manipulation of provided training and development data is allowed (e.g. by mixing data sampled from a pdf or using techniques such as pitch shifting or time stretching).
Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision making is also forbidden.
Classification decision must be done independently for each test sample.

Submission

Participants can choose to participate in only one subtask or both. For Subtask B, any combination of modalities can be submitted (audio-only, video-only, audio-video)

Official challenge submission consists of:

System output file (*.csv)
Metadata file (*.yaml)
Technical report explaining in sufficient detail the method (*.pdf)

System output should be presented as a single text-file (in CSV format, with a header row) containing a classification result for each audio file in the evaluation set. In addition, the results file should contain probabilities for each scene class. Result items can be in any order. Multiple system outputs can be submitted (maximum 4 per participant per subtask).

For each system, meta information should be provided in a separate file, containing the task-specific information. This meta information enables fast processing of the submissions and analysis of submitted systems. Participants are advised to fill the meta information carefully while making sure all information is correctly provided.

All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted yaml, submitted system output, and the technical report! Instead of system name you can use a submission label too.

System output file

Both subtask will follow the same system output file format.

Row format:

[filename (string)][tab][scene label (string)][tab][airport probability (float)][tab][bus probability (float)][tab][metro probability (float)][tab][metro_station probability (float)][tab][park probability (float)][tab][public_square probability (float)][tab][shopping_mall probability (float)][tab][street_pedestrian probability (float)][tab][street_traffic probability (float)][tab][tram probability (float)]

Example output:

filename	scene_label	airport	bus	metro	metro_station	park	public_square	shopping_mall	street_pedestrian	street_traffic	tram
0.wav	bus	0.25	0.99	0.12	0.32	0.41	0.42	0.23	0.34	0.12	0.45
1.wav	tram	0.25	0.19	0.12	0.32	0.41	0.42	0.23	0.34	0.12	0.85

Metadata file

Subtask A

Example meta information file for Subtask A baseline system task1/Martin_TAU_task1a_1/Martin_TAU_task1a_1.meta.yaml:

Subtask A / Metadata

# Submission information
submission:
  # Submission label
  # Label is used to index submissions.
  # Generate your label following way to avoid
  # overlapping codes among submissions:
  # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
  label: Martin_TAU_task1a_1

  # Submission name
  # This name will be used in the results tables when space permits
  name: DCASE2021 baseline system

  # Submission name abbreviated
  # This abbreviated name will be used in the results table when space is tight.
  # Use maximum 10 characters.
  abbreviation: Baseline

  # Authors of the submitted system. Mark authors in
  # the order you want them to appear in submission lists.
  # One of the authors has to be marked as corresponding author,
  # this will be listed next to the submission in the results tables.
  authors:
    # First author
    - lastname: Martín Morató
      firstname: Irene
      email: irene.martinmorato@tuni.fi           # Contact email address
      corresponding: true                         # Mark true for one of the authors

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences            # Optional
        location: Tampere, Finland

    # Second author
    - lastname: Heittola
      firstname: Toni
      email: toni.heittola@tuni.fi                # Contact email address

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences            # Optional
        location: Tampere, Finland

    # Third author
    - lastname: Mesaros
      firstname: Annamaria
      email: annamaria.mesaros@tuni.fi

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences
        location: Tampere, Finland

    # Fourth author
    - lastname: Virtanen
      firstname: Tuomas
      email: tuomas.virtanen@tuni.fi

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences
        location: Tampere, Finland

# System information
system:
  # System description, meta data provided here will be used to do
  # meta analysis of the submitted system.
  # Use general level tags, when possible use the tags provided in comments.
  # If information field is not applicable to the system, use "!!null".
  description:

    # Audio input / sampling rate
    # e.g. 16kHz, 22.05kHz, 44.1kHz, 48.0kHz
    input_sampling_rate: 44.1kHz

    # Acoustic representation
    # one or multiple labels, e.g. MFCC, log-mel energies, spectrogram, CQT, raw waveform, ...
    acoustic_features: log-mel energies

    # Embeddings
    # e.g. VGGish, OpenL3, ...
    embeddings: !!null

    # Data augmentation methods
    # e.g. mixup, time stretching, block mixing, pitch shifting, ...
    data_augmentation: !!null

    # Machine learning
    # In case using ensemble methods, please specify all methods used (comma separated list).
    # one or multiple, e.g. GMM, HMM, SVM, MLP, CNN, RNN, CRNN, ResNet, ensemble, ...
    machine_learning_method: CNN

    # Ensemble method subsystem count
    # In case ensemble method is not used, mark !!null.
    # e.g. 2, 3, 4, 5, ...
    ensemble_method_subsystem_count: !!null

    # Decision making methods
    # e.g. average, majority vote, maximum likelihood, ...
    decision_making: !!null

    # External data usage method
    # e.g. directly, embeddings, pre-trained model, ...
    external_data_usage: embeddings

    # Method for handling the complexity restrictions
    # e.g. weight quantization, sparsity, ...
    complexity_management: weight quantization

  # System complexity, meta data provided here will be used to evaluate
  # submitted systems from the computational load perspective.
  complexity:
    # Total amount of parameters used in the acoustic model.
    # For neural networks, this information is usually given before training process
    # in the network summary.
    # For other than neural networks, if parameter count information is not directly
    # available, try estimating the count as accurately as possible.
    # In case of ensemble approaches, add up parameters for all subsystems.
    # In case embeddings are used, add up parameter count of the embedding
    # extraction networks and classification network
    # Use numerical value (do not use comma for thousands-separator).
    total_parameters: 46246

    # Total amount of non-zero parameters in the acoustic model.
    # Calculated with same principles as "total_parameters".
    # Use numerical value (do not use comma for thousands-separator).
    total_parameters_non_zero: 46246

    # Model size calculated as instructed in task description page.
    # Use numerical value, unit is KB
    model_size: 90.3 # KB

  # List of external datasets used in the submission.
  # Development dataset is used here only as example, list only external datasets
  external_datasets:
    # Dataset name
    - name: TAU Urban Acoustic Scenes 2020 Mobile, Development dataset

      # Dataset access url
      url: https://doi.org/10.5281/zenodo.3819968

      # Total audio length in minutes
      total_audio_length: 3840            # minutes

  # URL to the source code of the system [optional]
  source_code: https://github.com/marmoi/dcase2021_task1a_baseline

# System results
results:
  development_dataset:
    # System results for development dataset with provided the cross-validation setup.
    # Full results are not mandatory, however, they are highly recommended
    # as they are needed for through analysis of the challenge submissions.
    # If you are unable to provide all results, also incomplete
    # results can be reported.

    # Overall metrics
    overall:
      logloss: 1.461
      accuracy: 46.9    # mean of class-wise accuracies

    # Class-wise metrics
    class_wise:
      airport:
        logloss: 1.497
        accuracy: 31.1
      bus:
        logloss: 1.475
        accuracy: 40.1
      metro:
        logloss: 1.457
        accuracy: 48.1
      metro_station:
        logloss: 2.060
        accuracy: 29.6
      park:
        logloss: 1.217
        accuracy: 63.6
      public_square:
        logloss: 1.738
        accuracy: 36.0
      shopping_mall:
        logloss: 1.136
        accuracy: 61.3
      street_pedestrian:
        logloss: 1.522
        accuracy: 47.1
      street_traffic:
        logloss: 1.145
        accuracy: 68.0
      tram:
        logloss: 1.360
        accuracy: 44.3

    # Device-wise
    device_wise:
      a:
        logloss: !!null
        accuracy: 63.9
      b:
        logloss: !!null
        accuracy: 52.2
      c:
        logloss: !!null
        accuracy: 56.3
      s1:
        logloss: !!null
        accuracy: 44.2
      s2:
        logloss: !!null
        accuracy: 43.9
      s3:
        logloss: !!null
        accuracy: 44.5
      s4:
        logloss: !!null
        accuracy: 38.5
      s5:
        logloss: !!null
        accuracy: 40.6
      s6:
        logloss: !!null
        accuracy: 38.2

Subtask B

Example meta information file for subtask B baseline system task1/Wang_TAU_task1b_1/Wang_TAU_task1b_1.meta.yaml:

Subtask B / Metadata

# Submission information
submission:
  # Submission label
  # Label is used to index submissions.
  # Generate your label following way to avoid
  # overlapping codes among submissions:
  # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
  label: Wang_TAU_task1b_1

  # Submission name
  # This name will be used in the results tables when space permits
  name: DCASE2021 baseline system

  # Submission name abbreviated
  # This abbreviated name will be used in the results table when space is tight.
  # Use maximum 10 characters.
  abbreviation: Baseline

  # Authors of the submitted system. Mark authors in
  # the order you want them to appear in submission lists.
  # One of the authors has to be marked as corresponding author,
  # this will be listed next to the submission in the results tables.
  authors:
    # First author
    - lastname: Wang
      firstname: Shanshan
      email: shanshan.wang@tuni.fi                # Contact email address
      corresponding: true                         # Mark true for one of the authors

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences            # Optional
        location: Tampere, Finland

    # Second author
    - lastname: Heittola
      firstname: Toni
      email: toni.heittola@tuni.fi                # Contact email address

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences            # Optional
        location: Tampere, Finland

    # Third author
    - lastname: Mesaros
      firstname: Annamaria
      email: annamaria.mesaros@tuni.fi

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences
        location: Tampere, Finland

    # Fourth author
    - lastname: Virtanen
      firstname: Tuomas
      email: tuomas.virtanen@tuni.fi

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences
        location: Tampere, Finland

# System information
system:
  # System description, meta data provided here will be used to do
  # meta analysis of the submitted system.
  # Use general level tags, when possible use the tags provided in comments.
  # If information field is not applicable to the system, use "!!null".
  description:
    # Audio input / channels
    # one or multiple: e.g. mono, binaural, left, right, mixed, ...
    input_channels: mono

    # Audio input / sampling rate
    # e.g. 16kHz, 22.05kHz, 44.1kHz, 48.0kHz
    input_sampling_rate: 48.0kHz

    # Acoustic representation
    # one or multiple labels, e.g. MFCC, log-mel energies, spectrogram, CQT, raw waveform, ...
    acoustic_features: log-mel energies

    # Embeddings
    # e.g. VGGish, OpenL3, ...
    audio_embeddings: OpenL3
    visual_embeddings: OpenL3

    # Data augmentation methods
    # e.g. mixup, time stretching, block mixing, pitch shifting, ...
    data_augmentation: !!null

    # Machine learning
    # In case using ensemble methods, please specify all methods used (comma separated list).
    # one or multiple, e.g. GMM, HMM, SVM, MLP, CNN, RNN, CRNN, ResNet, ensemble, ...
    machine_learning_method: CNN

    # Ensemble method subsystem count
    # In case ensemble method is not used, mark !!null.
    # e.g. 2, 3, 4, 5, ...
    ensemble_method_subsystem_count: !!null

    # How information from modalities are combined
    # e.g. audio only, video only, early fusion, late fusion
    modality_combination: early fusion

    # Decision making methods
    # e.g. average, majority vote, maximum likelihood, ...
    decision_making: maximum likelihood

    # External data usage method
    # e.g. directly, embeddings, pre-trained model, ...
    external_data_usage: embeddings

  # System complexity, meta data provided here will be used to evaluate
  # submitted systems from the computational load perspective.
  complexity:
    # Total amount of parameters used in the model.
    # For neural networks, this information is usually given before training process
    # in the network summary.
    # For other than neural networks, if parameter count information is not directly
    # available, try estimating the count as accurately as possible.
    # In case of ensemble approaches, add up parameters for all subsystems.
    # In case embeddings are used, add up parameter count of the embedding
    # extraction networks and classification network
    # Use numerical value.
    total_parameters: 14553134
    # total parameter is 14553134
    # audio-visual model parameter is 34,186
    # audio model parameters: 338634
    # video model parameters: 338634
    # audio embedding extraction from OpenL3 9150660
    # video embedding extraction from OpenL3 4691020


    # Amount of parameters used in the acoustic model. Indicated the same way than total_parameters.
    # Use numerical value (do not use comma for thousands-separator).
    total_parameters_audio: 9489294

    # Amount of parameters used in the visual model. Indicated the same way than total_parameters
    # Use numerical value (do not use comma for thousands-separator).
    total_parameters_visual: 5029654

  # List of external datasets used in the submission.
  # Development dataset is used here only as example, list only external datasets
  external_datasets:
    # Dataset name
    - name: TAU Urban Audio-Visual Scenes 2021, Development dataset

      # Dataset access url
      url: https://zenodo.org/record/4477542#.YK3yipMza3A

      # Total audio length in minutes
      total_audio_length: 2040            # minutes

  # URL to the source code of the system [optional]
  source_code: https://github.com/shanwangshan/TAU-urban-audio-visual-scenes

# System results
results:
  development_dataset:
    # System results for development dataset with provided the cross-validation setup.
    # Full results are not mandatory, however, they are highly recommended
    # as they are needed for through analysis of the challenge submissions.
    # If you are unable to provide all results, also incomplete
    # results can be reported.

    # Overall metrics
    overall:
      logloss: 0.658
      accuracy: 77.0    # mean of class-wise accuracies

    # Class-wise metrics
    class_wise:
      airport:
        logloss: 0.963
        accuracy: 66.8
      bus:
        logloss: 0.396
        accuracy: 85.9
      metro:
        logloss: 0.541
        accuracy: 80.4
      metro_station:
        logloss: 0.565
        accuracy: 80.8
      park:
        logloss: 0.710
        accuracy: 77.2
      public_square:
        logloss: 0.732
        accuracy: 71.1
      shopping_mall:
        logloss: 0.839
        accuracy: 72.6
      street_pedestrian:
        logloss: 0.877
        accuracy: 72.7
      street_traffic:
        logloss: 0.296
        accuracy: 89.6
      tram:
        logloss: 0.659
        accuracy: 73.1

Package validator

This is an automatic validation tool to help challenge participants to prepare a correctly formatted submission package, which in turn will speed up the submission processing in the challenge evaluation stage. Please use this to make sure your submission package follows the given formatting.

DCASE2021 Task 1 submission validator

Evaluation

Systems will be ranked by multiclass cross-entropy (Log loss). The metric is independent of the operating point (see python implementation here).

As an additional metric, we will calculate macro-average accuracy (average of the class-wise accuracies). The accuracy will not be used in the official rankings.

Results

Subtask A

Official rank	Submission Information
Official rank	Code	Author	Affiliation	Technical Report	Logloss	Accuracy with 95% confidence interval
21	Byttebier_IDLab_task1a_1	Brecht Desplanques	ELIS, Ghent University - imec, Ghent, Belgium	task-acoustic-scene-classification-results-a#Byttebier2021	0.936	68.6 (67.6 - 69.6)
18	Byttebier_IDLab_task1a_2	Brecht Desplanques	ELIS, Ghent University - imec, Ghent, Belgium	task-acoustic-scene-classification-results-a#Byttebier2021	0.914	67.5 (66.5 - 68.6)
23	Byttebier_IDLab_task1a_3	Brecht Desplanques	ELIS, Ghent University - imec, Ghent, Belgium	task-acoustic-scene-classification-results-a#Byttebier2021	0.944	68.5 (67.5 - 69.6)
17	Byttebier_IDLab_task1a_4	Brecht Desplanques	ELIS, Ghent University - imec, Ghent, Belgium	task-acoustic-scene-classification-results-a#Byttebier2021	0.905	68.8 (67.8 - 69.8)
49	Cao_SCUT_task1a_1	Wenchang Cao	School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China	task-acoustic-scene-classification-results-a#Cao2021	1.136	66.7 (65.7 - 67.7)
56	Cao_SCUT_task1a_2	Wenchang Cao	School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China	task-acoustic-scene-classification-results-a#Cao2021	1.200	64.6 (63.5 - 65.6)
50	Cao_SCUT_task1a_3	Wenchang Cao	School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China	task-acoustic-scene-classification-results-a#Cao2021	1.137	67.2 (66.1 - 68.2)
53	Cao_SCUT_task1a_4	Wenchang Cao	School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China	task-acoustic-scene-classification-results-a#Cao2021	1.147	66.1 (65.1 - 67.1)
85	Ding_TJU_task1a_1	Biyun Ding	School of Electrical and Information Engineering, Tianjin University, Tianjin, China	task-acoustic-scene-classification-results-a#Ding2021	1.544	53.0 (51.9 - 54.1)
70	Ding_TJU_task1a_2	Biyun Ding	School of Electrical and Information Engineering, Tianjin University, Tianjin, China	task-acoustic-scene-classification-results-a#Ding2021	1.326	51.1 (50.0 - 52.2)
61	Ding_TJU_task1a_3	Biyun Ding	School of Electrical and Information Engineering, Tianjin University, Tianjin, China	task-acoustic-scene-classification-results-a#Ding2021	1.226	49.1 (48.0 - 50.2)
67	Ding_TJU_task1a_4	Biyun Ding	School of Electrical and Information Engineering, Tianjin University, Tianjin, China	task-acoustic-scene-classification-results-a#Ding2021	1.296	51.4 (50.3 - 52.5)
64	Fan_NWPU_task1a_1	MengFan Cui	Northwestern Polytechnic University, China	task-acoustic-scene-classification-results-a#Cui2021	1.261	68.3 (67.3 - 69.3)
97	Galindo-Meza_ITESO_task1a_1	Carlos Alberto Galindo-Meza	Departamento de Electronica, Sistemas e Informatica, Instituto Tecnologico de Estudios Superiores de Occidente, Jalisco, Mexico	task-acoustic-scene-classification-results-a#Galindo-Meza2021	2.221	53.9 (52.8 - 55.0)
42	Heo_Clova_task1a_1	Heo Hee-Soo	Naver Corporation, Seongnam, South Korea	task-acoustic-scene-classification-results-a#Hee-Soo2021	1.087	67.0 (66.0 - 68.0)
20	Heo_Clova_task1a_2	Heo Hee-Soo	Naver Corporation, Seongnam, South Korea	task-acoustic-scene-classification-results-a#Hee-Soo2021	0.930	66.9 (65.9 - 67.9)
34	Heo_Clova_task1a_3	Heo Hee-Soo	Naver Corporation, Seongnam, South Korea	task-acoustic-scene-classification-results-a#Hee-Soo2021	1.045	70.0 (69.0 - 71.0)
12	Heo_Clova_task1a_4	Heo Hee-Soo	Naver Corporation, Seongnam, South Korea	task-acoustic-scene-classification-results-a#Hee-Soo2021	0.871	70.1 (69.1 - 71.1)
86	Horváth_HIT_task1a_1	Kristóf Horváth	Hitachi Ltd., Tokyo, Japan	task-acoustic-scene-classification-results-a#Horvth2021	1.597	51.4 (50.3 - 52.5)
92	Horváth_HIT_task1a_2	Kristóf Horváth	Hitachi Ltd., Tokyo, Japan	task-acoustic-scene-classification-results-a#Horvth2021	2.031	53.3 (52.2 - 54.4)
76	Horváth_HIT_task1a_3	Kristóf Horváth	Hitachi Ltd., Tokyo, Japan	task-acoustic-scene-classification-results-a#Horvth2021	1.460	51.6 (50.5 - 52.7)
95	Horváth_HIT_task1a_4	Kristóf Horváth	Hitachi Ltd., Tokyo, Japan	task-acoustic-scene-classification-results-a#Horvth2021	2.065	49.2 (48.1 - 50.3)
78	Jeng_CHT+NSYSU_task1a_1	Hui Hsin Jeng	Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan	task-acoustic-scene-classification-results-a#Jeng2021	1.469	55.0 (53.9 - 56.1)
84	Jeng_CHT+NSYSU_task1a_2	Hui Hsin Jeng	Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan	task-acoustic-scene-classification-results-a#Jeng2021	1.543	51.3 (50.2 - 52.4)
79	Jeng_CHT+NSYSU_task1a_3	Hui Hsin Jeng	Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan	task-acoustic-scene-classification-results-a#Jeng2021	1.470	56.3 (55.2 - 57.4)
33	Jeong_ETRI_task1a_1	Youngho Jeong	Media Coding Research Section, Electronics and Telecommunications Research Institute, Daejeon, Republic of Korea	task-acoustic-scene-classification-results-a#Jeong2021	1.041	66.0 (64.9 - 67.0)
25	Jeong_ETRI_task1a_2	Youngho Jeong	Media Coding Research Section, Electronics and Telecommunications Research Institute, Daejeon, Republic of Korea	task-acoustic-scene-classification-results-a#Jeong2021	0.952	67.0 (65.9 - 68.0)
30	Jeong_ETRI_task1a_3	Youngho Jeong	Media Coding Research Section, Electronics and Telecommunications Research Institute, Daejeon, Republic of Korea	task-acoustic-scene-classification-results-a#Jeong2021	1.023	66.7 (65.7 - 67.7)
63	Jeong_ETRI_task1a_4	Youngho Jeong	Media Coding Research Section, Electronics and Telecommunications Research Institute, Daejeon, Republic of Korea	task-acoustic-scene-classification-results-a#Jeong2021	1.228	66.1 (65.1 - 67.2)
72	Kek_NU_task1a_1	Xing Yong Kek	Faculty if Science, Agriculture & Engineering, Newcastle University, Singapore	task-acoustic-scene-classification-results-a#Kek2021	1.355	66.8 (65.7 - 67.8)
57	Kek_NU_task1a_2	Xing Yong Kek	Faculty if Science, Agriculture & Engineering, Newcastle University, Singapore	task-acoustic-scene-classification-results-a#Kek2021	1.207	63.5 (62.4 - 64.6)
38	Kim_3M_task1a_1	Bongjun Kim	3M, Saint Paul, United States	task-acoustic-scene-classification-results-a#Kim2021	1.076	61.5 (60.4 - 62.6)
39	Kim_3M_task1a_2	Bongjun Kim	3M, Saint Paul, United States	task-acoustic-scene-classification-results-a#Kim2021	1.077	61.6 (60.5 - 62.6)
37	Kim_3M_task1a_3	Bongjun Kim	3M, Saint Paul, United States	task-acoustic-scene-classification-results-a#Kim2021	1.076	62.0 (61.0 - 63.1)
40	Kim_3M_task1a_4	Bongjun Kim	3M, Saint Paul, United States	task-acoustic-scene-classification-results-a#Kim2021	1.078	61.3 (60.2 - 62.3)
46	Kim_KNU_task1a_1	Seokjin Lee	School of Electronics Engineering, School of Electronic and Electrical Engineering, Kyungpook National University, Daegu, Republic of Korea	task-acoustic-scene-classification-results-a#Kim2021a	1.115	64.7 (63.6 - 65.7)
28	Kim_KNU_task1a_2	Seokjin Lee	School of Electronics Engineering, School of Electronic and Electrical Engineering, Kyungpook National University, Daegu, Republic of Korea	task-acoustic-scene-classification-results-a#Kim2021a	1.010	63.8 (62.8 - 64.9)
55	Kim_KNU_task1a_3	Seokjin Lee	School of Electronics Engineering, School of Electronic and Electrical Engineering, Kyungpook National University, Daegu, Republic of Korea	task-acoustic-scene-classification-results-a#Kim2021a	1.188	61.3 (60.3 - 62.4)
52	Kim_KNU_task1a_4	Seokjin Lee	School of Electronics Engineering, School of Electronic and Electrical Engineering, Kyungpook National University, Daegu, Republic of Korea	task-acoustic-scene-classification-results-a#Kim2021a	1.143	62.9 (61.8 - 64.0)
8	Kim_QTI_task1a_1	Byeonggeun Kim	Qualcomm AI Research, Qualcomm Korea YH, Seoul, Korea	task-acoustic-scene-classification-results-a#Kim2021b	0.793	75.0 (74.0 - 76.0)
1	Kim_QTI_task1a_2	Byeonggeun Kim	Qualcomm AI Research, Qualcomm Korea YH, Seoul, Korea	task-acoustic-scene-classification-results-a#Kim2021b	0.724	76.1 (75.1 - 77.0)
2	Kim_QTI_task1a_3	Byeonggeun Kim	Qualcomm AI Research, Qualcomm Korea YH, Seoul, Korea	task-acoustic-scene-classification-results-a#Kim2021b	0.735	76.1 (75.2 - 77.1)
5	Kim_QTI_task1a_4	Byeonggeun Kim	Qualcomm AI Research, Qualcomm Korea YH, Seoul, Korea	task-acoustic-scene-classification-results-a#Kim2021b	0.764	75.2 (74.3 - 76.2)
14	Koutini_CPJKU_task1a_1	Khaled Koutini	Computational Perception (CP), Johannes Kepler University (JKU) Linz, Linz, Austria	task-acoustic-scene-classification-results-a#Koutini2021	0.883	70.9 (69.9 - 71.9)
10	Koutini_CPJKU_task1a_2	Khaled Koutini	Computational Perception (CP), Johannes Kepler University (JKU) Linz, Linz, Austria	task-acoustic-scene-classification-results-a#Koutini2021	0.842	71.8 (70.8 - 72.8)
9	Koutini_CPJKU_task1a_3	Khaled Koutini	Computational Perception (CP), Johannes Kepler University (JKU) Linz, Linz, Austria	task-acoustic-scene-classification-results-a#Koutini2021	0.834	72.1 (71.1 - 73.1)
11	Koutini_CPJKU_task1a_4	Khaled Koutini	Computational Perception (CP), Johannes Kepler University (JKU) Linz, Linz, Austria	task-acoustic-scene-classification-results-a#Koutini2021	0.847	71.8 (70.9 - 72.8)
90	Lim_CAU_task1a_1	Soyoung Lim	Statistics Dept., Chung-Ang University, Seoul, South Korea	task-acoustic-scene-classification-results-a#Lim2021	1.956	67.5 (66.5 - 68.5)
91	Lim_CAU_task1a_2	Soyoung Lim	Statistics Dept., Chung-Ang University, Seoul, South Korea	task-acoustic-scene-classification-results-a#Lim2021	2.010	67.9 (66.9 - 69.0)
80	Lim_CAU_task1a_3	Soyoung Lim	Statistics Dept., Chung-Ang University, Seoul, South Korea	task-acoustic-scene-classification-results-a#Lim2021	1.479	68.5 (67.5 - 69.5)
93	Lim_CAU_task1a_4	Soyoung Lim	Statistics Dept., Chung-Ang University, Seoul, South Korea	task-acoustic-scene-classification-results-a#Lim2021	2.039	65.8 (64.7 - 66.8)
16	Liu_UESTC_task1a_1	Yingzi Liu	School of imformation and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China	task-acoustic-scene-classification-results-a#Liu2021	0.900	68.8 (67.8 - 69.8)
15	Liu_UESTC_task1a_2	Yingzi Liu	School of imformation and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China	task-acoustic-scene-classification-results-a#Liu2021	0.895	68.2 (67.2 - 69.2)
13	Liu_UESTC_task1a_3	Yingzi Liu	School of imformation and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China	task-acoustic-scene-classification-results-a#Liu2021	0.878	69.6 (68.6 - 70.6)
87	Liu_UESTC_task1a_4	Yingzi Liu	School of imformation and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China	task-acoustic-scene-classification-results-a#Liu2021	1.626	42.0 (40.9 - 43.1)
99	Madhu_CET_task1a_1	Aswathy Madhu	Electronics & Communication, College of Engineering Trivandrum, Thiruvananthapuram, Kerala, India	task-acoustic-scene-classification-results-a#Madhu2021	3.950	9.7 (9.0 - 10.3)
	DCASE2021 baseline	Irene Martín Morató	Computing Sciences, Tampere University, Tampere, Finland	task-acoustic-scene-classification-results-a#BASELINE	1.730	45.6 (44.5 - 46.7)
51	Naranjo-Alcazar_ITI_task1a_1	Javier Naranjo-Alcazar	Computer Science, Universitat de Valencia, Burjassot, Spain; Intituto Tecnológico de Informática, Valencia, Spain	task-acoustic-scene-classification-results-a#Naranjo-Alcazar2021_t1a	1.140	60.2 (59.2 - 61.3)
73	Pham_AIT_task1a_1	Lam Pham	Center for Digital Safety & Security, Austrian Institute of Technology, Vienna, Austria	task-acoustic-scene-classification-results-a#Pham2021	1.368	67.5 (66.4 - 68.5)
54	Pham_AIT_task1a_2	Lam Pham	Center for Digital Safety & Security, Austrian Institute of Technology, Vienna, Austria	task-acoustic-scene-classification-results-a#Pham2021	1.187	68.4 (67.4 - 69.4)
94	Pham_AIT_task1a_3	Lam Pham	Center for Digital Safety & Security, Austrian Institute of Technology, Vienna, Austria	task-acoustic-scene-classification-results-a#Pham2021	2.058	69.6 (68.6 - 70.6)
65	Phan_UIUC_task1a_1	Duc Phan	ECE, University of Illinois, Urban-Champaign, Illinois, US	task-acoustic-scene-classification-results-a#Phan2021	1.272	63.3 (62.3 - 64.4)
71	Phan_UIUC_task1a_2	Duc Phan	ECE, University of Illinois, Urban-Champaign, Illinois, US	task-acoustic-scene-classification-results-a#Phan2021	1.335	63.3 (62.3 - 64.4)
60	Phan_UIUC_task1a_3	Duc Phan	ECE, University of Illinois, Urban-Champaign, Illinois, US	task-acoustic-scene-classification-results-a#Phan2021	1.223	65.3 (64.3 - 66.4)
66	Phan_UIUC_task1a_4	Duc Phan	ECE, University of Illinois, Urban-Champaign, Illinois, US	task-acoustic-scene-classification-results-a#Phan2021	1.292	65.3 (64.3 - 66.4)
24	Puy_VAI_task1a_1	Gilles Puy	valeo.ai, Paris, France	task-acoustic-scene-classification-results-a#Puy2021	0.952	66.6 (65.6 - 67.6)
27	Puy_VAI_task1a_2	Gilles Puy	valeo.ai, Paris, France	task-acoustic-scene-classification-results-a#Puy2021	0.974	65.4 (64.4 - 66.5)
22	Puy_VAI_task1a_3	Gilles Puy	valeo.ai, Paris, France	task-acoustic-scene-classification-results-a#Puy2021	0.939	66.2 (65.1 - 67.2)
88	Qiao_NCUT_task1a_1	Ziling Qiao	Electronic and Communication Engineering, North China University of Technology, Beijing, China	task-acoustic-scene-classification-results-a#Qiao2021	1.630	52.2 (51.1 - 53.3)
32	Seo_SGU_task1a_1	Ji-Hwan Kim	Dept. of Computer Science and Engineering, Sogang University, Seoul, Repulic of Korea	task-acoustic-scene-classification-results-a#Seo2021	1.030	70.3 (69.3 - 71.3)
41	Seo_SGU_task1a_2	Ji-Hwan Kim	Dept. of Computer Science and Engineering, Sogang University, Seoul, Repulic of Korea	task-acoustic-scene-classification-results-a#Seo2021	1.080	71.4 (70.4 - 72.4)
35	Seo_SGU_task1a_3	Ji-Hwan Kim	Dept. of Computer Science and Engineering, Sogang University, Seoul, Repulic of Korea	task-acoustic-scene-classification-results-a#Seo2021	1.065	71.3 (70.3 - 72.3)
44	Seo_SGU_task1a_4	Ji-Hwan Kim	Dept. of Computer Science and Engineering, Sogang University, Seoul, Repulic of Korea	task-acoustic-scene-classification-results-a#Seo2021	1.087	71.8 (70.8 - 72.8)
77	Singh_IITMandi_task1a_1	Arshdeep Singh	SCEE, Indian institute of technology, Mandi, Mandi, India	task-acoustic-scene-classification-results-a#Singh2021	1.464	47.2 (46.1 - 48.3)
83	Singh_IITMandi_task1a_2	Arshdeep Singh	SCEE, Indian institute of technology, Mandi, Mandi, India	task-acoustic-scene-classification-results-a#Singh2021	1.515	44.7 (43.6 - 45.8)
82	Singh_IITMandi_task1a_3	Arshdeep Singh	SCEE, Indian institute of technology, Mandi, Mandi, India	task-acoustic-scene-classification-results-a#Singh2021	1.509	46.1 (45.0 - 47.2)
81	Singh_IITMandi_task1a_4	Arshdeep Singh	SCEE, Indian institute of technology, Mandi, Mandi, India	task-acoustic-scene-classification-results-a#Singh2021	1.488	46.8 (45.7 - 47.9)
43	Sugahara_RION_task1a_1	Reiko Sugahara	RION CO., LTD., Tokyo, Japan	task-acoustic-scene-classification-results-a#Sugahara2021	1.087	63.8 (62.8 - 64.9)
36	Sugahara_RION_task1a_2	Reiko Sugahara	RION CO., LTD., Tokyo, Japan	task-acoustic-scene-classification-results-a#Sugahara2021	1.070	65.2 (64.2 - 66.3)
31	Sugahara_RION_task1a_3	Reiko Sugahara	RION CO., LTD., Tokyo, Japan	task-acoustic-scene-classification-results-a#Sugahara2021	1.024	65.3 (64.3 - 66.4)
68	Sugahara_RION_task1a_4	Reiko Sugahara	RION CO., LTD., Tokyo, Japan	task-acoustic-scene-classification-results-a#Sugahara2021	1.297	64.7 (63.7 - 65.8)
48	Verbitskiy_DS_task1a_1	Sergey Verbitskiy	Deepsound, Novosibirsk, Russia	task-acoustic-scene-classification-results-a#Verbitskiy2021	1.127	61.4 (60.3 - 62.4)
29	Verbitskiy_DS_task1a_2	Sergey Verbitskiy	Deepsound, Novosibirsk, Russia	task-acoustic-scene-classification-results-a#Verbitskiy2021	1.019	64.5 (63.4 - 65.5)
26	Verbitskiy_DS_task1a_3	Sergey Verbitskiy	Deepsound, Novosibirsk, Russia	task-acoustic-scene-classification-results-a#Verbitskiy2021	0.966	67.3 (66.3 - 68.4)
19	Verbitskiy_DS_task1a_4	Sergey Verbitskiy	Deepsound, Novosibirsk, Russia	task-acoustic-scene-classification-results-a#Verbitskiy2021	0.924	68.1 (67.1 - 69.1)
6	Yang_GT_task1a_1	Chao-Han Huck Yang	School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA	task-acoustic-scene-classification-results-a#Yang2021	0.768	73.1 (72.1 - 74.0)
4	Yang_GT_task1a_2	Chao-Han Huck Yang	School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA	task-acoustic-scene-classification-results-a#Yang2021	0.764	72.9 (71.9 - 73.9)
3	Yang_GT_task1a_3	Chao-Han Huck Yang	School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA	task-acoustic-scene-classification-results-a#Yang2021	0.758	72.9 (71.9 - 73.8)
7	Yang_GT_task1a_4	Chao-Han Huck Yang	School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA	task-acoustic-scene-classification-results-a#Yang2021	0.774	72.8 (71.8 - 73.8)
69	Yihao_speakin_task1a_1	Chen Yihao	SpeakIn Technology, Shanghai, China	task-acoustic-scene-classification-results-a#Yihao2021	1.311	51.9 (50.8 - 53.0)
59	Yihao_speakin_task1a_2	Chen Yihao	SpeakIn Technology, Shanghai, China	task-acoustic-scene-classification-results-a#Yihao2021	1.222	55.2 (54.1 - 56.3)
96	Yihao_speakin_task1a_3	Chen Yihao	SpeakIn Technology, Shanghai, China	task-acoustic-scene-classification-results-a#Yihao2021	2.105	53.5 (52.4 - 54.6)
47	Zhang_BUPT&BYTEDANCE_task1a_1	Jiawang Zhang	AI-Lab Speech & Audio Team, Beijing University of Posts and Telecommunications & ByteDance, Shanghai, China	task-acoustic-scene-classification-results-a#Zhang2021	1.124	63.0 (62.0 - 64.1)
45	Zhang_BUPT&BYTEDANCE_task1a_2	Jiawang Zhang	AI-Lab Speech & Audio Team, Beijing University of Posts and Telecommunications & ByteDance, Shanghai, China	task-acoustic-scene-classification-results-a#Zhang2021	1.113	63.2 (62.2 - 64.3)
98	Zhang_BUPT&BYTEDANCE_task1a_3	Jiawang Zhang	AI-Lab Speech & Audio Team, Beijing University of Posts and Telecommunications & ByteDance, Shanghai, China	task-acoustic-scene-classification-results-a#Zhang2021	3.359	52.2 (51.1 - 53.3)
89	Zhang_BUPT&BYTEDANCE_task1a_4	Jiawang Zhang	AI-Lab Speech & Audio Team, Beijing University of Posts and Telecommunications & ByteDance, Shanghai, China	task-acoustic-scene-classification-results-a#Zhang2021	1.946	59.0 (57.9 - 60.1)
75	Zhao_Maxvision_task1a_1	Na Zhao	Algorithm, Maxvision, Wuhan, China	task-acoustic-scene-classification-results-a#Zhao2021	1.440	61.2 (60.2 - 62.3)
74	Zhao_Maxvision_task1a_2	Na Zhao	Algorithm, Maxvision, Wuhan, China	task-acoustic-scene-classification-results-a#Zhao2021	1.412	63.5 (62.4 - 64.6)
62	Zhao_Maxvision_task1a_3	Na Zhao	Algorithm, Maxvision, Wuhan, China	task-acoustic-scene-classification-results-a#Zhao2021	1.227	63.5 (62.5 - 64.6)
58	Zhao_Maxvision_task1a_4	Na Zhao	Algorithm, Maxvision, Wuhan, China	task-acoustic-scene-classification-results-a#Zhao2021	1.215	62.8 (61.8 - 63.9)

Complete results and technical reports can be found at subtask A results page

Subtask B

Official rank	Submission Information
Official rank	Code	Author	Affiliation	Technical Report	Logloss	Accuracy with 95% confidence interval
23	Boes_KUL_task1b_1	Wim Boes	ESAT, KU Leuven, Leuven, Belgium	task-acoustic-scene-classification-results-b#Boes2021	0.653	74.5 (74.2 - 74.8)
25	Boes_KUL_task1b_2	Wim Boes	ESAT, KU Leuven, Leuven, Belgium	task-acoustic-scene-classification-results-b#Boes2021	0.683	76.0 (75.7 - 76.3)
26	Boes_KUL_task1b_3	Wim Boes	ESAT, KU Leuven, Leuven, Belgium	task-acoustic-scene-classification-results-b#Boes2021	0.701	76.3 (76.0 - 76.6)
24	Boes_KUL_task1b_4	Wim Boes	ESAT, KU Leuven, Leuven, Belgium	task-acoustic-scene-classification-results-b#Boes2021	0.681	76.0 (75.6 - 76.3)
35	Diez_Noismart_task1b_1	Itxasne Diez	Getxo, Basque Country, Spain	task-acoustic-scene-classification-results-b#Diez2021	1.061	65.2 (64.8 - 65.5)
38	Diez_Noismart_task1b_2	Itxasne Diez	Getxo, Basque Country, Spain	task-acoustic-scene-classification-results-b#Diez2021	1.096	64.4 (64.1 - 64.8)
34	Diez_Noismart_task1b_3	Itxasne Diez	Getxo, Basque Country, Spain	task-acoustic-scene-classification-results-b#Diez2021	1.060	64.7 (64.4 - 65.1)
8	Du_USTC_task1b_1	Jun Du	NELSLIP, University of Science and Technology of China, Heifei, China	task-acoustic-scene-classification-results-b#Wang2021	0.241	92.9 (92.7 - 93.1)
7	Du_USTC_task1b_2	Jun Du	NELSLIP, University of Science and Technology of China, Heifei, China	task-acoustic-scene-classification-results-b#Wang2021	0.238	92.7 (92.5 - 92.9)
6	Du_USTC_task1b_3	Jun Du	NELSLIP, University of Science and Technology of China, Heifei, China	task-acoustic-scene-classification-results-b#Wang2021	0.222	93.2 (93.0 - 93.4)
5	Du_USTC_task1b_4	Jun Du	NELSLIP, University of Science and Technology of China, Heifei, China	task-acoustic-scene-classification-results-b#Wang2021	0.221	93.2 (93.0 - 93.4)
37	Fedorishin_UB_task1b_1	Dennis Fedorishin	Computer Science, Center for Unified Biometrics and Sensors, University at Buffalo, New York, USA	task-acoustic-scene-classification-results-b#Fedorishin2021	1.077	67.2 (66.8 - 67.5)
33	Fedorishin_UB_task1b_2	Dennis Fedorishin	Computer Science, Center for Unified Biometrics and Sensors, University at Buffalo, New York, USA	task-acoustic-scene-classification-results-b#Fedorishin2021	1.028	68.7 (68.4 - 69.1)
20	Hou_UGent_task1b_1	Yuanbo Hou	Ghent University, Gent, Belgium	task-acoustic-scene-classification-results-b#Hou2021	0.555	81.5 (81.2 - 81.8)
29	Hou_UGent_task1b_2	Yuanbo Hou	Ghent University, Gent, Belgium	task-acoustic-scene-classification-results-b#Hou2021	0.771	81.8 (81.6 - 82.1)
19	Hou_UGent_task1b_3	Yuanbo Hou	Ghent University, Gent, Belgium	task-acoustic-scene-classification-results-b#Hou2021	0.523	84.0 (83.7 - 84.3)
16	Hou_UGent_task1b_4	Yuanbo Hou	Ghent University, Gent, Belgium	task-acoustic-scene-classification-results-b#Hou2021	0.416	85.6 (85.3 - 85.8)
18	Naranjo-Alcazar_UV_task1b_1	Javier Naranjo-Alcazar	Computer Science, Universitat de Valencia, Burjassot, Spain; Intituto Tecnológico de Informática, Valencia, Spain	task-acoustic-scene-classification-results-b#Naranjo-Alcazar2021_t1b	0.495	86.5 (86.3 - 86.8)
22	Naranjo-Alcazar_UV_task1b_2	Javier Naranjo-Alcazar	Computer Science, Universitat de Valencia, Burjassot, Spain; Intituto Tecnológico de Informática, Valencia, Spain	task-acoustic-scene-classification-results-b#Naranjo-Alcazar2021_t1b	0.640	83.2 (82.9 - 83.4)
32	Naranjo-Alcazar_UV_task1b_3	Javier Naranjo-Alcazar	Computer Science, Universitat de Valencia, Burjassot, Spain; Intituto Tecnológico de Informática, Valencia, Spain	task-acoustic-scene-classification-results-b#Naranjo-Alcazar2021_t1b	1.006	66.8 (66.5 - 67.1)
12	Okazaki_LDSLVision_task1b_1	Soichiro Okazaki	Lumada Data Science Lab., Hitachi, Ltd., Toyko, Japan	task-acoustic-scene-classification-results-b#Okazaki2021	0.312	91.6 (91.4 - 91.8)
13	Okazaki_LDSLVision_task1b_2	Soichiro Okazaki	Lumada Data Science Lab., Hitachi, Ltd., Toyko, Japan	task-acoustic-scene-classification-results-b#Okazaki2021	0.320	93.2 (93.0 - 93.3)
11	Okazaki_LDSLVision_task1b_3	Soichiro Okazaki	Lumada Data Science Lab., Hitachi, Ltd., Toyko, Japan	task-acoustic-scene-classification-results-b#Okazaki2021	0.303	93.5 (93.3 - 93.7)
9	Okazaki_LDSLVision_task1b_4	Soichiro Okazaki	Lumada Data Science Lab., Hitachi, Ltd., Toyko, Japan	task-acoustic-scene-classification-results-b#Okazaki2021	0.257	93.5 (93.3 - 93.7)
45	Peng_CQU_task1b_1	Wang Peng	Intelligent Information Technology and System Lab, CHONGQING UNIVERSITY, Chongqing, China	task-acoustic-scene-classification-results-b#Peng2021	1.395	68.2 (67.9 - 68.5)
40	Peng_CQU_task1b_2	Wang Peng	Intelligent Information Technology and System Lab, CHONGQING UNIVERSITY, Chongqing, China	task-acoustic-scene-classification-results-b#Peng2021	1.172	67.8 (67.5 - 68.1)
41	Peng_CQU_task1b_3	Wang Peng	Intelligent Information Technology and System Lab, CHONGQING UNIVERSITY, Chongqing, China	task-acoustic-scene-classification-results-b#Peng2021	1.172	67.8 (67.5 - 68.1)
43	Peng_CQU_task1b_4	Wang Peng	Intelligent Information Technology and System Lab, CHONGQING UNIVERSITY, Chongqing, China	task-acoustic-scene-classification-results-b#Peng2021	1.233	68.5 (68.1 - 68.8)
44	Pham_AIT_task1b_1	Lam Pham	Center for Digital Safety & Security, Austrian Institute of Technology, Vienna, Austria	task-acoustic-scene-classification-results-b#Pham2021	1.311	73.0 (72.7 - 73.3)
21	Pham_AIT_task1b_2	Lam Pham	Center for Digital Safety & Security, Austrian Institute of Technology, Vienna, Austria	task-acoustic-scene-classification-results-b#Pham2021	0.589	88.3 (88.1 - 88.6)
17	Pham_AIT_task1b_3	Lam Pham	Center for Digital Safety & Security, Austrian Institute of Technology, Vienna, Austria	task-acoustic-scene-classification-results-b#Pham2021	0.434	88.4 (88.2 - 88.7)
28	Pham_AIT_task1b_4	Lam Pham	Center for Digital Safety & Security, Austrian Institute of Technology, Vienna, Austria	task-acoustic-scene-classification-results-b#Pham2021	0.738	91.5 (91.3 - 91.7)
39	Triantafyllopoulos_AUD_task1b_1	Andreas Triantafyllopoulos	audEERING GmbH, Gilching, Germany	task-acoustic-scene-classification-results-b#Triantafyllopoulos2021	1.157	58.4 (58.1 - 58.8)
27	Triantafyllopoulos_AUD_task1b_2	Andreas Triantafyllopoulos	audEERING GmbH, Gilching, Germany	task-acoustic-scene-classification-results-b#Triantafyllopoulos2021	0.735	73.6 (73.3 - 73.9)
30	Triantafyllopoulos_AUD_task1b_3	Andreas Triantafyllopoulos	audEERING GmbH, Gilching, Germany	task-acoustic-scene-classification-results-b#Triantafyllopoulos2021	0.785	73.7 (73.3 - 74.0)
31	Triantafyllopoulos_AUD_task1b_4	Andreas Triantafyllopoulos	audEERING GmbH, Gilching, Germany	task-acoustic-scene-classification-results-b#Triantafyllopoulos2021	0.872	70.3 (70.0 - 70.7)
36	Wang_BIT_task1b_1	Yuxiang Wang	Electronic engineering, Beijing Institute of Technology, Beijing, China	task-acoustic-scene-classification-results-b#Wang2021a	1.061	74.1 (73.7 - 74.4)
42	Wang_BIT_task1b_2	Shuang Liang	School of Information and Electronics Beijing Institute of Technology, Beijing Institute of Technology, Beijing,China	task-acoustic-scene-classification-results-b#Liang2021	1.180	62.4 (62.0 - 62.7)
	DCASE2021 baseline	Shanshan Wang	Computing Sciences, Tampere University, Tampere, Finland	task-acoustic-scene-classification-results-b#BASELINE	0.662	77.1 (76.8 - 77.5)
15	Yang_THU_task1b_1	Yujie Yang	Tsinghua University, Shenzhen, China	task-acoustic-scene-classification-results-b#Yang2021	0.332	90.8 (90.6 - 91.1)
14	Yang_THU_task1b_2	Yujie Yang	Tsinghua University, Shenzhen, China	task-acoustic-scene-classification-results-b#Yang2021	0.321	90.8 (90.6 - 91.0)
10	Yang_THU_task1b_3	Yujie Yang	Tsinghua University, Shenzhen, China	task-acoustic-scene-classification-results-b#Yang2021	0.279	92.1 (91.9 - 92.3)
3	Zhang_IOA_task1b_1	Pengyuan Zhang	Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Beijing, China	task-acoustic-scene-classification-results-b#Wang2021b	0.201	93.5 (93.3 - 93.7)
4	Zhang_IOA_task1b_2	Pengyuan Zhang	Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Beijing, China	task-acoustic-scene-classification-results-b#Wang2021b	0.205	93.6 (93.4 - 93.8)
1	Zhang_IOA_task1b_3	Pengyuan Zhang	Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Beijing, China	task-acoustic-scene-classification-results-b#Wang2021b	0.195	93.8 (93.6 - 93.9)
2	Zhang_IOA_task1b_4	Pengyuan Zhang	Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Beijing, China	task-acoustic-scene-classification-results-b#Wang2021b	0.199	93.9 (93.7 - 94.1)

Complete results and technical reports can be found at subtask B results page

Baseline systems

The baseline system provides a state-of-the-art approach for the classification in each subtask.

Subtask A

The baseline system implements a convolutional neural network (CNN) based approach using log mel-band energies extracted for each 10-second signal. The network consists of three CNN layers and one fully connected layer to assign scene labels to the audio signals. The system is based on the DCASE 2020 Subtask B baseline system.

After training, the model parameters are quantized to float16. The resulting model size is 90.3 KB.

Repository

DCASE2021 Task 1A baseline, repository

Parameters

Audio features:
- Log mel-band energies (40 bands), analysis frame 40 ms (50% hop size)
Neural network:
- Input shape: 40 * 500 (10 seconds)
- Architecture:
  - CNN layer #1: 2D Convolutional layer (filters: 16, kernel size: 7) + Batch normalization + ReLu activation
  - CNN layer #2: 2D Convolutional layer (filters: 16, kernel size: 7) + Batch normalization + ReLu activation, 2D max pooling (pool size: (5, 5)) + Dropout (rate: 30%)
  - CNN layer #3: 2D Convolutional layer (filters: 32, kernel size: 7) + Batch normalization + ReLu activation, 2D max pooling (pool size: (4, 100)) + Dropout (rate: 30%)
  - Flatten
  - Dense layer #1: Dense layer (units: 100, activation: ReLu ), Dropout (rate: 30%)
  - Output layer (activation: softmax)
- Learning: 200 epochs (batch size 16), data shuffling between epochs
- Optimizer: Adam (learning rate 0.001)

Model selection:

Approximately 30% of the original training data is assigned to validation set, split done such that training and validation sets do not have segments from the same location and both sets have data from each city
Model performance after each epoch is evaluated on the validation set, and best performing model is selected

Results for the development dataset

System	Log loss	Accuracy	Description
DCASE2021 Task 1 Baseline, Subtask A	1.473 (± 0.051)	47.7 % (± 0.9)	Log mel-band energies as features, three layers of 2D CNN and one fully connected layer as classifier. After training, the model parameters are quantized to float16.
DCASE2020 Task 1 Baseline, Subtask A	1.365 (± 0.032)	54.1 % (± 1.4)	OpenL3 as audio embeddings, two fully connected layers as classifiers
DCASE2019 Task 1 Baseline	1.578 (± 0.029)	46.5 % (± 1.2)	Log mel-band energies as features, two layers of 2D CNN and one fully connected layer as classifier, See more information

Results for DCASE2021 baseline are calculated using TensorFlow in GPU mode (using Nvidia Tesla V100 GPU card). Because results produced with GPU card are generally non-deterministic, the system was trained and tested 10 times; mean and standard deviation of the performance from these 10 independent trials are shown in the results tables. Detailed results for the DCASE2021 baseline:

Scene label	Log loss	Device-wise log-losses									Accuracy
		A	B	C	S1	S2	S3	S4	S5	S6
Airport	1.429	1.156	1.196	1.457	1.450	1.187	1.446	1.953	1.505	1.502	40.5 %
Bus	1.317	0.796	1.488	0.908	1.569	0.997	1.277	1.939	1.377	1.503	47.1 %
Metro	1.318	0.761	1.030	0.963	2.002	1.522	1.173	1.200	1.437	1.770	51.9 %
Metro station	1.999	1.814	2.079	2.368	2.058	2.339	1.781	1.921	1.917	1.715	28.3 %
Park	1.166	0.458	1.022	0.381	1.130	0.845	1.206	2.342	1.298	1.814	69.0 %
Public square	2.139	1.542	1.708	1.804	2.254	1.866	2.146	3.012	2.716	2.202	25.3 %
Shopping mall	1.091	0.946	0.830	1.091	1.302	1.293	1.196	1.140	0.976	1.042	61.3 %
Street, pedestrian	1.827	1.178	1.310	1.454	1.789	1.656	1.883	3.146	2.068	1.956	38.7 %
Street, traffic	1.338	0.854	1.154	1.368	1.104	1.325	1.356	1.747	0.764	2.365	62.0 %
Tram	1.105	0.674	1.116	1.016	0.866	1.378	0.750	0.942	1.776	1.424	53.0 %
Overall averaged over 10 iterations	1.473 (± 0.051)	1.018	1.294	1.282	1.552	1.441	1.421	1.934	1.583	1.729	47.7 % (± 0.9)

As discussed here, devices S4-S6 are used only for testing not for training the system.

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results.

Subtask B

The baseline system for Subtask B is based on OpenL3 using both audio and video branches. The audio and video embeddings are extracted according to the original OpenL3 publication, after which each branch is trained additionally for the scene classification task using a single modality. The trained subnetworks (audio and video subnetworks) are then connected using two fully-connected feed-forward layers of size 128 and 10.

Repository

DCASE2021 Task 1B baseline, repository

Parameters

Audio embeddings using L3 network:
- Content type: “env”
- Input representation: “mel256”
- Embedding size: “512”
- Hop size: 0.1
Video embeddings using L3 network:
- Content type: “env”
- Input representation: “mel256”
- Embedding size: “512”
Audio sub-networks:
- Input shape: 1 * 512 (embedding of 1 seconds)
- Architecture:
  - Dense layer #1: Dense layer (512) + Batch normalization + ReLu activation + Dropout (rate: 20%)
  - Dense layer #2: Dense layer (128) + Batch normalization + ReLu activation + Dropout (rate: 20%)
  - Dense layer #3: Dense layer (64) + Batch normalization + ReLu activation + Dropout (rate: 20%)
  - Dense layer #4 (output layer): Dense layer (10)
- Optimizer: Adam (learning rate 0.0001,weight_decay=0.0001)
- Learning: 200 epochs (batch size 64), data shuffling between epochs
- Loss: cross entropy loss
Video sub-networks:
- Exactly same as audio subnetwork, except the input is the video embeddings
Early Fusion Audio-Visual networks:
- Input shape: 1 * 1024 (concatenate audio and video embeddings)
- Architecture: Same as audio and video subnetworks described above
- Optimizer: Adam (learning rate 0.0001,weight_decay=0.0001)
- Learning: 200 epochs (batch size 64), data shuffling between epochs
- Loss: cross entropy loss
Audio-Visual networks (baseline):
- Require pretrained weights
- Audio pretrained weights (get from audio subnetwork)
- Video pretrained weights (get from video subnetwork)
- Input shape: 1 * 512 (concatenate the output of linear layer #2 from audio and video subnetworks)
- Architecture:
  - Dense layer #1: Dense layer (128)
  - Dense layer #2 (output layer): Dense layer (10)
- Optimizer: Adam (learning rate 0.0001,weight_decay=0.0001)
- Learning: 200 epochs (batch size 64), data shuffling between epochs
- Loss: cross entropy loss

Model selection: - Approximately 10% of the original training data is assigned to the validation set, split done such that training and validation sets do not have segments from the same location. - Model performance after each epoch is evaluated on the validation set, and best performing model is selected

Results for the development dataset

	Baseline (audio-visual)		Audio subnetwork		Video subnetwork		Early fusion (audio-visual)
Scene class	Log loss	Accuracy	Log loss	Acc	Log loss	Acc	Log loss	Acc
Airport	0.963	66.8%	0.977	66.9%	2.450	54.0%	2.117	56.5%
Bus	0.396	85.9%	0.628	78.0%	0.563	85.7%	0.284	91.8%
Metro	0.541	80.4%	1.106	60.7%	1.124	72.8%	0.461	87.7%
Metro station	0.565	80.8%	1.316	58.0%	0.495	85.2%	0.319	90.3%
Park	0.710	77.2%	0.960	73.5%	1.859	73.5%	0.705	83.2%
Public square	0.732	71.1%	1.284	54.3%	1.606	61.2%	1.073	70.6%
Shopping mall	0.839	72.6%	1.384	54.9%	2.454	45.4%	1.097	77.2%
Street pedestrian	0.877	72.7%	1.285	57.4%	1.921	58.1%	1.557	64.5%
Street traffic	0.296	89.6%	0.516	84.7%	1.336	70.7%	0.324	90.8%
Tram	0.659	73.1%	1.026	62.9%	2.677	42.8%	1.697	62.1%
Overall	0.658	77.0%	1.048	65.1%	1.648	64.9%	0.963	77.4%

Citation

If you are participating in subtask A or use baseline code please cite the following paper:

Publication

Irene Martín-Morató, Toni Heittola, Annamaria Mesaros, and Tuomas Virtanen. Low-complexity acoustic scene classification for multi-device audio: analysis of dcase 2021 challenge systems. 2021. arXiv:2105.13734.

PDF

Low-complexity acoustic scene classification for multi-device audio: analysis of DCASE 2021 Challenge systems

Abstract

This paper presents the details of Task 1A Acoustic Scene Classification in the DCASE 2021 Challenge. The task consisted of classification of data from multiple devices, requiring good generalization properties, using low-complexity solutions. The provided baseline system is based on a CNN architecture and post-training parameters quantization. The system is trained using all the available training data, without any specific technique for handling device mismatch, and obtains an overall accuracy of 47.7%, with a log loss of 1.473. Details on the challenge results will be added after the challenge deadline.

PDF

If you are using the audio dataset for subtask A, please cite the following paper:

Publication

Toni Heittola, Annamaria Mesaros, and Tuomas Virtanen. Acoustic scene classification in dcase 2020 challenge: generalization across devices and low complexity solutions. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), 56–60. 2020. URL: https://arxiv.org/abs/2005.14623.

PDF

Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions

Abstract

This paper presents the details of Task 1: Acoustic Scene Classification in the DCASE 2020 Challenge. The task consists of two subtasks: classification of data from multiple devices, requiring good generalization properties, and classification using low-complexity solutions. Here we describe the datasets and baseline systems. After the challenge submission deadline, challenge results and analysis of the submissions will be added.

PDF

If you are participating in subtask B please cite the following paper:

Publication

Shanshan Wang, Toni Heittola, Annamaria Mesaros, and Tuomas Virtanen. Audio-visual scene classification: analysis of dcase 2021 challenge submissions. 2021. arXiv:2105.13675.

PDF

Audio-visual scene classification: analysis of DCASE 2021 Challenge submissions

Abstract

This paper presents the details of the Audio-Visual Scene Classification task in the DCASE 2021 Challenge (Task 1 Subtask B). The task is concerned with classification using audio and video modalities, using a dataset of synchronized recordings. Here we describe the datasets and baseline systems. After the challenge submission deadline, challenge results and analysis of the submissions will be added.

PDF

If you are using the audio-visual dataset or baseline code for subtask B, please cite the following paper:

Publication

PDF

A Curated Dataset of Urban Scenes for Audio-Visual Scene Analysis

Abstract

Keywords

Audio-visual data, Scene analysis, Acous-tic scene, Pattern recognition, Transfer learning

PDF

	Annamaria Mesaros Tampere University
	Irene Martin Morato Tampere University
	Shanshan Wang Tampere University
	Toni Heittola Tampere University
	Tuomas Virtanen Tampere University

Coordinators

Content

Low-Complexity Acoustic Scene Classification with Multiple Devices Subtask A

Audio-Visual Scene Classification Subtask B

Subtask A

Low-Complexity Acoustic Scene Classification with Multiple Devices

Audio dataset

A multi-device dataset for urban acoustic scene classification

Abstract

Keywords

Task setup

Development dataset

Evaluation dataset

System complexity requirements

Model size calculation

System 1: DCASE2020 Task 1 Baseline, Subtask A 19.14 MB

Audio embeddings (OpenL3)

Acoustic model

System 2: DCASE2020 Task 1 Baseline, Subtask B 451.5 KB

Acoustic model

System 3: DCASE2021 Task 1 Baseline, Subtask A 90.3 KB

Acoustic model

Reference labels

Download

Subtask B

Audio-Visual Scene Classification Subtask B

Audio-Visual dataset

A Curated Dataset of Urban Scenes for Audio-Visual Scene Analysis

Abstract

Keywords

Task setup

Development dataset

Evaluation dataset

Reference labels

Download

External data resources

Task rules

Submission

System output file

Metadata file

Subtask A

Subtask A / Metadata

Subtask B

Subtask B / Metadata

Package validator

Evaluation

Results

Subtask A

Subtask B

Baseline systems

Subtask A

Repository

Parameters

Results for the development dataset

Subtask B

Repository

Parameters

Results for the development dataset

Citation

Low-complexity acoustic scene classification for multi-device audio: analysis of DCASE 2021 Challenge systems

Abstract

Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions

Abstract

Audio-visual scene classification: analysis of DCASE 2021 Challenge submissions

Abstract

A Curated Dataset of Urban Scenes for Audio-Visual Scene Analysis

Abstract

Keywords

Low-Complexity Acoustic Scene Classification with Multiple Devices
Subtask A

Audio-Visual Scene Classification
Subtask B

Audio-Visual Scene Classification
Subtask B