The goal of acoustic scene classification is to classify a test recording into one of the provided predefined classes that characterizes the environment in which it was recorded.

Challenge has ended. Full results for this task can be found in subtask specific result pages: Task1A Task1B

This task comprises two different subtasks that involve system development for two different situations:

A Devices Task 1

Acoustic Scene Classification with Multiple Devices
Subtask A

Classification of data from multiple devices (real and simulated) targeting generalization properties of systems across a number of different devices.

B Complexity Task 1

Low-Complexity Acoustic Scene Classification
Subtask B

Classification of data into three higher level classes while focusing on low-complexity solutions.

Subtask A

A Devices Task 1

Acoustic Scene Classification with Multiple Devices

This subtask is concerned with the basic problem of acoustic scene classification, in which it is required to classify a test audio recording into one of ten known acoustic scene classes. This task targets generalization properties of systems across a number of different devices, and will use audio data recorded and simulated with a variety of devices.

Figure 1: Overview of acoustic scene classification system.

Audio dataset

The dataset for this task is TAU Urban Acoustic Scenes 2020 Mobile. The dataset contains recordings from 12 European cities in 10 different acoustic scenes using 4 different devices. Additionally, synthetic data for 11 mobile devices was created based on the original recordings. Of the 12 cities, two are present only in the evaluation set.

Recordings were made using four devices that captured audio simultaneously. The main recording device consists in a Soundman OKM II Klassik/studio A3, electret binaural microphone and a Zoom F8 audio recorder using 48kHz sampling rate and 24-bit resolution, referred to as device A. The other devices are commonly available customer devices: device B is a Samsung Galaxy S7, device C is iPhone SE, and device D is a GoPro Hero5 Session.

Acoustic scenes (10):

Airport - airport
Indoor shopping mall - shopping_mall
Metro station - metro_station
Pedestrian street - street_pedestrian
Public square - public_square
Street with medium level of traffic - street_traffic
Travelling by a tram - tram
Travelling by a bus - bus
Travelling by an underground metro - metro
Urban park - park

Audio data was recorded in Amsterdam, Barcelona, Helsinki, Lisbon, London, Lyon, Madrid, Milan, Prague, Paris, Stockholm and Vienna.

The dataset was collected by Tampere University of Technology between 05/2018 - 11/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

For complete details on the data recording and processing see

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 9–13. November 2018. URL: https://dcase.community/documents/workshop2018/proceedings/DCASE2018Workshop_Mesaros_8.pdf.

PDF

A multi-device dataset for urban acoustic scene classification

Abstract

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

Keywords

Acoustic scene classification, DCASE challenge, public datasets, multi-device data

PDF

Additionally, 10 mobile devices S1-S10 are simulated using the audio recorded with device A, impulse responses recorded with real devices, and additional dynamic range compression, in order to simulate realistic recordings. A recording from device A is processed through convolution with the selected Si impulse response, then processed with a selected set of parameters for dynamic range compression (device-specific). The impulse responses are proprietary data and will not be published.

The development dataset comprises 40 hours of data from device A, and smaller amounts from the other devices. Audio is provided in a single-channel 44.1kHz 24-bit format.

Task setup

Development dataset

The development set contains data from 10 cities and 9 devices: 3 real devices (A, B, C) and 6 simulated devices (S1-S6). Data from devices B, C, and S1-S6 consists of randomly selected segments from the simultaneous recordings, therefore all overlap with the data from device A, but not necessarily with each other. The total amount of audio in the development set is 64 hours.

The dataset is provided with a training/test split in which 70% of the data for each device is included for training, 30% for testing. Some devices appear only in the test subset. In order to create a perfectly balanced test set, a number of segments from various devices are not included in this split. Complete details on the development set and training/test split are provided in the following table.

Device		Dataset		Cross-validation setup
Name	Type	Total duration	Total segments	Train segments	Test segments	Notes
A	Real	40h	14400	10215	330	3855 Segments not used in train/test split
B C	Real	3h each	1080	750	330
S1 S2 S3	Simulated	3h each	1080	750	330
S4 S5 S6	Simulated	3h each	1080	-	330	750 segments not used in train/test split
Total		64h	23040	13965	2970

Participants are required to report the performance of their system using this train/test setup in order to allow a comparison of systems on the development set. Participants are allowed to create their own cross-validation folds or separate validation set. In this case please pay attention to the segments recorded at the same location. Location identifier can be found from metadata file provided in the dataset or from audio file names:

[scene label]-[city]-[location id]-[segment id]-[device id].wav

Make sure that all the files having the same location id are placed on the same side of the evaluation.

Evaluation dataset

The evaluation dataset contains data from 12 cities, 10 acoustic scenes, 11 devices. There are five new devices (not available in the development set): real device D and simulated devices S7-S11. Evaluation data contains 33 hours of audio. The evaluation data contains audio recorded at different locations than the development data.

Device and city information is not provided in the evaluation set. The systems are expected to be robust to different devices.

Reference labels

Reference labels are provided only for the development datasets. Reference labels for evaluation dataset will not be released. For publications based on the DCASE challenge data, please use the provided training/test setup of the development set, to allow comparisons. After the challenge, if you want to evaluate your proposed system with official challenge evaluation setup, contact the task coordinators. Task coordinators can provide unofficial scoring for a limited amount of system outputs.

Download

TAU Urban Acoustic Scenes 2020 Mobile, Development dataset (30.5 GB)

version 2.0

TAU Urban Acoustic Scenes 2020 Mobile, Evaluation dataset (12.8 GB)

Subtask B

B Complexity Task 1

Low-Complexity Acoustic Scene Classification
Subtask B

This subtask is concerned with the classification of audio into three major classes: indoor, outdoor, and transportation. The task targets low complexity solutions for the classification problem in terms of model size and uses audio recorded with a single device (device A).

Audio dataset

The dataset for this task is TAU Urban Acoustic Scenes 2020 3Class. The dataset contains recordings from 12 European cities in 10 different acoustic scenes.

The dataset was collected by Tampere University of Technology between 05/2018 - 11/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

For complete details on the data recording and processing see

Publication

PDF

A multi-device dataset for urban acoustic scene classification

Abstract

Keywords

Acoustic scene classification, DCASE challenge, public datasets, multi-device data

PDF

The 10 acoustic scenes are grouped into three major classes as follows:

Indoor scenes - indoor: airport, indoor shopping mall, and metro station
Outdoor scenes - outdoor: pedestrian street, public square, a street with a medium level of traffic, and urban park
Transportation related scenes - transportation: traveling by bus, traveling by tram, traveling by underground metro

This dataset contains data recorded with a single device (device A). Audio is provided in binaural, 48kHz 24-bit format.

Task setup

Development dataset

The development set contains data from 10 cities. The total amount of audio in the development set is 40 hours.

The dataset is provided with a training/test split. Participants are required to report the performance of their system using this train/test setup in order to allow comparison of systems on the development set.

Participants are allowed to create their own cross-validation folds or separate validation set. In this case please pay attention to the segments recorded at the same location. Location identifier can be found from metadata file provided in the dataset or from audio file names:

[scene label]-[city]-[location id]-[segment id]-[device id].wav

Make sure that all files having the same location id are placed on the same side of the evaluation. In this case, the device id is always A.

Evaluation dataset

The evaluation set contains data from 12 cities (2 cities unseen in the development set). Evaluation data contains 30 hours of audio.

System complexity requirements

Classifier complexity for this setup is limited to 500KB size for the non-zero parameters. This translates into 128K parameters when using float32 (32-bit float) which is often the default data type (128000 parameter values * 32 bits per parameter / 8 bits per byte= 512000 bytes = 500KB).

By limiting the size of the model on disk, we allow participants some flexibility in design, for example, some implementations would prefer to minimize the number of non-zero parameters of the network in order to comply with this size limit, while other implementations may target representation of the model parameters with a low number of bits. There is no requirement nor recommendation on which method to minimize the model size is sought after.

In order to apply the limit strictly on the classifier size, the parameter count will exclude the first active layer of the network if this layer is a feature extraction layer (Kapre layer in Keras or tf.signal.* layers in Tensorflow). If the feature extraction is done separately, all layers/parameters of the neural network are counted. Layers not used in the classification stage, such as batch normalization layers, are also skipped from the model size calculation. If the system uses embeddings (e.g VGGish, OpenL3, or EdgeL3), the network used to generate the embeddings counts in the number of parameters.

The computational complexity of the feature extraction stage is not included in the system complexity estimation within this task. We acknowledge that feature extraction is an integral part of the system complexity, but since there is no established method for estimating and comparing the complexity of different feature extraction implementations, we estimate the complexity through the size of the classifier models, in order to keep the complexity estimation straightforward across different approaches.

Full information about the model size should be provided in the technical report.

Model size calculation

We offer a script for calculating the model size for Keras based models along with the baseline system. If you have any doubts about how to calculate the model size, please contact toni.heittola@tuni.fi or write to the DCASE forum for visibility.

Calculation examples

System 1: DCASE2020 Task 1 Baseline, Subtask A 19.12 MB

Total model size: 17.87 MB (Audio embeddings) + 1.254 MB (Acoustic model) = 19.12 MB

Audio embeddings (OpenL3)

Layer	Parameters	Non-zero parameters	Size (non-zero)	Note
input_1	0	0	0 KB
melspectrogram_1	4 460 800	4 196 335	16.01 MB	Skipped
batch_normalization_1	4	4	16 bytes	Skipped
conv2d_1	640	640	2.5 KB
batch_normalization_2	256	256	1 KB	Skipped
activation_1	0	0	0 KB
conv2d_2	36 928	36 928	144.2 KB
batch_normalization_3	256	256	1 KB	Skipped
activation_2	0	0	0 KB
max_pooling2d_1	0	0	0 KB
conv2d_3	73 856	73 856	288.5 KB
batch_normalization_4	512	512	2 KB	Skipped
activation_3	0	0	0 KB
conv2d_4	147 584	147 584	576.5 KB
batch_normalization_5	512	512	2 KB	Skipped
activation_4	0	0	0 KB
max_pooling2d_2	0	0	0 KB
conv2d_5	295 168	295 168	1.126 MB
batch_normalization_6	1024	1024	4 KB	Skipped
activation_5	0	0	0 KB
conv2d_6	590 080	590 080	2.251 MB
batch_normalization_7	1024	1024	4 KB	Skipped
activation_6	0	0	0 KB
max_pooling2d_3	0	0	0 KB
conv2d_7	1 180 160	1 180 160	4.502 MB
batch_normalization_8	2048	2048	8 KB	Skipped
activation_7	0	0	0 KB
audio_embedding_layer	2 359 808	2 359 808	9.002 MB
max_pooling2d_4	0	0	0 KB
flatten_1	0	0	0 KB
Total	4 684 224	4 684 224	17.87 MB

Acoustic model

Layer	Parameters	Non-zero parameters	Size (non-zero)
dense_1	262 656	262 557	1.002 MB
dense_2	65 664	65 664	256.5 KB
dense_3	387	387	1.512 KB
Total	32 8707	32 8608	1.254 MB

System 2: DCASE2020 Task 1 Baseline, Subtask B 450.1 KB

Total model size: 0 KB (Audio embeddings) + 450.1 KB (Acoustic model) = 450.1 KB

Acoustic model

Layer	Parameters	Non-zero parameters	Size (non-zero)	Note
conv2d_1	1600	1600	6.25 KB
batch_normalization_1	128	128	512 bytes	Skipped
activation_1	0	0	0 KB
max_pooling2d_1	0	0	0 KB
dropout_1	0	0	0 KB
conv2d_2	100 416	100 416	392.2 KB
batch_normalization_2	256	256	1 KB	Skipped
activation_2	0	0	0 KB
max_pooling2d_2	0	0	0 KB
dropout_2	0	0	0 KB
flatten_1	0	0	0 KB
dense_1	12 900	12 900	50.39 KB
dropout_3	0	0	0 KB
dense_2	303	303	1.184 KB
Total	115 219	115 219	450.1 KB

Reference labels

Download

TAU Urban Acoustic Scenes 2020 3Class, Development dataset (41.5 GB)

TAU Urban Acoustic Scenes 2020 3Class, Evaluation dataset (19.9 GB)

External data resources

Use of external data is allowed in all subtasks under the following conditions:

The used external resource is clearly referenced and freely accessible to any other research group in the world. External data refers to public datasets, trained models, or impulse responses. The data must be public and freely available before 1st of April 2020.
Participants submit at least one system without external training data so that we can study the contribution of such resources. This condition applies only to cases where external audio datasets are used. In the case of external data being pre-trained models or embeddings, this condition does not apply. The list of external data sources used in training must be clearly indicated in the technical report.
Participants inform the organizers in advance about such data sources, so that all competitors know about them and have an equal opportunity to use them. Please send an email to the task coordinators; we will update the list of external datasets on the webpage accordingly. Once the evaluation set is published, the list of allowed external data resources is locked (no further external sources allowed).
It is not allowed to use TUT Urban Acoustic Scenes 2018, TAU Urban Acoustic Scenes 2019 or TAU Urban Acoustic Scenes 2019 Mobile. These datasets are partially included in the current setup, and additional usage will lead to overfitting.

List of external data resources allowed:

Dataset name	Type	Added	Link
LITIS Rouen audio scene dataset	audio	04.03.2019	https://sites.google.com/site/alainrakotomamonjy/home/audio-scene
DCASE2013 Challenge - Public Dataset for Scene Classification Task	audio	04.03.2019	https://archive.org/details/dcase2013_scene_classification
DCASE2013 Challenge - Private Dataset for Scene Classification Task	audio	04.03.2019	https://archive.org/details/dcase2013_scene_classification_testset
AudioSet	audio	04.03.2019	https://research.google.com/audioset/
OpenL3	model	12.02.2020	https://openl3.readthedocs.io/
EdgeL3	model	12.02.2020	https://edgel3.readthedocs.io/
VGGish	model	12.02.2020	https://github.com/tensorflow/models/tree/master/research/audioset/vggish
SoundNet	model	03.06.2020	http://soundnet.csail.mit.edu/

Submission

Participants can choose to participate in only one subtask or both.

Official challenge submission consists of:

System output file (*.csv)
Metadata file (*.yaml)
Technical report explaining in sufficient detail the method (*.pdf)

System output should be presented as a single text-file (in CSV format, with a header row) containing a classification result for each audio file in the evaluation set. In addition, the results file should contain probabilities for each scene class. Result items can be in any order. Multiple system outputs can be submitted (maximum 4 per participant per subtask).

For each system, meta information should be provided in a separate file, containing the task-specific information. This meta information enables fast processing of the submissions and analysis of submitted systems. Participants are advised to fill the meta information carefully while making sure all information is correctly provided.

All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted yaml, submitted system output, and the technical report! Instead of system name you can use submission label too. Example package:

DCASE2020 challenge submission example package (185 kB)
(.zip)

Detailed information for submission can be found on the submission page.

Subtask A

System output file

Row format:

[filename (string)][tab][scene label (string)][tab][bus probability (float)][tab][metro probability (float)][tab][metro_station probability (float)][tab][park probability (float)][tab][public_square probability (float)][tab][shopping_mall probability (float)][tab][street_pedestrian probability (float)][tab][street_traffic probability (float)][tab][tram probability (float)]

Example output:

filename	scene_label	airport	bus	metro	metro_station	park	public_square	shopping_mall	street_pedestrian	street_traffic	tram
0.wav	bus	0.25	0.99	0.12	0.32	0.41	0.42	0.23	0.34	0.12	0.45
1.wav	tram	0.25	0.19	0.12	0.32	0.41	0.42	0.23	0.34	0.12	0.85

Metadata file

Example meta information file for Task 1 baseline system task1/Heittola_TAU_task1a_1/Heittola_TAU_task1a_1.meta.yaml:

Subtask A / Metadata

# Submission information
submission:
  # Submission label
  # Label is used to index submissions.
  # Generate your label following way to avoid
  # overlapping codes among submissions:
  # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
  label: Heittola_TAU_task1a_1

  # Submission name
  # This name will be used in the results tables when space permits
  name: DCASE2020 baseline system

  # Submission name abbreviated
  # This abbreviated name will be used in the results table when space is tight.
  # Use maximum 10 characters.
  abbreviation: Baseline

  # Authors of the submitted system. Mark authors in
  # the order you want them to appear in submission lists.
  # One of the authors has to be marked as corresponding author,
  # this will be listed next to the submission in the results tables.
  authors:
    # First author
    - lastname: Heittola
      firstname: Toni
      email: toni.heittola@tuni.fi                # Contact email address
      corresponding: true                         # Mark true for one of the authors

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences            # Optional
        location: Tampere, Finland

    # Second author
    - lastname: Mesaros
      firstname: Annamaria
      email: annamaria.mesaros@tuni.fi

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences
        location: Tampere, Finland

    # Third author
    - lastname: Virtanen
      firstname: Tuomas
      email: tuomas.virtanen@tuni.fi

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences
        location: Tampere, Finland

# System information
system:
  # System description, meta data provided here will be used to do
  # meta analysis of the submitted system.
  # Use general level tags, when possible use the tags provided in comments.
  # If information field is not applicable to the system, use "!!null".
  description:

    # Audio input
    # e.g. 16kHz, 22.05kHz, 44.1kHz
    input_sampling_rate: 44.1kHz

    # Acoustic representation
    # one or multiple labels, e.g. MFCC, log-mel energies, spectrogram, CQT, raw waveform, ...
    acoustic_features: !!null

    # Embeddings
    # e.g. VGGish, OpenL3, ...
    embeddings: OpenL3

    # Data augmentation methods
    # e.g. mixup, time stretching, block mixing, pitch shifting, ...
    data_augmentation: !!null

    # Machine learning
    # In case using ensemble methods, please specify all methods used (comma separated list).
    # one or multiple, e.g. GMM, HMM, SVM, MLP, CNN, RNN, CRNN, ResNet, ensemble, ...
    machine_learning_method: MLP

    # Ensemble method subsystem count
    # In case ensemble method is not used, mark !!null.
    # e.g. 2, 3, 4, 5, ...
    ensemble_method_subsystem_count: !!null

    # Decision making methods
    # e.g. average, majority vote, maximum likelihood, ...
    decision_making: !!null

    # External data usage method
    # e.g. directly, embeddings, pre-trained model, ...
    external_data_usage: embeddings

  # System complexity, meta data provided here will be used to evaluate
  # submitted systems from the computational load perspective.
  complexity:
    # Total amount of parameters used in the acoustic model.
    # For neural networks, this information is usually given before training process
    # in the network summary.
    # For other than neural networks, if parameter count information is not directly
    # available, try estimating the count as accurately as possible.
    # In case of ensemble approaches, add up parameters for all subsystems.
    # In case embeddings are used, add up parameter count of the embedding
    # extraction networks and classification network
    # Use numerical value.
    total_parameters: 5012931 # embeddings (OpenL2)=4684224, classifier=328707

  # List of external datasets used in the submission.
  # Development dataset is used here only as example, list only external datasets
  external_datasets:
    # Dataset name
    - name: TAU Urban Acoustic Scenes 2020, Development dataset

      # Dataset access url
      url: https://doi.org/10.5281/zenodo.3819968

      # Total audio length in minutes
      total_audio_length: 3840            # minutes

  # URL to the source code of the system [optional]
  source_code: https://github.com/toni-heittola/dcase2020_task1_baseline

# System results
results:
  development_dataset:
    # System results for development dataset with provided the cross-validation setup.
    # Full results are not mandatory, however, they are highly recommended
    # as they are needed for through analysis of the challenge submissions.
    # If you are unable to provide all results, also incomplete
    # results can be reported.

    # Overall metrics
    overall:
      accuracy: 51.6    # mean of class-wise accuracies
      logloss: 1.405

    # Class-wise metrics
    class_wise:
      airport:
        accuracy: 36.5
        logloss: 1.989
      bus:
        accuracy: 52.9
        logloss: 1.014
      metro:
        accuracy: 46.8
        logloss: 1.429
      metro_station:
        accuracy: 47.1
        logloss: 1.477
      park:
        accuracy: 72.7
        logloss: 0.971
      public_square:
        accuracy: 59.6
        logloss: 1.182
      shopping_mall:
        accuracy: 42.4
        logloss: 1.714
      street_pedestrian:
        accuracy: 20.9
        logloss: 2.421
      street_traffic:
        accuracy: 74.7
        logloss: 0.861
      tram:
        accuracy: 62.8
        logloss: 0.989

    # Device-wise
    device_wise:
      a:
        accuracy: 68.8
        logloss: 0.946
      b:
        accuracy: 60.2
        logloss: 1.158
      c:
        accuracy: 59.9
        logloss: 1.038
      s1:
        accuracy: 50.3
        logloss: 1.408
      s2:
        accuracy: 50.0
        logloss: 1.405
      s3:
        accuracy: 50.9
        logloss: 1.468
      s4:
        accuracy: 45.2
        logloss: 1.642
      s5:
        accuracy: 44.8
        logloss: 1.646
      s6:
        accuracy: 34.8
        logloss: 1.931

Subtask B

System output file

Row format:

[filename (string)][tab][scene label (string)][tab][indoor probability (float)][tab][outdoor probability (float)][tab][transportation probability (float)]

Example output:

filename	scene_label	indoor	outdoor	transportation
0.wav	outdoor	0.25	0.99	0.12
1.wav	indoor	0.75	0.29	0.12

Metadata file

Example meta information file for Task 1 baseline system task1/Heittola_TAU_task1b_1/Heittola_TAU_task1b_1.meta.yaml:

Subtask B / Metadata

# Submission information
submission:
  # Submission label
  # Label is used to index submissions.
  # Generate your label following way to avoid
  # overlapping codes among submissions:
  # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
  label: Heittola_TAU_task1b_1

  # Submission name
  # This name will be used in the results tables when space permits
  name: DCASE2020 baseline system

  # Submission name abbreviated
  # This abbreviated name will be used in the results table when space is tight.
  # Use maximum 10 characters.
  abbreviation: Baseline

  # Authors of the submitted system. Mark authors in
  # the order you want them to appear in submission lists.
  # One of the authors has to be marked as corresponding author,
  # this will be listed next to the submission in the results tables.
  authors:
    # First author
    - lastname: Heittola
      firstname: Toni
      email: toni.heittola@tuni.fi                # Contact email address
      corresponding: true                         # Mark true for one of the authors

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences            # Optional
        location: Tampere, Finland

    # Second author
    - lastname: Mesaros
      firstname: Annamaria
      email: annamaria.mesaros@tuni.fi

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences
        location: Tampere, Finland

    # Third author
    - lastname: Virtanen
      firstname: Tuomas
      email: tuomas.virtanen@tuni.fi

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences
        location: Tampere, Finland

# System information
system:
  # System description, meta data provided here will be used to do
  # meta analysis of the submitted system.
  # Use general level tags, when possible use the tags provided in comments.
  # If information field is not applicable to the system, use "!!null".
  description:
    # Audio input / channels
    # one or multiple: e.g. mono, binaural, left, right, mixed, ...
    input_channels: mono

    # Audio input / sampling rate
    # e.g. 16kHz, 22.05kHz, 44.1kHz, 48.0kHz
    input_sampling_rate: 48.0kHz

    # Acoustic representation
    # one or multiple labels, e.g. MFCC, log-mel energies, spectrogram, CQT, raw waveform, ...
    acoustic_features: log-mel energies

    # Embeddings
    # e.g. VGGish, OpenL3, ...
    embeddings: !!null

    # Data augmentation methods
    # e.g. mixup, time stretching, block mixing, pitch shifting, ...
    data_augmentation: !!null

    # Machine learning
    # In case using ensemble methods, please specify all methods used (comma separated list).
    # one or multiple, e.g. GMM, HMM, SVM, MLP, CNN, RNN, CRNN, ResNet, ensemble, ...
    machine_learning_method: CNN

    # Ensemble method subsystem count
    # In case ensemble method is not used, mark !!null.
    # e.g. 2, 3, 4, 5, ...
    ensemble_method_subsystem_count: !!null

    # Decision making methods
    # e.g. average, majority vote, maximum likelihood, ...
    decision_making: !!null

    # External data usage method
    # e.g. directly, embeddings, pre-trained model, ...
    external_data_usage: embeddings

    # Method for handling the complexity restrictions
    # e.g. weight quantization, sparsity, ...
    complexity_management: !!null

  # System complexity, meta data provided here will be used to evaluate
  # submitted systems from the computational load perspective.
  complexity:
    # Total amount of parameters used in the acoustic model.
    # For neural networks, this information is usually given before training process
    # in the network summary.
    # For other than neural networks, if parameter count information is not directly
    # available, try estimating the count as accurately as possible.
    # In case of ensemble approaches, add up parameters for all subsystems.
    # In case embeddings are used, add up parameter count of the embedding
    # extraction networks and classification network
    # Use numerical value.
    total_parameters: 115219

    # Total amount of non-zero parameters in the acoustic model.
    # Calculated with same principles as "total_parameters".
    # Use numerical value.
    total_parameters_non_zero: 115219

    # Model size calculated as instructed in task description page.
    # Use numerical value, unit is KB
    model_size: 450.1 # KB

  # List of external datasets used in the submission.
  # Development dataset is used here only as example, list only external datasets
  external_datasets:
    # Dataset name
    - name: TAU Urban Acoustic Scenes 2020 3Class, Development dataset

      # Dataset access url
      url: https://doi.org/10.5281/zenodo.3670185

      # Total audio length in minutes
      total_audio_length: 2400            # minutes

  # URL to the source code of the system [optional]
  source_code: https://github.com/toni-heittola/dcase2020_task1_baseline

# System results
results:
  development_dataset:
    # System results for development dataset with provided the cross-validation setup.
    # Full results are not mandatory, however, they are highly recommended
    # as they are needed for through analysis of the challenge submissions.
    # If you are unable to provide all results, also incomplete
    # results can be reported.

    # Overall metrics
    overall:
      accuracy: 88.0
      logloss: 0.481

    # Class-wise accuracies
    class_wise:
      indoor:
        accuracy: 83.7
        logloss: 0.746
      outdoor:
        accuracy: 89.5
        logloss: 0.367
      transportation:
        accuracy: 90.7
        logloss: 0.356

Package validator

This is an automatic validation tool to help challenge participants to prepare a correctly formatted submission package, which in turn will speed up the submission processing in the challenge evaluation stage. Please use this to make sure your submission package follows the given formatting.

DCASE2020 Task 1 submission validator

Task rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

Task specific rules:

Use of external data is allowed. See conditions for external data usage here.
In subtask B, the model size limit applies. See conditions for the model size here.
Manipulation of provided training and development data is allowed (e.g. by mixing data sampled from a pdf or using techniques such as pitch shifting or time stretching).
Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision making is also forbidden. Separately published leaderboard data is considered as evaluation data as well.
Classification decision must be done independently for each test sample.

Evaluation

Systems will be ranked by macro-average accuracy (average of the class-wise accuracies).

As a secondary metric, we will use multiclass cross-entropy (Log loss), in order to have a metric that is independent of the operating point (see python implementation here).

Results

Subtask A

Rank	Submission Information
Rank	Code	Author	Affiliation	Technical Report	Accuracy with 95% confidence interval	Logloss
	Abbasi_ARI_task1a_1	Reyhaneh Abbasi	Mathematics and Signal Processing in Acoustics, acoustic research institute of OEAW, Vienna, Austria	task-acoustic-scene-classification-results-a#Abbasi2020	59.7 (58.8 - 60.6)	1.099
	Abbasi_ARI_task1a_2	Reyhaneh Abbasi	Mathematics and Signal Processing in Acoustics, acoustic research institute of OEAW, Vienna, Austria	task-acoustic-scene-classification-results-a#Abbasi2020	60.6 (59.7 - 61.5)	1.063
	Cao_JNU_task1a_1	Yi Cao	Mechanical engineering, Jiangnan University, Wuxi, China	task-acoustic-scene-classification-results-a#Fei2020	65.7 (64.9 - 66.6)	1.265
	Cao_JNU_task1a_2	Yi Cao	Mechanical engineering, Jiangnan University, Wuxi, China	task-acoustic-scene-classification-results-a#Fei2020	65.7 (64.8 - 66.5)	1.259
	Cao_JNU_task1a_3	Yi Cao	Mechanical engineering, Jiangnan University, Wuxi, China	task-acoustic-scene-classification-results-a#Fei2020	66.0 (65.1 - 66.8)	1.268
	Cao_JNU_task1a_4	Yi Cao	Mechanical engineering, Jiangnan University, Wuxi, China	task-acoustic-scene-classification-results-a#Fei2020	65.9 (65.1 - 66.8)	1.267
	FanVaf__task1a_1	Eleftherios Fanioudakis	Greece	task-acoustic-scene-classification-results-a#Fanioudakis2020	63.4 (62.5 - 64.2)	1.106
	FanVaf__task1a_2	Eleftherios Fanioudakis	Greece	task-acoustic-scene-classification-results-a#Fanioudakis2020	60.7 (59.9 - 61.6)	1.142
	FanVaf__task1a_3	Eleftherios Fanioudakis	Greece	task-acoustic-scene-classification-results-a#Fanioudakis2020	64.8 (63.9 - 65.6)	1.298
	FanVaf__task1a_4	Eleftherios Fanioudakis	Greece	task-acoustic-scene-classification-results-a#Fanioudakis2020	67.5 (66.6 - 68.3)	1.240
	Gao_UNISA_task1a_1	Wei Gao	UniSA STEM, University of South Australia, Adelaide, Australia	task-acoustic-scene-classification-results-a#Gao2020	75.0 (74.3 - 75.8)	1.225
	Gao_UNISA_task1a_2	Wei Gao	UniSA STEM, University of South Australia, Adelaide, Australia	task-acoustic-scene-classification-results-a#Gao2020	74.1 (73.3 - 74.9)	1.242
	Gao_UNISA_task1a_3	Wei Gao	UniSA STEM, University of South Australia, Adelaide, Australia	task-acoustic-scene-classification-results-a#Gao2020	74.7 (73.9 - 75.5)	1.231
	Gao_UNISA_task1a_4	Wei Gao	UniSA STEM, University of South Australia, Adelaide, Australia	task-acoustic-scene-classification-results-a#Gao2020	75.2 (74.4 - 76.0)	1.230
	DCASE2020 baseline	Toni Heittola	Computing Sciences, Tampere University, Tampere, Finland	task-acoustic-scene-classification-results-a#Heittola2020	51.4 (50.5 - 52.3)	1.902
	Helin_ADSPLAB_task1a_1	Yuexian Zou	School of ECE, Peking University, Shenzhen, China	task-acoustic-scene-classification-results-a#Wang2020_t1	73.4 (72.6 - 74.2)	0.850
	Helin_ADSPLAB_task1a_2	Yuexian Zou	School of ECE, Peking University, Shenzhen, China	task-acoustic-scene-classification-results-a#Wang2020_t1	68.4 (67.6 - 69.3)	0.991
	Helin_ADSPLAB_task1a_3	Yuexian Zou	School of ECE, Peking University, Shenzhen, China	task-acoustic-scene-classification-results-a#Wang2020_t1	73.1 (72.3 - 73.9)	0.889
	Helin_ADSPLAB_task1a_4	Yuexian Zou	School of ECE, Peking University, Shenzhen, China	task-acoustic-scene-classification-results-a#Wang2020_t1	72.3 (71.5 - 73.1)	0.899
	Hu_GT_task1a_1	Hu Hu	School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, USA	task-acoustic-scene-classification-results-a#Hu2020	75.7 (74.9 - 76.4)	0.924
	Hu_GT_task1a_2	Hu Hu	School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, USA	task-acoustic-scene-classification-results-a#Hu2020	75.9 (75.1 - 76.7)	0.895
	Hu_GT_task1a_3	Hu Hu	School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, USA	task-acoustic-scene-classification-results-a#Hu2020	76.2 (75.4 - 77.0)	0.898
	Hu_GT_task1a_4	Hu Hu	School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, USA	task-acoustic-scene-classification-results-a#Hu2020	75.8 (75.0 - 76.5)	0.900
	JHKim_IVS_task1a_1	Jaehun Kim	AI Research Lab, IVS Inc, Seoul, South Korea	task-acoustic-scene-classification-results-a#Kim2020_t1	67.3 (66.5 - 68.2)	5.219
	JHKim_IVS_task1a_2	Jaehun Kim	AI Research Lab, IVS Inc, Seoul, South Korea	task-acoustic-scene-classification-results-a#Kim2020_t1	66.2 (65.3 - 67.0)	4.766
	Jie_Maxvision_task1a_1	Liu Jie	Maxvision, Wuhan, China	task-acoustic-scene-classification-results-a#Jie2020	75.0 (74.3 - 75.8)	1.209
	Kim_SGU_task1a_1	Kim Ji-Hwan	Dept. of Computer Scinece and Engineering, Sogang University, Seoul, South Korea	task-acoustic-scene-classification-results-a#Changmin2020	71.6 (70.8 - 72.4)	1.309
	Kim_SGU_task1a_2	Kim Ji-Hwan	Dept. of Computer Scinece and Engineering, Sogang University, Seoul, South Korea	task-acoustic-scene-classification-results-a#Changmin2020	70.7 (69.9 - 71.6)	1.304
	Kim_SGU_task1a_3	Kim Ji-Hwan	Dept. of Computer Scinece and Engineering, Sogang University, Seoul, South Korea	task-acoustic-scene-classification-results-a#Changmin2020	70.7 (69.8 - 71.5)	1.412
	Kim_SGU_task1a_4	Kim Ji-Hwan	Dept. of Computer Scinece and Engineering, Sogang University, Seoul, South Korea	task-acoustic-scene-classification-results-a#Changmin2020	66.4 (65.6 - 67.3)	1.428
	Koutini_CPJKU_task1a_1	Khaled Koutini	Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria	task-acoustic-scene-classification-results-a#Koutini2020	71.9 (71.1 - 72.7)	0.800
	Koutini_CPJKU_task1a_2	Khaled Koutini	Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria	task-acoustic-scene-classification-results-a#Koutini2020	71.6 (70.8 - 72.4)	0.862
	Koutini_CPJKU_task1a_3	Khaled Koutini	Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria	task-acoustic-scene-classification-results-a#Koutini2020	73.6 (72.8 - 74.4)	0.796
	Koutini_CPJKU_task1a_4	Khaled Koutini	Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria	task-acoustic-scene-classification-results-a#Koutini2020	73.4 (72.6 - 74.2)	0.814
	Lee_CAU_task1a_1	Yerin Lee	Statistics Dept., Chung-Ang University, Seoul, South Korea	task-acoustic-scene-classification-results-a#Lee2020	69.2 (68.3 - 70.0)	0.885
	Lee_CAU_task1a_2	Yerin Lee	Statistics Dept., Chung-Ang University, Seoul, South Korea	task-acoustic-scene-classification-results-a#Lee2020	69.6 (68.8 - 70.5)	0.859
	Lee_CAU_task1a_3	Yerin Lee	Statistics Dept., Chung-Ang University, Seoul, South Korea	task-acoustic-scene-classification-results-a#Lee2020	72.0 (71.2 - 72.8)	0.944
	Lee_CAU_task1a_4	Yerin Lee	Statistics Dept., Chung-Ang University, Seoul, South Korea	task-acoustic-scene-classification-results-a#Lee2020	72.9 (72.1 - 73.7)	0.919
	Lee_GU_task1a_1	Sang Woong Lee	Gachon University, South Korea	task-acoustic-scene-classification-results-a#Aryal2020	55.9 (55.0 - 56.8)	1.969
	Lee_GU_task1a_2	Sang Woong Lee	Gachon University, South Korea	task-acoustic-scene-classification-results-a#Aryal2020	55.6 (54.7 - 56.5)	1.818
	Lee_GU_task1a_3	Sang Woong Lee	Gachon University, South Korea	task-acoustic-scene-classification-results-a#Aryal2020	55.6 (54.7 - 56.5)	2.987
	Lee_GU_task1a_4	Sang Woong Lee	Gachon University, South Korea	task-acoustic-scene-classification-results-a#Aryal2020	54.9 (54.1 - 55.8)	2.847
	Liu_SHNU_task1a_1	YanHua Long	The College of Information,Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai, China	task-acoustic-scene-classification-results-a#Liu2020	69.3 (68.5 - 70.1)	1.396
	Liu_SHNU_task1a_2	YanHua Long	The College of Information,Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai, China	task-acoustic-scene-classification-results-a#Liu2020	68.0 (67.2 - 68.9)	4.510
	Liu_SHNU_task1a_3	YanHua Long	The College of Information,Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai, China	task-acoustic-scene-classification-results-a#Liu2020	55.7 (54.8 - 56.6)	9.403
	Liu_SHNU_task1a_4	YanHua Long	The College of Information,Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai, China	task-acoustic-scene-classification-results-a#Liu2020	72.0 (71.2 - 72.8)	3.165
	Liu_UESTC_task1a_1	Yingzi Liu	School of imformation and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China	task-acoustic-scene-classification-results-a#Liu2020a	73.2 (72.4 - 74.0)	1.305
	Liu_UESTC_task1a_2	Yingzi Liu	School of imformation and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China	task-acoustic-scene-classification-results-a#Liu2020a	72.4 (71.6 - 73.2)	1.303
	Liu_UESTC_task1a_3	Yingzi Liu	School of imformation and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China	task-acoustic-scene-classification-results-a#Liu2020a	72.5 (71.7 - 73.3)	0.755
	Liu_UESTC_task1a_4	Yingzi Liu	School of imformation and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China	task-acoustic-scene-classification-results-a#Liu2020a	72.0 (71.2 - 72.8)	0.767
	Lopez-Meyer_IL_task1a_1	Paulo Lopez-Meyer	Intel Labs, Intel Corporation, Jalisco, Mexico	task-acoustic-scene-classification-results-a#Lopez-Meyer2020_t1a	64.3 (63.4 - 65.1)	5.268
	Lopez-Meyer_IL_task1a_2	Paulo Lopez-Meyer	Intel Labs, Intel Corporation, Jalisco, Mexico	task-acoustic-scene-classification-results-a#Lopez-Meyer2020_t1a	64.1 (63.3 - 65.0)	11.870
	Lu_INTC_task1a_1	Lu Hong	Intel Labs, Intel Corporation, Santa Clara, USA	task-acoustic-scene-classification-results-a#Hong2020	71.2 (70.4 - 72.0)	0.809
	Lu_INTC_task1a_2	Lu Hong	Intel Labs, Intel Corporation, Santa Clara, USA	task-acoustic-scene-classification-results-a#Hong2020	64.1 (63.3 - 65.0)	1.383
	Lu_INTC_task1a_3	Lu Hong	Intel Labs, Intel Corporation, Santa Clara, USA	task-acoustic-scene-classification-results-a#Hong2020	66.4 (65.5 - 67.2)	1.192
	Lu_INTC_task1a_4	Lu Hong	Intel Labs, Intel Corporation, Santa Clara, USA	task-acoustic-scene-classification-results-a#Hong2020	71.2 (70.4 - 72.1)	0.806
	Monteiro_INRS_task1a_1	Monteiro Joao	EMT, Institut National de la Recherche Scientifique, Montreal, Canada	task-acoustic-scene-classification-results-a#Joao2020	61.7 (60.8 - 62.6)	5.936
	Monteiro_INRS_task1a_2	Monteiro Joao	EMT, Institut National de la Recherche Scientifique, Montreal, Canada	task-acoustic-scene-classification-results-a#Joao2020	55.9 (55.0 - 56.8)	5.198
	Monteiro_INRS_task1a_3	Monteiro Joao	EMT, Institut National de la Recherche Scientifique, Montreal, Canada	task-acoustic-scene-classification-results-a#Joao2020	50.8 (49.9 - 51.7)	2.766
	Monteiro_INRS_task1a_4	Monteiro Joao	EMT, Institut National de la Recherche Scientifique, Montreal, Canada	task-acoustic-scene-classification-results-a#Joao2020	66.3 (65.5 - 67.2)	2.226
	Naranjo-Alcazar_Vfy_task1a_1	Javier Naranjo-Alcazar	AI department, Visualfy, Benisano, Spain; Computer Science Department, Universitat de Valencia, Burjassot, Spain	task-acoustic-scene-classification-results-a#Naranjo-Alcazar2020_t1	61.9 (61.0 - 62.7)	1.246
	Naranjo-Alcazar_Vfy_task1a_2	Javier Naranjo-Alcazar	AI department, Visualfy, Benisano, Spain; Computer Science Department, Universitat de Valencia, Burjassot, Spain	task-acoustic-scene-classification-results-a#Naranjo-Alcazar2020_t1	59.7 (58.8 - 60.6)	1.314
	Paniagua_UPM_task1a_1	Rubén Fraile	CITSEM, Universidad Politéctica de Madrid, Madrid, Spain	task-acoustic-scene-classification-results-a#Paniagua2020	43.8 (42.9 - 44.7)	2.053
	Shim_UOS_task1a_1	Ha-jin Yu	School of Computer Science, University of Seoul, Seoul, South Korea	task-acoustic-scene-classification-results-a#Shim2020	71.7 (70.9 - 72.5)	1.190
	Shim_UOS_task1a_2	Ha-jin Yu	School of Computer Science, University of Seoul, Seoul, South Korea	task-acoustic-scene-classification-results-a#Shim2020	71.5 (70.7 - 72.4)	0.897
	Shim_UOS_task1a_3	Ha-jin Yu	School of Computer Science, University of Seoul, Seoul, South Korea	task-acoustic-scene-classification-results-a#Shim2020	68.5 (67.6 - 69.3)	0.911
	Shim_UOS_task1a_4	Ha-jin Yu	School of Computer Science, University of Seoul, Seoul, South Korea	task-acoustic-scene-classification-results-a#Shim2020	71.0 (70.2 - 71.8)	0.945
	Suh_ETRI_task1a_1	Youngho Jeong	Media Coding Research Section, Electronics and Telecommunications Research Institute, Daejeon, South Korea	task-acoustic-scene-classification-results-a#Suh2020	72.5 (71.7 - 73.3)	1.290
	Suh_ETRI_task1a_2	Youngho Jeong	Media Coding Research Section, Electronics and Telecommunications Research Institute, Daejeon, South Korea	task-acoustic-scene-classification-results-a#Suh2020	75.5 (74.7 - 76.2)	1.221
	Suh_ETRI_task1a_3	Youngho Jeong	Media Coding Research Section, Electronics and Telecommunications Research Institute, Daejeon, South Korea	task-acoustic-scene-classification-results-a#Suh2020	76.5 (75.8 - 77.3)	1.219
	Suh_ETRI_task1a_4	Youngho Jeong	Media Coding Research Section, Electronics and Telecommunications Research Institute, Daejeon, South Korea	task-acoustic-scene-classification-results-a#Suh2020	76.5 (75.7 - 77.2)	1.219
	Swiecicki_NON_task1a_1	Jakub Swiecicki	None, Warsaw, Poland	task-acoustic-scene-classification-results-a#Swiecicki2020	67.1 (66.2 - 67.9)	0.926
	Swiecicki_NON_task1a_2	Jakub Swiecicki	None, Warsaw, Poland	task-acoustic-scene-classification-results-a#Swiecicki2020	69.5 (68.7 - 70.3)	0.851
	Swiecicki_NON_task1a_3	Jakub Swiecicki	None, Warsaw, Poland	task-acoustic-scene-classification-results-a#Swiecicki2020	70.3 (69.4 - 71.1)	0.970
	Swiecicki_NON_task1a_4	Jakub Swiecicki	None, Warsaw, Poland	task-acoustic-scene-classification-results-a#Swiecicki2020	71.8 (71.0 - 72.7)	0.793
	Vilouras_AUTh_task1a_1	Konstantinos Vilouras	Electrical and Computer Engineering, Aristotle University of Thessaloniki, Thessaloniki, Greece	task-acoustic-scene-classification-results-a#Vilouras2020	67.7 (66.8 - 68.5)	0.929
	Vilouras_AUTh_task1a_2	Konstantinos Vilouras	Electrical and Computer Engineering, Aristotle University of Thessaloniki, Thessaloniki, Greece	task-acoustic-scene-classification-results-a#Vilouras2020	67.8 (67.0 - 68.7)	0.931
	Vilouras_AUTh_task1a_3	Konstantinos Vilouras	Electrical and Computer Engineering, Aristotle University of Thessaloniki, Thessaloniki, Greece	task-acoustic-scene-classification-results-a#Vilouras2020	69.3 (68.5 - 70.1)	0.883
	Waldekar_IITKGP_task1a_1	Shefali Waldekar	Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India	task-acoustic-scene-classification-results-a#Waldekar2020	58.4 (57.5 - 59.2)	1.427
	Wang_RoyalFlush_task1a_1	Peiyao Wang	Speech Group, Hithink RoyalFlush Information Network Co.,Ltd, Hangzhou, China	task-acoustic-scene-classification-results-a#Wang2020a	56.7 (55.8 - 57.6)	1.576
	Wang_RoyalFlush_task1a_2	Peiyao Wang	Speech Group, Hithink RoyalFlush Information Network Co.,Ltd, Hangzhou, China	task-acoustic-scene-classification-results-a#Wang2020a	65.2 (64.3 - 66.0)	1.294
	Wang_RoyalFlush_task1a_3	Peiyao Wang	Speech Group, Hithink RoyalFlush Information Network Co.,Ltd, Hangzhou, China	task-acoustic-scene-classification-results-a#Wang2020a	64.0 (63.1 - 64.8)	1.239
	Wang_RoyalFlush_task1a_4	Peiyao Wang	Speech Group, Hithink RoyalFlush Information Network Co.,Ltd, Hangzhou, China	task-acoustic-scene-classification-results-a#Wang2020a	45.5 (44.6 - 46.4)	5.880
	Wu_CUHK_task1a_1	Yuzhong Wu	Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China	task-acoustic-scene-classification-results-a#Wu2020_t1a	64.7 (63.9 - 65.6)	1.148
	Wu_CUHK_task1a_2	Yuzhong Wu	Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China	task-acoustic-scene-classification-results-a#Wu2020_t1a	69.3 (68.4 - 70.1)	1.070
	Wu_CUHK_task1a_3	Yuzhong Wu	Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China	task-acoustic-scene-classification-results-a#Wu2020_t1a	67.9 (67.1 - 68.8)	1.100
	Wu_CUHK_task1a_4	Yuzhong Wu	Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China	task-acoustic-scene-classification-results-a#Wu2020_t1a	69.4 (68.6 - 70.2)	1.080
	Zhang_THUEE_task1a_1	Wei-Qiang Zhang	Department of Electronic Engineering, Tsinghua University, Beijing, China	task-acoustic-scene-classification-results-a#Shao2020	73.0 (72.2 - 73.8)	1.963
	Zhang_THUEE_task1a_2	Wei-Qiang Zhang	Department of Electronic Engineering, Tsinghua University, Beijing, China	task-acoustic-scene-classification-results-a#Shao2020	73.2 (72.4 - 74.0)	1.967
	Zhang_THUEE_task1a_3	Wei-Qiang Zhang	Department of Electronic Engineering, Tsinghua University, Beijing, China	task-acoustic-scene-classification-results-a#Shao2020	72.3 (71.5 - 73.1)	1.958
	Zhang_UESTC_task1a_1	Chi Zhang	Electronic Information Engineering, University of Electronic Science and Technology of China, Chengdu, China	task-acoustic-scene-classification-results-a#Zhang2020	50.4 (49.5 - 51.3)	1.899
	Zhang_UESTC_task1a_2	Chi Zhang	Electronic Information Engineering, University of Electronic Science and Technology of China, Chengdu, China	task-acoustic-scene-classification-results-a#Zhang2020	51.7 (50.8 - 52.6)	1.805
	Zhang_UESTC_task1a_3	Chi Zhang	Electronic Information Engineering, University of Electronic Science and Technology of China, Chengdu, China	task-acoustic-scene-classification-results-a#Zhang2020	47.4 (46.5 - 48.3)	2.068

Complete results and technical reports can be found at subtask A results page

Subtask B

Rank	Submission Information
Rank	Code	Author	Affiliation	Technical Report	Accuracy with 95% confidence interval	Logloss
	Chang_QTI_task1b_1	Simyung Chang	Qualcomm AI Research, Qualcomm Korea YH, Seoul, South Korea	task-acoustic-scene-classification-results-b#Chang2020	95.0 (94.6 - 95.5)	0.228
	Chang_QTI_task1b_2	Simyung Chang	Qualcomm AI Research, Qualcomm Korea YH, Seoul, South Korea	task-acoustic-scene-classification-results-b#Chang2020	93.2 (92.9 - 93.5)	0.232
	Chang_QTI_task1b_3	Simyung Chang	Qualcomm AI Research, Qualcomm Korea YH, Seoul, South Korea	task-acoustic-scene-classification-results-b#Chang2020	94.8 (94.2 - 95.3)	0.224
	Chang_QTI_task1b_4	Simyung Chang	Qualcomm AI Research, Qualcomm Korea YH, Seoul, South Korea	task-acoustic-scene-classification-results-b#Chang2020	94.4 (93.8 - 95.1)	0.237
	Dat_HCMUni_task1b_1	Ngo Dat	Electrical & Electronic Engineering, Ho Chi Minh University of Technology, Ho Chi Minh, Vietnam	task-acoustic-scene-classification-results-b#Dat2020	89.5 (89.5 - 89.5)	0.648
	Farrugia_IMT-Atlantique-BRAIn_task1b_1	Nicolas Farrugia	Electronics, IMT Atlantique, Brest, France	task-acoustic-scene-classification-results-b#Pajusco2020	85.4 (84.9 - 85.8)	0.379
	Farrugia_IMT-Atlantique-BRAIn_task1b_2	Nicolas Farrugia	Electronics, IMT Atlantique, Brest, France	task-acoustic-scene-classification-results-b#Pajusco2020	90.6 (90.0 - 91.2)	0.270
	Farrugia_IMT-Atlantique-BRAIn_task1b_3	Nicolas Farrugia	Electronics, IMT Atlantique, Brest, France	task-acoustic-scene-classification-results-b#Pajusco2020	86.6 (85.9 - 87.3)	0.384
	Farrugia_IMT-Atlantique-BRAIn_task1b_4	Nicolas Farrugia	Electronics, IMT Atlantique, Brest, France	task-acoustic-scene-classification-results-b#Pajusco2020	88.4 (87.9 - 88.9)	0.286
	Feng_TJU_task1b_1	Guoqing Feng	School of Electrical and Information Engineering, Tianjin University, Tianjin, China	task-acoustic-scene-classification-results-b#Feng2020	72.3 (73.9 - 70.7)	1.728
	Feng_TJU_task1b_2	Guoqing Feng	School of Electrical and Information Engineering, Tianjin University, Tianjin, China	task-acoustic-scene-classification-results-b#Feng2020	81.9 (82.2 - 81.6)	1.189
	Feng_TJU_task1b_3	Guoqing Feng	School of Electrical and Information Engineering, Tianjin University, Tianjin, China	task-acoustic-scene-classification-results-b#Feng2020	80.7 (81.0 - 80.4)	1.302
	Feng_TJU_task1b_4	Guoqing Feng	School of Electrical and Information Engineering, Tianjin University, Tianjin, China	task-acoustic-scene-classification-results-b#Feng2020	79.9 (80.4 - 79.3)	1.281
	DCASE2020 baseline	Toni Heittola	Computing Sciences, Tampere University, Tampere, Finland	task-acoustic-scene-classification-results-b#Heittola2020	89.5 (88.8 - 90.2)	0.401
	Helin_ADSPLAB_task1b_1	Yuexian Zou	School of ECE, Peking University, Shenzhen, China	task-acoustic-scene-classification-results-b#Wang2020_t1	91.6 (91.1 - 92.0)	0.227
	Helin_ADSPLAB_task1b_2	Yuexian Zou	School of ECE, Peking University, Shenzhen, China	task-acoustic-scene-classification-results-b#Wang2020_t1	91.6 (91.2 - 92.0)	0.233
	Helin_ADSPLAB_task1b_3	Yuexian Zou	School of ECE, Peking University, Shenzhen, China	task-acoustic-scene-classification-results-b#Wang2020_t1	91.6 (91.1 - 92.0)	0.230
	Helin_ADSPLAB_task1b_4	Yuexian Zou	School of ECE, Peking University, Shenzhen, China	task-acoustic-scene-classification-results-b#Wang2020_t1	91.3 (91.0 - 91.6)	0.264
	Hu_GT_task1b_1	Hu Hu	School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, USA	task-acoustic-scene-classification-results-b#Hu2020	95.8 (95.5 - 96.1)	0.357
	Hu_GT_task1b_2	Hu Hu	School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, USA	task-acoustic-scene-classification-results-b#Hu2020	95.5 (95.1 - 95.8)	0.367
	Hu_GT_task1b_3	Hu Hu	School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, USA	task-acoustic-scene-classification-results-b#Hu2020	96.0 (95.5 - 96.5)	0.122
	Hu_GT_task1b_4	Hu Hu	School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, USA	task-acoustic-scene-classification-results-b#Hu2020	95.8 (95.3 - 96.3)	0.131
	Kalinowski_SRPOL_task1b_4	Beniamin Kalinowski	Audio Intelligence, Samsung R&D Poland, Warsaw, Poland	task-acoustic-scene-classification-results-b#Kalinowski2020	93.1 (92.7 - 93.5)	1.532
	Koutini_CPJKU_task1b_1	Khaled Koutini	Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria	task-acoustic-scene-classification-results-b#Koutini2020	94.7 (94.5 - 94.9)	0.164
	Koutini_CPJKU_task1b_2	Khaled Koutini	Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria	task-acoustic-scene-classification-results-b#Koutini2020	96.5 (96.2 - 96.8)	0.101
	Koutini_CPJKU_task1b_3	Khaled Koutini	Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria	task-acoustic-scene-classification-results-b#Koutini2020	95.7 (95.5 - 95.9)	0.113
	Koutini_CPJKU_task1b_4	Khaled Koutini	Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria	task-acoustic-scene-classification-results-b#Koutini2020	96.2 (95.9 - 96.5)	0.105
	Kowaleczko_SRPOL_task1b_3	Pawel Kowaleczko	Audio Intelligence, Samsung R&D Poland, Warsaw, Poland	task-acoustic-scene-classification-results-b#Kalinowski2020	90.1 (89.6 - 90.7)	0.356
	Kwiatkowska_SRPOL_task1b_1	Zuzanna Kwiatkowska	Audio Intelligence, Samsung R&D Institute, Warsaw Poland	task-acoustic-scene-classification-results-b#Kalinowski2020	92.6 (92.0 - 93.2)	0.200
	Kwiatkowska_SRPOL_task1b_2	Zuzanna Kwiatkowska	Audio Intelligence, Samsung R&D Institute, Warsaw Poland	task-acoustic-scene-classification-results-b#Kalinowski2020	93.5 (93.0 - 94.0)	0.168
	LamPham_Kent_task1b_1	Lam Pham	School of Computing, University of Kent, Kent, UK	task-acoustic-scene-classification-results-b#Pham2020	89.4 (89.2 - 89.7)	0.332
	LamPham_Kent_task1b_2	Lam Pham	School of Computing, University of Kent, Kent, UK	task-acoustic-scene-classification-results-b#Pham2020	87.0 (86.1 - 87.8)	0.349
	LamPham_Kent_task1b_3	Lam Pham	School of Computing, University of Kent, Kent, UK	task-acoustic-scene-classification-results-b#Pham2020	84.7 (85.0 - 84.5)	0.402
	Lee_CAU_task1b_1	Yerin Lee	Statistics, Chung-Ang University, Seoul, South Korea	task-acoustic-scene-classification-results-b#Lee2020	90.7 (90.7 - 90.7)	0.302
	Lee_CAU_task1b_2	Yerin Lee	Statistics, Chung-Ang University, Seoul, South Korea	task-acoustic-scene-classification-results-b#Lee2020	93.9 (93.7 - 94.1)	0.156
	Lee_CAU_task1b_3	Yerin Lee	Statistics, Chung-Ang University, Seoul, South Korea	task-acoustic-scene-classification-results-b#Lee2020	91.1 (91.0 - 91.2)	0.246
	Lee_CAU_task1b_4	Yerin Lee	Statistics, Chung-Ang University, Seoul, South Korea	task-acoustic-scene-classification-results-b#Lee2020	91.2 (91.2 - 91.2)	0.864
	Lopez-Meyer_IL_task1b_1	Paulo Lopez-Meyer	Intel Labs, Intel Corporation, Jalisco, Mexico	task-acoustic-scene-classification-results-b#Lopez-Meyer2020_t1b	90.4 (89.6 - 91.1)	0.681
	Lopez-Meyer_IL_task1b_2	Paulo Lopez-Meyer	Intel Labs, Intel Corporation, Jalisco, Mexico	task-acoustic-scene-classification-results-b#Lopez-Meyer2020_t1b	90.1 (89.7 - 90.5)	0.677
	Lopez-Meyer_IL_task1b_3	Paulo Lopez-Meyer	Intel Labs, Intel Corporation, Jalisco, Mexico	task-acoustic-scene-classification-results-b#Lopez-Meyer2020_t1b	90.5 (89.8 - 91.2)	0.276
	Lopez-Meyer_IL_task1b_4	Paulo Lopez-Meyer	Intel Labs, Intel Corporation, Jalisco, Mexico	task-acoustic-scene-classification-results-b#Lopez-Meyer2020_t1b	89.7 (88.8 - 90.5)	0.983
	McDonnell_USA_task1b_1	Mark McDonnell	Computational Learning Systems Laboratory, University of South Australia, Mawson Lakes, Australia	task-acoustic-scene-classification-results-b#McDonnell2020	94.9 (94.9 - 95.0)	0.135
	McDonnell_USA_task1b_2	Mark McDonnell	Computational Learning Systems Laboratory, University of South Australia, Mawson Lakes, Australia	task-acoustic-scene-classification-results-b#McDonnell2020	95.5 (95.3 - 95.7)	0.118
	McDonnell_USA_task1b_3	Mark McDonnell	Computational Learning Systems Laboratory, University of South Australia, Mawson Lakes, Australia	task-acoustic-scene-classification-results-b#McDonnell2020	95.9 (95.7 - 96.1)	0.117
	McDonnell_USA_task1b_4	Mark McDonnell	Computational Learning Systems Laboratory, University of South Australia, Mawson Lakes, Australia	task-acoustic-scene-classification-results-b#McDonnell2020	95.8 (95.6 - 96.0)	0.119
	Monteiro_INRS_task1b_1	Monteiro Joao	EMT, Institut National de la Recherche Scientifique, Montreal, Canada	task-acoustic-scene-classification-results-b#Joao2020	87.4 (86.5 - 88.3)	0.327
	Naranjo-Alcazar_Vfy_task1b_1	Javier Naranjo-Alcazar	AI department, Visualfy, Benisano, Spain; Computer Science Department, Universitat de Valencia, Burjassot, Spain	task-acoustic-scene-classification-results-b#Naranjo-Alcazar2020_t1	93.6 (93.4 - 93.7)	0.202
	Naranjo-Alcazar_Vfy_task1b_2	Javier Naranjo-Alcazar	AI department, Visualfy, Benisano, Spain; Computer Science Department, Universitat de Valencia, Burjassot, Spain	task-acoustic-scene-classification-results-b#Naranjo-Alcazar2020_t1	93.6 (93.4 - 93.8)	0.190
	NguyenHongDuc_SU_task1b_1	Paul Nguyen Hong Duc	Institut d’Alembert, Sorbonne Universite, Paris, France	task-acoustic-scene-classification-results-b#Nguyen_Hong_Duc2020	93.1 (92.6 - 93.5)	0.215
	NguyenHongDuc_SU_task1b_2	Paul Nguyen Hong Duc	Institut d’Alembert, Sorbonne Universite, Paris, France	task-acoustic-scene-classification-results-b#Nguyen_Hong_Duc2020	92.3 (91.9 - 92.6)	0.214
	Ooi_NTU_task1b_1	Kenneth Ooi	School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore	task-acoustic-scene-classification-results-b#Ooi2020	87.8 (87.1 - 88.6)	0.337
	Ooi_NTU_task1b_2	Kenneth Ooi	School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore	task-acoustic-scene-classification-results-b#Ooi2020	87.3 (86.6 - 88.1)	0.367
	Ooi_NTU_task1b_3	Kenneth Ooi	School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore	task-acoustic-scene-classification-results-b#Ooi2020	89.8 (89.0 - 90.5)	0.257
	Ooi_NTU_task1b_4	Kenneth Ooi	School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore	task-acoustic-scene-classification-results-b#Ooi2020	89.8 (89.1 - 90.5)	0.305
	Paniagua_UPM_task1b_1	Rubén Fraile	CITSEM, Universidad Politéctica de Madrid, Madrid, Spain	task-acoustic-scene-classification-results-b#Paniagua2020	89.4 (89.0 - 89.8)	0.347
	Patki_SELF_task1b_1	Prachi Patki		task-acoustic-scene-classification-results-b#Patki2020	86.0 (85.8 - 86.3)	1.372
	Patki_SELF_task1b_2	Prachi Patki		task-acoustic-scene-classification-results-b#Patki2020	89.4 (89.0 - 89.7)	0.951
	Patki_SELF_task1b_3	Prachi Patki		task-acoustic-scene-classification-results-b#Patki2020	83.7 (81.8 - 85.7)	1.837
	Phan_UIUC_task1b_1	Duc Phan	ECE, University of Illinois at Urbana Champaign, Illinois, USA	task-acoustic-scene-classification-results-b#Phan2020_t1	88.5 (87.8 - 89.2)	0.319
	Phan_UIUC_task1b_2	Duc Phan	ECE, University of Illinois at Urbana Champaign, Illinois, USA	task-acoustic-scene-classification-results-b#Phan2020_t1	89.2 (88.7 - 89.8)	0.283
	Phan_UIUC_task1b_3	Duc Phan	ECE, University of Illinois at Urbana Champaign, Illinois, USA	task-acoustic-scene-classification-results-b#Phan2020_t1	89.0 (88.7 - 89.3)	0.301
	Phan_UIUC_task1b_4	Duc Phan	ECE, University of Illinois at Urbana Champaign, Illinois, USA	task-acoustic-scene-classification-results-b#Phan2020_t1	89.5 (88.9 - 90.0)	0.282
	Sampathkumar_TUC_task1b_1	Arunodhayan Sampathkumar	Juniorprofessur MEDIA COMPUTING, Techniche universität Chemnitz, Chemnitz, Germany	task-acoustic-scene-classification-results-b#Sampathkumar2020	87.5 (87.1 - 87.9)	0.864
	Singh_IITMandi_task1b_1	Arshdeep Singh	School of Computing and Electrical engineering, Indian institute of technology, Mandi, Mandi, India	task-acoustic-scene-classification-results-b#Singh2020	84.5 (83.5 - 85.6)	0.418
	Singh_IITMandi_task1b_2	Arshdeep Singh	School of computing and electrical engineering, Indian institute of technology, Mandi, Mandi, India	task-acoustic-scene-classification-results-b#Singh2020	84.7 (83.5 - 85.9)	0.420
	Singh_IITMandi_task1b_3	Arshdeep Singh	School of Computing and Electrical engineering, Indian institute of technology, Mandi, Mandi, India	task-acoustic-scene-classification-results-b#Singh2020	85.2 (84.6 - 85.8)	0.402
	Singh_IITMandi_task1b_4	Arshdeep Singh	School of computing and electrical engineering, Indian institute of technology, Mandi, Mandi, India	task-acoustic-scene-classification-results-b#Singh2020	86.4 (85.0 - 87.8)	0.385
	Suh_ETRI_task1b_1	Youngho Jeong	Media Coding Research Section, Electronics and Telecommunications Research Institute, Daejeon, South Korea	task-acoustic-scene-classification-results-b#Suh2020	93.3 (93.2 - 93.4)	0.302
	Suh_ETRI_task1b_2	Youngho Jeong	Media Coding Research Section, Electronics and Telecommunications Research Institute, Daejeon, South Korea	task-acoustic-scene-classification-results-b#Suh2020	94.6 (94.4 - 94.7)	0.270
	Suh_ETRI_task1b_3	Youngho Jeong	Media Coding Research Section, Electronics and Telecommunications Research Institute, Daejeon, South Korea	task-acoustic-scene-classification-results-b#Suh2020	95.1 (94.9 - 95.2)	0.277
	Suh_ETRI_task1b_4	Youngho Jeong	Media Coding Research Section, Electronics and Telecommunications Research Institute, Daejeon, South Korea	task-acoustic-scene-classification-results-b#Suh2020	94.6 (94.5 - 94.8)	0.271
	Vilouras_AUTh_task1b_1	Konstantinos Vilouras	Electrical and Computer Engineering, Aristotle University of Thessaloniki, Thessaloniki, Greece	task-acoustic-scene-classification-results-b#Vilouras2020	91.8 (91.2 - 92.5)	0.215
	Waldekar_IITKGP_task1b_1	Shefali Waldekar	Electronics and Electrical Communication Engineering Dept., Indian Institute of Technology Kharagpur, Kharagpur, India	task-acoustic-scene-classification-results-b#Waldekar2020	88.6 (88.2 - 89.1)	7.923
	Wu_CUHK_task1b_1	Yuzhong Wu	Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China	task-acoustic-scene-classification-results-b#Wu2020_t1b	94.2 (94.0 - 94.3)	0.188
	Wu_CUHK_task1b_2	Yuzhong Wu	Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China	task-acoustic-scene-classification-results-b#Wu2020_t1b	94.2 (94.1 - 94.3)	0.201
	Wu_CUHK_task1b_3	Yuzhong Wu	Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China	task-acoustic-scene-classification-results-b#Wu2020_t1b	94.3 (94.3 - 94.4)	0.185
	Wu_CUHK_task1b_4	Yuzhong Wu	Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China	task-acoustic-scene-classification-results-b#Wu2020_t1b	94.9 (94.7 - 95.1)	0.218
	Yang_UESTC_task1b_1	Yang Haocong	Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, China	task-acoustic-scene-classification-results-b#Haocong2020	92.1 (91.7 - 92.4)	0.272
	Yang_UESTC_task1b_2	Yang Haocong	Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, China	task-acoustic-scene-classification-results-b#Haocong2020	93.5 (93.3 - 93.7)	0.247
	Yang_UESTC_task1b_3	Yang Haocong	Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, China	task-acoustic-scene-classification-results-b#Haocong2020	93.5 (93.3 - 93.8)	0.228
	Yang_UESTC_task1b_4	Yang Haocong	Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, China	task-acoustic-scene-classification-results-b#Haocong2020	90.4 (88.7 - 92.0)	0.327
	Zhang_BUPT_task1b_1	Jiawang Zhang	BUPT, Beijing University of Posts and Telecommunications, Beijing, China	task-acoustic-scene-classification-results-b#Zhang2020	92.0 (91.6 - 92.4)	0.346
	Zhang_BUPT_task1b_2	Jiawang Zhang	BUPT, Beijing University of Posts and Telecommunications, Beijing, China	task-acoustic-scene-classification-results-b#Zhang2020	92.7 (92.1 - 93.2)	0.334
	Zhang_BUPT_task1b_3	Jiawang Zhang	BUPT, Beijing University of Posts and Telecommunications, Beijing, China	task-acoustic-scene-classification-results-b#Zhang2020	92.9 (92.3 - 93.5)	0.316
	Zhang_BUPT_task1b_4	Jiawang Zhang	BUPT, Beijing University of Posts and Telecommunications, Beijing, China	task-acoustic-scene-classification-results-b#Zhang2020	93.0 (92.4 - 93.6)	0.316
	Zhao_JNU_task1b_1	Jingqiao Zhao	Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China	task-acoustic-scene-classification-results-b#Zhao2020	86.6 (86.5 - 86.7)	0.867
	Zhao_JNU_task1b_2	Jingqiao Zhao	Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China	task-acoustic-scene-classification-results-b#Zhao2020	86.9 (86.8 - 87.0)	0.873

Complete results and technical reports can be found at subtask B results page

Submissions

Subtask	Teams	Entries	Authors	Affiliations
Subtask A	27	88	92	31
Subtask B	30	86	110	41
Overall	45	174	146	54

Baseline system

The baseline system provides a state of the art approach for the classification in each subtask. The baseline system is built on dcase_util toolbox and has all needed functionality for the dataset handling, acoustic feature extraction, storing and accessing, acoustic model training and storing, and evaluation. The modular structure of the system enables participants to modify the system to their needs.

Repository

DCASE2020 Task 1 baseline, repository

Subtask A

The baseline system is a modification of the baselines from previous DCASE challenge editions of acoustic scene classification, built on the same skeleton. It replaces use of mel energies with use of OpenL3 embeddings and replaces the CNN network architecture with two fully-connected feed-forward layers (size 512 and 128) as in the original OpenL3 publication:

Publication

J. Cramer, H-.H. Wu, J. Salamon, and J. P. Bello. Look, listen and learn more: design choices for deep audio embeddings. In IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 3852–3856. Brighton, UK, May 2019. URL: https://ieeexplore.ieee.org/document/8682475.

PDF

Look, Listen and Learn More: Design Choices for Deep Audio Embeddings

Abstract

A considerable challenge in applying deep learning to audio classification is the scarcity of labeled data. An increasingly popular solution is to learn deep audio embeddings from large audio collections and use them to train shallow classifiers using small labeled datasets. Look, Listen, and Learn (L 3 -Net) is an embedding trained through self-supervised learning of audio-visual correspondence in videos as opposed to other embeddings requiring labeled data. This framework has the potential to produce powerful out-of-the-box embeddings for downstream audio classification tasks, but has a number of unexplained design choices that may impact the embeddings' behavior. In this paper we investigate how L 3 -Net design choices impact the performance of downstream audio classifiers trained with these embeddings. We show that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key. Surprisingly, we find that matching the content for training the embedding to the downstream task is not beneficial. Finally, we show that our best variant of the L3 -Net embedding outperforms both the VGGish and SoundNet embeddings, while having fewer parameters and being trained on less data. Our implementation of the L3 -Net embedding model as well as pre-trained models are made freely available online.

Keywords

audio signal processing;learning (artificial intelligence);signal classification;audio collections;labeled datasets;self-supervised learning;audio-visual correspondence;downstream audio classification tasks;downstream audio classifiers;audio-informed choices;deep learning;deep audio embeddings;net embedding model;net design choices;VGGish embeddings;SoundNet embeddings;Videos;Task analysis;Training;Computational modeling;Data models;Training data;Spectrogram;Audio classification;machine listening;deep audio embeddings;deep learning;transfer learning

PDF

Parameters

Audio embeddings:
- OpenL3 embedding
- Analysis window size: 1 second, Hop size: 100 ms
- Embeddings parameters:
  - Input representation: mel256
  - Content type: music
  - Embedding size: 512
Neural network:
- Architecture: Two fully connected layers (512 and 128 hidden units)
- Learning: 200 epochs (batch size 64), data shuffling between epochs
- Optimizer: Adam (learning rate 0.001)
Model selection:
- Approximately 30% of the original training data is assigned to validation set, split done such that training and validation sets do not have segments from the same location and both sets have data from each city
- Model performance after each epoch is evaluated on the validation set, and best performing model is selected

Results for the development dataset

Results are calculated using TensorFlow in GPU mode (using Nvidia Titan XP GPU card). Because results produced with GPU card are generally non-deterministic, the system was trained and tested 10 times; mean and standard deviation of the performance from these 10 independent trials are shown in the results tables.

The system is compared against the 2019 baseline system in order to put its performance into context with respect to the problem:

System	Accuracy	Log loss	Description
DCASE2019 Task 1 Baseline	46.5 % (± 1.2)	1.578 (± 0.029)	Log mel-band energies as features, two layers of 2D CNN and one fully connected layer as classifier, See more information
DCASE2020 Task 1 Baseline, Subtask A	54.1 % (± 1.4)	1.365 (± 0.032)	OpenL3 as audio embeddings, two fully connected layers as classifiers

Detailed results for the DCASE2020 baseline:

Scene label	Accuracy	Device-wise accuracies									Log loss
		A	B	C	S1	S2	S3	S4	S5	S6
Airport	45.0 %	65.8	61.9	53.6	54.8	34.5	36.7	35.5	32.7	29.7	1.615
Bus	62.9 %	85.5	76.1	83.3	62.4	67.6	50.3	50.6	41.8	48.8	0.964
Metro	53.5 %	71.5	50.0	66.4	44.2	45.2	51.8	50.9	37.6	64.2	1.281
Metro station	53.0 %	63.6	45.5	44.5	49.4	50.3	63.6	50.9	53.0	56.4	1.298
Park	71.3 %	91.5	94.5	85.5	72.7	79.7	71.5	51.8	55.5	38.8	1.022
Public square	44.9 %	65.2	40.9	60.3	43.6	41.5	54.5	46.4	39.1	12.7	1.633
Shopping mall	48.3 %	60.9	63.0	57.9	47.6	57.3	31.8	22.4	51.2	42.4	1.482
Street, pedestrian	29.8 %	52.1	36.7	30.0	28.2	34.8	29.1	31.2	4.5	21.5	2.277
Street, traffic	79.9 %	82.1	84.2	86.4	86.4	73.9	77.0	84.2	86.7	58.5	0.731
Tram	52.2 %	67.9	53.6	58.4	60.3	48.2	50.6	57.9	49.7	23.3	1.350
Average	54.1 % (± 1.4)	70.6	60.6	62.6	55.0	53.3	51.7	48.2	45.2	39.6	1.365 (± 0.032)

As discussed here, devices S4-S6 are used only for testing not for training the system.

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results.

Subtask B

In subtask B, the baseline system is similar to the DCASE2019 baseline. The system implements a convolutional neural network (CNN) based approach. Log mel-band energies are first extracted for each 10-second signal, and a network consisting of two CNN layers and one fully connected layer is trained to assign scene labels to the audio signals. The model size of the system is 450 KB.

Parameters

Audio features:
- Log mel-band energies (40 bands), analysis frame 40 ms (50% hop size)
Neural network:
- Input shape: 40 * 500 (10 seconds)
- Architecture:
  - CNN layer #1
    - 2D Convolutional layer (filters: 32, kernel size: 7) + Batch normalization + ReLu activation
    - 2D max pooling (pool size: (5, 5)) + Dropout (rate: 30%)
  - CNN layer #2
    - 2D Convolutional layer (filters: 64, kernel size: 7) + Batch normalization + ReLu activation
    - 2D max pooling (pool size: (4, 100)) + Dropout (rate: 30%)
  - Flatten
  - Dense layer #1
    - Dense layer (units: 100, activation: ReLu )
    - Dropout (rate: 30%)
  - Output layer (activation: softmax)
- Learning: 200 epochs (batch size 16), data shuffling between epochs
- Optimizer: Adam (learning rate 0.001)
Model selection:
- Approximately 30% of the original training data is assigned to validation set, split done such that training and validation sets do not have segments from the same location and both sets have data from each city
- Model performance after each epoch is evaluated on the validation set, and best performing model is selected

Results for the development dataset

The system is compared against subtask A baseline system and a minified version of it in order to show performance at different model size levels. The modified version of subtask A baseline replaces the OpenL3 audio embeddings with EdgeL3 embeddings which are a sparse version of OpenL3, intended for low-complexity applications. More about EdgeL3 embeddings can be found here:

Publication

S. Kumari, D. Roy, M. Cartwright, J. P. Bello, and A. Arora. Edgel^3: compressing l^3-net for mote scale urban noise monitoring. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), volume, 877–884. May 2019. doi:10.1109/IPDPSW.2019.00145.

EdgeL^3: Compressing L^3-Net for Mote Scale Urban Noise Monitoring

Abstract

Urban noise sensing in deeply embedded devices at the edge of the Internet of Things (IoT) is challenging not only because of the lack of sufficiently labeled training data but also because device resources are quite limited. Look, Listen, and Learn (L3), a recently proposed state-of-the-art transfer learning technique, mitigates the first challenge by training self-supervised deep audio embeddings through binary Audio-Visual Correspondence (AVC), and the resulting embeddings can be used to train a variety of downstream audio classification tasks. However, with close to 4.7 million parameters, the multi-layer L3-Net CNN is still prohibitively expensive to be run on small edge devices, such as "motes" that use a single microcontroller and limited memory to achieve long-lived self-powered operation. In this paper, we comprehensively explore the feasibility of compressing the L3-Net for mote-scale inference. We use pruning, ablation, and knowledge distillation techniques to show that the originally proposed L3-Net architecture is substantially overparameterized, not only for AVC but for the target task of sound classification as evaluated on two popular downstream datasets. Our findings demonstrate the value of fine-tuning and knowledge distillation in regaining the performance lost through aggressive compression strategies. Finally, we present EdgeL3, the first L3-Net reference model compressed by 1-2 orders of magnitude for real-time urban noise monitoring on resource-constrained edge devices, that can fit in just 0.4 MB of memory through half-precision floating point representation.

Keywords

audio signal processing;convolutional neural nets;environmental science computing;Internet of Things;learning (artificial intelligence);neural net architecture;noise pollution;signal classification;real-time urban noise monitoring;resource-constrained edge devices;IoT;AVC;downstream audio classification tasks;single microcontroller;mote-scale inference;knowledge distillation techniques;sound classification;compression strategies;binary audio-visual correspondence;self-supervised deep audio embeddings;EdgeL3;downstream datasets;mote scale urban noise monitoring;urban noise sensing;embedded devices;Internet of Things;look listen and learn;transfer learning technique;multilayer L3-Net CNN;L3-Net architecture;memory size 0.4 MByte;Task analysis;Sensors;Training;Convolution;Monitoring;Visualization;Feature extraction;edge network;pruning;convolutional neural nets;deep learning;audio embedding;transfer learning;finetuning;knowledge distillation

Systems:

System	Accuracy	Log loss	Audio embedding	Acoustic model	Total size
DCASE2020 Task 1 Baseline, Subtask A OpenL3 + MLP (2 layers, 512 and 128 units)	89.8 % (± 0.3)	0.266 (± 0.006)	17.87 MB	145.2 KB	19.12 MB
Modified DCASE2020 Task 1 Baseline, Subtask A EdgeL3 + MLP (2 layers, 64 units each)	88.9 % (± 0.3)	0.298 (± 0.003)	840.6 KB	145.2 KB	985.8 KB
DCASE2020 Task 1 Baseline, Subtask B Log mel-band energies + CNN (2 CNN layers and 1 fully-connected)	87.3 % (± 0.7)	0.437 (± 0.045)	-	450.1 KB	450 KB

Detailed results for the DCASE2020 Task 1 Baseline, Subtask B:

Class label	Accuracy	Log loss
Indoor	82.0 %	0.680
Outdoor	88.5 %	0.365
Transportation	91.5 %	0.282
Average	87.3 % (± 0.7)	0.437 (± 0.045)

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results.

Citation

If you are participating to this task or using the dataset or baseline code please cite the following paper:

Publication

Toni Heittola, Annamaria Mesaros, and Tuomas Virtanen. Acoustic scene classification in dcase 2020 challenge: generalization across devices and low complexity solutions. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), 56–60. 2020. URL: https://arxiv.org/abs/2005.14623.

PDF

Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions

Abstract

This paper presents the details of Task 1: Acoustic Scene Classification in the DCASE 2020 Challenge. The task consists of two subtasks: classification of data from multiple devices, requiring good generalization properties, and classification using low-complexity solutions. Here we describe the datasets and baseline systems. After the challenge submission deadline, challenge results and analysis of the submissions will be added.

PDF

	Annamaria Mesaros Tampere University
	Toni Heittola Tampere University
	Tuomas Virtanen Tampere University

Coordinators

Content

Acoustic Scene Classification with Multiple Devices Subtask A

Low-Complexity Acoustic Scene Classification Subtask B

Subtask A

Acoustic Scene Classification with Multiple Devices

Audio dataset

A multi-device dataset for urban acoustic scene classification

Abstract

Keywords

Task setup

Development dataset

Evaluation dataset

Reference labels

Download

Subtask B

Low-Complexity Acoustic Scene Classification Subtask B

Audio dataset

A multi-device dataset for urban acoustic scene classification

Abstract

Keywords

Task setup

Development dataset

Evaluation dataset

System complexity requirements

Model size calculation

System 1: DCASE2020 Task 1 Baseline, Subtask A 19.12 MB

Audio embeddings (OpenL3)

Acoustic model

System 2: DCASE2020 Task 1 Baseline, Subtask B 450.1 KB

Acoustic model

Reference labels

Download

External data resources

Submission

Subtask A

System output file

Metadata file

Subtask A / Metadata

Subtask B

System output file

Metadata file

Subtask B / Metadata

Package validator

Task rules

Evaluation

Results

Subtask A

Subtask B

Submissions

Baseline system

Repository

Subtask A

Look, Listen and Learn More: Design Choices for Deep Audio Embeddings

Abstract

Keywords

Parameters

Results for the development dataset

Subtask B

Parameters

Results for the development dataset

EdgeL^3: Compressing L^3-Net for Mote Scale Urban Noise Monitoring

Abstract

Keywords

Citation

Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions

Abstract

Acoustic Scene Classification with Multiple Devices
Subtask A

Low-Complexity Acoustic Scene Classification
Subtask B

Low-Complexity Acoustic Scene Classification
Subtask B