Acoustic scene classification


Task description

The goal of acoustic scene classification is to classify a test recording into one of the provided predefined classes that characterizes the environment in which it was recorded.

This task comprises two different subtasks that involve system development for two different situations:

A Devices Task 1

Acoustic Scene Classification with Multiple Devices
Subtask A

Classification of data from multiple devices (real and simulated) targeting generalization properties of systems across a number of different devices.

B Complexity Task 1

Low-Complexity Acoustic Scene Classification
Subtask B

Classification of data into three higher level classes while focusing on low-complexity solutions.

Subtask A

A Devices Task 1

Acoustic Scene Classification with Multiple Devices

This subtask is concerned with the basic problem of acoustic scene classification, in which it is required to classify a test audio recording into one of ten known acoustic scene classes. This task targets generalization properties of systems across a number of different devices, and will use audio data recorded and simulated with a variety of devices.

Figure 1: Overview of acoustic scene classification system.


Audio dataset

The dataset for this task is TAU Urban Acoustic Scenes 2020 Mobile. The dataset contains recordings from 12 european cities in 10 different acoustic scenes using 4 different devices. Additionally, synthetic data for 11 mobile devices was created based on the original recordings. Of the 12 cities, two are present only in the evaluation set.

Recordings were made using four devices that captured audio simultaneously. The main recording device consists in a Soundman OKM II Klassik/studio A3, electret binaural microphone and a Zoom F8 audio recorder using 48kHz sampling rate and 24 bit resolution, referred to as device A. The other devices are commonly available customer devices: device B is a Samsung Galaxy S7, device C is IPhone SE, and device D is a GoPro Hero5 Session.

Acoustic scenes (10):

  • Airport - airport
  • Indoor shopping mall - shopping_mall
  • Metro station - metro_station
  • Pedestrian street - street_pedestrian
  • Public square - public_square
  • Street with medium level of traffic - street_traffic
  • Travelling by a tram - tram
  • Travelling by a bus - bus
  • Travelling by an underground metro - metro
  • Urban park - park

Audio data was recorded in Amsterdam, Barcelona, Helsinki, Lisbon, London, Lyon, Madrid, Milan, Prague, Paris, Stockholm and Vienna.

The dataset was collected by Tampere University of Technology between 05/2018 - 11/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

ERC

For complete details on the data recording and processing see

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 9–13. November 2018. URL: https://arxiv.org/abs/1807.09840.

PDF

A multi-device dataset for urban acoustic scene classification

Abstract

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

Keywords

Acoustic scene classification, DCASE challenge, public datasets, multi-device data

PDF

Additionally, 11 mobile devices S1-S11 are simulated using the audio recorded with device A, impulse responses recorded with real devices, and additional dynamic range compression, in order to simulate realistic recordings. A recording from device A is processed through convolution with the selected Si impulse response, then processed with a selected set of parameters for dynamic range compression (device specific). The impulse responses are proprietary data and will not be published.

The development dataset comprises 40 hours of data from device A, and smaller amounts from the other devices. Audio is provided in single channel 44.1kHz 24-bit format.

Task setup

Development dataset

The development set contains data from 10 cities and 9 devices: 3 real devices (A, B, C) and 6 simulated devices (S1-S6). Data from devices B, C and S1-S6 consists of randomly selected segments from the simultaneous recordings, therefore all overlap with the data from device A, but not necessarily with each other. The total amount of audio in the development set is 64 hours.

The dataset is provided with a training/test split in which 70% of the data for each device is included for training, 30% for testing. Some devices appear only in the test subset. In order to create a perfectly balanced test set, a number of segments from various devices are not included in this split. Complete details on the development set and training/test split are provided in the following table.

Device Dataset Cross-validation setup
Name Type Total
duration
Total
segments
Train
segments
Test
segments
Notes
A Real 40h 14400 10215 330 3855 Segments not used in train/test split
B C Real 3h each 1080 750 330
S1 S2 S3 Simulated 3h each 1080 750 330
S4 S5 S6 Simulated 3h each 1080 - 330 750 segments not used in train/test split
Total 64h 23040 13965 2970

Participants are required to report the performance of their system using this train/test setup in order to allow comparison of systems on the development set. Participants are allowed to create their own cross-validation folds or separate validation set. In this case please pay attention to the segments recorded at the same location. Location identifier can be found from metadata file provided in the dataset or from audio file names:

[scene label]-[city]-[location id]-[segment id]-[device id].wav

Make sure that all the files having the same location id are placed on the same side of the evaluation.

Evaluation dataset

The evaluation dataset will contain data from 12 cities, 10 acoustic scenes, 11 devices. There will be five new devices (not available in the development set): real device D and simulated devices S7-S11. Evaluation data will contain 33 hours of audio. The evaluation data contains audio recorded at different locations than the development data.

Device and city information will not be provided in the evaluation set. The systems are expected to be robust to different devices.

Download


Subtask B

B Complexity Task 1

Low-Complexity Acoustic Scene Classification
Subtask B

This subtask is concerned with classification of audio into three major classes: indoor, outdoor and transportation. The task targets low complexity solutions for the classification problem in term of model size, and uses audio recorded with a single device (device A).

Figure 1: Overview of acoustic scene classification system.


Audio dataset

The dataset for this task is TAU Urban Acoustic Scenes 2020 3Class. The dataset contains recordings from 12 european cities in 10 different acoustic scenes.

The dataset was collected by Tampere University of Technology between 05/2018 - 11/2018. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

ERC

For complete details on the data recording and processing see

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 9–13. November 2018. URL: https://arxiv.org/abs/1807.09840.

PDF

A multi-device dataset for urban acoustic scene classification

Abstract

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

Keywords

Acoustic scene classification, DCASE challenge, public datasets, multi-device data

PDF

The 10 acoustic scenes are grouped into three major classes as follows:

  • Indoor scenes - indoor: airport, indoor shopping mall, and metro station
  • Outdoor scenes - outdoor: pedestrian street, public square, street with medium level of traffic, and urban park
  • Transportation related scenes - transportation: travelling by a bus, travelling by a tram, travelling by an underground metro

This dataset contains data recorded with a single device (device A). Audio is provided in binaural, 48kHz 24-bit format.

Task setup

Development dataset

The development set contains data from 10 cities. The total amount of audio in the development set is 40 hours.

The dataset is provided with a training/test split. Participants are required to report the performance of their system using this train/test setup in order to allow comparison of systems on the development set.

Participants are allowed to create their own cross-validation folds or separate validation set. In this case please pay attention to the segments recorded at the same location. Location identifier can be found from metadata file provided in the dataset or from audio file names:

[scene label]-[city]-[location id]-[segment id]-[device id].wav

Make sure that all files having the same location id are placed on the same side of the evaluation. In this case, the device id is always A.

Evaluation dataset

The evaluation set will contain data from 12 cities (2 cities unseen in the development set). Evaluation data will contain 30 hours of audio.

System complexity requirements

Classifier complexity for this setup is limited to 500KB size for the non-zero parameters. This translates into 128K parameters when using float32 (32-bit float) which is often the default data type (128000 parameter values * 32 bits per parameter / 8 bits per byte= 512000 bytes = 500KB).

By limiting the size of the model on disk, we allow participants some flexibility in design, for example some implementations would prefer to minimize the number of non-zero parameters of the network in order to comply with this size limit, while other implementations may target representation of the model parameters with low number of bits. There is no requirement nor recommendation on which method to minimize the model size is sought after.

In order to apply the limit strictly on the classifier size, the parameter count will exclude the first active layer of the network if this layer is a feature extraction layer (Kapre layer in Keras or tf.signal.* layers in Tensorflow). If the feature extraction is done separately, all layers/parameters of the neural network are counted. Layers not used in the classification stage, such as batch normalization layers, are also skipped from the model size calculation. If the system uses embeddings (e.g VGGish, OpenL3 or EdgeL3), the network used to generate the embeddings counts in the number of parameters.

The computational complexity of the feature extraction stage is not included in the system complexity estimation within this task. We acknowledge that feature extraction is an integral part of the system complexity, but since there is no established method for estimating and comparing complexity of different feature extraction implementations, we estimate the complexity through the size of the classifier models, in order to keep the complexity estimation straightforward across different approaches.

Full information about the model size should be provided in the technical report.

Model size calculation

We offer a script for calculating the model size for Keras based models along with the baseline system. If you have any doubts about how to calculate the model size, please contact toni.heittola@tuni.fi or write to the DCASE forum for visibility.

Calculation examples

Total model size: 17.87 MB (Audio embeddings) + 1.254 MB (Acoustic model) = 19.12 MB

Audio embeddings (OpenL3)
Layer Parameters Non-zero parameters Size (non-zero) Note
input_1 0 0 0 KB
melspectrogram_1 4 460 800 4 196 335 16.01 MB Skipped
batch_normalization_1 4 4 16 bytes Skipped
conv2d_1 640 640 2.5 KB
batch_normalization_2 256 256 1 KB Skipped
activation_1 0 0 0 KB
conv2d_2 36 928 36 928 144.2 KB
batch_normalization_3 256 256 1 KB Skipped
activation_2 0 0 0 KB
max_pooling2d_1 0 0 0 KB
conv2d_3 73 856 73 856 288.5 KB
batch_normalization_4 512 512 2 KB Skipped
activation_3 0 0 0 KB
conv2d_4 147 584 147 584 576.5 KB
batch_normalization_5 512 512 2 KB Skipped
activation_4 0 0 0 KB
max_pooling2d_2 0 0 0 KB
conv2d_5 295 168 295 168 1.126 MB
batch_normalization_6 1024 1024 4 KB Skipped
activation_5 0 0 0 KB
conv2d_6 590 080 590 080 2.251 MB
batch_normalization_7 1024 1024 4 KB Skipped
activation_6 0 0 0 KB
max_pooling2d_3 0 0 0 KB
conv2d_7 1 180 160 1 180 160 4.502 MB
batch_normalization_8 2048 2048 8 KB Skipped
activation_7 0 0 0 KB
audio_embedding_layer 2 359 808 2 359 808 9.002 MB
max_pooling2d_4 0 0 0 KB
flatten_1 0 0 0 KB
Total 4 684 224 4 684 224 17.87 MB
Acoustic model
Layer Parameters Non-zero parameters Size (non-zero) Note
dense_1 262 656 262 557 1.002 MB
dense_2 65 664 65 664 256.5 KB
dense_3 387 387 1.512 KB
Total 32 8707 32 8608 1.254 MB

Total model size: 0 KB (Audio embeddings) + 450.1 KB (Acoustic model) = 450.1 KB

Acoustic model
Layer Parameters Non-zero parameters Size (non-zero) Note
conv2d_1 1600 1600 6.25 KB
batch_normalization_1 128 128 512 bytes Skipped
activation_1 0 0 0 KB
max_pooling2d_1 0 0 0 KB
dropout_1 0 0 0 KB
conv2d_2 100 416 100 416 392.2 KB
batch_normalization_2 256 256 1 KB Skipped
activation_2 0 0 0 KB
max_pooling2d_2 0 0 0 KB
dropout_2 0 0 0 KB
flatten_1 0 0 0 KB
dense_1 12 900 12 900 50.39 KB
dropout_3 0 0 0 KB
dense_2 303 303 1.184 KB
Total 115 219 115 219 450.1 KB

Download


External data resources

Use of external data is allowed in all subtasks under the following conditions:

  • The used external resource is clearly referenced and freely accessible to any other research group in the world. External data refers to public datasets, trained models or impulse responses. The data must be public and freely available before 1st of April 2020.

  • Participants submit at least one system without external training data so that we can study the contribution of such resources. This condition applies only to cases where external audio datasets are used. In case of external data being pretrained models or embeddings, this condition does not apply. The list of external data sources used in training must be clearly indicated in the technical report.

  • Participants inform the organizers in advance about such data sources, so that all competitors know about them and have equal opportunity to use them. Please send and email to the task coordinators; we will update the list of external datasets on the webpage accordingly. Once the evaluation set is published, the list of allowed external data resources is locked (no further external sources allowed).

  • It is not allowed to use TUT Urban Acoustic Scenes 2018, TAU Urban Acoustic Scenes 2019 or TAU Urban Acoustic Scenes 2019 Mobile. These datasets are partially included in the current setup, and additional usage will lead to overfitting.

List of external data resources allowed:

Dataset name Type Added Link
LITIS Rouen audio scene dataset audio 04.03.2019 https://sites.google.com/site/alainrakotomamonjy/home/audio-scene
DCASE2013 Challenge - Public Dataset for Scene Classification Task audio 04.03.2019 https://archive.org/details/dcase2013_scene_classification
DCASE2013 Challenge - Private Dataset for Scene Classification Task audio 04.03.2019 https://archive.org/details/dcase2013_scene_classification_testset
Dares G1 audio 04.03.2019 http://www.daresounds.org/
AudioSet audio 04.03.2019 https://research.google.com/audioset/
OpenL3 model 12.02.2020 https://openl3.readthedocs.io/
EdgeL3 model 12.02.2020 https://edgel3.readthedocs.io/
VGGish model 12.02.2020 https://github.com/tensorflow/models/tree/master/research/audioset/vggish


Submission

Participants can choose to participate in only one subtask or both.

Official challenge submission consists of:

  • System output file (*.csv)
  • Technical report explaining in sufficient detail the method (*.pdf)
  • Metadata file (*.yaml)

System output should be presented as a single text-file (in CSV format, with a header row) containing classification result for each audio file in the evaluation set. In addition, the results file should contain probabilities for each scene class. Result items can be in any order. Row format:

[filename (string)][tab][scene label (string)][tab][bus probability (float)][tab][metro probability (float)][tab][metro_station probability (float)][tab][park probability (float)][tab][public_square probability (float)][tab][shopping_mall probability (float)][tab][street_pedestrian probability (float)][tab][street_traffic probability (float)][tab][tram probability (float)]

Example output:

filename    scene_label airport bus     metro   metro_station   park    public_square   shopping_mall   street_pedestrian   street_traffic  tram
0.wav       bus         0.25    0.99    0.12    0.32            0.41    0.42            0.23            0.34                0.12            0.45

Multiple system outputs can be submitted (maximum 4 per participant per subtask). For each system, meta information should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission.

Please make a clear connection between the system name in the submitted yaml, submitted system output, and the technical report!

Detailed information for submission can be found on the submission page.

Task rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements can be found here.

Task specific rules:

  • Use of external data is allowed. See conditions for external data usage here.
  • In subtask B, model size limit applies. See conditions for the model size here.
  • Manipulation of provided training and development data is allowed (e.g. by mixing data sampled from a pdf or using techniques such as pitch shifting or time stretching).
  • Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system; the use of statistics about the evaluation data in the decision making is also forbidden. Separately published leaderboard data is considered as evaluation data as well.
  • Classification decision must be done independently for each test sample.

Evaluation

Systems will be ranked by macro-average accuracy (average of the class-wise accuracies).

As a secondary metric we will use multiclass cross-entropy (Log loss), in order to have a metric which is independent of the operating point (see python implementation here).

Baseline system

The baseline system provides a state of the art approach for the classification in each subtask. The baseline system is built on dcase_util toolbox and has all needed functionality for the dataset handling, acoustic feature extraction, storing and accessing, acoustic model training and storing, and evaluation. The modular structure of the system enables participants to modify the system to their needs.

Repository


Subtask A

The baseline system is a modification of the baselines from previous DCASE challenge editions of acoustic scene classification, built on the same skeleton. It replaces use of mel energies with use of OpenL3 embeddings and replaces the CNN network architecture with two fully-connected feed-forward layers (size 512 and 128) as in the original OpenL3 publication:

Publication

J. Cramer, H-.H. Wu, J. Salamon, and J. P. Bello. Look, listen and learn more: design choices for deep audio embeddings. In IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 3852–3856. Brighton, UK, May 2019. URL: https://ieeexplore.ieee.org/document/8682475.

PDF

Look, Listen and Learn More: Design Choices for Deep Audio Embeddings

Abstract

A considerable challenge in applying deep learning to audio classification is the scarcity of labeled data. An increasingly popular solution is to learn deep audio embeddings from large audio collections and use them to train shallow classifiers using small labeled datasets. Look, Listen, and Learn (L 3 -Net) is an embedding trained through self-supervised learning of audio-visual correspondence in videos as opposed to other embeddings requiring labeled data. This framework has the potential to produce powerful out-of-the-box embeddings for downstream audio classification tasks, but has a number of unexplained design choices that may impact the embeddings' behavior. In this paper we investigate how L 3 -Net design choices impact the performance of downstream audio classifiers trained with these embeddings. We show that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key. Surprisingly, we find that matching the content for training the embedding to the downstream task is not beneficial. Finally, we show that our best variant of the L3 -Net embedding outperforms both the VGGish and SoundNet embeddings, while having fewer parameters and being trained on less data. Our implementation of the L3 -Net embedding model as well as pre-trained models are made freely available online.

Keywords

audio signal processing;learning (artificial intelligence);signal classification;audio collections;labeled datasets;self-supervised learning;audio-visual correspondence;downstream audio classification tasks;downstream audio classifiers;audio-informed choices;deep learning;deep audio embeddings;net embedding model;net design choices;VGGish embeddings;SoundNet embeddings;Videos;Task analysis;Training;Computational modeling;Data models;Training data;Spectrogram;Audio classification;machine listening;deep audio embeddings;deep learning;transfer learning

PDF

Parameters

  • Audio embeddings:
    • OpenL3 embedding
    • Analysis window size: 1 second, Hop size: 100 ms
    • Embeddings parameters:
      • Input representation: mel256
      • Content type: music
      • Embedding size: 512
  • Neural network:
    • Architecture: Two fully connected layers (512 and 128 hidden units)
    • Learning: 200 epochs (batch size 64), data shuffling between epochs
    • Optimizer: Adam (learning rate 0.001)
  • Model selection:
    • Approximately 30% of the original training data is assigned to validation set, split done such that training and validation sets do not have segments from the same location and both sets have data from each city
    • Model performance after each epoch is evaluated on the validation set, and best performing model is selected

Results for the development dataset

Results are calculated using TensorFlow in GPU mode (using Nvidia Titan XP GPU card). Because results produced with GPU card are generally non-deterministic, the system was trained and tested 10 times; mean and standard deviation of the performance from these 10 independent trials are shown in the results tables.

The system is compared against the 2019 baseline system in order to put its performance into context with respect to the problem:

System Accuracy Log loss Description
DCASE2019 Task 1 Baseline 46.5 %
(± 1.2)
1.578
(± 0.029)
Log mel-band energies as features,
two layers of 2D CNN and one fully connected layer as classifier,

See more information
DCASE2020 Task 1 Baseline, Subtask A 54.6 %
(± 1.5)
1.356
(± 0.055)
OpenL3 as audio embeddings,
two fully connected layers as classifiers

Detailed results for the DCASE2020 baseline:

Scene label Accuracy Device-wise accuracies Log loss
A B C S1 S2 S3 S4 S5 S6
Airport 43.6 % 63.0 49.7 49.7 53.3 34.5 39.7 36.1 38.8 27.9 1.693
Bus 61.5 % 84.5 64.8 81.2 64.2 64.5 45.5 59.4 36.7 52.4 1.019
Metro 54.6 % 69.1 45.8 63.3 45.5 50.6 53.0 50.9 41.5 72.1 1.263
Metro station 54.3 % 62.1 48.5 46.4 49.7 54.5 68.8 54.2 53.6 50.9 1.227
Park 71.0 % 92.1 97.9 90.6 68.8 80.0 67.6 49.4 54.8 37.6 1.026
Public square 43.0 % 63.6 37.0 60.0 41.8 37.9 52.7 49.4 31.2 13.0 1.658
Shopping mall 56.9 % 63.0 67.3 65.5 57.6 61.2 39.7 43.3 57.3 57.0 1.203
Street, pedestrian 28.8 % 54.2 34.8 33.3 25.5 35.8 26.1 25.5 2.1 21.8 2.335
Street, traffic 78.6 % 80.3 85.5 86.1 83.0 70.0 79.4 81.8 89.1 52.4 0.809
Tram 53.6 % 66.7 56.1 63.6 60.6 53.3 56.4 57.6 47.9 20.3 1.326
Average 54.6 %
(± 1.5)
69.9 58.7 64.0 55.0 54.2 52.9 50.8 45.3 40.5 1.356
(± 0.055)

As discussed here, devices S4-S6 are used only for testing not for training the system.

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results.

Subtask B

In subtask B, the baseline system is similar to DCASE2019 baseline. The system implements a convolutional neural network (CNN) based approach. Log mel-band energies are first extracted for each 10-second signal, and a network consisting of two CNN layers and one fully connected layer is trained to assign scene labels to the audio signals. The model size of the system is 450 KB.

Parameters

  • Audio features:
    • Log mel-band energies (40 bands), analysis frame 40 ms (50% hop size)
  • Neural network:
    • Input shape: 40 * 500 (10 seconds)
    • Architecture:
      • CNN layer #1
        • 2D Convolutional layer (filters: 32, kernel size: 7) + Batch normalization + ReLu activation
        • 2D max pooling (pool size: (5, 5)) + Dropout (rate: 30%)
      • CNN layer #2
        • 2D Convolutional layer (filters: 64, kernel size: 7) + Batch normalization + ReLu activation
        • 2D max pooling (pool size: (4, 100)) + Dropout (rate: 30%)
      • Flatten
      • Dense layer #1
        • Dense layer (units: 100, activation: ReLu )
        • Dropout (rate: 30%)
      • Output layer (activation: softmax)
    • Learning: 200 epochs (batch size 16), data shuffling between epochs
    • Optimizer: Adam (learning rate 0.001)
  • Model selection:
    • Approximately 30% of the original training data is assigned to validation set, split done such that training and validation sets do not have segments from the same location and both sets have data from each city
    • Model performance after each epoch is evaluated on the validation set, and best performing model is selected

Results for the development dataset

Results are calculated using TensorFlow in GPU mode (using Nvidia Titan XP GPU card). Because results produced with GPU card are generally non-deterministic, the system was trained and tested 10 times; mean and standard deviation of the performance from these 10 independent trials are shown in the results tables.

The system is compared against subtask A baseline system and a minified version of it in order to show performance at different model size levels. The modified version of subtask A baseline replaces the OpenL3 audio embeddings with EdgeL3 embeddings which are a sparse version of OpenL3, intended for low-complexity applications. More about EdgeL3 embeddings can be found here:

Publication

S. Kumari, D. Roy, M. Cartwright, J. P. Bello, and A. Arora. Edgel^3: compressing l^3-net for mote scale urban noise monitoring. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), volume, 877–884. May 2019. doi:10.1109/IPDPSW.2019.00145.

EdgeL^3: Compressing L^3-Net for Mote Scale Urban Noise Monitoring

Abstract

Urban noise sensing in deeply embedded devices at the edge of the Internet of Things (IoT) is challenging not only because of the lack of sufficiently labeled training data but also because device resources are quite limited. Look, Listen, and Learn (L3), a recently proposed state-of-the-art transfer learning technique, mitigates the first challenge by training self-supervised deep audio embeddings through binary Audio-Visual Correspondence (AVC), and the resulting embeddings can be used to train a variety of downstream audio classification tasks. However, with close to 4.7 million parameters, the multi-layer L3-Net CNN is still prohibitively expensive to be run on small edge devices, such as "motes" that use a single microcontroller and limited memory to achieve long-lived self-powered operation. In this paper, we comprehensively explore the feasibility of compressing the L3-Net for mote-scale inference. We use pruning, ablation, and knowledge distillation techniques to show that the originally proposed L3-Net architecture is substantially overparameterized, not only for AVC but for the target task of sound classification as evaluated on two popular downstream datasets. Our findings demonstrate the value of fine-tuning and knowledge distillation in regaining the performance lost through aggressive compression strategies. Finally, we present EdgeL3, the first L3-Net reference model compressed by 1-2 orders of magnitude for real-time urban noise monitoring on resource-constrained edge devices, that can fit in just 0.4 MB of memory through half-precision floating point representation.

Keywords

audio signal processing;convolutional neural nets;environmental science computing;Internet of Things;learning (artificial intelligence);neural net architecture;noise pollution;signal classification;real-time urban noise monitoring;resource-constrained edge devices;IoT;AVC;downstream audio classification tasks;single microcontroller;mote-scale inference;knowledge distillation techniques;sound classification;compression strategies;binary audio-visual correspondence;self-supervised deep audio embeddings;EdgeL3;downstream datasets;mote scale urban noise monitoring;urban noise sensing;embedded devices;Internet of Things;look listen and learn;transfer learning technique;multilayer L3-Net CNN;L3-Net architecture;memory size 0.4 MByte;Task analysis;Sensors;Training;Convolution;Monitoring;Visualization;Feature extraction;edge network;pruning;convolutional neural nets;deep learning;audio embedding;transfer learning;finetuning;knowledge distillation

Systems:

System Accuracy Log loss Audio embedding Acoustic model Total size
DCASE2020 Task 1 Baseline, Subtask A
OpenL3 + MLP (2 layers, 512 and 128 units)
89.8 %
(± 0.3)
0.266
(± 0.006)
17.87 MB 145.2 KB 19.12 MB
Modified DCASE2020 Task 1 Baseline, Subtask A
EdgeL3 + MLP (2 layers, 64 units each)
88.9 %
(± 0.3)
0.298
(± 0.003)
840.6 KB 145.2 KB 985.8 KB
DCASE2020 Task 1 Baseline, Subtask B
Log mel-band energies + CNN (2 CNN layers and 1 fully-connected)
87.3 %
(± 0.7)
0.437
(± 0.045)
- 450.1 KB 450 KB

Detailed results for the DCASE2020 Task 1 Baseline, Subtask B:

Class label Accuracy Log loss
Indoor 82.0 % 0.680
Outdoor 88.5 % 0.365
Transportation 91.5 % 0.282
Average 87.3 %
(± 0.7)
0.437
(± 0.045)

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able obtain very similar results.

Citation

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 9–13. November 2018. URL: https://arxiv.org/abs/1807.09840.

PDF

A multi-device dataset for urban acoustic scene classification

Abstract

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task. As in previous years of the challenge, the task is defined for classification of short audio samples into one of predefined acoustic scene classes, using a supervised, closed-set classification setup. The newly recorded TUT Urban Acoustic Scenes 2018 dataset consists of ten different acoustic scenes and was recorded in six large European cities, therefore it has a higher acoustic variability than the previous datasets used for this task, and in addition to high-quality binaural recordings, it also includes data recorded with mobile devices. We also present the baseline system consisting of a convolutional neural network and its performance in the subtasks using the recommended cross-validation setup.

Keywords

Acoustic scene classification, DCASE challenge, public datasets, multi-device data

PDF