Acoustic-Based Traffic Monitoring


Task description

This task aims to design an acoustic-based traffic monitoring solution that counts the number of vehicles, per vehicle type (car or commercial vehicle) and per direction of travel (left or right).

If you are interested in the task, you can join us on the dedicated slack channel

Description

Traffic monitoring solutions are one of the essential parts of smart city development to monitor the usage and condition of roadway infrastructures and detect anomalies. These systems are designed based on several different sensors grouped into two main categories: intrusive and non-intrusive sensors. Some examples of intrusive sensors that are embedded in the road are induction loops, vibration or magnetic sensors. A few examples of non-intrusive systems mounted over or on the side of the road are radar, cameras, infrared or acoustic sensors, and off-road mobile devices, e.g. aircraft or satellites. Acoustic sensors provide several advantages compared to other sensors that make them a desirable choice stand alone or in combination with other sensors, e.g. low cost, power efficiency, ease of installation, robustness to adverse weather and low-visibility conditions.

Given the difficulty of collecting and labeling real world traffic data, in this challenge we would like to investigate the effect of using synthetic data generated by traffic simulator in the system performance. Hence, in addition to real world traffic sound recordings, we provide a tool to synthesize realistic vehicle pass-by events in various traffic conditions, based on the open-source pyroadacoustics road acoustics simulator.

Figure 1: Overview of acoustic-based traffic monitoring system.


The participants are encouraged to investigate the performance gain using synthetic data via the traffic simulator or other data augmentation techniques, to compensate for the limited amount of available training data. Together with audio recordings, we will also release meta-data such as:

  • Sensor location id, to allow for the development of different models per location (site).
  • Timestamp of the segment as “day of week” and “hour of day”, to allow models to learn varying traffic conditions naturally occurring in each site.
  • Geometry information on the position of the sensor array relative to the traffic lanes.
  • Ground truth number of passing vehicles, quantized by minute in 4 categories based on the direction of travel and the class of the vehicle, namely car or commercial vehicle (cv): car/right-to-left, car/left-to-right, cv/right-to-left and cv/left-to-right. Depending on the location, the annotations are extracted from magnetic coils installed below the road surface, radar or cameras installed at the side of the road or human labels for locations with low traffic density. The source for groundtruth will not be revealed to the participants.
  • Maximum speed of vehicles at the specific location.
  • Highest number of pass-by vehicles for peak hours in each direction, both for weekdays and weekends per location.

Audio dataset

Recording procedure

The released audio data has been collected using Bosch urban traffic monitoring sensors. A linear microphone array installed on the side of the road parallel to the direction of travel provides audio input to the traffic monitoring system. Real-world data is collected in urban environments with various traffic densities, from country roads with at most 5 vehicles per minute, to intercity roads with as many as 30 vehicles per direction per minute. All real-world data acquisitions use the same linear uniform microphone array with 4 capsules and an aperture of 24cm. We have acquired audio segments across Germany and United States, between November 2019 and December 2022. A subset of the data will be released for this task. Labels for the data have been collected via various sensors e.g. coil, radar, camera as well as human annotators. The source of the labels will not be released to the challenge participants.

Synthetic Data

Given the difficulties posed by the collection and labeling of realistic traffic data, we propose incorporating synthetic data during model training. Participants can use data generation system based on open source pyroadacoustics simulator. It is designed to simulate the sound produced by a source moving on a road with arbitrary trajectory and speed, and received by a set of stationary microphones with arbitrary array geometry and microphone directivity pattern. The signal produced by pyroadacoustics includes the direct sound, reflection from the asphalt surface, air absorption and Doppler effect. Acoustic traffic simulator is done in two stages: first, the (stationary) source signal produced by the rolling tires and car engine is generated using the Harmonoise model; then, the produced signal is used as input to pyroadacoustics, that simulates source motion and acoustic propagation. More details regarding the data generation process can be found here.

Reference labels

We frame the traffic monitoring task (counting, vehicle type and travel direction detection) as a regression task with the ground truth labels for 1 minute of audio segment as:

  • car_left: number of passenger vehicles going left to right per minute
  • car_right: number of passenger vehicles going right to left per minute
  • cv_left: number of commercial vehicles going left to right per minute
  • cv_right: number of commercial vehicles going right to left per minute

The annotation file has the following columns:

dow: the day of the week the audio is recorded. This is an integer number between 0 to 6.

hour: the hour the audio is recorded.

minute: the minute the audio is recorded.

car_left: number of passenger cars going from left to right in one-minute segment.

car_right: number of passenger cars going from right to left in one-minute segment.

cv_left: number of commercial vehicles going from left to right in one-minute segment.

cv_right: number of commercial vehicles going from right to left in one-minute segment.

path: path of the audio file.

split: the data split (train, test, val).

Figure 2: Number of one minute segments (Count) vs. number of vehicles passing by (# vehicles), shown per site and label.


Meta-data

The meta-data provided for each site contains information regarding the microphone array geometry and the traffic conditions. The attributes corresponding to the geometry are the height that the microphone array is installed (array-height) and the distance of the array to the side of the street (distance-to-street-side). The attributes corresponding to the traffic conditions are maximum speed of the vehicle in the street (max-pass-by-speed) and the maximum number of pass-by cars in one minute interval (max-traffic-density). Participants are encouraged to incorporate these mata-data in synthetic sample generation as well as model training.

Development dataset

The development dataset contains real data recordings from six different locations as well as the simulation data. Below is a summary of the development dataset:

  • Train: 7294 one-minute training samples of real data collected from 6 sites (loc1 to loc6) with various traffic conditions.
  • Validation: 7705 one-minute validation samples of real data collected from 6 sites with various traffic conditions.
  • Simulation: 1224 one-minute segments of simulation data generated via pyroadacoustics simulator.
  • Engin sound: We also release synthetically generated engine sounds using enginesound that are used to generate the simulation data.

The folder structure of the released real data for each location (loc1 to loc6) follows the same folder structure as: site meta information (meta.json), train data folder (train/*.flac) and label (train.csv), validation data folder (val/*.flac) and label (val.csv).

Evaluation dataset

Evaluation data will be provided from the same 6 sites but different days than development dataset, containing 9887 one-minute audio segments per location.

Download


Task setup

The audio based traffic monitoring dataset consists of a public development set and a private test set. Both audio recordings and the corresponding annotations will be released for the development data at the start of the challenge. However, we will only release the audio recordings for the private test set at the start of the evaluation period. The private test annotations will be release after the challenge is ended.

Task rules

  • Allowed: Participants may submit up to three systems.
  • Allowed: In each submission, participants can have one model for all the sites or one model per site.
  • Restricted: In case of training different models per site, all the models should have the same architecture.
  • Allowed: Data augmentation in development dataset is allowed and encouraged.
  • Allowed: Using synthetic data (generated via the provided simulation code or other methods) for model training is allowed and highly encouraged.
  • Allowed: Participants are allowed and encouraged to use the provided meta-data such as time of the day, day of the week and common traffic patterns of each site in model training.
  • Allowed: Participants are allowed to use external public and freely accessible datasets.
  • Not allowed: Participants are not allowed to manually annotate the private test set.
  • Not allowed: The test dataset cannot be used to train the submitted systems.
  • Required: open-sourcing the system's source code
    • For submissions to be considered in ranking, we require a link to open-sourced source code.
    • This can be specified in the metadata YAML file included in the submission.
    • The code should be hosted on any public code hosting service such as Github or Bitbucket.
    • The source code needs to be well documented including instructions for reproducing the results of the submitted system(s).
    • Only submissions that include reproducible open-sourced code will be considered for the ranking.

Submission

The results for each of the 6 sites in the evaluation dataset should be collected in individual CSV files. Each result file should be named as <GROUP NAME>_<METHOD>_<locX>.csv. It should contain the 4 outputs as the reference labels along with the timestamp as: path, car_left, car_right, cv_left, cv_right. Here is an example of output:

"path", "car_left", "car_right", "cv_left", "cv_right"  
test/00000.flac, 3.458767, 1.875205, 0.0, 0.14362139  
test/00001.flac, 2.0522563, 1.8457798, 0.0, 0.0  
test/00002.flac, 0.5575373, 0.02693855, 0.0, 0.0

Evaluation and Ranking

Metrics

Considering the task is a regression problem, we use both distance-based and correlation-based metrics to evaluate the system performance from different perspectives. Distance-based metric directly computes regression errors, and correlation-based metric neutralizes scaling factor while taking global up-and-down trend into account. A well-performed system is expected to capture both local variation and global trend of the traffic counting.

  • Kendall's Tau Rank Correlation is used to measure the ordinal association between two quantities. It is often more preferred method to other metrics such as Pearson's or Spearman's correlation due to its more robust to outliers. Kendall's Tau can range from -1 to 1.
  • Root Mean Square Error (RMSE) is a common metric in regression task that measures the prediction errors using Euclidean distance. RMSE can range from 0 to +Inf.

Ranking

To rank the submitted systems, we utilize a ranking method that involves conducting a total of 6×4×2 comparisons across locations, classes, and metrics. Each comparison independently computes ranks across submissions. Scores are assigned to indicate the ranking order (e.g., top ranks receive higher scores than low ranks). The average score across all compared entries represents the final system performance to determine the task winner.

Baseline system

The baseline system includes a Convoltional Recurrnet Neural Network (CRNN) based architecture consisting of two branches. The model input is 4-channel raw audio recordings that is 1) converted to Generalized Cross-Correlation with Phase transform (GCC-Phat) features as input to a convolutional encoder and time-distributed fully connected (FC) layers; and 2) filtered via a learnable Gabor filterbank similarly as input to a convolutional encoder and time-distributed fully connected (FC) layers. The two branches are then concatenated and passed to another time-distributed block. Finally, a gated recurrent unit (GRU) is used to model temporal dependencies. At the end, a FC layer with four neurons is used as a regression output, providing the count of vehicles of the four categories of car_left, car_right, cv_left, cv_right.

Repository


Results with the development dataset

loc1 car_left car_right cv_left cv_right
Kendall's Tau Corr 0.470 0.478 0.231 0.189
RMSE 2.449 2.687 0.732 0.777
loc2 car_left car_right cv_left cv_right
Kendall's Tau Corr 0.446 0.221 0.135 -0.026
RMSE 3.308 3.560 0.468 0.610
loc3 car_left car_right cv_left cv_right
Kendall's Tau Corr 0.619 0.593 0.102 0.272
RMSE 1.629 1.209 0.308 0.199
loc4 car_left car_right cv_left cv_right
Kendall's Tau Corr 0.456 0.248 0 0.438
RMSE 1.698 2.210 0.548 0.728
loc5 car_left car_right cv_left cv_right
Kendall's Tau Corr 0.484 0.575 0.092 0.108
RMSE 0.662 0.607 0.491 0.676
loc6 car_left car_right cv_left cv_right
Kendall's Tau Corr 0.825 0.736 0.711 0.648
RMSE 1.672 1.950 0.535 0.441

Citation

If you are participating in this task or using the baseline code, please cite the following papers.

Publication

Stefano Damiano and Toon van Waterschoot. Pyroadacoustics: a Road Acoustics Simulator Based on Variable Length Delay Lines. In Proceedings of the 25th International Conference on Digital Audio Effects (DAFx20in22), 216–223. Vienna, Austria, September 2022.

PDF

Pyroadacoustics: a Road Acoustics Simulator Based on Variable Length Delay Lines

PDF
Publication

Stefano Damiano, Luca Bondi, Shabnam Ghaffarzadegan, Andre Guntoro, and Toon van Waterschoot. Can synthetic data boost the training of deep acoustic vehicle counting networks? In Proceedings of the 2024 International Conference on Acoustics, Speech and Signal Processing (ICASSP) (accepted). Seoul, South Korea, April 2024.

PDF

Can Synthetic Data Boost the Training of Deep Acoustic Vehicle Counting Networks?

PDF