Urban Sound Tagging with Spatiotemporal Context


Task description

Description

This task aims to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags for 10s recordings from an urban acoustic sensor network for which we know when and where the recordings were taken. This task is motivated by the real-world problem of building machine listening tools to aid in the monitoring, analysis, and mitigation of urban noise pollution.

Often in machine listening research, we work with datasets scraped from the internet, disconnected from real applications, and devoid of relevant metadata such as when and where the data were recorded. However, this is not the case in many real-world sensing applications. In these scenarios, this spatiotemporal context (STC) metadata may inform us as to what sounds we may expect to hear in a recording. For example, in NYC you are more likely to hear an ice cream truck by the park at 3pm on a Saturday in July than you are by a busy street at rush hour on a Tuesday in January; however, you are more likely to hear honking, engines, and sirens on that Tuesday. However, if you knew there was a thunderstorm that Saturday afternoon in July, this would reduce your expectation to hear an ice cream truck. Knowing that this thunderstorm was afoot would also help you disambiguate between the noise of heavy rain and that of a large walk-behind saw.

In this task, in addition to the recording, we provide identifiers for the New York City block (location) where the recording was taken as well as when the recording was taken, quantized to the hour. We encourage all submissions to exploit both this provided information as well as incorporating any external data (e.g. weather, land usage, traffic, Twitter) that can further help inform your system to predict tags.

To incentivize the use of STC metadata, only systems that exploit STC metadata will be included in the official task leaderboard. Awards will only be given to systems which have incorporated STC.

Figure 1. Overview of a system for audio tagging with spatiotemporal context.

Audio dataset

Recording procedure

The provided audio has been acquired using the SONYC acoustic sensor network for urban noise pollution monitoring [1]. SONYC (Sounds of New York City) is a research project aimed at monitoring, analyzing, and mitigating urban noise pollution. To this end, SONYC has designed an acoustic sensor for noise pollution monitoring and has deployed over 50 different sensors in the Manhattan, Brooklyn, and Queens boroughs of New York, with the highest concentration around New York University's Manhattan campus. Collectively, these sensors have recorded over 100M 10-second audio clips since its launch in 2016, of which we provide a small labeled subset. The data was initially sampled by selecting the nearest neighbors on VGGish features of recordings known to have classes of interest [2], and was subsequently sampled using an active learning approach which aimed at sampling diverse samples around and above the decision threshold. All recordings are 10 seconds and were recorded with identical microphones at identical gain settings.

[1] J. P. Bello, C. Silva, O. Nov, R. L. DuBios, A. Arora, J. Salamon, C. Mydlarz, H. Doraiswamy, "SONYC: A System for the Monitoring, Analysis and Mitigation of Urban Noise Pollution", Communications of the ACM (CACM), 2018.

[2] Cartwright, M., Mendez, A.E.M., Cramer, J., Lostanlen, V., Dove, G., Wu, H., Salamon, J., Nov, O., Bello, J.P. "SONYC Urban Sound Tagging (SONYC-UST): A Multilabel Dataset from an Urban Acoustic Sensor Network", Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE) , 2019.

Reference labels

Through consultation with the New York Department of Environmental Protection (DEP) and the New York noise code, we constructed a small, two-level urban sound taxonomy (see Figure 2) consisting of 8 coarse-level and 23 fine-level sound categories, e.g., the coarse alert signals category contains four fine-level categories: reverse beeper, car alarm, car horn, siren. This taxonomy is not intended to provide a framework for exhaustive description of urban sounds. Instead, it was scoped to provide actionable information to the DEP, while also being understandable and manageable for novice annotators. The chosen sound categories map to categories of interest in the noise code; they were limited to those that seem likely discernible by novice annotators; and we kept the number of categories small enough so that they can all be visible at once in an annotation interface.

Figure 2. Hierarchical taxonomy of urban sound tags in the DCASE Urban Sound Tagging task. Rectangular and round boxes respectively denote coarse and fine tags.

To annotate the sensor recordings, we launched an annotation campaign on the Zooniverse citizen-science platform [3]. We presented volunteers on the platform with instructions explaining the task and a field guide describing the SONYC-UST classes, and we asked them to annotate the presence of all of the fine-level classes in a recording. For every coarse-level class (e.g., alert signal) we also included a fine-level other/unknown class (e.g., other/unknown alert signal) with the goal of capturing an annotator's uncertainty in a fine-level tag while still annotating the coarse-level class. If an annotator marked a sound class as present in the recording, they were also asked to annotate the proximity of the sound event (near, far, not sure). Volunteers could annotate as many recordings as were available. We required that each recording be annotated by three different annotators.

The SONYC research team also created an agreed-upon verified subset of annotations that will be considered ground-truth for model validation and evaluation. For additional details regarding both the crowdsourcing procedure and our multi-stage verification procedure, please see [2] and [3].

[3] M. Cartwright, G.Dove, A.E. Méndez Méndez, J.P. Bello, Oded Nov, "Crowdsourcing multi-label audio annotation tasks with citizen scientists", Proceedings of the Conference on Human Factors in Computing Systems (CHI), 2019.

Spatiotemporal Context Information (STC)

A central component of this year's task is the inclusion spatiotemporal context information (STC). To maintain privacy, we quantized the spatial information to the level of a city block, and we quantized the temporal information to the level of an hour. We also limited the occurrence of recordings with positive human voice annotations to one per hour per sensor. For the spatial information, we have provided borough and block identifiers, as used in NYC's parcel number system known as Borough, Block, Lot (BBL). This a common identifier that is used in various NYC datasets, making it easy to relate the sensor data to other city data such as PLUTO and more generally NYC Open Data. For ease of use with other datasets, we've also included the latitude and longitude coordinates of the center of the block. Below are distributions of the recordings in time and space.

Figure 3. Location of the SONYC-UST v2 sensors. Green is `train` split, blue is `validate` split.
Figure 4. Temporal distribution of SONYC-UST v2 recordings.
Figure 5. Temporal distribution of SONYC-UST v2 recordings.

Development and evaluation datasets

Annotation format

The metadata corresponding to the public development set is included in the dataset as a CSV file named annotations.csv. Each row in the file represents one multi-label annotation of a recording---it could be the annotation of a single citizen science volunteer, a single SONYC team member, or the agreed-upon ground truth by the SONYC team (see the annotator_id column description for more information). The multi-label annotation is encoded as the entries in 29 columns, corresponding to alphabetically sorted tags: amplified speech, car alarm, car horn, chainsaw, and so forth. In each entry, the numbers 0 and 1 respectively denote the absence of the presence of the corresponding tag. In addition, each row also includes three identification numbers (ID): one for the human who performed the annotation, one for the sensor in the SONYC network, and one for the ten-second recording itself. This year, we are also including spatial information in four additional columns: borough, block, latitude, longitude; and temporal information in four additional columns: year, week, day, hour. The columns of the CSV are described below.

Columns

split

The data split. (train, validate)

sensor_id

The ID of the sensor the recording is from. These have been anonymized to have no relation to geolocation.

audio_filename

The filename of the audio recording

annotator_id

The anonymous ID of the annotator. If this value is positive, it is a citizen science volunteer from the Zooniverse platform. If it is negative, it is a SONYC team member. If it is 0, then it is the ground truth agreed-upon by the SONYC team.

year

The year the recording is from.

week

The week of the year the recording is from.

day

The day of the week the recording is from, with Monday as the start (i.e. 0=Monday).

hour

The hour of the day the recording is from

borough

The NYC borough in which the sensor is located (1=Manhattan, 3=Brooklyn, 4=Queens). This corresponds to the first digit in the 10-digit NYC parcel number system known as Borough, Block, Lot (BBL).

block

The NYC block in which the sensor is located. This corresponds to digits 2—6 digit in the 10-digit NYC parcel number system known as Borough, Block, Lot (BBL).

latitude

The latitude coordinate of the block in which the sensor is located.

longitude

The longitude coordinate of the block in which the sensor is located.

<coarse_id>-<fine_id>_<fine_name>_presence

Columns of this form indicate the presence of fine-level class. 1 if present, 0 if not present. If -1, then the class was not labeled in this annotation because the annotation was performed by a SONYC team member who only annotated one coarse group of classes at a time when annotating the verified subset.

<coarse_id>_<coarse_name>_presence

Columns of this form indicate the presence of a coarse-level class. 1 if present, 0 if not present. If -1, then the class was not labeled in this annotation because the annotation was performed by a SONYC team member who only annotated one coarse group of classes at a time when annotating the verified subset. These columns are computed from the fine-level class presence columns and are presented here for convenience when training on only coarse-level classes.

<coarse_id>-<fine_id>_<fine_name>_proximity

Columns of this form indicate the proximity of a fine-level class. After indicating the presence of a fine-level class, citizen science annotators were asked to indicate the proximity of the sound event to the sensor. Only the citizen science volunteers performed this task, and therefore this data is not included in the verified annotations. This column may take on one of the following four values: (near, far, notsure, -1). If -1, then the proximity was not annotated because either the annotation was not performed by a citizen science volunteer, or the citizen science volunteer did not indicate the presence of the class.

Download

The dataset (both development and evaluation) can be downloaded here:

SONYC Urban Sound Tagging (12.8 GB)
version 2.2.0


Task setup

The Urban Sound Tagging dataset comprises a public development set and a private evaluation set. For the public development set, we release both the audio recordings and the corresponding human annotations at the start of the challenge. For the private evaluation set, we will release the audio recordings at the start of the evaluation period, but we will not disclose human annotations until after the challenge has concluded.

Development dataset

The development dataset contains all of the recordings and annotations from DCASE 2019 Task 5 (both development and evaluation sets), plus almost 15000 additional recordings with crowdsourced annotations. All of the recordings are from 2016–2019 (early 2019) and are grouped into a train split (13538 recordings / 35 sensors) and validate split (4308 recordings / 9 sensors). The train and validate splits are disjoint with respect to the sensor from which each recording came. 716 of the recordings also have annotations which have been verified by the SONYC team (denoted by annotator_id==0). Note that recordings with verified annotations are not limited to the validation sensor split (see Figure 6). Participants may use this information however they please, but note that the baseline model is trained with crowsourced annotations of the train sensor split, and validation metrics are reported with the verified annotations of the validate sensor split (see Figure 7).

Evaluation dataset

The evaluation data is distributed as the test split in SONYC-UST v2.2. It contains 669 recordings from 48 sensors. It may contain recordings from sensors in either the validate or the train splits (and also possibly unseen sensors). However, all of the evaluation data will be displaced in time, occurring after any of the recordings in the development set (mid to late 2019). In addition, all of the annotations in the evaluation set will be verified by the SONYC team, using the same procedure as the verified annotations in the development set.

Figure 6. The distribution of all SONYC-UST v2 recordings and annotations by sensor split, recording split, annotation type, and time.
Figure 7. The distribution of the recordings and annotations used in training the baseline model.

External data resources

Dataset name Type Added Link
AudioSet audio 01.03.2020 https://research.google.com/audioset/
OpenL3 model 01.03.2020 https://github.com/marl/openl3
VGGish model 01.03.2020 https://github.com/tensorflow/models/tree/master/research/audioset/vggish
PLUTO metadata 01.03.2020 https://www1.nyc.gov/site/planning/data-maps/open-data.page
NYC OpenData metadata 01.03.2020 https://opendata.cityofnewyork.us
NYC OpenData - NYPD Arrests Data metadata 01.03.2020 https://data.cityofnewyork.us/Public-Safety/NYPD-Arrests-Data-Historic-/8h9b-rp9u
NYC OpenData - Motor Vehicle Collisions metadata 01.03.2020 Vehicle collisions - https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95
NYC OpenData - 311 Service Requests metadata 01.03.2020 https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9

Submission

The output files should be in CSV format with the following columns.

audio_filename

The filename of the audio recording

<coarse_id>-<fine_id>_<fine_name>

Columns of this form should indicate the probability of the presence of fine-level class (floating point values between 0 and 1). If the system does not produce probabilities and only detections, 1 and 0 can be used for predicted positives and negatives, respectively. Numeric values outside of the range [0,1] or non-numeric values (including empty strings) will throw an error during evaluation and the results will not be accepted. Also note that we accept "X" level predictions, which are meant to indicate presence of the respective coarse tag that is either not captured by the available fine tags or is uncertain. For example, if a system confidently detects the presence of the coarse tag machinery-impact but not any of the corresponding fine tags, the value of 2-X_other-unknown-impact-machinery would be 1. Feel free to provide these if your system models these probabilities.

<coarse_id>_<coarse_name>

Columns of this form indicate the probability of the presence of a coarse-level class (floating point values between 0 and 1). 1 if present, 0 if not present. If the system does not produce probabilities and only detections, 1 and 0 can be used for predicted positives and negatives, respectively.

An example of the output in addition to the accompanying metadata file is included in the submission package template as well as the Submission page. Note that evaluations at the coarse and fine levels will be performed independently on the columns corresponding to the respective tags. Therefore, participants should provide system outputs for both the coarse and fine tags if they wish to be evaluated on both levels.

For more information about the submission process and platform, please refer to the Submission page.

Task rules

The rules of the DCASE Urban Sound Tagging task are relatively permissive. The only two practices that are strictly forbidden are the manual annotation of the private evaluation set and the inclusion of private external data to the train subset or validate subset. Nevertheless, we explicitly authorize the inclusion of publicly released external data to the train set or validate subset. For instance, we highly encourage the use of external spatiotemporal data such as NYC OpenData. We also authorize participants to download the AudioSet and FSD and use them to learn general-purpose machine listening features in a supervised way. However, we do not allow participants to collect train data from YouTube or FreeSound in an arbitrary way, in the absence of a sustainable and reproducible open data initiative. This restriction to the rules of the challenge aims at preserving the reproducibility of research in the development of innovative systems for machine listening.

We provide a thorough list of rules below.

  • Forbidden: manual annotation of the evaluation set

    • Participants are not allowed to perform any manual annotation, be it expert or crowdsourced, of the private evaluation set.
  • Forbidden: use of private external data

    • Participants are not allowed to use private external data for training, validating, and/or selecting their model.

    • The term "private external data" includes all data that either have a proprietary license or are hosted on private or corporate servers, such as YouTube, Soundcloud, Dropbox, Google Drive, and so forth.

  • Restricted: use of open external data

    • Participants are allowed to use open external data for training, validating, and/or selecting their model.

    • The term "open external data" includes all datasets that are released under a free license, such as Creative Commons; which have a digital object identifier (DOI); and which are hosted on a dedicated repository for open science, such as NYC OpenData, Zenodo, DataDryad, and IEEE DataPort.

  • Restricted: use of pre-trained models

    • Participants are allowed to use pretrained models, such as self-supervised audiovisual embeddings, as feature extractors for their system.

    • If these pretrained models are novel contributions, we highly encourage participants to release them under an open source license, for the benefit of the machine listening community at large.

  • Authorized: manual inspection of the public development set

    • While not required, participants are allowed to employ additional human labor, either expert or crowdsourced, to refine the annotation of recordings in the public development set.

    • To participants who were to undertake a systematic re-annotation of the train set, in part or in full, we kindly ask to contact us so that we can consider, for a possible upcoming edition of the Urban Sound Tagging challenge, including their annotations in the public development set.

    • As stated above, although we encourage human inspection of the public development set, we strictly forbid the human inspection of the private evaluation set.

  • Encouraged: use of additional metadata

    • Participants are highly encouraged to use the additional metadata (both STC and non-STC) of each annotation-recording pair, such as recording location, recording time, identifying number (ID) of each sensor, annotator IDs, and perceived proximity, in the development of their system.

    • In addition to spatiotemporal context, three additional examples of authorized use of metadata are: using sensor IDs to control the heterogeneity of recording conditions; using annotator IDs to model the reliability of each annotator; and using perceived proximity as a latent variable in the acoustic model.

    • However, at the time of prediction, systems must only rely on a single-channel audio input along with any combination of the provided spatial and temporal metadata (STC, i.e., year, week, day, hour, borough, block, latitude, longitude). Note that sensor_id is not an allowed input since the model must be able to generalize to new unseen sensors.

  • Required (for ranking): use of STC metadata

    • Participants are highly encouraged to use the additional STC metadata of each annotation-recording pair, such as recording location, recording time, identifying number (ID) of each sensor, annotator IDs, and perceived proximity, in the development of their system.

    • Systems must exploit STC to be included in the official leaderboard (ranking). Submissions that don't use STC will still be evaluated, but won't be included in the ranking.

    • In addition to spatiotemporal context, three additional examples of authorized use of metadata are: using sensor IDs to control the heterogeneity of recording conditions; using annotator IDs to model the reliability of each annotator; and using perceived proximity as a latent variable in the acoustic model.

    • However, at the time of prediction, systems must only rely on a single-channel audio input along with any combination of the provided spatial and temporal metadata (STC, i.e., year, week, day, hour, borough, block, latitude, longitude). Note that sensor_id is not an allowed input since the model must be able to generalize to new unseen sensors.

  • Required (for ranking): open-sourcing of system source code

    • For the sake of the integrity of the challenge and as well as the contribution to the collective knowledge of the research community, we require that all systems must be verifiable and reproducible. Therefore, we require that all submissions that wish to be recognized competitively be accompanied with a link to a public source code repository.

    • This can be specified in the metadata YAML file included in the submission.

    • The source code must be hosted on Github, Bitbucket, SourceForge, or any other public code hosting service.

    • The source code should be well documented and include instructions for reproducing the results of the submitted system.

    • Only submissions that include reproducible open-sourced code will be considered for the top ten ranking systems.

    • Closed-source submissions will still be accepted but not included in the final ranking. Open-sourced submissions are highly encouraged nonetheless.

Evaluation

Summary:
There will be two separate rankings, one for performance on the coarse-grained labels and one for performance on the fine-grained labels. In both cases, macro-averaged AUPRC scores will be used for ranking. In addition, as a secondary metric we will also report micro-averaged AURPC. Please see below for full details about the evaluation procedure.

NOTE: while all submissions will be evaluated, only submissions that make use of spatiotemporal context (STC) will be included in the rankings.

Full description:
The Urban Sound Tagging challenge is a task of multilabel classification. To evaluate and rank participants, we ask them to submit a CSV file following a similar layout as the publicly available CSV file of the development set: in it, each row should represent a different ten-second snippet, and each column should represent an urban sound tag.

The area under the precision-recall curve (AUPRC) is the classification metric that we employ to rank participants. To compute this curve, we threshold the confidence of every tag in every snippet by some fixed threshold \(\tau\), thus resulting in a one-hot encoding of predicted tags. Then, we count the total number of true positives (TP), false positives (FP), and false negatives (FN) between prediction and consensus ground truth over the entire evaluation dataset.

The Urban Sound Tagging challenge provides two leaderboards of participants, according to two distinct metric: fine-grained AUPRC and coarse-grained AUPRC. In each of the two levels of granularity, we vary \(\tau\) between 0 and 1 and compute TP, FP, and FN for each coarse category. Then, we compute micro-averaged precision \(P = \text{TP} / (\text{TP} + \text{FP})\) and recall \(R = \text{TP} / (\text{TP} + \text{TN})\), giving an equal importance to every sample. We repeat the same operation for all values of \(\tau\) in the interval \([0, 1]\) that result in different values of P and R. Lastly, we use the trapezoidal rule to estimate the AUPRC.

The computations can be summarized by the following expressions defined for each coarse category, where \(t_0\) and \(y_0\) correspond to the presence of an incomplete tag in the ground truth and prediction (respectively), and \(t_k\) and \(y_k\) (for \(k \in \{1, \ldots, K\}\)) correspond to the presence of fine tag \(k\) in the ground truth and prediction (respectively).

$$\text{TP} = \left(1 - \prod_{k=0}^K (1-t_k) \right) \left(1 - \prod_{k=0}^K (1-y_k) \right)$$
$$\text{FP} = \left(\prod_{k=0}^K (1-t_k) \right) \left(1 - \prod_{k=0}^K (1-y_k) \right)$$
$$\text{FN} = \left(1 - \prod_{k=0}^K (1-t_k) \right) \left(\prod_{k=0}^K (1-y_k) \right)$$

For samples with complete ground truth (i.e., in the absence of the incomplete fine tag in the ground truth for the coarse category at hand), evaluating urban sound tagging at a fine level of granularity is also relatively straightforward. Indeed, for samples with complete ground truth, the computation of TP, FP, and FN amounts to pairwise conjunctions between predicted fine tags and corresponding ground truth fine tags, without any coarsening. Each fine tag produces either one TP (if it is present and predicted), one FP (if it it absent yet predicted), or one FN (if it is absent yet not predicted). Then, we apply one-hot integer encoding to these boolean values, and sum them up at the level of coarse categories before micro-averaging across coarse categories over the entire evaluation dataset. In this case, the sum (TP+FP+FN) is equal to the number of tags in the fine-grained taxonomy, i.e. 23. Furthermore, the sum (TP+FN) is equal to the number of truly present tags in the sample at hand.

The situation becomes considerably more complex when the incomplete fine tag is present in the ground truth, because this presence hinders the possibility of precisely counting the number of false alarms in the coarse category at hand. We propose a pragmatic solution to this problem; the guiding idea behind our solution is to evaluate the prediction at the fine level only when possible, and fall back to the coarse level if necessary.

For example, if a small engine is present in the ground truth and absent in the prediction but an "other/unknown" engine is predicted, then it's a true positive in the coarse-grained sense, but a false negative in the fine-grained sense. However, if a small engine is absent in the ground truth and present in the prediction, then the outcome of the evaluation will depend on the completeness of the ground truth for the coarse category of engines. If this coarse category is complete (i.e. if the tag "engine of uncertain size" is absent from the ground truth), then we may evaluate the small engine tag at the fine level, and count it as a false positive. Conversely, if the coarse category of engines is incomplete (i.e. the tag "engine of uncertain size" is present in the ground truth), then we fall back to coarse-level evaluation for the sample at hand, and count the small engine prediction as a true positive, in aggregation with potential predictions of medium engines and large engines.

The computations can be summarized by the following expressions defined for each coarse category, where \(t_0\) and \(y_0\) correspond to the presence of an incomplete tag in the ground truth and prediction (respectively), and \(t_k\) and \(y_k\) (for \(k \in \{1, \ldots, K\}\)) correspond to the presence of fine tag \(k\) in the ground truth and prediction (respectively).

$$\text{TP} = \left(\sum_{k=1}^K t_k y_k \right) + t_0 \left( 1 - \prod_{k=1}^K t_k y_k\right) \left(1 - \prod_{k=0}^K (1-y_k) \right) $$
$$\text{FP} = (1-t_0) \left( \left(\sum_{k=1}^K (1-t_k)y_k \right) + y_0 \left( \prod_{k=1}^K (1-t_k) \right) \left( 1 - \prod_{k=1}^K y_k \right) \right) $$
$$\text{FN} = \left(\sum_{k=1}^K t_k(1-y_k) \right) + t_0 \left( \prod_{k=1}^K (1-t_k) \right) \left(\prod_{k=0}^K (1-y_k) \right)$$

As a secondary metric, we report the micro-averaged F-score of the system, after fixing the value of the threshold to 0.5. This score is the harmonic mean between precision and recall: \(F = 2 \cdot P \cdot R / (P + R)\). We only provide the F-score metric for purposes of post-hoc error analysis and do not use it at the time of producing the official leaderboard.

We provide evaluation code that computes these metrics for participants to use for evaluating their system output in the source code repository containing the baseline model. The evaluation code accepts the output format we expect for submission, so participants can use this to help ensure that their system output is formatted correctly, as well as assessing the performance of their system as they develop it. We encourage participants to use this code as a starting point for manipulating the dataset and for evaluating their system outputs.

Ranking

There will be two separate rankings, one for performance on the coarse-grained labels and one for performance on the fine-grained labels. In both cases, macro-averaged AUPRC scores will be used for ranking. In addition, as a secondary metric we will also report micro-averaged AURPC. Please see the Evaluation section for more details about the evaluation procedure.

NOTE: while all submissions will be evaluated, only submissions that make use of spatiotemporal context (STC) will be included in the rankings.

Baseline system

For the baseline model, we use a multi-label multi-layer perceptron model, using a single hidden layer of size 128 (with ReLU non-linearities), and using AutoPool to aggregate frame level predictions. The model takes in as input: * Audio content, via OpenL3 embeddings (content_type="env", input_repr="mel256", and embedding_size=512), using a window size and hop size of 1.0 second (with centered windows), giving us 11 512-dimensional embeddings for each clip in our dataset. * Spatial context, via latitude and longitude values, giving us 2 values for each clip in our dataset. * Temporal context, via hour of the day, day of the week, and week of the year, each encoded as a one hot vector, giving us 24 + 7 + 52 = 83 values for each clip in our dataset. We Z-score normalize the embeddings, latitude, and longitude values, and concatenate all of the inputs (at each time step), resulting in an input size of 512 + 2 + 83 = 597. We use the weak tags for each audio clip as the targets for each clip. For the train data (which has no verified target), we count a positive for a tag if at least one annotator has labeled the audio clip with that tag (i.e. minority vote). Note that while some of the audio clips in the train set have verified annotations, we only use the crowdsourced annotations. For audio clips in the validate set, we only use annotations that have been manually verified. We train the model using stochastic gradient descent to minimize binary cross-entropy loss, using L2 regularization (A.K.A. weight decay) with a factor of 1e-5. For training models to predict tags at the fine level, we modify the loss such that if "unknown/other" is annotated for a particular coarse tag, the loss for the fine tags corresponding to this coarse tag are masked out. We train for up to 100 epochs; to mitigate overfitting, we use early stopping with a patience of 20 epochs using loss on the validate set. We train one model to predict fine-level tags, with coarse-level tag predictions obtained by taking the maximum probability over fine-tags predictions within a coarse category. We train another model only to predict coarse-level tags.

Repository

The code for the baseline and evaluation can be found in our source code repository:


We encourage participants to use this code as a starting point for manipulating the dataset and for evaluating their system outputs.

Results for the development dataset

Fine-level model

Fine-level evaluation:

  • Micro AUPRC: 0.7329
  • Micro F1-score (@0.5): 0.6149
  • Macro AUPRC: 0.5278
  • Coarse Tag AUPRC:

    Coarse Tag Name AUPRC
    engine 0.6429
    machinery-impact 0.5098
    non-machinery-impact 0.4474
    powered-saw 0.5194
    alert-signal 0.8238
    music 0.3151
    human-voice 0.9073
    dog 0.0568

Coarse-level evaluation:

  • Micro AUPRC: 0.8391
  • Micro F1-score (@0.5): 0.6736
  • Macro AUPRC: 0.6370

  • Coarse Tag AUPRC:

    Coarse Tag Name AUPRC
    engine 0.8191
    machinery-impact 0.6502
    non-machinery-impact 0.4474
    powered-saw 0.7960
    alert-signal 0.8837
    music 0.4720
    human-voice 0.9711
    dog 0.0568

Coarse-level model

Coarse-level evaluation:

  • Micro AUPRC: 0.8352
  • Micro F1-score (@0.5): 0.7389
  • Macro AUPRC: 0.6323
  • Coarse Tag AUPRC:

    Coarse Tag Name AUPRC
    engine 0.8500
    machinery-impact 0.6021
    non-machinery-impact 0.4192
    powered-saw 0.7200
    alert-signal 0.8518
    music 0.6145
    human-voice 0.9593
    dog 0.0463

Feedback and questions

For questions and comments on this task, please refer to our Google Groups page.

Citation

If you are participating to this task or using the dataset code please consider citing the following papers:

Publication

Mark Cartwright, Ana Elisa Mendez Mendez, Jason Cramer, Vincent Lostanlen, Graham Dove, Ho-Hsiang Wu, Justin Salamon, Oded Nov, and Juan Bello. SONYC urban sound tagging (SONYC-UST): a multilabel dataset from an urban acoustic sensor network. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 35–39. October 2019. URL: http://dcase.community/documents/workshop2019/proceedings/DCASE2019Workshop_Cartwright_4.pdf.

PDF

SONYC Urban Sound Tagging (SONYC-UST): A Multilabel Dataset from an Urban Acoustic Sensor Network

Abstract

SONYC Urban Sound Tagging (SONYC-UST) is a dataset for the development and evaluation of machine listening systems for real-world urban noise monitoring. It consists of 3068 audio recordings from the Sounds of New York City (SONYC) acoustic sensor network. Via the Zooniverse citizen science platform, volunteers tagged the presence of 23 fine-grained classes that were chosen in consultation with the New York City Department of Environmental Protection. These 23 fine-grained classes can be grouped into eight coarse-grained classes. In this work, we describe the collection of this dataset, metrics used to evaluate tagging systems, and the results of a simple baseline model.

Keywords

sound tagging, DCASE challenge, public datasets, acoustic sensor network, noise pollution

PDF