Urban Sound Tagging

Task description

The goal of urban sound tagging (UST) is to predict whether each of 23 sources of noise pollution is present or absent in a 10-second scene, as recorded by an acoustic sensor network.

Problem Statement

With the Urban Sound Tagging challenge, SONYC provides a realistic use case for the development and evaluation of innovative systems in machine listening. Upon coordination with the Department of Environmental Protection, we have established a set of 23 noise "tags", each of them referring to a source of noise, and many which are frequent causes of noise complaints in New York City. We propose to formulate urban sound tagging as a multilabel classification problem.

The primary goal of the challenge is to write a computer program which, given a ten-second recording from some urban environment, returns whether each of these 23 sources is audible in the recording or not. This is a relatively ambitious goal, because it implies to resolve ambiguities between closely related percepts, such as distinguishing motorcycle engines from car engines and truck engines. This distinction is not even always achieved by human annotators. In the following, we refer to this problem as "fine-grained" classification.

A secondary goal of the challenge is to retrieve, rather than fine-grained tags among a list of 23, coarse-grained tags among a list of seven. For example, small engines, medium engines, and large engines are three fine-grained tags which we gather under the coarse category of "engine".

The relationship between coarse-grained and fine-grained tags is hierarchical, so that it is possible to derive coarse-grained labeling from fine-grained labeling, but not the other way around. In practice, although the coarse taxonomy may conflate dissimilar sounds (e.g. car horn and car alarm), it enjoys a considerably better inter-annotator agreement than the fine taxonomy.

The reason why we simultaneously formulate urban sound tagging at two different levels of granularity is to investigate the tradeoff between training machine listening models with specific, potentially erroneous labels (fine-grained multilabel classification); versus with approximate, likely correct labels (coarse-grained multilabel classification).

For each ten-second snippet, a 23-dimensional binary vector encodes the ground truth: absence and presence of a given class respectively correspond to the values 0 and 1. In other words, a perfectly accurate for urban sound tagging would only return binary values. In practice, we allow, and even encourage, participants to return unquantized values, ranging continuously between 0 and 1. Quantizing those values amounts to an operation of rounding to the nearest integer.


The city of New York, like many others, has a "noise code". For reasons of comfort and public health, jackhammers can only operate on weekdays; pet owners are held accountable for their animals' noises; ice cream trucks may play their jingles while in motion, but should remain quiet once they've parked; blasting a car horn is restricted to situations of imminent danger. The noise code presents a plan of legal enforcement and thus mitigation of harmful and disruptive types of sounds.

In an effort towards reducing urban noise pollution, the engagement of citizens is crucial, yet by no means sufficient on its own. Indeed, the rate of complaints that are transmitted, in a given neighborhood, through a municipal service such as 3-1-1, is not necessarily proportional to the level of noise pollution in that neighborhood. In the case of New York City, the Department of Environmental Protection is in charge of attending to the subset of noise complaints which are caused by static sources, including construction and traffic. Unfortunately, statistical evidence demonstrates that, although harmful levels of noise predominantly affect low-income and unemployed New Yorkers, these residents are the least likely to take the initiative of filing a complaint to the city officials. Such a gap between reported exposure and actual exposure raises the challenge of improving fairness, accountability, and transparency in public policies against noise pollution.


SONYC (Sounds of New York City) is an independent research project for mitigating urban noise pollution. One of its aims is to map the spatiotemporal distribution of noise at the scale of a megacity like New York, in real time, and throughout multiple years. To this end, SONYC has designed an acoustic sensor for noise pollution monitoring. This sensor combines a relatively high accuracy in sound acquisition with a relatively low production cost. Between 2015 and 2019, over 50 different sensors have been assembled and deployed in various areas of New York City. Collectively, these sensors have gathered the equivalent of 37 years of audio data.

Every year, the SONYC acoustic sensor network records millions of such audio snippets. This automated procedure of data acquisition, in its own right, gives some insight into the overall rumble of New York City through time and space. However, as of today, each SONYC sensor merely returns an overall sound pressure level (SPL) in its immediate vicinity, without breaking it down into specific components. From a perceptual standpoint, not all sources of outdoor noise are equally unpleasant. For this reason, determining whether a given acoustic scene comes in violation of the noise code requires, more than an SPL estimate in decibels, a list of all active sources in the scene. In other words, in the context of automated noise pollution monitoring, the resort to computational methods for detection and classification of acoustic scenes and events (DCASE) appears as necessary.

In the long term, some of the most successful techniques in the challenge could inspire the development of an embedded solution for automatic urban sound tagging. Following the submission deadline, we plan to write a report summarizing the most significant findings of all participants. In the long term, some of the most successful techniques in the challenge could inspire the development of an embedded solution for low-cost and scalable monitoring, analysis, and mitigation of urban noise.

Audio dataset

Recording procedure

The provided audio has been acquired using the SONYC acoustic sensor network for urban noise pollution monitoring. Over 50 different sensors have been deployed in New York City, and these sensors have collectively gathered the equivalent of 37 years of audio data, of which we provide a small subset. The data was sampled by selecting the nearest neighbors on VGGish features of recordings known to have classes of interest. All recordings are 10 seconds and were recorded with identical microphones at identical gain settings. To maintain privacy, the recordings in this release have been distributed in time and location, and the time and location of the recordings are not included in the metadata.

Reference labels

The process of collecting accurate ground truth data is an essential component in the fair evaluation of all information retrieval systems. Nevertheless, in the scope of DCASE challenges, such a process is particularly tedious, and urban sound tagging makes no exception. Indeed, various heterogeneous sources of noise pollution may overlap within the same acoustic scene. Furthermore, the distances between a given sensor and all audible sources may typically vary between 10 meters and 100 meters. Because of absorption and reverberation, identifying all sources may remain difficult for the inexpert ear. So as to account for the eventuality of human errors in the annotation of audio data, we make sure that at least three humans listen to each recording independently.

These humans are citizen scientists, that is, individuals who volunteered their time to annotate soundscapes from SONYC sensors. We recruited these volunteers on Zooniverse, a web platform for citizen science. Although the citizen scientists of SONYC likely did not receive any formal training in ecoacoustics, they did go through a specific tutorial on how to annotate soundscapes on the Zooniverse browser interface. They also received a field guide of urban sounds for reference. Furthermore, ahead of the release of the Urban Sound Tagging dataset, we had performed a controlled study on the tradeoff between reliability and redundancy in crowdsourcing audio annotations from multiple human subjects. The findings of this preliminary study allowed us to write an experimental plan for maximizing the informativeness of our data collection campaign, given a fixed volume of human labor.

Coarse-level and fine-level taxonomies

We illustrate the relationship of hierarchical containment between coarse-grained and fine-grained taxonomy in the diagram below.

Figure 1. Hierarchical taxonomy of urban sound tags in the DCASE Urban Sound Tagging task. Rectangular and round boxes respectively denote coarse and fine tags.

Note that the fine-grained taxonomy distinguishes whether human voices are heard talking, shouting, or amplified through a megaphone; whether a rotating saw is small (handheld) or large (walk-behind); and whether a vehicle alert sound is an anti-theft alarm or a reverse beeper.

Although these distinctions are relatively subtle in machine listening, and often regarded as superfluous, they are essential to efficient noise pollution monitoring, because they map to different paragraphs of the noise code. Yet, it is not always possible, even for expert ears, to resolve these distinctions. Conversely, the coarse-grained taxonomy is slightly less convenient for mapping the spatiotemporal impact of, say, construction noise, because some coarse tags incorporate sources both from construction-related and construction-unrelated acoustic environments.

During our annotation campaign, we gave annotators the possibility to express their uncertainty in their choice of tagging. To this end, we supplemented the list of 23 tags above by 7 tags of the form "other/unknown", each corresponding to a different coarse category. We refer to these 7 tags as incomplete tags, as opposed to 23 complete tags. For every coarse category, the fine-grained reference annotation may contain one or several complete fine-grained tags belonging to that category, as well as the incomplete fine-grained tag. In the following, we present a metric for fine-grained multilabel classification with potentially incomplete ground truth.

The urban sound tagging challenge has two leaderboards: one for coarse-grained classification and another for fine-grained classification. The evaluation of coarse-grained multilabel classification is relatively standard, because there is no need to account for the presence of incomplete tags. Conversely, the evaluation of fine-grained multilbabel classification requires to occasionally fall back to the coarse grain for samples and coarse category in which the ground truth annotation is incomplete.

For example, suppose that a ten-second snippet contains an engine sound, but it's impossible to tell whether this engine is medium (car engine) or large (truck engine). In the reference annotation, this snippet contains the incomplete tag "engine of uncertain size", and none of the complete tags for the coarse category of engines. Any machine listening system which detects the presence of an engine, whatever be its size, will be considered to produce a correct prediction.

Taxonomy format

The label taxonomy is described in an included YAML file dcase-ust-taxonomy.yaml. We replicate it below for reference.

1. engine
    1: small-sounding-engine
    2: medium-sounding-engine
    3: large-sounding-engine
    X: engine-of-uncertain-size
2. machinery-impact
    1: rock-drill
    2: jackhammer
    3: hoe-ram
    4: pile-driver
    X: other-unknown-impact-machinery
3. non-machinery-impact
    1: non-machinery-impact
4. powered-saw
    1: chainsaw
    2: small-medium-rotating-saw
    3: large-rotating-saw
    X: other-unknown-powered-saw
5. alert-signal
    1: car-horn
    2: car-alarm
    3: siren
    4: reverse-beeper
    X: other-unknown-alert-signal
6. music
    1: stationary-music
    2: mobile-music
    3: ice-cream-truck
    X: music-from-uncertain-source
7. human-voice
    1: person-or-small-group-talking
    2: person-or-small-group-shouting
    3: large-crowd
    4: amplified-speech
    X: other-unknown-human-voice
8. dog
    1: dog-barking-whining

Annotation format

In the metadata corresponding to the public development set, we adopt a one-hot encoding of urban sound tags. Each annotator-snippet pair is a row vector with 29 different entries, corresponding to alphabetically sorted tags: amplified speech, car alarm, car horn, chainsaw, and so forth. In each entry, the numbers 0 and 1 respectively denote the absence of the presence of the corresponding tag. As a complement to this one-hot encoding, we append three identification numbers (ID): one for the human who performed the annotation, one for the sensor in the SONYC network, and one for the ten-second snippet itself. We concatenate all available crowdsourced annotations into a table in which each row corresponds to a different annotator-snippet pair. We format this table into a text file with comma-separated values (CSV), included in the dataset as annotations.csv. Each row in the file represents one multi-label annotation of a recording---it could be the annotation of a single citizen science volunteer, a single SONYC team member, or the agreed-upon ground truth by the SONYC team (see the annotator_id column description for more information). We recommend participants to use a dedicated library to parse this CSV annotation file, rather than attempt to parse tabular data themselves. The describe the columns below.



The data split. (train, validate)


The ID of the sensor the recording is from. These have been anonymized to have no relation to geolocation.


The filename of the audio recording


The anonymous ID of the annotator. If this value is positive, it is a citizen science volunteer from the Zooniverse platform. If it is negative, it is a SONYC team member (only present for validate set). If it is 0, then it is the ground truth agreed-upon by the SONYC team.


Columns of this form indicate the presence of fine-level class. 1 if present, 0 if not present. If -1, then the class was not labeled in this annotation because the annotation was performed by a SONYC team member who only annotated one coarse group of classes at a time when annotating the validate set.


Columns of this form indicate the presence of a coarse-level class. 1 if present, 0 if not present. If -1, then the class was not labeled in this annotation because the annotation was performed by a SONYC team member who only annotated one coarse group of classes at a time when annotating the validate set. These columns are computed from the fine-level class presence columns and are presented here for convenience when training on only coarse-level classes.


Columns of this form indicate the proximity of a fine-level class. After indicating the presence of a fine-level class, citizen science annotators were asked to indicate the proximity of the sound event to the sensor. Only the citizen science volunteers performed this task, and therefore this data is included in the train subset but not the validate subset. This column may take on one of the following four values: (near, far, notsure, -1). If -1, then the proximity was not annotated because either the annotation was not performed by a citizen science volunteer, or the citizen science volunteer did not indicate the presence of the class.


Task setup

Just like other datasets in the DCASE challenge, the Urban Sound Tagging dataset comprises a public development set and a private evaluation set. On the public development set, we release both the audio snippets and the corresponding human annotations. Conversely, on the private evaluation set, we only release the audio snippets and do not disclose human annotations except to the coordinators of the DCASE challenge.

Development dataset

The development dataset in this release contains a train split (2351 recordings) and validate split (443 recordings). The train and validate splits are disjoint with respect to the sensor from which each recording came, and were chosen such that the distribution of citizen science provided labels were similar for each split. The sensors in the evaluation set will also be disjoint with the train set.

The purpose of the development set is to allow participants to develop computational systems for multilabel classification in a supervised manner. We have partitioned this development set into two disjoint subsets: a train set and a validate set. In this partition, we have made sure that the frequency of occurrence of each tag is roughly preserved between the train set and the validate set. Furthermore, for every given sensor in the SONYC network, we allocated all available annotated recordings which originate from this sensor into either the train set or the validate set, but never both. This sensor-conditional split guarantees that, for participants, evaluating sound event detection on the validate subset reflects the ability of their machine listening system to generalize to new recording conditions, rather than overfitting the recording conditions in the evaluation set.

Evaluation dataset

An evaluation set will be released in May 2019. The forthcoming evaluation set may contain sensors from the validate split, but the evaluation recordings will be displaced in time, occurring after any of the recordings in the validate split.

The purpose of the private evaluation set is to perform a comparative evaluation of all competing systems in the Urban Sound Tagging challenge. In this context, it is vital to have a highly reliable estimate of the ground truth and minimize annotation error. Therefore, as organizers of the Urban Sound Tagging challenge, we carefully annotated all ten-second snippets in the private evaluation set ourselves, in addition to crowdsourcing annotations on Zooniverse. As a result, audio snippets in the evaluation set have both three independent annotations from citizen scientists and two independent annotations from machine listening experts. Lastly, we applied a third round of multi-expert review over the mismatches between expert annotations, thus progressively converge towards a consensus. In the following, we denote by "ground truth" the outcome of said consensus.

Task rules

In comparison with previously held challenges, the rules of the DCASE Urban Sound Tagging task are relatively permissive. The only two practices that are strictly forbidden are the manual annotation of the private evaluation set and the inclusion of private external data to the train subset or validate subset. Nevertheless, we explicitly authorize the inclusion of publicly released external data to the train set or validate subset. For instance, we authorize participants to download the AudioSet and FSD and use them to learn general-purpose machine listening features in a supervised way. However, we do not allow participants to collect train data from YouTube or FreeSound in an arbitrary way, in the absence of a sustainable and reproducible open data initiative. This restriction to the rules of the challenge aims at preserving the reproducibility of research in the development of innovative systems for machine listening.

We provide a thorough list of rules below.

  • Forbidden: manual annotation of the evaluation set

    • Participants are not allowed to perform any manual annotation, be it expert of crowdsourced, onto the private evaluation set.
  • Forbidden: use of private external data

    • Participants are not allowed to use private external data for training, validating, and/or selecting their model.

    • The term "private external data" includes all data that either have a proprietary license or are hosted on private or corporate servers, such as YouTube, Soundcloud, Dropbox, Google Drive, and so forth.

  • Restricted: use of open external data

    • Participants are allowed to use open external data for training, validating, and/or selecting their model.
    • The term "open external data" includes all datasets that are released under a free license, such as Creative Commons; which have a digital object identifier (DOI); and which are hosted on a dedicated repository for open science, such as Zenodo, DataDryad, and IEEE DataPort.
    • During the development phase, we recommend participants to inform and suggest task organizers external datasets they are planning to use. These suggested datasets will be listed on the task page, so that all competitors know about them and have equal opportunity to use them.
    • Once the private evaluation set is published, the list of external datasets allowed is locked, and no external sources are allowed anymore.
  • Restricted: use of pre-trained models

    • Participants are allowed to use pretrained models, such as self-supervised audiovisual embeddings, as feature extractors for their system.
    • If these pretrained models are novel contributions, we highly encourage participants to release them under an open source license, for the benefit of the machine listening community at large.
  • Authorized: use of additional metadata

    • While not required, participants are allowed to use the additional metadata of each annotation-snippet pair, such as identifying number (ID) of each sensor, annotator IDs, and perceived proximity, in the development of their system. However, we remind all participants that, at the time of prediction, their system must solely rely on a single-channel audio input.
    • Three examples of authorized use of metadata are: using sensor IDs to control the heterogeneity of recording conditions; using annotator IDs to model the reliability of each annotator; and using perceived proximity as a latent variable in the acoustic model.
  • Authorized: manual inspection of the public development set

    • While not required, participants are allowed to employ additional human labor, either expert or crowdsourced, to refine the annotation of snippets in the public development set.
    • To participants who were to undertake a systematic re-annotation of the train set, in part or in full, we kindly ask to contact us so that we can consider, for a possible upcoming edition of the Urban Sound Tagging challenge, including their annotations in the public development set.
    • As stated above, although we encourage human inspection of the public development set, we strictly forbid the human inspection of the private evaluation set.
  • Required (for ranking): open-sourcing of system source code

    • For the sake of the integrity of the challenge and as well as the contribution to the collective knowledge of the research community, we require that all systems must be verifiable and reproducible. Therefore, we require that all submissions that wish to be recognized competitively be accompanied with a link to a public source code repository.
    • This can be specified in the metadata YAML file included in the submission.
    • The source code must be hosted on Github, Bitbucket, SourceForge, or any other public code hosting service.
    • The source code should be well documented and include instructions for reproducing the results of the submitted system.
    • Only submissions that include reproducible open-sourced code will be considered for the top ten ranking systems.
    • Closed-source submissions will still be accepted but not included in the final ranking. Open-sourced submissions are highly encouraged nonetheless.

Submission Format

The output files should be in CSV format with the following columns.


The filename of the audio recording


Columns of this form should indicate the probability of the presence of fine-level class (floating point values between 0 and 1). If the system does not produce probabilities and only detections, 1 and 0 can be used for predicted positives and negatives, respectively. Numeric values outside of the range [0,1] or non-numeric values (including empty strings) will throw an error during evaluation and the results will not be accepted. Also note that we accept "X" level predictions, which are meant to indicate presence of the respective coarse tag that is either not captured by the available fine tags or is uncertain. For example, if a system confidently detects the presence of the coarse tag machinery-impact but not any of the corresponding fine tags, the value of 2-X_other-unknown-impact-machinery would be 1. Feel free to provide these if your system models these probabilities.


Columns of this form indicate the probability of the presence of a coarse-level class (floating point values between 0 and 1). 1 if present, 0 if not present. If the system does not produce probabilities and only detections, 1 and 0 can be used for predicted positives and negatives, respectively.

An example of the output in addition to the accompanying metadata file is included in the submission package template. Note that evaluations at the coarse and fine levels will be performed independently on the columns corresponding to the respective tags. Therefore, participants should provide system outputs for both the coarse and fine tags if they wish to be evaluated on both levels.



The Urban Sound Tagging challenge is a task of multilabel classification. To evaluate and rank participants, we ask them to submit a CSV file following a similar layout as the publicly available CSV file of the development set: in it, each row should represent a different ten-second snippet, and each column should represent an urban sound tag.

The area under the precision-recall curve (AUPRC) is the classification metric that we employ to rank participants. To compute this curve, we threshold the confidence of every tag in every snippet by some fixed threshold \(\tau\), thus resulting in a one-hot encoding of predicted tags. Then, we count the total number of true positives (TP), false positives (FP), and false negatives (FN) between prediction and consensus ground truth over the entire evaluation dataset.

The Urban Sound Tagging challenge provides two leaderboards of participants, according to two distinct metric: fine-grained AUPRC and coarse-grained AUPRC. In each of the two levels of granularity, we vary \(\tau\) between 0 and 1 and compute TP, FP, and FN for each coarse category. Then, we compute micro-averaged precision \(P = \text{TP} / (\text{TP} + \text{FP})\) and recall \(R = \text{TP} / (\text{TP} + \text{TN})\), giving an equal importance to every sample. We repeat the same operation for all values of \(\tau\) in the interval \([0, 1]\) that result in different values of P and R. Lastly, we use the trapezoidal rule to estimate the AUPRC.

The computations can be summarized by the following expressions defined for each coarse category, where \(t_0\) and \(y_0\) correspond to the presence of an incomplete tag in the ground truth and prediction (respectively), and \(t_k\) and \(y_k\) (for \(k \in \{1, \ldots, K\}\)) correspond to the presence of fine tag \(k\) in the ground truth and prediction (respectively).

$$\text{TP} = \left(1 - \prod_{k=0}^K (1-t_k) \right) \left(1 - \prod_{k=0}^K (1-y_k) \right)$$
$$\text{FP} = \left(\prod_{k=0}^K (1-t_k) \right) \left(1 - \prod_{k=0}^K (1-y_k) \right)$$
$$\text{FN} = \left(1 - \prod_{k=0}^K (1-t_k) \right) \left(\prod_{k=0}^K (1-y_k) \right)$$

For samples with complete ground truth (i.e., in the absence of the incomplete fine tag in the ground truth for the coarse category at hand), evaluating urban sound tagging at a fine level of granularity is also relatively straightforward. Indeed, for samples with complete ground truth, the computation of TP, FP, and FN amounts to pairwise conjunctions between predicted fine tags and corresponding ground truth fine tags, without any coarsening. Each fine tag produces either one TP (if it is present and predicted), one FP (if it it absent yet predicted), or one FN (if it is absent yet not predicted). Then, we apply one-hot integer encoding to these boolean values, and sum them up at the level of coarse categories before micro-averaging across coarse categories over the entire evaluation dataset. In this case, the sum (TP+FP+FN) is equal to the number of tags in the fine-grained taxonomy, i.e. 23. Furthermore, the sum (TP+FN) is equal to the number of truly present tags in the sample at hand.

The situation becomes considerably more complex when the incomplete fine tag is present in the ground truth, because this presence hinders the possibility of precisely counting the number of false alarms in the coarse category at hand. We propose a pragmatic solution to this problem; the guiding idea behind our solution is to evaluate the prediction at the fine level only when possible, and fall back to the coarse level if necessary.

For example, if a small engine is present in the ground truth and absent in the prediction but an "other/unknown" engine is predicted, then it's a true positive in the coarse-grained sense, but a false negative in the fine-grained sense. However, if a small engine is absent in the ground truth and present in the prediction, then the outcome of the evaluation will depend on the completeness of the ground truth for the coarse category of engines. If this coarse category is complete (i.e. if the tag "engine of uncertain size" is absent from the ground truth), then we may evaluate the small engine tag at the fine level, and count it as a false positive. Conversely, if the coarse category of engines is incomplete (i.e. the tag "engine of uncertain size" is present in the ground truth), then we fall back to coarse-level evaluation for the sample at hand, and count the small engine prediction as a true positive, in aggregation with potential predictions of medium engines and large engines.

The computations can be summarized by the following expressions defined for each coarse category, where \(t_0\) and \(y_0\) correspond to the presence of an incomplete tag in the ground truth and prediction (respectively), and \(t_k\) and \(y_k\) (for \(k \in \{1, \ldots, K\}\)) correspond to the presence of fine tag \(k\) in the ground truth and prediction (respectively).

$$\text{TP} = \left(\sum_{k=1}^K t_k y_k \right) + t_0 \left( 1 - \prod_{k=1}^K t_k y_k\right) \left(1 - \prod_{k=0}^K (1-y_k) \right) $$
$$\text{FP} = (1-t_0) \left( \left(\sum_{k=1}^K (1-t_k)y_k \right) + y_0 \left( \prod_{k=1}^K (1-t_k) \right) \left( 1 - \prod_{k=1}^K y_k \right) \right) $$
$$\text{FN} = \left(\sum_{k=1}^K t_k(1-y_k) \right) + t_0 \left( \prod_{k=1}^K (1-t_k) \right) \left(\prod_{k=0}^K (1-y_k) \right)$$

As a secondary metric, we report the micro-averaged F-score of the system, after fixing the value of the threshold to 0.5. This score is the harmonic mean between precision and recall: \(F = 2 \cdot P \cdot R / (P + R)\). We only provide the F-score metric for purposes of post-hoc error analysis and do not use it at the time of producing the official leaderboard.

We provide evaluation code that computes these metrics for participants to use for evaluating their system output in the source code repository containing the baseline model. The evaluation code accepts the output format we expect for submission, so participants can use this to help ensure that their system output is formatted correctly, as well as assessing the performance of their system as they develop it. We encourage participants to use this code as a starting point for manipulating the dataset and for evaluating their system outputs.

Baseline system

For the baseline model, we simply use a multi-label logistic regression model. Because of the size of the dataset, we opted for a simple and shallow model for our baseline. Our model took VGGish embeddings as its input representation, which by default uses a window size and hop size of 0.96 seconds, giving us ten 128-dimensional embeddings for each clip in our dataset. We use the weak tags for each audio clip as the targets for each clip. For the train data (which has no verified target), we simply count a positive for a tag if at least one annotator has labeled the audio clip with that tag.

We trained the model using stochastic gradient descent to minimize binary cross-entropy loss. For training models to predict tags at the fine level, the loss is modified such that if "unknown/other" is annotated for a particular coarse tag, the loss for the fine tags corresponding to this coarse tag are masked out. We use early stopping using loss on the validate set to mitigate overfitting.

One model was trained to predict fine-level tags, with coarse-level tag predictions obtained by taking the maximum probability over fine-tags predictions within a coarse category. Another model was trained only to predict coarse-level tags.

For inference, we predict tags at the frame level and simply take the average of output tag probabilities as the clip-level tag probabilities.


The code for the baseline and evaluation can be found in our source code repository:

Again, we encourage participants to use this code as a starting point for manipulating the dataset and for evaluating their system outputs.

Baseline Results

The results for the baseline systems can be found below:

Fine-level model

Fine-level evaluation:

  • Micro AUPRC: 0.6717253550113078
  • Micro F1-score (@0.5): 0.5015353121801432
  • Macro AUPRC: 0.427463730110938
  • Coarse Tag AUPRC:

    Coarse Tag Name   AUPRC
    engine   0.7122944027927718
    machinery-impact   0.19788462073882798
    non-machinery-impact   0.36403054299960413
    powered-saw   0.3855391333457478
    alert-signal   0.6359773072562782
    music   0.21516455980970542
    human-voice   0.8798293427878373
    dog   0.0289899311567318

Coarse-level evaluation:

  • Micro AUPRC: 0.7424913328250053
  • Micro F1-score (@0.5): 0.5065590312815338
  • Macro AUPRC: 0.5297273551638281

  • Coarse Tag AUPRC:

    Coarse Tag Name   AUPRC
    engine   0.8594524913674696
    machinery-impact   0.28532090723421905
    non-machinery-impact   0.36403054299960413
    powered-saw   0.7200903371047481
    alert-signal   0.7536308641644877
    music   0.282907929536143
    human-voice   0.9433958377472215
    dog   0.0289899311567318

Coarse-level model

Coarse-level evaluation:

  • Micro AUPRC: 0.761602033798918
  • Micro F1-score (@0.5): 0.6741035856573705
  • Macro AUPRC: 0.5422528970239988
  • Coarse Tag AUPRC:

    Coarse Tag Name   AUPRC
    engine   0.8552225117097685
    machinery-impact   0.3595869306870976
    non-machinery-impact   0.36067068831072385
    powered-saw   0.6779980935124421
    alert-signal   0.8126810682348001
    music   0.2988632647455638
    human-voice   0.94516997783423
    dog   0.02783064115736446

Feedback and Questions

For questions and comments on this task, please refer to our Google Groups page.


If you are participating to this task or using the dataset code please consider citing the following papers:


Juan P. Bello, Claudio Silva, Oded Nov, R. Luke Dubois, Anish Arora, Justin Salamon, Charles Mydlarz, and Harish Doraiswamy. Sonyc: a system for monitoring, analyzing, and mitigating urban noise pollution. Communications of the ACM, 62(2):68–77, Feb 2019. doi:10.1145/3224204.


SONYC: A System for Monitoring, Analyzing, and Mitigating Urban Noise Pollution


Noise is unwanted or harmful sound from environmental sources, including traffic, construction, industrial, and social activity. Noise pollution is one of the topmost quality-of-life concerns for urban residents in the U.S., with more than 70 million people nationwide exposed to noise levels beyond the limit the U.S. Environmental Protection Agency (EPA) considers harmful.12 Such levels have proven effects on health, including sleep disruption, hypertension, heart disease, and hearing loss.5,11,12 In addition, there is evidence of harmful effects on educational performance, with studies showing noise pollution causing learning and cognitive impairment in children, resulting in decreased memory capacity, reading skills, and test scores.