For sound event detection tasks

Main metric for evaluation is the segment based error rate.

Detailed description and calculation procedure of metrics is presented in


Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016. URL:, doi:10.3390/app6060162.


Metrics for Polyphonic Sound Event Detection


This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.

Segment-based metrics

Segment based evaluation is done in a fixed time grid, using segments of one second length to compare the ground truth and the system output.

In each segment \(k\) we count:

  • true positives \(TP\): events indicated as active by both the ground truth and system output
  • false positives \(FP\): events indicated as active by the system output but inactive by the ground truth;
  • false negatives \(FN\): events indicated as inactive by the system output but active by the ground truth;
  • substitutions \(S\): system output indicating as active a wrong label events; one substitution is equivalent to one false positives and one false negative, meaning the system did not detect the correct event (false negative for the correct class) but detected something (false positive for another class)
  • insertions \(I\): false positives after subtracting the substitutions
  • deletions \(D\): false negatives after subtracting the substitutions
  • reference events \(N\): number of events in the ground truth (segment!)

Error rate

Error rate calculated as described in [Poliner2007] over all test data based on the total number of insertions, deletions and substitutions:

\begin{equation*} ER=\frac{\sum {S(k)}+\sum{D(k)}+\sum{I(k)}} {\sum N(k)} \end{equation*}


F-score is calculated over all test data based on the total number of false positive, false negatives and true positives:

\begin{equation*} \label{eq-fscore} F=\frac{2P \cdot R}{P+R}, \quad \text{where} \quad P=\frac {\sum TP(k)} {\sum TP(k)+\sum FP(k) },\quad R=\frac{\sum TP(k)} { \sum TP(k)+\sum FN(k) } \\ \end{equation*}

Event-based metrics

Event-based evaluation considers true positives, false positives and false negatives with respect to event instances.

Definition: An event in the system output is considered correctly detected if its temporal position is overlapping with the temporal position of an event with the same label in the ground truth. A tolerance is allowed for the onset and offset (200 ms for onset and 200 ms or half length for offset)

We count for all sequences:

  • true positives \(TP\): correctly detected events.
  • false positives \(FP\): events in the system output that are not correct according to the definition
  • false negatives \(FN\): events in the ground truth that have not been correctly detected according to the definition;
  • substitutions \(S\): events in system output that have correct temporal position but incorrect class label
  • insertions \(I\): events in system output that are not correct nor substitutions
  • deletions \(D\): events in ground truth that are not correct nor substituted
  • reference events \(N\): number of events in the ground truth

Error rate

\begin{equation*} ER=\frac{S+D+I}{N} \end{equation*}


\begin{equation*} F=\frac{2P \cdot R}{P+R}, \quad \text{where} \quad P=\frac{TP}{TP+FP},\quad R=\frac{TP}{TP+FN} \\ \end{equation*}



G. Poliner and D. Ellis. A discriminative model for polyphonic piano transcription. EURASIP Journal on Advances in Signal Processing, 2007(1):048317, 2007.


A Discriminative Model for Polyphonic Piano Transcription


We present a discriminative model for polyphonic piano transcription. Support vector machines trained on spectral features are used to classify frame-level note instances. The classifier outputs are temporally constrained via hidden Markov models, and the proposed system is used to transcribe both synthesized and real piano recordings. A frame-level transcription accuracy of 68% was achieved on a newly generated test set, and direct comparisons to previous approaches are provided.