Challenge has ended. Full results for this task can be found here

Description

The task evaluates systems for the large-scale detection of sound events using weakly labeled training data. The data are YouTube video excerpts focusing on transportation and warnings due to their industry relevance and to the underuse of audio in this context. The results will help define new grounds for large-scale sound event detection and show the benefit of audio for self-driving cars, smart cities and related areas.

Audio dataset

The task employs a subset of “AudioSet: An Ontology And Human-Labeled Dataset For Audio Events” by Google. AudioSet consists of an expanding ontology of 632 sound event classes and a collection of 2 million human-labeled 10-second sound clips (less than 21% are shorter than 10-seconds) drawn from 2 million YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.

The subset consists of 17 sound events divided into two categories: “Warning” and “Vehicle”. The list below shows each class and next to it the approximate number of 10-second clips. Note that each clip may correspond to more than one sound event.

Warning sounds:

Train horn (441)
Air horn, truck horn (407)
Car alarm (273)
Reversing beeps (337)
Ambulance (siren) (624)
Police car (siren) (2,399)
Fire engine, fire truck (siren) (2,399)
Civil defense siren (1,506)
Screaming (744)

Vehicle sounds:

Bicycle (2,020)
Skateboard (1,617)
Car (25,744)
Car passing by (3,724)
Bus (3,745)
Truck (7,090)
Motorcycle (3,291)
Train (2,301)

Annotations

To collect all the data Google worked with human annotators who listen, analyze, and verify the sounds they hear within YouTube clips. To facilitate faster accumulation of examples for all classes, Google rely on available YouTube metadata and content-based search to nominate candidate video segments that are likely to contain the target sound.

A detailed description of the data annotation procedure is available in:

Publication

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA, 2017.

PDF

Audio Set: An ontology and human-labeled dataset for audio events

Abstract

Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets -- principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 635 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

PDF

The subset consists of two partitions: development and evaluation. The format for each partition list is the following:

CID, start_seconds, end_seconds, positive_labels

For example:

-jj2tyuf6-A,  80.000,  90.000,  "Train horn,Air horn, truck horn", "/m/0284vy3,/m/05x_td"

The first column, -jj2tyuf6-A, is the YouTube ID of the video from where the 10-second clips was extracted; The second and third columns, t=80 sec to t=90 sec, correspond to the clip boundaries within the full video; last column, /m/0284vy3 ("Train horn") and /m/05x_td ("Air horn, truck horn"), corresponds to the sound classes present in the clip.

Note that AudioSet does not come with time boundaries for each sound class within the 10-second clip and thus annotations are considered weak labels.

Download

** Development dataset **

Large-scale weakly supervised sound event detection for smart cars, Development dataset

** Evaluation dataset **

Large-scale weakly supervised sound event detection for smart cars, Evaluation dataset (863 MB)
(.zip)
password "DCASE_2017_evaluation_set"

Task setup

The challenge consists of detecting sound events within web videos using weakly labeled training data. The detection within a 10-second clip should be performed at two levels:

Without timestamps (similar to audio tagging).
With timestamps (start-end time).

The training set has weak labels denoting the presence of a given sound event in the video’s soundtrack and no timestamps are provided. For testing and evaluation, strong labels with timestamps are provided for the purpose of evaluating performance.

Development dataset

The development is divided in two partitions: training and testing. Training is unbalanced and has at least 30 clips per sound event, whereas testing has about 30 clips per sound event. Note that a 10-second clip may correspond to more than one sound event.

Evaluation dataset

Evaluation set without ground truth will be released one month before the submission deadline. Full ground truth will be published after the DCASE 2017 challenge and workshop are concluded. The evaluation set has at least 60 files per sound event. A fraction of these clips will have strong labels for the purpose of evaluation.

Submission

Detailed information for the challenge submission can found from submission page.

System output should be presented as a single text-file (in CSV format) containing classification result for each audio file in the evaluation set. Result items can be in any order. Format:

[filename (string)][tab][event onset time in seconds (float)][tab][event offset time in seconds (float)][tab][event label (string)]

Detection for both subtasks--with and without timestamps will be evaluated using the same output format, but the latter will ignore the timestamps. The system output file can be the same for both subtasks or two different versions can be provided.

Multiple system outputs can be submitted (maximum 4 per participant). If submitting multiple systems, the individual text-files should be packaged into a zip file for submission. Please carefully mark the connection between the submitted files and the corresponding system or system parameters (for example by naming the text file appropriately).

If no event is detected for the particular audio signal, the system should still output a row containing only the file name, to indicate that the file was processed. This is used to verify that participants processed all evaluation files.

Task rules

These are the general rules valid for all tasks. The same rules and additional information on technical report and submission requirements can be found here. Task specific rules are highlighted with green.

Participants are not allowed to use external data for system development. Data from another task is considered external data.
Another example of external data are other elements related to the video such as the rest of audio from where the 10-sec clip was extracted, the video frames and metadata.
The system outputs that do not respect the challenge rules will be evaluated on request, but they will not be officially included in the challenge rankings.
Participants are not allowed to use the embeddings provided by AudioSet or any other features that indirectly use external data.
Only weak labels and none of the strong labels (timestamps) can be used for training the submitted system.
We strongly suggest to use an approach that addresses the use of weak labels.
Manipulation of provided training and development data is allowed.
The development dataset can be augmented without use of external data (e.g. by mixing data sampled from a pdf or using techniques such as pitch shifting or time stretching).
Participants are not allowed to make subjective judgments of the evaluation data, nor to annotate it. The evaluation dataset cannot be used to train the submitted system.

Evaluation

A - Sound event detection without timestamps (or Audio Tagging).
F-score (Precision and Recall) will be calculated for the detection of sound events within the 10-second clip. Ranking of submitted systems will be based on F-score.

B - Sound event detection with timestamps.
Segment-based error rate will be calculated in one-second segments over the entire test set. Additionally, segment-based F-score will be calculated. Ranking of submitted systems will be based on segment-based error rate.

Ranking of submitted systems will be done independently for each of the two metrics. Detailed information on metrics calculation is available in:

Publication

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016. URL: http://www.mdpi.com/2076-3417/6/6/162, doi:10.3390/app6060162.

PDF

Metrics for Polyphonic Sound Event Detection

Abstract

This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.

PDF Web publication

Toolbox

A short description of metrics can be found here.

Evaluation for Subtask A is done using a script available in the data repository link in the "Download" section inside the folder name "evaluation". Evaluation for Subtask B is done using sed_eval toolbox and we provided a wrapper to facilitate its usage also in the data repository link in the "Download" section inside the folder name "TaskB_evaluation".

sed_eval - Evaluation toolbox for Sound Event Detection

In case of using the toolbox directly, use the following parameters for sed_eval.sound_event.SegmentBasedMetrics evaluator to align it with the baseline system:

For Subtask B: one second segment size time_resolution=1.0

Results

Rank	Submission Information				Subtask A, Audio tagging	Subtask B, Sound event detection
Rank	Code	Author	Affiliation	Technical Report	F1 (subtask A / evaluation dataset)	ER (subtask B / evaluation dataset)	F1 (subtask B / evaluation dataset)
	Adavanne_TUT_task4_1	Sharath Adavanne	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-large-scale-sound-event-detection-results#Adavanne2017	45.5	0.8100	47.9
	Adavanne_TUT_task4_2	Sharath Adavanne	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-large-scale-sound-event-detection-results#Adavanne2017	46.6	0.8000	48.3
	Adavanne_TUT_task4_3	Sharath Adavanne	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-large-scale-sound-event-detection-results#Adavanne2017	44.5	0.8200	48.9
	Adavanne_TUT_task4_4	Sharath Adavanne	Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland	task-large-scale-sound-event-detection-results#Adavanne2017	26.3	0.7900	49.0
	Chou_SINICA_task4_1	Szu-Yu Chou	Research Center for IT innovation, Academia Sinica, Taipei, Taiwan	task-large-scale-sound-event-detection-results#Chou2017	47.6	0.8300	42.4
	Chou_SINICA_task4_2	Szu-Yu Chou	Research Center for IT innovation, Academia Sinica, Taipei, Taiwan	task-large-scale-sound-event-detection-results#Chou2017	49.0
	Chou_SINICA_task4_3	Szu-Yu Chou	Research Center for IT innovation, Academia Sinica, Taipei, Taiwan	task-large-scale-sound-event-detection-results#Chou2017	47.9
	Chou_SINICA_task4_4	Szu-Yu Chou	Research Center for IT innovation, Academia Sinica, Taipei, Taiwan	task-large-scale-sound-event-detection-results#Chou2017	49.0
	DCASE2017 baseline	Benjamin Elizalde	Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, USA	task-large-scale-sound-event-detection-results#Badlani2017	18.2	0.9300	28.4
	Kukanov_UEF_task4_1	Ivan Kukanov	School of Computing, University of Eastern Finland, Joensuu, Finland	task-large-scale-sound-event-detection-results#Kukanov2017	39.6
	Lee_KAIST_task4_1	Jongpil Lee	Graduate School of Culture Technology, Korea Advanced Institute of Science and Technology, Daejeon, Korea	task-large-scale-sound-event-detection-results#Lee2017	40.3	0.8200	39.4
	Lee_KAIST_task4_2	Jongpil Lee	Graduate School of Culture Technology, Korea Advanced Institute of Science and Technology, Daejeon, Korea	task-large-scale-sound-event-detection-results#Lee2017	47.3	0.7800	42.6
	Lee_KAIST_task4_3	Jongpil Lee	Graduate School of Culture Technology, Korea Advanced Institute of Science and Technology, Daejeon, Korea	task-large-scale-sound-event-detection-results#Lee2017	47.2	0.7800	44.2
	Lee_KAIST_task4_4	Jongpil Lee	Graduate School of Culture Technology, Korea Advanced Institute of Science and Technology, Daejeon, Korea	task-large-scale-sound-event-detection-results#Lee2017	47.1	0.7500	47.1
	Lee_SNU_task4_1	Kyogu Lee	Music and Audio Research Group, Seoul National University, Seoul, Korea	task-large-scale-sound-event-detection-results#Lee2017a	52.3	0.6700	54.4
	Lee_SNU_task4_2	Kyogu Lee	Music and Audio Research Group, Seoul National University, Seoul, Korea	task-large-scale-sound-event-detection-results#Lee2017a	52.3	0.6700	54.4
	Lee_SNU_task4_3	Kyogu Lee	Music and Audio Research Group, Seoul National University, Seoul, Korea	task-large-scale-sound-event-detection-results#Lee2017a	52.6	0.6700	55.4
	Lee_SNU_task4_4	Kyogu Lee	Music and Audio Research Group, Seoul National University, Seoul, Korea	task-large-scale-sound-event-detection-results#Lee2017a	52.1	0.6600	55.5
	Salamon_NYU_task4_1	Justin Salamon	Music and Audio Research Laboratory, New York University, New York City, USA; Center of Urban Science and Progress, New York University, New York City, USA	task-large-scale-sound-event-detection-results#Salamon2017	46.0	0.8200	46.2
	Salamon_NYU_task4_2	Justin Salamon	Music and Audio Research Laboratory, New York University, New York City, USA; Center of Urban Science and Progress, New York University, New York City, USA	task-large-scale-sound-event-detection-results#Salamon2017	45.3	0.8500	45.6
	Salamon_NYU_task4_3	Justin Salamon	Music and Audio Research Laboratory, New York University, New York City, USA; Center of Urban Science and Progress, New York University, New York City, USA	task-large-scale-sound-event-detection-results#Salamon2017	44.9	0.7700	45.9
	Salamon_NYU_task4_4	Justin Salamon	Music and Audio Research Laboratory, New York University, New York City, USA; Center of Urban Science and Progress, New York University, New York City, USA	task-large-scale-sound-event-detection-results#Salamon2017	38.1	0.7700	45.9
	Toan_NCU_task4_1	Toan Vu	Department of Computer Science and Information Engineering, National Central University, Taiwan	task-large-scale-sound-event-detection-results#Vu2017	48.5
	Toan_NCU_task4_2	Toan Vu	Department of Computer Science and Information Engineering, National Central University, Taiwan	task-large-scale-sound-event-detection-results#Vu2017	46.5	0.9400	43.0
	Toan_NCU_task4_3	Toan Vu	Department of Computer Science and Information Engineering, National Central University, Taiwan	task-large-scale-sound-event-detection-results#Vu2017		0.9000	42.7
	Toan_NCU_task4_4	Toan Vu	Department of Computer Science and Information Engineering, National Central University, Taiwan	task-large-scale-sound-event-detection-results#Vu2017		0.8700	41.6
	Tseng_Bosch_task4_1	Juncheng Billy Li	Research and Technology Center, Robert Bosch LLC, Pittsburgh, USA	task-large-scale-sound-event-detection-results#Tseng2017	35.0
	Tseng_Bosch_task4_2	Juncheng Billy Li	Research and Technology Center, Robert Bosch LLC, Pittsburgh, USA	task-large-scale-sound-event-detection-results#Tseng2017	35.1
	Tseng_Bosch_task4_3	Juncheng Billy Li	Research and Technology Center, Robert Bosch LLC, Pittsburgh, USA	task-large-scale-sound-event-detection-results#Tseng2017	35.2
	Tseng_Bosch_task4_4	Juncheng Billy Li	Research and Technology Center, Robert Bosch LLC, Pittsburgh, USA	task-large-scale-sound-event-detection-results#Tseng2017	35.2
	Xu_CVSSP_task4_1	Yong Xu	Center for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK	task-large-scale-sound-event-detection-results#Xu2017	54.4	0.7300	51.8
	Xu_CVSSP_task4_2	Yong Xu	Center for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK	task-large-scale-sound-event-detection-results#Xu2017	55.6	0.7800	47.5
	Xu_CVSSP_task4_3	Yong Xu	Center for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK	task-large-scale-sound-event-detection-results#Xu2017	54.2	1.0100	52.1
	Xu_CVSSP_task4_4	Yong Xu	Center for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK	task-large-scale-sound-event-detection-results#Xu2017	52.8	0.8000	50.4

Complete results and technical reports can be found here.

Baseline system

The baseline system for the task is provided. The system is meant to implement a basic approach for acoustic scene classification, and provide some comparison point for the participants while developing their systems. The baseline systems for all tasks share the code base, implementing quite similar approach for all tasks. The baseline system will download the needed datasets and produces the results below when ran with the default parameters.

The baseline system is based on a multilayer perceptron architecture using log mel-band energies as features. A 5-frame context is used, resulting in a feature vector length of 200. Using these features, a neural network containing two dense layers of 50 hidden units per layer and 20% dropout is trained for 200 epochs for each class. Detection decision is based on the network output layer containing sigmoid units that can be active at the same time. A detailed description is available in the baseline system documentation. The baseline system includes evaluation of results using for for Subtask A: F-score, and for Subtask B: segment-based error rate and segment-based F-score as metrics.

The baseline system is implemented using Python (version 2.7 and 3.6). Participants are allowed to build their system on top of the given baseline system. The system has all needed functionality for dataset handling, storing / accessing features and models, and evaluating the results, making the adaptation for one's needs rather easy. The baseline system is also a good starting point for entry level researchers.

Python implementation

Large-scale weakly supervised sound event detection for smart cars, Development dataset

Results for development dataset

Evaluation setup

System is trained using the Training set and tested using the Testing set (488 clips).
Python 2.7.13 used

System parameters

Frame size: 40 ms (with 50% hop size)
Feature vector: 40 log mel-band energies in 5 consecutive frames = 200 values
MLP: 2 layers x 50 hidden units, 20% dropout, 200 epochs, learning rate 0.001, sigmoid output layer
Trained and tested on full audio

Subtask A, Audio tagging (Ranked based on Micro-averaging or Instance-Based)

F-score	10.9 %
Precision	7.8 %
Recall	17.5 %

Subtask A, Audio tagging (Macro-averaging or Class-Based)

F-score	13.1 %
Precision	12.2 %
Recall	14.1 %

Subtask B, Sound event detection (Ranked based on Micro-averaging or Instanced-Based)

Segment-based overall metrics
ER	1.02
F-score	13.8 %
Precision	19.2 %
Recall	10.9 %

Contributors

Rohan Badlani

Birla Institute of Technology & Science

Ankit Shah

Carnegie Mellon University

Citation

If you are using the dataset or baseline code, or want to refer challenge task please cite the following paper:

Publication

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen. DCASE 2017 challenge setup: tasks, datasets and baseline system. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 85–92. November 2017.

PDF

DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.

Keywords

Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events

PDF

When citing challenge task and results please cite the following paper:

Publication

A. Mesaros, A. Diment, B. Elizalde, T. Heittola, E. Vincent, B. Raj, and T. Virtanen. Sound event detection in the DCASE 2017 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019. In press. doi:10.1109/TASLP.2019.2907016.

PDF

Sound event detection in the DCASE 2017 Challenge

Abstract

Each edition of the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) contained several tasks involving sound event detection in different setups. DCASE 2017 presented participants with three such tasks, each having specific datasets and detection requirements: Task 2, in which target sound events were very rare in both training and testing data, Task 3 having overlapping events annotated in real-life audio, and Task 4, in which only weakly-labeled data was available for training. In this paper, we present the three tasks, including the datasets and baseline systems, and analyze the challenge entries for each task. We observe the popularity of methods using deep neural networks, and the still widely used mel frequency based representations, with only few approaches standing out as radically different. Analysis of the systems behavior reveals that task-specific optimization has a big role in producing good performance; however, often this optimization closely follows the ranking metric, and its maximization/minimization does not result in universally good performance. We also introduce the calculation of confidence intervals based on a jackknife resampling procedure, to perform statistical analysis of the challenge results. The analysis indicates that while the 95% confidence intervals for many systems overlap, there are significant difference in performance between the top systems and the baseline for all tasks.

Keywords

Event detection;Task analysis;Training;Acoustics;Speech processing;Glass;Hidden Markov models;Sound event detection;weak labels;pattern recognition;jackknife estimates;confidence intervals

PDF

	Benjamin Elizalde Carnegie Mellon University
	Bhiksha Raj Carnegie Mellon University
	Emmanuel Vincent INRIA

Coordinators

Content

Description

Audio dataset

Annotations

Audio Set: An ontology and human-labeled dataset for audio events

Abstract

Download

Task setup

Development dataset

Evaluation dataset

Submission

Task rules

Evaluation

Metrics for Polyphonic Sound Event Detection

Abstract

Results

Baseline system

Python implementation

Results for development dataset

Subtask A, Audio tagging (Ranked based on Micro-averaging or Instance-Based)

Subtask A, Audio tagging (Macro-averaging or Class-Based)

Subtask B, Sound event detection (Ranked based on Micro-averaging or Instanced-Based)

Contributors

Citation

DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System

Abstract

Keywords

Sound event detection in the DCASE 2017 Challenge

Abstract

Keywords