Sound event detection in real life audio

Task description

Detailed task description in task description page

Challenge results

Here you can find complete information on the submissions for Task 3: results on evaluation and development set (when reported by authors), class-wise results, technical reports and bibtex citations.

Detailed description of metrics used can be found here.

System outputs:

DCASE2016 Challenge Submissions Package (28.7 MB)

Systems ranking

Rank	Submission Information		Technical Report	Segment-based (overall / evaluation dataset)		Event-based (overall / onset-only evaluation dataset)		Segment-based (overall / development dataset)
Rank	Code	Name	Technical Report	ER	F1	ER	F1	ER	F1
	Adavanne2016	Adavanne_task3_1	adavanne_IID	0.8051	47.8	5.1248	4.8	0.9100	31.0
	Adavanne2016	Adavanne_task3_2	adavanne_IITD	0.8887	37.9	7.5286	4.7	0.8500	34.3
	Heittola2016	DCASE2016 baseline	DCASE2016_baseline	0.8773	34.3	1.7303	6.3	0.9100	23.7
	Elizalde2016	Elizalde_task3_1	CMU_G_v3	1.0730	22.5	3.3496	4.2
	Elizalde2016	Elizalde_task3_2	CMU_G_v4	1.1056	20.8	3.1804	2.9	0.8100	34.8
	Elizalde2016	Elizalde_task3_3	CMU_G+P_v3	0.9635	33.3	2.0445	4.2
	Elizalde2016	Elizalde_task3_4	CMU_G+P_v4	0.9613	33.6	1.8700	3.6	0.7600	38.5
	Gorin2016	Gorin_task3_1	act	0.9799	41.1	1.8483	2.9	0.8400	38.1
	Kong2016	Kong_task3_1	QK	0.9557	36.3	2.8819	7.3		38.1
	Kroos2016	Kroos_task3_1	RandB	1.1488	16.8	3.1469	3.4
	Lai2016	Liu_task3_1	BW#3	0.9287	34.5	2.4283	8.1
	Dai2016	Pham_task3_1		0.9583	11.6	1.2886	1.8	1.2450	18.1
	Phan2016	Phan_task3_1	CaR-FOREST	0.9644	23.9	1.0634	1.5	0.8304	31.6
	Schroeder2016	Schroeder_task3_1		1.3092	33.6	12.0766	3.7
	Ubskii2016	Ubskii_task3_1		0.9971	39.6	2.9518	6.7
	Vu2016	Vu_task3_1		0.9124	41.9	2.0949	6.3	0.8150	49.8
	Zoehrer2016	Zoehrer_task3_1		0.9056	39.6	3.0879	6.0	0.7300	47.6

Teams ranking

Table including only the best performing system per submitting team.

Rank	Submission Information		Technical Report	Segment-based (overall / evaluation dataset)		Event-based (overall / onset-only evaluation dataset)		Segment-based (overall / development dataset)
Rank	Code	Name	Technical Report	ER	F1	ER	F1	ER	F1
	Adavanne2016	Adavanne_task3_1	adavanne_IID	0.8051	47.8	5.1248	4.8	0.9100	31.0
	Heittola2016	DCASE2016 baseline	DCASE2016_baseline	0.8773	34.3	1.7303	6.3	0.9100	23.7
	Elizalde2016	Elizalde_task3_4	CMU_G+P_v4	0.9613	33.6	1.8700	3.6	0.7600	38.5
	Gorin2016	Gorin_task3_1	act	0.9799	41.1	1.8483	2.9	0.8400	38.1
	Kong2016	Kong_task3_1	QK	0.9557	36.3	2.8819	7.3		38.1
	Kroos2016	Kroos_task3_1	RandB	1.1488	16.8	3.1469	3.4
	Lai2016	Liu_task3_1	BW#3	0.9287	34.5	2.4283	8.1
	Dai2016	Pham_task3_1		0.9583	11.6	1.2886	1.8	1.2450	18.1
	Phan2016	Phan_task3_1	CaR-FOREST	0.9644	23.9	1.0634	1.5	0.8304	31.6
	Schroeder2016	Schroeder_task3_1		1.3092	33.6	12.0766	3.7
	Ubskii2016	Ubskii_task3_1		0.9971	39.6	2.9518	6.7
	Vu2016	Vu_task3_1		0.9124	41.9	2.0949	6.3	0.8150	49.8
	Zoehrer2016	Zoehrer_task3_1		0.9056	39.6	3.0879	6.0	0.7300	47.6

Class-wise performance

Home

Rank	Submission Information		Technical Report	Segment-based (Class-based average)		Cupboard		Cutlery		Dishes		Drawer		Glass jingling		Object impact		Object rustling		Object snapping		People walking		Washing dishes		Water tap running
Rank	Code	Name	Technical Report	ER (class avg/eval/seg)	F1 (class avg/eval/seg)	ER / Cupboard (eval/seg)	F1 / Cupboard (eval/seg)	ER / Cutlery (eval/seg)	F1 / Cutlery (eval/seg)	ER / Dishes (eval/seg)	F1 / Dishes (eval/seg)	ER / Drawer (eval/seg)	F1 / Drawer (eval/seg)	ER / Glass jingling (eval/seg)	F1 / Glass jingling (eval/seg)	ER / Object impact (eval/seg)	F1 / Object impact (eval/seg)	ER / Object rustling (eval/seg)	F1 / Object rustling (eval/seg)	ER / Object snapping (eval/seg)	F1 / Object snapping (eval/seg)	ER / People walking (eval/seg)	F1 / People walking (eval/seg)	ER / Washing dishes (eval/seg)	F1 / Washing dishes (eval/seg)	ER / Washing dishes (eval/seg)	F1 / Washing dishes (eval/seg)
	Adavanne2016	Adavanne_task3_1	adavanne_IID	0.9887	0.1	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.2785	0.2	0.5973	0.8
	Adavanne2016	Adavanne_task3_2	adavanne_IITD	1.0682	0.1	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	2.0127	0.2	0.7372	0.7
	Heittola2016	DCASE2016 baseline	DCASE2016_baseline	0.9783	0.2	1.0385	0.0	1.0571	0.0	1.0744	0.2	0.9811	0.1	1.0000	0.0	1.1574	0.1	0.6786	0.6	1.0000	0.0	1.0833	0.2	1.0190	0.0	0.6724	0.6
	Elizalde2016	Elizalde_task3_1	CMU_G_v3	1.9262	0.1	2.1538	0.0	1.9714	0.0	1.7851	0.1	2.5094	0.1	2.0667	0.1	1.6294	0.2	1.5714	0.0	3.0476	0.1	1.9479	0.1	1.4747	0.2	1.0307	0.3
	Elizalde2016	Elizalde_task3_2	CMU_G_v4	4.2003	0.0	1.0385	0.0	6.7143	0.0	1.3636	0.0	1.0000	0.0	27.1333	0.0	1.9949	0.3	1.0000	0.0	2.9048	0.0	1.0000	0.0	1.0063	0.0	1.0478	0.0
	Elizalde2016	Elizalde_task3_3	CMU_G+P_v3	1.5296	0.1	1.0000	0.0	1.5429	0.0	1.3802	0.1	1.4528	0.0	1.0000	0.0	2.2741	0.3	1.3393	0.0	3.0000	0.0	1.5208	0.1	1.3671	0.2	0.9488	0.4
	Elizalde2016	Elizalde_task3_4	CMU_G+P_v4	1.5768	0.1	1.0385	0.0	1.6857	0.0	1.4050	0.1	1.8113	0.0	1.0000	0.0	2.1269	0.3	1.5893	0.0	2.6667	0.0	1.5312	0.1	1.4494	0.2	1.0410	0.4
	Gorin2016	Gorin_task3_1	act	1.0834	0.2	1.0000	0.0	1.0000	0.0	1.1653	0.2	1.0000	0.0	1.0000	0.0	1.0863	0.1	0.8929	0.6	1.0000	0.0	1.0104	0.1	1.8101	0.4	0.9522	0.5
	Kong2016	Kong_task3_1	QK	1.1803	0.2	1.0385	0.0	1.2857	0.0	1.2479	0.1	1.0755	0.0	0.9333	0.3	1.4569	0.2	1.4821	0.2	1.2381	0.0	0.9792	0.1	1.3481	0.1	0.8976	0.5
	Kroos2016	Kroos_task3_1	RandB	1.6394	0.1	1.6538	0.0	1.9429	0.1	1.5950	0.1	1.3396	0.0	2.1333	0.1	1.3147	0.2	1.9821	0.1	2.1429	0.0	1.2500	0.0	1.5190	0.1	1.1604	0.1
	Lai2016	Liu_task3_1	BW#3	1.2249	0.2	1.1538	0.1	1.2286	0.0	1.2810	0.1	1.0377	0.0	1.0667	0.0	1.4822	0.3	1.0357	0.4	2.3333	0.0	1.0417	0.0	1.1139	0.1	0.6997	0.7
	Dai2016	Pham_task3_1		1.0055	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	0.9848	0.0	1.0179	0.0	1.0000	0.0	1.0208	0.0	1.0000	0.0	1.0375	0.2
	Phan2016	Phan_task3_1	CaR-FOREST	1.0449	0.0	1.0769	0.1	1.3143	0.0	1.0000	0.0	1.0000	0.0	1.1333	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	0.9693	0.3
	Schroeder2016	Schroeder_task3_1		2.2534	0.1	1.0000	0.0	1.2571	0.1	7.5455	0.2	1.0000	0.0	3.2000	0.1	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	5.7848	0.3	1.0000	0.0
	Ubskii2016	Ubskii_task3_1		1.4109	0.2	1.4231	0.1	1.0571	0.0	2.1818	0.2	1.0000	0.0	1.0000	0.0	1.8223	0.3	2.0000	0.4	1.0000	0.0	1.7500	0.2	1.3608	0.2	0.9249	0.7
	Vu2016	Vu_task3_1		1.3479	0.1	1.0000	0.0	1.6286	0.0	1.1322	0.1	1.4340	0.0	1.0000	0.0	1.5939	0.2	2.5536	0.0	1.0000	0.0	1.8542	0.2	1.0570	0.0	0.5734	0.8
	Zoehrer2016	Zoehrer_task3_1		1.0797	0.1	1.1154	0.1	1.0000	0.0	1.0000	0.0	1.0189	0.0	1.0000	0.0	1.5025	0.4	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.7658	0.3	0.4744	0.8

Residential Area

Rank	Submission Information		Technical Report	Segment-based (Class-based average)		Bird singing		Car passing by		Children shouting		Object banging		People speaking		People walking		Wind blowing
Rank	Code	Name	Technical Report	ER (class avg/eval/seg)	F1 (class avg/eval/seg)	ER / Bird singing (eval/seg)	F1 / Bird singing (eval/seg)	ER / Car passing by (eval/seg)	F1 / Car passing by (eval/seg)	ER / Children shouting (eval/seg)	F1 / Children shouting (eval/seg)	ER / Object banging (eval/seg)	F1 / Object banging (eval/seg)	ER / People speaking (eval/seg)	F1 / People speaking (eval/seg)	ER / People walking (eval/seg)	F1 / People walking (eval/seg)	ER / Wind blowing (eval/seg)	F1 / Wind blowing (eval/seg)
	Adavanne2016	Adavanne_task3_1	adavanne_IID	1.0159	0.2	1.1332	0.6	0.5634	0.8	1.0000	0.0	1.0000	0.0	1.1228	0.0	1.0000	0.0	1.2917	0.3
	Adavanne2016	Adavanne_task3_2	adavanne_IITD	1.1661	0.1	1.5884	0.6	1.0188	0.0	1.2000	0.1	1.2727	0.0	1.0000	0.0	1.0205	0.0	1.0625	0.0
	Heittola2016	DCASE2016 baseline	DCASE2016_baseline	1.3188	0.2	0.9637	0.3	0.4836	0.7	1.1333	0.0	1.0000	0.0	2.6667	0.0	1.1096	0.1	1.8750	0.2
	Elizalde2016	Elizalde_task3_1	CMU_G_v3	2.7125	0.1	1.2034	0.4	0.9531	0.4	5.4667	0.0	5.2727	0.0	2.3860	0.0	1.6849	0.1	2.0208	0.1
	Elizalde2016	Elizalde_task3_2	CMU_G_v4	2.0883	0.2	1.2107	0.5	0.8357	0.4	2.8667	0.0	3.6364	0.0	2.6667	0.1	1.5479	0.1	1.8542	0.1
	Elizalde2016	Elizalde_task3_3	CMU_G+P_v3	1.3496	0.2	1.3341	0.5	0.8075	0.5	1.1333	0.0	1.5455	0.0	2.5439	0.0	1.0411	0.0	1.0417	0.2
	Elizalde2016	Elizalde_task3_4	CMU_G+P_v4	1.2472	0.2	1.2857	0.6	0.7653	0.5	1.0667	0.0	1.3636	0.0	2.3333	0.0	1.0411	0.0	0.8750	0.3
	Gorin2016	Gorin_task3_1	act	1.3456	0.3	1.0944	0.6	1.1502	0.6	1.0000	0.0	1.0000	0.0	3.1754	0.1	1.0822	0.3	0.9167	0.4
	Kong2016	Kong_task3_1	QK	1.1055	0.2	1.2131	0.5	0.7042	0.7	1.0667	0.0	1.1818	0.0	1.4211	0.0	1.0685	0.0	1.0833	0.0
	Kroos2016	Kroos_task3_1	RandB	1.6154	0.1	1.1695	0.3	1.4789	0.2	2.2000	0.1	1.0909	0.0	2.3158	0.1	1.2192	0.1	1.8333	0.0
	Lai2016	Liu_task3_1	BW#3	1.7348	0.1	1.0266	0.4	0.7324	0.5	1.5333	0.0	2.0909	0.0	2.4561	0.0	1.1164	0.1	3.1875	0.0
	Dai2016	Pham_task3_1		0.9808	0.1	1.0024	0.1	0.8216	0.4	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.0417	0.0
	Phan2016	Phan_task3_1	CaR-FOREST	1.0576	0.1	1.4673	0.4	0.8263	0.5	1.0000	0.0	1.0000	0.0	1.0000	0.0	1.1096	0.0	1.0000	0.0
	Schroeder2016	Schroeder_task3_1		1.0164	0.2	0.9952	0.5	0.6995	0.7	1.0000	0.0	1.0000	0.0	1.3860	0.0	1.0342	0.0	1.0000	0.1
	Ubskii2016	Ubskii_task3_1		1.0218	0.2	1.0508	0.4	0.5164	0.7	1.0000	0.0	1.0000	0.0	1.5439	0.0	1.0000	0.0	1.0417	0.0
	Vu2016	Vu_task3_1		1.1772	0.2	1.2567	0.5	0.6854	0.7	1.0000	0.0	1.0000	0.0	1.5789	0.0	1.2192	0.0	1.5000	0.2
	Zoehrer2016	Zoehrer_task3_1		0.9892	0.2	1.2131	0.4	0.6761	0.6	1.0000	0.0	1.0000	0.0	1.0351	0.0	1.0000	0.0	1.0000	0.0

System characteristics

Rank	Submission Information		Technical Report	Segment-based (overall)		System characteristics
Rank	Code	Name	Technical Report	ER (eval/seg)	F1 (eval/seg)	Input	Features	Classifier
	Adavanne2016	Adavanne_task3_1	adavanne_IID	0.8051	47.8	binaural	mel energy	RNN
	Adavanne2016	Adavanne_task3_2	adavanne_IITD	0.8887	37.9	binaural	mel energy + TDOA	RNN
	Heittola2016	DCASE2016 baseline	DCASE2016_baseline	0.8773	34.3	monophonic	MFCC	GMM
	Elizalde2016	Elizalde_task3_1	CMU_G_v3	1.0730	22.5	monophonic	MFCC	Random forests
	Elizalde2016	Elizalde_task3_2	CMU_G_v4	1.1056	20.8	monophonic	MFCC	Random forests
	Elizalde2016	Elizalde_task3_3	CMU_G+P_v3	0.9635	33.3	monophonic	MFCC	Random forests
	Elizalde2016	Elizalde_task3_4	CMU_G+P_v4	0.9613	33.6	monophonic	MFCC	Random forests
	Gorin2016	Gorin_task3_1	act	0.9799	41.1	monophonic	mel energy	CNN
	Kong2016	Kong_task3_1	QK	0.9557	36.3	monophonic	MFCC	DNN
	Kroos2016	Kroos_task3_1	RandB	1.1488	16.8			Random
	Lai2016	Liu_task3_1	BW#3	0.9287	34.5	monophonic	MFCC	Fusion
	Dai2016	Pham_task3_1		0.9583	11.6	monophonic	MFCC	DNN
	Phan2016	Phan_task3_1	CaR-FOREST	0.9644	23.9	monophonic	GCC	Random forests
	Schroeder2016	Schroeder_task3_1		1.3092	33.6	monophonic	GFB	GMM-HMM
	Ubskii2016	Ubskii_task3_1		0.9971	39.6	monophonic	MFCC	Fusion
	Vu2016	Vu_task3_1		0.9124	41.9	monophonic	mel energy	RNN
	Zoehrer2016	Zoehrer_task3_1		0.9056	39.6	monophonic	spectrogram	GRNN

Technical reports

Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features

Sharath Adavanne, Giambattista Parascandolo, Pasi Pertilä, Toni Heittola and Tuomas Virtanen

Department of Signal Processing, Tampere University of Technology, Tampere, Finland

Adavanne_task3_1 Adavanne_task3_2

PDF

Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features

Abstract

In this paper, we propose the use of spatial and harmonic features in combination with long short term memory (LSTM) recurrent neural network (RNN) for automatic sound event detection (SED) task. Real life sound recordings typically have many overlapping sound events, making it hard to recognize with just mono channel audio. Human listeners have been successfully recognizing the mixture of overlapping sound events using pitch cues and exploiting the stereo (multichannel) audio signal available at their ears to spatially localize these events. Traditionally SED systems have only been using mono channel audio, motivated by the human listener we propose to extend them to use multichannel audio. The proposed SED system is compared against the state of the art mono channel meth od on the development subset of TUT sound events detection 2016 database [1]. The proposed method improves the F-score by 3.75% while reducing the error rate by 6%.

System characteristics

Input	binaural
Sampling rate	44.1kHz
Features	mel energy; mel energy + TDOA
Classifier	RNN

PDF

Sound Event Detection for Real Life Audio DCASE Challenge

Wei Dai¹, Juncheng Li², Phuong Pham³, Samarjit Das² and Shuhui Qu⁴

¹Carnegie Mellon University, Pittsburgh, USA, ²Robert Bosch Research and Technology Center, USA, ³University of Pittsburgh, Pittsburgh, USA, ⁴Stanford University, Stanford, USA

Pham_task3_1

PDF

Sound Event Detection for Real Life Audio DCASE Challenge

Abstract

We explore logistic regression classifier (LogReg) and deep neural network (DNN) on the DCASE 2016 Challenge for task 3, i.e., sound event detection in real life audio. Our models use the Mel Frequency Cepstral Coefficients (MFCCs) and their deltas and accelerations as detection features. The error rate metric favors the simple logistic regression model with high activation threshold on both segment- and event-based contexts. On the other hand, DNN model outperforms the baseline in frame-based context.

System characteristics

Input	monophonic
Sampling rate	44.1kHz
Features	MFCC
Classifier	DNN

PDF

Experimentation on The DCASE Challenge 2016: Task 1 - Acoustic Scene Classification and Task 3 - Sound Event Detection in Real Life Audio

Benjamin Elizalde¹, Anurag Kumar¹, Ankit Shah², Rohan Badlani³, Emmanuel Vincent⁴, Bhiksha Raj¹ and Ian Lane¹

¹Carnegie Mellon University, Pittsburgh, USA, ²NIT Surathkal, India, ³BITS, Pilani, India, ⁴Inria, France

Elizalde_task3_1 Elizalde_task3_2 Elizalde_task3_3 Elizalde_task3_4

PDF

Experimentation on The DCASE Challenge 2016: Task 1 - Acoustic Scene Classification and Task 3 - Sound Event Detection in Real Life Audio

Abstract

Audio carries substantial information about the contents of our environment. In a recording, sound events can occur in isolation, such as car passing by or footsteps and/or there could be a collection of sounds events, often collectively referred to as scenes, such as busy street or park. The 2016 DCASE challenge aims to foster standardized development in both of these areas. In this paper we present our work on Task 1 Acoustic Scene Classification and Task 3 Sound Event Detection in Real Life Recordings. Among our experiments we have low-level and high-level features, classifier optimization and other heuristics specific to each task. Our performance for both tasks improved the baseline published by DCASE. For Task 1 we achieved an overall accuracy of 78.9% compared to the baseline of 72.6% and for Task 3 we achieved a Segment-Based Error Rate of 0.48 compared to the baseline of 0.91.

System characteristics

Input	monophonic
Sampling rate	44.1kHz
Features	MFCC
Classifier	Random forests

PDF

DCASE 2016 Sound Event Detection System Based on Convolutional Neural Network

Arseniy Gorin, Nurtas Makhazhanov and Nickolay Shmyrev

ACTechnologies LLC, Moscow, Russia

Gorin_task3_1

PDF Code

DCASE 2016 Sound Event Detection System Based on Convolutional Neural Network

Abstract

The report describes a sound event detection system submitted to DCASE 2016 challenge. In this work a convolutional neural network is used for detecting and classifying polyphonic events in a long temporal context of filter bank acoustic features. Given a small amount of training resources, data augmentation was explored. The system achieves an average 7.7% relative error rate improvement, but is still unable to detect short events with limited training data.

System characteristics

Input	monophonic
Sampling rate	16kHz
Features	mel energy
Classifier	CNN

PDF

Source code

DCASE2016 Baseline System

Toni Heittola, Annamaria Mesaros and Tuomas Virtanen

Department of Signal Processing, Tampere University of Technology, Tampere, Finland

DCASE2016_task3_1

PDF Code

DCASE2016 Baseline System

System characteristics

Input	monophonic
Sampling rate	44.1kHz
Features	MFCC
Classifier	GMM

PDF

Source code

Deep Neural Network Baseline for DCASE Challenge 2016

Qiuqiang Kong, Iwona Sobieraj, Wenwu Wang and Mark Plumbley

Centre for Vision, Speech and Signal Processing, University of Surrey, Surrey, United Kingdom

Kong_task3_1

PDF Code

Deep Neural Network Baseline for DCASE Challenge 2016

Abstract

The DCASE Challenge 2016 contains tasks for Acoustic Acene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. For feature extraction, 40 Melfilter bank features are used. Two kinds of Mel banks, same area bank and same height bank are discussed. Experimental results show that the same height bank is better than the same area bank. DNNs with the same structure are applied to all four tasks in the DCASE Challenge 2016. In Task 1 we obtained accuracy of 76.4% using Mel + DNN against 72.5% by using Mel Frequency Ceptral Coefficient (MFCC) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 17.4% using Mel + DNN against 41.6% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 38.1% using Mel + DNN against 26.6% by using MFCC + GMM. In task 4 we obtained Equal Error Rate (ERR) of 20.9% using Mel + DNN against 21.0% by using MFCC + GMM. Therefore the DNN improves the baseline in Task 1 and Task 3, and is similar to the baseline in Task 4, although is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always work.

System characteristics

Input	monophonic
Sampling rate	44.1kHz
Features	MFCC
Classifier	DNN

PDF

Source code

Random System Performance in Task 3

Christian Kroos and Mark Plumbley

Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Surrey, SUnited Kingdom

Kroos_task3_1

PDF

Random System Performance in Task 3

Abstract

In this report we describe briefly the creation of a random, datablind systems to provide a random baseline for Task 3 in the DCASE 2016 challenge. Particular attention is paid to the results for two sound events occurring in the residential area scene, one very rare, the other very frequent.

System characteristics

Classifier

Random

PDF

DCASE Report for Task 3: Sound Event Detection in Real Life Audio

Ying-Hui Lai^1,2, Chun-Hao Wang³, Shi-Yan Hou³, Bang-Yin Chen³, Yu Tsao¹ and Yi-Wen Liu³

¹Research Center for Information Technology, Academia Sinica, Taipei, Taiwan, ²Department of Electrical Engineering, Yuan Ze University, Taoyuan City, Taiwan, ³National Tsing Hua University, Hsinchu City, Taiwan

Liu_task3_1

PDF

DCASE Report for Task 3: Sound Event Detection in Real Life Audio

Abstract

Our team has built an acoustic event classifier solely using short-time features. Signals were first de-noised by a log minimum square error (logMMSE) procedure. Then, Mel-frequency cepstral coefficients (MFCCs) extracted from the de-noised signal at every 20 ms were used to train two classifiers based on support vector machine (SVMs) and neural networks (NN), respectively. Optimal parameters for the classifiers were exhaustively searched to maximize the frame-wise recognition accuracy in cross validation. Frame-wise recognition rates of 93.0% and 91.8% were thus obtained from the SVM and NN, respectively, for the home events (and 86.2% and 85.7% respectively for the residential events). To process the evaluation data, the same signal processing procedures were applied so both classifiers produce their classification result for every frame. Whenever SVM and NN gives different answers, we resort to the confusion matrices obtained during the supervised learning phase so a final answer could be produced based on a maximal a posteriori (MAP) principle. Finally, a heuristic smoothing procedure was applied to the jointly decided recognition results so the event onsets and offsets could be determined

System characteristics

Input	monophonic
Sampling rate	44.1kHz
Features	MFCC
Classifier	Fusion

PDF

Car-Forest: Joint Classification-Regression Decision Forests for Overlapping Audio Event Detection

Huy Phan^1,2, Lars Hertel¹, Marco Maass¹, Philipp Koch¹ and Alfred Mertins¹

¹Institute for Signal Processing, University of Luebeck, Luebeck, Germany, ²Graduate School for Computing in Medicine and Life Sciences, University of Luebeck, Luebeck, Germany

Phan_task3_1

PDF

Car-Forest: Joint Classification-Regression Decision Forests for Overlapping Audio Event Detection

Abstract

This report describes our submissions to Task2 and Task3 of the DCASE 2016 challenge [1]. The systems aim at dealing with the detection of overlapping audio events in continuous streams, where the detectors are based on random decision forests. The proposed forests are jointly trained for classification and regression simultaneously. Initially, the training is classification-oriented to encourage the trees to select discriminative features from overlapping mixtures to separate positive audio segments from the negative ones. The regression phase is then carried out to let the positive audio segments vote for the event onsets and offsets, and therefore model the temporal structure of audio events. One random decision forest is specifically trained for each event category of interest. Experimental results on the development data show that our systems outperform the DCASE 2016 challenge baselines with absolute gains of 64.4% and 8.0% on Task2 and Task3, respectively.

System characteristics

Input	monophonic
Sampling rate	16kHz
Features	GCC
Classifier	Random forests

PDF

Performance Comparison of GMM, HMM and DNN Based Approaches for Acoustic Event Detection Within Task 3 of The DCASE 2016 Challenge

Jens Schröder^1,2, Jörn Anemüller^2,3 and Stefan Goetze^1,2

¹Fraunhofer Institute for Digital Media Technology IDMT, Oldenburg, Germany, ²Cluster of Excellence, Hearing4all, Germany, ³Department of Medical Physics and Acoustics, University of Oldenburg, Oldenburg, Germany

Schroeder_task3_1

PDF

Performance Comparison of GMM, HMM and DNN Based Approaches for Acoustic Event Detection Within Task 3 of The DCASE 2016 Challenge

Abstract

This contribution reports on the performance of systems for polyphonic acoustic event detection (AED) compared within the framework of the Detection and classification of acoustic scenes and events 2016 (DCASE'16) challenge. State-of-the-art Gaussian mixture model (GMM) and GMM-hidden Markov model (HMM) approaches are applied using Mel-frequency cepstral coefficients (MFCCs) and Gabor filterbank (GFB) features and a non-negative matrix factorization (NMF) based system. Furthermore, tandem and hybrid deep neural network (DNN)-HMMsystems are adopted. All HMM systems that usually are of multiclass type, i.e., systems that just output one label per time segment from a set of possible classes, are extended to binary classification systems that are compound of single binary classifiers classifying between target and non-target classes and, thus, are capable of multi labeling. These systems are evaluated for the data of residential areas of Task 3 from the DCASE'16 challenge. It is shown, that the DNN based system performs worse than the traditional systems for this task. Best results are achieved using GFB features in combination with a multiclass GMM-HMM approach.

System characteristics

Input	monophonic
Sampling rate	44.1kHz
Features	GFB
Classifier	GMM-HMM

PDF

Sound Event Detection in Real-Life Audio

Dmitrii Ubskii and Alexei Pugachev

Chair of Speech Information Systems, ITMO University, St. Petersburg, Russia

Ubskii_task3_1

PDF

Sound Event Detection in Real-Life Audio

Abstract

In this paper, an acoustic event detection system is proposed. This system uses fusion of several classifiers (GMM, DNN, LSTM) using another classifier (DNN) in attempt to achieve better results. The proposed system yields F1 score of up to 21% for indoors subset of the provided data and up to 44% for outdoors subset.

System characteristics

Input	monophonic
Sampling rate	44.1kHz
Features	MFCC
Classifier	Fusion

PDF

Acoustic Scene and Event Recognition Using Recurrent Neural Networks

Toan H. Vu and Jia-Ching Wang

Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan

Vu_task3_1

PDF

Acoustic Scene and Event Recognition Using Recurrent Neural Networks

Abstract

The DCASE2016 challenge is designed particularly for research in environmental sound analysis. It consists of four tasks that spread on various problems such as acoustic scene classification and sound event detection. This paper reports our results on all the tasks by using Recurrent Neural Networks (RNNs). Experiments show that our models achieved superior performances compared with the baselines.

System characteristics

Input	monophonic
Sampling rate	44.1kHz
Features	mel energy
Classifier	RNN

PDF

Gated Recurrent Networks Applied To Acoustic Scene Classification and Acoustic Event Detection

Matthias Zöhrer and Franz Pernkopf

Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria

Zoehrer_task3_1

PDF

Gated Recurrent Networks Applied To Acoustic Scene Classification and Acoustic Event Detection

Abstract

We present two resource efficient frameworks for acoustic scene classification and acoustic event detection. In particular, we combine gated recurrent neural networks (GRNNs) and linear discriminant analysis (LDA) for efficiently classifying environmental sound scenes of the IEEE Detection and Classification of Acoustic Scenes and Events challenge (DCASE2016). Our system reaches an overall accuracy of 79.1% on DCASE 2016 task 1 development data, resulting in a relative improvement of 8.34% compared to the baseline GMM system. By applying GRNNs on DCASE2016 real event detection data using a MSE objective, we obtain a segment-based error rate (ER) score of 0.73 - which is a relative improvement of 19.8% compared to the baseline GMM system. We further investigate semi-supervised learning applied to acoustic scene analysis. In particular, we evaluate the effects of a hybrid, i.e. generative discriminative, objective function.

System characteristics

Input	monophonic
Sampling rate	44.1kHz
Features	spectrogram
Classifier	GRNN

PDF

Content

Task description

Challenge results

Systems ranking

Teams ranking

Class-wise performance

Home

Residential Area

System characteristics

Technical reports

Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features

Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features

Abstract

System characteristics

Sound Event Detection for Real Life Audio DCASE Challenge

Sound Event Detection for Real Life Audio DCASE Challenge

Abstract

System characteristics

Experimentation on The DCASE Challenge 2016: Task 1 - Acoustic Scene Classification and Task 3 - Sound Event Detection in Real Life Audio

Experimentation on The DCASE Challenge 2016: Task 1 - Acoustic Scene Classification and Task 3 - Sound Event Detection in Real Life Audio

Abstract

System characteristics

DCASE 2016 Sound Event Detection System Based on Convolutional Neural Network

DCASE 2016 Sound Event Detection System Based on Convolutional Neural Network

Abstract

System characteristics

DCASE2016 Baseline System

DCASE2016 Baseline System

System characteristics

Deep Neural Network Baseline for DCASE Challenge 2016

Deep Neural Network Baseline for DCASE Challenge 2016

Abstract

System characteristics

Random System Performance in Task 3

Random System Performance in Task 3

Abstract

System characteristics

DCASE Report for Task 3: Sound Event Detection in Real Life Audio

DCASE Report for Task 3: Sound Event Detection in Real Life Audio

Abstract

System characteristics

Car-Forest: Joint Classification-Regression Decision Forests for Overlapping Audio Event Detection

Car-Forest: Joint Classification-Regression Decision Forests for Overlapping Audio Event Detection

Abstract

System characteristics

Performance Comparison of GMM, HMM and DNN Based Approaches for Acoustic Event Detection Within Task 3 of The DCASE 2016 Challenge

Performance Comparison of GMM, HMM and DNN Based Approaches for Acoustic Event Detection Within Task 3 of The DCASE 2016 Challenge

Abstract

System characteristics

Sound Event Detection in Real-Life Audio

Sound Event Detection in Real-Life Audio

Abstract

System characteristics

Acoustic Scene and Event Recognition Using Recurrent Neural Networks

Acoustic Scene and Event Recognition Using Recurrent Neural Networks

Abstract

System characteristics

Gated Recurrent Networks Applied To Acoustic Scene Classification and Acoustic Event Detection

Gated Recurrent Networks Applied To Acoustic Scene Classification and Acoustic Event Detection

Abstract

System characteristics