Sound event detection
in real life audio


Challenge results

Task description

Detailed task description in task description page

Challenge results

Here you can find complete information on the submissions for Task 3: results on evaluation and development set (when reported by authors), class-wise results, technical reports and bibtex citations.

Detailed description of metrics used can be found here.

System outputs:


Systems ranking

Rank Submission Information Technical
Report
Segment-based
(overall / evaluation dataset)
Event-based
(overall / onset-only evaluation dataset)
Segment-based
(overall / development dataset)
Code Name ER F1 ER F1 ER F1
Adavanne2016 Adavanne_task3_1 adavanne_IID 0.8051 47.8 5.1248 4.8 0.9100 31.0
Adavanne2016 Adavanne_task3_2 adavanne_IITD 0.8887 37.9 7.5286 4.7 0.8500 34.3
Heittola2016 DCASE2016 baseline DCASE2016_baseline 0.8773 34.3 1.7303 6.3 0.9100 23.7
Elizalde2016 Elizalde_task3_1 CMU_G_v3 1.0730 22.5 3.3496 4.2
Elizalde2016 Elizalde_task3_2 CMU_G_v4 1.1056 20.8 3.1804 2.9 0.8100 34.8
Elizalde2016 Elizalde_task3_3 CMU_G+P_v3 0.9635 33.3 2.0445 4.2
Elizalde2016 Elizalde_task3_4 CMU_G+P_v4 0.9613 33.6 1.8700 3.6 0.7600 38.5
Gorin2016 Gorin_task3_1 act 0.9799 41.1 1.8483 2.9 0.8400 38.1
Kong2016 Kong_task3_1 QK 0.9557 36.3 2.8819 7.3 38.1
Kroos2016 Kroos_task3_1 RandB 1.1488 16.8 3.1469 3.4
Lai2016 Liu_task3_1 BW#3 0.9287 34.5 2.4283 8.1
Dai2016 Pham_task3_1 0.9583 11.6 1.2886 1.8 1.2450 18.1
Phan2016 Phan_task3_1 CaR-FOREST 0.9644 23.9 1.0634 1.5 0.8304 31.6
Schroeder2016 Schroeder_task3_1 1.3092 33.6 12.0766 3.7
Ubskii2016 Ubskii_task3_1 0.9971 39.6 2.9518 6.7
Vu2016 Vu_task3_1 0.9124 41.9 2.0949 6.3 0.8150 49.8
Zoehrer2016 Zoehrer_task3_1 0.9056 39.6 3.0879 6.0 0.7300 47.6

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission Information Technical
Report
Segment-based
(overall / evaluation dataset)
Event-based
(overall / onset-only evaluation dataset)
Segment-based
(overall / development dataset)
Code Name ER F1 ER F1 ER F1
Adavanne2016 Adavanne_task3_1 adavanne_IID 0.8051 47.8 5.1248 4.8 0.9100 31.0
Heittola2016 DCASE2016 baseline DCASE2016_baseline 0.8773 34.3 1.7303 6.3 0.9100 23.7
Elizalde2016 Elizalde_task3_4 CMU_G+P_v4 0.9613 33.6 1.8700 3.6 0.7600 38.5
Gorin2016 Gorin_task3_1 act 0.9799 41.1 1.8483 2.9 0.8400 38.1
Kong2016 Kong_task3_1 QK 0.9557 36.3 2.8819 7.3 38.1
Kroos2016 Kroos_task3_1 RandB 1.1488 16.8 3.1469 3.4
Lai2016 Liu_task3_1 BW#3 0.9287 34.5 2.4283 8.1
Dai2016 Pham_task3_1 0.9583 11.6 1.2886 1.8 1.2450 18.1
Phan2016 Phan_task3_1 CaR-FOREST 0.9644 23.9 1.0634 1.5 0.8304 31.6
Schroeder2016 Schroeder_task3_1 1.3092 33.6 12.0766 3.7
Ubskii2016 Ubskii_task3_1 0.9971 39.6 2.9518 6.7
Vu2016 Vu_task3_1 0.9124 41.9 2.0949 6.3 0.8150 49.8
Zoehrer2016 Zoehrer_task3_1 0.9056 39.6 3.0879 6.0 0.7300 47.6

Class-wise performance

Home

Rank Submission Information Technical
Report
Segment-based
(Class-based average)
Cupboard Cutlery Dishes Drawer Glass jingling Object impact Object rustling Object snapping People walking Washing dishes Water tap running
Code Name ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1
Adavanne2016 Adavanne_task3_1 adavanne_IID 0.9887 0.1 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.2785 0.2 0.5973 0.8
Adavanne2016 Adavanne_task3_2 adavanne_IITD 1.0682 0.1 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 2.0127 0.2 0.7372 0.7
Heittola2016 DCASE2016 baseline DCASE2016_baseline 0.9783 0.2 1.0385 0.0 1.0571 0.0 1.0744 0.2 0.9811 0.1 1.0000 0.0 1.1574 0.1 0.6786 0.6 1.0000 0.0 1.0833 0.2 1.0190 0.0 0.6724 0.6
Elizalde2016 Elizalde_task3_1 CMU_G_v3 1.9262 0.1 2.1538 0.0 1.9714 0.0 1.7851 0.1 2.5094 0.1 2.0667 0.1 1.6294 0.2 1.5714 0.0 3.0476 0.1 1.9479 0.1 1.4747 0.2 1.0307 0.3
Elizalde2016 Elizalde_task3_2 CMU_G_v4 4.2003 0.0 1.0385 0.0 6.7143 0.0 1.3636 0.0 1.0000 0.0 27.1333 0.0 1.9949 0.3 1.0000 0.0 2.9048 0.0 1.0000 0.0 1.0063 0.0 1.0478 0.0
Elizalde2016 Elizalde_task3_3 CMU_G+P_v3 1.5296 0.1 1.0000 0.0 1.5429 0.0 1.3802 0.1 1.4528 0.0 1.0000 0.0 2.2741 0.3 1.3393 0.0 3.0000 0.0 1.5208 0.1 1.3671 0.2 0.9488 0.4
Elizalde2016 Elizalde_task3_4 CMU_G+P_v4 1.5768 0.1 1.0385 0.0 1.6857 0.0 1.4050 0.1 1.8113 0.0 1.0000 0.0 2.1269 0.3 1.5893 0.0 2.6667 0.0 1.5312 0.1 1.4494 0.2 1.0410 0.4
Gorin2016 Gorin_task3_1 act 1.0834 0.2 1.0000 0.0 1.0000 0.0 1.1653 0.2 1.0000 0.0 1.0000 0.0 1.0863 0.1 0.8929 0.6 1.0000 0.0 1.0104 0.1 1.8101 0.4 0.9522 0.5
Kong2016 Kong_task3_1 QK 1.1803 0.2 1.0385 0.0 1.2857 0.0 1.2479 0.1 1.0755 0.0 0.9333 0.3 1.4569 0.2 1.4821 0.2 1.2381 0.0 0.9792 0.1 1.3481 0.1 0.8976 0.5
Kroos2016 Kroos_task3_1 RandB 1.6394 0.1 1.6538 0.0 1.9429 0.1 1.5950 0.1 1.3396 0.0 2.1333 0.1 1.3147 0.2 1.9821 0.1 2.1429 0.0 1.2500 0.0 1.5190 0.1 1.1604 0.1
Lai2016 Liu_task3_1 BW#3 1.2249 0.2 1.1538 0.1 1.2286 0.0 1.2810 0.1 1.0377 0.0 1.0667 0.0 1.4822 0.3 1.0357 0.4 2.3333 0.0 1.0417 0.0 1.1139 0.1 0.6997 0.7
Dai2016 Pham_task3_1 1.0055 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 0.9848 0.0 1.0179 0.0 1.0000 0.0 1.0208 0.0 1.0000 0.0 1.0375 0.2
Phan2016 Phan_task3_1 CaR-FOREST 1.0449 0.0 1.0769 0.1 1.3143 0.0 1.0000 0.0 1.0000 0.0 1.1333 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 0.9693 0.3
Schroeder2016 Schroeder_task3_1 2.2534 0.1 1.0000 0.0 1.2571 0.1 7.5455 0.2 1.0000 0.0 3.2000 0.1 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 5.7848 0.3 1.0000 0.0
Ubskii2016 Ubskii_task3_1 1.4109 0.2 1.4231 0.1 1.0571 0.0 2.1818 0.2 1.0000 0.0 1.0000 0.0 1.8223 0.3 2.0000 0.4 1.0000 0.0 1.7500 0.2 1.3608 0.2 0.9249 0.7
Vu2016 Vu_task3_1 1.3479 0.1 1.0000 0.0 1.6286 0.0 1.1322 0.1 1.4340 0.0 1.0000 0.0 1.5939 0.2 2.5536 0.0 1.0000 0.0 1.8542 0.2 1.0570 0.0 0.5734 0.8
Zoehrer2016 Zoehrer_task3_1 1.0797 0.1 1.1154 0.1 1.0000 0.0 1.0000 0.0 1.0189 0.0 1.0000 0.0 1.5025 0.4 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.7658 0.3 0.4744 0.8

Residential Area

Rank Submission Information Technical
Report
Segment-based
(Class-based average)
Bird singing Car passing by Children shouting Object banging People speaking People walking Wind blowing
Code Name ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1 ER F1
Adavanne2016 Adavanne_task3_1 adavanne_IID 1.0159 0.2 1.1332 0.6 0.5634 0.8 1.0000 0.0 1.0000 0.0 1.1228 0.0 1.0000 0.0 1.2917 0.3
Adavanne2016 Adavanne_task3_2 adavanne_IITD 1.1661 0.1 1.5884 0.6 1.0188 0.0 1.2000 0.1 1.2727 0.0 1.0000 0.0 1.0205 0.0 1.0625 0.0
Heittola2016 DCASE2016 baseline DCASE2016_baseline 1.3188 0.2 0.9637 0.3 0.4836 0.7 1.1333 0.0 1.0000 0.0 2.6667 0.0 1.1096 0.1 1.8750 0.2
Elizalde2016 Elizalde_task3_1 CMU_G_v3 2.7125 0.1 1.2034 0.4 0.9531 0.4 5.4667 0.0 5.2727 0.0 2.3860 0.0 1.6849 0.1 2.0208 0.1
Elizalde2016 Elizalde_task3_2 CMU_G_v4 2.0883 0.2 1.2107 0.5 0.8357 0.4 2.8667 0.0 3.6364 0.0 2.6667 0.1 1.5479 0.1 1.8542 0.1
Elizalde2016 Elizalde_task3_3 CMU_G+P_v3 1.3496 0.2 1.3341 0.5 0.8075 0.5 1.1333 0.0 1.5455 0.0 2.5439 0.0 1.0411 0.0 1.0417 0.2
Elizalde2016 Elizalde_task3_4 CMU_G+P_v4 1.2472 0.2 1.2857 0.6 0.7653 0.5 1.0667 0.0 1.3636 0.0 2.3333 0.0 1.0411 0.0 0.8750 0.3
Gorin2016 Gorin_task3_1 act 1.3456 0.3 1.0944 0.6 1.1502 0.6 1.0000 0.0 1.0000 0.0 3.1754 0.1 1.0822 0.3 0.9167 0.4
Kong2016 Kong_task3_1 QK 1.1055 0.2 1.2131 0.5 0.7042 0.7 1.0667 0.0 1.1818 0.0 1.4211 0.0 1.0685 0.0 1.0833 0.0
Kroos2016 Kroos_task3_1 RandB 1.6154 0.1 1.1695 0.3 1.4789 0.2 2.2000 0.1 1.0909 0.0 2.3158 0.1 1.2192 0.1 1.8333 0.0
Lai2016 Liu_task3_1 BW#3 1.7348 0.1 1.0266 0.4 0.7324 0.5 1.5333 0.0 2.0909 0.0 2.4561 0.0 1.1164 0.1 3.1875 0.0
Dai2016 Pham_task3_1 0.9808 0.1 1.0024 0.1 0.8216 0.4 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.0417 0.0
Phan2016 Phan_task3_1 CaR-FOREST 1.0576 0.1 1.4673 0.4 0.8263 0.5 1.0000 0.0 1.0000 0.0 1.0000 0.0 1.1096 0.0 1.0000 0.0
Schroeder2016 Schroeder_task3_1 1.0164 0.2 0.9952 0.5 0.6995 0.7 1.0000 0.0 1.0000 0.0 1.3860 0.0 1.0342 0.0 1.0000 0.1
Ubskii2016 Ubskii_task3_1 1.0218 0.2 1.0508 0.4 0.5164 0.7 1.0000 0.0 1.0000 0.0 1.5439 0.0 1.0000 0.0 1.0417 0.0
Vu2016 Vu_task3_1 1.1772 0.2 1.2567 0.5 0.6854 0.7 1.0000 0.0 1.0000 0.0 1.5789 0.0 1.2192 0.0 1.5000 0.2
Zoehrer2016 Zoehrer_task3_1 0.9892 0.2 1.2131 0.4 0.6761 0.6 1.0000 0.0 1.0000 0.0 1.0351 0.0 1.0000 0.0 1.0000 0.0

System characteristics

Rank Submission Information Technical
Report
Segment-based (overall) System characteristics
Code Name ER F1 Input Features Classifier
Adavanne2016 Adavanne_task3_1 adavanne_IID 0.8051 47.8 binaural mel energy RNN
Adavanne2016 Adavanne_task3_2 adavanne_IITD 0.8887 37.9 binaural mel energy + TDOA RNN
Heittola2016 DCASE2016 baseline DCASE2016_baseline 0.8773 34.3 monophonic MFCC GMM
Elizalde2016 Elizalde_task3_1 CMU_G_v3 1.0730 22.5 monophonic MFCC Random forests
Elizalde2016 Elizalde_task3_2 CMU_G_v4 1.1056 20.8 monophonic MFCC Random forests
Elizalde2016 Elizalde_task3_3 CMU_G+P_v3 0.9635 33.3 monophonic MFCC Random forests
Elizalde2016 Elizalde_task3_4 CMU_G+P_v4 0.9613 33.6 monophonic MFCC Random forests
Gorin2016 Gorin_task3_1 act 0.9799 41.1 monophonic mel energy CNN
Kong2016 Kong_task3_1 QK 0.9557 36.3 monophonic MFCC DNN
Kroos2016 Kroos_task3_1 RandB 1.1488 16.8 Random
Lai2016 Liu_task3_1 BW#3 0.9287 34.5 monophonic MFCC Fusion
Dai2016 Pham_task3_1 0.9583 11.6 monophonic MFCC DNN
Phan2016 Phan_task3_1 CaR-FOREST 0.9644 23.9 monophonic GCC Random forests
Schroeder2016 Schroeder_task3_1 1.3092 33.6 monophonic GFB GMM-HMM
Ubskii2016 Ubskii_task3_1 0.9971 39.6 monophonic MFCC Fusion
Vu2016 Vu_task3_1 0.9124 41.9 monophonic mel energy RNN
Zoehrer2016 Zoehrer_task3_1 0.9056 39.6 monophonic spectrogram GRNN

Technical reports

Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features

Abstract

In this paper, we propose the use of spatial and harmonic features in combination with long short term memory (LSTM) recurrent neural network (RNN) for automatic sound event detection (SED) task. Real life sound recordings typically have many overlapping sound events, making it hard to recognize with just mono channel audio. Human listeners have been successfully recognizing the mixture of overlapping sound events using pitch cues and exploiting the stereo (multichannel) audio signal available at their ears to spatially localize these events. Traditionally SED systems have only been using mono channel audio, motivated by the human listener we propose to extend them to use multichannel audio. The proposed SED system is compared against the state of the art mono channel meth od on the development subset of TUT sound events detection 2016 database [1]. The proposed method improves the F-score by 3.75% while reducing the error rate by 6%.

System characteristics
Input binaural
Sampling rate 44.1kHz
Features mel energy; mel energy + TDOA
Classifier RNN
PDF

Sound Event Detection for Real Life Audio DCASE Challenge

Abstract

We explore logistic regression classifier (LogReg) and deep neural network (DNN) on the DCASE 2016 Challenge for task 3, i.e., sound event detection in real life audio. Our models use the Mel Frequency Cepstral Coefficients (MFCCs) and their deltas and accelerations as detection features. The error rate metric favors the simple logistic regression model with high activation threshold on both segment- and event-based contexts. On the other hand, DNN model outperforms the baseline in frame-based context.

System characteristics
Input monophonic
Sampling rate 44.1kHz
Features MFCC
Classifier DNN
PDF

Experimentation on The DCASE Challenge 2016: Task 1 - Acoustic Scene Classification and Task 3 - Sound Event Detection in Real Life Audio

Abstract

Audio carries substantial information about the contents of our environment. In a recording, sound events can occur in isolation, such as car passing by or footsteps and/or there could be a collection of sounds events, often collectively referred to as scenes, such as busy street or park. The 2016 DCASE challenge aims to foster standardized development in both of these areas. In this paper we present our work on Task 1 Acoustic Scene Classification and Task 3 Sound Event Detection in Real Life Recordings. Among our experiments we have low-level and high-level features, classifier optimization and other heuristics specific to each task. Our performance for both tasks improved the baseline published by DCASE. For Task 1 we achieved an overall accuracy of 78.9% compared to the baseline of 72.6% and for Task 3 we achieved a Segment-Based Error Rate of 0.48 compared to the baseline of 0.91.

System characteristics
Input monophonic
Sampling rate 44.1kHz
Features MFCC
Classifier Random forests
PDF

DCASE 2016 Sound Event Detection System Based on Convolutional Neural Network

Abstract

The report describes a sound event detection system submitted to DCASE 2016 challenge. In this work a convolutional neural network is used for detecting and classifying polyphonic events in a long temporal context of filter bank acoustic features. Given a small amount of training resources, data augmentation was explored. The system achieves an average 7.7% relative error rate improvement, but is still unable to detect short events with limited training data.

System characteristics
Input monophonic
Sampling rate 16kHz
Features mel energy
Classifier CNN
PDF

DCASE2016 Baseline System

System characteristics
Input monophonic
Sampling rate 44.1kHz
Features MFCC
Classifier GMM
PDF

Deep Neural Network Baseline for DCASE Challenge 2016

Abstract

The DCASE Challenge 2016 contains tasks for Acoustic Acene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. For feature extraction, 40 Melfilter bank features are used. Two kinds of Mel banks, same area bank and same height bank are discussed. Experimental results show that the same height bank is better than the same area bank. DNNs with the same structure are applied to all four tasks in the DCASE Challenge 2016. In Task 1 we obtained accuracy of 76.4% using Mel + DNN against 72.5% by using Mel Frequency Ceptral Coefficient (MFCC) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 17.4% using Mel + DNN against 41.6% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 38.1% using Mel + DNN against 26.6% by using MFCC + GMM. In task 4 we obtained Equal Error Rate (ERR) of 20.9% using Mel + DNN against 21.0% by using MFCC + GMM. Therefore the DNN improves the baseline in Task 1 and Task 3, and is similar to the baseline in Task 4, although is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always work.

System characteristics
Input monophonic
Sampling rate 44.1kHz
Features MFCC
Classifier DNN
PDF

Random System Performance in Task 3

Abstract

In this report we describe briefly the creation of a random, datablind systems to provide a random baseline for Task 3 in the DCASE 2016 challenge. Particular attention is paid to the results for two sound events occurring in the residential area scene, one very rare, the other very frequent.

System characteristics
Classifier Random
PDF

DCASE Report for Task 3: Sound Event Detection in Real Life Audio

Abstract

Our team has built an acoustic event classifier solely using short-time features. Signals were first de-noised by a log minimum square error (logMMSE) procedure. Then, Mel-frequency cepstral coefficients (MFCCs) extracted from the de-noised signal at every 20 ms were used to train two classifiers based on support vector machine (SVMs) and neural networks (NN), respectively. Optimal parameters for the classifiers were exhaustively searched to maximize the frame-wise recognition accuracy in cross validation. Frame-wise recognition rates of 93.0% and 91.8% were thus obtained from the SVM and NN, respectively, for the home events (and 86.2% and 85.7% respectively for the residential events). To process the evaluation data, the same signal processing procedures were applied so both classifiers produce their classification result for every frame. Whenever SVM and NN gives different answers, we resort to the confusion matrices obtained during the supervised learning phase so a final answer could be produced based on a maximal a posteriori (MAP) principle. Finally, a heuristic smoothing procedure was applied to the jointly decided recognition results so the event onsets and offsets could be determined

System characteristics
Input monophonic
Sampling rate 44.1kHz
Features MFCC
Classifier Fusion
PDF

Car-Forest: Joint Classification-Regression Decision Forests for Overlapping Audio Event Detection

Abstract

This report describes our submissions to Task2 and Task3 of the DCASE 2016 challenge [1]. The systems aim at dealing with the detection of overlapping audio events in continuous streams, where the detectors are based on random decision forests. The proposed forests are jointly trained for classification and regression simultaneously. Initially, the training is classification-oriented to encourage the trees to select discriminative features from overlapping mixtures to separate positive audio segments from the negative ones. The regression phase is then carried out to let the positive audio segments vote for the event onsets and offsets, and therefore model the temporal structure of audio events. One random decision forest is specifically trained for each event category of interest. Experimental results on the development data show that our systems outperform the DCASE 2016 challenge baselines with absolute gains of 64.4% and 8.0% on Task2 and Task3, respectively.

System characteristics
Input monophonic
Sampling rate 16kHz
Features GCC
Classifier Random forests
PDF

Performance Comparison of GMM, HMM and DNN Based Approaches for Acoustic Event Detection Within Task 3 of The DCASE 2016 Challenge

Abstract

This contribution reports on the performance of systems for polyphonic acoustic event detection (AED) compared within the framework of the Detection and classification of acoustic scenes and events 2016 (DCASE'16) challenge. State-of-the-art Gaussian mixture model (GMM) and GMM-hidden Markov model (HMM) approaches are applied using Mel-frequency cepstral coefficients (MFCCs) and Gabor filterbank (GFB) features and a non-negative matrix factorization (NMF) based system. Furthermore, tandem and hybrid deep neural network (DNN)-HMMsystems are adopted. All HMM systems that usually are of multiclass type, i.e., systems that just output one label per time segment from a set of possible classes, are extended to binary classification systems that are compound of single binary classifiers classifying between target and non-target classes and, thus, are capable of multi labeling. These systems are evaluated for the data of residential areas of Task 3 from the DCASE'16 challenge. It is shown, that the DNN based system performs worse than the traditional systems for this task. Best results are achieved using GFB features in combination with a multiclass GMM-HMM approach.

System characteristics
Input monophonic
Sampling rate 44.1kHz
Features GFB
Classifier GMM-HMM
PDF

Sound Event Detection in Real-Life Audio

Abstract

In this paper, an acoustic event detection system is proposed. This system uses fusion of several classifiers (GMM, DNN, LSTM) using another classifier (DNN) in attempt to achieve better results. The proposed system yields F1 score of up to 21% for indoors subset of the provided data and up to 44% for outdoors subset.

System characteristics
Input monophonic
Sampling rate 44.1kHz
Features MFCC
Classifier Fusion
PDF

Acoustic Scene and Event Recognition Using Recurrent Neural Networks

Abstract

The DCASE2016 challenge is designed particularly for research in environmental sound analysis. It consists of four tasks that spread on various problems such as acoustic scene classification and sound event detection. This paper reports our results on all the tasks by using Recurrent Neural Networks (RNNs). Experiments show that our models achieved superior performances compared with the baselines.

System characteristics
Input monophonic
Sampling rate 44.1kHz
Features mel energy
Classifier RNN
PDF

Gated Recurrent Networks Applied To Acoustic Scene Classification and Acoustic Event Detection

Abstract

We present two resource efficient frameworks for acoustic scene classification and acoustic event detection. In particular, we combine gated recurrent neural networks (GRNNs) and linear discriminant analysis (LDA) for efficiently classifying environmental sound scenes of the IEEE Detection and Classification of Acoustic Scenes and Events challenge (DCASE2016). Our system reaches an overall accuracy of 79.1% on DCASE 2016 task 1 development data, resulting in a relative improvement of 8.34% compared to the baseline GMM system. By applying GRNNs on DCASE2016 real event detection data using a MSE objective, we obtain a segment-based error rate (ER) score of 0.73 - which is a relative improvement of 19.8% compared to the baseline GMM system. We further investigate semi-supervised learning applied to acoustic scene analysis. In particular, we evaluate the effects of a hybrid, i.e. generative discriminative, objective function.

System characteristics
Input monophonic
Sampling rate 44.1kHz
Features spectrogram
Classifier GRNN
PDF