Task description
Detailed task description in task description page
Challenge results
Here you can find complete information on the submissions for Task 3: results on evaluation and development set (when reported by authors), class-wise results, technical reports and bibtex citations.
Detailed description of metrics used can be found here.
System outputs:
Systems ranking
Rank | Submission Information |
Technical Report |
Segment-based (overall / evaluation dataset) |
Event-based (overall / onset-only evaluation dataset) |
Segment-based (overall / development dataset) |
||||
---|---|---|---|---|---|---|---|---|---|
Code | Name | ER | F1 | ER | F1 | ER | F1 | ||
Adavanne2016 | Adavanne_task3_1 | adavanne_IID | 0.8051 | 47.8 | 5.1248 | 4.8 | 0.9100 | 31.0 | |
Adavanne2016 | Adavanne_task3_2 | adavanne_IITD | 0.8887 | 37.9 | 7.5286 | 4.7 | 0.8500 | 34.3 | |
Heittola2016 | DCASE2016 baseline | DCASE2016_baseline | 0.8773 | 34.3 | 1.7303 | 6.3 | 0.9100 | 23.7 | |
Elizalde2016 | Elizalde_task3_1 | CMU_G_v3 | 1.0730 | 22.5 | 3.3496 | 4.2 | |||
Elizalde2016 | Elizalde_task3_2 | CMU_G_v4 | 1.1056 | 20.8 | 3.1804 | 2.9 | 0.8100 | 34.8 | |
Elizalde2016 | Elizalde_task3_3 | CMU_G+P_v3 | 0.9635 | 33.3 | 2.0445 | 4.2 | |||
Elizalde2016 | Elizalde_task3_4 | CMU_G+P_v4 | 0.9613 | 33.6 | 1.8700 | 3.6 | 0.7600 | 38.5 | |
Gorin2016 | Gorin_task3_1 | act | 0.9799 | 41.1 | 1.8483 | 2.9 | 0.8400 | 38.1 | |
Kong2016 | Kong_task3_1 | QK | 0.9557 | 36.3 | 2.8819 | 7.3 | 38.1 | ||
Kroos2016 | Kroos_task3_1 | RandB | 1.1488 | 16.8 | 3.1469 | 3.4 | |||
Lai2016 | Liu_task3_1 | BW#3 | 0.9287 | 34.5 | 2.4283 | 8.1 | |||
Dai2016 | Pham_task3_1 | 0.9583 | 11.6 | 1.2886 | 1.8 | 1.2450 | 18.1 | ||
Phan2016 | Phan_task3_1 | CaR-FOREST | 0.9644 | 23.9 | 1.0634 | 1.5 | 0.8304 | 31.6 | |
Schroeder2016 | Schroeder_task3_1 | 1.3092 | 33.6 | 12.0766 | 3.7 | ||||
Ubskii2016 | Ubskii_task3_1 | 0.9971 | 39.6 | 2.9518 | 6.7 | ||||
Vu2016 | Vu_task3_1 | 0.9124 | 41.9 | 2.0949 | 6.3 | 0.8150 | 49.8 | ||
Zoehrer2016 | Zoehrer_task3_1 | 0.9056 | 39.6 | 3.0879 | 6.0 | 0.7300 | 47.6 |
Teams ranking
Table including only the best performing system per submitting team.
Rank | Submission Information |
Technical Report |
Segment-based (overall / evaluation dataset) |
Event-based (overall / onset-only evaluation dataset) |
Segment-based (overall / development dataset) |
||||
---|---|---|---|---|---|---|---|---|---|
Code | Name | ER | F1 | ER | F1 | ER | F1 | ||
Adavanne2016 | Adavanne_task3_1 | adavanne_IID | 0.8051 | 47.8 | 5.1248 | 4.8 | 0.9100 | 31.0 | |
Heittola2016 | DCASE2016 baseline | DCASE2016_baseline | 0.8773 | 34.3 | 1.7303 | 6.3 | 0.9100 | 23.7 | |
Elizalde2016 | Elizalde_task3_4 | CMU_G+P_v4 | 0.9613 | 33.6 | 1.8700 | 3.6 | 0.7600 | 38.5 | |
Gorin2016 | Gorin_task3_1 | act | 0.9799 | 41.1 | 1.8483 | 2.9 | 0.8400 | 38.1 | |
Kong2016 | Kong_task3_1 | QK | 0.9557 | 36.3 | 2.8819 | 7.3 | 38.1 | ||
Kroos2016 | Kroos_task3_1 | RandB | 1.1488 | 16.8 | 3.1469 | 3.4 | |||
Lai2016 | Liu_task3_1 | BW#3 | 0.9287 | 34.5 | 2.4283 | 8.1 | |||
Dai2016 | Pham_task3_1 | 0.9583 | 11.6 | 1.2886 | 1.8 | 1.2450 | 18.1 | ||
Phan2016 | Phan_task3_1 | CaR-FOREST | 0.9644 | 23.9 | 1.0634 | 1.5 | 0.8304 | 31.6 | |
Schroeder2016 | Schroeder_task3_1 | 1.3092 | 33.6 | 12.0766 | 3.7 | ||||
Ubskii2016 | Ubskii_task3_1 | 0.9971 | 39.6 | 2.9518 | 6.7 | ||||
Vu2016 | Vu_task3_1 | 0.9124 | 41.9 | 2.0949 | 6.3 | 0.8150 | 49.8 | ||
Zoehrer2016 | Zoehrer_task3_1 | 0.9056 | 39.6 | 3.0879 | 6.0 | 0.7300 | 47.6 |
Class-wise performance
Home
Rank | Submission Information |
Technical Report |
Segment-based (Class-based average) |
Cupboard | Cutlery | Dishes | Drawer | Glass jingling | Object impact | Object rustling | Object snapping | People walking | Washing dishes | Water tap running | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Code | Name | ER (class avg/eval/seg) | F1 (class avg/eval/seg) | ER / Cupboard (eval/seg) | F1 / Cupboard (eval/seg) | ER / Cutlery (eval/seg) | F1 / Cutlery (eval/seg) | ER / Dishes (eval/seg) | F1 / Dishes (eval/seg) | ER / Drawer (eval/seg) | F1 / Drawer (eval/seg) | ER / Glass jingling (eval/seg) | F1 / Glass jingling (eval/seg) | ER / Object impact (eval/seg) | F1 / Object impact (eval/seg) | ER / Object rustling (eval/seg) | F1 / Object rustling (eval/seg) | ER / Object snapping (eval/seg) | F1 / Object snapping (eval/seg) | ER / People walking (eval/seg) | F1 / People walking (eval/seg) | ER / Washing dishes (eval/seg) | F1 / Washing dishes (eval/seg) | ER / Washing dishes (eval/seg) | F1 / Washing dishes (eval/seg) | ||
Adavanne2016 | Adavanne_task3_1 | adavanne_IID | 0.9887 | 0.1 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.2785 | 0.2 | 0.5973 | 0.8 | |
Adavanne2016 | Adavanne_task3_2 | adavanne_IITD | 1.0682 | 0.1 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 2.0127 | 0.2 | 0.7372 | 0.7 | |
Heittola2016 | DCASE2016 baseline | DCASE2016_baseline | 0.9783 | 0.2 | 1.0385 | 0.0 | 1.0571 | 0.0 | 1.0744 | 0.2 | 0.9811 | 0.1 | 1.0000 | 0.0 | 1.1574 | 0.1 | 0.6786 | 0.6 | 1.0000 | 0.0 | 1.0833 | 0.2 | 1.0190 | 0.0 | 0.6724 | 0.6 | |
Elizalde2016 | Elizalde_task3_1 | CMU_G_v3 | 1.9262 | 0.1 | 2.1538 | 0.0 | 1.9714 | 0.0 | 1.7851 | 0.1 | 2.5094 | 0.1 | 2.0667 | 0.1 | 1.6294 | 0.2 | 1.5714 | 0.0 | 3.0476 | 0.1 | 1.9479 | 0.1 | 1.4747 | 0.2 | 1.0307 | 0.3 | |
Elizalde2016 | Elizalde_task3_2 | CMU_G_v4 | 4.2003 | 0.0 | 1.0385 | 0.0 | 6.7143 | 0.0 | 1.3636 | 0.0 | 1.0000 | 0.0 | 27.1333 | 0.0 | 1.9949 | 0.3 | 1.0000 | 0.0 | 2.9048 | 0.0 | 1.0000 | 0.0 | 1.0063 | 0.0 | 1.0478 | 0.0 | |
Elizalde2016 | Elizalde_task3_3 | CMU_G+P_v3 | 1.5296 | 0.1 | 1.0000 | 0.0 | 1.5429 | 0.0 | 1.3802 | 0.1 | 1.4528 | 0.0 | 1.0000 | 0.0 | 2.2741 | 0.3 | 1.3393 | 0.0 | 3.0000 | 0.0 | 1.5208 | 0.1 | 1.3671 | 0.2 | 0.9488 | 0.4 | |
Elizalde2016 | Elizalde_task3_4 | CMU_G+P_v4 | 1.5768 | 0.1 | 1.0385 | 0.0 | 1.6857 | 0.0 | 1.4050 | 0.1 | 1.8113 | 0.0 | 1.0000 | 0.0 | 2.1269 | 0.3 | 1.5893 | 0.0 | 2.6667 | 0.0 | 1.5312 | 0.1 | 1.4494 | 0.2 | 1.0410 | 0.4 | |
Gorin2016 | Gorin_task3_1 | act | 1.0834 | 0.2 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.1653 | 0.2 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0863 | 0.1 | 0.8929 | 0.6 | 1.0000 | 0.0 | 1.0104 | 0.1 | 1.8101 | 0.4 | 0.9522 | 0.5 | |
Kong2016 | Kong_task3_1 | QK | 1.1803 | 0.2 | 1.0385 | 0.0 | 1.2857 | 0.0 | 1.2479 | 0.1 | 1.0755 | 0.0 | 0.9333 | 0.3 | 1.4569 | 0.2 | 1.4821 | 0.2 | 1.2381 | 0.0 | 0.9792 | 0.1 | 1.3481 | 0.1 | 0.8976 | 0.5 | |
Kroos2016 | Kroos_task3_1 | RandB | 1.6394 | 0.1 | 1.6538 | 0.0 | 1.9429 | 0.1 | 1.5950 | 0.1 | 1.3396 | 0.0 | 2.1333 | 0.1 | 1.3147 | 0.2 | 1.9821 | 0.1 | 2.1429 | 0.0 | 1.2500 | 0.0 | 1.5190 | 0.1 | 1.1604 | 0.1 | |
Lai2016 | Liu_task3_1 | BW#3 | 1.2249 | 0.2 | 1.1538 | 0.1 | 1.2286 | 0.0 | 1.2810 | 0.1 | 1.0377 | 0.0 | 1.0667 | 0.0 | 1.4822 | 0.3 | 1.0357 | 0.4 | 2.3333 | 0.0 | 1.0417 | 0.0 | 1.1139 | 0.1 | 0.6997 | 0.7 | |
Dai2016 | Pham_task3_1 | 1.0055 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 0.9848 | 0.0 | 1.0179 | 0.0 | 1.0000 | 0.0 | 1.0208 | 0.0 | 1.0000 | 0.0 | 1.0375 | 0.2 | ||
Phan2016 | Phan_task3_1 | CaR-FOREST | 1.0449 | 0.0 | 1.0769 | 0.1 | 1.3143 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.1333 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 0.9693 | 0.3 | |
Schroeder2016 | Schroeder_task3_1 | 2.2534 | 0.1 | 1.0000 | 0.0 | 1.2571 | 0.1 | 7.5455 | 0.2 | 1.0000 | 0.0 | 3.2000 | 0.1 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 5.7848 | 0.3 | 1.0000 | 0.0 | ||
Ubskii2016 | Ubskii_task3_1 | 1.4109 | 0.2 | 1.4231 | 0.1 | 1.0571 | 0.0 | 2.1818 | 0.2 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.8223 | 0.3 | 2.0000 | 0.4 | 1.0000 | 0.0 | 1.7500 | 0.2 | 1.3608 | 0.2 | 0.9249 | 0.7 | ||
Vu2016 | Vu_task3_1 | 1.3479 | 0.1 | 1.0000 | 0.0 | 1.6286 | 0.0 | 1.1322 | 0.1 | 1.4340 | 0.0 | 1.0000 | 0.0 | 1.5939 | 0.2 | 2.5536 | 0.0 | 1.0000 | 0.0 | 1.8542 | 0.2 | 1.0570 | 0.0 | 0.5734 | 0.8 | ||
Zoehrer2016 | Zoehrer_task3_1 | 1.0797 | 0.1 | 1.1154 | 0.1 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0189 | 0.0 | 1.0000 | 0.0 | 1.5025 | 0.4 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.7658 | 0.3 | 0.4744 | 0.8 |
Residential Area
Rank | Submission Information |
Technical Report |
Segment-based (Class-based average) |
Bird singing | Car passing by | Children shouting | Object banging | People speaking | People walking | Wind blowing | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Code | Name | ER (class avg/eval/seg) | F1 (class avg/eval/seg) | ER / Bird singing (eval/seg) | F1 / Bird singing (eval/seg) | ER / Car passing by (eval/seg) | F1 / Car passing by (eval/seg) | ER / Children shouting (eval/seg) | F1 / Children shouting (eval/seg) | ER / Object banging (eval/seg) | F1 / Object banging (eval/seg) | ER / People speaking (eval/seg) | F1 / People speaking (eval/seg) | ER / People walking (eval/seg) | F1 / People walking (eval/seg) | ER / Wind blowing (eval/seg) | F1 / Wind blowing (eval/seg) | ||
Adavanne2016 | Adavanne_task3_1 | adavanne_IID | 1.0159 | 0.2 | 1.1332 | 0.6 | 0.5634 | 0.8 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.1228 | 0.0 | 1.0000 | 0.0 | 1.2917 | 0.3 | |
Adavanne2016 | Adavanne_task3_2 | adavanne_IITD | 1.1661 | 0.1 | 1.5884 | 0.6 | 1.0188 | 0.0 | 1.2000 | 0.1 | 1.2727 | 0.0 | 1.0000 | 0.0 | 1.0205 | 0.0 | 1.0625 | 0.0 | |
Heittola2016 | DCASE2016 baseline | DCASE2016_baseline | 1.3188 | 0.2 | 0.9637 | 0.3 | 0.4836 | 0.7 | 1.1333 | 0.0 | 1.0000 | 0.0 | 2.6667 | 0.0 | 1.1096 | 0.1 | 1.8750 | 0.2 | |
Elizalde2016 | Elizalde_task3_1 | CMU_G_v3 | 2.7125 | 0.1 | 1.2034 | 0.4 | 0.9531 | 0.4 | 5.4667 | 0.0 | 5.2727 | 0.0 | 2.3860 | 0.0 | 1.6849 | 0.1 | 2.0208 | 0.1 | |
Elizalde2016 | Elizalde_task3_2 | CMU_G_v4 | 2.0883 | 0.2 | 1.2107 | 0.5 | 0.8357 | 0.4 | 2.8667 | 0.0 | 3.6364 | 0.0 | 2.6667 | 0.1 | 1.5479 | 0.1 | 1.8542 | 0.1 | |
Elizalde2016 | Elizalde_task3_3 | CMU_G+P_v3 | 1.3496 | 0.2 | 1.3341 | 0.5 | 0.8075 | 0.5 | 1.1333 | 0.0 | 1.5455 | 0.0 | 2.5439 | 0.0 | 1.0411 | 0.0 | 1.0417 | 0.2 | |
Elizalde2016 | Elizalde_task3_4 | CMU_G+P_v4 | 1.2472 | 0.2 | 1.2857 | 0.6 | 0.7653 | 0.5 | 1.0667 | 0.0 | 1.3636 | 0.0 | 2.3333 | 0.0 | 1.0411 | 0.0 | 0.8750 | 0.3 | |
Gorin2016 | Gorin_task3_1 | act | 1.3456 | 0.3 | 1.0944 | 0.6 | 1.1502 | 0.6 | 1.0000 | 0.0 | 1.0000 | 0.0 | 3.1754 | 0.1 | 1.0822 | 0.3 | 0.9167 | 0.4 | |
Kong2016 | Kong_task3_1 | QK | 1.1055 | 0.2 | 1.2131 | 0.5 | 0.7042 | 0.7 | 1.0667 | 0.0 | 1.1818 | 0.0 | 1.4211 | 0.0 | 1.0685 | 0.0 | 1.0833 | 0.0 | |
Kroos2016 | Kroos_task3_1 | RandB | 1.6154 | 0.1 | 1.1695 | 0.3 | 1.4789 | 0.2 | 2.2000 | 0.1 | 1.0909 | 0.0 | 2.3158 | 0.1 | 1.2192 | 0.1 | 1.8333 | 0.0 | |
Lai2016 | Liu_task3_1 | BW#3 | 1.7348 | 0.1 | 1.0266 | 0.4 | 0.7324 | 0.5 | 1.5333 | 0.0 | 2.0909 | 0.0 | 2.4561 | 0.0 | 1.1164 | 0.1 | 3.1875 | 0.0 | |
Dai2016 | Pham_task3_1 | 0.9808 | 0.1 | 1.0024 | 0.1 | 0.8216 | 0.4 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0417 | 0.0 | ||
Phan2016 | Phan_task3_1 | CaR-FOREST | 1.0576 | 0.1 | 1.4673 | 0.4 | 0.8263 | 0.5 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.1096 | 0.0 | 1.0000 | 0.0 | |
Schroeder2016 | Schroeder_task3_1 | 1.0164 | 0.2 | 0.9952 | 0.5 | 0.6995 | 0.7 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.3860 | 0.0 | 1.0342 | 0.0 | 1.0000 | 0.1 | ||
Ubskii2016 | Ubskii_task3_1 | 1.0218 | 0.2 | 1.0508 | 0.4 | 0.5164 | 0.7 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.5439 | 0.0 | 1.0000 | 0.0 | 1.0417 | 0.0 | ||
Vu2016 | Vu_task3_1 | 1.1772 | 0.2 | 1.2567 | 0.5 | 0.6854 | 0.7 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.5789 | 0.0 | 1.2192 | 0.0 | 1.5000 | 0.2 | ||
Zoehrer2016 | Zoehrer_task3_1 | 0.9892 | 0.2 | 1.2131 | 0.4 | 0.6761 | 0.6 | 1.0000 | 0.0 | 1.0000 | 0.0 | 1.0351 | 0.0 | 1.0000 | 0.0 | 1.0000 | 0.0 |
System characteristics
Rank | Submission Information |
Technical Report |
Segment-based (overall) | System characteristics | ||||
---|---|---|---|---|---|---|---|---|
Code | Name | ER (eval/seg) | F1 (eval/seg) | Input | Features | Classifier | ||
Adavanne2016 | Adavanne_task3_1 | adavanne_IID | 0.8051 | 47.8 | binaural | mel energy | RNN | |
Adavanne2016 | Adavanne_task3_2 | adavanne_IITD | 0.8887 | 37.9 | binaural | mel energy + TDOA | RNN | |
Heittola2016 | DCASE2016 baseline | DCASE2016_baseline | 0.8773 | 34.3 | monophonic | MFCC | GMM | |
Elizalde2016 | Elizalde_task3_1 | CMU_G_v3 | 1.0730 | 22.5 | monophonic | MFCC | Random forests | |
Elizalde2016 | Elizalde_task3_2 | CMU_G_v4 | 1.1056 | 20.8 | monophonic | MFCC | Random forests | |
Elizalde2016 | Elizalde_task3_3 | CMU_G+P_v3 | 0.9635 | 33.3 | monophonic | MFCC | Random forests | |
Elizalde2016 | Elizalde_task3_4 | CMU_G+P_v4 | 0.9613 | 33.6 | monophonic | MFCC | Random forests | |
Gorin2016 | Gorin_task3_1 | act | 0.9799 | 41.1 | monophonic | mel energy | CNN | |
Kong2016 | Kong_task3_1 | QK | 0.9557 | 36.3 | monophonic | MFCC | DNN | |
Kroos2016 | Kroos_task3_1 | RandB | 1.1488 | 16.8 | Random | |||
Lai2016 | Liu_task3_1 | BW#3 | 0.9287 | 34.5 | monophonic | MFCC | Fusion | |
Dai2016 | Pham_task3_1 | 0.9583 | 11.6 | monophonic | MFCC | DNN | ||
Phan2016 | Phan_task3_1 | CaR-FOREST | 0.9644 | 23.9 | monophonic | GCC | Random forests | |
Schroeder2016 | Schroeder_task3_1 | 1.3092 | 33.6 | monophonic | GFB | GMM-HMM | ||
Ubskii2016 | Ubskii_task3_1 | 0.9971 | 39.6 | monophonic | MFCC | Fusion | ||
Vu2016 | Vu_task3_1 | 0.9124 | 41.9 | monophonic | mel energy | RNN | ||
Zoehrer2016 | Zoehrer_task3_1 | 0.9056 | 39.6 | monophonic | spectrogram | GRNN |
Technical reports
Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features
Sharath Adavanne, Giambattista Parascandolo, Pasi Pertilä, Toni Heittola and Tuomas Virtanen
Department of Signal Processing, Tampere University of Technology, Tampere, Finland
Adavanne_task3_1 Adavanne_task3_2
Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features
Abstract
In this paper, we propose the use of spatial and harmonic features in combination with long short term memory (LSTM) recurrent neural network (RNN) for automatic sound event detection (SED) task. Real life sound recordings typically have many overlapping sound events, making it hard to recognize with just mono channel audio. Human listeners have been successfully recognizing the mixture of overlapping sound events using pitch cues and exploiting the stereo (multichannel) audio signal available at their ears to spatially localize these events. Traditionally SED systems have only been using mono channel audio, motivated by the human listener we propose to extend them to use multichannel audio. The proposed SED system is compared against the state of the art mono channel meth od on the development subset of TUT sound events detection 2016 database [1]. The proposed method improves the F-score by 3.75% while reducing the error rate by 6%.
System characteristics
Input | binaural |
Sampling rate | 44.1kHz |
Features | mel energy; mel energy + TDOA |
Classifier | RNN |
Sound Event Detection for Real Life Audio DCASE Challenge
Wei Dai1, Juncheng Li2, Phuong Pham3, Samarjit Das2 and Shuhui Qu4
1Carnegie Mellon University, Pittsburgh, USA, 2Robert Bosch Research and Technology Center, USA, 3University of Pittsburgh, Pittsburgh, USA, 4Stanford University, Stanford, USA
Pham_task3_1
Sound Event Detection for Real Life Audio DCASE Challenge
Abstract
We explore logistic regression classifier (LogReg) and deep neural network (DNN) on the DCASE 2016 Challenge for task 3, i.e., sound event detection in real life audio. Our models use the Mel Frequency Cepstral Coefficients (MFCCs) and their deltas and accelerations as detection features. The error rate metric favors the simple logistic regression model with high activation threshold on both segment- and event-based contexts. On the other hand, DNN model outperforms the baseline in frame-based context.
System characteristics
Input | monophonic |
Sampling rate | 44.1kHz |
Features | MFCC |
Classifier | DNN |
Experimentation on The DCASE Challenge 2016: Task 1 - Acoustic Scene Classification and Task 3 - Sound Event Detection in Real Life Audio
Benjamin Elizalde1, Anurag Kumar1, Ankit Shah2, Rohan Badlani3, Emmanuel Vincent4, Bhiksha Raj1 and Ian Lane1
1Carnegie Mellon University, Pittsburgh, USA, 2NIT Surathkal, India, 3BITS, Pilani, India, 4Inria, France
Elizalde_task3_1 Elizalde_task3_2 Elizalde_task3_3 Elizalde_task3_4
Experimentation on The DCASE Challenge 2016: Task 1 - Acoustic Scene Classification and Task 3 - Sound Event Detection in Real Life Audio
Abstract
Audio carries substantial information about the contents of our environment. In a recording, sound events can occur in isolation, such as car passing by or footsteps and/or there could be a collection of sounds events, often collectively referred to as scenes, such as busy street or park. The 2016 DCASE challenge aims to foster standardized development in both of these areas. In this paper we present our work on Task 1 Acoustic Scene Classification and Task 3 Sound Event Detection in Real Life Recordings. Among our experiments we have low-level and high-level features, classifier optimization and other heuristics specific to each task. Our performance for both tasks improved the baseline published by DCASE. For Task 1 we achieved an overall accuracy of 78.9% compared to the baseline of 72.6% and for Task 3 we achieved a Segment-Based Error Rate of 0.48 compared to the baseline of 0.91.
System characteristics
Input | monophonic |
Sampling rate | 44.1kHz |
Features | MFCC |
Classifier | Random forests |
DCASE 2016 Sound Event Detection System Based on Convolutional Neural Network
Abstract
The report describes a sound event detection system submitted to DCASE 2016 challenge. In this work a convolutional neural network is used for detecting and classifying polyphonic events in a long temporal context of filter bank acoustic features. Given a small amount of training resources, data augmentation was explored. The system achieves an average 7.7% relative error rate improvement, but is still unable to detect short events with limited training data.
System characteristics
Input | monophonic |
Sampling rate | 16kHz |
Features | mel energy |
Classifier | CNN |
DCASE2016 Baseline System
System characteristics
Input | monophonic |
Sampling rate | 44.1kHz |
Features | MFCC |
Classifier | GMM |
Deep Neural Network Baseline for DCASE Challenge 2016
Abstract
The DCASE Challenge 2016 contains tasks for Acoustic Acene Classification (ASC), Acoustic Event Detection (AED), and audio tagging. Since 2006, Deep Neural Networks (DNNs) have been widely applied to computer visions, speech recognition and natural language processing tasks. In this paper, we provide DNN baselines for the DCASE Challenge 2016. For feature extraction, 40 Melfilter bank features are used. Two kinds of Mel banks, same area bank and same height bank are discussed. Experimental results show that the same height bank is better than the same area bank. DNNs with the same structure are applied to all four tasks in the DCASE Challenge 2016. In Task 1 we obtained accuracy of 76.4% using Mel + DNN against 72.5% by using Mel Frequency Ceptral Coefficient (MFCC) + Gaussian Mixture Model (GMM). In Task 2 we obtained F value of 17.4% using Mel + DNN against 41.6% by using Constant Q Transform (CQT) + Nonnegative Matrix Factorization (NMF). In Task 3 we obtained F value of 38.1% using Mel + DNN against 26.6% by using MFCC + GMM. In task 4 we obtained Equal Error Rate (ERR) of 20.9% using Mel + DNN against 21.0% by using MFCC + GMM. Therefore the DNN improves the baseline in Task 1 and Task 3, and is similar to the baseline in Task 4, although is worse than the baseline in Task 2. This indicates that DNNs can be successful in many of these tasks, but may not always work.
System characteristics
Input | monophonic |
Sampling rate | 44.1kHz |
Features | MFCC |
Classifier | DNN |
Random System Performance in Task 3
Christian Kroos and Mark Plumbley
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Surrey, SUnited Kingdom
Kroos_task3_1
Random System Performance in Task 3
Abstract
In this report we describe briefly the creation of a random, datablind systems to provide a random baseline for Task 3 in the DCASE 2016 challenge. Particular attention is paid to the results for two sound events occurring in the residential area scene, one very rare, the other very frequent.
System characteristics
Classifier | Random |
DCASE Report for Task 3: Sound Event Detection in Real Life Audio
Ying-Hui Lai1,2, Chun-Hao Wang3, Shi-Yan Hou3, Bang-Yin Chen3, Yu Tsao1 and Yi-Wen Liu3
1Research Center for Information Technology, Academia Sinica, Taipei, Taiwan, 2Department of Electrical Engineering, Yuan Ze University, Taoyuan City, Taiwan, 3National Tsing Hua University, Hsinchu City, Taiwan
Liu_task3_1
DCASE Report for Task 3: Sound Event Detection in Real Life Audio
Abstract
Our team has built an acoustic event classifier solely using short-time features. Signals were first de-noised by a log minimum square error (logMMSE) procedure. Then, Mel-frequency cepstral coefficients (MFCCs) extracted from the de-noised signal at every 20 ms were used to train two classifiers based on support vector machine (SVMs) and neural networks (NN), respectively. Optimal parameters for the classifiers were exhaustively searched to maximize the frame-wise recognition accuracy in cross validation. Frame-wise recognition rates of 93.0% and 91.8% were thus obtained from the SVM and NN, respectively, for the home events (and 86.2% and 85.7% respectively for the residential events). To process the evaluation data, the same signal processing procedures were applied so both classifiers produce their classification result for every frame. Whenever SVM and NN gives different answers, we resort to the confusion matrices obtained during the supervised learning phase so a final answer could be produced based on a maximal a posteriori (MAP) principle. Finally, a heuristic smoothing procedure was applied to the jointly decided recognition results so the event onsets and offsets could be determined
System characteristics
Input | monophonic |
Sampling rate | 44.1kHz |
Features | MFCC |
Classifier | Fusion |
Car-Forest: Joint Classification-Regression Decision Forests for Overlapping Audio Event Detection
Huy Phan1,2, Lars Hertel1, Marco Maass1, Philipp Koch1 and Alfred Mertins1
1Institute for Signal Processing, University of Luebeck, Luebeck, Germany, 2Graduate School for Computing in Medicine and Life Sciences, University of Luebeck, Luebeck, Germany
Phan_task3_1
Car-Forest: Joint Classification-Regression Decision Forests for Overlapping Audio Event Detection
Abstract
This report describes our submissions to Task2 and Task3 of the DCASE 2016 challenge [1]. The systems aim at dealing with the detection of overlapping audio events in continuous streams, where the detectors are based on random decision forests. The proposed forests are jointly trained for classification and regression simultaneously. Initially, the training is classification-oriented to encourage the trees to select discriminative features from overlapping mixtures to separate positive audio segments from the negative ones. The regression phase is then carried out to let the positive audio segments vote for the event onsets and offsets, and therefore model the temporal structure of audio events. One random decision forest is specifically trained for each event category of interest. Experimental results on the development data show that our systems outperform the DCASE 2016 challenge baselines with absolute gains of 64.4% and 8.0% on Task2 and Task3, respectively.
System characteristics
Input | monophonic |
Sampling rate | 16kHz |
Features | GCC |
Classifier | Random forests |
Performance Comparison of GMM, HMM and DNN Based Approaches for Acoustic Event Detection Within Task 3 of The DCASE 2016 Challenge
Jens Schröder1,2, Jörn Anemüller2,3 and Stefan Goetze1,2
1Fraunhofer Institute for Digital Media Technology IDMT, Oldenburg, Germany, 2Cluster of Excellence, Hearing4all, Germany, 3Department of Medical Physics and Acoustics, University of Oldenburg, Oldenburg, Germany
Schroeder_task3_1
Performance Comparison of GMM, HMM and DNN Based Approaches for Acoustic Event Detection Within Task 3 of The DCASE 2016 Challenge
Abstract
This contribution reports on the performance of systems for polyphonic acoustic event detection (AED) compared within the framework of the Detection and classification of acoustic scenes and events 2016 (DCASE'16) challenge. State-of-the-art Gaussian mixture model (GMM) and GMM-hidden Markov model (HMM) approaches are applied using Mel-frequency cepstral coefficients (MFCCs) and Gabor filterbank (GFB) features and a non-negative matrix factorization (NMF) based system. Furthermore, tandem and hybrid deep neural network (DNN)-HMMsystems are adopted. All HMM systems that usually are of multiclass type, i.e., systems that just output one label per time segment from a set of possible classes, are extended to binary classification systems that are compound of single binary classifiers classifying between target and non-target classes and, thus, are capable of multi labeling. These systems are evaluated for the data of residential areas of Task 3 from the DCASE'16 challenge. It is shown, that the DNN based system performs worse than the traditional systems for this task. Best results are achieved using GFB features in combination with a multiclass GMM-HMM approach.
System characteristics
Input | monophonic |
Sampling rate | 44.1kHz |
Features | GFB |
Classifier | GMM-HMM |
Sound Event Detection in Real-Life Audio
Dmitrii Ubskii and Alexei Pugachev
Chair of Speech Information Systems, ITMO University, St. Petersburg, Russia
Ubskii_task3_1
Sound Event Detection in Real-Life Audio
Abstract
In this paper, an acoustic event detection system is proposed. This system uses fusion of several classifiers (GMM, DNN, LSTM) using another classifier (DNN) in attempt to achieve better results. The proposed system yields F1 score of up to 21% for indoors subset of the provided data and up to 44% for outdoors subset.
System characteristics
Input | monophonic |
Sampling rate | 44.1kHz |
Features | MFCC |
Classifier | Fusion |
Acoustic Scene and Event Recognition Using Recurrent Neural Networks
Toan H. Vu and Jia-Ching Wang
Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
Vu_task3_1
Acoustic Scene and Event Recognition Using Recurrent Neural Networks
Abstract
The DCASE2016 challenge is designed particularly for research in environmental sound analysis. It consists of four tasks that spread on various problems such as acoustic scene classification and sound event detection. This paper reports our results on all the tasks by using Recurrent Neural Networks (RNNs). Experiments show that our models achieved superior performances compared with the baselines.
System characteristics
Input | monophonic |
Sampling rate | 44.1kHz |
Features | mel energy |
Classifier | RNN |
Gated Recurrent Networks Applied To Acoustic Scene Classification and Acoustic Event Detection
Matthias Zöhrer and Franz Pernkopf
Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria
Zoehrer_task3_1
Gated Recurrent Networks Applied To Acoustic Scene Classification and Acoustic Event Detection
Abstract
We present two resource efficient frameworks for acoustic scene classification and acoustic event detection. In particular, we combine gated recurrent neural networks (GRNNs) and linear discriminant analysis (LDA) for efficiently classifying environmental sound scenes of the IEEE Detection and Classification of Acoustic Scenes and Events challenge (DCASE2016). Our system reaches an overall accuracy of 79.1% on DCASE 2016 task 1 development data, resulting in a relative improvement of 8.34% compared to the baseline GMM system. By applying GRNNs on DCASE2016 real event detection data using a MSE objective, we obtain a segment-based error rate (ER) score of 0.73 - which is a relative improvement of 19.8% compared to the baseline GMM system. We further investigate semi-supervised learning applied to acoustic scene analysis. In particular, we evaluate the effects of a hybrid, i.e. generative discriminative, objective function.
System characteristics
Input | monophonic |
Sampling rate | 44.1kHz |
Features | spectrogram |
Classifier | GRNN |