Task description
This task evaluated performance of the sound event detection systems in multisource conditions similar to our everyday life, where the sound sources are rarely heard in isolation. The participants used 1 hour and 32 minutes of audio in 24 recordings to train their systems. The challenge evaluation was done using 29 minutes of audio in 8 recordings.
More detailed task description can be found in the task description page
Challenge results
Detailed description of metrics used can be found here.
System outputs:
Systems ranking
Rank | Submission Information |
Technical Report |
Segment-based (overall / evaluation dataset) |
Segment-based (overall / development dataset) |
|||
---|---|---|---|---|---|---|---|
Code | Name | ER (overall / evaluation dataset) | F1 (overall / evaluation dataset) | ER (overall / development dataset) | F1 (overall / development dataset) | ||
Adavanne2017 | Adavanne_TUT_task3_1 | Ash_1 | 0.7914 | 41.7 | 0.2500 | 79.3 | |
Adavanne2017 | Adavanne_TUT_task3_2 | Ash_2 | 0.8061 | 42.9 | 0.2400 | 79.1 | |
Adavanne2017 | Adavanne_TUT_task3_3 | Ash_3 | 0.8544 | 41.4 | 0.2000 | 80.3 | |
Adavanne2017 | Adavanne_TUT_task3_4 | Ash_4 | 0.8716 | 36.2 | 0.2400 | 76.9 | |
Chen2017 | Chen_UR_task3_1 | Chen | 0.8575 | 30.9 | 0.8100 | 37.0 | |
Dang2017 | Dang_NCU_task3_1 | andang3 | 0.9529 | 42.6 | 0.5900 | 55.4 | |
Dang2017 | Dang_NCU_task3_2 | andang3 | 0.9468 | 42.8 | 0.5900 | 55.4 | |
Dang2017 | Dang_NCU_task3_3 | andang3 | 1.0318 | 44.2 | 0.5900 | 55.4 | |
Dang2017 | Dang_NCU_task3_4 | andang3 | 1.1028 | 43.5 | 0.6200 | 53.7 | |
Feroze2017 | Feroze_IST_task3_1 | Khizer | 1.0942 | 42.6 | 0.7600 | 47.4 | |
Feroze2017 | Feroze_IST_task3_2 | Khizer | 1.0312 | 39.7 | 0.7600 | 47.4 | |
Heittola2017 | DCASE2017 baseline | Baseline | 0.9358 | 42.8 | 0.6900 | 56.7 | |
Hou2017 | Hou_BUPT_task3_1 | MMS_HYB | 1.0446 | 29.3 | 0.6000 | 58.9 | |
Hou2017 | Hou_BUPT_task3_2 | BGRU_HYB | 0.9248 | 34.1 | 0.6600 | 53.9 | |
Kroos2017 | Kroos_CVSSP_task3_1 | J-NEAT-E | 0.8979 | 44.9 | 0.7300 | 49.2 | |
Kroos2017 | Kroos_CVSSP_task3_2 | J-NEAT-P | 0.8911 | 41.6 | 0.7200 | 50.5 | |
Kroos2017 | Kroos_CVSSP_task3_3 | SLFFN | 1.0141 | 43.8 | 0.6900 | 56.5 | |
Jeong2017 | Lee_SNU_task3_1 | MICNN_1 | 0.9260 | 42.0 | 0.5100 | 67.0 | |
Jeong2017 | Lee_SNU_task3_2 | MICNN_2 | 0.8673 | 27.9 | 0.5100 | 67.0 | |
Jeong2017 | Lee_SNU_task3_3 | MICNN_3 | 0.8080 | 40.8 | 0.5100 | 67.0 | |
Jeong2017 | Lee_SNU_task3_4 | MICNN_4 | 0.8985 | 43.6 | 0.5100 | 67.0 | |
Li2017 | Li_SCUT_task3_1 | LiSCUTt3_1 | 0.9920 | 40.3 | 0.7100 | 55.5 | |
Li2017 | Li_SCUT_task3_2 | LiSCUTt3_2 | 0.9523 | 41.0 | 0.6900 | 54.5 | |
Li2017 | Li_SCUT_task3_3 | LiSCUTt3_3 | 1.0043 | 43.4 | 0.7100 | 55.8 | |
Li2017 | Li_SCUT_task3_4 | LiSCUTt3_4 | 0.9878 | 33.9 | 0.7100 | 52.8 | |
Lu2017 | Lu_THU_task3_1 | bigru_da | 0.8251 | 39.6 | 0.6100 | 56.7 | |
Lu2017 | Lu_THU_task3_2 | bigru_da | 0.8306 | 39.2 | 0.6100 | 56.7 | |
Lu2017 | Lu_THU_task3_3 | bigru_da | 0.8361 | 38.0 | 0.6100 | 56.7 | |
Lu2017 | Lu_THU_task3_4 | bigru_da | 0.8373 | 38.3 | 0.6100 | 56.7 | |
Wang2017 | Wang_NTHU_task3_1 | NTHU_AHG | 0.9749 | 40.8 | 0.7700 | 43.6 | |
Xia2017 | Xia_UWA_task3_1 | UWA_T3_1 | 0.9523 | 43.5 | 0.6600 | 56.9 | |
Xia2017 | Xia_UWA_task3_2 | UWA_T3_1 | 0.9437 | 41.1 | 0.6500 | 56.0 | |
Xia2017 | Xia_UWA_task3_3 | UWA_T3_1 | 0.8740 | 41.7 | 0.6400 | 56.0 | |
Yu2017 | Yu_FZU_task3_1* | DRF | 1.1963 | 3.9 | 0.8200 | 38.2 | |
Zhou2017 | Zhou_PKU_task3_1 | MC-LSTM-1 | 0.8526 | 39.1 | 0.6600 | 54.5 | |
Zhou2017 | Zhou_PKU_task3_2 | MC-LSTM-2 | 0.8526 | 37.3 | 0.6400 | 54.4 |
Teams ranking
Table including only the best performing system per submitting team.
Rank | Submission Information |
Technical Report |
Segment-based (overall / evaluation dataset) |
Segment-based (overall / development dataset) |
|||
---|---|---|---|---|---|---|---|
Code | Name | ER (overall / evaluation dataset) | F1 (overall / evaluation dataset) | ER (overall / development dataset) | F1 (overall / development dataset) | ||
Adavanne2017 | Adavanne_TUT_task3_1 | Ash_1 | 0.7914 | 41.7 | 0.2500 | 79.3 | |
Chen2017 | Chen_UR_task3_1 | Chen | 0.8575 | 30.9 | 0.8100 | 37.0 | |
Dang2017 | Dang_NCU_task3_2 | andang3 | 0.9468 | 42.8 | 0.5900 | 55.4 | |
Feroze2017 | Feroze_IST_task3_2 | Khizer | 1.0312 | 39.7 | 0.7600 | 47.4 | |
Heittola2017 | DCASE2017 baseline | Baseline | 0.9358 | 42.8 | 0.6900 | 56.7 | |
Hou2017 | Hou_BUPT_task3_2 | BGRU_HYB | 0.9248 | 34.1 | 0.6600 | 53.9 | |
Kroos2017 | Kroos_CVSSP_task3_2 | J-NEAT-P | 0.8911 | 41.6 | 0.7200 | 50.5 | |
Jeong2017 | Lee_SNU_task3_3 | MICNN_3 | 0.8080 | 40.8 | 0.5100 | 67.0 | |
Li2017 | Li_SCUT_task3_2 | LiSCUTt3_2 | 0.9523 | 41.0 | 0.6900 | 54.5 | |
Lu2017 | Lu_THU_task3_1 | bigru_da | 0.8251 | 39.6 | 0.6100 | 56.7 | |
Wang2017 | Wang_NTHU_task3_1 | NTHU_AHG | 0.9749 | 40.8 | 0.7700 | 43.6 | |
Xia2017 | Xia_UWA_task3_3 | UWA_T3_1 | 0.8740 | 41.7 | 0.6400 | 56.0 | |
Yu2017 | Yu_FZU_task3_1* | DRF | 1.1963 | 3.9 | 0.8200 | 38.2 | |
Zhou2017 | Zhou_PKU_task3_1 | MC-LSTM-1 | 0.8526 | 39.1 | 0.6600 | 54.5 |
Class-wise performance
Rank | Submission Information |
Technical Report |
Brakes squeking | Car | Children | Large vehicle | People speaking | People walking | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Code | Name | ER / Brakes squeaking (eval/seg) | F1 / Brakes squeaking (eval/seg) | ER / Car (eval/seg) | F1 / Car (eval/seg) | ER / Children (eval/seg) | F1 / Children (eval/seg) | ER / Large vehicle (eval/seg) | F1 / Large vehicle (eval/seg) | ER / People speaking (eval/seg) | F1 / People speaking (eval/seg) | ER / People walking (eval/seg) | F1 / People walking (eval/seg) | ||
Adavanne2017 | Adavanne_TUT_task3_1 | Ash_1 | 1.0000 | 0.7674 | 54.6 | 1.2000 | 0.0 | 1.0678 | 49.3 | 1.0408 | 0.0 | 1.0331 | 38.7 | ||
Adavanne2017 | Adavanne_TUT_task3_2 | Ash_2 | 0.9773 | 4.4 | 0.7674 | 54.7 | 2.8000 | 0.0 | 1.4181 | 45.3 | 1.2367 | 1.9 | 0.8398 | 52.6 | |
Adavanne2017 | Adavanne_TUT_task3_3 | Ash_3 | 1.0000 | 0.7758 | 52.0 | 3.2667 | 0.0 | 1.4576 | 48.0 | 1.4286 | 3.3 | 0.9144 | 52.8 | ||
Adavanne2017 | Adavanne_TUT_task3_4 | Ash_4 | 1.0000 | 0.8496 | 51.4 | 1.0000 | 1.4011 | 37.7 | 1.0000 | 1.5580 | 28.8 | ||||
Chen2017 | Chen_UR_task3_1 | Chen | 1.0000 | 0.8538 | 51.8 | 1.0000 | 0.9887 | 14.6 | 1.0082 | 0.0 | 1.0663 | 1.0 | |||
Dang2017 | Dang_NCU_task3_1 | andang3 | 0.8409 | 27.5 | 0.8022 | 59.1 | 1.2667 | 6.6 | 1.8079 | 33.6 | 1.0980 | 21.1 | 1.8287 | 35.0 | |
Dang2017 | Dang_NCU_task3_2 | andang3 | 0.8182 | 30.8 | 0.8036 | 59.0 | 1.2000 | 6.9 | 1.8079 | 32.8 | 1.1102 | 22.3 | 1.7928 | 35.2 | |
Dang2017 | Dang_NCU_task3_3 | andang3 | 0.7045 | 46.6 | 0.8482 | 59.4 | 1.2000 | 18.2 | 2.3785 | 31.5 | 1.1265 | 36.4 | 1.9503 | 34.8 | |
Dang2017 | Dang_NCU_task3_4 | andang3 | 0.9318 | 12.8 | 0.7187 | 65.2 | 2.5111 | 1.7 | 1.6836 | 42.0 | 1.9020 | 21.8 | 2.1381 | 34.5 | |
Feroze2017 | Feroze_IST_task3_1 | Khizer | 0.7955 | 37.5 | 0.7479 | 61.8 | 4.0889 | 0.0 | 2.0000 | 43.3 | 1.9592 | 14.3 | 1.5166 | 39.1 | |
Feroze2017 | Feroze_IST_task3_2 | Khizer | 0.8750 | 25.2 | 0.7618 | 58.1 | 3.6222 | 0.0 | 1.8023 | 43.7 | 1.8204 | 10.4 | 1.4171 | 35.1 | |
Heittola2017 | DCASE2017 baseline | Baseline | 0.9205 | 16.5 | 0.7674 | 61.5 | 2.6667 | 0.0 | 1.4407 | 42.7 | 1.2980 | 8.6 | 1.4448 | 33.5 | |
Hou2017 | Hou_BUPT_task3_1 | MMS_HYB | 0.9886 | 2.2 | 0.7507 | 50.9 | 4.8222 | 0.0 | 1.7571 | 36.9 | 1.5020 | 0.0 | 1.1851 | 13.7 | |
Hou2017 | Hou_BUPT_task3_2 | BGRU_HYB | 1.0000 | 0.9373 | 52.7 | 2.8889 | 0.0 | 1.2712 | 32.8 | 1.1469 | 0.0 | 1.1657 | 16.3 | ||
Kroos2017 | Kroos_CVSSP_task3_1 | J-NEAT-E | 1.0000 | 0.8677 | 62.4 | 4.2222 | 0.0 | 1.0226 | 0.0 | 1.4163 | 14.7 | 0.8508 | 51.3 | ||
Kroos2017 | Kroos_CVSSP_task3_2 | J-NEAT-P | 1.0000 | 0.8621 | 47.1 | 2.7333 | 1.6 | 1.4463 | 43.1 | 1.4041 | 33.1 | 0.9558 | 50.0 | ||
Kroos2017 | Kroos_CVSSP_task3_3 | SLFFN | 0.9545 | 8.7 | 0.7939 | 58.9 | 4.1333 | 1.1 | 1.7458 | 42.0 | 1.6163 | 16.1 | 1.1050 | 49.5 | |
Jeong2017 | Lee_SNU_task3_1 | MICNN_1 | 1.0000 | 0.9234 | 61.2 | 1.0000 | 2.5311 | 41.1 | 1.1837 | 6.5 | 1.0138 | 0.0 | |||
Jeong2017 | Lee_SNU_task3_2 | MICNN_2 | 1.0000 | 0.9248 | 45.5 | 1.0000 | 1.3672 | 25.8 | 1.0000 | 1.0000 | |||||
Jeong2017 | Lee_SNU_task3_3 | MICNN_3 | 1.0000 | 0.9234 | 61.2 | 1.0000 | 1.3672 | 25.8 | 1.0000 | 0.8 | 1.0000 | ||||
Jeong2017 | Lee_SNU_task3_4 | MICNN_4 | 1.1023 | 7.6 | 0.9234 | 61.2 | 2.7556 | 1.6 | 1.8983 | 45.3 | 1.3020 | 30.2 | 1.0000 | ||
Li2017 | Li_SCUT_task3_1 | LiSCUTt3_1 | 0.9432 | 10.8 | 0.7591 | 60.2 | 4.0222 | 0.0 | 1.7345 | 43.3 | 1.5224 | 7.4 | 1.3343 | 32.4 | |
Li2017 | Li_SCUT_task3_2 | LiSCUTt3_2 | 1.0000 | 0.7019 | 62.2 | 3.9111 | 0.0 | 1.4520 | 45.4 | 1.4082 | 8.0 | 1.4613 | 31.9 | ||
Li2017 | Li_SCUT_task3_3 | LiSCUTt3_3 | 1.0568 | 0.0 | 0.6783 | 66.9 | 3.4889 | 0.0 | 1.9322 | 37.8 | 1.4531 | 12.7 | 1.6685 | 34.5 | |
Li2017 | Li_SCUT_task3_4 | LiSCUTt3_4 | 1.0682 | 17.5 | 0.9109 | 27.5 | 2.5111 | 0.0 | 1.6723 | 43.9 | 1.8653 | 15.5 | 0.8646 | 56.3 | |
Lu2017 | Lu_THU_task3_1 | bigru_da | 1.0000 | 0.7855 | 45.0 | 1.0444 | 0.0 | 1.5424 | 33.6 | 1.0980 | 8.2 | 1.0000 | 54.2 | ||
Lu2017 | Lu_THU_task3_2 | bigru_da | 1.0000 | 0.8008 | 44.4 | 1.0444 | 0.0 | 1.5876 | 33.6 | 1.0735 | 7.7 | 1.0166 | 53.4 | ||
Lu2017 | Lu_THU_task3_3 | bigru_da | 1.0000 | 0.8120 | 41.9 | 1.0889 | 0.0 | 1.5424 | 33.9 | 1.0612 | 8.5 | 1.0221 | 52.7 | ||
Lu2017 | Lu_THU_task3_4 | bigru_da | 1.0000 | 0.8273 | 40.2 | 1.0444 | 0.0 | 1.4802 | 34.8 | 1.0776 | 9.0 | 1.0083 | 54.5 | ||
Wang2017 | Wang_NTHU_task3_1 | NTHU_AHG | 1.0000 | 0.8315 | 58.7 | 2.4222 | 1.8 | 2.0678 | 22.8 | 1.6367 | 17.3 | 1.3094 | 43.0 | ||
Xia2017 | Xia_UWA_task3_1 | UWA_T3_1 | 1.0000 | 0.7604 | 59.1 | 1.1556 | 18.8 | 2.1299 | 41.9 | 1.2408 | 17.8 | 1.6022 | 38.0 | ||
Xia2017 | Xia_UWA_task3_2 | UWA_T3_1 | 1.0000 | 0.7214 | 58.1 | 3.8000 | 0.0 | 1.6497 | 42.1 | 1.5673 | 13.1 | 1.2265 | 43.1 | ||
Xia2017 | Xia_UWA_task3_3 | UWA_T3_1 | 1.0000 | 0.7242 | 57.7 | 1.0444 | 20.3 | 1.7797 | 40.7 | 1.3755 | 6.6 | 1.3011 | 39.5 | ||
Yu2017 | Yu_FZU_task3_1* | DRF | 1.2159 | 0.0 | 1.2925 | 6.3 | 15.6444 | 4.6 | 1.2938 | 1.7 | 1.3306 | 1.2 | 1.0304 | 1.1 | |
Zhou2017 | Zhou_PKU_task3_1 | MC-LSTM-1 | 1.0455 | 0.0 | 0.7674 | 54.9 | 1.1333 | 0.0 | 1.7345 | 37.2 | 1.0694 | 6.4 | 1.2790 | 34.0 | |
Zhou2017 | Zhou_PKU_task3_2 | MC-LSTM-2 | 1.0227 | 0.0 | 0.8245 | 47.0 | 1.5333 | 0.0 | 1.3220 | 49.8 | 1.0163 | 10.8 | 1.3315 | 32.7 |
System characteristics
Rank | Submission Information |
Technical Report |
Segment-based (overall) | System characteristics | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Code | Name | ER (eval/seg) | F1 (eval/seg) | Input |
Sampling rate |
Data augmentation |
Features | Classifier |
Decision making |
||
Adavanne2017 | Adavanne_TUT_task3_1 | Ash_1 | 0.7914 | 41.7 | mono | 44.1kHz | log-mel energies | CRNN | threshold | ||
Adavanne2017 | Adavanne_TUT_task3_2 | Ash_2 | 0.8061 | 42.9 | binaural | 44.1kHz | log-mel energies | CRNN | threshold | ||
Adavanne2017 | Adavanne_TUT_task3_3 | Ash_3 | 0.8544 | 41.4 | binaural | 44.1kHz | multi-scale log-mel energies | CRNN | threshold | ||
Adavanne2017 | Adavanne_TUT_task3_4 | Ash_4 | 0.8716 | 36.2 | binaural | 44.1kHz | spectrogram | CRNN | threshold | ||
Chen2017 | Chen_UR_task3_1 | Chen | 0.8575 | 30.9 | mono | 44.1kHz | log-mel energies | CNN | median filtering | ||
Dang2017 | Dang_NCU_task3_1 | andang3 | 0.9529 | 42.6 | mono | 44.1kHz | log-mel energies | CRNN | majority vote | ||
Dang2017 | Dang_NCU_task3_2 | andang3 | 0.9468 | 42.8 | mono | 44.1kHz | log-mel energies | CRNN | majority vote | ||
Dang2017 | Dang_NCU_task3_3 | andang3 | 1.0318 | 44.2 | mono | 44.1kHz | log-mel energies | CRNN | majority vote | ||
Dang2017 | Dang_NCU_task3_4 | andang3 | 1.1028 | 43.5 | mono | 44.1kHz | log-mel energies | CRNN | majority vote | ||
Feroze2017 | Feroze_IST_task3_1 | Khizer | 1.0942 | 42.6 | mixed | 44.1kHz | Perceptual Linear Predictive | NN | morphological operations | ||
Feroze2017 | Feroze_IST_task3_2 | Khizer | 1.0312 | 39.7 | mixed | 44.1kHz | Perceptual Linear Predictive | NN | morphological operations | ||
Heittola2017 | DCASE2017 baseline | Baseline | 0.9358 | 42.8 | mono | 44.1kHz | log-mel energies | MLP | median filtering | ||
Hou2017 | Hou_BUPT_task3_1 | MMS_HYB | 1.0446 | 29.3 | mono | 44.1kHz | log-mel energies | combination [MLP; BGRU] | median filtering | ||
Hou2017 | Hou_BUPT_task3_2 | BGRU_HYB | 0.9248 | 34.1 | mono | 44.1kHz | raw audio data | BGRU | median filtering | ||
Kroos2017 | Kroos_CVSSP_task3_1 | J-NEAT-E | 0.8979 | 44.9 | mono | 44.1kHz | scattering transform, clustering | Neuroevolution | threshold | ||
Kroos2017 | Kroos_CVSSP_task3_2 | J-NEAT-P | 0.8911 | 41.6 | mono | 44.1kHz | scattering transform, clustering | Neuroevolution | threshold | ||
Kroos2017 | Kroos_CVSSP_task3_3 | SLFFN | 1.0141 | 43.8 | mono | 44.1kHz | scattering transform, clustering | ANN | threshold | ||
Jeong2017 | Lee_SNU_task3_1 | MICNN_1 | 0.9260 | 42.0 | binaural | 44.1kHz | channel swapping | log-mel energies | CNN | adaptive thresholding | |
Jeong2017 | Lee_SNU_task3_2 | MICNN_2 | 0.8673 | 27.9 | binaural | 44.1kHz | channel swapping | log-mel energies | CNN | adaptive thresholding | |
Jeong2017 | Lee_SNU_task3_3 | MICNN_3 | 0.8080 | 40.8 | binaural | 44.1kHz | channel swapping | log-mel energies | CNN | adaptive thresholding | |
Jeong2017 | Lee_SNU_task3_4 | MICNN_4 | 0.8985 | 43.6 | binaural | 44.1kHz | channel swapping | log-mel energies | CNN | adaptive thresholding | |
Li2017 | Li_SCUT_task3_1 | LiSCUTt3_1 | 0.9920 | 40.3 | mono | 44.1kHz | DNN(MFCC) | Bi-LSTM | Top output probability | ||
Li2017 | Li_SCUT_task3_2 | LiSCUTt3_2 | 0.9523 | 41.0 | mono | 44.1kHz | DNN(MFCC) | Bi-LSTM | Top output probability | ||
Li2017 | Li_SCUT_task3_3 | LiSCUTt3_3 | 1.0043 | 43.4 | mono | 44.1kHz | DNN(MFCC) | DNN | Top output probability | ||
Li2017 | Li_SCUT_task3_4 | LiSCUTt3_4 | 0.9878 | 33.9 | mono | 44.1kHz | DNN(MFCC) | Bi-LSTM | Top output probability | ||
Lu2017 | Lu_THU_task3_1 | bigru_da | 0.8251 | 39.6 | mixed | 44.1kHz | pitch shifting, time stretching | MFCC, pitch | RNN, ensemble | median filtering | |
Lu2017 | Lu_THU_task3_2 | bigru_da | 0.8306 | 39.2 | mixed | 44.1kHz | pitch shifting, time stretching | MFCC, pitch | RNN, ensemble | median filtering | |
Lu2017 | Lu_THU_task3_3 | bigru_da | 0.8361 | 38.0 | mixed | 44.1kHz | pitch shifting, time stretching | MFCC, pitch | RNN, ensemble | median filtering | |
Lu2017 | Lu_THU_task3_4 | bigru_da | 0.8373 | 38.3 | mixed | 44.1kHz | pitch shifting, time stretching | MFCC, pitch | RNN, ensemble | median filtering | |
Wang2017 | Wang_NTHU_task3_1 | NTHU_AHG | 0.9749 | 40.8 | mono, binaural | 44.1kHz | MFCC, TDOA | RNN | post processing technique | ||
Xia2017 | Xia_UWA_task3_1 | UWA_T3_1 | 0.9523 | 43.5 | mono | 44.1kHz | log-mel energies | MLP | Class wise distance evaluation (CW) | ||
Xia2017 | Xia_UWA_task3_2 | UWA_T3_1 | 0.9437 | 41.1 | mono | 44.1kHz | log-mel energies | CNN | median filtering | ||
Xia2017 | Xia_UWA_task3_3 | UWA_T3_1 | 0.8740 | 41.7 | mono | 44.1kHz | log-mel energies | CNN | Class wise distance evaluation (CW) | ||
Yu2017 | Yu_FZU_task3_1* | DRF | 1.1963 | 3.9 | mono | 16kHz | mel energies | Deep Random Forest | sliding median filtering | ||
Zhou2017 | Zhou_PKU_task3_1 | MC-LSTM-1 | 0.8526 | 39.1 | right, diff | 44.1kHz | log-mel energies | LSTM | median filtering | ||
Zhou2017 | Zhou_PKU_task3_2 | MC-LSTM-2 | 0.8526 | 37.3 | right, mean, diff | 44.1kHz | log-mel energies | LSTM | median filtering |
Technical reports
A Report on Sound Event Detection with Different Binaural Features
Sharath Adavanne and Tuomas Virtanen
Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland
Adavanne_TUT_task3_1 Adavanne_TUT_task3_2 Adavanne_TUT_task3_3 Adavanne_TUT_task3_4
A Report on Sound Event Detection with Different Binaural Features
Abstract
In this paper, we compare the performance of using binaural audio features in place of single channel features for sound event detection. Three different binaural features are studied and evaluated on the publicly available TUT Sound Events 2017 dataset of length 70 minutes. Sound event detection is performed separately with single channel and binaural features using stacked convolutional and recurrent neural network and the evaluation is reported using standard metrics of error rate and F-score. The studied binaural features are seen to consistently perform equal to or better than the single-channel features with respect to error rate metric.
System characteristics
Input | mono; binaural |
Sampling rate | 44.1kHz |
Features | log-mel energies; multi-scale log-mel energies; spectrogram |
Classifier | CRNN |
Decision making | threshold |
DCASE2017 Sound Event Detection Using Convolutional Neural Network
Yukun Chen, Yichi Zhang and Zhiyao Duan
Electrical and Computer Engineering, University of Rochester, NY, US
Chen_UR_task3_1
DCASE2017 Sound Event Detection Using Convolutional Neural Network
Abstract
The DCASE2017 Challenge Task 3 is to develop a sound event detection system of real life audio. In our setup, we merge the two channels into one, then use Mel-band energy to calculate the converted spectrum, and train the model based on convolu- tional neural network (CNN). The method we use achieves a 0.81 error rate on average for the four cross-validation folders. It proves the practicability of using CNN for sound event detection.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CNN |
Decision making | median filtering |
Deep Learning for DCASE2017 Challenge
An Dang, Toan Vu and Jia-Ching Wang
Computer Sciene and Information Engineering, National Central University, Taoyuan, Taiwan
Dang_NCU_task3_1 Dang_NCU_task3_2 Dang_NCU_task3_3 Dang_NCU_task3_4
Deep Learning for DCASE2017 Challenge
Abstract
This paper reports our results on all tasks of DCASE challenge 2017 which are acoustic scene classification, detection of rare sound events, sound event detection in real life audio, and large-scale weakly supervised sound event detection for smart cars. Our proposed methods are developed based on two favorite neural networks which are convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Experiments show that our proposed methods outperform the baseline.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CRNN |
Decision making | majority vote |
Comparison of Baseline System with Perceptual Linear Predictive Feature Using Neural Network for Sound Event Detection in Real Life Audio
Khizer Feroze1 and Abdur-Rehaman Maud2
1Electrical Engineering, Institute of Space Technology, Islamabad, Pakistan, 2Electrical Engineering, Institue of Space Technology, Islamabad, Pakistan
Feroze_IST_task3_1 Feroze_IST_task3_2
Comparison of Baseline System with Perceptual Linear Predictive Feature Using Neural Network for Sound Event Detection in Real Life Audio
Abstract
For sound event detection of polyphonic sounds, we compare the performance of perceptual linear predictive (PLP) feature with Mel frequency cepstral coefficients (MFCC) using neural network classifier. The results are further compared with the performance of the baseline system given by DCASE 2017 (task 3). Our results show that using PLP based classifier, individual error rate (ER) for each event is improved compared to the baseline system. For car event, ER is improved by 10%, for large vehicle event 23%, for people walking event 26% and some improvements are also ob-served in other events.
System characteristics
Input | mixed |
Sampling rate | 44.1kHz |
Features | Perceptual Linear Predictive |
Classifier | NN |
Decision making | morphological operations |
DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System
Abstract
DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | MLP |
Decision making | median filtering |
Sound Event Detection in Real Life Audio Using Multi-Model System
Yuanbo Hou and Shengchen Li
Embedded Artificial Intelligence Laboratory, Beijing University Of Posts And Telecommunications, Beijing, China
Hou_BUPT_task3_1 Hou_BUPT_task3_2
Sound Event Detection in Real Life Audio Using Multi-Model System
Abstract
In this paper, we present a polyphonic sound event detection (SED) system based on a multi-model system. In the proposed multi-model system, we use one model based on Deep Neural Networks (DNN) to detect sound events of car, and five models based on Bi-directional Gated Recurrent Units Recurrent Neural Networks (BGRU-RNN) to detect other sound events including: brakes squeaking, children, large vehicle, people speaking and people walking. Since different classes sound events have differ-ent audio characteristics, we use different models to detect each class. The proposed multi-model system is trained and tested based on IEEE DCASE2017 Challenge: Sound Event Detection in Real Life Audio (Task 3) Development Dataset, the result yields up to 58.92% and 0.60 in terms of F-Score and error rate on segment-based metric respectively.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies; raw audio data |
Classifier | combination [MLP; BGRU]; BGRU |
Decision making | median filtering |
Audio Event Detection Using Multiple-Input Convolutional Neural Network
Il-Young Jeong1,2, Subin Lee1,2, Yoonchang Han1 and Kyogu Lee2
1Cochlear.ai, Seoul, Korea, 2Music and Audio Research Group, Seoul National University, Seoul, Korea
Lee_SNU_task3_1 Lee_SNU_task3_2 Lee_SNU_task3_3 Lee_SNU_task3_4
Audio Event Detection Using Multiple-Input Convolutional Neural Network
Abstract
This paper describes the model and training framework from our submission for DCASE 2017 task 3: sound event detection in real life audio. Our model basically follows convolutional neural network architecture, yet uses two input data of the short- and long-term audio signal. In the training stage, we calculated validation errors more frequently than one epoch with adaptive thresholds. We also used class-wise early stopping to find the best model for each class. The proposed model shows a meaningful improvements in cross validation experiments compared to the baseline system using the simple neural network.
System characteristics
Input | binaural |
Sampling rate | 44.1kHz |
Data augmentation | channel swapping |
Features | log-mel energies |
Classifier | CNN |
Decision making | adaptive thresholding |
Neuroevolution for Sound Event Detection in Real Life Audio: A Pilot Study
Christian Kroos and Mark D. Plumbley
Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK
Kroos_CVSSP_task3_1 Kroos_CVSSP_task3_2 Kroos_CVSSP_task3_3
Neuroevolution for Sound Event Detection in Real Life Audio: A Pilot Study
Abstract
Neuroevolution techniques combine genetic algorithms with artificial neural networks, some of them evolving network topology along with the network weights. One of these latter techniques is the NeuroEvolution of Augmenting Topologies (NEAT) algorithm. For this pilot study we devised an extended variant (joint NEAT, J-NEAT), introducing co-evolution, and applied it to sound event detection in real life audio (task 3) in the DCASE 2017 challenge. Our research question was whether small networks could be evolved that would be able to compete with the much larger networks now typical for classification and detection tasks. We used the wavelet-based deep scattering transform and k-means clustering across the resulting scales (not across samples) to provide J-NEAT with a compact representation of the acoustic input. Results show that J-NEAT is capable of evolving small networks that match the performance of the baseline system in terms of the segment-based error metrics, while exhibiting a substantially better event-related error rate. The evolved networks were, however, narrowly outperformed by a comparable, experimenter-designed minimal single-layer feed-forward network. We discuss the question of evolving versus learning for supervised tasks.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | scattering transform, clustering |
Classifier | Neuroevolution; ANN |
Decision making | threshold |
The SEIE-SCUT Systems for IEEE AASP Challenge on DCASE 2017: Deep Learning Techniques for Audio Representation and Classification
Yanxiong Li and Xianku Li
School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
Li_SCUT_task3_1 Li_SCUT_task3_2 Li_SCUT_task3_3 Li_SCUT_task3_4
The SEIE-SCUT Systems for IEEE AASP Challenge on DCASE 2017: Deep Learning Techniques for Audio Representation and Classification
Abstract
In this report, we present our works about three tasks of IEEE AASP challenge on DCASE 2017, i.e. task 1: Acoustic Scene Classification (ASC), task 2: detection of rare sound events in artificially created mixtures and task 3: sound event detection in real life recordings. Tasks 2 and 3 belong to the same problem, i.e. Sound Event Detection (SED). We adopt deep learning techniques to extract Deep Audio Feature (DAF) and classify various acoustic scenes or sound events. Specifically, a Deep Neural Network (DNN) is first built for generating the DAF from Mel-Frequency Cepstral Coefficients (MFCCs), and then a Recurrent Neural Network (RNN) of Bi-directional Long Short Term Memory (Bi-LSTM) fed by the DAF is built for ASC and SED. Evaluated on the development datasets of DCASE 2017, our systems are superior to the corresponding baselines for tasks 1 and 2, and our system for task 3 performs as good as the baseline in terms of the predominant metrics.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | DNN(MFCC) |
Classifier | Bi-LSTM; DNN |
Decision making | Top output probability |
Bidirectional GRU for Sound Event Detection
Rui Lu1 and Zhiyao Duan2
1Department of Automation, Tsinghua University, Beijing, China, 2Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY USA
Lu_THU_task3_1 Lu_THU_task3_2 Lu_THU_task3_3 Lu_THU_task3_4
Bidirectional GRU for Sound Event Detection
Abstract
Sound event detection (SED) aims to detect temporal boundaries of sound events given audio streams. Sound recordings in real life situations typically have many overlapping events, making this detection task much more difficult than classification and non- overlapping detection. Recently, multi-label recurrent neural net- works (RNNs) have become the main stream solutions for this poly- phonic sound event detection problem. However, similar to many other deep learning approaches, the relative scarcity of carefully labeled data has limited the capacity of RNNs. In this paper, we first present a multi label bi-directional recurrent neural network to model the temporal evolution of sound events. Then we propose the use of data augmentation to overcome the problem of data scarcity and explore the appropriate augmentation strategies that achieve better performance. We evaluate our approach on the development subset of the DCASE2017 task3 dataset. Combined with data augmentation and ensemble technique, we reduce the error rate by over 11% compared to the officially published baseline system.
System characteristics
Input | mixed |
Sampling rate | 44.1kHz |
Data augmentation | pitch shifting, time stretching |
Features | MFCC, pitch |
Classifier | RNN, ensemble |
Decision making | median filtering |
Sound Event Detection From Real-Life Audio by Training a Long Short-Term Memory Network with Mono and Stereo Features
Chun-Hao Wang, Jun-Kai You and Yi-Wen Liu
Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan
Wang_NTHU_task3_1
Sound Event Detection From Real-Life Audio by Training a Long Short-Term Memory Network with Mono and Stereo Features
Abstract
In this paper, we trained and evaluated an acoustic sound event classifier that uses a combination of stereo and mono features. For stereo features, we treated the time difference of arrival (TDOA) as a random variable and calculated its probability density function. For mono features, Mel-frequency cepstral coefficients (MFCCs) and their 1st and 2nd derivatives were extracted. A recurrent neural network (RNN) with long-short term memory (LSTM) was constructed to perform multi-label classification. Training with the 4-fold validation dataset given by DCASE2017 challenge [5], model parameters were chosen based on the best average performance. The proposed TDOA plus MFCC features combined with the RNN-LSTM model achieved a segment-based error rate of 0.77.
System characteristics
Input | mono, binaural |
Sampling rate | 44.1kHz |
Features | MFCC, TDOA |
Classifier | RNN |
Decision making | post processing technique |
Class Wise Distance Based Acoustic Event Detection
Xianjun Xia1, Roberto Togneri1, Ferdous Sohel2 and David Huang1
1School of Electrical, Electronic and Computer Engineering, The University of Western Australia, Perth, Australia, 2School of Engineering and Information Technology, Murdoch University, Perth, Australia
Xia_UWA_task3_1 Xia_UWA_task3_2 Xia_UWA_task3_3
Class Wise Distance Based Acoustic Event Detection
Abstract
In this paper, we propose a class wise distance based approach in a neural network based acoustic event detection system. The neural network output probabilities are updated by calculating the distance between the acoustic features of each frame and the class wise distance of each event class. The detected acoustic segments are re-evaluated segmentally using the class wise distances. Cross-validation detection results on the development set of DCASE2017 show the efficiency of the proposed method by achieving a 4% absolute reduction in segment-based error rate compared to the baseline system.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | MLP; CNN |
Decision making | Class wise distance evaluation (CW); median filtering |
Sound Event Detection Using Deep Random Forest
Chun-Yan Yu, Huang Liu and Zi-Ming Qi
College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350108, China
Yu_FZU_task3_1a
a The system was re-submitted after the deadline. The revised submission yielded substantially lower ER, with the difference in performance attributed to a software bug in the original submission. More details can be found in the technical report.
Sound Event Detection Using Deep Random Forest
Abstract
In this paper, we present our work on Task 3 Sound Event Detection in Real Life Audio [1]. The systems aim at dealing with the detection of overlapping audio events, where the detectors are based on deep random forest, a decision tree ensemble approach. For random forest has natural defect of detecting and classifying polyphonic events, the systems use one-vs-the-rest (OvR) mul-ticlass/multilabel strategy, fitting one deep random forest per event class. On the development data set, the system obtained error rate value of 0.82 and F-score of 38.2%.
System characteristics
Input | mono |
Sampling rate | 16kHz |
Features | mel energies |
Classifier | Deep Random Forest |
Decision making | sliding median filtering |
Sound Event Detection in Multichannel Audio LSTM Network
Jianchao Zhou
Institute of Computer Science & Technology, Peking University, Beijing, China
Zhou_PKU_task3_1 Zhou_PKU_task3_2
Sound Event Detection in Multichannel Audio LSTM Network
Abstract
In this paper, a polyphonic sound event detection system is proposed. This system uses log mel-band energy features with long short term memory (LSTM) recurrent neural network. Human listeners have been successfully recognizing overlapping sound events by two ears. Motivated by that we propose to extend the system to use multichannel audio data. The original stereo (multichannel) audio signal has two channels, we construct three different channel data and use different fusion strategies to extend our system. Experiments show that our system achieved superior performances compared with the baselines.
System characteristics
Input | right, diff; right, mean, diff |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | LSTM |
Decision making | median filtering |