Task description
The task evaluates systems for the large-scale detection of sound events using weakly labeled training data. The task employs a subset of AudioSet dataset by using 17 sound event classes from two categories (“Warning sounds” and “Vehicle sounds”).
Detailed task description can be found in the task description page
Challenge results
Detailed description of metrics used can be found here.
System outputs:
Systems ranking
Subtask A - Audio tagging
Rank | Submission Information |
Technical Report |
Evaluation dataset (overall) | Development dataset (overall) | |||||
---|---|---|---|---|---|---|---|---|---|
Code | Name | F1 (overall / evaluation dataset) | Precision (overall / evaluation dataset) | Recall (overall / evaluation dataset) | F1 (overall / development dataset) | Precision (overall / development dataset) | Recall (overall / development dataset) | ||
Adavanne2017 | Adavanne_TUT_task4_1 | Ash_1 | 45.5 | 57.2 | 37.9 | ||||
Adavanne2017 | Adavanne_TUT_task4_2 | Ash_2 | 46.6 | 58.0 | 38.9 | 43.2 | 47.5 | 39.6 | |
Adavanne2017 | Adavanne_TUT_task4_3 | Ash_3 | 44.5 | 55.8 | 37.1 | ||||
Adavanne2017 | Adavanne_TUT_task4_4 | Ash_4 | 26.3 | 33.2 | 21.8 | ||||
Chou2017 | Chou_SINICA_task4_1 | FCNN_SM_1 | 47.6 | 43.8 | 52.2 | ||||
Chou2017 | Chou_SINICA_task4_2 | FCNN_SM_2 | 49.0 | 51.9 | 46.4 | ||||
Chou2017 | Chou_SINICA_task4_3 | FCNN_SM_3 | 47.9 | 48.4 | 47.4 | ||||
Chou2017 | Chou_SINICA_task4_4 | FCNN_SM_3 | 49.0 | 53.8 | 45.0 | ||||
Badlani2017 | DCASE2017 baseline | Baseline | 18.2 | 15.0 | 23.1 | 10.9 | 7.8 | 17.5 | |
Kukanov2017 | Kukanov_UEF_task4_1 | K-CRNN-MFoM | 39.6 | 47.6 | 33.9 | 33.5 | 35.1 | 32.0 | |
Lee2017 | Lee_KAIST_task4_1 | SDCNN_MAC | 40.3 | 31.3 | 56.7 | 35.3 | 25.6 | 56.7 | |
Lee2017 | Lee_KAIST_task4_2 | MLMS5_MAC | 47.3 | 48.0 | 46.6 | 41.2 | 37.6 | 45.7 | |
Lee2017 | Lee_KAIST_task4_3 | MLMS3_MAC | 47.2 | 49.6 | 45.0 | 38.7 | 37.3 | 40.2 | |
Lee2017 | Lee_KAIST_task4_4 | MLMS8_MAC | 47.1 | 48.5 | 45.9 | 40.1 | 37.4 | 43.2 | |
Lee2017a | Lee_SNU_task4_1 | EMSI1 | 52.3 | 77.1 | 39.6 | 47.6 | 68.3 | 36.5 | |
Lee2017a | Lee_SNU_task4_2 | EMSI2 | 52.3 | 77.1 | 39.6 | 47.5 | 66.7 | 36.8 | |
Lee2017a | Lee_SNU_task4_3 | EMSI3 | 52.6 | 69.7 | 42.3 | 57.0 | 70.3 | 47.9 | |
Lee2017a | Lee_SNU_task4_4 | EMSI4 | 52.1 | 77.4 | 39.3 | 48.9 | 70.3 | 37.4 | |
Salamon2017 | Salamon_NYU_task4_1 | Salamon_1 | 46.0 | 50.7 | 42.1 | 45.9 | 44.7 | 47.0 | |
Salamon2017 | Salamon_NYU_task4_2 | Salamon_2 | 45.3 | 46.8 | 43.8 | 44.0 | 39.9 | 49.0 | |
Salamon2017 | Salamon_NYU_task4_3 | Salamon_3 | 44.9 | 62.8 | 35.0 | 45.5 | 53.7 | 39.4 | |
Salamon2017 | Salamon_NYU_task4_4 | Salamon_4 | 38.1 | 73.9 | 25.7 | 38.0 | 63.0 | 27.2 | |
Vu2017 | Toan_NCU_task4_1 | ToanVu1 | 48.5 | 54.7 | 43.6 | 51.8 | 54.2 | 49.5 | |
Vu2017 | Toan_NCU_task4_2 | ToanVu2 | 46.5 | 47.3 | 45.6 | 49.5 | 45.2 | 54.6 | |
Tseng2017 | Tseng_Bosch_task4_1 | Bosch1 | 35.0 | 34.1 | 36.0 | 29.5 | 26.8 | 32.7 | |
Tseng2017 | Tseng_Bosch_task4_2 | Bosch2 | 35.1 | 34.0 | 36.2 | 29.0 | 26.5 | 31.9 | |
Tseng2017 | Tseng_Bosch_task4_3 | Bosch3 | 35.2 | 31.6 | 39.7 | 33.1 | 27.9 | 40.6 | |
Tseng2017 | Tseng_Bosch_task4_4 | Bosch4 | 35.2 | 33.9 | 36.7 | 31.2 | 28.0 | 35.3 | |
Xu2017 | Xu_CVSSP_task4_1 | Surrey1AB | 54.4 | 57.8 | 51.3 | 61.9 | 59.4 | 64.7 | |
Xu2017 | Xu_CVSSP_task4_2 | Surrey2AB | 55.6 | 61.4 | 50.8 | ||||
Xu2017 | Xu_CVSSP_task4_3 | Surrey3AB | 54.2 | 58.9 | 50.2 | ||||
Xu2017 | Xu_CVSSP_task4_4 | Surrey4AB | 52.8 | 53.5 | 52.1 |
Subtask B - Sound event detection
Rank | Submission Information |
Technical Report |
Evaluation dataset (segment-based, overall) | Development dataset (segment-based, overall) | |||
---|---|---|---|---|---|---|---|
Code | Name | ER (evaluation dataset) | F1 (evaluation dataset) | ER (development dataset) | F1 (development dataset) | ||
Adavanne2017 | Adavanne_TUT_task4_1 | Ash_1 | 0.8100 | 47.9 | 0.8400 | 38.8 | |
Adavanne2017 | Adavanne_TUT_task4_2 | Ash_2 | 0.8000 | 48.3 | 0.8400 | 38.1 | |
Adavanne2017 | Adavanne_TUT_task4_3 | Ash_3 | 0.8200 | 48.9 | 0.8400 | 38.6 | |
Adavanne2017 | Adavanne_TUT_task4_4 | Ash_4 | 0.7900 | 49.0 | 0.8100 | 41.1 | |
Chou2017 | Chou_SINICA_task4_1 | FCNN_SM_1 | 0.8300 | 42.4 | |||
Badlani2017 | DCASE2017 baseline | Baseline | 0.9300 | 28.4 | 1.0200 | 13.8 | |
Lee2017 | Lee_KAIST_task4_1 | SDCNN_MAC | 0.8200 | 39.4 | 0.8800 | 28.1 | |
Lee2017 | Lee_KAIST_task4_2 | MLMS5_MAC | 0.7800 | 42.6 | 0.8600 | 30.8 | |
Lee2017 | Lee_KAIST_task4_3 | MLMS3_MAC | 0.7800 | 44.2 | 0.8600 | 31.3 | |
Lee2017 | Lee_KAIST_task4_4 | MLMS8_MAC | 0.7500 | 47.1 | 0.8400 | 34.2 | |
Lee2017a | Lee_SNU_task4_1 | EMSI1 | 0.6700 | 54.4 | 0.7200 | 45.9 | |
Lee2017a | Lee_SNU_task4_2 | EMSI2 | 0.6700 | 54.4 | 0.8300 | 42.9 | |
Lee2017a | Lee_SNU_task4_3 | EMSI3 | 0.6700 | 55.4 | 0.7000 | 47.7 | |
Lee2017a | Lee_SNU_task4_4 | EMSI4 | 0.6600 | 55.5 | 0.7100 | 47.1 | |
Salamon2017 | Salamon_NYU_task4_1 | Salamon_1 | 0.8200 | 46.2 | 0.8400 | 40.3 | |
Salamon2017 | Salamon_NYU_task4_2 | Salamon_2 | 0.8500 | 45.6 | 0.8605 | 39.3 | |
Salamon2017 | Salamon_NYU_task4_3 | Salamon_3 | 0.7700 | 45.9 | 0.7607 | 41.0 | |
Salamon2017 | Salamon_NYU_task4_4 | Salamon_4 | 0.7700 | 45.9 | 0.7607 | 41.0 | |
Vu2017 | Toan_NCU_task4_2 | ToanVu2 | 0.9400 | 43.0 | 0.9300 | 40.9 | |
Vu2017 | Toan_NCU_task4_3 | ToanVu3 | 0.9000 | 42.7 | 0.9000 | 39.9 | |
Vu2017 | Toan_NCU_task4_4 | ToanVu4 | 0.8700 | 41.6 | 0.8900 | 37.9 | |
Xu2017 | Xu_CVSSP_task4_1 | Surrey1AB | 0.7300 | 51.8 | 0.7200 | 49.7 | |
Xu2017 | Xu_CVSSP_task4_2 | Surrey2AB | 0.7800 | 47.5 | |||
Xu2017 | Xu_CVSSP_task4_3 | Surrey3AB | 1.0100 | 52.1 | |||
Xu2017 | Xu_CVSSP_task4_4 | Surrey4AB | 0.8000 | 50.4 |
Teams ranking
Table including only the best performing system per submitting team.
Subtask A - Audio tagging
Rank | Submission Information |
Technical Report |
Evaluation dataset (overall) | Development dataset (overall) | |||||
---|---|---|---|---|---|---|---|---|---|
Code | Name | F1 (overall / evaluation dataset) | Precision (overall / evaluation dataset) | Recall (overall / evaluation dataset) | F1 (overall / development dataset) | Precision (overall / development dataset) | Recall (overall / development dataset) | ||
Adavanne2017 | Adavanne_TUT_task4_2 | Ash_2 | 46.6 | 58.0 | 38.9 | 43.2 | 47.5 | 39.6 | |
Chou2017 | Chou_SINICA_task4_3 | FCNN_SM_3 | 47.9 | 48.4 | 47.4 | ||||
Badlani2017 | DCASE2017 baseline | Baseline | 18.2 | 15.0 | 23.1 | 10.9 | 7.8 | 17.5 | |
Kukanov2017 | Kukanov_UEF_task4_1 | K-CRNN-MFoM | 39.6 | 47.6 | 33.9 | 33.5 | 35.1 | 32.0 | |
Lee2017 | Lee_KAIST_task4_2 | MLMS5_MAC | 47.3 | 48.0 | 46.6 | 41.2 | 37.6 | 45.7 | |
Lee2017a | Lee_SNU_task4_3 | EMSI3 | 52.6 | 69.7 | 42.3 | 57.0 | 70.3 | 47.9 | |
Salamon2017 | Salamon_NYU_task4_1 | Salamon_1 | 46.0 | 50.7 | 42.1 | 45.9 | 44.7 | 47.0 | |
Vu2017 | Toan_NCU_task4_1 | ToanVu1 | 48.5 | 54.7 | 43.6 | 51.8 | 54.2 | 49.5 | |
Tseng2017 | Tseng_Bosch_task4_3 | Bosch3 | 35.2 | 31.6 | 39.7 | 33.1 | 27.9 | 40.6 | |
Xu2017 | Xu_CVSSP_task4_2 | Surrey2AB | 55.6 | 61.4 | 50.8 |
Subtask B - Sound event detection
Rank | Submission Information |
Technical Report |
Evaluation dataset (segment-based, overall) | Development dataset (segment-based, overall) | |||
---|---|---|---|---|---|---|---|
Code | Name | ER (evaluation dataset) | F1 (evaluation dataset) | ER (development dataset) | F1 (development dataset) | ||
Adavanne2017 | Adavanne_TUT_task4_4 | Ash_4 | 0.7900 | 49.0 | 0.8100 | 41.1 | |
Chou2017 | Chou_SINICA_task4_1 | FCNN_SM_1 | 0.8300 | 42.4 | |||
Badlani2017 | DCASE2017 baseline | Baseline | 0.9300 | 28.4 | 1.0200 | 13.8 | |
Lee2017 | Lee_KAIST_task4_4 | MLMS8_MAC | 0.7500 | 47.1 | 0.8400 | 34.2 | |
Lee2017a | Lee_SNU_task4_4 | EMSI4 | 0.6600 | 55.5 | 0.7100 | 47.1 | |
Salamon2017 | Salamon_NYU_task4_3 | Salamon_3 | 0.7700 | 45.9 | 0.7607 | 41.0 | |
Vu2017 | Toan_NCU_task4_4 | ToanVu4 | 0.8700 | 41.6 | 0.8900 | 37.9 | |
Xu2017 | Xu_CVSSP_task4_1 | Surrey1AB | 0.7300 | 51.8 | 0.7200 | 49.7 |
Class-wise performance
Subtask A - Audio tagging
Rank | Submission Information |
Technical Report |
Overall F1 (evaluation dataset) |
Warning sounds | Vehicle sounds | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Code | Name |
Air horn, truck horn |
Ambulance (siren) |
Car alarm |
Civil defense siren |
Fire engine, fire truck (siren) |
Police car (siren) |
Reversing beeps |
Screaming |
Train horn |
Bicycle | Bus | Car |
Car passing by |
Motorcycle | Skateboard | Train | Truck | |||
Adavanne2017 | 45.5 | Adavanne_TUT_task4_1 | Ash_1 | 7.8 | 0.0 | 0.0 | 78.6 | 51.7 | 48.9 | 8.5 | 77.9 | 21.8 | 39.0 | 21.1 | 67.2 | 0.0 | 54.7 | 79.5 | 71.8 | 50.0 | |
Adavanne2017 | 46.6 | Adavanne_TUT_task4_2 | Ash_2 | 43.2 | 0.0 | 0.0 | 82.3 | 50.8 | 45.0 | 0.0 | 78.7 | 27.9 | 37.5 | 23.9 | 68.5 | 0.0 | 60.2 | 80.2 | 70.6 | 53.7 | |
Adavanne2017 | 44.5 | Adavanne_TUT_task4_3 | Ash_3 | 16.8 | 0.0 | 0.0 | 80.7 | 54.0 | 48.9 | 0.0 | 70.5 | 9.1 | 32.4 | 7.6 | 68.3 | 0.0 | 62.3 | 80.5 | 66.7 | 52.5 | |
Adavanne2017 | 26.3 | Adavanne_TUT_task4_4 | Ash_4 | 0.0 | 0.0 | 0.0 | 54.9 | 53.6 | 28.6 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 63.7 | 0.0 | 0.0 | 2.2 | 4.3 | 19.1 | |
Chou2017 | 47.6 | Chou_SINICA_task4_1 | FCNN_SM_1 | 55.5 | 48.8 | 48.4 | 80.3 | 56.2 | 57.3 | 37.5 | 84.0 | 68.8 | 39.6 | 36.7 | 64.6 | 28.9 | 52.8 | 78.8 | 68.7 | 46.8 | |
Chou2017 | 49.0 | Chou_SINICA_task4_2 | FCNN_SM_2 | 50.7 | 37.8 | 47.3 | 82.0 | 57.1 | 60.1 | 33.3 | 80.8 | 69.5 | 42.1 | 36.9 | 67.0 | 32.9 | 58.7 | 79.3 | 68.5 | 52.9 | |
Chou2017 | 47.9 | Chou_SINICA_task4_3 | FCNN_SM_3 | 55.1 | 60.3 | 57.8 | 81.6 | 57.3 | 47.0 | 36.9 | 84.0 | 68.8 | 40.8 | 35.3 | 66.0 | 32.1 | 56.6 | 76.4 | 67.5 | 52.7 | |
Chou2017 | 49.0 | Chou_SINICA_task4_4 | FCNN_SM_3 | 48.2 | 36.6 | 45.6 | 82.8 | 58.1 | 61.7 | 33.9 | 80.8 | 69.5 | 37.4 | 36.4 | 67.4 | 29.3 | 57.8 | 79.8 | 67.9 | 53.5 | |
Badlani2017 | 18.2 | DCASE2017 baseline | Baseline | 0.0 | 0.0 | 0.0 | 48.0 | 19.4 | 38.8 | 0.0 | 0.0 | 14.2 | 4.2 | 0.0 | 30.0 | 0.0 | 14.3 | 0.0 | 7.8 | 0.0 | |
Kukanov2017 | 39.6 | Kukanov_UEF_task4_1 | K-CRNN-MFoM | 0.0 | 3.8 | 0.0 | 80.7 | 0.5 | 55.1 | 0.0 | 45.4 | 27.9 | 21.4 | 10.8 | 57.1 | 0.0 | 63.5 | 57.4 | 61.8 | 41.7 | |
Lee2017 | 40.3 | Lee_KAIST_task4_1 | SDCNN_MAC | 34.1 | 52.7 | 22.8 | 70.2 | 48.4 | 52.9 | 45.9 | 79.4 | 59.8 | 33.5 | 31.7 | 46.0 | 35.4 | 57.3 | 76.6 | 68.4 | 33.6 | |
Lee2017 | 47.3 | Lee_KAIST_task4_2 | MLMS5_MAC | 30.0 | 50.0 | 15.8 | 82.2 | 59.3 | 53.4 | 33.3 | 79.2 | 72.0 | 48.4 | 34.5 | 60.9 | 15.1 | 61.7 | 74.7 | 72.0 | 42.5 | |
Lee2017 | 47.2 | Lee_KAIST_task4_3 | MLMS3_MAC | 26.2 | 36.8 | 5.6 | 78.2 | 56.7 | 57.7 | 30.2 | 78.2 | 53.9 | 33.7 | 40.0 | 62.5 | 21.1 | 63.9 | 78.5 | 70.0 | 48.0 | |
Lee2017 | 47.1 | Lee_KAIST_task4_4 | MLMS8_MAC | 24.1 | 38.4 | 10.9 | 80.0 | 53.9 | 57.0 | 30.2 | 78.0 | 54.8 | 35.7 | 37.2 | 63.8 | 18.2 | 64.3 | 74.3 | 69.8 | 44.3 | |
Lee2017a | 52.3 | Lee_SNU_task4_1 | EMSI1 | 58.8 | 3.8 | 35.3 | 89.5 | 59.6 | 43.6 | 26.4 | 72.4 | 50.3 | 23.1 | 9.4 | 78.9 | 0.0 | 66.7 | 82.3 | 84.8 | 39.5 | |
Lee2017a | 52.3 | Lee_SNU_task4_2 | EMSI2 | 58.8 | 3.8 | 35.3 | 89.5 | 59.6 | 43.6 | 26.4 | 72.4 | 50.3 | 23.1 | 9.4 | 78.9 | 0.0 | 66.7 | 82.3 | 84.8 | 39.5 | |
Lee2017a | 52.6 | Lee_SNU_task4_3 | EMSI3 | 58.8 | 0.0 | 37.6 | 87.2 | 53.7 | 52.0 | 31.0 | 74.1 | 64.0 | 24.6 | 8.8 | 74.6 | 0.0 | 65.5 | 81.5 | 85.2 | 45.2 | |
Lee2017a | 52.1 | Lee_SNU_task4_4 | EMSI4 | 57.8 | 3.8 | 35.3 | 87.5 | 54.2 | 44.4 | 29.6 | 71.6 | 50.3 | 25.9 | 6.5 | 79.1 | 0.0 | 66.0 | 81.5 | 85.2 | 39.7 | |
Salamon2017 | 46.0 | Salamon_NYU_task4_1 | Salamon_1 | 52.0 | 36.6 | 31.5 | 79.0 | 55.8 | 57.1 | 48.5 | 66.2 | 68.4 | 38.9 | 21.1 | 31.5 | 2.4 | 59.5 | 61.7 | 63.0 | 31.6 | |
Salamon2017 | 45.3 | Salamon_NYU_task4_2 | Salamon_2 | 0.5 | 41.4 | 39.1 | 80.0 | 56.6 | 44.3 | 36.1 | 65.5 | 74.7 | 36.2 | 20.2 | 61.9 | 0.0 | 56.4 | 65.3 | 67.7 | 36.3 | |
Salamon2017 | 44.9 | Salamon_NYU_task4_3 | Salamon_3 | 45.9 | 16.4 | 18.4 | 81.1 | 48.6 | 60.4 | 32.1 | 62.7 | 59.4 | 27.3 | 6.2 | 70.9 | 2.8 | 63.6 | 55.8 | 59.2 | 30.6 | |
Salamon2017 | 38.1 | Salamon_NYU_task4_4 | Salamon_4 | 26.8 | 0.0 | 2.9 | 81.5 | 47.3 | 39.6 | 20.0 | 32.8 | 38.8 | 11.1 | 0.0 | 75.7 | 0.0 | 56.2 | 41.4 | 49.4 | 17.5 | |
Vu2017 | 48.5 | Toan_NCU_task4_1 | ToanVu1 | 47.0 | 57.1 | 38.6 | 82.9 | 54.1 | 55.8 | 51.5 | 0.8 | 69.9 | 28.6 | 31.1 | 70.5 | 35.2 | 60.5 | 63.9 | 73.5 | 42.9 | |
Vu2017 | 46.5 | Toan_NCU_task4_2 | ToanVu2 | 54.8 | 46.3 | 51.0 | 67.9 | 57.0 | 44.6 | 61.0 | 66.7 | 67.3 | 31.1 | 31.2 | 66.1 | 24.1 | 58.5 | 65.9 | 73.1 | 43.1 | |
Tseng2017 | 35.0 | Tseng_Bosch_task4_1 | Bosch1 | 44.4 | 36.4 | 14.1 | 69.7 | 46.4 | 49.8 | 4.2 | 47.1 | 33.3 | 20.0 | 17.1 | 61.1 | 17.9 | 37.5 | 35.6 | 31.0 | 34.4 | |
Tseng2017 | 35.1 | Tseng_Bosch_task4_2 | Bosch2 | 44.4 | 36.4 | 16.3 | 69.7 | 46.1 | 49.5 | 6.9 | 57.6 | 41.3 | 20.2 | 17.1 | 60.5 | 17.9 | 37.5 | 37.5 | 26.5 | 34.4 | |
Tseng2017 | 35.2 | Tseng_Bosch_task4_3 | Bosch3 | 42.9 | 35.9 | 21.2 | 70.1 | 40.5 | 46.8 | 15.1 | 46.6 | 41.8 | 18.0 | 19.7 | 54.8 | 20.6 | 36.9 | 36.9 | 46.1 | 36.3 | |
Tseng2017 | 35.2 | Tseng_Bosch_task4_4 | Bosch4 | 43.4 | 36.4 | 16.3 | 69.7 | 46.4 | 48.2 | 16.1 | 49.7 | 40.5 | 20.0 | 17.1 | 60.7 | 17.9 | 37.5 | 36.9 | 30.0 | 34.4 | |
Xu2017 | 54.4 | Xu_CVSSP_task4_1 | Surrey1AB | 54.3 | 59.5 | 78.8 | 83.9 | 63.2 | 62.2 | 65.8 | 86.6 | 80.2 | 39.1 | 32.5 | 71.5 | 41.8 | 64.5 | 71.6 | 80.1 | 46.4 | |
Xu2017 | 55.6 | Xu_CVSSP_task4_2 | Surrey2AB | 63.7 | 35.6 | 72.9 | 86.4 | 65.7 | 63.8 | 60.3 | 91.2 | 73.6 | 40.5 | 39.7 | 72.9 | 27.1 | 63.5 | 74.5 | 79.2 | 52.3 | |
Xu2017 | 54.2 | Xu_CVSSP_task4_3 | Surrey3AB | 59.5 | 52.7 | 72.5 | 85.0 | 53.2 | 43.0 | 65.9 | 88.9 | 74.9 | 44.2 | 41.7 | 73.0 | 39.1 | 69.4 | 73.1 | 80.5 | 46.5 | |
Xu2017 | 52.8 | Xu_CVSSP_task4_4 | Surrey4AB | 63.1 | 58.9 | 70.9 | 81.5 | 62.1 | 57.6 | 66.7 | 82.3 | 76.5 | 28.6 | 36.6 | 69.4 | 32.8 | 65.0 | 72.0 | 75.1 | 44.0 |
Subtask B - Sound event detection
Rank | Submission Information |
Technical Report |
Segment-based (overall / evaluation dataset) |
Warning sounds | Vehicle sounds | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Code | Name | ER (overall/evaluation dataset) | F1 (overall/evaluation dataset) |
Air horn, truck horn |
Ambulance (siren) |
Car alarm |
Civil defense siren |
Fire engine, fire truck (siren) |
Police car (siren) |
Reversing beeps |
Screaming |
Train horn |
Bicycle | Bus | Car | Car passing by | Motorcycle | Skateboard | Train | Truck | |||||||||||||||||||
ER / Air horn, truck horn (eval/seg) |
F1 / Air horn, truck horn (eval/seg) | ER / Ambulance (siren) (eval/seg) | F1 / Ambulance (siren) (eval/seg) | ER / Car alarm (eval/seg) | F1 / Car alarm (eval/seg) | ER / Civil defense siren (eval/seg) | F1 / Civil defense siren (eval/seg) | ER / Fire engine, fire truck (siren) (eval/seg) | F1 / Fire engine, fire truck (siren) (eval/seg) | ER / Police car (siren) (eval/seg) | F1 / Police car (siren) (eval/seg) | ER / Reversing beeps (eval/seg) | F1 / Reversing beeps (eval/seg) | ER / Screaming (eval/seg) | F1 / Screaming (eval/seg) | ER / Train horn (eval/seg) | F1 / Train horn (eval/seg) | ER / Bicycle (eval/seg) | F1 / Bicycle (eval/seg) | ER / Bus (eval/seg) | F1 / Bus (eval/seg) | ER / Car (eval/seg) | F1 / Car (eval/seg) | ER / Car passing by (eval/seg) | F1 / Car passing by (eval/seg) | ER / Motorcycle (eval/seg) | F1 / Motorcycle (eval/seg) | ER / Skateboard (eval/seg) | F1 / Skateboard (eval/seg) | ER / Train (eval/seg) | F1 / Train (eval/seg) | ER / Truck (eval/seg) | F1 / Truck (eval/seg) | ||||||
Adavanne2017 | Adavanne_TUT_task4_1 | Ash_1 | 0.8100 | 47.9 | 1.0700 | 0.0 | 1.0000 | 1.0300 | 2.8 | 0.3400 | 83.0 | 0.9700 | 54.4 | 1.1700 | 42.3 | 0.9700 | 6.8 | 1.3200 | 51.2 | 0.9600 | 47.3 | 1.5300 | 24.6 | 1.0100 | 8.1 | 1.3300 | 53.0 | 1.1000 | 3.3 | 0.9200 | 50.8 | 0.8500 | 62.6 | 0.7800 | 62.0 | 0.9900 | 44.4 | ||
Adavanne2017 | Adavanne_TUT_task4_2 | Ash_2 | 0.8000 | 48.3 | 1.0000 | 0.6 | 1.0000 | 1.3200 | 0.4 | 0.4300 | 80.4 | 1.2000 | 44.5 | 1.2000 | 31.5 | 1.0000 | 0.9100 | 58.1 | 1.1500 | 45.3 | 1.3000 | 26.0 | 1.0400 | 10.3 | 1.0700 | 56.4 | 1.0000 | 0.8200 | 57.2 | 1.0400 | 57.7 | 0.8300 | 62.9 | 1.0000 | 46.9 | ||||
Adavanne2017 | Adavanne_TUT_task4_3 | Ash_3 | 0.8200 | 48.9 | 1.1600 | 2.5 | 1.0000 | 1.0000 | 0.3100 | 84.8 | 0.9900 | 49.8 | 1.0800 | 38.1 | 1.0700 | 27.9 | 1.0500 | 55.7 | 0.9900 | 48.8 | 1.7300 | 25.3 | 1.0900 | 20.4 | 1.2800 | 54.1 | 1.0000 | 0.7100 | 60.1 | 1.0700 | 59.7 | 0.7100 | 63.5 | 1.5500 | 44.6 | ||||
Adavanne2017 | Adavanne_TUT_task4_4 | Ash_4 | 0.7900 | 49.0 | 1.1000 | 38.3 | 1.0000 | 1.0000 | 0.3200 | 84.0 | 1.1300 | 50.9 | 1.1800 | 31.2 | 1.0200 | 29.3 | 0.9900 | 54.5 | 1.0900 | 47.0 | 1.3200 | 32.5 | 1.1900 | 32.5 | 1.2200 | 55.1 | 1.0000 | 0.9200 | 54.8 | 0.8800 | 62.4 | 0.7800 | 60.5 | 1.0400 | 46.4 | ||||
Chou2017 | Chou_SINICA_task4_1 | FCNN_SM_1 | 0.8300 | 42.4 | 0.8800 | 32.3 | 0.9000 | 25.7 | 0.8600 | 30.7 | 0.5900 | 73.5 | 1.3900 | 41.0 | 1.1500 | 47.9 | 1.0100 | 29.8 | 0.8700 | 51.9 | 0.9000 | 40.9 | 1.4400 | 15.4 | 1.3200 | 17.2 | 0.9600 | 47.3 | 1.1200 | 3.2 | 0.9400 | 46.4 | 1.3900 | 46.5 | 0.9300 | 46.3 | 1.6700 | 35.3 | |
Badlani2017 | DCASE2017 baseline | Baseline | 0.9300 | 28.4 | 1.0000 | 1.0000 | 1.0000 | 0.6400 | 67.4 | 0.9800 | 16.5 | 1.0100 | 34.0 | 1.0000 | 1.0000 | 0.9800 | 3.9 | 0.9900 | 2.5 | 1.0000 | 1.7500 | 46.0 | 1.0000 | 0.9700 | 6.1 | 1.0000 | 0.9900 | 1.9 | 1.0000 | ||||||||||
Lee2017 | Lee_KAIST_task4_1 | SDCNN_MAC | 0.8200 | 39.4 | 0.9500 | 14.6 | 1.0000 | 0.9500 | 9.0 | 0.5000 | 75.1 | 1.0000 | 39.4 | 1.0100 | 34.3 | 0.8900 | 21.0 | 0.9100 | 31.2 | 0.9000 | 24.8 | 1.1000 | 15.6 | 1.0400 | 4.1 | 1.3900 | 50.7 | 1.0000 | 0.8000 | 44.7 | 0.9400 | 42.4 | 0.8300 | 37.4 | 0.9400 | 23.5 | |||
Lee2017 | Lee_KAIST_task4_2 | MLMS5_MAC | 0.7800 | 42.6 | 0.9000 | 23.9 | 0.9900 | 5.3 | 0.9400 | 11.1 | 0.4300 | 77.7 | 1.0700 | 42.0 | 1.0300 | 37.8 | 0.8600 | 24.5 | 0.8800 | 34.4 | 0.9100 | 27.1 | 1.1900 | 13.6 | 1.0300 | 7.2 | 1.2100 | 53.5 | 1.0000 | 0.7600 | 48.5 | 0.9200 | 47.0 | 0.8100 | 43.1 | 0.9400 | 37.0 | ||
Lee2017 | Lee_KAIST_task4_3 | MLMS3_MAC | 0.7800 | 44.2 | 0.9600 | 13.9 | 1.0000 | 1.0 | 0.9400 | 10.6 | 0.4200 | 79.1 | 0.9900 | 43.1 | 0.9500 | 42.7 | 0.8500 | 26.0 | 0.8600 | 46.4 | 0.9500 | 27.7 | 1.1900 | 17.6 | 1.0500 | 4.4 | 1.2300 | 54.8 | 1.0000 | 0.7500 | 49.0 | 0.8300 | 52.3 | 0.7800 | 46.8 | 0.9600 | 33.4 | ||
Lee2017 | Lee_KAIST_task4_4 | MLMS8_MAC | 0.7500 | 47.1 | 0.9500 | 15.1 | 1.0000 | 0.5 | 0.9000 | 17.6 | 0.3900 | 79.6 | 0.9700 | 44.4 | 0.9400 | 42.5 | 0.7800 | 38.6 | 0.8900 | 53.7 | 0.9100 | 33.3 | 1.4000 | 17.9 | 1.0000 | 4.6 | 1.1700 | 55.9 | 1.0000 | 0.7400 | 54.4 | 0.7900 | 56.1 | 0.7500 | 55.5 | 0.9300 | 37.1 | ||
Lee2017a | Lee_SNU_task4_1 | EMSI1 | 0.6700 | 54.4 | 0.6600 | 53.9 | 1.0000 | 5.0 | 0.8300 | 31.1 | 0.2900 | 85.4 | 0.9100 | 52.7 | 0.9300 | 37.5 | 0.9000 | 22.6 | 0.7800 | 51.8 | 0.9100 | 38.2 | 0.9300 | 22.6 | 1.0600 | 2.0 | 0.7300 | 66.0 | 1.0000 | 0.6300 | 59.2 | 0.7200 | 63.6 | 0.5100 | 75.4 | 0.9300 | 32.4 | ||
Lee2017a | Lee_SNU_task4_2 | EMSI2 | 0.6700 | 54.4 | 0.6600 | 53.9 | 1.0000 | 0.5 | 0.8300 | 31.1 | 0.2900 | 85.4 | 0.9100 | 52.7 | 0.9300 | 37.5 | 0.9000 | 22.6 | 0.7800 | 51.8 | 0.9100 | 38.2 | 0.9300 | 22.6 | 1.0600 | 2.0 | 0.7300 | 0.7 | 1.0000 | 0.6300 | 59.2 | 0.7200 | 63.6 | 0.5100 | 75.4 | 0.9300 | 32.4 | ||
Lee2017a | Lee_SNU_task4_3 | EMSI3 | 0.6700 | 55.4 | 0.6600 | 54.1 | 1.0000 | 0.7700 | 39.8 | 0.3000 | 84.6 | 0.8700 | 54.8 | 0.9200 | 38.6 | 0.8600 | 29.4 | 0.8000 | 56.6 | 0.9400 | 40.2 | 0.9000 | 31.2 | 1.0400 | 1.3 | 0.7300 | 66.7 | 1.0000 | 0.6200 | 59.6 | 0.7500 | 64.0 | 0.5300 | 74.4 | 0.9300 | 33.8 | |||
Lee2017a | Lee_SNU_task4_4 | EMSI4 | 0.6600 | 55.5 | 0.6700 | 53.2 | 1.0000 | 0.5 | 0.7800 | 38.2 | 0.3000 | 84.7 | 0.8600 | 54.4 | 0.9100 | 39.1 | 0.8800 | 26.4 | 0.7800 | 55.8 | 0.9800 | 37.8 | 0.8800 | 31.2 | 1.0400 | 1.3 | 0.7300 | 67.0 | 1.0000 | 0.6100 | 61.2 | 0.7300 | 64.1 | 0.5200 | 74.9 | 0.9200 | 33.5 | ||
Salamon2017 | Salamon_NYU_task4_1 | Salamon_1 | 0.8200 | 46.2 | 0.9500 | 39.8 | 0.9700 | 18.2 | 0.9200 | 20.3 | 0.3900 | 80.6 | 1.0100 | 53.1 | 1.0700 | 42.2 | 0.7400 | 48.0 | 0.9800 | 44.2 | 0.9500 | 47.3 | 1.8000 | 24.8 | 1.1500 | 11.5 | 1.3100 | 53.5 | 1.0900 | 1.1 | 0.9000 | 50.6 | 0.9900 | 50.8 | 0.8300 | 56.5 | 1.3100 | 23.7 | |
Salamon2017 | Salamon_NYU_task4_2 | Salamon_2 | 0.8500 | 45.6 | 1.0000 | 38.8 | 1.0500 | 17.4 | 0.9300 | 22.5 | 0.3700 | 81.6 | 1.0600 | 52.0 | 1.1600 | 37.2 | 1.0000 | 33.0 | 1.0400 | 46.2 | 0.9600 | 55.5 | 2.1200 | 21.6 | 1.1900 | 10.0 | 1.4400 | 50.9 | 1.0600 | 0.0 | 1.0100 | 45.7 | 0.9200 | 50.8 | 0.7600 | 59.8 | 1.2000 | 26.1 | |
Salamon2017 | Salamon_NYU_task4_3 | Salamon_3 | 0.7700 | 45.9 | 0.8700 | 33.1 | 1.0100 | 4.3 | 0.9600 | 8.4 | 0.3500 | 81.9 | 0.9100 | 50.1 | 0.9300 | 42.5 | 0.8500 | 27.6 | 0.8700 | 40.7 | 0.8200 | 45.6 | 1.2700 | 16.7 | 1.0300 | 1.7 | 0.9900 | 59.3 | 1.0100 | 0.0 | 0.6900 | 55.1 | 0.8900 | 39.1 | 0.8000 | 49.5 | 1.0100 | 22.7 | |
Salamon2017 | Salamon_NYU_task4_4 | Salamon_4 | 0.7700 | 45.9 | 0.8700 | 33.1 | 1.0100 | 4.3 | 0.9600 | 8.4 | 0.3500 | 81.9 | 0.9100 | 50.1 | 0.9300 | 42.5 | 0.8500 | 27.6 | 0.8700 | 40.7 | 0.8200 | 45.6 | 1.2700 | 16.7 | 1.0300 | 1.7 | 0.9900 | 59.3 | 1.0100 | 0.0 | 0.6900 | 55.1 | 0.8900 | 39.1 | 0.8000 | 49.5 | 1.0100 | 22.7 | |
Vu2017 | Toan_NCU_task4_2 | ToanVu2 | 0.9400 | 43.0 | 0.9100 | 38.3 | 1.1300 | 42.8 | 0.8500 | 38.6 | 0.6800 | 68.1 | 1.4200 | 45.3 | 0.9700 | 34.7 | 0.9400 | 43.9 | 1.0200 | 41.1 | 1.0400 | 44.4 | 2.3300 | 21.0 | 2.4000 | 26.4 | 0.9000 | 48.4 | 1.8900 | 22.0 | 1.0000 | 52.0 | 1.1900 | 41.2 | 0.8300 | 54.5 | 1.3100 | 33.3 | |
Vu2017 | Toan_NCU_task4_3 | ToanVu3 | 0.9000 | 42.7 | 0.9100 | 36.5 | 1.1300 | 40.1 | 0.8100 | 39.8 | 0.6300 | 69.5 | 1.3800 | 44.9 | 0.9800 | 31.9 | 0.8900 | 45.0 | 0.9300 | 40.2 | 0.9300 | 45.1 | 2.1400 | 21.0 | 2.0900 | 27.8 | 0.8900 | 48.5 | 1.8600 | 20.4 | 0.9400 | 53.1 | 1.1100 | 41.3 | 0.8100 | 52.4 | 1.3100 | 32.4 | |
Vu2017 | Toan_NCU_task4_4 | ToanVu4 | 0.8700 | 41.6 | 0.9000 | 34.1 | 1.1000 | 37.1 | 0.8200 | 36.0 | 0.5700 | 70.5 | 1.3100 | 45.5 | 0.9700 | 32.7 | 0.8200 | 45.5 | 0.9000 | 35.0 | 0.8500 | 43.2 | 1.8600 | 21.1 | 1.7500 | 26.3 | 0.8800 | 48.4 | 1.7600 | 22.0 | 0.9400 | 52.8 | 1.0200 | 41.1 | 0.8300 | 43.7 | 1.2600 | 32.1 | |
Xu2017 | Xu_CVSSP_task4_1 | Surrey1AB | 0.7300 | 51.8 | 0.9000 | 47.6 | 0.9100 | 29.7 | 0.6700 | 53.3 | 0.2900 | 85.8 | 0.8500 | 55.9 | 0.8700 | 45.0 | 0.7900 | 48.1 | 0.7800 | 65.5 | 0.9700 | 53.7 | 1.1900 | 34.9 | 1.1400 | 21.6 | 0.7600 | 48.7 | 1.4000 | 18.1 | 0.8900 | 59.7 | 0.8600 | 58.7 | 0.6700 | 65.9 | 1.0900 | 35.2 | |
Xu2017 | Xu_CVSSP_task4_2 | Surrey2AB | 0.7800 | 47.5 | 0.8700 | 48.4 | 0.9600 | 20.4 | 0.6600 | 58.6 | 0.3500 | 83.2 | 0.9800 | 52.4 | 0.9000 | 43.8 | 1.0100 | 39.7 | 0.8200 | 56.8 | 0.9900 | 53.3 | 1.4200 | 27.6 | 1.2500 | 23.2 | 0.8000 | 45.2 | 1.5100 | 17.8 | 0.9900 | 57.4 | 0.9600 | 50.1 | 0.7600 | 60.0 | 1.1800 | 23.1 | |
Xu2017 | Xu_CVSSP_task4_3 | Surrey3AB | 1.0100 | 52.1 | 1.0600 | 45.0 | 1.1800 | 57.7 | 1.2500 | 53.9 | 0.3500 | 82.7 | 0.9400 | 60.2 | 0.9200 | 55.7 | 1.2500 | 51.7 | 1.6200 | 53.5 | 1.6700 | 50.2 | 2.4700 | 25.0 | 2.4900 | 29.3 | 0.8400 | 59.7 | 4.6800 | 24.1 | 1.1900 | 57.9 | 1.2600 | 52.8 | 0.9300 | 65.5 | 1.3400 | 44.5 | |
Xu2017 | Xu_CVSSP_task4_4 | Surrey4AB | 0.8000 | 50.4 | 0.9700 | 45.9 | 0.9400 | 34.2 | 0.8900 | 50.7 | 0.3100 | 85.0 | 0.8600 | 56.2 | 0.9100 | 44.0 | 0.9200 | 52.4 | 1.0400 | 61.0 | 0.8000 | 65.1 | 1.9900 | 24.2 | 1.7100 | 28.9 | 0.9000 | 48.0 | 1.8100 | 21.0 | 1.0100 | 58.2 | 1.0100 | 55.4 | 0.8000 | 65.1 | 1.3000 | 38.3 |
System characteristics
Subtask A - Audio tagging
Rank | Submission Information |
Technical Report |
(overall) | System characteristics | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Code | Name | F1 (eval) | Input |
Sampling rate |
Data augmentation |
Features | Classifier |
Decision making |
||
Adavanne2017 | Adavanne_TUT_task4_1 | Ash_1 | 45.5 | mono | 44.1kHz | log-mel energies | CRNN | thresholding | ||
Adavanne2017 | Adavanne_TUT_task4_2 | Ash_2 | 46.6 | mono | 44.1kHz | log-mel energies | CRNN | thresholding | ||
Adavanne2017 | Adavanne_TUT_task4_3 | Ash_3 | 44.5 | mono | 44.1kHz | log-mel energies | CRNN | thresholding | ||
Adavanne2017 | Adavanne_TUT_task4_4 | Ash_4 | 26.3 | mono | 44.1kHz | log-mel energies | CRNN | thresholding | ||
Chou2017 | Chou_SINICA_task4_1 | FCNN_SM_1 | 47.6 | mono | 44.1kHz | spectrogram | CNN | majority vote | ||
Chou2017 | Chou_SINICA_task4_2 | FCNN_SM_2 | 49.0 | mono | 44.1kHz | spectrogram | CNN | majority vote | ||
Chou2017 | Chou_SINICA_task4_3 | FCNN_SM_3 | 47.9 | mono | 44.1kHz | spectrogram | CNN | majority vote | ||
Chou2017 | Chou_SINICA_task4_4 | FCNN_SM_3 | 49.0 | mono | 44.1kHz | spectrogram | CNN | majority vote | ||
Badlani2017 | DCASE2017 baseline | Baseline | 18.2 | mono | 44.1kHz | log-mel energies | MLP | median filtering | ||
Kukanov2017 | Kukanov_UEF_task4_1 | K-CRNN-MFoM | 39.6 | mono | 44.1kHz | log-mel energies | CRNN-MFoM | median filtering | ||
Lee2017 | Lee_KAIST_task4_1 | SDCNN_MAC | 40.3 | mono | 44.1kHz | raw waveforms | CNN | thresholding | ||
Lee2017 | Lee_KAIST_task4_2 | MLMS5_MAC | 47.3 | mono | 44.1kHz | raw waveforms | CNN | thresholding | ||
Lee2017 | Lee_KAIST_task4_3 | MLMS3_MAC | 47.2 | mono | 44.1kHz | raw waveforms | CNN | thresholding | ||
Lee2017 | Lee_KAIST_task4_4 | MLMS8_MAC | 47.1 | mono | 44.1kHz | raw waveforms | CNN | thresholding | ||
Lee2017a | Lee_SNU_task4_1 | EMSI1 | 52.3 | mono | 44.1kHz | log-mel energies | CNN, ensemble | mean probability | ||
Lee2017a | Lee_SNU_task4_2 | EMSI2 | 52.3 | mono | 44.1kHz | log-mel energies | CNN, ensemble | mean probability | ||
Lee2017a | Lee_SNU_task4_3 | EMSI3 | 52.6 | mono | 44.1kHz | log-mel energies | CNN, ensemble | weighted mean probability | ||
Lee2017a | Lee_SNU_task4_4 | EMSI4 | 52.1 | mono | 44.1kHz | log-mel energies | CNN, ensemble | weighted mean probability | ||
Salamon2017 | Salamon_NYU_task4_1 | Salamon_1 | 46.0 | mono | 44.1kHz | pitch shifting | log-mel energies | CRNN | raw output | |
Salamon2017 | Salamon_NYU_task4_2 | Salamon_2 | 45.3 | mono | 44.1kHz | pitch shifting | log-mel energies | CRNN | raw output | |
Salamon2017 | Salamon_NYU_task4_3 | Salamon_3 | 44.9 | mono | 44.1kHz | pitch shifting, dynamic range compression | log-mel energies | ensemble | raw output | |
Salamon2017 | Salamon_NYU_task4_4 | Salamon_4 | 38.1 | mono | 44.1kHz | pitch shifting, dynamic range compression | log-mel energies | ensemble | raw output | |
Vu2017 | Toan_NCU_task4_1 | ToanVu1 | 48.5 | mono | 22050 Hz | log-mel energies | DenseNet | |||
Vu2017 | Toan_NCU_task4_2 | ToanVu2 | 46.5 | mono | 22050 Hz | log-mel energies | DenseNet | median filtering | ||
Tseng2017 | Tseng_Bosch_task4_1 | Bosch1 | 35.0 | mono | 44.1kHz | log-mel energies | ensemble | max pooling | ||
Tseng2017 | Tseng_Bosch_task4_2 | Bosch2 | 35.1 | mono | 44.1kHz | log-mel energies | ensemble | max pooling | ||
Tseng2017 | Tseng_Bosch_task4_3 | Bosch3 | 35.2 | mono | 44.1kHz | log-mel energies | ensemble | max pooling | ||
Tseng2017 | Tseng_Bosch_task4_4 | Bosch4 | 35.2 | mono | 44.1kHz | log-mel energies | ensemble | max pooling | ||
Xu2017 | Xu_CVSSP_task4_1 | Surrey1AB | 54.4 | mono | 44.1kHz | log-mel energies | CRNN | |||
Xu2017 | Xu_CVSSP_task4_2 | Surrey2AB | 55.6 | mono | 44.1kHz | log-mel energies | CRNN | |||
Xu2017 | Xu_CVSSP_task4_3 | Surrey3AB | 54.2 | mono | 44.1kHz | log-mel energies | CRNN | |||
Xu2017 | Xu_CVSSP_task4_4 | Surrey4AB | 52.8 | mono | 44.1kHz | log-mel energies | CRNN |
Subtask B - Sound event detection
Rank | Submission Information |
Technical Report |
Segment based (overall) | System characteristics | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Code | Name | ER (eval/seg) | F1 (eval/seg) | Input |
Sampling rate |
Data augmentation |
Features | Classifier |
Decision making |
||
Adavanne2017 | Adavanne_TUT_task4_1 | Ash_1 | 0.8100 | 47.9 | mono | 44.1kHz | log-mel energies | CRNN | thresholding | ||
Adavanne2017 | Adavanne_TUT_task4_2 | Ash_2 | 0.8000 | 48.3 | mono | 44.1kHz | log-mel energies | CRNN | thresholding | ||
Adavanne2017 | Adavanne_TUT_task4_3 | Ash_3 | 0.8200 | 48.9 | mono | 44.1kHz | log-mel energies | CRNN | thresholding | ||
Adavanne2017 | Adavanne_TUT_task4_4 | Ash_4 | 0.7900 | 49.0 | mono | 44.1kHz | log-mel energies | CRNN | thresholding | ||
Chou2017 | Chou_SINICA_task4_1 | FCNN_SM_1 | 0.8300 | 42.4 | mono | 44.1kHz | spectrogram | CNN | majority vote | ||
Badlani2017 | DCASE2017 baseline | Baseline | 0.9300 | 28.4 | mono | 44.1kHz | log-mel energies | MLP | median filtering | ||
Lee2017 | Lee_KAIST_task4_1 | SDCNN_MAC | 0.8200 | 39.4 | mono | 44.1kHz | raw waveforms | CNN | thresholding | ||
Lee2017 | Lee_KAIST_task4_2 | MLMS5_MAC | 0.7800 | 42.6 | mono | 44.1kHz | raw waveforms | CNN | thresholding | ||
Lee2017 | Lee_KAIST_task4_3 | MLMS3_MAC | 0.7800 | 44.2 | mono | 44.1kHz | raw waveforms | CNN | thresholding | ||
Lee2017 | Lee_KAIST_task4_4 | MLMS8_MAC | 0.7500 | 47.1 | mono | 44.1kHz | raw waveforms | CNN | thresholding | ||
Lee2017a | Lee_SNU_task4_1 | EMSI1 | 0.6700 | 54.4 | mono | 44.1kHz | log-mel energies | CNN, ensemble | mean probability | ||
Lee2017a | Lee_SNU_task4_2 | EMSI2 | 0.6700 | 54.4 | mono | 44.1kHz | log-mel energies | CNN, ensemble | mean probability | ||
Lee2017a | Lee_SNU_task4_3 | EMSI3 | 0.6700 | 55.4 | mono | 44.1kHz | log-mel energies | CNN, ensemble | weighted mean probability | ||
Lee2017a | Lee_SNU_task4_4 | EMSI4 | 0.6600 | 55.5 | mono | 44.1kHz | log-mel energies | CNN, ensemble | weighted mean probability | ||
Salamon2017 | Salamon_NYU_task4_1 | Salamon_1 | 0.8200 | 46.2 | mono | 44.1kHz | pitch shifting | log-mel energies | CRNN | raw output | |
Salamon2017 | Salamon_NYU_task4_2 | Salamon_2 | 0.8500 | 45.6 | mono | 44.1kHz | pitch shifting | log-mel energies | CRNN | raw output | |
Salamon2017 | Salamon_NYU_task4_3 | Salamon_3 | 0.7700 | 45.9 | mono | 44.1kHz | pitch shifting, dynamic range compression | log-mel energies | ensemble | raw output | |
Salamon2017 | Salamon_NYU_task4_4 | Salamon_4 | 0.7700 | 45.9 | mono | 44.1kHz | pitch shifting, dynamic range compression | log-mel energies | ensemble | raw output | |
Vu2017 | Toan_NCU_task4_2 | ToanVu2 | 0.9400 | 43.0 | mono | 22050 Hz | log-mel energies | DenseNet | median filtering | ||
Vu2017 | Toan_NCU_task4_3 | ToanVu3 | 0.9000 | 42.7 | mono | 22050 Hz | log-mel energies | DenseNet | median filtering | ||
Vu2017 | Toan_NCU_task4_4 | ToanVu4 | 0.8700 | 41.6 | mono | 22050 Hz | log-mel energies | DenseNet | median filtering | ||
Xu2017 | Xu_CVSSP_task4_1 | Surrey1AB | 0.7300 | 51.8 | mono | 44.1kHz | log-mel energies | CRNN | |||
Xu2017 | Xu_CVSSP_task4_2 | Surrey2AB | 0.7800 | 47.5 | mono | 44.1kHz | log-mel energies | CRNN | |||
Xu2017 | Xu_CVSSP_task4_3 | Surrey3AB | 1.0100 | 52.1 | mono | 44.1kHz | log-mel energies | CRNN | |||
Xu2017 | Xu_CVSSP_task4_4 | Surrey4AB | 0.8000 | 50.4 | mono | 44.1kHz | log-mel energies | CRNN |
Technical reports
Sound Event Detection Using Weakly Labeled Dataset with Stacked Convolutional and Recurrent Neural Network
Sharath Adavanne and Tuomas Virtanen
Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland
Adavanne_TUT_task4_1 Adavanne_TUT_task4_2 Adavanne_TUT_task4_3 Adavanne_TUT_task4_4
Sound Event Detection Using Weakly Labeled Dataset with Stacked Convolutional and Recurrent Neural Network
Abstract
This paper proposes a neural network architecture and training scheme to learn the start and end time of sound events (strong labels) in an audio recording given just the list of sound events existing in the audio without time information (weak labels). We achieve this by using a stacked convolutional and recurrent neural network with two prediction layers in sequence one for the strong followed by the weak label. The network is trained using frame-wise log mel-band energy as the input audio feature, and weak labels provided in the dataset as labels for the weak label prediction layer. Strong labels are generated by replicating the weak labels as many number of times as the frames in the input audio feature, and used for strong label layer during training. We propose to control what the network learns from the weak and strong labels by different weighting for the loss computed in the two prediction layers. The proposed method is evaluated on a publicly available dataset of 155 hours with 17 sound event classes. The method achieves the best error rate of 0.84 for strong labels and F-score of 43.3% for weak labels on the unseen test split.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CRNN |
Decision making | thresholding |
DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System
Abstract
DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | MLP |
Decision making | median filtering |
FrameCNN: A Weakly-Supervised Learning Framework for Frame-Wise Acoustic Event Detection and Classification
Szu-Yu Chou1, Jyh-Shing Jang2 and Yi-Hsuan Yang1
1Research Center for IT innovation, Academia Sinica, Taipei, Taiwan, 2Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan
Chou_SINICA_task4_1 Chou_SINICA_task4_2 Chou_SINICA_task4_3 Chou_SINICA_task4_4
FrameCNN: A Weakly-Supervised Learning Framework for Frame-Wise Acoustic Event Detection and Classification
Abstract
In this paper, we describe our contribution to the challenge of detection and classification of acoustic scenes and events (DCASE2017).We propose framCNN, a novel weakly supervised learning frame-work that improves the performance of convolutional neural net-work (CNN) for acoustic event detection by attending to details of each sound at various temporal levels. Most existing weakly-supervised frameworks replace fully-connected network with global average pooling after the final convolution layer. Such a method tends to identify only a few discriminative parts, leading to sub-optimal localization and classification accuracy. The key idea of our approach is to consciously classify the sound of each frame given by the corresponding label. The idea is general and can be applied to any network for achieving sound event detection and improving the performance of sound event classification. In acoustic scene classification (Task1), our approach obtained an average accuracy of 99.2% on the four-fold cross-validation for acoustic scene recognition, comparing to the provided baseline of 74.8%. In the large-scale weakly supervised sound event detection for smart cars(Task4), we obtained a F-score 53.8% for sound event audio tagging (subtask A), compared to the baseline of 19.8%, and a F-score32.8% for sound event detection (subtask B), compared to the base-line of 11.4%
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | spectrogram |
Classifier | CNN |
Decision making | majority vote |
Recurrent Neural Network and Maximal Figure of Merit for Acoustic Event Detection
Ivan Kukanov1, Ville Hautamäki1 and Kong Aik Lee2
1School of Computing, University of Eastern Finland, Joensuu, Finland, 2Institute for Infocomm Research, A*Star, Singapore
Kukanov_UEF_task4_1
Recurrent Neural Network and Maximal Figure of Merit for Acoustic Event Detection
Abstract
In this report, we describe the systems submitted to the DCASE 2017 challenge. In particular, we explored convolutional recurrent neural network (CRNN) for acoustic scene classification (Task 1). For the weakly supervised sound event detection (Task 4), we utilized CRNN by embedding maximal figure-of-merit (CRNN-MFoM) into the binary cross-entropy objective function. On the development data set, the CRNN model achieves an average 14.7% relative accuracy improvement on the classification Task 1, the CRNN-MFoM improves F1-score from 10.9% to 33.5% on the detection Task 4 compared to the baseline system.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CRNN-MFoM |
Decision making | median filtering |
Combining Multi-Scale Features Using Sample-Level Deep Convolutional Neural Networks for Weakly Supervised Sound Event Detection
Jongpil Lee, Jiyoung Park and Juhan Nam
Graduate School of Culture Technology, Korea Advanced Institute of Science and Technology, Daejeon, Korea
Lee_KAIST_task4_1 Lee_KAIST_task4_2 Lee_KAIST_task4_3 Lee_KAIST_task4_4
Combining Multi-Scale Features Using Sample-Level Deep Convolutional Neural Networks for Weakly Supervised Sound Event Detection
Abstract
This paper describes our method submitted to large-scale weakly supervised sound event detection for smart cars in DCASE Challenge 2017. It it based on two techniques that have been previously suggested for music auto-tagging. One is training sample-level deep convolutional neural networks using raw waveforms as feature extractors. The other is aggregating features on multi-scaled models of the CNNs and making final predictions. With this approach, we achieved the best results, 44.3% in F-score on subtask A (audio tagging) and 0.84 in error rate on subtask B (sound event detection). Finally, we visualize hierarchically learned filters from the challenge dataset in each layer of the raw waveform based model to explain how they discriminate the events.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | raw waveforms |
Classifier | CNN |
Decision making | thresholding |
Ensemble of Convolutional Neural Networks for Weakly-Supervised Sound Event Detection Using Multiple Scale Input
Donmoon Lee1,2, Subin Lee1,2, Yoonchang Han2 and Kyogu Lee1
1Music and Audio Research Group, Seoul National University, Seoul, Korea, 2Cochlear.ai, Seoul, Korea
Lee_SNU_task4_1 Lee_SNU_task4_2 Lee_SNU_task4_3 Lee_SNU_task4_4
Ensemble of Convolutional Neural Networks for Weakly-Supervised Sound Event Detection Using Multiple Scale Input
Abstract
In this paper, we use ensemble of convolutional neural network models that use the various analysis window to detect audio events in the automotive environment. When detecting the presence of audio events, global input based model that uses the entire audio clip works better. On the other hand, segmented input based models works better in finding the accurate position of the event. Experimental results for weakly-labeled audio data confirm the performance trade-off between the two tasks, depending on the length of input audio. By combining the predictions of various models, the proposed system achieved 0.4762 in the clip-based F1-score and 0.7167 in the segment-based error rate.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CNN, ensemble |
Decision making | mean probability; weighted mean probability |
DCASE 2017 Submission: Multiple Instance Learning for Sound Event Detection
Justin Salamon1,2, Brian McFee1,3 and Peter Li1
1Music and Audio Research Laboratory, New York University, New York City, USA, 2Center of Urban Science and Progress, New York University, New York City, USA, 3Center for Data Science, New York University, New York City, USA
Salamon_NYU_task4_1 Salamon_NYU_task4_2 Salamon_NYU_task4_3 Salamon_NYU_task4_4
DCASE 2017 Submission: Multiple Instance Learning for Sound Event Detection
Abstract
This extended abstract describes the design and implementation of a multiple instance learning model for sound event detection. The submitted systems use a convolutional-recurrent neural network (CRNN) architecture to learn strong (temporally localized) predictors from weakly labeled data. Four variants of the proposed methods were submitted to DCASE 2017, Task 4.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Data augmentation | pitch shifting; pitch shifting, dynamic range compression |
Features | log-mel energies |
Classifier | CRNN; ensemble |
Decision making | raw output |
Large-Scale Weakly Supervised Sound Event Detection
Shaoyen Tseng1 and Juncheng Billy Li2
1Department of Electrical Engineering, University of Southern California, Los Angeles, USA, 2Research and Technology Center, Robert Bosch LLC, Pittsburgh, USA
Tseng_Bosch_task4_1 Tseng_Bosch_task4_2 Tseng_Bosch_task4_3 Tseng_Bosch_task4_4
Large-Scale Weakly Supervised Sound Event Detection
Abstract
State-of-the-art audio event detection (AED) systems fully rely on supervised-learning based on strongly labeled data.The dependence on strong labels severely limits the scalability of AED work. Large-scale manually annotated datasets are difficult and expensive to collect [1], whereas weakly labeled data could be much easier to acquire. In weakly labeled data, we only need to determine whether an event in the recording is present or absent. This not only makes manual labeling significantly easier but also makes automatically infer-ring labels from online multimedia or audio meta-information(titles, tags, etc) possible [2]. This work employs a subset of Google’s AudioSet [3], which is a large number of weakly labeled YouTube video excerpts. The subset focuses on transportation and warning sounds and consists of 17 sound events divided into two categories: Warning and Vehicle.We perform experiments on 3 sets of features, including standard Mel frequency cepstral coefficients (MFCC) and log-Mel spectrograms and pre-trained embeddings extracted from a deep convolutional network. Our system employs multiple instance learning (MIL) [4] approaches to deal with weak labels by bagging them to positive or negative bags. We apply 4 models, Deep Neural Network (DNN), Recurrent Neural Network (RNN) and Convolutional Deep NeuralNetwork. Using the late-fusion approach, we improve the performance of the baseline audio tagging (Subtask A) F1score 13.1% by 18.1%.The embeddings extracted by the convolutional neural networks significantly boosts the performance of all the models.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | ensemble |
Decision making | max pooling |
Deep Learning for DCASE2017 Challenge
Toan Vu, An Dang and Jia-Ching Wang
Department of Computer Science and Information Engineering, National Central University, Taiwan
Toan_NCU_task4_1 Toan_NCU_task4_2 Toan_NCU_task4_3 Toan_NCU_task4_4
Deep Learning for DCASE2017 Challenge
Abstract
This paper reports our results on all tasks of DCASE challenge 2017 which are acoustic scene classification, detection of rare sound events, sound event detection in real life audio, and large-scale weakly supervised sound event detection for smart cars. Our proposed methods are developed based on two favorite neural networks which are convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Experiments show that our proposed methods outperform the baseline.
System characteristics
Input | mono |
Sampling rate | 22050 Hz |
Features | log-mel energies |
Classifier | DenseNet |
Decision making | median filtering |
Surrey-CVSSP System for DCASE2017 Challenge Task4
Abstract
In this technique report, we present a bunch of methods for the task 4 of Detection and Classification of Acoustic Scenes and Events 2017 (DCASE2017) challenge. This task evaluates systems for the large-scale detection of sound events using weakly labeled training data. The data are YouTube video excerpts focusing on transportation and warnings due to their industry applications. There are two tasks, audio tagging and sound event detection from weakly labeled data. Convolutional neural network (CNN) and gated recurrent unit (GRU) based recurrent neural network (RNN) are adopted as our basic framework. We proposed a learnable gating activation function for selecting informative local features. Attention-based scheme is used for localizing the specific events in a weakly-supervised mode. A new batch-level balancing strategy is also proposed to tackle the data unbalancing problem. Fusion of posteriors from different systems are found effective to improve the performance. In a summary, we get 61% F-value for the audio tagging subtask and 0.72 error rate (ER) for the sound event detection subtask on the development set. While the official multilayer perceptron (MLP) based baseline just obtained 13.1% F-value for the audio tagging and 1.02 for the sound event detection.
System characteristics
Input | mono |
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CRNN |