Task description
This task evaluates systems for general-purpose audio tagging with an increased number of categories and using data with annotations of varying reliability. This task will provide insight towards the development of broadly-applicable sound event classifiers that consider an increased and diverse amount of categories.
More detailed task description can be found in the task description page or in the competition page in Kaggle.
IMPORTANT NOTE 1: the task results shown in this page only include the submissions that were made using the DCASE submission system. Therefore, there might be some entries appearing in the official Kaggle leaderboard that do not appear here. Be aware that because of the missing entries, the ranking of some teams might be different in this page when compared to the Kaggle leaderboard.
IMPORTANT NOTE 2: some of the DCASE submissions failed to run evaluation in the private leaderboard because of breaking the runtime or memory constraints set by Kaggle's Kernels-only competition rules. These submissions are listed here with a score of 0.0 in the private leaderboard. We're working on computing private leaderboard scores for these submissions as well even though they will be disqualified from the official ranking.
Systems ranking
Rank | Submission code | Name | Tech. report |
mAP@3 (Private leaderboard)* |
mAP@3 (Public leaderboard) |
---|---|---|---|---|---|
DCASE2018 baseline | Baseline | Fonseca2018 | 0.6943 | 0.7049 | |
Jeong_COCAI_task2_1 | Cochlear.ai_1 | Jeong2018 | 0.9538 | 0.9751 | |
Jeong_COCAI_task2_2 | Cochlear.ai_2 | Jeong2018 | 0.9506 | 0.9751 | |
Jeong_COCAI_task2_3 | Cochlear.ai_3 | Jeong2018 | 0.9405 | 0.9729 | |
Nguyen_NTU_task2_1 | NTU_ensemble8 | Nguyen2018 | 0.9496 | 0.9635 | |
Nguyen_NTU_task2_2 | NTU_labelsmoothing | Nguyen2018 | 0.9251 | 0.9413 | |
Nguyen_NTU_task2_3 | NTU_bgnormalization | Nguyen2018 | 0.9213 | 0.9297 | |
Nguyen_NTU_task2_4 | NTU_en8_augment_test | Nguyen2018 | 0.9478 | 0.9601 | |
Wilhelm_UKON_task2_1 | CNN on Raw-Audio and Spectrogram | Wilhelm2018 | 0.9435 | 0.9662 | |
Wilhelm_UKON_task2_2 | CNN on Raw-Audio and Spectrogram | Wilhelm2018 | 0.9416 | 0.9568 | |
Kim_GIST_WisenetAI_task2_1 | ConResNet | Kim2018 | 0.9151 | 0.9585 | |
Kim_GIST_WisenetAI_task2_2 | ConResNet | Kim2018 | 0.9133 | 0.9585 | |
Kim_GIST_WisenetAI_task2_3 | ConResNet | Kim2018 | 0.9139 | 0.9579 | |
Kim_GIST_WisenetAI_task2_4 | ConResNet | Kim2018 | 0.9174 | 0.9563 | |
Xu_Aalto_task2_1 | Multi-level attention model on fine-tuned AudioSet features | Xu2018 | 0.9065 | 0.9363 | |
Xu_Aalto_task2_2 | Multi-level attention model on fine-tuned AudioSet features | Xu2018 | 0.9081 | 0.9319 | |
Chakraborty_IBM_Task2_1 | 3 CNN L1 Stacked Fused Spectral XGBoost L2 | Chakraborty2018 | 0.9328 | 0.9480 | |
Chakraborty_IBM_Task2_2 | 2 CNN results geometrically averaged | Chakraborty2018 | 0.9320 | 0.9452 | |
Chakraborty_IBM_Task2_judges_award | VGG Style CNN with 3 channel input | Chakraborty2018 | 0.9079 | 0.9258 | |
Han_NPU_task2_1 | 2ModEnsem | Han2018 | 0.8723 | 0.9181 | |
Zhesong_PKU_task2_1 | BCNN_WaveNet | Yu2018 | 0.8807 | 0.9197 | |
Hanyu_BUPT_task2 | CRNN | Hanyu2018 | 0.7877 | 0.8029 | |
Wei_Kuaiyu_task2_1 | Kuaiyu tagging system | WEI2018 | 0.9409 | 0.9690 | |
Wei_Kuaiyu_task2_2 | Kuaiyu tagging system | WEI2018 | 0.9423 | 0.9673 | |
Colangelo_RM3_task2_1 | DCASE2018 Task2 CRNN RM3 | Colangelo2018 | 0.6978 | 0.7309 | |
Shan_DBSonics_task2_1 | Shan DBSonics approach | Ren2018 | 0.9405 | 0.9734 | |
Kele_NUDT_task2_1 | DCASE2018 Meta-learning system | Kele2018 | 0.9498 | 0.9779 | |
Kele_NUDT_task2_2 | DCASE2018 Meta-learning system | Kele2018 | 0.9441 | 0.9662 | |
Agafonov_ITMO_task2_1 | Fusion of 4 CNN | Agafonov2018 | 0.9174 | 0.9502 | |
Agafonov_ITMO_task2_2 | Fusion of 4 CNN | Agafonov2018 | 0.9275 | 0.9491 | |
Wilkinghoff_FKIE_task2_1 | CNN Ensemble based on Multiple Features | Wilkinghoff2018 | 0.9414 | 0.9563 | |
Pantic_ETF_task2_1 | Ensemble of convolutional neural networks for general purpose audio tagging | Pantic2018 | 0.9419 | 0.9563 | |
Khadkevich_FB_task2_1 | 2 average pooling | Khadkevich2018 | 0.9188 | 0.9131 | |
Khadkevich_FB_task2_2 | 2 max pooling | Khadkevich2018 | 0.9178 | 0.9103 | |
Iqbal_Surrey_task2_1 | Stacked CNN-CRNN (4 Models) | Iqbal2018 | 0.9484 | 0.9568 | |
Iqbal_Surrey_task2_2 | Stacked CNN-CRNN (8 Models) | Iqbal2018 | 0.9512 | 0.9612 | |
Baseline_Surrey_task2_1 | Surrey baseline CNN 8 layers | Kong2018 | 0.9034 | 0.9203 | |
Baseline_Surrey_task2_2 | Surrey baseline CNN 4 layers | Kong2018 | 0.8622 | 0.8854 | |
Dorfer_CPJKU_task2_1 | CNN - Iterative Self-Verification | Dorfer2018 | 0.9518 | 0.9563 |
* Unless stated otherwise, all reported mAP@3 scores are computed using the ground truth for the private leaderboard.
Teams ranking
Table including only the best performing system per submitting team.
Rank |
Submission code |
Name | Tech. report |
mAP@3 (Private leaderboard) |
mAP@3 (Public leaderboard) |
---|---|---|---|---|---|
DCASE2018 baseline | Baseline | Fonseca2018 | 0.6943 | 0.7049 | |
Chakraborty_IBM_Task2_1 | 3 CNN L1 Stacked Fused Spectral XGBoost L2 | Chakraborty2018 | 0.9328 | 0.9480 | |
Han_NPU_task2_1 | 2ModEnsem | Han2018 | 0.8723 | 0.9181 | |
Baseline_Surrey_task2_1 | Surrey baseline CNN 8 layers | Kong2018 | 0.9034 | 0.9203 | |
Jeong_COCAI_task2_1 | Cochlear.ai_1 | Jeong2018 | 0.9538 | 0.9751 | |
Dorfer_CPJKU_task2_1 | CNN - Iterative Self-Verification | Dorfer2018 | 0.9518 | 0.9563 | |
Shan_DBSonics_task2_1 | Shan DBSonics approach | Ren2018 | 0.9405 | 0.9734 | |
Wilkinghoff_FKIE_task2_1 | CNN Ensemble based on Multiple Features | Wilkinghoff2018 | 0.9414 | 0.9563 | |
Nguyen_NTU_task2_1 | NTU_ensemble8 | Nguyen2018 | 0.9496 | 0.9635 | |
Zhesong_PKU_task2_1 | BCNN_WaveNet | Yu2018 | 0.8807 | 0.9197 | |
Wilhelm_UKON_task2_1 | CNN on Raw-Audio and Spectrogram | Wilhelm2018 | 0.9435 | 0.9662 | |
Hanyu_BUPT_task2 | CRNN | Hanyu2018 | 0.7877 | 0.8029 | |
Kele_NUDT_task2_1 | DCASE2018 Meta-learning system | Kele2018 | 0.9498 | 0.9779 | |
Agafonov_ITMO_task2_2 | Fusion of 4 CNN | Agafonov2018 | 0.9275 | 0.9491 | |
Kim_GIST_WisenetAI_task2_4 | ConResNet | Kim2018 | 0.9174 | 0.9563 | |
Pantic_ETF_task2_1 | Ensemble of convolutional neural networks for general purpose audio tagging | Pantic2018 | 0.9419 | 0.9563 | |
Khadkevich_FB_task2_1 | 2 average pooling | Khadkevich2018 | 0.9188 | 0.9131 | |
Iqbal_Surrey_task2_2 | Stacked CNN-CRNN (8 Models) | Iqbal2018 | 0.9512 | 0.9612 | |
Xu_Aalto_task2_2 | Multi-level attention model on fine-tuned AudioSet features | Xu2018 | 0.9081 | 0.9319 | |
Colangelo_RM3_task2_1 | DCASE2018 Task2 CRNN RM3 | Colangelo2018 | 0.6978 | 0.7309 | |
Wei_Kuaiyu_task2_2 | Kuaiyu tagging system | WEI2018 | 0.9423 | 0.9673 |
Class-wise performance
The table below shows the mAP@3 scores computed per class.
Rank | Submission code | Name | Tech. report | mAP@3 | acoustic guitar | applause | bark | bass drum | burping or eructation | bus | cello | chime | clarinet | computer keyboard | cough | cowbell | double bass | drawer open or close | electric piano | fart | finger snapping | fireworks | flute | glockenspiel | gong | gunshot or gunfire | harmonica | hi-hat | keys jangling | knock | meow | microwave oven | oboe | saxophone | scissors | shatter | snare drum | squeak | tambourine | tearing | telephone | trumpet | violin or fiddle | writing |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DCASE2018 baseline | Baseline | Fonseca2018 | 0.6943 | 0.6713 | 0.9744 | 0.8623 | 0.5797 | 0.7051 | 0.4833 | 0.8447 | 0.7708 | 0.9778 | 0.5397 | 0.7222 | 0.5637 | 0.6458 | 0.0625 | 0.7372 | 0.6181 | 0.7222 | 0.5449 | 0.8939 | 0.5694 | 0.8500 | 0.1373 | 0.8827 | 0.5208 | 0.7536 | 0.8854 | 0.8681 | 0.5764 | 0.8922 | 0.8447 | 0.2667 | 0.6528 | 0.3155 | 0.1806 | 0.8229 | 0.9773 | 0.6496 | 0.8611 | 0.7356 | 0.6319 | |
Jeong_COCAI_task2_1 | Cochlear.ai_1 | Jeong2018 | 0.9538 | 0.9398 | 1.0000 | 0.9783 | 1.0000 | 1.0000 | 0.8917 | 0.9659 | 0.8750 | 1.0000 | 0.9683 | 1.0000 | 0.9853 | 1.0000 | 0.8681 | 0.9744 | 1.0000 | 1.0000 | 0.8077 | 1.0000 | 0.8681 | 0.9833 | 0.9412 | 0.9815 | 1.0000 | 0.9348 | 0.9635 | 0.9792 | 0.9722 | 1.0000 | 0.9451 | 0.8250 | 1.0000 | 1.0000 | 0.6458 | 1.0000 | 1.0000 | 0.7479 | 0.9833 | 1.0000 | 0.9375 | |
Jeong_COCAI_task2_2 | Cochlear.ai_2 | Jeong2018 | 0.9506 | 0.9444 | 1.0000 | 0.9783 | 1.0000 | 1.0000 | 0.8917 | 0.9659 | 0.8681 | 1.0000 | 0.9683 | 1.0000 | 0.9804 | 1.0000 | 0.8681 | 0.9744 | 1.0000 | 1.0000 | 0.7692 | 1.0000 | 0.8681 | 0.9833 | 0.9510 | 0.9815 | 1.0000 | 0.9058 | 0.9635 | 0.9792 | 0.9514 | 1.0000 | 0.9318 | 0.8000 | 1.0000 | 1.0000 | 0.6319 | 0.9844 | 1.0000 | 0.7393 | 0.9833 | 1.0000 | 0.9583 | |
Jeong_COCAI_task2_3 | Cochlear.ai_3 | Jeong2018 | 0.9405 | 0.9398 | 1.0000 | 0.9783 | 1.0000 | 1.0000 | 0.7500 | 0.9508 | 0.8819 | 1.0000 | 0.9444 | 1.0000 | 1.0000 | 1.0000 | 0.8403 | 1.0000 | 1.0000 | 1.0000 | 0.7564 | 1.0000 | 0.8264 | 0.9833 | 0.8922 | 0.9444 | 1.0000 | 0.8406 | 0.9167 | 0.9792 | 0.9375 | 1.0000 | 0.9527 | 0.5750 | 1.0000 | 1.0000 | 0.6181 | 1.0000 | 1.0000 | 0.7692 | 1.0000 | 1.0000 | 0.9167 | |
Nguyen_NTU_task2_1 | NTU_ensemble8 | Nguyen2018 | 0.9496 | 0.8981 | 1.0000 | 0.9710 | 1.0000 | 1.0000 | 0.9250 | 0.9621 | 0.8264 | 1.0000 | 0.9286 | 1.0000 | 0.9804 | 1.0000 | 0.8889 | 0.9808 | 1.0000 | 1.0000 | 0.7949 | 1.0000 | 0.8264 | 0.9667 | 0.9085 | 0.9444 | 1.0000 | 0.8913 | 0.9583 | 0.9792 | 0.9792 | 0.9804 | 0.9886 | 0.6000 | 1.0000 | 1.0000 | 0.6736 | 1.0000 | 1.0000 | 0.7949 | 1.0000 | 1.0000 | 0.9583 | |
Nguyen_NTU_task2_2 | NTU_labelsmoothing | Nguyen2018 | 0.9251 | 0.8380 | 1.0000 | 0.9783 | 0.9565 | 1.0000 | 0.9000 | 0.9356 | 0.7708 | 1.0000 | 0.9762 | 0.9792 | 0.9412 | 0.9844 | 0.8958 | 0.8846 | 0.9792 | 1.0000 | 0.7564 | 0.9432 | 0.7917 | 0.9667 | 0.9183 | 0.9012 | 0.9792 | 0.8261 | 0.9583 | 0.9792 | 0.8681 | 0.9804 | 0.9943 | 0.6417 | 0.9375 | 0.9643 | 0.5625 | 0.9792 | 0.9091 | 0.7991 | 0.9611 | 0.9808 | 0.9097 | |
Nguyen_NTU_task2_3 | NTU_bgnormalization | Nguyen2018 | 0.9213 | 0.8796 | 1.0000 | 0.9565 | 1.0000 | 1.0000 | 0.8250 | 0.9583 | 0.8958 | 0.9593 | 0.8492 | 1.0000 | 0.9559 | 0.9844 | 0.7778 | 0.9423 | 1.0000 | 1.0000 | 0.6538 | 1.0000 | 0.7431 | 0.9167 | 0.8268 | 0.9198 | 1.0000 | 0.8623 | 0.9583 | 0.9792 | 0.9583 | 0.9804 | 0.9394 | 0.5000 | 0.9792 | 0.9643 | 0.6458 | 1.0000 | 0.9773 | 0.7991 | 0.9833 | 0.9693 | 0.9583 | |
Nguyen_NTU_task2_4 | NTU_en8_augment_test | Nguyen2018 | 0.9478 | 0.8981 | 1.0000 | 0.9493 | 1.0000 | 1.0000 | 0.9250 | 0.9621 | 0.8264 | 1.0000 | 0.9286 | 1.0000 | 0.9804 | 1.0000 | 0.8889 | 0.9808 | 1.0000 | 1.0000 | 0.8077 | 1.0000 | 0.8472 | 0.9500 | 0.8987 | 0.9444 | 1.0000 | 0.8841 | 0.9583 | 0.9792 | 0.9792 | 0.9804 | 0.9886 | 0.5750 | 1.0000 | 1.0000 | 0.6597 | 1.0000 | 1.0000 | 0.7778 | 1.0000 | 1.0000 | 0.9583 | |
Wilhelm_UKON_task2_1 | CNN on Raw-Audio and Spectrogram | Wilhelm2018 | 0.9435 | 0.8657 | 1.0000 | 0.9783 | 1.0000 | 1.0000 | 0.9417 | 0.9394 | 0.8681 | 0.9889 | 1.0000 | 1.0000 | 1.0000 | 0.9323 | 0.9097 | 1.0000 | 0.9306 | 1.0000 | 0.7949 | 1.0000 | 0.8472 | 0.9611 | 0.9510 | 0.9630 | 0.9792 | 0.9130 | 0.9479 | 1.0000 | 0.9792 | 0.9706 | 0.9356 | 0.9083 | 0.9514 | 0.9107 | 0.5903 | 1.0000 | 0.9773 | 0.7692 | 0.9833 | 0.9828 | 0.9375 | |
Wilhelm_UKON_task2_2 | CNN on Raw-Audio and Spectrogram | Wilhelm2018 | 0.9416 | 0.8519 | 1.0000 | 0.9710 | 0.9565 | 1.0000 | 0.9167 | 0.9773 | 0.8472 | 0.9889 | 0.9762 | 1.0000 | 1.0000 | 0.9219 | 0.9514 | 0.9615 | 0.9514 | 1.0000 | 0.8013 | 0.9886 | 0.9167 | 0.9444 | 0.9412 | 0.9444 | 0.9688 | 0.8768 | 0.9375 | 1.0000 | 0.9583 | 0.9657 | 0.9242 | 0.9250 | 0.9514 | 0.8929 | 0.6319 | 0.9844 | 0.9318 | 0.8077 | 0.9444 | 0.9943 | 0.9792 | |
Kim_GIST_WisenetAI_task2_1 | ConResNet | Kim2018 | 0.9151 | 0.8843 | 1.0000 | 0.9783 | 1.0000 | 1.0000 | 0.6750 | 0.9545 | 0.8681 | 1.0000 | 0.8254 | 0.9792 | 1.0000 | 1.0000 | 0.8125 | 0.9423 | 0.9167 | 0.9383 | 0.7500 | 0.9167 | 0.9306 | 0.9333 | 0.8562 | 0.9198 | 0.9792 | 0.8623 | 0.9688 | 0.9583 | 0.9722 | 0.9559 | 0.9602 | 0.5167 | 0.8611 | 1.0000 | 0.5486 | 1.0000 | 0.9470 | 0.6795 | 0.9833 | 1.0000 | 0.7639 | |
Kim_GIST_WisenetAI_task2_2 | ConResNet | Kim2018 | 0.9133 | 0.8750 | 1.0000 | 0.9783 | 1.0000 | 1.0000 | 0.6750 | 0.9545 | 0.8611 | 1.0000 | 0.8016 | 0.9722 | 1.0000 | 1.0000 | 0.8333 | 0.9423 | 0.8958 | 0.9444 | 0.7372 | 0.9167 | 0.9236 | 0.9444 | 0.8497 | 0.8951 | 0.9792 | 0.8623 | 0.9688 | 0.9583 | 0.9514 | 0.9510 | 0.9564 | 0.5250 | 0.8681 | 1.0000 | 0.5208 | 1.0000 | 0.9773 | 0.6923 | 0.9833 | 1.0000 | 0.7639 | |
Kim_GIST_WisenetAI_task2_3 | ConResNet | Kim2018 | 0.9139 | 0.8750 | 1.0000 | 0.9783 | 1.0000 | 1.0000 | 0.6750 | 0.9545 | 0.8611 | 1.0000 | 0.8492 | 0.9792 | 1.0000 | 1.0000 | 0.8542 | 0.9423 | 0.9167 | 0.9444 | 0.7308 | 0.9167 | 0.9097 | 0.9444 | 0.8693 | 0.8951 | 0.9792 | 0.8188 | 0.9688 | 0.9583 | 0.9514 | 0.9510 | 0.9564 | 0.5167 | 0.8611 | 1.0000 | 0.5208 | 1.0000 | 0.9773 | 0.6923 | 0.9833 | 1.0000 | 0.7431 | |
Kim_GIST_WisenetAI_task2_4 | ConResNet | Kim2018 | 0.9174 | 0.8981 | 1.0000 | 0.9783 | 1.0000 | 1.0000 | 0.6750 | 0.9545 | 0.8889 | 1.0000 | 0.8730 | 1.0000 | 0.9853 | 1.0000 | 0.8056 | 0.9231 | 0.9167 | 0.9753 | 0.7628 | 0.9280 | 0.9514 | 0.9444 | 0.8693 | 0.9259 | 0.9688 | 0.8406 | 0.9688 | 0.9583 | 0.9722 | 0.9559 | 0.9545 | 0.5417 | 0.8681 | 1.0000 | 0.5347 | 1.0000 | 0.9242 | 0.6795 | 0.9833 | 1.0000 | 0.7431 | |
Xu_Aalto_task2_1 | Multi-level attention model on fine-tuned AudioSet features | Xu2018 | 0.9065 | 0.8611 | 1.0000 | 0.9348 | 0.9710 | 1.0000 | 0.9667 | 0.9205 | 0.8264 | 0.9778 | 0.8810 | 0.9514 | 1.0000 | 0.9688 | 0.8750 | 1.0000 | 0.7986 | 0.9815 | 0.7821 | 0.9432 | 0.8125 | 0.9222 | 0.9183 | 0.9383 | 0.8385 | 0.8478 | 0.9635 | 0.9792 | 0.9306 | 0.9804 | 0.8939 | 0.6500 | 0.8889 | 0.8571 | 0.4097 | 0.9375 | 1.0000 | 0.7821 | 0.9056 | 0.9770 | 0.7917 | |
Xu_Aalto_task2_2 | Multi-level attention model on fine-tuned AudioSet features | Xu2018 | 0.9081 | 0.8565 | 1.0000 | 0.9348 | 0.9783 | 1.0000 | 0.9333 | 0.9545 | 0.8611 | 0.9852 | 0.9048 | 0.9514 | 1.0000 | 0.9531 | 0.8750 | 1.0000 | 0.8264 | 0.9815 | 0.7949 | 0.9432 | 0.7917 | 0.9222 | 0.9118 | 0.9630 | 0.8229 | 0.8478 | 0.9479 | 0.9792 | 0.9097 | 0.9853 | 0.8939 | 0.5667 | 0.9167 | 0.8690 | 0.4028 | 0.9375 | 1.0000 | 0.7949 | 0.9111 | 0.9770 | 0.8264 | |
Chakraborty_IBM_Task2_1 | 3 CNN L1 Stacked Fused Spectral XGBoost L2 | Chakraborty2018 | 0.9328 | 0.9120 | 1.0000 | 0.9565 | 1.0000 | 1.0000 | 0.7917 | 0.9621 | 0.8542 | 1.0000 | 0.8968 | 1.0000 | 0.9853 | 0.9844 | 0.8750 | 0.9167 | 1.0000 | 1.0000 | 0.7115 | 0.9621 | 0.8125 | 0.9500 | 0.9118 | 0.9074 | 0.9323 | 0.7971 | 0.9635 | 0.9792 | 1.0000 | 0.9804 | 0.9716 | 0.6167 | 0.9583 | 0.9643 | 0.6458 | 1.0000 | 0.9773 | 0.7650 | 0.9778 | 0.9943 | 0.9097 | |
Chakraborty_IBM_Task2_2 | 2 CNN results geometrically averaged | Chakraborty2018 | 0.9320 | 0.9074 | 1.0000 | 0.9348 | 1.0000 | 1.0000 | 0.8750 | 0.9545 | 0.8125 | 0.9889 | 0.9762 | 1.0000 | 1.0000 | 0.9844 | 0.9167 | 0.9744 | 1.0000 | 1.0000 | 0.7308 | 0.9773 | 0.8542 | 0.9500 | 0.9150 | 0.9136 | 0.9479 | 0.8623 | 0.9844 | 0.9583 | 0.9722 | 0.9853 | 0.9621 | 0.5667 | 1.0000 | 0.9821 | 0.4306 | 1.0000 | 0.9773 | 0.7393 | 0.9667 | 0.9943 | 0.8194 | |
Chakraborty_IBM_Task2_judges_award | VGG Style CNN with 3 channel input | Chakraborty2018 | 0.9079 | 0.8333 | 1.0000 | 0.8841 | 0.9783 | 1.0000 | 0.8917 | 0.9508 | 0.8542 | 0.9889 | 0.9048 | 0.9792 | 0.9853 | 0.9688 | 0.8958 | 0.9038 | 0.9306 | 1.0000 | 0.6603 | 0.9508 | 0.8542 | 0.9000 | 0.8758 | 0.9198 | 0.8958 | 0.8116 | 0.9844 | 0.9375 | 1.0000 | 0.9412 | 0.9413 | 0.6333 | 0.8750 | 0.9464 | 0.4028 | 0.9479 | 0.8561 | 0.7265 | 0.9500 | 0.9923 | 0.8194 | |
Han_NPU_task2_1 | 2ModEnsem | Han2018 | 0.8723 | 0.8056 | 1.0000 | 0.9130 | 0.9130 | 1.0000 | 0.6750 | 0.9280 | 0.7986 | 0.9778 | 0.7937 | 1.0000 | 0.9118 | 0.9375 | 0.8264 | 0.8333 | 0.9514 | 0.9383 | 0.6410 | 0.8864 | 0.7708 | 0.9667 | 0.7516 | 0.9012 | 0.9375 | 0.8768 | 0.8958 | 0.8472 | 0.7917 | 0.9412 | 0.8826 | 0.6083 | 0.8472 | 0.8750 | 0.5764 | 0.9792 | 0.9470 | 0.6239 | 0.9444 | 0.9674 | 0.8750 | |
Zhesong_PKU_task2_1 | BCNN_WaveNet | Yu2018 | 0.8807 | 0.8287 | 0.9808 | 0.9348 | 0.9783 | 0.9808 | 0.7750 | 0.9773 | 0.7847 | 0.9741 | 0.7222 | 0.9583 | 0.8971 | 0.9792 | 0.8958 | 0.8141 | 0.8264 | 0.9753 | 0.6090 | 0.9053 | 0.7153 | 0.8944 | 0.8301 | 0.9198 | 0.9635 | 0.8406 | 0.9062 | 0.9792 | 0.8819 | 0.9559 | 0.8939 | 0.4583 | 0.8681 | 0.9048 | 0.5139 | 0.9844 | 0.8864 | 0.6709 | 0.9333 | 0.9828 | 0.8333 | |
Hanyu_BUPT_task2 | CRNN | Hanyu2018 | 0.7877 | 0.7685 | 1.0000 | 0.9348 | 0.7174 | 0.9423 | 0.7583 | 0.9129 | 0.7500 | 0.9444 | 0.7698 | 0.9167 | 0.9265 | 0.8438 | 0.6806 | 0.7756 | 0.8056 | 0.5988 | 0.7949 | 0.9205 | 0.5278 | 0.8833 | 0.5131 | 0.7901 | 0.7344 | 0.7971 | 0.7135 | 0.7708 | 0.6389 | 0.9216 | 0.9318 | 0.2667 | 0.8819 | 0.6071 | 0.3542 | 0.8906 | 1.0000 | 0.5256 | 0.7889 | 0.8774 | 0.6458 | |
Wei_Kuaiyu_task2_1 | Kuaiyu tagging system | WEI2018 | 0.9409 | 0.9583 | 1.0000 | 0.9783 | 1.0000 | 1.0000 | 0.8250 | 0.9773 | 0.8403 | 1.0000 | 0.9683 | 1.0000 | 1.0000 | 1.0000 | 0.8542 | 0.8846 | 1.0000 | 1.0000 | 0.8205 | 0.9659 | 0.8819 | 0.9667 | 0.9346 | 0.9136 | 1.0000 | 0.8913 | 0.9844 | 0.9792 | 0.9167 | 0.9657 | 0.9773 | 0.7333 | 0.9792 | 0.9821 | 0.5208 | 1.0000 | 0.9545 | 0.7436 | 0.9278 | 1.0000 | 0.8750 | |
Wei_Kuaiyu_task2_2 | Kuaiyu tagging system | WEI2018 | 0.9423 | 0.9583 | 1.0000 | 0.9783 | 1.0000 | 1.0000 | 0.8250 | 0.9773 | 0.8681 | 1.0000 | 0.9683 | 1.0000 | 1.0000 | 1.0000 | 0.8542 | 0.8846 | 1.0000 | 1.0000 | 0.8205 | 0.9659 | 0.8819 | 0.9667 | 0.9379 | 0.9321 | 1.0000 | 0.8913 | 0.9844 | 0.9792 | 0.9167 | 0.9657 | 0.9754 | 0.7333 | 0.9792 | 0.9821 | 0.5278 | 1.0000 | 0.9545 | 0.7521 | 0.9500 | 1.0000 | 0.8542 | |
Colangelo_RM3_task2_1 | DCASE2018 Task2 CRNN RM3 | Colangelo2018 | 0.6978 | 0.7361 | 0.9808 | 0.6812 | 0.8841 | 0.7885 | 0.6750 | 0.7538 | 0.7778 | 0.8000 | 0.5873 | 0.7083 | 0.9216 | 0.7656 | 0.6875 | 0.5641 | 0.7847 | 0.5741 | 0.5064 | 0.7500 | 0.6528 | 0.7444 | 0.6895 | 0.5247 | 0.8438 | 0.7029 | 0.9323 | 0.8472 | 0.6042 | 0.3480 | 0.6742 | 0.5833 | 0.7431 | 0.8750 | 0.3403 | 0.9010 | 0.8030 | 0.4145 | 0.7111 | 0.5632 | 0.6250 | |
Shan_DBSonics_task2_1 | Shan DBSonics approach | Ren2018 | 0.9405 | 0.9120 | 1.0000 | 0.9710 | 1.0000 | 1.0000 | 0.7167 | 0.9508 | 0.8264 | 1.0000 | 0.9206 | 1.0000 | 1.0000 | 1.0000 | 0.8750 | 0.9038 | 0.9792 | 1.0000 | 0.8013 | 0.9886 | 0.8264 | 0.9667 | 0.9020 | 0.9568 | 1.0000 | 0.8768 | 0.9635 | 0.9583 | 1.0000 | 0.9559 | 0.9716 | 0.6750 | 0.9514 | 0.9286 | 0.6875 | 1.0000 | 1.0000 | 0.7906 | 0.9833 | 0.9943 | 0.9583 | |
Kele_NUDT_task2_1 | DCASE2018 Meta-learning system | Kele2018 | 0.9498 | 0.9120 | 1.0000 | 0.9783 | 0.9783 | 1.0000 | 0.8083 | 0.9735 | 0.8750 | 1.0000 | 0.9762 | 1.0000 | 0.9706 | 1.0000 | 0.9306 | 0.8974 | 1.0000 | 1.0000 | 0.8846 | 0.9773 | 0.7917 | 0.9500 | 0.8954 | 0.9815 | 1.0000 | 0.9130 | 0.9635 | 1.0000 | 1.0000 | 0.9853 | 0.9886 | 0.7250 | 0.9514 | 0.9821 | 0.6389 | 0.9844 | 1.0000 | 0.8162 | 1.0000 | 0.9943 | 0.9514 | |
Kele_NUDT_task2_2 | DCASE2018 Meta-learning system | Kele2018 | 0.9441 | 0.8750 | 1.0000 | 0.9565 | 1.0000 | 1.0000 | 0.7250 | 0.9659 | 0.9167 | 1.0000 | 1.0000 | 1.0000 | 0.9706 | 1.0000 | 0.9167 | 0.8654 | 1.0000 | 1.0000 | 0.8654 | 0.9886 | 0.7986 | 0.9667 | 0.8889 | 0.9630 | 1.0000 | 0.9130 | 0.9635 | 0.9792 | 0.9722 | 0.9804 | 0.9830 | 0.7500 | 0.9583 | 0.9821 | 0.6250 | 0.9844 | 0.9773 | 0.7949 | 1.0000 | 0.9943 | 0.9097 | |
Agafonov_ITMO_task2_1 | Fusion of 4 CNN | Agafonov2018 | 0.9174 | 0.8704 | 0.9615 | 0.9710 | 1.0000 | 1.0000 | 0.8250 | 0.9091 | 0.9097 | 1.0000 | 0.8492 | 1.0000 | 0.9706 | 0.9844 | 0.8542 | 0.9615 | 0.9167 | 0.9815 | 0.7564 | 0.9886 | 0.9097 | 0.9444 | 0.8693 | 0.9198 | 0.9219 | 0.8986 | 0.9427 | 0.9792 | 0.9375 | 0.9608 | 0.9015 | 0.6083 | 0.8681 | 0.9405 | 0.5000 | 0.9844 | 1.0000 | 0.7436 | 0.8722 | 1.0000 | 0.9792 | |
Agafonov_ITMO_task2_2 | Fusion of 4 CNN | Agafonov2018 | 0.9275 | 0.8981 | 0.9808 | 0.9783 | 1.0000 | 1.0000 | 0.7750 | 0.9508 | 0.9097 | 1.0000 | 0.8810 | 1.0000 | 0.9853 | 1.0000 | 0.8472 | 0.9423 | 0.9375 | 0.9815 | 0.7500 | 0.9659 | 0.8750 | 0.9611 | 0.9085 | 0.9321 | 0.9375 | 0.8768 | 0.9635 | 0.9792 | 0.9792 | 0.9804 | 0.9205 | 0.6250 | 0.8611 | 0.9583 | 0.5556 | 0.9792 | 1.0000 | 0.7521 | 0.9167 | 1.0000 | 0.9792 | |
Wilkinghoff_FKIE_task2_1 | CNN Ensemble based on Multiple Features | Wilkinghoff2018 | 0.9414 | 0.9167 | 1.0000 | 0.9783 | 1.0000 | 1.0000 | 0.8667 | 0.9773 | 0.8125 | 0.9889 | 1.0000 | 1.0000 | 0.9706 | 1.0000 | 0.9097 | 0.9744 | 1.0000 | 1.0000 | 0.7436 | 0.9886 | 0.8542 | 0.9667 | 0.9706 | 0.9321 | 0.9688 | 0.9130 | 0.9427 | 0.9792 | 0.9583 | 1.0000 | 0.9451 | 0.8167 | 0.9306 | 0.9821 | 0.5069 | 0.9792 | 0.9470 | 0.7692 | 0.9833 | 1.0000 | 0.8611 | |
Pantic_ETF_task2_1 | Ensemble of convolutional neural networks for general purpose audio tagging | Pantic2018 | 0.9419 | 0.9259 | 1.0000 | 0.9565 | 1.0000 | 1.0000 | 0.8417 | 0.9773 | 0.8750 | 0.9852 | 0.8810 | 1.0000 | 1.0000 | 0.9844 | 0.9167 | 0.9615 | 0.9792 | 1.0000 | 0.7692 | 0.9659 | 0.8403 | 1.0000 | 0.9020 | 0.9630 | 0.9688 | 0.9348 | 0.9479 | 1.0000 | 1.0000 | 0.9510 | 0.9527 | 0.7250 | 0.9792 | 0.9226 | 0.6736 | 0.9844 | 0.9773 | 0.7778 | 0.9667 | 0.9943 | 0.9583 | |
Khadkevich_FB_task2_1 | 2 average pooling | Khadkevich2018 | 0.9188 | 0.8981 | 1.0000 | 0.9710 | 0.9565 | 1.0000 | 0.9417 | 0.9545 | 0.8472 | 0.9889 | 0.9524 | 1.0000 | 1.0000 | 0.9219 | 0.8681 | 0.9359 | 0.9375 | 0.8889 | 0.9167 | 0.9659 | 0.8333 | 0.9667 | 0.8399 | 0.9815 | 0.8906 | 0.8261 | 0.9635 | 0.9792 | 0.8542 | 0.9804 | 0.9470 | 0.5333 | 0.9167 | 0.9107 | 0.4931 | 1.0000 | 0.9697 | 0.7265 | 0.9444 | 0.9579 | 0.9375 | |
Khadkevich_FB_task2_2 | 2 max pooling | Khadkevich2018 | 0.9178 | 0.9306 | 1.0000 | 0.9565 | 1.0000 | 1.0000 | 0.9750 | 0.9470 | 0.8403 | 0.9667 | 1.0000 | 1.0000 | 0.9706 | 0.9010 | 0.8403 | 0.8205 | 0.9306 | 0.9074 | 0.9167 | 0.9545 | 0.7986 | 0.9444 | 0.9085 | 0.9630 | 0.8906 | 0.8478 | 0.9531 | 0.9792 | 0.8889 | 0.9657 | 0.9223 | 0.5250 | 0.9375 | 0.9286 | 0.5556 | 1.0000 | 0.9545 | 0.7393 | 0.9500 | 0.9540 | 0.9375 | |
Iqbal_Surrey_task2_1 | Stacked CNN-CRNN (4 Models) | Iqbal2018 | 0.9484 | 0.8889 | 1.0000 | 0.9565 | 1.0000 | 1.0000 | 0.7833 | 0.9545 | 0.9167 | 1.0000 | 0.9762 | 0.9583 | 0.9706 | 1.0000 | 0.8750 | 0.9808 | 0.9792 | 1.0000 | 0.7949 | 0.9848 | 0.7431 | 0.9500 | 0.9281 | 0.9198 | 0.9844 | 0.8333 | 0.9635 | 0.9792 | 1.0000 | 0.9853 | 1.0000 | 0.8083 | 0.9583 | 1.0000 | 0.6597 | 1.0000 | 1.0000 | 0.8333 | 1.0000 | 1.0000 | 0.9583 | |
Iqbal_Surrey_task2_2 | Stacked CNN-CRNN (8 Models) | Iqbal2018 | 0.9512 | 0.9028 | 1.0000 | 0.9565 | 1.0000 | 1.0000 | 0.7833 | 0.9432 | 0.8958 | 1.0000 | 0.9762 | 0.9583 | 1.0000 | 1.0000 | 0.8958 | 0.9808 | 1.0000 | 1.0000 | 0.8205 | 0.9886 | 0.7569 | 0.9667 | 0.9216 | 0.9198 | 0.9844 | 0.8406 | 0.9635 | 0.9792 | 1.0000 | 1.0000 | 0.9943 | 0.8250 | 0.9583 | 1.0000 | 0.6944 | 1.0000 | 1.0000 | 0.8205 | 1.0000 | 1.0000 | 0.9792 | |
Baseline_Surrey_task2_1 | Surrey baseline CNN 8 layers | Kong2018 | 0.9034 | 0.8889 | 1.0000 | 0.9565 | 0.9348 | 1.0000 | 0.7417 | 0.9545 | 0.8750 | 0.9889 | 0.8810 | 0.9167 | 1.0000 | 0.9688 | 0.8194 | 0.8654 | 0.8681 | 0.9815 | 0.7115 | 0.9773 | 0.7014 | 0.9667 | 0.8922 | 0.8395 | 0.9479 | 0.8478 | 0.9844 | 0.9514 | 0.9167 | 0.9510 | 0.9489 | 0.6750 | 0.7917 | 0.9405 | 0.4306 | 0.9792 | 0.9470 | 0.6880 | 0.9333 | 0.9885 | 0.8958 | |
Baseline_Surrey_task2_2 | Surrey baseline CNN 4 layers | Kong2018 | 0.8622 | 0.8889 | 1.0000 | 0.9493 | 0.9348 | 0.9551 | 0.6500 | 0.9318 | 0.7500 | 0.9556 | 0.8095 | 0.9167 | 0.9559 | 0.9844 | 0.6875 | 0.9231 | 0.6944 | 0.9444 | 0.6667 | 0.9394 | 0.6806 | 0.9111 | 0.8725 | 0.7593 | 0.9219 | 0.7174 | 0.9375 | 0.7917 | 0.8472 | 0.9510 | 0.9072 | 0.5250 | 0.5903 | 0.9821 | 0.3819 | 0.9844 | 0.8864 | 0.6752 | 0.9444 | 0.9808 | 0.8125 | |
Dorfer_CPJKU_task2_1 | CNN - Iterative Self-Verification | Dorfer2018 | 0.9518 | 0.9213 | 1.0000 | 0.9565 | 1.0000 | 1.0000 | 0.9000 | 0.9773 | 0.8958 | 1.0000 | 1.0000 | 1.0000 | 0.9853 | 1.0000 | 0.9097 | 0.9423 | 0.9444 | 0.9815 | 0.8077 | 0.9886 | 0.8542 | 0.9667 | 0.9510 | 0.9383 | 0.9688 | 0.9275 | 0.9635 | 0.9792 | 0.9792 | 1.0000 | 1.0000 | 0.7083 | 0.9306 | 0.9821 | 0.6250 | 1.0000 | 1.0000 | 0.7821 | 0.9833 | 1.0000 | 0.9236 |
System characteristics
Input characteristics
Rank | Code | Tech. report | mAP@3 | Acoustic features | Data augmentation | External data | Re-labeling | Sampling rate |
---|---|---|---|---|---|---|---|---|
DCASE2018 baseline | Fonseca2018 | 0.6943 | log-mel energies | 44.1kHz | ||||
Jeong_COCAI_task2_1 | Jeong2018 | 0.9538 | log-mel energies, waveform | mixup | automatic | 16k,32k,44.1kHz | ||
Jeong_COCAI_task2_2 | Jeong2018 | 0.9506 | log-mel energies, waveform | mixup | automatic | 16k,32kHz | ||
Jeong_COCAI_task2_3 | Jeong2018 | 0.9405 | log-mel energies | mixup | automatic | 32kHz | ||
Nguyen_NTU_task2_1 | Nguyen2018 | 0.9496 | log-mel energies | block mixing, randomly erase/cutout, time stretching, pitch shifting | automatic | 44.1kHz | ||
Nguyen_NTU_task2_2 | Nguyen2018 | 0.9251 | log-mel energies | block mixing, randomly cutout, time stretching, pitch shifting | automatic | 44.1kHz | ||
Nguyen_NTU_task2_3 | Nguyen2018 | 0.9213 | log-mel energies | block mixing, randomly cutout, time stretching, pitch shifting | automatic | 44.1kHz | ||
Nguyen_NTU_task2_4 | Nguyen2018 | 0.9478 | log-mel energies | block mixing, randomly erase/cutout, time stretching, pitch shifting | automatic | 44.1kHz | ||
Wilhelm_UKON_task2_1 | Wilhelm2018 | 0.9435 | log-mel energies, raw audio | cropping, padding, time shifting, same class blending, different class blending | 44.1kHz | |||
Wilhelm_UKON_task2_2 | Wilhelm2018 | 0.9416 | log-mel energies, raw audio | cropping, padding, time shifting, same class blending, different class blending | 44.1kHz | |||
Kim_GIST_WisenetAI_task2_1 | Kim2018 | 0.9151 | MFCC, delta of MFCC, delta-delta of MFCC | time stretching, time shifting, additive wgn | 44.1kHz | |||
Kim_GIST_WisenetAI_task2_2 | Kim2018 | 0.9133 | MFCC, delta of MFCC, delta-delta of MFCC | time stretching, time shifting, additive wgn | 44.1kHz | |||
Kim_GIST_WisenetAI_task2_3 | Kim2018 | 0.9139 | MFCC, delta of MFCC, delta-delta of MFCC | time stretching, time shifting, additive wgn | 44.1kHz | |||
Kim_GIST_WisenetAI_task2_4 | Kim2018 | 0.9174 | MFCC, delta of MFCC, delta-delta of MFCC | time stretching, time shifting, additive wgn | 44.1kHz | |||
Xu_Aalto_task2_1 | Xu2018 | 0.9065 | log-mel energies | pitch shifting, mixup | Google AudioSet VGGish model | 44.1kHz | ||
Xu_Aalto_task2_2 | Xu2018 | 0.9081 | log-mel energies | pitch shifting, mixup | Google AudioSet VGGish model | 44.1kHz | ||
Chakraborty_IBM_Task2_1 | Chakraborty2018 | 0.9328 | spectogram, spectral summaries | chunking, mixup, time-shift | 22050Hz | |||
Chakraborty_IBM_Task2_2 | Chakraborty2018 | 0.9320 | spectogram | chunking, mixup, time-shift | 22050Hz | |||
Chakraborty_IBM_Task2_judges_award | Chakraborty2018 | 0.9079 | spectogram with delta and delta-delta augmentation | chunking, mixup | 22050Hz | |||
Han_NPU_task2_1 | Han2018 | 0.8723 | MFCC | 44.1kHz | ||||
Zhesong_PKU_task2_1 | Yu2018 | 0.8807 | MFCC & raw audio | trim silence | automatic | 44.1kHz & 16kHZ | ||
Hanyu_BUPT_task2 | Hanyu2018 | 0.7877 | log-mel energies | automatic | 44.1kHz | |||
Wei_Kuaiyu_task2_1 | WEI2018 | 0.9409 | log-mel energies | mixup, random erasing | 44.1kHz | |||
Wei_Kuaiyu_task2_2 | WEI2018 | 0.9423 | log-mel energies | mixup, random erasing | 44.1kHz | |||
Colangelo_RM3_task2_1 | Colangelo2018 | 0.6978 | log-mel energies | 44.1kHz | ||||
Shan_DBSonics_task2_1 | Ren2018 | 0.9405 | log-mel energies, MFCC | time stretching, pitch shift, reverb, dynamic range compression | VEGAS, SoundNet | 44.1kHz | ||
Kele_NUDT_task2_1 | Kele2018 | 0.9498 | log-mel energies | mixup | ImageNet-based pre-trained model | 44.1kHz | ||
Kele_NUDT_task2_2 | Kele2018 | 0.9441 | log-mel energies | mixup | ImageNet-based pre-trained model | 44.1kHz | ||
Agafonov_ITMO_task2_1 | Agafonov2018 | 0.9174 | log-mel energies | time stretching, pitch shifting | 16kHz | |||
Agafonov_ITMO_task2_2 | Agafonov2018 | 0.9275 | log-mel energies | time stretching, pitch shifting | 16kHz | |||
Wilkinghoff_FKIE_task2_1 | Wilkinghoff2018 | 0.9414 | PLP, MFCC, mel-spectrogram, raw data | mix-up, cutout, dropout, vertical shifts | 24kHz | |||
Pantic_ETF_task2_1 | Pantic2018 | 0.9419 | CQT, mel-spectrogram | mixup, random erasing, width shift, zoom | pre-trained model | 44.1kHz | ||
Khadkevich_FB_task2_1 | Khadkevich2018 | 0.9188 | log-mel energies | 16kHz | ||||
Khadkevich_FB_task2_2 | Khadkevich2018 | 0.9178 | log-mel energies | 16kHz | ||||
Iqbal_Surrey_task2_1 | Iqbal2018 | 0.9484 | log-mel energies | mixup | automatic | 32kHz | ||
Iqbal_Surrey_task2_2 | Iqbal2018 | 0.9512 | log-mel energies | mixup | automatic | 32kHz | ||
Baseline_Surrey_task2_1 | Kong2018 | 0.9034 | log-mel energies | 32kHz | ||||
Baseline_Surrey_task2_2 | Kong2018 | 0.8622 | log-mel energies | 32kHz | ||||
Dorfer_CPJKU_task2_1 | Dorfer2018 | 0.9518 | Perceptual weighted power spectrogram, Logarithmic-filtered log-spectrogram | mixup | automatic | 32.0kHz |
Machine learning characteristics
Rank | Code | Tech. report | mAP@3 | Classifier | Ensemble subsystems | Decision making | System complexity |
---|---|---|---|---|---|---|---|
DCASE2018 baseline | Fonseca2018 | 0.6943 | CNN | 658100 | |||
Jeong_COCAI_task2_1 | Jeong2018 | 0.9538 | CNN | 30 | geometric mean | 414200805 | |
Jeong_COCAI_task2_2 | Jeong2018 | 0.9506 | CNN | 20 | geometric mean | 276133870 | |
Jeong_COCAI_task2_3 | Jeong2018 | 0.9405 | CNN | 5 | geometric mean | 55461000 | |
Nguyen_NTU_task2_1 | Nguyen2018 | 0.9496 | CNN | 8 | geometric mean | 5679784 | |
Nguyen_NTU_task2_2 | Nguyen2018 | 0.9251 | CNN | 652705 | |||
Nguyen_NTU_task2_3 | Nguyen2018 | 0.9213 | CNN | 769297 | |||
Nguyen_NTU_task2_4 | Nguyen2018 | 0.9478 | CNN | 8 | geometric mean | 5679784 | |
Wilhelm_UKON_task2_1 | Wilhelm2018 | 0.9435 | CNN | 5 | geometric mean | 15683245 | |
Wilhelm_UKON_task2_2 | Wilhelm2018 | 0.9416 | CNN | geometric mean | 3136649 | ||
Kim_GIST_WisenetAI_task2_1 | Kim2018 | 0.9151 | CNN,ensemble | 10 | mean probability | 7197609 | |
Kim_GIST_WisenetAI_task2_2 | Kim2018 | 0.9133 | CNN,ensemble | 10 | mean probability | 7197609 | |
Kim_GIST_WisenetAI_task2_3 | Kim2018 | 0.9139 | CNN,ensemble | 10 | mean probability | 7197609 | |
Kim_GIST_WisenetAI_task2_4 | Kim2018 | 0.9174 | CNN,ensemble | 10 | mean probability | 7197609 | |
Xu_Aalto_task2_1 | Xu2018 | 0.9065 | CNN, DNN, multi-level attention, | 20 | geometric mean | 19119200 | |
Xu_Aalto_task2_2 | Xu2018 | 0.9081 | CNN, DNN, multi-level attention, | 120 | geometric mean | 114715200 | |
Chakraborty_IBM_Task2_1 | Chakraborty2018 | 0.9328 | CNN, xgboost, ensemble | geometric averaging, xgboost classifier | 29270658 | ||
Chakraborty_IBM_Task2_2 | Chakraborty2018 | 0.9320 | CNN | geometric averaging | 27603245 | ||
Chakraborty_IBM_Task2_judges_award | Chakraborty2018 | 0.9079 | CNN | geometric averaging | 1464813 | ||
Han_NPU_task2_1 | Han2018 | 0.8723 | CNN | 2 | 24810 | ||
Zhesong_PKU_task2_1 | Yu2018 | 0.8807 | Bilinear-CNN & WaveNet | 2 | 658100 | ||
Hanyu_BUPT_task2 | Hanyu2018 | 0.7877 | CRNN | mean probability | 980000 | ||
Wei_Kuaiyu_task2_1 | WEI2018 | 0.9409 | CNN | 2 | rank averaging | 9764770 | |
Wei_Kuaiyu_task2_2 | WEI2018 | 0.9423 | CNN | 2 | rank averaging | 9764770 | |
Colangelo_RM3_task2_1 | Colangelo2018 | 0.6978 | CRNN | 2446185 | |||
Shan_DBSonics_task2_1 | Ren2018 | 0.9405 | CNN | 4 | geometric mean | 6000000 | |
Kele_NUDT_task2_1 | Kele2018 | 0.9498 | ensemble | 5 | probability vote | ||
Kele_NUDT_task2_2 | Kele2018 | 0.9441 | ensemble | 5 | probability vote | ||
Agafonov_ITMO_task2_1 | Agafonov2018 | 0.9174 | ensemble | 4 | average fusion | 1880684 | |
Agafonov_ITMO_task2_2 | Agafonov2018 | 0.9275 | ensemble | 4 | geometric mean fusion | 12312447 | |
Wilkinghoff_FKIE_task2_1 | Wilkinghoff2018 | 0.9414 | CNN, ensemble | 25 | logistic regression, neural network | 349396065 | |
Pantic_ETF_task2_1 | Pantic2018 | 0.9419 | CNN | 11 | Logistic regression | 94075101 | |
Khadkevich_FB_task2_1 | Khadkevich2018 | 0.9188 | CNN | ||||
Khadkevich_FB_task2_2 | Khadkevich2018 | 0.9178 | CNN | ||||
Iqbal_Surrey_task2_1 | Iqbal2018 | 0.9484 | GCNN, GCRNN | 4 | geometric mean | 81701540 | |
Iqbal_Surrey_task2_2 | Iqbal2018 | 0.9512 | CNN, GCNN, CRNN, GCRNN | 8 | geometric mean | 125787722 | |
Baseline_Surrey_task2_1 | Kong2018 | 0.9034 | VGGish 8 layer CNN with global max pooling | 4691274 | |||
Baseline_Surrey_task2_2 | Kong2018 | 0.8622 | AlexNetish 4 layer CNN with global max pooling | 4309450 | |||
Dorfer_CPJKU_task2_1 | Dorfer2018 | 0.9518 | CNN, ensemble | 3 | average | 8369492 |
Technical reports
AUDIO TAGGING USING LABELED DATASET WITH NEURAL NETWORKS
Iurii Agafonov and Evgeniy Shuranov
Speech Information Systems (ITMO), ITMO University, Saint-Petersburg, Saint-Petersburg, Russia. Database Collection and Processing Group (STC), Speech Technology Center, Saint-Petersburg, Saint-Petersburg, Russia.
Agafonov_ITMO_task2_2
AUDIO TAGGING USING LABELED DATASET WITH NEURAL NETWORKS
Iurii Agafonov and Evgeniy Shuranov
Speech Information Systems (ITMO), ITMO University, Saint-Petersburg, Saint-Petersburg, Russia. Database Collection and Processing Group (STC), Speech Technology Center, Saint-Petersburg, Saint-Petersburg, Russia.
Abstract
In this paper, an audio tagging system is proposed. This system uses fusion of 5 Convolutional Neural Network (CNN) and 1 Convolutional Recurrent Neural Network (CRNN) classifiers in attempt to achieve better results. The proposed system reaches 0.95 score in public leaderboard.
System characteristics
Sampling rate | 16kHz |
Data augmentation | time stretching, pitch shifting |
Features | log-mel energies |
Classifier | ensemble |
Decision making | geometric mean fusion |
Ensemble subsystems | 4 |
Complexity | 12312447 parameters |
DIVERSIFIED SYSTEM OF DEEP CONVOLUTIONAL NEURAL NETWORKS WITH STACKED SPECTRAL FUSION FOR AUDIO TAGGING
Ria Chakraborty
Cognitive Business Decision Services (IBM), International Business Machines, India, Kolkata, India.
Chakraborty_IBM_Task2_judges_award
DIVERSIFIED SYSTEM OF DEEP CONVOLUTIONAL NEURAL NETWORKS WITH STACKED SPECTRAL FUSION FOR AUDIO TAGGING
Ria Chakraborty
Cognitive Business Decision Services (IBM), International Business Machines, India, Kolkata, India.
Abstract
This paper outlines a diversified system of deep convolutional neural networks with stacked fusion of spectral features for the DCASE 2018 Task 2 [1], freesound general-purpose audio tagging. The primary objective of this research has been to develop a solution which can churn out decent performance and be deployed within reasonable resource constraints. The two best performing submissions are the results of only two and three different CNNs, with their results being combined based on a boosted tree algorithm with fused spectral features. This paper describes the submissions which are made under the team name Gyat, leveraging different feature representations and data augmentations along with the marginal benefits they bring on the table. Experimental results show that the proposed systems and preprocessing methods effectively learn acoustic characteristics from the audio recordings, and their ensemble model significantly reduces the error rate further, exhibiting a MAP@3 score of 0.945 and 0.947 respectively on the public leaderboard. The baseline score for this task has been 0.704 (public LB score).
System characteristics
Sampling rate | 22050Hz |
Data augmentation | chunking, mixup |
Features | spectogram with delta and delta-delta augmentation |
Classifier | CNN |
Decision making | geometric averaging |
Complexity | 1464813 parameters |
CONVOLUTIONAL RECURRENT NEURAL NETWORK FOR AUDIO EVENTS CLASSIFICATION
Federico Colangelo, Federica Battisti, Alessandro Neri and Marco Carli
Department of Engineering (RM3), Universita degli studi Roma Tre, Rome, Italy.
Colangelo_RM3_task2_1
CONVOLUTIONAL RECURRENT NEURAL NETWORK FOR AUDIO EVENTS CLASSIFICATION
Federico Colangelo, Federica Battisti, Alessandro Neri and Marco Carli
Department of Engineering (RM3), Universita degli studi Roma Tre, Rome, Italy.
Abstract
Audio event recognition is becoming a hot topic both in the research and in the industrial field. Nowadays, thanks to the availability of cheap sensors, the acquisition of high-quality audio is much easier. However, new challenges arise: the large number of inputs requires adequate means for coding, transmitting, and storing the recorded data. Moreover, to build systems that can act based on their surroundings (e.g. autonomous cars), automatic tools for detecting specific audio events are needed. In this paper, the effectiveness of an architecture based on a combination of convolutional and re- current neural networks for general purpose audio event detection is evaluated. Specifically, the architecture is evaluated in the context of the DCASE challenge on general purpose audio tagging, in order to provide a clear comparison with architectures based on different principles.
System characteristics
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CRNN |
Complexity | 2446185 parameters |
GTRAINING GENERAL-PURPOSE AUDIO TAGGING NETWORKS WITH NOISY LABELS AND ITERATIVE SELF-VERIFICATION
Matthias Dorfer and Gerhard Widmer
Institute of Computational Perception (JKU), Johannes Kepler University Linz, Linz, Austria.
Abstract
This work describes our submission to the first Freesound general- purpose audio tagging challenge carried out within the DCASE 2018 challenge. Our solution is based on a fully convolutional neural network that predicts one out of 41 possible audio class labels when given an audio spectrogram excerpt as an input. What makes this classification dataset and the task in general special, is the fact that only 3,700 of the 9,500 provided training examples are delivered with manually verified ground truth labels. The remaining non- verified observations are expected to contain a substantial amount of label noise (30-35%). We propose to address this issue by a simple, iterative self-verification process, which gradually shifts unverified labels into the verified, trusted training set. The decision criterion for self-verifying a training example is the prediction consensus of a previous snapshot of the network on multiple short sliding window excerpts of the training example at hand. This procedure requires a carefully chosen cross-validation setup, as large enough neural net- works are able to learn an entire dataset by heart, even in the face of noisy label data. On the unseen test data, an ensemble of three net- works trained with this self-verification approach achieves a mean average precision (MAP@3) of 0.951. This is the second best out of 558 submissions to the corresponding Kaggle challenge.
System characteristics
Sampling rate | 32.0kHz |
Data augmentation | mixup |
Features | Perceptual weighted power spectrogram, Logarithmic-filtered log-spectrogram |
Classifier | CNN, ensemble |
Decision making | average |
Ensemble subsystems | 3 |
Re-labeling | automatic |
Complexity | 8369492 parameters |
GENERAL-PURPOSE TAGGING OF FREESOUND AUDIO WITH AUDIOSET LABELS: TASK DESCRIPTION, DATASET, AND BASELINE
Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis and Xavier Serra
Music Technology Group (UPF), Universitat Pompeu Fabra, Barcelona, Spain. Machine Perception Team (GOOGLE), Google Research, New York, USA.
DCASE2018 baseline
GENERAL-PURPOSE TAGGING OF FREESOUND AUDIO WITH AUDIOSET LABELS: TASK DESCRIPTION, DATASET, AND BASELINE
Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis and Xavier Serra
Music Technology Group (UPF), Universitat Pompeu Fabra, Barcelona, Spain. Machine Perception Team (GOOGLE), Google Research, New York, USA.
Abstract
This paper describes Task 2 of the DCASE 2018 Challenge, titled “General-purpose audio tagging of Freesound content with Au- dioSet labels”. This task was hosted on the Kaggle platform as “Freesound General-Purpose Audio Tagging Challenge”. The goal of the task is to build an audio tagging system that can recognize the category of an audio clip from a subset of 41 diverse categories drawn from the AudioSet Ontology. We present the task, the dataset prepared for the competition, and a baseline system.
System characteristics
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CNN |
Complexity | 658100 parameters |
CIAIC-GATFC SYSTEM FOR DCASE2018 CHALLENGE TASK2
Xueyu Han, Di Li and Qing Liu
Center of Intelligence Acoustics and Immersive Communication (NPU), Northwestern Polytechnical University, Xi'an, China.
Han_NPU_task2_1
CIAIC-GATFC SYSTEM FOR DCASE2018 CHALLENGE TASK2
Xueyu Han, Di Li and Qing Liu
Center of Intelligence Acoustics and Immersive Communication (NPU), Northwestern Polytechnical University, Xi'an, China.
Abstract
In this report, we present our method to tackle the problem of general-purpose automatic audio tagging described in DCASE 2018 challenge task 2. Two convolutional neural networks (CNN) models with different inputs and different architectures were trained respectively. Outputs from the two CNN models were then fused together to give a final decision. In particular, the distribution of training samples among 41 categories were unequally. Therefore, we presented a data augmentation method, which guaranteed the number of training samples per category was equal. A relative 21.4% improvement over DCASE baseline system [1] is achieved on the public Kaggle leaderboard.
System characteristics
Sampling rate | 44.1kHz |
Features | MFCC |
Classifier | CNN |
Ensemble subsystems | 2 |
Complexity | 24810 parameters |
A SYSTEM FOR DCASE CHALLENGE USING 2018 CRNN WITH MEL FEATURES
Zhang Hanyu and Li Shengchen
Embedded Artificial Intelligence Group (BUPT), University of Posts and Telecommunications, Beijing, Beijing, China.
Hanyu_BUPT_task2
A SYSTEM FOR DCASE CHALLENGE USING 2018 CRNN WITH MEL FEATURES
Zhang Hanyu and Li Shengchen
Embedded Artificial Intelligence Group (BUPT), University of Posts and Telecommunications, Beijing, Beijing, China.
Abstract
For the Acoustic Scene Classification task (General-purpose audio tagging of Freesound content with AudioSet labels) of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2018), we propose a method to classify 41 different acoustic events using a Convolutional Recurrent Neural Network (CRNN) with log Mel spectrogram. First, the waveform of the audio recordings is transformed to log Mel spectrogram and MFCC. The convolutional layers are then applied on the log Mel spectrogram and Mel-frequency cepstral coefficients to extract high level features. The features are fed into the Recurrent Neural Network (RNN) for classification. On the official development set of the challenge, the best MAP is 0.8613, which increases 6.83% compared with the base- line.
System characteristics
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CRNN |
Decision making | mean probability |
Re-labeling | automatic |
Complexity | 980000 parameters |
STACKED CONVOLUTIONAL NEURAL NETWORKS FOR GENERAL-PURPOSE AUDIO TAGGING
Turab Iqbal, Qiuqiang Kong, Mark Plumbley and Wenwu Wang
Centre for Vision, Speech and Signal Processing (Surrey), University of Surrey, UK, Surrey, UK.
Abstract
This technical report describes the methods used for classifying sound events as part of Task 2 of the DCASE 2018 challenge. The data used in this task requires a number of considerations, including how to handle variable-length audio samples and the presence of noisy labels. We propose a number of neural network architectures that learn from mel-spectrogram inputs. These baseline models involve the use of preprocessing techniques, data augmentation, and pseudo-labeling in order to improve their performance. They are then ensembled using a popular technique known as stacking. On the test set used for evaluation, compared to the baseline mean average precision score of 0.704, our system achieved a score of 0.961.
System characteristics
Sampling rate | 32kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN, GCNN, CRNN, GCRNN |
Decision making | geometric mean |
Ensemble subsystems | 8 |
Re-labeling | automatic |
Complexity | 125787722 parameters |
AUDIO TAGGING SYSTEM FOR DCASE 2018: FOCUSING ON LABEL NOISE, DATA AUGMENTATION AND ITS EFFICIENT LEARNING
Il-Young Jeong and Hyungui Lim
COCAI, Cochlear.ai, Seoul, South Korea.
Abstract
In this technical report, we expound on the techniques and models applied to our submission for DCASE 2018: General-purpose audio tagging of Freesound content with AudioSet labels. We aim to focus primarily on how to train a deep-learning model efficiently against strong augmentation and label noise. First, we conducted a single-block DenseNet architecture and multi-head softmax classifier for efficient learning with mixup augmentation. For the label noise, we tried the batch-wise loss masking, which eliminates the loss of outliers in a mini-batch. We also tried an ensemble of various models, trained by using different sampling rate or audio representation.
System characteristics
Sampling rate | 32kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN |
Decision making | geometric mean |
Ensemble subsystems | 5 |
Re-labeling | automatic |
Complexity | 55461000 parameters |
NUDT SOLUTION FOR AUDIO TAGGING TASK OF DCASE 2018 CHALLENGE
Xu Kele, Zhu Boqing, Wang Dezhi, Peng Yuxing, Wang Huaimin, Zhang Lilun and Li Bo
College of Meteorology and Oceanography (NUDT), National University of Defense Technology, Changsha, China. Department of Computer Science (NUDT), National University of Defense Technology, Changsha, China. Department of Automation (BUPT), Beijing University of Posts and Telecommunications, Beijing, China.
Kele_NUDT_task2_2
NUDT SOLUTION FOR AUDIO TAGGING TASK OF DCASE 2018 CHALLENGE
Xu Kele, Zhu Boqing, Wang Dezhi, Peng Yuxing, Wang Huaimin, Zhang Lilun and Li Bo
College of Meteorology and Oceanography (NUDT), National University of Defense Technology, Changsha, China. Department of Computer Science (NUDT), National University of Defense Technology, Changsha, China. Department of Automation (BUPT), Beijing University of Posts and Telecommunications, Beijing, China.
Abstract
In this technical report, we describe our solution for the general- purpose audio tagging task, which belongs to one of the subtasks in the DCASE 2018 challenge. For the solution, we employed both deep learning methods and statistic features-based shallow architecture learners. For single model, only deep learning approaches are investigated, and different deep neural network architectures are tested with different kinds of input, which ranges from the raw- signal, log-scaled Mel-spectrograms (log Mel) to Mel Frequency Cepstral Coefficents (MFCC). For log Mel and MFCC, the delta and delta-delta information are also used to formulate three-channels features. Inception, ResNet, ResNeXt, Dual Path Networks (DPN) are selected as the neural network architectures, while Mixup is used for the data augmentation. Using ResNeXt, our best single convolutional neural network architecture provides an mAP@3 of 0.967 on the public Kaggle leaderboard. Moreover, to improve the accuracy further, we also propose a meta learning-based ensemble method. By employing the diversities between different architectures, the meta learning- based model can provide higher prediction accuracy and robustness with comparison to the single model. Using the proposed meta- learning method, our solution achieves an mAP@3 of 0.977 (rank 1 of 555) on the public Kaggle leaderboard, while the baseline gives an mAP@3 of 0.70.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | ensemble |
Decision making | probability vote |
Ensemble subsystems | 5 |
ACOUSTIC SCENE AND EVENT DETECTION SYSTEMS SUBMITTED TO DCASE 2018 CHALLENGE
Maksim Khadkevich
AML (FB), Facebook, Menlo Park, CA, USA.
Khadkevich_FB_task2_2
ACOUSTIC SCENE AND EVENT DETECTION SYSTEMS SUBMITTED TO DCASE 2018 CHALLENGE
Maksim Khadkevich
AML (FB), Facebook, Menlo Park, CA, USA.
Abstract
In this technical report we describe systems that have been submitted to DCASE 2018 challenge. Feature extraction and convolutional neural network (CNN) architecture are outlined. For tasks 1c and 2 we describe transfer learning approach that has been applied. Model training and inference are finally presented.
System characteristics
Sampling rate | 16kHz |
Features | log-mel energies |
Classifier | CNN |
GIST_WISENETAI AUDIO TAGGER BASED ON CONCATENATED RESIDUAL NETWORK FOR DCASE 2018 CHALLENGE TASK 2
Nam Kyun Kim, Jeong Hyeon Yang, Hong Kook Kim, Jeong Eun Lim, Jin Soo Park and Ji Hyun Park
School of Electrical Engineering and Computer Science (GIST), Gwangju Institute of Science and Technology, Gwangju, Korea. Algorithm R&D Team (Hanwha Techwin), Hanwha Techwin, Sungnam, Korea.
Kim_GIST_WisenetAI_task2_4
GIST_WISENETAI AUDIO TAGGER BASED ON CONCATENATED RESIDUAL NETWORK FOR DCASE 2018 CHALLENGE TASK 2
Nam Kyun Kim, Jeong Hyeon Yang, Hong Kook Kim, Jeong Eun Lim, Jin Soo Park and Ji Hyun Park
School of Electrical Engineering and Computer Science (GIST), Gwangju Institute of Science and Technology, Gwangju, Korea. Algorithm R&D Team (Hanwha Techwin), Hanwha Techwin, Sungnam, Korea.
Abstract
In this report, we describe the method and performance of an acoustic event tagger applied to the Task 2 of the Detection and Classification of Acoustic Scene and Events 2018 (DCASE 2018) challenge, where the task evaluates systems for general-purpose audio tagging with an increased number of categories and using data with annotations of varying reliability. The proposed audio tagger, which is call GIST_WisenetAI and developed by the collaboration of GIST and Hanwha Techwin, is based on a concatenated residual network (ConResNet). In particular, the proposed ConResNet is composed of two types of convolutional neural net- work (CNN) residual networks (CNN-ResNet) such as a 2D CNN-ResNet and an 1D CNN-ResNet using a sequence of mel- frequency cepstrum coefficients (MFCCs) and their statistics, respectively, as input features. In order to improve the performance of audio tagging, k different ConResNets are trained using k-fold cross-validation, and then they are linearly combined to generate an ensemble classifier. In this task, 9,473 audio samples for training/validation are divided into 10 folds, and 9,400 audio sample are given for testing. Consequently, the proposed method provides the mean average precision up to top 3 (MAP@3) of 0.958, which is measured through the Kaggle platform.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | time stretching, time shifting, additive wgn |
Features | MFCC, delta of MFCC, delta-delta of MFCC |
Classifier | CNN,ensemble |
Decision making | mean probability |
Ensemble subsystems | 10 |
Complexity | 7197609 parameters |
DCASE 2018 CHALLENGE SURREY CROSS-TASK CONVOLUTIONAL NEURAL NETWORK BASELINE
Qiuqiang Kong, Iqbal Turab, Xu Yong, Wenwu Wang and Mark D. Plumbley
Centre for Vission, Speech and Signal Processing (CVSSP) (Surrey), University of Surrey, Guildford, UK.
Abstract
Detection and classification of acoustic scenes and events (DCASE) 2018 challenge is a well known IEEE AASP challenge consists of several audio classification and sound event detection tasks. DCASE 2018 challenge includes five tasks: 1) Acoustic scene classification, 2) Audio tagging of Freesound, 3) Bird audio detection, 4) Weakly labeled semi-supervised sound event detection and 5) Multi-channel audio tagging. In this paper we open source the python code of all of Task 1 - 5 of DCASE 2018 challenge. The baseline source code contains the implementation of the convolutioanl neural networks (CNNs) including the AlexNetish and the VGGish from the image processing area. We researched how the performance varies from task to task when the configuration of the neural networks are the same. The experiment shows deeper VGGish network performs better than AlexNetish on Task 2 - 5 except Task 1 where VGGish and AlexNetish network perform similar. With the VGGish network, we achieve an accuracy of 0.680 on Task 1, a mean average precision (mAP) of 0.928 on Task 2, an area under the curve (AUC) of 0.854 on Task 3, a sound event detection F1 score of 20.8% on Task 4 and a F1 score of 87.75% on Task 5.
System characteristics
Sampling rate | 32kHz |
Features | log-mel energies |
Classifier | AlexNetish 4 layer CNN with global max pooling |
Complexity | 4309450 parameters |
DCASE 2018 TASK 2: ITERATIVE TRAINING, LABEL SMOOTHING, AND BACKGROUND NOISE NORMALIZATION FOR AUDIO EVENT TAGGING
Thi Ngoc Tho Nguyen, Ngoc Khanh Nguyen, Douglas L. Jones and Woon Seng Gan
Electrical and Electronic Engineering (NTU), Nanyang Technological University, Singapore. Electrical and Computer Engineering (UIUC), University of Illinois Urbana-Champaign, Illinois, USA. SWAT, SWAT, Singapore. Electrical and Electronic Engineering (NTU), Nanyang Technological University, New York, USA.
Nguyen_NTU_task2_4
DCASE 2018 TASK 2: ITERATIVE TRAINING, LABEL SMOOTHING, AND BACKGROUND NOISE NORMALIZATION FOR AUDIO EVENT TAGGING
Thi Ngoc Tho Nguyen, Ngoc Khanh Nguyen, Douglas L. Jones and Woon Seng Gan
Electrical and Electronic Engineering (NTU), Nanyang Technological University, Singapore. Electrical and Computer Engineering (UIUC), University of Illinois Urbana-Champaign, Illinois, USA. SWAT, SWAT, Singapore. Electrical and Electronic Engineering (NTU), Nanyang Technological University, New York, USA.
Abstract
This paper describes an approach from our submissions for DCASE 2018 Task 2: general-purpose audio tagging of Freesound content with AudioSet labels. To tackle the problem of diverse recording environments, we propose to use background noise normalization. To tackle the problem of noisy labels, we propose to use pseudo- label for automatic label verification and label smoothing to reduce the over-fitting. We train several convolutional neural networks with data augmentation and different input sizes for the automatic label verification process. The procedure is promising to improve the quality of datasets for audio classification. On the public leader board for the competition, our single model and an ensemble of 8 models score 0.941 and 0.963 respectively.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | block mixing, randomly erase/cutout, time stretching, pitch shifting |
Features | log-mel energies |
Classifier | CNN |
Decision making | geometric mean |
Ensemble subsystems | 8 |
Re-labeling | automatic |
Complexity | 5679784 parameters |
ENSEMBLE OF CONVOLUTIONAL NEURAL NETWORKS FOR GENERAL PURPOSE AUDIO TAGGING
Bogdan Pantic
Signals and Systems Department (ETF), School of Electrical Engineering, Belgrade, Serbia.
Pantic_ETF_task2_1
ENSEMBLE OF CONVOLUTIONAL NEURAL NETWORKS FOR GENERAL PURPOSE AUDIO TAGGING
Bogdan Pantic
Signals and Systems Department (ETF), School of Electrical Engineering, Belgrade, Serbia.
Abstract
This work describes our solution for the general purpose audio tagging task of the DCASE 2018 challenge. We propose ensemble of several Convolutional Neural Networks (CNNs) with different properties. Logistic regression is used as meta-classifier to produce final predictions. Experiments demonstrate that ensemble outperforms each CNN individually. Finally, proposed system achieves Mean Average Precision (MAP) score of 0.956 on public leaderboard, which is significant improvement compared to base- line.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | mixup, random erasing, width shift, zoom |
Features | CQT, mel-spectrogram |
Classifier | CNN |
Decision making | Logistic regression |
Ensemble subsystems | 11 |
Complexity | 94075101 parameters |
AUTOMATIC AUDIO TAGGING WITH 1D AND 2D CONVOLUTIONAL NEURAL NETWORKS
Siyuan Shan and Yi Ren
DB Sonics, Beijing, China.
Shan_DBSonics_task2_1
AUTOMATIC AUDIO TAGGING WITH 1D AND 2D CONVOLUTIONAL NEURAL NETWORKS
Siyuan Shan and Yi Ren
DB Sonics, Beijing, China.
Abstract
In this work, we ensemble four different models for the audio tagging tasks. The first two models are 2D convolutional neural net- works (CNNs) that respectively take Mel spectrogram and MFCC spectrogram as input features, while other last two models are two 1D CNNs architectures that take raw waveform as inputs. Data augmentation techniques, including time stretch, pitch shift, reverb and dynamic range compression, are also employed for better generalization. Transfer learning from two external datasets is also adopted. These components together contribute to our final recognition performance reported on Kaggle.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | time stretching, pitch shift, reverb, dynamic range compression |
Features | log-mel energies, MFCC |
Classifier | CNN |
Decision making | geometric mean |
Ensemble subsystems | 4 |
Complexity | 6000000 parameters |
A REPORT ON AUDIO TAGGING WITH DEEPER CNN, 1D-CONVNET AND 2D-CONVNET
Qingkai WEI, Yanfang LIU and Xiaohui RUAN
Kuaiyu, Beijing Kuaiyu Electronics Co., Ltd, Beijing, PRC.
Wei_Kuaiyu_task2_2
A REPORT ON AUDIO TAGGING WITH DEEPER CNN, 1D-CONVNET AND 2D-CONVNET
Qingkai WEI, Yanfang LIU and Xiaohui RUAN
Kuaiyu, Beijing Kuaiyu Electronics Co., Ltd, Beijing, PRC.
Abstract
General-purpose audio tagging is a newly proposed task in DCASE 2018, which can provide insight towards broadly-applicable sound event classifiers. In this paper, two systems (named as 1D-ConvNet and 2D-ConvNet in this paper) with small kernel sizes, multiple functional modules, deeper CNN (convolutional neural network- s) are developed to improve performance in this task. Different audio features are used: raw waveforms are for 1D-ConvNet; frequency domain features such as mfcc, log-mel spectrogram, multi-resolution log-mel spectrogram and spectrogram, are compared as the 2D-ConvNet input. Using DCASE 2018 Challenge task 2 dataset to train and evaluate, the best single model with 1D- ConvNet and 2D-ConvNet are chosen, whose kaggle public leader- board score are 0.877 and 0.961 respectively. In addition, a better ensemble rank averaging prediction get a score 0.968 on the public leaderboard, ranking 5/556.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | mixup, random erasing |
Features | log-mel energies |
Classifier | CNN |
Decision making | rank averaging |
Ensemble subsystems | 2 |
Complexity | 9764770 parameters |
COMBINING HIGH-LEVEL FEATURES OF RAW AUDIO AND SPECTROGRAMS FOR AUDIO TAGGING
Benjamin Wilhelm and Marcel Lederle
Computer and Information Science (UKON), University of Konstanz, Constance, Germany.
Wilhelm_UKON_task2_2
COMBINING HIGH-LEVEL FEATURES OF RAW AUDIO AND SPECTROGRAMS FOR AUDIO TAGGING
Benjamin Wilhelm and Marcel Lederle
Computer and Information Science (UKON), University of Konstanz, Constance, Germany.
Abstract
We introduce a method for general-purpose audio tagging that combines high-level features computed from the spectrogram and raw audio data. We use convolutional neural networks with one- dimensional and two-dimensional convolutions to extract these useful high-level features and combine them with a densely connected neural network to make predictions. Our method performs in the top two percent of on the Freesound General-Purpose Audio Tagging Challenge.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | cropping, padding, time shifting, same class blending, different class blending |
Features | log-mel energies, raw audio |
Classifier | CNN |
Decision making | geometric mean |
Complexity | 3136649 parameters |
GENERAL-PURPOSE AUDIO TAGGING BY ENSEMBLING CONVOLUTIONAL NEURAL NETWORKS BASED ON MULTIPLE FEATURES
Kevin Wilkinghoff
Communication Systems (FKIE), Fraunhofer Institute for Communication, Information Processing and Ergonomics, Wachtberg, Germany.
Wilkinghoff_FKIE_task2_1
GENERAL-PURPOSE AUDIO TAGGING BY ENSEMBLING CONVOLUTIONAL NEURAL NETWORKS BASED ON MULTIPLE FEATURES
Kevin Wilkinghoff
Communication Systems (FKIE), Fraunhofer Institute for Communication, Information Processing and Ergonomics, Wachtberg, Germany.
Abstract
This paper describes an audio tagging system which participated in Task 2 “General-purpose audio tagging of Freesound content with AudioSet labels” of the “Detection and Classification of Acoustic Scenes and Events (DCASE)” Challenge 2018. The system is an ensemble consisting of five convolutional neural networks based on Mel-frequency Cepstral Coefficients, Perceptual Linear Prediction features, Mel-spectrograms and the raw audio data. For ensembling all models, score-based fusion via Logistic Regression is per- formed with another neural network. In experimental evaluations, it is shown that ensembling the models significantly improves upon the performances obtained with the individual models.
System characteristics
Sampling rate | 24kHz |
Data augmentation | mix-up, cutout, dropout, vertical shifts |
Features | PLP, MFCC, mel-spectrogram, raw data |
Classifier | CNN, ensemble |
Decision making | logistic regression, neural network |
Ensemble subsystems | 25 |
Complexity | 349396065 parameters |
THE AALTO SYSTEM BASED ON FINE-TUNED AUDIOSET FEATURES FOR DCASE2018 TASK2 —— GENERAL PURPOSE AUDIO TAGGING
Zhicun Xu, Peter Smit and Mikko Kurimo
Department of Signal Processing and Acoustics (Aalto), Aalto University, Finland, Espoo, Finland.
Xu_Aalto_task2_2
THE AALTO SYSTEM BASED ON FINE-TUNED AUDIOSET FEATURES FOR DCASE2018 TASK2 —— GENERAL PURPOSE AUDIO TAGGING
Zhicun Xu, Peter Smit and Mikko Kurimo
Department of Signal Processing and Acoustics (Aalto), Aalto University, Finland, Espoo, Finland.
Abstract
In this paper, we presented a neural network system for DCASE 2018 task 2, general purpose audio tagging. We fine-tuned the Google AudioSet feature generation model with different settings for the given 41 classes on top of a fully connected layer with 100 units. Then we used the fine-tuned models to generate 128 dimensional features for each 0.960s audio. We tried different neural net- work structures including LSTM and multi-level attention models. In our experiments, the multi-level attention model has shown its superiority over others. Truncating the silence parts, repeating and splitting the audio into the fixed length, pitch shifting augmentation, and mixup techniques have all improved the results with a reason- able amount. The proposed system achieved a result with MAP@3 score at 0.936, which outperforms the baseline result of 0.704 and achieves top 7% in the public leaderboard.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | pitch shifting, mixup |
Features | log-mel energies |
Classifier | CNN, DNN, multi-level attention, |
Decision making | geometric mean |
Ensemble subsystems | 120 |
Complexity | 114715200 parameters |
MODEL OF B-CNN AND WAVENET IN TASK 2
Zhesong Yu
Institute of computer science & technology (PKU), Peking University, Beijing, China.
Zhesong_PKU_task2_1
MODEL OF B-CNN AND WAVENET IN TASK 2
Zhesong Yu
Institute of computer science & technology (PKU), Peking University, Beijing, China.
Abstract
The system has two parts, one is BCNN in MFCC, another is WaveNet in raw audio. B-CNN is a model proposed to solve Im- age fine classification task, and WaveNet is a model used to music generation. And the propose of this paper is just to see the performance of the two model in music classification task.
System characteristics
Sampling rate | 44.1kHz & 16kHZ |
Features | MFCC & raw audio |
Classifier | Bilinear-CNN & WaveNet |
Ensemble subsystems | 2 |
Re-labeling | automatic |
Complexity | 658100 parameters |