General-purpose audio tagging of Freesound content with AudioSet labels


Challenge results

Task description

This task evaluates systems for general-purpose audio tagging with an increased number of categories and using data with annotations of varying reliability. This task will provide insight towards the development of broadly-applicable sound event classifiers that consider an increased and diverse amount of categories.

More detailed task description can be found in the task description page or in the competition page in Kaggle.

IMPORTANT NOTE: the task results shown in this page only include the submissions that were made using the DCASE submission system. Therefore, there might be some entries appearing in the official Kaggle leaderboard that do not appear here. Be aware that because of the missing entries, the ranking of some teams might be different in this page when compared to the Kaggle leaderboard.

Systems ranking

Rank Submission code Name Tech. report mAP@3
(Private leaderboard)*
mAP@3
(Public leaderboard)
DCASE2018 baseline Baseline Fonseca2018 0.6943 0.7049
Jeong_COCAI_task2_1 Cochlear.ai_1 Jeong2018 0.9538 0.9751
Jeong_COCAI_task2_2 Cochlear.ai_2 Jeong2018 0.9506 0.9751
Jeong_COCAI_task2_3 Cochlear.ai_3 Jeong2018 0.9405 0.9729
Nguyen_NTU_task2_1 NTU_ensemble8 Nguyen2018 0.9496 0.9635
Nguyen_NTU_task2_2 NTU_labelsmoothing Nguyen2018 0.9251 0.9413
Nguyen_NTU_task2_3 NTU_bgnormalization Nguyen2018 0.9213 0.9297
Nguyen_NTU_task2_4 NTU_en8_augment_test Nguyen2018 0.9478 0.9601
Wilhelm_UKON_task2_1 CNN on Raw-Audio and Spectrogram Wilhelm2018 0.9435 0.9662
Wilhelm_UKON_task2_2 CNN on Raw-Audio and Spectrogram Wilhelm2018 0.9416 0.9568
Kim_GIST_task2_1 ConResNet Kim2018 0.9151 0.9585
Kim_GIST_task2_2 ConResNet Kim2018 0.9133 0.9585
Kim_GIST_task2_3 ConResNet Kim2018 0.9139 0.9579
Kim_GIST_task2_4 ConResNet Kim2018 0.9174 0.9563
Xu_Aalto_task2_1 Multi-level attention model on fine-tuned AudioSet features Xu2018 0.9065 0.9363
Xu_Aalto_task2_2 Multi-level attention model on fine-tuned AudioSet features Xu2018 0.9081 0.9319
Chakraborty_IBM_Task2_1 3 CNN L1 Stacked Fused Spectral XGBoost L2 Chakraborty2018 0.9328 0.9480
Chakraborty_IBM_Task2_2 2 CNN results geometrically averaged Chakraborty2018 0.9320 0.9452
Chakraborty_IBM_Task2_judges_award VGG Style CNN with 3 channel input Chakraborty2018 0.9079 0.9258
Han_NPU_task2_1 2ModEnsem Han2018 0.8723 0.9181
Zhesong_PKU_task2_1 BCNN_WaveNet Yu2018 0.8807 0.9197
Hanyu_BUPT_task2 CRNN Hanyu2018 0.7877 0.8029
Wei_Kuaiyu_task2_1 Kuaiyu tagging system WEI2018 0.9409 0.9690
Wei_Kuaiyu_task2_2 Kuaiyu tagging system WEI2018 0.9423 0.9673
Colangelo_RM3_task2_1 DCASE2018 Task2 CRNN RM3 Colangelo2018 0.6978 0.7309
Shan_DBSonics_task2_1 Shan DBSonics approach Ren2018 0.9405 0.9734
Kele_NUDT_task2_1 DCASE2018 Meta-learning system Kele2018 0.9498 0.9779
Kele_NUDT_task2_2 DCASE2018 Meta-learning system Kele2018 0.9441 0.9662
Agafonov_ITMO_task2_1 Fusion of 4 CNN Agafonov2018 0.9174 0.9502
Agafonov_ITMO_task2_2 Fusion of 4 CNN Agafonov2018 0.9275 0.9491
Wilkinghoff_FKIE_task2_1 CNN Ensemble based on Multiple Features Wilkinghoff2018 0.9414 0.9563
Pantic_ETF_task2_1 Ensemble of convolutional neural networks for general purpose audio tagging Pantic2018 0.9419 0.9563
Khadkevich_FB_task2_1 2 average pooling Khadkevich2018 0.9188 0.9131
Khadkevich_FB_task2_2 2 max pooling Khadkevich2018 0.9178 0.9103
Iqbal_Surrey_task2_1 Stacked CNN-CRNN (4 Models) Iqbal2018 0.9484 0.9568
Iqbal_Surrey_task2_2 Stacked CNN-CRNN (8 Models) Iqbal2018 0.9512 0.9612
Baseline_Surrey_task2_1 Surrey baseline CNN 8 layers Kong2018 0.9034 0.9203
Baseline_Surrey_task2_2 Surrey baseline CNN 4 layers Kong2018 0.8622 0.8854
Dorfer_CPJKU_task2_1 CNN - Iterative Self-Verification Dorfer2018 0.9518 0.9563

* Unless stated otherwise, all reported mAP@3 scores are computed using the ground truth for the private leaderboard.

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission
code
Name Tech. report mAP@3
(Private leaderboard)
mAP@3
(Public leaderboard)
DCASE2018 baseline Baseline Fonseca2018 0.6943 0.7049
Chakraborty_IBM_Task2_1 3 CNN L1 Stacked Fused Spectral XGBoost L2 Chakraborty2018 0.9328 0.9480
Han_NPU_task2_1 2ModEnsem Han2018 0.8723 0.9181
Baseline_Surrey_task2_1 Surrey baseline CNN 8 layers Kong2018 0.9034 0.9203
Jeong_COCAI_task2_1 Cochlear.ai_1 Jeong2018 0.9538 0.9751
Dorfer_CPJKU_task2_1 CNN - Iterative Self-Verification Dorfer2018 0.9518 0.9563
Shan_DBSonics_task2_1 Shan DBSonics approach Ren2018 0.9405 0.9734
Wilkinghoff_FKIE_task2_1 CNN Ensemble based on Multiple Features Wilkinghoff2018 0.9414 0.9563
Nguyen_NTU_task2_1 NTU_ensemble8 Nguyen2018 0.9496 0.9635
Zhesong_PKU_task2_1 BCNN_WaveNet Yu2018 0.8807 0.9197
Wilhelm_UKON_task2_1 CNN on Raw-Audio and Spectrogram Wilhelm2018 0.9435 0.9662
Hanyu_BUPT_task2 CRNN Hanyu2018 0.7877 0.8029
Kele_NUDT_task2_1 DCASE2018 Meta-learning system Kele2018 0.9498 0.9779
Agafonov_ITMO_task2_2 Fusion of 4 CNN Agafonov2018 0.9275 0.9491
Kim_GIST_task2_4 ConResNet Kim2018 0.9174 0.9563
Pantic_ETF_task2_1 Ensemble of convolutional neural networks for general purpose audio tagging Pantic2018 0.9419 0.9563
Khadkevich_FB_task2_1 2 average pooling Khadkevich2018 0.9188 0.9131
Iqbal_Surrey_task2_2 Stacked CNN-CRNN (8 Models) Iqbal2018 0.9512 0.9612
Xu_Aalto_task2_2 Multi-level attention model on fine-tuned AudioSet features Xu2018 0.9081 0.9319
Colangelo_RM3_task2_1 DCASE2018 Task2 CRNN RM3 Colangelo2018 0.6978 0.7309
Wei_Kuaiyu_task2_2 Kuaiyu tagging system WEI2018 0.9423 0.9673

Class-wise performance

The table below shows the mAP@3 scores computed per class.

Rank Submission code Name Tech. report mAP@3 acoustic guitar applause bark bass drum burping or eructation bus cello chime clarinet computer keyboard cough cowbell double bass drawer open or close electric piano fart finger snapping fireworks flute glockenspiel gong gunshot or gunfire harmonica hi-hat keys jangling knock meow microwave oven oboe saxophone scissors shatter snare drum squeak tambourine tearing telephone trumpet violin or fiddle writing
DCASE2018 baseline Baseline Fonseca2018 0.6943 0.6713 0.9744 0.8623 0.5797 0.7051 0.4833 0.8447 0.7708 0.9778 0.5397 0.7222 0.5637 0.6458 0.0625 0.7372 0.6181 0.7222 0.5449 0.8939 0.5694 0.8500 0.1373 0.8827 0.5208 0.7536 0.8854 0.8681 0.5764 0.8922 0.8447 0.2667 0.6528 0.3155 0.1806 0.8229 0.9773 0.6496 0.8611 0.7356 0.6319
Jeong_COCAI_task2_1 Cochlear.ai_1 Jeong2018 0.9538 0.9398 1.0000 0.9783 1.0000 1.0000 0.8917 0.9659 0.8750 1.0000 0.9683 1.0000 0.9853 1.0000 0.8681 0.9744 1.0000 1.0000 0.8077 1.0000 0.8681 0.9833 0.9412 0.9815 1.0000 0.9348 0.9635 0.9792 0.9722 1.0000 0.9451 0.8250 1.0000 1.0000 0.6458 1.0000 1.0000 0.7479 0.9833 1.0000 0.9375
Jeong_COCAI_task2_2 Cochlear.ai_2 Jeong2018 0.9506 0.9444 1.0000 0.9783 1.0000 1.0000 0.8917 0.9659 0.8681 1.0000 0.9683 1.0000 0.9804 1.0000 0.8681 0.9744 1.0000 1.0000 0.7692 1.0000 0.8681 0.9833 0.9510 0.9815 1.0000 0.9058 0.9635 0.9792 0.9514 1.0000 0.9318 0.8000 1.0000 1.0000 0.6319 0.9844 1.0000 0.7393 0.9833 1.0000 0.9583
Jeong_COCAI_task2_3 Cochlear.ai_3 Jeong2018 0.9405 0.9398 1.0000 0.9783 1.0000 1.0000 0.7500 0.9508 0.8819 1.0000 0.9444 1.0000 1.0000 1.0000 0.8403 1.0000 1.0000 1.0000 0.7564 1.0000 0.8264 0.9833 0.8922 0.9444 1.0000 0.8406 0.9167 0.9792 0.9375 1.0000 0.9527 0.5750 1.0000 1.0000 0.6181 1.0000 1.0000 0.7692 1.0000 1.0000 0.9167
Nguyen_NTU_task2_1 NTU_ensemble8 Nguyen2018 0.9496 0.8981 1.0000 0.9710 1.0000 1.0000 0.9250 0.9621 0.8264 1.0000 0.9286 1.0000 0.9804 1.0000 0.8889 0.9808 1.0000 1.0000 0.7949 1.0000 0.8264 0.9667 0.9085 0.9444 1.0000 0.8913 0.9583 0.9792 0.9792 0.9804 0.9886 0.6000 1.0000 1.0000 0.6736 1.0000 1.0000 0.7949 1.0000 1.0000 0.9583
Nguyen_NTU_task2_2 NTU_labelsmoothing Nguyen2018 0.9251 0.8380 1.0000 0.9783 0.9565 1.0000 0.9000 0.9356 0.7708 1.0000 0.9762 0.9792 0.9412 0.9844 0.8958 0.8846 0.9792 1.0000 0.7564 0.9432 0.7917 0.9667 0.9183 0.9012 0.9792 0.8261 0.9583 0.9792 0.8681 0.9804 0.9943 0.6417 0.9375 0.9643 0.5625 0.9792 0.9091 0.7991 0.9611 0.9808 0.9097
Nguyen_NTU_task2_3 NTU_bgnormalization Nguyen2018 0.9213 0.8796 1.0000 0.9565 1.0000 1.0000 0.8250 0.9583 0.8958 0.9593 0.8492 1.0000 0.9559 0.9844 0.7778 0.9423 1.0000 1.0000 0.6538 1.0000 0.7431 0.9167 0.8268 0.9198 1.0000 0.8623 0.9583 0.9792 0.9583 0.9804 0.9394 0.5000 0.9792 0.9643 0.6458 1.0000 0.9773 0.7991 0.9833 0.9693 0.9583
Nguyen_NTU_task2_4 NTU_en8_augment_test Nguyen2018 0.9478 0.8981 1.0000 0.9493 1.0000 1.0000 0.9250 0.9621 0.8264 1.0000 0.9286 1.0000 0.9804 1.0000 0.8889 0.9808 1.0000 1.0000 0.8077 1.0000 0.8472 0.9500 0.8987 0.9444 1.0000 0.8841 0.9583 0.9792 0.9792 0.9804 0.9886 0.5750 1.0000 1.0000 0.6597 1.0000 1.0000 0.7778 1.0000 1.0000 0.9583
Wilhelm_UKON_task2_1 CNN on Raw-Audio and Spectrogram Wilhelm2018 0.9435 0.8657 1.0000 0.9783 1.0000 1.0000 0.9417 0.9394 0.8681 0.9889 1.0000 1.0000 1.0000 0.9323 0.9097 1.0000 0.9306 1.0000 0.7949 1.0000 0.8472 0.9611 0.9510 0.9630 0.9792 0.9130 0.9479 1.0000 0.9792 0.9706 0.9356 0.9083 0.9514 0.9107 0.5903 1.0000 0.9773 0.7692 0.9833 0.9828 0.9375
Wilhelm_UKON_task2_2 CNN on Raw-Audio and Spectrogram Wilhelm2018 0.9416 0.8519 1.0000 0.9710 0.9565 1.0000 0.9167 0.9773 0.8472 0.9889 0.9762 1.0000 1.0000 0.9219 0.9514 0.9615 0.9514 1.0000 0.8013 0.9886 0.9167 0.9444 0.9412 0.9444 0.9688 0.8768 0.9375 1.0000 0.9583 0.9657 0.9242 0.9250 0.9514 0.8929 0.6319 0.9844 0.9318 0.8077 0.9444 0.9943 0.9792
Kim_GIST_task2_1 ConResNet Kim2018 0.9151 0.8843 1.0000 0.9783 1.0000 1.0000 0.6750 0.9545 0.8681 1.0000 0.8254 0.9792 1.0000 1.0000 0.8125 0.9423 0.9167 0.9383 0.7500 0.9167 0.9306 0.9333 0.8562 0.9198 0.9792 0.8623 0.9688 0.9583 0.9722 0.9559 0.9602 0.5167 0.8611 1.0000 0.5486 1.0000 0.9470 0.6795 0.9833 1.0000 0.7639
Kim_GIST_task2_2 ConResNet Kim2018 0.9133 0.8750 1.0000 0.9783 1.0000 1.0000 0.6750 0.9545 0.8611 1.0000 0.8016 0.9722 1.0000 1.0000 0.8333 0.9423 0.8958 0.9444 0.7372 0.9167 0.9236 0.9444 0.8497 0.8951 0.9792 0.8623 0.9688 0.9583 0.9514 0.9510 0.9564 0.5250 0.8681 1.0000 0.5208 1.0000 0.9773 0.6923 0.9833 1.0000 0.7639
Kim_GIST_task2_3 ConResNet Kim2018 0.9139 0.8750 1.0000 0.9783 1.0000 1.0000 0.6750 0.9545 0.8611 1.0000 0.8492 0.9792 1.0000 1.0000 0.8542 0.9423 0.9167 0.9444 0.7308 0.9167 0.9097 0.9444 0.8693 0.8951 0.9792 0.8188 0.9688 0.9583 0.9514 0.9510 0.9564 0.5167 0.8611 1.0000 0.5208 1.0000 0.9773 0.6923 0.9833 1.0000 0.7431
Kim_GIST_task2_4 ConResNet Kim2018 0.9174 0.8981 1.0000 0.9783 1.0000 1.0000 0.6750 0.9545 0.8889 1.0000 0.8730 1.0000 0.9853 1.0000 0.8056 0.9231 0.9167 0.9753 0.7628 0.9280 0.9514 0.9444 0.8693 0.9259 0.9688 0.8406 0.9688 0.9583 0.9722 0.9559 0.9545 0.5417 0.8681 1.0000 0.5347 1.0000 0.9242 0.6795 0.9833 1.0000 0.7431
Xu_Aalto_task2_1 Multi-level attention model on fine-tuned AudioSet features Xu2018 0.9065 0.8611 1.0000 0.9348 0.9710 1.0000 0.9667 0.9205 0.8264 0.9778 0.8810 0.9514 1.0000 0.9688 0.8750 1.0000 0.7986 0.9815 0.7821 0.9432 0.8125 0.9222 0.9183 0.9383 0.8385 0.8478 0.9635 0.9792 0.9306 0.9804 0.8939 0.6500 0.8889 0.8571 0.4097 0.9375 1.0000 0.7821 0.9056 0.9770 0.7917
Xu_Aalto_task2_2 Multi-level attention model on fine-tuned AudioSet features Xu2018 0.9081 0.8565 1.0000 0.9348 0.9783 1.0000 0.9333 0.9545 0.8611 0.9852 0.9048 0.9514 1.0000 0.9531 0.8750 1.0000 0.8264 0.9815 0.7949 0.9432 0.7917 0.9222 0.9118 0.9630 0.8229 0.8478 0.9479 0.9792 0.9097 0.9853 0.8939 0.5667 0.9167 0.8690 0.4028 0.9375 1.0000 0.7949 0.9111 0.9770 0.8264
Chakraborty_IBM_Task2_1 3 CNN L1 Stacked Fused Spectral XGBoost L2 Chakraborty2018 0.9328 0.9120 1.0000 0.9565 1.0000 1.0000 0.7917 0.9621 0.8542 1.0000 0.8968 1.0000 0.9853 0.9844 0.8750 0.9167 1.0000 1.0000 0.7115 0.9621 0.8125 0.9500 0.9118 0.9074 0.9323 0.7971 0.9635 0.9792 1.0000 0.9804 0.9716 0.6167 0.9583 0.9643 0.6458 1.0000 0.9773 0.7650 0.9778 0.9943 0.9097
Chakraborty_IBM_Task2_2 2 CNN results geometrically averaged Chakraborty2018 0.9320 0.9074 1.0000 0.9348 1.0000 1.0000 0.8750 0.9545 0.8125 0.9889 0.9762 1.0000 1.0000 0.9844 0.9167 0.9744 1.0000 1.0000 0.7308 0.9773 0.8542 0.9500 0.9150 0.9136 0.9479 0.8623 0.9844 0.9583 0.9722 0.9853 0.9621 0.5667 1.0000 0.9821 0.4306 1.0000 0.9773 0.7393 0.9667 0.9943 0.8194
Chakraborty_IBM_Task2_judges_award VGG Style CNN with 3 channel input Chakraborty2018 0.9079 0.8333 1.0000 0.8841 0.9783 1.0000 0.8917 0.9508 0.8542 0.9889 0.9048 0.9792 0.9853 0.9688 0.8958 0.9038 0.9306 1.0000 0.6603 0.9508 0.8542 0.9000 0.8758 0.9198 0.8958 0.8116 0.9844 0.9375 1.0000 0.9412 0.9413 0.6333 0.8750 0.9464 0.4028 0.9479 0.8561 0.7265 0.9500 0.9923 0.8194
Han_NPU_task2_1 2ModEnsem Han2018 0.8723 0.8056 1.0000 0.9130 0.9130 1.0000 0.6750 0.9280 0.7986 0.9778 0.7937 1.0000 0.9118 0.9375 0.8264 0.8333 0.9514 0.9383 0.6410 0.8864 0.7708 0.9667 0.7516 0.9012 0.9375 0.8768 0.8958 0.8472 0.7917 0.9412 0.8826 0.6083 0.8472 0.8750 0.5764 0.9792 0.9470 0.6239 0.9444 0.9674 0.8750
Zhesong_PKU_task2_1 BCNN_WaveNet Yu2018 0.8807 0.8287 0.9808 0.9348 0.9783 0.9808 0.7750 0.9773 0.7847 0.9741 0.7222 0.9583 0.8971 0.9792 0.8958 0.8141 0.8264 0.9753 0.6090 0.9053 0.7153 0.8944 0.8301 0.9198 0.9635 0.8406 0.9062 0.9792 0.8819 0.9559 0.8939 0.4583 0.8681 0.9048 0.5139 0.9844 0.8864 0.6709 0.9333 0.9828 0.8333
Hanyu_BUPT_task2 CRNN Hanyu2018 0.7877 0.7685 1.0000 0.9348 0.7174 0.9423 0.7583 0.9129 0.7500 0.9444 0.7698 0.9167 0.9265 0.8438 0.6806 0.7756 0.8056 0.5988 0.7949 0.9205 0.5278 0.8833 0.5131 0.7901 0.7344 0.7971 0.7135 0.7708 0.6389 0.9216 0.9318 0.2667 0.8819 0.6071 0.3542 0.8906 1.0000 0.5256 0.7889 0.8774 0.6458
Wei_Kuaiyu_task2_1 Kuaiyu tagging system WEI2018 0.9409 0.9583 1.0000 0.9783 1.0000 1.0000 0.8250 0.9773 0.8403 1.0000 0.9683 1.0000 1.0000 1.0000 0.8542 0.8846 1.0000 1.0000 0.8205 0.9659 0.8819 0.9667 0.9346 0.9136 1.0000 0.8913 0.9844 0.9792 0.9167 0.9657 0.9773 0.7333 0.9792 0.9821 0.5208 1.0000 0.9545 0.7436 0.9278 1.0000 0.8750
Wei_Kuaiyu_task2_2 Kuaiyu tagging system WEI2018 0.9423 0.9583 1.0000 0.9783 1.0000 1.0000 0.8250 0.9773 0.8681 1.0000 0.9683 1.0000 1.0000 1.0000 0.8542 0.8846 1.0000 1.0000 0.8205 0.9659 0.8819 0.9667 0.9379 0.9321 1.0000 0.8913 0.9844 0.9792 0.9167 0.9657 0.9754 0.7333 0.9792 0.9821 0.5278 1.0000 0.9545 0.7521 0.9500 1.0000 0.8542
Colangelo_RM3_task2_1 DCASE2018 Task2 CRNN RM3 Colangelo2018 0.6978 0.7361 0.9808 0.6812 0.8841 0.7885 0.6750 0.7538 0.7778 0.8000 0.5873 0.7083 0.9216 0.7656 0.6875 0.5641 0.7847 0.5741 0.5064 0.7500 0.6528 0.7444 0.6895 0.5247 0.8438 0.7029 0.9323 0.8472 0.6042 0.3480 0.6742 0.5833 0.7431 0.8750 0.3403 0.9010 0.8030 0.4145 0.7111 0.5632 0.6250
Shan_DBSonics_task2_1 Shan DBSonics approach Ren2018 0.9405 0.9120 1.0000 0.9710 1.0000 1.0000 0.7167 0.9508 0.8264 1.0000 0.9206 1.0000 1.0000 1.0000 0.8750 0.9038 0.9792 1.0000 0.8013 0.9886 0.8264 0.9667 0.9020 0.9568 1.0000 0.8768 0.9635 0.9583 1.0000 0.9559 0.9716 0.6750 0.9514 0.9286 0.6875 1.0000 1.0000 0.7906 0.9833 0.9943 0.9583
Kele_NUDT_task2_1 DCASE2018 Meta-learning system Kele2018 0.9498 0.9120 1.0000 0.9783 0.9783 1.0000 0.8083 0.9735 0.8750 1.0000 0.9762 1.0000 0.9706 1.0000 0.9306 0.8974 1.0000 1.0000 0.8846 0.9773 0.7917 0.9500 0.8954 0.9815 1.0000 0.9130 0.9635 1.0000 1.0000 0.9853 0.9886 0.7250 0.9514 0.9821 0.6389 0.9844 1.0000 0.8162 1.0000 0.9943 0.9514
Kele_NUDT_task2_2 DCASE2018 Meta-learning system Kele2018 0.9441 0.8750 1.0000 0.9565 1.0000 1.0000 0.7250 0.9659 0.9167 1.0000 1.0000 1.0000 0.9706 1.0000 0.9167 0.8654 1.0000 1.0000 0.8654 0.9886 0.7986 0.9667 0.8889 0.9630 1.0000 0.9130 0.9635 0.9792 0.9722 0.9804 0.9830 0.7500 0.9583 0.9821 0.6250 0.9844 0.9773 0.7949 1.0000 0.9943 0.9097
Agafonov_ITMO_task2_1 Fusion of 4 CNN Agafonov2018 0.9174 0.8704 0.9615 0.9710 1.0000 1.0000 0.8250 0.9091 0.9097 1.0000 0.8492 1.0000 0.9706 0.9844 0.8542 0.9615 0.9167 0.9815 0.7564 0.9886 0.9097 0.9444 0.8693 0.9198 0.9219 0.8986 0.9427 0.9792 0.9375 0.9608 0.9015 0.6083 0.8681 0.9405 0.5000 0.9844 1.0000 0.7436 0.8722 1.0000 0.9792
Agafonov_ITMO_task2_2 Fusion of 4 CNN Agafonov2018 0.9275 0.8981 0.9808 0.9783 1.0000 1.0000 0.7750 0.9508 0.9097 1.0000 0.8810 1.0000 0.9853 1.0000 0.8472 0.9423 0.9375 0.9815 0.7500 0.9659 0.8750 0.9611 0.9085 0.9321 0.9375 0.8768 0.9635 0.9792 0.9792 0.9804 0.9205 0.6250 0.8611 0.9583 0.5556 0.9792 1.0000 0.7521 0.9167 1.0000 0.9792
Wilkinghoff_FKIE_task2_1 CNN Ensemble based on Multiple Features Wilkinghoff2018 0.9414 0.9167 1.0000 0.9783 1.0000 1.0000 0.8667 0.9773 0.8125 0.9889 1.0000 1.0000 0.9706 1.0000 0.9097 0.9744 1.0000 1.0000 0.7436 0.9886 0.8542 0.9667 0.9706 0.9321 0.9688 0.9130 0.9427 0.9792 0.9583 1.0000 0.9451 0.8167 0.9306 0.9821 0.5069 0.9792 0.9470 0.7692 0.9833 1.0000 0.8611
Pantic_ETF_task2_1 Ensemble of convolutional neural networks for general purpose audio tagging Pantic2018 0.9419 0.9259 1.0000 0.9565 1.0000 1.0000 0.8417 0.9773 0.8750 0.9852 0.8810 1.0000 1.0000 0.9844 0.9167 0.9615 0.9792 1.0000 0.7692 0.9659 0.8403 1.0000 0.9020 0.9630 0.9688 0.9348 0.9479 1.0000 1.0000 0.9510 0.9527 0.7250 0.9792 0.9226 0.6736 0.9844 0.9773 0.7778 0.9667 0.9943 0.9583
Khadkevich_FB_task2_1 2 average pooling Khadkevich2018 0.9188 0.8981 1.0000 0.9710 0.9565 1.0000 0.9417 0.9545 0.8472 0.9889 0.9524 1.0000 1.0000 0.9219 0.8681 0.9359 0.9375 0.8889 0.9167 0.9659 0.8333 0.9667 0.8399 0.9815 0.8906 0.8261 0.9635 0.9792 0.8542 0.9804 0.9470 0.5333 0.9167 0.9107 0.4931 1.0000 0.9697 0.7265 0.9444 0.9579 0.9375
Khadkevich_FB_task2_2 2 max pooling Khadkevich2018 0.9178 0.9306 1.0000 0.9565 1.0000 1.0000 0.9750 0.9470 0.8403 0.9667 1.0000 1.0000 0.9706 0.9010 0.8403 0.8205 0.9306 0.9074 0.9167 0.9545 0.7986 0.9444 0.9085 0.9630 0.8906 0.8478 0.9531 0.9792 0.8889 0.9657 0.9223 0.5250 0.9375 0.9286 0.5556 1.0000 0.9545 0.7393 0.9500 0.9540 0.9375
Iqbal_Surrey_task2_1 Stacked CNN-CRNN (4 Models) Iqbal2018 0.9484 0.8889 1.0000 0.9565 1.0000 1.0000 0.7833 0.9545 0.9167 1.0000 0.9762 0.9583 0.9706 1.0000 0.8750 0.9808 0.9792 1.0000 0.7949 0.9848 0.7431 0.9500 0.9281 0.9198 0.9844 0.8333 0.9635 0.9792 1.0000 0.9853 1.0000 0.8083 0.9583 1.0000 0.6597 1.0000 1.0000 0.8333 1.0000 1.0000 0.9583
Iqbal_Surrey_task2_2 Stacked CNN-CRNN (8 Models) Iqbal2018 0.9512 0.9028 1.0000 0.9565 1.0000 1.0000 0.7833 0.9432 0.8958 1.0000 0.9762 0.9583 1.0000 1.0000 0.8958 0.9808 1.0000 1.0000 0.8205 0.9886 0.7569 0.9667 0.9216 0.9198 0.9844 0.8406 0.9635 0.9792 1.0000 1.0000 0.9943 0.8250 0.9583 1.0000 0.6944 1.0000 1.0000 0.8205 1.0000 1.0000 0.9792
Baseline_Surrey_task2_1 Surrey baseline CNN 8 layers Kong2018 0.9034 0.8889 1.0000 0.9565 0.9348 1.0000 0.7417 0.9545 0.8750 0.9889 0.8810 0.9167 1.0000 0.9688 0.8194 0.8654 0.8681 0.9815 0.7115 0.9773 0.7014 0.9667 0.8922 0.8395 0.9479 0.8478 0.9844 0.9514 0.9167 0.9510 0.9489 0.6750 0.7917 0.9405 0.4306 0.9792 0.9470 0.6880 0.9333 0.9885 0.8958
Baseline_Surrey_task2_2 Surrey baseline CNN 4 layers Kong2018 0.8622 0.8889 1.0000 0.9493 0.9348 0.9551 0.6500 0.9318 0.7500 0.9556 0.8095 0.9167 0.9559 0.9844 0.6875 0.9231 0.6944 0.9444 0.6667 0.9394 0.6806 0.9111 0.8725 0.7593 0.9219 0.7174 0.9375 0.7917 0.8472 0.9510 0.9072 0.5250 0.5903 0.9821 0.3819 0.9844 0.8864 0.6752 0.9444 0.9808 0.8125
Dorfer_CPJKU_task2_1 CNN - Iterative Self-Verification Dorfer2018 0.9518 0.9213 1.0000 0.9565 1.0000 1.0000 0.9000 0.9773 0.8958 1.0000 1.0000 1.0000 0.9853 1.0000 0.9097 0.9423 0.9444 0.9815 0.8077 0.9886 0.8542 0.9667 0.9510 0.9383 0.9688 0.9275 0.9635 0.9792 0.9792 1.0000 1.0000 0.7083 0.9306 0.9821 0.6250 1.0000 1.0000 0.7821 0.9833 1.0000 0.9236

System characteristics

Input characteristics

Rank Code Tech. report mAP@3 Acoustic features Data augmentation External data Re-labeling Sampling rate
DCASE2018 baseline Fonseca2018 0.6943 log-mel energies 44.1kHz
Jeong_COCAI_task2_1 Jeong2018 0.9538 log-mel energies, waveform mixup automatic 16k,32k,44.1kHz
Jeong_COCAI_task2_2 Jeong2018 0.9506 log-mel energies, waveform mixup automatic 16k,32kHz
Jeong_COCAI_task2_3 Jeong2018 0.9405 log-mel energies mixup automatic 32kHz
Nguyen_NTU_task2_1 Nguyen2018 0.9496 log-mel energies block mixing, randomly erase/cutout, time stretching, pitch shifting automatic 44.1kHz
Nguyen_NTU_task2_2 Nguyen2018 0.9251 log-mel energies block mixing, randomly cutout, time stretching, pitch shifting automatic 44.1kHz
Nguyen_NTU_task2_3 Nguyen2018 0.9213 log-mel energies block mixing, randomly cutout, time stretching, pitch shifting automatic 44.1kHz
Nguyen_NTU_task2_4 Nguyen2018 0.9478 log-mel energies block mixing, randomly erase/cutout, time stretching, pitch shifting automatic 44.1kHz
Wilhelm_UKON_task2_1 Wilhelm2018 0.9435 log-mel energies, raw audio cropping, padding, time shifting, same class blending, different class blending 44.1kHz
Wilhelm_UKON_task2_2 Wilhelm2018 0.9416 log-mel energies, raw audio cropping, padding, time shifting, same class blending, different class blending 44.1kHz
Kim_GIST_task2_1 Kim2018 0.9151 MFCC, delta of MFCC, delta-delta of MFCC time stretching, time shifting, additive wgn 44.1kHz
Kim_GIST_task2_2 Kim2018 0.9133 MFCC, delta of MFCC, delta-delta of MFCC time stretching, time shifting, additive wgn 44.1kHz
Kim_GIST_task2_3 Kim2018 0.9139 MFCC, delta of MFCC, delta-delta of MFCC time stretching, time shifting, additive wgn 44.1kHz
Kim_GIST_task2_4 Kim2018 0.9174 MFCC, delta of MFCC, delta-delta of MFCC time stretching, time shifting, additive wgn 44.1kHz
Xu_Aalto_task2_1 Xu2018 0.9065 log-mel energies pitch shifting, mixup Google AudioSet VGGish model 44.1kHz
Xu_Aalto_task2_2 Xu2018 0.9081 log-mel energies pitch shifting, mixup Google AudioSet VGGish model 44.1kHz
Chakraborty_IBM_Task2_1 Chakraborty2018 0.9328 spectogram, spectral summaries chunking, mixup, time-shift 22050Hz
Chakraborty_IBM_Task2_2 Chakraborty2018 0.9320 spectogram chunking, mixup, time-shift 22050Hz
Chakraborty_IBM_Task2_judges_award Chakraborty2018 0.9079 spectogram with delta and delta-delta augmentation chunking, mixup 22050Hz
Han_NPU_task2_1 Han2018 0.8723 MFCC 44.1kHz
Zhesong_PKU_task2_1 Yu2018 0.8807 MFCC & raw audio trim silence automatic 44.1kHz & 16kHZ
Hanyu_BUPT_task2 Hanyu2018 0.7877 log-mel energies automatic 44.1kHz
Wei_Kuaiyu_task2_1 WEI2018 0.9409 log-mel energies mixup, random erasing 44.1kHz
Wei_Kuaiyu_task2_2 WEI2018 0.9423 log-mel energies mixup, random erasing 44.1kHz
Colangelo_RM3_task2_1 Colangelo2018 0.6978 log-mel energies 44.1kHz
Shan_DBSonics_task2_1 Ren2018 0.9405 log-mel energies, MFCC time stretching, pitch shift, reverb, dynamic range compression VEGAS, SoundNet 44.1kHz
Kele_NUDT_task2_1 Kele2018 0.9498 log-mel energies mixup ImageNet-based pre-trained model 44.1kHz
Kele_NUDT_task2_2 Kele2018 0.9441 log-mel energies mixup ImageNet-based pre-trained model 44.1kHz
Agafonov_ITMO_task2_1 Agafonov2018 0.9174 log-mel energies time stretching, pitch shifting 16kHz
Agafonov_ITMO_task2_2 Agafonov2018 0.9275 log-mel energies time stretching, pitch shifting 16kHz
Wilkinghoff_FKIE_task2_1 Wilkinghoff2018 0.9414 PLP, MFCC, mel-spectrogram, raw data mix-up, cutout, dropout, vertical shifts 24kHz
Pantic_ETF_task2_1 Pantic2018 0.9419 CQT, mel-spectrogram mixup, random erasing, width shift, zoom pre-trained model 44.1kHz
Khadkevich_FB_task2_1 Khadkevich2018 0.9188 log-mel energies 16kHz
Khadkevich_FB_task2_2 Khadkevich2018 0.9178 log-mel energies 16kHz
Iqbal_Surrey_task2_1 Iqbal2018 0.9484 log-mel energies mixup automatic 32kHz
Iqbal_Surrey_task2_2 Iqbal2018 0.9512 log-mel energies mixup automatic 32kHz
Baseline_Surrey_task2_1 Kong2018 0.9034 log-mel energies 32kHz
Baseline_Surrey_task2_2 Kong2018 0.8622 log-mel energies 32kHz
Dorfer_CPJKU_task2_1 Dorfer2018 0.9518 Perceptual weighted power spectrogram, Logarithmic-filtered log-spectrogram mixup automatic 32.0kHz



Machine learning characteristics

Rank Code Tech. report mAP@3 Classifier Ensemble subsystems Decision making System complexity
DCASE2018 baseline Fonseca2018 0.6943 CNN 658100
Jeong_COCAI_task2_1 Jeong2018 0.9538 CNN 30 geometric mean 414200805
Jeong_COCAI_task2_2 Jeong2018 0.9506 CNN 20 geometric mean 276133870
Jeong_COCAI_task2_3 Jeong2018 0.9405 CNN 5 geometric mean 55461000
Nguyen_NTU_task2_1 Nguyen2018 0.9496 CNN 8 geometric mean 5679784
Nguyen_NTU_task2_2 Nguyen2018 0.9251 CNN 652705
Nguyen_NTU_task2_3 Nguyen2018 0.9213 CNN 769297
Nguyen_NTU_task2_4 Nguyen2018 0.9478 CNN 8 geometric mean 5679784
Wilhelm_UKON_task2_1 Wilhelm2018 0.9435 CNN 5 geometric mean 15683245
Wilhelm_UKON_task2_2 Wilhelm2018 0.9416 CNN geometric mean 3136649
Kim_GIST_task2_1 Kim2018 0.9151 CNN,ensemble 10 mean probability 7197609
Kim_GIST_task2_2 Kim2018 0.9133 CNN,ensemble 10 mean probability 7197609
Kim_GIST_task2_3 Kim2018 0.9139 CNN,ensemble 10 mean probability 7197609
Kim_GIST_task2_4 Kim2018 0.9174 CNN,ensemble 10 mean probability 7197609
Xu_Aalto_task2_1 Xu2018 0.9065 CNN, DNN, multi-level attention, 20 geometric mean 19119200
Xu_Aalto_task2_2 Xu2018 0.9081 CNN, DNN, multi-level attention, 120 geometric mean 114715200
Chakraborty_IBM_Task2_1 Chakraborty2018 0.9328 CNN, xgboost, ensemble geometric averaging, xgboost classifier 29270658
Chakraborty_IBM_Task2_2 Chakraborty2018 0.9320 CNN geometric averaging 27603245
Chakraborty_IBM_Task2_judges_award Chakraborty2018 0.9079 CNN geometric averaging 1464813
Han_NPU_task2_1 Han2018 0.8723 CNN 2 24810
Zhesong_PKU_task2_1 Yu2018 0.8807 Bilinear-CNN & WaveNet 2 658100
Hanyu_BUPT_task2 Hanyu2018 0.7877 CRNN mean probability 980000
Wei_Kuaiyu_task2_1 WEI2018 0.9409 CNN 2 rank averaging 9764770
Wei_Kuaiyu_task2_2 WEI2018 0.9423 CNN 2 rank averaging 9764770
Colangelo_RM3_task2_1 Colangelo2018 0.6978 CRNN 2446185
Shan_DBSonics_task2_1 Ren2018 0.9405 CNN 4 geometric mean 6000000
Kele_NUDT_task2_1 Kele2018 0.9498 ensemble 5 probability vote
Kele_NUDT_task2_2 Kele2018 0.9441 ensemble 5 probability vote
Agafonov_ITMO_task2_1 Agafonov2018 0.9174 ensemble 4 average fusion 1880684
Agafonov_ITMO_task2_2 Agafonov2018 0.9275 ensemble 4 geometric mean fusion 12312447
Wilkinghoff_FKIE_task2_1 Wilkinghoff2018 0.9414 CNN, ensemble 25 logistic regression, neural network 349396065
Pantic_ETF_task2_1 Pantic2018 0.9419 CNN 11 Logistic regression 94075101
Khadkevich_FB_task2_1 Khadkevich2018 0.9188 CNN
Khadkevich_FB_task2_2 Khadkevich2018 0.9178 CNN
Iqbal_Surrey_task2_1 Iqbal2018 0.9484 GCNN, GCRNN 4 geometric mean 81701540
Iqbal_Surrey_task2_2 Iqbal2018 0.9512 CNN, GCNN, CRNN, GCRNN 8 geometric mean 125787722
Baseline_Surrey_task2_1 Kong2018 0.9034 VGGish 8 layer CNN with global max pooling 4691274
Baseline_Surrey_task2_2 Kong2018 0.8622 AlexNetish 4 layer CNN with global max pooling 4309450
Dorfer_CPJKU_task2_1 Dorfer2018 0.9518 CNN, ensemble 3 average 8369492

Technical reports

AUDIO TAGGING USING LABELED DATASET WITH NEURAL NETWORKS

Iurii Agafonov and Evgeniy Shuranov
Speech Information Systems (ITMO), ITMO University, Saint-Petersburg, Saint-Petersburg, Russia. Database Collection and Processing Group (STC), Speech Technology Center, Saint-Petersburg, Saint-Petersburg, Russia.

Abstract

In this paper, an audio tagging system is proposed. This system uses fusion of 5 Convolutional Neural Network (CNN) and 1 Convolutional Recurrent Neural Network (CRNN) classifiers in attempt to achieve better results. The proposed system reaches 0.95 score in public leaderboard.

System characteristics
Sampling rate 16kHz
Data augmentation time stretching, pitch shifting
Features log-mel energies
Classifier ensemble
Decision making geometric mean fusion
Ensemble subsystems 4
Complexity 12312447 parameters
PDF

DIVERSIFIED SYSTEM OF DEEP CONVOLUTIONAL NEURAL NETWORKS WITH STACKED SPECTRAL FUSION FOR AUDIO TAGGING

Ria Chakraborty
Cognitive Business Decision Services (IBM), International Business Machines, India, Kolkata, India.

Abstract

This paper outlines a diversified system of deep convolutional neural networks with stacked fusion of spectral features for the DCASE 2018 Task 2 [1], freesound general-purpose audio tagging. The primary objective of this research has been to develop a solution which can churn out decent performance and be deployed within reasonable resource constraints. The two best performing submissions are the results of only two and three different CNNs, with their results being combined based on a boosted tree algorithm with fused spectral features. This paper describes the submissions which are made under the team name Gyat, leveraging different feature representations and data augmentations along with the marginal benefits they bring on the table. Experimental results show that the proposed systems and preprocessing methods effectively learn acoustic characteristics from the audio recordings, and their ensemble model significantly reduces the error rate further, exhibiting a MAP@3 score of 0.945 and 0.947 respectively on the public leaderboard. The baseline score for this task has been 0.704 (public LB score).

System characteristics
Sampling rate 22050Hz
Data augmentation chunking, mixup
Features spectogram with delta and delta-delta augmentation
Classifier CNN
Decision making geometric averaging
Complexity 1464813 parameters
PDF

CONVOLUTIONAL RECURRENT NEURAL NETWORK FOR AUDIO EVENTS CLASSIFICATION

Federico Colangelo, Federica Battisti, Alessandro Neri and Marco Carli
Department of Engineering (RM3), Universita degli studi Roma Tre, Rome, Italy.

Abstract

Audio event recognition is becoming a hot topic both in the research and in the industrial field. Nowadays, thanks to the availability of cheap sensors, the acquisition of high-quality audio is much easier. However, new challenges arise: the large number of inputs requires adequate means for coding, transmitting, and storing the recorded data. Moreover, to build systems that can act based on their surroundings (e.g. autonomous cars), automatic tools for detecting specific audio events are needed. In this paper, the effectiveness of an architecture based on a combination of convolutional and re- current neural networks for general purpose audio event detection is evaluated. Specifically, the architecture is evaluated in the context of the DCASE challenge on general purpose audio tagging, in order to provide a clear comparison with architectures based on different principles.

System characteristics
Sampling rate 44.1kHz
Features log-mel energies
Classifier CRNN
Complexity 2446185 parameters
PDF

GTRAINING GENERAL-PURPOSE AUDIO TAGGING NETWORKS WITH NOISY LABELS AND ITERATIVE SELF-VERIFICATION

Matthias Dorfer and Gerhard Widmer
Institute of Computational Perception (JKU), Johannes Kepler University Linz, Linz, Austria.

Abstract

This work describes our submission to the first Freesound general- purpose audio tagging challenge carried out within the DCASE 2018 challenge. Our solution is based on a fully convolutional neural network that predicts one out of 41 possible audio class labels when given an audio spectrogram excerpt as an input. What makes this classification dataset and the task in general special, is the fact that only 3,700 of the 9,500 provided training examples are delivered with manually verified ground truth labels. The remaining non- verified observations are expected to contain a substantial amount of label noise (30-35%). We propose to address this issue by a simple, iterative self-verification process, which gradually shifts unverified labels into the verified, trusted training set. The decision criterion for self-verifying a training example is the prediction consensus of a previous snapshot of the network on multiple short sliding window excerpts of the training example at hand. This procedure requires a carefully chosen cross-validation setup, as large enough neural net- works are able to learn an entire dataset by heart, even in the face of noisy label data. On the unseen test data, an ensemble of three net- works trained with this self-verification approach achieves a mean average precision (MAP@3) of 0.951. This is the second best out of 558 submissions to the corresponding Kaggle challenge.

System characteristics
Sampling rate 32.0kHz
Data augmentation mixup
Features Perceptual weighted power spectrogram, Logarithmic-filtered log-spectrogram
Classifier CNN, ensemble
Decision making average
Ensemble subsystems 3
Re-labeling automatic
Complexity 8369492 parameters
PDF

GENERAL-PURPOSE TAGGING OF FREESOUND AUDIO WITH AUDIOSET LABELS: TASK DESCRIPTION, DATASET, AND BASELINE

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis and Xavier Serra
Music Technology Group (UPF), Universitat Pompeu Fabra, Barcelona, Spain. Machine Perception Team (GOOGLE), Google Research, New York, USA.

Abstract

This paper describes Task 2 of the DCASE 2018 Challenge, titled “General-purpose audio tagging of Freesound content with Au- dioSet labels”. This task was hosted on the Kaggle platform as “Freesound General-Purpose Audio Tagging Challenge”. The goal of the task is to build an audio tagging system that can recognize the category of an audio clip from a subset of 41 diverse categories drawn from the AudioSet Ontology. We present the task, the dataset prepared for the competition, and a baseline system.

System characteristics
Sampling rate 44.1kHz
Features log-mel energies
Classifier CNN
Complexity 658100 parameters
PDF

CIAIC-GATFC SYSTEM FOR DCASE2018 CHALLENGE TASK2

Xueyu Han, Di Li and Qing Liu
Center of Intelligence Acoustics and Immersive Communication (NPU), Northwestern Polytechnical University, Xi'an, China.

Abstract

In this report, we present our method to tackle the problem of general-purpose automatic audio tagging described in DCASE 2018 challenge task 2. Two convolutional neural networks (CNN) models with different inputs and different architectures were trained respectively. Outputs from the two CNN models were then fused together to give a final decision. In particular, the distribution of training samples among 41 categories were unequally. Therefore, we presented a data augmentation method, which guaranteed the number of training samples per category was equal. A relative 21.4% improvement over DCASE baseline system [1] is achieved on the public Kaggle leaderboard.

System characteristics
Sampling rate 44.1kHz
Features MFCC
Classifier CNN
Ensemble subsystems 2
Complexity 24810 parameters
PDF

A SYSTEM FOR DCASE CHALLENGE USING 2018 CRNN WITH MEL FEATURES

Zhang Hanyu and Li Shengchen
Embedded Artificial Intelligence Group (BUPT), University of Posts and Telecommunications, Beijing, Beijing, China.

Abstract

For the Acoustic Scene Classification task (General-purpose audio tagging of Freesound content with AudioSet labels) of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2018), we propose a method to classify 41 different acoustic events using a Convolutional Recurrent Neural Network (CRNN) with log Mel spectrogram. First, the waveform of the audio recordings is transformed to log Mel spectrogram and MFCC. The convolutional layers are then applied on the log Mel spectrogram and Mel-frequency cepstral coefficients to extract high level features. The features are fed into the Recurrent Neural Network (RNN) for classification. On the official development set of the challenge, the best MAP is 0.8613, which increases 6.83% compared with the base- line.

System characteristics
Sampling rate 44.1kHz
Features log-mel energies
Classifier CRNN
Decision making mean probability
Re-labeling automatic
Complexity 980000 parameters
PDF

STACKED CONVOLUTIONAL NEURAL NETWORKS FOR GENERAL-PURPOSE AUDIO TAGGING

Turab Iqbal, Qiuqiang Kong, Mark Plumbley and Wenwu Wang
Centre for Vision, Speech and Signal Processing (Surrey), University of Surrey, UK, Surrey, UK.

Abstract

This technical report describes the methods used for classifying sound events as part of Task 2 of the DCASE 2018 challenge. The data used in this task requires a number of considerations, including how to handle variable-length audio samples and the presence of noisy labels. We propose a number of neural network architectures that learn from mel-spectrogram inputs. These baseline models involve the use of preprocessing techniques, data augmentation, and pseudo-labeling in order to improve their performance. They are then ensembled using a popular technique known as stacking. On the test set used for evaluation, compared to the baseline mean average precision score of 0.704, our system achieved a score of 0.961.

System characteristics
Sampling rate 32kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN, GCNN, CRNN, GCRNN
Decision making geometric mean
Ensemble subsystems 8
Re-labeling automatic
Complexity 125787722 parameters
PDF

AUDIO TAGGING SYSTEM FOR DCASE 2018: FOCUSING ON LABEL NOISE, DATA AUGMENTATION AND ITS EFFICIENT LEARNING

Il-Young Jeong and Hyungui Lim
COCAI, Cochlear.ai, Seoul, South Korea.

Abstract

In this technical report, we expound on the techniques and models applied to our submission for DCASE 2018: General-purpose audio tagging of Freesound content with AudioSet labels. We aim to focus primarily on how to train a deep-learning model efficiently against strong augmentation and label noise. First, we conducted a single-block DenseNet architecture and multi-head softmax classifier for efficient learning with mixup augmentation. For the label noise, we tried the batch-wise loss masking, which eliminates the loss of outliers in a mini-batch. We also tried an ensemble of various models, trained by using different sampling rate or audio representation.

System characteristics
Sampling rate 32kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN
Decision making geometric mean
Ensemble subsystems 5
Re-labeling automatic
Complexity 55461000 parameters
PDF

NUDT SOLUTION FOR AUDIO TAGGING TASK OF DCASE 2018 CHALLENGE

Xu Kele, Zhu Boqing, Wang Dezhi, Peng Yuxing, Wang Huaimin, Zhang Lilun and Li Bo
College of Meteorology and Oceanography (NUDT), National University of Defense Technology, Changsha, China. Department of Computer Science (NUDT), National University of Defense Technology, Changsha, China. Department of Automation (BUPT), Beijing University of Posts and Telecommunications, Beijing, China.

Abstract

In this technical report, we describe our solution for the general- purpose audio tagging task, which belongs to one of the subtasks in the DCASE 2018 challenge. For the solution, we employed both deep learning methods and statistic features-based shallow architecture learners. For single model, only deep learning approaches are investigated, and different deep neural network architectures are tested with different kinds of input, which ranges from the raw- signal, log-scaled Mel-spectrograms (log Mel) to Mel Frequency Cepstral Coefficents (MFCC). For log Mel and MFCC, the delta and delta-delta information are also used to formulate three-channels features. Inception, ResNet, ResNeXt, Dual Path Networks (DPN) are selected as the neural network architectures, while Mixup is used for the data augmentation. Using ResNeXt, our best single convolutional neural network architecture provides an mAP@3 of 0.967 on the public Kaggle leaderboard. Moreover, to improve the accuracy further, we also propose a meta learning-based ensemble method. By employing the diversities between different architectures, the meta learning- based model can provide higher prediction accuracy and robustness with comparison to the single model. Using the proposed meta- learning method, our solution achieves an mAP@3 of 0.977 (rank 1 of 555) on the public Kaggle leaderboard, while the baseline gives an mAP@3 of 0.70.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies
Classifier ensemble
Decision making probability vote
Ensemble subsystems 5
PDF

ACOUSTIC SCENE AND EVENT DETECTION SYSTEMS SUBMITTED TO DCASE 2018 CHALLENGE

Maksim Khadkevich
AML (FB), Facebook, Menlo Park, CA, USA.

Abstract

In this technical report we describe systems that have been submitted to DCASE 2018 challenge. Feature extraction and convolutional neural network (CNN) architecture are outlined. For tasks 1c and 2 we describe transfer learning approach that has been applied. Model training and inference are finally presented.

System characteristics
Sampling rate 16kHz
Features log-mel energies
Classifier CNN
PDF

GIST_WISENETAI AUDIO TAGGER BASED ON CONCATENATED RESIDUAL NETWORK FOR DCASE 2018 CHALLENGE TASK 2

Nam Kyun Kim, Jeong Hyeon Yang, Hong Kook Kim, Jeong Eun Lim, Jin Soo Park and Ji Hyun Park
School of Electrical Engineering and Computer Science (GIST), Gwangju Institute of Science and Technology, Gwangju, Korea. Algorithm R&D Team (Hanwha Techwin), Hanwha Techwin, Sungnam, Korea.

Abstract

In this report, we describe the method and performance of an acoustic event tagger applied to the Task 2 of the Detection and Classification of Acoustic Scene and Events 2018 (DCASE 2018) challenge, where the task evaluates systems for general-purpose audio tagging with an increased number of categories and using data with annotations of varying reliability. The proposed audio tagger, which is call GIST_WisenetAI and developed by the collaboration of GIST and Hanwha Techwin, is based on a concatenated residual network (ConResNet). In particular, the proposed ConResNet is composed of two types of convolutional neural net- work (CNN) residual networks (CNN-ResNet) such as a 2D CNN-ResNet and an 1D CNN-ResNet using a sequence of mel- frequency cepstrum coefficients (MFCCs) and their statistics, respectively, as input features. In order to improve the performance of audio tagging, k different ConResNets are trained using k-fold cross-validation, and then they are linearly combined to generate an ensemble classifier. In this task, 9,473 audio samples for training/validation are divided into 10 folds, and 9,400 audio sample are given for testing. Consequently, the proposed method provides the mean average precision up to top 3 (MAP@3) of 0.958, which is measured through the Kaggle platform.

System characteristics
Sampling rate 44.1kHz
Data augmentation time stretching, time shifting, additive wgn
Features MFCC, delta of MFCC, delta-delta of MFCC
Classifier CNN,ensemble
Decision making mean probability
Ensemble subsystems 10
Complexity 7197609 parameters
PDF

DCASE 2018 CHALLENGE SURREY CROSS-TASK CONVOLUTIONAL NEURAL NETWORK BASELINE

Qiuqiang Kong, Iqbal Turab, Xu Yong, Wenwu Wang and Mark D. Plumbley
Centre for Vission, Speech and Signal Processing (CVSSP) (Surrey), University of Surrey, Guildford, UK.

Abstract

Detection and classification of acoustic scenes and events (DCASE) 2018 challenge is a well known IEEE AASP challenge consists of several audio classification and sound event detection tasks. DCASE 2018 challenge includes five tasks: 1) Acoustic scene classification, 2) Audio tagging of Freesound, 3) Bird audio detection, 4) Weakly labeled semi-supervised sound event detection and 5) Multi-channel audio tagging. In this paper we open source the python code of all of Task 1 - 5 of DCASE 2018 challenge. The baseline source code contains the implementation of the convolutioanl neural networks (CNNs) including the AlexNetish and the VGGish from the image processing area. We researched how the performance varies from task to task when the configuration of the neural networks are the same. The experiment shows deeper VGGish network performs better than AlexNetish on Task 2 - 5 except Task 1 where VGGish and AlexNetish network perform similar. With the VGGish network, we achieve an accuracy of 0.680 on Task 1, a mean average precision (mAP) of 0.928 on Task 2, an area under the curve (AUC) of 0.854 on Task 3, a sound event detection F1 score of 20.8% on Task 4 and a F1 score of 87.75% on Task 5.

System characteristics
Sampling rate 32kHz
Features log-mel energies
Classifier AlexNetish 4 layer CNN with global max pooling
Complexity 4309450 parameters
PDF

DCASE 2018 TASK 2: ITERATIVE TRAINING, LABEL SMOOTHING, AND BACKGROUND NOISE NORMALIZATION FOR AUDIO EVENT TAGGING

Thi Ngoc Tho Nguyen, Ngoc Khanh Nguyen, Douglas L. Jones and Woon Seng Gan
Electrical and Electronic Engineering (NTU), Nanyang Technological University, Singapore. Electrical and Computer Engineering (UIUC), University of Illinois Urbana-Champaign, Illinois, USA. SWAT, SWAT, Singapore. Electrical and Electronic Engineering (NTU), Nanyang Technological University, New York, USA.

Abstract

This paper describes an approach from our submissions for DCASE 2018 Task 2: general-purpose audio tagging of Freesound content with AudioSet labels. To tackle the problem of diverse recording environments, we propose to use background noise normalization. To tackle the problem of noisy labels, we propose to use pseudo- label for automatic label verification and label smoothing to reduce the over-fitting. We train several convolutional neural networks with data augmentation and different input sizes for the automatic label verification process. The procedure is promising to improve the quality of datasets for audio classification. On the public leader board for the competition, our single model and an ensemble of 8 models score 0.941 and 0.963 respectively.

System characteristics
Sampling rate 44.1kHz
Data augmentation block mixing, randomly erase/cutout, time stretching, pitch shifting
Features log-mel energies
Classifier CNN
Decision making geometric mean
Ensemble subsystems 8
Re-labeling automatic
Complexity 5679784 parameters
PDF

ENSEMBLE OF CONVOLUTIONAL NEURAL NETWORKS FOR GENERAL PURPOSE AUDIO TAGGING

Bogdan Pantic
Signals and Systems Department (ETF), School of Electrical Engineering, Belgrade, Serbia.

Abstract

This work describes our solution for the general purpose audio tagging task of the DCASE 2018 challenge. We propose ensemble of several Convolutional Neural Networks (CNNs) with different properties. Logistic regression is used as meta-classifier to produce final predictions. Experiments demonstrate that ensemble outperforms each CNN individually. Finally, proposed system achieves Mean Average Precision (MAP) score of 0.956 on public leaderboard, which is significant improvement compared to base- line.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup, random erasing, width shift, zoom
Features CQT, mel-spectrogram
Classifier CNN
Decision making Logistic regression
Ensemble subsystems 11
Complexity 94075101 parameters
PDF

AUTOMATIC AUDIO TAGGING WITH 1D AND 2D CONVOLUTIONAL NEURAL NETWORKS

Siyuan Shan and Yi Ren
DB Sonics, Beijing, China.

Abstract

In this work, we ensemble four different models for the audio tagging tasks. The first two models are 2D convolutional neural net- works (CNNs) that respectively take Mel spectrogram and MFCC spectrogram as input features, while other last two models are two 1D CNNs architectures that take raw waveform as inputs. Data augmentation techniques, including time stretch, pitch shift, reverb and dynamic range compression, are also employed for better generalization. Transfer learning from two external datasets is also adopted. These components together contribute to our final recognition performance reported on Kaggle.

System characteristics
Sampling rate 44.1kHz
Data augmentation time stretching, pitch shift, reverb, dynamic range compression
Features log-mel energies, MFCC
Classifier CNN
Decision making geometric mean
Ensemble subsystems 4
Complexity 6000000 parameters
PDF

A REPORT ON AUDIO TAGGING WITH DEEPER CNN, 1D-CONVNET AND 2D-CONVNET

Qingkai WEI, Yanfang LIU and Xiaohui RUAN
Kuaiyu, Beijing Kuaiyu Electronics Co., Ltd, Beijing, PRC.

Abstract

General-purpose audio tagging is a newly proposed task in DCASE 2018, which can provide insight towards broadly-applicable sound event classifiers. In this paper, two systems (named as 1D-ConvNet and 2D-ConvNet in this paper) with small kernel sizes, multiple functional modules, deeper CNN (convolutional neural network- s) are developed to improve performance in this task. Different audio features are used: raw waveforms are for 1D-ConvNet; frequency domain features such as mfcc, log-mel spectrogram, multi-resolution log-mel spectrogram and spectrogram, are compared as the 2D-ConvNet input. Using DCASE 2018 Challenge task 2 dataset to train and evaluate, the best single model with 1D- ConvNet and 2D-ConvNet are chosen, whose kaggle public leader- board score are 0.877 and 0.961 respectively. In addition, a better ensemble rank averaging prediction get a score 0.968 on the public leaderboard, ranking 5/556.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup, random erasing
Features log-mel energies
Classifier CNN
Decision making rank averaging
Ensemble subsystems 2
Complexity 9764770 parameters
PDF

COMBINING HIGH-LEVEL FEATURES OF RAW AUDIO AND SPECTROGRAMS FOR AUDIO TAGGING

Benjamin Wilhelm and Marcel Lederle
Computer and Information Science (UKON), University of Konstanz, Constance, Germany.

Abstract

We introduce a method for general-purpose audio tagging that combines high-level features computed from the spectrogram and raw audio data. We use convolutional neural networks with one- dimensional and two-dimensional convolutions to extract these useful high-level features and combine them with a densely connected neural network to make predictions. Our method performs in the top two percent of on the Freesound General-Purpose Audio Tagging Challenge.

System characteristics
Sampling rate 44.1kHz
Data augmentation cropping, padding, time shifting, same class blending, different class blending
Features log-mel energies, raw audio
Classifier CNN
Decision making geometric mean
Complexity 3136649 parameters
PDF

GENERAL-PURPOSE AUDIO TAGGING BY ENSEMBLING CONVOLUTIONAL NEURAL NETWORKS BASED ON MULTIPLE FEATURES

Kevin Wilkinghoff
Communication Systems (FKIE), Fraunhofer Institute for Communication, Information Processing and Ergonomics, Wachtberg, Germany.

Abstract

This paper describes an audio tagging system which participated in Task 2 “General-purpose audio tagging of Freesound content with AudioSet labels” of the “Detection and Classification of Acoustic Scenes and Events (DCASE)” Challenge 2018. The system is an ensemble consisting of five convolutional neural networks based on Mel-frequency Cepstral Coefficients, Perceptual Linear Prediction features, Mel-spectrograms and the raw audio data. For ensembling all models, score-based fusion via Logistic Regression is per- formed with another neural network. In experimental evaluations, it is shown that ensembling the models significantly improves upon the performances obtained with the individual models.

System characteristics
Sampling rate 24kHz
Data augmentation mix-up, cutout, dropout, vertical shifts
Features PLP, MFCC, mel-spectrogram, raw data
Classifier CNN, ensemble
Decision making logistic regression, neural network
Ensemble subsystems 25
Complexity 349396065 parameters
PDF

THE AALTO SYSTEM BASED ON FINE-TUNED AUDIOSET FEATURES FOR DCASE2018 TASK2 —— GENERAL PURPOSE AUDIO TAGGING

Zhicun Xu, Peter Smit and Mikko Kurimo
Department of Signal Processing and Acoustics (Aalto), Aalto University, Finland, Espoo, Finland.

Abstract

In this paper, we presented a neural network system for DCASE 2018 task 2, general purpose audio tagging. We fine-tuned the Google AudioSet feature generation model with different settings for the given 41 classes on top of a fully connected layer with 100 units. Then we used the fine-tuned models to generate 128 dimensional features for each 0.960s audio. We tried different neural net- work structures including LSTM and multi-level attention models. In our experiments, the multi-level attention model has shown its superiority over others. Truncating the silence parts, repeating and splitting the audio into the fixed length, pitch shifting augmentation, and mixup techniques have all improved the results with a reason- able amount. The proposed system achieved a result with MAP@3 score at 0.936, which outperforms the baseline result of 0.704 and achieves top 7% in the public leaderboard.

System characteristics
Sampling rate 44.1kHz
Data augmentation pitch shifting, mixup
Features log-mel energies
Classifier CNN, DNN, multi-level attention,
Decision making geometric mean
Ensemble subsystems 120
Complexity 114715200 parameters
PDF

MODEL OF B-CNN AND WAVENET IN TASK 2

Zhesong Yu
Institute of computer science & technology (PKU), Peking University, Beijing, China.

Abstract

The system has two parts, one is BCNN in MFCC, another is WaveNet in raw audio. B-CNN is a model proposed to solve Im- age fine classification task, and WaveNet is a model used to music generation. And the propose of this paper is just to see the performance of the two model in music classification task.

System characteristics
Sampling rate 44.1kHz & 16kHZ
Features MFCC & raw audio
Classifier Bilinear-CNN & WaveNet
Ensemble subsystems 2
Re-labeling automatic
Complexity 658100 parameters
PDF