Task description
This task evaluates systems for multi-label audio tagging using a small set of manually-labeled data, and a larger set of noisy-labeled data, under a large vocabulary setting. This task will provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.
More detailed task description can be found in the task description page or in the competition page in Kaggle.
IMPORTANT NOTE: the task results shown in this page only include the submissions that were made using the DCASE submission system. Therefore, there are entries appearing in the official Kaggle leaderboard that do not appear here and the two rankings do not match.
IMPORTANT NOTE 2: Some of the submitted systems failed running when presented with the data of the private test set (mainly because of kernels taking longer to compute than the maximum time allowed). For these systems Kaggle does not provide us a score for the private LB set, and are disqualified from the official ranking. Also, Kaggle only provides us private LB score for the two selected submissions per team. Hence, the third system that some teams submitted to DCASE does not have a private LB score attached (only public LB). We have asked the authors of these systems to provide us with private LB scores so we can show them in the tables below. Disqualified systems are shown in the tables below highlighted in red.
Systems ranking
Rank | Submission code | Kaggle team name | Name | Tech. report |
lwlrap (public LB) |
lwlrap (private LB*) |
---|---|---|---|---|---|---|
Zhang_THU_task2_2 | THUEE | THUEE | Zhang2019 | 0.7392 | 0.7577 | |
Zhang_THU_task2_1 | THUEE | THUEE | Zhang2019 | 0.7423 | 0.7575 | |
Boqing_NUDT_task2_1 | TEMP | Multi-label Audio tagging system 1 | Boqing2019 | 0.7253 | 0.7240 | |
Boqing_NUDT_task2_3 | TEMP | Multi-label Audio tagging system 3 | Boqing2019 | 0.7119 | 0.6777 | |
Boqing_NUDT_task2_2 | TEMP | Multi-label Audio tagging system 2 | Boqing2019 | 0.7235 | 0.7232 | |
Zhang_BIsmart_task2_3 | 3x6min | DCASE2019 Task2 | Teacher-Student V3 | Zhang2019b | 0.7126 | 0.7144 | |
Zhang_BIsmart_task2_2 | 3x6min | DCASE2019 Task2 | Teacher-Student V2 | Zhang2019b | 0.7298 | 0.7338 | |
Zhang_BIsmart_task2_1 | 3x6min | DCASE2019 Task2 | Teacher-Student V1 | Zhang2019b | 0.7304 | 0.7338 | |
Kong_SURREY_task2_1 | cvssp_baseline | CVSSP cross-task CNN baseline | Kong2019 | 0.5803 | 0.0000 | |
BOUTEILLON_NOORG_task2_2 | Eric Bouteillon | BOUTEILLON Warm-up pipeline and spec-mix 2.2 | Bouteillon2019 | 0.7331 | 0.7419 | |
BOUTEILLON_NOORG_task2_1 | Eric Bouteillon | BOUTEILLON Warm-up pipeline and spec-mix 2.1 | Bouteillon2019 | 0.7389 | 0.7519 | |
Akiyama_OU_task2_2 | [kaggler-ja/AIMS] OUmed | resnet34_envnet_ensemble Raw-Audio and Spectrogram | Akiyama2019 | 0.7474 | 0.7579 | |
Akiyama_OU_task2_1 | [kaggler-ja/AIMS] OUmed | resnet34_envnet_ensemble on Raw-Audio and Spectrogram | Akiyama2019 | 0.7504 | 0.7577 | |
Sun_BNU_task2_1 | Penghao | CNN+MeanTeacher | Sun2019 | 0.6320 | 0.6443 | |
Ebbers_UPB_task2_3 | Janek Ebbers | DCASE2019 UPB system 3 | Ebbers2019 | 0.7071 | 0.0000 | |
Ebbers_UPB_task2_2 | Janek Ebbers | DCASE2019 UPB system 2 | Ebbers2019 | 0.7262 | 0.7456 | |
Ebbers_UPB_task2_1 | Janek Ebbers | DCASE2019 UPB system 1 | Ebbers2019 | 0.7305 | 0.7552 | |
HongXiaoFeng_BUPT_task2_1 | HongXiaoFeng | HongXiaoFeng_BUPT_task2_1 | Hong2019 | 0.6991 | 0.7152 | |
HongXiaoFeng_BUPT_task2_2 | HongXiaoFeng | HongXiaoFeng_BUPT_task2_2 | Hong2019 | 0.6991 | 0.7149 | |
Kharin_MePhI_task2_1 | Alexander Khar | Kharin_noisy_annealing | Kharin2019 | 0.6637 | 0.6819 | |
Koutini_CPJKU_task2_1 | CP-JKU | CP JKU 1 | Koutini2019 | 0.7282 | 0.7351 | |
Koutini_CPJKU_task2_2 | CP-JKU | CP JKU 2 | Koutini2019 | 0.7254 | 0.7374 | |
Fonseca_UPF_task2_1 | Challenge Baseline | DCASE2019 baseline system | Fonseca2019 | 0.5370 | 0.5379 | |
PaischerPrinz_CPJKU_task2_1 | CPJKUStudents | CPJKU Students submission | Paischer2019 | 0.7222 | 0.7033 | |
PaischerPrinz_CPJKU_task2_2 | CPJKUStudents | CPJKU Students submission | Paischer2019 | 0.7216 | 0.7099 | |
PaischerPrinz_CPJKU_task2_3 | CPJKUStudents | CPJKU Students submission | Paischer2019 | 0.7158 | 0.7018 | |
Liu_Kuaiyu_task2_1 | Kuaiyu | Kuaiyu Tagging System | Liu2019 | 0.7348 | 0.7414 | |
Liu_Kuaiyu_task2_2 | Kuaiyu | Kuaiyu Tagging System | Liu2019 | 0.7311 | 0.7366 |
* Unless stated otherwise, all reported scores are computed using the ground truth for the private leaderboard.
Teams ranking
Table including only the best performing system per submitting team.
Rank | Submission code | Kaggle team name | Name | Tech. report |
lwlrap (public LB) |
lwlrap (private LB) |
---|---|---|---|---|---|---|
Zhang_THU_task2_2 | THUEE | THUEE | Zhang2019 | 0.7392 | 0.7577 | |
Boqing_NUDT_task2_1 | TEMP | Multi-label Audio tagging system 1 | Boqing2019 | 0.7253 | 0.7240 | |
Zhang_BIsmart_task2_2 | 3x6min | DCASE2019 Task2 | Teacher-Student V2 | Zhang2019b | 0.7298 | 0.7338 | |
Kong_SURREY_task2_1 | cvssp_baseline | CVSSP cross-task CNN baseline | Kong2019 | 0.5803 | 0.0000 | |
BOUTEILLON_NOORG_task2_1 | Eric Bouteillon | BOUTEILLON Warm-up pipeline and spec-mix 2.1 | Bouteillon2019 | 0.7389 | 0.7519 | |
Akiyama_OU_task2_2 | [kaggler-ja/AIMS] OUmed | resnet34_envnet_ensemble Raw-Audio and Spectrogram | Akiyama2019 | 0.7474 | 0.7579 | |
Sun_BNU_task2_1 | Penghao | CNN+MeanTeacher | Sun2019 | 0.6320 | 0.6443 | |
Ebbers_UPB_task2_1 | Janek Ebbers | DCASE2019 UPB system 1 | Ebbers2019 | 0.7305 | 0.7552 | |
HongXiaoFeng_BUPT_task2_1 | HongXiaoFeng | HongXiaoFeng_BUPT_task2_1 | Hong2019 | 0.6991 | 0.7152 | |
Kharin_MePhI_task2_1 | Alexander Khar | Kharin_noisy_annealing | Kharin2019 | 0.6637 | 0.6819 | |
Koutini_CPJKU_task2_2 | CP-JKU | CP JKU 2 | Koutini2019 | 0.7254 | 0.7374 | |
Fonseca_UPF_task2_1 | Challenge Baseline | DCASE2019 baseline system | Fonseca2019 | 0.5370 | 0.5379 | |
PaischerPrinz_CPJKU_task2_2 | CPJKUStudents | CPJKU Students submission | Paischer2019 | 0.7216 | 0.7099 | |
Liu_Kuaiyu_task2_1 | Kuaiyu | Kuaiyu Tagging System | Liu2019 | 0.7348 | 0.7414 |
System characteristics
Input characteristics
Rank | Submission code | Tech. report |
lwlrap (public LB) |
lwlrap (private LB) |
Acoustic features | Data augmentation | Use of noisy subset | Sampling rate |
---|---|---|---|---|---|---|---|---|
Zhang_THU_task2_2 | Zhang2019 | 0.7392 | 0.7577 | log-mel energies, CQT | mixup, SpecAugment | using provided labels | 44.1kHz | |
Zhang_THU_task2_1 | Zhang2019 | 0.7423 | 0.7575 | log-mel energies, CQT | mixup, SpecAugment | using provided labels | 44.1kHz | |
Boqing_NUDT_task2_1 | Boqing2019 | 0.7253 | 0.7240 | log-mel energies | SpecAugment | using provided labels | 44.1kHz | |
Boqing_NUDT_task2_3 | Boqing2019 | 0.7119 | 0.6777 | log-mel energies | SpecAugment | using provided labels | 44.1kHz | |
Boqing_NUDT_task2_2 | Boqing2019 | 0.7235 | 0.7232 | log-mel energies | SpecAugment | using provided labels | 44.1kHz | |
Zhang_BIsmart_task2_3 | Zhang2019b | 0.7126 | 0.7144 | log-mel energies | frequency masking, time masking, time reversal, mixup | using provided labels, automatic re-labeling | 32kHz | |
Zhang_BIsmart_task2_2 | Zhang2019b | 0.7298 | 0.7338 | log-mel energies, PCEN | frequency masking, time masking, time reversal, mixup | using provided labels, automatic re-labeling | 44.1kHz | |
Zhang_BIsmart_task2_1 | Zhang2019b | 0.7304 | 0.7338 | log-mel energies, PCEN | frequency masking, time masking, time reversal, mixup | using provided labels, automatic re-labeling | 44.1kHz | |
Kong_SURREY_task2_1 | Kong2019 | 0.5803 | 0.0000 | log-mel energies | using provided labels | 32kHz | ||
BOUTEILLON_NOORG_task2_2 | Bouteillon2019 | 0.7331 | 0.7419 | log-mel energies | spec-mix | using provided labels | 44.1kHz | |
BOUTEILLON_NOORG_task2_1 | Bouteillon2019 | 0.7389 | 0.7519 | log-mel energies | spec-mix | using provided labels | 44.1kHz | |
Akiyama_OU_task2_2 | Akiyama2019 | 0.7474 | 0.7579 | log-mel energies, waveform | mixup, cutout, random gain, flip, highpass | semisupervised, multitask learning | 44.1kHz | |
Akiyama_OU_task2_1 | Akiyama2019 | 0.7504 | 0.7577 | log-mel energies, waveform | mixup, cutout, random gain, flip, highpass | semisupervised, multitask learning | 44.1kHz | |
Sun_BNU_task2_1 | Sun2019 | 0.6320 | 0.6443 | log-mel energies | resample, gaussian noise | 44.1kHz | ||
Ebbers_UPB_task2_3 | Ebbers2019 | 0.7071 | 0.0000 | log-mel energies | mixup, frequency warping, frequency masking, time masking | automatic re-labeling | 44.1kHz | |
Ebbers_UPB_task2_2 | Ebbers2019 | 0.7262 | 0.7456 | log-mel energies | mixup, frequency warping, frequency masking, time masking | automatic re-labeling | 44.1kHz | |
Ebbers_UPB_task2_1 | Ebbers2019 | 0.7305 | 0.7552 | log-mel energies | mixup, frequency warping, frequency masking, time masking | automatic re-labeling | 44.1kHz | |
HongXiaoFeng_BUPT_task2_1 | Hong2019 | 0.6991 | 0.7152 | log-mel energies | mixup | Semi-Supervised Learning | 44.1kHz | |
HongXiaoFeng_BUPT_task2_2 | Hong2019 | 0.6991 | 0.7149 | log-mel energies | mixup | Semi-Supervised Learning | 44.1kHz | |
Kharin_MePhI_task2_1 | Kharin2019 | 0.6637 | 0.6819 | log-mel energies | random crops | using provided labels | 44.1kHz | |
Koutini_CPJKU_task2_1 | Koutini2019 | 0.7282 | 0.7351 | log-mel energies | mixup | using provided labels | 44.1kHz | |
Koutini_CPJKU_task2_2 | Koutini2019 | 0.7254 | 0.7374 | log-mel energies | mixup | using provided labels | 44.1kHz | |
Fonseca_UPF_task2_1 | Fonseca2019 | 0.5370 | 0.5379 | log-mel energies | using provided labels | 44.1kHz | ||
PaischerPrinz_CPJKU_task2_1 | Paischer2019 | 0.7222 | 0.7033 | log-mel energies, perceptually weighted mel, perceptually weighted CQT | Mixup Augmentation | using provided labels | 44.1kHz, 32kHz | |
PaischerPrinz_CPJKU_task2_2 | Paischer2019 | 0.7216 | 0.7099 | log-mel energies, perceptually weighted mel, perceptually weighted CQT | Mixup Augmentation | using provided labels | 44.1kHz, 32kHz | |
PaischerPrinz_CPJKU_task2_3 | Paischer2019 | 0.7158 | 0.7018 | log-mel energies, perceptually weighted mel, perceptually weighted CQT | Mixup Augmentation | using provided labels | 44.1kHz, 32kHz | |
Liu_Kuaiyu_task2_1 | Liu2019 | 0.7348 | 0.7414 | log-mel energies | mixup, | using provided labels | 44.1kHz | |
Liu_Kuaiyu_task2_2 | Liu2019 | 0.7311 | 0.7366 | log-mel energies | mixup, | using provided labels | 44.1kHz |
Machine learning characteristics
Rank | Submission code | Tech. report |
lwlrap (public LB) |
lwlrap (private LB) |
Classifier | Ensemble subsystems | Decision making | System complexity |
---|---|---|---|---|---|---|---|---|
Zhang_THU_task2_2 | Zhang2019 | 0.7392 | 0.7577 | CNN, RNN, ensemble | 15 | geometric mean | 17000000 | |
Zhang_THU_task2_1 | Zhang2019 | 0.7423 | 0.7575 | CNN, RNN, ensemble | 15 | geometric mean | 17000000 | |
Boqing_NUDT_task2_1 | Boqing2019 | 0.7253 | 0.7240 | CNN | 5 | arithmetic mean | 16800000 | |
Boqing_NUDT_task2_3 | Boqing2019 | 0.7119 | 0.6777 | CNN | arithmetic mean | 2300000 | ||
Boqing_NUDT_task2_2 | Boqing2019 | 0.7235 | 0.7232 | CNN | 5 | arithmetic mean | 16800000 | |
Zhang_BIsmart_task2_3 | Zhang2019b | 0.7126 | 0.7144 | CNN | arithmetic mean | 5500000 | ||
Zhang_BIsmart_task2_2 | Zhang2019b | 0.7298 | 0.7338 | CNN | 13 | arithmetic mean | 71500000 | |
Zhang_BIsmart_task2_1 | Zhang2019b | 0.7304 | 0.7338 | CNN | 12 | arithmetic mean | 66000000 | |
Kong_SURREY_task2_1 | Kong2019 | 0.5803 | 0.0000 | CNN | arithmetic mean | 4686144 | ||
BOUTEILLON_NOORG_task2_2 | Bouteillon2019 | 0.7331 | 0.7419 | CNN | arithmetic mean | 5250000 | ||
BOUTEILLON_NOORG_task2_1 | Bouteillon2019 | 0.7389 | 0.7519 | CNN | 2 | arithmetic mean | 143000000 | |
Akiyama_OU_task2_2 | Akiyama2019 | 0.7474 | 0.7579 | CNN, ensemble | 95 | weighted average | 21800000 | |
Akiyama_OU_task2_1 | Akiyama2019 | 0.7504 | 0.7577 | CNN, ensemble | 170 | weighted average | 21800000 | |
Sun_BNU_task2_1 | Sun2019 | 0.6320 | 0.6443 | CNN | arithmetic mean | 20700000 | ||
Ebbers_UPB_task2_3 | Ebbers2019 | 0.7071 | 0.0000 | CRNN | arithmetic mean | 2600000 | ||
Ebbers_UPB_task2_2 | Ebbers2019 | 0.7262 | 0.7456 | CRNN | 3 | arithmetic mean | 7900000 | |
Ebbers_UPB_task2_1 | Ebbers2019 | 0.7305 | 0.7552 | CRNN | 6 | arithmetic mean | 15900000 | |
HongXiaoFeng_BUPT_task2_1 | Hong2019 | 0.6991 | 0.7152 | CNN, CRNN, ensemble | 27 | geometric mean | 300000000 | |
HongXiaoFeng_BUPT_task2_2 | Hong2019 | 0.6991 | 0.7149 | CNN, ensemble | 26 | geometric mean | 280000000 | |
Kharin_MePhI_task2_1 | Kharin2019 | 0.6637 | 0.6819 | CNN | arithmetic mean | 4700000 | ||
Koutini_CPJKU_task2_1 | Koutini2019 | 0.7282 | 0.7351 | CNN, Receptive Field Regularization | 39 | arithmetic mean | 90000000 | |
Koutini_CPJKU_task2_2 | Koutini2019 | 0.7254 | 0.7374 | CNN, Receptive Field Regularization | 24 | arithmetic mean | 90000000 | |
Fonseca_UPF_task2_1 | Fonseca2019 | 0.5370 | 0.5379 | CNN | arithmetic mean | 3300000 | ||
PaischerPrinz_CPJKU_task2_1 | Paischer2019 | 0.7222 | 0.7033 | CNN | 5 | arithmetic mean | 33700000 | |
PaischerPrinz_CPJKU_task2_2 | Paischer2019 | 0.7216 | 0.7099 | CNN | 6 | arithmetic mean | 48600000 | |
PaischerPrinz_CPJKU_task2_3 | Paischer2019 | 0.7158 | 0.7018 | CNN | 5 | arithmetic mean | 47300000 | |
Liu_Kuaiyu_task2_1 | Liu2019 | 0.7348 | 0.7414 | CNN | 5 | geometric mean | 55000000 | |
Liu_Kuaiyu_task2_2 | Liu2019 | 0.7311 | 0.7366 | CNN | 2 | geometric mean | 30000000 |
Technical reports
MULTITASK LEARNING AND SEMI-SUPERVISED LEARNING WITH NOISY DATA FOR AUDIO TAGGING
Osamu Akiyama and Junya Sato
Faculty of Medicine (OU), Osaka University, Osaka, Japan.
Abstract
This paper describes our submission to the DCASE 2019 challenge Task 2 "Audio tagging with noisy labels and minimal supervision" [1]. This task is a multi-label audio classification with 80 classes. The training data is composed of a small amount of reliably labeled data (curated data) and a larger amount of data with unreliable labels (noisy data). Additionally, there is a difference between data distribution between curated data and noisy data. To tackle this difficulty, we propose three strategies. The first is multitask learning using noisy data. The second is semi-supervised learning (SSL) using input data with a different distribution from labeled input data. The third is an ensemble method that averages models learned with different time windows. By using these methods, we achieved a score of 0.750 with label-weighted label-ranking average precision (lwlrap), which is in the top 1% on the public leaderboard (LB).
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | mixup, cutout, random gain, flip, highpass |
Features | log-mel energies, waveform |
Classifier | CNN, ensemble |
Decision making | weighted average |
Ensemble subsystems | 170 |
Complexity | 21800000 parameters |
Training time | 17h (1 x Tesla P-100) |
MULTI-LABEL AUDIO TAGGING WITH NOISY LABELS AND VARIABLE LENGTH
Zhu Boqing, Xu Kele, Wang Dezhi and Mathurin ACHE
College of Computer (NUDT), National University of Defense Technology, Changsha, Changsha, China. College of Meteorology and Oceanography (NUDT), National University of Defense Technology, Changsha, Changsha, China. None, None, Paris, France.
Boqing_NUDT_task2_1Boqing_NUDT_task2_3Boqing_NUDT_task2_2
MULTI-LABEL AUDIO TAGGING WITH NOISY LABELS AND VARIABLE LENGTH
Zhu Boqing, Xu Kele, Wang Dezhi and Mathurin ACHE
College of Computer (NUDT), National University of Defense Technology, Changsha, Changsha, China. College of Meteorology and Oceanography (NUDT), National University of Defense Technology, Changsha, Changsha, China. None, None, Paris, France.
Abstract
This paper describes our approach for DCASE 2019 Task2: Audio tagging with noisy labels and minimal supervision. This challenge uses a smaller set of manually labeled data and a larger set of noise-labeled data to enable the system to perform multi-label audio tagging tasks with minimal supervision conditions. We aim to tagging the audio clips with convolutional neural network under a limited computation and storage resources. To tackle the problem of noisy label data, we propose a data generation method named Dominate Mixup. It can restrain the impact of incorrect label during back propagation and it’s suitable for multi-class classification problem. In response to the variable length of audio clips, we conduct an efficient learning method with cyclical audio length which allow us to learn more pattern from widely diverse sound events. On the public leaderboard for the competition, our single model and simple ensemble of 5 models score 0.711 and 0.725 respectively.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | SpecAugment |
Features | log-mel energies |
Classifier | CNN |
Decision making | arithmetic mean |
Ensemble subsystems | 5 |
Complexity | 16800000 parameters |
Training time | 25h (1 x GeForce RTX 2080Ti) |
SPECMIX: A SIMPLE DATA AUGMENTATION TO LEVERAGE CLEAN AND NOISY SET FOR EFFICIENT AUDIO TAGGING
Eric Bouteillon
NOORG, No Organization, France.
Abstract
This paper presents a semi-supervised warm-up pipeline used to create an efficient audio tagging system as well as a novel data augmentation technique for multi-labels audio tagging named by the author SpecMix. These new techniques were applied to our submitted audio tagging system to the Freesound Audio Tagging 2019 challenge carried out within the DCASE 2019 Task 2 challenge [3]. Purpose of this challenge consist of predicting the audio labels for every test clips using machine learning techniques trained on a small amount of reliable, manually-labeled data, and a larger quantity of noisy web audio data in a multi-label audio tagging task with a large vocabulary setting.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | spec-mix |
Features | log-mel energies |
Classifier | CNN |
Decision making | arithmetic mean |
Ensemble subsystems | 2 |
Complexity | 143000000 parameters |
Training time | 72h (1 x rtx2080ti) |
CONVOLUTIONAL RECURRENT NEURAL NETWORK AND DATA AUGMENTATION FOR AUDIO TAGGING WITH NOISY LABELS AND MINIMAL SUPERVISION
Janek Ebbers and Reinhold Haeb-Umbach
Communications Engineering (UPB), Paderborn University, Paderborn, Germany.
Ebbers_UPB_task2_3Ebbers_UPB_task2_2Ebbers_UPB_task2_1
CONVOLUTIONAL RECURRENT NEURAL NETWORK AND DATA AUGMENTATION FOR AUDIO TAGGING WITH NOISY LABELS AND MINIMAL SUPERVISION
Janek Ebbers and Reinhold Haeb-Umbach
Communications Engineering (UPB), Paderborn University, Paderborn, Germany.
Abstract
This report presents our Audio Tagging system for the DCASE 2019 Challenge Task 2. Our proposed neural network architecture consists of a convolutional front end using log-mel-energies as input features, a recurrent neural network sequence encoder outputting a single vector for a whole sequence and finally a fully connected classifier network outputting an activity probability for each of the 80 event classes. Due to the limited amount of available data we use various data augmentation techniques to prevent overfitting and improve generalization. Our best system achieves a label-weighted label-ranking average precision (lwlrap) of 73.0% on the public test set which is an absolute improvement of 19.3% over the baseline.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | mixup, frequency warping, frequency masking, time masking |
Features | log-mel energies |
Classifier | CRNN |
Decision making | arithmetic mean |
Ensemble subsystems | 6 |
Complexity | 15900000 parameters |
Training time | 10h (6 x GTX 980) |
AUDIO TAGGING WITH NOISY LABELS AND MINIMAL SUPERVISION
Eduardo Fonseca, Frederic Font, Manoj Plakal and Daniel P. W. Ellis
Machine Perception Team (GOOGLE), Google Research, New York, USA. Music Technology Group (UPF), Universitat Pompeu Fabra, Barcelona, Barcelona, Spain.
Fonseca_UPF_task2_1
AUDIO TAGGING WITH NOISY LABELS AND MINIMAL SUPERVISION
Eduardo Fonseca, Frederic Font, Manoj Plakal and Daniel P. W. Ellis
Machine Perception Team (GOOGLE), Google Research, New York, USA. Music Technology Group (UPF), Universitat Pompeu Fabra, Barcelona, Barcelona, Spain.
Abstract
This paper introduces Task 2 of the DCASE2019 Challenge, titled “Audio tagging with noisy labels and minimal supervision”. This task was hosted on the Kaggle platform as “Freesound Audio Tagging 2019”. The task evaluates systems for multi-label audio tagging using a large set of noisy-labeled data, and a much smaller set of manually-labeled data, under a large vocabulary setting of 80 everyday sound classes. In addition, the proposed dataset poses an acoustic mismatch problem between the noisy train set and the test set due to the fact that they come from different web audio sources. This can correspond to a realistic scenario given by the difficulty of gathering large amounts of manually labeled data. We present the task setup, the FSDKaggle2019 dataset prepared for this scientific evaluation, and a baseline system consisting of a convolutional neural network. All these resources are freely available.
System characteristics
Sampling rate | 44.1kHz |
Features | log-mel energies |
Classifier | CNN |
Decision making | arithmetic mean |
Complexity | 3300000 parameters |
Training time | 8h (1 x Tesla V-100) |
MULTI-LABEL AUDIO TAGGING SYSTEM FOR FREESOUND 2019: FOCUSING ON NETWORK ARCHITECTURES, LABEL NOISY AND LOSS FUNCTIONS
Xiaofeng Hong and Gang Liu
Pattern Recognition and Intelligent System Laboratory (PRIS Lab) (BUPT), Beijing University of Posts and Telecommunications, Beijing, China.
HongXiaoFeng_BUPT_task2_1HongXiaoFeng_BUPT_task2_2
MULTI-LABEL AUDIO TAGGING SYSTEM FOR FREESOUND 2019: FOCUSING ON NETWORK ARCHITECTURES, LABEL NOISY AND LOSS FUNCTIONS
Xiaofeng Hong and Gang Liu
Pattern Recognition and Intelligent System Laboratory (PRIS Lab) (BUPT), Beijing University of Posts and Telecommunications, Beijing, China.
Abstract
In this technical report, we propose our solutions applied to our submission for DCASE2019 Task2. We focus on the model architectures which can efficiently tag the audio with multi-label and noisy label. Furthermore, we use multi-label models based on convolutional network and recurrent network to unify detection of audio events. Graph representation is also utilized to take the audio event co-occurrence into account which is reflected in the loss functions. We also tried Semi-Supervised Learning to use the noisy data. Finally, we tried an ensemble of CNNs and CRNN, trained by using cross validation folds. Compared to the baseline score of 0.537, we achieved a score of 0.700 on the public leaderboard.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN, ensemble |
Decision making | geometric mean |
Ensemble subsystems | 26 |
Complexity | 280000000 parameters |
Training time | 10h (1 x TITAN Xp) |
DCASE 2019 CHALLENGE NOISY_ANNEALING SYSTEM TECHNICAL REPORT
Alexander Kharin
laboratory of bionanophotonics (MePhI), National Research Nuclear University MePhI, Moscow, Moscow, Russia.
Kharin_MePhI_task2_1
DCASE 2019 CHALLENGE NOISY_ANNEALING SYSTEM TECHNICAL REPORT
Alexander Kharin
laboratory of bionanophotonics (MePhI), National Research Nuclear University MePhI, Moscow, Moscow, Russia.
Abstract
Multy-layer convolutional neural network with following Dense layer with 4.7 millions of parameters was used for training on mel-spectrograms of audio data. Such large number of parameters and small dataset (~9k samples without augumentation) leads to vulnerability of model to overfitting. Agumentation of audiofiles (i.e cropping of spectrograms) was not found wery effective way to get rid of overfitting. The following ways found to be reasonoble: standard Kfold technique with training on 5 Kfolds and averaging of the results and so-called ‘noisy data annealing’. That method lies on sequential training of the model on general set for several epochs (30 in our case) followed by training on poorly labeled, but larger dataset for 5 epochs. After several cycles we can observe significant reduction of the overfitting (lwrap scores 0.61 for base model, 0.66 for noisydata-annealed model). Such increase is caused by partial ‘reset’ of the trainable parameters during training on poorly-labelled set. The more set-specific parameters are, the higher is ‘reset’ rate, so such annealing enhances the significance of non-overfitting-responsible features and reduces the impact of highly dataset-specific features.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | random crops |
Features | log-mel energies |
Classifier | CNN |
Decision making | arithmetic mean |
Complexity | 4700000 parameters |
Training time | 5h (1 x TESLA K80) |
CROSS-TASK LEARNING FOR AUDIO TAGGING, SOUND EVENT DETECTION AND SPATIAL LOCALIZATION: DCASE 2019 BASELINE SYSTEMS
Qiuqiang Kong, Yin Cao, Turab Iqbal, Wenwu Wang and Mark D. Plumbley
Centre for Vision, Speech and Signal Processing (CVSSP) (SURREY), University of Surrey, Guildford, England.
Kong_SURREY_task2_1
CROSS-TASK LEARNING FOR AUDIO TAGGING, SOUND EVENT DETECTION AND SPATIAL LOCALIZATION: DCASE 2019 BASELINE SYSTEMS
Qiuqiang Kong, Yin Cao, Turab Iqbal, Wenwu Wang and Mark D. Plumbley
Centre for Vision, Speech and Signal Processing (CVSSP) (SURREY), University of Surrey, Guildford, England.
Abstract
The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge focuses on audio tagging, sound event detection and spatial localisation. DCASE 2019 consists of five tasks: 1) acoustic scene classification, 2) audio tagging with noisy labels and minimal supervision, 3) sound event localisation and detection, 4) sound event detection in domestic environments, and 5) urban sound tagging. In this paper, we propose generic cross-task baseline systems based on convolutional neural networks (CNNs). The motivation is to investigate the performance of a variety of models across several audio recognition tasks without exploiting the specific characteristics of the tasks. We looked at CNNs with 5, 9, and 13 layers, and found that the optimal architecture is taskdependent. For the systems we considered, we found that the 9-layer CNN with average pooling after convolutional layers is a good model for a majority of the DCASE 2019 tasks.
System characteristics
Sampling rate | 32kHz |
Features | log-mel energies |
Classifier | CNN |
Decision making | arithmetic mean |
Complexity | 4686144 parameters |
Training time | 2h (1 x GTX Titan Xp) |
CP-JKU SUBMISSIONS TO DCASE’19: ACOUSTIC SCENE CLASSIFICATION AND AUDIO TAGGING WITH RECEPTIVE-FIELD-REGULARIZED CNNS
Khaled Koutini, Hamid Eghbal-zadeh and Gerhard Widmer
Institute of Computational Perception (JKU), Johannes Kepler University Linz, Linz, Austria.
Koutini_CPJKU_task2_1Koutini_CPJKU_task2_2
CP-JKU SUBMISSIONS TO DCASE’19: ACOUSTIC SCENE CLASSIFICATION AND AUDIO TAGGING WITH RECEPTIVE-FIELD-REGULARIZED CNNS
Khaled Koutini, Hamid Eghbal-zadeh and Gerhard Widmer
Institute of Computational Perception (JKU), Johannes Kepler University Linz, Linz, Austria.
Abstract
In this report, we detail the CP-JKU submissions to the DCASE2019 challenge Task 1 (acoustic scene classification) and Task 2 (audio tagging with noisy labels and minimal supervision). In all of our submissions, we use fully convolutional deep neural networks architectures that are regularized with Receptive Field (RF) adjustments. We adjust the RF of variants of Resnet and Densenet architectures to best fit the various audio processing tasks that use the spectrogram features as input. Additionally, we propose novel CNN layers such as Frequency-Aware CNNs, and new noise compensation techniques such as Adaptive Weighting for Learning from Noisy Labels to cope with the complexities of each task. We prepared all of our submissions without the use of any external data. Our focus in this year’s submissions is to provide the best-performing single-model submission, using our proposed approaches.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | mixup |
Features | log-mel energies |
Classifier | CNN, Receptive Field Regularization |
Decision making | arithmetic mean |
Ensemble subsystems | 24 |
Complexity | 90000000 parameters |
Training time | 18h (1 x 1080ti) |
STACKED CONVOLUTIONAL NEURAL NETWORKS FOR AUDIO TAGGING WITH NOISE LABELS
Yanfang Liu and Qingkai Wei
Kuaiyu, Beijing Kuaiyu Electronics Co., Ltd., Beijing, China.
Abstract
This technical report describes the system we used to participate in task 2 of the DCASE 2019 challenge. The task is to predict the tags of audio recordings with using a small number of manually-verified labels and a much larger number of noisy labels. In this task, we propose serveral convolutional neural networks to learn from log-mel spectrogram features. To improve the performance, different techniques preprocessing, data augmentations, loss functions and cross-validation are involved. The prediction results are then ensembled using geometric mean. On the test set used for evaluation, our system achieved a score of 0.734.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | mixup, |
Features | log-mel energies |
Classifier | CNN |
Decision making | geometric mean |
Ensemble subsystems | 2 |
Complexity | 30000000 parameters |
Training time | 10h (1 x Titan xp) |
AUDIO TAGGING WITH CONVOLUTIONAL NEURAL NETWORKS TRAINED WITH NOISY DATA
Fabian Paischer, Katharina Prinz and Gerhard Widmer
Institute of Computational Perception (CPJKU), Johannes Kepler University, Linz, Linz, Austria.
PaischerPrinz_CPJKU_task2_1PaischerPrinz_CPJKU_task2_2PaischerPrinz_CPJKU_task2_3
AUDIO TAGGING WITH CONVOLUTIONAL NEURAL NETWORKS TRAINED WITH NOISY DATA
Fabian Paischer, Katharina Prinz and Gerhard Widmer
Institute of Computational Perception (CPJKU), Johannes Kepler University, Linz, Linz, Austria.
Abstract
This report is a description of our submission to the 2019 DCASE Challenge, Task 2. The task at hand is to predict one or more audiotags, out of the 80 available tags, for the audio clips of different lengths, originating from two different datasets. For training, a total number of 4970 audio clips is provided with trustworthy labels, whereas 19815 samples contain a substantial amount of label noise with unknown noise ratio. To tackle this task, we propose two different convolutional neural network (CNN) architectures trained on different features to capture different aspects of the data. Stochastic Weight Averaging is used in order to improve generalisation. By averaging over the predictions of all five networks, we obtain an ensemble that provides us with the likelihood of 80 different labels being present in an input audio clip. On the unseen data of the Public Kaggle Leaderboard, our system reaches a Label Weighted Label Ranking Average Precision (Lwlrap) of 0.722.
System characteristics
Sampling rate | 44.1kHz, 32kHz |
Data augmentation | Mixup Augmentation |
Features | log-mel energies, perceptually weighted mel, perceptually weighted CQT |
Classifier | CNN |
Decision making | arithmetic mean |
Ensemble subsystems | 5 |
Complexity | 47300000 parameters |
Training time | 192h (2 x NVIDIA GeForce GTX 1080) |
Audio Tagging with Minimal Supervision Based on Mean Teacher for DCASE 2019 Challenge
Jun He, Penghao Rao, Bo Sun and Lejun Yu
College of Information Science and Technology (BNU), Bejing Normal University, Beijing, Beijing, China.
Sun_BNU_task2_1
Audio Tagging with Minimal Supervision Based on Mean Teacher for DCASE 2019 Challenge
Jun He, Penghao Rao, Bo Sun and Lejun Yu
College of Information Science and Technology (BNU), Bejing Normal University, Beijing, Beijing, China.
Abstract
In this report, we describe the mean teacher based audio tagging system and performance applied to the task 2 of DCASE 2018 challenge, where the task evaluates systems for audio tagging with noisy labels and minimal supervision. The proposed system is based on a VGG16 network with attention mechanism and gated CNN. Following data augmentation techniques are used to increase model robustness: a) Scaling the signal with 0.75 to 1.5 time, b) Adding Gaussian white noise with 20dB to 40dB. Samples with noisy labels are regarded as unlabeled and are utilized with semi-supervision method namely mean teacher. The proposed system is trained using 5-fold cross-validation, and the final result is the arithmetic mean of the five models. Finally, the method provides lwlrap score of 0.631, which is measured through the Kaggle platform.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | resample, gaussian noise |
Features | log-mel energies |
Classifier | CNN |
Decision making | arithmetic mean |
Complexity | 20700000 parameters |
Training time | 24h (1 x GTX 1080 Ti) |
THUEE SYSTEM FOR DCASE 2019 CHALLENGE TASK 2
Kexin He, Yuhan Shen and Weiqiang Zhang
Department of Electronic Engineering (THU), Tsinghua University, Beijing, China.
Zhang_THU_task2_2Zhang_THU_task2_1
THUEE SYSTEM FOR DCASE 2019 CHALLENGE TASK 2
Kexin He, Yuhan Shen and Weiqiang Zhang
Department of Electronic Engineering (THU), Tsinghua University, Beijing, China.
Abstract
In this report, we described our submission for the task 2 of Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Challenge: Audio tagging with noisy labels and minimal supervision. Our methods are mainly based on two types of deep learning models: Convolutional Recurrent Neural Network (CRNN) and DenseNet. In order to prevent overfitting, we adopted data augmentation using mixup strategy and SpecAugment. Besides, we designed a staged loss function to train our models using both curated and noisy data. We also used various acoustic features, including log-mel energies and perceptual Constant-Q transform (p-CQT), and tried an ensemble of multiple subsystems to enhance the generalization capability of our system. Our final system achieved a lwlrap score of 0.742 on the public leaderboard in Kaggle.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | mixup, SpecAugment |
Features | log-mel energies, CQT |
Classifier | CNN, RNN, ensemble |
Decision making | geometric mean |
Ensemble subsystems | 15 |
Complexity | 17000000 parameters |
Training time | 10h (1 x Tesla P-100) |
DCASE 2019 TASK 2: SEMI-SUPERVISED NETWORKS WITH HEAVY DATA AUGMENTATIONS TO BATTLE AGAINST LABEL NOISE IN AUDIO TAGGING TASK
Jihang Zhang and Jie Wu
Data Analyst (BIsmart), getBIsmart, Irvine, California, USA. Data Scientist (ENGINE), ENGINE | Transformation, London, UK.
Zhang_BIsmart_task2_3Zhang_BIsmart_task2_2Zhang_BIsmart_task2_1
DCASE 2019 TASK 2: SEMI-SUPERVISED NETWORKS WITH HEAVY DATA AUGMENTATIONS TO BATTLE AGAINST LABEL NOISE IN AUDIO TAGGING TASK
Jihang Zhang and Jie Wu
Data Analyst (BIsmart), getBIsmart, Irvine, California, USA. Data Scientist (ENGINE), ENGINE | Transformation, London, UK.
Abstract
This technical report describes a system used for DCASE 2019 Task 2: Audio tagging with noisy labels and minimal supervision. Building a large-scale multi-label dataset normally requires extensive amount of manual effort, especially for general-purpose audio tagging system. To tackle the problem, we use a semi-supervised teacher-student convolutional neural network (CNN) to leverage substantial noisy labels and small curated labels in dataset. To further regularize the system, we exploit multiple data augmentation methods, including SpecAugment [1], mixup [2], and an innovative time reversal augmentation approach. Moreover, a combination of binary Focal [3] and ArcFace [4] losses are used to increase the accuracy of pseudo labels produced by the semi-supervised network, and accelerate the training process. Aadaptive test time augmentation (TTA) based on the lengths of audio samples is used as a final approach to improve the system. We choose a single system that generates the submission file Zhang BIsmart task2 3.output.csv to be the candidate model considered for the Judges’ Award. Other two systems use ensemble approach to furthur improve the performance.
System characteristics
Sampling rate | 44.1kHz |
Data augmentation | frequency masking, time masking, time reversal, mixup |
Features | log-mel energies, PCEN |
Classifier | CNN |
Decision making | arithmetic mean |
Ensemble subsystems | 12 |
Complexity | 66000000 parameters |
Training time | 60h (1 x GeForce RTX 2070) + 5h (1 x Tesla P100) |
Other resources generated in the Kaggle competition
The table below shows additional resources that were created and made accessible by Kaggle participants during the competition but that were not submitted to DCASE Challenge. Note that teams who submitted to DCASE Challenge are deliverably ommitted from this table as their generated resources are referenced in the sections above.