Audio tagging with noisy labels and minimal supervision


Challenge results

Task description

This task evaluates systems for multi-label audio tagging using a small set of manually-labeled data, and a larger set of noisy-labeled data, under a large vocabulary setting. This task will provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.

More detailed task description can be found in the task description page or in the competition page in Kaggle.

IMPORTANT NOTE: the task results shown in this page only include the submissions that were made using the DCASE submission system. Therefore, there are entries appearing in the official Kaggle leaderboard that do not appear here and the two rankings do not match.

IMPORTANT NOTE 2: Some of the submitted systems failed running when presented with the data of the private test set (mainly because of kernels taking longer to compute than the maximum time allowed). For these systems Kaggle does not provide us a score for the private LB set, and are disqualified from the official ranking. Also, Kaggle only provides us private LB score for the two selected submissions per team. Hence, the third system that some teams submitted to DCASE does not have a private LB score attached (only public LB). We have asked the authors of these systems to provide us with private LB scores so we can show them in the tables below. Disqualified systems are shown in the tables below highlighted in red.

Systems ranking

Rank Submission code Kaggle team name Name Tech. report lwlrap

(public LB)
lwlrap

(private LB*)
Zhang_THU_task2_2 THUEE THUEE Zhang2019 0.7392 0.7577
Zhang_THU_task2_1 THUEE THUEE Zhang2019 0.7423 0.7575
Boqing_NUDT_task2_1 TEMP Multi-label Audio tagging system 1 Boqing2019 0.7253 0.7240
Boqing_NUDT_task2_3 TEMP Multi-label Audio tagging system 3 Boqing2019 0.7119 0.6777
Boqing_NUDT_task2_2 TEMP Multi-label Audio tagging system 2 Boqing2019 0.7235 0.7232
Zhang_BIsmart_task2_3 3x6min DCASE2019 Task2 | Teacher-Student V3 Zhang2019b 0.7126 0.7144
Zhang_BIsmart_task2_2 3x6min DCASE2019 Task2 | Teacher-Student V2 Zhang2019b 0.7298 0.7338
Zhang_BIsmart_task2_1 3x6min DCASE2019 Task2 | Teacher-Student V1 Zhang2019b 0.7304 0.7338
Kong_SURREY_task2_1 cvssp_baseline CVSSP cross-task CNN baseline Kong2019 0.5803 0.0000
BOUTEILLON_NOORG_task2_2 Eric Bouteillon BOUTEILLON Warm-up pipeline and spec-mix 2.2 Bouteillon2019 0.7331 0.7419
BOUTEILLON_NOORG_task2_1 Eric Bouteillon BOUTEILLON Warm-up pipeline and spec-mix 2.1 Bouteillon2019 0.7389 0.7519
Akiyama_OU_task2_2 \[kaggler-ja/AIMS\] OUmed resnet34_envnet_ensemble Raw-Audio and Spectrogram Akiyama2019 0.7474 0.7579
Akiyama_OU_task2_1 \[kaggler-ja/AIMS\] OUmed resnet34_envnet_ensemble on Raw-Audio and Spectrogram Akiyama2019 0.7504 0.7577
Sun_BNU_task2_1 Penghao CNN+MeanTeacher Sun2019 0.6320 0.6443
Ebbers_UPB_task2_3 Janek Ebbers DCASE2019 UPB system 3 Ebbers2019 0.7071 0.0000
Ebbers_UPB_task2_2 Janek Ebbers DCASE2019 UPB system 2 Ebbers2019 0.7262 0.7456
Ebbers_UPB_task2_1 Janek Ebbers DCASE2019 UPB system 1 Ebbers2019 0.7305 0.7552
HongXiaoFeng_BUPT_task2_1 HongXiaoFeng HongXiaoFeng_BUPT_task2_1 Hong2019 0.6991 0.7152
HongXiaoFeng_BUPT_task2_2 HongXiaoFeng HongXiaoFeng_BUPT_task2_2 Hong2019 0.6991 0.7149
Kharin_MePhI_task2_1 Alexander Khar Kharin_noisy_annealing Kharin2019 0.6637 0.6819
Koutini_CPJKU_task2_1 CP-JKU CP JKU 1 Koutini2019 0.7282 0.7351
Koutini_CPJKU_task2_2 CP-JKU CP JKU 2 Koutini2019 0.7254 0.7374
Fonseca_UPF_task2_1 Challenge Baseline DCASE2019 baseline system Fonseca2019 0.5370 0.5379
PaischerPrinz_CPJKU_task2_1 CPJKUStudents CPJKU Students submission Paischer2019 0.7222 0.7033
PaischerPrinz_CPJKU_task2_2 CPJKUStudents CPJKU Students submission Paischer2019 0.7216 0.7099
PaischerPrinz_CPJKU_task2_3 CPJKUStudents CPJKU Students submission Paischer2019 0.7158 0.7018
Liu_Kuaiyu_task2_1 Kuaiyu Kuaiyu Tagging System Liu2019 0.7348 0.7414
Liu_Kuaiyu_task2_2 Kuaiyu Kuaiyu Tagging System Liu2019 0.7311 0.7366

* Unless stated otherwise, all reported scores are computed using the ground truth for the private leaderboard.

Teams ranking

Table including only the best performing system per submitting team.

Rank Submission code Kaggle team name Name Tech. report lwlrap

(public LB)
lwlrap

(private LB)
Zhang_THU_task2_2 THUEE THUEE Zhang2019 0.7392 0.7577
Boqing_NUDT_task2_1 TEMP Multi-label Audio tagging system 1 Boqing2019 0.7253 0.7240
Zhang_BIsmart_task2_2 3x6min DCASE2019 Task2 | Teacher-Student V2 Zhang2019b 0.7298 0.7338
Kong_SURREY_task2_1 cvssp_baseline CVSSP cross-task CNN baseline Kong2019 0.5803 0.0000
BOUTEILLON_NOORG_task2_1 Eric Bouteillon BOUTEILLON Warm-up pipeline and spec-mix 2.1 Bouteillon2019 0.7389 0.7519
Akiyama_OU_task2_2 \[kaggler-ja/AIMS\] OUmed resnet34_envnet_ensemble Raw-Audio and Spectrogram Akiyama2019 0.7474 0.7579
Sun_BNU_task2_1 Penghao CNN+MeanTeacher Sun2019 0.6320 0.6443
Ebbers_UPB_task2_1 Janek Ebbers DCASE2019 UPB system 1 Ebbers2019 0.7305 0.7552
HongXiaoFeng_BUPT_task2_1 HongXiaoFeng HongXiaoFeng_BUPT_task2_1 Hong2019 0.6991 0.7152
Kharin_MePhI_task2_1 Alexander Khar Kharin_noisy_annealing Kharin2019 0.6637 0.6819
Koutini_CPJKU_task2_2 CP-JKU CP JKU 2 Koutini2019 0.7254 0.7374
Fonseca_UPF_task2_1 Challenge Baseline DCASE2019 baseline system Fonseca2019 0.5370 0.5379
PaischerPrinz_CPJKU_task2_2 CPJKUStudents CPJKU Students submission Paischer2019 0.7216 0.7099
Liu_Kuaiyu_task2_1 Kuaiyu Kuaiyu Tagging System Liu2019 0.7348 0.7414

System characteristics

Input characteristics

Rank Submission code Tech. report lwlrap

(public LB)
lwlrap

(private LB)
Acoustic features Data augmentation Use of noisy subset Sampling rate
Zhang_THU_task2_2 Zhang2019 0.7392 0.7577 log-mel energies, CQT mixup, SpecAugment using provided labels 44.1kHz
Zhang_THU_task2_1 Zhang2019 0.7423 0.7575 log-mel energies, CQT mixup, SpecAugment using provided labels 44.1kHz
Boqing_NUDT_task2_1 Boqing2019 0.7253 0.7240 log-mel energies SpecAugment using provided labels 44.1kHz
Boqing_NUDT_task2_3 Boqing2019 0.7119 0.6777 log-mel energies SpecAugment using provided labels 44.1kHz
Boqing_NUDT_task2_2 Boqing2019 0.7235 0.7232 log-mel energies SpecAugment using provided labels 44.1kHz
Zhang_BIsmart_task2_3 Zhang2019b 0.7126 0.7144 log-mel energies frequency masking, time masking, time reversal, mixup using provided labels, automatic re-labeling 32kHz
Zhang_BIsmart_task2_2 Zhang2019b 0.7298 0.7338 log-mel energies, PCEN frequency masking, time masking, time reversal, mixup using provided labels, automatic re-labeling 44.1kHz
Zhang_BIsmart_task2_1 Zhang2019b 0.7304 0.7338 log-mel energies, PCEN frequency masking, time masking, time reversal, mixup using provided labels, automatic re-labeling 44.1kHz
Kong_SURREY_task2_1 Kong2019 0.5803 0.0000 log-mel energies using provided labels 32kHz
BOUTEILLON_NOORG_task2_2 Bouteillon2019 0.7331 0.7419 log-mel energies spec-mix using provided labels 44.1kHz
BOUTEILLON_NOORG_task2_1 Bouteillon2019 0.7389 0.7519 log-mel energies spec-mix using provided labels 44.1kHz
Akiyama_OU_task2_2 Akiyama2019 0.7474 0.7579 log-mel energies, waveform mixup, cutout, random gain, flip, highpass semisupervised, multitask learning 44.1kHz
Akiyama_OU_task2_1 Akiyama2019 0.7504 0.7577 log-mel energies, waveform mixup, cutout, random gain, flip, highpass semisupervised, multitask learning 44.1kHz
Sun_BNU_task2_1 Sun2019 0.6320 0.6443 log-mel energies resample, gaussian noise 44.1kHz
Ebbers_UPB_task2_3 Ebbers2019 0.7071 0.0000 log-mel energies mixup, frequency warping, frequency masking, time masking automatic re-labeling 44.1kHz
Ebbers_UPB_task2_2 Ebbers2019 0.7262 0.7456 log-mel energies mixup, frequency warping, frequency masking, time masking automatic re-labeling 44.1kHz
Ebbers_UPB_task2_1 Ebbers2019 0.7305 0.7552 log-mel energies mixup, frequency warping, frequency masking, time masking automatic re-labeling 44.1kHz
HongXiaoFeng_BUPT_task2_1 Hong2019 0.6991 0.7152 log-mel energies mixup Semi-Supervised Learning 44.1kHz
HongXiaoFeng_BUPT_task2_2 Hong2019 0.6991 0.7149 log-mel energies mixup Semi-Supervised Learning 44.1kHz
Kharin_MePhI_task2_1 Kharin2019 0.6637 0.6819 log-mel energies random crops using provided labels 44.1kHz
Koutini_CPJKU_task2_1 Koutini2019 0.7282 0.7351 log-mel energies mixup using provided labels 44.1kHz
Koutini_CPJKU_task2_2 Koutini2019 0.7254 0.7374 log-mel energies mixup using provided labels 44.1kHz
Fonseca_UPF_task2_1 Fonseca2019 0.5370 0.5379 log-mel energies using provided labels 44.1kHz
PaischerPrinz_CPJKU_task2_1 Paischer2019 0.7222 0.7033 log-mel energies, perceptually weighted mel, perceptually weighted CQT Mixup Augmentation using provided labels 44.1kHz, 32kHz
PaischerPrinz_CPJKU_task2_2 Paischer2019 0.7216 0.7099 log-mel energies, perceptually weighted mel, perceptually weighted CQT Mixup Augmentation using provided labels 44.1kHz, 32kHz
PaischerPrinz_CPJKU_task2_3 Paischer2019 0.7158 0.7018 log-mel energies, perceptually weighted mel, perceptually weighted CQT Mixup Augmentation using provided labels 44.1kHz, 32kHz
Liu_Kuaiyu_task2_1 Liu2019 0.7348 0.7414 log-mel energies mixup, using provided labels 44.1kHz
Liu_Kuaiyu_task2_2 Liu2019 0.7311 0.7366 log-mel energies mixup, using provided labels 44.1kHz



Machine learning characteristics

Rank Submission code Tech. report lwlrap

(public LB)
lwlrap

(private LB)
Classifier Ensemble subsystems Decision making System complexity
Zhang_THU_task2_2 Zhang2019 0.7392 0.7577 CNN, RNN, ensemble 15 geometric mean 17000000
Zhang_THU_task2_1 Zhang2019 0.7423 0.7575 CNN, RNN, ensemble 15 geometric mean 17000000
Boqing_NUDT_task2_1 Boqing2019 0.7253 0.7240 CNN 5 arithmetic mean 16800000
Boqing_NUDT_task2_3 Boqing2019 0.7119 0.6777 CNN arithmetic mean 2300000
Boqing_NUDT_task2_2 Boqing2019 0.7235 0.7232 CNN 5 arithmetic mean 16800000
Zhang_BIsmart_task2_3 Zhang2019b 0.7126 0.7144 CNN arithmetic mean 5500000
Zhang_BIsmart_task2_2 Zhang2019b 0.7298 0.7338 CNN 13 arithmetic mean 71500000
Zhang_BIsmart_task2_1 Zhang2019b 0.7304 0.7338 CNN 12 arithmetic mean 66000000
Kong_SURREY_task2_1 Kong2019 0.5803 0.0000 CNN arithmetic mean 4686144
BOUTEILLON_NOORG_task2_2 Bouteillon2019 0.7331 0.7419 CNN arithmetic mean 5250000
BOUTEILLON_NOORG_task2_1 Bouteillon2019 0.7389 0.7519 CNN 2 arithmetic mean 143000000
Akiyama_OU_task2_2 Akiyama2019 0.7474 0.7579 CNN, ensemble 95 weighted average 21800000
Akiyama_OU_task2_1 Akiyama2019 0.7504 0.7577 CNN, ensemble 170 weighted average 21800000
Sun_BNU_task2_1 Sun2019 0.6320 0.6443 CNN arithmetic mean 20700000
Ebbers_UPB_task2_3 Ebbers2019 0.7071 0.0000 CRNN arithmetic mean 2600000
Ebbers_UPB_task2_2 Ebbers2019 0.7262 0.7456 CRNN 3 arithmetic mean 7900000
Ebbers_UPB_task2_1 Ebbers2019 0.7305 0.7552 CRNN 6 arithmetic mean 15900000
HongXiaoFeng_BUPT_task2_1 Hong2019 0.6991 0.7152 CNN, CRNN, ensemble 27 geometric mean 300000000
HongXiaoFeng_BUPT_task2_2 Hong2019 0.6991 0.7149 CNN, ensemble 26 geometric mean 280000000
Kharin_MePhI_task2_1 Kharin2019 0.6637 0.6819 CNN arithmetic mean 4700000
Koutini_CPJKU_task2_1 Koutini2019 0.7282 0.7351 CNN, Receptive Field Regularization 39 arithmetic mean 90000000
Koutini_CPJKU_task2_2 Koutini2019 0.7254 0.7374 CNN, Receptive Field Regularization 24 arithmetic mean 90000000
Fonseca_UPF_task2_1 Fonseca2019 0.5370 0.5379 CNN arithmetic mean 3300000
PaischerPrinz_CPJKU_task2_1 Paischer2019 0.7222 0.7033 CNN 5 arithmetic mean 33700000
PaischerPrinz_CPJKU_task2_2 Paischer2019 0.7216 0.7099 CNN 6 arithmetic mean 48600000
PaischerPrinz_CPJKU_task2_3 Paischer2019 0.7158 0.7018 CNN 5 arithmetic mean 47300000
Liu_Kuaiyu_task2_1 Liu2019 0.7348 0.7414 CNN 5 geometric mean 55000000
Liu_Kuaiyu_task2_2 Liu2019 0.7311 0.7366 CNN 2 geometric mean 30000000

Technical reports

MULTITASK LEARNING AND SEMI-SUPERVISED LEARNING WITH NOISY DATA FOR AUDIO TAGGING

Osamu Akiyama and Junya Sato
Faculty of Medicine (OU), Osaka University, Osaka, Japan.

Abstract

This paper describes our submission to the DCASE 2019 challenge Task 2 "Audio tagging with noisy labels and minimal supervision" [1]. This task is a multi-label audio classification with 80 classes. The training data is composed of a small amount of reliably labeled data (curated data) and a larger amount of data with unreliable labels (noisy data). Additionally, there is a difference between data distribution between curated data and noisy data. To tackle this difficulty, we propose three strategies. The first is multitask learning using noisy data. The second is semi-supervised learning (SSL) using input data with a different distribution from labeled input data. The third is an ensemble method that averages models learned with different time windows. By using these methods, we achieved a score of 0.750 with label-weighted label-ranking average precision (lwlrap), which is in the top 1% on the public leaderboard (LB).

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup, cutout, random gain, flip, highpass
Features log-mel energies, waveform
Classifier CNN, ensemble
Decision making weighted average
Ensemble subsystems 170
Complexity 21800000 parameters
Training time 17h (1 x Tesla P-100)
PDF

MULTI-LABEL AUDIO TAGGING WITH NOISY LABELS AND VARIABLE LENGTH

Zhu Boqing, Xu Kele, Wang Dezhi and Mathurin ACHE
College of Computer (NUDT), National University of Defense Technology, Changsha, Changsha, China. College of Meteorology and Oceanography (NUDT), National University of Defense Technology, Changsha, Changsha, China. None, None, Paris, France.

Abstract

This paper describes our approach for DCASE 2019 Task2: Audio tagging with noisy labels and minimal supervision. This challenge uses a smaller set of manually labeled data and a larger set of noise-labeled data to enable the system to perform multi-label audio tagging tasks with minimal supervision conditions. We aim to tagging the audio clips with convolutional neural network under a limited computation and storage resources. To tackle the problem of noisy label data, we propose a data generation method named Dominate Mixup. It can restrain the impact of incorrect label during back propagation and it’s suitable for multi-class classification problem. In response to the variable length of audio clips, we conduct an efficient learning method with cyclical audio length which allow us to learn more pattern from widely diverse sound events. On the public leaderboard for the competition, our single model and simple ensemble of 5 models score 0.711 and 0.725 respectively.

System characteristics
Sampling rate 44.1kHz
Data augmentation SpecAugment
Features log-mel energies
Classifier CNN
Decision making arithmetic mean
Ensemble subsystems 5
Complexity 16800000 parameters
Training time 25h (1 x GeForce RTX 2080Ti)
PDF

SPECMIX: A SIMPLE DATA AUGMENTATION TO LEVERAGE CLEAN AND NOISY SET FOR EFFICIENT AUDIO TAGGING

Eric Bouteillon
NOORG, No Organization, France.

Abstract

This paper presents a semi-supervised warm-up pipeline used to create an efficient audio tagging system as well as a novel data augmentation technique for multi-labels audio tagging named by the author SpecMix. These new techniques were applied to our submitted audio tagging system to the Freesound Audio Tagging 2019 challenge carried out within the DCASE 2019 Task 2 challenge [3]. Purpose of this challenge consist of predicting the audio labels for every test clips using machine learning techniques trained on a small amount of reliable, manually-labeled data, and a larger quantity of noisy web audio data in a multi-label audio tagging task with a large vocabulary setting.

System characteristics
Sampling rate 44.1kHz
Data augmentation spec-mix
Features log-mel energies
Classifier CNN
Decision making arithmetic mean
Ensemble subsystems 2
Complexity 143000000 parameters
Training time 72h (1 x rtx2080ti)
PDF

CONVOLUTIONAL RECURRENT NEURAL NETWORK AND DATA AUGMENTATION FOR AUDIO TAGGING WITH NOISY LABELS AND MINIMAL SUPERVISION

Janek Ebbers and Reinhold Haeb-Umbach
Communications Engineering (UPB), Paderborn University, Paderborn, Germany.

Abstract

This report presents our Audio Tagging system for the DCASE 2019 Challenge Task 2. Our proposed neural network architecture consists of a convolutional front end using log-mel-energies as input features, a recurrent neural network sequence encoder outputting a single vector for a whole sequence and finally a fully connected classifier network outputting an activity probability for each of the 80 event classes. Due to the limited amount of available data we use various data augmentation techniques to prevent overfitting and improve generalization. Our best system achieves a label-weighted label-ranking average precision (lwlrap) of 73.0% on the public test set which is an absolute improvement of 19.3% over the baseline.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup, frequency warping, frequency masking, time masking
Features log-mel energies
Classifier CRNN
Decision making arithmetic mean
Ensemble subsystems 6
Complexity 15900000 parameters
Training time 10h (6 x GTX 980)
PDF

AUDIO TAGGING WITH NOISY LABELS AND MINIMAL SUPERVISION

Eduardo Fonseca, Frederic Font, Manoj Plakal and Daniel P. W. Ellis
Machine Perception Team (GOOGLE), Google Research, New York, USA. Music Technology Group (UPF), Universitat Pompeu Fabra, Barcelona, Barcelona, Spain.

Abstract

This paper introduces Task 2 of the DCASE2019 Challenge, titled “Audio tagging with noisy labels and minimal supervision”. This task was hosted on the Kaggle platform as “Freesound Audio Tagging 2019”. The task evaluates systems for multi-label audio tagging using a large set of noisy-labeled data, and a much smaller set of manually-labeled data, under a large vocabulary setting of 80 everyday sound classes. In addition, the proposed dataset poses an acoustic mismatch problem between the noisy train set and the test set due to the fact that they come from different web audio sources. This can correspond to a realistic scenario given by the difficulty of gathering large amounts of manually labeled data. We present the task setup, the FSDKaggle2019 dataset prepared for this scientific evaluation, and a baseline system consisting of a convolutional neural network. All these resources are freely available.

System characteristics
Sampling rate 44.1kHz
Features log-mel energies
Classifier CNN
Decision making arithmetic mean
Complexity 3300000 parameters
Training time 8h (1 x Tesla V-100)
PDF

MULTI-LABEL AUDIO TAGGING SYSTEM FOR FREESOUND 2019: FOCUSING ON NETWORK ARCHITECTURES, LABEL NOISY AND LOSS FUNCTIONS

Xiaofeng Hong and Gang Liu
Pattern Recognition and Intelligent System Laboratory (PRIS Lab) (BUPT), Beijing University of Posts and Telecommunications, Beijing, China.

Abstract

In this technical report, we propose our solutions applied to our submission for DCASE2019 Task2. We focus on the model architectures which can efficiently tag the audio with multi-label and noisy label. Furthermore, we use multi-label models based on convolutional network and recurrent network to unify detection of audio events. Graph representation is also utilized to take the audio event co-occurrence into account which is reflected in the loss functions. We also tried Semi-Supervised Learning to use the noisy data. Finally, we tried an ensemble of CNNs and CRNN, trained by using cross validation folds. Compared to the baseline score of 0.537, we achieved a score of 0.700 on the public leaderboard.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN, ensemble
Decision making geometric mean
Ensemble subsystems 26
Complexity 280000000 parameters
Training time 10h (1 x TITAN Xp)
PDF

DCASE 2019 CHALLENGE NOISY_ANNEALING SYSTEM TECHNICAL REPORT

Alexander Kharin
laboratory of bionanophotonics (MePhI), National Research Nuclear University MePhI, Moscow, Moscow, Russia.

Abstract

Multy-layer convolutional neural network with following Dense layer with 4.7 millions of parameters was used for training on mel-spectrograms of audio data. Such large number of parameters and small dataset (~9k samples without augumentation) leads to vulnerability of model to overfitting. Agumentation of audiofiles (i.e cropping of spectrograms) was not found wery effective way to get rid of overfitting. The following ways found to be reasonoble: standard Kfold technique with training on 5 Kfolds and averaging of the results and so-called ‘noisy data annealing’. That method lies on sequential training of the model on general set for several epochs (30 in our case) followed by training on poorly labeled, but larger dataset for 5 epochs. After several cycles we can observe significant reduction of the overfitting (lwrap scores 0.61 for base model, 0.66 for noisydata-annealed model). Such increase is caused by partial ‘reset’ of the trainable parameters during training on poorly-labelled set. The more set-specific parameters are, the higher is ‘reset’ rate, so such annealing enhances the significance of non-overfitting-responsible features and reduces the impact of highly dataset-specific features.

System characteristics
Sampling rate 44.1kHz
Data augmentation random crops
Features log-mel energies
Classifier CNN
Decision making arithmetic mean
Complexity 4700000 parameters
Training time 5h (1 x TESLA K80)
PDF

CROSS-TASK LEARNING FOR AUDIO TAGGING, SOUND EVENT DETECTION AND SPATIAL LOCALIZATION: DCASE 2019 BASELINE SYSTEMS

Qiuqiang Kong, Yin Cao, Turab Iqbal, Wenwu Wang and Mark D. Plumbley
Centre for Vision, Speech and Signal Processing (CVSSP) (SURREY), University of Surrey, Guildford, England.

Abstract

The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge focuses on audio tagging, sound event detection and spatial localisation. DCASE 2019 consists of five tasks: 1) acoustic scene classification, 2) audio tagging with noisy labels and minimal supervision, 3) sound event localisation and detection, 4) sound event detection in domestic environments, and 5) urban sound tagging. In this paper, we propose generic cross-task baseline systems based on convolutional neural networks (CNNs). The motivation is to investigate the performance of a variety of models across several audio recognition tasks without exploiting the specific characteristics of the tasks. We looked at CNNs with 5, 9, and 13 layers, and found that the optimal architecture is taskdependent. For the systems we considered, we found that the 9-layer CNN with average pooling after convolutional layers is a good model for a majority of the DCASE 2019 tasks.

System characteristics
Sampling rate 32kHz
Features log-mel energies
Classifier CNN
Decision making arithmetic mean
Complexity 4686144 parameters
Training time 2h (1 x GTX Titan Xp)
PDF

CP-JKU SUBMISSIONS TO DCASE’19: ACOUSTIC SCENE CLASSIFICATION AND AUDIO TAGGING WITH RECEPTIVE-FIELD-REGULARIZED CNNS

Khaled Koutini, Hamid Eghbal-zadeh and Gerhard Widmer
Institute of Computational Perception (JKU), Johannes Kepler University Linz, Linz, Austria.

Abstract

In this report, we detail the CP-JKU submissions to the DCASE2019 challenge Task 1 (acoustic scene classification) and Task 2 (audio tagging with noisy labels and minimal supervision). In all of our submissions, we use fully convolutional deep neural networks architectures that are regularized with Receptive Field (RF) adjustments. We adjust the RF of variants of Resnet and Densenet architectures to best fit the various audio processing tasks that use the spectrogram features as input. Additionally, we propose novel CNN layers such as Frequency-Aware CNNs, and new noise compensation techniques such as Adaptive Weighting for Learning from Noisy Labels to cope with the complexities of each task. We prepared all of our submissions without the use of any external data. Our focus in this year’s submissions is to provide the best-performing single-model submission, using our proposed approaches.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup
Features log-mel energies
Classifier CNN, Receptive Field Regularization
Decision making arithmetic mean
Ensemble subsystems 24
Complexity 90000000 parameters
Training time 18h (1 x 1080ti)
PDF

STACKED CONVOLUTIONAL NEURAL NETWORKS FOR AUDIO TAGGING WITH NOISE LABELS

Yanfang Liu and Qingkai Wei
Kuaiyu, Beijing Kuaiyu Electronics Co., Ltd., Beijing, China.

Abstract

This technical report describes the system we used to participate in task 2 of the DCASE 2019 challenge. The task is to predict the tags of audio recordings with using a small number of manually-verified labels and a much larger number of noisy labels. In this task, we propose serveral convolutional neural networks to learn from log-mel spectrogram features. To improve the performance, different techniques preprocessing, data augmentations, loss functions and cross-validation are involved. The prediction results are then ensembled using geometric mean. On the test set used for evaluation, our system achieved a score of 0.734.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup,
Features log-mel energies
Classifier CNN
Decision making geometric mean
Ensemble subsystems 2
Complexity 30000000 parameters
Training time 10h (1 x Titan xp)
PDF

AUDIO TAGGING WITH CONVOLUTIONAL NEURAL NETWORKS TRAINED WITH NOISY DATA

Fabian Paischer, Katharina Prinz and Gerhard Widmer
Institute of Computational Perception (CPJKU), Johannes Kepler University, Linz, Linz, Austria.

Abstract

This report is a description of our submission to the 2019 DCASE Challenge, Task 2. The task at hand is to predict one or more audiotags, out of the 80 available tags, for the audio clips of different lengths, originating from two different datasets. For training, a total number of 4970 audio clips is provided with trustworthy labels, whereas 19815 samples contain a substantial amount of label noise with unknown noise ratio. To tackle this task, we propose two different convolutional neural network (CNN) architectures trained on different features to capture different aspects of the data. Stochastic Weight Averaging is used in order to improve generalisation. By averaging over the predictions of all five networks, we obtain an ensemble that provides us with the likelihood of 80 different labels being present in an input audio clip. On the unseen data of the Public Kaggle Leaderboard, our system reaches a Label Weighted Label Ranking Average Precision (Lwlrap) of 0.722.

System characteristics
Sampling rate 44.1kHz, 32kHz
Data augmentation Mixup Augmentation
Features log-mel energies, perceptually weighted mel, perceptually weighted CQT
Classifier CNN
Decision making arithmetic mean
Ensemble subsystems 5
Complexity 47300000 parameters
Training time 192h (2 x NVIDIA GeForce GTX 1080)
PDF

Audio Tagging with Minimal Supervision Based on Mean Teacher for DCASE 2019 Challenge

Jun He, Penghao Rao, Bo Sun and Lejun Yu
College of Information Science and Technology (BNU), Bejing Normal University, Beijing, Beijing, China.

Abstract

In this report, we describe the mean teacher based audio tagging system and performance applied to the task 2 of DCASE 2018 challenge, where the task evaluates systems for audio tagging with noisy labels and minimal supervision. The proposed system is based on a VGG16 network with attention mechanism and gated CNN. Following data augmentation techniques are used to increase model robustness: a) Scaling the signal with 0.75 to 1.5 time, b) Adding Gaussian white noise with 20dB to 40dB. Samples with noisy labels are regarded as unlabeled and are utilized with semi-supervision method namely mean teacher. The proposed system is trained using 5-fold cross-validation, and the final result is the arithmetic mean of the five models. Finally, the method provides lwlrap score of 0.631, which is measured through the Kaggle platform.

System characteristics
Sampling rate 44.1kHz
Data augmentation resample, gaussian noise
Features log-mel energies
Classifier CNN
Decision making arithmetic mean
Complexity 20700000 parameters
Training time 24h (1 x GTX 1080 Ti)
PDF

THUEE SYSTEM FOR DCASE 2019 CHALLENGE TASK 2

Kexin He, Yuhan Shen and Weiqiang Zhang
Department of Electronic Engineering (THU), Tsinghua University, Beijing, China.

Abstract

In this report, we described our submission for the task 2 of Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Challenge: Audio tagging with noisy labels and minimal supervision. Our methods are mainly based on two types of deep learning models: Convolutional Recurrent Neural Network (CRNN) and DenseNet. In order to prevent overfitting, we adopted data augmentation using mixup strategy and SpecAugment. Besides, we designed a staged loss function to train our models using both curated and noisy data. We also used various acoustic features, including log-mel energies and perceptual Constant-Q transform (p-CQT), and tried an ensemble of multiple subsystems to enhance the generalization capability of our system. Our final system achieved a lwlrap score of 0.742 on the public leaderboard in Kaggle.

System characteristics
Sampling rate 44.1kHz
Data augmentation mixup, SpecAugment
Features log-mel energies, CQT
Classifier CNN, RNN, ensemble
Decision making geometric mean
Ensemble subsystems 15
Complexity 17000000 parameters
Training time 10h (1 x Tesla P-100)
PDF

DCASE 2019 TASK 2: SEMI-SUPERVISED NETWORKS WITH HEAVY DATA AUGMENTATIONS TO BATTLE AGAINST LABEL NOISE IN AUDIO TAGGING TASK

Jihang Zhang and Jie Wu
Data Analyst (BIsmart), getBIsmart, Irvine, California, USA. Data Scientist (ENGINE), ENGINE | Transformation, London, UK.

Abstract

This technical report describes a system used for DCASE 2019 Task 2: Audio tagging with noisy labels and minimal supervision. Building a large-scale multi-label dataset normally requires extensive amount of manual effort, especially for general-purpose audio tagging system. To tackle the problem, we use a semi-supervised teacher-student convolutional neural network (CNN) to leverage substantial noisy labels and small curated labels in dataset. To further regularize the system, we exploit multiple data augmentation methods, including SpecAugment [1], mixup [2], and an innovative time reversal augmentation approach. Moreover, a combination of binary Focal [3] and ArcFace [4] losses are used to increase the accuracy of pseudo labels produced by the semi-supervised network, and accelerate the training process. Aadaptive test time augmentation (TTA) based on the lengths of audio samples is used as a final approach to improve the system. We choose a single system that generates the submission file Zhang BIsmart task2 3.output.csv to be the candidate model considered for the Judges’ Award. Other two systems use ensemble approach to furthur improve the performance.

System characteristics
Sampling rate 44.1kHz
Data augmentation frequency masking, time masking, time reversal, mixup
Features log-mel energies, PCEN
Classifier CNN
Decision making arithmetic mean
Ensemble subsystems 12
Complexity 66000000 parameters
Training time 60h (1 x GeForce RTX 2070) + 5h (1 x Tesla P100)
PDF


Other resources generated in the Kaggle competition

The table below shows additional resources that were created and made accessible by Kaggle participants during the competition but that were not submitted to DCASE Challenge. Note that teams who submitted to DCASE Challenge are deliverably ommitted from this table as their generated resources are referenced in the sections above.

Team name Kaggle ranking Kaggle score Code Report
Ruslan Baikulov 1 0.75980 https://github.com/lRomul/argus-freesound kaggle writeup
the art of ensemble 2 0.75913 https://github.com/qrfaction/2nd-Freesound-Audio-Tagging-2019 kaggle writeup
Dmitriy Danevskiy 3 0.75892 https://github.com/ex4sperans/freesound-classification kaggle writeup
Miguel Pinto 6 0.75421 https://github.com/mnpinto/audiotagging2019 Medium blog post
[kaggler-ja] Shirogane 7 0.75302 https://www.kaggle.com/hidehisaarai1213/freesound-7th-place-solution kaggle writeup
4 people 9 0.74835 https://www.kaggle.com/theoviel/9th-place-modeling-kernel technical report
VFA 13 0.73993 - kaggle writeup
和你一起虚度时光 19 0.73399 - kaggle writeup
[dsmlkz] Dombra Power 21 0.73371 - kaggle writeup
daisukelab 38 0.72308 https://github.com/daisukelab/freesound-audio-tagging-2019 technical report
Audio4Fun 77 0.70251 - kaggle writeup
Robert Bracco 210 0.54651 - kaggle writeup
- - - Tutorial on Medium - How to Participate in a Kaggle Competition with Zero Code https://towardsdatascience.com/f017918d2f08