Sound Event Detection with Soft Labels


Challenge results

Task description

The task evaluates systems for the detection of sound events using softly labeled data for training in addition to other types of data such as weakly labeled, unlabeled or strongly labeled. The target of the systems is to provide not only the event class but also the event time boundaries of the multiple events present in the real-life audio recordings. The main goal of the task is to investigate whether using soft labels brings any improvement in performance.

More detailed task description can be found in the task description page

Systems ranking

Submission information Evaluation dataset Development dataset
Rank Submission
label
Name Technical
Report

F1_MO

ER_m

F1_m

F1_M

F1_MO

ER_m

F1_m

F1_M
Bai_JLESS_task4b_1 Two_D+ Yin2023 58.21 0.345 79.84 45.98 50.79 0.402 75.28 39.21
Bai_JLESS_task4b_2 Two_D+_en Yin2023 59.77 0.325 80.84 40.29 54.30 0.387 77.21 42.05
Bai_JLESS_task4b_3 One_en Yin2023 58.00 0.328 80.76 37.28 52.25 0.394 76.90 40.58
Bai_JLESS_task4b_4 All_SED_sys Yin2023 60.74 0.320 81.01 37.33 56.16 0.360 78.63 42.45
Cai_NCUT_task4b_1 NCUT_1 Zhang2023 43.60 0.367 77.86 35.71 43.50 0.439 74.84 39.57
Cai_NCUT_task4b_2 NCUT_2 Zhang2023 43.58 0.376 77.96 36.45 44.49 0.443 73.38 35.60
Cai_NCUT_task4b_3 NCUT_3 Zhang2023 42.14 0.346 79.01 33.36 44.47 0.432 73.89 34.86
Liu_NJUPT_task4b_1 NJUPT_1 Liu2023 19.82 0.786 34.03 5.62 69.28 0.684 32.53 18.18
Liu_NJUPT_task4b_2 NJUPT_2 Liu2023 20.83 0.886 22.27 4.51 72.13 0.724 25.53 11.28
Liu_NJUPT_task4b_3 NJUPT_3 Liu2023 22.53 0.688 40.85 5.69 71.53 0.713 29.64 18.19
Liu_NJUPT_task4b_4 NJUPT_4 Liu2023 22.46 0.739 37.56 5.82 74.15 0.751 26.89 19.30
Liu_SRCN_task4b_1 CRNN_t4b Jin2023 44.69 0.419 75.95 31.03 43.98 0.500 71.04 34.73
Liu_SRCN_task4b_2 AST_t4b Jin2023 52.03 0.320 80.89 31.74 49.70 0.430 72.90 28.80
DCASE2023 baseline Baseline_task4b 43.44 0.484 74.13 35.28 42.87 0.487 70.34 35.83
Min_KAIST_task4b_1 STRF_aug Min2023 48.95 0.361 78.05 29.19 45.81 0.445 72.78 36.12
Min_KAIST_task4b_2 STRF_aug_e Min2023 48.72 0.351 78.27 28.68 45.37 0.447 72.53 35.20
Min_KAIST_task4b_3 STRF_AST Min2023 45.21 0.397 74.77 21.94 45.41 0.461 70.20 27.82
Min_KAIST_task4b_4 STRF_AST_e Min2023 46.24 0.390 75.23 21.98 44.27 0.453 71.00 29.83
Nhan_VNUHCMUS_task4b_1 STTeam Nhan2023 47.17 1.000 nan 0.00 46.71 0.450 72.43 37.32
Xu_SJTU_task4b_1 sjtu_baseline Xuenan2023 46.13 0.371 78.05 32.29 55.79 0.386 78.15 42.96
Xu_SJTU_task4b_2 fc_beat Xuenan2023 50.88 0.362 77.80 24.41 59.88 0.369 77.80 28.52
Xu_SJTU_task4b_3 scene_ens Xuenan2023 51.13 0.329 80.80 35.58 69.85 0.246 86.13 57.91
Xu_SJTU_task4b_4 time_beat Xuenan2023 46.99 0.396 75.04 24.87 57.25 0.354 78.95 37.29

Teams ranking

Table including only the best ranking score per submitting team.

Submission information Evaluation dataset Development dataset
Rank Submission
label
Name Technical
Report

F1_MO

ER_m

F1_m

F1_M

F1_MO

ER_m

F1_m

F1_M
Bai_JLESS_task4b_4 All_SED_sys Yin2023 60.74 0.320 81.01 37.33 56.16 0.360 78.63 42.45
Cai_NCUT_task4b_1 NCUT_1 Zhang2023 43.60 0.367 77.86 35.71 43.50 0.439 74.84 39.57
Liu_NJUPT_task4b_1 NJUPT_1 Liu2023 24.24 0.991 2.71 0.75 63.43 0.193 72.91 59.76
Liu_SRCN_task4b_2 AST_t4b Jin2023 52.03 0.320 80.89 31.74 49.70 0.430 72.90 28.80
DCASE2023 baseline Baseline_task4b Martin2023 43.44 0.484 74.13 35.28 42.87 0.487 70.34 35.83
Min_KAIST_task4b_1 STRF_aug Min2023 48.95 0.361 78.05 29.19 45.81 0.445 72.78 36.12
Nhan_VNUHCMUS_task4b_1 STTeam Nhan2023 47.17 1.000 nan 0.00 46.71 0.450 72.43 37.32
Xu_SJTU_task4b_3 scene_ens Xuenan2023 51.13 0.329 80.80 35.58 69.85 0.246 86.13 57.91

System characteristics

Rank Code Technical
Report

F1_MO
(Evaluation dataset)
Data
augmentation
System Features
Bai_JLESS_task4b_1 Yin2023 58.21 mixup Conformer log-mel energies
Bai_JLESS_task4b_2 Yin2023 59.77 mixup Conformer log-mel energies
Bai_JLESS_task4b_3 Yin2023 58.00 mixup Conformer log-mel energies
Bai_JLESS_task4b_4 Yin2023 60.74 mixup Conformer log-mel energies
Cai_NCUT_task4b_1 Zhang2023 43.60 mixup CRNN log-mel energies
Cai_NCUT_task4b_2 Zhang2023 43.58 mixup SK-RCRNN,CRNN log-mel energies
Cai_NCUT_task4b_3 Zhang2023 42.14 mixup SK-RCRNN,CRNN,RCRNN log-mel energies
Liu_NJUPT_task4b_1 Liu2023 19.82 MViT mel energies
Liu_NJUPT_task4b_2 Liu2023 20.83 MViT mel energies
Liu_NJUPT_task4b_3 Liu2023 22.53 MViT mel energies
Liu_NJUPT_task4b_4 Liu2023 22.46 MViT mel energies
Liu_SRCN_task4b_1 Jin2023 44.69 mixup CRNN spectrogram
Liu_SRCN_task4b_2 Jin2023 52.03 mixup CRNN spectrogram
DCASE2023 baseline 43.44 CRNN mel energies
Min_KAIST_task4b_1 Min2023 48.95 CRNN, STRFaugNet log-mel energies
Min_KAIST_task4b_2 Min2023 48.72 CRNN, STRFaugNet, ensemble log-mel energies
Min_KAIST_task4b_3 Min2023 45.21 CRNN, STRFaugNet log-mel energies
Min_KAIST_task4b_4 Min2023 46.24 CRNN, STRFaugNet, ensemble log-mel energies
Nhan_VNUHCMUS_task4b_1 Nhan2023 47.17 specaugment, wavaugment Self Attention CRNN mel-spectrogram
Xu_SJTU_task4b_1 Xuenan2023 46.13 CRNN mel energies
Xu_SJTU_task4b_2 Xuenan2023 50.88 CRNN mel energies
Xu_SJTU_task4b_3 Xuenan2023 51.13 CRNN mel energies
Xu_SJTU_task4b_4 Xuenan2023 46.99 CRNN mel energies



Technical reports

DCASE 2023 Challenge Task4 Technical Report

Yongbin Jin1, Minjun Chen1, Jun Shao1, Yangyang Liu1, Bo Peng1 and Jie Chen2
1Intelligence Service Lab, Intelligence SW Team, Samsung Research China-Nanjing, Nanjing, China, 2Intelligence SW Team, Samsung Research China-Nanjing, Nanjing, China

Abstract

We describe our submitted systems for DCASE2023 Task4 in this technical report: Sound Event Detection with Weak Labels and Synthetic Soundscapes (Subtask A), and Sound Event Detection with Soft Labels (Subtask B). We focus on construct a CRNN model, which fuses the embedding extracted by the BEATs or AST pre-trained model,and use the frequency dynamic convolution(FDY-CRNN) and channel-wise selective kernel attention (SKA) for having adaptive receptive field. To get multiple models of different architectures for making an ensemble, we fine-tune multiple BEATs model on the SED dataset also. In order to make use of the weak labeled and unlabeled subset of DESED dataset further, we pseudo labels these subsets by a multiple iterative of self-training. We also use a small part of audio files from the Audioset dataset, and this part of data following the same self-training procedure. We train these models using two different settings, one setting for optimizing PSDS1 score, and the other for optimizing PSDS2 score. Our proposed systems achieve poly-phonic sound event detection scores (PSDS-scores) of 0.570 (PSDS-scenario1) and 0.889 (PSDS-scenario2) respectively on development dataset of subtask A, and macro-average F1 score with optimum threshold per class (F1MO) 49.70 on development dataset of subtask B.

System characteristics
Input mono
Classifier CRNN
Acoustic features spectrogram
Data augmentation mixup
PDF

Sound Event Detection System Using a Modifided Mvit for DCASE 2023 Challenge Task 4b

Shutao Liu1, Peihong Zhang2, Fulin Yang2, Chenyang Zhu1, Shengchen Li3 and Xi Shao1
1Nanjing University of Posts and Telecommunications, Nanjing, Jiangsu, P.R.China, 2Xi'an Jiaotong-Liverpool University, Suzhou,Jiangsu, P.R.China, 3Xi'an Jiaotong-Liverpool University, Suzhou,Jiangsu, P.R.China,

Abstract

In this report, we describe our submissions for the task 4b of Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge: Sound Event Detection with Soft Labels. We use a MViT model based on frequency dynamic convolution. While preserving the advantages of multi-scale feature extraction by MViT, frequency dynamic convolution is used to overcome the translation invariance of image feature extraction caused by MViT model to improve the ability of the model in terms of extracting frequency dimension features. Without using any external datasets or pretrain model, our system trained only on the provided soft-label dataset, and the final F1-m score and F1-MO score are 80.52 and 63.43, respectively, both higher than the baseline system.

System characteristics
Input mono
Classifier MViT
Acoustic features mel energies
PDF

Application of Spectro-Temporal Receptive Field for DCASE 2023 Challenge Task4 B

Deokki Min, Hyeonuk Nam and Park Yong-Hwa
Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea

Abstract

Spectro-Temporal Receptive Field (STRF) is a linear function which describe the relationship between sound stimulus and primary auditory cortex (A1) neural response. By means of convolution with sound spectrogram and STRF, it is possible to predict the A1 cells response. A1 is known to estimates the spectro-temporal modulation information of input sound, and by considering the characteristic of A1, STRF is designed to capture this both spectral and temporal modulation information. In this work, we used STRF as a CNN kernel and construct the two-branch deep learning model. One branch is named as STRFNet whose first CNN layer kernel is STRF. This branch extracts the neuroscience-inspired spectro-temporal modulation information. The other branch is CRNN which has deeper layer than the baseline, and extracts the general time-frequency information of input spectrogram. Two-branch model is named as STRFaugNet, and its performance outperforms the baseline by 6.9%.

System characteristics
Input mono
Classifier CRNN, STRFaugNet; CRNN, STRFaugNet, ensemble
Acoustic features log-mel energies
Decision making average
PDF

Sound Event Detection with Soft Labels Using Self-Attention Mechanisms for Global Scene Feature Extraction

Tri-Do Nhan1, Biyani Param2 and Yuxuan Zhang3
1Computing Sciences, University of Science, Vietnam National University, 2Computing Sciences, BITS Pilani Goa Campus, India, 3Computing Sciences, Nanyang Technological University, NTU Singapore

Abstract

This paper presents our approach to Task 4b of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge, which focuses on Sound Event Detection with Soft Labels. Our proposed method builds upon a CRNN backbone model and leverages the benefits of data augmentation techniques to improve model robustness. Furthermore, we introduce self-attention mechanisms to capture global context information and enhance the model's ability to predict soft label segments more accurately. Our experiments demonstrate that incorporating soft labels and self-attention mechanisms result in significant performance gains compared to traditional methods on data varying across different scenarios

System characteristics
Input mono
Classifier Self Attention CRNN
Acoustic features mel-spectrogram
Data augmentation specaugment, wavaugment
PDF

Sound Event Detection by Aggregating Pre-Trained Embeddings From Different Layers

Xu Xuenan, Ma Ziyang, Yang Fei, Yang Guanrou, Wu Mengyue and Chen Xie
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China

Abstract

This technical report is the system description of the X-Lance team submission to the DCASE 2023 task 4b challenge: sound event detection with soft labels. Our submissions focus on incorporating informative audio representations from self-supervised learning. The embeddings from different layers of the pre-trained models are aggregated as the input of our model. Since the occurrence of sound events in different scenes is imbalanced, for each scene we train our models using all the audio files. Finally, models of different architectures trained under different scenes are ensembled with learned weights.

System characteristics
Input mono
Classifier CRNN
Acoustic features mel energies
Decision making average; weighted
PDF

How Information on Soft Labels and Hard Labels Mutually Benefits Sound Event Detection Tasks

Han Yin, Jisheng Bai, Siwei Huang and Jianfeng Chen
Northwestern Polytechnical University, Marine Science and Technology, Joint Laboratory of Environmental Sound Sensing,, Xian, China

Abstract

This technical report describes our submission to DCASE 2023 Task 4 Subtrack B: Sound event detection (SED) with soft labels. We propose different architectures to explore how both soft and hard labels can jointly improve the performance of SED. And we use temporal mixup for data augmentation and k-fold cross-validationapproaches to solve the problem of too little training data. Our systems are built upon the Convolutional Recurrent Neural Network (CRNN) proposed by the baseline and the Conformer structure. We conduct extensive ablation experiments to compare the advantages and disadvantages of different information fusion strategies.

System characteristics
Input mono
Classifier Conformer
Acoustic features log-mel energies
Data augmentation mixup
PDF

Sound Event Detection Based on Soft Label

Haiyue Zhang1, Liangxiao Zuo2, Jingxuan Chen2, Xichang Cai3 and Menglong Wu3
1Electronic science and technology, North China University of Technology, Beijing, China, 2Communication Engineering, North China University of Technology, Beijing, China, 3College of Information, North China University of Technology, Beijing, China

Abstract

This report focuses on an in-depth study of DCASE 2023 Task 4b. In contrast to previous tasks, this task provides a dataset with soft labels, aiming to explore how to improve the performance of the baseline system using soft labels. The report primarily employs two effective enhancement methods. Firstly, to balance the dataset, the report expands the original dataset. Given the significant differences in sound events within the dataset, an augmentation method is employed to generate additional samples and equalize the dataset. Secondly, to further enhance the system performance, a model ensemble approach is utilized. By combining the predictions of multiple models, their individual strengths can be effectively utilized to improve overall performance.

System characteristics
Input mono
Classifier CRNN; SK-RCRNN,CRNN; SK-RCRNN,CRNN,RCRNN
Acoustic features log-mel energies
Data augmentation mixup
Decision making average
PDF