Task description
The task evaluates systems for the detection of sound events using softly labeled data for training in addition to other types of data such as weakly labeled, unlabeled or strongly labeled. The target of the systems is to provide not only the event class but also the event time boundaries of the multiple events present in the real-life audio recordings. The main goal of the task is to investigate whether using soft labels brings any improvement in performance.
More detailed task description can be found in the task description page
Systems ranking
Submission information | Evaluation dataset | Development dataset | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Rank |
Submission label |
Name |
Technical Report |
F1_MO |
ER_m |
F1_m |
F1_M |
F1_MO |
ER_m |
F1_m |
F1_M |
Bai_JLESS_task4b_1 | Two_D+ | Yin2023 | 58.21 | 0.345 | 79.84 | 45.98 | 50.79 | 0.402 | 75.28 | 39.21 | |
Bai_JLESS_task4b_2 | Two_D+_en | Yin2023 | 59.77 | 0.325 | 80.84 | 40.29 | 54.30 | 0.387 | 77.21 | 42.05 | |
Bai_JLESS_task4b_3 | One_en | Yin2023 | 58.00 | 0.328 | 80.76 | 37.28 | 52.25 | 0.394 | 76.90 | 40.58 | |
Bai_JLESS_task4b_4 | All_SED_sys | Yin2023 | 60.74 | 0.320 | 81.01 | 37.33 | 56.16 | 0.360 | 78.63 | 42.45 | |
Cai_NCUT_task4b_1 | NCUT_1 | Zhang2023 | 43.60 | 0.367 | 77.86 | 35.71 | 43.50 | 0.439 | 74.84 | 39.57 | |
Cai_NCUT_task4b_2 | NCUT_2 | Zhang2023 | 43.58 | 0.376 | 77.96 | 36.45 | 44.49 | 0.443 | 73.38 | 35.60 | |
Cai_NCUT_task4b_3 | NCUT_3 | Zhang2023 | 42.14 | 0.346 | 79.01 | 33.36 | 44.47 | 0.432 | 73.89 | 34.86 | |
Liu_NJUPT_task4b_1 | NJUPT_1 | Liu2023 | 19.82 | 0.786 | 34.03 | 5.62 | 69.28 | 0.684 | 32.53 | 18.18 | |
Liu_NJUPT_task4b_2 | NJUPT_2 | Liu2023 | 20.83 | 0.886 | 22.27 | 4.51 | 72.13 | 0.724 | 25.53 | 11.28 | |
Liu_NJUPT_task4b_3 | NJUPT_3 | Liu2023 | 22.53 | 0.688 | 40.85 | 5.69 | 71.53 | 0.713 | 29.64 | 18.19 | |
Liu_NJUPT_task4b_4 | NJUPT_4 | Liu2023 | 22.46 | 0.739 | 37.56 | 5.82 | 74.15 | 0.751 | 26.89 | 19.30 | |
Liu_SRCN_task4b_1 | CRNN_t4b | Jin2023 | 44.69 | 0.419 | 75.95 | 31.03 | 43.98 | 0.500 | 71.04 | 34.73 | |
Liu_SRCN_task4b_2 | AST_t4b | Jin2023 | 52.03 | 0.320 | 80.89 | 31.74 | 49.70 | 0.430 | 72.90 | 28.80 | |
DCASE2023 baseline | Baseline_task4b | 43.44 | 0.484 | 74.13 | 35.28 | 42.87 | 0.487 | 70.34 | 35.83 | ||
Min_KAIST_task4b_1 | STRF_aug | Min2023 | 48.95 | 0.361 | 78.05 | 29.19 | 45.81 | 0.445 | 72.78 | 36.12 | |
Min_KAIST_task4b_2 | STRF_aug_e | Min2023 | 48.72 | 0.351 | 78.27 | 28.68 | 45.37 | 0.447 | 72.53 | 35.20 | |
Min_KAIST_task4b_3 | STRF_AST | Min2023 | 45.21 | 0.397 | 74.77 | 21.94 | 45.41 | 0.461 | 70.20 | 27.82 | |
Min_KAIST_task4b_4 | STRF_AST_e | Min2023 | 46.24 | 0.390 | 75.23 | 21.98 | 44.27 | 0.453 | 71.00 | 29.83 | |
Nhan_VNUHCMUS_task4b_1 | STTeam | Nhan2023 | 47.17 | 1.000 | nan | 0.00 | 46.71 | 0.450 | 72.43 | 37.32 | |
Xu_SJTU_task4b_1 | sjtu_baseline | Xuenan2023 | 46.13 | 0.371 | 78.05 | 32.29 | 55.79 | 0.386 | 78.15 | 42.96 | |
Xu_SJTU_task4b_2 | fc_beat | Xuenan2023 | 50.88 | 0.362 | 77.80 | 24.41 | 59.88 | 0.369 | 77.80 | 28.52 | |
Xu_SJTU_task4b_3 | scene_ens | Xuenan2023 | 51.13 | 0.329 | 80.80 | 35.58 | 69.85 | 0.246 | 86.13 | 57.91 | |
Xu_SJTU_task4b_4 | time_beat | Xuenan2023 | 46.99 | 0.396 | 75.04 | 24.87 | 57.25 | 0.354 | 78.95 | 37.29 |
Teams ranking
Table including only the best ranking score per submitting team.
Submission information | Evaluation dataset | Development dataset | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Rank |
Submission label |
Name |
Technical Report |
F1_MO |
ER_m |
F1_m |
F1_M |
F1_MO |
ER_m |
F1_m |
F1_M |
Bai_JLESS_task4b_4 | All_SED_sys | Yin2023 | 60.74 | 0.320 | 81.01 | 37.33 | 56.16 | 0.360 | 78.63 | 42.45 | |
Cai_NCUT_task4b_1 | NCUT_1 | Zhang2023 | 43.60 | 0.367 | 77.86 | 35.71 | 43.50 | 0.439 | 74.84 | 39.57 | |
Liu_NJUPT_task4b_1 | NJUPT_1 | Liu2023 | 24.24 | 0.991 | 2.71 | 0.75 | 63.43 | 0.193 | 72.91 | 59.76 | |
Liu_SRCN_task4b_2 | AST_t4b | Jin2023 | 52.03 | 0.320 | 80.89 | 31.74 | 49.70 | 0.430 | 72.90 | 28.80 | |
DCASE2023 baseline | Baseline_task4b | Martin2023 | 43.44 | 0.484 | 74.13 | 35.28 | 42.87 | 0.487 | 70.34 | 35.83 | |
Min_KAIST_task4b_1 | STRF_aug | Min2023 | 48.95 | 0.361 | 78.05 | 29.19 | 45.81 | 0.445 | 72.78 | 36.12 | |
Nhan_VNUHCMUS_task4b_1 | STTeam | Nhan2023 | 47.17 | 1.000 | nan | 0.00 | 46.71 | 0.450 | 72.43 | 37.32 | |
Xu_SJTU_task4b_3 | scene_ens | Xuenan2023 | 51.13 | 0.329 | 80.80 | 35.58 | 69.85 | 0.246 | 86.13 | 57.91 |
System characteristics
Rank | Code |
Technical Report |
F1_MO (Evaluation dataset) |
Data augmentation |
System | Features |
---|---|---|---|---|---|---|
Bai_JLESS_task4b_1 | Yin2023 | 58.21 | mixup | Conformer | log-mel energies | |
Bai_JLESS_task4b_2 | Yin2023 | 59.77 | mixup | Conformer | log-mel energies | |
Bai_JLESS_task4b_3 | Yin2023 | 58.00 | mixup | Conformer | log-mel energies | |
Bai_JLESS_task4b_4 | Yin2023 | 60.74 | mixup | Conformer | log-mel energies | |
Cai_NCUT_task4b_1 | Zhang2023 | 43.60 | mixup | CRNN | log-mel energies | |
Cai_NCUT_task4b_2 | Zhang2023 | 43.58 | mixup | SK-RCRNN,CRNN | log-mel energies | |
Cai_NCUT_task4b_3 | Zhang2023 | 42.14 | mixup | SK-RCRNN,CRNN,RCRNN | log-mel energies | |
Liu_NJUPT_task4b_1 | Liu2023 | 19.82 | MViT | mel energies | ||
Liu_NJUPT_task4b_2 | Liu2023 | 20.83 | MViT | mel energies | ||
Liu_NJUPT_task4b_3 | Liu2023 | 22.53 | MViT | mel energies | ||
Liu_NJUPT_task4b_4 | Liu2023 | 22.46 | MViT | mel energies | ||
Liu_SRCN_task4b_1 | Jin2023 | 44.69 | mixup | CRNN | spectrogram | |
Liu_SRCN_task4b_2 | Jin2023 | 52.03 | mixup | CRNN | spectrogram | |
DCASE2023 baseline | 43.44 | CRNN | mel energies | |||
Min_KAIST_task4b_1 | Min2023 | 48.95 | CRNN, STRFaugNet | log-mel energies | ||
Min_KAIST_task4b_2 | Min2023 | 48.72 | CRNN, STRFaugNet, ensemble | log-mel energies | ||
Min_KAIST_task4b_3 | Min2023 | 45.21 | CRNN, STRFaugNet | log-mel energies | ||
Min_KAIST_task4b_4 | Min2023 | 46.24 | CRNN, STRFaugNet, ensemble | log-mel energies | ||
Nhan_VNUHCMUS_task4b_1 | Nhan2023 | 47.17 | specaugment, wavaugment | Self Attention CRNN | mel-spectrogram | |
Xu_SJTU_task4b_1 | Xuenan2023 | 46.13 | CRNN | mel energies | ||
Xu_SJTU_task4b_2 | Xuenan2023 | 50.88 | CRNN | mel energies | ||
Xu_SJTU_task4b_3 | Xuenan2023 | 51.13 | CRNN | mel energies | ||
Xu_SJTU_task4b_4 | Xuenan2023 | 46.99 | CRNN | mel energies |
Technical reports
DCASE 2023 Challenge Task4 Technical Report
Yongbin Jin1, Minjun Chen1, Jun Shao1, Yangyang Liu1, Bo Peng1 and Jie Chen2
1Intelligence Service Lab, Intelligence SW Team, Samsung Research China-Nanjing, Nanjing, China, 2Intelligence SW Team, Samsung Research China-Nanjing, Nanjing, China
Liu_SRCN_task4b_1 Liu_SRCN_task4b_2
DCASE 2023 Challenge Task4 Technical Report
Yongbin Jin1, Minjun Chen1, Jun Shao1, Yangyang Liu1, Bo Peng1 and Jie Chen2
1Intelligence Service Lab, Intelligence SW Team, Samsung Research China-Nanjing, Nanjing, China, 2Intelligence SW Team, Samsung Research China-Nanjing, Nanjing, China
Abstract
We describe our submitted systems for DCASE2023 Task4 in this technical report: Sound Event Detection with Weak Labels and Synthetic Soundscapes (Subtask A), and Sound Event Detection with Soft Labels (Subtask B). We focus on construct a CRNN model, which fuses the embedding extracted by the BEATs or AST pre-trained model,and use the frequency dynamic convolution(FDY-CRNN) and channel-wise selective kernel attention (SKA) for having adaptive receptive field. To get multiple models of different architectures for making an ensemble, we fine-tune multiple BEATs model on the SED dataset also. In order to make use of the weak labeled and unlabeled subset of DESED dataset further, we pseudo labels these subsets by a multiple iterative of self-training. We also use a small part of audio files from the Audioset dataset, and this part of data following the same self-training procedure. We train these models using two different settings, one setting for optimizing PSDS1 score, and the other for optimizing PSDS2 score. Our proposed systems achieve poly-phonic sound event detection scores (PSDS-scores) of 0.570 (PSDS-scenario1) and 0.889 (PSDS-scenario2) respectively on development dataset of subtask A, and macro-average F1 score with optimum threshold per class (F1MO) 49.70 on development dataset of subtask B.
System characteristics
Input | mono |
Classifier | CRNN |
Acoustic features | spectrogram |
Data augmentation | mixup |
Sound Event Detection System Using a Modifided Mvit for DCASE 2023 Challenge Task 4b
Shutao Liu1, Peihong Zhang2, Fulin Yang2, Chenyang Zhu1, Shengchen Li3 and Xi Shao1
1Nanjing University of Posts and Telecommunications, Nanjing, Jiangsu, P.R.China, 2Xi'an Jiaotong-Liverpool University, Suzhou,Jiangsu, P.R.China, 3Xi'an Jiaotong-Liverpool University, Suzhou,Jiangsu, P.R.China,
Liu_NJUPT_task4b_1 Liu_NJUPT_task4b_2 Liu_NJUPT_task4b_3 Liu_NJUPT_task4b_4
Sound Event Detection System Using a Modifided Mvit for DCASE 2023 Challenge Task 4b
Shutao Liu1, Peihong Zhang2, Fulin Yang2, Chenyang Zhu1, Shengchen Li3 and Xi Shao1
1Nanjing University of Posts and Telecommunications, Nanjing, Jiangsu, P.R.China, 2Xi'an Jiaotong-Liverpool University, Suzhou,Jiangsu, P.R.China, 3Xi'an Jiaotong-Liverpool University, Suzhou,Jiangsu, P.R.China,
Abstract
In this report, we describe our submissions for the task 4b of Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge: Sound Event Detection with Soft Labels. We use a MViT model based on frequency dynamic convolution. While preserving the advantages of multi-scale feature extraction by MViT, frequency dynamic convolution is used to overcome the translation invariance of image feature extraction caused by MViT model to improve the ability of the model in terms of extracting frequency dimension features. Without using any external datasets or pretrain model, our system trained only on the provided soft-label dataset, and the final F1-m score and F1-MO score are 80.52 and 63.43, respectively, both higher than the baseline system.
System characteristics
Input | mono |
Classifier | MViT |
Acoustic features | mel energies |
Application of Spectro-Temporal Receptive Field for DCASE 2023 Challenge Task4 B
Deokki Min, Hyeonuk Nam and Park Yong-Hwa
Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
Min_KAIST_task4b_1 Min_KAIST_task4b_2 Min_KAIST_task4b_3 Min_KAIST_task4b_4
Application of Spectro-Temporal Receptive Field for DCASE 2023 Challenge Task4 B
Deokki Min, Hyeonuk Nam and Park Yong-Hwa
Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
Abstract
Spectro-Temporal Receptive Field (STRF) is a linear function which describe the relationship between sound stimulus and primary auditory cortex (A1) neural response. By means of convolution with sound spectrogram and STRF, it is possible to predict the A1 cells response. A1 is known to estimates the spectro-temporal modulation information of input sound, and by considering the characteristic of A1, STRF is designed to capture this both spectral and temporal modulation information. In this work, we used STRF as a CNN kernel and construct the two-branch deep learning model. One branch is named as STRFNet whose first CNN layer kernel is STRF. This branch extracts the neuroscience-inspired spectro-temporal modulation information. The other branch is CRNN which has deeper layer than the baseline, and extracts the general time-frequency information of input spectrogram. Two-branch model is named as STRFaugNet, and its performance outperforms the baseline by 6.9%.
System characteristics
Input | mono |
Classifier | CRNN, STRFaugNet; CRNN, STRFaugNet, ensemble |
Acoustic features | log-mel energies |
Decision making | average |
Sound Event Detection with Soft Labels Using Self-Attention Mechanisms for Global Scene Feature Extraction
Tri-Do Nhan1, Biyani Param2 and Yuxuan Zhang3
1Computing Sciences, University of Science, Vietnam National University, 2Computing Sciences, BITS Pilani Goa Campus, India, 3Computing Sciences, Nanyang Technological University, NTU Singapore
Nhan_VNUHCMUS_task4b_1
Sound Event Detection with Soft Labels Using Self-Attention Mechanisms for Global Scene Feature Extraction
Tri-Do Nhan1, Biyani Param2 and Yuxuan Zhang3
1Computing Sciences, University of Science, Vietnam National University, 2Computing Sciences, BITS Pilani Goa Campus, India, 3Computing Sciences, Nanyang Technological University, NTU Singapore
Abstract
This paper presents our approach to Task 4b of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge, which focuses on Sound Event Detection with Soft Labels. Our proposed method builds upon a CRNN backbone model and leverages the benefits of data augmentation techniques to improve model robustness. Furthermore, we introduce self-attention mechanisms to capture global context information and enhance the model's ability to predict soft label segments more accurately. Our experiments demonstrate that incorporating soft labels and self-attention mechanisms result in significant performance gains compared to traditional methods on data varying across different scenarios
System characteristics
Input | mono |
Classifier | Self Attention CRNN |
Acoustic features | mel-spectrogram |
Data augmentation | specaugment, wavaugment |
Sound Event Detection by Aggregating Pre-Trained Embeddings From Different Layers
Xu Xuenan, Ma Ziyang, Yang Fei, Yang Guanrou, Wu Mengyue and Chen Xie
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
Xu_SJTU_task4b_1 Xu_SJTU_task4b_2 Xu_SJTU_task4b_3 Xu_SJTU_task4b_4
Sound Event Detection by Aggregating Pre-Trained Embeddings From Different Layers
Xu Xuenan, Ma Ziyang, Yang Fei, Yang Guanrou, Wu Mengyue and Chen Xie
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
Abstract
This technical report is the system description of the X-Lance team submission to the DCASE 2023 task 4b challenge: sound event detection with soft labels. Our submissions focus on incorporating informative audio representations from self-supervised learning. The embeddings from different layers of the pre-trained models are aggregated as the input of our model. Since the occurrence of sound events in different scenes is imbalanced, for each scene we train our models using all the audio files. Finally, models of different architectures trained under different scenes are ensembled with learned weights.
System characteristics
Input | mono |
Classifier | CRNN |
Acoustic features | mel energies |
Decision making | average; weighted |
How Information on Soft Labels and Hard Labels Mutually Benefits Sound Event Detection Tasks
Han Yin, Jisheng Bai, Siwei Huang and Jianfeng Chen
Northwestern Polytechnical University, Marine Science and Technology, Joint Laboratory of Environmental Sound Sensing,, Xian, China
Bai_JLESS_task4b_1 Bai_JLESS_task4b_2 Bai_JLESS_task4b_3 Bai_JLESS_task4b_4
How Information on Soft Labels and Hard Labels Mutually Benefits Sound Event Detection Tasks
Han Yin, Jisheng Bai, Siwei Huang and Jianfeng Chen
Northwestern Polytechnical University, Marine Science and Technology, Joint Laboratory of Environmental Sound Sensing,, Xian, China
Abstract
This technical report describes our submission to DCASE 2023 Task 4 Subtrack B: Sound event detection (SED) with soft labels. We propose different architectures to explore how both soft and hard labels can jointly improve the performance of SED. And we use temporal mixup for data augmentation and k-fold cross-validationapproaches to solve the problem of too little training data. Our systems are built upon the Convolutional Recurrent Neural Network (CRNN) proposed by the baseline and the Conformer structure. We conduct extensive ablation experiments to compare the advantages and disadvantages of different information fusion strategies.
System characteristics
Input | mono |
Classifier | Conformer |
Acoustic features | log-mel energies |
Data augmentation | mixup |
Sound Event Detection Based on Soft Label
Haiyue Zhang1, Liangxiao Zuo2, Jingxuan Chen2, Xichang Cai3 and Menglong Wu3
1Electronic science and technology, North China University of Technology, Beijing, China, 2Communication Engineering, North China University of Technology, Beijing, China, 3College of Information, North China University of Technology, Beijing, China
Cai_NCUT_task4b_1 Cai_NCUT_task4b_2 Cai_NCUT_task4b_3
Sound Event Detection Based on Soft Label
Haiyue Zhang1, Liangxiao Zuo2, Jingxuan Chen2, Xichang Cai3 and Menglong Wu3
1Electronic science and technology, North China University of Technology, Beijing, China, 2Communication Engineering, North China University of Technology, Beijing, China, 3College of Information, North China University of Technology, Beijing, China
Abstract
This report focuses on an in-depth study of DCASE 2023 Task 4b. In contrast to previous tasks, this task provides a dataset with soft labels, aiming to explore how to improve the performance of the baseline system using soft labels. The report primarily employs two effective enhancement methods. Firstly, to balance the dataset, the report expands the original dataset. Given the significant differences in sound events within the dataset, an augmentation method is employed to generate additional samples and equalize the dataset. Secondly, to further enhance the system performance, a model ensemble approach is utilized. By combining the predictions of multiple models, their individual strengths can be effectively utilized to improve overall performance.
System characteristics
Input | mono |
Classifier | CRNN; SK-RCRNN,CRNN; SK-RCRNN,CRNN,RCRNN |
Acoustic features | log-mel energies |
Data augmentation | mixup |
Decision making | average |