Task description

The task evaluates systems for the detection of sound events using softly labeled data for training in addition to other types of data such as weakly labeled, unlabeled or strongly labeled. The target of the systems is to provide not only the event class but also the event time boundaries of the multiple events present in the real-life audio recordings. The main goal of the task is to investigate whether using soft labels brings any improvement in performance.

More detailed task description can be found in the task description page

Systems ranking

	Submission information			Evaluation dataset				Development dataset
Rank	Submission label	Name	Technical Report	F1_MO	ER_m	F1_m	F1_M	F1_MO	ER_m	F1_m	F1_M
	Bai_JLESS_task4b_1	Two_D+	Yin2023	58.21	0.345	79.84	45.98	50.79	0.402	75.28	39.21
	Bai_JLESS_task4b_2	Two_D+_en	Yin2023	59.77	0.325	80.84	40.29	54.30	0.387	77.21	42.05
	Bai_JLESS_task4b_3	One_en	Yin2023	58.00	0.328	80.76	37.28	52.25	0.394	76.90	40.58
	Bai_JLESS_task4b_4	All_SED_sys	Yin2023	60.74	0.320	81.01	37.33	56.16	0.360	78.63	42.45
	Cai_NCUT_task4b_1	NCUT_1	Zhang2023	43.60	0.367	77.86	35.71	43.50	0.439	74.84	39.57
	Cai_NCUT_task4b_2	NCUT_2	Zhang2023	43.58	0.376	77.96	36.45	44.49	0.443	73.38	35.60
	Cai_NCUT_task4b_3	NCUT_3	Zhang2023	42.14	0.346	79.01	33.36	44.47	0.432	73.89	34.86
	Liu_NJUPT_task4b_1	NJUPT_1	Liu2023	19.82	0.786	34.03	5.62	69.28	0.684	32.53	18.18
	Liu_NJUPT_task4b_2	NJUPT_2	Liu2023	20.83	0.886	22.27	4.51	72.13	0.724	25.53	11.28
	Liu_NJUPT_task4b_3	NJUPT_3	Liu2023	22.53	0.688	40.85	5.69	71.53	0.713	29.64	18.19
	Liu_NJUPT_task4b_4	NJUPT_4	Liu2023	22.46	0.739	37.56	5.82	74.15	0.751	26.89	19.30
	Liu_SRCN_task4b_1	CRNN_t4b	Jin2023	44.69	0.419	75.95	31.03	43.98	0.500	71.04	34.73
	Liu_SRCN_task4b_2	AST_t4b	Jin2023	52.03	0.320	80.89	31.74	49.70	0.430	72.90	28.80
	DCASE2023 baseline	Baseline_task4b		43.44	0.484	74.13	35.28	42.87	0.487	70.34	35.83
	Min_KAIST_task4b_1	STRF_aug	Min2023	48.95	0.361	78.05	29.19	45.81	0.445	72.78	36.12
	Min_KAIST_task4b_2	STRF_aug_e	Min2023	48.72	0.351	78.27	28.68	45.37	0.447	72.53	35.20
	Min_KAIST_task4b_3	STRF_AST	Min2023	45.21	0.397	74.77	21.94	45.41	0.461	70.20	27.82
	Min_KAIST_task4b_4	STRF_AST_e	Min2023	46.24	0.390	75.23	21.98	44.27	0.453	71.00	29.83
	Nhan_VNUHCMUS_task4b_1	STTeam	Nhan2023	47.17	1.000	nan	0.00	46.71	0.450	72.43	37.32
	Xu_SJTU_task4b_1	sjtu_baseline	Xuenan2023	46.13	0.371	78.05	32.29	55.79	0.386	78.15	42.96
	Xu_SJTU_task4b_2	fc_beat	Xuenan2023	50.88	0.362	77.80	24.41	59.88	0.369	77.80	28.52
	Xu_SJTU_task4b_3	scene_ens	Xuenan2023	51.13	0.329	80.80	35.58	69.85	0.246	86.13	57.91
	Xu_SJTU_task4b_4	time_beat	Xuenan2023	46.99	0.396	75.04	24.87	57.25	0.354	78.95	37.29

Teams ranking

Table including only the best ranking score per submitting team.

	Submission information			Evaluation dataset				Development dataset
Rank	Submission label	Name	Technical Report	F1_MO	ER_m	F1_m	F1_M	F1_MO	ER_m	F1_m	F1_M
	Bai_JLESS_task4b_4	All_SED_sys	Yin2023	60.74	0.320	81.01	37.33	56.16	0.360	78.63	42.45
	Cai_NCUT_task4b_1	NCUT_1	Zhang2023	43.60	0.367	77.86	35.71	43.50	0.439	74.84	39.57
	Liu_NJUPT_task4b_1	NJUPT_1	Liu2023	24.24	0.991	2.71	0.75	63.43	0.193	72.91	59.76
	Liu_SRCN_task4b_2	AST_t4b	Jin2023	52.03	0.320	80.89	31.74	49.70	0.430	72.90	28.80
	DCASE2023 baseline	Baseline_task4b	Martin2023	43.44	0.484	74.13	35.28	42.87	0.487	70.34	35.83
	Min_KAIST_task4b_1	STRF_aug	Min2023	48.95	0.361	78.05	29.19	45.81	0.445	72.78	36.12
	Nhan_VNUHCMUS_task4b_1	STTeam	Nhan2023	47.17	1.000	nan	0.00	46.71	0.450	72.43	37.32
	Xu_SJTU_task4b_3	scene_ens	Xuenan2023	51.13	0.329	80.80	35.58	69.85	0.246	86.13	57.91

System characteristics

Code	Technical Report	F1_MO (Evaluation dataset)	Data augmentation	System	Features
Bai_JLESS_task4b_1	Yin2023	58.21	mixup	Conformer	log-mel energies
Bai_JLESS_task4b_2	Yin2023	59.77	mixup	Conformer	log-mel energies
Bai_JLESS_task4b_3	Yin2023	58.00	mixup	Conformer	log-mel energies
Bai_JLESS_task4b_4	Yin2023	60.74	mixup	Conformer	log-mel energies
Cai_NCUT_task4b_1	Zhang2023	43.60	mixup	CRNN	log-mel energies
Cai_NCUT_task4b_2	Zhang2023	43.58	mixup	SK-RCRNN,CRNN	log-mel energies
Cai_NCUT_task4b_3	Zhang2023	42.14	mixup	SK-RCRNN,CRNN,RCRNN	log-mel energies
Liu_NJUPT_task4b_1	Liu2023	19.82		MViT	mel energies
Liu_NJUPT_task4b_2	Liu2023	20.83		MViT	mel energies
Liu_NJUPT_task4b_3	Liu2023	22.53		MViT	mel energies
Liu_NJUPT_task4b_4	Liu2023	22.46		MViT	mel energies
Liu_SRCN_task4b_1	Jin2023	44.69	mixup	CRNN	spectrogram
Liu_SRCN_task4b_2	Jin2023	52.03	mixup	CRNN	spectrogram
DCASE2023 baseline		43.44		CRNN	mel energies
Min_KAIST_task4b_1	Min2023	48.95		CRNN, STRFaugNet	log-mel energies
Min_KAIST_task4b_2	Min2023	48.72		CRNN, STRFaugNet, ensemble	log-mel energies
Min_KAIST_task4b_3	Min2023	45.21		CRNN, STRFaugNet	log-mel energies
Min_KAIST_task4b_4	Min2023	46.24		CRNN, STRFaugNet, ensemble	log-mel energies
Nhan_VNUHCMUS_task4b_1	Nhan2023	47.17	specaugment, wavaugment	Self Attention CRNN	mel-spectrogram
Xu_SJTU_task4b_1	Xuenan2023	46.13		CRNN	mel energies
Xu_SJTU_task4b_2	Xuenan2023	50.88		CRNN	mel energies
Xu_SJTU_task4b_3	Xuenan2023	51.13		CRNN	mel energies
Xu_SJTU_task4b_4	Xuenan2023	46.99		CRNN	mel energies

Technical reports

DCASE 2023 Challenge Task4 Technical Report

Yongbin Jin¹, Minjun Chen¹, Jun Shao¹, Yangyang Liu¹, Bo Peng¹ and Jie Chen²

¹Intelligence Service Lab, Intelligence SW Team, Samsung Research China-Nanjing, Nanjing, China, ²Intelligence SW Team, Samsung Research China-Nanjing, Nanjing, China

Liu_SRCN_task4b_1 Liu_SRCN_task4b_2

PDF

DCASE 2023 Challenge Task4 Technical Report

Yongbin Jin¹, Minjun Chen¹, Jun Shao¹, Yangyang Liu¹, Bo Peng¹ and Jie Chen²
¹Intelligence Service Lab, Intelligence SW Team, Samsung Research China-Nanjing, Nanjing, China, ²Intelligence SW Team, Samsung Research China-Nanjing, Nanjing, China

Abstract

We describe our submitted systems for DCASE2023 Task4 in this technical report: Sound Event Detection with Weak Labels and Synthetic Soundscapes (Subtask A), and Sound Event Detection with Soft Labels (Subtask B). We focus on construct a CRNN model, which fuses the embedding extracted by the BEATs or AST pre-trained modelï¼Œand use the frequency dynamic convolution(FDY-CRNN) and channel-wise selective kernel attention (SKA) for having adaptive receptive field. To get multiple models of different architectures for making an ensemble, we fine-tune multiple BEATs model on the SED dataset also. In order to make use of the weak labeled and unlabeled subset of DESED dataset further, we pseudo labels these subsets by a multiple iterative of self-training. We also use a small part of audio files from the Audioset dataset, and this part of data following the same self-training procedure. We train these models using two different settings, one setting for optimizing PSDS1 score, and the other for optimizing PSDS2 score. Our proposed systems achieve poly-phonic sound event detection scores (PSDS-scores) of 0.570 (PSDS-scenario1) and 0.889 (PSDS-scenario2) respectively on development dataset of subtask A, and macro-average F1 score with optimum threshold per class (F1MO) 49.70 on development dataset of subtask B.

System characteristics

Input	mono
Classifier	CRNN
Acoustic features	spectrogram
Data augmentation	mixup

PDF

Sound Event Detection System Using a Modifided Mvit for DCASE 2023 Challenge Task 4b

Shutao Liu¹, Peihong Zhang², Fulin Yang², Chenyang Zhu¹, Shengchen Li³ and Xi Shao¹

¹Nanjing University of Posts and Telecommunications, Nanjing, Jiangsu, P.R.China, ²Xi'an Jiaotong-Liverpool University, Suzhou,Jiangsu, P.R.China, ³Xi'an Jiaotong-Liverpool University, Suzhou,Jiangsu, P.R.China,

Liu_NJUPT_task4b_1 Liu_NJUPT_task4b_2 Liu_NJUPT_task4b_3 Liu_NJUPT_task4b_4

PDF

Sound Event Detection System Using a Modifided Mvit for DCASE 2023 Challenge Task 4b

Shutao Liu¹, Peihong Zhang², Fulin Yang², Chenyang Zhu¹, Shengchen Li³ and Xi Shao¹
¹Nanjing University of Posts and Telecommunications, Nanjing, Jiangsu, P.R.China, ²Xi'an Jiaotong-Liverpool University, Suzhou,Jiangsu, P.R.China, ³Xi'an Jiaotong-Liverpool University, Suzhou,Jiangsu, P.R.China,

Abstract

In this report, we describe our submissions for the task 4b of Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge: Sound Event Detection with Soft Labels. We use a MViT model based on frequency dynamic convolution. While preserving the advantages of multi-scale feature extraction by MViT, frequency dynamic convolution is used to overcome the translation invariance of image feature extraction caused by MViT model to improve the ability of the model in terms of extracting frequency dimension features. Without using any external datasets or pretrain model, our system trained only on the provided soft-label dataset, and the final F1-m score and F1-MO score are 80.52 and 63.43, respectively, both higher than the baseline system.

System characteristics

Input	mono
Classifier	MViT
Acoustic features	mel energies

PDF

Application of Spectro-Temporal Receptive Field for DCASE 2023 Challenge Task4 B

Deokki Min, Hyeonuk Nam and Park Yong-Hwa

Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea

Min_KAIST_task4b_1 Min_KAIST_task4b_2 Min_KAIST_task4b_3 Min_KAIST_task4b_4

PDF

Application of Spectro-Temporal Receptive Field for DCASE 2023 Challenge Task4 B

Deokki Min, Hyeonuk Nam and Park Yong-Hwa
Mechanical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea

Abstract

Spectro-Temporal Receptive Field (STRF) is a linear function which describe the relationship between sound stimulus and primary auditory cortex (A1) neural response. By means of convolution with sound spectrogram and STRF, it is possible to predict the A1 cells response. A1 is known to estimates the spectro-temporal modulation information of input sound, and by considering the characteristic of A1, STRF is designed to capture this both spectral and temporal modulation information. In this work, we used STRF as a CNN kernel and construct the two-branch deep learning model. One branch is named as STRFNet whose first CNN layer kernel is STRF. This branch extracts the neuroscience-inspired spectro-temporal modulation information. The other branch is CRNN which has deeper layer than the baseline, and extracts the general time-frequency information of input spectrogram. Two-branch model is named as STRFaugNet, and its performance outperforms the baseline by 6.9%.

System characteristics

Input	mono
Classifier	CRNN, STRFaugNet; CRNN, STRFaugNet, ensemble
Acoustic features	log-mel energies
Decision making	average

PDF

Sound Event Detection with Soft Labels Using Self-Attention Mechanisms for Global Scene Feature Extraction

Tri-Do Nhan¹, Biyani Param² and Yuxuan Zhang³

¹Computing Sciences, University of Science, Vietnam National University, ²Computing Sciences, BITS Pilani Goa Campus, India, ³Computing Sciences, Nanyang Technological University, NTU Singapore

Nhan_VNUHCMUS_task4b_1

PDF Code

Sound Event Detection with Soft Labels Using Self-Attention Mechanisms for Global Scene Feature Extraction

Tri-Do Nhan¹, Biyani Param² and Yuxuan Zhang³
¹Computing Sciences, University of Science, Vietnam National University, ²Computing Sciences, BITS Pilani Goa Campus, India, ³Computing Sciences, Nanyang Technological University, NTU Singapore

Abstract

This paper presents our approach to Task 4b of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge, which focuses on Sound Event Detection with Soft Labels. Our proposed method builds upon a CRNN backbone model and leverages the benefits of data augmentation techniques to improve model robustness. Furthermore, we introduce self-attention mechanisms to capture global context information and enhance the model's ability to predict soft label segments more accurately. Our experiments demonstrate that incorporating soft labels and self-attention mechanisms result in significant performance gains compared to traditional methods on data varying across different scenarios

System characteristics

Input	mono
Classifier	Self Attention CRNN
Acoustic features	mel-spectrogram
Data augmentation	specaugment, wavaugment

PDF

Source code

Sound Event Detection by Aggregating Pre-Trained Embeddings From Different Layers

Xu Xuenan, Ma Ziyang, Yang Fei, Yang Guanrou, Wu Mengyue and Chen Xie

X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China

Xu_SJTU_task4b_1 Xu_SJTU_task4b_2 Xu_SJTU_task4b_3 Xu_SJTU_task4b_4

PDF

Sound Event Detection by Aggregating Pre-Trained Embeddings From Different Layers

Xu Xuenan, Ma Ziyang, Yang Fei, Yang Guanrou, Wu Mengyue and Chen Xie
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China

Abstract

This technical report is the system description of the X-Lance team submission to the DCASE 2023 task 4b challenge: sound event detection with soft labels. Our submissions focus on incorporating informative audio representations from self-supervised learning. The embeddings from different layers of the pre-trained models are aggregated as the input of our model. Since the occurrence of sound events in different scenes is imbalanced, for each scene we train our models using all the audio files. Finally, models of different architectures trained under different scenes are ensembled with learned weights.

System characteristics

Input	mono
Classifier	CRNN
Acoustic features	mel energies
Decision making	average; weighted

PDF

How Information on Soft Labels and Hard Labels Mutually Benefits Sound Event Detection Tasks

Han Yin, Jisheng Bai, Siwei Huang and Jianfeng Chen

Northwestern Polytechnical University, Marine Science and Technology, Joint Laboratory of Environmental Sound Sensing,, Xian, China

Bai_JLESS_task4b_1 Bai_JLESS_task4b_2 Bai_JLESS_task4b_3 Bai_JLESS_task4b_4

PDF

How Information on Soft Labels and Hard Labels Mutually Benefits Sound Event Detection Tasks

Han Yin, Jisheng Bai, Siwei Huang and Jianfeng Chen
Northwestern Polytechnical University, Marine Science and Technology, Joint Laboratory of Environmental Sound Sensing,, Xian, China

Abstract

This technical report describes our submission to DCASE 2023 Task 4 Subtrack B: Sound event detection (SED) with soft labels. We propose different architectures to explore how both soft and hard labels can jointly improve the performance of SED. And we use temporal mixup for data augmentation and k-fold cross-validationapproaches to solve the problem of too little training data. Our systems are built upon the Convolutional Recurrent Neural Network (CRNN) proposed by the baseline and the Conformer structure. We conduct extensive ablation experiments to compare the advantages and disadvantages of different information fusion strategies.

System characteristics

Input	mono
Classifier	Conformer
Acoustic features	log-mel energies
Data augmentation	mixup

PDF

Sound Event Detection Based on Soft Label

Haiyue Zhang¹, Liangxiao Zuo², Jingxuan Chen², Xichang Cai³ and Menglong Wu³

¹Electronic science and technology, North China University of Technology, Beijing, China, ²Communication Engineering, North China University of Technology, Beijing, China, ³College of Information, North China University of Technology, Beijing, China

Cai_NCUT_task4b_1 Cai_NCUT_task4b_2 Cai_NCUT_task4b_3

PDF

Sound Event Detection Based on Soft Label

Haiyue Zhang¹, Liangxiao Zuo², Jingxuan Chen², Xichang Cai³ and Menglong Wu³
¹Electronic science and technology, North China University of Technology, Beijing, China, ²Communication Engineering, North China University of Technology, Beijing, China, ³College of Information, North China University of Technology, Beijing, China

Abstract

This report focuses on an in-depth study of DCASE 2023 Task 4b. In contrast to previous tasks, this task provides a dataset with soft labels, aiming to explore how to improve the performance of the baseline system using soft labels. The report primarily employs two effective enhancement methods. Firstly, to balance the dataset, the report expands the original dataset. Given the significant differences in sound events within the dataset, an augmentation method is employed to generate additional samples and equalize the dataset. Secondly, to further enhance the system performance, a model ensemble approach is utilized. By combining the predictions of multiple models, their individual strengths can be effectively utilized to improve overall performance.

System characteristics

Input	mono
Classifier	CRNN; SK-RCRNN,CRNN; SK-RCRNN,CRNN,RCRNN
Acoustic features	log-mel energies
Data augmentation	mixup
Decision making	average

PDF

Content

Task description

Systems ranking

Teams ranking

System characteristics

Technical reports

DCASE 2023 Challenge Task4 Technical Report

DCASE 2023 Challenge Task4 Technical Report

Abstract

System characteristics

Sound Event Detection System Using a Modifided Mvit for DCASE 2023 Challenge Task 4b

Sound Event Detection System Using a Modifided Mvit for DCASE 2023 Challenge Task 4b

Abstract

System characteristics

Application of Spectro-Temporal Receptive Field for DCASE 2023 Challenge Task4 B

Application of Spectro-Temporal Receptive Field for DCASE 2023 Challenge Task4 B

Abstract

System characteristics

Sound Event Detection with Soft Labels Using Self-Attention Mechanisms for Global Scene Feature Extraction

Sound Event Detection with Soft Labels Using Self-Attention Mechanisms for Global Scene Feature Extraction

Abstract

System characteristics

Sound Event Detection by Aggregating Pre-Trained Embeddings From Different Layers

Sound Event Detection by Aggregating Pre-Trained Embeddings From Different Layers

Abstract

System characteristics

How Information on Soft Labels and Hard Labels Mutually Benefits Sound Event Detection Tasks

How Information on Soft Labels and Hard Labels Mutually Benefits Sound Event Detection Tasks

Abstract

System characteristics

Sound Event Detection Based on Soft Label

Sound Event Detection Based on Soft Label

Abstract

System characteristics