13:10 |
Keynote I Chair: Nobutaka Ono |
|
15:50 |
Oral Session I Chair: Irene Martin Morato |
Acoustic Scene Classification Across Multiple Devices Through Incremental Learning of Device-Specific Domains
Manjunath Mulimani (Tampere University ), Annamaria Mesaros (Tampere University)
|
Abstract
In this paper, we propose using a domain-incremental learning approach for coping with different devices in acoustic scene classification. While the typical way to handle mismatched training data is through domain adaptation or specific regularization techniques, incremental learning offers a different approach. With this technique, it is possible to learn the characteristics of new devices on-the-go, adding to a previously trained model. This also means that new device data can be introduced at any time, without a need to retrain the original model. In terms of incremental learning, we propose a combination of domain-specific Low-Rank Adaptation (LoRA) parameters and running statistics of Batch Normalization (BN) layers. LoRA adds low-rank decomposition matrices to a convolutional layer with a few trainable parameters for each new device, while domain-specific BN is used to boost performance. Experiments are conducted on the TAU Urban Acoustic Scenes 2020 Mobile development dataset, containing 9 different devices; we train the system using the 40h of data available for the main device, and incrementally learn the domains of the other 8 devices based on 3h of data available for each. We show that the proposed approach outperforms other fine-tuning-based methods, and is outperformed only by joint learning with all data from all devices.
|
Leveraging Self-Supervised Audio Representations for Data-Efficient Acoustic Scene Classification
Yiqiang Cai (Xi'an Jiaotong-Liverpool University), Shengchen Li (Xi'an Jiaotong-Liverpool University), Xi Shao (Nanjing University of Posts and Telecommunications)
|
Abstract
Acoustic scene classification (ASC) predominantly relies on supervised approaches. However, acquiring labeled data for training ASC models is often costly and time-consuming. Recently, self-supervised learning (SSL) has emerged as a powerful method for extracting features from unlabeled audio data, benefiting many downstream audio tasks. This paper proposes a data-efficient and low-complexity ASC system by leveraging self-supervised audio representations extracted from general-purpose audio datasets. We introduce BEATs, an audio SSL pre-trained model, to extract the general representations from AudioSet. Through extensive experiments, it has been demonstrated that the self-supervised audio representations can help to achieve high ASC accuracy with limited labeled fine-tuning data. Furthermore, we find that ensembling the SSL models fine-tuned with different strategies contributes to a further performance improvement. To meet low-complexity requirements, we use knowledge distillation to transfer the self-supervised knowledge from large teacher models to an efficient student model. The experimental results suggest that the self-supervised teachers effectively improve the classification accuracy of the student model. Our best-performing system obtains an average accuracy of 56.7%.
|
Frequency Tracking Features for Data-Efficient Deep Siren Identification
Stefano Damiano (KU Leuven), Thomas Dietzen (KU Leuven), Toon van Waterschoot (Department of Electrical Engineering (ESAT-STADIUS/ETC))
|
Abstract
The identification of siren sounds in urban soundscapes is a crucial safety aspect for smart vehicles and has been widely addressed by means of neural networks that ensure robustness to both the diversity of siren signals and the strong and unstructured background noise characterizing traffic. Convolutional neural networks analyzing spectrogram features of incoming signals achieve state-of-the-art performance when enough training data capturing the diversity of the target acoustic scenes is available. In practice, data is usually limited and algorithms should be robust to adapt to unseen acoustic conditions without requiring extensive datasets for re-training. In this work, given the harmonic nature of siren signals, characterized by a periodically evolving fundamental frequency, we propose a low-complexity feature extraction method based on frequency tracking using a single-parameter adaptive notch filter. The features are then used to design a small-scale convolutional network suitable for training with limited data. The evaluation results indicate that the proposed model consistently outperforms the traditional spectrogram-based model when limited training data is available, achieves better cross-domain generalization and has a smaller size.
|
Acoustic-Based Traffic Monitoring with Neural Network Trained by Matching Loss for Ranking Function
Tomohiro Takahashi (Tokyo Metropolitan University), Natsuki Ueno (Kumamoto University), Yuma Kinoshita (Tokai University), Yukoh Wakabayashi (Toyohashi University of Technology), Nobutaka Ono (Tokyo Metropolitan University), Makiho Sukekawa (NEXCO-EAST ENGINEERING Company Limited), Seishi Fukuma (NEXCO-EAST ENGINEERING Company Limited), Hiroshi Nakagawa (NEXCO-EAST ENGINEERING Company Limited)
|
Abstract
In this study, we propose an effective loss function for training neural networks (NNs) in acoustic-based traffic monitoring. This task involves estimating the number of vehicles from a fixed duration of acoustic input, such as one minute. Since the distribution of the number of passing vehicles depends on the road and can deviate significantly from a normal distribution, using Mean Square Error (MSE) as the loss function may not always lead to efficient learning. To address this, we introduce a matching loss for the ranking function into the loss function. This enhances learning by increasing the rank correlation between true and estimated vehicle counts. We evaluated the effectiveness of this loss function on the development dataset of the DCASE 2024 Challenge Task 10 under various input feature and network architecture conditions. The results demonstrate that the proposed loss function significantly improves Kendall's Tau Rank Correlation (KTRC) and Root Mean Square Error (RMSE), highlighting its potential for improving acoustic-based traffic monitoring systems.
|
The Language of Sound Search: Examining User Queries in Audio Search Engines
Benno Weck (Music Technology Group, Universitat Pompeu Fabra (UPF)), Frederic Font (Music Technology Group - Universitat Pompeu Fabra)
|
Abstract
This study examines textual, user-written search queries within the context of sound search engines, encompassing various applications such as foley, sound effects, and general audio retrieval.
Current research inadequately addresses real-world user needs and behaviours in designing text-based audio retrieval systems.
To bridge this gap, we analysed search queries from two sources: a custom survey and Freesound website query logs.
The survey was designed to collect queries for an unrestricted, hypothetical sound search engine, resulting in a dataset that captures user intentions without the constraints of existing systems. This dataset is also made available for sharing with the research community.
In contrast, the Freesound query logs encompass approximately 9 million search requests, providing a comprehensive view of real-world usage patterns.
Our findings indicate that survey queries are generally longer than Freesound queries, suggesting users prefer detailed queries when not limited by system constraints.
Both datasets predominantly feature keyword-based queries, with few survey participants using full sentences.
Key factors influencing survey queries include the primary sound source, intended usage, perceived location, and the number of sound sources.
These insights are crucial for developing user-centred, effective text-based audio retrieval systems, enhancing our understanding of user behaviour in sound search contexts.
|
|