Workshop schedule

Wednesday 23rd October

Day 1 Location: Shinagawa Season Terrace
12:00

Registration

Desk opens 12:00 and will be open until 17:30.
13:00

Welcome

13:10

Keynote I


Chair: Nobutaka Ono
Nancy F. Chen
Institute for Infocomm Research (I2R), Agency for Science, Technology, and Research (A*STAR)
K1
Multimodal, Multilingual Generative AI: From Multicultural Contextualization to Empathetic Reasoning
Abstract

We will share about MeraLion (Multimodal Empathetic Reasoning and Learning In One Network), our generative AI efforts in Singapore’s National Multimodal Large Language Model Programme. Speech and audio information is rich in providing more comprehensive understanding of spatial and temporal reasoning in addition to social dynamics that goes beyond semantics derived from text alone. Cultural nuances and multilingual peculiarities add another layer of complexity in understanding human interactions. In addition, we will draw use cases in education to highlight research endeavors, technology deployment experience and application opportunities.

More info

14:20

Challenge task Spotlights


Chair: Annamaria Mesaros

15:20

Coffee break

15:50

Oral Session I


Chair: Irene Martin Morato
Acoustic Scene Classification Across Multiple Devices Through Incremental Learning of Device-Specific Domains
Manjunath Mulimani (Tampere University ), Annamaria Mesaros (Tampere University)
Abstract

In this paper, we propose using a domain-incremental learning approach for coping with different devices in acoustic scene classification. While the typical way to handle mismatched training data is through domain adaptation or specific regularization techniques, incremental learning offers a different approach. With this technique, it is possible to learn the characteristics of new devices on-the-go, adding to a previously trained model. This also means that new device data can be introduced at any time, without a need to retrain the original model. In terms of incremental learning, we propose a combination of domain-specific Low-Rank Adaptation (LoRA) parameters and running statistics of Batch Normalization (BN) layers. LoRA adds low-rank decomposition matrices to a convolutional layer with a few trainable parameters for each new device, while domain-specific BN is used to boost performance. Experiments are conducted on the TAU Urban Acoustic Scenes 2020 Mobile development dataset, containing 9 different devices; we train the system using the 40h of data available for the main device, and incrementally learn the domains of the other 8 devices based on 3h of data available for each. We show that the proposed approach outperforms other fine-tuning-based methods, and is outperformed only by joint learning with all data from all devices.

Leveraging Self-Supervised Audio Representations for Data-Efficient Acoustic Scene Classification
Yiqiang Cai (Xi'an Jiaotong-Liverpool University), Shengchen Li (Xi'an Jiaotong-Liverpool University), Xi Shao (Nanjing University of Posts and Telecommunications)
Abstract

Acoustic scene classification (ASC) predominantly relies on supervised approaches. However, acquiring labeled data for training ASC models is often costly and time-consuming. Recently, self-supervised learning (SSL) has emerged as a powerful method for extracting features from unlabeled audio data, benefiting many downstream audio tasks. This paper proposes a data-efficient and low-complexity ASC system by leveraging self-supervised audio representations extracted from general-purpose audio datasets. We introduce BEATs, an audio SSL pre-trained model, to extract the general representations from AudioSet. Through extensive experiments, it has been demonstrated that the self-supervised audio representations can help to achieve high ASC accuracy with limited labeled fine-tuning data. Furthermore, we find that ensembling the SSL models fine-tuned with different strategies contributes to a further performance improvement. To meet low-complexity requirements, we use knowledge distillation to transfer the self-supervised knowledge from large teacher models to an efficient student model. The experimental results suggest that the self-supervised teachers effectively improve the classification accuracy of the student model. Our best-performing system obtains an average accuracy of 56.7%.

Frequency Tracking Features for Data-Efficient Deep Siren Identification
Stefano Damiano (KU Leuven), Thomas Dietzen (KU Leuven), Toon van Waterschoot (Department of Electrical Engineering (ESAT-STADIUS/ETC))
Abstract

The identification of siren sounds in urban soundscapes is a crucial safety aspect for smart vehicles and has been widely addressed by means of neural networks that ensure robustness to both the diversity of siren signals and the strong and unstructured background noise characterizing traffic. Convolutional neural networks analyzing spectrogram features of incoming signals achieve state-of-the-art performance when enough training data capturing the diversity of the target acoustic scenes is available. In practice, data is usually limited and algorithms should be robust to adapt to unseen acoustic conditions without requiring extensive datasets for re-training. In this work, given the harmonic nature of siren signals, characterized by a periodically evolving fundamental frequency, we propose a low-complexity feature extraction method based on frequency tracking using a single-parameter adaptive notch filter. The features are then used to design a small-scale convolutional network suitable for training with limited data. The evaluation results indicate that the proposed model consistently outperforms the traditional spectrogram-based model when limited training data is available, achieves better cross-domain generalization and has a smaller size.

Acoustic-Based Traffic Monitoring with Neural Network Trained by Matching Loss for Ranking Function
Tomohiro Takahashi (Tokyo Metropolitan University), Natsuki Ueno (Kumamoto University), Yuma Kinoshita (Tokai University), Yukoh Wakabayashi (Toyohashi University of Technology), Nobutaka Ono (Tokyo Metropolitan University), Makiho Sukekawa (NEXCO-EAST ENGINEERING Company Limited), Seishi Fukuma (NEXCO-EAST ENGINEERING Company Limited), Hiroshi Nakagawa (NEXCO-EAST ENGINEERING Company Limited)
Abstract

In this study, we propose an effective loss function for training neural networks (NNs) in acoustic-based traffic monitoring. This task involves estimating the number of vehicles from a fixed duration of acoustic input, such as one minute. Since the distribution of the number of passing vehicles depends on the road and can deviate significantly from a normal distribution, using Mean Square Error (MSE) as the loss function may not always lead to efficient learning. To address this, we introduce a matching loss for the ranking function into the loss function. This enhances learning by increasing the rank correlation between true and estimated vehicle counts. We evaluated the effectiveness of this loss function on the development dataset of the DCASE 2024 Challenge Task 10 under various input feature and network architecture conditions. The results demonstrate that the proposed loss function significantly improves Kendall's Tau Rank Correlation (KTRC) and Root Mean Square Error (RMSE), highlighting its potential for improving acoustic-based traffic monitoring systems.

The Language of Sound Search: Examining User Queries in Audio Search Engines
Benno Weck (Music Technology Group, Universitat Pompeu Fabra (UPF)), Frederic Font (Music Technology Group - Universitat Pompeu Fabra)
Abstract

This study examines textual, user-written search queries within the context of sound search engines, encompassing various applications such as foley, sound effects, and general audio retrieval. Current research inadequately addresses real-world user needs and behaviours in designing text-based audio retrieval systems. To bridge this gap, we analysed search queries from two sources: a custom survey and Freesound website query logs. The survey was designed to collect queries for an unrestricted, hypothetical sound search engine, resulting in a dataset that captures user intentions without the constraints of existing systems. This dataset is also made available for sharing with the research community. In contrast, the Freesound query logs encompass approximately 9 million search requests, providing a comprehensive view of real-world usage patterns. Our findings indicate that survey queries are generally longer than Freesound queries, suggesting users prefer detailed queries when not limited by system constraints. Both datasets predominantly feature keyword-based queries, with few survey participants using full sentences. Key factors influencing survey queries include the primary sound source, intended usage, perceived location, and the number of sound sources. These insights are crucial for developing user-centred, effective text-based audio retrieval systems, enhancing our understanding of user behaviour in sound search contexts.

17:30

End of day 1

18:00 - 20:00

Thursday 24th October

Day 2 Location: Shinagawa Season Terrace
9:30

Keynote II


Chair: Yohei Kawaguchi
Bourhan Yassin
K2
The Future of Bioacoustics and AI for Large-Scale Biodiversity Monitoring
Abstract

Sound is an invaluable tool in the discovery and preservation of species, offering insights that other data collection methodologies often overlook. In the first half of this keynote, we will explore the power of acoustic monitoring, particularly in biodiversity conservation, where AI and sound classification techniques enable near real-time identification of species vocalizations. By leveraging feature embeddings to streamline data processing, these methods allow for accurate species detection and classification, reducing the complexity of handling large numbers of of raw audio files. In the second half, we will focus on the future of ground data collection through the use of Unmanned Aerial Vehicles (UAVs). UAVs present a powerful opportunity to scale ground truth data collection, providing continuous monitoring of rapidly changing ecosystems. This enhanced data gathering, when combined with AI, enables the development of more precise biodiversity indicators and predictions. Together, these innovations expand our ability to monitor ecosystems and protect wildlife at unprecedented scales.

More info
10:30

Coffee break

11:00

Poster Session I

List of posters

12:20

Lunch

13:40

Oral Session II


Chair: Keisuke Imoto

AdaProj: Adaptively Scaled Angular Margin Subspace Projections for Anomalous Sound Detection with Auxiliary Classification Tasks
Kevin Wilkinghoff (MERL)
Abstract

The state-of-the-art approach for semi-supervised anomalous sound detection is to first learn an embedding space by using auxiliary classification tasks based on meta information or self-supervised learning and then estimate the distribution of normal data. In this work, AdaProj a novel loss function for training the embedding model is presented. In contrast to commonly used angular margin losses, which project data of each class as close as possible to their corresponding class centers, AdaProj learns to project data onto class-specific subspaces while still ensuring an angular margin between classes. By doing so, the resulting distributions of the embeddings belonging to normal data are not required to be as restrictive as other loss functions allowing a more detailed view on the data. In experiments conducted on the DCASE2022 and DCASE2023 anomalous sound detection datasets, it is shown that using AdaProj to learn an embedding space significantly outperforms other commonly used loss functions.

Improving Domain Generalisation with Diversity-Based Sampling
Andrea Napoli (University of Southampton), Paul White (ISVR)
Abstract

Domain shifts are a major obstacle to the deployment of automated bioacoustic monitoring tools to new recording environments or habitats. Invariance regularisation is one approach for dealing with these shifts, in which the feature distributions of data from different domains are encouraged to match (by minimising some measure of statistical distance). However, in a deep learning setup, the statistical distance is only computed over small minibatches of data at a time. Inevitably, small samples have poor representation of their underlying distributions, resulting in extremely noisy distance estimates. In this paper, we propose that promoting wider distribution coverage, by inducing diversity in each sampled minibatch, would improve these estimates, and hence the generalisation power of the trained model. We describe two options for diversity-based data samplers, based on the k-determinantal point process (k-DPP) and the k-means++ algorithm, which can function as drop-in replacements for a standard random sampler. We then test these on a domain shift task based on humpback whale detection, where we find both options improve the performance of two invariance regularisation algorithms, as well as standard training via empirical risk minimisation (ERM).

Integrating Continuous and Binary Relevances in Audio-Text Relevance Learning
Huang Xie (Tampere University), Khazar Khorrami (Tampere University), Okko Räsänen (Tampere University), Tuomas Virtanen (Tampere University)
Abstract

Audio-text relevance learning refers to learning the shared semantic properties of audio samples and textual descriptions. The standard approach uses binary relevances derived from pairs of audio samples and their human-provided captions, categorizing each pair as either positive or negative. This may result in suboptimal systems due to varying levels of relevance between audio samples and captions. In contrast, a recent study used human-assigned relevance ratings, i.e., continuous relevances, for these pairs but did not obtain performance gains in audio-text relevance learning. This work introduces a relevance learning method that utilizes both human-assigned continuous relevance ratings and binary relevances using a combination of a listwise ranking objective and a contrastive learning objective. Experimental results demonstrate the effectiveness of the proposed method, showing improvements in language-based audio retrieval, a downstream task in audio-text relevance learning. In addition, we analyze how properties of the captions or audio clips contribute to the continuous audio-text relevances provided by humans or learned by the machine.

Estimated Audio–Caption Correspondences Improve Language-Based Audio Retrieval
Paul Primus (Johannes Kepler University), Florian Schmid (Johannes Kepler University), Gerhard Widmer (Johannes Kepler University)
Abstract

Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio–caption pairs. This leads to a shared embedding space in which corresponding items from the two modalities end up close together. Since audio--caption datasets typically only contain matching pairs of recordings and descriptions, it has become common practice to create mismatching pairs by pairing the audio with a caption randomly drawn from the dataset. This is not ideal because the randomly sampled caption could, just by chance, partly or entirely describe the audio recording. However, correspondence information for all possible pairs is costly to annotate and thus typically unavailable; we, therefore, suggest substituting it with estimated correspondences. To this end, we propose a two-staged training procedure in which multiple retrieval models are first trained as usual, i.e., without estimated correspondences. In the second stage, the audio-caption correspondences predicted by these models then serve as prediction targets. We evaluate our method on the ClothoV2 and the AudioCaps benchmark and show that it improves retrieval performance, even in a restricting self-distillation setting where a single model generates and then learns from the estimated correspondences. We further show that our method outperforms the current state of the art by 1.6 pp. mAP@10 on the ClothoV2 benchmark.

Guided Captioning of Audio
Irene Martin (Tampere University), James O Afolaranmi (Tampere University), Annamaria Mesaros (Tampere University)
Abstract

This work introduces a guided captioning system that aims to produce captions focused on different audio content, depending on a guiding text. We show that using keywords guidance results in more diverse captions, even though the usual captioning metrics do not reflect this. We design a system that can be trained using keywords automatically extracted from reference annotations, and which is provided with one keyword at test time. When trained with 5 keywords, the produced captions contain the exact guidance keyword 70% of the time, and results in over 3600 unique sentences for Clotho dataset. In contrast, a baseline without any keywords produces 700 unique captions on the same test set.

15:20

Coffee break

15:50

Poster Session II

List of posters

17:10

End of day 2

17:30
18:30 - 21:00

Friday 25th October

Day 3 Location: Shinagawa Season Terrace
9:30

Keynote III


Chair: Yohei Kawaguchi
Jenelle Feather
Flatiron Institute’s Center for Computational Neuroscience
K3
Successes and Failures of Machine Learning Models of Sensory Systems
Abstract

The environment is full of rich sensory information, and our brain can parse this input, understand a scene, and learn from the resulting representations. The past decade has given rise to computational models that transform sensory inputs into representations useful for complex behaviors, such as speech recognition and image classification. These models can improve our understanding of biological sensory systems and serve as a test bed for technology that aids sensory impairments, provided that model representations resemble those in the brain. In this talk, I will detail comparisons of model representations with those of biological systems. In the first study, I will discuss how modifications to a model’s training environment improve its ability to predict auditory fMRI responses. In the second part of the talk, I will present behavioral experiments using model metamers to reveal divergent invariances between human observers and current computational models of auditory and visual perception. By investigating the similarities and differences between computational models and biological systems, we aim to improve current models and better explain how our brain utilizes robust representations for perception and cognition.

More info
10:30

Coffee break

11:00

Poster Session III




List of posters

12:20

Lunch

13:40

Oral Session III


Chair: Tatsuya Komatsu
From Computation to Consumption: Exploring the Compute-Energy Link for Training and Testing Neural Networks for SED Systems
Constance Douwes (Inria), Romain serizel (Université de Lorraine)
Abstract

The massive use of machine learning models, particularly neural networks, has raised serious concerns about their environmental impact. Indeed, over the last few years we have seen an explosion in the computing costs associated with training and deploying these systems. It is, therefore, crucial to understand their energy requirements in order to better integrate them into the evaluation of models, which has so far focused mainly on performance. In this paper, we study several neural network architectures that are key components of sound event detection systems, using an audio tagging task as an example. We measure the energy consumption for training and testing small to large architectures and establish complex relationships between the energy consumption, the number of floating-point operations, the number of parameters, and the GPU/memory utilization.

Self Training and Ensembling Frequency Dependent Networks with Coarse Prediction Pooling and Sound Event Bounding Boxes
Hyeonuk Nam (KAIST), Deokki Min (Korea Advanced Institute of Science and Technology (KAIST)), Seung-Deok Choi (KAIST), Inhan Choi (KAIST), Yong-Hwa Park (Kaist)
Abstract

To tackle sound event detection (SED), we propose frequency dependent networks (FreDNets), which heavily leverage frequency-dependent methods. We apply frequency warping and FilterAugment, which are frequency-dependent data augmentation methods. The model architecture consists of 3 branches: audio teacher-student transformer (ATST) branch, BEATs branch and CNN branch including either partial dilated frequency dynamic convolution (PDFD conv) or squeeze-and-Excitation (SE) with time-frame frequency-wise SE (tfwSE). To train MAESTRO labels with coarse temporal resolution, we applied max pooling on prediction for the MAESTRO dataset. Using best ensemble model, we applied self training to obtain pseudo label from DESED weak set, unlabeled set and AudioSet. AudioSet pseudo labels, filtered to focus on high-confidence labels, are used to train on DESED dataset only. We used change-detection-based sound event bounding boxes (cSEBBs) as post processing for ensemble models on self training and submission models. The resulting FreDNet was ranked 2nd in DCASE 2024 Challenge Task 4.

Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event Detection with Heterogeneous Datasets
Florian Schmid (Johannes Kepler University), Paul Primus (Johannes Kepler University), Tobias Morocutti (Johannes Kepler University), Jonathan Greif ( Johannes Kepler University), Gerhard Widmer (Johannes Kepler University)
Abstract

A central problem in building effective sound event detection systems is the lack of high-quality, strongly annotated sound event datasets. For this reason, Task 4 of the DCASE 2024 challenge proposes learning from two heterogeneous datasets, including audio clips labeled with varying annotation granularity and with different sets of possible events. We propose a multi-iteration, multi-stage procedure for fine-tuning Audio Spectrogram Transformers on the joint DESED and MAESTRO Real datasets. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of fine-tuned transformers. Then, in a second iteration, we repeat the two-stage training process and include a distillation loss based on the pseudo-labels, achieving a new single-model, state-of-the-art performance on the public evaluation set of DESED with a PSDS1 of 0.692. A single model and an ensemble, both based on our proposed training procedure, ranked first in Task 4 of the DCASE Challenge 2024.

Learning Multi-Target TDOA Features for Sound Event Localization and Detection
Axel Berg (Arm), Johanna Engman (Lund University), Jens Gulin (Sony), Kalle Åström (Lund University), Magnus Oskarsson (Lund University)
Abstract

Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events. These features can be used as a drop-in replacement for GCC-PHAT inputs to a SELD-network. We test our method on the STARSS23 dataset and demonstrate improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features.

Synthetic Training Set Generation Using Text-to-Audio Models for Environmental Sound Classification
Francesca Ronchini (Politecnico di Milano), Luca Comanducci (Politecnico di Milano), Fabio Antonacci (Politecnico di Milano)
Abstract

In recent years, text-to-audio models have revolutionized the field of automatic audio generation. This paper investigates their ap- plication in generating synthetic datasets for training data-driven models. Specifically, this study analyzes the performance of two environmental sound classification systems trained with data generated from text-to-audio models. We considered three scenarios: a) augmenting the training dataset with data generated by text-to- audio models; b) using a mixed training dataset combining real and synthetic text-driven generated data; and c) using a training dataset composed entirely of synthetic audio. In all cases, the performance of the classification models was tested on real data. Results indicate that text-to-audio models are effective for dataset augmentation, with consistent performance when replacing a subset of the recorded dataset. However, the performance of the audio recognition models drops when relying entirely on generated audio.

15:20

Coffee break

15:50

Town Hall Discussion




16:30

Closing Remarks