DCASE2025 Workshop Program

Workshop on Detection and Classification of Acoustic Scenes and Events
29 + 30-31 October 2025, Barcelona, Spain

Preliminary Schedule

These times are approximate and still subject to change.

Wednesday 29th of October

Day 1 Location: UPF Campus del Poblenou, Barcelona
8:30

BioDCASE programme

Details: BioDCASE Workshop Schedule
18:00

End of BioDCASE programme

20:00

Thursday 30th of October

Day 2 Location: UPF Campus del Poblenou, Barcelona
8:30

Registration

9:00

Welcome session

9:30

Keynote 1

Simone Graetzer
Ph.D. MIOA MASA Senior Research Fellow in the Acoustics Research Centre, University of Salford, Co-Lead in EPSRC Noise Network Plus, Co-Investigator in EPSRC CDT in Sound Futures
K1
Enhancing and Predicting Speech Intelligibility in Noise
Abstract

The Clarity Project is a UK Research and Innovation funded research project on speech intelligibility enhancement and prediction for Hearing Aid (HA) signal processing involving four UK universities and several project partners. Since 2019, we have been running Machine Learning (ML) challenges to facilitate the development of novel ML-based approaches. The challenges focus on speech-in-noise listening because this is the situation in which HA users report the most dissatisfaction, and we run both enhancement and prediction challenges as both enhancement and prediction algorithms are fundamental to the development of HA technology. We evaluate speech enhancement challenge submissions by running listening tests involving HA wearers. The enhanced HA signals and listening test scores can be fed into the prediction challenges. For each challenge, we provide tools, datasets and baseline systems. Our data and code are made open-source to lower barriers that might prevent researchers from considering hearing loss and hearing aids. Recently, we released the results of our third prediction challenge, and we are now gathering feedback on how to ensure sustainability beyond our current funding. You can find out more at https://claritychallenge.org/.

More info
10:45

Coffee break

11:10

Poster session 1 spotlights

Low-Complexity Acoustic Scene Classification with Device Information in the DCASE 2025 Challenge
Schmid, Florian and Primus, Paul and Heittola, Toni and Mesaros, Annamaria and Martin-Morato, Irene and Widmer, Gerhard
Abstract

This paper presents the Low-Complexity Acoustic Scene Classification with Device Information Task of the DCASE 2025 Challenge, along with its baseline system. Continuing the focus on low-complexity models, data efficiency, and device mismatch from previous editions (2022–2024), this year's task introduces a key change: recording device information is now provided at inference time. This enables the development of device-specific models that leverage device characteristics—reflecting real-world deployment scenarios in which a model is designed with awareness of the underlying hardware. The training set matches the 25% subset used in the corresponding DCASE 2024 challenge, with no restrictions on external data use, highlighting transfer learning as a central topic. The baseline achieves 50.72% accuracy with a device-agnostic model, improving to 51.89% when incorporating device-specific fine-tuning. The task attracted 31 submissions from 12 teams, with 11 teams outperforming the baseline. The top-performing submission achieved an accuracy gain of more than 8 percentage points over the baseline on the evaluation set.

PDF
Description and Discussion on DCASE 2025 Challenge Task 2: First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring
Nishida, Tomoya and Noboru, Harada and Niizumi, Daisuke and Albertini, Davide and Sannino, Roberto and Pradolini, Simone and Augusti, Filippo and Imoto, Keisuke and Dohi, Kota and Purohit, Harsh and Endo, Takashi and Kawaguchi, Yohei
Abstract

This paper introduces the task description for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge Task 2, titled “First-shot unsupervised anomalous sound detection (ASD) for machine condition monitoring.” Building on the DCASE 2024 Challenge Task 2, this task is structured as a first-shot problem within a domain generalization framework. The primary objective of the first-shot approach is to facilitate the rapid deployment of ASD systems for new machine types without requiring machine-specific hyperparameter tunings. For DCASE 2025 Challenge Task 2, sounds from previously unseen machine types have been collected and provided as the evaluation dataset. We received 119 submissions from 35 teams, and an analysis of these submissions has been made in this paper. Analysis showed that various approaches can all be competitive, such as fine-tuning pre-trained models, using frozen pre-trained models, and training small models from scratch, when combined with appropriate cost functions, anomaly score normalization, and use of clean machine and noise sounds.

PDF
Towards Spatial Audio Understanding Via Question Answering
Sudarsanam, Parthasaarathy and Politis, Archontis
Abstract

In this paper, we introduce a novel framework for spatial audio understanding of first-order ambisonic (FOA) signals through a question answering (QA) paradigm, aiming to extend the scope of sound event localization and detection (SELD) towards spatial scene understanding and reasoning. First, we curate and release fine-grained spatio-temporal textual descriptions for the STARSS23 dataset using a rule-based approach, and further enhance linguistic diversity using large language model (LLM)-based rephrasing. We also introduce a QA dataset aligned with the STARSS23 scenes, covering various aspects such as event presence, localization, spatial, and temporal relationships. To increase language variety, we again leverage LLMs to generate multiple rephrasings per question. Finally, we develop a baseline spatial audio QA model that takes FOA signals and natural language questions as input and provides answers regarding various occurrences, temporal, and spatial relationships of sound events in the scene formulated as a classification task. Despite being trained solely with scene-level question answering supervision, our model achieves performance that is comparable to a fully supervised sound event localization and detection model trained with frame-level spatiotemporal annotations. The results highlight the potential of language-guided approaches for spatial audio understanding and open new directions for integrating linguistic supervision into spatial scene analysis.

PDF
Bioacoustics on Tiny Hardware at the BioDCASE 2025 Challenge
Carmantini, Giovanni and Benhamadi, Yasmine and Carreau, Matthieu and Kwak, Minkyung and Morandi, Ilaria and Förstner, Friedrich and Hladik, Pierre-Emmanue and Lagrange, Mathieu and Linhart, Pavel and Petrusková, Tereza and Lostanlen, Vincent and Kahl, Stefan
Abstract

The BioDCASE initiative aims to encourage the invention of methods for detection and classification of acoustic scenes and events (DCASE) within the domain of bioacoustics. We have contributed to the first edition of the BioDCASE challenge by means of a task named 'bioacoustics on tiny hardware'. The motivation for this task resides in the growing need for operating bioacoustic event detection algorithms on low-power autonomous recording units (ARU's). Participants were tasked with developing a detector of bird vocalizations from the yellowhammer (Emberiza citrinella), given two hours of audio as a training set. The detector had to run within the resource constraints of an ESP32-S3 microcontroller unit. By evaluating the submitted models on a withheld dataset, we conducted an independent benchmark that assessed both classification performance and resource efficiency through multiple metrics: average precision, inference time, and memory usage. Our reported results confirm that recent advances in 'tiny machine learning' (TinyML) have transformative potential for computational bioacoustics. For more information, please visit: https://biodcase.github.io/challenge2025/task3

PDF
Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos
Berghi, Davide and Jackson, Philip
Abstract

In this study, we address the multimodal task of stereo sound event localization and detection with source distance estimation (3D SELD) in regular video content. 3D SELD is a complex task that combines temporal event classification with spatial localization, requiring reasoning across spatial, temporal, and semantic dimensions. The last is arguably the most challenging to model. Traditional SELD approaches typically rely on multichannel input, limiting their capacity to benefit from large-scale pre-training due to data constraints. To overcome this, we enhance a standard SELD architecture with semantic information by integrating pre-trained, contrastive language-aligned models: CLAP for audio and OWL-ViT for visual inputs. These embeddings are incorporated into a modified Conformer module tailored for multimodal fusion, which we refer to as the Cross-Modal Conformer. We perform an ablation study on the development set of the DCASE2025 Task3 Stereo SELD Dataset to assess the individual contributions of the language-aligned models and benchmark against the DCASE Task 3 baseline systems. Additionally, we detail the curation process of large synthetic audio and audio-visual datasets used for model pre-training. These datasets were further expanded through left-right channel swapping augmentation. Our approach, combining extensive pre-training, model ensembling, and visual post-processing, achieved second rank in the DCASE 2025 Challenge Task 3 (Track B), underscoring the effectiveness of our method. Future work will explore the modality-specific contributions and architectural refinements.

PDF
Stereo Sound Event Localization and Detection with Onscreen/Offscreen Classification
Shimada, Kazuki and Politis, Archontis and Roman, Iran and Sudarsanam, Parthasaarathy and Diaz-Guerra, David and Pandey, Ruchi and Uchida, Kengo and Koyama, Yuichiro and Takahashi, Naoya and Shibuya, Takashi and Takahashi, Shusuke and Virtanen, Tuomas and Mitsufuji, Yuki
Abstract

This paper presents the objective, dataset, baseline, and metrics of Task 3 of the DCASE2025 Challenge on sound event localization and detection (SELD). In previous editions, the challenge used four-channel audio formats of first-order Ambisonics (FOA) and microphone array. In contrast, this year's challenge investigates SELD with stereo audio data (termed stereo SELD). This change shifts the focus from more specialized 360° audio and audiovisual scene analysis to more commonplace audio and media scenarios with limited field-of-view (FOV). Due to inherent angular ambiguities in stereo audio data, the task focuses on direction-of-arrival (DOA) estimation in the azimuth plane (left-right axis) along with distance estimation. The challenge remains divided into two tracks: audio-only and audiovisual, with the audiovisual track introducing a new sub-task of onscreen/offscreen event classification (i.e., classifying whether a detected sound event originates from a source within the limited FOV or outside of it). This challenge introduces the DCASE2025 Task3 Stereo SELD Dataset, whose stereo audio and perspective video clips are sampled and converted from the STARSS23 recordings. The baseline system is designed to process stereo audio and corresponding video frames as inputs. In addition to the typical SELD event classification and localization, it integrates onscreen/offscreen classification for the audiovisual track. The evaluation metrics have been modified to introduce an onscreen/offscreen accuracy metric, which assesses the models' ability to identify which sound sources are onscreen. In the experimental evaluation, the baseline system performs reasonably well with the stereo audio data.

PDF
Sound Event Detection using Time-frequency Bounding Boxes with a Self-Supervised Audio Spectrogram Transformer
Zhu, Zhi and Sato, Yoshinao
Abstract

Time-frequency bounding boxes of sound events in audio spectrograms capture essential information across various domains. Nevertheless, previous studies of sound event detection (SED) have focused on detecting temporal boundaries. Although object detection models can be adapted for time-frequency SED, they are intended for image data that exhibit significantly different features compared to audio spectrograms. To address this modality gap, this study employs an audio spectrogram transformer (AST) as the backbone of a detection transformer (DETR). A dataset of whistle sounds from defective wind turbine blades was used for model training and evaluation. The proposed model outperformed a faster region-based convolutional neural network with a residual network backbone, which is an established object detection model. The integration of deformable attention with a multiscale feature pyramid significantly contributed to improving the performance. These results demonstrate the effectiveness of deformable DETR models with an AST backbone for time-frequency SED. Our findings will contribute to advancing time-frequency SED, an area that remains underexplored.

PDF
Description and Discussion on DCASE 2025 Challenge Task 4: Spatial Semantic Segmentation of Sound Scenes
Yasuda, Masahiro and Binh Thien, Nguyen and Harada, Noboru and Serizel, Romain and Mishra, Mayank and Delcroix, Marc and Araki, Shoko and Takeuchi, Daiki and Niizumi, Daisuke and Ohishi, Yasunori and Nakatani, Tomohiro and Kawamura, Takao and Ono, Nobutaka
Abstract

Spatial Semantic Segmentation of Sound Scenes (S5) aims to enhance technologies for sound event detection and separation from multi-channel input signals that mix multiple sound events with spatial information. This is a fundamental basis of immersive communication. The ultimate goal is to separate sound event signals with 6 Degrees of Freedom (6DoF) information into dry sound object signals and metadata about the object type (sound event class) and representing spatial information, including direction. However, because several existing challenge tasks already provide some of the subset functions, this task for this year focuses on detecting and separating sound events from multi-channel spatial input signals. This paper outlines the S5 task setting of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge Task 4 and the DCASE2025 Task 4 Dataset, newly recorded and curated for this task. We also discuss the performance and characteristics of the S5 systems submitted to DCASE 2025 Challenge Task 4 based on experimental results.

PDF
Exploiting Stereo Spatial Properties with ReCoOP Framework for Joint Sound Event Detection and Localization
Banerjee, Mohor and Nagisetty, Srikanth and Teo, Han Boon
Abstract

The integration of intelligent systems into daily environments increases the need for a robust understanding of the acoustic scene. Applications such as assistive technologies, audio navigation, and public safety rely on accurate localization and detection of sound events (SELD). Commercially, embedding spatial audio intelligence into smart devices, vehicles, healthcare tools, and surveillance systems, particularly where visual input is limited, has generated significant interest. Traditional signal processing methods struggle to meet the localization and classification demands of compact, microphone-limited devices. As stereo and multichannel audio become prevalent, developing SELD systems capable of joint direction-of-arrival (DoA) estimation and real-world event detection is essential. In response, we propose ReCoOP (ResNet-Conformer with ONE-PEACE), a deep learning framework that combines a ResNet-Conformer backbone with stereo spatial features and contextual embeddings. The system incorporates Interaural Level and Phase Differences, Generalized Cross-Correlation, alongside Mel spectrograms, to model spatial cues, while global semantics are captured through pre-trained ONE-PEACE embeddings. ReCoOP features specialized modules for direction and distance estimation, with outputs fused via a joint head. Evaluated on the DCASE2025 Task 3 dataset, our approach improves performance by approximately 17.8% over the baseline, securing 3rd place in the DCASE 2025 Task 3 challenge.

PDF
12:00

Poster session 1

13:00

Lunch

14:30

Poster session 2 spotlights

Listening or Reading? An Empirical Study of Modality Importance Analysis Across AQA Question Types
Yin, Zeyu and Cai, Yiqiang and Lyu, Xinyang and Deng, Pingsong and Li, Shengchen
Abstract

Audio Question Answering (AQA) challenges a system to integrate acoustic perception with natural–language reasoning, yet how much each modality actually matters remains unclear. We propose a controlled modality–weight study on the DCASE 2025 Task5 benchmark to quantify this balance. Building on a dual-tower BEATS+BERT architecture, we introduce a scalar fusion hyper-parameter that linearly mixes audio and text embeddings. We evaluate model performance across six distinct question types and use statistical analysis to characterize how accuracy shifts as modality weights change. Our results reveal a clear asymmetry: while text alone supports strong performance on many questions, audio contributes significantly only to tasks that require perceptual grounding. Some tasks benefit most from a balanced fusion of both modalities, whereas for others, increased audio weight can even reduce accuracy. This protocol yields a quantitative map of which question categories depend primarily on audio, on text, or on a certain balanced fusion, providing practical guidance for future audio–language model design.

PDF
Crossing the Species Divide: Transfer Learning from Speech to Animal Sounds
Cauzinille, Jules and Miron, Marius and Pietquin, Olivier and Hagiwara, Masato and Marxer, Ricard and Rey, Arnaud and Favre, Benoit
Abstract

Self-supervised speech models have demonstrated impressive performance in speech processing, but their effectiveness on non-speech data remains underexplored. We study the transfer learning capabilities of such models on bioacoustic detection and classification tasks. We show that models such as HuBERT, WavLM, and XEUS can generate rich latent representations of animal sounds across taxa. We analyze the models properties with linear probing on time-averaged representations. We then extend the approach to account for the effect of time-wise information with other downstream architectures. Finally, we study the implication of frequency range and noise on performance. Notably, our results are competitive with fine-tuned bioacoustic pre-trained models and show the impact of noise-robust pre-training setups. These findings highlight the potential of speech-based self-supervised learning as an efficient framework for advancing bioacoustic research.

PDF
Comparison of Foundation Model Pre-Training Strategies and Architectures for Urban Garden Recordings
Koutsogeorgos, Parmenion and Härmä, Aki
Abstract

Environmental audio recordings captured via passive acoustic monitoring include various sounds such as bird vocalisations, weather phenomena, and human activity. Although abundant and easy to collect, these recordings often contain noise, are location-specific, and lack comprehensive annotations, posing challenges to traditional supervised methods. This paper compares self-supervised pre-training techniques and architectures for developing foundation models to learn transferable feature representations from environmental audio data. The reported experiments use the GardenFiles23 dataset, which consists of two years of stereo recordings and metadata from an urban garden. Pre-training tasks include masked spectrogram reconstruction, in which random patches of mel-spectrogram inputs are masked and the model learns to predict them, and a novel contrastive learning task, in which the model learns to align the two channels of stereo recordings that are masked in a complementary manner, meaning that the masked patches in one channel are unmasked in the other. Two architectures are compared: a Self-Supervised Audio Spectrogram Transformer (SSAST) and a State-Space Model variant (Mamba), which theoretically offers linear-time sequence modelling and improved efficiency. Embeddings are assessed on three downstream tasks: bird detection, time-of-day prediction, and weather metadata prediction. Results indicate that masked reconstruction provides stable convergence and superior bird detection performance, while contrastive learning generates richer embeddings that are beneficial for temporal and weather predictions. Overall, SSAST consistently outperforms Mamba with short input sequences.

PDF
Universal Incremental Learning for Few-Shot Bird Sound Classification
Mulimani , Manjunath and Mesaros, Annamaria
Abstract

Incremental learning aims to continually learn new input tasks while overcoming the forgetting of previously learned ones. Existing incremental learning methods for audio classification tasks assume that the incoming task either contains new classes from the same domain or the same classes from a new domain, referred to as class-incremental learning (CIL) and domain-incremental learning (DIL), respectively. In this work, we propose a universal incremental learning (UIL) method for few-shot bird sound classification, in which the incoming task contains new or a combination of new and previously seen bird classes from a new domain. Our method uses generalizable audio embeddings from a pre-trained model, which is trained on focal recordings, to develop an incremental learner that solves few-shot bird sound classification tasks from diverse soundscape datasets. These datasets are selected from BIRB (Benchmark for Information Retrieval in Bioacoustics), a large-scale bird sounds benchmark, and used to demonstrate the performance of the proposed method. Results show that our method adapts to the incoming tasks effectively with minimal forgetting of previously seen tasks.

PDF
Hierarchical and Multimodal Learning for Heterogeneous Sound Classification
Anastasopoulou, Panagiota and Dal Rí, Francesco and Serra, Xavier and Font, Frederic
Abstract

This paper investigates multimodal and hierarchical classification strategies to enhance performance in real-world sound classification tasks, centered on the two-level structure of the Broad Sound Taxonomy. We propose a framework that enables the system to consider high-level sound categories when refining its predictions at the subclass level, thereby aligning with the natural hierarchy of sound semantics. To improve accuracy, we integrate acoustic features with semantic cues extracted from crowdsourced textual metadata of the sounds such as titles, tags, and descriptions. During training, we utilize and compare pre-trained embeddings across these modalities, enabling better generalization across acoustically heterogeneous yet semantically related categories. Our experiments show that the use of text-audio embeddings improve classification. We also observe that, although hierarchical supervision does not significantly impact accuracy, it leads to more coherent and perceptually structured latent representations. These improvements in classification performance and representation quality make the system more suitable for downstream tasks such as sound retrieval, description, and similarity search.

PDF
Cross-Modal Attention Architectures for Language-Based Audio Retrieval
Calvet, Oscar and Torre Toledano, Doroteo
Abstract

We present our submission to Task 6 of the DCASE 2025 Challenge on language-based audio retrieval, where our team ranked second overall, with our best single system achieving the fifth-highest score among all individual submissions. Our approach investigates multiple cross-modal architectures, including both standard dual-encoders and attention-based models that leverage fine-grained interactions between audio and text embeddings. All models are trained with contrastive learning on a combination of large-scale captioned audio datasets, using PaSST and RoBERTa as backbone encoders. While each system achieves competitive results on its own, we observe consistent improvements when combining them in an ensemble, suggesting that the architectures capture complementary audio-text relationships. We support this finding with initial representational analyses, which point to differences in how these models structure the shared embedding space. Our results highlight the benefits of architectural diversity in modeling semantic similarity across modalities.

PDF
Latent Multi-view Learning for Robust Environmental Sound Representations
Ding, Sivan and Wilkins, Julia and Fuentes, Magdalena and Bello, Juan Pablo
Abstract

Self-supervised learning (SSL) approaches, such as contrastive and generative methods, have advanced environmental sound representation learning using unlabeled data. However, how these approaches can complement each other within a unified framework remains relatively underexplored. In this work, we propose a multi-view learning framework that integrates contrastive principles into a generative pipeline to capture sound source and device information. Our method encodes compressed audio latents into view-specific and view-common subspaces, guided by two self-supervised objectives: contrastive learning for targeted information flow between subspaces, and reconstruction for overall information preservation. We evaluate our method on an urban sound sensor network dataset for sound source and sensor classification, demonstrating improved downstream performance over traditional SSL techniques. Additionally, we investigate the model's potential to disentangle environmental sound attributes within the structured latent space under varied training configurations.

PDF
Robust Detection of Overlapping Bioacoustic Sound Events
Mahon, Louis and Hoffman, Benjamin and Cuisimano, Maddie and Hagiwara, Masato and James, Logan and Woolley, Sarah and Effenberger, Felix and Keen, Sara and Liu, Jen-yu and Pietquin, Olivier
Abstract

We propose a method for accurately detecting bioacoustic sound events that is robust to overlapping events, a common issue in domains such as ethology, ecology and conservation. While standard methods employ a frame-based, multi-label approach, we introduce an onset-based detection method which we name Voxaboxen. For each time window, Voxaboxen predicts whether it contains the start of a vocalization and how long the vocalization is. It also does the same in reverse, predicting whether each window contains the end of a vocalization, and how long ago it started, and fuses the two sets of bounding boxes with a graph-matching algorithm. We also release a new dataset of temporally-strong labels of zebra finch vocalizations designed to have high overlap. Experiments on eight datasets, including our new dataset, show Voxaboxen outperforms natural baselines and existing methods, and is robust to vocalization overlap.

PDF
A Lightweight Temporal Attention Module for Frequency Dynamic Sound Event Detection
Zhang, Yuliang
Abstract

Recent advances in Sound Event Detection (SED) have leveraged frequency-dynamic convolution to address the shift-variant nature of audio representations in the frequency domain. However, most existing methods overlook the temporal importance of individual frames during early feature extraction, which is critical for accurate event boundary detection. In this paper, we propose a lightweight temporal attention module integrated into convolutional SED architectures. The module computes temporal weights by compressing the frequency axis and applying per-frame attention using one of three strategies: MLP (frame-wise), Conv1D (local context), and MultiHead Attention (global context). These weights are injected either before or after the convolutional operation to enhance time-sensitive representations. Through comprehensive ablation experiments on the DCASE2021 Task4 dataset, we show that introducing temporal attention consistently improves model performance. Specifically, averaged over 10 independent runs, the proposed temporal attention module increases PSDS1 from 0.4241 to 0.4383 on FDY-CRNN, from 0.4327 to 0.4395 on DFD-CRNN, and from 0.4376 to 0.4452 on MDFD-CRNN. These improvements demonstrate that even lightweight attention mechanisms targeting temporal saliency can significantly enhance the event boundary modeling capabilities of frequency-dynamic SED systems.

PDF
Whale-VAD: Whale Vocalisation Activity Detection
Geldenhuys, Christiaan and Tonitz, Günther and Niesler, Thomas
Abstract

We present a lightweight sound event detection (SED) system focused on the discovery of whale calls in marine audio recordings. Our proposed architecture uses a hybrid CNN-BiLSTM architecture with an added residual bottleneck and depthwise convolutions to perform coherent per-frame whale call event detection. We discover that, for this task, the inclusion of spectral phase information among the input features notably improves performance. We also evaluate the effectiveness of negative batch undersampling and the inclusion of a focal loss term. As part of the 2025 BioDCASE challenge (Task 2), we compare our system to ResNet-18 and YOLOVv11 models, as well as to our own baseline. All models are trained exclusively on the same subset of the publically available ATBFL dataset. Our proposed whale call event detector improves on the development set performance of all models, including the top performing YOLOv11, achieving an F1-score of 0.44.

PDF
Importance-Weighted Domain Adaptation for Sound Source Tracking
Zhong, Bingxiang and Dietzen, Thomas
Abstract

In recent years, deep learning has significantly advanced sound source localization (SSL). However, training such models requires large labeled datasets, and real recordings are costly to annotate in particular if sources move. While synthetic data using simulated room impulse responses (RIRs) and noise offers a practical alternative, models trained on synthetic data suffer from domain shift in real environments. Unsupervised domain adaptation (UDA) can address this by aligning synthetic and real domains without relying on labels from the latter. The few existing UDA approaches however focus on static SSL and do not account for the problem of sound source tracking (SST), which presents two specific domain adaptation challenges. First, variable-length input sequences create mismatches in feature dimensionality across domains. Second, the angular coverages of the synthetic and the real data may not be well aligned either due to partial domain overlap or due to batch size constraints, which we refer to as directional diversity mismatch. To address these, we propose a novel UDA approach tailored for SST based on two key features. We employ the final hidden state of a recurrent neural network as a fixed-dimensional feature representation to handle variable-length sequences. Further, we use importance-weighted adversarial training to tackle directional diversity mismatch by prioritizing synthetic samples similar to the real domain. Experimental results demonstrate that our approach successfully adapts synthetic-trained models to real environments, improving SST performance.

PDF
A Three-Level Evaluation Protocol for Acoustic Scene Understanding of Large Language Audio Models
Harish, Dilip and Abeßer, Jakob
Abstract

Reaching a semantic understanding of complex acoustic scenes requires computational models to capture the temporal-spatial sound source composition as well as individual sound events. This is a great challenge for computational models due to the large variety of everyday sound events and the extensive temporal-spectral overlap in real-life acoustic scenes. In this work, we aim to evaluate the acoustic scene understanding capabilities of two large audio-language models (LALMs). As a challenging scenario, we use the USM dataset, which features synthetic urban soundscapes with 2-6 overlapping sound sources per mixture. Our main contribution is a novel three-layer evaluation protocol, which includes four analysis tasks for low-level sound event understanding (sound event tagging), mid-level understanding and reasoning (sound polyphony estimation, sound source loudness ranking), as well as high-level scene understanding (audio captioning). We apply standardized metrics to assess the models' performances for each task. The proposed multi-layer protocol allows for a fine-grained analysis of model behavior across soundscapes of various complexity levels. Our results indicate that despite their remarkable controllability using textual instructions, the ability of state-of-the-art LALMs to understand acoustic scenes is still limited as the performance on individual analysis tasks degrades with increasing sound polyphony.

PDF
Supervised Detection of Baleen Whale Calls on Edge-Compute
van Toor, Astrid
Abstract

This paper presents an edge-optimised approach for baleen whale call detection, addressing both the detection requirements of BioDCASE 2025 Task 2 and deployment constraints similar to Task 3. Common machine learning models contain 4+ million training parameters and use architectures unsuitable for real-time edge deployment. In contrast, our model contains just 35,571 parameters (159KB) and operates efficiently on a 64-bit ARM Cortex-A53 with 512MB RAM. We applied an edge-optimised feature extraction pipeline and a custom CNN model architecture designed for real-time inference in offshore deployments. Classifying on 11.8-second detection windows, our precision-focused approach achieves 72% precision for blue whale ABZ calls and 80% for fin whale burst pulse calls, though downsweep detection lags at 18% precision. After applying a temporal head for call-specific identification as per the BioDCASE challenge requirements, ABZ call precision drops to 65% and burst pulse calls to 4%, while downsweep calls improve to 29%. Acknowledging the difficulties in call-specific identification, this work highlights the feasibility and potential of edge-optimised architectures for baleen whale detection in real-world monitoring scenarios where computational resources and power consumption are severely constrained, while addressing common challenges and next steps to improve the results.

PDF
15:20

Poster session 2

16:20

Coffee break

16:50

Discussion panel

20:00

Friday 31st of October

Day 3 Location: UPF Campus del Poblenou, Barcelona
8:30

Registration

9:00

Welcome session

9:30

Keynote 2

Gaëtan Hadjeres
Staff Research Scientist at SonyAI
K2
The Sound Effect Foundation Model: Beyond Text-to-Audio Generation
Abstract

We introduce the Sound Effect Foundation Model, Sony AI's generative approach to enhance sound effect creation and manipulation. By leveraging professional high-quality datasets focused exclusively on sound effects, our model generates high-fidelity audio with precise controls—extending beyond traditional text-to-audio capabilities. This model is easily extensible in order to fulfill professional creators' needs and workflows. Key features span from sound variation, infilling for seamless audio repairs to the creation of personalized audio characters and much more. Via bespoke user interfaces and professional software integration, we show that our approach suggests novel workflows while enhancing existing ones, and hopefully add AI generative models to the toolbox of professional creators.

More info
10:45

Coffee break

11:10

Poster session 3 spotlights

Efficient State-Space Model for Audio Anomaly Detection with Domain Adaptation
Emon, Jakaria and Anon, Taharim Rahman
Abstract

This paper presents a two-stage, embedding-centric frame work for Unsupervised Anomaly Sound Detection (UASD), specifically addressing the challenges of first-shot generalization and computational efficiency. Our approach utilizes an efficient state-space model (SSM) backbone. Pre-training of this backbone is accelerated using a pixel unshuffle (space-to-depth) input transformation for spectrograms, which reduces training time by approximately 87.5% while preserving representation quality. Subsequently, the pre-trained model is fine-tuned with a specialized anomaly head that fuses multi-level features, combined with pseudo outlier exposure and domain-adversarial adaptation employing a gradient reversal layer. Our system demonstrates superior performance over the DCASE 2025 autoencoder baseline. Machine-specific models achieve a harmonic mean total score of 0.722. This work establishes the efficacy of SSMs for this task and offers a scalable, robust solution for UASD in dynamic acoustic environments.

PDF
Correlation-Based Filtering for Unsupervised Anomalous Sound Detection
Bürli, Andrin and Hamdan, Sami
Abstract

Unsupervised anomalous sound detection (ASD) under domain shift remains a key challenge for real‐world deployment. We introduce a two‐stage “first‐shot” pipeline for DCASE 2025 Task 2 that leverages optional clean‐only or noise‐only supplemental recordings to improve robustness to unseen background noises. First, a correlation‐based filter is trained separately on clean or noise data, separating each test mixture $x=C+N+A$ into a cleaner signal $x'=C+A$. Second, a mel‐spectrogram autoencoder, augmented with SMOTE and mixup on $x'$, detects anomalies. On the development set, our method achieves a high SI-SDR for the separation task and improves the detection metrics for three out of seven components compared to the baseline. These results validate that assuming statistical independence between machine sound, background noise, and anomalies can enhance first-shot ASD. Future work will explore automated correlation estimation and integration with more advanced anomaly detection methods for the second stage.

PDF
Handling Domain Shifts for Anomalous Sound Detection: A Review of DCASE-Related Work
Wilkinghoff, Kevin and Fujimura, Takuya and Imoto, Keisuke and Le Roux, Jonathan and Tan, Zheng-Hua and Toda, Tomoki
Abstract

When detecting anomalous sounds in complex environments, one of the main difficulties is that trained models must be sensitive to subtle differences in monitored target signals, while many practical applications also require them to be insensitive to changes in acoustic domains. Examples of such domain shifts include changing the type of microphone or the location of acoustic sensors, which can have a much stronger impact on the acoustic signal than subtle anomalies themselves. Moreover, users typically aim to train a model only on source domain data, which they may have a relatively large collection of, and they hope that such a trained model will be able to generalize well to an unseen target domain by providing only a minimal number of samples to characterize the acoustic signals in that domain. In this work, we review and discuss recent publications focusing on this domain generalization problem for anomalous sound detection in the context of the DCASE challenges on acoustic machine condition monitoring.

PDF
Adjusting Bias in Anomaly Scores via Variance Minimization for Domain-Generalized Discriminative Anomalous Sound Detection
Matsumoto, Masaaki and Fujimura, Takuya and Huang, WenChin and Toda, Tomoki
Abstract

We propose an anomaly score rescaling method based on variance minimization for domain-generalized anomalous sound detection (ASD). Current state-of-the-art ASD methods face significant challenges due to pronounced domain shifts, which lead to inconsistent anomaly score distributions across domains. One promising existing approach to address this issue is to rescale anomaly scores based on local data density in the embedding space. To enable more flexible and adaptive rescaling, our proposed method introduces weighting parameters into the rescaling process and analytically optimizes them based on the score variance minimization. Experimental evaluations on the DCASE 2021-2024 ASD datasets demonstrate that our proposed method achieves significant improvements on the DCASE 2022-2024 datasets. We also confirm that the proposed method obtains weighting parameters that lead to high ASD performance.

PDF
ASDKit: A Toolkit for Comprehensive Evaluation of Anomalous Sound Detection Methods
Fujimura, Takuya and Wilkinghoff, Kevin and Imoto, Keisuke and Toda, Tomoki
Abstract

In this paper, we introduce ASDKit, a toolkit for anomalous sound detection (ASD) task. Our aim is to facilitate ASD research by providing an open-source framework that collects and carefully evaluates various ASD methods. First, ASDKit provides training and evaluation scripts for a wide range of ASD methods, all handled within a unified framework. For instance, it includes the autoencoder-based official DCASE baseline, representative discriminative methods, and self-supervised learning-based methods. Second, it supports comprehensive evaluation on the DCASE 2020--2024 datasets, enabling careful assessment of ASD performance, which is highly sensitive to factors such as datasets and random seeds. In our experiments, we re-evaluate various ASD methods using ASDKit and identify consistently effective techniques across multiple datasets and trials. We also demonstrate that ASDKit reproduces the state-of-the-art-level performance on the considered datasets.

PDF
LiB-TRAD: A Lithium Battery Thermal Runaway Acoustic Dataset for Anomaly Detection
WANG, xiaoliang and MING, ao and Chen, Meixin and JIN, hao
Abstract

Efficient detection of lithium battery thermal runaway is a critical factor in promoting the large-scale application of lithium batteries in energy storage and electric transportation. Traditional methods rely heavily on contact-based techniques such as temperature, current, voltage, impedance, or structural deformation monitoring, which have limitations in terms of cost, real-time performance, and scalability. In contrast, acoustic detection, with its non-contact nature, low cost, and suitability for large-scale monitoring, is emerging as a promising alternative. While previous studies have demonstrated the effectiveness of machine learning-based acoustic methods for thermal runaway detection, there is still a lack of an open acoustic dataset covering the entire process of lithium battery thermal runaway. To address this, this paper introduces the first lithium battery acoustic dataset containing both normal and thermal runaway events, annotated with abnormal events. We further evaluate several baseline models and state-of-the-art acoustic event detection models using this dataset. Experimental results show that this dataset holds strong potential for thermal runaway anomaly detection and provides a valuable data foundation and benchmark for future research.

PDF
Audio‑Based Pedestrian Detection in the Presence of Vehicular Noise
Kim, Yonghyun and Han, Chaeyeon and Sarode, Akash and Posner, Noah and Guhathakurta, Subhrajit and Lerch, Alexander
Abstract

Audio-based pedestrian detection is a challenging task and has, thus far, only been explored in noise-limited environments. We present a new dataset, results, and a detailed analysis of the state-of-the-art in audio-based pedestrian detection in the presence of vehicular noise. In our study, we conduct three analyses: (i) cross-dataset evaluation between noisy and noise-limited environments, (ii) an assessment of the impact of noisy data on model performance, highlighting the influence of acoustic context, and (iii) an evaluation of the model's predictive robustness on out-of-domain sounds. The new dataset is a comprehensive 1321-hour roadside dataset. It incorporates traffic-rich soundscapes. Each recording includes 16kHz audio synchronized with frame-level pedestrian annotations and 1fps video thumbnails.

PDF
Deployment of AI-based Sound Analysis Algorithms in Real-time Acoustic Sensors: Challenges and a Use Case
Sagasti, Amaia and Artís, Pere and Font, Frederic and Serra, Xavier
Abstract

Real-time acoustic sensing involves significant challenges in capturing, processing, and transmitting audio. Integrating AI models on resource-constrained devices further complicates development. This paper presents an end-to-end solution addressing these challenges: SENS, the Smart Environmental Noise System, is a low-cost sensor designed for real-time acoustic monitoring. Built on a Raspberry Pi platform, SENS captures sound continuously and processes it locally using custom-developed software based on small and efficient artificial intelligence algorithms. With a current focus on urban environments, SENS calculates acoustic parameters, including sound pressure level (SPL), and makes predictions of the perceptual sound attributes of 'pleasantness' and 'eventfulness' (ISO 12913), along with detecting the presence of specific sound sources such as vehicles, birds, and human activity. To safeguard privacy, all processing occurs directly on the device in real-time ensuring that no audio recordings are permanently stored or transferred. Additionally, the system transmits the analysis results through the wireless network to a remote server. Demonstrating its practical applicability, a network of five SENS devices has been deployed in an urban area for over three months, validating SENS as a powerful tool for analyzing and understanding soundscapes, recognizing patterns, and detecting acoustic events. The proposed flexible and reproducible technology allows reconfiguration for different applications and represents an innovative step in real-time and AI-based noise monitoring.

PDF
Synthetic data enables context-aware bioacoustic sound event detection
Hoffman, Benjamin and Robinson, David and Miron, Marius and Baglione, Vittorio and Canestrari, Daniela and Elias, Damian and Trapote, Eva and Cusimano, Maddie and Effenberger, Felix and Hagiwara, Masato and Pietquin, Olivier
Abstract

We propose a methodology for training foundation models that enhances their in-context learning capabilities within the domain of bioacoustic signal processing. We use synthetically generated training data, introducing a domain-randomization-based pipeline that constructs diverse acoustic scenes with temporally strong labels. We generate over 8.8 thousand hours of strongly-labeled audio and train a query-by-example, transformer-based model to perform few-shot bioacoustic sound event detection. Our second contribution is a public benchmark of 13 diverse few-shot bioacoustics tasks. Our model outperforms previously published methods, and improves relative to other training-free methods by 64%. We demonstrate that this is due to increase in model size and data scale, as well as algorithmic improvements. We make our trained model available via an API, to provide ecologists and ethologists with a training-free tool for bioacoustic sound event detection.

PDF
A Revisit of Audio Evaluation through Human Impressions: Defining and Modeling a Multidimensional Perceptual Task
Nishijima, Hiroshi and Saito, Daisuke and Minematsu, Nobuaki
Abstract

Current audio evaluation paradigms predominantly rely on technical metrics or single-dimensional subjective scores. These methods inadequately capture the multifaceted nature of human auditory perception. This paper reframes audio evaluation as a multidimensional perceptual task. We formally define subjective impression as a computational problem with measurable dimensions. To this end, we introduce a new dataset of 4,110 environmental sounds from FSD50K. It is annotated with five perceptual dimensions: pleasantness, clarity, brightness, calmness, and immersion. Our analysis reveals both independence and meaningful correlations within this perceptual space. A notable finding is the strong relationship between pleasantness and calmness. Furthermore, we demonstrate the feasibility of automated impression prediction. Our baseline models use fine-tuned BEATs representations and achieve a mean squared error below 0.7. This value corresponds to an average deviation of less than one point on a seven-point scale. This work provides the foundation for a human-centered evaluation of audio generation systems and sound design. It enables assessment based on nuanced perceptual qualities rather than technical fidelity alone.

PDF
MIMII-Agent: Leveraging LLMs with Function Calling for Relative Evaluation of Anomalous Sound Detection
Purohit, Harsh and Nishida, Tomoya and Dohi, Kota and Endo, Takashi and Kawaguchi, Yohei
Abstract

This paper proposes a method for generating machine-type-specific anomalies to evaluate the relative performance of unsupervised anomalous sound detection (UASD) systems across different machine types, even in the absence of real anomaly sound data. Conventional keyword-based data augmentation methods often produce unrealistic sounds due to their reliance on manually defined labels, limiting scalability as machine types and anomaly patterns diversify. Advanced generative models, such as MIMII-Gen, show promise but typically depend on anomalous training data, making them less effective when diverse anomalous examples are unavailable. To address these limitations, we propose a novel synthesis approach leveraging large language models (LLMs) to interpret textual descriptions of faults and automatically select audio transformation functions, converting normal machine sounds into diverse and plausible anomalous sounds. We validate this approach by evaluating a UASD system trained only on normal sounds from five machine types, using both real and synthetic anomaly data. Experimental results reveal consistent trends in relative detection difficulty across machine types between synthetic and real anomalies. This finding supports our hypothesis and highlights the effectiveness of the proposed LLM-based synthesis approach for relative evaluation of UASD systems.

PDF
Perceptual Detection of Packet Loss-Induced Audio Artifacts in Black-Box Wireless Music Systems
Guimarães, Victória and Bentes, Luiz and Pires, Ana and Freitas, Rosiane
Abstract

Audible degradations introduced by wireless audio transmission, such as clicks, dropouts, and glitches, can significantly compromise the perceived quality of music. These artifacts are typically caused by packet loss and often occur as short, perceptually salient events. In this work, we approach their detection as a binary sound event classification task based solely on perceptual analysis of the audio signal. The study focuses on black-box scenarios, where only the resulting audio is available for analysis, without any access to packet-level metadata, network diagnostics, or internal system information. We introduce BlueData, a dataset of music recordings labeled as clean or degraded. Degradation labels were assigned through listening tests under controlled Bluetooth transmission impairments, reflecting the presence of perceptual artifacts. A range of classical machine learning classifiers were trained using handcrafted acoustic features. Among them, models such as XGBoost and CatBoost achieved AUC scores close to 0.97, while K-Nearest Neighbors (KNN) reached the highest recall for the degraded class, with 85.09%. These results demonstrate the effectiveness of lightweight and interpretable models in identifying transmission-induced perceptual degradations directly from the audio signal and position BlueData as a relevant dataset for research in perceptual quality monitoring under black-box conditions.

PDF
ToyADMOS2025: The Evaluation Dataset for the DCASE2025T2 First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring
Harada, Noboru and Niizumi, Daisuke and Ohishi, Yasunori and Takeuchi, Daiki and Yasuda, Masahiro
Abstract

Recently, various applications have been explored that utilize machine learning to detect anomalies in machinery solely by listening to operational sounds. This paper introduces the newly recorded ToyADMOS dataset, “ToyADMOS2025”, for the DCASE 2025 Challenge Task 2: First-shot anomalous sound detection for machine condition monitoring (DCASE2025T2). New machine types, such as AutoTrash, HomeCamera, ToyPet, and ToyRCCar, were newly recorded as a part of the Additional training and Evaluation datasets. This paper also shows benchmark results of the First-shot baseline implementation (with a simple autoencoder and selective Mahalanobis modes) on the ToyADMOS2025.

PDF
12:00

Poster session 3

13:00

Lunch

14:30

Poster session 4 spotlights

Towards Audio-based Zero-Shot Action Recognition in Kitchen Environments
Gebhard, Alexander and Triantafyllopoulos, Andreas and Tsangko, Iosif and Schuller, Björn W.
Abstract

Human actions often generate sounds that can be recognized to infer their cause. In action recognition, actions can usually be broken down to a combination of verbs and nouns, of which there exist a very large number of enumerations. Contemporary datasets, like EPIC-KITCHENS, cover a wide gamut of the potential action space, but not its entirety. Arguably, the holistic characterization of human actions through the sounds they generate requires the use of zero-shot learning (ZSL). In this contribution, we explore the feasibility of ZSL for recognizing a) nouns, b) verbs, or c) actions on Epic-Kitchens. To achieve this, we use linguistic intermediation, by generating descriptions of each word corresponding to our classes using a pre-trained large language model (LLAMA-2). Our results show that human action recognition from sounds is possible in zero-shot fashion, as we consistently obtain results over chance.

PDF
Region-Specific Audio Tagging for Spatial Sound
ZHAO, Jinzheng and Xu, Yong and Liu, Haohe and Berghi, Davide and Qian, Xinyuan and Kong, Qiuqiang and Zhao, Junqi and Plumbley, Mark and Wang, Wenwu
Abstract

Audio tagging aims to label sound events appearing in an audio recording. In this paper, we propose region-specific audio tagging, a new task which labels sound events in a given region for spatial audio recorded by a microphone array. The region can be specified as an angular space or a distance from the microphone. We first study the performance of different combinations of spectral, spatial, and position features. Then we extend state-of-the-art audio tagging systems such as pre-trained audio neural networks (PANNs) and audio spectrogram transformer (AST) to the proposed region-specific audio tagging task. Experimental results on both the simulated and the real datasets show the feasibility of the proposed task and the effectiveness of the proposed method. Further experiments show that incorporating the directional features is beneficial for omnidirectional tagging.

PDF
Analysing Human-Generated Captions for Audio and Visual Scenes
Martin, Irene and Sudarsanam, Parthasaarathy and Virtanen, Tuomas
Abstract

This work investigates how humans describe audio and visual content by analysing single-sentence captions for each modality. While prior research has focused on improving captioning models and their evaluation, less attention has been paid to how linguistic features differ across modalities. We analyse the distribution of parts of speech and domain-specific vocabulary and examine how a structure-based method and neural network-based model classify captions as audio-based or image-based. The structure-based approach reveals how audio captions include verbs related to sound production (e.g., heard, speaking, playing), while image captions use verbs describing physical actions (e.g., sitting, walking, holding). We also study how the input captions influence neural network predictions using gradient-based attribution. Attribution scores from integrated gradients reveal that words like growling, sounded, howling, and chirp strongly support audio classification, while words like grouped, cupcakes, and participates are linked to image captions.

PDF
An Entropy-Guided Curriculum Learning Strategy for Data-Efficient Acoustic Scene Classification under Domain Shift
Zhang, Peihong and Liu, Yuxuan and Li, Zhixin and sang, rui and tan, yizhou and cai, yiqiang and li, shengchen
Abstract

Acoustic Scene Classification (ASC) faces challenges in generalizing across recording devices, particularly when labeled data is limited. The DCASE 2024 Challenge Task 1 highlights this issue by requiring models to learn from small labeled subsets recorded on a few devices. These models need to then generalize to recordings from previously unseen devices under strict complexity constraints. While techniques such as data augmentation and the use of pre-trained models are well-established for improving model generalization, optimizing the training strategy represents a complementary yet less-explored path that introduces no additional architectural complexity or inference overhead. Among various training strategies, curriculum learning offers a promising paradigm by structuring the learning process from easier to harder examples. In this work, we propose an entropy-guided curriculum learning strategy to address the domain shift problem in data-efficient ASC. Specifically, we quantify the uncertainty of device domain predictions for each training sample by computing the Shannon entropy of the device posterior probabilities estimated by an auxiliary domain classifier. Using entropy as a proxy for domain invariance, the curriculum begins with high-entropy samples and gradually incorporates low-entropy, domain-specific ones to facilitate the learning of generalizable representations. Experimental results on multiple DCASE 2024 ASC baselines demonstrate that our strategy effectively mitigates domain shift, particularly under limited labeled data conditions. Our strategy is architecture-agnostic and introduces no additional inference cost, making it easily integrable into existing ASC baselines and offering a practical solution to domain shift.

PDF
Sound Event Classification meets Data Assimilation with Distributed Fiber-Optic Sensing
Tonami, Noriyuki and Yajima, Yoshiyuki and Kohno, Wataru and Mishima, Sakiko and Kondo, Reishi and Hino, Tomoyuki
Abstract

Distributed Fiber-Optic Sensing (DFOS) is a promising technique for large-scale acoustic monitoring. However, its wide variation in installation environments and sensor characteristics causes spatial heterogeneity. This heterogeneity makes it difficult to collect representative training data. It also degrades the generalization ability of learning-based models, such as fine-tuning methods, under a limited amount of training data. To address this, we formulate Sound Event Classification (SEC) as 'data assimilation'' in an embedding space. Instead of training models, we infer sound event classes by combining pretrained audio embeddings with simulated DFOS signals. Simulated DFOS signals are generated by applying various frequency responses and noise patterns to microphone data, which allows for diverse prior modeling of DFOS conditions. Our method achieves out-of-domain (OOD) robust classification without requiring model training. The proposed method achieved accuracy improvements of 6.42, 14.11, and 3.47 percentage points compared with conventional zero-shot and two types of fine-tune methods, respectively. By employing the simulator in the framework of data assimilation, the proposed method also enables precise estimation of physical parameters from observed DFOS signals.

PDF
Enhancing Multiscale Features for Efficient Acoustic Scene Classification with One-Dimensional Separate CNN
He, Yuxuan and Raake, Alexander and Abeßer, Jakob
Abstract

Acoustic Scene Classification (ASC) is a fundamental task in audio signal processing, aiming to classify the location of an environmental audio recording. Recent advances focus on improving ASC model efficiency, particularly in resource-constrained environments. Convolutional neural networks (CNNs) remain the dominant approach due to their high performance, with recent focus on 1D kernels, such as in the Time-Frequency Separate Network (TF-SepNet), for reducing model complexity and computational cost. However, TF-SepNet performs feature extraction using a fixed receptive field in both time and frequency dimensions, which restricts its ability to capture multiscale contextual patterns. In this study, we investigate the integration of multiscale feature extraction modules into TF-SepNet, with the aim of improving model efficiency by balancing accuracy and complexity. We propose three architectures, TFSepDCD-Net, TFSepSPP-Net, and TFSepASPP-Net, each with two architectural variants based on TF-SepNet: one replaces its max pooling layers, and the other replaces its final convolutional layer. Each architecture has three configurations corresponding to different model sizes—small, medium, and large—to explore the tradeoff between accuracy and model complexity. Our experiments show that incorporating multiscale modules allows smaller models to achieve comparable or even superior accuracy to larger baselines. These findings highlight the potential of multiscale representations for improving the efficiency of CNN-based ASC systems, especially in 1D separate architectures like TF-SepNet.

PDF
Discriminative Anomalous Sound Detection Using Pseudo Labels, Target Signal Enhancement, and Ensemble Feature Extractors
Fujimura, Takuya and Kuroyanagi, Ibuki and Toda, Tomoki
Abstract

We propose discriminative anomalous sound detection (ASD) systems designed to handle unlabeled data, noisy environments, and first-shot conditions. First, since discriminative methods suffer from significant performance degradation under unlabeled conditions, we generate pseudo labels to effectively train the discriminative feature extractors. Second, to improve noise robustness, we introduce a target signal enhancement (TSE) model as a pre-processing step. The TSE model is trained utilizing a small amount of clean machine sounds, together with a larger amount of noisy machine sounds. Third, to increase robustness across various machine types in first-shot conditions, we employ diverse architectures as feature extractors and ensemble their anomaly scores. Experimental results show that our systems achieve official scores of 64.85% and 59.99% on the DCASE 2025 development and evaluation sets, respectively, where the score is calculated as the harmonic mean of the AUC and partial AUC ($p = 0.1$) over all machine types and domains.

PDF
Cross-Attention with Confidence Weighting for Multi-Channel Audio Alignment
Nihal, Md Ragib Amin and Yen, Benjamin and Ashizawa, Takeshi and Nakadai, Kazuhiro
Abstract

Multi-channel audio alignment is a key requirement in bioacoustic monitoring, spatial audio systems, and acoustic localization. However, existing methods often struggle to address nonlinear clock drift and lack mechanisms for quantifying uncertainty. Traditional methods like Cross-correlation and Dynamic Time Warping assume simple drift patterns and provide no reliability measures. Meanwhile, recent deep learning models typically treat alignment as a binary classification task, overlooking inter-channel dependencies and uncertainty estimation. We introduce a method that combines cross-attention mechanisms with confidence-weighted scoring to improve multi-channel audio synchronization. We extend BEATs encoders with cross-attention layers to model temporal relationships between channels. We also develop a confidence-weighted scoring function that uses the full prediction distribution instead of binary thresholding. Our method achieved first place in the BioDCASE 2025 Task 1 challenge with 0.30 MSE average across test datasets, compared to 0.58 for the deep learning baseline. On individual datasets, we achieved 0.14 MSE on ARU data (77% reduction) and 0.45 MSE on zebra finch data (18% reduction). The framework supports probabilistic temporal alignment, moving beyond point estimates. While validated in a bioacoustic context, the approach is applicable to a broader range of multi-channel audio tasks where alignment confidence is critical. Code available on: https://github.com/Ragib-Amin-Nihal/BEATsCA

PDF
Assessing a Domain-Adaptive Deployment Workflow for Selective Audio Recording in Wildlife Acoustic Monitoring
Azziz, Julia and Lema, Josefina and Anzibar, Maximiliano and Ziegler, Lucía and Steinfeld, Leonardo and Rocamora, Martín
Abstract

Passive acoustic monitoring is a valuable tool for wildlife research, but scheduled recording often results in large volumes of audio, much of which may not be of interest. Selective audio recording, where audio is only saved when relevant activity is detected, offers an effective alternative. In this work, we leverage a low-cost embedded system that implements selective recording using an on-device classification model and evaluate its deployment for penguin vocalization detection. To address the domain shift between training and deployment conditions (e.g. environment, recording device), we propose a lightweight domain adaptation strategy based on fine-tuning the model with a small amount of location-specific data. We replicate realistic deployment scenarios using data from two geographically distinct locations, Antarctica and Falkland Islands, and assess the impact of fine-tuning on classification and selective recording performance. Our results show that fine-tuning with location-specific data substantially improves generalization ability and reduces both false positives and false negatives in selective recording. These findings highlight the value of integrating model fine-tuning into field monitoring workflows, in order to improve the reliability of acoustic data collection.

PDF
On the Role of Training Class Distribution in Zero-Shot Audio Classification
Dogan, Duygu and Xie, Huang and Heittola, Toni and Virtanen, Tuomas
Abstract

Zero-shot learning (ZSL) enables the classification of audio samples into classes that are not seen during training by transferring semantic information learned from seen classes to unseen ones. Thus, the ability of zero-shot models to generalize to unseen classes is inherently affected by the training data. While most audio ZSL studies focus on improving model architectures, the effect of training class distribution in the audio embedding space has not been well explored. In this work, we investigate how the distribution of training classes in audio embedding space, both internally and in relation to unseen classes, affects zero-shot classification performance. We design two controlled experimental setups to understand the impact of training classes: (i) a similarity-based configuration, where we experiment with varying acoustic similarity between training classes and unseen test classes, and (ii) a diversity-based configuration, where the training sets are constructed with different levels of coverage in the audio embedding space. We conduct our experiments on a subset of AudioSet, evaluating zero-shot classification performance under different training class configurations. Our experiments demonstrate that both higher acoustic similarity between training and test classes and higher acoustic diversity among training classes improve zero-shot classification accuracy.

PDF
CochlScene Pre‑Training and Device‑Aware Distillation for Low-Complexity Acoustic Scene Classification
Karasin, Dominik and Olariu, Ioan-Cristian and Schöpf, Michael and Szymańska, Anna
Abstract

Acoustic Scene Classification (ASC) aims to categorize short audio clips into pre-defined scene classes. The DCASE 2025 Challenge Task 1 evaluates ASC systems on the TAU Urban Acoustic Scenes 2022 Mobile dataset, under strict low complexity constraints (128 kB memory, 30 MMACs), with only 25% of labels available and device IDs provided at inference. In this work, we present an ASC system that exploits device type information and pretraining on an external ASC dataset to improve the classification performance. In addition, we conduct an ablation study to quantify the impact of each component in our pipeline. Our approach centers on a compact CP-Mobile student model distilled via Bayesian ensemble averaging from different combinations of CP-ResNet and BEATs teachers. We evaluate domain-specific pre-training on the CochlScene dataset on both student and teachers to compensate for label scarcity. Additionally, we apply a rich data augmentation suite, of which Device Impulse Response augmentation was particularly effective. Finally, we exploit device IDs to fine-tune specialized student and teacher models per recording device. On the TAU Urban 2022 development-test dataset, our system achieved a macro-averaged accuracy of 60.5%, representing an 8.61 percentage point improvement over the DCASE baseline, securing us the top rank in the DCASE 2025 Task 1 Challenge.

PDF
On Temporal Guidance and Iterative Refinement in Audio Source Separation
Morocutti, Tobias and Greif, Jonathan and Primus, Paul and Schmid, Florian and Widmer, Gerhard
Abstract

Spatial semantic segmentation of sound scenes (S5) involves the accurate identification of active sound classes and the precise separation of their sources from complex acoustic mixtures. Conventional systems rely on a two-stage pipeline—audio tagging followed by label-conditioned source separation—but are often constrained by the absence of fine-grained temporal information critical for effective separation. In this work, we address this limitation by introducing a novel approach for S5 that enhances the synergy between the event detection and source separation stages. Our key contributions are threefold. First, we fine-tune a pre-trained Transformer to detect active sound classes. Second, we utilize a separate instance of this fine-tuned Transformer to perform sound event detection (SED), providing the separation module with detailed, time-varying guidance. Third, we implement an iterative refinement mechanism that progressively enhances separation quality by recursively reusing the separator’s output from previous iterations. These advancements lead to significant improvements in both audio tagging and source separation performance, as demonstrated by our system's second-place finish in Task 4 of the DCASE Challenge 2025. Our implementation and model checkpoints are available in our GitHub repository: https://github.com/theMoro/dcase25task4.

PDF
Self-Guided Target Sound Extraction and Classification Through Universal Sound Separation Model and Multiple Clues
Kwon, Younghoo and Lee, Dongheon and Kim, Dohwan and Choi, Jung-Woo
Abstract

This paper introduces a multi-stage self-directed framework designed to address the spatial semantic segmentation of sound scene (S5) task in the DCASE 2025 Task 4 challenge. This framework integrates models focused on three distinct tasks: Universal Sound Separation (USS), Single-label Classification (SC), and Target Sound Extraction (TSE). Initially, USS breaks down a complex audio mixture into separate source waveforms. Each of these separated waveforms is then processed by a SC block, generating two critical pieces of information: the waveform itself and its corresponding class label. These serve as inputs for the TSE stage, which isolates the source that matches this information. Since these inputs are produced within the system, the extraction target is identified autonomously, removing the necessity for external guidance. The extracted waveform can be looped back into the classification task, creating a cycle of iterative refinement that progressively enhances both separability and labeling accuracy. We thus call our framework a multi-stage self-guided system due to these self-contained characteristics. On the official evaluation dataset, the proposed system achieves an 11.00 dB increase in class-aware signal-to-distortion ratio improvement (CA-SDRi) and a 55.8\% accuracy in label prediction, outperforming the ResUNetK baseline by 4.4 dB and 4.3\%, respectively, and achieving first place among all submissions.

PDF
15:20

Poster session 4

16:20

Coffee break

16:50

Awards and closing

17:30

End of workshop