DCASE Workshop 2021 Proceedings

The proceedings of the DCASE2021 Workshop have been published as an electronic publication:

Font, Frederic ; Mesaros, Annamaria ; P. W. Ellis, Daniel ; Fonseca, Eduardo ; Fuentes, Magdalena ; Elizalde, Benjamin (eds.), Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2021), Nov. 2021.

ISBN (Electronic): 978-84-09-36072-7
DOI: 10.5281/zenodo.5770113


Link PDF
Total cites: 779 (updated 30.11.2023)
Abstract

Automated audio captioning (AAC) is the task of automatically creating textual descriptions (i.e. captions) for the contents of a general audio signal. Most AAC methods are using existing datasets to optimize and/or evaluate upon. Given the limited information held by the AAC datasets, it is very likely that AAC methods learn only the information contained in the utilized datasets. In this paper we present a first approach for continuously adapting an AAC method to new information, using a continual learning method. In our scenario, a pre-optimized AAC method is used for some unseen general audio signals and can update its parameters in order to adapt to the new information, given a new reference caption. We evaluate our method using a freely available, pre-optimized AAC method and two freely available AAC datasets. We compare our proposed method with three scenarios, two of training on one of the datasets and evaluating on the other and a third of training on one dataset and fine-tuning on the other. Obtained results show that our method achieves a good balance between distilling new knowledge and not forgetting the previous one.

Cites: 9 ( see at Google Scholar )

PDF
Abstract

Anomaly Sound Detection (ASD) is a popular topic in deep learning and has attracted the attention of numerous researchers due to its practical applications within the industry. In the case of unsupervised conditions, how to better discover the inherent consistency of normal sound clips has become a key issue in ASD. In this paper, we propose a novel training framework that jointly trains two different feature extractors using contrastive loss to obtain a better representation of normal sounds in the latent space. We evaluate our framework on the development dataset of DCASE 2021 challenge task 2. Our framework is a combination of two baseline systems from the challenge: 1) An AutoEncoder-based model and 2) a MobileNetV2-based model. Our approach trains two models, whereas during inference only model 2) is used. Experimental results indicate that the MobileNetV2-based model trained under our proposed training framework exceeds the baseline model in terms of the official score metric. Since we participated in the challenge and submitted the system trained on the proposed framework with some data augmentation methods, we also analyze the results of DCASE 2021 challenge task 2 and discuss the effect of the median filter as a data augmentation technique. Notably, our proposed approach achieves the first place for anomaly detection for the machine type ``Fan'' with an AUC of 90.68 and a pAUC of 79.99.

Cites: 1 ( see at Google Scholar )

PDF
Abstract

This paper describes our submission to the DCASE 2021 challenge. Different from most other approaches, our work focuses on training a lightweight and well-performing model which can be used in real-world applications. Compared to the baseline, our model only contains 600k parameters, resulting in a size of 2.7 Mb on disk, making it viable for applications on low-resource devices such as mobile phones. As a novelty, our approach uses unsupervised data augmentation (UDA) as the primary consistency criterion, which we show can achieve competitive performance to the more common mean teacher paradigm. Our submitted results on the validation set result in a single model peak performance of 36.91 PSDS-1 and 57.17 PSDS2, outperforming the baseline by an absolute of 2.7 and 5.0 points respectively. The best submitted ensemble system using a 5-way fusion achieves a PSDS-1 of 38.23 and PSDS-2 of 62.29 on the validation dataset. Our system ranks 7th in the official DCASE2021 Task4 challenge ranking and is the best performing model without post-processing while also having the least amount of parameters (3.4 M) by a large margin. Post-challenge evaluation reveals that by applying simple median post-processing, our approach achieves comparable performance to the 5th place. The source is available at https://github.com/bibiaaaa/SmallRice_DCASE2021Challenge.

Cites: 5 ( see at Google Scholar )

PDF
Abstract

We introduce several novel knowledge distillation techniques for training a single shallow model of three recurrent layers for acoustic event detection (AED). These techniques allow us to train a generic shallow student model without many convolutional layers, ensembling, or custom modules. Gradual incorporation of pseudolabeled data, using strong and weak pseudolabels to train our student model, event masking in the loss function, and a custom SpecAugment procedure with event-dependent time masking all contribute to a strong event-based F1-score of 42.7%, which matches the top submission score, compared to 34.7% when training with a generic knowledge distillation method. For comparison to state-of-the-art performance, we use the ensemble model of the top submission in the challenge as a fixed teacher model.

Cites: 2 ( see at Google Scholar )

PDF
Abstract

In this paper, we describe our multi-resolution mean teacher systems for DCASE 2021 Task 4: Sound event detection and separation in domestic environments. Aiming to take advantage of the different lengths and spectral characteristics of each target category, we follow the multi-resolution feature extraction approach that we introduced for last year's edition. It is found that each one of the proposed Polyphonic Sound Detection Score (PSDS) scenarios benefits from either a higher temporal resolution or a higher frequency resolution. Additionally, the combination of several time-frequency resolutions through model fusion is able to improve the PSDS results in both scenarios. Furthermore, a class-wise analysis of the PSDS metric is provided, indicating that the detection of each event category is optimized with different resolution points or model combinations.

Cites: 4 ( see at Google Scholar )

PDF
Abstract

In this paper we present our system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 Challenge Task 4: Sound Event Detection and Separation in Domestic Environments, where it scored the fourth rank. Our presented solution is an advancement of our system used in the previous edition of the task.We use a forward-backward convolutional recurrent neural network (FBCRNN) for tagging and pseudo labeling followed by tag-conditioned sound event detection (SED) models which are trained using strong pseudo labels provided by the FBCRNN. Our advancement over our earlier model is threefold. First, we introduce a strong label loss in the objective of the FBCRNN to take advantage of the strongly labeled synthetic data during training. Second, we perform multiple iterations of self-training for both the FBCRNN and tag-conditioned SED models. Third, while we used only tag-conditioned CNNs as our SED model in the previous edition we here explore sophisticated tag-conditioned SED model architectures, namely, bidirectional CRNNs and bidirectional convolutional transformer neural networks (CTNNs), and combine them. With metric and class specific tuning of median filter lengths for post-processing, our final SED model, consisting of 6 submodels (2 of each architecture), achieves on the public evaluation set poly-phonic sound event detection scores (PSDS) of 0.455 for scenario 1 and 0.684 for scenario as well as a collar-based F1-score of 0.596 outperforming the baselines and our model from the previous edition by far. Source code is publicly available at https://github.com/fgnt/pb_sed.

Cites: 14 ( see at Google Scholar )

PDF
Abstract

Acoustic scene classification (ASC) has seen tremendous progress from the combined use of convolutional neural networks (CNNs) and signal processing strategies. In this paper, we investigate the use of two common feature representations within the audio understanding domain, the raw waveform and Mel-spectrogram, and measure their degree of complementarity when using both representations for feature fusion. We introduce a new model paradigm for acoustic scene classification by fusing features learned from Mel-spectrograms and the raw waveform from separate feature extraction branches. Our experimental results show that our proposed fusion model significantly outperforms the baseline audio-only sub-network on the DCASE 2021 Challenge Task 1B (increase of 5.7% in accuracy and a 12.7% reduction in loss). We further show that learned features of raw waveforms and Mel-spectrograms are indeed complementary to each other and that there is a consistent improvement in classification performance over models trained on Mel-spectrograms or waveforms alone.

Cites: 5 ( see at Google Scholar )

PDF
Abstract

The goal of Unsupervised Anomaly Detection (UAD) is to detect anomalous signals under the condition that only non-anomalous (normal) data is available beforehand. In UAD under Domain-Shift Conditions (UAD-S), data is further exposed to contextual changes that are usually unknown beforehand. Motivated by the difficulties encountered in the UAD-S task presented at the 2021 edition of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge, we visually inspect Uniform Manifold Approximation and Projections (UMAPs) for log-STFT, log-Mel and pretrained Look, Listen and Learn (L3) representations of the DCASE UAD-S dataset. In our investigation, we look for two beneficial qualities, Separability (SEP) and Discriminative Support (DSUP), and formulate several hypotheses that could facilitate diagnosis and developement of further representation and detection approaches. Particularly, we hypothesize that input length and pretraining may regulate a relevant tradeoff between SEP and DSUP. Our code as well as the resulting UMAPs and plots are publicly available.

Cites: 2 ( see at Google Scholar )

PDF
Abstract

Music plays an important role in human cultures and constitutes an integral part of urban soundscapes. In order to make sense of these soundscapes, machine listening models should be able to detect and classify street music. Yet, the lack of well-curated resources for training and evaluating these models currently hinders their development. We present MONYC, an open dataset of 1.5k music clips as recorded by the sensors of the Sounds of New York City (SONYC) project. MONYC contains audio data and spatiotemporal metadata, i.e., coarse sensor location and timestamps. In addition, we provide multilabel genre tags from four annotators as well as four binary tags: whether the music is live or recorded; loud or quiet; single-instrument or multi-instrument; and whether non-musical sources are also present. The originality of MONYC is that it reveals how music manifests itself in a real-world setting among social interactions in an urban context. We perform a detailed qualitative analysis of MONYC, show its spatiotemporal trends, and discuss the scope of research questions that it can answer in the future.

PDF
Abstract

Automated audio captioning is the multimodal task of describing environmental audio recordings with fluent natural language. Most current methods utilize pre-trained analysis models to extract relevant semantic content from the audio input. However, prior information on language modeling is rarely introduced, and corresponding architectures are limited in capacity due to data scarcity. In this paper, we present a method leveraging the linguistic information contained in BART, a large-scale conditional language model with general purpose pre-training. The caption generation is conditioned on sequences of textual AudioSet tags. This input is enriched with temporally aligned audio embeddings that allows the model to improve the sound event recognition. The full BART architecture is fine-tuned with few additional parameters. Experimental results demonstrate that, beyond the scaling properties of the architecture, language-only pre-training improves the text quality in the multimodal setting of audio captioning. The best model achieves state-of-the-art performance on AudioCaps with 46.5 SPIDEr.

Cites: 29 ( see at Google Scholar )

PDF
Abstract

Audio captioning is a multi-modal task, focusing on generating a natural sentence to describe the content in an audio clip. This paper proposes a solution of automated audio captioning based on weakly supervised pre-training and word selection methods. Our solution focuses on solving two problems in automated audio captioning: data insufficiency and word selection indeterminacy. As the amount of training data is limited, we collect large-scale weakly labeled dataset from Web with heuristic methods. Then we pre-train the encoder-decoder models with this dataset followed by fine-tuning on the Clotho dataset. To solve the word selection indeterminacy problem, we use keywords extracted from captions of similar audios and audio tags produced by pre-trained audio tagging models to guide caption generation. The proposed system achieves the best SPIDEr score of 0.310 in the DCASE 2021 Challenge Task 6.

Cites: 13 ( see at Google Scholar )

PDF
Abstract

This paper proposes a new large-scale dataset called “ToyADMOS” for anomaly detection in machine operating sounds (ADMOS). As with our previous ToyADMOS dataset, we collected a large number of operating sounds of miniature machines (toys) under normal and anomaly conditions by deliberately damaging them, but extended them in this case by providing a controlled depth of damages in the anomaly samples. Since typical application scenarios of ADMOS require robust performance under domain-shift conditions, the ToyADMOS2 dataset is designed for evaluating systems under such conditions. The released dataset consists of two sub-datasets for machine-condition inspection: fault diagnosis of machines with geometrically fixed tasks and fault diagnosis of machines with moving tasks. Domain shifts are represented by introducing several differences in operating conditions, such as the use of the same machine type but with different models and parts configurations, operating speeds, microphone arrangements, etc. Each subdataset contains over 27 k samples of normal machine-operating sounds and over 8 k samples of anomalous sounds recorded with five to eight microphones. The dataset is freely available for download at https://github.com/nttcslab/ToyADMOS2-dataset and https://doi.org/10.5281/zenodo.4580270.

Cites: 118 ( see at Google Scholar )

PDF
Abstract

The availability of audio data on sound sharing platforms such as Freesound gives users access to large amounts of annotated audio. Utilising such data for training is becoming increasingly popular, but the problem of label noise that is often prevalent in such datasets requires further investigation. This paper introduces ARCA23K, an Automatically Retrieved and Curated Audio dataset comprised of over 23000 labelled Freesound clips. Unlike past datasets such as FSDKaggle2018 and FSDnoisy18K, ARCA23K facilitates the study of label noise in a more controlled manner. We describe the entire process of creating the dataset such that it is fully reproducible, meaning researchers can extend our work with little effort. We show that the majority of labelling errors in ARCA23K are due to out-of-vocabulary audio clips, and we refer to this type of label noise as open-set label noise. Experiments are carried out in which we study the impact of label noise in terms of classification performance and representation learning.

Cites: 4 ( see at Google Scholar )

PDF
Abstract

Sound recorded from beehives is important to understand a colony's state. This fact is used in the we4bee project, where beehives are equipped with sensors (among them microphones), distributed to educational institutions and set up to record colony characteristics at the communication level. Due to data protection laws, we have to ensure that no human is recorded besides the bees' sound. However, detecting the presence of speech is challenging since the frequencies of human speech and the humming of bees largely overlap. Despite having access to only a limited amount of labeled data, in this initial study we show how to solve this problem using Siamese networks. We find that using common convolutional neural networks in a Siamese setting can strongly improve the ability to detect human speech in recordings obtained from beehives. By adding train-time augmentation techniques, we are able to reach a recall of up to 100 %, resulting in a reliable technique adhering to privacy regulations. Our results are useful for research projects that require written permits for acquiring data, which impedes the collection of samples. Further, all steps, including pre-processing, are calculated on the GPU, and can be used in an end-to-end pipeline, which allows for quick prototyping.

Cites: 1 ( see at Google Scholar )

PDF
Abstract

We present the task description and discussion on the results of the DCASE 2021 Challenge Task 2. In 2020, we organized an unsupervised anomalous sound detection (ASD) task, identifying whether a given sound was normal or anomalous without anomalous training data. In 2021, we organized an advanced unsupervised ASD task under domain-shift conditions, which focuses on the inevitable problem of the practical use of ASD systems. The main challenge of this task is to detect unknown anomalous sounds where the acoustic characteristics of the training and testing samples are different, i.e., domain-shifted. This problem frequently occurs due to changes in seasons, manufactured products, and/or environmental noise. We received 75 submissions from 26 teams, and several novel approaches have been developed in this challenge. On the basis of the analysis of the evaluation results, we found that there are two types of remarkable approaches that TOP-5 winning teams adopted: 1) ensemble approaches of ``outlier exposure'' (OE)-based detectors and ``inlier modeling'' (IM)-based detectors and 2) approaches based on IM-based detection for features learned in a machine-identification task.

Cites: 79 ( see at Google Scholar )

PDF
Abstract

It is a practical research topic how to deal with multi-device audio inputs by a single acoustic scene classification system with efficient design. In this work, we propose Residual Normalization, a novel feature normalization method that uses instance normalization with a shortcut path to discard unnecessary device-specific information without losing useful information for classification. Moreover, we introduce an efficient architecture, BC-ResNet-ASC, a modified version of the baseline architecture with a limited receptive field. BC-ResNet-ASC outperforms the baseline architecture even though it contains the small number of parameters. Through three model compression schemes: pruning, quantization, and knowledge distillation, we can reduce model complexity further while mitigating the performance degradation. The proposed system achieves an average test accuracy of 76.3\% in TAU Urban Acoustic Scenes 2020 Mobile, development dataset with 315k parameters, and average test accuracy of 75.3\% after compression to 61.0KB of non-zero parameters. The proposed method won the 1st place in DCASE 2021 challenge, task 1A.

Cites: 21 ( see at Google Scholar )

PDF
Abstract

Automated Audio Captioning (AAC) is about automatically creating captions that can explain the given audio sound data using machine learning techniques. The solutions for this problem tested on DCASE 2021 audio captioning challenge. In the challenge, a model is required to generate natural language descriptions of a given audio signal. We use pre-trained models trained using AudioSet data, a large-scale dataset of manually annotated audio events. The large amount of audio events data would help capturing important audio feature representation. To make use of the learned feature from AudioSet data, we utilize CNN14 or ResNet54 network pre-trained on AudioSet which achieved state-of-the-art performance on audio pattern recognition. Our proposed sequence-to-sequence model consists of a CNN14 or ResNet54 encoder and a Transformer decoder. Experiments show that the proposed model can achieve a SPIDEr score of 0.246 and 0.285 on audio captioning performance. We further experiment the use of three different voice features, log-mel spectrogram, constant Q transform spectrogram and gammatone filter spectrogram.

Cites: 2 ( see at Google Scholar )

PDF
Abstract

This paper presents an ensemble approach based on two unsupervised anomalous sound detection (ASD) methods for machine condition monitoring under domain-shifted conditions in DCASE 2021 challenge task 2. The first ASD method is based on a conformer-based sequence-level autoencoder with section ID regression and a self-attention architecture. We utilize the data augmentation techniques such as SpecAugment to boost the performance and combine a simple scorer module for each section and each domain to address the domain shift problem. The second ASD method is based on a binary classification model using metric learning that uses task-irrelevant outliers as pseudo-anomalous data while controlling centroids of normal and outlier data in a feature space. As a countermeasure against the domain shift problem, we perform data augmentation based on Mixup with data from the target domain, resulting in a stable performance for each section. An ensemble approach is applied to each method, and the resulting two ensembled methods are further ensembled to maximize the ASD performance. The results of DCASE 2021 challenge task 2 have demonstrated that our proposed method achieves a harmonic mean of 63.745% harmonically averaged over of area under the curve (AUC) and partial AUC (p = 0.1) of all machines, sections, and domains.

Cites: 7 ( see at Google Scholar )

PDF
Abstract

Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder architecture, where the audio signal is encoded into a latent representation, and aligned with its corresponding text descriptions, then a decoder is used to generate the captions. However, training of an AAC system often encounters the problem of data scarcity, which may lead to inaccurate representation and audio-text alignment. To address this problem, we propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC). In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts by contrasting samples, which can improve the quality of latent representation and the alignment between audio and texts, while trained with limited data. Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.

Cites: 24 ( see at Google Scholar )

PDF
Abstract

We present our submission to the DCASE2021 Challenge Task 2, which aims to promote research in anomalous sound detection. We found that blending the predictions of various anomaly detectors, rather than relying on well-known domain adaptation techniques alone, gave us the best performance under domain shifted conditions. Our submission is composed of two self-supervised classifier models, a probabilistic model we call NF-CDEE, and an ensemble of the three -- the latter obtained the top rank in the DCASE2021 Challenge Task 2.

Cites: 35 ( see at Google Scholar )

PDF
Abstract

This paper presents the details of Task 1A Low-Complexity Acoustic Scene Classification with Multiple Devices in the DCASE 2021 Challenge. The task targeted development of low-complexity solutions with good generalization properties. The provided baseline system is based on a CNN architecture and post-training quantization of parameters. The system is trained using all the available training data, without any specific technique for handling device mismatch, and obtains an overall accuracy of 47.7%, with a log loss of 1.473. The task received 99 submissions from 30 teams, and most of the submitted systems outperformed the baseline. The most used techniques among the submissions were residual networks and weight quantization, with the top systems reaching over 70% accuracy, and logloss under 0.8. The acoustic scene classification task remained a popular task in the challenge, despite the increasing difficulty of the setup.

Cites: 56 ( see at Google Scholar )

PDF
Abstract

Describing soundscapes in sentences allows better understanding of the acoustic scene than a single label indicating the acoustic scene class or a set of audio tags indicating the sound events active in the audio clip. In addition, the richness of natural language allows a range of possible descriptions for the same acoustic scene. In this work, we address the diversity obtained when collecting descriptions of soundscapes using crowdsourcing. We study how much the collection of audio captions can be guided by the instructions given in the annotation task, by analysing the possible bias introduced by auxiliary information provided in the annotation process. Our study shows that even when given hints on the audio content, different annotators describe the same soundscape using different vocabulary. In automatic captioning, hints provided as audio tags represent grounding textual information that facilitates guiding the captioning output towards specific concepts. We also release a new dataset of audio captions and audio tags produced by multiple annotators for a subset of the TAU Urban Acoustic Scenes 2018 dataset, suitable for studying guided captioning.

Cites: 21 ( see at Google Scholar )

PDF
Abstract

Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features extracted by the encoder. To improve the proposed system, transfer learning from either an upstream audio-related task or a large in-domain dataset is introduced to mitigate the problem induced by data scarcity. Moreover, evaluation metrics are incorporated into the optimization of the model with reinforcement learning, which helps address the problem of ``exposure bias'' induced by ``teacher forcing'' training strategy and the mismatch between the evaluation metrics and the loss function. The resulting system was ranked 3rd in DCASE 2021 Task 6. Ablation studies are carried out to investigate how much each component in the proposed system can contribute to final performance. The results show that the proposed techniques significantly improve the scores of the evaluation metrics, however, reinforcement learning may impact adversely on the quality of the generated captions.

Cites: 40 ( see at Google Scholar )

PDF
Abstract

Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often used as the audio encoder. However, CNNs can be limited in modelling temporal relationships among the time frames in an audio signal, while RNNs can be limited in modelling the long-range dependencies among the time frames. In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free. The proposed method has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events. We evaluate our model on AudioCaps, which is the largest audio captioning dataset publicly available. Our model shows competitive performance compared to other state-of-the-art approaches.

Cites: 53 ( see at Google Scholar )

PDF
Abstract

Few-shot bioacoustic event detection is a novel area of research that emerged from a need in monitoring biodiversity and animal behaviour: to annotate long recordings, that experts usually can only provide very few annotations for due to the task being specialist and labour-intensive. This paper presents an overview of the first evaluation of few-shot bioacoustic sound event detection, organised as a task of the DCASE 2021 Challenge. A set of datasets consisting of mammal and bird multi-species recordings in the wild, along with class-specific temporal annotations, was compiled for the challenge, for the purpose of training learning-based approaches and for evaluation of the submissions in a few-shot labelled dataset. This paper describes the task in detail, the datasets that were used for both development and evaluation of the submitted systems, along with how system performance was ranked and the characteristics of the best-performing submissions. Some common strategies that the participating teams used are discussed, including input features, model architectures, transferring of prior knowledge, use of public datasets and data augmentation. Ranking for the challenge was based on overall performance of the evaluation set, however in this paper we also present results on each of the subsets of the evaluation set. This new analysis reveals submissions that performed better on specific subsets and gives an insight as to characteristics of the subsets that can influence performance.

Cites: 27 ( see at Google Scholar )

PDF
Abstract

The use of multiple and semantically correlated sources can provide complementary information to each other that may not be evident when working with individual modalities on their own. In this context, multi-modal models can help producing more accurate and robust predictions in machine learning tasks where audio-visual data is available. This paper presents a multi-modal model for automatic scene classification that exploits simultaneously auditory and visual information. The proposed approach makes use of two separate networks which are respectively trained in isolation on audio and visual data, so that each network specializes in a given modality. The visual subnetwork is a pre-trained VGG16 model followed by a bidiretional recurrent layer, while the residual audio subnetwork is based on stacked squeeze-excitation convolutional blocks trained from scratch. After training each subnetwork, the fusion of information from the audio and visual streams is performed at two different stages. The early fusion stage combines features resulting from the last convolutional block of the respective subnetworks at different time steps to feed a bidirectional recurrent structure. The late fusion stage combines the output of the early fusion stage with the independent predictions provided by the two subnetworks, resulting in the final prediction. We evaluate the method using the recently published TAU Audio-Visual Urban Scenes 2021, which contains synchronized audio and video recordings from 12 European cities in 10 different scenes classes. The proposed model has been shown to provide an excellent trade-off between prediction performance (86.5%) and system complexity (15M parameters) in the evaluation results of the DCASE 2021 Challenge.

Cites: 4 ( see at Google Scholar )

PDF
Abstract

This paper details our work towards leveraging state-of-the-art ASR techniques for the task of automated audio captioning. Our model architecture comprises of a convolution-augmented Transformer (Conformer) encoder and a Transformer decoder to generate natural language descriptions of acoustic signals in an end-to-end manner. To overcome the limited availability of captioned audio samples for model training, we incorporate the Audioset-tags and audio-embeddings obtained from pretrained audio neural networks (PANNs) as an auxiliary input to our model. We train our model over audio samples from Clotho & AudioCaps datasets, and test over Clotho dataset's validation and evaluation splits. Experimental results indicate that our trained models significantly outperform the baseline system from DCASE~2021 challenge task 6.

Cites: 14 ( see at Google Scholar )

PDF
Abstract

Sound event localization and detection (SELD) is an emerging research topic that aims to unify the tasks of sound event detection and direction-of-arrival estimation. As a result, SELD inherits the challenges of both tasks, such as noise, reverberation, interference, polyphony, and non-stationarity of sound sources. Furthermore, SELD often faces an additional challenge of assigning correct correspondences between the detected sound classes and directions of arrival to multiple overlapping sound events. Previous studies have shown that unknown interferences in reverberant environments often cause major degradation in the performance of SELD systems. To further understand the challenges of the SELD task, we performed a detailed error analysis on two of our SELD systems, which both ranked second in the team category of DCASE SELD Challenge, one in 2020 and one in 2021. Experimental results indicate polyphony as the main challenge in SELD, due to the difficulty in detecting all sound events of interest. In addition, the SELD systems tend to make fewer errors for the polyphonic scenario that is dominant in the training set.

Cites: 6 ( see at Google Scholar )

PDF
Abstract

We adapted methods from the speaker recognition literature to acoustic event detection (or audio-tagging) and applied representational similarity analysis, a cognitive neuroscience technique, to gain a deeper understanding of model performance. Experiments with a feed-forward time-delay neural network architecture (TDNN) were carried out using the FSDKaggle2018 dataset. We compared different system optimizations such as speed and reverb augmentation, different input features (spectrograms, mel-filterbanks, MFCCs and cochleagrams), as well as updates to the network architecture (increased or decreased temporal context and model capacity as well as drop-out and batch-normalization). Most system configurations were able to outperform the original published baseline and through a combination of optimizations (data augmentation in particular) our system was able to out perform a harder baseline derived from a pre-trained model trained on many times more data. Additional experiments applying representational similarity analysis to the network embeddings allowed us to understand what acoustic features the different systems used to perform the task.

PDF
Abstract

In this paper, we propose a system for audio-visual scene classification with a multi-modal ensemble way consisting of three features: (1) Log-mel spectrogram audio features extracted by CNN variants from audio modality. (2) Frame-wise image features extracted by CNN variants from video modality. (3) Another frame-wise image features extracted by OpenAI CLIP models which are trained with a large-scale web crawling text and paired image dataset under contrastive learning framework. We trained the above three models respectively and made an ensemble weighted by class-wise confidences of each model’s semantic outputs. As a result, our ensemble system reached 0.149 log-loss (official baseline: 0.658 log-loss) and 96.1% accuracy (official baseline: 77.0% accuracy) on TAU Audio-Visual Urban Scenes 2021 dataset which are used in DCASE2021 Challenge Task1B.

Cites: 4 ( see at Google Scholar )

PDF
Abstract

In this paper we provide two methods that improve the detection of sound events in domestic environments. First, motivated by the broad categorization of domestic sounds as foreground or background events according to their spectro-temporal structure, we propose to learn a foreground-background classifier jointly with the sound event classifier in a multi-task fashion to improve the generalization of the latter. Second, while the semi-supervised learning capability adopted for training sound event detection systems with synthetic labeled data and unlabeled or partially labeled real data aims to learn invariant representations for both domains, there is still a gap in performance when testing such systems on real environments. To further reduce this data mismatch, we propose a domain adaptation strategy that aligns the empirical distributions of the feature representations of active and inactive frames of synthetic and real recordings via optimal transport. We show that these two approaches lead to enhanced detection performance in terms of the event-based macro F1-score on the DESED dataset

Cites: 2 ( see at Google Scholar )

PDF
Abstract

Over the past few years, convolutional neural networks (CNNs) have been established as the core architecture for audio classification and detection. In particular, a hybrid model that combines a recurrent neural network or a self-attention mechanism with CNNs to deal with longer-range contexts has been widely used. Recently, Transformers, which are pure attention-based architectures, have achieved excellent performance in various fields, showing that CNNs are not essential. In this paper, we investigate the reliance on CNNs for sound event localization and detection by introducing the Many-to-Many Audio Spectrogram Transformer (M2M-AST), a pure attention-based architecture. We adopt multiple classification tokens in the Transformer architecture to easily handle various output resolutions. We empirically show that the proposed M2M-AST outperforms the conventional hybrid model on TAU-NIGENS Spatial Sound Events 2021 dataset.

Cites: 11 ( see at Google Scholar )

PDF
Abstract

This report presents the dataset and baseline of Task 3 of the DCASE2021 Challenge on Sound Event Localization and Detection (SELD). The acoustical synthesis remains the same as in the previous iteration of the challenge, however the new dataset brings more challenging conditions of polyphony and overlapping instances of the same class. The most important difference is the introduction of directional interferers, meaning sound events that are localized in space but do not belong to the target classes to be detected and are not annotated. Since such interfering events are expected in every real-world scenario of SELD, the new dataset aims to promote systems that deal with this condition effectively. A modified SELDnet baseline employing the recent ACCDOA representation of SELD problems accompanies the dataset and it is shown to outperform the previous one. The new dataset is shown to be significantly more challenging for both baselines according to all considered metrics. To investigate the individual and combined effects of ambient noise, interferers, and reverberation, we study the performance of the baseline on different versions of the dataset excluding or including combinations of these factors. The results indicate that by far the most detrimental effects are caused by directional interferers.

Cites: 64 ( see at Google Scholar )

PDF
Abstract

Previous DCASE challenges contributed to an increase in the performance of acoustic scene classification systems. State-of-the-art classifiers demand significant processing capabilities and memory which is challenging for resource-constrained mobile or IoT edge devices. Thus, it is more likely to deploy these models on more powerful hardware and classify audio recordings previously uploaded (or streamed) from low-power edge devices. In such scenario, the edge device may apply lossy audio compression to reduce the transmission data rate. This paper explores the effect of lossy audio compression on the classification performance using a DCASE 2020 challenge contribution [1]. We found that classification accuracy can decrease by up to 57% compared to classifying original (uncompressed) audio. We further demonstrate how lossy audio compression techniques during model training can improve classification accuracy of compressed audio signals even for audio codecs and codec bitrates not included in the training process. [1] H. Hu, C.-H. H. Yang, X. Xia, X. Bai, X. Tang, Y. Wang,S. Niu, L. Chai, J. Li, H. Zhu,et al., “Device-robust acoustic scene classification based on two-stage categorization anddata augmentation,”arXiv preprint arXiv:2007.08389, DCASE 2020, 2020

Cites: 3 ( see at Google Scholar )

PDF
Abstract

micarraylib is a python library to load, standardize, and aggregate datasets collected with different microphone array hardware. The goal is to create larger datasets by aggregating existing and mostly incompatible microphone array data and encoding it into standard B-format ambisonics. These larger datasets can be used to develop novel sound event localization and detection (SELD) algorithms. micarraylib streamlines the download, load, resampling, aggregation, and signal processing of datasets collected with commonly-used and custom microphone array hardware. We provide an API to standardize the 3D coordinates of each microphone array capsule, visualize the placement of microphone arrays in specific spatial configurations, and encode time-series data collected with different microphone arrays into B-format ambisonics. Finally, we also show that the data aggregates can be used to reconstruct a microphone capsule’s time-series data using the information from other capsules in the data aggregate. micarraylib will allow for the easy addition of more datasets and microphone array hardware as they become available in the future. All original software written for this paper is released with an open-source license.

Cites: 2 ( see at Google Scholar )

PDF
Abstract

Detection and Classification Acoustic Scene and Events Challenge 2021 Task 4 uses a heterogeneous dataset that includes both recorded and synthetic soundscapes. Until recently only target sound events were considered when synthesizing the soundscapes. However, recorded soundscapes often contain a substantial amount of non-target events that may affect the performance. In this paper, we focus on the impact of these non-target events in the synthetic soundscapes. Firstly, we investigate to what extent using non-target events alternatively during the training or validation phase (or none of them) helps the system to correctly detect target events. Secondly, we analyze to what extend adjusting the signal-to-noise ratio between target and non-target events at training improves the sound event detection performance. The results show that using both target and non-target events for only one of the phases (validation or training) helps the system to properly detect sound events, outperforming the baseline (which uses non-target events in both phases). The paper also reports the results of a preliminary study on evaluating the system on clips that contain only non-target events. This opens questions for future work on non-target subset and acoustic similarity between target and non-target events which might confuse the system.

Cites: 19 ( see at Google Scholar )

PDF
Abstract

Labeling audio material to train classifiers comes with a large amount of human labor. In this paper, we propose an active learning method for sound event classification, where a human annotator is asked to manually label sound segments up to a certain labeling budget. The sound event classifier is incrementally re-trained on pseudo-labeled sound segments and manually labeled segments. The segments to be labeled during the active learning process are selected based on the model uncertainty of the classifier, which we propose to estimate using Monte Carlo dropout, a technique for Bayesian inference in neural networks. Evaluation results on the UrbanSound8K dataset show that the proposed active learning method, which uses pre-trained audio neural network (PANN) embeddings as input features, outperforms two baseline methods based on medoid clustering, especially for low labeling budgets.

Cites: 1 ( see at Google Scholar )

PDF
Abstract

Joint sound event localization and detection (SELD) is an emerging audio signal processing task adding spatial dimensions to acoustic scene analysis and sound event detection. A popular approach to modeling SELD jointly is using convolutional recurrent neural network (CRNN) models, where CNNs learn high-level features from multi-channel audio input and the RNNs learn temporal relationships from these high-level features. However, RNNs have some drawbacks, such as a limited capability to model long temporal dependencies and slow training and inference times due to their sequential processing nature. Recently, a few SELD studies used multi-head self-attention (MHSA), among other innovations in their models. MHSA and the related transformer networks have shown state-of-the-art performance in various domains. While they can model long temporal dependencies, they can also be parallelized efficiently. In this paper, we study in detail the effect of MHSA on the SELD task. Specifically, we examined the effects of replacing the RNN blocks with self-attention layers. We studied the influence of stacking multiple self-attention blocks, using multiple attention heads in each self-attention block, and the effect of position embeddings and layer normalization. Evaluation on the DCASE 2021 SELD (task 3) development data set shows a significant improvement in all employed metrics compared to the baseline CRNN accompanying the task.

Cites: 12 ( see at Google Scholar )

PDF
Abstract

Underspecification and fairness in machine learning (ML) applications have recently become prominent issues in the ML community. Acoustic scene classification (ASC) applications have so far remained unaffected by this discussion, but are now becoming increasingly used in real-world systems where fairness and reliability are critical aspects. In this work, we argue for the need of a more holistic evaluation process for ASC models through disaggregated evaluations.This entails taking into account performance differences across several factors, such as city, location, and recording device. Although these factors play a well-understood role in the performance of ASC models, most works report single evaluation metrics taking into account all different strata of a particular dataset. We argue that metrics computed on specific sub-populations of the underlying data contain valuable information about the expected real-world behavior of proposed systems, and their reporting could improve the transparency and trustability of such systems. We demonstrate the effectiveness of the proposed evaluation process in uncovering underspecification and fairness problems exhibited by several standard ML architectures when trained on two widely-used ASC datasets. Our evaluation shows that all examined architectures exhibit large biases across all factors taken into consideration, and in particular with respect to the recording location. Additionally, different architectures exhibit different biases even though they are trained with the same experimental configurations.

Cites: 4 ( see at Google Scholar )

PDF
Abstract

This paper presents the details of the Audio-Visual Scene Classification task in the DCASE 2021 Challenge (Task 1 Subtask B). The task is concerned with classification using audio and video modalities, using a dataset of synchronized recordings. This task has attracted 43 submissions from 13 different teams around the world. Among all submissions, more than half of the submitted systems have better performance than the baseline. The common techniques among the top systems are the usage of large pretrained models such as ResNet or EfficientNet which are trained for the task-specific problem. Fine-tuning, transfer learning, and data augmentation techniques are also employed to boost the performance. More importantly, multi-modal methods using both audio and video are employed by all the top 5 teams. The best system among all achieved a logloss of 0.195 and accuracy of 93.8%, compared to the baseline system with logloss of 0.662 and accuracy of 77.1%.

Cites: 18 ( see at Google Scholar )

PDF
Abstract

We present a neural network-based sound event detection system that outputs sound events and their time boundaries in audio signals. The network can be trained efficiently with a small amount of strongly labeled synthetic data and a large amount of weakly labeled or unlabeled real data. Based on the mean-teacher framework of semi-supervised learning with RNNs and Transformer, the proposed system employs multi-scale CNNs with efficient channel attention, which can capture the various features and pay more attention to the important area of features. The model parameters are learned with multiple consistency criteria, including interpolation consistency, shift consistency, and clip-level consistency, to improve the generalization and representation power. For different evaluation scenarios, we explore different pooling functions and search for the best layer. To further improve the performance, we use data augmentation and posterior-level score fusion. We demonstrate the performance of our proposed method through experimental evaluation using the DCASE2021 Task4 dataset. On the validation set, our ensemble system achieves the PSDS-scenario1 of 40.72% and PSDS-scenario2 of 80.80%, significantly outperforming that of the baseline score of 34.2% and 52.7%, respectively. On the DCASE2021 challenge's evaluation set, our ensemble system is ranking 7 among the 28 teams and ranking 14 among the 80 submissions.

Cites: 1 ( see at Google Scholar )

PDF
Abstract

Automated audio captioning (AAC) is the task of automatically generating textual descriptions for general audio signals. A captioning system has to identify various information from the input signal and express it with natural language. Existing works mainly focus on investigating new methods and try to improve their performance measured on existing datasets. Having attracted attention only recently, very few works on AAC study the performance of existing pre-trained audio and natural language processing resources. In this paper, we evaluate the performance of off-the-shelf models with a Transformer-based captioning approach. We utilize the freely available Clotho dataset to compare four different pre-trained machine listening models, four word embedding models, and their combinations in many different settings. Our evaluation suggests that YAMNet combined with BERT embeddings produces the best captions. Moreover, in general, fine-tuning pre-trained word embeddings can lead to better performance. Finally, we show that sequences of audio embeddings can be processed using a Transformer encoder to produce higher-quality captions.

Cites: 9 ( see at Google Scholar )

PDF
Abstract

Systems based on sub-cluster AdaCos yield state-of-the-art performance on the DCASE 2020 dataset for anomalous sound detection. In contrast to the previous year, the dataset belonging to task 2 'Unsupervised Anomalous Sound Detection for Machine Condition Monitoring under Domain Shifted Conditions' of the DCASE challenge 2021 contains not only source domains with 1000 normal training samples for each machine but also so-called target domains with different acoustic conditions for which only 3 normal training samples are available. To address this additional problem, a novel anomalous sound detection system based on sub-cluster AdaCos for the DCASE challenge 2021 is presented. The system is trained to extract embeddings whose distributions are estimated in different ways for source and target domains, and utilize the resulting negative log-likelihoods as anomaly scores. In experimental evaluations, it is shown that the presented system significantly outperforms both baseline systems on the source and target domains of the development set. On the evaluation set of the challenge, the proposed system is ranked third among all 27 teams' submissions.

Cites: 8 ( see at Google Scholar )

PDF
Abstract

Sound event localization and detection (SELD), which jointly performs sound event detection (SED) and sound source localization(SSL), detects the type and occurrence time of sound events as well as their corresponding direction-of-arrival (DoA) angles simultaneously. In this paper, we propose a method based on Adaptive Hybrid Convolution (AHConv) and multi-scale feature extractor. The square convolution shares the weights in each of the square areas in feature maps making its feature extraction ability limited. In order to address this problem, we propose a AHConv mechanism instead of square convolution to capture the dependencies along with the time dimensions and the frequency dimensions respectively. We also explored a multi-scale feature extractor that can integrate information from very local to exponentially enlarged receptive field within the block. In order to adaptive recalibrate the feature maps after the convolutional operation, we designed an adaptive attention block that is largely embodied in the AHConv and multi-scale feature extractor. On TAU-NIGENS Spatial Sound Events 2021 development dataset, our systems demonstrate a significant improvement over the baseline system. Only the first-order Ambisonics (FOA) dataset was considered in this experiment.

PDF
Abstract

Automated audio captioning (AAC) has developed rapidly in recent years, involving acoustic signal processing and natural language processing to generate human-readable sentences for audio clips. The current models are generally based on the neural encoder-decoder architecture, and their decoder mainly uses acoustic information that is extracted from the CNN-based encoder. However, they have ignored semantic information that could help the AAC model to generate meaningful descriptions. This paper proposes a novel approach for automated audio captioning based on incorporating semantic and acoustic information. Specifically, our audio captioning model consists of two sub-modules. (1) The pre-trained keyword encoder utilizes pre-trained ResNet38 to initialize its parameters, and then it is trained by extracted keywords as labels. (2) The multi-modal attention decoder adopts an LSTM-based decoder that contains semantic and acoustic attention modules. Experiments demonstrate that our proposed model achieves state-of-the-art performance on the Clotho dataset. Our code can be found at https://github.com/WangHelin1997/DCASE2021_Task6_PKU.

Cites: 23 ( see at Google Scholar )

PDF
Abstract

Sound scene in real environment is generally composed of different types of sound events meanwhile the time-frequency scales of these events are diverse. Thus, it is important to design a proper mechanism to extract the multi-scale features for sound event detection (SED). In order to improve the discriminative ability of different types of sound events, we propose a multi-scale SED network based on split attention. We design a Multi-scale (MS) module to extract the fine-grained and the coarse-level features in parallel. A Channel Shuffle (CS) operation is introduced to enhance the cross-channel information communication among the features with different scales. Also, a Split Attention (SA) module is designed to learn several sub-features separately and an attention mechanism is followed to generate the corresponding importance coefficients for each sub-features. Experiments on DCASE2021 Task4 dataset demonstrate the effectiveness of our proposed multi-scale network.

PDF
Abstract

Understanding the reasons behind the predictions of deep neural networks is a pressing concern as it can be critical in several application scenarios. In this work, we present a novel interpretable model for polyphonic sound event detection. It tackles one of the limitations of our previous work, i.e. the difficulty to deal with a multi-label setting properly. The proposed architecture incorporates a prototype layer and an attention mechanism. The network learns a set of local prototypes in the latent space representing a patch in the input representation. Besides, it learns attention maps for positioning the local prototypes and reconstructing the latent space. Then, the predictions are solely based on the attention maps. Thus, the explanations provided are the attention maps and the corresponding local prototypes. Moreover, one can reconstruct the prototypes to the audio domain for inspection. The obtained results in urban sound event detection are comparable to that of two opaque baselines but with fewer parameters while offering interpretability.

Cites: 2 ( see at Google Scholar )

PDF